feat: add StatVar for numerically stable statistics in audit and logs commands#26479
feat: add StatVar for numerically stable statistics in audit and logs commands#26479
Conversation
… commands - Add pkg/stats.StatVar using Welford's online algorithm (mean, variance, stddev, median, min, max, count, sum) - Add MedianTimeBetweenTurns and StdDevTimeBetweenTurns to workflow.LogMetrics - Use StatVar in copilot_events_jsonl.go for TBT computation - Add median/stddev fields to MetricsTrendData in audit_cross_run.go - Use StatVar in buildMetricsTrend() replacing manual accumulation - Use StatVar in health_metrics.go CalculateWorkflowHealth() Agent-Logs-Url: https://github.com/github/gh-aw/sessions/30f4a60c-5387-4216-b4a2-56643a7c1967 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
… memory usage Agent-Logs-Url: https://github.com/github/gh-aw/sessions/30f4a60c-5387-4216-b4a2-56643a7c1967 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Introduces a new pkg/stats.StatVar utility for numerically stable, incremental descriptive statistics (Welford) and wires it into CLI audit/log health metrics to compute richer aggregates (median/stddev in addition to avg/min/max).
Changes:
- Added
pkg/stats.StatVar(mean/variance/stddev + exact median) with unit tests. - Extended
workflow.LogMetricswith median + sample stddev for time-between-turns and populated these incopilot_events_jsonlparsing. - Refactored cross-run audit and workflow health computations to use
StatVarfor cost/tokens/turns/duration aggregation.
Show a summary per file
| File | Description |
|---|---|
| pkg/stats/statvar.go | New streaming stats accumulator (Welford) + exact median support. |
| pkg/stats/statvar_test.go | Unit tests for StatVar behavior and numerical stability. |
| pkg/workflow/metrics.go | Adds median/stddev fields for time-between-turns in LogMetrics. |
| pkg/cli/copilot_events_jsonl.go | Uses StatVar to compute avg/max/median/stddev TBT from timestamps. |
| pkg/cli/audit_cross_run.go | Uses StatVar to compute cross-run avg/median/stddev for key metrics. |
| pkg/cli/health_metrics.go | Uses StatVar internally to compute workflow health averages. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 6/6 changed files
- Comments generated: 3
| @@ -94,6 +96,9 @@ func CalculateWorkflowHealth(workflowName string, runs []WorkflowRun, threshold | |||
| totalDuration += run.Duration | |||
| totalTokens += run.TokenUsage | |||
| totalCost += run.EstimatedCost | |||
| durationStats.Add(float64(run.Duration)) | |||
There was a problem hiding this comment.
totalDuration is accumulated but never used in this function (there is no TotalDuration field on WorkflowHealth). This will fail compilation due to an unused variable. Remove totalDuration and the totalDuration += run.Duration line, or add a corresponding output field if the total duration is intended to be reported.
| if v < s.min { | ||
| s.min = v | ||
| } | ||
| if v > s.max { | ||
| s.max = v |
There was a problem hiding this comment.
The doc comment says NaN and ±Inf "will propagate through all derived statistics", but Min()/Max() may not propagate when NaN is added after initial finite values (comparisons with NaN are false, leaving min/max unchanged). Either update the comment to reflect actual behavior, or explicitly detect NaN/Inf in Add and set min/max accordingly (e.g., if math.IsNaN(v) then force min/max to NaN).
| if v < s.min { | |
| s.min = v | |
| } | |
| if v > s.max { | |
| s.max = v | |
| if math.IsNaN(v) { | |
| s.min = v | |
| s.max = v | |
| } else { | |
| if v < s.min { | |
| s.min = v | |
| } | |
| if v > s.max { | |
| s.max = v | |
| } |
| // MetricsTrendData contains aggregated cost, token, turn, and duration statistics | ||
| // across multiple runs, with spike detection for anomalous runs. |
There was a problem hiding this comment.
The MetricsTrendData type comment is duplicated twice back-to-back, which makes the docs noisy and harder to read. Remove one of the duplicate two-line blocks so the doc comment appears only once.
| // MetricsTrendData contains aggregated cost, token, turn, and duration statistics | |
| // across multiple runs, with spike detection for anomalous runs. |
|
@copilot review all comments |
… comment, clarify NaN doc - health_metrics.go: remove unused totalDuration variable - audit_cross_run.go: remove duplicate MetricsTrendData comment block - statvar.go: clarify Add() doc to accurately reflect NaN/Min/Max behavior Agent-Logs-Url: https://github.com/github/gh-aw/sessions/b393d657-9b75-407b-8c0e-e9dcd97e4bbe Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Addressed all three review comments in 39f35b1:
|
Summary
Adds a
pkg/stats.StatVarstatistical variable type and uses it throughout the audit and logs commands to compute richer, more statistically correct metrics.Changes
New:
pkg/stats.StatVarpkg/workflow/metrics.go—LogMetricsMedianTimeBetweenTurns time.DurationStdDevTimeBetweenTurns time.Duration(sample stddev)pkg/cli/copilot_events_jsonl.gostats.StatVarAvgTimeBetweenTurns,MaxTimeBetweenTurns,MedianTimeBetweenTurns, andStdDevTimeBetweenTurnspkg/cli/audit_cross_run.go—MetricsTrendDataMedianCost,StdDevCost,MedianTokens,StdDevTokens,MedianTurns,StdDevTurns,MedianDurationNs,StdDevDurationNsbuildMetricsTrend()now usesStatVarfor all four metric dimensions (cost, tokens, turns, duration), replacing verbose manual min/max/sum loopspkg/cli/health_metrics.goCalculateWorkflowHealth()now usesStatVarinternally for duration, token, and cost averagesTest coverage
pkg/stats/statvar_test.go: 9 test cases covering empty, single, two-observation, odd/even count, known variance (Wikipedia example), numerical stability, median idempotency, and all-identical inputs