feat: add benchmark repeatability analysis with Coefficient of Variation#112
feat: add benchmark repeatability analysis with Coefficient of Variation#112maryamtahhan wants to merge 6 commits into
Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (8)
📝 WalkthroughWalkthroughThis PR introduces a comprehensive repeatability analysis feature for benchmark results. It adds a CLI tool ( ChangesRepeatability Analysis Tool & Testing
Dashboard Integration & Visualization
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as analyze_repeatability.py
participant FS as File System
participant Report as Report (MD/JSON)
User->>CLI: Run with results directory
CLI->>FS: Scan for benchmarks.json
FS-->>CLI: List benchmark files
CLI->>FS: Load each benchmarks.json + test-metadata.json
FS-->>CLI: Benchmark & metadata pairs
CLI->>CLI: Group runs by configuration + concurrency
CLI->>CLI: For each group: calculate CV, std, mean per metric
CLI->>CLI: Map CV values to grades
CLI->>CLI: Generate aggregated statistics per base config
CLI->>Report: Write Markdown tables (optional: JSON)
Report-->>User: Reports generated
User->>User: Review repeatability analysis
sequenceDiagram
participant User as User (Streamlit)
participant Dashboard as Repeatability.py
participant Utils as repeatability_utils.py
participant FS as File System
participant Plot as Plotly/Streamlit Render
User->>Dashboard: Open Repeatability dashboard
Dashboard->>Dashboard: Load & cache benchmark runs
Dashboard->>FS: Scan for benchmarks.json + test-metadata.json
FS-->>Dashboard: Benchmark & metadata pairs
Dashboard->>Dashboard: Extract config keys & metrics
Dashboard->>Dashboard: Group runs by configuration
Dashboard->>Utils: calculate_repeatability_metrics()
Utils-->>Dashboard: Per-config CV, grade, mean
Dashboard->>Dashboard: Compute overview statistics
Dashboard->>Plot: Render grade distribution pie
Dashboard->>Plot: Render CV-by-metric box plot
Dashboard->>Dashboard: Apply grade/run filters
Dashboard->>Utils: filter_by_repeatability()
Utils-->>Dashboard: Filtered configuration rows
Dashboard->>Plot: Render configuration table (color-coded)
Plot-->>User: Interactive dashboard
User->>User: Explore repeatability insights
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes The PR introduces a cohesive repeatability analysis feature across three independent surfaces (CLI tool, dashboard, documentation) with substantial but self-contained logic. The CLI script (557 lines) and dashboard page (475 lines) are logic-dense with overlapping utility functions, requiring careful review of statistical calculations and data aggregation logic. The test suite (429 lines) is comprehensive but straightforward. Utilities and documentation are well-scoped. The moderate complexity stems from multiple interconnected functions, varied data flows (file I/O, grouping, metric calculation), and need to validate correctness of CV formulas and grading logic, offset by consistent patterns across both implementations. 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
automation/test-execution/ansible/scripts/analyze_repeatability.py (1)
187-200: 💤 Low valueUnused
enumerate— simplify to plain iteration
iis never referenced in the loop body (Ruff B007).♻️ Proposed fix
- for i, benchmark in enumerate(bench_data.get('benchmarks', [])): + for benchmark in bench_data.get('benchmarks', []):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@automation/test-execution/ansible/scripts/analyze_repeatability.py` around lines 187 - 200, The loop over bench_data.get('benchmarks', []) uses enumerate but never uses the index variable i; change the for loop in analyze_repeatability.py to iterate directly (e.g., for benchmark in bench_data.get('benchmarks', []):) so you remove the unused i; keep the existing body that computes config, strategy, concurrency, extended_key and appends into runs_by_config[extended_key] unchanged (symbols to locate: bench_data, benchmark, config, strategy, concurrency, extended_key, runs_by_config).automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_📈_Repeatability.py (1)
53-53: ⚡ Quick win
raw_dfis computed and cached but never consumed — remove it from the return value
load_benchmark_runsbuildsdf = pd.DataFrame(all_runs)and returns it as the first element of the tuple, but the caller unpacks it asraw_dfwhich is never used (Ruff RUF059). With@st.cache_data(ttl=300), this entire DataFrame is held in memory for 5 minutes unnecessarily.♻️ Proposed refactor
`@st.cache_data`(ttl=300) -def load_benchmark_runs(results_dir: str) -> tuple[pd.DataFrame, dict]: - """... - Returns: - Tuple of (raw_data_df, repeatability_metrics_dict) - """ +def load_benchmark_runs(results_dir: str) -> pd.DataFrame: + """... + Returns: + DataFrame with repeatability metrics per configuration + """ results_path = Path(results_dir) all_runs = [] runs_by_config = defaultdict(list) if not results_path.exists(): - return pd.DataFrame(), {} + return pd.DataFrame() for json_file in results_path.rglob("benchmarks.json"): try: ... - df = pd.DataFrame(all_runs) # remove this line ... - return df, repeatability_df + return repeatability_df # In main(): - raw_df, repeatability_df = load_benchmark_runs(str(results_dir)) + repeatability_df = load_benchmark_runs(str(results_dir))Also applies to: 126-126, 182-182, 396-396
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_`📈_Repeatability.py at line 53, The function load_benchmark_runs currently constructs df (DataFrame) and returns it as the first tuple element which callers unpack as raw_df but never use; because the function is decorated with `@st.cache_data`(ttl=300) this wastes memory—remove the unused DataFrame from the return value by changing load_benchmark_runs to return only the actually-used object (e.g., the runs dict) and drop df/raw_df from the return and signature, update all call sites that unpack raw_df to stop expecting it (adjust unpacking to a single value or ignore the first element), and remove the cached DataFrame creation or its caching if the DataFrame is actually needed; apply the same fix to the other similar locations referenced (lines 126, 182, 396).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@automation/test-execution/ansible/tests/test_repeatability.py`:
- Around line 413-416: The print call uses an unnecessary f-string: change
print(f"\nFailed tests:") to a normal string print("\nFailed tests:") to remove
the spurious f-prefix flagged by Ruff; ensure similar non-interpolated literals
in the failed_tests handling (e.g., the loop printing f" -
{class_name}.{method_name}: {error}" is fine because it interpolates) are left
unchanged.
In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_`📈_Repeatability.py:
- Line 232: The marker color fallback uses an invalid hex string '#gray' in the
list comprehension building marker=dict(colors=[colors.get(grade, '#gray') for
grade in grade_counts.index]); change the fallback to a valid CSS color (e.g.,
'gray') or a proper hex like '#808080' so replace occurrences of
colors.get(grade, '#gray') with colors.get(grade, 'gray') (or ' `#808080`') to
ensure Plotly accepts the color.
- Around line 79-106: The loop that builds row uses unguarded dict indexing
(bench['metrics'], metrics['tokens_per_second']['successful']['mean'], etc.)
which raises KeyError and causes the outer try/except to skip the whole file;
change each direct access to use a safe nested accessor (re-use the existing
get_nested_value function used in analyze_repeatability.py or use chained
.get(...) calls with a default of None) for every metric and metadata field
(e.g., replace metrics['...']['successful']['mean'] with
get_nested_value(metrics, ['...','successful','mean'], None)), so missing
individual metrics become None instead of aborting processing of the entire
benchmarks.json.
- Around line 388-390: The code calls a non-existent method get_results_dir() on
DashboardConfig causing an AttributeError; change the call to use the correct
method name get_results_directory() so replace config.get_results_dir() with
config.get_results_directory() where DashboardConfig is used in this file
(ensure no other occurrences remain).
---
Nitpick comments:
In `@automation/test-execution/ansible/scripts/analyze_repeatability.py`:
- Around line 187-200: The loop over bench_data.get('benchmarks', []) uses
enumerate but never uses the index variable i; change the for loop in
analyze_repeatability.py to iterate directly (e.g., for benchmark in
bench_data.get('benchmarks', []):) so you remove the unused i; keep the existing
body that computes config, strategy, concurrency, extended_key and appends into
runs_by_config[extended_key] unchanged (symbols to locate: bench_data,
benchmark, config, strategy, concurrency, extended_key, runs_by_config).
In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_`📈_Repeatability.py:
- Line 53: The function load_benchmark_runs currently constructs df (DataFrame)
and returns it as the first tuple element which callers unpack as raw_df but
never use; because the function is decorated with `@st.cache_data`(ttl=300) this
wastes memory—remove the unused DataFrame from the return value by changing
load_benchmark_runs to return only the actually-used object (e.g., the runs
dict) and drop df/raw_df from the return and signature, update all call sites
that unpack raw_df to stop expecting it (adjust unpacking to a single value or
ignore the first element), and remove the cached DataFrame creation or its
caching if the DataFrame is actually needed; apply the same fix to the other
similar locations referenced (lines 126, 182, 396).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 4ec20825-d56e-40cf-832d-d8c83673bdad
📒 Files selected for processing (7)
automation/test-execution/ansible/scripts/analyze_repeatability.pyautomation/test-execution/ansible/tests/test_repeatability.pyautomation/test-execution/dashboard-examples/vllm_dashboard/Home.pyautomation/test-execution/dashboard-examples/vllm_dashboard/pages/4_📈_Repeatability.pyautomation/test-execution/dashboard-examples/vllm_dashboard/repeatability_utils.pydocs/docs.mddocs/methodology/repeatability-analysis.md
| if failed_tests: | ||
| print(f"\nFailed tests:") | ||
| for class_name, method_name, error in failed_tests: | ||
| print(f" - {class_name}.{method_name}: {error}") |
There was a problem hiding this comment.
Spurious f prefix on string literal (Ruff F541)
f"Failed tests:" has no interpolation — the f is a no-op and is flagged as an error by Ruff.
🐛 Proposed fix
- print(f"Failed tests:")
+ print("Failed tests:")🧰 Tools
🪛 Ruff (0.15.12)
[error] 414-414: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@automation/test-execution/ansible/tests/test_repeatability.py` around lines
413 - 416, The print call uses an unnecessary f-string: change print(f"\nFailed
tests:") to a normal string print("\nFailed tests:") to remove the spurious
f-prefix flagged by Ruff; ensure similar non-interpolated literals in the
failed_tests handling (e.g., the loop printing f" - {class_name}.{method_name}:
{error}" is fine because it interpolates) are left unchanged.
| for bench in data.get('benchmarks', []): | ||
| metrics = bench['metrics'] | ||
| config = bench['config'] | ||
|
|
||
| concurrency = config.get('strategy', {}).get('max_concurrency', 0) | ||
|
|
||
| # Extract key metrics | ||
| row = { | ||
| 'test_run_id': metadata.get('test_run_id', 'unknown'), | ||
| 'platform': metadata.get('platform', 'unknown'), | ||
| 'model': metadata.get('model', 'unknown'), | ||
| 'model_short': metadata.get('model', 'unknown').split('/')[-1], | ||
| 'workload': metadata.get('workload', 'unknown'), | ||
| 'cores': metadata.get('core_count', 0), | ||
| 'tensor_parallel': metadata.get('tensor_parallel', 1), | ||
| 'vllm_version': metadata.get('vllm_version', 'unknown'), | ||
| 'concurrency': concurrency, | ||
| 'throughput_mean': metrics['tokens_per_second']['successful']['mean'], | ||
| 'ttft_mean': metrics['time_to_first_token_ms']['successful']['mean'], | ||
| 'ttft_p90': metrics['time_to_first_token_ms']['successful']['percentiles']['p90'], | ||
| 'ttft_p95': metrics['time_to_first_token_ms']['successful']['percentiles']['p95'], | ||
| 'tpot_mean': metrics['inter_token_latency_ms']['successful']['mean'], | ||
| 'tpot_p90': metrics['inter_token_latency_ms']['successful']['percentiles']['p90'], | ||
| 'tpot_p95': metrics['inter_token_latency_ms']['successful']['percentiles']['p95'], | ||
| 'request_latency_mean': metrics['request_latency']['successful']['mean'], | ||
| 'request_latency_p90': metrics['request_latency']['successful']['percentiles']['p90'], | ||
| 'request_latency_p95': metrics['request_latency']['successful']['percentiles']['p95'], | ||
| } |
There was a problem hiding this comment.
Silent data loss: unguarded key access will skip entire files on missing metrics
Any benchmark entry that's missing a metric key (e.g. bench['metrics'], metrics['tokens_per_second']['successful']['mean']) raises a KeyError. The outer try/except at line 68 catches this and skips the entire benchmarks.json file with only a logger.warning. This is inconsistent with analyze_repeatability.py, which uses get_nested_value() with per-value None defaults so that only the missing metric is excluded, not the whole file.
🛡️ Proposed fix — use safe nested access per metric
+ def _get(d, *keys, default=None):
+ for k in keys:
+ if isinstance(d, dict) and k in d:
+ d = d[k]
+ else:
+ return default
+ return d
for bench in data.get('benchmarks', []):
- metrics = bench['metrics']
- config = bench['config']
+ metrics = bench.get('metrics', {})
+ config = bench.get('config', {})
concurrency = config.get('strategy', {}).get('max_concurrency', 0)
row = {
...
- 'throughput_mean': metrics['tokens_per_second']['successful']['mean'],
- 'ttft_mean': metrics['time_to_first_token_ms']['successful']['mean'],
- 'ttft_p90': metrics['time_to_first_token_ms']['successful']['percentiles']['p90'],
+ 'throughput_mean': _get(metrics, 'tokens_per_second', 'successful', 'mean'),
+ 'ttft_mean': _get(metrics, 'time_to_first_token_ms', 'successful', 'mean'),
+ 'ttft_p90': _get(metrics, 'time_to_first_token_ms', 'successful', 'percentiles', 'p90'),
# ... apply similarly to remaining fields
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_`📈_Repeatability.py
around lines 79 - 106, The loop that builds row uses unguarded dict indexing
(bench['metrics'], metrics['tokens_per_second']['successful']['mean'], etc.)
which raises KeyError and causes the outer try/except to skip the whole file;
change each direct access to use a safe nested accessor (re-use the existing
get_nested_value function used in analyze_repeatability.py or use chained
.get(...) calls with a default of None) for every metric and metadata field
(e.g., replace metrics['...']['successful']['mean'] with
get_nested_value(metrics, ['...','successful','mean'], None)), so missing
individual metrics become None instead of aborting processing of the entire
benchmarks.json.
| fig = go.Figure(data=[go.Pie( | ||
| labels=grade_counts.index, | ||
| values=grade_counts.values, | ||
| marker=dict(colors=[colors.get(grade, '#gray') for grade in grade_counts.index]), |
There was a problem hiding this comment.
'#gray' is not a valid Plotly/CSS color
colors.get(grade, '#gray') — the fallback '#gray' is not a valid hex color (hex colors look like '#808080'). Use the named color 'gray' instead.
🐛 Proposed fix
- marker=dict(colors=[colors.get(grade, '#gray') for grade in grade_counts.index]),
+ marker=dict(colors=[colors.get(grade, 'gray') for grade in grade_counts.index]),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| marker=dict(colors=[colors.get(grade, '#gray') for grade in grade_counts.index]), | |
| marker=dict(colors=[colors.get(grade, 'gray') for grade in grade_counts.index]), |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_`📈_Repeatability.py
at line 232, The marker color fallback uses an invalid hex string '#gray' in the
list comprehension building marker=dict(colors=[colors.get(grade, '#gray') for
grade in grade_counts.index]); change the fallback to a valid CSS color (e.g.,
'gray') or a proper hex like '#808080' so replace occurrences of
colors.get(grade, '#gray') with colors.get(grade, 'gray') (or ' `#808080`') to
ensure Plotly accepts the color.
Add comprehensive repeatability analysis tooling to measure benchmark consistency using Coefficient of Variation (CV) metrics. New features: - analyze_repeatability.py: Calculate CV for metrics across multiple runs - Automatic grouping by platform, model, cores, TP, vLLM version, concurrency - Repeatability grading (Excellent/Good/Acceptable/Poor) based on CV thresholds - Markdown and JSON report generation with detailed CV tables and rankings - Dashboard utilities for integrating CV metrics into visualizations CV interpretation: - CV < 1%: Excellent (ideal for regression testing) - CV 1-3%: Good (suitable for performance comparisons) - CV 3-5%: Acceptable (moderate variance) - CV > 5%: Poor (high variance, unreliable results) Metrics analyzed: - Request latency (mean, P90, P95) - TTFT (mean, P90, P95) - TPoT/ITL (mean, P90, P95) - Output throughput Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Move README_REPEATABILITY.md to docs/methodology/repeatability-analysis.md - Add Jekyll frontmatter for GitHub Pages integration - Update docs index to include repeatability analysis - Add to documentation status table Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add 34 unit tests covering: - CV calculation accuracy and edge cases - Repeatability grade assignment (Excellent/Good/Acceptable/Poor) - Letter grade mapping (A+ through C) - Nested value extraction from benchmark JSON - Metric repeatability analysis - Edge cases: empty data, single run, zero mean, NaN handling All tests pass with pytest. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add new Repeatability dashboard (page 4) with: Features: - CV metrics overview with grade distribution - Interactive filters (minimum grade, minimum runs) - Box plots showing CV distribution by metric type - Detailed configuration table with color-coded grades - CSV export for repeatability analysis - Grade thresholds visualization (Excellent/Good/Acceptable/Poor) Uses existing repeatability_utils.py for CV calculations. Also updates Home.py to include the new dashboard in navigation and feature descriptions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add repeatability analysis to Client-Side Metrics dashboard: - Import repeatability_utils functions - Calculate CV metrics for grouped configurations - Display repeatability summary in collapsible expander - Show CV percentages and grades (Excellent/Good/Acceptable/Poor) - Format metrics with configuration details (platform, cores, concurrency) - Add grade legend for easy interpretation Users can now see repeatability metrics directly in the performance dashboard without navigating to the dedicated Repeatability page. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add comprehensive repeatability analysis tooling to measure benchmark consistency using Coefficient of Variation (CV) metrics.
New features:
CV interpretation:
Metrics analyzed:
Summary by CodeRabbit