feat: add benchmark repeatability analysis with Coefficient of Variation by maryamtahhan · Pull Request #112 · redhat-et/vllm-cpu-perf-eval

maryamtahhan · 2026-04-28T08:03:38Z

Add comprehensive repeatability analysis tooling to measure benchmark consistency using Coefficient of Variation (CV) metrics.

New features:

analyze_repeatability.py: Calculate CV for metrics across multiple runs
Automatic grouping by platform, model, cores, TP, vLLM version, concurrency
Repeatability grading (Excellent/Good/Acceptable/Poor) based on CV thresholds
Markdown and JSON report generation with detailed CV tables and rankings
Dashboard utilities for integrating CV metrics into visualizations

CV interpretation:

CV < 1%: Excellent (ideal for regression testing)
CV 1-3%: Good (suitable for performance comparisons)
CV 3-5%: Acceptable (moderate variance)
CV > 5%: Poor (high variance, unreliable results)

Metrics analyzed:

Request latency (mean, P90, P95)
TTFT (mean, P90, P95)
TPoT/ITL (mean, P90, P95)
Output throughput

Summary by CodeRabbit

New Features
- Added benchmark repeatability analysis tool that evaluates consistency across multiple runs using Coefficient of Variation metrics and generates graded repeatability assessments.
- Added "Repeatability" dashboard page with visualization of repeatability metrics, grade distribution, and filterable configuration tables.
- Added comprehensive documentation and examples for running repeatability analysis.

coderabbitai · 2026-04-28T08:03:46Z

Warning

Rate limit exceeded

@maryamtahhan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 42 minutes and 48 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 9fc67f26-0ddd-47d5-8a9b-3a846c05a4d8

📥 Commits

Reviewing files that changed from the base of the PR and between abde060 and 981ccd2.

📒 Files selected for processing (8)

automation/test-execution/ansible/scripts/analyze_repeatability.py
automation/test-execution/ansible/tests/test_repeatability.py
automation/test-execution/dashboard-examples/vllm_dashboard/Home.py
automation/test-execution/dashboard-examples/vllm_dashboard/pages/1_📊_Client_Metrics.py
automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_📈_Repeatability.py
automation/test-execution/dashboard-examples/vllm_dashboard/repeatability_utils.py
docs/docs.md
docs/methodology/repeatability-analysis.md

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive repeatability analysis feature for benchmark results. It adds a CLI tool (analyze_repeatability.py) that scans benchmark directories, computes coefficient of variation (CV) across multiple runs per configuration, and generates both Markdown and JSON reports with repeatability grades. A companion Streamlit dashboard visualizes these metrics with interactive filtering and distribution charts. Complete test coverage and documentation explain the statistical approach and usage.

Changes

Repeatability Analysis Tool & Testing

Layer / File(s)	Summary
Statistical Foundations `automation/test-execution/ansible/scripts/analyze_repeatability.py` (lines 1–145)	Defines `CV_THRESHOLDS` mapping CV percentages to grades (Excellent/Good/Acceptable/Poor), `METRICS_CONFIG` specifying which JSON paths to extract for request latency, TTFT, TPoT, and throughput metrics, and core helper functions: `calculate_cv()` for sample-based variation, `get_repeatability_grade()` to map CV to grade strings, and `get_letter_grade()` to assign overall A–C grades from average CV.
Data Loading & Analysis `automation/test-execution/ansible/scripts/analyze_repeatability.py` (lines 146–263)	`load_benchmark_runs()` recursively scans results directories for `benchmarks.json` paired with `test-metadata.json`, groups runs by configuration key (including `max_concurrency`), and aggregates metadata. `analyze_metric_repeatability()` computes per-metric mean, sample standard deviation, CV percentage, repeatability grade, and optional percentile statistics (p90/p95) for each configuration group with ≥2 runs.
Report Generation `automation/test-execution/ansible/scripts/analyze_repeatability.py` (lines 266–460)	`generate_markdown_report()` creates Markdown tables grouped by base configuration, displaying mean and optional percentile metrics across concurrency levels with ±CV formatting and unit-specific precision. `generate_json_report()` serializes analysis results with NaN-to-null conversion for downstream integration.
CLI Interface `automation/test-execution/ansible/scripts/analyze_repeatability.py` (lines 462–557)	`main()` entry point parses arguments (results directory, Markdown/JSON output paths, verbosity flag), filters configurations with fewer than 2 runs, triggers report generation, logs progress and errors, and returns appropriate exit codes.
Testing & Validation `automation/test-execution/ansible/tests/test_repeatability.py`	Test classes validate CV calculation (empty, single, zero-mean, known-value edge cases), grade boundary behavior including NaN handling, letter-grade thresholds, nested JSON path traversal, and `analyze_metric_repeatability()` aggregation for single vs. multi-run scenarios with expected grade assignments.
Methodology Documentation `docs/methodology/repeatability-analysis.md`, `docs/docs.md`	Complete guide covering quick-start commands, CV formula and interpretation, per-metric configuration, output formats (with example Markdown and JSON schemas), troubleshooting, and dashboard integration patterns. Navigation index updated to include the new documentation page.

Dashboard Integration & Visualization

Layer / File(s)	Summary
Dashboard Utilities `automation/test-execution/dashboard-examples/vllm_dashboard/repeatability_utils.py`	Shared helper functions for dashboard integration: `calculate_cv()` and `get_repeatability_grade()` compute CV and grades from Pandas Series; `get_grade_color()` maps grades to display colors; `calculate_repeatability_metrics()` aggregates per-group mean/std/CV/grade; `format_metric_with_cv()`, `get_repeatability_summary()`, and `create_repeatability_comparison_table()` prepare displayable summaries; `filter_by_repeatability()` filters data by run count, CV threshold, and grade; `add_cv_annotations_to_plot()` adds Plotly figure annotations with grade-driven styling.
Repeatability Dashboard Page `automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_📈_Repeatability.py`	Streamlit page (`main()`) that loads benchmark results, computes repeatability metrics per configuration, and renders: overview statistics (configuration count, CV distribution), grade distribution pie chart, CV-by-metric box plot with threshold reference lines, and a filterable/sortable configuration table with color-coded rows and CSV export. Includes sidebar styling and error handling for insufficient data.
Navigation & Home `automation/test-execution/dashboard-examples/vllm_dashboard/Home.py`	Updates landing page to add a fourth dashboard column ("📈 Repeatability") to the overview cards, extends the Dashboard Features section to document CV metrics and filtering capabilities, and adjusts Quick Start formatting.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as analyze_repeatability.py
    participant FS as File System
    participant Report as Report (MD/JSON)
    
    User->>CLI: Run with results directory
    CLI->>FS: Scan for benchmarks.json
    FS-->>CLI: List benchmark files
    CLI->>FS: Load each benchmarks.json + test-metadata.json
    FS-->>CLI: Benchmark & metadata pairs
    CLI->>CLI: Group runs by configuration + concurrency
    CLI->>CLI: For each group: calculate CV, std, mean per metric
    CLI->>CLI: Map CV values to grades
    CLI->>CLI: Generate aggregated statistics per base config
    CLI->>Report: Write Markdown tables (optional: JSON)
    Report-->>User: Reports generated
    User->>User: Review repeatability analysis

sequenceDiagram
    participant User as User (Streamlit)
    participant Dashboard as Repeatability.py
    participant Utils as repeatability_utils.py
    participant FS as File System
    participant Plot as Plotly/Streamlit Render
    
    User->>Dashboard: Open Repeatability dashboard
    Dashboard->>Dashboard: Load & cache benchmark runs
    Dashboard->>FS: Scan for benchmarks.json + test-metadata.json
    FS-->>Dashboard: Benchmark & metadata pairs
    Dashboard->>Dashboard: Extract config keys & metrics
    Dashboard->>Dashboard: Group runs by configuration
    Dashboard->>Utils: calculate_repeatability_metrics()
    Utils-->>Dashboard: Per-config CV, grade, mean
    Dashboard->>Dashboard: Compute overview statistics
    Dashboard->>Plot: Render grade distribution pie
    Dashboard->>Plot: Render CV-by-metric box plot
    Dashboard->>Dashboard: Apply grade/run filters
    Dashboard->>Utils: filter_by_repeatability()
    Utils-->>Dashboard: Filtered configuration rows
    Dashboard->>Plot: Render configuration table (color-coded)
    Plot-->>User: Interactive dashboard
    User->>User: Explore repeatability insights

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

The PR introduces a cohesive repeatability analysis feature across three independent surfaces (CLI tool, dashboard, documentation) with substantial but self-contained logic. The CLI script (557 lines) and dashboard page (475 lines) are logic-dense with overlapping utility functions, requiring careful review of statistical calculations and data aggregation logic. The test suite (429 lines) is comprehensive but straightforward. Utilities and documentation are well-scoped. The moderate complexity stems from multiple interconnected functions, varied data flows (file I/O, grouping, metric calculation), and need to validate correctness of CV formulas and grading logic, offset by consistent patterns across both implementations.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely describes the main change: adding benchmark repeatability analysis using Coefficient of Variation (CV) as the core metric. It is clear, specific, and directly related to the comprehensive feature set introduced across multiple files.
Docstring Coverage	✅ Passed	Docstring coverage is 93.65% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

automation/test-execution/ansible/scripts/analyze_repeatability.py (1)

187-200: 💤 Low value

Unused enumerate — simplify to plain iteration

i is never referenced in the loop body (Ruff B007).

♻️ Proposed fix

-            for i, benchmark in enumerate(bench_data.get('benchmarks', [])):
+            for benchmark in bench_data.get('benchmarks', []):

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@automation/test-execution/ansible/scripts/analyze_repeatability.py` around
lines 187 - 200, The loop over bench_data.get('benchmarks', []) uses enumerate
but never uses the index variable i; change the for loop in
analyze_repeatability.py to iterate directly (e.g., for benchmark in
bench_data.get('benchmarks', []):) so you remove the unused i; keep the existing
body that computes config, strategy, concurrency, extended_key and appends into
runs_by_config[extended_key] unchanged (symbols to locate: bench_data,
benchmark, config, strategy, concurrency, extended_key, runs_by_config).

automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_📈_Repeatability.py (1)

53-53: ⚡ Quick win

raw_df is computed and cached but never consumed — remove it from the return value

load_benchmark_runs builds df = pd.DataFrame(all_runs) and returns it as the first element of the tuple, but the caller unpacks it as raw_df which is never used (Ruff RUF059). With @st.cache_data(ttl=300), this entire DataFrame is held in memory for 5 minutes unnecessarily.

♻️ Proposed refactor

 `@st.cache_data`(ttl=300)
-def load_benchmark_runs(results_dir: str) -> tuple[pd.DataFrame, dict]:
-    """...
-    Returns:
-        Tuple of (raw_data_df, repeatability_metrics_dict)
-    """
+def load_benchmark_runs(results_dir: str) -> pd.DataFrame:
+    """...
+    Returns:
+        DataFrame with repeatability metrics per configuration
+    """
     results_path = Path(results_dir)
     all_runs = []
     runs_by_config = defaultdict(list)

     if not results_path.exists():
-        return pd.DataFrame(), {}
+        return pd.DataFrame()

     for json_file in results_path.rglob("benchmarks.json"):
         try:
             ...
-    df = pd.DataFrame(all_runs)  # remove this line
     ...
-    return df, repeatability_df
+    return repeatability_df

 # In main():
-    raw_df, repeatability_df = load_benchmark_runs(str(results_dir))
+    repeatability_df = load_benchmark_runs(str(results_dir))

Also applies to: 126-126, 182-182, 396-396

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_`📈_Repeatability.py
at line 53, The function load_benchmark_runs currently constructs df (DataFrame)
and returns it as the first tuple element which callers unpack as raw_df but
never use; because the function is decorated with `@st.cache_data`(ttl=300) this
wastes memory—remove the unused DataFrame from the return value by changing
load_benchmark_runs to return only the actually-used object (e.g., the runs
dict) and drop df/raw_df from the return and signature, update all call sites
that unpack raw_df to stop expecting it (adjust unpacking to a single value or
ignore the first element), and remove the cached DataFrame creation or its
caching if the DataFrame is actually needed; apply the same fix to the other
similar locations referenced (lines 126, 182, 396).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@automation/test-execution/ansible/tests/test_repeatability.py`:
- Around line 413-416: The print call uses an unnecessary f-string: change
print(f"\nFailed tests:") to a normal string print("\nFailed tests:") to remove
the spurious f-prefix flagged by Ruff; ensure similar non-interpolated literals
in the failed_tests handling (e.g., the loop printing f"  -
{class_name}.{method_name}: {error}" is fine because it interpolates) are left
unchanged.

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_`📈_Repeatability.py:
- Line 232: The marker color fallback uses an invalid hex string '#gray' in the
list comprehension building marker=dict(colors=[colors.get(grade, '#gray') for
grade in grade_counts.index]); change the fallback to a valid CSS color (e.g.,
'gray') or a proper hex like '#808080' so replace occurrences of
colors.get(grade, '#gray') with colors.get(grade, 'gray') (or ' `#808080`') to
ensure Plotly accepts the color.
- Around line 79-106: The loop that builds row uses unguarded dict indexing
(bench['metrics'], metrics['tokens_per_second']['successful']['mean'], etc.)
which raises KeyError and causes the outer try/except to skip the whole file;
change each direct access to use a safe nested accessor (re-use the existing
get_nested_value function used in analyze_repeatability.py or use chained
.get(...) calls with a default of None) for every metric and metadata field
(e.g., replace metrics['...']['successful']['mean'] with
get_nested_value(metrics, ['...','successful','mean'], None)), so missing
individual metrics become None instead of aborting processing of the entire
benchmarks.json.
- Around line 388-390: The code calls a non-existent method get_results_dir() on
DashboardConfig causing an AttributeError; change the call to use the correct
method name get_results_directory() so replace config.get_results_dir() with
config.get_results_directory() where DashboardConfig is used in this file
(ensure no other occurrences remain).

---

Nitpick comments:
In `@automation/test-execution/ansible/scripts/analyze_repeatability.py`:
- Around line 187-200: The loop over bench_data.get('benchmarks', []) uses
enumerate but never uses the index variable i; change the for loop in
analyze_repeatability.py to iterate directly (e.g., for benchmark in
bench_data.get('benchmarks', []):) so you remove the unused i; keep the existing
body that computes config, strategy, concurrency, extended_key and appends into
runs_by_config[extended_key] unchanged (symbols to locate: bench_data,
benchmark, config, strategy, concurrency, extended_key, runs_by_config).

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_`📈_Repeatability.py:
- Line 53: The function load_benchmark_runs currently constructs df (DataFrame)
and returns it as the first tuple element which callers unpack as raw_df but
never use; because the function is decorated with `@st.cache_data`(ttl=300) this
wastes memory—remove the unused DataFrame from the return value by changing
load_benchmark_runs to return only the actually-used object (e.g., the runs
dict) and drop df/raw_df from the return and signature, update all call sites
that unpack raw_df to stop expecting it (adjust unpacking to a single value or
ignore the first element), and remove the cached DataFrame creation or its
caching if the DataFrame is actually needed; apply the same fix to the other
similar locations referenced (lines 126, 182, 396).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 4ec20825-d56e-40cf-832d-d8c83673bdad

📥 Commits

Reviewing files that changed from the base of the PR and between 8c3c15d and abde060.

📒 Files selected for processing (7)

automation/test-execution/ansible/scripts/analyze_repeatability.py
automation/test-execution/ansible/tests/test_repeatability.py
automation/test-execution/dashboard-examples/vllm_dashboard/Home.py
automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_📈_Repeatability.py
automation/test-execution/dashboard-examples/vllm_dashboard/repeatability_utils.py
docs/docs.md
docs/methodology/repeatability-analysis.md

coderabbitai · 2026-05-05T13:56:11Z

+    if failed_tests:
+        print(f"\nFailed tests:")
+        for class_name, method_name, error in failed_tests:
+            print(f"  - {class_name}.{method_name}: {error}")


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Spurious f prefix on string literal (Ruff F541)

f"Failed tests:" has no interpolation — the f is a no-op and is flagged as an error by Ruff.

🐛 Proposed fix

- print(f"Failed tests:") + print("Failed tests:")

🧰 Tools

🪛 Ruff (0.15.12)

[error] 414-414: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@automation/test-execution/ansible/tests/test_repeatability.py` around lines 413 - 416, The print call uses an unnecessary f-string: change print(f"\nFailed tests:") to a normal string print("\nFailed tests:") to remove the spurious f-prefix flagged by Ruff; ensure similar non-interpolated literals in the failed_tests handling (e.g., the loop printing f" - {class_name}.{method_name}: {error}" is fine because it interpolates) are left unchanged.

coderabbitai · 2026-05-05T13:56:12Z

+            for bench in data.get('benchmarks', []):
+                metrics = bench['metrics']
+                config = bench['config']
+
+                concurrency = config.get('strategy', {}).get('max_concurrency', 0)
+
+                # Extract key metrics
+                row = {
+                    'test_run_id': metadata.get('test_run_id', 'unknown'),
+                    'platform': metadata.get('platform', 'unknown'),
+                    'model': metadata.get('model', 'unknown'),
+                    'model_short': metadata.get('model', 'unknown').split('/')[-1],
+                    'workload': metadata.get('workload', 'unknown'),
+                    'cores': metadata.get('core_count', 0),
+                    'tensor_parallel': metadata.get('tensor_parallel', 1),
+                    'vllm_version': metadata.get('vllm_version', 'unknown'),
+                    'concurrency': concurrency,
+                    'throughput_mean': metrics['tokens_per_second']['successful']['mean'],
+                    'ttft_mean': metrics['time_to_first_token_ms']['successful']['mean'],
+                    'ttft_p90': metrics['time_to_first_token_ms']['successful']['percentiles']['p90'],
+                    'ttft_p95': metrics['time_to_first_token_ms']['successful']['percentiles']['p95'],
+                    'tpot_mean': metrics['inter_token_latency_ms']['successful']['mean'],
+                    'tpot_p90': metrics['inter_token_latency_ms']['successful']['percentiles']['p90'],
+                    'tpot_p95': metrics['inter_token_latency_ms']['successful']['percentiles']['p95'],
+                    'request_latency_mean': metrics['request_latency']['successful']['mean'],
+                    'request_latency_p90': metrics['request_latency']['successful']['percentiles']['p90'],
+                    'request_latency_p95': metrics['request_latency']['successful']['percentiles']['p95'],
+                }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Silent data loss: unguarded key access will skip entire files on missing metrics

Any benchmark entry that's missing a metric key (e.g. bench['metrics'], metrics['tokens_per_second']['successful']['mean']) raises a KeyError. The outer try/except at line 68 catches this and skips the entire benchmarks.json file with only a logger.warning. This is inconsistent with analyze_repeatability.py, which uses get_nested_value() with per-value None defaults so that only the missing metric is excluded, not the whole file.

🛡️ Proposed fix — use safe nested access per metric

+ def _get(d, *keys, default=None): + for k in keys: + if isinstance(d, dict) and k in d: + d = d[k] + else: + return default + return d for bench in data.get('benchmarks', []): - metrics = bench['metrics'] - config = bench['config'] + metrics = bench.get('metrics', {}) + config = bench.get('config', {}) concurrency = config.get('strategy', {}).get('max_concurrency', 0) row = { ... - 'throughput_mean': metrics['tokens_per_second']['successful']['mean'], - 'ttft_mean': metrics['time_to_first_token_ms']['successful']['mean'], - 'ttft_p90': metrics['time_to_first_token_ms']['successful']['percentiles']['p90'], + 'throughput_mean': _get(metrics, 'tokens_per_second', 'successful', 'mean'), + 'ttft_mean': _get(metrics, 'time_to_first_token_ms', 'successful', 'mean'), + 'ttft_p90': _get(metrics, 'time_to_first_token_ms', 'successful', 'percentiles', 'p90'), # ... apply similarly to remaining fields }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_`📈_Repeatability.py around lines 79 - 106, The loop that builds row uses unguarded dict indexing (bench['metrics'], metrics['tokens_per_second']['successful']['mean'], etc.) which raises KeyError and causes the outer try/except to skip the whole file; change each direct access to use a safe nested accessor (re-use the existing get_nested_value function used in analyze_repeatability.py or use chained .get(...) calls with a default of None) for every metric and metadata field (e.g., replace metrics['...']['successful']['mean'] with get_nested_value(metrics, ['...','successful','mean'], None)), so missing individual metrics become None instead of aborting processing of the entire benchmarks.json.

coderabbitai · 2026-05-05T13:56:12Z

+    fig = go.Figure(data=[go.Pie(
+        labels=grade_counts.index,
+        values=grade_counts.values,
+        marker=dict(colors=[colors.get(grade, '#gray') for grade in grade_counts.index]),


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

'#gray' is not a valid Plotly/CSS color

colors.get(grade, '#gray') — the fallback '#gray' is not a valid hex color (hex colors look like '#808080'). Use the named color 'gray' instead.

🐛 Proposed fix

- marker=dict(colors=[colors.get(grade, '#gray') for grade in grade_counts.index]), + marker=dict(colors=[colors.get(grade, 'gray') for grade in grade_counts.index]),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

marker=dict(colors=[colors.get(grade, '#gray') for grade in grade_counts.index]),

marker=dict(colors=[colors.get(grade, 'gray') for grade in grade_counts.index]),

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_`📈_Repeatability.py at line 232, The marker color fallback uses an invalid hex string '#gray' in the list comprehension building marker=dict(colors=[colors.get(grade, '#gray') for grade in grade_counts.index]); change the fallback to a valid CSS color (e.g., 'gray') or a proper hex like '#808080' so replace occurrences of colors.get(grade, '#gray') with colors.get(grade, 'gray') (or ' `#808080`') to ensure Plotly accepts the color.

Add comprehensive repeatability analysis tooling to measure benchmark consistency using Coefficient of Variation (CV) metrics. New features: - analyze_repeatability.py: Calculate CV for metrics across multiple runs - Automatic grouping by platform, model, cores, TP, vLLM version, concurrency - Repeatability grading (Excellent/Good/Acceptable/Poor) based on CV thresholds - Markdown and JSON report generation with detailed CV tables and rankings - Dashboard utilities for integrating CV metrics into visualizations CV interpretation: - CV < 1%: Excellent (ideal for regression testing) - CV 1-3%: Good (suitable for performance comparisons) - CV 3-5%: Acceptable (moderate variance) - CV > 5%: Poor (high variance, unreliable results) Metrics analyzed: - Request latency (mean, P90, P95) - TTFT (mean, P90, P95) - TPoT/ITL (mean, P90, P95) - Output throughput Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- Move README_REPEATABILITY.md to docs/methodology/repeatability-analysis.md - Add Jekyll frontmatter for GitHub Pages integration - Update docs index to include repeatability analysis - Add to documentation status table Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Add 34 unit tests covering: - CV calculation accuracy and edge cases - Repeatability grade assignment (Excellent/Good/Acceptable/Poor) - Letter grade mapping (A+ through C) - Nested value extraction from benchmark JSON - Metric repeatability analysis - Edge cases: empty data, single run, zero mean, NaN handling All tests pass with pytest. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Add new Repeatability dashboard (page 4) with: Features: - CV metrics overview with grade distribution - Interactive filters (minimum grade, minimum runs) - Box plots showing CV distribution by metric type - Detailed configuration table with color-coded grades - CSV export for repeatability analysis - Grade thresholds visualization (Excellent/Good/Acceptable/Poor) Uses existing repeatability_utils.py for CV calculations. Also updates Home.py to include the new dashboard in navigation and feature descriptions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Add repeatability analysis to Client-Side Metrics dashboard: - Import repeatability_utils functions - Calculate CV metrics for grouped configurations - Display repeatability summary in collapsible expander - Show CV percentages and grades (Excellent/Good/Acceptable/Poor) - Format metrics with configuration details (platform, cores, concurrency) - Add grade legend for easy interpretation Users can now see repeatability metrics directly in the performance dashboard without navigating to the dedicated Repeatability page. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

maryamtahhan force-pushed the feat/cv branch from 2e34dfb to 7951669 Compare May 5, 2026 12:18

maryamtahhan marked this pull request as ready for review May 5, 2026 12:19

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

maryamtahhan and others added 6 commits May 5, 2026 14:57

fix: remove trailing whitespace from repeatability README

fe08985

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

maryamtahhan force-pushed the feat/cv branch from abde060 to 981ccd2 Compare May 5, 2026 14:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add benchmark repeatability analysis with Coefficient of Variation#112

feat: add benchmark repeatability analysis with Coefficient of Variation#112
maryamtahhan wants to merge 6 commits into
redhat-et:mainfrom
maryamtahhan:feat/cv

maryamtahhan commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	marker=dict(colors=[colors.get(grade, '#gray') for grade in grade_counts.index]),
	marker=dict(colors=[colors.get(grade, 'gray') for grade in grade_counts.index]),

Conversation

maryamtahhan commented Apr 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maryamtahhan commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading