Skip to content

feat: add benchmark repeatability analysis with Coefficient of Variation#112

Open
maryamtahhan wants to merge 6 commits into
redhat-et:mainfrom
maryamtahhan:feat/cv
Open

feat: add benchmark repeatability analysis with Coefficient of Variation#112
maryamtahhan wants to merge 6 commits into
redhat-et:mainfrom
maryamtahhan:feat/cv

Conversation

@maryamtahhan
Copy link
Copy Markdown
Collaborator

@maryamtahhan maryamtahhan commented Apr 28, 2026

Add comprehensive repeatability analysis tooling to measure benchmark consistency using Coefficient of Variation (CV) metrics.

New features:

  • analyze_repeatability.py: Calculate CV for metrics across multiple runs
  • Automatic grouping by platform, model, cores, TP, vLLM version, concurrency
  • Repeatability grading (Excellent/Good/Acceptable/Poor) based on CV thresholds
  • Markdown and JSON report generation with detailed CV tables and rankings
  • Dashboard utilities for integrating CV metrics into visualizations

CV interpretation:

  • CV < 1%: Excellent (ideal for regression testing)
  • CV 1-3%: Good (suitable for performance comparisons)
  • CV 3-5%: Acceptable (moderate variance)
  • CV > 5%: Poor (high variance, unreliable results)

Metrics analyzed:

  • Request latency (mean, P90, P95)
  • TTFT (mean, P90, P95)
  • TPoT/ITL (mean, P90, P95)
  • Output throughput

Summary by CodeRabbit

  • New Features
    • Added benchmark repeatability analysis tool that evaluates consistency across multiple runs using Coefficient of Variation metrics and generates graded repeatability assessments.
    • Added "Repeatability" dashboard page with visualization of repeatability metrics, grade distribution, and filterable configuration tables.
    • Added comprehensive documentation and examples for running repeatability analysis.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

Warning

Rate limit exceeded

@maryamtahhan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 42 minutes and 48 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 9fc67f26-0ddd-47d5-8a9b-3a846c05a4d8

📥 Commits

Reviewing files that changed from the base of the PR and between abde060 and 981ccd2.

📒 Files selected for processing (8)
  • automation/test-execution/ansible/scripts/analyze_repeatability.py
  • automation/test-execution/ansible/tests/test_repeatability.py
  • automation/test-execution/dashboard-examples/vllm_dashboard/Home.py
  • automation/test-execution/dashboard-examples/vllm_dashboard/pages/1_📊_Client_Metrics.py
  • automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_📈_Repeatability.py
  • automation/test-execution/dashboard-examples/vllm_dashboard/repeatability_utils.py
  • docs/docs.md
  • docs/methodology/repeatability-analysis.md
📝 Walkthrough

Walkthrough

This PR introduces a comprehensive repeatability analysis feature for benchmark results. It adds a CLI tool (analyze_repeatability.py) that scans benchmark directories, computes coefficient of variation (CV) across multiple runs per configuration, and generates both Markdown and JSON reports with repeatability grades. A companion Streamlit dashboard visualizes these metrics with interactive filtering and distribution charts. Complete test coverage and documentation explain the statistical approach and usage.

Changes

Repeatability Analysis Tool & Testing

Layer / File(s) Summary
Statistical Foundations
automation/test-execution/ansible/scripts/analyze_repeatability.py (lines 1–145)
Defines CV_THRESHOLDS mapping CV percentages to grades (Excellent/Good/Acceptable/Poor), METRICS_CONFIG specifying which JSON paths to extract for request latency, TTFT, TPoT, and throughput metrics, and core helper functions: calculate_cv() for sample-based variation, get_repeatability_grade() to map CV to grade strings, and get_letter_grade() to assign overall A–C grades from average CV.
Data Loading & Analysis
automation/test-execution/ansible/scripts/analyze_repeatability.py (lines 146–263)
load_benchmark_runs() recursively scans results directories for benchmarks.json paired with test-metadata.json, groups runs by configuration key (including max_concurrency), and aggregates metadata. analyze_metric_repeatability() computes per-metric mean, sample standard deviation, CV percentage, repeatability grade, and optional percentile statistics (p90/p95) for each configuration group with ≥2 runs.
Report Generation
automation/test-execution/ansible/scripts/analyze_repeatability.py (lines 266–460)
generate_markdown_report() creates Markdown tables grouped by base configuration, displaying mean and optional percentile metrics across concurrency levels with ±CV formatting and unit-specific precision. generate_json_report() serializes analysis results with NaN-to-null conversion for downstream integration.
CLI Interface
automation/test-execution/ansible/scripts/analyze_repeatability.py (lines 462–557)
main() entry point parses arguments (results directory, Markdown/JSON output paths, verbosity flag), filters configurations with fewer than 2 runs, triggers report generation, logs progress and errors, and returns appropriate exit codes.
Testing & Validation
automation/test-execution/ansible/tests/test_repeatability.py
Test classes validate CV calculation (empty, single, zero-mean, known-value edge cases), grade boundary behavior including NaN handling, letter-grade thresholds, nested JSON path traversal, and analyze_metric_repeatability() aggregation for single vs. multi-run scenarios with expected grade assignments.
Methodology Documentation
docs/methodology/repeatability-analysis.md, docs/docs.md
Complete guide covering quick-start commands, CV formula and interpretation, per-metric configuration, output formats (with example Markdown and JSON schemas), troubleshooting, and dashboard integration patterns. Navigation index updated to include the new documentation page.

Dashboard Integration & Visualization

Layer / File(s) Summary
Dashboard Utilities
automation/test-execution/dashboard-examples/vllm_dashboard/repeatability_utils.py
Shared helper functions for dashboard integration: calculate_cv() and get_repeatability_grade() compute CV and grades from Pandas Series; get_grade_color() maps grades to display colors; calculate_repeatability_metrics() aggregates per-group mean/std/CV/grade; format_metric_with_cv(), get_repeatability_summary(), and create_repeatability_comparison_table() prepare displayable summaries; filter_by_repeatability() filters data by run count, CV threshold, and grade; add_cv_annotations_to_plot() adds Plotly figure annotations with grade-driven styling.
Repeatability Dashboard Page
automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_📈_Repeatability.py
Streamlit page (main()) that loads benchmark results, computes repeatability metrics per configuration, and renders: overview statistics (configuration count, CV distribution), grade distribution pie chart, CV-by-metric box plot with threshold reference lines, and a filterable/sortable configuration table with color-coded rows and CSV export. Includes sidebar styling and error handling for insufficient data.
Navigation & Home
automation/test-execution/dashboard-examples/vllm_dashboard/Home.py
Updates landing page to add a fourth dashboard column ("📈 Repeatability") to the overview cards, extends the Dashboard Features section to document CV metrics and filtering capabilities, and adjusts Quick Start formatting.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as analyze_repeatability.py
    participant FS as File System
    participant Report as Report (MD/JSON)
    
    User->>CLI: Run with results directory
    CLI->>FS: Scan for benchmarks.json
    FS-->>CLI: List benchmark files
    CLI->>FS: Load each benchmarks.json + test-metadata.json
    FS-->>CLI: Benchmark & metadata pairs
    CLI->>CLI: Group runs by configuration + concurrency
    CLI->>CLI: For each group: calculate CV, std, mean per metric
    CLI->>CLI: Map CV values to grades
    CLI->>CLI: Generate aggregated statistics per base config
    CLI->>Report: Write Markdown tables (optional: JSON)
    Report-->>User: Reports generated
    User->>User: Review repeatability analysis
Loading
sequenceDiagram
    participant User as User (Streamlit)
    participant Dashboard as Repeatability.py
    participant Utils as repeatability_utils.py
    participant FS as File System
    participant Plot as Plotly/Streamlit Render
    
    User->>Dashboard: Open Repeatability dashboard
    Dashboard->>Dashboard: Load & cache benchmark runs
    Dashboard->>FS: Scan for benchmarks.json + test-metadata.json
    FS-->>Dashboard: Benchmark & metadata pairs
    Dashboard->>Dashboard: Extract config keys & metrics
    Dashboard->>Dashboard: Group runs by configuration
    Dashboard->>Utils: calculate_repeatability_metrics()
    Utils-->>Dashboard: Per-config CV, grade, mean
    Dashboard->>Dashboard: Compute overview statistics
    Dashboard->>Plot: Render grade distribution pie
    Dashboard->>Plot: Render CV-by-metric box plot
    Dashboard->>Dashboard: Apply grade/run filters
    Dashboard->>Utils: filter_by_repeatability()
    Utils-->>Dashboard: Filtered configuration rows
    Dashboard->>Plot: Render configuration table (color-coded)
    Plot-->>User: Interactive dashboard
    User->>User: Explore repeatability insights
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

The PR introduces a cohesive repeatability analysis feature across three independent surfaces (CLI tool, dashboard, documentation) with substantial but self-contained logic. The CLI script (557 lines) and dashboard page (475 lines) are logic-dense with overlapping utility functions, requiring careful review of statistical calculations and data aggregation logic. The test suite (429 lines) is comprehensive but straightforward. Utilities and documentation are well-scoped. The moderate complexity stems from multiple interconnected functions, varied data flows (file I/O, grouping, metric calculation), and need to validate correctness of CV formulas and grading logic, offset by consistent patterns across both implementations.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: adding benchmark repeatability analysis using Coefficient of Variation (CV) as the core metric. It is clear, specific, and directly related to the comprehensive feature set introduced across multiple files.
Docstring Coverage ✅ Passed Docstring coverage is 93.65% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@maryamtahhan maryamtahhan marked this pull request as ready for review May 5, 2026 12:19
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
automation/test-execution/ansible/scripts/analyze_repeatability.py (1)

187-200: 💤 Low value

Unused enumerate — simplify to plain iteration

i is never referenced in the loop body (Ruff B007).

♻️ Proposed fix
-            for i, benchmark in enumerate(bench_data.get('benchmarks', [])):
+            for benchmark in bench_data.get('benchmarks', []):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@automation/test-execution/ansible/scripts/analyze_repeatability.py` around
lines 187 - 200, The loop over bench_data.get('benchmarks', []) uses enumerate
but never uses the index variable i; change the for loop in
analyze_repeatability.py to iterate directly (e.g., for benchmark in
bench_data.get('benchmarks', []):) so you remove the unused i; keep the existing
body that computes config, strategy, concurrency, extended_key and appends into
runs_by_config[extended_key] unchanged (symbols to locate: bench_data,
benchmark, config, strategy, concurrency, extended_key, runs_by_config).
automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_📈_Repeatability.py (1)

53-53: ⚡ Quick win

raw_df is computed and cached but never consumed — remove it from the return value

load_benchmark_runs builds df = pd.DataFrame(all_runs) and returns it as the first element of the tuple, but the caller unpacks it as raw_df which is never used (Ruff RUF059). With @st.cache_data(ttl=300), this entire DataFrame is held in memory for 5 minutes unnecessarily.

♻️ Proposed refactor
 `@st.cache_data`(ttl=300)
-def load_benchmark_runs(results_dir: str) -> tuple[pd.DataFrame, dict]:
-    """...
-    Returns:
-        Tuple of (raw_data_df, repeatability_metrics_dict)
-    """
+def load_benchmark_runs(results_dir: str) -> pd.DataFrame:
+    """...
+    Returns:
+        DataFrame with repeatability metrics per configuration
+    """
     results_path = Path(results_dir)
     all_runs = []
     runs_by_config = defaultdict(list)

     if not results_path.exists():
-        return pd.DataFrame(), {}
+        return pd.DataFrame()

     for json_file in results_path.rglob("benchmarks.json"):
         try:
             ...
-    df = pd.DataFrame(all_runs)  # remove this line
     ...
-    return df, repeatability_df
+    return repeatability_df

 # In main():
-    raw_df, repeatability_df = load_benchmark_runs(str(results_dir))
+    repeatability_df = load_benchmark_runs(str(results_dir))

Also applies to: 126-126, 182-182, 396-396

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_`📈_Repeatability.py
at line 53, The function load_benchmark_runs currently constructs df (DataFrame)
and returns it as the first tuple element which callers unpack as raw_df but
never use; because the function is decorated with `@st.cache_data`(ttl=300) this
wastes memory—remove the unused DataFrame from the return value by changing
load_benchmark_runs to return only the actually-used object (e.g., the runs
dict) and drop df/raw_df from the return and signature, update all call sites
that unpack raw_df to stop expecting it (adjust unpacking to a single value or
ignore the first element), and remove the cached DataFrame creation or its
caching if the DataFrame is actually needed; apply the same fix to the other
similar locations referenced (lines 126, 182, 396).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@automation/test-execution/ansible/tests/test_repeatability.py`:
- Around line 413-416: The print call uses an unnecessary f-string: change
print(f"\nFailed tests:") to a normal string print("\nFailed tests:") to remove
the spurious f-prefix flagged by Ruff; ensure similar non-interpolated literals
in the failed_tests handling (e.g., the loop printing f"  -
{class_name}.{method_name}: {error}" is fine because it interpolates) are left
unchanged.

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_`📈_Repeatability.py:
- Line 232: The marker color fallback uses an invalid hex string '#gray' in the
list comprehension building marker=dict(colors=[colors.get(grade, '#gray') for
grade in grade_counts.index]); change the fallback to a valid CSS color (e.g.,
'gray') or a proper hex like '#808080' so replace occurrences of
colors.get(grade, '#gray') with colors.get(grade, 'gray') (or ' `#808080`') to
ensure Plotly accepts the color.
- Around line 79-106: The loop that builds row uses unguarded dict indexing
(bench['metrics'], metrics['tokens_per_second']['successful']['mean'], etc.)
which raises KeyError and causes the outer try/except to skip the whole file;
change each direct access to use a safe nested accessor (re-use the existing
get_nested_value function used in analyze_repeatability.py or use chained
.get(...) calls with a default of None) for every metric and metadata field
(e.g., replace metrics['...']['successful']['mean'] with
get_nested_value(metrics, ['...','successful','mean'], None)), so missing
individual metrics become None instead of aborting processing of the entire
benchmarks.json.
- Around line 388-390: The code calls a non-existent method get_results_dir() on
DashboardConfig causing an AttributeError; change the call to use the correct
method name get_results_directory() so replace config.get_results_dir() with
config.get_results_directory() where DashboardConfig is used in this file
(ensure no other occurrences remain).

---

Nitpick comments:
In `@automation/test-execution/ansible/scripts/analyze_repeatability.py`:
- Around line 187-200: The loop over bench_data.get('benchmarks', []) uses
enumerate but never uses the index variable i; change the for loop in
analyze_repeatability.py to iterate directly (e.g., for benchmark in
bench_data.get('benchmarks', []):) so you remove the unused i; keep the existing
body that computes config, strategy, concurrency, extended_key and appends into
runs_by_config[extended_key] unchanged (symbols to locate: bench_data,
benchmark, config, strategy, concurrency, extended_key, runs_by_config).

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_`📈_Repeatability.py:
- Line 53: The function load_benchmark_runs currently constructs df (DataFrame)
and returns it as the first tuple element which callers unpack as raw_df but
never use; because the function is decorated with `@st.cache_data`(ttl=300) this
wastes memory—remove the unused DataFrame from the return value by changing
load_benchmark_runs to return only the actually-used object (e.g., the runs
dict) and drop df/raw_df from the return and signature, update all call sites
that unpack raw_df to stop expecting it (adjust unpacking to a single value or
ignore the first element), and remove the cached DataFrame creation or its
caching if the DataFrame is actually needed; apply the same fix to the other
similar locations referenced (lines 126, 182, 396).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 4ec20825-d56e-40cf-832d-d8c83673bdad

📥 Commits

Reviewing files that changed from the base of the PR and between 8c3c15d and abde060.

📒 Files selected for processing (7)
  • automation/test-execution/ansible/scripts/analyze_repeatability.py
  • automation/test-execution/ansible/tests/test_repeatability.py
  • automation/test-execution/dashboard-examples/vllm_dashboard/Home.py
  • automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_📈_Repeatability.py
  • automation/test-execution/dashboard-examples/vllm_dashboard/repeatability_utils.py
  • docs/docs.md
  • docs/methodology/repeatability-analysis.md

Comment on lines +413 to +416
if failed_tests:
print(f"\nFailed tests:")
for class_name, method_name, error in failed_tests:
print(f" - {class_name}.{method_name}: {error}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Spurious f prefix on string literal (Ruff F541)

f"Failed tests:" has no interpolation — the f is a no-op and is flagged as an error by Ruff.

🐛 Proposed fix
-        print(f"Failed tests:")
+        print("Failed tests:")
🧰 Tools
🪛 Ruff (0.15.12)

[error] 414-414: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@automation/test-execution/ansible/tests/test_repeatability.py` around lines
413 - 416, The print call uses an unnecessary f-string: change print(f"\nFailed
tests:") to a normal string print("\nFailed tests:") to remove the spurious
f-prefix flagged by Ruff; ensure similar non-interpolated literals in the
failed_tests handling (e.g., the loop printing f"  - {class_name}.{method_name}:
{error}" is fine because it interpolates) are left unchanged.

Comment on lines +79 to +106
for bench in data.get('benchmarks', []):
metrics = bench['metrics']
config = bench['config']

concurrency = config.get('strategy', {}).get('max_concurrency', 0)

# Extract key metrics
row = {
'test_run_id': metadata.get('test_run_id', 'unknown'),
'platform': metadata.get('platform', 'unknown'),
'model': metadata.get('model', 'unknown'),
'model_short': metadata.get('model', 'unknown').split('/')[-1],
'workload': metadata.get('workload', 'unknown'),
'cores': metadata.get('core_count', 0),
'tensor_parallel': metadata.get('tensor_parallel', 1),
'vllm_version': metadata.get('vllm_version', 'unknown'),
'concurrency': concurrency,
'throughput_mean': metrics['tokens_per_second']['successful']['mean'],
'ttft_mean': metrics['time_to_first_token_ms']['successful']['mean'],
'ttft_p90': metrics['time_to_first_token_ms']['successful']['percentiles']['p90'],
'ttft_p95': metrics['time_to_first_token_ms']['successful']['percentiles']['p95'],
'tpot_mean': metrics['inter_token_latency_ms']['successful']['mean'],
'tpot_p90': metrics['inter_token_latency_ms']['successful']['percentiles']['p90'],
'tpot_p95': metrics['inter_token_latency_ms']['successful']['percentiles']['p95'],
'request_latency_mean': metrics['request_latency']['successful']['mean'],
'request_latency_p90': metrics['request_latency']['successful']['percentiles']['p90'],
'request_latency_p95': metrics['request_latency']['successful']['percentiles']['p95'],
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Silent data loss: unguarded key access will skip entire files on missing metrics

Any benchmark entry that's missing a metric key (e.g. bench['metrics'], metrics['tokens_per_second']['successful']['mean']) raises a KeyError. The outer try/except at line 68 catches this and skips the entire benchmarks.json file with only a logger.warning. This is inconsistent with analyze_repeatability.py, which uses get_nested_value() with per-value None defaults so that only the missing metric is excluded, not the whole file.

🛡️ Proposed fix — use safe nested access per metric
+            def _get(d, *keys, default=None):
+                for k in keys:
+                    if isinstance(d, dict) and k in d:
+                        d = d[k]
+                    else:
+                        return default
+                return d

             for bench in data.get('benchmarks', []):
-                metrics = bench['metrics']
-                config = bench['config']
+                metrics = bench.get('metrics', {})
+                config = bench.get('config', {})

                 concurrency = config.get('strategy', {}).get('max_concurrency', 0)

                 row = {
                     ...
-                    'throughput_mean': metrics['tokens_per_second']['successful']['mean'],
-                    'ttft_mean': metrics['time_to_first_token_ms']['successful']['mean'],
-                    'ttft_p90': metrics['time_to_first_token_ms']['successful']['percentiles']['p90'],
+                    'throughput_mean': _get(metrics, 'tokens_per_second', 'successful', 'mean'),
+                    'ttft_mean': _get(metrics, 'time_to_first_token_ms', 'successful', 'mean'),
+                    'ttft_p90': _get(metrics, 'time_to_first_token_ms', 'successful', 'percentiles', 'p90'),
                     # ... apply similarly to remaining fields
                 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_`📈_Repeatability.py
around lines 79 - 106, The loop that builds row uses unguarded dict indexing
(bench['metrics'], metrics['tokens_per_second']['successful']['mean'], etc.)
which raises KeyError and causes the outer try/except to skip the whole file;
change each direct access to use a safe nested accessor (re-use the existing
get_nested_value function used in analyze_repeatability.py or use chained
.get(...) calls with a default of None) for every metric and metadata field
(e.g., replace metrics['...']['successful']['mean'] with
get_nested_value(metrics, ['...','successful','mean'], None)), so missing
individual metrics become None instead of aborting processing of the entire
benchmarks.json.

fig = go.Figure(data=[go.Pie(
labels=grade_counts.index,
values=grade_counts.values,
marker=dict(colors=[colors.get(grade, '#gray') for grade in grade_counts.index]),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

'#gray' is not a valid Plotly/CSS color

colors.get(grade, '#gray') — the fallback '#gray' is not a valid hex color (hex colors look like '#808080'). Use the named color 'gray' instead.

🐛 Proposed fix
-        marker=dict(colors=[colors.get(grade, '#gray') for grade in grade_counts.index]),
+        marker=dict(colors=[colors.get(grade, 'gray') for grade in grade_counts.index]),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
marker=dict(colors=[colors.get(grade, '#gray') for grade in grade_counts.index]),
marker=dict(colors=[colors.get(grade, 'gray') for grade in grade_counts.index]),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/4_`📈_Repeatability.py
at line 232, The marker color fallback uses an invalid hex string '#gray' in the
list comprehension building marker=dict(colors=[colors.get(grade, '#gray') for
grade in grade_counts.index]); change the fallback to a valid CSS color (e.g.,
'gray') or a proper hex like '#808080' so replace occurrences of
colors.get(grade, '#gray') with colors.get(grade, 'gray') (or ' `#808080`') to
ensure Plotly accepts the color.

maryamtahhan and others added 6 commits May 5, 2026 14:57
Add comprehensive repeatability analysis tooling to measure benchmark
consistency using Coefficient of Variation (CV) metrics.

New features:
- analyze_repeatability.py: Calculate CV for metrics across multiple runs
- Automatic grouping by platform, model, cores, TP, vLLM version, concurrency
- Repeatability grading (Excellent/Good/Acceptable/Poor) based on CV thresholds
- Markdown and JSON report generation with detailed CV tables and rankings
- Dashboard utilities for integrating CV metrics into visualizations

CV interpretation:
- CV < 1%: Excellent (ideal for regression testing)
- CV 1-3%: Good (suitable for performance comparisons)
- CV 3-5%: Acceptable (moderate variance)
- CV > 5%: Poor (high variance, unreliable results)

Metrics analyzed:
- Request latency (mean, P90, P95)
- TTFT (mean, P90, P95)
- TPoT/ITL (mean, P90, P95)
- Output throughput

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Move README_REPEATABILITY.md to docs/methodology/repeatability-analysis.md
- Add Jekyll frontmatter for GitHub Pages integration
- Update docs index to include repeatability analysis
- Add to documentation status table

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add 34 unit tests covering:
- CV calculation accuracy and edge cases
- Repeatability grade assignment (Excellent/Good/Acceptable/Poor)
- Letter grade mapping (A+ through C)
- Nested value extraction from benchmark JSON
- Metric repeatability analysis
- Edge cases: empty data, single run, zero mean, NaN handling

All tests pass with pytest.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add new Repeatability dashboard (page 4) with:

Features:
- CV metrics overview with grade distribution
- Interactive filters (minimum grade, minimum runs)
- Box plots showing CV distribution by metric type
- Detailed configuration table with color-coded grades
- CSV export for repeatability analysis
- Grade thresholds visualization (Excellent/Good/Acceptable/Poor)

Uses existing repeatability_utils.py for CV calculations.

Also updates Home.py to include the new dashboard in navigation
and feature descriptions.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add repeatability analysis to Client-Side Metrics dashboard:

- Import repeatability_utils functions
- Calculate CV metrics for grouped configurations
- Display repeatability summary in collapsible expander
- Show CV percentages and grades (Excellent/Good/Acceptable/Poor)
- Format metrics with configuration details (platform, cores, concurrency)
- Add grade legend for easy interpretation

Users can now see repeatability metrics directly in the performance
dashboard without navigating to the dedicated Repeatability page.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant