Skip to content

[CI][Benchmarks] Add dashboard info in contrib #19549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 29, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 106 additions & 0 deletions devops/scripts/benchmarks/CONTRIB.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,104 @@ The suite is structured around three main components: Suites, Benchmarks, and Re
* `display_name`: Optional user-friendly name for the benchmark (string). Defaults to `name()`.
* `explicit_group`: Optional explicit group name for results (string). Used to group results in visualizations.

## Dashboard and Visualization

The benchmark suite generates an interactive HTML dashboard that visualizes `Result` objects and their metadata.

### Data Flow from Results to Dashboard

1. **Collection Phase:**
* Benchmarks generate `Result` objects containing performance measurements.
* The framework combines these with `BenchmarkMetadata` from benchmarks and suites. The metadata
is used for defining charts.
* All data is packaged into a `BenchmarkOutput` object containing runs, metadata, and tags for serialization.

2. **Serialization:**
* For local viewing (`--output-html local`): Data is written as JavaScript variables in `data.js`.
These are directly loaded in the HTML dashboard.
* For remote deployment (`--output-html remote`): Data is written as JSON in `data.json`.
The `config.js` file contains the URL where the json file is hosted.
* Historical runs may be separated into archive files for better dashboard load times.

3. **Dashboard Rendering:**
* JavaScript processes the data to create three chart types:
* **Historical Results**: Time-series charts showing performance trends over multiple runs. One chart for each unique benchmark scenario.
* **Historical Layer Comparisons**: Time-series charts for grouped results. Benchmark scenarios that can be directly compared are grouped either by using `explicit_group()` or matching the beginning of their labels with predefined groups.
* **Comparisons**: Bar charts comparing selected runs side-by-side. Again, based on the `explicit_group()` or labels.

### Chart Types and Result Mapping

**Historical Results (Time-series):**
* One chart per unique `result.label`.
* X-axis: `BenchmarkRun.date` (time).
* Y-axis: `result.value` with `result.unit`.
* Multiple lines for different `BenchmarkRun.name` entries.
* Points include `result.stddev`, `result.git_hash`, and environment info in tooltips.

**Historical Layer Comparisons:**
* Groups related results using `benchmark.explicit_group()` or `result.label` prefixes.
* Useful for comparing different implementations/configurations of the same benchmark.
* Same time-series format but with grouped data.

**Comparisons (Bar charts):**
* Compares selected runs side-by-side.
* X-axis: `BenchmarkRun.name`.
* Y-axis: `result.value` with `result.unit`.
* One bar per selected run.

### Dashboard Features Controlled by Results/Metadata

**Visual Properties:**
* **Chart Title**: `metadata.display_name` or `result.label`.
* **Y-axis Range**: `metadata.range_min` and `range_max` (when custom ranges enabled).
* **Direction Indicator**: `result.lower_is_better` (shows "Lower/Higher is better").
* **Grouping**: `benchmark.explicit_group()` groups related results together.

**Filtering and Organization:**
* **Suite Filters**: Filter by `result.suite`.
* **Tag Filters**: Filter by `metadata.tags`.
* **Regex Search**: Search by `result.label` patterns, `metadata.display_name` patterns are not searchable.
* **Stability**: Hide/show based on `metadata.unstable`.

**Information Display:**
* **Description**: `metadata.description` appears prominently above charts.
* **Notes**: `metadata.notes` provides additional context (toggleable).
* **Tags**: `metadata.tags` displayed as colored badges with descriptions.
* **Command Details**: Shows `result.command` and `result.env` in expandable sections.
* **Git Information**: `result.git_url` and `result.git_hash` for benchmark source tracking.

### Dashboard Interaction

**Run Selection:**
* Users select which `BenchmarkRun.name` entries to compare.
* Default selection uses `BenchmarkOutput.default_compare_names`.
* Changes affect all chart types simultaneously.

**URL State Preservation:**
* All filters, selections, and options are preserved in URL parameters.
* Enables sharing specific dashboard views via URL address copy.

### Best Practices for Dashboard-Friendly Results

**Naming:**
* Use unique `result.label` names that will be most descriptive.
* Consider `metadata.display_name` for prettier chart titles.
* Ensure `benchmark.name()` is unique across all suites.

**Grouping:**
* Use `benchmark.explicit_group()` to group related measurements.
* Ensure grouped results have the same `result.unit`.
* Group metadata keys in `Suite.additional_metadata()` should match group prefixes.

**Metadata:**
* Provide `metadata.description` for user understanding.
* Use `metadata.notes` for implementation details or caveats.
* Tag with relevant `metadata.tags` for filtering.
* Set `metadata.range_min`/`range_max` for consistent comparisons when needed.

**Stability:**
* Mark unstable benchmarks with `metadata.unstable` to hide them by default.

## Adding New Benchmarks

1. **Create Benchmark Class:** Implement a new class inheriting from `benches.base.Benchmark`. Implement required methods (`setup`, `run`, `teardown`, `name`) and optional ones (`description`, `get_tags`, etc.) as needed.
Expand All @@ -84,6 +182,14 @@ The suite is structured around three main components: Suites, Benchmarks, and Re
* **Use unique names:** Ensure `benchmark.name()` and `result.label` are descriptive and unique.
* **Group related results:** Use `benchmark.explicit_group()` consistently for results you want to compare directly in outputs. Ensure units match within a group. If defining group-level metadata in the Suite, ensure the chosen explicit_group name starts with the corresponding key defined in additional_metadata.
* **Test locally:** Before submitting changes, test with relevant drivers/backends (e.g., using `--compute-runtime --build-igc` for L0). Check the visualization locally if possible (--output-markdown --output-html, then open the generated files).
* **Test dashboard visualization:** When adding new benchmarks, always generate and review the HTML dashboard to ensure:
* Chart titles and labels are clear and readable.
* Results are grouped logically using `explicit_group()`.
* Metadata (description, notes, tags) displays correctly.
* Y-axis ranges are appropriate (consider setting `range_min`/`range_max` if needed).
* Filtering by suite and tags works as expected.
* Time-series trends make sense for historical data.
* **Tip**: Use `--dry-run --output-html local` to regenerate the dashboard without re-running benchmarks. This uses existing historical data and is useful for testing metadata changes, new groupings, or dashboard improvements.

## Utilities

Expand Down