feat: GPU Telemetry Realtime Dashboard Display #370

ilana-n · 2025-10-16T21:29:09Z

Adds real-time GPU telemetry visualization dashboard to monitor GPU metrics during benchmark execution.

Key Additions:

Real-time Telemetry Dashboard: New interactive TUI dashboard that displays live GPU metrics (utilization, memory, temperature, power)
- New 'current' column for the most recent value of the metric
- Same as inference metrics: statistics aggregated over entire run
New CLI Flag: --gpu-telemetry 'dashboard' flag enables real-time dashboard view during benchmarks

Technical Changes

Added RealtimeTelemetryDashboard UI component with multi-GPU support
Added methods to RecordsManager to support real-time data streaming
Added comprehensive tests for new functionality

Design Decisions

Telemetry Remains Active When User Switches Tabs
- Decision: Telemetry collection continues running even when users navigate away from the dashboard view.

Performance Analysis:

CPU overhead: ~0.1% (3-5ms per iteration, every 5 seconds)
Startup delay when re-enabling: 3-5 seconds

Rationale: Trading 0.1% CPU overhead for eliminating 3-5 second wait times provides significantly better UX. The performance cost is negligible while the user experience benefit is substantial.

Documentation

Updated GPU telemetry tutorial for real-time dashboard usage

Example:

Summary by CodeRabbit

Release Notes

New Features
- Added real-time GPU telemetry dashboard for monitoring GPU metrics
- Introduced GPU telemetry modes (SUMMARY and REALTIME_DASHBOARD)
- Added support for custom DCGM endpoints for GPU data collection
Bug Fixes
- Removed unnecessary debug logging from exporters
- Removed spurious warnings for empty telemetry data
Documentation
- Added comprehensive GPU telemetry setup tutorial

coderabbitai · 2025-10-16T21:29:20Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

The PR introduces comprehensive GPU telemetry support to the aiperf system by adding new configuration fields for telemetry modes and URLs, defining real-time telemetry message types and commands, creating UI dashboard components for GPU metrics visualization, establishing telemetry hooks and mixins, and updating telemetry managers and record processors to track and publish real-time metrics alongside traditional telemetry collection.

Changes

Cohort / File(s)	Summary
Configuration & Enums `aiperf/common/config/user_config.py`, `aiperf/common/enums/telemetry_enums.py`, `aiperf/common/enums/command_enums.py`, `aiperf/common/enums/message_enums.py`, `aiperf/common/enums/__init__.py`	Introduces GPUTelemetryMode enum with SUMMARY and REALTIME_DASHBOARD modes; adds gpu_telemetry_mode and gpu_telemetry_urls to UserConfig with automatic parser; adds START_REALTIME_TELEMETRY command type and REALTIME_TELEMETRY_METRICS message type.
Message Infrastructure `aiperf/common/messages/telemetry_messages.py`, `aiperf/common/messages/__init__.py`	Defines RealtimeTelemetryMetricsMessage carrying metrics list and StartRealtimeTelemetryCommand with telemetry_mode; adds dcgm_url field to TelemetryRecordsMessage; exports new messages.
Hooks & Mixins `aiperf/common/hooks.py`, `aiperf/common/mixins/realtime_telemetry_metrics_mixin.py`, `aiperf/common/mixins/__init__.py`	Adds ON_REALTIME_TELEMETRY_METRICS hook and decorator; creates RealtimeTelemetryMetricsMixin extending MessageBusClientMixin with async handler for receiving and triggering telemetry metrics.
Telemetry Processing `aiperf/gpu_telemetry/telemetry_manager.py`, `aiperf/gpu_telemetry/telemetry_data_collector.py`, `aiperf/post_processors/telemetry_results_processor.py`	Updates TelemetryManager with collector_id_to_url mapping for dcgm_url tracking; replaces gpu_telemetry with gpu_telemetry_urls; removes warning in TelemetryDataCollector; TelemetryResultsProcessor exposes get_telemetry_hierarchy and updates header formatting with endpoint and model name.
Records Management `aiperf/records/records_manager.py`	Comprehensive refactor to track telemetry endpoints, publish RealtimeTelemetryMetricsMessage at intervals, handle START_REALTIME_TELEMETRY command, and use telemetry processor hierarchy for export; adds _report_realtime_telemetry_metrics_task and _generate_realtime_telemetry_metrics.
UI Components `aiperf/ui/dashboard/realtime_telemetry_dashboard.py`, `aiperf/ui/dashboard/aiperf_textual_app.py`, `aiperf/ui/dashboard/aiperf_dashboard_ui.py`, `aiperf/ui/dashboard/__init__.py`, `aiperf/ui/base_ui.py`, `aiperf/ui/__init__.py`	Introduces GPUMetricsTable, SingleNodeView, and RealtimeTelemetryDashboard UI components; integrates RealtimeTelemetryMetricsMixin into BaseAIPerfUI; wires telemetry metrics handler in AIPerfTextualApp with telemetry panel and mode switching; exports new dashboard components.
Minor Cleanup `aiperf/exporters/csv_exporter.py`, `aiperf/exporters/json_exporter.py`	Removes debug logging statements from exporter constructors.
Tests `tests/gpu_telemetry/test_telemetry_manager.py`, `tests/post_processors/test_telemetry_results_processor.py`, `tests/records/test_records_filtering.py`, `tests/ci/test_docs_end_to_end/test_runner.py`	Updates telemetry manager tests to use gpu_telemetry_urls; adds test for get_telemetry_hierarchy; updates TelemetryRecordsMessage instantiation with dcgm_url; enhances CI shutdown to handle DCGM exporter containers.
Documentation `docs/tutorials/gpu-telemetry.md`	Adds vLLM GPU telemetry setup guide with custom DCGM metrics configuration, exporter setup, health checks, and multi-node telemetry examples.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant RecordsManager
    participant TelemetryManager
    participant Processor
    participant Dashboard
    participant Hooks

    User->>RecordsManager: START_REALTIME_TELEMETRY
    RecordsManager->>RecordsManager: Switch mode to REALTIME_DASHBOARD
    RecordsManager->>TelemetryManager: Collect telemetry
    TelemetryManager->>Processor: Process metrics
    Processor->>Processor: Format MetricResult entries
    Processor->>RecordsManager: Return metrics
    RecordsManager->>RecordsManager: _generate_realtime_telemetry_metrics()
    RecordsManager->>Hooks: Trigger ON_REALTIME_TELEMETRY_METRICS
    Hooks->>Dashboard: on_realtime_telemetry_metrics(metrics)
    Dashboard->>Dashboard: Update GPU tables per GPU
    Dashboard->>User: Display real-time metrics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The changes span multiple architectural layers with significant logic additions to telemetry processing and UI integration, balanced by cohesive grouping around the telemetry feature. Key complexity drivers include: (1) multi-file coordination across enums, messages, hooks, and mixins; (2) state management in TelemetryManager with new collector_id_to_url mapping; (3) comprehensive refactor of records_manager with new reporting tasks and command handling; (4) three new UI component classes with nested helper methods; (5) integration points requiring cross-subsystem understanding. However, changes are relatively homogeneous in intent (telemetry feature additions) despite broad file distribution.

Poem

🐰 Hops with glee through GPU glow,
Metrics dance in dashboard's flow,
Real-time telemetry takes the stage,
Telemetry tales on every page! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 87.32% which is sufficient. The required threshold is 80.00%.
Title Check	✅ Passed	The PR title "feat: GPU Telemetry Realtime Dashboard Display" directly and accurately captures the main change of this pull request. The changeset introduces a new real-time GPU telemetry visualization dashboard (RealtimeTelemetryDashboard) with UI components (GPUMetricsTable, SingleNodeView) that display live metrics like utilization, memory, and temperature. The title is concise and clearly communicates the primary objective without vague terms or unnecessary details. While "Display" is somewhat redundant with "Dashboard," the full phrase remains meaningful and emphasizes the visual aspect of the feature being added.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ilana/gpu-telemetry-dashboard

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2025-10-16T21:31:54Z

Codecov Report

❌ Patch coverage is 51.79641% with 161 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
aiperf/records/records_manager.py	8.79%	83 Missing ⚠️
...iperf/ui/dashboard/realtime_telemetry_dashboard.py	56.16%	64 Missing ⚠️
aiperf/ui/dashboard/aiperf_textual_app.py	25.00%	12 Missing ⚠️
aiperf/common/hooks.py	66.66%	1 Missing ⚠️
aiperf/ui/dashboard/aiperf_dashboard_ui.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (3)

docs/tutorials/gpu-telemetry.md (1)

144-144: Consider documenting the purpose of HTML comment markers.

The section delimiters (e.g., , ) appear intentional, likely for programmatic documentation generation or test orchestration. Consider adding a brief comment block at the top of the file explaining their purpose and any downstream tooling that depends on them, so future maintainers understand why they exist.

Also applies to: 208-208, 251-251, 261-261, 265-265, 286-286
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (2)
41-43: Annotate mutable class attributes as ClassVar (or make them tuples).

STATS_FIELDS and COLUMNS are shared, mutable class attributes flagged by Ruff (RUF012). Mark them as ClassVar and switch to immutable tuples to avoid accidental per-instance mutation.

Apply this diff:
-from rich.text import Text
+from typing import ClassVar
+
+from rich.text import Text
@@
-    STATS_FIELDS = ["avg", "min", "max", "p99", "p90", "p50", "std"]
-    COLUMNS = ["Metric", *STATS_FIELDS]
+    STATS_FIELDS: ClassVar[tuple[str, ...]] = ("avg", "min", "max", "p99", "p90", "p50", "std")
+    COLUMNS: ClassVar[tuple[str, ...]] = ("Metric", *STATS_FIELDS)
112-120: Narrow the exception handling when updating cells.

Catching bare Exception hides legitimate errors. DataTable.update_cell mainly raises RowDoesNotExist or ColumnDoesNotExist; handle those explicitly so real bugs surface.

Apply this diff:
-from textual.widgets.data_table import ColumnKey, RowDoesNotExist, RowKey
+from textual.widgets.data_table import ColumnDoesNotExist, ColumnKey, RowDoesNotExist, RowKey
@@
-        for col_name, cell_value in zip(self.COLUMNS, row_cells, strict=True):
-            try:
-                self.data_table.update_cell(  # type: ignore
-                    row_key, self._column_keys[col_name], cell_value, update_width=True
-                )
-            except Exception as e:
+        for col_name, cell_value in zip(self.COLUMNS, row_cells, strict=True):
+            try:
+                self.data_table.update_cell(  # type: ignore
+                    row_key, self._column_keys[col_name], cell_value, update_width=True
+                )
+            except (RowDoesNotExist, ColumnDoesNotExist) as e:
                 _logger.warning(
                     f"Error updating cell {col_name} with value {cell_value}: {e!r}"
                 )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9a2824 and 0cd8551.

📒 Files selected for processing (27)

aiperf/common/config/user_config.py (2 hunks)
aiperf/common/enums/__init__.py (2 hunks)
aiperf/common/enums/command_enums.py (1 hunks)
aiperf/common/enums/message_enums.py (1 hunks)
aiperf/common/enums/telemetry_enums.py (1 hunks)
aiperf/common/hooks.py (2 hunks)
aiperf/common/messages/__init__.py (2 hunks)
aiperf/common/messages/telemetry_messages.py (3 hunks)
aiperf/common/mixins/__init__.py (2 hunks)
aiperf/common/mixins/realtime_telemetry_metrics_mixin.py (1 hunks)
aiperf/exporters/csv_exporter.py (0 hunks)
aiperf/exporters/json_exporter.py (0 hunks)
aiperf/gpu_telemetry/telemetry_data_collector.py (0 hunks)
aiperf/gpu_telemetry/telemetry_manager.py (4 hunks)
aiperf/post_processors/telemetry_results_processor.py (3 hunks)
aiperf/records/records_manager.py (10 hunks)
aiperf/ui/__init__.py (3 hunks)
aiperf/ui/base_ui.py (1 hunks)
aiperf/ui/dashboard/__init__.py (2 hunks)
aiperf/ui/dashboard/aiperf_dashboard_ui.py (1 hunks)
aiperf/ui/dashboard/aiperf_textual_app.py (7 hunks)
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (1 hunks)
docs/tutorials/gpu-telemetry.md (5 hunks)
tests/ci/test_docs_end_to_end/test_runner.py (3 hunks)
tests/gpu_telemetry/test_telemetry_manager.py (13 hunks)
tests/post_processors/test_telemetry_results_processor.py (1 hunks)
tests/records/test_records_filtering.py (2 hunks)

💤 Files with no reviewable changes (3)

aiperf/exporters/csv_exporter.py
aiperf/exporters/json_exporter.py
aiperf/gpu_telemetry/telemetry_data_collector.py

🧰 Additional context used

🧬 Code graph analysis (19)

aiperf/ui/dashboard/__init__.py (1)

aiperf/ui/dashboard/realtime_telemetry_dashboard.py (3)

GPUMetricsTable (21-154)

RealtimeTelemetryDashboard (248-313)

SingleNodeView (157-245)

aiperf/ui/dashboard/aiperf_dashboard_ui.py (4)

aiperf/common/mixins/hooks_mixin.py (1)

attach_hook (96-114)

aiperf/common/hooks.py (2)

AIPerfHook (41-57)

on_realtime_telemetry_metrics (352-364)

aiperf/ui/dashboard/aiperf_textual_app.py (1)

on_realtime_telemetry_metrics (273-277)

aiperf/ui/dashboard/realtime_telemetry_dashboard.py (1)

on_realtime_telemetry_metrics (304-313)

tests/post_processors/test_telemetry_results_processor.py (3)

tests/data_exporters/conftest.py (1)

sample_telemetry_record (17-38)

aiperf/common/models/telemetry_models.py (2)

TelemetryRecord (69-107)

TelemetryHierarchy (270-320)

aiperf/post_processors/telemetry_results_processor.py (3)

TelemetryResultsProcessor (23-99)

process_telemetry_record (39-45)

get_telemetry_hierarchy (31-37)

aiperf/post_processors/telemetry_results_processor.py (3)

aiperf/exporters/display_units_utils.py (1)

normalize_endpoint_display (17-36)

aiperf/common/models/telemetry_models.py (4)

TelemetryHierarchy (270-320)

TelemetryRecord (69-107)

add_record (240-251)

add_record (293-320)

aiperf/common/models/record_models.py (1)

MetricResult (25-51)

aiperf/common/enums/telemetry_enums.py (1)

aiperf/common/enums/base_enums.py (1)

CaseInsensitiveStrEnum (10-49)

aiperf/common/enums/__init__.py (1)

aiperf/common/enums/telemetry_enums.py (1)

GPUTelemetryMode (7-11)

aiperf/ui/base_ui.py (2)

aiperf/common/mixins/realtime_telemetry_metrics_mixin.py (1)

RealtimeTelemetryMetricsMixin (15-43)

aiperf/common/mixins/realtime_metrics_mixin.py (1)

RealtimeMetricsMixin (15-34)

aiperf/gpu_telemetry/telemetry_manager.py (1)

aiperf/common/messages/telemetry_messages.py (1)

TelemetryRecordsMessage (20-45)

aiperf/common/config/user_config.py (2)

aiperf/common/enums/telemetry_enums.py (1)

GPUTelemetryMode (7-11)

aiperf/common/config/cli_parameter.py (1)

DisableCLI (22-29)

aiperf/ui/__init__.py (2)

aiperf/ui/dashboard/realtime_telemetry_dashboard.py (3)

GPUMetricsTable (21-154)

RealtimeTelemetryDashboard (248-313)

SingleNodeView (157-245)

aiperf/ui/dashboard/custom_widgets.py (2)

MaximizableWidget (15-32)

NonFocusableDataTable (10-12)

aiperf/common/mixins/realtime_telemetry_metrics_mixin.py (6)

aiperf/common/enums/message_enums.py (1)

MessageType (7-51)

aiperf/common/hooks.py (3)

AIPerfHook (41-57)

on_message (313-334)

provides_hooks (206-228)

aiperf/common/messages/telemetry_messages.py (1)

RealtimeTelemetryMetricsMessage (79-86)

aiperf/common/mixins/message_bus_mixin.py (1)

MessageBusClientMixin (34-141)

aiperf/common/models/record_models.py (1)

MetricResult (25-51)

aiperf/common/aiperf_logger.py (1)

AIPerfLogger (25-239)

aiperf/common/messages/__init__.py (1)

aiperf/common/messages/telemetry_messages.py (2)

RealtimeTelemetryMetricsMessage (79-86)

StartRealtimeTelemetryCommand (89-98)

aiperf/common/hooks.py (2)

aiperf/ui/dashboard/aiperf_textual_app.py (1)

on_realtime_telemetry_metrics (273-277)

aiperf/ui/dashboard/realtime_telemetry_dashboard.py (1)

on_realtime_telemetry_metrics (304-313)

aiperf/records/records_manager.py (7)

aiperf/common/enums/telemetry_enums.py (1)

GPUTelemetryMode (7-11)

aiperf/common/messages/telemetry_messages.py (2)

RealtimeTelemetryMetricsMessage (79-86)

StartRealtimeTelemetryCommand (89-98)

aiperf/exporters/display_units_utils.py (1)

normalize_endpoint_display (17-36)

aiperf/common/models/error_models.py (1)

ErrorDetails (11-47)

aiperf/gpu_telemetry/telemetry_manager.py (2)

TelemetryManager (40-378)

_normalize_dcgm_url (106-118)

aiperf/post_processors/telemetry_results_processor.py (2)

summarize (47-99)

get_telemetry_hierarchy (31-37)

aiperf/common/models/telemetry_models.py (1)

TelemetryResults (323-348)

aiperf/ui/dashboard/aiperf_textual_app.py (4)

aiperf/common/enums/telemetry_enums.py (1)

GPUTelemetryMode (7-11)

aiperf/common/messages/telemetry_messages.py (1)

StartRealtimeTelemetryCommand (89-98)

aiperf/ui/dashboard/realtime_telemetry_dashboard.py (3)

RealtimeTelemetryDashboard (248-313)

set_status_message (292-302)

on_realtime_telemetry_metrics (304-313)

aiperf/common/hooks.py (1)

on_realtime_telemetry_metrics (352-364)

tests/gpu_telemetry/test_telemetry_manager.py (1)

tests/data_exporters/test_gpu_telemetry_console_exporter.py (1)

mock_user_config (29-33)

aiperf/common/mixins/__init__.py (1)

aiperf/common/mixins/realtime_telemetry_metrics_mixin.py (1)

RealtimeTelemetryMetricsMixin (15-43)

aiperf/common/messages/telemetry_messages.py (7)

aiperf/common/enums/command_enums.py (1)

CommandType (7-17)

aiperf/common/enums/telemetry_enums.py (1)

GPUTelemetryMode (7-11)

aiperf/common/enums/message_enums.py (1)

MessageType (7-51)

aiperf/common/messages/command_messages.py (1)

CommandMessage (57-97)

aiperf/common/models/error_models.py (1)

ErrorDetails (11-47)

aiperf/common/models/record_models.py (1)

MetricResult (25-51)

aiperf/common/models/telemetry_models.py (2)

ProcessTelemetryResult (351-362)

TelemetryRecord (69-107)

aiperf/ui/dashboard/realtime_telemetry_dashboard.py (5)

aiperf/common/aiperf_logger.py (1)

AIPerfLogger (25-239)

aiperf/common/config/service_config.py (1)

ServiceConfig (30-171)

aiperf/common/models/record_models.py (1)

MetricResult (25-51)

aiperf/ui/dashboard/custom_widgets.py (2)

MaximizableWidget (15-32)

NonFocusableDataTable (10-12)

aiperf/ui/dashboard/aiperf_textual_app.py (2)

compose (115-148)

on_realtime_telemetry_metrics (273-277)

🪛 Ruff (0.14.0)

aiperf/common/config/user_config.py

244-244: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

aiperf/ui/dashboard/realtime_telemetry_dashboard.py

41-41: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

42-42: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

117-117: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (11)

docs/tutorials/gpu-telemetry.md (3)

37-38: Well-documented default endpoint behavior.

The IMPORTANT note clearly explains that the default endpoint http://localhost:9401/metrics is always attempted regardless of the --gpu-telemetry flag. This is well-emphasized and reinforced again in the multi-node example (lines 294, 316). The clarification is critical to user understanding and is properly positioned.

180-206: Port forwarding and DCGM exporter configuration is clearly explained.

The setup instructions for the DCGM exporter are well-structured with clear explanations of port forwarding (-p 9401:9400), metrics collection interval, and custom metrics file mounting. The example commands are syntactically correct and production-ready.

31-35: Table effectively communicates flag behavior.

The comparison table succinctly explains the differences between no flag, flag only, and custom URLs, making it easy for users to understand which endpoints are collected and displayed in each scenario.

aiperf/ui/dashboard/__init__.py (1)

34-38: LGTM!

The new dashboard components (GPUMetricsTable, RealtimeTelemetryDashboard, SingleNodeView) are correctly imported and exported. The changes follow the standard mkinit pattern for exposing public API surface.

Also applies to: 55-55, 63-63, 65-65

aiperf/ui/__init__.py (1)

20-20: LGTM!

The telemetry dashboard components are correctly re-exported from the dashboard submodule to the top-level UI package API. This follows the standard mkinit-generated pattern.

Also applies to: 28-28, 30-30, 52-52, 62-62, 64-64

aiperf/common/enums/command_enums.py (1)

17-17: LGTM!

The new START_REALTIME_TELEMETRY command type follows the existing naming conventions and is properly integrated into the CommandType enum.

aiperf/ui/base_ui.py (1)

7-7: LGTM!

The RealtimeTelemetryMetricsMixin is correctly integrated into the BaseAIPerfUI class hierarchy. The mixin follows the same pattern as the existing RealtimeMetricsMixin and provides the necessary hooks for handling real-time telemetry metrics. The docstring updates accurately reflect the new capability.

Also applies to: 12-17, 21-23, 25-27

aiperf/common/enums/message_enums.py (1)

44-44: LGTM!

The new REALTIME_TELEMETRY_METRICS message type is correctly added to the MessageType enum, following the established naming conventions and alphabetical ordering.

aiperf/ui/dashboard/aiperf_dashboard_ui.py (1)

73-76: LGTM!

The ON_REALTIME_TELEMETRY_METRICS hook attachment follows the established pattern for hook registration in the dashboard UI. The integration correctly wires the telemetry metrics flow from the mixin to the textual app's handler.

aiperf/common/mixins/__init__.py (1)

44-46: LGTM!

The RealtimeTelemetryMetricsMixin is correctly imported and exported following the standard mkinit pattern. The alphabetical ordering is maintained.

Also applies to: 69-69

tests/records/test_records_filtering.py (1)

352-358: LGTM!

The test updates correctly reflect the new dcgm_url parameter added to TelemetryRecordsMessage. The test values are consistent and appropriate for the testing scenarios.

Also applies to: 386-392

aiperf/common/config/user_config.py

aiperf/records/records_manager.py

aiperf/ui/dashboard/realtime_telemetry_dashboard.py

tests/ci/test_docs_end_to_end/test_runner.py

Copilot

Pull Request Overview

Adds real-time GPU telemetry visualization dashboard to monitor GPU metrics during benchmark execution, introducing a new interactive TUI dashboard that displays live GPU metrics (utilization, memory, temperature, power).

Real-time GPU telemetry dashboard with --gpu-telemetry 'dashboard' flag
New UI components and data streaming infrastructure for telemetry visualization
Enhanced telemetry management with endpoint tracking and performance optimization

Reviewed Changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
aiperf/ui/dashboard/realtime_telemetry_dashboard.py	New interactive TUI dashboard for real-time GPU metrics visualization
aiperf/ui/dashboard/aiperf_textual_app.py	Integrates telemetry dashboard into main UI with mode switching
aiperf/records/records_manager.py	Adds telemetry data streaming and background task management
aiperf/post_processors/telemetry_results_processor.py	Enhances telemetry processor with hierarchy access and display formatting
aiperf/gpu_telemetry/telemetry_manager.py	Updates telemetry manager with URL tracking and message enrichment
aiperf/common/config/user_config.py	Adds GPU telemetry mode configuration parsing
tests/	Updates test files with new dcgm_url field and configuration changes

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

aiperf/ui/dashboard/realtime_telemetry_dashboard.py

aiperf/records/records_manager.py

…messages in telemetry manager

ajcasagrande

Halfway through, but it looks really well done so far! Great work!

aiperf/common/config/user_config.py

aiperf/common/messages/telemetry_messages.py

ajcasagrande · 2025-10-20T21:46:23Z

aiperf/common/messages/telemetry_messages.py

+    telemetry_mode: GPUTelemetryMode = Field(
+        default=GPUTelemetryMode.REALTIME_DASHBOARD,
+        description="The GPU telemetry mode to set when starting realtime telemetry",
+    )


this may be a note to myself, but how is this being used? what is it needed by?

aiperf/common/mixins/realtime_telemetry_metrics_mixin.py

ajcasagrande · 2025-10-20T22:05:40Z

aiperf/records/records_manager.py

-    @background_task(interval=None, immediate=True)
-    async def _report_realtime_metrics_task(self) -> None:
-        """Report the real-time metrics at a regular interval (only if the UI type is dashboard)."""
+    @background_task(interval=DEFAULT_REALTIME_METRICS_INTERVAL, immediate=False)


the problem with doing it this way is that my original interval=None means the task wont repeat, so if the ui type is not dashboard, it will stop the task, but here, it will still fire the task every interval, and then exit. So a little less efficient.

alternatively, we could have a custom exception that we raise, that background_task would catch, to know it should stop.

like raise BackgroundTaskDisabled("UI type is not Dashboard: {ui_type=}") if we always assume it will be disabled at the beginning

or raise StopBackgroundTask("..") if we want to allow exiting mid-execution
sorta like that.

But granted, those options dont ever allow you to resume the task

As discussed on slack, there could be assumptions around being able to resume the task needed in the future for the sweep task.

I modified it so that the _report_realtime_inference_metrics_task works the same with @background_task(interval=None, immediate=True) and _report_realtime_telemetry_metrics_task has this flow:

@background_task(interval=None, immediate=True) fires the task once on startup

Line 476-477: If ui_type != DASHBOARD, it returns immediately and stops permanently

If dashboard enabled by user in CLI: start processing realtime telemetry metrics

If dashboard not enabled, sleep. If user clicks on the "5 - GPU Telemetry" tab then it can be woken up by the _on_start_realtime_telemetry_command to resume

What do you think about this sol?

ajcasagrande · 2025-10-20T22:06:25Z

aiperf/records/records_manager.py

+        if self.service_config.ui_type != AIPerfUIType.DASHBOARD:
+            return
+        if self.user_config.gpu_telemetry_mode != GPUTelemetryMode.REALTIME_DASHBOARD:
+            return


same thing with the above code about the interval thing.

aiperf/records/records_manager.py

ajcasagrande · 2025-10-20T22:10:10Z

aiperf/records/records_manager.py

+        if self.user_config.gpu_telemetry_mode != GPUTelemetryMode.REALTIME_DASHBOARD:
+            return
+
+        telemetry_metrics = await self._generate_realtime_telemetry_metrics()


is there any way we can know if we do not need to process the realtime metrics because they are the same? like a last_updated kinda thing? or an update count?

ooh that makes sense, will do

ajcasagrande · 2025-10-20T22:24:53Z

aiperf/records/records_manager.py

+            return_exceptions=True,
+        )
+
+        telemetry_metrics = [


this is a bit complex, I would add a comment about how it is flattening the metric results or something

ilana-n changed the base branch from main to feature/gpu_telemetry_v2 October 16, 2025 21:34

ilana-n requested a review from Copilot October 16, 2025 21:34

coderabbitai bot reviewed Oct 16, 2025

View reviewed changes

ilana-n changed the title ~~GPU Telemetry Realtime Dashboard Display~~ feat: GPU Telemetry Realtime Dashboard Display Oct 16, 2025

github-actions bot added the feat label Oct 16, 2025

Copilot AI reviewed Oct 16, 2025

View reviewed changes

aiperf/ui/dashboard/realtime_telemetry_dashboard.py Show resolved Hide resolved

aiperf/records/records_manager.py Outdated Show resolved Hide resolved

ilana-n force-pushed the ilana/gpu-telemetry-dashboard branch 2 times, most recently from d24a8fd to 0cd8551 Compare October 17, 2025 22:04

ilana-n changed the base branch from feature/gpu_telemetry_v2 to main October 17, 2025 22:12

ilana-n force-pushed the ilana/gpu-telemetry-dashboard branch from 7911183 to c17e2ae Compare October 17, 2025 23:04

ilana-n requested review from ajcasagrande and lkomali October 17, 2025 23:09

ilana-n added 3 commits October 19, 2025 18:22

feat: add realtime GPU telemetry dashboard

a4b0c0e

fix: add more test coverage and remove redundant command acknowledge …

28ba6d4

…messages in telemetry manager

fix: use dcgm faker for gpu telemetry unit tests

47a2e8e

ilana-n force-pushed the ilana/gpu-telemetry-dashboard branch from 0453e7f to 47a2e8e Compare October 20, 2025 02:06

ajcasagrande reviewed Oct 20, 2025

View reviewed changes

Uh oh!

feat: GPU Telemetry Realtime Dashboard Display #370

Are you sure you want to change the base?

feat: GPU Telemetry Realtime Dashboard Display #370

Uh oh!

Conversation

ilana-n commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Additions:

Technical Changes

Design Decisions

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

codecov bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

ajcasagrande left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajcasagrande Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ajcasagrande Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

ilana-n Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

ajcasagrande Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ajcasagrande Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

ilana-n Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

ajcasagrande Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ilana-n commented Oct 16, 2025 •

edited

Loading

coderabbitai bot commented Oct 16, 2025 •

edited

Loading

codecov bot commented Oct 16, 2025 •

edited

Loading