Skip to content

Conversation

@ilana-n
Copy link
Contributor

@ilana-n ilana-n commented Oct 16, 2025

Adds real-time GPU telemetry visualization dashboard to monitor GPU metrics during benchmark execution.

Key Additions:

  • Real-time Telemetry Dashboard: New interactive TUI dashboard that displays live GPU metrics (utilization, memory, temperature, power)
    • New 'current' column for the most recent value of the metric
    • Same as inference metrics: statistics aggregated over entire run
  • New CLI Flag: --gpu-telemetry 'dashboard' flag enables real-time dashboard view during benchmarks

Technical Changes

  • Added RealtimeTelemetryDashboard UI component with multi-GPU support
  • Added methods to RecordsManager to support real-time data streaming
  • Added comprehensive tests for new functionality

Design Decisions

  • Telemetry Remains Active When User Switches Tabs
    • Decision: Telemetry collection continues running even when users navigate away from the dashboard view.

Performance Analysis:

  • CPU overhead: ~0.1% (3-5ms per iteration, every 5 seconds)
  • Startup delay when re-enabling: 3-5 seconds

Rationale: Trading 0.1% CPU overhead for eliminating 3-5 second wait times provides significantly better UX. The performance cost is negligible while the user experience benefit is substantial.

Documentation

  • Updated GPU telemetry tutorial for real-time dashboard usage

Example:
Screenshot 2025-10-17 at 6 00 35 PM

Summary by CodeRabbit

Release Notes

  • New Features

    • Added real-time GPU telemetry dashboard for monitoring GPU metrics
    • Introduced GPU telemetry modes (SUMMARY and REALTIME_DASHBOARD)
    • Added support for custom DCGM endpoints for GPU data collection
  • Bug Fixes

    • Removed unnecessary debug logging from exporters
    • Removed spurious warnings for empty telemetry data
  • Documentation

    • Added comprehensive GPU telemetry setup tutorial

@coderabbitai
Copy link

coderabbitai bot commented Oct 16, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

The PR introduces comprehensive GPU telemetry support to the aiperf system by adding new configuration fields for telemetry modes and URLs, defining real-time telemetry message types and commands, creating UI dashboard components for GPU metrics visualization, establishing telemetry hooks and mixins, and updating telemetry managers and record processors to track and publish real-time metrics alongside traditional telemetry collection.

Changes

Cohort / File(s) Summary
Configuration & Enums
aiperf/common/config/user_config.py, aiperf/common/enums/telemetry_enums.py, aiperf/common/enums/command_enums.py, aiperf/common/enums/message_enums.py, aiperf/common/enums/__init__.py
Introduces GPUTelemetryMode enum with SUMMARY and REALTIME_DASHBOARD modes; adds gpu_telemetry_mode and gpu_telemetry_urls to UserConfig with automatic parser; adds START_REALTIME_TELEMETRY command type and REALTIME_TELEMETRY_METRICS message type.
Message Infrastructure
aiperf/common/messages/telemetry_messages.py, aiperf/common/messages/__init__.py
Defines RealtimeTelemetryMetricsMessage carrying metrics list and StartRealtimeTelemetryCommand with telemetry_mode; adds dcgm_url field to TelemetryRecordsMessage; exports new messages.
Hooks & Mixins
aiperf/common/hooks.py, aiperf/common/mixins/realtime_telemetry_metrics_mixin.py, aiperf/common/mixins/__init__.py
Adds ON_REALTIME_TELEMETRY_METRICS hook and decorator; creates RealtimeTelemetryMetricsMixin extending MessageBusClientMixin with async handler for receiving and triggering telemetry metrics.
Telemetry Processing
aiperf/gpu_telemetry/telemetry_manager.py, aiperf/gpu_telemetry/telemetry_data_collector.py, aiperf/post_processors/telemetry_results_processor.py
Updates TelemetryManager with collector_id_to_url mapping for dcgm_url tracking; replaces gpu_telemetry with gpu_telemetry_urls; removes warning in TelemetryDataCollector; TelemetryResultsProcessor exposes get_telemetry_hierarchy and updates header formatting with endpoint and model name.
Records Management
aiperf/records/records_manager.py
Comprehensive refactor to track telemetry endpoints, publish RealtimeTelemetryMetricsMessage at intervals, handle START_REALTIME_TELEMETRY command, and use telemetry processor hierarchy for export; adds _report_realtime_telemetry_metrics_task and _generate_realtime_telemetry_metrics.
UI Components
aiperf/ui/dashboard/realtime_telemetry_dashboard.py, aiperf/ui/dashboard/aiperf_textual_app.py, aiperf/ui/dashboard/aiperf_dashboard_ui.py, aiperf/ui/dashboard/__init__.py, aiperf/ui/base_ui.py, aiperf/ui/__init__.py
Introduces GPUMetricsTable, SingleNodeView, and RealtimeTelemetryDashboard UI components; integrates RealtimeTelemetryMetricsMixin into BaseAIPerfUI; wires telemetry metrics handler in AIPerfTextualApp with telemetry panel and mode switching; exports new dashboard components.
Minor Cleanup
aiperf/exporters/csv_exporter.py, aiperf/exporters/json_exporter.py
Removes debug logging statements from exporter constructors.
Tests
tests/gpu_telemetry/test_telemetry_manager.py, tests/post_processors/test_telemetry_results_processor.py, tests/records/test_records_filtering.py, tests/ci/test_docs_end_to_end/test_runner.py
Updates telemetry manager tests to use gpu_telemetry_urls; adds test for get_telemetry_hierarchy; updates TelemetryRecordsMessage instantiation with dcgm_url; enhances CI shutdown to handle DCGM exporter containers.
Documentation
docs/tutorials/gpu-telemetry.md
Adds vLLM GPU telemetry setup guide with custom DCGM metrics configuration, exporter setup, health checks, and multi-node telemetry examples.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant RecordsManager
    participant TelemetryManager
    participant Processor
    participant Dashboard
    participant Hooks

    User->>RecordsManager: START_REALTIME_TELEMETRY
    RecordsManager->>RecordsManager: Switch mode to REALTIME_DASHBOARD
    RecordsManager->>TelemetryManager: Collect telemetry
    TelemetryManager->>Processor: Process metrics
    Processor->>Processor: Format MetricResult entries
    Processor->>RecordsManager: Return metrics
    RecordsManager->>RecordsManager: _generate_realtime_telemetry_metrics()
    RecordsManager->>Hooks: Trigger ON_REALTIME_TELEMETRY_METRICS
    Hooks->>Dashboard: on_realtime_telemetry_metrics(metrics)
    Dashboard->>Dashboard: Update GPU tables per GPU
    Dashboard->>User: Display real-time metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The changes span multiple architectural layers with significant logic additions to telemetry processing and UI integration, balanced by cohesive grouping around the telemetry feature. Key complexity drivers include: (1) multi-file coordination across enums, messages, hooks, and mixins; (2) state management in TelemetryManager with new collector_id_to_url mapping; (3) comprehensive refactor of records_manager with new reporting tasks and command handling; (4) three new UI component classes with nested helper methods; (5) integration points requiring cross-subsystem understanding. However, changes are relatively homogeneous in intent (telemetry feature additions) despite broad file distribution.

Poem

🐰 Hops with glee through GPU glow,
Metrics dance in dashboard's flow,
Real-time telemetry takes the stage,
Telemetry tales on every page!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 87.32% which is sufficient. The required threshold is 80.00%.
Title Check ✅ Passed The PR title "feat: GPU Telemetry Realtime Dashboard Display" directly and accurately captures the main change of this pull request. The changeset introduces a new real-time GPU telemetry visualization dashboard (RealtimeTelemetryDashboard) with UI components (GPUMetricsTable, SingleNodeView) that display live metrics like utilization, memory, and temperature. The title is concise and clearly communicates the primary objective without vague terms or unnecessary details. While "Display" is somewhat redundant with "Dashboard," the full phrase remains meaningful and emphasizes the visual aspect of the feature being added.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ilana/gpu-telemetry-dashboard

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Oct 16, 2025

@ilana-n ilana-n changed the base branch from main to feature/gpu_telemetry_v2 October 16, 2025 21:34
@ilana-n ilana-n requested a review from Copilot October 16, 2025 21:34
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
docs/tutorials/gpu-telemetry.md (1)

144-144: Consider documenting the purpose of HTML comment markers.

The section delimiters (e.g., <!-- setup-vllm-gpu-telemetry-default-openai-endpoint-server -->, <!-- /setup-vllm-gpu-telemetry-default-openai-endpoint-server -->) appear intentional, likely for programmatic documentation generation or test orchestration. Consider adding a brief comment block at the top of the file explaining their purpose and any downstream tooling that depends on them, so future maintainers understand why they exist.

Also applies to: 208-208, 251-251, 261-261, 265-265, 286-286

aiperf/ui/dashboard/realtime_telemetry_dashboard.py (2)

41-43: Annotate mutable class attributes as ClassVar (or make them tuples).

STATS_FIELDS and COLUMNS are shared, mutable class attributes flagged by Ruff (RUF012). Mark them as ClassVar and switch to immutable tuples to avoid accidental per-instance mutation.

Apply this diff:

-from rich.text import Text
+from typing import ClassVar
+
+from rich.text import Text
@@
-    STATS_FIELDS = ["avg", "min", "max", "p99", "p90", "p50", "std"]
-    COLUMNS = ["Metric", *STATS_FIELDS]
+    STATS_FIELDS: ClassVar[tuple[str, ...]] = ("avg", "min", "max", "p99", "p90", "p50", "std")
+    COLUMNS: ClassVar[tuple[str, ...]] = ("Metric", *STATS_FIELDS)

112-120: Narrow the exception handling when updating cells.

Catching bare Exception hides legitimate errors. DataTable.update_cell mainly raises RowDoesNotExist or ColumnDoesNotExist; handle those explicitly so real bugs surface.

Apply this diff:

-from textual.widgets.data_table import ColumnKey, RowDoesNotExist, RowKey
+from textual.widgets.data_table import ColumnDoesNotExist, ColumnKey, RowDoesNotExist, RowKey
@@
-        for col_name, cell_value in zip(self.COLUMNS, row_cells, strict=True):
-            try:
-                self.data_table.update_cell(  # type: ignore
-                    row_key, self._column_keys[col_name], cell_value, update_width=True
-                )
-            except Exception as e:
+        for col_name, cell_value in zip(self.COLUMNS, row_cells, strict=True):
+            try:
+                self.data_table.update_cell(  # type: ignore
+                    row_key, self._column_keys[col_name], cell_value, update_width=True
+                )
+            except (RowDoesNotExist, ColumnDoesNotExist) as e:
                 _logger.warning(
                     f"Error updating cell {col_name} with value {cell_value}: {e!r}"
                 )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9a2824 and 0cd8551.

📒 Files selected for processing (27)
  • aiperf/common/config/user_config.py (2 hunks)
  • aiperf/common/enums/__init__.py (2 hunks)
  • aiperf/common/enums/command_enums.py (1 hunks)
  • aiperf/common/enums/message_enums.py (1 hunks)
  • aiperf/common/enums/telemetry_enums.py (1 hunks)
  • aiperf/common/hooks.py (2 hunks)
  • aiperf/common/messages/__init__.py (2 hunks)
  • aiperf/common/messages/telemetry_messages.py (3 hunks)
  • aiperf/common/mixins/__init__.py (2 hunks)
  • aiperf/common/mixins/realtime_telemetry_metrics_mixin.py (1 hunks)
  • aiperf/exporters/csv_exporter.py (0 hunks)
  • aiperf/exporters/json_exporter.py (0 hunks)
  • aiperf/gpu_telemetry/telemetry_data_collector.py (0 hunks)
  • aiperf/gpu_telemetry/telemetry_manager.py (4 hunks)
  • aiperf/post_processors/telemetry_results_processor.py (3 hunks)
  • aiperf/records/records_manager.py (10 hunks)
  • aiperf/ui/__init__.py (3 hunks)
  • aiperf/ui/base_ui.py (1 hunks)
  • aiperf/ui/dashboard/__init__.py (2 hunks)
  • aiperf/ui/dashboard/aiperf_dashboard_ui.py (1 hunks)
  • aiperf/ui/dashboard/aiperf_textual_app.py (7 hunks)
  • aiperf/ui/dashboard/realtime_telemetry_dashboard.py (1 hunks)
  • docs/tutorials/gpu-telemetry.md (5 hunks)
  • tests/ci/test_docs_end_to_end/test_runner.py (3 hunks)
  • tests/gpu_telemetry/test_telemetry_manager.py (13 hunks)
  • tests/post_processors/test_telemetry_results_processor.py (1 hunks)
  • tests/records/test_records_filtering.py (2 hunks)
💤 Files with no reviewable changes (3)
  • aiperf/exporters/csv_exporter.py
  • aiperf/exporters/json_exporter.py
  • aiperf/gpu_telemetry/telemetry_data_collector.py
🧰 Additional context used
🧬 Code graph analysis (19)
aiperf/ui/dashboard/__init__.py (1)
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (3)
  • GPUMetricsTable (21-154)
  • RealtimeTelemetryDashboard (248-313)
  • SingleNodeView (157-245)
aiperf/ui/dashboard/aiperf_dashboard_ui.py (4)
aiperf/common/mixins/hooks_mixin.py (1)
  • attach_hook (96-114)
aiperf/common/hooks.py (2)
  • AIPerfHook (41-57)
  • on_realtime_telemetry_metrics (352-364)
aiperf/ui/dashboard/aiperf_textual_app.py (1)
  • on_realtime_telemetry_metrics (273-277)
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (1)
  • on_realtime_telemetry_metrics (304-313)
tests/post_processors/test_telemetry_results_processor.py (3)
tests/data_exporters/conftest.py (1)
  • sample_telemetry_record (17-38)
aiperf/common/models/telemetry_models.py (2)
  • TelemetryRecord (69-107)
  • TelemetryHierarchy (270-320)
aiperf/post_processors/telemetry_results_processor.py (3)
  • TelemetryResultsProcessor (23-99)
  • process_telemetry_record (39-45)
  • get_telemetry_hierarchy (31-37)
aiperf/post_processors/telemetry_results_processor.py (3)
aiperf/exporters/display_units_utils.py (1)
  • normalize_endpoint_display (17-36)
aiperf/common/models/telemetry_models.py (4)
  • TelemetryHierarchy (270-320)
  • TelemetryRecord (69-107)
  • add_record (240-251)
  • add_record (293-320)
aiperf/common/models/record_models.py (1)
  • MetricResult (25-51)
aiperf/common/enums/telemetry_enums.py (1)
aiperf/common/enums/base_enums.py (1)
  • CaseInsensitiveStrEnum (10-49)
aiperf/common/enums/__init__.py (1)
aiperf/common/enums/telemetry_enums.py (1)
  • GPUTelemetryMode (7-11)
aiperf/ui/base_ui.py (2)
aiperf/common/mixins/realtime_telemetry_metrics_mixin.py (1)
  • RealtimeTelemetryMetricsMixin (15-43)
aiperf/common/mixins/realtime_metrics_mixin.py (1)
  • RealtimeMetricsMixin (15-34)
aiperf/gpu_telemetry/telemetry_manager.py (1)
aiperf/common/messages/telemetry_messages.py (1)
  • TelemetryRecordsMessage (20-45)
aiperf/common/config/user_config.py (2)
aiperf/common/enums/telemetry_enums.py (1)
  • GPUTelemetryMode (7-11)
aiperf/common/config/cli_parameter.py (1)
  • DisableCLI (22-29)
aiperf/ui/__init__.py (2)
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (3)
  • GPUMetricsTable (21-154)
  • RealtimeTelemetryDashboard (248-313)
  • SingleNodeView (157-245)
aiperf/ui/dashboard/custom_widgets.py (2)
  • MaximizableWidget (15-32)
  • NonFocusableDataTable (10-12)
aiperf/common/mixins/realtime_telemetry_metrics_mixin.py (6)
aiperf/common/enums/message_enums.py (1)
  • MessageType (7-51)
aiperf/common/hooks.py (3)
  • AIPerfHook (41-57)
  • on_message (313-334)
  • provides_hooks (206-228)
aiperf/common/messages/telemetry_messages.py (1)
  • RealtimeTelemetryMetricsMessage (79-86)
aiperf/common/mixins/message_bus_mixin.py (1)
  • MessageBusClientMixin (34-141)
aiperf/common/models/record_models.py (1)
  • MetricResult (25-51)
aiperf/common/aiperf_logger.py (1)
  • AIPerfLogger (25-239)
aiperf/common/messages/__init__.py (1)
aiperf/common/messages/telemetry_messages.py (2)
  • RealtimeTelemetryMetricsMessage (79-86)
  • StartRealtimeTelemetryCommand (89-98)
aiperf/common/hooks.py (2)
aiperf/ui/dashboard/aiperf_textual_app.py (1)
  • on_realtime_telemetry_metrics (273-277)
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (1)
  • on_realtime_telemetry_metrics (304-313)
aiperf/records/records_manager.py (7)
aiperf/common/enums/telemetry_enums.py (1)
  • GPUTelemetryMode (7-11)
aiperf/common/messages/telemetry_messages.py (2)
  • RealtimeTelemetryMetricsMessage (79-86)
  • StartRealtimeTelemetryCommand (89-98)
aiperf/exporters/display_units_utils.py (1)
  • normalize_endpoint_display (17-36)
aiperf/common/models/error_models.py (1)
  • ErrorDetails (11-47)
aiperf/gpu_telemetry/telemetry_manager.py (2)
  • TelemetryManager (40-378)
  • _normalize_dcgm_url (106-118)
aiperf/post_processors/telemetry_results_processor.py (2)
  • summarize (47-99)
  • get_telemetry_hierarchy (31-37)
aiperf/common/models/telemetry_models.py (1)
  • TelemetryResults (323-348)
aiperf/ui/dashboard/aiperf_textual_app.py (4)
aiperf/common/enums/telemetry_enums.py (1)
  • GPUTelemetryMode (7-11)
aiperf/common/messages/telemetry_messages.py (1)
  • StartRealtimeTelemetryCommand (89-98)
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (3)
  • RealtimeTelemetryDashboard (248-313)
  • set_status_message (292-302)
  • on_realtime_telemetry_metrics (304-313)
aiperf/common/hooks.py (1)
  • on_realtime_telemetry_metrics (352-364)
tests/gpu_telemetry/test_telemetry_manager.py (1)
tests/data_exporters/test_gpu_telemetry_console_exporter.py (1)
  • mock_user_config (29-33)
aiperf/common/mixins/__init__.py (1)
aiperf/common/mixins/realtime_telemetry_metrics_mixin.py (1)
  • RealtimeTelemetryMetricsMixin (15-43)
aiperf/common/messages/telemetry_messages.py (7)
aiperf/common/enums/command_enums.py (1)
  • CommandType (7-17)
aiperf/common/enums/telemetry_enums.py (1)
  • GPUTelemetryMode (7-11)
aiperf/common/enums/message_enums.py (1)
  • MessageType (7-51)
aiperf/common/messages/command_messages.py (1)
  • CommandMessage (57-97)
aiperf/common/models/error_models.py (1)
  • ErrorDetails (11-47)
aiperf/common/models/record_models.py (1)
  • MetricResult (25-51)
aiperf/common/models/telemetry_models.py (2)
  • ProcessTelemetryResult (351-362)
  • TelemetryRecord (69-107)
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (5)
aiperf/common/aiperf_logger.py (1)
  • AIPerfLogger (25-239)
aiperf/common/config/service_config.py (1)
  • ServiceConfig (30-171)
aiperf/common/models/record_models.py (1)
  • MetricResult (25-51)
aiperf/ui/dashboard/custom_widgets.py (2)
  • MaximizableWidget (15-32)
  • NonFocusableDataTable (10-12)
aiperf/ui/dashboard/aiperf_textual_app.py (2)
  • compose (115-148)
  • on_realtime_telemetry_metrics (273-277)
🪛 Ruff (0.14.0)
aiperf/common/config/user_config.py

244-244: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

aiperf/ui/dashboard/realtime_telemetry_dashboard.py

41-41: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


42-42: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


117-117: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (11)
docs/tutorials/gpu-telemetry.md (3)

37-38: Well-documented default endpoint behavior.

The IMPORTANT note clearly explains that the default endpoint http://localhost:9401/metrics is always attempted regardless of the --gpu-telemetry flag. This is well-emphasized and reinforced again in the multi-node example (lines 294, 316). The clarification is critical to user understanding and is properly positioned.


180-206: Port forwarding and DCGM exporter configuration is clearly explained.

The setup instructions for the DCGM exporter are well-structured with clear explanations of port forwarding (-p 9401:9400), metrics collection interval, and custom metrics file mounting. The example commands are syntactically correct and production-ready.


31-35: Table effectively communicates flag behavior.

The comparison table succinctly explains the differences between no flag, flag only, and custom URLs, making it easy for users to understand which endpoints are collected and displayed in each scenario.

aiperf/ui/dashboard/__init__.py (1)

34-38: LGTM!

The new dashboard components (GPUMetricsTable, RealtimeTelemetryDashboard, SingleNodeView) are correctly imported and exported. The changes follow the standard mkinit pattern for exposing public API surface.

Also applies to: 55-55, 63-63, 65-65

aiperf/ui/__init__.py (1)

20-20: LGTM!

The telemetry dashboard components are correctly re-exported from the dashboard submodule to the top-level UI package API. This follows the standard mkinit-generated pattern.

Also applies to: 28-28, 30-30, 52-52, 62-62, 64-64

aiperf/common/enums/command_enums.py (1)

17-17: LGTM!

The new START_REALTIME_TELEMETRY command type follows the existing naming conventions and is properly integrated into the CommandType enum.

aiperf/ui/base_ui.py (1)

7-7: LGTM!

The RealtimeTelemetryMetricsMixin is correctly integrated into the BaseAIPerfUI class hierarchy. The mixin follows the same pattern as the existing RealtimeMetricsMixin and provides the necessary hooks for handling real-time telemetry metrics. The docstring updates accurately reflect the new capability.

Also applies to: 12-17, 21-23, 25-27

aiperf/common/enums/message_enums.py (1)

44-44: LGTM!

The new REALTIME_TELEMETRY_METRICS message type is correctly added to the MessageType enum, following the established naming conventions and alphabetical ordering.

aiperf/ui/dashboard/aiperf_dashboard_ui.py (1)

73-76: LGTM!

The ON_REALTIME_TELEMETRY_METRICS hook attachment follows the established pattern for hook registration in the dashboard UI. The integration correctly wires the telemetry metrics flow from the mixin to the textual app's handler.

aiperf/common/mixins/__init__.py (1)

44-46: LGTM!

The RealtimeTelemetryMetricsMixin is correctly imported and exported following the standard mkinit pattern. The alphabetical ordering is maintained.

Also applies to: 69-69

tests/records/test_records_filtering.py (1)

352-358: LGTM!

The test updates correctly reflect the new dcgm_url parameter added to TelemetryRecordsMessage. The test values are consistent and appropriate for the testing scenarios.

Also applies to: 386-392

@ilana-n ilana-n changed the title GPU Telemetry Realtime Dashboard Display feat: GPU Telemetry Realtime Dashboard Display Oct 16, 2025
@github-actions github-actions bot added the feat label Oct 16, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds real-time GPU telemetry visualization dashboard to monitor GPU metrics during benchmark execution, introducing a new interactive TUI dashboard that displays live GPU metrics (utilization, memory, temperature, power).

  • Real-time GPU telemetry dashboard with --gpu-telemetry 'dashboard' flag
  • New UI components and data streaming infrastructure for telemetry visualization
  • Enhanced telemetry management with endpoint tracking and performance optimization

Reviewed Changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
aiperf/ui/dashboard/realtime_telemetry_dashboard.py New interactive TUI dashboard for real-time GPU metrics visualization
aiperf/ui/dashboard/aiperf_textual_app.py Integrates telemetry dashboard into main UI with mode switching
aiperf/records/records_manager.py Adds telemetry data streaming and background task management
aiperf/post_processors/telemetry_results_processor.py Enhances telemetry processor with hierarchy access and display formatting
aiperf/gpu_telemetry/telemetry_manager.py Updates telemetry manager with URL tracking and message enrichment
aiperf/common/config/user_config.py Adds GPU telemetry mode configuration parsing
tests/ Updates test files with new dcgm_url field and configuration changes

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@ilana-n ilana-n force-pushed the ilana/gpu-telemetry-dashboard branch 2 times, most recently from d24a8fd to 0cd8551 Compare October 17, 2025 22:04
@ilana-n ilana-n changed the base branch from feature/gpu_telemetry_v2 to main October 17, 2025 22:12
@ilana-n ilana-n force-pushed the ilana/gpu-telemetry-dashboard branch from 7911183 to c17e2ae Compare October 17, 2025 23:04
@ilana-n ilana-n force-pushed the ilana/gpu-telemetry-dashboard branch from 0453e7f to 47a2e8e Compare October 20, 2025 02:06
Copy link
Contributor

@ajcasagrande ajcasagrande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Halfway through, but it looks really well done so far! Great work!

Comment on lines +95 to +98
telemetry_mode: GPUTelemetryMode = Field(
default=GPUTelemetryMode.REALTIME_DASHBOARD,
description="The GPU telemetry mode to set when starting realtime telemetry",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may be a note to myself, but how is this being used? what is it needed by?

@background_task(interval=None, immediate=True)
async def _report_realtime_metrics_task(self) -> None:
"""Report the real-time metrics at a regular interval (only if the UI type is dashboard)."""
@background_task(interval=DEFAULT_REALTIME_METRICS_INTERVAL, immediate=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem with doing it this way is that my original interval=None means the task wont repeat, so if the ui type is not dashboard, it will stop the task, but here, it will still fire the task every interval, and then exit. So a little less efficient.

alternatively, we could have a custom exception that we raise, that background_task would catch, to know it should stop.

like raise BackgroundTaskDisabled("UI type is not Dashboard: {ui_type=}") if we always assume it will be disabled at the beginning

or raise StopBackgroundTask("..") if we want to allow exiting mid-execution
sorta like that.

But granted, those options dont ever allow you to resume the task

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on slack, there could be assumptions around being able to resume the task needed in the future for the sweep task.

I modified it so that the _report_realtime_inference_metrics_task works the same with @background_task(interval=None, immediate=True) and _report_realtime_telemetry_metrics_task has this flow:

  • @background_task(interval=None, immediate=True) fires the task once on startup
  • Line 476-477: If ui_type != DASHBOARD, it returns immediately and stops permanently
  • If dashboard enabled by user in CLI: start processing realtime telemetry metrics
  • If dashboard not enabled, sleep. If user clicks on the "5 - GPU Telemetry" tab then it can be woken up by the _on_start_realtime_telemetry_command to resume

What do you think about this sol?

Comment on lines +489 to +492
if self.service_config.ui_type != AIPerfUIType.DASHBOARD:
return
if self.user_config.gpu_telemetry_mode != GPUTelemetryMode.REALTIME_DASHBOARD:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing with the above code about the interval thing.

if self.user_config.gpu_telemetry_mode != GPUTelemetryMode.REALTIME_DASHBOARD:
return

telemetry_metrics = await self._generate_realtime_telemetry_metrics()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any way we can know if we do not need to process the realtime metrics because they are the same? like a last_updated kinda thing? or an update count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh that makes sense, will do

return_exceptions=True,
)

telemetry_metrics = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit complex, I would add a comment about how it is flattening the metric results or something

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants