-
Couldn't load subscription status.
- Fork 3
feat: GPU Telemetry Realtime Dashboard Display #370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThe PR introduces comprehensive GPU telemetry support to the aiperf system by adding new configuration fields for telemetry modes and URLs, defining real-time telemetry message types and commands, creating UI dashboard components for GPU metrics visualization, establishing telemetry hooks and mixins, and updating telemetry managers and record processors to track and publish real-time metrics alongside traditional telemetry collection. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant RecordsManager
participant TelemetryManager
participant Processor
participant Dashboard
participant Hooks
User->>RecordsManager: START_REALTIME_TELEMETRY
RecordsManager->>RecordsManager: Switch mode to REALTIME_DASHBOARD
RecordsManager->>TelemetryManager: Collect telemetry
TelemetryManager->>Processor: Process metrics
Processor->>Processor: Format MetricResult entries
Processor->>RecordsManager: Return metrics
RecordsManager->>RecordsManager: _generate_realtime_telemetry_metrics()
RecordsManager->>Hooks: Trigger ON_REALTIME_TELEMETRY_METRICS
Hooks->>Dashboard: on_realtime_telemetry_metrics(metrics)
Dashboard->>Dashboard: Update GPU tables per GPU
Dashboard->>User: Display real-time metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The changes span multiple architectural layers with significant logic additions to telemetry processing and UI integration, balanced by cohesive grouping around the telemetry feature. Key complexity drivers include: (1) multi-file coordination across enums, messages, hooks, and mixins; (2) state management in TelemetryManager with new collector_id_to_url mapping; (3) comprehensive refactor of records_manager with new reporting tasks and command handling; (4) three new UI component classes with nested helper methods; (5) integration points requiring cross-subsystem understanding. However, changes are relatively homogeneous in intent (telemetry feature additions) despite broad file distribution. Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (3)
docs/tutorials/gpu-telemetry.md (1)
144-144: Consider documenting the purpose of HTML comment markers.The section delimiters (e.g.,
<!-- setup-vllm-gpu-telemetry-default-openai-endpoint-server -->,<!-- /setup-vllm-gpu-telemetry-default-openai-endpoint-server -->) appear intentional, likely for programmatic documentation generation or test orchestration. Consider adding a brief comment block at the top of the file explaining their purpose and any downstream tooling that depends on them, so future maintainers understand why they exist.Also applies to: 208-208, 251-251, 261-261, 265-265, 286-286
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (2)
41-43: Annotate mutable class attributes asClassVar(or make them tuples).
STATS_FIELDSandCOLUMNSare shared, mutable class attributes flagged by Ruff (RUF012). Mark them asClassVarand switch to immutable tuples to avoid accidental per-instance mutation.Apply this diff:
-from rich.text import Text +from typing import ClassVar + +from rich.text import Text @@ - STATS_FIELDS = ["avg", "min", "max", "p99", "p90", "p50", "std"] - COLUMNS = ["Metric", *STATS_FIELDS] + STATS_FIELDS: ClassVar[tuple[str, ...]] = ("avg", "min", "max", "p99", "p90", "p50", "std") + COLUMNS: ClassVar[tuple[str, ...]] = ("Metric", *STATS_FIELDS)
112-120: Narrow the exception handling when updating cells.Catching bare
Exceptionhides legitimate errors.DataTable.update_cellmainly raisesRowDoesNotExistorColumnDoesNotExist; handle those explicitly so real bugs surface.Apply this diff:
-from textual.widgets.data_table import ColumnKey, RowDoesNotExist, RowKey +from textual.widgets.data_table import ColumnDoesNotExist, ColumnKey, RowDoesNotExist, RowKey @@ - for col_name, cell_value in zip(self.COLUMNS, row_cells, strict=True): - try: - self.data_table.update_cell( # type: ignore - row_key, self._column_keys[col_name], cell_value, update_width=True - ) - except Exception as e: + for col_name, cell_value in zip(self.COLUMNS, row_cells, strict=True): + try: + self.data_table.update_cell( # type: ignore + row_key, self._column_keys[col_name], cell_value, update_width=True + ) + except (RowDoesNotExist, ColumnDoesNotExist) as e: _logger.warning( f"Error updating cell {col_name} with value {cell_value}: {e!r}" )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (27)
aiperf/common/config/user_config.py(2 hunks)aiperf/common/enums/__init__.py(2 hunks)aiperf/common/enums/command_enums.py(1 hunks)aiperf/common/enums/message_enums.py(1 hunks)aiperf/common/enums/telemetry_enums.py(1 hunks)aiperf/common/hooks.py(2 hunks)aiperf/common/messages/__init__.py(2 hunks)aiperf/common/messages/telemetry_messages.py(3 hunks)aiperf/common/mixins/__init__.py(2 hunks)aiperf/common/mixins/realtime_telemetry_metrics_mixin.py(1 hunks)aiperf/exporters/csv_exporter.py(0 hunks)aiperf/exporters/json_exporter.py(0 hunks)aiperf/gpu_telemetry/telemetry_data_collector.py(0 hunks)aiperf/gpu_telemetry/telemetry_manager.py(4 hunks)aiperf/post_processors/telemetry_results_processor.py(3 hunks)aiperf/records/records_manager.py(10 hunks)aiperf/ui/__init__.py(3 hunks)aiperf/ui/base_ui.py(1 hunks)aiperf/ui/dashboard/__init__.py(2 hunks)aiperf/ui/dashboard/aiperf_dashboard_ui.py(1 hunks)aiperf/ui/dashboard/aiperf_textual_app.py(7 hunks)aiperf/ui/dashboard/realtime_telemetry_dashboard.py(1 hunks)docs/tutorials/gpu-telemetry.md(5 hunks)tests/ci/test_docs_end_to_end/test_runner.py(3 hunks)tests/gpu_telemetry/test_telemetry_manager.py(13 hunks)tests/post_processors/test_telemetry_results_processor.py(1 hunks)tests/records/test_records_filtering.py(2 hunks)
💤 Files with no reviewable changes (3)
- aiperf/exporters/csv_exporter.py
- aiperf/exporters/json_exporter.py
- aiperf/gpu_telemetry/telemetry_data_collector.py
🧰 Additional context used
🧬 Code graph analysis (19)
aiperf/ui/dashboard/__init__.py (1)
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (3)
GPUMetricsTable(21-154)RealtimeTelemetryDashboard(248-313)SingleNodeView(157-245)
aiperf/ui/dashboard/aiperf_dashboard_ui.py (4)
aiperf/common/mixins/hooks_mixin.py (1)
attach_hook(96-114)aiperf/common/hooks.py (2)
AIPerfHook(41-57)on_realtime_telemetry_metrics(352-364)aiperf/ui/dashboard/aiperf_textual_app.py (1)
on_realtime_telemetry_metrics(273-277)aiperf/ui/dashboard/realtime_telemetry_dashboard.py (1)
on_realtime_telemetry_metrics(304-313)
tests/post_processors/test_telemetry_results_processor.py (3)
tests/data_exporters/conftest.py (1)
sample_telemetry_record(17-38)aiperf/common/models/telemetry_models.py (2)
TelemetryRecord(69-107)TelemetryHierarchy(270-320)aiperf/post_processors/telemetry_results_processor.py (3)
TelemetryResultsProcessor(23-99)process_telemetry_record(39-45)get_telemetry_hierarchy(31-37)
aiperf/post_processors/telemetry_results_processor.py (3)
aiperf/exporters/display_units_utils.py (1)
normalize_endpoint_display(17-36)aiperf/common/models/telemetry_models.py (4)
TelemetryHierarchy(270-320)TelemetryRecord(69-107)add_record(240-251)add_record(293-320)aiperf/common/models/record_models.py (1)
MetricResult(25-51)
aiperf/common/enums/telemetry_enums.py (1)
aiperf/common/enums/base_enums.py (1)
CaseInsensitiveStrEnum(10-49)
aiperf/common/enums/__init__.py (1)
aiperf/common/enums/telemetry_enums.py (1)
GPUTelemetryMode(7-11)
aiperf/ui/base_ui.py (2)
aiperf/common/mixins/realtime_telemetry_metrics_mixin.py (1)
RealtimeTelemetryMetricsMixin(15-43)aiperf/common/mixins/realtime_metrics_mixin.py (1)
RealtimeMetricsMixin(15-34)
aiperf/gpu_telemetry/telemetry_manager.py (1)
aiperf/common/messages/telemetry_messages.py (1)
TelemetryRecordsMessage(20-45)
aiperf/common/config/user_config.py (2)
aiperf/common/enums/telemetry_enums.py (1)
GPUTelemetryMode(7-11)aiperf/common/config/cli_parameter.py (1)
DisableCLI(22-29)
aiperf/ui/__init__.py (2)
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (3)
GPUMetricsTable(21-154)RealtimeTelemetryDashboard(248-313)SingleNodeView(157-245)aiperf/ui/dashboard/custom_widgets.py (2)
MaximizableWidget(15-32)NonFocusableDataTable(10-12)
aiperf/common/mixins/realtime_telemetry_metrics_mixin.py (6)
aiperf/common/enums/message_enums.py (1)
MessageType(7-51)aiperf/common/hooks.py (3)
AIPerfHook(41-57)on_message(313-334)provides_hooks(206-228)aiperf/common/messages/telemetry_messages.py (1)
RealtimeTelemetryMetricsMessage(79-86)aiperf/common/mixins/message_bus_mixin.py (1)
MessageBusClientMixin(34-141)aiperf/common/models/record_models.py (1)
MetricResult(25-51)aiperf/common/aiperf_logger.py (1)
AIPerfLogger(25-239)
aiperf/common/messages/__init__.py (1)
aiperf/common/messages/telemetry_messages.py (2)
RealtimeTelemetryMetricsMessage(79-86)StartRealtimeTelemetryCommand(89-98)
aiperf/common/hooks.py (2)
aiperf/ui/dashboard/aiperf_textual_app.py (1)
on_realtime_telemetry_metrics(273-277)aiperf/ui/dashboard/realtime_telemetry_dashboard.py (1)
on_realtime_telemetry_metrics(304-313)
aiperf/records/records_manager.py (7)
aiperf/common/enums/telemetry_enums.py (1)
GPUTelemetryMode(7-11)aiperf/common/messages/telemetry_messages.py (2)
RealtimeTelemetryMetricsMessage(79-86)StartRealtimeTelemetryCommand(89-98)aiperf/exporters/display_units_utils.py (1)
normalize_endpoint_display(17-36)aiperf/common/models/error_models.py (1)
ErrorDetails(11-47)aiperf/gpu_telemetry/telemetry_manager.py (2)
TelemetryManager(40-378)_normalize_dcgm_url(106-118)aiperf/post_processors/telemetry_results_processor.py (2)
summarize(47-99)get_telemetry_hierarchy(31-37)aiperf/common/models/telemetry_models.py (1)
TelemetryResults(323-348)
aiperf/ui/dashboard/aiperf_textual_app.py (4)
aiperf/common/enums/telemetry_enums.py (1)
GPUTelemetryMode(7-11)aiperf/common/messages/telemetry_messages.py (1)
StartRealtimeTelemetryCommand(89-98)aiperf/ui/dashboard/realtime_telemetry_dashboard.py (3)
RealtimeTelemetryDashboard(248-313)set_status_message(292-302)on_realtime_telemetry_metrics(304-313)aiperf/common/hooks.py (1)
on_realtime_telemetry_metrics(352-364)
tests/gpu_telemetry/test_telemetry_manager.py (1)
tests/data_exporters/test_gpu_telemetry_console_exporter.py (1)
mock_user_config(29-33)
aiperf/common/mixins/__init__.py (1)
aiperf/common/mixins/realtime_telemetry_metrics_mixin.py (1)
RealtimeTelemetryMetricsMixin(15-43)
aiperf/common/messages/telemetry_messages.py (7)
aiperf/common/enums/command_enums.py (1)
CommandType(7-17)aiperf/common/enums/telemetry_enums.py (1)
GPUTelemetryMode(7-11)aiperf/common/enums/message_enums.py (1)
MessageType(7-51)aiperf/common/messages/command_messages.py (1)
CommandMessage(57-97)aiperf/common/models/error_models.py (1)
ErrorDetails(11-47)aiperf/common/models/record_models.py (1)
MetricResult(25-51)aiperf/common/models/telemetry_models.py (2)
ProcessTelemetryResult(351-362)TelemetryRecord(69-107)
aiperf/ui/dashboard/realtime_telemetry_dashboard.py (5)
aiperf/common/aiperf_logger.py (1)
AIPerfLogger(25-239)aiperf/common/config/service_config.py (1)
ServiceConfig(30-171)aiperf/common/models/record_models.py (1)
MetricResult(25-51)aiperf/ui/dashboard/custom_widgets.py (2)
MaximizableWidget(15-32)NonFocusableDataTable(10-12)aiperf/ui/dashboard/aiperf_textual_app.py (2)
compose(115-148)on_realtime_telemetry_metrics(273-277)
🪛 Ruff (0.14.0)
aiperf/common/config/user_config.py
244-244: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
aiperf/ui/dashboard/realtime_telemetry_dashboard.py
41-41: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
42-42: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
117-117: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (11)
docs/tutorials/gpu-telemetry.md (3)
37-38: Well-documented default endpoint behavior.The IMPORTANT note clearly explains that the default endpoint
http://localhost:9401/metricsis always attempted regardless of the--gpu-telemetryflag. This is well-emphasized and reinforced again in the multi-node example (lines 294, 316). The clarification is critical to user understanding and is properly positioned.
180-206: Port forwarding and DCGM exporter configuration is clearly explained.The setup instructions for the DCGM exporter are well-structured with clear explanations of port forwarding (
-p 9401:9400), metrics collection interval, and custom metrics file mounting. The example commands are syntactically correct and production-ready.
31-35: Table effectively communicates flag behavior.The comparison table succinctly explains the differences between no flag, flag only, and custom URLs, making it easy for users to understand which endpoints are collected and displayed in each scenario.
aiperf/ui/dashboard/__init__.py (1)
34-38: LGTM!The new dashboard components (GPUMetricsTable, RealtimeTelemetryDashboard, SingleNodeView) are correctly imported and exported. The changes follow the standard mkinit pattern for exposing public API surface.
Also applies to: 55-55, 63-63, 65-65
aiperf/ui/__init__.py (1)
20-20: LGTM!The telemetry dashboard components are correctly re-exported from the dashboard submodule to the top-level UI package API. This follows the standard mkinit-generated pattern.
Also applies to: 28-28, 30-30, 52-52, 62-62, 64-64
aiperf/common/enums/command_enums.py (1)
17-17: LGTM!The new START_REALTIME_TELEMETRY command type follows the existing naming conventions and is properly integrated into the CommandType enum.
aiperf/ui/base_ui.py (1)
7-7: LGTM!The RealtimeTelemetryMetricsMixin is correctly integrated into the BaseAIPerfUI class hierarchy. The mixin follows the same pattern as the existing RealtimeMetricsMixin and provides the necessary hooks for handling real-time telemetry metrics. The docstring updates accurately reflect the new capability.
Also applies to: 12-17, 21-23, 25-27
aiperf/common/enums/message_enums.py (1)
44-44: LGTM!The new REALTIME_TELEMETRY_METRICS message type is correctly added to the MessageType enum, following the established naming conventions and alphabetical ordering.
aiperf/ui/dashboard/aiperf_dashboard_ui.py (1)
73-76: LGTM!The ON_REALTIME_TELEMETRY_METRICS hook attachment follows the established pattern for hook registration in the dashboard UI. The integration correctly wires the telemetry metrics flow from the mixin to the textual app's handler.
aiperf/common/mixins/__init__.py (1)
44-46: LGTM!The RealtimeTelemetryMetricsMixin is correctly imported and exported following the standard mkinit pattern. The alphabetical ordering is maintained.
Also applies to: 69-69
tests/records/test_records_filtering.py (1)
352-358: LGTM!The test updates correctly reflect the new
dcgm_urlparameter added toTelemetryRecordsMessage. The test values are consistent and appropriate for the testing scenarios.Also applies to: 386-392
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds real-time GPU telemetry visualization dashboard to monitor GPU metrics during benchmark execution, introducing a new interactive TUI dashboard that displays live GPU metrics (utilization, memory, temperature, power).
- Real-time GPU telemetry dashboard with --gpu-telemetry 'dashboard' flag
- New UI components and data streaming infrastructure for telemetry visualization
- Enhanced telemetry management with endpoint tracking and performance optimization
Reviewed Changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| aiperf/ui/dashboard/realtime_telemetry_dashboard.py | New interactive TUI dashboard for real-time GPU metrics visualization |
| aiperf/ui/dashboard/aiperf_textual_app.py | Integrates telemetry dashboard into main UI with mode switching |
| aiperf/records/records_manager.py | Adds telemetry data streaming and background task management |
| aiperf/post_processors/telemetry_results_processor.py | Enhances telemetry processor with hierarchy access and display formatting |
| aiperf/gpu_telemetry/telemetry_manager.py | Updates telemetry manager with URL tracking and message enrichment |
| aiperf/common/config/user_config.py | Adds GPU telemetry mode configuration parsing |
| tests/ | Updates test files with new dcgm_url field and configuration changes |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
d24a8fd to
0cd8551
Compare
7911183 to
c17e2ae
Compare
0453e7f to
47a2e8e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Halfway through, but it looks really well done so far! Great work!
| telemetry_mode: GPUTelemetryMode = Field( | ||
| default=GPUTelemetryMode.REALTIME_DASHBOARD, | ||
| description="The GPU telemetry mode to set when starting realtime telemetry", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may be a note to myself, but how is this being used? what is it needed by?
| @background_task(interval=None, immediate=True) | ||
| async def _report_realtime_metrics_task(self) -> None: | ||
| """Report the real-time metrics at a regular interval (only if the UI type is dashboard).""" | ||
| @background_task(interval=DEFAULT_REALTIME_METRICS_INTERVAL, immediate=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem with doing it this way is that my original interval=None means the task wont repeat, so if the ui type is not dashboard, it will stop the task, but here, it will still fire the task every interval, and then exit. So a little less efficient.
alternatively, we could have a custom exception that we raise, that background_task would catch, to know it should stop.
like raise BackgroundTaskDisabled("UI type is not Dashboard: {ui_type=}") if we always assume it will be disabled at the beginning
or raise StopBackgroundTask("..") if we want to allow exiting mid-execution
sorta like that.
But granted, those options dont ever allow you to resume the task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed on slack, there could be assumptions around being able to resume the task needed in the future for the sweep task.
I modified it so that the _report_realtime_inference_metrics_task works the same with @background_task(interval=None, immediate=True) and _report_realtime_telemetry_metrics_task has this flow:
- @background_task(interval=None, immediate=True) fires the task once on startup
- Line 476-477: If ui_type != DASHBOARD, it returns immediately and stops permanently
- If dashboard enabled by user in CLI: start processing realtime telemetry metrics
- If dashboard not enabled, sleep. If user clicks on the "5 - GPU Telemetry" tab then it can be woken up by the
_on_start_realtime_telemetry_commandto resume
What do you think about this sol?
| if self.service_config.ui_type != AIPerfUIType.DASHBOARD: | ||
| return | ||
| if self.user_config.gpu_telemetry_mode != GPUTelemetryMode.REALTIME_DASHBOARD: | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing with the above code about the interval thing.
| if self.user_config.gpu_telemetry_mode != GPUTelemetryMode.REALTIME_DASHBOARD: | ||
| return | ||
|
|
||
| telemetry_metrics = await self._generate_realtime_telemetry_metrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any way we can know if we do not need to process the realtime metrics because they are the same? like a last_updated kinda thing? or an update count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh that makes sense, will do
| return_exceptions=True, | ||
| ) | ||
|
|
||
| telemetry_metrics = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit complex, I would add a comment about how it is flattening the metric results or something
Adds real-time GPU telemetry visualization dashboard to monitor GPU metrics during benchmark execution.
Key Additions:
Technical Changes
Design Decisions
Performance Analysis:
Rationale: Trading 0.1% CPU overhead for eliminating 3-5 second wait times provides significantly better UX. The performance cost is negligible while the user experience benefit is substantial.
Documentation
Example:

Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation