Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ Features
| **[Sequence Distributions](docs/tutorials/sequence-distributions.md)** | Mixed ISL/OSL pairings | Benchmarking mixed use cases |
| **[Goodput](docs/tutorials/goodput.md)** | Throughput of requests meeting user-defined SLOs | SLO validation, capacity planning, runtime/model comparisons |
| **[Request Rate with Max Concurrency](docs/tutorials/request-rate-concurrency.md)** | Dual control of request timing and concurrent connection ceiling (Poisson or constant modes) | Testing API rate/concurrency limits, avoiding thundering herd, realistic client simulation |
| **[GPU Telemetry](docs/tutorials/gpu-telemetry.md)** | Real-time GPU metrics collection via DCGM (power, utilization, memory, temperature, etc) | Performance optimization, resource monitoring, multi-node telemetry |
| **[Template Endpoint](docs/tutorials/template-endpoint.md)** | Benchmark custom APIs with flexible Jinja2 request templates | Custom API formats, rapid prototyping, non-standard endpoints |

### Working with Benchmark Data
Expand Down
49 changes: 36 additions & 13 deletions docs/tutorials/gpu-telemetry.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ This guide shows you how to collect GPU metrics (power, utilization, memory, tem
This guide covers two setup paths depending on your inference backend:

### Path 1: Dynamo (Built-in DCGM)
If you're using **Dynamo**, it comes with DCGM pre-configured on port 9401. No additional setup needed! Just use the `--gpu-telemetry` flag to enable console display and optionally add additional DCGM url endpoints.
If you're using **Dynamo**, it comes with DCGM pre-configured on port 9401. No additional setup needed! Just use the `--gpu-telemetry` flag to enable console display and optionally add additional DCGM url endpoints. URLs can be specified with or without the `http://` prefix (e.g., `localhost:9400` or `http://localhost:9400`).

### Path 2: Other Inference Servers (Custom DCGM)
If you're using **any other inference backend**, you'll need to set up DCGM separately.
Expand All @@ -28,15 +28,28 @@ AIPerf provides GPU telemetry collection with the `--gpu-telemetry` flag. Here's

### How the `--gpu-telemetry` Flag Works

| Usage | Command | What Gets Collected (If Available) | Console Display | CSV/JSON Export |
|-------|---------|---------------------|-----------------|-----------------|
| **No flag** | `aiperf profile --model MODEL ...` | `http://localhost:9400/metrics` + `http://localhost:9401/metrics` | ❌ No | ✅ Yes |
| **Flag only** | `aiperf profile --model MODEL ... --gpu-telemetry` | `http://localhost:9400/metrics` + `http://localhost:9401/metrics` | ✅ Yes | ✅ Yes |
| **Custom URLs** | `aiperf profile --model MODEL ... --gpu-telemetry http://node1:9400/metrics http://node2:9400/metrics` | `http://localhost:9400/metrics` + `http://localhost:9401/metrics` + custom URLs | ✅ Yes | ✅ Yes |
| Usage | Command | What Gets Collected (If Available) | Console Display | Dashboard View | CSV/JSON Export |
|-------|---------|---------------------|-----------------|----------------|-----------------|
| **No flag** | `aiperf profile --model MODEL ...` | `http://localhost:9400/metrics` + `http://localhost:9401/metrics` | ❌ No | ❌ No | ✅ Yes |
| **Flag only** | `aiperf profile --model MODEL ... --gpu-telemetry` | `http://localhost:9400/metrics` + `http://localhost:9401/metrics` | ✅ Yes | ❌ No | ✅ Yes |
| **Dashboard mode** | `aiperf profile --model MODEL ... --gpu-telemetry dashboard` | `http://localhost:9400/metrics` + `http://localhost:9401/metrics` | ✅ Yes | ✅ Yes | ✅ Yes |
| **Custom URLs** | `aiperf profile --model MODEL ... --gpu-telemetry node1:9400 http://node2:9400/metrics` | `http://localhost:9400/metrics` + `http://localhost:9401/metrics` + custom URLs | ✅ Yes | ❌ No | ✅ Yes |
| **Dashboard + URLs** | `aiperf profile --model MODEL ... --gpu-telemetry dashboard localhost:9400` | `http://localhost:9400/metrics` + `http://localhost:9401/metrics` + custom URLs | ✅ Yes | ✅ Yes | ✅ Yes |

> [!IMPORTANT]
> The default endpoints `http://localhost:9400/metrics` and `http://localhost:9401/metrics` are ALWAYS attempted for telemetry collection, regardless of whether the `--gpu-telemetry` flag is used. The flag primarily controls whether metrics are displayed on the console and allows you to specify additional custom DCGM exporter endpoints.

> [!NOTE]
> When specifying custom DCGM exporter URLs, the `http://` prefix is optional. URLs like `localhost:9400` will automatically be treated as `http://localhost:9400`. Both formats work identically.

### Real-Time Dashboard View

Adding `dashboard` to the `--gpu-telemetry` flag enables a live terminal UI (TUI) that displays GPU metrics in real-time during your benchmark runs:

```bash
aiperf profile --model MODEL ... --gpu-telemetry dashboard
```

---

# 1: Using Dynamo
Expand All @@ -48,7 +61,7 @@ Dynamo includes DCGM out of the box on port 9401 - no extra setup needed!
```bash
# Set environment variables
export AIPERF_REPO_TAG="main"
export DYNAMO_PREBUILT_IMAGE_TAG="nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0"
export DYNAMO_PREBUILT_IMAGE_TAG="nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1"
export MODEL="Qwen/Qwen3-0.6B"

# Download the Dynamo container
Expand Down Expand Up @@ -99,7 +112,7 @@ uv pip install ./aiperf

```bash
# Wait for Dynamo API to be ready (up to 15 minutes)
timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"a\"}],\"max_completion_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "Dynamo not ready after 15min"; exit 1; }
timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"a\"}],\"max_completion_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "Dynamo not ready after 15min"; exit 1; }
```
```bash
# Wait for DCGM Exporter to be ready (up to 2 minutes after Dynamo is ready)
Expand All @@ -116,7 +129,7 @@ aiperf profile \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--streaming \
--url localhost:8080 \
--url localhost:8000 \
--synthetic-input-tokens-mean 100 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 200 \
Expand All @@ -131,6 +144,9 @@ aiperf profile \
--gpu-telemetry
```

> [!TIP]
> The `dashboard` keyword enables a live terminal UI for real-time GPU telemetry visualization. Press `5` to maximize the GPU Telemetry panel during the benchmark run.

---

# 2: Using Other Inference Server
Expand Down Expand Up @@ -279,6 +295,12 @@ aiperf profile \
--gpu-telemetry
```

> [!TIP]
> The `dashboard` keyword enables a live terminal UI for real-time GPU telemetry visualization. Press `5` to maximize the GPU Telemetry panel during the benchmark run.

> [!TIP]
> The `dashboard` keyword enables a live terminal UI for real-time GPU telemetry visualization. Press `5` to maximize the GPU Telemetry panel during the benchmark run.

## Multi-Node GPU Telemetry Example

For distributed setups with multiple nodes, you can collect GPU telemetry from all nodes simultaneously:
Expand All @@ -287,12 +309,13 @@ For distributed setups with multiple nodes, you can collect GPU telemetry from a
# Example: Collecting telemetry from 3 nodes in a distributed setup
# Note: The default endpoints http://localhost:9400/metrics and http://localhost:9401/metrics
# are always attempted in addition to these custom URLs
# URLs can be specified with or without the http:// prefix
aiperf profile \
--model Qwen/Qwen3-0.6B \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--streaming \
--url localhost:8080 \
--url localhost:8000 \
--synthetic-input-tokens-mean 100 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 200 \
Expand All @@ -304,14 +327,14 @@ aiperf profile \
--warmup-request-count 1 \
--conversation-num 8 \
--random-seed 100 \
--gpu-telemetry http://node1:9400/metrics http://node2:9400/metrics http://node3:9400/metrics
--gpu-telemetry node1:9400 node2:9400 http://node3:9400/metrics
```

This will collect GPU metrics from:
- `http://localhost:9400/metrics` (default, always attempted)
- `http://localhost:9401/metrics` (default, always attempted)
- `http://node1:9400/metrics` (custom node 1)
- `http://node2:9400/metrics` (custom node 2)
- `http://node1:9400` (custom node 1, normalized from `node1:9400`)
- `http://node2:9400` (custom node 2, normalized from `node2:9400`)
- `http://node3:9400/metrics` (custom node 3)

All metrics are displayed on the console and saved to the output CSV and JSON files, with GPU indices and hostnames distinguishing metrics from different nodes.
Expand Down
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ nav:
- Time-based Benchmarking: tutorials/time-based-benchmarking.md
- Sequence Distributions: tutorials/sequence-distributions.md
- Goodput: tutorials/goodput.md
- Request Rate with Max Concurrency: tutorials/request-rate-concurrency.md
- GPU Telemetry: tutorials/gpu-telemetry.md
- Template Endpoint: tutorials/template-endpoint.md
- Reference:
- Architecture: architecture.md
Expand Down
40 changes: 39 additions & 1 deletion src/aiperf/common/config/user_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from aiperf.common.config.loadgen_config import LoadGeneratorConfig
from aiperf.common.config.output_config import OutputConfig
from aiperf.common.config.tokenizer_config import TokenizerConfig
from aiperf.common.enums import CustomDatasetType
from aiperf.common.enums import CustomDatasetType, GPUTelemetryMode
from aiperf.common.enums.timing_enums import RequestRateMode, TimingMode
from aiperf.common.utils import load_json_str

Expand Down Expand Up @@ -224,6 +224,44 @@ def _count_dataset_entries(self) -> int:
),
]

_gpu_telemetry_mode: GPUTelemetryMode = GPUTelemetryMode.SUMMARY
_gpu_telemetry_urls: list[str] = []

@model_validator(mode="after")
def _parse_gpu_telemetry_config(self) -> Self:
"""Parse gpu_telemetry list into mode and URLs."""
if not self.gpu_telemetry:
return self

mode = GPUTelemetryMode.SUMMARY
urls = []

for item in self.gpu_telemetry:
if item in ["dashboard"]:
mode = GPUTelemetryMode.REALTIME_DASHBOARD
elif item.startswith("http") or ":" in item:
normalized_url = item if item.startswith("http") else f"http://{item}"
urls.append(normalized_url)

self._gpu_telemetry_mode = mode
self._gpu_telemetry_urls = urls
return self

@property
def gpu_telemetry_mode(self) -> GPUTelemetryMode:
"""Get the GPU telemetry display mode (parsed from gpu_telemetry list)."""
return self._gpu_telemetry_mode

@gpu_telemetry_mode.setter
def gpu_telemetry_mode(self, value: GPUTelemetryMode) -> None:
"""Set the GPU telemetry display mode."""
self._gpu_telemetry_mode = value

@property
def gpu_telemetry_urls(self) -> list[str]:
"""Get the parsed GPU telemetry DCGM endpoint URLs."""
return self._gpu_telemetry_urls

@model_validator(mode="after")
def _compute_config(self) -> Self:
"""Compute additional configuration.
Expand Down
4 changes: 4 additions & 0 deletions src/aiperf/common/enums/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,9 @@
from aiperf.common.enums.system_enums import (
SystemState,
)
from aiperf.common.enums.telemetry_enums import (
GPUTelemetryMode,
)
from aiperf.common.enums.timing_enums import (
CreditPhase,
RequestRateMode,
Expand Down Expand Up @@ -131,6 +134,7 @@
"ExportLevel",
"FrequencyMetricUnit",
"FrequencyMetricUnitInfo",
"GPUTelemetryMode",
"GenericMetricUnit",
"ImageFormat",
"LifecycleState",
Expand Down
1 change: 1 addition & 0 deletions src/aiperf/common/enums/command_enums.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ class CommandType(CaseInsensitiveStrEnum):
SHUTDOWN = "shutdown"
SHUTDOWN_WORKERS = "shutdown_workers"
SPAWN_WORKERS = "spawn_workers"
START_REALTIME_TELEMETRY = "start_realtime_telemetry"


class CommandResponseStatus(CaseInsensitiveStrEnum):
Expand Down
1 change: 1 addition & 0 deletions src/aiperf/common/enums/message_enums.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ class MessageType(CaseInsensitiveStrEnum):
PROFILE_PROGRESS = "profile_progress"
PROFILE_RESULTS = "profile_results"
REALTIME_METRICS = "realtime_metrics"
REALTIME_TELEMETRY_METRICS = "realtime_telemetry_metrics"
REGISTRATION = "registration"
SERVICE_ERROR = "service_error"
STATUS = "status"
Expand Down
11 changes: 11 additions & 0 deletions src/aiperf/common/enums/telemetry_enums.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

from aiperf.common.enums.base_enums import CaseInsensitiveStrEnum


class GPUTelemetryMode(CaseInsensitiveStrEnum):
"""GPU telemetry display mode."""

SUMMARY = "summary"
REALTIME_DASHBOARD = "realtime_dashboard"
16 changes: 16 additions & 0 deletions src/aiperf/common/hooks.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ class AIPerfHook(CaseInsensitiveStrEnum):
ON_INIT = "@on_init"
ON_MESSAGE = "@on_message"
ON_REALTIME_METRICS = "@on_realtime_metrics"
ON_REALTIME_TELEMETRY_METRICS = "@on_realtime_telemetry_metrics"
ON_PROFILING_PROGRESS = "@on_profiling_progress"
ON_PULL_MESSAGE = "@on_pull_message"
ON_RECORDS_PROGRESS = "@on_records_progress"
Expand Down Expand Up @@ -348,6 +349,21 @@ def _on_realtime_metrics(self, metrics: list[MetricResult]) -> None:
return _hook_decorator(AIPerfHook.ON_REALTIME_METRICS, func)


def on_realtime_telemetry_metrics(func: Callable) -> Callable:
"""Decorator to specify that the function is a hook that should be called when real-time GPU telemetry metrics are received.
See :func:`aiperf.common.hooks._hook_decorator`.

Example:
```python
class MyPlugin(RealtimeMetricsMixin):
@on_realtime_telemetry_metrics
def _on_realtime_telemetry_metrics(self, metrics: list[MetricResult]) -> None:
pass
```
"""
return _hook_decorator(AIPerfHook.ON_REALTIME_TELEMETRY_METRICS, func)


def on_pull_message(
*message_types: MessageTypeT | Callable[[SelfT], Iterable[MessageTypeT]],
) -> Callable:
Expand Down
4 changes: 4 additions & 0 deletions src/aiperf/common/messages/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
ShutdownCommand,
ShutdownWorkersCommand,
SpawnWorkersCommand,
StartRealtimeTelemetryCommand,
TargetedServiceMessage,
)
from aiperf.common.messages.credit_messages import (
Expand Down Expand Up @@ -75,6 +76,7 @@
)
from aiperf.common.messages.telemetry_messages import (
ProcessTelemetryResultMessage,
RealtimeTelemetryMetricsMessage,
TelemetryRecordsMessage,
TelemetryStatusMessage,
)
Expand Down Expand Up @@ -127,13 +129,15 @@
"ProfileStartCommand",
"RealtimeMetricsCommand",
"RealtimeMetricsMessage",
"RealtimeTelemetryMetricsMessage",
"RecordsProcessingStatsMessage",
"RegisterServiceCommand",
"RegistrationMessage",
"RequiresRequestNSMixin",
"ShutdownCommand",
"ShutdownWorkersCommand",
"SpawnWorkersCommand",
"StartRealtimeTelemetryCommand",
"StatusMessage",
"TargetedServiceMessage",
"TelemetryRecordsMessage",
Expand Down
11 changes: 11 additions & 0 deletions src/aiperf/common/messages/command_messages.py
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,17 @@ class RealtimeMetricsCommand(CommandMessage):
command: CommandTypeT = CommandType.REALTIME_METRICS


class StartRealtimeTelemetryCommand(CommandMessage):
"""Command to start the realtime telemetry background task in RecordsManager.

This command is sent when the user dynamically enables the telemetry dashboard
by pressing the telemetry option in the UI. This always sets the GPU telemetry
mode to REALTIME_DASHBOARD.
"""

command: CommandTypeT = CommandType.START_REALTIME_TELEMETRY


class SpawnWorkersCommand(CommandMessage):
command: CommandTypeT = CommandType.SPAWN_WORKERS

Expand Down
21 changes: 20 additions & 1 deletion src/aiperf/common/messages/telemetry_messages.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,12 @@

from aiperf.common.enums import MessageType
from aiperf.common.messages.service_messages import BaseServiceMessage
from aiperf.common.models import ErrorDetails, ProcessTelemetryResult, TelemetryRecord
from aiperf.common.models import (
ErrorDetails,
MetricResult,
ProcessTelemetryResult,
TelemetryRecord,
)
from aiperf.common.types import MessageTypeT


Expand All @@ -19,6 +24,10 @@ class TelemetryRecordsMessage(BaseServiceMessage):
...,
description="The ID of the telemetry data collector that collected the records.",
)
dcgm_url: str = Field(
...,
description="The DCGM endpoint URL that was contacted (e.g., 'http://localhost:9400/metrics')",
)
records: list[TelemetryRecord] = Field(
..., description="The telemetry records collected from GPU monitoring"
)
Expand Down Expand Up @@ -62,3 +71,13 @@ class TelemetryStatusMessage(BaseServiceMessage):
default_factory=list,
description="List of DCGM endpoint URLs that were reachable and will provide data",
)


class RealtimeTelemetryMetricsMessage(BaseServiceMessage):
"""Message from the records manager to show real-time GPU telemetry metrics."""

message_type: MessageTypeT = MessageType.REALTIME_TELEMETRY_METRICS

metrics: list[MetricResult] = Field(
..., description="The current real-time GPU telemetry metrics."
)
4 changes: 4 additions & 0 deletions src/aiperf/common/mixins/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,9 @@
from aiperf.common.mixins.realtime_metrics_mixin import (
RealtimeMetricsMixin,
)
from aiperf.common.mixins.realtime_telemetry_metrics_mixin import (
RealtimeTelemetryMetricsMixin,
)
from aiperf.common.mixins.reply_client_mixin import (
ReplyClientMixin,
)
Expand All @@ -67,6 +70,7 @@
"ProgressTrackerMixin",
"PullClientMixin",
"RealtimeMetricsMixin",
"RealtimeTelemetryMetricsMixin",
"ReplyClientMixin",
"TaskManagerMixin",
"WorkerTrackerMixin",
Expand Down
Loading