From 431c34bb1d803954fddbbec73314c1f1be03af52 Mon Sep 17 00:00:00 2001 From: Anthony Casagrande Date: Wed, 1 Oct 2025 21:23:50 -0700 Subject: [PATCH 01/11] docs: add comprehencsive metrics docs --- docs/metrics_reference.md | 109 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 109 insertions(+) create mode 100644 docs/metrics_reference.md diff --git a/docs/metrics_reference.md b/docs/metrics_reference.md new file mode 100644 index 000000000..1cc3dfa85 --- /dev/null +++ b/docs/metrics_reference.md @@ -0,0 +1,109 @@ + +# AIPerf Metrics Reference + +This document provides a comprehensive reference of all metrics available in AIPerf for benchmarking LLM inference performance. Metrics are organized by computation type to help you understand when and how each metric is calculated. + +## Understanding Metric Types + +AIPerf computes metrics in three distinct phases during benchmark execution: + +### Record Metrics +Computed **individually for each request/response(s) pair** during the benchmark run. These metrics capture per-request characteristics such as latency, token counts, and streaming behavior. Record metrics produce **statistical distributions** (min, max, mean, median, p90, p99) that reveal performance variability across requests. + +**Examples**: Request Latency, TTFT, Token Counts, Inter-Token Latency + +### Aggregate Metrics +Computed by **tracking or accumulating values** across all requests in **real-time** during the benchmark. These include counters, min/max timestamps, and other global statistics. Aggregate metrics produce **single scalar values** representing the entire benchmark run. + +**Examples**: Request Count, Error Count, Min/Max Timestamps + +### Derived Metrics +Computed **after the benchmark completes** by applying mathematical formulas to other metric results. These metrics depend on one or more prerequisite metrics being available first. Derived metrics can produce either single values or distributions depending on their dependencies. + +**Examples**: Request Throughput, Benchmark Duration, Total Token Counts + +--- + +## Record Metrics + +| Metric | Tag | Explanation | Formula | +|--------|-----|-------------|---------| +| **Request Latency** | `request_latency` | Measures the total end-to-end time from sending a request until receiving the final response. | `responses[-1].perf_ns - start_perf_ns` | +| **Time to First Token (TTFT)** | `ttft` | Measures how long it takes to receive the first token after sending a request. Critical for user-perceived responsiveness in streaming scenarios. | `responses[0].perf_ns - request.start_perf_ns` | +| **Time to Second Token (TTST)** | `ttst` | Measures the time gap between the first and second tokens. Helps identify generation startup overhead separate from streaming throughput. | `responses[1].perf_ns - responses[0].perf_ns` | +| **Inter Token Latency (ITL)** | `inter_token_latency` | Measures the average time between consecutive tokens during generation, excluding the initial TTFT overhead. | `(request_latency - ttft) / (output_sequence_length - 1)` | +| **Inter Chunk Latency (ICL)** | `inter_chunk_latency` | Captures the time gaps between all consecutive response chunks in a streaming response, providing a distribution of chunk arrival times. | `[responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(responses))]` | +| **Output Token Count** | `output_token_count` | The number of output tokens generated for a single request, excluding reasoning tokens. | `output_token_count` | +| **Reasoning Token Count** | `reasoning_token_count` | The number of reasoning tokens generated for a single request (e.g., chain-of-thought tokens in reasoning models). | `reasoning_token_count` | +| **Output Sequence Length (OSL)** | `output_sequence_length` | The total number of completion tokens (output + reasoning) generated for a single request. | `(output_token_count or 0) + (reasoning_token_count or 0)` | +| **Input Sequence Length (ISL)** | `input_sequence_length` | The number of input/prompt tokens for a single request. | `input_token_count` | +| **Output Token Throughput Per User** | `output_token_throughput_per_user` | The token generation rate experienced by an individual user/request, measured as the inverse of inter-token latency. | `1.0 / inter_token_latency_seconds` | + +--- + +## Aggregate Metrics + +| Metric | Tag | Explanation | Formula | +|--------|-----|-------------|---------| +| **Request Count** | `request_count` | The total number of successfully completed requests in the benchmark. | `sum(1 for request in valid_requests)` | +| **Error Request Count** | `error_request_count` | The total number of failed/error requests encountered during the benchmark. | `sum(1 for request in error_requests)` | +| **Minimum Request Timestamp** | `min_request_timestamp` | The wall-clock timestamp of the first request sent in the benchmark, used to calculate benchmark duration. | `min(timestamp_ns for record in records)` | +| **Maximum Response Timestamp** | `max_response_timestamp` | The wall-clock timestamp of the last response received in the benchmark, used to calculate benchmark duration. | `max(timestamp_ns + request_latency for record in records)` | + +--- + +## Derived Metrics + +| Metric | Tag | Explanation | Formula | +|--------|-----|-------------|---------| +| **Request Throughput** | `request_throughput` | The overall rate of completed requests per second across the entire benchmark. | `request_count / benchmark_duration_seconds` | +| **Output Token Throughput** | `output_token_throughput` | The aggregate token generation rate across all concurrent requests, measured as total tokens per second. | `total_osl / benchmark_duration_seconds` | +| **Benchmark Duration** | `benchmark_duration` | The total elapsed time from the first request sent to the last response received. | `max_response_timestamp - min_request_timestamp` | +| **Total Output Tokens** | `total_output_tokens` | The sum of all output tokens (excluding reasoning tokens) generated across all requests. | `sum(output_token_count for record in records)` | +| **Total Reasoning Tokens** | `total_reasoning_tokens` | The sum of all reasoning tokens generated across all requests. | `sum(reasoning_token_count for record in records)` | +| **Total Output Sequence Length** | `total_osl` | The sum of all completion tokens (output + reasoning) generated across all requests. | `sum(output_sequence_length for record in records)` | +| **Total Input Sequence Length** | `total_isl` | The sum of all input/prompt tokens processed across all requests. | `sum(input_sequence_length for record in records)` | + +--- + +## Reference Tables + +### Metric Summary + +| Type | Computation | Output | +|------|-------------|--------| +| **Record** | Per-request during benchmark | Statistical distributions (min, max, mean, p50, p90, p99) | +| **Aggregate** | Real-time accumulation across all requests | Single scalar values | +| **Derived** | Post-benchmark from other metrics | Single values or distributions | + +### Time Units + +| Aspect | Details | +|--------|---------| +| **Internal Storage** | Nanoseconds (`perf_ns`) for maximum precision | +| **Display Format** | Milliseconds (ms) or Seconds (s) for readability | +| **Conversion** | Automatic based on metric `display_unit` setting | + +### Model Requirements + +| Requirement | Description | Example Metrics | +|-------------|-------------|-----------------| +| **Token-producing models** | Models that return `usage` information with input/output token counts | `output_token_count`, `input_sequence_length`, `output_token_throughput` | +| **Streaming responses** | Endpoints that support Server-Sent Events (SSE) | `ttft`, `inter_token_latency`, `inter_chunk_latency` | +| **Reasoning token support** | Models that expose reasoning/thinking token counts separately | `reasoning_token_count`, `total_reasoning_tokens` | + +### Metric Flags Reference + +| Flag | Description | Impact | +|------|-------------|--------| +| `LARGER_IS_BETTER` | Higher values indicate better performance | Used for throughput and count metrics to indicate optimization direction | +| `PRODUCES_TOKENS_ONLY` | Only computed for token-producing models | Metric skipped if model doesn't provide token count information | +| `STREAMING_TOKENS_ONLY` | Only computed for streaming responses with tokens | Requires both streaming support and token information | +| `STREAMING_ONLY` | Only computed for streaming responses | Requires Server-Sent Events (SSE) support | +| `SUPPORTS_REASONING` | Requires reasoning token support | Only available for models like OpenAI o1 that expose reasoning tokens | +| `NO_CONSOLE` | Not displayed in console output | Metric available in JSON/CSV exports but hidden from terminal display | +| `ERROR_ONLY` | Only computed for error requests | Tracks error-specific information | + From 2f90f5cb48bcef42dfa0e7f69b22b311d4834fb0 Mon Sep 17 00:00:00 2001 From: Anthony Casagrande Date: Wed, 1 Oct 2025 22:04:40 -0700 Subject: [PATCH 02/11] update docs --- README.md | 59 +++++- docs/metrics_reference.md | 431 ++++++++++++++++++++++++++++++++++---- 2 files changed, 445 insertions(+), 45 deletions(-) diff --git a/README.md b/README.md index 087b7763c..2ef0e5f8d 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ SPDX-License-Identifier: Apache-2.0 [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/aiperf) -**[Architecture](docs/architecture.md)**| **[Design Proposals](https://github.com/ai-dynamo/enhancements)** | **[Migrating from Genai-Perf](docs/migrating.md)** | **[CLI Options](docs/cli_options.md)** +**[Architecture](docs/architecture.md)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** | **[Migrating from Genai-Perf](docs/migrating.md)** | **[CLI Options](docs/cli_options.md)** | **[Metrics Reference](docs/metrics_reference.md)** | AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution. @@ -84,7 +84,6 @@ aiperf profile --benchmark-duration 300.0 --benchmark-grace-period 30.0 [other o
- + +## Metrics Reference + +AIPerf provides comprehensive metrics organized into three categories. For detailed descriptions, requirements, and nuances of each metric, see the **[Complete Metrics Reference](docs/metrics_reference.md)**. + +### Record Metrics + +Computed individually for each request and its response(s). Record metrics produce statistical distributions (min, max, mean, p50, p90, p99, etc.). + +| Metric | Tag | Formula | Unit | +|--------|-----|---------|------| +| [**Request Latency**](docs/metrics_reference.md#request-latency) | `request_latency` | `responses[-1].perf_ns - start_perf_ns` | `ms` | +| [**Time to First Token (TTFT)**](docs/metrics_reference.md#time-to-first-token-ttft) | `ttft` | `responses[0].perf_ns - request.start_perf_ns` | `ms` | +| [**Time to Second Token (TTST)**](docs/metrics_reference.md#time-to-second-token-ttst) | `ttst` | `responses[1].perf_ns - responses[0].perf_ns` | `ms` | +| [**Inter Token Latency (ITL)**](docs/metrics_reference.md#inter-token-latency-itl) | `inter_token_latency` | `(request_latency - ttft) / (output_sequence_length - 1)` | `ms` | +| [**Inter Chunk Latency (ICL)**](docs/metrics_reference.md#inter-chunk-latency-icl) | `inter_chunk_latency` | `[responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(responses))]` | `ms` | +| [**Output Token Count**](docs/metrics_reference.md#output-token-count) | `output_token_count` | `output_token_count` | `tokens` | +| [**Reasoning Token Count**](docs/metrics_reference.md#reasoning-token-count) | `reasoning_token_count` | `reasoning_token_count` | `tokens` | +| [**Output Sequence Length (OSL)**](docs/metrics_reference.md#output-sequence-length-osl) | `output_sequence_length` | `(output_token_count or 0) + (reasoning_token_count or 0)` | `tokens` | +| [**Input Sequence Length (ISL)**](docs/metrics_reference.md#input-sequence-length-isl) | `input_sequence_length` | `input_token_count` | `tokens` | +| [**Output Token Throughput Per User**](docs/metrics_reference.md#output-token-throughput-per-user) | `output_token_throughput_per_user` | `1.0 / inter_token_latency_seconds` | `tokens/sec/user` | + +### Aggregate Metrics + +Computed by tracking values across all requests in real-time. Aggregate metrics produce single scalar values. + +| Metric | Tag | Formula | Unit | +|--------|-----|---------|------| +| [**Request Count**](docs/metrics_reference.md#request-count) | `request_count` | `sum(1 for request in valid_requests)` | `requests` | +| [**Error Request Count**](docs/metrics_reference.md#error-request-count) | `error_request_count` | `sum(1 for request in error_requests)` | `requests` | +| [**Minimum Request Timestamp**](docs/metrics_reference.md#minimum-request-timestamp) | `min_request_timestamp` | `min(timestamp_ns for record in records)` | `datetime` | +| [**Maximum Response Timestamp**](docs/metrics_reference.md#maximum-response-timestamp) | `max_response_timestamp` | `max(timestamp_ns + request_latency for record in records)` | `datetime` | + +### Derived Metrics + +Computed using formulas based on other metrics, but **not** computed per-record. These are calculated either after the benchmark completes for final results or in real-time across all current data for live metrics display. + +| Metric | Tag | Formula | Unit | +|--------|-----|---------|------| +| [**Request Throughput**](docs/metrics_reference.md#request-throughput) | `request_throughput` | `request_count / benchmark_duration_seconds` | `requests/sec` | +| [**Output Token Throughput**](docs/metrics_reference.md#output-token-throughput) | `output_token_throughput` | `total_osl / benchmark_duration_seconds` | `tokens/sec` | +| [**Benchmark Duration**](docs/metrics_reference.md#benchmark-duration) | `benchmark_duration` | `max_response_timestamp - min_request_timestamp` | `sec` | +| [**Total Output Tokens**](docs/metrics_reference.md#total-output-tokens) | `total_output_tokens` | `sum(output_token_count for record in records)` | `tokens` | +| [**Total Reasoning Tokens**](docs/metrics_reference.md#total-reasoning-tokens) | `total_reasoning_tokens` | `sum(reasoning_token_count for record in records)` | `tokens` | +| [**Total Output Sequence Length**](docs/metrics_reference.md#total-output-sequence-length) | `total_osl` | `sum(output_sequence_length for record in records)` | `tokens` | +| [**Total Input Sequence Length**](docs/metrics_reference.md#total-input-sequence-length) | `total_isl` | `sum(input_sequence_length for record in records)` | `tokens` | + +
+ + ## Known Issues - Output sequence length constraints (`--output-tokens-mean`) cannot be guaranteed unless you pass `ignore_eos` and/or `min_tokens` via `--extra-inputs` to an inference server that supports them. diff --git a/docs/metrics_reference.md b/docs/metrics_reference.md index 1cc3dfa85..953e248d2 100644 --- a/docs/metrics_reference.md +++ b/docs/metrics_reference.md @@ -10,62 +10,407 @@ This document provides a comprehensive reference of all metrics available in AIP AIPerf computes metrics in three distinct phases during benchmark execution: -### Record Metrics -Computed **individually for each request/response(s) pair** during the benchmark run. These metrics capture per-request characteristics such as latency, token counts, and streaming behavior. Record metrics produce **statistical distributions** (min, max, mean, median, p90, p99) that reveal performance variability across requests. +**Record Metrics** are computed individually for each request and its response(s) during the benchmark run. A single request may have one response (non-streaming) or multiple responses (streaming). These metrics capture per-request characteristics such as latency, token counts, and streaming behavior. Record metrics produce statistical distributions (min, max, mean, median, p90, p99) that reveal performance variability across requests. -**Examples**: Request Latency, TTFT, Token Counts, Inter-Token Latency +**Aggregate Metrics** are computed by tracking or accumulating values across all requests in real-time during the benchmark. These include counters, min/max timestamps, and other global statistics. Aggregate metrics produce single scalar values representing the entire benchmark run. -### Aggregate Metrics -Computed by **tracking or accumulating values** across all requests in **real-time** during the benchmark. These include counters, min/max timestamps, and other global statistics. Aggregate metrics produce **single scalar values** representing the entire benchmark run. - -**Examples**: Request Count, Error Count, Min/Max Timestamps - -### Derived Metrics -Computed **after the benchmark completes** by applying mathematical formulas to other metric results. These metrics depend on one or more prerequisite metrics being available first. Derived metrics can produce either single values or distributions depending on their dependencies. - -**Examples**: Request Throughput, Benchmark Duration, Total Token Counts +**Derived Metrics** are computed by applying mathematical formulas to other metric results, but are **not** computed per-record like Record Metrics. Instead, these metrics depend on one or more prerequisite metrics being available first and are calculated either after the benchmark completes for final results or in real-time across all current data for live metrics display. Derived metrics can produce either single values or distributions depending on their dependencies. --- ## Record Metrics -| Metric | Tag | Explanation | Formula | -|--------|-----|-------------|---------| -| **Request Latency** | `request_latency` | Measures the total end-to-end time from sending a request until receiving the final response. | `responses[-1].perf_ns - start_perf_ns` | -| **Time to First Token (TTFT)** | `ttft` | Measures how long it takes to receive the first token after sending a request. Critical for user-perceived responsiveness in streaming scenarios. | `responses[0].perf_ns - request.start_perf_ns` | -| **Time to Second Token (TTST)** | `ttst` | Measures the time gap between the first and second tokens. Helps identify generation startup overhead separate from streaming throughput. | `responses[1].perf_ns - responses[0].perf_ns` | -| **Inter Token Latency (ITL)** | `inter_token_latency` | Measures the average time between consecutive tokens during generation, excluding the initial TTFT overhead. | `(request_latency - ttft) / (output_sequence_length - 1)` | -| **Inter Chunk Latency (ICL)** | `inter_chunk_latency` | Captures the time gaps between all consecutive response chunks in a streaming response, providing a distribution of chunk arrival times. | `[responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(responses))]` | -| **Output Token Count** | `output_token_count` | The number of output tokens generated for a single request, excluding reasoning tokens. | `output_token_count` | -| **Reasoning Token Count** | `reasoning_token_count` | The number of reasoning tokens generated for a single request (e.g., chain-of-thought tokens in reasoning models). | `reasoning_token_count` | -| **Output Sequence Length (OSL)** | `output_sequence_length` | The total number of completion tokens (output + reasoning) generated for a single request. | `(output_token_count or 0) + (reasoning_token_count or 0)` | -| **Input Sequence Length (ISL)** | `input_sequence_length` | The number of input/prompt tokens for a single request. | `input_token_count` | -| **Output Token Throughput Per User** | `output_token_throughput_per_user` | The token generation rate experienced by an individual user/request, measured as the inverse of inter-token latency. | `1.0 / inter_token_latency_seconds` | +| Metric | Tag | Formula | Unit | +|--------|-----|---------|------| +| [**Request Latency**](#request-latency) | `request_latency` | `responses[-1].perf_ns - start_perf_ns` | `ms` | +| [**Time to First Token (TTFT)**](#time-to-first-token-ttft) | `ttft` | `responses[0].perf_ns - request.start_perf_ns` | `ms` | +| [**Time to Second Token (TTST)**](#time-to-second-token-ttst) | `ttst` | `responses[1].perf_ns - responses[0].perf_ns` | `ms` | +| [**Inter Token Latency (ITL)**](#inter-token-latency-itl) | `inter_token_latency` | `(request_latency - ttft) / (output_sequence_length - 1)` | `ms` | +| [**Inter Chunk Latency (ICL)**](#inter-chunk-latency-icl) | `inter_chunk_latency` | `[responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(responses))]` | `ms` | +| [**Output Token Count**](#output-token-count) | `output_token_count` | `output_token_count` | `tokens` | +| [**Reasoning Token Count**](#reasoning-token-count) | `reasoning_token_count` | `reasoning_token_count` | `tokens` | +| [**Output Sequence Length (OSL)**](#output-sequence-length-osl) | `output_sequence_length` | `(output_token_count or 0) + (reasoning_token_count or 0)` | `tokens` | +| [**Input Sequence Length (ISL)**](#input-sequence-length-isl) | `input_sequence_length` | `input_token_count` | `tokens` | +| [**Output Token Throughput Per User**](#output-token-throughput-per-user) | `output_token_throughput_per_user` | `1.0 / inter_token_latency_seconds` | `tokens/sec/user` | --- ## Aggregate Metrics -| Metric | Tag | Explanation | Formula | -|--------|-----|-------------|---------| -| **Request Count** | `request_count` | The total number of successfully completed requests in the benchmark. | `sum(1 for request in valid_requests)` | -| **Error Request Count** | `error_request_count` | The total number of failed/error requests encountered during the benchmark. | `sum(1 for request in error_requests)` | -| **Minimum Request Timestamp** | `min_request_timestamp` | The wall-clock timestamp of the first request sent in the benchmark, used to calculate benchmark duration. | `min(timestamp_ns for record in records)` | -| **Maximum Response Timestamp** | `max_response_timestamp` | The wall-clock timestamp of the last response received in the benchmark, used to calculate benchmark duration. | `max(timestamp_ns + request_latency for record in records)` | +| Metric | Tag | Formula | Unit | +|--------|-----|---------|------| +| [**Request Count**](#request-count) | `request_count` | `sum(1 for request in valid_requests)` | `requests` | +| [**Error Request Count**](#error-request-count) | `error_request_count` | `sum(1 for request in error_requests)` | `requests` | +| [**Minimum Request Timestamp**](#minimum-request-timestamp) | `min_request_timestamp` | `min(timestamp_ns for record in records)` | `datetime` | +| [**Maximum Response Timestamp**](#maximum-response-timestamp) | `max_response_timestamp` | `max(timestamp_ns + request_latency for record in records)` | `datetime` | --- ## Derived Metrics -| Metric | Tag | Explanation | Formula | -|--------|-----|-------------|---------| -| **Request Throughput** | `request_throughput` | The overall rate of completed requests per second across the entire benchmark. | `request_count / benchmark_duration_seconds` | -| **Output Token Throughput** | `output_token_throughput` | The aggregate token generation rate across all concurrent requests, measured as total tokens per second. | `total_osl / benchmark_duration_seconds` | -| **Benchmark Duration** | `benchmark_duration` | The total elapsed time from the first request sent to the last response received. | `max_response_timestamp - min_request_timestamp` | -| **Total Output Tokens** | `total_output_tokens` | The sum of all output tokens (excluding reasoning tokens) generated across all requests. | `sum(output_token_count for record in records)` | -| **Total Reasoning Tokens** | `total_reasoning_tokens` | The sum of all reasoning tokens generated across all requests. | `sum(reasoning_token_count for record in records)` | -| **Total Output Sequence Length** | `total_osl` | The sum of all completion tokens (output + reasoning) generated across all requests. | `sum(output_sequence_length for record in records)` | -| **Total Input Sequence Length** | `total_isl` | The sum of all input/prompt tokens processed across all requests. | `sum(input_sequence_length for record in records)` | +| Metric | Tag | Formula | Unit | +|--------|-----|---------|------| +| [**Request Throughput**](#request-throughput) | `request_throughput` | `request_count / benchmark_duration_seconds` | `requests/sec` | +| [**Output Token Throughput**](#output-token-throughput) | `output_token_throughput` | `total_osl / benchmark_duration_seconds` | `tokens/sec` | +| [**Benchmark Duration**](#benchmark-duration) | `benchmark_duration` | `max_response_timestamp - min_request_timestamp` | `sec` | +| [**Total Output Tokens**](#total-output-tokens) | `total_output_tokens` | `sum(output_token_count for record in records)` | `tokens` | +| [**Total Reasoning Tokens**](#total-reasoning-tokens) | `total_reasoning_tokens` | `sum(reasoning_token_count for record in records)` | `tokens` | +| [**Total Output Sequence Length**](#total-output-sequence-length) | `total_osl` | `sum(output_sequence_length for record in records)` | `tokens` | +| [**Total Input Sequence Length**](#total-input-sequence-length) | `total_isl` | `sum(input_sequence_length for record in records)` | `tokens` | + +--- + +## Detailed Metric Descriptions + +### Request Latency + +**Type:** Record Metric + +**Description:** Measures the total end-to-end time from sending a request until receiving the final response. For streaming requests with multiple responses, this measures until the last response is received. This is the complete wall-clock time experienced by the client for a single request. + +**Requirements:** +- Available for all request types (streaming and non-streaming) +- No special requirements + +**Notes:** Request latency includes all components: network time, queuing, prompt processing, token generation, and response transmission. For streaming requests, it measures from request start to the final chunk received. + +--- + +### Time to First Token (TTFT) + +**Type:** Record Metric + +**Description:** Measures how long it takes to receive the first token after sending a request. This is critical for user-perceived responsiveness in streaming scenarios, as it represents how quickly the model begins generating output. + +**Requirements:** +- Streaming responses with Server-Sent Events (SSE) +- At least 1 response chunk + +**Notes:** TTFT includes network latency, queuing time, prompt processing, and generation of the first token. The metric is skipped for non-streaming endpoints. TTFT is a key indicator of interactive performance and perceived latency for end users. + +--- + +### Time to Second Token (TTST) + +**Type:** Record Metric + +**Description:** Measures the time gap between the first and second tokens. This metric helps identify generation startup overhead separate from steady-state streaming throughput. + +**Requirements:** +- Streaming responses with Server-Sent Events (SSE) +- At least 2 response chunks (tokens) + +**Notes:** Records with fewer than 2 tokens will skip this metric. TTST is useful for diagnosing issues in the token generation pipeline that may not be apparent from TTFT alone. A high TTST relative to subsequent inter-token latencies may indicate startup inefficiencies. + +--- + +### Inter Token Latency (ITL) + +**Type:** Record Metric + +**Description:** Measures the average time between consecutive tokens during generation, excluding the initial TTFT overhead. This represents the steady-state token generation rate. + +**Requirements:** +- Streaming responses with Server-Sent Events (SSE) +- At least 2 tokens in the output sequence +- Valid `ttft`, `request_latency`, and `output_sequence_length` metrics + +**Formula Details:** ITL computes the average time between tokens by dividing the remaining latency (after TTFT) by the number of token intervals: +``` +ITL = (request_latency - ttft) / (output_sequence_length - 1) +``` + +**Notes:** Records with fewer than 2 tokens will skip this metric. ITL is a critical metric for understanding streaming performance and predicting generation times for different output lengths. + +--- + +### Inter Chunk Latency (ICL) + +**Type:** Record Metric + +**Description:** Captures the time gaps between all consecutive response chunks in a streaming response, providing a distribution of chunk arrival times rather than a single average. + +**Requirements:** +- Streaming responses with Server-Sent Events (SSE) +- At least 2 response chunks + +**Formula Details:** ICL produces an array of latencies: +``` +ICL = [responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(responses))] +``` + +**Notes:** Unlike ITL (which produces a single average), ICL provides the full distribution of inter-chunk times. This is useful for detecting variability, jitter, or issues in streaming delivery. Analyzing ICL distributions can reveal batching behavior, scheduling issues, or network variability. + +--- + +### Output Token Count + +**Type:** Record Metric + +**Description:** The number of output tokens generated for a single request, excluding reasoning tokens. This represents the visible output tokens returned to the user across all responses for the request. + +**Requirements:** +- Token-producing endpoints that return actual token content (text) +- Excludes embeddings and other non-generative endpoints + +**Notes:** AIPerf counts tokens from the returned content using a tokenizer. For streaming requests with multiple responses, tokens are counted across all response chunks. For models that support reasoning tokens, this metric counts only the non-reasoning output tokens. + +--- + +### Reasoning Token Count + +**Type:** Record Metric + +**Description:** The number of reasoning tokens generated for a single request. These are tokens used for "thinking" or chain-of-thought reasoning before generating the final output. + +**Requirements:** +- Models/backends that support reasoning output (e.g., OpenAI o1, o1-mini, o1-preview) +- The backend must separate reasoning content into a `reasoning_content` field, distinct from the regular `content` field in the response(s) + +**Notes:** AIPerf counts tokens from the `reasoning_content` field using a tokenizer, just like other token metrics. The metric does NOT differentiate `` tags or extract reasoning from within the regular `content` field. The backend must provide reasoning as a separate field in the response. Standard models/backends that don't expose reasoning content separately will skip this metric. + +--- + +### Output Sequence Length (OSL) + +**Type:** Record Metric + +**Description:** The total number of completion tokens (output + reasoning) generated for a single request across all its responses. This represents the complete token generation workload for the request. + +**Requirements:** +- Token-producing endpoints that return text content +- Excludes embeddings and other non-generative endpoints + +**Formula Details:** +``` +OSL = (output_token_count or 0) + (reasoning_token_count or 0) +``` + +**Notes:** AIPerf counts tokens from the generated text content across all responses. If no token content is available (e.g., embeddings endpoints), this metric is skipped. OSL represents the total completion tokens generated, sometimes called "completion token count" in other tools. For models without reasoning tokens, OSL equals the output token count. + +--- + +### Input Sequence Length (ISL) + +**Type:** Record Metric + +**Description:** The number of input/prompt tokens for a single request. This represents the size of the input sent to the model. + +**Requirements:** +- Token-producing endpoints (chat, completion, etc.) +- AIPerf tokenizes the input prompt to compute the count + +**Notes:** ISL represents the number of tokens in the input prompt sent to the model. AIPerf computes this by tokenizing the input using the appropriate tokenizer for the model. This metric is useful for understanding the relationship between input size and latency/throughput. + +--- + +### Output Token Throughput Per User + +**Type:** Record Metric + +**Description:** The token generation rate experienced by an individual user/request, measured as the inverse of inter-token latency. This represents single-request streaming performance. + +**Requirements:** +- Streaming responses with Server-Sent Events (SSE) +- Valid `inter_token_latency` metric + +**Formula Details:** +``` +Output Token Throughput Per User = 1.0 / inter_token_latency_seconds +``` + +**Notes:** This metric computes the inverse of ITL to show tokens per second from an individual user's perspective. It differs from Output Token Throughput (aggregate across all concurrent requests) by focusing on single-request experience. This is useful for understanding the user experience independent of concurrency effects. + +--- + +### Request Count + +**Type:** Aggregate Metric + +**Description:** The total number of successfully completed requests in the benchmark. This includes all requests that received valid responses, regardless of streaming mode. + +**Requirements:** +- No special requirements + +**Notes:** This is a fundamental metric for calculating throughput and success rates. Requests that encounter errors are tracked separately in Error Request Count. + +--- + +### Error Request Count + +**Type:** Aggregate Metric + +**Description:** The total number of failed/error requests encountered during the benchmark. This includes network errors, HTTP errors, timeout errors, and other failures. + +**Requirements:** +- No special requirements + +**Notes:** Error requests are tracked separately from successful requests. The error rate can be computed as `error_request_count / (request_count + error_request_count)`. + +--- + +### Minimum Request Timestamp + +**Type:** Aggregate Metric + +**Description:** The wall-clock timestamp of the first request sent in the benchmark. This is used to calculate the benchmark duration and represents the start of the benchmark run. + +**Requirements:** +- No special requirements + +**Notes:** This uses wall-clock timestamps (not performance counters), representing real calendar time. Useful for correlating benchmark results with external system monitoring and logs. + +--- + +### Maximum Response Timestamp + +**Type:** Aggregate Metric + +**Description:** The wall-clock timestamp of the last response received in the benchmark. This is used to calculate the benchmark duration and represents the end of the benchmark run. + +**Requirements:** +- Valid `request_latency` for at least one request + +**Formula Details:** +``` +Maximum Response Timestamp = max(timestamp_ns + request_latency for record in records) +``` + +**Notes:** This uses wall-clock timestamps (not performance counters), representing real calendar time. Combined with Minimum Request Timestamp, this defines the total benchmark duration. + +--- + +### Request Throughput + +**Type:** Derived Metric + +**Description:** The overall rate of completed requests per second across the entire benchmark. This represents the system's ability to process requests under the given concurrency and load. + +**Requirements:** +- Valid `request_count` metric +- Valid `benchmark_duration` metric + +**Formula Details:** +``` +Request Throughput = request_count / benchmark_duration_seconds +``` + +**Notes:** This metric captures the aggregate request processing rate. Higher values indicate better system throughput. Request throughput is affected by concurrency level, request complexity, and system capacity. + +--- + +### Output Token Throughput + +**Type:** Derived Metric + +**Description:** The aggregate token generation rate across all concurrent requests, measured as total tokens per second. This represents the system's overall token generation capacity. + +**Requirements:** +- Token-producing endpoints that generate text content +- Valid `total_osl` and `benchmark_duration` metrics + +**Formula Details:** +``` +Output Token Throughput = total_osl / benchmark_duration_seconds +``` + +**Notes:** This metric measures aggregate throughput across all concurrent requests and represents the overall system token generation rate. Not applicable to embeddings or other non-generative endpoints. Higher values indicate better system utilization and capacity. + +--- + +### Benchmark Duration + +**Type:** Derived Metric + +**Description:** The total elapsed time from the first request sent to the last response received. This represents the complete wall-clock duration of the benchmark run. + +**Requirements:** +- Valid `min_request_timestamp` metric +- Valid `max_response_timestamp` metric + +**Formula Details:** +``` +Benchmark Duration = max_response_timestamp - min_request_timestamp +``` + +**Notes:** Uses wall-clock timestamps representing real calendar time. This is the denominator for throughput calculations and represents the effective measurement window. + +--- + +### Total Output Tokens + +**Type:** Derived Metric + +**Description:** The sum of all output tokens (excluding reasoning tokens) generated across all requests. This represents the total visible output token workload. + +**Requirements:** +- Token-producing endpoints that return text content +- Valid `output_token_count` for processed records + +**Formula Details:** +``` +Total Output Tokens = sum(output_token_count for record in records) +``` + +**Notes:** AIPerf counts tokens from the returned content using a tokenizer. This metric aggregates output tokens across all successful requests and is useful for capacity planning and cost estimation. + +--- + +### Total Reasoning Tokens + +**Type:** Derived Metric + +**Description:** The sum of all reasoning tokens generated across all requests. This represents the total reasoning/thinking workload. + +**Requirements:** +- Models/backends that support reasoning output +- Backend must provide `reasoning_content` as a separate field +- Valid `reasoning_token_count` for processed records + +**Formula Details:** +``` +Total Reasoning Tokens = sum(reasoning_token_count for record in records) +``` + +**Notes:** AIPerf counts tokens from the `reasoning_content` field. This metric is only available for models like OpenAI o1 that expose reasoning tokens separately. Useful for understanding the reasoning overhead and cost for reasoning-enabled models. + +--- + +### Total Output Sequence Length + +**Type:** Derived Metric + +**Description:** The sum of all completion tokens (output + reasoning) generated across all requests. This represents the complete token generation workload. + +**Requirements:** +- Token-producing endpoints that return text content +- Valid `output_sequence_length` for processed records + +**Formula Details:** +``` +Total Output Sequence Length = sum(output_sequence_length for record in records) +``` + +**Notes:** This aggregates the complete token generation workload including both output and reasoning tokens. For models without reasoning tokens, this equals Total Output Tokens. This is the numerator for Output Token Throughput calculations. + +--- + +### Total Input Sequence Length + +**Type:** Derived Metric + +**Description:** The sum of all input/prompt tokens processed across all requests. This represents the total input workload sent to the model. + +**Requirements:** +- Token-producing endpoints +- Valid `input_sequence_length` for processed records + +**Formula Details:** +``` +Total Input Sequence Length = sum(input_sequence_length for record in records) +``` + +**Notes:** AIPerf tokenizes input prompts to compute token counts. This metric is useful for understanding the input workload, capacity planning, and analyzing the relationship between input size and system performance. --- @@ -91,16 +436,16 @@ Computed **after the benchmark completes** by applying mathematical formulas to | Requirement | Description | Example Metrics | |-------------|-------------|-----------------| -| **Token-producing models** | Models that return `usage` information with input/output token counts | `output_token_count`, `input_sequence_length`, `output_token_throughput` | -| **Streaming responses** | Endpoints that support Server-Sent Events (SSE) | `ttft`, `inter_token_latency`, `inter_chunk_latency` | -| **Reasoning token support** | Models that expose reasoning/thinking token counts separately | `reasoning_token_count`, `total_reasoning_tokens` | +| **Token-producing endpoints** | Endpoints that return token content (text/tokens) in responses that can be counted; excludes embeddings and other non-generative endpoints | `output_token_count`, `input_sequence_length`, `output_token_throughput` | +| **Streaming responses** | Endpoints that support Server-Sent Events (SSE) returning multiple response chunks | `ttft`, `inter_token_latency`, `inter_chunk_latency` | +| **Reasoning token support** | Models/backends that expose reasoning content in a separate `reasoning_content` field in responses (not embedded in `` tags) | `reasoning_token_count`, `total_reasoning_tokens` | ### Metric Flags Reference | Flag | Description | Impact | |------|-------------|--------| | `LARGER_IS_BETTER` | Higher values indicate better performance | Used for throughput and count metrics to indicate optimization direction | -| `PRODUCES_TOKENS_ONLY` | Only computed for token-producing models | Metric skipped if model doesn't provide token count information | +| `PRODUCES_TOKENS_ONLY` | Only computed for token-producing endpoints | Metric skipped for non-generative endpoints like embeddings | | `STREAMING_TOKENS_ONLY` | Only computed for streaming responses with tokens | Requires both streaming support and token information | | `STREAMING_ONLY` | Only computed for streaming responses | Requires Server-Sent Events (SSE) support | | `SUPPORTS_REASONING` | Requires reasoning token support | Only available for models like OpenAI o1 that expose reasoning tokens | From e85b7eb43ccb772abb1081202f214cf63a76733a Mon Sep 17 00:00:00 2001 From: Anthony Casagrande Date: Wed, 1 Oct 2025 23:01:08 -0700 Subject: [PATCH 03/11] tweaks to metrics docs --- README.md | 6 +- docs/metrics_reference.md | 441 +++++++++++++++++--------------------- 2 files changed, 199 insertions(+), 248 deletions(-) diff --git a/README.md b/README.md index 2ef0e5f8d..b9e897556 100644 --- a/README.md +++ b/README.md @@ -166,7 +166,7 @@ AIPerf provides comprehensive metrics organized into three categories. For detai ### Record Metrics -Computed individually for each request and its response(s). Record metrics produce statistical distributions (min, max, mean, p50, p90, p99, etc.). +Computed **individually** for **each request** and its **response(s)**. Record metrics produce **statistical distributions** (min, max, mean, p50, p90, p99, etc.). | Metric | Tag | Formula | Unit | |--------|-----|---------|------| @@ -183,7 +183,7 @@ Computed individually for each request and its response(s). Record metrics produ ### Aggregate Metrics -Computed by tracking values across all requests in real-time. Aggregate metrics produce single scalar values. +Computed by **tracking** values across **all requests** in **real-time**. Aggregate metrics produce **single scalar values**. | Metric | Tag | Formula | Unit | |--------|-----|---------|------| @@ -194,7 +194,7 @@ Computed by tracking values across all requests in real-time. Aggregate metrics ### Derived Metrics -Computed using formulas based on other metrics, but **not** computed per-record. These are calculated either after the benchmark completes for final results or in real-time across all current data for live metrics display. +Computed using **formulas** based on other metrics, but **not** computed per-record. These are calculated either **after the benchmark completes** for final results or in **real-time** across **all current data** for live metrics display. | Metric | Tag | Formula | Unit | |--------|-----|---------|------| diff --git a/docs/metrics_reference.md b/docs/metrics_reference.md index 953e248d2..b1c0ceb93 100644 --- a/docs/metrics_reference.md +++ b/docs/metrics_reference.md @@ -10,445 +10,396 @@ This document provides a comprehensive reference of all metrics available in AIP AIPerf computes metrics in three distinct phases during benchmark execution: -**Record Metrics** are computed individually for each request and its response(s) during the benchmark run. A single request may have one response (non-streaming) or multiple responses (streaming). These metrics capture per-request characteristics such as latency, token counts, and streaming behavior. Record metrics produce statistical distributions (min, max, mean, median, p90, p99) that reveal performance variability across requests. +### Record Metrics -**Aggregate Metrics** are computed by tracking or accumulating values across all requests in real-time during the benchmark. These include counters, min/max timestamps, and other global statistics. Aggregate metrics produce single scalar values representing the entire benchmark run. +Record Metrics are computed **individually** for **each request** and its **response(s)** during the benchmark run. A single request may have one response (non-streaming) or multiple responses (streaming). These metrics capture **per-request characteristics** such as latency, token counts, and streaming behavior. Record metrics produce **statistical distributions** (min, max, mean, median, p90, p99) that reveal performance variability across requests. -**Derived Metrics** are computed by applying mathematical formulas to other metric results, but are **not** computed per-record like Record Metrics. Instead, these metrics depend on one or more prerequisite metrics being available first and are calculated either after the benchmark completes for final results or in real-time across all current data for live metrics display. Derived metrics can produce either single values or distributions depending on their dependencies. +#### Example: ---- +`request_latency` measures the time for each individual request from start to final response. If you send 100 requests, you get 100 latency values that form a distribution showing how latency varies across requests. -## Record Metrics - -| Metric | Tag | Formula | Unit | -|--------|-----|---------|------| -| [**Request Latency**](#request-latency) | `request_latency` | `responses[-1].perf_ns - start_perf_ns` | `ms` | -| [**Time to First Token (TTFT)**](#time-to-first-token-ttft) | `ttft` | `responses[0].perf_ns - request.start_perf_ns` | `ms` | -| [**Time to Second Token (TTST)**](#time-to-second-token-ttst) | `ttst` | `responses[1].perf_ns - responses[0].perf_ns` | `ms` | -| [**Inter Token Latency (ITL)**](#inter-token-latency-itl) | `inter_token_latency` | `(request_latency - ttft) / (output_sequence_length - 1)` | `ms` | -| [**Inter Chunk Latency (ICL)**](#inter-chunk-latency-icl) | `inter_chunk_latency` | `[responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(responses))]` | `ms` | -| [**Output Token Count**](#output-token-count) | `output_token_count` | `output_token_count` | `tokens` | -| [**Reasoning Token Count**](#reasoning-token-count) | `reasoning_token_count` | `reasoning_token_count` | `tokens` | -| [**Output Sequence Length (OSL)**](#output-sequence-length-osl) | `output_sequence_length` | `(output_token_count or 0) + (reasoning_token_count or 0)` | `tokens` | -| [**Input Sequence Length (ISL)**](#input-sequence-length-isl) | `input_sequence_length` | `input_token_count` | `tokens` | -| [**Output Token Throughput Per User**](#output-token-throughput-per-user) | `output_token_throughput_per_user` | `1.0 / inter_token_latency_seconds` | `tokens/sec/user` | +### Aggregate Metrics ---- +Aggregate Metrics are computed by **tracking** or **accumulating** values across **all requests** in **real-time** during the benchmark. These include counters, min/max timestamps, and other global statistics. Aggregate metrics produce a **single value** representing the entire benchmark run. + + +#### Example: -## Aggregate Metrics +`request_count` increments by 1 for each successful request. At the end of a benchmark with 100 successful requests, this metric equals 100 (a single value, not a distribution). -| Metric | Tag | Formula | Unit | -|--------|-----|---------|------| -| [**Request Count**](#request-count) | `request_count` | `sum(1 for request in valid_requests)` | `requests` | -| [**Error Request Count**](#error-request-count) | `error_request_count` | `sum(1 for request in error_requests)` | `requests` | -| [**Minimum Request Timestamp**](#minimum-request-timestamp) | `min_request_timestamp` | `min(timestamp_ns for record in records)` | `datetime` | -| [**Maximum Response Timestamp**](#maximum-response-timestamp) | `max_response_timestamp` | `max(timestamp_ns + request_latency for record in records)` | `datetime` | +### Derived Metrics + +Derived Metrics are computed by applying **mathematical formulas** to other metric results, but are **not** computed per-record like Record Metrics. Instead, these metrics depend on one or more **prerequisite metrics** being available first and are calculated either **after the benchmark completes** for final results or in **real-time** across **all current data** for live metrics display. Derived metrics can produce either single values or distributions depending on their dependencies. + +#### Example: + +`request_throughput` is computed from `request_count / benchmark_duration_seconds`. This requires both `request_count` and `benchmark_duration` to be available first, then applies a formula to produce a single throughput value (e.g., 10.5 requests/sec). --- -## Derived Metrics +## Quick Reference + +For a quick reference of all metrics with their tags, formulas, and units, see the **[Metrics Reference section in the README](../README.md#metrics-reference)**. -| Metric | Tag | Formula | Unit | -|--------|-----|---------|------| -| [**Request Throughput**](#request-throughput) | `request_throughput` | `request_count / benchmark_duration_seconds` | `requests/sec` | -| [**Output Token Throughput**](#output-token-throughput) | `output_token_throughput` | `total_osl / benchmark_duration_seconds` | `tokens/sec` | -| [**Benchmark Duration**](#benchmark-duration) | `benchmark_duration` | `max_response_timestamp - min_request_timestamp` | `sec` | -| [**Total Output Tokens**](#total-output-tokens) | `total_output_tokens` | `sum(output_token_count for record in records)` | `tokens` | -| [**Total Reasoning Tokens**](#total-reasoning-tokens) | `total_reasoning_tokens` | `sum(reasoning_token_count for record in records)` | `tokens` | -| [**Total Output Sequence Length**](#total-output-sequence-length) | `total_osl` | `sum(output_sequence_length for record in records)` | `tokens` | -| [**Total Input Sequence Length**](#total-input-sequence-length) | `total_isl` | `sum(input_sequence_length for record in records)` | `tokens` | +The sections below provide detailed descriptions, requirements, and notes for each metric. --- ## Detailed Metric Descriptions -### Request Latency +### Latency & Timing Metrics -**Type:** Record Metric +These metrics measure time and latency characteristics of requests and responses. + +#### Request Latency -**Description:** Measures the total end-to-end time from sending a request until receiving the final response. For streaming requests with multiple responses, this measures until the last response is received. This is the complete wall-clock time experienced by the client for a single request. +**Tag:** `request_latency` +**Type:** Record Metric -**Requirements:** -- Available for all request types (streaming and non-streaming) -- No special requirements +**Description:** Measures the total end-to-end time from sending a request until receiving the final response. For streaming requests with multiple responses, this measures until the last response is received. This is the complete time experienced by the client for a single request. -**Notes:** Request latency includes all components: network time, queuing, prompt processing, token generation, and response transmission. For streaming requests, it measures from request start to the final chunk received. +**Notes:** +- Available for all request types (streaming and non-streaming); no special requirements. +- Includes all components: network time, queuing, prompt processing, token generation, and response transmission. +- For streaming requests, measures from request start to the final chunk received. --- -### Time to First Token (TTFT) +#### Time to First Token (TTFT) +**Tag:** `ttft` **Type:** Record Metric -**Description:** Measures how long it takes to receive the first token after sending a request. This is critical for user-perceived responsiveness in streaming scenarios, as it represents how quickly the model begins generating output. - -**Requirements:** -- Streaming responses with Server-Sent Events (SSE) -- At least 1 response chunk +**Description:** Measures how long it takes to receive the first token (or chunk of tokens) after sending a request. This is critical for user-perceived responsiveness in streaming scenarios, as it represents how quickly the model begins generating output. -**Notes:** TTFT includes network latency, queuing time, prompt processing, and generation of the first token. The metric is skipped for non-streaming endpoints. TTFT is a key indicator of interactive performance and perceived latency for end users. +**Notes:** +- Requires `--streaming` flag, with a token-producing endpoint, and at least 1 response chunk. +- Includes network latency, queuing time, prompt processing, and generation of the first token (or chunk of tokens). --- -### Time to Second Token (TTST) +#### Time to Second Token (TTST) +**Tag:** `ttst` **Type:** Record Metric -**Description:** Measures the time gap between the first and second tokens. This metric helps identify generation startup overhead separate from steady-state streaming throughput. +**Description:** Measures the time gap between the first and second chunk of tokens (SSE messages). This metric helps identify generation startup overhead separate from steady-state streaming throughput. -**Requirements:** -- Streaming responses with Server-Sent Events (SSE) -- At least 2 response chunks (tokens) - -**Notes:** Records with fewer than 2 tokens will skip this metric. TTST is useful for diagnosing issues in the token generation pipeline that may not be apparent from TTFT alone. A high TTST relative to subsequent inter-token latencies may indicate startup inefficiencies. +**Notes:** +- Requires `--streaming` flag, with a token-producing endpoint, and at least 2 response chunks (tokens). --- -### Inter Token Latency (ITL) +#### Inter Token Latency (ITL) +**Tag:** `inter_token_latency` **Type:** Record Metric **Description:** Measures the average time between consecutive tokens during generation, excluding the initial TTFT overhead. This represents the steady-state token generation rate. -**Requirements:** -- Streaming responses with Server-Sent Events (SSE) -- At least 2 tokens in the output sequence -- Valid `ttft`, `request_latency`, and `output_sequence_length` metrics - -**Formula Details:** ITL computes the average time between tokens by dividing the remaining latency (after TTFT) by the number of token intervals: +**Formula Details:** ``` ITL = (request_latency - ttft) / (output_sequence_length - 1) ``` -**Notes:** Records with fewer than 2 tokens will skip this metric. ITL is a critical metric for understanding streaming performance and predicting generation times for different output lengths. +**Notes:** +- Requires `--streaming` flag, with a token-producing endpoint, at least 2 output tokens, and valid `ttft`, `request_latency`, and `output_sequence_length` metrics. --- -### Inter Chunk Latency (ICL) +#### Inter Chunk Latency (ICL) +**Tag:** `inter_chunk_latency` **Type:** Record Metric -**Description:** Captures the time gaps between all consecutive response chunks in a streaming response, providing a distribution of chunk arrival times rather than a single average. +**Description:** Captures the time gaps between all consecutive response chunks (SSE messages) in a streaming response, providing a distribution of chunk arrival times rather than a single average. Note that this is different from the ITL metric, which measures the time between consecutive tokens regardless of chunk size. -**Requirements:** -- Streaming responses with Server-Sent Events (SSE) -- At least 2 response chunks - -**Formula Details:** ICL produces an array of latencies: +**Formula Details:** ``` ICL = [responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(responses))] ``` -**Notes:** Unlike ITL (which produces a single average), ICL provides the full distribution of inter-chunk times. This is useful for detecting variability, jitter, or issues in streaming delivery. Analyzing ICL distributions can reveal batching behavior, scheduling issues, or network variability. +**Notes:** +- Requires `--streaming` flag, with a token-producing endpoint, and at least 2 response chunks. +- Unlike ITL (which produces a single average), ICL provides the full distribution of inter-chunk times. +- Useful for detecting variability, jitter, or issues in streaming delivery. +- Analyzing ICL distributions can reveal batching behavior, scheduling issues, or network variability. --- -### Output Token Count +### Token Count Metrics -**Type:** Record Metric +These metrics track token counts for individual requests and aggregated across all requests. + +#### Output Token Count -**Description:** The number of output tokens generated for a single request, excluding reasoning tokens. This represents the visible output tokens returned to the user across all responses for the request. +**Tag:** `output_token_count` +**Type:** Record Metric -**Requirements:** -- Token-producing endpoints that return actual token content (text) -- Excludes embeddings and other non-generative endpoints +**Description:** The number of output tokens generated for a single request, _excluding reasoning tokens_. This represents the visible output tokens returned to the user across all responses for the request. -**Notes:** AIPerf counts tokens from the returned content using a tokenizer. For streaming requests with multiple responses, tokens are counted across all response chunks. For models that support reasoning tokens, this metric counts only the non-reasoning output tokens. +**Notes:** +- Requires token-producing endpoints that return actual token content (text); excludes embeddings and other non-generative endpoints. +- AIPerf counts tokens from the returned content using a tokenizer. +- For streaming requests with multiple responses, the responses are joined together and then tokens are counted. +- For models and endpoints that support reasoning tokens, this metric counts only the non-reasoning output tokens. +- This **will** count tokens inside of the `` tags, if they are present in the `content` field of the response. --- -### Reasoning Token Count +#### Reasoning Token Count +**Tag:** `reasoning_token_count` **Type:** Record Metric **Description:** The number of reasoning tokens generated for a single request. These are tokens used for "thinking" or chain-of-thought reasoning before generating the final output. -**Requirements:** -- Models/backends that support reasoning output (e.g., OpenAI o1, o1-mini, o1-preview) -- The backend must separate reasoning content into a `reasoning_content` field, distinct from the regular `content` field in the response(s) - -**Notes:** AIPerf counts tokens from the `reasoning_content` field using a tokenizer, just like other token metrics. The metric does NOT differentiate `` tags or extract reasoning from within the regular `content` field. The backend must provide reasoning as a separate field in the response. Standard models/backends that don't expose reasoning content separately will skip this metric. +**Notes:** +- Requires models/backends that support reasoning output with reasoning content separated into a `reasoning_content` field, distinct from the regular `content` field in the response(s). +- AIPerf counts tokens from the `reasoning_content` field using a tokenizer, just like other token metrics. +- Does NOT differentiate `` tags or extract reasoning from within the regular `content` field. +- The backend must provide reasoning as a separate field in the response. +- Standard models/backends that don't expose reasoning content separately will skip this metric. --- -### Output Sequence Length (OSL) +#### Output Sequence Length (OSL) +**Tag:** `output_sequence_length` **Type:** Record Metric **Description:** The total number of completion tokens (output + reasoning) generated for a single request across all its responses. This represents the complete token generation workload for the request. -**Requirements:** -- Token-producing endpoints that return text content -- Excludes embeddings and other non-generative endpoints - **Formula Details:** ``` OSL = (output_token_count or 0) + (reasoning_token_count or 0) ``` -**Notes:** AIPerf counts tokens from the generated text content across all responses. If no token content is available (e.g., embeddings endpoints), this metric is skipped. OSL represents the total completion tokens generated, sometimes called "completion token count" in other tools. For models without reasoning tokens, OSL equals the output token count. +**Notes:** +- Requires token-producing endpoints that return text content; excludes embeddings and other non-generative endpoints. +- AIPerf counts tokens from the generated text content across all responses. +- For models and endpoints that do not support/separate reasoning tokens, OSL equals the output token count. --- -### Input Sequence Length (ISL) +#### Input Sequence Length (ISL) +**Tag:** `input_sequence_length` **Type:** Record Metric **Description:** The number of input/prompt tokens for a single request. This represents the size of the input sent to the model. -**Requirements:** -- Token-producing endpoints (chat, completion, etc.) -- AIPerf tokenizes the input prompt to compute the count - -**Notes:** ISL represents the number of tokens in the input prompt sent to the model. AIPerf computes this by tokenizing the input using the appropriate tokenizer for the model. This metric is useful for understanding the relationship between input size and latency/throughput. +**Notes:** +- Requires token-producing endpoints (chat, completion, etc.). +- AIPerf tokenizes the input prompt to compute the count using the appropriate tokenizer for the model. +- Useful for understanding the relationship between input size and latency/throughput. --- -### Output Token Throughput Per User +#### Total Output Tokens -**Type:** Record Metric - -**Description:** The token generation rate experienced by an individual user/request, measured as the inverse of inter-token latency. This represents single-request streaming performance. +**Tag:** `total_output_tokens` +**Type:** Derived Metric -**Requirements:** -- Streaming responses with Server-Sent Events (SSE) -- Valid `inter_token_latency` metric +**Description:** The sum of all output tokens (excluding reasoning tokens) generated across all requests. This represents the total visible output token workload. **Formula Details:** ``` -Output Token Throughput Per User = 1.0 / inter_token_latency_seconds +Total Output Tokens = sum(output_token_count for record in records) ``` -**Notes:** This metric computes the inverse of ITL to show tokens per second from an individual user's perspective. It differs from Output Token Throughput (aggregate across all concurrent requests) by focusing on single-request experience. This is useful for understanding the user experience independent of concurrency effects. +**Notes:** +- Requires token-producing endpoints that return text content, with valid `output_token_count` for processed records. +- AIPerf counts tokens from the returned content using a tokenizer. +- Aggregates output tokens across all successful requests. +- Useful for capacity planning and cost estimation. --- -### Request Count +#### Total Reasoning Tokens -**Type:** Aggregate Metric - -**Description:** The total number of successfully completed requests in the benchmark. This includes all requests that received valid responses, regardless of streaming mode. - -**Requirements:** -- No special requirements - -**Notes:** This is a fundamental metric for calculating throughput and success rates. Requests that encounter errors are tracked separately in Error Request Count. - ---- - -### Error Request Count - -**Type:** Aggregate Metric +**Tag:** `total_reasoning_tokens` +**Type:** Derived Metric -**Description:** The total number of failed/error requests encountered during the benchmark. This includes network errors, HTTP errors, timeout errors, and other failures. +**Description:** The sum of all reasoning tokens generated across all requests. This represents the total reasoning/thinking workload. -**Requirements:** -- No special requirements +**Formula Details:** +``` +Total Reasoning Tokens = sum(reasoning_token_count for record in records) +``` -**Notes:** Error requests are tracked separately from successful requests. The error rate can be computed as `error_request_count / (request_count + error_request_count)`. +**Notes:** +- Requires models/backends that support reasoning output with `reasoning_content` as a separate field, and valid `reasoning_token_count` for processed records. +- AIPerf counts tokens from the `reasoning_content` field. +- Only available for models like OpenAI o1 that expose reasoning tokens separately. +- Useful for understanding the reasoning overhead and cost for reasoning-enabled models. --- -### Minimum Request Timestamp +#### Total Output Sequence Length -**Type:** Aggregate Metric +**Tag:** `total_osl` +**Type:** Derived Metric -**Description:** The wall-clock timestamp of the first request sent in the benchmark. This is used to calculate the benchmark duration and represents the start of the benchmark run. +**Description:** The sum of all completion tokens (output + reasoning) generated across all requests. This represents the complete token generation workload. -**Requirements:** -- No special requirements +**Formula Details:** +``` +Total Output Sequence Length = sum(output_sequence_length for record in records) +``` -**Notes:** This uses wall-clock timestamps (not performance counters), representing real calendar time. Useful for correlating benchmark results with external system monitoring and logs. +**Notes:** +- Requires token-producing endpoints that return text content, with valid `output_sequence_length` for processed records. +- Aggregates the complete token generation workload including both output and reasoning tokens. +- For models without reasoning tokens, this equals Total Output Tokens. +- Numerator for Output Token Throughput calculations. --- -### Maximum Response Timestamp - -**Type:** Aggregate Metric +#### Total Input Sequence Length -**Description:** The wall-clock timestamp of the last response received in the benchmark. This is used to calculate the benchmark duration and represents the end of the benchmark run. +**Tag:** `total_isl` +**Type:** Derived Metric -**Requirements:** -- Valid `request_latency` for at least one request +**Description:** The sum of all input/prompt tokens processed across all requests. This represents the total input workload sent to the model. **Formula Details:** ``` -Maximum Response Timestamp = max(timestamp_ns + request_latency for record in records) +Total Input Sequence Length = sum(input_sequence_length for record in records) ``` -**Notes:** This uses wall-clock timestamps (not performance counters), representing real calendar time. Combined with Minimum Request Timestamp, this defines the total benchmark duration. +**Notes:** +- Requires token-producing endpoints, with valid `input_sequence_length` for processed records. +- AIPerf tokenizes input prompts to compute token counts. +- Useful for understanding the input workload, capacity planning, and analyzing the relationship between input size and system performance. --- -### Request Throughput +### Throughput Metrics + +These metrics measure the rate of requests and token generation. +#### Request Throughput + +**Tag:** `request_throughput` **Type:** Derived Metric **Description:** The overall rate of completed requests per second across the entire benchmark. This represents the system's ability to process requests under the given concurrency and load. -**Requirements:** -- Valid `request_count` metric -- Valid `benchmark_duration` metric - **Formula Details:** ``` Request Throughput = request_count / benchmark_duration_seconds ``` -**Notes:** This metric captures the aggregate request processing rate. Higher values indicate better system throughput. Request throughput is affected by concurrency level, request complexity, and system capacity. +**Notes:** +- Requires valid `request_count` and `benchmark_duration` metrics. +- Captures the aggregate request processing rate; higher values indicate better system throughput. +- Affected by concurrency level, request complexity, and system capacity. --- -### Output Token Throughput +#### Output Token Throughput +**Tag:** `output_token_throughput` **Type:** Derived Metric **Description:** The aggregate token generation rate across all concurrent requests, measured as total tokens per second. This represents the system's overall token generation capacity. -**Requirements:** -- Token-producing endpoints that generate text content -- Valid `total_osl` and `benchmark_duration` metrics - **Formula Details:** ``` Output Token Throughput = total_osl / benchmark_duration_seconds ``` -**Notes:** This metric measures aggregate throughput across all concurrent requests and represents the overall system token generation rate. Not applicable to embeddings or other non-generative endpoints. Higher values indicate better system utilization and capacity. +**Important:** This metric specifically includes the TTFT in the equation, so it is **not** directly comparable to the [Output Token Throughput Per User](#output-token-throughput-per-user) metric. ---- +**Notes:** +- Requires token-producing endpoints that generate text content, with valid `total_osl` and `benchmark_duration` metrics. +- Measures aggregate throughput across all concurrent requests; represents the overall system token generation rate. +- Not applicable to embeddings or other non-generative endpoints. +- Higher values indicate better system utilization and capacity. -### Benchmark Duration +--- -**Type:** Derived Metric +#### Output Token Throughput Per User -**Description:** The total elapsed time from the first request sent to the last response received. This represents the complete wall-clock duration of the benchmark run. +**Tag:** `output_token_throughput_per_user` +**Type:** Record Metric -**Requirements:** -- Valid `min_request_timestamp` metric -- Valid `max_response_timestamp` metric +**Description:** The token generation rate experienced by an individual user/request, measured as the inverse of inter-token latency. This represents single-request streaming performance. **Formula Details:** ``` -Benchmark Duration = max_response_timestamp - min_request_timestamp +Output Token Throughput Per User = 1.0 / inter_token_latency_seconds ``` -**Notes:** Uses wall-clock timestamps representing real calendar time. This is the denominator for throughput calculations and represents the effective measurement window. +**Important:** This metric specifically excludes the TTFT from the equation, so it is **not** directly comparable to the [Output Token Throughput](#output-token-throughput) metric. ---- +**Notes:** +- Requires `--streaming` flag, with a token-producing endpoint, and valid `inter_token_latency` metric. +- Computes the inverse of ITL to show tokens per second from an individual user's perspective. +- Differs from Output Token Throughput (aggregate across all concurrent requests) by focusing on single-request experience. +- Useful for understanding the user experience independent of concurrency effects. -### Total Output Tokens +--- -**Type:** Derived Metric +### System & Benchmark Metrics -**Description:** The sum of all output tokens (excluding reasoning tokens) generated across all requests. This represents the total visible output token workload. +These metrics track overall benchmark execution and system-level counters. -**Requirements:** -- Token-producing endpoints that return text content -- Valid `output_token_count` for processed records +#### Request Count -**Formula Details:** -``` -Total Output Tokens = sum(output_token_count for record in records) -``` +**Tag:** `request_count` +**Type:** Aggregate Metric -**Notes:** AIPerf counts tokens from the returned content using a tokenizer. This metric aggregates output tokens across all successful requests and is useful for capacity planning and cost estimation. +**Description:** The total number of **successfully completed** requests in the benchmark. This includes all requests that received valid responses, regardless of streaming mode. --- -### Total Reasoning Tokens +#### Error Request Count -**Type:** Derived Metric - -**Description:** The sum of all reasoning tokens generated across all requests. This represents the total reasoning/thinking workload. - -**Requirements:** -- Models/backends that support reasoning output -- Backend must provide `reasoning_content` as a separate field -- Valid `reasoning_token_count` for processed records +**Tag:** `error_request_count` +**Type:** Aggregate Metric -**Formula Details:** -``` -Total Reasoning Tokens = sum(reasoning_token_count for record in records) -``` +**Description:** The total number of failed/error requests encountered during the benchmark. This includes network errors, HTTP errors, timeout errors, and other failures. -**Notes:** AIPerf counts tokens from the `reasoning_content` field. This metric is only available for models like OpenAI o1 that expose reasoning tokens separately. Useful for understanding the reasoning overhead and cost for reasoning-enabled models. +**Notes:** +- Error rate can be computed as `error_request_count / (request_count + error_request_count)`. --- -### Total Output Sequence Length +#### Minimum Request Timestamp -**Type:** Derived Metric +**Tag:** `min_request_timestamp` +**Type:** Aggregate Metric -**Description:** The sum of all completion tokens (output + reasoning) generated across all requests. This represents the complete token generation workload. +**Description:** The wall-clock timestamp of the first request sent in the benchmark. This is used to calculate the benchmark duration and represents the start of the benchmark run. -**Requirements:** -- Token-producing endpoints that return text content -- Valid `output_sequence_length` for processed records +--- -**Formula Details:** -``` -Total Output Sequence Length = sum(output_sequence_length for record in records) -``` +#### Maximum Response Timestamp -**Notes:** This aggregates the complete token generation workload including both output and reasoning tokens. For models without reasoning tokens, this equals Total Output Tokens. This is the numerator for Output Token Throughput calculations. +**Tag:** `max_response_timestamp` +**Type:** Aggregate Metric + +**Description:** The wall-clock timestamp of the last response received in the benchmark. This is used to calculate the benchmark duration and represents the end of the benchmark run. --- -### Total Input Sequence Length +#### Benchmark Duration +**Tag:** `benchmark_duration` **Type:** Derived Metric -**Description:** The sum of all input/prompt tokens processed across all requests. This represents the total input workload sent to the model. - -**Requirements:** -- Token-producing endpoints -- Valid `input_sequence_length` for processed records +**Description:** The total elapsed time from the first request sent to the last response received. This represents the complete wall-clock duration of the benchmark run. **Formula Details:** ``` -Total Input Sequence Length = sum(input_sequence_length for record in records) +Benchmark Duration = max_response_timestamp - min_request_timestamp ``` -**Notes:** AIPerf tokenizes input prompts to compute token counts. This metric is useful for understanding the input workload, capacity planning, and analyzing the relationship between input size and system performance. +**Notes:** +- Requires valid `min_request_timestamp` and `max_response_timestamp` metrics. +- Uses wall-clock timestamps representing real calendar time. +- Denominator for throughput calculations; represents the effective measurement window. --- - -## Reference Tables - -### Metric Summary - -| Type | Computation | Output | -|------|-------------|--------| -| **Record** | Per-request during benchmark | Statistical distributions (min, max, mean, p50, p90, p99) | -| **Aggregate** | Real-time accumulation across all requests | Single scalar values | -| **Derived** | Post-benchmark from other metrics | Single values or distributions | - -### Time Units - -| Aspect | Details | -|--------|---------| -| **Internal Storage** | Nanoseconds (`perf_ns`) for maximum precision | -| **Display Format** | Milliseconds (ms) or Seconds (s) for readability | -| **Conversion** | Automatic based on metric `display_unit` setting | - -### Model Requirements - -| Requirement | Description | Example Metrics | -|-------------|-------------|-----------------| -| **Token-producing endpoints** | Endpoints that return token content (text/tokens) in responses that can be counted; excludes embeddings and other non-generative endpoints | `output_token_count`, `input_sequence_length`, `output_token_throughput` | -| **Streaming responses** | Endpoints that support Server-Sent Events (SSE) returning multiple response chunks | `ttft`, `inter_token_latency`, `inter_chunk_latency` | -| **Reasoning token support** | Models/backends that expose reasoning content in a separate `reasoning_content` field in responses (not embedded in `` tags) | `reasoning_token_count`, `total_reasoning_tokens` | - -### Metric Flags Reference - -| Flag | Description | Impact | -|------|-------------|--------| -| `LARGER_IS_BETTER` | Higher values indicate better performance | Used for throughput and count metrics to indicate optimization direction | -| `PRODUCES_TOKENS_ONLY` | Only computed for token-producing endpoints | Metric skipped for non-generative endpoints like embeddings | -| `STREAMING_TOKENS_ONLY` | Only computed for streaming responses with tokens | Requires both streaming support and token information | -| `STREAMING_ONLY` | Only computed for streaming responses | Requires Server-Sent Events (SSE) support | -| `SUPPORTS_REASONING` | Requires reasoning token support | Only available for models like OpenAI o1 that expose reasoning tokens | -| `NO_CONSOLE` | Not displayed in console output | Metric available in JSON/CSV exports but hidden from terminal display | -| `ERROR_ONLY` | Only computed for error requests | Tracks error-specific information | - From 8ad95c7f55a4518f30abc893251aa4d14e2078a8 Mon Sep 17 00:00:00 2001 From: Anthony Casagrande Date: Thu, 2 Oct 2025 09:36:31 -0700 Subject: [PATCH 04/11] Update metrics_reference.md Signed-off-by: Anthony Casagrande --- docs/metrics_reference.md | 137 ++++++++++++++++++++++++++------------ 1 file changed, 95 insertions(+), 42 deletions(-) diff --git a/docs/metrics_reference.md b/docs/metrics_reference.md index b1c0ceb93..5df20dc22 100644 --- a/docs/metrics_reference.md +++ b/docs/metrics_reference.md @@ -14,6 +14,10 @@ AIPerf computes metrics in three distinct phases during benchmark execution: Record Metrics are computed **individually** for **each request** and its **response(s)** during the benchmark run. A single request may have one response (non-streaming) or multiple responses (streaming). These metrics capture **per-request characteristics** such as latency, token counts, and streaming behavior. Record metrics produce **statistical distributions** (min, max, mean, median, p90, p99) that reveal performance variability across requests. +**Examples:** `request_latency`, `ttft`, `inter_token_latency`, `output_token_count`, `input_sequence_length` + +**Dependencies:** Record Metrics can depend on raw request/response data and other Record Metrics from the same request. + #### Example: `request_latency` measures the time for each individual request from start to final response. If you send 100 requests, you get 100 latency values that form a distribution showing how latency varies across requests. @@ -22,6 +26,9 @@ Record Metrics are computed **individually** for **each request** and its **resp Aggregate Metrics are computed by **tracking** or **accumulating** values across **all requests** in **real-time** during the benchmark. These include counters, min/max timestamps, and other global statistics. Aggregate metrics produce a **single value** representing the entire benchmark run. +**Examples:** `request_count`, `error_request_count`, `min_request_timestamp`, `max_response_timestamp` + +**Dependencies:** Aggregate Metrics can depend on Record Metrics and other Aggregate Metrics. #### Example: @@ -31,6 +38,10 @@ Aggregate Metrics are computed by **tracking** or **accumulating** values across Derived Metrics are computed by applying **mathematical formulas** to other metric results, but are **not** computed per-record like Record Metrics. Instead, these metrics depend on one or more **prerequisite metrics** being available first and are calculated either **after the benchmark completes** for final results or in **real-time** across **all current data** for live metrics display. Derived metrics can produce either single values or distributions depending on their dependencies. +**Examples:** `request_throughput`, `output_token_throughput`, `benchmark_duration` + +**Dependencies:** Derived Metrics can depend on Record Metrics, Aggregate Metrics, and other Derived Metrics. + #### Example: `request_throughput` is computed from `request_count / benchmark_duration_seconds`. This requires both `request_count` and `benchmark_duration` to be available first, then applies a formula to produce a single throughput value (e.g., 10.5 requests/sec). @@ -53,8 +64,9 @@ These metrics measure time and latency characteristics of requests and responses #### Request Latency -**Tag:** `request_latency` -**Type:** Record Metric +| Tag | Type | Flags | +|-----|------|-------| +| `request_latency` | Record Metric | [`NONE`](#flag-none) | **Description:** Measures the total end-to-end time from sending a request until receiving the final response. For streaming requests with multiple responses, this measures until the last response is received. This is the complete time experienced by the client for a single request. @@ -67,8 +79,9 @@ These metrics measure time and latency characteristics of requests and responses #### Time to First Token (TTFT) -**Tag:** `ttft` -**Type:** Record Metric +| Tag | Type | Flags | +|-----|------|-------| +| `ttft` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only) | **Description:** Measures how long it takes to receive the first token (or chunk of tokens) after sending a request. This is critical for user-perceived responsiveness in streaming scenarios, as it represents how quickly the model begins generating output. @@ -80,8 +93,9 @@ These metrics measure time and latency characteristics of requests and responses #### Time to Second Token (TTST) -**Tag:** `ttst` -**Type:** Record Metric +| Tag | Type | Flags | +|-----|------|-------| +| `ttst` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only) | **Description:** Measures the time gap between the first and second chunk of tokens (SSE messages). This metric helps identify generation startup overhead separate from steady-state streaming throughput. @@ -92,8 +106,9 @@ These metrics measure time and latency characteristics of requests and responses #### Inter Token Latency (ITL) -**Tag:** `inter_token_latency` -**Type:** Record Metric +| Tag | Type | Flags | +|-----|------|-------| +| `inter_token_latency` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | **Description:** Measures the average time between consecutive tokens during generation, excluding the initial TTFT overhead. This represents the steady-state token generation rate. @@ -109,8 +124,9 @@ ITL = (request_latency - ttft) / (output_sequence_length - 1) #### Inter Chunk Latency (ICL) -**Tag:** `inter_chunk_latency` -**Type:** Record Metric +| Tag | Type | Flags | +|-----|------|-------| +| `inter_chunk_latency` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only), [`EXPERIMENTAL`](#flag-experimental) | **Description:** Captures the time gaps between all consecutive response chunks (SSE messages) in a streaming response, providing a distribution of chunk arrival times rather than a single average. Note that this is different from the ITL metric, which measures the time between consecutive tokens regardless of chunk size. @@ -133,8 +149,9 @@ These metrics track token counts for individual requests and aggregated across a #### Output Token Count -**Tag:** `output_token_count` -**Type:** Record Metric +| Tag | Type | Flags | +|-----|------|-------| +| `output_token_count` | Record Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | **Description:** The number of output tokens generated for a single request, _excluding reasoning tokens_. This represents the visible output tokens returned to the user across all responses for the request. @@ -149,8 +166,9 @@ These metrics track token counts for individual requests and aggregated across a #### Reasoning Token Count -**Tag:** `reasoning_token_count` -**Type:** Record Metric +| Tag | Type | Flags | +|-----|------|-------| +| `reasoning_token_count` | Record Metric | [`SUPPORTS_REASONING`](#flag-supports-reasoning), [`LARGER_IS_BETTER`](#flag-larger-is-better) | **Description:** The number of reasoning tokens generated for a single request. These are tokens used for "thinking" or chain-of-thought reasoning before generating the final output. @@ -165,8 +183,9 @@ These metrics track token counts for individual requests and aggregated across a #### Output Sequence Length (OSL) -**Tag:** `output_sequence_length` -**Type:** Record Metric +| Tag | Type | Flags | +|-----|------|-------| +| `output_sequence_length` | Record Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | **Description:** The total number of completion tokens (output + reasoning) generated for a single request across all its responses. This represents the complete token generation workload for the request. @@ -184,8 +203,9 @@ OSL = (output_token_count or 0) + (reasoning_token_count or 0) #### Input Sequence Length (ISL) -**Tag:** `input_sequence_length` -**Type:** Record Metric +| Tag | Type | Flags | +|-----|------|-------| +| `input_sequence_length` | Record Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | **Description:** The number of input/prompt tokens for a single request. This represents the size of the input sent to the model. @@ -198,8 +218,9 @@ OSL = (output_token_count or 0) + (reasoning_token_count or 0) #### Total Output Tokens -**Tag:** `total_output_tokens` -**Type:** Derived Metric +| Tag | Type | Flags | +|-----|------|-------| +| `total_output_tokens` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | **Description:** The sum of all output tokens (excluding reasoning tokens) generated across all requests. This represents the total visible output token workload. @@ -218,8 +239,9 @@ Total Output Tokens = sum(output_token_count for record in records) #### Total Reasoning Tokens -**Tag:** `total_reasoning_tokens` -**Type:** Derived Metric +| Tag | Type | Flags | +|-----|------|-------| +| `total_reasoning_tokens` | Derived Metric | [`SUPPORTS_REASONING`](#flag-supports-reasoning), [`LARGER_IS_BETTER`](#flag-larger-is-better) | **Description:** The sum of all reasoning tokens generated across all requests. This represents the total reasoning/thinking workload. @@ -238,8 +260,9 @@ Total Reasoning Tokens = sum(reasoning_token_count for record in records) #### Total Output Sequence Length -**Tag:** `total_osl` -**Type:** Derived Metric +| Tag | Type | Flags | +|-----|------|-------| +| `total_osl` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | **Description:** The sum of all completion tokens (output + reasoning) generated across all requests. This represents the complete token generation workload. @@ -258,8 +281,9 @@ Total Output Sequence Length = sum(output_sequence_length for record in records) #### Total Input Sequence Length -**Tag:** `total_isl` -**Type:** Derived Metric +| Tag | Type | Flags | +|-----|------|-------| +| `total_isl` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | **Description:** The sum of all input/prompt tokens processed across all requests. This represents the total input workload sent to the model. @@ -281,8 +305,9 @@ These metrics measure the rate of requests and token generation. #### Request Throughput -**Tag:** `request_throughput` -**Type:** Derived Metric +| Tag | Type | Flags | +|-----|------|-------| +| `request_throughput` | Derived Metric | [`LARGER_IS_BETTER`](#flag-larger-is-better) | **Description:** The overall rate of completed requests per second across the entire benchmark. This represents the system's ability to process requests under the given concurrency and load. @@ -300,8 +325,9 @@ Request Throughput = request_count / benchmark_duration_seconds #### Output Token Throughput -**Tag:** `output_token_throughput` -**Type:** Derived Metric +| Tag | Type | Flags | +|-----|------|-------| +| `output_token_throughput` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | **Description:** The aggregate token generation rate across all concurrent requests, measured as total tokens per second. This represents the system's overall token generation capacity. @@ -322,8 +348,9 @@ Output Token Throughput = total_osl / benchmark_duration_seconds #### Output Token Throughput Per User -**Tag:** `output_token_throughput_per_user` -**Type:** Record Metric +| Tag | Type | Flags | +|-----|------|-------| +| `output_token_throughput_per_user` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | **Description:** The token generation rate experienced by an individual user/request, measured as the inverse of inter-token latency. This represents single-request streaming performance. @@ -348,8 +375,9 @@ These metrics track overall benchmark execution and system-level counters. #### Request Count -**Tag:** `request_count` -**Type:** Aggregate Metric +| Tag | Type | Flags | +|-----|------|-------| +| `request_count` | Aggregate Metric | [`LARGER_IS_BETTER`](#flag-larger-is-better) | **Description:** The total number of **successfully completed** requests in the benchmark. This includes all requests that received valid responses, regardless of streaming mode. @@ -357,8 +385,9 @@ These metrics track overall benchmark execution and system-level counters. #### Error Request Count -**Tag:** `error_request_count` -**Type:** Aggregate Metric +| Tag | Type | Flags | +|-----|------|-------| +| `error_request_count` | Aggregate Metric | [`ERROR_ONLY`](#flag-error-only) | **Description:** The total number of failed/error requests encountered during the benchmark. This includes network errors, HTTP errors, timeout errors, and other failures. @@ -369,8 +398,9 @@ These metrics track overall benchmark execution and system-level counters. #### Minimum Request Timestamp -**Tag:** `min_request_timestamp` -**Type:** Aggregate Metric +| Tag | Type | Flags | +|-----|------|-------| +| `min_request_timestamp` | Aggregate Metric | [`NO_CONSOLE`](#flag-no-console) | **Description:** The wall-clock timestamp of the first request sent in the benchmark. This is used to calculate the benchmark duration and represents the start of the benchmark run. @@ -378,8 +408,9 @@ These metrics track overall benchmark execution and system-level counters. #### Maximum Response Timestamp -**Tag:** `max_response_timestamp` -**Type:** Aggregate Metric +| Tag | Type | Flags | +|-----|------|-------| +| `max_response_timestamp` | Aggregate Metric | [`NO_CONSOLE`](#flag-no-console) | **Description:** The wall-clock timestamp of the last response received in the benchmark. This is used to calculate the benchmark duration and represents the end of the benchmark run. @@ -387,8 +418,9 @@ These metrics track overall benchmark execution and system-level counters. #### Benchmark Duration -**Tag:** `benchmark_duration` -**Type:** Derived Metric +| Tag | Type | Flags | +|-----|------|-------| +| `benchmark_duration` | Derived Metric | [`NO_CONSOLE`](#flag-no-console) | **Description:** The total elapsed time from the first request sent to the last response received. This represents the complete wall-clock duration of the benchmark run. @@ -403,3 +435,24 @@ Benchmark Duration = max_response_timestamp - min_request_timestamp - Denominator for throughput calculations; represents the effective measurement window. --- + +## Metric Flags Reference + +Metric flags are used to control when and how metrics are computed, displayed, and grouped. Flags can be combined using bitwise operations to create composite behaviors. + +| Flag | Description | Impact | +|------|-------------|--------| +| `NONE` | No flags set | Metric has default behavior with no special restrictions | +| `STREAMING_ONLY` | Only computed for streaming responses | Requires Server-Sent Events (SSE) with multiple response chunks; skipped for non-streaming requests | +| `ERROR_ONLY` | Only computed for error requests | Tracks error-specific information; skipped for successful requests | +| `PRODUCES_TOKENS_ONLY` | Only computed for token-producing endpoints | Requires endpoints that return text/token content; skipped for embeddings and non-generative endpoints | +| `NO_CONSOLE` | Not displayed in console output | Metric computed but excluded from terminal display; available in JSON/CSV exports | +| `LARGER_IS_BETTER` | Higher values indicate better performance | Used for throughput and count metrics to indicate optimization direction | +| `INTERNAL` | Internal system metric (also `NO_CONSOLE`) | Not user-facing; used for internal processing and debugging | +| `SUPPORTS_AUDIO_ONLY` | Only applicable to audio endpoints | Requires audio-based input/output; skipped for text-only endpoints | +| `SUPPORTS_IMAGE_ONLY` | Only applicable to image endpoints | Requires image-based input/output; skipped for text-only endpoints | +| `SUPPORTS_REASONING` | Requires reasoning token support | Only available for models that expose reasoning content in separate fields (e.g., OpenAI o1) | +| `EXPERIMENTAL` | Experimental feature (also `NO_CONSOLE`) | Not production-ready; subject to change; excluded from default display | +| `STREAMING_TOKENS_ONLY` | Combination: `STREAMING_ONLY` + `PRODUCES_TOKENS_ONLY` | Requires both streaming support and token-producing endpoints | + +--- From dd1d97d9bcee0d826c3a68d0ffdee53158d35007 Mon Sep 17 00:00:00 2001 From: Anthony Casagrande Date: Thu, 2 Oct 2025 10:04:27 -0700 Subject: [PATCH 05/11] Update metrics_reference.md Signed-off-by: Anthony Casagrande --- docs/metrics_reference.md | 249 +++++++++++++++++++++++--------------- 1 file changed, 150 insertions(+), 99 deletions(-) diff --git a/docs/metrics_reference.md b/docs/metrics_reference.md index 5df20dc22..60d945e17 100644 --- a/docs/metrics_reference.md +++ b/docs/metrics_reference.md @@ -56,20 +56,25 @@ The sections below provide detailed descriptions, requirements, and notes for ea --- -## Detailed Metric Descriptions +# Detailed Metric Descriptions -### Latency & Timing Metrics +## Latency & Timing Metrics These metrics measure time and latency characteristics of requests and responses. -#### Request Latency +### Request Latency -| Tag | Type | Flags | -|-----|------|-------| -| `request_latency` | Record Metric | [`NONE`](#flag-none) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `request_latency` | Record Metric | [`NONE`](#flag-none) | - | **Description:** Measures the total end-to-end time from sending a request until receiving the final response. For streaming requests with multiple responses, this measures until the last response is received. This is the complete time experienced by the client for a single request. +**Formula Details:** +``` +Request Latency = responses[-1].perf_ns - start_perf_ns +``` + **Notes:** - Available for all request types (streaming and non-streaming); no special requirements. - Includes all components: network time, queuing, prompt processing, token generation, and response transmission. @@ -77,38 +82,48 @@ These metrics measure time and latency characteristics of requests and responses --- -#### Time to First Token (TTFT) +### Time to First Token (TTFT) -| Tag | Type | Flags | -|-----|------|-------| -| `ttft` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `ttft` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only) | - | **Description:** Measures how long it takes to receive the first token (or chunk of tokens) after sending a request. This is critical for user-perceived responsiveness in streaming scenarios, as it represents how quickly the model begins generating output. +**Formula Details:** +``` +TTFT = responses[0].perf_ns - request.start_perf_ns +``` + **Notes:** - Requires `--streaming` flag, with a token-producing endpoint, and at least 1 response chunk. - Includes network latency, queuing time, prompt processing, and generation of the first token (or chunk of tokens). --- -#### Time to Second Token (TTST) +### Time to Second Token (TTST) -| Tag | Type | Flags | -|-----|------|-------| -| `ttst` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `ttst` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only) | - | **Description:** Measures the time gap between the first and second chunk of tokens (SSE messages). This metric helps identify generation startup overhead separate from steady-state streaming throughput. +**Formula Details:** +``` +TTST = responses[1].perf_ns - responses[0].perf_ns +``` + **Notes:** - Requires `--streaming` flag, with a token-producing endpoint, and at least 2 response chunks (tokens). --- -#### Inter Token Latency (ITL) +### Inter Token Latency (ITL) -| Tag | Type | Flags | -|-----|------|-------| -| `inter_token_latency` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `inter_token_latency` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `request_latency`, `ttft`, `output_sequence_length` | **Description:** Measures the average time between consecutive tokens during generation, excluding the initial TTFT overhead. This represents the steady-state token generation rate. @@ -122,11 +137,11 @@ ITL = (request_latency - ttft) / (output_sequence_length - 1) --- -#### Inter Chunk Latency (ICL) +### Inter Chunk Latency (ICL) -| Tag | Type | Flags | -|-----|------|-------| -| `inter_chunk_latency` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only), [`EXPERIMENTAL`](#flag-experimental) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `inter_chunk_latency` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only), [`EXPERIMENTAL`](#flag-experimental) | - | **Description:** Captures the time gaps between all consecutive response chunks (SSE messages) in a streaming response, providing a distribution of chunk arrival times rather than a single average. Note that this is different from the ITL metric, which measures the time between consecutive tokens regardless of chunk size. @@ -143,18 +158,23 @@ ICL = [responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(respo --- -### Token Count Metrics +## Token Count Metrics These metrics track token counts for individual requests and aggregated across all requests. -#### Output Token Count +### Output Token Count -| Tag | Type | Flags | -|-----|------|-------| -| `output_token_count` | Record Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `output_token_count` | Record Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | - | **Description:** The number of output tokens generated for a single request, _excluding reasoning tokens_. This represents the visible output tokens returned to the user across all responses for the request. +**Formula Details:** +``` +Output Token Count = output_token_count +``` + **Notes:** - Requires token-producing endpoints that return actual token content (text); excludes embeddings and other non-generative endpoints. - AIPerf counts tokens from the returned content using a tokenizer. @@ -164,14 +184,19 @@ These metrics track token counts for individual requests and aggregated across a --- -#### Reasoning Token Count +### Reasoning Token Count -| Tag | Type | Flags | -|-----|------|-------| -| `reasoning_token_count` | Record Metric | [`SUPPORTS_REASONING`](#flag-supports-reasoning), [`LARGER_IS_BETTER`](#flag-larger-is-better) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `reasoning_token_count` | Record Metric | [`SUPPORTS_REASONING`](#flag-supports-reasoning), [`LARGER_IS_BETTER`](#flag-larger-is-better) | - | **Description:** The number of reasoning tokens generated for a single request. These are tokens used for "thinking" or chain-of-thought reasoning before generating the final output. +**Formula Details:** +``` +Reasoning Token Count = reasoning_token_count +``` + **Notes:** - Requires models/backends that support reasoning output with reasoning content separated into a `reasoning_content` field, distinct from the regular `content` field in the response(s). - AIPerf counts tokens from the `reasoning_content` field using a tokenizer, just like other token metrics. @@ -181,11 +206,11 @@ These metrics track token counts for individual requests and aggregated across a --- -#### Output Sequence Length (OSL) +### Output Sequence Length (OSL) -| Tag | Type | Flags | -|-----|------|-------| -| `output_sequence_length` | Record Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `output_sequence_length` | Record Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | - | **Description:** The total number of completion tokens (output + reasoning) generated for a single request across all its responses. This represents the complete token generation workload for the request. @@ -201,14 +226,19 @@ OSL = (output_token_count or 0) + (reasoning_token_count or 0) --- -#### Input Sequence Length (ISL) +### Input Sequence Length (ISL) -| Tag | Type | Flags | -|-----|------|-------| -| `input_sequence_length` | Record Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `input_sequence_length` | Record Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | - | **Description:** The number of input/prompt tokens for a single request. This represents the size of the input sent to the model. +**Formula Details:** +``` +Input Sequence Length = input_token_count +``` + **Notes:** - Requires token-producing endpoints (chat, completion, etc.). - AIPerf tokenizes the input prompt to compute the count using the appropriate tokenizer for the model. @@ -216,11 +246,11 @@ OSL = (output_token_count or 0) + (reasoning_token_count or 0) --- -#### Total Output Tokens +### Total Output Tokens -| Tag | Type | Flags | -|-----|------|-------| -| `total_output_tokens` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `total_output_tokens` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `output_token_count` | **Description:** The sum of all output tokens (excluding reasoning tokens) generated across all requests. This represents the total visible output token workload. @@ -237,11 +267,11 @@ Total Output Tokens = sum(output_token_count for record in records) --- -#### Total Reasoning Tokens +### Total Reasoning Tokens -| Tag | Type | Flags | -|-----|------|-------| -| `total_reasoning_tokens` | Derived Metric | [`SUPPORTS_REASONING`](#flag-supports-reasoning), [`LARGER_IS_BETTER`](#flag-larger-is-better) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `total_reasoning_tokens` | Derived Metric | [`SUPPORTS_REASONING`](#flag-supports-reasoning), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `reasoning_token_count` | **Description:** The sum of all reasoning tokens generated across all requests. This represents the total reasoning/thinking workload. @@ -258,11 +288,11 @@ Total Reasoning Tokens = sum(reasoning_token_count for record in records) --- -#### Total Output Sequence Length +### Total Output Sequence Length -| Tag | Type | Flags | -|-----|------|-------| -| `total_osl` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `total_osl` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `output_sequence_length` | **Description:** The sum of all completion tokens (output + reasoning) generated across all requests. This represents the complete token generation workload. @@ -279,11 +309,11 @@ Total Output Sequence Length = sum(output_sequence_length for record in records) --- -#### Total Input Sequence Length +### Total Input Sequence Length -| Tag | Type | Flags | -|-----|------|-------| -| `total_isl` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `total_isl` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `input_sequence_length` | **Description:** The sum of all input/prompt tokens processed across all requests. This represents the total input workload sent to the model. @@ -299,15 +329,15 @@ Total Input Sequence Length = sum(input_sequence_length for record in records) --- -### Throughput Metrics +## Throughput Metrics These metrics measure the rate of requests and token generation. -#### Request Throughput +### Request Throughput -| Tag | Type | Flags | -|-----|------|-------| -| `request_throughput` | Derived Metric | [`LARGER_IS_BETTER`](#flag-larger-is-better) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `request_throughput` | Derived Metric | [`LARGER_IS_BETTER`](#flag-larger-is-better) | `request_count`, `benchmark_duration` | **Description:** The overall rate of completed requests per second across the entire benchmark. This represents the system's ability to process requests under the given concurrency and load. @@ -323,11 +353,16 @@ Request Throughput = request_count / benchmark_duration_seconds --- -#### Output Token Throughput +### Output Token Throughput + +> [!IMPORTANT] +> This metric is computed as a single values across all requests, and it includes the TTFT in the equation, so it is **not** directly comparable to the [Output Token Throughput Per User](#output-token-throughput-per-user) metric. + + +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `output_token_throughput` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `total_osl`, `benchmark_duration` | -| Tag | Type | Flags | -|-----|------|-------| -| `output_token_throughput` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | **Description:** The aggregate token generation rate across all concurrent requests, measured as total tokens per second. This represents the system's overall token generation capacity. @@ -336,8 +371,6 @@ Request Throughput = request_count / benchmark_duration_seconds Output Token Throughput = total_osl / benchmark_duration_seconds ``` -**Important:** This metric specifically includes the TTFT in the equation, so it is **not** directly comparable to the [Output Token Throughput Per User](#output-token-throughput-per-user) metric. - **Notes:** - Requires token-producing endpoints that generate text content, with valid `total_osl` and `benchmark_duration` metrics. - Measures aggregate throughput across all concurrent requests; represents the overall system token generation rate. @@ -346,11 +379,15 @@ Output Token Throughput = total_osl / benchmark_duration_seconds --- -#### Output Token Throughput Per User +### Output Token Throughput Per User + +> [!IMPORTANT] +> This metric is computed per-request, and it excludes the TTFT from the equation, so it is **not** directly comparable to the [Output Token Throughput](#output-token-throughput) metric. + -| Tag | Type | Flags | -|-----|------|-------| -| `output_token_throughput_per_user` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `output_token_throughput_per_user` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `inter_token_latency` | **Description:** The token generation rate experienced by an individual user/request, measured as the inverse of inter-token latency. This represents single-request streaming performance. @@ -359,8 +396,6 @@ Output Token Throughput = total_osl / benchmark_duration_seconds Output Token Throughput Per User = 1.0 / inter_token_latency_seconds ``` -**Important:** This metric specifically excludes the TTFT from the equation, so it is **not** directly comparable to the [Output Token Throughput](#output-token-throughput) metric. - **Notes:** - Requires `--streaming` flag, with a token-producing endpoint, and valid `inter_token_latency` metric. - Computes the inverse of ITL to show tokens per second from an individual user's perspective. @@ -369,58 +404,78 @@ Output Token Throughput Per User = 1.0 / inter_token_latency_seconds --- -### System & Benchmark Metrics +## System & Benchmark Metrics These metrics track overall benchmark execution and system-level counters. -#### Request Count +### Request Count -| Tag | Type | Flags | -|-----|------|-------| -| `request_count` | Aggregate Metric | [`LARGER_IS_BETTER`](#flag-larger-is-better) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `request_count` | Aggregate Metric | [`LARGER_IS_BETTER`](#flag-larger-is-better) | - | **Description:** The total number of **successfully completed** requests in the benchmark. This includes all requests that received valid responses, regardless of streaming mode. +**Formula Details:** +``` +Request Count = sum(1 for request in valid_requests) +``` + --- -#### Error Request Count +### Error Request Count -| Tag | Type | Flags | -|-----|------|-------| -| `error_request_count` | Aggregate Metric | [`ERROR_ONLY`](#flag-error-only) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `error_request_count` | Aggregate Metric | [`ERROR_ONLY`](#flag-error-only) | - | **Description:** The total number of failed/error requests encountered during the benchmark. This includes network errors, HTTP errors, timeout errors, and other failures. +**Formula Details:** +``` +Error Request Count = sum(1 for request in error_requests) +``` + **Notes:** - Error rate can be computed as `error_request_count / (request_count + error_request_count)`. --- -#### Minimum Request Timestamp +### Minimum Request Timestamp -| Tag | Type | Flags | -|-----|------|-------| -| `min_request_timestamp` | Aggregate Metric | [`NO_CONSOLE`](#flag-no-console) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `min_request_timestamp` | Aggregate Metric | [`NO_CONSOLE`](#flag-no-console) | - | **Description:** The wall-clock timestamp of the first request sent in the benchmark. This is used to calculate the benchmark duration and represents the start of the benchmark run. +**Formula Details:** +``` +Minimum Request Timestamp = min(timestamp_ns for record in records) +``` + --- -#### Maximum Response Timestamp +### Maximum Response Timestamp -| Tag | Type | Flags | -|-----|------|-------| -| `max_response_timestamp` | Aggregate Metric | [`NO_CONSOLE`](#flag-no-console) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `max_response_timestamp` | Aggregate Metric | [`NO_CONSOLE`](#flag-no-console) | `request_latency` | **Description:** The wall-clock timestamp of the last response received in the benchmark. This is used to calculate the benchmark duration and represents the end of the benchmark run. +**Formula Details:** +``` +Maximum Response Timestamp = max(timestamp_ns + request_latency for record in records) +``` + --- -#### Benchmark Duration +### Benchmark Duration -| Tag | Type | Flags | -|-----|------|-------| -| `benchmark_duration` | Derived Metric | [`NO_CONSOLE`](#flag-no-console) | +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `benchmark_duration` | Derived Metric | [`NO_CONSOLE`](#flag-no-console) | `min_request_timestamp`, `max_response_timestamp` | **Description:** The total elapsed time from the first request sent to the last response received. This represents the complete wall-clock duration of the benchmark run. @@ -436,7 +491,7 @@ Benchmark Duration = max_response_timestamp - min_request_timestamp --- -## Metric Flags Reference +# Metric Flags Reference Metric flags are used to control when and how metrics are computed, displayed, and grouped. Flags can be combined using bitwise operations to create composite behaviors. @@ -448,11 +503,7 @@ Metric flags are used to control when and how metrics are computed, displayed, a | `PRODUCES_TOKENS_ONLY` | Only computed for token-producing endpoints | Requires endpoints that return text/token content; skipped for embeddings and non-generative endpoints | | `NO_CONSOLE` | Not displayed in console output | Metric computed but excluded from terminal display; available in JSON/CSV exports | | `LARGER_IS_BETTER` | Higher values indicate better performance | Used for throughput and count metrics to indicate optimization direction | -| `INTERNAL` | Internal system metric (also `NO_CONSOLE`) | Not user-facing; used for internal processing and debugging | -| `SUPPORTS_AUDIO_ONLY` | Only applicable to audio endpoints | Requires audio-based input/output; skipped for text-only endpoints | -| `SUPPORTS_IMAGE_ONLY` | Only applicable to image endpoints | Requires image-based input/output; skipped for text-only endpoints | -| `SUPPORTS_REASONING` | Requires reasoning token support | Only available for models that expose reasoning content in separate fields (e.g., OpenAI o1) | -| `EXPERIMENTAL` | Experimental feature (also `NO_CONSOLE`) | Not production-ready; subject to change; excluded from default display | +| `SUPPORTS_REASONING` | Requires reasoning token support | Only available for models and endpoints that expose reasoning content in separate fields | | `STREAMING_TOKENS_ONLY` | Combination: `STREAMING_ONLY` + `PRODUCES_TOKENS_ONLY` | Requires both streaming support and token-producing endpoints | --- From c6f9bf53cf20cd750bfaf6f0ff289bf016406922 Mon Sep 17 00:00:00 2001 From: Anthony Casagrande Date: Thu, 2 Oct 2025 10:52:49 -0700 Subject: [PATCH 06/11] Update metrics_reference.md Signed-off-by: Anthony Casagrande --- docs/metrics_reference.md | 308 ++++++++++++++++++-------------------- 1 file changed, 146 insertions(+), 162 deletions(-) diff --git a/docs/metrics_reference.md b/docs/metrics_reference.md index 60d945e17..995cd5573 100644 --- a/docs/metrics_reference.md +++ b/docs/metrics_reference.md @@ -58,29 +58,12 @@ The sections below provide detailed descriptions, requirements, and notes for ea # Detailed Metric Descriptions -## Latency & Timing Metrics +## Streaming Metrics -These metrics measure time and latency characteristics of requests and responses. +These metrics are specific to streaming requests and measure real-time token generation characteristics. -### Request Latency - -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `request_latency` | Record Metric | [`NONE`](#flag-none) | - | - -**Description:** Measures the total end-to-end time from sending a request until receiving the final response. For streaming requests with multiple responses, this measures until the last response is received. This is the complete time experienced by the client for a single request. - -**Formula Details:** -``` -Request Latency = responses[-1].perf_ns - start_perf_ns -``` - -**Notes:** -- Available for all request types (streaming and non-streaming); no special requirements. -- Includes all components: network time, queuing, prompt processing, token generation, and response transmission. -- For streaming requests, measures from request start to the final chunk received. - ---- +> [!NOTE] +> **Requirements:** All metrics in this section require the `--streaming` flag with a token-producing endpoint and at least one non-empty response chunk. ### Time to First Token (TTFT) @@ -90,13 +73,12 @@ Request Latency = responses[-1].perf_ns - start_perf_ns **Description:** Measures how long it takes to receive the first token (or chunk of tokens) after sending a request. This is critical for user-perceived responsiveness in streaming scenarios, as it represents how quickly the model begins generating output. -**Formula Details:** -``` -TTFT = responses[0].perf_ns - request.start_perf_ns +**Formula:** +```python +ttft = responses[0].perf_ns - request.start_perf_ns ``` **Notes:** -- Requires `--streaming` flag, with a token-producing endpoint, and at least 1 response chunk. - Includes network latency, queuing time, prompt processing, and generation of the first token (or chunk of tokens). --- @@ -109,13 +91,13 @@ TTFT = responses[0].perf_ns - request.start_perf_ns **Description:** Measures the time gap between the first and second chunk of tokens (SSE messages). This metric helps identify generation startup overhead separate from steady-state streaming throughput. -**Formula Details:** -``` -TTST = responses[1].perf_ns - responses[0].perf_ns +**Formula:** +```python +ttst = responses[1].perf_ns - responses[0].perf_ns ``` **Notes:** -- Requires `--streaming` flag, with a token-producing endpoint, and at least 2 response chunks (tokens). +- Requires at least 2 non-empty response chunks to compute the time between first and second tokens. --- @@ -127,13 +109,13 @@ TTST = responses[1].perf_ns - responses[0].perf_ns **Description:** Measures the average time between consecutive tokens during generation, excluding the initial TTFT overhead. This represents the steady-state token generation rate. -**Formula Details:** -``` -ITL = (request_latency - ttft) / (output_sequence_length - 1) +**Formula:** +```python +inter_token_latency = (request_latency - ttft) / (output_sequence_length - 1) ``` **Notes:** -- Requires `--streaming` flag, with a token-producing endpoint, at least 2 output tokens, and valid `ttft`, `request_latency`, and `output_sequence_length` metrics. +- Requires at least 2 non-empty response chunks and valid `ttft`, `request_latency`, and `output_sequence_length` metrics. --- @@ -145,64 +127,67 @@ ITL = (request_latency - ttft) / (output_sequence_length - 1) **Description:** Captures the time gaps between all consecutive response chunks (SSE messages) in a streaming response, providing a distribution of chunk arrival times rather than a single average. Note that this is different from the ITL metric, which measures the time between consecutive tokens regardless of chunk size. -**Formula Details:** -``` -ICL = [responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(responses))] +**Formula:** +```python +inter_chunk_latency = [responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(responses))] ``` **Notes:** -- Requires `--streaming` flag, with a token-producing endpoint, and at least 2 response chunks. +- Requires at least 2 response chunks. - Unlike ITL (which produces a single average), ICL provides the full distribution of inter-chunk times. - Useful for detecting variability, jitter, or issues in streaming delivery. - Analyzing ICL distributions can reveal batching behavior, scheduling issues, or network variability. --- -## Token Count Metrics +### Output Token Throughput Per User -These metrics track token counts for individual requests and aggregated across all requests. +> [!IMPORTANT] +> This metric is computed per-request, and it excludes the TTFT from the equation, so it is **not** directly comparable to the [Output Token Throughput](#output-token-throughput) metric. -### Output Token Count | Tag | Type | Flags | Depends On | |-----|------|-------|------------| -| `output_token_count` | Record Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | - | +| `output_token_throughput_per_user` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `inter_token_latency` | -**Description:** The number of output tokens generated for a single request, _excluding reasoning tokens_. This represents the visible output tokens returned to the user across all responses for the request. +**Description:** The token generation rate experienced by an individual user/request, measured as the inverse of inter-token latency. This represents single-request streaming performance. -**Formula Details:** -``` -Output Token Count = output_token_count +**Formula:** +```python +output_token_throughput_per_user = 1.0 / inter_token_latency_seconds ``` **Notes:** -- Requires token-producing endpoints that return actual token content (text); excludes embeddings and other non-generative endpoints. -- AIPerf counts tokens from the returned content using a tokenizer. -- For streaming requests with multiple responses, the responses are joined together and then tokens are counted. -- For models and endpoints that support reasoning tokens, this metric counts only the non-reasoning output tokens. -- This **will** count tokens inside of the `` tags, if they are present in the `content` field of the response. +- Computes the inverse of ITL to show tokens per second from an individual user's perspective. +- Differs from Output Token Throughput (aggregate across all concurrent requests) by focusing on single-request experience. +- Useful for understanding the user experience independent of concurrency effects. --- -### Reasoning Token Count +## Token Based Metrics + +These metrics track token counts and throughput for token-producing endpoints. + +> [!NOTE] +> **Requirements:** All metrics in this section require token-producing endpoints that return text content (chat, completion, etc.). These metrics are not available for embeddings or other non-generative endpoints. + +### Output Token Count | Tag | Type | Flags | Depends On | |-----|------|-------|------------| -| `reasoning_token_count` | Record Metric | [`SUPPORTS_REASONING`](#flag-supports-reasoning), [`LARGER_IS_BETTER`](#flag-larger-is-better) | - | +| `output_token_count` | Record Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | - | -**Description:** The number of reasoning tokens generated for a single request. These are tokens used for "thinking" or chain-of-thought reasoning before generating the final output. +**Description:** The number of output tokens generated for a single request, _excluding reasoning tokens_. This represents the visible output tokens returned to the user across all responses for the request. -**Formula Details:** -``` -Reasoning Token Count = reasoning_token_count +**Formula:** +```python +output_token_count = len(tokenizer.encode(content)) ``` **Notes:** -- Requires models/backends that support reasoning output with reasoning content separated into a `reasoning_content` field, distinct from the regular `content` field in the response(s). -- AIPerf counts tokens from the `reasoning_content` field using a tokenizer, just like other token metrics. -- Does NOT differentiate `` tags or extract reasoning from within the regular `content` field. -- The backend must provide reasoning as a separate field in the response. -- Standard models/backends that don't expose reasoning content separately will skip this metric. +- For streaming requests with multiple responses, the responses are joined together and then tokens are counted. +- For models that support reasoning tokens, this metric counts only the non-reasoning output tokens. +- This **will** count tokens inside of the `` tags, if they are present in the `content` field of the response. --- @@ -214,15 +199,13 @@ Reasoning Token Count = reasoning_token_count **Description:** The total number of completion tokens (output + reasoning) generated for a single request across all its responses. This represents the complete token generation workload for the request. -**Formula Details:** -``` -OSL = (output_token_count or 0) + (reasoning_token_count or 0) +**Formula:** +```python +output_sequence_length = (output_token_count or 0) + (reasoning_token_count or 0) ``` **Notes:** -- Requires token-producing endpoints that return text content; excludes embeddings and other non-generative endpoints. -- AIPerf counts tokens from the generated text content across all responses. -- For models and endpoints that do not support/separate reasoning tokens, OSL equals the output token count. +- For models that do not support/separate reasoning tokens, OSL equals the output token count. --- @@ -234,14 +217,12 @@ OSL = (output_token_count or 0) + (reasoning_token_count or 0) **Description:** The number of input/prompt tokens for a single request. This represents the size of the input sent to the model. -**Formula Details:** -``` -Input Sequence Length = input_token_count +**Formula:** +```python +input_sequence_length = len(tokenizer.encode(prompt)) ``` **Notes:** -- Requires token-producing endpoints (chat, completion, etc.). -- AIPerf tokenizes the input prompt to compute the count using the appropriate tokenizer for the model. - Useful for understanding the relationship between input size and latency/throughput. --- @@ -254,40 +235,17 @@ Input Sequence Length = input_token_count **Description:** The sum of all output tokens (excluding reasoning tokens) generated across all requests. This represents the total visible output token workload. -**Formula Details:** -``` -Total Output Tokens = sum(output_token_count for record in records) +**Formula:** +```python +total_output_tokens = sum(output_token_count for record in records) ``` **Notes:** -- Requires token-producing endpoints that return text content, with valid `output_token_count` for processed records. -- AIPerf counts tokens from the returned content using a tokenizer. - Aggregates output tokens across all successful requests. - Useful for capacity planning and cost estimation. --- -### Total Reasoning Tokens - -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `total_reasoning_tokens` | Derived Metric | [`SUPPORTS_REASONING`](#flag-supports-reasoning), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `reasoning_token_count` | - -**Description:** The sum of all reasoning tokens generated across all requests. This represents the total reasoning/thinking workload. - -**Formula Details:** -``` -Total Reasoning Tokens = sum(reasoning_token_count for record in records) -``` - -**Notes:** -- Requires models/backends that support reasoning output with `reasoning_content` as a separate field, and valid `reasoning_token_count` for processed records. -- AIPerf counts tokens from the `reasoning_content` field. -- Only available for models like OpenAI o1 that expose reasoning tokens separately. -- Useful for understanding the reasoning overhead and cost for reasoning-enabled models. - ---- - ### Total Output Sequence Length | Tag | Type | Flags | Depends On | @@ -296,16 +254,14 @@ Total Reasoning Tokens = sum(reasoning_token_count for record in records) **Description:** The sum of all completion tokens (output + reasoning) generated across all requests. This represents the complete token generation workload. -**Formula Details:** -``` -Total Output Sequence Length = sum(output_sequence_length for record in records) +**Formula:** +```python +total_osl = sum(output_sequence_length for record in records) ``` **Notes:** -- Requires token-producing endpoints that return text content, with valid `output_sequence_length` for processed records. - Aggregates the complete token generation workload including both output and reasoning tokens. - For models without reasoning tokens, this equals Total Output Tokens. -- Numerator for Output Token Throughput calculations. --- @@ -317,96 +273,125 @@ Total Output Sequence Length = sum(output_sequence_length for record in records) **Description:** The sum of all input/prompt tokens processed across all requests. This represents the total input workload sent to the model. -**Formula Details:** -``` -Total Input Sequence Length = sum(input_sequence_length for record in records) +**Formula:** +```python +total_isl = sum(input_sequence_length for record in records) ``` **Notes:** -- Requires token-producing endpoints, with valid `input_sequence_length` for processed records. -- AIPerf tokenizes input prompts to compute token counts. - Useful for understanding the input workload, capacity planning, and analyzing the relationship between input size and system performance. --- -## Throughput Metrics +### Output Token Throughput -These metrics measure the rate of requests and token generation. +> [!IMPORTANT] +> This metric is computed as a single values across all requests, and it includes the TTFT in the equation, so it is **not** directly comparable to the [Output Token Throughput Per User](#output-token-throughput-per-user) metric. -### Request Throughput | Tag | Type | Flags | Depends On | |-----|------|-------|------------| -| `request_throughput` | Derived Metric | [`LARGER_IS_BETTER`](#flag-larger-is-better) | `request_count`, `benchmark_duration` | +| `output_token_throughput` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `total_osl`, `benchmark_duration` | -**Description:** The overall rate of completed requests per second across the entire benchmark. This represents the system's ability to process requests under the given concurrency and load. -**Formula Details:** -``` -Request Throughput = request_count / benchmark_duration_seconds +**Description:** The aggregate token generation rate across all concurrent requests, measured as total tokens per second. This represents the system's overall token generation capacity. + +**Formula:** +```python +output_token_throughput = total_osl / benchmark_duration_seconds ``` **Notes:** -- Requires valid `request_count` and `benchmark_duration` metrics. -- Captures the aggregate request processing rate; higher values indicate better system throughput. -- Affected by concurrency level, request complexity, and system capacity. +- Measures aggregate throughput across all concurrent requests; represents the overall system token generation rate. +- Higher values indicate better system utilization and capacity. --- -### Output Token Throughput +## Reasoning Metrics -> [!IMPORTANT] -> This metric is computed as a single values across all requests, and it includes the TTFT in the equation, so it is **not** directly comparable to the [Output Token Throughput Per User](#output-token-throughput-per-user) metric. +These metrics are specific to models that support reasoning/thinking tokens. +> [!NOTE] +> **Requirements:** All metrics in this section require models and backends that expose reasoning content in a separate `reasoning_content` field, distinct from the regular `content` field. + +### Reasoning Token Count | Tag | Type | Flags | Depends On | |-----|------|-------|------------| -| `output_token_throughput` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `total_osl`, `benchmark_duration` | - +| `reasoning_token_count` | Record Metric | [`SUPPORTS_REASONING`](#flag-supports-reasoning), [`LARGER_IS_BETTER`](#flag-larger-is-better) | - | -**Description:** The aggregate token generation rate across all concurrent requests, measured as total tokens per second. This represents the system's overall token generation capacity. +**Description:** The number of reasoning tokens generated for a single request. These are tokens used for "thinking" or chain-of-thought reasoning before generating the final output. -**Formula Details:** +**Formula:** +```python +reasoning_token_count = len(tokenizer.encode(reasoning_content)) ``` -Output Token Throughput = total_osl / benchmark_duration_seconds + +**Notes:** +- Does **not** differentiate `` tags or extract reasoning from within the regular `content` field. + +--- + +### Total Reasoning Tokens + +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `total_reasoning_tokens` | Derived Metric | [`SUPPORTS_REASONING`](#flag-supports-reasoning), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `reasoning_token_count` | + +**Description:** The sum of all reasoning tokens generated across all requests. This represents the total reasoning/thinking workload. + +**Formula:** +```python +total_reasoning_tokens = sum(reasoning_token_count for record in records) ``` **Notes:** -- Requires token-producing endpoints that generate text content, with valid `total_osl` and `benchmark_duration` metrics. -- Measures aggregate throughput across all concurrent requests; represents the overall system token generation rate. -- Not applicable to embeddings or other non-generative endpoints. -- Higher values indicate better system utilization and capacity. +- Useful for understanding the reasoning overhead and cost for reasoning-enabled models. --- -### Output Token Throughput Per User +## General Metrics -> [!IMPORTANT] -> This metric is computed per-request, and it excludes the TTFT from the equation, so it is **not** directly comparable to the [Output Token Throughput](#output-token-throughput) metric. +> [!NOTE] +> **Requirements:** Metrics in this section are available for all benchmark runs with no special requirements. +### Request Latency | Tag | Type | Flags | Depends On | |-----|------|-------|------------| -| `output_token_throughput_per_user` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `inter_token_latency` | +| `request_latency` | Record Metric | [`NONE`](#flag-none) | - | -**Description:** The token generation rate experienced by an individual user/request, measured as the inverse of inter-token latency. This represents single-request streaming performance. +**Description:** Measures the total end-to-end time from sending a request until receiving the final response. For streaming requests with multiple responses, this measures until the last response is received. This is the complete time experienced by the client for a single request. -**Formula Details:** -``` -Output Token Throughput Per User = 1.0 / inter_token_latency_seconds +**Formula:** +```python +request_latency = responses[-1].perf_ns - start_perf_ns ``` **Notes:** -- Requires `--streaming` flag, with a token-producing endpoint, and valid `inter_token_latency` metric. -- Computes the inverse of ITL to show tokens per second from an individual user's perspective. -- Differs from Output Token Throughput (aggregate across all concurrent requests) by focusing on single-request experience. -- Useful for understanding the user experience independent of concurrency effects. +- Includes all components: network time, queuing, prompt processing, token generation, and response transmission. +- For streaming requests, measures from request start to the final chunk received. --- -## System & Benchmark Metrics +### Request Throughput + +| Tag | Type | Flags | Depends On | +|-----|------|-------|------------| +| `request_throughput` | Derived Metric | [`LARGER_IS_BETTER`](#flag-larger-is-better) | `request_count`, `benchmark_duration` | -These metrics track overall benchmark execution and system-level counters. +**Description:** The overall rate of completed requests per second across the entire benchmark. This represents the system's ability to process requests under the given concurrency and load. + +**Formula:** +```python +request_throughput = request_count / benchmark_duration_seconds +``` + +**Notes:** +- Captures the aggregate request processing rate; higher values indicate better system throughput. +- Affected by concurrency level, request complexity, output sequence length, and system capacity. + +--- ### Request Count @@ -416,9 +401,9 @@ These metrics track overall benchmark execution and system-level counters. **Description:** The total number of **successfully completed** requests in the benchmark. This includes all requests that received valid responses, regardless of streaming mode. -**Formula Details:** -``` -Request Count = sum(1 for request in valid_requests) +**Formula:** +```python +request_count = sum(1 for record if record.valid) ``` --- @@ -431,9 +416,9 @@ Request Count = sum(1 for request in valid_requests) **Description:** The total number of failed/error requests encountered during the benchmark. This includes network errors, HTTP errors, timeout errors, and other failures. -**Formula Details:** -``` -Error Request Count = sum(1 for request in error_requests) +**Formula:** +```python +error_request_count = sum(1 for record if not record.valid) ``` **Notes:** @@ -449,9 +434,9 @@ Error Request Count = sum(1 for request in error_requests) **Description:** The wall-clock timestamp of the first request sent in the benchmark. This is used to calculate the benchmark duration and represents the start of the benchmark run. -**Formula Details:** -``` -Minimum Request Timestamp = min(timestamp_ns for record in records) +**Formula:** +```python +min_request_timestamp = min(timestamp_ns for record in records) ``` --- @@ -464,9 +449,9 @@ Minimum Request Timestamp = min(timestamp_ns for record in records) **Description:** The wall-clock timestamp of the last response received in the benchmark. This is used to calculate the benchmark duration and represents the end of the benchmark run. -**Formula Details:** -``` -Maximum Response Timestamp = max(timestamp_ns + request_latency for record in records) +**Formula:** +```python +max_response_timestamp = max(timestamp_ns + request_latency for record in records) ``` --- @@ -479,15 +464,14 @@ Maximum Response Timestamp = max(timestamp_ns + request_latency for record in re **Description:** The total elapsed time from the first request sent to the last response received. This represents the complete wall-clock duration of the benchmark run. -**Formula Details:** -``` -Benchmark Duration = max_response_timestamp - min_request_timestamp +**Formula:** +```python +benchmark_duration = max_response_timestamp - min_request_timestamp ``` **Notes:** -- Requires valid `min_request_timestamp` and `max_response_timestamp` metrics. - Uses wall-clock timestamps representing real calendar time. -- Denominator for throughput calculations; represents the effective measurement window. +- Used as the denominator for throughput calculations; represents the effective measurement window. --- From 5705b28b8329a420b670cbc2fb300abf732ce438 Mon Sep 17 00:00:00 2001 From: Anthony Casagrande Date: Thu, 2 Oct 2025 11:07:03 -0700 Subject: [PATCH 07/11] Update metrics_reference.md Signed-off-by: Anthony Casagrande --- docs/metrics_reference.md | 180 ++++++++++---------------------------- 1 file changed, 45 insertions(+), 135 deletions(-) diff --git a/docs/metrics_reference.md b/docs/metrics_reference.md index 995cd5573..6caf53cb9 100644 --- a/docs/metrics_reference.md +++ b/docs/metrics_reference.md @@ -8,42 +8,45 @@ This document provides a comprehensive reference of all metrics available in AIP ## Understanding Metric Types -AIPerf computes metrics in three distinct phases during benchmark execution: +AIPerf computes metrics in three distinct phases during benchmark execution: **Record Metrics**, **Aggregate Metrics**, and **Derived Metrics**. -### Record Metrics +## Record Metrics -Record Metrics are computed **individually** for **each request** and its **response(s)** during the benchmark run. A single request may have one response (non-streaming) or multiple responses (streaming). These metrics capture **per-request characteristics** such as latency, token counts, and streaming behavior. Record metrics produce **statistical distributions** (min, max, mean, median, p90, p99) that reveal performance variability across requests. +Record Metrics are computed **individually** for **each request** and its **response(s)** during the benchmark run. A single request may have one response (non-streaming) or multiple responses (streaming). These metrics capture **per-request characteristics** such as latency, token counts, and streaming behavior. Record metrics produce **statistical distributions** (min, max, mean, median, p90, p99, etc.) that reveal performance variability across requests. -**Examples:** `request_latency`, `ttft`, `inter_token_latency`, `output_token_count`, `input_sequence_length` +### Example Metrics +`request_latency`, `ttft`, `inter_token_latency`, `output_token_count`, `input_sequence_length` -**Dependencies:** Record Metrics can depend on raw request/response data and other Record Metrics from the same request. - -#### Example: +### Dependencies +Record Metrics can depend on raw request/response data and other Record Metrics from the same request. +### Example Scenario `request_latency` measures the time for each individual request from start to final response. If you send 100 requests, you get 100 latency values that form a distribution showing how latency varies across requests. -### Aggregate Metrics +## Aggregate Metrics Aggregate Metrics are computed by **tracking** or **accumulating** values across **all requests** in **real-time** during the benchmark. These include counters, min/max timestamps, and other global statistics. Aggregate metrics produce a **single value** representing the entire benchmark run. -**Examples:** `request_count`, `error_request_count`, `min_request_timestamp`, `max_response_timestamp` - -**Dependencies:** Aggregate Metrics can depend on Record Metrics and other Aggregate Metrics. +### Example Metrics +`request_count`, `error_request_count`, `min_request_timestamp`, `max_response_timestamp` -#### Example: +### Dependencies +Aggregate Metrics can depend on Record Metrics and other Aggregate Metrics. +### Example Scenario `request_count` increments by 1 for each successful request. At the end of a benchmark with 100 successful requests, this metric equals 100 (a single value, not a distribution). -### Derived Metrics +## Derived Metrics Derived Metrics are computed by applying **mathematical formulas** to other metric results, but are **not** computed per-record like Record Metrics. Instead, these metrics depend on one or more **prerequisite metrics** being available first and are calculated either **after the benchmark completes** for final results or in **real-time** across **all current data** for live metrics display. Derived metrics can produce either single values or distributions depending on their dependencies. -**Examples:** `request_throughput`, `output_token_throughput`, `benchmark_duration` +### Example Metrics +`request_throughput`, `output_token_throughput`, `benchmark_duration` -**Dependencies:** Derived Metrics can depend on Record Metrics, Aggregate Metrics, and other Derived Metrics. - -#### Example: +### Dependencies +Derived Metrics can depend on Record Metrics, Aggregate Metrics, and other Derived Metrics. +### Example Scenario `request_throughput` is computed from `request_count / benchmark_duration_seconds`. This requires both `request_count` and `benchmark_duration` to be available first, then applies a formula to produce a single throughput value (e.g., 10.5 requests/sec). --- @@ -60,18 +63,12 @@ The sections below provide detailed descriptions, requirements, and notes for ea ## Streaming Metrics -These metrics are specific to streaming requests and measure real-time token generation characteristics. - > [!NOTE] -> **Requirements:** All metrics in this section require the `--streaming` flag with a token-producing endpoint and at least one non-empty response chunk. +> All metrics in this section require the `--streaming` flag with a token-producing endpoint and at least one non-empty response chunk. ### Time to First Token (TTFT) -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `ttft` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only) | - | - -**Description:** Measures how long it takes to receive the first token (or chunk of tokens) after sending a request. This is critical for user-perceived responsiveness in streaming scenarios, as it represents how quickly the model begins generating output. +Measures how long it takes to receive the first token (or chunk of tokens) after sending a request. This is critical for user-perceived responsiveness in streaming scenarios, as it represents how quickly the model begins generating output. **Formula:** ```python @@ -85,11 +82,7 @@ ttft = responses[0].perf_ns - request.start_perf_ns ### Time to Second Token (TTST) -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `ttst` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only) | - | - -**Description:** Measures the time gap between the first and second chunk of tokens (SSE messages). This metric helps identify generation startup overhead separate from steady-state streaming throughput. +Measures the time gap between the first and second chunk of tokens (SSE messages). This metric helps identify generation startup overhead separate from steady-state streaming throughput. **Formula:** ```python @@ -103,11 +96,7 @@ ttst = responses[1].perf_ns - responses[0].perf_ns ### Inter Token Latency (ITL) -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `inter_token_latency` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `request_latency`, `ttft`, `output_sequence_length` | - -**Description:** Measures the average time between consecutive tokens during generation, excluding the initial TTFT overhead. This represents the steady-state token generation rate. +Measures the average time between consecutive tokens during generation, excluding the initial TTFT overhead. This represents the steady-state token generation rate. **Formula:** ```python @@ -121,11 +110,7 @@ inter_token_latency = (request_latency - ttft) / (output_sequence_length - 1) ### Inter Chunk Latency (ICL) -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `inter_chunk_latency` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only), [`EXPERIMENTAL`](#flag-experimental) | - | - -**Description:** Captures the time gaps between all consecutive response chunks (SSE messages) in a streaming response, providing a distribution of chunk arrival times rather than a single average. Note that this is different from the ITL metric, which measures the time between consecutive tokens regardless of chunk size. +Captures the time gaps between all consecutive response chunks (SSE messages) in a streaming response, providing a distribution of chunk arrival times rather than a single average. Note that this is different from the ITL metric, which measures the time between consecutive tokens regardless of chunk size. **Formula:** ```python @@ -145,12 +130,7 @@ inter_chunk_latency = [responses[i].perf_ns - responses[i-1].perf_ns for i in ra > [!IMPORTANT] > This metric is computed per-request, and it excludes the TTFT from the equation, so it is **not** directly comparable to the [Output Token Throughput](#output-token-throughput) metric. - -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `output_token_throughput_per_user` | Record Metric | [`STREAMING_TOKENS_ONLY`](#flag-streaming-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `inter_token_latency` | - -**Description:** The token generation rate experienced by an individual user/request, measured as the inverse of inter-token latency. This represents single-request streaming performance. +The token generation rate experienced by an individual user/request, measured as the inverse of inter-token latency. This represents single-request streaming performance. **Formula:** ```python @@ -166,18 +146,12 @@ output_token_throughput_per_user = 1.0 / inter_token_latency_seconds ## Token Based Metrics -These metrics track token counts and throughput for token-producing endpoints. - > [!NOTE] -> **Requirements:** All metrics in this section require token-producing endpoints that return text content (chat, completion, etc.). These metrics are not available for embeddings or other non-generative endpoints. +> All metrics in this section require token-producing endpoints that return text content (chat, completion, etc.). These metrics are not available for embeddings or other non-generative endpoints. ### Output Token Count -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `output_token_count` | Record Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | - | - -**Description:** The number of output tokens generated for a single request, _excluding reasoning tokens_. This represents the visible output tokens returned to the user across all responses for the request. +The number of output tokens generated for a single request, _excluding reasoning tokens_. This represents the visible output tokens returned to the user across all responses for the request. **Formula:** ```python @@ -193,11 +167,7 @@ output_token_count = len(tokenizer.encode(content)) ### Output Sequence Length (OSL) -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `output_sequence_length` | Record Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | - | - -**Description:** The total number of completion tokens (output + reasoning) generated for a single request across all its responses. This represents the complete token generation workload for the request. +The total number of completion tokens (output + reasoning) generated for a single request across all its responses. This represents the complete token generation workload for the request. **Formula:** ```python @@ -211,11 +181,7 @@ output_sequence_length = (output_token_count or 0) + (reasoning_token_count or 0 ### Input Sequence Length (ISL) -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `input_sequence_length` | Record Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | - | - -**Description:** The number of input/prompt tokens for a single request. This represents the size of the input sent to the model. +The number of input/prompt tokens for a single request. This represents the size of the input sent to the model. **Formula:** ```python @@ -229,11 +195,7 @@ input_sequence_length = len(tokenizer.encode(prompt)) ### Total Output Tokens -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `total_output_tokens` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `output_token_count` | - -**Description:** The sum of all output tokens (excluding reasoning tokens) generated across all requests. This represents the total visible output token workload. +The sum of all output tokens (excluding reasoning tokens) generated across all requests. This represents the total visible output token workload. **Formula:** ```python @@ -248,11 +210,7 @@ total_output_tokens = sum(output_token_count for record in records) ### Total Output Sequence Length -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `total_osl` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `output_sequence_length` | - -**Description:** The sum of all completion tokens (output + reasoning) generated across all requests. This represents the complete token generation workload. +The sum of all completion tokens (output + reasoning) generated across all requests. This represents the complete token generation workload. **Formula:** ```python @@ -267,11 +225,7 @@ total_osl = sum(output_sequence_length for record in records) ### Total Input Sequence Length -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `total_isl` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `input_sequence_length` | - -**Description:** The sum of all input/prompt tokens processed across all requests. This represents the total input workload sent to the model. +The sum of all input/prompt tokens processed across all requests. This represents the total input workload sent to the model. **Formula:** ```python @@ -288,13 +242,7 @@ total_isl = sum(input_sequence_length for record in records) > [!IMPORTANT] > This metric is computed as a single values across all requests, and it includes the TTFT in the equation, so it is **not** directly comparable to the [Output Token Throughput Per User](#output-token-throughput-per-user) metric. - -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `output_token_throughput` | Derived Metric | [`PRODUCES_TOKENS_ONLY`](#flag-produces-tokens-only), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `total_osl`, `benchmark_duration` | - - -**Description:** The aggregate token generation rate across all concurrent requests, measured as total tokens per second. This represents the system's overall token generation capacity. +The aggregate token generation rate across all concurrent requests, measured as total tokens per second. This represents the system's overall token generation capacity. **Formula:** ```python @@ -309,18 +257,12 @@ output_token_throughput = total_osl / benchmark_duration_seconds ## Reasoning Metrics -These metrics are specific to models that support reasoning/thinking tokens. - > [!NOTE] -> **Requirements:** All metrics in this section require models and backends that expose reasoning content in a separate `reasoning_content` field, distinct from the regular `content` field. +> All metrics in this section require models and backends that expose reasoning content in a separate `reasoning_content` field, distinct from the regular `content` field. ### Reasoning Token Count -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `reasoning_token_count` | Record Metric | [`SUPPORTS_REASONING`](#flag-supports-reasoning), [`LARGER_IS_BETTER`](#flag-larger-is-better) | - | - -**Description:** The number of reasoning tokens generated for a single request. These are tokens used for "thinking" or chain-of-thought reasoning before generating the final output. +The number of reasoning tokens generated for a single request. These are tokens used for "thinking" or chain-of-thought reasoning before generating the final output. **Formula:** ```python @@ -334,11 +276,7 @@ reasoning_token_count = len(tokenizer.encode(reasoning_content)) ### Total Reasoning Tokens -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `total_reasoning_tokens` | Derived Metric | [`SUPPORTS_REASONING`](#flag-supports-reasoning), [`LARGER_IS_BETTER`](#flag-larger-is-better) | `reasoning_token_count` | - -**Description:** The sum of all reasoning tokens generated across all requests. This represents the total reasoning/thinking workload. +The sum of all reasoning tokens generated across all requests. This represents the total reasoning/thinking workload. **Formula:** ```python @@ -353,15 +291,11 @@ total_reasoning_tokens = sum(reasoning_token_count for record in records) ## General Metrics > [!NOTE] -> **Requirements:** Metrics in this section are available for all benchmark runs with no special requirements. +> Metrics in this section are available for all benchmark runs with no special requirements. ### Request Latency -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `request_latency` | Record Metric | [`NONE`](#flag-none) | - | - -**Description:** Measures the total end-to-end time from sending a request until receiving the final response. For streaming requests with multiple responses, this measures until the last response is received. This is the complete time experienced by the client for a single request. +Measures the total end-to-end time from sending a request until receiving the final response. For streaming requests with multiple responses, this measures until the last response is received. This is the complete time experienced by the client for a single request. **Formula:** ```python @@ -376,11 +310,7 @@ request_latency = responses[-1].perf_ns - start_perf_ns ### Request Throughput -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `request_throughput` | Derived Metric | [`LARGER_IS_BETTER`](#flag-larger-is-better) | `request_count`, `benchmark_duration` | - -**Description:** The overall rate of completed requests per second across the entire benchmark. This represents the system's ability to process requests under the given concurrency and load. +The overall rate of completed requests per second across the entire benchmark. This represents the system's ability to process requests under the given concurrency and load. **Formula:** ```python @@ -395,11 +325,7 @@ request_throughput = request_count / benchmark_duration_seconds ### Request Count -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `request_count` | Aggregate Metric | [`LARGER_IS_BETTER`](#flag-larger-is-better) | - | - -**Description:** The total number of **successfully completed** requests in the benchmark. This includes all requests that received valid responses, regardless of streaming mode. +The total number of **successfully completed** requests in the benchmark. This includes all requests that received valid responses, regardless of streaming mode. **Formula:** ```python @@ -410,11 +336,7 @@ request_count = sum(1 for record if record.valid) ### Error Request Count -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `error_request_count` | Aggregate Metric | [`ERROR_ONLY`](#flag-error-only) | - | - -**Description:** The total number of failed/error requests encountered during the benchmark. This includes network errors, HTTP errors, timeout errors, and other failures. +The total number of failed/error requests encountered during the benchmark. This includes network errors, HTTP errors, timeout errors, and other failures. **Formula:** ```python @@ -428,11 +350,7 @@ error_request_count = sum(1 for record if not record.valid) ### Minimum Request Timestamp -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `min_request_timestamp` | Aggregate Metric | [`NO_CONSOLE`](#flag-no-console) | - | - -**Description:** The wall-clock timestamp of the first request sent in the benchmark. This is used to calculate the benchmark duration and represents the start of the benchmark run. +The wall-clock timestamp of the first request sent in the benchmark. This is used to calculate the benchmark duration and represents the start of the benchmark run. **Formula:** ```python @@ -443,11 +361,7 @@ min_request_timestamp = min(timestamp_ns for record in records) ### Maximum Response Timestamp -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `max_response_timestamp` | Aggregate Metric | [`NO_CONSOLE`](#flag-no-console) | `request_latency` | - -**Description:** The wall-clock timestamp of the last response received in the benchmark. This is used to calculate the benchmark duration and represents the end of the benchmark run. +The wall-clock timestamp of the last response received in the benchmark. This is used to calculate the benchmark duration and represents the end of the benchmark run. **Formula:** ```python @@ -458,11 +372,7 @@ max_response_timestamp = max(timestamp_ns + request_latency for record in record ### Benchmark Duration -| Tag | Type | Flags | Depends On | -|-----|------|-------|------------| -| `benchmark_duration` | Derived Metric | [`NO_CONSOLE`](#flag-no-console) | `min_request_timestamp`, `max_response_timestamp` | - -**Description:** The total elapsed time from the first request sent to the last response received. This represents the complete wall-clock duration of the benchmark run. +The total elapsed time from the first request sent to the last response received. This represents the complete wall-clock duration of the benchmark run. **Formula:** ```python From 25f8117560e512db86cc635ce7c8387d7c50ab81 Mon Sep 17 00:00:00 2001 From: Anthony Casagrande Date: Thu, 2 Oct 2025 11:12:01 -0700 Subject: [PATCH 08/11] Update metrics_reference.md Signed-off-by: Anthony Casagrande --- docs/metrics_reference.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/metrics_reference.md b/docs/metrics_reference.md index 6caf53cb9..9051f2845 100644 --- a/docs/metrics_reference.md +++ b/docs/metrics_reference.md @@ -14,39 +14,39 @@ AIPerf computes metrics in three distinct phases during benchmark execution: **R Record Metrics are computed **individually** for **each request** and its **response(s)** during the benchmark run. A single request may have one response (non-streaming) or multiple responses (streaming). These metrics capture **per-request characteristics** such as latency, token counts, and streaming behavior. Record metrics produce **statistical distributions** (min, max, mean, median, p90, p99, etc.) that reveal performance variability across requests. -### Example Metrics +#### Example Metrics `request_latency`, `ttft`, `inter_token_latency`, `output_token_count`, `input_sequence_length` -### Dependencies +#### Dependencies Record Metrics can depend on raw request/response data and other Record Metrics from the same request. -### Example Scenario +#### Example Scenario `request_latency` measures the time for each individual request from start to final response. If you send 100 requests, you get 100 latency values that form a distribution showing how latency varies across requests. ## Aggregate Metrics Aggregate Metrics are computed by **tracking** or **accumulating** values across **all requests** in **real-time** during the benchmark. These include counters, min/max timestamps, and other global statistics. Aggregate metrics produce a **single value** representing the entire benchmark run. -### Example Metrics +#### Example Metrics `request_count`, `error_request_count`, `min_request_timestamp`, `max_response_timestamp` -### Dependencies +#### Dependencies Aggregate Metrics can depend on Record Metrics and other Aggregate Metrics. -### Example Scenario +#### Example Scenario `request_count` increments by 1 for each successful request. At the end of a benchmark with 100 successful requests, this metric equals 100 (a single value, not a distribution). ## Derived Metrics Derived Metrics are computed by applying **mathematical formulas** to other metric results, but are **not** computed per-record like Record Metrics. Instead, these metrics depend on one or more **prerequisite metrics** being available first and are calculated either **after the benchmark completes** for final results or in **real-time** across **all current data** for live metrics display. Derived metrics can produce either single values or distributions depending on their dependencies. -### Example Metrics +#### Example Metrics `request_throughput`, `output_token_throughput`, `benchmark_duration` -### Dependencies +#### Dependencies Derived Metrics can depend on Record Metrics, Aggregate Metrics, and other Derived Metrics. -### Example Scenario +#### Example Scenario `request_throughput` is computed from `request_count / benchmark_duration_seconds`. This requires both `request_count` and `benchmark_duration` to be available first, then applies a formula to produce a single throughput value (e.g., 10.5 requests/sec). --- From 6f9c60e5981d935428a825c4cde2196043d24c30 Mon Sep 17 00:00:00 2001 From: Anthony Casagrande Date: Thu, 2 Oct 2025 11:16:30 -0700 Subject: [PATCH 09/11] Update metrics_reference.md Signed-off-by: Anthony Casagrande --- docs/metrics_reference.md | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/docs/metrics_reference.md b/docs/metrics_reference.md index 9051f2845..7a0fb6307 100644 --- a/docs/metrics_reference.md +++ b/docs/metrics_reference.md @@ -6,6 +6,43 @@ This document provides a comprehensive reference of all metrics available in AIPerf for benchmarking LLM inference performance. Metrics are organized by computation type to help you understand when and how each metric is calculated. +## Table of Contents + +- [Understanding Metric Types](#understanding-metric-types) + - [Record Metrics](#record-metrics) + - [Aggregate Metrics](#aggregate-metrics) + - [Derived Metrics](#derived-metrics) +- [Quick Reference](#quick-reference) +- [Detailed Metric Descriptions](#detailed-metric-descriptions) + - [Streaming Metrics](#streaming-metrics) + - [Time to First Token (TTFT)](#time-to-first-token-ttft) + - [Time to Second Token (TTST)](#time-to-second-token-ttst) + - [Inter Token Latency (ITL)](#inter-token-latency-itl) + - [Inter Chunk Latency (ICL)](#inter-chunk-latency-icl) + - [Output Token Throughput Per User](#output-token-throughput-per-user) + - [Token Based Metrics](#token-based-metrics) + - [Output Token Count](#output-token-count) + - [Output Sequence Length (OSL)](#output-sequence-length-osl) + - [Input Sequence Length (ISL)](#input-sequence-length-isl) + - [Total Output Tokens](#total-output-tokens) + - [Total Output Sequence Length](#total-output-sequence-length) + - [Total Input Sequence Length](#total-input-sequence-length) + - [Output Token Throughput](#output-token-throughput) + - [Reasoning Metrics](#reasoning-metrics) + - [Reasoning Token Count](#reasoning-token-count) + - [Total Reasoning Tokens](#total-reasoning-tokens) + - [General Metrics](#general-metrics) + - [Request Latency](#request-latency) + - [Request Throughput](#request-throughput) + - [Request Count](#request-count) + - [Error Request Count](#error-request-count) + - [Minimum Request Timestamp](#minimum-request-timestamp) + - [Maximum Response Timestamp](#maximum-response-timestamp) + - [Benchmark Duration](#benchmark-duration) +- [Metric Flags Reference](#metric-flags-reference) + +--- + ## Understanding Metric Types AIPerf computes metrics in three distinct phases during benchmark execution: **Record Metrics**, **Aggregate Metrics**, and **Derived Metrics**. From d20b5c55afbf0236378c1a801d64d0926f2b8b7e Mon Sep 17 00:00:00 2001 From: Anthony Casagrande Date: Thu, 2 Oct 2025 11:18:54 -0700 Subject: [PATCH 10/11] Update metrics_reference.md Signed-off-by: Anthony Casagrande --- docs/metrics_reference.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/metrics_reference.md b/docs/metrics_reference.md index 7a0fb6307..41546b78a 100644 --- a/docs/metrics_reference.md +++ b/docs/metrics_reference.md @@ -68,7 +68,7 @@ Aggregate Metrics are computed by **tracking** or **accumulating** values across `request_count`, `error_request_count`, `min_request_timestamp`, `max_response_timestamp` #### Dependencies -Aggregate Metrics can depend on Record Metrics and other Aggregate Metrics. +Aggregate Metrics can depend on raw request/response data, Record Metrics and other Aggregate Metrics. #### Example Scenario `request_count` increments by 1 for each successful request. At the end of a benchmark with 100 successful requests, this metric equals 100 (a single value, not a distribution). @@ -81,7 +81,8 @@ Derived Metrics are computed by applying **mathematical formulas** to other metr `request_throughput`, `output_token_throughput`, `benchmark_duration` #### Dependencies -Derived Metrics can depend on Record Metrics, Aggregate Metrics, and other Derived Metrics. +Derived Metrics can depend on Record Metrics, Aggregate Metrics, and other Derived Metrics, but do not have +any knowledge of the individual request/response data. #### Example Scenario `request_throughput` is computed from `request_count / benchmark_duration_seconds`. This requires both `request_count` and `benchmark_duration` to be available first, then applies a formula to produce a single throughput value (e.g., 10.5 requests/sec). From c25567e1f82a4b72f843a5cb3878ff3b3c1a6dcc Mon Sep 17 00:00:00 2001 From: Anthony Casagrande Date: Thu, 2 Oct 2025 11:23:29 -0700 Subject: [PATCH 11/11] Update README.md Signed-off-by: Anthony Casagrande --- README.md | 49 ++++++++++++++++++++++++++++--------------------- 1 file changed, 28 insertions(+), 21 deletions(-) diff --git a/README.md b/README.md index b9e897556..147ee9996 100644 --- a/README.md +++ b/README.md @@ -162,49 +162,56 @@ METRICS REFERENCE ## Metrics Reference -AIPerf provides comprehensive metrics organized into three categories. For detailed descriptions, requirements, and nuances of each metric, see the **[Complete Metrics Reference](docs/metrics_reference.md)**. +AIPerf provides comprehensive metrics organized into four functional categories. For detailed descriptions, requirements, and nuances of each metric, see the **[Complete Metrics Reference](docs/metrics_reference.md)**. -### Record Metrics +### Streaming Metrics -Computed **individually** for **each request** and its **response(s)**. Record metrics produce **statistical distributions** (min, max, mean, p50, p90, p99, etc.). +Metrics specific to streaming requests that measure real-time token generation characteristics. Requires `--streaming` flag. | Metric | Tag | Formula | Unit | |--------|-----|---------|------| -| [**Request Latency**](docs/metrics_reference.md#request-latency) | `request_latency` | `responses[-1].perf_ns - start_perf_ns` | `ms` | | [**Time to First Token (TTFT)**](docs/metrics_reference.md#time-to-first-token-ttft) | `ttft` | `responses[0].perf_ns - request.start_perf_ns` | `ms` | | [**Time to Second Token (TTST)**](docs/metrics_reference.md#time-to-second-token-ttst) | `ttst` | `responses[1].perf_ns - responses[0].perf_ns` | `ms` | | [**Inter Token Latency (ITL)**](docs/metrics_reference.md#inter-token-latency-itl) | `inter_token_latency` | `(request_latency - ttft) / (output_sequence_length - 1)` | `ms` | | [**Inter Chunk Latency (ICL)**](docs/metrics_reference.md#inter-chunk-latency-icl) | `inter_chunk_latency` | `[responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(responses))]` | `ms` | -| [**Output Token Count**](docs/metrics_reference.md#output-token-count) | `output_token_count` | `output_token_count` | `tokens` | -| [**Reasoning Token Count**](docs/metrics_reference.md#reasoning-token-count) | `reasoning_token_count` | `reasoning_token_count` | `tokens` | -| [**Output Sequence Length (OSL)**](docs/metrics_reference.md#output-sequence-length-osl) | `output_sequence_length` | `(output_token_count or 0) + (reasoning_token_count or 0)` | `tokens` | -| [**Input Sequence Length (ISL)**](docs/metrics_reference.md#input-sequence-length-isl) | `input_sequence_length` | `input_token_count` | `tokens` | | [**Output Token Throughput Per User**](docs/metrics_reference.md#output-token-throughput-per-user) | `output_token_throughput_per_user` | `1.0 / inter_token_latency_seconds` | `tokens/sec/user` | -### Aggregate Metrics +### Token Based Metrics -Computed by **tracking** values across **all requests** in **real-time**. Aggregate metrics produce **single scalar values**. +Metrics for token-producing endpoints that track token counts and throughput. Requires text-generating endpoints (chat, completion, etc.). | Metric | Tag | Formula | Unit | |--------|-----|---------|------| -| [**Request Count**](docs/metrics_reference.md#request-count) | `request_count` | `sum(1 for request in valid_requests)` | `requests` | -| [**Error Request Count**](docs/metrics_reference.md#error-request-count) | `error_request_count` | `sum(1 for request in error_requests)` | `requests` | -| [**Minimum Request Timestamp**](docs/metrics_reference.md#minimum-request-timestamp) | `min_request_timestamp` | `min(timestamp_ns for record in records)` | `datetime` | -| [**Maximum Response Timestamp**](docs/metrics_reference.md#maximum-response-timestamp) | `max_response_timestamp` | `max(timestamp_ns + request_latency for record in records)` | `datetime` | +| [**Output Token Count**](docs/metrics_reference.md#output-token-count) | `output_token_count` | `len(tokenizer.encode(content))` | `tokens` | +| [**Output Sequence Length (OSL)**](docs/metrics_reference.md#output-sequence-length-osl) | `output_sequence_length` | `(output_token_count or 0) + (reasoning_token_count or 0)` | `tokens` | +| [**Input Sequence Length (ISL)**](docs/metrics_reference.md#input-sequence-length-isl) | `input_sequence_length` | `len(tokenizer.encode(prompt))` | `tokens` | +| [**Total Output Tokens**](docs/metrics_reference.md#total-output-tokens) | `total_output_tokens` | `sum(output_token_count for record in records)` | `tokens` | +| [**Total Output Sequence Length**](docs/metrics_reference.md#total-output-sequence-length) | `total_osl` | `sum(output_sequence_length for record in records)` | `tokens` | +| [**Total Input Sequence Length**](docs/metrics_reference.md#total-input-sequence-length) | `total_isl` | `sum(input_sequence_length for record in records)` | `tokens` | +| [**Output Token Throughput**](docs/metrics_reference.md#output-token-throughput) | `output_token_throughput` | `total_osl / benchmark_duration_seconds` | `tokens/sec` | -### Derived Metrics +### Reasoning Metrics -Computed using **formulas** based on other metrics, but **not** computed per-record. These are calculated either **after the benchmark completes** for final results or in **real-time** across **all current data** for live metrics display. +Metrics specific to models that support reasoning/thinking tokens. Requires models with separate `reasoning_content` field. | Metric | Tag | Formula | Unit | |--------|-----|---------|------| +| [**Reasoning Token Count**](docs/metrics_reference.md#reasoning-token-count) | `reasoning_token_count` | `len(tokenizer.encode(reasoning_content))` | `tokens` | +| [**Total Reasoning Tokens**](docs/metrics_reference.md#total-reasoning-tokens) | `total_reasoning_tokens` | `sum(reasoning_token_count for record in records)` | `tokens` | + +### General Metrics + +Metrics available for all benchmark runs with no special requirements. + +| Metric | Tag | Formula | Unit | +|--------|-----|---------|------| +| [**Request Latency**](docs/metrics_reference.md#request-latency) | `request_latency` | `responses[-1].perf_ns - start_perf_ns` | `ms` | | [**Request Throughput**](docs/metrics_reference.md#request-throughput) | `request_throughput` | `request_count / benchmark_duration_seconds` | `requests/sec` | -| [**Output Token Throughput**](docs/metrics_reference.md#output-token-throughput) | `output_token_throughput` | `total_osl / benchmark_duration_seconds` | `tokens/sec` | +| [**Request Count**](docs/metrics_reference.md#request-count) | `request_count` | `sum(1 for record if record.valid)` | `requests` | +| [**Error Request Count**](docs/metrics_reference.md#error-request-count) | `error_request_count` | `sum(1 for record if not record.valid)` | `requests` | +| [**Minimum Request Timestamp**](docs/metrics_reference.md#minimum-request-timestamp) | `min_request_timestamp` | `min(timestamp_ns for record in records)` | `datetime` | +| [**Maximum Response Timestamp**](docs/metrics_reference.md#maximum-response-timestamp) | `max_response_timestamp` | `max(timestamp_ns + request_latency for record in records)` | `datetime` | | [**Benchmark Duration**](docs/metrics_reference.md#benchmark-duration) | `benchmark_duration` | `max_response_timestamp - min_request_timestamp` | `sec` | -| [**Total Output Tokens**](docs/metrics_reference.md#total-output-tokens) | `total_output_tokens` | `sum(output_token_count for record in records)` | `tokens` | -| [**Total Reasoning Tokens**](docs/metrics_reference.md#total-reasoning-tokens) | `total_reasoning_tokens` | `sum(reasoning_token_count for record in records)` | `tokens` | -| [**Total Output Sequence Length**](docs/metrics_reference.md#total-output-sequence-length) | `total_osl` | `sum(output_sequence_length for record in records)` | `tokens` | -| [**Total Input Sequence Length**](docs/metrics_reference.md#total-input-sequence-length) | `total_isl` | `sum(input_sequence_length for record in records)` | `tokens` |