Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
552375f
Add replay from trace strategy
VincentG1234 Feb 28, 2026
a957299
fix the CI mypy
VincentG1234 Mar 11, 2026
b557fe1
add e2e tests
VincentG1234 Mar 15, 2026
edc18ba
fix ruff error CI
VincentG1234 Mar 17, 2026
18433f5
Add trace replay documentation
VincentG1234 Mar 18, 2026
a8e6444
refactor: move trace_io to utils for cross-component sharing
VincentG1234 Apr 20, 2026
584f753
replace manual trace loading with datasets.load_dataset
VincentG1234 Apr 20, 2026
cde76f4
refactor benchmark.entrypoints: remove max_requests data truncation
VincentG1234 Apr 20, 2026
a229476
refactor benchmark.entrypoint: remove replay special case
VincentG1234 Apr 21, 2026
d43e920
fix replay profile dataset filtering semantics
VincentG1234 Apr 21, 2026
c0739e4
erase useless diffs
VincentG1234 Apr 22, 2026
7d76d5f
fix trace replay tests for multiprocessing context
VincentG1234 Apr 22, 2026
8a7adde
fix ruff issue
VincentG1234 Apr 22, 2026
a6126fb
refactor trace_synthetic and trace_io: remove max_rows; use data_samp…
VincentG1234 Apr 23, 2026
ba792eb
fix replay profile data sample handling
VincentG1234 Apr 25, 2026
fc524d2
test: restore e2e utils
VincentG1234 Apr 25, 2026
c0150d0
fix trace replay ordering alignment
VincentG1234 Apr 26, 2026
b532df9
Fix trace replay alignment and semantics across loading, scheduling, …
VincentG1234 Apr 26, 2026
e3f317d
Fix trace replay alignment and semantics across loading, scheduling, …
VincentG1234 Apr 26, 2026
abfa41c
refactor unit tests: strengthen trace replay unit coverage
VincentG1234 Apr 27, 2026
09dcb32
fix ci: fix mdformat pre-commit on datasets guide
VincentG1234 Apr 28, 2026
6d7eac5
docs: clarify trace replay timestamp semantics
VincentG1234 Apr 29, 2026
b6c56f3
docs: clarify trace replay dataset examples and explanations
VincentG1234 Apr 30, 2026
cfeecc5
docs: clarify trace io and profiles wording
VincentG1234 May 2, 2026
0a1c7eb
fix replay trace scheduling completion and optimize synthetic prompt …
VincentG1234 May 5, 2026
b981ce9
fix ci with precommit
VincentG1234 May 9, 2026
4c0b43b
enforce single trace data source in resolve_args
VincentG1234 May 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions docs/getting-started/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ GuideLLM offers a wide range of configuration options to customize your benchmar
| `--random-seed` | Random seed for reproducibility | `--random-seed 42` |
| `--max-seconds` | Duration for each benchmark in seconds | `--max-seconds 30` |
| `--max-requests` | Maximum number of requests for each benchmark | `--max-requests 1000` |
| `--data-samples` | Maximum number of dataset rows to load | `--data-samples 1000` |
| `--output-dir` | Directory path to save output files | `--output-dir results/` |
| `--outputs` | Output formats to generate | `--outputs json csv html` |

Expand Down Expand Up @@ -187,6 +188,34 @@ guidellm benchmark \

You can customize synthetic data generation with additional parameters such as standard deviation, minimum, and maximum values. See the [Datasets Synthetic data documentation](../guides/datasets.md#synthetic-data) for more details.

### Trace Replay Benchmarking (beta)

For realistic load testing, replay trace events using each row's timestamp and token lengths. Trace files must be JSONL and are loaded with the `trace_synthetic` data type. By default, each row uses `timestamp`, `input_length`, and `output_length` fields. Timestamps may be absolute or monotonic values; GuideLLM sorts them and converts them to offsets from the first event before scheduling:

```json
{"timestamp": 1234500.0, "input_length": 256, "output_length": 128}
{"timestamp": 1234500.5, "input_length": 512, "output_length": 64}
```

In this example, the second request is scheduled 0.5 seconds after the first request.

Run with the `replay` profile:

```bash
guidellm benchmark \
--target "http://localhost:8000" \
--data path/to/trace.jsonl \
--data-args type_=trace_synthetic \
--profile replay \
--rate 1.0
```

The `--rate` parameter acts as a time scale for the intervals between trace events, not requests per second: `1.0` preserves the original timing, `2.0` doubles the intervals and runs twice as long, and `0.5` halves the intervals and runs twice as fast.

GuideLLM orders trace rows by timestamp before scheduling and payload generation, so each scheduled event uses the token lengths from the same sorted row. Use `--data-samples` to limit how many trace rows are loaded and replayed. `--max-requests` remains a runtime completion constraint; it does not truncate the trace dataset.

If your trace uses different column names, map them with `timestamp_column`, `prompt_tokens_column`, and `output_tokens_column` in `--data-args`.

### Working with Real Data

While synthetic data is convenient for quick tests, you can benchmark with real-world data:
Expand Down
47 changes: 47 additions & 0 deletions docs/guides/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ The following arguments can be used to configure datasets and their processing:
- `prompt_column`: Specifies the column name for the prompt. By default, GuideLLM will try the most common column names (e.g., `prompt`, `text`, `input`).
- `prompt_tokens_count_column`: Specifies the column name for the prompt token count. These are used to set the request prompt token count for counting metrics. By default, GuideLLM assumes no token count is provided.
- `output_tokens_count_column`: Specifies the column name for the output token count. These are used to set the request output token count for the request and counting metrics. By default, GuideLLM assumes no token count is provided.
- `type_`: Selects a specialized dataset deserializer, such as `trace_synthetic` for trace replay files.
- `timestamp_column`: Specifies the timestamp column for `trace_synthetic` data. The default is `timestamp`.
- `prompt_tokens_column`: Specifies the prompt token length column for `trace_synthetic` data. The default is `input_length`.
- `output_tokens_column`: Specifies the output token length column for `trace_synthetic` data. The default is `output_length`.
- `split`: Specifies the dataset split to use (e.g., `train`, `val`, `test`). By default, GuideLLM will try the most common split names (e.g., `train`, `validation`, `test`) if the dataset has splits, otherwise it will use the entire dataset.
- Any remaining arguments are passed directly into the dataset constructor as kwargs.
- `--data-sampler`: Specifies the sampling strategy for datasets. By default, no sampling is applied. When set to `random`, it enables random shuffling of the dataset, which can be useful for creating diverse batches during benchmarking.
Expand Down Expand Up @@ -116,22 +120,62 @@ GuideLLM supports various file formats for datasets, including text, CSV, JSON,
#### Supported Formats with Examples

- **Text files (`.txt`, `.text`)**: Where each line is a separate prompt to use.

```
Hello, how are you?
What is your name?
```

- **CSV files (`.csv`)**: Where each row is a separate dataset entry and the first row contains the column names. The columns should include `prompt` or other common names for the prompt which will be used as the prompt column. Additional columns can be included based on the previously mentioned aliases for the `--data-column-mapper` argument.

```csv
prompt,output_tokens_count,additional_column,additional_column2
Hello, how are you?,5,foo,bar
What is your name?,3,baz,qux
```

- **JSON Lines files (`.jsonl`)**: Where each line is a separate JSON object. The objects should include `prompt` or other common names for the prompt which will be used as the prompt column. Additional fields can be included based on the previously mentioned aliases for the `--data-args` argument.

```json
{"prompt": "Hello, how are you?", "output_tokens_count": 5, "additional_column": "foo", "additional_column2": "bar"}
{"prompt": "What is your name?", "output_tokens_count": 3, "additional_column": "baz", "additional_column2": "qux"}
```

- **Trace files (`.jsonl` with `trace_synthetic` type)**: Specialized JSONL files for replay benchmarking with `timestamp`, `input_length`, and `output_length` fields. Used with `--profile replay` to replay trace events using each row's timestamp and token lengths. Timestamps must be numbers expressed in seconds on a shared timeline with any consistent zero point; GuideLLM sorts them and converts them to offsets from the first event before scheduling. Date strings are not parsed yet, so provide timestamps as numbers. See [Trace Replay Benchmarking](../getting-started/benchmark.md#trace-replay-benchmarking).

```json
{"timestamp": 1234500.0, "input_length": 256, "output_length": 128}
{"timestamp": 1234500.5, "input_length": 512, "output_length": 64}
```

In this example, the second request is scheduled 0.5 seconds after the first request. Trace rows are ordered by timestamp before GuideLLM schedules requests and generates synthetic payloads. This keeps each scheduled event aligned with the prompt and output token lengths from the same row.

Use `--data-args type_=trace_synthetic` to enable trace loading:

```bash
guidellm benchmark \
--target http://localhost:8000 \
--profile replay \
--rate 1.0 \
--data path/to/trace.jsonl \
--data-args type_=trace_synthetic
```

If your trace uses different column names, configure them with `timestamp_column`, `prompt_tokens_column`, and `output_tokens_column`:

```bash
guidellm benchmark \
--target http://localhost:8000 \
--profile replay \
--rate 1.0 \
--data replay.jsonl \
--data-args type_=trace_synthetic,timestamp_column=timestamp,prompt_tokens_column=input_length,output_tokens_column=output_length
```

For replay, `--rate` is a time scale for the intervals between trace events rather than requests per second. Use `--data-samples` to limit how many trace rows are loaded and replayed. Use `--max-requests` only as a runtime completion constraint; it does not limit the trace rows loaded from the file.

- **JSON files (`.json`)**: Where the entire dataset is represented as a JSON array of objects nested under a specific key. To surface the correct key to use, a `--data-column-mapper` argument must be passed in of `"field": "NAME"` for where the array exists. The objects should include `prompt` or other common names for the prompt which will be used as the prompt column. Additional fields can be included based on the previously mentioned aliases for the `--data-column-mapper` argument.

```json
{
"version": "1.0",
Expand All @@ -141,8 +185,11 @@ GuideLLM supports various file formats for datasets, including text, CSV, JSON,
]
}
```

- **Parquet files (`.parquet`)** Example: A binary columnar storage format for efficient data processing. For more information on the supported formats, see the Hugging Face dataset documentation linked in the [Notes](#notes) section.

- **Arrow files (`.arrow`)** Example: A cross-language development platform for in-memory data. For more information on the supported formats, see the Hugging Face dataset documentation linked in the [Notes](#notes) section.

- **HDF5 files (`.hdf5`)** Example: A hierarchical data format for storing large amounts of data. For more information on the supported formats, see the Hugging Face dataset documentation linked in the [Notes](#notes) section.

#### Example Commands
Expand Down
9 changes: 9 additions & 0 deletions src/guidellm/benchmark/entrypoints.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,6 +355,8 @@ async def resolve_profile(
max_global_error_rate: float | None,
over_saturation: dict[str, Any] | None = None,
console: Console | None = None,
data: list[Any] | None = None,
**profile_kwargs: Any,
) -> Profile:
"""
Resolve and configure a benchmark profile with rate and constraint settings.
Expand All @@ -376,6 +378,8 @@ async def resolve_profile(
:param max_global_error_rate: Maximum global error rate threshold before stopping
:param over_saturation: Over-saturation detection configuration (dict)
:param console: Console instance for progress reporting, or None
:param data: Optional list of data sources.
:param profile_kwargs: Additional profile-specific arguments.
:return: Configured Profile instance ready for benchmarking
:raises ValueError: If constraints are provided with a pre-configured Profile
"""
Expand Down Expand Up @@ -403,6 +407,8 @@ async def resolve_profile(
random_seed=random_seed,
rampup_duration=rampup,
constraints={**constraints},
data=data,
**profile_kwargs,
)
elif constraints:
raise ValueError(
Expand Down Expand Up @@ -536,6 +542,9 @@ async def benchmark_generative_text(
max_global_error_rate=args.max_global_error_rate,
over_saturation=args.over_saturation,
console=console,
data=args.data,
data_args=args.data_args,
data_samples=request_loader.info.get("data_samples", -1),
)
output_formats = await resolve_output_formats(
outputs=args.outputs, output_dir=args.output_dir, console=console
Expand Down
110 changes: 109 additions & 1 deletion src/guidellm/benchmark/profiles.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

from abc import ABC, abstractmethod
from collections.abc import Generator
from pathlib import Path
from typing import TYPE_CHECKING, Annotated, Any, ClassVar, Literal

import numpy as np
Expand All @@ -37,8 +38,10 @@
SchedulingStrategy,
SynchronousStrategy,
ThroughputStrategy,
TraceReplayStrategy,
)
from guidellm.schemas import PydanticClassRegistryMixin
from guidellm.utils.trace_io import load_relative_timestamps

if TYPE_CHECKING:
from guidellm.benchmark.schemas import Benchmark
Expand All @@ -48,13 +51,14 @@
"ConcurrentProfile",
"Profile",
"ProfileType",
"ReplayProfile",
"SweepProfile",
"SynchronousProfile",
"ThroughputProfile",
]

ProfileType = Annotated[
Literal["synchronous", "concurrent", "throughput", "async", "sweep"],
Literal["synchronous", "concurrent", "throughput", "async", "sweep", "replay"],
"Profile type identifiers for polymorphic deserialization",
]

Expand Down Expand Up @@ -328,6 +332,110 @@ def next_strategy(
return SynchronousStrategy()


@Profile.register("replay")
class ReplayProfile(Profile):
"""
Replay a trace file:
schedule each request at start_time + time_scale * relative_timestamp[i].

For this profile, the ``rate`` argument is interpreted as time_scale (scale factor
applied to relative timestamps), not as requests per second.

When ``data_samples`` is set, the replayed timestamps are truncated to match
the sampled dataset size.
"""

type_: Literal["replay"] = "replay" # type: ignore[assignment]
relative_timestamps: list[float] = Field(
description="Request start times relative to first event (first = 0)",
)
time_scale: float = Field(
default=1.0,
gt=0,
description="Scale factor applied to relative timestamps",
)

@classmethod
def resolve_args(
cls,
rate_type: str,
rate: list[float] | None,
random_seed: int,
**kwargs: Any,
) -> dict[str, Any]:
_ = (rate_type, random_seed) # unused
data = kwargs.get("data")
if not data:
raise ValueError("Replay profile requires data (path to trace file)")
if len(data) != 1:
raise ValueError(
f"ReplayProfile requires exactly one data source, received {len(data)}"
)
if not data[0]:
raise ValueError("Replay profile requires data (path to trace file)")
path = Path(data[0]) if isinstance(data[0], str) else data[0]
Comment thread
VincentG1234 marked this conversation as resolved.
if not path.exists():
raise ValueError(f"Replay trace file not found: {path}")

# For replay profile, rate is interpreted as time_scale (not requests per
# second)
time_scale = rate[0] if rate and len(rate) > 0 else 1.0

# Honor a custom timestamp column when configured via --data-args so the
# replay profile and trace_synthetic deserializer use the same field.
data_args = kwargs.get("data_args") or []
first_args = data_args[0] if data_args else {}
timestamp_column = "timestamp"
if isinstance(first_args, dict):
raw_timestamp_column = first_args.get("timestamp_column")
if isinstance(raw_timestamp_column, str) and raw_timestamp_column.strip():
timestamp_column = raw_timestamp_column

relative_timestamps = load_relative_timestamps(
Comment thread
sjmonson marked this conversation as resolved.
path, timestamp_column=timestamp_column
)
data_samples = kwargs.get("data_samples", -1)
if isinstance(data_samples, int) and data_samples > 0:
relative_timestamps = relative_timestamps[:data_samples]

if not relative_timestamps:
raise ValueError(
"No timestamps remain after applying data_samples. "
"The trace is empty or all events were filtered out."
)

constraints = dict(kwargs.get("constraints") or {})
if not any(
key in constraints
for key in ("max_number", "max_num", "max_requests", "max_req")
):
constraints["max_requests"] = len(relative_timestamps)

return {
"relative_timestamps": relative_timestamps,
"time_scale": time_scale,
"constraints": constraints,
}

@property
def strategy_types(self) -> list[str]:
return ["trace"]

def next_strategy(
self,
prev_strategy: SchedulingStrategy | None,
prev_benchmark: Benchmark | None,
) -> TraceReplayStrategy | None:
_ = prev_benchmark
# Replay has a single strategy; return it once, then None
if prev_strategy is not None:
return None
return TraceReplayStrategy(
relative_timestamps=self.relative_timestamps,
time_scale=self.time_scale,
)


@Profile.register("concurrent")
class ConcurrentProfile(Profile):
"""
Expand Down
2 changes: 2 additions & 0 deletions src/guidellm/data/deserializers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
SyntheticTextDataset,
SyntheticTextDatasetDeserializer,
)
from .trace_synthetic import TraceSyntheticDatasetDeserializer

__all__ = [
"ArrowFileDatasetDeserializer",
Expand All @@ -46,4 +47,5 @@
"SyntheticTextDatasetDeserializer",
"TarFileDatasetDeserializer",
"TextFileDatasetDeserializer",
"TraceSyntheticDatasetDeserializer",
]
Loading