neuralmagic · markurtz · Apr 28, 2025 · Apr 25, 2025 · Apr 25, 2025 · Apr 25, 2025
diff --git a/README.md b/README.md
@@ -163,6 +163,7 @@ Our comprehensive documentation offers detailed guides and resources to help you
 
 - [**Installation Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/install.md) - This guide provides step-by-step instructions for installing GuideLLM, including prerequisites and setup tips.
 - [**Backends Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/backends.md) - A comprehensive overview of supported backends and how to set them up for use with GuideLLM.
+- [**Data/Datasets Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/datasets.md) - Information on supported datasets, including how to use them for benchmarking.
 - [**Metrics Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/metrics.md) - Detailed explanations of the metrics used in GuideLLM, including definitions and how to interpret them.
 - [**Outputs Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/outputs.md) - Information on the different output formats supported by GuideLLM and how to use them.
 - [**Architecture Overview**](https://github.com/neuralmagic/guidellm/tree/main/docs/architecture.md) - A detailed look at GuideLLM's design, components, and how they interact.

diff --git a/docs/datasets.md b/docs/datasets.md
@@ -0,0 +1,192 @@
+# Dataset Configurations
+
+GuideLLM supports various dataset configurations to enable benchmarking and evaluation of large language models (LLMs). This document provides a comprehensive guide to configuring datasets for different use cases, along with detailed examples and rationale for choosing specific pathways.
+
+## Data Arguments Overview
+
+The following arguments can be used to configure datasets and their processing:
+
+- `--data`: Specifies the dataset source. This can be a file path, Hugging Face dataset ID, synthetic data configuration, or in-memory data.
+- `--data-args`: A JSON string or dictionary argument that allows you to control how datasets are parsed and prepared. This includes specific aliases for GuideLLM flows, such as:
+  - `prompt_column`: Specifies the column name for the prompt. By default, GuideLLM will try the most common column names (e.g., `prompt`, `text`, `input`).
+  - `prompt_tokens_count_column`: Specifies the column name for the prompt token count. These are used to set the request prompt token count for counting metrics. By default, GuideLLM assumes no token count is provided.
+  - `output_tokens_count_column`: Specifies the column name for the output token count. These are used to set the request output token count for the request and counting metrics. By default, GuideLLM assumes no token count is provided.
+  - `split`: Specifies the dataset split to use (e.g., `train`, `val`, `test`). By default, GuideLLM will try the most common split names (e.g., `train`, `validation`, `test`) if the dataset has splits, otherwise it will use the entire dataset.
+  - Any remaining arguments are passed directly into the dataset constructor as kwargs.
+- `--data-sampler`: Specifies the sampling strategy for datasets. By default, no sampling is applied. When set to `random`, it enables random shuffling of the dataset, which can be useful for creating diverse batches during benchmarking.
+- `--processor`: Specifies the processor or tokenizer to use. This is only required for synthetic data generation or when local calculations are specified through configuration settings. By default, the processor is set to the `--model` argument. If `--model` is not supplied, it defaults to the model retrieved from the backend.
+- `--processor-args`: A JSON string containing any arguments to pass to the processor or tokenizer constructor. These arguments are passed as a dictionary of kwargs.
+
+### Example Usage
+
+```bash
+guidellm benchmark \
+    --target "http://localhost:8000" \
+    --rate-type "throughput" \
+    --max-requests 1000 \
+    --data "path/to/dataset.ext" \
+    --data-args '{"prompt_column": "prompt", "split": "train"}' \
+    --processor "path/to/processor" \
+    --processor-args '{"arg1": "value1"}' \
+    --data-sampler "random"
+```
+
+## Dataset Types
+
+GuideLLM supports several types of datasets, each with its own advantages and use cases. Below are the main dataset types supported by GuideLLM, including synthetic data, Hugging Face datasets, file-based datasets, and in-memory datasets.
+
+### Synthetic Data
+
+Synthetic datasets allow you to generate data on the fly with customizable parameters. This is useful for controlled experiments, stress testing, and simulating specific scenarios. For example, you might want to evaluate how a model handles long prompts or generates outputs with specific characteristics.
+
+#### Example Commands
+
+```bash
+guidellm benchmark \
+    --target "http://localhost:8000" \
+    --rate-type "throughput" \
+    --max-requests 1000 \
+    --data "prompt_tokens=256,output_tokens=128"
+```
+
+Or using a JSON string:
+
+```bash
+guidellm benchmark \
+    --target "http://localhost:8000" \
+    --rate-type "throughput" \
+    --max-requests 1000 \
+    --data '{"prompt_tokens": 256, "output_tokens": 128}'
+```
+
+#### Configuration Options
+
+- `prompt_tokens`: Average number of tokens in prompts. If nothing else is specified, all requests will have this number of tokens.
+- `prompt_tokens_stdev`: Standard deviation for prompt tokens. If not supplied and min/max are not specified, no deviation is applied. If not supplied and min/max are specified, a uniform distribution is used.
+- `prompt_tokens_min`: Minimum number of tokens in prompts. If unset and `prompt_tokens_stdev` is set, the minimum is 1.
+- `prompt_tokens_max`: Maximum number of tokens in prompts. If unset and `prompt_tokens_stdev` is set, the maximum is 5 times the standard deviation.
+- `output_tokens`: Average number of tokens in outputs. If nothing else is specified, all requests will have this number of tokens.
+- `output_tokens_stdev`: Standard deviation for output tokens. If not supplied and min/max are not specified, no deviation is applied. If not supplied and min/max are specified, a uniform distribution is used.
+- `output_tokens_min`: Minimum number of tokens in outputs. If unset and `output_tokens_stdev` is set, the minimum is 1.
+- `output_tokens_max`: Maximum number of tokens in outputs. If unset and `output_tokens_stdev` is set, the maximum is 5 times the standard deviation.
+- `samples`: Number of samples to generate (default: 1000). More samples will increase the time taken to generate the dataset before benchmarking, but will also decrease the liklihood of caching requests.
+- `source`: Source text for generation (default: `data:prideandprejudice.txt.gz`). This can be any text file, URL containing a text file, or a compressed text file. The text is used to sample from at a word and punctuation granularity and then combined into a single string of the desired lengths.
+
+#### Notes
+
+- A processor/tokenizer is required. By default, the model passed in or retrieved from the server is used. If unavailable, use the `--processor` argument to specify a directory or Hugging Face model ID containing the processor/tokenizer files.
+
+### Hugging Face Datasets
+
+GuideLLM supports datasets from the Hugging Face Hub or local directories that follow the `datasets` library format. This allows you to easily leverage a wide range of datasets for benchmarking and evaluation with real-world data.
+
+#### Example Commands
+
+```bash
+guidellm benchmark \
+    --target "http://localhost:8000" \
+    --rate-type "throughput" \
+    --max-requests 1000 \
+    --data "garage-bAInd/Open-Platypus"
+```
+
+Or using a local dataset:
+
+```bash
+guidellm benchmark \
+    --target "http://localhost:8000" \
+    --rate-type "throughput" \
+    --max-requests 1000 \
+    --data "path/to/dataset"
+```
+
+#### Notes
+
+- Hugging Face datasets can be specified by ID, a local directory, or a path to a local Python file.
+
+### File-Based Datasets
+
+GuideLLM supports various file formats for datasets, including text, CSV, JSON, and more. These datasets can be used for benchmarking and evaluation, allowing you to work with structured data in a familiar format that matches your use case.
+
+#### Supported Formats with Examples
+
+- **Text files (`.txt`, `.text`)**
+  ```
+  Hello, how are you?
+  What is your name?
+  ```
+- **CSV files (`.csv`)**
+  ```csv
+  prompt,output
+  "Hello, how are you?","I'm fine."
+  "What is your name?","My name is GuideLLM."
+  ```
+- **JSON files (`.json`, `.jsonl`)**
+  ```json
+  [
+    {"prompt": "Hello, how are you?", "output": "I'm fine."},
+    {"prompt": "What is your name?", "output": "My name is GuideLLM."}
+  ]
+  ```
+- **Parquet files (`.parquet`)** Example: A binary columnar storage format for efficient data processing.
+- **Arrow files (`.arrow`)** Example: A cross-language development platform for in-memory data.
+- **HDF5 files (`.hdf5`)** Example: A hierarchical data format for storing large amounts of data.
+
+#### Example Commands
+
+```bash
+guidellm benchmark \
+    --target "http://localhost:8000" \
+    --rate-type "throughput" \
+    --max-requests 1000 \
+    --data "path/to/dataset.ext"
+```
+
+#### Notes
+
+- Ensure the file format matches the expected structure for the dataset.
+
+### In-Memory Datasets
+
+In-memory datasets allow you to directly pass data as Python objects, making them ideal for quick prototyping and testing without the need to save data to disk.
+
+#### Supported Formats with Examples
+
+- **Dictionary of columns and values**
+  ```python
+  {
+      "column1": ["value1", "value2"],
+      "column2": ["value3", "value4"]
+  }
+  ```
+- **List of dictionaries**
+  ```python
+  [
+      {"column1": "value1", "column2": "value3"},
+      {"column1": "value2", "column2": "value4"}
+  ]
+  ```
+- **List of items**
+  ```python
+  ["value1", "value2", "value3"]
+  ```
+
+#### Example Usage
+
+```python
+from guidellm.benchmark import benchmark_generative_text
+
+data = [
+    {"prompt": "Hello", "output": "Hi"},
+    {"prompt": "How are you?", "output": "I'm fine."}
+]
+
+benchmark_generative_text(data=data, ...)
+```
+
+#### Notes
+
+- Ensure that the data format is consistent and adheres to one of the supported structures.
+- For dictionaries, all columns must have the same number of samples.
+- For lists of dictionaries, all items must have the same keys.
+- For lists of items, all elements must be of the same type.