From 0ab9c29f6f7f6a25fc88a202da6ff9e153d9aa9c Mon Sep 17 00:00:00 2001
From: Mark Kurtz <mark.j.kurtz@gmail.com>
Date: Fri, 25 Apr 2025 05:33:07 +0000
Subject: [PATCH 1/4] Add docs for data/datasets and how to configure them in
 GuideLLM

---
 README.md        |   1 +
 docs/datasets.md | 192 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 193 insertions(+)
 create mode 100644 docs/datasets.md

diff --git a/README.md b/README.md
index 663cc95..6639ebd 100644
--- a/README.md
+++ b/README.md
@@ -163,6 +163,7 @@ Our comprehensive documentation offers detailed guides and resources to help you
 
 - [**Installation Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/install.md) - This guide provides step-by-step instructions for installing GuideLLM, including prerequisites and setup tips.
 - [**Backends Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/backends.md) - A comprehensive overview of supported backends and how to set them up for use with GuideLLM.
+- [**Data/Datasets Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/datasets.md) - Information on supported datasets, including how to use them for benchmarking.
 - [**Metrics Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/metrics.md) - Detailed explanations of the metrics used in GuideLLM, including definitions and how to interpret them.
 - [**Outputs Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/outputs.md) - Information on the different output formats supported by GuideLLM and how to use them.
 - [**Architecture Overview**](https://github.com/neuralmagic/guidellm/tree/main/docs/architecture.md) - A detailed look at GuideLLM's design, components, and how they interact.
diff --git a/docs/datasets.md b/docs/datasets.md
new file mode 100644
index 0000000..9d20e44
--- /dev/null
+++ b/docs/datasets.md
@@ -0,0 +1,192 @@
+# Dataset Configurations
+
+GuideLLM supports various dataset configurations to enable benchmarking and evaluation of large language models (LLMs). This document provides a comprehensive guide to configuring datasets for different use cases, along with detailed examples and rationale for choosing specific pathways.
+
+## Data Arguments Overview
+
+The following arguments can be used to configure datasets and their processing:
+
+- `--data`: Specifies the dataset source. This can be a file path, Hugging Face dataset ID, synthetic data configuration, or in-memory data.
+- `--data-args`: A JSON string or dictionary argument that allows you to control how datasets are parsed and prepared. This includes specific aliases for GuideLLM flows, such as:
+  - `prompt_column`: Specifies the column name for the prompt. By default, GuideLLM will try the most common column names (e.g., `prompt`, `text`, `input`).
+  - `prompt_tokens_count_column`: Specifies the column name for the prompt token count. These are used to set the request prompt token count for counting metrics. By default, GuideLLM assumes no token count is provided.
+  - `output_tokens_count_column`: Specifies the column name for the output token count. These are used to set the request output token count for the request and counting metrics. By default, GuideLLM assumes no token count is provided.
+  - `split`: Specifies the dataset split to use (e.g., `train`, `val`, `test`). By default, GuideLLM will try the most common split names (e.g., `train`, `validation`, `test`) if the dataset has splits, otherwise it will use the entire dataset.
+  - Any remaining arguments are passed directly into the dataset constructor as kwargs.
+- `--data-sampler`: Specifies the sampling strategy for datasets. By default, no sampling is applied. When set to `random`, it enables random shuffling of the dataset, which can be useful for creating diverse batches during benchmarking.
+- `--processor`: Specifies the processor or tokenizer to use. This is only required for synthetic data generation or when local calculations are specified through configuration settings. By default, the processor is set to the `--model` argument. If `--model` is not supplied, it defaults to the model retrieved from the backend.
+- `--processor-args`: A JSON string containing any arguments to pass to the processor or tokenizer constructor. These arguments are passed as a dictionary of kwargs.
+
+### Example Usage
+
+```bash
+guidellm benchmark \
+    --target "http://localhost:8000" \
+    --rate-type "throughput" \
+    --max-requests 1000 \
+    --data "path/to/dataset.ext" \
+    --data-args '{"prompt_column": "prompt", "split": "train"}' \
+    --processor "path/to/processor" \
+    --processor-args '{"arg1": "value1"}' \
+    --data-sampler "random"
+```
+
+## Dataset Types
+
+GuideLLM supports several types of datasets, each with its own advantages and use cases. Below are the main dataset types supported by GuideLLM, including synthetic data, Hugging Face datasets, file-based datasets, and in-memory datasets.
+
+### Synthetic Data
+
+Synthetic datasets allow you to generate data on the fly with customizable parameters. This is useful for controlled experiments, stress testing, and simulating specific scenarios. For example, you might want to evaluate how a model handles long prompts or generates outputs with specific characteristics.
+
+#### Example Commands
+
+```bash
+guidellm benchmark \
+    --target "http://localhost:8000" \
+    --rate-type "throughput" \
+    --max-requests 1000 \
+    --data "prompt_tokens=256,output_tokens=128"
+```
+
+Or using a JSON string:
+
+```bash
+guidellm benchmark \
+    --target "http://localhost:8000" \
+    --rate-type "throughput" \
+    --max-requests 1000 \
+    --data '{"prompt_tokens": 256, "output_tokens": 128}'
+```
+
+#### Configuration Options
+
+- `prompt_tokens`: Average number of tokens in prompts. If nothing else is specified, all requests will have this number of tokens.
+- `prompt_tokens_stdev`: Standard deviation for prompt tokens. If not supplied and min/max are not specified, no deviation is applied. If not supplied and min/max are specified, a uniform distribution is used.
+- `prompt_tokens_min`: Minimum number of tokens in prompts. If unset and `prompt_tokens_stdev` is set, the minimum is 1.
+- `prompt_tokens_max`: Maximum number of tokens in prompts. If unset and `prompt_tokens_stdev` is set, the maximum is 5 times the standard deviation.
+- `output_tokens`: Average number of tokens in outputs. If nothing else is specified, all requests will have this number of tokens.
+- `output_tokens_stdev`: Standard deviation for output tokens. If not supplied and min/max are not specified, no deviation is applied. If not supplied and min/max are specified, a uniform distribution is used.
+- `output_tokens_min`: Minimum number of tokens in outputs. If unset and `output_tokens_stdev` is set, the minimum is 1.
+- `output_tokens_max`: Maximum number of tokens in outputs. If unset and `output_tokens_stdev` is set, the maximum is 5 times the standard deviation.
+- `samples`: Number of samples to generate (default: 1000). More samples will increase the time taken to generate the dataset before benchmarking, but will also decrease the liklihood of caching requests.
+- `source`: Source text for generation (default: `data:prideandprejudice.txt.gz`). This can be any text file, URL containing a text file, or a compressed text file. The text is used to sample from at a word and punctuation granularity and then combined into a single string of the desired lengths.
+
+#### Notes
+
+- A processor/tokenizer is required. By default, the model passed in or retrieved from the server is used. If unavailable, use the `--processor` argument to specify a directory or Hugging Face model ID containing the processor/tokenizer files.
+
+### Hugging Face Datasets
+
+GuideLLM supports datasets from the Hugging Face Hub or local directories that follow the `datasets` library format. This allows you to easily leverage a wide range of datasets for benchmarking and evaluation with real-world data.
+
+#### Example Commands
+
+```bash
+guidellm benchmark \
+    --target "http://localhost:8000" \
+    --rate-type "throughput" \
+    --max-requests 1000 \
+    --data "garage-bAInd/Open-Platypus"
+```
+
+Or using a local dataset:
+
+```bash
+guidellm benchmark \
+    --target "http://localhost:8000" \
+    --rate-type "throughput" \
+    --max-requests 1000 \
+    --data "path/to/dataset"
+```
+
+#### Notes
+
+- Hugging Face datasets can be specified by ID, a local directory, or a path to a local Python file.
+
+### File-Based Datasets
+
+GuideLLM supports various file formats for datasets, including text, CSV, JSON, and more. These datasets can be used for benchmarking and evaluation, allowing you to work with structured data in a familiar format that matches your use case.
+
+#### Supported Formats with Examples
+
+- **Text files (`.txt`, `.text`)**
+  ```
+  Hello, how are you?
+  What is your name?
+  ```
+- **CSV files (`.csv`)**
+  ```csv
+  prompt,output
+  "Hello, how are you?","I'm fine."
+  "What is your name?","My name is GuideLLM."
+  ```
+- **JSON files (`.json`, `.jsonl`)**
+  ```json
+  [
+    {"prompt": "Hello, how are you?", "output": "I'm fine."},
+    {"prompt": "What is your name?", "output": "My name is GuideLLM."}
+  ]
+  ```
+- **Parquet files (`.parquet`)** Example: A binary columnar storage format for efficient data processing.
+- **Arrow files (`.arrow`)** Example: A cross-language development platform for in-memory data.
+- **HDF5 files (`.hdf5`)** Example: A hierarchical data format for storing large amounts of data.
+
+#### Example Commands
+
+```bash
+guidellm benchmark \
+    --target "http://localhost:8000" \
+    --rate-type "throughput" \
+    --max-requests 1000 \
+    --data "path/to/dataset.ext"
+```
+
+#### Notes
+
+- Ensure the file format matches the expected structure for the dataset.
+
+### In-Memory Datasets
+
+In-memory datasets allow you to directly pass data as Python objects, making them ideal for quick prototyping and testing without the need to save data to disk.
+
+### Supported Formats with Examples
+
+- **Dictionary of columns and values**
+  ```python
+  {
+      "column1": ["value1", "value2"],
+      "column2": ["value3", "value4"]
+  }
+  ```
+- **List of dictionaries**
+  ```python
+  [
+      {"column1": "value1", "column2": "value3"},
+      {"column1": "value2", "column2": "value4"}
+  ]
+  ```
+- **List of items**
+  ```python
+  ["value1", "value2", "value3"]
+  ```
+
+### Example Usage
+
+```python
+from guidellm.benchmark import benchmark_generative_text
+
+data = [
+    {"prompt": "Hello", "output": "Hi"},
+    {"prompt": "How are you?", "output": "I'm fine."}
+]
+
+benchmark_generative_text(data=data, ...)
+```
+
+### Notes
+
+- Ensure that the data format is consistent and adheres to one of the supported structures.
+- For dictionaries, all columns must have the same number of samples.
+- For lists of dictionaries, all items must have the same keys.
+- For lists of items, all elements must be of the same type.

From 0f42b3779c1f5962ed23a54ad89932d9a48a4f25 Mon Sep 17 00:00:00 2001
From: Mark Kurtz <mark.j.kurtz@gmail.com>
Date: Fri, 25 Apr 2025 05:34:39 +0000
Subject: [PATCH 2/4] Fix header levels

---
 docs/datasets.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/datasets.md b/docs/datasets.md
index 9d20e44..447896b 100644
--- a/docs/datasets.md
+++ b/docs/datasets.md
@@ -150,7 +150,7 @@ guidellm benchmark \
 
 In-memory datasets allow you to directly pass data as Python objects, making them ideal for quick prototyping and testing without the need to save data to disk.
 
-### Supported Formats with Examples
+#### Supported Formats with Examples
 
 - **Dictionary of columns and values**
   ```python
@@ -171,7 +171,7 @@ In-memory datasets allow you to directly pass data as Python objects, making the
   ["value1", "value2", "value3"]
   ```
 
-### Example Usage
+#### Example Usage
 
 ```python
 from guidellm.benchmark import benchmark_generative_text
@@ -184,7 +184,7 @@ data = [
 benchmark_generative_text(data=data, ...)
 ```
 
-### Notes
+#### Notes
 
 - Ensure that the data format is consistent and adheres to one of the supported structures.
 - For dictionaries, all columns must have the same number of samples.

From 24706640c50609eeb3083210cf2d3db000c812da Mon Sep 17 00:00:00 2001
From: Mark Kurtz <mark.j.kurtz@gmail.com>
Date: Fri, 25 Apr 2025 01:36:52 -0400
Subject: [PATCH 3/4] Update docs/datasets.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 docs/datasets.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/datasets.md b/docs/datasets.md
index 447896b..086c2d7 100644
--- a/docs/datasets.md
+++ b/docs/datasets.md
@@ -69,7 +69,7 @@ guidellm benchmark \
 - `output_tokens_stdev`: Standard deviation for output tokens. If not supplied and min/max are not specified, no deviation is applied. If not supplied and min/max are specified, a uniform distribution is used.
 - `output_tokens_min`: Minimum number of tokens in outputs. If unset and `output_tokens_stdev` is set, the minimum is 1.
 - `output_tokens_max`: Maximum number of tokens in outputs. If unset and `output_tokens_stdev` is set, the maximum is 5 times the standard deviation.
-- `samples`: Number of samples to generate (default: 1000). More samples will increase the time taken to generate the dataset before benchmarking, but will also decrease the liklihood of caching requests.
+- `samples`: Number of samples to generate (default: 1000). More samples will increase the time taken to generate the dataset before benchmarking, but will also decrease the likelihood of caching requests.
 - `source`: Source text for generation (default: `data:prideandprejudice.txt.gz`). This can be any text file, URL containing a text file, or a compressed text file. The text is used to sample from at a word and punctuation granularity and then combined into a single string of the desired lengths.
 
 #### Notes

From 09196a5e42430325be7aaa98c7cedccfc38d56ab Mon Sep 17 00:00:00 2001
From: Mark Kurtz <mark.j.kurtz@gmail.com>
Date: Fri, 25 Apr 2025 16:26:39 +0000
Subject: [PATCH 4/4] Update dataset documentation based on review comments

---
 docs/datasets.md | 62 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 42 insertions(+), 20 deletions(-)

diff --git a/docs/datasets.md b/docs/datasets.md
index 086c2d7..86aa3af 100644
--- a/docs/datasets.md
+++ b/docs/datasets.md
@@ -24,7 +24,7 @@ guidellm benchmark \
     --target "http://localhost:8000" \
     --rate-type "throughput" \
     --max-requests 1000 \
-    --data "path/to/dataset.ext" \
+    --data "path/to/dataset|dataset_id" \
     --data-args '{"prompt_column": "prompt", "split": "train"}' \
     --processor "path/to/processor" \
     --processor-args '{"arg1": "value1"}' \
@@ -103,6 +103,8 @@ guidellm benchmark \
 #### Notes
 
 - Hugging Face datasets can be specified by ID, a local directory, or a path to a local Python file.
+- A supported Hugging Face datasets format is defined as one that can be loaded using the `datasets` library with the `load_dataset` function and therefore it is representable as a `Dataset`, `DatasetDict`, `IterableDataset`, or `IterableDatasetDict`. More information on the supported data types and additional args for the underlying use of `load_dataset` can be found in the [Hugging Face datasets documentation](https://huggingface.co/docs/datasets/en/loading#hugging-face-hub).
+- A processor/tokenizer is only required if `GUIDELLM__PREFERRED_PROMPT_TOKENS_SOURCE="local"` or `GUIDELLM__PREFERRED_OUTPUT_TOKENS_SOURCE="local"` is set in the environment. In this case, the processor/tokenizer must be specified using the `--processor` argument. If not set, the processor/tokenizer will be set to the model passed in or retrieved from the server.
 
 ### File-Based Datasets
 
@@ -110,27 +112,35 @@ GuideLLM supports various file formats for datasets, including text, CSV, JSON,
 
 #### Supported Formats with Examples
 
-- **Text files (`.txt`, `.text`)**
+- **Text files (`.txt`, `.text`)**: Where each line is a separate prompt to use.
   ```
   Hello, how are you?
   What is your name?
   ```
-- **CSV files (`.csv`)**
+- **CSV files (`.csv`)**: Where each row is a separate dataset entry and the first row contains the column names. The columns should include `prompt` or other common names for the prompt which will be used as the prompt column. Additional columns can be included based on the previously mentioned aliases for the `--data-args` argument.
   ```csv
-  prompt,output
-  "Hello, how are you?","I'm fine."
-  "What is your name?","My name is GuideLLM."
+  prompt,output_tokens_count,additional_column,additional_column2
+  Hello, how are you?,5,foo,bar
+  What is your name?,3,baz,qux
   ```
-- **JSON files (`.json`, `.jsonl`)**
+- **JSON Lines files (`.jsonl`)**: Where each line is a separate JSON object. The objects should include `prompt` or other common names for the prompt which will be used as the prompt column. Additional fields can be included based on the previously mentioned aliases for the `--data-args` argument.
   ```json
-  [
-    {"prompt": "Hello, how are you?", "output": "I'm fine."},
-    {"prompt": "What is your name?", "output": "My name is GuideLLM."}
-  ]
+  {"prompt": "Hello, how are you?", "output_tokens_count": 5, "additional_column": "foo", "additional_column2": "bar"}
+  {"prompt": "What is your name?", "output_tokens_count": 3, "additional_column": "baz", "additional_column2": "qux"}
+  ```
+- **JSON files (`.json`)**: Where the entire dataset is represented as a JSON array of objects nested under a specific key. To surface the correct key to use, a `--data-args` argument must be passed in of `"field": "NAME"` for where the array exists. The objects should include `prompt` or other common names for the prompt which will be used as the prompt column. Additional fields can be included based on the previously mentioned aliases for the `--data-args` argument.
+  ```json
+  {
+    "version": "1.0",
+    "data": [
+      {"prompt": "Hello, how are you?", "output_tokens_count": 5, "additional_column": "foo", "additional_column2": "bar"},
+      {"prompt": "What is your name?", "output_tokens_count": 3, "additional_column": "baz", "additional_column2": "qux"}
+    ]
+  }
   ```
-- **Parquet files (`.parquet`)** Example: A binary columnar storage format for efficient data processing.
-- **Arrow files (`.arrow`)** Example: A cross-language development platform for in-memory data.
-- **HDF5 files (`.hdf5`)** Example: A hierarchical data format for storing large amounts of data.
+- **Parquet files (`.parquet`)** Example: A binary columnar storage format for efficient data processing. For more information on the supported formats, see the Hugging Face dataset documentation linked in the [Notes](#notes) section.
+- **Arrow files (`.arrow`)** Example: A cross-language development platform for in-memory data. For more information on the supported formats, see the Hugging Face dataset documentation linked in the [Notes](#notes) section.
+- **HDF5 files (`.hdf5`)** Example: A hierarchical data format for storing large amounts of data. For more information on the supported formats, see the Hugging Face dataset documentation linked in the [Notes](#notes) section.
 
 #### Example Commands
 
@@ -139,12 +149,18 @@ guidellm benchmark \
     --target "http://localhost:8000" \
     --rate-type "throughput" \
     --max-requests 1000 \
-    --data "path/to/dataset.ext"
+    --data "path/to/dataset.ext" \
+    --data-args '{"prompt_column": "prompt", "split": "train"}'
 ```
 
+Where `.ext` can be any of the supported file formats listed above.
+
 #### Notes
 
-- Ensure the file format matches the expected structure for the dataset.
+- Ensure the file format matches the expected structure for the dataset and is listed as a supported format.
+- The `--data-args` argument can be used to specify additional parameters for parsing the dataset, such as the prompt column name or the split to use.
+- A processor/tokenizer is only required if `GUIDELLM__PREFERRED_PROMPT_TOKENS_SOURCE="local"` or `GUIDELLM__PREFERRED_OUTPUT_TOKENS_SOURCE="local"` is set in the environment. In this case, the processor/tokenizer must be specified using the `--processor` argument. If not set, the processor/tokenizer will be set to the model passed in or retrieved from the server.
+- More information on the supported formats and additional args for the underlying use of `load_dataset` can be found in the [Hugging Face datasets documentation](https://huggingface.co/docs/datasets/en/loading#local-and-remote-files).
 
 ### In-Memory Datasets
 
@@ -152,23 +168,28 @@ In-memory datasets allow you to directly pass data as Python objects, making the
 
 #### Supported Formats with Examples
 
-- **Dictionary of columns and values**
+- **Dictionary of columns and values**: Where each key is a column name and the values are lists of data points. The keys should include `prompt` or other common names for the prompt which will be used as the prompt column. Additional columns can be included based on the previously mentioned aliases for the `--data-args` argument.
   ```python
   {
       "column1": ["value1", "value2"],
       "column2": ["value3", "value4"]
   }
   ```
-- **List of dictionaries**
+- **List of dictionaries**: Where each dictionary represents a single data point with key-value pairs. The dictionaries should include `prompt` or other common names for the prompt which will be used as the prompt column. Additional fields can be included based on the previously mentioned aliases for the `--data-args` argument.
   ```python
   [
       {"column1": "value1", "column2": "value3"},
       {"column1": "value2", "column2": "value4"}
   ]
   ```
-- **List of items**
+- **List of items**: Where each item is a single data point. The items should include `prompt` or other common names for the prompt which will be used as the prompt column. Additional fields can be included based on the previously mentioned aliases for the `--data-args` argument.
   ```python
-  ["value1", "value2", "value3"]
+  [
+      "value1",
+      "value2",
+      "value3"
+  ]
+
   ```
 
 #### Example Usage
@@ -190,3 +211,4 @@ benchmark_generative_text(data=data, ...)
 - For dictionaries, all columns must have the same number of samples.
 - For lists of dictionaries, all items must have the same keys.
 - For lists of items, all elements must be of the same type.
+- A processor/tokenizer is only required if `GUIDELLM__PREFERRED_PROMPT_TOKENS_SOURCE="local"` or `GUIDELLM__PREFERRED_OUTPUT_TOKENS_SOURCE="local"` is set in the environment. In this case, the processor/tokenizer must be specified using the `--processor` argument. If not set, the processor/tokenizer will be set to the model passed in or retrieved from the server.