-
Notifications
You must be signed in to change notification settings - Fork 41
Add docs for data/datasets and how to configure them in GuideLLM #137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 2 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,192 @@ | ||
# Dataset Configurations | ||
|
||
GuideLLM supports various dataset configurations to enable benchmarking and evaluation of large language models (LLMs). This document provides a comprehensive guide to configuring datasets for different use cases, along with detailed examples and rationale for choosing specific pathways. | ||
|
||
## Data Arguments Overview | ||
|
||
The following arguments can be used to configure datasets and their processing: | ||
|
||
- `--data`: Specifies the dataset source. This can be a file path, Hugging Face dataset ID, synthetic data configuration, or in-memory data. | ||
- `--data-args`: A JSON string or dictionary argument that allows you to control how datasets are parsed and prepared. This includes specific aliases for GuideLLM flows, such as: | ||
- `prompt_column`: Specifies the column name for the prompt. By default, GuideLLM will try the most common column names (e.g., `prompt`, `text`, `input`). | ||
- `prompt_tokens_count_column`: Specifies the column name for the prompt token count. These are used to set the request prompt token count for counting metrics. By default, GuideLLM assumes no token count is provided. | ||
- `output_tokens_count_column`: Specifies the column name for the output token count. These are used to set the request output token count for the request and counting metrics. By default, GuideLLM assumes no token count is provided. | ||
- `split`: Specifies the dataset split to use (e.g., `train`, `val`, `test`). By default, GuideLLM will try the most common split names (e.g., `train`, `validation`, `test`) if the dataset has splits, otherwise it will use the entire dataset. | ||
- Any remaining arguments are passed directly into the dataset constructor as kwargs. | ||
- `--data-sampler`: Specifies the sampling strategy for datasets. By default, no sampling is applied. When set to `random`, it enables random shuffling of the dataset, which can be useful for creating diverse batches during benchmarking. | ||
- `--processor`: Specifies the processor or tokenizer to use. This is only required for synthetic data generation or when local calculations are specified through configuration settings. By default, the processor is set to the `--model` argument. If `--model` is not supplied, it defaults to the model retrieved from the backend. | ||
- `--processor-args`: A JSON string containing any arguments to pass to the processor or tokenizer constructor. These arguments are passed as a dictionary of kwargs. | ||
|
||
### Example Usage | ||
|
||
```bash | ||
guidellm benchmark \ | ||
--target "http://localhost:8000" \ | ||
--rate-type "throughput" \ | ||
--max-requests 1000 \ | ||
--data "path/to/dataset.ext" \ | ||
--data-args '{"prompt_column": "prompt", "split": "train"}' \ | ||
--processor "path/to/processor" \ | ||
--processor-args '{"arg1": "value1"}' \ | ||
--data-sampler "random" | ||
``` | ||
|
||
## Dataset Types | ||
|
||
GuideLLM supports several types of datasets, each with its own advantages and use cases. Below are the main dataset types supported by GuideLLM, including synthetic data, Hugging Face datasets, file-based datasets, and in-memory datasets. | ||
|
||
### Synthetic Data | ||
|
||
Synthetic datasets allow you to generate data on the fly with customizable parameters. This is useful for controlled experiments, stress testing, and simulating specific scenarios. For example, you might want to evaluate how a model handles long prompts or generates outputs with specific characteristics. | ||
|
||
#### Example Commands | ||
|
||
```bash | ||
guidellm benchmark \ | ||
--target "http://localhost:8000" \ | ||
--rate-type "throughput" \ | ||
--max-requests 1000 \ | ||
--data "prompt_tokens=256,output_tokens=128" | ||
``` | ||
|
||
Or using a JSON string: | ||
|
||
```bash | ||
guidellm benchmark \ | ||
--target "http://localhost:8000" \ | ||
--rate-type "throughput" \ | ||
--max-requests 1000 \ | ||
--data '{"prompt_tokens": 256, "output_tokens": 128}' | ||
``` | ||
|
||
#### Configuration Options | ||
|
||
- `prompt_tokens`: Average number of tokens in prompts. If nothing else is specified, all requests will have this number of tokens. | ||
- `prompt_tokens_stdev`: Standard deviation for prompt tokens. If not supplied and min/max are not specified, no deviation is applied. If not supplied and min/max are specified, a uniform distribution is used. | ||
- `prompt_tokens_min`: Minimum number of tokens in prompts. If unset and `prompt_tokens_stdev` is set, the minimum is 1. | ||
- `prompt_tokens_max`: Maximum number of tokens in prompts. If unset and `prompt_tokens_stdev` is set, the maximum is 5 times the standard deviation. | ||
- `output_tokens`: Average number of tokens in outputs. If nothing else is specified, all requests will have this number of tokens. | ||
- `output_tokens_stdev`: Standard deviation for output tokens. If not supplied and min/max are not specified, no deviation is applied. If not supplied and min/max are specified, a uniform distribution is used. | ||
- `output_tokens_min`: Minimum number of tokens in outputs. If unset and `output_tokens_stdev` is set, the minimum is 1. | ||
- `output_tokens_max`: Maximum number of tokens in outputs. If unset and `output_tokens_stdev` is set, the maximum is 5 times the standard deviation. | ||
- `samples`: Number of samples to generate (default: 1000). More samples will increase the time taken to generate the dataset before benchmarking, but will also decrease the liklihood of caching requests. | ||
- `source`: Source text for generation (default: `data:prideandprejudice.txt.gz`). This can be any text file, URL containing a text file, or a compressed text file. The text is used to sample from at a word and punctuation granularity and then combined into a single string of the desired lengths. | ||
|
||
#### Notes | ||
|
||
- A processor/tokenizer is required. By default, the model passed in or retrieved from the server is used. If unavailable, use the `--processor` argument to specify a directory or Hugging Face model ID containing the processor/tokenizer files. | ||
|
||
### Hugging Face Datasets | ||
|
||
GuideLLM supports datasets from the Hugging Face Hub or local directories that follow the `datasets` library format. This allows you to easily leverage a wide range of datasets for benchmarking and evaluation with real-world data. | ||
|
||
#### Example Commands | ||
|
||
```bash | ||
guidellm benchmark \ | ||
--target "http://localhost:8000" \ | ||
--rate-type "throughput" \ | ||
--max-requests 1000 \ | ||
--data "garage-bAInd/Open-Platypus" | ||
``` | ||
|
||
Or using a local dataset: | ||
|
||
```bash | ||
guidellm benchmark \ | ||
--target "http://localhost:8000" \ | ||
--rate-type "throughput" \ | ||
--max-requests 1000 \ | ||
--data "path/to/dataset" | ||
``` | ||
|
||
#### Notes | ||
|
||
- Hugging Face datasets can be specified by ID, a local directory, or a path to a local Python file. | ||
|
||
### File-Based Datasets | ||
|
||
GuideLLM supports various file formats for datasets, including text, CSV, JSON, and more. These datasets can be used for benchmarking and evaluation, allowing you to work with structured data in a familiar format that matches your use case. | ||
|
||
#### Supported Formats with Examples | ||
|
||
- **Text files (`.txt`, `.text`)** | ||
``` | ||
Hello, how are you? | ||
What is your name? | ||
``` | ||
- **CSV files (`.csv`)** | ||
```csv | ||
prompt,output | ||
"Hello, how are you?","I'm fine." | ||
"What is your name?","My name is GuideLLM." | ||
``` | ||
- **JSON files (`.json`, `.jsonl`)** | ||
```json | ||
[ | ||
{"prompt": "Hello, how are you?", "output": "I'm fine."}, | ||
{"prompt": "What is your name?", "output": "My name is GuideLLM."} | ||
] | ||
``` | ||
- **Parquet files (`.parquet`)** Example: A binary columnar storage format for efficient data processing. | ||
- **Arrow files (`.arrow`)** Example: A cross-language development platform for in-memory data. | ||
- **HDF5 files (`.hdf5`)** Example: A hierarchical data format for storing large amounts of data. | ||
|
||
#### Example Commands | ||
|
||
```bash | ||
guidellm benchmark \ | ||
--target "http://localhost:8000" \ | ||
--rate-type "throughput" \ | ||
--max-requests 1000 \ | ||
--data "path/to/dataset.ext" | ||
``` | ||
|
||
#### Notes | ||
|
||
- Ensure the file format matches the expected structure for the dataset. | ||
|
||
### In-Memory Datasets | ||
|
||
In-memory datasets allow you to directly pass data as Python objects, making them ideal for quick prototyping and testing without the need to save data to disk. | ||
|
||
#### Supported Formats with Examples | ||
|
||
- **Dictionary of columns and values** | ||
```python | ||
{ | ||
"column1": ["value1", "value2"], | ||
"column2": ["value3", "value4"] | ||
} | ||
``` | ||
- **List of dictionaries** | ||
```python | ||
[ | ||
{"column1": "value1", "column2": "value3"}, | ||
{"column1": "value2", "column2": "value4"} | ||
] | ||
``` | ||
- **List of items** | ||
```python | ||
["value1", "value2", "value3"] | ||
``` | ||
|
||
#### Example Usage | ||
|
||
```python | ||
from guidellm.benchmark import benchmark_generative_text | ||
|
||
data = [ | ||
{"prompt": "Hello", "output": "Hi"}, | ||
{"prompt": "How are you?", "output": "I'm fine."} | ||
] | ||
|
||
benchmark_generative_text(data=data, ...) | ||
``` | ||
|
||
#### Notes | ||
|
||
- Ensure that the data format is consistent and adheres to one of the supported structures. | ||
- For dictionaries, all columns must have the same number of samples. | ||
- For lists of dictionaries, all items must have the same keys. | ||
- For lists of items, all elements must be of the same type. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.