Skip to content

Commit 6cac7b1

Browse files
committed
Upload prefill and decode heavy benchmarking configs
1 parent e26d0cd commit 6cac7b1

File tree

8 files changed

+365
-1
lines changed

8 files changed

+365
-1
lines changed

benchmarking/prefix-cache-aware/high-cache-values.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,12 @@ logLevel: INFO
2323
# NOTE: For this dataset to be used, config.data.path must also be explicitly set to /dataset/dataset.json.
2424
gcsPath: ""
2525

26+
# An S3 bucket path that points to the dataset file.
27+
# The file will be copied from this path to the local file system
28+
# at /dataset/s3-dataset.json for use during the run.
29+
# NOTE: For this dataset to be used, config.data.path must also be explicitly set to /dataset/s3-dataset.json.
30+
s3Path: ""
31+
2632
# hfToken optionally creates a secret with the specified token.
2733
# Can be set using helm install --set hftoken=<token>
2834
hfToken: ""

benchmarking/prefix-cache-aware/low-cache-values.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,12 @@ logLevel: INFO
2323
# NOTE: For this dataset to be used, config.data.path must also be explicitly set to /dataset/dataset.json.
2424
gcsPath: ""
2525

26+
# An S3 bucket path that points to the dataset file.
27+
# The file will be copied from this path to the local file system
28+
# at /dataset/s3-dataset.json for use during the run.
29+
# NOTE: For this dataset to be used, config.data.path must also be explicitly set to /dataset/s3-dataset.json.
30+
s3Path: ""
31+
2632
# hfToken optionally creates a secret with the specified token.
2733
# Can be set using helm install --set hftoken=<token>
2834
hfToken: ""
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Decode Heavy Configuration
2+
job:
3+
image:
4+
repository: quay.io/inference-perf/inference-perf
5+
tag: "0.2.0" # Defaults to .Chart.AppVersion
6+
serviceAccountName: ""
7+
nodeSelector: {}
8+
# Example resources:
9+
# resources:
10+
# requests:
11+
# cpu: "1"
12+
# memory: "4Gi"
13+
# limits:
14+
# cpu: "2"
15+
# memory: "8Gi"
16+
resources: {}
17+
18+
logLevel: INFO
19+
20+
# A GCS bucket path that points to the dataset file.
21+
# The file will be copied from this path to the local file system
22+
# at /dataset/dataset.json for use during the run.
23+
# NOTE: For this dataset to be used, config.data.path must also be explicitly set to /dataset/dataset.json.
24+
gcsPath: ""
25+
26+
# An S3 bucket path that points to the dataset file.
27+
# The file will be copied from this path to the local file system
28+
# at /dataset/s3-dataset.json for use during the run.
29+
# NOTE: For this dataset to be used, config.data.path must also be explicitly set to /dataset/s3-dataset.json.
30+
s3Path: ""
31+
32+
# hfToken optionally creates a secret with the specified token.
33+
# Can be set using helm install --set hftoken=<token>
34+
hfToken: ""
35+
36+
config:
37+
load:
38+
type: constant
39+
interval: 15
40+
stages:
41+
- rate: 200
42+
duration: 60
43+
- rate: 210
44+
duration: 60
45+
- rate: 220
46+
duration: 60
47+
worker_max_concurrency: 1000
48+
api:
49+
type: completion
50+
streaming: true
51+
server:
52+
type: vllm
53+
model_name: meta-llama/Llama-3.1-8B-Instruct
54+
base_url: http://0.0.0.0:8000
55+
ignore_eos: true
56+
tokenizer:
57+
pretrained_model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
58+
data:
59+
type: infinity_instruct
60+
path: ""
61+
input_distribution:
62+
max: 1024
63+
output_distribution:
64+
max: 1024
65+
metrics:
66+
type: prometheus
67+
prometheus:
68+
google_managed: true
69+
report:
70+
request_lifecycle:
71+
summary: true
72+
per_stage: true
73+
per_request: true
74+
prometheus:
75+
summary: true
76+
per_stage: true
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Prefill Heavy Configuration
2+
job:
3+
image:
4+
repository: quay.io/inference-perf/inference-perf
5+
tag: "0.2.0" # Defaults to .Chart.AppVersion
6+
serviceAccountName: ""
7+
nodeSelector: {}
8+
# Example resources:
9+
# resources:
10+
# requests:
11+
# cpu: "1"
12+
# memory: "4Gi"
13+
# limits:
14+
# cpu: "2"
15+
# memory: "8Gi"
16+
resources: {}
17+
18+
logLevel: INFO
19+
20+
# A GCS bucket path that points to the dataset file.
21+
# The file will be copied from this path to the local file system
22+
# at /dataset/dataset.json for use during the run.
23+
# NOTE: For this dataset to be used, config.data.path must also be explicitly set to /dataset/dataset.json.
24+
gcsPath: ""
25+
26+
# An S3 bucket path that points to the dataset file.
27+
# The file will be copied from this path to the local file system
28+
# at /dataset/s3-dataset.json for use during the run.
29+
# NOTE: For this dataset to be used, config.data.path must also be explicitly set to /dataset/s3-dataset.json.
30+
s3Path: ""
31+
32+
# hfToken optionally creates a secret with the specified token.
33+
# Can be set using helm install --set hftoken=<token>
34+
hfToken: ""
35+
36+
config:
37+
load:
38+
type: constant
39+
interval: 15
40+
stages:
41+
- rate: 300
42+
duration: 30
43+
- rate: 310
44+
duration: 30
45+
- rate: 320
46+
duration: 30
47+
- rate: 330
48+
duration: 30
49+
api:
50+
type: completion
51+
streaming: true
52+
server:
53+
type: vllm
54+
model_name: meta-llama/Llama-3.1-8B-Instruct
55+
base_url: http://0.0.0.0:8000
56+
ignore_eos: true
57+
tokenizer:
58+
pretrained_model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
59+
data:
60+
type: billsum_conversations
61+
path: ""
62+
input_distribution:
63+
max: 1024
64+
output_distribution:
65+
max: 1024
66+
metrics:
67+
type: prometheus
68+
prometheus:
69+
google_managed: true
70+
report:
71+
request_lifecycle:
72+
summary: true
73+
per_stage: true
74+
per_request: true
75+
prometheus:
76+
summary: true
77+
per_stage: true

mkdocs.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,8 @@ nav:
8484
- Benchmark: performance/benchmark/index.md
8585
- Advanced Benchmarking Configs:
8686
- Prefix Cache Aware: performance/benchmark/advanced-configs/prefix-cache-aware.md
87+
- Decode Heavy Workload: performance/benchmark/advanced-configs/decode-heavy-workload.md
88+
- Prefill Heavy Workload: performance/benchmark/advanced-configs/prefill-heavy-workload.md
8789
- Regression Testing: performance/regression-testing/index.md
8890
- Reference:
8991
- v1 API Reference: reference/spec.md
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# Decode Heavy Workload Benchmarking
2+
This guide shows how to deploy a decode-heavy benchmarking config using inference-perf.
3+
4+
## Prerequisites
5+
6+
Before you begin, ensure you have the following:
7+
8+
* **Helm 3+**: [Installation Guide](https://helm.sh/docs/intro/install/)
9+
* **Kubernetes Cluster**: Access to a Kubernetes cluster
10+
* **Hugging Face Token Secret**: A Hugging Face token to pull models.
11+
* **Gateway Deployed**: Your inference server/gateway must be deployed and accessible within the cluster.
12+
13+
Follow [benchmarking guide](https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark/#benchmark) for more information on how to set up gateway and how to validate benchmark results.
14+
15+
## Infinity Instruct Dataset Configuration
16+
17+
The chart uses the `infinity_instruct` [dataset type](https://huggingface.co/datasets/BAAI/Infinity-Instruct).
18+
19+
>NOTE: Currently, we need to download and supply the dataset for inference-perf to ingest. Currently using helm, we can supply the dataset by uploading to a gcs or s3 bucket. Otherwise, you can follow inference perf guides to run locally with a local dataset file path.
20+
21+
## Deployment
22+
23+
### 1. Check out the repo.
24+
25+
```bash
26+
git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension
27+
cd gateway-api-inference-extension/benchmarking/single-workload
28+
```
29+
30+
### 2. Get the target IP.
31+
32+
The examples below shows how to get the IP of a gateway or a k8s service.
33+
34+
```bash
35+
# Get gateway IP
36+
GW_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
37+
# Get LoadBalancer k8s service IP
38+
SVC_IP=$(kubectl get service/vllm-llama3-8b-instruct -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
39+
40+
echo $GW_IP
41+
echo $SVC_IP
42+
```
43+
44+
### 3. Deploying the Decode Heavy Configuration
45+
46+
This configuration is optimized for scenarios where a high cache hit rate is expected. It uses the `decode-heavy-values.yaml` file.
47+
48+
=== "Google Cloud Storage (GCS)"
49+
Use the `gcsPath` field to provide your dataset stored on GCS. The dataset will be downloaded from the bucket and stored locally on the cluster at `/dataset/gcs-dataset.json`.
50+
```bash
51+
export IP='<YOUR_IP>'
52+
export PORT='<YOUR_PORT>'
53+
export HF_TOKEN='<YOUR_HUGGINGFACE_TOKEN>'
54+
helm install decode-heavy ../inference-perf -f decode-heavy-values.yaml \
55+
--set hfToken=${HF_TOKEN} \
56+
--set "config.server.base_url=http://${IP}:${PORT}" \
57+
--set "config.data.path=/dataset/gcs-dataset.json" \
58+
--set "gcsPath=<PATH TO DATASET FILE ON GCS BUCKET>"
59+
```
60+
**Parameters to customize:**
61+
62+
* `decode-heavy`: A unique name for this deployment.
63+
* `hfTokenSecret.name`: The name of your Kubernetes Secret containing the Hugging Face token (default: `hf-token`).
64+
* `hfTokenSecret.key`: The key in your Kubernetes Secret pointing to the Hugging Face token (default: `token`).
65+
* `config.server.base_url`: The base URL (IP and port) of your inference server for the high-cache scenario.
66+
* `gcsPath`: The path to the downloaded dataset file hosted on your gcs bucket.
67+
68+
=== "Simple Storage Service (S3)"
69+
Use the `s3Path` field to provide your dataset stored on S3. The dataset will be downloaded from the bucket and stored locally on the cluster at `/dataset/s3-dataset.json`.
70+
```bash
71+
export IP='<YOUR_IP>'
72+
export PORT='<YOUR_PORT>'
73+
export HF_TOKEN='<YOUR_HUGGINGFACE_TOKEN>'
74+
helm install decode-heavy ../inference-perf -f decode-heavy-values.yaml \
75+
--set hfToken=${HF_TOKEN} \
76+
--set "config.server.base_url=http://${IP}:${PORT}" \
77+
--set "config.data.path=/dataset/s3-dataset.json" \
78+
--set "s3Path=<PATH TO DATASET FILE ON S3 BUCKET>"
79+
```
80+
**Parameters to customize:**
81+
82+
* `decode-heavy`: A unique name for this deployment.
83+
* `hfTokenSecret.name`: The name of your Kubernetes Secret containing the Hugging Face token (default: `hf-token`).
84+
* `hfTokenSecret.key`: The key in your Kubernetes Secret pointing to the Hugging Face token (default: `token`).
85+
* `config.server.base_url`: The base URL (IP and port) of your inference server for the high-cache scenario.
86+
* `s3Path`: The path to the downloaded dataset file hosted on your s3 bucket.
87+
88+
## Clean Up
89+
90+
To uninstall the deployed charts:
91+
92+
```bash
93+
helm uninstall decode-heavy
94+
```
95+
96+
## Post Benchmark Analysis
97+
Follow the benchmarking guide instructions to [compare benchmark results](https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark/#analyze-the-results).
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# Prefill Heavy Workload Benchmarking
2+
This guide shows how to deploy a prefill-heavy benchmarking config using inference-perf.
3+
4+
## Prerequisites
5+
6+
Before you begin, ensure you have the following:
7+
8+
* **Helm 3+**: [Installation Guide](https://helm.sh/docs/intro/install/)
9+
* **Kubernetes Cluster**: Access to a Kubernetes cluster
10+
* **Hugging Face Token Secret**: A Hugging Face token to pull models.
11+
* **Gateway Deployed**: Your inference server/gateway must be deployed and accessible within the cluster.
12+
13+
Follow [benchmarking guide](https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark/#benchmark) for more information on how to set up gateway and how to validate benchmark results.
14+
15+
## Infinity Instruct Dataset Configuration
16+
17+
The chart uses the `infinity_instruct` [dataset type](https://huggingface.co/datasets/BAAI/Infinity-Instruct).
18+
19+
>NOTE: Currently, we need to download and supply the dataset for inference-perf to ingest. Currently using helm, we can supply the dataset by uploading to a gcs or s3 bucket. Otherwise, you can follow inference perf guides to run locally with a local dataset file path.
20+
21+
## Deployment
22+
23+
### 1. Check out the repo.
24+
25+
```bash
26+
git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension
27+
cd gateway-api-inference-extension/benchmarking/single-workload
28+
```
29+
30+
### 2. Get the target IP.
31+
32+
The examples below shows how to get the IP of a gateway or a k8s service.
33+
34+
```bash
35+
# Get gateway IP
36+
GW_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
37+
# Get LoadBalancer k8s service IP
38+
SVC_IP=$(kubectl get service/vllm-llama3-8b-instruct -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
39+
40+
echo $GW_IP
41+
echo $SVC_IP
42+
```
43+
44+
### 3. Deploying the Prefill Heavy Configuration
45+
46+
This configuration is optimized for scenarios where a high cache hit rate is expected. It uses the `prefill-heavy-values.yaml` file.
47+
48+
=== "Google Cloud Storage (GCS)"
49+
Use the `gcsPath` field to provide your dataset stored on GCS. The dataset will be downloaded from the bucket and stored locally on the cluster at `/dataset/gcs-dataset.json`.
50+
```bash
51+
export IP='<YOUR_IP>'
52+
export PORT='<YOUR_PORT>'
53+
export HF_TOKEN='<YOUR_HUGGINGFACE_TOKEN>'
54+
helm install prefill-heavy ../inference-perf -f prefill-heavy-values.yaml \
55+
--set hfToken=${HF_TOKEN} \
56+
--set "config.server.base_url=http://${IP}:${PORT}" \
57+
--set "config.data.path=/dataset/gcs-dataset.json" \
58+
--set "gcsPath=<PATH TO DATASET FILE ON GCS BUCKET>"
59+
```
60+
**Parameters to customize:**
61+
62+
* `prefill-heavy`: A unique name for this deployment.
63+
* `hfTokenSecret.name`: The name of your Kubernetes Secret containing the Hugging Face token (default: `hf-token`).
64+
* `hfTokenSecret.key`: The key in your Kubernetes Secret pointing to the Hugging Face token (default: `token`).
65+
* `config.server.base_url`: The base URL (IP and port) of your inference server for the high-cache scenario.
66+
* `gcsPath`: The path to the downloaded dataset file hosted on your gcs bucket.
67+
68+
=== "Simple Storage Service (S3)"
69+
Use the `s3Path` field to provide your dataset stored on S3. The dataset will be downloaded from the bucket and stored locally on the cluster at `/dataset/s3-dataset.json`.
70+
```bash
71+
export IP='<YOUR_IP>'
72+
export PORT='<YOUR_PORT>'
73+
export HF_TOKEN='<YOUR_HUGGINGFACE_TOKEN>'
74+
helm install prefill-heavy ../inference-perf -f prefill-heavy-values.yaml \
75+
--set hfToken=${HF_TOKEN} \
76+
--set "config.server.base_url=http://${IP}:${PORT}" \
77+
--set "config.data.path=/dataset/s3-dataset.json" \
78+
--set "s3Path=<PATH TO DATASET FILE ON S3 BUCKET>"
79+
```
80+
81+
**Parameters to customize:**
82+
83+
* `prefill-heavy`: A unique name for this deployment.
84+
* `hfTokenSecret.name`: The name of your Kubernetes Secret containing the Hugging Face token (default: `hf-token`).
85+
* `hfTokenSecret.key`: The key in your Kubernetes Secret pointing to the Hugging Face token (default: `token`).
86+
* `config.server.base_url`: The base URL (IP and port) of your inference server for the high-cache scenario.
87+
* `s3Path`: The path to the downloaded dataset file hosted on your s3 bucket.
88+
89+
## Clean Up
90+
91+
To uninstall the deployed charts:
92+
93+
```bash
94+
helm uninstall prefill-heavy
95+
```
96+
97+
## Post Benchmark Analysis
98+
Follow the benchmarking guide instructions to [compare benchmark results](https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark/#analyze-the-results).

site-src/performance/benchmark/index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,9 +173,11 @@ detailed list of configuration knobs.
173173

174174
The following is a list of advanced configurations available.
175175

176-
| Guides | Config | Directory | Config(s)
176+
| Guide | Directory | Config(s)
177177
| :--- | :--- | :--- | :--- |
178178
| [Prefix Cache Aware Guide](https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark/advanced-configs/prefix-cache-aware/#prefix-cache-aware-benchmarking) | `prefix-cache-aware` | `benchamrking/prefix-cache-aware` | `high-cache-values.yaml` `low-cache-values.yaml` |
179+
| [Decode Heavy Guide](https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark/advanced-configs/decode-heavy) | `benchamrking/single-workload` | `decode-heavy-values.yaml` |
180+
| [Prefill Heavy Guide](https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark/advanced-configs/prefill-heavy) | `benchamrking/single-workload` | `prefill-heavy-values.yaml` |
179181

180182
## Analyze the results
181183

0 commit comments

Comments
 (0)