Skip to content

Commit 86076e2

Browse files
committed
Add vllm speculative decoding docs and helper script
Signed-off-by: DilreetRaju <[email protected]>
1 parent 6deeecb commit 86076e2

File tree

4 files changed

+158
-0
lines changed

4 files changed

+158
-0
lines changed
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
#!/bin/bash
2+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
set -e
5+
trap 'echo Cleaning up...; kill 0' EXIT
6+
7+
8+
# ---------------------------
9+
# 1. Frontend (Ingress)
10+
# ---------------------------
11+
python -m dynamo.frontend --http-port=8000 &
12+
13+
14+
# ---------------------------
15+
# 2. Speculative Main Worker
16+
# ---------------------------
17+
# This runs the main model with EAGLE as the draft model for speculative decoding
18+
DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 \
19+
CUDA_VISIBLE_DEVICES=0 python -m dynamo.vllm \
20+
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
21+
--enforce-eager \
22+
--speculative_config '{
23+
"model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
24+
"draft_tensor_parallel_size": 1,
25+
"num_speculative_tokens": 2,
26+
"method": "eagle"
27+
}' \
28+
--connector none \
29+
--gpu-memory-utilization 0.8

docs/backends/vllm/README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,13 @@ bash launch/dep.sh
151151

152152
Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!
153153

154+
### Speculative Decoding with Aggregated Serving (Meta-Llama-3.1-8B-Instruct + Eagle3)
155+
156+
Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model using **aggregated speculative decoding** on a single node.
157+
This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the **VLLM aggregated serving framework** for faster inference while maintaining accuracy.
158+
159+
**Guide:** [Speculative Decoding Quickstart](./speculative_decoding.md)
160+
154161
### Kubernetes Deployment
155162

156163
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](../../../components/backends/vllm/deploy/README.md)
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
# Running **Meta-Llama-3.1-8B-Instruct** with Speculative Decoding (Eagle3)
18+
19+
This guide walks through how to deploy **Meta-Llama-3.1-8B-Instruct** using **aggregated speculative decoding** with **Eagle3** on a single node.
20+
Since the model is only **8B parameters**, you can run it on **any GPU with at least 16GB VRAM**.
21+
22+
23+
24+
## Step 1: Set Up Your Docker Environment
25+
26+
First, we’ll initialize a Docker container using the VLLM backend.
27+
You can refer to the [VLLM Quickstart Guide](./README.md#vllm-quick-start) — or follow the full steps below.
28+
29+
### 1. Launch Docker Compose
30+
31+
```bash
32+
docker compose -f deploy/docker-compose.yml up -d
33+
```
34+
35+
### 2. Build the Container
36+
37+
```bash
38+
./container/build.sh --framework VLLM
39+
```
40+
41+
### 3. Run the Container
42+
43+
```bash
44+
./container/run.sh -it --framework VLLM --mount-workspace
45+
```
46+
47+
48+
49+
## Step 2: Get Access to the Llama-3 Model
50+
51+
The **Meta-Llama-3.1-8B-Instruct** model is gated, so you’ll need to request access on Hugging Face.
52+
Go to the official [Meta-Llama-3.1-8B-Instruct repository](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and fill out the access form.
53+
Approval usually takes around **5 minutes**.
54+
55+
Once you have access, generate a **Hugging Face access token** with permission for gated repositories, then set it inside your container:
56+
57+
```bash
58+
export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
59+
export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN
60+
```
61+
62+
63+
64+
## Step 3: Run Aggregated Speculative Decoding
65+
66+
Now that your environment is ready, start the aggregated server with **speculative decoding**.
67+
68+
```bash
69+
# Requires only one GPU
70+
cd components/backends/vllm
71+
bash launch/agg_spec_decoding.sh
72+
```
73+
74+
Once the weights finish downloading and serving begins, you’ll be ready to send inference requests to your model.
75+
76+
77+
78+
79+
## Step 4: Example Request
80+
81+
To verify your setup, try sending a simple prompt to your model:
82+
83+
```bash
84+
curl http://localhost:8000/v1/completions \
85+
-H "Content-Type: application/json" \
86+
-d '{
87+
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
88+
"messages": [
89+
{"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
90+
],
91+
"max_tokens": 250
92+
}'
93+
```
94+
95+
### Example Output
96+
97+
```json
98+
{
99+
"id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
100+
"choices": [
101+
{
102+
"text": "In cherry blossom’s gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes.",
103+
"index": 0,
104+
"finish_reason": "stop"
105+
}
106+
],
107+
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
108+
"usage": {
109+
"prompt_tokens": 16,
110+
"completion_tokens": 250,
111+
"total_tokens": 266
112+
}
113+
}
114+
```
115+
116+
117+
118+
## Additional Resources
119+
120+
* [VLLM Quickstart](./README.md#vllm-quick-start)
121+
* [Meta-Llama-3.1-8B-Instruct on Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)

docs/hidden_toctree.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@
7272
backends/vllm/gpt-oss.md
7373
backends/vllm/multi-node.md
7474
backends/vllm/prometheus.md
75+
backends/vllm/speculative_decoding.md
7576

7677
benchmarks/kv-router-ab-testing.md
7778

0 commit comments

Comments
 (0)