|
| 1 | +<!-- |
| 2 | +SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | +SPDX-License-Identifier: Apache-2.0 |
| 4 | +
|
| 5 | +Licensed under the Apache License, Version 2.0 (the "License"); |
| 6 | +you may not use this file except in compliance with the License. |
| 7 | +You may obtain a copy of the License at |
| 8 | +
|
| 9 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +Unless required by applicable law or agreed to in writing, software |
| 12 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 13 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 14 | +See the License for the specific language governing permissions and |
| 15 | +limitations under the License. |
| 16 | +--> |
| 17 | +# Running **Meta-Llama-3.1-8B-Instruct** with Speculative Decoding (Eagle3) |
| 18 | + |
| 19 | +This guide walks through how to deploy **Meta-Llama-3.1-8B-Instruct** using **aggregated speculative decoding** with **Eagle3** on a single node. |
| 20 | +Since the model is only **8B parameters**, you can run it on **any GPU with at least 16GB VRAM**. |
| 21 | + |
| 22 | + |
| 23 | + |
| 24 | +## Step 1: Set Up Your Docker Environment |
| 25 | + |
| 26 | +First, we’ll initialize a Docker container using the VLLM backend. |
| 27 | +You can refer to the [VLLM Quickstart Guide](./README.md#vllm-quick-start) — or follow the full steps below. |
| 28 | + |
| 29 | +### 1. Launch Docker Compose |
| 30 | + |
| 31 | +```bash |
| 32 | +docker compose -f deploy/docker-compose.yml up -d |
| 33 | +``` |
| 34 | + |
| 35 | +### 2. Build the Container |
| 36 | + |
| 37 | +```bash |
| 38 | +./container/build.sh --framework VLLM |
| 39 | +``` |
| 40 | + |
| 41 | +### 3. Run the Container |
| 42 | + |
| 43 | +```bash |
| 44 | +./container/run.sh -it --framework VLLM --mount-workspace |
| 45 | +``` |
| 46 | + |
| 47 | + |
| 48 | + |
| 49 | +## Step 2: Get Access to the Llama-3 Model |
| 50 | + |
| 51 | +The **Meta-Llama-3.1-8B-Instruct** model is gated, so you’ll need to request access on Hugging Face. |
| 52 | +Go to the official [Meta-Llama-3.1-8B-Instruct repository](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and fill out the access form. |
| 53 | +Approval usually takes around **5 minutes**. |
| 54 | + |
| 55 | +Once you have access, generate a **Hugging Face access token** with permission for gated repositories, then set it inside your container: |
| 56 | + |
| 57 | +```bash |
| 58 | +export HUGGING_FACE_HUB_TOKEN="insert_your_token_here" |
| 59 | +export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN |
| 60 | +``` |
| 61 | + |
| 62 | + |
| 63 | + |
| 64 | +## Step 3: Run Aggregated Speculative Decoding |
| 65 | + |
| 66 | +Now that your environment is ready, start the aggregated server with **speculative decoding**. |
| 67 | + |
| 68 | +```bash |
| 69 | +# Requires only one GPU |
| 70 | +cd components/backends/vllm |
| 71 | +bash launch/agg_spec_decoding.sh |
| 72 | +``` |
| 73 | + |
| 74 | +Once the weights finish downloading and serving begins, you’ll be ready to send inference requests to your model. |
| 75 | + |
| 76 | + |
| 77 | + |
| 78 | + |
| 79 | +## Step 4: Example Request |
| 80 | + |
| 81 | +To verify your setup, try sending a simple prompt to your model: |
| 82 | + |
| 83 | +```bash |
| 84 | +curl http://localhost:8000/v1/completions \ |
| 85 | + -H "Content-Type: application/json" \ |
| 86 | + -d '{ |
| 87 | + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", |
| 88 | + "messages": [ |
| 89 | + {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."} |
| 90 | + ], |
| 91 | + "max_tokens": 250 |
| 92 | + }' |
| 93 | +``` |
| 94 | + |
| 95 | +### Example Output |
| 96 | + |
| 97 | +```json |
| 98 | +{ |
| 99 | + "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8", |
| 100 | + "choices": [ |
| 101 | + { |
| 102 | + "text": "In cherry blossom’s gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes.", |
| 103 | + "index": 0, |
| 104 | + "finish_reason": "stop" |
| 105 | + } |
| 106 | + ], |
| 107 | + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", |
| 108 | + "usage": { |
| 109 | + "prompt_tokens": 16, |
| 110 | + "completion_tokens": 250, |
| 111 | + "total_tokens": 266 |
| 112 | + } |
| 113 | +} |
| 114 | +``` |
| 115 | + |
| 116 | + |
| 117 | + |
| 118 | +## Additional Resources |
| 119 | + |
| 120 | +* [VLLM Quickstart](./README.md#vllm-quick-start) |
| 121 | +* [Meta-Llama-3.1-8B-Instruct on Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) |
0 commit comments