-
Couldn't load subscription status.
- Fork 662
docs: Guide for Speculative Decoding in VLLM using Eagle3 and Meta-Llama-3.1-8B-Instruct #3895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi Dilu-Bilu! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughThis pull request introduces speculative decoding support to the VLLM backend by adding a new deployment script that orchestrates a frontend ingress and speculative worker with Eagle3 draft model configuration, along with comprehensive documentation and toctree integration. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20–25 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
components/backends/vllm/launch/agg_spec_decoding.sh(1 hunks)docs/backends/vllm/README.md(1 hunks)docs/backends/vllm/speculative_decoding.md(1 hunks)docs/hidden_toctree.rst(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3895/merge) by Dilu-Bilu.
components/backends/vllm/launch/agg_spec_decoding.sh
[error] 1-1: pre-commit hook 'check-shebang-scripts-are-executable' failed: the file has a shebang but is not marked executable. Exit code 1. Run 'chmod +x components/backends/vllm/launch/agg_spec_decoding.sh'.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (5)
docs/hidden_toctree.rst (1)
75-75: LGTM. Toctree entry is correctly placed and follows the established format for vLLM backend documentation.docs/backends/vllm/README.md (1)
154-159: LGTM. The new subsection is well-written, clearly describes the speculative decoding setup, and uses an appropriate relative link to the guide.components/backends/vllm/launch/agg_spec_decoding.sh (3)
4-5: LGTM on error handling. Theset -eand trap on EXIT provide basic safety for process cleanup and error detection.
22-27: LGTM on speculative configuration. Thespeculative_configJSON structure is well-formed and appropriately configured for Eagle3 draft model with Llama-3.1-8B-Instruct (draft_tensor_parallel_size=1, num_speculative_tokens=2, method=eagle).
18-27: Environment variable naming verified as correct.The variables
DYN_SYSTEM_ENABLEDandDYN_SYSTEM_PORTfollow established Dynamo naming conventions in the codebase. These exact names appear consistently across examples, tests, and launch scripts for multiple backends (vllm, trtllm, sglang), confirming they are standardized and intentional.
Signed-off-by: DilreetRaju <[email protected]>
Signed-off-by: DilreetRaju <[email protected]>
Signed-off-by: DilreetRaju <[email protected]>
3c3e959 to
15777d5
Compare
|
@Dilu-Bilu Thanks! @athreesh Could you take a look at this one, or hand off to however would be best? |
|
this looks great @Dilu-Bilu! I am going to merge in and approve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! @alec-flowers for viz
Overview:
This PR adds a new guide and helper script for running Meta-Llama-3.1-8B-Instruct with speculative decoding using Eagle3.
It also updates the VLLM README to reference this new guide under the Advanced Examples section.
Details:
Changes included in this PR:
New files added:
docs/backends/vllm/speculative_decoding.md— A step-by-step guide for deploying Meta-Llama-3.1-8B-Instruct with aggregated speculative decoding using Eagle3.components/backends/vllm/launch/agg_spec_decoding.sh— A helper script to easily start the speculative decoding server.Updated files:
docs/backends/vllm/README.md— Added a new section under Advanced Examples referencing the new speculative decoding guide.This setup enables users to run Meta-Llama-3.1-8B-Instruct on a single GPU (≥16 GB VRAM) with Eagle3 as the draft model, allowing faster and more efficient inference via speculative decoding.
Where should the reviewer start?
docs/backends/vllm/README.md— Check that the new Advanced Examples section is clear and correctly linked.docs/backends/vllm/speculative_decoding.md— Verify that the guide is accurate, complete, and easy to follow.components/backends/vllm/launch/agg_spec_decoding.sh— Review for correctness and best practices when launching the server.Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Documentation