docs: Guide for Speculative Decoding in VLLM using Eagle3 and Meta-Llama-3.1-8B-Instruct #3895

Dilu-Bilu · 2025-10-26T08:02:24Z

Overview:

This PR adds a new guide and helper script for running Meta-Llama-3.1-8B-Instruct with speculative decoding using Eagle3.
It also updates the VLLM README to reference this new guide under the Advanced Examples section.

Details:

Changes included in this PR:

New files added:

docs/backends/vllm/speculative_decoding.md — A step-by-step guide for deploying Meta-Llama-3.1-8B-Instruct with aggregated speculative decoding using Eagle3.
components/backends/vllm/launch/agg_spec_decoding.sh — A helper script to easily start the speculative decoding server.

Updated files:

docs/backends/vllm/README.md — Added a new section under Advanced Examples referencing the new speculative decoding guide.

This setup enables users to run Meta-Llama-3.1-8B-Instruct on a single GPU (≥16 GB VRAM) with Eagle3 as the draft model, allowing faster and more efficient inference via speculative decoding.

Where should the reviewer start?

docs/backends/vllm/README.md — Check that the new Advanced Examples section is clear and correctly linked.
docs/backends/vllm/speculative_decoding.md — Verify that the guide is accurate, complete, and easy to follow.
components/backends/vllm/launch/agg_spec_decoding.sh — Review for correctness and best practices when launching the server.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Closes GitHub issue: N/A (new documentation & helper script)

Summary by CodeRabbit

New Features
- Added speculative decoding deployment capability enabling optimized inference performance through aggregated serving architecture.
Documentation
- Introduced comprehensive quickstart guide for speculative decoding configuration and deployment.
- Updated backend documentation with advanced deployment examples and configuration references.

copy-pr-bot · 2025-10-26T08:02:27Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-10-26T08:02:48Z

👋 Hi Dilu-Bilu! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2025-10-26T08:05:18Z

Walkthrough

This pull request introduces speculative decoding support to the VLLM backend by adding a new deployment script that orchestrates a frontend ingress and speculative worker with Eagle3 draft model configuration, along with comprehensive documentation and toctree integration.

Changes

Cohort / File(s)	Summary
Speculative Decoding Launch Script `components/backends/vllm/launch/agg_spec_decoding.sh`	New Bash script orchestrating multi-component VLLM deployment with frontend ingress on port 8000, speculative worker with DYN_SYSTEM on port 8081, Meta-Llama-3.1-8B-Instruct model, Eagle3 draft config, and GPU memory settings. Includes process cleanup trap.
Documentation Updates `docs/backends/vllm/README.md`, `docs/backends/vllm/speculative_decoding.md`, `docs/hidden_toctree.rst`	Added new "Speculative Decoding with Aggregated Serving" subsection to README with link to new quickstart guide; introduced comprehensive speculative_decoding.md covering Docker setup, Hugging Face model access, deployment steps, example inference requests, and validation; registered new guide in documentation toctree.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

components/backends/vllm/launch/agg_spec_decoding.sh — Verify GPU device configurations, port assignments (8000, 8081), model identifiers, and speculative decoding parameters (draft_tensor_parallel_size, num_speculative_tokens, method)
docs/backends/vllm/speculative_decoding.md — Validate Docker commands, Hugging Face token setup instructions, and example payloads for accuracy and completeness
Cross-file consistency — Ensure README references align with actual speculative_decoding.md content and file paths

Poem

🐰 Hark! A script hops forth with Eagle wings,
Speculating tokens, swift as spring,
Where Llama dances with its draft so light,
Through aggregated paths, deployment takes flight! 🚀✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "docs: Guide for Speculative Decoding in VLLM using Eagle3 and Meta-Llama-3.1-8B-Instruct" clearly summarizes the primary change in the changeset. The title accurately reflects the main deliverable—a new documentation guide for speculative decoding using Eagle3 and the specified model. While the PR also includes a helper script (agg_spec_decoding.sh), the documentation is the focal point, and the title appropriately emphasizes this with the "docs:" prefix. The title is specific, concise, and provides enough context for a reviewer to understand the purpose of the change.
Description Check	✅ Passed	The PR description fully adheres to the required template structure with all four sections properly completed. The Overview clearly states the purpose, the Details section comprehensively lists new and updated files with descriptions, the "Where should the reviewer start?" section provides targeted guidance with specific files and review focus areas, and the Related Issues section is appropriately filled out. The description is well-organized, informative, and gives reviewers clear direction on how to evaluate the changes.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6deeecb and 0ea82b6.

📒 Files selected for processing (4)

components/backends/vllm/launch/agg_spec_decoding.sh (1 hunks)
docs/backends/vllm/README.md (1 hunks)
docs/backends/vllm/speculative_decoding.md (1 hunks)
docs/hidden_toctree.rst (1 hunks)

🧰 Additional context used

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3895/merge) by Dilu-Bilu.

components/backends/vllm/launch/agg_spec_decoding.sh

[error] 1-1: pre-commit hook 'check-shebang-scripts-are-executable' failed: the file has a shebang but is not marked executable. Exit code 1. Run 'chmod +x components/backends/vllm/launch/agg_spec_decoding.sh'.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (5)

docs/hidden_toctree.rst (1)

75-75: LGTM. Toctree entry is correctly placed and follows the established format for vLLM backend documentation.

docs/backends/vllm/README.md (1)

154-159: LGTM. The new subsection is well-written, clearly describes the speculative decoding setup, and uses an appropriate relative link to the guide.

components/backends/vllm/launch/agg_spec_decoding.sh (3)

4-5: LGTM on error handling. The set -e and trap on EXIT provide basic safety for process cleanup and error detection.

22-27: LGTM on speculative configuration. The speculative_config JSON structure is well-formed and appropriately configured for Eagle3 draft model with Llama-3.1-8B-Instruct (draft_tensor_parallel_size=1, num_speculative_tokens=2, method=eagle).

18-27: Environment variable naming verified as correct.

The variables DYN_SYSTEM_ENABLED and DYN_SYSTEM_PORT follow established Dynamo naming conventions in the codebase. These exact names appear consistently across examples, tests, and launch scripts for multiple backends (vllm, trtllm, sglang), confirming they are standardized and intentional.

components/backends/vllm/launch/agg_spec_decoding.sh

docs/backends/vllm/speculative_decoding.md

Signed-off-by: DilreetRaju <[email protected]>

grahamking · 2025-10-27T15:39:07Z

@Dilu-Bilu Thanks!

@athreesh Could you take a look at this one, or hand off to however would be best?

athreesh · 2025-10-27T16:39:37Z

this looks great @Dilu-Bilu! I am going to merge in and approve

athreesh

LGTM! @alec-flowers for viz

pull-request-size bot added the size/L label Oct 26, 2025

github-actions bot added the docs label Oct 26, 2025

github-actions bot added the external-contribution Pull request is from an external contributor label Oct 26, 2025

coderabbitai bot reviewed Oct 26, 2025

View reviewed changes

components/backends/vllm/launch/agg_spec_decoding.sh Show resolved Hide resolved

docs/backends/vllm/speculative_decoding.md Outdated Show resolved Hide resolved

Dilu-Bilu added 3 commits October 26, 2025 10:57

Add vllm speculative decoding docs and helper script

86076e2

Signed-off-by: DilreetRaju <[email protected]>

Save current changes before DCO rebase

a7ba2d5

Signed-off-by: DilreetRaju <[email protected]>

\Used assitant advice to fix completion

15777d5

Signed-off-by: DilreetRaju <[email protected]>

Dilu-Bilu force-pushed the add-spec-decode-docs-vllm branch from 3c3e959 to 15777d5 Compare October 26, 2025 17:57

athreesh approved these changes Oct 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

docs: Guide for Speculative Decoding in VLLM using Eagle3 and Meta-Llama-3.1-8B-Instruct #3895

docs: Guide for Speculative Decoding in VLLM using Eagle3 and Meta-Llama-3.1-8B-Instruct #3895

Uh oh!

Dilu-Bilu commented Oct 26, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Oct 26, 2025

Uh oh!

github-actions bot commented Oct 26, 2025

Uh oh!

coderabbitai bot commented Oct 26, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

grahamking commented Oct 27, 2025

Uh oh!

athreesh commented Oct 27, 2025

Uh oh!

athreesh left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

docs: Guide for Speculative Decoding in VLLM using Eagle3 and Meta-Llama-3.1-8B-Instruct #3895

Are you sure you want to change the base?

docs: Guide for Speculative Decoding in VLLM using Eagle3 and Meta-Llama-3.1-8B-Instruct #3895

Uh oh!

Conversation

Dilu-Bilu commented Oct 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Oct 26, 2025

Uh oh!

github-actions bot commented Oct 26, 2025

Uh oh!

coderabbitai bot commented Oct 26, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

grahamking commented Oct 27, 2025

Uh oh!

athreesh commented Oct 27, 2025

Uh oh!

athreesh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dilu-Bilu commented Oct 26, 2025 •

edited by coderabbitai bot

Loading