Skip to content

Conversation

@Dilu-Bilu
Copy link

@Dilu-Bilu Dilu-Bilu commented Oct 26, 2025

Overview:

This PR adds a new guide and helper script for running Meta-Llama-3.1-8B-Instruct with speculative decoding using Eagle3.
It also updates the VLLM README to reference this new guide under the Advanced Examples section.


Details:

Changes included in this PR:

New files added:

  • docs/backends/vllm/speculative_decoding.md — A step-by-step guide for deploying Meta-Llama-3.1-8B-Instruct with aggregated speculative decoding using Eagle3.
  • components/backends/vllm/launch/agg_spec_decoding.sh — A helper script to easily start the speculative decoding server.

Updated files:

  • docs/backends/vllm/README.md — Added a new section under Advanced Examples referencing the new speculative decoding guide.

This setup enables users to run Meta-Llama-3.1-8B-Instruct on a single GPU (≥16 GB VRAM) with Eagle3 as the draft model, allowing faster and more efficient inference via speculative decoding.


Where should the reviewer start?

  • docs/backends/vllm/README.md — Check that the new Advanced Examples section is clear and correctly linked.
  • docs/backends/vllm/speculative_decoding.md — Verify that the guide is accurate, complete, and easy to follow.
  • components/backends/vllm/launch/agg_spec_decoding.sh — Review for correctness and best practices when launching the server.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • Closes GitHub issue: N/A (new documentation & helper script)

Summary by CodeRabbit

  • New Features

    • Added speculative decoding deployment capability enabling optimized inference performance through aggregated serving architecture.
  • Documentation

    • Introduced comprehensive quickstart guide for speculative decoding configuration and deployment.
    • Updated backend documentation with advanced deployment examples and configuration references.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi Dilu-Bilu! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Oct 26, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 26, 2025

Walkthrough

This pull request introduces speculative decoding support to the VLLM backend by adding a new deployment script that orchestrates a frontend ingress and speculative worker with Eagle3 draft model configuration, along with comprehensive documentation and toctree integration.

Changes

Cohort / File(s) Summary
Speculative Decoding Launch Script
components/backends/vllm/launch/agg_spec_decoding.sh
New Bash script orchestrating multi-component VLLM deployment with frontend ingress on port 8000, speculative worker with DYN_SYSTEM on port 8081, Meta-Llama-3.1-8B-Instruct model, Eagle3 draft config, and GPU memory settings. Includes process cleanup trap.
Documentation Updates
docs/backends/vllm/README.md, docs/backends/vllm/speculative_decoding.md, docs/hidden_toctree.rst
Added new "Speculative Decoding with Aggregated Serving" subsection to README with link to new quickstart guide; introduced comprehensive speculative_decoding.md covering Docker setup, Hugging Face model access, deployment steps, example inference requests, and validation; registered new guide in documentation toctree.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

  • components/backends/vllm/launch/agg_spec_decoding.sh — Verify GPU device configurations, port assignments (8000, 8081), model identifiers, and speculative decoding parameters (draft_tensor_parallel_size, num_speculative_tokens, method)
  • docs/backends/vllm/speculative_decoding.md — Validate Docker commands, Hugging Face token setup instructions, and example payloads for accuracy and completeness
  • Cross-file consistency — Ensure README references align with actual speculative_decoding.md content and file paths

Poem

🐰 Hark! A script hops forth with Eagle wings,
Speculating tokens, swift as spring,
Where Llama dances with its draft so light,
Through aggregated paths, deployment takes flight! 🚀✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "docs: Guide for Speculative Decoding in VLLM using Eagle3 and Meta-Llama-3.1-8B-Instruct" clearly summarizes the primary change in the changeset. The title accurately reflects the main deliverable—a new documentation guide for speculative decoding using Eagle3 and the specified model. While the PR also includes a helper script (agg_spec_decoding.sh), the documentation is the focal point, and the title appropriately emphasizes this with the "docs:" prefix. The title is specific, concise, and provides enough context for a reviewer to understand the purpose of the change.
Description Check ✅ Passed The PR description fully adheres to the required template structure with all four sections properly completed. The Overview clearly states the purpose, the Details section comprehensively lists new and updated files with descriptions, the "Where should the reviewer start?" section provides targeted guidance with specific files and review focus areas, and the Related Issues section is appropriately filled out. The description is well-organized, informative, and gives reviewers clear direction on how to evaluate the changes.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6deeecb and 0ea82b6.

📒 Files selected for processing (4)
  • components/backends/vllm/launch/agg_spec_decoding.sh (1 hunks)
  • docs/backends/vllm/README.md (1 hunks)
  • docs/backends/vllm/speculative_decoding.md (1 hunks)
  • docs/hidden_toctree.rst (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3895/merge) by Dilu-Bilu.
components/backends/vllm/launch/agg_spec_decoding.sh

[error] 1-1: pre-commit hook 'check-shebang-scripts-are-executable' failed: the file has a shebang but is not marked executable. Exit code 1. Run 'chmod +x components/backends/vllm/launch/agg_spec_decoding.sh'.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (5)
docs/hidden_toctree.rst (1)

75-75: LGTM. Toctree entry is correctly placed and follows the established format for vLLM backend documentation.

docs/backends/vllm/README.md (1)

154-159: LGTM. The new subsection is well-written, clearly describes the speculative decoding setup, and uses an appropriate relative link to the guide.

components/backends/vllm/launch/agg_spec_decoding.sh (3)

4-5: LGTM on error handling. The set -e and trap on EXIT provide basic safety for process cleanup and error detection.


22-27: LGTM on speculative configuration. The speculative_config JSON structure is well-formed and appropriately configured for Eagle3 draft model with Llama-3.1-8B-Instruct (draft_tensor_parallel_size=1, num_speculative_tokens=2, method=eagle).


18-27: Environment variable naming verified as correct.

The variables DYN_SYSTEM_ENABLED and DYN_SYSTEM_PORT follow established Dynamo naming conventions in the codebase. These exact names appear consistently across examples, tests, and launch scripts for multiple backends (vllm, trtllm, sglang), confirming they are standardized and intentional.

@Dilu-Bilu Dilu-Bilu force-pushed the add-spec-decode-docs-vllm branch from 3c3e959 to 15777d5 Compare October 26, 2025 17:57
@grahamking
Copy link
Contributor

@Dilu-Bilu Thanks!

@athreesh Could you take a look at this one, or hand off to however would be best?

@athreesh
Copy link
Contributor

this looks great @Dilu-Bilu! I am going to merge in and approve

Copy link
Contributor

@athreesh athreesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @alec-flowers for viz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs external-contribution Pull request is from an external contributor size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants