diff --git a/container/BUILD_DGX_SPARK_GUIDE.md b/container/BUILD_DGX_SPARK_GUIDE.md new file mode 100644 index 0000000000..c2d4341acf --- /dev/null +++ b/container/BUILD_DGX_SPARK_GUIDE.md @@ -0,0 +1,130 @@ +# Building Dynamo for DGX-SPARK (vLLM) + +## How `build.sh` Chooses the Dockerfile + +The `build.sh` script automatically selects the correct Dockerfile based on the platform and optional flags: + +### Dockerfile Selection Logic + +``` +IF framework == "VLLM": + IF --dgx-spark flag is set OR platform is linux/arm64: + Use: Dockerfile.vllm.dgx-spark (NVIDIA's pre-built vLLM with Blackwell support) + ELSE: + Use: Dockerfile.vllm (Build from source) +ELSE IF framework == "TRTLLM": + Use: Dockerfile.trtllm +ELSE IF framework == "SGLANG": + Use: Dockerfile.sglang +ELSE: + Use: Dockerfile +``` + +### How to Use + +#### For DGX-SPARK (Blackwell GPUs) + +**Automatic detection (recommended):** +```bash +./container/build.sh --framework VLLM --platform linux/arm64 +``` + +**Explicit flag:** +```bash +./container/build.sh --framework VLLM --dgx-spark +``` + +#### For x86_64 (standard GPUs) + +```bash +./container/build.sh --framework VLLM +# or explicitly +./container/build.sh --framework VLLM --platform linux/amd64 +``` + +## Key Differences + +### Standard vLLM Dockerfile (`Dockerfile.vllm`) +- Builds vLLM from source +- Uses CUDA 12.8 +- Supports: Ampere, Ada, Hopper GPUs +- **Does NOT support Blackwell (compute_121)** + +### DGX-SPARK Dockerfile (`Dockerfile.vllm.dgx-spark`) +- Uses NVIDIA's pre-built vLLM container (`nvcr.io/nvidia/vllm:25.09-py3`) +- Uses CUDA 13.0 +- Supports: **Blackwell GPUs (compute_121)** via DGX-SPARK +- Skips building vLLM from source (avoids nvcc errors) +- **Builds UCX v1.19.0 from source** with CUDA 13 support +- **Builds NIXL 0.7.0 from source** with CUDA 13 support (self-contained, no cache dependency) +- **Builds NIXL Python wheel** with CUDA 13 support +- Adds Dynamo's runtime customizations and integrations + +## Why DGX-SPARK Needs Special Handling + +DGX-SPARK systems use **Blackwell GPUs** with architecture `compute_121`. When trying to build vLLM from source with older CUDA toolchains: + +``` +ERROR: nvcc fatal : Unsupported gpu architecture 'compute_121a' +``` + +**Solution:** Use NVIDIA's pre-built vLLM container that already includes: +- CUDA 13.0 support +- Blackwell GPU architecture support +- DGX Spark functional support +- NVFP4 format optimization + +### Why Build UCX and NIXL from Source? + +The DGX-SPARK Dockerfile builds UCX v1.19.0 and NIXL 0.7.0 **from source** instead of copying from the base image: + +**Reason 1: CUDA 13 Compatibility** +- NIXL 0.7.0 is the first version with native CUDA 13.0 support +- Building from source ensures proper linkage against `libcudart.so.13` (not `libcudart.so.12`) +- Avoids runtime errors: `libcudart.so.12: cannot open shared object file` + +**Reason 2: Cache Independence** +- The base image (`dynamo_base`) may have cached NIXL 0.6.x built with CUDA 12 +- Building fresh in the DGX-SPARK Dockerfile ensures we always get NIXL 0.7.0 with CUDA 13 +- Self-contained build = predictable results + +**Reason 3: ARM64 Optimization** +- UCX and NIXL are built specifically for `aarch64` architecture +- GDS backend is disabled (`-Ddisable_gds_backend=true`) as it's not supported on ARM64 + +## Build Arguments + +When using the `--dgx-spark` flag, `build.sh` automatically: +- Selects `Dockerfile.vllm.dgx-spark` +- Sets `PLATFORM=linux/arm64` (forced) +- Sets `NIXL_REF=0.7.0` (for CUDA 13 support) +- Sets `ARCH=arm64` and `ARCH_ALT=aarch64` + +The DGX-SPARK Dockerfile itself hardcodes: +- `BASE_IMAGE=nvcr.io/nvidia/vllm` +- `BASE_IMAGE_TAG=25.09-py3` + +All other build arguments work the same way. + +## Troubleshooting + +### Error: `exec /bin/sh: exec format error` +- **Cause:** Building with wrong platform +- **Fix:** Use `--platform linux/arm64` for DGX-SPARK + +### Error: `nvcc fatal : Unsupported gpu architecture 'compute_121a'` +- **Cause:** Building from source without Blackwell support +- **Fix:** Use `--dgx-spark` or `--platform linux/arm64` to use pre-built container + +### Error: `libcudart.so.12: cannot open shared object file` +- **Cause:** NIXL was built with CUDA 12 but container has CUDA 13 +- **Fix:** Rebuild with `--dgx-spark` flag to ensure NIXL 0.7.0 with CUDA 13 support +- **Verify:** Inside container: `ldd /opt/nvidia/nvda_nixl/lib/aarch64-linux-gnu/plugins/libplugin_UCX_MO.so | grep cudart` should show `libcudart.so.13` (not `.so.12`) + +## References + +- [NVIDIA vLLM Release 25.09 Documentation](https://docs.nvidia.com/deeplearning/frameworks/vllm-release-notes/rel-25-09.html) +- [NVIDIA NGC Container Registry](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm) +- [NIXL 0.7.0 Release Notes](https://github.com/ai-dynamo/nixl/releases/tag/0.7.0) - CUDA 13.0 support +- [DGX-SPARK README](../docs/backends/vllm/DGX-SPARK_README.md) - Complete deployment guide + diff --git a/container/Dockerfile.vllm.dgx-spark b/container/Dockerfile.vllm.dgx-spark new file mode 100644 index 0000000000..4f92b2cdfb --- /dev/null +++ b/container/Dockerfile.vllm.dgx-spark @@ -0,0 +1,263 @@ +# syntax=docker/dockerfile:1.10.0 +# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# DGX-SPARK specific Dockerfile for vLLM +# Uses NVIDIA's pre-built vLLM container that supports Blackwell GPUs (compute_121) +# See: https://docs.nvidia.com/deeplearning/frameworks/vllm-release-notes/rel-25-09.html + +ARG BASE_IMAGE="nvcr.io/nvidia/vllm" +ARG BASE_IMAGE_TAG="25.09-py3" + +ARG DYNAMO_BASE_IMAGE="dynamo:latest-none" +FROM ${DYNAMO_BASE_IMAGE} AS dynamo_base + +######################################################## +########## Runtime Image (based on NVIDIA vLLM) ####### +######################################################## +# +# PURPOSE: Production runtime environment for DGX-SPARK +# +# This stage uses NVIDIA's pre-built vLLM container that already includes: +# - vLLM with DGX Spark functional support (Blackwell compute_121) +# - CUDA 13.0 support +# - NVFP4 format support +# - All necessary GPU acceleration libraries +# +# We add Dynamo's customizations on top: +# - Dynamo runtime libraries +# - NIXL for KV cache transfer +# - Custom backend integrations +# + +FROM ${BASE_IMAGE}:${BASE_IMAGE_TAG} AS runtime + +WORKDIR /workspace +ENV DYNAMO_HOME=/opt/dynamo +ENV VIRTUAL_ENV=/opt/dynamo/venv +ENV PATH="${VIRTUAL_ENV}/bin:${PATH}" +# Add system Python site-packages to PYTHONPATH so we can use NVIDIA's vLLM +ENV PYTHONPATH="/usr/local/lib/python3.12/dist-packages:${PYTHONPATH}" + +# NVIDIA vLLM container already has Python 3.12 and vLLM installed +# We just need to set up Dynamo's virtual environment and dependencies +ARG ARCH_ALT=aarch64 +ENV NIXL_PREFIX=/opt/nvidia/nvda_nixl +ENV NIXL_LIB_DIR=$NIXL_PREFIX/lib/${ARCH_ALT}-linux-gnu +ENV NIXL_PLUGIN_DIR=$NIXL_LIB_DIR/plugins + +# Install additional dependencies for Dynamo +# Note: NVIDIA vLLM container already has Python and CUDA tools +RUN apt-get update && \ + DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ + # Python runtime - CRITICAL for virtual environment to work + python3.12-dev \ + build-essential \ + # jq and curl for polling various endpoints and health checks + jq \ + git \ + git-lfs \ + curl \ + # Libraries required by UCX to find RDMA devices + libibverbs1 rdma-core ibverbs-utils libibumad3 \ + libnuma1 librdmacm1 ibverbs-providers \ + # JIT Kernel Compilation, flashinfer + ninja-build \ + g++ \ + # prometheus dependencies + ca-certificates && \ + rm -rf /var/lib/apt/lists/* + +# NVIDIA vLLM container has CUDA already, but ensure CUDA tools are in PATH +ENV PATH=/usr/local/cuda/bin:$PATH + +# DeepGemm runs nvcc for JIT kernel compilation, however the CUDA include path +# is not properly set for compilation. Set CPATH to help nvcc find the headers. +ENV CPATH=/usr/local/cuda/include + +### COPY NATS & ETCD ### +# Copy nats and etcd from dev image +COPY --from=dynamo_base /usr/bin/nats-server /usr/bin/nats-server +COPY --from=dynamo_base /usr/local/bin/etcd/ /usr/local/bin/etcd/ +# Add ETCD and CUDA binaries to PATH so cicc and other CUDA tools are accessible +ENV PATH=/usr/local/bin/etcd/:/usr/local/cuda/nvvm/bin:/usr/local/cuda/bin:$PATH + +### COPY UV EARLY (needed for building NIXL Python wheel) ### +COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/uv +COPY --from=ghcr.io/astral-sh/uv:latest /uvx /bin/uvx + +# Build UCX and NIXL directly in this stage for CUDA 13.0 support +# This ensures we get fresh NIXL 0.7.0 with CUDA 13 support, not cached CUDA 12 version + +# Build UCX from source +RUN apt-get update && \ + DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ + autoconf automake libtool pkg-config \ + libibverbs-dev librdmacm-dev \ + && rm -rf /var/lib/apt/lists/* \ + && cd /usr/local/src \ + && git clone https://github.com/openucx/ucx.git \ + && cd ucx && git checkout v1.19.0 \ + && ./autogen.sh \ + && ./configure \ + --prefix=/usr/local/ucx \ + --enable-shared \ + --disable-static \ + --disable-doxygen-doc \ + --enable-optimizations \ + --enable-cma \ + --enable-devel-headers \ + --with-cuda=/usr/local/cuda \ + --with-verbs \ + --with-dm \ + --enable-mt \ + && make -j$(nproc) \ + && make -j$(nproc) install-strip \ + && echo "/usr/local/ucx/lib" > /etc/ld.so.conf.d/ucx.conf \ + && echo "/usr/local/ucx/lib/ucx" >> /etc/ld.so.conf.d/ucx.conf \ + && ldconfig \ + && cd /usr/local/src \ + && rm -rf ucx + +# Build NIXL 0.7.0 from source with CUDA 13.0 support +# Build both C++ library and Python wheel +RUN apt-get update && \ + DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ + meson ninja-build python3-pip \ + && rm -rf /var/lib/apt/lists/* \ + && git clone --depth 1 --branch 0.7.0 "https://github.com/ai-dynamo/nixl.git" /opt/nixl \ + && cd /opt/nixl \ + && meson setup build/ --buildtype=release --prefix=$NIXL_PREFIX -Ddisable_gds_backend=true \ + && ninja -C build/ -j$(nproc) \ + && ninja -C build/ install \ + && echo "$NIXL_LIB_DIR" > /etc/ld.so.conf.d/nixl.conf \ + && echo "$NIXL_PLUGIN_DIR" >> /etc/ld.so.conf.d/nixl.conf \ + && ldconfig \ + && mkdir -p /opt/dynamo/wheelhouse/nixl \ + && /bin/uv build . --out-dir /opt/dynamo/wheelhouse/nixl --config-settings=setup-args="-Ddisable_gds_backend=true" \ + && cd - \ + && rm -rf /opt/nixl + +ENV PATH=/usr/local/ucx/bin:$PATH + +# Set library paths for NIXL and UCX +ENV LD_LIBRARY_PATH=\ +/usr/local/cuda/lib64:\ +$NIXL_LIB_DIR:\ +$NIXL_PLUGIN_DIR:\ +/usr/local/ucx/lib:\ +/usr/local/ucx/lib/ucx:\ +$LD_LIBRARY_PATH + +### VIRTUAL ENVIRONMENT SETUP ### +# Note: uv was already copied earlier (needed for building NIXL Python wheel) + +# Create Dynamo's virtual environment +RUN uv venv /opt/dynamo/venv --python 3.12 + +# Install Dynamo dependencies +# Note: vLLM is available via PYTHONPATH pointing to system Python +# Note: We copy dynamo wheels from base, but NIXL wheel was built fresh above with CUDA 13 support +COPY benchmarks/ /opt/dynamo/benchmarks/ +RUN mkdir -p /opt/dynamo/wheelhouse +COPY --from=dynamo_base /opt/dynamo/wheelhouse/ai_dynamo_runtime*.whl /opt/dynamo/wheelhouse/ +COPY --from=dynamo_base /opt/dynamo/wheelhouse/ai_dynamo*.whl /opt/dynamo/wheelhouse/ +RUN uv pip install \ + /opt/dynamo/wheelhouse/ai_dynamo_runtime*.whl \ + /opt/dynamo/wheelhouse/ai_dynamo*any.whl \ + /opt/dynamo/wheelhouse/nixl/nixl*.whl \ + && cd /opt/dynamo/benchmarks \ + && UV_GIT_LFS=1 uv pip install --no-cache . \ + && cd - \ + && rm -rf /opt/dynamo/benchmarks + +# Install common and test dependencies +RUN --mount=type=bind,source=./container/deps/requirements.txt,target=/tmp/requirements.txt \ + --mount=type=bind,source=./container/deps/requirements.test.txt,target=/tmp/requirements.test.txt \ + UV_GIT_LFS=1 uv pip install \ + --no-cache \ + --requirement /tmp/requirements.txt \ + --requirement /tmp/requirements.test.txt + +# Copy benchmarks, examples, and tests for CI +COPY . /workspace/ + +# Copy attribution files +COPY ATTRIBUTION* LICENSE /workspace/ + +# Copy launch banner +RUN --mount=type=bind,source=./container/launch_message.txt,target=/workspace/launch_message.txt \ + sed '/^#\s/d' /workspace/launch_message.txt > ~/.launch_screen && \ + echo "cat ~/.launch_screen" >> ~/.bashrc && \ + echo "source $VIRTUAL_ENV/bin/activate" >> ~/.bashrc + +ENTRYPOINT ["/opt/nvidia/nvidia_entrypoint.sh"] +CMD [] + +########################################################### +########## Development (run.sh, runs as root user) ######## +########################################################### +# +# PURPOSE: Local development environment for use with run.sh (not Dev Container plug-in) +# +# This stage runs as root and provides: +# - Development tools and utilities for local debugging +# - Support for vscode/cursor development outside the Dev Container plug-in +# +# Use this stage if you need a full-featured development environment with extra tools, +# but do not use it with the Dev Container plug-in. + +FROM runtime AS dev + +# Don't want ubuntu to be editable, just change uid and gid. +ARG WORKSPACE_DIR=/workspace + +# Install utilities as root +RUN apt-get update -y && \ + apt-get install -y --no-install-recommends \ + # Install utilities + nvtop \ + wget \ + tmux \ + vim \ + git \ + openssh-client \ + iproute2 \ + rsync \ + zip \ + unzip \ + htop \ + # Build Dependencies + autoconf \ + automake \ + cmake \ + libtool \ + meson \ + net-tools \ + pybind11-dev \ + # Rust build dependencies + clang \ + libclang-dev \ + protobuf-compiler && \ + rm -rf /var/lib/apt/lists/* + +# Set workspace directory variable +ENV WORKSPACE_DIR=${WORKSPACE_DIR} \ + DYNAMO_HOME=${WORKSPACE_DIR} \ + RUSTUP_HOME=/usr/local/rustup \ + CARGO_HOME=/usr/local/cargo \ + CARGO_TARGET_DIR=/workspace/target \ + VIRTUAL_ENV=/opt/dynamo/venv \ + PATH=/usr/local/cargo/bin:$PATH + +COPY --from=dynamo_base /usr/local/rustup /usr/local/rustup +COPY --from=dynamo_base /usr/local/cargo /usr/local/cargo + +# Install maturin, for maturin develop +# Editable install of dynamo +RUN uv pip install maturin[patchelf] && \ + uv pip install --no-deps -e . + +ENTRYPOINT ["/opt/nvidia/nvidia_entrypoint.sh"] +CMD [] + diff --git a/container/build.sh b/container/build.sh index 7ce32457af..1a26405c02 100755 --- a/container/build.sh +++ b/container/build.sh @@ -24,6 +24,7 @@ set -e TAG= RUN_PREFIX= PLATFORM=linux/amd64 +USE_DGX_SPARK=false # Get short commit hash commit_id=$(git rev-parse --short HEAD) @@ -116,6 +117,7 @@ SGLANG_BASE_IMAGE="nvcr.io/nvidia/cuda-dl-base" SGLANG_BASE_IMAGE_TAG="25.01-cuda12.8-devel-ubuntu24.04" NIXL_REF=0.6.0 +NIXL_REF_DGX=0.7.0 # CUDA 13.0 support for DGX-SPARK NIXL_UCX_REF=v1.19.0 NIXL_UCX_EFA_REF=9d2b88a1f67faf9876f267658bd077b379b8bb76 @@ -339,6 +341,13 @@ get_options() { missing_requirement "$1" fi ;; + --dgx-spark) + if [ -n "$2" ] && [[ "$2" != --* ]]; then + echo "ERROR: --dgx-spark does not take any argument" + exit 1 + fi + USE_DGX_SPARK=true + ;; -?*) error 'ERROR: Unknown option: ' "$1" ;; @@ -482,6 +491,7 @@ show_help() { echo " [--sccache-bucket S3 bucket name for sccache (required with --use-sccache)]" echo " [--sccache-region S3 region for sccache (required with --use-sccache)]" echo " [--vllm-max-jobs number of parallel jobs for compilation (only used by vLLM framework)]" + echo " [--dgx-spark Use DGX-SPARK specific Dockerfile for vLLM (Blackwell GPU support, auto-detected for ARM64)]" echo "" echo " Note: When using --use-sccache, AWS credentials must be set:" echo " export AWS_ACCESS_KEY_ID=your_access_key" @@ -500,6 +510,14 @@ error() { get_options "$@" +# If --dgx-spark is specified, always force linux/arm64 (DGX-SPARK is ARM64 only) +if [[ "$USE_DGX_SPARK" == "true" ]]; then + if [[ "$PLATFORM" != *"linux/arm64"* ]]; then + echo "Note: --dgx-spark requires linux/arm64 platform, overriding to linux/arm64" + fi + PLATFORM="--platform linux/arm64" +fi + # Automatically set ARCH and ARCH_ALT if PLATFORM is linux/arm64 ARCH="amd64" if [[ "$PLATFORM" == *"linux/arm64"* ]]; then @@ -507,9 +525,23 @@ if [[ "$PLATFORM" == *"linux/arm64"* ]]; then BUILD_ARGS+=" --build-arg ARCH=arm64 --build-arg ARCH_ALT=aarch64 " fi +# Automatically use NIXL 0.7.0 (CUDA 13.0 support) for DGX-SPARK builds +# Only when explicitly building for DGX-SPARK with --dgx-spark flag +if [[ "$USE_DGX_SPARK" == "true" ]]; then + echo "Note: Using NIXL ${NIXL_REF_DGX} for CUDA 13.0 compatibility (DGX-SPARK)" + NIXL_REF=$NIXL_REF_DGX +fi + # Update DOCKERFILE if framework is VLLM if [[ $FRAMEWORK == "VLLM" ]]; then - DOCKERFILE=${SOURCE_DIR}/Dockerfile.vllm + # Use DGX-SPARK Dockerfile when: + # 1. Explicitly requested with --dgx-spark flag, OR + # 2. Building for ARM64 platform (DGX-SPARK requires Blackwell GPU support) + if [[ "$USE_DGX_SPARK" == "true" ]] || [[ "$PLATFORM" == *"linux/arm64"* ]]; then + DOCKERFILE=${SOURCE_DIR}/Dockerfile.vllm.dgx-spark + else + DOCKERFILE=${SOURCE_DIR}/Dockerfile.vllm + fi elif [[ $FRAMEWORK == "TRTLLM" ]]; then DOCKERFILE=${SOURCE_DIR}/Dockerfile.trtllm elif [[ $FRAMEWORK == "NONE" ]]; then @@ -592,7 +624,12 @@ fi # BUILD DEV IMAGE -BUILD_ARGS+=" --build-arg BASE_IMAGE=$BASE_IMAGE --build-arg BASE_IMAGE_TAG=$BASE_IMAGE_TAG" +# Only pass BASE_IMAGE and BASE_IMAGE_TAG for non-DGX-SPARK builds +# DGX-SPARK uses hardcoded base images in Dockerfile.vllm.dgx-spark +# Skip these build args when building for DGX-SPARK +if [[ ! (("$USE_DGX_SPARK" == "true") || ("$PLATFORM" == *"linux/arm64"* && $FRAMEWORK == "VLLM")) ]]; then + BUILD_ARGS+=" --build-arg BASE_IMAGE=$BASE_IMAGE --build-arg BASE_IMAGE_TAG=$BASE_IMAGE_TAG" +fi if [ -n "${GITHUB_TOKEN}" ]; then BUILD_ARGS+=" --build-arg GITHUB_TOKEN=${GITHUB_TOKEN} " diff --git a/docs/backends/vllm/DGX-SPARK_README.md b/docs/backends/vllm/DGX-SPARK_README.md new file mode 100644 index 0000000000..689f73281a --- /dev/null +++ b/docs/backends/vllm/DGX-SPARK_README.md @@ -0,0 +1,527 @@ + + +# LLM Deployment using vLLM on DGX-SPARK + +This directory contains reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM on **DGX-SPARK systems**. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation. + +> [!NOTE] +> This guide is specifically tailored for **DGX-SPARK** systems running **ARM64 architecture**. For general x86_64 deployments, refer to the main [README.md](./README.md). + +## Use the Latest Release + +We recommend using the latest stable release of Dynamo to avoid breaking changes: + +[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest) + +You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with: + +```bash +git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) +``` + +--- + +## Table of Contents +- [DGX-SPARK Specific Considerations](#dgx-spark-specific-considerations) +- [Feature Support Matrix](#feature-support-matrix) +- [Quick Start](#quick-start) +- [Single Node Examples](#run-single-node-examples) +- [Advanced Examples](#advanced-examples) +- [Deploy on Kubernetes](#kubernetes-deployment) +- [Configuration](#configuration) + +## DGX-SPARK Specific Considerations + +### Architecture Requirements + +DGX-SPARK systems run on **ARM64 architecture** (`aarch64`), which requires specific build configurations: + +- **Platform**: `linux/arm64` +- **Architecture**: `arm64` / `aarch64` +- **Base Images**: ARM64-compatible NVIDIA CUDA images + +### Build Requirements + +When building containers for DGX-SPARK, you **must** specify the ARM64 platform: + +```bash +# Correct build command for DGX-SPARK +./container/build.sh --framework VLLM --platform linux/arm64 +``` + +> [!WARNING] +> **Do not use the default build command** (`./container/build.sh --framework VLLM`) as it defaults to `linux/amd64` and will cause `exec /bin/sh: exec format error` on ARM64 systems. + +### Performance Considerations + +DGX-SPARK systems may have different performance characteristics compared to x86_64 systems: + +- **Memory bandwidth**: ARM64 systems may have different memory access patterns +- **GPU utilization**: Ensure proper GPU affinity and NUMA awareness +- **Container overhead**: ARM64 containers may have slightly different resource usage + +## Feature Support Matrix + +### Core Dynamo Features + +| Feature | vLLM on DGX-SPARK | Notes | +|---------|-------------------|-------| +| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | Fully supported on ARM64 | +| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP - ARM64 compatibility verified | +| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ | ARM64 optimized | +| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | Platform agnostic | +| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP - ARM64 testing in progress | +| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | ARM64 compatible | +| [**LMCache**](./LMCache_Integration.md) | ✅ | ARM64 supported | + +### Large Scale P/D and WideEP Features + +| Feature | vLLM on DGX-SPARK | Notes | +|--------------------|-------------------|-----------------------------------------------------------------------| +| **WideEP** | ✅ | Support for PPLX / DeepEP verified on ARM64 | +| **DP Rank Routing**| ✅ | Supported via external control of DP ranks - ARM64 optimized | +| **GB200 Support** | 🚧 | Container functional on main - ARM64 compatibility testing ongoing | + +## vLLM Quick Start + +Below we provide a guide that lets you run all of our common deployment patterns on a single DGX-SPARK node. + +### Start NATS and ETCD in the background + +Start using [Docker Compose](../../../deploy/docker-compose.yml) + +```bash +docker compose -f deploy/docker-compose.yml up -d +``` + +### Pull or build container + +We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). + +> [!IMPORTANT] +> **For DGX-SPARK systems**, you must build your own container with ARM64 support: + +```bash +# Build for ARM64 architecture (DGX-SPARK) +./container/build.sh --framework VLLM --platform linux/arm64 + +# Or explicitly with --dgx-spark flag (also forces linux/arm64) +./container/build.sh --framework VLLM --dgx-spark +``` + +> [!NOTE] +> The build script automatically (when using `--dgx-spark` or ARM64 platform): +> - Detects ARM64 platform and sets the correct architecture arguments (`ARCH=arm64`, `ARCH_ALT=aarch64`) +> - Uses NIXL 0.7.0 (CUDA 13.0 support) instead of 0.6.0 when `--dgx-spark` flag is set +> - Uses a special `Dockerfile.vllm.dgx-spark` that leverages NVIDIA's pre-built vLLM container (`nvcr.io/nvidia/vllm:25.09-py3`) +> - This container already includes DGX Spark functional support (Blackwell GPU compute_121) and fixes the `nvcc fatal: Unsupported gpu architecture 'compute_121a'` error + +### Run container + +```bash +./container/run.sh -it --framework VLLM [--mount-workspace] +``` + +> [!NOTE] +> **`--mount-workspace` is optional** and mounts your local Dynamo project directory into the container at `/workspace`. Use it for: +> - **Development**: When you need to edit source files and see changes immediately +> - **Local testing**: When running examples that need access to project files +> +> Skip `--mount-workspace` for production deployments or when you don't need to modify source code. + +#### What happens when you run the container? + +When you run `./container/run.sh -it --framework VLLM`, it: + +1. **Starts a Docker container** using the `dynamo:latest-vllm` image +2. **Runs interactively** (`-it` flag) with a bash shell +3. **Does NOT automatically start any vLLM service** - it just gives you a shell inside the container +4. **No model is loaded by default** - you're just in an empty container environment + +The container uses `/opt/nvidia/nvidia_entrypoint.sh` as the entrypoint, which typically just starts a bash shell when no specific command is provided. + +#### Setting up different serving modes + +The serving modes are controlled by **launch scripts** that you run **inside the container**. Here's how: + +**Aggregated Serving (Single GPU):** +```bash +# Start the container +./container/run.sh -it --framework VLLM + +# Inside the container, run: +cd components/backends/vllm +bash launch/agg.sh +``` + +**Disaggregated Serving (Two GPUs):** +```bash +# Start the container +./container/run.sh -it --framework VLLM + +# Inside the container, run: +cd components/backends/vllm +bash launch/disagg.sh +``` + +**Other available modes:** +- `agg_router.sh` - Aggregated serving with KV routing (2 GPUs) +- `disagg_router.sh` - Disaggregated serving with KV routing (3 GPUs) +- `dep.sh` - Data Parallel Attention / Expert Parallelism (4 GPUs) + +#### Complete workflow example + +Here's a complete example for disaggregated serving: + +```bash +# 1. Build the container (if not already built) +./container/build.sh --framework VLLM --platform linux/arm64 + +# 2. Start the container +./container/run.sh -it --framework VLLM + +# 3. Inside the container, start disaggregated serving +cd components/backends/vllm +bash launch/disagg.sh + +# 4. Test the API +curl -X POST "http://localhost:8000/v1/completions" \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen/Qwen3-0.6B", + "prompt": "Hello, how are you?", + "max_tokens": 50 + }' +``` + +> [!TIP] +> **Key Points:** +> - **No model is loaded by default** - the container just gives you a shell +> - **Models are specified in the launch scripts** (currently `Qwen/Qwen3-0.6B`) +> - **You can modify the launch scripts** to use different models +> - **The `--enforce-eager` flag** is for quick testing (remove for production) +> - **GPU assignment** is handled by `CUDA_VISIBLE_DEVICES` in the scripts + +This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks. + +## Run Single Node Examples + +> [!IMPORTANT] +> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility. + +This figure shows an overview of the major components to deploy: + +``` ++------+ +-----------+ +------------------+ +---------------+ +| HTTP |----->| dynamo |----->| vLLM Worker |------------>| vLLM Prefill | +| |<-----| ingress |<-----| |<------------| Worker | ++------+ +-----------+ +------------------+ +---------------+ + | ^ | + query best | | return | publish kv events + worker | | worker_id v + | | +------------------+ + | +---------| kv-router | + +------------->| | + +------------------+ +``` + +Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern. + +### Aggregated Serving + +```bash +# requires one gpu +cd components/backends/vllm +bash launch/agg.sh +``` + +### Aggregated Serving with KV Routing + +```bash +# requires two gpus +cd components/backends/vllm +bash launch/agg_router.sh +``` + +### Disaggregated Serving + +```bash +# requires two gpus +cd components/backends/vllm +bash launch/disagg.sh +``` + +### Disaggregated Serving with KV Routing + +```bash +# requires three gpus +cd components/backends/vllm +bash launch/disagg_router.sh +``` + +### Single Node Data Parallel Attention / Expert Parallelism + +This example is not meant to be performant but showcases Dynamo routing to data parallel workers + +```bash +# requires four gpus +cd components/backends/vllm +bash launch/dep.sh +``` + +> [!TIP] +> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker. + +## Advanced Examples + +Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example! + +### Multi-Node Disaggregated Serving + +DGX-SPARK systems are perfect for multi-node deployments, especially when paired with other GPU servers. This allows you to optimize resource utilization across different hardware configurations. + +#### Example: DGX-SPARK + RTX 3090 Setup + +**Hardware Configuration:** +- **DGX-SPARK (ARM64)**: Prefill worker (prompt processing) + Frontend +- **RTX 3090 Server (x86_64)**: Decode worker (token generation) + +**Why this works well:** +- DGX-SPARK: High compute power (~100 TFLOPs FP16) for compute-intensive prefill +- RTX 3090: Handles decode after receiving KV cache from DGX Spark +- Network efficiency: Only KV cache data transferred, not full model weights + +#### Step-by-Step Multi-Node Setup + +**1. Infrastructure Setup** + +**On DGX-SPARK (Head Node):** +```bash +# Start NATS and ETCD services +docker compose -f deploy/docker-compose.yml up -d + +# Set environment variables +export HEAD_NODE_IP="" +export NATS_SERVER="nats://${HEAD_NODE_IP}:4222" +export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379" +``` + +**On RTX 3090 Server:** +```bash +# Set the same environment variables +export HEAD_NODE_IP="" +export NATS_SERVER="nats://${HEAD_NODE_IP}:4222" +export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379" +``` + +**2. Build Containers** + +**On DGX-SPARK:** +```bash +# Option 1: Use platform flag (recommended) +./container/build.sh --framework VLLM --platform linux/arm64 + +# Option 2: Use dgx-spark flag (also uses NIXL 0.7.0 with CUDA 13.0) +./container/build.sh --framework VLLM --dgx-spark +``` + +**On RTX 3090 Server:** +```bash +./container/build.sh --framework VLLM --platform linux/amd64 +``` + +**3. Deploy Workers** + +> [!NOTE] +> **Worker flags**: Prefill worker uses `--is-prefill-worker` flag. Decode worker runs without any flag (just regular vllm command). + +**On DGX-SPARK (Prefill Worker + Frontend):** +```bash +# Start the container +./container/run.sh -it --framework VLLM + +# Inside the container, set environment variables: +export HEAD_NODE_IP="" +export NATS_SERVER="nats://${HEAD_NODE_IP}:4222" +export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379" + +cd components/backends/vllm + +# Start frontend with KV routing +python -m dynamo.frontend \ + --router-mode kv \ + --http-port 8000 & + +# Start prefill worker (handles prompt processing - compute-intensive) +python -m dynamo.vllm \ + --model Qwen/Qwen3-0.6B \ + --enforce-eager \ + --is-prefill-worker \ + --connector nixl +``` + +**On RTX 3090 Server (Decode Worker):** +```bash +# Start the container +./container/run.sh -it --framework VLLM + +# Inside the container, set environment variables: +export HEAD_NODE_IP="" +export NATS_SERVER="nats://${HEAD_NODE_IP}:4222" +export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379" + +cd components/backends/vllm + +# Start decode worker (handles token generation - memory-intensive) +python -m dynamo.vllm \ + --model Qwen/Qwen3-0.6B \ + --enforce-eager \ + --connector nixl +``` + +#### How Multi-Node Disaggregation Works + +1. **Request Flow:** + - Client sends request to DGX-SPARK frontend (port 8000) + - Frontend routes prefill work to DGX-SPARK itself (high compute for prefill) + - DGX-SPARK processes the prompt and builds the KV cache (compute-intensive) + - KV cache is transferred to the RTX 3090 server + - RTX 3090 server generates tokens using the KV cache (memory-intensive) + +2. **KV Cache Transfer:** + - Uses NIXL connector for efficient KV cache transfer over network + - Layer-by-layer streaming: Each layer's KV cache can be transferred as it's computed + - DGX-SPARK sends processed KV cache to decode server + - Decode server uses the KV cache for efficient token generation + +3. **Resource Optimization:** + - DGX-SPARK: Handles compute-intensive prefill (~100 TFLOPs compute) + - RTX 3090: Handles memory-intensive decode after receiving KV cache + +#### Network Requirements + +Ensure your firewall allows: +- **Port 4222**: NATS communication +- **Port 2379**: ETCD communication +- **Port 8000**: HTTP API (if accessing from external clients) + +#### Testing Multi-Node Setup + +Once both workers are running, test from any machine: + +```bash +curl -X POST "http://:8000/v1/completions" \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen/Qwen3-0.6B", + "prompt": "Hello, how are you?", + "max_tokens": 50 + }' +``` + +> [!TIP] +> **Multi-Node Benefits:** +> - **Resource optimization**: Use each machine's strengths +> - **Scalability**: Add more prefill or decode workers as needed +> - **Cost efficiency**: Leverage existing hardware optimally +> - **Network efficiency**: Only KV cache data transferred, not full model weights + +### Kubernetes Deployment + +For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](../../../components/backends/vllm/deploy/README.md) + +> [!NOTE] +> When deploying on Kubernetes with DGX-SPARK nodes, ensure your cluster nodes are properly labeled with ARM64 architecture and use ARM64-compatible base images. + +## Configuration + +vLLM workers are configured through command-line arguments. Key parameters include: + +- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`) +- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving +- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo +- `--connector`: Specify which kv_transfer_config you want vllm to use `[nixl, lmcache, kvbm, none]`. This is a helper flag which overwrites the engines KVTransferConfig. + +See `args.py` for the full list of configuration options and their defaults. + +The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM. + +### Hashing Consistency for KV Events + +When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following: + +- Set `PYTHONHASHSEED=0` for all vLLM processes when relying on Python's builtin hashing for prefix caching. +- If your vLLM version supports it, configure a deterministic prefix caching algorithm, for example: + +```bash +vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256 +``` +See the high-level notes in [KV Cache Routing](../../../docs/router/kv_cache_routing.md) on deterministic event IDs. + +## Request Migration + +You can enable [request migration](../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker: + +```bash +python3 -m dynamo.vllm ... --migration-limit=3 +``` + +This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/fault_tolerance/request_migration.md) documentation for details on how this works. + +## Request Cancellation + +When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests. + +### Cancellation Support Matrix + +| | Prefill | Decode | +|-|---------|--------| +| **Aggregated** | ✅ | ✅ | +| **Disaggregated** | ✅ | ✅ | + +For more details, see the [Request Cancellation Architecture](../../../docs/fault_tolerance/request_cancellation.md) documentation. + +## Troubleshooting + +### Common Issues on DGX-SPARK + +#### Build Errors + +**Error**: `exec /bin/sh: exec format error` + +**Solution**: Ensure you're building with the correct platform: +```bash +./container/build.sh --framework VLLM --platform linux/arm64 +``` + +**Error**: `nvcc fatal: Unsupported gpu architecture 'compute_121a'` (Blackwell GPU) + +**Solution**: This error occurs when trying to build vLLM from source with an older CUDA toolchain that doesn't support Blackwell GPUs. The build script automatically uses `Dockerfile.vllm.dgx-spark` which leverages NVIDIA's pre-built vLLM container (`nvcr.io/nvidia/vllm:25.09-py3`) with native DGX Spark support: +```bash +./container/build.sh --framework VLLM --platform linux/arm64 +``` +The special Dockerfile skips building vLLM from source and uses NVIDIA's container that already includes compute_121 (Blackwell) support. + +#### Performance Issues + +- **Memory bandwidth**: Monitor memory usage patterns specific to ARM64 +- **GPU utilization**: Check GPU affinity settings and NUMA topology +- **Container overhead**: ARM64 containers may have different resource requirements + +#### Architecture Detection + +To verify your system architecture: +```bash +uname -m # Should return 'aarch64' for DGX-SPARK +``` + +### Getting Help + +For DGX-SPARK specific issues: +1. Check this README for ARM64-specific considerations +2. Verify you're using the correct build platform (`linux/arm64`) +3. Review the main [README.md](./README.md) for general troubleshooting +4. Open an issue with `DGX-SPARK` and `ARM64` tags for specific support