diff --git a/container/BUILD_DGX_SPARK_GUIDE.md b/container/BUILD_DGX_SPARK_GUIDE.md
new file mode 100644
index 0000000000..c2d4341acf
--- /dev/null
+++ b/container/BUILD_DGX_SPARK_GUIDE.md
@@ -0,0 +1,130 @@
+# Building Dynamo for DGX-SPARK (vLLM)
+
+## How `build.sh` Chooses the Dockerfile
+
+The `build.sh` script automatically selects the correct Dockerfile based on the platform and optional flags:
+
+### Dockerfile Selection Logic
+
+```
+IF framework == "VLLM":
+    IF --dgx-spark flag is set OR platform is linux/arm64:
+        Use: Dockerfile.vllm.dgx-spark  (NVIDIA's pre-built vLLM with Blackwell support)
+    ELSE:
+        Use: Dockerfile.vllm            (Build from source)
+ELSE IF framework == "TRTLLM":
+    Use: Dockerfile.trtllm
+ELSE IF framework == "SGLANG":
+    Use: Dockerfile.sglang
+ELSE:
+    Use: Dockerfile
+```
+
+### How to Use
+
+#### For DGX-SPARK (Blackwell GPUs)
+
+**Automatic detection (recommended):**
+```bash
+./container/build.sh --framework VLLM --platform linux/arm64
+```
+
+**Explicit flag:**
+```bash
+./container/build.sh --framework VLLM --dgx-spark
+```
+
+#### For x86_64 (standard GPUs)
+
+```bash
+./container/build.sh --framework VLLM
+# or explicitly
+./container/build.sh --framework VLLM --platform linux/amd64
+```
+
+## Key Differences
+
+### Standard vLLM Dockerfile (`Dockerfile.vllm`)
+- Builds vLLM from source
+- Uses CUDA 12.8
+- Supports: Ampere, Ada, Hopper GPUs
+- **Does NOT support Blackwell (compute_121)**
+
+### DGX-SPARK Dockerfile (`Dockerfile.vllm.dgx-spark`)
+- Uses NVIDIA's pre-built vLLM container (`nvcr.io/nvidia/vllm:25.09-py3`)
+- Uses CUDA 13.0
+- Supports: **Blackwell GPUs (compute_121)** via DGX-SPARK
+- Skips building vLLM from source (avoids nvcc errors)
+- **Builds UCX v1.19.0 from source** with CUDA 13 support
+- **Builds NIXL 0.7.0 from source** with CUDA 13 support (self-contained, no cache dependency)
+- **Builds NIXL Python wheel** with CUDA 13 support
+- Adds Dynamo's runtime customizations and integrations
+
+## Why DGX-SPARK Needs Special Handling
+
+DGX-SPARK systems use **Blackwell GPUs** with architecture `compute_121`. When trying to build vLLM from source with older CUDA toolchains:
+
+```
+ERROR: nvcc fatal : Unsupported gpu architecture 'compute_121a'
+```
+
+**Solution:** Use NVIDIA's pre-built vLLM container that already includes:
+- CUDA 13.0 support
+- Blackwell GPU architecture support
+- DGX Spark functional support
+- NVFP4 format optimization
+
+### Why Build UCX and NIXL from Source?
+
+The DGX-SPARK Dockerfile builds UCX v1.19.0 and NIXL 0.7.0 **from source** instead of copying from the base image:
+
+**Reason 1: CUDA 13 Compatibility**
+- NIXL 0.7.0 is the first version with native CUDA 13.0 support
+- Building from source ensures proper linkage against `libcudart.so.13` (not `libcudart.so.12`)
+- Avoids runtime errors: `libcudart.so.12: cannot open shared object file`
+
+**Reason 2: Cache Independence**
+- The base image (`dynamo_base`) may have cached NIXL 0.6.x built with CUDA 12
+- Building fresh in the DGX-SPARK Dockerfile ensures we always get NIXL 0.7.0 with CUDA 13
+- Self-contained build = predictable results
+
+**Reason 3: ARM64 Optimization**
+- UCX and NIXL are built specifically for `aarch64` architecture
+- GDS backend is disabled (`-Ddisable_gds_backend=true`) as it's not supported on ARM64
+
+## Build Arguments
+
+When using the `--dgx-spark` flag, `build.sh` automatically:
+- Selects `Dockerfile.vllm.dgx-spark`
+- Sets `PLATFORM=linux/arm64` (forced)
+- Sets `NIXL_REF=0.7.0` (for CUDA 13 support)
+- Sets `ARCH=arm64` and `ARCH_ALT=aarch64`
+
+The DGX-SPARK Dockerfile itself hardcodes:
+- `BASE_IMAGE=nvcr.io/nvidia/vllm`
+- `BASE_IMAGE_TAG=25.09-py3`
+
+All other build arguments work the same way.
+
+## Troubleshooting
+
+### Error: `exec /bin/sh: exec format error`
+- **Cause:** Building with wrong platform
+- **Fix:** Use `--platform linux/arm64` for DGX-SPARK
+
+### Error: `nvcc fatal : Unsupported gpu architecture 'compute_121a'`
+- **Cause:** Building from source without Blackwell support
+- **Fix:** Use `--dgx-spark` or `--platform linux/arm64` to use pre-built container
+
+### Error: `libcudart.so.12: cannot open shared object file`
+- **Cause:** NIXL was built with CUDA 12 but container has CUDA 13
+- **Fix:** Rebuild with `--dgx-spark` flag to ensure NIXL 0.7.0 with CUDA 13 support
+- **Verify:** Inside container: `ldd /opt/nvidia/nvda_nixl/lib/aarch64-linux-gnu/plugins/libplugin_UCX_MO.so | grep cudart` should show `libcudart.so.13` (not `.so.12`)
+
+## References
+
+- [NVIDIA vLLM Release 25.09 Documentation](https://docs.nvidia.com/deeplearning/frameworks/vllm-release-notes/rel-25-09.html)
+- [NVIDIA NGC Container Registry](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm)
+- [NIXL 0.7.0 Release Notes](https://github.com/ai-dynamo/nixl/releases/tag/0.7.0) - CUDA 13.0 support
+- [DGX-SPARK README](../docs/backends/vllm/DGX-SPARK_README.md) - Complete deployment guide
+
diff --git a/container/Dockerfile.vllm.dgx-spark b/container/Dockerfile.vllm.dgx-spark
new file mode 100644
index 0000000000..4f92b2cdfb
--- /dev/null
+++ b/container/Dockerfile.vllm.dgx-spark
@@ -0,0 +1,263 @@
+# syntax=docker/dockerfile:1.10.0
+# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# DGX-SPARK specific Dockerfile for vLLM
+# Uses NVIDIA's pre-built vLLM container that supports Blackwell GPUs (compute_121)
+# See: https://docs.nvidia.com/deeplearning/frameworks/vllm-release-notes/rel-25-09.html
+
+ARG BASE_IMAGE="nvcr.io/nvidia/vllm"
+ARG BASE_IMAGE_TAG="25.09-py3"
+
+ARG DYNAMO_BASE_IMAGE="dynamo:latest-none"
+FROM ${DYNAMO_BASE_IMAGE} AS dynamo_base
+
+########################################################
+########## Runtime Image (based on NVIDIA vLLM) #######
+########################################################
+#
+# PURPOSE: Production runtime environment for DGX-SPARK
+#
+# This stage uses NVIDIA's pre-built vLLM container that already includes:
+# - vLLM with DGX Spark functional support (Blackwell compute_121)
+# - CUDA 13.0 support
+# - NVFP4 format support
+# - All necessary GPU acceleration libraries
+#
+# We add Dynamo's customizations on top:
+# - Dynamo runtime libraries
+# - NIXL for KV cache transfer
+# - Custom backend integrations
+#
+
+FROM ${BASE_IMAGE}:${BASE_IMAGE_TAG} AS runtime
+
+WORKDIR /workspace
+ENV DYNAMO_HOME=/opt/dynamo
+ENV VIRTUAL_ENV=/opt/dynamo/venv
+ENV PATH="${VIRTUAL_ENV}/bin:${PATH}"
+# Add system Python site-packages to PYTHONPATH so we can use NVIDIA's vLLM
+ENV PYTHONPATH="/usr/local/lib/python3.12/dist-packages:${PYTHONPATH}"
+
+# NVIDIA vLLM container already has Python 3.12 and vLLM installed
+# We just need to set up Dynamo's virtual environment and dependencies
+ARG ARCH_ALT=aarch64
+ENV NIXL_PREFIX=/opt/nvidia/nvda_nixl
+ENV NIXL_LIB_DIR=$NIXL_PREFIX/lib/${ARCH_ALT}-linux-gnu
+ENV NIXL_PLUGIN_DIR=$NIXL_LIB_DIR/plugins
+
+# Install additional dependencies for Dynamo
+# Note: NVIDIA vLLM container already has Python and CUDA tools
+RUN apt-get update && \
+    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
+        # Python runtime - CRITICAL for virtual environment to work
+        python3.12-dev \
+        build-essential \
+        # jq and curl for polling various endpoints and health checks
+        jq \
+        git \
+        git-lfs \
+        curl \
+        # Libraries required by UCX to find RDMA devices
+        libibverbs1 rdma-core ibverbs-utils libibumad3 \
+        libnuma1 librdmacm1 ibverbs-providers \
+        # JIT Kernel Compilation, flashinfer
+        ninja-build \
+        g++ \
+        # prometheus dependencies
+        ca-certificates && \
+    rm -rf /var/lib/apt/lists/*
+
+# NVIDIA vLLM container has CUDA already, but ensure CUDA tools are in PATH
+ENV PATH=/usr/local/cuda/bin:$PATH
+
+# DeepGemm runs nvcc for JIT kernel compilation, however the CUDA include path
+# is not properly set for compilation. Set CPATH to help nvcc find the headers.
+ENV CPATH=/usr/local/cuda/include
+
+### COPY NATS & ETCD ###
+# Copy nats and etcd from dev image
+COPY --from=dynamo_base /usr/bin/nats-server /usr/bin/nats-server
+COPY --from=dynamo_base /usr/local/bin/etcd/ /usr/local/bin/etcd/
+# Add ETCD and CUDA binaries to PATH so cicc and other CUDA tools are accessible
+ENV PATH=/usr/local/bin/etcd/:/usr/local/cuda/nvvm/bin:/usr/local/cuda/bin:$PATH
+
+### COPY UV EARLY (needed for building NIXL Python wheel) ###
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/uv
+COPY --from=ghcr.io/astral-sh/uv:latest /uvx /bin/uvx
+
+# Build UCX and NIXL directly in this stage for CUDA 13.0 support
+# This ensures we get fresh NIXL 0.7.0 with CUDA 13 support, not cached CUDA 12 version
+
+# Build UCX from source
+RUN apt-get update && \
+    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
+        autoconf automake libtool pkg-config \
+        libibverbs-dev librdmacm-dev \
+    && rm -rf /var/lib/apt/lists/* \
+    && cd /usr/local/src \
+    && git clone https://github.com/openucx/ucx.git \
+    && cd ucx && git checkout v1.19.0 \
+    && ./autogen.sh \
+    && ./configure \
+        --prefix=/usr/local/ucx \
+        --enable-shared \
+        --disable-static \
+        --disable-doxygen-doc \
+        --enable-optimizations \
+        --enable-cma \
+        --enable-devel-headers \
+        --with-cuda=/usr/local/cuda \
+        --with-verbs \
+        --with-dm \
+        --enable-mt \
+    && make -j$(nproc) \
+    && make -j$(nproc) install-strip \
+    && echo "/usr/local/ucx/lib" > /etc/ld.so.conf.d/ucx.conf \
+    && echo "/usr/local/ucx/lib/ucx" >> /etc/ld.so.conf.d/ucx.conf \
+    && ldconfig \
+    && cd /usr/local/src \
+    && rm -rf ucx
+
+# Build NIXL 0.7.0 from source with CUDA 13.0 support
+# Build both C++ library and Python wheel
+RUN apt-get update && \
+    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
+        meson ninja-build python3-pip \
+    && rm -rf /var/lib/apt/lists/* \
+    && git clone --depth 1 --branch 0.7.0 "https://github.com/ai-dynamo/nixl.git" /opt/nixl \
+    && cd /opt/nixl \
+    && meson setup build/ --buildtype=release --prefix=$NIXL_PREFIX -Ddisable_gds_backend=true \
+    && ninja -C build/ -j$(nproc) \
+    && ninja -C build/ install \
+    && echo "$NIXL_LIB_DIR" > /etc/ld.so.conf.d/nixl.conf \
+    && echo "$NIXL_PLUGIN_DIR" >> /etc/ld.so.conf.d/nixl.conf \
+    && ldconfig \
+    && mkdir -p /opt/dynamo/wheelhouse/nixl \
+    && /bin/uv build . --out-dir /opt/dynamo/wheelhouse/nixl --config-settings=setup-args="-Ddisable_gds_backend=true" \
+    && cd - \
+    && rm -rf /opt/nixl
+
+ENV PATH=/usr/local/ucx/bin:$PATH
+
+# Set library paths for NIXL and UCX
+ENV LD_LIBRARY_PATH=\
+/usr/local/cuda/lib64:\
+$NIXL_LIB_DIR:\
+$NIXL_PLUGIN_DIR:\
+/usr/local/ucx/lib:\
+/usr/local/ucx/lib/ucx:\
+$LD_LIBRARY_PATH
+
+### VIRTUAL ENVIRONMENT SETUP ###
+# Note: uv was already copied earlier (needed for building NIXL Python wheel)
+
+# Create Dynamo's virtual environment
+RUN uv venv /opt/dynamo/venv --python 3.12
+
+# Install Dynamo dependencies
+# Note: vLLM is available via PYTHONPATH pointing to system Python
+# Note: We copy dynamo wheels from base, but NIXL wheel was built fresh above with CUDA 13 support
+COPY benchmarks/ /opt/dynamo/benchmarks/
+RUN mkdir -p /opt/dynamo/wheelhouse
+COPY --from=dynamo_base /opt/dynamo/wheelhouse/ai_dynamo_runtime*.whl /opt/dynamo/wheelhouse/
+COPY --from=dynamo_base /opt/dynamo/wheelhouse/ai_dynamo*.whl /opt/dynamo/wheelhouse/
+RUN uv pip install \
+    /opt/dynamo/wheelhouse/ai_dynamo_runtime*.whl \
+    /opt/dynamo/wheelhouse/ai_dynamo*any.whl \
+    /opt/dynamo/wheelhouse/nixl/nixl*.whl \
+    && cd /opt/dynamo/benchmarks \
+    && UV_GIT_LFS=1 uv pip install --no-cache . \
+    && cd - \
+    && rm -rf /opt/dynamo/benchmarks
+
+# Install common and test dependencies
+RUN --mount=type=bind,source=./container/deps/requirements.txt,target=/tmp/requirements.txt \
+    --mount=type=bind,source=./container/deps/requirements.test.txt,target=/tmp/requirements.test.txt \
+    UV_GIT_LFS=1 uv pip install \
+        --no-cache \
+        --requirement /tmp/requirements.txt \
+        --requirement /tmp/requirements.test.txt
+
+# Copy benchmarks, examples, and tests for CI
+COPY . /workspace/
+
+# Copy attribution files
+COPY ATTRIBUTION* LICENSE /workspace/
+
+# Copy launch banner
+RUN --mount=type=bind,source=./container/launch_message.txt,target=/workspace/launch_message.txt \
+    sed '/^#\s/d' /workspace/launch_message.txt > ~/.launch_screen && \
+    echo "cat ~/.launch_screen" >> ~/.bashrc && \
+    echo "source $VIRTUAL_ENV/bin/activate" >> ~/.bashrc
+
+ENTRYPOINT ["/opt/nvidia/nvidia_entrypoint.sh"]
+CMD []
+
+###########################################################
+########## Development (run.sh, runs as root user) ########
+###########################################################
+#
+# PURPOSE: Local development environment for use with run.sh (not Dev Container plug-in)
+#
+# This stage runs as root and provides:
+# - Development tools and utilities for local debugging
+# - Support for vscode/cursor development outside the Dev Container plug-in
+#
+# Use this stage if you need a full-featured development environment with extra tools,
+# but do not use it with the Dev Container plug-in.
+
+FROM runtime AS dev
+
+# Don't want ubuntu to be editable, just change uid and gid.
+ARG WORKSPACE_DIR=/workspace
+
+# Install utilities as root
+RUN apt-get update -y && \
+    apt-get install -y --no-install-recommends  \
+    # Install utilities
+    nvtop \
+    wget \
+    tmux \
+    vim \
+    git \
+    openssh-client \
+    iproute2 \
+    rsync \
+    zip \
+    unzip \
+    htop \
+    # Build Dependencies
+    autoconf \
+    automake \
+    cmake \
+    libtool \
+    meson \
+    net-tools \
+    pybind11-dev \
+    # Rust build dependencies
+    clang \
+    libclang-dev \
+    protobuf-compiler && \
+    rm -rf /var/lib/apt/lists/*
+
+# Set workspace directory variable
+ENV WORKSPACE_DIR=${WORKSPACE_DIR} \
+    DYNAMO_HOME=${WORKSPACE_DIR} \
+    RUSTUP_HOME=/usr/local/rustup \
+    CARGO_HOME=/usr/local/cargo \
+    CARGO_TARGET_DIR=/workspace/target \
+    VIRTUAL_ENV=/opt/dynamo/venv \
+    PATH=/usr/local/cargo/bin:$PATH
+
+COPY --from=dynamo_base /usr/local/rustup /usr/local/rustup
+COPY --from=dynamo_base /usr/local/cargo /usr/local/cargo
+
+# Install maturin, for maturin develop
+# Editable install of dynamo
+RUN uv pip install maturin[patchelf] && \
+    uv pip install --no-deps -e .
+
+ENTRYPOINT ["/opt/nvidia/nvidia_entrypoint.sh"]
+CMD []
+
diff --git a/container/build.sh b/container/build.sh
index 7ce32457af..1a26405c02 100755
--- a/container/build.sh
+++ b/container/build.sh
@@ -24,6 +24,7 @@ set -e
 TAG=
 RUN_PREFIX=
 PLATFORM=linux/amd64
+USE_DGX_SPARK=false
 
 # Get short commit hash
 commit_id=$(git rev-parse --short HEAD)
@@ -116,6 +117,7 @@ SGLANG_BASE_IMAGE="nvcr.io/nvidia/cuda-dl-base"
 SGLANG_BASE_IMAGE_TAG="25.01-cuda12.8-devel-ubuntu24.04"
 
 NIXL_REF=0.6.0
+NIXL_REF_DGX=0.7.0  # CUDA 13.0 support for DGX-SPARK
 NIXL_UCX_REF=v1.19.0
 NIXL_UCX_EFA_REF=9d2b88a1f67faf9876f267658bd077b379b8bb76
 
@@ -339,6 +341,13 @@ get_options() {
                 missing_requirement "$1"
             fi
             ;;
+        --dgx-spark)
+            if [ -n "$2" ] && [[ "$2" != --* ]]; then
+                echo "ERROR: --dgx-spark does not take any argument"
+                exit 1
+            fi
+            USE_DGX_SPARK=true
+            ;;
          -?*)
             error 'ERROR: Unknown option: ' "$1"
             ;;
@@ -482,6 +491,7 @@ show_help() {
     echo "  [--sccache-bucket S3 bucket name for sccache (required with --use-sccache)]"
     echo "  [--sccache-region S3 region for sccache (required with --use-sccache)]"
     echo "  [--vllm-max-jobs number of parallel jobs for compilation (only used by vLLM framework)]"
+    echo "  [--dgx-spark Use DGX-SPARK specific Dockerfile for vLLM (Blackwell GPU support, auto-detected for ARM64)]"
     echo ""
     echo "  Note: When using --use-sccache, AWS credentials must be set:"
     echo "        export AWS_ACCESS_KEY_ID=your_access_key"
@@ -500,6 +510,14 @@ error() {
 
 get_options "$@"
 
+# If --dgx-spark is specified, always force linux/arm64 (DGX-SPARK is ARM64 only)
+if [[ "$USE_DGX_SPARK" == "true" ]]; then
+    if [[ "$PLATFORM" != *"linux/arm64"* ]]; then
+        echo "Note: --dgx-spark requires linux/arm64 platform, overriding to linux/arm64"
+    fi
+    PLATFORM="--platform linux/arm64"
+fi
+
 # Automatically set ARCH and ARCH_ALT if PLATFORM is linux/arm64
 ARCH="amd64"
 if [[ "$PLATFORM" == *"linux/arm64"* ]]; then
@@ -507,9 +525,23 @@ if [[ "$PLATFORM" == *"linux/arm64"* ]]; then
     BUILD_ARGS+=" --build-arg ARCH=arm64 --build-arg ARCH_ALT=aarch64 "
 fi
 
+# Automatically use NIXL 0.7.0 (CUDA 13.0 support) for DGX-SPARK builds
+# Only when explicitly building for DGX-SPARK with --dgx-spark flag
+if [[ "$USE_DGX_SPARK" == "true" ]]; then
+    echo "Note: Using NIXL ${NIXL_REF_DGX} for CUDA 13.0 compatibility (DGX-SPARK)"
+    NIXL_REF=$NIXL_REF_DGX
+fi
+
 # Update DOCKERFILE if framework is VLLM
 if [[ $FRAMEWORK == "VLLM" ]]; then
-    DOCKERFILE=${SOURCE_DIR}/Dockerfile.vllm
+    # Use DGX-SPARK Dockerfile when:
+    # 1. Explicitly requested with --dgx-spark flag, OR
+    # 2. Building for ARM64 platform (DGX-SPARK requires Blackwell GPU support)
+    if [[ "$USE_DGX_SPARK" == "true" ]] || [[ "$PLATFORM" == *"linux/arm64"* ]]; then
+        DOCKERFILE=${SOURCE_DIR}/Dockerfile.vllm.dgx-spark
+    else
+        DOCKERFILE=${SOURCE_DIR}/Dockerfile.vllm
+    fi
 elif [[ $FRAMEWORK == "TRTLLM" ]]; then
     DOCKERFILE=${SOURCE_DIR}/Dockerfile.trtllm
 elif [[ $FRAMEWORK == "NONE" ]]; then
@@ -592,7 +624,12 @@ fi
 
 # BUILD DEV IMAGE
 
-BUILD_ARGS+=" --build-arg BASE_IMAGE=$BASE_IMAGE --build-arg BASE_IMAGE_TAG=$BASE_IMAGE_TAG"
+# Only pass BASE_IMAGE and BASE_IMAGE_TAG for non-DGX-SPARK builds
+# DGX-SPARK uses hardcoded base images in Dockerfile.vllm.dgx-spark
+# Skip these build args when building for DGX-SPARK
+if [[ ! (("$USE_DGX_SPARK" == "true") || ("$PLATFORM" == *"linux/arm64"* && $FRAMEWORK == "VLLM")) ]]; then
+    BUILD_ARGS+=" --build-arg BASE_IMAGE=$BASE_IMAGE --build-arg BASE_IMAGE_TAG=$BASE_IMAGE_TAG"
+fi
 
 if [ -n "${GITHUB_TOKEN}" ]; then
     BUILD_ARGS+=" --build-arg GITHUB_TOKEN=${GITHUB_TOKEN} "
diff --git a/docs/backends/vllm/DGX-SPARK_README.md b/docs/backends/vllm/DGX-SPARK_README.md
new file mode 100644
index 0000000000..689f73281a
--- /dev/null
+++ b/docs/backends/vllm/DGX-SPARK_README.md
@@ -0,0 +1,527 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
+# LLM Deployment using vLLM on DGX-SPARK
+
+This directory contains reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM on **DGX-SPARK systems**. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
+
+> [!NOTE]
+> This guide is specifically tailored for **DGX-SPARK** systems running **ARM64 architecture**. For general x86_64 deployments, refer to the main [README.md](./README.md).
+
+## Use the Latest Release
+
+We recommend using the latest stable release of Dynamo to avoid breaking changes:
+
+[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
+
+You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
+
+```bash
+git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
+```
+
+---
+
+## Table of Contents
+- [DGX-SPARK Specific Considerations](#dgx-spark-specific-considerations)
+- [Feature Support Matrix](#feature-support-matrix)
+- [Quick Start](#quick-start)
+- [Single Node Examples](#run-single-node-examples)
+- [Advanced Examples](#advanced-examples)
+- [Deploy on Kubernetes](#kubernetes-deployment)
+- [Configuration](#configuration)
+
+## DGX-SPARK Specific Considerations
+
+### Architecture Requirements
+
+DGX-SPARK systems run on **ARM64 architecture** (`aarch64`), which requires specific build configurations:
+
+- **Platform**: `linux/arm64`
+- **Architecture**: `arm64` / `aarch64`
+- **Base Images**: ARM64-compatible NVIDIA CUDA images
+
+### Build Requirements
+
+When building containers for DGX-SPARK, you **must** specify the ARM64 platform:
+
+```bash
+# Correct build command for DGX-SPARK
+./container/build.sh --framework VLLM --platform linux/arm64
+```
+
+> [!WARNING]
+> **Do not use the default build command** (`./container/build.sh --framework VLLM`) as it defaults to `linux/amd64` and will cause `exec /bin/sh: exec format error` on ARM64 systems.
+
+### Performance Considerations
+
+DGX-SPARK systems may have different performance characteristics compared to x86_64 systems:
+
+- **Memory bandwidth**: ARM64 systems may have different memory access patterns
+- **GPU utilization**: Ensure proper GPU affinity and NUMA awareness
+- **Container overhead**: ARM64 containers may have slightly different resource usage
+
+## Feature Support Matrix
+
+### Core Dynamo Features
+
+| Feature | vLLM on DGX-SPARK | Notes |
+|---------|-------------------|-------|
+| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | Fully supported on ARM64 |
+| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP - ARM64 compatibility verified |
+| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ | ARM64 optimized |
+| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | Platform agnostic |
+| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP - ARM64 testing in progress |
+| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | ARM64 compatible |
+| [**LMCache**](./LMCache_Integration.md) | ✅ | ARM64 supported |
+
+### Large Scale P/D and WideEP Features
+
+| Feature            | vLLM on DGX-SPARK | Notes                                                                 |
+|--------------------|-------------------|-----------------------------------------------------------------------|
+| **WideEP**         | ✅                | Support for PPLX / DeepEP verified on ARM64                          |
+| **DP Rank Routing**| ✅                | Supported via external control of DP ranks - ARM64 optimized          |
+| **GB200 Support**  | 🚧                | Container functional on main - ARM64 compatibility testing ongoing   |
+
+## vLLM Quick Start
+
+Below we provide a guide that lets you run all of our common deployment patterns on a single DGX-SPARK node.
+
+### Start NATS and ETCD in the background
+
+Start using [Docker Compose](../../../deploy/docker-compose.yml)
+
+```bash
+docker compose -f deploy/docker-compose.yml up -d
+```
+
+### Pull or build container
+
+We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). 
+
+> [!IMPORTANT]
+> **For DGX-SPARK systems**, you must build your own container with ARM64 support:
+
+```bash
+# Build for ARM64 architecture (DGX-SPARK)
+./container/build.sh --framework VLLM --platform linux/arm64
+
+# Or explicitly with --dgx-spark flag (also forces linux/arm64)
+./container/build.sh --framework VLLM --dgx-spark
+```
+
+> [!NOTE]
+> The build script automatically (when using `--dgx-spark` or ARM64 platform):
+> - Detects ARM64 platform and sets the correct architecture arguments (`ARCH=arm64`, `ARCH_ALT=aarch64`)
+> - Uses NIXL 0.7.0 (CUDA 13.0 support) instead of 0.6.0 when `--dgx-spark` flag is set
+> - Uses a special `Dockerfile.vllm.dgx-spark` that leverages NVIDIA's pre-built vLLM container (`nvcr.io/nvidia/vllm:25.09-py3`)
+> - This container already includes DGX Spark functional support (Blackwell GPU compute_121) and fixes the `nvcc fatal: Unsupported gpu architecture 'compute_121a'` error
+
+### Run container
+
+```bash
+./container/run.sh -it --framework VLLM [--mount-workspace]
+```
+
+> [!NOTE]
+> **`--mount-workspace` is optional** and mounts your local Dynamo project directory into the container at `/workspace`. Use it for:
+> - **Development**: When you need to edit source files and see changes immediately
+> - **Local testing**: When running examples that need access to project files
+> 
+> Skip `--mount-workspace` for production deployments or when you don't need to modify source code.
+
+#### What happens when you run the container?
+
+When you run `./container/run.sh -it --framework VLLM`, it:
+
+1. **Starts a Docker container** using the `dynamo:latest-vllm` image
+2. **Runs interactively** (`-it` flag) with a bash shell
+3. **Does NOT automatically start any vLLM service** - it just gives you a shell inside the container
+4. **No model is loaded by default** - you're just in an empty container environment
+
+The container uses `/opt/nvidia/nvidia_entrypoint.sh` as the entrypoint, which typically just starts a bash shell when no specific command is provided.
+
+#### Setting up different serving modes
+
+The serving modes are controlled by **launch scripts** that you run **inside the container**. Here's how:
+
+**Aggregated Serving (Single GPU):**
+```bash
+# Start the container
+./container/run.sh -it --framework VLLM
+
+# Inside the container, run:
+cd components/backends/vllm
+bash launch/agg.sh
+```
+
+**Disaggregated Serving (Two GPUs):**
+```bash
+# Start the container
+./container/run.sh -it --framework VLLM
+
+# Inside the container, run:
+cd components/backends/vllm
+bash launch/disagg.sh
+```
+
+**Other available modes:**
+- `agg_router.sh` - Aggregated serving with KV routing (2 GPUs)
+- `disagg_router.sh` - Disaggregated serving with KV routing (3 GPUs)
+- `dep.sh` - Data Parallel Attention / Expert Parallelism (4 GPUs)
+
+#### Complete workflow example
+
+Here's a complete example for disaggregated serving:
+
+```bash
+# 1. Build the container (if not already built)
+./container/build.sh --framework VLLM --platform linux/arm64
+
+# 2. Start the container
+./container/run.sh -it --framework VLLM
+
+# 3. Inside the container, start disaggregated serving
+cd components/backends/vllm
+bash launch/disagg.sh
+
+# 4. Test the API
+curl -X POST "http://localhost:8000/v1/completions" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-0.6B",
+    "prompt": "Hello, how are you?",
+    "max_tokens": 50
+  }'
+```
+
+> [!TIP]
+> **Key Points:**
+> - **No model is loaded by default** - the container just gives you a shell
+> - **Models are specified in the launch scripts** (currently `Qwen/Qwen3-0.6B`)
+> - **You can modify the launch scripts** to use different models
+> - **The `--enforce-eager` flag** is for quick testing (remove for production)
+> - **GPU assignment** is handled by `CUDA_VISIBLE_DEVICES` in the scripts
+
+This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
+
+## Run Single Node Examples
+
+> [!IMPORTANT]
+> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
+
+This figure shows an overview of the major components to deploy:
+
+```
++------+      +-----------+      +------------------+             +---------------+
+| HTTP |----->| dynamo    |----->|   vLLM Worker    |------------>|  vLLM Prefill |
+|      |<-----| ingress   |<-----|                  |<------------|    Worker     |
++------+      +-----------+      +------------------+             +---------------+
+                  |    ^                  |
+       query best |    | return           | publish kv events
+           worker |    | worker_id        v
+                  |    |         +------------------+
+                  |    +---------|     kv-router    |
+                  +------------->|                  |
+                                 +------------------+
+```
+
+Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern.
+
+### Aggregated Serving
+
+```bash
+# requires one gpu
+cd components/backends/vllm
+bash launch/agg.sh
+```
+
+### Aggregated Serving with KV Routing
+
+```bash
+# requires two gpus
+cd components/backends/vllm
+bash launch/agg_router.sh
+```
+
+### Disaggregated Serving
+
+```bash
+# requires two gpus
+cd components/backends/vllm
+bash launch/disagg.sh
+```
+
+### Disaggregated Serving with KV Routing
+
+```bash
+# requires three gpus
+cd components/backends/vllm
+bash launch/disagg_router.sh
+```
+
+### Single Node Data Parallel Attention / Expert Parallelism
+
+This example is not meant to be performant but showcases Dynamo routing to data parallel workers
+
+```bash
+# requires four gpus
+cd components/backends/vllm
+bash launch/dep.sh
+```
+
+> [!TIP]
+> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
+
+## Advanced Examples
+
+Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!
+
+### Multi-Node Disaggregated Serving
+
+DGX-SPARK systems are perfect for multi-node deployments, especially when paired with other GPU servers. This allows you to optimize resource utilization across different hardware configurations.
+
+#### Example: DGX-SPARK + RTX 3090 Setup
+
+**Hardware Configuration:**
+- **DGX-SPARK (ARM64)**: Prefill worker (prompt processing) + Frontend
+- **RTX 3090 Server (x86_64)**: Decode worker (token generation)
+
+**Why this works well:**
+- DGX-SPARK: High compute power (~100 TFLOPs FP16) for compute-intensive prefill
+- RTX 3090: Handles decode after receiving KV cache from DGX Spark
+- Network efficiency: Only KV cache data transferred, not full model weights
+
+#### Step-by-Step Multi-Node Setup
+
+**1. Infrastructure Setup**
+
+**On DGX-SPARK (Head Node):**
+```bash
+# Start NATS and ETCD services
+docker compose -f deploy/docker-compose.yml up -d
+
+# Set environment variables
+export HEAD_NODE_IP="<dgx-spark-ip>"
+export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
+export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
+```
+
+**On RTX 3090 Server:**
+```bash
+# Set the same environment variables
+export HEAD_NODE_IP="<dgx-spark-ip>"
+export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
+export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
+```
+
+**2. Build Containers**
+
+**On DGX-SPARK:**
+```bash
+# Option 1: Use platform flag (recommended)
+./container/build.sh --framework VLLM --platform linux/arm64
+
+# Option 2: Use dgx-spark flag (also uses NIXL 0.7.0 with CUDA 13.0)
+./container/build.sh --framework VLLM --dgx-spark
+```
+
+**On RTX 3090 Server:**
+```bash
+./container/build.sh --framework VLLM --platform linux/amd64
+```
+
+**3. Deploy Workers**
+
+> [!NOTE]
+> **Worker flags**: Prefill worker uses `--is-prefill-worker` flag. Decode worker runs without any flag (just regular vllm command).
+
+**On DGX-SPARK (Prefill Worker + Frontend):**
+```bash
+# Start the container
+./container/run.sh -it --framework VLLM
+
+# Inside the container, set environment variables:
+export HEAD_NODE_IP="<dgx-spark-ip>"
+export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
+export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
+
+cd components/backends/vllm
+
+# Start frontend with KV routing
+python -m dynamo.frontend \
+    --router-mode kv \
+    --http-port 8000 &
+
+# Start prefill worker (handles prompt processing - compute-intensive)
+python -m dynamo.vllm \
+    --model Qwen/Qwen3-0.6B \
+    --enforce-eager \
+    --is-prefill-worker \
+    --connector nixl
+```
+
+**On RTX 3090 Server (Decode Worker):**
+```bash
+# Start the container
+./container/run.sh -it --framework VLLM
+
+# Inside the container, set environment variables:
+export HEAD_NODE_IP="<dgx-spark-ip>"
+export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
+export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
+
+cd components/backends/vllm
+
+# Start decode worker (handles token generation - memory-intensive)
+python -m dynamo.vllm \
+    --model Qwen/Qwen3-0.6B \
+    --enforce-eager \
+    --connector nixl
+```
+
+#### How Multi-Node Disaggregation Works
+
+1. **Request Flow:**
+   - Client sends request to DGX-SPARK frontend (port 8000)
+   - Frontend routes prefill work to DGX-SPARK itself (high compute for prefill)
+   - DGX-SPARK processes the prompt and builds the KV cache (compute-intensive)
+   - KV cache is transferred to the RTX 3090 server
+   - RTX 3090 server generates tokens using the KV cache (memory-intensive)
+
+2. **KV Cache Transfer:**
+   - Uses NIXL connector for efficient KV cache transfer over network
+   - Layer-by-layer streaming: Each layer's KV cache can be transferred as it's computed
+   - DGX-SPARK sends processed KV cache to decode server
+   - Decode server uses the KV cache for efficient token generation
+
+3. **Resource Optimization:**
+   - DGX-SPARK: Handles compute-intensive prefill (~100 TFLOPs compute)
+   - RTX 3090: Handles memory-intensive decode after receiving KV cache
+
+#### Network Requirements
+
+Ensure your firewall allows:
+- **Port 4222**: NATS communication
+- **Port 2379**: ETCD communication  
+- **Port 8000**: HTTP API (if accessing from external clients)
+
+#### Testing Multi-Node Setup
+
+Once both workers are running, test from any machine:
+
+```bash
+curl -X POST "http://<dgx-spark-ip>:8000/v1/completions" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-0.6B",
+    "prompt": "Hello, how are you?",
+    "max_tokens": 50
+  }'
+```
+
+> [!TIP]
+> **Multi-Node Benefits:**
+> - **Resource optimization**: Use each machine's strengths
+> - **Scalability**: Add more prefill or decode workers as needed
+> - **Cost efficiency**: Leverage existing hardware optimally
+> - **Network efficiency**: Only KV cache data transferred, not full model weights
+
+### Kubernetes Deployment
+
+For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](../../../components/backends/vllm/deploy/README.md)
+
+> [!NOTE]
+> When deploying on Kubernetes with DGX-SPARK nodes, ensure your cluster nodes are properly labeled with ARM64 architecture and use ARM64-compatible base images.
+
+## Configuration
+
+vLLM workers are configured through command-line arguments. Key parameters include:
+
+- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
+- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving
+- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo
+- `--connector`: Specify which kv_transfer_config you want vllm to use `[nixl, lmcache, kvbm, none]`. This is a helper flag which overwrites the engines KVTransferConfig.
+
+See `args.py` for the full list of configuration options and their defaults.
+
+The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.
+
+### Hashing Consistency for KV Events
+
+When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:
+
+- Set `PYTHONHASHSEED=0` for all vLLM processes when relying on Python's builtin hashing for prefix caching.
+- If your vLLM version supports it, configure a deterministic prefix caching algorithm, for example:
+
+```bash
+vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
+```
+See the high-level notes in [KV Cache Routing](../../../docs/router/kv_cache_routing.md) on deterministic event IDs.
+
+## Request Migration
+
+You can enable [request migration](../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
+
+```bash
+python3 -m dynamo.vllm ... --migration-limit=3
+```
+
+This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/fault_tolerance/request_migration.md) documentation for details on how this works.
+
+## Request Cancellation
+
+When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
+
+### Cancellation Support Matrix
+
+| | Prefill | Decode |
+|-|---------|--------|
+| **Aggregated** | ✅ | ✅ |
+| **Disaggregated** | ✅ | ✅ |
+
+For more details, see the [Request Cancellation Architecture](../../../docs/fault_tolerance/request_cancellation.md) documentation.
+
+## Troubleshooting
+
+### Common Issues on DGX-SPARK
+
+#### Build Errors
+
+**Error**: `exec /bin/sh: exec format error`
+
+**Solution**: Ensure you're building with the correct platform:
+```bash
+./container/build.sh --framework VLLM --platform linux/arm64
+```
+
+**Error**: `nvcc fatal: Unsupported gpu architecture 'compute_121a'` (Blackwell GPU)
+
+**Solution**: This error occurs when trying to build vLLM from source with an older CUDA toolchain that doesn't support Blackwell GPUs. The build script automatically uses `Dockerfile.vllm.dgx-spark` which leverages NVIDIA's pre-built vLLM container (`nvcr.io/nvidia/vllm:25.09-py3`) with native DGX Spark support:
+```bash
+./container/build.sh --framework VLLM --platform linux/arm64
+```
+The special Dockerfile skips building vLLM from source and uses NVIDIA's container that already includes compute_121 (Blackwell) support.
+
+#### Performance Issues
+
+- **Memory bandwidth**: Monitor memory usage patterns specific to ARM64
+- **GPU utilization**: Check GPU affinity settings and NUMA topology
+- **Container overhead**: ARM64 containers may have different resource requirements
+
+#### Architecture Detection
+
+To verify your system architecture:
+```bash
+uname -m  # Should return 'aarch64' for DGX-SPARK
+```
+
+### Getting Help
+
+For DGX-SPARK specific issues:
+1. Check this README for ARM64-specific considerations
+2. Verify you're using the correct build platform (`linux/arm64`)
+3. Review the main [README.md](./README.md) for general troubleshooting
+4. Open an issue with `DGX-SPARK` and `ARM64` tags for specific support