diff --git a/.github/workflows/lint-docs.yaml b/.github/workflows/lint-docs.yaml
new file mode 100644
index 00000000000..f50853e7932
--- /dev/null
+++ b/.github/workflows/lint-docs.yaml
@@ -0,0 +1,27 @@
+name: Lint Documentation
+on:
+  push:
+    paths:
+      - "**.md"
+    branches:
+      - main
+  pull_request:
+    paths: "**.md"
+permissions:
+  contents: read
+
+jobs:
+  markdown-link-check:
+    name: Broken Links
+    runs-on: ubuntu-latest
+    steps:
+    - name: Harden Runner
+      uses: step-security/harden-runner@20cf305ff2072d973412fa9b1e3a4f227bda3c76 # v2.14.0
+      with:
+        egress-policy: audit
+
+    - uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8 # v6.0.1
+    - uses: tcort/github-action-markdown-link-check@e7c7a18363c842693fadde5d41a3bd3573a7a225 # v1.1.2
+      with:
+        use-quiet-mode: 'yes'
+        config-file: .markdownlinkcheck.json
diff --git a/.gitignore b/.gitignore
index a8beb74cb7c..37431bda907 100644
--- a/.gitignore
+++ b/.gitignore
@@ -118,3 +118,8 @@ profiling_results*
 # Node.js
 node_modules/
 package-lock.json
+
+# Docusaurus
+docs/.docusaurus/
+docs/build/
+docs/.cache-loader/
diff --git a/benchmarks/incluster/README.md b/benchmarks/incluster/README.md
deleted file mode 120000
index ab6c21f5862..00000000000
--- a/benchmarks/incluster/README.md
+++ /dev/null
@@ -1 +0,0 @@
-../../docs/benchmarks/benchmarking.md
\ No newline at end of file
diff --git a/benchmarks/incluster/README.md b/benchmarks/incluster/README.md
new file mode 100644
index 00000000000..fc6136bbac5
--- /dev/null
+++ b/benchmarks/incluster/README.md
@@ -0,0 +1,545 @@
+---
+title: "SPDX-License-Identifier: Apache-2.0"
+---
+
+<!-- # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License. -->
+
+# Dynamo Benchmarking Guide
+
+This benchmarking framework lets you compare performance across any combination of:
+- **DynamoGraphDeployments**
+- **External HTTP endpoints** (existing services deployed following standard documentation from vLLM, llm-d, AIBrix, etc.)
+
+## Choosing Your Benchmarking Approach
+
+Dynamo provides two benchmarking approaches to suit different use cases: **client-side** and **server-side**. Client-side refers to running benchmarks on your local machine and connecting to Kubernetes deployments via port-forwarding, while server-side refers to running benchmarks directly within the Kubernetes cluster using internal service URLs. Which method to use depends on your use case.
+
+**TLDR:**
+Need high performance/load testing? Server-side.
+Just quick testing/comparison? Client-side.
+
+### Use Client-Side Benchmarking When:
+- You want to quickly test deployments
+- You want immediate access to results on your local machine
+- You're comparing external services or deployments (not necessarily just Dynamo deployments)
+- You need to run benchmarks from your laptop/workstation
+
+→ **[Go to Client-Side Benchmarking (Local)](#client-side-benchmarking-local)**
+
+### Use Server-Side Benchmarking When:
+- You have a development environment with kubectl access
+- You're doing performance validation with high load/speed requirements
+- You're experiencing timeouts or performance issues with client-side benchmarking
+- You want optimal network performance (no port-forwarding overhead)
+- You're running automated CI/CD pipelines
+- You need isolated execution environments
+- You're doing resource-intensive benchmarking
+- You want persistent result storage in the cluster
+
+→ **[Go to Server-Side Benchmarking (In-Cluster)](#server-side-benchmarking-in-cluster)**
+
+### Quick Comparison
+
+| Feature | Client-Side | Server-Side |
+|---------|-------------|-------------|
+| **Location** | Your local machine | Kubernetes cluster |
+| **Network** | Port-forwarding required | Direct service DNS |
+| **Setup** | Quick and simple | Requires cluster resources |
+| **Performance** | Limited by local resources, may timeout under high load | Optimal cluster performance, handles high load |
+| **Isolation** | Shared environment | Isolated job execution |
+| **Results** | Local filesystem | Persistent volumes |
+| **Best for** | Light load | High load |
+
+## What This Tool Does
+
+The framework is a Python-based wrapper around `aiperf` that:
+- Benchmarks any HTTP endpoints
+- Runs concurrency sweeps across configurable load levels
+- Generates comparison plots with your custom labels
+- Works with any HuggingFace-compatible model on NVIDIA GPUs (H200, H100, A100, etc.)
+- Provides direct Python script execution for maximum flexibility
+
+**Default sequence lengths**: Input: 2000 tokens, Output: 256 tokens (configurable with `--isl` and `--osl`)
+
+**Important**: The `--model` parameter configures AIPerf for benchmarking and provides logging context. The default `--model` value in the benchmarking script is `Qwen/Qwen3-0.6B`, but it must match the model deployed at the endpoint(s).
+
+---
+
+## Client-Side Benchmarking (Local) {#client-side-benchmarking-local}
+
+Client-side benchmarking runs on your local machine and connects to Kubernetes deployments via port-forwarding.
+
+## Prerequisites
+
+1. **Dynamo container environment** - You must be running inside a Dynamo container with the benchmarking tools pre-installed.
+
+2. **HTTP endpoints** - Ensure you have HTTP endpoints available for benchmarking. These can be:
+   - DynamoGraphDeployments exposed via HTTP endpoints
+   - External services (vLLM, llm-d, AIBrix, etc.)
+   - Any HTTP endpoint serving HuggingFace-compatible models
+
+3. **Benchmark dependencies** - Since benchmarks run locally, you need to install the required Python dependencies. Install them using:
+   ```bash
+   pip install -r deploy/utils/requirements.txt
+   ```
+
+## User Workflow
+
+Follow these steps to benchmark Dynamo deployments using client-side benchmarking:
+
+### Step 1: Establish Kubernetes Cluster and Install Dynamo
+Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform. First follow the [installation guide](/docs/kubernetes/installation_guide.md) to install Dynamo Kubernetes Platform, then use [deploy/utils/README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/utils/README.md) to set up benchmarking resources.
+
+### Step 2: Deploy DynamoGraphDeployments
+Deploy your DynamoGraphDeployments separately using the [deployment documentation](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/). Each deployment should have a frontend service exposed.
+
+### Step 3: Port-Forward and Benchmark Deployment A
+```bash
+# Port-forward the frontend service for deployment A
+kubectl port-forward -n &lt;namespace&gt; svc/<frontend-service-name> 8000:8000 > /dev/null 2>&1 &
+# Note: remember to stop the port-forward process after benchmarking.
+
+# Benchmark deployment A using Python scripts
+python3 -m benchmarks.utils.benchmark \
+   --benchmark-name deployment-a \
+   --endpoint-url http://localhost:8000 \
+   --model "your-model-name" \
+   --output-dir ./benchmarks/results
+```
+
+### Step 4: [If Comparative] Teardown Deployment A and Establish Deployment B
+If comparing multiple deployments, teardown deployment A and deploy deployment B with a different configuration.
+
+### Step 5: [If Comparative] Port-Forward and Benchmark Deployment B
+```bash
+# Port-forward the frontend service for deployment B
+kubectl port-forward -n &lt;namespace&gt; svc/<frontend-service-name> 8001:8000 > /dev/null 2>&1 &
+
+# Benchmark deployment B using Python scripts
+python3 -m benchmarks.utils.benchmark \
+   --benchmark-name deployment-b \
+   --endpoint-url http://localhost:8001 \
+   --model "your-model-name" \
+   --output-dir ./benchmarks/results
+```
+
+### Step 6: Generate Summary and Visualization
+```bash
+# Generate plots and summary using Python plotting script
+python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results
+
+# Or plot only specific benchmark experiments
+python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --benchmark-name experiment-a --benchmark-name experiment-b
+```
+
+## Use Cases
+
+The benchmarking framework supports various comparative analysis scenarios:
+
+- **Compare multiple DynamoGraphDeployments of a single backend** (e.g., aggregated vs disaggregated configurations)
+- **Compare different backends** (e.g., vLLM vs TensorRT-LLM vs SGLang)
+- **Compare Dynamo vs other platforms** (e.g., Dynamo vs llm-d vs AIBrix)
+- **Compare different models** (e.g., Llama-3-8B vs Llama-3-70B vs Qwen-3-0.6B)
+- **Compare different hardware configurations** (e.g., H100 vs A100 vs H200)
+- **Compare different parallelization strategies** (e.g., different GPU counts or memory configurations)
+
+## Configuration and Usage
+
+### Command Line Options
+
+```bash
+python3 -m benchmarks.utils.benchmark --benchmark-name &lt;name&gt; --endpoint-url <endpoint_url> [OPTIONS]
+
+REQUIRED:
+  --benchmark-name NAME           Name/label for this benchmark (used in plots and results)
+  --endpoint-url URL              HTTP endpoint URL to benchmark (e.g., http://localhost:8000)
+
+OPTIONS:
+  -h, --help                    Show help message and examples
+  -m, --model MODEL             Model name for AIPerf configuration and logging (default: Qwen/Qwen3-0.6B)
+                                NOTE: This must match the model deployed at the endpoint
+  -i, --isl LENGTH              Input sequence length (default: 2000)
+  -s, --std STDDEV              Input sequence standard deviation (default: 10)
+  -o, --osl LENGTH              Output sequence length (default: 256)
+  -d, --output-dir DIR          Output directory (default: ./benchmarks/results)
+  --verbose                     Enable verbose output
+```
+
+### Important Notes
+
+- **Benchmark Name**: The benchmark name becomes the label in plots and results
+- **Name Restrictions**: Names can only contain letters, numbers, hyphens, and underscores. The name `plots` is reserved.
+- **Port-Forwarding**: You must have an exposed endpoint before benchmarking
+- **Model Parameter**: The `--model` parameter configures AIPerf for testing and logging, and must match the model deployed at the endpoint
+- **Sequential Benchmarking**: For comparative benchmarks, deploy and benchmark each configuration separately
+
+### What Happens During Benchmarking
+
+The Python benchmarking module:
+1. **Connects** to your port-forwarded endpoint
+2. **Benchmarks** using AIPerf at various concurrency levels (default: 1, 2, 5, 10, 50, 100, 250)
+3. **Measures** key metrics: latency, throughput, time-to-first-token
+4. **Saves** results to an output directory organized by benchmark name
+
+The Python plotting module:
+1. **Generates** comparison plots using your benchmark name in `<OUTPUT_DIR>/plots/`
+2. **Creates** summary statistics and visualizations
+
+### Plotting Options
+
+The plotting script supports several options for customizing which experiments to visualize:
+
+```bash
+# Plot all benchmark experiments in the data directory
+python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results
+
+# Plot only specific benchmark experiments
+python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --benchmark-name experiment-a --benchmark-name experiment-b
+
+# Specify custom output directory for plots
+python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --output-dir ./custom-plots
+```
+
+**Available Options:**
+- `--data-dir`: Directory containing benchmark results (required)
+- `--benchmark-name`: Specific benchmark experiment name to plot (can be specified multiple times). Names must match subdirectory names under the data dir.
+- `--output-dir`: Custom output directory for plots (defaults to data-dir/plots)
+
+**Note**: If `--benchmark-name` is not specified, the script will plot all subdirectories found in the data directory.
+
+### Using Your Own Models and Configuration
+
+The benchmarking framework supports any HuggingFace-compatible LLM model. Specify your model in the benchmark script's `--model` parameter. It must match the model name of the deployment. You can override the default sequence lengths (2000/256 tokens) with `--isl` and `--osl` flags if needed for your specific workload.
+
+The benchmarking framework is built around Python modules that provide direct control over the benchmark workflow. The Python benchmarking module connects to your existing endpoints, runs the benchmarks, and can generate plots. Deployment is user-managed and out of scope for this tool.
+
+### Comparison Limitations
+
+The plotting system supports up to 12 different benchmarks in a single comparison.
+
+### Concurrency Configuration
+
+You can customize the concurrency levels using the CONCURRENCIES environment variable:
+
+```bash
+# Custom concurrency levels
+CONCURRENCIES="1,5,20,50" python3 -m benchmarks.utils.benchmark \
+    --benchmark-name my-test \
+    --endpoint-url http://localhost:8000
+
+# Or set permanently
+export CONCURRENCIES="1,2,5,10,25,50,100"
+python3 -m benchmarks.utils.benchmark \
+    --benchmark-name test \
+    --endpoint-url http://localhost:8000
+```
+
+## Understanding Your Results
+
+After benchmarking completes, check `./benchmarks/results/` (or your custom output directory):
+
+### Plot Labels and Organization
+
+The plotting script uses the `--benchmark-name` as the experiment name in all generated plots. For example:
+- `--benchmark-name aggregated` → plots will show "aggregated" as the label
+- `--benchmark-name vllm-disagg` → plots will show "vllm-disagg" as the label
+
+This allows you to easily identify and compare different configurations in the visualization plots.
+
+### Summary and Plots
+
+```text
+benchmarks/results/plots
+├── SUMMARY.txt                                     # Quick overview of all results
+├── p50_inter_token_latency_vs_concurrency.png      # Token generation speed
+├── avg_time_to_first_token_vs_concurrency.png      # Response time
+├── request_throughput_vs_concurrency.png           # Requests per second
+├── efficiency_tok_s_gpu_vs_user.png                # GPU efficiency
+└── avg_inter_token_latency_vs_concurrency.png      # Average latency
+```
+
+### Data Files
+
+Raw data is organized by deployment/benchmark type and concurrency level:
+
+**For Any Benchmarking (uses your custom benchmark name):**
+```text
+results/                         # Client-side: ./benchmarks/results/ or custom dir
+├── plots/                       # Server-side: /data/results/
+│   ├── SUMMARY.txt              # Performance visualization plots
+│   ├── p50_inter_token_latency_vs_concurrency.png
+│   ├── avg_inter_token_latency_vs_concurrency.png
+│   ├── request_throughput_vs_concurrency.png
+│   ├── efficiency_tok_s_gpu_vs_user.png
+│   └── avg_time_to_first_token_vs_concurrency.png
+├── <your-benchmark-name>/       # Results for your benchmark (uses your custom name)
+│   ├── c1/                      # Concurrency level 1
+│   │   └── profile_export_aiperf.json
+│   ├── c2/                      # Concurrency level 2
+│   ├── c5/                      # Concurrency level 5
+│   └── ...                      # Other concurrency levels (10, 50, 100, 250)
+└── <your-benchmark-name-N>/     # Results for additional benchmarking runs
+    └── c*/                      # Same structure as above
+```
+
+**Example with actual benchmark names:**
+```text
+results/
+├── plots/
+├── experiment-a/                  # --benchmark-name experiment-a
+├── experiment-b/                  # --benchmark-name experiment-b
+└── experiment-c/                  # --benchmark-name experiment-c
+```
+
+Each concurrency directory contains:
+- **`profile_export_aiperf.json`** - Structured metrics from AIPerf
+- **`profile_export_aiperf.csv`** - CSV format metrics from AIPerf
+- **`profile_export.json`** - Raw AIPerf results
+- **`inputs.json`** - Generated test inputs
+
+---
+
+## Server-Side Benchmarking (In-Cluster) {#server-side-benchmarking-in-cluster}
+
+Server-side benchmarking runs directly within the Kubernetes cluster, eliminating the need for port forwarding and providing better resource utilization.
+
+## What Server-Side Benchmarking Does
+
+The server-side benchmarking solution:
+- Runs benchmarks directly within the Kubernetes cluster using internal service URLs
+- Uses Kubernetes service DNS for direct communication (no port forwarding required)
+- Leverages the existing benchmarking infrastructure (`benchmarks.utils.benchmark`)
+- Stores results persistently using `dynamo-pvc`
+- Provides isolated execution environment with configurable resources
+- Handles high load/speed requirements without timeout issues
+- **Note**: Each benchmark job runs within a single Kubernetes namespace, but can benchmark services across multiple namespaces using the full DNS format `svc_name.namespace.svc.cluster.local`
+
+## Prerequisites
+
+1. **Kubernetes cluster** with NVIDIA GPUs and Dynamo namespace setup (see [Dynamo Kubernetes Platform docs](/docs/kubernetes/README.md))
+2. **Storage** PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/utils/README.md))
+3. **Docker image** containing the Dynamo benchmarking tools
+
+## Quick Start
+
+### Step 1: Deploy Your DynamoGraphDeployment
+Deploy your DynamoGraphDeployment using the [deployment documentation](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/). Ensure it has a frontend service exposed.
+
+### Step 2: Deploy and Run Benchmark Job
+
+**Note**: The server-side benchmarking job requires a Docker image containing the Dynamo benchmarking tools. Before the 0.5.1 release, you must build your own Docker image using the [container build instructions](https://github.com/ai-dynamo/dynamo/tree/main/container/README.md), push it to your container registry, then update the `image` field in `benchmarks/incluster/benchmark_job.yaml` to use your built image tag.
+
+```bash
+export NAMESPACE=benchmarking
+
+# Deploy the benchmark job with default settings
+kubectl apply -f benchmarks/incluster/benchmark_job.yaml -n $NAMESPACE
+
+# Monitor the job, wait for it to complete
+kubectl logs -f job/dynamo-benchmark -n $NAMESPACE
+```
+
+#### Customize the job configuration
+
+To customize the benchmark parameters, edit the `benchmarks/incluster/benchmark_job.yaml` file and modify:
+
+- **Model name**: Change `"Qwen/Qwen3-0.6B"` in the args section
+- **Benchmark name**: Change `"qwen3-0p6b-vllm-agg"` to your desired benchmark name
+- **Service URL**: Change `"vllm-agg-frontend:8000"` so the service URL matches your deployed service
+- **Docker image**: Change the image field if needed
+
+Then deploy:
+```bash
+kubectl apply -f benchmarks/incluster/benchmark_job.yaml -n $NAMESPACE
+```
+
+### Step 3: Retrieve Results
+```bash
+# Create access pod (skip this step if access pod is already running)
+kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
+kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
+
+# Download the results
+kubectl cp $NAMESPACE/pvc-access-pod:/data/results/<benchmark-name> ./benchmarks/results/<benchmark-name>
+
+# Cleanup
+kubectl delete pod pvc-access-pod -n $NAMESPACE
+```
+
+### Step 4: Generate Plots
+```bash
+# Generate performance plots from the downloaded results
+python3 -m benchmarks.utils.plot \
+  --data-dir ./benchmarks/results
+```
+
+This will create visualization plots. For more details on interpreting these plots, see the [Summary and Plots](#summary-and-plots) section above.
+
+## Cross-Namespace Service Access
+
+Server-side benchmarking can benchmark services across multiple namespaces from a single job using Kubernetes DNS. When referencing services in other namespaces, use the full DNS format:
+
+```bash
+# Access service in same namespace
+SERVICE_URL=vllm-agg-frontend:8000
+
+# Access service in different namespace
+SERVICE_URL=vllm-agg-frontend.production.svc.cluster.local:8000
+```
+
+**DNS Format**: `<service-name>.&lt;namespace&gt;.svc.cluster.local:port`
+
+This allows you to:
+- Benchmark multiple services across different namespaces in a single job
+- Compare services running in different environments (dev, staging, production)
+- Test cross-namespace integrations without port-forwarding
+- Run comprehensive cross-namespace performance comparisons
+
+## Configuration
+
+The benchmark job is configured directly in the YAML file.
+
+### Default Configuration
+
+- **Model**: `Qwen/Qwen3-0.6B`
+- **Benchmark Name**: `qwen3-0p6b-vllm-agg`
+- **Service**: `vllm-agg-frontend:8000`
+- **Docker Image**: `nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag`
+
+### Customizing the Job
+
+To customize the benchmark, edit `benchmarks/incluster/benchmark_job.yaml`:
+
+1. **Change the model**: Update the `--model` argument
+2. **Change the benchmark name**: Update the `--benchmark-name` argument
+3. **Change the service URL**: Update the `--endpoint-url` argument (use `<svc_name>.&lt;namespace&gt;.svc.cluster.local:port` for cross-namespace access)
+4. **Change Docker image**: Update the image field if needed
+
+### Example: Multi-Namespace Benchmarking
+
+To benchmark services across multiple namespaces, you would need to run separate benchmark jobs for each service since the format supports one benchmark per job. However, the results are stored in the same PVC and may be accessed together.
+
+```yaml
+# Job 1: Production service
+args:
+  - --model
+  - "Qwen/Qwen3-0.6B"
+  - --benchmark-name
+  - "prod-vllm"
+  - --endpoint-url
+  - "vllm-agg-frontend.production.svc.cluster.local:8000"
+  - --output-dir
+  - /data/results
+
+# Job 2: Staging service
+args:
+  - --model
+  - "Qwen/Qwen3-0.6B"
+  - --benchmark-name
+  - "staging-vllm"
+  - --endpoint-url
+  - "vllm-agg-frontend.staging.svc.cluster.local:8000"
+  - --output-dir
+  - /data/results
+```
+
+## Understanding Your Results
+
+Results are stored in `/data/results` and follow the same structure as client-side benchmarking:
+
+```text
+/data/results/
+└── <benchmark-name>/                # Results for your benchmark name
+    ├── c1/                          # Concurrency level 1
+    │   └── profile_export_aiperf.json
+    ├── c2/                          # Concurrency level 2
+    └── ...                          # Other concurrency levels
+```
+
+## Monitoring and Debugging
+
+### Check Job Status
+```bash
+kubectl describe job dynamo-benchmark -n $NAMESPACE
+```
+
+### View Logs
+```bash
+# Follow logs in real-time
+kubectl logs -f job/dynamo-benchmark -n $NAMESPACE
+```
+
+### Debug Failed Jobs
+```bash
+# Check pod status
+kubectl get pods -n $NAMESPACE -l job-name=dynamo-benchmark
+
+# Describe failed pod
+kubectl describe pod &lt;pod-name&gt; -n $NAMESPACE
+```
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Service not found**: Ensure your DynamoGraphDeployment frontend service is running
+3. **PVC access**: Check that `dynamo-pvc` is properly configured and accessible
+4. **Image pull issues**: Ensure the Docker image is accessible from the cluster
+5. **Resource constraints**: Adjust resource limits if the job is being evicted
+
+### Debug Commands
+
+```bash
+# Check PVC status
+kubectl get pvc dynamo-pvc -n $NAMESPACE
+
+# Check service endpoints
+kubectl get svc -n $NAMESPACE
+
+# Verify your service exists and has endpoints
+SVC_NAME="${SERVICE_URL%%:*}"
+kubectl get svc "$SVC_NAME" -n "$NAMESPACE"
+kubectl get endpoints "$SVC_NAME" -n "$NAMESPACE"
+```
+
+---
+
+## Customize Benchmarking Behavior
+
+The built-in Python workflow connects to endpoints, benchmarks with aiperf, and generates plots. If you want to modify the behavior:
+
+1. **Extend the workflow**: Modify `benchmarks/utils/workflow.py` to add custom deployment types or metrics collection
+
+2. **Generate different plots**: Modify `benchmarks/utils/plot.py` to generate a different set of plots for whatever you wish to visualize.
+
+3. **Direct module usage**: Use individual Python modules (`benchmarks.utils.benchmark`, `benchmarks.utils.plot`) for granular control over each step of the benchmarking process.
+
+The Python benchmarking module provides a complete end-to-end benchmarking experience with full control over the workflow.
+
+---
+
+## Testing with Mocker Backend
+
+For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/mocker/) that simulates LLM inference without requiring actual GPU resources. This is useful for:
+
+- **Testing deployments** without expensive GPU infrastructure
+- **Developing and debugging** router, planner, or frontend logic
+- **CI/CD pipelines** that need to validate infrastructure without model execution
+- **Benchmarking framework validation** to ensure your setup works before using real backends
+
+The mocker backend mimics the API and behavior of real backends (vLLM, SGLang, TensorRT-LLM) but generates mock responses instead of running actual inference.
+
+See the [mocker directory](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/mocker/) for usage examples and configuration options.
diff --git a/benchmarks/profiler/README.md b/benchmarks/profiler/README.md
deleted file mode 120000
index d0192ec6a3e..00000000000
--- a/benchmarks/profiler/README.md
+++ /dev/null
@@ -1 +0,0 @@
-../../docs/benchmarks/sla_driven_profiling.md
\ No newline at end of file
diff --git a/benchmarks/profiler/README.md b/benchmarks/profiler/README.md
new file mode 100644
index 00000000000..973ce53b1c5
--- /dev/null
+++ b/benchmarks/profiler/README.md
@@ -0,0 +1,624 @@
+---
+title: "SLA-Driven Profiling with DynamoGraphDeploymentRequest"
+---
+
+# SLA-Driven Profiling with DynamoGraphDeploymentRequest
+
+> [!TIP]
+> **New to DGDR and SLA-Driven Profiling?** Start with the [SLA-Driven Profiling and Planner Deployment Quick Start Guide](/docs/planner/sla_planner_quickstart.md) for step-by-step instructions. This document provides deeper technical details about the profiling process.
+
+## Overview
+
+Dynamo provides automated SLA-driven profiling through **DynamoGraphDeploymentRequests (DGDR)**. Instead of manually running profiling scripts, you declare your performance requirements and let the Dynamo Operator handle profiling and deployment automatically.
+
+**Key Benefits:**
+- **Declarative**: Specify SLAs, not implementation details
+- **Automated**: No manual job setup or result processing
+- **Integrated**: Seamlessly works with Dynamo Operator
+- **Production-Ready**: Generates optimized configurations with SLA planner
+
+This document covers:
+- Technical details of online vs offline profiling
+- Profiling process internals (GPU usage, measurements, interpolation)
+- Direct script usage for advanced scenarios
+- Comprehensive troubleshooting
+
+## Support Matrix
+
+| Backend | Dense Models | MoE Models |
+|---------|-------------|------------|
+| vLLM | ✅ | 🚧 |
+| SGLang | ✅ | ✅ |
+| TensorRT-LLM | ✅ | 🚧 |
+
+Specifically, the profiler sweeps over the following parallelization mapping for prefill and decode:
+| Model Architecture | Prefill Parallelization Mapping | Decode Parallelization Mapping |
+|---------|-------------|------------|
+| MLA+MoE (DeepseekV3ForCausalLM, DeepseekV32ForCausalLM) | TEP, DEP | TEP, DEP |
+| GQA+MoE (Qwen3MoeForCausalLM) | TP, TEP, DEP | TP, TEP, DEP |
+| Other Models | TP | TP |
+
+> [!NOTE]
+> - Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend.
+
+## Using DGDR for Profiling (Recommended)
+
+The recommended way to profile models is through DGDRs. Sample configurations are provided in `deploy/`:
+
+**Available Samples:**
+- **`profile_sla_dgdr.yaml`**: Standard profiling with AIPerf on real engines
+- **`profile_sla_aic_dgdr.yaml`**: Fast profiling with AI Configurator simulation
+- **`profile_sla_moe_dgdr.yaml`**: MoE model profiling
+
+The Dynamo Operator automatically:
+1. Discovers GPU resources (cluster-scoped operators only)
+2. Runs profiling (AIPerf on real engines or AI Configurator simulation)
+3. Generates optimal DGD configuration with SLA planner
+4. Deploys the DGD to your cluster
+
+See the [Quick Start Guide](/docs/planner/sla_planner_quickstart.md) for prerequisites and detailed instructions.
+
+## Hardware Configuration
+
+Hardware parameters have sensible defaults and are **optional** - you can override them if needed:
+
+```yaml
+profilingConfig:
+  config:
+    # Override hardware defaults if needed
+    hardware:
+      min_num_gpus_per_engine: 1
+      max_num_gpus_per_engine: 8
+      num_gpus_per_node: 8
+
+    # Only needed when using AI Configurator (sweep.use_ai_configurator: true)
+    sweep:
+      aic_system: h200_sxm  # GPU type for AI Configurator (h100_sxm, h200_sxm, etc.)
+```
+
+### Automatic GPU Discovery (Optional Feature)
+
+Cluster-scoped operators can optionally enable automatic GPU discovery to detect hardware from cluster nodes. When enabled, hardware config is auto-detected and overrides any manually specified values.
+
+```yaml
+spec:
+  enableGpuDiscovery: true
+```
+
+This feature is only available with cluster-scoped operators (`namespaceRestriction.enabled=false`) as it requires cluster-wide node access permissions. It is not available for namespace-restricted operators.
+
+## Profiling Method
+
+1. **Hardware Setup**: Uses defaults or user-specified hardware configuration. Optionally, cluster-scoped operators can enable automatic GPU discovery to detect specifications from cluster nodes.
+2. **Identify Sweep Ranges**: Automatically determine minimum and maximum number of GPUs per engine. Minimum is determined by the model size and GPU VRAM. Maximum is set to one node for dense model and 4 nodes for MoE models.
+3. **Parallelization Mapping Sweep**: Use the input ISL and OSL, test the performance of the engines with different parallelization mappings.
+   - For dense models, we test different TP sizes for both prefill and decode.
+   - For MoE models (SGLang), we evaluate both TEP and DEP as candidates for prefill and decode.
+   - **Prefill**:
+     - TP/TEP: We measure TTFT with batch size = 1 (assuming ISL is long enough to saturate compute) without KV reuse.
+     - DEP: Attention uses data parallelism. We send a single burst with total concurrency `attention_dp_size × attn_dp_num_req_ratio` (defaults to 4) and compute the reported TTFT as `time_to_first_token.max / attn_dp_num_req_ratio` from the AIPerf summary of that burst. This stabilizes measurements when the first batch may launch before all requests arrive.
+   ![Prefill Performance](/img/h100_prefill_performance.png)
+   - **Decode**: Since the ITL (or iteration time) is relevant with how many requests are in-flight, we measure the ITL under different number of in-flight requests. The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold. To measure the ITL without being affected by piggy-backed prefill requests, the script will enable kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL. Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL. However, for MoE models, this is not guaranteed because the kv cache in different attention DP ranks is different. We are working on framework-side change to fix this issue. For example, the below plot shows the decode parallelization mapping sweep results for H100 for deepseek-ai/DeepSeek-R1-Distill-Llama-8B.
+   ![Decode Performance](/img/h100_decode_performance.png)
+4. **Recommendation**: Selects optimal parallelization mapping for prefill and decode that achieves the highest per GPU throughput while adhering the SLA on TTFT and ITL. Specifically, the profiler will choose the point (or a point on the curve for decode) that is left to the vertical red dashed line that represents the SLAs while has the highest y coordinate (throughput per GPU).
+5. **In-Depth Profiling on the Recommended P/D Engine**: After finding the best TP size for prefill and decode, the script will then interpolate the TTFT with ISL and ITL with active KV cache and decode context length. This is to provide a more accurate estimation of the performance when ISL and OSL changes and will be used in the sla-planner.
+![ITL Interpolation](/img/pd_interpolation.png)
+   - **Prefill**: Measures TTFT and throughput per GPU across different input lengths with batch size=1.
+   - **Decode**: Measures ITL and throughput per GPU under various KV cache loads and decode context lengths. The active kv usage determines the complexity of the memory-bounded attention kernel while the active kv usage divided the average context length determines the complexity of the computation bound MLP kernel. For example, the below figure shows the ITL of DS-Distilled Llama 8b model on H100 TP4. The ITL grows near-linearly with active kv usage under a fixed context length. And the slope increases as the context length decreases.
+
+
+To run the parallelization mapping sweep and the in-depth profiling on the recommended P/D engine, the profiler need to know the engine's forward pass time with different loads. There are two ways to achieve this: run AIPerf on real engines or use AI Configurator to run simulations.
+
+### AIPerf on Real Engines
+
+Profiles your model by creating real test deployments in Kubernetes and measuring their performance.
+
+**Characteristics:**
+- **Duration**: 2-4 hours
+- **Accuracy**: Highest (real measurements)
+- **GPU Requirements**: Full access to test different parallelization mappings
+- **Backends**: vLLM, SGLang, TensorRT-LLM
+
+**DGDR Configuration:**
+```yaml
+profilingConfig:
+  config:
+    sweep:
+      use_ai_configurator: false  # Default
+```
+
+### AI Configurator Simulation
+
+Uses performance simulation to rapidly estimate optimal configurations without running real deployments.
+
+**Characteristics:**
+- **Duration**: 20-30 seconds
+- **Accuracy**: Estimated (may have errors for unusual configurations)
+- **GPU Requirements**: None
+- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon)
+
+**DGDR Configuration:**
+```yaml
+profilingConfig:
+  config:
+    sweep:
+      use_ai_configurator: true
+    aic:
+      system: h200_sxm          # GPU system type
+      model_name: QWEN3_32B     # AIC model identifier
+      backend_version: "0.20.0"
+```
+
+**Supported Configurations:**
+
+For the current list of supported models, systems, and backend versions, see the [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features).
+
+To check from the command line: `aiconfigurator cli --help`
+
+**Currently supports:**
+- **Backends**: TensorRT-LLM (versions 0.20.0, 1.0.0rc3, 1.0.0rc6)
+- **Systems**: H100 SXM, H200 SXM, B200 SXM, GB200 SXM, A100 SXM
+- **Models**: Wide range including GPT, Llama, Mixtral, DeepSeek, Qwen, and more
+
+### Output Format
+
+After profiling, the DGDR status contains:
+
+1. **Recommended Configuration**: Optimal TP for prefill and decode
+2. **Performance Data**: Interpolation models for SLA planner
+3. **Generated DGD**: Complete deployment manifest
+
+**Example Recommendations:**
+```
+Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
+Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
+```
+
+#### Interactive Configuration Selection WebUI
+
+When running the profiler with `--pick-with-webui`, an interactive web interface is launched that allows you to visually explore profiling results and manually select configurations.
+
+**Features:**
+- **Interactive Charts**: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables
+- **Pareto-Optimal Analysis**: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput
+- **DGD Config Preview**: Click "Show Config" on any row to view the corresponding DynamoGraphDeployment YAML
+- **GPU Cost Estimation**: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests)
+- **SLA Visualization**: Red dashed lines indicate your TTFT and ITL targets
+
+**Selection Methods:**
+1. **GPU Hours Table** (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination
+2. **Individual Selection**: Click one row in the Prefill table AND one row in the Decode table to manually choose each
+
+**Example DGD Config Output:**
+
+When you click "Show Config", you'll see a DynamoGraphDeployment configuration like:
+
+```yaml
+# DynamoGraphDeployment Configuration
+# Prefill: 1 GPU(s), TP=1
+# Decode: 4 GPU(s), TP=4
+# Model: Qwen/Qwen3-32B-FP8
+# Backend: trtllm
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+spec:
+  services:
+    PrefillWorker:
+      subComponentType: prefill
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --tensor-parallel-size=1
+    DecodeWorker:
+      subComponentType: decode
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --tensor-parallel-size=4
+```
+
+**Usage:**
+```bash
+python -m benchmarks.profiler.profile_sla \
+  --backend trtllm \
+  --config path/to/disagg.yaml \
+  --pick-with-webui \
+  --use-ai-configurator \
+  --model Qwen/Qwen3-32B-FP8 \
+  --aic-system h200_sxm \
+  --ttft 200 --itl 15
+```
+
+Once you have selected a configuration, the full DynamoGraphDeployment CRD will be saved in your output folder as `config_with_planner.yaml`.
+
+The WebUI launches on port 8000 by default (configurable with `--webui-port`).
+
+#### Output Performance Plots
+
+The profiler will generate the following plots to better visualize the performance data:
+
+**Parallelization Mapping Sweep Plots:**
+- `prefill_performance.png`: TTFT vs Parallelization Mapping size
+- `decode_performance.png`: ITL vs Parallelization Mapping size and in-flight requests
+
+Note these two plots are based on the input ISL and OSL.
+
+**In-Depth Profiling for the Recommended P/D Engine Plots:**
+- `selected_prefill_interpolation/prefill_ttft_interpolation.png`: TTFT vs ISL for the recommended prefill engine
+- `selected_prefill_interpolation/prefill_throughput_interpolation.png`: Throughput vs ISL for the recommended prefill engine
+- `selected_decode_interpolation/decode_itl_interplation.png`: ITL vs KV usage and context length for the recommended decode engine
+- `selected_decode_interpolation/decode_throughput_interpolation.png`: Throughput vs KV usage and context length for the recommended decode engine
+
+
+### Output Interpolation Data
+
+The profiler generates `.npz` files to store the performance data for the recommended P/D engine:
+
+**Prefill Interpolation** (`selected_prefill_interpolation/raw_data.npz`):
+- `prefill_isl`: 1D array of input sequence lengths tested
+- `prefill_ttft`: 1D array of TTFTs (ms) at each ISL
+- `prefill_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each ISL
+
+**Decode Interpolation** (`selected_decode_interpolation/raw_data.npz`):
+- `max_kv_tokens`: Total KV tokens capacity in decode engine
+- `x_kv_usage`: 1D array of active KV usage percentages [0, 1]
+- `y_context_length`: 1D array of average context lengths tested
+- `z_itl`: 1D array of ITLs (ms) at each (KV usage, context length) point
+- `z_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each point
+
+## DGDR Configuration Reference
+
+This section provides detailed explanations of all DGDR `profilingConfig` options. The DGDR controller passes this configuration to the profiler script, which is defined in `benchmarks/profiler/utils/profiler_argparse.py`.
+
+### Configuration Structure
+
+All profiler configuration goes under `spec.profilingConfig.config`:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: my-deployment
+spec:
+  model: "Qwen/Qwen3-0.6B"         # High-level: model to deploy
+  backend: vllm                    # High-level: inference backend
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"  # Required
+    configMapRef:                  # Optional: base DGD config
+      name: my-config
+      key: disagg.yaml
+
+    config:                        # Profiler configuration
+      sla: { ... }
+      hardware: { ... }
+      sweep: { ... }
+      aic: { ... }
+      planner: { ... }
+
+  deploymentOverrides:             # Optional
+    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
+```
+
+### SLA Configuration (Required)
+
+Define your performance requirements and workload characteristics:
+
+```yaml
+profilingConfig:
+  config:
+    sla:
+      isl: 3000      # Average input sequence length (tokens)
+      osl: 150       # Average output sequence length (tokens)
+      ttft: 200.0    # Target Time To First Token (milliseconds)
+      itl: 20.0      # Target Inter-Token Latency (milliseconds)
+```
+
+**What these control:**
+- **ISL/OSL**: Based on your expected traffic patterns
+- **TTFT**: First token latency target (lower = more GPUs needed, affects prefill engine)
+- **ITL**: Token generation latency target (lower = more GPUs needed, affects decode engine)
+- **Trade-offs**: Tighter SLAs require more GPU resources
+
+### Hardware Configuration (Optional)
+
+Control GPU search space and constraints:
+
+```yaml
+profilingConfig:
+  config:
+    hardware:
+      min_num_gpus_per_engine: 2      # if not provided, will automatically determine based on model and VRAM size
+      max_num_gpus_per_engine: 8      # Maximum GPUs to test
+      num_gpus_per_node: 8            # GPUs per node (for multi-node MoE)
+      gpu_type: h200_sxm              # GPU type hint
+```
+
+**When to use:**
+- **min_num_gpus_per_engine**: Skip small TP sizes if your model is large
+- **max_num_gpus_per_engine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error))
+- **num_gpus_per_node**: Determine the upper bound of number of GPUs per node for dense models and configure Grove for multi-node MoE engines.
+- **gpu_type**: Informational, auto-detected by controller
+
+> [!TIP]
+> If you don't specify hardware constraints, the controller auto-detects based on your model size and available cluster resources.
+
+### Sweep Configuration (Optional)
+
+Control profiling behavior:
+
+```yaml
+profilingConfig:
+  config:
+    sweep:
+      use_ai_configurator: false              # Use offline profiling (default: false)
+      prefill_interpolation_granularity: 16   # Samples for prefill TTFT curve
+      decode_interpolation_granularity: 6     # Samples for decode ITL curve
+```
+
+**Use cases:**
+- **use_ai_configurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only)
+- **prefill_interpolation_granularity**: How many samples to benchmark for prefill TTFT curve (lower = faster but may be less accurate)
+- **decode_interpolation_granularity**: How many samples to benchmark for decode ITL curve (lower = faster but may be less accurate). Since ITL interpolation is a 3d plot and takes longer to run, we default to a smaller number of samples. Increasing this value might quadratically increase the profiling time.
+
+### AI Configurator Configuration (Required if `use_ai_configurator: true`)
+
+Configure AI Configurator profiling mode:
+
+```yaml
+profilingConfig:
+  config:
+    sweep:
+      use_ai_configurator: true
+      aic_system: h200_sxm              # GPU system: h100_sxm, h200_sxm, b200_sxm, gb200_sxm, a100_sxm
+      aic_hf_id: Qwen/Qwen3-32B         # Huggingface model id
+      aic_backend_version: "0.20.0"     # TensorRT-LLM version: 0.20.0, 1.0.0rc3
+```
+
+**Supported configurations:** See [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features)
+
+### Planner Configuration (Optional)
+
+Pass arguments to the SLA planner:
+
+```yaml
+profilingConfig:
+  config:
+    planner:
+      planner_min_endpoint: 2                    # Minimum endpoints to maintain
+      planner_adjustment_interval: 60            # Adjustment interval (seconds)
+      planner_load_predictor: linear             # Load prediction method
+```
+
+> [!NOTE]
+> Planner arguments use `planner_` prefix. See planner documentation for full list.
+
+### Engine Configuration (Auto-configured)
+
+The controller automatically sets these from high-level fields:
+
+```yaml
+# You specify:
+spec:
+  model: "Qwen/Qwen3-0.6B"
+  backend: vllm
+
+# Controller auto-injects into config:
+profilingConfig:
+  config:
+    deployment:
+      model: "Qwen/Qwen3-0.6B"       # From spec.model
+    engine:
+      backend: vllm                  # From spec.backend
+      config: /path/to/configmap     # From spec.profilingConfig.configMapRef (if provided)
+```
+
+**You should not manually set** `deployment.model` or `engine.backend` in `profilingConfig.config` - they are automatically injected from the high-level fields.
+
+### Complete Example: AIPerf on Real Engines
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: vllm-dense-online
+spec:
+  model: "Qwen/Qwen3-0.6B"
+  backend: vllm
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
+    config:
+      sla:
+        isl: 3000
+        osl: 150
+        ttft: 200.0
+        itl: 20.0
+
+      hardware:
+        min_num_gpus_per_engine: 1
+        max_num_gpus_per_engine: 8
+
+      sweep:
+        use_ai_configurator: false
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
+
+  autoApply: true
+```
+
+### Complete Example: AI Configurator Simulation
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: trtllm-aic-offline
+spec:
+  model: "Qwen/Qwen3-32B"
+  backend: trtllm
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1"
+    config:
+      sla:
+        isl: 4000
+        osl: 500
+        ttft: 300.0
+        itl: 10.0
+
+      sweep:
+        use_ai_configurator: true
+
+      aic:
+        system: h200_sxm
+        model_name: QWEN3_32B
+        backend_version: "0.20.0"
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1"
+
+  autoApply: true
+```
+
+### Complete Example: MoE Model
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: sglang-moe
+spec:
+  model: "deepseek-ai/DeepSeek-R1"
+  backend: sglang
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
+    config:
+      sla:
+        isl: 2048
+        osl: 512
+        ttft: 300.0
+        itl: 25.0
+
+      hardware:
+        num_gpus_per_node: 8
+        max_num_gpus_per_engine: 32
+
+      engine:
+        is_moe_model: true       # Enable MoE profiling mode
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
+
+  autoApply: true
+```
+
+## Troubleshooting
+
+### Profiling Takes Too Long
+
+**Solution 1**: Use AI Configurator for rapid profiling (TensorRT-LLM only):
+```yaml
+sweep:
+  use_ai_configurator: true
+```
+
+**Solution 2**: Reduce search space:
+```yaml
+config:
+  sweep:
+    min_num_gpus: 4  # Skip TP1, TP2
+    max_num_gpus: 8  # Don't test beyond TP8
+```
+
+### SLA Cannot Be Met
+
+**Symptoms**: Profiler reports no configuration meets targets
+
+**Solutions:**
+1. Relax SLA targets (increase TTFT/ITL)
+2. Add more GPU resources
+3. Try a different backend
+4. Use a smaller model
+
+### AI Configurator: Attention Head Constraint Error
+
+**Symptoms**: Profiling fails with error:
+```
+AssertionError: num_heads <N> should be divisible by tp_size <M> and the division result should be >= 4
+```
+
+**Cause**: AI Configurator requires **≥4 attention heads per GPU**. Small models with few heads cannot use high TP sizes.
+
+**Affected Models:**
+- **Qwen3-0.6B** (16 heads): Max TP = 4 ❌ Fails at TP=8
+- **GPT-2** (12 heads): Max TP = 3
+- Most models **&lt;1B parameters**: May hit this constraint
+
+**Solution**: Limit `max_num_gpus_per_engine` in your DGDR:
+
+```yaml
+profilingConfig:
+  profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1"
+  config:
+    hardware:
+      max_num_gpus_per_engine: 4  # For Qwen3-0.6B (16 heads / 4 = max TP of 4)
+    sweep:
+      use_ai_configurator: true
+    aic:
+      system: h200_sxm
+      model_name: QWEN3_0_6B
+```
+
+**Calculate Max TP**: `max_tp = num_attention_heads / 4`
+
+> **Note**: This is an AI Configurator limitation. Online profiling doesn't have this constraint.
+
+### Image Pull Errors
+
+**Symptoms**: `ErrImagePull` or `ImagePullBackOff`
+
+**Solution**: Ensure image pull secrets are configured:
+```bash
+kubectl create secret docker-registry nvcr-imagepullsecret \
+  --docker-server=nvcr.io \
+  --docker-username='$oauthtoken' \
+  --docker-password=<NGC_API_KEY> \
+  --namespace <your-namespace>
+```
+
+### Out of Memory During Profiling
+
+**Symptoms**: OOM errors in profiling jobs
+
+**Solutions:**
+1. Reduce `gpu_memory_utilization` in engine config
+2. Reduce `--max-context-length`
+3. Skip larger TP configurations
+4. Use fewer GPUs per test
+
+### Unsupported Parallelization Mapping in Backend
+
+**Symptoms**: Starttime/runtime error in the backend. For example, prime number of attention heads restrain TP size to be 1 (i.e., falcon-7b with 71 attention heads). Or some backend does not support different TP sizes for prefill and decode.
+
+**Solutions:**
+1. Contact the backend to add support for the use cases and bump backend version in dynamo.
+2. Restrain the max and min number of GPUs per engine to the supported range.
+
+## Next Steps
+
+- **Deploy with DGDR**: See [Quick Start Guide](/docs/planner/sla_planner_quickstart.md)
+- **Understand SLA Planner**: Read [SLA Planner Deep Dive](/docs/planner/sla_planner.md)
+- **Monitor Deployments**: Set up [Observability](/docs/kubernetes/observability/metrics.md)
+- **Optimize Performance**: See [Performance Tuning](/docs/performance/tuning.md)
+
+## Related Documentation
+
+- [DGDR API Reference](/docs/kubernetes/api_reference.md)
+- [SLA Planner Quick Start](/docs/planner/sla_planner_quickstart.md)
+- [SLA Planner Architecture](/docs/planner/sla_planner.md)
+- [Profiler Arguments Reference](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/profiler/utils/profiler_argparse.py)
diff --git a/deploy/README.md b/deploy/README.md
deleted file mode 120000
index f6eccd892ef..00000000000
--- a/deploy/README.md
+++ /dev/null
@@ -1 +0,0 @@
-../docs/kubernetes/README.md
\ No newline at end of file
diff --git a/deploy/README.md b/deploy/README.md
new file mode 100644
index 00000000000..7caa302f5be
--- /dev/null
+++ b/deploy/README.md
@@ -0,0 +1,256 @@
+---
+title: "Deploying Dynamo on Kubernetes"
+---
+
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Deploying Dynamo on Kubernetes
+
+High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
+
+## Important Terminology
+
+**Kubernetes Namespace**: The K8s namespace where your DynamoGraphDeployment resource is created.
+- Used for: Resource isolation, RBAC, organizing deployments
+- Example: `dynamo-system`, `team-a-namespace`
+
+**Dynamo Namespace**: The logical namespace used by Dynamo components for [service discovery](/docs/kubernetes/service_discovery.md).
+- Used for: Runtime component communication, service discovery
+- Specified in: `.spec.services.<ServiceName>.dynamoNamespace` field
+- Example: `my-llm`, `production-model`, `dynamo-dev`
+
+These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa.
+
+## Prerequisites
+
+Before you begin, ensure you have the following tools installed:
+
+| Tool | Minimum Version | Installation Guide |
+|------|-----------------|-------------------|
+| **kubectl** | v1.24+ | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) |
+| **Helm** | v3.0+ | [Install Helm](https://helm.sh/docs/intro/install/) |
+
+Verify your installation:
+```bash
+kubectl version --client  # Should show v1.24+
+helm version              # Should show v3.0+
+```
+
+For detailed installation instructions, see the [Prerequisites section](/docs/kubernetes/installation_guide.md#prerequisites) in the Installation Guide.
+
+## Pre-deployment Checks
+
+Before deploying the platform, run the pre-deployment checks to ensure the cluster is ready:
+
+```bash
+./deploy/pre-deployment/pre-deployment-check.sh
+```
+
+This validates kubectl connectivity, StorageClass configuration, and GPU availability. See [pre-deployment checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README.md) for more details.
+
+## 1. Install Platform First
+
+```bash
+# 1. Set environment
+export NAMESPACE=dynamo-system
+export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
+
+# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
+helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
+helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
+
+# 3. Install Platform
+helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
+helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
+```
+
+**For Shared/Multi-Tenant Clusters:**
+
+If your cluster has namespace-restricted Dynamo operators, add this flag to step 3:
+```bash
+--set dynamo-operator.namespaceRestriction.enabled=true
+```
+
+For more details or customization options (including multinode deployments), see **[Installation Guide for Dynamo Kubernetes Platform](/docs/kubernetes/installation_guide.md)**.
+
+## 2. Choose Your Backend
+
+Each backend has deployment examples and configuration options:
+
+| Backend      | Aggregated | Aggregated + Router | Disaggregated | Disaggregated + Router | Disaggregated + Planner | Disaggregated Multi-node |
+|--------------|:----------:|:-------------------:|:-------------:|:----------------------:|:-----------------------:|:------------------------:|
+| **[SGLang](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)**       | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| **[TensorRT-LLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |
+| **[vLLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)**           | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+
+## 3. Deploy Your First Model
+
+```bash
+export NAMESPACE=dynamo-system
+kubectl create namespace ${NAMESPACE}
+
+# to pull model from HF
+export HF_TOKEN=<Token-Here>
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="$HF_TOKEN" \
+  -n ${NAMESPACE};
+
+# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
+kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
+
+# Check status
+kubectl get dynamoGraphDeployment -n ${NAMESPACE}
+
+# Test it
+kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
+curl http://localhost:8000/v1/models
+```
+
+For SLA-based autoscaling, see [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md).
+
+## Understanding Dynamo's Custom Resources
+
+Dynamo provides two main Kubernetes Custom Resources for deploying models:
+
+### DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration
+
+The **recommended approach** for generating optimal configurations. DGDR provides a high-level interface where you specify:
+- Model name and backend framework
+- SLA targets (latency requirements)
+- GPU type (optional)
+
+Dynamo automatically handles profiling and generates an optimized DGD spec in the status. Perfect for:
+- SLA-driven configuration generation
+- Automated resource optimization
+- Users who want simplicity over control
+
+**Note**: DGDR generates a DGD spec which you can then use to deploy.
+
+### DynamoGraphDeployment (DGD) - Direct Configuration
+
+A lower-level interface that defines your complete inference pipeline:
+- Model configuration
+- Resource allocation (GPUs, memory)
+- Scaling policies
+- Frontend/backend connections
+
+Use this when you need fine-grained control or have already completed profiling.
+
+Refer to the [API Reference and Documentation](/docs/kubernetes/api_reference.md) for more details.
+
+## 📖 API Reference & Documentation
+
+For detailed technical specifications of Dynamo's Kubernetes resources:
+
+- **[API Reference](/docs/kubernetes/api_reference.md)** - Complete CRD field specifications for all Dynamo resources
+- **[Create Deployment](/docs/kubernetes/deployment/create_deployment.md)** - Step-by-step deployment creation with DynamoGraphDeployment
+- **[Operator Guide](/docs/kubernetes/dynamo_operator.md)** - Dynamo operator configuration and management
+
+### Choosing Your Architecture Pattern
+
+When creating a deployment, select the architecture pattern that best fits your use case:
+
+- **Development / Testing** - Use `agg.yaml` as the base configuration
+- **Production with Load Balancing** - Use `agg_router.yaml` to enable scalable, load-balanced inference
+- **High Performance / Disaggregated** - Use `disagg_router.yaml` for maximum throughput and modular scalability
+
+### Frontend and Worker Components
+
+You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
+
+- Provides OpenAI-compatible `/v1/chat/completions` endpoint
+- Auto-discovers backend workers via [service discovery](/docs/kubernetes/service_discovery.md) (Kubernetes-native by default)
+- Routes requests and handles load balancing
+- Validates and preprocesses requests
+
+### Customizing Your Deployment
+
+Example structure:
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-llm
+spec:
+  services:
+    Frontend:
+      dynamoNamespace: my-llm
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: your-image
+    VllmDecodeWorker:  # or SGLangDecodeWorker, TrtllmDecodeWorker
+      dynamoNamespace: dynamo-dev
+      componentType: worker
+      replicas: 1
+      envFromSecret: hf-token-secret  # for HuggingFace models
+      resources:
+        limits:
+          gpu: "1"
+      extraPodSpec:
+        mainContainer:
+          image: your-image
+          command: ["/bin/sh", "-c"]
+          args:
+            - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]
+```
+
+Worker command examples per backend:
+```yaml
+# vLLM worker
+args:
+  - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
+
+# SGLang worker
+args:
+  - >-
+    python3 -m dynamo.sglang
+    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+    --tp 1
+    --trust-remote-code
+
+# TensorRT-LLM worker
+args:
+  - python3 -m dynamo.trtllm
+    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+    --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+    --extra-engine-args /workspace/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/agg.yaml
+```
+
+Key customization points include:
+- **Model Configuration**: Specify model in the args command
+- **Resource Allocation**: Configure GPU requirements under `resources.limits`
+- **Scaling**: Set `replicas` for number of worker instances
+- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs
+- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers
+
+## Additional Resources
+
+- **[Examples](../examples.md)** - Complete working examples
+- **[Create Custom Deployments](/docs/kubernetes/deployment/create_deployment.md)** - Build your own CRDs
+- **[Managing Models with DynamoModel](/docs/kubernetes/deployment/dynamomodel-guide.md)** - Deploy LoRA adapters and manage models
+- **[Operator Documentation](/docs/kubernetes/dynamo_operator.md)** - How the platform works
+- **[Service Discovery](/docs/kubernetes/service_discovery.md)** - Discovery backends and configuration
+- **[Helm Charts](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/README.md)** - For advanced users
+- **[GitOps Deployment with FluxCD](/docs/kubernetes/fluxcd.md)** - For advanced users
+- **[Logging](/docs/kubernetes/observability/logging.md)** - For logging setup
+- **[Multinode Deployment](/docs/kubernetes/deployment/multinode-deployment.md)** - For multinode deployment
+- **[Grove](/docs/kubernetes/grove.md)** - For grove details and custom installation
+- **[Monitoring](/docs/kubernetes/observability/metrics.md)** - For monitoring setup
+- **[Model Caching with Fluid](/docs/kubernetes/model_caching_with_fluid.md)** - For model caching with Fluid
diff --git a/docs/DOCUSAURUS_MIGRATION_PLAN.md b/docs/DOCUSAURUS_MIGRATION_PLAN.md
new file mode 100644
index 00000000000..1e88f4f80e3
--- /dev/null
+++ b/docs/DOCUSAURUS_MIGRATION_PLAN.md
@@ -0,0 +1,768 @@
+# Docusaurus Migration Plan for NVIDIA Dynamo Documentation
+
+> **Date:** January 2026  
+> **Status:** Approved  
+> **Decision:** Full Migration (Option A) with Native Versioning
+
+---
+
+## Table of Contents
+
+1. [Executive Summary](#executive-summary)
+2. [Current State Analysis](#current-state-analysis)
+3. [Migration Approach](#migration-approach)
+4. [Local Development & Testing](#local-development--testing)
+5. [Project Structure](#project-structure)
+6. [Content Migration](#content-migration)
+7. [Versioning Strategy](#versioning-strategy)
+8. [Theme & Styling](#theme--styling)
+9. [Implementation Steps](#implementation-steps)
+10. [Migration Checklist](#migration-checklist)
+
+---
+
+## Executive Summary
+
+Full migration from Sphinx to Docusaurus, replacing the existing documentation build system entirely. This approach uses Docusaurus native versioning and focuses on local development/testing before any deployment decisions.
+
+### Key Decisions
+
+| Decision | Choice |
+|----------|--------|
+| **Migration Approach** | Option A: Full Docusaurus Migration |
+| **Versioning** | Docusaurus Native (`docs:version` command) |
+| **Deployment** | TBD (local testing first) |
+| **Theme** | Custom CSS on Classic Theme (NVIDIA branding) |
+
+### Goals
+
+- ✅ Complete replacement of Sphinx
+- ✅ Native versioned documentation
+- ✅ Local preview in browser for testing
+- ✅ Modern developer experience with hot reload
+- ✅ MDX support for interactive documentation
+
+---
+
+## Current State Analysis
+
+### Existing Infrastructure
+
+| Component | Current Implementation |
+|-----------|----------------------|
+| **Framework** | Sphinx 7.x with nvidia_sphinx_theme |
+| **Content Format** | RST (index.rst) + Markdown (MyST parser) |
+| **Extensions** | mermaid, sphinx_design, ablog, sphinx_tabs, etc. |
+| **Build System** | Makefile + Docker (container/Dockerfile.docs) |
+| **CI/CD** | `.github/workflows/generate-docs.yml` (612 lines) |
+
+### Files to Migrate
+
+| Category | Files | Format |
+|----------|-------|--------|
+| Entry point | `index.rst` | RST → MDX |
+| Configuration | `conf.py`, `Makefile` | → `docusaurus.config.ts` |
+| Content | ~100+ docs | Markdown (mostly compatible) |
+| Includes | `_includes/*.rst` | RST → MDX components |
+| Extensions | `_extensions/github_alerts.py` | → MDX/Admonitions |
+| Static assets | `_static/*`, `images/*` | → `static/` |
+
+---
+
+## Migration Approach
+
+### Full Docusaurus Migration
+
+Complete replacement of Sphinx with Docusaurus, migrating all content.
+
+**Benefits:**
+- Clean break from legacy system
+- Modern React-based stack
+- Native versioning built-in
+- Fast hot-reload development server
+- Better search capabilities
+- MDX support for interactive docs
+
+**New Structure:**
+```
+docs/
+├── docusaurus/                  # New Docusaurus project
+│   ├── docusaurus.config.ts     # Main configuration
+│   ├── sidebars.ts              # Navigation structure
+│   ├── package.json             # Dependencies
+│   ├── tsconfig.json
+│   ├── docs/                    # Current version documentation
+│   │   ├── intro.md
+│   │   ├── backends/
+│   │   ├── kubernetes/
+│   │   ├── guides/
+│   │   └── ...
+│   ├── versioned_docs/          # Auto-generated by Docusaurus
+│   │   ├── version-0.3.0/
+│   │   └── version-0.2.0/
+│   ├── versioned_sidebars/
+│   ├── src/
+│   │   ├── components/          # Custom React components
+│   │   ├── css/
+│   │   │   └── custom.css       # NVIDIA theme overrides
+│   │   └── pages/
+│   ├── static/
+│   │   └── img/
+│   └── versions.json            # Version manifest
+└── sphinx/                      # Old Sphinx docs (to be removed after migration)
+```
+
+---
+
+## Local Development & Testing
+
+### Quick Start
+
+```bash
+# Navigate to docs directory
+cd docs/docusaurus
+
+# Install dependencies
+npm install
+
+# Start development server with hot reload
+npm run start
+# Opens http://localhost:3000 in your browser
+
+# Build static site (for testing production build)
+npm run build
+
+# Serve the production build locally
+npm run serve
+# Opens http://localhost:3000 with production build
+```
+
+### Development Commands
+
+| Command | Description |
+|---------|-------------|
+| `npm run start` | Start dev server with hot reload (http://localhost:3000) |
+| `npm run build` | Build production static site to `build/` |
+| `npm run serve` | Serve production build locally |
+| `npm run clear` | Clear Docusaurus cache |
+| `npm run docusaurus docs:version X.Y.Z` | Create a new version snapshot |
+
+### Testing Workflow
+
+1. **Make changes** to docs in `docs/docusaurus/docs/`
+2. **View instantly** at http://localhost:3000 (hot reload)
+3. **Test production build:**
+   ```bash
+   npm run build && npm run serve
+   ```
+4. **Open browser** to http://localhost:3000 to verify
+
+---
+
+## Project Structure
+
+### Initial Setup
+
+```bash
+# Create Docusaurus project
+cd docs
+npx create-docusaurus@latest docusaurus classic --typescript
+
+# Install additional plugins
+cd docusaurus
+npm install @docusaurus/theme-mermaid
+npm install @docusaurus/plugin-client-redirects
+```
+
+### Configuration Files
+
+#### `docusaurus.config.ts`
+
+```typescript
+import {themes as prismThemes} from 'prism-react-renderer';
+import type {Config} from '@docusaurus/types';
+import type * as Preset from '@docusaurus/preset-classic';
+
+const config: Config = {
+  title: 'NVIDIA Dynamo',
+  tagline: 'High-performance, low-latency inference framework',
+  favicon: 'img/favicon.ico',
+  
+  // For local testing, use localhost
+  url: 'http://localhost:3000',
+  baseUrl: '/',
+  
+  organizationName: 'ai-dynamo',
+  projectName: 'dynamo',
+  
+  onBrokenLinks: 'warn',
+  onBrokenMarkdownLinks: 'warn',
+  
+  i18n: {
+    defaultLocale: 'en',
+    locales: ['en'],
+  },
+  
+  markdown: {
+    mermaid: true,
+  },
+  
+  themes: ['@docusaurus/theme-mermaid'],
+  
+  presets: [
+    [
+      'classic',
+      {
+        docs: {
+          routeBasePath: '/',  // Docs at root
+          sidebarPath: './sidebars.ts',
+          editUrl: 'https://github.com/ai-dynamo/dynamo/tree/main/docs/docusaurus/',
+          showLastUpdateTime: true,
+          // Versioning config
+          lastVersion: 'current',
+          versions: {
+            current: {
+              label: 'dev',
+              path: 'dev',
+            },
+          },
+        },
+        blog: false,  // Disable blog
+        theme: {
+          customCss: './src/css/custom.css',
+        },
+      } satisfies Preset.Options,
+    ],
+  ],
+
+  plugins: [
+    [
+      '@docusaurus/plugin-client-redirects',
+      {
+        redirects: [
+          // Preserve existing redirects from Sphinx
+          {from: '/guides/tool-calling', to: '/agents/tool-calling'},
+          {from: '/architecture/architecture', to: '/design_docs/architecture'},
+          // Add more as needed
+        ],
+      },
+    ],
+  ],
+
+  themeConfig: {
+    navbar: {
+      title: 'NVIDIA Dynamo',
+      logo: {
+        alt: 'NVIDIA Logo',
+        src: 'img/nvidia-logo.svg',
+      },
+      items: [
+        {
+          type: 'docsVersionDropdown',
+          position: 'right',
+          dropdownActiveClassDisabled: true,
+        },
+        {
+          href: 'https://github.com/ai-dynamo/dynamo',
+          label: 'GitHub',
+          position: 'right',
+        },
+      ],
+    },
+    footer: {
+      style: 'dark',
+      links: [
+        {
+          title: 'Documentation',
+          items: [
+            {label: 'Getting Started', to: '/'},
+            {label: 'Backends', to: '/backends'},
+            {label: 'Kubernetes', to: '/kubernetes'},
+          ],
+        },
+        {
+          title: 'Community',
+          items: [
+            {label: 'GitHub', href: 'https://github.com/ai-dynamo/dynamo'},
+            {label: 'Issues', href: 'https://github.com/ai-dynamo/dynamo/issues'},
+          ],
+        },
+      ],
+      copyright: `Copyright © ${new Date().getFullYear()} NVIDIA Corporation & Affiliates`,
+    },
+    prism: {
+      theme: prismThemes.github,
+      darkTheme: prismThemes.dracula,
+      additionalLanguages: ['bash', 'python', 'yaml', 'rust', 'toml', 'json'],
+    },
+  } satisfies Preset.ThemeConfig,
+};
+
+export default config;
+```
+
+#### `sidebars.ts`
+
+```typescript
+import type {SidebarsConfig} from '@docusaurus/plugin-content-docs';
+
+const sidebars: SidebarsConfig = {
+  docs: [
+    'intro',
+    {
+      type: 'category',
+      label: 'Getting Started',
+      items: [
+        'installation',
+        'quickstart',
+        'support-matrix',
+      ],
+    },
+    {
+      type: 'category',
+      label: 'Backends',
+      items: [
+        'backends/index',
+        {
+          type: 'category',
+          label: 'SGLang',
+          items: [
+            'backends/sglang/index',
+            'backends/sglang/gpt-oss',
+          ],
+        },
+        {
+          type: 'category',
+          label: 'vLLM',
+          items: [
+            'backends/vllm/index',
+          ],
+        },
+        {
+          type: 'category',
+          label: 'TensorRT-LLM',
+          items: [
+            'backends/trtllm/index',
+          ],
+        },
+      ],
+    },
+    {
+      type: 'category',
+      label: 'Kubernetes',
+      items: [
+        'kubernetes/deployment',
+        'kubernetes/observability',
+        'kubernetes/multinode',
+      ],
+    },
+    {
+      type: 'category',
+      label: 'User Guides',
+      items: [
+        'agents/tool-calling',
+        'multimodal/index',
+        'performance/tuning',
+        'observability/metrics',
+      ],
+    },
+    {
+      type: 'category',
+      label: 'Design Docs',
+      items: [
+        'design_docs/architecture',
+        'design_docs/disagg_serving',
+        'design_docs/distributed_runtime',
+      ],
+    },
+    {
+      type: 'category',
+      label: 'Reference',
+      items: [
+        'reference/cli',
+        'reference/glossary',
+      ],
+    },
+  ],
+};
+
+export default sidebars;
+```
+
+---
+
+## Content Migration
+
+### Automated Conversion Process
+
+```bash
+# 1. Convert RST files to Markdown
+find ../sphinx -name "*.rst" -exec pandoc {} -f rst -t gfm -o {}.md \;
+
+# 2. Copy Markdown files (already compatible)
+cp -r ../sphinx/**/*.md docs/
+
+# 3. Run migration script for Sphinx-specific syntax
+python scripts/migrate_content.py
+```
+
+### Migration Script
+
+Create `scripts/migrate_content.py`:
+
+```python
+#!/usr/bin/env python3
+"""Migrate Sphinx markdown syntax to Docusaurus MDX."""
+
+import re
+import os
+from pathlib import Path
+
+REPLACEMENTS = [
+    # Admonitions: MyST -> Docusaurus
+    (r'```{note}', ':::note'),
+    (r'```{warning}', ':::warning'),
+    (r'```{tip}', ':::tip'),
+    (r'```{caution}', ':::caution'),
+    (r'```{danger}', ':::danger'),
+    (r'```', ':::'),  # Close admonitions
+    
+    # References
+    (r':ref:`([^`]+)`', r'[\1](\1.md)'),
+    
+    # Code blocks with sphinx-specific options
+    (r'```{code-block} (\w+)', r'```\1'),
+    
+    # Remove toctree directives (handled by sidebars.ts)
+    (r'```{toctree}[\s\S]*?```', ''),
+]
+
+def migrate_file(filepath: Path):
+    content = filepath.read_text()
+    
+    for pattern, replacement in REPLACEMENTS:
+        content = re.sub(pattern, replacement, content)
+    
+    # Write back
+    filepath.write_text(content)
+    print(f"Migrated: {filepath}")
+
+def main():
+    docs_dir = Path("docs")
+    for md_file in docs_dir.rglob("*.md"):
+        migrate_file(md_file)
+
+if __name__ == "__main__":
+    main()
+```
+
+### Manual Fixes Required
+
+| Sphinx Feature | Docusaurus Equivalent |
+|----------------|----------------------|
+| `.. include::` directives | Import MDX components |
+| `.. toctree::` | `sidebars.ts` configuration |
+| `:doc:` references | Standard markdown links |
+| `{guilabel}`, `{menuselection}` | Bold text or custom component |
+| Sphinx tabs | `@docusaurus/theme-classic` tabs or MDX |
+
+---
+
+## Versioning Strategy
+
+### Docusaurus Native Versioning
+
+Docusaurus handles versioning automatically with the `docs:version` command.
+
+**How it works:**
+
+```bash
+# When ready to release version 0.4.0:
+npm run docusaurus docs:version 0.4.0
+
+# This creates:
+# - versioned_docs/version-0.4.0/  (snapshot of docs/)
+# - versioned_sidebars/version-0.4.0-sidebars.json
+# - Updates versions.json: ["0.4.0", "0.3.0", ...]
+```
+
+**Version Structure:**
+```
+docs/docusaurus/
+├── docs/                           # "current" (dev) version
+├── versioned_docs/
+│   ├── version-0.4.0/              # Release 0.4.0
+│   ├── version-0.3.0/              # Release 0.3.0
+│   └── version-0.2.0/              # Release 0.2.0
+├── versioned_sidebars/
+│   ├── version-0.4.0-sidebars.json
+│   ├── version-0.3.0-sidebars.json
+│   └── version-0.2.0-sidebars.json
+└── versions.json                   # ["0.4.0", "0.3.0", "0.2.0"]
+```
+
+**Configuration in `docusaurus.config.ts`:**
+
+```typescript
+docs: {
+  lastVersion: 'current',
+  versions: {
+    current: {
+      label: 'dev',
+      path: 'dev',
+      banner: 'unreleased',
+    },
+    '0.4.0': {
+      label: '0.4.0 (latest)',
+      path: 'latest',
+    },
+    '0.3.0': {
+      label: '0.3.0',
+      path: '0.3.0',
+    },
+  },
+},
+```
+
+**URL Structure:**
+```
+/dev/           → Current development docs
+/latest/        → Latest stable (0.4.0)
+/0.3.0/         → Archived version
+/0.2.0/         → Archived version
+```
+
+---
+
+## Theme & Styling
+
+### NVIDIA Branding with Custom CSS
+
+Create `src/css/custom.css`:
+
+```css
+/**
+ * NVIDIA Dynamo Documentation Theme
+ * Custom styling to match NVIDIA branding
+ */
+
+:root {
+  /* NVIDIA Brand Colors */
+  --ifm-color-primary: #76b900;
+  --ifm-color-primary-dark: #6aa600;
+  --ifm-color-primary-darker: #5f9400;
+  --ifm-color-primary-darkest: #4d7a00;
+  --ifm-color-primary-light: #84c219;
+  --ifm-color-primary-lighter: #93cb33;
+  --ifm-color-primary-lightest: #a8d64d;
+  
+  /* Navigation */
+  --ifm-navbar-background-color: #1a1a1a;
+  --ifm-navbar-link-color: #ffffff;
+  --ifm-navbar-link-hover-color: #76b900;
+  
+  /* Code blocks */
+  --ifm-code-font-size: 95%;
+  --docusaurus-highlighted-code-line-bg: rgba(118, 185, 0, 0.1);
+  
+  /* Sidebar */
+  --ifm-menu-color-active: #76b900;
+}
+
+/* Dark mode */
+[data-theme='dark'] {
+  --ifm-background-color: #1a1a1a;
+  --ifm-background-surface-color: #242424;
+  --ifm-color-primary: #76b900;
+}
+
+/* Navbar styling */
+.navbar {
+  box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.1);
+}
+
+.navbar__title {
+  font-weight: 700;
+}
+
+/* Footer styling */
+.footer {
+  background-color: #1a1a1a;
+}
+
+.footer__link-item {
+  color: #b0b0b0;
+}
+
+.footer__link-item:hover {
+  color: #76b900;
+}
+
+/* Admonitions */
+.alert--note {
+  --ifm-alert-background-color: rgba(118, 185, 0, 0.1);
+  --ifm-alert-border-color: #76b900;
+}
+
+/* Version badge */
+.badge--secondary {
+  background-color: #76b900;
+  border-color: #76b900;
+}
+```
+
+### Logo Assets
+
+Place in `static/img/`:
+- `nvidia-logo.svg` - NVIDIA logo for navbar
+- `favicon.ico` - Browser favicon
+
+---
+
+## Implementation Steps
+
+### Phase 1: Setup (Day 1-2)
+
+```bash
+# 1. Create Docusaurus project
+cd /home/jonathan/go/src/dynamo/docs
+npx create-docusaurus@latest docusaurus classic --typescript
+
+# 2. Install dependencies
+cd docusaurus
+npm install @docusaurus/theme-mermaid @docusaurus/plugin-client-redirects
+
+# 3. Apply configuration
+# - Update docusaurus.config.ts (see above)
+# - Create sidebars.ts
+# - Add custom.css
+
+# 4. Test setup
+npm run start
+# Verify http://localhost:3000 shows default page
+```
+
+### Phase 2: Content Migration (Day 3-5)
+
+```bash
+# 1. Copy existing markdown content
+mkdir -p docs/backends docs/kubernetes docs/guides
+
+# 2. Run conversion scripts
+python scripts/migrate_content.py
+
+# 3. Convert index.rst to intro.md
+pandoc ../index.rst -f rst -t gfm -o docs/intro.md
+
+# 4. Copy images
+cp -r ../images static/img/
+
+# 5. Iteratively fix issues
+npm run start  # Check browser for errors
+```
+
+### Phase 3: Validation (Day 6-7)
+
+```bash
+# 1. Build production site
+npm run build
+
+# 2. Check for broken links
+npm run serve
+# Manually test navigation
+
+# 3. Test version switching
+npm run docusaurus docs:version 0.3.0
+npm run start
+# Verify version dropdown works
+```
+
+### Phase 4: Cleanup
+
+```bash
+# After validation, remove old Sphinx files:
+# - docs/conf.py
+# - docs/Makefile
+# - docs/index.rst
+# - docs/_extensions/
+# - docs/_includes/
+# - docs/_static/
+
+# Keep docusaurus/ as the new docs root (or move up)
+```
+
+---
+
+## Migration Checklist
+
+### Pre-Migration
+- [ ] Audit all existing content (pages, images, downloads)
+- [ ] Document all Sphinx extensions in use
+- [ ] Create redirect map from old URLs to new
+- [ ] Set up Docusaurus development environment
+
+### Setup
+- [ ] Initialize Docusaurus project
+- [ ] Configure `docusaurus.config.ts`
+- [ ] Create `sidebars.ts` from toctrees
+- [ ] Add NVIDIA custom CSS theme
+- [ ] Add logo and favicon
+
+### Content Migration
+- [ ] Convert `index.rst` to `intro.md`
+- [ ] Migrate all Markdown files
+- [ ] Convert RST files to MDX
+- [ ] Migrate images to `static/img/`
+- [ ] Fix internal links
+- [ ] Implement custom components (if needed)
+
+### Validation
+- [ ] Test all pages render correctly
+- [ ] Verify all internal links work
+- [ ] Test code block syntax highlighting
+- [ ] Test Mermaid diagrams
+- [ ] Test version dropdown (after creating test version)
+- [ ] Test mobile responsiveness
+- [ ] Run `npm run build` without errors
+
+### Post-Migration
+- [ ] Remove old Sphinx configuration files
+- [ ] Update `.gitignore` for Docusaurus
+- [ ] Update CONTRIBUTING.md for docs workflow
+- [ ] Create GitHub Actions workflow (when ready for CI)
+
+---
+
+## Quick Reference
+
+### Commands Cheat Sheet
+
+```bash
+# Development
+npm run start              # Start dev server (http://localhost:3000)
+npm run build              # Build production site
+npm run serve              # Serve production build locally
+npm run clear              # Clear cache
+
+# Versioning
+npm run docusaurus docs:version 0.4.0    # Create version snapshot
+
+# Debugging
+npm run build -- --debug   # Build with debug output
+```
+
+### File Locations
+
+| What | Where |
+|------|-------|
+| Main config | `docusaurus.config.ts` |
+| Sidebar nav | `sidebars.ts` |
+| Current docs | `docs/` |
+| Versioned docs | `versioned_docs/` |
+| Custom CSS | `src/css/custom.css` |
+| Static files | `static/` |
+| Build output | `build/` |
+
+---
+
+*Document updated: January 2026*
diff --git a/docs/MIGRATION_COMPLETE.md b/docs/MIGRATION_COMPLETE.md
new file mode 100644
index 00000000000..1813e7cce80
--- /dev/null
+++ b/docs/MIGRATION_COMPLETE.md
@@ -0,0 +1,142 @@
+# Docusaurus Migration Summary Report
+
+**Migration Completed:** Phase 4 Complete (Restructured)  
+**Date:** January 2026  
+
+---
+
+## Executive Summary
+
+The NVIDIA Dynamo documentation has been successfully migrated from Sphinx (reStructuredText/Markdown) to Docusaurus 3.9.2 (React/MDX). The Docusaurus project now lives directly in `docs/` (not a subfolder), providing a cleaner structure. The migration preserves all existing content while adding modern features including local search, improved navigation, and native versioning support.
+
+---
+
+## Final Directory Structure
+
+```
+docs/
+├── docusaurus.config.ts         # Main Docusaurus configuration
+├── sidebars.ts                  # Navigation structure
+├── package.json                 # Node.js dependencies
+├── package-lock.json            # Dependency lock file
+├── versions.json                # Version manifest
+├── tsconfig.json                # TypeScript config
+├── docs/                        # Current version content (for Docusaurus)
+├── versioned_docs/              # Created via `docs:version` command
+├── versioned_sidebars/          # Created via `docs:version` command
+├── src/
+│   └── css/custom.css           # NVIDIA theme (#76b900 green)
+├── static/img/                  # Static images and assets
+├── build/                       # Generated output (gitignored)
+├── node_modules/                # Dependencies (gitignored)
+├── agents/                      # Source content directories
+├── backends/
+├── kubernetes/
+├── ... (other content dirs)
+├── images/                      # Shared images
+├── README.md                    # Build instructions
+├── DOCUSAURUS_MIGRATION_PLAN.md # Original migration plan
+└── MIGRATION_COMPLETE.md        # This summary
+```
+
+---
+
+## Quick Reference
+
+### Development Commands
+
+```bash
+cd docs/docusaurus
+
+# Start development server (hot reload)
+npm run start
+
+# Build production site
+npm run build
+
+# Serve production build locally
+npm run serve
+
+# Clear cache
+npm run clear
+```
+
+### Creating New Versions
+
+When releasing a new version of Dynamo:
+
+```bash
+cd docs
+npm run docusaurus docs:version X.Y.Z
+```
+
+Then update `docusaurus.config.ts` to configure the version labels and paths.
+
+### URLs
+
+| Environment | URL |
+|-------------|-----|
+| Development | `http://localhost:3000` |
+| Current docs | `/` |
+| Versioned docs | `/X.Y.Z/` (after creating versions) |
+
+---
+
+## Features Added
+
+| Feature | Implementation |
+|---------|----------------|
+| **Local Search** | `@easyops-cn/docusaurus-search-local` - Press `Ctrl+K` |
+| **Version Dropdown** | Native Docusaurus versioning with navbar dropdown |
+| **Mermaid Diagrams** | `@docusaurus/theme-mermaid` plugin |
+| **Dark Theme** | Dark mode toggle in navbar |
+| **NVIDIA Branding** | Custom CSS with #76b900 green theme |
+| **Auto Sidebar** | Generated from directory structure |
+| **MDX Support** | React components in Markdown |
+
+---
+
+## Migration Statistics
+
+| Metric | Value |
+|--------|-------|
+| Total files migrated | 96 |
+| RST files converted | 8 |
+| Sphinx files removed | 8 |
+| Sphinx directories removed | 6 |
+
+---
+
+## Phase 4 Restructuring
+
+The Docusaurus project was moved from `docs/docusaurus/` to `docs/` directly:
+
+- ✅ Moved all Docusaurus config files to `docs/`
+- ✅ Updated `editUrl` in docusaurus.config.ts
+- ✅ Updated `.gitignore` paths
+- ✅ Removed `docs/docusaurus/` subfolder
+- ✅ Reset versioning (run `docs:version` to recreate)
+
+---
+
+## Recommendations
+
+1. **Verify Content:** Review key pages to ensure formatting is correct
+2. **Update CI/CD:** Modify pipeline to use `cd docs && npm run build` instead of Sphinx
+3. **Link Check:** Run `npm run build` to catch broken internal links
+4. **Create Versions:** Run `npm run docusaurus docs:version X.Y.Z` for each release
+5. **Search Index:** Local search indexes on build; verify search works after deployment
+
+---
+
+## Rollback (if needed)
+
+The original Sphinx files are preserved in git history. To rollback:
+
+```bash
+git checkout HEAD~N -- docs/conf.py docs/Makefile docs/index.rst docs/_extensions docs/_includes docs/_static
+```
+
+---
+
+**Migration Complete** 🎉
diff --git a/docs/Makefile b/docs/Makefile
deleted file mode 100644
index 169b4bcdb96..00000000000
--- a/docs/Makefile
+++ /dev/null
@@ -1,94 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Minimal makefile for Sphinx documentation
-#
-
-# You can set these variables from the command line, and also
-# from the environment for the first two.
-SPHINXOPTS        ?= -W
-SPHINXBUILD       ?= sphinx-build
-SOURCEDIR          = .
-BUILDDIR           = build
-
-##@ General
-
-# Put it first so that "make" without argument is like "make help".
-help: ## Display help for all targets
-	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
-	@echo ""
-	@echo "Additional documentation targets:"
-	@awk 'BEGIN {FS = ":.*##"; printf "  \033[36m%-20s\033[0m %s\n", "TARGET", "DESCRIPTION"} /^[a-zA-Z_0-9-]+:.*?##/ { printf "  \033[36m%-20s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) }' $(MAKEFILE_LIST)
-
-clean: ## Clean build artifacts
-	@rm -fr ${BUILDDIR}
-
-##@ Helm Documentation
-
-## Location to install dependencies to
-LOCALBIN ?= $(shell pwd)/bin
-$(LOCALBIN):
-	mkdir -p $(LOCALBIN)
-
-## Tool Versions
-HELM_DOCS_VERSION ?= 1.14.2
-
-## Tool Binaries
-HELM_DOCS ?= $(LOCALBIN)/helm-docs-$(HELM_DOCS_VERSION)
-
-.PHONY: helm-docs-install
-helm-docs-install: $(HELM_DOCS) ## Download helm-docs locally if necessary
-$(HELM_DOCS): $(LOCALBIN)
-	@echo "📥 Downloading helm-docs $(HELM_DOCS_VERSION)..."
-	@ARCH=$$(uname -m); \
-	OS=$$(uname -s | tr '[:upper:]' '[:lower:]'); \
-	curl -sSL "https://github.com/norwoodj/helm-docs/releases/download/v$(HELM_DOCS_VERSION)/helm-docs_$(HELM_DOCS_VERSION)_$${OS}_$${ARCH}.tar.gz" | \
-	tar xz -C $(LOCALBIN) helm-docs && \
-	mv $(LOCALBIN)/helm-docs $(HELM_DOCS) && \
-	echo "✅ helm-docs $(HELM_DOCS_VERSION) installed successfully"
-
-.PHONY: generate-helm-docs
-generate-helm-docs: helm-docs-install ## Generate README.md for Helm charts from values.yaml
-	@echo "📚 Generating Helm chart documentation..."
-	@cd ../deploy/helm/charts/platform && $(realpath $(HELM_DOCS)) \
-		--template-files=README.md.gotmpl \
-		--output-file=README.md \
-		--sort-values-order=file \
-		--chart-to-generate=. \
-		--ignore-non-descriptions
-	@echo "✅ Generated documentation at ../deploy/helm/charts/platform/README.md"
-
-.PHONY: helm-docs-clean
-helm-docs-clean: ## Remove generated helm documentation
-	@echo "🧹 Cleaning generated helm documentation..."
-	@rm -f ../deploy/helm/charts/platform/README.md
-	@echo "✅ Cleaned helm documentation"
-
-.PHONY: generate-crd-docs
-generate-crd-docs: ## Generate CRD API reference documentation
-	@echo "📚 Generating CRD API reference documentation..."
-	@cd ../deploy/operator && make generate-api-docs
-	@echo "✅ CRD API reference generated"
-
-.PHONY: docs-all
-docs-all: generate-helm-docs generate-crd-docs html ## Generate all documentation (Sphinx + Helm + CRDs)
-
-.PHONY: help Makefile clean
-
-
-# Catch-all target: route all unknown targets to Sphinx using the new
-# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
-%:
-	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/README.md b/docs/README.md
index a7b98729324..665815a4d5f 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -2,81 +2,164 @@
 orphan: true
 ---
 
-# Building Documentation
+# NVIDIA Dynamo Documentation
 
-This directory contains the documentation source files for NVIDIA Dynamo.
+This directory contains the documentation source files for NVIDIA Dynamo, built with [Docusaurus](https://docusaurus.io/).
 
-## Prerequisites
+## Quick Start
 
-- Python 3.11 or later
-- [uv](https://docs.astral.sh/uv/) package manager
+```bash
+# Navigate to the docs directory
+cd docs
 
-## Build Instructions
+# Install dependencies
+npm install
 
-### Option 1: Dedicated Docs Environment (Recommended)
+# Start development server (with hot reload)
+npm run start
+# Opens http://localhost:3000 in your browser
 
-This approach builds the docs without requiring the full project dependencies (including `ai-dynamo-runtime`):
+# Build for production
+npm run build
 
-```bash
-# One-time setup: Create docs environment and install dependencies
-uv venv .venv-docs
-uv pip install --python .venv-docs --group docs
+# Serve production build locally
+npm run serve
+```
+
+## Documentation Commands
+
+| Command | Description |
+|---------|-------------|
+| `npm run start` | Start dev server with hot reload |
+| `npm run build` | Build production static site to `build/` |
+| `npm run serve` | Serve production build locally |
+| `npm run clear` | Clear Docusaurus cache |
+| `npm run docusaurus docs:version X.Y.Z` | Create a new version snapshot |
+
+## Directory Structure
 
-# Generate documentation
-uv run --python .venv-docs --no-project docs/generate_docs.py
 ```
+docs/
+├── docusaurus.config.ts         # Main Docusaurus configuration
+├── sidebars.ts                  # Navigation structure
+├── package.json                 # Dependencies
+├── versions.json                # Version manifest
+├── tsconfig.json                # TypeScript config
+├── docs/                        # Current version content
+├── versioned_docs/              # Released versions (created via docs:version)
+├── versioned_sidebars/          # Sidebars for each version
+├── src/
+│   └── css/custom.css           # NVIDIA theme
+├── static/img/                  # Static images
+├── build/                       # Generated output (gitignored)
+├── agents/                      # Content source (linked in docs/)
+├── backends/
+├── kubernetes/
+└── ...
+```
+
+## Versioning
 
-The generated HTML will be available in `docs/build/html/`.
+The documentation supports multiple versions matching Dynamo releases.
 
-### Option 2: Using Full Development Environment
+### Creating a New Version
 
-If you already have the full project dependencies installed (i.e., you're actively developing the codebase), you can use `uv run` directly:
+When releasing a new version of Dynamo:
 
 ```bash
-uv run --group docs docs/generate_docs.py
+cd docs
+npm run docusaurus docs:version X.Y.Z
 ```
 
-This will use your existing project environment and add the docs dependencies.
+This will:
+1. Copy `docs/` to `versioned_docs/version-X.Y.Z/`
+2. Copy `sidebars.ts` to `versioned_sidebars/`
+3. Add the version to `versions.json`
 
-### Option 3: Using Docker
+### Version Configuration
 
-Build the docs in a Docker container with all dependencies isolated:
+After creating versions, update `docusaurus.config.ts` to configure version labels and paths:
 
-```bash
-docker build -f container/Dockerfile.docs -t dynamo-docs .
+```typescript
+docs: {
+  lastVersion: 'X.Y.Z',  // Set the latest stable version
+  versions: {
+    current: { label: 'dev (next)', path: 'dev', banner: 'unreleased' },
+    'X.Y.Z': { label: 'X.Y.Z (latest)', path: '', banner: 'none' },
+  },
+}
 ```
 
-The documentation will be built inside the container. To extract the built docs:
+## Writing Documentation
 
-```bash
-# Run the container and copy the output
-docker run --rm -v $(pwd)/docs/build:/workspace/dynamo/docs/build dynamo-docs
+### File Format
 
-# Or create a container to copy files from
-docker create --name temp-docs dynamo-docs
-docker cp temp-docs:/workspace/dynamo/docs/build ./docs/build
-docker rm temp-docs
+Documentation is written in Markdown with [MDX](https://mdxjs.com/) support.
+
+### Frontmatter
+
+Each document should have frontmatter:
+
+```markdown
+---
+title: "Page Title"
+sidebar_position: 1
+---
+
+# Page Title
+
+Content here...
 ```
 
-This approach is ideal for CI/CD pipelines or when you want complete isolation from your local environment.
+### Admonitions
 
-## Directory Structure
+Use Docusaurus admonitions for callouts:
 
-- `docs/` - Documentation source files (Markdown and reStructuredText)
-- `docs/conf.py` - Sphinx configuration
-- `docs/_static/` - Static assets (CSS, JS, images)
-- `docs/_extensions/` - Custom Sphinx extensions
-- `docs/build/` - Generated documentation output (not tracked in git)
+```markdown
+:::note
+This is a note.
+:::
+
+:::tip
+This is a tip.
+:::
+
+:::warning
+This is a warning.
+:::
+
+:::danger
+This is a danger notice.
+:::
+```
 
-## Redirect Creation
+### Code Blocks
+
+```markdown
+```python title="example.py"
+def hello():
+    print("Hello, Dynamo!")
+```
+```
+
+### Internal Links
+
+Link to other docs using relative paths:
+
+```markdown
+See the [Backend Guide](./backends/vllm/README.md) for more details.
+```
 
-When moving or renaming files a redirect must be created.
+## Search
 
-Redirect entries should be added to the `redirects` dictionary in `conf.py`. For detailed information on redirect syntax, see the [sphinx-reredirects usage documentation](https://documatt.com/sphinx-reredirects/usage/#introduction).
+The documentation includes local search powered by `@easyops-cn/docusaurus-search-local`. Use `Ctrl+K` to open search.
 
-## Dependency Management
+## Theme
 
-Documentation dependencies are defined in `pyproject.toml` under the `[dependency-groups]` section:
+The site uses the Docusaurus Classic theme with custom NVIDIA branding:
+- Primary color: NVIDIA Green (#76b900)
+- Dark navbar and footer
+- Custom logo and favicon
 
 ```toml
 [dependency-groups]
diff --git a/docs/_extensions/__init__.py b/docs/_extensions/__init__.py
deleted file mode 100644
index 868a8a06587..00000000000
--- a/docs/_extensions/__init__.py
+++ /dev/null
@@ -1,19 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""
-Custom Sphinx extensions for Dynamo documentation.
-"""
-
-__version__ = "0.1.0"
diff --git a/docs/_extensions/github_alerts.py b/docs/_extensions/github_alerts.py
deleted file mode 100644
index fec4d3a43fd..00000000000
--- a/docs/_extensions/github_alerts.py
+++ /dev/null
@@ -1,255 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""
-AST-based Sphinx extension to convert GitHub-flavored markdown alerts to MyST admonitions.
-
-This extension works on the parsed document AST, making it more robust than text preprocessing.
-It finds blockquote nodes that match GitHub alert patterns and replaces them with admonition nodes.
-"""
-
-import re
-from typing import Any, Dict
-
-from docutils import nodes
-from sphinx.application import Sphinx
-from sphinx.util import logging
-
-__version__ = "0.2.0"
-
-# Set up logger for the extension
-logger = logging.getLogger(__name__)
-
-# Log when the extension module is imported
-logger.info(f"GitHub alerts extension v{__version__} imported successfully")
-
-
-class GitHubAlertsTransformer:
-    """AST transformer for GitHub alerts to MyST admonitions."""
-
-    # Mapping of GitHub alert types to MyST admonition types
-    ALERT_MAPPING = {
-        "note": nodes.note,
-        "tip": nodes.tip,
-        "important": nodes.important,
-        "warning": nodes.warning,
-        "caution": nodes.caution,
-        "danger": nodes.danger,
-        "info": nodes.note,  # Map info to note
-        "hint": nodes.tip,  # Map hint to tip
-    }
-
-    def __init__(self):
-        # Regex to match GitHub alert syntax in text
-        self.alert_pattern = re.compile(r"^\[!(.*?)\](?:\s+(.*))?$")
-
-    def is_github_alert_blockquote(self, node: nodes.block_quote) -> bool:
-        """
-        Check if a blockquote node represents a GitHub alert.
-
-        Returns:
-            bool: True if this is a GitHub alert blockquote, False otherwise
-        """
-        if not isinstance(node, nodes.block_quote):
-            return False
-
-        # GitHub alerts start with a paragraph containing [!TYPE]
-        if not node.children or not isinstance(node.children[0], nodes.paragraph):
-            return False
-
-        first_para = node.children[0]
-        if not first_para.children or not isinstance(
-            first_para.children[0], nodes.Text
-        ):
-            return False
-
-        first_text = first_para.children[0].astext()
-        match = self.alert_pattern.match(first_text.strip())
-
-        return match is not None
-
-    def create_admonition_node(self, blockquote: nodes.block_quote) -> nodes.admonition:
-        """
-        Create a docutils admonition node from a GitHub alert blockquote.
-
-        Args:
-            blockquote: The blockquote node containing the GitHub alert
-
-        Returns:
-            The created admonition node
-        """
-        # Extract alert information from the blockquote
-        first_para = blockquote.children[0]
-        first_text = first_para.children[0].astext()
-        match = self.alert_pattern.match(first_text.strip())
-
-        if not match:
-            raise ValueError("Not a valid GitHub alert blockquote")
-
-        alert_type = match.group(1).lower().strip()
-        title = match.group(2).strip() if match.group(2) else None
-
-        # Extract content nodes (everything after the first paragraph)
-        content_nodes = []
-
-        # If there's a title, check if there's more content in the first paragraph
-        if title and len(first_para.children) > 1:
-            # Create new paragraph with remaining content
-            remaining_para = nodes.paragraph()
-            # Properly detach and add child nodes
-            for child in first_para.children[1:]:
-                child.parent = None  # Detach from current parent
-                remaining_para.append(child)
-            content_nodes.append(remaining_para)
-        elif not title and len(first_para.children) > 1:
-            # No title, but there's content after [!TYPE] - treat as content
-            content_para = nodes.paragraph()
-            # Properly detach and add child nodes
-            for child in first_para.children[1:]:
-                child.parent = None  # Detach from current parent
-                content_para.append(child)
-            content_nodes.append(content_para)
-
-        # Add any additional paragraphs/content
-        for child in blockquote.children[1:]:
-            child.parent = None  # Detach from current parent
-            content_nodes.append(child)
-
-        # Map to MyST admonition type
-        admonition_class = self.ALERT_MAPPING.get(alert_type, nodes.note)
-        admonition = admonition_class()
-
-        # Add title if present
-        if title:
-            title_node = nodes.title(title, title)
-            admonition.append(title_node)
-
-        # Add content nodes
-        for content_node in content_nodes:
-            content_node.parent = None  # Ensure node is properly detached
-            admonition.append(content_node)
-
-        return admonition
-
-    def transform_document(self, document: nodes.document) -> None:
-        """Transform all GitHub alert blockquotes in the document."""
-
-        # Find all blockquote nodes
-        blockquotes = document.traverse(nodes.block_quote)
-
-        for blockquote in blockquotes:
-            if self.is_github_alert_blockquote(blockquote):
-                # Create admonition node from blockquote
-                admonition = self.create_admonition_node(blockquote)
-
-                # Replace blockquote with admonition
-                blockquote.parent.replace(blockquote, admonition)
-
-
-def transform_github_alerts(app: Sphinx, doctree: nodes.document, docname: str) -> None:
-    """
-    Transform GitHub alerts in the document tree.
-
-    This function is connected to Sphinx's 'doctree-resolved' event.
-
-    Args:
-        app: The Sphinx application instance
-        doctree: The document tree to transform
-        docname: The document name being processed
-    """
-    # Check if this is a markdown file by looking at the source file
-    # Sphinx strips extensions from docnames, so we need to check the source
-    env = app.env
-    source_file = env.doc2path(docname, base=None)
-    is_markdown = source_file and source_file.suffix in (".md", ".markdown")
-
-    if not is_markdown:
-        return
-
-    # Check if the extension is enabled
-    if not app.config.github_alerts_enabled:
-        return
-
-    logger.debug(f"Processing GitHub alerts in {docname}")
-
-    try:
-        # Get the transformer instance
-        transformer = getattr(app, "_github_alerts_transformer", None)
-        if transformer is None:
-            transformer = GitHubAlertsTransformer()
-            app._github_alerts_transformer = transformer
-
-        # Count blockquotes before transformation
-        initial_blockquotes = list(doctree.traverse(nodes.block_quote))
-        initial_admonitions = list(doctree.traverse(nodes.Admonition))
-        alert_blockquotes = [
-            bq
-            for bq in initial_blockquotes
-            if transformer.is_github_alert_blockquote(bq)
-        ]
-
-        if alert_blockquotes:
-            logger.info(
-                f"GitHub alerts: Converting {len(alert_blockquotes)} alert(s) in {docname}"
-            )
-
-            # Transform the document
-            transformer.transform_document(doctree)
-
-            # Count remaining blockquotes and new admonitions for verification
-            remaining_blockquotes = list(doctree.traverse(nodes.block_quote))
-            remaining_admonitions = list(doctree.traverse(nodes.Admonition))
-
-            logger.debug(
-                f"GitHub alerts: {docname} - {len(initial_blockquotes)} → {len(remaining_blockquotes)} blockquotes, {len(remaining_admonitions) - len(initial_admonitions)} admonitions created"
-            )
-        else:
-            logger.debug(f"GitHub alerts: No alerts found in {docname}")
-    except Exception as e:
-        logger.error(f"GitHub alerts: Error processing {docname}: {e}")
-        raise
-
-
-def setup(app: Sphinx) -> Dict[str, Any]:
-    """
-    Setup function for the Sphinx extension.
-
-    Args:
-        app: The Sphinx application instance
-
-    Returns:
-        Extension metadata
-    """
-    logger.info("GitHub alerts extension setup() called")
-
-    try:
-        # Connect our transformer to the doctree-resolved event
-        # This happens after parsing but before writing
-        app.connect("doctree-resolved", transform_github_alerts)
-        logger.info("GitHub alerts extension connected to 'doctree-resolved' event")
-
-        # Add configuration values
-        app.add_config_value("github_alerts_enabled", True, "env")
-
-        logger.info("GitHub alerts extension setup completed")
-
-        return {
-            "version": __version__,
-            "parallel_read_safe": True,
-            "parallel_write_safe": True,
-        }
-    except Exception as e:
-        logger.error(f"GitHub alerts extension setup failed: {e}")
-        raise
diff --git a/docs/_includes/dive_in_examples.rst b/docs/_includes/dive_in_examples.rst
deleted file mode 100644
index 261e896d77d..00000000000
--- a/docs/_includes/dive_in_examples.rst
+++ /dev/null
@@ -1,32 +0,0 @@
-The examples below assume you build the latest image yourself from source. If using a prebuilt image follow the examples from the corresponding branch.
-
-.. grid:: 1 2 2 2
-    :gutter: 3
-    :margin: 0
-    :padding: 3 4 0 0
-
-    .. grid-item-card:: :doc:`Hello World <../examples/runtime/hello_world/README>`
-        :link: ../examples/runtime/hello_world/README
-        :link-type: doc
-
-        Demonstrates the basic concepts of Dynamo by creating a simple GPU-unaware graph
-
-    .. grid-item-card:: :doc:`vLLM <../backends/vllm/README>`
-        :link: ../backends/vllm/README
-        :link-type: doc
-
-        Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with VLLM.
-
-    .. grid-item-card:: :doc:`SGLang <../backends/sglang/README>`
-        :link: ../backends/sglang/README
-        :link-type: doc
-
-        Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with SGLang.
-
-    .. grid-item-card:: :doc:`TensorRT-LLM <../backends/trtllm/README>`
-        :link: ../backends/trtllm/README
-        :link-type: doc
-
-        Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with TensorRT-LLM.
-
-
diff --git a/docs/_includes/install.rst b/docs/_includes/install.rst
deleted file mode 100644
index 3403c6f827b..00000000000
--- a/docs/_includes/install.rst
+++ /dev/null
@@ -1,44 +0,0 @@
-Pip (PyPI)
-----------
-
-Install a pre-built wheel from PyPI.
-
-.. code-block:: bash
-
-   # Create a virtual environment and activate it
-   uv venv venv
-   source venv/bin/activate
-
-   # Install Dynamo from PyPI (choose one backend extra)
-   uv pip install "ai-dynamo[sglang]==my-tag"  # or [vllm], [trtllm]
-
-
-Pip from source
----------------
-
-Install directly from a local checkout for development.
-
-.. code-block:: bash
-
-   # Clone the repository
-   git clone https://github.com/ai-dynamo/dynamo.git
-   cd dynamo
-
-   # Create a virtual environment and activate it
-   uv venv venv
-   source venv/bin/activate
-   uv pip install ".[sglang]"  # or [vllm], [trtllm]
-
-
-Docker
-------
-
-Pull and run prebuilt images from NVIDIA NGC (`nvcr.io`).
-
-.. code-block:: bash
-
-   # Run a container (mount your workspace if needed)
-   docker run --rm -it \
-     --gpus all \
-     --network host \
-     nvcr.io/nvidia/ai-dynamo/sglang-runtime:my-tag  # or vllm, tensorrtllm
diff --git a/docs/_includes/quick_start_local.rst b/docs/_includes/quick_start_local.rst
deleted file mode 100644
index 05b6e63b5f4..00000000000
--- a/docs/_includes/quick_start_local.rst
+++ /dev/null
@@ -1,45 +0,0 @@
-Get started with Dynamo locally in just a few commands:
-
-**1. Install Dynamo**
-
-.. code-block:: bash
-
-   # Install uv (recommended Python package manager)
-   curl -LsSf https://astral.sh/uv/install.sh | sh
-
-   # Create virtual environment and install Dynamo
-   uv venv venv
-   source venv/bin/activate
-   # Use prerelease flag to install RC versions of flashinfer and/or other dependencies
-   uv pip install --prerelease=allow "ai-dynamo[sglang]"  # or [vllm], [trtllm]
-
-**2. Start etcd/NATS**
-
-.. code-block:: bash
-
-   # Fetch and start etcd and NATS using Docker Compose
-   VERSION=$(uv pip show ai-dynamo | grep Version | cut -d' ' -f2)
-   curl -fsSL -o docker-compose.yml https://raw.githubusercontent.com/ai-dynamo/dynamo/refs/tags/v${VERSION}/deploy/docker-compose.yml
-   docker compose -f docker-compose.yml up -d
-
-**3. Run Dynamo**
-
-.. code-block:: bash
-
-   # Start the OpenAI compatible frontend (default port is 8000)
-   python -m dynamo.frontend
-
-   # In another terminal, start an SGLang worker
-   python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B
-
-**4. Test your deployment**
-
-.. code-block:: bash
-
-   curl localhost:8000/v1/chat/completions \
-     -H "Content-Type: application/json" \
-     -d '{"model": "Qwen/Qwen3-0.6B",
-          "messages": [{"role": "user", "content": "Hello!"}],
-          "max_tokens": 50}'
-
-
diff --git a/docs/_sections/backends.rst b/docs/_sections/backends.rst
deleted file mode 100644
index e77774f4105..00000000000
--- a/docs/_sections/backends.rst
+++ /dev/null
@@ -1,9 +0,0 @@
-Backends
-========
-
-.. toctree::
-   :maxdepth: 1
-
-   vLLM <../backends/vllm/README>
-   SGLang <../backends/sglang/README>
-   TensorRT-LLM <../backends/trtllm/README>
\ No newline at end of file
diff --git a/docs/_sections/examples.rst b/docs/_sections/examples.rst
deleted file mode 100644
index 30258a46bee..00000000000
--- a/docs/_sections/examples.rst
+++ /dev/null
@@ -1,8 +0,0 @@
-..
-    Quickstart Page (left sidebar target)
-..
-
-Examples
-========
-
-.. include:: ../_includes/dive_in_examples.rst
\ No newline at end of file
diff --git a/docs/_sections/frontends.rst b/docs/_sections/frontends.rst
deleted file mode 100644
index b5e4e3e5da8..00000000000
--- a/docs/_sections/frontends.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-Frontends
-=========
-
-.. toctree::
-   :maxdepth: 1
-
-   KServe <../frontends/kserve.md>
\ No newline at end of file
diff --git a/docs/_sections/installation.rst b/docs/_sections/installation.rst
deleted file mode 100644
index b9543fb5586..00000000000
--- a/docs/_sections/installation.rst
+++ /dev/null
@@ -1,10 +0,0 @@
-..
-    Installation Page (left sidebar target)
-..
-
-Installation
-============
-
-.. include:: ../_includes/install.rst
-
-
diff --git a/docs/_sections/k8s_deployment.rst b/docs/_sections/k8s_deployment.rst
deleted file mode 100644
index 087f8fd08df..00000000000
--- a/docs/_sections/k8s_deployment.rst
+++ /dev/null
@@ -1,14 +0,0 @@
-Deployment Guide
-================
-
-.. toctree::
-   :hidden:
-
-   Kubernetes Quickstart <../kubernetes/README>
-   Detailed Installation Guide <../kubernetes/installation_guide>
-   Dynamo Operator <../kubernetes/dynamo_operator>
-   Service Discovery <../kubernetes/service_discovery>
-   Webhooks <../kubernetes/webhooks>
-   Minikube Setup <../kubernetes/deployment/minikube>
-   Managing Models with DynamoModel <../kubernetes/deployment/dynamomodel-guide>
-   Autoscaling <../kubernetes/autoscaling>
diff --git a/docs/_sections/k8s_multinode.rst b/docs/_sections/k8s_multinode.rst
deleted file mode 100644
index 3a1c7cff2c4..00000000000
--- a/docs/_sections/k8s_multinode.rst
+++ /dev/null
@@ -1,8 +0,0 @@
-Multinode
-=========
-
-.. toctree::
-   :hidden:
-
-   Multinode Deployments <../kubernetes/deployment/multinode-deployment>
-   Grove <../kubernetes/grove>
diff --git a/docs/_sections/k8s_observability.rst b/docs/_sections/k8s_observability.rst
deleted file mode 100644
index af7c6ff66d9..00000000000
--- a/docs/_sections/k8s_observability.rst
+++ /dev/null
@@ -1,8 +0,0 @@
-Observability
-=============
-
-.. toctree::
-   :hidden:
-
-   Metrics <../kubernetes/observability/metrics>
-   Logging <../kubernetes/observability/logging>
diff --git a/docs/_sections/observability.rst b/docs/_sections/observability.rst
deleted file mode 100644
index c1b108c9752..00000000000
--- a/docs/_sections/observability.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-Observability
-=============
-
-.. toctree::
-   :hidden:
-
-   Overview <../observability/README>
-   Prometheus + Grafana Setup <../observability/prometheus-grafana>
-   Metrics <../observability/metrics>
-   Metrics Developer Guide <../observability/metrics-developer-guide>
-   Health Checks <../observability/health-checks>
-   Tracing <../observability/tracing>
-   Logging <../observability/logging>
diff --git a/docs/_static/custom.js b/docs/_static/custom.js
deleted file mode 100644
index 03900df2ae0..00000000000
--- a/docs/_static/custom.js
+++ /dev/null
@@ -1,19 +0,0 @@
-// Add RunLLM widget
-document.addEventListener("DOMContentLoaded", function () {
-    var script = document.createElement("script");
-    script.type = "module";
-    script.id = "runllm-widget-script"
-
-    script.src = "https://widget.runllm.com";
-
-    script.setAttribute("version", "stable");
-    script.setAttribute("runllm-keyboard-shortcut", "Mod+j"); // cmd-j or ctrl-j to open the widget.
-    script.setAttribute("runllm-name", "dynamo");
-    script.setAttribute("runllm-position", "BOTTOM_RIGHT");
-    script.setAttribute("runllm-position-y", "120px");
-    script.setAttribute("runllm-position-x", "20px");
-    script.setAttribute("runllm-assistant-id", "758");
-
-    script.async = true;
-    document.head.appendChild(script);
-  });
diff --git a/docs/_static/switcher.json b/docs/_static/switcher.json
deleted file mode 100644
index 3b1e3994d1a..00000000000
--- a/docs/_static/switcher.json
+++ /dev/null
@@ -1,12 +0,0 @@
-[
-    {
-        "name": "0.1.0 (current release)",
-        "version": "0.1.0",
-        "url": "https://docs.nvidia.com/dynamo/latest/index.html"
-    },
-    {
-        "name": "older releases",
-        "version": "archives",
-        "url": "https://docs.nvidia.com/dynamo/archives/"
-    }
-]
\ No newline at end of file
diff --git a/docs/conf.py b/docs/conf.py
deleted file mode 100644
index 33b8e76af2b..00000000000
--- a/docs/conf.py
+++ /dev/null
@@ -1,170 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2023-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-
-# Configuration file for the Sphinx documentation builder.
-import os
-import sys
-
-# -- Project information -----------------------------------------------------
-project = "NVIDIA Dynamo"
-copyright = "2024-2026, NVIDIA CORPORATION & AFFILIATES"
-author = "NVIDIA"
-
-# Version is set via DYNAMO_DOCS_VERSION env var during build (e.g., "0.3.0")
-# Defaults to "dev" for main branch and PR builds
-release = os.environ.get("DYNAMO_DOCS_VERSION", "dev")
-
-# -- General configuration ---------------------------------------------------
-
-# Standard extensions
-extensions = [
-    "ablog",
-    "myst_parser",
-    "sphinx_copybutton",
-    "sphinx_design",
-    "sphinx_prompt",
-    # "sphinxcontrib.bibtex",
-    "sphinx_tabs.tabs",
-    "sphinx_sitemap",
-    "sphinx.ext.autodoc",
-    "sphinx.ext.autosummary",
-    "sphinx.ext.mathjax",
-    "sphinx.ext.napoleon",
-    "sphinx.ext.ifconfig",
-    "sphinx.ext.extlinks",
-    "sphinxcontrib.mermaid",
-    "sphinx_reredirects",
-]
-
-# Redirects configuration
-redirects = {
-    # PR  #3802
-    "guides/tool-calling": "../agents/tool-calling.html",  # key format corrected
-    "architecture/architecture": "../design_docs/architecture.html",
-    "architecture/disagg_serving": "../design_docs/disagg_serving.html",
-    "architecture/distributed_runtime": "../design_docs/distributed_runtime.html",
-    "architecture/dynamo_flow": "../design_docs/dynamo_flow.html",
-    "architecture/request_cancellation": "../fault_tolerance/request_cancellation.html",
-    "architecture/request_migration": "../fault_tolerance/request_migration.html",
-    "kubernetes/create_deployment": "../kubernetes/deployment/create_deployment.html",
-    "kubernetes/minikube": "../kubernetes/deployment/minikube.html",
-    "kubernetes/multinode-deployment": "../kubernetes/deployment/multinode-deployment.html",
-    "kubernetes/logging": "../kubernetes/observability/logging.html",
-    "kubernetes/metrics": "../kubernetes/observability/metrics.html",
-    "architecture/kv_cache_routing": "../router/kv_cache_routing.html",
-    # PR #3658
-    "API/nixl_connect/README": "../../api/nixl_connect/README.html",
-    "API/nixl_connect/connector": "../../api/nixl_connect/connector.html",
-    "API/nixl_connect/descriptor": "../../api/nixl_connect/descriptor.html",
-    "API/nixl_connect/device": "../../api/nixl_connect/device.html",
-    "API/nixl_connect/device_kind": "../../api/nixl_connect/device_kind.html",
-    "API/nixl_connect/operation_status": "../../api/nixl_connect/operation_status.html",
-    "API/nixl_connect/rdma_metadata": "../../api/nixl_connect/rdma_metadata.html",
-    "API/nixl_connect/read_operation": "../../api/nixl_connect/read_operation.html",
-    "API/nixl_connect/readable_operation": "../../api/nixl_connect/readable_operation.html",
-    "API/nixl_connect/writable_operation": "../../api/nixl_connect/writable_operation.html",
-    "API/nixl_connect/write_operation": "../../api/nixl_connect/write_operation.html",
-    "guides/backend": "../development/backend-guide.html",
-    "runtime/README": "../development/runtime-guide.html",
-    "guides/tool_calling": "../agents/tool-calling.html",
-    "architecture/kvbm_architecture": "../kvbm/kvbm_architecture.html",
-    "architecture/kvbm_components": "../kvbm/kvbm_components.html",
-    "architecture/kvbm_intro": "../kvbm/kvbm_intro.html",
-    "architecture/kvbm_motivation": "../kvbm/kvbm_motivation.html",
-    "architecture/kvbm_reading": "../kvbm/kvbm_reading.html",
-    "guides/run_kvbm_in_trtllm": "../kvbm/trtllm-setup.html",
-    "guides/run_kvbm_in_vllm": "../kvbm/vllm-setup.html",
-    "guides/health_check": "../observability/health-checks.html",
-    "guides/logging": "../observability/logging.html",
-    "guides/metrics": "../observability/metrics.html",
-    "guides/disagg_perf_tuning": "../performance/tuning.html",
-    "architecture/load_planner": "../planner/load_planner.html",
-    "architecture/planner_intro": "../planner/planner_intro.html",
-    "architecture/sla_planner": "../planner/sla_planner.html",
-    "kubernetes/sla_planner_quickstart": "../planner/sla_planner_quickstart.html",
-    "guides/dynamo_run": "../reference/cli.html",
-    "dynamo_glossary": "../reference/glossary.html",
-    "support_matrix": "../reference/support-matrix.html",
-    "components/router/README": "../router/README.html",
-    # Multimodal documentation consolidation
-    "backends/vllm/multimodal": "../../multimodal/vllm.html",
-    "backends/vllm/multimodal_vllm_guide": "../../multimodal/vllm.html",
-    "backends/trtllm/multimodal_support": "../../multimodal/trtllm.html",
-    "backends/trtllm/multimodal_trtllm_guide": "../../multimodal/trtllm.html",
-    "backends/trtllm/multinode/multinode-multimodal-example": "../../../multimodal/trtllm.html",
-    "backends/sglang/multimodal_epd": "../../multimodal/sglang.html",
-    "backends/sglang/multimodal_sglang_guide": "../../multimodal/sglang.html",
-    "multimodal/multimodal_intro": "index.html",
-}
-
-# Custom extensions
-sys.path.insert(0, os.path.abspath("_extensions"))
-extensions.append("github_alerts")
-
-# Handle Mermaid diagrams as code blocks (not directives) to avoid warnings
-myst_fence_as_directive = ["mermaid"]  # Uncomment if sphinxcontrib-mermaid is installed
-
-# File extensions (myst_parser automatically handles .md files)
-source_suffix = [".rst", ".md"]
-
-# MyST parser configuration
-myst_enable_extensions = [
-    "colon_fence",  # ::: code blocks
-    "deflist",  # Definition lists
-    "html_image",  # HTML images
-    "tasklist",  # Task lists
-]
-
-# Templates path
-templates_path = ["_templates"]
-
-# List of patterns to ignore when looking for source files
-exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "build"]
-
-# -- Options for HTML output -------------------------------------------------
-html_theme = "nvidia_sphinx_theme"
-html_static_path = ["_static"]
-html_extra_path = ["project.json"]
-html_theme_options = {
-    "collapse_navigation": False,
-    "icon_links": [
-        {
-            "name": "GitHub",
-            "url": "https://github.com/ai-dynamo/dynamo",
-            "icon": "fa-brands fa-github",
-        }
-    ],
-    "switcher": {
-        # Use single shared URL so all versions see the same switcher list
-        # When a new version is added, all old docs automatically see it
-        "json_url": "https://docs.nvidia.com/dynamo/versions1.json",
-        "version_match": release,
-    },
-    "extra_head": {
-        """
-    <script src="https://assets.adobedtm.com/5d4962a43b79/c1061d2c5e7b/launch-191c2462b890.min.js" ></script>
-    """
-    },
-    "extra_footer": {
-        """
-    <script type="text/javascript">if (typeof _satellite !== "undefined") {_satellite.pageBottom();}</script>
-    """
-    },
-    "navbar_start": ["navbar-logo"],
-    "primary_sidebar_end": [],
-}
-
-# Document settings
-master_doc = "index"
-html_title = f"{project} Documentation"
-html_short_title = project
-html_baseurl = "https://docs.nvidia.com/dynamo/latest/"
-
-# Suppress warnings for external links and missing references
-suppress_warnings = [
-    "myst.xref_missing",  # Missing cross-references of relative links outside docs folder
-]
-
-# Additional MyST configuration
-myst_heading_anchors = 7  # Generate anchors for headers
-myst_substitutions = {}  # Custom substitutions
diff --git a/docs/agents/tool-calling.md b/docs/docs/agents/tool-calling.md
similarity index 99%
rename from docs/agents/tool-calling.md
rename to docs/docs/agents/tool-calling.md
index dd0d116215d..1aee142ec8f 100644
--- a/docs/agents/tool-calling.md
+++ b/docs/docs/agents/tool-calling.md
@@ -1,3 +1,7 @@
+---
+title: "Tool Calling with Dynamo"
+---
+
 # Tool Calling with Dynamo
 
 You can connect Dynamo to external tools and services using function calling (also known as tool calling). By providing a list of available functions, Dynamo can choose
diff --git a/docs/api/nixl_connect/README.md b/docs/docs/api/nixl_connect/README.md
similarity index 92%
rename from docs/api/nixl_connect/README.md
rename to docs/docs/api/nixl_connect/README.md
index 2a65fa76951..1953da2d6e8 100644
--- a/docs/api/nixl_connect/README.md
+++ b/docs/docs/api/nixl_connect/README.md
@@ -1,3 +1,7 @@
+---
+title: "Dynamo NIXL Connect"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -153,11 +157,11 @@ flowchart LR
 
 #### Code Examples
 
-See [MultimodalPDWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) or [MultimodalDecodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) from our Multimodal example,
+See [MultimodalPDWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) or [MultimodalDecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) from our Multimodal example,
 for how they coordinate directly with the Encode Worker by creating a [`WritableOperation`](writable_operation.md),
 sending the operation's metadata via Dynamo's round-robin dispatcher, and awaiting the operation for completion before making use of the transferred data.
 
-See [MultimodalEncodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) from our Multimodal example,
+See [MultimodalEncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) from our Multimodal example,
 for how the resulting embeddings are registered with the NIXL subsystem by creating a [`Descriptor`](descriptor.md),
 a [`WriteOperation`](write_operation.md) is created using the metadata provided by the requesting worker,
 and the worker awaits for the data transfer to complete for yielding a response.
@@ -178,5 +182,5 @@ and the worker awaits for the data transfer to complete for yielding a response.
 
   - [NVIDIA Dynamo](https://developer.nvidia.com/dynamo) @ [GitHub](https://github.com/ai-dynamo/dynamo)
   - [NVIDIA Inference Transfer Library (NIXL)](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/#nvidia_inference_transfer_library_nixl_low-latency_hardware-agnostic_communication%C2%A0) @ [GitHub](https://github.com/ai-dynamo/nixl)
-  - [Dynamo Multimodal Example](../../..//examples/multimodal)
+  - [Dynamo Multimodal Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal)
   - [NVIDIA GPU Direct](https://developer.nvidia.com/gpudirect)
diff --git a/docs/api/nixl_connect/connector.md b/docs/docs/api/nixl_connect/connector.md
similarity index 98%
rename from docs/api/nixl_connect/connector.md
rename to docs/docs/api/nixl_connect/connector.md
index 6d2fbf327e3..95c69c5705f 100644
--- a/docs/api/nixl_connect/connector.md
+++ b/docs/docs/api/nixl_connect/connector.md
@@ -1,3 +1,7 @@
+---
+title: "dynamo.nixl_connect.Connector"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -27,7 +31,7 @@ The connector provides two methods of moving data between workers:
 
   - Preparing local memory to be read by a remote worker.
 
-In both cases, local memory is registered with the NIXL-based I/O subsystem via the [`Descriptor`](#descriptor) class and provided to the connector.
+In both cases, local memory is registered with the NIXL-based I/O subsystem via the [`Descriptor`](descriptor.md) class and provided to the connector.
 When RDMA is available, the connector then configures the RDMA subsystem to expose the memory for the requested operation and returns an operation control object;
 otherwise the connector will select the best available RDMA alternative.
 The operation control object, either a [`ReadableOperation`](readable_operation.md) or a [`WritableOperation`](writable_operation.md),
diff --git a/docs/api/nixl_connect/descriptor.md b/docs/docs/api/nixl_connect/descriptor.md
similarity index 98%
rename from docs/api/nixl_connect/descriptor.md
rename to docs/docs/api/nixl_connect/descriptor.md
index ffc211d102e..3b32df9a97f 100644
--- a/docs/api/nixl_connect/descriptor.md
+++ b/docs/docs/api/nixl_connect/descriptor.md
@@ -1,3 +1,7 @@
+---
+title: "dynamo.nixl_connect.Descriptor"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/api/nixl_connect/device.md b/docs/docs/api/nixl_connect/device.md
similarity index 97%
rename from docs/api/nixl_connect/device.md
rename to docs/docs/api/nixl_connect/device.md
index 1fbb7d56a64..f2cd5a8308c 100644
--- a/docs/api/nixl_connect/device.md
+++ b/docs/docs/api/nixl_connect/device.md
@@ -1,3 +1,7 @@
+---
+title: "dynamo.nixl_connect.Device"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/api/nixl_connect/device_kind.md b/docs/docs/api/nixl_connect/device_kind.md
similarity index 95%
rename from docs/api/nixl_connect/device_kind.md
rename to docs/docs/api/nixl_connect/device_kind.md
index baced73161e..5f2b1377cfb 100644
--- a/docs/api/nixl_connect/device_kind.md
+++ b/docs/docs/api/nixl_connect/device_kind.md
@@ -1,3 +1,7 @@
+---
+title: "dynamo.nixl_connect.DeviceKind(IntEnum)"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/api/nixl_connect/operation_status.md b/docs/docs/api/nixl_connect/operation_status.md
similarity index 96%
rename from docs/api/nixl_connect/operation_status.md
rename to docs/docs/api/nixl_connect/operation_status.md
index ebda2a5de22..fb31b4bac2e 100644
--- a/docs/api/nixl_connect/operation_status.md
+++ b/docs/docs/api/nixl_connect/operation_status.md
@@ -1,3 +1,7 @@
+---
+title: "dynamo.nixl_connect.OperationStatus(IntEnum)"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/api/nixl_connect/rdma_metadata.md b/docs/docs/api/nixl_connect/rdma_metadata.md
similarity index 97%
rename from docs/api/nixl_connect/rdma_metadata.md
rename to docs/docs/api/nixl_connect/rdma_metadata.md
index 2d632d0f8d9..8bd8eb5257b 100644
--- a/docs/api/nixl_connect/rdma_metadata.md
+++ b/docs/docs/api/nixl_connect/rdma_metadata.md
@@ -1,3 +1,7 @@
+---
+title: "dynamo.nixl_connect.RdmaMetadata"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/api/nixl_connect/read_operation.md b/docs/docs/api/nixl_connect/read_operation.md
similarity index 98%
rename from docs/api/nixl_connect/read_operation.md
rename to docs/docs/api/nixl_connect/read_operation.md
index 301eeb1727a..bda7eeef415 100644
--- a/docs/api/nixl_connect/read_operation.md
+++ b/docs/docs/api/nixl_connect/read_operation.md
@@ -1,3 +1,7 @@
+---
+title: "dynamo.nixl_connect.ReadOperation"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/api/nixl_connect/readable_operation.md b/docs/docs/api/nixl_connect/readable_operation.md
similarity index 98%
rename from docs/api/nixl_connect/readable_operation.md
rename to docs/docs/api/nixl_connect/readable_operation.md
index 0be925ffdc8..307b17fd4cb 100644
--- a/docs/api/nixl_connect/readable_operation.md
+++ b/docs/docs/api/nixl_connect/readable_operation.md
@@ -1,3 +1,7 @@
+---
+title: "dynamo.nixl_connect.ReadableOperation"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/api/nixl_connect/writable_operation.md b/docs/docs/api/nixl_connect/writable_operation.md
similarity index 98%
rename from docs/api/nixl_connect/writable_operation.md
rename to docs/docs/api/nixl_connect/writable_operation.md
index 48a5e116906..188f46a64d7 100644
--- a/docs/api/nixl_connect/writable_operation.md
+++ b/docs/docs/api/nixl_connect/writable_operation.md
@@ -1,3 +1,7 @@
+---
+title: "dynamo.nixl_connect.WritableOperation"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/api/nixl_connect/write_operation.md b/docs/docs/api/nixl_connect/write_operation.md
similarity index 98%
rename from docs/api/nixl_connect/write_operation.md
rename to docs/docs/api/nixl_connect/write_operation.md
index c0d7da1ae6d..a18aacfdf9f 100644
--- a/docs/api/nixl_connect/write_operation.md
+++ b/docs/docs/api/nixl_connect/write_operation.md
@@ -1,3 +1,7 @@
+---
+title: "dynamo.nixl_connect.WriteOperation"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/backends/sglang/README.md b/docs/docs/backends/sglang/README.md
similarity index 94%
rename from docs/backends/sglang/README.md
rename to docs/docs/backends/sglang/README.md
index aa2e311fd39..78071271ee6 100644
--- a/docs/backends/sglang/README.md
+++ b/docs/docs/backends/sglang/README.md
@@ -1,3 +1,7 @@
+---
+title: "Running SGLang with Dynamo"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -24,8 +28,8 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 - [Dynamo SGLang Integration](#dynamo-sglang-integration)
 - [Installation](#installation)
 - [Quick Start](#quick-start)
-- [Single Node Examples](#run-single-node-examples)
-- [Multi-Node and Advanced Examples](#advanced-examples)
+- [Aggregated Serving](#aggregated-serving)
+- [Disaggregated Serving](#disaggregated-serving)
 - [Deploy on SLURM or Kubernetes](#deployment)
 
 ## Feature Support Matrix
@@ -165,7 +169,7 @@ Below we provide a guide that lets you run all of our common deployment patterns
 
 ### Start NATS and ETCD in the background
 
-Start using [Docker Compose](../../../deploy/docker-compose.yml)
+Start using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml)
 
 ```bash
 docker compose -f deploy/docker-compose.yml up -d
@@ -266,7 +270,7 @@ curl localhost:8000/v1/chat/completions \
 We currently provide deployment examples for Kubernetes and SLURM.
 
 ## Kubernetes
-- **[Deploying Dynamo with SGLang on Kubernetes](../../../examples/backends/sglang/deploy/README.md)**
+- **[Deploying Dynamo with SGLang on Kubernetes](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)**
 
 ## SLURM
-- **[Deploying Dynamo with SGLang on SLURM](../../../examples/backends/sglang/slurm_jobs/README.md)**
+- **[Deploying Dynamo with SGLang on SLURM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/slurm_jobs/README.md)**
diff --git a/docs/backends/sglang/expert-distribution-eplb.md b/docs/docs/backends/sglang/expert-distribution-eplb.md
similarity index 98%
rename from docs/backends/sglang/expert-distribution-eplb.md
rename to docs/docs/backends/sglang/expert-distribution-eplb.md
index f72a5c8ccda..031f82fe146 100644
--- a/docs/backends/sglang/expert-distribution-eplb.md
+++ b/docs/docs/backends/sglang/expert-distribution-eplb.md
@@ -1,3 +1,7 @@
+---
+title: "Expert Parallelism Load Balancer (EPLB) in SGLang"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/backends/sglang/gpt-oss.md b/docs/docs/backends/sglang/gpt-oss.md
similarity index 95%
rename from docs/backends/sglang/gpt-oss.md
rename to docs/docs/backends/sglang/gpt-oss.md
index 8f42287fa14..2891271c3b7 100644
--- a/docs/backends/sglang/gpt-oss.md
+++ b/docs/docs/backends/sglang/gpt-oss.md
@@ -1,3 +1,7 @@
+---
+title: "Running gpt-oss-120b Disaggregated with SGLang"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/backends/sglang/profiling.md b/docs/docs/backends/sglang/profiling.md
similarity index 86%
rename from docs/backends/sglang/profiling.md
rename to docs/docs/backends/sglang/profiling.md
index b6921e97472..a3e980eb2bd 100644
--- a/docs/backends/sglang/profiling.md
+++ b/docs/docs/backends/sglang/profiling.md
@@ -1,3 +1,7 @@
+---
+title: "Profiling SGLang Workers in Dynamo"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -36,7 +40,7 @@ The profiler outputs Chrome trace files in the specified `output_dir`. You can v
 
 ## Test Script
 
-A test script is provided at [`examples/backends/sglang/test_sglang_profile.py`](../../../examples/backends/sglang/test_sglang_profile.py) that demonstrates the full profiling workflow:
+A test script is provided at [`examples/backends/sglang/test_sglang_profile.py`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/test_sglang_profile.py) that demonstrates the full profiling workflow:
 
 ```bash
 python examples/backends/sglang/test_sglang_profile.py
diff --git a/docs/backends/sglang/prometheus.md b/docs/docs/backends/sglang/prometheus.md
similarity index 99%
rename from docs/backends/sglang/prometheus.md
rename to docs/docs/backends/sglang/prometheus.md
index aee6c272b8f..1e5289756d6 100644
--- a/docs/backends/sglang/prometheus.md
+++ b/docs/docs/backends/sglang/prometheus.md
@@ -1,3 +1,7 @@
+---
+title: "SGLang Prometheus Metrics"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/backends/sglang/sgl-hicache-example.md b/docs/docs/backends/sglang/sgl-hicache-example.md
similarity index 96%
rename from docs/backends/sglang/sgl-hicache-example.md
rename to docs/docs/backends/sglang/sgl-hicache-example.md
index 4c71cbd4eb2..a4c74792d7f 100644
--- a/docs/backends/sglang/sgl-hicache-example.md
+++ b/docs/docs/backends/sglang/sgl-hicache-example.md
@@ -1,3 +1,7 @@
+---
+title: "Enable SGLang Hierarchical Cache (HiCache)"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/backends/sglang/sglang-disaggregation.md b/docs/docs/backends/sglang/sglang-disaggregation.md
similarity index 98%
rename from docs/backends/sglang/sglang-disaggregation.md
rename to docs/docs/backends/sglang/sglang-disaggregation.md
index a91daec310c..f47309a9ef9 100644
--- a/docs/backends/sglang/sglang-disaggregation.md
+++ b/docs/docs/backends/sglang/sglang-disaggregation.md
@@ -1,3 +1,7 @@
+---
+title: "SGLang Disaggregated Serving"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/backends/trtllm/README.md b/docs/docs/backends/trtllm/README.md
similarity index 83%
rename from docs/backends/trtllm/README.md
rename to docs/docs/backends/trtllm/README.md
index 2e2ebb22e5d..5f6a5f14c68 100644
--- a/docs/backends/trtllm/README.md
+++ b/docs/docs/backends/trtllm/README.md
@@ -1,3 +1,7 @@
+---
+title: "LLM Deployment using TensorRT-LLM"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -35,7 +39,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 ## Table of Contents
 - [Feature Support Matrix](#feature-support-matrix)
-- [Quick Start](#quick-start)
+- [Quick Start](#tensorrt-llm-quick-start)
 - [Single Node Examples](#single-node-examples)
 - [Advanced Examples](#advanced-examples)
 - [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
@@ -51,12 +55,12 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 | Feature | TensorRT-LLM | Notes |
 |---------|--------------|-------|
-| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ |  |
-| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
-| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ |  |
-| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ |  |
-| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned |
-| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | |
+| [**Disaggregated Serving**](../../design_docs/disagg_serving.md) | ✅ |  |
+| [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
+| [**KV-Aware Routing**](../../router/kv_cache_routing.md) | ✅ |  |
+| [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ |  |
+| [**Load Based Planner**](../../planner/load_planner.md) | 🚧 | Planned |
+| [**KVBM**](../../kvbm/kvbm_architecture.md) | ✅ | |
 
 ### Large Scale P/D and WideEP Features
 
@@ -72,7 +76,7 @@ Below we provide a guide that lets you run all of our the common deployment patt
 
 ### Start NATS and ETCD in the background
 
-Start using [Docker Compose](../../../deploy/docker-compose.yml)
+Start using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml)
 
 ```bash
 docker compose -f deploy/docker-compose.yml up -d
@@ -166,18 +170,18 @@ For comprehensive instructions on multinode serving, see the [multinode-examples
 
 ### Kubernetes Deployment
 
-For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](../../../examples/backends/trtllm/deploy/README.md).
+For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md).
 
 ### Client
 
-See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
+See [client](../../backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
 
 NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
 
 ### Benchmarking
 
 To benchmark your deployment with AIPerf, see this utility script, configuring the
-`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
+`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/llm/perf.sh)
 
 ## KV Cache Transfer in Disaggregated Serving
 
@@ -186,7 +190,7 @@ Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disag
 
 ## Request Migration
 
-You can enable [request migration](../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
+You can enable [request migration](../../fault_tolerance/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
 
 ```bash
 # For decode and aggregated workers
@@ -196,7 +200,7 @@ python3 -m dynamo.trtllm ... --migration-limit=3
 > [!IMPORTANT]
 > **Prefill workers do not support request migration** and must use `--migration-limit=0` (the default). Prefill workers only process prompts and return KV cache state - they don't maintain long-running generation requests that would benefit from migration.
 
-See the [Request Migration Architecture](../../../docs/fault_tolerance/request_migration.md) documentation for details on how this works.
+See the [Request Migration Architecture](../../fault_tolerance/request_migration.md) documentation for details on how this works.
 
 ## Request Cancellation
 
@@ -213,14 +217,14 @@ For more details, see the [Request Cancellation Architecture](../../fault_tolera
 
 ## Client
 
-See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
+See [client](../../backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
 
 NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
 
 ## Benchmarking
 
 To benchmark your deployment with AIPerf, see this utility script, configuring the
-`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
+`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/llm/perf.sh)
 
 ## Multimodal support
 
@@ -233,7 +237,7 @@ Logits processors let you modify the next-token logits at every decoding step (e
 ### How it works
 - **Interface**: Implement `dynamo.logits_processing.BaseLogitsProcessor` which defines `__call__(input_ids, logits)` and modifies `logits` in-place.
 - **TRT-LLM adapter**: Use `dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)` to convert Dynamo processors into TRT-LLM-compatible processors and assign them to `SamplingParams.logits_processor`.
-- **Examples**: See example processors in `lib/bindings/python/src/dynamo/logits_processing/examples/` ([temperature](../../../lib/bindings/python/src/dynamo/logits_processing/examples/temperature.py), [hello_world](../../../lib/bindings/python/src/dynamo/logits_processing/examples/hello_world.py)).
+- **Examples**: See example processors in `lib/bindings/python/src/dynamo/logits_processing/examples/` ([temperature](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/src/dynamo/logits_processing/examples/temperature.py), [hello_world](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/src/dynamo/logits_processing/examples/hello_world.py)).
 
 ### Quick test: HelloWorld processor
 You can enable a test-only processor that forces the model to respond with "Hello world!". This is useful to verify the wiring without modifying your model or engine code.
@@ -285,10 +289,10 @@ sampling_params.logits_processor = create_trtllm_adapters(processors)
 
 ## Performance Sweep
 
-For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](../../../examples/backends/trtllm/performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
+For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
 
 ## Dynamo KV Block Manager Integration
 
 Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
 
-Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/trtllm-setup.md) .
+Here is the instruction: [Running KVBM in TensorRT-LLM](./../../kvbm/trtllm-setup.md) .
diff --git a/docs/backends/trtllm/gemma3_sliding_window_attention.md b/docs/docs/backends/trtllm/gemma3_sliding_window_attention.md
similarity index 97%
rename from docs/backends/trtllm/gemma3_sliding_window_attention.md
rename to docs/docs/backends/trtllm/gemma3_sliding_window_attention.md
index 65d365e3ebc..444149f1c7d 100644
--- a/docs/backends/trtllm/gemma3_sliding_window_attention.md
+++ b/docs/docs/backends/trtllm/gemma3_sliding_window_attention.md
@@ -1,3 +1,7 @@
+---
+title: "Gemma 3 with Variable Sliding Window Attention"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/backends/trtllm/gpt-oss.md b/docs/docs/backends/trtllm/gpt-oss.md
similarity index 99%
rename from docs/backends/trtllm/gpt-oss.md
rename to docs/docs/backends/trtllm/gpt-oss.md
index 0feab08c3d1..0011fa050df 100644
--- a/docs/backends/trtllm/gpt-oss.md
+++ b/docs/docs/backends/trtllm/gpt-oss.md
@@ -1,3 +1,7 @@
+---
+title: "Running gpt-oss-120b Disaggregated with TensorRT-LLM"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -510,7 +514,7 @@ flowchart TD
 
 ## Next Steps
 
-- **Production Deployment**: For multi-node deployments, see the [Multi-node Guide](../../../examples/basics/multinode/README.md)
+- **Production Deployment**: For multi-node deployments, see the [Multi-node Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/README.md)
 - **Advanced Configuration**: Explore TensorRT-LLM engine building options for further optimization
 - **Monitoring**: Set up Prometheus and Grafana for production monitoring
 - **Performance Benchmarking**: Use AIPerf to measure and optimize your deployment performance
diff --git a/docs/backends/trtllm/kv-cache-transfer.md b/docs/docs/backends/trtllm/kv-cache-transfer.md
similarity index 74%
rename from docs/backends/trtllm/kv-cache-transfer.md
rename to docs/docs/backends/trtllm/kv-cache-transfer.md
index 43cb8ddfe71..56b70184329 100644
--- a/docs/backends/trtllm/kv-cache-transfer.md
+++ b/docs/docs/backends/trtllm/kv-cache-transfer.md
@@ -1,3 +1,7 @@
+---
+title: "KV Cache Transfer in Disaggregated Serving"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -21,10 +25,6 @@ limitations under the License.
 
 In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:
 
-## Using NIXL for KV Cache Transfer
-
-Start the disaggregated service: See [Disaggregated Serving](./README.md#disaggregated) to learn how to start the deployment.
-
 ## Default Method: NIXL
 By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
 
@@ -34,7 +34,9 @@ TODO: Add instructions for how to specify different backends for NIXL.
 
 ## Alternative Method: UCX
 
-TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. To enable UCX as the KV cache transfer backend, set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
+TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. There are two ways to enable UCX as the KV cache transfer backend:
+
+1. **Recommended:** Set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
+2. Alternatively, set the environment variable `TRTLLM_USE_UCX_KV_CACHE=1` and configure `cache_transceiver_config.backend: DEFAULT` in the engine configuration YAML.
 
-> [!Note]
-> The environment variable `TRTLLM_USE_UCX_KV_CACHE=1` with `cache_transceiver_config.backend: DEFAULT` does not enable UCX. You must explicitly set `backend: UCX` in the configuration.
+This flexibility allows users to choose the most suitable method for their deployment and compatibility requirements.
diff --git a/docs/backends/trtllm/llama4_plus_eagle.md b/docs/docs/backends/trtllm/llama4_plus_eagle.md
similarity index 97%
rename from docs/backends/trtllm/llama4_plus_eagle.md
rename to docs/docs/backends/trtllm/llama4_plus_eagle.md
index 1ba7981fa2d..0f73a7c9d7f 100644
--- a/docs/backends/trtllm/llama4_plus_eagle.md
+++ b/docs/docs/backends/trtllm/llama4_plus_eagle.md
@@ -1,3 +1,7 @@
+---
+title: "Llama 4 Maverick Instruct with Eagle Speculative Decoding on SLURM"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/backends/trtllm/multinode/multinode-examples.md b/docs/docs/backends/trtllm/multinode/multinode-examples.md
similarity index 99%
rename from docs/backends/trtllm/multinode/multinode-examples.md
rename to docs/docs/backends/trtllm/multinode/multinode-examples.md
index 5586d5c4782..267b0d9fe16 100644
--- a/docs/backends/trtllm/multinode/multinode-examples.md
+++ b/docs/docs/backends/trtllm/multinode/multinode-examples.md
@@ -1,3 +1,7 @@
+---
+title: "Example: Multi-node TRTLLM Workers with Dynamo on Slurm"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/backends/trtllm/prometheus.md b/docs/docs/backends/trtllm/prometheus.md
similarity index 99%
rename from docs/backends/trtllm/prometheus.md
rename to docs/docs/backends/trtllm/prometheus.md
index 18dcba97b30..a6c967ec0d1 100644
--- a/docs/backends/trtllm/prometheus.md
+++ b/docs/docs/backends/trtllm/prometheus.md
@@ -1,3 +1,7 @@
+---
+title: "TensorRT-LLM Prometheus Metrics"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/backends/vllm/LMCache_Integration.md b/docs/docs/backends/vllm/LMCache_Integration.md
similarity index 99%
rename from docs/backends/vllm/LMCache_Integration.md
rename to docs/docs/backends/vllm/LMCache_Integration.md
index 24d7f031aa2..83a1ff513cf 100644
--- a/docs/backends/vllm/LMCache_Integration.md
+++ b/docs/docs/backends/vllm/LMCache_Integration.md
@@ -1,3 +1,7 @@
+---
+title: "LMCache Integration in Dynamo"
+---
+
 # LMCache Integration in Dynamo
 
 ## Introduction
diff --git a/docs/backends/vllm/README.md b/docs/docs/backends/vllm/README.md
similarity index 84%
rename from docs/backends/vllm/README.md
rename to docs/docs/backends/vllm/README.md
index 213e66df748..525c16f2725 100644
--- a/docs/backends/vllm/README.md
+++ b/docs/docs/backends/vllm/README.md
@@ -1,3 +1,7 @@
+---
+title: "LLM Deployment using vLLM"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -23,7 +27,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 ## Table of Contents
 - [Feature Support Matrix](#feature-support-matrix)
-- [Quick Start](#quick-start)
+- [Quick Start](#vllm-quick-start)
 - [Single Node Examples](#run-single-node-examples)
 - [Advanced Examples](#advanced-examples)
 - [Deploy on Kubernetes](#kubernetes-deployment)
@@ -35,12 +39,12 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 | Feature | vLLM | Notes |
 |---------|------|-------|
-| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ |  |
-| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
-| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ |  |
-| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ |  |
-| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP |
-| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ |  |
+| [**Disaggregated Serving**](../../design_docs/disagg_serving.md) | ✅ |  |
+| [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
+| [**KV-Aware Routing**](../../router/kv_cache_routing.md) | ✅ |  |
+| [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ |  |
+| [**Load Based Planner**](../../planner/load_planner.md) | 🚧 | WIP |
+| [**KVBM**](../../kvbm/kvbm_architecture.md) | ✅ |  |
 | [**LMCache**](./LMCache_Integration.md) | ✅ |  |
 | [**Prompt Embeddings**](./prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag |
 
@@ -58,7 +62,7 @@ Below we provide a guide that lets you run all of our the common deployment patt
 
 ### Start NATS and ETCD in the background
 
-Start using [Docker Compose](../../../deploy/docker-compose.yml)
+Start using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml)
 
 ```bash
 docker compose -f deploy/docker-compose.yml up -d
@@ -143,7 +147,7 @@ This setup demonstrates how to use Dynamo to create an instance using Eagle-base
 
 ### Kubernetes Deployment
 
-For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](../../../examples/backends/vllm/deploy/README.md)
+For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)
 
 ## Configuration
 
@@ -172,17 +176,17 @@ When using KV-aware routing, ensure deterministic hashing across processes to av
 ```bash
 vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
 ```
-See the high-level notes in [KV Cache Routing](../../../docs/router/kv_cache_routing.md) on deterministic event IDs.
+See the high-level notes in [KV Cache Routing](../../router/kv_cache_routing.md) on deterministic event IDs.
 
 ## Request Migration
 
-You can enable [request migration](../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
+You can enable [request migration](../../fault_tolerance/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
 
 ```bash
 python3 -m dynamo.vllm ... --migration-limit=3
 ```
 
-This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/fault_tolerance/request_migration.md) documentation for details on how this works.
+This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../fault_tolerance/request_migration.md) documentation for details on how this works.
 
 ## Request Cancellation
 
@@ -195,4 +199,4 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
 | **Aggregated** | ✅ | ✅ |
 | **Disaggregated** | ✅ | ✅ |
 
-For more details, see the [Request Cancellation Architecture](../../../docs/fault_tolerance/request_cancellation.md) documentation.
+For more details, see the [Request Cancellation Architecture](../../fault_tolerance/request_cancellation.md) documentation.
diff --git a/docs/backends/vllm/deepseek-r1.md b/docs/docs/backends/vllm/deepseek-r1.md
similarity index 97%
rename from docs/backends/vllm/deepseek-r1.md
rename to docs/docs/backends/vllm/deepseek-r1.md
index 632dae3f90b..8b2b30056a0 100644
--- a/docs/backends/vllm/deepseek-r1.md
+++ b/docs/docs/backends/vllm/deepseek-r1.md
@@ -1,3 +1,7 @@
+---
+title: "Running Deepseek R1 with Wide EP"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/backends/vllm/gpt-oss.md b/docs/docs/backends/vllm/gpt-oss.md
similarity index 99%
rename from docs/backends/vllm/gpt-oss.md
rename to docs/docs/backends/vllm/gpt-oss.md
index 2fbf758f133..5dba57c8765 100644
--- a/docs/backends/vllm/gpt-oss.md
+++ b/docs/docs/backends/vllm/gpt-oss.md
@@ -1,3 +1,7 @@
+---
+title: "Running gpt-oss-120b Disaggregated with vLLM"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/backends/vllm/multi-node.md b/docs/docs/backends/vllm/multi-node.md
similarity index 98%
rename from docs/backends/vllm/multi-node.md
rename to docs/docs/backends/vllm/multi-node.md
index c5b31ee5a59..e3c43f9513c 100644
--- a/docs/backends/vllm/multi-node.md
+++ b/docs/docs/backends/vllm/multi-node.md
@@ -1,3 +1,7 @@
+---
+title: "Multi-node Examples"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/backends/vllm/prometheus.md b/docs/docs/backends/vllm/prometheus.md
similarity index 99%
rename from docs/backends/vllm/prometheus.md
rename to docs/docs/backends/vllm/prometheus.md
index 5fea39c5d12..23419f2a5a9 100644
--- a/docs/backends/vllm/prometheus.md
+++ b/docs/docs/backends/vllm/prometheus.md
@@ -1,3 +1,7 @@
+---
+title: "vLLM Prometheus Metrics"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/backends/vllm/prompt-embeddings.md b/docs/docs/backends/vllm/prompt-embeddings.md
similarity index 99%
rename from docs/backends/vllm/prompt-embeddings.md
rename to docs/docs/backends/vllm/prompt-embeddings.md
index a800c66c072..cb41b50e49b 100644
--- a/docs/backends/vllm/prompt-embeddings.md
+++ b/docs/docs/backends/vllm/prompt-embeddings.md
@@ -1,3 +1,7 @@
+---
+title: "Prompt Embeddings"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/backends/vllm/speculative_decoding.md b/docs/docs/backends/vllm/speculative_decoding.md
similarity index 97%
rename from docs/backends/vllm/speculative_decoding.md
rename to docs/docs/backends/vllm/speculative_decoding.md
index 38ee2bbf22e..2709ba18b21 100644
--- a/docs/backends/vllm/speculative_decoding.md
+++ b/docs/docs/backends/vllm/speculative_decoding.md
@@ -1,3 +1,7 @@
+---
+title: "Running **Meta-Llama-3.1-8B-Instruct** with Speculative Decoding (Eagle3)"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/benchmarks/benchmarking.md b/docs/docs/benchmarks/benchmarking.md
similarity index 93%
rename from docs/benchmarks/benchmarking.md
rename to docs/docs/benchmarks/benchmarking.md
index 4874bd321b7..fc6136bbac5 100644
--- a/docs/benchmarks/benchmarking.md
+++ b/docs/docs/benchmarks/benchmarking.md
@@ -1,3 +1,7 @@
+---
+title: "SPDX-License-Identifier: Apache-2.0"
+---
+
 <!-- # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
@@ -74,7 +78,7 @@ The framework is a Python-based wrapper around `aiperf` that:
 
 ---
 
-# Client-Side Benchmarking (Local)
+## Client-Side Benchmarking (Local) {#client-side-benchmarking-local}
 
 Client-side benchmarking runs on your local machine and connects to Kubernetes deployments via port-forwarding.
 
@@ -97,15 +101,15 @@ Client-side benchmarking runs on your local machine and connects to Kubernetes d
 Follow these steps to benchmark Dynamo deployments using client-side benchmarking:
 
 ### Step 1: Establish Kubernetes Cluster and Install Dynamo
-Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform. First follow the [installation guide](/docs/kubernetes/installation_guide.md) to install Dynamo Kubernetes Platform, then use [deploy/utils/README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md) to set up benchmarking resources.
+Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform. First follow the [installation guide](/docs/kubernetes/installation_guide.md) to install Dynamo Kubernetes Platform, then use [deploy/utils/README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/utils/README.md) to set up benchmarking resources.
 
 ### Step 2: Deploy DynamoGraphDeployments
-Deploy your DynamoGraphDeployments separately using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Each deployment should have a frontend service exposed.
+Deploy your DynamoGraphDeployments separately using the [deployment documentation](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/). Each deployment should have a frontend service exposed.
 
 ### Step 3: Port-Forward and Benchmark Deployment A
 ```bash
 # Port-forward the frontend service for deployment A
-kubectl port-forward -n <namespace> svc/<frontend-service-name> 8000:8000 > /dev/null 2>&1 &
+kubectl port-forward -n &lt;namespace&gt; svc/<frontend-service-name> 8000:8000 > /dev/null 2>&1 &
 # Note: remember to stop the port-forward process after benchmarking.
 
 # Benchmark deployment A using Python scripts
@@ -122,7 +126,7 @@ If comparing multiple deployments, teardown deployment A and deploy deployment B
 ### Step 5: [If Comparative] Port-Forward and Benchmark Deployment B
 ```bash
 # Port-forward the frontend service for deployment B
-kubectl port-forward -n <namespace> svc/<frontend-service-name> 8001:8000 > /dev/null 2>&1 &
+kubectl port-forward -n &lt;namespace&gt; svc/<frontend-service-name> 8001:8000 > /dev/null 2>&1 &
 
 # Benchmark deployment B using Python scripts
 python3 -m benchmarks.utils.benchmark \
@@ -157,7 +161,7 @@ The benchmarking framework supports various comparative analysis scenarios:
 ### Command Line Options
 
 ```bash
-python3 -m benchmarks.utils.benchmark --benchmark-name <name> --endpoint-url <endpoint_url> [OPTIONS]
+python3 -m benchmarks.utils.benchmark --benchmark-name &lt;name&gt; --endpoint-url <endpoint_url> [OPTIONS]
 
 REQUIRED:
   --benchmark-name NAME           Name/label for this benchmark (used in plots and results)
@@ -308,7 +312,7 @@ Each concurrency directory contains:
 
 ---
 
-# Server-Side Benchmarking (In-Cluster)
+## Server-Side Benchmarking (In-Cluster) {#server-side-benchmarking-in-cluster}
 
 Server-side benchmarking runs directly within the Kubernetes cluster, eliminating the need for port forwarding and providing better resource utilization.
 
@@ -326,17 +330,17 @@ The server-side benchmarking solution:
 ## Prerequisites
 
 1. **Kubernetes cluster** with NVIDIA GPUs and Dynamo namespace setup (see [Dynamo Kubernetes Platform docs](/docs/kubernetes/README.md))
-2. **Storage** PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md))
+2. **Storage** PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/utils/README.md))
 3. **Docker image** containing the Dynamo benchmarking tools
 
 ## Quick Start
 
 ### Step 1: Deploy Your DynamoGraphDeployment
-Deploy your DynamoGraphDeployment using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Ensure it has a frontend service exposed.
+Deploy your DynamoGraphDeployment using the [deployment documentation](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/). Ensure it has a frontend service exposed.
 
 ### Step 2: Deploy and Run Benchmark Job
 
-**Note**: The server-side benchmarking job requires a Docker image containing the Dynamo benchmarking tools. Before the 0.5.1 release, you must build your own Docker image using the [container build instructions](https://github.com/ai-dynamo/dynamo/blob/main/container/README.md), push it to your container registry, then update the `image` field in `benchmarks/incluster/benchmark_job.yaml` to use your built image tag.
+**Note**: The server-side benchmarking job requires a Docker image containing the Dynamo benchmarking tools. Before the 0.5.1 release, you must build your own Docker image using the [container build instructions](https://github.com/ai-dynamo/dynamo/tree/main/container/README.md), push it to your container registry, then update the `image` field in `benchmarks/incluster/benchmark_job.yaml` to use your built image tag.
 
 ```bash
 export NAMESPACE=benchmarking
@@ -396,7 +400,7 @@ SERVICE_URL=vllm-agg-frontend:8000
 SERVICE_URL=vllm-agg-frontend.production.svc.cluster.local:8000
 ```
 
-**DNS Format**: `<service-name>.<namespace>.svc.cluster.local:port`
+**DNS Format**: `<service-name>.&lt;namespace&gt;.svc.cluster.local:port`
 
 This allows you to:
 - Benchmark multiple services across different namespaces in a single job
@@ -421,7 +425,7 @@ To customize the benchmark, edit `benchmarks/incluster/benchmark_job.yaml`:
 
 1. **Change the model**: Update the `--model` argument
 2. **Change the benchmark name**: Update the `--benchmark-name` argument
-3. **Change the service URL**: Update the `--endpoint-url` argument (use `<svc_name>.<namespace>.svc.cluster.local:port` for cross-namespace access)
+3. **Change the service URL**: Update the `--endpoint-url` argument (use `<svc_name>.&lt;namespace&gt;.svc.cluster.local:port` for cross-namespace access)
 4. **Change Docker image**: Update the image field if needed
 
 ### Example: Multi-Namespace Benchmarking
@@ -484,7 +488,7 @@ kubectl logs -f job/dynamo-benchmark -n $NAMESPACE
 kubectl get pods -n $NAMESPACE -l job-name=dynamo-benchmark
 
 # Describe failed pod
-kubectl describe pod <pod-name> -n $NAMESPACE
+kubectl describe pod &lt;pod-name&gt; -n $NAMESPACE
 ```
 
 ## Troubleshooting
@@ -529,7 +533,7 @@ The Python benchmarking module provides a complete end-to-end benchmarking exper
 
 ## Testing with Mocker Backend
 
-For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) that simulates LLM inference without requiring actual GPU resources. This is useful for:
+For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/mocker/) that simulates LLM inference without requiring actual GPU resources. This is useful for:
 
 - **Testing deployments** without expensive GPU infrastructure
 - **Developing and debugging** router, planner, or frontend logic
@@ -538,4 +542,4 @@ For development and testing purposes, Dynamo provides a [mocker backend](https:/
 
 The mocker backend mimics the API and behavior of real backends (vLLM, SGLang, TensorRT-LLM) but generates mock responses instead of running actual inference.
 
-See the [mocker directory](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) for usage examples and configuration options.
+See the [mocker directory](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/mocker/) for usage examples and configuration options.
diff --git a/docs/benchmarks/kv-router-ab-testing.md b/docs/docs/benchmarks/kv-router-ab-testing.md
similarity index 97%
rename from docs/benchmarks/kv-router-ab-testing.md
rename to docs/docs/benchmarks/kv-router-ab-testing.md
index 272b86abd0c..1373a191678 100644
--- a/docs/benchmarks/kv-router-ab-testing.md
+++ b/docs/docs/benchmarks/kv-router-ab-testing.md
@@ -1,3 +1,7 @@
+---
+title: "Dynamo KV Smart Router A/B Benchmarking Guide"
+---
+
 # Dynamo KV Smart Router A/B Benchmarking Guide
 
 This guide walks you through setting up and running A/B benchmarks to compare Dynamo's KV Smart Router against standard round-robin routing on a Kubernetes cluster.
@@ -346,7 +350,7 @@ kubectl get pods -n router-on-test -l nvidia.com/dynamo-component-type=worker
 ```
 
 **Both must show 8/8 workers in Ready state (1/1 Running).** If workers are not ready:
-- Check logs: `kubectl logs -n <namespace> <pod-name>`
+- Check logs: `kubectl logs -n &lt;namespace&gt; &lt;pod-name&gt;`
 - Common issues: model download in progress, startup probe timeout, insufficient GPU resources
 
 **Do not proceed with benchmarks until all 16 workers (8 per deployment) are healthy.**
@@ -571,14 +575,14 @@ kubectl -n benchmark cp ${POD_NAME}:/tmp/router_on_results/profile_export_aiperf
 
 **Factors that reduce KV router benefit:**
 - **Unique prompts** with no prefix reuse
-- **Short prompts** (<1000 tokens) where routing overhead exceeds benefit
+- **Short prompts** (&lt;1000 tokens) where routing overhead exceeds benefit
 - **Evenly distributed load** where round-robin is already optimal
 - **Low request rate** where cache eviction negates benefits
 
 **Expected Performance:**
 - **High prefix overlap workloads**: 20-50% TTFT improvement
 - **Moderate prefix overlap**: 10-20% improvement
-- **Low prefix overlap**: <5% improvement (may not be worth enabling)
+- **Low prefix overlap**: &lt;5% improvement (may not be worth enabling)
 
 **KV Smart Router is beneficial when:**
 - TTFT improvements > 20%
@@ -586,7 +590,7 @@ kubectl -n benchmark cp ${POD_NAME}:/tmp/router_on_results/profile_export_aiperf
 - Workload demonstrates measurable prefix reuse patterns
 
 **Standard routing is better when:**
-- KV router shows <10% improvement
+- KV router shows &lt;10% improvement
 - Increased latency variance is observed
 - Load distribution across workers is more important than cache affinity
 
@@ -644,7 +648,7 @@ kubectl delete namespace benchmark
 kubectl describe nodes | grep -A 10 "Allocated resources"
 
 # Reduce worker replicas if needed
-kubectl edit dynamographdeployment -n <namespace>
+kubectl edit dynamographdeployment -n &lt;namespace&gt;
 ```
 
 ### Issue: ImagePullBackOff Errors
@@ -666,7 +670,7 @@ kubectl get pods -n dynamo-system -o yaml | grep image:
 **Solution:**
 - Ensure Dynamo platform is Helm-installed in the namespace
 - Verify operator has `--restrictedNamespace=<your-namespace>` argument
-- Check operator logs: `kubectl logs -n <namespace> deployment/dynamo-platform-dynamo-operator-controller-manager`
+- Check operator logs: `kubectl logs -n &lt;namespace&gt; deployment/dynamo-platform-dynamo-operator-controller-manager`
 
 ### Issue: Workers Not Becoming Ready
 
@@ -675,7 +679,7 @@ kubectl get pods -n dynamo-system -o yaml | grep image:
 **Solution:**
 ```bash
 # Check worker logs
-kubectl logs -n <namespace> <worker-pod-name>
+kubectl logs -n &lt;namespace&gt; <worker-pod-name>
 
 # Common issues:
 # - Invalid HuggingFace token
@@ -697,7 +701,7 @@ Increase the startup probe `failureThreshold`:
 
 ```bash
 # Patch the deployment to allow 32 minutes instead of 22
-kubectl patch dynamographdeployment <deployment-name> -n <namespace> --type='json' \
+kubectl patch dynamographdeployment <deployment-name> -n &lt;namespace&gt; --type='json' \
   -p='[{"op": "replace", "path": "/spec/services/VllmDecodeWorker/extraPodSpec/mainContainer/startupProbe/failureThreshold", "value": 60}]'
 ```
 
@@ -725,10 +729,10 @@ startupProbe:
 **Solution:**
 ```bash
 # Check all worker status
-kubectl get pods -n <namespace> -l nvidia.com/dynamo-component-type=worker
+kubectl get pods -n &lt;namespace&gt; -l nvidia.com/dynamo-component-type=worker
 
 # Describe problematic pods
-kubectl describe pod <pod-name> -n <namespace>
+kubectl describe pod &lt;pod-name&gt; -n &lt;namespace&gt;
 
 # Fix issues before benchmarking or results will be skewed
 ```
diff --git a/docs/benchmarks/sla_driven_profiling.md b/docs/docs/benchmarks/sla_driven_profiling.md
similarity index 83%
rename from docs/benchmarks/sla_driven_profiling.md
rename to docs/docs/benchmarks/sla_driven_profiling.md
index b10ac655098..973ce53b1c5 100644
--- a/docs/benchmarks/sla_driven_profiling.md
+++ b/docs/docs/benchmarks/sla_driven_profiling.md
@@ -1,3 +1,7 @@
+---
+title: "SLA-Driven Profiling with DynamoGraphDeploymentRequest"
+---
+
 # SLA-Driven Profiling with DynamoGraphDeploymentRequest
 
 > [!TIP]
@@ -63,13 +67,13 @@ profilingConfig:
   config:
     # Override hardware defaults if needed
     hardware:
-      minNumGpusPerEngine: 1
-      maxNumGpusPerEngine: 8
-      numGpusPerNode: 8
+      min_num_gpus_per_engine: 1
+      max_num_gpus_per_engine: 8
+      num_gpus_per_node: 8
 
-    # Only needed when using AI Configurator (sweep.useAiConfigurator: true)
+    # Only needed when using AI Configurator (sweep.use_ai_configurator: true)
     sweep:
-      aicSystem: h200_sxm  # GPU type for AI Configurator (h100_sxm, h200_sxm, etc.)
+      aic_system: h200_sxm  # GPU type for AI Configurator (h100_sxm, h200_sxm, etc.)
 ```
 
 ### Automatic GPU Discovery (Optional Feature)
@@ -93,12 +97,12 @@ This feature is only available with cluster-scoped operators (`namespaceRestrict
    - **Prefill**:
      - TP/TEP: We measure TTFT with batch size = 1 (assuming ISL is long enough to saturate compute) without KV reuse.
      - DEP: Attention uses data parallelism. We send a single burst with total concurrency `attention_dp_size × attn_dp_num_req_ratio` (defaults to 4) and compute the reported TTFT as `time_to_first_token.max / attn_dp_num_req_ratio` from the AIPerf summary of that burst. This stabilizes measurements when the first batch may launch before all requests arrive.
-   ![Prefill Performance](../images/h100_prefill_performance.png)
+   ![Prefill Performance](/img/h100_prefill_performance.png)
    - **Decode**: Since the ITL (or iteration time) is relevant with how many requests are in-flight, we measure the ITL under different number of in-flight requests. The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold. To measure the ITL without being affected by piggy-backed prefill requests, the script will enable kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL. Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL. However, for MoE models, this is not guaranteed because the kv cache in different attention DP ranks is different. We are working on framework-side change to fix this issue. For example, the below plot shows the decode parallelization mapping sweep results for H100 for deepseek-ai/DeepSeek-R1-Distill-Llama-8B.
-   ![Decode Performance](../images/h100_decode_performance.png)
+   ![Decode Performance](/img/h100_decode_performance.png)
 4. **Recommendation**: Selects optimal parallelization mapping for prefill and decode that achieves the highest per GPU throughput while adhering the SLA on TTFT and ITL. Specifically, the profiler will choose the point (or a point on the curve for decode) that is left to the vertical red dashed line that represents the SLAs while has the highest y coordinate (throughput per GPU).
 5. **In-Depth Profiling on the Recommended P/D Engine**: After finding the best TP size for prefill and decode, the script will then interpolate the TTFT with ISL and ITL with active KV cache and decode context length. This is to provide a more accurate estimation of the performance when ISL and OSL changes and will be used in the sla-planner.
-![ITL Interpolation](../images/pd_interpolation.png)
+![ITL Interpolation](/img/pd_interpolation.png)
    - **Prefill**: Measures TTFT and throughput per GPU across different input lengths with batch size=1.
    - **Decode**: Measures ITL and throughput per GPU under various KV cache loads and decode context lengths. The active kv usage determines the complexity of the memory-bounded attention kernel while the active kv usage divided the average context length determines the complexity of the computation bound MLP kernel. For example, the below figure shows the ITL of DS-Distilled Llama 8b model on H100 TP4. The ITL grows near-linearly with active kv usage under a fixed context length. And the slope increases as the context length decreases.
 
@@ -120,7 +124,7 @@ Profiles your model by creating real test deployments in Kubernetes and measurin
 profilingConfig:
   config:
     sweep:
-      useAiConfigurator: false  # Default
+      use_ai_configurator: false  # Default
 ```
 
 ### AI Configurator Simulation
@@ -138,10 +142,11 @@ Uses performance simulation to rapidly estimate optimal configurations without r
 profilingConfig:
   config:
     sweep:
-      useAiConfigurator: true
-      aicSystem: h200_sxm          # GPU system type
-      aicHfId: Qwen/Qwen3-32B      # HuggingFace model ID
-      aicBackendVersion: "0.20.0"
+      use_ai_configurator: true
+    aic:
+      system: h200_sxm          # GPU system type
+      model_name: QWEN3_32B     # AIC model identifier
+      backend_version: "0.20.0"
 ```
 
 **Supported Configurations:**
@@ -289,7 +294,8 @@ spec:
     config:                        # Profiler configuration
       sla: { ... }
       hardware: { ... }
-      sweep: { ... }               # AIC settings go here (aicSystem, aicHfId, etc.)
+      sweep: { ... }
+      aic: { ... }
       planner: { ... }
 
   deploymentOverrides:             # Optional
@@ -324,16 +330,16 @@ Control GPU search space and constraints:
 profilingConfig:
   config:
     hardware:
-      minNumGpusPerEngine: 2      # if not provided, will automatically determine based on model and VRAM size
-      maxNumGpusPerEngine: 8      # Maximum GPUs to test
-      numGpusPerNode: 8            # GPUs per node (for multi-node MoE)
-      gpuType: h200_sxm              # GPU type hint
+      min_num_gpus_per_engine: 2      # if not provided, will automatically determine based on model and VRAM size
+      max_num_gpus_per_engine: 8      # Maximum GPUs to test
+      num_gpus_per_node: 8            # GPUs per node (for multi-node MoE)
+      gpu_type: h200_sxm              # GPU type hint
 ```
 
 **When to use:**
-- **minNumGpusPerEngine**: Skip small TP sizes if your model is large
-- **maxNumGpusPerEngine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error))
-- **numGpusPerNode**: Determine the upper bound of number of GPUs per node for dense models and configure Grove for multi-node MoE engines.
+- **min_num_gpus_per_engine**: Skip small TP sizes if your model is large
+- **max_num_gpus_per_engine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error))
+- **num_gpus_per_node**: Determine the upper bound of number of GPUs per node for dense models and configure Grove for multi-node MoE engines.
 - **gpu_type**: Informational, auto-detected by controller
 
 > [!TIP]
@@ -347,17 +353,17 @@ Control profiling behavior:
 profilingConfig:
   config:
     sweep:
-      useAiConfigurator: false              # Use offline profiling (default: false)
-      prefillInterpolationGranularity: 16   # Samples for prefill TTFT curve
-      decodeInterpolationGranularity: 6     # Samples for decode ITL curve
+      use_ai_configurator: false              # Use offline profiling (default: false)
+      prefill_interpolation_granularity: 16   # Samples for prefill TTFT curve
+      decode_interpolation_granularity: 6     # Samples for decode ITL curve
 ```
 
 **Use cases:**
-- **useAiConfigurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only)
-- **prefillInterpolationGranularity**: How many samples to benchmark for prefill TTFT curve (lower = faster but may be less accurate)
-- **decodeInterpolationGranularity**: How many samples to benchmark for decode ITL curve (lower = faster but may be less accurate). Since ITL interpolation is a 3d plot and takes longer to run, we default to a smaller number of samples. Increasing this value might quadratically increase the profiling time.
+- **use_ai_configurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only)
+- **prefill_interpolation_granularity**: How many samples to benchmark for prefill TTFT curve (lower = faster but may be less accurate)
+- **decode_interpolation_granularity**: How many samples to benchmark for decode ITL curve (lower = faster but may be less accurate). Since ITL interpolation is a 3d plot and takes longer to run, we default to a smaller number of samples. Increasing this value might quadratically increase the profiling time.
 
-### AI Configurator Configuration (Required if `useAiConfigurator: true`)
+### AI Configurator Configuration (Required if `use_ai_configurator: true`)
 
 Configure AI Configurator profiling mode:
 
@@ -365,10 +371,10 @@ Configure AI Configurator profiling mode:
 profilingConfig:
   config:
     sweep:
-      useAiConfigurator: true
-      aicSystem: h200_sxm              # GPU system: h100_sxm, h200_sxm, b200_sxm, gb200_sxm, a100_sxm
-      aicHfId: Qwen/Qwen3-32B         # Huggingface model id
-      aicBackendVersion: "0.20.0"     # TensorRT-LLM version: 0.20.0, 1.0.0rc3
+      use_ai_configurator: true
+      aic_system: h200_sxm              # GPU system: h100_sxm, h200_sxm, b200_sxm, gb200_sxm, a100_sxm
+      aic_hf_id: Qwen/Qwen3-32B         # Huggingface model id
+      aic_backend_version: "0.20.0"     # TensorRT-LLM version: 0.20.0, 1.0.0rc3
 ```
 
 **Supported configurations:** See [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features)
@@ -389,27 +395,6 @@ profilingConfig:
 > [!NOTE]
 > Planner arguments use `planner_` prefix. See planner documentation for full list.
 
-### Model Cache PVC (Advanced)
-
-For large models, you can use a pre-populated PVC containing model weights instead of downloading from HuggingFace. This is useful when:
-- The model is not publicly available on HuggingFace
-- You want to avoid repeated downloads during profiling
-- You have a shared model cache across your cluster
-
-```yaml
-profilingConfig:
-  config:
-    deployment:
-      modelCache:
-        pvcName: "model-cache"                        # Name of PVC containing model weights (required)
-        pvcPath: "hub/models--deepseek-ai--DeepSeek-R1"  # Subpath within PVC (optional)
-        mountPath: "/opt/model-cache"                 # Mount path in container (optional, default: /opt/model-cache)
-```
-
-**Requirements:**
-- The PVC must exist in the same namespace as the DGDR
-- The model weights must be accessible at `{mountPath}/{pvcPath}`
-
 ### Engine Configuration (Auto-configured)
 
 The controller automatically sets these from high-level fields:
@@ -453,11 +438,11 @@ spec:
         itl: 20.0
 
       hardware:
-        minNumGpusPerEngine: 1
-        maxNumGpusPerEngine: 8
+        min_num_gpus_per_engine: 1
+        max_num_gpus_per_engine: 8
 
       sweep:
-        useAiConfigurator: false
+        use_ai_configurator: false
 
   deploymentOverrides:
     workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
@@ -486,10 +471,12 @@ spec:
         itl: 10.0
 
       sweep:
-        useAiConfigurator: true
-        aicSystem: h200_sxm
-        aicHfId: Qwen/Qwen3-32B
-        aicBackendVersion: "0.20.0"
+        use_ai_configurator: true
+
+      aic:
+        system: h200_sxm
+        model_name: QWEN3_32B
+        backend_version: "0.20.0"
 
   deploymentOverrides:
     workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1"
@@ -518,11 +505,11 @@ spec:
         itl: 25.0
 
       hardware:
-        numGpusPerNode: 8
-        maxNumGpusPerEngine: 32
+        num_gpus_per_node: 8
+        max_num_gpus_per_engine: 32
 
       engine:
-        isMoeModel: true       # Enable MoE profiling mode
+        is_moe_model: true       # Enable MoE profiling mode
 
   deploymentOverrides:
     workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
@@ -537,15 +524,15 @@ spec:
 **Solution 1**: Use AI Configurator for rapid profiling (TensorRT-LLM only):
 ```yaml
 sweep:
-  useAiConfigurator: true
+  use_ai_configurator: true
 ```
 
 **Solution 2**: Reduce search space:
 ```yaml
 config:
   sweep:
-    minNumGpus: 4  # Skip TP1, TP2
-    maxNumGpus: 8  # Don't test beyond TP8
+    min_num_gpus: 4  # Skip TP1, TP2
+    max_num_gpus: 8  # Don't test beyond TP8
 ```
 
 ### SLA Cannot Be Met
@@ -570,20 +557,21 @@ AssertionError: num_heads <N> should be divisible by tp_size <M> and the divisio
 **Affected Models:**
 - **Qwen3-0.6B** (16 heads): Max TP = 4 ❌ Fails at TP=8
 - **GPT-2** (12 heads): Max TP = 3
-- Most models **<1B parameters**: May hit this constraint
+- Most models **&lt;1B parameters**: May hit this constraint
 
-**Solution**: Limit `maxNumGpusPerEngine` in your DGDR:
+**Solution**: Limit `max_num_gpus_per_engine` in your DGDR:
 
 ```yaml
 profilingConfig:
   profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1"
   config:
     hardware:
-      maxNumGpusPerEngine: 4  # For Qwen3-0.6B (16 heads / 4 = max TP of 4)
+      max_num_gpus_per_engine: 4  # For Qwen3-0.6B (16 heads / 4 = max TP of 4)
     sweep:
-      useAiConfigurator: true
-      aicSystem: h200_sxm
-      aicHfId: Qwen/Qwen3-0.6B
+      use_ai_configurator: true
+    aic:
+      system: h200_sxm
+      model_name: QWEN3_0_6B
 ```
 
 **Calculate Max TP**: `max_tp = num_attention_heads / 4`
@@ -633,4 +621,4 @@ kubectl create secret docker-registry nvcr-imagepullsecret \
 - [DGDR API Reference](/docs/kubernetes/api_reference.md)
 - [SLA Planner Quick Start](/docs/planner/sla_planner_quickstart.md)
 - [SLA Planner Architecture](/docs/planner/sla_planner.md)
-- [Profiler Arguments Reference](/benchmarks/profiler/utils/profiler_argparse.py)
+- [Profiler Arguments Reference](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/profiler/utils/profiler_argparse.py)
diff --git a/docs/design_docs/architecture.md b/docs/docs/design_docs/architecture.md
similarity index 96%
rename from docs/design_docs/architecture.md
rename to docs/docs/design_docs/architecture.md
index b812cad802b..5bb42810b2d 100644
--- a/docs/design_docs/architecture.md
+++ b/docs/docs/design_docs/architecture.md
@@ -1,3 +1,7 @@
+---
+title: "High Level Architecture"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
@@ -54,13 +58,13 @@ The following diagram outlines Dynamo's high-level architecture. To enable large
 
 - [Dynamo Disaggregated Serving](disagg_serving.md)
 - [Dynamo Smart Router](../router/kv_cache_routing.md)
-- [Dynamo KV Cache Block Manager](../kvbm/kvbm_intro.rst)
-- [Planner](../planner/planner_intro.rst)
+- [Dynamo KV Cache Block Manager](../kvbm/kvbm_intro.md)
+- [Planner](../planner/planner_intro.md)
 - [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
 
 Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.
 
-![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](../images/architecture.png "Dynamo Architecture")
+![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](/img/architecture.png "Dynamo Architecture")
 
 Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if Dynamo detects an increase in requests with long input sequences, the Planner automatically scales up prefill workers to meet the heightened demand.
 
@@ -74,7 +78,7 @@ Dynamo prioritizes seamless integration. Its modular design enables it to work h
 
 Disaggregating prefill and decode boosts performance, gaining efficiency when more GPUs are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization.
 
-![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](../images/disagg_perf_benefit.png)
+![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](/img/disagg_perf_benefit.png)
 
 * Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL
 
@@ -83,7 +87,7 @@ The disaggregation of prefill and decode phases offers valuable flexibility. Sin
 
 ### KV aware routing
 
-![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](../images/kv_routing.png)
+![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](/img/kv_routing.png)
 
 * Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL
 
@@ -93,7 +97,7 @@ Existing routing methods, including load-based routing, overlook the specific pr
 ### KV cache manager
 
 The Dynamo KV Block Manager (KVBM) enables KV cache offloading to system CPU memory, local SSDs, and network-attached storage, allowing more KV blocks to be reused instead of recomputed. In many cases, KV transfer is faster than recomputation, so KVBM helps reduce time-to-first-token (TTFT). The following plot highlights the performance gains achieved through CPU memory offloading. In a scenario involving 20 multi-turn conversations with 15 users, KVBM with CPU memory offloading achieved a 2.2×–12× improvement in TTFT (depending on QPS), demonstrating benefits that extend beyond basic prefix caching.
-![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](../images/kvbm_agg_performance.png)
+![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](/img/kvbm_agg_performance.png)
 
 * Tested with different QPS using Qwen3-8B on H100. Avg 20K ISL / 100 OSL.
 
diff --git a/docs/design_docs/disagg_serving.md b/docs/docs/design_docs/disagg_serving.md
similarity index 98%
rename from docs/design_docs/disagg_serving.md
rename to docs/docs/design_docs/disagg_serving.md
index cdc8fdc7517..4929868fb7c 100644
--- a/docs/design_docs/disagg_serving.md
+++ b/docs/docs/design_docs/disagg_serving.md
@@ -1,3 +1,7 @@
+---
+title: "Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
diff --git a/docs/design_docs/distributed_runtime.md b/docs/docs/design_docs/distributed_runtime.md
similarity index 92%
rename from docs/design_docs/distributed_runtime.md
rename to docs/docs/design_docs/distributed_runtime.md
index 062c254e7a5..c804939f3a8 100644
--- a/docs/design_docs/distributed_runtime.md
+++ b/docs/docs/design_docs/distributed_runtime.md
@@ -1,3 +1,7 @@
+---
+title: "Dynamo Distributed Runtime"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -41,15 +45,15 @@ Since the workers are deployed in different processes, each of them has its own
 
 In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are two modes for `DistributedRuntime` initialization: dynamic and static. In static mode, components and endpoints are defined using known addresses and do not change during runtime. In dynamic modes, components and endpoints are discovered through the network and can change during runtime. We focus on the dynamic mode in the rest of this document. Static mode is basically dynamic mode without registration and discovery and hence does not rely on etcd.
 
-```{caution}
+:::caution
 The hierarchy and naming in etcd and NATS may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
 ```
 
-- `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections to the following services:
+- `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections to the following two services:
     - etcd (dynamic mode only): for service discovery. In static mode, `DistributedRuntime` can operate without etcd.
-    - NATS (optional): for KV event messaging and router replica sync. NATS is enabled by default but can be disabled via the `enable_nats` parameter (e.g., using `--no-kv-events` flag). When NATS is disabled, the system operates in approximate mode without KV event persistence. Also legacy nats based request_plane is supported.
+    - NATS (both static and dynamic mode): for messaging.
 
-  etcd and NATS are global services (there could be multiple instances for high availability).
+  where etcd and NATS are two global services (there could be multiple etcd and NATS services for high availability).
 
   For etcd, it also creates a primary lease and spin up a background task to keep the lease alive. All objects registered under this `DistributedRuntime` use this lease_id to maintain their life cycle. There is also a cancellation token that is tied to the primary lease. When the cancellation token is triggered or the background task failed, the primary lease is revoked or expired and the kv pairs stored with this lease_id is removed.
 - `Namespace`: `Namespace`s are primarily a logical grouping mechanism and is not registered in etcd. It provides the root path for all components under this `Namespace`.
@@ -62,7 +66,7 @@ The hierarchy and naming in etcd and NATS may change over time, and this documen
 
 Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The etcd watcher continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s.
 
-The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](../../lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
+The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](https://github.com/ai-dynamo/dynamo/tree/main/lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
 
 - `random`: randomly select an endpoint to hit
 - `round_robin`: select endpoints in round-robin order
diff --git a/docs/design_docs/dynamo_flow.md b/docs/docs/design_docs/dynamo_flow.md
similarity index 98%
rename from docs/design_docs/dynamo_flow.md
rename to docs/docs/design_docs/dynamo_flow.md
index 4b5321687e3..9fa25673242 100644
--- a/docs/design_docs/dynamo_flow.md
+++ b/docs/docs/design_docs/dynamo_flow.md
@@ -1,3 +1,7 @@
+---
+title: "Dynamo Architecture Flow"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -17,7 +21,7 @@ limitations under the License.
 
 # Dynamo Architecture Flow
 
-This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](../../examples/backends/vllm). Color-coded flows indicate different types of operations:
+This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm). Color-coded flows indicate different types of operations:
 
 ## 🔵 Main Request Flow (Blue)
 The primary user journey through the system:
diff --git a/docs/development/backend-guide.md b/docs/docs/development/backend-guide.md
similarity index 99%
rename from docs/development/backend-guide.md
rename to docs/docs/development/backend-guide.md
index 705615b9dad..78881cdd667 100644
--- a/docs/development/backend-guide.md
+++ b/docs/docs/development/backend-guide.md
@@ -1,3 +1,7 @@
+---
+title: "Writing Python Workers in Dynamo"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
diff --git a/docs/development/runtime-guide.md b/docs/docs/development/runtime-guide.md
similarity index 91%
rename from docs/development/runtime-guide.md
rename to docs/docs/development/runtime-guide.md
index 6a4ed0c9588..1cfdc7ea04c 100644
--- a/docs/development/runtime-guide.md
+++ b/docs/docs/development/runtime-guide.md
@@ -1,3 +1,7 @@
+---
+title: "Dynamo Runtime"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -44,7 +48,7 @@ cargo test
 
 The simplest way to deploy the pre-requisite services is using
 [docker-compose](https://docs.docker.com/compose/install/linux/),
-defined in [deploy/docker-compose.yml](../../deploy/docker-compose.yml).
+defined in [deploy/docker-compose.yml](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml).
 
 ```
 # At the root of the repository:
@@ -73,7 +77,7 @@ be operating within your distributed runtime.
 
 The current examples use a hard-coded `namespace`. We will address the `namespace` collisions later.
 
-Most examples require `etcd` for service discovery. `nats.io` is required for KV-aware routing with event tracking; for approximate mode (`--no-kv-events`), NATS is optional.
+All examples require the `etcd` and `nats.io` pre-requisites to be running and available.
 
 #### Rust `hello_world`
 
@@ -110,7 +114,7 @@ Annotated { data: Some("d"), id: None, event: None, comment: None }
 
 #### Python
 
-See the [README.md](../../lib/runtime/lib/bindings/python/README.md) for details
+See the [README.md](https://github.com/ai-dynamo/dynamo/tree/main/lib/runtime/lib/bindings/python/README.md) for details
 
 The Python and Rust `hello_world` client and server examples are interchangeable,
 so you can start the Python `server.py` and talk to it from the Rust `client`.
diff --git a/docs/docs/examples.md b/docs/docs/examples.md
new file mode 100644
index 00000000000..b0abba7389e
--- /dev/null
+++ b/docs/docs/examples.md
@@ -0,0 +1,57 @@
+---
+title: Examples
+sidebar_position: 3
+---
+
+# Examples
+
+Explore practical examples to get started with NVIDIA Dynamo.
+
+## Quick Start Examples
+
+The [examples directory](https://github.com/ai-dynamo/dynamo/tree/main/examples) in the Dynamo repository contains ready-to-run examples for various use cases.
+
+### Backend Examples
+
+| Backend | Description | Link |
+|---------|-------------|------|
+| **vLLM** | Run inference with vLLM backend | [examples/backends/vllm](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm) |
+| **SGLang** | Run inference with SGLang backend | [examples/backends/sglang](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang) |
+| **TensorRT-LLM** | Run inference with TensorRT-LLM backend | [examples/backends/trtllm](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm) |
+
+### Deployment Examples
+
+| Example | Description | Link |
+|---------|-------------|------|
+| **Basic Deployment** | Simple single-node deployment | [examples/basics](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics) |
+| **Kubernetes** | Deploy on Kubernetes | [examples/deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/deployments) |
+| **Multimodal** | Vision and multimodal models | [examples/multimodal](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal) |
+
+### Custom Backend Examples
+
+Learn how to create custom backends:
+
+| Example | Description | Link |
+|---------|-------------|------|
+| **Custom Backend** | Build your own backend | [examples/custom_backend](https://github.com/ai-dynamo/dynamo/tree/main/examples/custom_backend) |
+
+## Running Examples
+
+Most examples can be run directly after installing Dynamo:
+
+```bash
+# Clone the repository
+git clone https://github.com/ai-dynamo/dynamo.git
+cd dynamo
+
+# Navigate to an example
+cd examples/backends/sglang
+
+# Follow the README in each example directory
+```
+
+## Next Steps
+
+- See the [Backends documentation](./backends/vllm/README.md) for detailed backend configuration
+- Check [Kubernetes Deployment](./kubernetes/README.md) for production deployments
+- Review [User Guides](./agents/tool-calling.md) for advanced features
diff --git a/docs/fault_tolerance/request_cancellation.md b/docs/docs/fault_tolerance/request_cancellation.md
similarity index 97%
rename from docs/fault_tolerance/request_cancellation.md
rename to docs/docs/fault_tolerance/request_cancellation.md
index afe28139a0a..7d1c1580867 100644
--- a/docs/fault_tolerance/request_cancellation.md
+++ b/docs/docs/fault_tolerance/request_cancellation.md
@@ -1,3 +1,7 @@
+---
+title: "Request Cancellation Architecture"
+---
+
 # Request Cancellation Architecture
 
 This document describes how Dynamo implements request cancellation to cancel in-flight requests between Dynamo workers. Request cancellation allows in-flight requests to terminate early, saving computational resources that would otherwise be spent on responses that are no longer needed.
@@ -45,7 +49,7 @@ The Python `Context` class wraps the Rust `AsyncEngineContext` and exposes the f
 - **`stop_generating()`**: Issues a stop generating signal, equivalent to the Rust method
 - **`async_killed_or_stopped()`**: An async method that completes when the context becomes either killed or stopped, whichever happens first. This combines the functionality of the Rust `killed()` and `stopped()` async methods using `tokio::select!`.
 
-For a working example of request cancellation, see the [cancellation demo](../../examples/custom_backend/cancellation/README.md).
+For a working example of request cancellation, see the [cancellation demo](https://github.com/ai-dynamo/dynamo/tree/main/examples/custom_backend/cancellation/README.md).
 
 ### Context Usage in Python
 
diff --git a/docs/fault_tolerance/request_migration.md b/docs/docs/fault_tolerance/request_migration.md
similarity index 99%
rename from docs/fault_tolerance/request_migration.md
rename to docs/docs/fault_tolerance/request_migration.md
index 81d352bc92c..8b239217934 100644
--- a/docs/fault_tolerance/request_migration.md
+++ b/docs/docs/fault_tolerance/request_migration.md
@@ -1,3 +1,7 @@
+---
+title: "Request Migration Architecture"
+---
+
 # Request Migration Architecture
 
 This document describes how Dynamo implements request migration to handle worker failures gracefully during LLM text generation. Request migration allows in-progress requests to continue on different workers when the original worker becomes unavailable, providing fault tolerance and improved user experience.
diff --git a/docs/frontends/kserve.md b/docs/docs/frontends/kserve.md
similarity index 76%
rename from docs/frontends/kserve.md
rename to docs/docs/frontends/kserve.md
index c4a41b33796..da48eaa895c 100644
--- a/docs/frontends/kserve.md
+++ b/docs/docs/frontends/kserve.md
@@ -1,3 +1,7 @@
+---
+title: "KServe gRPC frontend"
+---
+
 # KServe gRPC frontend
 
 ## Motivation
@@ -87,13 +91,13 @@ This combination is used when the user is migrating an existing KServe based bac
 #### Model Metadata / Config
 
 When registering the backend, the backend must provide the model's metadata as tensor based deployment is generic and the frontend can't make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata:
-* [TensorModelConfig](../../lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](../../lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values.
-* [triton_model_config](../../lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which will supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](../../tests/frontend/grpc/echo_tensor_worker.py) for example.
+* [TensorModelConfig](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values.
+* [triton_model_config](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which will supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](https://github.com/ai-dynamo/dynamo/tree/main/tests/frontend/grpc/echo_tensor_worker.py) for example.
 
 #### Inference
 
-When receiving inference request, the backend will receive [NvCreateTensorRequest](../../lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](../../lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo.
+When receiving inference request, the backend will receive [NvCreateTensorRequest](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo.
 
 ## Python Bindings
 
-The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](../../lib/bindings/python/examples/kserve_grpc_service/server.py) for example.
+The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/examples/kserve_grpc_service/server.py) for example.
diff --git a/docs/frontends/openapi.json b/docs/docs/frontends/openapi.json
similarity index 100%
rename from docs/frontends/openapi.json
rename to docs/docs/frontends/openapi.json
diff --git a/docs/guides/jail_stream_readme.md b/docs/docs/guides/jail_stream_readme.md
similarity index 99%
rename from docs/guides/jail_stream_readme.md
rename to docs/docs/guides/jail_stream_readme.md
index 2bda1e62d24..e5843163e16 100644
--- a/docs/guides/jail_stream_readme.md
+++ b/docs/docs/guides/jail_stream_readme.md
@@ -1,3 +1,7 @@
+---
+title: "JailedStream Implementation"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/guides/request_plane.md b/docs/docs/guides/request_plane.md
similarity index 90%
rename from docs/guides/request_plane.md
rename to docs/docs/guides/request_plane.md
index d707f1ff9a6..64a0f2db920 100644
--- a/docs/guides/request_plane.md
+++ b/docs/docs/guides/request_plane.md
@@ -1,3 +1,7 @@
+---
+title: "Dynamo Request Planes User Guide"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -44,7 +48,7 @@ Dynamo has **two independent communication planes**:
 - **Request plane** (**`DYN_REQUEST_PLANE`**): how **RPC requests** flow between components (frontend → router → worker), via `tcp`, `http`, or `nats`.
 - **KV event plane** (currently only **NATS** is supported): how **KV cache events** (and optional router replica sync) are distributed/persisted for KV-aware routing.
 
-**Note:** If you are using `tcp` or `http` request plane with KV events enabled (default), NATS is automatically initialized. You can optionally configure `NATS_SERVER` environment variable (e.g., `NATS_SERVER=nats://nats-hostname:port`) to specify a custom NATS server; otherwise, it defaults to `localhost:4222`. To completely disable NATS, use `--no-kv-events` on the frontend.
+**Note:** if you are using `tcp` or `http` request plane and choose to use NATS for KV events, you must still configure NATS server using `NATS_SERVER` environment variable, e.g. `NATS_SERVER=nats://nats-hostname:port`.
 
 Because they are independent, you can mix them.
 
@@ -100,7 +104,7 @@ DYN_REQUEST_PLANE=tcp python -m dynamo.vllm --model Qwen/Qwen3-0.6B
 
 **When to use TCP:**
 - Simple deployments with direct service-to-service communication (e.g. frontend to backend)
-- Minimal infrastructure requirements (NATS is initialized by default for KV events but can be disabled with `--no-kv-events`)
+- Minimal infrastructure requirements (**no NATS needed unless you enable KV-event-backed routing/replica sync**)
 - Low-latency requirements
 
 **TCP Configuration Options:**
@@ -172,7 +176,7 @@ DYN_REQUEST_PLANE=nats python -m dynamo.vllm --model Qwen/Qwen3-0.6B
 
 **When to use NATS:**
 - Production deployments with service discovery
-- KV-aware routing with accurate cache state tracking (requires NATS for event transport). Note: approximate mode (`--no-kv-events`) provides KV routing without NATS but with reduced accuracy.
+- Currently KV based routing require NATS. If you want to completely disable NATS, KV based routing won't be available
 - Need for message replay and persistence features
 
 Limitations:
@@ -182,7 +186,7 @@ Limitations:
 
 Here's a complete example showing how to launch a Dynamo deployment with different request planes:
 
-See [`examples/backends/vllm/launch/agg_request_planes.sh`](../../examples/backends/vllm/launch/agg_request_planes.sh) for a complete working example that demonstrates launching Dynamo with TCP, HTTP, or NATS request planes.
+See [`examples/backends/vllm/launch/agg_request_planes.sh`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch/agg_request_planes.sh) for a complete working example that demonstrates launching Dynamo with TCP, HTTP, or NATS request planes.
 
 
 ## Real-World Example
@@ -301,6 +305,6 @@ curl http://localhost:8000/v1/chat/completions \
 
 ### Resource Usage
 
-- **TCP**: Minimal infrastructure (NATS required only if using KV events, can disable with `--no-kv-events`)
-- **HTTP**: Minimal infrastructure (NATS required only if using KV events, can disable with `--no-kv-events`)
+- **TCP**: Minimal infrastructure (no additional services required)
+- **HTTP**: Minimal infrastructure (no additional services required)
 - **NATS**: Requires running NATS server (additional memory/CPU)
diff --git a/docs/docs/installation.md b/docs/docs/installation.md
new file mode 100644
index 00000000000..e5bec54f498
--- /dev/null
+++ b/docs/docs/installation.md
@@ -0,0 +1,52 @@
+---
+title: Installation
+sidebar_position: 2
+---
+
+# Installation
+
+## Pip (PyPI)
+
+Install a pre-built wheel from PyPI.
+
+```bash
+# Create a virtual environment and activate it
+uv venv venv
+source venv/bin/activate
+
+# Install Dynamo from PyPI (choose one backend extra)
+uv pip install "ai-dynamo[sglang]"  # or [vllm], [trtllm]
+```
+
+## Pip from source
+
+Install directly from a local checkout for development.
+
+```bash
+# Clone the repository
+git clone https://github.com/ai-dynamo/dynamo.git
+cd dynamo
+
+# Create a virtual environment and activate it
+uv venv venv
+source venv/bin/activate
+uv pip install ".[sglang]"  # or [vllm], [trtllm]
+```
+
+## Docker
+
+Pull and run prebuilt images from NVIDIA NGC (`nvcr.io`).
+
+```bash
+# Run a container (mount your workspace if needed)
+docker run --rm -it \
+  --gpus all \
+  --network host \
+  nvcr.io/nvidia/ai-dynamo/sglang-runtime:latest  # or vllm, tensorrtllm
+```
+
+## Next Steps
+
+- Check the [Support Matrix](./reference/support-matrix.md) for compatible versions
+- Try the [Examples](./examples.md) to see Dynamo in action
+- Deploy on [Kubernetes](./kubernetes/README.md) for production workloads
diff --git a/docs/docs/intro.md b/docs/docs/intro.md
new file mode 100644
index 00000000000..1503fd64e41
--- /dev/null
+++ b/docs/docs/intro.md
@@ -0,0 +1,97 @@
+---
+title: "Welcome to NVIDIA Dynamo"
+slug: /
+sidebar_position: 1
+---
+
+# Welcome to NVIDIA Dynamo
+
+The NVIDIA Dynamo Platform is a high-performance, low-latency inference framework designed to serve all AI models—across any framework, architecture, or deployment scale.
+
+:::tip Discover the Latest Developments!
+This guide is a snapshot of a specific point in time. For the latest information, examples, and Release Assets, see the [Dynamo GitHub repository](https://github.com/ai-dynamo/dynamo/releases/latest).
+:::
+
+## Quickstart
+
+Get started with Dynamo locally in just a few commands:
+
+### 1. Install Dynamo
+
+```bash
+# Install uv (recommended Python package manager)
+curl -LsSf https://astral.sh/uv/install.sh | sh
+
+# Create virtual environment and install Dynamo
+uv venv venv
+source venv/bin/activate
+# Use prerelease flag to install RC versions of flashinfer and/or other dependencies
+uv pip install --prerelease=allow "ai-dynamo[sglang]"  # or [vllm], [trtllm]
+```
+
+### 2. Start etcd/NATS
+
+```bash
+# Fetch and start etcd and NATS using Docker Compose
+VERSION=$(uv pip show ai-dynamo | grep Version | cut -d' ' -f2)
+curl -fsSL -o docker-compose.yml https://raw.githubusercontent.com/ai-dynamo/dynamo/refs/tags/v${VERSION}/deploy/docker-compose.yml
+docker compose -f docker-compose.yml up -d
+```
+
+### 3. Run Dynamo
+
+```bash
+# Start the OpenAI compatible frontend (default port is 8000)
+python -m dynamo.frontend
+
+# In another terminal, start an SGLang worker
+python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B
+```
+
+### 4. Test Your Deployment
+
+```bash
+curl localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model": "Qwen/Qwen3-0.6B",
+       "messages": [{"role": "user", "content": "Hello!"}],
+       "max_tokens": 50}'
+```
+
+## Key Features
+
+| Feature | Description |
+|---------|-------------|
+| **Multi-Backend Support** | vLLM, SGLang, and TensorRT-LLM backends |
+| **Disaggregated Serving** | Separate prefill and decode for optimal performance |
+| **KV Cache Routing** | Intelligent request routing based on KV cache state |
+| **Kubernetes Native** | Full operator and Helm chart support |
+| **Observability** | Prometheus metrics, Grafana dashboards, and tracing |
+
+## Documentation Overview
+
+### Backends
+- [vLLM Backend](./backends/vllm/README.md) - High-throughput serving with vLLM
+- [SGLang Backend](./backends/sglang/README.md) - Fast inference with SGLang
+- [TensorRT-LLM Backend](./backends/trtllm/README.md) - Optimized inference with TensorRT-LLM
+
+### Kubernetes Deployment
+- [Installation Guide](./kubernetes/installation_guide) - Deploy Dynamo on Kubernetes
+- [Operator Guide](./kubernetes/dynamo_operator) - Using the Dynamo Operator
+- [Autoscaling](./kubernetes/autoscaling) - Automatic scaling configuration
+
+### Architecture
+- [System Architecture](./design_docs/architecture) - Overall system design
+- [Disaggregated Serving](./design_docs/disagg_serving) - P/D separation architecture
+- [Distributed Runtime](./design_docs/distributed_runtime) - Runtime internals
+
+### Performance & Tuning
+- [Performance Tuning](./performance/tuning) - Optimize your deployment
+- [Benchmarking](./benchmarks/benchmarking) - Measure and compare performance
+- [AI Configurator](./performance/aiconfigurator) - Automated configuration
+
+## Getting Help
+
+- **GitHub Issues**: [Report bugs or request features](https://github.com/ai-dynamo/dynamo/issues)
+- **Discussions**: [Ask questions and share ideas](https://github.com/ai-dynamo/dynamo/discussions)
+- **Reference**: [CLI Reference](./reference/cli) | [Glossary](./reference/glossary) | [Support Matrix](./reference/support-matrix)
diff --git a/docs/kubernetes/README.md b/docs/docs/kubernetes/README.md
similarity index 92%
rename from docs/kubernetes/README.md
rename to docs/docs/kubernetes/README.md
index 5136361a8a6..7caa302f5be 100644
--- a/docs/kubernetes/README.md
+++ b/docs/docs/kubernetes/README.md
@@ -1,3 +1,7 @@
+---
+title: "Deploying Dynamo on Kubernetes"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -57,7 +61,7 @@ Before deploying the platform, run the pre-deployment checks to ensure the clust
 ./deploy/pre-deployment/pre-deployment-check.sh
 ```
 
-This validates kubectl connectivity, StorageClass configuration, and GPU availability. See [pre-deployment checks](/deploy/pre-deployment/README.md) for more details.
+This validates kubectl connectivity, StorageClass configuration, and GPU availability. See [pre-deployment checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README.md) for more details.
 
 ## 1. Install Platform First
 
@@ -90,9 +94,9 @@ Each backend has deployment examples and configuration options:
 
 | Backend      | Aggregated | Aggregated + Router | Disaggregated | Disaggregated + Router | Disaggregated + Planner | Disaggregated Multi-node |
 |--------------|:----------:|:-------------------:|:-------------:|:----------------------:|:-----------------------:|:------------------------:|
-| **[SGLang](/examples/backends/sglang/deploy/README.md)**       | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
-| **[TensorRT-LLM](/examples/backends/trtllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |
-| **[vLLM](/examples/backends/vllm/deploy/README.md)**           | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| **[SGLang](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)**       | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| **[TensorRT-LLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |
+| **[vLLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)**           | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
 
 ## 3. Deploy Your First Model
 
@@ -238,12 +242,12 @@ Key customization points include:
 
 ## Additional Resources
 
-- **[Examples](/docs/examples/README.md)** - Complete working examples
+- **[Examples](../examples.md)** - Complete working examples
 - **[Create Custom Deployments](/docs/kubernetes/deployment/create_deployment.md)** - Build your own CRDs
 - **[Managing Models with DynamoModel](/docs/kubernetes/deployment/dynamomodel-guide.md)** - Deploy LoRA adapters and manage models
 - **[Operator Documentation](/docs/kubernetes/dynamo_operator.md)** - How the platform works
 - **[Service Discovery](/docs/kubernetes/service_discovery.md)** - Discovery backends and configuration
-- **[Helm Charts](/deploy/helm/README.md)** - For advanced users
+- **[Helm Charts](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/README.md)** - For advanced users
 - **[GitOps Deployment with FluxCD](/docs/kubernetes/fluxcd.md)** - For advanced users
 - **[Logging](/docs/kubernetes/observability/logging.md)** - For logging setup
 - **[Multinode Deployment](/docs/kubernetes/deployment/multinode-deployment.md)** - For multinode deployment
diff --git a/docs/kubernetes/api_reference.md b/docs/docs/kubernetes/api_reference.md
similarity index 95%
rename from docs/kubernetes/api_reference.md
rename to docs/docs/kubernetes/api_reference.md
index 9a8cfc00ca9..2d5a40cda44 100644
--- a/docs/kubernetes/api_reference.md
+++ b/docs/docs/kubernetes/api_reference.md
@@ -1,3 +1,7 @@
+---
+title: "API Reference"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -311,7 +315,7 @@ _Appears in:_
 | `model` _string_ | Model specifies the model to deploy (e.g., "Qwen/Qwen3-0.6B", "meta-llama/Llama-3-70b").<br />This is a high-level identifier for easy reference in kubectl output and logs.<br />The controller automatically sets this value in profilingConfig.config.deployment.model. |  | Required: \{\} <br /> |
 | `backend` _string_ | Backend specifies the inference backend for profiling.<br />The controller automatically sets this value in profilingConfig.config.engine.backend.<br />Profiling runs on real GPUs or via AIC simulation to collect performance data. |  | Enum: [vllm sglang trtllm] <br />Required: \{\} <br /> |
 | `useMocker` _boolean_ | UseMocker indicates whether to deploy a mocker DynamoGraphDeployment instead of<br />a real backend deployment. When true, the deployment uses simulated engines that<br />don't require GPUs, using the profiling data to simulate realistic timing behavior.<br />Mocker is available in all backend images and useful for large-scale experiments.<br />Profiling still runs against the real backend (specified above) to collect performance data. | false |  |
-| `enableGpuDiscovery` _boolean_ | EnableGpuDiscovery controls whether the profiler should automatically discover GPU<br />resources from the Kubernetes cluster nodes. When enabled, the profiler will override<br />any manually specified hardware configuration (minNumGpusPerEngine, maxNumGpusPerEngine,<br />numGpusPerNode) with values detected from the cluster.<br />Requires cluster-wide node access permissions - only available with cluster-scoped operators. | false | Optional: \{\} <br /> |
+| `enableGpuDiscovery` _boolean_ | EnableGpuDiscovery controls whether the profiler should automatically discover GPU<br />resources from the Kubernetes cluster nodes. When enabled, the profiler will override<br />any manually specified hardware configuration (min_num_gpus_per_engine, max_num_gpus_per_engine,<br />num_gpus_per_node) with values detected from the cluster.<br />Requires cluster-wide node access permissions - only available with cluster-scoped operators. | false | Optional: \{\} <br /> |
 | `profilingConfig` _[ProfilingConfigSpec](#profilingconfigspec)_ | ProfilingConfig provides the complete configuration for the profiling job.<br />This configuration is passed directly to the profiler.<br />The structure matches the profile_sla config format exactly (see ProfilingConfigSpec for schema).<br />Note: deployment.model and engine.backend are automatically set from the high-level<br />modelName and backend fields and should not be specified in this config. |  | Required: \{\} <br /> |
 | `autoApply` _boolean_ | AutoApply indicates whether to automatically create a DynamoGraphDeployment<br />after profiling completes. If false, only the spec is generated and stored in status.<br />Users can then manually create a DGD using the generated spec. | false |  |
 | `deploymentOverrides` _[DeploymentOverridesSpec](#deploymentoverridesspec)_ | DeploymentOverrides allows customizing metadata for the auto-created DGD.<br />Only applicable when AutoApply is true. |  | Optional: \{\} <br /> |
@@ -335,7 +339,7 @@ _Appears in:_
 | `backend` _string_ | Backend is extracted from profilingConfig.config.engine.backend for display purposes.<br />This field is populated by the controller and shown in kubectl output. |  | Optional: \{\} <br /> |
 | `observedGeneration` _integer_ | ObservedGeneration reflects the generation of the most recently observed spec.<br />Used to detect spec changes and enforce immutability after profiling starts. |  |  |
 | `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions contains the latest observed conditions of the deployment request.<br />Standard condition types include: Validation, Profiling, SpecGenerated, DeploymentReady.<br />Conditions are merged by type on patch updates. |  |  |
-| `profilingResults` _string_ | ProfilingResults contains a reference to the ConfigMap holding profiling data.<br />Format: "configmap/<name>" |  | Optional: \{\} <br /> |
+| `profilingResults` _string_ | ProfilingResults contains a reference to the ConfigMap holding profiling data.<br />Format: "configmap/&lt;name&gt;" |  | Optional: \{\} <br /> |
 | `generatedDeployment` _[RawExtension](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#rawextension-runtime-pkg)_ | GeneratedDeployment contains the full generated DynamoGraphDeployment specification<br />including metadata, based on profiling results. Users can extract this to create<br />a DGD manually, or it's used automatically when autoApply is true.<br />Stored as RawExtension to preserve all fields including metadata.<br />For mocker backends, this contains the mocker DGD spec. |  | EmbeddedResource: \{\} <br />Optional: \{\} <br /> |
 | `deployment` _[DeploymentStatus](#deploymentstatus)_ | Deployment tracks the auto-created DGD when AutoApply is true.<br />Contains name, namespace, state, and creation status of the managed DGD. |  | Optional: \{\} <br /> |
 
@@ -434,7 +438,6 @@ _Appears in:_
 | `services` _object (keys:string, values:[DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec))_ | Services are the services to deploy as part of this deployment. |  | MaxProperties: 25 <br />Optional: \{\} <br /> |
 | `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs are environment variables applied to all services in the deployment unless<br />overridden by service-specific configuration. |  | Optional: \{\} <br /> |
 | `backendFramework` _string_ | BackendFramework specifies the backend framework (e.g., "sglang", "vllm", "trtllm"). |  | Enum: [sglang vllm trtllm] <br /> |
-| `restart` _[Restart](#restart)_ | Restart specifies the restart policy for the graph deployment. |  | Optional: \{\} <br /> |
 
 
 #### DynamoGraphDeploymentStatus
@@ -453,7 +456,6 @@ _Appears in:_
 | `state` _string_ | State is a high-level textual status of the graph deployment lifecycle. |  |  |
 | `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions contains the latest observed conditions of the graph deployment.<br />The slice is merged by type on patch updates. |  |  |
 | `services` _object (keys:string, values:[ServiceReplicaStatus](#servicereplicastatus))_ | Services contains per-service replica status information.<br />The map key is the service name from spec.services. |  |  |
-| `restart` _[RestartStatus](#restartstatus)_ | Restart contains the status of the restart of the graph deployment. |  |  |
 
 
 #### DynamoModel
@@ -744,94 +746,6 @@ _Appears in:_
 | `claims` _[ResourceClaim](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#resourceclaim-v1-core) array_ | Claims specifies resource claims for dynamic resource allocation |  |  |
 
 
-#### Restart
-
-
-
-
-
-
-
-_Appears in:_
-- [DynamoGraphDeploymentSpec](#dynamographdeploymentspec)
-
-| Field | Description | Default | Validation |
-| --- | --- | --- | --- |
-| `id` _string_ | ID is an arbitrary string that triggers a restart when changed.<br />Any modification to this value will initiate a restart of the graph deployment according to the strategy. |  | MinLength: 1 <br />Required: \{\} <br /> |
-| `strategy` _[RestartStrategy](#restartstrategy)_ | Strategy specifies the restart strategy for the graph deployment. |  | Optional: \{\} <br /> |
-
-
-#### RestartPhase
-
-_Underlying type:_ _string_
-
-
-
-
-
-_Appears in:_
-- [RestartStatus](#restartstatus)
-
-| Field | Description |
-| --- | --- |
-| `Pending` |  |
-| `Restarting` |  |
-| `Completed` |  |
-| `Failed` |  |
-
-
-#### RestartStatus
-
-
-
-RestartStatus contains the status of the restart of the graph deployment.
-
-
-
-_Appears in:_
-- [DynamoGraphDeploymentStatus](#dynamographdeploymentstatus)
-
-| Field | Description | Default | Validation |
-| --- | --- | --- | --- |
-| `observedID` _string_ | ObservedID is the restart ID that has been observed and is being processed.<br />Matches the Restart.ID field in the spec. |  |  |
-| `phase` _[RestartPhase](#restartphase)_ | Phase is the phase of the restart. |  |  |
-| `inProgress` _string array_ | InProgress contains the names of the services that are currently being restarted. |  |  |
-
-
-#### RestartStrategy
-
-
-
-
-
-
-
-_Appears in:_
-- [Restart](#restart)
-
-| Field | Description | Default | Validation |
-| --- | --- | --- | --- |
-| `type` _[RestartStrategyType](#restartstrategytype)_ | Type specifies the restart strategy type. | Sequential | Enum: [Sequential Parallel] <br /> |
-| `order` _string array_ | Order specifies the order in which the services should be restarted. |  | Optional: \{\} <br /> |
-
-
-#### RestartStrategyType
-
-_Underlying type:_ _string_
-
-
-
-
-
-_Appears in:_
-- [RestartStrategy](#restartstrategy)
-
-| Field | Description |
-| --- | --- |
-| `Sequential` |  |
-| `Parallel` |  |
-
-
 #### ScalingAdapter
 
 
@@ -1042,7 +956,7 @@ Worker components receive the following probe configurations:
 - **Timeout**: 5 seconds
 - **Failure Threshold**: 720 (allows up to 2 hours for startup: 10s × 720 = 7200s)
 
-:::{note}
+:::note
 For larger models (typically >70B parameters) or slower storage systems, you may need to increase the `failureThreshold` to allow more time for model loading. Calculate the required threshold based on your expected startup time: `failureThreshold = (expected_startup_seconds / period)`. Override the startup probe in your component specification if the default 2-hour window is insufficient.
 :::
 
diff --git a/docs/kubernetes/autoscaling.md b/docs/docs/kubernetes/autoscaling.md
similarity index 99%
rename from docs/kubernetes/autoscaling.md
rename to docs/docs/kubernetes/autoscaling.md
index 661e166fe8d..346242bd6e6 100644
--- a/docs/kubernetes/autoscaling.md
+++ b/docs/docs/kubernetes/autoscaling.md
@@ -1,3 +1,7 @@
+---
+title: "Autoscaling"
+---
+
 # Autoscaling
 
 This guide explains how to configure autoscaling for DynamoGraphDeployment (DGD) services using the `sglang-agg` example from `examples/backends/sglang/deploy/agg.yaml`.
diff --git a/docs/kubernetes/deployment/create_deployment.md b/docs/docs/kubernetes/deployment/create_deployment.md
similarity index 84%
rename from docs/kubernetes/deployment/create_deployment.md
rename to docs/docs/kubernetes/deployment/create_deployment.md
index 2c1228b2acd..7746e4922aa 100644
--- a/docs/kubernetes/deployment/create_deployment.md
+++ b/docs/docs/kubernetes/deployment/create_deployment.md
@@ -1,7 +1,11 @@
+---
+title: "Creating Kubernetes Deployments"
+---
+
 # Creating Kubernetes Deployments
 
-The scripts in the `examples/<backend>/launch` folder like [agg.sh](../../../examples/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally.
-The corresponding YAML files like [agg.yaml](../../../examples/backends/vllm/deploy/agg.yaml) show you how you could create a Kubernetes deployment for your inference graph.
+The scripts in the `examples/<backend>/launch` folder like [agg.sh](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally.
+The corresponding YAML files like [agg.yaml](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg.yaml) show you how you could create a Kubernetes deployment for your inference graph.
 
 This guide explains how to create your own deployment files.
 
@@ -25,7 +29,7 @@ Before choosing a template, understand the different architecture patterns:
 - GPU utilization may not be optimal (prefill and decode compete for resources)
 - Lower throughput ceiling compared to disaggregated
 
-**Example**: [`agg.yaml`](../../../examples/backends/vllm/deploy/agg.yaml)
+**Example**: [`agg.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg.yaml)
 
 ### Aggregated + Router (agg_router.yaml)
 
@@ -42,7 +46,7 @@ Before choosing a template, understand the different architecture patterns:
 - Still has GPU underutilization issues of aggregated serving
 - More complex than plain aggregated but simpler than disaggregated
 
-**Example**: [`agg_router.yaml`](../../../examples/backends/vllm/deploy/agg_router.yaml)
+**Example**: [`agg_router.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg_router.yaml)
 
 ### Disaggregated Serving (disagg_router.yaml)
 
@@ -61,7 +65,7 @@ Before choosing a template, understand the different architecture patterns:
 - More complex setup and debugging
 - Requires understanding of prefill/decode separation
 
-**Example**: [`disagg_router.yaml`](../../../examples/backends/vllm/deploy/disagg_router.yaml)
+**Example**: [`disagg_router.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/disagg_router.yaml)
 
 ### Quick Selection Guide
 
@@ -69,11 +73,11 @@ Select the architecture pattern as your template that best fits your use case.
 
 For example, when using the `vLLM` backend:
 
-- **Development / Testing**: Use [`agg.yaml`](../../../examples/backends/vllm/deploy/agg.yaml) as the base configuration.
+- **Development / Testing**: Use [`agg.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg.yaml) as the base configuration.
 
-- **Production with Load Balancing**: Use [`agg_router.yaml`](../../../examples/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
+- **Production with Load Balancing**: Use [`agg_router.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
 
-- **High Performance / Disaggregated Deployment**: Use [`disagg_router.yaml`](../../../examples/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
+- **High Performance / Disaggregated Deployment**: Use [`disagg_router.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
 
 
 ## Step 2: Customize the Template
diff --git a/docs/kubernetes/deployment/dynamomodel-guide.md b/docs/docs/kubernetes/deployment/dynamomodel-guide.md
similarity index 99%
rename from docs/kubernetes/deployment/dynamomodel-guide.md
rename to docs/docs/kubernetes/deployment/dynamomodel-guide.md
index 67757f0c4b8..aaf9358f6ac 100644
--- a/docs/kubernetes/deployment/dynamomodel-guide.md
+++ b/docs/docs/kubernetes/deployment/dynamomodel-guide.md
@@ -1,3 +1,7 @@
+---
+title: "Managing Models with DynamoModel"
+---
+
 # Managing Models with DynamoModel
 
 ## Overview
diff --git a/docs/kubernetes/deployment/minikube.md b/docs/docs/kubernetes/deployment/minikube.md
similarity index 98%
rename from docs/kubernetes/deployment/minikube.md
rename to docs/docs/kubernetes/deployment/minikube.md
index d98daf165cc..ad750cdd1fb 100644
--- a/docs/kubernetes/deployment/minikube.md
+++ b/docs/docs/kubernetes/deployment/minikube.md
@@ -1,3 +1,7 @@
+---
+title: "Minikube Setup Guide"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/kubernetes/deployment/multinode-deployment.md b/docs/docs/kubernetes/deployment/multinode-deployment.md
similarity index 95%
rename from docs/kubernetes/deployment/multinode-deployment.md
rename to docs/docs/kubernetes/deployment/multinode-deployment.md
index 7d4ba7254ad..4d5f630e8ea 100644
--- a/docs/kubernetes/deployment/multinode-deployment.md
+++ b/docs/docs/kubernetes/deployment/multinode-deployment.md
@@ -1,3 +1,7 @@
+---
+title: "Multinode Deployment Guide"
+---
+
 # Multinode Deployment Guide
 
 This guide explains how to deploy Dynamo workloads across multiple nodes. Multinode deployments enable you to scale compute-intensive LLM workloads across multiple physical machines, maximizing GPU utilization and supporting larger models.
@@ -215,9 +219,9 @@ The operator uses Ray for multi-node tensor/pipeline parallel deployments. Ray p
 **All Nodes (Leader and Workers):**
 - **Injected Flags**:
   - `--data-parallel-address <leader-hostname>` - Address of the coordination server
-  - `--data-parallel-size-local <value>` - Number of data parallel workers per node
+  - `--data-parallel-size-local &lt;value&gt;` - Number of data parallel workers per node
   - `--data-parallel-rpc-port 13445` - RPC port for data parallel coordination
-  - `--data-parallel-start-rank <value>` - Starting rank for this node (calculated automatically)
+  - `--data-parallel-start-rank &lt;value&gt;` - Starting rank for this node (calculated automatically)
 - **Probes**: Worker probes are removed; leader probes remain active
 
 **Note**: The operator intelligently injects these flags into your command regardless of command structure (direct Python commands or shell wrappers)
@@ -301,8 +305,8 @@ To enable compilation cache, add a volume mount with `useAsCompilationCache: tru
 
 For additional support and examples, see the working multinode configurations in:
 
-- **SGLang**: [examples/backends/sglang/deploy/](../../../examples/backends/sglang/deploy/)
-- **TensorRT-LLM**: [examples/backends/trtllm/deploy/](../../../examples/backends/trtllm/deploy/)
-- **vLLM**: [examples/backends/vllm/deploy/](../../../examples/backends/vllm/deploy/)
+- **SGLang**: [examples/backends/sglang/deploy/](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/)
+- **TensorRT-LLM**: [examples/backends/trtllm/deploy/](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/)
+- **vLLM**: [examples/backends/vllm/deploy/](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/)
 
 These examples demonstrate proper usage of the `multinode` section with corresponding `gpu` limits and correct `tp-size` configuration.
diff --git a/docs/kubernetes/dynamo_operator.md b/docs/docs/kubernetes/dynamo_operator.md
similarity index 99%
rename from docs/kubernetes/dynamo_operator.md
rename to docs/docs/kubernetes/dynamo_operator.md
index 1d0e3910d0c..a69a93bb864 100644
--- a/docs/kubernetes/dynamo_operator.md
+++ b/docs/docs/kubernetes/dynamo_operator.md
@@ -1,3 +1,7 @@
+---
+title: "Working with Dynamo Kubernetes Operator"
+---
+
 # Working with Dynamo Kubernetes Operator
 
 ## Overview
diff --git a/docs/kubernetes/fluxcd.md b/docs/docs/kubernetes/fluxcd.md
similarity index 97%
rename from docs/kubernetes/fluxcd.md
rename to docs/docs/kubernetes/fluxcd.md
index ae80035700f..74ec640a7e7 100644
--- a/docs/kubernetes/fluxcd.md
+++ b/docs/docs/kubernetes/fluxcd.md
@@ -1,3 +1,7 @@
+---
+title: "GitOps Deployment with FluxCD"
+---
+
 # GitOps Deployment with FluxCD
 
 This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../backends/vllm/README.md) to demonstrate the workflow.
diff --git a/docs/kubernetes/grove.md b/docs/docs/kubernetes/grove.md
similarity index 99%
rename from docs/kubernetes/grove.md
rename to docs/docs/kubernetes/grove.md
index 7d5921f7660..ab1dcecdaad 100644
--- a/docs/kubernetes/grove.md
+++ b/docs/docs/kubernetes/grove.md
@@ -1,3 +1,7 @@
+---
+title: "Grove Deployment Guide"
+---
+
 # Grove Deployment Guide
 
 Grove is a Kubernetes API specifically designed to address the orchestration challenges of modern AI workloads, particularly disaggregated inference systems. Grove provides seamless integration with NVIDIA Dynamo for comprehensive AI infrastructure management.
diff --git a/docs/kubernetes/installation_guide.md b/docs/docs/kubernetes/installation_guide.md
similarity index 94%
rename from docs/kubernetes/installation_guide.md
rename to docs/docs/kubernetes/installation_guide.md
index 73db2adc823..756260319d8 100644
--- a/docs/kubernetes/installation_guide.md
+++ b/docs/docs/kubernetes/installation_guide.md
@@ -1,3 +1,7 @@
+---
+title: "Installation Guide for Dynamo Kubernetes Platform"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -55,7 +59,7 @@ kubectl get clusterrolebinding -o json | \
 
 ## Installation Paths
 
-Platform is installed using Dynamo Kubernetes Platform [helm chart](../../deploy/helm/charts/platform/README.md).
+Platform is installed using Dynamo Kubernetes Platform [helm chart](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/platform/README.md).
 
 **Path A: Pre-built Artifacts**
 - Use case: Production deployment, shared or dedicated clusters
@@ -124,7 +128,7 @@ Before proceeding, run the pre-deployment check script to verify your cluster me
 ./deploy/pre-deployment/pre-deployment-check.sh
 ```
 
-This script validates kubectl connectivity, default StorageClass configuration, and GPU node availability. See [Pre-Deployment Checks](../../deploy/pre-deployment/README.md) for details.
+This script validates kubectl connectivity, default StorageClass configuration, and GPU node availability. See [Pre-Deployment Checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README.md) for details.
 
 > **No cluster?** See [Minikube Setup](deployment/minikube.md) for local development.
 
@@ -286,9 +290,9 @@ kubectl get pods -n ${NAMESPACE}
    ```
 
 2. **Explore Backend Guides**
-   - [vLLM Deployments](../../examples/backends/vllm/deploy/README.md)
-   - [SGLang Deployments](../../examples/backends/sglang/deploy/README.md)
-   - [TensorRT-LLM Deployments](../../examples/backends/trtllm/deploy/README.md)
+   - [vLLM Deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)
+   - [SGLang Deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)
+   - [TensorRT-LLM Deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)
 
 3. **Optional:**
    - [Set up Prometheus & Grafana](./observability/metrics.md)
@@ -325,8 +329,8 @@ kubectl get crd | grep dynamo
 
 **Pods not starting?**
 ```bash
-kubectl describe pod <pod-name> -n ${NAMESPACE}
-kubectl logs <pod-name> -n ${NAMESPACE}
+kubectl describe pod &lt;pod-name&gt; -n ${NAMESPACE}
+kubectl logs &lt;pod-name&gt; -n ${NAMESPACE}
 ```
 
 **HuggingFace model access?**
@@ -379,7 +383,7 @@ kubectl delete crd <crd-name>
 
 ## Advanced Options
 
-- [Helm Chart Configuration](../../deploy/helm/charts/platform/README.md)
+- [Helm Chart Configuration](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/platform/README.md)
 - [Create custom deployments](./deployment/create_deployment.md)
 - [Dynamo Operator details](./dynamo_operator.md)
 - [Model Express Server details](https://github.com/ai-dynamo/modelexpress)
diff --git a/docs/kubernetes/model_caching_with_fluid.md b/docs/docs/kubernetes/model_caching_with_fluid.md
similarity index 99%
rename from docs/kubernetes/model_caching_with_fluid.md
rename to docs/docs/kubernetes/model_caching_with_fluid.md
index 87267c935c3..291344217d3 100644
--- a/docs/kubernetes/model_caching_with_fluid.md
+++ b/docs/docs/kubernetes/model_caching_with_fluid.md
@@ -1,3 +1,7 @@
+---
+title: "Model Caching with Fluid: Cloud-Native Data Orchestration and Acceleration"
+---
+
 # Model Caching with Fluid: Cloud-Native Data Orchestration and Acceleration
 
 Fluid is an open-source, cloud-native data orchestration and acceleration platform for Kubernetes. It virtualizes and accelerates data access from various sources (object storage, distributed file systems, cloud storage), making it ideal for AI, machine learning, and big data workloads.
diff --git a/docs/kubernetes/observability/logging.md b/docs/docs/kubernetes/observability/logging.md
similarity index 99%
rename from docs/kubernetes/observability/logging.md
rename to docs/docs/kubernetes/observability/logging.md
index 4be596fd7f4..2fb3cde51bb 100644
--- a/docs/kubernetes/observability/logging.md
+++ b/docs/docs/kubernetes/observability/logging.md
@@ -1,3 +1,7 @@
+---
+title: "Log Aggregation in Dynamo on Kubernetes"
+---
+
 # Log Aggregation in Dynamo on Kubernetes
 
 This guide demonstrates how to set up logging for Dynamo in Kubernetes using Grafana Loki and Alloy. This setup provides a simple reference logging setup that can be followed in Kubernetes clusters including Minikube and MicroK8s.
diff --git a/docs/kubernetes/observability/metrics.md b/docs/docs/kubernetes/observability/metrics.md
similarity index 97%
rename from docs/kubernetes/observability/metrics.md
rename to docs/docs/kubernetes/observability/metrics.md
index f8d6f8696bf..43e843ba020 100644
--- a/docs/kubernetes/observability/metrics.md
+++ b/docs/docs/kubernetes/observability/metrics.md
@@ -1,3 +1,7 @@
+---
+title: "Dynamo Metrics Collection on Kubernetes"
+---
+
 # Dynamo Metrics Collection on Kubernetes
 
 ## Overview
@@ -153,7 +157,7 @@ Visit http://localhost:9090 and try these example queries:
 - `dynamo_frontend_requests_total`
 - `dynamo_frontend_time_to_first_token_seconds_bucket`
 
-![Prometheus UI showing Dynamo metrics](../../images/prometheus-k8s.png)
+![Prometheus UI showing Dynamo metrics](/img/prometheus-k8s.png)
 
 ### In Grafana
 ```bash
@@ -171,4 +175,4 @@ Visit http://localhost:3000 and log in with the credentials captured above.
 
 Once logged in, find the Dynamo dashboard under General.
 
-![Grafana dashboard showing Dynamo metrics](../../images/grafana-k8s.png)
+![Grafana dashboard showing Dynamo metrics](/img/grafana-k8s.png)
diff --git a/docs/kubernetes/service_discovery.md b/docs/docs/kubernetes/service_discovery.md
similarity index 99%
rename from docs/kubernetes/service_discovery.md
rename to docs/docs/kubernetes/service_discovery.md
index ed2bd20793e..62fa901bb4a 100644
--- a/docs/kubernetes/service_discovery.md
+++ b/docs/docs/kubernetes/service_discovery.md
@@ -1,3 +1,7 @@
+---
+title: "Service Discovery"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/kubernetes/webhooks.md b/docs/docs/kubernetes/webhooks.md
similarity index 88%
rename from docs/kubernetes/webhooks.md
rename to docs/docs/kubernetes/webhooks.md
index 5355f39323f..f89137152bf 100644
--- a/docs/kubernetes/webhooks.md
+++ b/docs/docs/kubernetes/webhooks.md
@@ -1,3 +1,7 @@
+---
+title: "Webhooks"
+---
+
 # Webhooks
 
 This document describes the webhook functionality in the Dynamo Operator, including validation webhooks, certificate management, and troubleshooting.
@@ -232,10 +236,10 @@ If you need to rotate certificates manually:
 
 ```bash
 # Delete the certificate secret
-kubectl delete secret <release>-webhook-server-cert -n <namespace>
+kubectl delete secret <release>-webhook-server-cert -n &lt;namespace&gt;
 
 # Upgrade the release to regenerate certificates
-helm upgrade <release> dynamo-platform -n <namespace>
+helm upgrade <release> dynamo-platform -n &lt;namespace&gt;
 ```
 
 ---
@@ -326,10 +330,10 @@ Bring your own certificates for custom PKI requirements.
 kubectl create secret tls <release>-webhook-server-cert \
   --cert=tls.crt \
   --key=tls.key \
-  -n <namespace>
+  -n &lt;namespace&gt;
 
 # Also add ca.crt to the secret
-kubectl patch secret <release>-webhook-server-cert -n <namespace> \
+kubectl patch secret <release>-webhook-server-cert -n &lt;namespace&gt; \
   --type='json' \
   -p='[{"op": "add", "path": "/data/ca.crt", "value": "'$(base64 -w0 < ca.crt)'"}]'
 ```
@@ -347,14 +351,14 @@ dynamo-operator:
 3. **Deploy operator**:
 
 ```bash
-helm install dynamo-platform . -n <namespace> -f values.yaml
+helm install dynamo-platform . -n &lt;namespace&gt; -f values.yaml
 ```
 
 #### Certificate Requirements
 
 - **Secret name**: Must match `webhook.certificateSecret.name` (default: `webhook-server-cert`)
 - **Secret keys**: `tls.crt`, `tls.key`, `ca.crt`
-- **Certificate SAN**: Must include `<service-name>.<namespace>.svc`
+- **Certificate SAN**: Must include `<service-name>.&lt;namespace&gt;.svc`
   - Example: `dynamo-platform-dynamo-operator-webhook-service.dynamo-system.svc`
 
 ---
@@ -419,7 +423,7 @@ helm install team-a-operator dynamo-platform \
 The webhook configuration name reflects the deployment mode:
 
 - **Cluster-wide**: `<release>-validating`
-- **Namespace-restricted**: `<release>-validating-<namespace>`
+- **Namespace-restricted**: `<release>-validating-&lt;namespace&gt;`
 
 Example:
 
@@ -458,7 +462,7 @@ kubectl get validatingwebhookconfiguration | grep dynamo
 
 2. **Check webhook configuration**:
 ```bash
-kubectl get validatingwebhookconfiguration <name> -o yaml
+kubectl get validatingwebhookconfiguration &lt;name&gt; -o yaml
 # Verify:
 # - caBundle is present and non-empty
 # - clientConfig.service points to correct service
@@ -467,12 +471,12 @@ kubectl get validatingwebhookconfiguration <name> -o yaml
 
 3. **Verify webhook service exists**:
 ```bash
-kubectl get service -n <namespace> | grep webhook
+kubectl get service -n &lt;namespace&gt; | grep webhook
 ```
 
 4. **Check operator logs for webhook startup**:
 ```bash
-kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep webhook
+kubectl logs -n &lt;namespace&gt; deployment/<release>-dynamo-operator | grep webhook
 # Should see: "Webhooks are enabled - webhooks will validate, controllers will skip validation"
 # Should see: "Starting webhook server"
 ```
@@ -491,13 +495,13 @@ Post "https://...webhook-service...:443/validate-...": dial tcp ...:443: connect
 
 1. **Verify operator pod is running**:
 ```bash
-kubectl get pods -n <namespace> -l app.kubernetes.io/name=dynamo-operator
+kubectl get pods -n &lt;namespace&gt; -l app.kubernetes.io/name=dynamo-operator
 ```
 
 2. **Check webhook server is listening**:
 ```bash
 # Port-forward to pod
-kubectl port-forward -n <namespace> pod/<operator-pod> 9443:9443
+kubectl port-forward -n &lt;namespace&gt; pod/<operator-pod> 9443:9443
 
 # In another terminal, test connection
 curl -k https://localhost:9443/validate-nvidia-com-v1alpha1-dynamocomponentdeployment
@@ -506,12 +510,12 @@ curl -k https://localhost:9443/validate-nvidia-com-v1alpha1-dynamocomponentdeplo
 
 3. **Verify webhook port in deployment**:
 ```bash
-kubectl get deployment -n <namespace> <release>-dynamo-operator -o yaml | grep -A5 "containerPort: 9443"
+kubectl get deployment -n &lt;namespace&gt; <release>-dynamo-operator -o yaml | grep -A5 "containerPort: 9443"
 ```
 
 4. **Check for webhook initialization errors**:
 ```bash
-kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep -i error
+kubectl logs -n &lt;namespace&gt; deployment/<release>-dynamo-operator | grep -i error
 ```
 
 ---
@@ -528,26 +532,26 @@ x509: certificate signed by unknown authority
 
 1. **Verify caBundle is present**:
 ```bash
-kubectl get validatingwebhookconfiguration <name> -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d
+kubectl get validatingwebhookconfiguration &lt;name&gt; -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d
 # Should output a valid PEM certificate
 ```
 
 2. **Verify certificate secret exists**:
 ```bash
-kubectl get secret -n <namespace> <release>-webhook-server-cert
+kubectl get secret -n &lt;namespace&gt; <release>-webhook-server-cert
 ```
 
 3. **Check certificate validity**:
 ```bash
-kubectl get secret -n <namespace> <release>-webhook-server-cert -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -text
+kubectl get secret -n &lt;namespace&gt; <release>-webhook-server-cert -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -text
 # Check:
 # - Not expired
-# - SAN includes: <service-name>.<namespace>.svc
+# - SAN includes: <service-name>.&lt;namespace&gt;.svc
 ```
 
 4. **Check CA injection job logs**:
 ```bash
-kubectl logs -n <namespace> job/<release>-webhook-ca-inject-<revision>
+kubectl logs -n &lt;namespace&gt; job/<release>-webhook-ca-inject-<revision>
 ```
 
 ---
@@ -562,22 +566,22 @@ kubectl logs -n <namespace> job/<release>-webhook-ca-inject-<revision>
 
 1. **List hook jobs**:
 ```bash
-kubectl get jobs -n <namespace> | grep webhook
+kubectl get jobs -n &lt;namespace&gt; | grep webhook
 ```
 
 2. **Check job logs**:
 ```bash
 # Certificate generation
-kubectl logs -n <namespace> job/<release>-webhook-cert-gen-<revision>
+kubectl logs -n &lt;namespace&gt; job/<release>-webhook-cert-gen-<revision>
 
 # CA injection
-kubectl logs -n <namespace> job/<release>-webhook-ca-inject-<revision>
+kubectl logs -n &lt;namespace&gt; job/<release>-webhook-ca-inject-<revision>
 ```
 
 3. **Check RBAC permissions**:
 ```bash
 # Verify ServiceAccount exists
-kubectl get sa -n <namespace> <release>-webhook-ca-inject
+kubectl get sa -n &lt;namespace&gt; <release>-webhook-ca-inject
 
 # Verify ClusterRole and ClusterRoleBinding exist
 kubectl get clusterrole <release>-webhook-ca-inject
@@ -587,11 +591,11 @@ kubectl get clusterrolebinding <release>-webhook-ca-inject
 4. **Manual cleanup**:
 ```bash
 # Delete failed jobs
-kubectl delete job -n <namespace> <release>-webhook-cert-gen-<revision>
-kubectl delete job -n <namespace> <release>-webhook-ca-inject-<revision>
+kubectl delete job -n &lt;namespace&gt; <release>-webhook-cert-gen-<revision>
+kubectl delete job -n &lt;namespace&gt; <release>-webhook-ca-inject-<revision>
 
 # Retry helm upgrade
-helm upgrade <release> dynamo-platform -n <namespace>
+helm upgrade <release> dynamo-platform -n &lt;namespace&gt;
 ```
 
 ---
@@ -606,7 +610,7 @@ helm upgrade <release> dynamo-platform -n <namespace>
 Check operator logs for detailed validation errors:
 
 ```bash
-kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep "validate create\|validate update"
+kubectl logs -n &lt;namespace&gt; deployment/<release>-dynamo-operator | grep "validate create\|validate update"
 ```
 
 Webhook logs include:
@@ -628,29 +632,29 @@ The webhook automatically skips validation for resources being deleted. If stuck
 
 1. **Check if webhook is blocking**:
 ```bash
-kubectl describe <resource-type> <name> -n <namespace>
+kubectl describe <resource-type> &lt;name&gt; -n &lt;namespace&gt;
 # Look for events mentioning webhook errors
 ```
 
 2. **Temporarily disable webhook**:
 ```bash
 # Option 1: Delete ValidatingWebhookConfiguration
-kubectl delete validatingwebhookconfiguration <name>
+kubectl delete validatingwebhookconfiguration &lt;name&gt;
 
 # Option 2: Set failurePolicy to Ignore
-kubectl patch validatingwebhookconfiguration <name> \
+kubectl patch validatingwebhookconfiguration &lt;name&gt; \
   --type='json' \
   -p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Ignore"}]'
 ```
 
 3. **Delete resource again**:
 ```bash
-kubectl delete <resource-type> <name> -n <namespace>
+kubectl delete <resource-type> &lt;name&gt; -n &lt;namespace&gt;
 ```
 
 4. **Re-enable webhook**:
 ```bash
-helm upgrade <release> dynamo-platform -n <namespace>
+helm upgrade <release> dynamo-platform -n &lt;namespace&gt;
 ```
 
 ---
@@ -693,6 +697,6 @@ helm upgrade <release> dynamo-platform -n <namespace>
 
 For issues or questions:
 - Check [Troubleshooting](#troubleshooting) section
-- Review operator logs: `kubectl logs -n <namespace> deployment/<release>-dynamo-operator`
+- Review operator logs: `kubectl logs -n &lt;namespace&gt; deployment/<release>-dynamo-operator`
 - Open an issue on GitHub
 
diff --git a/docs/kvbm/kvbm_architecture.md b/docs/docs/kvbm/kvbm_architecture.md
similarity index 97%
rename from docs/kvbm/kvbm_architecture.md
rename to docs/docs/kvbm/kvbm_architecture.md
index 8da8474ca9e..148a66b0299 100644
--- a/docs/kvbm/kvbm_architecture.md
+++ b/docs/docs/kvbm/kvbm_architecture.md
@@ -1,3 +1,7 @@
+---
+title: "KVBM Architecture"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
@@ -20,7 +24,7 @@ limitations under the License.
 
 The KVBM serves as a critical infrastructure component for scaling LLM inference workloads efficiently. By cleanly separating runtime logic from memory management, and by enabling distributed block sharing, KVBM lays the foundation for high-throughput, multi-node, and memory-disaggregated AI systems.
 
-![A block diagram showing a layered architecture view of Dynamo KV Block manager.](../images/kvbm-architecture.png)
+![A block diagram showing a layered architecture view of Dynamo KV Block manager.](/img/kvbm-architecture.png)
 **High level layered architecture view of Dynamo KV Block manager and how it interfaces with different components of LLM inference ecosystem**
 
 The KVBM has three primary logical layers. The top layer-the LLM inference runtimes (TRTLLM, vLLM and SGLang)-integrates through a dedicated connector module to the Dynamo KVBM module. These connectors act as translation layers, mapping runtime-specific operations and events into the KVBM’s block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and providing memory tiering.
diff --git a/docs/kvbm/kvbm_components.md b/docs/docs/kvbm/kvbm_components.md
similarity index 96%
rename from docs/kvbm/kvbm_components.md
rename to docs/docs/kvbm/kvbm_components.md
index 1df026f5be5..288019884cd 100644
--- a/docs/kvbm/kvbm_components.md
+++ b/docs/docs/kvbm/kvbm_components.md
@@ -1,3 +1,7 @@
+---
+title: "Understanding KVBM components"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
@@ -20,7 +24,7 @@ limitations under the License.
 
 KVBM design takes inspiration from the KV block managers used in vLLM and SGLang, with an added influence from historical memory tiering strategies common in general GPU programming. For more details, [See KVBM Reading](kvbm_reading.md). The figure below illustrates the internal components of KVBM.
 
-![Internal Components of Dynamo KVBM. ](../images/kvbm-components.png)
+![Internal Components of Dynamo KVBM. ](/img/kvbm-components.png)
 **Internal Components of Dynamo KVBM**
 
 ## KVBM Components
@@ -44,7 +48,7 @@ KVBM design takes inspiration from the KV block managers used in vLLM and SGLang
 - **Disk Pool(G3)**: Local SSD NVMe-backed KV block pool. Receives Host offloads (Host→Disk) and provides blocks for onboarding to Device (Disk→Device). NIXL descriptors expose file offsets/regions for zero-copy I/O and optional GDS.
 
 ## KVBM DataFlows
-![KVBM Data Flows. ](../images/kvbm-data-flows.png)
+![KVBM Data Flows. ](/img/kvbm-data-flows.png)
 **KVBM Data Flows from device to other memory hierarchies**
 
 **Device → Host (Offload)**
diff --git a/docs/kvbm/kvbm_design_deepdive.md b/docs/docs/kvbm/kvbm_design_deepdive.md
similarity index 97%
rename from docs/kvbm/kvbm_design_deepdive.md
rename to docs/docs/kvbm/kvbm_design_deepdive.md
index c8d631ac13a..d6665a76c67 100644
--- a/docs/kvbm/kvbm_design_deepdive.md
+++ b/docs/docs/kvbm/kvbm_design_deepdive.md
@@ -1,3 +1,7 @@
+---
+title: "KVBM components"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
@@ -20,7 +24,7 @@ limitations under the License.
 
 The design of the KVBM is inspired from vLLM and SGLang KV block managers but with a twist from historical memory tiering design aspired in general GPU programming. [See KVBM Reading](kvbm_reading.md). The following figure shows the internal architecture of KVBM and how it works across workers using NIXL.
 
-![Internal architecture and key modules in the Dynamo KVBM. ](../images/kvbm-internal-arch.png)
+![Internal architecture and key modules in the Dynamo KVBM. ](/img/kvbm-internal-arch.png)
 **Internal architecture and key modules in the Dynamo KVBM**
 
 ## KvBlockManager as Orchestration Layer
@@ -113,7 +117,7 @@ The NIXL agent exposes remote memory buffers using `NixlBlockSet`, `RemoteBlocks
 
 `RemoteBlocks` is a lightweight abstraction over shared memory for cross-node block usage (through UCX or other backends).
 
-The left side of the figure in [Understanding KVBM Components](#understanding-kvbm-components) illustrates a bidirectional remote memory registration and layout synchronization protocol between workers (for example, Worker 1 and Worker 2) using NIXL. The following steps break down the process:
+The left side of the figure in [KVBM Components](./kvbm_components) illustrates a bidirectional remote memory registration and layout synchronization protocol between workers (for example, Worker 1 and Worker 2) using NIXL. The following steps break down the process:
 
 1. *Agent Creation & Memory Registration:*
 
diff --git a/docs/kvbm/kvbm_integrations.md b/docs/docs/kvbm/kvbm_integrations.md
similarity index 84%
rename from docs/kvbm/kvbm_integrations.md
rename to docs/docs/kvbm/kvbm_integrations.md
index 1d4f34b877a..4b63af67d1b 100644
--- a/docs/kvbm/kvbm_integrations.md
+++ b/docs/docs/kvbm/kvbm_integrations.md
@@ -1,3 +1,7 @@
+---
+title: "KVBM Integrations"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
@@ -25,7 +29,7 @@ There are two components of the interface, Scheduler and Worker. Scheduler(leade
 
 The following figure shows the typical integration of KVBM with inference frameworks (vLLM used as an example)
 
-![vLLM KVBM Integration ](../images/kvbm-integrations.png)
+![vLLM KVBM Integration ](/img/kvbm-integrations.png)
 **vLLM KVBM Integration**
 
 
@@ -34,11 +38,11 @@ The following figure shows the typical integration of KVBM with inference framew
 * Instructions to [run KVBM with TRTLLM](trtllm-setup.md)
 
 ## Onboarding
-![Onboarding blocks from Host to Device](../images/kvbm-onboard-host2device.png)
+![Onboarding blocks from Host to Device](/img/kvbm-onboard-host2device.png)
 **Onboarding blocks from Host to Device**
-![Onboarding blocks from Disk to Device](../images/kvbm-onboard-disk2device.png)
+![Onboarding blocks from Disk to Device](/img/kvbm-onboard-disk2device.png)
 **Onboarding blocks from Disk to Device**
 
 ## Offloading
-![Offloading blocks from Device to Host&Disk](../images/kvbm-offload.png)
+![Offloading blocks from Device to Host&Disk](/img/kvbm-offload.png)
 **Offloading blocks from Device to Host&Disk**
diff --git a/docs/docs/kvbm/kvbm_intro.md b/docs/docs/kvbm/kvbm_intro.md
new file mode 100644
index 00000000000..34985d9c356
--- /dev/null
+++ b/docs/docs/kvbm/kvbm_intro.md
@@ -0,0 +1,49 @@
+---
+title: "KV Block Manager"
+---
+
+# KV Block Manager
+
+The Dynamo KV Block Manager (KVBM) is a scalable runtime component
+designed to handle memory allocation, management, and remote sharing of
+Key-Value (KV) blocks for inference tasks across heterogeneous and
+distributed environments. It acts as a unified memory layer for
+frameworks like vLLM, SGLang, and TRT-LLM.
+
+It offers:
+
+- A **unified memory API** that spans GPU memory(in future) , pinned
+  host memory, remote RDMA-accessible memory, local or distributed pool
+  of SSDs and remote file/object/cloud storage systems.
+- Support for evolving **block lifecycles** (allocate → register →
+  match) with event-based state transitions that storage can subscribe
+  to.
+- Integration with **NIXL**, a dynamic memory exchange layer used for
+  remote registration, sharing, and access of memory blocks over
+  RDMA/NVLink.
+
+The Dynamo KV Block Manager serves as a reference implementation that
+emphasizes modularity and extensibility. Its pluggable design enables
+developers to customize components and optimize for specific
+performance, memory, and deployment needs.
+
+||
+||
+||
+||
+||
+||
+||
+||
+||
+
+<div class="toctree" hidden="">
+
+Overview \<self\> Motivation \<kvbm_motivation.md\> Architecture
+\<kvbm_architecture.md\> Components \<kvbm_components.md\> Design Deep
+Dive \<kvbm_design_deepdive.md\> Integrations \<kvbm_integrations.md\>
+KVBM in vLLM \<vllm-setup.md\> KVBM in TRTLLM \<trtllm-setup.md\>
+LMCache Integration \<../backends/vllm/LMCache_Integration\> Further
+Reading \<kvbm_reading.md\>
+
+</div>
diff --git a/docs/kvbm/kvbm_metrics_grafana.png b/docs/docs/kvbm/kvbm_metrics_grafana.png
similarity index 100%
rename from docs/kvbm/kvbm_metrics_grafana.png
rename to docs/docs/kvbm/kvbm_metrics_grafana.png
diff --git a/docs/kvbm/kvbm_motivation.md b/docs/docs/kvbm/kvbm_motivation.md
similarity index 98%
rename from docs/kvbm/kvbm_motivation.md
rename to docs/docs/kvbm/kvbm_motivation.md
index 43880d28ab7..4d80c048f71 100644
--- a/docs/kvbm/kvbm_motivation.md
+++ b/docs/docs/kvbm/kvbm_motivation.md
@@ -1,3 +1,7 @@
+---
+title: "Motivation behind KVBM"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
diff --git a/docs/kvbm/kvbm_reading.md b/docs/docs/kvbm/kvbm_reading.md
similarity index 95%
rename from docs/kvbm/kvbm_reading.md
rename to docs/docs/kvbm/kvbm_reading.md
index f046841552c..7381352bab4 100644
--- a/docs/kvbm/kvbm_reading.md
+++ b/docs/docs/kvbm/kvbm_reading.md
@@ -1,3 +1,7 @@
+---
+title: "KVBM Further Reading"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
diff --git a/docs/kvbm/trtllm-setup.md b/docs/docs/kvbm/trtllm-setup.md
similarity index 98%
rename from docs/kvbm/trtllm-setup.md
rename to docs/docs/kvbm/trtllm-setup.md
index ffcab52ac12..7fd1ce41d55 100644
--- a/docs/kvbm/trtllm-setup.md
+++ b/docs/docs/kvbm/trtllm-setup.md
@@ -1,3 +1,7 @@
+---
+title: "Running KVBM in TensorRT-LLM"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -148,7 +152,7 @@ KVBM currently provides following types of metrics out of the box:
 1. If enabling KVBM does not show any TTFT perf gain or even perf degradation, one potential reason is not enough prefix cache hit on KVBM to reuse offloaded KV blocks.
 To confirm, please enable KVBM metrics as mentioned above and check the grafana dashboard `Onboard Blocks - Host to Device` and `Onboard Blocks - Disk to Device`.
 If observed large number of onboarded KV blocks as the example below, we can rule out this cause:
-![Grafana Example](kvbm_metrics_grafana.png)
+![Grafana Example](/img/kvbm_metrics_grafana.png)
 
 2. Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
 To avoid it, please set a longer timeout (default 1800 seconds) for leader–worker initialization.
diff --git a/docs/kvbm/vllm-setup.md b/docs/docs/kvbm/vllm-setup.md
similarity index 98%
rename from docs/kvbm/vllm-setup.md
rename to docs/docs/kvbm/vllm-setup.md
index 69f90986999..e3a910d8df3 100644
--- a/docs/kvbm/vllm-setup.md
+++ b/docs/docs/kvbm/vllm-setup.md
@@ -1,3 +1,7 @@
+---
+title: "Running KVBM in vLLM"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -142,7 +146,7 @@ KVBM currently provides following types of metrics out of the box:
 1. If enabling KVBM does not show any TTFT perf gain or even perf degradation, one potential reason is not enough prefix cache hit on KVBM to reuse offloaded KV blocks.
 To confirm, please enable KVBM metrics as mentioned above and check the grafana dashboard `Onboard Blocks - Host to Device` and `Onboard Blocks - Disk to Device`.
 If observed large number of onboarded KV blocks as the example below, we can rule out this cause:
-![Grafana Example](kvbm_metrics_grafana.png)
+![Grafana Example](/img/kvbm_metrics_grafana.png)
 
 2. Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
 To avoid it, please set a longer timeout (default 1800 seconds) for leader–worker initialization.
diff --git a/docs/multimodal/index.md b/docs/docs/multimodal/index.md
similarity index 98%
rename from docs/multimodal/index.md
rename to docs/docs/multimodal/index.md
index 57f0c4eb675..edeb6b27407 100644
--- a/docs/multimodal/index.md
+++ b/docs/docs/multimodal/index.md
@@ -1,3 +1,7 @@
+---
+title: "Multimodal Inference in Dynamo"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -27,13 +31,7 @@ Dynamo supports multimodal inference across multiple LLM backends, enabling mode
 
 ## Backend Documentation
 
-```{toctree}
-:maxdepth: 1
 
-vLLM Multimodal <vllm.md>
-TensorRT-LLM Multimodal <trtllm.md>
-SGLang Multimodal <sglang.md>
-```
 
 ## Support Matrix
 
diff --git a/docs/multimodal/sglang.md b/docs/docs/multimodal/sglang.md
similarity index 90%
rename from docs/multimodal/sglang.md
rename to docs/docs/multimodal/sglang.md
index 0f0f0f3fcef..8a0247d54e6 100644
--- a/docs/multimodal/sglang.md
+++ b/docs/docs/multimodal/sglang.md
@@ -1,3 +1,7 @@
+---
+title: "SGLang Multimodal"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -79,9 +83,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 ### Components
 
 - workers:
-  - [MultimodalEncodeWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding
-  - [MultimodalWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling and decoding.
-- processor: [MultimodalProcessorHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py)
+  - [MultimodalEncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding
+  - [MultimodalWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling and decoding.
+- processor: [MultimodalProcessorHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py)
   - tokenizes the prompt using the chat template
   - passes the text and image url to the MultimodalEncodeWorker.
 
@@ -142,10 +146,10 @@ curl http://localhost:8000/v1/chat/completions \
 ### Components
 
 - workers:
-  - [MultimodalEncodeWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding
-  - [MultimodalWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for decoding
-  - [MultimodalPrefillWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling
-- processor: [MultimodalProcessorHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py) tokenizes the prompt and passes it to the MultimodalEncodeWorker.
+  - [MultimodalEncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding
+  - [MultimodalWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for decoding
+  - [MultimodalPrefillWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling
+- processor: [MultimodalProcessorHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py) tokenizes the prompt and passes it to the MultimodalEncodeWorker.
 
 ### Workflow
 
diff --git a/docs/multimodal/trtllm.md b/docs/docs/multimodal/trtllm.md
similarity index 99%
rename from docs/multimodal/trtllm.md
rename to docs/docs/multimodal/trtllm.md
index f449891db8c..0a06f8aaf11 100644
--- a/docs/multimodal/trtllm.md
+++ b/docs/docs/multimodal/trtllm.md
@@ -1,3 +1,7 @@
+---
+title: "TensorRT-LLM Multimodal"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/multimodal/vllm.md b/docs/docs/multimodal/vllm.md
similarity index 91%
rename from docs/multimodal/vllm.md
rename to docs/docs/multimodal/vllm.md
index 76ac72614e4..76035ee1f72 100644
--- a/docs/multimodal/vllm.md
+++ b/docs/docs/multimodal/vllm.md
@@ -1,3 +1,7 @@
+---
+title: "vLLM Multimodal"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -81,7 +85,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 **Components:**
 
-- workers: [EncodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding and [MultimodalPDWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling and decoding.
+- workers: [EncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding and [MultimodalPDWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling and decoding.
 - processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
 - frontend: HTTP endpoint to handle incoming requests.
 
@@ -145,7 +149,7 @@ curl http://localhost:8000/v1/chat/completions \
 
 **Components:**
 
-- workers: [EncodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding, [MultimodalDecodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for decoding, and [MultimodalPDWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling.
+- workers: [EncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding, [MultimodalDecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for decoding, and [MultimodalPDWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling.
 - processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
 - frontend: HTTP endpoint to handle incoming requests.
 
@@ -288,7 +292,7 @@ bash launch/disagg_multimodal_llama.sh
 
 **Components:**
 
-- workers: [VideoEncodeWorker](../../examples/multimodal/components/video_encode_worker.py) for decoding video into frames, and [VllmPDWorker](../../examples/multimodal/components/worker.py) for prefilling and decoding.
+- workers: [VideoEncodeWorker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/video_encode_worker.py) for decoding video into frames, and [VllmPDWorker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/worker.py) for prefilling and decoding.
 - processor: Tokenizes the prompt and passes it to the VideoEncodeWorker.
 - frontend: HTTP endpoint to handle incoming requests.
 
@@ -373,7 +377,7 @@ bash launch/video_disagg.sh
 
 **Components:**
 
-- workers: [AudioEncodeWorker](../../examples/multimodal/components/audio_encode_worker.py) for decoding audio into embeddings, and [VllmPDWorker](../../examples/multimodal/components/worker.py) for prefilling and decoding.
+- workers: [AudioEncodeWorker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/audio_encode_worker.py) for decoding audio into embeddings, and [VllmPDWorker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/worker.py) for prefilling and decoding.
 - processor: Tokenizes the prompt and passes it to the AudioEncodeWorker.
 - frontend: HTTP endpoint to handle incoming requests.
 
diff --git a/docs/observability/README.md b/docs/docs/observability/README.md
similarity index 71%
rename from docs/observability/README.md
rename to docs/docs/observability/README.md
index 40f07db8b88..259e97dd550 100644
--- a/docs/observability/README.md
+++ b/docs/docs/observability/README.md
@@ -1,3 +1,7 @@
+---
+title: "Dynamo Observability"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -51,7 +55,7 @@ For detailed setup instructions and configuration, see [Prometheus + Grafana Set
 
 ## Kubernetes
 
-For Kubernetes-specific setup and configuration, see [docs/kubernetes/observability/](../kubernetes/observability/).
+For Kubernetes-specific setup and configuration, see [do../kubernetes/observability/metrics.md](../kubernetes/observability/metrics.md).
 
 ---
 
@@ -85,12 +89,12 @@ The dcgm-exporter service in the Docker Compose network is configured to use por
 ### Configuration Files
 
 The following configuration files are located in the `deploy/observability/` directory:
-- [docker-compose.yml](../../deploy/docker-compose.yml): Defines NATS and etcd services
-- [docker-observability.yml](../../deploy/docker-observability.yml): Defines Prometheus, Grafana, Tempo, and exporters
-- [prometheus.yml](../../deploy/observability/prometheus.yml): Contains Prometheus scraping configuration
-- [grafana-datasources.yml](../../deploy/observability/grafana-datasources.yml): Contains Grafana datasource configuration
-- [grafana_dashboards/dashboard-providers.yml](../../deploy/observability/grafana_dashboards/dashboard-providers.yml): Contains Grafana dashboard provider configuration
-- [grafana_dashboards/dynamo.json](../../deploy/observability/grafana_dashboards/dynamo.json): A general Dynamo Dashboard for both SW and HW metrics
-- [grafana_dashboards/dcgm-metrics.json](../../deploy/observability/grafana_dashboards/dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
-- [grafana_dashboards/kvbm.json](../../deploy/observability/grafana_dashboards/kvbm.json): Contains Grafana dashboard configuration for KVBM metrics
+- [docker-compose.yml](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml): Defines NATS and etcd services
+- [docker-observability.yml](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-observability.yml): Defines Prometheus, Grafana, Tempo, and exporters
+- [prometheus.yml](https://github.com/ai-dynamo/dynamo/tree/main/deploy/observability/prometheus.yml): Contains Prometheus scraping configuration
+- [grafana-datasources.yml](https://github.com/ai-dynamo/dynamo/tree/main/deploy/observability/grafana-datasources.yml): Contains Grafana datasource configuration
+- [grafana_dashboards/dashboard-providers.yml](https://github.com/ai-dynamo/dynamo/tree/main/deploy/observability/grafana_dashboards/dashboard-providers.yml): Contains Grafana dashboard provider configuration
+- [grafana_dashboards/dynamo.json](https://github.com/ai-dynamo/dynamo/tree/main/deploy/observability/grafana_dashboards/dynamo.json): A general Dynamo Dashboard for both SW and HW metrics
+- [grafana_dashboards/dcgm-metrics.json](https://github.com/ai-dynamo/dynamo/tree/main/deploy/observability/grafana_dashboards/dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
+- [grafana_dashboards/kvbm.json](https://github.com/ai-dynamo/dynamo/tree/main/deploy/observability/grafana_dashboards/kvbm.json): Contains Grafana dashboard configuration for KVBM metrics
 
diff --git a/docs/observability/grafana-disagg-trace.png b/docs/docs/observability/grafana-disagg-trace.png
similarity index 100%
rename from docs/observability/grafana-disagg-trace.png
rename to docs/docs/observability/grafana-disagg-trace.png
diff --git a/docs/observability/grafana-dynamo-composite.png b/docs/docs/observability/grafana-dynamo-composite.png
similarity index 100%
rename from docs/observability/grafana-dynamo-composite.png
rename to docs/docs/observability/grafana-dynamo-composite.png
diff --git a/docs/observability/grafana1.png b/docs/docs/observability/grafana1.png
similarity index 100%
rename from docs/observability/grafana1.png
rename to docs/docs/observability/grafana1.png
diff --git a/docs/observability/health-checks.md b/docs/docs/observability/health-checks.md
similarity index 99%
rename from docs/observability/health-checks.md
rename to docs/docs/observability/health-checks.md
index abaa4149c60..dd1c19f13c3 100644
--- a/docs/observability/health-checks.md
+++ b/docs/docs/observability/health-checks.md
@@ -1,3 +1,7 @@
+---
+title: "Dynamo Health Checks"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/observability/logging.md b/docs/docs/observability/logging.md
similarity index 99%
rename from docs/observability/logging.md
rename to docs/docs/observability/logging.md
index 650b7594ae2..edd13b1438c 100644
--- a/docs/observability/logging.md
+++ b/docs/docs/observability/logging.md
@@ -1,3 +1,7 @@
+---
+title: "Dynamo Logging"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -143,7 +147,7 @@ This section shows how trace and span information appears in JSONL logs. These l
 
 When viewing the corresponding trace in Grafana, you should be able to see something like the following:
 
-![Disaggregated Trace Example](grafana-disagg-trace.png)
+![Disaggregated Trace Example](/img/grafana-disagg-trace.png)
 
 ### Trace Overview
 
diff --git a/docs/observability/metrics-developer-guide.md b/docs/docs/observability/metrics-developer-guide.md
similarity index 92%
rename from docs/observability/metrics-developer-guide.md
rename to docs/docs/observability/metrics-developer-guide.md
index b3cdaf0716b..89b958cb793 100644
--- a/docs/observability/metrics-developer-guide.md
+++ b/docs/docs/observability/metrics-developer-guide.md
@@ -1,3 +1,7 @@
+---
+title: "Metrics Developer Guide"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -22,7 +26,7 @@ Prometheus Exposition Format text metrics will be available at: `http://localhos
 
 ## Metric Name Constants
 
-The [prometheus_names.rs](../../lib/runtime/src/metrics/prometheus_names.rs) module provides centralized metric name constants and sanitization functions to ensure consistency across all Dynamo components.
+The [prometheus_names.rs](https://github.com/ai-dynamo/dynamo/tree/main/lib/runtime/src/metrics/prometheus_names.rs) module provides centralized metric name constants and sanitization functions to ensure consistency across all Dynamo components.
 
 ---
 
@@ -250,7 +254,7 @@ endpoint.metrics.register_callback(update_metrics)
 
 ### Examples
 
-Example scripts: [lib/bindings/python/examples/metrics/](../../lib/bindings/python/examples/metrics/)
+Example scripts: [lib/bindings/python/examples/metrics/](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/examples/metrics/)
 
 ```bash
 cd ~/dynamo/lib/bindings/python/examples/metrics
@@ -265,5 +269,5 @@ DYN_SYSTEM_PORT=8081 ./server_with_callback.py
 - [Metrics Overview](metrics.md)
 - [Prometheus and Grafana Setup](prometheus-grafana.md)
 - [Distributed Runtime Architecture](../design_docs/distributed_runtime.md)
-- [Python Metrics Examples](../../lib/bindings/python/examples/metrics/)
+- [Python Metrics Examples](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/examples/metrics/)
 
diff --git a/docs/observability/metrics.md b/docs/docs/observability/metrics.md
similarity index 99%
rename from docs/observability/metrics.md
rename to docs/docs/observability/metrics.md
index 43e4e34266a..87e6ca31706 100644
--- a/docs/observability/metrics.md
+++ b/docs/docs/observability/metrics.md
@@ -1,3 +1,7 @@
+---
+title: "Dynamo Metrics"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/observability/prometheus-grafana.md b/docs/docs/observability/prometheus-grafana.md
similarity index 91%
rename from docs/observability/prometheus-grafana.md
rename to docs/docs/observability/prometheus-grafana.md
index a546db175c7..b9c6ab63511 100644
--- a/docs/observability/prometheus-grafana.md
+++ b/docs/docs/observability/prometheus-grafana.md
@@ -1,3 +1,7 @@
+---
+title: "Metrics Visualization with Prometheus and Grafana"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -9,7 +13,7 @@ SPDX-License-Identifier: Apache-2.0
 
 This guide shows how to set up Prometheus and Grafana for visualizing Dynamo metrics on a single machine for demo purposes.
 
-![Grafana Dynamo Dashboard](./grafana-dynamo-composite.png)
+![Grafana Dynamo Dashboard](/img/grafana-dynamo-composite.png)
 
 **Components:**
 - **Prometheus Server** - Collects and stores metrics from Dynamo services
@@ -79,7 +83,7 @@ Other interfaces:
 
 ### Prometheus
 
-The Prometheus configuration is specified in [prometheus.yml](../../deploy/observability/prometheus.yml). This file is set up to collect metrics from the metrics aggregation service endpoint.
+The Prometheus configuration is specified in [prometheus.yml](https://github.com/ai-dynamo/dynamo/tree/main/deploy/observability/prometheus.yml). This file is set up to collect metrics from the metrics aggregation service endpoint.
 
 Please be aware that you might need to modify the target settings to align with your specific host configuration and network environment.
 
diff --git a/docs/observability/trace.png b/docs/docs/observability/trace.png
similarity index 100%
rename from docs/observability/trace.png
rename to docs/docs/observability/trace.png
diff --git a/docs/observability/tracing.md b/docs/docs/observability/tracing.md
similarity index 98%
rename from docs/observability/tracing.md
rename to docs/docs/observability/tracing.md
index e9d64e00e74..49deb9a0401 100644
--- a/docs/observability/tracing.md
+++ b/docs/docs/observability/tracing.md
@@ -1,3 +1,7 @@
+---
+title: "Distributed Tracing with Tempo"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -144,7 +148,7 @@ http://localhost:8000/v1/chat/completions
 
 Below is an example of what a trace looks like in Grafana Tempo:
 
-![Trace Example](trace.png)
+![Trace Example](/img/trace.png)
 
 ### 6. Stop Services
 
diff --git a/docs/performance/aiconfigurator.md b/docs/docs/performance/aiconfigurator.md
similarity index 99%
rename from docs/performance/aiconfigurator.md
rename to docs/docs/performance/aiconfigurator.md
index 91528bf5e82..a7225054e21 100644
--- a/docs/performance/aiconfigurator.md
+++ b/docs/docs/performance/aiconfigurator.md
@@ -1,3 +1,7 @@
+---
+title: "Finding Best Initial Configs using AIConfigurator"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
diff --git a/docs/performance/tuning.md b/docs/docs/performance/tuning.md
similarity index 98%
rename from docs/performance/tuning.md
rename to docs/docs/performance/tuning.md
index f999762d262..8915dfba610 100644
--- a/docs/performance/tuning.md
+++ b/docs/docs/performance/tuning.md
@@ -1,3 +1,7 @@
+---
+title: "Disaggregation and Performance Tuning"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -89,7 +93,7 @@ For most frameworks, when chunked prefill is enabled and one forward iteration g
 In the prefill engine, the best strategy is to operate at the smallest batch size that saturates the GPUs so that the average time to first token (TTFT) is minimized.
 For example, for Llama3.3-70b NVFP4 quantization on B200 TP1 in vLLM, the below figure shows the prefill time with different isl (prefix caching is turned off):
 
-![Combined bar and line chart showing "Prefill Time". Bar chart represents TTFT (Time To First Token) in milliseconds against ISL (Input Sequence Length). The line chart shows TTFT/ISL (milliseconds per token) against ISL.](../images/prefill_time.png)
+![Combined bar and line chart showing "Prefill Time". Bar chart represents TTFT (Time To First Token) in milliseconds against ISL (Input Sequence Length). The line chart shows TTFT/ISL (milliseconds per token) against ISL.](/img/prefill_time.png)
 
 For isl less than 1000, the prefill efficiency is low because the GPU is not fully saturated.
 For isl larger than 4000, the prefill time per token increases because the attention takes longer to compute with a longer history.
diff --git a/docs/planner/load_planner.md b/docs/docs/planner/load_planner.md
similarity index 99%
rename from docs/planner/load_planner.md
rename to docs/docs/planner/load_planner.md
index 9ae1bbdc0aa..f606fe5f9e0 100644
--- a/docs/planner/load_planner.md
+++ b/docs/docs/planner/load_planner.md
@@ -1,3 +1,7 @@
+---
+title: "Load-based Planner"
+---
+
 # Load-based Planner
 
 This document covers load-based planner in `examples/llm/components/planner.py`.
diff --git a/docs/docs/planner/planner_intro.md b/docs/docs/planner/planner_intro.md
new file mode 100644
index 00000000000..149c37f5e3f
--- /dev/null
+++ b/docs/docs/planner/planner_intro.md
@@ -0,0 +1,53 @@
+---
+title: "Planner"
+---
+
+# Planner
+
+The planner monitors the state of the system and adjusts workers to
+ensure that the system runs efficiently.
+
+Currently, the planner can scale the number of vllm workers up and down
+based on the kv cache load and prefill queue size:
+
+Key features include:
+
+- **SLA-based scaling** that uses predictive modeling and performance
+  interpolation to proactively meet TTFT and ITL targets
+- **Graceful scaling** that ensures no requests are dropped during
+  scale-down operations
+
+<div class="admonition seealso">
+
+🚀 Quick Start
+
+**New to SLA Planner?** Start with the \[SLA Planner Quick Start
+Guide\](/docs/planner/sla_planner_quickstart.md) for a complete,
+step-by-step workflow.
+
+**Prerequisites**: SLA-based planner requires pre-deployment profiling
+(2-4 hours on real silicon or a few minutes using simulator) before
+deployment. The Quick Start guide includes everything you need.
+
+</div>
+
+||
+||
+||
+||
+||
+||
+||
+||
+||
+||
+||
+||
+
+<div class="toctree" hidden="">
+
+Overview \<self\> SLA Planner Quick Start \<sla_planner_quickstart\>
+SLA-Driven Profiling \<../benchmarks/sla_driven_profiling.md\> SLA-based
+Planner \<sla_planner.md\>
+
+</div>
diff --git a/docs/planner/sla_planner.md b/docs/docs/planner/sla_planner.md
similarity index 98%
rename from docs/planner/sla_planner.md
rename to docs/docs/planner/sla_planner.md
index 3bbb6088c98..895feffeea8 100644
--- a/docs/planner/sla_planner.md
+++ b/docs/docs/planner/sla_planner.md
@@ -1,3 +1,7 @@
+---
+title: "SLA-based Planner"
+---
+
 # SLA-based Planner
 
 > [!TIP]
@@ -21,7 +25,7 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy
 - **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval
 - **Workers**: prefill and backend workers handle inference
 
-The adjustment interval can be defined in the planner manifest as an argument. The default interval value can be found in this [file](/components/src/dynamo/planner/defaults.py).
+The adjustment interval can be defined in the planner manifest as an argument. The default interval value can be found in this [file](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/planner/defaults.py).
 
 ```mermaid
 flowchart LR
diff --git a/docs/planner/sla_planner_quickstart.md b/docs/docs/planner/sla_planner_quickstart.md
similarity index 90%
rename from docs/planner/sla_planner_quickstart.md
rename to docs/docs/planner/sla_planner_quickstart.md
index dd52ecd2505..8a7e7c50a27 100644
--- a/docs/planner/sla_planner_quickstart.md
+++ b/docs/docs/planner/sla_planner_quickstart.md
@@ -1,3 +1,7 @@
+---
+title: "SLA-Driven Profiling and Planner Deployment Quick Start Guide"
+---
+
 # SLA-Driven Profiling and Planner Deployment Quick Start Guide
 
 Complete workflow to deploy SLA-optimized Dynamo models using DynamoGraphDeploymentRequests (DGDR). This guide shows how to automatically profile models and deploy them with optimal configurations that meet your Service Level Agreements (SLAs).
@@ -92,10 +96,38 @@ Dynamo provides sample DGDR configurations in `benchmarks/profiler/deploy/`. You
 
 **Available Sample DGDRs:**
 - **`profile_sla_dgdr.yaml`**: Standard online profiling for dense models
-- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator
+- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator (TensorRT-LLM)
 - **`profile_sla_moe_dgdr.yaml`**: Online profiling for MoE models (SGLang)
 
-Or, you can create your own DGDR for your own needs.
+Or, you can create your own DGDR for your own needs:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: my-model-deployment  # Change the name
+  namespace: default         # Change the namespace
+spec:
+  model: "Qwen/Qwen3-0.6B"     # Update to your model
+  backend: vllm                # Backend: vllm, sglang, or trtllm
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"  # Required
+    config:
+      sla:
+        isl: 3000    # Adjust to your workload
+        osl: 150     # Adjust to your workload
+        ttft: 200    # Your target (ms)
+        itl: 20      # Your target (ms)
+
+      sweep:
+        use_ai_configurator: false  # Set to true for fast profiling (TensorRT-LLM only)
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"  # Optional
+
+  autoApply: true  # Auto-deploy after profiling
+```
 
 > [!TIP]
 > For detailed explanations of all configuration options (SLA, hardware, sweep, AIC, planner), see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference).
@@ -214,18 +246,18 @@ Choose between **online profiling** (real measurements, 2-4 hours) or **offline
 ```yaml
 # Online Profiling (Default)
 sweep:
-  useAiConfigurator: false
+  use_ai_configurator: false
 
-# Offline Profiling (AI Configurator)
+# Offline Profiling (AI Configurator - TensorRT-LLM only)
 sweep:
-  useAiConfigurator: true
-  aicSystem: h200_sxm
-  aicHfId: Qwen/Qwen3-32B
-  aicBackendVersion: "0.20.0"
+  use_ai_configurator: true
+  aic_system: h200_sxm
+  aic_hf_id: Qwen/Qwen3-32B
+  aic_backend_version: "0.20.0"
 ```
 
 > [!NOTE]
-> For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-methods).
+> For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](../benchmarks/sla_driven_profiling#profiling-method).
 
 ### Hardware Configuration
 
@@ -269,10 +301,11 @@ spec:
         ttft: 300
         itl: 10
       sweep:
-        useAiConfigurator: true
-        aicSystem: h200_sxm
-        aicHfId: deepseek-ai/DeepSeek-V3
-        aicBackendVersion: "0.20.0"
+        use_ai_configurator: true
+      aic:
+        system: h200_sxm
+        model_name: DEEPSEEK_V3
+        backend_version: "0.20.0"
 
   deploymentOverrides:
     workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
@@ -298,31 +331,31 @@ profilingConfig:
 
     # Hardware constraints (optional)
     hardware:
-      minNumGpusPerEngine: 2
-      maxNumGpusPerEngine: 8
-      gpuType: h200_sxm
+      min_num_gpus_per_engine: 2
+      max_num_gpus_per_engine: 8
+      gpu_type: h200_sxm
 
     # Profiling sweep settings (optional)
     sweep:
-      prefillInterpolationGranularity: 16  # Number of samples for prefill ISL sweep
-      decodeInterpolationGranularity: 6    # Number of samples for decode sweep
+      prefill_interpolation_granularity: 16  # Number of samples for prefill ISL sweep
+      decode_interpolation_granularity: 6    # Number of samples for decode sweep
 ```
 
 > **Note**: `engine.config` is a **file path** to a DGD YAML file, not inline configuration. Use ConfigMapRef (recommended) or leave it unset to auto-generate.
 
 #### Planner Configuration Passthrough
-Add planner-specific settings:
+Add planner-specific settings. Planner arguments use a `planner_` prefix:
 
 ```yaml
 profilingConfig:
   config:
     planner:
-      plannerMinEndpoint: 2
+      planner_min_endpoint: 2
 ```
 
 ## Understanding Profiling Results
 
-For details about the profiling process, performance plots, and interpolation data, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-process-details).
+For details about the profiling process, performance plots, and interpolation data, see [SLA-Driven Profiling Documentation](../benchmarks/sla_driven_profiling).
 
 ## Advanced Topics
 
@@ -349,10 +382,6 @@ spec:
 
 Profiling still runs against the real backend (via GPUs or AIC) to collect performance data. The mocker deployment then uses this data to simulate realistic timing behavior.
 
-### Using a Model Cache PVC
-
-For large models, you can use a pre-populated PVC containing model weights instead of downloading from HuggingFace. See [Model Cache PVC](/docs/benchmarks/sla_driven_profiling.md#model-cache-pvc-advanced) for configuration details.
-
 ### DGDR Immutability
 
 DGDRs are **immutable** - if you need to update SLAs or configuration:
diff --git a/docs/reference/cli.md b/docs/docs/reference/cli.md
similarity index 99%
rename from docs/reference/cli.md
rename to docs/docs/reference/cli.md
index 63b5795a8a2..0bc914452ac 100644
--- a/docs/reference/cli.md
+++ b/docs/docs/reference/cli.md
@@ -1,3 +1,7 @@
+---
+title: "Dynamo Run"
+---
+
 # Dynamo Run
 
 `dynamo-run` is a Rust binary that lets you easily run a model, explore the Dynamo components, and demonstrates the Rust API. It supports the `mistral.rs` engines, as well as testing engines `echo` and `mocker`.
diff --git a/docs/reference/glossary.md b/docs/docs/reference/glossary.md
similarity index 99%
rename from docs/reference/glossary.md
rename to docs/docs/reference/glossary.md
index 49cfa2375c5..48e73224545 100644
--- a/docs/reference/glossary.md
+++ b/docs/docs/reference/glossary.md
@@ -1,3 +1,7 @@
+---
+title: "NVIDIA Dynamo Glossary"
+---
+
 # NVIDIA Dynamo Glossary
 
 ## B
diff --git a/docs/reference/support-matrix.md b/docs/docs/reference/support-matrix.md
similarity index 81%
rename from docs/reference/support-matrix.md
rename to docs/docs/reference/support-matrix.md
index 37d963ca1f0..f7695dcab12 100644
--- a/docs/reference/support-matrix.md
+++ b/docs/docs/reference/support-matrix.md
@@ -1,3 +1,7 @@
+---
+title: "Dynamo Support Matrix"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
 All rights reserved.
@@ -8,8 +12,6 @@ SPDX-License-Identifier: Apache-2.0
 
 This document provides the support matrix for Dynamo, including hardware, software and build instructions.
 
-> **See also:** [Feature Compatibility Matrix](../../feature-matrix.md) for backend-specific feature support (vLLM, TensorRT-LLM, SGLang).
-
 ## Hardware Compatibility
 
 | **CPU Architecture** | **Status**   |
@@ -62,27 +64,24 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
 
 The following table shows the dependency versions included with each Dynamo release:
 
-| **Dependency** | **main (ToT)** | **v0.8.0** | **v0.7.1** | **v0.7.0.post1** | **v0.7.0** |
-| :------------- | :------------- | :--------- | :--------- | :--------------- | :--------- |
-| SGLang         | 0.5.7          | 0.5.6.post2 | 0.5.3.post4| 0.5.3.post4      | 0.5.3.post4|
-| TensorRT-LLM   | 1.2.0rc6.post1 | 1.2.0rc6.post1 | 1.2.0rc3   | 1.2.0rc3         | 1.2.0rc2   |
-| vLLM           | 0.13.0         | 0.12.0     | 0.11.0     | 0.11.0           | 0.11.0     |
-| NIXL           | 0.8.0          | 0.8.0      | 0.8.0      | 0.8.0            | 0.8.0      |
+| **Dependency** | **main (ToT)** | **v0.8.0 (unreleased)** | **v0.7.1** | **v0.7.0.post1** | **v0.7.0** |
+| :------------- | :------------- | :---------------------- | :--------- | :--------------- | :--------- |
+| SGLang         | 0.5.7          | 0.5.7                   | 0.5.3.post4| 0.5.3.post4      | 0.5.3.post4|
+| TensorRT-LLM   | 1.2.0rc6       | 1.2.0rc6                | 1.2.0rc3   | 1.2.0rc3         | 1.2.0rc2   |
+| vLLM           | 0.13.0         | 0.12.0                  | 0.11.0     | 0.11.0           | 0.11.0     |
+| NIXL           | 0.8.0          | 0.8.0                   | 0.8.0      | 0.8.0            | 0.8.0      |
 
 > [!Note]
-> **main (ToT)** reflects the current development branch.
+> **main (ToT)** reflects the current development branch. **v0.8.0** is the upcoming release (planned for January 14, 2025) and not yet available.
 
 
 > [!Important]
 > Specific versions of TensorRT-LLM supported by Dynamo are subject to change. Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] will fail.
 
 ### CUDA Support by Framework
-| **Dynamo Version**   | **SGLang**                        | **TensorRT-LLM**        | **vLLM**                          |
-| :------------------- | :-------------------------------- | :-----------------------| :-------------------------------- |
-| **Dynamo 0.8.0**     | CUDA 12.9, CUDA 13.0 (🧪)         | CUDA 13.0               | CUDA 12.9, CUDA 13.0 (🧪)         |
-| **Dynamo 0.7.1**     | CUDA 12.8                         | CUDA 13.0               | CUDA 12.9                         |
-
-> 🧪 = Experimental
+| **Dynamo Version**   | **SGLang**              | **TensorRT-LLM**        | **vLLM**                |
+| :------------------- | :-----------------------| :-----------------------| :-----------------------|
+| **Dynamo 0.7.1**     | CUDA 12.8               | CUDA 13.0               | CUDA 12.9               |
 
 ## Cloud Service Provider Compatibility
 
diff --git a/docs/router/README.md b/docs/docs/router/README.md
similarity index 94%
rename from docs/router/README.md
rename to docs/docs/router/README.md
index 733c2246f60..118eba73591 100644
--- a/docs/router/README.md
+++ b/docs/docs/router/README.md
@@ -1,3 +1,7 @@
+---
+title: "KV Router"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -60,10 +64,10 @@ spec:
 - No worker-side configuration changes needed
 
 **Complete K8s Examples:**
-- [TRT-LLM aggregated router example](../../examples/backends/trtllm/deploy/agg_router.yaml)
-- [vLLM aggregated router example](../../examples/backends/vllm/deploy/agg_router.yaml)
-- [SGLang aggregated router example](../../examples/backends/sglang/deploy/agg_router.yaml)
-- [Distributed inference tutorial](../../examples/basics/kubernetes/Distributed_Inference/agg_router.yaml)
+- [TRT-LLM aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/agg_router.yaml)
+- [vLLM aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg_router.yaml)
+- [SGLang aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/agg_router.yaml)
+- [Distributed inference tutorial](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/kubernetes/Distributed_Inference/agg_router.yaml)
 
 **For A/B Testing and Advanced K8s Setup:**
 See the comprehensive [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
@@ -213,7 +217,6 @@ The router uses KV events from workers by default to maintain an accurate global
   - Router predicts cache state based on routing decisions with TTL-based expiration and pruning
   - Tracks blocks from recent requests with configurable time-to-live
   - Reduces overhead at the cost of routing accuracy
-  - **NATS is not needed** - suitable for simpler deployments without NATS infrastructure
   - Suitable for testing or when event processing becomes a bottleneck
 
 ## Tuning Guidelines
diff --git a/docs/router/kv_cache_routing.md b/docs/docs/router/kv_cache_routing.md
similarity index 95%
rename from docs/router/kv_cache_routing.md
rename to docs/docs/router/kv_cache_routing.md
index 54f2d9c0161..886d84fb626 100644
--- a/docs/router/kv_cache_routing.md
+++ b/docs/docs/router/kv_cache_routing.md
@@ -1,3 +1,7 @@
+---
+title: "KV Cache Routing"
+---
+
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
@@ -31,8 +35,6 @@ The main KV-aware routing arguments:
 
 - `--no-track-active-blocks`: Disables tracking of active blocks (blocks being used for ongoing generation/decode phases). By default, the router tracks active blocks for load balancing. Disable this when routing to workers that only perform prefill (no decode phase), as tracking decode load is not relevant. This reduces router overhead and simplifies state management.
 
-- `--no-assume-kv-reuse`: When tracking active blocks, disables the assumption of KV cache reuse. By default (`router_assume_kv_reuse=true`), the router computes actual block hashes for sequence tracking to deduplicate blocks and optimize load balancing. When disabled via this flag, the router generates random hashes for sequence blocks, treating each request's blocks as unique. This is useful in disaggregated setups where prefill transfers blocks to decode workers that may already have those blocks cached, but the engine cannot coordinate transfers to avoid duplication. Without this flag, the router's load balancing heuristics would undercount decode blocks when duplicates exist.
-
 - `--active-decode-blocks-threshold`: Initial threshold (0.0-1.0) for determining when a worker is considered busy based on KV cache block utilization. When a worker's KV cache active blocks exceed this percentage of total blocks, it will be marked as busy and excluded from routing. If not set, blocks-based busy detection is disabled. This feature works with all routing modes (`--router-mode kv|round-robin|random`) as long as backend engines emit `ForwardPassMetrics`. The threshold can be dynamically updated at runtime via the `/busy_threshold` HTTP endpoint (see [Dynamic Threshold Configuration](#dynamic-threshold-configuration)).
 
 - `--active-prefill-tokens-threshold`: Literal token count threshold for determining when a worker is considered busy based on prefill token utilization. When active prefill tokens exceed this threshold, the worker is marked as busy. If not set, tokens-based busy detection is disabled.
@@ -51,22 +53,17 @@ The main KV-aware routing arguments:
 >
 > **Request plane is independent of KV event transport.**
 > `DYN_REQUEST_PLANE` controls how **requests** are sent (TCP/HTTP/NATS), but KV-aware routing still uses **NATS** for KV events in both JetStream and NATS Core + Local Indexer modes.
-> When KV events are enabled (default), NATS is automatically initialized. You can optionally set `NATS_SERVER=nats://...` to specify a custom NATS server; otherwise, it defaults to `localhost:4222`.
-> Use `--no-kv-events` to disable KV events and remove the NATS requirement entirely (with request plane being `tcp` or `http`).
->
-> When `--kv-overlap-score-weight` is set to 0, no KvIndexer is created and prefix matching is disabled (pure load balancing). When `--no-kv-events` is set, a KvIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning.
+> If you run with `DYN_REQUEST_PLANE=tcp` (or `http`) and KV events enabled (default), you must also configure NATS, e.g. `NATS_SERVER=nats://...`.
+> Only `--no-kv-events` removes the NATS requirement.
 >
-> **Backend Configuration:** When using `--no-kv-events`, configure your backend workers to disable KV event publishing:
-> - **vLLM**: Use `--kv-events-config '{"enable_kv_cache_events": false}'`
-> - **SGLang**: Do not use `--kv-events-config`
-> - **TRT-LLM**: Do not use `--publish-events-and-metrics`
+> When `--kv-overlap-score-weight` is set to 0, no KvIndexer is created and prefix matching is disabled (pure load balancing). When `--no-kv-events` is set, a KvIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning. In both cases, it's recommended to disable your backend workers from publishing events through `KvEventPublisher` to avoid event accumulation in JetStream. WIP to enable disabling publishing of KV events completely in these cases.
 >
 > The cli args `--router-ttl`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When KV events are enabled (default), the router relies on worker-side eviction events and these parameters are ignored.
 
 ## Prerequisites and Limitations
 
 >[!Note]
-> **KV Router Requirements**: The KV router currently works only with **dynamic endpoints** that are registered via [`register_llm()`](../development/backend-guide.md#writing-python-workers-in-dynamo) with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text.
+> **KV Router Requirements**: The KV router currently works only with **dynamic endpoints** that are registered via [`register_llm()`](../development/backend-guide) with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text.
 
 **Current Limitations (WIP):**
 - **Static endpoints**: Not yet supported. The KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states.
@@ -81,7 +78,7 @@ For basic model registration without KV routing, you can use `--router-mode roun
 
 ## Disaggregated Serving (Prefill and Decode)
 
-Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../development/backend-guide.md)), the frontend automatically detects them and activates an internal prefill router.
+Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../development/backend-guide)), the frontend automatically detects them and activates an internal prefill router.
 
 ### Automatic Prefill Router Activation
 
@@ -127,7 +124,7 @@ await prefill_endpoint.serve_endpoint(prefill_handler.generate)
 ```
 
 > [!Note]
-> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh).
+> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch/disagg_router.sh).
 
 ### Request Flow
 
diff --git a/docs/docusaurus.config.ts b/docs/docusaurus.config.ts
new file mode 100644
index 00000000000..8a23697ef3a
--- /dev/null
+++ b/docs/docusaurus.config.ts
@@ -0,0 +1,147 @@
+import {themes as prismThemes} from 'prism-react-renderer';
+import type {Config} from '@docusaurus/types';
+import type * as Preset from '@docusaurus/preset-classic';
+
+const config: Config = {
+  title: 'NVIDIA Dynamo',
+  tagline: 'High-performance, low-latency inference framework',
+  favicon: 'img/favicon.ico',
+
+  // Future flags
+  future: {
+    v4: true,
+  },
+
+  // For local testing
+  url: 'http://localhost:3000',
+  baseUrl: '/',
+
+  organizationName: 'ai-dynamo',
+  projectName: 'dynamo',
+
+  onBrokenLinks: 'warn',
+
+  i18n: {
+    defaultLocale: 'en',
+    locales: ['en'],
+  },
+
+  // Enable Mermaid diagrams and configure markdown hooks
+  markdown: {
+    mermaid: true,
+    hooks: {
+      onBrokenMarkdownLinks: 'warn',
+      onBrokenMarkdownImages: 'warn',
+    },
+  },
+
+  themes: ['@docusaurus/theme-mermaid'],
+
+  presets: [
+    [
+      'classic',
+      {
+        docs: {
+          routeBasePath: '/', // Docs at root URL
+          sidebarPath: './sidebars.ts',
+          editUrl: 'https://github.com/ai-dynamo/dynamo/tree/main/docs/',
+          showLastUpdateTime: true,
+          // Versioning configuration - start fresh after restructure
+          // Run `npm run docusaurus docs:version X.Y.Z` to create versions
+        },
+        blog: false, // Disable blog
+        theme: {
+          customCss: './src/css/custom.css',
+        },
+      } satisfies Preset.Options,
+    ],
+  ],
+
+  plugins: [
+    [
+      '@docusaurus/plugin-client-redirects',
+      {
+        redirects: [
+          // Preserve existing redirects from Sphinx structure
+          {from: '/guides/tool-calling', to: '/agents/tool-calling'},
+          {from: '/architecture/architecture', to: '/design_docs/architecture'},
+          {from: '/architecture/disagg_serving', to: '/design_docs/disagg_serving'},
+          {from: '/architecture/distributed_runtime', to: '/design_docs/distributed_runtime'},
+          {from: '/architecture/dynamo_flow', to: '/design_docs/dynamo_flow'},
+        ],
+      },
+    ],
+    // Local search for offline/self-hosted use
+    require.resolve('@easyops-cn/docusaurus-search-local'),
+  ],
+
+  themeConfig: {
+    image: 'img/nvidia-social-card.png',
+    colorMode: {
+      defaultMode: 'light',
+      respectPrefersColorScheme: true,
+    },
+    navbar: {
+      title: 'NVIDIA Dynamo',
+      logo: {
+        alt: 'NVIDIA Logo',
+        src: 'img/nvidia-logo.svg',
+      },
+      items: [
+        {
+          type: 'docSidebar',
+          sidebarId: 'docs',
+          position: 'left',
+          label: 'Documentation',
+        },
+        {
+          type: 'docsVersionDropdown',
+          position: 'right',
+          dropdownActiveClassDisabled: true,
+        },
+        {
+          href: 'https://github.com/ai-dynamo/dynamo',
+          label: 'GitHub',
+          position: 'right',
+        },
+      ],
+    },
+    footer: {
+      style: 'dark',
+      links: [
+        {
+          title: 'Documentation',
+          items: [
+            {label: 'Getting Started', to: '/'},
+            {label: 'Backends', to: '/backends/vllm/'},
+            {label: 'Kubernetes', to: '/kubernetes/installation_guide'},
+          ],
+        },
+        {
+          title: 'Community',
+          items: [
+            {label: 'GitHub', href: 'https://github.com/ai-dynamo/dynamo'},
+            {label: 'Issues', href: 'https://github.com/ai-dynamo/dynamo/issues'},
+            {label: 'Discussions', href: 'https://github.com/ai-dynamo/dynamo/discussions'},
+          ],
+        },
+        {
+          title: 'NVIDIA',
+          items: [
+            {label: 'NVIDIA.com', href: 'https://www.nvidia.com'},
+            {label: 'Privacy Policy', href: 'https://www.nvidia.com/en-us/about-nvidia/privacy-policy/'},
+            {label: 'Terms of Service', href: 'https://www.nvidia.com/en-us/about-nvidia/terms-of-service/'},
+          ],
+        },
+      ],
+      copyright: `Copyright © ${new Date().getFullYear()} NVIDIA Corporation & Affiliates. All rights reserved.`,
+    },
+    prism: {
+      theme: prismThemes.github,
+      darkTheme: prismThemes.dracula,
+      additionalLanguages: ['bash', 'python', 'yaml', 'rust', 'toml', 'json', 'docker'],
+    },
+  } satisfies Preset.ThemeConfig,
+};
+
+export default config;
diff --git a/docs/examples/README.md b/docs/examples/README.md
deleted file mode 120000
index 6fa53604d90..00000000000
--- a/docs/examples/README.md
+++ /dev/null
@@ -1 +0,0 @@
-../../examples/README.md
\ No newline at end of file
diff --git a/docs/examples/runtime/hello_world/README.md b/docs/examples/runtime/hello_world/README.md
deleted file mode 120000
index 86ff08cd56e..00000000000
--- a/docs/examples/runtime/hello_world/README.md
+++ /dev/null
@@ -1 +0,0 @@
-../../../../examples/custom_backend/hello_world/README.md
\ No newline at end of file
diff --git a/docs/exclusions.txt b/docs/exclusions.txt
deleted file mode 100644
index 15c2889a736..00000000000
--- a/docs/exclusions.txt
+++ /dev/null
@@ -1,16 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2023-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License
-
-README.md
\ No newline at end of file
diff --git a/docs/generate_docs.py b/docs/generate_docs.py
deleted file mode 100755
index 9b8d0946154..00000000000
--- a/docs/generate_docs.py
+++ /dev/null
@@ -1,308 +0,0 @@
-#!/usr/bin/env python3
-# type: ignore  # Ignore all mypy errors in this file
-
-# SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import json
-import logging
-import os
-import re
-import subprocess
-from contextlib import contextmanager
-from functools import partial
-
-# Get the directory of the current file
-dynamo_docs_abspath = os.path.dirname(os.path.abspath(__file__))
-dynamo_abspath = os.path.dirname(dynamo_docs_abspath)
-repo_url = "https://github.com/ai-dynamo/dynamo/blob/main/"
-
-# Regex patterns
-http_patn = r"^https?://"
-http_reg = re.compile(http_patn)
-tag_patn = "/(?:blob|tree)/main"
-dynamo_repo_patn = rf"{http_patn}github.com/ai-dynamo/dynamo"
-dynamo_github_url_reg = re.compile(
-    rf"{dynamo_repo_patn}/([^/#]+)(?:{tag_patn})?/*([^#]*)\s*(?=#|$)"
-)
-# relpath_patn = r"]\s*\(\s*([^)]+)\)"
-# Hyperlink in a .md file, excluding embedded images.
-hyperlink_reg = re.compile(r"((?<!\!)\[[^\]]+\]\s*\(\s*)([^)]+?)(\s*\))")
-
-exclusions = None
-with open(f"{dynamo_docs_abspath}/exclusions.txt", "r") as f:
-    exclusions = f.read()
-    f.close()
-exclude_patterns = exclusions.strip().split("\n")
-
-
-def setup_logger():
-    """
-    This function is to setup logging
-    """
-    # Create a custom logger
-    logger = logging.getLogger(__name__)
-    # Set the log level
-    logger.setLevel(logging.INFO)
-    # Create handlers
-    stream_handler = logging.StreamHandler()
-    # Create formatters and add it to the handlers
-    formatter = logging.Formatter(
-        "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-    )
-    stream_handler.setFormatter(formatter)
-    logger.addHandler(stream_handler)
-    return logger
-
-
-def log_message(message):
-    """
-    This function is for logging to /tmp
-    - message: Message to log
-    """
-    # Setup the logger
-    logger = setup_logger()
-    # Log the message
-    logger.info(message)
-
-
-def run_command(command):
-    """
-    This function runs any command using subprocess and logs failures
-    - command: Command to execute
-    """
-    log_message(f"Running command: {command}")
-    try:
-        subprocess.run(
-            command,
-            shell=True,
-            check=True,
-            text=True,
-            capture_output=False,
-        )
-    except subprocess.CalledProcessError as e:
-        raise (e)
-
-
-def is_excluded(file_path):
-    for exclude_pattern in exclude_patterns:
-        file_abspath = os.path.abspath(file_path)
-        exclude_pattern = os.path.abspath(exclude_pattern)
-        if os.path.commonpath([file_abspath, exclude_pattern]) == exclude_pattern:
-            return True
-    return False
-
-
-def replace_url_with_relpath(url, src_doc_path):
-    """
-    This function replaces Triton Inference Server GitHub URLs with relative paths in following cases.
-    1. URL is a doc file, e.g. ".md" file.
-    2. URL is a directory which contains README.md and URL ends with "#<section>".
-
-    Examples:
-        https://github.com/triton-inference-server/server/blob/main/docs/protocol#restricted-protocols
-        https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_shared_memory.md
-        https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher
-
-    Keep URL in the following cases:
-        https://github.com/triton-inference-server/server/tree/r24.02
-        https://github.com/triton-inference-server/server/blob/main/build.py
-        https://github.com/triton-inference-server/server/blob/main/qa
-        https://github.com/triton-inference-server/server/blob/main/CONTRIBUTING.md
-    """
-    m = dynamo_github_url_reg.match(url)
-    # Do not replace URL if it is not a Triton GitHub file.
-    if not m:
-        return url
-
-    target_repo_name = m.group(1)
-    target_relpath_from_target_repo = os.path.normpath(m.groups("")[1])
-    section = url[len(m.group(0)) :]
-    valid_hashtag = section not in ["", "#"] and section.startswith("#")
-
-    if target_repo_name == "dynamo":
-        target_path = os.path.join(dynamo_abspath, target_relpath_from_target_repo)
-    else:
-        target_path = os.path.join(
-            dynamo_docs_abspath, target_repo_name, target_relpath_from_target_repo
-        )
-
-    # Return URL if it points to a path outside server/docs.
-    if os.path.commonpath([dynamo_docs_abspath, target_path]) != dynamo_docs_abspath:
-        return url
-
-    if (
-        os.path.isfile(target_path)
-        and os.path.splitext(target_path)[1] == ".md"
-        and not is_excluded(target_path)
-    ):
-        pass
-    elif (
-        os.path.isdir(target_path)
-        and os.path.isfile(os.path.join(target_path, "README.md"))
-        and valid_hashtag
-        and not is_excluded(os.path.join(target_path, "README.md"))
-    ):
-        target_path = os.path.join(target_path, "README.md")
-    else:
-        return url
-
-    # The "target_path" must be a file at this line.
-    relpath = os.path.relpath(target_path, start=os.path.dirname(src_doc_path))
-    return re.sub(dynamo_github_url_reg, relpath, url, count=1)
-
-
-def replace_relpath_with_url(relpath, src_doc_path):
-    """
-    This function replaces relative paths with Triton Inference Server GitHub URLs in following cases.
-    1. Relative path is a file that is not ".md" type inside the current repo.
-    2. Relative path is a directory but not (has "README.md" and ends with "#<section>").
-    3. Relative path does not exist (shows 404 page).
-
-    Examples:
-        ../examples/model_repository
-        ../examples/model_repository/inception_graphdef/config.pbtxt
-
-    Keep relpath in the following cases:
-        build.md
-        build.md#building-with-docker
-        #building-with-docker
-        ../getting_started/quickstart.md
-        ../protocol#restricted-protocols
-    """
-    target_path = relpath.rsplit("#")[0]
-    section = relpath[len(target_path) :]
-    valid_hashtag = section not in ["", "#"]
-    if relpath.startswith("#"):
-        target_path = os.path.basename(src_doc_path)
-    target_path = os.path.join(os.path.dirname(src_doc_path), target_path)
-    target_path = os.path.normpath(target_path)
-
-    # Assert target path is under the current repo directory.
-    assert os.path.commonpath([dynamo_abspath, target_path]) == dynamo_abspath
-
-    target_path_from_src_repo = os.path.relpath(target_path, start=dynamo_abspath)
-
-    # For example, target_path of "../protocol#restricted-protocols" should be "<path-to-server>/server/docs/protocol/README.md"
-    if (
-        os.path.isdir(target_path)
-        and valid_hashtag
-        and os.path.isfile(os.path.join(target_path, "README.md"))
-    ):
-        relpath = os.path.join(relpath.rsplit("#")[0], "README.md") + section
-        target_path = os.path.join(target_path, "README.md")
-
-    if (
-        os.path.isfile(target_path)
-        and os.path.splitext(target_path)[1] == ".md"
-        and os.path.commonpath([dynamo_docs_abspath, target_path])
-        == dynamo_docs_abspath
-        and not is_excluded(target_path)
-    ):
-        return relpath
-    else:
-        return repo_url + target_path_from_src_repo + section
-
-
-def replace_hyperlink(m, src_doc_path):
-    """
-    TODO: Support of HTML tags for future docs.
-    Markdown allows <link>, e.g. <a href=[^>]+>. Whether we want to
-    find and replace the link depends on if they link to internal .md files
-    or allows relative paths. I haven't seen one such case in our doc so
-    should be safe for now.
-    """
-
-    hyperlink_str = m.group(2)
-    match = http_reg.match(hyperlink_str)
-
-    if match:
-        # Hyperlink is a URL.
-        res = replace_url_with_relpath(hyperlink_str, src_doc_path)
-    else:
-        # Hyperlink is a relative path.
-        res = replace_relpath_with_url(hyperlink_str, src_doc_path)
-
-    return m.group(1) + res + m.group(3)
-
-
-def preprocess_docs(exclude_paths=[]):
-    # Find all ".md" files inside the current repo.
-    if exclude_paths:
-        cmd = (
-            ["find", dynamo_docs_abspath, "-type", "d", "\\("]
-            + " -o ".join([f"-path './{dir}'" for dir in exclude_paths]).split(" ")
-            + ["\\)", "-prune", "-o", "-type", "f", "-name", "'*.md'", "-print"]
-        )
-    else:
-        cmd = ["find", dynamo_docs_abspath, "-name", "'*.md'"]
-    cmd = " ".join(cmd)
-    result = subprocess.run(cmd, check=True, capture_output=True, text=True, shell=True)
-    docs_list = list(filter(None, result.stdout.split("\n")))
-
-    # Read, preprocess and write back to each document file.
-    for doc_abspath in docs_list:
-        if is_excluded(doc_abspath):
-            continue
-
-        content = None
-        with open(doc_abspath, "r") as f:
-            content = f.read()
-
-        content = hyperlink_reg.sub(
-            partial(replace_hyperlink, src_doc_path=doc_abspath),
-            content,
-        )
-
-        with open(doc_abspath, "w") as f:
-            f.write(content)
-
-
-@contextmanager
-def change_directory(path):
-    """
-    Context manager for changing the current working directory
-    """
-    original_directory = os.getcwd()
-    try:
-        os.chdir(path)
-        yield
-    finally:
-        os.chdir(original_directory)
-
-
-def update_project_json():
-    """Update project.json with the current version from DYNAMO_DOCS_VERSION env var."""
-    version = os.environ.get("DYNAMO_DOCS_VERSION", "dev")
-    project_json_path = os.path.join(dynamo_docs_abspath, "project.json")
-
-    project_data = {"name": "NVIDIA Dynamo", "version": version}
-
-    with open(project_json_path, "w") as f:
-        json.dump(project_data, f)
-
-    log_message(f"Updated project.json with version: {version}")
-
-
-def main():
-    with change_directory(dynamo_docs_abspath):
-        run_command("make clean")
-        update_project_json()
-        preprocess_docs()
-        run_command("make html")
-
-
-if __name__ == "__main__":
-    main()
diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst
deleted file mode 100644
index c27b4e9fdbe..00000000000
--- a/docs/hidden_toctree.rst
+++ /dev/null
@@ -1,85 +0,0 @@
-:orphan:
-
-..
-    SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-    SPDX-License-Identifier: Apache-2.0
-
-.. This hidden toctree includes readmes etc that aren't meant to be in the main table of contents but should be accounted for in the sphinx project structure
-
-
-.. toctree::
-   :maxdepth: 2
-   :hidden:
-
-   development/runtime-guide.md
-   api/nixl_connect/connector.md
-   api/nixl_connect/descriptor.md
-   api/nixl_connect/device.md
-   api/nixl_connect/device_kind.md
-   api/nixl_connect/operation_status.md
-   api/nixl_connect/rdma_metadata.md
-   api/nixl_connect/readable_operation.md
-   api/nixl_connect/writable_operation.md
-   api/nixl_connect/read_operation.md
-   api/nixl_connect/write_operation.md
-   api/nixl_connect/README.md
-
-   kubernetes/api_reference.md
-   kubernetes/deployment/create_deployment.md
-   kubernetes/deployment/dynamomodel-guide.md
-
-   kubernetes/fluxcd.md
-   kubernetes/grove.md
-   kubernetes/model_caching_with_fluid.md
-   kubernetes/README.md
-   reference/cli.md
-   observability/metrics.md
-   kvbm/vllm-setup.md
-   kvbm/trtllm-setup.md
-   agents/tool-calling.md
-   guides/jail_stream_readme.md
-   guides/request_plane.md
-
-   router/kv_cache_routing.md
-   router/kv_events.md
-   planner/load_planner.md
-   fault_tolerance/request_migration.md
-   fault_tolerance/request_cancellation.md
-
-   backends/trtllm/multinode/multinode-examples.md
-   backends/trtllm/llama4_plus_eagle.md
-   backends/trtllm/kv-cache-transfer.md
-   backends/trtllm/gemma3_sliding_window_attention.md
-   backends/trtllm/gpt-oss.md
-   backends/trtllm/prometheus.md
-
-   backends/sglang/expert-distribution-eplb.md
-   backends/sglang/gpt-oss.md
-   backends/sglang/profiling.md
-   backends/sglang/sgl-hicache-example.md
-   backends/sglang/sglang-disaggregation.md
-   backends/sglang/prometheus.md
-
-   examples/README.md
-   examples/runtime/hello_world/README.md
-
-   design_docs/distributed_runtime.md
-   design_docs/dynamo_flow.md
-
-   backends/vllm/deepseek-r1.md
-   backends/vllm/gpt-oss.md
-   backends/vllm/LMCache_Integration.md
-   backends/vllm/multi-node.md
-   backends/vllm/prometheus.md
-   backends/vllm/prompt-embeddings.md
-   backends/vllm/speculative_decoding.md
-
-   benchmarks/kv-router-ab-testing.md
-
-   frontends/kserve.md
-   _sections/frontends.rst
-
-..   TODO: architecture/distributed_runtime.md and architecture/dynamo_flow.md
-     have some outdated names/references and need a refresh.
-..   TODO: Add an OpenAI frontend doc and then add top-level Frontends section
-     to index.rst pointing to both OpenAI HTTP and KServe GRPC docs.
diff --git a/docs/index.rst b/docs/index.rst
deleted file mode 100644
index 5b860d26bb9..00000000000
--- a/docs/index.rst
+++ /dev/null
@@ -1,85 +0,0 @@
-..
-    SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-    SPDX-License-Identifier: Apache-2.0
-
-    Licensed under the Apache License, Version 2.0 (the "License");
-    you may not use this file except in compliance with the License.
-    You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software
-    distributed under the License is distributed on an "AS IS" BASIS,
-    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-    See the License for the specific language governing permissions and
-    limitations under the License.
-
-..
-   Main Page
-..
-
-Welcome to NVIDIA Dynamo
-========================
-
-The NVIDIA Dynamo Platform is a high-performance, low-latency inference framework designed to serve all AI models—across any framework, architecture, or deployment scale.
-
-.. admonition:: 💎 Discover the latest developments!
-   :class: seealso
-
-   This guide is a snapshot of a specific point in time. For the latest information, examples, and Release Assets, see the `Dynamo GitHub repository <https://github.com/ai-dynamo/dynamo/releases/latest>`_.
-
-Quickstart
-==========
-.. include:: _includes/quick_start_local.rst
-
-..
-   Sidebar
-..
-
-.. toctree::
-   :hidden:
-   :caption: Getting Started
-
-   Quickstart <self>
-   Installation <_sections/installation>
-   Support Matrix <reference/support-matrix.md>
-   Examples <_sections/examples>
-
-.. toctree::
-   :hidden:
-   :caption: Kubernetes Deployment
-
-   Deployment Guide <_sections/k8s_deployment>
-   Observability (K8s) <_sections/k8s_observability>
-   Multinode <_sections/k8s_multinode>
-
-.. toctree::
-   :hidden:
-   :caption: User Guides
-
-   Tool Calling <agents/tool-calling.md>
-   Multimodality Support <multimodal/index.md>
-   Finding Best Initial Configs <performance/aiconfigurator.md>
-   Benchmarking <benchmarks/benchmarking.md>
-   Tuning Disaggregated Performance <performance/tuning.md>
-   Writing Python Workers in Dynamo <development/backend-guide.md>
-   Observability (Local) <_sections/observability>
-   Glossary <reference/glossary.md>
-
-.. toctree::
-   :hidden:
-   :caption: Components
-
-   Backends <_sections/backends>
-   Router <router/README>
-   Planner <planner/planner_intro>
-   KVBM <kvbm/kvbm_intro>
-
-.. toctree::
-   :hidden:
-   :caption: Design Docs
-
-   Overall Architecture <design_docs/architecture.md>
-   Architecture Flow <design_docs/dynamo_flow.md>
-   Disaggregated Serving <design_docs/disagg_serving.md>
-   Distributed Runtime <design_docs/distributed_runtime.md>
diff --git a/docs/kvbm/kvbm_intro.rst b/docs/kvbm/kvbm_intro.rst
deleted file mode 100644
index b9d74d7f30d..00000000000
--- a/docs/kvbm/kvbm_intro.rst
+++ /dev/null
@@ -1,70 +0,0 @@
-..
-    SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-    SPDX-License-Identifier: Apache-2.0
-
-    Licensed under the Apache License, Version 2.0 (the "License");
-    you may not use this file except in compliance with the License.
-    You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software
-    distributed under the License is distributed on an "AS IS" BASIS,
-    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-    See the License for the specific language governing permissions and
-    limitations under the License.
-
-KV Block Manager
-================
-The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM, SGLang, and TRT-LLM.
-
-It offers:
-
-* A **unified memory API** that spans GPU memory(in future) , pinned host memory, remote RDMA-accessible memory, local or distributed pool of SSDs and remote file/object/cloud storage systems.
-* Support for evolving **block lifecycles** (allocate → register → match) with event-based state transitions that storage can subscribe to.
-* Integration with **NIXL**, a dynamic memory exchange layer used for remote registration, sharing, and access of memory blocks over RDMA/NVLink.
-
-The Dynamo KV Block Manager serves as a reference implementation that emphasizes modularity and extensibility. Its pluggable design enables developers to customize components and optimize for specific performance, memory, and deployment needs.
-
-.. list-table::
-   :widths: 20 5 75
-   :header-rows: 1
-
-   * -
-     -
-     - Feature
-   * - **Backend**
-     - ✅
-     - Local
-   * -
-     - ✅
-     - Kubernetes
-   * - **LLM Framework**
-     - ✅
-     - vLLM
-   * -
-     - ✅
-     - TensorRT-LLM
-   * -
-     - ❌
-     - SGLang
-   * - **Serving Type**
-     - ✅
-     - Aggregated
-   * -
-     - ✅
-     - Disaggregated
-
-.. toctree::
-   :hidden:
-
-   Overview <self>
-   Motivation <kvbm_motivation.md>
-   Architecture <kvbm_architecture.md>
-   Components <kvbm_components.md>
-   Design Deep Dive <kvbm_design_deepdive.md>
-   Integrations <kvbm_integrations.md>
-   KVBM in vLLM <vllm-setup.md>
-   KVBM in TRTLLM <trtllm-setup.md>
-   LMCache Integration <../backends/vllm/LMCache_Integration>
-   Further Reading <kvbm_reading.md>
\ No newline at end of file
diff --git a/docs/package.json b/docs/package.json
new file mode 100644
index 00000000000..9b84e93fa32
--- /dev/null
+++ b/docs/package.json
@@ -0,0 +1,50 @@
+{
+  "name": "docusaurus",
+  "version": "0.0.0",
+  "private": true,
+  "scripts": {
+    "docusaurus": "docusaurus",
+    "start": "docusaurus start",
+    "build": "docusaurus build",
+    "swizzle": "docusaurus swizzle",
+    "deploy": "docusaurus deploy",
+    "clear": "docusaurus clear",
+    "serve": "docusaurus serve",
+    "write-translations": "docusaurus write-translations",
+    "write-heading-ids": "docusaurus write-heading-ids",
+    "typecheck": "tsc"
+  },
+  "dependencies": {
+    "@docusaurus/core": "3.9.2",
+    "@docusaurus/plugin-client-redirects": "^3.9.2",
+    "@docusaurus/preset-classic": "3.9.2",
+    "@docusaurus/theme-mermaid": "^3.9.2",
+    "@easyops-cn/docusaurus-search-local": "^0.52.2",
+    "@mdx-js/react": "^3.0.0",
+    "clsx": "^2.0.0",
+    "prism-react-renderer": "^2.3.0",
+    "react": "^19.0.0",
+    "react-dom": "^19.0.0"
+  },
+  "devDependencies": {
+    "@docusaurus/module-type-aliases": "3.9.2",
+    "@docusaurus/tsconfig": "3.9.2",
+    "@docusaurus/types": "3.9.2",
+    "typescript": "~5.6.2"
+  },
+  "browserslist": {
+    "production": [
+      ">0.5%",
+      "not dead",
+      "not op_mini all"
+    ],
+    "development": [
+      "last 3 chrome version",
+      "last 3 firefox version",
+      "last 5 safari version"
+    ]
+  },
+  "engines": {
+    "node": ">=20.0"
+  }
+}
diff --git a/docs/planner/planner_intro.rst b/docs/planner/planner_intro.rst
deleted file mode 100644
index fccf4c87b80..00000000000
--- a/docs/planner/planner_intro.rst
+++ /dev/null
@@ -1,82 +0,0 @@
-..
-    SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-    SPDX-License-Identifier: Apache-2.0
-
-    Licensed under the Apache License, Version 2.0 (the "License");
-    you may not use this file except in compliance with the License.
-    You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software
-    distributed under the License is distributed on an "AS IS" BASIS,
-    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-    See the License for the specific language governing permissions and
-    limitations under the License.
-
-Planner
-=======
-
-The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently.
-
-Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:
-
-Key features include:
-
-* **SLA-based scaling** that uses predictive modeling and performance interpolation to proactively meet TTFT and ITL targets
-* **Graceful scaling** that ensures no requests are dropped during scale-down operations
-
-.. admonition:: 🚀 Quick Start
-   :class: seealso
-
-   **New to SLA Planner?** Start with the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md) for a complete, step-by-step workflow.
-
-   **Prerequisites**: SLA-based planner requires pre-deployment profiling (2-4 hours on real silicon or a few minutes using simulator) before deployment. The Quick Start guide includes everything you need.
-
-.. list-table::
-   :widths: 20 5 75
-   :header-rows: 1
-
-   * -
-     -
-     - Feature
-   * - **Backend**
-     - ❌
-     - Local
-   * -
-     - ✅
-     - Kubernetes
-   * - **LLM Framework**
-     - ✅
-     - vLLM
-   * -
-     - ✅
-     - TensorRT-LLM
-   * -
-     - ✅
-     - SGLang
-   * - **Serving Type**
-     - ✅
-     - Aggregated
-   * -
-     - ✅
-     - Disaggregated
-   * - **Planner Actions**
-     - ❌
-     - Load-based scaling up/down prefill/decode workers
-   * -
-     - ✅
-     - SLA-based scaling up/down prefill/decode workers [1]_
-   * -
-     - ❌
-     - Adjusting engine knobs
-
-.. [1] Supported with some limitations.
-
-.. toctree::
-   :hidden:
-
-   Overview <self>
-   SLA Planner Quick Start <sla_planner_quickstart>
-   SLA-Driven Profiling <../benchmarks/sla_driven_profiling.md>
-   SLA-based Planner <sla_planner.md>
diff --git a/docs/project.json b/docs/project.json
deleted file mode 100644
index c62aff5acf7..00000000000
--- a/docs/project.json
+++ /dev/null
@@ -1 +0,0 @@
-{"name": "NVIDIA Dynamo", "version": "dev"}
\ No newline at end of file
diff --git a/docs/sidebars.ts b/docs/sidebars.ts
new file mode 100644
index 00000000000..968d2d413e8
--- /dev/null
+++ b/docs/sidebars.ts
@@ -0,0 +1,285 @@
+import type {SidebarsConfig} from '@docusaurus/plugin-content-docs';
+
+/**
+ * NVIDIA Dynamo Documentation Sidebars
+ * 
+ * Structure matches https://docs.nvidia.com/dynamo/latest/
+ */
+const sidebars: SidebarsConfig = {
+  docs: [
+    // ==================== Getting Started ====================
+    {
+      type: 'category',
+      label: 'Getting Started',
+      collapsed: false,
+      items: [
+        { type: 'doc', id: 'intro', label: 'Quickstart' },
+        { type: 'doc', id: 'installation', label: 'Installation' },
+        { type: 'doc', id: 'reference/support-matrix', label: 'Support Matrix' },
+        { type: 'doc', id: 'examples', label: 'Examples' },
+      ],
+    },
+
+    // ==================== Kubernetes Deployment ====================
+    {
+      type: 'category',
+      label: 'Kubernetes Deployment',
+      collapsed: false,
+      items: [
+        {
+          type: 'category',
+          label: 'Deployment Guide',
+          items: [
+            { type: 'doc', id: 'kubernetes/README', label: 'Kubernetes Quickstart' },
+            { type: 'doc', id: 'kubernetes/installation_guide', label: 'Detailed Installation Guide' },
+            { type: 'doc', id: 'kubernetes/dynamo_operator', label: 'Dynamo Operator' },
+            { type: 'doc', id: 'kubernetes/deployment/minikube', label: 'Minikube Setup' },
+            { type: 'doc', id: 'kubernetes/deployment/dynamomodel-guide', label: 'Managing Models with DynamoModel' },
+          ],
+        },
+        {
+          type: 'category',
+          label: 'Observability (K8s)',
+          items: [
+            { type: 'doc', id: 'kubernetes/observability/metrics', label: 'Metrics' },
+            { type: 'doc', id: 'kubernetes/observability/logging', label: 'Logging' },
+          ],
+        },
+        {
+          type: 'category',
+          label: 'Multinode',
+          items: [
+            { type: 'doc', id: 'kubernetes/deployment/multinode-deployment', label: 'Multinode Deployments' },
+            { type: 'doc', id: 'kubernetes/grove', label: 'Grove' },
+          ],
+        },
+      ],
+    },
+
+    // ==================== User Guides ====================
+    {
+      type: 'category',
+      label: 'User Guides',
+      collapsed: false,
+      items: [
+        { type: 'doc', id: 'agents/tool-calling', label: 'Tool Calling' },
+        { type: 'doc', id: 'multimodal/index', label: 'Multimodality Support' },
+        { type: 'doc', id: 'performance/aiconfigurator', label: 'Finding Best Initial Configs' },
+        { type: 'doc', id: 'benchmarks/benchmarking', label: 'Dynamo Benchmarking Guide' },
+        { type: 'doc', id: 'performance/tuning', label: 'Tuning Disaggregated Performance' },
+        { type: 'doc', id: 'development/runtime-guide', label: 'Writing Python Workers in Dynamo' },
+        {
+          type: 'category',
+          label: 'Observability (Local)',
+          items: [
+            { type: 'doc', id: 'observability/README', label: 'Overview' },
+            { type: 'doc', id: 'observability/prometheus-grafana', label: 'Prometheus + Grafana Setup' },
+            { type: 'doc', id: 'observability/metrics', label: 'Metrics' },
+            { type: 'doc', id: 'observability/metrics-developer-guide', label: 'Metrics Developer Guide' },
+            { type: 'doc', id: 'observability/health-checks', label: 'Health Checks' },
+            { type: 'doc', id: 'observability/tracing', label: 'Tracing' },
+            { type: 'doc', id: 'observability/logging', label: 'Logging' },
+          ],
+        },
+        { type: 'doc', id: 'reference/glossary', label: 'Glossary' },
+      ],
+    },
+
+    // ==================== Components ====================
+    {
+      type: 'category',
+      label: 'Components',
+      collapsed: false,
+      items: [
+        {
+          type: 'category',
+          label: 'Backends',
+          items: [
+            {
+              type: 'category',
+              label: 'vLLM',
+              items: [
+                { type: 'doc', id: 'backends/vllm/README', label: 'Overview' },
+                { type: 'doc', id: 'backends/vllm/deepseek-r1', label: 'DeepSeek-R1' },
+                { type: 'doc', id: 'backends/vllm/gpt-oss', label: 'GPT-OSS' },
+                { type: 'doc', id: 'backends/vllm/multi-node', label: 'Multi-Node' },
+                { type: 'doc', id: 'backends/vllm/speculative_decoding', label: 'Speculative Decoding' },
+                { type: 'doc', id: 'backends/vllm/prompt-embeddings', label: 'Prompt Embeddings' },
+                { type: 'doc', id: 'backends/vllm/LMCache_Integration', label: 'LMCache Integration' },
+                { type: 'doc', id: 'backends/vllm/prometheus', label: 'Prometheus' },
+              ],
+            },
+            {
+              type: 'category',
+              label: 'SGLang',
+              items: [
+                { type: 'doc', id: 'backends/sglang/README', label: 'Overview' },
+                { type: 'doc', id: 'backends/sglang/gpt-oss', label: 'GPT-OSS' },
+                { type: 'doc', id: 'backends/sglang/sglang-disaggregation', label: 'Disaggregation' },
+                { type: 'doc', id: 'backends/sglang/expert-distribution-eplb', label: 'Expert Distribution (EPLB)' },
+                { type: 'doc', id: 'backends/sglang/sgl-hicache-example', label: 'HiCache Example' },
+                { type: 'doc', id: 'backends/sglang/profiling', label: 'Profiling' },
+                { type: 'doc', id: 'backends/sglang/prometheus', label: 'Prometheus' },
+              ],
+            },
+            {
+              type: 'category',
+              label: 'TensorRT-LLM',
+              items: [
+                { type: 'doc', id: 'backends/trtllm/README', label: 'Overview' },
+                { type: 'doc', id: 'backends/trtllm/gpt-oss', label: 'GPT-OSS' },
+                { type: 'doc', id: 'backends/trtllm/kv-cache-transfer', label: 'KV Cache Transfer' },
+                { type: 'doc', id: 'backends/trtllm/gemma3_sliding_window_attention', label: 'Gemma3 Sliding Window' },
+                { type: 'doc', id: 'backends/trtllm/llama4_plus_eagle', label: 'Llama4 + Eagle' },
+                { type: 'doc', id: 'backends/trtllm/multinode/multinode-examples', label: 'Multinode Examples' },
+                { type: 'doc', id: 'backends/trtllm/prometheus', label: 'Prometheus' },
+              ],
+            },
+          ],
+        },
+        { type: 'doc', id: 'router/README', label: 'Router' },
+        {
+          type: 'category',
+          label: 'Planner',
+          items: [
+            { type: 'doc', id: 'planner/planner_intro', label: 'Overview' },
+            { type: 'doc', id: 'planner/sla_planner_quickstart', label: 'SLA Planner Quick Start' },
+            { type: 'doc', id: 'benchmarks/sla_driven_profiling', label: 'SLA-Driven Profiling' },
+            { type: 'doc', id: 'planner/sla_planner', label: 'SLA-based Planner' },
+          ],
+        },
+        {
+          type: 'category',
+          label: 'KVBM',
+          items: [
+            { type: 'doc', id: 'kvbm/kvbm_intro', label: 'Overview' },
+            { type: 'doc', id: 'kvbm/kvbm_motivation', label: 'Motivation' },
+            { type: 'doc', id: 'kvbm/kvbm_architecture', label: 'Architecture' },
+            { type: 'doc', id: 'kvbm/kvbm_components', label: 'Components' },
+            { type: 'doc', id: 'kvbm/kvbm_design_deepdive', label: 'Design Deep Dive' },
+            { type: 'doc', id: 'kvbm/kvbm_integrations', label: 'Integrations' },
+            { type: 'doc', id: 'kvbm/vllm-setup', label: 'KVBM in vLLM' },
+            { type: 'doc', id: 'kvbm/trtllm-setup', label: 'KVBM in TRTLLM' },
+            { type: 'doc', id: 'backends/vllm/LMCache_Integration', label: 'LMCache Integration' },
+            { type: 'doc', id: 'kvbm/kvbm_reading', label: 'Further Reading' },
+          ],
+        },
+      ],
+    },
+
+    // ==================== Design Docs ====================
+    {
+      type: 'category',
+      label: 'Design Docs',
+      collapsed: false,
+      items: [
+        { type: 'doc', id: 'design_docs/architecture', label: 'Overall Architecture' },
+        { type: 'doc', id: 'design_docs/dynamo_flow', label: 'Architecture Flow' },
+        { type: 'doc', id: 'design_docs/disagg_serving', label: 'Disaggregated Serving' },
+        { type: 'doc', id: 'design_docs/distributed_runtime', label: 'Distributed Runtime' },
+      ],
+    },
+
+    // ==================== Additional Resources ====================
+    {
+      type: 'category',
+      label: 'Additional Resources',
+      collapsed: true,
+      items: [
+        {
+          type: 'category',
+          label: 'Advanced Kubernetes',
+          items: [
+            { type: 'doc', id: 'kubernetes/deployment/create_deployment', label: 'Create Deployment' },
+            { type: 'doc', id: 'kubernetes/autoscaling', label: 'Autoscaling' },
+            { type: 'doc', id: 'kubernetes/service_discovery', label: 'Service Discovery' },
+            { type: 'doc', id: 'kubernetes/model_caching_with_fluid', label: 'Model Caching with Fluid' },
+            { type: 'doc', id: 'kubernetes/fluxcd', label: 'FluxCD' },
+            { type: 'doc', id: 'kubernetes/webhooks', label: 'Webhooks' },
+            { type: 'doc', id: 'kubernetes/api_reference', label: 'API Reference' },
+          ],
+        },
+        {
+          type: 'category',
+          label: 'Multimodal Details',
+          items: [
+            { type: 'doc', id: 'multimodal/vllm', label: 'vLLM' },
+            { type: 'doc', id: 'multimodal/sglang', label: 'SGLang' },
+            { type: 'doc', id: 'multimodal/trtllm', label: 'TensorRT-LLM' },
+          ],
+        },
+        {
+          type: 'category',
+          label: 'Router Details',
+          items: [
+            { type: 'doc', id: 'router/kv_cache_routing', label: 'KV Cache Routing' },
+          ],
+        },
+        {
+          type: 'category',
+          label: 'Fault Tolerance',
+          items: [
+            { type: 'doc', id: 'fault_tolerance/request_cancellation', label: 'Request Cancellation' },
+            { type: 'doc', id: 'fault_tolerance/request_migration', label: 'Request Migration' },
+          ],
+        },
+        {
+          type: 'category',
+          label: 'Benchmarks',
+          items: [
+            { type: 'doc', id: 'benchmarks/kv-router-ab-testing', label: 'KV Router A/B Testing' },
+          ],
+        },
+        {
+          type: 'category',
+          label: 'Frontends',
+          items: [
+            { type: 'doc', id: 'frontends/kserve', label: 'KServe' },
+          ],
+        },
+        {
+          type: 'category',
+          label: 'Development',
+          items: [
+            { type: 'doc', id: 'development/backend-guide', label: 'Backend Guide' },
+          ],
+        },
+        {
+          type: 'category',
+          label: 'Guides',
+          items: [
+            { type: 'doc', id: 'guides/request_plane', label: 'Request Plane' },
+            { type: 'doc', id: 'guides/jail_stream_readme', label: 'Jail Stream' },
+          ],
+        },
+        { type: 'doc', id: 'planner/load_planner', label: 'Load Planner' },
+        { type: 'doc', id: 'reference/cli', label: 'CLI Reference' },
+        {
+          type: 'category',
+          label: 'API Reference',
+          items: [
+            {
+              type: 'category',
+              label: 'NIXL Connect',
+              items: [
+                { type: 'doc', id: 'api/nixl_connect/README', label: 'Overview' },
+                { type: 'doc', id: 'api/nixl_connect/connector', label: 'Connector' },
+                { type: 'doc', id: 'api/nixl_connect/device', label: 'Device' },
+                { type: 'doc', id: 'api/nixl_connect/device_kind', label: 'Device Kind' },
+                { type: 'doc', id: 'api/nixl_connect/descriptor', label: 'Descriptor' },
+                { type: 'doc', id: 'api/nixl_connect/read_operation', label: 'Read Operation' },
+                { type: 'doc', id: 'api/nixl_connect/write_operation', label: 'Write Operation' },
+                { type: 'doc', id: 'api/nixl_connect/readable_operation', label: 'Readable Operation' },
+                { type: 'doc', id: 'api/nixl_connect/writable_operation', label: 'Writable Operation' },
+                { type: 'doc', id: 'api/nixl_connect/operation_status', label: 'Operation Status' },
+                { type: 'doc', id: 'api/nixl_connect/rdma_metadata', label: 'RDMA Metadata' },
+              ],
+            },
+          ],
+        },
+      ],
+    },
+  ],
+};
+
+export default sidebars;
diff --git a/docs/src/components/HomepageFeatures/index.tsx b/docs/src/components/HomepageFeatures/index.tsx
new file mode 100644
index 00000000000..c2551fb9b80
--- /dev/null
+++ b/docs/src/components/HomepageFeatures/index.tsx
@@ -0,0 +1,71 @@
+import type {ReactNode} from 'react';
+import clsx from 'clsx';
+import Heading from '@theme/Heading';
+import styles from './styles.module.css';
+
+type FeatureItem = {
+  title: string;
+  Svg: React.ComponentType<React.ComponentProps<'svg'>>;
+  description: ReactNode;
+};
+
+const FeatureList: FeatureItem[] = [
+  {
+    title: 'Easy to Use',
+    Svg: require('@site/static/img/undraw_docusaurus_mountain.svg').default,
+    description: (
+      <>
+        Docusaurus was designed from the ground up to be easily installed and
+        used to get your website up and running quickly.
+      </>
+    ),
+  },
+  {
+    title: 'Focus on What Matters',
+    Svg: require('@site/static/img/undraw_docusaurus_tree.svg').default,
+    description: (
+      <>
+        Docusaurus lets you focus on your docs, and we&apos;ll do the chores. Go
+        ahead and move your docs into the <code>docs</code> directory.
+      </>
+    ),
+  },
+  {
+    title: 'Powered by React',
+    Svg: require('@site/static/img/undraw_docusaurus_react.svg').default,
+    description: (
+      <>
+        Extend or customize your website layout by reusing React. Docusaurus can
+        be extended while reusing the same header and footer.
+      </>
+    ),
+  },
+];
+
+function Feature({title, Svg, description}: FeatureItem) {
+  return (
+    <div className={clsx('col col--4')}>
+      <div className="text--center">
+        <Svg className={styles.featureSvg} role="img" />
+      </div>
+      <div className="text--center padding-horiz--md">
+        <Heading as="h3">{title}</Heading>
+        <p>{description}</p>
+      </div>
+    </div>
+  );
+}
+
+export default function HomepageFeatures(): ReactNode {
+  return (
+    <section className={styles.features}>
+      <div className="container">
+        <div className="row">
+          {FeatureList.map((props, idx) => (
+            <Feature key={idx} {...props} />
+          ))}
+        </div>
+      </div>
+    </section>
+  );
+}
diff --git a/docs/src/components/HomepageFeatures/styles.module.css b/docs/src/components/HomepageFeatures/styles.module.css
new file mode 100644
index 00000000000..b248eb2e5de
--- /dev/null
+++ b/docs/src/components/HomepageFeatures/styles.module.css
@@ -0,0 +1,11 @@
+.features {
+  display: flex;
+  align-items: center;
+  padding: 2rem 0;
+  width: 100%;
+}
+
+.featureSvg {
+  height: 200px;
+  width: 200px;
+}
diff --git a/docs/src/css/custom.css b/docs/src/css/custom.css
new file mode 100644
index 00000000000..ae4df9c97cd
--- /dev/null
+++ b/docs/src/css/custom.css
@@ -0,0 +1,176 @@
+/**
+ * NVIDIA Dynamo Documentation Theme
+ * Custom styling to match NVIDIA branding
+ * 
+ * Primary brand color: #76b900 (NVIDIA Green)
+ */
+
+:root {
+  /* NVIDIA Brand Colors */
+  --ifm-color-primary: #76b900;
+  --ifm-color-primary-dark: #6aa600;
+  --ifm-color-primary-darker: #5f9400;
+  --ifm-color-primary-darkest: #4d7a00;
+  --ifm-color-primary-light: #84c219;
+  --ifm-color-primary-lighter: #93cb33;
+  --ifm-color-primary-lightest: #a8d64d;
+
+  /* Navigation */
+  --ifm-navbar-background-color: #1a1a1a;
+  --ifm-navbar-link-color: #ffffff;
+  --ifm-navbar-link-hover-color: #76b900;
+
+  /* Code blocks */
+  --ifm-code-font-size: 95%;
+  --docusaurus-highlighted-code-line-bg: rgba(118, 185, 0, 0.1);
+
+  /* Sidebar */
+  --ifm-menu-color-active: #76b900;
+
+  /* Links */
+  --ifm-link-color: #76b900;
+  --ifm-link-hover-color: #5f9400;
+
+  /* Font */
+  --ifm-font-family-base: system-ui, -apple-system, 'Segoe UI', Roboto, Ubuntu, Cantarell, 'Noto Sans', sans-serif;
+}
+
+/* Dark mode */
+[data-theme='dark'] {
+  --ifm-background-color: #1a1a1a;
+  --ifm-background-surface-color: #242424;
+  --ifm-color-primary: #76b900;
+  --ifm-color-primary-dark: #6aa600;
+  --ifm-color-primary-darker: #5f9400;
+  --ifm-color-primary-darkest: #4d7a00;
+  --ifm-color-primary-light: #84c219;
+  --ifm-color-primary-lighter: #93cb33;
+  --ifm-color-primary-lightest: #a8d64d;
+  --docusaurus-highlighted-code-line-bg: rgba(118, 185, 0, 0.15);
+}
+
+/* Navbar styling */
+.navbar {
+  box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.1);
+}
+
+.navbar__title {
+  font-weight: 700;
+  color: #ffffff;
+}
+
+.navbar__link {
+  color: #ffffff;
+}
+
+.navbar__link:hover {
+  color: #76b900;
+}
+
+.navbar__link--active {
+  color: #76b900;
+}
+
+/* Navbar logo */
+.navbar__logo {
+  height: 2rem;
+}
+
+/* Footer styling */
+.footer {
+  background-color: #1a1a1a;
+}
+
+.footer__title {
+  color: #ffffff;
+}
+
+.footer__link-item {
+  color: #b0b0b0;
+}
+
+.footer__link-item:hover {
+  color: #76b900;
+  text-decoration: none;
+}
+
+.footer__copyright {
+  color: #888888;
+}
+
+/* Sidebar styling */
+.menu__link--active:not(.menu__link--sublist) {
+  background-color: rgba(118, 185, 0, 0.1);
+}
+
+/* Admonitions / Callouts */
+.alert--note {
+  --ifm-alert-background-color: rgba(118, 185, 0, 0.1);
+  --ifm-alert-border-color: #76b900;
+}
+
+.alert--tip {
+  --ifm-alert-background-color: rgba(118, 185, 0, 0.1);
+  --ifm-alert-border-color: #76b900;
+}
+
+/* Version badge */
+.badge--secondary {
+  background-color: #76b900;
+  border-color: #76b900;
+}
+
+/* Table of contents */
+.table-of-contents__link--active {
+  color: #76b900;
+}
+
+/* Code blocks */
+.prism-code {
+  border-radius: 8px;
+}
+
+/* Pagination */
+.pagination-nav__link:hover {
+  border-color: #76b900;
+}
+
+/* Search */
+.DocSearch-Button {
+  border-radius: 8px;
+}
+
+/* Hero section (if used on landing page) */
+.hero--primary {
+  background-color: #1a1a1a;
+}
+
+.hero__title {
+  color: #ffffff;
+}
+
+.hero__subtitle {
+  color: #b0b0b0;
+}
+
+/* Buttons */
+.button--primary {
+  background-color: #76b900;
+  border-color: #76b900;
+}
+
+.button--primary:hover {
+  background-color: #6aa600;
+  border-color: #6aa600;
+}
+
+/* Tabs */
+.tabs__item--active {
+  border-bottom-color: #76b900;
+  color: #76b900;
+}
+
+/* Mermaid diagrams */
+.mermaid {
+  background-color: transparent;
+}
diff --git a/docs/repositories.txt b/docs/static/.nojekyll
similarity index 100%
rename from docs/repositories.txt
rename to docs/static/.nojekyll
diff --git a/docs/images/architecture.png b/docs/static/img/architecture.png
similarity index 100%
rename from docs/images/architecture.png
rename to docs/static/img/architecture.png
diff --git a/docs/images/disagg_perf_benefit.png b/docs/static/img/disagg_perf_benefit.png
similarity index 100%
rename from docs/images/disagg_perf_benefit.png
rename to docs/static/img/disagg_perf_benefit.png
diff --git a/docs/static/img/docusaurus-social-card.jpg b/docs/static/img/docusaurus-social-card.jpg
new file mode 100644
index 00000000000..ffcb448210e
Binary files /dev/null and b/docs/static/img/docusaurus-social-card.jpg differ
diff --git a/docs/static/img/docusaurus.png b/docs/static/img/docusaurus.png
new file mode 100644
index 00000000000..f458149e3c8
Binary files /dev/null and b/docs/static/img/docusaurus.png differ
diff --git a/docs/images/dynamo-deploy.png b/docs/static/img/dynamo-deploy.png
similarity index 100%
rename from docs/images/dynamo-deploy.png
rename to docs/static/img/dynamo-deploy.png
diff --git a/docs/images/dynamo_flow.png b/docs/static/img/dynamo_flow.png
similarity index 100%
rename from docs/images/dynamo_flow.png
rename to docs/static/img/dynamo_flow.png
diff --git a/docs/static/img/favicon.ico b/docs/static/img/favicon.ico
new file mode 100644
index 00000000000..c01d54bcd39
Binary files /dev/null and b/docs/static/img/favicon.ico differ
diff --git a/docs/images/frontpage-architecture.png b/docs/static/img/frontpage-architecture.png
similarity index 100%
rename from docs/images/frontpage-architecture.png
rename to docs/static/img/frontpage-architecture.png
diff --git a/docs/images/frontpage-banner.png b/docs/static/img/frontpage-banner.png
similarity index 100%
rename from docs/images/frontpage-banner.png
rename to docs/static/img/frontpage-banner.png
diff --git a/docs/images/frontpage-gpu-evolution.png b/docs/static/img/frontpage-gpu-evolution.png
similarity index 100%
rename from docs/images/frontpage-gpu-evolution.png
rename to docs/static/img/frontpage-gpu-evolution.png
diff --git a/docs/images/frontpage-gpu-vertical.png b/docs/static/img/frontpage-gpu-vertical.png
similarity index 100%
rename from docs/images/frontpage-gpu-vertical.png
rename to docs/static/img/frontpage-gpu-vertical.png
diff --git a/docs/static/img/grafana-disagg-trace.png b/docs/static/img/grafana-disagg-trace.png
new file mode 100644
index 00000000000..1e41bc4d4ec
Binary files /dev/null and b/docs/static/img/grafana-disagg-trace.png differ
diff --git a/docs/static/img/grafana-dynamo-composite.png b/docs/static/img/grafana-dynamo-composite.png
new file mode 100644
index 00000000000..eba18e0b06d
Binary files /dev/null and b/docs/static/img/grafana-dynamo-composite.png differ
diff --git a/docs/images/grafana-k8s.png b/docs/static/img/grafana-k8s.png
similarity index 100%
rename from docs/images/grafana-k8s.png
rename to docs/static/img/grafana-k8s.png
diff --git a/docs/images/h100_decode_performance.png b/docs/static/img/h100_decode_performance.png
similarity index 100%
rename from docs/images/h100_decode_performance.png
rename to docs/static/img/h100_decode_performance.png
diff --git a/docs/images/h100_prefill_performance.png b/docs/static/img/h100_prefill_performance.png
similarity index 100%
rename from docs/images/h100_prefill_performance.png
rename to docs/static/img/h100_prefill_performance.png
diff --git a/docs/images/itl_interpolation.png b/docs/static/img/itl_interpolation.png
similarity index 100%
rename from docs/images/itl_interpolation.png
rename to docs/static/img/itl_interpolation.png
diff --git a/docs/images/kv_cache_mgr.png b/docs/static/img/kv_cache_mgr.png
similarity index 100%
rename from docs/images/kv_cache_mgr.png
rename to docs/static/img/kv_cache_mgr.png
diff --git a/docs/images/kv_cache_mgr_design.png b/docs/static/img/kv_cache_mgr_design.png
similarity index 100%
rename from docs/images/kv_cache_mgr_design.png
rename to docs/static/img/kv_cache_mgr_design.png
diff --git a/docs/images/kv_routing.png b/docs/static/img/kv_routing.png
similarity index 100%
rename from docs/images/kv_routing.png
rename to docs/static/img/kv_routing.png
diff --git a/docs/images/kvbm-architecture.png b/docs/static/img/kvbm-architecture.png
similarity index 100%
rename from docs/images/kvbm-architecture.png
rename to docs/static/img/kvbm-architecture.png
diff --git a/docs/images/kvbm-components.png b/docs/static/img/kvbm-components.png
similarity index 100%
rename from docs/images/kvbm-components.png
rename to docs/static/img/kvbm-components.png
diff --git a/docs/images/kvbm-data-flows.png b/docs/static/img/kvbm-data-flows.png
similarity index 100%
rename from docs/images/kvbm-data-flows.png
rename to docs/static/img/kvbm-data-flows.png
diff --git a/docs/images/kvbm-integrations.png b/docs/static/img/kvbm-integrations.png
similarity index 100%
rename from docs/images/kvbm-integrations.png
rename to docs/static/img/kvbm-integrations.png
diff --git a/docs/images/kvbm-internal-arch.png b/docs/static/img/kvbm-internal-arch.png
similarity index 100%
rename from docs/images/kvbm-internal-arch.png
rename to docs/static/img/kvbm-internal-arch.png
diff --git a/docs/images/kvbm-offload.png b/docs/static/img/kvbm-offload.png
similarity index 100%
rename from docs/images/kvbm-offload.png
rename to docs/static/img/kvbm-offload.png
diff --git a/docs/images/kvbm-onboard-disk2device.png b/docs/static/img/kvbm-onboard-disk2device.png
similarity index 100%
rename from docs/images/kvbm-onboard-disk2device.png
rename to docs/static/img/kvbm-onboard-disk2device.png
diff --git a/docs/images/kvbm-onboard-host2device.png b/docs/static/img/kvbm-onboard-host2device.png
similarity index 100%
rename from docs/images/kvbm-onboard-host2device.png
rename to docs/static/img/kvbm-onboard-host2device.png
diff --git a/docs/images/kvbm_agg_performance.png b/docs/static/img/kvbm_agg_performance.png
similarity index 100%
rename from docs/images/kvbm_agg_performance.png
rename to docs/static/img/kvbm_agg_performance.png
diff --git a/docs/static/img/kvbm_metrics_grafana.png b/docs/static/img/kvbm_metrics_grafana.png
new file mode 100644
index 00000000000..b68b707ab06
Binary files /dev/null and b/docs/static/img/kvbm_metrics_grafana.png differ
diff --git a/docs/images/pd_interpolation.png b/docs/static/img/pd_interpolation.png
similarity index 100%
rename from docs/images/pd_interpolation.png
rename to docs/static/img/pd_interpolation.png
diff --git a/docs/images/planner_perf.png b/docs/static/img/planner_perf.png
similarity index 100%
rename from docs/images/planner_perf.png
rename to docs/static/img/planner_perf.png
diff --git a/docs/images/planner_tensorboard.png b/docs/static/img/planner_tensorboard.png
similarity index 100%
rename from docs/images/planner_tensorboard.png
rename to docs/static/img/planner_tensorboard.png
diff --git a/docs/images/prefill_time.png b/docs/static/img/prefill_time.png
similarity index 100%
rename from docs/images/prefill_time.png
rename to docs/static/img/prefill_time.png
diff --git a/docs/images/prometheus-k8s.png b/docs/static/img/prometheus-k8s.png
similarity index 100%
rename from docs/images/prometheus-k8s.png
rename to docs/static/img/prometheus-k8s.png
diff --git a/docs/static/img/trace.png b/docs/static/img/trace.png
new file mode 100644
index 00000000000..7cc6eb09b19
Binary files /dev/null and b/docs/static/img/trace.png differ
diff --git a/docs/tsconfig.json b/docs/tsconfig.json
new file mode 100644
index 00000000000..920d7a6523b
--- /dev/null
+++ b/docs/tsconfig.json
@@ -0,0 +1,8 @@
+{
+  // This file is not used in compilation. It is here just for a nice editor experience.
+  "extends": "@docusaurus/tsconfig",
+  "compilerOptions": {
+    "baseUrl": "."
+  },
+  "exclude": [".docusaurus", "build"]
+}
diff --git a/docs/versions.json b/docs/versions.json
new file mode 100644
index 00000000000..fe51488c706
--- /dev/null
+++ b/docs/versions.json
@@ -0,0 +1 @@
+[]
diff --git a/lib/runtime/README.md b/lib/runtime/README.md
deleted file mode 120000
index 566eaea7a73..00000000000
--- a/lib/runtime/README.md
+++ /dev/null
@@ -1 +0,0 @@
-../../docs/development/runtime-guide.md
\ No newline at end of file
diff --git a/lib/runtime/README.md b/lib/runtime/README.md
new file mode 100644
index 00000000000..1cfdc7ea04c
--- /dev/null
+++ b/lib/runtime/README.md
@@ -0,0 +1,120 @@
+---
+title: "Dynamo Runtime"
+---
+
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Dynamo Runtime
+
+<h4>A Datacenter Scale Distributed Inference Serving Framework</h4>
+
+[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+
+Rust implementation of the Dynamo runtime system, enabling distributed computing capabilities for machine learning workloads.
+
+## Prerequisites
+
+### Install Rust and Cargo using [rustup](https://rustup.rs/):
+
+```bash
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+```
+
+### Build
+
+```
+cargo build
+cargo test
+```
+
+### Start Dependencies
+
+#### Docker Compose
+
+The simplest way to deploy the pre-requisite services is using
+[docker-compose](https://docs.docker.com/compose/install/linux/),
+defined in [deploy/docker-compose.yml](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml).
+
+```
+# At the root of the repository:
+docker compose -f deploy/docker-compose.yml up -d
+```
+
+This will deploy a [NATS.io](https://nats.io/) server and an [etcd](https://etcd.io/)
+server used to communicate between and discover components at runtime.
+
+
+#### Local (alternate)
+
+To deploy the pre-requisite services locally instead of using `docker-compose`
+above, you can manually launch each:
+
+- [NATS.io](https://docs.nats.io/running-a-nats-service/introduction/installation) server with [Jetstream](https://docs.nats.io/nats-concepts/jetstream)
+    - example: `nats-server -js --trace`
+- [etcd](https://etcd.io) server
+    - follow instructions in [etcd installation](https://etcd.io/docs/v3.5/install/) to start an `etcd-server` locally
+
+
+### Run Examples
+
+When developing or running examples, any process or user that shared your core-services (`etcd` and `nats.io`) will
+be operating within your distributed runtime.
+
+The current examples use a hard-coded `namespace`. We will address the `namespace` collisions later.
+
+All examples require the `etcd` and `nats.io` pre-requisites to be running and available.
+
+#### Rust `hello_world`
+
+With two terminals open, in one window:
+
+```
+cd examples/hello_world
+cargo run --bin server
+```
+
+In the second terminal, execute:
+
+```
+cd examples/hello_world
+cargo run --bin client
+```
+
+which should yield some output similar to:
+```
+    Finished `dev` profile [unoptimized + debuginfo] target(s) in 6.25s
+     Running `target/debug/client`
+Annotated { data: Some("h"), id: None, event: None, comment: None }
+Annotated { data: Some("e"), id: None, event: None, comment: None }
+Annotated { data: Some("l"), id: None, event: None, comment: None }
+Annotated { data: Some("l"), id: None, event: None, comment: None }
+Annotated { data: Some("o"), id: None, event: None, comment: None }
+Annotated { data: Some(" "), id: None, event: None, comment: None }
+Annotated { data: Some("w"), id: None, event: None, comment: None }
+Annotated { data: Some("o"), id: None, event: None, comment: None }
+Annotated { data: Some("r"), id: None, event: None, comment: None }
+Annotated { data: Some("l"), id: None, event: None, comment: None }
+Annotated { data: Some("d"), id: None, event: None, comment: None }
+```
+
+#### Python
+
+See the [README.md](https://github.com/ai-dynamo/dynamo/tree/main/lib/runtime/lib/bindings/python/README.md) for details
+
+The Python and Rust `hello_world` client and server examples are interchangeable,
+so you can start the Python `server.py` and talk to it from the Rust `client`.