diff --git a/.github/workflows/lint-docs.yaml b/.github/workflows/lint-docs.yaml new file mode 100644 index 00000000000..f50853e7932 --- /dev/null +++ b/.github/workflows/lint-docs.yaml @@ -0,0 +1,27 @@ +name: Lint Documentation +on: + push: + paths: + - "**.md" + branches: + - main + pull_request: + paths: "**.md" +permissions: + contents: read + +jobs: + markdown-link-check: + name: Broken Links + runs-on: ubuntu-latest + steps: + - name: Harden Runner + uses: step-security/harden-runner@20cf305ff2072d973412fa9b1e3a4f227bda3c76 # v2.14.0 + with: + egress-policy: audit + + - uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8 # v6.0.1 + - uses: tcort/github-action-markdown-link-check@e7c7a18363c842693fadde5d41a3bd3573a7a225 # v1.1.2 + with: + use-quiet-mode: 'yes' + config-file: .markdownlinkcheck.json diff --git a/.gitignore b/.gitignore index a8beb74cb7c..37431bda907 100644 --- a/.gitignore +++ b/.gitignore @@ -118,3 +118,8 @@ profiling_results* # Node.js node_modules/ package-lock.json + +# Docusaurus +docs/.docusaurus/ +docs/build/ +docs/.cache-loader/ diff --git a/benchmarks/incluster/README.md b/benchmarks/incluster/README.md deleted file mode 120000 index ab6c21f5862..00000000000 --- a/benchmarks/incluster/README.md +++ /dev/null @@ -1 +0,0 @@ -../../docs/benchmarks/benchmarking.md \ No newline at end of file diff --git a/benchmarks/incluster/README.md b/benchmarks/incluster/README.md new file mode 100644 index 00000000000..fc6136bbac5 --- /dev/null +++ b/benchmarks/incluster/README.md @@ -0,0 +1,545 @@ +--- +title: "SPDX-License-Identifier: Apache-2.0" +--- + + + +# Dynamo Benchmarking Guide + +This benchmarking framework lets you compare performance across any combination of: +- **DynamoGraphDeployments** +- **External HTTP endpoints** (existing services deployed following standard documentation from vLLM, llm-d, AIBrix, etc.) + +## Choosing Your Benchmarking Approach + +Dynamo provides two benchmarking approaches to suit different use cases: **client-side** and **server-side**. Client-side refers to running benchmarks on your local machine and connecting to Kubernetes deployments via port-forwarding, while server-side refers to running benchmarks directly within the Kubernetes cluster using internal service URLs. Which method to use depends on your use case. + +**TLDR:** +Need high performance/load testing? Server-side. +Just quick testing/comparison? Client-side. + +### Use Client-Side Benchmarking When: +- You want to quickly test deployments +- You want immediate access to results on your local machine +- You're comparing external services or deployments (not necessarily just Dynamo deployments) +- You need to run benchmarks from your laptop/workstation + +→ **[Go to Client-Side Benchmarking (Local)](#client-side-benchmarking-local)** + +### Use Server-Side Benchmarking When: +- You have a development environment with kubectl access +- You're doing performance validation with high load/speed requirements +- You're experiencing timeouts or performance issues with client-side benchmarking +- You want optimal network performance (no port-forwarding overhead) +- You're running automated CI/CD pipelines +- You need isolated execution environments +- You're doing resource-intensive benchmarking +- You want persistent result storage in the cluster + +→ **[Go to Server-Side Benchmarking (In-Cluster)](#server-side-benchmarking-in-cluster)** + +### Quick Comparison + +| Feature | Client-Side | Server-Side | +|---------|-------------|-------------| +| **Location** | Your local machine | Kubernetes cluster | +| **Network** | Port-forwarding required | Direct service DNS | +| **Setup** | Quick and simple | Requires cluster resources | +| **Performance** | Limited by local resources, may timeout under high load | Optimal cluster performance, handles high load | +| **Isolation** | Shared environment | Isolated job execution | +| **Results** | Local filesystem | Persistent volumes | +| **Best for** | Light load | High load | + +## What This Tool Does + +The framework is a Python-based wrapper around `aiperf` that: +- Benchmarks any HTTP endpoints +- Runs concurrency sweeps across configurable load levels +- Generates comparison plots with your custom labels +- Works with any HuggingFace-compatible model on NVIDIA GPUs (H200, H100, A100, etc.) +- Provides direct Python script execution for maximum flexibility + +**Default sequence lengths**: Input: 2000 tokens, Output: 256 tokens (configurable with `--isl` and `--osl`) + +**Important**: The `--model` parameter configures AIPerf for benchmarking and provides logging context. The default `--model` value in the benchmarking script is `Qwen/Qwen3-0.6B`, but it must match the model deployed at the endpoint(s). + +--- + +## Client-Side Benchmarking (Local) {#client-side-benchmarking-local} + +Client-side benchmarking runs on your local machine and connects to Kubernetes deployments via port-forwarding. + +## Prerequisites + +1. **Dynamo container environment** - You must be running inside a Dynamo container with the benchmarking tools pre-installed. + +2. **HTTP endpoints** - Ensure you have HTTP endpoints available for benchmarking. These can be: + - DynamoGraphDeployments exposed via HTTP endpoints + - External services (vLLM, llm-d, AIBrix, etc.) + - Any HTTP endpoint serving HuggingFace-compatible models + +3. **Benchmark dependencies** - Since benchmarks run locally, you need to install the required Python dependencies. Install them using: + ```bash + pip install -r deploy/utils/requirements.txt + ``` + +## User Workflow + +Follow these steps to benchmark Dynamo deployments using client-side benchmarking: + +### Step 1: Establish Kubernetes Cluster and Install Dynamo +Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform. First follow the [installation guide](/docs/kubernetes/installation_guide.md) to install Dynamo Kubernetes Platform, then use [deploy/utils/README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/utils/README.md) to set up benchmarking resources. + +### Step 2: Deploy DynamoGraphDeployments +Deploy your DynamoGraphDeployments separately using the [deployment documentation](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/). Each deployment should have a frontend service exposed. + +### Step 3: Port-Forward and Benchmark Deployment A +```bash +# Port-forward the frontend service for deployment A +kubectl port-forward -n <namespace> svc/ 8000:8000 > /dev/null 2>&1 & +# Note: remember to stop the port-forward process after benchmarking. + +# Benchmark deployment A using Python scripts +python3 -m benchmarks.utils.benchmark \ + --benchmark-name deployment-a \ + --endpoint-url http://localhost:8000 \ + --model "your-model-name" \ + --output-dir ./benchmarks/results +``` + +### Step 4: [If Comparative] Teardown Deployment A and Establish Deployment B +If comparing multiple deployments, teardown deployment A and deploy deployment B with a different configuration. + +### Step 5: [If Comparative] Port-Forward and Benchmark Deployment B +```bash +# Port-forward the frontend service for deployment B +kubectl port-forward -n <namespace> svc/ 8001:8000 > /dev/null 2>&1 & + +# Benchmark deployment B using Python scripts +python3 -m benchmarks.utils.benchmark \ + --benchmark-name deployment-b \ + --endpoint-url http://localhost:8001 \ + --model "your-model-name" \ + --output-dir ./benchmarks/results +``` + +### Step 6: Generate Summary and Visualization +```bash +# Generate plots and summary using Python plotting script +python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results + +# Or plot only specific benchmark experiments +python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --benchmark-name experiment-a --benchmark-name experiment-b +``` + +## Use Cases + +The benchmarking framework supports various comparative analysis scenarios: + +- **Compare multiple DynamoGraphDeployments of a single backend** (e.g., aggregated vs disaggregated configurations) +- **Compare different backends** (e.g., vLLM vs TensorRT-LLM vs SGLang) +- **Compare Dynamo vs other platforms** (e.g., Dynamo vs llm-d vs AIBrix) +- **Compare different models** (e.g., Llama-3-8B vs Llama-3-70B vs Qwen-3-0.6B) +- **Compare different hardware configurations** (e.g., H100 vs A100 vs H200) +- **Compare different parallelization strategies** (e.g., different GPU counts or memory configurations) + +## Configuration and Usage + +### Command Line Options + +```bash +python3 -m benchmarks.utils.benchmark --benchmark-name <name> --endpoint-url [OPTIONS] + +REQUIRED: + --benchmark-name NAME Name/label for this benchmark (used in plots and results) + --endpoint-url URL HTTP endpoint URL to benchmark (e.g., http://localhost:8000) + +OPTIONS: + -h, --help Show help message and examples + -m, --model MODEL Model name for AIPerf configuration and logging (default: Qwen/Qwen3-0.6B) + NOTE: This must match the model deployed at the endpoint + -i, --isl LENGTH Input sequence length (default: 2000) + -s, --std STDDEV Input sequence standard deviation (default: 10) + -o, --osl LENGTH Output sequence length (default: 256) + -d, --output-dir DIR Output directory (default: ./benchmarks/results) + --verbose Enable verbose output +``` + +### Important Notes + +- **Benchmark Name**: The benchmark name becomes the label in plots and results +- **Name Restrictions**: Names can only contain letters, numbers, hyphens, and underscores. The name `plots` is reserved. +- **Port-Forwarding**: You must have an exposed endpoint before benchmarking +- **Model Parameter**: The `--model` parameter configures AIPerf for testing and logging, and must match the model deployed at the endpoint +- **Sequential Benchmarking**: For comparative benchmarks, deploy and benchmark each configuration separately + +### What Happens During Benchmarking + +The Python benchmarking module: +1. **Connects** to your port-forwarded endpoint +2. **Benchmarks** using AIPerf at various concurrency levels (default: 1, 2, 5, 10, 50, 100, 250) +3. **Measures** key metrics: latency, throughput, time-to-first-token +4. **Saves** results to an output directory organized by benchmark name + +The Python plotting module: +1. **Generates** comparison plots using your benchmark name in `/plots/` +2. **Creates** summary statistics and visualizations + +### Plotting Options + +The plotting script supports several options for customizing which experiments to visualize: + +```bash +# Plot all benchmark experiments in the data directory +python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results + +# Plot only specific benchmark experiments +python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --benchmark-name experiment-a --benchmark-name experiment-b + +# Specify custom output directory for plots +python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --output-dir ./custom-plots +``` + +**Available Options:** +- `--data-dir`: Directory containing benchmark results (required) +- `--benchmark-name`: Specific benchmark experiment name to plot (can be specified multiple times). Names must match subdirectory names under the data dir. +- `--output-dir`: Custom output directory for plots (defaults to data-dir/plots) + +**Note**: If `--benchmark-name` is not specified, the script will plot all subdirectories found in the data directory. + +### Using Your Own Models and Configuration + +The benchmarking framework supports any HuggingFace-compatible LLM model. Specify your model in the benchmark script's `--model` parameter. It must match the model name of the deployment. You can override the default sequence lengths (2000/256 tokens) with `--isl` and `--osl` flags if needed for your specific workload. + +The benchmarking framework is built around Python modules that provide direct control over the benchmark workflow. The Python benchmarking module connects to your existing endpoints, runs the benchmarks, and can generate plots. Deployment is user-managed and out of scope for this tool. + +### Comparison Limitations + +The plotting system supports up to 12 different benchmarks in a single comparison. + +### Concurrency Configuration + +You can customize the concurrency levels using the CONCURRENCIES environment variable: + +```bash +# Custom concurrency levels +CONCURRENCIES="1,5,20,50" python3 -m benchmarks.utils.benchmark \ + --benchmark-name my-test \ + --endpoint-url http://localhost:8000 + +# Or set permanently +export CONCURRENCIES="1,2,5,10,25,50,100" +python3 -m benchmarks.utils.benchmark \ + --benchmark-name test \ + --endpoint-url http://localhost:8000 +``` + +## Understanding Your Results + +After benchmarking completes, check `./benchmarks/results/` (or your custom output directory): + +### Plot Labels and Organization + +The plotting script uses the `--benchmark-name` as the experiment name in all generated plots. For example: +- `--benchmark-name aggregated` → plots will show "aggregated" as the label +- `--benchmark-name vllm-disagg` → plots will show "vllm-disagg" as the label + +This allows you to easily identify and compare different configurations in the visualization plots. + +### Summary and Plots + +```text +benchmarks/results/plots +├── SUMMARY.txt # Quick overview of all results +├── p50_inter_token_latency_vs_concurrency.png # Token generation speed +├── avg_time_to_first_token_vs_concurrency.png # Response time +├── request_throughput_vs_concurrency.png # Requests per second +├── efficiency_tok_s_gpu_vs_user.png # GPU efficiency +└── avg_inter_token_latency_vs_concurrency.png # Average latency +``` + +### Data Files + +Raw data is organized by deployment/benchmark type and concurrency level: + +**For Any Benchmarking (uses your custom benchmark name):** +```text +results/ # Client-side: ./benchmarks/results/ or custom dir +├── plots/ # Server-side: /data/results/ +│ ├── SUMMARY.txt # Performance visualization plots +│ ├── p50_inter_token_latency_vs_concurrency.png +│ ├── avg_inter_token_latency_vs_concurrency.png +│ ├── request_throughput_vs_concurrency.png +│ ├── efficiency_tok_s_gpu_vs_user.png +│ └── avg_time_to_first_token_vs_concurrency.png +├── / # Results for your benchmark (uses your custom name) +│ ├── c1/ # Concurrency level 1 +│ │ └── profile_export_aiperf.json +│ ├── c2/ # Concurrency level 2 +│ ├── c5/ # Concurrency level 5 +│ └── ... # Other concurrency levels (10, 50, 100, 250) +└── / # Results for additional benchmarking runs + └── c*/ # Same structure as above +``` + +**Example with actual benchmark names:** +```text +results/ +├── plots/ +├── experiment-a/ # --benchmark-name experiment-a +├── experiment-b/ # --benchmark-name experiment-b +└── experiment-c/ # --benchmark-name experiment-c +``` + +Each concurrency directory contains: +- **`profile_export_aiperf.json`** - Structured metrics from AIPerf +- **`profile_export_aiperf.csv`** - CSV format metrics from AIPerf +- **`profile_export.json`** - Raw AIPerf results +- **`inputs.json`** - Generated test inputs + +--- + +## Server-Side Benchmarking (In-Cluster) {#server-side-benchmarking-in-cluster} + +Server-side benchmarking runs directly within the Kubernetes cluster, eliminating the need for port forwarding and providing better resource utilization. + +## What Server-Side Benchmarking Does + +The server-side benchmarking solution: +- Runs benchmarks directly within the Kubernetes cluster using internal service URLs +- Uses Kubernetes service DNS for direct communication (no port forwarding required) +- Leverages the existing benchmarking infrastructure (`benchmarks.utils.benchmark`) +- Stores results persistently using `dynamo-pvc` +- Provides isolated execution environment with configurable resources +- Handles high load/speed requirements without timeout issues +- **Note**: Each benchmark job runs within a single Kubernetes namespace, but can benchmark services across multiple namespaces using the full DNS format `svc_name.namespace.svc.cluster.local` + +## Prerequisites + +1. **Kubernetes cluster** with NVIDIA GPUs and Dynamo namespace setup (see [Dynamo Kubernetes Platform docs](/docs/kubernetes/README.md)) +2. **Storage** PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/utils/README.md)) +3. **Docker image** containing the Dynamo benchmarking tools + +## Quick Start + +### Step 1: Deploy Your DynamoGraphDeployment +Deploy your DynamoGraphDeployment using the [deployment documentation](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/). Ensure it has a frontend service exposed. + +### Step 2: Deploy and Run Benchmark Job + +**Note**: The server-side benchmarking job requires a Docker image containing the Dynamo benchmarking tools. Before the 0.5.1 release, you must build your own Docker image using the [container build instructions](https://github.com/ai-dynamo/dynamo/tree/main/container/README.md), push it to your container registry, then update the `image` field in `benchmarks/incluster/benchmark_job.yaml` to use your built image tag. + +```bash +export NAMESPACE=benchmarking + +# Deploy the benchmark job with default settings +kubectl apply -f benchmarks/incluster/benchmark_job.yaml -n $NAMESPACE + +# Monitor the job, wait for it to complete +kubectl logs -f job/dynamo-benchmark -n $NAMESPACE +``` + +#### Customize the job configuration + +To customize the benchmark parameters, edit the `benchmarks/incluster/benchmark_job.yaml` file and modify: + +- **Model name**: Change `"Qwen/Qwen3-0.6B"` in the args section +- **Benchmark name**: Change `"qwen3-0p6b-vllm-agg"` to your desired benchmark name +- **Service URL**: Change `"vllm-agg-frontend:8000"` so the service URL matches your deployed service +- **Docker image**: Change the image field if needed + +Then deploy: +```bash +kubectl apply -f benchmarks/incluster/benchmark_job.yaml -n $NAMESPACE +``` + +### Step 3: Retrieve Results +```bash +# Create access pod (skip this step if access pod is already running) +kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE +kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s + +# Download the results +kubectl cp $NAMESPACE/pvc-access-pod:/data/results/ ./benchmarks/results/ + +# Cleanup +kubectl delete pod pvc-access-pod -n $NAMESPACE +``` + +### Step 4: Generate Plots +```bash +# Generate performance plots from the downloaded results +python3 -m benchmarks.utils.plot \ + --data-dir ./benchmarks/results +``` + +This will create visualization plots. For more details on interpreting these plots, see the [Summary and Plots](#summary-and-plots) section above. + +## Cross-Namespace Service Access + +Server-side benchmarking can benchmark services across multiple namespaces from a single job using Kubernetes DNS. When referencing services in other namespaces, use the full DNS format: + +```bash +# Access service in same namespace +SERVICE_URL=vllm-agg-frontend:8000 + +# Access service in different namespace +SERVICE_URL=vllm-agg-frontend.production.svc.cluster.local:8000 +``` + +**DNS Format**: `.<namespace>.svc.cluster.local:port` + +This allows you to: +- Benchmark multiple services across different namespaces in a single job +- Compare services running in different environments (dev, staging, production) +- Test cross-namespace integrations without port-forwarding +- Run comprehensive cross-namespace performance comparisons + +## Configuration + +The benchmark job is configured directly in the YAML file. + +### Default Configuration + +- **Model**: `Qwen/Qwen3-0.6B` +- **Benchmark Name**: `qwen3-0p6b-vllm-agg` +- **Service**: `vllm-agg-frontend:8000` +- **Docker Image**: `nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag` + +### Customizing the Job + +To customize the benchmark, edit `benchmarks/incluster/benchmark_job.yaml`: + +1. **Change the model**: Update the `--model` argument +2. **Change the benchmark name**: Update the `--benchmark-name` argument +3. **Change the service URL**: Update the `--endpoint-url` argument (use `.<namespace>.svc.cluster.local:port` for cross-namespace access) +4. **Change Docker image**: Update the image field if needed + +### Example: Multi-Namespace Benchmarking + +To benchmark services across multiple namespaces, you would need to run separate benchmark jobs for each service since the format supports one benchmark per job. However, the results are stored in the same PVC and may be accessed together. + +```yaml +# Job 1: Production service +args: + - --model + - "Qwen/Qwen3-0.6B" + - --benchmark-name + - "prod-vllm" + - --endpoint-url + - "vllm-agg-frontend.production.svc.cluster.local:8000" + - --output-dir + - /data/results + +# Job 2: Staging service +args: + - --model + - "Qwen/Qwen3-0.6B" + - --benchmark-name + - "staging-vllm" + - --endpoint-url + - "vllm-agg-frontend.staging.svc.cluster.local:8000" + - --output-dir + - /data/results +``` + +## Understanding Your Results + +Results are stored in `/data/results` and follow the same structure as client-side benchmarking: + +```text +/data/results/ +└── / # Results for your benchmark name + ├── c1/ # Concurrency level 1 + │ └── profile_export_aiperf.json + ├── c2/ # Concurrency level 2 + └── ... # Other concurrency levels +``` + +## Monitoring and Debugging + +### Check Job Status +```bash +kubectl describe job dynamo-benchmark -n $NAMESPACE +``` + +### View Logs +```bash +# Follow logs in real-time +kubectl logs -f job/dynamo-benchmark -n $NAMESPACE +``` + +### Debug Failed Jobs +```bash +# Check pod status +kubectl get pods -n $NAMESPACE -l job-name=dynamo-benchmark + +# Describe failed pod +kubectl describe pod <pod-name> -n $NAMESPACE +``` + +## Troubleshooting + +### Common Issues + +1. **Service not found**: Ensure your DynamoGraphDeployment frontend service is running +3. **PVC access**: Check that `dynamo-pvc` is properly configured and accessible +4. **Image pull issues**: Ensure the Docker image is accessible from the cluster +5. **Resource constraints**: Adjust resource limits if the job is being evicted + +### Debug Commands + +```bash +# Check PVC status +kubectl get pvc dynamo-pvc -n $NAMESPACE + +# Check service endpoints +kubectl get svc -n $NAMESPACE + +# Verify your service exists and has endpoints +SVC_NAME="${SERVICE_URL%%:*}" +kubectl get svc "$SVC_NAME" -n "$NAMESPACE" +kubectl get endpoints "$SVC_NAME" -n "$NAMESPACE" +``` + +--- + +## Customize Benchmarking Behavior + +The built-in Python workflow connects to endpoints, benchmarks with aiperf, and generates plots. If you want to modify the behavior: + +1. **Extend the workflow**: Modify `benchmarks/utils/workflow.py` to add custom deployment types or metrics collection + +2. **Generate different plots**: Modify `benchmarks/utils/plot.py` to generate a different set of plots for whatever you wish to visualize. + +3. **Direct module usage**: Use individual Python modules (`benchmarks.utils.benchmark`, `benchmarks.utils.plot`) for granular control over each step of the benchmarking process. + +The Python benchmarking module provides a complete end-to-end benchmarking experience with full control over the workflow. + +--- + +## Testing with Mocker Backend + +For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/mocker/) that simulates LLM inference without requiring actual GPU resources. This is useful for: + +- **Testing deployments** without expensive GPU infrastructure +- **Developing and debugging** router, planner, or frontend logic +- **CI/CD pipelines** that need to validate infrastructure without model execution +- **Benchmarking framework validation** to ensure your setup works before using real backends + +The mocker backend mimics the API and behavior of real backends (vLLM, SGLang, TensorRT-LLM) but generates mock responses instead of running actual inference. + +See the [mocker directory](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/mocker/) for usage examples and configuration options. diff --git a/benchmarks/profiler/README.md b/benchmarks/profiler/README.md deleted file mode 120000 index d0192ec6a3e..00000000000 --- a/benchmarks/profiler/README.md +++ /dev/null @@ -1 +0,0 @@ -../../docs/benchmarks/sla_driven_profiling.md \ No newline at end of file diff --git a/benchmarks/profiler/README.md b/benchmarks/profiler/README.md new file mode 100644 index 00000000000..973ce53b1c5 --- /dev/null +++ b/benchmarks/profiler/README.md @@ -0,0 +1,624 @@ +--- +title: "SLA-Driven Profiling with DynamoGraphDeploymentRequest" +--- + +# SLA-Driven Profiling with DynamoGraphDeploymentRequest + +> [!TIP] +> **New to DGDR and SLA-Driven Profiling?** Start with the [SLA-Driven Profiling and Planner Deployment Quick Start Guide](/docs/planner/sla_planner_quickstart.md) for step-by-step instructions. This document provides deeper technical details about the profiling process. + +## Overview + +Dynamo provides automated SLA-driven profiling through **DynamoGraphDeploymentRequests (DGDR)**. Instead of manually running profiling scripts, you declare your performance requirements and let the Dynamo Operator handle profiling and deployment automatically. + +**Key Benefits:** +- **Declarative**: Specify SLAs, not implementation details +- **Automated**: No manual job setup or result processing +- **Integrated**: Seamlessly works with Dynamo Operator +- **Production-Ready**: Generates optimized configurations with SLA planner + +This document covers: +- Technical details of online vs offline profiling +- Profiling process internals (GPU usage, measurements, interpolation) +- Direct script usage for advanced scenarios +- Comprehensive troubleshooting + +## Support Matrix + +| Backend | Dense Models | MoE Models | +|---------|-------------|------------| +| vLLM | ✅ | 🚧 | +| SGLang | ✅ | ✅ | +| TensorRT-LLM | ✅ | 🚧 | + +Specifically, the profiler sweeps over the following parallelization mapping for prefill and decode: +| Model Architecture | Prefill Parallelization Mapping | Decode Parallelization Mapping | +|---------|-------------|------------| +| MLA+MoE (DeepseekV3ForCausalLM, DeepseekV32ForCausalLM) | TEP, DEP | TEP, DEP | +| GQA+MoE (Qwen3MoeForCausalLM) | TP, TEP, DEP | TP, TEP, DEP | +| Other Models | TP | TP | + +> [!NOTE] +> - Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend. + +## Using DGDR for Profiling (Recommended) + +The recommended way to profile models is through DGDRs. Sample configurations are provided in `deploy/`: + +**Available Samples:** +- **`profile_sla_dgdr.yaml`**: Standard profiling with AIPerf on real engines +- **`profile_sla_aic_dgdr.yaml`**: Fast profiling with AI Configurator simulation +- **`profile_sla_moe_dgdr.yaml`**: MoE model profiling + +The Dynamo Operator automatically: +1. Discovers GPU resources (cluster-scoped operators only) +2. Runs profiling (AIPerf on real engines or AI Configurator simulation) +3. Generates optimal DGD configuration with SLA planner +4. Deploys the DGD to your cluster + +See the [Quick Start Guide](/docs/planner/sla_planner_quickstart.md) for prerequisites and detailed instructions. + +## Hardware Configuration + +Hardware parameters have sensible defaults and are **optional** - you can override them if needed: + +```yaml +profilingConfig: + config: + # Override hardware defaults if needed + hardware: + min_num_gpus_per_engine: 1 + max_num_gpus_per_engine: 8 + num_gpus_per_node: 8 + + # Only needed when using AI Configurator (sweep.use_ai_configurator: true) + sweep: + aic_system: h200_sxm # GPU type for AI Configurator (h100_sxm, h200_sxm, etc.) +``` + +### Automatic GPU Discovery (Optional Feature) + +Cluster-scoped operators can optionally enable automatic GPU discovery to detect hardware from cluster nodes. When enabled, hardware config is auto-detected and overrides any manually specified values. + +```yaml +spec: + enableGpuDiscovery: true +``` + +This feature is only available with cluster-scoped operators (`namespaceRestriction.enabled=false`) as it requires cluster-wide node access permissions. It is not available for namespace-restricted operators. + +## Profiling Method + +1. **Hardware Setup**: Uses defaults or user-specified hardware configuration. Optionally, cluster-scoped operators can enable automatic GPU discovery to detect specifications from cluster nodes. +2. **Identify Sweep Ranges**: Automatically determine minimum and maximum number of GPUs per engine. Minimum is determined by the model size and GPU VRAM. Maximum is set to one node for dense model and 4 nodes for MoE models. +3. **Parallelization Mapping Sweep**: Use the input ISL and OSL, test the performance of the engines with different parallelization mappings. + - For dense models, we test different TP sizes for both prefill and decode. + - For MoE models (SGLang), we evaluate both TEP and DEP as candidates for prefill and decode. + - **Prefill**: + - TP/TEP: We measure TTFT with batch size = 1 (assuming ISL is long enough to saturate compute) without KV reuse. + - DEP: Attention uses data parallelism. We send a single burst with total concurrency `attention_dp_size × attn_dp_num_req_ratio` (defaults to 4) and compute the reported TTFT as `time_to_first_token.max / attn_dp_num_req_ratio` from the AIPerf summary of that burst. This stabilizes measurements when the first batch may launch before all requests arrive. + ![Prefill Performance](/img/h100_prefill_performance.png) + - **Decode**: Since the ITL (or iteration time) is relevant with how many requests are in-flight, we measure the ITL under different number of in-flight requests. The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold. To measure the ITL without being affected by piggy-backed prefill requests, the script will enable kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL. Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL. However, for MoE models, this is not guaranteed because the kv cache in different attention DP ranks is different. We are working on framework-side change to fix this issue. For example, the below plot shows the decode parallelization mapping sweep results for H100 for deepseek-ai/DeepSeek-R1-Distill-Llama-8B. + ![Decode Performance](/img/h100_decode_performance.png) +4. **Recommendation**: Selects optimal parallelization mapping for prefill and decode that achieves the highest per GPU throughput while adhering the SLA on TTFT and ITL. Specifically, the profiler will choose the point (or a point on the curve for decode) that is left to the vertical red dashed line that represents the SLAs while has the highest y coordinate (throughput per GPU). +5. **In-Depth Profiling on the Recommended P/D Engine**: After finding the best TP size for prefill and decode, the script will then interpolate the TTFT with ISL and ITL with active KV cache and decode context length. This is to provide a more accurate estimation of the performance when ISL and OSL changes and will be used in the sla-planner. +![ITL Interpolation](/img/pd_interpolation.png) + - **Prefill**: Measures TTFT and throughput per GPU across different input lengths with batch size=1. + - **Decode**: Measures ITL and throughput per GPU under various KV cache loads and decode context lengths. The active kv usage determines the complexity of the memory-bounded attention kernel while the active kv usage divided the average context length determines the complexity of the computation bound MLP kernel. For example, the below figure shows the ITL of DS-Distilled Llama 8b model on H100 TP4. The ITL grows near-linearly with active kv usage under a fixed context length. And the slope increases as the context length decreases. + + +To run the parallelization mapping sweep and the in-depth profiling on the recommended P/D engine, the profiler need to know the engine's forward pass time with different loads. There are two ways to achieve this: run AIPerf on real engines or use AI Configurator to run simulations. + +### AIPerf on Real Engines + +Profiles your model by creating real test deployments in Kubernetes and measuring their performance. + +**Characteristics:** +- **Duration**: 2-4 hours +- **Accuracy**: Highest (real measurements) +- **GPU Requirements**: Full access to test different parallelization mappings +- **Backends**: vLLM, SGLang, TensorRT-LLM + +**DGDR Configuration:** +```yaml +profilingConfig: + config: + sweep: + use_ai_configurator: false # Default +``` + +### AI Configurator Simulation + +Uses performance simulation to rapidly estimate optimal configurations without running real deployments. + +**Characteristics:** +- **Duration**: 20-30 seconds +- **Accuracy**: Estimated (may have errors for unusual configurations) +- **GPU Requirements**: None +- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon) + +**DGDR Configuration:** +```yaml +profilingConfig: + config: + sweep: + use_ai_configurator: true + aic: + system: h200_sxm # GPU system type + model_name: QWEN3_32B # AIC model identifier + backend_version: "0.20.0" +``` + +**Supported Configurations:** + +For the current list of supported models, systems, and backend versions, see the [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features). + +To check from the command line: `aiconfigurator cli --help` + +**Currently supports:** +- **Backends**: TensorRT-LLM (versions 0.20.0, 1.0.0rc3, 1.0.0rc6) +- **Systems**: H100 SXM, H200 SXM, B200 SXM, GB200 SXM, A100 SXM +- **Models**: Wide range including GPT, Llama, Mixtral, DeepSeek, Qwen, and more + +### Output Format + +After profiling, the DGDR status contains: + +1. **Recommended Configuration**: Optimal TP for prefill and decode +2. **Performance Data**: Interpolation models for SLA planner +3. **Generated DGD**: Complete deployment manifest + +**Example Recommendations:** +``` +Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU) +Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU) +``` + +#### Interactive Configuration Selection WebUI + +When running the profiler with `--pick-with-webui`, an interactive web interface is launched that allows you to visually explore profiling results and manually select configurations. + +**Features:** +- **Interactive Charts**: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables +- **Pareto-Optimal Analysis**: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput +- **DGD Config Preview**: Click "Show Config" on any row to view the corresponding DynamoGraphDeployment YAML +- **GPU Cost Estimation**: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests) +- **SLA Visualization**: Red dashed lines indicate your TTFT and ITL targets + +**Selection Methods:** +1. **GPU Hours Table** (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination +2. **Individual Selection**: Click one row in the Prefill table AND one row in the Decode table to manually choose each + +**Example DGD Config Output:** + +When you click "Show Config", you'll see a DynamoGraphDeployment configuration like: + +```yaml +# DynamoGraphDeployment Configuration +# Prefill: 1 GPU(s), TP=1 +# Decode: 4 GPU(s), TP=4 +# Model: Qwen/Qwen3-32B-FP8 +# Backend: trtllm +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +spec: + services: + PrefillWorker: + subComponentType: prefill + replicas: 1 + extraPodSpec: + mainContainer: + args: + - --tensor-parallel-size=1 + DecodeWorker: + subComponentType: decode + replicas: 1 + extraPodSpec: + mainContainer: + args: + - --tensor-parallel-size=4 +``` + +**Usage:** +```bash +python -m benchmarks.profiler.profile_sla \ + --backend trtllm \ + --config path/to/disagg.yaml \ + --pick-with-webui \ + --use-ai-configurator \ + --model Qwen/Qwen3-32B-FP8 \ + --aic-system h200_sxm \ + --ttft 200 --itl 15 +``` + +Once you have selected a configuration, the full DynamoGraphDeployment CRD will be saved in your output folder as `config_with_planner.yaml`. + +The WebUI launches on port 8000 by default (configurable with `--webui-port`). + +#### Output Performance Plots + +The profiler will generate the following plots to better visualize the performance data: + +**Parallelization Mapping Sweep Plots:** +- `prefill_performance.png`: TTFT vs Parallelization Mapping size +- `decode_performance.png`: ITL vs Parallelization Mapping size and in-flight requests + +Note these two plots are based on the input ISL and OSL. + +**In-Depth Profiling for the Recommended P/D Engine Plots:** +- `selected_prefill_interpolation/prefill_ttft_interpolation.png`: TTFT vs ISL for the recommended prefill engine +- `selected_prefill_interpolation/prefill_throughput_interpolation.png`: Throughput vs ISL for the recommended prefill engine +- `selected_decode_interpolation/decode_itl_interplation.png`: ITL vs KV usage and context length for the recommended decode engine +- `selected_decode_interpolation/decode_throughput_interpolation.png`: Throughput vs KV usage and context length for the recommended decode engine + + +### Output Interpolation Data + +The profiler generates `.npz` files to store the performance data for the recommended P/D engine: + +**Prefill Interpolation** (`selected_prefill_interpolation/raw_data.npz`): +- `prefill_isl`: 1D array of input sequence lengths tested +- `prefill_ttft`: 1D array of TTFTs (ms) at each ISL +- `prefill_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each ISL + +**Decode Interpolation** (`selected_decode_interpolation/raw_data.npz`): +- `max_kv_tokens`: Total KV tokens capacity in decode engine +- `x_kv_usage`: 1D array of active KV usage percentages [0, 1] +- `y_context_length`: 1D array of average context lengths tested +- `z_itl`: 1D array of ITLs (ms) at each (KV usage, context length) point +- `z_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each point + +## DGDR Configuration Reference + +This section provides detailed explanations of all DGDR `profilingConfig` options. The DGDR controller passes this configuration to the profiler script, which is defined in `benchmarks/profiler/utils/profiler_argparse.py`. + +### Configuration Structure + +All profiler configuration goes under `spec.profilingConfig.config`: + +```yaml +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeploymentRequest +metadata: + name: my-deployment +spec: + model: "Qwen/Qwen3-0.6B" # High-level: model to deploy + backend: vllm # High-level: inference backend + + profilingConfig: + profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" # Required + configMapRef: # Optional: base DGD config + name: my-config + key: disagg.yaml + + config: # Profiler configuration + sla: { ... } + hardware: { ... } + sweep: { ... } + aic: { ... } + planner: { ... } + + deploymentOverrides: # Optional + workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" +``` + +### SLA Configuration (Required) + +Define your performance requirements and workload characteristics: + +```yaml +profilingConfig: + config: + sla: + isl: 3000 # Average input sequence length (tokens) + osl: 150 # Average output sequence length (tokens) + ttft: 200.0 # Target Time To First Token (milliseconds) + itl: 20.0 # Target Inter-Token Latency (milliseconds) +``` + +**What these control:** +- **ISL/OSL**: Based on your expected traffic patterns +- **TTFT**: First token latency target (lower = more GPUs needed, affects prefill engine) +- **ITL**: Token generation latency target (lower = more GPUs needed, affects decode engine) +- **Trade-offs**: Tighter SLAs require more GPU resources + +### Hardware Configuration (Optional) + +Control GPU search space and constraints: + +```yaml +profilingConfig: + config: + hardware: + min_num_gpus_per_engine: 2 # if not provided, will automatically determine based on model and VRAM size + max_num_gpus_per_engine: 8 # Maximum GPUs to test + num_gpus_per_node: 8 # GPUs per node (for multi-node MoE) + gpu_type: h200_sxm # GPU type hint +``` + +**When to use:** +- **min_num_gpus_per_engine**: Skip small TP sizes if your model is large +- **max_num_gpus_per_engine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error)) +- **num_gpus_per_node**: Determine the upper bound of number of GPUs per node for dense models and configure Grove for multi-node MoE engines. +- **gpu_type**: Informational, auto-detected by controller + +> [!TIP] +> If you don't specify hardware constraints, the controller auto-detects based on your model size and available cluster resources. + +### Sweep Configuration (Optional) + +Control profiling behavior: + +```yaml +profilingConfig: + config: + sweep: + use_ai_configurator: false # Use offline profiling (default: false) + prefill_interpolation_granularity: 16 # Samples for prefill TTFT curve + decode_interpolation_granularity: 6 # Samples for decode ITL curve +``` + +**Use cases:** +- **use_ai_configurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only) +- **prefill_interpolation_granularity**: How many samples to benchmark for prefill TTFT curve (lower = faster but may be less accurate) +- **decode_interpolation_granularity**: How many samples to benchmark for decode ITL curve (lower = faster but may be less accurate). Since ITL interpolation is a 3d plot and takes longer to run, we default to a smaller number of samples. Increasing this value might quadratically increase the profiling time. + +### AI Configurator Configuration (Required if `use_ai_configurator: true`) + +Configure AI Configurator profiling mode: + +```yaml +profilingConfig: + config: + sweep: + use_ai_configurator: true + aic_system: h200_sxm # GPU system: h100_sxm, h200_sxm, b200_sxm, gb200_sxm, a100_sxm + aic_hf_id: Qwen/Qwen3-32B # Huggingface model id + aic_backend_version: "0.20.0" # TensorRT-LLM version: 0.20.0, 1.0.0rc3 +``` + +**Supported configurations:** See [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features) + +### Planner Configuration (Optional) + +Pass arguments to the SLA planner: + +```yaml +profilingConfig: + config: + planner: + planner_min_endpoint: 2 # Minimum endpoints to maintain + planner_adjustment_interval: 60 # Adjustment interval (seconds) + planner_load_predictor: linear # Load prediction method +``` + +> [!NOTE] +> Planner arguments use `planner_` prefix. See planner documentation for full list. + +### Engine Configuration (Auto-configured) + +The controller automatically sets these from high-level fields: + +```yaml +# You specify: +spec: + model: "Qwen/Qwen3-0.6B" + backend: vllm + +# Controller auto-injects into config: +profilingConfig: + config: + deployment: + model: "Qwen/Qwen3-0.6B" # From spec.model + engine: + backend: vllm # From spec.backend + config: /path/to/configmap # From spec.profilingConfig.configMapRef (if provided) +``` + +**You should not manually set** `deployment.model` or `engine.backend` in `profilingConfig.config` - they are automatically injected from the high-level fields. + +### Complete Example: AIPerf on Real Engines + +```yaml +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeploymentRequest +metadata: + name: vllm-dense-online +spec: + model: "Qwen/Qwen3-0.6B" + backend: vllm + + profilingConfig: + profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" + config: + sla: + isl: 3000 + osl: 150 + ttft: 200.0 + itl: 20.0 + + hardware: + min_num_gpus_per_engine: 1 + max_num_gpus_per_engine: 8 + + sweep: + use_ai_configurator: false + + deploymentOverrides: + workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" + + autoApply: true +``` + +### Complete Example: AI Configurator Simulation + +```yaml +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeploymentRequest +metadata: + name: trtllm-aic-offline +spec: + model: "Qwen/Qwen3-32B" + backend: trtllm + + profilingConfig: + profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1" + config: + sla: + isl: 4000 + osl: 500 + ttft: 300.0 + itl: 10.0 + + sweep: + use_ai_configurator: true + + aic: + system: h200_sxm + model_name: QWEN3_32B + backend_version: "0.20.0" + + deploymentOverrides: + workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1" + + autoApply: true +``` + +### Complete Example: MoE Model + +```yaml +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeploymentRequest +metadata: + name: sglang-moe +spec: + model: "deepseek-ai/DeepSeek-R1" + backend: sglang + + profilingConfig: + profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1" + config: + sla: + isl: 2048 + osl: 512 + ttft: 300.0 + itl: 25.0 + + hardware: + num_gpus_per_node: 8 + max_num_gpus_per_engine: 32 + + engine: + is_moe_model: true # Enable MoE profiling mode + + deploymentOverrides: + workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1" + + autoApply: true +``` + +## Troubleshooting + +### Profiling Takes Too Long + +**Solution 1**: Use AI Configurator for rapid profiling (TensorRT-LLM only): +```yaml +sweep: + use_ai_configurator: true +``` + +**Solution 2**: Reduce search space: +```yaml +config: + sweep: + min_num_gpus: 4 # Skip TP1, TP2 + max_num_gpus: 8 # Don't test beyond TP8 +``` + +### SLA Cannot Be Met + +**Symptoms**: Profiler reports no configuration meets targets + +**Solutions:** +1. Relax SLA targets (increase TTFT/ITL) +2. Add more GPU resources +3. Try a different backend +4. Use a smaller model + +### AI Configurator: Attention Head Constraint Error + +**Symptoms**: Profiling fails with error: +``` +AssertionError: num_heads should be divisible by tp_size and the division result should be >= 4 +``` + +**Cause**: AI Configurator requires **≥4 attention heads per GPU**. Small models with few heads cannot use high TP sizes. + +**Affected Models:** +- **Qwen3-0.6B** (16 heads): Max TP = 4 ❌ Fails at TP=8 +- **GPT-2** (12 heads): Max TP = 3 +- Most models **<1B parameters**: May hit this constraint + +**Solution**: Limit `max_num_gpus_per_engine` in your DGDR: + +```yaml +profilingConfig: + profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1" + config: + hardware: + max_num_gpus_per_engine: 4 # For Qwen3-0.6B (16 heads / 4 = max TP of 4) + sweep: + use_ai_configurator: true + aic: + system: h200_sxm + model_name: QWEN3_0_6B +``` + +**Calculate Max TP**: `max_tp = num_attention_heads / 4` + +> **Note**: This is an AI Configurator limitation. Online profiling doesn't have this constraint. + +### Image Pull Errors + +**Symptoms**: `ErrImagePull` or `ImagePullBackOff` + +**Solution**: Ensure image pull secrets are configured: +```bash +kubectl create secret docker-registry nvcr-imagepullsecret \ + --docker-server=nvcr.io \ + --docker-username='$oauthtoken' \ + --docker-password= \ + --namespace +``` + +### Out of Memory During Profiling + +**Symptoms**: OOM errors in profiling jobs + +**Solutions:** +1. Reduce `gpu_memory_utilization` in engine config +2. Reduce `--max-context-length` +3. Skip larger TP configurations +4. Use fewer GPUs per test + +### Unsupported Parallelization Mapping in Backend + +**Symptoms**: Starttime/runtime error in the backend. For example, prime number of attention heads restrain TP size to be 1 (i.e., falcon-7b with 71 attention heads). Or some backend does not support different TP sizes for prefill and decode. + +**Solutions:** +1. Contact the backend to add support for the use cases and bump backend version in dynamo. +2. Restrain the max and min number of GPUs per engine to the supported range. + +## Next Steps + +- **Deploy with DGDR**: See [Quick Start Guide](/docs/planner/sla_planner_quickstart.md) +- **Understand SLA Planner**: Read [SLA Planner Deep Dive](/docs/planner/sla_planner.md) +- **Monitor Deployments**: Set up [Observability](/docs/kubernetes/observability/metrics.md) +- **Optimize Performance**: See [Performance Tuning](/docs/performance/tuning.md) + +## Related Documentation + +- [DGDR API Reference](/docs/kubernetes/api_reference.md) +- [SLA Planner Quick Start](/docs/planner/sla_planner_quickstart.md) +- [SLA Planner Architecture](/docs/planner/sla_planner.md) +- [Profiler Arguments Reference](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/profiler/utils/profiler_argparse.py) diff --git a/deploy/README.md b/deploy/README.md deleted file mode 120000 index f6eccd892ef..00000000000 --- a/deploy/README.md +++ /dev/null @@ -1 +0,0 @@ -../docs/kubernetes/README.md \ No newline at end of file diff --git a/deploy/README.md b/deploy/README.md new file mode 100644 index 00000000000..7caa302f5be --- /dev/null +++ b/deploy/README.md @@ -0,0 +1,256 @@ +--- +title: "Deploying Dynamo on Kubernetes" +--- + + + +# Deploying Dynamo on Kubernetes + +High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides. + +## Important Terminology + +**Kubernetes Namespace**: The K8s namespace where your DynamoGraphDeployment resource is created. +- Used for: Resource isolation, RBAC, organizing deployments +- Example: `dynamo-system`, `team-a-namespace` + +**Dynamo Namespace**: The logical namespace used by Dynamo components for [service discovery](/docs/kubernetes/service_discovery.md). +- Used for: Runtime component communication, service discovery +- Specified in: `.spec.services..dynamoNamespace` field +- Example: `my-llm`, `production-model`, `dynamo-dev` + +These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa. + +## Prerequisites + +Before you begin, ensure you have the following tools installed: + +| Tool | Minimum Version | Installation Guide | +|------|-----------------|-------------------| +| **kubectl** | v1.24+ | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) | +| **Helm** | v3.0+ | [Install Helm](https://helm.sh/docs/intro/install/) | + +Verify your installation: +```bash +kubectl version --client # Should show v1.24+ +helm version # Should show v3.0+ +``` + +For detailed installation instructions, see the [Prerequisites section](/docs/kubernetes/installation_guide.md#prerequisites) in the Installation Guide. + +## Pre-deployment Checks + +Before deploying the platform, run the pre-deployment checks to ensure the cluster is ready: + +```bash +./deploy/pre-deployment/pre-deployment-check.sh +``` + +This validates kubectl connectivity, StorageClass configuration, and GPU availability. See [pre-deployment checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README.md) for more details. + +## 1. Install Platform First + +```bash +# 1. Set environment +export NAMESPACE=dynamo-system +export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases + +# 2. Install CRDs (skip if on shared cluster where CRDs already exist) +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz +helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default + +# 3. Install Platform +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz +helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace +``` + +**For Shared/Multi-Tenant Clusters:** + +If your cluster has namespace-restricted Dynamo operators, add this flag to step 3: +```bash +--set dynamo-operator.namespaceRestriction.enabled=true +``` + +For more details or customization options (including multinode deployments), see **[Installation Guide for Dynamo Kubernetes Platform](/docs/kubernetes/installation_guide.md)**. + +## 2. Choose Your Backend + +Each backend has deployment examples and configuration options: + +| Backend | Aggregated | Aggregated + Router | Disaggregated | Disaggregated + Router | Disaggregated + Planner | Disaggregated Multi-node | +|--------------|:----------:|:-------------------:|:-------------:|:----------------------:|:-----------------------:|:------------------------:| +| **[SGLang](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | +| **[TensorRT-LLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ | +| **[vLLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | + +## 3. Deploy Your First Model + +```bash +export NAMESPACE=dynamo-system +kubectl create namespace ${NAMESPACE} + +# to pull model from HF +export HF_TOKEN= +kubectl create secret generic hf-token-secret \ + --from-literal=HF_TOKEN="$HF_TOKEN" \ + -n ${NAMESPACE}; + +# Deploy any example (this uses vLLM with Qwen model using aggregated serving) +kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE} + +# Check status +kubectl get dynamoGraphDeployment -n ${NAMESPACE} + +# Test it +kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE} +curl http://localhost:8000/v1/models +``` + +For SLA-based autoscaling, see [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md). + +## Understanding Dynamo's Custom Resources + +Dynamo provides two main Kubernetes Custom Resources for deploying models: + +### DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration + +The **recommended approach** for generating optimal configurations. DGDR provides a high-level interface where you specify: +- Model name and backend framework +- SLA targets (latency requirements) +- GPU type (optional) + +Dynamo automatically handles profiling and generates an optimized DGD spec in the status. Perfect for: +- SLA-driven configuration generation +- Automated resource optimization +- Users who want simplicity over control + +**Note**: DGDR generates a DGD spec which you can then use to deploy. + +### DynamoGraphDeployment (DGD) - Direct Configuration + +A lower-level interface that defines your complete inference pipeline: +- Model configuration +- Resource allocation (GPUs, memory) +- Scaling policies +- Frontend/backend connections + +Use this when you need fine-grained control or have already completed profiling. + +Refer to the [API Reference and Documentation](/docs/kubernetes/api_reference.md) for more details. + +## 📖 API Reference & Documentation + +For detailed technical specifications of Dynamo's Kubernetes resources: + +- **[API Reference](/docs/kubernetes/api_reference.md)** - Complete CRD field specifications for all Dynamo resources +- **[Create Deployment](/docs/kubernetes/deployment/create_deployment.md)** - Step-by-step deployment creation with DynamoGraphDeployment +- **[Operator Guide](/docs/kubernetes/dynamo_operator.md)** - Dynamo operator configuration and management + +### Choosing Your Architecture Pattern + +When creating a deployment, select the architecture pattern that best fits your use case: + +- **Development / Testing** - Use `agg.yaml` as the base configuration +- **Production with Load Balancing** - Use `agg_router.yaml` to enable scalable, load-balanced inference +- **High Performance / Disaggregated** - Use `disagg_router.yaml` for maximum throughput and modular scalability + +### Frontend and Worker Components + +You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that: + +- Provides OpenAI-compatible `/v1/chat/completions` endpoint +- Auto-discovers backend workers via [service discovery](/docs/kubernetes/service_discovery.md) (Kubernetes-native by default) +- Routes requests and handles load balancing +- Validates and preprocesses requests + +### Customizing Your Deployment + +Example structure: +```yaml +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: my-llm +spec: + services: + Frontend: + dynamoNamespace: my-llm + componentType: frontend + replicas: 1 + extraPodSpec: + mainContainer: + image: your-image + VllmDecodeWorker: # or SGLangDecodeWorker, TrtllmDecodeWorker + dynamoNamespace: dynamo-dev + componentType: worker + replicas: 1 + envFromSecret: hf-token-secret # for HuggingFace models + resources: + limits: + gpu: "1" + extraPodSpec: + mainContainer: + image: your-image + command: ["/bin/sh", "-c"] + args: + - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags] +``` + +Worker command examples per backend: +```yaml +# vLLM worker +args: + - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B + +# SGLang worker +args: + - >- + python3 -m dynamo.sglang + --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B + --tp 1 + --trust-remote-code + +# TensorRT-LLM worker +args: + - python3 -m dynamo.trtllm + --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B + --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B + --extra-engine-args /workspace/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/agg.yaml +``` + +Key customization points include: +- **Model Configuration**: Specify model in the args command +- **Resource Allocation**: Configure GPU requirements under `resources.limits` +- **Scaling**: Set `replicas` for number of worker instances +- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs +- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers + +## Additional Resources + +- **[Examples](../examples.md)** - Complete working examples +- **[Create Custom Deployments](/docs/kubernetes/deployment/create_deployment.md)** - Build your own CRDs +- **[Managing Models with DynamoModel](/docs/kubernetes/deployment/dynamomodel-guide.md)** - Deploy LoRA adapters and manage models +- **[Operator Documentation](/docs/kubernetes/dynamo_operator.md)** - How the platform works +- **[Service Discovery](/docs/kubernetes/service_discovery.md)** - Discovery backends and configuration +- **[Helm Charts](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/README.md)** - For advanced users +- **[GitOps Deployment with FluxCD](/docs/kubernetes/fluxcd.md)** - For advanced users +- **[Logging](/docs/kubernetes/observability/logging.md)** - For logging setup +- **[Multinode Deployment](/docs/kubernetes/deployment/multinode-deployment.md)** - For multinode deployment +- **[Grove](/docs/kubernetes/grove.md)** - For grove details and custom installation +- **[Monitoring](/docs/kubernetes/observability/metrics.md)** - For monitoring setup +- **[Model Caching with Fluid](/docs/kubernetes/model_caching_with_fluid.md)** - For model caching with Fluid diff --git a/docs/DOCUSAURUS_MIGRATION_PLAN.md b/docs/DOCUSAURUS_MIGRATION_PLAN.md new file mode 100644 index 00000000000..1e88f4f80e3 --- /dev/null +++ b/docs/DOCUSAURUS_MIGRATION_PLAN.md @@ -0,0 +1,768 @@ +# Docusaurus Migration Plan for NVIDIA Dynamo Documentation + +> **Date:** January 2026 +> **Status:** Approved +> **Decision:** Full Migration (Option A) with Native Versioning + +--- + +## Table of Contents + +1. [Executive Summary](#executive-summary) +2. [Current State Analysis](#current-state-analysis) +3. [Migration Approach](#migration-approach) +4. [Local Development & Testing](#local-development--testing) +5. [Project Structure](#project-structure) +6. [Content Migration](#content-migration) +7. [Versioning Strategy](#versioning-strategy) +8. [Theme & Styling](#theme--styling) +9. [Implementation Steps](#implementation-steps) +10. [Migration Checklist](#migration-checklist) + +--- + +## Executive Summary + +Full migration from Sphinx to Docusaurus, replacing the existing documentation build system entirely. This approach uses Docusaurus native versioning and focuses on local development/testing before any deployment decisions. + +### Key Decisions + +| Decision | Choice | +|----------|--------| +| **Migration Approach** | Option A: Full Docusaurus Migration | +| **Versioning** | Docusaurus Native (`docs:version` command) | +| **Deployment** | TBD (local testing first) | +| **Theme** | Custom CSS on Classic Theme (NVIDIA branding) | + +### Goals + +- ✅ Complete replacement of Sphinx +- ✅ Native versioned documentation +- ✅ Local preview in browser for testing +- ✅ Modern developer experience with hot reload +- ✅ MDX support for interactive documentation + +--- + +## Current State Analysis + +### Existing Infrastructure + +| Component | Current Implementation | +|-----------|----------------------| +| **Framework** | Sphinx 7.x with nvidia_sphinx_theme | +| **Content Format** | RST (index.rst) + Markdown (MyST parser) | +| **Extensions** | mermaid, sphinx_design, ablog, sphinx_tabs, etc. | +| **Build System** | Makefile + Docker (container/Dockerfile.docs) | +| **CI/CD** | `.github/workflows/generate-docs.yml` (612 lines) | + +### Files to Migrate + +| Category | Files | Format | +|----------|-------|--------| +| Entry point | `index.rst` | RST → MDX | +| Configuration | `conf.py`, `Makefile` | → `docusaurus.config.ts` | +| Content | ~100+ docs | Markdown (mostly compatible) | +| Includes | `_includes/*.rst` | RST → MDX components | +| Extensions | `_extensions/github_alerts.py` | → MDX/Admonitions | +| Static assets | `_static/*`, `images/*` | → `static/` | + +--- + +## Migration Approach + +### Full Docusaurus Migration + +Complete replacement of Sphinx with Docusaurus, migrating all content. + +**Benefits:** +- Clean break from legacy system +- Modern React-based stack +- Native versioning built-in +- Fast hot-reload development server +- Better search capabilities +- MDX support for interactive docs + +**New Structure:** +``` +docs/ +├── docusaurus/ # New Docusaurus project +│ ├── docusaurus.config.ts # Main configuration +│ ├── sidebars.ts # Navigation structure +│ ├── package.json # Dependencies +│ ├── tsconfig.json +│ ├── docs/ # Current version documentation +│ │ ├── intro.md +│ │ ├── backends/ +│ │ ├── kubernetes/ +│ │ ├── guides/ +│ │ └── ... +│ ├── versioned_docs/ # Auto-generated by Docusaurus +│ │ ├── version-0.3.0/ +│ │ └── version-0.2.0/ +│ ├── versioned_sidebars/ +│ ├── src/ +│ │ ├── components/ # Custom React components +│ │ ├── css/ +│ │ │ └── custom.css # NVIDIA theme overrides +│ │ └── pages/ +│ ├── static/ +│ │ └── img/ +│ └── versions.json # Version manifest +└── sphinx/ # Old Sphinx docs (to be removed after migration) +``` + +--- + +## Local Development & Testing + +### Quick Start + +```bash +# Navigate to docs directory +cd docs/docusaurus + +# Install dependencies +npm install + +# Start development server with hot reload +npm run start +# Opens http://localhost:3000 in your browser + +# Build static site (for testing production build) +npm run build + +# Serve the production build locally +npm run serve +# Opens http://localhost:3000 with production build +``` + +### Development Commands + +| Command | Description | +|---------|-------------| +| `npm run start` | Start dev server with hot reload (http://localhost:3000) | +| `npm run build` | Build production static site to `build/` | +| `npm run serve` | Serve production build locally | +| `npm run clear` | Clear Docusaurus cache | +| `npm run docusaurus docs:version X.Y.Z` | Create a new version snapshot | + +### Testing Workflow + +1. **Make changes** to docs in `docs/docusaurus/docs/` +2. **View instantly** at http://localhost:3000 (hot reload) +3. **Test production build:** + ```bash + npm run build && npm run serve + ``` +4. **Open browser** to http://localhost:3000 to verify + +--- + +## Project Structure + +### Initial Setup + +```bash +# Create Docusaurus project +cd docs +npx create-docusaurus@latest docusaurus classic --typescript + +# Install additional plugins +cd docusaurus +npm install @docusaurus/theme-mermaid +npm install @docusaurus/plugin-client-redirects +``` + +### Configuration Files + +#### `docusaurus.config.ts` + +```typescript +import {themes as prismThemes} from 'prism-react-renderer'; +import type {Config} from '@docusaurus/types'; +import type * as Preset from '@docusaurus/preset-classic'; + +const config: Config = { + title: 'NVIDIA Dynamo', + tagline: 'High-performance, low-latency inference framework', + favicon: 'img/favicon.ico', + + // For local testing, use localhost + url: 'http://localhost:3000', + baseUrl: '/', + + organizationName: 'ai-dynamo', + projectName: 'dynamo', + + onBrokenLinks: 'warn', + onBrokenMarkdownLinks: 'warn', + + i18n: { + defaultLocale: 'en', + locales: ['en'], + }, + + markdown: { + mermaid: true, + }, + + themes: ['@docusaurus/theme-mermaid'], + + presets: [ + [ + 'classic', + { + docs: { + routeBasePath: '/', // Docs at root + sidebarPath: './sidebars.ts', + editUrl: 'https://github.com/ai-dynamo/dynamo/tree/main/docs/docusaurus/', + showLastUpdateTime: true, + // Versioning config + lastVersion: 'current', + versions: { + current: { + label: 'dev', + path: 'dev', + }, + }, + }, + blog: false, // Disable blog + theme: { + customCss: './src/css/custom.css', + }, + } satisfies Preset.Options, + ], + ], + + plugins: [ + [ + '@docusaurus/plugin-client-redirects', + { + redirects: [ + // Preserve existing redirects from Sphinx + {from: '/guides/tool-calling', to: '/agents/tool-calling'}, + {from: '/architecture/architecture', to: '/design_docs/architecture'}, + // Add more as needed + ], + }, + ], + ], + + themeConfig: { + navbar: { + title: 'NVIDIA Dynamo', + logo: { + alt: 'NVIDIA Logo', + src: 'img/nvidia-logo.svg', + }, + items: [ + { + type: 'docsVersionDropdown', + position: 'right', + dropdownActiveClassDisabled: true, + }, + { + href: 'https://github.com/ai-dynamo/dynamo', + label: 'GitHub', + position: 'right', + }, + ], + }, + footer: { + style: 'dark', + links: [ + { + title: 'Documentation', + items: [ + {label: 'Getting Started', to: '/'}, + {label: 'Backends', to: '/backends'}, + {label: 'Kubernetes', to: '/kubernetes'}, + ], + }, + { + title: 'Community', + items: [ + {label: 'GitHub', href: 'https://github.com/ai-dynamo/dynamo'}, + {label: 'Issues', href: 'https://github.com/ai-dynamo/dynamo/issues'}, + ], + }, + ], + copyright: `Copyright © ${new Date().getFullYear()} NVIDIA Corporation & Affiliates`, + }, + prism: { + theme: prismThemes.github, + darkTheme: prismThemes.dracula, + additionalLanguages: ['bash', 'python', 'yaml', 'rust', 'toml', 'json'], + }, + } satisfies Preset.ThemeConfig, +}; + +export default config; +``` + +#### `sidebars.ts` + +```typescript +import type {SidebarsConfig} from '@docusaurus/plugin-content-docs'; + +const sidebars: SidebarsConfig = { + docs: [ + 'intro', + { + type: 'category', + label: 'Getting Started', + items: [ + 'installation', + 'quickstart', + 'support-matrix', + ], + }, + { + type: 'category', + label: 'Backends', + items: [ + 'backends/index', + { + type: 'category', + label: 'SGLang', + items: [ + 'backends/sglang/index', + 'backends/sglang/gpt-oss', + ], + }, + { + type: 'category', + label: 'vLLM', + items: [ + 'backends/vllm/index', + ], + }, + { + type: 'category', + label: 'TensorRT-LLM', + items: [ + 'backends/trtllm/index', + ], + }, + ], + }, + { + type: 'category', + label: 'Kubernetes', + items: [ + 'kubernetes/deployment', + 'kubernetes/observability', + 'kubernetes/multinode', + ], + }, + { + type: 'category', + label: 'User Guides', + items: [ + 'agents/tool-calling', + 'multimodal/index', + 'performance/tuning', + 'observability/metrics', + ], + }, + { + type: 'category', + label: 'Design Docs', + items: [ + 'design_docs/architecture', + 'design_docs/disagg_serving', + 'design_docs/distributed_runtime', + ], + }, + { + type: 'category', + label: 'Reference', + items: [ + 'reference/cli', + 'reference/glossary', + ], + }, + ], +}; + +export default sidebars; +``` + +--- + +## Content Migration + +### Automated Conversion Process + +```bash +# 1. Convert RST files to Markdown +find ../sphinx -name "*.rst" -exec pandoc {} -f rst -t gfm -o {}.md \; + +# 2. Copy Markdown files (already compatible) +cp -r ../sphinx/**/*.md docs/ + +# 3. Run migration script for Sphinx-specific syntax +python scripts/migrate_content.py +``` + +### Migration Script + +Create `scripts/migrate_content.py`: + +```python +#!/usr/bin/env python3 +"""Migrate Sphinx markdown syntax to Docusaurus MDX.""" + +import re +import os +from pathlib import Path + +REPLACEMENTS = [ + # Admonitions: MyST -> Docusaurus + (r'```{note}', ':::note'), + (r'```{warning}', ':::warning'), + (r'```{tip}', ':::tip'), + (r'```{caution}', ':::caution'), + (r'```{danger}', ':::danger'), + (r'```', ':::'), # Close admonitions + + # References + (r':ref:`([^`]+)`', r'[\1](\1.md)'), + + # Code blocks with sphinx-specific options + (r'```{code-block} (\w+)', r'```\1'), + + # Remove toctree directives (handled by sidebars.ts) + (r'```{toctree}[\s\S]*?```', ''), +] + +def migrate_file(filepath: Path): + content = filepath.read_text() + + for pattern, replacement in REPLACEMENTS: + content = re.sub(pattern, replacement, content) + + # Write back + filepath.write_text(content) + print(f"Migrated: {filepath}") + +def main(): + docs_dir = Path("docs") + for md_file in docs_dir.rglob("*.md"): + migrate_file(md_file) + +if __name__ == "__main__": + main() +``` + +### Manual Fixes Required + +| Sphinx Feature | Docusaurus Equivalent | +|----------------|----------------------| +| `.. include::` directives | Import MDX components | +| `.. toctree::` | `sidebars.ts` configuration | +| `:doc:` references | Standard markdown links | +| `{guilabel}`, `{menuselection}` | Bold text or custom component | +| Sphinx tabs | `@docusaurus/theme-classic` tabs or MDX | + +--- + +## Versioning Strategy + +### Docusaurus Native Versioning + +Docusaurus handles versioning automatically with the `docs:version` command. + +**How it works:** + +```bash +# When ready to release version 0.4.0: +npm run docusaurus docs:version 0.4.0 + +# This creates: +# - versioned_docs/version-0.4.0/ (snapshot of docs/) +# - versioned_sidebars/version-0.4.0-sidebars.json +# - Updates versions.json: ["0.4.0", "0.3.0", ...] +``` + +**Version Structure:** +``` +docs/docusaurus/ +├── docs/ # "current" (dev) version +├── versioned_docs/ +│ ├── version-0.4.0/ # Release 0.4.0 +│ ├── version-0.3.0/ # Release 0.3.0 +│ └── version-0.2.0/ # Release 0.2.0 +├── versioned_sidebars/ +│ ├── version-0.4.0-sidebars.json +│ ├── version-0.3.0-sidebars.json +│ └── version-0.2.0-sidebars.json +└── versions.json # ["0.4.0", "0.3.0", "0.2.0"] +``` + +**Configuration in `docusaurus.config.ts`:** + +```typescript +docs: { + lastVersion: 'current', + versions: { + current: { + label: 'dev', + path: 'dev', + banner: 'unreleased', + }, + '0.4.0': { + label: '0.4.0 (latest)', + path: 'latest', + }, + '0.3.0': { + label: '0.3.0', + path: '0.3.0', + }, + }, +}, +``` + +**URL Structure:** +``` +/dev/ → Current development docs +/latest/ → Latest stable (0.4.0) +/0.3.0/ → Archived version +/0.2.0/ → Archived version +``` + +--- + +## Theme & Styling + +### NVIDIA Branding with Custom CSS + +Create `src/css/custom.css`: + +```css +/** + * NVIDIA Dynamo Documentation Theme + * Custom styling to match NVIDIA branding + */ + +:root { + /* NVIDIA Brand Colors */ + --ifm-color-primary: #76b900; + --ifm-color-primary-dark: #6aa600; + --ifm-color-primary-darker: #5f9400; + --ifm-color-primary-darkest: #4d7a00; + --ifm-color-primary-light: #84c219; + --ifm-color-primary-lighter: #93cb33; + --ifm-color-primary-lightest: #a8d64d; + + /* Navigation */ + --ifm-navbar-background-color: #1a1a1a; + --ifm-navbar-link-color: #ffffff; + --ifm-navbar-link-hover-color: #76b900; + + /* Code blocks */ + --ifm-code-font-size: 95%; + --docusaurus-highlighted-code-line-bg: rgba(118, 185, 0, 0.1); + + /* Sidebar */ + --ifm-menu-color-active: #76b900; +} + +/* Dark mode */ +[data-theme='dark'] { + --ifm-background-color: #1a1a1a; + --ifm-background-surface-color: #242424; + --ifm-color-primary: #76b900; +} + +/* Navbar styling */ +.navbar { + box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.1); +} + +.navbar__title { + font-weight: 700; +} + +/* Footer styling */ +.footer { + background-color: #1a1a1a; +} + +.footer__link-item { + color: #b0b0b0; +} + +.footer__link-item:hover { + color: #76b900; +} + +/* Admonitions */ +.alert--note { + --ifm-alert-background-color: rgba(118, 185, 0, 0.1); + --ifm-alert-border-color: #76b900; +} + +/* Version badge */ +.badge--secondary { + background-color: #76b900; + border-color: #76b900; +} +``` + +### Logo Assets + +Place in `static/img/`: +- `nvidia-logo.svg` - NVIDIA logo for navbar +- `favicon.ico` - Browser favicon + +--- + +## Implementation Steps + +### Phase 1: Setup (Day 1-2) + +```bash +# 1. Create Docusaurus project +cd /home/jonathan/go/src/dynamo/docs +npx create-docusaurus@latest docusaurus classic --typescript + +# 2. Install dependencies +cd docusaurus +npm install @docusaurus/theme-mermaid @docusaurus/plugin-client-redirects + +# 3. Apply configuration +# - Update docusaurus.config.ts (see above) +# - Create sidebars.ts +# - Add custom.css + +# 4. Test setup +npm run start +# Verify http://localhost:3000 shows default page +``` + +### Phase 2: Content Migration (Day 3-5) + +```bash +# 1. Copy existing markdown content +mkdir -p docs/backends docs/kubernetes docs/guides + +# 2. Run conversion scripts +python scripts/migrate_content.py + +# 3. Convert index.rst to intro.md +pandoc ../index.rst -f rst -t gfm -o docs/intro.md + +# 4. Copy images +cp -r ../images static/img/ + +# 5. Iteratively fix issues +npm run start # Check browser for errors +``` + +### Phase 3: Validation (Day 6-7) + +```bash +# 1. Build production site +npm run build + +# 2. Check for broken links +npm run serve +# Manually test navigation + +# 3. Test version switching +npm run docusaurus docs:version 0.3.0 +npm run start +# Verify version dropdown works +``` + +### Phase 4: Cleanup + +```bash +# After validation, remove old Sphinx files: +# - docs/conf.py +# - docs/Makefile +# - docs/index.rst +# - docs/_extensions/ +# - docs/_includes/ +# - docs/_static/ + +# Keep docusaurus/ as the new docs root (or move up) +``` + +--- + +## Migration Checklist + +### Pre-Migration +- [ ] Audit all existing content (pages, images, downloads) +- [ ] Document all Sphinx extensions in use +- [ ] Create redirect map from old URLs to new +- [ ] Set up Docusaurus development environment + +### Setup +- [ ] Initialize Docusaurus project +- [ ] Configure `docusaurus.config.ts` +- [ ] Create `sidebars.ts` from toctrees +- [ ] Add NVIDIA custom CSS theme +- [ ] Add logo and favicon + +### Content Migration +- [ ] Convert `index.rst` to `intro.md` +- [ ] Migrate all Markdown files +- [ ] Convert RST files to MDX +- [ ] Migrate images to `static/img/` +- [ ] Fix internal links +- [ ] Implement custom components (if needed) + +### Validation +- [ ] Test all pages render correctly +- [ ] Verify all internal links work +- [ ] Test code block syntax highlighting +- [ ] Test Mermaid diagrams +- [ ] Test version dropdown (after creating test version) +- [ ] Test mobile responsiveness +- [ ] Run `npm run build` without errors + +### Post-Migration +- [ ] Remove old Sphinx configuration files +- [ ] Update `.gitignore` for Docusaurus +- [ ] Update CONTRIBUTING.md for docs workflow +- [ ] Create GitHub Actions workflow (when ready for CI) + +--- + +## Quick Reference + +### Commands Cheat Sheet + +```bash +# Development +npm run start # Start dev server (http://localhost:3000) +npm run build # Build production site +npm run serve # Serve production build locally +npm run clear # Clear cache + +# Versioning +npm run docusaurus docs:version 0.4.0 # Create version snapshot + +# Debugging +npm run build -- --debug # Build with debug output +``` + +### File Locations + +| What | Where | +|------|-------| +| Main config | `docusaurus.config.ts` | +| Sidebar nav | `sidebars.ts` | +| Current docs | `docs/` | +| Versioned docs | `versioned_docs/` | +| Custom CSS | `src/css/custom.css` | +| Static files | `static/` | +| Build output | `build/` | + +--- + +*Document updated: January 2026* diff --git a/docs/MIGRATION_COMPLETE.md b/docs/MIGRATION_COMPLETE.md new file mode 100644 index 00000000000..1813e7cce80 --- /dev/null +++ b/docs/MIGRATION_COMPLETE.md @@ -0,0 +1,142 @@ +# Docusaurus Migration Summary Report + +**Migration Completed:** Phase 4 Complete (Restructured) +**Date:** January 2026 + +--- + +## Executive Summary + +The NVIDIA Dynamo documentation has been successfully migrated from Sphinx (reStructuredText/Markdown) to Docusaurus 3.9.2 (React/MDX). The Docusaurus project now lives directly in `docs/` (not a subfolder), providing a cleaner structure. The migration preserves all existing content while adding modern features including local search, improved navigation, and native versioning support. + +--- + +## Final Directory Structure + +``` +docs/ +├── docusaurus.config.ts # Main Docusaurus configuration +├── sidebars.ts # Navigation structure +├── package.json # Node.js dependencies +├── package-lock.json # Dependency lock file +├── versions.json # Version manifest +├── tsconfig.json # TypeScript config +├── docs/ # Current version content (for Docusaurus) +├── versioned_docs/ # Created via `docs:version` command +├── versioned_sidebars/ # Created via `docs:version` command +├── src/ +│ └── css/custom.css # NVIDIA theme (#76b900 green) +├── static/img/ # Static images and assets +├── build/ # Generated output (gitignored) +├── node_modules/ # Dependencies (gitignored) +├── agents/ # Source content directories +├── backends/ +├── kubernetes/ +├── ... (other content dirs) +├── images/ # Shared images +├── README.md # Build instructions +├── DOCUSAURUS_MIGRATION_PLAN.md # Original migration plan +└── MIGRATION_COMPLETE.md # This summary +``` + +--- + +## Quick Reference + +### Development Commands + +```bash +cd docs/docusaurus + +# Start development server (hot reload) +npm run start + +# Build production site +npm run build + +# Serve production build locally +npm run serve + +# Clear cache +npm run clear +``` + +### Creating New Versions + +When releasing a new version of Dynamo: + +```bash +cd docs +npm run docusaurus docs:version X.Y.Z +``` + +Then update `docusaurus.config.ts` to configure the version labels and paths. + +### URLs + +| Environment | URL | +|-------------|-----| +| Development | `http://localhost:3000` | +| Current docs | `/` | +| Versioned docs | `/X.Y.Z/` (after creating versions) | + +--- + +## Features Added + +| Feature | Implementation | +|---------|----------------| +| **Local Search** | `@easyops-cn/docusaurus-search-local` - Press `Ctrl+K` | +| **Version Dropdown** | Native Docusaurus versioning with navbar dropdown | +| **Mermaid Diagrams** | `@docusaurus/theme-mermaid` plugin | +| **Dark Theme** | Dark mode toggle in navbar | +| **NVIDIA Branding** | Custom CSS with #76b900 green theme | +| **Auto Sidebar** | Generated from directory structure | +| **MDX Support** | React components in Markdown | + +--- + +## Migration Statistics + +| Metric | Value | +|--------|-------| +| Total files migrated | 96 | +| RST files converted | 8 | +| Sphinx files removed | 8 | +| Sphinx directories removed | 6 | + +--- + +## Phase 4 Restructuring + +The Docusaurus project was moved from `docs/docusaurus/` to `docs/` directly: + +- ✅ Moved all Docusaurus config files to `docs/` +- ✅ Updated `editUrl` in docusaurus.config.ts +- ✅ Updated `.gitignore` paths +- ✅ Removed `docs/docusaurus/` subfolder +- ✅ Reset versioning (run `docs:version` to recreate) + +--- + +## Recommendations + +1. **Verify Content:** Review key pages to ensure formatting is correct +2. **Update CI/CD:** Modify pipeline to use `cd docs && npm run build` instead of Sphinx +3. **Link Check:** Run `npm run build` to catch broken internal links +4. **Create Versions:** Run `npm run docusaurus docs:version X.Y.Z` for each release +5. **Search Index:** Local search indexes on build; verify search works after deployment + +--- + +## Rollback (if needed) + +The original Sphinx files are preserved in git history. To rollback: + +```bash +git checkout HEAD~N -- docs/conf.py docs/Makefile docs/index.rst docs/_extensions docs/_includes docs/_static +``` + +--- + +**Migration Complete** 🎉 diff --git a/docs/Makefile b/docs/Makefile deleted file mode 100644 index 169b4bcdb96..00000000000 --- a/docs/Makefile +++ /dev/null @@ -1,94 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# Minimal makefile for Sphinx documentation -# - -# You can set these variables from the command line, and also -# from the environment for the first two. -SPHINXOPTS ?= -W -SPHINXBUILD ?= sphinx-build -SOURCEDIR = . -BUILDDIR = build - -##@ General - -# Put it first so that "make" without argument is like "make help". -help: ## Display help for all targets - @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) - @echo "" - @echo "Additional documentation targets:" - @awk 'BEGIN {FS = ":.*##"; printf " \033[36m%-20s\033[0m %s\n", "TARGET", "DESCRIPTION"} /^[a-zA-Z_0-9-]+:.*?##/ { printf " \033[36m%-20s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) }' $(MAKEFILE_LIST) - -clean: ## Clean build artifacts - @rm -fr ${BUILDDIR} - -##@ Helm Documentation - -## Location to install dependencies to -LOCALBIN ?= $(shell pwd)/bin -$(LOCALBIN): - mkdir -p $(LOCALBIN) - -## Tool Versions -HELM_DOCS_VERSION ?= 1.14.2 - -## Tool Binaries -HELM_DOCS ?= $(LOCALBIN)/helm-docs-$(HELM_DOCS_VERSION) - -.PHONY: helm-docs-install -helm-docs-install: $(HELM_DOCS) ## Download helm-docs locally if necessary -$(HELM_DOCS): $(LOCALBIN) - @echo "📥 Downloading helm-docs $(HELM_DOCS_VERSION)..." - @ARCH=$$(uname -m); \ - OS=$$(uname -s | tr '[:upper:]' '[:lower:]'); \ - curl -sSL "https://github.com/norwoodj/helm-docs/releases/download/v$(HELM_DOCS_VERSION)/helm-docs_$(HELM_DOCS_VERSION)_$${OS}_$${ARCH}.tar.gz" | \ - tar xz -C $(LOCALBIN) helm-docs && \ - mv $(LOCALBIN)/helm-docs $(HELM_DOCS) && \ - echo "✅ helm-docs $(HELM_DOCS_VERSION) installed successfully" - -.PHONY: generate-helm-docs -generate-helm-docs: helm-docs-install ## Generate README.md for Helm charts from values.yaml - @echo "📚 Generating Helm chart documentation..." - @cd ../deploy/helm/charts/platform && $(realpath $(HELM_DOCS)) \ - --template-files=README.md.gotmpl \ - --output-file=README.md \ - --sort-values-order=file \ - --chart-to-generate=. \ - --ignore-non-descriptions - @echo "✅ Generated documentation at ../deploy/helm/charts/platform/README.md" - -.PHONY: helm-docs-clean -helm-docs-clean: ## Remove generated helm documentation - @echo "🧹 Cleaning generated helm documentation..." - @rm -f ../deploy/helm/charts/platform/README.md - @echo "✅ Cleaned helm documentation" - -.PHONY: generate-crd-docs -generate-crd-docs: ## Generate CRD API reference documentation - @echo "📚 Generating CRD API reference documentation..." - @cd ../deploy/operator && make generate-api-docs - @echo "✅ CRD API reference generated" - -.PHONY: docs-all -docs-all: generate-helm-docs generate-crd-docs html ## Generate all documentation (Sphinx + Helm + CRDs) - -.PHONY: help Makefile clean - - -# Catch-all target: route all unknown targets to Sphinx using the new -# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). -%: - @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/docs/README.md b/docs/README.md index a7b98729324..665815a4d5f 100644 --- a/docs/README.md +++ b/docs/README.md @@ -2,81 +2,164 @@ orphan: true --- -# Building Documentation +# NVIDIA Dynamo Documentation -This directory contains the documentation source files for NVIDIA Dynamo. +This directory contains the documentation source files for NVIDIA Dynamo, built with [Docusaurus](https://docusaurus.io/). -## Prerequisites +## Quick Start -- Python 3.11 or later -- [uv](https://docs.astral.sh/uv/) package manager +```bash +# Navigate to the docs directory +cd docs -## Build Instructions +# Install dependencies +npm install -### Option 1: Dedicated Docs Environment (Recommended) +# Start development server (with hot reload) +npm run start +# Opens http://localhost:3000 in your browser -This approach builds the docs without requiring the full project dependencies (including `ai-dynamo-runtime`): +# Build for production +npm run build -```bash -# One-time setup: Create docs environment and install dependencies -uv venv .venv-docs -uv pip install --python .venv-docs --group docs +# Serve production build locally +npm run serve +``` + +## Documentation Commands + +| Command | Description | +|---------|-------------| +| `npm run start` | Start dev server with hot reload | +| `npm run build` | Build production static site to `build/` | +| `npm run serve` | Serve production build locally | +| `npm run clear` | Clear Docusaurus cache | +| `npm run docusaurus docs:version X.Y.Z` | Create a new version snapshot | + +## Directory Structure -# Generate documentation -uv run --python .venv-docs --no-project docs/generate_docs.py ``` +docs/ +├── docusaurus.config.ts # Main Docusaurus configuration +├── sidebars.ts # Navigation structure +├── package.json # Dependencies +├── versions.json # Version manifest +├── tsconfig.json # TypeScript config +├── docs/ # Current version content +├── versioned_docs/ # Released versions (created via docs:version) +├── versioned_sidebars/ # Sidebars for each version +├── src/ +│ └── css/custom.css # NVIDIA theme +├── static/img/ # Static images +├── build/ # Generated output (gitignored) +├── agents/ # Content source (linked in docs/) +├── backends/ +├── kubernetes/ +└── ... +``` + +## Versioning -The generated HTML will be available in `docs/build/html/`. +The documentation supports multiple versions matching Dynamo releases. -### Option 2: Using Full Development Environment +### Creating a New Version -If you already have the full project dependencies installed (i.e., you're actively developing the codebase), you can use `uv run` directly: +When releasing a new version of Dynamo: ```bash -uv run --group docs docs/generate_docs.py +cd docs +npm run docusaurus docs:version X.Y.Z ``` -This will use your existing project environment and add the docs dependencies. +This will: +1. Copy `docs/` to `versioned_docs/version-X.Y.Z/` +2. Copy `sidebars.ts` to `versioned_sidebars/` +3. Add the version to `versions.json` -### Option 3: Using Docker +### Version Configuration -Build the docs in a Docker container with all dependencies isolated: +After creating versions, update `docusaurus.config.ts` to configure version labels and paths: -```bash -docker build -f container/Dockerfile.docs -t dynamo-docs . +```typescript +docs: { + lastVersion: 'X.Y.Z', // Set the latest stable version + versions: { + current: { label: 'dev (next)', path: 'dev', banner: 'unreleased' }, + 'X.Y.Z': { label: 'X.Y.Z (latest)', path: '', banner: 'none' }, + }, +} ``` -The documentation will be built inside the container. To extract the built docs: +## Writing Documentation -```bash -# Run the container and copy the output -docker run --rm -v $(pwd)/docs/build:/workspace/dynamo/docs/build dynamo-docs +### File Format -# Or create a container to copy files from -docker create --name temp-docs dynamo-docs -docker cp temp-docs:/workspace/dynamo/docs/build ./docs/build -docker rm temp-docs +Documentation is written in Markdown with [MDX](https://mdxjs.com/) support. + +### Frontmatter + +Each document should have frontmatter: + +```markdown +--- +title: "Page Title" +sidebar_position: 1 +--- + +# Page Title + +Content here... ``` -This approach is ideal for CI/CD pipelines or when you want complete isolation from your local environment. +### Admonitions -## Directory Structure +Use Docusaurus admonitions for callouts: -- `docs/` - Documentation source files (Markdown and reStructuredText) -- `docs/conf.py` - Sphinx configuration -- `docs/_static/` - Static assets (CSS, JS, images) -- `docs/_extensions/` - Custom Sphinx extensions -- `docs/build/` - Generated documentation output (not tracked in git) +```markdown +:::note +This is a note. +::: + +:::tip +This is a tip. +::: + +:::warning +This is a warning. +::: + +:::danger +This is a danger notice. +::: +``` -## Redirect Creation +### Code Blocks + +```markdown +```python title="example.py" +def hello(): + print("Hello, Dynamo!") +``` +``` + +### Internal Links + +Link to other docs using relative paths: + +```markdown +See the [Backend Guide](./backends/vllm/README.md) for more details. +``` -When moving or renaming files a redirect must be created. +## Search -Redirect entries should be added to the `redirects` dictionary in `conf.py`. For detailed information on redirect syntax, see the [sphinx-reredirects usage documentation](https://documatt.com/sphinx-reredirects/usage/#introduction). +The documentation includes local search powered by `@easyops-cn/docusaurus-search-local`. Use `Ctrl+K` to open search. -## Dependency Management +## Theme -Documentation dependencies are defined in `pyproject.toml` under the `[dependency-groups]` section: +The site uses the Docusaurus Classic theme with custom NVIDIA branding: +- Primary color: NVIDIA Green (#76b900) +- Dark navbar and footer +- Custom logo and favicon ```toml [dependency-groups] diff --git a/docs/_extensions/__init__.py b/docs/_extensions/__init__.py deleted file mode 100644 index 868a8a06587..00000000000 --- a/docs/_extensions/__init__.py +++ /dev/null @@ -1,19 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" -Custom Sphinx extensions for Dynamo documentation. -""" - -__version__ = "0.1.0" diff --git a/docs/_extensions/github_alerts.py b/docs/_extensions/github_alerts.py deleted file mode 100644 index fec4d3a43fd..00000000000 --- a/docs/_extensions/github_alerts.py +++ /dev/null @@ -1,255 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -""" -AST-based Sphinx extension to convert GitHub-flavored markdown alerts to MyST admonitions. - -This extension works on the parsed document AST, making it more robust than text preprocessing. -It finds blockquote nodes that match GitHub alert patterns and replaces them with admonition nodes. -""" - -import re -from typing import Any, Dict - -from docutils import nodes -from sphinx.application import Sphinx -from sphinx.util import logging - -__version__ = "0.2.0" - -# Set up logger for the extension -logger = logging.getLogger(__name__) - -# Log when the extension module is imported -logger.info(f"GitHub alerts extension v{__version__} imported successfully") - - -class GitHubAlertsTransformer: - """AST transformer for GitHub alerts to MyST admonitions.""" - - # Mapping of GitHub alert types to MyST admonition types - ALERT_MAPPING = { - "note": nodes.note, - "tip": nodes.tip, - "important": nodes.important, - "warning": nodes.warning, - "caution": nodes.caution, - "danger": nodes.danger, - "info": nodes.note, # Map info to note - "hint": nodes.tip, # Map hint to tip - } - - def __init__(self): - # Regex to match GitHub alert syntax in text - self.alert_pattern = re.compile(r"^\[!(.*?)\](?:\s+(.*))?$") - - def is_github_alert_blockquote(self, node: nodes.block_quote) -> bool: - """ - Check if a blockquote node represents a GitHub alert. - - Returns: - bool: True if this is a GitHub alert blockquote, False otherwise - """ - if not isinstance(node, nodes.block_quote): - return False - - # GitHub alerts start with a paragraph containing [!TYPE] - if not node.children or not isinstance(node.children[0], nodes.paragraph): - return False - - first_para = node.children[0] - if not first_para.children or not isinstance( - first_para.children[0], nodes.Text - ): - return False - - first_text = first_para.children[0].astext() - match = self.alert_pattern.match(first_text.strip()) - - return match is not None - - def create_admonition_node(self, blockquote: nodes.block_quote) -> nodes.admonition: - """ - Create a docutils admonition node from a GitHub alert blockquote. - - Args: - blockquote: The blockquote node containing the GitHub alert - - Returns: - The created admonition node - """ - # Extract alert information from the blockquote - first_para = blockquote.children[0] - first_text = first_para.children[0].astext() - match = self.alert_pattern.match(first_text.strip()) - - if not match: - raise ValueError("Not a valid GitHub alert blockquote") - - alert_type = match.group(1).lower().strip() - title = match.group(2).strip() if match.group(2) else None - - # Extract content nodes (everything after the first paragraph) - content_nodes = [] - - # If there's a title, check if there's more content in the first paragraph - if title and len(first_para.children) > 1: - # Create new paragraph with remaining content - remaining_para = nodes.paragraph() - # Properly detach and add child nodes - for child in first_para.children[1:]: - child.parent = None # Detach from current parent - remaining_para.append(child) - content_nodes.append(remaining_para) - elif not title and len(first_para.children) > 1: - # No title, but there's content after [!TYPE] - treat as content - content_para = nodes.paragraph() - # Properly detach and add child nodes - for child in first_para.children[1:]: - child.parent = None # Detach from current parent - content_para.append(child) - content_nodes.append(content_para) - - # Add any additional paragraphs/content - for child in blockquote.children[1:]: - child.parent = None # Detach from current parent - content_nodes.append(child) - - # Map to MyST admonition type - admonition_class = self.ALERT_MAPPING.get(alert_type, nodes.note) - admonition = admonition_class() - - # Add title if present - if title: - title_node = nodes.title(title, title) - admonition.append(title_node) - - # Add content nodes - for content_node in content_nodes: - content_node.parent = None # Ensure node is properly detached - admonition.append(content_node) - - return admonition - - def transform_document(self, document: nodes.document) -> None: - """Transform all GitHub alert blockquotes in the document.""" - - # Find all blockquote nodes - blockquotes = document.traverse(nodes.block_quote) - - for blockquote in blockquotes: - if self.is_github_alert_blockquote(blockquote): - # Create admonition node from blockquote - admonition = self.create_admonition_node(blockquote) - - # Replace blockquote with admonition - blockquote.parent.replace(blockquote, admonition) - - -def transform_github_alerts(app: Sphinx, doctree: nodes.document, docname: str) -> None: - """ - Transform GitHub alerts in the document tree. - - This function is connected to Sphinx's 'doctree-resolved' event. - - Args: - app: The Sphinx application instance - doctree: The document tree to transform - docname: The document name being processed - """ - # Check if this is a markdown file by looking at the source file - # Sphinx strips extensions from docnames, so we need to check the source - env = app.env - source_file = env.doc2path(docname, base=None) - is_markdown = source_file and source_file.suffix in (".md", ".markdown") - - if not is_markdown: - return - - # Check if the extension is enabled - if not app.config.github_alerts_enabled: - return - - logger.debug(f"Processing GitHub alerts in {docname}") - - try: - # Get the transformer instance - transformer = getattr(app, "_github_alerts_transformer", None) - if transformer is None: - transformer = GitHubAlertsTransformer() - app._github_alerts_transformer = transformer - - # Count blockquotes before transformation - initial_blockquotes = list(doctree.traverse(nodes.block_quote)) - initial_admonitions = list(doctree.traverse(nodes.Admonition)) - alert_blockquotes = [ - bq - for bq in initial_blockquotes - if transformer.is_github_alert_blockquote(bq) - ] - - if alert_blockquotes: - logger.info( - f"GitHub alerts: Converting {len(alert_blockquotes)} alert(s) in {docname}" - ) - - # Transform the document - transformer.transform_document(doctree) - - # Count remaining blockquotes and new admonitions for verification - remaining_blockquotes = list(doctree.traverse(nodes.block_quote)) - remaining_admonitions = list(doctree.traverse(nodes.Admonition)) - - logger.debug( - f"GitHub alerts: {docname} - {len(initial_blockquotes)} → {len(remaining_blockquotes)} blockquotes, {len(remaining_admonitions) - len(initial_admonitions)} admonitions created" - ) - else: - logger.debug(f"GitHub alerts: No alerts found in {docname}") - except Exception as e: - logger.error(f"GitHub alerts: Error processing {docname}: {e}") - raise - - -def setup(app: Sphinx) -> Dict[str, Any]: - """ - Setup function for the Sphinx extension. - - Args: - app: The Sphinx application instance - - Returns: - Extension metadata - """ - logger.info("GitHub alerts extension setup() called") - - try: - # Connect our transformer to the doctree-resolved event - # This happens after parsing but before writing - app.connect("doctree-resolved", transform_github_alerts) - logger.info("GitHub alerts extension connected to 'doctree-resolved' event") - - # Add configuration values - app.add_config_value("github_alerts_enabled", True, "env") - - logger.info("GitHub alerts extension setup completed") - - return { - "version": __version__, - "parallel_read_safe": True, - "parallel_write_safe": True, - } - except Exception as e: - logger.error(f"GitHub alerts extension setup failed: {e}") - raise diff --git a/docs/_includes/dive_in_examples.rst b/docs/_includes/dive_in_examples.rst deleted file mode 100644 index 261e896d77d..00000000000 --- a/docs/_includes/dive_in_examples.rst +++ /dev/null @@ -1,32 +0,0 @@ -The examples below assume you build the latest image yourself from source. If using a prebuilt image follow the examples from the corresponding branch. - -.. grid:: 1 2 2 2 - :gutter: 3 - :margin: 0 - :padding: 3 4 0 0 - - .. grid-item-card:: :doc:`Hello World <../examples/runtime/hello_world/README>` - :link: ../examples/runtime/hello_world/README - :link-type: doc - - Demonstrates the basic concepts of Dynamo by creating a simple GPU-unaware graph - - .. grid-item-card:: :doc:`vLLM <../backends/vllm/README>` - :link: ../backends/vllm/README - :link-type: doc - - Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with VLLM. - - .. grid-item-card:: :doc:`SGLang <../backends/sglang/README>` - :link: ../backends/sglang/README - :link-type: doc - - Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with SGLang. - - .. grid-item-card:: :doc:`TensorRT-LLM <../backends/trtllm/README>` - :link: ../backends/trtllm/README - :link-type: doc - - Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with TensorRT-LLM. - - diff --git a/docs/_includes/install.rst b/docs/_includes/install.rst deleted file mode 100644 index 3403c6f827b..00000000000 --- a/docs/_includes/install.rst +++ /dev/null @@ -1,44 +0,0 @@ -Pip (PyPI) ----------- - -Install a pre-built wheel from PyPI. - -.. code-block:: bash - - # Create a virtual environment and activate it - uv venv venv - source venv/bin/activate - - # Install Dynamo from PyPI (choose one backend extra) - uv pip install "ai-dynamo[sglang]==my-tag" # or [vllm], [trtllm] - - -Pip from source ---------------- - -Install directly from a local checkout for development. - -.. code-block:: bash - - # Clone the repository - git clone https://github.com/ai-dynamo/dynamo.git - cd dynamo - - # Create a virtual environment and activate it - uv venv venv - source venv/bin/activate - uv pip install ".[sglang]" # or [vllm], [trtllm] - - -Docker ------- - -Pull and run prebuilt images from NVIDIA NGC (`nvcr.io`). - -.. code-block:: bash - - # Run a container (mount your workspace if needed) - docker run --rm -it \ - --gpus all \ - --network host \ - nvcr.io/nvidia/ai-dynamo/sglang-runtime:my-tag # or vllm, tensorrtllm diff --git a/docs/_includes/quick_start_local.rst b/docs/_includes/quick_start_local.rst deleted file mode 100644 index 05b6e63b5f4..00000000000 --- a/docs/_includes/quick_start_local.rst +++ /dev/null @@ -1,45 +0,0 @@ -Get started with Dynamo locally in just a few commands: - -**1. Install Dynamo** - -.. code-block:: bash - - # Install uv (recommended Python package manager) - curl -LsSf https://astral.sh/uv/install.sh | sh - - # Create virtual environment and install Dynamo - uv venv venv - source venv/bin/activate - # Use prerelease flag to install RC versions of flashinfer and/or other dependencies - uv pip install --prerelease=allow "ai-dynamo[sglang]" # or [vllm], [trtllm] - -**2. Start etcd/NATS** - -.. code-block:: bash - - # Fetch and start etcd and NATS using Docker Compose - VERSION=$(uv pip show ai-dynamo | grep Version | cut -d' ' -f2) - curl -fsSL -o docker-compose.yml https://raw.githubusercontent.com/ai-dynamo/dynamo/refs/tags/v${VERSION}/deploy/docker-compose.yml - docker compose -f docker-compose.yml up -d - -**3. Run Dynamo** - -.. code-block:: bash - - # Start the OpenAI compatible frontend (default port is 8000) - python -m dynamo.frontend - - # In another terminal, start an SGLang worker - python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B - -**4. Test your deployment** - -.. code-block:: bash - - curl localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{"model": "Qwen/Qwen3-0.6B", - "messages": [{"role": "user", "content": "Hello!"}], - "max_tokens": 50}' - - diff --git a/docs/_sections/backends.rst b/docs/_sections/backends.rst deleted file mode 100644 index e77774f4105..00000000000 --- a/docs/_sections/backends.rst +++ /dev/null @@ -1,9 +0,0 @@ -Backends -======== - -.. toctree:: - :maxdepth: 1 - - vLLM <../backends/vllm/README> - SGLang <../backends/sglang/README> - TensorRT-LLM <../backends/trtllm/README> \ No newline at end of file diff --git a/docs/_sections/examples.rst b/docs/_sections/examples.rst deleted file mode 100644 index 30258a46bee..00000000000 --- a/docs/_sections/examples.rst +++ /dev/null @@ -1,8 +0,0 @@ -.. - Quickstart Page (left sidebar target) -.. - -Examples -======== - -.. include:: ../_includes/dive_in_examples.rst \ No newline at end of file diff --git a/docs/_sections/frontends.rst b/docs/_sections/frontends.rst deleted file mode 100644 index b5e4e3e5da8..00000000000 --- a/docs/_sections/frontends.rst +++ /dev/null @@ -1,7 +0,0 @@ -Frontends -========= - -.. toctree:: - :maxdepth: 1 - - KServe <../frontends/kserve.md> \ No newline at end of file diff --git a/docs/_sections/installation.rst b/docs/_sections/installation.rst deleted file mode 100644 index b9543fb5586..00000000000 --- a/docs/_sections/installation.rst +++ /dev/null @@ -1,10 +0,0 @@ -.. - Installation Page (left sidebar target) -.. - -Installation -============ - -.. include:: ../_includes/install.rst - - diff --git a/docs/_sections/k8s_deployment.rst b/docs/_sections/k8s_deployment.rst deleted file mode 100644 index 087f8fd08df..00000000000 --- a/docs/_sections/k8s_deployment.rst +++ /dev/null @@ -1,14 +0,0 @@ -Deployment Guide -================ - -.. toctree:: - :hidden: - - Kubernetes Quickstart <../kubernetes/README> - Detailed Installation Guide <../kubernetes/installation_guide> - Dynamo Operator <../kubernetes/dynamo_operator> - Service Discovery <../kubernetes/service_discovery> - Webhooks <../kubernetes/webhooks> - Minikube Setup <../kubernetes/deployment/minikube> - Managing Models with DynamoModel <../kubernetes/deployment/dynamomodel-guide> - Autoscaling <../kubernetes/autoscaling> diff --git a/docs/_sections/k8s_multinode.rst b/docs/_sections/k8s_multinode.rst deleted file mode 100644 index 3a1c7cff2c4..00000000000 --- a/docs/_sections/k8s_multinode.rst +++ /dev/null @@ -1,8 +0,0 @@ -Multinode -========= - -.. toctree:: - :hidden: - - Multinode Deployments <../kubernetes/deployment/multinode-deployment> - Grove <../kubernetes/grove> diff --git a/docs/_sections/k8s_observability.rst b/docs/_sections/k8s_observability.rst deleted file mode 100644 index af7c6ff66d9..00000000000 --- a/docs/_sections/k8s_observability.rst +++ /dev/null @@ -1,8 +0,0 @@ -Observability -============= - -.. toctree:: - :hidden: - - Metrics <../kubernetes/observability/metrics> - Logging <../kubernetes/observability/logging> diff --git a/docs/_sections/observability.rst b/docs/_sections/observability.rst deleted file mode 100644 index c1b108c9752..00000000000 --- a/docs/_sections/observability.rst +++ /dev/null @@ -1,13 +0,0 @@ -Observability -============= - -.. toctree:: - :hidden: - - Overview <../observability/README> - Prometheus + Grafana Setup <../observability/prometheus-grafana> - Metrics <../observability/metrics> - Metrics Developer Guide <../observability/metrics-developer-guide> - Health Checks <../observability/health-checks> - Tracing <../observability/tracing> - Logging <../observability/logging> diff --git a/docs/_static/custom.js b/docs/_static/custom.js deleted file mode 100644 index 03900df2ae0..00000000000 --- a/docs/_static/custom.js +++ /dev/null @@ -1,19 +0,0 @@ -// Add RunLLM widget -document.addEventListener("DOMContentLoaded", function () { - var script = document.createElement("script"); - script.type = "module"; - script.id = "runllm-widget-script" - - script.src = "https://widget.runllm.com"; - - script.setAttribute("version", "stable"); - script.setAttribute("runllm-keyboard-shortcut", "Mod+j"); // cmd-j or ctrl-j to open the widget. - script.setAttribute("runllm-name", "dynamo"); - script.setAttribute("runllm-position", "BOTTOM_RIGHT"); - script.setAttribute("runllm-position-y", "120px"); - script.setAttribute("runllm-position-x", "20px"); - script.setAttribute("runllm-assistant-id", "758"); - - script.async = true; - document.head.appendChild(script); - }); diff --git a/docs/_static/switcher.json b/docs/_static/switcher.json deleted file mode 100644 index 3b1e3994d1a..00000000000 --- a/docs/_static/switcher.json +++ /dev/null @@ -1,12 +0,0 @@ -[ - { - "name": "0.1.0 (current release)", - "version": "0.1.0", - "url": "https://docs.nvidia.com/dynamo/latest/index.html" - }, - { - "name": "older releases", - "version": "archives", - "url": "https://docs.nvidia.com/dynamo/archives/" - } -] \ No newline at end of file diff --git a/docs/conf.py b/docs/conf.py deleted file mode 100644 index 33b8e76af2b..00000000000 --- a/docs/conf.py +++ /dev/null @@ -1,170 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2023-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -# Configuration file for the Sphinx documentation builder. -import os -import sys - -# -- Project information ----------------------------------------------------- -project = "NVIDIA Dynamo" -copyright = "2024-2026, NVIDIA CORPORATION & AFFILIATES" -author = "NVIDIA" - -# Version is set via DYNAMO_DOCS_VERSION env var during build (e.g., "0.3.0") -# Defaults to "dev" for main branch and PR builds -release = os.environ.get("DYNAMO_DOCS_VERSION", "dev") - -# -- General configuration --------------------------------------------------- - -# Standard extensions -extensions = [ - "ablog", - "myst_parser", - "sphinx_copybutton", - "sphinx_design", - "sphinx_prompt", - # "sphinxcontrib.bibtex", - "sphinx_tabs.tabs", - "sphinx_sitemap", - "sphinx.ext.autodoc", - "sphinx.ext.autosummary", - "sphinx.ext.mathjax", - "sphinx.ext.napoleon", - "sphinx.ext.ifconfig", - "sphinx.ext.extlinks", - "sphinxcontrib.mermaid", - "sphinx_reredirects", -] - -# Redirects configuration -redirects = { - # PR #3802 - "guides/tool-calling": "../agents/tool-calling.html", # key format corrected - "architecture/architecture": "../design_docs/architecture.html", - "architecture/disagg_serving": "../design_docs/disagg_serving.html", - "architecture/distributed_runtime": "../design_docs/distributed_runtime.html", - "architecture/dynamo_flow": "../design_docs/dynamo_flow.html", - "architecture/request_cancellation": "../fault_tolerance/request_cancellation.html", - "architecture/request_migration": "../fault_tolerance/request_migration.html", - "kubernetes/create_deployment": "../kubernetes/deployment/create_deployment.html", - "kubernetes/minikube": "../kubernetes/deployment/minikube.html", - "kubernetes/multinode-deployment": "../kubernetes/deployment/multinode-deployment.html", - "kubernetes/logging": "../kubernetes/observability/logging.html", - "kubernetes/metrics": "../kubernetes/observability/metrics.html", - "architecture/kv_cache_routing": "../router/kv_cache_routing.html", - # PR #3658 - "API/nixl_connect/README": "../../api/nixl_connect/README.html", - "API/nixl_connect/connector": "../../api/nixl_connect/connector.html", - "API/nixl_connect/descriptor": "../../api/nixl_connect/descriptor.html", - "API/nixl_connect/device": "../../api/nixl_connect/device.html", - "API/nixl_connect/device_kind": "../../api/nixl_connect/device_kind.html", - "API/nixl_connect/operation_status": "../../api/nixl_connect/operation_status.html", - "API/nixl_connect/rdma_metadata": "../../api/nixl_connect/rdma_metadata.html", - "API/nixl_connect/read_operation": "../../api/nixl_connect/read_operation.html", - "API/nixl_connect/readable_operation": "../../api/nixl_connect/readable_operation.html", - "API/nixl_connect/writable_operation": "../../api/nixl_connect/writable_operation.html", - "API/nixl_connect/write_operation": "../../api/nixl_connect/write_operation.html", - "guides/backend": "../development/backend-guide.html", - "runtime/README": "../development/runtime-guide.html", - "guides/tool_calling": "../agents/tool-calling.html", - "architecture/kvbm_architecture": "../kvbm/kvbm_architecture.html", - "architecture/kvbm_components": "../kvbm/kvbm_components.html", - "architecture/kvbm_intro": "../kvbm/kvbm_intro.html", - "architecture/kvbm_motivation": "../kvbm/kvbm_motivation.html", - "architecture/kvbm_reading": "../kvbm/kvbm_reading.html", - "guides/run_kvbm_in_trtllm": "../kvbm/trtllm-setup.html", - "guides/run_kvbm_in_vllm": "../kvbm/vllm-setup.html", - "guides/health_check": "../observability/health-checks.html", - "guides/logging": "../observability/logging.html", - "guides/metrics": "../observability/metrics.html", - "guides/disagg_perf_tuning": "../performance/tuning.html", - "architecture/load_planner": "../planner/load_planner.html", - "architecture/planner_intro": "../planner/planner_intro.html", - "architecture/sla_planner": "../planner/sla_planner.html", - "kubernetes/sla_planner_quickstart": "../planner/sla_planner_quickstart.html", - "guides/dynamo_run": "../reference/cli.html", - "dynamo_glossary": "../reference/glossary.html", - "support_matrix": "../reference/support-matrix.html", - "components/router/README": "../router/README.html", - # Multimodal documentation consolidation - "backends/vllm/multimodal": "../../multimodal/vllm.html", - "backends/vllm/multimodal_vllm_guide": "../../multimodal/vllm.html", - "backends/trtllm/multimodal_support": "../../multimodal/trtllm.html", - "backends/trtllm/multimodal_trtllm_guide": "../../multimodal/trtllm.html", - "backends/trtllm/multinode/multinode-multimodal-example": "../../../multimodal/trtllm.html", - "backends/sglang/multimodal_epd": "../../multimodal/sglang.html", - "backends/sglang/multimodal_sglang_guide": "../../multimodal/sglang.html", - "multimodal/multimodal_intro": "index.html", -} - -# Custom extensions -sys.path.insert(0, os.path.abspath("_extensions")) -extensions.append("github_alerts") - -# Handle Mermaid diagrams as code blocks (not directives) to avoid warnings -myst_fence_as_directive = ["mermaid"] # Uncomment if sphinxcontrib-mermaid is installed - -# File extensions (myst_parser automatically handles .md files) -source_suffix = [".rst", ".md"] - -# MyST parser configuration -myst_enable_extensions = [ - "colon_fence", # ::: code blocks - "deflist", # Definition lists - "html_image", # HTML images - "tasklist", # Task lists -] - -# Templates path -templates_path = ["_templates"] - -# List of patterns to ignore when looking for source files -exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "build"] - -# -- Options for HTML output ------------------------------------------------- -html_theme = "nvidia_sphinx_theme" -html_static_path = ["_static"] -html_extra_path = ["project.json"] -html_theme_options = { - "collapse_navigation": False, - "icon_links": [ - { - "name": "GitHub", - "url": "https://github.com/ai-dynamo/dynamo", - "icon": "fa-brands fa-github", - } - ], - "switcher": { - # Use single shared URL so all versions see the same switcher list - # When a new version is added, all old docs automatically see it - "json_url": "https://docs.nvidia.com/dynamo/versions1.json", - "version_match": release, - }, - "extra_head": { - """ - - """ - }, - "extra_footer": { - """ - - """ - }, - "navbar_start": ["navbar-logo"], - "primary_sidebar_end": [], -} - -# Document settings -master_doc = "index" -html_title = f"{project} Documentation" -html_short_title = project -html_baseurl = "https://docs.nvidia.com/dynamo/latest/" - -# Suppress warnings for external links and missing references -suppress_warnings = [ - "myst.xref_missing", # Missing cross-references of relative links outside docs folder -] - -# Additional MyST configuration -myst_heading_anchors = 7 # Generate anchors for headers -myst_substitutions = {} # Custom substitutions diff --git a/docs/agents/tool-calling.md b/docs/docs/agents/tool-calling.md similarity index 99% rename from docs/agents/tool-calling.md rename to docs/docs/agents/tool-calling.md index dd0d116215d..1aee142ec8f 100644 --- a/docs/agents/tool-calling.md +++ b/docs/docs/agents/tool-calling.md @@ -1,3 +1,7 @@ +--- +title: "Tool Calling with Dynamo" +--- + # Tool Calling with Dynamo You can connect Dynamo to external tools and services using function calling (also known as tool calling). By providing a list of available functions, Dynamo can choose diff --git a/docs/api/nixl_connect/README.md b/docs/docs/api/nixl_connect/README.md similarity index 92% rename from docs/api/nixl_connect/README.md rename to docs/docs/api/nixl_connect/README.md index 2a65fa76951..1953da2d6e8 100644 --- a/docs/api/nixl_connect/README.md +++ b/docs/docs/api/nixl_connect/README.md @@ -1,3 +1,7 @@ +--- +title: "Dynamo NIXL Connect" +--- + + +# Dynamo Runtime + +

A Datacenter Scale Distributed Inference Serving Framework

+ +[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) + +Rust implementation of the Dynamo runtime system, enabling distributed computing capabilities for machine learning workloads. + +## Prerequisites + +### Install Rust and Cargo using [rustup](https://rustup.rs/): + +```bash +curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh +``` + +### Build + +``` +cargo build +cargo test +``` + +### Start Dependencies + +#### Docker Compose + +The simplest way to deploy the pre-requisite services is using +[docker-compose](https://docs.docker.com/compose/install/linux/), +defined in [deploy/docker-compose.yml](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml). + +``` +# At the root of the repository: +docker compose -f deploy/docker-compose.yml up -d +``` + +This will deploy a [NATS.io](https://nats.io/) server and an [etcd](https://etcd.io/) +server used to communicate between and discover components at runtime. + + +#### Local (alternate) + +To deploy the pre-requisite services locally instead of using `docker-compose` +above, you can manually launch each: + +- [NATS.io](https://docs.nats.io/running-a-nats-service/introduction/installation) server with [Jetstream](https://docs.nats.io/nats-concepts/jetstream) + - example: `nats-server -js --trace` +- [etcd](https://etcd.io) server + - follow instructions in [etcd installation](https://etcd.io/docs/v3.5/install/) to start an `etcd-server` locally + + +### Run Examples + +When developing or running examples, any process or user that shared your core-services (`etcd` and `nats.io`) will +be operating within your distributed runtime. + +The current examples use a hard-coded `namespace`. We will address the `namespace` collisions later. + +All examples require the `etcd` and `nats.io` pre-requisites to be running and available. + +#### Rust `hello_world` + +With two terminals open, in one window: + +``` +cd examples/hello_world +cargo run --bin server +``` + +In the second terminal, execute: + +``` +cd examples/hello_world +cargo run --bin client +``` + +which should yield some output similar to: +``` + Finished `dev` profile [unoptimized + debuginfo] target(s) in 6.25s + Running `target/debug/client` +Annotated { data: Some("h"), id: None, event: None, comment: None } +Annotated { data: Some("e"), id: None, event: None, comment: None } +Annotated { data: Some("l"), id: None, event: None, comment: None } +Annotated { data: Some("l"), id: None, event: None, comment: None } +Annotated { data: Some("o"), id: None, event: None, comment: None } +Annotated { data: Some(" "), id: None, event: None, comment: None } +Annotated { data: Some("w"), id: None, event: None, comment: None } +Annotated { data: Some("o"), id: None, event: None, comment: None } +Annotated { data: Some("r"), id: None, event: None, comment: None } +Annotated { data: Some("l"), id: None, event: None, comment: None } +Annotated { data: Some("d"), id: None, event: None, comment: None } +``` + +#### Python + +See the [README.md](https://github.com/ai-dynamo/dynamo/tree/main/lib/runtime/lib/bindings/python/README.md) for details + +The Python and Rust `hello_world` client and server examples are interchangeable, +so you can start the Python `server.py` and talk to it from the Rust `client`.