You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,7 +30,7 @@ High-throughput, low-latency inference framework designed for serving generative
30
30
31
31
## Latest News
32
32
33
-
-[08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md)
33
+
-[08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./docs/backends/trtllm/gpt-oss.md)
34
34
35
35
## The Era of Multi-GPU, Multi-Node
36
36
@@ -65,9 +65,9 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
65
65
66
66
To learn more about each framework and their capabilities, check out each framework's README!
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.
Copy file name to clipboardExpand all lines: components/README.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,9 +23,9 @@ This directory contains the core components that make up the Dynamo inference fr
23
23
24
24
Dynamo supports multiple inference engines (with a focus on SGLang, vLLM, and TensorRT-LLM), each with their own deployment configurations and capabilities:
25
25
26
-
-**[vLLM](backends/vllm/README.md)** - High-performance LLM inference with native KV cache events and NIXL-based transfer mechanisms
27
-
-**[SGLang](backends/sglang/README.md)** - Structured generation language framework with ZMQ-based communication
28
-
-**[TensorRT-LLM](backends/trtllm/README.md)** - NVIDIA's optimized LLM inference engine with TensorRT acceleration
26
+
-**[vLLM](/docs/backends/vllm/README.md)** - High-performance LLM inference with native KV cache events and NIXL-based transfer mechanisms
27
+
-**[SGLang](/docs/backends/sglang/README.md)** - Structured generation language framework with ZMQ-based communication
28
+
-**[TensorRT-LLM](/docs/backends/trtllm/README.md)** - NVIDIA's optimized LLM inference engine with TensorRT acceleration
29
29
30
30
Each engine provides launch scripts for different deployment patterns in their respective `/launch` & `/deploy` directories.
Copy file name to clipboardExpand all lines: components/backends/trtllm/deploy/README.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -232,7 +232,7 @@ envs:
232
232
233
233
## Testing the Deployment
234
234
235
-
Send a test request to verify your deployment. See the [client section](../../../../components/backends/vllm/README.md#client) for detailed instructions.
235
+
Send a test request to verify your deployment. See the [client section](../../../../docs/backends/vllm/README.md#client) for detailed instructions.
236
236
237
237
**Note:** For multi-node deployments, target the node running `python3 -m dynamo.frontend <args>`.
238
238
@@ -254,7 +254,7 @@ TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving
254
254
- **UCX** (default): Standard method for KV cache transfer
255
255
- **NIXL** (experimental): Alternative transfer method
256
256
257
-
For detailed configuration instructions, see the [KV cache transfer guide](../kv-cache-transfer.md).
257
+
For detailed configuration instructions, see the [KV cache transfer guide](../../../../docs/backends/trtllm/kv-cache-transfer.md).
258
258
259
259
## Request Migration
260
260
@@ -282,8 +282,8 @@ Configure the `model` name and `host` based on your deployment.
Copy file name to clipboardExpand all lines: components/backends/trtllm/performance_sweeps/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -41,7 +41,7 @@ Please note that:
41
41
3.`post_process.py` - Scan the genai-perf results to produce a json with entries to each config point.
42
42
4.`plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization.
43
43
44
-
For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide.
44
+
For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../../../../docs/backends/trtllm/multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide.
@@ -229,7 +229,7 @@ cd $DYNAMO_HOME/components/backends/sglang
229
229
./launch/disagg_dp_attn.sh
230
230
```
231
231
232
-
When using MoE models, you can also use the our implementation of the native SGLang endpoints to record expert distribution data. The `disagg_dp_attn.sh` script automatically sets up the SGLang HTTP server, the environment variable that controls the expert distribution recording directory, and sets up the expert distribution recording mode to `stat`. You can learn more about expert parallelism load balancing [here](docs/expert-distribution-eplb.md).
232
+
When using MoE models, you can also use the our implementation of the native SGLang endpoints to record expert distribution data. The `disagg_dp_attn.sh` script automatically sets up the SGLang HTTP server, the environment variable that controls the expert distribution recording directory, and sets up the expert distribution recording mode to `stat`. You can learn more about expert parallelism load balancing [here](expert-distribution-eplb.md).
233
233
234
234
### Testing the Deployment
235
235
@@ -266,24 +266,24 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ
266
266
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
267
267
268
268
### Run a multi-node sized model
269
-
-**[Run a multi-node model](docs/multinode-examples.md)**
269
+
-**[Run a multi-node model](multinode-examples.md)**
270
270
271
271
### Large scale P/D disaggregation with WideEP
272
-
-**[Run DeepSeek-R1 on 104+ H100s](docs/dsr1-wideep-h100.md)**
273
-
-**[Run DeepSeek-R1-FP8 on GB200s](docs/dsr1-wideep-gb200.md)**
272
+
-**[Run DeepSeek-R1 on 104+ H100s](dsr1-wideep-h100.md)**
273
+
-**[Run DeepSeek-R1-FP8 on GB200s](dsr1-wideep-gb200.md)**
0 commit comments