Skip to content

Commit da38e96

Browse files
tanmayv25nnshah1
andauthored
feat: TRT-LLM disaggregated serving using UCX (ai-dynamo#562)
Signed-off-by: Tanmay Verma <[email protected]> Signed-off-by: Tanmay Verma <[email protected]> Co-authored-by: Neelay Shah <[email protected]>
1 parent 538b463 commit da38e96

27 files changed

+803
-310
lines changed

container/Dockerfile.tensorrt_llm

+6
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,12 @@ RUN pip install dist/ai_dynamo_runtime*cp312*.whl && \
201201
ENV DYNAMO_KV_CAPI_PATH="/opt/dynamo/bindings/lib/libdynamo_llm_capi.so"
202202
ENV DYNAMO_HOME=/workspace
203203

204+
205+
# Copy launch banner
206+
RUN --mount=type=bind,source=./container/launch_message.txt,target=/workspace/launch_message.txt \
207+
sed '/^#\s/d' /workspace/launch_message.txt > ~/.launch_screen && \
208+
echo "cat ~/.launch_screen" >> ~/.bashrc
209+
204210
# FIXME: Copy more specific folders in for dev/debug after directory restructure
205211
COPY . /workspace
206212

docs/guides/dynamo_run.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -342,7 +342,7 @@ See instructions [here](/examples/tensorrt_llm/README.md#run-container) to run t
342342
343343
Execute the following to load the TensorRT-LLM model specified in the configuration.
344344
```
345-
dynamo run out=pystr:/workspace/examples/tensorrt_llm/engines/agg_engine.py -- --engine_args /workspace/examples/tensorrt_llm/configs/llm_api_config.yaml
345+
dynamo run out=pystr:/workspace/examples/tensorrt_llm/engines/trtllm_engine.py -- --engine_args /workspace/examples/tensorrt_llm/configs/llm_api_config.yaml
346346
```
347347
348348
#### Dynamo does the pre-processing

examples/tensorrt_llm/README.md

+42-9
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,14 @@ This directory contains examples and reference implementations for deploying Lar
2525
See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture.
2626
Note that this TensorRT-LLM version does not support all the options yet.
2727

28+
Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can only configure the deployment to always use aggregate or disaggregated serving.
29+
30+
## Getting Started
31+
32+
1. Choose a deployment architecture based on your requirements
33+
2. Configure the components as needed
34+
3. Deploy using the provided scripts
35+
2836
### Prerequisites
2937

3038
Start required services (etcd and NATS) using [Docker Compose](../../deploy/docker-compose.yml)
@@ -68,6 +76,29 @@ This build script internally points to the base container image built with step
6876
```
6977
## Run Deployment
7078

79+
This figure shows an overview of the major components to deploy:
80+
81+
82+
83+
```
84+
85+
+------+ +-----------+ +------------------+ +---------------+
86+
| HTTP |----->| processor |----->| Worker |------------>| Prefill |
87+
| |<-----| |<-----| |<------------| Worker |
88+
+------+ +-----------+ +------------------+ +---------------+
89+
| ^ |
90+
query best | | return | publish kv events
91+
worker | | worker_id v
92+
| | +------------------+
93+
| +---------| kv-router |
94+
+------------->| |
95+
+------------------+
96+
97+
```
98+
99+
Note: The above architecture illustrates all the components. The final components
100+
that get spawned depend upon the chosen graph.
101+
71102
### Example architectures
72103

73104
#### Aggregated serving
@@ -82,21 +113,23 @@ cd /workspace/examples/tensorrt_llm
82113
dynamo serve graphs.agg_router:Frontend -f ./configs/agg_router.yaml
83114
```
84115

85-
<!--
86-
This is work in progress and will be enabled soon.
87-
88116
#### Disaggregated serving
89117
```bash
90-
cd /workspace/examples/llm
91-
dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
118+
cd /workspace/examples/tensorrt_llm
119+
TRTLLM_USE_UCX_KVCACHE=1 dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
92120
```
93121

122+
We are defining TRTLLM_USE_UCX_KVCACHE so that TRTLLM uses UCX for transfering the KV
123+
cache between the context and generation workers.
124+
94125
#### Disaggregated serving with KV Routing
95126
```bash
96-
cd /workspace/examples/llm
97-
dynamo serve graphs.disagg_router:Frontend -f ./configs/disagg_router.yaml
127+
cd /workspace/examples/tensorrt_llm
128+
TRTLLM_USE_UCX_KVCACHE=1 dynamo serve graphs.disagg_router:Frontend -f ./configs/disagg_router.yaml
98129
```
99-
-->
130+
131+
We are defining TRTLLM_USE_UCX_KVCACHE so that TRTLLM uses UCX for transfering the KV
132+
cache between the context and generation workers.
100133

101134
### Client
102135

@@ -108,7 +141,7 @@ See [close deployment](../../docs/guides/dynamo_serve.md#close-deployment) secti
108141

109142
Remaining tasks:
110143

111-
- [ ] Add support for the disaggregated serving.
144+
- [x] Add support for the disaggregated serving.
112145
- [ ] Add integration test coverage.
113146
- [ ] Add instructions for benchmarking.
114147
- [ ] Add multi-node support.

0 commit comments

Comments
 (0)