[TRTLLM-8431][doc] update public doc and example, add etcd auto-scaling tests (#8602)

reasonsolo · web-flow · commit 24167d00eb81 · 2025-10-28T17:04:53.000-07:00
Signed-off-by: Lizhi Zhou &lt;1432185+reasonsolo@users.noreply.github.com&gt;
diff --git a/examples/disaggregated/README.md b/examples/disaggregated/README.md
@@ -204,7 +204,39 @@ srun -A <account> -p <partition> -t <time> \
 Additionally, we offer a fully executable script—please refer to [Disaggregated SLURM Scripts](./slurm/simple_example/).
 
 
-## Dynamic scaling (Prototype)
+## Dynamic scaling 
+  
+### Service discovery method
+
+Disaggregated server also supports dynamic service-discovery and auto-scaling of context/generation servers. This can be achieved by setting `disagg_cluster` section in the configurations of both context/generation servers and disagg-server. In this case, the context/generation servers must include an extra command line of `--server-role=[context|generation]`, also the `context/genration_servers` section of disaggregated server must be removed. You can simplify context/generation servers' config section by only passing `--disagg_cluster_uri=<disagg_cluster_uri>` in the command line (but disaggregated server's config must have this section). The omitted fields will use the defaults shown below. 
+
+```yaml
+disagg_cluster:
+  cluster_uri: <your_cluster_uri>
+  cluster_name: ""
+  minimal_instances: 
+    context_servers: 1
+    generation_servers: 1
+  heartbeat_interval_sec: 5
+  inactive_interval_sec: 10
+```
+- `cluster_uri`: the http address of disagg-server like `http://<your-disagg-server-host>:<your-disagg-server-port>` or a pre-configured Etcd server address like `etcd://<your-etcd-host>:2379`.
+- `cluster_name` : optional namespace to isolate multiple disagg-clusters in Etcd.
+- `minimal_instances`: the equivalence of `num_instances` in the auto-scaling concept, disagg-server will reject requests when 
+the active context/generation servers is below the corresponding threshold.
+- `heartbeat_interval_sec`: frequency at which context/generation servers send heartbeats to the disagg-server.
+- `inactive_interval_sec`: A server is marked inactive if no heartbeat is received within this interval (set higher than the heartbeat interval).
+
+Note that the disaggregated server and all the context/generation servers should have the same `disagg_cluster` configuration values, or the disaggregated server may not be able to keep alive or detect inactivity the other servers properly. If `disagg_cluster` section is specified, 
+
+Additionally, we offer a fully executable script—please refer to [Disaggregated SLURM Scripts](./slurm/service_discovery_example/).
+
+#### Dynamically adding servers
+
+To add servers dynamically, you can start more context/generation workers with the same `disagg_cluster`, then the disaggregated server can discover the new servers and dispatch requests to them automatically. If a context/generation server becomes inactive, the disaggregated server will also detect this and stop routing requests to it.
+
+
+### Metadata server method (Prototype)
 
 Currently, trtllm supports dynamic addition and removal of servers by leveraging ETCD. To enable this feature, you should start the context and generation servers with an additional flag ```--metadata_server_config_file``` and ```--server_role```.
 Before launching the context and generation servers, you should first start the ETCD server. By default, the ETCD server listens for client requests at ```localhost:2379```.
@@ -240,7 +272,7 @@ refersh_interval: 10.0
 
 The ```hostname``` and ```port``` must match those used when starting the ETCD server. The ```health_check_timeout``` parameter specifies how long a server will be considered dead if no healthy response is received. By default, trtllm will perform two checks before marking a server as dead. The ```refresh_interval``` parameter determines how often the latest server list is fetched from the ETCD server.
 
-### Dynamically adding servers
+#### Dynamically adding servers
 
 Users can add servers by directly launching them with trtllm-serve. For example, you can start an additional generation server as follows:
 
diff --git a/examples/disaggregated/slurm/service_discovery_example/launch.slurm b/examples/disaggregated/slurm/service_discovery_example/launch.slurm
@@ -0,0 +1,73 @@
+bin/bash
+#SBATCH --partition=${partition}
+#SBATCH --account=${account}
+#SBATCH --job-name=${job_name}
+#SBATCH --time=02:00:00
+
+container_image="${container_image:-}"
+mount_paths="${mount_paths:-}"
+work_path="${work_path:-}"
+enable_etcd="${enable_etcd:-0}"
+disagg_port="8000"
+ctx_port="8001"
+gen_port="8002"
+
+# use the first node as the disaggregated server node
+disagg_server_node=$(head -n 1 <(scontrol show hostnames $SLURM_JOB_NODELIST))
+
+if [[ "$enable_etcd" == "1" ]]; then
+     # you can optionally launch a etcd server, the container image must have etcd installed
+     disagg_cluster_uri="etcd://${disagg_server_node}:2379"
+     srun --container-image=${container_image} \
+          --container-mounts=${mount_paths} \
+          -w $disagg_server_node -N 1 --ntasks-per-node=1 \
+          --mpi=pmix \
+          bash -c "etcd" &
+     sleep 5 # wait for etcd to start
+else
+     # or use the disaggregated server's http address as built-in service discovery server
+     disagg_cluster_uri="http://${disagg_server_node}:${disagg_port}"
+fi
+
+cat >${work_path}/disagg_config.yaml << EOL
+hostname: localhost
+port: ${disagg_port}
+backend: pytorch
+disagg_cluster:
+  cluster_uri: ${disagg_cluster_uri}
+  cluster_name: example_cluster
+EOL
+
+cat >${work_path}/ctx_extra-llm-api-config.yaml << EOL
+disable_overlap_scheduler: True
+cache_transceiver_config:
+  backend: UCX
+  max_tokens_in_buffer: 2048
+EOL
+
+cat >${work_path}/gen_extra-llm-api-config.yaml << EOL
+cache_transceiver_config:
+  backend: UCX
+  max_tokens_in_buffer: 2048
+EOL
+
+# Launch a proxy without any context/generation servers.
+srun --container-image=${container_image} \
+     --container-mounts=${mount_paths} \
+     -w $disagg_server_node -N 1 --ntasks-per-node=1 \
+     --mpi=pmix \
+     bash -c "trtllm-llmapi-launch trtllm-serve disaggregated -c ${work_path}/disagg_config.yaml" &
+
+# Launch a context with `tp_size=8` using two 4-GPU nodes, and register itself through disagg_cluster_uri
+srun --container-image=${container_image} \
+     --container-mounts=${mount_paths} \
+     -N 2 --ntasks-per-node=4 \
+     --mpi=pmix \
+     bash -c "trtllm-llmapi-launch trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tp_size 8 --host 0.0.0.0 --port ${ctx_port} --extra_llm_api_options ${work_path}/ctx_extra-llm-api-config.yaml --disagg_cluster_uri ${disagg_cluster_uri} --server-role context" &
+
+# Launch a generation with `tp_size=4` using one 4-GPU node.
+srun --container-image=${container_image} \
+     --container-mounts=${mount_paths} \
+     -N 1 --ntasks-per-node=4 \
+     --mpi=pmix \
+     bash -c "trtllm-llmapi-launch trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tp_size 4 --host 0.0.0.0 --port ${gen_port} --extra_llm_api_options ${work_path}/gen_extra-llm-api-config.yaml --disagg_cluster_uri ${disagg_cluster_uri} --server-role generation" &
diff --git a/tensorrt_llm/commands/serve.py b/tensorrt_llm/commands/serve.py
@@ -533,10 +533,14 @@ def serve_encoder(model: str, host: str, port: int, log_level: str,
     help=
     "The interval of logging metrics in seconds. Set to 0 to disable metrics logging."
 )
-def disaggregated(config_file: Optional[str],
-                  metadata_server_config_file: Optional[str],
-                  server_start_timeout: int, request_timeout: int,
-                  log_level: str, metrics_log_interval: int):
+def disaggregated(
+    config_file: Optional[str],
+    metadata_server_config_file: Optional[str],
+    server_start_timeout: int,
+    request_timeout: int,
+    log_level: str,
+    metrics_log_interval: int,
+):
     """Running server in disaggregated mode"""
 
     logger.set_level(log_level)
diff --git a/tensorrt_llm/serve/cluster_storage.py b/tensorrt_llm/serve/cluster_storage.py
@@ -36,23 +36,26 @@ class WatchEvent:
 
 class WatchEventQueue:
 
-    def __init__(self, key_prefixes: List[str],
-                 events: asyncio.Queue[WatchEvent]):
+    def __init__(self, key_prefixes: List[str]):
         self.key_prefixes = key_prefixes
-        self.events = events
+        self.events = asyncio.Queue()
 
     async def drain(self):
         events = []
         event = await self.events.get()
-        logger.debug(f"Draining watch event: {self.events.qsize()}")
         events.append(event)
         while not self.events.empty():
             event = self.events.get_nowait()
             events.append(event)
         self.events.task_done()
-        logger.debug(f"after draining watch event: {self.events.qsize()}")
         return events
 
+    async def add_events(self, events: List[WatchEvent]):
+        loop = asyncio.get_event_loop()
+        for event in events:
+            self.events.put_nowait(event)
+        loop._write_to_self()
+
 
 class ClusterStorage(abc.ABC):
 
@@ -104,17 +107,17 @@ async def get_prefix(self,
 
 
 def create_cluster_storage(cluster_uri, cluster_name, **kwargs):
-    if cluster_uri.startswith("http"):
+    if cluster_uri.startswith("http://") or cluster_uri.startswith("https://"):
         return HttpClusterStorageServer(cluster_uri, cluster_name, **kwargs)
-    elif cluster_uri.startswith("etcd"):
+    elif cluster_uri.startswith("etcd://"):
         return Etcd3ClusterStorage(cluster_uri, cluster_name, **kwargs)
     raise ValueError(f"Invalid cluster storage URI: {cluster_uri}")
 
 
 def create_cluster_storage_client(cluster_uri, cluster_name, **kwargs):
-    if cluster_uri.startswith("http"):
+    if cluster_uri.startswith("http://") or cluster_uri.startswith("https://"):
         return HttpClusterStorageClient(cluster_uri, cluster_name, **kwargs)
-    elif cluster_uri.startswith("etcd"):
+    elif cluster_uri.startswith("etcd://"):
         return Etcd3ClusterStorage(cluster_uri, cluster_name, **kwargs)
     raise ValueError(f"Invalid cluster storage URI: {cluster_uri}")
 
@@ -138,7 +141,11 @@ def key_time():
 
 class HttpClusterStorageServer(ClusterStorage):
 
-    def __init__(self, cluster_uri, cluster_name, server: FastAPI = None):
+    def __init__(self,
+                 cluster_uri,
+                 cluster_name,
+                 server: FastAPI = None,
+                 **kwargs):
         self._storage = {}
         self._lock = asyncio.Lock()
         self._watch_handles = {}
@@ -237,7 +244,7 @@ async def watch(self, key_prefix: str) -> WatchEventQueue:
                 )
             else:
                 self._watch_handles[key_prefix] = WatchEventQueue(
-                    key_prefixes=[key_prefix], events=asyncio.Queue())
+                    key_prefixes=[key_prefix])
             return self._watch_handles[key_prefix]
 
     async def unwatch(self, key_prefix: str) -> None:
@@ -291,7 +298,7 @@ async def _check_expired(self):
 
 class HttpClusterStorageClient(ClusterStorage):
 
-    def __init__(self, cluster_uri, cluster_name):
+    def __init__(self, cluster_uri, cluster_name, **kwargs):
         self._session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(
             total=5))
         self._cluster_uri = cluster_uri if cluster_uri.startswith(
@@ -393,8 +400,8 @@ def __init__(self,
                  key_prefix: str,
                  cancel_event: Callable[[], None] = None):
         self.key_prefix = key_prefix
-        self._cancel_event = cancel_event
         self.events = asyncio.Queue()
+        self._cancel_event = cancel_event
 
     def cancel_event(self):
         if self._cancel_event:
@@ -406,7 +413,7 @@ def set_cancel_event(self, cancel_event: Callable[[], None]):
     def __del__(self):
         self.cancel_event()
 
-    def add_event(self, watch_resp):
+    def add_events_from_resp(self, watch_resp):
         try:
             for event in watch_resp.events:
                 # Event type is not in public interface of etcd3
@@ -430,7 +437,8 @@ class Etcd3ClusterStorage(ClusterStorage):
     def __init__(self,
                  cluster_uri: str,
                  cluster_name: str,
-                 one_single_lease: bool = False):
+                 one_single_lease: bool = False,
+                 **kwargs):
         cluster_uri = cluster_uri.replace("etcd://", "")
         host, port = cluster_uri.rsplit(":", 1)
         self._client = etcd3.client(host, port)
@@ -502,7 +510,7 @@ async def expire(self, key: str, ttl: int) -> bool:
         try:
             lease = self._get_lease(key, ttl)
             # TTL will be ignored since it can only be set when creating a lease
-            self.client.refresh_lease(lease_id=lease.id)
+            next(self.client.refresh_lease(lease_id=lease.id), None)
         except etcd3.Etcd3Exception as e:
             logger.error(f"Error refreshing lease {key}: {e}")
             return False
@@ -512,7 +520,7 @@ async def get_prefix(self,
                          key_prefix: str,
                          keys_only: bool = False) -> Dict[str, str]:
         try:
-            resp = self.client.get_prefix(key_prefix, keys_only=keys_only)
+            resp = self.client.get_prefix(key_prefix)
             return {
                 metadata.key.decode("utf-8"):
                 "" if keys_only else v.decode("utf-8")
@@ -528,7 +536,7 @@ async def watch(self, key_prefix: str) -> WatchEventQueue:
                 return self._watch_handles[key_prefix]
             watch_handle = Etcd3WatchEventQueue(key_prefix=key_prefix)
             watch_id = self.client.add_watch_prefix_callback(
-                key_prefix, watch_handle.add_event)
+                key_prefix, watch_handle.add_events_from_resp)
             watch_handle.set_cancel_event(
                 lambda: self.client.cancel_watch(watch_id))
             self._watch_handles[key_prefix] = watch_handle
diff --git a/tensorrt_llm/serve/disagg_auto_scaling.py b/tensorrt_llm/serve/disagg_auto_scaling.py
@@ -94,18 +94,21 @@ def worker_key_prefix(self) -> str:
 
     async def watch_workers(self, get_existing_first: bool = True):
         workers = []
+        self._watch_handle = await self._cluster_storage.watch(
+            self.worker_key_prefix)
         if get_existing_first:
             # There is a tiny gap between getting existing workers and watching the key,
             # which may cause we missing some workers registered in between.
             resp = await self._cluster_storage.get_prefix(
                 self.worker_key_prefix, keys_only=False)
+            events = []
             for worker_id, data in resp.items():
                 event = WatchEvent(storage_item=StorageItem(key=worker_id,
                                                             value=data),
                                    event_type=WatchEventType.SET)
                 workers.append(self._parse_worker_info(event))
-        self._watch_handle = await self._cluster_storage.watch(
-            self.worker_key_prefix)
+                events.append(event)
+            await self._watch_handle.add_events(events)
         return workers
 
     async def unwatch_workers(self) -> None:
diff --git a/tensorrt_llm/serve/router.py b/tensorrt_llm/serve/router.py
@@ -159,6 +159,7 @@ def __init__(self, server_role: ServerRole, servers: List[str],
     @abstractmethod
     def _on_servers_updated(self, old_servers, new_servers):
         """Called when the server list changes. Override in subclasses to handle index resets.
+        Called with lock already held.
         Args:
             old_servers: The previous server list
             new_servers: The new server list
@@ -639,8 +640,11 @@ async def finish_request(self,
                                                                 session=session)
 
     def _on_servers_updated(self, old_servers, new_servers):
-        raise NotImplementedError(
-            "KvCacheAwareRouter does not support server updates")
+        for new_server in new_servers:
+            self._server_state[new_server] = KvCacheAwareServerState(
+                new_server, self._use_tokens)
+        for old_server in old_servers:
+            self._server_state.pop(old_server, None)
 
 
 def create_router(router_config: Optional[RouterConfig],
diff --git a/tests/integration/defs/disaggregated/test_auto_scaling.py b/tests/integration/defs/disaggregated/test_auto_scaling.py
diff --git a/tests/integration/test_lists/test-db/l0_dgx_h100.yml b/tests/integration/test_lists/test-db/l0_dgx_h100.yml