feat(sglang): tp>1 colocate engine topology + weight-sync correctness fixes#46
feat(sglang): tp>1 colocate engine topology + weight-sync correctness fixes#46leviking98z-rgb wants to merge 2 commits into
Conversation
A tp>1 SRT server spans tp GPUs but colocate builds one engine per GPU: the lowest-physical-GPU rank of each tp group becomes the owner (launches the server with CUDA_VISIBLE_DEVICES widened for the spawn child, publishes its URL via a /dev/shm rendezvous file after the health check) and other ranks become pure HTTP clients (generate posts to the owner; sleep/wake/weight push no-op). Includes: - disable_custom_all_reduce default for tp>1 (container CUDA-graph IPC fails in sgl_kernel custom all-reduce) - per-tp-rank INDEPENDENT weight payload serialization: duplicating one serialized string across ranks unbalances the CUDA-IPC refcount handshake and leaks the gathered full weights (~16.5 GB/step measured -> OOM) tp=1 behavior unchanged. NOTE: measured on H20 colocate, tp=2 generation is SLOWER than tp=1 at all observed lengths (NCCL allreduce latency dominates with custom-AR disabled) - this PR ships the capability + fixes, not a default change.
|
Review — request changes. Direction is right and the hard parts check out: fanout=0 keeps the FSDP gather collectives in lockstep on tp clients ( Two blockers:
Also: guard (or |
…ping Addresses #46 review B1 (unlink the /dev/shm URL file on shutdown; clients health-validate the URL before accepting and get a timeout margin), B2 (set_lora_from_tensors now no-ops on tp clients — BROADCAST LoRA sync would 400 on duplicate adapter names), N1 (init/destroy_weights_update_group guarded for symmetry) and N2 (multi-entry CUDA_VISIBLE_DEVICES maps device.index through the visible list instead of treating it as physical).
|
Review blockers addressed in 29ae21a: rendezvous file unlinked on shutdown + clients health-validate the URL before accepting (with a timeout margin over the owner health budget) [B1]; set_lora_from_tensors and init/destroy_weights_update_group now no-op on tp clients [B2/N1]; multi-entry CUDA_VISIBLE_DEVICES maps device.index through the visible list [N2]. Spawn-target CVD setting and client health_check remain follow-ups. |
Part of the verl performance-parity series tracked in #40.
Summary
A tp>1 SRT server spans tp GPUs but colocate builds one engine per GPU. The lowest-physical-GPU rank of each tp group becomes the owner: it launches the server (CUDA_VISIBLE_DEVICES widened for the spawn child), publishes its URL via a /dev/shm rendezvous file once healthy; the other ranks become pure HTTP clients (generate posts to the owner; sleep/wake/weight-push no-op). Also:
disable_custom_all_reducedefaults on for tp>1 (sgl_kernel custom all-reduce CUDA-graph IPC fails across tp worker processes in containers)TensorWeightSyncnow serializesweight_payload_fanoutindependent copies (owner=tp_size, client=0, gather stays in lockstep)tp=1 behavior unchanged.
Measured caveat (why this is capability, not a default)
On H20 colocate with custom-AR disabled, tp=2 generation is SLOWER than tp=1 at every observed length (len
1k: 70.0 vs 47.7s; len3.4k: 177.7 vs 110.5s) — per-layer NCCL allreduce latency dominates. The vllm-tp2 advantage seen in verl does not transfer to sglang-NCCL here. Shipped for topologies/models where it does pay; default stays tp_size=1.Test Plan
tp2v10_unirl_drpo_gz6): cuda_alloc flat at 3.81 GB (leak gone), ratio 1.000xunirl_opt16_tp1_soak/unirl_opt16_tp2_soak)