Skip to content

feat(sglang): tp>1 colocate engine topology + weight-sync correctness fixes#46

Draft
leviking98z-rgb wants to merge 2 commits into
mainfrom
perf/05-sglang-tp-colocate
Draft

feat(sglang): tp>1 colocate engine topology + weight-sync correctness fixes#46
leviking98z-rgb wants to merge 2 commits into
mainfrom
perf/05-sglang-tp-colocate

Conversation

@leviking98z-rgb

@leviking98z-rgb leviking98z-rgb commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Part of the verl performance-parity series tracked in #40.

Summary

A tp>1 SRT server spans tp GPUs but colocate builds one engine per GPU. The lowest-physical-GPU rank of each tp group becomes the owner: it launches the server (CUDA_VISIBLE_DEVICES widened for the spawn child), publishes its URL via a /dev/shm rendezvous file once healthy; the other ranks become pure HTTP clients (generate posts to the owner; sleep/wake/weight-push no-op). Also:

  • disable_custom_all_reduce defaults on for tp>1 (sgl_kernel custom all-reduce CUDA-graph IPC fails across tp worker processes in containers)
  • per-tp-rank INDEPENDENT weight payload serialization: duplicating one serialized string across ranks unbalances the CUDA-IPC refcount handshake and leaks the gathered full weights (+16.5 GB/step measured via the cuda-alloc watermark -> OOM by step 4). TensorWeightSync now serializes weight_payload_fanout independent copies (owner=tp_size, client=0, gather stays in lockstep)

tp=1 behavior unchanged.

Measured caveat (why this is capability, not a default)

On H20 colocate with custom-AR disabled, tp=2 generation is SLOWER than tp=1 at every observed length (len1k: 70.0 vs 47.7s; len3.4k: 177.7 vs 110.5s) — per-layer NCCL allreduce latency dominates. The vllm-tp2 advantage seen in verl does not transfer to sglang-NCCL here. Shipped for topologies/models where it does pay; default stays tp_size=1.

Test Plan

  • 12+ step stable tp=2 run (tp2v10_unirl_drpo_gz6): cuda_alloc flat at 3.81 GB (leak gone), ratio 1.000x
  • A/B soak tp1 vs tp2 to len~4k (unirl_opt16_tp1_soak / unirl_opt16_tp2_soak)
  • tp=1 regression: all perf runs in this series

A tp>1 SRT server spans tp GPUs but colocate builds one engine per GPU: the
lowest-physical-GPU rank of each tp group becomes the owner (launches the
server with CUDA_VISIBLE_DEVICES widened for the spawn child, publishes its
URL via a /dev/shm rendezvous file after the health check) and other ranks
become pure HTTP clients (generate posts to the owner; sleep/wake/weight
push no-op). Includes:
- disable_custom_all_reduce default for tp>1 (container CUDA-graph IPC fails
  in sgl_kernel custom all-reduce)
- per-tp-rank INDEPENDENT weight payload serialization: duplicating one
  serialized string across ranks unbalances the CUDA-IPC refcount handshake
  and leaks the gathered full weights (~16.5 GB/step measured -> OOM)
tp=1 behavior unchanged. NOTE: measured on H20 colocate, tp=2 generation is
SLOWER than tp=1 at all observed lengths (NCCL allreduce latency dominates
with custom-AR disabled) - this PR ships the capability + fixes, not a
default change.
@leviking98z-rgb

Copy link
Copy Markdown
Collaborator Author

Review — request changes. Direction is right and the hard parts check out: fanout=0 keeps the FSDP gather collectives in lockstep on tp clients (base.py _iter_buckets walk), the per-rank independent serialization correctly balances the CUDA-IPC refcount handshake, and disable_custom_all_reduce is gated so tp=1 is untouched.

Two blockers:

  1. The /dev/shm rendezvous file is never removed in shutdown(), so every restart begins with a stale file; ctor order between owner (unlink) and clients (first read) is unordered across ray actors, so a client can adopt a dead — or a previous job's live — server URL. Unlink on shutdown and make the file run-scoped (nonce) or health-validated before accept.
  2. set_lora_from_tensors lacks the tp-client guard; LocalLoraWeightSync.sync is BROADCAST, so owner+clients all POST the same versioned lora_name to the shared server -> HTTP 400 "already loaded" -> crash on tp>1 + LoRA.

Also: guard (or require(tp==1)) init/destroy_weights_update_group for symmetry; multi-entry CVD should map device.index through the CVD list instead of treating the logical index as physical (silent wrong-GPU ownership); prefer setting the widened CVD inside the spawn target over mutating process-global env; give clients health_timeout + margin; client health_check always returns False (probe _base_url instead).

…ping

Addresses #46 review B1 (unlink the /dev/shm URL file on shutdown; clients
health-validate the URL before accepting and get a timeout margin), B2
(set_lora_from_tensors now no-ops on tp clients — BROADCAST LoRA sync
would 400 on duplicate adapter names), N1 (init/destroy_weights_update_group
guarded for symmetry) and N2 (multi-entry CUDA_VISIBLE_DEVICES maps
device.index through the visible list instead of treating it as physical).
@leviking98z-rgb

Copy link
Copy Markdown
Collaborator Author

Review blockers addressed in 29ae21a: rendezvous file unlinked on shutdown + clients health-validate the URL before accepting (with a timeout margin over the owner health budget) [B1]; set_lora_from_tensors and init/destroy_weights_update_group now no-op on tp clients [B2/N1]; multi-entry CUDA_VISIBLE_DEVICES maps device.index through the visible list [N2]. Spawn-target CVD setting and client health_check remain follow-ups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant