feat(sglang): tp>1 colocate engine topology + weight-sync correctness fixes by leviking98z-rgb · Pull Request #46 · Tencent-Hunyuan/UniRL

leviking98z-rgb · 2026-06-12T04:48:40Z

Part of the verl performance-parity series tracked in #40.

Summary

A tp>1 SRT server spans tp GPUs but colocate builds one engine per GPU. The lowest-physical-GPU rank of each tp group becomes the owner: it launches the server (CUDA_VISIBLE_DEVICES widened for the spawn child), publishes its URL via a /dev/shm rendezvous file once healthy; the other ranks become pure HTTP clients (generate posts to the owner; sleep/wake/weight-push no-op). Also:

disable_custom_all_reduce defaults on for tp>1 (sgl_kernel custom all-reduce CUDA-graph IPC fails across tp worker processes in containers)
per-tp-rank INDEPENDENT weight payload serialization: duplicating one serialized string across ranks unbalances the CUDA-IPC refcount handshake and leaks the gathered full weights (+16.5 GB/step measured via the cuda-alloc watermark -> OOM by step 4). TensorWeightSync now serializes weight_payload_fanout independent copies (owner=tp_size, client=0, gather stays in lockstep)

tp=1 behavior unchanged.

Measured caveat (why this is capability, not a default)

On H20 colocate with custom-AR disabled, tp=2 generation is SLOWER than tp=1 at every observed length (len~~1k: 70.0 vs 47.7s; len~~3.4k: 177.7 vs 110.5s) — per-layer NCCL allreduce latency dominates. The vllm-tp2 advantage seen in verl does not transfer to sglang-NCCL here. Shipped for topologies/models where it does pay; default stays tp_size=1.

Test Plan

12+ step stable tp=2 run (tp2v10_unirl_drpo_gz6): cuda_alloc flat at 3.81 GB (leak gone), ratio 1.000x
A/B soak tp1 vs tp2 to len~4k (unirl_opt16_tp1_soak / unirl_opt16_tp2_soak)
tp=1 regression: all perf runs in this series

A tp>1 SRT server spans tp GPUs but colocate builds one engine per GPU: the lowest-physical-GPU rank of each tp group becomes the owner (launches the server with CUDA_VISIBLE_DEVICES widened for the spawn child, publishes its URL via a /dev/shm rendezvous file after the health check) and other ranks become pure HTTP clients (generate posts to the owner; sleep/wake/weight push no-op). Includes: - disable_custom_all_reduce default for tp>1 (container CUDA-graph IPC fails in sgl_kernel custom all-reduce) - per-tp-rank INDEPENDENT weight payload serialization: duplicating one serialized string across ranks unbalances the CUDA-IPC refcount handshake and leaks the gathered full weights (~16.5 GB/step measured -> OOM) tp=1 behavior unchanged. NOTE: measured on H20 colocate, tp=2 generation is SLOWER than tp=1 at all observed lengths (NCCL allreduce latency dominates with custom-AR disabled) - this PR ships the capability + fixes, not a default change.

leviking98z-rgb · 2026-06-12T09:09:40Z

Review — request changes. Direction is right and the hard parts check out: fanout=0 keeps the FSDP gather collectives in lockstep on tp clients (base.py _iter_buckets walk), the per-rank independent serialization correctly balances the CUDA-IPC refcount handshake, and disable_custom_all_reduce is gated so tp=1 is untouched.

Two blockers:

The /dev/shm rendezvous file is never removed in shutdown(), so every restart begins with a stale file; ctor order between owner (unlink) and clients (first read) is unordered across ray actors, so a client can adopt a dead — or a previous job's live — server URL. Unlink on shutdown and make the file run-scoped (nonce) or health-validated before accept.
set_lora_from_tensors lacks the tp-client guard; LocalLoraWeightSync.sync is BROADCAST, so owner+clients all POST the same versioned lora_name to the shared server -> HTTP 400 "already loaded" -> crash on tp>1 + LoRA.

Also: guard (or require(tp==1)) init/destroy_weights_update_group for symmetry; multi-entry CVD should map device.index through the CVD list instead of treating the logical index as physical (silent wrong-GPU ownership); prefer setting the widened CVD inside the spawn target over mutating process-global env; give clients health_timeout + margin; client health_check always returns False (probe _base_url instead).

…ping Addresses #46 review B1 (unlink the /dev/shm URL file on shutdown; clients health-validate the URL before accepting and get a timeout margin), B2 (set_lora_from_tensors now no-ops on tp clients — BROADCAST LoRA sync would 400 on duplicate adapter names), N1 (init/destroy_weights_update_group guarded for symmetry) and N2 (multi-entry CUDA_VISIBLE_DEVICES maps device.index through the visible list instead of treating it as physical).

leviking98z-rgb · 2026-06-12T09:13:00Z

Review blockers addressed in 29ae21a: rendezvous file unlinked on shutdown + clients health-validate the URL before accepting (with a timeout margin over the owner health budget) [B1]; set_lora_from_tensors and init/destroy_weights_update_group now no-op on tp clients [B2/N1]; multi-entry CUDA_VISIBLE_DEVICES maps device.index through the visible list [N2]. Spawn-target CVD setting and client health_check remain follow-ups.

leviking98z-rgb mentioned this pull request Jun 12, 2026

VLM/LLM Infra Updating #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sglang): tp>1 colocate engine topology + weight-sync correctness fixes#46

feat(sglang): tp>1 colocate engine topology + weight-sync correctness fixes#46
leviking98z-rgb wants to merge 2 commits into
mainfrom
perf/05-sglang-tp-colocate

leviking98z-rgb commented Jun 12, 2026 •

edited

Loading

Uh oh!

leviking98z-rgb commented Jun 12, 2026

Uh oh!

leviking98z-rgb commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leviking98z-rgb commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Measured caveat (why this is capability, not a default)

Test Plan

Uh oh!

leviking98z-rgb commented Jun 12, 2026

Uh oh!

leviking98z-rgb commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leviking98z-rgb commented Jun 12, 2026 •

edited

Loading