Skip to content

feat(rollout): vLLM-Omni v2 rollout engine — canonical vllm_omni, retire v1#36

Open
celve wants to merge 51 commits into
Tencent-Hunyuan:mainfrom
celve:LIN-382/vllm-omni-canonical
Open

feat(rollout): vLLM-Omni v2 rollout engine — canonical vllm_omni, retire v1#36
celve wants to merge 51 commits into
Tencent-Hunyuan:mainfrom
celve:LIN-382/vllm-omni-canonical

Conversation

@celve

@celve celve commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Promotes the v2 vLLM-Omni rollout engine to the canonical vllm_omni and retires the original v1 engine. v2 is a thin engine core over a single backend seam with per-modality adapters, replacing v1's monolithic implementation.

Architecture — unirl/rollout/engine/vllm_omni/

  • backends/ — the only code that imports the vLLM-Omni runtime (boot, ports, per-stage collective_rpc).
  • adapters/ — per-modality registry keyed on config.modality; owns the RolloutReqRolloutResp conversion and per-modality topology knobs.
  • pipelines/ / worker/ / patches/ — worker-side role packages (custom diffusion pipelines, weight-sync receive mixins, runtime hijacks).
  • weight_sync.py — canonical sync ops + LoRA lifecycle over the seam.

Supported modalities: SD3.5 (sd3_t2i), HunyuanImage-3 (hi3_*), HunyuanVideo-1.5 t2v (hv15_t2v), Qwen-Image (qwen_image_t2i).

v1 retirement / promotion

  • Deleted the v1 engine package; renamed vllm_omni_v2vllm_omni and classes VLLMOmniV2{RolloutEngine,EngineConfig}VLLMOmni{…}.
  • Hoisted the shared sgl_compat shim (vendored sglang CUDA-IPC reductions / serializer, so the engine venv needs no sglang) into unirl/distributed/weight_sync/transfer/, beside its siblings bucketed_transfer / ipc_dispatch / checksum — no engine package owns cross-engine transfer code.
  • Collapsed the redundant engine-target suffix in config/validation.py; example recipes rewired to the v2 engine (modality keys + request-driven sampling).

Verification

  • End-to-end SD3.5 GRPO colocate on H20×8: boot → generate → PickScore → train loop runs clean; reward climbs off the ~0.76 baseline (rollouts 1→6: 0.7626 → 0.7946), tracking the reference curve.
  • Every recipe _target_ resolves (static guard); ruff clean.

Notes

  • Unit tests are not included in this PR.

celve added 30 commits June 10, 2026 14:04
…omni)

Rewrite of the vllm-omni rollout engine to the rollout-engine layout spec
v2, mirroring the sglang_diffusion reference. Coexists with v1 under the
distinct vllm_omni_v2 name; opt-in via cloned recipes; v1 retires after
the parity gate. Covers all 8 v1 modalities.

- backends/: the one runtime seam — base.py (Backend protocol +
  GenerateCall/StageSampling intent + OmniRawResult wire protocol, zero
  runtime code) and native.py (lazy Omni import, boot with stage-YAML
  overlays + temp file, request-id grouping, sleep/wake tasks, per-stage
  collective_rpc weight-sync verbs, tokenize_prompt).
- adapters/: registry keyed on modality; per-shape bases ar_dit/ar_only/
  dit_image/dit_video lift v1's request.py/response.py modality branches
  into overridable steps (incl. dit_recaption's per-prompt seeded calls).
- utils/: pure ports of the segment/condition/prompt/σ mechanics.
- weight_sync.py: one component owning all sync/LoRA state; v1-parity
  active LoRA re-push on wake (restore_lora_after_wake) + the
  mark_weights_released event; both LoRA transports (handle/byte-copy).
- config.py: v1 fields verbatim + live-registry modality validation +
  server_intent; VLLMOmniPorts self-reserved bind-to-0 master ports ride
  each stage's engine_args.master_port — kills the 30200+rank*200+idx*50
  port math, the RANK-env fallback, and the per-rank YAML rewrites.
- worker/ + pipelines/ + patches/: role-8 copies (v1 keeps its own — its
  stage YAMLs reference v1 qualnames until retirement); patches/ carries
  the install() bundle with per-patch DELETE-WHEN notes.
- Shared infra: rollout/engine/ports.py (ReservedPorts, ported from the
  LIN-371 branch); bucketed_transfer/ipc_dispatch/checksum hoisted to the
  neutral distributed/weight_sync/transfer/ package (v1 paths re-export;
  5 trainer-side import sites repointed).
- recipes: 8 *_v2.yaml opt-in clones (only the rollout _target_ lines
  changed); config/validation.py IPC gate accepts the v2 engine.
- tests/rollout/vllm_omni_v2/: 55 CPU tests — registry/topology parity
  vs the v1 frozensets, per-shape build_inputs/build_response, utils
  mechanics, WeightSync lifecycle, bare-engine sequencing, the
  dispatch-marker guard, ports, and a subprocess CPU-importability check
  with vllm/vllm_omni/sglang blocked.

GPU smoke (incl. reserved==settled port assertion) and the fixed-seed
parity gate vs v1 remain before v1 retirement.
… drop the YAML rewrite

Audit of the vllm-omni source (pin v0.20.0 + upstream main) showed the seam
was reconstructing machinery the runtime already provides:

- Boot now passes the PRISTINE packaged stage YAML plus enable_sleep_mode /
  master_port as Omni ctor kwargs — vllm-omni merges them into every stage's
  engine_args itself (load_stage_configs_from_yaml base_engine_args channel
  + the dedicated sleep-mode injection in _resolve_stage_configs, verified
  at the pin AND main; no stage YAML of ours defines either key, so the
  behavior is identical to the old overlay). Deletes _overlay_stage_tree,
  the yaml/tempfile load-overlay-write-unlink machinery, and the shutdown
  temp-file cleanup. Known cosmetic: upstream logs a spurious "top-level
  engine args are ignored: enable_sleep_mode" warning before its dedicated
  block applies the kwarg.
- tp_per_stage reads back from omni.stage_configs (the runtime's own merged
  per-stage configs) via _tp_from_stage_configs, replacing the pre-overlay
  YAML re-parse (_parse_tp_per_stage) and its ordering caveat.
- VLLMOmniPorts: two per-stage fields -> one reserved master-port BASE. At
  the pin, OmniDiffusionConfig.__post_init__ adds rand(0,100) + a 37-stride
  bind-check scan even to injected ports (so per-stage reserved ports were
  never honored verbatim anyway); from v0.21.0rc2 (#3803) the base is
  honored verbatim. Docstrings carry the MASTER_PORT-env-precedence upgrade
  landmine.
- patches/__init__: DELETE-WHEN corrections from the audit — flow-alignment
  deletable at pin >= v0.21 (upstream removed the v0.20.0 KV API),
  lora_request passthrough confirmed still absent at main (upstream-PR
  opportunity noted), tokenizer ratio-36 nuance (upstream now raises a
  clean ValueError; Base ckpt still needs the 0-fallback).
- Tests: ports tests adapted to the single field; new table-driven
  _assemble_omni_kwargs layering tests (dedicated keys beat the omni_extra
  escape hatch; hatch still beats timeout/mode defaults — v1 parity) and a
  _tp_from_stage_configs test (mapping- and attr-style entries).
  pytest tests/ -> 59 passed.

Boot mechanics only — no rollout-output surface touched; the pending parity
gate is unaffected.
Drives the report's pending GPU checks against the live v0.20.0 runtime:
boot (sd35_t2i), settled-vs-reserved master-port window assertion (pin
semantics: base + rand(0,100) + 37-stride scan), generate, one tensor-bag
sync with fingerprint_tensor checksum match, sleep/wake + post-wake
generate. Live-LoRA wake re-push is left to the LoRA e2e recipe.
Image.to_pil() / Video.to_pils() existed but Images had no batch
counterpart, so five call sites re-derived the conversion in three
different spellings. Route them all through Images.to_pils().

vllm_omni_v2's images_to_pil(req, n) becomes pil_images_from_req — the
helper extracts and validates the request primitive (paralleling
texts_from_req); the conversion itself now lives on the type.
types/reward.py keeps tensor_frame_to_pil (extra grayscale→RGB + cpu
handling) — intentionally not unified.
…t sampling params authoritative

Adapter files are now named by model family (hi3_ar_dit / hi3_ar_only /
hi3_dit_recaption; dit_image keeps sd35_t2i only). The HI3 chat-template
knowledge (task_key / sys_type / output_modalities + the bot_task resolve
hook) moves from utils' _TASK_DEFAULTS table onto the adapter classes, and
build_prompt_entries moves with it.

The engine keeps no sampling defaults anymore: config's default_* fields are
gone and core_diff_kwargs / build_ar_sampling read the request's typed
sampling params directly. The two-engine trainer ships the WHOLE composed
params to the ar_recaption engine (it reads its AR slice for sampling and
the diffusion slice's height/width for the recaption prompt); recipes spell
the sampling values per request.
…b-adapters

Each registered modality adapter is now a thin binder: it constructs an
input_adapter and an output_adapter in __init__ and delegates build_inputs /
build_response to them. Input-side and output-side variation are independent
axes (i2t/it2i share the AR-bearing request side but differ in response
shape), so the per-output-shape inheritance bases leaked duplication across
branches — the build_ar_sampling class-attr borrow, the duplicated AR
build_inputs, and the duplicated image-attach decorate all dissolve into one
parameterized Hi3InputAdapter.

Layout: universal single-stage DiT skeletons live in dit.py (DitInputAdapter
with an extras() hook, DitOutputAdapter with conditions/decode/extra_decoded
hooks); families derive small subclasses in their own file (hi3.py / sd3.py /
hv15.py) — family-specific classes carry the family prefix, universal ones
don't. core_diff_kwargs / sde_extra_args move off the ABC to
utils/diff_kwargs.py as plain functions (they read no adapter state).

Engine-facing surface unchanged: same registry keys, same two verbs, same
knob class-attributes — the suite passes as-is except one test that
monkeypatched via the old module path (adapters.hi3_ar_dit → adapters.hi3).
Registry keys gain a short family prefix matching the adapter file names:

  t2i           -> hi3_t2i          it2i -> hi3_it2i
  i2t           -> hi3_i2t          t2t  -> hi3_t2t
  ar_recaption  -> hi3_ar_recaption dit_recaption -> hi3_dit_recaption
  sd35_t2i      -> sd3_t2i          t2v  -> hv15_t2v

Bare task keys silently bound a task to one family ("t2v" was HV1.5-only)
and collide as soon as another family serves the same task (wan/qwen_image
video+image models already live under models/). Clean break, no aliases —
v2 recipes, the config default, the GPU smoke driver, and test constants are
updated in this commit; the v1 engine and v1 recipes keep their keys.

The upstream chat-template task vocabulary (t2i_think / t2i_recaption / ...)
is unchanged — Hi3InputAdapter derives bot-task variants from its explicit
bot_task_base, not from the registry key.

BREAKING: out-of-tree v2 recipes must update rollout-engine `modality`
values to the family-namespaced keys.
…oded hook

extra_decoded had exactly one real override (the HI3 two-track AR text) plus
a ceremonial no-op default — a hook that never earned its slot. A single
build_decoded(pil_images, frame_groups, per_request) now owns the whole
decoded dict: the base returns {track_name: Images}, hv15 swaps the payload
for packed frame groups, and Hi3ImageOutputAdapter extends via super() with
the best-effort AR text. One method answers "what goes in decoded for this
family", and children derive with plain super() composition instead of
fitting pre-cut slots. Contract note in the docstring: keep the track_name
entry (a missing key silently yields decoded=None on that track).
…s only

The 3-arg signature leaked the skeleton's internal collection step into the
hook contract. decoded may span tracks beyond the DiT one (the HI3 two-track
AR text comes from the AR outputs, not the DiT slices), so the raw wire
groups are its natural currency. The base re-collects its PILs via the new
_collect helper (deliberate and cheap — collect_dit_outputs only gathers
references; pils_to_images still runs once), and the knob-threading for
collect_dit_outputs now lives in one place used by both the skeleton and
the hook implementations.
…tputs directly

A private helper used three times in one small file was ceremony; each call
site now reads self-contained and subclass derivation goes through plain
super(). The knobs are the same self attributes at every site, so the
selection can't drift.
…ed/conditions triple

An output adapter's whole job is to produce the three dicts assemble_tracks
consumes, so the class contract now mirrors that parameter list 1:1: three
parallel hooks — build_segments / build_decoded / build_conditions — all
with the uniform (req, per_request) currency, and build() reduced to guard +
assemble_tracks. The v1-parity AR sweep moves into build_segments (a named,
derivable home instead of unconditional skeleton behavior); conditions
implementations collect their own DiT outputs inline (cheap reference
gathering, same self knobs everywhere).

Hi3TextOutputAdapter adopts the same shape for full uniformity: the
conditions= ctor callable is gone — ar_recaption's fused prompt capture is
now a tiny Hi3ArRecaptionOutputAdapter subclass overriding build_conditions.
Text-extraction failure policies preserved exactly: best-effort in the HI3
two-track build_decoded, fatal in the AR-only one.
…_sampling pair

Same principle as the output-adapter triple, read off the input wire type:
GenerateCall has two payload fields, so the input decomposition is a pair —
build_prompts(req) / build_sampling(req) — with build() reduced to pure
assembly. The whole sub-adapter layer now fits one sentence: a build()
orchestrator over build_<wire-field> hooks taking raw currency (req on the
way in, (req, per_request) on the way out).

The extras() hook is gone: hv15's num_frames delta becomes two
super()-extend overrides (the idiom Hi3ImageOutputAdapter.build_decoded
established) instead of a pair-of-dicts return. Hi3InputAdapter's stages
get public names while _resolve_task/_decorate/_ar_sampling/_dit_sampling
stay private inside them. Hi3DitRecaptionInputAdapter keeps its wholesale
build() — its prompts and sampling are paired per single-prompt call (seed +
gid slice decided together), the documented call-topology exception.
…ils/conditions.py

Each of the four condition builders had exactly one v2 caller and was 100%
family-specific (family capture keys, family padding rules, family condition
types) — family conversion logic wearing a utils costume, organized by kind
instead of by family. They now live where the placement rule says:

- build_hv15_conditions  -> Hv15VideoOutputAdapter.build_conditions
- build_sd3_text_condition -> Sd3OutputAdapter.build_conditions
- build_fused_mm_condition -> hi3_fused_conditions (module-level: two
  consumers in hi3.py)
- build_ar_fused_condition -> hi3_ar_fused_conditions

This also fixes a layering smell: utils/ no longer imports
unirl.models.hunyuan_image3 — the family-agnostic layer is now actually
family-agnostic. Error text preserved verbatim; the raise moves from
"wrapper diagnoses the builder's None" to the failed check itself. The
dead empty-input heads are dropped (unreachable post-collect: the build()
guard + collect_dit_outputs's per-request raise guarantee non-empty).
The v1 engine's private _build_* copies in vllm_omni/response.py are
parity code and stay.

Boundary rule: the inline criterion is family-specific LOGIC, not consumer
count — decoded_text_from_ar / seed_from_sample_id / grouped_pils_to_videos
are universal mechanics with single consumers today and stay in utils/.
…xamples/ layout

Adopt the upstream examples/ reorg (#272/#267) for the branch-only v2 files:
recipes/diffusion_rl/sd3_*_v2 -> examples/diffusion/sd3/, the hv15 t2v v2
recipe -> examples/diffusion/hunyuan_video15/, hi3_vllmomni_v2 ->
examples/unified_model/hi3/, and scripts/vllm_omni_v2_gpu_smoke.py ->
examples/. All three relocated recipes compose-check under the new
--config-name roots (hydra config_path is ../examples post-reorg).
ruff check/format + hook fixes on files this branch adds: drop unused
imports (test fixtures, native.py STAGE_KIND_DIFFUSION), sort
ipc_receive_mixin imports, format reflow, executable bit + TYPE_CHECKING
RolloutReq import + relocated usage path for the GPU smoke driver.
pack_initial_noise_extra_args ships x_T as EITHER a materialized
initial_noise_batch tensor OR a recipe (init_noise_group_ids +
init_noise_latent_shape + init_noise_seed), but the hv15 pipeline only
handled the batch form — a t2v request shipping the recipe form was
silently ignored and upstream RNG drew x_T instead (the x_T-collapse
class of bug the HI3 pipeline fails loudly on). Mirror sd3's recipe
branch: regenerate this request's row byte-identically via
NoiseRecipe(...).resolve().

CPU-verified only (lint + suite); the pipeline executes on the GPU pod —
covered by the next smoke/e2e run there.
…vest naming

The three RL pipeline subclasses follow one protocol around upstream's
forward — install (once, idempotent) / arm (every request) / run (upstream,
with our taps and injectors firing inside) / harvest (export onto the wire).
This makes the protocol literal:

- New pipelines/_shared/interception.py holds the byte-identical mechanics
  (detach_cpu, stamp_custom_output, drain_trajectory_into,
  resolve_request_noise, inject_latents, make_sde_scheduler). Deliberately
  vllm-omni-free (wire objects duck-typed), so the mechanics are CPU-tested
  in the new test_pipeline_interception.py even though the pipelines
  themselves only run inside the GPU worker.
- FlowMatchSDEDiscreteScheduler.arm(eta=..., sde_indices=...) replaces
  private attr-poking at three call sites; counterpart of drain_trajectory.
- Hooks are renamed by purpose, not upstream target: every family now has
  one _install_conditioning_tap (encode_prompt for sd3/hv15,
  prepare_inputs_for_generation for hi3), an initial-noise injector, and
  _arm_*/_harvest_* steps; upstream-defect carriers are suffixed
  _workaround (sd3 T5 truncation; hv15's _sigma_override is now a
  try/finally contextmanager). forward() reads as the protocol.
- hi3/sde_scheduler.py (dead re-export shim) deleted.

Wire contract unchanged: custom_output keys, trajectory_* semantics,
extra_args reads, error prose, recipe regeneration math, first-call-only
capture. CPU-verified (lint + 71 tests incl. 12 new); the pipelines execute
only on the GPU pod — covered by its next smoke/e2e run.
…Image rollout path

Adds the ninth modality to the v2 engine (Qwen-Image RL was trainside-only;
the trainer bundle in models/qwen_image/ is the parity oracle). Single
diffusion stage, TP=1; no engine/weight_sync/backends/patches changes —
upstream v0.20.0 consumes sampling_params.sigmas natively and the generic
LoRA manager maps rank-64 adapters onto the fused to_qkv via
stacked_params_mapping.

Two model-specific seams, everything else the SD3 pattern:

- CFG semantics: upstream treats negative_prompt "" as present and defaults
  true_cfg_scale to 4.0 — the family input adapter omits the key when CFG
  is off (guidance 1.0, the oracle setting) and always maps guidance_scale
  onto true_cfg_scale. CFG-on is supported end-to-end: the encode tap
  captures the negative encode and build_conditions emits negative_text.
- Packed-latent boundary: the worker loop runs [B, S, C*4]; the trainer
  contract is [B, C, H, W]. RLQwenImagePipeline packs the driver x_T before
  prepare_latents injection and unpacks the harvested trajectory to
  [B, T+1, C, H, W] (upstream _unpack_latents is 5D; frame dim squeezed).

Also: max_sequence_length pinned to model_config (512) when the request
doesn't set one (upstream defaults 1024); ragged-pad concat for the
variable-length Qwen2.5-VL captures; stage YAML with max_lora_rank 64;
dancegrpo vllmomni_v2 recipe = trainside twin + rollout/sync delta only.

CPU suite: 80 passed (CFG trap both ways, msl pinning, ragged-pad mask
sums, negative_text + mixed-capture raise, σ-echo, K=0 NFT, registry
knobs). GPU smoke + trainside parity folded into the standing gate.
…DE) vllmomni recipe

Smoke driver gains a SMOKE_MODALITY env (default sd3_t2i) with a per-family
model-config shim — qwen needs use_dynamic_shifting + dynamic_shift_overrides
(σ policy) and the max_sequence_length pin. New recipe
qwen_image_grpo_vllmomni_v2 = the dancegrpo_vllmomni_v2 twin with
FlowSDEStrategy (matches qwen_image_trainside), per the chosen e2e base.
…gmas

Diffusers' set_timesteps has a THIRD sigma mutation beyond the
static/dynamic shift branches: the shift_terminal whole-schedule stretch,
applied even to passed-in sigmas. SD3.5 configs leave it null; the
Qwen-Image-2512 checkpoint ships shift_terminal: 0.02, which stretched the
engine-pinned [.., 0.3584] into [.., 0.0200] on the worker and tripped the
sigma-echo gate in the qwen GPU smoke (max abs diff 3.384e-01 == exactly
the stretch factor (1-0.02)/(1-0.3584)). Null it inside the same
transient-override block; restore via finally. Regression tests for both
config shapes.
…TE gate

Probe post-mortem (8-replica colocate, both failures at engine boot):

1) DistNetworkError port 30005: at the v0.20.0 pin the stage_configs_path
   route strips master_port from the kwargs that feed the per-stage
   base_engine_args merge (only enable_sleep_mode/lora_* get re-injected),
   so every stage settles from the SHARED (None or 30005)+rand(0,100)
   window. SD3's fast boots won the 37-stride scan's TOCTOU races; Qwen's
   ~35s boots lose them. New patch_master_port_unstrip() re-attaches the
   caller's reserved per-replica base through _strip_single_engine_args —
   YAML keys still win, scan stays as fallback. DELETE-WHEN pin >= 0.21.

2) CUDA OOM (50 MiB free at engine TE load): the trainer-side bundle loads
   the Qwen2.5-VL text encoder (~15 GiB/rank) that the separate-engine
   path never uses — the engine encodes prompts and the trainer replays
   captured conditions. New QwenImagePipelineConfig.load_text_encoder
   (default true; vllmomni recipes set false) gates it; pipeline builds no
   text_embed stage without it and generate() raises a directed error.
Eight simultaneous engine boots each hold ~20 GiB anon RSS during weight
materialization; the burst blows the pod's k8s memcg limit and the kernel
OOM-kills raylet (ActorUnavailableError, probe-b post-mortem in dmesg).
Exclusive flock on /tmp/diffrl_omni_boot.lock makes the load window
single-file per node; DIFFRL_OMNI_BOOT_SERIALIZE=0 opts out. Also narrows
the master-port settle TOCTOU as a side effect.
probe-c post-mortem: engine model 53.7 GiB + dummy run OOM'd at 116 MiB
free — the colocated trainer's caching allocator held ~40 GiB
reserved-but-unallocated (full per-rank model load before FSDP shard),
invisible to the engine subprocess. empty_cache() in boot() (which runs
inside the trainer's ray actor) returns it to the driver first.
probes b/d died at the TRAINER actor-pool bootstrap (handle.py create_remote),
not the engine boots: 8 ranks materializing the 20B transformer concurrently
hold ~20-23 GiB anon each and blow the pod's ~439 GiB memcg ceiling — the
kernel kills raylet (ActorUnavailableError). Same flock pattern as the
engine-boot serialization, on the bundle's heavy from_pretrained window;
gc before lock release so the serialized peak actually holds.
DIFFRL_MODEL_LOAD_SERIALIZE=0 opts out.
probe-e crossed the entire infra gauntlet (serialized loads, serialized
engine boots, cache flush, generation, sleep) and died in the first
TRAINING replay forward: the installed diffusers' QwenImage RoPE builder
requires txt_seq_lens (max() over it sizes the text frequency slice) and
predict_noise only passed the attention mask. Derive per-sample lengths
from the mask for both CFG branches.
…ength

probe-f: replay microbatches carry the batch-wide pad width (18) while
max(txt_seq_lens) reflects their own true max (12); diffusers RoPE applies
over the tensor width but slices freqs by txt_seq_lens -> shape crash in
apply_rotary_emb_qwen. Trim embeds+mask to the slice's true max so width
== max(txt_seq_lens) by construction (padding beyond it is meaningless).
…se — the 60-rollout e2e silently skipped wandb)
AC-off training leaves the activation peak (~30-40 GiB on Qwen-Image at
mbs=1) reserved in the trainer's caching allocator; the colocated engine
then OOMs re-mapping its weights on wake (qwen e2e-c: 'Tried to allocate
2.00 MiB ... 2.38 MiB free' in rollout 2's generate). empty_cache() before
wake_up() returns it to the driver each cycle.
celve added 21 commits June 10, 2026 14:29
…CAST body)

e2e-d reproduced e2e-c's rollout-2 OOM despite the trainer-side flush:
train_step executes driver-side while the train-phase peak lives in the 8
ray actors. wake_up()'s body runs per-actor — flush there, immediately
before the engine re-maps its weight pool.
…rate-engine anchor

The qwen recipes inherited DiffusionGRPO's default 'native' (engine-emitted
rollout logp as the PPO anchor). Correct for trainside; across processes it
bakes the rollout<->replay numeric discrepancy (bf16-stored trajectory vs
in-flight fp32, engine vs trainer logp conventions) into every first-epoch
ratio: observed 1±2-5e-5 vs sd3's replay-anchored 1±7e-6, against a 1e-4
clip range — the trust region half-consumed by anchor error before any real
policy movement. Every sd3 separate-engine recipe sets replay; the qwen
clones missed it.
Single-node 1x8 port of hunyuan_video15_t2v_vllmomni_nccl_separate_v2.yaml.
All 8 GPUs time-share train+rollout (colocate, default layout) so it runs on
one pod instead of 2x8. Numerically faithful: per-train-worker batch stays 32.

Deltas vs separate: num_devices 16->8 (drop layout/train_fraction);
enable_sleep_mode true; sync NCCLWeightSync -> IPCWeightSync(lora_merged=true)
keeping merged full-weight semantics over same-node CUDA-IPC; add
old_logp_source=replay (separate-engine anchor, prevents flat reward).
…y venv)

The colocate recipe used IPCWeightSync, whose worker mixin
(BucketedIPCReceiveMixin.__new__) eagerly imports sglang's
monkey_patch_torch_reductions for CUDA-IPC handle unpickling. On the two-venv
image, sglang lives only in .venv-sglang (cu130), not the vllm-omni .venv
(cu129), so every engine worker crashed at construction with
ModuleNotFoundError: No module named 'sglang'.

- Recipe: switch sync IPCWeightSync -> LocalLoraWeightSync (the proven sd3/qwen
  v2 colocate bridge). In-process, SGLang-free; engine runs base+adapter = the
  merged model. hv15 stage config already sets enable_lora/max_lora_rank=64.
- Mixin: make the eager sglang import in __new__ degrade gracefully when sglang
  is absent (it is only needed for the IPC receive path; lazy imports inside
  update_weights_from_ipc still raise clearly if IPC is used without sglang).
… over wire)

hv15 t2v decoded a valid video (out.output [B,3,F,H,W]) but the trainer-side
response saw empty .images -> collect_dit_outputs 'no PIL images'. Root cause:
the engine post-processes video into PIL frames, but PIL image lists do NOT
survive the engine worker->client wire for video (only tensors on custom_output
/ trajectory_* cross; verified the wire result carries only trajectory_latents).

Fix: RL pipeline stamps the decoded video tensor onto
custom_output['rl_decoded_video']; collect_dit_outputs rebuilds per-sample PIL
frames from it (VideoProcessor.postprocess_video) when .images is empty for
final_output_type=video. Image modalities unaffected. Removes the temp diags.
…gment

LatentSegment/Segment dropped the sample_indices row->sample identity mapping
(base.py: 'nothing read it, so it was removed') in the upstream-squash rebase,
but build_image_segment still passed it -> LatentSegment.__init__() got an
unexpected keyword argument 'sample_indices'. First surfaced by hv15 (first
modality to reach build_segments post-rebase); affects all diffusion modalities.
TextSegment.pack (AR path) keeps sample_indices — packed batches use it.
set_lora_handle serializes LoRA tensors to the Omni subprocess via sglang's
MultiprocessingSerializer (CUDA-IPC), absent from the vllm-omni-only venv ->
ModuleNotFoundError at the first weight_sync.sync() (after a full generate+
train). The engine already has a sglang-free byte-copy transport (set_lora_copy:
torch.save+base64; receiver set_lora_from_tensor_dict_copy: torch.load). Fall
back to it when sglang is unavailable; covers initial sync + wake-restore (both
route through set_lora_handle). LoRA is tiny so per-rank copy is free.
… sglang)

The v2 engine's weight-sync paths imported sglang's MultiprocessingSerializer /
monkey_patch_torch_reductions / FlattenedTensorBucket directly -> crash on the
vllm-omni-only venv (two-venv image; sglang lives in .venv-sglang). Repoint all
v2 sites to the already-vendored sgl_compat (verbatim sglang 0.5.10.post1, under
vllm_omni/weight_sync) — trainer + worker import the SAME module so CUDA-IPC
pickles round-trip. Reverts the byte-copy fallback (native) + graceful-skip
(__new__) stopgaps; the vendored path is the engine's intended sglang-free
design (v1 already uses it). Sites: __new__ monkey_patch,
update_weights_from_tensor, set_lora_from_tensor_dict; native.set_lora_handle;
full/tensor.sync.
- delete the v1 rollout engine package (rollout/engine/vllm_omni)
- rename vllm_omni_v2 -> vllm_omni: package dir, classes
  VLLMOmniV2{RolloutEngine,EngineConfig} -> VLLMOmni{...}, tests dir,
  and example scripts (drop the _v2 suffix everywhere)
- hoist the shared sgl_compat shim to distributed/weight_sync/transfer/
  (alongside its already-hoisted siblings) and repoint the 5 import sites;
  v2 no longer reaches back into the v1 package
- keep the v1 example yamls, rewired to the v2 engine: modality rename
  (sd35_t2i->sd3_t2i, t2v->hv15_t2v, ar_recaption->hi3_ar_recaption,
  dit_recaption->hi3_dit_recaption) + drop the default_* sampling fields
  (the v2 engine validates modality against its adapter registry and
  carries no sampling defaults); merge away the _v2 twins
- collapse the redundant _VLLM_OMNI_V2_ENGINE_TARGET_SUFFIX in config/validation
- strip stale strangler/coexists-with-v1 framing from the engine __init__

Verified: engine + trainer imports resolve; example _target_ dotpaths
resolve; unit suite 68 pass / 14 pre-existing env-fails — identical to
pre-change HEAD (zero regressions). GPU colocate e2e regression pending on pod.
Drop tests/ from the tree and apply ruff --fix (import ordering, induced by
the vllm_omni rename) plus ruff-format across the affected files.
@celve celve requested a review from haonan3 June 11, 2026 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant