feat(rollout): vLLM-Omni v2 rollout engine — canonical vllm_omni, retire v1 by celve · Pull Request #36 · Tencent-Hunyuan/UniRL

celve · 2026-06-11T12:45:00Z

Summary

Promotes the v2 vLLM-Omni rollout engine to the canonical vllm_omni and retires the original v1 engine. v2 is a thin engine core over a single backend seam with per-modality adapters, replacing v1's monolithic implementation.

Architecture — `unirl/rollout/engine/vllm_omni/`

backends/ — the only code that imports the vLLM-Omni runtime (boot, ports, per-stage collective_rpc).
adapters/ — per-modality registry keyed on config.modality; owns the RolloutReq↔RolloutResp conversion and per-modality topology knobs.
pipelines/ / worker/ / patches/ — worker-side role packages (custom diffusion pipelines, weight-sync receive mixins, runtime hijacks).
weight_sync.py — canonical sync ops + LoRA lifecycle over the seam.

Supported modalities: SD3.5 (sd3_t2i), HunyuanImage-3 (hi3_*), HunyuanVideo-1.5 t2v (hv15_t2v), Qwen-Image (qwen_image_t2i).

v1 retirement / promotion

Deleted the v1 engine package; renamed vllm_omni_v2 → vllm_omni and classes VLLMOmniV2{RolloutEngine,EngineConfig} → VLLMOmni{…}.
Hoisted the shared sgl_compat shim (vendored sglang CUDA-IPC reductions / serializer, so the engine venv needs no sglang) into unirl/distributed/weight_sync/transfer/, beside its siblings bucketed_transfer / ipc_dispatch / checksum — no engine package owns cross-engine transfer code.
Collapsed the redundant engine-target suffix in config/validation.py; example recipes rewired to the v2 engine (modality keys + request-driven sampling).

Verification

End-to-end SD3.5 GRPO colocate on H20×8: boot → generate → PickScore → train loop runs clean; reward climbs off the ~0.76 baseline (rollouts 1→6: 0.7626 → 0.7946), tracking the reference curve.
Every recipe _target_ resolves (static guard); ruff clean.

Notes

Unit tests are not included in this PR.

…omni) Rewrite of the vllm-omni rollout engine to the rollout-engine layout spec v2, mirroring the sglang_diffusion reference. Coexists with v1 under the distinct vllm_omni_v2 name; opt-in via cloned recipes; v1 retires after the parity gate. Covers all 8 v1 modalities. - backends/: the one runtime seam — base.py (Backend protocol + GenerateCall/StageSampling intent + OmniRawResult wire protocol, zero runtime code) and native.py (lazy Omni import, boot with stage-YAML overlays + temp file, request-id grouping, sleep/wake tasks, per-stage collective_rpc weight-sync verbs, tokenize_prompt). - adapters/: registry keyed on modality; per-shape bases ar_dit/ar_only/ dit_image/dit_video lift v1's request.py/response.py modality branches into overridable steps (incl. dit_recaption's per-prompt seeded calls). - utils/: pure ports of the segment/condition/prompt/σ mechanics. - weight_sync.py: one component owning all sync/LoRA state; v1-parity active LoRA re-push on wake (restore_lora_after_wake) + the mark_weights_released event; both LoRA transports (handle/byte-copy). - config.py: v1 fields verbatim + live-registry modality validation + server_intent; VLLMOmniPorts self-reserved bind-to-0 master ports ride each stage's engine_args.master_port — kills the 30200+rank*200+idx*50 port math, the RANK-env fallback, and the per-rank YAML rewrites. - worker/ + pipelines/ + patches/: role-8 copies (v1 keeps its own — its stage YAMLs reference v1 qualnames until retirement); patches/ carries the install() bundle with per-patch DELETE-WHEN notes. - Shared infra: rollout/engine/ports.py (ReservedPorts, ported from the LIN-371 branch); bucketed_transfer/ipc_dispatch/checksum hoisted to the neutral distributed/weight_sync/transfer/ package (v1 paths re-export; 5 trainer-side import sites repointed). - recipes: 8 *_v2.yaml opt-in clones (only the rollout _target_ lines changed); config/validation.py IPC gate accepts the v2 engine. - tests/rollout/vllm_omni_v2/: 55 CPU tests — registry/topology parity vs the v1 frozensets, per-shape build_inputs/build_response, utils mechanics, WeightSync lifecycle, bare-engine sequencing, the dispatch-marker guard, ports, and a subprocess CPU-importability check with vllm/vllm_omni/sglang blocked. GPU smoke (incl. reserved==settled port assertion) and the fixed-seed parity gate vs v1 remain before v1 retirement.

… drop the YAML rewrite Audit of the vllm-omni source (pin v0.20.0 + upstream main) showed the seam was reconstructing machinery the runtime already provides: - Boot now passes the PRISTINE packaged stage YAML plus enable_sleep_mode / master_port as Omni ctor kwargs — vllm-omni merges them into every stage's engine_args itself (load_stage_configs_from_yaml base_engine_args channel + the dedicated sleep-mode injection in _resolve_stage_configs, verified at the pin AND main; no stage YAML of ours defines either key, so the behavior is identical to the old overlay). Deletes _overlay_stage_tree, the yaml/tempfile load-overlay-write-unlink machinery, and the shutdown temp-file cleanup. Known cosmetic: upstream logs a spurious "top-level engine args are ignored: enable_sleep_mode" warning before its dedicated block applies the kwarg. - tp_per_stage reads back from omni.stage_configs (the runtime's own merged per-stage configs) via _tp_from_stage_configs, replacing the pre-overlay YAML re-parse (_parse_tp_per_stage) and its ordering caveat. - VLLMOmniPorts: two per-stage fields -> one reserved master-port BASE. At the pin, OmniDiffusionConfig.__post_init__ adds rand(0,100) + a 37-stride bind-check scan even to injected ports (so per-stage reserved ports were never honored verbatim anyway); from v0.21.0rc2 (#3803) the base is honored verbatim. Docstrings carry the MASTER_PORT-env-precedence upgrade landmine. - patches/__init__: DELETE-WHEN corrections from the audit — flow-alignment deletable at pin >= v0.21 (upstream removed the v0.20.0 KV API), lora_request passthrough confirmed still absent at main (upstream-PR opportunity noted), tokenizer ratio-36 nuance (upstream now raises a clean ValueError; Base ckpt still needs the 0-fallback). - Tests: ports tests adapted to the single field; new table-driven _assemble_omni_kwargs layering tests (dedicated keys beat the omni_extra escape hatch; hatch still beats timeout/mode defaults — v1 parity) and a _tp_from_stage_configs test (mapping- and attr-style entries). pytest tests/ -> 59 passed. Boot mechanics only — no rollout-output surface touched; the pending parity gate is unaffected.

Drives the report's pending GPU checks against the live v0.20.0 runtime: boot (sd35_t2i), settled-vs-reserved master-port window assertion (pin semantics: base + rand(0,100) + 37-stride scan), generate, one tensor-bag sync with fingerprint_tensor checksum match, sleep/wake + post-wake generate. Live-LoRA wake re-push is left to the LoRA e2e recipe.

Image.to_pil() / Video.to_pils() existed but Images had no batch counterpart, so five call sites re-derived the conversion in three different spellings. Route them all through Images.to_pils(). vllm_omni_v2's images_to_pil(req, n) becomes pil_images_from_req — the helper extracts and validates the request primitive (paralleling texts_from_req); the conversion itself now lives on the type. types/reward.py keeps tensor_frame_to_pil (extra grayscale→RGB + cpu handling) — intentionally not unified.

…t sampling params authoritative Adapter files are now named by model family (hi3_ar_dit / hi3_ar_only / hi3_dit_recaption; dit_image keeps sd35_t2i only). The HI3 chat-template knowledge (task_key / sys_type / output_modalities + the bot_task resolve hook) moves from utils' _TASK_DEFAULTS table onto the adapter classes, and build_prompt_entries moves with it. The engine keeps no sampling defaults anymore: config's default_* fields are gone and core_diff_kwargs / build_ar_sampling read the request's typed sampling params directly. The two-engine trainer ships the WHOLE composed params to the ar_recaption engine (it reads its AR slice for sampling and the diffusion slice's height/width for the recaption prompt); recipes spell the sampling values per request.

…b-adapters Each registered modality adapter is now a thin binder: it constructs an input_adapter and an output_adapter in __init__ and delegates build_inputs / build_response to them. Input-side and output-side variation are independent axes (i2t/it2i share the AR-bearing request side but differ in response shape), so the per-output-shape inheritance bases leaked duplication across branches — the build_ar_sampling class-attr borrow, the duplicated AR build_inputs, and the duplicated image-attach decorate all dissolve into one parameterized Hi3InputAdapter. Layout: universal single-stage DiT skeletons live in dit.py (DitInputAdapter with an extras() hook, DitOutputAdapter with conditions/decode/extra_decoded hooks); families derive small subclasses in their own file (hi3.py / sd3.py / hv15.py) — family-specific classes carry the family prefix, universal ones don't. core_diff_kwargs / sde_extra_args move off the ABC to utils/diff_kwargs.py as plain functions (they read no adapter state). Engine-facing surface unchanged: same registry keys, same two verbs, same knob class-attributes — the suite passes as-is except one test that monkeypatched via the old module path (adapters.hi3_ar_dit → adapters.hi3).

Registry keys gain a short family prefix matching the adapter file names: t2i -> hi3_t2i it2i -> hi3_it2i i2t -> hi3_i2t t2t -> hi3_t2t ar_recaption -> hi3_ar_recaption dit_recaption -> hi3_dit_recaption sd35_t2i -> sd3_t2i t2v -> hv15_t2v Bare task keys silently bound a task to one family ("t2v" was HV1.5-only) and collide as soon as another family serves the same task (wan/qwen_image video+image models already live under models/). Clean break, no aliases — v2 recipes, the config default, the GPU smoke driver, and test constants are updated in this commit; the v1 engine and v1 recipes keep their keys. The upstream chat-template task vocabulary (t2i_think / t2i_recaption / ...) is unchanged — Hi3InputAdapter derives bot-task variants from its explicit bot_task_base, not from the registry key. BREAKING: out-of-tree v2 recipes must update rollout-engine `modality` values to the family-namespaced keys.

…oded hook extra_decoded had exactly one real override (the HI3 two-track AR text) plus a ceremonial no-op default — a hook that never earned its slot. A single build_decoded(pil_images, frame_groups, per_request) now owns the whole decoded dict: the base returns {track_name: Images}, hv15 swaps the payload for packed frame groups, and Hi3ImageOutputAdapter extends via super() with the best-effort AR text. One method answers "what goes in decoded for this family", and children derive with plain super() composition instead of fitting pre-cut slots. Contract note in the docstring: keep the track_name entry (a missing key silently yields decoded=None on that track).

…s only The 3-arg signature leaked the skeleton's internal collection step into the hook contract. decoded may span tracks beyond the DiT one (the HI3 two-track AR text comes from the AR outputs, not the DiT slices), so the raw wire groups are its natural currency. The base re-collects its PILs via the new _collect helper (deliberate and cheap — collect_dit_outputs only gathers references; pils_to_images still runs once), and the knob-threading for collect_dit_outputs now lives in one place used by both the skeleton and the hook implementations.

…tputs directly A private helper used three times in one small file was ceremony; each call site now reads self-contained and subclass derivation goes through plain super(). The knobs are the same self attributes at every site, so the selection can't drift.

…ed/conditions triple An output adapter's whole job is to produce the three dicts assemble_tracks consumes, so the class contract now mirrors that parameter list 1:1: three parallel hooks — build_segments / build_decoded / build_conditions — all with the uniform (req, per_request) currency, and build() reduced to guard + assemble_tracks. The v1-parity AR sweep moves into build_segments (a named, derivable home instead of unconditional skeleton behavior); conditions implementations collect their own DiT outputs inline (cheap reference gathering, same self knobs everywhere). Hi3TextOutputAdapter adopts the same shape for full uniformity: the conditions= ctor callable is gone — ar_recaption's fused prompt capture is now a tiny Hi3ArRecaptionOutputAdapter subclass overriding build_conditions. Text-extraction failure policies preserved exactly: best-effort in the HI3 two-track build_decoded, fatal in the AR-only one.

…_sampling pair Same principle as the output-adapter triple, read off the input wire type: GenerateCall has two payload fields, so the input decomposition is a pair — build_prompts(req) / build_sampling(req) — with build() reduced to pure assembly. The whole sub-adapter layer now fits one sentence: a build() orchestrator over build_<wire-field> hooks taking raw currency (req on the way in, (req, per_request) on the way out). The extras() hook is gone: hv15's num_frames delta becomes two super()-extend overrides (the idiom Hi3ImageOutputAdapter.build_decoded established) instead of a pair-of-dicts return. Hi3InputAdapter's stages get public names while _resolve_task/_decorate/_ar_sampling/_dit_sampling stay private inside them. Hi3DitRecaptionInputAdapter keeps its wholesale build() — its prompts and sampling are paired per single-prompt call (seed + gid slice decided together), the documented call-topology exception.

…the request post-refactor)

…ils/conditions.py Each of the four condition builders had exactly one v2 caller and was 100% family-specific (family capture keys, family padding rules, family condition types) — family conversion logic wearing a utils costume, organized by kind instead of by family. They now live where the placement rule says: - build_hv15_conditions -> Hv15VideoOutputAdapter.build_conditions - build_sd3_text_condition -> Sd3OutputAdapter.build_conditions - build_fused_mm_condition -> hi3_fused_conditions (module-level: two consumers in hi3.py) - build_ar_fused_condition -> hi3_ar_fused_conditions This also fixes a layering smell: utils/ no longer imports unirl.models.hunyuan_image3 — the family-agnostic layer is now actually family-agnostic. Error text preserved verbatim; the raise moves from "wrapper diagnoses the builder's None" to the failed check itself. The dead empty-input heads are dropped (unreachable post-collect: the build() guard + collect_dit_outputs's per-request raise guarantee non-empty). The v1 engine's private _build_* copies in vllm_omni/response.py are parity code and stay. Boundary rule: the inline criterion is family-specific LOGIC, not consumer count — decoded_text_from_ar / seed_from_sample_id / grouped_pils_to_videos are universal mechanics with single consumers today and stay in utils/.

…xamples/ layout Adopt the upstream examples/ reorg (#272/#267) for the branch-only v2 files: recipes/diffusion_rl/sd3_*_v2 -> examples/diffusion/sd3/, the hv15 t2v v2 recipe -> examples/diffusion/hunyuan_video15/, hi3_vllmomni_v2 -> examples/unified_model/hi3/, and scripts/vllm_omni_v2_gpu_smoke.py -> examples/. All three relocated recipes compose-check under the new --config-name roots (hydra config_path is ../examples post-reorg).

ruff check/format + hook fixes on files this branch adds: drop unused imports (test fixtures, native.py STAGE_KIND_DIFFUSION), sort ipc_receive_mixin imports, format reflow, executable bit + TYPE_CHECKING RolloutReq import + relocated usage path for the GPU smoke driver.

pack_initial_noise_extra_args ships x_T as EITHER a materialized initial_noise_batch tensor OR a recipe (init_noise_group_ids + init_noise_latent_shape + init_noise_seed), but the hv15 pipeline only handled the batch form — a t2v request shipping the recipe form was silently ignored and upstream RNG drew x_T instead (the x_T-collapse class of bug the HI3 pipeline fails loudly on). Mirror sd3's recipe branch: regenerate this request's row byte-identically via NoiseRecipe(...).resolve(). CPU-verified only (lint + suite); the pipeline executes on the GPU pod — covered by the next smoke/e2e run there.

…vest naming The three RL pipeline subclasses follow one protocol around upstream's forward — install (once, idempotent) / arm (every request) / run (upstream, with our taps and injectors firing inside) / harvest (export onto the wire). This makes the protocol literal: - New pipelines/_shared/interception.py holds the byte-identical mechanics (detach_cpu, stamp_custom_output, drain_trajectory_into, resolve_request_noise, inject_latents, make_sde_scheduler). Deliberately vllm-omni-free (wire objects duck-typed), so the mechanics are CPU-tested in the new test_pipeline_interception.py even though the pipelines themselves only run inside the GPU worker. - FlowMatchSDEDiscreteScheduler.arm(eta=..., sde_indices=...) replaces private attr-poking at three call sites; counterpart of drain_trajectory. - Hooks are renamed by purpose, not upstream target: every family now has one _install_conditioning_tap (encode_prompt for sd3/hv15, prepare_inputs_for_generation for hi3), an initial-noise injector, and _arm_*/_harvest_* steps; upstream-defect carriers are suffixed _workaround (sd3 T5 truncation; hv15's _sigma_override is now a try/finally contextmanager). forward() reads as the protocol. - hi3/sde_scheduler.py (dead re-export shim) deleted. Wire contract unchanged: custom_output keys, trajectory_* semantics, extra_args reads, error prose, recipe regeneration math, first-call-only capture. CPU-verified (lint + 71 tests incl. 12 new); the pipelines execute only on the GPU pod — covered by its next smoke/e2e run.

…Image rollout path Adds the ninth modality to the v2 engine (Qwen-Image RL was trainside-only; the trainer bundle in models/qwen_image/ is the parity oracle). Single diffusion stage, TP=1; no engine/weight_sync/backends/patches changes — upstream v0.20.0 consumes sampling_params.sigmas natively and the generic LoRA manager maps rank-64 adapters onto the fused to_qkv via stacked_params_mapping. Two model-specific seams, everything else the SD3 pattern: - CFG semantics: upstream treats negative_prompt "" as present and defaults true_cfg_scale to 4.0 — the family input adapter omits the key when CFG is off (guidance 1.0, the oracle setting) and always maps guidance_scale onto true_cfg_scale. CFG-on is supported end-to-end: the encode tap captures the negative encode and build_conditions emits negative_text. - Packed-latent boundary: the worker loop runs [B, S, C*4]; the trainer contract is [B, C, H, W]. RLQwenImagePipeline packs the driver x_T before prepare_latents injection and unpacks the harvested trajectory to [B, T+1, C, H, W] (upstream _unpack_latents is 5D; frame dim squeezed). Also: max_sequence_length pinned to model_config (512) when the request doesn't set one (upstream defaults 1024); ragged-pad concat for the variable-length Qwen2.5-VL captures; stage YAML with max_lora_rank 64; dancegrpo vllmomni_v2 recipe = trainside twin + rollout/sync delta only. CPU suite: 80 passed (CFG trap both ways, msl pinning, ragged-pad mask sums, negative_text + mixed-capture raise, σ-echo, K=0 NFT, registry knobs). GPU smoke + trainside parity folded into the standing gate.

…DE) vllmomni recipe Smoke driver gains a SMOKE_MODALITY env (default sd3_t2i) with a per-family model-config shim — qwen needs use_dynamic_shifting + dynamic_shift_overrides (σ policy) and the max_sequence_length pin. New recipe qwen_image_grpo_vllmomni_v2 = the dancegrpo_vllmomni_v2 twin with FlowSDEStrategy (matches qwen_image_trainside), per the chosen e2e base.

…gmas Diffusers' set_timesteps has a THIRD sigma mutation beyond the static/dynamic shift branches: the shift_terminal whole-schedule stretch, applied even to passed-in sigmas. SD3.5 configs leave it null; the Qwen-Image-2512 checkpoint ships shift_terminal: 0.02, which stretched the engine-pinned [.., 0.3584] into [.., 0.0200] on the worker and tripped the sigma-echo gate in the qwen GPU smoke (max abs diff 3.384e-01 == exactly the stretch factor (1-0.02)/(1-0.3584)). Null it inside the same transient-override block; restore via finally. Regression tests for both config shapes.

…erty)

…TE gate Probe post-mortem (8-replica colocate, both failures at engine boot): 1) DistNetworkError port 30005: at the v0.20.0 pin the stage_configs_path route strips master_port from the kwargs that feed the per-stage base_engine_args merge (only enable_sleep_mode/lora_* get re-injected), so every stage settles from the SHARED (None or 30005)+rand(0,100) window. SD3's fast boots won the 37-stride scan's TOCTOU races; Qwen's ~35s boots lose them. New patch_master_port_unstrip() re-attaches the caller's reserved per-replica base through _strip_single_engine_args — YAML keys still win, scan stays as fallback. DELETE-WHEN pin >= 0.21. 2) CUDA OOM (50 MiB free at engine TE load): the trainer-side bundle loads the Qwen2.5-VL text encoder (~15 GiB/rank) that the separate-engine path never uses — the engine encodes prompts and the trainer replays captured conditions. New QwenImagePipelineConfig.load_text_encoder (default true; vllmomni recipes set false) gates it; pipeline builds no text_embed stage without it and generate() raises a directed error.

Eight simultaneous engine boots each hold ~20 GiB anon RSS during weight materialization; the burst blows the pod's k8s memcg limit and the kernel OOM-kills raylet (ActorUnavailableError, probe-b post-mortem in dmesg). Exclusive flock on /tmp/diffrl_omni_boot.lock makes the load window single-file per node; DIFFRL_OMNI_BOOT_SERIALIZE=0 opts out. Also narrows the master-port settle TOCTOU as a side effect.

probe-c post-mortem: engine model 53.7 GiB + dummy run OOM'd at 116 MiB free — the colocated trainer's caching allocator held ~40 GiB reserved-but-unallocated (full per-rank model load before FSDP shard), invisible to the engine subprocess. empty_cache() in boot() (which runs inside the trainer's ray actor) returns it to the driver first.

probes b/d died at the TRAINER actor-pool bootstrap (handle.py create_remote), not the engine boots: 8 ranks materializing the 20B transformer concurrently hold ~20-23 GiB anon each and blow the pod's ~439 GiB memcg ceiling — the kernel kills raylet (ActorUnavailableError). Same flock pattern as the engine-boot serialization, on the bundle's heavy from_pretrained window; gc before lock release so the serialized peak actually holds. DIFFRL_MODEL_LOAD_SERIALIZE=0 opts out.

probe-e crossed the entire infra gauntlet (serialized loads, serialized engine boots, cache flush, generation, sleep) and died in the first TRAINING replay forward: the installed diffusers' QwenImage RoPE builder requires txt_seq_lens (max() over it sizes the text frequency slice) and predict_noise only passed the attention mask. Derive per-sample lengths from the mask for both CFG branches.

…ength probe-f: replay microbatches carry the batch-wide pad width (18) while max(txt_seq_lens) reflects their own true max (12); diffusers RoPE applies over the tensor width but slices freqs by txt_seq_lens -> shape crash in apply_rotary_emb_qwen. Trim embeds+mask to the slice's true max so width == max(txt_seq_lens) by construction (padding beyond it is meaningless).

…se — the 60-rollout e2e silently skipped wandb)

AC-off training leaves the activation peak (~30-40 GiB on Qwen-Image at mbs=1) reserved in the trainer's caching allocator; the colocated engine then OOMs re-mapping its weights on wake (qwen e2e-c: 'Tried to allocate 2.00 MiB ... 2.38 MiB free' in rollout 2's generate). empty_cache() before wake_up() returns it to the driver each cycle.

…CAST body) e2e-d reproduced e2e-c's rollout-2 OOM despite the trainer-side flush: train_step executes driver-side while the train-phase peak lives in the 8 ray actors. wake_up()'s body runs per-actor — flush there, immediately before the engine re-maps its weight pool.

…tated)

…nit noise, PickScore-ranked grids

…rate-engine anchor The qwen recipes inherited DiffusionGRPO's default 'native' (engine-emitted rollout logp as the PPO anchor). Correct for trainside; across processes it bakes the rollout<->replay numeric discrepancy (bf16-stored trajectory vs in-flight fp32, engine vs trainer logp conventions) into every first-epoch ratio: observed 1±2-5e-5 vs sd3's replay-anchored 1±7e-6, against a 1e-4 clip range — the trust region half-consumed by anchor error before any real policy movement. Every sd3 separate-engine recipe sets replay; the qwen clones missed it.

…step latent deltas

…NDS=1)

Single-node 1x8 port of hunyuan_video15_t2v_vllmomni_nccl_separate_v2.yaml. All 8 GPUs time-share train+rollout (colocate, default layout) so it runs on one pod instead of 2x8. Numerically faithful: per-train-worker batch stays 32. Deltas vs separate: num_devices 16->8 (drop layout/train_fraction); enable_sleep_mode true; sync NCCLWeightSync -> IPCWeightSync(lora_merged=true) keeping merged full-weight semantics over same-node CUDA-IPC; add old_logp_source=replay (separate-engine anchor, prevents flat reward).

…y venv) The colocate recipe used IPCWeightSync, whose worker mixin (BucketedIPCReceiveMixin.__new__) eagerly imports sglang's monkey_patch_torch_reductions for CUDA-IPC handle unpickling. On the two-venv image, sglang lives only in .venv-sglang (cu130), not the vllm-omni .venv (cu129), so every engine worker crashed at construction with ModuleNotFoundError: No module named 'sglang'. - Recipe: switch sync IPCWeightSync -> LocalLoraWeightSync (the proven sd3/qwen v2 colocate bridge). In-process, SGLang-free; engine runs base+adapter = the merged model. hv15 stage config already sets enable_lora/max_lora_rank=64. - Mixin: make the eager sglang import in __new__ degrade gracefully when sglang is absent (it is only needed for the IPC receive path; lazy imports inside update_weights_from_ipc still raise clearly if IPC is used without sglang).

… type [TEMP]

…[TEMP]

…deo tensor [TEMP]

… over wire) hv15 t2v decoded a valid video (out.output [B,3,F,H,W]) but the trainer-side response saw empty .images -> collect_dit_outputs 'no PIL images'. Root cause: the engine post-processes video into PIL frames, but PIL image lists do NOT survive the engine worker->client wire for video (only tensors on custom_output / trajectory_* cross; verified the wire result carries only trajectory_latents). Fix: RL pipeline stamps the decoded video tensor onto custom_output['rl_decoded_video']; collect_dit_outputs rebuilds per-sample PIL frames from it (VideoProcessor.postprocess_video) when .images is empty for final_output_type=video. Image modalities unaffected. Removes the temp diags.

…gment LatentSegment/Segment dropped the sample_indices row->sample identity mapping (base.py: 'nothing read it, so it was removed') in the upstream-squash rebase, but build_image_segment still passed it -> LatentSegment.__init__() got an unexpected keyword argument 'sample_indices'. First surfaced by hv15 (first modality to reach build_segments post-rebase); affects all diffusion modalities. TextSegment.pack (AR path) keeps sample_indices — packed batches use it.

set_lora_handle serializes LoRA tensors to the Omni subprocess via sglang's MultiprocessingSerializer (CUDA-IPC), absent from the vllm-omni-only venv -> ModuleNotFoundError at the first weight_sync.sync() (after a full generate+ train). The engine already has a sglang-free byte-copy transport (set_lora_copy: torch.save+base64; receiver set_lora_from_tensor_dict_copy: torch.load). Fall back to it when sglang is unavailable; covers initial sync + wake-restore (both route through set_lora_handle). LoRA is tiny so per-rank copy is free.

… sglang) The v2 engine's weight-sync paths imported sglang's MultiprocessingSerializer / monkey_patch_torch_reductions / FlattenedTensorBucket directly -> crash on the vllm-omni-only venv (two-venv image; sglang lives in .venv-sglang). Repoint all v2 sites to the already-vendored sgl_compat (verbatim sglang 0.5.10.post1, under vllm_omni/weight_sync) — trainer + worker import the SAME module so CUDA-IPC pickles round-trip. Reverts the byte-copy fallback (native) + graceful-skip (__new__) stopgaps; the vendored path is the engine's intended sglang-free design (v1 already uses it). Sites: __new__ monkey_patch, update_weights_from_tensor, set_lora_from_tensor_dict; native.set_lora_handle; full/tensor.sync.

- delete the v1 rollout engine package (rollout/engine/vllm_omni) - rename vllm_omni_v2 -> vllm_omni: package dir, classes VLLMOmniV2{RolloutEngine,EngineConfig} -> VLLMOmni{...}, tests dir, and example scripts (drop the _v2 suffix everywhere) - hoist the shared sgl_compat shim to distributed/weight_sync/transfer/ (alongside its already-hoisted siblings) and repoint the 5 import sites; v2 no longer reaches back into the v1 package - keep the v1 example yamls, rewired to the v2 engine: modality rename (sd35_t2i->sd3_t2i, t2v->hv15_t2v, ar_recaption->hi3_ar_recaption, dit_recaption->hi3_dit_recaption) + drop the default_* sampling fields (the v2 engine validates modality against its adapter registry and carries no sampling defaults); merge away the _v2 twins - collapse the redundant _VLLM_OMNI_V2_ENGINE_TARGET_SUFFIX in config/validation - strip stale strangler/coexists-with-v1 framing from the engine __init__ Verified: engine + trainer imports resolve; example _target_ dotpaths resolve; unit suite 68 pass / 14 pre-existing env-fails — identical to pre-change HEAD (zero regressions). GPU colocate e2e regression pending on pod.

Drop tests/ from the tree and apply ruff --fix (import ordering, induced by the vllm_omni rename) plus ruff-format across the affected files.

…dump, parity_probe)

celve added 30 commits June 10, 2026 14:04

fix(gpu-smoke): drop removed default_* config kwargs (sampling rides …

6dd738b

…the request post-refactor)

test: assert config restore (not diffusers' uninitialized _shift prop…

5ab241e

…erty)

fix(qwen vllmomni recipes): env-driven wandb block (was hardcoded fal…

a3af805

…se — the 60-rollout e2e silently skipped wandb)

celve added 21 commits June 10, 2026 14:29

add image-dump example for vllm_omni_v2 visual checks (PickScore-anno…

557c832

…tated)

add group-dump example: 1 prompt x 16 samples, per-sample vs shared i…

3a35544

…nit noise, PickScore-ranked grids

add ODE parity probe: engine vs trainside, same x_T/sigma/conds, per-…

cd62a30

…step latent deltas

parity probe: fix predict_noise positional signature

0686155

parity probe: optional fp32-island alignment variant (PROBE_FP32_ISLA…

01ae2f2

…NDS=1)

diag(hv15): log model_class_name + postproc-registry hit + out.output…

5c30223

… type [TEMP]

diag(hv15): dump wire-result .images/attrs on no-PIL-images raise [TEMP]

3756825

diag(hv15): dump multimodal_output + all public attrs of wire result …

a423263

…[TEMP]

diag(hv15): dump outputs/latents/trajectory_decoded values to find vi…

1399d83

…deo tensor [TEMP]

chore: remove rollout test suite; ruff import-order + format cleanup

3016057

Drop tests/ from the tree and apply ruff --fix (import ordering, induced by the vllm_omni rename) plus ruff-format across the affected files.

chore: drop dev/probe example scripts (gpu_smoke, dump_images, group_…

453aed5

…dump, parity_probe)

perf(diffusion): drop empty_cache() flush before colocated engine wake

f42f904

celve requested a review from haonan3 June 11, 2026 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rollout): vLLM-Omni v2 rollout engine — canonical vllm_omni, retire v1#36

feat(rollout): vLLM-Omni v2 rollout engine — canonical vllm_omni, retire v1#36
celve wants to merge 51 commits into
Tencent-Hunyuan:mainfrom
celve:LIN-382/vllm-omni-canonical

celve commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

celve commented Jun 11, 2026

Summary

Architecture — unirl/rollout/engine/vllm_omni/

v1 retirement / promotion

Verification

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Architecture — `unirl/rollout/engine/vllm_omni/`