feat(rollout): vLLM-Omni v2 rollout engine — canonical vllm_omni, retire v1#36
Open
celve wants to merge 51 commits into
Open
feat(rollout): vLLM-Omni v2 rollout engine — canonical vllm_omni, retire v1#36celve wants to merge 51 commits into
celve wants to merge 51 commits into
Conversation
…omni) Rewrite of the vllm-omni rollout engine to the rollout-engine layout spec v2, mirroring the sglang_diffusion reference. Coexists with v1 under the distinct vllm_omni_v2 name; opt-in via cloned recipes; v1 retires after the parity gate. Covers all 8 v1 modalities. - backends/: the one runtime seam — base.py (Backend protocol + GenerateCall/StageSampling intent + OmniRawResult wire protocol, zero runtime code) and native.py (lazy Omni import, boot with stage-YAML overlays + temp file, request-id grouping, sleep/wake tasks, per-stage collective_rpc weight-sync verbs, tokenize_prompt). - adapters/: registry keyed on modality; per-shape bases ar_dit/ar_only/ dit_image/dit_video lift v1's request.py/response.py modality branches into overridable steps (incl. dit_recaption's per-prompt seeded calls). - utils/: pure ports of the segment/condition/prompt/σ mechanics. - weight_sync.py: one component owning all sync/LoRA state; v1-parity active LoRA re-push on wake (restore_lora_after_wake) + the mark_weights_released event; both LoRA transports (handle/byte-copy). - config.py: v1 fields verbatim + live-registry modality validation + server_intent; VLLMOmniPorts self-reserved bind-to-0 master ports ride each stage's engine_args.master_port — kills the 30200+rank*200+idx*50 port math, the RANK-env fallback, and the per-rank YAML rewrites. - worker/ + pipelines/ + patches/: role-8 copies (v1 keeps its own — its stage YAMLs reference v1 qualnames until retirement); patches/ carries the install() bundle with per-patch DELETE-WHEN notes. - Shared infra: rollout/engine/ports.py (ReservedPorts, ported from the LIN-371 branch); bucketed_transfer/ipc_dispatch/checksum hoisted to the neutral distributed/weight_sync/transfer/ package (v1 paths re-export; 5 trainer-side import sites repointed). - recipes: 8 *_v2.yaml opt-in clones (only the rollout _target_ lines changed); config/validation.py IPC gate accepts the v2 engine. - tests/rollout/vllm_omni_v2/: 55 CPU tests — registry/topology parity vs the v1 frozensets, per-shape build_inputs/build_response, utils mechanics, WeightSync lifecycle, bare-engine sequencing, the dispatch-marker guard, ports, and a subprocess CPU-importability check with vllm/vllm_omni/sglang blocked. GPU smoke (incl. reserved==settled port assertion) and the fixed-seed parity gate vs v1 remain before v1 retirement.
… drop the YAML rewrite Audit of the vllm-omni source (pin v0.20.0 + upstream main) showed the seam was reconstructing machinery the runtime already provides: - Boot now passes the PRISTINE packaged stage YAML plus enable_sleep_mode / master_port as Omni ctor kwargs — vllm-omni merges them into every stage's engine_args itself (load_stage_configs_from_yaml base_engine_args channel + the dedicated sleep-mode injection in _resolve_stage_configs, verified at the pin AND main; no stage YAML of ours defines either key, so the behavior is identical to the old overlay). Deletes _overlay_stage_tree, the yaml/tempfile load-overlay-write-unlink machinery, and the shutdown temp-file cleanup. Known cosmetic: upstream logs a spurious "top-level engine args are ignored: enable_sleep_mode" warning before its dedicated block applies the kwarg. - tp_per_stage reads back from omni.stage_configs (the runtime's own merged per-stage configs) via _tp_from_stage_configs, replacing the pre-overlay YAML re-parse (_parse_tp_per_stage) and its ordering caveat. - VLLMOmniPorts: two per-stage fields -> one reserved master-port BASE. At the pin, OmniDiffusionConfig.__post_init__ adds rand(0,100) + a 37-stride bind-check scan even to injected ports (so per-stage reserved ports were never honored verbatim anyway); from v0.21.0rc2 (#3803) the base is honored verbatim. Docstrings carry the MASTER_PORT-env-precedence upgrade landmine. - patches/__init__: DELETE-WHEN corrections from the audit — flow-alignment deletable at pin >= v0.21 (upstream removed the v0.20.0 KV API), lora_request passthrough confirmed still absent at main (upstream-PR opportunity noted), tokenizer ratio-36 nuance (upstream now raises a clean ValueError; Base ckpt still needs the 0-fallback). - Tests: ports tests adapted to the single field; new table-driven _assemble_omni_kwargs layering tests (dedicated keys beat the omni_extra escape hatch; hatch still beats timeout/mode defaults — v1 parity) and a _tp_from_stage_configs test (mapping- and attr-style entries). pytest tests/ -> 59 passed. Boot mechanics only — no rollout-output surface touched; the pending parity gate is unaffected.
Drives the report's pending GPU checks against the live v0.20.0 runtime: boot (sd35_t2i), settled-vs-reserved master-port window assertion (pin semantics: base + rand(0,100) + 37-stride scan), generate, one tensor-bag sync with fingerprint_tensor checksum match, sleep/wake + post-wake generate. Live-LoRA wake re-push is left to the LoRA e2e recipe.
Image.to_pil() / Video.to_pils() existed but Images had no batch counterpart, so five call sites re-derived the conversion in three different spellings. Route them all through Images.to_pils(). vllm_omni_v2's images_to_pil(req, n) becomes pil_images_from_req — the helper extracts and validates the request primitive (paralleling texts_from_req); the conversion itself now lives on the type. types/reward.py keeps tensor_frame_to_pil (extra grayscale→RGB + cpu handling) — intentionally not unified.
…t sampling params authoritative Adapter files are now named by model family (hi3_ar_dit / hi3_ar_only / hi3_dit_recaption; dit_image keeps sd35_t2i only). The HI3 chat-template knowledge (task_key / sys_type / output_modalities + the bot_task resolve hook) moves from utils' _TASK_DEFAULTS table onto the adapter classes, and build_prompt_entries moves with it. The engine keeps no sampling defaults anymore: config's default_* fields are gone and core_diff_kwargs / build_ar_sampling read the request's typed sampling params directly. The two-engine trainer ships the WHOLE composed params to the ar_recaption engine (it reads its AR slice for sampling and the diffusion slice's height/width for the recaption prompt); recipes spell the sampling values per request.
…b-adapters Each registered modality adapter is now a thin binder: it constructs an input_adapter and an output_adapter in __init__ and delegates build_inputs / build_response to them. Input-side and output-side variation are independent axes (i2t/it2i share the AR-bearing request side but differ in response shape), so the per-output-shape inheritance bases leaked duplication across branches — the build_ar_sampling class-attr borrow, the duplicated AR build_inputs, and the duplicated image-attach decorate all dissolve into one parameterized Hi3InputAdapter. Layout: universal single-stage DiT skeletons live in dit.py (DitInputAdapter with an extras() hook, DitOutputAdapter with conditions/decode/extra_decoded hooks); families derive small subclasses in their own file (hi3.py / sd3.py / hv15.py) — family-specific classes carry the family prefix, universal ones don't. core_diff_kwargs / sde_extra_args move off the ABC to utils/diff_kwargs.py as plain functions (they read no adapter state). Engine-facing surface unchanged: same registry keys, same two verbs, same knob class-attributes — the suite passes as-is except one test that monkeypatched via the old module path (adapters.hi3_ar_dit → adapters.hi3).
Registry keys gain a short family prefix matching the adapter file names:
t2i -> hi3_t2i it2i -> hi3_it2i
i2t -> hi3_i2t t2t -> hi3_t2t
ar_recaption -> hi3_ar_recaption dit_recaption -> hi3_dit_recaption
sd35_t2i -> sd3_t2i t2v -> hv15_t2v
Bare task keys silently bound a task to one family ("t2v" was HV1.5-only)
and collide as soon as another family serves the same task (wan/qwen_image
video+image models already live under models/). Clean break, no aliases —
v2 recipes, the config default, the GPU smoke driver, and test constants are
updated in this commit; the v1 engine and v1 recipes keep their keys.
The upstream chat-template task vocabulary (t2i_think / t2i_recaption / ...)
is unchanged — Hi3InputAdapter derives bot-task variants from its explicit
bot_task_base, not from the registry key.
BREAKING: out-of-tree v2 recipes must update rollout-engine `modality`
values to the family-namespaced keys.
…oded hook
extra_decoded had exactly one real override (the HI3 two-track AR text) plus
a ceremonial no-op default — a hook that never earned its slot. A single
build_decoded(pil_images, frame_groups, per_request) now owns the whole
decoded dict: the base returns {track_name: Images}, hv15 swaps the payload
for packed frame groups, and Hi3ImageOutputAdapter extends via super() with
the best-effort AR text. One method answers "what goes in decoded for this
family", and children derive with plain super() composition instead of
fitting pre-cut slots. Contract note in the docstring: keep the track_name
entry (a missing key silently yields decoded=None on that track).
…s only The 3-arg signature leaked the skeleton's internal collection step into the hook contract. decoded may span tracks beyond the DiT one (the HI3 two-track AR text comes from the AR outputs, not the DiT slices), so the raw wire groups are its natural currency. The base re-collects its PILs via the new _collect helper (deliberate and cheap — collect_dit_outputs only gathers references; pils_to_images still runs once), and the knob-threading for collect_dit_outputs now lives in one place used by both the skeleton and the hook implementations.
…tputs directly A private helper used three times in one small file was ceremony; each call site now reads self-contained and subclass derivation goes through plain super(). The knobs are the same self attributes at every site, so the selection can't drift.
…ed/conditions triple An output adapter's whole job is to produce the three dicts assemble_tracks consumes, so the class contract now mirrors that parameter list 1:1: three parallel hooks — build_segments / build_decoded / build_conditions — all with the uniform (req, per_request) currency, and build() reduced to guard + assemble_tracks. The v1-parity AR sweep moves into build_segments (a named, derivable home instead of unconditional skeleton behavior); conditions implementations collect their own DiT outputs inline (cheap reference gathering, same self knobs everywhere). Hi3TextOutputAdapter adopts the same shape for full uniformity: the conditions= ctor callable is gone — ar_recaption's fused prompt capture is now a tiny Hi3ArRecaptionOutputAdapter subclass overriding build_conditions. Text-extraction failure policies preserved exactly: best-effort in the HI3 two-track build_decoded, fatal in the AR-only one.
…_sampling pair Same principle as the output-adapter triple, read off the input wire type: GenerateCall has two payload fields, so the input decomposition is a pair — build_prompts(req) / build_sampling(req) — with build() reduced to pure assembly. The whole sub-adapter layer now fits one sentence: a build() orchestrator over build_<wire-field> hooks taking raw currency (req on the way in, (req, per_request) on the way out). The extras() hook is gone: hv15's num_frames delta becomes two super()-extend overrides (the idiom Hi3ImageOutputAdapter.build_decoded established) instead of a pair-of-dicts return. Hi3InputAdapter's stages get public names while _resolve_task/_decorate/_ar_sampling/_dit_sampling stay private inside them. Hi3DitRecaptionInputAdapter keeps its wholesale build() — its prompts and sampling are paired per single-prompt call (seed + gid slice decided together), the documented call-topology exception.
…the request post-refactor)
…ils/conditions.py Each of the four condition builders had exactly one v2 caller and was 100% family-specific (family capture keys, family padding rules, family condition types) — family conversion logic wearing a utils costume, organized by kind instead of by family. They now live where the placement rule says: - build_hv15_conditions -> Hv15VideoOutputAdapter.build_conditions - build_sd3_text_condition -> Sd3OutputAdapter.build_conditions - build_fused_mm_condition -> hi3_fused_conditions (module-level: two consumers in hi3.py) - build_ar_fused_condition -> hi3_ar_fused_conditions This also fixes a layering smell: utils/ no longer imports unirl.models.hunyuan_image3 — the family-agnostic layer is now actually family-agnostic. Error text preserved verbatim; the raise moves from "wrapper diagnoses the builder's None" to the failed check itself. The dead empty-input heads are dropped (unreachable post-collect: the build() guard + collect_dit_outputs's per-request raise guarantee non-empty). The v1 engine's private _build_* copies in vllm_omni/response.py are parity code and stay. Boundary rule: the inline criterion is family-specific LOGIC, not consumer count — decoded_text_from_ar / seed_from_sample_id / grouped_pils_to_videos are universal mechanics with single consumers today and stay in utils/.
…xamples/ layout Adopt the upstream examples/ reorg (#272/#267) for the branch-only v2 files: recipes/diffusion_rl/sd3_*_v2 -> examples/diffusion/sd3/, the hv15 t2v v2 recipe -> examples/diffusion/hunyuan_video15/, hi3_vllmomni_v2 -> examples/unified_model/hi3/, and scripts/vllm_omni_v2_gpu_smoke.py -> examples/. All three relocated recipes compose-check under the new --config-name roots (hydra config_path is ../examples post-reorg).
ruff check/format + hook fixes on files this branch adds: drop unused imports (test fixtures, native.py STAGE_KIND_DIFFUSION), sort ipc_receive_mixin imports, format reflow, executable bit + TYPE_CHECKING RolloutReq import + relocated usage path for the GPU smoke driver.
pack_initial_noise_extra_args ships x_T as EITHER a materialized initial_noise_batch tensor OR a recipe (init_noise_group_ids + init_noise_latent_shape + init_noise_seed), but the hv15 pipeline only handled the batch form — a t2v request shipping the recipe form was silently ignored and upstream RNG drew x_T instead (the x_T-collapse class of bug the HI3 pipeline fails loudly on). Mirror sd3's recipe branch: regenerate this request's row byte-identically via NoiseRecipe(...).resolve(). CPU-verified only (lint + suite); the pipeline executes on the GPU pod — covered by the next smoke/e2e run there.
…vest naming The three RL pipeline subclasses follow one protocol around upstream's forward — install (once, idempotent) / arm (every request) / run (upstream, with our taps and injectors firing inside) / harvest (export onto the wire). This makes the protocol literal: - New pipelines/_shared/interception.py holds the byte-identical mechanics (detach_cpu, stamp_custom_output, drain_trajectory_into, resolve_request_noise, inject_latents, make_sde_scheduler). Deliberately vllm-omni-free (wire objects duck-typed), so the mechanics are CPU-tested in the new test_pipeline_interception.py even though the pipelines themselves only run inside the GPU worker. - FlowMatchSDEDiscreteScheduler.arm(eta=..., sde_indices=...) replaces private attr-poking at three call sites; counterpart of drain_trajectory. - Hooks are renamed by purpose, not upstream target: every family now has one _install_conditioning_tap (encode_prompt for sd3/hv15, prepare_inputs_for_generation for hi3), an initial-noise injector, and _arm_*/_harvest_* steps; upstream-defect carriers are suffixed _workaround (sd3 T5 truncation; hv15's _sigma_override is now a try/finally contextmanager). forward() reads as the protocol. - hi3/sde_scheduler.py (dead re-export shim) deleted. Wire contract unchanged: custom_output keys, trajectory_* semantics, extra_args reads, error prose, recipe regeneration math, first-call-only capture. CPU-verified (lint + 71 tests incl. 12 new); the pipelines execute only on the GPU pod — covered by its next smoke/e2e run.
…Image rollout path Adds the ninth modality to the v2 engine (Qwen-Image RL was trainside-only; the trainer bundle in models/qwen_image/ is the parity oracle). Single diffusion stage, TP=1; no engine/weight_sync/backends/patches changes — upstream v0.20.0 consumes sampling_params.sigmas natively and the generic LoRA manager maps rank-64 adapters onto the fused to_qkv via stacked_params_mapping. Two model-specific seams, everything else the SD3 pattern: - CFG semantics: upstream treats negative_prompt "" as present and defaults true_cfg_scale to 4.0 — the family input adapter omits the key when CFG is off (guidance 1.0, the oracle setting) and always maps guidance_scale onto true_cfg_scale. CFG-on is supported end-to-end: the encode tap captures the negative encode and build_conditions emits negative_text. - Packed-latent boundary: the worker loop runs [B, S, C*4]; the trainer contract is [B, C, H, W]. RLQwenImagePipeline packs the driver x_T before prepare_latents injection and unpacks the harvested trajectory to [B, T+1, C, H, W] (upstream _unpack_latents is 5D; frame dim squeezed). Also: max_sequence_length pinned to model_config (512) when the request doesn't set one (upstream defaults 1024); ragged-pad concat for the variable-length Qwen2.5-VL captures; stage YAML with max_lora_rank 64; dancegrpo vllmomni_v2 recipe = trainside twin + rollout/sync delta only. CPU suite: 80 passed (CFG trap both ways, msl pinning, ragged-pad mask sums, negative_text + mixed-capture raise, σ-echo, K=0 NFT, registry knobs). GPU smoke + trainside parity folded into the standing gate.
…DE) vllmomni recipe Smoke driver gains a SMOKE_MODALITY env (default sd3_t2i) with a per-family model-config shim — qwen needs use_dynamic_shifting + dynamic_shift_overrides (σ policy) and the max_sequence_length pin. New recipe qwen_image_grpo_vllmomni_v2 = the dancegrpo_vllmomni_v2 twin with FlowSDEStrategy (matches qwen_image_trainside), per the chosen e2e base.
…gmas Diffusers' set_timesteps has a THIRD sigma mutation beyond the static/dynamic shift branches: the shift_terminal whole-schedule stretch, applied even to passed-in sigmas. SD3.5 configs leave it null; the Qwen-Image-2512 checkpoint ships shift_terminal: 0.02, which stretched the engine-pinned [.., 0.3584] into [.., 0.0200] on the worker and tripped the sigma-echo gate in the qwen GPU smoke (max abs diff 3.384e-01 == exactly the stretch factor (1-0.02)/(1-0.3584)). Null it inside the same transient-override block; restore via finally. Regression tests for both config shapes.
…TE gate Probe post-mortem (8-replica colocate, both failures at engine boot): 1) DistNetworkError port 30005: at the v0.20.0 pin the stage_configs_path route strips master_port from the kwargs that feed the per-stage base_engine_args merge (only enable_sleep_mode/lora_* get re-injected), so every stage settles from the SHARED (None or 30005)+rand(0,100) window. SD3's fast boots won the 37-stride scan's TOCTOU races; Qwen's ~35s boots lose them. New patch_master_port_unstrip() re-attaches the caller's reserved per-replica base through _strip_single_engine_args — YAML keys still win, scan stays as fallback. DELETE-WHEN pin >= 0.21. 2) CUDA OOM (50 MiB free at engine TE load): the trainer-side bundle loads the Qwen2.5-VL text encoder (~15 GiB/rank) that the separate-engine path never uses — the engine encodes prompts and the trainer replays captured conditions. New QwenImagePipelineConfig.load_text_encoder (default true; vllmomni recipes set false) gates it; pipeline builds no text_embed stage without it and generate() raises a directed error.
Eight simultaneous engine boots each hold ~20 GiB anon RSS during weight materialization; the burst blows the pod's k8s memcg limit and the kernel OOM-kills raylet (ActorUnavailableError, probe-b post-mortem in dmesg). Exclusive flock on /tmp/diffrl_omni_boot.lock makes the load window single-file per node; DIFFRL_OMNI_BOOT_SERIALIZE=0 opts out. Also narrows the master-port settle TOCTOU as a side effect.
probe-c post-mortem: engine model 53.7 GiB + dummy run OOM'd at 116 MiB free — the colocated trainer's caching allocator held ~40 GiB reserved-but-unallocated (full per-rank model load before FSDP shard), invisible to the engine subprocess. empty_cache() in boot() (which runs inside the trainer's ray actor) returns it to the driver first.
probes b/d died at the TRAINER actor-pool bootstrap (handle.py create_remote), not the engine boots: 8 ranks materializing the 20B transformer concurrently hold ~20-23 GiB anon each and blow the pod's ~439 GiB memcg ceiling — the kernel kills raylet (ActorUnavailableError). Same flock pattern as the engine-boot serialization, on the bundle's heavy from_pretrained window; gc before lock release so the serialized peak actually holds. DIFFRL_MODEL_LOAD_SERIALIZE=0 opts out.
probe-e crossed the entire infra gauntlet (serialized loads, serialized engine boots, cache flush, generation, sleep) and died in the first TRAINING replay forward: the installed diffusers' QwenImage RoPE builder requires txt_seq_lens (max() over it sizes the text frequency slice) and predict_noise only passed the attention mask. Derive per-sample lengths from the mask for both CFG branches.
…ength probe-f: replay microbatches carry the batch-wide pad width (18) while max(txt_seq_lens) reflects their own true max (12); diffusers RoPE applies over the tensor width but slices freqs by txt_seq_lens -> shape crash in apply_rotary_emb_qwen. Trim embeds+mask to the slice's true max so width == max(txt_seq_lens) by construction (padding beyond it is meaningless).
…se — the 60-rollout e2e silently skipped wandb)
AC-off training leaves the activation peak (~30-40 GiB on Qwen-Image at mbs=1) reserved in the trainer's caching allocator; the colocated engine then OOMs re-mapping its weights on wake (qwen e2e-c: 'Tried to allocate 2.00 MiB ... 2.38 MiB free' in rollout 2's generate). empty_cache() before wake_up() returns it to the driver each cycle.
…CAST body) e2e-d reproduced e2e-c's rollout-2 OOM despite the trainer-side flush: train_step executes driver-side while the train-phase peak lives in the 8 ray actors. wake_up()'s body runs per-actor — flush there, immediately before the engine re-maps its weight pool.
…nit noise, PickScore-ranked grids
…rate-engine anchor The qwen recipes inherited DiffusionGRPO's default 'native' (engine-emitted rollout logp as the PPO anchor). Correct for trainside; across processes it bakes the rollout<->replay numeric discrepancy (bf16-stored trajectory vs in-flight fp32, engine vs trainer logp conventions) into every first-epoch ratio: observed 1±2-5e-5 vs sd3's replay-anchored 1±7e-6, against a 1e-4 clip range — the trust region half-consumed by anchor error before any real policy movement. Every sd3 separate-engine recipe sets replay; the qwen clones missed it.
…step latent deltas
Single-node 1x8 port of hunyuan_video15_t2v_vllmomni_nccl_separate_v2.yaml. All 8 GPUs time-share train+rollout (colocate, default layout) so it runs on one pod instead of 2x8. Numerically faithful: per-train-worker batch stays 32. Deltas vs separate: num_devices 16->8 (drop layout/train_fraction); enable_sleep_mode true; sync NCCLWeightSync -> IPCWeightSync(lora_merged=true) keeping merged full-weight semantics over same-node CUDA-IPC; add old_logp_source=replay (separate-engine anchor, prevents flat reward).
…y venv) The colocate recipe used IPCWeightSync, whose worker mixin (BucketedIPCReceiveMixin.__new__) eagerly imports sglang's monkey_patch_torch_reductions for CUDA-IPC handle unpickling. On the two-venv image, sglang lives only in .venv-sglang (cu130), not the vllm-omni .venv (cu129), so every engine worker crashed at construction with ModuleNotFoundError: No module named 'sglang'. - Recipe: switch sync IPCWeightSync -> LocalLoraWeightSync (the proven sd3/qwen v2 colocate bridge). In-process, SGLang-free; engine runs base+adapter = the merged model. hv15 stage config already sets enable_lora/max_lora_rank=64. - Mixin: make the eager sglang import in __new__ degrade gracefully when sglang is absent (it is only needed for the IPC receive path; lazy imports inside update_weights_from_ipc still raise clearly if IPC is used without sglang).
…deo tensor [TEMP]
… over wire) hv15 t2v decoded a valid video (out.output [B,3,F,H,W]) but the trainer-side response saw empty .images -> collect_dit_outputs 'no PIL images'. Root cause: the engine post-processes video into PIL frames, but PIL image lists do NOT survive the engine worker->client wire for video (only tensors on custom_output / trajectory_* cross; verified the wire result carries only trajectory_latents). Fix: RL pipeline stamps the decoded video tensor onto custom_output['rl_decoded_video']; collect_dit_outputs rebuilds per-sample PIL frames from it (VideoProcessor.postprocess_video) when .images is empty for final_output_type=video. Image modalities unaffected. Removes the temp diags.
…gment LatentSegment/Segment dropped the sample_indices row->sample identity mapping (base.py: 'nothing read it, so it was removed') in the upstream-squash rebase, but build_image_segment still passed it -> LatentSegment.__init__() got an unexpected keyword argument 'sample_indices'. First surfaced by hv15 (first modality to reach build_segments post-rebase); affects all diffusion modalities. TextSegment.pack (AR path) keeps sample_indices — packed batches use it.
set_lora_handle serializes LoRA tensors to the Omni subprocess via sglang's MultiprocessingSerializer (CUDA-IPC), absent from the vllm-omni-only venv -> ModuleNotFoundError at the first weight_sync.sync() (after a full generate+ train). The engine already has a sglang-free byte-copy transport (set_lora_copy: torch.save+base64; receiver set_lora_from_tensor_dict_copy: torch.load). Fall back to it when sglang is unavailable; covers initial sync + wake-restore (both route through set_lora_handle). LoRA is tiny so per-rank copy is free.
… sglang) The v2 engine's weight-sync paths imported sglang's MultiprocessingSerializer / monkey_patch_torch_reductions / FlattenedTensorBucket directly -> crash on the vllm-omni-only venv (two-venv image; sglang lives in .venv-sglang). Repoint all v2 sites to the already-vendored sgl_compat (verbatim sglang 0.5.10.post1, under vllm_omni/weight_sync) — trainer + worker import the SAME module so CUDA-IPC pickles round-trip. Reverts the byte-copy fallback (native) + graceful-skip (__new__) stopgaps; the vendored path is the engine's intended sglang-free design (v1 already uses it). Sites: __new__ monkey_patch, update_weights_from_tensor, set_lora_from_tensor_dict; native.set_lora_handle; full/tensor.sync.
- delete the v1 rollout engine package (rollout/engine/vllm_omni)
- rename vllm_omni_v2 -> vllm_omni: package dir, classes
VLLMOmniV2{RolloutEngine,EngineConfig} -> VLLMOmni{...}, tests dir,
and example scripts (drop the _v2 suffix everywhere)
- hoist the shared sgl_compat shim to distributed/weight_sync/transfer/
(alongside its already-hoisted siblings) and repoint the 5 import sites;
v2 no longer reaches back into the v1 package
- keep the v1 example yamls, rewired to the v2 engine: modality rename
(sd35_t2i->sd3_t2i, t2v->hv15_t2v, ar_recaption->hi3_ar_recaption,
dit_recaption->hi3_dit_recaption) + drop the default_* sampling fields
(the v2 engine validates modality against its adapter registry and
carries no sampling defaults); merge away the _v2 twins
- collapse the redundant _VLLM_OMNI_V2_ENGINE_TARGET_SUFFIX in config/validation
- strip stale strangler/coexists-with-v1 framing from the engine __init__
Verified: engine + trainer imports resolve; example _target_ dotpaths
resolve; unit suite 68 pass / 14 pre-existing env-fails — identical to
pre-change HEAD (zero regressions). GPU colocate e2e regression pending on pod.
Drop tests/ from the tree and apply ruff --fix (import ordering, induced by the vllm_omni rename) plus ruff-format across the affected files.
…dump, parity_probe)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Promotes the v2 vLLM-Omni rollout engine to the canonical
vllm_omniand retires the original v1 engine. v2 is a thin engine core over a single backend seam with per-modality adapters, replacing v1's monolithic implementation.Architecture —
unirl/rollout/engine/vllm_omni/backends/— the only code that imports the vLLM-Omni runtime (boot, ports, per-stagecollective_rpc).adapters/— per-modality registry keyed onconfig.modality; owns theRolloutReq↔RolloutRespconversion and per-modality topology knobs.pipelines//worker//patches/— worker-side role packages (custom diffusion pipelines, weight-sync receive mixins, runtime hijacks).weight_sync.py— canonical sync ops + LoRA lifecycle over the seam.Supported modalities: SD3.5 (
sd3_t2i), HunyuanImage-3 (hi3_*), HunyuanVideo-1.5 t2v (hv15_t2v), Qwen-Image (qwen_image_t2i).v1 retirement / promotion
vllm_omni_v2→vllm_omniand classesVLLMOmniV2{RolloutEngine,EngineConfig}→VLLMOmni{…}.sgl_compatshim (vendored sglang CUDA-IPC reductions / serializer, so the engine venv needs no sglang) intounirl/distributed/weight_sync/transfer/, beside its siblingsbucketed_transfer/ipc_dispatch/checksum— no engine package owns cross-engine transfer code.config/validation.py; example recipes rewired to the v2 engine (modality keys + request-driven sampling).Verification
_target_resolves (static guard);ruffclean.Notes