feat(transport): TensorMeta segment views — selection without hydration#49
Draft
leviking98z-rgb wants to merge 1 commit into
Draft
feat(transport): TensorMeta segment views — selection without hydration#49leviking98z-rgb wants to merge 1 commit into
leviking98z-rgb wants to merge 1 commit into
Conversation
Root-cause fix for the DP-balance hydration workaround. TensorMeta gains view_plan: an ordered list of (ref_idx, start, end) segments in ref-local units. select/select_units/select_segments build lazy VIEWS over the remote refs (zero data motion on the driver); misaligned slices degrade to views instead of raising; localize preserves plans through ref routing (with_refs, not from_handles); materialization is centralized in TensorMeta.assemble with a documented trailing-dim contract (segments crossing refs padded to different widths are right-padded with zeros — the TextTokenCondition.concat convention) and wired into every input path: transport.hydrate, base/gpu_store/transfer_queue get_batch. balance_track_for_dp now permutes via native track.select(perm): data stays worker-resident and materializes on the destination worker. hydrate_track remains as a utility but is off the balance path. Verified: 7 CPU unit tests (permutation + ragged pad, view slicing, misaligned-slice degradation, packed token segments, plan survival through with_refs, empty selection, assemble parity) and a 16-GPU e2e gate (viewbal_e2e: 5 steps, ratio_mean 0.9995-1.0004, rank token spread 0.06%).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of the verl performance-parity series tracked in #40. Root-cause follow-up to #45.
Summary
#45's DP balancing needed
hydrate_track: a driver-side full hydration of every TensorMeta field beforeBatch.select, because selection had no representation on remote refs (TensorMeta.selectraised;_slice_by_refsonly cut at ref boundaries). That workaround broke the transport's zero-copy premise (worker -> driver -> worker bounce), mutated frozen dataclasses in place, left field types history-dependent, and buried a padding convention in a private helper.This PR adds the missing primitive: segment views.
TensorMeta.view_plan— an ordered list of(ref_idx, start, end)segments in ref-local units (rows for CONCAT fields, tokens for PACKED fields).select/select_units/select_segmentsbuild lazy views — zero data motion on the driver; misalignedslicenow degrades to a view instead of raising.localizepreserves plans through ref routing (with_refs, notfrom_handles).TensorMeta.assembleand wired into every input path (hydrate, base / gpu_store / transfer_queueget_batch). Trailing-dim CONTRACT documented there: segments crossing refs padded to different widths are right-padded with zeros (theTextTokenCondition.concatconvention) — consumers of 2D+ per-shard-padded fields must be mask-driven.balance_track_for_dpnow permutes via nativetrack.select(perm): data stays worker-resident and materializes on the destination worker.hydrate_trackremains a utility but is off the balance path.Test Plan
tests/test_tensormeta_views.py: 7 CPU unit tests — permutation + ragged right-pad, view slicing, misaligned-slice degradation, packed token segments, plan survival throughwith_refs, empty selection,assembleparity.viewbal_e2e, Qwen3-4B DRPO + balance on): 5 steps,ratio_mean0.9995-1.0004, rank token spread 0.06%, rewards nominal.backend.get(refs)/ boundary-slice fast paths.