[data, docs] feat: add WDS-native offline packing workflow by kaimo455 · Pull Request #78 · baidu-baige/LoongForge

kaimo455 · 2026-06-04T09:34:49Z

Summary

add the WDS-native offline packing pipeline, including manifest scanning, media-specific packing, pack-plan generation, and WDS writing
move offline packing implementation into the wds_pack package with CLI entrypoints and updated configs
update runtime handling for packed text/image/video samples and refresh offline packing docs

fdcp · 2026-06-18T11:41:56Z

Origin of WDS-native offline packing pipeline

The offline packing logic implemented in this PR was originally developed by me for LLaVA-OneVision-1.5. The original implementation and demo examples are available here:

Core offline packing tool in LLaVA-OneVision-1.5
https://github.com/fdcp/LLaVA-OneVision-1.5/tree/main/tools/data_preprocess/offline_packing
End-to-end offline packing examples for LLaVA-OneVision-1.5
https://github.com/fdcp/LLaVA-OneVision-1.5/tree/main/examples_offline_packing

This packing solution was later migrated and upgraded for LLaVA-OneVision-2:
3. Standalone offline packing module for LLaVA-OneVision-2
https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2/tree/main/offline_packing
4. Sample packing scripts compatible with LLaVA-OneVision-1.5 workflows in V2
https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2/tree/main/examples/llava_onevision1_5/sample_packing

Custom adaptations for LoongForge

Based on the proven multi-modal offline packing framework above, I refactored the entire pipeline to fit LoongForge’s requirements:

Natively support WDS format, including manifest scanning, media-type specific packing, pack-plan generation and WDS writing
Isolate all packing implementations into a dedicated wds_pack package with standardized CLI entrypoints
Refine runtime handling for packed text/image/video samples and fully update offline packing documentation

kaimo455 · 2026-06-19T16:47:29Z

Thanks @fdcp for pointing this out, and more importantly, thank you for developing the original multimodal offline packing pipeline. We appreciate the work you did for LLaVA-OneVision and respect the contribution behind it.

We noticed this attribution issue and have already added a documentation update in #94. The update adds an Acknowledgements section to both the offline data packing feature documentation and the offline packing README, explicitly noting that LoongForge’s WDS-native offline packing workflow is based on the multimodal offline packing framework originally developed for LLaVA-OneVision-1.5 and later migrated/upgraded for LLaVA-OneVision-2. It also includes the upstream links you listed.

We also want to clarify that LoongForge previously collaborated with the LLaVA-NeXt / LLaVA-OneVision work, which is part of the historical context behind these changes. Some older repository or package names may still carry aiak-* naming, while the current LoongForge repository has migrated and adapted part of the LLaVA-OneVision offline packing capabilities.

Compared with the upstream offline packing workflow, the LoongForge adaptation focuses on integration with our WDS-native training data path, including:

consuming existing uncompressed WebDataset tar shards directly, rather than first materializing paired JSON/image sample files or split sample directories
building a manifest/SQLite index with sample metadata, media type, token length, source shard, and tar member byte offsets
separating packing by declared media type (text, image, video) and validating that packed samples do not mix media types unexpectedly
emitting an explicit pack_plan.jsonl with sample_ids, sample_token_lens, and total_token_len for audit/debugging before writing output shards
writing packed WebDataset shards by reusing media bytes from source tar offsets instead of re-reading copied media files from an intermediate directory
storing _meta fields in packed sample JSON so the generated packed sample can be traced back to its source sample ids and token lengths
integrating the workflow into LoongForge’s wds_pack package, config-driven CLI entrypoints, and runtime handling for packed text/image/video chat samples

Thanks again for paying attention to the project and for raising this. We welcome your PRs for new features, improvements, or any changes that can make the offline packing pipeline more efficient, easier to maintain, and easier for users to adopt. If you think the acknowledgement wording can be improved further, please also feel free to suggest edits directly.

docs: update offline packing WDS-native workflow

bf84905

github-actions Bot added ci documentation Improvements or additions to documentation labels Jun 4, 2026

kaimo455 self-assigned this Jun 4, 2026

kaimo455 changed the title ~~docs: update offline packing WDS-native workflow~~ [docs] docs: update offline packing WDS-native workflow Jun 4, 2026

feat: add WDS-native offline packing pipeline

f41fdf5

kaimo455 changed the title ~~[docs] docs: update offline packing WDS-native workflow~~ [data, docs] feat: add WDS-native offline packing workflow Jun 4, 2026

github-actions Bot added the docker label Jun 4, 2026

chore: add SPDX headers for wds_pack package markers

1e87e12

kaimo455 merged commit ada5957 into master Jun 4, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data, docs] feat: add WDS-native offline packing workflow#78

[data, docs] feat: add WDS-native offline packing workflow#78
kaimo455 merged 3 commits into
masterfrom
docs/offline-packing-wds-native

kaimo455 commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

fdcp commented Jun 18, 2026

Uh oh!

kaimo455 commented Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaimo455 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

fdcp commented Jun 18, 2026

Origin of WDS-native offline packing pipeline

Custom adaptations for LoongForge

Uh oh!

kaimo455 commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaimo455 commented Jun 4, 2026 •

edited

Loading

kaimo455 commented Jun 19, 2026 •

edited

Loading