Skip to content

[data, docs] feat: add WDS-native offline packing workflow#78

Merged
kaimo455 merged 3 commits into
masterfrom
docs/offline-packing-wds-native
Jun 4, 2026
Merged

[data, docs] feat: add WDS-native offline packing workflow#78
kaimo455 merged 3 commits into
masterfrom
docs/offline-packing-wds-native

Conversation

@kaimo455

@kaimo455 kaimo455 commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add the WDS-native offline packing pipeline, including manifest scanning, media-specific packing, pack-plan generation, and WDS writing
  • move offline packing implementation into the wds_pack package with CLI entrypoints and updated configs
  • update runtime handling for packed text/image/video samples and refresh offline packing docs

@github-actions github-actions Bot added ci documentation Improvements or additions to documentation labels Jun 4, 2026
@kaimo455 kaimo455 self-assigned this Jun 4, 2026
@kaimo455 kaimo455 changed the title docs: update offline packing WDS-native workflow [docs] docs: update offline packing WDS-native workflow Jun 4, 2026
@kaimo455 kaimo455 changed the title [docs] docs: update offline packing WDS-native workflow [data, docs] feat: add WDS-native offline packing workflow Jun 4, 2026
@github-actions github-actions Bot added the docker label Jun 4, 2026
@kaimo455 kaimo455 merged commit ada5957 into master Jun 4, 2026
9 checks passed
@fdcp

fdcp commented Jun 18, 2026

Copy link
Copy Markdown

Origin of WDS-native offline packing pipeline

The offline packing logic implemented in this PR was originally developed by me for LLaVA-OneVision-1.5. The original implementation and demo examples are available here:

  1. Core offline packing tool in LLaVA-OneVision-1.5
    https://github.com/fdcp/LLaVA-OneVision-1.5/tree/main/tools/data_preprocess/offline_packing
  2. End-to-end offline packing examples for LLaVA-OneVision-1.5
    https://github.com/fdcp/LLaVA-OneVision-1.5/tree/main/examples_offline_packing

This packing solution was later migrated and upgraded for LLaVA-OneVision-2:
3. Standalone offline packing module for LLaVA-OneVision-2
https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2/tree/main/offline_packing
4. Sample packing scripts compatible with LLaVA-OneVision-1.5 workflows in V2
https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2/tree/main/examples/llava_onevision1_5/sample_packing

Custom adaptations for LoongForge

Based on the proven multi-modal offline packing framework above, I refactored the entire pipeline to fit LoongForge’s requirements:

  • Natively support WDS format, including manifest scanning, media-type specific packing, pack-plan generation and WDS writing
  • Isolate all packing implementations into a dedicated wds_pack package with standardized CLI entrypoints
  • Refine runtime handling for packed text/image/video samples and fully update offline packing documentation

@kaimo455

kaimo455 commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks @fdcp for pointing this out, and more importantly, thank you for developing the original multimodal offline packing pipeline. We appreciate the work you did for LLaVA-OneVision and respect the contribution behind it.

We noticed this attribution issue and have already added a documentation update in #94. The update adds an Acknowledgements section to both the offline data packing feature documentation and the offline packing README, explicitly noting that LoongForge’s WDS-native offline packing workflow is based on the multimodal offline packing framework originally developed for LLaVA-OneVision-1.5 and later migrated/upgraded for LLaVA-OneVision-2. It also includes the upstream links you listed.

We also want to clarify that LoongForge previously collaborated with the LLaVA-NeXt / LLaVA-OneVision work, which is part of the historical context behind these changes. Some older repository or package names may still carry aiak-* naming, while the current LoongForge repository has migrated and adapted part of the LLaVA-OneVision offline packing capabilities.

Compared with the upstream offline packing workflow, the LoongForge adaptation focuses on integration with our WDS-native training data path, including:

  • consuming existing uncompressed WebDataset tar shards directly, rather than first materializing paired JSON/image sample files or split sample directories
  • building a manifest/SQLite index with sample metadata, media type, token length, source shard, and tar member byte offsets
  • separating packing by declared media type (text, image, video) and validating that packed samples do not mix media types unexpectedly
  • emitting an explicit pack_plan.jsonl with sample_ids, sample_token_lens, and total_token_len for audit/debugging before writing output shards
  • writing packed WebDataset shards by reusing media bytes from source tar offsets instead of re-reading copied media files from an intermediate directory
  • storing _meta fields in packed sample JSON so the generated packed sample can be traced back to its source sample ids and token lengths
  • integrating the workflow into LoongForge’s wds_pack package, config-driven CLI entrypoints, and runtime handling for packed text/image/video chat samples

Thanks again for paying attention to the project and for raising this. We welcome your PRs for new features, improvements, or any changes that can make the offline packing pipeline more efficient, easier to maintain, and easier for users to adopt. If you think the acknowledgement wording can be improved further, please also feel free to suggest edits directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci docker documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants