Skip to content

fix: add preprocessing script for convert SenseNova-SI dataset to qwen3-vl format#33

Merged
caizhongang merged 2 commits into
mainfrom
fix-qwen-data
Apr 29, 2026
Merged

fix: add preprocessing script for convert SenseNova-SI dataset to qwen3-vl format#33
caizhongang merged 2 commits into
mainfrom
fix-qwen-data

Conversation

@KetoneOlefine

Copy link
Copy Markdown
Collaborator

Summary

This PR fixes dataset schema incompatibilities that prevent Qwen3-VL data loading for SenseNova-SI-800K.jsonl.

Root causes

  1. image has mixed types across samples (list[str] and str), which breaks Arrow/HuggingFace dataset construction.
  2. Samples use conversations with plain string payloads, while the training loader expects:
    • messages key
    • content as a structured list of typed objects.

What changed

  • Added training/qwen3_vl/preprocess_sensenova_si_dataset.py to normalize source JSONL:
    • image: str -> [str]
    • conversations -> messages
    • role mapping: human -> user, gpt -> assistant
    • text mapping: value -> [{"type":"text","text": ...}]
  • Updated docs:
    • README.md
    • README_CN.md
  • Updated example dataset config:
    • training/qwen3_vl/data.yaml now points to SenseNova-SI-800K_qwen3vl_format.jsonl

Fixes #31

… Updated dataset YAML path and added instructions for data preprocessing and training configuration.
@caizhongang caizhongang merged commit 150572f into main Apr 29, 2026
2 checks passed
@caizhongang caizhongang deleted the fix-qwen-data branch April 29, 2026 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Training fails at dataset build: ArrowInvalid on mixed image types + KeyError 'messages'

2 participants