fix: add preprocessing script for convert SenseNova-SI dataset to qwen3-vl format by KetoneOlefine · Pull Request #33 · OpenSenseNova/SenseNova-SI

KetoneOlefine · 2026-04-29T07:45:00Z

Summary

This PR fixes dataset schema incompatibilities that prevent Qwen3-VL data loading for SenseNova-SI-800K.jsonl.

Root causes

image has mixed types across samples (list[str] and str), which breaks Arrow/HuggingFace dataset construction.
Samples use conversations with plain string payloads, while the training loader expects:
- messages key
- content as a structured list of typed objects.

What changed

Added training/qwen3_vl/preprocess_sensenova_si_dataset.py to normalize source JSONL:
- image: str -> [str]
- conversations -> messages
- role mapping: human -> user, gpt -> assistant
- text mapping: value -> [{"type":"text","text": ...}]
Updated docs:
- README.md
- README_CN.md
Updated example dataset config:
- training/qwen3_vl/data.yaml now points to SenseNova-SI-800K_qwen3vl_format.jsonl

Fixes #31

… Updated dataset YAML path and added instructions for data preprocessing and training configuration.

KetoneOlefine added 2 commits April 29, 2026 07:36

Enhance README and add preprocessing script for SenseNova-SI dataset.…

7a49648

… Updated dataset YAML path and added instructions for data preprocessing and training configuration.

Refactor code style in preprocess_sensenova_si_dataset.py.

0aac56c

KetoneOlefine mentioned this pull request Apr 29, 2026

Training fails at dataset build: ArrowInvalid on mixed image types + KeyError 'messages' #31

Closed

caizhongang approved these changes Apr 29, 2026

View reviewed changes

caizhongang merged commit 150572f into main Apr 29, 2026
2 checks passed

caizhongang deleted the fix-qwen-data branch April 29, 2026 07:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add preprocessing script for convert SenseNova-SI dataset to qwen3-vl format#33

fix: add preprocessing script for convert SenseNova-SI dataset to qwen3-vl format#33
caizhongang merged 2 commits into
mainfrom
fix-qwen-data

KetoneOlefine commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KetoneOlefine commented Apr 29, 2026

Summary

Root causes

What changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants