Skip to content

Support vlm training with multi-image and videos #1460

@HuiyingLi

Description

@HuiyingLi

Dataset format that supports mixtures of pure text, single/multi image and video
e.g. jsonls
{
"conversations": [
{"from": "human", "value": "\nDescribe first. \nNow second."},
{"from": "gpt", "value": "First is... Second is..."}
],
"images": ["img1.jpg", "img2.jpg"]
}

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions