Dataset format that supports mixtures of pure text, single/multi image and video
e.g. jsonls
{
"conversations": [
{"from": "human", "value": "
\nDescribe first.
\nNow second."},
{"from": "gpt", "value": "First is... Second is..."}
],
"images": ["img1.jpg", "img2.jpg"]
}