Support vlm training with multi-image and videos

Dataset format that supports mixtures of pure text, single/multi image and video
e.g. jsonls
  {
    "conversations": [
      {"from": "human", "value": "<image>\nDescribe first. <image>\nNow second."},
      {"from": "gpt", "value": "First is... Second is..."}
    ],
    "images": ["img1.jpg", "img2.jpg"]
  }