Caption branch #17

jetyingjia · 2024-06-19T08:50:59Z

Thanks for your great work！
In your project, the caption branch is trained only on VG data. This caption ability may be poor than the modal using large caption data and large language model. Have you the plan using large caption data training this modal or other future work?

PhyscalX · 2024-06-20T02:50:14Z

Hi, @jetyingjia

Currently, high quality & human annotated region-level caption data are still limited. However, we find it helpful to further mix VG and SemanticSA-1B in a fully multimodal training stage on v1.1 pre-trained models without LLM.
TAP is also a strong image-level vision encoder for MLLMs (e.g., LLaVA). We find it could be a natural high-resolution replacement for any low resolution CLIP models and achieve comparable performance (VQAv2, GQA, MMB, ...).

jetyingjia · 2024-06-21T02:50:02Z

Hi, @PhyscalX
In my some cases, TAP have a good performance and very efficiently. Hope have a better version in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caption branch #17

Caption branch #17

jetyingjia commented Jun 19, 2024

PhyscalX commented Jun 20, 2024

jetyingjia commented Jun 21, 2024

Caption branch #17

Caption branch #17

Comments

jetyingjia commented Jun 19, 2024

PhyscalX commented Jun 20, 2024

jetyingjia commented Jun 21, 2024