Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caption branch #17

Open
jetyingjia opened this issue Jun 19, 2024 · 2 comments
Open

Caption branch #17

jetyingjia opened this issue Jun 19, 2024 · 2 comments

Comments

@jetyingjia
Copy link

Thanks for your great work!
In your project, the caption branch is trained only on VG data. This caption ability may be poor than the modal using large caption data and large language model. Have you the plan using large caption data training this modal or other future work?

@PhyscalX
Copy link
Collaborator

Hi, @jetyingjia

  1. Currently, high quality & human annotated region-level caption data are still limited. However, we find it helpful to further mix VG and SemanticSA-1B in a fully multimodal training stage on v1.1 pre-trained models without LLM.

  2. TAP is also a strong image-level vision encoder for MLLMs (e.g., LLaVA). We find it could be a natural high-resolution replacement for any low resolution CLIP models and achieve comparable performance (VQAv2, GQA, MMB, ...).

@jetyingjia
Copy link
Author

Hi, @PhyscalX
In my some cases, TAP have a good performance and very efficiently. Hope have a better version in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants