Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The setting of Llava-Next #17

Open
TungChintao opened this issue Feb 27, 2025 · 1 comment
Open

The setting of Llava-Next #17

TungChintao opened this issue Feb 27, 2025 · 1 comment

Comments

@TungChintao
Copy link

Hello,
Thank you for your contributions to the community. We have a few questions regarding your work:


1. Token Count in LLaVA-Next

LLaVA-Next utilizes dynamic resolution with grid configurations (e.g., 1:2, 2:1, 2:2, 1:3, 3:1), resulting in variable token counts rather than a fixed value of 2880. The token count of 2880 appears specific to the 2:2 grid configuration.
Question: Could you clarify the basis for reporting 2880 tokens for LLaVA-Next in your paper?


2. Token Retention Strategy

Your method operates on individual images. Under LLaVA-Next's dynamic resolution framework:

  • How is the fixed retained token count of 160 ensured across varying grid configurations?
  • If the configuration model = visionzip(model, dominant=135, contextual=25) is applied, could the actual retained tokens scale to 160 * n (where n depends on the grid, e.g., 2, 3, 4, 5)?

We greatly appreciate your time and insights!

@TungChintao
Copy link
Author

@Yangsenqiao I sincerely hope to receive your reply and answers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant