The setting of Llava-Next #17

TungChintao · 2025-02-27T09:59:56Z

Hello,
Thank you for your contributions to the community. We have a few questions regarding your work:

1. Token Count in LLaVA-Next

LLaVA-Next utilizes dynamic resolution with grid configurations (e.g., 1:2, 2:1, 2:2, 1:3, 3:1), resulting in variable token counts rather than a fixed value of 2880. The token count of 2880 appears specific to the 2:2 grid configuration.
Question: Could you clarify the basis for reporting 2880 tokens for LLaVA-Next in your paper?

2. Token Retention Strategy

Your method operates on individual images. Under LLaVA-Next's dynamic resolution framework:

How is the fixed retained token count of 160 ensured across varying grid configurations?
If the configuration model = visionzip(model, dominant=135, contextual=25) is applied, could the actual retained tokens scale to 160 * n (where n depends on the grid, e.g., 2, 3, 4, 5)?

We greatly appreciate your time and insights!

The text was updated successfully, but these errors were encountered:

TungChintao · 2025-03-06T07:02:15Z

@Yangsenqiao I sincerely hope to receive your reply and answers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The setting of Llava-Next #17

The setting of Llava-Next #17

TungChintao commented Feb 27, 2025

TungChintao commented Mar 6, 2025

The setting of Llava-Next #17

The setting of Llava-Next #17

Comments

TungChintao commented Feb 27, 2025

1. Token Count in LLaVA-Next

2. Token Retention Strategy

TungChintao commented Mar 6, 2025