Skip to content

Conversation

brian-dellabetta
Copy link
Collaborator

@brian-dellabetta brian-dellabetta commented Oct 20, 2025

SUMMARY:
Upgrade the lm_eval vision languge tests from Qwen 2.5 to Qwen 3. After updating to include apply_chat_template, the scores closely align with what was achieved with Qwen 2.5

  • switch to neuralmagic/calibration dataset, based on suggestion here, to avoid tracing issues related to VL dataset.
  • switch to chartqa task, to increase number of samples and reduce variance in accuracy.

TEST PLAN:
The 3 lm_eval VL tests were run, and the accuracies were updated

@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use mmmu_val instead of the literature task? This gives us around 0.53 for the dense model?

@brian-dellabetta
Copy link
Collaborator Author

Why not just use mmmu_val instead of the literature task? This gives us around 0.53 for the dense model?

mmmu_val is 900 evals total instead of 30. that would add probably ~40 minutes to each lm-eval run, and we run two for each config, so total test time would increase over 3 hours with that change

@dsikka
Copy link
Collaborator

dsikka commented Oct 20, 2025

Why not use mmmu_val

Why not just use mmmu_val instead of the literature task? This gives us around 0.53 for the dense model?

mmmu_val is 900 evals total instead of 30. that would add probably ~40 minutes to each lm-eval run, and we run two for each config, so total test time would increase over 3 hours with that change

The 30 datapoints has proven to be very noisy historically. A happy medium might be better but we should also just validate the runtime for batch size of 100

_template true

Signed-off-by: Brian Dellabetta <[email protected]>
@brian-dellabetta brian-dellabetta force-pushed the bdellabe/qwen3-vl-lmeval branch from ea00c16 to 57e50b1 Compare October 21, 2025 21:57
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
@brian-dellabetta brian-dellabetta marked this pull request as ready for review October 22, 2025 20:19
num_fewshot: int = 5
limit: int = 1000
batch_size: int = 100
apply_chat_template: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not always be applying a chat template?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the others have, they are using add_bos_token=True instead. This is just to be backwards-compatible. Research does not always do so, the transforms benchmarks aren't using it -- https://github.com/neuralmagic/research/blob/bdellabe/transforms-benchmarks/examples/llm_compress_eval_example.py#L171-L176

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants