[tests] Update lm_eval VL tests to qwen 3 #1953

brian-dellabetta · 2025-10-20T18:26:10Z

SUMMARY:
Upgrade the lm_eval vision languge tests from Qwen 2.5 to Qwen 3. After updating to include apply_chat_template, the scores closely align with what was achieved with Qwen 2.5

switch to neuralmagic/calibration dataset, based on suggestion here, to avoid tracing issues related to VL dataset.
switch to chartqa task, to increase number of samples and reduce variance in accuracy.

TEST PLAN:
The 3 lm_eval VL tests were run, and the accuracies were updated

github-actions · 2025-10-20T18:26:19Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

dsikka

Why not just use mmmu_val instead of the literature task? This gives us around 0.53 for the dense model?

brian-dellabetta · 2025-10-20T19:24:56Z

Why not just use mmmu_val instead of the literature task? This gives us around 0.53 for the dense model?

mmmu_val is 900 evals total instead of 30. that would add probably ~40 minutes to each lm-eval run, and we run two for each config, so total test time would increase over 3 hours with that change

dsikka · 2025-10-20T20:19:38Z

Why not use mmmu_val

Why not just use mmmu_val instead of the literature task? This gives us around 0.53 for the dense model?

mmmu_val is 900 evals total instead of 30. that would add probably ~40 minutes to each lm-eval run, and we run two for each config, so total test time would increase over 3 hours with that change

The 30 datapoints has proven to be very noisy historically. A happy medium might be better but we should also just validate the runtime for batch size of 100

_template true Signed-off-by: Brian Dellabetta <[email protected]>

Signed-off-by: Brian Dellabetta <[email protected]>

kylesayrs · 2025-10-22T21:38:06Z

tests/lmeval/test_lmeval.py

    num_fewshot: int = 5
    limit: int = 1000
    batch_size: int = 100
+    apply_chat_template: bool = False


Should we not always be applying a chat template?

None of the others have, they are using add_bos_token=True instead. This is just to be backwards-compatible. Research does not always do so, the transforms benchmarks aren't using it -- https://github.com/neuralmagic/research/blob/bdellabe/transforms-benchmarks/examples/llm_compress_eval_example.py#L171-L176

brian-dellabetta requested review from dsikka, kylesayrs and rahul-tuli October 20, 2025 18:26

dsikka requested changes Oct 20, 2025

View reviewed changes

qwen 3 vl with apply_chat

57e50b1

_template true Signed-off-by: Brian Dellabetta <[email protected]>

brian-dellabetta force-pushed the bdellabe/qwen3-vl-lmeval branch from ea00c16 to 57e50b1 Compare October 21, 2025 21:57

brian-dellabetta added 3 commits October 21, 2025 22:46

chartqa p1

1e01353

Signed-off-by: Brian Dellabetta <[email protected]>

broken test

2fc4001

Signed-off-by: Brian Dellabetta <[email protected]>

neuralmagic/calibration dataset

73f11ed

Signed-off-by: Brian Dellabetta <[email protected]>

brian-dellabetta requested a review from dsikka October 22, 2025 20:19

brian-dellabetta marked this pull request as ready for review October 22, 2025 20:19

kylesayrs reviewed Oct 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tests] Update lm_eval VL tests to qwen 3 #1953

[tests] Update lm_eval VL tests to qwen 3 #1953

brian-dellabetta commented Oct 20, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 20, 2025

Uh oh!

dsikka left a comment

Uh oh!

brian-dellabetta commented Oct 20, 2025

Uh oh!

dsikka commented Oct 20, 2025 •

edited

Loading

Uh oh!

kylesayrs Oct 22, 2025

Uh oh!

brian-dellabetta Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[tests] Update lm_eval VL tests to qwen 3 #1953

Are you sure you want to change the base?

[tests] Update lm_eval VL tests to qwen 3 #1953

Conversation

brian-dellabetta commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 20, 2025

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta commented Oct 20, 2025

Uh oh!

dsikka commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylesayrs Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

brian-dellabetta commented Oct 20, 2025 •

edited

Loading

dsikka commented Oct 20, 2025 •

edited

Loading