-
Notifications
You must be signed in to change notification settings - Fork 265
[tests] Update lm_eval VL tests to qwen 3 #1953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use mmmu_val instead of the literature task? This gives us around 0.53 for the dense model?
|
Why not use mmmu_val
The 30 datapoints has proven to be very noisy historically. A happy medium might be better but we should also just validate the runtime for batch size of 100 |
_template true Signed-off-by: Brian Dellabetta <[email protected]>
ea00c16
to
57e50b1
Compare
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
num_fewshot: int = 5 | ||
limit: int = 1000 | ||
batch_size: int = 100 | ||
apply_chat_template: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not always be applying a chat template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the others have, they are using add_bos_token=True
instead. This is just to be backwards-compatible. Research does not always do so, the transforms benchmarks aren't using it -- https://github.com/neuralmagic/research/blob/bdellabe/transforms-benchmarks/examples/llm_compress_eval_example.py#L171-L176
SUMMARY:
Upgrade the lm_eval vision languge tests from Qwen 2.5 to Qwen 3. After updating to include
apply_chat_template
, the scores closely align with what was achieved with Qwen 2.5neuralmagic/calibration
dataset, based on suggestion here, to avoid tracing issues related to VL dataset.chartqa
task, to increase number of samples and reduce variance in accuracy.TEST PLAN:
The 3 lm_eval VL tests were run, and the accuracies were updated