feat: implement model-graded flag for use with MCQEvals #329
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Implements a global CLI flag (--model-graded) that enables a LLM-as-a-judge implementation of the specified benchmark. Only to be used with MCQEval tasks, validated by searching for "mcq" in task metadata tags (note: bind between MCQEval implementation and "mcq" tag in metadata is enforced by CI workflow in separate PR, see it here).
What are you adding?
Testing
pytest)pre-commit run --all-files)Checklist
Note
Adds a --model-graded flag to enable LLM-based MCQ grading, validates usage for MCQ benchmarks, wires global scoring config, updates MCQ scorer, and documents usage.
--model-gradedflag insrc/openbench/_cli/eval_command.pywith help/env var; expand groups early and validate viavalidate_mcq_flag().openbench.utils.mcq.USE_MODEL_GRADING=True, echo notice, and warn ifGROQ_API_KEYmissing.create_mcq_scorer()insrc/openbench/scorers/mcq.pyto support LLM grading viamodel_graded_factwith custom instructions and optionalgrading_model.USE_MODEL_GRADINGandMCQ_GRADING_MODEL; haveMCQEval()pass these intocreate_mcq_scorer().is_mcq_task()and fast static analysisget_mcq_benchmarks()insrc/openbench/utils/mcq.py.validate_mcq_flag()in CLI to restrict flag to MCQ benchmarks.--model-gradedindocs/cli/eval.mdx.docs/development/mcq.mdx.Written by Cursor Bugbot for commit 9bc56aa. This will update automatically on new commits. Configure here.