feat: implement model-graded flag for use with MCQEvals #329

nmayorga7 · 2025-12-01T01:11:56Z

Summary

Implements a global CLI flag (--model-graded) that enables a LLM-as-a-judge implementation of the specified benchmark. Only to be used with MCQEval tasks, validated by searching for "mcq" in task metadata tags (note: bind between MCQEval implementation and "mcq" tag in metadata is enforced by CI workflow in separate PR, see it here).

What are you adding?

Testing

I have run the existing test suite (pytest)
Tested locally with MCQEval and non-MCQEval for proper functionality
I have run pre-commit hooks (pre-commit run --all-files)

Checklist

My code follows the project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (if applicable)
My changes generate no new warnings
New and existing unit tests pass locally with my changes

Note

Adds a --model-graded flag to enable LLM-based MCQ grading, validates usage for MCQ benchmarks, wires global scoring config, updates MCQ scorer, and documents usage.

CLI:
- Add --model-graded flag in src/openbench/_cli/eval_command.py with help/env var; expand groups early and validate via validate_mcq_flag().
- On enable, set openbench.utils.mcq.USE_MODEL_GRADING=True, echo notice, and warn if GROQ_API_KEY missing.
Scoring (MCQ):
- Enhance create_mcq_scorer() in src/openbench/scorers/mcq.py to support LLM grading via model_graded_fact with custom instructions and optional grading_model.
- Default remains regex-based; returns model-graded scorer when enabled.
Utils/Infrastructure:
- Add global config USE_MODEL_GRADING and MCQ_GRADING_MODEL; have MCQEval() pass these into create_mcq_scorer().
- Add MCQ detection helpers is_mcq_task() and fast static analysis get_mcq_benchmarks() in src/openbench/utils/mcq.py.
- Add validate_mcq_flag() in CLI to restrict flag to MCQ benchmarks.
Docs:
- Document --model-graded in docs/cli/eval.mdx.
- Add "Model-Graded Scoring" section with usage and notes in docs/development/mcq.mdx.

^{Written by Cursor Bugbot for commit 9bc56aa. This will update automatically on new commits. Configure here.}

cursor · 2025-12-01T01:17:21Z

src/openbench/scorers/mcq.py

+        )
+
+        # Return a factory function that matches the regex path pattern
+        return lambda: graded_scorer


Bug: Model-graded scorer discards grouped and additional metrics

When use_model_grading=True, the code builds a metrics list including grouped metrics (for category/subject breakdowns) and additional_metrics, but then calls model_graded_fact() without applying them. The regex path correctly uses @scorer(metrics=metrics) to attach these metrics, but the model-graded path returns the raw scorer. This means users of --model-graded will lose per-category and per-subject metric breakdowns that work with regex scoring.

cursor · 2025-12-01T01:17:21Z

src/openbench/utils/mcq.py

+        if result.returncode == 0:
+            mcq_files.update(result.stdout.strip().split("\n"))
+    except Exception:
+        pass


Bug: Silent grep failure causes false validation errors on Windows

The get_mcq_benchmarks() function uses the grep command via subprocess, but silently catches all exceptions with pass. On Windows (or systems without grep), this returns an empty list, causing validate_mcq_flag to reject ALL benchmarks when using --model-graded, with a misleading error message stating they are not MCQ-based even when they are.

nmayorga7 added 2 commits November 30, 2025 19:05

feat: implement model-graded flage for use with MCQEvals

cc0ca1a

chore: updated MCQEval docs

9bc56aa

nmayorga7 requested a review from AarushSah as a code owner December 1, 2025 01:11

cursor bot reviewed Dec 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement model-graded flag for use with MCQEvals #329

feat: implement model-graded flag for use with MCQEvals #329

Uh oh!

nmayorga7 commented Dec 1, 2025 •

edited by cursor bot

Loading

Uh oh!

cursor bot Dec 1, 2025

Uh oh!

cursor bot Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

feat: implement model-graded flag for use with MCQEvals #329

Are you sure you want to change the base?

feat: implement model-graded flag for use with MCQEvals #329

Uh oh!

Conversation

nmayorga7 commented Dec 1, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What are you adding?

Testing

Checklist

Uh oh!

cursor bot Dec 1, 2025

Choose a reason for hiding this comment

Bug: Model-graded scorer discards grouped and additional metrics

Uh oh!

cursor bot Dec 1, 2025

Choose a reason for hiding this comment

Bug: Silent grep failure causes false validation errors on Windows

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

nmayorga7 commented Dec 1, 2025 •

edited by cursor bot

Loading