Skip to content

Conversation

@nmayorga7
Copy link
Collaborator

@nmayorga7 nmayorga7 commented Dec 1, 2025

Summary

Implements a global CLI flag (--model-graded) that enables a LLM-as-a-judge implementation of the specified benchmark. Only to be used with MCQEval tasks, validated by searching for "mcq" in task metadata tags (note: bind between MCQEval implementation and "mcq" tag in metadata is enforced by CI workflow in separate PR, see it here).

What are you adding?

  • Bug fix (non-breaking change which fixes an issue)
  • New benchmark/evaluation
  • New model provider
  • CLI enhancement
  • Performance improvement
  • Documentation update
  • API/SDK feature
  • Integration (CI/CD, tools)
  • Export/import functionality
  • Code refactoring
  • Breaking change
  • Other

Testing

  • I have run the existing test suite (pytest)
  • Tested locally with MCQEval and non-MCQEval for proper functionality
  • I have run pre-commit hooks (pre-commit run --all-files)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (if applicable)
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes

Note

Adds a --model-graded flag to enable LLM-based MCQ grading, validates usage for MCQ benchmarks, wires global scoring config, updates MCQ scorer, and documents usage.

  • CLI:
    • Add --model-graded flag in src/openbench/_cli/eval_command.py with help/env var; expand groups early and validate via validate_mcq_flag().
    • On enable, set openbench.utils.mcq.USE_MODEL_GRADING=True, echo notice, and warn if GROQ_API_KEY missing.
  • Scoring (MCQ):
    • Enhance create_mcq_scorer() in src/openbench/scorers/mcq.py to support LLM grading via model_graded_fact with custom instructions and optional grading_model.
    • Default remains regex-based; returns model-graded scorer when enabled.
  • Utils/Infrastructure:
    • Add global config USE_MODEL_GRADING and MCQ_GRADING_MODEL; have MCQEval() pass these into create_mcq_scorer().
    • Add MCQ detection helpers is_mcq_task() and fast static analysis get_mcq_benchmarks() in src/openbench/utils/mcq.py.
    • Add validate_mcq_flag() in CLI to restrict flag to MCQ benchmarks.
  • Docs:
    • Document --model-graded in docs/cli/eval.mdx.
    • Add "Model-Graded Scoring" section with usage and notes in docs/development/mcq.mdx.

Written by Cursor Bugbot for commit 9bc56aa. This will update automatically on new commits. Configure here.

@nmayorga7 nmayorga7 requested a review from AarushSah as a code owner December 1, 2025 01:11
)

# Return a factory function that matches the regex path pattern
return lambda: graded_scorer
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Model-graded scorer discards grouped and additional metrics

When use_model_grading=True, the code builds a metrics list including grouped metrics (for category/subject breakdowns) and additional_metrics, but then calls model_graded_fact() without applying them. The regex path correctly uses @scorer(metrics=metrics) to attach these metrics, but the model-graded path returns the raw scorer. This means users of --model-graded will lose per-category and per-subject metric breakdowns that work with regex scoring.

Fix in Cursor Fix in Web

if result.returncode == 0:
mcq_files.update(result.stdout.strip().split("\n"))
except Exception:
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Silent grep failure causes false validation errors on Windows

The get_mcq_benchmarks() function uses the grep command via subprocess, but silently catches all exceptions with pass. On Windows (or systems without grep), this returns an empty list, causing validate_mcq_flag to reject ALL benchmarks when using --model-graded, with a misleading error message stating they are not MCQ-based even when they are.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants