Skip to content

Fix LLM judge overall pass reconciliation#978

Merged
leggetter merged 2 commits into
mainfrom
fix/llm-judge-overall-reconcile
Jun 25, 2026
Merged

Fix LLM judge overall pass reconciliation#978
leggetter merged 2 commits into
mainfrom
fix/llm-judge-overall-reconcile

Conversation

@leggetter

Copy link
Copy Markdown
Collaborator

Summary

  • Derive overall_transcript_pass from the criteria[] array when present (logical AND of all criterion passes).
  • Log when reconciliation overrides a contradictory top-level boolean from the model.
  • Add unit tests for reconciliation (llm-judge-parse.test.ts).

Fixes eval-ci failure on main run 28158908775: scenario 02 passed every heuristic and LLM criterion, but the judge returned overall_transcript_pass: false.

Test plan

  • npm run test in docs/agent-evaluation/
  • npm run typecheck in docs/agent-evaluation/
  • eval-ci green on this PR

Made with Cursor

Derive overall_transcript_pass from criteria when present so eval-ci does
not fail when the model marks every criterion pass but overall false.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings June 25, 2026 10:18

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Summary

Adjusts LLM-judge JSON parsing in the agent-evaluation harness to make overall_transcript_pass consistent with per-criterion judgments, addressing eval-ci failures caused by contradictory top-level booleans from the model.

Changes:

  • Adds reconcileOverallTranscriptPass() to derive overall pass from criteria[] when present.
  • Logs when reconciliation overrides the model’s top-level overall value.
  • Adds a no-runner unit test for reconciliation and wires it into npm test.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
docs/agent-evaluation/src/llm-judge.ts Reconciles overall pass from criteria and logs when the model’s top-level value disagrees.
docs/agent-evaluation/src/llm-judge-parse.test.ts Adds unit-style assertions covering reconciliation behavior.
docs/agent-evaluation/package.json Adds a new test script and includes it in the main test pipeline.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/agent-evaluation/src/llm-judge.ts Outdated
throw new Error(`JSON.parse failed: ${detail}`);
}
const overall = Boolean(parsed.overall_transcript_pass);
const overall_from_model = Boolean(parsed.overall_transcript_pass);
Comment thread docs/agent-evaluation/src/llm-judge-parse.test.ts
Comment thread docs/agent-evaluation/src/llm-judge.ts Outdated
}
}
const overall = reconcileOverallTranscriptPass(overall_from_model, criteria);
if (criteria.length > 0 && overall !== overall_from_model) {
Parse string "true"/"false" explicitly, apply the same to criteria pass
fields, only log reconciliation when overall was explicitly set, and align
test assertion messages with other eval unit tests.

Co-authored-by: Cursor <cursoragent@cursor.com>
@leggetter leggetter merged commit 7562289 into main Jun 25, 2026
2 checks passed
@leggetter leggetter deleted the fix/llm-judge-overall-reconcile branch June 25, 2026 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants