Fix LLM judge overall pass reconciliation by leggetter · Pull Request #978 · hookdeck/outpost

leggetter · 2026-06-25T10:18:02Z

Summary

Derive overall_transcript_pass from the criteria[] array when present (logical AND of all criterion passes).
Log when reconciliation overrides a contradictory top-level boolean from the model.
Add unit tests for reconciliation (llm-judge-parse.test.ts).

Fixes eval-ci failure on main run 28158908775: scenario 02 passed every heuristic and LLM criterion, but the judge returned overall_transcript_pass: false.

Test plan

npm run test in docs/agent-evaluation/
npm run typecheck in docs/agent-evaluation/
eval-ci green on this PR

Made with Cursor

Derive overall_transcript_pass from criteria when present so eval-ci does not fail when the model marks every criterion pass but overall false. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Summary

Adjusts LLM-judge JSON parsing in the agent-evaluation harness to make overall_transcript_pass consistent with per-criterion judgments, addressing eval-ci failures caused by contradictory top-level booleans from the model.

Changes:

Adds reconcileOverallTranscriptPass() to derive overall pass from criteria[] when present.
Logs when reconciliation overrides the model’s top-level overall value.
Adds a no-runner unit test for reconciliation and wires it into npm test.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
docs/agent-evaluation/src/llm-judge.ts	Reconciles overall pass from criteria and logs when the model’s top-level value disagrees.
docs/agent-evaluation/src/llm-judge-parse.test.ts	Adds unit-style assertions covering reconciliation behavior.
docs/agent-evaluation/package.json	Adds a new test script and includes it in the main `test` pipeline.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    throw new Error(`JSON.parse failed: ${detail}`);
  }
-  const overall = Boolean(parsed.overall_transcript_pass);
+  const overall_from_model = Boolean(parsed.overall_transcript_pass);


    }
  }
+  const overall = reconcileOverallTranscriptPass(overall_from_model, criteria);
+  if (criteria.length > 0 && overall !== overall_from_model) {


Parse string "true"/"false" explicitly, apply the same to criteria pass fields, only log reconciliation when overall was explicitly set, and align test assertion messages with other eval unit tests. Co-authored-by: Cursor <cursoragent@cursor.com>

Reconcile LLM judge overall pass with criteria array.

5caea74

Derive overall_transcript_pass from criteria when present so eval-ci does not fail when the model marks every criterion pass but overall false. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI review requested due to automatic review settings June 25, 2026 10:18

Copilot started reviewing on behalf of leggetter June 25, 2026 10:18 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

leggetter merged commit 7562289 into main Jun 25, 2026
2 checks passed

leggetter deleted the fix/llm-judge-overall-reconcile branch June 25, 2026 11:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix LLM judge overall pass reconciliation#978

Fix LLM judge overall pass reconciliation#978
leggetter merged 2 commits into
mainfrom
fix/llm-judge-overall-reconcile

leggetter commented Jun 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

leggetter commented Jun 25, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants