Background
The Custom Engine e2e tests (e2e/custom_engine_test.go) currently assert only that the CLI produced output containing the substrings PASS and 1 passed. They do not verify:
- The custom engine's
final_message actually reached the judge.
expect.must_contain and other gate assertions evaluated correctly.
- The generated
report.json payload matches what was actually executed.
A custom engine that returns the wrong final_message but happens to print PASS somewhere would pass these tests.
Proposal
After the run, parse the generated report.json and assert:
- The case's
final_message matches the fixture script's output.
- The case's
judge.outcome is the expected pass/fail.
- The transcript role/content matches the user message that was fed in.
Why this is a follow-up, not a blocker
The current e2e tests do catch the most important regression — "Custom Engine doesn't run at all" — but the safety net is shallow.
Origin
Raised during the self code-review of feat/custom-engine-local (CI/tests finding #3).
Background
The Custom Engine e2e tests (
e2e/custom_engine_test.go) currently assert only that the CLI produced output containing the substringsPASSand1 passed. They do not verify:final_messageactually reached the judge.expect.must_containand other gate assertions evaluated correctly.report.jsonpayload matches what was actually executed.A custom engine that returns the wrong
final_messagebut happens to printPASSsomewhere would pass these tests.Proposal
After the run, parse the generated
report.jsonand assert:final_messagematches the fixture script's output.judge.outcomeis the expected pass/fail.Why this is a follow-up, not a blocker
The current e2e tests do catch the most important regression — "Custom Engine doesn't run at all" — but the safety net is shallow.
Origin
Raised during the self code-review of
feat/custom-engine-local(CI/tests finding #3).