Inject OUTPOST_API_KEY into agent eval CI step 1#976
Merged
Conversation
Forward live Outpost credentials to the agent sandbox during eval:ci so smoke tests run for real; tighten the LLM judge when keys are present and keep the missing-env exception only for local transcript-only runs. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Each run step only receives the secrets it needs: Anthropic/eval vars stay on ci-eval.sh; Outpost execution-only vars stay on execute-ci-artifacts. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Clarify that publish 202 means accepted, not immediately listable in events APIs; steer quickstarts and the agent prompt to Console verification and optional polling instead of hard-fail events.list. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Retry parse failures with stop_reason/token logging and persist full raw judge output to llm-judge-failure.json for CI triage. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Updates Outpost documentation and the agent-evaluation CI harness so agent eval step 1 can run live Outpost smoke tests (when credentials are present), while also clarifying publish semantics (202 acceptance vs eventual observability) in quickstarts and concept docs.
Changes:
- Document and standardize “publish accepted (202) ≠ immediately observable in events/attempts” across quickstarts, concepts, and the agent prompt.
- Require and wire
OUTPOST_API_KEY(plus defaults forOUTPOST_TEST_WEBHOOK_URL/OUTPOST_API_BASE_URL) intoci-eval.shand the CI step 1 environment. - Tighten and harden the LLM judge (retry on invalid JSON; stricter execution scoring when live creds are present; emit a failure artifact on judge parse failures).
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/content/quickstarts/hookdeck-outpost-typescript.mdoc | Adds callout + optional polling guidance to avoid false negatives right after publish. |
| docs/content/concepts.mdoc | Adds “Publish acceptance vs observability” concept section + anchor used by quickstarts/prompts. |
| docs/agent-evaluation/src/run-agent-eval.ts | Updates header/help text to reflect optional forwarding of Outpost creds into the agent sandbox. |
| docs/agent-evaluation/src/llm-judge.ts | Adds stricter judge rules when live secrets exist, retries on invalid JSON, and writes a failure artifact on repeated parse failures. |
| docs/agent-evaluation/scripts/ci-eval.sh | Requires OUTPOST_API_KEY in CI and sets defaults for Outpost env forwarded into the run. |
| docs/agent-evaluation/results/README.md | Updates interpretation of execution evidence and the role of step 2 re-execution. |
| docs/agent-evaluation/README.md | Aligns CI and local-run docs with step-scoped secrets and strict-vs-lenient execution scoring. |
| docs/agent-evaluation/hookdeck-outpost-agent-prompt.md | Instructs agents to verify delivery via Console/logs and avoid immediate events.list hard-fail. |
| docs/agent-evaluation/fixtures/placeholder-values-for-turn0.md | Updates operator guidance on which env is needed for CI/full runs. |
| docs/agent-evaluation/.env.example | Updates guidance on OUTPOST_API_KEY usage + adds scorer model note. |
| .github/workflows/docs-agent-eval-ci.yml | Injects Outpost env into step 1 and documents per-step env scoping. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+143
to
+146
| ```typescript | ||
| const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms)); | ||
| let found; | ||
| for (let attempt = 1; attempt <= 8 && !found; attempt++) { |
Comment on lines
+181
to
+188
| async function writeJudgeFailureArtifact( | ||
| run_path: string, | ||
| artifact: LlmJudgeFailureArtifact, | ||
| ): Promise<string> { | ||
| const path = judgeFailureArtifactPath(run_path); | ||
| await writeFile(path, `${JSON.stringify(artifact, null, 2)}\n`, "utf8"); | ||
| return path; | ||
| } |
Comment on lines
65
to
+72
| - name: Run eval CI slice (scenarios 01, 02) | ||
| env: | ||
| ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} | ||
| EVAL_TEST_DESTINATION_URL: ${{ secrets.EVAL_TEST_DESTINATION_URL }} | ||
| EVAL_LOCAL_DOCS: "1" | ||
| OUTPOST_API_KEY: ${{ secrets.OUTPOST_API_KEY }} | ||
| OUTPOST_TEST_WEBHOOK_URL: ${{ secrets.EVAL_TEST_DESTINATION_URL }} | ||
| OUTPOST_API_BASE_URL: https://api.outpost.hookdeck.com/2025-07-01 |
Clarify HTTP 202 acceptance vs eventual consistency in concepts and all four language quickstarts; use accurate GET /events?tenant_id= wording. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
alexluong
approved these changes
Jun 24, 2026
2 tasks
leggetter
added a commit
that referenced
this pull request
Jun 25, 2026
…977) * Redact eval artifact secrets and clarify quickstart polling snippets. Address Copilot review on #976: redact known secrets when writing transcripts and judge failures, and scan results/runs before CI upload. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com> * Add tests for eval artifact secret redaction. Cover pattern and literal env redaction, JSON artifact output, and wire npm run test:redact-secrets into npm run test. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com> * Extend eval artifact redaction to webhook URLs and CI env. Add EVAL_TEST_DESTINATION_URL and OUTPOST_TEST_WEBHOOK_URL to literal redaction; pass secrets into the CI redact step so re-scan can replace plain JSON echoes (Copilot review on #977). Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com> * Fix eval artifact redaction corrupting JSON structure. Deep-walk artifact objects and redact string leaves before serialization so transcript.json stays parseable for heuristic scoring. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OUTPOST_API_KEY(and related Outpost env) to the agent sandbox in CI step 1, not only inexecute-ci-artifacts.shstep 2OUTPOST_API_KEYinci-eval.shand defaultOUTPOST_TEST_WEBHOOK_URL/OUTPOST_API_BASE_URLcriteria[].passto match judge evidence (avoids pass=false with “counts as pass” prose)Test plan
docs/agent-evaluation/**execute-ci-artifacts.shas a deterministic re-run gatenpm run typecheckindocs/agent-evaluation/(passed in dev)Made with Cursor