Inject OUTPOST_API_KEY into agent eval CI step 1 by leggetter · Pull Request #976 · hookdeck/outpost

leggetter · 2026-06-24T10:05:15Z

Summary

Forward OUTPOST_API_KEY (and related Outpost env) to the agent sandbox in CI step 1, not only in execute-ci-artifacts.sh step 2
Require OUTPOST_API_KEY in ci-eval.sh and default OUTPOST_TEST_WEBHOOK_URL / OUTPOST_API_BASE_URL
Tighten the LLM judge when live credentials are present (strict execution scoring; no missing-env exception); keep the lenient exception for local transcript-only runs without the key
Require criteria[].pass to match judge evidence (avoids pass=false with “counts as pass” prose)

Test plan

Merge and re-run Docs agent eval (CI slice) on a branch that touches docs/agent-evaluation/**
Confirm step 1 agent transcript shows live Outpost smoke tests (not mock/dummy key workarounds)
Confirm step 2 still runs execute-ci-artifacts.sh as a deterministic re-run gate
Local: npm run typecheck in docs/agent-evaluation/ (passed in dev)

Made with Cursor

Forward live Outpost credentials to the agent sandbox during eval:ci so smoke tests run for real; tighten the LLM judge when keys are present and keep the missing-env exception only for local transcript-only runs. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Each run step only receives the secrets it needs: Anthropic/eval vars stay on ci-eval.sh; Outpost execution-only vars stay on execute-ci-artifacts. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Clarify that publish 202 means accepted, not immediately listable in events APIs; steer quickstarts and the agent prompt to Console verification and optional polling instead of hard-fail events.list. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Retry parse failures with stop_reason/token logging and persist full raw judge output to llm-judge-failure.json for CI triage. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Updates Outpost documentation and the agent-evaluation CI harness so agent eval step 1 can run live Outpost smoke tests (when credentials are present), while also clarifying publish semantics (202 acceptance vs eventual observability) in quickstarts and concept docs.

Changes:

Document and standardize “publish accepted (202) ≠ immediately observable in events/attempts” across quickstarts, concepts, and the agent prompt.
Require and wire OUTPOST_API_KEY (plus defaults for OUTPOST_TEST_WEBHOOK_URL / OUTPOST_API_BASE_URL) into ci-eval.sh and the CI step 1 environment.
Tighten and harden the LLM judge (retry on invalid JSON; stricter execution scoring when live creds are present; emit a failure artifact on judge parse failures).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
docs/content/quickstarts/hookdeck-outpost-typescript.mdoc	Adds callout + optional polling guidance to avoid false negatives right after publish.
docs/content/concepts.mdoc	Adds “Publish acceptance vs observability” concept section + anchor used by quickstarts/prompts.
docs/agent-evaluation/src/run-agent-eval.ts	Updates header/help text to reflect optional forwarding of Outpost creds into the agent sandbox.
docs/agent-evaluation/src/llm-judge.ts	Adds stricter judge rules when live secrets exist, retries on invalid JSON, and writes a failure artifact on repeated parse failures.
docs/agent-evaluation/scripts/ci-eval.sh	Requires `OUTPOST_API_KEY` in CI and sets defaults for Outpost env forwarded into the run.
docs/agent-evaluation/results/README.md	Updates interpretation of execution evidence and the role of step 2 re-execution.
docs/agent-evaluation/README.md	Aligns CI and local-run docs with step-scoped secrets and strict-vs-lenient execution scoring.
docs/agent-evaluation/hookdeck-outpost-agent-prompt.md	Instructs agents to verify delivery via Console/logs and avoid immediate events.list hard-fail.
docs/agent-evaluation/fixtures/placeholder-values-for-turn0.md	Updates operator guidance on which env is needed for CI/full runs.
docs/agent-evaluation/.env.example	Updates guidance on `OUTPOST_API_KEY` usage + adds scorer model note.
.github/workflows/docs-agent-eval-ci.yml	Injects Outpost env into step 1 and documents per-step env scoping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+```typescript
+const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms));
+let found;
+for (let attempt = 1; attempt <= 8 && !found; attempt++) {


+async function writeJudgeFailureArtifact(
+  run_path: string,
+  artifact: LlmJudgeFailureArtifact,
+): Promise<string> {
+  const path = judgeFailureArtifactPath(run_path);
+  await writeFile(path, `${JSON.stringify(artifact, null, 2)}\n`, "utf8");
+  return path;
+}


      - name: Run eval CI slice (scenarios 01, 02)
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          EVAL_TEST_DESTINATION_URL: ${{ secrets.EVAL_TEST_DESTINATION_URL }}
          EVAL_LOCAL_DOCS: "1"
+          OUTPOST_API_KEY: ${{ secrets.OUTPOST_API_KEY }}
+          OUTPOST_TEST_WEBHOOK_URL: ${{ secrets.EVAL_TEST_DESTINATION_URL }}
+          OUTPOST_API_BASE_URL: https://api.outpost.hookdeck.com/2025-07-01


Clarify HTTP 202 acceptance vs eventual consistency in concepts and all four language quickstarts; use accurate GET /events?tenant_id= wording. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…977) * Redact eval artifact secrets and clarify quickstart polling snippets. Address Copilot review on #976: redact known secrets when writing transcripts and judge failures, and scan results/runs before CI upload. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com> * Add tests for eval artifact secret redaction. Cover pattern and literal env redaction, JSON artifact output, and wire npm run test:redact-secrets into npm run test. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com> * Extend eval artifact redaction to webhook URLs and CI env. Add EVAL_TEST_DESTINATION_URL and OUTPOST_TEST_WEBHOOK_URL to literal redaction; pass secrets into the CI redact step so re-scan can replace plain JSON echoes (Copilot review on #977). Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com> * Fix eval artifact redaction corrupting JSON structure. Deep-walk artifact objects and redact string leaves before serialization so transcript.json stays parseable for heuristic scoring. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

leggetter and others added 4 commits June 24, 2026 11:04

Harden LLM judge against truncated JSON responses.

d1228b7

Retry parse failures with stop_reason/token logging and persist full raw judge output to llm-judge-failure.json for CI triage. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

leggetter marked this pull request as ready for review June 24, 2026 13:48

leggetter requested review from alexluong and Copilot June 24, 2026 13:48

Copilot started reviewing on behalf of leggetter June 24, 2026 13:48 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

alexluong approved these changes Jun 24, 2026

View reviewed changes

leggetter merged commit 9e035f9 into main Jun 24, 2026
3 checks passed

leggetter deleted the fix/eval-ci-inject-outpost-env branch June 24, 2026 15:59

leggetter restored the fix/eval-ci-inject-outpost-env branch June 24, 2026 16:00

leggetter mentioned this pull request Jun 24, 2026

Redact eval artifact secrets and clarify quickstart polling snippets #977

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inject OUTPOST_API_KEY into agent eval CI step 1#976

Inject OUTPOST_API_KEY into agent eval CI step 1#976
leggetter merged 5 commits into
mainfrom
fix/eval-ci-inject-outpost-env

leggetter commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

leggetter commented Jun 24, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants