Skip to content

Inject OUTPOST_API_KEY into agent eval CI step 1#976

Merged
leggetter merged 5 commits into
mainfrom
fix/eval-ci-inject-outpost-env
Jun 24, 2026
Merged

Inject OUTPOST_API_KEY into agent eval CI step 1#976
leggetter merged 5 commits into
mainfrom
fix/eval-ci-inject-outpost-env

Conversation

@leggetter

Copy link
Copy Markdown
Collaborator

Summary

  • Forward OUTPOST_API_KEY (and related Outpost env) to the agent sandbox in CI step 1, not only in execute-ci-artifacts.sh step 2
  • Require OUTPOST_API_KEY in ci-eval.sh and default OUTPOST_TEST_WEBHOOK_URL / OUTPOST_API_BASE_URL
  • Tighten the LLM judge when live credentials are present (strict execution scoring; no missing-env exception); keep the lenient exception for local transcript-only runs without the key
  • Require criteria[].pass to match judge evidence (avoids pass=false with “counts as pass” prose)

Test plan

  • Merge and re-run Docs agent eval (CI slice) on a branch that touches docs/agent-evaluation/**
  • Confirm step 1 agent transcript shows live Outpost smoke tests (not mock/dummy key workarounds)
  • Confirm step 2 still runs execute-ci-artifacts.sh as a deterministic re-run gate
  • Local: npm run typecheck in docs/agent-evaluation/ (passed in dev)

Made with Cursor

leggetter and others added 4 commits June 24, 2026 11:04
Forward live Outpost credentials to the agent sandbox during eval:ci so
smoke tests run for real; tighten the LLM judge when keys are present and
keep the missing-env exception only for local transcript-only runs.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Each run step only receives the secrets it needs: Anthropic/eval vars
stay on ci-eval.sh; Outpost execution-only vars stay on execute-ci-artifacts.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Clarify that publish 202 means accepted, not immediately listable in
events APIs; steer quickstarts and the agent prompt to Console
verification and optional polling instead of hard-fail events.list.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Retry parse failures with stop_reason/token logging and persist full raw
judge output to llm-judge-failure.json for CI triage.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@leggetter leggetter marked this pull request as ready for review June 24, 2026 13:48
@leggetter leggetter requested review from alexluong and Copilot June 24, 2026 13:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Outpost documentation and the agent-evaluation CI harness so agent eval step 1 can run live Outpost smoke tests (when credentials are present), while also clarifying publish semantics (202 acceptance vs eventual observability) in quickstarts and concept docs.

Changes:

  • Document and standardize “publish accepted (202) ≠ immediately observable in events/attempts” across quickstarts, concepts, and the agent prompt.
  • Require and wire OUTPOST_API_KEY (plus defaults for OUTPOST_TEST_WEBHOOK_URL / OUTPOST_API_BASE_URL) into ci-eval.sh and the CI step 1 environment.
  • Tighten and harden the LLM judge (retry on invalid JSON; stricter execution scoring when live creds are present; emit a failure artifact on judge parse failures).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
docs/content/quickstarts/hookdeck-outpost-typescript.mdoc Adds callout + optional polling guidance to avoid false negatives right after publish.
docs/content/concepts.mdoc Adds “Publish acceptance vs observability” concept section + anchor used by quickstarts/prompts.
docs/agent-evaluation/src/run-agent-eval.ts Updates header/help text to reflect optional forwarding of Outpost creds into the agent sandbox.
docs/agent-evaluation/src/llm-judge.ts Adds stricter judge rules when live secrets exist, retries on invalid JSON, and writes a failure artifact on repeated parse failures.
docs/agent-evaluation/scripts/ci-eval.sh Requires OUTPOST_API_KEY in CI and sets defaults for Outpost env forwarded into the run.
docs/agent-evaluation/results/README.md Updates interpretation of execution evidence and the role of step 2 re-execution.
docs/agent-evaluation/README.md Aligns CI and local-run docs with step-scoped secrets and strict-vs-lenient execution scoring.
docs/agent-evaluation/hookdeck-outpost-agent-prompt.md Instructs agents to verify delivery via Console/logs and avoid immediate events.list hard-fail.
docs/agent-evaluation/fixtures/placeholder-values-for-turn0.md Updates operator guidance on which env is needed for CI/full runs.
docs/agent-evaluation/.env.example Updates guidance on OUTPOST_API_KEY usage + adds scorer model note.
.github/workflows/docs-agent-eval-ci.yml Injects Outpost env into step 1 and documents per-step env scoping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +143 to +146
```typescript
const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms));
let found;
for (let attempt = 1; attempt <= 8 && !found; attempt++) {
Comment on lines +181 to +188
async function writeJudgeFailureArtifact(
run_path: string,
artifact: LlmJudgeFailureArtifact,
): Promise<string> {
const path = judgeFailureArtifactPath(run_path);
await writeFile(path, `${JSON.stringify(artifact, null, 2)}\n`, "utf8");
return path;
}
Comment on lines 65 to +72
- name: Run eval CI slice (scenarios 01, 02)
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
EVAL_TEST_DESTINATION_URL: ${{ secrets.EVAL_TEST_DESTINATION_URL }}
EVAL_LOCAL_DOCS: "1"
OUTPOST_API_KEY: ${{ secrets.OUTPOST_API_KEY }}
OUTPOST_TEST_WEBHOOK_URL: ${{ secrets.EVAL_TEST_DESTINATION_URL }}
OUTPOST_API_BASE_URL: https://api.outpost.hookdeck.com/2025-07-01
Clarify HTTP 202 acceptance vs eventual consistency in concepts and all
four language quickstarts; use accurate GET /events?tenant_id= wording.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@leggetter leggetter merged commit 9e035f9 into main Jun 24, 2026
3 checks passed
@leggetter leggetter deleted the fix/eval-ci-inject-outpost-env branch June 24, 2026 15:59
@leggetter leggetter restored the fix/eval-ci-inject-outpost-env branch June 24, 2026 16:00
leggetter added a commit that referenced this pull request Jun 25, 2026
…977)

* Redact eval artifact secrets and clarify quickstart polling snippets.

Address Copilot review on #976: redact known secrets when writing
transcripts and judge failures, and scan results/runs before CI upload.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* Add tests for eval artifact secret redaction.

Cover pattern and literal env redaction, JSON artifact output, and wire
npm run test:redact-secrets into npm run test.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* Extend eval artifact redaction to webhook URLs and CI env.

Add EVAL_TEST_DESTINATION_URL and OUTPOST_TEST_WEBHOOK_URL to literal
redaction; pass secrets into the CI redact step so re-scan can replace
plain JSON echoes (Copilot review on #977).

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix eval artifact redaction corrupting JSON structure.

Deep-walk artifact objects and redact string leaves before serialization so
transcript.json stays parseable for heuristic scoring.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants