hookdeck · leggetter · Jun 24, 2026 · Jun 24, 2026 · Jun 24, 2026 · Jun 24, 2026
diff --git a/.github/workflows/docs-agent-eval-ci.yml b/.github/workflows/docs-agent-eval-ci.yml
@@ -1,9 +1,12 @@
 # Runs scenarios 01+02 (curl + TypeScript SDK) with heuristic + LLM judge.
 # Sets EVAL_LOCAL_DOCS=1 so the agent reads repo docs under docs/ (not production WebFetch).
+# Injects OUTPOST_API_KEY (and related env) for the agent run so smoke tests hit live Outpost;
+# step 2 re-runs the saved artifacts deterministically.
 # Triggers: workflow_dispatch, or push (main) / pull_request when docs / OpenAPI / agent-eval / TS SDK paths change.
 # Each run bills Anthropic (agent + judge).
 # Requires repo secrets: ANTHROPIC_API_KEY, EVAL_TEST_DESTINATION_URL, OUTPOST_API_KEY
 # (OUTPOST_TEST_WEBHOOK_URL uses the same URL as EVAL_TEST_DESTINATION_URL in CI.)
+# Env is scoped per step — see each run step's env block (no job-wide secret export).
 # See docs/agent-evaluation/README.md § CI (recommended slice).
 name: Docs agent eval (CI slice)
 
@@ -64,6 +67,9 @@ jobs:
           ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
           EVAL_TEST_DESTINATION_URL: ${{ secrets.EVAL_TEST_DESTINATION_URL }}
           EVAL_LOCAL_DOCS: "1"
+          OUTPOST_API_KEY: ${{ secrets.OUTPOST_API_KEY }}
+          OUTPOST_TEST_WEBHOOK_URL: ${{ secrets.EVAL_TEST_DESTINATION_URL }}
+          OUTPOST_API_BASE_URL: https://api.outpost.hookdeck.com/2025-07-01
         run: ./scripts/ci-eval.sh
 
       - name: Execute generated curl + TypeScript artifacts (live Outpost)

diff --git a/docs/agent-evaluation/.env.example b/docs/agent-evaluation/.env.example
@@ -6,9 +6,8 @@ ANTHROPIC_API_KEY=
 # Required for Turn 0 template (test webhook URL injected into the prompt)
 EVAL_TEST_DESTINATION_URL=
 
-# Strongly recommended for a *full* eval: run the agent’s curl/script/app against a real project.
-# The harness does not read this key; you (or a future verifier) use it after the run.
-# OUTPOST_API_KEY=   # required for ./scripts/execute-ci-artifacts.sh after eval:ci; GitHub Actions CI execution step
+# Strongly recommended for CI and full local eval: forwarded to the agent sandbox when set.
+# OUTPOST_API_KEY=
 # OUTPOST_API_BASE_URL=https://api.outpost.hookdeck.com/2025-07-01
 # OUTPOST_TEST_WEBHOOK_URL=https://hkdk.events/your-source-id   # often same as EVAL_TEST_DESTINATION_URL
 # OUTPOST_CI_PUBLISH_TOPIC=user.created   # optional; publish topic for npm run smoke:execute-ci (must exist in project)
@@ -36,4 +35,5 @@ EVAL_TEST_DESTINATION_URL=
 # Scoring is ON by default after each scenario (heuristic + LLM). Opt out:
 # EVAL_NO_SCORE_HEURISTIC=1
 # EVAL_NO_SCORE_LLM=1
+# LLM judge model (default claude-sonnet-4-6 — current Sonnet tier; no newer Sonnet as of 2026-06)
 # EVAL_SCORE_MODEL=claude-sonnet-4-6
diff --git a/docs/agent-evaluation/README.md b/docs/agent-evaluation/README.md
@@ -140,6 +140,7 @@ Each scenario run uses one directory:
   - **`<stamp>-scenario-NN.eval-aborted.json`** — **SIGTERM** / **SIGINT** before completion (not **SIGKILL**)
     If **`transcript.json`** is missing, check these files next to **`…/runs/<stamp>-scenario-NN/`** (same directory as the run folder, not inside it).
 - **`heuristic-score.json`** / **`llm-score.json`** — by default (unless disabled above)
+- **`llm-judge-failure.json`** — when the LLM judge returns unparseable JSON after retries: full raw model text per attempt plus `stop_reason` and token usage (for CI triage)
 - **Agent-written files** — the SDK **`cwd`** is this directory. Defaults include **`Write`**, **`Edit`**, and **`Bash`** for clones, installs, and generated code.
 
 Re-score a finished run without re-invoking the agent — uses **today's** [`src/score-transcript.ts`](src/score-transcript.ts) and **scenario markdown on disk** (so LLM criteria update when you edit **`## Success criteria`**):
@@ -210,19 +211,19 @@ For **pull-request or main-branch** automation, run **two** scenarios only:
 
 ```sh
 cd docs/agent-evaluation && npm ci && npm run eval:ci
-# or: ./scripts/ci-eval.sh   # requires ANTHROPIC_API_KEY + EVAL_TEST_DESTINATION_URL in the environment
-# after a successful eval:ci, live Outpost smoke: OUTPOST_API_KEY + OUTPOST_TEST_WEBHOOK_URL ./scripts/execute-ci-artifacts.sh
+# or: ./scripts/ci-eval.sh   # requires ANTHROPIC_API_KEY, EVAL_TEST_DESTINATION_URL, OUTPOST_API_KEY
+# after a successful eval:ci, CI step 2 re-runs artifacts: ./scripts/execute-ci-artifacts.sh
 ```
 
 `eval:ci` is **`npm run eval -- --scenarios 01,02`**: both **heuristic** checks and the **LLM judge** (grounded in each scenario's **`## Success criteria`**). Skipping the judge would leave you with regex-only signal, which does not encode the product checklist.
 
-**GitHub Actions:** add repository secrets **`ANTHROPIC_API_KEY`**, **`EVAL_TEST_DESTINATION_URL`**, and **`OUTPOST_API_KEY`**. Workflow **`.github/workflows/docs-agent-eval-ci.yml`** runs **`./scripts/ci-eval.sh`** with **`EVAL_LOCAL_DOCS=1`** (agent **reads docs from the repo**), then **`./scripts/execute-ci-artifacts.sh`**: picks the **newest** **`*-scenario-01`** / **`*-scenario-02`** pair from **`results/runs/`**, runs the generated **`.sh`** then **`npx tsx`** on the TypeScript artifact (**`npm install`** in the **02** run dir when **`package.json`** exists). **`OUTPOST_TEST_WEBHOOK_URL`** in CI is set from the same secret as **`EVAL_TEST_DESTINATION_URL`**. Triggers on **`workflow_dispatch`** (manual: Actions → **Docs agent eval (CI slice)** → **Run workflow**, pick branch), pushes to **`main`**, and **pull requests** when **`docs/content/**`**, **`docs/apis/**`**, **`sdks/outpost-typescript/**`**, root **`docs/README.md`** / **`docs/AGENTS.md`**, or **`docs/agent-evaluation/**`** change (GitHub does not allow **`paths`** + **`paths-ignore`** together on the same event, so edits under e.g. **`docs/agent-evaluation/README.md`** also match **`docs/agent-evaluation/**`** and can trigger a run). Uses **`ubuntu-latest`** (Claude Agent SDK needs normal filesystem access — avoid tight sandboxes; see **Permissions / failures** above). **Fork PRs** skip this job (secrets are not available).
+**GitHub Actions:** add repository secrets **`ANTHROPIC_API_KEY`**, **`EVAL_TEST_DESTINATION_URL`**, and **`OUTPOST_API_KEY`**. Workflow **`.github/workflows/docs-agent-eval-ci.yml`** scopes env **per step** (no job-wide secret blast radius): step 1 **`./scripts/ci-eval.sh`** gets Anthropic + eval + Outpost creds so the agent can **read repo docs** (`EVAL_LOCAL_DOCS=1`) and **run live smoke tests**; step 2 **`./scripts/execute-ci-artifacts.sh`** gets only Outpost execution vars (including tenant cleanup). Step 2 re-runs the newest **`*-scenario-01`** / **`*-scenario-02`** pair deterministically (generated **`.sh`** then **`npx tsx`** on the TypeScript artifact; **`npm install`** in the **02** run dir when **`package.json`** exists). **`OUTPOST_TEST_WEBHOOK_URL`** in CI is set from the same secret as **`EVAL_TEST_DESTINATION_URL`**. Triggers on **`workflow_dispatch`** (manual: Actions → **Docs agent eval (CI slice)** → **Run workflow**, pick branch), pushes to **`main`**, and **pull requests** when **`docs/content/**`**, **`docs/apis/**`**, **`sdks/outpost-typescript/**`**, root **`docs/README.md`** / **`docs/AGENTS.md`**, or **`docs/agent-evaluation/**`** change (GitHub does not allow **`paths`** + **`paths-ignore`** together on the same event, so edits under e.g. **`docs/agent-evaluation/README.md`** also match **`docs/agent-evaluation/**`** and can trigger a run). Uses **`ubuntu-latest`** (Claude Agent SDK needs normal filesystem access — avoid tight sandboxes; see **Permissions / failures** above). **Fork PRs** skip this job (secrets are not available).
 
 The workflow uses **`concurrency: { group: outpost-docs-agent-eval-live-outpost, cancel-in-progress: false }`** so only one run at a time talks to the shared CI Outpost project for execution, and sets **`OUTPOST_CI_CLEANUP_TENANT=customer_acme_001`** so **`execute-ci-artifacts.sh`** **DELETE**s that tenant before the curl script and again on **EXIT** (clears destinations from prior runs and avoids parallel deletes). Override the tenant id only if your Turn 0 fixtures consistently use another id.
 
 - **`ANTHROPIC_API_KEY`** — required for the agent and for the **LLM judge** (Success criteria) after each scenario you run.
 - **`EVAL_TEST_DESTINATION_URL`** — required for Turn 0; same Source URL as `{{TEST_DESTINATION_URL}}` (and, in CI, reused as **`OUTPOST_TEST_WEBHOOK_URL`** for execution).
-- **`OUTPOST_API_KEY`** — required for **`execute-ci-artifacts.sh`** and for **GitHub Actions** execution after **`eval:ci`**. For **local** transcript-only runs you can omit it. Put the key in **`docs/agent-evaluation/.env`** (or export); never paste it into chat.
+- **`OUTPOST_API_KEY`** — required for **`ci-eval.sh`**, **`execute-ci-artifacts.sh`**, and **GitHub Actions**. The eval runner forwards it (with **`OUTPOST_TEST_WEBHOOK_URL`** / **`OUTPOST_API_BASE_URL`**) into the agent sandbox when set so the model can run live smoke tests; the LLM judge scores execution strictly in that mode. For **local transcript-only** runs you can omit it (the judge applies a missing-env exception). Put the key in **`docs/agent-evaluation/.env`** (or export); never paste it into chat.
 - **`EVAL_LOCAL_DOCS=1`** — Turn 0 replaces public doc URLs with **absolute paths to repo docs** (primarily **`.mdoc`** under **`docs/content/`**, plus OpenAPI under **`docs/apis/`**; the Turn 0 template itself is **[`hookdeck-outpost-agent-prompt.md`](hookdeck-outpost-agent-prompt.md)**). The agent uses **Read** on **`docs/`** instead of **WebFetch** to production. Use locally when validating unpublished docs; **GitHub Actions** sets this for **`docs-agent-eval-ci.yml`**.
 - **`EVAL_SKIP_HARNESS_PRE_STEPS=1`** — skip **`git_clone`** (and any future **`preSteps`**) declared in a scenario's **`## Eval harness`** JSON block; useful offline or when the baseline folder is already present.
 
@@ -250,7 +251,9 @@ Changing **`EVAL_PERMISSION_MODE`** is usually unnecessary; widening **`EVAL_TOO
 
 ### Transcript vs execution (full pass)
 
-`npm run eval` only captures **what the model produced**; by itself it does **not** call Outpost (transcript review). **`./scripts/execute-ci-artifacts.sh`** (and the **GitHub Actions** workflow's second step) runs the **01** shell + **02** TypeScript outputs against **live** Outpost when **`OUTPOST_API_KEY`** and **`OUTPOST_TEST_WEBHOOK_URL`** are set.
+When **`OUTPOST_API_KEY`** is set, `npm run eval` forwards it to the agent sandbox; transcripts can show **live** curl/SDK smoke tests and the **LLM judge** scores execution-style Success criteria from that evidence (no missing-env exception). **`./scripts/execute-ci-artifacts.sh`** (and the **GitHub Actions** workflow's second step) still **re-runs** the saved **01** shell + **02** TypeScript outputs deterministically — a separate gate from the agent session.
+
+Without **`OUTPOST_API_KEY`**, eval is **transcript-only** for execution: heuristics + LLM still run, but the judge may pass execution rows when failure was solely due to missing env (see [`src/llm-judge.ts`](src/llm-judge.ts)).
 
 **Local smoke (no agent):** to verify secrets and the managed API the same way CI does—without depending on a fresh eval transcript—run from **`docs/agent-evaluation/`** with **`OUTPOST_API_KEY`** and **`OUTPOST_TEST_WEBHOOK_URL`** set (e.g. **`source .env`**):
 

diff --git a/docs/agent-evaluation/fixtures/placeholder-values-for-turn0.md b/docs/agent-evaluation/fixtures/placeholder-values-for-turn0.md
@@ -6,7 +6,7 @@ The **prompt template itself** lives in one place only:
 
 Do **not** paste real API keys into chat. Have operators put `OUTPOST_API_KEY` in a project `**.env`\*\* (or another loader), not in the agent transcript. Use a throwaway Hookdeck project when possible.
 
-For `**npm run eval -- --scenario …**` (or `**--scenarios**` / `**--all**`), the runner only needs `**ANTHROPIC_API_KEY**` and `**EVAL_TEST_DESTINATION_URL**`. To score a **full** eval (generated commands/code actually work), you still need `**OUTPOST_API_KEY`** (and usually `**OUTPOST_TEST_WEBHOOK_URL**`) when you **execute** the agent’s output afterward. Optional `**EVAL_LOCAL_DOCS=1`** points Turn 0 at repo paths instead of live `{{DOCS_URL}}` links.
+For `**npm run eval -- --scenario …**` (or `**--scenarios**` / `**--all**`), the runner requires `**ANTHROPIC_API_KEY**` and `**EVAL_TEST_DESTINATION_URL**`. For **CI** and **full** evals (agent runs live smoke tests + strict LLM execution scoring), also set `**OUTPOST_API_KEY**` (and usually `**OUTPOST_TEST_WEBHOOK_URL**` — defaults to `EVAL_TEST_DESTINATION_URL` in `ci-eval.sh`). Optional `**EVAL_LOCAL_DOCS=1`** points Turn 0 at repo paths instead of live `{{DOCS_URL}}` links.
 
 ---
 

diff --git a/docs/agent-evaluation/hookdeck-outpost-agent-prompt.md b/docs/agent-evaluation/hookdeck-outpost-agent-prompt.md
@@ -104,6 +104,7 @@ Goal: tenant → **one destination** (often webhook to `{{TEST_DESTINATION_URL}}
 - Default to **curl** when they want the absolute minimum and did not name a language.
 - When they name **TypeScript**, **Python**, or **Go**, produce **only** what that language’s **quickstart** describes—typically **one file** (plus `package.json` / `go.mod` / venv if the quickstart needs it), not a full application tree.
 - Ask only for env vars and details the quickstart still needs.
+- **Verify delivery the quickstart way:** after publish, print the **event id** (or 202 success) and point the operator to **Hookdeck Console** / dashboard **logs** for the test webhook URL. **Do not** add an immediate `events.list` / `GET …/events` check that throws if the event is missing on the first try — publish is **202 (accepted)** and observability APIs are **eventually consistent** (see **{{DOCS_URL}}/concepts#publish-acceptance-vs-observability**). Reserve events/attempts listing and retry UX for **Building your own UI** / full-stack paths, not the one-file quickstart.
 
 ### New minimal application
 
@@ -145,6 +146,7 @@ Apply **only** the items below that fit the task; **skip** any that do not apply
 - [ ] **Ran** the smallest end-to-end check that fits this task (e.g. run the script or shell flow once, exercise one new API path, or smoke the UI/API flow you added) and saw a clear success signal (e.g. event id, HTTP 2xx, or expected output).
 - [ ] **Secrets:** The platform Outpost API key remains **server-side** / **environment** only — not in client bundles, not hard-coded in committed source.
 - [ ] **Repeatable:** Env vars, how to run, and how to verify with the test destination above are stated briefly (README, comments, or chat — match the task size; a one-file script may need only inline or chat notes).
+- [ ] **Quick path verification:** If this is a **quickstart-shaped** script (one file / curl flow), success is **publish accepted (202 / event id printed)** plus **Hookdeck Console** (or dashboard logs) for the test webhook — **not** a hard-fail `events.list` / `GET …/events` immediately after publish. If you must list events in code, **poll with retries** per **{{DOCS_URL}}/concepts#publish-acceptance-vs-observability**; do not throw on the first empty list.
 
 **When editing an existing application repository (Existing application or equivalent):**
 

diff --git a/docs/agent-evaluation/results/README.md b/docs/agent-evaluation/results/README.md
@@ -36,7 +36,7 @@ npm run score -- --run results/runs/<stamp>-scenario-NN --write
 npm run score -- --run results/runs/<stamp>-scenario-NN --llm --write
 ```
 
-**Execution** (curl/SDK against live Outpost with `OUTPOST_API_KEY`) is **not** recorded in these JSON files. Use **`../scripts/execute-ci-artifacts.sh`** after **`eval:ci`**, or the second step in **`.github/workflows/docs-agent-eval-ci.yml`**, and the **Execution (full pass)** rows in `[../scenarios/](../scenarios/)` for human notes.
+**Execution** when `OUTPOST_API_KEY` is set: the agent may run live smoke tests during eval (evidence in `transcript.json`; **LLM judge** scores execution-style criteria). **`execute-ci-artifacts.sh`** (CI step 2) re-runs saved artifacts deterministically. Without the key, execution is transcript-only / manual — see **Execution (full pass)** rows in `[../scenarios/](../scenarios/)`.
 
 ---
 

diff --git a/docs/agent-evaluation/scripts/ci-eval.sh b/docs/agent-evaluation/scripts/ci-eval.sh
@@ -1,11 +1,12 @@
 #!/usr/bin/env bash
 # CI-friendly agent eval: scenarios 01+02 with heuristic + LLM judge (Success criteria from each scenario .md).
 #
-# Required secrets (e.g. GitHub Actions): ANTHROPIC_API_KEY, EVAL_TEST_DESTINATION_URL
+# Required secrets (e.g. GitHub Actions): ANTHROPIC_API_KEY, EVAL_TEST_DESTINATION_URL, OUTPOST_API_KEY
 # Optional: same vars in docs/agent-evaluation/.env for local runs.
 #
 # Scenarios: 01 = curl quickstart shape; 02 = TypeScript SDK script. See README § CI.
-# After success, run ./scripts/execute-ci-artifacts.sh with OUTPOST_API_KEY + OUTPOST_TEST_WEBHOOK_URL for live Outpost (CI does this automatically).
+# OUTPOST_API_KEY is forwarded to the agent sandbox so it can run smoke tests during eval:ci.
+# After success, ./scripts/execute-ci-artifacts.sh re-runs the saved artifacts (CI step 2).
 set -euo pipefail
 
 ROOT="$(cd "$(dirname "$0")/.." && pwd)"
@@ -19,5 +20,13 @@ if [[ -z "${EVAL_TEST_DESTINATION_URL:-}" ]]; then
   echo "ci-eval: EVAL_TEST_DESTINATION_URL is not set" >&2
   exit 1
 fi
+if [[ -z "${OUTPOST_API_KEY:-}" ]]; then
+  echo "ci-eval: OUTPOST_API_KEY is not set (required so the agent can run live Outpost smoke tests)" >&2
+  exit 1
+fi
+
+export OUTPOST_TEST_WEBHOOK_URL="${OUTPOST_TEST_WEBHOOK_URL:-${EVAL_TEST_DESTINATION_URL:-}}"
+: "${OUTPOST_API_BASE_URL:=https://api.outpost.hookdeck.com/2025-07-01}"
+export OUTPOST_API_BASE_URL OUTPOST_TEST_WEBHOOK_URL
 
 exec npm run eval:ci
-Original file line number
+Diff line change
@@ Expand Up / @@ -6,7 +6,7 @@ The **prompt template itself** lives in one place only: @@
     Do **not** paste real API keys into chat. Have operators put `OUTPOST_API_KEY` in a project `**.env`\*\* (or another loader), not in the agent transcript. Use a throwaway Hookdeck project when possible.
-    For `**npm run eval -- --scenario …**` (or `**--scenarios**` / `**--all**`), the runner only needs `**ANTHROPIC_API_KEY**` and `**EVAL_TEST_DESTINATION_URL**`. To score a **full** eval (generated commands/code actually work), you still need `**OUTPOST_API_KEY`** (and usually `**OUTPOST_TEST_WEBHOOK_URL**`) when you **execute** the agent’s output afterward. Optional `**EVAL_LOCAL_DOCS=1`** points Turn 0 at repo paths instead of live `{{DOCS_URL}}` links.
+    For `**npm run eval -- --scenario …**` (or `**--scenarios**` / `**--all**`), the runner requires `**ANTHROPIC_API_KEY**` and `**EVAL_TEST_DESTINATION_URL**`. For **CI** and **full** evals (agent runs live smoke tests + strict LLM execution scoring), also set `**OUTPOST_API_KEY**` (and usually `**OUTPOST_TEST_WEBHOOK_URL**` — defaults to `EVAL_TEST_DESTINATION_URL` in `ci-eval.sh`). Optional `**EVAL_LOCAL_DOCS=1`** points Turn 0 at repo paths instead of live `{{DOCS_URL}}` links.
     ---
@@ Expand Down @@