Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/docs-agent-eval-ci.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
# Runs scenarios 01+02 (curl + TypeScript SDK) with heuristic + LLM judge.
# Sets EVAL_LOCAL_DOCS=1 so the agent reads repo docs under docs/ (not production WebFetch).
# Injects OUTPOST_API_KEY (and related env) for the agent run so smoke tests hit live Outpost;
# step 2 re-runs the saved artifacts deterministically.
# Triggers: workflow_dispatch, or push (main) / pull_request when docs / OpenAPI / agent-eval / TS SDK paths change.
# Each run bills Anthropic (agent + judge).
# Requires repo secrets: ANTHROPIC_API_KEY, EVAL_TEST_DESTINATION_URL, OUTPOST_API_KEY
# (OUTPOST_TEST_WEBHOOK_URL uses the same URL as EVAL_TEST_DESTINATION_URL in CI.)
# Env is scoped per step — see each run step's env block (no job-wide secret export).
# See docs/agent-evaluation/README.md § CI (recommended slice).
name: Docs agent eval (CI slice)

Expand Down Expand Up @@ -64,6 +67,9 @@ jobs:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
EVAL_TEST_DESTINATION_URL: ${{ secrets.EVAL_TEST_DESTINATION_URL }}
EVAL_LOCAL_DOCS: "1"
OUTPOST_API_KEY: ${{ secrets.OUTPOST_API_KEY }}
OUTPOST_TEST_WEBHOOK_URL: ${{ secrets.EVAL_TEST_DESTINATION_URL }}
OUTPOST_API_BASE_URL: https://api.outpost.hookdeck.com/2025-07-01
run: ./scripts/ci-eval.sh

- name: Execute generated curl + TypeScript artifacts (live Outpost)
Expand Down
6 changes: 3 additions & 3 deletions docs/agent-evaluation/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,8 @@ ANTHROPIC_API_KEY=
# Required for Turn 0 template (test webhook URL injected into the prompt)
EVAL_TEST_DESTINATION_URL=

# Strongly recommended for a *full* eval: run the agent’s curl/script/app against a real project.
# The harness does not read this key; you (or a future verifier) use it after the run.
# OUTPOST_API_KEY= # required for ./scripts/execute-ci-artifacts.sh after eval:ci; GitHub Actions CI execution step
# Strongly recommended for CI and full local eval: forwarded to the agent sandbox when set.
# OUTPOST_API_KEY=
# OUTPOST_API_BASE_URL=https://api.outpost.hookdeck.com/2025-07-01
# OUTPOST_TEST_WEBHOOK_URL=https://hkdk.events/your-source-id # often same as EVAL_TEST_DESTINATION_URL
# OUTPOST_CI_PUBLISH_TOPIC=user.created # optional; publish topic for npm run smoke:execute-ci (must exist in project)
Expand Down Expand Up @@ -36,4 +35,5 @@ EVAL_TEST_DESTINATION_URL=
# Scoring is ON by default after each scenario (heuristic + LLM). Opt out:
# EVAL_NO_SCORE_HEURISTIC=1
# EVAL_NO_SCORE_LLM=1
# LLM judge model (default claude-sonnet-4-6 — current Sonnet tier; no newer Sonnet as of 2026-06)
# EVAL_SCORE_MODEL=claude-sonnet-4-6
13 changes: 8 additions & 5 deletions docs/agent-evaluation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ Each scenario run uses one directory:
- **`<stamp>-scenario-NN.eval-aborted.json`** — **SIGTERM** / **SIGINT** before completion (not **SIGKILL**)
If **`transcript.json`** is missing, check these files next to **`…/runs/<stamp>-scenario-NN/`** (same directory as the run folder, not inside it).
- **`heuristic-score.json`** / **`llm-score.json`** — by default (unless disabled above)
- **`llm-judge-failure.json`** — when the LLM judge returns unparseable JSON after retries: full raw model text per attempt plus `stop_reason` and token usage (for CI triage)
- **Agent-written files** — the SDK **`cwd`** is this directory. Defaults include **`Write`**, **`Edit`**, and **`Bash`** for clones, installs, and generated code.

Re-score a finished run without re-invoking the agent — uses **today's** [`src/score-transcript.ts`](src/score-transcript.ts) and **scenario markdown on disk** (so LLM criteria update when you edit **`## Success criteria`**):
Expand Down Expand Up @@ -210,19 +211,19 @@ For **pull-request or main-branch** automation, run **two** scenarios only:

```sh
cd docs/agent-evaluation && npm ci && npm run eval:ci
# or: ./scripts/ci-eval.sh # requires ANTHROPIC_API_KEY + EVAL_TEST_DESTINATION_URL in the environment
# after a successful eval:ci, live Outpost smoke: OUTPOST_API_KEY + OUTPOST_TEST_WEBHOOK_URL ./scripts/execute-ci-artifacts.sh
# or: ./scripts/ci-eval.sh # requires ANTHROPIC_API_KEY, EVAL_TEST_DESTINATION_URL, OUTPOST_API_KEY
# after a successful eval:ci, CI step 2 re-runs artifacts: ./scripts/execute-ci-artifacts.sh
```

`eval:ci` is **`npm run eval -- --scenarios 01,02`**: both **heuristic** checks and the **LLM judge** (grounded in each scenario's **`## Success criteria`**). Skipping the judge would leave you with regex-only signal, which does not encode the product checklist.

**GitHub Actions:** add repository secrets **`ANTHROPIC_API_KEY`**, **`EVAL_TEST_DESTINATION_URL`**, and **`OUTPOST_API_KEY`**. Workflow **`.github/workflows/docs-agent-eval-ci.yml`** runs **`./scripts/ci-eval.sh`** with **`EVAL_LOCAL_DOCS=1`** (agent **reads docs from the repo**), then **`./scripts/execute-ci-artifacts.sh`**: picks the **newest** **`*-scenario-01`** / **`*-scenario-02`** pair from **`results/runs/`**, runs the generated **`.sh`** then **`npx tsx`** on the TypeScript artifact (**`npm install`** in the **02** run dir when **`package.json`** exists). **`OUTPOST_TEST_WEBHOOK_URL`** in CI is set from the same secret as **`EVAL_TEST_DESTINATION_URL`**. Triggers on **`workflow_dispatch`** (manual: Actions → **Docs agent eval (CI slice)** → **Run workflow**, pick branch), pushes to **`main`**, and **pull requests** when **`docs/content/**`**, **`docs/apis/**`**, **`sdks/outpost-typescript/**`**, root **`docs/README.md`** / **`docs/AGENTS.md`**, or **`docs/agent-evaluation/**`** change (GitHub does not allow **`paths`** + **`paths-ignore`** together on the same event, so edits under e.g. **`docs/agent-evaluation/README.md`** also match **`docs/agent-evaluation/**`** and can trigger a run). Uses **`ubuntu-latest`** (Claude Agent SDK needs normal filesystem access — avoid tight sandboxes; see **Permissions / failures** above). **Fork PRs** skip this job (secrets are not available).
**GitHub Actions:** add repository secrets **`ANTHROPIC_API_KEY`**, **`EVAL_TEST_DESTINATION_URL`**, and **`OUTPOST_API_KEY`**. Workflow **`.github/workflows/docs-agent-eval-ci.yml`** scopes env **per step** (no job-wide secret blast radius): step 1 **`./scripts/ci-eval.sh`** gets Anthropic + eval + Outpost creds so the agent can **read repo docs** (`EVAL_LOCAL_DOCS=1`) and **run live smoke tests**; step 2 **`./scripts/execute-ci-artifacts.sh`** gets only Outpost execution vars (including tenant cleanup). Step 2 re-runs the newest **`*-scenario-01`** / **`*-scenario-02`** pair deterministically (generated **`.sh`** then **`npx tsx`** on the TypeScript artifact; **`npm install`** in the **02** run dir when **`package.json`** exists). **`OUTPOST_TEST_WEBHOOK_URL`** in CI is set from the same secret as **`EVAL_TEST_DESTINATION_URL`**. Triggers on **`workflow_dispatch`** (manual: Actions → **Docs agent eval (CI slice)** → **Run workflow**, pick branch), pushes to **`main`**, and **pull requests** when **`docs/content/**`**, **`docs/apis/**`**, **`sdks/outpost-typescript/**`**, root **`docs/README.md`** / **`docs/AGENTS.md`**, or **`docs/agent-evaluation/**`** change (GitHub does not allow **`paths`** + **`paths-ignore`** together on the same event, so edits under e.g. **`docs/agent-evaluation/README.md`** also match **`docs/agent-evaluation/**`** and can trigger a run). Uses **`ubuntu-latest`** (Claude Agent SDK needs normal filesystem access — avoid tight sandboxes; see **Permissions / failures** above). **Fork PRs** skip this job (secrets are not available).

The workflow uses **`concurrency: { group: outpost-docs-agent-eval-live-outpost, cancel-in-progress: false }`** so only one run at a time talks to the shared CI Outpost project for execution, and sets **`OUTPOST_CI_CLEANUP_TENANT=customer_acme_001`** so **`execute-ci-artifacts.sh`** **DELETE**s that tenant before the curl script and again on **EXIT** (clears destinations from prior runs and avoids parallel deletes). Override the tenant id only if your Turn 0 fixtures consistently use another id.

- **`ANTHROPIC_API_KEY`** — required for the agent and for the **LLM judge** (Success criteria) after each scenario you run.
- **`EVAL_TEST_DESTINATION_URL`** — required for Turn 0; same Source URL as `{{TEST_DESTINATION_URL}}` (and, in CI, reused as **`OUTPOST_TEST_WEBHOOK_URL`** for execution).
- **`OUTPOST_API_KEY`** — required for **`execute-ci-artifacts.sh`** and for **GitHub Actions** execution after **`eval:ci`**. For **local** transcript-only runs you can omit it. Put the key in **`docs/agent-evaluation/.env`** (or export); never paste it into chat.
- **`OUTPOST_API_KEY`** — required for **`ci-eval.sh`**, **`execute-ci-artifacts.sh`**, and **GitHub Actions**. The eval runner forwards it (with **`OUTPOST_TEST_WEBHOOK_URL`** / **`OUTPOST_API_BASE_URL`**) into the agent sandbox when set so the model can run live smoke tests; the LLM judge scores execution strictly in that mode. For **local transcript-only** runs you can omit it (the judge applies a missing-env exception). Put the key in **`docs/agent-evaluation/.env`** (or export); never paste it into chat.
- **`EVAL_LOCAL_DOCS=1`** — Turn 0 replaces public doc URLs with **absolute paths to repo docs** (primarily **`.mdoc`** under **`docs/content/`**, plus OpenAPI under **`docs/apis/`**; the Turn 0 template itself is **[`hookdeck-outpost-agent-prompt.md`](hookdeck-outpost-agent-prompt.md)**). The agent uses **Read** on **`docs/`** instead of **WebFetch** to production. Use locally when validating unpublished docs; **GitHub Actions** sets this for **`docs-agent-eval-ci.yml`**.
- **`EVAL_SKIP_HARNESS_PRE_STEPS=1`** — skip **`git_clone`** (and any future **`preSteps`**) declared in a scenario's **`## Eval harness`** JSON block; useful offline or when the baseline folder is already present.

Expand Down Expand Up @@ -250,7 +251,9 @@ Changing **`EVAL_PERMISSION_MODE`** is usually unnecessary; widening **`EVAL_TOO

### Transcript vs execution (full pass)

`npm run eval` only captures **what the model produced**; by itself it does **not** call Outpost (transcript review). **`./scripts/execute-ci-artifacts.sh`** (and the **GitHub Actions** workflow's second step) runs the **01** shell + **02** TypeScript outputs against **live** Outpost when **`OUTPOST_API_KEY`** and **`OUTPOST_TEST_WEBHOOK_URL`** are set.
When **`OUTPOST_API_KEY`** is set, `npm run eval` forwards it to the agent sandbox; transcripts can show **live** curl/SDK smoke tests and the **LLM judge** scores execution-style Success criteria from that evidence (no missing-env exception). **`./scripts/execute-ci-artifacts.sh`** (and the **GitHub Actions** workflow's second step) still **re-runs** the saved **01** shell + **02** TypeScript outputs deterministically — a separate gate from the agent session.

Without **`OUTPOST_API_KEY`**, eval is **transcript-only** for execution: heuristics + LLM still run, but the judge may pass execution rows when failure was solely due to missing env (see [`src/llm-judge.ts`](src/llm-judge.ts)).

**Local smoke (no agent):** to verify secrets and the managed API the same way CI does—without depending on a fresh eval transcript—run from **`docs/agent-evaluation/`** with **`OUTPOST_API_KEY`** and **`OUTPOST_TEST_WEBHOOK_URL`** set (e.g. **`source .env`**):

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ The **prompt template itself** lives in one place only:

Do **not** paste real API keys into chat. Have operators put `OUTPOST_API_KEY` in a project `**.env`\*\* (or another loader), not in the agent transcript. Use a throwaway Hookdeck project when possible.

For `**npm run eval -- --scenario …**` (or `**--scenarios**` / `**--all**`), the runner only needs `**ANTHROPIC_API_KEY**` and `**EVAL_TEST_DESTINATION_URL**`. To score a **full** eval (generated commands/code actually work), you still need `**OUTPOST_API_KEY`** (and usually `**OUTPOST_TEST_WEBHOOK_URL**`) when you **execute** the agent’s output afterward. Optional `**EVAL_LOCAL_DOCS=1`** points Turn 0 at repo paths instead of live `{{DOCS_URL}}` links.
For `**npm run eval -- --scenario …**` (or `**--scenarios**` / `**--all**`), the runner requires `**ANTHROPIC_API_KEY**` and `**EVAL_TEST_DESTINATION_URL**`. For **CI** and **full** evals (agent runs live smoke tests + strict LLM execution scoring), also set `**OUTPOST_API_KEY**` (and usually `**OUTPOST_TEST_WEBHOOK_URL**` — defaults to `EVAL_TEST_DESTINATION_URL` in `ci-eval.sh`). Optional `**EVAL_LOCAL_DOCS=1`** points Turn 0 at repo paths instead of live `{{DOCS_URL}}` links.

---

Expand Down
2 changes: 2 additions & 0 deletions docs/agent-evaluation/hookdeck-outpost-agent-prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ Goal: tenant → **one destination** (often webhook to `{{TEST_DESTINATION_URL}}
- Default to **curl** when they want the absolute minimum and did not name a language.
- When they name **TypeScript**, **Python**, or **Go**, produce **only** what that language’s **quickstart** describes—typically **one file** (plus `package.json` / `go.mod` / venv if the quickstart needs it), not a full application tree.
- Ask only for env vars and details the quickstart still needs.
- **Verify delivery the quickstart way:** after publish, print the **event id** (or 202 success) and point the operator to **Hookdeck Console** / dashboard **logs** for the test webhook URL. **Do not** add an immediate `events.list` / `GET …/events` check that throws if the event is missing on the first try — publish is **202 (accepted)** and observability APIs are **eventually consistent** (see **{{DOCS_URL}}/concepts#publish-acceptance-vs-observability**). Reserve events/attempts listing and retry UX for **Building your own UI** / full-stack paths, not the one-file quickstart.

### New minimal application

Expand Down Expand Up @@ -145,6 +146,7 @@ Apply **only** the items below that fit the task; **skip** any that do not apply
- [ ] **Ran** the smallest end-to-end check that fits this task (e.g. run the script or shell flow once, exercise one new API path, or smoke the UI/API flow you added) and saw a clear success signal (e.g. event id, HTTP 2xx, or expected output).
- [ ] **Secrets:** The platform Outpost API key remains **server-side** / **environment** only — not in client bundles, not hard-coded in committed source.
- [ ] **Repeatable:** Env vars, how to run, and how to verify with the test destination above are stated briefly (README, comments, or chat — match the task size; a one-file script may need only inline or chat notes).
- [ ] **Quick path verification:** If this is a **quickstart-shaped** script (one file / curl flow), success is **publish accepted (202 / event id printed)** plus **Hookdeck Console** (or dashboard logs) for the test webhook — **not** a hard-fail `events.list` / `GET …/events` immediately after publish. If you must list events in code, **poll with retries** per **{{DOCS_URL}}/concepts#publish-acceptance-vs-observability**; do not throw on the first empty list.

**When editing an existing application repository (Existing application or equivalent):**

Expand Down
2 changes: 1 addition & 1 deletion docs/agent-evaluation/results/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ npm run score -- --run results/runs/<stamp>-scenario-NN --write
npm run score -- --run results/runs/<stamp>-scenario-NN --llm --write
```

**Execution** (curl/SDK against live Outpost with `OUTPOST_API_KEY`) is **not** recorded in these JSON files. Use **`../scripts/execute-ci-artifacts.sh`** after **`eval:ci`**, or the second step in **`.github/workflows/docs-agent-eval-ci.yml`**, and the **Execution (full pass)** rows in `[../scenarios/](../scenarios/)` for human notes.
**Execution** when `OUTPOST_API_KEY` is set: the agent may run live smoke tests during eval (evidence in `transcript.json`; **LLM judge** scores execution-style criteria). **`execute-ci-artifacts.sh`** (CI step 2) re-runs saved artifacts deterministically. Without the key, execution is transcript-only / manual — see **Execution (full pass)** rows in `[../scenarios/](../scenarios/)`.

---

Expand Down
13 changes: 11 additions & 2 deletions docs/agent-evaluation/scripts/ci-eval.sh
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
#!/usr/bin/env bash
# CI-friendly agent eval: scenarios 01+02 with heuristic + LLM judge (Success criteria from each scenario .md).
#
# Required secrets (e.g. GitHub Actions): ANTHROPIC_API_KEY, EVAL_TEST_DESTINATION_URL
# Required secrets (e.g. GitHub Actions): ANTHROPIC_API_KEY, EVAL_TEST_DESTINATION_URL, OUTPOST_API_KEY
# Optional: same vars in docs/agent-evaluation/.env for local runs.
#
# Scenarios: 01 = curl quickstart shape; 02 = TypeScript SDK script. See README § CI.
# After success, run ./scripts/execute-ci-artifacts.sh with OUTPOST_API_KEY + OUTPOST_TEST_WEBHOOK_URL for live Outpost (CI does this automatically).
# OUTPOST_API_KEY is forwarded to the agent sandbox so it can run smoke tests during eval:ci.
# After success, ./scripts/execute-ci-artifacts.sh re-runs the saved artifacts (CI step 2).
set -euo pipefail

ROOT="$(cd "$(dirname "$0")/.." && pwd)"
Expand All @@ -19,5 +20,13 @@ if [[ -z "${EVAL_TEST_DESTINATION_URL:-}" ]]; then
echo "ci-eval: EVAL_TEST_DESTINATION_URL is not set" >&2
exit 1
fi
if [[ -z "${OUTPOST_API_KEY:-}" ]]; then
echo "ci-eval: OUTPOST_API_KEY is not set (required so the agent can run live Outpost smoke tests)" >&2
exit 1
fi

export OUTPOST_TEST_WEBHOOK_URL="${OUTPOST_TEST_WEBHOOK_URL:-${EVAL_TEST_DESTINATION_URL:-}}"
: "${OUTPOST_API_BASE_URL:=https://api.outpost.hookdeck.com/2025-07-01}"
export OUTPOST_API_BASE_URL OUTPOST_TEST_WEBHOOK_URL

exec npm run eval:ci
Loading
Loading