Cautilus

Cautilus keeps agent and workflow behavior honest while prompts keep changing. It is a repo-local contract layer for agent and workflow behavior evaluation: define the behavior you are trying to protect once, then verify it survives prompt, skill, and wrapper changes. The product has three connected jobs: discover declared behavior claims worth proving from selected source docs, verify curated claims through bounded evaluation packets, and improve behavior with budgeted improvement once the proof surface is honest. Ships as a standalone binary plus Cautilus Agent, which a host repo can install without copying another scaffold first. Agents are first-class users of the product surface. Commands should emit durable packets with enough state for the next agent to resume, not only terminal prose for a human operator. Cautilus installs as a machine-level binary, but its agent-facing surface is intentionally repo-local. The binary is shared across repos. The Cautilus Agent surface, adapter wiring, prompts, and instruction-routing surface are not. They stay checked into each host repo so evaluation behavior remains reproducible, reviewable, and owned by the repo that declares it.

Current Release Boundary

The current external-adoption slice is eval-only while the broader claim, improve, live app-runner, and review-learning contracts are still being rewritten. Host repos can use cautilus evaluate fixture, cautilus evaluate observation, and post-run cautilus evaluate skill-experiment with checked-in fixtures, host-owned adapters, preserved task packets, and the current evaluation and skill-experiment report packets. skill-experiment compare compares host-preserved baseline and variant outputs; it does not clone, install, or execute skills. Treat claim discovery automation, improve automation, live eval app-runner workflows, and review-learning packet capture or selected-packet summaries as opt-in product slices until the rewrite closes.

Who It Is For

teams maintaining agent runtimes or chatbot loops whose prompts and wrappers change frequently
maintainers shipping repo-owned skills who want protected validation, not trigger-only smoke checks
operators who want review-ready outputs and explicit comparison evidence before accepting workflow changes

Day-1 trigger: your repo already has behavior that matters, but prompt tweaks and ad hoc evals no longer explain whether a candidate actually got better.

Not for: repos that only need deterministic lint, unit, or type checks and do not have an evaluator-dependent behavior surface.

Quick Start

Prerequisites:

native macOS or native Linux
a target host repo you can edit locally
git available on PATH

curl -fsSL \
  https://raw.githubusercontent.com/corca-ai/cautilus/main/install.sh \
  | sh
cd /path/to/host-repo
cautilus init

If this machine still has a legacy Homebrew install, remove that copy first and then reinstall through install.sh:

brew uninstall cautilus
curl -fsSL https://raw.githubusercontent.com/corca-ai/cautilus/main/install.sh | sh

If you want to hand setup to an agent, ask it to repeat cautilus doctor --repo-root . --next-action, do exactly what the packet says, and stop only after cautilus doctor --repo-root . reports readiness plus a first_bounded_run.

Quick links:

What Cautilus promises: docs/specs/user/index.spec.md
Maintainer claim map: docs/specs/contracts/index.spec.md
Start here — Cautilus, proven on itself: docs/specs/index.spec.md
Full command catalog: docs/guides/cli.md
Fresh consumer bootstrap after the binary is on PATH: docs/guides/consumer-adoption.md
Public executable spec report: https://corca-ai.github.io/cautilus/

docs/specs/index.spec.md is the top-level "proven on itself" apex and the specdown entry; the user and maintainer spec indexes it links to remain the curated claim source of truth. Raw discover claims packets remain the high-recall, source-ref-backed proof-planning input, not the primary document a user should review. The Cautilus Agent curates that packet against the repo: reduce false positives, raise likely missing public promises, and separate in-scope discovery bugs from out-of-scope narrative gaps. The public website report is generated from the claim spec tree, but host repos do not need that renderer before Cautilus can inspect readiness, claims, evals, or improvement work. Each claim page pairs a bounded product promise with executable evidence or an explicit evidence gap. Read the user spec index to understand what Cautilus promises, then use the maintainer index to inspect proof routes, adapters, fixtures, and known gaps.

One Bounded Eval Loop

Start here if you want the current stable cross-repo slice before reading the full surface. You need one checked-in cautilus.evaluation_input.v1 fixture and a host-owned adapter runner. This loop verifies a bounded behavior fixture and produces reopenable observed and summary packets.

Input (CLI)

cautilus evaluate fixture \
  --fixture ./fixtures/eval/<behavior>.fixture.json \
  --output-dir /tmp/cautilus-eval
cautilus evaluate observation \
  --input /tmp/cautilus-eval/eval-observed.json \
  --output /tmp/cautilus-eval/eval-summary.json

Input (For Agent): "Run this checked-in Cautilus eval fixture and summarize the observed packet and summary packet."

Cautilus turns the fixture run into durable eval packets that another agent or maintainer can reopen. The summary is not a global product verdict; it is evidence for the behavior fixture and adapter path that the host repo chose. Next step: a human decides whether that evidence is enough for the host repo's current proof need.

The same small loop anchors the public spec report in docs/specs/user/index.spec.md. It is the shortest currently stable external-adoption example of the product claim: Cautilus turns behavior evidence into a reviewable decision surface.

Dogfood Example

Cautilus is useful when a repo instruction such as AGENTS.md is supposed to steer an agent's first move. In charness, an instruction-surface fixture proved that the agent first selected the startup bootstrap helper find-skills, then selected the durable work skill for the actual task. That turned "did the agent read and follow the repo instructions?" from transcript judgment into a reproducible packet with artifacts another maintainer can reopen. The same dogfood run also exposed a useful limit: routing proof is not backend subagent capability proof. Keeping that distinction in the packet prevented the result from over-claiming what had been verified.

Scenarios

Cautilus has three connected product layers: claim discovery, bounded evaluation, and bounded improvement. External host repos should start with the eval-only slice above unless they are intentionally adopting claim discovery or improvement during the current contract rewrite.

Claim discovery turns adapter-owned entry docs and linked Markdown into cautilus.claim_proof_plan.v1 candidates. It is proof planning, not a verdict that the repo is correct. The Cautilus Agent curates false positives, likely missing promises, scan boundaries, and extraction and review budgets before any eval plan is trusted.

Evaluation uses two top-level surfaces: dev for AI-assisted development work such as repo contracts, tools, and skills, and app for AI-powered product behavior such as chat, prompt, and service responses. For the live reader-facing contract, read docs/specs/user/evaluation.spec.md. For the full command catalog, including claim review, scenario normalization, live targets, and improvement commands, read docs/guides/cli.md. Sample normalization inputs live in examples/starters/ and the checked-in fixture directories under fixtures/.

Why Cautilus

Prompt strings change, but behavior is the real contract.

Concrete picture: you tweak a chatbot system prompt. One user's follow-up experience improves. Another user silently loses context recovery across turns. Anecdotes will not tell you which effect dominates. Cautilus treats the context-recovery case as a protected scenario kept out of tuning so the signal stays honest. It stores the evidence in a durable file the next maintainer can reopen from disk. Later docs use the shorthand held-out for that protected validation path and packet for those reopenable machine-readable files.

The stance, in four contrasts:

Unlike a dashboard-first review tool, Cautilus treats packets, CLI commands, and repo instructions as agent-facing interfaces first; HTML is a human-readable mirror, not the source of truth.
Unlike a prompt manager, Cautilus does not freeze one prompt string as the contract — it treats the behavior under evaluation as the contract (intent-first).
Unlike a benchmark scrapbook, Cautilus separates iteration from protected validation and keeps evidence reopenable from files (held-out honesty, packet-first).
Unlike ad hoc eval scripts, Cautilus makes adapters, reports, review files, and compare artifacts first-class product boundaries (structured review).
Unlike open-ended improver loops, Cautilus keeps search and revision explicitly bounded by budgets, checkpoints, and blocked-readiness conditions (bounded autonomy).

The proof layers are deliberately split because humans, code, and AI are good at different work. Human-auditable claims stay readable. Deterministic claims belong in ordinary tests and CI. Evaluator-dependent behavior goes through cautilus evaluate. Improvement work waits until the proof surface is explicit.

Cautilus also ships a GEPA-style bounded prompt search seam above the one-shot improver: multi-generation reflective mutation, protected reevaluation, frontier-promotion review reuse, checkpoint feedback reinjection, bounded merge synthesis, and Pareto-style frontier selection. Deep dive: docs/guides/improve.md.

The longer-term direction is close to the workflow philosophy behind DSPy: prompts can change as long as the evaluated behavior survives.

Core Flow

Two entry points share one host-owned cautilus-adapter.yaml and return the same durable decision surface. Operators use the standalone CLI. Claude and Codex use the repo-local Cautilus Agent that cautilus init installs under .agents/skills/cautilus-agent/.

The minimum host-repo shape is an adapter, an installed Cautilus Agent, and run artifacts such as eval-cases.json, eval-observed.json, and eval-summary.json. The result is not just a pass/fail bit: it is a set of machine-readable packets plus readable views that another maintainer or agent can reopen. See docs/specs/user/reviewable-artifacts.spec.md for the rendered-artifact claim.

Use cautilus doctor --next-action for the next onboarding step, cautilus doctor --scope agent-surface for agent-surface discoverability, and cautilus doctor for repo wiring readiness. From this repo, npm run consumer:onboard:smoke is the shortest end-to-end adoption proof against a fresh consumer.

Name		Name	Last commit message	Last commit date
Latest commit History 1,144 Commits
.agents		.agents
.cautilus		.cautilus
.charness		.charness
.claude-plugin		.claude-plugin
.claude		.claude
.githooks		.githooks
.github/workflows		.github/workflows
artifacts/self-dogfood/latest		artifacts/self-dogfood/latest
bin		bin
charness-artifacts		charness-artifacts
cmd/cautilus		cmd/cautilus
docs		docs
examples		examples
fixtures		fixtures
internal		internal
plugins/cautilus		plugins/cautilus
scripts		scripts
skills		skills
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.golangci.yml		.golangci.yml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
eslint.config.mjs		eslint.config.mjs
go.mod		go.mod
go.sum		go.sum
install.sh		install.sh
package-lock.json		package-lock.json
package.json		package.json
specdown.json		specdown.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cautilus

Current Release Boundary

Who It Is For

Quick Start

One Bounded Eval Loop

Dogfood Example

Scenarios

Why Cautilus

Core Flow

Read More

About

Uh oh!

Releases 39

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cautilus

Current Release Boundary

Who It Is For

Quick Start

One Bounded Eval Loop

Dogfood Example

Scenarios

Why Cautilus

Core Flow

Read More

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 39

Contributors

Uh oh!

Languages