Crawfish

Crawfish is the control plane for governed agent swarms.

Harnesses are abundant. Constitutions are not enough. Evaluation is how a swarm learns without becoming opaque.

Harnesses are abundant. Cognition is volatile. Governance is lagging.
Crawfish exists for the layer above all of that: lifecycle, contracts, continuity, verification, doctrine, and multi-owner control.

Crawfish is a lifecycle-managed runtime for agent swarms that need to survive real operating conditions: budgets, approvals, outages, degraded dependencies, foreign-owner encounters, and model churn. It is not another assistant shell, not another graph toy, and not a harness trying to pretend it is the whole system.

Why Now

The agent stack is changing faster than the rules around it.

Specialized harnesses keep multiplying: OpenClaw, Codex, Claude Code, Gemini CLI, ACP-compatible clients, and more.
The quality of reasoning is improving, but it is also unstable across vendors, models, and release cycles.
Governance and operational practice still trail capability growth.
Multi-owner agent encounters are no longer theoretical. They already happen on the same laptop.

Most teams are still driving by the rear-view mirror, in the sense described by Notion's "Steam, Steel, and Infinite Minds": building with yesterday's application assumptions while swarm-scale agency is arriving with today's tools.

Crawfish is built for that mismatch.

Why Constitutions Are Not Enough

High-level principles matter. They are still not governance.

Anthropic's Claude's Constitution is a strong example of rule-guided model behavior. Anthropic's earlier Constitutional AI work made the same point at training time: written principles can shape behavior. But a constitution does not enforce itself once agents begin roaming across workspaces, owners, harnesses, and execution surfaces.

The frontier problem is different:

a principle can say "do not overreach"
the runtime still needs a pre_dispatch checkpoint
the system still needs evidence that the checkpoint ran
the operator still needs escalation when the check cannot be enforced

That is why Crawfish now treats governance as runtime structure, not policy prose:

doctrine packs
jurisdiction classes
oversight checkpoints
enforcement records
policy incidents

If a rule exists but no checkpoint, no evidence, and no escalation path exists, the swarm is still operating in a wild-west mode.

Why Swarm, Not Assistant

An assistant is usually imagined as a single interface.
A swarm is a governed system of bounded workers, harness-backed execution surfaces, tools, policies, and owners.

Crawfish treats the future as swarm-shaped:

many agents, not one
many harnesses, not one
many owners and trust domains, not one
many continuity states, not a binary up/down illusion

Swarm here does not imply shared trust, shared memory, or ambient context sharing. It means a governed collection of agents and harness-backed workers under one control plane.

Why Swarm, Not Role-Split Multi-Agent

Much earlier "multi-agent" work was often about splitting roles inside one application, not governing real encounters across owners, harnesses, and trust boundaries.

LangChain's multi-agent docs frame the problem primarily as context engineering: deciding what information each sub-agent should see and how much context to pass.
OpenAI's Agents SDK frames multi-agent coordination around handoffs and shared run context inside one agentic application.
AutoGen's Swarm docs explicitly describe agents that share the same message context.

Those patterns are useful. They are not the same as the environment Crawfish is built for.

Crawfish targets the point where "many" stops meaning "more prompt wrappers inside one app" and starts meaning:

many bounded workers, not one conversation tree
many owners, not one ambient authority
many harness surfaces, not one centrally managed loop
many real encounter boundaries, not only context partitioning

That is why Crawfish needs doctrine, checkpoints, leases, evidence, evaluation, and escalation. Context split is coordination. Swarm governance is a different systems problem.

Why Crawfish Is Not Another Harness

Harnesses are execution surfaces. Crawfish governs them.

OpenClaw is an interactive gateway-native harness surface.
Codex, Claude Code, Gemini CLI, and future ACP-compatible adapters are specialized general-purpose harnesses.
MCP tools are tool-plane integrations.
A2A is the first real remote-agent plane in the current design, using Agent Cards and task-based delegation in the shape introduced by Google's "A2A: A New Era of Agent Interoperability".

Crawfish does not compete by being one more reasoning loop. It competes by making many volatile reasoning loops behave like one inspectable system.

Start With Mainline Alpha

The current supported getting-started path is deliberately narrow:

local swarm control
local-first task.plan under verify_loop
approval-gated local workspace.patch.apply
incident.enrich as the supporting workload
inspectable events, traces, evaluations, alerts, and restart recovery

OpenClaw, A2A, treaties, federation packs, remote evidence, and remote follow-up remain in the repository as experimental alpha surfaces. They still compile and run under CI, but they are not the public happy path and they are not what crawfish init generates by default.

The deeper remote-governance discussion is retained as an experimental appendix later in this document; it is not the mainline onboarding path.

Concept Discipline

Crawfish keeps a broader architecture than it exposes on the public happy path.

For the README, quickstart, and main benchmark story, keep the external model narrow:

local governed swarm runtime
lifecycle-managed actions
local-first task.plan
deterministic verification
approval-gated local mutation
inspectable traces, evaluations, reviews, and alerts

The more advanced remote line remains implemented, but it should be read as retained experimental architecture rather than the default user journey. The compression rule is simple:

treaty: can remote delegation happen
federation pack: how remote states and results are interpreted
evidence bundle: what proof is required to admit the result
follow-up: how the same action continues when proof is incomplete

What The Control Plane Enforces

Crawfish is opinionated about what must survive model churn.

Lifecycle: agents are supervised resources with desired state, health, drain behavior, degraded profiles, and recovery rules.
Contracts: deadlines, budgets, approval rules, mutation mode, and fallback policy are compiled into runtime behavior.
Governance: same-device foreign-owner encounters are classified, constrained, auditable, and revocable.
Continuity: when a model route or harness disappears, the swarm contracts into deterministic work, store-and-forward, or handoff instead of vanishing behind retries.
Verification: success is not whatever a model claims. Verification-sensitive work runs under deterministic checks and bounded retry budgets.
Inspection: actions expose phase, artifacts, checkpoints, external refs, event lineage, governance metadata, and operator-readable failure codes.

What Runs Today

The current public happy path is mainline alpha: local swarm control, local harnesses, deterministic fallback, approval-gated local mutation, and inspectable evaluation.

Mainline Alpha

task.plan runs as a local-first planning path: claude_code -> codex -> deterministic
task.plan also runs under the implemented verify_loop, so local harness output and deterministic fallback are both forced through the same bounded verifier
workspace.patch.apply performs local deterministic edits under approval, grants, leases, revocation, workspace locks, and audit receipts
incident.enrich emits incident_enrichment.json and incident_summary.md
repo.review and ci.triage remain implemented supporting workloads
repo.index remains internal plumbing for repo-aware workloads

Experimental Alpha Surfaces

The repository also contains implemented but experimental alpha surfaces:

OpenClaw inbound and outbound
A2A outbound remote-agent delegation
treaty / federation / remote evidence / remote follow-up lines

They remain compiled and tested, but they are not the recommended getting-started path and they are no longer the default example or crawfish init template.

Verified Execution Strategies

verify_loop is the first implemented execution strategy beyond single_pass.

For task.plan, Crawfish now does this:

Select an execution surface.
Run one proposal attempt.
Deterministically verify the result.
Feed structured verification failures back into the next attempt.
Stop on success, human handoff, or budget exhaustion.

Today that surface can be:

a local Claude Code process
a local Codex process
an OpenClaw outbound run
a deterministic fallback planner

This is where the project starts to look beyond the current generation of agent demos.
Reasoning quality will keep changing. Verification and control have to outlive that churn.

Evaluation Spine

Tracing alone is not enough. Evaluation alone is not enough. A control plane needs both.

LangSmith provides a useful reference shape here through its observability concepts, pairwise evaluation, annotation queues, automation rules, and experiment comparison: traces, datasets, evaluators, review, alerts, and comparison loops belong to one operational system. Crawfish does not copy LangSmith's product. It lifts that shape into swarm runtime infrastructure.

The runtime now builds an evaluation spine:

trace -> scorecard -> review queue -> alert -> dataset -> replay -> compare

That spine is attached to real action execution:

task.plan
repo.review
incident.enrich

The point is not to build a hosted dashboard first. The point is to make swarms inspectable and corrigible before the UI arrives.

Observability is the rear-view mirror. Evaluation is the learning loop.

In Crawfish:

TraceBundle captures inputs, executor lineage, artifacts, events, external refs, and verification outputs
EvaluationRecord turns deterministic checks into durable quality evidence
ReviewQueueItem escalates work that should not quietly auto-complete
FeedbackNote lets operator judgment flow back into future iterations without rewriting history
AlertRule turns governance or evaluation failures into visible operator signals
DatasetCase freezes completed actions into replayable evaluation datasets with doctrine and jurisdiction metadata
ExperimentRun replays those cases against one executor surface so the swarm can learn without polluting production review queues

For remote-agent work, that spine now treats the returned result as a governance event:

task_plan_remote_default scores remote outcome disposition, delegation receipt evidence, remote task lineage, and treaty-violation absence
federation metadata now carries through trace, review, alert, dataset, and replay paths so remote escalation stays visible after execution
A2A outcomes that come back as review_required or rejected are visible in the same trace, review, and alert substrate as local failures
remote-agent quality is therefore judged on both proposal quality and treaty evidence quality

Pairwise Review

Single-run evaluation tells you whether one executor met the bar. Pairwise review tells you whether one route is actually better than another.

Crawfish now treats executor-first comparison as a control-plane primitive:

launch two isolated experiment runs against one dataset
compare them deterministically before any human judgment
open a human review item only when the signals are too close or too conflicted to trust automation

That shape is borrowed deliberately from LangSmith's pairwise evaluation, annotation queues, and experiment comparison, but reinterpreted as runtime substrate rather than a hosted UI.

The important product choice is what Crawfish does not do here:

no LLM-as-judge
no opaque winner selection
no prompt arena disconnected from runtime doctrine

Instead, pairwise outcomes are driven by doctrine incidents, terminal status, normalized evaluation score, and explicit review resolution when automation should stop pretending certainty.

Remote-agent comparisons inherit the same rule: a route that returns weaker frontier evidence or more treaty violations does not get to hide behind a technically successful transport call.

Philosophy

The forward-looking product philosophy lives in docs/spec/philosophy.md.

The short version:

build for swarm-age governance, not single-agent demos
harnesses are replaceable, control planes are strategic
reasoning is volatile; contracts and verification must survive model churn
institutions lag capability growth, as argued in Notion's "Steam, Steel, and Infinite Minds"; runtime guardrails cannot
constitutions do not enforce themselves
constitutions guide models; institutions govern swarms
frontier enforcement gaps are runtime failures, not merely policy failures
evaluation is how a swarm learns without becoming opaque
treaties precede marketplaces, reputation, and federation packs
treaties decide whether delegation is lawful; federation packs decide how remote states and results are interpreted
evidence bundles and remote review workflow decide whether frontier results are admissible
design for future multi-owner encounters, not yesterday's app sandbox

The supporting spec set lives in:

Quickstart

The reference example lives under examples/hero-swarm/.

cargo test --workspace
cargo run -p crawfish-cli --bin crawfish -- init ./sandbox
cp examples/hero-swarm/Crawfish.toml ./sandbox/Crawfish.toml
cp examples/hero-swarm/agents/task_planner.toml ./sandbox/agents/
cp examples/hero-swarm/agents/workspace_editor.toml ./sandbox/agents/
cp examples/hero-swarm/agents/incident_enricher.toml ./sandbox/agents/
cd sandbox
mkdir -p src docs incident
printf 'pub fn value() -> u32 { 42 }\n' > src/lib.rs
cp ../examples/hero-swarm/data/sample-incident.log incident/sample-incident.log
cp ../examples/hero-swarm/data/service-manifest.toml incident/service-manifest.toml
cargo run -p crawfish-cli --bin crawfish -- run &
sleep 1

cargo run -p crawfish-cli --bin crawfish -- action submit \
  --target-agent task_planner \
  --capability task.plan \
  --goal "propose a rollout checklist" \
  --caller-owner local-dev \
  --inputs-json '{
    "workspace_root": ".",
    "objective": "Prepare a rollout checklist for tightening local validation around src/lib.rs",
    "context_files": ["src/lib.rs"],
    "desired_outputs": ["rollout checklist", "operator handoff"]
  }' \
  --json

cargo run -p crawfish-cli --bin crawfish -- inspect <action-id> --json
cargo run -p crawfish-cli --bin crawfish -- action events <action-id> --json
cargo run -p crawfish-cli --bin crawfish -- action trace <action-id> --json
cargo run -p crawfish-cli --bin crawfish -- action evals <action-id> --json
cargo run -p crawfish-cli --bin crawfish -- review list --json

cargo run -p crawfish-cli --bin crawfish -- action submit \
  --target-agent workspace_editor \
  --capability workspace.patch.apply \
  --goal "materialize the rollout checklist" \
  --caller-owner local-dev \
  --workspace-write \
  --mutating \
  --inputs-json '{
    "workspace_root": ".",
    "edits": [{
      "path": "docs/rollout-checklist.md",
      "op": "create",
      "contents": "# Rollout Checklist\n\n- Inspect src/lib.rs\n- Add validation coverage\n- Run targeted tests\n- Capture operator handoff\n"
    }]
  }' \
  --json

cargo run -p crawfish-cli --bin crawfish -- action list --phase awaiting_approval --json
cargo run -p crawfish-cli --bin crawfish -- action approve <mutation-action-id> --approver local-dev --json

cargo run -p crawfish-cli --bin crawfish -- action submit \
  --target-agent incident_enricher \
  --capability incident.enrich \
  --goal "enrich local incident" \
  --caller-owner local-dev \
  --inputs-json '{
    "service_name": "api",
    "log_file": "incident/sample-incident.log",
    "service_manifest_file": "incident/service-manifest.toml"
  }' \
  --json

cargo run -p crawfish-cli --bin crawfish -- alert list --json

For the full reference walkthrough, run examples/hero-swarm/demo.sh.

If claude or codex is installed locally, task_planner will prefer those harnesses first. If neither local wrapper is available, Crawfish falls back to deterministic planning when the compiled contract allows it.

Public Status

Crawfish is public and maintained seriously, but it is still alpha.

Surface	Status
CLI	public, unstable
`Crawfish.toml` and manifests	public, unstable
local UDS HTTP API	public, unstable
Rust workspace crates	public, unstable

Current support baseline:

version posture: 0.x / alpha
implementation posture: Rust-first, not Rust-only
supported runtime environments: macOS and Linux
supported MCP transport in the current codebase: SSE only
supported mainline alpha path: local swarm control and local-first task.plan
implemented but experimental alpha surfaces: OpenClaw, A2A, treaty/federation remote governance

Breaking alpha changes are allowed, but they must ship with:

a changelog entry in docs/project/CHANGELOG.md
README or spec updates
a migration note when the break is user-visible

Primary alpha config direction:

quality.evaluation_profile is the primary evaluation selector
quality.evaluation_hook still parses during alpha, but it is deprecated and only normalized for legacy built-ins

Project maintenance policy lives in:

Experimental Alpha Appendix: Remote-Agent Governance

The sections below describe retained but experimental remote-governance surfaces rather than the recommended onboarding path.

Why Remote Agents Are Not Just Another Harness

Remote agents are not only remote processes. They are separate authorities.

A harness crossing changes the execution surface. A remote-agent crossing changes the governance problem. A2A's Agent Card model and task lifecycle make that explicit: the runtime is delegating work to another agent system, not just spawning another wrapper on the same machine.

That is why Crawfish treats remote delegation differently:

harnesses are selected execution surfaces
remote agents are treaty-governed delegation targets
federation packs decide how remote states, evidence gaps, and remote results are interpreted after delegation
doctrine still applies, but treaties decide whether cross-system delegation is allowed at all
remote task lineage, remote principal identity, and delegation receipts must remain inspectable

Why Treaties Precede Marketplaces

Before reputation systems, marketplaces, or federation policy packs, a swarm needs a lawful basis for remote delegation.

In Crawfish, that basis is the treaty.

A treaty decides:

which remote principal is recognized
which capabilities may be delegated
which data scopes may cross the boundary
which artifact classes may come back
which checkpoints and result evidence are mandatory
whether missing evidence should be escalated or denied

That is why the current A2A line is treaty-governed rather than marketplace-driven. Google's "A2A: A New Era of Agent Interoperability" gives the task-plane shape. Crawfish adds the control-plane question: not just can the swarm delegate, but under what treaty, with what evidence, and how does the runtime respond when the evidence comes back incomplete.

Markets can come later. The treaty has to come first.

Why Federation Packs Matter After The Treaty

Treaties answer the first question: may this swarm delegate across the boundary at all?

Federation packs answer the next question: once the remote side starts talking back, how should the control plane interpret what it sees?

That second question matters because remote-agent governance does not end at dispatch:

a remote task can return input-required
it can demand auth instead of finishing
it can return artifacts that are technically well-formed but outside the allowed class or scope
it can finish without enough evidence for the local control plane to trust the result

So Crawfish now separates the two responsibilities:

treaty packs define whether delegation is lawful
federation packs define how remote state, evidence, and results are escalated, reviewed, accepted, or rejected

That is how a control plane turns remote delegation from “we made an HTTP call” into governable swarm behavior.

Why Evidence Bundles Decide Admissibility

Treaties decide whether remote delegation is lawful. Federation packs decide how remote state and remote results should be interpreted. But neither is enough unless the runtime can produce an admissible evidence bundle when the remote side replies.

That is why Crawfish now treats remote evidence as a first-class control-plane object:

remote terminal state evidence
remote artifact manifest
remote scope and data evidence
checkpoint evidence for admission, pre_dispatch, and post_result
treaty violations, policy incidents, and review disposition

This follows the same broad lesson behind LangSmith's observability concepts: traces matter because they preserve evidence, not because they make the UI look richer. In Crawfish, evidence bundles are what decide whether a remote result is admissible, blocked for review, or rejected.

Remote review is therefore not a UI-only feature. It is the operator workflow that turns a treaty-governed but ambiguous remote outcome into an explicit control-plane result:

accept_result
reject_result
needs_followup

needs_followup is now a real control-plane continuation. Crawfish creates a structured RemoteFollowupRequest, keeps the action blocked, preserves the prior remote evidence bundle, and requires an explicit operator-triggered re-dispatch before the same action may create a fresh remote attempt.

That is why the project is Rust-first, not Rust-only:

crates/ is the implementation spine for the runtime, control plane, storage, and native outbound adapters.
integrations/ is the edge zone for isolated bridge packages where a non-Rust implementation is pragmatic.
The current example is integrations/openclaw-inbound/, a thin TypeScript ingress bridge. The policy engine, lifecycle authority, storage, and runtime decisions remain in Rust.

Experimental remote and federation examples live under examples/experimental/.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github		.github
crates		crates
docs		docs
examples		examples
integrations/openclaw-inbound		integrations/openclaw-inbound
scripts		scripts
.editorconfig		.editorconfig
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawfish

Why Now

Why Constitutions Are Not Enough

Why Swarm, Not Assistant

Why Swarm, Not Role-Split Multi-Agent

Why Crawfish Is Not Another Harness

Start With Mainline Alpha

Concept Discipline

What The Control Plane Enforces

What Runs Today

Mainline Alpha

Experimental Alpha Surfaces

Verified Execution Strategies

Evaluation Spine

Pairwise Review

Philosophy

Quickstart

Public Status

Experimental Alpha Appendix: Remote-Agent Governance

Why Remote Agents Are Not Just Another Harness

Why Treaties Precede Marketplaces

Why Federation Packs Matter After The Treaty

Why Evidence Bundles Decide Admissibility

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crawfish

Why Now

Why Constitutions Are Not Enough

Why Swarm, Not Assistant

Why Swarm, Not Role-Split Multi-Agent

Why Crawfish Is Not Another Harness

Start With Mainline Alpha

Concept Discipline

What The Control Plane Enforces

What Runs Today

Mainline Alpha

Experimental Alpha Surfaces

Verified Execution Strategies

Evaluation Spine

Pairwise Review

Philosophy

Quickstart

Public Status

Experimental Alpha Appendix: Remote-Agent Governance

Why Remote Agents Are Not Just Another Harness

Why Treaties Precede Marketplaces

Why Federation Packs Matter After The Treaty

Why Evidence Bundles Decide Admissibility

About

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages