ClaudeScientist

A research co-pilot that remembers, verifies, and lets you steer.

中文版本: README.zh-CN.md

ClaudeScientist plugs into Claude Code and adds what most AI scientist systems leave out: it remembers what you've tried, verifies your numbers before you publish them, and gives you a live terminal dashboard where you can watch the research unfold and step in at any time.

You give Claude a research question. It generates hypotheses, ranks them in a tournament, runs experiments with built-in safety checks, and tracks provenance for every number it produces. You watch the whole process in a second terminal and can reject, redirect, or approve at any point.

Current version: v5.0.0 — the cockpit is now an "activity streaming" research monitor. The top of the screen shows a phase strip (idle / explore / select / experiment / verify / prove / review / narrate) derived live from cockpit_events; the main pane shows activity cards (one card per research action — a BT tournament, a proof diagnose loop, a Lean attempt) instead of a flat event firehose; a new Focus tab lists the node(s) the agent is working on right now. The original event stream is preserved as a collapsible audit log at the bottom (toggle with A). Two new optional MCP atomic tools — cockpit__set_phase and cockpit__narrate — let SOP-driven agents annotate decisions without coupling to the cockpit's rendering. No schema migration: phase / focus / activity are pure functions over the existing cockpit_events table, per ADR 0011 and architecture.md §14.

v4.2.0 features retained (see retrospective-v4.2.md): tab grouping into Cross / Empirical / Proof, collapsible detail sections, pane-scoped w/i/t keys, the multi-provider vector backend (DashScope / Jina / Voyage / GLM tested via ADR 0010, default local Qwen/Qwen3-Embedding-0.6B), reports-as-files (closure / draft / diagnostic / portfolio / cascade) per ADR 0009, and the cold-start Welcome screen. See architecture.md §13 for the two-trunk split.

What it looks like

Open two terminals side by side. That's the whole UI.

The cockpit TUI — hypothesis tree, evidence, ratings, and event stream in one terminal.

The two terminals don't talk to each other directly — they both read and write the same SQLite file. This is the central design choice: every module collaborates through a shared database, not over the network.

Role	Where	What it does
Claude Code	Terminal A	Drives the research: understands your question, calls tools, writes and runs code
MCP servers	Background	Provide the tools Claude calls — memory, verification, literature search, proof generation
Hooks	Auto-loaded at startup	Run safety checks before/after every tool call (block data leaks, log provenance)
Cockpit TUI	Terminal B	Shows live state; lets you approve, reject, or redirect hypotheses
SQLite	`.research-agent/state.db`	The single file that holds all state: hypotheses, evidence, ratings, metrics, events

What you can do with it

Track your research thinking. Every hypothesis, piece of evidence, and branching decision lives in a persistent graph. Papers you read along the way are compressed and searchable. Come back next week — it's all there. Want to revisit a direction you pruned? Run a counterfactual replay without touching the live state.
Rank competing ideas. A Bradley-Terry tournament compares hypotheses head-to-head and produces a leaderboard with confidence intervals, so you know which direction is actually winning.
Lock your goalposts before experimenting. Preregistration makes you commit to a metric, direction, and threshold before you see results. Multiple-comparison correction is applied automatically.
Make your numbers trustworthy. Every reported number gets checked: Is it reproducible across random seeds? Which files produced it? Has anything changed since? Baseline comparisons are checked for fair compute budgets. A reviewer agent blocks any unverified claim from reaching a writeup.
Catch mistakes before they compound. A failure ledger remembers every debugging session. Next time you hit a similar problem, the system surfaces how you fixed it before.
Watch and steer in real time. The cockpit TUI shows the hypothesis tree, ratings, and event stream live. Press a key to reject a bad hypothesis or inject a note — interventions are picked up at the next turn.
Generate and verify statistical proofs (v4.0). A proof trunk handles drafting, segmentation, diagnosis against known error patterns, and optional Lean 4 formal verification.

Quick start

Install and run the setup wizard:

uv sync
uv run python -m claudescientist.setup

The wizard walks you through embedding backend, proof corpus seeding, held-out directory, Lean toolchain, and auto-prune — all in one pass. Run it again any time; it skips steps that are already done.

Literature search uses two external MCPs. arXiv is launched through uv tool run arxiv-mcp-server; OpenAlex is launched through npx -y openalex-research-mcp, so install Node.js/npm if you want the OpenAlex-backed librarian tools.

Manual setup (without the wizard)

uv sync --extra proof    # pulls in sentence-transformers for the proof trunk
uv run python scripts/seed_proof_corpus.py
uv run python scripts/seed_proof_failures.py

Run — open two terminals from the repo root:

# Terminal A: Claude Code (from the repo root)
claude

# Terminal B: cockpit TUI (from the repo root)
uv run python -m cockpit.tui

For the Chinese UI on Windows Terminal:

chcp 65001
$env:PYTHONUTF8=1
uv run python -m cockpit.tui --lang zh

Press L inside the TUI to toggle English / Chinese labels.

Lean formal verification is a separate opt-in setup — see docs/setup-lean.md.

Where to go next

If you're new, read in this order:

docs/overview.md — the complete mental model: how the pieces fit, what happens end-to-end, the three design principles
docs/workflows/first-research-task.md — walk through one full task from start to finish
docs/architecture.md — the contracts between modules (treat as binding)
docs/tool-reference.md — every MCP tool, with signature and usage guidance

More:

Design rationale for each major decision → docs/adr/
Where the project is headed → docs/roadmap.md
Historical plans → docs/archive/
Agent and contributor rules → AGENTS.md

Runtime details

Default paths:

Shared state: .research-agent/state.db under the repo root
Generated reports: reports/ under the repo root; gitignored by default, force-add individual files only when you intentionally want to share them
Held-out datasets: %USERPROFILE%\.research-agent\heldout, configurable via RESEARCH_AGENT_HELDOUT_DIR
Embedding backend: local (sentence-transformers/Qwen/Qwen3-Embedding-0.6B); override with RESEARCH_AGENT_EMBED_BACKEND=mock|openai. Tests use mock automatically.

Dev server commands for individual MCP modules:

uv run python -m memory_mcp.dev_server
uv run python -m verify_mcp.dev_server
uv run python -m prove_mcp.dev_server
uv run python -m cockpit.mcp_server
uv run python -m claudescientist.heldout register <name> <path>

Validation

Before shipping a change:

uv run ruff check
uv run pytest tests/memory_mcp tests/verify_mcp tests/prove_mcp tests/hooks tests/cockpit tests/scripts tests/e2e
uv run python -m cockpit.tui --once --lang zh
uv run python -c "import memory_mcp.server; import verify_mcp.server; import prove_mcp.server; import cockpit.mcp_server; print('OK')"

Current status

The repo works for local development and integration. A fresh end-to-end validation pass is needed before calling it production-ready.

A few things to know:

Auto-prune is dry-run by default. Set RESEARCH_AGENT_AUTO_PRUNE=1 to let it actually pause weak branches.
The cockpit is terminal-only. No browser frontend, no web server.
The prover agent works without Lean. The NL proof workflow runs on its own; Lean is extra insurance you can set up later via docs/setup-lean.md.
mem_nodes.elo_score is a legacy column. New code should read mem_bt_ratings.strength.

Full tool list and scope details: docs/tool-reference.md and AGENTS.md.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.claude		.claude
.github		.github
data		data
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
AGENTS.zh-CN.md		AGENTS.zh-CN.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClaudeScientist

What it looks like

What you can do with it

Quick start

Where to go next

Runtime details

Validation

Current status

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ClaudeScientist

What it looks like

What you can do with it

Quick start

Where to go next

Runtime details

Validation

Current status

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages