Matrix evaluation for LLM agents — pytest for your cases, codecov for your regressions.
Python ≥3.11 · install from this repo (uv sync / pip install -e .) — PyPI publish pending · Architecture
More: Extended product brief · Battlecard · Competitive landscape · Spike DX analysis · v1 checklist
| Works today | validate · run · gate · baseline update; @case · @evaluator · RuntimeAdapter; .gategrid/ reports and per-profile baselines; contrib/file_edit and MCP example trees; OpenCrabs hashline dogfood (examples/opencrabs/); legacy agent_eval_matrix removed |
| CI | PR gate, main baseline artifact, run.sample — docs/guides/ci.md · Phase 5 |
| Post-v1 | gategrid init, HTML report, marketplace GitHub Action — Phase 6 |
This repo’s CI runs pytest, ci-gate-mock on PR, and baseline refresh on main via .github/workflows/gategrid-ci.yml. Copy .github/workflows/gategrid.yml.example or see docs/guides/ci.md.
Building an agent is only half the work. You still need to know whether a new prompt, tool surface, model, or MCP server build actually helps — and whether yesterday’s change broke last week’s behavior on your stack.
Most teams end up with:
- One-off scripts that don’t compose
- Benchmarks tied to a single agent framework
- CI that runs evals but doesn’t gate regressions
- Baselines that mix unrelated profiles or PR envs, so “pass rate vs main” lies
Gategrid is the shared runner: you bring cases, runtime, and scorers in your eval tree; it runs the grid, stores artifacts only under .gategrid/, and compares runs to git-friendly golden baselines for one gated profile at a time.
| It is | It isn’t |
|---|---|
A matrix runner (cases × profiles × models) |
An agent framework |
| pytest-shaped plugins (your code, our infra) | A hosted eval SaaS |
| Single-profile CI gates + one baseline file per lane | A mandatory multi-profile fleet baseline |
CLI-first (run, gate, baseline update; PR never updates baseline) |
Direct MCP protocol tests without an LLM |
| Git-native golden runs (codecov-style) | promptfoo / LangSmith-style cloud baselines only |
Think pytest plus codecov-style compare to a stored golden run — for one stack at a time in CI, with optional benchmark matrices when you want to compare many profiles on the same cases.
File-edit sandboxes and MCP helpers live in contrib, not in core — you can run matrices without them.
| Gate (CI default) | Benchmark (optional) | |
|---|---|---|
| Question | Did our stack regress? | Which stack is best? |
| Profiles per run | One per gate matrix | Many (A/B tool surfaces) |
| Baseline | .gategrid/baselines/<profile>.json |
Report only; no PR gate |
baseline update |
main / nightly only, full case grid |
Not used for gating |
PR and main should use the same profile and the same baseline file. Overall and like-for-like comparisons stay honest. Full gate YAML, sampling, and CI recipes: docs/guides/ci.md and README-pitch-draft.md.
flowchart LR
subgraph yours [Your repo]
Cases["@case + fixtures"]
Runtime[RuntimeAdapter or your loop]
Eval["@evaluator gate / metric"]
Matrix[matrix YAML files]
end
subgraph ours [Gategrid]
Grid[Matrix executor]
Report[".gategrid/"]
Gate[gate + baseline]
end
Matrix --> Grid
Cases --> Grid
Runtime --> Grid
Grid --> Eval
Eval --> Report
Report --> Gate
| You own | Framework owns |
|---|---|
| Cases, runtime, evaluators, matrix YAML under your tree | Grid expansion, retries, traces |
Several matrix files per repo (smoke, mcp-gate, benchmark, …) |
Reports, one baseline file per gate lane |
| One profile in each gate matrix | gategrid gate, baseline update rules |
Pass --root (or GATEGRID_EVAL_ROOT) when the eval tree is not the repo root — e.g. examples/gategrid/ or examples/opencrabs/.
Secrets: values in process env only; YAML names api_key_env / env_pass_through, never secret values.
- One profile per gate — PR and
maincompare against the samebaselines/<profile>.json, not a mixed fleet baseline. - Regression gate in the CLI —
runproduces a report;gatechecks overall and like-for-like vs baseline;baseline updaterefreshes the golden run onmainonly. - Three layers of pass — cell (
gateevaluators), regression (vs baseline), optional hard limits on this run (matrixgate.limits— examples in pitch draft). - Bring your stack —
RuntimeAdapter, optionalgategrid[pydantic-ai], optionalgategrid[mcp], optional contrib (file-edit sandbox, MCP profile helpers, LLM-judge base class).
PR sampling: run.sample in matrix YAML (max_cells, share, always_include_tags) — see docs/guides/ci.md and examples/gategrid/matrices/ci-gate-pr-mock.yaml.
From a clone:
git clone https://github.com/leshchenko1979/gategrid.git
cd gategrid
uv sync --extra dev
# Matrix run (default eval root: parent of matrices/)
uv run gategrid validate --matrix examples/gategrid/matrices/smoke.yaml
uv run gategrid run --matrix examples/gategrid/matrices/smoke.yaml
# Golden baseline + gate (use the report path printed by run)
uv run gategrid baseline update --from-report .gategrid/reports/<report-from-run-output> --profile demo
uv run gategrid gate --profile demoArtifacts live under .gategrid/reports/ and .gategrid/baselines/ (GATEGRID_HOME overrides). Partial runs: --case <id> (gate may warn on fingerprint mismatch).
MCP mock (offline):
uv sync --extra dev --extra pydantic-ai --extra mcp
uv run gategrid validate --matrix examples/gategrid/matrices/mcp-gate-mock.yaml --root examples/gategrid
uv run gategrid run --matrix examples/gategrid/matrices/mcp-gate-mock.yaml --root examples/gategridCI gate (mock, offline):
uv run gategrid validate --matrix examples/gategrid/matrices/ci-gate-mock.yaml --root examples/gategrid
uv run gategrid run --matrix examples/gategrid/matrices/ci-gate-mock.yaml --root examples/gategridSee examples/gategrid/README.md for layout, PR sampling (ci-gate-pr-mock), baseline regen, and live mcp-gate (requires OPENAI_API_KEY).
LLM-mediated E2E over your MCP server (stdio subprocess or remote HTTP). Gategrid does not run docker, databases, or product side effects — you start the MCP process and own secrets.
| Install | Use when |
|---|---|
pip install "gategrid[pydantic-ai,mcp]" |
Path A: pydantic-ai agent + MCP toolsets (example adapter) |
pip install "gategrid[mcp]" |
Path B: your adapter + official MCP SDK (or another client) |
MCP connection settings live in profile.data.mcp (not a core profile field). Helpers: gategrid.contrib.mcp.mcp_from_profile, resolve_env_pass_through for data.env_pass_through names only.
from gategrid import case, evaluator
from gategrid.contrib.mcp import mcp_from_profile
# In your RuntimeAdapter.execute:
# mcp_cfg = mcp_from_profile(ctx.profile)
# Path A: mcp_toolset_from_data(...) + run_agent(toolsets=[...])
# Path B: your MCP client + agent loop → RunArtifactuv sync --extra dev --extra pydantic-ai --extra mcp
export OPENAI_API_KEY=...
gategrid run --matrix examples/gategrid/matrices/mcp-gate.yaml --root examples/gategridfrom gategrid import case, evaluator
@case(tags=["smoke"], data={"user_prompt": "Create a standup tomorrow 9am"})
def create_event() -> None:
pass # prompt in case data; adapter runs the agent loop
@evaluator(role="gate")
def event_created(ctx, artifact):
return artifact.metrics.get("calendar_write_ok")gategrid run --matrix examples/gategrid/matrices/smoke.yamlDogfood for gate and benchmark personas: five profile variants × ten file-edit cases (hashline protocol hypotheses vs a reference stack).
| Artifact | Path |
|---|---|
| Gate matrix | examples/opencrabs/matrices/hashline-gate.yaml — one profile, gate.baseline, regression bounds |
| Bench matrix | examples/opencrabs/matrices/hashline-bench.yaml — multi-profile, no PR gate |
| Report write-up | docs/reports/hashline/hashline_hypothesis_report.md |
| Charts | docs/reports/hashline/hashline_hypothesis_report.ipynb |
| Repro | examples/opencrabs/README.md — gate vs bench, baseline, CI tiers |
uv sync --extra dev --extra pydantic-ai
gategrid validate --matrix examples/opencrabs/matrices/hashline-smoke.yaml --root examples/opencrabs
gategrid run --matrix examples/opencrabs/matrices/hashline-smoke.yaml --root examples/opencrabs
# Gate / bench: MINIMAX_API_KEY — examples/opencrabs/README.md| Role | Typical use |
|---|---|
| MCP / tool authors | Gate one candidate profile on shared cases before release |
| Agent engineers | Same gate matrix locally and in CI |
| Platform / QA | PR gate vs baselines/<profile>.json; main updates baseline |
| Researchers | Optional benchmark matrix with many profiles — reports only |
Gategrid is a thin git-native regression gate for one agent stack at a time — not a hosted experiment browser or red-team suite.
| Gategrid | promptfoo | DeepEval | |
|---|---|---|---|
| CI regression | One profile, git baseline file | Pass-rate / Action compare; cloud share common | Pytest pass; regression UI → Confident AI |
| Agent runtime | Pluggable RuntimeAdapter |
Providers + custom JS | Bring your app |
| Matrix | Gate vs benchmark personas | Prompt × provider matrix | Datasets / metrics |
Detail: docs/roadmap/product/battlecard.md · docs/roadmap/product/competitive-landscape.md.
From a clone (recommended until PyPI publish):
uv sync --extra dev
# or: pip install -e ".[dev]"Optional extras (same names when published):
uv sync --extra pydantic-ai # LLM runtime (pydantic-ai)
uv sync --extra pydantic-ai --extra mcp # MCP toolsets (pydantic-ai path)
uv sync --extra mcp # MCP SDK only (bring-your-own adapter)Python ≥3.11. Secrets via environment only.
Monorepo layout:
| Path | Role |
|---|---|
| src/gategrid/ | Framework (cli, executor, gate, contrib/, integrations/) |
| examples/gategrid/ | Smoke, MCP gate (mock/live), CI gate (ci-gate-mock) |
| examples/file_edit/ | contrib/file_edit sample |
| examples/opencrabs/ | Hashline hypothesis dogfood (Spike C) |
| schemas/v1/ | Frozen report/baseline/matrix JSON schemas |
Operator setup: CLAUDE.md. Coding principles: CODE.md. Build status: v1-implementation-checklist.md.
uv sync --extra dev
uv run pytest tests/
uv run gategrid validate --matrix examples/gategrid/matrices/smoke.yamlSee LICENSE.