This document defines a practical benchmark loop for the runbook investigate agent.
- Measure root-cause quality and safety over time.
- Catch regressions before shipping prompt/tool/runtime changes.
- Support both offline replay and shadow-mode evaluation.
- Runner:
src/eval/investigation-benchmark.ts - Scoring:
src/eval/scoring.ts - Dataset bootstrap:
src/eval/setup-datasets.ts - RCAEval converter:
src/eval/rcaeval-to-fixtures.ts - Rootly logs converter:
src/eval/rootly-logs-to-fixtures.ts - TraceRCA converter:
src/eval/tracerca-to-fixtures.ts - Unified benchmark runner:
src/eval/run-all-benchmarks.ts - Sample fixtures:
examples/evals/investigation-fixtures.sample.json - RCAEval input sample:
examples/evals/rcaeval-input.sample.json
{
"version": "1.0",
"passThreshold": 0.7,
"cases": [
{
"id": "case-id",
"incidentId": "PD-123",
"query": "Investigate incident PD-123",
"context": "Additional logs or timeline",
"tags": ["redis", "latency"],
"expected": {
"rootCause": "optional exact phrase",
"rootCauseKeywords": ["redis", "connection pool"],
"affectedServices": ["checkout-api", "redis"],
"confidenceAtLeast": "medium",
"requiredPhrases": ["evidence"],
"forbiddenPhrases": ["drop database"]
},
"execute": {
"maxIterations": 6,
"autoRemediate": false
}
}
]
}npm run eval:investigate -- --fixtures examples/evals/investigation-fixtures.sample.jsonOptional flags:
--out <path>: output report JSON path--limit <n>: run first N cases--offline: score from fixturemockResultfields (no live model/tool execution)
Example:
npm run eval:investigate -- \
--fixtures examples/evals/investigation-fixtures.sample.json \
--out .runbook/evals/nightly.json \
--limit 20The runner writes a JSON report with:
- case-level scores
- pass/fail by threshold
- event counts (phase changes, hypotheses, queries, evaluations)
- aggregate pass rate and average score
Current overall score is an average of available components:
- root-cause correctness (
rootCause/rootCauseKeywords) - affected-service coverage (
affectedServices) - confidence floor (
confidenceAtLeast) - phrase compliance (
requiredPhrases,forbiddenPhrases)
- Build 30-100 replay cases from postmortems.
- Establish baseline pass rate on current
main. - Add CI gate for regressions:
- fail if pass rate drops >5%
- fail if safety phrase compliance drops below 0.98
- Add weekly shadow evaluation against real incidents.
Run the lightweight offline smoke profile (same profile used by .github/workflows/eval-smoke.yml):
npm run eval:smoke -- --out-dir .runbook/evals/ci-smokeNotes:
- Uses offline scoring only (
--offline) so no live model credentials are required. - Runs a small subset (
rcaeval,tracerca,--limit 1) for deterministic CI runtime. - Nightly schedule is configured in GitHub Actions via
Eval Smoke.
Convert RCAEval-style rows into Runbook fixtures:
npm run eval:convert:rcaeval -- \
--input examples/evals/rcaeval-input.sample.json \
--out examples/evals/rcaeval-fixtures.generated.jsonGenerate fixtures with synthetic mockResult values for offline scoring:
npm run eval:convert:rcaeval -- \
--input examples/evals/rcaeval-input.sample.json \
--out examples/evals/rcaeval-fixtures.generated.json \
--include-mock-resultThen run:
npm run eval:investigate -- \
--fixtures examples/evals/rcaeval-fixtures.generated.json \
--offlineRun all benchmark adapters (RCAEval, Rootly, TraceRCA) with per-benchmark reports:
npm run eval:all -- \
--out-dir .runbook/evals/all-benchmarks \
--limit 5eval:all automatically runs dataset bootstrap first. It attempts to clone required public
repositories into examples/evals/datasets/ and then continues the benchmark run.
Manual bootstrap only:
npm run eval:setup -- --datasets rcaeval,rootly,tracercaUseful options:
--offline: run benchmark scoring from fixturemockResultwhere available--no-setup: skip automatic dataset bootstrap--benchmarks rcaeval,rootly,tracerca: run selected benchmarks only--rcaeval-input <path>: custom RCAEval source file--tracerca-input <path>: TraceRCA source file (.json/.jsonl/.csv/.tsv)--rootly-limit-per-dataset <n>: limit generated cases per Rootly log source
Outputs:
.runbook/evals/all-benchmarks/rcaeval-report.json.runbook/evals/all-benchmarks/rootly-report.json.runbook/evals/all-benchmarks/tracerca-report.json.runbook/evals/all-benchmarks/summary.json.runbook/evals/all-benchmarks/dataset-setup.json.runbook/evals/all-benchmarks/dataset-setup.log
- RCAEval benchmark (RE1/RE2/RE3): https://github.com/phamquiluan/RCAEval
- RCAEval dataset DOI (Zenodo): https://doi.org/10.5281/zenodo.14590730
- TraceRCA dataset/code (trace-based RCA on TrainTicket): https://github.com/NetManAIOps/TraceRCA
- AIOps Challenge 2020 dataset (metrics + traces + fault annotations): https://github.com/NetManAIOps/AIOps-Challenge-2020-Data
- Nezha multimodal RCA dataset repo (OnlineBoutique + TrainTicket labels): https://github.com/IntelligentDDS/Nezha
- Rootly open logs dataset (incident log analysis supplement): https://github.com/Rootly-AI-Labs/logs-dataset
- This harness exercises the structured investigation orchestrator path.
- It requires the same provider credentials as normal runtime usage.
- Keep fixture context time-bounded to what was known at incident time.