Investigate Agent Evaluation Harness

This document defines a practical benchmark loop for the runbook investigate agent.

Goals

Measure root-cause quality and safety over time.
Catch regressions before shipping prompt/tool/runtime changes.
Support both offline replay and shadow-mode evaluation.

Harness components

Runner: src/eval/investigation-benchmark.ts
Scoring: src/eval/scoring.ts
Dataset bootstrap: src/eval/setup-datasets.ts
RCAEval converter: src/eval/rcaeval-to-fixtures.ts
Rootly logs converter: src/eval/rootly-logs-to-fixtures.ts
TraceRCA converter: src/eval/tracerca-to-fixtures.ts
Unified benchmark runner: src/eval/run-all-benchmarks.ts
Sample fixtures: examples/evals/investigation-fixtures.sample.json
RCAEval input sample: examples/evals/rcaeval-input.sample.json

Fixture format

{
  "version": "1.0",
  "passThreshold": 0.7,
  "cases": [
    {
      "id": "case-id",
      "incidentId": "PD-123",
      "query": "Investigate incident PD-123",
      "context": "Additional logs or timeline",
      "tags": ["redis", "latency"],
      "expected": {
        "rootCause": "optional exact phrase",
        "rootCauseKeywords": ["redis", "connection pool"],
        "affectedServices": ["checkout-api", "redis"],
        "confidenceAtLeast": "medium",
        "requiredPhrases": ["evidence"],
        "forbiddenPhrases": ["drop database"]
      },
      "execute": {
        "maxIterations": 6,
        "autoRemediate": false
      }
    }
  ]
}

Run benchmark

npm run eval:investigate -- --fixtures examples/evals/investigation-fixtures.sample.json

Optional flags:

--out <path>: output report JSON path
--limit <n>: run first N cases
--offline: score from fixture mockResult fields (no live model/tool execution)

Example:

npm run eval:investigate -- \
  --fixtures examples/evals/investigation-fixtures.sample.json \
  --out .runbook/evals/nightly.json \
  --limit 20

Output report

The runner writes a JSON report with:

case-level scores
pass/fail by threshold
event counts (phase changes, hypotheses, queries, evaluations)
aggregate pass rate and average score

Scoring model (draft)

Current overall score is an average of available components:

root-cause correctness (rootCause / rootCauseKeywords)
affected-service coverage (affectedServices)
confidence floor (confidenceAtLeast)
phrase compliance (requiredPhrases, forbiddenPhrases)

Recommended rollout

Build 30-100 replay cases from postmortems.
Establish baseline pass rate on current main.
Add CI gate for regressions:
- fail if pass rate drops >5%
- fail if safety phrase compliance drops below 0.98
Add weekly shadow evaluation against real incidents.

CI smoke lane

Run the lightweight offline smoke profile (same profile used by .github/workflows/eval-smoke.yml):

npm run eval:smoke -- --out-dir .runbook/evals/ci-smoke

Notes:

Uses offline scoring only (--offline) so no live model credentials are required.
Runs a small subset (rcaeval,tracerca, --limit 1) for deterministic CI runtime.
Nightly schedule is configured in GitHub Actions via Eval Smoke.

RCAEval adapter workflow

Convert RCAEval-style rows into Runbook fixtures:

npm run eval:convert:rcaeval -- \
  --input examples/evals/rcaeval-input.sample.json \
  --out examples/evals/rcaeval-fixtures.generated.json

Generate fixtures with synthetic mockResult values for offline scoring:

npm run eval:convert:rcaeval -- \
  --input examples/evals/rcaeval-input.sample.json \
  --out examples/evals/rcaeval-fixtures.generated.json \
  --include-mock-result

Then run:

npm run eval:investigate -- \
  --fixtures examples/evals/rcaeval-fixtures.generated.json \
  --offline

Unified benchmark run

Run all benchmark adapters (RCAEval, Rootly, TraceRCA) with per-benchmark reports:

npm run eval:all -- \
  --out-dir .runbook/evals/all-benchmarks \
  --limit 5

eval:all automatically runs dataset bootstrap first. It attempts to clone required public repositories into examples/evals/datasets/ and then continues the benchmark run.

Manual bootstrap only:

npm run eval:setup -- --datasets rcaeval,rootly,tracerca

Useful options:

--offline: run benchmark scoring from fixture mockResult where available
--no-setup: skip automatic dataset bootstrap
--benchmarks rcaeval,rootly,tracerca: run selected benchmarks only
--rcaeval-input <path>: custom RCAEval source file
--tracerca-input <path>: TraceRCA source file (.json/.jsonl/.csv/.tsv)
--rootly-limit-per-dataset <n>: limit generated cases per Rootly log source

Outputs:

.runbook/evals/all-benchmarks/rcaeval-report.json
.runbook/evals/all-benchmarks/rootly-report.json
.runbook/evals/all-benchmarks/tracerca-report.json
.runbook/evals/all-benchmarks/summary.json
.runbook/evals/all-benchmarks/dataset-setup.json
.runbook/evals/all-benchmarks/dataset-setup.log

Open datasets (recommended)

RCAEval benchmark (RE1/RE2/RE3): https://github.com/phamquiluan/RCAEval
RCAEval dataset DOI (Zenodo): https://doi.org/10.5281/zenodo.14590730
TraceRCA dataset/code (trace-based RCA on TrainTicket): https://github.com/NetManAIOps/TraceRCA
AIOps Challenge 2020 dataset (metrics + traces + fault annotations): https://github.com/NetManAIOps/AIOps-Challenge-2020-Data
Nezha multimodal RCA dataset repo (OnlineBoutique + TrainTicket labels): https://github.com/IntelligentDDS/Nezha
Rootly open logs dataset (incident log analysis supplement): https://github.com/Rootly-AI-Labs/logs-dataset

Notes

This harness exercises the structured investigation orchestrator path.
It requires the same provider credentials as normal runtime usage.
Keep fixture context time-bounded to what was known at incident time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Agent Evaluation Harness

Goals

Harness components

Fixture format

Run benchmark

Output report

Scoring model (draft)

Recommended rollout

CI smoke lane

RCAEval adapter workflow

Unified benchmark run

Open datasets (recommended)

Notes

FilesExpand file tree

INVESTIGATION_EVAL.md

Latest commit

History

INVESTIGATION_EVAL.md

File metadata and controls

Investigate Agent Evaluation Harness

Goals

Harness components

Fixture format

Run benchmark

Output report

Scoring model (draft)

Recommended rollout

CI smoke lane

RCAEval adapter workflow

Unified benchmark run

Open datasets (recommended)

Notes