Skip to content

Latest commit

 

History

History
192 lines (145 loc) · 5.81 KB

File metadata and controls

192 lines (145 loc) · 5.81 KB

Investigate Agent Evaluation Harness

This document defines a practical benchmark loop for the runbook investigate agent.

Goals

  • Measure root-cause quality and safety over time.
  • Catch regressions before shipping prompt/tool/runtime changes.
  • Support both offline replay and shadow-mode evaluation.

Harness components

  • Runner: src/eval/investigation-benchmark.ts
  • Scoring: src/eval/scoring.ts
  • Dataset bootstrap: src/eval/setup-datasets.ts
  • RCAEval converter: src/eval/rcaeval-to-fixtures.ts
  • Rootly logs converter: src/eval/rootly-logs-to-fixtures.ts
  • TraceRCA converter: src/eval/tracerca-to-fixtures.ts
  • Unified benchmark runner: src/eval/run-all-benchmarks.ts
  • Sample fixtures: examples/evals/investigation-fixtures.sample.json
  • RCAEval input sample: examples/evals/rcaeval-input.sample.json

Fixture format

{
  "version": "1.0",
  "passThreshold": 0.7,
  "cases": [
    {
      "id": "case-id",
      "incidentId": "PD-123",
      "query": "Investigate incident PD-123",
      "context": "Additional logs or timeline",
      "tags": ["redis", "latency"],
      "expected": {
        "rootCause": "optional exact phrase",
        "rootCauseKeywords": ["redis", "connection pool"],
        "affectedServices": ["checkout-api", "redis"],
        "confidenceAtLeast": "medium",
        "requiredPhrases": ["evidence"],
        "forbiddenPhrases": ["drop database"]
      },
      "execute": {
        "maxIterations": 6,
        "autoRemediate": false
      }
    }
  ]
}

Run benchmark

npm run eval:investigate -- --fixtures examples/evals/investigation-fixtures.sample.json

Optional flags:

  • --out <path>: output report JSON path
  • --limit <n>: run first N cases
  • --offline: score from fixture mockResult fields (no live model/tool execution)

Example:

npm run eval:investigate -- \
  --fixtures examples/evals/investigation-fixtures.sample.json \
  --out .runbook/evals/nightly.json \
  --limit 20

Output report

The runner writes a JSON report with:

  • case-level scores
  • pass/fail by threshold
  • event counts (phase changes, hypotheses, queries, evaluations)
  • aggregate pass rate and average score

Scoring model (draft)

Current overall score is an average of available components:

  • root-cause correctness (rootCause / rootCauseKeywords)
  • affected-service coverage (affectedServices)
  • confidence floor (confidenceAtLeast)
  • phrase compliance (requiredPhrases, forbiddenPhrases)

Recommended rollout

  1. Build 30-100 replay cases from postmortems.
  2. Establish baseline pass rate on current main.
  3. Add CI gate for regressions:
    • fail if pass rate drops >5%
    • fail if safety phrase compliance drops below 0.98
  4. Add weekly shadow evaluation against real incidents.

CI smoke lane

Run the lightweight offline smoke profile (same profile used by .github/workflows/eval-smoke.yml):

npm run eval:smoke -- --out-dir .runbook/evals/ci-smoke

Notes:

  • Uses offline scoring only (--offline) so no live model credentials are required.
  • Runs a small subset (rcaeval,tracerca, --limit 1) for deterministic CI runtime.
  • Nightly schedule is configured in GitHub Actions via Eval Smoke.

RCAEval adapter workflow

Convert RCAEval-style rows into Runbook fixtures:

npm run eval:convert:rcaeval -- \
  --input examples/evals/rcaeval-input.sample.json \
  --out examples/evals/rcaeval-fixtures.generated.json

Generate fixtures with synthetic mockResult values for offline scoring:

npm run eval:convert:rcaeval -- \
  --input examples/evals/rcaeval-input.sample.json \
  --out examples/evals/rcaeval-fixtures.generated.json \
  --include-mock-result

Then run:

npm run eval:investigate -- \
  --fixtures examples/evals/rcaeval-fixtures.generated.json \
  --offline

Unified benchmark run

Run all benchmark adapters (RCAEval, Rootly, TraceRCA) with per-benchmark reports:

npm run eval:all -- \
  --out-dir .runbook/evals/all-benchmarks \
  --limit 5

eval:all automatically runs dataset bootstrap first. It attempts to clone required public repositories into examples/evals/datasets/ and then continues the benchmark run.

Manual bootstrap only:

npm run eval:setup -- --datasets rcaeval,rootly,tracerca

Useful options:

  • --offline: run benchmark scoring from fixture mockResult where available
  • --no-setup: skip automatic dataset bootstrap
  • --benchmarks rcaeval,rootly,tracerca: run selected benchmarks only
  • --rcaeval-input <path>: custom RCAEval source file
  • --tracerca-input <path>: TraceRCA source file (.json/.jsonl/.csv/.tsv)
  • --rootly-limit-per-dataset <n>: limit generated cases per Rootly log source

Outputs:

  • .runbook/evals/all-benchmarks/rcaeval-report.json
  • .runbook/evals/all-benchmarks/rootly-report.json
  • .runbook/evals/all-benchmarks/tracerca-report.json
  • .runbook/evals/all-benchmarks/summary.json
  • .runbook/evals/all-benchmarks/dataset-setup.json
  • .runbook/evals/all-benchmarks/dataset-setup.log

Open datasets (recommended)

Notes

  • This harness exercises the structured investigation orchestrator path.
  • It requires the same provider credentials as normal runtime usage.
  • Keep fixture context time-bounded to what was known at incident time.