Summary
Add a test harness that validates agent behavior through repeatable scenarios — not just code correctness (unit tests) but whether the agent produces useful, accurate output for given inputs.
Motivation
Agent quality depends on prompt engineering, skill design, and tool orchestration — none of which are covered by go test. Frameworks like Google ADK emphasize evaluation as a first-class concern. DocsClaw should have its own eval framework without depending on external frameworks.
Proposed design
An eval scenario is a YAML file describing input, expected behavior, and pass criteria:
name: executive-summary
description: Agent should produce a structured summary from a report
config-dir: testdata/executive-assistant
input: |
Summarize this Q1 report: Revenue $4.2M, up 12%.
Key driver: Enterprise segment grew 25%.
Risk: Two large renewals pending in Q2.
expect:
contains:
- "revenue"
- "enterprise"
- "risk"
max_length: 500
has_sections:
- "Summary"
- "Key decisions"
A docsclaw eval subcommand runs scenarios against a live or mocked LLM:
docsclaw eval scenarios/ # run all scenarios
docsclaw eval scenarios/hr/ # run HR-specific scenarios
Key features
- Deterministic mode: fixed seed / temperature 0 for reproducibility
- Golden file comparison: save a known-good output, diff against it
- Scoring rubric: LLM-as-judge for subjective quality (optional)
- CI integration: exit code reflects pass/fail for pipeline gating
- Cost tracking: report token usage per scenario
Related
Summary
Add a test harness that validates agent behavior through repeatable scenarios — not just code correctness (unit tests) but whether the agent produces useful, accurate output for given inputs.
Motivation
Agent quality depends on prompt engineering, skill design, and tool orchestration — none of which are covered by
go test. Frameworks like Google ADK emphasize evaluation as a first-class concern. DocsClaw should have its own eval framework without depending on external frameworks.Proposed design
An eval scenario is a YAML file describing input, expected behavior, and pass criteria:
A
docsclaw evalsubcommand runs scenarios against a live or mocked LLM:Key features
Related
docs/demo/could become the first eval cases