Skip to content

feat: add evaluation harness for agent behavior testing #35

@pavelanni

Description

@pavelanni

Summary

Add a test harness that validates agent behavior through repeatable scenarios — not just code correctness (unit tests) but whether the agent produces useful, accurate output for given inputs.

Motivation

Agent quality depends on prompt engineering, skill design, and tool orchestration — none of which are covered by go test. Frameworks like Google ADK emphasize evaluation as a first-class concern. DocsClaw should have its own eval framework without depending on external frameworks.

Proposed design

An eval scenario is a YAML file describing input, expected behavior, and pass criteria:

name: executive-summary
description: Agent should produce a structured summary from a report
config-dir: testdata/executive-assistant
input: |
  Summarize this Q1 report: Revenue $4.2M, up 12%.
  Key driver: Enterprise segment grew 25%.
  Risk: Two large renewals pending in Q2.
expect:
  contains:
    - "revenue"
    - "enterprise"
    - "risk"
  max_length: 500
  has_sections:
    - "Summary"
    - "Key decisions"

A docsclaw eval subcommand runs scenarios against a live or mocked LLM:

docsclaw eval scenarios/       # run all scenarios
docsclaw eval scenarios/hr/    # run HR-specific scenarios

Key features

  • Deterministic mode: fixed seed / temperature 0 for reproducibility
  • Golden file comparison: save a known-good output, diff against it
  • Scoring rubric: LLM-as-judge for subjective quality (optional)
  • CI integration: exit code reflects pass/fail for pipeline gating
  • Cost tracking: report token usage per scenario

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions