feat: add evaluation harness for agent behavior testing

## Summary

Add a test harness that validates agent behavior through repeatable scenarios — not just code correctness (unit tests) but whether the agent produces useful, accurate output for given inputs.

## Motivation

Agent quality depends on prompt engineering, skill design, and tool orchestration — none of which are covered by `go test`. Frameworks like Google ADK emphasize evaluation as a first-class concern. DocsClaw should have its own eval framework without depending on external frameworks.

## Proposed design

An eval scenario is a YAML file describing input, expected behavior, and pass criteria:

```yaml
name: executive-summary
description: Agent should produce a structured summary from a report
config-dir: testdata/executive-assistant
input: |
  Summarize this Q1 report: Revenue $4.2M, up 12%.
  Key driver: Enterprise segment grew 25%.
  Risk: Two large renewals pending in Q2.
expect:
  contains:
    - "revenue"
    - "enterprise"
    - "risk"
  max_length: 500
  has_sections:
    - "Summary"
    - "Key decisions"
```

A `docsclaw eval` subcommand runs scenarios against a live or mocked LLM:

```bash
docsclaw eval scenarios/       # run all scenarios
docsclaw eval scenarios/hr/    # run HR-specific scenarios
```

## Key features

- **Deterministic mode**: fixed seed / temperature 0 for reproducibility
- **Golden file comparison**: save a known-good output, diff against it
- **Scoring rubric**: LLM-as-judge for subjective quality (optional)
- **CI integration**: exit code reflects pass/fail for pipeline gating
- **Cost tracking**: report token usage per scenario

## Related

- Closes gap identified in #33 (ADK evaluation)
- Scenarios in `docs/demo/` could become the first eval cases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add evaluation harness for agent behavior testing #35

Summary

Motivation

Proposed design

Key features

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: add evaluation harness for agent behavior testing #35

Description

Summary

Motivation

Proposed design

Key features

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions