Skip to content

Roadmap: Behavioral A/B benchmark (stock vs lobotomized-CC) #11

@dividedby

Description

@dividedby

Parent

PRD: release-adoption control plane — #2 (slice 5, roadmap)

What to build

The value-proving track (ADR 0002): a Behavioral A/B benchmark comparing stock CC vs lobotomized-CC — same version/model/effort/prompt — scored on the behavioral axes the Lobotomy targets via an LLM judge (paired, randomized order), with a Correctness guardrail. Evidence, not a gate; it feeds the Behavioral A/B field of the Adoption record.

Hybrid home: ~/repos/bench exposes its run/judge/aggregate as library primitives (a leaf refactor in bench); the tweakcc-specific behavior-bait fixtures, the behavioral rubric, and the A/B driver live in tweakcc-maint. Driver seam mirrors the gate: the LLM judge sits behind a stubbable port so pairing/randomization/aggregation are tested without real model calls.

Tracking placeholder — needs its own design pass (and the bench refactor) before AFK-ready.

Acceptance criteria

  • Triaged: scope the bench library refactor + the behavioral rubric/fixtures.

Blocked by

None - roadmap, needs triage.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions