-
Notifications
You must be signed in to change notification settings - Fork 82
feat: PR scorecard system — closed-loop metrics for AI code quality #1199
Description
Problem
We have no aggregate view of how well our AI automation performs per PR. Individual corrections are logged to Langfuse, but we can't answer:
- How many iterations did PR #X take before merge?
- What's our one-shot rate (PR created → merged with zero fix cycles)?
- Did the CLAUDE.md change last week actually reduce iterations?
- What patterns does the agent consistently struggle with?
- Did a merged PR actually fix the issue, or did it regress?
Proposal: PR Scorecard
When an ambient-code:managed PR is merged (or after N fix cycles), write a structured scorecard as a PR comment:
<!-- ambient-code:scorecard
{
"source_issue": "#1200",
"sessions": ["session-abc123", "session-def456"],
"iterations": 4,
"breakdown": {
"initial": 1,
"ci_fixes": 1,
"review_rounds": 2
},
"one_shot": false,
"context_commit": "abc123def",
"corrections_logged": 2,
"agent_friction": "Had to search 5 files to find the right service pattern. CLAUDE.md doesn't document the handler registration flow.",
"created_at": "2026-04-01T10:00:00Z",
"merged_at": "2026-04-03T14:30:00Z"
}
-->Data sources
| Field | Source |
|---|---|
source_issue |
PR frontmatter (source= field) |
sessions |
PR frontmatter (session_id=) + batch fixer session IDs |
iterations |
retry_count from frontmatter at merge time |
breakdown |
Parse from batch fixer logs: CI fix vs review round vs conflict resolution |
one_shot |
iterations == 0 (initial session created it, zero fix cycles) |
context_commit |
Git SHA of CLAUDE.md + .claude/ directory at session creation time |
corrections_logged |
Count of log_correction calls from session traces in Langfuse |
agent_friction |
Agent self-reports via a new log_friction tool or a friction field in log_correction |
created_at / merged_at |
GH API timestamps |
Regression tracking
New label: ambient-code:regression
Applied when:
- The source issue is reopened after PR merge
- Someone comments
@ambient-code regressionon the merged PR - A new issue references the merged PR as the cause
The scorecard gets updated with "regressed": true when the label is added.
Aggregation
The weekly feedback loop GHA (feedback-loop.yml) expands to also:
- Query all merged
ambient-code:managedPRs from the past week - Parse scorecards from PR comments
- Aggregate metrics:
- One-shot rate: % of PRs that needed zero fix cycles
- Avg iterations: mean
retry_countacross PRs - Regression rate: % of merged PRs that got
ambient-code:regression - Common friction: most frequent agent-reported friction patterns
- Context effectiveness: compare iteration counts before/after context changes (group by
context_commit)
- Post a summary to a GH discussion or milestone description
Context A/B testing
With context_commit tracked per PR, we can correlate context changes with outcomes:
Week 1 (context commit abc): avg 3.2 iterations, 20% one-shot rate
Week 2 (context commit def — added handler patterns): avg 1.8 iterations, 45% one-shot rate
This tells us whether a CLAUDE.md change actually helped.
Agent friction reporting
Two options:
Option A: Add a friction field to the existing log_correction tool schema. The fixer already calls it — just add an optional field for self-reported difficulty.
Option B: New log_friction tool that the agent calls when it struggles (not a correction, just "this was hard"). Lower barrier than corrections.
Leaning toward Option A — keep it in one tool, one Langfuse pipeline.
Implementation phases
Phase 1: Scorecard writing
- Add a post-merge GHA trigger or a step in the batch fixer that writes the scorecard comment when a PR is merged
- Parse frontmatter for session ID, source issue, retry count
- Include
context_commit(SHA of.claude/directory at the time)
Phase 2: Regression tracking
- Create
ambient-code:regressionlabel - Add
pull_request: [labeled]trigger to amber handler for regression label - Update scorecard with
regressed: true
Phase 3: Aggregation
- Extend feedback loop to scrape scorecards from merged PRs
- Generate weekly metrics summary
- Post to GH discussion or milestone
Phase 4: Context A/B testing
- Compare metrics across
context_commitvalues - Surface recommendations: "Iteration count dropped 40% after commit X — keep those changes"
Labels
ambient-code:auto-fix, enhancement
Phase 5: Multi-Runner Review Aggregation
Problem
A single model reviewing code has blind spots. Different models catch different things — CodeRabbit finds security patterns, Claude catches architectural issues, Gemini spots performance problems. Today these run independently (CodeRabbit via their bot, Claude via amber-auto-review) with no aggregation.
Proposal
For ambient-code:managed PRs, run reviews through multiple runners and aggregate findings into a single prioritized review comment.
Runners
| Runner | Source | What it catches well |
|---|---|---|
| CodeRabbit | External bot (already runs) | Security, dependency issues, OWASP patterns |
| Claude Code | ACP session (claude runner) | Architecture, pattern consistency, codebase conventions |
| Gemini CLI | ACP session (gemini-cli runner) | Performance, algorithmic complexity, resource usage |
Flow
PR created/updated
├─ CodeRabbit runs automatically (external)
├─ Claude review session (ACP, claude runner)
└─ Gemini review session (ACP, gemini-cli runner)
↓
Aggregator session reads all three reviews
↓
Posts single comment: prioritized findings, deduped, with model attribution
Implementation
- Trigger: PR
opened/synchronizeevent (like current amber-auto-review) - Parallel reviews: Create 2 ACP sessions (claude + gemini) pointing at the PR branch. CodeRabbit runs on its own.
- Wait: Poll both sessions for completion (v0.0.5
wait: truewith idle detection) - Aggregate: A third session (or the GHA itself) reads:
- CodeRabbit's PR comments (via GH API)
- Claude review session's output
- Gemini review session's output
- Post: Single comment with:
- Critical issues (agreed by 2+ models)
- Model-specific findings (only one model flagged it)
- Confidence: "3/3 models flagged this" vs "1/3"
Scorecard integration
Review metrics feed into the PR scorecard:
- Number of review findings per model
- Agreement rate across models
- False positive rate (findings dismissed by human reviewer)
Open questions
- Should the aggregator be a separate session or inline GHA logic?
- How to handle CodeRabbit's async timing (it may not have commented yet when our sessions finish)?
- Cost: 3 reviews per PR — gate behind a label (
ambient-code:deep-review) or run for all?