Skip to content

feat: PR scorecard system — closed-loop metrics for AI code quality #1199

@Gkrumbach07

Description

@Gkrumbach07

Problem

We have no aggregate view of how well our AI automation performs per PR. Individual corrections are logged to Langfuse, but we can't answer:

  • How many iterations did PR #X take before merge?
  • What's our one-shot rate (PR created → merged with zero fix cycles)?
  • Did the CLAUDE.md change last week actually reduce iterations?
  • What patterns does the agent consistently struggle with?
  • Did a merged PR actually fix the issue, or did it regress?

Proposal: PR Scorecard

When an ambient-code:managed PR is merged (or after N fix cycles), write a structured scorecard as a PR comment:

<!-- ambient-code:scorecard
{
  "source_issue": "#1200",
  "sessions": ["session-abc123", "session-def456"],
  "iterations": 4,
  "breakdown": {
    "initial": 1,
    "ci_fixes": 1,
    "review_rounds": 2
  },
  "one_shot": false,
  "context_commit": "abc123def",
  "corrections_logged": 2,
  "agent_friction": "Had to search 5 files to find the right service pattern. CLAUDE.md doesn't document the handler registration flow.",
  "created_at": "2026-04-01T10:00:00Z",
  "merged_at": "2026-04-03T14:30:00Z"
}
-->

Data sources

Field Source
source_issue PR frontmatter (source= field)
sessions PR frontmatter (session_id=) + batch fixer session IDs
iterations retry_count from frontmatter at merge time
breakdown Parse from batch fixer logs: CI fix vs review round vs conflict resolution
one_shot iterations == 0 (initial session created it, zero fix cycles)
context_commit Git SHA of CLAUDE.md + .claude/ directory at session creation time
corrections_logged Count of log_correction calls from session traces in Langfuse
agent_friction Agent self-reports via a new log_friction tool or a friction field in log_correction
created_at / merged_at GH API timestamps

Regression tracking

New label: ambient-code:regression

Applied when:

  • The source issue is reopened after PR merge
  • Someone comments @ambient-code regression on the merged PR
  • A new issue references the merged PR as the cause

The scorecard gets updated with "regressed": true when the label is added.

Aggregation

The weekly feedback loop GHA (feedback-loop.yml) expands to also:

  1. Query all merged ambient-code:managed PRs from the past week
  2. Parse scorecards from PR comments
  3. Aggregate metrics:
    • One-shot rate: % of PRs that needed zero fix cycles
    • Avg iterations: mean retry_count across PRs
    • Regression rate: % of merged PRs that got ambient-code:regression
    • Common friction: most frequent agent-reported friction patterns
    • Context effectiveness: compare iteration counts before/after context changes (group by context_commit)
  4. Post a summary to a GH discussion or milestone description

Context A/B testing

With context_commit tracked per PR, we can correlate context changes with outcomes:

Week 1 (context commit abc): avg 3.2 iterations, 20% one-shot rate
Week 2 (context commit def — added handler patterns): avg 1.8 iterations, 45% one-shot rate

This tells us whether a CLAUDE.md change actually helped.

Agent friction reporting

Two options:

Option A: Add a friction field to the existing log_correction tool schema. The fixer already calls it — just add an optional field for self-reported difficulty.

Option B: New log_friction tool that the agent calls when it struggles (not a correction, just "this was hard"). Lower barrier than corrections.

Leaning toward Option A — keep it in one tool, one Langfuse pipeline.

Implementation phases

Phase 1: Scorecard writing

  • Add a post-merge GHA trigger or a step in the batch fixer that writes the scorecard comment when a PR is merged
  • Parse frontmatter for session ID, source issue, retry count
  • Include context_commit (SHA of .claude/ directory at the time)

Phase 2: Regression tracking

  • Create ambient-code:regression label
  • Add pull_request: [labeled] trigger to amber handler for regression label
  • Update scorecard with regressed: true

Phase 3: Aggregation

  • Extend feedback loop to scrape scorecards from merged PRs
  • Generate weekly metrics summary
  • Post to GH discussion or milestone

Phase 4: Context A/B testing

  • Compare metrics across context_commit values
  • Surface recommendations: "Iteration count dropped 40% after commit X — keep those changes"

Labels

ambient-code:auto-fix, enhancement


Phase 5: Multi-Runner Review Aggregation

Problem

A single model reviewing code has blind spots. Different models catch different things — CodeRabbit finds security patterns, Claude catches architectural issues, Gemini spots performance problems. Today these run independently (CodeRabbit via their bot, Claude via amber-auto-review) with no aggregation.

Proposal

For ambient-code:managed PRs, run reviews through multiple runners and aggregate findings into a single prioritized review comment.

Runners

Runner Source What it catches well
CodeRabbit External bot (already runs) Security, dependency issues, OWASP patterns
Claude Code ACP session (claude runner) Architecture, pattern consistency, codebase conventions
Gemini CLI ACP session (gemini-cli runner) Performance, algorithmic complexity, resource usage

Flow

PR created/updated
  ├─ CodeRabbit runs automatically (external)
  ├─ Claude review session (ACP, claude runner)
  └─ Gemini review session (ACP, gemini-cli runner)
       ↓
  Aggregator session reads all three reviews
       ↓
  Posts single comment: prioritized findings, deduped, with model attribution

Implementation

  1. Trigger: PR opened / synchronize event (like current amber-auto-review)
  2. Parallel reviews: Create 2 ACP sessions (claude + gemini) pointing at the PR branch. CodeRabbit runs on its own.
  3. Wait: Poll both sessions for completion (v0.0.5 wait: true with idle detection)
  4. Aggregate: A third session (or the GHA itself) reads:
    • CodeRabbit's PR comments (via GH API)
    • Claude review session's output
    • Gemini review session's output
  5. Post: Single comment with:
    • Critical issues (agreed by 2+ models)
    • Model-specific findings (only one model flagged it)
    • Confidence: "3/3 models flagged this" vs "1/3"

Scorecard integration

Review metrics feed into the PR scorecard:

  • Number of review findings per model
  • Agreement rate across models
  • False positive rate (findings dismissed by human reviewer)

Open questions

  • Should the aggregator be a separate session or inline GHA logic?
  • How to handle CodeRabbit's async timing (it may not have commented yet when our sessions finish)?
  • Cost: 3 reviews per PR — gate behind a label (ambient-code:deep-review) or run for all?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions