feat: PR scorecard system — closed-loop metrics for AI code quality

## Problem

We have no aggregate view of how well our AI automation performs per PR. Individual corrections are logged to Langfuse, but we can't answer:

- How many iterations did PR #X take before merge?
- What's our one-shot rate (PR created → merged with zero fix cycles)?
- Did the CLAUDE.md change last week actually reduce iterations?
- What patterns does the agent consistently struggle with?
- Did a merged PR actually fix the issue, or did it regress?

## Proposal: PR Scorecard

When an `ambient-code:managed` PR is merged (or after N fix cycles), write a structured scorecard as a PR comment:

```html

```

### Data sources

| Field | Source |
|-------|--------|
| `source_issue` | PR frontmatter (`source=` field) |
| `sessions` | PR frontmatter (`session_id=`) + batch fixer session IDs |
| `iterations` | `retry_count` from frontmatter at merge time |
| `breakdown` | Parse from batch fixer logs: CI fix vs review round vs conflict resolution |
| `one_shot` | `iterations == 0` (initial session created it, zero fix cycles) |
| `context_commit` | Git SHA of `CLAUDE.md` + `.claude/` directory at session creation time |
| `corrections_logged` | Count of `log_correction` calls from session traces in Langfuse |
| `agent_friction` | Agent self-reports via a new `log_friction` tool or a `friction` field in `log_correction` |
| `created_at` / `merged_at` | GH API timestamps |

### Regression tracking

New label: `ambient-code:regression`

Applied when:
- The source issue is reopened after PR merge
- Someone comments `@ambient-code regression` on the merged PR
- A new issue references the merged PR as the cause

The scorecard gets updated with `"regressed": true` when the label is added.

### Aggregation

The weekly feedback loop GHA (`feedback-loop.yml`) expands to also:

1. Query all merged `ambient-code:managed` PRs from the past week
2. Parse scorecards from PR comments
3. Aggregate metrics:
   - **One-shot rate**: % of PRs that needed zero fix cycles
   - **Avg iterations**: mean `retry_count` across PRs
   - **Regression rate**: % of merged PRs that got `ambient-code:regression`
   - **Common friction**: most frequent agent-reported friction patterns
   - **Context effectiveness**: compare iteration counts before/after context changes (group by `context_commit`)
4. Post a summary to a GH discussion or milestone description

### Context A/B testing

With `context_commit` tracked per PR, we can correlate context changes with outcomes:

```
Week 1 (context commit abc): avg 3.2 iterations, 20% one-shot rate
Week 2 (context commit def — added handler patterns): avg 1.8 iterations, 45% one-shot rate
```

This tells us whether a CLAUDE.md change actually helped.

### Agent friction reporting

Two options:

**Option A**: Add a `friction` field to the existing `log_correction` tool schema. The fixer already calls it — just add an optional field for self-reported difficulty.

**Option B**: New `log_friction` tool that the agent calls when it struggles (not a correction, just "this was hard"). Lower barrier than corrections.

Leaning toward Option A — keep it in one tool, one Langfuse pipeline.

## Implementation phases

### Phase 1: Scorecard writing
- Add a post-merge GHA trigger or a step in the batch fixer that writes the scorecard comment when a PR is merged
- Parse frontmatter for session ID, source issue, retry count
- Include `context_commit` (SHA of `.claude/` directory at the time)

### Phase 2: Regression tracking
- Create `ambient-code:regression` label
- Add `pull_request: [labeled]` trigger to amber handler for regression label
- Update scorecard with `regressed: true`

### Phase 3: Aggregation
- Extend feedback loop to scrape scorecards from merged PRs
- Generate weekly metrics summary
- Post to GH discussion or milestone

### Phase 4: Context A/B testing
- Compare metrics across `context_commit` values
- Surface recommendations: "Iteration count dropped 40% after commit X — keep those changes"

## Labels

`ambient-code:auto-fix`, `enhancement`

---

## Phase 5: Multi-Runner Review Aggregation

### Problem

A single model reviewing code has blind spots. Different models catch different things — CodeRabbit finds security patterns, Claude catches architectural issues, Gemini spots performance problems. Today these run independently (CodeRabbit via their bot, Claude via amber-auto-review) with no aggregation.

### Proposal

For `ambient-code:managed` PRs, run reviews through multiple runners and aggregate findings into a single prioritized review comment.

### Runners

| Runner | Source | What it catches well |
|--------|--------|---------------------|
| CodeRabbit | External bot (already runs) | Security, dependency issues, OWASP patterns |
| Claude Code | ACP session (claude runner) | Architecture, pattern consistency, codebase conventions |
| Gemini CLI | ACP session (gemini-cli runner) | Performance, algorithmic complexity, resource usage |

### Flow

```
PR created/updated
  ├─ CodeRabbit runs automatically (external)
  ├─ Claude review session (ACP, claude runner)
  └─ Gemini review session (ACP, gemini-cli runner)
       ↓
  Aggregator session reads all three reviews
       ↓
  Posts single comment: prioritized findings, deduped, with model attribution
```

### Implementation

1. **Trigger**: PR `opened` / `synchronize` event (like current amber-auto-review)
2. **Parallel reviews**: Create 2 ACP sessions (claude + gemini) pointing at the PR branch. CodeRabbit runs on its own.
3. **Wait**: Poll both sessions for completion (v0.0.5 `wait: true` with idle detection)
4. **Aggregate**: A third session (or the GHA itself) reads:
   - CodeRabbit's PR comments (via GH API)
   - Claude review session's output
   - Gemini review session's output
5. **Post**: Single comment with:
   - Critical issues (agreed by 2+ models)
   - Model-specific findings (only one model flagged it)
   - Confidence: "3/3 models flagged this" vs "1/3"

### Scorecard integration

Review metrics feed into the PR scorecard:
- Number of review findings per model
- Agreement rate across models
- False positive rate (findings dismissed by human reviewer)

### Open questions

- Should the aggregator be a separate session or inline GHA logic?
- How to handle CodeRabbit's async timing (it may not have commented yet when our sessions finish)?
- Cost: 3 reviews per PR — gate behind a label (`ambient-code:deep-review`) or run for all?

Field	Source
`source_issue`	PR frontmatter (`source=` field)
`sessions`	PR frontmatter (`session_id=`) + batch fixer session IDs
`iterations`	`retry_count` from frontmatter at merge time
`breakdown`	Parse from batch fixer logs: CI fix vs review round vs conflict resolution
`one_shot`	`iterations == 0` (initial session created it, zero fix cycles)
`context_commit`	Git SHA of `CLAUDE.md` + `.claude/` directory at session creation time
`corrections_logged`	Count of `log_correction` calls from session traces in Langfuse
`agent_friction`	Agent self-reports via a new `log_friction` tool or a `friction` field in `log_correction`
`created_at` / `merged_at`	GH API timestamps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: PR scorecard system — closed-loop metrics for AI code quality #1199

Problem

Proposal: PR Scorecard

Data sources

Regression tracking

Aggregation

Context A/B testing

Agent friction reporting

Implementation phases

Phase 1: Scorecard writing

Phase 2: Regression tracking

Phase 3: Aggregation

Phase 4: Context A/B testing

Labels

Phase 5: Multi-Runner Review Aggregation

Problem

Proposal

Runners

Flow

Implementation

Scorecard integration

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Runner	Source	What it catches well
CodeRabbit	External bot (already runs)	Security, dependency issues, OWASP patterns
Claude Code	ACP session (claude runner)	Architecture, pattern consistency, codebase conventions
Gemini CLI	ACP session (gemini-cli runner)	Performance, algorithmic complexity, resource usage

feat: PR scorecard system — closed-loop metrics for AI code quality #1199

Description

Problem

Proposal: PR Scorecard

Data sources

Regression tracking

Aggregation

Context A/B testing

Agent friction reporting

Implementation phases

Phase 1: Scorecard writing

Phase 2: Regression tracking

Phase 3: Aggregation

Phase 4: Context A/B testing

Labels

Phase 5: Multi-Runner Review Aggregation

Problem

Proposal

Runners

Flow

Implementation

Scorecard integration

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions