feat(learn): weight loops in Headroom Learn + RTK-loop eval by purva-8 · Pull Request #1160 · chopratejas/headroom

purva-8 · 2026-06-19T10:44:18Z

Description

headroom learn ranked recommendations by a single LLM-guessed estimated_tokens_saved with a flat hardcoded confidence, and had no notion of a loop. So (1) RTK re-fetch loops were invisible - RTK truncates a command's output, the agent re-runs larger-limit variants, those calls succeed (is_error=False), and analyze() even early-returned when a session had no failures and no events - and (2) even when surfaced, a loop ranked no higher than a one-off mistake. This adds loop-aware weighting plus the eval that reproduces an RTK loop, runs it through Learn, and checks the guardrail prevents re-triggering.

Closes #1159

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Performance improvement
Code refactoring (no functional changes)

Changes Made

New headroom/learn/loops.py: detect_loops() (canonical signature collapses RTK pagination/limit variants; classifies error vs rtk-refetch loops; measured wasted tokens), format_loops_for_digest(), apply_loop_weighting().
analyzer.py: detect loops up front (fixes the no-failure early-return), lead the digest with them, prioritize loops in the system prompt, re-sort after weighting.
models.py: Recommendation.is_loop_guardrail / loop_occurrences.
benchmarks/rtk_loop_learn_eval.py + headroom/learn/fixtures.py: the two-phase RTK-loop eval and its session fixtures.
Tests, docs/rtk-loop-weighting.md, CHANGELOG entry.

Testing

Unit tests pass (pytest)
Linting passes (ruff check .)
Type checking passes (mypy headroom) - not run (mypy not in my minimal env; see Not tested)
New tests added for new functionality
Manual testing performed

Test Output

$ python -m pytest tests/test_learn/ -q
190 passed, 3 skipped, 1 warning in 5.85s
$ ruff check <changed files>
All checks passed!

Real Behavior Proof

Environment: macOS (Darwin 25.0), Python 3.10.18, fresh venv (pip install -e minus the optional hnswlib/proxy extras, which are unrelated to learn); real LLM via the analyzer's claude CLI backend (HEADROOM_LEARN_CLI=claude, claude-cli 2.1.158) — no API key used.
Exact command / steps: HEADROOM_LEARN_CLI=claude python -c "from benchmarks.rtk_loop_learn_eval import run_eval; c=run_eval(use_real_llm=True); print(c.render())"
Observed result: the analyzer shelled out to a real model and produced the "Commands" guardrail quoted below, naming the looping command. The digest reports the measured 5,005-token waste and asks the model to rank loops first, so the model emitted that figure; in this run the guardrail ranked Add Claude Opus 4.5 and Claude 4 model family to context limits #1 and the scorecard was all-PASS (below). Caveat — real-mode is run-dependent: the rule's wording, and whether the post-hoc apply_loop_weighting fuzzy match fires, vary across runs (in one run it did not tag the rule). The deterministic CI eval (stub LLM) is the stable, reproducible artifact; this real run corroborates it.
Not tested: the analyzer's API-key path (ANTHROPIC/OPENAI/GEMINI) — exercised the equivalent claude CLI backend instead; mypy; a live agent obeying the written rule end-to-end (Phase 2 is a non-recurrence check, not a live agent — called out in the doc).

Real model output from this run, ranked #1 at the measured 5,005-token weight:

Commands — When grepping logs (or any large file), never loop with increasing | head -N limits — tool output is capped at ~4 KB regardless of N, so repeated attempts return identical bytes. Instead: redirect to a temp file (grep ... > /tmp/out.txt) then read it, or use grep -c first…

[PASS] loop_detected          (1 loop(s), ~5,005 tok wasted)
[PASS] guardrail_produced
[PASS] ranked_first
[PASS] names_command
[PASS] prescribes_fix
[PASS] weight_reflects_waste
[PASS] guardrail_holds
RESULT: PASS

(One real-mode run via the claude CLI backend. The deterministic pytest eval above is the stable artifact; see the run-dependence caveat under Observed result.)

The real run also caught an over-brittle check: an earlier names_command required the literal "TimeoutError"; the real model wrote a more general rule (grep + head -N) without it, so I fixed the check to verify the looping command is named, not an incidental literal.

Review Readiness

I have performed a self-review
This PR is ready for human review

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have updated the CHANGELOG.md if applicable

Additional Notes

No new dependencies. No network, no user/assistant content dropped — operates on already-captured session digests.
Kept as one logical change. mypy not run locally (minimal env); happy to address anything CI's mypy flags.

Headroom Learn ranked recommendations by a single LLM-guessed estimated_tokens_saved with a flat hardcoded confidence, and had no notion of a loop. Two consequences: - RTK re-fetch loops were invisible: RTK truncates a command's output, so the agent re-runs larger-limit variants to fetch more. Those calls SUCCEED (is_error=False), and analyze() early-returned when a session had no failures and no events — skipping the loop entirely. - Even when surfaced, a loop ranked no higher than a one-off mistake. Add headroom/learn/loops.py: - detect_loops(): group calls by a canonical signature that collapses RTK pagination/limit variants, flag >=3 repeats, classify error vs rtk-refetch loops, and compute MEASURED wasted tokens. - format_loops_for_digest(): surface loops as a highest-priority digest section so the analyzer LLM sees them. - apply_loop_weighting(): raise a matching recommendation's savings to at least the loop's measured waste and tag it as a loop guardrail, so loops outrank one-offs deterministically. Wire into analyzer.py: detect loops up front (fixes the no-failure early-return), lead the digest with them, prioritize loops in the system prompt, and re-sort after weighting. Add is_loop_guardrail / loop_occurrences to Recommendation. Eval: benchmarks/rtk_loop_learn_eval.py reproduces the loop, runs learn, scores the guardrail (produced/ranked-first/names-command/prescribes-fix/ weight-reflects-waste), then injects it and asserts a guarded session does not re-trigger it. Deterministic in CI; real-LLM via --real. Tests: tests/test_learn/test_loop_weighting.py (13) and test_rtk_loop_eval.py. Design notes in docs/rtk-loop-weighting.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-19T10:44:30Z

PR governance

This PR follows the template and is marked ready for human review.

purva-8 · 2026-06-19T10:59:26Z

Correction (resolved). Disregard my CRLF diagnosis above — I had it wrong. The Test Output check was failing because my own PR body was submitted with CRLF line endings (a tooling round-trip on my end), not because GitHub forces CRLF or because the script is broken. Other PRs with LF bodies pass this check fine. Normalizing my body to LF cleared it, and the label is now status: ready for review.

The regex (\n after the fence tag) is still mildly fragile on CRLF input, but it's a latent edge case, not the blocker I claimed. Sorry for the noise.

…igure The design doc stated the LLM 'rated the eval's loop at 150 tokens vs ~5,000 measured' as an empirical result. 150 was the deterministic stub's constant, never a real model output — removed. Doc now states the implementation is a hybrid (digest prompt hint + post-hoc fuzzy boost) and that the prompt hint is currently load-bearing while the post-hoc boost is fuzzy-match-based and does not always fire. Stub comment clarified as a simulated value. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

JerrettDavis

This PR is not merge-ready yet. The body still says the review-readiness box is intentionally unchecked pending maintainer agreement, and GitHub reports mergeStateStatus=UNSTABLE rather than clean/green. Please mark it ready, ensure the required CI surface is green, and update the description once the approach is no longer pending.

purva-8 · 2026-06-20T19:32:18Z

Thanks @JerrettDavis - done:

Marked ready for review and checked the Review Readiness boxes.
Updated the description to drop the "pending maintainer agreement" wording.
Fixed the template/governance flag (it was my PR body having CRLF line endings - normalized to LF; label is now ready for review).

The only remaining non-green status is that 4 workflows are awaiting maintainer approval to run (first-time contributor) — CI, Wrap E2E, Init E2E, and Deploy Documentation, all currently action_required. I can't trigger those myself. Could a maintainer approve the run so CI can go green?

github-actions Bot added the status: needs author action Pull request body or readiness checklist still needs author updates label Jun 19, 2026

purva-8 mentioned this pull request Jun 19, 2026

feat(learn): weight loops in Headroom Learn + RTK-loop eval #1157

Closed

github-actions Bot added status: needs author action Pull request body or readiness checklist still needs author updates and removed status: needs author action Pull request body or readiness checklist still needs author updates labels Jun 19, 2026

JerrettDavis requested changes Jun 19, 2026

View reviewed changes

purva-8 marked this pull request as ready for review June 20, 2026 19:32

purva-8 requested a review from JerrettDavis June 20, 2026 19:34

github-actions Bot added status: ready for review Pull request body is complete and the author marked it ready for human review and removed status: needs author action Pull request body or readiness checklist still needs author updates labels Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(learn): weight loops in Headroom Learn + RTK-loop eval#1160

feat(learn): weight loops in Headroom Learn + RTK-loop eval#1160
purva-8 wants to merge 2 commits into
chopratejas:mainfrom
purva-8:learn-loop-weighting

purva-8 commented Jun 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

purva-8 commented Jun 19, 2026 •

edited

Loading

Uh oh!

JerrettDavis left a comment

Uh oh!

purva-8 commented Jun 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

purva-8 commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Changes Made

Testing

Test Output

Real Behavior Proof

Review Readiness

Checklist

Additional Notes

Uh oh!

github-actions Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR governance

Uh oh!

purva-8 commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JerrettDavis left a comment

Choose a reason for hiding this comment

Uh oh!

purva-8 commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

purva-8 commented Jun 19, 2026 •

edited

Loading

github-actions Bot commented Jun 19, 2026 •

edited

Loading

purva-8 commented Jun 19, 2026 •

edited

Loading

purva-8 commented Jun 20, 2026 •

edited

Loading