Skip to content

feat(learn): weight loops in Headroom Learn + RTK-loop eval#1160

Open
purva-8 wants to merge 2 commits into
chopratejas:mainfrom
purva-8:learn-loop-weighting
Open

feat(learn): weight loops in Headroom Learn + RTK-loop eval#1160
purva-8 wants to merge 2 commits into
chopratejas:mainfrom
purva-8:learn-loop-weighting

Conversation

@purva-8

@purva-8 purva-8 commented Jun 19, 2026

Copy link
Copy Markdown

Description

headroom learn ranked recommendations by a single LLM-guessed estimated_tokens_saved with a flat hardcoded confidence, and had no notion of a loop. So (1) RTK re-fetch loops were invisible - RTK truncates a command's output, the agent re-runs larger-limit variants, those calls succeed (is_error=False), and analyze() even early-returned when a session had no failures and no events - and (2) even when surfaced, a loop ranked no higher than a one-off mistake. This adds loop-aware weighting plus the eval that reproduces an RTK loop, runs it through Learn, and checks the guardrail prevents re-triggering.

Closes #1159

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Performance improvement
  • Code refactoring (no functional changes)

Changes Made

  • New headroom/learn/loops.py: detect_loops() (canonical signature collapses RTK pagination/limit variants; classifies error vs rtk-refetch loops; measured wasted tokens), format_loops_for_digest(), apply_loop_weighting().
  • analyzer.py: detect loops up front (fixes the no-failure early-return), lead the digest with them, prioritize loops in the system prompt, re-sort after weighting.
  • models.py: Recommendation.is_loop_guardrail / loop_occurrences.
  • benchmarks/rtk_loop_learn_eval.py + headroom/learn/fixtures.py: the two-phase RTK-loop eval and its session fixtures.
  • Tests, docs/rtk-loop-weighting.md, CHANGELOG entry.

Testing

  • Unit tests pass (pytest)
  • Linting passes (ruff check .)
  • Type checking passes (mypy headroom) - not run (mypy not in my minimal env; see Not tested)
  • New tests added for new functionality
  • Manual testing performed

Test Output

$ python -m pytest tests/test_learn/ -q
190 passed, 3 skipped, 1 warning in 5.85s
$ ruff check <changed files>
All checks passed!

Real Behavior Proof

  • Environment: macOS (Darwin 25.0), Python 3.10.18, fresh venv (pip install -e minus the optional hnswlib/proxy extras, which are unrelated to learn); real LLM via the analyzer's claude CLI backend (HEADROOM_LEARN_CLI=claude, claude-cli 2.1.158) — no API key used.
  • Exact command / steps: HEADROOM_LEARN_CLI=claude python -c "from benchmarks.rtk_loop_learn_eval import run_eval; c=run_eval(use_real_llm=True); print(c.render())"
  • Observed result: the analyzer shelled out to a real model and produced the "Commands" guardrail quoted below, naming the looping command. The digest reports the measured 5,005-token waste and asks the model to rank loops first, so the model emitted that figure; in this run the guardrail ranked Add Claude Opus 4.5 and Claude 4 model family to context limits #1 and the scorecard was all-PASS (below). Caveat — real-mode is run-dependent: the rule's wording, and whether the post-hoc apply_loop_weighting fuzzy match fires, vary across runs (in one run it did not tag the rule). The deterministic CI eval (stub LLM) is the stable, reproducible artifact; this real run corroborates it.
  • Not tested: the analyzer's API-key path (ANTHROPIC/OPENAI/GEMINI) — exercised the equivalent claude CLI backend instead; mypy; a live agent obeying the written rule end-to-end (Phase 2 is a non-recurrence check, not a live agent — called out in the doc).

Real model output from this run, ranked #1 at the measured 5,005-token weight:

Commands — When grepping logs (or any large file), never loop with increasing | head -N limits — tool output is capped at ~4 KB regardless of N, so repeated attempts return identical bytes. Instead: redirect to a temp file (grep ... > /tmp/out.txt) then read it, or use grep -c first…

[PASS] loop_detected          (1 loop(s), ~5,005 tok wasted)
[PASS] guardrail_produced
[PASS] ranked_first
[PASS] names_command
[PASS] prescribes_fix
[PASS] weight_reflects_waste
[PASS] guardrail_holds
RESULT: PASS

(One real-mode run via the claude CLI backend. The deterministic pytest eval above is the stable artifact; see the run-dependence caveat under Observed result.)

The real run also caught an over-brittle check: an earlier names_command required the literal "TimeoutError"; the real model wrote a more general rule (grep + head -N) without it, so I fixed the check to verify the looping command is named, not an incidental literal.

Review Readiness

  • I have performed a self-review
  • This PR is ready for human review

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have updated the CHANGELOG.md if applicable

Additional Notes

  • No new dependencies. No network, no user/assistant content dropped — operates on already-captured session digests.
  • Kept as one logical change. mypy not run locally (minimal env); happy to address anything CI's mypy flags.

Headroom Learn ranked recommendations by a single LLM-guessed
estimated_tokens_saved with a flat hardcoded confidence, and had no
notion of a loop. Two consequences:

- RTK re-fetch loops were invisible: RTK truncates a command's output,
  so the agent re-runs larger-limit variants to fetch more. Those calls
  SUCCEED (is_error=False), and analyze() early-returned when a session
  had no failures and no events — skipping the loop entirely.
- Even when surfaced, a loop ranked no higher than a one-off mistake.

Add headroom/learn/loops.py:
- detect_loops(): group calls by a canonical signature that collapses
  RTK pagination/limit variants, flag >=3 repeats, classify error vs
  rtk-refetch loops, and compute MEASURED wasted tokens.
- format_loops_for_digest(): surface loops as a highest-priority digest
  section so the analyzer LLM sees them.
- apply_loop_weighting(): raise a matching recommendation's savings to
  at least the loop's measured waste and tag it as a loop guardrail, so
  loops outrank one-offs deterministically.

Wire into analyzer.py: detect loops up front (fixes the no-failure
early-return), lead the digest with them, prioritize loops in the system
prompt, and re-sort after weighting. Add is_loop_guardrail /
loop_occurrences to Recommendation.

Eval: benchmarks/rtk_loop_learn_eval.py reproduces the loop, runs learn,
scores the guardrail (produced/ranked-first/names-command/prescribes-fix/
weight-reflects-waste), then injects it and asserts a guarded session
does not re-trigger it. Deterministic in CI; real-LLM via --real.

Tests: tests/test_learn/test_loop_weighting.py (13) and
test_rtk_loop_eval.py. Design notes in docs/rtk-loop-weighting.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

PR governance

This PR follows the template and is marked ready for human review.

@github-actions github-actions Bot added the status: needs author action Pull request body or readiness checklist still needs author updates label Jun 19, 2026
@github-actions github-actions Bot added status: needs author action Pull request body or readiness checklist still needs author updates and removed status: needs author action Pull request body or readiness checklist still needs author updates labels Jun 19, 2026
@purva-8

purva-8 commented Jun 19, 2026

Copy link
Copy Markdown
Author

Correction (resolved). Disregard my CRLF diagnosis above — I had it wrong. The Test Output check was failing because my own PR body was submitted with CRLF line endings (a tooling round-trip on my end), not because GitHub forces CRLF or because the script is broken. Other PRs with LF bodies pass this check fine. Normalizing my body to LF cleared it, and the label is now status: ready for review.

The regex (\n after the fence tag) is still mildly fragile on CRLF input, but it's a latent edge case, not the blocker I claimed. Sorry for the noise.

…igure

The design doc stated the LLM 'rated the eval's loop at 150 tokens vs ~5,000
measured' as an empirical result. 150 was the deterministic stub's constant,
never a real model output — removed. Doc now states the implementation is a
hybrid (digest prompt hint + post-hoc fuzzy boost) and that the prompt hint is
currently load-bearing while the post-hoc boost is fuzzy-match-based and does
not always fire. Stub comment clarified as a simulated value.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@JerrettDavis JerrettDavis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is not merge-ready yet. The body still says the review-readiness box is intentionally unchecked pending maintainer agreement, and GitHub reports mergeStateStatus=UNSTABLE rather than clean/green. Please mark it ready, ensure the required CI surface is green, and update the description once the approach is no longer pending.

@purva-8 purva-8 marked this pull request as ready for review June 20, 2026 19:32
@purva-8

purva-8 commented Jun 20, 2026

Copy link
Copy Markdown
Author

Thanks @JerrettDavis - done:

  • Marked ready for review and checked the Review Readiness boxes.
  • Updated the description to drop the "pending maintainer agreement" wording.
  • Fixed the template/governance flag (it was my PR body having CRLF line endings - normalized to LF; label is now ready for review).

The only remaining non-green status is that 4 workflows are awaiting maintainer approval to run (first-time contributor) — CI, Wrap E2E, Init E2E, and Deploy Documentation, all currently action_required. I can't trigger those myself. Could a maintainer approve the run so CI can go green?

@purva-8 purva-8 requested a review from JerrettDavis June 20, 2026 19:34
@github-actions github-actions Bot added status: ready for review Pull request body is complete and the author marked it ready for human review and removed status: needs author action Pull request body or readiness checklist still needs author updates labels Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: ready for review Pull request body is complete and the author marked it ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(learn): weight loops in Headroom Learn

2 participants