feat(learn): weight loops in Headroom Learn + RTK-loop eval#1160
feat(learn): weight loops in Headroom Learn + RTK-loop eval#1160purva-8 wants to merge 2 commits into
Conversation
Headroom Learn ranked recommendations by a single LLM-guessed estimated_tokens_saved with a flat hardcoded confidence, and had no notion of a loop. Two consequences: - RTK re-fetch loops were invisible: RTK truncates a command's output, so the agent re-runs larger-limit variants to fetch more. Those calls SUCCEED (is_error=False), and analyze() early-returned when a session had no failures and no events — skipping the loop entirely. - Even when surfaced, a loop ranked no higher than a one-off mistake. Add headroom/learn/loops.py: - detect_loops(): group calls by a canonical signature that collapses RTK pagination/limit variants, flag >=3 repeats, classify error vs rtk-refetch loops, and compute MEASURED wasted tokens. - format_loops_for_digest(): surface loops as a highest-priority digest section so the analyzer LLM sees them. - apply_loop_weighting(): raise a matching recommendation's savings to at least the loop's measured waste and tag it as a loop guardrail, so loops outrank one-offs deterministically. Wire into analyzer.py: detect loops up front (fixes the no-failure early-return), lead the digest with them, prioritize loops in the system prompt, and re-sort after weighting. Add is_loop_guardrail / loop_occurrences to Recommendation. Eval: benchmarks/rtk_loop_learn_eval.py reproduces the loop, runs learn, scores the guardrail (produced/ranked-first/names-command/prescribes-fix/ weight-reflects-waste), then injects it and asserts a guarded session does not re-trigger it. Deterministic in CI; real-LLM via --real. Tests: tests/test_learn/test_loop_weighting.py (13) and test_rtk_loop_eval.py. Design notes in docs/rtk-loop-weighting.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR governanceThis PR follows the template and is marked ready for human review. |
|
Correction (resolved). Disregard my CRLF diagnosis above — I had it wrong. The The regex ( |
…igure The design doc stated the LLM 'rated the eval's loop at 150 tokens vs ~5,000 measured' as an empirical result. 150 was the deterministic stub's constant, never a real model output — removed. Doc now states the implementation is a hybrid (digest prompt hint + post-hoc fuzzy boost) and that the prompt hint is currently load-bearing while the post-hoc boost is fuzzy-match-based and does not always fire. Stub comment clarified as a simulated value. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
JerrettDavis
left a comment
There was a problem hiding this comment.
This PR is not merge-ready yet. The body still says the review-readiness box is intentionally unchecked pending maintainer agreement, and GitHub reports mergeStateStatus=UNSTABLE rather than clean/green. Please mark it ready, ensure the required CI surface is green, and update the description once the approach is no longer pending.
|
Thanks @JerrettDavis - done:
The only remaining non-green status is that 4 workflows are awaiting maintainer approval to run (first-time contributor) — |
Description
headroom learnranked recommendations by a single LLM-guessedestimated_tokens_savedwith a flat hardcodedconfidence, and had no notion of a loop. So (1) RTK re-fetch loops were invisible - RTK truncates a command's output, the agent re-runs larger-limit variants, those calls succeed (is_error=False), andanalyze()even early-returned when a session had no failures and no events - and (2) even when surfaced, a loop ranked no higher than a one-off mistake. This adds loop-aware weighting plus the eval that reproduces an RTK loop, runs it through Learn, and checks the guardrail prevents re-triggering.Closes #1159
Type of Change
Changes Made
headroom/learn/loops.py:detect_loops()(canonical signature collapses RTK pagination/limit variants; classifies error vs rtk-refetch loops; measured wasted tokens),format_loops_for_digest(),apply_loop_weighting().analyzer.py: detect loops up front (fixes the no-failure early-return), lead the digest with them, prioritize loops in the system prompt, re-sort after weighting.models.py:Recommendation.is_loop_guardrail/loop_occurrences.benchmarks/rtk_loop_learn_eval.py+headroom/learn/fixtures.py: the two-phase RTK-loop eval and its session fixtures.docs/rtk-loop-weighting.md, CHANGELOG entry.Testing
pytest)ruff check .)mypy headroom) - not run (mypy not in my minimal env; see Not tested)Test Output
Real Behavior Proof
pip install -eminus the optionalhnswlib/proxy extras, which are unrelated tolearn); real LLM via the analyzer's claude CLI backend (HEADROOM_LEARN_CLI=claude, claude-cli 2.1.158) — no API key used.HEADROOM_LEARN_CLI=claude python -c "from benchmarks.rtk_loop_learn_eval import run_eval; c=run_eval(use_real_llm=True); print(c.render())"apply_loop_weightingfuzzy match fires, vary across runs (in one run it did not tag the rule). The deterministic CI eval (stub LLM) is the stable, reproducible artifact; this real run corroborates it.mypy; a live agent obeying the written rule end-to-end (Phase 2 is a non-recurrence check, not a live agent — called out in the doc).Real model output from this run, ranked #1 at the measured 5,005-token weight:
(One real-mode run via the claude CLI backend. The deterministic
pytesteval above is the stable artifact; see the run-dependence caveat under Observed result.)The real run also caught an over-brittle check: an earlier
names_commandrequired the literal "TimeoutError"; the real model wrote a more general rule (grep +head -N) without it, so I fixed the check to verify the looping command is named, not an incidental literal.Review Readiness
Checklist
Additional Notes