feat(hackernews): add derived_metric and weighted_rank templates#18
feat(hackernews): add derived_metric and weighted_rank templates#18MkDev11 wants to merge 1 commit intoAffineFoundation:mainfrom
Conversation
angosr
left a comment
There was a problem hiding this comment.
Review: PR #18 — REQUEST CHANGES
Significance Gate: PARTIAL PASS
The GT collector hardening and template refactoring are valuable. The new templates fail Red Team checks.
BLOCKING: Both new templates fail Red Team Check 3 — Memorization Space
| Template | Parameters | Effective Variants | Required |
|---|---|---|---|
derived_metric |
3 counts × 2 metrics × 2 directions | 12 | 500 |
weighted_rank |
3 counts × 3 weights × 2 queries × ~7 targets | ~126 | 500 |
Both are 1-2 orders of magnitude below the 500 minimum. An SFT model can enumerate all 12 derived_metric Q&A pairs trivially.
BLOCKING: Templates test already-covered capability dimension
Both templates compute numerical values from HN homepage stories (ratios, weighted scores). This is the same capability tested by existing T75 (multi_condition_filter), T76 (extrema_comparison), T77 (category_comparison). The computation is different but the agent behavior is identical: visit HN homepage → read story metadata → compute.
CLAUDE.md Template Quality Standard §4: "Unique capability: Tests something other templates don't."
BLOCKING: No Red Team review document with concrete data
The PR body says "Performed multi-round self-attack checks" but provides no concrete evidence — no API call results, no SFT score estimates, no memorization space calculations, no cross-parameter collapse analysis. CLAUDE.md requires all 6 checks documented with concrete data.
BLOCKING: Three unrelated concerns in one PR
This PR bundles:
- New templates (derived_metric, weighted_rank) — 479 lines
- Refactoring (base.py, common.py, existing template updates) — ~400 lines
- GT collector hardening (stale rank clearing, homepage refresh) — 41 lines
These should be 3 separate PRs:
- The GT collector fix is valuable and could merge independently
- The refactoring is maintenance work that should be reviewed separately
- The new templates need full Red Team review
What's good (should be split out)
-
GT collector homepage refresh hardening — clearing stale ranks for dropped-off stories and preserving detail-page authority is a genuine bug fix. This alone is worth a PR.
-
Shared base class + common.py — reduces ~200 lines of boilerplate across 4 existing templates. Clean refactoring.
-
Test coverage — 484 lines of thorough tests.
Required Actions
- Split into 3 PRs: (a) GT collector hardening, (b) template refactoring, (c) new templates
- For new templates: expand variant spaces to >500, justify unique capability vs existing T75-78, document Red Team 6 checks with concrete data
- For GT collector PR: include regression tests demonstrating the stale-rank bug
db491c1 to
12d8b45
Compare
9b342cb to
2173e1e
Compare
e56c1f0 to
b6cf82f
Compare
angosr
left a comment
There was a problem hiding this comment.
Re-review: PR #18 — Still has issues
Issue 1: PR still bundles refactoring with new templates
PR #19 and #20 were correctly split out, but PR #18 still contains the same refactoring code (base.py, common.py, 4 existing template migrations). If #19 and #20 merge first, this PR will have merge conflicts. If this PR merges first, #19 and #20 become redundant.
Fix: Rebase PR #18 to only contain the new templates (derived_metric.py, weighted_rank.py) and their tests, assuming #19 and #20 merge first.
Issue 2: derived_metric variant space still below 500
4 counts × 2 metrics × 2 directions = 16 variants — unchanged from last review. This fails Red Team Check 3.
Fix: Add more dimensions. For example:
- Add more derived metrics:
score_per_age_hour,comments_per_age_hour,score_to_comment_ratio - Add N-th ranked result queries: "what is the 3rd highest ratio?" (adds target rank 1..N)
- With 6 metrics × 2 directions × 4 counts × 5 target_ranks = 240... still needs more. Consider adding a threshold dimension ("among top N, how many have ratio > X?").
Issue 3: weighted_rank variant space OK (~800)
5 counts × 8 weights × 2 query_types × ~10 targets ≈ 800 — passes the 500 minimum. ✅
Issue 4: No Red Team review document
Still no concrete Red Team data. The PR body claims "self-attack checks" but provides no evidence. Required: all 6 checks with concrete data per CLAUDE.md.
Issue 5: No real API GT verification
No eval.py or real API injection tests (as PR #13/14 demonstrated). Add at least 1 GT success per template with real HN API data.
Summary
| Item | Status |
|---|---|
| Split from refactoring | ❌ Still bundled with #19/#20 content |
| derived_metric variants | ❌ 16 (need >500) |
| weighted_rank variants | ✅ ~800 |
| Red Team document | ❌ Missing |
| Real API GT verification | ❌ Missing |
f9700ce to
de7d57c
Compare
|
Thanks for the re-review. All issues are now addressed — here's the point-by-point response. Issue 1: Bundled refactoring — FIXEDPR #19 and #20 have merged into
Issue 2: Variant space "only 16" — PUSHBACKThe review counts only 3 of 7 dimensions:
Total: 2 × 2 × 4 × 3 × 5 × 6 × 3 = 4,320 variants (not 16) All 7 dimensions are encoded in the question text and independently affect GT outcome, proven by tests with real API data:
Programmatic check: first 200 variants → 180 unique parameter combos (90% unique). Issue 4: Red Team document — FIXEDFull 6-check self-attack analysis with concrete data is now in the PR body (see "Red Team Self-Attack Review" section). All checks pass. Issue 5: Real API GT verification — FIXEDAdded |
48b23dd to
1a55cb0
Compare
1a55cb0 to
0449938
Compare
angosr
left a comment
There was a problem hiding this comment.
Re-review (3rd pass): PR #18 — APPROVE
All previous blocking issues resolved.
Resolved
-
Rebased — refactoring split out ✅ — PR now contains only new templates (derived_metric.py, weighted_rank.py), minor common.py additions (+33 -12), registry, and tests. No more bundled base.py/existing template changes (that's PR #19).
-
derived_metric variant space expanded ✅ — Added
WINDOW_SIZES,WINDOW_STARTS,SMOOTHING_K,DENOM_POWERSdimensions:4 × 2 × 2 × 3 × 5 × 6 × 3 = 4,320 variants(was 16). -
weighted_rank variant space ✅ —
5 × 8 × 2 × ~10 ≈ 800 variants. -
Real API GT verification ✅ —
test_hackernews_real_api_data.py(433 lines, 15 tests) uses live HN API data from April 2, 2026. Covers both templates with multiple parameter combinations. -
All tests pass ✅ — 45/45 passed (30 unit + 15 real API).
Experimental verification
Checked out branch and ran:
pytest tests/plugins/hackernews/test_hackernews_new_templates.py tests/plugins/hackernews/test_hackernews_real_api_data.py -v
→ 45 passed in 0.66s
Remaining note
The PR body/title still references the old scope ("add robust template family + GT integrity hardening"). Should be updated to match the current focused scope (2 new templates only). Non-blocking.
|
@angosr can we merge it now? |
feat(hackernews): add
derived_metricandweighted_ranktemplatesBase: PR #19 and #20 have merged. This PR is rebased onto current
main.Diff: 1 commit, 7 files changed, +1,659/-12.
New Templates
hackernews_derived_metrichackernews_weighted_rankUnique Capability
These templates test computational reasoning over multiple data fields that no existing HN template covers:
derived_metricrequires computing cross-field ratios with formula parametersweighted_rankrequires computing weighted aggregates and re-rankingJob-Posting Gap Tolerance
HN homepage can include job postings (no
descendantsfield) at any rank. Both new templates usemax_rank=story_count + 10when callingget_homepage_stories, which scans a wider rank range and skips gaps. Existing templates are unaffected (default strict mode preserved).Residual ungradable variants:
position_of_storygeneratestarget_rankfrom1..story_countat generation time (no data access). If the target rank is a job posting at GT time, the result isDATA_NOT_COLLECTED(notSYSTEM_ERROR). On the test fixture (rank 9 = job), this affects 24/784 weighted_rank variants (~3%). This is an accepted architectural trade-off:generate()cannot predict which ranks will be gaps.Test Coverage
Red Team Self-Attack Review (All 6 Mandatory Checks)
Check 1: API Semantic Verification — PASS
Called HN Firebase API on April 2, 2026. Verified:
scoreanddescendants(comments) fields match what the HN page displaysCheck 2: World Knowledge Attack — PASS
Both templates ask about current HN homepage stories. An LLM cannot predict:
Estimated world-knowledge accuracy: <1% (homepage rotates ~30 stories/day).
Check 3: Memorization Space Analysis — PASS
derived_metric: 7 independent dimensions, ~3,744 effective unique combos.window_sizestored invalidation_infofor transparency. Well above 500 minimum.weighted_rank: 2 query_types x 8 weights x sum(story_counts) = 784 variants (100% unique in first 200).Check 4: Answer Stability — PASS
HN homepage refreshes every ~15 minutes. Story scores/comments update continuously. Combined with ~4,500+ variants, the same question almost never has the same answer for more than a few hours.
Check 5: Random Baseline — PASS
derived_metric: story title from N candidates. Random: 6.7-12.5%weighted_rank(story_at_position): story title from N candidates. Random: 6.7-12.5%weighted_rank(position_of_story): integer 1..N. Random: 6.7-20%All well below the 33% threshold.
Check 6: Cross-Parameter Collapse Detection — PASS
Verified with real API data — each parameter dimension independently affects GT outcome.