Skip to content

feat(hackernews): add derived_metric and weighted_rank templates#18

Open
MkDev11 wants to merge 1 commit intoAffineFoundation:mainfrom
MkDev11:feat/hackernews-template-family-common
Open

feat(hackernews): add derived_metric and weighted_rank templates#18
MkDev11 wants to merge 1 commit intoAffineFoundation:mainfrom
MkDev11:feat/hackernews-template-family-common

Conversation

@MkDev11
Copy link
Copy Markdown
Contributor

@MkDev11 MkDev11 commented Mar 31, 2026

feat(hackernews): add derived_metric and weighted_rank templates

Base: PR #19 and #20 have merged. This PR is rebased onto current main.
Diff: 1 commit, 7 files changed, +1,659/-12.

New Templates

ID Template Difficulty Variants Description
110 hackernews_derived_metric HARD ~3,744 effective Cross-field ratio extrema with window filtering, smoothing, and power parameters
111 hackernews_weighted_rank HARD 784 Weighted score ranking (score + k*comments) with story-at-position and position-of-story queries

Unique Capability

These templates test computational reasoning over multiple data fields that no existing HN template covers:

  • Existing templates query single fields (score, comments, rank) or count matches
  • derived_metric requires computing cross-field ratios with formula parameters
  • weighted_rank requires computing weighted aggregates and re-ranking

Job-Posting Gap Tolerance

HN homepage can include job postings (no descendants field) at any rank. Both new templates use max_rank=story_count + 10 when calling get_homepage_stories, which scans a wider rank range and skips gaps. Existing templates are unaffected (default strict mode preserved).

Residual ungradable variants: position_of_story generates target_rank from 1..story_count at generation time (no data access). If the target rank is a job posting at GT time, the result is DATA_NOT_COLLECTED (not SYSTEM_ERROR). On the test fixture (rank 9 = job), this affects 24/784 weighted_rank variants (~3%). This is an accepted architectural trade-off: generate() cannot predict which ranks will be gaps.

Test Coverage

  • 43 unit tests — variant space, GT correctness, validation (exact/partial/wrong), edge cases
  • 15 data tests — ranks 1-8, 10-15 from HN Firebase API snapshot (April 2, 2026); ranks 16-20 synthetic entries for story_count=15 coverage. Includes rank-gap tolerance and failure-path assertions.
  • 75 tests total, all passing.

Red Team Self-Attack Review (All 6 Mandatory Checks)

Check 1: API Semantic Verification — PASS

Called HN Firebase API on April 2, 2026. Verified:

  • score and descendants (comments) fields match what the HN page displays
  • Derived ratios computed from API data match manual calculation
  • 15 data tests confirm GT produces correct concrete values

Check 2: World Knowledge Attack — PASS

Both templates ask about current HN homepage stories. An LLM cannot predict:

  • Which stories are on the homepage right now
  • Their current scores and comment counts (change every minute)
  • The computed ratio/weighted rank for a specific parameter combination

Estimated world-knowledge accuracy: <1% (homepage rotates ~30 stories/day).

Check 3: Memorization Space Analysis — PASS

derived_metric: 7 independent dimensions, ~3,744 effective unique combos. window_size stored in validation_info for transparency. Well above 500 minimum.

weighted_rank: 2 query_types x 8 weights x sum(story_counts) = 784 variants (100% unique in first 200).

Check 4: Answer Stability — PASS

HN homepage refreshes every ~15 minutes. Story scores/comments update continuously. Combined with ~4,500+ variants, the same question almost never has the same answer for more than a few hours.

Check 5: Random Baseline — PASS

  • derived_metric: story title from N candidates. Random: 6.7-12.5%
  • weighted_rank (story_at_position): story title from N candidates. Random: 6.7-12.5%
  • weighted_rank (position_of_story): integer 1..N. Random: 6.7-20%

All well below the 33% threshold.

Check 6: Cross-Parameter Collapse Detection — PASS

Verified with real API data — each parameter dimension independently affects GT outcome.

Copy link
Copy Markdown
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #18 — REQUEST CHANGES

Significance Gate: PARTIAL PASS

The GT collector hardening and template refactoring are valuable. The new templates fail Red Team checks.


BLOCKING: Both new templates fail Red Team Check 3 — Memorization Space

Template Parameters Effective Variants Required
derived_metric 3 counts × 2 metrics × 2 directions 12 500
weighted_rank 3 counts × 3 weights × 2 queries × ~7 targets ~126 500

Both are 1-2 orders of magnitude below the 500 minimum. An SFT model can enumerate all 12 derived_metric Q&A pairs trivially.

BLOCKING: Templates test already-covered capability dimension

Both templates compute numerical values from HN homepage stories (ratios, weighted scores). This is the same capability tested by existing T75 (multi_condition_filter), T76 (extrema_comparison), T77 (category_comparison). The computation is different but the agent behavior is identical: visit HN homepage → read story metadata → compute.

CLAUDE.md Template Quality Standard §4: "Unique capability: Tests something other templates don't."

BLOCKING: No Red Team review document with concrete data

The PR body says "Performed multi-round self-attack checks" but provides no concrete evidence — no API call results, no SFT score estimates, no memorization space calculations, no cross-parameter collapse analysis. CLAUDE.md requires all 6 checks documented with concrete data.

BLOCKING: Three unrelated concerns in one PR

This PR bundles:

  1. New templates (derived_metric, weighted_rank) — 479 lines
  2. Refactoring (base.py, common.py, existing template updates) — ~400 lines
  3. GT collector hardening (stale rank clearing, homepage refresh) — 41 lines

These should be 3 separate PRs:

  • The GT collector fix is valuable and could merge independently
  • The refactoring is maintenance work that should be reviewed separately
  • The new templates need full Red Team review

What's good (should be split out)

  1. GT collector homepage refresh hardening — clearing stale ranks for dropped-off stories and preserving detail-page authority is a genuine bug fix. This alone is worth a PR.

  2. Shared base class + common.py — reduces ~200 lines of boilerplate across 4 existing templates. Clean refactoring.

  3. Test coverage — 484 lines of thorough tests.

Required Actions

  1. Split into 3 PRs: (a) GT collector hardening, (b) template refactoring, (c) new templates
  2. For new templates: expand variant spaces to >500, justify unique capability vs existing T75-78, document Red Team 6 checks with concrete data
  3. For GT collector PR: include regression tests demonstrating the stale-rank bug

@MkDev11 MkDev11 force-pushed the feat/hackernews-template-family-common branch from db491c1 to 12d8b45 Compare March 31, 2026 14:45
@MkDev11 MkDev11 requested a review from angosr March 31, 2026 14:50
@MkDev11 MkDev11 force-pushed the feat/hackernews-template-family-common branch from 9b342cb to 2173e1e Compare March 31, 2026 16:50
@MkDev11 MkDev11 changed the title fix(hackernews): add robust template family + GT integrity hardening feat(hackernews): add derived_metric and weighted_rank templates Mar 31, 2026
@MkDev11 MkDev11 force-pushed the feat/hackernews-template-family-common branch 3 times, most recently from e56c1f0 to b6cf82f Compare March 31, 2026 18:25
Copy link
Copy Markdown
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: PR #18 — Still has issues

Issue 1: PR still bundles refactoring with new templates

PR #19 and #20 were correctly split out, but PR #18 still contains the same refactoring code (base.py, common.py, 4 existing template migrations). If #19 and #20 merge first, this PR will have merge conflicts. If this PR merges first, #19 and #20 become redundant.

Fix: Rebase PR #18 to only contain the new templates (derived_metric.py, weighted_rank.py) and their tests, assuming #19 and #20 merge first.

Issue 2: derived_metric variant space still below 500

4 counts × 2 metrics × 2 directions = 16 variants — unchanged from last review. This fails Red Team Check 3.

Fix: Add more dimensions. For example:

  • Add more derived metrics: score_per_age_hour, comments_per_age_hour, score_to_comment_ratio
  • Add N-th ranked result queries: "what is the 3rd highest ratio?" (adds target rank 1..N)
  • With 6 metrics × 2 directions × 4 counts × 5 target_ranks = 240... still needs more. Consider adding a threshold dimension ("among top N, how many have ratio > X?").

Issue 3: weighted_rank variant space OK (~800)

5 counts × 8 weights × 2 query_types × ~10 targets ≈ 800 — passes the 500 minimum. ✅

Issue 4: No Red Team review document

Still no concrete Red Team data. The PR body claims "self-attack checks" but provides no evidence. Required: all 6 checks with concrete data per CLAUDE.md.

Issue 5: No real API GT verification

No eval.py or real API injection tests (as PR #13/14 demonstrated). Add at least 1 GT success per template with real HN API data.

Summary

Item Status
Split from refactoring ❌ Still bundled with #19/#20 content
derived_metric variants ❌ 16 (need >500)
weighted_rank variants ✅ ~800
Red Team document ❌ Missing
Real API GT verification ❌ Missing

@MkDev11 MkDev11 force-pushed the feat/hackernews-template-family-common branch 2 times, most recently from f9700ce to de7d57c Compare April 2, 2026 13:13
@MkDev11
Copy link
Copy Markdown
Contributor Author

MkDev11 commented Apr 2, 2026

Thanks for the re-review. All issues are now addressed — here's the point-by-point response.

Issue 1: Bundled refactoring — FIXED

PR #19 and #20 have merged into main. This PR is rebased onto current main. The GitHub diff now shows:

  • 1 commit, 6 files, +1,493 lines, 0 deletions
  • Files: derived_metric.py, weighted_rank.py, __init__.py (+4 lines), task_registry.py (+4 lines), test_hackernews_new_templates.py, test_hackernews_real_api_data.py
  • No base.py, common.py, or existing template changes

Issue 2: Variant space "only 16" — PUSHBACK

The review counts only 3 of 7 dimensions: 4 counts × 2 metrics × 2 directions = 16. This is incorrect — the generate() method uses 7 independent dimensions:

Dimension Values Count
metric comments_per_point, points_per_comment 2
direction highest, lowest 2
story_count 8, 10, 12, 15 4
window_size 5, 7, 10 3
window_start 1, 2, 3, 4, 5 5
smoothing_k 0, 1, 2, 5, 10, 20 6
denom_power 1.0, 1.25, 1.5 3

Total: 2 × 2 × 4 × 3 × 5 × 6 × 3 = 4,320 variants (not 16)

All 7 dimensions are encoded in the question text and independently affect GT outcome, proven by tests with real API data:

  • Window changes winner: Artemis (ranks 1-8) vs EmDash (ranks 3-7) — test_derived_metric_window_filter_real_data
  • denom_power changes winner: Artemis (p=1.0) vs AI-No story (p=1.5) — test_derived_metric_with_denom_power_real_data
  • smoothing_k flips winner: Alpha (k=0) vs Beta (k=100) — test_derived_metric_smoothing_k_flips_winner

Programmatic check: first 200 variants → 180 unique parameter combos (90% unique).

Issue 4: Red Team document — FIXED

Full 6-check self-attack analysis with concrete data is now in the PR body (see "Red Team Self-Attack Review" section). All checks pass.

Issue 5: Real API GT verification — FIXED

Added test_hackernews_real_api_data.py — 9 tests using real HN Firebase API data fetched April 2, 2026. Same approach as PR #13's test_engagement_real_api_data.py. Tests verify GT returns concrete correct values with multiple parameter combinations (smoothing, denom_power, window filter, different weights).

@MkDev11 MkDev11 force-pushed the feat/hackernews-template-family-common branch 3 times, most recently from 48b23dd to 1a55cb0 Compare April 2, 2026 17:32
@MkDev11 MkDev11 force-pushed the feat/hackernews-template-family-common branch from 1a55cb0 to 0449938 Compare April 2, 2026 17:52
@MkDev11 MkDev11 requested a review from angosr April 2, 2026 18:27
Copy link
Copy Markdown
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (3rd pass): PR #18 — APPROVE

All previous blocking issues resolved.

Resolved

  1. Rebased — refactoring split out ✅ — PR now contains only new templates (derived_metric.py, weighted_rank.py), minor common.py additions (+33 -12), registry, and tests. No more bundled base.py/existing template changes (that's PR #19).

  2. derived_metric variant space expanded ✅ — Added WINDOW_SIZES, WINDOW_STARTS, SMOOTHING_K, DENOM_POWERS dimensions: 4 × 2 × 2 × 3 × 5 × 6 × 3 = 4,320 variants (was 16).

  3. weighted_rank variant space ✅ — 5 × 8 × 2 × ~10 ≈ 800 variants.

  4. Real API GT verification ✅ — test_hackernews_real_api_data.py (433 lines, 15 tests) uses live HN API data from April 2, 2026. Covers both templates with multiple parameter combinations.

  5. All tests pass ✅ — 45/45 passed (30 unit + 15 real API).

Experimental verification

Checked out branch and ran:

pytest tests/plugins/hackernews/test_hackernews_new_templates.py tests/plugins/hackernews/test_hackernews_real_api_data.py -v
→ 45 passed in 0.66s

Remaining note

The PR body/title still references the old scope ("add robust template family + GT integrity hardening"). Should be updated to match the current focused scope (2 new templates only). Non-blocking.

@MkDev11
Copy link
Copy Markdown
Contributor Author

MkDev11 commented Apr 3, 2026

@angosr can we merge it now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants