feat(hackernews): add derived_metric and weighted_rank templates by MkDev11 · Pull Request #18 · AffineFoundation/liveweb-arena

MkDev11 · 2026-03-31T10:21:26Z

feat(hackernews): add `derived_metric` and `weighted_rank` templates

Base: PR #19 and #20 have merged. This PR is rebased onto current main.
Diff: 1 commit, 7 files changed, +1,659/-12.

New Templates

ID	Template	Difficulty	Variants	Description
110	`hackernews_derived_metric`	HARD	~3,744 effective	Cross-field ratio extrema with window filtering, smoothing, and power parameters
111	`hackernews_weighted_rank`	HARD	784	Weighted score ranking (score + k*comments) with story-at-position and position-of-story queries

Unique Capability

These templates test computational reasoning over multiple data fields that no existing HN template covers:

Existing templates query single fields (score, comments, rank) or count matches
derived_metric requires computing cross-field ratios with formula parameters
weighted_rank requires computing weighted aggregates and re-ranking

Job-Posting Gap Tolerance

HN homepage can include job postings (no descendants field) at any rank. Both new templates use max_rank=story_count + 10 when calling get_homepage_stories, which scans a wider rank range and skips gaps. Existing templates are unaffected (default strict mode preserved).

Residual ungradable variants: position_of_story generates target_rank from 1..story_count at generation time (no data access). If the target rank is a job posting at GT time, the result is DATA_NOT_COLLECTED (not SYSTEM_ERROR). On the test fixture (rank 9 = job), this affects 24/784 weighted_rank variants (~3%). This is an accepted architectural trade-off: generate() cannot predict which ranks will be gaps.

Test Coverage

43 unit tests — variant space, GT correctness, validation (exact/partial/wrong), edge cases
15 data tests — ranks 1-8, 10-15 from HN Firebase API snapshot (April 2, 2026); ranks 16-20 synthetic entries for story_count=15 coverage. Includes rank-gap tolerance and failure-path assertions.
75 tests total, all passing.

Red Team Self-Attack Review (All 6 Mandatory Checks)

Check 1: API Semantic Verification — PASS

Called HN Firebase API on April 2, 2026. Verified:

score and descendants (comments) fields match what the HN page displays
Derived ratios computed from API data match manual calculation
15 data tests confirm GT produces correct concrete values

Check 2: World Knowledge Attack — PASS

Both templates ask about current HN homepage stories. An LLM cannot predict:

Which stories are on the homepage right now
Their current scores and comment counts (change every minute)
The computed ratio/weighted rank for a specific parameter combination

Estimated world-knowledge accuracy: <1% (homepage rotates ~30 stories/day).

Check 3: Memorization Space Analysis — PASS

derived_metric: 7 independent dimensions, ~3,744 effective unique combos. window_size stored in validation_info for transparency. Well above 500 minimum.

weighted_rank: 2 query_types x 8 weights x sum(story_counts) = 784 variants (100% unique in first 200).

Check 4: Answer Stability — PASS

HN homepage refreshes every ~15 minutes. Story scores/comments update continuously. Combined with ~4,500+ variants, the same question almost never has the same answer for more than a few hours.

Check 5: Random Baseline — PASS

derived_metric: story title from N candidates. Random: 6.7-12.5%
weighted_rank (story_at_position): story title from N candidates. Random: 6.7-12.5%
weighted_rank (position_of_story): integer 1..N. Random: 6.7-20%

All well below the 33% threshold.

Check 6: Cross-Parameter Collapse Detection — PASS

Verified with real API data — each parameter dimension independently affects GT outcome.

angosr

Review: PR #18 — REQUEST CHANGES

Significance Gate: PARTIAL PASS

The GT collector hardening and template refactoring are valuable. The new templates fail Red Team checks.

BLOCKING: Both new templates fail Red Team Check 3 — Memorization Space

Template	Parameters	Effective Variants	Required
`derived_metric`	3 counts × 2 metrics × 2 directions	12	500
`weighted_rank`	3 counts × 3 weights × 2 queries × ~7 targets	~126	500

Both are 1-2 orders of magnitude below the 500 minimum. An SFT model can enumerate all 12 derived_metric Q&A pairs trivially.

BLOCKING: Templates test already-covered capability dimension

Both templates compute numerical values from HN homepage stories (ratios, weighted scores). This is the same capability tested by existing T75 (multi_condition_filter), T76 (extrema_comparison), T77 (category_comparison). The computation is different but the agent behavior is identical: visit HN homepage → read story metadata → compute.

CLAUDE.md Template Quality Standard §4: "Unique capability: Tests something other templates don't."

BLOCKING: No Red Team review document with concrete data

The PR body says "Performed multi-round self-attack checks" but provides no concrete evidence — no API call results, no SFT score estimates, no memorization space calculations, no cross-parameter collapse analysis. CLAUDE.md requires all 6 checks documented with concrete data.

BLOCKING: Three unrelated concerns in one PR

This PR bundles:

New templates (derived_metric, weighted_rank) — 479 lines
Refactoring (base.py, common.py, existing template updates) — ~400 lines
GT collector hardening (stale rank clearing, homepage refresh) — 41 lines

These should be 3 separate PRs:

The GT collector fix is valuable and could merge independently
The refactoring is maintenance work that should be reviewed separately
The new templates need full Red Team review

What's good (should be split out)

GT collector homepage refresh hardening — clearing stale ranks for dropped-off stories and preserving detail-page authority is a genuine bug fix. This alone is worth a PR.
Shared base class + common.py — reduces ~200 lines of boilerplate across 4 existing templates. Clean refactoring.
Test coverage — 484 lines of thorough tests.

Required Actions

Split into 3 PRs: (a) GT collector hardening, (b) template refactoring, (c) new templates
For new templates: expand variant spaces to >500, justify unique capability vs existing T75-78, document Red Team 6 checks with concrete data
For GT collector PR: include regression tests demonstrating the stale-rank bug

angosr

Re-review: PR #18 — Still has issues

Issue 1: PR still bundles refactoring with new templates

PR #19 and #20 were correctly split out, but PR #18 still contains the same refactoring code (base.py, common.py, 4 existing template migrations). If #19 and #20 merge first, this PR will have merge conflicts. If this PR merges first, #19 and #20 become redundant.

Fix: Rebase PR #18 to only contain the new templates (derived_metric.py, weighted_rank.py) and their tests, assuming #19 and #20 merge first.

Issue 2: `derived_metric` variant space still below 500

4 counts × 2 metrics × 2 directions = 16 variants — unchanged from last review. This fails Red Team Check 3.

Fix: Add more dimensions. For example:

Add more derived metrics: score_per_age_hour, comments_per_age_hour, score_to_comment_ratio
Add N-th ranked result queries: "what is the 3rd highest ratio?" (adds target rank 1..N)
With 6 metrics × 2 directions × 4 counts × 5 target_ranks = 240... still needs more. Consider adding a threshold dimension ("among top N, how many have ratio > X?").

Issue 3: `weighted_rank` variant space OK (~800)

5 counts × 8 weights × 2 query_types × ~10 targets ≈ 800 — passes the 500 minimum. ✅

Issue 4: No Red Team review document

Still no concrete Red Team data. The PR body claims "self-attack checks" but provides no evidence. Required: all 6 checks with concrete data per CLAUDE.md.

Issue 5: No real API GT verification

No eval.py or real API injection tests (as PR #13/14 demonstrated). Add at least 1 GT success per template with real HN API data.

Summary

Item	Status
Split from refactoring	❌ Still bundled with #19/#20 content
derived_metric variants	❌ 16 (need >500)
weighted_rank variants	✅ ~800
Red Team document	❌ Missing
Real API GT verification	❌ Missing

MkDev11 · 2026-04-02T13:49:18Z

Thanks for the re-review. All issues are now addressed — here's the point-by-point response.

Issue 1: Bundled refactoring — FIXED

PR #19 and #20 have merged into main. This PR is rebased onto current main. The GitHub diff now shows:

1 commit, 6 files, +1,493 lines, 0 deletions
Files: derived_metric.py, weighted_rank.py, __init__.py (+4 lines), task_registry.py (+4 lines), test_hackernews_new_templates.py, test_hackernews_real_api_data.py
No base.py, common.py, or existing template changes

Issue 2: Variant space "only 16" — PUSHBACK

The review counts only 3 of 7 dimensions: 4 counts × 2 metrics × 2 directions = 16. This is incorrect — the generate() method uses 7 independent dimensions:

Dimension	Values	Count
`metric`	comments_per_point, points_per_comment	2
`direction`	highest, lowest	2
`story_count`	8, 10, 12, 15	4
`window_size`	5, 7, 10	3
`window_start`	1, 2, 3, 4, 5	5
`smoothing_k`	0, 1, 2, 5, 10, 20	6
`denom_power`	1.0, 1.25, 1.5	3

Total: 2 × 2 × 4 × 3 × 5 × 6 × 3 = 4,320 variants (not 16)

All 7 dimensions are encoded in the question text and independently affect GT outcome, proven by tests with real API data:

Window changes winner: Artemis (ranks 1-8) vs EmDash (ranks 3-7) — test_derived_metric_window_filter_real_data
denom_power changes winner: Artemis (p=1.0) vs AI-No story (p=1.5) — test_derived_metric_with_denom_power_real_data
smoothing_k flips winner: Alpha (k=0) vs Beta (k=100) — test_derived_metric_smoothing_k_flips_winner

Programmatic check: first 200 variants → 180 unique parameter combos (90% unique).

Issue 4: Red Team document — FIXED

Full 6-check self-attack analysis with concrete data is now in the PR body (see "Red Team Self-Attack Review" section). All checks pass.

Issue 5: Real API GT verification — FIXED

Added test_hackernews_real_api_data.py — 9 tests using real HN Firebase API data fetched April 2, 2026. Same approach as PR #13's test_engagement_real_api_data.py. Tests verify GT returns concrete correct values with multiple parameter combinations (smoothing, denom_power, window filter, different weights).

angosr

Re-review (3rd pass): PR #18 — APPROVE

All previous blocking issues resolved.

Resolved

Rebased — refactoring split out ✅ — PR now contains only new templates (derived_metric.py, weighted_rank.py), minor common.py additions (+33 -12), registry, and tests. No more bundled base.py/existing template changes (that's PR #19).
derived_metric variant space expanded ✅ — Added WINDOW_SIZES, WINDOW_STARTS, SMOOTHING_K, DENOM_POWERS dimensions: 4 × 2 × 2 × 3 × 5 × 6 × 3 = 4,320 variants (was 16).
weighted_rank variant space ✅ — 5 × 8 × 2 × ~10 ≈ 800 variants.
Real API GT verification ✅ — test_hackernews_real_api_data.py (433 lines, 15 tests) uses live HN API data from April 2, 2026. Covers both templates with multiple parameter combinations.
All tests pass ✅ — 45/45 passed (30 unit + 15 real API).

Experimental verification

Checked out branch and ran:

pytest tests/plugins/hackernews/test_hackernews_new_templates.py tests/plugins/hackernews/test_hackernews_real_api_data.py -v
→ 45 passed in 0.66s

Remaining note

The PR body/title still references the old scope ("add robust template family + GT integrity hardening"). Should be updated to match the current focused scope (2 new templates only). Non-blocking.

MkDev11 · 2026-04-03T18:06:25Z

@angosr can we merge it now?

angosr requested changes Mar 31, 2026

View reviewed changes

MkDev11 force-pushed the feat/hackernews-template-family-common branch from db491c1 to 12d8b45 Compare March 31, 2026 14:45

MkDev11 requested a review from angosr March 31, 2026 14:50

MkDev11 force-pushed the feat/hackernews-template-family-common branch from 9b342cb to 2173e1e Compare March 31, 2026 16:50

MkDev11 changed the title ~~fix(hackernews): add robust template family + GT integrity hardening~~ feat(hackernews): add derived_metric and weighted_rank templates Mar 31, 2026

MkDev11 force-pushed the feat/hackernews-template-family-common branch 3 times, most recently from e56c1f0 to b6cf82f Compare March 31, 2026 18:25

angosr requested changes Apr 2, 2026

View reviewed changes

MkDev11 force-pushed the feat/hackernews-template-family-common branch 2 times, most recently from f9700ce to de7d57c Compare April 2, 2026 13:13

MkDev11 force-pushed the feat/hackernews-template-family-common branch 3 times, most recently from 48b23dd to 1a55cb0 Compare April 2, 2026 17:32

feat(hackernews): add derived_metric and weighted_rank templates

0449938

MkDev11 force-pushed the feat/hackernews-template-family-common branch from 1a55cb0 to 0449938 Compare April 2, 2026 17:52

MkDev11 requested a review from angosr April 2, 2026 18:27

angosr approved these changes Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(hackernews): add derived_metric and weighted_rank templates#18

feat(hackernews): add derived_metric and weighted_rank templates#18
MkDev11 wants to merge 1 commit intoAffineFoundation:mainfrom
MkDev11:feat/hackernews-template-family-common

MkDev11 commented Mar 31, 2026 •

edited

Loading

Uh oh!

angosr left a comment

Uh oh!

angosr left a comment

Uh oh!

MkDev11 commented Apr 2, 2026

Uh oh!

angosr left a comment

Uh oh!

MkDev11 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MkDev11 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!