feat: add template red team dashboard CLI by claytonlin1110 · Pull Request #16 · AffineFoundation/liveweb-arena

claytonlin1110 · 2026-03-26T12:47:54Z

Summary

This PR adds a Template Red Team Dashboard as a lightweight CLI (python -m liveweb_arena.redteam) that probes templates without running the browser agent. It executes a deterministic “API semantic probe” by calling each plugin’s fetch_api_data() for a minimal set of inferred URLs, feeding those snapshots through the real GTCollector + template get_ground_truth(). It then emits actionable template-quality metrics and artifacts (report.json, report.md) to support red-team review, anti-memorization checks, and quick regressions in CI.

Motivation

Template quality issues are easy to ship unintentionally:

Memorizable templates (collapsed parameter space or small answer space)
Semantics drift (question meaning doesn’t match what the API actually returns)
Instability (GT changes across close repeats due to volatile sources)
Solvability/GT binding issues (GT depends on data that isn’t collected via the intended navigation path)

What’s included

New liveweb_arena.redteam CLI
Entry point: liveweb_arena/redteam/main.py

Key capabilities:

Targeted runs via --templates plugin/template[/variant]
Bulk runs via --all-templates (auto-resolves registered templates with a known plugin/cache source)
Plugin filtering via --plugins coingecko stooq ...
Template discovery without probing via --list-templates

Artifacts:

Writes report.json and report.md to ./redteam// (or --output-dir).

Deterministic API probe pipeline (no browser, no LLM)
Core logic: liveweb_arena/redteam/probe.py

How it works:

Generates tasks through the real TaskManager.generate_composite_task(...) for each (seed, template) pair.
For each generated question, infers a minimal set of probe URLs (conservative heuristics per plugin where needed).
Calls plugin fetch_api_data(url) for each probe URL and feeds results into GTCollector.on_page_visit(...).
Calls GTCollector.fetch_remaining_api_gt() which triggers the template’s real get_ground_truth(validation_info) logic.
Captures success/failure, GT values, and probe URLs per sample.

Metrics: collapse, baseline, GT success, stability
Metrics: liveweb_arena/redteam/metrics.py

Computed per template:

GT success rate: fraction of samples where GT could be collected from probed data
Unique questions / unique GT values: simple diversity indicators
Cross-parameter collapse rate: detects whether distinct validation_info configurations collapse to identical GT outputs

CI gating (threshold enforcement)
Flags (in CLI):

--fail-on-violation (exit code 2 if any violation)
--min-gt-success 0..1
--max-collapse 0..1
--max-baseline 0..1
--min-stability 0..1 (requires --repeat >= 2)

Windows importability fixes (unblocks running tooling on Windows)
Two issues prevented running python -m liveweb_arena.redteam on Windows:

liveweb_arena/init.py eagerly imported the browser layer.
liveweb_arena/core/cache.py imported POSIX-only fcntl.

claytonlin1110 · 2026-03-26T12:49:21Z

@angosr please check and give me feedback for this

angosr

Review: PR #16 — feat: add template red team dashboard CLI

Significance Gate: CONDITIONAL PASS

A quick automated probe for template GT correctness is useful as a first pass. However, this tool does NOT replace the CLAUDE.md Red Team Review (6 mandatory checks require human judgment: world knowledge attack, memorization space analysis, etc.). The tool should be positioned as a complement, not a substitute.

BLOCKING: 5 generated report files committed to the repository

redteam/20260326_123231/report.md
redteam/20260326_123324/report.md
redteam/20260326_123403/report.md
redteam/20260326_123651/report.md
redteam/20260326_123828/report.md

Generated output must not be committed to source control. Add redteam/ to .gitignore and remove these files from the PR.

BLOCKING: Unrelated changes bundled

liveweb_arena/__init__.py: Lazy-loading BrowserEngine/BrowserSession — identical to the change in PR #12 (rejected). This is an infrastructure change that should be its own PR with proper justification.
liveweb_arena/core/cache.py: Windows fcntl portability fix — unrelated to the red team CLI. Should be a separate fix PR.

These bundled changes make review harder and risk sneaking unrelated modifications through a tooling PR.

CONCERN: `_infer_probe_urls` is fragile and hard to maintain

The URL inference uses per-plugin heuristics:

if plugin_name == "openlibrary":
    a = vi.get("book_a_query")
    ...
if plugin_name == "arxiv":
    category = vi.get("category")
    ...
if plugin_name == "stooq":
    symbol = vi.get("symbol") or vi.get("symbol_a")
    ...

This creates a parallel code path that must be manually kept in sync with template validation_info keys. When a new template changes key names or adds new URL patterns, this function silently breaks. The probe gives false negatives (GT fails because the URL wasn't inferred) that look like template bugs.

Better approach: let each template declare its probe URLs via a method like get_probe_urls(validation_info) -> List[str], keeping the knowledge co-located with the template code.

CONCERN: Probe bypasses page cache pipeline

The real GT collection uses GTCollector.on_page_visit() which is triggered by the browser's page cache. This probe calls plugin.fetch_api_data(url) directly and feeds it into on_page_visit. But:

The probe's api_data timestamp differs from what a real browser visit would produce
The probe doesn't respect GT data priority rules (CLAUDE.md §6: detail page > list page)
The probe visits URLs in a fixed order, while real agents visit pages in unpredictable order

A probe that passes doesn't guarantee the real eval pipeline works. This should be documented clearly.

Required Actions

Remove the 5 committed report files; add redteam/ to .gitignore
Split out __init__.py lazy-loading and cache.py fcntl fix into separate PRs
Document clearly that this tool is a supplement to (not replacement for) CLAUDE.md Red Team Review and eval.py testing

claytonlin1110 · 2026-03-27T08:59:09Z

@angosr Updated, please check.

angosr

Re-review (2nd pass): PR #16 — APPROVE

All blocking issues resolved:

Generated reports removed ✅ — redteam/ added to .gitignore, no report files in diff.
Unrelated changes removed ✅ — __init__.py lazy-loading and cache.py fcntl fix are gone. Diff is now focused on the redteam CLI.
URL inference improved ✅ — Added get_probe_urls() to QuestionTemplate base class, giving templates the ability to declare their own probe URLs instead of relying on fragile per-plugin heuristics in _infer_probe_urls.

Note

The tool is a useful supplement for quick template validation. The _infer_probe_urls heuristics still exist as a fallback, but templates can now opt into the cleaner get_probe_urls pattern. This should be documented as the preferred approach for new templates.

As noted in the previous review, this tool does NOT replace CLAUDE.md Red Team Review or eval.py testing — it complements them.

angosr

Review WITHDRAWN — PR #16 rejected on re-examination

Previous approval is retracted. On closer analysis, this PR fails the Significance Gate and has a fundamental design flaw.

BLOCKING: Probe bypasses the real GT collection pipeline

The tool calls plugin.fetch_api_data(url) directly and feeds results into GTCollector.on_page_visit(). This completely bypasses the page cache pipeline that the real evaluation uses:

Real eval: Browser visits page → CacheManager stores {html, api_data, accessibility_tree} as atomic snapshot → GTCollector receives data from cache → GT computed

This tool: fetch_api_data() called directly → data injected into GTCollector → GT computed

This means:

Cache timing semantics are lost — the real pipeline ensures HTML and api_data share the same timestamp. The probe has no such guarantee.
GT priority rules (CLAUDE.md §6) are not tested — detail page > list page priority depends on the page cache mechanism
A probe "pass" gives false confidence — templates can pass the probe but fail in real evaluation because the cache binding works differently
Page-bound GT (GTSourceType.PAGE_ONLY) is violated by design — the whole point of PAGE_ONLY is that GT comes from pages the agent actually visits, not from direct API calls

This is not a minor gap — it's architecturally wrong. A tool that validates templates through a different code path than production is worse than no tool, because it creates false confidence.

BLOCKING: Fails Significance Gate

Does it advance project goals? No. PR #13 and #14 demonstrated that real API GT verification can be done with simple pytest fixtures + injected data — no CLI tool needed. That pattern uses the real GTCollector correctly.
Maintenance burden: 600+ lines of new code including _infer_probe_urls with per-plugin URL heuristics that must be kept in sync with every new template.
get_probe_urls added to core validators/base.py for a supplementary tool that no template implements — scope creep into core interfaces.

Recommendation

Close this PR. The existing pattern (pytest + real API data injection into GTCollector, as PR #13/14 did) is simpler, uses the real pipeline, and has zero maintenance overhead. If automated red-team checking is desired, it should go through the actual cache pipeline, not around it.

claytonlin1110 · 2026-03-27T20:37:20Z

@angosr
I've refactored the redteam probe so it no longer calls plugin.fetch_api_data() and injects that into GTCollector. The tool now uses CacheManager.ensure_cached() with PageRequirement.nav / PageRequirement.data (same needs_api_data split as eval), which produces CachedPage snapshots with html, api_data, accessibility_tree, fetched_at bound together—the same atomic path the real evaluation uses in cache mode. THen, Icall GTCollector.on_page_visit(url, content=a11y, api_data=cached.api_data) with that snapshot, matching how cache-mode observations feed GT (see env._handle_observation_event).

I still document limits explicitly: this does not reproduce arbitrary agent visit order or full multi-step trajectories; it uses a fixed probe URL list. It remains a supplement to CLAUDE.md red-team review and eval.py, not a substitute. A --cache-dir flag and report metadata record which cache directory was used.

angosr

Re-review: PR #16 — APPROVE

The core architectural issue has been fixed. The probe now uses the real cache pipeline.

What changed since last review

The probe now calls cache_manager.ensure_cached(pages, plugin) — the same CacheManager code path as production evaluation — instead of directly calling fetch_api_data. This means:

✅ Atomic page snapshots {html, api_data, accessibility_tree, fetched_at} are used
✅ GT priority rules (detail > list) are respected via the real cache
✅ Page-bound GT (PAGE_ONLY) semantics are preserved
✅ A probe pass now means the real cache pipeline works

Resolved concerns

Issue	Status
Bypasses page cache pipeline	✅ Fixed — uses `CacheManager.ensure_cached`
Generated report files committed	✅ Fixed — `redteam/` in `.gitignore`, no reports in diff
Unrelated `__init__.py`/`cache.py` changes	✅ Removed — scope is clean
`get_probe_urls` on base class	✅ Acceptable — clean override for templates
Documented as supplement	✅ Docstring explicitly says "supplements, does not replace, full eval.py runs or CLAUDE.md red-team review"

Remaining note

_infer_probe_urls still has per-plugin heuristics as fallback, but get_probe_urls on QuestionTemplate provides the preferred override path. New templates should implement get_probe_urls instead of relying on the heuristics.

angosr

Review WITHDRAWN (2nd retraction) — PR #16

BLOCKING: CacheManager dependency contradicts "lightweight, no-browser" premise

The probe now uses CacheManager.ensure_cached(), which internally calls _fetch_page() via Playwright. This creates a contradiction:

Cache empty → ensure_cached must fetch via Playwright → requires full browser setup → same weight as eval.py, tool has no advantage
Cache pre-populated → reads stale disk data → validates GT against old snapshots, not current live data

Neither mode was tested. No evidence the tool actually runs successfully in either scenario.

BLOCKING: No real-world verification

The tool has never been demonstrated working:

No output from an actual python -m liveweb_arena.redteam run
No evidence that CacheManager.ensure_cached works in the probe context (browser session availability unclear)
No comparison showing it catches a real GT issue that pytest wouldn't catch

BLOCKING: Fails Significance Gate — no capability beyond pytest

PR #13 and #14 demonstrated that pytest + real API data injection into GTCollector verifies the entire GT computation pipeline in ~30 minutes of work, with zero maintenance overhead.

This tool adds 600+ lines of code, a CLI framework, _infer_probe_urls heuristics, metrics computation, and report generation — for a capability that pytest already provides more simply and reliably.

The bar for adding new tooling infrastructure should be: does it enable something that cannot be done today? This tool does not meet that bar.

Recommendation

Close this PR. Template GT verification is adequately served by the established pytest + real API injection pattern.

angosr

Updated review with actual test results — PR #16 remains REJECTED

I checked out the branch and ran the tool. Results are mixed — it partially works but has critical reliability issues.

Actual test results

Test	Result	Detail
`--list-templates`	✅	Lists all registered templates correctly
`coingecko/coingecko_price` seed=1	✅ GT success	CacheManager fetches via Playwright (9.2s), GT returns value
`openlibrary/openlibrary_book_stats` seeds=1,2,3	✅ 3/3 GT success	3 unique GT values, collapse=0%
`openmeteo/openmeteo_current` seed=1	❌ GT fail	`_infer_probe_urls` only provides docs homepage, not city-specific URL
`stooq/stooq_daily_change` seed=1	❌ Crash	Unhandled exception in `generate_composite_task`, no graceful error

BLOCKING: Tool crashes on some plugins instead of failing gracefully

stooq/stooq_daily_change raises an unhandled ValueError and exits. A tool that crashes instead of reporting a failure is not production-ready.

BLOCKING: _infer_probe_urls fails for 2 of 4 tested plugins

OpenMeteo templates need city-specific URLs (e.g., open-meteo.com/en/docs?latitude=38.72&longitude=-9.14), but the probe only visits the docs homepage. The GT correctly reports "Agent did not visit Open Meteo page for 'Lisbon'" — but this means the tool gives a false negative for every OpenMeteo template. Same pattern would affect Stooq (needs symbol-specific URLs) and Taostats (needs subnet-specific URLs).

The tool is reliable only for plugins where start_url alone is sufficient (CoinGecko coin pages, OpenLibrary search pages). For plugins requiring parameter-specific URLs, it systematically fails.

Confirmed: requires Playwright (not lightweight)

CacheManager uses Playwright to fetch pages: [Cache] MISS data - fetching www.coingecko.com/en/coins/dogecoin took 9.2 seconds. This is the full browser stack — the tool is not a lightweight alternative to eval.py.

Minor: aiohttp resource leak

Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x769e1f40e780>

Assessment

The tool works for ~50% of plugins (those with self-contained start URLs) but crashes or gives false negatives for the rest. Combined with requiring full Playwright setup, the value proposition over pytest + real API injection is weak.

If the author wants to continue: (1) fix crash handling, (2) implement get_probe_urls on all existing templates so _infer_probe_urls heuristics aren't needed, (3) document which plugins are supported, (4) close the aiohttp session properly.

claytonlin1110 · 2026-04-01T19:35:34Z

@angosr Please review

angosr requested changes Mar 27, 2026

View reviewed changes

claytonlin1110 requested a review from angosr March 27, 2026 08:59

angosr approved these changes Mar 27, 2026

View reviewed changes

angosr requested changes Mar 27, 2026

View reviewed changes

claytonlin1110 requested a review from angosr March 27, 2026 20:37

claytonlin1110 force-pushed the feat/redteam-dashboard branch from a8011a5 to 25ba970 Compare March 27, 2026 20:39

angosr approved these changes Mar 30, 2026

View reviewed changes

angosr requested changes Mar 30, 2026

View reviewed changes

claytonlin1110 force-pushed the feat/redteam-dashboard branch from 25ba970 to 5780e29 Compare March 30, 2026 10:41

claytonlin1110 requested a review from angosr March 30, 2026 10:48

claytonlin1110 added 4 commits April 2, 2026 10:34

feat: add template red team dashboard CLI

698fa2e

fix: remove generated reports and clarify supplementary scope

38a6dfa

fix: update

a2680a6

fix: update

c3adbfc

claytonlin1110 force-pushed the feat/redteam-dashboard branch from d591fab to c3adbfc Compare April 2, 2026 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add template red team dashboard CLI#16

feat: add template red team dashboard CLI#16
claytonlin1110 wants to merge 4 commits intoAffineFoundation:mainfrom
claytonlin1110:feat/redteam-dashboard

claytonlin1110 commented Mar 26, 2026

Uh oh!

claytonlin1110 commented Mar 26, 2026

Uh oh!

angosr left a comment

Uh oh!

claytonlin1110 commented Mar 27, 2026

Uh oh!

angosr left a comment

Uh oh!

angosr left a comment

Uh oh!

claytonlin1110 commented Mar 27, 2026

Uh oh!

angosr left a comment

Uh oh!

angosr left a comment

Uh oh!

angosr left a comment

Uh oh!

claytonlin1110 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

claytonlin1110 commented Mar 26, 2026

Summary

Motivation

What’s included

Uh oh!

claytonlin1110 commented Mar 26, 2026

Uh oh!

angosr left a comment

Choose a reason for hiding this comment

Review: PR #16 — feat: add template red team dashboard CLI

Significance Gate: CONDITIONAL PASS

BLOCKING: 5 generated report files committed to the repository

BLOCKING: Unrelated changes bundled

CONCERN: _infer_probe_urls is fragile and hard to maintain

CONCERN: Probe bypasses page cache pipeline

Required Actions

Uh oh!

claytonlin1110 commented Mar 27, 2026

Uh oh!

angosr left a comment

Choose a reason for hiding this comment

Re-review (2nd pass): PR #16 — APPROVE

Note

Uh oh!

angosr left a comment

Choose a reason for hiding this comment

Review WITHDRAWN — PR #16 rejected on re-examination

BLOCKING: Probe bypasses the real GT collection pipeline

BLOCKING: Fails Significance Gate

Recommendation

Uh oh!

claytonlin1110 commented Mar 27, 2026

Uh oh!

angosr left a comment

Choose a reason for hiding this comment

Re-review: PR #16 — APPROVE

What changed since last review

Resolved concerns

Remaining note

Uh oh!

angosr left a comment

Choose a reason for hiding this comment

Review WITHDRAWN (2nd retraction) — PR #16

BLOCKING: CacheManager dependency contradicts "lightweight, no-browser" premise

BLOCKING: No real-world verification

BLOCKING: Fails Significance Gate — no capability beyond pytest

Recommendation

Uh oh!

angosr left a comment

Choose a reason for hiding this comment

Updated review with actual test results — PR #16 remains REJECTED

Actual test results

BLOCKING: Tool crashes on some plugins instead of failing gracefully

BLOCKING: _infer_probe_urls fails for 2 of 4 tested plugins

Confirmed: requires Playwright (not lightweight)

Minor: aiohttp resource leak

Assessment

Uh oh!

claytonlin1110 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CONCERN: `_infer_probe_urls` is fragile and hard to maintain