Skip to content

feat: add template red team dashboard CLI#16

Open
claytonlin1110 wants to merge 4 commits intoAffineFoundation:mainfrom
claytonlin1110:feat/redteam-dashboard
Open

feat: add template red team dashboard CLI#16
claytonlin1110 wants to merge 4 commits intoAffineFoundation:mainfrom
claytonlin1110:feat/redteam-dashboard

Conversation

@claytonlin1110
Copy link
Copy Markdown
Contributor

Summary

This PR adds a Template Red Team Dashboard as a lightweight CLI (python -m liveweb_arena.redteam) that probes templates without running the browser agent. It executes a deterministic “API semantic probe” by calling each plugin’s fetch_api_data() for a minimal set of inferred URLs, feeding those snapshots through the real GTCollector + template get_ground_truth(). It then emits actionable template-quality metrics and artifacts (report.json, report.md) to support red-team review, anti-memorization checks, and quick regressions in CI.

Motivation

Template quality issues are easy to ship unintentionally:

  • Memorizable templates (collapsed parameter space or small answer space)
  • Semantics drift (question meaning doesn’t match what the API actually returns)
  • Instability (GT changes across close repeats due to volatile sources)
  • Solvability/GT binding issues (GT depends on data that isn’t collected via the intended navigation path)

What’s included

  1. New liveweb_arena.redteam CLI
    Entry point: liveweb_arena/redteam/main.py

Key capabilities:

  • Targeted runs via --templates plugin/template[/variant]
  • Bulk runs via --all-templates (auto-resolves registered templates with a known plugin/cache source)
  • Plugin filtering via --plugins coingecko stooq ...
  • Template discovery without probing via --list-templates

Artifacts:

  • Writes report.json and report.md to ./redteam// (or --output-dir).
  1. Deterministic API probe pipeline (no browser, no LLM)
    Core logic: liveweb_arena/redteam/probe.py

How it works:

  • Generates tasks through the real TaskManager.generate_composite_task(...) for each (seed, template) pair.
  • For each generated question, infers a minimal set of probe URLs (conservative heuristics per plugin where needed).
  • Calls plugin fetch_api_data(url) for each probe URL and feeds results into GTCollector.on_page_visit(...).
  • Calls GTCollector.fetch_remaining_api_gt() which triggers the template’s real get_ground_truth(validation_info) logic.
  • Captures success/failure, GT values, and probe URLs per sample.
  1. Metrics: collapse, baseline, GT success, stability
    Metrics: liveweb_arena/redteam/metrics.py

Computed per template:

  • GT success rate: fraction of samples where GT could be collected from probed data
  • Unique questions / unique GT values: simple diversity indicators
  • Cross-parameter collapse rate: detects whether distinct validation_info configurations collapse to identical GT outputs
  1. CI gating (threshold enforcement)
    Flags (in CLI):

--fail-on-violation (exit code 2 if any violation)
--min-gt-success 0..1
--max-collapse 0..1
--max-baseline 0..1
--min-stability 0..1 (requires --repeat >= 2)

  1. Windows importability fixes (unblocks running tooling on Windows)
    Two issues prevented running python -m liveweb_arena.redteam on Windows:
  • liveweb_arena/init.py eagerly imported the browser layer.
  • liveweb_arena/core/cache.py imported POSIX-only fcntl.

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@angosr please check and give me feedback for this

Copy link
Copy Markdown
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #16 — feat: add template red team dashboard CLI

Significance Gate: CONDITIONAL PASS

A quick automated probe for template GT correctness is useful as a first pass. However, this tool does NOT replace the CLAUDE.md Red Team Review (6 mandatory checks require human judgment: world knowledge attack, memorization space analysis, etc.). The tool should be positioned as a complement, not a substitute.


BLOCKING: 5 generated report files committed to the repository

redteam/20260326_123231/report.md
redteam/20260326_123324/report.md
redteam/20260326_123403/report.md
redteam/20260326_123651/report.md
redteam/20260326_123828/report.md

Generated output must not be committed to source control. Add redteam/ to .gitignore and remove these files from the PR.

BLOCKING: Unrelated changes bundled

  1. liveweb_arena/__init__.py: Lazy-loading BrowserEngine/BrowserSession — identical to the change in PR #12 (rejected). This is an infrastructure change that should be its own PR with proper justification.

  2. liveweb_arena/core/cache.py: Windows fcntl portability fix — unrelated to the red team CLI. Should be a separate fix PR.

These bundled changes make review harder and risk sneaking unrelated modifications through a tooling PR.

CONCERN: _infer_probe_urls is fragile and hard to maintain

The URL inference uses per-plugin heuristics:

if plugin_name == "openlibrary":
    a = vi.get("book_a_query")
    ...
if plugin_name == "arxiv":
    category = vi.get("category")
    ...
if plugin_name == "stooq":
    symbol = vi.get("symbol") or vi.get("symbol_a")
    ...

This creates a parallel code path that must be manually kept in sync with template validation_info keys. When a new template changes key names or adds new URL patterns, this function silently breaks. The probe gives false negatives (GT fails because the URL wasn't inferred) that look like template bugs.

Better approach: let each template declare its probe URLs via a method like get_probe_urls(validation_info) -> List[str], keeping the knowledge co-located with the template code.

CONCERN: Probe bypasses page cache pipeline

The real GT collection uses GTCollector.on_page_visit() which is triggered by the browser's page cache. This probe calls plugin.fetch_api_data(url) directly and feeds it into on_page_visit. But:

  • The probe's api_data timestamp differs from what a real browser visit would produce
  • The probe doesn't respect GT data priority rules (CLAUDE.md §6: detail page > list page)
  • The probe visits URLs in a fixed order, while real agents visit pages in unpredictable order

A probe that passes doesn't guarantee the real eval pipeline works. This should be documented clearly.

Required Actions

  1. Remove the 5 committed report files; add redteam/ to .gitignore
  2. Split out __init__.py lazy-loading and cache.py fcntl fix into separate PRs
  3. Document clearly that this tool is a supplement to (not replacement for) CLAUDE.md Red Team Review and eval.py testing

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@angosr Updated, please check.

@claytonlin1110 claytonlin1110 requested a review from angosr March 27, 2026 08:59
Copy link
Copy Markdown
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (2nd pass): PR #16 — APPROVE

All blocking issues resolved:

  1. Generated reports removed ✅ — redteam/ added to .gitignore, no report files in diff.
  2. Unrelated changes removed ✅ — __init__.py lazy-loading and cache.py fcntl fix are gone. Diff is now focused on the redteam CLI.
  3. URL inference improved ✅ — Added get_probe_urls() to QuestionTemplate base class, giving templates the ability to declare their own probe URLs instead of relying on fragile per-plugin heuristics in _infer_probe_urls.

Note

The tool is a useful supplement for quick template validation. The _infer_probe_urls heuristics still exist as a fallback, but templates can now opt into the cleaner get_probe_urls pattern. This should be documented as the preferred approach for new templates.

As noted in the previous review, this tool does NOT replace CLAUDE.md Red Team Review or eval.py testing — it complements them.

Copy link
Copy Markdown
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review WITHDRAWN — PR #16 rejected on re-examination

Previous approval is retracted. On closer analysis, this PR fails the Significance Gate and has a fundamental design flaw.

BLOCKING: Probe bypasses the real GT collection pipeline

The tool calls plugin.fetch_api_data(url) directly and feeds results into GTCollector.on_page_visit(). This completely bypasses the page cache pipeline that the real evaluation uses:

Real eval: Browser visits page → CacheManager stores {html, api_data, accessibility_tree} as atomic snapshot → GTCollector receives data from cache → GT computed

This tool: fetch_api_data() called directly → data injected into GTCollector → GT computed

This means:

  • Cache timing semantics are lost — the real pipeline ensures HTML and api_data share the same timestamp. The probe has no such guarantee.
  • GT priority rules (CLAUDE.md §6) are not tested — detail page > list page priority depends on the page cache mechanism
  • A probe "pass" gives false confidence — templates can pass the probe but fail in real evaluation because the cache binding works differently
  • Page-bound GT (GTSourceType.PAGE_ONLY) is violated by design — the whole point of PAGE_ONLY is that GT comes from pages the agent actually visits, not from direct API calls

This is not a minor gap — it's architecturally wrong. A tool that validates templates through a different code path than production is worse than no tool, because it creates false confidence.

BLOCKING: Fails Significance Gate

  1. Does it advance project goals? No. PR #13 and #14 demonstrated that real API GT verification can be done with simple pytest fixtures + injected data — no CLI tool needed. That pattern uses the real GTCollector correctly.
  2. Maintenance burden: 600+ lines of new code including _infer_probe_urls with per-plugin URL heuristics that must be kept in sync with every new template.
  3. get_probe_urls added to core validators/base.py for a supplementary tool that no template implements — scope creep into core interfaces.

Recommendation

Close this PR. The existing pattern (pytest + real API data injection into GTCollector, as PR #13/14 did) is simpler, uses the real pipeline, and has zero maintenance overhead. If automated red-team checking is desired, it should go through the actual cache pipeline, not around it.

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@angosr
I've refactored the redteam probe so it no longer calls plugin.fetch_api_data() and injects that into GTCollector. The tool now uses CacheManager.ensure_cached() with PageRequirement.nav / PageRequirement.data (same needs_api_data split as eval), which produces CachedPage snapshots with html, api_data, accessibility_tree, fetched_at bound together—the same atomic path the real evaluation uses in cache mode. THen, Icall GTCollector.on_page_visit(url, content=a11y, api_data=cached.api_data) with that snapshot, matching how cache-mode observations feed GT (see env._handle_observation_event).

I still document limits explicitly: this does not reproduce arbitrary agent visit order or full multi-step trajectories; it uses a fixed probe URL list. It remains a supplement to CLAUDE.md red-team review and eval.py, not a substitute. A --cache-dir flag and report metadata record which cache directory was used.

@claytonlin1110 claytonlin1110 requested a review from angosr March 27, 2026 20:37
@claytonlin1110 claytonlin1110 force-pushed the feat/redteam-dashboard branch from a8011a5 to 25ba970 Compare March 27, 2026 20:39
Copy link
Copy Markdown
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: PR #16 — APPROVE

The core architectural issue has been fixed. The probe now uses the real cache pipeline.

What changed since last review

The probe now calls cache_manager.ensure_cached(pages, plugin) — the same CacheManager code path as production evaluation — instead of directly calling fetch_api_data. This means:

  • ✅ Atomic page snapshots {html, api_data, accessibility_tree, fetched_at} are used
  • ✅ GT priority rules (detail > list) are respected via the real cache
  • ✅ Page-bound GT (PAGE_ONLY) semantics are preserved
  • ✅ A probe pass now means the real cache pipeline works

Resolved concerns

Issue Status
Bypasses page cache pipeline ✅ Fixed — uses CacheManager.ensure_cached
Generated report files committed ✅ Fixed — redteam/ in .gitignore, no reports in diff
Unrelated __init__.py/cache.py changes ✅ Removed — scope is clean
get_probe_urls on base class ✅ Acceptable — clean override for templates
Documented as supplement ✅ Docstring explicitly says "supplements, does not replace, full eval.py runs or CLAUDE.md red-team review"

Remaining note

_infer_probe_urls still has per-plugin heuristics as fallback, but get_probe_urls on QuestionTemplate provides the preferred override path. New templates should implement get_probe_urls instead of relying on the heuristics.

Copy link
Copy Markdown
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review WITHDRAWN (2nd retraction) — PR #16

BLOCKING: CacheManager dependency contradicts "lightweight, no-browser" premise

The probe now uses CacheManager.ensure_cached(), which internally calls _fetch_page() via Playwright. This creates a contradiction:

  1. Cache emptyensure_cached must fetch via Playwright → requires full browser setup → same weight as eval.py, tool has no advantage
  2. Cache pre-populated → reads stale disk data → validates GT against old snapshots, not current live data

Neither mode was tested. No evidence the tool actually runs successfully in either scenario.

BLOCKING: No real-world verification

The tool has never been demonstrated working:

  • No output from an actual python -m liveweb_arena.redteam run
  • No evidence that CacheManager.ensure_cached works in the probe context (browser session availability unclear)
  • No comparison showing it catches a real GT issue that pytest wouldn't catch

BLOCKING: Fails Significance Gate — no capability beyond pytest

PR #13 and #14 demonstrated that pytest + real API data injection into GTCollector verifies the entire GT computation pipeline in ~30 minutes of work, with zero maintenance overhead.

This tool adds 600+ lines of code, a CLI framework, _infer_probe_urls heuristics, metrics computation, and report generation — for a capability that pytest already provides more simply and reliably.

The bar for adding new tooling infrastructure should be: does it enable something that cannot be done today? This tool does not meet that bar.

Recommendation

Close this PR. Template GT verification is adequately served by the established pytest + real API injection pattern.

Copy link
Copy Markdown
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated review with actual test results — PR #16 remains REJECTED

I checked out the branch and ran the tool. Results are mixed — it partially works but has critical reliability issues.

Actual test results

Test Result Detail
--list-templates Lists all registered templates correctly
coingecko/coingecko_price seed=1 ✅ GT success CacheManager fetches via Playwright (9.2s), GT returns value
openlibrary/openlibrary_book_stats seeds=1,2,3 ✅ 3/3 GT success 3 unique GT values, collapse=0%
openmeteo/openmeteo_current seed=1 ❌ GT fail _infer_probe_urls only provides docs homepage, not city-specific URL
stooq/stooq_daily_change seed=1 ❌ Crash Unhandled exception in generate_composite_task, no graceful error

BLOCKING: Tool crashes on some plugins instead of failing gracefully

stooq/stooq_daily_change raises an unhandled ValueError and exits. A tool that crashes instead of reporting a failure is not production-ready.

BLOCKING: _infer_probe_urls fails for 2 of 4 tested plugins

OpenMeteo templates need city-specific URLs (e.g., open-meteo.com/en/docs?latitude=38.72&longitude=-9.14), but the probe only visits the docs homepage. The GT correctly reports "Agent did not visit Open Meteo page for 'Lisbon'" — but this means the tool gives a false negative for every OpenMeteo template. Same pattern would affect Stooq (needs symbol-specific URLs) and Taostats (needs subnet-specific URLs).

The tool is reliable only for plugins where start_url alone is sufficient (CoinGecko coin pages, OpenLibrary search pages). For plugins requiring parameter-specific URLs, it systematically fails.

Confirmed: requires Playwright (not lightweight)

CacheManager uses Playwright to fetch pages: [Cache] MISS data - fetching www.coingecko.com/en/coins/dogecoin took 9.2 seconds. This is the full browser stack — the tool is not a lightweight alternative to eval.py.

Minor: aiohttp resource leak

Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x769e1f40e780>

Assessment

The tool works for ~50% of plugins (those with self-contained start URLs) but crashes or gives false negatives for the rest. Combined with requiring full Playwright setup, the value proposition over pytest + real API injection is weak.

If the author wants to continue: (1) fix crash handling, (2) implement get_probe_urls on all existing templates so _infer_probe_urls heuristics aren't needed, (3) document which plugins are supported, (4) close the aiohttp session properly.

@claytonlin1110 claytonlin1110 force-pushed the feat/redteam-dashboard branch from 25ba970 to 5780e29 Compare March 30, 2026 10:41
@claytonlin1110 claytonlin1110 requested a review from angosr March 30, 2026 10:48
@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@angosr Please review

@claytonlin1110 claytonlin1110 force-pushed the feat/redteam-dashboard branch from d591fab to c3adbfc Compare April 2, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants