feat: add template red team dashboard CLI#16
feat: add template red team dashboard CLI#16claytonlin1110 wants to merge 4 commits intoAffineFoundation:mainfrom
Conversation
|
@angosr please check and give me feedback for this |
angosr
left a comment
There was a problem hiding this comment.
Review: PR #16 — feat: add template red team dashboard CLI
Significance Gate: CONDITIONAL PASS
A quick automated probe for template GT correctness is useful as a first pass. However, this tool does NOT replace the CLAUDE.md Red Team Review (6 mandatory checks require human judgment: world knowledge attack, memorization space analysis, etc.). The tool should be positioned as a complement, not a substitute.
BLOCKING: 5 generated report files committed to the repository
redteam/20260326_123231/report.md
redteam/20260326_123324/report.md
redteam/20260326_123403/report.md
redteam/20260326_123651/report.md
redteam/20260326_123828/report.md
Generated output must not be committed to source control. Add redteam/ to .gitignore and remove these files from the PR.
BLOCKING: Unrelated changes bundled
-
liveweb_arena/__init__.py: Lazy-loadingBrowserEngine/BrowserSession— identical to the change in PR #12 (rejected). This is an infrastructure change that should be its own PR with proper justification. -
liveweb_arena/core/cache.py: Windowsfcntlportability fix — unrelated to the red team CLI. Should be a separate fix PR.
These bundled changes make review harder and risk sneaking unrelated modifications through a tooling PR.
CONCERN: _infer_probe_urls is fragile and hard to maintain
The URL inference uses per-plugin heuristics:
if plugin_name == "openlibrary":
a = vi.get("book_a_query")
...
if plugin_name == "arxiv":
category = vi.get("category")
...
if plugin_name == "stooq":
symbol = vi.get("symbol") or vi.get("symbol_a")
...This creates a parallel code path that must be manually kept in sync with template validation_info keys. When a new template changes key names or adds new URL patterns, this function silently breaks. The probe gives false negatives (GT fails because the URL wasn't inferred) that look like template bugs.
Better approach: let each template declare its probe URLs via a method like get_probe_urls(validation_info) -> List[str], keeping the knowledge co-located with the template code.
CONCERN: Probe bypasses page cache pipeline
The real GT collection uses GTCollector.on_page_visit() which is triggered by the browser's page cache. This probe calls plugin.fetch_api_data(url) directly and feeds it into on_page_visit. But:
- The probe's
api_datatimestamp differs from what a real browser visit would produce - The probe doesn't respect GT data priority rules (CLAUDE.md §6: detail page > list page)
- The probe visits URLs in a fixed order, while real agents visit pages in unpredictable order
A probe that passes doesn't guarantee the real eval pipeline works. This should be documented clearly.
Required Actions
- Remove the 5 committed report files; add
redteam/to.gitignore - Split out
__init__.pylazy-loading andcache.pyfcntl fix into separate PRs - Document clearly that this tool is a supplement to (not replacement for) CLAUDE.md Red Team Review and eval.py testing
|
@angosr Updated, please check. |
angosr
left a comment
There was a problem hiding this comment.
Re-review (2nd pass): PR #16 — APPROVE
All blocking issues resolved:
- Generated reports removed ✅ —
redteam/added to.gitignore, no report files in diff. - Unrelated changes removed ✅ —
__init__.pylazy-loading andcache.pyfcntl fix are gone. Diff is now focused on the redteam CLI. - URL inference improved ✅ — Added
get_probe_urls()toQuestionTemplatebase class, giving templates the ability to declare their own probe URLs instead of relying on fragile per-plugin heuristics in_infer_probe_urls.
Note
The tool is a useful supplement for quick template validation. The _infer_probe_urls heuristics still exist as a fallback, but templates can now opt into the cleaner get_probe_urls pattern. This should be documented as the preferred approach for new templates.
As noted in the previous review, this tool does NOT replace CLAUDE.md Red Team Review or eval.py testing — it complements them.
angosr
left a comment
There was a problem hiding this comment.
Review WITHDRAWN — PR #16 rejected on re-examination
Previous approval is retracted. On closer analysis, this PR fails the Significance Gate and has a fundamental design flaw.
BLOCKING: Probe bypasses the real GT collection pipeline
The tool calls plugin.fetch_api_data(url) directly and feeds results into GTCollector.on_page_visit(). This completely bypasses the page cache pipeline that the real evaluation uses:
Real eval: Browser visits page → CacheManager stores {html, api_data, accessibility_tree} as atomic snapshot → GTCollector receives data from cache → GT computed
This tool: fetch_api_data() called directly → data injected into GTCollector → GT computed
This means:
- Cache timing semantics are lost — the real pipeline ensures HTML and api_data share the same timestamp. The probe has no such guarantee.
- GT priority rules (CLAUDE.md §6) are not tested — detail page > list page priority depends on the page cache mechanism
- A probe "pass" gives false confidence — templates can pass the probe but fail in real evaluation because the cache binding works differently
- Page-bound GT (
GTSourceType.PAGE_ONLY) is violated by design — the whole point of PAGE_ONLY is that GT comes from pages the agent actually visits, not from direct API calls
This is not a minor gap — it's architecturally wrong. A tool that validates templates through a different code path than production is worse than no tool, because it creates false confidence.
BLOCKING: Fails Significance Gate
- Does it advance project goals? No. PR #13 and #14 demonstrated that real API GT verification can be done with simple pytest fixtures + injected data — no CLI tool needed. That pattern uses the real GTCollector correctly.
- Maintenance burden: 600+ lines of new code including
_infer_probe_urlswith per-plugin URL heuristics that must be kept in sync with every new template. get_probe_urlsadded to corevalidators/base.pyfor a supplementary tool that no template implements — scope creep into core interfaces.
Recommendation
Close this PR. The existing pattern (pytest + real API data injection into GTCollector, as PR #13/14 did) is simpler, uses the real pipeline, and has zero maintenance overhead. If automated red-team checking is desired, it should go through the actual cache pipeline, not around it.
|
@angosr I still document limits explicitly: this does not reproduce arbitrary agent visit order or full multi-step trajectories; it uses a fixed probe URL list. It remains a supplement to CLAUDE.md red-team review and eval.py, not a substitute. A --cache-dir flag and report metadata record which cache directory was used. |
a8011a5 to
25ba970
Compare
angosr
left a comment
There was a problem hiding this comment.
Re-review: PR #16 — APPROVE
The core architectural issue has been fixed. The probe now uses the real cache pipeline.
What changed since last review
The probe now calls cache_manager.ensure_cached(pages, plugin) — the same CacheManager code path as production evaluation — instead of directly calling fetch_api_data. This means:
- ✅ Atomic page snapshots
{html, api_data, accessibility_tree, fetched_at}are used - ✅ GT priority rules (detail > list) are respected via the real cache
- ✅ Page-bound GT (
PAGE_ONLY) semantics are preserved - ✅ A probe pass now means the real cache pipeline works
Resolved concerns
| Issue | Status |
|---|---|
| Bypasses page cache pipeline | ✅ Fixed — uses CacheManager.ensure_cached |
| Generated report files committed | ✅ Fixed — redteam/ in .gitignore, no reports in diff |
Unrelated __init__.py/cache.py changes |
✅ Removed — scope is clean |
get_probe_urls on base class |
✅ Acceptable — clean override for templates |
| Documented as supplement | ✅ Docstring explicitly says "supplements, does not replace, full eval.py runs or CLAUDE.md red-team review" |
Remaining note
_infer_probe_urls still has per-plugin heuristics as fallback, but get_probe_urls on QuestionTemplate provides the preferred override path. New templates should implement get_probe_urls instead of relying on the heuristics.
angosr
left a comment
There was a problem hiding this comment.
Review WITHDRAWN (2nd retraction) — PR #16
BLOCKING: CacheManager dependency contradicts "lightweight, no-browser" premise
The probe now uses CacheManager.ensure_cached(), which internally calls _fetch_page() via Playwright. This creates a contradiction:
- Cache empty →
ensure_cachedmust fetch via Playwright → requires full browser setup → same weight as eval.py, tool has no advantage - Cache pre-populated → reads stale disk data → validates GT against old snapshots, not current live data
Neither mode was tested. No evidence the tool actually runs successfully in either scenario.
BLOCKING: No real-world verification
The tool has never been demonstrated working:
- No output from an actual
python -m liveweb_arena.redteamrun - No evidence that
CacheManager.ensure_cachedworks in the probe context (browser session availability unclear) - No comparison showing it catches a real GT issue that pytest wouldn't catch
BLOCKING: Fails Significance Gate — no capability beyond pytest
PR #13 and #14 demonstrated that pytest + real API data injection into GTCollector verifies the entire GT computation pipeline in ~30 minutes of work, with zero maintenance overhead.
This tool adds 600+ lines of code, a CLI framework, _infer_probe_urls heuristics, metrics computation, and report generation — for a capability that pytest already provides more simply and reliably.
The bar for adding new tooling infrastructure should be: does it enable something that cannot be done today? This tool does not meet that bar.
Recommendation
Close this PR. Template GT verification is adequately served by the established pytest + real API injection pattern.
angosr
left a comment
There was a problem hiding this comment.
Updated review with actual test results — PR #16 remains REJECTED
I checked out the branch and ran the tool. Results are mixed — it partially works but has critical reliability issues.
Actual test results
| Test | Result | Detail |
|---|---|---|
--list-templates |
✅ | Lists all registered templates correctly |
coingecko/coingecko_price seed=1 |
✅ GT success | CacheManager fetches via Playwright (9.2s), GT returns value |
openlibrary/openlibrary_book_stats seeds=1,2,3 |
✅ 3/3 GT success | 3 unique GT values, collapse=0% |
openmeteo/openmeteo_current seed=1 |
❌ GT fail | _infer_probe_urls only provides docs homepage, not city-specific URL |
stooq/stooq_daily_change seed=1 |
❌ Crash | Unhandled exception in generate_composite_task, no graceful error |
BLOCKING: Tool crashes on some plugins instead of failing gracefully
stooq/stooq_daily_change raises an unhandled ValueError and exits. A tool that crashes instead of reporting a failure is not production-ready.
BLOCKING: _infer_probe_urls fails for 2 of 4 tested plugins
OpenMeteo templates need city-specific URLs (e.g., open-meteo.com/en/docs?latitude=38.72&longitude=-9.14), but the probe only visits the docs homepage. The GT correctly reports "Agent did not visit Open Meteo page for 'Lisbon'" — but this means the tool gives a false negative for every OpenMeteo template. Same pattern would affect Stooq (needs symbol-specific URLs) and Taostats (needs subnet-specific URLs).
The tool is reliable only for plugins where start_url alone is sufficient (CoinGecko coin pages, OpenLibrary search pages). For plugins requiring parameter-specific URLs, it systematically fails.
Confirmed: requires Playwright (not lightweight)
CacheManager uses Playwright to fetch pages: [Cache] MISS data - fetching www.coingecko.com/en/coins/dogecoin took 9.2 seconds. This is the full browser stack — the tool is not a lightweight alternative to eval.py.
Minor: aiohttp resource leak
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x769e1f40e780>
Assessment
The tool works for ~50% of plugins (those with self-contained start URLs) but crashes or gives false negatives for the rest. Combined with requiring full Playwright setup, the value proposition over pytest + real API injection is weak.
If the author wants to continue: (1) fix crash handling, (2) implement get_probe_urls on all existing templates so _infer_probe_urls heuristics aren't needed, (3) document which plugins are supported, (4) close the aiohttp session properly.
25ba970 to
5780e29
Compare
|
@angosr Please review |
d591fab to
c3adbfc
Compare
Summary
This PR adds a Template Red Team Dashboard as a lightweight CLI (python -m liveweb_arena.redteam) that probes templates without running the browser agent. It executes a deterministic “API semantic probe” by calling each plugin’s fetch_api_data() for a minimal set of inferred URLs, feeding those snapshots through the real GTCollector + template get_ground_truth(). It then emits actionable template-quality metrics and artifacts (report.json, report.md) to support red-team review, anti-memorization checks, and quick regressions in CI.
Motivation
Template quality issues are easy to ship unintentionally:
What’s included
Entry point: liveweb_arena/redteam/main.py
Key capabilities:
Artifacts:
Core logic: liveweb_arena/redteam/probe.py
How it works:
Metrics: liveweb_arena/redteam/metrics.py
Computed per template:
Flags (in CLI):
--fail-on-violation (exit code 2 if any violation)
--min-gt-success 0..1
--max-collapse 0..1
--max-baseline 0..1
--min-stability 0..1 (requires --repeat >= 2)
Two issues prevented running python -m liveweb_arena.redteam on Windows: