Skip to content

Request: ablation data, benchmark scoring code, and significance tests — independent audit of 4 vision LLMs on 40 verified signals could not reproduce PatternAgent's contribution #21

@roman-rr

Description

@roman-rr

Warning to practitioners evaluating PatternAgent for production trading

As of April 2026, no frontier vision LLM we independently tested is capable of reliable directional chart-pattern analysis at a level usable in production trading. Deploying PatternAgent (or any functionally equivalent vision-LLM-as-chart-reader architecture) as a soft prior or directional filter in a live trading pipeline will introduce negative edge with a structural long-bias, not zero edge. Losses come from the bias, not the chance-level accuracy.

Empirical findings across 4 frontier vision LLMs (Claude Haiku 4.5, Claude Sonnet 4.6, Claude Opus 4.7, Gemini 3 Flash Preview — 2 vendors, 3 price tiers) on 40 verified production crypto trades (20 TP hits, 20 SL hits, balanced 20 long / 20 short), 215 LLM calls total, $1.16 spent:

  • Direction accuracy indistinguishable from 50% baseline. Haiku 51.4%, Gemini 51.4%, Opus 57.1% at n=37. All Wilson 95% CIs contain 50%.
  • Confidence does not discriminate correct from wrong calls. Signed point-biserial r(conf × correctness, outcome) ≈ 0 for all three; CIs straddle zero.
  • Pattern-name accuracy: 1 correct out of 215 calls across the entire audit.
  • Gemini 3 Flash: 100% long-bias. Called LONG on 17 of 17 long fixtures and 18 of 20 short fixtures — a 90 percentage-point long/short gap. That is not visual pattern reading. It is a bullish prior disconnected from image content.

Full audit with methodology, raw data, per-model tables, verbatim prompt, and 4 sample rendered charts: https://gist.github.com/roman-rr/c1cd675f7c35b68ae5ac281c30080166

If you are considering integrating PatternAgent into a live trading system: this is the evidence. The discussion below is a good-faith request to the maintainers for missing artifacts (ablation data, benchmark scorer, significance tests) that could update this assessment. Until such artifacts surface, practitioners should treat the paper's published accuracy numbers as unverified and the visual-agent approach as unvalidated for production deployment.


Hi QuantAgent team — thanks for open-sourcing the pipeline. We wanted to integrate a PatternAgent-style visual expert into our production crypto signals service and took the paper's claims as our starting point. In the course of evaluation we ran into several gaps in the published artifacts and an empirical replication that diverges meaningfully from the paper's headline numbers. This issue is a good-faith request for clarification on the reproducibility and validity points below.

Missing from the paper and/or repo

1. No ablation study. The paper does not report PatternAgent's isolated contribution. No "PatternAgent off" vs "IndicatorAgent + TrendAgent only" baseline appears in the main text or appendix. If this data exists, could it be shared?

2. Benchmark scoring code is absent from the repo. The 100 CSVs per asset are shipped under benchmark/, but the scorer that produced Table 1's numbers (e.g. 50.7% BTC, 63.7% SPX) is not in the public repo. We could not replicate the 50.7% without writing our own evaluator, which means our reimplementation and yours could diverge silently. Could the scoring harness be released?

3. LLM used for the benchmark is not disclosed in the paper's experiments section. default_config.py defaults to gpt-4o-mini (tool) + gpt-4o (graph), but the paper does not state which model produced Table 1. Appendix H discloses GPT-4o's use for writing the manuscript, not for running the experiments. Explicit confirmation would help reproducibility.

4. No significance testing. With 100 segments × 3-bar windows ≈ 300 binary outcomes per asset, the standard error on a proportion is ~2.9pp. The +5.4pp BTC gap over XGBoost lives within 2σ and would not survive a standard binomial test. Any significance analysis on the published deltas?

5. Training-data overlap. BTC test data (2023-04 to 2025-06) is fully inside GPT-4o's training window. Was any memorization / leakage analysis performed?

Our independent empirical audit

We ran PatternAgent's task shape (render bare candlestick chart → ask vision LLM to identify a classical pattern from the same 16-pattern glossary → collect direction + confidence) on 40 verified crypto signals (20 TP hits, 20 SL hits, balanced 20 long / 20 short) across 4 frontier vision LLMs from 2 vendors:

  • Claude Haiku 4.5 (Anthropic, via OpenRouter)
  • Claude Sonnet 4.6 (Anthropic, via Azure Foundry)
  • Claude Opus 4.7 (Anthropic, via Azure Foundry)
  • Gemini 3 Flash Preview (Google, via OpenRouter)

Methodology matches what we understand to be PatternAgent's design: bare candlestick rendering (node-canvas, 48 bars × 1200×700 px for 4H primary + 45 bars × 1200×500 px for 1D macro), 16-pattern glossary verbatim, strict JSON output schema, point-in-time OHLCV from Hyperliquid. Every fixture is a real signal that shipped to subscribers and was subsequently verified against market outcome via verification.takeProfitHit or verification.stopLossHit.

Headline results

  • Direction accuracy: Haiku 51.4%, Gemini 51.4%, Opus 57.1% (n=37 each, after prep-failed fixtures). Wilson 95% confidence intervals: [35.9%, 66.6%], [35.9%, 66.6%], [40.9%, 72.0%] — all fully contain the 50% coin-flip baseline.
  • Confidence calibration: signed point-biserial r(confidence × correctness, winner-outcome) = −0.09, +0.05, −0.03 for Haiku, Gemini, Opus respectively. All 95% CIs straddle zero. Confidence does not discriminate correct calls from wrong calls.
  • Pattern-name accuracy: 1 correct out of 215 total calls across the entire audit (Claude Sonnet 4.6 naming BTC's April 7 2025 tariff-flush as "V-shaped Reversal" — the sole hit; the other 214 calls were wrong).
  • Gemini 3 Flash called LONG on 17 of 17 long fixtures (100%) and 18 of 20 short fixtures — a 90 percentage-point long/short imbalance which we read as a structural bullish prior, not visual pattern recognition. Opus' gap was 49pp, Haiku's was 13.8pp.

Full methodology, per-model tables, per-round cost/latency breakdowns, the verbatim prompt, and four sample rendered charts are in this public gist. Total audit cost: $1.16 across 215 LLM calls. Raw JSON data per round is referenced in the gist appendix.

Scope note

Our audit targets PatternAgent specifically (pattern_agent.py) — the 16-pattern identification task. We did not independently test trend_agent.py's task shape (candlesticks + overlaid trendlines for channel-direction analysis), so our findings do not speak to that component. indicator_agent.py is text-only and outside the scope of a vision-LLM audit.

What would change our assessment

We'd genuinely update our conclusion if any of the following surfaces:

  • Ablation data showing PatternAgent's isolated contribution above baseline (IndicatorAgent + TrendAgent only)
  • Public benchmark scorer we could run against the shipped CSVs to verify the 50.7% / 63.7% / etc. numbers end-to-end
  • Significance tests on the published accuracy deltas (binomial CI, bootstrap, or comparable)
  • A single LLM that clears these thresholds on balanced test fixtures: direction accuracy Wilson 95% lower bound > 50%, signed r(confidence, outcome) ≥ 0.3 with CI excluding zero, long/short call imbalance < 10pp

Happy to compare notes, run additional tests on your specific setup, or correct our audit if we've misunderstood PatternAgent's intended evaluation protocol. Thanks for considering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions