Skip to content

feat(web): ask examples that ChatGPT verifiably gets wrong#20

Merged
andylizf merged 1 commit into
mainfrom
web/ask-examples-hard
Jun 4, 2026
Merged

feat(web): ask examples that ChatGPT verifiably gets wrong#20
andylizf merged 1 commit into
mainfrom
web/ask-examples-hard

Conversation

@andylizf
Copy link
Copy Markdown
Contributor

@andylizf andylizf commented Jun 4, 2026

Tested 23 SimpleQA-style candidates against Codex (GPT-5-class) as the 'ChatGPT control group'. Modern models have memorized nearly all public-benchmark prose facts — the survivors are table/infobox cells:

question ChatGPT no-web ChatGPT with web our agent (live, verified)
Inter shots on target, 2010 UCL final 4 ❌ 4 ❌ (cites UEFA/Wikipedia!) 7 ✅ reads the stats table tile
Nagaland RTO code NL-03 Kohima ❌ Tuensang ✅ Tuensang ✅ triangulates district infoboxes

The shots question is the flagship: wrong even with browsing, because text extraction mangles the multi-column match-stats table — exactly the failure mode the paper's pipeline figure illustrates. Kept 'Explain The Starry Night' + '介绍一下兵马俑' for approachability.

Replace two generic ask examples with questions whose answers live in
table/infobox cells — verified empirically:

- 'How many shots on target did Inter have in the 2010 Champions League
  final?' -> 7. GPT-5-class Codex answers 4 WITHOUT web access and STILL
  answers 4 WITH web search enabled (while citing UEFA/Wikipedia) — text
  scraping fumbles the match-statistics table. Our agent reads the table
  tile and answers 7.
- 'Which district in Nagaland has the RTO code NL-03?' -> Tuensang. Codex
  (no web) says Kohima; our agent triangulates NL-01/02/03 across district
  infoboxes and answers correctly.

Both verified end-to-end against the live agent. Kept one accessible classic
and the Chinese example for approachability.
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
web Ready Ready Preview, Comment Jun 4, 2026 3:03pm

@andylizf andylizf merged commit daeeb16 into main Jun 4, 2026
6 checks passed
@andylizf andylizf deleted the web/ask-examples-hard branch June 4, 2026 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant