feat(web): ask examples that ChatGPT verifiably gets wrong by andylizf · Pull Request #20 · StarTrail-org/PixelRAG

andylizf · 2026-06-04T15:02:31Z

Tested 23 SimpleQA-style candidates against Codex (GPT-5-class) as the 'ChatGPT control group'. Modern models have memorized nearly all public-benchmark prose facts — the survivors are table/infobox cells:

question	ChatGPT no-web	ChatGPT with web	our agent (live, verified)
Inter shots on target, 2010 UCL final	4 ❌	4 ❌ (cites UEFA/Wikipedia!)	7 ✅ reads the stats table tile
Nagaland RTO code NL-03	Kohima ❌	Tuensang ✅	Tuensang ✅ triangulates district infoboxes

The shots question is the flagship: wrong even with browsing, because text extraction mangles the multi-column match-stats table — exactly the failure mode the paper's pipeline figure illustrates. Kept 'Explain The Starry Night' + '介绍一下兵马俑' for approachability.

Replace two generic ask examples with questions whose answers live in table/infobox cells — verified empirically: - 'How many shots on target did Inter have in the 2010 Champions League final?' -> 7. GPT-5-class Codex answers 4 WITHOUT web access and STILL answers 4 WITH web search enabled (while citing UEFA/Wikipedia) — text scraping fumbles the match-statistics table. Our agent reads the table tile and answers 7. - 'Which district in Nagaland has the RTO code NL-03?' -> Tuensang. Codex (no web) says Kohima; our agent triangulates NL-01/02/03 across district infoboxes and answers correctly. Both verified end-to-end against the live agent. Kept one accessible classic and the Chinese example for approachability.

vercel · 2026-06-04T15:02:33Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
web	Ready	Preview, Comment	Jun 4, 2026 3:03pm

vercel Bot deployed to Preview June 4, 2026 15:03 View deployment

andylizf merged commit daeeb16 into main Jun 4, 2026
6 checks passed

andylizf deleted the web/ask-examples-hard branch June 4, 2026 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(web): ask examples that ChatGPT verifiably gets wrong#20

feat(web): ask examples that ChatGPT verifiably gets wrong#20
andylizf merged 1 commit into
mainfrom
web/ask-examples-hard

andylizf commented Jun 4, 2026

Uh oh!

vercel Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andylizf commented Jun 4, 2026

Uh oh!

vercel Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 4, 2026 •

edited

Loading