A platform where domain experts evaluate AI-generated scientific hypotheses to benchmark how well different models perform at scientific reasoning.
AI models generate hypotheses across scientific domains (Physics, Chemistry, Biology, and more). Experts rate each hypothesis on three criteria — novelty, plausibility, and testability — using a 1–5 scale. The ratings are aggregated into a leaderboard that ranks models by average expert scores.
Currently benchmarking 180+ models from OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, xAI, and others.
We welcome evaluations from all. Sign in at hypothesisai-production.up.railway.app and begin rating hypotheses in your area of expertise. It is free for all users.
- React
- Next
- PostgreSQL
- OpenRouter
- Tailwind