PHYSICS is a high-level physics problem-solving benchmark designed to assess the reasoning and analytical capabilities of foundation models. The dataset contains 1,297 PhD-qualifying exam problems spanning six fundamental physics disciplines.
- Dataset Size: 1,297 problems
- Problem Domains:
- Classical Mechanics
- Quantum Mechanics
- Thermodynamics & Statistical Mechanics
- Electromagnetism
- Atomic Physics
- Optics
- Problem Complexity: Requires deep mathematical modeling and multi-step logical reasoning.
- Automated Evaluation System:
- Uses SymPy for symbolic verification
- GPT-4o-based natural language answer validation
- Benchmarking Across 33 Models:
- Proprietary models (e.g., o3-mini, Gemini-1.5-Pro)
- Open-source models (e.g., DeepSeek-R1, Llama-3.3-70B)
- Performance Gap Analysis:
- Best-performing model achieves only 59.9% accuracy
- Open-source models struggle significantly, revealing gaps in physics problem-solving abilities.
- Sources: Publicly available PhD-qualifying exam questions
- Annotation Process:
- Structured review by expert annotators
- Strict data quality control
- Evaluation Metrics:
- Problem complexity and difficulty classification
Benchmark | Multi-modal | Size | Level | Question Type | Evaluation | Reasoning Steps |
---|---|---|---|---|---|---|
JEEBench | ❌ | 515 | CEE | OE, MC | Rule-Based | - |
MATH | ❌ | 12,500 | K12-Comp | OE | Rule-Based | - |
HARDMath | ❌ | 1,466 | Graduate | OE | Rule + Model | - |
GSM8K | ❌ | 8,500 | K8 | OE | Rule-Based | 5.0 |
GPQA | ❌ | 227 | Graduate | OE | Rule-Based | 3.6 |
SciQ | ❌ | 13,679 | K4-K8 | MC, OE | Rule-Based | - |
SciEval | ❌ | 1,657 | - | OE, MC | Rule-Based | - |
SciBench | ✅ | 295 | College | OE | Rule-Based | 2.8 |
MMMU | ✅ | 443 | College | OE, MC | Rule-Based | - |
MMMU-Pro | ✅ | 3,460 | College | MC | Rule-Based | - |
ScienceQA | ✅ | 617 | K1-K12 | MC | Rule-Based | 2.4 |
OlympiadBench | ✅ | 2,334 | Comp | OE | Rule-Based | 3.7 |
PutnamBench | ❌ | 1,692 | College | OE | Rule-Based | - |
Ours | ✅ | 1,297 | PhD-Qualifying | OE | Rule + Model | 5.7 |
Legend:
- Level:
- Comp: Competition
- College: College Level
- CEE: College Entrance Examination
- K1-K12: Elementary and High School Level
- Question Type:
- OE: Open-ended Questions
- MC: Multiple-choice Questions
- Reasoning Steps: Based on statistics from corresponding papers.
- SymPy-based symbolic equivalence checking
- LLM-based accuracy verification
- Weighted scoring based on correctness and complexity
- Step-by-step reasoning assessment
- Identification of first error step
- Error categorization for detailed failure analysis
Model | AMO | E&M | CM | Opt. | QM | Stats. | Val | Test |
---|---|---|---|---|---|---|---|---|
Proprietary Models | ||||||||
o3-mini | 52.4 | 64.9 | 59.8 | 51.5 | 66.0 | 60.0 | 55.0 | 59.9 |
o1-mini | 45.4 | 41.8 | 41.9 | 40.6 | 44.3 | 48.0 | 44.1 | 43.6 |
Gemini-1.5-pro† | 35.5 | 40.2 | 31.5 | 32.2 | 44.5 | 43.7 | 35.3 | 38.4 |
GPT-4o† | 35.3 | 44.1 | 33.4 | 23.4 | 33.8 | 45.0 | 34.7 | 36.7 |
Claude-3.5-Sonnet† | 37.2 | 34.8 | 27.6 | 35.5 | 35.1 | 38.4 | 31.7 | 34.7 |
Open-Source Models | ||||||||
DeepSeek-R1 | 37.0 | 48.6 | 38.3 | 43.1 | 44.5 | 51.5 | 44.2 | 44.3 |
Qwen2.5-Math-72B | 27.0 | 34.8 | 27.3 | 27.4 | 36.2 | 37.0 | 38.5 | 32.2 |
Llama-3.3-70B | 28.2 | 35.8 | 27.9 | 17.2 | 31.4 | 41.3 | 34.3 | 31.5 |
phi-4 | 32.8 | 33.0 | 19.8 | 27.2 | 23.4 | 35.2 | 28.7 | 29.1 |
Qwen2.5-72B | 28.8 | 30.9 | 23.0 | 25.4 | 27.4 | 33.2 | 31.5 | 28.7 |
Qwen2.5-32B | 25.5 | 27.5 | 19.4 | 20.8 | 24.7 | 41.1 | 23.3 | 27.6 |
Mistral-Small-24B | 19.1 | 29.5 | 19.6 | 17.6 | 15.2 | 28.4 | 25.1 | 21.8 |
Qwen2.5-7B | 21.8 | 28.1 | 11.2 | 18.7 | 17.4 | 22.1 | 20.9 | 20.4 |
Qwen2.5-14B | 23.8 | 19.7 | 14.1 | 12.3 | 13.5 | 28.2 | 25.3 | 19.6 |
Gemma-2-27b | 14.3 | 19.0 | 16.2 | 13.4 | 18.4 | 25.9 | 21.7 | 18.3 |
Yi-1.5-34B | 11.0 | 15.4 | 18.0 | 13.2 | 19.6 | 25.2 | 25.3 | 17.4 |
Qwen2.5-Math-1.5B | 13.3 | 14.8 | 16.5 | 16.2 | 17.2 | 19.5 | 15.1 | 16.4 |
InternVL2-5-38B† | 15.3 | 12.5 | 12.5 | 7.7 | 18.0 | 23.1 | 16.7 | 15.3 |
Aria† | 13.0 | 14.0 | 14.2 | 11.7 | 9.7 | 14.4 | 12.7 | 12.9 |
QwQ-32B-Preview | 16.7 | 7.5 | 10.1 | 11.2 | 10.6 | 14.8 | 12.4 | 12.1 |
Gemma-2-9b | 9.4 | 8.2 | 9.1 | 16.5 | 12.1 | 16.9 | 15.2 | 11.9 |
Mistral-7B | 10.1 | 10.4 | 5.1 | 13.7 | 11.6 | 17.6 | 12.6 | 11.7 |
Llama-3.1-8B | 8.4 | 17.4 | 6.8 | 14.7 | 7.4 | 16.1 | 9.1 | 11.7 |
Mathstral-7B | 7.3 | 10.0 | 12.0 | 9.6 | 8.2 | 17.6 | 12.0 | 10.8 |
c4ai-command-r-v01 | 2.0 | 7.8 | 7.5 | 3.8 | 7.5 | 11.4 | 6.8 | 7.0 |
DeepSeek-R1-Distill-Qwen-32B | 9.1 | 5.4 | 4.8 | 9.8 | 2.3 | 10.2 | 7.1 | 6.8 |
Gemma-2-2b | 6.6 | 6.2 | 3.9 | 10.3 | 3.9 | 7.3 | 6.1 | 6.1 |
Qwen2-VL-72B† | 11.8 | 3.5 | 4.6 | 4.0 | 2.9 | 4.2 | 4.5 | 5.0 |
Internlm3-8b | 1.8 | 4.6 | 4.7 | 3.2 | 4.0 | 9.2 | 4.1 | 4.8 |
DeepSeek-vl2-small† | 3.1 | 1.8 | 1.8 | 4.5 | 0.0 | 0.3 | 4.8 | 1.7 |
THUDM-chatglm3-6b | 0.9 | 2.3 | 0.0 | 0.7 | 0.9 | 2.0 | 0.9 | 1.2 |
Qwen2.5-Math-7B | 1.4 | 1.7 | 0.0 | 2.1 | 0.0 | 1.5 | 1.9 | 1.0 |
DeepSeek-math-7b-rl | 0.7 | 0.0 | 0.0 | 1.5 | 0.0 | 0.6 | 0.9 | 0.4 |
† These models are equipped with multi-model abilities. Problems with images are also tested on these models.
Abbreviations: AMO (Atomic Physics) | E&M (Electromagnetism) | CM (Classical Mechanics) | Opt. (Optics) | QM (Quantum Mechanics) | Stats. (Theromodynamics and Statistical Physics).
- Introduced a challenging benchmark with expert-annotated physics problems across six subfields, requiring deep multi-step reasoning and theoretical integration.
- Developed a robust automated evaluation framework using SymPy and GPT-based assessment for precise model performance measurement.
- Conducted a comprehensive evaluation of both open-source and proprietary models, analyzing their strengths, weaknesses, and limitations.
- Provided in-depth analysis of prompting techniques, Long CoT, failure cases, and RAG-based augmentation to guide future model improvements.
@misc{feng2025physicsbenchmarkingfoundationmodels,
title={PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving},
author={Kaiyue Feng and Yilun Zhao and Yixin Liu and Tianyu Yang and Chen Zhao and John Sous and Arman Cohan},
year={2025},
eprint={2503.21821},
archivePrefix={arXiv},
primaryClass={physics.ed-ph},
url={https://arxiv.org/abs/2503.21821},
}
This project is licensed under the MIT License.