PHYSICS: A Comprehensive Benchmark for Advanced Physics Reasoning

Overview

PHYSICS is a high-level physics problem-solving benchmark designed to assess the reasoning and analytical capabilities of foundation models. The dataset contains 1,297 PhD-qualifying exam problems spanning six fundamental physics disciplines.

Key Features

Dataset Size: 1,297 problems
Problem Domains:
- Classical Mechanics
- Quantum Mechanics
- Thermodynamics & Statistical Mechanics
- Electromagnetism
- Atomic Physics
- Optics
Problem Complexity: Requires deep mathematical modeling and multi-step logical reasoning.
Automated Evaluation System:
- Uses SymPy for symbolic verification
- GPT-4o-based natural language answer validation
Benchmarking Across 33 Models:
- Proprietary models (e.g., o3-mini, Gemini-1.5-Pro)
- Open-source models (e.g., DeepSeek-R1, Llama-3.3-70B)
Performance Gap Analysis:
- Best-performing model achieves only 59.9% accuracy
- Open-source models struggle significantly, revealing gaps in physics problem-solving abilities.

Data Collection

Sources: Publicly available PhD-qualifying exam questions
Annotation Process:
- Structured review by expert annotators
- Strict data quality control
Evaluation Metrics:
- Problem complexity and difficulty classification

Benchmark Comparison

Benchmark	Multi-modal	Size	Level	Question Type	Evaluation	Reasoning Steps
JEEBench	❌	515	CEE	OE, MC	Rule-Based	-
MATH	❌	12,500	K12-Comp	OE	Rule-Based	-
HARDMath	❌	1,466	Graduate	OE	Rule + Model	-
GSM8K	❌	8,500	K8	OE	Rule-Based	5.0
GPQA	❌	227	Graduate	OE	Rule-Based	3.6
SciQ	❌	13,679	K4-K8	MC, OE	Rule-Based	-
SciEval	❌	1,657	-	OE, MC	Rule-Based	-
SciBench	✅	295	College	OE	Rule-Based	2.8
MMMU	✅	443	College	OE, MC	Rule-Based	-
MMMU-Pro	✅	3,460	College	MC	Rule-Based	-
ScienceQA	✅	617	K1-K12	MC	Rule-Based	2.4
OlympiadBench	✅	2,334	Comp	OE	Rule-Based	3.7
PutnamBench	❌	1,692	College	OE	Rule-Based	-
Ours	✅	1,297	PhD-Qualifying	OE	Rule + Model	5.7

Legend:

Level:
- Comp: Competition
- College: College Level
- CEE: College Entrance Examination
- K1-K12: Elementary and High School Level
Question Type:
- OE: Open-ended Questions
- MC: Multiple-choice Questions
Reasoning Steps: Based on statistics from corresponding papers.

Evaluation Framework

Answer-Level Evaluation

SymPy-based symbolic equivalence checking
LLM-based accuracy verification
Weighted scoring based on correctness and complexity

Step-Level Evaluation

Step-by-step reasoning assessment
Identification of first error step
Error categorization for detailed failure analysis

Experimental Results

Model	AMO	E&M	CM	Opt.	QM	Stats.	Val	Test
Proprietary Models
o3-mini	52.4	64.9	59.8	51.5	66.0	60.0	55.0	59.9
o1-mini	45.4	41.8	41.9	40.6	44.3	48.0	44.1	43.6
Gemini-1.5-pro†	35.5	40.2	31.5	32.2	44.5	43.7	35.3	38.4
GPT-4o†	35.3	44.1	33.4	23.4	33.8	45.0	34.7	36.7
Claude-3.5-Sonnet†	37.2	34.8	27.6	35.5	35.1	38.4	31.7	34.7
Open-Source Models
DeepSeek-R1	37.0	48.6	38.3	43.1	44.5	51.5	44.2	44.3
Qwen2.5-Math-72B	27.0	34.8	27.3	27.4	36.2	37.0	38.5	32.2
Llama-3.3-70B	28.2	35.8	27.9	17.2	31.4	41.3	34.3	31.5
phi-4	32.8	33.0	19.8	27.2	23.4	35.2	28.7	29.1
Qwen2.5-72B	28.8	30.9	23.0	25.4	27.4	33.2	31.5	28.7
Qwen2.5-32B	25.5	27.5	19.4	20.8	24.7	41.1	23.3	27.6
Mistral-Small-24B	19.1	29.5	19.6	17.6	15.2	28.4	25.1	21.8
Qwen2.5-7B	21.8	28.1	11.2	18.7	17.4	22.1	20.9	20.4
Qwen2.5-14B	23.8	19.7	14.1	12.3	13.5	28.2	25.3	19.6
Gemma-2-27b	14.3	19.0	16.2	13.4	18.4	25.9	21.7	18.3
Yi-1.5-34B	11.0	15.4	18.0	13.2	19.6	25.2	25.3	17.4
Qwen2.5-Math-1.5B	13.3	14.8	16.5	16.2	17.2	19.5	15.1	16.4
InternVL2-5-38B†	15.3	12.5	12.5	7.7	18.0	23.1	16.7	15.3
Aria†	13.0	14.0	14.2	11.7	9.7	14.4	12.7	12.9
QwQ-32B-Preview	16.7	7.5	10.1	11.2	10.6	14.8	12.4	12.1
Gemma-2-9b	9.4	8.2	9.1	16.5	12.1	16.9	15.2	11.9
Mistral-7B	10.1	10.4	5.1	13.7	11.6	17.6	12.6	11.7
Llama-3.1-8B	8.4	17.4	6.8	14.7	7.4	16.1	9.1	11.7
Mathstral-7B	7.3	10.0	12.0	9.6	8.2	17.6	12.0	10.8
c4ai-command-r-v01	2.0	7.8	7.5	3.8	7.5	11.4	6.8	7.0
DeepSeek-R1-Distill-Qwen-32B	9.1	5.4	4.8	9.8	2.3	10.2	7.1	6.8
Gemma-2-2b	6.6	6.2	3.9	10.3	3.9	7.3	6.1	6.1
Qwen2-VL-72B†	11.8	3.5	4.6	4.0	2.9	4.2	4.5	5.0
Internlm3-8b	1.8	4.6	4.7	3.2	4.0	9.2	4.1	4.8
DeepSeek-vl2-small†	3.1	1.8	1.8	4.5	0.0	0.3	4.8	1.7
THUDM-chatglm3-6b	0.9	2.3	0.0	0.7	0.9	2.0	0.9	1.2
Qwen2.5-Math-7B	1.4	1.7	0.0	2.1	0.0	1.5	1.9	1.0
DeepSeek-math-7b-rl	0.7	0.0	0.0	1.5	0.0	0.6	0.9	0.4

† These models are equipped with multi-model abilities. Problems with images are also tested on these models.

Key Findings

Introduced a challenging benchmark with expert-annotated physics problems across six subfields, requiring deep multi-step reasoning and theoretical integration.
Developed a robust automated evaluation framework using SymPy and GPT-based assessment for precise model performance measurement.
Conducted a comprehensive evaluation of both open-source and proprietary models, analyzing their strengths, weaknesses, and limitations.
Provided in-depth analysis of prompting techniques, Long CoT, failure cases, and RAG-based augmentation to guide future model improvements.

Citation

@misc{feng2025physicsbenchmarkingfoundationmodels,
      title={PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving}, 
      author={Kaiyue Feng and Yilun Zhao and Yixin Liu and Tianyu Yang and Chen Zhao and John Sous and Arman Cohan},
      year={2025},
      eprint={2503.21821},
      archivePrefix={arXiv},
      primaryClass={physics.ed-ph},
      url={https://arxiv.org/abs/2503.21821}, 
}

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
PHYSICS		PHYSICS
api_evaluation		api_evaluation
offline_evaluation		offline_evaluation
outputs		outputs
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
equation_equivilancy.py		equation_equivilancy.py
extract_boxed.py		extract_boxed.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PHYSICS: A Comprehensive Benchmark for Advanced Physics Reasoning

Overview

Key Features

Data Collection

Benchmark Comparison

Evaluation Framework

Answer-Level Evaluation

Step-Level Evaluation

Experimental Results

Key Findings

Citation

License

About

Releases

Packages

Contributors 2

Languages

License

yale-nlp/Physics

Folders and files

Latest commit

History

Repository files navigation

PHYSICS: A Comprehensive Benchmark for Advanced Physics Reasoning

Overview

Key Features

Data Collection

Benchmark Comparison

Evaluation Framework

Answer-Level Evaluation

Step-Level Evaluation

Experimental Results

Key Findings

Citation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages