A curated benchmark of physics problems, step-by-step solutions, and evaluation rubrics for testing large language model reasoning in scientific tasks.
This repository is designed to evaluate whether AI systems can solve physics problems with mathematical consistency, conceptual clarity, and physically meaningful reasoning.
- Create challenging physics problems across undergraduate and graduate-level topics.
- Provide detailed step-by-step solutions.
- Define evaluation rubrics for grading AI-generated answers.
- Highlight common reasoning mistakes made in physics problem solving.
- Support benchmarking of LLMs in scientific reasoning tasks.
The benchmark currently includes or plans to include problems in:
- Classical mechanics
- Electromagnetism
- Quantum mechanics
- Statistical physics
- Condensed matter physics
- Quantum information and quantum dynamics
problems/
mechanics/
electromagnetism/
quantum_mechanics/
statistical_physics/
condensed_matter/
solutions/
mechanics/
electromagnetism/
quantum_mechanics/
statistical_physics/
condensed_matter/
rubrics/
evaluation_rubric.md
scripts/
validate_dataset.py
- Classical Mechanics: Inclined plane with friction and an external force
- Electromagnetism: Electric field of a uniformly charged spherical shell
- Quantum Mechanics: Spin-1/2 particle in a magnetic field
- Statistical Physics: Two-level system and partition function
- Condensed Matter / Quantum Dynamics: Two-spin transverse-field Ising Hamiltonian
This repository is under active development. Initial examples focus on physics reasoning tasks relevant to AI model evaluation, including multi-step problem solving, symbolic manipulation, and conceptual interpretation.
Thiago Rocha Girão Souza
PhD Candidate in Physics
Quantum Computing | Quantum Dynamics | Scientific Python