Skip to content

thiagorgs/physics-llm-evaluation-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Physics LLM Evaluation Benchmark

A curated benchmark of physics problems, step-by-step solutions, and evaluation rubrics for testing large language model reasoning in scientific tasks.

This repository is designed to evaluate whether AI systems can solve physics problems with mathematical consistency, conceptual clarity, and physically meaningful reasoning.

Goals

  • Create challenging physics problems across undergraduate and graduate-level topics.
  • Provide detailed step-by-step solutions.
  • Define evaluation rubrics for grading AI-generated answers.
  • Highlight common reasoning mistakes made in physics problem solving.
  • Support benchmarking of LLMs in scientific reasoning tasks.

Topics

The benchmark currently includes or plans to include problems in:

  • Classical mechanics
  • Electromagnetism
  • Quantum mechanics
  • Statistical physics
  • Condensed matter physics
  • Quantum information and quantum dynamics

Repository Structure

problems/
  mechanics/
  electromagnetism/
  quantum_mechanics/
  statistical_physics/
  condensed_matter/

solutions/
  mechanics/
  electromagnetism/
  quantum_mechanics/
  statistical_physics/
  condensed_matter/

rubrics/
  evaluation_rubric.md

scripts/
  validate_dataset.py

Current Examples

  • Classical Mechanics: Inclined plane with friction and an external force
  • Electromagnetism: Electric field of a uniformly charged spherical shell
  • Quantum Mechanics: Spin-1/2 particle in a magnetic field
  • Statistical Physics: Two-level system and partition function
  • Condensed Matter / Quantum Dynamics: Two-spin transverse-field Ising Hamiltonian

Status

This repository is under active development. Initial examples focus on physics reasoning tasks relevant to AI model evaluation, including multi-step problem solving, symbolic manipulation, and conceptual interpretation.

Author

Thiago Rocha Girão Souza
PhD Candidate in Physics
Quantum Computing | Quantum Dynamics | Scientific Python

About

A curated benchmark of physics problems, step-by-step solutions, and evaluation rubrics for testing LLM reasoning in scientific tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages