Academic

Evaluating Large Language Models on Quantum Mechanics: A Comparative Study Across Diverse Models and Tasks

arXiv:2602.19006v1 Announce Type: new Abstract: We present a systematic evaluation of large language models on quantum mechanics problem-solving. Our study evaluates 15 models from five providers (OpenAI, Anthropic, Google, Alibaba, DeepSeek) spanning three capability tiers on 20 tasks covering derivations, creative problems, non-standard concepts, and numerical computation, comprising 900 baseline and 75 tool-augmented assessments. Results reveal clear tier stratification: flagship models achieve 81\% average accuracy, outperforming mid-tier (77\%) and fast models (67\%) by 4pp and 14pp respectively. Task difficulty patterns emerge distinctly: derivations show highest performance (92\% average, 100\% for flagship models), while numerical computation remains most challenging (42\%). Tool augmentation on numerical tasks yields task-dependent effects: modest overall improvement (+4.4pp) at 3x token cost masks dramatic heterogeneity ranging from +29pp gains to -16pp degradation. Reproduc

S
S. K. Rithvik
· · 1 min read · 13 views

arXiv:2602.19006v1 Announce Type: new Abstract: We present a systematic evaluation of large language models on quantum mechanics problem-solving. Our study evaluates 15 models from five providers (OpenAI, Anthropic, Google, Alibaba, DeepSeek) spanning three capability tiers on 20 tasks covering derivations, creative problems, non-standard concepts, and numerical computation, comprising 900 baseline and 75 tool-augmented assessments. Results reveal clear tier stratification: flagship models achieve 81\% average accuracy, outperforming mid-tier (77\%) and fast models (67\%) by 4pp and 14pp respectively. Task difficulty patterns emerge distinctly: derivations show highest performance (92\% average, 100\% for flagship models), while numerical computation remains most challenging (42\%). Tool augmentation on numerical tasks yields task-dependent effects: modest overall improvement (+4.4pp) at 3x token cost masks dramatic heterogeneity ranging from +29pp gains to -16pp degradation. Reproducibility analysis across three runs quantifies 6.3pp average variance, with flagship models demonstrating exceptional stability (GPT-5 achieves zero variance) while specialized models require multi-run evaluation. This work contributes: (i) a benchmark for quantum mechanics with automatic verification, (ii) systematic evaluation quantifying tier-based performance hierarchies, (iii) empirical analysis of tool augmentation trade-offs, and (iv) reproducibility characterization. All tasks, verifiers, and results are publicly released.

Executive Summary

This article presents a comprehensive evaluation of large language models on quantum mechanics problem-solving across diverse models and tasks. The study evaluates 15 models from five providers on 20 tasks, revealing clear tier stratification and task difficulty patterns. The results demonstrate exceptional performance by flagship models, particularly in derivations, and highlight the challenges of numerical computation. The analysis also explores the effects of tool augmentation on numerical tasks and characterizes reproducibility across multiple runs. This work contributes a benchmark for quantum mechanics with automatic verification, systematic evaluation of tier-based performance hierarchies, empirical analysis of tool augmentation trade-offs, and reproducibility characterization. The study's findings have significant implications for the development and deployment of large language models in quantum mechanics and beyond.

Key Points

  • The study evaluates 15 large language models from five providers across 20 tasks in quantum mechanics.
  • Clear tier stratification is observed, with flagship models outperforming mid-tier and fast models.
  • Task difficulty patterns emerge, with derivations showing highest performance and numerical computation remaining most challenging.
  • Tool augmentation on numerical tasks yields task-dependent effects, with modest overall improvement at increased token cost.

Merits

Comprehensive evaluation framework

The study provides a systematic and rigorous evaluation framework for large language models on quantum mechanics problem-solving, which can be applied to other domains as well.

Insights into tier-based performance hierarchies

The results reveal clear tier stratification, providing insights into the performance capabilities of different models and their potential applications.

Empirical analysis of tool augmentation trade-offs

The study explores the effects of tool augmentation on numerical tasks, highlighting the trade-offs between performance improvement and increased token cost.

Demerits

Limited model diversity

The study only evaluates models from five providers, which may not represent the full range of large language models available.

Task selection bias

The choice of tasks may be biased towards specific aspects of quantum mechanics, which may not be representative of the full scope of the field.

Expert Commentary

The study presents a comprehensive evaluation of large language models on quantum mechanics problem-solving, which is of significant interest to the AI and scientific communities. The results provide valuable insights into the performance capabilities of different models and their potential applications. However, the study's limitations, including the limited model diversity and task selection bias, should be acknowledged and addressed in future research. The findings have significant implications for the development and deployment of large language models in scientific domains, including quantum mechanics, and highlight the need for more effective evaluation frameworks and regulatory policies.

Recommendations

  • Future studies should aim to evaluate a more diverse range of models and tasks to provide a more comprehensive understanding of large language models in scientific domains.
  • Researchers should develop more effective evaluation frameworks and metrics to account for the complexities of scientific applications and the trade-offs between performance improvement and increased token cost.

Sources