Academic

The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?

arXiv:2603.03334v1 Announce Type: new Abstract: The evaluation of Large Language Models (LLMs) on mathematical reasoning has largely focused on elementary problems, competition-style questions, or formal theorem proving, leaving graduate-level and computational mathematics relatively underexplored. We introduce CompMath-MCQ, a new benchmark dataset for assessing LLMs on advanced mathematical reasoning in a multiple-choice setting. The dataset consists of 1{,}500 originally authored questions by professors of graduate-level courses, covering topics including Linear Algebra, Numerical Optimization, Vector Calculus, Probability, and Python-based scientific computing. Three option choices are provided for each question, with exactly one of them being correct. To ensure the absence of data leakage, all questions are newly created and not sourced from existing materials. The validity of questions is verified through a procedure based on cross-LLM disagreement, followed by manual expert revi

arXiv:2603.03334v1 Announce Type: new Abstract: The evaluation of Large Language Models (LLMs) on mathematical reasoning has largely focused on elementary problems, competition-style questions, or formal theorem proving, leaving graduate-level and computational mathematics relatively underexplored. We introduce CompMath-MCQ, a new benchmark dataset for assessing LLMs on advanced mathematical reasoning in a multiple-choice setting. The dataset consists of 1{,}500 originally authored questions by professors of graduate-level courses, covering topics including Linear Algebra, Numerical Optimization, Vector Calculus, Probability, and Python-based scientific computing. Three option choices are provided for each question, with exactly one of them being correct. To ensure the absence of data leakage, all questions are newly created and not sourced from existing materials. The validity of questions is verified through a procedure based on cross-LLM disagreement, followed by manual expert review. By adopting a multiple-choice format, our dataset enables objective, reproducible, and bias-free evaluation through lm_eval library. Baseline results with state-of-the-art LLMs indicate that advanced computational mathematical reasoning remains a significant challenge. We release CompMath-MCQ at the following link: https://github.com/biancaraimondi/CompMath-MCQ.git

Executive Summary

This study introduces CompMath-MCQ, a novel benchmark dataset evaluating Large Language Models (LLMs) on advanced mathematical reasoning in a multiple-choice setting. The dataset comprises 1,500 questions across various graduate-level mathematical topics, verified through cross-LLM disagreement and manual expert review. Baseline results indicate that LLMs face significant challenges in computational mathematical reasoning. This research contributes to advancing the assessment of LLMs in higher-level math and has implications for their potential applications in education and research. However, the study's findings also highlight the need for further development in LLMs' ability to reason mathematically at an advanced level.

Key Points

  • CompMath-MCQ is the first benchmark dataset to evaluate LLMs on advanced mathematical reasoning in a multiple-choice setting.
  • The dataset consists of 1,500 questions covering graduate-level mathematical topics, verified through a rigorous process.
  • Baseline results suggest that LLMs face significant challenges in computational mathematical reasoning.

Merits

Strengths in Dataset Creation

The study's creators employed a robust process to ensure the absence of data leakage, verifying the validity of questions through cross-LLM disagreement and manual expert review.

Contribution to LLM Evaluation

CompMath-MCQ fills a significant gap in the evaluation of LLMs, focusing on advanced mathematical reasoning in a multiple-choice setting.

Potential Applications

The study's findings can inform the development of LLMs for educational and research purposes, leading to improved mathematical reasoning and problem-solving capabilities.

Demerits

Limitation in Generalizability

The study's findings may not be directly generalizable to other domains or applications, given the specificity of the CompMath-MCQ dataset.

Need for Further Development

The baseline results indicate that LLMs still face significant challenges in computational mathematical reasoning, necessitating further research and development.

Expert Commentary

The introduction of CompMath-MCQ is a significant contribution to the field of LLM evaluation, highlighting the need for comprehensive assessment of these models in advanced mathematical reasoning. While the study's findings are not entirely surprising, given the known limitations of LLMs, they underscore the importance of continued research and development in this area. Furthermore, the study's emphasis on the importance of verifying the validity of questions through cross-LLM disagreement and manual expert review serves as a valuable reminder of the need for rigorous evaluation in LLM research. Overall, this study has the potential to inform the development of LLMs in educational and research settings, leading to improved mathematical reasoning and problem-solving capabilities.

Recommendations

  • Researchers should prioritize the development of LLMs for advanced mathematical reasoning, focusing on targeted evaluation and fine-tuning techniques.
  • Educational institutions and governments should invest in the development of LLMs for educational purposes, given their potential to enhance mathematical reasoning and problem-solving capabilities.

Sources