Academic

GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

arXiv:2603.19252v1 Announce Type: cross Abstract: Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text and diagrams. However, existing benchmarks are often limited in scale and rarely provide visually grounded multiple-choice questions, limiting reliable evaluation of complex reasoning. We introduce GeoChallenge, a dataset of 90K automatically generated multiple-choice geometry proof problems, each requiring multi-step reasoning over aligned textual descriptions and diagrams. GeoChallenge provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation. Experiments on multiple advanced LLMs show a clear performance gap between models and humans (the best-performing model, GPT-5-nano, achieves 75.89 exact match vs. 94.74 for humans). Further analysis also reveals three common failure patterns of LLMs: (1) exact match failures under the multiple-choic

arXiv:2603.19252v1 Announce Type: cross Abstract: Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text and diagrams. However, existing benchmarks are often limited in scale and rarely provide visually grounded multiple-choice questions, limiting reliable evaluation of complex reasoning. We introduce GeoChallenge, a dataset of 90K automatically generated multiple-choice geometry proof problems, each requiring multi-step reasoning over aligned textual descriptions and diagrams. GeoChallenge provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation. Experiments on multiple advanced LLMs show a clear performance gap between models and humans (the best-performing model, GPT-5-nano, achieves 75.89 exact match vs. 94.74 for humans). Further analysis also reveals three common failure patterns of LLMs: (1) exact match failures under the multiple-choice setting; (2) weak visual reliance; and (3) overextended reasoning without convergence.

Executive Summary

The article introduces GeoChallenge, a novel dataset of 90K automatically generated multiple-choice geometry proof problems that require multi-step reasoning over aligned textual descriptions and diagrams. Experiments reveal a clear performance gap between large language models (LLMs) and humans, highlighting three common failure patterns of LLMs: exact match failures, weak visual reliance, and overextended reasoning without convergence. The GeoChallenge dataset provides a valuable tool for evaluating the symbolic reasoning of LLMs, but its limitations, such as a narrow focus on geometry, must be acknowledged. The findings have significant implications for the development of more robust LLMs and highlight the need for further research in geometric reasoning and visual grounding.

Key Points

  • GeoChallenge is a novel dataset for evaluating the symbolic reasoning of LLMs
  • The dataset consists of 90K automatically generated multiple-choice geometry proof problems
  • Experiments reveal a significant performance gap between LLMs and humans

Merits

Strength: Comprehensive Evaluation Tool

GeoChallenge provides a comprehensive evaluation tool for LLMs, enabling researchers to assess their symbolic reasoning capabilities in a controlled setting.

Strength: Fine-Grained Complexity Ratings

The dataset includes fine-grained complexity ratings, allowing researchers to tailor their evaluation to specific tasks and models.

Strength: Visual Grounding

GeoChallenge requires multi-step reasoning over aligned textual descriptions and diagrams, providing a more nuanced evaluation of LLMs' visual grounding abilities.

Demerits

Limitation: Narrow Focus on Geometry

The GeoChallenge dataset focuses exclusively on geometry, limiting its applicability to other domains and areas of symbolic reasoning.

Limitation: Limited Generalizability

The findings may not generalize to other languages or cultural contexts, highlighting the need for further research and validation.

Limitation: Potential for Overfitting

The automated generation of problems may lead to overfitting, which can compromise the validity and reliability of the results.

Expert Commentary

The article presents a novel and relevant contribution to the field of AI research, highlighting the importance of visual reasoning and symbolic reasoning in LLMs. However, the limitations of the dataset and the findings must be acknowledged and addressed through further research. The development of more robust and versatile LLMs is critical for real-world applications, and the findings have significant implications for policy-making and investment in AI research.

Recommendations

  • Develop more comprehensive and diverse datasets that cover a broader range of domains and areas of symbolic reasoning.
  • Invest in research and development of more robust and versatile LLMs that can handle complex geometric reasoning tasks.

Sources

Original: arXiv - cs.AI