LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics
arXiv:2602.24173v1 Announce Type: new Abstract: We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics. This consists of an automatic pipeline that extracts lemmas from arXiv and rewrites them into self-contained statements by making all assumptions and required definitions explicit. It results in a benchmark that can be updated regularly with new problems taken directly from human mathematical research, while previous instances can be used for training without compromising future evaluations. We benchmark current state-of-the-art LLMs, which obtain around 10-15$\%$ accuracy in theorem proving (pass@1) depending on the model, showing that there is currently a large margin of
arXiv:2602.24173v1 Announce Type: new Abstract: We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics. This consists of an automatic pipeline that extracts lemmas from arXiv and rewrites them into self-contained statements by making all assumptions and required definitions explicit. It results in a benchmark that can be updated regularly with new problems taken directly from human mathematical research, while previous instances can be used for training without compromising future evaluations. We benchmark current state-of-the-art LLMs, which obtain around 10-15$\%$ accuracy in theorem proving (pass@1) depending on the model, showing that there is currently a large margin of progression for LLMs to reach human-level proving capabilities in a research context.
Executive Summary
The article presents LemmaBench, a novel benchmark designed to assess the capabilities of Large Language Models (LLMs) in mathematics at a research level. Unlike existing benchmarks, LemmaBench relies on an automated pipeline that extracts lemmas from arXiv and rephrases them into self-contained statements. The authors demonstrate the effectiveness of LemmaBench by evaluating state-of-the-art LLMs, which achieve 10-15% accuracy in theorem proving. This finding highlights a significant gap between LLM performance and human-level capabilities. LemmaBench's dynamic nature enables the benchmark to stay up-to-date with the latest mathematical research, allowing for more accurate assessments of LLMs. The results have important implications for the development of AI systems capable of contributing to human research, and the authors' approach can be applied to other domains where dynamic benchmarks are necessary.
Key Points
- ▸ LemmaBench is a novel benchmark for evaluating LLM capabilities in research-level mathematics
- ▸ The benchmark relies on an automated pipeline to extract and rephrase lemmas from arXiv
- ▸ State-of-the-art LLMs achieve 10-15% accuracy in theorem proving using LemmaBench
Merits
Strength in Design
The use of an automated pipeline to update the benchmark with the latest research results is a significant strength, allowing for more accurate assessments of LLMs over time.
Demerits
Limited Generalizability
The results may not generalize to other domains or types of mathematical problems, which could limit the applicability of LemmaBench.
Expert Commentary
The introduction of LemmaBench represents a significant advancement in the field of AI evaluation, particularly in the context of research-level mathematics. By leveraging an automated pipeline to extract and rephrase lemmas from arXiv, the authors have created a dynamic benchmark that can adapt to the latest research developments. The results of this study highlight the substantial gap between LLM performance and human-level capabilities, underscoring the need for continued research and development in this area. Furthermore, the approach taken by the authors can be applied to other domains where dynamic benchmarks are necessary, offering a valuable framework for evaluating AI systems in a rapidly evolving research landscape.
Recommendations
- ✓ Future research should focus on developing LLMs that can effectively leverage dynamic benchmarks like LemmaBench to improve their performance and capabilities.
- ✓ The use of dynamic benchmarks should be explored in other domains where research-level evaluations are necessary, such as natural language processing, computer vision, and more.