Academic

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

arXiv:2603.02668v1 Announce Type: new Abstract: We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other of

arXiv:2603.02668v1 Announce Type: new Abstract: We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics.

Executive Summary

This article presents SorryDB, a dynamically-updating benchmark of open Lean tasks aimed at evaluating AI provers in formal mathematics projects. The authors assess various approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, on a snapshot of 1000 tasks from SorryDB. The results show that current approaches are complementary, with no single approach strictly outperforming others. The benchmark offers a robust metric for evaluating AI agents' ability to contribute to novel formal mathematics projects, mitigating test-set contamination. SorryDB's dynamic nature and alignment with community needs make it a valuable tool for the formal mathematics community.

Key Points

  • SorryDB is a dynamically-updating benchmark of open Lean tasks drawn from 78 real-world formalization projects on GitHub.
  • The authors evaluate various approaches on a snapshot of 1000 tasks from SorryDB, showing that current approaches are complementary.
  • SorryDB mitigates test-set contamination and offers a robust metric for evaluating AI agents' ability to contribute to novel formal mathematics projects.

Merits

Strength in Design

SorryDB's dynamic nature and alignment with community needs make it a valuable tool for the formal mathematics community.

Comprehensive Evaluation

The authors assess various approaches, including generalist large language models, agentic approaches, and specialized symbolic provers.

Demerits

Limited Scope

The evaluation is limited to a snapshot of 1000 tasks from SorryDB, which may not be representative of the full scope of formal mathematics projects.

Dependence on Data

The performance of AI provers may be dependent on the quality and quantity of data in SorryDB.

Expert Commentary

The authors' use of SorryDB as a benchmark for evaluating AI provers in formal mathematics projects is a significant contribution to the field. However, the evaluation is limited to a snapshot of 1000 tasks from SorryDB, which may not be representative of the full scope of formal mathematics projects. Additionally, the performance of AI provers may be dependent on the quality and quantity of data in SorryDB. Nevertheless, the dynamic nature and alignment with community needs of SorryDB make it a valuable tool for the formal mathematics community. The implications of this work have significant potential for formal verification techniques and tools, as well as for the field of artificial intelligence in mathematics.

Recommendations

  • Future work should focus on expanding the scope of SorryDB to include a more comprehensive range of formal mathematics projects.
  • The development of more robust and diverse datasets for AI provers in formal mathematics projects is essential for improving their performance and accuracy.

Sources