Skip to main content
Academic

VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean

arXiv:2602.18307v1 Announce Type: cross Abstract: Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies. Our evaluation of frontier LLMs and specialized provers yields three observations. First, provers tuned for Mathlib-style mathematics transfer poorly to this repository-centric setting. Second, success is strongly correlated with transitive repository dependence: tasks whose proofs draw on large, multi-hop dependency closures are less likely to be solved. Third, providing curated context rest

Y
Yutong Xin, Qiaochu Chen, Greg Durrett, I\c{s}il Dillig
· · 1 min read · 5 views

arXiv:2602.18307v1 Announce Type: cross Abstract: Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies. Our evaluation of frontier LLMs and specialized provers yields three observations. First, provers tuned for Mathlib-style mathematics transfer poorly to this repository-centric setting. Second, success is strongly correlated with transitive repository dependence: tasks whose proofs draw on large, multi-hop dependency closures are less likely to be solved. Third, providing curated context restricted to a proof's dependency closure improves performance relative to exposing the full repository, but nevertheless leaves substantial room for improvement. Our benchmark and evaluation suite are released at https://github.com/utopia-group/VeriSoftBench.

Executive Summary

This article introduces VeriSoftBench, a repository-scale formal verification benchmark for Lean, a proof assistant. Unlike existing benchmarks that focus on mathematics, VeriSoftBench tackles the challenges of software verification by leveraging open-source formal-methods developments. The authors evaluate frontier large language models (LLMs) and specialized provers, yielding three key findings. They conclude that provers tuned for Mathlib-style mathematics struggle in the repository-centric setting, success is correlated with transitive repository dependence, and providing curated context improves performance. The benchmark and evaluation suite are released for public use, offering a valuable resource for the formal verification community. This study contributes to the development of more robust and effective proof automation tools for software verification.

Key Points

  • VeriSoftBench, a repository-scale formal verification benchmark for Lean, is introduced.
  • Existing benchmarks focus on mathematics, whereas VeriSoftBench tackles software verification.
  • Provers tuned for Mathlib-style mathematics struggle in the repository-centric setting.
  • Success in proof automation is correlated with transitive repository dependence.
  • Providing curated context improves performance in proof automation.

Merits

Strength in Representative Benchmarking

VeriSoftBench provides a representative benchmark for software verification, offering a more accurate reflection of real-world challenges and opportunities.

Insight into Proof Automation Limitations

The study highlights the limitations of existing provers in tackling software verification challenges, underscoring the need for more robust and effective proof automation tools.

Contribution to Research Community

The release of VeriSoftBench and its evaluation suite enables the formal verification community to build upon this research, driving progress in proof automation and software verification.

Demerits

Limited Scope and Generalizability

The study focuses on Lean and a specific set of open-source formal-methods developments, potentially limiting the scope and generalizability of the findings.

Need for Further Investigation into Transitive Repository Dependence

While the study highlights the importance of transitive repository dependence, further investigation is required to fully understand its impact on proof automation and software verification.

Expert Commentary

This study represents a significant contribution to the field of formal verification, highlighting the limitations of existing provers and the importance of representative benchmarks. The findings emphasize the need for more robust and effective proof automation tools, which can be achieved through further investigation into transitive repository dependence and the application of large language models. While the study has limitations in scope and generalizability, it provides a valuable foundation for future research and development. The release of VeriSoftBench and its evaluation suite is a crucial step towards driving progress in proof automation and software verification.

Recommendations

  • Future research should focus on developing more robust and effective proof automation tools that can tackle software verification challenges in a repository-centric setting.
  • Investigations into transitive repository dependence should be prioritized to better understand its impact on proof automation and software verification.

Sources