Escaping the Cognitive Well: Efficient Competition Math with Off-the-Shelf Models
arXiv:2602.16793v1 Announce Type: new Abstract: In the past year, custom and unreleased math reasoning models reached gold medal performance on the International Mathematical Olympiad (IMO). Similar performance was then reported using large-scale inference on publicly available models but at prohibitive costs (e.g., 3000 USD per problem). In this work, we present an inference pipeline that attains best-in-class performance on IMO-style math problems at an average inference cost orders of magnitude below competing methods while using only general-purpose off-the-shelf models. Our method relies on insights about grader failure in solver-grader pipelines, which we call the Cognitive Well (iterative refinement converging to a wrong solution that the solver as well as the pipeline's internal grader consider to be basically correct). Our pipeline addresses these failure modes through conjecture extraction, wherein candidate lemmas are isolated from generated solutions and independently veri
arXiv:2602.16793v1 Announce Type: new Abstract: In the past year, custom and unreleased math reasoning models reached gold medal performance on the International Mathematical Olympiad (IMO). Similar performance was then reported using large-scale inference on publicly available models but at prohibitive costs (e.g., 3000 USD per problem). In this work, we present an inference pipeline that attains best-in-class performance on IMO-style math problems at an average inference cost orders of magnitude below competing methods while using only general-purpose off-the-shelf models. Our method relies on insights about grader failure in solver-grader pipelines, which we call the Cognitive Well (iterative refinement converging to a wrong solution that the solver as well as the pipeline's internal grader consider to be basically correct). Our pipeline addresses these failure modes through conjecture extraction, wherein candidate lemmas are isolated from generated solutions and independently verified alongside their negations in a fresh environment (context detachment). On IMO-ProofBench Advanced (PB-Adv), our pipeline achieves 67.1 percent performance using Gemini 3.0 Pro with an average cost per question of approximately 31 USD. At the time of evaluation, this represented the state-of-the-art on PB-Adv among both public and unreleased models, and more than doubles the success rate of the next best publicly accessible pipeline, all at a fraction of the cost.
Executive Summary
This article presents an innovative approach to math problem-solving by utilizing off-the-shelf models to achieve best-in-class performance on IMO-style math problems at significantly lower costs than custom and unreleased models. The method, called conjecture extraction, involves isolating candidate lemmas from generated solutions and verifying them independently. The pipeline achieves 67.1% performance on IMO-ProofBench Advanced using Gemini 3.0 Pro, surpassing the state-of-the-art at a fraction of the cost. This breakthrough has significant implications for the field of math problem-solving and competition, and its efficiency and cost-effectiveness make it a valuable contribution to the academic community.
Key Points
- ▸ The article presents a novel approach to math problem-solving using off-the-shelf models, achieving best-in-class performance on IMO-style math problems.
- ▸ The method, called conjecture extraction, isolates candidate lemmas from generated solutions and verifies them independently, addressing grader failure modes.
- ▸ The pipeline achieves 67.1% performance on IMO-ProofBench Advanced using Gemini 3.0 Pro, surpassing the state-of-the-art at a fraction of the cost.
Merits
Strength in Efficiency and Cost-Effectiveness
The article's approach achieves high performance at significantly lower costs than custom and unreleased models, making it a valuable contribution to the academic community.
Innovative Methodology
The conjecture extraction method is a novel and effective approach to addressing grader failure modes, offering a new perspective on math problem-solving.
Demerits
Potential for Overreliance on Off-the-Shelf Models
The article's focus on off-the-shelf models may lead to overreliance on these models, potentially limiting the development of more innovative and robust solutions.
Limited Generalizability
The article's results may not generalize to other math problem-solving domains or applications, limiting the broader impact of the research.
Expert Commentary
The article's innovative approach to math problem-solving using off-the-shelf models is a significant contribution to the field. However, its limitations, such as potential overreliance on off-the-shelf models and limited generalizability, should be carefully considered. The research has significant implications for math problem-solving and competition, and its efficiency and cost-effectiveness make it a valuable contribution to the academic community. Further research is needed to explore the broader applications and limitations of the approach.
Recommendations
- ✓ Future research should explore the potential for off-the-shelf models to be applied to other math problem-solving domains, such as education or research.
- ✓ The development of more innovative and robust solutions should be prioritized to address the limitations of the article's approach.