Evaluating Robustness of Reasoning Models on Parameterized Logical Problems
arXiv:2602.12665v1 Announce Type: new Abstract: Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability. We introduce a diagnostic benchmark for 2-SAT built from parameterized families of structured 2--CNF formulas, where satisfiability is characterized by the implication graph and can be tuned along interpretable axes. Our generators isolate distinct competencies and failure modes: (i) contradiction-cycle UNSAT cores with controllable size and imbalance, (ii) SAT instances with a prescribed fraction of free variables to control solution multiplicity, (iii) planted backbones that modulate propagation, (iv) late bridge clauses that couple otherwise monotone regions to probe sensitivity to ordering and revision, and (v) symmetry/duplication variants that test abstraction under renaming and redundant struc
arXiv:2602.12665v1 Announce Type: new Abstract: Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability. We introduce a diagnostic benchmark for 2-SAT built from parameterized families of structured 2--CNF formulas, where satisfiability is characterized by the implication graph and can be tuned along interpretable axes. Our generators isolate distinct competencies and failure modes: (i) contradiction-cycle UNSAT cores with controllable size and imbalance, (ii) SAT instances with a prescribed fraction of free variables to control solution multiplicity, (iii) planted backbones that modulate propagation, (iv) late bridge clauses that couple otherwise monotone regions to probe sensitivity to ordering and revision, and (v) symmetry/duplication variants that test abstraction under renaming and redundant structure. We evaluate LLM-based reasoners on decision accuracy and assignment validity, and quantify robustness under semantics-preserving perturbations such as clause reordering, filler clauses, and variable renaming. Across models, we observe sharp performance transitions under targeted structural interventions even when surface statistics are held fixed, revealing brittleness regimes that are invisible to aggregate SAT accuracy.
Executive Summary
The article presents a novel diagnostic benchmark for evaluating the robustness of large language model (LLM)-based reasoners on structured 2-CNF formulas in 2-SAT problems. The authors introduce parameterized families of formulas that allow for controlled manipulation of structural phenomena affecting satisfiability, such as contradiction-cycle UNSAT cores, SAT instances with prescribed solution multiplicity, and planted backbones. The study evaluates LLM-based reasoners on decision accuracy and assignment validity, revealing performance transitions under targeted structural interventions that are not captured by aggregate SAT accuracy. The research highlights the brittleness of current LLM-based reasoners and underscores the need for more nuanced evaluation metrics.
Key Points
- ▸ Introduction of a diagnostic benchmark for 2-SAT with parameterized families of structured 2-CNF formulas.
- ▸ Evaluation of LLM-based reasoners on decision accuracy and assignment validity.
- ▸ Identification of performance transitions under targeted structural interventions.
- ▸ Revelation of brittleness regimes in LLM-based reasoners that are not captured by aggregate SAT accuracy.
Merits
Innovative Benchmark
The introduction of a diagnostic benchmark that isolates distinct competencies and failure modes in LLM-based reasoners is a significant contribution to the field. It provides a controlled testbed for evaluating the robustness of reasoning models.
Controlled Structural Manipulation
The ability to tune satisfiability along interpretable axes, such as contradiction-cycle UNSAT cores and planted backbones, allows for a more nuanced understanding of model performance. This controlled manipulation is crucial for identifying specific failure modes.
Comprehensive Evaluation
The study evaluates LLM-based reasoners on both decision accuracy and assignment validity, providing a holistic view of model performance. This comprehensive approach is essential for understanding the strengths and limitations of current reasoning models.
Demerits
Limited Scope
The focus on 2-SAT problems may limit the generalizability of the findings to other types of logical problems or reasoning tasks. Further research is needed to evaluate the robustness of LLM-based reasoners on a broader range of problems.
Model Specificity
The study does not specify which LLM-based reasoners were evaluated, making it difficult to assess the applicability of the findings to specific models. Future research should include a more detailed analysis of different LLM-based reasoners.
Surface Statistics
While the study holds surface statistics fixed, the impact of these statistics on model performance is not fully explored. A more detailed analysis of how surface statistics affect reasoning could provide additional insights.
Expert Commentary
The article presents a significant advancement in the evaluation of LLM-based reasoners, introducing a diagnostic benchmark that isolates distinct competencies and failure modes. The controlled manipulation of structural phenomena affecting satisfiability provides a nuanced understanding of model performance, revealing brittleness regimes that are not captured by aggregate SAT accuracy. The study's findings highlight the need for more robust and reliable reasoning models in AI applications. However, the limited scope of the study and the lack of specificity regarding the evaluated models are notable limitations. Future research should expand the evaluation to a broader range of logical problems and include a more detailed analysis of different LLM-based reasoners. Additionally, the impact of surface statistics on model performance should be explored in more detail to provide a comprehensive understanding of reasoning model robustness.
Recommendations
- ✓ Future research should evaluate the robustness of LLM-based reasoners on a broader range of logical problems to ensure the generalizability of the findings.
- ✓ A more detailed analysis of different LLM-based reasoners should be included in future studies to assess the applicability of the findings to specific models.
- ✓ The impact of surface statistics on model performance should be explored in more detail to provide a comprehensive understanding of reasoning model robustness.