Skip to main content
Academic

LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification

arXiv:2602.21044v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real-world reasoning problems admit multiple valid derivations, requiring models to explore diverse logical paths rather than committing to one route. To address this limitation, we introduce LogicGraph, the first benchmark aimed to systematically evaluate multi-path logical reasoning, constructed via a neuro-symbolic framework that leverages backward logic generation and semantic instantiation. This pipeline yields solver-verified reasoning problems formalized by high-depth multi-path reasoning and inherent logical distractions, where each instance is associated with an exhaustive set of minimal proofs. We further propose a reference-free evaluation framework to rigorously assess model performance in both convergent and divergent regimes. Experiments on state-of-the-art

arXiv:2602.21044v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real-world reasoning problems admit multiple valid derivations, requiring models to explore diverse logical paths rather than committing to one route. To address this limitation, we introduce LogicGraph, the first benchmark aimed to systematically evaluate multi-path logical reasoning, constructed via a neuro-symbolic framework that leverages backward logic generation and semantic instantiation. This pipeline yields solver-verified reasoning problems formalized by high-depth multi-path reasoning and inherent logical distractions, where each instance is associated with an exhaustive set of minimal proofs. We further propose a reference-free evaluation framework to rigorously assess model performance in both convergent and divergent regimes. Experiments on state-of-the-art language models reveal a common limitation: models tend to commit early to a single route and fail to explore alternatives, and the coverage gap grows substantially with reasoning depth. LogicGraph exposes this divergence gap and provides actionable insights to motivate future improvements. Our code and data will be released at https://github.com/kkkkarry/LogicGraph.

Executive Summary

LogicGraph presents a novel benchmark to evaluate large language models' (LLMs) ability to perform multi-path logical reasoning, a critical aspect of real-world reasoning problems. The benchmark is constructed using a neuro-symbolic framework that leverages backward logic generation and semantic instantiation. Experiments on state-of-the-art LLMs reveal a common limitation: models tend to commit early to a single route and fail to explore alternatives. LogicGraph exposes this divergence gap and provides actionable insights to motivate future improvements. This research has significant implications for the development of more robust and versatile LLMs, which are essential for tackling complex reasoning tasks.

Key Points

  • LogicGraph is the first benchmark to systematically evaluate multi-path logical reasoning.
  • The benchmark is constructed using a neuro-symbolic framework that leverages backward logic generation and semantic instantiation.
  • Experiments reveal a common limitation in state-of-the-art LLMs: they tend to commit early to a single route and fail to explore alternatives.

Merits

Strength in Novelty

LogicGraph introduces a novel benchmark that addresses the limitation of existing LLM evaluations, which primarily focus on convergent logical reasoning.

Strength in Rigor

The benchmark is constructed using a neuro-symbolic framework that leverages backward logic generation and semantic instantiation, ensuring the evaluation of multi-path logical reasoning is rigorous and comprehensive.

Demerits

Limitation in Model Limitations

The benchmark exposes a common limitation in state-of-the-art LLMs, but it does not provide a clear solution to this limitation, leaving room for further research.

Limitation in Scope

The benchmark focuses on multi-path logical reasoning and may not capture other aspects of real-world reasoning problems, such as common sense and world knowledge.

Expert Commentary

LogicGraph presents a significant contribution to the field of natural language processing and artificial intelligence. The benchmark's ability to evaluate LLMs' ability to perform multi-path logical reasoning highlights a critical limitation in existing LLM evaluations. To address this limitation, researchers and developers should consider incorporating mechanisms that allow LLMs to explore multiple logical paths. Furthermore, the development of more robust and versatile LLMs has significant implications for applications in areas such as natural language processing, expert systems, and decision-making systems.

Recommendations

  • Further research is needed to develop mechanisms that allow LLMs to explore multiple logical paths.
  • Developers should consider incorporating neural-symbolic integration approaches to evaluate LLMs' ability to perform multi-path logical reasoning.

Sources