Academic

NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effect

arXiv:2603.02504v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong performance on natural language tasks but remain unreliable in mathematical reasoning, frequently generating fluent yet logically inconsistent solutions. We present \textbf{NeuroProlog}, a neurosymbolic framework that ensures verifiable reasoning by compiling math word problems into executable Prolog programs with formal verification guarantees. We propose a multi-task Cocktail training strategy that jointly optimizes three synergistic objectives in a unified symbolic representation space: (i) mathematical formula-to-rule translation (KB), (ii) natural language-to-program synthesis (SOLVE), and (iii) program-answer alignment. This joint supervision enables positive transfer, where symbolic grounding in formula translation directly improves compositional reasoning capabilities. At inference, we introduce an execution-guided decoding pipeline with fine-grained error taxonomy that enables iterativ

P
Pratibha Zunjare, Michael Hsiao
· · 1 min read · 2 views

arXiv:2603.02504v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong performance on natural language tasks but remain unreliable in mathematical reasoning, frequently generating fluent yet logically inconsistent solutions. We present \textbf{NeuroProlog}, a neurosymbolic framework that ensures verifiable reasoning by compiling math word problems into executable Prolog programs with formal verification guarantees. We propose a multi-task Cocktail training strategy that jointly optimizes three synergistic objectives in a unified symbolic representation space: (i) mathematical formula-to-rule translation (KB), (ii) natural language-to-program synthesis (SOLVE), and (iii) program-answer alignment. This joint supervision enables positive transfer, where symbolic grounding in formula translation directly improves compositional reasoning capabilities. At inference, we introduce an execution-guided decoding pipeline with fine-grained error taxonomy that enables iterative program repair and quantifies model self-debugging capacity. Comprehensive evaluation on GSM8K across four model scales (3B--32B parameters) demonstrates consistent improvements: cocktail training achieves significant accuracy gains of +5.23\% (Qwen-32B, $p < 0.01$), +3.43\% (GPT-OSS-20B, $p < 0.01$), and +5.54\% (Llama-3B, $p < 0.05$) over single-task baselines.Systematic error analysis reveals scale-dependent learning dynamics: at 32B scale, cocktail training transforms unfixable type errors (12\% repair rate) into correctable domain errors (96\% repair rate), achieving 92.7\% overall correction; at 8B scale, the same training eliminates syntactic errors but introduces semantic failures, revealing a critical capacity threshold for type-safe symbolic reasoning.

Executive Summary

This article introduces NeuroProlog, a neurosymbolic framework that enables verifiable mathematical reasoning by compiling math word problems into executable Prolog programs. The framework employs a multi-task Cocktail training strategy that optimizes three synergistic objectives in a unified symbolic representation space. Comprehensive evaluation demonstrates consistent improvements in accuracy across four model scales, with significant gains over single-task baselines. Systematic error analysis reveals scale-dependent learning dynamics and highlights the critical capacity threshold for type-safe symbolic reasoning. The findings showcase the potential of NeuroProlog in addressing the limitations of Large Language Models in mathematical reasoning, with implications for both practical applications and policy considerations.

Key Points

  • NeuroProlog is a neurosymbolic framework that ensures verifiable mathematical reasoning by compiling math word problems into executable Prolog programs.
  • The framework employs a multi-task Cocktail training strategy that optimizes three synergistic objectives in a unified symbolic representation space.
  • Comprehensive evaluation demonstrates consistent improvements in accuracy across four model scales, with significant gains over single-task baselines.

Merits

Strength in Mathematical Reasoning

NeuroProlog demonstrates a significant improvement in mathematical reasoning, addressing the limitations of Large Language Models in this domain.

Scalability and Flexibility

The framework's multi-task Cocktail training strategy allows for flexible optimization across different model scales and objectives, enabling scalable and adaptable performance.

Demerits

Complexity and Computational Requirements

The framework's multi-task training strategy and symbolic representation space may introduce complexity and computational requirements that may be challenging to manage, especially for smaller-scale models.

Dependence on Prolog Programming

The framework's reliance on Prolog programming may limit its applicability and adaptability to other symbolic representation spaces or programming languages.

Expert Commentary

The article's introduction of NeuroProlog and its multi-task Cocktail training strategy represents a significant contribution to the field of AI and symbolic reasoning. The findings demonstrate the potential of this approach in addressing the limitations of Large Language Models in mathematical reasoning, while also highlighting the importance of scalability, flexibility, and formal verification guarantees. The article's implications for practical and policy considerations are substantial, and its findings have the potential to shape the development and regulation of AI systems in various domains.

Recommendations

  • Further research and development on NeuroProlog and its application to various domains and tasks are necessary to fully explore its potential and limitations.
  • The integration of NeuroProlog with other AI frameworks and systems could enable novel applications and use cases, and may help to address the limitations of single-task training and symbolic representation spaces.

Sources