Academic

BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models

arXiv:2602.12889v1 Announce Type: new Abstract: We present BaziQA-Benchmark, a standardized benchmark for evaluating symbolic and temporally compositional reasoning in large language models. The benchmark is derived from 200 professionally curated, multiple-choice problems from the Global Fortune-teller Competition (2021--2025), where each instance requires structured inference over a fixed symbolic chart and interacting temporal conditions. Unlike anecdotal or prompt-driven evaluations, BaziQA-Benchmark enables objective scoring and controlled comparison across years, domains, and model families. We evaluate contemporary language models under a multi-turn setting and analyze performance variation across temporal difficulty, reasoning domains, and inference protocols.To further probe reasoning behavior, we introduce a lightweight Structured Reasoning Protocol that constrains inference order without adding domain knowledge. Results show that models consistently outperform chance but re

J
Jiangxi Chen, Qian Liu
· · 1 min read · 9 views

arXiv:2602.12889v1 Announce Type: new Abstract: We present BaziQA-Benchmark, a standardized benchmark for evaluating symbolic and temporally compositional reasoning in large language models. The benchmark is derived from 200 professionally curated, multiple-choice problems from the Global Fortune-teller Competition (2021--2025), where each instance requires structured inference over a fixed symbolic chart and interacting temporal conditions. Unlike anecdotal or prompt-driven evaluations, BaziQA-Benchmark enables objective scoring and controlled comparison across years, domains, and model families. We evaluate contemporary language models under a multi-turn setting and analyze performance variation across temporal difficulty, reasoning domains, and inference protocols.To further probe reasoning behavior, we introduce a lightweight Structured Reasoning Protocol that constrains inference order without adding domain knowledge. Results show that models consistently outperform chance but remain far from saturation, exhibiting pronounced sensitivity to temporal composition and reasoning order, as well as systematic failures on precise temporal localization and multi-condition symbolic judgments.

Executive Summary

The article introduces BaziQA-Benchmark, a standardized evaluation tool for assessing symbolic and temporally compositional reasoning in large language models. Derived from 200 multiple-choice problems from the Global Fortune-teller Competition (2021-2025), the benchmark enables objective scoring and controlled comparison across various dimensions. The study evaluates contemporary language models under a multi-turn setting, revealing that while models outperform chance, they still fall short of optimal performance, particularly in temporal composition and reasoning order. The authors also introduce a Structured Reasoning Protocol to further probe reasoning behavior.

Key Points

  • Introduction of BaziQA-Benchmark for evaluating symbolic and temporally compositional reasoning in large language models.
  • Benchmark derived from 200 professionally curated multiple-choice problems from the Global Fortune-teller Competition.
  • Objective scoring and controlled comparison across years, domains, and model families.
  • Models outperform chance but exhibit pronounced sensitivity to temporal composition and reasoning order.
  • Introduction of a lightweight Structured Reasoning Protocol to constrain inference order without adding domain knowledge.

Merits

Standardized Benchmark

The BaziQA-Benchmark provides a standardized and objective evaluation tool, enabling controlled comparison across different models and years, which is crucial for advancing the field of AI reasoning.

Professionally Curated Problems

The use of professionally curated problems from a reputable competition ensures high-quality and relevant evaluation scenarios, enhancing the benchmark's credibility and utility.

Comprehensive Evaluation

The study evaluates models under a multi-turn setting and analyzes performance variations across temporal difficulty, reasoning domains, and inference protocols, providing a thorough assessment of model capabilities.

Demerits

Limited Scope

The benchmark is derived from a specific competition, which may limit its generalizability to other domains or types of reasoning problems.

Model Performance Gaps

While models outperform chance, they still exhibit significant performance gaps, particularly in temporal composition and reasoning order, indicating that current models are not yet fully capable of complex reasoning tasks.

Potential Bias

The benchmark's reliance on problems from a specific competition may introduce biases that are not representative of broader reasoning tasks, potentially limiting the benchmark's applicability.

Expert Commentary

The introduction of BaziQA-Benchmark represents a significant advancement in the evaluation of symbolic and temporally compositional reasoning in large language models. The benchmark's standardized and objective nature provides a much-needed tool for comparing model performance across different dimensions, including temporal difficulty, reasoning domains, and inference protocols. The study's findings reveal that while contemporary models outperform chance, they still exhibit pronounced sensitivity to temporal composition and reasoning order, indicating that there is substantial room for improvement. The introduction of the Structured Reasoning Protocol further enhances the benchmark's utility by enabling a more detailed analysis of model reasoning behavior. However, the benchmark's reliance on problems from a specific competition may limit its generalizability, and future research should aim to address this limitation by incorporating a more diverse set of reasoning tasks. Overall, the BaziQA-Benchmark is a valuable contribution to the field of AI, providing a robust framework for evaluating and advancing the reasoning capabilities of large language models.

Recommendations

  • Expand the benchmark to include a more diverse set of reasoning tasks and domains to enhance its generalizability and applicability.
  • Conduct further research to identify and address the specific challenges and limitations in temporal composition and reasoning order, with the goal of improving model performance in these areas.

Sources