BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models
arXiv:2602.12889v1 Announce Type: new Abstract: We present BaziQA-Benchmark, a standardized benchmark for evaluating symbolic and temporally compositional reasoning in large language models. The benchmark is …
Jiangxi Chen, Qian Liu
10 views