Humans and LLMs Diverge on Probabilistic Inferences
arXiv:2602.23546v1 Announce Type: new Abstract: Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning p
arXiv:2602.23546v1 Announce Type: new Abstract: Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning pattern used to evaluate such inferences. Our findings reveal persistent differences between humans and LLMs, and underscore the need to evaluate reasoning beyond deterministic settings.
Executive Summary
The article explores the differences between human reasoning and large language models (LLMs) in making probabilistic inferences. The study introduces a dataset called ProbCOPA, which consists of 210 handcrafted probabilistic inferences in English, annotated by human participants. The results show that LLMs fail to produce human-like distributions, revealing persistent differences between humans and LLMs. The findings highlight the need to evaluate reasoning beyond deterministic settings and underscore the limitations of current LLMs in mimicking human probabilistic judgments.
Key Points
- ▸ Introduction of ProbCOPA dataset for probabilistic inferences
- ▸ LLMs fail to produce human-like distributions in probabilistic judgments
- ▸ Persistent differences between humans and LLMs in reasoning patterns
Merits
Novel Dataset
The introduction of the ProbCOPA dataset provides a valuable resource for studying probabilistic inferences
Comprehensive Analysis
The study provides a thorough analysis of human and LLM reasoning patterns, highlighting key differences
Demerits
Limited Generalizability
The study's findings may not generalize to other domains or tasks, limiting the scope of the conclusions
Methodological Limitations
The use of a specific dataset and evaluation metrics may introduce biases and limitations in the results
Expert Commentary
The study's findings highlight the complexities of human reasoning and the challenges of developing AI systems that can accurately mimic human probabilistic judgments. The introduction of the ProbCOPA dataset provides a valuable resource for future research, and the comprehensive analysis of human and LLM reasoning patterns sheds light on the persistent differences between humans and LLMs. However, the study's limitations, such as the limited generalizability of the findings, must be carefully considered in the development of more advanced AI systems.
Recommendations
- ✓ Future research should focus on developing more advanced LLMs that can accurately mimic human probabilistic judgments
- ✓ The development of more comprehensive datasets and evaluation metrics is necessary to fully capture the complexities of human reasoning