Academic

Humans and LLMs Diverge on Probabilistic Inferences

arXiv:2602.23546v1 Announce Type: new Abstract: Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning p

Gaurav Kamath, Sreenath Madathil, Sebastian Schuster, Marie-Catherine de Marneffe, Siva Reddy · March 3, 2026 · 1 min read · 0 views

#cs.CL #cs.AI

Executive Summary

The article explores the differences between human reasoning and large language models (LLMs) in making probabilistic inferences. The study introduces a dataset called ProbCOPA, which consists of 210 handcrafted probabilistic inferences in English, annotated by human participants. The results show that LLMs fail to produce human-like distributions, revealing persistent differences between humans and LLMs. The findings highlight the need to evaluate reasoning beyond deterministic settings and underscore the limitations of current LLMs in mimicking human probabilistic judgments.

Key Points

▸ Introduction of ProbCOPA dataset for probabilistic inferences
▸ LLMs fail to produce human-like distributions in probabilistic judgments
▸ Persistent differences between humans and LLMs in reasoning patterns

Merits

Novel Dataset

The introduction of the ProbCOPA dataset provides a valuable resource for studying probabilistic inferences

Comprehensive Analysis

The study provides a thorough analysis of human and LLM reasoning patterns, highlighting key differences

Demerits

Limited Generalizability

The study's findings may not generalize to other domains or tasks, limiting the scope of the conclusions

Methodological Limitations

The use of a specific dataset and evaluation metrics may introduce biases and limitations in the results

Expert Commentary

The study's findings highlight the complexities of human reasoning and the challenges of developing AI systems that can accurately mimic human probabilistic judgments. The introduction of the ProbCOPA dataset provides a valuable resource for future research, and the comprehensive analysis of human and LLM reasoning patterns sheds light on the persistent differences between humans and LLMs. However, the study's limitations, such as the limited generalizability of the findings, must be carefully considered in the development of more advanced AI systems.

Recommendations

✓ Future research should focus on developing more advanced LLMs that can accurately mimic human probabilistic judgments
✓ The development of more comprehensive datasets and evaluation metrics is necessary to fully capture the complexities of human reasoning

Sources

arXiv - cs.CL

Something extraordinary is coming.

Humans and LLMs Diverge on Probabilistic Inferences

AI Commentary

Executive Summary

Key Points

Merits

Novel Dataset

Comprehensive Analysis

Demerits

Limited Generalizability

Methodological Limitations

Expert Commentary

Recommendations

Sources

Related Articles

France or Spain or Germany or France: A Neural Account …

Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of …

LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering

JCG, PC

HSOLLC Co., Ltd.