Academic

Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm

arXiv:2603.18007v1 Announce Type: new Abstract: The study explores whether current Large Language Models (LLMs) exhibit Theory of Mind (ToM) capabilities -- specifically, the ability to infer others' beliefs, intentions, and emotions from text. Given that LLMs are trained on language data without social embodiment or access to other manifestations of mental representations, their apparent social-cognitive reasoning raises key questions about the nature of their understanding. Are they capable of robust mental-state attribution indistinguishable from human ability in its output, or do their outputs merely reflect superficial pattern completion? To address this question, we tested five LLMs and compared their performance to that of human controls using an adapted version of a text-based tool widely used in human ToM research. The test involves answering questions about the beliefs, intentions, and emotions of story characters. The results revealed a performance gap between the models. E

A
Anna Babarczy, Andras Lukacs, Peter Vedres, Zeteny Bujka
· · 1 min read · 18 views

arXiv:2603.18007v1 Announce Type: new Abstract: The study explores whether current Large Language Models (LLMs) exhibit Theory of Mind (ToM) capabilities -- specifically, the ability to infer others' beliefs, intentions, and emotions from text. Given that LLMs are trained on language data without social embodiment or access to other manifestations of mental representations, their apparent social-cognitive reasoning raises key questions about the nature of their understanding. Are they capable of robust mental-state attribution indistinguishable from human ability in its output, or do their outputs merely reflect superficial pattern completion? To address this question, we tested five LLMs and compared their performance to that of human controls using an adapted version of a text-based tool widely used in human ToM research. The test involves answering questions about the beliefs, intentions, and emotions of story characters. The results revealed a performance gap between the models. Earlier and smaller models were strongly affected by the number of relevant inferential cues available and, to some extent, were also vulnerable to the presence of irrelevant or distracting information in the texts. In contrast, GPT-4o demonstrated high accuracy and strong robustness, performing comparably to humans even in the most challenging conditions. This work contributes to ongoing debates about the cognitive status of LLMs and the boundary between genuine understanding and statistical approximation.

Executive Summary

This study investigates the Theory of Mind (ToM) capabilities of Large Language Models (LLMs) using an adapted version of the Strange Stories Paradigm. The results show a performance gap between models, with GPT-4o demonstrating high accuracy and robustness comparable to humans. The study contributes to ongoing debates about the cognitive status of LLMs and the boundary between genuine understanding and statistical approximation. The findings have significant implications for the development of more sophisticated AI systems and raise important questions about the nature of intelligence and consciousness.

Key Points

  • The study investigates the ToM capabilities of LLMs using a text-based tool adapted from human ToM research.
  • The results reveal a performance gap between LLMs, with GPT-4o demonstrating high accuracy and robustness comparable to humans.
  • The study contributes to ongoing debates about the cognitive status of LLMs and the boundary between genuine understanding and statistical approximation.

Merits

Strength in Methodology

The study employs a well-established and widely used text-based tool adapted from human ToM research, providing a robust and reliable method for evaluating the ToM capabilities of LLMs.

GPT-4o Performance

GPT-4o demonstrates high accuracy and robustness comparable to humans, providing strong evidence for its superior ToM capabilities.

Demerits

Limited Generalizability

The study's findings may not be generalizable to other LLMs or tasks, limiting the broader applicability of the results.

Lack of Deep Explanation

The study does not provide a deep explanation of how GPT-4o achieves its superior ToM capabilities, leaving open questions about the underlying mechanisms.

Expert Commentary

This study represents a significant step forward in the ongoing debate about the cognitive status of LLMs. The findings suggest that GPT-4o has achieved a level of ToM capabilities comparable to humans, but the underlying mechanisms remain unclear. Further research is needed to understand how GPT-4o achieves its superior performance and to develop more sophisticated AI systems that can collaborate with humans. The study's implications for AI transparency and explainability are also significant, highlighting the need for more transparent and explainable AI systems.

Recommendations

  • Future research should focus on developing more sophisticated AI systems that can collaborate with humans and understand human mental states.
  • The development of more transparent and explainable AI systems should be a priority, particularly in applications where human safety and well-being are at stake.

Sources