Skip to main content
Academic

Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models

arXiv:2602.22072v1 Announce Type: new Abstract: Theory of Mind (ToM) refers to an agent's ability to model the internal states of others. Contributing to the debate whether large language models (LLMs) exhibit genuine ToM capabilities, our study investigates their ToM robustness using perturbations on false-belief tasks and examines the potential of Chain-of-Thought prompting (CoT) to enhance performance and explain the LLM's decision. We introduce a handcrafted, richly annotated ToM dataset, including classic and perturbed false belief tasks, the corresponding spaces of valid reasoning chains for correct task completion, subsequent reasoning faithfulness, task solutions, and propose metrics to evaluate reasoning chain correctness and to what extent final answers are faithful to reasoning traces of the generated CoT. We show a steep drop in ToM capabilities under task perturbation for all evaluated LLMs, questioning the notion of any robust form of ToM being present. While CoT prompti

C
Christian Nickel, Laura Schrewe, Florian Mai, Lucie Flek
· · 1 min read · 4 views

arXiv:2602.22072v1 Announce Type: new Abstract: Theory of Mind (ToM) refers to an agent's ability to model the internal states of others. Contributing to the debate whether large language models (LLMs) exhibit genuine ToM capabilities, our study investigates their ToM robustness using perturbations on false-belief tasks and examines the potential of Chain-of-Thought prompting (CoT) to enhance performance and explain the LLM's decision. We introduce a handcrafted, richly annotated ToM dataset, including classic and perturbed false belief tasks, the corresponding spaces of valid reasoning chains for correct task completion, subsequent reasoning faithfulness, task solutions, and propose metrics to evaluate reasoning chain correctness and to what extent final answers are faithful to reasoning traces of the generated CoT. We show a steep drop in ToM capabilities under task perturbation for all evaluated LLMs, questioning the notion of any robust form of ToM being present. While CoT prompting improves the ToM performance overall in a faithful manner, it surprisingly degrades accuracy for some perturbation classes, indicating that selective application is necessary.

Executive Summary

This study investigates the theory of mind (ToM) capabilities of large language models (LLMs) using perturbed false-belief tasks and Chain-of-Thought prompting (CoT). The authors introduce a handcrafted ToM dataset and propose metrics to evaluate reasoning chain correctness and faithfulness. Results show a steep drop in ToM capabilities under task perturbation, questioning the notion of robust ToM in LLMs. While CoT prompting improves ToM performance overall, it surprisingly degrades accuracy for some perturbation classes. This study contributes to the debate on LLM's ToM capabilities and highlights the importance of selective CoT application. The findings have significant implications for the development and evaluation of LLMs in various applications, including natural language processing, decision-making, and human-computer interaction.

Key Points

  • ToM capabilities of LLMs are investigated using perturbed false-belief tasks and CoT prompting.
  • Authors introduce a handcrafted ToM dataset and propose metrics for evaluating reasoning chain correctness and faithfulness.
  • Results show a steep drop in ToM capabilities under task perturbation, challenging the notion of robust ToM in LLMs.
  • CoT prompting improves ToM performance overall, but degrades accuracy for some perturbation classes.

Merits

Strength in methodology

The study introduces a novel, handcrafted ToM dataset, which provides a rich and annotated resource for evaluating ToM capabilities in LLMs.

Insightful findings

The study reveals a steep drop in ToM capabilities under task perturbation, highlighting the limitations of current LLMs in modeling internal states of others.

Practical implications

The findings have significant implications for the development and evaluation of LLMs in various applications, including natural language processing, decision-making, and human-computer interaction.

Demerits

Limitation in generalizability

The study's findings may not generalize to other ToM tasks or LLM architectures, which could limit the study's broader impact.

Methodological complexity

The study's methodology, including the introduction of CoT prompting and metrics for evaluating reasoning chain correctness and faithfulness, may be complex and challenging to replicate.

Potential for over-reliance on CoT

The study's reliance on CoT prompting to improve ToM performance may lead to over-reliance on this technique, which could limit the development of more robust ToM capabilities in LLMs.

Expert Commentary

This study makes a significant contribution to the debate on LLMs' ToM capabilities, highlighting the limitations of current LLMs in modeling internal states of others. The introduction of a novel ToM dataset and CoT prompting methodology provides a rich and annotated resource for evaluating ToM capabilities in LLMs. However, the study's findings also raise concerns about the potential for over-reliance on CoT prompting and the limitations of the study's generalizability. Nevertheless, the study's implications for the development and evaluation of LLMs are significant, and its findings have far-reaching consequences for the development of more robust ToM capabilities in LLMs.

Recommendations

  • Future studies should investigate the development of more robust ToM capabilities in LLMs, including the use of novel architectures and training methods.
  • The use of CoT prompting and metrics for evaluating reasoning chain correctness and faithfulness should be explored in more detail, particularly in the context of human-LM interaction systems.

Sources