Academic

A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities

arXiv:2603.02540v1 Announce Type: new Abstract: Large language models (LLMs) exhibit a unified "general factor" of capability across 10 benchmarks, a finding confirmed by our factor analysis of 156 models, yet they still struggle with simple, trivial tasks for humans. This is because current benchmarks focus on task completion, failing to probe the foundational cognitive abilities that highlight these behaviors. We address this by introducing the NeuroCognition benchmark, grounded in three adapted neuropsychological tests: Raven's Progressive Matrices (abstract relational reasoning), Spatial Working Memory (maintenance and systematic search), and the Wisconsin Card Sorting Test (cognitive flexibility). Our evaluation reveals that while models perform strongly on text, their performance degrades for images and with increased complexity. Furthermore, we observe that complex reasoning is not universally beneficial, whereas simple, human-like strategies yield partial gains. We also find t

F
Faiz Ghifari Haznitrama, Faeyza Rishad Ardi, Alice Oh
· · 1 min read · 14 views

arXiv:2603.02540v1 Announce Type: new Abstract: Large language models (LLMs) exhibit a unified "general factor" of capability across 10 benchmarks, a finding confirmed by our factor analysis of 156 models, yet they still struggle with simple, trivial tasks for humans. This is because current benchmarks focus on task completion, failing to probe the foundational cognitive abilities that highlight these behaviors. We address this by introducing the NeuroCognition benchmark, grounded in three adapted neuropsychological tests: Raven's Progressive Matrices (abstract relational reasoning), Spatial Working Memory (maintenance and systematic search), and the Wisconsin Card Sorting Test (cognitive flexibility). Our evaluation reveals that while models perform strongly on text, their performance degrades for images and with increased complexity. Furthermore, we observe that complex reasoning is not universally beneficial, whereas simple, human-like strategies yield partial gains. We also find that NeuroCognition correlates positively with standard general-capability benchmarks, while still measuring distinct cognitive abilities beyond them. Overall, NeuroCognition emphasizes where current LLMs align with human-like intelligence and where they lack core adaptive cognition, showing the potential to serve as a verifiable, scalable source for improving LLMs.

Executive Summary

This article introduces the NeuroCognition benchmark, a neuropsychologically grounded evaluation of large language models (LLMs) cognitive abilities. The authors argue that current benchmarks fail to probe foundational cognitive abilities, leading to LLMs struggling with simple tasks. NeuroCognition addresses this by adapting three neuropsychological tests: Raven's Progressive Matrices, Spatial Working Memory, and the Wisconsin Card Sorting Test. The evaluation reveals performance degradation for images and increased complexity, highlighting LLMs' limitations in core adaptive cognition. The authors suggest that NeuroCognition can serve as a scalable source for improving LLMs. This study contributes to the ongoing debate on the limitations of LLMs and the need for more comprehensive evaluations.

Key Points

  • NeuroCognition benchmark is introduced to evaluate LLMs' cognitive abilities
  • Current benchmarks fail to probe foundational cognitive abilities
  • LLMs perform strongly on text but degrade for images and with complexity
  • NeuroCognition correlates positively with standard general-capability benchmarks

Merits

Strength

The study provides a comprehensive evaluation of LLMs' cognitive abilities using a neuropsychologically grounded approach, offering valuable insights into their limitations and potential improvements.

Demerits

Limitation

The study's reliance on adapted neuropsychological tests may limit the generalizability of the findings to more complex or real-world tasks.

Expert Commentary

This study offers a significant contribution to the ongoing debate on the limitations of LLMs and the need for more comprehensive evaluations. By introducing the NeuroCognition benchmark, the authors provide a valuable framework for evaluating LLMs' cognitive abilities and identifying areas for improvement. While the study's reliance on adapted neuropsychological tests may limit its generalizability, the findings highlight the potential of neuropsychologically grounded approaches for improving LLMs. Furthermore, the study's implications for policymakers and regulatory bodies underscore the importance of considering the limitations of current LLM evaluations and supporting the development of more comprehensive assessments.

Recommendations

  • Future studies should explore the adaptation of NeuroCognition to more complex or real-world tasks to enhance its generalizability and applicability.
  • Researchers should consider incorporating multiple evaluation frameworks, including NeuroCognition, to provide a more comprehensive understanding of LLMs' cognitive abilities.

Sources