Academic

Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs

Hengwei Ye, Yuanting Guan, Yuxuan Ge, Tianying Zhu, Zhenhan Guan, Yijia Zhong, Yijing Zhang, Han Zhang, Yingna Wu, Zheng Tian · March 24, 2026 · 1 min read · 7 views

#cs.CL #cs.AI

arXiv:2603.20209v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) combine the linguistic strengths of LLMs with the ability to process multimodal data, enbaling them to address a broader range of visual tasks. Because MLLMs aim at more general, human-like competence than language-only models, we take inspiration from the Wechsler Intelligence Scales - an established battery for evaluating children by decomposing intelligence into interpretable, testable abilities. We introduce KidGym, a comprehensive 2D grid-based benchmark for assessing five essential capabilities of MLLMs: Execution, Perception Reasoning, Learning, Memory and Planning. The benchmark comprises 12 unique tasks, each targeting at least one core capability, specifically designed to guage MLLMs' adaptability and developmental potential, mirroring the stages of children's cognitive growth. Additionally, our tasks encompass diverse scenarios and objects with randomly generated layouts, ensuring a more accurate and robust evluation of MLLM capabilities. KidGym is designed to be fully user-customizable and extensible, allowing researchers to create new evaluation scenarios and adjust difficuly levels to accommodate the rapidly growing MLLM community. Through the evaluation of state-of-the-art MLLMs using KidGym, we identified significant insights into model capabilities and revealed several limitations of current models. We release our benchmark at: https://kidgym.github.io/KidGym-Website/.

Executive Summary

This article introduces KidGym, a comprehensive 2D grid-based benchmark for assessing the capabilities of Multimodal Large Language Models (MLLMs). Inspired by the Wechsler Intelligence Scales, KidGym evaluates five essential capabilities of MLLMs, including Execution, Perception Reasoning, Learning, Memory, and Planning. The benchmark consists of 12 unique tasks, each targeting at least one core capability, and is designed to be fully user-customizable and extensible. The authors evaluate state-of-the-art MLLMs using KidGym and identify significant insights into model capabilities, revealing several limitations of current models. KidGym is a valuable tool for researchers to assess and improve MLLM capabilities, mirroring the stages of children's cognitive growth.

Key Points

▸ KidGym is a 2D grid-based benchmark for assessing MLLM capabilities
▸ The benchmark evaluates five essential capabilities: Execution, Perception Reasoning, Learning, Memory, and Planning
▸ KidGym consists of 12 unique tasks, each targeting at least one core capability

Merits

Strength

KidGym provides a comprehensive and user-friendly benchmark for evaluating MLLM capabilities, allowing researchers to assess and improve model performance

Flexibility

KidGym is designed to be fully user-customizable and extensible, enabling researchers to create new evaluation scenarios and adjust difficulty levels

Demerits

Limitation

KidGym's 2D grid-based design may not be representative of real-world scenarios, which often involve more complex and dynamic environments

Scope

KidGym focuses primarily on evaluating MLLM capabilities in a controlled environment, and its applicability to real-world scenarios is unclear

Expert Commentary

The introduction of KidGym represents a significant step forward in the evaluation and improvement of MLLM capabilities. By mirroring the stages of children's cognitive growth, KidGym provides a valuable tool for researchers to assess and improve MLLM performance. However, the 2D grid-based design and focus on controlled environments may limit KidGym's applicability to real-world scenarios. Despite these limitations, KidGym is a valuable contribution to the field, and its implications for AI model evaluation and improvement are significant. As researchers continue to develop and refine MLLM technology, KidGym will play a critical role in assessing and improving model capabilities.

Recommendations

✓ Researchers should explore the applicability of KidGym to real-world scenarios and applications
✓ Developers should consider incorporating more dynamic and complex environments into KidGym's design

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs

AI Commentary

Executive Summary

Key Points

Merits

Strength

Flexibility

Demerits

Limitation

Scope

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.