Academic

From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

arXiv:2603.02775v1 Announce Type: new Abstract: Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable s

Weikang Shi, Houxing Ren, Junting Pan, Aojun Zhou, Ke Wang, Zimu Lu, Yunqiao Yang, Yuxuan Hu, Linda Wei, Mingjie Zhan, Hongsheng Li · March 5, 2026 · 1 min read · 17 views

#cs.CL #cs.LG

Executive Summary

The article introduces KMP-Bench, a comprehensive benchmark for evaluating the pedagogical intelligence of Large Language Models (LLMs) in mathematical tutoring. The benchmark assesses LLMs from two perspectives: holistic pedagogical capabilities and granular tutoring abilities. The evaluation reveals that leading LLMs excel at tasks with verifiable solutions but struggle with nuanced pedagogical principles. Fine-tuning models on a large-scale dialogue dataset, KMP-Pile, improves performance on KMP-Bench, highlighting the importance of pedagogically-rich training data.

Key Points

▸ Introduction of KMP-Bench, a comprehensive benchmark for evaluating LLMs in mathematical tutoring
▸ Evaluation of LLMs from two complementary perspectives: holistic pedagogical capabilities and granular tutoring abilities
▸ Disparity in LLM performance: excelling at tasks with verifiable solutions but struggling with nuanced pedagogical principles

Merits

Comprehensive Evaluation Framework

KMP-Bench provides a thorough assessment of LLMs' pedagogical intelligence, covering multiple aspects of mathematical tutoring.

Demerits

Limited Generalizability

The benchmark's focus on K-8 mathematical pedagogy may limit its applicability to other educational domains or age groups.

Expert Commentary

The article contributes significantly to the ongoing discussion on the potential of LLMs in education. The introduction of KMP-Bench and KMP-Pile addresses a critical gap in the evaluation of AI-powered tutoring systems. The disparity in LLM performance underscores the need for more nuanced and pedagogically-informed training data. As AI-powered education continues to evolve, the development of comprehensive evaluation frameworks like KMP-Bench will be essential for ensuring the effectiveness and reliability of these systems.

Recommendations

✓ Future research should focus on expanding KMP-Bench to cover a broader range of educational domains and age groups.
✓ Developers of AI-powered educational tools should prioritize the creation of pedagogically-rich training data to improve the performance of their systems.

Sources

arXiv - cs.CL

From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Evaluation Framework

Demerits

Limited Generalizability

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs