Academic

From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

arXiv:2603.02775v1 Announce Type: new Abstract: Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable s

arXiv:2603.02775v1 Announce Type: new Abstract: Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.

Executive Summary

The article introduces KMP-Bench, a comprehensive benchmark for evaluating the pedagogical intelligence of Large Language Models (LLMs) in mathematical tutoring. The benchmark assesses LLMs from two perspectives: holistic pedagogical capabilities and granular tutoring abilities. The evaluation reveals that leading LLMs excel at tasks with verifiable solutions but struggle with nuanced pedagogical principles. Fine-tuning models on a large-scale dialogue dataset, KMP-Pile, improves performance on KMP-Bench, highlighting the importance of pedagogically-rich training data.

Key Points

  • Introduction of KMP-Bench, a comprehensive benchmark for evaluating LLMs in mathematical tutoring
  • Evaluation of LLMs from two complementary perspectives: holistic pedagogical capabilities and granular tutoring abilities
  • Disparity in LLM performance: excelling at tasks with verifiable solutions but struggling with nuanced pedagogical principles

Merits

Comprehensive Evaluation Framework

KMP-Bench provides a thorough assessment of LLMs' pedagogical intelligence, covering multiple aspects of mathematical tutoring.

Demerits

Limited Generalizability

The benchmark's focus on K-8 mathematical pedagogy may limit its applicability to other educational domains or age groups.

Expert Commentary

The article contributes significantly to the ongoing discussion on the potential of LLMs in education. The introduction of KMP-Bench and KMP-Pile addresses a critical gap in the evaluation of AI-powered tutoring systems. The disparity in LLM performance underscores the need for more nuanced and pedagogically-informed training data. As AI-powered education continues to evolve, the development of comprehensive evaluation frameworks like KMP-Bench will be essential for ensuring the effectiveness and reliability of these systems.

Recommendations

  • Future research should focus on expanding KMP-Bench to cover a broader range of educational domains and age groups.
  • Developers of AI-powered educational tools should prioritize the creation of pedagogically-rich training data to improve the performance of their systems.

Sources