GenAI-LA: Generative AI and Learning Analytics Workshop (LAK 2026), April 27--May 1, 2026, Bergen, Norway
arXiv:2602.15531v1 Announce Type: new Abstract: This work introduces EduEVAL-DB, a dataset based on teacher roles designed to support the evaluation and training of automatic pedagogical evaluators and AI tutors for instructional explanations. The dataset comprises 854 explanations corresponding to 139 questions from a curated subset of the ScienceQA benchmark, spanning science, language, and social science across K-12 grade levels. For each question, one human-teacher explanation is provided and six are generated by LLM-simulated teacher roles. These roles are inspired by instructional styles and shortcomings observed in real educational practice and are instantiated via prompt engineering. We further propose a pedagogical risk rubric aligned with established educational standards, operationalizing five complementary risk dimensions: factual correctness, explanatory depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. All explanations are
arXiv:2602.15531v1 Announce Type: new Abstract: This work introduces EduEVAL-DB, a dataset based on teacher roles designed to support the evaluation and training of automatic pedagogical evaluators and AI tutors for instructional explanations. The dataset comprises 854 explanations corresponding to 139 questions from a curated subset of the ScienceQA benchmark, spanning science, language, and social science across K-12 grade levels. For each question, one human-teacher explanation is provided and six are generated by LLM-simulated teacher roles. These roles are inspired by instructional styles and shortcomings observed in real educational practice and are instantiated via prompt engineering. We further propose a pedagogical risk rubric aligned with established educational standards, operationalizing five complementary risk dimensions: factual correctness, explanatory depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. All explanations are annotated with binary risk labels through a semi-automatic process with expert teacher review. Finally, we present preliminary validation experiments to assess the suitability of EduEVAL-DB for evaluation. We benchmark a state-of-the-art education-oriented model (Gemini 2.5 Pro) against a lightweight local Llama 3.1 8B model and examine whether supervised fine-tuning on EduEVAL-DB supports pedagogical risk detection using models deployable on consumer hardware.
Executive Summary
This article introduces EduEVAL-DB, a novel dataset designed to support the evaluation and training of automatic pedagogical evaluators and AI tutors for instructional explanations. The dataset comprises 854 explanations from a curated subset of the ScienceQA benchmark, spanning science, language, and social science across K-12 grade levels. A pedagogical risk rubric is proposed, operationalizing five complementary risk dimensions, and preliminary validation experiments assess the suitability of EduEVAL-DB for evaluation. The findings highlight the potential of supervised fine-tuning on EduEVAL-DB to support pedagogical risk detection using models deployable on consumer hardware.
Key Points
- ▸ Introduction of EduEVAL-DB, a novel dataset for evaluating and training automatic pedagogical evaluators and AI tutors
- ▸ Proposed pedagogical risk rubric with five complementary risk dimensions
- ▸ Preliminary validation experiments demonstrating the suitability of EduEVAL-DB for evaluation
Merits
Strength in Addressing Pedagogical Risks
EduEVAL-DB addresses a critical gap in the development of AI-powered educational tools by providing a comprehensive dataset for evaluating and training automatic pedagogical evaluators and AI tutors. The proposed pedagogical risk rubric operationalizes five complementary risk dimensions, enabling more informed decision-making in educational settings.
Demerits
Limited Generalizability
The dataset is limited to a curated subset of the ScienceQA benchmark, which may not be representative of the broader educational landscape. Further research is needed to assess the generalizability of EduEVAL-DB to other subjects and grade levels.
Expert Commentary
The introduction of EduEVAL-DB represents a significant step forward in the development of AI-powered educational tools. By providing a comprehensive dataset for evaluating and training automatic pedagogical evaluators and AI tutors, EduEVAL-DB addresses a critical gap in the field. However, the limited generalizability of the dataset raises important questions about the broader applicability of the findings. As AI-driven educational tools continue to evolve, it is essential to prioritize ongoing research and development to ensure that these tools are designed and deployed with pedagogical risks and teacher roles in mind.
Recommendations
- ✓ Future research should focus on expanding the scope of EduEVAL-DB to include a broader range of subjects and grade levels.
- ✓ Developers of AI-powered educational tools should prioritize the integration of pedagogical risk assessments and evaluations into their products.