ManiBench: A Benchmark for Testing Visual-Logic Drift and Syntactic Hallucinations in Manim Code Generation
arXiv:2603.13251v1 Announce Type: new Abstract: Traditional benchmarks like HumanEval and MBPP test logic and syntax effectively, but fail when code must produce dynamic, pedagogical visuals. We introduce ManiBench, a specialized benchmark evaluating LLM performance in generating Manim CE code, where temporal fidelity and version-aware API correctness are critical. ManiBench targets two key failure modes: Syntactic Hallucinations (valid Python referencing non-existent or deprecated Manim APIs) and Visual-Logic Drift (generated visuals diverging from intended mathematical logic through timing errors or missing causal relationships). The benchmark comprises 150-200 problems across five difficulty levels spanning calculus, linear algebra, probability, topology, and AI, grounded in analysis of 3Blue1Brown's ManimGL source (53,000 lines, 143 scene classes). Evaluation uses a four-tier framework measuring Executability, Version-Conflict Error Rate, Alignment Score, and Coverage Score. An op
arXiv:2603.13251v1 Announce Type: new Abstract: Traditional benchmarks like HumanEval and MBPP test logic and syntax effectively, but fail when code must produce dynamic, pedagogical visuals. We introduce ManiBench, a specialized benchmark evaluating LLM performance in generating Manim CE code, where temporal fidelity and version-aware API correctness are critical. ManiBench targets two key failure modes: Syntactic Hallucinations (valid Python referencing non-existent or deprecated Manim APIs) and Visual-Logic Drift (generated visuals diverging from intended mathematical logic through timing errors or missing causal relationships). The benchmark comprises 150-200 problems across five difficulty levels spanning calculus, linear algebra, probability, topology, and AI, grounded in analysis of 3Blue1Brown's ManimGL source (53,000 lines, 143 scene classes). Evaluation uses a four-tier framework measuring Executability, Version-Conflict Error Rate, Alignment Score, and Coverage Score. An open-source framework automates evaluation across multiple models and prompting strategies. Code, data and benchmark suite are available at https://github.com/nabin2004/ManiBench. and the dataset is hosted on https://huggingface.co/datasets/nabin2004/ManiBench.
Executive Summary
This article introduces ManiBench, a novel benchmark designed to evaluate the performance of Large Language Models (LLMs) in generating Manim code. Manim is a Python library used for creating dynamic, pedagogical visuals in mathematics education. ManiBench specifically targets Syntactic Hallucinations and Visual-Logic Drift, two common failure modes in code generation. The benchmark comprises 150-200 problems across five difficulty levels, grounded in the analysis of 3Blue1Brown's ManimGL source. An evaluation framework measures Executability, Version-Conflict Error Rate, Alignment Score, and Coverage Score. The open-source framework automates evaluation across multiple models and prompting strategies. This development has significant implications for AI-assisted education and code generation, highlighting the need for specialized benchmarks to address the unique challenges of generating accurate and effective educational materials.
Key Points
- ▸ ManiBench is a novel benchmark for evaluating LLMs in generating Manim code.
- ▸ The benchmark targets Syntactic Hallucinations and Visual-Logic Drift, common failure modes in code generation.
- ▸ ManiBench comprises 150-200 problems across five difficulty levels, grounded in the analysis of 3Blue1Brown's ManimGL source.
Merits
Comprehensive Evaluation Framework
ManiBench's four-tier evaluation framework provides a comprehensive assessment of LLM performance, including Executability, Version-Conflict Error Rate, Alignment Score, and Coverage Score.
Open-Source and Automatable
The open-source framework automates evaluation across multiple models and prompting strategies, facilitating easy replication and extension of the benchmark.
Domain-Specific
ManiBench is specifically designed for the Manim domain, addressing the unique challenges of generating accurate and effective educational materials in mathematics education.
Demerits
Limited Scope
ManiBench is currently limited to the Manim domain and may not be applicable to other code generation tasks.
Dependence on Specific Dataset
ManiBench's evaluation framework is grounded in the analysis of 3Blue1Brown's ManimGL source, which may limit its generalizability to other datasets or domains.
Expert Commentary
The development of ManiBench represents a significant step forward in the evaluation of LLMs for code generation. By targeting Syntactic Hallucinations and Visual-Logic Drift, ManiBench addresses key failure modes in code generation, highlighting the need for specialized benchmarks to address the unique challenges of generating accurate and effective educational materials. The comprehensive evaluation framework and open-source nature of ManiBench facilitate easy replication and extension of the benchmark, underscoring its potential impact on the field. However, the limited scope and dependence on a specific dataset are notable limitations that warrant further consideration. As AI-assisted education continues to evolve, the development of ManiBench serves as a critical reminder of the need for ongoing evaluation and improvement of LLMs in this domain.
Recommendations
- ✓ Future research should focus on extending ManiBench to other code generation tasks and domains, with a particular emphasis on evaluating LLMs in high-stakes applications, such as education and healthcare.
- ✓ Developers and researchers should prioritize the development of specialized benchmarks for code generation, addressing the unique challenges and failure modes associated with each domain and application.