Academic

DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing

arXiv:2602.13318v1 Announce Type: new Abstract: Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection, coherent slide organization, layout-aware rendering, and robust multi-turn instruction following. However, existing benchmarks and evaluation protocols do not adequately measure these challenges. To address this gap, we introduce the Deck Edits and Compliance Kit Benchmark (DECKBench), an evaluation framework for multi-agent slide generation and editing. DECKBench is built on a curated dataset of paper to slide pairs augmented with realistic, simulated editing instructions. Our evaluation protocol systematically assesses slide-level and deck-level fidelity, coherence, layout quality, and multi-turn instruction following. We further implement a modular multi-agent baseline system that decomposes the slide generation and editing task into paper parsing and summarization, slide planning, HTM

arXiv:2602.13318v1 Announce Type: new Abstract: Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection, coherent slide organization, layout-aware rendering, and robust multi-turn instruction following. However, existing benchmarks and evaluation protocols do not adequately measure these challenges. To address this gap, we introduce the Deck Edits and Compliance Kit Benchmark (DECKBench), an evaluation framework for multi-agent slide generation and editing. DECKBench is built on a curated dataset of paper to slide pairs augmented with realistic, simulated editing instructions. Our evaluation protocol systematically assesses slide-level and deck-level fidelity, coherence, layout quality, and multi-turn instruction following. We further implement a modular multi-agent baseline system that decomposes the slide generation and editing task into paper parsing and summarization, slide planning, HTML creation, and iterative editing. Experimental results demonstrate that the proposed benchmark highlights strengths, exposes failure modes, and provides actionable insights for improving multi-agent slide generation and editing systems. Overall, this work establishes a standardized foundation for reproducible and comparable evaluation of academic presentation generation and editing. Code and data are publicly available at https://github.com/morgan-heisler/DeckBench .

Executive Summary

The article introduces DECKBench, a novel benchmark for evaluating multi-agent frameworks designed for academic slide generation and editing. DECKBench addresses the limitations of existing benchmarks by providing a comprehensive evaluation protocol that assesses slide-level and deck-level fidelity, coherence, layout quality, and multi-turn instruction following. The framework is built on a curated dataset of paper to slide pairs augmented with simulated editing instructions. The authors also present a modular multi-agent baseline system that decomposes the task into paper parsing, summarization, slide planning, HTML creation, and iterative editing. The results highlight the strengths and failure modes of current systems, offering actionable insights for improvement. This work establishes a standardized foundation for reproducible and comparable evaluation in academic presentation generation and editing.

Key Points

  • Introduction of DECKBench, a benchmark for multi-agent slide generation and editing.
  • Comprehensive evaluation protocol assessing fidelity, coherence, layout quality, and instruction following.
  • Modular multi-agent baseline system for slide generation and editing.
  • Experimental results highlighting strengths and failure modes.
  • Standardized foundation for reproducible and comparable evaluation.

Merits

Comprehensive Evaluation Protocol

DECKBench provides a thorough evaluation protocol that assesses multiple dimensions of slide generation and editing, including fidelity, coherence, layout quality, and instruction following. This holistic approach ensures a more accurate and nuanced assessment of multi-agent systems.

Modular Multi-Agent System

The proposed baseline system decomposes the task into manageable sub-tasks, allowing for a clear and structured approach to slide generation and editing. This modularity facilitates easier debugging, improvement, and scalability of the system.

Standardized Benchmark

DECKBench establishes a standardized foundation for evaluating multi-agent slide generation and editing systems. This standardization promotes reproducibility and comparability, fostering advancements in the field.

Demerits

Limited Dataset

The dataset used in DECKBench is curated and augmented with simulated editing instructions. While this approach provides a realistic evaluation environment, it may not fully capture the diversity and complexity of real-world academic slide generation and editing tasks.

Baseline System Limitations

The modular multi-agent baseline system, while providing a structured approach, may not be as robust or sophisticated as more advanced systems. This could limit the overall effectiveness of the benchmark in evaluating high-performance systems.

Potential Bias in Evaluation

The evaluation protocol, while comprehensive, may still be subject to biases inherent in the design of the benchmark and the selection of evaluation metrics. This could affect the objectivity and fairness of the assessment.

Expert Commentary

The introduction of DECKBench represents a significant advancement in the field of automated academic slide generation and editing. By providing a comprehensive evaluation protocol and a modular multi-agent baseline system, the authors address critical gaps in existing benchmarks. The benchmark's focus on fidelity, coherence, layout quality, and multi-turn instruction following ensures a thorough assessment of multi-agent systems. However, the limited dataset and potential biases in evaluation metrics are notable limitations that could affect the benchmark's overall effectiveness. Despite these limitations, DECKBench establishes a standardized foundation for reproducible and comparable evaluation, fostering advancements in the field. The practical implications of this work are substantial, as it provides a valuable tool for researchers and developers to improve automated presentation tools. Additionally, the standardization of evaluation protocols can influence policy decisions regarding the adoption of automated systems in educational and professional settings. Overall, DECKBench is a significant contribution to the field, offering actionable insights and promoting further research and development in multi-agent slide generation and editing.

Recommendations

  • Expand the dataset to include a more diverse range of academic papers and editing instructions to better capture real-world scenarios.
  • Develop more sophisticated multi-agent systems to serve as benchmarks, ensuring that the evaluation protocol can assess high-performance systems effectively.

Sources