Academic

EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents

arXiv:2604.05557v1 Announce Type: new Abstract: Scientific research follows multi-turn, multi-step workflows that require proactively searching the literature, consulting figures and tables, and integrating evidence across papers to align experimental settings and support reproducible conclusions. This joint capability is not systematically assessed in existing benchmarks, which largely under-evaluate proactive search, multi-evidence integration and sustained evidence use over time. In this work, we introduce EpiBench, an episodic multi-turn multimodal benchmark that instantiates short research workflows. Given a research task, agents must navigate across papers over multiple turns, align evidence from figures and tables, and use the accumulated evidence in the memory to answer objective questions that require cross paper comparisons and multi-figure integration. EpiBench introduces a process-level evaluation framework for fine-grained testing and diagnosis of research agents. Our exp

arXiv:2604.05557v1 Announce Type: new Abstract: Scientific research follows multi-turn, multi-step workflows that require proactively searching the literature, consulting figures and tables, and integrating evidence across papers to align experimental settings and support reproducible conclusions. This joint capability is not systematically assessed in existing benchmarks, which largely under-evaluate proactive search, multi-evidence integration and sustained evidence use over time. In this work, we introduce EpiBench, an episodic multi-turn multimodal benchmark that instantiates short research workflows. Given a research task, agents must navigate across papers over multiple turns, align evidence from figures and tables, and use the accumulated evidence in the memory to answer objective questions that require cross paper comparisons and multi-figure integration. EpiBench introduces a process-level evaluation framework for fine-grained testing and diagnosis of research agents. Our experiments show that even the leading model achieves an accuracy of only 29.23% on the hard split, indicating substantial room for improvement in multi-turn, multi-evidence research workflows, providing an evaluation platform for verifiable and reproducible research agents.

Executive Summary

EpiBench presents a groundbreaking benchmark for evaluating multimodal agents in multi-turn scientific research workflows, addressing a critical gap in existing evaluation frameworks. Unlike prior benchmarks that focus on static or single-step tasks, EpiBench simulates dynamic research processes requiring agents to proactively search literature, integrate evidence from figures and tables, and sustain reasoning over multiple interactions. The benchmark’s episodic structure and process-level evaluation framework enable fine-grained diagnosis of agent capabilities, revealing that even state-of-the-art models achieve only 29.23% accuracy on the hard split. This underscores the substantial challenges in replicating human-like research workflows and provides a rigorous platform for advancing verifiable and reproducible AI research agents.

Key Points

  • EpiBench introduces episodic multi-turn workflows to evaluate agents' ability to simulate human-like scientific research, including proactive literature search and evidence integration across papers.
  • The benchmark incorporates multimodal elements (figures, tables) and requires sustained evidence use over time, unlike existing benchmarks that focus on static or single-step tasks.
  • Process-level evaluation framework enables fine-grained diagnosis of agent performance, revealing significant gaps in current models' ability to handle complex research workflows (e.g., 29.23% accuracy on hard split).

Merits

Novelty and Scope

EpiBench fills a critical void in AI evaluation by addressing the lack of benchmarks for multi-turn, multi-evidence research workflows, which are essential for advancing autonomous scientific discovery agents.

Rigorous Evaluation Framework

The process-level evaluation framework allows for granular analysis of agent performance, enabling researchers to identify specific weaknesses (e.g., evidence integration, sustained reasoning) and target improvements effectively.

Multimodal Integration

By incorporating figures and tables alongside textual data, EpiBench reflects the real-world complexity of scientific research, where evidence often spans multiple modalities.

Demerits

Limited Generalizability

The benchmark’s focus on episodic research workflows may limit its applicability to other domains (e.g., legal or policy analysis) where multi-turn reasoning is also critical but structured differently.

Scalability Challenges

The complexity of multi-turn workflows and multimodal evidence integration may pose scalability issues, potentially limiting the benchmark’s adoption in resource-constrained environments.

Subjectivity in Evaluation

The subjective nature of defining 'correct' research workflows and evidence integration could introduce variability in benchmark results, particularly in subjective domains like hypothesis generation.

Expert Commentary

EpiBench represents a paradigm shift in how we evaluate AI systems for complex, multi-step research tasks. The benchmark’s emphasis on episodic workflows and multimodal evidence integration mirrors the cognitive demands of human researchers, making it a more realistic proxy for autonomous scientific discovery than static benchmarks. The dismal performance of leading models (29.23% on hard split) is not merely a reflection of current technological limitations but a clarion call for rethinking architectural designs. Traditional transformer-based models, while powerful, struggle with sustained reasoning and adaptive tool use—capabilities that may require novel paradigms such as neuro-symbolic integration or reinforcement learning for tool orchestration. Furthermore, the benchmark’s process-level evaluation framework is particularly laudable, as it enables researchers to dissect failures at a granular level, whether in evidence retrieval, cross-paper alignment, or memory management. This diagnostic capability is invaluable for targeted improvements. However, the benchmark’s focus on episodic tasks may inadvertently prioritize short-term gains over long-term research strategies, such as hypothesis generation or experimental design. Future iterations of EpiBench could expand to include these higher-order cognitive tasks to fully capture the spectrum of scientific reasoning.

Recommendations

  • Develop hybrid neuro-symbolic architectures that combine the pattern-recognition strengths of neural networks with the logical rigor of symbolic reasoning to improve sustained evidence use and cross-paper alignment.
  • Expand EpiBench to include longitudinal research workflows, such as hypothesis generation and experimental design, to evaluate agents' ability to engage in open-ended scientific reasoning rather than episodic tasks.
  • Incorporate adversarial elements (e.g., noisy or conflicting evidence) into the benchmark to test the robustness of agents' reasoning and decision-making under uncertainty.
  • Collaborate with domain experts (e.g., biologists, chemists) to ensure the benchmark’s tasks remain grounded in real-world research challenges and reflect the evolving needs of scientific disciplines.

Sources

Original: arXiv - cs.CL