MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains
arXiv:2603.00873v1 Announce Type: new Abstract: With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures. Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers, with fidelity ensured by HAVE (Hop-wise Attribution and Verification of Evidence), resulting in 3,333 high-quality examples averaging 3.7 hops. Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, step
arXiv:2603.00873v1 Announce Type: new Abstract: With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures. Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers, with fidelity ensured by HAVE (Hop-wise Attribution and Verification of Evidence), resulting in 3,333 high-quality examples averaging 3.7 hops. Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy. By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning. Finally, we introduce Search-Align, a process-supervised fine-tuning framework leveraging verified reasoning chains, showing that our data not only enables faithful evaluation but also improves planning and retrieval fidelity in open-source MLLMs.
Executive Summary
The article introduces MC-Search, a benchmark for evaluating multimodal agentic search with structured long reasoning chains. It provides a comprehensive framework for assessing the performance of multimodal large language models (MLLMs) in tasks that require step-wise, cross-modal, and knowledge-grounded reasoning. The benchmark consists of 3,333 high-quality examples with annotated reasoning chains, allowing for the evaluation of answer accuracy, reasoning quality, and stepwise retrieval and planning accuracy. The article also presents a unified agentic MM-RAG pipeline and a process-supervised fine-tuning framework called Search-Align, which improves planning and retrieval fidelity in MLLMs.
Key Points
- ▸ Introduction of MC-Search benchmark for agentic MM-RAG
- ▸ Evaluation of MLLMs using long, step-wise annotated reasoning chains
- ▸ Development of a unified agentic MM-RAG pipeline and Search-Align fine-tuning framework
Merits
Comprehensive Evaluation Framework
MC-Search provides a thorough framework for evaluating MLLMs, covering various aspects of multimodal reasoning and retrieval.
Demerits
Limited Scope of Reasoning Structures
The benchmark only covers five representative reasoning structures, which may not be exhaustive or representative of all real-world scenarios.
Expert Commentary
The introduction of MC-Search and Search-Align marks a significant step forward in the development of more sophisticated and transparent MLLMs. By providing a comprehensive evaluation framework and a fine-tuning framework that leverages verified reasoning chains, the authors have addressed a critical gap in the current state of multimodal AI research. The implications of this work are far-reaching, with potential applications in areas such as natural language processing, computer vision, and human-computer interaction.
Recommendations
- ✓ Future research should focus on expanding the scope of MC-Search to cover a broader range of reasoning structures and real-world scenarios
- ✓ The development of Search-Align should be extended to other AI models and applications to improve their performance and transparency