Academic

MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains

Xuying Ning, Dongqi Fu, Tianxin Wei, Mengting Ai, Jiaru Zou, Ting-Wei Li, Hanghang Tong, Yada Zhu, Hendrik Hamann, Jingrui He · March 7, 2026 · 1 min read · 2 views

#cs.AI

arXiv:2603.00873v1 Announce Type: new Abstract: With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures. Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers, with fidelity ensured by HAVE (Hop-wise Attribution and Verification of Evidence), resulting in 3,333 high-quality examples averaging 3.7 hops. Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy. By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning. Finally, we introduce Search-Align, a process-supervised fine-tuning framework leveraging verified reasoning chains, showing that our data not only enables faithful evaluation but also improves planning and retrieval fidelity in open-source MLLMs.

Executive Summary

The article introduces MC-Search, a benchmark for evaluating multimodal agentic search with structured long reasoning chains. It provides a comprehensive framework for assessing the performance of multimodal large language models (MLLMs) in tasks that require step-wise, cross-modal, and knowledge-grounded reasoning. The benchmark consists of 3,333 high-quality examples with annotated reasoning chains, allowing for the evaluation of answer accuracy, reasoning quality, and stepwise retrieval and planning accuracy. The article also presents a unified agentic MM-RAG pipeline and a process-supervised fine-tuning framework called Search-Align, which improves planning and retrieval fidelity in MLLMs.

Key Points

▸ Introduction of MC-Search benchmark for agentic MM-RAG
▸ Evaluation of MLLMs using long, step-wise annotated reasoning chains
▸ Development of a unified agentic MM-RAG pipeline and Search-Align fine-tuning framework

Merits

Comprehensive Evaluation Framework

MC-Search provides a thorough framework for evaluating MLLMs, covering various aspects of multimodal reasoning and retrieval.

Demerits

Limited Scope of Reasoning Structures

The benchmark only covers five representative reasoning structures, which may not be exhaustive or representative of all real-world scenarios.

Expert Commentary

The introduction of MC-Search and Search-Align marks a significant step forward in the development of more sophisticated and transparent MLLMs. By providing a comprehensive evaluation framework and a fine-tuning framework that leverages verified reasoning chains, the authors have addressed a critical gap in the current state of multimodal AI research. The implications of this work are far-reaching, with potential applications in areas such as natural language processing, computer vision, and human-computer interaction.

Recommendations

✓ Future research should focus on expanding the scope of MC-Search to cover a broader range of reasoning structures and real-world scenarios
✓ The development of Search-Align should be extended to other AI models and applications to improve their performance and transparency

Sources

arXiv - cs.AI

MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Evaluation Framework

Demerits

Limited Scope of Reasoning Structures

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs