Academic

Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

arXiv:2603.18425v1 Announce Type: new Abstract: Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.

M
Masayuki Kawarada, Tatsuya Ishigaki, Hiroya Takamura
· · 1 min read · 8 views

arXiv:2603.18425v1 Announce Type: new Abstract: Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.

Executive Summary

This article contributes to the growing body of research on multimodal task interference in large language models (LLMs). The authors introduce a benchmark to evaluate task interference in multimodal LLMs, comprising six tasks across text and vision with varying levels of history-target mismatch. The study reveals that task interference is highly directional, with significant performance drops when switching from text-only to image-based targets. The analysis highlights the amplifying effects of modality mismatches and the minimal impact of reasoning requirement shifts. The findings have significant implications for the development and deployment of multimodal dialogue systems.

Key Points

  • The authors introduce a benchmark for evaluating task interference in multimodal LLMs.
  • Task interference is highly directional, with significant performance drops when switching from text-only to image-based targets.
  • Modality mismatches have a significant impact on task interference, while reasoning requirement shifts have minimal impact.

Merits

Strength

The introduction of a benchmark for evaluating task interference in multimodal LLMs is a significant contribution to the field.

Methodological rigor

The study employs a systematic design with varying levels of history-target mismatch, ensuring a comprehensive analysis of task interference.

Insights into multimodal dialogue systems

The findings provide valuable insights into the challenges of developing and deploying multimodal dialogue systems.

Demerits

Limitation

The study focuses exclusively on LLMs, limiting the generalizability of the findings to other types of dialogue systems.

Lack of real-world context

The experiments are conducted in a controlled environment, which may not accurately reflect real-world scenarios.

Expert Commentary

The article makes a significant contribution to the field of multimodal dialogue systems, highlighting the need for a systematic evaluation of task interference in multimodal LLMs. The findings have important implications for the development and deployment of these systems, and the study's methodological rigor provides a solid foundation for future research. However, the study's limitations, such as the exclusive focus on LLMs and the lack of real-world context, should be addressed in future studies. Overall, this article is a valuable addition to the literature on multimodal dialogue systems, and its findings will likely influence the development of these systems in the future.

Recommendations

  • Future studies should consider evaluating task interference in other types of dialogue systems, beyond LLMs.
  • Researchers should explore ways to mitigate task interference in multimodal dialogue systems, such as developing more effective task switching strategies.

Sources