Academic

Feature-level Interaction Explanations in Multimodal Transformers

arXiv:2603.13326v1 Announce Type: new Abstract: Multimodal Transformers often produce predictions without clarifying how different modalities jointly support a decision. Most existing multimodal explainable AI (MXAI) methods extend unimodal saliency to multimodal backbones, highlighting important tokens or patches within each modality, but they rarely pinpoint which cross-modal feature pairs provide complementary evidence (synergy) or serve as reliable backups (redundancy). We present Feature-level I2MoE (FL-I2MoE), a structured Mixture-of-Experts layer that operates directly on token/patch sequences from frozen pretrained encoders and explicitly separates unique, synergistic, and redundant evidence at the feature level. We further develop an expert-wise explanation pipeline that combines attribution with top-K% masking to assess faithfulness, and we introduce Monte Carlo interaction probes to quantify pairwise behavior: the Shapley Interaction Index (SII) to score synergistic pairs a

arXiv:2603.13326v1 Announce Type: new Abstract: Multimodal Transformers often produce predictions without clarifying how different modalities jointly support a decision. Most existing multimodal explainable AI (MXAI) methods extend unimodal saliency to multimodal backbones, highlighting important tokens or patches within each modality, but they rarely pinpoint which cross-modal feature pairs provide complementary evidence (synergy) or serve as reliable backups (redundancy). We present Feature-level I2MoE (FL-I2MoE), a structured Mixture-of-Experts layer that operates directly on token/patch sequences from frozen pretrained encoders and explicitly separates unique, synergistic, and redundant evidence at the feature level. We further develop an expert-wise explanation pipeline that combines attribution with top-K% masking to assess faithfulness, and we introduce Monte Carlo interaction probes to quantify pairwise behavior: the Shapley Interaction Index (SII) to score synergistic pairs and a redundancy-gap score to capture substitutable (redundant) pairs. Across three benchmarks (MMIMDb, ENRICO, and MMHS150K), FL-I2MoE yields more interactionspecific and concentrated importance patterns than a dense Transformer with the same encoders. Finally, pair-level masking shows that removing pairs ranked by SII or redundancy-gap degrades performance more than masking randomly chosen pairs under the same budget, supporting that the identified interactions are causally relevant.

Executive Summary

This article presents Feature-level I2MoE (FL-I2MoE), a novel multimodal explainable AI (MXAI) method that explicitly separates unique, synergistic, and redundant evidence at the feature level. FL-I2MoE operates directly on token/patch sequences from frozen pretrained encoders and combines attribution with top-K% masking to assess faithfulness. The authors develop the Shapley Interaction Index (SII) to score synergistic pairs and a redundancy-gap score to capture substitutable pairs. FL-I2MoE is evaluated on three benchmarks, demonstrating more interaction-specific and concentrated importance patterns than a dense Transformer. Pair-level masking shows that removing pairs ranked by SII or redundancy-gap degrades performance more than masking randomly chosen pairs, supporting causally relevant interactions. This research contributes to the development of more interpretable and reliable multimodal models.

Key Points

  • FL-I2MoE explicitly separates unique, synergistic, and redundant evidence at the feature level.
  • The method combines attribution with top-K% masking to assess faithfulness.
  • The Shapley Interaction Index (SII) and redundancy-gap score quantify pairwise behavior.

Merits

Strength

FL-I2MoE provides explicit feature-level explanations, enhancing interpretability and trustworthiness of multimodal models.

Demerits

Limitation

The method relies on frozen pretrained encoders, which may limit its adaptability to diverse multimodal tasks and datasets.

Expert Commentary

This research represents a significant advancement in multimodal explainable AI, as it addresses the critical need for feature-level explanations in complex multimodal models. By developing FL-I2MoE and the Shapley Interaction Index (SII), the authors provide a robust framework for assessing the contribution of individual feature pairs to the overall model decision. The evaluation on three benchmarks demonstrates the effectiveness of FL-I2MoE in capturing interaction-specific and concentrated importance patterns. However, the reliance on frozen pretrained encoders may limit the method's adaptability to diverse multimodal tasks and datasets. Future research should focus on extending FL-I2MoE to more flexible and domain-specific models. Additionally, the authors could explore the application of FL-I2MoE in real-world scenarios to further validate its practical implications.

Recommendations

  • Future research should explore the extension of FL-I2MoE to more flexible and domain-specific models.
  • The authors should investigate the application of FL-I2MoE in real-world scenarios to further validate its practical implications.

Sources