Academic

PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation

arXiv:2602.18652v1 Announce Type: new Abstract: Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings. We introduced PolyFrame, our system for the MWE-2026 AdMIRe2 shared task on multimodal idiom disambiguation, featuring a unified pipeline for both image+text ranking (Subtask A) and text-only caption ranking (Subtask B). All model variants retain frozen CLIP-style vision--language encoders and the multilingual BGE M3 encoder, training only lightweight modules: a logistic regression and LLM-based sentence-type predictor, idiom synonym substitution, distractor-aware scoring, and Borda rank fusion. Starting from a CLIP baseline (26.7% Top-1 on English dev, 6.7% on English test), adding idiom-aware paraphrasing and explicit sentence-type classification increased performance to 60.0% Top-1 on English and 60.0% Top-1 (0.822 NDCG@5) in zero-shot transfer to Portuguese. On the multilingual blind test, our

Nina Hosseini-Kivanani · March 7, 2026 · 1 min read · 13 views

#cs.CL

Executive Summary

The article presents PolyFrame, a novel system designed to address the challenge of multimodal idiom disambiguation, particularly in multilingual contexts. PolyFrame integrates a unified pipeline for both image+text and text-only ranking tasks, leveraging frozen CLIP-style vision-language encoders and the multilingual BGE M3 encoder. The system employs lightweight modules such as logistic regression, LLM-based sentence-type prediction, idiom synonym substitution, distractor-aware scoring, and Borda rank fusion. The study demonstrates significant performance improvements over baseline models, achieving notable results in both English and Portuguese, and maintaining robust performance across 15 languages in the multilingual blind test. The findings suggest that effective idiom disambiguation can be achieved without fine-tuning large multimodal encoders, highlighting the potential of lightweight, idiom-aware modules.

Key Points

▸ PolyFrame introduces a unified pipeline for multimodal idiom disambiguation, combining image+text and text-only ranking tasks.
▸ The system achieves significant performance improvements using lightweight modules and frozen encoders.
▸ Idiom-aware paraphrasing is identified as the primary contributor to performance enhancements.
▸ The study demonstrates robust zero-shot transfer capabilities to Portuguese and consistent performance across 15 languages.

Merits

Innovative Approach

PolyFrame's unified pipeline and the use of lightweight modules represent an innovative approach to multimodal idiom disambiguation, offering a practical solution without the need for fine-tuning large encoders.

High Performance

The system achieves impressive performance metrics, particularly in English and Portuguese, and maintains robustness across multiple languages, demonstrating its effectiveness in multilingual settings.

Ablation Insights

The ablation results provide valuable insights into the contributions of different modules, highlighting the importance of idiom-aware rewriting and sentence-type prediction.

Demerits

Limited Generalization

While the system performs well in the tested languages, its generalization to other languages and idiomatic expressions not included in the study remains uncertain.

Dependency on Pretrained Models

The reliance on frozen CLIP-style and BGE M3 encoders may limit the adaptability of the system to new or emerging multimodal contexts.

Complexity of Implementation

The integration of multiple lightweight modules, while effective, adds complexity to the system, which may pose challenges for scalability and real-world deployment.

Expert Commentary

The article presents a significant advancement in the field of multimodal idiom disambiguation, addressing a critical challenge in both monolingual and multilingual settings. The innovative use of lightweight modules and frozen encoders not only improves performance but also offers a practical solution that avoids the computational overhead of fine-tuning large models. The study's findings are particularly noteworthy for their zero-shot transfer capabilities and consistent performance across multiple languages, demonstrating the potential for widespread applicability. However, the reliance on pretrained models and the complexity of integrating multiple modules pose challenges that need to be addressed for broader adoption. The ablation results provide valuable insights into the contributions of different components, underscoring the importance of idiom-aware rewriting and sentence-type prediction. Overall, the study contributes meaningfully to the fields of multimodal learning and multilingual NLP, offering a robust framework for future research and practical applications.

Recommendations

✓ Further research should explore the generalization of PolyFrame to additional languages and idiomatic expressions to ensure its robustness and adaptability.
✓ Investigation into the integration of more advanced, yet lightweight, modules could enhance the system's performance and scalability without compromising efficiency.

Sources

arXiv - cs.CL

PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation

AI Commentary

Executive Summary

Key Points

Merits

Innovative Approach

High Performance

Ablation Insights

Demerits

Limited Generalization

Dependency on Pretrained Models

Complexity of Implementation

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs