Academic

A Typologically Grounded Evaluation Framework for Word Order and Morphology Sensitivity in Multilingual Masked LMs

arXiv:2603.00432v1 Announce Type: new Abstract: We introduce a typology-aware diagnostic for multilingual masked language models that tests reliance on word order versus inflectional form. Using Universal Dependencies, we apply inference-time perturbations: full token scrambling, content-word scrambling with function words fixed, dependency-based head--dependent swaps, and sentence-level lemma substitution (+L), which lemmatizes both the context and the masked target label. We evaluate mBERT and XLM-R on English, Chinese, German, Spanish, and Russian. Full scrambling drives word-level reconstruction accuracy near zero in all languages; partial and head--dependent perturbations cause smaller but still large drops. +L has little effect in Chinese but substantially lowers accuracy in German/Spanish/Russian, and it does not mitigate the impact of scrambling. Top-5 word accuracy shows the same pattern: under full scrambling, the gold word rarely appears among the five highest-ranked recons

Anna Feldman, Libby Barak, Jing Peng · March 4, 2026 · 1 min read · 0 views

#cs.CL

Executive Summary

This article introduces a typology-aware diagnostic framework to evaluate multilingual masked language models' reliance on word order and inflectional form. The framework applies various perturbations, including token scrambling and lemma substitution, to test models' performance in five languages. The results show that full scrambling significantly impacts word-level reconstruction accuracy, while partial perturbations have smaller but still notable effects. The study provides insights into the strengths and weaknesses of multilingual models, highlighting the need for more nuanced evaluations.

Key Points

▸ Introduction of a typology-aware diagnostic framework for multilingual masked language models
▸ Evaluation of mBERT and XLM-R on five languages using various perturbations
▸ Results show significant impact of full scrambling on word-level reconstruction accuracy

Merits

Comprehensive Evaluation Framework

The proposed framework provides a thorough and systematic approach to evaluating multilingual models, considering both word order and inflectional form.

Demerits

Limited Language Scope

The study only evaluates five languages, which may not be representative of the full range of linguistic diversity and complexities.

Expert Commentary

The article presents a significant contribution to the field of natural language processing, providing a rigorous evaluation framework for multilingual masked language models. The results highlight the importance of considering linguistic typology in model development and evaluation. However, the study's limitations, such as the limited language scope, underscore the need for further research in this area. The findings have practical implications for the development of more effective multilingual models and may inform language policy and planning in multilingual societies.

Recommendations

✓ Future studies should expand the language scope to include a more diverse range of languages and linguistic typologies
✓ Developers of multilingual models should consider incorporating the proposed diagnostic framework into their evaluation pipelines

Sources

arXiv - cs.CL

Something extraordinary is coming.

A Typologically Grounded Evaluation Framework for Word Order and Morphology Sensitivity in Multilingual Masked LMs

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Evaluation Framework

Demerits

Limited Language Scope

Expert Commentary

Recommendations

Sources

Related Articles

How Large Language Models Get Stuck: Early structure with persistent …

Distribution-Aware Companding Quantization of Large Language Models

Policy Compliance of User Requests in Natural Language for AI …

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

JCG, PC

HSOLLC Co., Ltd.