A Typologically Grounded Evaluation Framework for Word Order and Morphology Sensitivity in Multilingual Masked LMs
arXiv:2603.00432v1 Announce Type: new Abstract: We introduce a typology-aware diagnostic for multilingual masked language models that tests reliance on word order versus inflectional form. Using Universal Dependencies, we apply inference-time perturbations: full token scrambling, content-word scrambling with function words fixed, dependency-based head--dependent swaps, and sentence-level lemma substitution (+L), which lemmatizes both the context and the masked target label. We evaluate mBERT and XLM-R on English, Chinese, German, Spanish, and Russian. Full scrambling drives word-level reconstruction accuracy near zero in all languages; partial and head--dependent perturbations cause smaller but still large drops. +L has little effect in Chinese but substantially lowers accuracy in German/Spanish/Russian, and it does not mitigate the impact of scrambling. Top-5 word accuracy shows the same pattern: under full scrambling, the gold word rarely appears among the five highest-ranked recons
arXiv:2603.00432v1 Announce Type: new Abstract: We introduce a typology-aware diagnostic for multilingual masked language models that tests reliance on word order versus inflectional form. Using Universal Dependencies, we apply inference-time perturbations: full token scrambling, content-word scrambling with function words fixed, dependency-based head--dependent swaps, and sentence-level lemma substitution (+L), which lemmatizes both the context and the masked target label. We evaluate mBERT and XLM-R on English, Chinese, German, Spanish, and Russian. Full scrambling drives word-level reconstruction accuracy near zero in all languages; partial and head--dependent perturbations cause smaller but still large drops. +L has little effect in Chinese but substantially lowers accuracy in German/Spanish/Russian, and it does not mitigate the impact of scrambling. Top-5 word accuracy shows the same pattern: under full scrambling, the gold word rarely appears among the five highest-ranked reconstructions. We release code, sampling scripts, and balanced evaluation subsets; Turkish results under strict reconstruction are reported in the appendix.
Executive Summary
This article introduces a typology-aware diagnostic framework to evaluate multilingual masked language models' reliance on word order and inflectional form. The framework applies various perturbations, including token scrambling and lemma substitution, to test models' performance in five languages. The results show that full scrambling significantly impacts word-level reconstruction accuracy, while partial perturbations have smaller but still notable effects. The study provides insights into the strengths and weaknesses of multilingual models, highlighting the need for more nuanced evaluations.
Key Points
- ▸ Introduction of a typology-aware diagnostic framework for multilingual masked language models
- ▸ Evaluation of mBERT and XLM-R on five languages using various perturbations
- ▸ Results show significant impact of full scrambling on word-level reconstruction accuracy
Merits
Comprehensive Evaluation Framework
The proposed framework provides a thorough and systematic approach to evaluating multilingual models, considering both word order and inflectional form.
Demerits
Limited Language Scope
The study only evaluates five languages, which may not be representative of the full range of linguistic diversity and complexities.
Expert Commentary
The article presents a significant contribution to the field of natural language processing, providing a rigorous evaluation framework for multilingual masked language models. The results highlight the importance of considering linguistic typology in model development and evaluation. However, the study's limitations, such as the limited language scope, underscore the need for further research in this area. The findings have practical implications for the development of more effective multilingual models and may inform language policy and planning in multilingual societies.
Recommendations
- ✓ Future studies should expand the language scope to include a more diverse range of languages and linguistic typologies
- ✓ Developers of multilingual models should consider incorporating the proposed diagnostic framework into their evaluation pipelines