Academic

Vocabulary shapes cross-lingual variation of word-order learnability in language models

arXiv:2603.19427v1 Announce Type: new Abstract: Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.

Jonas Mayer Martins, Jaap Jumelet, Viola Priesemann, Lisa Beinborn · March 23, 2026 · 1 min read · 8 views

#cs.CL #cs.AI #cs.LG

Executive Summary

This study investigates the impact of vocabulary structure on the learnability of word order in language models. The authors pretrain transformer language models on synthetic word-order variants of natural languages and observe that greater word-order irregularity raises model surprisal, indicating reduced learnability. The study reveals that the structure of the word and subword vocabulary strongly predicts the model surprisal, contradicting the common assumption that cross-lingual variation in word order can be explained by a coarse distinction between free- and fixed-word-order languages. This research has significant implications for the development of language models and sheds new light on the computational learnability of word order across languages.

Key Points

▸ The structure of vocabulary strongly predicts the model surprisal in language models.
▸ Greater word-order irregularity raises model surprisal and reduces learnability.
▸ The distinction between free- and fixed-word-order languages does not explain cross-lingual variation.

Merits

Strength of Methodology

The authors employ a robust methodology, pretraining transformer language models on synthetic word-order variants of natural languages, which allows for a comprehensive investigation of the relationship between vocabulary structure and word-order learnability.

Insights into Computational Learnability

The study sheds new light on the computational learnability of word order across languages, highlighting the importance of vocabulary structure in predicting model surprisal.

Demerits

Limitation of Generalizability

The study is limited in its generalizability to natural language processing tasks, as it focuses on a specific aspect of language models and may not capture the complexities of real-world language use.

Overreliance on Surprisal Metric

The study relies heavily on the surprisal metric to evaluate model performance, which may not capture the full range of linguistic phenomena and may be influenced by other factors such as model architecture and training data.

Expert Commentary

The study's findings are a significant contribution to the field of natural language processing, shedding new light on the computational learnability of word order across languages. The authors' use of a robust methodology and their focus on the structure of vocabulary as a key driver of model surprisal are notable strengths. However, the study's limitation in generalizability to natural language processing tasks and its overreliance on the surprisal metric are areas for further research. Additionally, the study's implications for language model evaluation and cross-lingual language understanding are significant and warrant further exploration.

Recommendations

✓ Future studies should investigate the relationship between vocabulary structure and word-order learnability in more naturalistic language settings.
✓ Language model developers should consider incorporating measures of vocabulary structure into their evaluation metrics.

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

Vocabulary shapes cross-lingual variation of word-order learnability in language models

AI Commentary

Executive Summary

Key Points

Merits

Strength of Methodology

Insights into Computational Learnability

Demerits

Limitation of Generalizability

Overreliance on Surprisal Metric

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.