Vocabulary shapes cross-lingual variation of word-order learnability in language models
arXiv:2603.19427v1 Announce Type: new Abstract: Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.
arXiv:2603.19427v1 Announce Type: new Abstract: Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.
Executive Summary
This study investigates the impact of vocabulary structure on the learnability of word order in language models. The authors pretrain transformer language models on synthetic word-order variants of natural languages and observe that greater word-order irregularity raises model surprisal, indicating reduced learnability. The study reveals that the structure of the word and subword vocabulary strongly predicts the model surprisal, contradicting the common assumption that cross-lingual variation in word order can be explained by a coarse distinction between free- and fixed-word-order languages. This research has significant implications for the development of language models and sheds new light on the computational learnability of word order across languages.
Key Points
- ▸ The structure of vocabulary strongly predicts the model surprisal in language models.
- ▸ Greater word-order irregularity raises model surprisal and reduces learnability.
- ▸ The distinction between free- and fixed-word-order languages does not explain cross-lingual variation.
Merits
Strength of Methodology
The authors employ a robust methodology, pretraining transformer language models on synthetic word-order variants of natural languages, which allows for a comprehensive investigation of the relationship between vocabulary structure and word-order learnability.
Insights into Computational Learnability
The study sheds new light on the computational learnability of word order across languages, highlighting the importance of vocabulary structure in predicting model surprisal.
Demerits
Limitation of Generalizability
The study is limited in its generalizability to natural language processing tasks, as it focuses on a specific aspect of language models and may not capture the complexities of real-world language use.
Overreliance on Surprisal Metric
The study relies heavily on the surprisal metric to evaluate model performance, which may not capture the full range of linguistic phenomena and may be influenced by other factors such as model architecture and training data.
Expert Commentary
The study's findings are a significant contribution to the field of natural language processing, shedding new light on the computational learnability of word order across languages. The authors' use of a robust methodology and their focus on the structure of vocabulary as a key driver of model surprisal are notable strengths. However, the study's limitation in generalizability to natural language processing tasks and its overreliance on the surprisal metric are areas for further research. Additionally, the study's implications for language model evaluation and cross-lingual language understanding are significant and warrant further exploration.
Recommendations
- ✓ Future studies should investigate the relationship between vocabulary structure and word-order learnability in more naturalistic language settings.
- ✓ Language model developers should consider incorporating measures of vocabulary structure into their evaluation metrics.
Sources
Original: arXiv - cs.CL