Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking
arXiv:2602.17653v1 Announce Type: new Abstract: Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets o
arXiv:2602.17653v1 Announce Type: new Abstract: Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.
Executive Summary
This article examines the typological alignment of language models in their treatment of differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. The authors train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. The results reveal a dissociation between two typological dimensions of DAM, with models exhibiting human-like preferences for natural markedness direction but not reproducing the strong object preference in human languages. This study contributes to our understanding of how language models internalize typological regularities and highlights the need to re-examine the sources of these tendencies.
Key Points
- ▸ Language models exhibit typological preferences in their treatment of differential argument marking (DAM)
- ▸ Models reliably exhibit human-like preferences for natural markedness direction
- ▸ Models do not reproduce the strong object preference in human languages
Merits
Strength in methodology
The study employs a controlled synthetic learning method, ensuring a high degree of experimental control and allowing for meaningful comparisons between models.
Contribution to field
The study provides new insights into how language models internalize typological regularities, shedding light on the complex interplay between language models and human linguistic preferences.
Demerits
Limitation in scope
The study focuses solely on GPT-2 models and may not be generalizable to other language models or languages.
Need for longitudinal analysis
The study's static nature may not capture the evolving nature of language models' typological alignment, and longitudinal analysis would be beneficial to understand changes over time.
Expert Commentary
The study's findings on the typological alignment of language models in DAM are both fascinating and thought-provoking. By revealing a dissociation between natural markedness direction and strong object preference, the authors challenge our understanding of the sources of typological tendencies in language models. This study highlights the need for further investigation into the complex interplay between language models and human linguistic preferences, particularly in the context of DAM. The findings have significant implications for natural language processing, language education, and our broader understanding of linguistic universals.
Recommendations
- ✓ Future studies should investigate the role of linguistic diversity and cultural context in shaping the typological alignment of language models.
- ✓ Researchers should explore the application of language models in cross-linguistic analysis and comparison with human linguistic preferences, enabling the development of more effective and culturally sensitive language models.