Learning from Child-Directed Speech in Two-Language Scenarios: A French-English Case Study
arXiv:2603.12906v1 Announce Type: new Abstract: Research on developmentally plausible language models has largely focused on English, leaving open questions about multilingual settings. We present a systematic study of compact language models by extending BabyBERTa to English-French scenarios under strictly size-matched data conditions, covering monolingual, bilingual, and cross-lingual settings. Our design contrasts two types of training corpora: (i) child-directed speech (about 2.5M tokens), following BabyBERTa and related work, and (ii) multi-domain corpora (about 10M tokens), extending the BabyLM framework to French. To enable fair evaluation, we also introduce new resources, including French versions of QAMR and QASRL, as well as English and French multi-domain corpora. We evaluate the models on both syntactic and semantic tasks and compare them with models trained on Wikipedia-only data. The results reveal context-dependent effects: training on Wikipedia consistently benefits
arXiv:2603.12906v1 Announce Type: new Abstract: Research on developmentally plausible language models has largely focused on English, leaving open questions about multilingual settings. We present a systematic study of compact language models by extending BabyBERTa to English-French scenarios under strictly size-matched data conditions, covering monolingual, bilingual, and cross-lingual settings. Our design contrasts two types of training corpora: (i) child-directed speech (about 2.5M tokens), following BabyBERTa and related work, and (ii) multi-domain corpora (about 10M tokens), extending the BabyLM framework to French. To enable fair evaluation, we also introduce new resources, including French versions of QAMR and QASRL, as well as English and French multi-domain corpora. We evaluate the models on both syntactic and semantic tasks and compare them with models trained on Wikipedia-only data. The results reveal context-dependent effects: training on Wikipedia consistently benefits semantic tasks, whereas child-directed speech improves grammatical judgments in monolingual settings. Bilingual pretraining yields notable gains for textual entailment, with particularly strong improvements for French. Importantly, similar patterns emerge across BabyBERTa, RoBERTa, and LTG-BERT, suggesting consistent trends across architectures.
Executive Summary
This study explores the development of compact language models in multilingual settings by extending the BabyBERTa model to English-French scenarios. The researchers investigate the effects of training corpora on child-directed speech and multi-domain corpora, evaluating models on syntactic and semantic tasks. The results reveal context-dependent effects, with Wikipedia training benefiting semantic tasks and child-directed speech improving grammatical judgments. Bilingual pretraining yields notable gains for textual entailment. The findings suggest consistent trends across architectures, with implications for the development of language models in multilingual settings.
Key Points
- ▸ The study extends the BabyBERTa model to English-French scenarios, exploring its performance in multilingual settings.
- ▸ The researchers investigate the effects of training corpora on child-directed speech and multi-domain corpora.
- ▸ The results reveal context-dependent effects, with Wikipedia training benefiting semantic tasks and child-directed speech improving grammatical judgments.
Merits
Strength in Design
The study's systematic and well-controlled design allows for a thorough examination of the effects of training corpora on model performance.
Robust Findings Across Architectures
The consistent trends observed across BabyBERTa, RoBERTa, and LTG-BERT suggest that the findings are robust and generalizable.
Demerits
Limited Generalizability
The study's findings may not be directly applicable to other languages or multilingual settings, given the focus on English-French scenarios.
Data-Dependent Results
The results may be influenced by the specific characteristics of the training corpora and the models used.
Expert Commentary
The study provides a valuable contribution to the field of language model development, particularly in multilingual settings. The findings suggest that the choice of training corpus has a significant impact on model performance, with context-dependent effects observed across different tasks. The robustness of the results across architectures is a significant strength of the study. However, the limited generalizability of the findings to other languages and multilingual settings is a notable limitation. Future studies should aim to replicate the findings in different contexts and explore the applicability of the results to real-world applications.
Recommendations
- ✓ Future studies should investigate the effects of training corpora on model performance in other languages and multilingual settings.
- ✓ The development of language models for real-world applications should take into account the context-dependent effects observed in this study.