A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT
arXiv:2602.22014v1 Announce Type: new Abstract: Diversity has been gaining interest in the NLP community in recent years. At the same time, state-of-the-art transformer models such as ModernBERT use very large pre-training datasets, which are driven by size rather than by diversity. This summons for an investigation of the impact of diversity on the ModernBERT pre-training. We do so in this study, with the express intent of reducing pre-training dataset size, while retaining at least comparable performance. We compare diversity-driven sampling algorithms, so as to pick the best one. We find that diversity-driven sampling allows in some tasks to gain 10 points relative to randomly-sampled pre-training data of commensurate size. We also see that a model pre-trained for 483h on a diversity-driven dataset of 150M tokens can yield a commensurate performance to a model pre-trained for 1,775h on a randomly-driven dataset of 2.4B tokens.
arXiv:2602.22014v1 Announce Type: new Abstract: Diversity has been gaining interest in the NLP community in recent years. At the same time, state-of-the-art transformer models such as ModernBERT use very large pre-training datasets, which are driven by size rather than by diversity. This summons for an investigation of the impact of diversity on the ModernBERT pre-training. We do so in this study, with the express intent of reducing pre-training dataset size, while retaining at least comparable performance. We compare diversity-driven sampling algorithms, so as to pick the best one. We find that diversity-driven sampling allows in some tasks to gain 10 points relative to randomly-sampled pre-training data of commensurate size. We also see that a model pre-trained for 483h on a diversity-driven dataset of 150M tokens can yield a commensurate performance to a model pre-trained for 1,775h on a randomly-driven dataset of 2.4B tokens.
Executive Summary
This study investigates the impact of diversity on ModernBERT pre-training, with a focus on reducing pre-training dataset size while maintaining performance. The authors compare diversity-driven sampling algorithms and find that diversity-driven sampling can lead to significant performance gains, with a model pre-trained on a diversity-driven dataset of 150M tokens achieving comparable performance to a model pre-trained on a randomly-driven dataset of 2.4B tokens. The study highlights the potential benefits of prioritizing diversity in pre-training datasets.
Key Points
- ▸ Diversity-driven sampling can lead to significant performance gains in ModernBERT pre-training
- ▸ Reducing pre-training dataset size while maintaining performance is possible with diversity-driven sampling
- ▸ A model pre-trained on a diversity-driven dataset of 150M tokens can achieve comparable performance to a model pre-trained on a randomly-driven dataset of 2.4B tokens
Merits
Improved Performance
The study demonstrates that diversity-driven sampling can lead to significant performance gains in ModernBERT pre-training, with up to 10 points relative improvement in some tasks.
Demerits
Limited Generalizability
The study focuses on a specific model (ModernBERT) and dataset, which may limit the generalizability of the findings to other models and datasets.
Expert Commentary
The study's findings highlight the importance of diversity in pre-training datasets and demonstrate the potential benefits of prioritizing diversity in machine learning model development. The use of diversity-driven sampling algorithms can lead to significant performance gains, while reducing the computational resources required for pre-training. However, further research is needed to fully explore the generalizability of these findings and to develop more efficient and effective diversity-driven sampling methods.
Recommendations
- ✓ Future studies should investigate the application of diversity-driven sampling to other machine learning models and datasets
- ✓ Researchers should explore the development of more efficient and effective diversity-driven sampling methods, such as those that incorporate multiple diversity metrics or that use active learning techniques.