Academic

Scale Dependent Data Duplication

arXiv:2603.06603v1 Announce Type: new Abstract: Data duplication during pretraining can degrade generalization and lead to memorization, motivating aggressive deduplication pipelines. However, at web scale, it is unclear what constitutes a ``duplicate'': beyond surface-form matches, semantically equivalent documents (e.g. translations) may induce redundant training signals once models become sufficiently capable. Practically, this means that semantic duplicates operate increasingly like exact duplicates during training. We present evidence that duplication is scale-dependent in two ways. First, as model capability increases, cross-entropy loss gradients for semantically equivalent documents become more aligned. Smaller models, by contrast, produce gradients that reflect surface similarity (e.g., shared tokens) rather than semantic similarity. Second, we embedded all 192 million FineWeb-Edu-Dedup documents using EmbeddingGemma-300m. For moderate corpus sizes, the cosine similarity betw

arXiv:2603.06603v1 Announce Type: new Abstract: Data duplication during pretraining can degrade generalization and lead to memorization, motivating aggressive deduplication pipelines. However, at web scale, it is unclear what constitutes a ``duplicate'': beyond surface-form matches, semantically equivalent documents (e.g. translations) may induce redundant training signals once models become sufficiently capable. Practically, this means that semantic duplicates operate increasingly like exact duplicates during training. We present evidence that duplication is scale-dependent in two ways. First, as model capability increases, cross-entropy loss gradients for semantically equivalent documents become more aligned. Smaller models, by contrast, produce gradients that reflect surface similarity (e.g., shared tokens) rather than semantic similarity. Second, we embedded all 192 million FineWeb-Edu-Dedup documents using EmbeddingGemma-300m. For moderate corpus sizes, the cosine similarity between nearest-neighbors follows an isotropic power law baseline. However, as corpus size grows to hundreds of billions of tokens, the nearest-neighbor similarities deviate sharply, indicating accelerated semantic collisions. Finally, controlled pretraining on data sampled with replacement from pools of finite unique documents shows that limited uniqueness yields mild degradation for small models, but rapidly increasing loss penalties for larger models, breaking naive scaling extrapolation. We derive explicit scaling laws that allow practitioners to estimate deviation from expected scaling due to limited semantic uniqueness of the pretraining corpus. Our results identify and resolve an unstudied source of scale-dependence, allowing for more accurate prediction at scale.

Executive Summary

This article presents a study on the phenomenon of data duplication during pretraining in large-scale machine learning models. The authors argue that duplication can lead to memorization and degrade generalization, motivating the use of aggressive deduplication pipelines. However, the authors contend that the concept of duplication is scale-dependent, meaning that its effects become more pronounced as model capability increases. The study uses a large corpus of documents and demonstrates that the cosine similarity between nearest-neighbors deviates sharply as corpus size grows. The authors also derive explicit scaling laws that allow practitioners to estimate deviation from expected scaling due to limited semantic uniqueness of the pretraining corpus.

Key Points

  • Data duplication during pretraining can lead to memorization and degrade generalization.
  • The concept of duplication is scale-dependent and becomes more pronounced as model capability increases.
  • The study uses a large corpus of documents to demonstrate the effects of duplication on model performance.

Merits

Strength in Methodology

The study uses a large corpus of documents and a sophisticated analysis of cosine similarity to demonstrate the effects of duplication on model performance.

Contribution to Field

The study provides a new understanding of the phenomenon of data duplication during pretraining and identifies an unstudied source of scale-dependence.

Implications for Practice

The study provides explicit scaling laws that allow practitioners to estimate deviation from expected scaling due to limited semantic uniqueness of the pretraining corpus.

Demerits

Limitation in Generalizability

The study is conducted on a specific corpus of documents and its findings may not generalize to other domains or datasets.

Assumptions about Model Capability

The study assumes that model capability increases linearly with scale, which may not be the case in real-world scenarios.

Expert Commentary

This study provides a valuable contribution to the field of machine learning by identifying a previously unstudied source of scale-dependence. The authors' use of a large corpus of documents and sophisticated analysis of cosine similarity is a significant strength of the study. However, the study's assumptions about model capability and generalizability may limit its applicability to real-world scenarios. Nonetheless, the study's findings have important implications for machine learning practitioners and policymakers alike.

Recommendations

  • Future studies should investigate the effects of duplication on model performance in other domains and datasets to improve generalizability.
  • Machine learning practitioners should consider using deduplication pipelines and evaluating their models for signs of memorization and overfitting.

Sources