Academic

Data-efficient pre-training by scaling synthetic megadocs

arXiv:2603.18534v1 Announce Type: new Abstract: Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different distribution. With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, plateauing near $1.48\times$ data efficiency at 32 rephrases per document. We find even better loss scaling under a new perspective: synthetic generations from the same document can form a single substantially longer megadocument instead of many short documents. We show two ways to construct megadocs: st

arXiv:2603.18534v1 Announce Type: new Abstract: Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different distribution. With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, plateauing near $1.48\times$ data efficiency at 32 rephrases per document. We find even better loss scaling under a new perspective: synthetic generations from the same document can form a single substantially longer megadocument instead of many short documents. We show two ways to construct megadocs: stitching synthetic rephrases from the same web document or stretching a document by inserting rationales. Both methods improve i.i.d. loss, downstream benchmarks, and especially long-context loss relative to simple rephrasing, increasing data efficiency from $1.48\times$ to $1.80\times$ at $32$ generations per document. Importantly, the improvement of megadocs over simple rephrasing widens as more synthetic data is generated. Our results show how to design synthetic data algorithms that benefit more from increasing compute when data-constrained.

Executive Summary

This article proposes a novel approach to pre-training language models by leveraging synthetic data augmentation and scaling, particularly in scenarios where data availability is limited. The authors demonstrate that mixing synthetic rephrases with web data improves validation loss, achieves better loss scaling, and increases data efficiency. The introduction of 'megadocs' – substantially longer documents formed by stitching or stretching synthetic rephrases from the same web document – further enhances data efficiency and long-context loss. The study showcases the potential of designing synthetic data algorithms to benefit from increased compute when data is constrained. The findings have significant implications for the development of more efficient and effective language models.

Key Points

  • Synthetic data augmentation improves i.i.d. validation loss on web data when mixed with synthetically generated rephrases.
  • Megadocs, formed by stitching or stretching synthetic rephrases, increase data efficiency and long-context loss.
  • The improvement of megadocs over simple rephrasing widens as more synthetic data is generated.

Merits

Improved Data Efficiency

The proposed approach achieves better loss scaling and increases data efficiency, allowing for more effective pre-training with limited data.

Scalability

The study demonstrates that the proposed method can benefit from increased compute when data is constrained, making it a scalable solution for large-scale language model development.

Flexibility

The introduction of megadocs provides a flexible framework for constructing synthetic data, allowing researchers to explore various methods for generating longer, more informative documents.

Demerits

Limited Generalizability

The study focuses on a specific task and dataset, and it is unclear whether the proposed approach will generalize to other domains or tasks.

Computational Requirements

The method requires significant computational resources to generate and process large amounts of synthetic data, which may be a limitation for researchers with limited access to computational power.

Evaluation Metrics

The study relies on a limited set of evaluation metrics, and it is unclear whether the proposed approach would perform well on other metrics or tasks.

Expert Commentary

This article makes a significant contribution to the field of natural language processing by proposing a novel approach to pre-training language models using synthetic data augmentation and scaling. The study demonstrates the potential of this approach to improve data efficiency and long-context loss, and the introduction of megadocs provides a flexible framework for constructing synthetic data. While the study has some limitations, including limited generalizability and computational requirements, it provides a valuable insight into the role of synthetic data generation in large-scale language model development. The findings have significant implications for the development of more efficient and effective language models, particularly in scenarios where data availability is limited.

Recommendations

  • Further research is needed to explore the generalizability of the proposed approach to other domains and tasks.
  • The computational requirements of the method should be carefully evaluated and optimized to make it more accessible to researchers with limited access to computational power.

Sources