Skip to main content
Academic

Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining

arXiv:2602.19548v1 Announce Type: new Abstract: One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially. This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance. We further show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance, with differences of up to 10 percentage po

arXiv:2602.19548v1 Announce Type: new Abstract: One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially. This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance. We further show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance, with differences of up to 10 percentage points (p.p.) on WikiTQ and 3 p.p. on HumanEval.

Executive Summary

This article challenges the conventional approach of using a single HTML-to-text extractor for pretraining large language models (LLMs). The authors demonstrate that utilizing multiple extractors can significantly increase token yield and improve downstream task performance, particularly for structured content like tables and code blocks. By taking a union over different extractors, the token yield of DCLM-Baseline can be increased by up to 71% without compromising benchmark performance. This innovative approach has important implications for the development of more effective and efficient LLM pretraining datasets.

Key Points

  • The use of a single fixed extractor for HTML-to-text extraction can lead to suboptimal coverage and utilization of Internet data
  • Different extractors can result in similar model performance on standard language understanding tasks, but with varying page survival rates
  • Extractor choice can significantly impact downstream task performance, particularly for structured content like tables and code blocks

Merits

Improved Token Yield

The proposed approach of taking a union over different extractors can increase the token yield of DCLM-Baseline by up to 71%, leading to more comprehensive and diverse pretraining datasets

Demerits

Increased Complexity

The use of multiple extractors may add complexity to the pretraining pipeline, potentially requiring additional computational resources and infrastructure

Expert Commentary

This article makes a significant contribution to the field of natural language processing by challenging conventional wisdom and demonstrating the benefits of using multiple extractors for HTML-to-text extraction. The authors' approach has important implications for the development of more effective and efficient LLM pretraining datasets, and their findings are likely to influence the design of future pretraining pipelines. However, the increased complexity of using multiple extractors must be carefully managed to ensure that the benefits are realized in practice. Further research is needed to explore the optimal combination of extractors and filtering pipelines for different downstream tasks and applications.

Recommendations

  • Researchers and practitioners should consider using multiple extractors and taking a union over their outputs to increase token yield and improve downstream task performance
  • Future studies should investigate the optimal combination of extractors and filtering pipelines for different downstream tasks and applications, and develop guidelines and standards for LLM pretraining datasets

Sources