Academic

Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining

arXiv:2602.19548v1 Announce Type: new Abstract: One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially. This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance. We further show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance, with differences of up to 10 percentage po

Jeffrey Li, Josh Gardner, Doug Kang, Fangping Shi, Karanjeet Singh, Chun-Liang Li, Herumb Shandilya, David Hall, Oncel Tuzel, Percy Liang, Ludwig Schmidt, Hadi Pour Ansari, Fartash Faghri · February 25, 2026 · 1 min read · 6 views

#cs.CL #cs.LG

Executive Summary

This article challenges the conventional approach of using a single HTML-to-text extractor for pretraining large language models (LLMs). The authors demonstrate that utilizing multiple extractors can significantly increase token yield and improve downstream task performance, particularly for structured content like tables and code blocks. By taking a union over different extractors, the token yield of DCLM-Baseline can be increased by up to 71% without compromising benchmark performance. This innovative approach has important implications for the development of more effective and efficient LLM pretraining datasets.

Key Points

▸ The use of a single fixed extractor for HTML-to-text extraction can lead to suboptimal coverage and utilization of Internet data
▸ Different extractors can result in similar model performance on standard language understanding tasks, but with varying page survival rates
▸ Extractor choice can significantly impact downstream task performance, particularly for structured content like tables and code blocks

Merits

Improved Token Yield

The proposed approach of taking a union over different extractors can increase the token yield of DCLM-Baseline by up to 71%, leading to more comprehensive and diverse pretraining datasets

Demerits

Increased Complexity

The use of multiple extractors may add complexity to the pretraining pipeline, potentially requiring additional computational resources and infrastructure

Expert Commentary

This article makes a significant contribution to the field of natural language processing by challenging conventional wisdom and demonstrating the benefits of using multiple extractors for HTML-to-text extraction. The authors' approach has important implications for the development of more effective and efficient LLM pretraining datasets, and their findings are likely to influence the design of future pretraining pipelines. However, the increased complexity of using multiple extractors must be carefully managed to ensure that the benefits are realized in practice. Further research is needed to explore the optimal combination of extractors and filtering pipelines for different downstream tasks and applications.

Recommendations

✓ Researchers and practitioners should consider using multiple extractors and taking a union over their outputs to increase token yield and improve downstream task performance
✓ Future studies should investigate the optimal combination of extractors and filtering pipelines for different downstream tasks and applications, and develop guidelines and standards for LLM pretraining datasets

Sources

arXiv - cs.CL

Something extraordinary is coming.

Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining

AI Commentary

Executive Summary

Key Points

Merits

Improved Token Yield

Demerits

Increased Complexity

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.