Skip to main content

All Articles

Articles

Academic · 1 min

Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining

arXiv:2602.19548v1 Announce Type: new Abstract: One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense …

Jeffrey Li, Josh Gardner, Doug Kang, Fangping Shi, Karanjeet Singh, Chun-Liang Li, Herumb Shandilya, David Hall, Oncel Tuzel, Percy Liang, Ludwig Schmidt, Hadi Pour Ansari, Fartash Faghri
7 views
Academic · 1 min

DEEP: Docker-based Execution and Evaluation Platform

arXiv:2602.19583v1 Announce Type: new Abstract: Comparative evaluation of several systems is a recurrent task in researching. It is a key step before deciding which system …

Sergio G\'omez Gonz\'alez, Miguel Domingo, Francisco Casacuberta
6 views