Skip to main content

Category

Academic

Academic · 1 min

TurkicNLP: An NLP Toolkit for Turkic Languages

arXiv:2602.19174v1 Announce Type: new Abstract: Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most …

Sherzod Hakimov
4 views
Academic · 1 min

Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining

arXiv:2602.19548v1 Announce Type: new Abstract: One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense …

Jeffrey Li, Josh Gardner, Doug Kang, Fangping Shi, Karanjeet Singh, Chun-Liang Li, Herumb Shandilya, David Hall, Oncel Tuzel, Percy Liang, Ludwig Schmidt, Hadi Pour Ansari, Fartash Faghri
7 views