propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale
arXiv:2602.12414v1 Announce Type: new Abstract: Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A …
Maximilian Idahl, Benedikt Droste, Bj\"orn Pl\"uster, Jan Philipp Harries
9 views