propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale
arXiv:2602.12414v1 Announce Type: new Abstract: Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb
arXiv:2602.12414v1 Announce Type: new Abstract: Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC. Using these annotations, we present a multi-dimensional compositional analysis of widely used pretraining datasets, revealing substantial differences in quality, reasoning depth, and content composition that single-score approaches cannot capture. All model weights and annotations are released under permissive, commercial-use licenses.
Executive Summary
The article introduces propella-1, a suite of multilingual small LLMs designed to annotate text documents across 18 properties, organized into six categories. These models support 57 languages and produce structured JSON annotations, offering a more nuanced approach to data curation for LLM pretraining. The study evaluates the 4B model against a commercial LLM, demonstrating higher agreement and releasing a dataset of over three billion document annotations. The analysis reveals significant differences in quality, reasoning depth, and content composition in pretraining datasets, highlighting the limitations of single-score approaches.
Key Points
- ▸ Introduction of propella-1, a family of small multilingual LLMs for multi-property document annotation.
- ▸ Annotation across 18 properties in six categories, supporting 57 languages and producing structured JSON outputs.
- ▸ Evaluation of the 4B model against a commercial LLM, showing higher agreement and interpretability.
- ▸ Release of propella-annotations dataset with over three billion document annotations.
- ▸ Multi-dimensional analysis revealing substantial differences in pretraining datasets not captured by single-score methods.
Merits
Comprehensive Annotation Framework
The multi-property annotation approach provides a more detailed and flexible framework for data curation, addressing the limitations of single scalar quality scores.
Multilingual Support
The models' support for 57 languages enhances their applicability and utility in diverse linguistic contexts.
High Agreement with Commercial Models
The 4B model's performance, achieving higher agreement than much larger general-purpose models, validates its effectiveness.
Demerits
Model Size and Scalability
While the 4B model shows promise, the scalability and performance of smaller models (0.6B, 1.7B) may be limited in complex annotation tasks.
Data Bias and Representation
The study does not extensively address potential biases in the pretraining datasets or the models' annotations, which could impact the reliability of the results.
Commercial Viability
The practical implementation and adoption of these models in commercial settings may face challenges related to integration and cost.
Expert Commentary
The introduction of propella-1 represents a significant advancement in the field of data curation for LLM pretraining. By moving away from single scalar quality scores, the study provides a more nuanced and interpretable framework for evaluating and annotating text documents. The multi-property annotation approach allows for flexible filtering and a deeper understanding of the quality dimensions within pretraining datasets. The evaluation of the 4B model against a commercial LLM underscores the effectiveness of this method, particularly in achieving higher agreement and interpretability. However, the study's limitations, such as the potential biases in the datasets and the scalability of smaller models, warrant further investigation. The release of the propella-annotations dataset is a valuable resource for researchers and practitioners, offering insights into the composition and quality of widely used pretraining datasets. This work sets a precedent for future research in data curation, emphasizing the need for comprehensive and multilingual approaches to enhance the robustness and diversity of AI models.
Recommendations
- ✓ Further research should focus on addressing potential biases in the pretraining datasets and the models' annotations to ensure fair and unbiased outcomes.
- ✓ Future studies should explore the scalability and performance of smaller models in complex annotation tasks to broaden their applicability.