Academic

propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

arXiv:2602.12414v1 Announce Type: new Abstract: Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb

Maximilian Idahl, Benedikt Droste, Bj\"orn Pl\"uster, Jan Philipp Harries · March 7, 2026 · 1 min read · 8 views

#cs.CL

Executive Summary

The article introduces propella-1, a suite of multilingual small LLMs designed to annotate text documents across 18 properties, organized into six categories. These models support 57 languages and produce structured JSON annotations, offering a more nuanced approach to data curation for LLM pretraining. The study evaluates the 4B model against a commercial LLM, demonstrating higher agreement and releasing a dataset of over three billion document annotations. The analysis reveals significant differences in quality, reasoning depth, and content composition in pretraining datasets, highlighting the limitations of single-score approaches.

Key Points

▸ Introduction of propella-1, a family of small multilingual LLMs for multi-property document annotation.
▸ Annotation across 18 properties in six categories, supporting 57 languages and producing structured JSON outputs.
▸ Evaluation of the 4B model against a commercial LLM, showing higher agreement and interpretability.
▸ Release of propella-annotations dataset with over three billion document annotations.
▸ Multi-dimensional analysis revealing substantial differences in pretraining datasets not captured by single-score methods.

Merits

Comprehensive Annotation Framework

The multi-property annotation approach provides a more detailed and flexible framework for data curation, addressing the limitations of single scalar quality scores.

Multilingual Support

The models' support for 57 languages enhances their applicability and utility in diverse linguistic contexts.

High Agreement with Commercial Models

The 4B model's performance, achieving higher agreement than much larger general-purpose models, validates its effectiveness.

Demerits

Model Size and Scalability

While the 4B model shows promise, the scalability and performance of smaller models (0.6B, 1.7B) may be limited in complex annotation tasks.

Data Bias and Representation

The study does not extensively address potential biases in the pretraining datasets or the models' annotations, which could impact the reliability of the results.

Commercial Viability

The practical implementation and adoption of these models in commercial settings may face challenges related to integration and cost.

Expert Commentary

The introduction of propella-1 represents a significant advancement in the field of data curation for LLM pretraining. By moving away from single scalar quality scores, the study provides a more nuanced and interpretable framework for evaluating and annotating text documents. The multi-property annotation approach allows for flexible filtering and a deeper understanding of the quality dimensions within pretraining datasets. The evaluation of the 4B model against a commercial LLM underscores the effectiveness of this method, particularly in achieving higher agreement and interpretability. However, the study's limitations, such as the potential biases in the datasets and the scalability of smaller models, warrant further investigation. The release of the propella-annotations dataset is a valuable resource for researchers and practitioners, offering insights into the composition and quality of widely used pretraining datasets. This work sets a precedent for future research in data curation, emphasizing the need for comprehensive and multilingual approaches to enhance the robustness and diversity of AI models.

Recommendations

✓ Further research should focus on addressing potential biases in the pretraining datasets and the models' annotations to ensure fair and unbiased outcomes.
✓ Future studies should explore the scalability and performance of smaller models in complex annotation tasks to broaden their applicability.

Sources

arXiv - cs.CL

propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Annotation Framework

Multilingual Support

High Agreement with Commercial Models

Demerits

Model Size and Scalability

Data Bias and Representation

Commercial Viability

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs