Academic

PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

Hanif Rahman · March 18, 2026 · 1 min read · 55 views

#cs.CL #cs.IR #cs.LG

arXiv:2603.16354v1 Announce Type: new Abstract: We present PashtoCorp, a 1.25-billion-word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08->6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%->21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary. On Belebele Pashto reading comprehension, Gemma-3n achieves 64.6% accuracy, the first published LLM baseline for Pashto on this benchmark. A leave-one-out source ablation shows that Wikipedia (0.7% of documents) is the most critical source for NER: removing it alone reduces entity F1 by 47%. Corpus data, trained model, and code are available at https://huggingface.co/datasets/ihanif/pashto-corpus, https://huggingface.co/ihanif/xlmr-pashto, and https://github.com/ihanif/pashto-corpus.

Executive Summary

PashtoCorp represents a landmark contribution to low-resource NLP by introducing a 1.25-billion-word Pashto corpus—fourteen times larger than prior dedicated resources—assembled via a rigorous, reproducible pipeline combining web scraping, HuggingFace datasets, and Arabic-script tokenization. The corpus demonstrates measurable gains in downstream tasks: MLM pretraining on PashtoCorp reduces perplexity by 25.1% and improves NER F1 by 10% relative, with significant variance reduction. The availability of the corpus, trained models, and code via HuggingFace and GitHub facilitates reproducibility and scalability. Notably, the ablation reveals Wikipedia’s disproportionate impact on NER performance, underscoring source selection’s critical role in low-resource corpus design.

Key Points

▸ 1.25B-word corpus is 40x larger than OSCAR Pashto subset and 83x larger than prior Pashto corpus
▸ Reproducible pipeline includes SHA-256 deduplication and quality filtering
▸ Pretraining on PashtoCorp yields measurable improvements in perplexity and NER metrics

Merits

Scale and Accessibility

PashtoCorp’s unprecedented size and open-access distribution enable novel research and model training for Pashto, a language with 60M speakers.

Reproducibility and Transparency

The documented pipeline and open-source deployment align with best practices in computational linguistics.

Demerits

Source Dependency

Heavy reliance on Wikipedia (0.7% of documents) creates a single-point vulnerability; removing it reduces F1 by 47%, indicating potential fragility if source diversity is compromised.

Expert Commentary

PashtoCorp is a watershed moment in low-resource language AI. Its scale alone transforms the landscape for Pashto NLP, but its true innovation lies in the reproducibility and empirical validation of its impact on downstream metrics. The 25.1% perplexity reduction and 10% F1 improvement are statistically significant and practically meaningful, particularly given the paucity of prior benchmarks. The fact that a single source—Wikipedia—accounts for nearly half of NER gains reveals a critical insight: in low-resource settings, quality of sources may outweigh quantity. This demands a shift in corpus design priorities from volume to curatorial quality. Moreover, the open-access deployment via HuggingFace cements PashtoCorp as a reference standard. However, the dependency on Wikipedia warrants a proactive strategy: future efforts should diversify sources, incorporate domain-specific corpora (e.g., news, legal, educational), and develop redundancy metrics to mitigate single-source risk. PashtoCorp does not merely expand capacity—it redefines the paradigm for building high-impact, reproducible resources for underrepresented languages.

Recommendations

✓ 1. Expand PashtoCorp with targeted acquisitions of domain-specific datasets (e.g., Pashto news archives, academic publications) to diversify source dependency.
✓ 2. Develop a source-diversity scoring metric to quantify redundancy and mitigate overreliance on any single corpus component.
✓ 3. Encourage replication studies using PashtoCorp as a benchmark for other low-resource languages (e.g., Dari, Uyghur, Kurdish) to validate generalizability.

Sources

arXiv - cs.CL

PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

AI Commentary

Executive Summary

Key Points

Merits

Scale and Accessibility

Reproducibility and Transparency

Demerits

Source Dependency

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs