PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development
arXiv:2603.16354v1 Announce Type: new Abstract: We present PashtoCorp, a 1.25-billion-word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08->6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%->21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary. On Belebele Pashto reading comprehension, Ge
arXiv:2603.16354v1 Announce Type: new Abstract: We present PashtoCorp, a 1.25-billion-word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08->6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%->21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary. On Belebele Pashto reading comprehension, Gemma-3n achieves 64.6% accuracy, the first published LLM baseline for Pashto on this benchmark. A leave-one-out source ablation shows that Wikipedia (0.7% of documents) is the most critical source for NER: removing it alone reduces entity F1 by 47%. Corpus data, trained model, and code are available at https://huggingface.co/datasets/ihanif/pashto-corpus, https://huggingface.co/ihanif/xlmr-pashto, and https://github.com/ihanif/pashto-corpus.
Executive Summary
PashtoCorp represents a landmark contribution to low-resource NLP by introducing a 1.25-billion-word Pashto corpus—fourteen times larger than prior dedicated resources—assembled via a rigorous, reproducible pipeline combining web scraping, HuggingFace datasets, and Arabic-script tokenization. The corpus demonstrates measurable gains in downstream tasks: MLM pretraining on PashtoCorp reduces perplexity by 25.1% and improves NER F1 by 10% relative, with significant variance reduction. The availability of the corpus, trained models, and code via HuggingFace and GitHub facilitates reproducibility and scalability. Notably, the ablation reveals Wikipedia’s disproportionate impact on NER performance, underscoring source selection’s critical role in low-resource corpus design.
Key Points
- ▸ 1.25B-word corpus is 40x larger than OSCAR Pashto subset and 83x larger than prior Pashto corpus
- ▸ Reproducible pipeline includes SHA-256 deduplication and quality filtering
- ▸ Pretraining on PashtoCorp yields measurable improvements in perplexity and NER metrics
Merits
Scale and Accessibility
PashtoCorp’s unprecedented size and open-access distribution enable novel research and model training for Pashto, a language with 60M speakers.
Reproducibility and Transparency
The documented pipeline and open-source deployment align with best practices in computational linguistics.
Demerits
Source Dependency
Heavy reliance on Wikipedia (0.7% of documents) creates a single-point vulnerability; removing it reduces F1 by 47%, indicating potential fragility if source diversity is compromised.
Expert Commentary
PashtoCorp is a watershed moment in low-resource language AI. Its scale alone transforms the landscape for Pashto NLP, but its true innovation lies in the reproducibility and empirical validation of its impact on downstream metrics. The 25.1% perplexity reduction and 10% F1 improvement are statistically significant and practically meaningful, particularly given the paucity of prior benchmarks. The fact that a single source—Wikipedia—accounts for nearly half of NER gains reveals a critical insight: in low-resource settings, quality of sources may outweigh quantity. This demands a shift in corpus design priorities from volume to curatorial quality. Moreover, the open-access deployment via HuggingFace cements PashtoCorp as a reference standard. However, the dependency on Wikipedia warrants a proactive strategy: future efforts should diversify sources, incorporate domain-specific corpora (e.g., news, legal, educational), and develop redundancy metrics to mitigate single-source risk. PashtoCorp does not merely expand capacity—it redefines the paradigm for building high-impact, reproducible resources for underrepresented languages.
Recommendations
- ✓ 1. Expand PashtoCorp with targeted acquisitions of domain-specific datasets (e.g., Pashto news archives, academic publications) to diversify source dependency.
- ✓ 2. Develop a source-diversity scoring metric to quantify redundancy and mitigate overreliance on any single corpus component.
- ✓ 3. Encourage replication studies using PashtoCorp as a benchmark for other low-resource languages (e.g., Dari, Uyghur, Kurdish) to validate generalizability.