From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset
arXiv:2602.14062v1 Announce Type: new Abstract: Large, openly licensed speech datasets are essential for building automatic speech recognition (ASR) systems, yet many widely spoken languages remain underrepresented in public resources. Pashto, spoken by more than 60 million people, has historically lacked large-scale openly licensed speech data suitable for modern ASR development. This paper presents a release-level analysis of the Pashto component of the Mozilla Common Voice corpus, focusing on version 24.0 (December 2025) and contextualizing trends across major releases. We document rapid growth from 1.49 recorded hours in mid-2023 to 2,768.7 total hours in 2025, including 975.89 validated hours available for supervised ASR training. Beyond scale, we analyze validation throughput, contributor participation inequality, demographic metadata completeness, and sentence-level concentration in the validated subset. We find that participation is extremely concentrated (Gini = 0.941), a
arXiv:2602.14062v1 Announce Type: new Abstract: Large, openly licensed speech datasets are essential for building automatic speech recognition (ASR) systems, yet many widely spoken languages remain underrepresented in public resources. Pashto, spoken by more than 60 million people, has historically lacked large-scale openly licensed speech data suitable for modern ASR development. This paper presents a release-level analysis of the Pashto component of the Mozilla Common Voice corpus, focusing on version 24.0 (December 2025) and contextualizing trends across major releases. We document rapid growth from 1.49 recorded hours in mid-2023 to 2,768.7 total hours in 2025, including 975.89 validated hours available for supervised ASR training. Beyond scale, we analyze validation throughput, contributor participation inequality, demographic metadata completeness, and sentence-level concentration in the validated subset. We find that participation is extremely concentrated (Gini = 0.941), age representation is strongly skewed toward young adults, and 41.97\% of clips lack self-reported gender labels, limiting subgroup auditing based on metadata. At the textual level, prompt reuse is moderate: 35.88\% of unique sentences account for 50\% of validated clips, suggesting that structural concentration is driven primarily by uneven contributor activity rather than dominance of a small prompt set. These results provide a quantitative audit of a rapidly scaling low-resource speech corpus and highlight practical priorities for improving dataset maturity, including expanded validation capacity and broader demographic participation.
Executive Summary
The article presents a comprehensive analysis of the Pashto component of the Mozilla Common Voice corpus, focusing on version 24.0 from December 2025. It documents significant growth in the dataset from 1.49 hours in mid-2023 to 2,768.7 hours in 2025, with 975.89 validated hours suitable for supervised ASR training. The analysis highlights issues such as concentrated contributor participation, skewed age representation, and incomplete gender metadata. It also examines sentence-level concentration and identifies practical priorities for improving dataset maturity.
Key Points
- ▸ Rapid growth of the Pashto Common Voice dataset from 1.49 hours to 2,768.7 hours between mid-2023 and 2025.
- ▸ Extreme concentration of contributor participation with a Gini coefficient of 0.941.
- ▸ Skewed age representation and incomplete gender metadata, limiting subgroup auditing.
- ▸ Moderate prompt reuse with 35.88% of unique sentences accounting for 50% of validated clips.
Merits
Comprehensive Analysis
The article provides a detailed and rigorous analysis of the Pashto Common Voice dataset, covering various aspects such as scale, validation throughput, contributor participation, and demographic metadata.
Quantitative Audit
The study offers a quantitative audit of a rapidly scaling low-resource speech corpus, which is crucial for understanding the current state and future development of ASR systems for underrepresented languages.
Practical Recommendations
The article identifies practical priorities for improving dataset maturity, such as expanded validation capacity and broader demographic participation, which are actionable and relevant for future dataset development.
Demerits
Limited Scope
The analysis is focused solely on the Pashto component of the Common Voice corpus, which may limit the generalizability of the findings to other languages or datasets.
Data Limitations
The study highlights issues such as incomplete gender metadata and skewed age representation, which could affect the reliability and validity of the findings.
Future Projections
The analysis includes projections based on version 24.0 from December 2025, which may not accurately reflect future trends or developments in the dataset.
Expert Commentary
The article provides a valuable and timely analysis of the Pashto Common Voice dataset, highlighting both its rapid growth and the challenges it faces in terms of contributor participation, demographic representation, and data quality. The study's rigorous methodology and comprehensive analysis offer important insights for researchers and practitioners working on ASR systems for underrepresented languages. The findings underscore the need for expanded validation capacity and broader demographic participation to improve dataset maturity. Additionally, the article's practical recommendations are actionable and relevant for future dataset development. However, the study's limited scope and data limitations should be considered when interpreting the results. Overall, this analysis contributes significantly to the ongoing efforts to address language representation and data quality in the development of ASR systems.
Recommendations
- ✓ Expand validation capacity to ensure the quality and reliability of the dataset.
- ✓ Promote broader demographic participation to improve the representativeness of the dataset.