Academic

RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks

arXiv:2603.02368v1 Announce Type: new Abstract: We introduce RO-N3WS, a benchmark Romanian speech dataset designed to improve generalization in automatic speech recognition (ASR), particularly in low-resource and out-of-distribution (OOD) conditions. RO-N3WS comprises over 126 hours of transcribed audio collected from broadcast news, literary audiobooks, film dialogue, children's stories, and conversational podcast speech. This diversity enables robust training and fine-tuning across stylistically distinct domains. We evaluate several state-of-the-art ASR systems (Whisper, Wav2Vec 2.0) in both zero-shot and fine-tuned settings, and conduct controlled comparisons using synthetic data generated with expressive TTS models. Our results show that even limited fine-tuning on real speech from RO-N3WS yields substantial WER improvements over zero-shot baselines. We will release all models, scripts, and data splits to support reproducible research in multilingual ASR, domain adaptation, and li

A
Alexandra Diaconu, M\u{a}d\u{a}lina V\^inaga, Bogdan Alexe
· · 1 min read · 9 views

arXiv:2603.02368v1 Announce Type: new Abstract: We introduce RO-N3WS, a benchmark Romanian speech dataset designed to improve generalization in automatic speech recognition (ASR), particularly in low-resource and out-of-distribution (OOD) conditions. RO-N3WS comprises over 126 hours of transcribed audio collected from broadcast news, literary audiobooks, film dialogue, children's stories, and conversational podcast speech. This diversity enables robust training and fine-tuning across stylistically distinct domains. We evaluate several state-of-the-art ASR systems (Whisper, Wav2Vec 2.0) in both zero-shot and fine-tuned settings, and conduct controlled comparisons using synthetic data generated with expressive TTS models. Our results show that even limited fine-tuning on real speech from RO-N3WS yields substantial WER improvements over zero-shot baselines. We will release all models, scripts, and data splits to support reproducible research in multilingual ASR, domain adaptation, and lightweight deployment.

Executive Summary

The article introduces RO-N3WS, a diverse Romanian speech dataset designed to enhance generalization in automatic speech recognition (ASR) systems, particularly in low-resource conditions. The dataset comprises 126 hours of transcribed audio from various domains, enabling robust training and fine-tuning. The results show that fine-tuning on RO-N3WS yields substantial improvements in word error rate (WER) over zero-shot baselines, demonstrating the dataset's effectiveness in improving ASR performance.

Key Points

  • Introduction of RO-N3WS, a diverse Romanian speech dataset
  • Evaluation of state-of-the-art ASR systems using RO-N3WS
  • Substantial WER improvements with limited fine-tuning on RO-N3WS

Merits

Diverse Dataset

The dataset's diversity enables robust training and fine-tuning across stylistically distinct domains, improving ASR performance

Reproducibility

The release of models, scripts, and data splits supports reproducible research in multilingual ASR and domain adaptation

Demerits

Limited Scope

The dataset is specific to the Romanian language, limiting its applicability to other languages

Dependence on Fine-Tuning

The substantial WER improvements require fine-tuning on RO-N3WS, which may not be feasible in all scenarios

Expert Commentary

The introduction of RO-N3WS is a significant contribution to the field of ASR, particularly in low-resource conditions. The dataset's diversity and the substantial WER improvements demonstrated in the article highlight the importance of robust training and fine-tuning in ASR systems. However, the limited scope of the dataset and the dependence on fine-tuning are notable limitations. Further research is needed to explore the applicability of RO-N3WS to other languages and to develop more efficient fine-tuning methods.

Recommendations

  • Explore the applicability of RO-N3WS to other languages
  • Develop more efficient fine-tuning methods to reduce the dependence on extensive fine-tuning

Sources