ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization
arXiv:2603.19256v1 Announce Type: new Abstract: Bengali is spoken by over 230 million people yet remains severely under-served in automatic speech recognition (ASR) and speaker diarization research. In this paper, we present our system for the DL Sprint 4.0 Bengali Long-Form Speech Recognition (Task~1) and Bengali Speaker Diarization Challenge (Task~2). For Task~1, we propose a data-centric pipeline that constructs a high-quality training corpus from Bengali YouTube audiobooks and dramas \cite{tabib2026bengaliloop}, incorporating LLM-assisted language normalization, fuzzy-matching-based chunk boundary validation, and muffled-zone augmentation. Fine-tuning the \texttt{tugstugi/whisper-medium} model on approximately 21,000 data points with beam size 5, we achieve a Word Error Rate (WER) of 16.751 on the public leaderboard and 15.551 on the private test set. For Task~2, we fine-tune the pyannote.audio community-1 segmentation model with targeted hyperparameter optimization under an extre
arXiv:2603.19256v1 Announce Type: new Abstract: Bengali is spoken by over 230 million people yet remains severely under-served in automatic speech recognition (ASR) and speaker diarization research. In this paper, we present our system for the DL Sprint 4.0 Bengali Long-Form Speech Recognition (Task~1) and Bengali Speaker Diarization Challenge (Task~2). For Task~1, we propose a data-centric pipeline that constructs a high-quality training corpus from Bengali YouTube audiobooks and dramas \cite{tabib2026bengaliloop}, incorporating LLM-assisted language normalization, fuzzy-matching-based chunk boundary validation, and muffled-zone augmentation. Fine-tuning the \texttt{tugstugi/whisper-medium} model on approximately 21,000 data points with beam size 5, we achieve a Word Error Rate (WER) of 16.751 on the public leaderboard and 15.551 on the private test set. For Task~2, we fine-tune the pyannote.audio community-1 segmentation model with targeted hyperparameter optimization under an extreme low-resource setting (10 training files), achieving a Diarization Error Rate (DER) of 0.19974 on the public leaderboard, and .26723 on the private test set. Our results demonstrate that careful data engineering and domain-adaptive fine-tuning can yield competitive performance for Bengali speech processing even without large annotated corpora.
Executive Summary
This article presents ShobdoSetu, a data-centric framework for Bengali long-form speech recognition and speaker diarization. The authors address the under-served Bengali language in automatic speech recognition research by proposing a novel pipeline, incorporating language normalization, chunk boundary validation, and muffled-zone augmentation. The framework achieves competitive performance on public and private test sets for Task 1 (Word Error Rate of 16.751 and 15.551) and Task 2 (Diarization Error Rate of 0.19974 and 0.26723). The results demonstrate the effectiveness of careful data engineering and domain-adaptive fine-tuning in low-resource settings. This breakthrough has significant implications for the Bengali-speaking population, who are severely under-served in speech recognition technology. The framework's performance and adaptability make it a valuable contribution to the field of speech recognition.
Key Points
- ▸ The ShobdoSetu framework addresses the under-served Bengali language in automatic speech recognition research.
- ▸ The framework incorporates novel techniques, including language normalization, chunk boundary validation, and muffled-zone augmentation.
- ▸ ShobdoSetu achieves competitive performance on public and private test sets for both Task 1 and Task 2.
Merits
Strength in Data Engineering
The article highlights the importance of careful data engineering in achieving competitive performance in low-resource settings.
Domain-Adaptive Fine-Tuning
The authors demonstrate the effectiveness of domain-adaptive fine-tuning in achieving state-of-the-art results in Bengali speech recognition.
Demerits
Limited Scope
The article focuses specifically on the Bengali language and may not be applicable to other languages or domains.
Dependence on Large Language Models
The framework relies on pre-trained language models, which may limit its applicability in scenarios with limited computational resources.
Expert Commentary
The article presents a significant breakthrough in Bengali speech recognition, demonstrating the effectiveness of careful data engineering and domain-adaptive fine-tuning in low-resource settings. The framework's performance and adaptability make it a valuable contribution to the field of speech recognition. However, the article's focus on the Bengali language may limit its applicability to other languages or domains. Furthermore, the dependence on large language models may pose challenges in scenarios with limited computational resources. Despite these limitations, the article's findings have significant practical and policy implications, highlighting the need for more research on under-served languages and the importance of investing in domain-adaptive fine-tuning techniques.
Recommendations
- ✓ Future research should focus on adapting the ShobdoSetu framework to other under-served languages and domains.
- ✓ Investigating the use of smaller language models or alternative techniques for domain-adaptive fine-tuning could improve the framework's applicability in scenarios with limited computational resources.
Sources
Original: arXiv - cs.CL