Academic

ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization

arXiv:2603.19256v1 Announce Type: new Abstract: Bengali is spoken by over 230 million people yet remains severely under-served in automatic speech recognition (ASR) and speaker diarization research. In this paper, we present our system for the DL Sprint 4.0 Bengali Long-Form Speech Recognition (Task~1) and Bengali Speaker Diarization Challenge (Task~2). For Task~1, we propose a data-centric pipeline that constructs a high-quality training corpus from Bengali YouTube audiobooks and dramas \cite{tabib2026bengaliloop}, incorporating LLM-assisted language normalization, fuzzy-matching-based chunk boundary validation, and muffled-zone augmentation. Fine-tuning the \texttt{tugstugi/whisper-medium} model on approximately 21,000 data points with beam size 5, we achieve a Word Error Rate (WER) of 16.751 on the public leaderboard and 15.551 on the private test set. For Task~2, we fine-tune the pyannote.audio community-1 segmentation model with targeted hyperparameter optimization under an extre

Md. Nazmus Sakib, Shafiul Tanvir, Mesbah Uddin Ahamed, H. M. Aktaruzzaman Mukdho · March 23, 2026 · 1 min read · 8 views

#cs.CL

Executive Summary

This article presents ShobdoSetu, a data-centric framework for Bengali long-form speech recognition and speaker diarization. The authors address the under-served Bengali language in automatic speech recognition research by proposing a novel pipeline, incorporating language normalization, chunk boundary validation, and muffled-zone augmentation. The framework achieves competitive performance on public and private test sets for Task 1 (Word Error Rate of 16.751 and 15.551) and Task 2 (Diarization Error Rate of 0.19974 and 0.26723). The results demonstrate the effectiveness of careful data engineering and domain-adaptive fine-tuning in low-resource settings. This breakthrough has significant implications for the Bengali-speaking population, who are severely under-served in speech recognition technology. The framework's performance and adaptability make it a valuable contribution to the field of speech recognition.

Key Points

▸ The ShobdoSetu framework addresses the under-served Bengali language in automatic speech recognition research.
▸ The framework incorporates novel techniques, including language normalization, chunk boundary validation, and muffled-zone augmentation.
▸ ShobdoSetu achieves competitive performance on public and private test sets for both Task 1 and Task 2.

Merits

Strength in Data Engineering

The article highlights the importance of careful data engineering in achieving competitive performance in low-resource settings.

Domain-Adaptive Fine-Tuning

The authors demonstrate the effectiveness of domain-adaptive fine-tuning in achieving state-of-the-art results in Bengali speech recognition.

Demerits

Limited Scope

The article focuses specifically on the Bengali language and may not be applicable to other languages or domains.

Dependence on Large Language Models

The framework relies on pre-trained language models, which may limit its applicability in scenarios with limited computational resources.

Expert Commentary

The article presents a significant breakthrough in Bengali speech recognition, demonstrating the effectiveness of careful data engineering and domain-adaptive fine-tuning in low-resource settings. The framework's performance and adaptability make it a valuable contribution to the field of speech recognition. However, the article's focus on the Bengali language may limit its applicability to other languages or domains. Furthermore, the dependence on large language models may pose challenges in scenarios with limited computational resources. Despite these limitations, the article's findings have significant practical and policy implications, highlighting the need for more research on under-served languages and the importance of investing in domain-adaptive fine-tuning techniques.

Recommendations

✓ Future research should focus on adapting the ShobdoSetu framework to other under-served languages and domains.
✓ Investigating the use of smaller language models or alternative techniques for domain-adaptive fine-tuning could improve the framework's applicability in scenarios with limited computational resources.

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization

AI Commentary

Executive Summary

Key Points

Merits

Strength in Data Engineering

Domain-Adaptive Fine-Tuning

Demerits

Limited Scope

Dependence on Large Language Models

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.