Academic

Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization

arXiv:2602.21741v1 Announce Type: new Abstract: We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle. Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora. For ASR we achieve a best private Word Error Rate (WER) of 0.37738 and public WER of 0.36137, combining a BengaliAI fine-tuned Whisper medium model with Demucs source separation for vocal isolation, silence-boundary chunking, and carefully tuned generation hyperparameters. For speaker diarization we reach a best private Diarization Error Rate (DER) of 0.27671 and public DER of 0.20936 by replacing the default segmentation model inside the pyannote.audio pipeline with a Bengali-fine-tuned variant, pairing it with wespeaker-voxceleb-resnet34-LM embeddings and centroid-based agglo

MD. Sagor Chowdhury, Adiba Fairooz Chowdhury · February 27, 2026 · 1 min read · 4 views

#cs.CL #cs.LG #cs.SD

Executive Summary

The article presents an end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization, addressing significant challenges posed by the language's large phoneme inventory, dialectal variations, code-mixing with English, and the scarcity of large-scale labeled corpora. The authors achieve competitive Word Error Rates (WER) and Diarization Error Rates (DER) through a combination of fine-tuned models, source separation techniques, and carefully tuned hyperparameters. The study highlights the importance of domain-specific fine-tuning, vocal source separation, and silence-aware chunking for effective Bengali speech processing.

Key Points

▸ Achieved best private WER of 0.37738 and public WER of 0.36137 for ASR
▸ Achieved best private DER of 0.27671 and public DER of 0.20936 for speaker diarization
▸ Utilized BengaliAI fine-tuned Whisper medium model and Demucs source separation for ASR
▸ Employed Bengali-fine-tuned segmentation model, wespeaker-voxceleb-resnet34-LM embeddings, and centroid-based agglomerative clustering for diarization

Merits

Innovative Approach

The study combines multiple advanced techniques, including fine-tuning, source separation, and silence-aware chunking, to address the unique challenges of Bengali speech processing.

Competitive Performance

The achieved WER and DER metrics are competitive, demonstrating the effectiveness of the proposed methods in handling low-resource languages.

Practical Insights

The article provides practical insights into the most impactful design choices for low-resource speech processing, which can be valuable for future research and applications.

Demerits

Limited Generalizability

The study focuses on Bengali, which may limit the generalizability of the findings to other languages with similar characteristics.

Data Scarcity

The scarcity of large-scale labeled corpora for Bengali poses a significant challenge, which may affect the robustness and scalability of the proposed methods.

Complexity

The combination of multiple techniques and models increases the complexity of the system, which may hinder its practical deployment in real-world scenarios.

Expert Commentary

The article presents a rigorous and well-reasoned approach to addressing the challenges of Bengali speech processing. The combination of fine-tuned models, source separation, and silence-aware chunking demonstrates a sophisticated understanding of the unique characteristics of the language. The achieved metrics are impressive, particularly given the scarcity of labeled data. However, the study's focus on Bengali may limit its generalizability to other languages. Future research should explore the adaptability of these methods to other low-resource languages with similar characteristics. Additionally, the complexity of the proposed system may pose challenges for practical deployment, suggesting a need for further simplification and optimization. Overall, the study makes a significant contribution to the field of low-resource language processing and provides valuable insights for both academic research and practical applications.

Recommendations

✓ Future research should investigate the adaptability of the proposed methods to other low-resource languages to assess their generalizability.
✓ Efforts should be made to create and label larger corpora for Bengali and other low-resource languages to support more robust and scalable speech processing systems.

Sources

arXiv - cs.CL

Something extraordinary is coming.

Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization

AI Commentary

Executive Summary

Key Points

Merits

Innovative Approach

Competitive Performance

Practical Insights

Demerits

Limited Generalizability

Data Scarcity

Complexity

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.