ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark
arXiv:2602.12911v1 Announce Type: new Abstract: Code-switching (CS), which is when Vietnamese speech uses English words like drug names or procedures, is a common phenomenon in Vietnamese medical communication. This creates challenges for Automatic Speech Recognition (ASR) systems, especially in low-resource languages like Vietnamese. Current most ASR systems struggle to recognize correctly English medical terms within Vietnamese sentences, and no benchmark addresses this challenge. In this paper, we construct a 34-hour \textbf{Vi}etnamese \textbf{Med}ical \textbf{C}ode-\textbf{S}witching \textbf{S}peech dataset (ViMedCSS) containing 16,576 utterances. Each utterance includes at least one English medical term drawn from a curated bilingual lexicon covering five medical topics. Using this dataset, we evaluate several state-of-the-art ASR models and examine different specific fine-tuning strategies for improving medical term recognition to investigate the best approach to solve in the d
arXiv:2602.12911v1 Announce Type: new Abstract: Code-switching (CS), which is when Vietnamese speech uses English words like drug names or procedures, is a common phenomenon in Vietnamese medical communication. This creates challenges for Automatic Speech Recognition (ASR) systems, especially in low-resource languages like Vietnamese. Current most ASR systems struggle to recognize correctly English medical terms within Vietnamese sentences, and no benchmark addresses this challenge. In this paper, we construct a 34-hour \textbf{Vi}etnamese \textbf{Med}ical \textbf{C}ode-\textbf{S}witching \textbf{S}peech dataset (ViMedCSS) containing 16,576 utterances. Each utterance includes at least one English medical term drawn from a curated bilingual lexicon covering five medical topics. Using this dataset, we evaluate several state-of-the-art ASR models and examine different specific fine-tuning strategies for improving medical term recognition to investigate the best approach to solve in the dataset. Experimental results show that Vietnamese-optimized models perform better on general segments, while multilingual pretraining helps capture English insertions. The combination of both approaches yields the best balance between overall and code-switched accuracy. This work provides the first benchmark for Vietnamese medical code-switching and offers insights into effective domain adaptation for low-resource, multilingual ASR systems.
Executive Summary
The article introduces ViMedCSS, a novel dataset designed to address the challenges of code-switching in Vietnamese medical speech, where English terms are frequently interspersed. This 34-hour dataset comprises 16,576 utterances, each containing at least one English medical term from a curated bilingual lexicon. The study evaluates state-of-the-art ASR models and various fine-tuning strategies to enhance the recognition of medical terms within Vietnamese sentences. The findings suggest that Vietnamese-optimized models excel in general segments, while multilingual pretraining improves the recognition of English insertions. The combination of both approaches yields the best overall and code-switched accuracy, providing a benchmark for Vietnamese medical code-switching and insights into domain adaptation for low-resource, multilingual ASR systems.
Key Points
- ▸ Introduction of ViMedCSS dataset for Vietnamese medical code-switching.
- ▸ Evaluation of state-of-the-art ASR models and fine-tuning strategies.
- ▸ Combination of Vietnamese-optimized models and multilingual pretraining yields the best results.
Merits
Comprehensive Dataset
The ViMedCSS dataset is extensive and well-curated, covering five medical topics and providing a robust benchmark for future research.
Innovative Approach
The study combines Vietnamese-optimized models with multilingual pretraining, offering a novel approach to improving ASR accuracy in code-switching scenarios.
Practical Insights
The findings provide actionable insights for developers working on ASR systems for low-resource languages with code-switching challenges.
Demerits
Limited Generalizability
The dataset and findings are specific to Vietnamese medical code-switching, which may limit their applicability to other languages or domains.
Dataset Size
While substantial, the 34-hour dataset may still be considered relatively small for training large ASR models, potentially impacting the robustness of the results.
Model Specificity
The study focuses on specific ASR models and fine-tuning strategies, which may not cover the full spectrum of possible approaches.
Expert Commentary
The introduction of the ViMedCSS dataset marks a significant advancement in the field of ASR for low-resource languages, particularly in the context of code-switching. The study's rigorous evaluation of state-of-the-art ASR models and fine-tuning strategies provides valuable insights into the challenges and potential solutions for improving medical term recognition in Vietnamese speech. The combination of Vietnamese-optimized models with multilingual pretraining demonstrates a balanced approach that could serve as a blueprint for similar studies in other languages and domains. However, the specificity of the dataset and the models used may limit the generalizability of the findings. Future research could explore the applicability of these methods to other low-resource languages and different code-switching scenarios. Additionally, expanding the dataset size could enhance the robustness of the results and provide a more comprehensive benchmark for the research community.
Recommendations
- ✓ Future studies should explore the applicability of the findings to other low-resource languages and different code-switching scenarios.
- ✓ Expanding the dataset size could enhance the robustness of the results and provide a more comprehensive benchmark for the research community.