Academic

Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation

arXiv:2602.13263v1 Announce Type: new Abstract: Automatic speech recognition (ASR) systems often degrade on accented speech because acoustic-phonetic and prosodic shifts induce a mismatch to training data, making labeled accent adaptation costly. However, common pseudo-label selection heuristics are largely text-centric (e.g., perplexity (PPL) filtering) and can prefer fluent yet acoustically mismatched hypotheses, leading to error amplification when fine-tuning. To address this, we introduce a multimodal consistency-guided, reference-free data selection pipeline for ASR accent adaptation under a transductive, label-free protocol. The pipeline starts with a target-aware preselection step based on submodular mutual information to improve query relevance and reduce downstream computation. It then generates multiple pseudo-transcriptions per utterance via perturbation-based decoding and scores each hypothesis using two reference-free signals: speech--text alignment in a shared embedding

arXiv:2602.13263v1 Announce Type: new Abstract: Automatic speech recognition (ASR) systems often degrade on accented speech because acoustic-phonetic and prosodic shifts induce a mismatch to training data, making labeled accent adaptation costly. However, common pseudo-label selection heuristics are largely text-centric (e.g., perplexity (PPL) filtering) and can prefer fluent yet acoustically mismatched hypotheses, leading to error amplification when fine-tuning. To address this, we introduce a multimodal consistency-guided, reference-free data selection pipeline for ASR accent adaptation under a transductive, label-free protocol. The pipeline starts with a target-aware preselection step based on submodular mutual information to improve query relevance and reduce downstream computation. It then generates multiple pseudo-transcriptions per utterance via perturbation-based decoding and scores each hypothesis using two reference-free signals: speech--text alignment in a shared embedding space and predicted word error rate (WER). A simple percentile-based selection rule retains reliable pseudo-labels for fine-tuning while discarding noisy utterances. In an in-domain setting, selecting ~1.5k utterances from a 30k pool achieves 10.91% WER, close to 10.45% obtained using 30k supervised labels. In a cross-domain setting with a mismatched candidate pool, consistency-filtered subsets avoid the degradation caused by unfiltered pseudo-labels under strong accent shift, and matched-hour experiments on a stronger ASR backbone further confirm gains over random sampling and recent selection baselines.

Executive Summary

The article presents a novel approach to improving automatic speech recognition (ASR) systems' performance on accented speech through a multimodal consistency-guided, reference-free data selection pipeline. The method addresses the challenge of costly labeled accent adaptation by leveraging pseudo-label selection heuristics that are not solely text-centric. The pipeline involves a target-aware preselection step, perturbation-based decoding, and scoring hypotheses using speech-text alignment and predicted word error rate (WER). The study demonstrates significant improvements in WER in both in-domain and cross-domain settings, highlighting the effectiveness of the proposed method over traditional approaches.

Key Points

  • Introduction of a multimodal consistency-guided, reference-free data selection pipeline for ASR accent adaptation.
  • Use of submodular mutual information for target-aware preselection to improve query relevance and reduce computation.
  • Perturbation-based decoding to generate multiple pseudo-transcriptions per utterance.
  • Scoring hypotheses using speech-text alignment and predicted WER for reliable pseudo-label selection.
  • Achievement of low WER close to supervised labels in in-domain settings and avoidance of degradation in cross-domain settings.

Merits

Innovative Methodology

The article introduces a novel approach that combines multimodal consistency and reference-free data selection, which is a significant advancement over traditional text-centric heuristics.

Effective Performance

The method achieves a WER close to supervised labels in in-domain settings and avoids degradation in cross-domain settings, demonstrating its robustness and effectiveness.

Comprehensive Evaluation

The study provides a thorough evaluation in both in-domain and cross-domain settings, including matched-hour experiments, which strengthens the validity of the findings.

Demerits

Complexity

The pipeline involves multiple steps and sophisticated techniques, which may increase computational complexity and implementation challenges.

Limited Generalization

While the method shows promise, its effectiveness may vary across different accents and languages, requiring further validation in diverse settings.

Dependence on Pretrained Models

The approach relies on pretrained models for speech-text alignment and WER prediction, which may introduce biases or limitations inherent in those models.

Expert Commentary

The article presents a significant advancement in the field of ASR accent adaptation by introducing a multimodal consistency-guided, reference-free data selection pipeline. The method addresses a critical challenge in ASR systems, which is their performance degradation on accented speech due to acoustic-phonetic and prosodic shifts. The innovative approach of combining submodular mutual information for preselection, perturbation-based decoding, and scoring hypotheses using speech-text alignment and predicted WER demonstrates a robust and effective solution. The study's comprehensive evaluation in both in-domain and cross-domain settings further strengthens the validity of the findings. However, the complexity of the pipeline and the dependence on pretrained models are notable limitations that need to be addressed. Overall, the article makes a valuable contribution to the field and has important implications for both practical applications and policy considerations in AI and speech recognition.

Recommendations

  • Further validation of the proposed method across a broader range of accents and languages to ensure its generalizability.
  • Exploration of techniques to simplify the pipeline and reduce computational complexity for practical deployment.
  • Investigation of the potential biases introduced by pretrained models and development of strategies to mitigate them.

Sources