Academic

ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models

arXiv:2602.18721v1 Announce Type: new Abstract: Semi-supervised learning in automatic speech recognition (ASR) typically relies on pseudo-labeling, which often suffers from confirmation bias and error accumulation due to noisy supervision. To address this limitation, we propose ReHear, a framework for iterative pseudo-label refinement that integrates an instruction-tuned, audio-aware large language model (LLM) into the self-training loop. Unlike conventional text-based correctors, our approach conditions the LLM on both the ASR hypothesis and the source audio, allowing it to recover phonetically accurate transcripts even from severe recognition errors. These refined pseudo-labels serve as high-fidelity targets for fine-tuning the ASR model in an iterative cycle. Experimental results across diverse benchmarks demonstrate that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.

Zefang Liu, Chenyang Zhu, Sangwoo Cho, Shi-Xiong Zhang · March 7, 2026 · 1 min read · 22 views

#cs.CL #eess.AS

Executive Summary

The article introduces ReHear, a novel framework for iterative pseudo-label refinement in semi-supervised speech recognition. By integrating an instruction-tuned, audio-aware large language model (LLM) into the self-training loop, ReHear conditions the LLM on both the ASR hypothesis and the source audio, enabling the recovery of phonetically accurate transcripts from severe recognition errors. Experimental results across diverse benchmarks demonstrate ReHear's effectiveness in mitigating error propagation and outperforming both supervised and pseudo-labeling baselines. The proposed framework has the potential to revolutionize semi-supervised learning in automatic speech recognition, a crucial component of various applications, including voice assistants and speech-to-text systems. While the results are promising, further evaluation and refinement of ReHear are necessary to ensure its robustness and scalability.

Key Points

▸ ReHear integrates an instruction-tuned, audio-aware large language model (LLM) into the self-training loop.
▸ The LLM conditions on both the ASR hypothesis and the source audio, enabling phonetically accurate transcript recovery.
▸ Experimental results demonstrate ReHear's effectiveness in mitigating error propagation and outperforming baselines.

Merits

Strength in Mitigating Error Propagation

ReHear's iterative pseudo-label refinement effectively addresses the limitation of noisy supervision in semi-supervised learning, leading to improved accuracy and robustness in automatic speech recognition.

Flexibility and Scalability

The proposed framework allows for the integration of various instruction-tuned LLMs and ASR models, making it a flexible and scalable solution for semi-supervised learning applications.

Potential for Real-World Applications

ReHear's effectiveness in improving speech recognition accuracy has significant implications for real-world applications, including voice assistants, speech-to-text systems, and human-computer interfaces.

Demerits

Limited Evaluation on Adversarial Cases

The article does not provide a comprehensive evaluation of ReHear's performance in adversarial cases, such as noisy or degraded audio inputs, which may limit its robustness in practical applications.

Dependence on High-Quality Audio

The proposed framework relies on the availability of high-quality audio data, which may not be feasible in all scenarios, particularly in resource-constrained environments.

Potential Overfitting to Training Data

The iterative refinement process may lead to overfitting to the training data, particularly if the LLM is not properly regularized or if the training data is biased.

Expert Commentary

The proposed framework represents a significant advancement in semi-supervised learning for automatic speech recognition. By leveraging the strengths of large language models and audio-aware conditioning, ReHear has the potential to revolutionize the field. However, further evaluation and refinement are necessary to ensure its robustness and scalability. Additionally, the article raises important questions regarding the potential overfitting to training data and the dependence on high-quality audio. These limitations highlight the need for continued research and development in this area.

Recommendations

✓ Further evaluation of ReHear in various scenarios, including adversarial cases and resource-constrained environments.
✓ Investigation of regularization techniques and data augmentation methods to mitigate overfitting and improve robustness.
✓ Exploration of the potential applications of ReHear in other domains, such as natural language processing and computer vision.

Sources

arXiv - cs.CL

ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models

AI Commentary

Executive Summary

Key Points

Merits

Strength in Mitigating Error Propagation

Flexibility and Scalability

Potential for Real-World Applications

Demerits

Limited Evaluation on Adversarial Cases

Dependence on High-Quality Audio

Potential Overfitting to Training Data

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs