Academic

Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

arXiv:2602.18966v1 Announce Type: new Abstract: Domain-specific speech remains a persistent challenge for automatic speech recognition (ASR), even for state-of-the-art systems like OpenAI's Whisper. We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining. The pipeline intercepts Whisper's initial transcript, applies specialized LLM agents for domain context identification, named entity recognition, and jargon detection, and generates compact prompts that guide Whisper's decoder. Evaluated on 421 NBA basketball commentary segments (a domain characterized by dense proper nouns and technical terminology) our best pipeline achieves a statistically significant 17.0% relative reduction in word error rate (WER; from 0.217 to 0.180, p<0.001). Improvements are observed in 40.1% of segments with degradation in only 7.1%, substantially outperforming direct transcript post-editing. These results demon

Y
Yonathan Ron, Shiri Gilboa, Tammuz Dubnov
· · 1 min read · 21 views

arXiv:2602.18966v1 Announce Type: new Abstract: Domain-specific speech remains a persistent challenge for automatic speech recognition (ASR), even for state-of-the-art systems like OpenAI's Whisper. We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining. The pipeline intercepts Whisper's initial transcript, applies specialized LLM agents for domain context identification, named entity recognition, and jargon detection, and generates compact prompts that guide Whisper's decoder. Evaluated on 421 NBA basketball commentary segments (a domain characterized by dense proper nouns and technical terminology) our best pipeline achieves a statistically significant 17.0% relative reduction in word error rate (WER; from 0.217 to 0.180, p<0.001). Improvements are observed in 40.1% of segments with degradation in only 7.1%, substantially outperforming direct transcript post-editing. These results demonstrate that prompt-based augmentation can deliver scalable domain adaptation for ASR, offering a practical alternative to costly model fine-tuning.

Executive Summary

This article presents Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline designed to enhance automatic speech recognition (ASR) performance in domain-specific speech. By applying specialized LLM agents for domain context identification, named entity recognition, and jargon detection, the pipeline generates compact prompts that guide Whisper's decoder. The results show a statistically significant 17.0% relative reduction in word error rate (WER) on NBA basketball commentary segments, outperforming direct transcript post-editing. This study demonstrates the potential of prompt-based augmentation for scalable domain adaptation in ASR, offering a practical alternative to costly model fine-tuning. The findings have significant implications for the development of more accurate and efficient ASR systems in various domains, including real-time transcription and language understanding applications.

Key Points

  • Whisper: Courtside Edition is a novel multi-agent LLM pipeline designed to enhance ASR performance in domain-specific speech.
  • The pipeline applies LLM agents for domain context identification, named entity recognition, and jargon detection to generate compact prompts for ASR.
  • The results show a statistically significant 17.0% relative reduction in WER on NBA basketball commentary segments, outperforming direct transcript post-editing.

Merits

Strength in Scalability

The study demonstrates the potential of prompt-based augmentation for scalable domain adaptation in ASR, offering a practical alternative to costly model fine-tuning.

Improvement in ASR Performance

The pipeline achieves a statistically significant 17.0% relative reduction in WER on NBA basketball commentary segments, outperforming direct transcript post-editing.

Demerits

Limited Domain Scope

The study focuses on NBA basketball commentary segments, which may not generalize to other domains or languages.

Lack of Human Evaluation

The study relies solely on automatic metrics, such as WER, without human evaluation to assess the quality and relevance of the generated transcripts.

Expert Commentary

The study presents a novel approach to enhancing ASR performance in domain-specific speech through prompt-based augmentation. The results demonstrate the potential of this approach for scalable domain adaptation in ASR, offering a practical alternative to costly model fine-tuning. However, the study's limited domain scope and lack of human evaluation are notable limitations. Future studies should aim to generalize the findings to other domains and languages, and incorporate human evaluation to assess the quality and relevance of the generated transcripts. The study's implications for the development of more accurate and efficient ASR systems are significant, and its results can inform policy decisions on the use of ASR technology in various industries and applications.

Recommendations

  • Future studies should aim to generalize the findings to other domains and languages, and incorporate human evaluation to assess the quality and relevance of the generated transcripts.
  • The study's results can inform policy decisions on the use of ASR technology in various industries and applications, particularly in real-time transcription and language understanding applications.

Sources