Academic

Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages

arXiv:2603.00941v1 Announce Type: new Abstract: Evaluating ASR systems for Indian languages is challenging due to spelling variations, suffix splitting flexibility, and non-standard spellings in code-mixed words. Traditional Word Error Rate (WER) often presents a bleaker picture of system performance than what human users perceive. Better aligning evaluation with real-world performance requires capturing permissible orthographic variations, which is extremely challenging for under-resourced Indian languages. Leveraging recent advances in LLMs, we propose a framework for creating benchmarks that capture permissible variations. Through extensive experiments, we demonstrate that OIWER, by accounting for orthographic variations, reduces pessimistic error rates (an average improvement of 6.3 points), narrows inflated model gaps (e.g., Gemini-Canary performance difference drops from 18.1 to 11.5 points), and aligns more closely with human perception than prior methods like WER-SN by 4.9 poi

arXiv:2603.00941v1 Announce Type: new Abstract: Evaluating ASR systems for Indian languages is challenging due to spelling variations, suffix splitting flexibility, and non-standard spellings in code-mixed words. Traditional Word Error Rate (WER) often presents a bleaker picture of system performance than what human users perceive. Better aligning evaluation with real-world performance requires capturing permissible orthographic variations, which is extremely challenging for under-resourced Indian languages. Leveraging recent advances in LLMs, we propose a framework for creating benchmarks that capture permissible variations. Through extensive experiments, we demonstrate that OIWER, by accounting for orthographic variations, reduces pessimistic error rates (an average improvement of 6.3 points), narrows inflated model gaps (e.g., Gemini-Canary performance difference drops from 18.1 to 11.5 points), and aligns more closely with human perception than prior methods like WER-SN by 4.9 points.

Executive Summary

This article proposes a novel framework for evaluating speech recognition systems for Indian languages, addressing the challenges posed by orthographic variations. By leveraging large language models, the authors develop an Orthographically-Informed Word Error Rate (OIWER) metric that captures permissible variations, resulting in improved alignment with human perception and reduced pessimistic error rates. The framework, demonstrated through extensive experiments, has the potential to revolutionize the evaluation of Automatic Speech Recognition (ASR) systems for under-resourced languages. The study's findings highlight the importance of accounting for linguistic complexities in language evaluation and hold significant implications for the development of more effective ASR systems.

Key Points

  • The article proposes a novel framework for evaluating ASR systems for Indian languages, addressing orthographic variations.
  • The OIWER metric captures permissible variations, resulting in improved alignment with human perception.
  • Extensive experiments demonstrate the effectiveness of the framework in reducing pessimistic error rates and model gaps.

Merits

Advancements in ASR Evaluation

The framework addresses significant challenges in evaluating ASR systems for under-resourced languages, enabling more accurate assessments of system performance.

Improved Alignment with Human Perception

The OIWER metric aligns more closely with human perception, providing a more realistic evaluation of ASR system performance.

Demerits

Limited Scope

The study is limited to Indian languages, and it remains to be seen whether the framework can be effectively applied to other languages with complex orthographic systems.

Technical Complexity

The framework requires significant technical expertise and computational resources, potentially limiting its accessibility for researchers and practitioners.

Expert Commentary

The article's proposed framework represents a significant advancement in the evaluation of ASR systems, particularly for under-resourced languages. By leveraging large language models, the authors demonstrate the potential for more accurate and effective evaluation metrics. However, the study's limitations, including its scope and technical complexity, must be addressed in future research. The implications of the study are far-reaching, highlighting the need for more effective language support and access to information for speakers of under-resourced languages. As such, the framework's adoption and development hold significant potential for improving communication and access to information in these communities.

Recommendations

  • Future research should explore the application of the framework to other languages with complex orthographic systems.
  • The development of more accessible and user-friendly versions of the framework is necessary to facilitate its adoption by researchers and practitioners.

Sources