Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR
arXiv:2603.16184v1 Announce Type: new Abstract: We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of \$81 on a single RTX PRO 6000 GPU compared to \$18,862 for the 128-GPU baseline. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results de
arXiv:2603.16184v1 Announce Type: new Abstract: We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of \$81 on a single RTX PRO 6000 GPU compared to \$18,862 for the 128-GPU baseline. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results demonstrate that linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.
Executive Summary
This article presents Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore. By fine-tuning Qwen3-ASR models exclusively on publicly available speech corpora using a balanced sampling strategy, the researchers achieve competitive results with a significant reduction in training cost and inference time. The average error rate of 14.85 is comparable to that of a larger model, MERaLiON-2-10B-ASR, while requiring a mere 1/256 of the training cost. This breakthrough has significant implications for the development of deployment-ready multilingual ASR systems, making it an important contribution to the field of natural language processing.
Key Points
- ▸ Polyglot-Lion is a compact multilingual ASR model tailored for the linguistic landscape of Singapore.
- ▸ The model achieves competitive results through fine-tuning of Qwen3-ASR models using a balanced sampling strategy.
- ▸ The training cost and inference time are significantly reduced compared to larger specialist systems.
Merits
Significant Reduction in Training Cost
The authors achieve a training cost of $81 on a single GPU, which is 1/256 of the cost of the 128-GPU baseline, making it a cost-effective solution for deployment-ready multilingual ASR systems.
Improved Inference Throughput
The inference throughput of Polyglot-Lion is approximately 20x faster than MERaLiON-2-10B-ASR, making it suitable for real-time applications.
Competitive Results
The average error rate of 14.85 is comparable to that of a larger model, MERaLiON-2-10B-ASR, demonstrating the effectiveness of Polyglot-Lion in multilingual ASR tasks.
Demerits
Limited Dataset Diversity
The authors only fine-tune the models on publicly available speech corpora, which may limit the diversity of the dataset and the generalizability of the results.
Language Tag Conditioning Omission
The authors deliberately omit language-tag conditioning, which may lead to potential biases in the model's performance and interpretation.
Expert Commentary
The authors' approach to fine-tuning Qwen3-ASR models using a balanced sampling strategy is a significant breakthrough in the field of multilingual ASR. By achieving competitive results with a significant reduction in training cost and inference time, Polyglot-Lion has the potential to revolutionize the development of deployment-ready multilingual ASR systems. The omission of language-tag conditioning is a deliberate choice that highlights the importance of exploring alternative approaches to language identification. While the limited dataset diversity is a concern, the authors' approach can be adapted to address this issue in future work.
Recommendations
- ✓ Future research should investigate the use of Polyglot-Lion in more diverse linguistic landscapes and evaluate its performance in real-world applications.
- ✓ The authors should explore alternative approaches to language identification, such as using language tag conditioning or multimodal inputs, to improve the model's robustness and generalizability.