Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR
arXiv:2602.12546v1 Announce Type: cross Abstract: We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation. Our 113M-parameter model consistently improves WER over a 139M AED baseline on Librispeech (2.8% vs. 3.2% test-clean; 5.6% vs. 6.0% test-other). On Common Voice 16.1 with a single multilingual model across five languages, our approach reduces average WER from 12.2% to 10.6%. To our knowledge, this is the first randomly initialized decoder-only ASR that surpasses strong AED baselines via modality-aware routing an
arXiv:2602.12546v1 Announce Type: cross Abstract: We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation. Our 113M-parameter model consistently improves WER over a 139M AED baseline on Librispeech (2.8% vs. 3.2% test-clean; 5.6% vs. 6.0% test-other). On Common Voice 16.1 with a single multilingual model across five languages, our approach reduces average WER from 12.2% to 10.6%. To our knowledge, this is the first randomly initialized decoder-only ASR that surpasses strong AED baselines via modality-aware routing and sparse MoE, achieving better accuracy with fewer active parameters and without alignment/adaptation modules.
Executive Summary
The article presents a novel decoder-only Conformer architecture for automatic speech recognition (ASR) that integrates speech and text processing within a single stack, eliminating the need for external speech encoders or pretrained large language models. The model employs a modality-aware sparse mixture of experts (MoE) with distinct expert pools for speech and text, utilizing hard routing and top-1 selection. This approach is embedded within hybrid-causality Conformer blocks, which are bidirectional for speech and causal for text. Training involves a combination of CTC loss on speech positions and label-smoothed cross-entropy for text generation. The proposed model, with 113M parameters, outperforms a 139M parameter AED baseline on Librispeech and reduces average WER on Common Voice 16.1 across five languages. This work represents a significant advancement in ASR technology, demonstrating improved accuracy with fewer active parameters and without alignment/adaptation modules.
Key Points
- ▸ Introduction of a decoder-only Conformer architecture for ASR.
- ▸ Use of modality-aware sparse mixture of experts (MoE) with distinct expert pools for speech and text.
- ▸ Hybrid-causality Conformer blocks for bidirectional speech and causal text processing.
- ▸ Training with CTC loss on speech positions and label-smoothed cross-entropy for text generation.
- ▸ Achievement of better accuracy with fewer active parameters compared to strong AED baselines.
Merits
Innovative Architecture
The decoder-only Conformer architecture is a novel approach in ASR, integrating speech and text processing within a single stack, which simplifies the model and reduces the need for external components.
Efficient Parameter Usage
The model achieves better accuracy with fewer active parameters, making it more efficient and potentially more scalable for real-world applications.
Multilingual Performance
The model demonstrates strong performance across multiple languages, reducing average WER on Common Voice 16.1, indicating its potential for multilingual ASR tasks.
Demerits
Limited Scope of Evaluation
The evaluation is primarily focused on Librispeech and Common Voice 16.1 datasets, which may not fully represent the diversity of real-world ASR scenarios.
Complexity in Implementation
The use of modality-aware sparse MoE and hybrid-causality Conformer blocks adds complexity to the model, which may pose challenges for implementation and deployment.
Dependence on Training Techniques
The model's performance is heavily reliant on specific training techniques, such as CTC loss and label-smoothed cross-entropy, which may limit its adaptability to other tasks or datasets.
Expert Commentary
The article presents a significant advancement in the field of automatic speech recognition (ASR) by introducing a decoder-only Conformer architecture that integrates speech and text processing within a single stack. The use of modality-aware sparse mixture of experts (MoE) with distinct expert pools for speech and text is a novel approach that simplifies the model and reduces the need for external components. The hybrid-causality Conformer blocks, which are bidirectional for speech and causal for text, further enhance the model's performance. The training technique combining CTC loss on speech positions and label-smoothed cross-entropy for text generation is also noteworthy. The model's ability to achieve better accuracy with fewer active parameters is a testament to its efficiency and potential for real-world applications. However, the evaluation is primarily focused on specific datasets, which may not fully represent the diversity of real-world ASR scenarios. Additionally, the complexity of the model may pose challenges for implementation and deployment. Despite these limitations, the article's contributions are significant and can inform both practical applications and policy decisions related to ASR technology.
Recommendations
- ✓ Further evaluation of the model on a more diverse range of datasets to assess its performance in various real-world scenarios.
- ✓ Exploration of techniques to simplify the model's implementation and deployment, making it more accessible for practical applications.