LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models
arXiv:2602.15675v1 Announce Type: new Abstract: Despite the advances in neural text to speech (TTS), many Arabic dialectal varieties remain marginally addressed, with most resources concentrated on Modern Spoken Arabic (MSA) and Gulf dialects, leaving Egyptian Arabic -- the most widely understood Arabic dialect -- severely under-resourced. We address this gap by introducing NileTTS: 38 hours of transcribed speech from two speakers across diverse domains including medical, sales, and general conversations. We construct this dataset using a novel synthetic pipeline: large language models (LLM) generate Egyptian Arabic content, which is then converted to natural speech using audio synthesis tools, followed by automatic transcription and speaker diarization with manual quality verification. We fine-tune XTTS v2, a state-of-the-art multilingual TTS model, on our dataset and evaluate against the baseline model trained on other Arabic dialects. Our contributions include: (1) the first public
arXiv:2602.15675v1 Announce Type: new Abstract: Despite the advances in neural text to speech (TTS), many Arabic dialectal varieties remain marginally addressed, with most resources concentrated on Modern Spoken Arabic (MSA) and Gulf dialects, leaving Egyptian Arabic -- the most widely understood Arabic dialect -- severely under-resourced. We address this gap by introducing NileTTS: 38 hours of transcribed speech from two speakers across diverse domains including medical, sales, and general conversations. We construct this dataset using a novel synthetic pipeline: large language models (LLM) generate Egyptian Arabic content, which is then converted to natural speech using audio synthesis tools, followed by automatic transcription and speaker diarization with manual quality verification. We fine-tune XTTS v2, a state-of-the-art multilingual TTS model, on our dataset and evaluate against the baseline model trained on other Arabic dialects. Our contributions include: (1) the first publicly available Egyptian Arabic TTS dataset, (2) a reproducible synthetic data generation pipeline for dialectal TTS, and (3) an open-source fine-tuned model. All resources are released to advance Egyptian Arabic speech synthesis research.
Executive Summary
The authors present a novel synthetic data pipeline for generating dialectal text-to-speech (TTS) models, specifically addressing the under-resourced Egyptian Arabic dialect. Leveraging large language models (LLM), they create a 38-hour transcribed speech dataset across diverse domains. By fine-tuning the XTTS v2 model on this dataset, they achieve improved performance compared to the baseline model trained on other Arabic dialects. The contributions include a publicly available Egyptian Arabic TTS dataset, a reproducible synthetic data generation pipeline, and an open-source fine-tuned model. This work has significant implications for advancing Egyptian Arabic speech synthesis research and promoting inclusive language technologies.
Key Points
- ▸ Introduction of a novel synthetic data pipeline for dialectal TTS models
- ▸ Creation of a publicly available Egyptian Arabic TTS dataset
- ▸ Fine-tuning of the XTTS v2 model for improved performance
Merits
Addressing the under-resourced Egyptian Arabic dialect
The authors fill a critical gap in the existing research by focusing on the most widely understood Arabic dialect, Egyptian Arabic.
Reproducible synthetic data generation pipeline
The authors provide a detailed and reproducible approach to generating dialectal TTS datasets, making it easier for other researchers to build upon this work.
Open-source fine-tuned model
The release of the open-source fine-tuned model enables other researchers to leverage this improved performance and advance Egyptian Arabic speech synthesis research.
Demerits
Limited domain diversity
While the dataset covers diverse domains, the authors acknowledge that there may be limited diversity in terms of speaker characteristics, accents, and speaking styles.
Dependence on large language models
The pipeline relies on the performance of large language models, which may introduce biases and limitations if not carefully selected and fine-tuned.
Expert Commentary
The authors' work demonstrates a critical understanding of the challenges and opportunities in developing dialectal TTS models. By leveraging synthetic data generation, they create a valuable resource for the research community and promote the development of more inclusive language technologies. However, the limitations of the pipeline and dataset should be carefully considered, and future research should aim to address these issues through expanded domain diversity, more robust language models, and thorough evaluation protocols.
Recommendations
- ✓ Future research should prioritize the development of more diverse and representative datasets for dialectal TTS models.
- ✓ The research community should continue to engage with stakeholders from under-resourced languages to better understand their needs and priorities for language technologies.