Skip to main content
Academic

LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models

arXiv:2602.15675v1 Announce Type: new Abstract: Despite the advances in neural text to speech (TTS), many Arabic dialectal varieties remain marginally addressed, with most resources concentrated on Modern Spoken Arabic (MSA) and Gulf dialects, leaving Egyptian Arabic -- the most widely understood Arabic dialect -- severely under-resourced. We address this gap by introducing NileTTS: 38 hours of transcribed speech from two speakers across diverse domains including medical, sales, and general conversations. We construct this dataset using a novel synthetic pipeline: large language models (LLM) generate Egyptian Arabic content, which is then converted to natural speech using audio synthesis tools, followed by automatic transcription and speaker diarization with manual quality verification. We fine-tune XTTS v2, a state-of-the-art multilingual TTS model, on our dataset and evaluate against the baseline model trained on other Arabic dialects. Our contributions include: (1) the first public

A
Ahmed Khaled Khamis, Hesham Ali
· · 1 min read · 2 views

arXiv:2602.15675v1 Announce Type: new Abstract: Despite the advances in neural text to speech (TTS), many Arabic dialectal varieties remain marginally addressed, with most resources concentrated on Modern Spoken Arabic (MSA) and Gulf dialects, leaving Egyptian Arabic -- the most widely understood Arabic dialect -- severely under-resourced. We address this gap by introducing NileTTS: 38 hours of transcribed speech from two speakers across diverse domains including medical, sales, and general conversations. We construct this dataset using a novel synthetic pipeline: large language models (LLM) generate Egyptian Arabic content, which is then converted to natural speech using audio synthesis tools, followed by automatic transcription and speaker diarization with manual quality verification. We fine-tune XTTS v2, a state-of-the-art multilingual TTS model, on our dataset and evaluate against the baseline model trained on other Arabic dialects. Our contributions include: (1) the first publicly available Egyptian Arabic TTS dataset, (2) a reproducible synthetic data generation pipeline for dialectal TTS, and (3) an open-source fine-tuned model. All resources are released to advance Egyptian Arabic speech synthesis research.

Executive Summary

The authors present a novel synthetic data pipeline for generating dialectal text-to-speech (TTS) models, specifically addressing the under-resourced Egyptian Arabic dialect. Leveraging large language models (LLM), they create a 38-hour transcribed speech dataset across diverse domains. By fine-tuning the XTTS v2 model on this dataset, they achieve improved performance compared to the baseline model trained on other Arabic dialects. The contributions include a publicly available Egyptian Arabic TTS dataset, a reproducible synthetic data generation pipeline, and an open-source fine-tuned model. This work has significant implications for advancing Egyptian Arabic speech synthesis research and promoting inclusive language technologies.

Key Points

  • Introduction of a novel synthetic data pipeline for dialectal TTS models
  • Creation of a publicly available Egyptian Arabic TTS dataset
  • Fine-tuning of the XTTS v2 model for improved performance

Merits

Addressing the under-resourced Egyptian Arabic dialect

The authors fill a critical gap in the existing research by focusing on the most widely understood Arabic dialect, Egyptian Arabic.

Reproducible synthetic data generation pipeline

The authors provide a detailed and reproducible approach to generating dialectal TTS datasets, making it easier for other researchers to build upon this work.

Open-source fine-tuned model

The release of the open-source fine-tuned model enables other researchers to leverage this improved performance and advance Egyptian Arabic speech synthesis research.

Demerits

Limited domain diversity

While the dataset covers diverse domains, the authors acknowledge that there may be limited diversity in terms of speaker characteristics, accents, and speaking styles.

Dependence on large language models

The pipeline relies on the performance of large language models, which may introduce biases and limitations if not carefully selected and fine-tuned.

Expert Commentary

The authors' work demonstrates a critical understanding of the challenges and opportunities in developing dialectal TTS models. By leveraging synthetic data generation, they create a valuable resource for the research community and promote the development of more inclusive language technologies. However, the limitations of the pipeline and dataset should be carefully considered, and future research should aim to address these issues through expanded domain diversity, more robust language models, and thorough evaluation protocols.

Recommendations

  • Future research should prioritize the development of more diverse and representative datasets for dialectal TTS models.
  • The research community should continue to engage with stakeholders from under-resourced languages to better understand their needs and priorities for language technologies.

Sources