Fish Audio S2 Technical Report
arXiv:2603.08823v1 Announce Type: cross Abstract: We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.
arXiv:2603.08823v1 Announce Type: cross Abstract: We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.
Executive Summary
Fish Audio S2 Technical Report presents an open-sourced text-to-speech system, Fish Audio S2, featuring multi-speaker, multi-turn generation, and instruction-following control. The system utilizes a multi-stage training recipe and staged data pipeline. A production-ready inference engine achieves real-time factor (RTF) of 0.195 and time-to-first-audio below 100 ms. The authors release model weights, fine-tuning code, and an SGLang-based inference engine. This development pushes the frontier of open-source text-to-speech systems, offering immense potential for applications in various domains. However, the report lacks detailed discussions on the technical challenges and potential biases in the system. The authors' emphasis on open-sourcing the system is commendable, but the long-term sustainability of the project remains uncertain.
Key Points
- ▸ Introduction of Fish Audio S2, an open-sourced text-to-speech system
- ▸ Multi-stage training recipe and staged data pipeline for scalability
- ▸ Production-ready inference engine with low RTF and time-to-first-audio
Merits
Advancements in Open-Source TTS
The release of Fish Audio S2 pushes the frontier of open-source text-to-speech systems, offering a scalable and production-ready solution for various applications.
Scalability and Flexibility
The multi-stage training recipe and staged data pipeline enable the system to handle diverse data sources and adapt to changing requirements.
Demerits
Technical Challenges and Biases
The report lacks detailed discussions on potential technical challenges and biases in the system, which may impact its performance and reliability in real-world applications.
Long-Term Sustainability
The open-sourcing of the system raises concerns about its long-term sustainability, as the community's engagement and maintenance of the project may be uncertain.
Expert Commentary
The Fish Audio S2 Technical Report presents a significant development in the field of text-to-speech systems. The authors' emphasis on open-sourcing the system is commendable, as it enables the community to engage with and improve the system. However, the report lacks detailed discussions on potential technical challenges and biases in the system, which may impact its performance and reliability in real-world applications. Furthermore, the long-term sustainability of the project remains uncertain. As the field continues to evolve, it is essential to address these concerns and ensure the system's adaptability and maintainability. The development of Fish Audio S2 has far-reaching implications for various domains, including natural language processing, human-computer interaction, and content creation. As the technical community continues to engage with and improve the system, it is crucial to consider the potential consequences of this technology on employment and society as a whole.
Recommendations
- ✓ Recommendation 1: The authors should provide a detailed discussion on the technical challenges and biases in the system, as well as potential solutions to mitigate these issues.
- ✓ Recommendation 2: The community should prioritize the long-term sustainability of the project, ensuring its maintainability and adaptability to changing requirements and technological advancements.