Academic

Fish Audio S2 Technical Report

arXiv:2603.08823v1 Announce Type: cross Abstract: We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.

Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang, Tianyu Li, Shidong Li, Yisheng Zheng, Xingwei Liu, Qingzheng Wang, Zhizhuo Zhou, Jiahua Liu, Xin Chen, Dawei Han · March 11, 2026 · 1 min read · 47 views

#cs.SD #cs.AI #cs.CL

Executive Summary

Fish Audio S2 Technical Report presents an open-sourced text-to-speech system, Fish Audio S2, featuring multi-speaker, multi-turn generation, and instruction-following control. The system utilizes a multi-stage training recipe and staged data pipeline. A production-ready inference engine achieves real-time factor (RTF) of 0.195 and time-to-first-audio below 100 ms. The authors release model weights, fine-tuning code, and an SGLang-based inference engine. This development pushes the frontier of open-source text-to-speech systems, offering immense potential for applications in various domains. However, the report lacks detailed discussions on the technical challenges and potential biases in the system. The authors' emphasis on open-sourcing the system is commendable, but the long-term sustainability of the project remains uncertain.

Key Points

▸ Introduction of Fish Audio S2, an open-sourced text-to-speech system
▸ Multi-stage training recipe and staged data pipeline for scalability
▸ Production-ready inference engine with low RTF and time-to-first-audio

Merits

Advancements in Open-Source TTS

The release of Fish Audio S2 pushes the frontier of open-source text-to-speech systems, offering a scalable and production-ready solution for various applications.

Scalability and Flexibility

The multi-stage training recipe and staged data pipeline enable the system to handle diverse data sources and adapt to changing requirements.

Demerits

Technical Challenges and Biases

The report lacks detailed discussions on potential technical challenges and biases in the system, which may impact its performance and reliability in real-world applications.

Long-Term Sustainability

The open-sourcing of the system raises concerns about its long-term sustainability, as the community's engagement and maintenance of the project may be uncertain.

Expert Commentary

The Fish Audio S2 Technical Report presents a significant development in the field of text-to-speech systems. The authors' emphasis on open-sourcing the system is commendable, as it enables the community to engage with and improve the system. However, the report lacks detailed discussions on potential technical challenges and biases in the system, which may impact its performance and reliability in real-world applications. Furthermore, the long-term sustainability of the project remains uncertain. As the field continues to evolve, it is essential to address these concerns and ensure the system's adaptability and maintainability. The development of Fish Audio S2 has far-reaching implications for various domains, including natural language processing, human-computer interaction, and content creation. As the technical community continues to engage with and improve the system, it is crucial to consider the potential consequences of this technology on employment and society as a whole.

Recommendations

✓ Recommendation 1: The authors should provide a detailed discussion on the technical challenges and biases in the system, as well as potential solutions to mitigate these issues.
✓ Recommendation 2: The community should prioritize the long-term sustainability of the project, ensuring its maintainability and adaptability to changing requirements and technological advancements.

Sources

arXiv - cs.CL

Fish Audio S2 Technical Report

AI Commentary

Executive Summary

Key Points

Merits

Advancements in Open-Source TTS

Scalability and Flexibility

Demerits

Technical Challenges and Biases

Long-Term Sustainability

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs