On the Emotion Understanding of Synthesized Speech
arXiv:2603.16483v1 Announce Type: new Abstract: Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exp
arXiv:2603.16483v1 Announce Type: new Abstract: Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.
Executive Summary
This article critically evaluates the assumption that emotion understanding models can learn fundamental representations that transfer to synthesized speech, and assesses the performance of Speech Emotion Recognition (SER) models on synthesized speech. The authors find that current SER models struggle to generalize to synthesized speech due to a representation mismatch between synthesized and human speech, and that generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. The study suggests that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and highlights the challenges of paralinguistic understanding in SLMs. This research has important implications for the development of more robust and effective emotion understanding models, particularly in the context of synthesized speech.
Key Points
- ▸ Current SER models struggle to generalize to synthesized speech due to a representation mismatch.
- ▸ Generative SLMs tend to infer emotion from textual semantics rather than paralinguistic cues.
- ▸ Existing SER models often exploit non-robust shortcuts rather than capturing fundamental features.
Merits
Strength
The study provides a comprehensive evaluation of SER models on synthesized speech, covering various datasets and synthesis models.
Demerits
Limitation
The study focuses primarily on the performance of SER models, without exploring potential solutions or methods for improving paralinguistic understanding in SLMs.
Expert Commentary
This study provides a timely and important contribution to the field of emotion understanding in synthesized speech. By critically evaluating the assumption that SER models can learn fundamental representations that transfer to synthesized speech, the authors highlight the challenges of developing models that can accurately capture the nuances of human speech. The study's findings have significant implications for the development of more robust and effective emotion understanding models, and demonstrate the need for a more nuanced understanding of the representation mismatch between synthesized and human speech. The study's methodology and results are thorough and well-executed, and the authors' conclusions are well-supported by the data. However, the study's limitations, including its focus on SER models rather than potential solutions or methods for improving paralinguistic understanding in SLMs, are an important area for future research.
Recommendations
- ✓ Future research should focus on developing more effective emotion understanding models that can accurately capture the nuances of human speech.
- ✓ Developers should prioritize the creation of more robust and effective emotion understanding models, rather than relying on existing non-robust shortcuts.