Academic

Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction

arXiv:2602.24080v1 Announce Type: new Abstract: The pursuit of human-like conversational agents has long been guided by the Turing test. For modern speech-to-speech (S2S) systems, a critical yet unanswered question is whether they can converse like humans. To tackle this, we conduct the first Turing test for S2S systems, collecting 2,968 human judgments on dialogues between 9 state-of-the-art S2S systems and 28 human participants. Our results deliver a clear finding: no existing evaluated S2S system passes the test, revealing a significant gap in human-likeness. To diagnose this failure, we develop a fine-grained taxonomy of 18 human-likeness dimensions and crowd-annotate our collected dialogues accordingly. Our analysis shows that the bottleneck is not semantic understanding but stems from paralinguistic features, emotional expressivity, and conversational persona. Furthermore, we find that off-the-shelf AI models perform unreliably as Turing test judges. In response, we propose an i

arXiv:2602.24080v1 Announce Type: new Abstract: The pursuit of human-like conversational agents has long been guided by the Turing test. For modern speech-to-speech (S2S) systems, a critical yet unanswered question is whether they can converse like humans. To tackle this, we conduct the first Turing test for S2S systems, collecting 2,968 human judgments on dialogues between 9 state-of-the-art S2S systems and 28 human participants. Our results deliver a clear finding: no existing evaluated S2S system passes the test, revealing a significant gap in human-likeness. To diagnose this failure, we develop a fine-grained taxonomy of 18 human-likeness dimensions and crowd-annotate our collected dialogues accordingly. Our analysis shows that the bottleneck is not semantic understanding but stems from paralinguistic features, emotional expressivity, and conversational persona. Furthermore, we find that off-the-shelf AI models perform unreliably as Turing test judges. In response, we propose an interpretable model that leverages the fine-grained human-likeness ratings and delivers accurate and transparent human-vs-machine discrimination, offering a powerful tool for automatic human-likeness evaluation. Our work establishes the first human-likeness evaluation for S2S systems and moves beyond binary outcomes to enable detailed diagnostic insights, paving the way for human-like improvements in conversational AI systems.

Executive Summary

This article presents a preliminary Turing test for speech-to-speech systems, aiming to determine whether they can converse like humans. The authors conducted the first Turing test for S2S systems, collecting human judgments on dialogues between 9 state-of-the-art S2S systems and 28 human participants. The results indicate a significant gap in human-likeness, with the bottleneck attributed to paralinguistic features, emotional expressivity, and conversational persona. The study also explores the reliability of off-the-shelf AI models as Turing test judges and proposes an interpretable model for human-likeness evaluation. This research contributes to the development of human-like conversational AI systems by establishing a fine-grained taxonomy of human-likeness dimensions and providing a diagnostic tool for S2S systems.

Key Points

  • The authors conducted the first Turing test for S2S systems, collecting human judgments on dialogues between 9 state-of-the-art S2S systems and 28 human participants.
  • The results indicate a significant gap in human-likeness, with the bottleneck attributed to paralinguistic features, emotional expressivity, and conversational persona.
  • The study proposes an interpretable model for human-likeness evaluation, leveraging fine-grained human-likeness ratings for accurate and transparent human-vs-machine discrimination.

Merits

Strength in Methodology

The authors employed a comprehensive and systematic approach to evaluate the human-likeness of S2S systems, collecting a large dataset of human judgments and developing a fine-grained taxonomy of human-likeness dimensions.

Demerits

Limitation in Generalizability

The study's findings may not be generalizable to all S2S systems, as the evaluation was conducted with a specific set of state-of-the-art systems. Further research is needed to validate the results and ensure the proposed model's applicability to other systems.

Expert Commentary

This article presents a significant contribution to the field of conversational AI development, providing a comprehensive evaluation of human-likeness in S2S systems and proposing a diagnostic tool for identifying areas for improvement. The study's findings and proposed model have important implications for the development of more human-like conversational AI systems, which can enhance user experience and engagement. However, the study's limitations in generalizability and the need for further research to validate the results and ensure the model's applicability to other systems are important considerations. Additionally, the study's results and proposed model may have implications for the regulation of conversational AI systems, particularly in industries where human-likeness is a critical factor.

Recommendations

  • Further research is needed to validate the study's findings and ensure the proposed model's applicability to other S2S systems.
  • The development of more human-like conversational AI systems should prioritize the incorporation of paralinguistic features, emotional expressivity, and conversational persona, as identified by the study as key areas for improvement.

Sources