Academic

DuplexCascade: Full-Duplex Speech-to-Speech Dialogue with VAD-Free Cascaded ASR-LLM-TTS Pipeline and Micro-Turn Optimization

arXiv:2603.09180v1 Announce Type: new Abstract: Spoken dialog systems with cascaded ASR-LLM-TTS modules retain strong LLM intelligence, but VAD segmentation often forces half-duplex turns and brittle control. On the other hand, VAD-free end-to-end model support full-duplex interaction but is hard to maintain conversational intelligence. In this paper, we present DuplexCascade, a VAD-free cascaded streaming pipeline for full-duplex speech-to-speech dialogue. Our key idea is to convert conventional utterance-wise long turns into chunk-wise micro-turn interactions, enabling rapid bidirectional exchange while preserving the strengths of a capable text LLM. To reliably coordinate turn-taking and response timing, we introduce a set of conversational special control tokens that steer the LLM's behavior under streaming constraints. On Full-DuplexBench and VoiceBench, DuplexCascade delivers state-of-the-art full-duplex turn-taking and strong conversational intelligence among open-source speech

J
Jianing Yang, Yusuke Fujita, Yui Sudo
· · 1 min read · 15 views

arXiv:2603.09180v1 Announce Type: new Abstract: Spoken dialog systems with cascaded ASR-LLM-TTS modules retain strong LLM intelligence, but VAD segmentation often forces half-duplex turns and brittle control. On the other hand, VAD-free end-to-end model support full-duplex interaction but is hard to maintain conversational intelligence. In this paper, we present DuplexCascade, a VAD-free cascaded streaming pipeline for full-duplex speech-to-speech dialogue. Our key idea is to convert conventional utterance-wise long turns into chunk-wise micro-turn interactions, enabling rapid bidirectional exchange while preserving the strengths of a capable text LLM. To reliably coordinate turn-taking and response timing, we introduce a set of conversational special control tokens that steer the LLM's behavior under streaming constraints. On Full-DuplexBench and VoiceBench, DuplexCascade delivers state-of-the-art full-duplex turn-taking and strong conversational intelligence among open-source speech-to-speech dialogue systems.

Executive Summary

This article presents DuplexCascade, a novel VAD-free cascaded streaming pipeline for full-duplex speech-to-speech dialogue. By converting utterance-wise long turns into chunk-wise micro-turn interactions, DuplexCascade enables rapid bidirectional exchange while preserving the strengths of a capable text LLM. The system introduces conversational special control tokens to steer the LLM's behavior under streaming constraints. The results demonstrate state-of-the-art full-duplex turn-taking and strong conversational intelligence on two benchmark datasets. DuplexCascade has significant implications for improving the efficiency and effectiveness of spoken dialog systems. This innovation could lead to more natural and interactive human-computer conversations, with potential applications in customer service, language learning, and accessibility technologies.

Key Points

  • DuplexCascade is a VAD-free cascaded streaming pipeline for full-duplex speech-to-speech dialogue.
  • The system converts utterance-wise long turns into chunk-wise micro-turn interactions.
  • Conversational special control tokens are introduced to steer the LLM's behavior under streaming constraints.

Merits

Improved Efficiency

DuplexCascade enables rapid bidirectional exchange, reducing the latency and improving the overall efficiency of spoken dialog systems.

Enhanced Conversational Intelligence

The system preserves the strengths of a capable text LLM, allowing for more natural and interactive human-computer conversations.

Demerits

Complexity

The introduction of conversational special control tokens may increase the complexity of the system, requiring additional computational resources and expertise to implement.

Limited Generalizability

The results may not generalize to other languages or domains, requiring further testing and validation to ensure the effectiveness of DuplexCascade.

Expert Commentary

DuplexCascade is a significant innovation in the field of spoken dialogue systems, addressing the limitations of VAD segmentation and full-duplex interaction. The introduction of conversational special control tokens is a novel approach that enables the system to steer the LLM's behavior under streaming constraints. While the complexity of the system may be a concern, the potential benefits of improved efficiency and enhanced conversational intelligence make DuplexCascade a valuable contribution to the field. The results demonstrate state-of-the-art full-duplex turn-taking and strong conversational intelligence, making DuplexCascade a promising approach for improving the effectiveness of spoken dialog systems.

Recommendations

  • Future research should focus on testing and validating DuplexCascade in different languages and domains to ensure its generalizability.
  • The introduction of conversational special control tokens should be further explored to understand its impact on the system's complexity and efficiency.

Sources