Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems
arXiv:2602.23266v1 Announce Type: new Abstract: Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR-LLM-TTS pipelines follow a strictly sequential paradigm, requiring complete transcription and full reasoning before speech synthesis can begin, which results in high response latency. We propose the Discourse-Aware Dual-Track Streaming Response (DDTSR) framework, a low-latency architecture that enables listen-while-thinking and speak-while-thinking. DDTSR is built upon three key mechanisms: (1) connective-guided small-large model synergy, where an auxiliary small model generates minimal-committal discourse connectives while a large model performs knowledge-intensive reasoning in parallel; (2) streaming-based cross-modal collaboration, which dynamically overlaps ASR, LLM inference, and TTS to advance the earliest speakable moment; and (3) curriculum-learning-based discourse continuity enhancement, which maintains c
arXiv:2602.23266v1 Announce Type: new Abstract: Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR-LLM-TTS pipelines follow a strictly sequential paradigm, requiring complete transcription and full reasoning before speech synthesis can begin, which results in high response latency. We propose the Discourse-Aware Dual-Track Streaming Response (DDTSR) framework, a low-latency architecture that enables listen-while-thinking and speak-while-thinking. DDTSR is built upon three key mechanisms: (1) connective-guided small-large model synergy, where an auxiliary small model generates minimal-committal discourse connectives while a large model performs knowledge-intensive reasoning in parallel; (2) streaming-based cross-modal collaboration, which dynamically overlaps ASR, LLM inference, and TTS to advance the earliest speakable moment; and (3) curriculum-learning-based discourse continuity enhancement, which maintains coherence and logical consistency between early responses and subsequent reasoning outputs. Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51% while preserving discourse quality. Further analysis shows that DDTSR functions as a plug-and-play module compatible with diverse LLM backbones, and remains robust across varying utterance lengths, indicating strong practicality and scalability for real-time spoken interaction.
Executive Summary
The article proposes a novel architecture, Discourse-Aware Dual-Track Streaming Response (DDTSR), designed to reduce response latency in cascaded spoken dialogue systems. By employing an auxiliary small model for minimal-committal discourse connectives and a large model for knowledge-intensive reasoning in parallel, DDTSR enables listen-while-thinking and speak-while-thinking. The framework is further enhanced with streaming-based cross-modal collaboration and curriculum-learning-based discourse continuity enhancement. The authors present experimental results demonstrating a 19%-51% reduction in response latency on two spoken dialogue benchmarks, while preserving discourse quality. The proposed architecture is shown to be compatible with diverse large language model (LLM) backbones and scalable across varying utterance lengths.
Key Points
- ▸ The Discourse-Aware Dual-Track Streaming Response (DDTSR) framework is introduced to reduce response latency in cascaded spoken dialogue systems.
- ▸ DDTSR employs an auxiliary small model for minimal-committal discourse connectives and a large model for knowledge-intensive reasoning in parallel.
- ▸ The framework incorporates streaming-based cross-modal collaboration and curriculum-learning-based discourse continuity enhancement for enhanced performance.
Merits
Strength in Reducing Response Latency
DDTSR demonstrates a significant reduction in response latency (19%-51%) while preserving discourse quality, making it an attractive solution for real-time spoken interaction.
Scalability and Compatibility
The proposed architecture is shown to be compatible with diverse LLM backbones and scalable across varying utterance lengths, indicating strong practicality and scalability.
Demerits
Potential Overhead in Resource Utilization
Employing two models in parallel and implementing streaming-based cross-modal collaboration may lead to increased resource utilization, which could be a limitation in certain scenarios.
Dependence on High-Performance Computing
The proposed architecture may require high-performance computing infrastructure to achieve optimal performance, which could be a limitation in resource-constrained environments.
Expert Commentary
The proposed Discourse-Aware Dual-Track Streaming Response (DDTSR) architecture is a significant contribution to the field of spoken dialogue systems, addressing the critical challenge of achieving human-like responsiveness in real-time interaction. The authors' innovative approach to leveraging parallel processing and cross-modal collaboration has the potential to transform the field, enabling more effective and engaging spoken dialogue systems. However, the potential overhead in resource utilization and dependence on high-performance computing infrastructure must be carefully considered in practical applications. Further research is needed to fully explore the implications and limitations of DDTSR, but its potential impact on the field is undeniable.
Recommendations
- ✓ Further investigation into the potential overhead in resource utilization and the dependence on high-performance computing infrastructure is recommended to ensure the feasibility of DDTSR in practical applications.
- ✓ The authors' innovative approach to leveraging parallel processing and cross-modal collaboration should be explored in further research to fully understand its potential impact on the field.