Academic

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Siyuan Liu, Jiahui Xu, Feng Jiang, Kuang Wang, Zefeng Zhao, Chu-Ren Huang, Jinghang Gu, Changqing Yin, Haizhou Li · February 28, 2026 · 1 min read · 14 views

#cs.CL

arXiv:2602.23266v1 Announce Type: new Abstract: Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR-LLM-TTS pipelines follow a strictly sequential paradigm, requiring complete transcription and full reasoning before speech synthesis can begin, which results in high response latency. We propose the Discourse-Aware Dual-Track Streaming Response (DDTSR) framework, a low-latency architecture that enables listen-while-thinking and speak-while-thinking. DDTSR is built upon three key mechanisms: (1) connective-guided small-large model synergy, where an auxiliary small model generates minimal-committal discourse connectives while a large model performs knowledge-intensive reasoning in parallel; (2) streaming-based cross-modal collaboration, which dynamically overlaps ASR, LLM inference, and TTS to advance the earliest speakable moment; and (3) curriculum-learning-based discourse continuity enhancement, which maintains coherence and logical consistency between early responses and subsequent reasoning outputs. Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51% while preserving discourse quality. Further analysis shows that DDTSR functions as a plug-and-play module compatible with diverse LLM backbones, and remains robust across varying utterance lengths, indicating strong practicality and scalability for real-time spoken interaction.

Executive Summary

The article proposes a novel architecture, Discourse-Aware Dual-Track Streaming Response (DDTSR), designed to reduce response latency in cascaded spoken dialogue systems. By employing an auxiliary small model for minimal-committal discourse connectives and a large model for knowledge-intensive reasoning in parallel, DDTSR enables listen-while-thinking and speak-while-thinking. The framework is further enhanced with streaming-based cross-modal collaboration and curriculum-learning-based discourse continuity enhancement. The authors present experimental results demonstrating a 19%-51% reduction in response latency on two spoken dialogue benchmarks, while preserving discourse quality. The proposed architecture is shown to be compatible with diverse large language model (LLM) backbones and scalable across varying utterance lengths.

Key Points

▸ The Discourse-Aware Dual-Track Streaming Response (DDTSR) framework is introduced to reduce response latency in cascaded spoken dialogue systems.
▸ DDTSR employs an auxiliary small model for minimal-committal discourse connectives and a large model for knowledge-intensive reasoning in parallel.
▸ The framework incorporates streaming-based cross-modal collaboration and curriculum-learning-based discourse continuity enhancement for enhanced performance.

Merits

Strength in Reducing Response Latency

DDTSR demonstrates a significant reduction in response latency (19%-51%) while preserving discourse quality, making it an attractive solution for real-time spoken interaction.

Scalability and Compatibility

The proposed architecture is shown to be compatible with diverse LLM backbones and scalable across varying utterance lengths, indicating strong practicality and scalability.

Demerits

Potential Overhead in Resource Utilization

Employing two models in parallel and implementing streaming-based cross-modal collaboration may lead to increased resource utilization, which could be a limitation in certain scenarios.

Dependence on High-Performance Computing

The proposed architecture may require high-performance computing infrastructure to achieve optimal performance, which could be a limitation in resource-constrained environments.

Expert Commentary

The proposed Discourse-Aware Dual-Track Streaming Response (DDTSR) architecture is a significant contribution to the field of spoken dialogue systems, addressing the critical challenge of achieving human-like responsiveness in real-time interaction. The authors' innovative approach to leveraging parallel processing and cross-modal collaboration has the potential to transform the field, enabling more effective and engaging spoken dialogue systems. However, the potential overhead in resource utilization and dependence on high-performance computing infrastructure must be carefully considered in practical applications. Further research is needed to fully explore the implications and limitations of DDTSR, but its potential impact on the field is undeniable.

Recommendations

✓ Further investigation into the potential overhead in resource utilization and the dependence on high-performance computing infrastructure is recommended to ensure the feasibility of DDTSR in practical applications.
✓ The authors' innovative approach to leveraging parallel processing and cross-modal collaboration should be explored in further research to fully understand its potential impact on the field.

Sources

arXiv - cs.CL

Something extraordinary is coming.

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

AI Commentary

Executive Summary

Key Points

Merits

Strength in Reducing Response Latency

Scalability and Compatibility

Demerits

Potential Overhead in Resource Utilization

Dependence on High-Performance Computing

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.