Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
arXiv:2602.23225v1 Announce Type: new Abstract: Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a para
arXiv:2602.23225v1 Announce Type: new Abstract: Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.
Executive Summary
This article proposes a novel approach to mitigate the autoregressive (AR) behavior of Diffusion Language Models (DLMs), which hinders their ability to leverage parallel hardware for faster decoding. The authors argue that the mismatch between DLM objectives and the sequential structure of training data is the primary driver of AR-like behavior. They introduce NAP (Non-Autoregressive Parallel DLMs), a data-centric approach that aligns supervision with non-AR parallel decoding. The method curates examples as multiple independent reasoning trajectories and employs a parallel-forced decoding strategy. Experimental results on math reasoning benchmarks demonstrate the effectiveness of NAP in reducing AR-like behavior and improving parallel decoding performance. The study has significant implications for the development of parallel language models and highlights the importance of revisiting data and supervision in mitigating AR-like behavior.
Key Points
- ▸ Diffusion Language Models (DLMs) struggle with parallel decoding due to autoregressive (AR) behavior
- ▸ Mismatch between DLM objectives and sequential training data structure drives AR-like behavior
- ▸ NAP (Non-Autoregressive Parallel DLMs) is proposed to align supervision with non-AR parallel decoding
- ▸ Experimental results demonstrate the effectiveness of NAP in reducing AR-like behavior and improving parallel decoding performance
Merits
Strength 1: Novel Approach
The authors propose a novel approach to mitigating AR-like behavior in DLMs, which is a significant contribution to the field.
Strength 2: Experimental Validation
The study provides experimental results that demonstrate the effectiveness of NAP in reducing AR-like behavior and improving parallel decoding performance.
Demerits
Limitation 1: Data-centric Approach
The proposed approach relies heavily on data curation and may not be scalable to large datasets or diverse domains.
Limitation 2: Computational Resources
The parallel-forced decoding strategy may require significant computational resources, which could be a limitation for real-world applications.
Expert Commentary
The study provides a thorough analysis of the AR-like behavior of DLMs and proposes a novel approach to mitigating this behavior. The experimental results are convincing and demonstrate the effectiveness of NAP in reducing AR-like behavior and improving parallel decoding performance. However, the proposed approach relies heavily on data curation, which may not be scalable to large datasets or diverse domains. Additionally, the parallel-forced decoding strategy may require significant computational resources, which could be a limitation for real-world applications. Nevertheless, the study has significant implications for the development of parallel language models and highlights the importance of revisiting data and supervision in mitigating AR-like behavior.
Recommendations
- ✓ Recommendation 1: Further Research on Scalability
- ✓ Further research is needed to investigate the scalability of NAP to large datasets and diverse domains.
- ✓ Recommendation 2: Investigation of Alternative Decoding Strategies
- ✓ Alternative decoding strategies that do not rely on parallel-forced decoding should be explored to mitigate the computational resources required by NAP.