Discrete Stochastic Localization for Non-autoregressive Generation
arXiv:2602.16169v1 Announce Type: new Abstract: Non-autoregressive (NAR) generation reduces decoding latency by predicting many tokens in parallel, but iterative refinement often suffers from error accumulation and distribution shift under self-generated drafts. Masked diffusion language models (MDLMs) and their remasking samplers (e.g., ReMDM) can be viewed as modern NAR iterative refinement, where generation repeatedly revises a partially observed draft. In this work we show that \emph{training alone} can substantially improve the step-efficiency of MDLM/ReMDM sampling. We propose \textsc{DSL} (Discrete Stochastic Localization), which trains a single SNR-invariant denoiser across a continuum of corruption levels, bridging intermediate draft noise and mask-style endpoint corruption within one Diffusion Transformer. On OpenWebText, \textsc{DSL} fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with \(\sim\)4$\times$ fewer denoiser evaluations
arXiv:2602.16169v1 Announce Type: new Abstract: Non-autoregressive (NAR) generation reduces decoding latency by predicting many tokens in parallel, but iterative refinement often suffers from error accumulation and distribution shift under self-generated drafts. Masked diffusion language models (MDLMs) and their remasking samplers (e.g., ReMDM) can be viewed as modern NAR iterative refinement, where generation repeatedly revises a partially observed draft. In this work we show that \emph{training alone} can substantially improve the step-efficiency of MDLM/ReMDM sampling. We propose \textsc{DSL} (Discrete Stochastic Localization), which trains a single SNR-invariant denoiser across a continuum of corruption levels, bridging intermediate draft noise and mask-style endpoint corruption within one Diffusion Transformer. On OpenWebText, \textsc{DSL} fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with \(\sim\)4$\times$ fewer denoiser evaluations, and matches autoregressive quality at high budgets. Analyses show improved self-correction and uncertainty calibration, making remasking markedly more compute-efficient.
Executive Summary
This article proposes Discrete Stochastic Localization (DSL), a novel approach to improve the efficiency of non-autoregressive (NAR) generation in language models. DSL trains a single denoiser across a continuum of corruption levels, enabling a single Diffusion Transformer to bridge intermediate draft noise and endpoint corruption. Experimental results on OpenWebText demonstrate significant improvements in step-efficiency, including large MAUVE gains at low step budgets and matching autoregressive quality at high budgets. The analysis highlights improved self-correction and uncertainty calibration, making remasking markedly more compute-efficient. This breakthrough has far-reaching implications for NAR generation, offering a potential solution to the limitations of current iterative refinement methods.
Key Points
- ▸ DSL proposes a novel approach to improve NAR generation efficiency
- ▸ DSL enables a single Diffusion Transformer to bridge intermediate draft noise and endpoint corruption
- ▸ Experimental results demonstrate significant improvements in step-efficiency and quality
Merits
Improved step-efficiency
DSL achieves large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with fewer denoiser evaluations.
Enhanced quality
DSL matches autoregressive quality at high budgets, demonstrating significant improvements in language model performance.
Increased compute-efficiency
DSL enables remasking to be markedly more compute-efficient, reducing the computational requirements of NAR generation.
Demerits
Limited evaluation on diverse datasets
The article primarily evaluates DSL on OpenWebText, and further evaluation on diverse datasets is necessary to confirm its generalizability.
Potential overfitting to specific corruption levels
DSL's reliance on a continuum of corruption levels may lead to overfitting to specific corruption levels, which could impact its performance on unseen data.
Expert Commentary
This article represents a significant breakthrough in the field of NAR generation. By proposing DSL, the authors have demonstrated a novel approach to improving the efficiency and quality of NAR generation. The experimental results are compelling, and the analysis highlights the potential of DSL to address the limitations of current iterative refinement methods. However, further evaluation on diverse datasets and consideration of potential overfitting are necessary to confirm the generalizability and robustness of DSL. Nevertheless, this work has far-reaching implications for the field of natural language processing and has the potential to revolutionize the way we approach NAR generation.
Recommendations
- ✓ Further evaluation of DSL on diverse datasets to confirm its generalizability and robustness.
- ✓ Incorporation of additional metrics and analyses to better understand the potential benefits and limitations of DSL.