Skip to main content
Academic

Sink-Aware Pruning for Diffusion Language Models

arXiv:2602.17664v1 Announce Type: new Abstract: Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose ${\bf \texttt{Sink-Aware Pruning}}$, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is

arXiv:2602.17664v1 Announce Type: new Abstract: Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose ${\bf \texttt{Sink-Aware Pruning}}$, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.

Executive Summary

This article proposes Sink-Aware Pruning, a novel method for efficient pruning of diffusion language models (DLMs). Unlike existing pruning heuristics inherited from autoregressive language models, the proposed method identifies and prunes unstable sink tokens in DLMs, achieving a better quality-efficiency trade-off. The authors demonstrate the effectiveness of their method without retraining, outperforming strong prior pruning baselines under matched compute. The contribution of this work lies in its recognition of the distinct characteristics of DLMs and the development of a pruning strategy tailored to these models. The proposed method has significant implications for the deployment of DLMs in resource-constrained environments.

Key Points

  • Diffusion language models incur high inference cost due to iterative denoising, motivating efficient pruning.
  • Existing pruning heuristics inherited from autoregressive language models often preserve attention sink tokens.
  • Sink-Aware Pruning identifies and prunes unstable sink tokens in DLMs, achieving a better quality-efficiency trade-off.

Merits

Strength

The proposed method is tailored to the specific characteristics of diffusion language models, leading to improved performance and efficiency.

Innovative approach

Sink-Aware Pruning introduces a new pruning strategy that addresses the limitations of existing methods, demonstrating a significant advancement in the field.

Demerits

Limitation

The proposed method may not generalize to other types of language models beyond DLMs, limiting its applicability.

Assumptions

The effectiveness of Sink-Aware Pruning relies on the assumption that sink tokens in DLMs exhibit higher variance over the full generation trajectory.

Expert Commentary

Sink-Aware Pruning is a well-crafted and timely contribution to the field of language models. The authors demonstrate a deep understanding of the challenges associated with efficient pruning of DLMs and develop a novel strategy that addresses these limitations. The proposal's focus on the distinct characteristics of DLMs is a significant advancement, and the evaluation demonstrates the effectiveness of the method. While there are limitations to the proposed method, it is a valuable addition to the research landscape. As the field continues to evolve, it is essential to develop more efficient pruning strategies tailored to the specific needs of different language models. The work presented here serves as a stepping stone towards achieving this goal.

Recommendations

  • Future research should explore the applicability of Sink-Aware Pruning to other types of language models beyond DLMs.
  • The development of Sink-Aware Pruning highlights the importance of understanding the characteristics of different language models and developing pruning strategies tailored to these models.

Sources