Academic

Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs

arXiv:2603.07475v1 Announce Type: new Abstract: Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a s

arXiv:2603.07475v1 Announce Type: new Abstract: Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.

Executive Summary

This study explores the internal representation structures of autoregressive (AR) and diffusion language models (dLLMs), providing insights into the effects of training objectives on model behavior. The researchers analyze the representations of native dLLMs, native AR models, and AR-initialized dLLMs, revealing key differences in their internal structures. They find that dLLMs produce more hierarchical abstractions with substantial early-layer redundancy, whereas AR models produce tightly coupled, depth-dependent representations. The study introduces an inference-time layer-skipping method that leverages this representational redundancy, achieving significant efficiency gains without compromising performance. This research has significant implications for the development of efficient and effective language models.

Key Points

  • dLLMs produce more hierarchical abstractions with early-layer redundancy
  • AR models produce tightly coupled, depth-dependent representations
  • Inference-time layer-skipping method achieves efficiency gains without compromising performance

Merits

Strength in Methodology

The study employs a comprehensive representational analysis, comparing native dLLMs, native AR models, and AR-initialized dLLMs, providing a thorough understanding of the effects of training objectives on model behavior.

Practical Efficiency Gains

The introduced inference-time layer-skipping method achieves significant efficiency gains, reducing FLOPs by up to 18.75% while preserving over 90% performance on reasoning and code generation benchmarks.

Demerits

Limitation in Generalizability

The study focuses on specific model architectures and tasks, limiting the generalizability of the findings to other scenarios.

Dependence on Initialization

The results suggest that AR-initialized dLLMs retain AR-like representational dynamics, highlighting the potential persistence of initialization bias in dLLMs.

Expert Commentary

This study provides a nuanced understanding of the internal representation structures of AR and dLLMs, shedding light on the effects of training objectives on model behavior. The introduced inference-time layer-skipping method demonstrates the potential for cache-orthogonal optimization techniques, which may be applied to various NLP models. However, the study's limitations in generalizability and dependence on initialization highlight the need for further research to fully understand the implications of training objectives on model behavior.

Recommendations

  • Future research should explore the generalizability of the study's findings to other model architectures and tasks, as well as the persistence of initialization bias in dLLMs.
  • The developed inference-time layer-skipping method should be evaluated on a broader range of NLP tasks to assess its applicability and effectiveness.

Sources