Academic

LinearARD: Linear-Memory Attention Distillation for RoPE Restoration

arXiv:2604.00004v1 Announce Type: cross Abstract: The extension of context windows in Large Language Models is typically facilitated by scaling positional encodings followed by lightweight Continual Pre-Training (CPT). While effective for processing long sequences, this paradigm often disrupts original model capabilities, leading to performance degradation on standard short-text benchmarks. We propose LinearARD, a self-distillation method that restores Rotary Position Embeddings (RoPE)-scaled students through attention-structure consistency with a frozen native-RoPE teacher. Rather than matching opaque hidden states, LinearARD aligns the row-wise distributions of dense $Q/Q$, $K/K$, and $V/V$ self-relation matrices to directly supervise attention dynamics. To overcome the quadratic memory bottleneck of $n \times n$ relation maps, we introduce a linear-memory kernel. This kernel leverages per-token log-sum-exp statistics and fuses logit recomputation into the backward pass to compute e

arXiv:2604.00004v1 Announce Type: cross Abstract: The extension of context windows in Large Language Models is typically facilitated by scaling positional encodings followed by lightweight Continual Pre-Training (CPT). While effective for processing long sequences, this paradigm often disrupts original model capabilities, leading to performance degradation on standard short-text benchmarks. We propose LinearARD, a self-distillation method that restores Rotary Position Embeddings (RoPE)-scaled students through attention-structure consistency with a frozen native-RoPE teacher. Rather than matching opaque hidden states, LinearARD aligns the row-wise distributions of dense $Q/Q$, $K/K$, and $V/V$ self-relation matrices to directly supervise attention dynamics. To overcome the quadratic memory bottleneck of $n \times n$ relation maps, we introduce a linear-memory kernel. This kernel leverages per-token log-sum-exp statistics and fuses logit recomputation into the backward pass to compute exact Kullback-Leibler divergence and gradients. On LLaMA2-7B extended from 4K to 32K, LinearARD recovers 98.3\% of the short-text performance of state-of-the-art baselines while surpassing them on long-context benchmarks. Notably, our method achieves these results using only \textbf{4.25M} training tokens compared to the \textbf{256M} tokens required by LongReD and CPT. Our code is available at https://github.com/gracefulning/LinearARD.

Executive Summary

This article presents LinearARD, a novel self-distillation method for restoring Rotary Position Embeddings (RoPE)-scaled students in Large Language Models (LLMs). The method aligns row-wise distributions of self-relation matrices to supervise attention dynamics, overcoming the quadratic memory bottleneck through a linear-memory kernel. LinearARD achieves state-of-the-art performance on both short-text and long-context benchmarks while reducing training tokens required from 256M to 4.25M. This breakthrough has significant implications for LLM development, enabling more efficient and effective training of large-scale models.

Key Points

  • LinearARD is a self-distillation method that restores RoPE-scaled students through attention-structure consistency with a frozen native-RoPE teacher.
  • The method aligns row-wise distributions of self-relation matrices to supervise attention dynamics.
  • LinearARD overcomes the quadratic memory bottleneck through a linear-memory kernel.

Merits

Strength in Scalability

LinearARD achieves state-of-the-art performance on both short-text and long-context benchmarks while reducing training tokens required, making it a more scalable solution for LLM development.

Effective Attention Supervision

The method's focus on aligning row-wise distributions of self-relation matrices provides effective supervision of attention dynamics, leading to improved model performance.

Efficient Memory Usage

The linear-memory kernel introduced in LinearARD overcomes the quadratic memory bottleneck, enabling more efficient training of large-scale models.

Demerits

Limited Model Compatibility

LinearARD is specifically designed for RoPE-scaled students, which may limit its applicability to other types of LLMs.

Potential Overreliance on Teacher Models

The method's reliance on a frozen native-RoPE teacher model may lead to overfitting or underutilization of the student model's capabilities.

Expert Commentary

The introduction of LinearARD marks a significant breakthrough in LLM development, addressing the challenges of scaling positional encodings and restoring RoPE-scaled students. The method's effectiveness in achieving state-of-the-art performance on both short-text and long-context benchmarks while reducing training tokens required is a testament to its scalability and efficiency. However, limitations in model compatibility and potential overreliance on teacher models should be carefully considered in future research. The implications of LinearARD are far-reaching, with the potential to transform the field of natural language processing and enable more efficient and effective training of large-scale models.

Recommendations

  • Future research should explore the application of LinearARD to other types of LLMs and its potential integration with existing models.
  • Investigations into the potential overreliance on teacher models and strategies for mitigating this issue are essential for the long-term success of LinearARD.

Sources

Original: arXiv - cs.AI