Academic

LinearARD: Linear-Memory Attention Distillation for RoPE Restoration

Ning Yang, Hengyu Zhong, Wentao Wang, Baoliang Tian, Haijun Zhang, Jun Wang · April 3, 2026 · 1 min read · 1 views

#cs.CL #cs.AI

arXiv:2604.00004v1 Announce Type: cross Abstract: The extension of context windows in Large Language Models is typically facilitated by scaling positional encodings followed by lightweight Continual Pre-Training (CPT). While effective for processing long sequences, this paradigm often disrupts original model capabilities, leading to performance degradation on standard short-text benchmarks. We propose LinearARD, a self-distillation method that restores Rotary Position Embeddings (RoPE)-scaled students through attention-structure consistency with a frozen native-RoPE teacher. Rather than matching opaque hidden states, LinearARD aligns the row-wise distributions of dense $Q/Q$, $K/K$, and $V/V$ self-relation matrices to directly supervise attention dynamics. To overcome the quadratic memory bottleneck of $n \times n$ relation maps, we introduce a linear-memory kernel. This kernel leverages per-token log-sum-exp statistics and fuses logit recomputation into the backward pass to compute exact Kullback-Leibler divergence and gradients. On LLaMA2-7B extended from 4K to 32K, LinearARD recovers 98.3\% of the short-text performance of state-of-the-art baselines while surpassing them on long-context benchmarks. Notably, our method achieves these results using only \textbf{4.25M} training tokens compared to the \textbf{256M} tokens required by LongReD and CPT. Our code is available at https://github.com/gracefulning/LinearARD.

Executive Summary

This article presents LinearARD, a novel self-distillation method for restoring Rotary Position Embeddings (RoPE)-scaled students in Large Language Models (LLMs). The method aligns row-wise distributions of self-relation matrices to supervise attention dynamics, overcoming the quadratic memory bottleneck through a linear-memory kernel. LinearARD achieves state-of-the-art performance on both short-text and long-context benchmarks while reducing training tokens required from 256M to 4.25M. This breakthrough has significant implications for LLM development, enabling more efficient and effective training of large-scale models.

Key Points

▸ LinearARD is a self-distillation method that restores RoPE-scaled students through attention-structure consistency with a frozen native-RoPE teacher.
▸ The method aligns row-wise distributions of self-relation matrices to supervise attention dynamics.
▸ LinearARD overcomes the quadratic memory bottleneck through a linear-memory kernel.

Merits

Strength in Scalability

LinearARD achieves state-of-the-art performance on both short-text and long-context benchmarks while reducing training tokens required, making it a more scalable solution for LLM development.

Effective Attention Supervision

The method's focus on aligning row-wise distributions of self-relation matrices provides effective supervision of attention dynamics, leading to improved model performance.

Efficient Memory Usage

The linear-memory kernel introduced in LinearARD overcomes the quadratic memory bottleneck, enabling more efficient training of large-scale models.

Demerits

Limited Model Compatibility

LinearARD is specifically designed for RoPE-scaled students, which may limit its applicability to other types of LLMs.

Potential Overreliance on Teacher Models

The method's reliance on a frozen native-RoPE teacher model may lead to overfitting or underutilization of the student model's capabilities.

Expert Commentary

The introduction of LinearARD marks a significant breakthrough in LLM development, addressing the challenges of scaling positional encodings and restoring RoPE-scaled students. The method's effectiveness in achieving state-of-the-art performance on both short-text and long-context benchmarks while reducing training tokens required is a testament to its scalability and efficiency. However, limitations in model compatibility and potential overreliance on teacher models should be carefully considered in future research. The implications of LinearARD are far-reaching, with the potential to transform the field of natural language processing and enable more efficient and effective training of large-scale models.

Recommendations

✓ Future research should explore the application of LinearARD to other types of LLMs and its potential integration with existing models.
✓ Investigations into the potential overreliance on teacher models and strategies for mitigating this issue are essential for the long-term success of LinearARD.

Sources

Original: arXiv - cs.AI

arXiv - cs.AI

LinearARD: Linear-Memory Attention Distillation for RoPE Restoration

AI Commentary

Executive Summary

Key Points

Merits

Strength in Scalability

Effective Attention Supervision

Efficient Memory Usage

Demerits

Limited Model Compatibility

Potential Overreliance on Teacher Models

Expert Commentary

Recommendations

Sources

Related Articles

AI-Driven Approaches to Enhancing Fairness and Identifying Algorithmic Bias in …

High resolution schemes for hyperbolic conservation laws

Robust Graph Representation Learning via Adaptive Spectral Contrast

Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via …

JCG, PC

HSOLLC Co., Ltd.