Why Attend to Everything? Focus is the Key
arXiv:2604.03260v1 Announce Type: new Abstract: We introduce Focus, a method that learns which token pairs matter rather than approximating all of them. Learnable centroids assign tokens to groups; distant attention is restricted to same-group pairs while local attention operates at full resolution. Because all model weights stay frozen, Focus is purely additive: centroid-only training (as few as 148K parameters) improves domain perplexity with zero degradation on downstream benchmarks--from 124M to 70B parameters, across five attention architectures. No existing efficient attention method achieves this in the retrofit setting. At 124M, Focus surpasses full attention (30.3 vs 31.4 PPL); trained from scratch at 7B scale (2B tokens), Focus again beats full attention (13.82 vs 13.89 PPL). At inference, restricting each token to its top-k highest-scoring groups discretizes the soft routing into a hard sparsity pattern, yielding 2x speedup while beating the pretrained baseline (41.3 vs 42.
arXiv:2604.03260v1 Announce Type: new Abstract: We introduce Focus, a method that learns which token pairs matter rather than approximating all of them. Learnable centroids assign tokens to groups; distant attention is restricted to same-group pairs while local attention operates at full resolution. Because all model weights stay frozen, Focus is purely additive: centroid-only training (as few as 148K parameters) improves domain perplexity with zero degradation on downstream benchmarks--from 124M to 70B parameters, across five attention architectures. No existing efficient attention method achieves this in the retrofit setting. At 124M, Focus surpasses full attention (30.3 vs 31.4 PPL); trained from scratch at 7B scale (2B tokens), Focus again beats full attention (13.82 vs 13.89 PPL). At inference, restricting each token to its top-k highest-scoring groups discretizes the soft routing into a hard sparsity pattern, yielding 2x speedup while beating the pretrained baseline (41.3 vs 42.8 PPL); decomposing this pattern into two standard FlashAttention calls reaches 8.6x wall-clock speedup at 1M tokens with no custom kernels. Unlike LoRA, centroid routing preserves alignment: instruction-tuned models retain TruthfulQA scores after adaptation, while LoRA degrades at every learning rate and rank. Sinkhorn normalization enforces balanced groups as a hard constraint, and the resulting groups discover interpretable linguistic categories without supervision.
Executive Summary
This article introduces Focus, a novel method for efficient attention that learns to focus on relevant token pairs while restricting distant attention. By assigning tokens to groups through learnable centroids, Focus achieves zero degradation on downstream benchmarks and improves domain perplexity. The method surpasses full attention in various settings and offers a 2x speedup at inference. Unlike LoRA, Focus preserves alignment and retains TruthfulQA scores after adaptation. The Sinkhorn normalization enforces balanced groups, discovering interpretable linguistic categories without supervision. The authors' additive approach and use of frozen model weights contribute to the method's efficiency and effectiveness.
Key Points
- ▸ Focus learns to focus on relevant token pairs through learnable centroids
- ▸ Distant attention is restricted to same-group pairs, while local attention operates at full resolution
- ▸ The method achieves zero degradation on downstream benchmarks and improves domain perplexity
Merits
Strength in Efficiency
Focus's additive approach and use of frozen model weights enable efficient computation and zero degradation on downstream benchmarks
Preservation of Alignment
Unlike LoRA, Focus preserves alignment and retains TruthfulQA scores after adaptation
Interpretability
The Sinkhorn normalization enforces balanced groups, discovering interpretable linguistic categories without supervision
Demerits
Computational Complexity
The method's efficiency may be compromised by the computational overhead of learnable centroids and Sinkhorn normalization
Limited Generalizability
The authors' results are primarily based on a specific attention architecture and may not generalize to other architectures or tasks
Expert Commentary
The introduction of Focus marks a significant advancement in efficient attention methods. By leveraging learnable centroids and restricted attention, the authors have created a method that achieves zero degradation on downstream benchmarks and improves domain perplexity. The preservation of alignment and interpretability of linguistic categories are notable strengths of Focus. However, the computational complexity and limited generalizability of the method may be concerns. Nevertheless, the authors' results demonstrate the potential of Focus to improve the efficiency and effectiveness of large language models in real-world applications.
Recommendations
- ✓ Future research should investigate the application of Focus to other attention architectures and tasks
- ✓ The authors should explore methods to mitigate the computational complexity of Focus and improve its generalizability
Sources
Original: arXiv - cs.CL