Academic

Why Attend to Everything? Focus is the Key

arXiv:2604.03260v1 Announce Type: new Abstract: We introduce Focus, a method that learns which token pairs matter rather than approximating all of them. Learnable centroids assign tokens to groups; distant attention is restricted to same-group pairs while local attention operates at full resolution. Because all model weights stay frozen, Focus is purely additive: centroid-only training (as few as 148K parameters) improves domain perplexity with zero degradation on downstream benchmarks--from 124M to 70B parameters, across five attention architectures. No existing efficient attention method achieves this in the retrofit setting. At 124M, Focus surpasses full attention (30.3 vs 31.4 PPL); trained from scratch at 7B scale (2B tokens), Focus again beats full attention (13.82 vs 13.89 PPL). At inference, restricting each token to its top-k highest-scoring groups discretizes the soft routing into a hard sparsity pattern, yielding 2x speedup while beating the pretrained baseline (41.3 vs 42.

arXiv:2604.03260v1 Announce Type: new Abstract: We introduce Focus, a method that learns which token pairs matter rather than approximating all of them. Learnable centroids assign tokens to groups; distant attention is restricted to same-group pairs while local attention operates at full resolution. Because all model weights stay frozen, Focus is purely additive: centroid-only training (as few as 148K parameters) improves domain perplexity with zero degradation on downstream benchmarks--from 124M to 70B parameters, across five attention architectures. No existing efficient attention method achieves this in the retrofit setting. At 124M, Focus surpasses full attention (30.3 vs 31.4 PPL); trained from scratch at 7B scale (2B tokens), Focus again beats full attention (13.82 vs 13.89 PPL). At inference, restricting each token to its top-k highest-scoring groups discretizes the soft routing into a hard sparsity pattern, yielding 2x speedup while beating the pretrained baseline (41.3 vs 42.8 PPL); decomposing this pattern into two standard FlashAttention calls reaches 8.6x wall-clock speedup at 1M tokens with no custom kernels. Unlike LoRA, centroid routing preserves alignment: instruction-tuned models retain TruthfulQA scores after adaptation, while LoRA degrades at every learning rate and rank. Sinkhorn normalization enforces balanced groups as a hard constraint, and the resulting groups discover interpretable linguistic categories without supervision.

Executive Summary

This article introduces Focus, a novel method for efficient attention that learns to focus on relevant token pairs while restricting distant attention. By assigning tokens to groups through learnable centroids, Focus achieves zero degradation on downstream benchmarks and improves domain perplexity. The method surpasses full attention in various settings and offers a 2x speedup at inference. Unlike LoRA, Focus preserves alignment and retains TruthfulQA scores after adaptation. The Sinkhorn normalization enforces balanced groups, discovering interpretable linguistic categories without supervision. The authors' additive approach and use of frozen model weights contribute to the method's efficiency and effectiveness.

Key Points

  • Focus learns to focus on relevant token pairs through learnable centroids
  • Distant attention is restricted to same-group pairs, while local attention operates at full resolution
  • The method achieves zero degradation on downstream benchmarks and improves domain perplexity

Merits

Strength in Efficiency

Focus's additive approach and use of frozen model weights enable efficient computation and zero degradation on downstream benchmarks

Preservation of Alignment

Unlike LoRA, Focus preserves alignment and retains TruthfulQA scores after adaptation

Interpretability

The Sinkhorn normalization enforces balanced groups, discovering interpretable linguistic categories without supervision

Demerits

Computational Complexity

The method's efficiency may be compromised by the computational overhead of learnable centroids and Sinkhorn normalization

Limited Generalizability

The authors' results are primarily based on a specific attention architecture and may not generalize to other architectures or tasks

Expert Commentary

The introduction of Focus marks a significant advancement in efficient attention methods. By leveraging learnable centroids and restricted attention, the authors have created a method that achieves zero degradation on downstream benchmarks and improves domain perplexity. The preservation of alignment and interpretability of linguistic categories are notable strengths of Focus. However, the computational complexity and limited generalizability of the method may be concerns. Nevertheless, the authors' results demonstrate the potential of Focus to improve the efficiency and effectiveness of large language models in real-world applications.

Recommendations

  • Future research should investigate the application of Focus to other attention architectures and tasks
  • The authors should explore methods to mitigate the computational complexity of Focus and improve its generalizability

Sources

Original: arXiv - cs.CL