Academic

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

arXiv:2602.12675v1 Announce Type: new Abstract: Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments

arXiv:2602.12675v1 Announce Type: new Abstract: Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.

Executive Summary

The article introduces SLA2, an advanced version of Sparse-Linear Attention (SLA), which aims to enhance the efficiency and performance of diffusion models in video generation. SLA2 addresses two primary limitations of the original SLA: the heuristic split of computations and the mismatch in attention error analysis. The proposed solution includes a learnable router for dynamic computation selection, a more accurate sparse-linear attention formulation, and a sparse + low-bit attention design to reduce quantization error. Experimental results demonstrate significant improvements, achieving 97% attention sparsity and an 18.6x speedup without compromising generation quality.

Key Points

  • SLA2 introduces a learnable router for dynamic computation selection.
  • SLA2 provides a more accurate sparse-linear attention formulation.
  • SLA2 incorporates a sparse + low-bit attention design to reduce quantization error.

Merits

Innovative Approach

The introduction of a learnable router and a more faithful sparse-linear attention formulation represents a significant advancement over the heuristic methods used in SLA.

Performance Improvements

Experimental results show substantial improvements in attention sparsity and speedup, making SLA2 highly efficient for video generation tasks.

Reduction of Quantization Error

The sparse + low-bit attention design effectively reduces quantization error, enhancing the overall performance of diffusion models.

Demerits

Complexity

The introduction of additional learnable parameters and complex formulations may increase the computational and implementation complexity.

Generalizability

The effectiveness of SLA2 has been demonstrated primarily in video generation; its applicability to other domains or tasks remains to be explored.

Expert Commentary

The article presents a significant advancement in the field of attention mechanisms, particularly in the context of diffusion models for video generation. The introduction of a learnable router and a more accurate sparse-linear attention formulation addresses critical limitations of the original SLA, demonstrating substantial improvements in attention sparsity and computational efficiency. The sparse + low-bit attention design is a notable innovation, effectively reducing quantization error and enhancing the overall performance of the models. However, the increased complexity and the need for further exploration of its generalizability to other domains are important considerations. The practical implications of SLA2 are profound, offering a pathway to more efficient and scalable AI models. Policy-wise, the advancements in attention mechanisms and efficient computation can inform decisions related to the deployment and regulation of AI technologies, ensuring their reliability and performance across various applications.

Recommendations

  • Further research should explore the generalizability of SLA2 to other domains and tasks beyond video generation to assess its broader applicability.
  • Future work could focus on simplifying the implementation of SLA2 to reduce computational complexity while maintaining its performance benefits.

Sources