Automated Attention Pattern Discovery at Scale in Large Language Models
arXiv:2604.03764v1 Announce Type: new Abstract: Large language models have found success by scaling up capabilities to work in general settings. The same can unfortunately not be said for interpretability methods. The current trend in mechanistic interpretability is to provide precise explanations of specific behaviors in controlled settings. These often do not generalize, or are too resource intensive for larger studies. In this work we propose to study repeated behaviors in large language models by mining completion scenarios in Java code datasets, through exploiting the structured nature of code. We collect the attention patterns generated in the attention heads to demonstrate that they are scalable signals for global interpretability of model components. We show that vision models offer a promising direction for analyzing attention patterns at scale. To demonstrate this, we introduce the Attention Pattern - Masked Autoencoder(AP-MAE), a vision transformer-based model that efficien
arXiv:2604.03764v1 Announce Type: new Abstract: Large language models have found success by scaling up capabilities to work in general settings. The same can unfortunately not be said for interpretability methods. The current trend in mechanistic interpretability is to provide precise explanations of specific behaviors in controlled settings. These often do not generalize, or are too resource intensive for larger studies. In this work we propose to study repeated behaviors in large language models by mining completion scenarios in Java code datasets, through exploiting the structured nature of code. We collect the attention patterns generated in the attention heads to demonstrate that they are scalable signals for global interpretability of model components. We show that vision models offer a promising direction for analyzing attention patterns at scale. To demonstrate this, we introduce the Attention Pattern - Masked Autoencoder(AP-MAE), a vision transformer-based model that efficiently reconstructs masked attention patterns. Experiments on StarCoder2 show that AP-MAE (i) reconstructs masked attention patterns with high accuracy, (ii) generalizes across unseen models with minimal degradation, (iii) reveals recurring patterns across inferences, (iv) predicts whether a generation will be correct without access to ground truth, with accuracies ranging from 55% to 70% depending on the task, and (v) enables targeted interventions that increase accuracy by 13.6% when applied selectively, but cause collapse when applied excessively. These results establish attention patterns as a scalable signal for interpretability and demonstrate that AP-MAE provides a transferable foundation for both analysis and intervention in large language models. Beyond its standalone value, AP-MAE also serves as a selection procedure to guide fine-grained mechanistic approaches. We release code and models to support future work in large-scale interpretability.
Executive Summary
This article proposes a novel approach to large-scale interpretability of large language models by exploiting the structured nature of code in Java datasets. The research introduces the Attention Pattern - Masked Autoencoder (AP-MAE), a vision transformer-based model that reconstructs masked attention patterns with high accuracy. The study demonstrates the scalability of attention patterns for global interpretability and shows that AP-MAE generalizes across unseen models and reveals recurring patterns across inferences. The results also suggest that AP-MAE can predict generation correctness and enable targeted interventions to increase accuracy. This work provides a promising direction for large-scale interpretability and has significant implications for the development of more transparent and reliable AI models.
Key Points
- ▸ The article proposes a novel approach to large-scale interpretability of large language models.
- ▸ The Attention Pattern - Masked Autoencoder (AP-MAE) model reconstructs masked attention patterns with high accuracy.
- ▸ AP-MAE generalizes across unseen models and reveals recurring patterns across inferences.
- ▸ AP-MAE can predict generation correctness and enable targeted interventions to increase accuracy.
Merits
Strength in Scalability
The research demonstrates the scalability of attention patterns for global interpretability, addressing a significant limitation of current mechanistic interpretability methods.
Transferable Foundation
AP-MAE provides a transferable foundation for both analysis and intervention in large language models, enabling targeted interventions to increase accuracy.
Demerits
Limited Generalizability
The study's results may not generalize to other domains or datasets, highlighting the need for further research to establish the robustness of AP-MAE.
Excessive Intervention Risk
The model's sensitivity to excessive intervention highlights the risk of over-reliance on targeted interventions, which may lead to model collapse or degradation.
Expert Commentary
The research proposes a novel approach to large-scale interpretability of large language models, addressing a significant limitation of current mechanistic interpretability methods. The introduction of AP-MAE provides a transferable foundation for both analysis and intervention in large language models, enabling targeted interventions to increase accuracy. However, the study's results may not generalize to other domains or datasets, highlighting the need for further research to establish the robustness of AP-MAE. Additionally, the model's sensitivity to excessive intervention highlights the risk of over-reliance on targeted interventions, which may lead to model collapse or degradation.
Recommendations
- ✓ Future research should focus on establishing the robustness of AP-MAE across various domains and datasets.
- ✓ Development of more robust and transparent AI models should prioritize explainability and accountability in AI decision-making processes.
Sources
Original: arXiv - cs.LG