Quantifying LLM Attention-Head Stability: Implications for Circuit Universality
arXiv:2602.16740v1 Announce Type: new Abstract: In mechanistic interpretability, recent work scrutinizes transformer "circuits" - sparse, mono or multi layer sub computations, that may reflect human …
Karan Bali, Jack Stanley, Praneet Suresh, Danilo Bzdok
6 views