Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations
arXiv:2603.18353v1 Announce Type: new Abstract: Language models encode task-relevant knowledge in internal representations that far exceeds their output performance, but whether mechanistic interpretability methods can …