Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear …
arXiv:2602.17881v1 Announce Type: cross Abstract: Steering vectors are a lightweight method for controlling language model behavior by adding a learned bias to the activations at …
Joschka Braun
5 views