Skip to main content
Academic

Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations

arXiv:2602.17881v1 Announce Type: cross Abstract: Steering vectors are a lightweight method for controlling language model behavior by adding a learned bias to the activations at inference time. Although effective on average, steering effect sizes vary across samples and are unreliable for many target behaviors. In my thesis, I investigate why steering reliability differs across behaviors and how it is impacted by steering vector training data. First, I find that higher cosine similarity between training activation differences predicts more reliable steering. Second, I observe that behavior datasets where positive and negative activations are better separated along the steering direction are more reliably steerable. Finally, steering vectors trained on different prompt variations are directionally distinct, yet perform similarly well and exhibit correlated efficacy across datasets. My findings suggest that steering vectors are unreliable when the latent target behavior representation

J
Joschka Braun
· · 1 min read · 4 views

arXiv:2602.17881v1 Announce Type: cross Abstract: Steering vectors are a lightweight method for controlling language model behavior by adding a learned bias to the activations at inference time. Although effective on average, steering effect sizes vary across samples and are unreliable for many target behaviors. In my thesis, I investigate why steering reliability differs across behaviors and how it is impacted by steering vector training data. First, I find that higher cosine similarity between training activation differences predicts more reliable steering. Second, I observe that behavior datasets where positive and negative activations are better separated along the steering direction are more reliably steerable. Finally, steering vectors trained on different prompt variations are directionally distinct, yet perform similarly well and exhibit correlated efficacy across datasets. My findings suggest that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction. Taken together, these insights offer a practical diagnostic for steering unreliability and motivate the development of more robust steering methods that explicitly account for non-linear latent behavior representations.

Executive Summary

This article sheds light on the limitations of steering vectors in language models, a method used to control behavior by adding a learned bias to activations at inference time. The author investigates the reliability of steering vectors across various behaviors and found that higher cosine similarity between training activation differences and better separation of positive and negative activations along the steering direction contribute to more reliable steering. The findings suggest that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction. The article offers a practical diagnostic for steering unreliability and motivates the development of more robust steering methods.

Key Points

  • Higher cosine similarity between training activation differences predicts more reliable steering.
  • Behavior datasets with better separation of positive and negative activations along the steering direction are more reliably steerable.
  • Steering vectors trained on different prompt variations are directionally distinct yet perform similarly well.

Merits

Strength of Geometric Predictors

The use of geometric predictors to understand the reliability of steering vectors is a significant contribution of this article, providing a novel approach to address the limitations of steering vectors.

Demerits

Limited Generalizability

The findings of this article are based on a specific dataset and may not be generalizable to other language models or domains.

Expert Commentary

The article makes a significant contribution to the field of language model research by shedding light on the limitations of steering vectors. The use of geometric predictors to understand the reliability of steering vectors is a novel approach that provides valuable insights for improving the behavior control of language models. However, the findings of this article are based on a specific dataset and may not be generalizable to other language models or domains. Therefore, it is essential to conduct further research to confirm and extend these findings.

Recommendations

  • Future research should aim to develop more robust steering methods that account for non-linear latent behavior representations.
  • The use of steering vectors should be accompanied by other methods to ensure reliable behavior control in language models.

Sources