Academic

Concept Heterogeneity-aware Representation Steering

arXiv:2603.02237v1 Announce Type: new Abstract: Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on internal activations at inference time. Most existing methods rely on a single global steering direction, typically obtained via difference-in-means over contrastive datasets. This approach implicitly assumes that the target concept is homogeneously represented across the embedding space. In practice, however, LLM representations can be highly non-homogeneous, exhibiting clustered, context-dependent structure, which renders global steering directions brittle. In this work, we view representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two unimodal Gaussian distributions with identical covariance, yielding a global translation. To relax this restrictive assumption, we theoretically model source and ta

arXiv:2603.02237v1 Announce Type: new Abstract: Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on internal activations at inference time. Most existing methods rely on a single global steering direction, typically obtained via difference-in-means over contrastive datasets. This approach implicitly assumes that the target concept is homogeneously represented across the embedding space. In practice, however, LLM representations can be highly non-homogeneous, exhibiting clustered, context-dependent structure, which renders global steering directions brittle. In this work, we view representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two unimodal Gaussian distributions with identical covariance, yielding a global translation. To relax this restrictive assumption, we theoretically model source and target representations as Gaussian mixture models and formulate steering as a discrete OT problem between semantic latent clusters. From the resulting transport plan, we derive an explicit, input-dependent steering map via barycentric projection, producing a smooth, kernel-weighted combination of cluster-level shifts. We term this method Concept Heterogeneity-aware Representation Steering (CHaRS). Through numerous experimental settings, we show that CHaRS yields more effective behavioral control than global steering.

Executive Summary

This article presents Concept Heterogeneity-aware Representation Steering (CHaRS), a novel approach to controlling large language models (LLMs) through the lens of optimal transport. Traditional methods rely on a single global steering direction, which assumes target concept homogeneity. CHaRS addresses this limitation by modeling source and target representations as Gaussian mixture models and deriving an explicit, input-dependent steering map via barycentric projection. Experimental results demonstrate CHaRS's effectiveness in behavioral control compared to global steering. The methodology leverages insights from optimal transport and Gaussian mixture models to improve representation steering for LLMs.

Key Points

  • CHaRS models source and target representations as Gaussian mixture models to address concept heterogeneity.
  • Optimal transport is used to derive an explicit, input-dependent steering map.
  • CHaRS yields more effective behavioral control than traditional global steering methods.

Merits

Strength

CHaRS's approach to modeling concept heterogeneity through Gaussian mixture models addresses a significant limitation of traditional methods.

Demerits

Limitation

The computational complexity of CHaRS may increase due to the need to estimate Gaussian mixture models and calculate optimal transport plans.

Expert Commentary

CHaRS is a significant contribution to the field of large language models, addressing a critical limitation of traditional representation steering methods. By leveraging insights from optimal transport and Gaussian mixture models, CHaRS offers a more effective approach to behavioral control. However, the increased computational complexity may limit its practical applicability. Nonetheless, CHaRS's methodology provides a valuable framework for future research and development in the field.

Recommendations

  • Future research should focus on optimizing the computational efficiency of CHaRS while maintaining its effectiveness.
  • The development of CHaRS highlights the need for further investigation into the concept of concept heterogeneity and its implications for LLM design.

Sources