Academic

Navigating the Concept Space of Language Models

arXiv:2603.23524v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) trained on large language model activations output thousands of features that enable mapping to human-interpretable concepts. The current practice for analyzing these features primarily relies on inspecting top-activating examples, manually browsing individual features, or performing semantic search on interested concepts, which makes exploratory discovery of concepts difficult at scale. In this paper, we present Concept Explorer, a scalable interactive system for post-hoc exploration of SAE features that organizes concept explanations using hierarchical neighborhood embeddings. Our approach constructs a multi-resolution manifold over SAE feature embeddings and enables progressive navigation from coarse concept clusters to fine-grained neighborhoods, supporting discovery, comparison, and relationship analysis among concepts. We demonstrate the utility of Concept Explorer on SAE features extracted from SmolLM2,

W
Wilson E. Marc\'ilio-Jr, Danilo M. Eler
· · 1 min read · 57 views

arXiv:2603.23524v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) trained on large language model activations output thousands of features that enable mapping to human-interpretable concepts. The current practice for analyzing these features primarily relies on inspecting top-activating examples, manually browsing individual features, or performing semantic search on interested concepts, which makes exploratory discovery of concepts difficult at scale. In this paper, we present Concept Explorer, a scalable interactive system for post-hoc exploration of SAE features that organizes concept explanations using hierarchical neighborhood embeddings. Our approach constructs a multi-resolution manifold over SAE feature embeddings and enables progressive navigation from coarse concept clusters to fine-grained neighborhoods, supporting discovery, comparison, and relationship analysis among concepts. We demonstrate the utility of Concept Explorer on SAE features extracted from SmolLM2, where it reveals coherent high-level structure, meaningful subclusters, and distinctive rare concepts that are hard to identify with existing workflows.

Executive Summary

This article presents Concept Explorer, an interactive system for post-hoc exploration of sparse autoencoder (SAE) features in large language models. Concept Explorer organizes concept explanations using hierarchical neighborhood embeddings, enabling progressive navigation from coarse concept clusters to fine-grained neighborhoods. The system reveals coherent high-level structure, meaningful subclusters, and distinctive rare concepts, showcasing the utility of Concept Explorer on SAE features extracted from SmolLM2. This scalable system addresses the limitations of existing workflows, making exploratory discovery of concepts more accessible at scale.

Key Points

  • Concept Explorer is an interactive system for post-hoc exploration of SAE features
  • The system utilizes hierarchical neighborhood embeddings to organize concept explanations
  • Concept Explorer enables progressive navigation from coarse concept clusters to fine-grained neighborhoods

Merits

Strength in Scalability

Concept Explorer addresses the limitations of existing workflows by providing a scalable solution for exploratory discovery of concepts at scale.

Strength in Interactive Exploration

The system enables interactive exploration of concept explanations using hierarchical neighborhood embeddings, facilitating progressive navigation.

Demerits

Limitation in Specialization

Concept Explorer is specifically designed for SAE features and may require adaptation for other types of features or models.

Limitation in Resource Intensity

The system may be computationally intensive, requiring significant resources for large-scale exploration of concept explanations.

Expert Commentary

Concept Explorer represents a significant advancement in the field of language model interpretability. By utilizing hierarchical neighborhood embeddings to organize concept explanations, the system provides a scalable solution for exploratory discovery of concepts at scale. However, its limitations in specialization and resource intensity highlight the need for further research and development. The implications of Concept Explorer are far-reaching, with potential applications in improving the efficiency and effectiveness of exploratory discovery in large language models, as well as influencing policy decisions related to the deployment and regulation of artificial intelligence systems.

Recommendations

  • Future research should focus on adapting Concept Explorer for other types of features or models to increase its versatility.
  • Developers should prioritize optimizing the system's resource intensity to make it more accessible to a wider range of users.

Sources

Original: arXiv - cs.CL