Academic

Discovering Universal Activation Directions for PII Leakage in Language Models

arXiv:2602.16980v1 Announce Type: new Abstract: Modern language models exhibit rich internal structure, yet little is known about how privacy-sensitive behaviors, such as personally identifiable information (PII) leakage, are represented and modulated within their hidden states. We present UniLeak, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model's residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts. These model-specific directions generalize across contexts and amplify PII generation probability, with minimal impact on generation quality. UniLeak recovers such directions without access to training data or groundtruth PII, relying only on self-generated text. Across multiple models and datasets, steering along these universal directions substantially increases PII leakage compared to existing prompt-based extraction methods. Our results offer

Leo Marchyok, Zachary Coalson, Sungho Keum, Sooel Son, Sanghyun Hong · February 21, 2026 · 1 min read · 5 views

#cs.LG #cs.CR

Executive Summary

This article presents UniLeak, a mechanistic-interpretability framework that identifies universal activation directions in language models, which consistently increase the likelihood of generating personally identifiable information (PII) across prompts. The framework recovers these directions without access to training data or groundtruth PII, relying only on self-generated text. The results show that steering along these universal directions substantially increases PII leakage compared to existing prompt-based extraction methods. This study offers a new perspective on PII leakage, highlighting the superposition of a latent signal in the model's representations, which enables both risk amplification and mitigation. The findings have significant implications for the development of more robust and transparent language models.

Key Points

▸ UniLeak is a mechanistic-interpretability framework for identifying universal activation directions in language models.
▸ The framework recovers these directions without access to training data or groundtruth PII.
▸ Steering along these universal directions substantially increases PII leakage compared to existing methods.

Merits

Improved Understanding of PII Leakage

The study provides a new perspective on PII leakage, highlighting the superposition of a latent signal in the model's representations.

Robust and Transparent Language Models

The findings have significant implications for the development of more robust and transparent language models.

Demerits

Limited Generalizability

The study only evaluates UniLeak on a limited set of models and datasets, which may limit its generalizability to other contexts.

Lack of Real-World Applications

The study does not demonstrate the practical applications of UniLeak in real-world scenarios.

Expert Commentary

The study presents a novel approach to understanding PII leakage in language models, which has significant implications for the development of more robust and transparent models. The framework's ability to recover universal activation directions without access to training data or groundtruth PII is a notable achievement. However, the study's limited generalizability and lack of real-world applications are significant limitations. Furthermore, the study's focus on language models may not be directly applicable to other types of AI systems, which may have different biases and fairness issues.

Recommendations

✓ Future studies should evaluate UniLeak on a broader range of models and datasets to increase its generalizability.
✓ Researchers should explore the practical applications of UniLeak in real-world scenarios to demonstrate its value and potential impact.

Sources

arXiv - cs.LG

Something extraordinary is coming.

Discovering Universal Activation Directions for PII Leakage in Language Models

AI Commentary

Executive Summary

Key Points

Merits

Improved Understanding of PII Leakage

Robust and Transparent Language Models

Demerits

Limited Generalizability

Lack of Real-World Applications

Expert Commentary

Recommendations

Sources

Related Articles

How Large Language Models Get Stuck: Early structure with persistent …

Distribution-Aware Companding Quantization of Large Language Models

Policy Compliance of User Requests in Natural Language for AI …

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

JCG, PC

HSOLLC Co., Ltd.