Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations
arXiv:2604.00209v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in high-stakes settings, yet they frequently violate contextual privacy by disclosing private information in situations where humans would exercise discretion. This raises a fundamental question: do LLMs internally encode contextual privacy norms, and if so, why do violations persist? We present the first systematic study of contextual privacy as a structured latent representation in LLMs, grounded in contextual integrity (CI) theory. Probing multiple models, we find that the three norm-determining CI parameters (information type, recipient, and transmission principle) are encoded as linearly separable and functionally independent directions in activation space. Despite this internal structure, models still leak private information in practice, revealing a clear gap between concept representation and model behavior. To bridge this gap, we introduce CI-parametric steering, which indep
arXiv:2604.00209v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in high-stakes settings, yet they frequently violate contextual privacy by disclosing private information in situations where humans would exercise discretion. This raises a fundamental question: do LLMs internally encode contextual privacy norms, and if so, why do violations persist? We present the first systematic study of contextual privacy as a structured latent representation in LLMs, grounded in contextual integrity (CI) theory. Probing multiple models, we find that the three norm-determining CI parameters (information type, recipient, and transmission principle) are encoded as linearly separable and functionally independent directions in activation space. Despite this internal structure, models still leak private information in practice, revealing a clear gap between concept representation and model behavior. To bridge this gap, we introduce CI-parametric steering, which independently intervenes along each CI dimension. This structured control reduces privacy violations more effectively and predictably than monolithic steering. Our results demonstrate that contextual privacy failures arise from misalignment between representation and behavior rather than missing awareness, and that leveraging the compositional structure of CI enables more reliable contextual privacy control, shedding light on potential improvement of contextual privacy understanding in LLMs.
Executive Summary
This study investigates the internal encoding of contextual privacy norms in large language models (LLMs) through the lens of contextual integrity (CI) theory. The authors identify that the three core CI parameters—information type, recipient, and transmission principle—are encoded as linearly separable and functionally independent directions within LLM activation spaces. While this structural encoding suggests the capacity for privacy norm awareness, the persistent leakage of private information in deployment reveals a critical disconnect between conceptual representation and behavioral output. To address this gap, the authors propose CI-parametric steering, a targeted intervention method that allows independent manipulation of each CI dimension, resulting in more effective and predictable reduction of privacy violations. The findings underscore that contextual privacy failures stem from misalignment between representation and behavior, not a lack of internal awareness, and offer a novel approach to enhance contextual privacy control in LLMs via compositional CI structuring.
Key Points
- ▸ LLMs encode CI parameters as linearly separable directions in activation space
- ▸ Persistent privacy leakage indicates a gap between representation and behavior
- ▸ CI-parametric steering enables targeted, dimension-specific intervention to mitigate privacy violations
Merits
Novel Structural Analysis
The study pioneeringly identifies and validates the structured encoding of CI parameters within LLM activation space, offering a new theoretical framework for understanding contextual privacy.
Demerits
Limited Scope of Validation
While the findings are theoretically compelling, the study lacks empirical validation across diverse real-world deployment scenarios, limiting generalizability.
Expert Commentary
This work represents a significant advance in the intersection of AI ethics and cognitive modeling. The identification of privacy norms as compositional, linearly separable representations within LLM architectures is both theoretically elegant and practically significant. It challenges the prevailing assumption that privacy violations arise from a lack of internal capacity, instead pointing to a misalignment between conceptual encoding and operational output. The CI-parametric steering mechanism is particularly noteworthy for its precision and potential applicability across multiple deployment contexts. Moreover, the study’s alignment with contextual integrity theory provides a robust epistemological foundation that elevates its credibility. While the limitations in real-world applicability warrant further investigation, the conceptual breakthrough here is substantial—potentially informing not only LLM design but broader AI governance frameworks that prioritize contextual sensitivity over reactive mitigation.
Recommendations
- ✓ 1. Incorporate CI-parametric steering as a baseline feature in LLM evaluation suites for sensitive applications.
- ✓ 2. Fund comparative studies on CI-based interventions across diverse LLM architectures and deployment environments to validate scalability and robustness.
Sources
Original: arXiv - cs.CL