Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information
arXiv:2602.21496v1 Announce Type: new Abstract: While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive information leaks without destroying utility remains an open scientific question. To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer. Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%. We also uncover a Scale-Dependent Safety Divergence: large reasoning models (e.g., GPT-5) achieve safety through con
arXiv:2602.21496v1 Announce Type: new Abstract: While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive information leaks without destroying utility remains an open scientific question. To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer. Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%. We also uncover a Scale-Dependent Safety Divergence: large reasoning models (e.g., GPT-5) achieve safety through constructive expansion (adding nuance), whereas capacity-constrained models revert to destructive truncation (deleting text). Finally, we identify a Reasoning Paradox: while inference-time reasoning increases baseline risk by enabling the model to make deeper sensitive inferences, it simultaneously empowers the defense to execute safe rewrites.
Executive Summary
This study proposes SemSIEdit, an agentic framework for Large Language Models (LLMs) to self-regulate and edit sensitive information while preserving narrative flow. The analysis reveals a significant reduction in sensitive information leakage (34.6%) with minimal utility loss (9.8%). However, the study also uncovers a Scale-Dependent Safety Divergence and a Reasoning Paradox, highlighting the complexities of addressing semantic sensitive information in LLMs. The findings have significant implications for the development of LLMs, particularly in applications where sensitive information is involved.
Key Points
- ▸ SemSIEdit framework for agentic self-regulation of LLMs
- ▸ Significant reduction in sensitive information leakage (34.6%)
- ▸ Scale-Dependent Safety Divergence: large models achieve safety through constructive expansion, while smaller models revert to destructive truncation
- ▸ Reasoning Paradox: inference-time reasoning increases risk, but also enables safe rewrites
Merits
Strength in Addressing Complex Sensitive Information
SemSIEdit addresses the limitations of existing defenses by introducing an agentic framework that can self-regulate and edit sensitive information while preserving narrative flow.
Significant Reduction in Sensitive Information Leakage
The study reveals a significant reduction in sensitive information leakage, demonstrating the effectiveness of the SemSIEdit framework.
Demerits
Limitation in Model-Scale Dependence
The study highlights a Scale-Dependent Safety Divergence, where large models achieve safety through constructive expansion, while smaller models revert to destructive truncation, introducing complexity in LLM development.
Reasoning Paradox and Increased Risk
The study uncovers a Reasoning Paradox, where inference-time reasoning increases risk, but also enables safe rewrites, highlighting the need for further research to balance risk and safety in LLM development.
Expert Commentary
The study makes significant contributions to the field of LLM safety and security by introducing the SemSIEdit framework and demonstrating its effectiveness in reducing sensitive information leakage. However, the study also highlights the complexities of addressing semantic sensitive information in LLMs, particularly in the context of model-scale dependence and inference-time reasoning. Further research is needed to balance risk and safety in LLM development and to address the implications of LLMs on human-model interaction and misinformation.
Recommendations
- ✓ Recommendation for further research: Investigate the impact of model-scale dependence on sensitive information leakage and model safety, and explore strategies to mitigate the Reasoning Paradox.
- ✓ Recommendation for policymakers: Develop regulatory frameworks that address the safety and security of LLMs, particularly in applications where sensitive information is involved.