Skip to main content
Academic

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

arXiv:2602.21496v1 Announce Type: new Abstract: While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive information leaks without destroying utility remains an open scientific question. To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer. Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%. We also uncover a Scale-Dependent Safety Divergence: large reasoning models (e.g., GPT-5) achieve safety through con

arXiv:2602.21496v1 Announce Type: new Abstract: While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive information leaks without destroying utility remains an open scientific question. To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer. Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%. We also uncover a Scale-Dependent Safety Divergence: large reasoning models (e.g., GPT-5) achieve safety through constructive expansion (adding nuance), whereas capacity-constrained models revert to destructive truncation (deleting text). Finally, we identify a Reasoning Paradox: while inference-time reasoning increases baseline risk by enabling the model to make deeper sensitive inferences, it simultaneously empowers the defense to execute safe rewrites.

Executive Summary

This study proposes SemSIEdit, an agentic framework for Large Language Models (LLMs) to self-regulate and edit sensitive information while preserving narrative flow. The analysis reveals a significant reduction in sensitive information leakage (34.6%) with minimal utility loss (9.8%). However, the study also uncovers a Scale-Dependent Safety Divergence and a Reasoning Paradox, highlighting the complexities of addressing semantic sensitive information in LLMs. The findings have significant implications for the development of LLMs, particularly in applications where sensitive information is involved.

Key Points

  • SemSIEdit framework for agentic self-regulation of LLMs
  • Significant reduction in sensitive information leakage (34.6%)
  • Scale-Dependent Safety Divergence: large models achieve safety through constructive expansion, while smaller models revert to destructive truncation
  • Reasoning Paradox: inference-time reasoning increases risk, but also enables safe rewrites

Merits

Strength in Addressing Complex Sensitive Information

SemSIEdit addresses the limitations of existing defenses by introducing an agentic framework that can self-regulate and edit sensitive information while preserving narrative flow.

Significant Reduction in Sensitive Information Leakage

The study reveals a significant reduction in sensitive information leakage, demonstrating the effectiveness of the SemSIEdit framework.

Demerits

Limitation in Model-Scale Dependence

The study highlights a Scale-Dependent Safety Divergence, where large models achieve safety through constructive expansion, while smaller models revert to destructive truncation, introducing complexity in LLM development.

Reasoning Paradox and Increased Risk

The study uncovers a Reasoning Paradox, where inference-time reasoning increases risk, but also enables safe rewrites, highlighting the need for further research to balance risk and safety in LLM development.

Expert Commentary

The study makes significant contributions to the field of LLM safety and security by introducing the SemSIEdit framework and demonstrating its effectiveness in reducing sensitive information leakage. However, the study also highlights the complexities of addressing semantic sensitive information in LLMs, particularly in the context of model-scale dependence and inference-time reasoning. Further research is needed to balance risk and safety in LLM development and to address the implications of LLMs on human-model interaction and misinformation.

Recommendations

  • Recommendation for further research: Investigate the impact of model-scale dependence on sensitive information leakage and model safety, and explore strategies to mitigate the Reasoning Paradox.
  • Recommendation for policymakers: Develop regulatory frameworks that address the safety and security of LLMs, particularly in applications where sensitive information is involved.

Sources