ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction
arXiv:2602.20708v1 Announce Type: new Abstract: Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution. Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over-refusal, prematurely terminating valid agentic workflows. We propose ICON, a probing-to-mitigation framework that neutralizes attacks while preserving task continuity. Our key insight is that IPI attacks leave distinct over-focusing signatures in the latent space. We introduce a Latent Space Trace Prober to detect attacks based on high intensity scores. Subsequently, a Mitigating Rectifier performs surgical attention steering that selectively manipulate adversarial query key dependencies while amplifying task relevant elements to restore the LLM's functional trajectory. Extensive evaluations on multiple backbones show that ICON achieves a competit
arXiv:2602.20708v1 Announce Type: new Abstract: Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution. Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over-refusal, prematurely terminating valid agentic workflows. We propose ICON, a probing-to-mitigation framework that neutralizes attacks while preserving task continuity. Our key insight is that IPI attacks leave distinct over-focusing signatures in the latent space. We introduce a Latent Space Trace Prober to detect attacks based on high intensity scores. Subsequently, a Mitigating Rectifier performs surgical attention steering that selectively manipulate adversarial query key dependencies while amplifying task relevant elements to restore the LLM's functional trajectory. Extensive evaluations on multiple backbones show that ICON achieves a competitive 0.4% ASR, matching commercial grade detectors, while yielding a over 50% task utility gain. Furthermore, ICON demonstrates robust Out of Distribution(OOD) generalization and extends effectively to multi-modal agents, establishing a superior balance between security and efficiency.
Executive Summary
This article proposes ICON, a novel framework to defend Large Language Model (LLM) agents against Indirect Prompt Injection (IPI) attacks. ICON leverages a Latent Space Trace Prober to detect attacks and a Mitigating Rectifier to restore the LLM's functional trajectory. The framework achieves a competitive false acceptance rate (ASR) and yields a significant task utility gain. The authors also demonstrate ICON's robustness in Out of Distribution (OOD) generalization and multi-modal agent applications. This work provides a promising solution to mitigate IPI attacks, a critical concern in LLMs.
Key Points
- ▸ ICON is a probing-to-mitigation framework that detects IPI attacks using Latent Space Trace Prober and restores task continuity with Mitigating Rectifier.
- ▸ The framework achieves a competitive 0.4% ASR and over 50% task utility gain compared to existing defenses.
- ▸ ICON demonstrates robust OOD generalization and extends effectively to multi-modal agents.
Merits
Strength in Detection
The Latent Space Trace Prober effectively detects IPI attacks with minimal false positives.
Task Continuity
The Mitigating Rectifier restores the LLM's functional trajectory, ensuring task continuity during adversarial queries.
Robustness
ICON demonstrates robustness in OOD generalization and multi-modal agent applications.
Demerits
Complexity
The framework's probing-to-mitigation mechanism may introduce additional computational overhead and complexity.
Task-Specific Training
The effectiveness of ICON may depend on task-specific training and fine-tuning of the LLM agent.
Evaluation Metrics
The evaluation metrics used in the paper may not comprehensively capture the performance of ICON in real-world scenarios.
Expert Commentary
The authors' proposal of ICON represents a significant advancement in the field of Large Language Model security. By leveraging the Latent Space Trace Prober and Mitigating Rectifier, ICON effectively balances security and efficiency. While the framework's complexity and task-specific training requirements may pose challenges, its robustness and generalizability make it a promising solution for real-world applications. As the field continues to evolve, it is essential to address the security concerns in LLMs and develop robust defense mechanisms like ICON.
Recommendations
- ✓ Future work should focus on evaluating ICON's performance in more diverse and complex scenarios, including real-world applications and edge cases.
- ✓ The development of more efficient and scalable versions of ICON is crucial for widespread adoption in LLM-based systems.