Academic

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

arXiv:2602.20708v1 Announce Type: new Abstract: Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution. Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over-refusal, prematurely terminating valid agentic workflows. We propose ICON, a probing-to-mitigation framework that neutralizes attacks while preserving task continuity. Our key insight is that IPI attacks leave distinct over-focusing signatures in the latent space. We introduce a Latent Space Trace Prober to detect attacks based on high intensity scores. Subsequently, a Mitigating Rectifier performs surgical attention steering that selectively manipulate adversarial query key dependencies while amplifying task relevant elements to restore the LLM's functional trajectory. Extensive evaluations on multiple backbones show that ICON achieves a competit

Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang, Longtao Huang, Jianbo Gao, Zhong Chen, Wei Yang Bryan Lim · March 2, 2026 · 1 min read · 50 views

#cs.AI #cs.CR

Executive Summary

This article proposes ICON, a novel framework to defend Large Language Model (LLM) agents against Indirect Prompt Injection (IPI) attacks. ICON leverages a Latent Space Trace Prober to detect attacks and a Mitigating Rectifier to restore the LLM's functional trajectory. The framework achieves a competitive false acceptance rate (ASR) and yields a significant task utility gain. The authors also demonstrate ICON's robustness in Out of Distribution (OOD) generalization and multi-modal agent applications. This work provides a promising solution to mitigate IPI attacks, a critical concern in LLMs.

Key Points

▸ ICON is a probing-to-mitigation framework that detects IPI attacks using Latent Space Trace Prober and restores task continuity with Mitigating Rectifier.
▸ The framework achieves a competitive 0.4% ASR and over 50% task utility gain compared to existing defenses.
▸ ICON demonstrates robust OOD generalization and extends effectively to multi-modal agents.

Merits

Strength in Detection

The Latent Space Trace Prober effectively detects IPI attacks with minimal false positives.

Task Continuity

The Mitigating Rectifier restores the LLM's functional trajectory, ensuring task continuity during adversarial queries.

Robustness

ICON demonstrates robustness in OOD generalization and multi-modal agent applications.

Demerits

Complexity

The framework's probing-to-mitigation mechanism may introduce additional computational overhead and complexity.

Task-Specific Training

The effectiveness of ICON may depend on task-specific training and fine-tuning of the LLM agent.

Evaluation Metrics

The evaluation metrics used in the paper may not comprehensively capture the performance of ICON in real-world scenarios.

Expert Commentary

The authors' proposal of ICON represents a significant advancement in the field of Large Language Model security. By leveraging the Latent Space Trace Prober and Mitigating Rectifier, ICON effectively balances security and efficiency. While the framework's complexity and task-specific training requirements may pose challenges, its robustness and generalizability make it a promising solution for real-world applications. As the field continues to evolve, it is essential to address the security concerns in LLMs and develop robust defense mechanisms like ICON.

Recommendations

✓ Future work should focus on evaluating ICON's performance in more diverse and complex scenarios, including real-world applications and edge cases.
✓ The development of more efficient and scalable versions of ICON is crucial for widespread adoption in LLM-based systems.

Sources

arXiv - cs.AI

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

AI Commentary

Executive Summary

Key Points

Merits

Strength in Detection

Task Continuity

Robustness

Demerits

Complexity

Task-Specific Training

Evaluation Metrics

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs