Academic

RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection

arXiv:2603.12582v1 Announce Type: new Abstract: Textual adversarial attacks pose a serious security threat to Natural Language Processing (NLP) systems by introducing imperceptible perturbations that mislead deep learning models. While adversarial example detection offers a lightweight alternative to robust training, existing methods typically rely on prior knowledge of attacks, white-box access to the victim model, or numerous queries, which severely limits their practical deployment. This paper introduces RTD-Guard, a novel black-box framework for detecting textual adversarial examples. Our key insight is that word-substitution perturbations in adversarial attacks closely resemble the "replaced tokens" that a Replaced Token Detection (RTD) discriminator is pre-trained to identify. Leveraging this, RTD-Guard employs an off-the-shelf RTD discriminator-without fine-tuning-to localize suspicious tokens, masks them, and detects adversarial examples by observing the prediction confidence

H
He Zhu, Yanshu Li, Wen Liu, Haitian Yang
· · 1 min read · 13 views

arXiv:2603.12582v1 Announce Type: new Abstract: Textual adversarial attacks pose a serious security threat to Natural Language Processing (NLP) systems by introducing imperceptible perturbations that mislead deep learning models. While adversarial example detection offers a lightweight alternative to robust training, existing methods typically rely on prior knowledge of attacks, white-box access to the victim model, or numerous queries, which severely limits their practical deployment. This paper introduces RTD-Guard, a novel black-box framework for detecting textual adversarial examples. Our key insight is that word-substitution perturbations in adversarial attacks closely resemble the "replaced tokens" that a Replaced Token Detection (RTD) discriminator is pre-trained to identify. Leveraging this, RTD-Guard employs an off-the-shelf RTD discriminator-without fine-tuning-to localize suspicious tokens, masks them, and detects adversarial examples by observing the prediction confidence shift of the victim model before and after intervention. The entire process requires no adversarial data, model tuning, or internal model access, and uses only two black-box queries. Comprehensive experiments on multiple benchmark datasets demonstrate that RTD-Guard effectively detects adversarial texts generated by diverse state-of-the-art attack methods. It surpasses existing detection baselines across multiple metrics, offering a highly efficient, practical, and resource-light defense mechanism-particularly suited for real-world deployment in resource-constrained or privacy-sensitive environments.

Executive Summary

RTD-Guard, a novel black-box framework for detecting textual adversarial examples, has been introduced in this paper. By leveraging a pre-trained Replaced Token Detection (RTD) discriminator, RTD-Guard requires only two black-box queries to localize suspicious tokens and detect adversarial examples. Comprehensive experiments demonstrate that RTD-Guard effectively detects adversarial texts generated by diverse state-of-the-art attack methods, surpassing existing detection baselines across multiple metrics. This framework offers a highly efficient, practical, and resource-light defense mechanism, particularly suited for real-world deployment in resource-constrained or privacy-sensitive environments. RTD-Guard's reliance on an off-the-shelf discriminator and black-box queries mitigates the need for adversarial data, model tuning, or internal model access, making it a promising solution for NLP security.

Key Points

  • RTD-Guard is a novel black-box framework for detecting textual adversarial examples.
  • RTD-Guard employs a pre-trained Replaced Token Detection (RTD) discriminator to localize suspicious tokens.
  • RTD-Guard requires only two black-box queries to detect adversarial examples.

Merits

Strength

RTD-Guard surpasses existing detection baselines across multiple metrics.

Efficiency

RTD-Guard is highly efficient and requires minimal resources.

Practicality

RTD-Guard is suitable for real-world deployment in resource-constrained or privacy-sensitive environments.

Demerits

Limitation

The effectiveness of RTD-Guard may be limited by the quality of the pre-trained RTD discriminator.

Vulnerability to Adversarial Attacks

RTD-Guard may be vulnerable to advanced adversarial attacks that evade detection by the RTD discriminator.

Expert Commentary

The introduction of RTD-Guard marks a significant step forward in the development of black-box textual adversarial detection frameworks. By leveraging a pre-trained RTD discriminator, RTD-Guard mitigates the need for adversarial data, model tuning, or internal model access, making it a promising solution for NLP security. However, the effectiveness of RTD-Guard may be limited by the quality of the pre-trained RTD discriminator, and it may be vulnerable to advanced adversarial attacks that evade detection by the RTD discriminator. Despite these limitations, RTD-Guard's reliance on black-box queries and off-the-shelf discriminator makes it a highly efficient and practical defense mechanism.

Recommendations

  • Further research is needed to improve the quality of the pre-trained RTD discriminator and enhance its ability to detect advanced adversarial attacks.
  • RTD-Guard should be integrated into existing NLP systems to provide a lightweight defense mechanism against adversarial attacks.

Sources