Academic

CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

arXiv:2603.12206v1 Announce Type: new Abstract: State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening r\'esum\'es to identify the best candidates for a role. Evaluated on a corpus of 2,483 r\'esum\'es totaling 9.5M tokens with contr

A
Alexandre Le Mercier, Thomas Demeester, Chris Develder
· · 1 min read · 9 views

arXiv:2603.12206v1 Announce Type: new Abstract: State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening r\'esum\'es to identify the best candidates for a role. Evaluated on a corpus of 2,483 r\'esum\'es totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at https://anonymous.4open.science/r/hispikes-91C0.

Executive Summary

The article introduces CLASP, a model designed to defend hybrid large language models against Hidden State Poisoning Attacks (HiSPAs). CLASP achieves high detection accuracy, with 95.9% token-level F1 score and 99.3% document-level F1 score, and generalizes to unseen attack patterns. The model operates independently and efficiently, processing 1,032 tokens per second with minimal computational overhead, making it a suitable lightweight front-line defense for state space models and hybrid architectures.

Key Points

  • CLASP defends against Hidden State Poisoning Attacks (HiSPAs) in state space models
  • The model achieves high detection accuracy, with 95.9% token-level F1 score and 99.3% document-level F1 score
  • CLASP generalizes to unseen attack patterns and operates efficiently with minimal computational overhead

Merits

Efficient Defense Mechanism

CLASP provides an efficient defense mechanism against HiSPAs, with minimal computational overhead and high detection accuracy

Generalizability

The model generalizes to unseen attack patterns, making it a robust defense mechanism

Demerits

Limited Context

The evaluation of CLASP is limited to a specific scenario, which may not be representative of all possible use cases

Dependence on XGBoost Classifier

The model's performance relies on the XGBoost classifier, which may not be optimal in all scenarios

Expert Commentary

The introduction of CLASP represents a significant advancement in the defense against Hidden State Poisoning Attacks (HiSPAs) in state space models. The model's high detection accuracy, generalizability, and efficiency make it a valuable contribution to the field. However, further research is needed to fully evaluate CLASP's performance in diverse scenarios and to explore its potential applications. Additionally, the development of CLASP highlights the ongoing need for research into adversarial attacks and defense mechanisms, and the importance of considering security and robustness in the development of language models.

Recommendations

  • Further evaluation of CLASP in diverse scenarios to fully assess its performance and generalizability
  • Exploration of CLASP's potential applications in real-world settings, including its use as a lightweight front-line defense for state space models and hybrid architectures

Sources