CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks
arXiv:2603.12206v1 Announce Type: new Abstract: State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening r\'esum\'es to identify the best candidates for a role. Evaluated on a corpus of 2,483 r\'esum\'es totaling 9.5M tokens with contr
arXiv:2603.12206v1 Announce Type: new Abstract: State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening r\'esum\'es to identify the best candidates for a role. Evaluated on a corpus of 2,483 r\'esum\'es totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at https://anonymous.4open.science/r/hispikes-91C0.
Executive Summary
The article introduces CLASP, a model designed to defend hybrid large language models against Hidden State Poisoning Attacks (HiSPAs). CLASP achieves high detection accuracy, with 95.9% token-level F1 score and 99.3% document-level F1 score, and generalizes to unseen attack patterns. The model operates independently and efficiently, processing 1,032 tokens per second with minimal computational overhead, making it a suitable lightweight front-line defense for state space models and hybrid architectures.
Key Points
- ▸ CLASP defends against Hidden State Poisoning Attacks (HiSPAs) in state space models
- ▸ The model achieves high detection accuracy, with 95.9% token-level F1 score and 99.3% document-level F1 score
- ▸ CLASP generalizes to unseen attack patterns and operates efficiently with minimal computational overhead
Merits
Efficient Defense Mechanism
CLASP provides an efficient defense mechanism against HiSPAs, with minimal computational overhead and high detection accuracy
Generalizability
The model generalizes to unseen attack patterns, making it a robust defense mechanism
Demerits
Limited Context
The evaluation of CLASP is limited to a specific scenario, which may not be representative of all possible use cases
Dependence on XGBoost Classifier
The model's performance relies on the XGBoost classifier, which may not be optimal in all scenarios
Expert Commentary
The introduction of CLASP represents a significant advancement in the defense against Hidden State Poisoning Attacks (HiSPAs) in state space models. The model's high detection accuracy, generalizability, and efficiency make it a valuable contribution to the field. However, further research is needed to fully evaluate CLASP's performance in diverse scenarios and to explore its potential applications. Additionally, the development of CLASP highlights the ongoing need for research into adversarial attacks and defense mechanisms, and the importance of considering security and robustness in the development of language models.
Recommendations
- ✓ Further evaluation of CLASP in diverse scenarios to fully assess its performance and generalizability
- ✓ Exploration of CLASP's potential applications in real-world settings, including its use as a lightweight front-line defense for state space models and hybrid architectures