Academic

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

arXiv:2602.11247v1 Announce Type: cross Abstract: Multi-turn prompt injection attacks distribute malicious intent across multiple conversation turns, exploiting the assumption that each turn is evaluated independently. While single-turn detection has been extensively studied, no published formula exists for aggregating per-turn pattern scores into a conversation-level risk score at the proxy layer -- without invoking an LLM. We identify a fundamental flaw in the intuitive weighted-average approach: it converges to the per-turn score regardless of turn count, meaning a 20-turn persistent attack scores identically to a single suspicious turn. Drawing on analogies from change-point detection (CUSUM), Bayesian belief updating, and security risk-based alerting, we propose peak + accumulation scoring -- a formula combining peak single-turn risk, persistence ratio, and category diversity. Evaluated on 10,654 multi-turn conversations -- 588 attacks sourced from WildJailbreak adversarial promp

J
J Alex Corll
· · 1 min read · 15 views

arXiv:2602.11247v1 Announce Type: cross Abstract: Multi-turn prompt injection attacks distribute malicious intent across multiple conversation turns, exploiting the assumption that each turn is evaluated independently. While single-turn detection has been extensively studied, no published formula exists for aggregating per-turn pattern scores into a conversation-level risk score at the proxy layer -- without invoking an LLM. We identify a fundamental flaw in the intuitive weighted-average approach: it converges to the per-turn score regardless of turn count, meaning a 20-turn persistent attack scores identically to a single suspicious turn. Drawing on analogies from change-point detection (CUSUM), Bayesian belief updating, and security risk-based alerting, we propose peak + accumulation scoring -- a formula combining peak single-turn risk, persistence ratio, and category diversity. Evaluated on 10,654 multi-turn conversations -- 588 attacks sourced from WildJailbreak adversarial prompts and 10,066 benign conversations from WildChat -- the formula achieves 90.8% recall at 1.20% false positive rate with an F1 of 85.9%. A sensitivity analysis over the persistence parameter reveals a phase transition at rho ~ 0.4, where recall jumps 12 percentage points with negligible FPR increase. We release the scoring algorithm, pattern library, and evaluation harness as open source.

Executive Summary

The article 'Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection' introduces a novel approach to detecting multi-turn prompt injection attacks in large language models (LLMs). The authors identify a critical flaw in the conventional weighted-average method for aggregating per-turn risk scores, which fails to account for the cumulative effect of persistent attacks. They propose a new scoring formula that combines peak single-turn risk, persistence ratio, and category diversity, drawing inspiration from change-point detection, Bayesian belief updating, and security risk-based alerting. Evaluated on a dataset of 10,654 multi-turn conversations, the formula achieves a high recall rate of 90.8% with a low false positive rate of 1.20%, demonstrating its effectiveness in identifying malicious intent distributed across multiple conversation turns.

Key Points

  • Identification of a fundamental flaw in the weighted-average approach for multi-turn attack detection.
  • Introduction of a new scoring formula combining peak single-turn risk, persistence ratio, and category diversity.
  • Evaluation on a large dataset of 10,654 multi-turn conversations, achieving high recall and low false positive rates.
  • Release of the scoring algorithm, pattern library, and evaluation harness as open source.

Merits

Innovative Approach

The article presents a novel and innovative approach to multi-turn attack detection, addressing a significant gap in the current literature.

High Performance Metrics

The proposed formula achieves impressive performance metrics, with a high recall rate and low false positive rate, indicating its effectiveness in real-world applications.

Open Source Release

The authors have made the scoring algorithm, pattern library, and evaluation harness available as open source, facilitating further research and practical implementation.

Demerits

Limited Dataset Diversity

The evaluation dataset consists of conversations sourced from WildJailbreak adversarial prompts and WildChat, which may not fully represent the diversity of multi-turn conversations in different contexts.

Sensitivity to Parameters

The performance of the formula is sensitive to the persistence parameter, which may require careful tuning for optimal results in different scenarios.

Potential Overfitting

The formula's high performance on the evaluated dataset does not guarantee similar results on other datasets, raising concerns about potential overfitting.

Expert Commentary

The article presents a significant advancement in the field of LLM security by addressing the critical issue of multi-turn prompt injection attacks. The identification of the flaw in the weighted-average approach is a crucial insight, as it underscores the limitations of existing methods in handling persistent attacks. The proposed peak + accumulation scoring formula is a well-reasoned and innovative solution that effectively combines multiple risk factors to provide a comprehensive assessment of conversation-level risk. The high performance metrics achieved on the evaluated dataset demonstrate the formula's potential for real-world applications. However, the sensitivity to the persistence parameter and the potential for overfitting are important considerations that warrant further investigation. The open-source release of the tools is a commendable contribution to the research community, facilitating further validation and refinement of the approach. Overall, this work sets a new benchmark for multi-turn attack detection and highlights the importance of continuous innovation in the field of AI security.

Recommendations

  • Further validation of the formula on diverse datasets to ensure its robustness and generalizability.
  • Exploration of adaptive mechanisms to dynamically adjust the persistence parameter based on conversation context.
  • Integration of the proposed formula into existing LLM systems to enhance their security capabilities and real-world deployment.

Sources