Skip to main content
Academic

Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs

arXiv:2602.17778v1 Announce Type: new Abstract: Multi-turn interaction length is a dominant factor in the operational costs of conversational LLMs. In this work, we present a new failure mode in conversational LLMs: turn amplification, in which a model consistently prolongs multi-turn interactions without completing the underlying task. We show that an adversary can systematically exploit clarification-seeking behavior$-$commonly encouraged in multi-turn conversation settings$-$to scalably prolong interactions. Moving beyond prompt-level behaviors, we take a mechanistic perspective and identify a query-independent, universal activation subspace associated with clarification-seeking responses. Unlike prior cost-amplification attacks that rely on per-turn prompt optimization, our attack arises from conversational dynamics and persists across prompts and tasks. We show that this mechanism provides a scalable pathway to induce turn amplification: both supply-chain attacks via fine-tuning

Z
Zachary Coalson, Bo Fang, Sanghyun Hong
· · 1 min read · 5 views

arXiv:2602.17778v1 Announce Type: new Abstract: Multi-turn interaction length is a dominant factor in the operational costs of conversational LLMs. In this work, we present a new failure mode in conversational LLMs: turn amplification, in which a model consistently prolongs multi-turn interactions without completing the underlying task. We show that an adversary can systematically exploit clarification-seeking behavior$-$commonly encouraged in multi-turn conversation settings$-$to scalably prolong interactions. Moving beyond prompt-level behaviors, we take a mechanistic perspective and identify a query-independent, universal activation subspace associated with clarification-seeking responses. Unlike prior cost-amplification attacks that rely on per-turn prompt optimization, our attack arises from conversational dynamics and persists across prompts and tasks. We show that this mechanism provides a scalable pathway to induce turn amplification: both supply-chain attacks via fine-tuning and runtime attacks through low-level parameter corruptions consistently shift models toward abstract, clarification-seeking behavior across prompts. Across multiple instruction-tuned LLMs and benchmarks, our attack substantially increases turn count while remaining compliant. We also show that existing defenses offer limited protection against this emerging class of failures.

Executive Summary

This article presents a new failure mode in conversational Large Language Models (LLMs), known as turn amplification, where a model prolongs multi-turn interactions without completing the underlying task. The authors identify a universal activation subspace associated with clarification-seeking responses, which can be exploited by adversaries to scalably prolong interactions. Unlike prior cost-amplification attacks, this mechanism arises from conversational dynamics and persists across prompts and tasks. The study presents a scalable pathway to induce turn amplification through fine-tuning and runtime attacks, and demonstrates its effectiveness across multiple instruction-tuned LLMs and benchmarks. However, existing defenses offer limited protection against this emerging class of failures. The findings have significant implications for the development and deployment of conversational LLMs, highlighting the need for more robust defenses against turn amplification attacks.

Key Points

  • Turn amplification is a new failure mode in conversational LLMs, where a model prolongs multi-turn interactions without completing the underlying task.
  • The authors identify a universal activation subspace associated with clarification-seeking responses, which can be exploited by adversaries to scalably prolong interactions.
  • The study presents a scalable pathway to induce turn amplification through fine-tuning and runtime attacks, and demonstrates its effectiveness across multiple instruction-tuned LLMs and benchmarks.

Merits

Theoretical significance

The study provides a mechanistic understanding of turn amplification, which is a novel and important contribution to the field of conversational LLMs.

Empirical significance

The study demonstrates the effectiveness of the turn amplification attack across multiple instruction-tuned LLMs and benchmarks, highlighting its practical significance.

Methodological innovation

The study introduces a novel attack mechanism that arises from conversational dynamics, which is a significant methodological innovation in the field of adversarial attacks.

Demerits

Limited scope

The study focuses on a specific type of failure mode (turn amplification) and does not explore other potential failure modes in conversational LLMs.

Lack of defense evaluation

The study notes that existing defenses offer limited protection against turn amplification attacks, but does not provide a comprehensive evaluation of different defense mechanisms.

Expert Commentary

The article presents a thorough and well-researched study on turn amplification in conversational LLMs. The authors provide a novel and mechanistic understanding of this failure mode, which is a significant contribution to the field. The study also demonstrates the effectiveness of the turn amplification attack across multiple instruction-tuned LLMs and benchmarks, highlighting its practical significance. However, the study's limited scope and lack of defense evaluation are notable limitations. Nevertheless, the findings have significant implications for the development and deployment of conversational LLMs, and highlight the need for more robust defenses against turn amplification attacks.

Recommendations

  • Future research should focus on developing more robust defenses against turn amplification attacks, including novel defense mechanisms and evaluation protocols.
  • Researchers should explore other potential failure modes in conversational LLMs, to provide a more comprehensive understanding of the risks and challenges associated with these models.

Sources