Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents
arXiv:2602.13234v1 Announce Type: new Abstract: LLM-based role-playing has rapidly improved in fidelity, yet stronger adherence to persona constraints commonly increases vulnerability to jailbreak attacks, especially for risky or negative personas. Most prior work mitigates this issue with training-time solutions (e.g., data curation or alignment-oriented regularization). However, these approaches are costly to maintain as personas and attack strategies evolve, can degrade in-character behavior, and are typically infeasible for frontier closed-weight LLMs. We propose a training-free Dual-Cycle Adversarial Self-Evolution framework with two coupled cycles. A Persona-Targeted Attacker Cycle synthesizes progressively stronger jailbreak prompts, while a Role-Playing Defender Cycle distills observed failures into a hierarchical knowledge base of (i) global safety rules, (ii) persona-grounded constraints, and (iii) safe in-character exemplars. At inference time, the Defender retrieves and co
arXiv:2602.13234v1 Announce Type: new Abstract: LLM-based role-playing has rapidly improved in fidelity, yet stronger adherence to persona constraints commonly increases vulnerability to jailbreak attacks, especially for risky or negative personas. Most prior work mitigates this issue with training-time solutions (e.g., data curation or alignment-oriented regularization). However, these approaches are costly to maintain as personas and attack strategies evolve, can degrade in-character behavior, and are typically infeasible for frontier closed-weight LLMs. We propose a training-free Dual-Cycle Adversarial Self-Evolution framework with two coupled cycles. A Persona-Targeted Attacker Cycle synthesizes progressively stronger jailbreak prompts, while a Role-Playing Defender Cycle distills observed failures into a hierarchical knowledge base of (i) global safety rules, (ii) persona-grounded constraints, and (iii) safe in-character exemplars. At inference time, the Defender retrieves and composes structured knowledge from this hierarchy to guide generation, producing responses that remain faithful to the target persona while satisfying safety constraints. Extensive experiments across multiple proprietary LLMs show consistent gains over strong baselines on both role fidelity and jailbreak resistance, and robust generalization to unseen personas and attack prompts.
Executive Summary
The article 'Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents' introduces a novel framework designed to enhance the safety of role-playing agents based on large language models (LLMs) without the need for extensive retraining. The proposed Dual-Cycle Adversarial Self-Evolution framework consists of two interconnected cycles: a Persona-Targeted Attacker Cycle that generates progressively stronger jailbreak prompts, and a Role-Playing Defender Cycle that learns from these attacks to build a hierarchical knowledge base. This knowledge base includes global safety rules, persona-specific constraints, and safe in-character examples, which guide the agent's responses during inference. The framework aims to maintain role fidelity while improving resistance to jailbreak attacks. Extensive experiments across multiple proprietary LLMs demonstrate consistent improvements over existing baselines in both role fidelity and jailbreak resistance, even generalizing to unseen personas and attack prompts.
Key Points
- ▸ Introduction of a training-free framework for enhancing safety in LLM-based role-playing agents.
- ▸ Use of dual-cycle adversarial self-evolution to synthesize stronger jailbreak prompts and learn from failures.
- ▸ Creation of a hierarchical knowledge base to guide safe and in-character responses.
- ▸ Demonstration of consistent gains in role fidelity and jailbreak resistance across multiple LLMs.
Merits
Innovative Framework
The Dual-Cycle Adversarial Self-Evolution framework is a novel approach that addresses the critical issue of balancing role fidelity and safety in LLM-based role-playing agents. By using adversarial self-evolution, the framework dynamically adapts to evolving personas and attack strategies without the need for costly retraining.
Training-Free Solution
The proposed solution does not require extensive retraining, making it feasible for use with frontier closed-weight LLMs. This is a significant advantage over traditional methods that rely on data curation or alignment-oriented regularization, which can be costly and time-consuming.
Hierarchical Knowledge Base
The hierarchical knowledge base that includes global safety rules, persona-grounded constraints, and safe in-character exemplars provides a structured and comprehensive guide for generating safe and in-character responses. This approach ensures that the agent remains faithful to the target persona while adhering to safety constraints.
Demerits
Complexity
The dual-cycle adversarial self-evolution framework introduces complexity in implementation and maintenance. The need to continuously synthesize stronger jailbreak prompts and update the hierarchical knowledge base may require significant computational resources and expertise.
Generalization Limitations
While the framework shows robust generalization to unseen personas and attack prompts, its effectiveness may vary depending on the diversity and complexity of the personas and attack strategies encountered. Further research is needed to validate its performance across a broader range of scenarios.
Dependency on Initial Training
The effectiveness of the framework may be influenced by the quality of the initial training data and the robustness of the underlying LLM. If the initial training is insufficient or the LLM is prone to biases, the framework's performance may be compromised.
Expert Commentary
The article presents a significant advancement in the field of AI safety, particularly in the context of role-playing agents. The Dual-Cycle Adversarial Self-Evolution framework addresses a critical challenge in maintaining role fidelity while ensuring safety, which is a common issue in LLM-based systems. The innovative use of adversarial self-evolution to synthesize stronger jailbreak prompts and learn from failures is a novel approach that sets this framework apart from traditional methods. The hierarchical knowledge base provides a structured and comprehensive guide for generating safe and in-character responses, which is a crucial aspect of ethical AI deployment. However, the complexity of the framework and its dependency on initial training data are notable limitations that need to be addressed. Overall, the article contributes valuable insights and methodologies that can inform both practical applications and policy decisions in the field of AI safety.
Recommendations
- ✓ Further research should focus on validating the framework's performance across a broader range of personas and attack strategies to ensure its robustness and generalizability.
- ✓ Organizations should consider integrating the proposed framework into their existing LLM-based role-playing systems to enhance safety and reliability, particularly in sensitive applications.