Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing
arXiv:2603.15647v1 Announce Type: new Abstract: Large language models (LLMs) are typically governed by post-training alignment (e.g., RLHF or DPO), which yields a largely static policy during deployment and inference. However, real-world safety is a full-lifecycle problem: static defenses degrade against evolving jailbreak behaviors, and fixed weights cannot adapt to pluralistic, time-varying safety norms. This motivates inference-time governance that steers behavior without costly retraining. To address this, we introduce the Consensus Clustering LinUCB Bandit (CCLUB), a unified framework for adaptive social alignment via system-prompt routing. CCLUB employs a conservative consensus clustering mechanism: it pools data only within the intersection of utility and safety similarity graphs, effectively preventing unsafe generalization across semantically proximal but risk-divergent contexts. Our theoretical analysis yields a sublinear regret guarantee, demonstrating near-optimal performa
arXiv:2603.15647v1 Announce Type: new Abstract: Large language models (LLMs) are typically governed by post-training alignment (e.g., RLHF or DPO), which yields a largely static policy during deployment and inference. However, real-world safety is a full-lifecycle problem: static defenses degrade against evolving jailbreak behaviors, and fixed weights cannot adapt to pluralistic, time-varying safety norms. This motivates inference-time governance that steers behavior without costly retraining. To address this, we introduce the Consensus Clustering LinUCB Bandit (CCLUB), a unified framework for adaptive social alignment via system-prompt routing. CCLUB employs a conservative consensus clustering mechanism: it pools data only within the intersection of utility and safety similarity graphs, effectively preventing unsafe generalization across semantically proximal but risk-divergent contexts. Our theoretical analysis yields a sublinear regret guarantee, demonstrating near-optimal performance of CCLUB. Extensive experiments validate that CCLUB outperforms strong baselines, achieving a 10.98% improvement in cumulative reward and a 14.42% reduction in the average suboptimality gap.
Executive Summary
This article introduces Consensus Clustering LinUCB Bandit (CCLUB), a unified framework for adaptive social alignment via system-prompt routing in large language models (LLMs). CCLUB employs a conservative consensus clustering mechanism to prevent unsafe generalization across semantically proximal but risk-divergent contexts. Theoretical analysis yields a sublinear regret guarantee, demonstrating near-optimal performance of CCLUB. Experiments validate its superiority over strong baselines, achieving a 10.98% improvement in cumulative reward and a 14.42% reduction in the average suboptimality gap. The proposed framework addresses real-world safety concerns by providing inference-time governance without costly retraining, enabling LLMs to adapt to evolving jailbreak behaviors and pluralistic safety norms.
Key Points
- ▸ CCLUB provides a unified framework for adaptive social alignment via system-prompt routing in LLMs.
- ▸ Conservative consensus clustering mechanism prevents unsafe generalization across semantically proximal but risk-divergent contexts.
- ▸ Theoretical analysis yields a sublinear regret guarantee, demonstrating near-optimal performance of CCLUB.
Merits
Strength in Addressing Real-World Safety Concerns
CCLUB addresses the limitations of post-training alignment methods by providing inference-time governance without costly retraining, enabling LLMs to adapt to evolving jailbreak behaviors and pluralistic safety norms.
Improved Performance
Experiments validate that CCLUB outperforms strong baselines, achieving a 10.98% improvement in cumulative reward and a 14.42% reduction in the average suboptimality gap.
Demerits
Potential Overhead of Computational Complexity
The proposed framework may introduce additional computational complexity due to the consensus clustering mechanism, which could impact its practical deployment in resource-constrained environments.
Assumptions of Theoretical Analysis
The theoretical analysis assumes certain conditions that may not hold in real-world scenarios, which could limit the generalizability of the results.
Expert Commentary
The proposed framework of CCLUB is a significant advancement in the field of language model safety and security. By providing a unified framework for adaptive social alignment via system-prompt routing, CCLUB addresses the limitations of post-training alignment methods and enables LLMs to adapt to evolving jailbreak behaviors and pluralistic safety norms. While the proposed framework has several strengths, including its improved performance and ability to address real-world safety concerns, there are also potential limitations, such as its potential overhead of computational complexity. Further research is needed to investigate the practical deployability and generalizability of CCLUB in real-world scenarios.
Recommendations
- ✓ Future research should focus on investigating the practical deployability and generalizability of CCLUB in real-world scenarios, including its potential impact on language model safety and security.
- ✓ Policymakers should consider the implications of CCLUB on the development and deployment of LLMs, including its potential impact on language model safety and security.