Academic

Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing

arXiv:2603.15647v1 Announce Type: new Abstract: Large language models (LLMs) are typically governed by post-training alignment (e.g., RLHF or DPO), which yields a largely static policy during deployment and inference. However, real-world safety is a full-lifecycle problem: static defenses degrade against evolving jailbreak behaviors, and fixed weights cannot adapt to pluralistic, time-varying safety norms. This motivates inference-time governance that steers behavior without costly retraining. To address this, we introduce the Consensus Clustering LinUCB Bandit (CCLUB), a unified framework for adaptive social alignment via system-prompt routing. CCLUB employs a conservative consensus clustering mechanism: it pools data only within the intersection of utility and safety similarity graphs, effectively preventing unsafe generalization across semantically proximal but risk-divergent contexts. Our theoretical analysis yields a sublinear regret guarantee, demonstrating near-optimal performa

Zeyu Zhang, Xiangxiang Dai, Ziyi Han, Xutong Liu, John C. S. Lui · March 18, 2026 · 1 min read · 7 views

#cs.LG #cs.AI

Executive Summary

This article introduces Consensus Clustering LinUCB Bandit (CCLUB), a unified framework for adaptive social alignment via system-prompt routing in large language models (LLMs). CCLUB employs a conservative consensus clustering mechanism to prevent unsafe generalization across semantically proximal but risk-divergent contexts. Theoretical analysis yields a sublinear regret guarantee, demonstrating near-optimal performance of CCLUB. Experiments validate its superiority over strong baselines, achieving a 10.98% improvement in cumulative reward and a 14.42% reduction in the average suboptimality gap. The proposed framework addresses real-world safety concerns by providing inference-time governance without costly retraining, enabling LLMs to adapt to evolving jailbreak behaviors and pluralistic safety norms.

Key Points

▸ CCLUB provides a unified framework for adaptive social alignment via system-prompt routing in LLMs.
▸ Conservative consensus clustering mechanism prevents unsafe generalization across semantically proximal but risk-divergent contexts.
▸ Theoretical analysis yields a sublinear regret guarantee, demonstrating near-optimal performance of CCLUB.

Merits

Strength in Addressing Real-World Safety Concerns

CCLUB addresses the limitations of post-training alignment methods by providing inference-time governance without costly retraining, enabling LLMs to adapt to evolving jailbreak behaviors and pluralistic safety norms.

Improved Performance

Experiments validate that CCLUB outperforms strong baselines, achieving a 10.98% improvement in cumulative reward and a 14.42% reduction in the average suboptimality gap.

Demerits

Potential Overhead of Computational Complexity

The proposed framework may introduce additional computational complexity due to the consensus clustering mechanism, which could impact its practical deployment in resource-constrained environments.

Assumptions of Theoretical Analysis

The theoretical analysis assumes certain conditions that may not hold in real-world scenarios, which could limit the generalizability of the results.

Expert Commentary

The proposed framework of CCLUB is a significant advancement in the field of language model safety and security. By providing a unified framework for adaptive social alignment via system-prompt routing, CCLUB addresses the limitations of post-training alignment methods and enables LLMs to adapt to evolving jailbreak behaviors and pluralistic safety norms. While the proposed framework has several strengths, including its improved performance and ability to address real-world safety concerns, there are also potential limitations, such as its potential overhead of computational complexity. Further research is needed to investigate the practical deployability and generalizability of CCLUB in real-world scenarios.

Recommendations

✓ Future research should focus on investigating the practical deployability and generalizability of CCLUB in real-world scenarios, including its potential impact on language model safety and security.
✓ Policymakers should consider the implications of CCLUB on the development and deployment of LLMs, including its potential impact on language model safety and security.

Sources

arXiv - cs.LG

Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing

AI Commentary

Executive Summary

Key Points

Merits

Strength in Addressing Real-World Safety Concerns

Improved Performance

Demerits

Potential Overhead of Computational Complexity

Assumptions of Theoretical Analysis

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs