Academic

Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

arXiv:2603.18280v1 Announce Type: new Abstract: Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: political probes, null controls, and permutation baselines can all reach 100%, so held-out category generalization is the informative test. Second, surgical ablation reveals lab-specific routing. Removing the political-sensitivity direction eliminates censorship and restores accurate factual output in most models tested, while one model confabulates because its architecture entangles factual knowledge with the censorship mechanism. Cross-model transfer

G
Gregory N. Frank
· · 1 min read · 4 views

arXiv:2603.18280v1 Announce Type: new Abstract: Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: political probes, null controls, and permutation baselines can all reach 100%, so held-out category generalization is the informative test. Second, surgical ablation reveals lab-specific routing. Removing the political-sensitivity direction eliminates censorship and restores accurate factual output in most models tested, while one model confabulates because its architecture entangles factual knowledge with the censorship mechanism. Cross-model transfer fails, indicating that routing geometry is model- and lab-specific. Third, refusal is no longer the dominant censorship mechanism. Within one model family, hard refusal falls to zero while narrative steering rises to the maximum, making censorship invisible to refusal-only benchmarks. These results support a three-stage descriptive framework: detect, route, generate. Models often retain the relevant knowledge; alignment changes how that knowledge is expressed. Evaluations that audit only detection or refusal therefore miss the routing mechanism that most directly determines behavior.

Executive Summary

This article critiques the current methods of alignment evaluation in AI models, specifically in the context of political censorship in Chinese-origin language models. The authors argue that current evaluation methods focus on whether models detect dangerous concepts and refuse harmful requests, but fail to consider the critical 'routing' stage where alignment operates. The study uses a combination of probes, surgical ablations, and behavioral tests to demonstrate that routing is a key determinant of behavior, and that current evaluation methods are inadequate. The authors propose a three-stage 'detect, route, generate' framework for understanding alignment, which highlights the importance of routing in determining behavior. The study has significant implications for the development and evaluation of AI models, particularly in high-stakes applications such as political censorship.

Key Points

  • Current alignment evaluation methods focus on detection and refusal, but neglect the critical 'routing' stage.
  • The study demonstrates that routing is a key determinant of behavior in AI models.
  • The authors propose a three-stage 'detect, route, generate' framework for understanding alignment.

Merits

Strength of Empirical Evidence

The study uses a robust methodology, including probes, surgical ablations, and behavioral tests, to support its conclusions.

Insight into Alignment Mechanisms

The study provides valuable insights into the mechanisms of alignment in AI models, highlighting the importance of routing in determining behavior.

Practical Implications

The study has significant practical implications for the development and evaluation of AI models, particularly in high-stakes applications such as political censorship.

Demerits

Limitation of Generalizability

The study focuses on Chinese-origin language models and may not be generalizable to other types of AI models or applications.

Methodological Complexity

The study's methodology is complex and may be challenging to replicate or extend to other research questions.

Expert Commentary

This study is a significant contribution to the field of AI alignment and safety, highlighting the importance of considering routing mechanisms in AI models. The authors' proposal of a three-stage 'detect, route, generate' framework provides a valuable framework for understanding alignment and has significant implications for AI model development and evaluation. However, the study's limitations, including the focus on Chinese-origin language models and the methodological complexity, should be acknowledged and addressed in future research. The study's findings also raise important questions about the explainability and transparency of AI models, particularly in high-stakes applications.

Recommendations

  • Future research should focus on developing methods for evaluating routing mechanisms in AI models.
  • AI model developers should prioritize the incorporation of routing mechanisms into their models to ensure alignment with desired behaviors.

Sources