Academic

Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

arXiv:2602.20813v1 Announce Type: new Abstract: Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking. We introduce an alignment benchmark spanning 904 scenarios across six categories -- Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming -- validated as realistic by human raters. Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss. Evaluating 24 frontier models using LLM judges validated against human annotations, we find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses across the board. Factor analysis reveals that alignment behaves as

N
Nora Petrova, John Burden
· · 1 min read · 19 views

arXiv:2602.20813v1 Announce Type: new Abstract: Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking. We introduce an alignment benchmark spanning 904 scenarios across six categories -- Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming -- validated as realistic by human raters. Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss. Evaluating 24 frontier models using LLM judges validated against human annotations, we find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses across the board. Factor analysis reveals that alignment behaves as a unified construct (analogous to the g-factor in cognitive research) with models scoring high on one category tending to score high on others. We publicly release the benchmark and an interactive leaderboard to support ongoing evaluation, with plans to expand scenarios in areas where we observe persistent weaknesses and to add new models as they are released.

Executive Summary

The article introduces an alignment benchmark to evaluate language models under realistic pressure, revealing gaps in their behavioral tendencies. The benchmark consists of 904 scenarios across six categories and was validated by human raters. The evaluation of 24 frontier models showed consistent weaknesses across the board, with even top-performing models exhibiting gaps in specific categories. The findings suggest that alignment behaves as a unified construct, with models scoring high on one category tend to score high on others.

Key Points

  • Introduction of an alignment benchmark to evaluate language models under realistic pressure
  • Evaluation of 24 frontier models using the benchmark revealed consistent weaknesses across the board
  • Alignment behaves as a unified construct, with models scoring high on one category tend to score high on others

Merits

Comprehensive Evaluation Framework

The benchmark provides a comprehensive evaluation framework with realistic multi-turn scenarios, allowing for a more accurate assessment of language models' alignment

Demerits

Limited Model Representation

The evaluation only includes 24 frontier models, which may not be representative of the entire range of language models

Expert Commentary

The article makes a significant contribution to the field of artificial intelligence by providing a comprehensive evaluation framework for assessing language models' alignment. The findings of the study have important implications for the development of more responsible and reliable language models. However, the limited representation of models in the evaluation is a notable limitation, and future research should aim to expand the benchmark to include a more diverse range of models. Furthermore, the article highlights the need for ongoing evaluation and expansion of the benchmark to address emerging weaknesses in language models.

Recommendations

  • Expand the benchmark to include a more diverse range of language models
  • Develop more nuanced evaluation metrics to capture the complexities of language models' alignment

Sources