AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
arXiv:2602.22755v1 Announce Type: new Abstract: We introduce AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors. Each model has …
Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price, Samuel Marks, Rowan Wang
7 views