Abstractive Red-Teaming of Language Model Character
arXiv:2602.12318v1 Announce Type: new Abstract: We want language model assistants to conform to a character specification, which asserts how the model should act across diverse …
Nate Rahn, Allison Qi, Avery Griffin, Jonathan Michala, Henry Sleight, Erik Jones
11 views