From Refusal Tokens to Refusal Control: Discovering and Steering Category-Specific Refusal Directions
arXiv:2603.13359v1 Announce Type: new Abstract: Language models are commonly fine-tuned for safety alignment to refuse harmful prompts. One approach fine-tunes them to generate categorical refusal …
Rishab Alagharu, Ishneet Sukhvinder Singh, Shaibi Shamsudeen, Zhen Wu, Ashwinee Panda
5 views