From Refusal Tokens to Refusal Control: Discovering and Steering Category-Specific Refusal Directions
arXiv:2603.13359v1 Announce Type: new Abstract: Language models are commonly fine-tuned for safety alignment to refuse harmful prompts. One approach fine-tunes them to generate categorical refusal …