Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning
arXiv:2603.07445v1 Announce Type: new Abstract: Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift …
Guoli Wang, Haonan Shi, Tu Ouyang, An Wang
19 views