Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
arXiv:2602.17546v1 Announce Type: new Abstract: Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and …
Jyotin Goel, Souvik Maji, Pratik Mazumder
6 views