Why Is RLHF Alignment Shallow? A Gradient Analysis
arXiv:2603.04851v1 Announce Type: new Abstract: Why is safety alignment in LLMs shallow? We prove that gradient-based alignment inherently concentrates on positions where harm is decided …
Robin Young
3 views