Semantic Containment as a Fundamental Property of Emergent Misalignment
arXiv:2603.04407v1 Announce Type: new Abstract: Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behavioral failures extending far beyond training distributions. Recent …
Rohan Saxena
3 views