Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms
arXiv:2604.00012v1 Announce Type: cross Abstract: Despite the impressive performance of general-purpose large language models (LLMs), they often require fine-tuning or post-training to excel at specific …
Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, Yisen Wang
1 views