Safety Training Persists Through Helpfulness Optimization in LLM Agents
arXiv:2603.02229v1 Announce Type: cross Abstract: Safety post-training has been studied extensively in single-step "chat" settings where safety typically refers to refusing harmful requests. We study …
Benjamin Plaut
3 views