Detoxifying LLMs via Representation Erasure-Based Preference Optimization
arXiv:2602.23391v1 Announce Type: new Abstract: Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based …
Nazanin Mohammadi Sepahvand, Eleni Triantafillou, Hugo Larochelle, Doina Precup, Daniel M. Roy, Gintare Karolina Dziugaite
15 views