Safety Training Persists Through Helpfulness Optimization in LLM Agents
arXiv:2603.02229v1 Announce Type: cross Abstract: Safety post-training has been studied extensively in single-step "chat" settings where safety typically refers to refusing harmful requests. We study an "agentic" (i.e., multi-step, tool-use) setting where safety refers to harmful actions directly taken by the LLM. We compare the effects of running direct preference optimization (DPO) on safety or helpfulness alone vs both metrics sequentially. As expected, training on one metric alone results in an extreme point along this frontier. However, unlike prior work, we find that safety training persists through subsequent helpfulness training. We also find that all training configurations end up near a linear Pareto frontier with $R^2 = 0.77$. Even post-training on both metrics simultaneously simply results in another point on the frontier rather than finding a "best of both worlds" strategy, despite the presence of such strategies in our DPO dataset. Overall, our findings underscore the ne
arXiv:2603.02229v1 Announce Type: cross Abstract: Safety post-training has been studied extensively in single-step "chat" settings where safety typically refers to refusing harmful requests. We study an "agentic" (i.e., multi-step, tool-use) setting where safety refers to harmful actions directly taken by the LLM. We compare the effects of running direct preference optimization (DPO) on safety or helpfulness alone vs both metrics sequentially. As expected, training on one metric alone results in an extreme point along this frontier. However, unlike prior work, we find that safety training persists through subsequent helpfulness training. We also find that all training configurations end up near a linear Pareto frontier with $R^2 = 0.77$. Even post-training on both metrics simultaneously simply results in another point on the frontier rather than finding a "best of both worlds" strategy, despite the presence of such strategies in our DPO dataset. Overall, our findings underscore the need for better understanding of post-training dynamics.
Executive Summary
This article examines safety training in Large Language Model (LLM) agents through the lens of helpfulness optimization. The authors investigate the impact of direct preference optimization (DPO) on safety and helpfulness metrics in a multi-step setting. Notably, they find that safety training persists even after helpfulness training, with all training configurations converging to a linear Pareto frontier. The study's results highlight the need for a deeper understanding of post-training dynamics in LLMs. The authors' findings suggest that LLMs can be designed to prioritize both safety and helpfulness, but the best approach remains unclear. The study's implications for the development and deployment of LLMs are significant, particularly in high-stakes applications where safety is paramount.
Key Points
- ▸ Safety training in LLMs persists through helpfulness optimization
- ▸ Direct preference optimization (DPO) on safety and helpfulness metrics yields a linear Pareto frontier
- ▸ All training configurations converge to a linear Pareto frontier, regardless of the training order
Merits
Strength in experimental design
The study's multi-step setting and consideration of both safety and helpfulness metrics provide a robust experimental design that yields valuable insights into the behavior of LLMs.
Insight into post-training dynamics
The authors' findings highlight the need for a deeper understanding of how LLMs adapt and change over time, particularly in response to different training objectives.
Demerits
Limited generalizability
The study's results may not generalize to other LLM architectures or training settings, which could limit the applicability of the findings.
Lack of real-world context
The study's focus on a controlled, multi-step setting may not fully capture the complexities and nuances of real-world applications, where LLMs are often deployed in high-stakes contexts.
Expert Commentary
The study's findings have significant implications for the development and deployment of LLMs, particularly in high-stakes applications where safety is paramount. The persistence of safety training through helpfulness optimization suggests that LLMs can be designed to prioritize both safety and helpfulness, but the best approach remains unclear. The study's results highlight the need for a deeper understanding of post-training dynamics in LLMs and underscore the importance of regulatory frameworks that prioritize safety and accountability. As LLMs become increasingly ubiquitous in our daily lives, it is essential to develop a nuanced understanding of their behavior and adaptability, particularly in response to different training objectives.
Recommendations
- ✓ Future studies should investigate the generalizability of the study's findings to other LLM architectures and training settings
- ✓ Researchers should prioritize the development of explainability and transparency techniques for LLMs, to better understand how these systems make decisions and adapt to different training objectives