Academic

Safety Training Persists Through Helpfulness Optimization in LLM Agents

arXiv:2603.02229v1 Announce Type: cross Abstract: Safety post-training has been studied extensively in single-step "chat" settings where safety typically refers to refusing harmful requests. We study an "agentic" (i.e., multi-step, tool-use) setting where safety refers to harmful actions directly taken by the LLM. We compare the effects of running direct preference optimization (DPO) on safety or helpfulness alone vs both metrics sequentially. As expected, training on one metric alone results in an extreme point along this frontier. However, unlike prior work, we find that safety training persists through subsequent helpfulness training. We also find that all training configurations end up near a linear Pareto frontier with $R^2 = 0.77$. Even post-training on both metrics simultaneously simply results in another point on the frontier rather than finding a "best of both worlds" strategy, despite the presence of such strategies in our DPO dataset. Overall, our findings underscore the ne

Benjamin Plaut · March 5, 2026 · 1 min read · 2 views

#cs.LG #cs.CL

Executive Summary

This article examines safety training in Large Language Model (LLM) agents through the lens of helpfulness optimization. The authors investigate the impact of direct preference optimization (DPO) on safety and helpfulness metrics in a multi-step setting. Notably, they find that safety training persists even after helpfulness training, with all training configurations converging to a linear Pareto frontier. The study's results highlight the need for a deeper understanding of post-training dynamics in LLMs. The authors' findings suggest that LLMs can be designed to prioritize both safety and helpfulness, but the best approach remains unclear. The study's implications for the development and deployment of LLMs are significant, particularly in high-stakes applications where safety is paramount.

Key Points

▸ Safety training in LLMs persists through helpfulness optimization
▸ Direct preference optimization (DPO) on safety and helpfulness metrics yields a linear Pareto frontier
▸ All training configurations converge to a linear Pareto frontier, regardless of the training order

Merits

Strength in experimental design

The study's multi-step setting and consideration of both safety and helpfulness metrics provide a robust experimental design that yields valuable insights into the behavior of LLMs.

Insight into post-training dynamics

The authors' findings highlight the need for a deeper understanding of how LLMs adapt and change over time, particularly in response to different training objectives.

Demerits

Limited generalizability

The study's results may not generalize to other LLM architectures or training settings, which could limit the applicability of the findings.

Lack of real-world context

The study's focus on a controlled, multi-step setting may not fully capture the complexities and nuances of real-world applications, where LLMs are often deployed in high-stakes contexts.

Expert Commentary

The study's findings have significant implications for the development and deployment of LLMs, particularly in high-stakes applications where safety is paramount. The persistence of safety training through helpfulness optimization suggests that LLMs can be designed to prioritize both safety and helpfulness, but the best approach remains unclear. The study's results highlight the need for a deeper understanding of post-training dynamics in LLMs and underscore the importance of regulatory frameworks that prioritize safety and accountability. As LLMs become increasingly ubiquitous in our daily lives, it is essential to develop a nuanced understanding of their behavior and adaptability, particularly in response to different training objectives.

Recommendations

✓ Future studies should investigate the generalizability of the study's findings to other LLM architectures and training settings
✓ Researchers should prioritize the development of explainability and transparency techniques for LLMs, to better understand how these systems make decisions and adapt to different training objectives

Sources

arXiv - cs.CL

Safety Training Persists Through Helpfulness Optimization in LLM Agents

AI Commentary

Executive Summary

Key Points

Merits

Strength in experimental design

Insight into post-training dynamics

Demerits

Limited generalizability

Lack of real-world context

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs