Academic

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah · March 5, 2026 · 1 min read · 15 views

#cs.CL

arXiv:2603.03205v1 Announce Type: new Abstract: Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.

Executive Summary

The article introduces MOSAIC, a post-training framework that enhances the safety of agentic language models in multi-step tool use scenarios. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. The framework reduces harmful behavior by up to 50% and increases harmful-task refusal by over 20% on injection attacks, demonstrating robust generalization across models, domains, and agentic settings.

Key Points

▸ MOSAIC is a post-training framework for safe multi-step tool use
▸ It uses preference-based reinforcement learning with pairwise trajectory comparisons
▸ MOSAIC reduces harmful behavior and increases harmful-task refusal

Merits

Robust Generalization

MOSAIC demonstrates robust generalization across models, domains, and agentic settings, making it a reliable solution for safe multi-step tool use.

Improved Safety

The framework reduces harmful behavior by up to 50% and increases harmful-task refusal by over 20% on injection attacks, significantly improving safety in agentic language models.

Demerits

Limited Training Data

The framework relies on preference-based reinforcement learning, which may require large amounts of training data to achieve optimal results.

Complexity

MOSAIC's plan, check, then act or refuse loop may add complexity to the decision-making process, potentially leading to increased computational costs.

Expert Commentary

The introduction of MOSAIC marks a significant step forward in addressing the safety concerns associated with agentic language models. By making safety decisions explicit and learnable, MOSAIC provides a robust framework for safe multi-step tool use. However, further research is needed to address the potential limitations of the framework, such as the requirement for large amounts of training data and the added complexity of the decision-making process. As the field of AI safety continues to evolve, MOSAIC is likely to play a crucial role in shaping the development of more reliable and trustworthy agentic language models.

Recommendations

✓ Further research should be conducted to explore the applications of MOSAIC in various domains and to address its potential limitations.
✓ The development of MOSAIC and similar frameworks should be accompanied by rigorous testing and evaluation to ensure their safety and efficacy in real-world scenarios.

Sources

arXiv - cs.CL

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

AI Commentary

Executive Summary

Key Points

Merits

Robust Generalization

Improved Safety

Demerits

Limited Training Data

Complexity

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs