OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences
arXiv:2603.09706v1 Announce Type: new Abstract: While safety alignment for Multimodal Large Language Models (MLLMs) has gained significant attention, current paradigms primarily target malicious intent or situational violations. We propose shifting the safety frontier toward consequence-driven safety, a paradigm essential for the robust deployment of autonomous and embodied agents. To formalize this shift, we introduce OOD-MMSafe, a benchmark comprising 455 curated query-image pairs designed to evaluate a model's ability to identify latent hazards within context-dependent causal chains. Our analysis reveals a pervasive causal blindness among frontier models, with the highest 67.5% failure rate in high-capacity closed-source models, and identifies a preference ceiling where static alignment yields format-centric failures rather than improved safety reasoning as model capacity grows. To address these bottlenecks, we develop the Consequence-Aware Safety Policy Optimization (CASPO) framew
arXiv:2603.09706v1 Announce Type: new Abstract: While safety alignment for Multimodal Large Language Models (MLLMs) has gained significant attention, current paradigms primarily target malicious intent or situational violations. We propose shifting the safety frontier toward consequence-driven safety, a paradigm essential for the robust deployment of autonomous and embodied agents. To formalize this shift, we introduce OOD-MMSafe, a benchmark comprising 455 curated query-image pairs designed to evaluate a model's ability to identify latent hazards within context-dependent causal chains. Our analysis reveals a pervasive causal blindness among frontier models, with the highest 67.5% failure rate in high-capacity closed-source models, and identifies a preference ceiling where static alignment yields format-centric failures rather than improved safety reasoning as model capacity grows. To address these bottlenecks, we develop the Consequence-Aware Safety Policy Optimization (CASPO) framework, which integrates the model's intrinsic reasoning as a dynamic reference for token-level self-distillation rewards. Experimental results demonstrate that CASPO significantly enhances consequence projection, reducing the failure ratio of risk identification to 7.3% for Qwen2.5-VL-7B and 5.7% for Qwen3-VL-4B while maintaining overall effectiveness.
Executive Summary
This article introduces OOD-MMSafe, a benchmark for evaluating Multimodal Large Language Models' (MLLMs) ability to identify latent hazards within context-dependent causal chains. The authors argue that current safety paradigms focus on malicious intent or situational violations, rather than consequence-driven safety. They propose the Consequence-Aware Safety Policy Optimization (CASPO) framework, which integrates the model's intrinsic reasoning as a dynamic reference for token-level self-distillation rewards. Experimental results demonstrate that CASPO significantly enhances consequence projection, reducing the failure ratio of risk identification. The authors' focus on consequence-driven safety is a crucial shift in the field, as it acknowledges the complexities of real-world applications. However, further investigation is needed to validate the generalizability of CASPO and its scalability to various MLLM architectures.
Key Points
- ▸ OOD-MMSafe introduces a new benchmark for evaluating consequence-driven safety in MLLMs.
- ▸ The CASPO framework integrates the model's intrinsic reasoning for improved consequence projection.
- ▸ Experimental results demonstrate significant reductions in the failure ratio of risk identification.
Merits
Strength in methodology
The authors' use of a curated dataset and a novel framework for consequence-driven safety evaluation provides a rigorous and systematic approach to assessing MLLM safety.
Practical applications
The CASPO framework has the potential to enhance the safety and reliability of autonomous and embodied agents in real-world applications.
Demerits
Limitation in generalizability
The authors' results are based on a limited set of MLLM architectures, and further investigation is needed to validate the generalizability of CASPO to various models.
Scalability concerns
The computational complexity of the CASPO framework may pose scalability challenges for large-scale deployment.
Expert Commentary
The article represents a significant contribution to the field of MLLM safety, as it shifts the focus from malicious intent to consequence-driven safety. The CASPO framework demonstrates promising results in enhancing consequence projection, and its potential applications in autonomous and embodied agents are substantial. However, further investigation is necessary to validate the generalizability of CASPO and address scalability concerns. The article's implications extend beyond the MLLM community, influencing policy decisions and practical applications in AI safety and reliability.
Recommendations
- ✓ Future research should prioritize the development of more scalable and generalizable frameworks for consequence-driven safety evaluation.
- ✓ The MLLM community should engage in a broader discussion on ensuring the safety and reliability of AI systems, particularly in applications involving autonomous and embodied agents.