Academic

MOSAIC: Composable Safety Alignment with Modular Control Tokens

arXiv:2603.16210v1 Announce Type: new Abstract: Safety alignment in large language models (LLMs) is commonly implemented as a single static policy embedded in model parameters. However, real-world deployments often require context-dependent safety rules that vary across users, regions, and applications. Existing approaches struggle to provide such conditional control: parameter-level alignment entangles safety behaviors with general capabilities, while prompt-based methods rely on natural language instructions that provide weak enforcement. We propose MOSAIC, a modular framework that enables compositional safety alignment through learnable control tokens optimized over a frozen backbone model. Each token represents a safety constraint and can be flexibly activated and composed at inference time. To train compositional tokens efficiently, we introduce order-based task sampling and a distribution-level alignment objective that mitigates over-refusal. Experiments show that MOSAIC achieve

arXiv:2603.16210v1 Announce Type: new Abstract: Safety alignment in large language models (LLMs) is commonly implemented as a single static policy embedded in model parameters. However, real-world deployments often require context-dependent safety rules that vary across users, regions, and applications. Existing approaches struggle to provide such conditional control: parameter-level alignment entangles safety behaviors with general capabilities, while prompt-based methods rely on natural language instructions that provide weak enforcement. We propose MOSAIC, a modular framework that enables compositional safety alignment through learnable control tokens optimized over a frozen backbone model. Each token represents a safety constraint and can be flexibly activated and composed at inference time. To train compositional tokens efficiently, we introduce order-based task sampling and a distribution-level alignment objective that mitigates over-refusal. Experiments show that MOSAIC achieves strong defense performance with substantially lower over-refusal while preserving model utility.

Executive Summary

The article introduces MOSAIC, a modular framework for compositional safety alignment in large language models. MOSAIC enables flexible and context-dependent safety rules through learnable control tokens, which can be activated and composed at inference time. The framework achieves strong defense performance while minimizing over-refusal and preserving model utility. This approach addresses the limitations of existing methods, which often struggle to provide conditional control and rely on weak enforcement mechanisms. By optimizing control tokens over a frozen backbone model, MOSAIC offers a promising solution for real-world deployments of large language models.

Key Points

  • MOSAIC is a modular framework for compositional safety alignment
  • Learnable control tokens enable flexible and context-dependent safety rules
  • The framework achieves strong defense performance with lower over-refusal

Merits

Modularity and Flexibility

MOSAIC's modular design allows for easy integration and composition of safety constraints, making it a versatile solution for various applications and deployments.

Demerits

Computational Complexity

The introduction of learnable control tokens and order-based task sampling may increase the computational complexity of the framework, potentially impacting its scalability and efficiency.

Expert Commentary

The MOSAIC framework represents a significant advancement in the field of AI safety and alignment. By providing a modular and flexible solution for compositional safety alignment, MOSAIC addresses the limitations of existing approaches and offers a promising path forward for real-world deployments of large language models. However, further research is needed to fully explore the potential of MOSAIC and address the challenges associated with its implementation, such as computational complexity and explainability.

Recommendations

  • Further research on the scalability and efficiency of MOSAIC
  • Investigation into the explainability and transparency of the safety alignment process

Sources