Mozi: Governed Autonomy for Drug Discovery LLM Agents
arXiv:2603.03655v1 Announce Type: new Abstract: Tool-augmented large language model (LLM) agents promise to unify scientific reasoning with computation, yet their deployment in high-stakes domains like drug discovery is bottlenecked by two critical barriers: unconstrained tool-use governance and poor long-horizon reliability. In dependency-heavy pharmaceutical pipelines, autonomous agents often drift into irreproducible trajectories, where early-stage hallucinations multiplicatively compound into downstream failures. To overcome this, we present Mozi, a dual-layer architecture that bridges the flexibility of generative AI with the deterministic rigor of computational biology. Layer A (Control Plane) establishes a governed supervisor--worker hierarchy that enforces role-based tool isolation, limits execution to constrained action spaces, and drives reflection-based replanning. Layer B (Workflow Plane) operationalizes canonical drug discovery stages -- from Target Identification to Lead
arXiv:2603.03655v1 Announce Type: new Abstract: Tool-augmented large language model (LLM) agents promise to unify scientific reasoning with computation, yet their deployment in high-stakes domains like drug discovery is bottlenecked by two critical barriers: unconstrained tool-use governance and poor long-horizon reliability. In dependency-heavy pharmaceutical pipelines, autonomous agents often drift into irreproducible trajectories, where early-stage hallucinations multiplicatively compound into downstream failures. To overcome this, we present Mozi, a dual-layer architecture that bridges the flexibility of generative AI with the deterministic rigor of computational biology. Layer A (Control Plane) establishes a governed supervisor--worker hierarchy that enforces role-based tool isolation, limits execution to constrained action spaces, and drives reflection-based replanning. Layer B (Workflow Plane) operationalizes canonical drug discovery stages -- from Target Identification to Lead Optimization -- as stateful, composable skill graphs. This layer integrates strict data contracts and strategic human-in-the-loop (HITL) checkpoints to safeguard scientific validity at high-uncertainty decision boundaries. Operating on the design principle of ``free-form reasoning for safe tasks, structured execution for long-horizon pipelines,'' Mozi provides built-in robustness mechanisms and trace-level audibility to completely mitigate error accumulation. We evaluate Mozi on PharmaBench, a curated benchmark for biomedical agents, demonstrating superior orchestration accuracy over existing baselines. Furthermore, through end-to-end therapeutic case studies, we demonstrate Mozi's ability to navigate massive chemical spaces, enforce stringent toxicity filters, and generate highly competitive in silico candidates, effectively transforming the LLM from a fragile conversationalist into a reliable, governed co-scientist.
Executive Summary
This article introduces Mozi, a dual-layer architecture designed to govern the use of tool-augmented large language model (LLM) agents in high-stakes domains like drug discovery. Mozi addresses two critical barriers: unconstrained tool-use governance and poor long-horizon reliability. By establishing a governed supervisor-worker hierarchy and integrating strict data contracts, Mozi ensures the reliability and reproducibility of LLM agents. The authors evaluate Mozi on PharmaBench and demonstrate its superior orchestration accuracy and ability to generate highly competitive in silico candidates. Mozi's design principle of 'free-form reasoning for safe tasks, structured execution for long-horizon pipelines' provides built-in robustness mechanisms and trace-level audibility to mitigate error accumulation. This study has significant implications for the development of reliable and governed LLM agents in drug discovery and beyond.
Key Points
- ▸ Mozi is a dual-layer architecture designed to govern LLM agents in high-stakes domains like drug discovery.
- ▸ Mozi addresses two critical barriers: unconstrained tool-use governance and poor long-horizon reliability.
- ▸ Mozi's design principle provides built-in robustness mechanisms and trace-level audibility to mitigate error accumulation.
Merits
Strength in Addressing Critical Barriers
Mozi effectively addresses two critical barriers in the deployment of LLM agents in high-stakes domains like drug discovery, making it a significant contribution to the field.
Superior Orchestration Accuracy
Mozi demonstrates superior orchestration accuracy on PharmaBench, a curated benchmark for biomedical agents, making it a reliable option for drug discovery.
Robustness Mechanisms and Trace-Level Audibility
Mozi's design principle provides built-in robustness mechanisms and trace-level audibility to mitigate error accumulation, making it a reliable and trustworthy solution.
Demerits
Limited Generalizability
Mozi's performance on PharmaBench may not generalize to other domains or applications, limiting its broader impact.
Dependence on Human-in-the-Loop Checkpoints
Mozi's reliance on human-in-the-loop (HITL) checkpoints may introduce additional complexity and overhead, potentially hindering its adoption in high-throughput applications.
Expert Commentary
Mozi represents a significant step forward in the development of reliable and governed LLM agents, particularly in high-stakes domains like drug discovery. The authors' focus on addressing critical barriers and providing built-in robustness mechanisms is a crucial contribution to the field. However, the limited generalizability of Mozi's performance to other domains and applications is a concern that warrants further investigation. Additionally, the reliance on human-in-the-loop (HITL) checkpoints may introduce additional complexity and overhead, potentially hindering Mozi's adoption in high-throughput applications. Nevertheless, Mozi's design principle and architecture have significant implications for the development of trustworthy and reliable AI systems, particularly in applications where errors can have severe consequences.
Recommendations
- ✓ Further investigation into the generalizability of Mozi's performance to other domains and applications is warranted.
- ✓ Exploring alternative approaches to human-AI collaboration, such as automated decision-making or explainable AI, may help mitigate the overhead associated with HITL checkpoints.