Academic

Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

Xuan Li, Zhanke Zhou, Zongze Li, Jiangchao Yao, Yu Rong, Lu Zhang, Bo Han · March 9, 2026 · 1 min read · 27 views

#cs.LG #cs.AI

arXiv:2603.05900v1 Announce Type: new Abstract: Large language models (LLMs) benefit substantially from supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction-based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step-by-step optimization trajectory. We reveal that answer-only SFT on the reference molecules collapses reasoning, and RLVR provides sparse feedback under similarity constraints due to the model's lack of effective exploration, which slows learning and limits optimization. To encourage the exploration of new molecules while balancing the exploitation of the reference molecules, we introduce Reference-guided Policy Optimization (RePO), an optimization approach that learns from reference molecules without requiring trajectory data. At each update, RePO samples candidate molecules with their intermediate reasoning trajectories from the model and trains the model using verifiable rewards that measure property satisfaction under similarity constraints in an RL manner. Meanwhile, it applies reference guidance by keeping the policy's intermediate reasoning trajectory as context and training only the answer in a supervised manner. Together, the RL term promotes exploration, while the guidance term mitigates reward sparsity and stabilizes training by grounding outputs to references when many valid molecular edits exist. Across molecular optimization benchmarks, RePO consistently outperforms SFT and RLVR baselines (e.g., GRPO), achieving improvements on the optimization metric (Success Rate $\times$ Similarity), improving balance across competing objectives, and generalizing better to unseen instruction styles. Our code is publicly available at https://github.com/tmlr-group/RePO.

Executive Summary

This article presents Reference-guided Policy Optimization (RePO), a novel optimization approach that leverages large language models (LLMs) to optimize molecular structures. RePO addresses the challenges of instruction-based molecular optimization by introducing a hybrid learning strategy that balances exploration and exploitation. By sampling candidate molecules with intermediate reasoning trajectories, RePO trains the model using verifiable rewards and reference guidance. The approach outperforms traditional supervised fine-tuning and reinforcement learning with verifiable rewards on molecular optimization benchmarks, demonstrating improved balance across competing objectives and better generalization to unseen instruction styles. The publicly available code and promising results make RePO a valuable contribution to the field of molecular optimization.

Key Points

▸ RePO introduces a hybrid learning strategy that balances exploration and exploitation.
▸ RePO samples candidate molecules with intermediate reasoning trajectories to train the model using verifiable rewards and reference guidance.
▸ RePO outperforms traditional supervised fine-tuning and reinforcement learning with verifiable rewards on molecular optimization benchmarks.

Merits

Addressing the Challenges of Instruction-based Molecular Optimization

RePO presents a novel approach that tackles the challenges of instruction-based molecular optimization, including the lack of step-by-step optimization trajectories and the model's lack of effective exploration.

Hybrid Learning Strategy

RePO introduces a hybrid learning strategy that balances exploration and exploitation, promoting effective learning and optimization.

Improved Performance

RePO outperforms traditional supervised fine-tuning and reinforcement learning with verifiable rewards on molecular optimization benchmarks, demonstrating improved balance across competing objectives and better generalization to unseen instruction styles.

Demerits

Dependence on LLMs

RePO relies on the performance of LLMs, which may not be suitable for all molecular optimization tasks or may require significant computational resources.

Limited Generalizability

While RePO demonstrates good generalization to unseen instruction styles, its performance may degrade when applied to significantly different molecular optimization tasks or domains.

Expert Commentary

The article presents a novel approach to molecular optimization that leverages the performance of LLMs. RePO's hybrid learning strategy and improved performance make it a promising contribution to the field. However, the dependence on LLMs and limited generalizability of RePO are concerns that require further investigation. The article's implications highlight the potential of RePO in various industries and the need for further research on hybrid learning strategies. Overall, the article presents a valuable contribution to the field of molecular optimization and highlights the potential of LLMs in this domain.

Recommendations

✓ Further research is needed to investigate the dependence of RePO on LLMs and to develop more generalizable approaches to molecular optimization.
✓ The development of RePO highlights the need for further research on hybrid learning strategies and their applications in molecular optimization tasks.

Sources

arXiv - cs.LG

Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

AI Commentary

Executive Summary

Key Points

Merits

Addressing the Challenges of Instruction-based Molecular Optimization

Hybrid Learning Strategy

Improved Performance

Demerits

Dependence on LLMs

Limited Generalizability

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs