Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning
arXiv:2603.05900v1 Announce Type: new Abstract: Large language models (LLMs) benefit substantially from supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction-based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step-by-step optimization trajectory. We reveal that answer-only SFT on the reference molecules collapses reasoning, and RLVR provides sparse feedback under similarity constraints due to the model's lack of effective exploration, which slows learning and limits optimization. To encourage the exploration of new molecules while balancing the exploitation of the reference molecules, we introduce Reference-guided Policy Optimization (RePO), an optimization approach that learns from reference molecules without requiring trajectory data. At each update, RePO samples candidate molecules with their intermediate rea
arXiv:2603.05900v1 Announce Type: new Abstract: Large language models (LLMs) benefit substantially from supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction-based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step-by-step optimization trajectory. We reveal that answer-only SFT on the reference molecules collapses reasoning, and RLVR provides sparse feedback under similarity constraints due to the model's lack of effective exploration, which slows learning and limits optimization. To encourage the exploration of new molecules while balancing the exploitation of the reference molecules, we introduce Reference-guided Policy Optimization (RePO), an optimization approach that learns from reference molecules without requiring trajectory data. At each update, RePO samples candidate molecules with their intermediate reasoning trajectories from the model and trains the model using verifiable rewards that measure property satisfaction under similarity constraints in an RL manner. Meanwhile, it applies reference guidance by keeping the policy's intermediate reasoning trajectory as context and training only the answer in a supervised manner. Together, the RL term promotes exploration, while the guidance term mitigates reward sparsity and stabilizes training by grounding outputs to references when many valid molecular edits exist. Across molecular optimization benchmarks, RePO consistently outperforms SFT and RLVR baselines (e.g., GRPO), achieving improvements on the optimization metric (Success Rate $\times$ Similarity), improving balance across competing objectives, and generalizing better to unseen instruction styles. Our code is publicly available at https://github.com/tmlr-group/RePO.
Executive Summary
This article presents Reference-guided Policy Optimization (RePO), a novel optimization approach that leverages large language models (LLMs) to optimize molecular structures. RePO addresses the challenges of instruction-based molecular optimization by introducing a hybrid learning strategy that balances exploration and exploitation. By sampling candidate molecules with intermediate reasoning trajectories, RePO trains the model using verifiable rewards and reference guidance. The approach outperforms traditional supervised fine-tuning and reinforcement learning with verifiable rewards on molecular optimization benchmarks, demonstrating improved balance across competing objectives and better generalization to unseen instruction styles. The publicly available code and promising results make RePO a valuable contribution to the field of molecular optimization.
Key Points
- ▸ RePO introduces a hybrid learning strategy that balances exploration and exploitation.
- ▸ RePO samples candidate molecules with intermediate reasoning trajectories to train the model using verifiable rewards and reference guidance.
- ▸ RePO outperforms traditional supervised fine-tuning and reinforcement learning with verifiable rewards on molecular optimization benchmarks.
Merits
Addressing the Challenges of Instruction-based Molecular Optimization
RePO presents a novel approach that tackles the challenges of instruction-based molecular optimization, including the lack of step-by-step optimization trajectories and the model's lack of effective exploration.
Hybrid Learning Strategy
RePO introduces a hybrid learning strategy that balances exploration and exploitation, promoting effective learning and optimization.
Improved Performance
RePO outperforms traditional supervised fine-tuning and reinforcement learning with verifiable rewards on molecular optimization benchmarks, demonstrating improved balance across competing objectives and better generalization to unseen instruction styles.
Demerits
Dependence on LLMs
RePO relies on the performance of LLMs, which may not be suitable for all molecular optimization tasks or may require significant computational resources.
Limited Generalizability
While RePO demonstrates good generalization to unseen instruction styles, its performance may degrade when applied to significantly different molecular optimization tasks or domains.
Expert Commentary
The article presents a novel approach to molecular optimization that leverages the performance of LLMs. RePO's hybrid learning strategy and improved performance make it a promising contribution to the field. However, the dependence on LLMs and limited generalizability of RePO are concerns that require further investigation. The article's implications highlight the potential of RePO in various industries and the need for further research on hybrid learning strategies. Overall, the article presents a valuable contribution to the field of molecular optimization and highlights the potential of LLMs in this domain.
Recommendations
- ✓ Further research is needed to investigate the dependence of RePO on LLMs and to develop more generalizable approaches to molecular optimization.
- ✓ The development of RePO highlights the need for further research on hybrid learning strategies and their applications in molecular optimization tasks.