Generalist Large Language Models for Molecular Property Prediction: Distilling Knowledge from Specialist Models
arXiv:2603.12344v1 Announce Type: new Abstract: Molecular Property Prediction (MPP) is a central task in drug discovery. While Large Language Models (LLMs) show promise as generalist models for MPP, their current performance remains below the threshold for practical adoption. We propose TreeKD, a novel knowledge distillation method that transfers complementary knowledge from tree-based specialist models into LLMs. Our approach trains specialist decision trees on functional group features, then verbalizes their learned predictive rules as natural language to enable rule-augmented context learning. This enables LLMs to leverage structural insights that are difficult to extract from SMILES strings alone. We further introduce rule-consistency, a test-time scaling technique inspired by bagging that ensembles predictions across diverse rules from a Random Forest. Experiments on 22 ADMET properties from the TDC benchmark demonstrate that TreeKD substantially improves LLM performance, narrowi
arXiv:2603.12344v1 Announce Type: new Abstract: Molecular Property Prediction (MPP) is a central task in drug discovery. While Large Language Models (LLMs) show promise as generalist models for MPP, their current performance remains below the threshold for practical adoption. We propose TreeKD, a novel knowledge distillation method that transfers complementary knowledge from tree-based specialist models into LLMs. Our approach trains specialist decision trees on functional group features, then verbalizes their learned predictive rules as natural language to enable rule-augmented context learning. This enables LLMs to leverage structural insights that are difficult to extract from SMILES strings alone. We further introduce rule-consistency, a test-time scaling technique inspired by bagging that ensembles predictions across diverse rules from a Random Forest. Experiments on 22 ADMET properties from the TDC benchmark demonstrate that TreeKD substantially improves LLM performance, narrowing the gap with SOTA specialist models and advancing toward practical generalist models for molecular property prediction.
Executive Summary
This article introduces TreeKD, a novel knowledge distillation method that transfers complementary knowledge from tree-based specialist models into Large Language Models (LLMs) for Molecular Property Prediction (MPP). By training specialist decision trees on functional group features and verbalizing their learned predictive rules as natural language, TreeKD enables LLMs to leverage structural insights difficult to extract from SMILES strings alone. Experiments on 22 ADMET properties from the TDC benchmark demonstrate that TreeKD substantially improves LLM performance, narrowing the gap with SOTA specialist models. This advances the development of practical generalist models for MPP, a central task in drug discovery.
Key Points
- ▸ TreeKD distills knowledge from tree-based specialist models into LLMs for MPP.
- ▸ The approach verbalizes predictive rules as natural language for rule-augmented context learning.
- ▸ TreeKD improves LLM performance on 22 ADMET properties from the TDC benchmark.
Merits
Strengths of TreeKD
The proposed method addresses the performance gap between LLMs and specialist models, enabling more efficient and accurate predictions.
Improvement over SOTA models
TreeKD demonstrates superior performance compared to state-of-the-art specialist models, indicating its potential for practical adoption in MPP.
Demerits
Limitations of the current implementation
The approach relies on pre-trained specialist decision trees and may not generalize to diverse biochemical datasets.
Expert Commentary
The article presents a significant advancement in the application of LLMs to MPP, a critical task in drug discovery. By leveraging the strengths of both tree-based specialist models and LLMs, TreeKD demonstrates improved performance and efficiency. However, the current implementation may not generalize to diverse biochemical datasets, and further research is needed to address this limitation. The proposed method has significant practical and policy implications, warranting careful consideration and potential adaptation of regulatory frameworks.
Recommendations
- ✓ Future research should focus on developing TreeKD to generalize to diverse biochemical datasets and explore its application in other MPP tasks.
- ✓ Regulatory bodies should reassess their frameworks to accommodate the increasing use of generalist models in MPP, ensuring public health and safety while fostering innovation.