Academic

Generalist Large Language Models for Molecular Property Prediction: Distilling Knowledge from Specialist Models

arXiv:2603.12344v1 Announce Type: new Abstract: Molecular Property Prediction (MPP) is a central task in drug discovery. While Large Language Models (LLMs) show promise as generalist models for MPP, their current performance remains below the threshold for practical adoption. We propose TreeKD, a novel knowledge distillation method that transfers complementary knowledge from tree-based specialist models into LLMs. Our approach trains specialist decision trees on functional group features, then verbalizes their learned predictive rules as natural language to enable rule-augmented context learning. This enables LLMs to leverage structural insights that are difficult to extract from SMILES strings alone. We further introduce rule-consistency, a test-time scaling technique inspired by bagging that ensembles predictions across diverse rules from a Random Forest. Experiments on 22 ADMET properties from the TDC benchmark demonstrate that TreeKD substantially improves LLM performance, narrowi

Khiem Le, Sreejata Dey, Marcos Mart\'inez Galindo, Vanessa Lopez, Ting Hua, Nitesh V. Chawla, Hoang Thanh Lam · March 16, 2026 · 1 min read · 27 views

#cs.LG

Executive Summary

This article introduces TreeKD, a novel knowledge distillation method that transfers complementary knowledge from tree-based specialist models into Large Language Models (LLMs) for Molecular Property Prediction (MPP). By training specialist decision trees on functional group features and verbalizing their learned predictive rules as natural language, TreeKD enables LLMs to leverage structural insights difficult to extract from SMILES strings alone. Experiments on 22 ADMET properties from the TDC benchmark demonstrate that TreeKD substantially improves LLM performance, narrowing the gap with SOTA specialist models. This advances the development of practical generalist models for MPP, a central task in drug discovery.

Key Points

▸ TreeKD distills knowledge from tree-based specialist models into LLMs for MPP.
▸ The approach verbalizes predictive rules as natural language for rule-augmented context learning.
▸ TreeKD improves LLM performance on 22 ADMET properties from the TDC benchmark.

Merits

Strengths of TreeKD

The proposed method addresses the performance gap between LLMs and specialist models, enabling more efficient and accurate predictions.

Improvement over SOTA models

TreeKD demonstrates superior performance compared to state-of-the-art specialist models, indicating its potential for practical adoption in MPP.

Demerits

Limitations of the current implementation

The approach relies on pre-trained specialist decision trees and may not generalize to diverse biochemical datasets.

Expert Commentary

The article presents a significant advancement in the application of LLMs to MPP, a critical task in drug discovery. By leveraging the strengths of both tree-based specialist models and LLMs, TreeKD demonstrates improved performance and efficiency. However, the current implementation may not generalize to diverse biochemical datasets, and further research is needed to address this limitation. The proposed method has significant practical and policy implications, warranting careful consideration and potential adaptation of regulatory frameworks.

Recommendations

✓ Future research should focus on developing TreeKD to generalize to diverse biochemical datasets and explore its application in other MPP tasks.
✓ Regulatory bodies should reassess their frameworks to accommodate the increasing use of generalist models in MPP, ensuring public health and safety while fostering innovation.

Sources

arXiv - cs.LG

Generalist Large Language Models for Molecular Property Prediction: Distilling Knowledge from Specialist Models

AI Commentary

Executive Summary

Key Points

Merits

Strengths of TreeKD

Improvement over SOTA models

Demerits

Limitations of the current implementation

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs