Skip to main content
Academic

MolFM-Lite: Multi-Modal Molecular Property Prediction with Conformer Ensemble Attention and Cross-Modal Fusion

arXiv:2602.22405v1 Announce Type: new Abstract: Most machine learning models for molecular property prediction rely on a single molecular representation (either a sequence, a graph, or a 3D structure) and treat molecular geometry as static. We present MolFM-Lite, a multi-modal model that jointly encodes SELFIES sequences (1D), molecular graphs (2D), and conformer ensembles (3D) through cross-attention fusion, while conditioning predictions on experimental context via Feature-wise Linear Modulation (FiLM). Our main methodological contributions are: (1) a conformer ensemble attention mechanism that combines learnable attention with Boltzmann-weighted priors over multiple RDKit-generated conformers, capturing the thermodynamic distribution of molecular shapes; and (2) a cross-modal fusion layer where each modality can attend to others, enabling complementary information sharing. We evaluate on four MoleculeNet scaffold-split benchmarks using our model's own splits, and report all baselin

arXiv:2602.22405v1 Announce Type: new Abstract: Most machine learning models for molecular property prediction rely on a single molecular representation (either a sequence, a graph, or a 3D structure) and treat molecular geometry as static. We present MolFM-Lite, a multi-modal model that jointly encodes SELFIES sequences (1D), molecular graphs (2D), and conformer ensembles (3D) through cross-attention fusion, while conditioning predictions on experimental context via Feature-wise Linear Modulation (FiLM). Our main methodological contributions are: (1) a conformer ensemble attention mechanism that combines learnable attention with Boltzmann-weighted priors over multiple RDKit-generated conformers, capturing the thermodynamic distribution of molecular shapes; and (2) a cross-modal fusion layer where each modality can attend to others, enabling complementary information sharing. We evaluate on four MoleculeNet scaffold-split benchmarks using our model's own splits, and report all baselines re-evaluated under the same protocol. Comprehensive ablation studies across all four datasets confirm that each architectural component contributes independently, with tri-modal fusion providing 7-11% AUC improvement over single-modality baselines and conformer ensembles adding approximately 2% over single-conformer variants. Pre-training on ZINC250K (~250K molecules) using cross-modal contrastive and masked-atom objectives enables effective weight initialization at modest compute cost. We release all code, trained models, and data splits to support reproducibility.

Executive Summary

This article presents MolFM-Lite, a multi-modal machine learning model for molecular property prediction that leverages SELFIES sequences, molecular graphs, and conformer ensembles through cross-attention fusion and Feature-wise Linear Modulation (FiLM). The model's contributions include a conformer ensemble attention mechanism and a cross-modal fusion layer. Evaluation on four MoleculeNet scaffold-split benchmarks demonstrates the effectiveness of the model, with tri-modal fusion providing a 7-11% AUC improvement over single-modality baselines. The authors also release all code, trained models, and data splits to support reproducibility. MolFM-Lite has significant implications for the field of molecular property prediction and could lead to improved models for various applications.

Key Points

  • MolFM-Lite is a multi-modal machine learning model for molecular property prediction.
  • The model leverages SELFIES sequences, molecular graphs, and conformer ensembles through cross-attention fusion and Feature-wise Linear Modulation (FiLM).
  • Evaluation on four MoleculeNet scaffold-split benchmarks demonstrates the effectiveness of the model.

Merits

Strength in Multi-Modal Fusion

MolFM-Lite's ability to fuse multiple molecular representations (SELFIES sequences, molecular graphs, and conformer ensembles) through cross-attention fusion and Feature-wise Linear Modulation (FiLM) is a significant strength of the model. This approach allows the model to capture complex molecular properties and improve predictions.

Thorough Evaluation

The authors provide a comprehensive evaluation of MolFM-Lite on four MoleculeNet scaffold-split benchmarks, including all baselines re-evaluated under the same protocol. This thorough evaluation helps to establish the model's effectiveness and provides a clear understanding of its strengths and weaknesses.

Open-Source Code and Data

The authors release all code, trained models, and data splits to support reproducibility, which is essential for the scientific community to build upon and improve the model.

Demerits

Computational Cost

Pre-training on ZINC250K (~250K molecules) using cross-modal contrastive and masked-atom objectives requires significant computational resources, which may be a limitation for some researchers or organizations.

Dependence on Specific Datasets

MolFM-Lite's performance may be heavily dependent on the specific datasets used for training and evaluation. Further evaluation on diverse datasets is necessary to establish the model's generalizability.

Interpretability

The model's complex architecture and multi-modal fusion approach may make it challenging to interpret the results and understand the underlying mechanisms driving the predictions.

Expert Commentary

MolFM-Lite is a significant contribution to the field of molecular property prediction, offering a novel approach to multi-modal fusion and feature-wise linear modulation. The model's performance on four MoleculeNet scaffold-split benchmarks is impressive, and its potential applications are vast. However, the model's computational cost and dependence on specific datasets are limitations that need to be addressed. Moreover, the model's complex architecture and multi-modal fusion approach raise interesting questions about interpretability and explainability. As the field continues to evolve, it is essential to develop effective methods for transfer learning, explainability, and open-source code and data sharing.

Recommendations

  • Future research should focus on developing more efficient and scalable methods for pre-training and fine-tuning MolFM-Lite on large datasets.
  • Investigate the use of MolFM-Lite in various applications, such as drug discovery and materials science, to demonstrate its practical viability.
  • Develop effective explainability methods for multi-modal fusion models, such as MolFM-Lite, to understand the underlying mechanisms driving the predictions.

Sources