Academic

Multi-RF Fusion with Multi-GNN Blending for Molecular Property Prediction

arXiv:2603.20724v1 Announce Type: new Abstract: Multi-RF Fusion achieves a test ROC-AUC of 0.8476 +/- 0.0002 on ogbg-molhiv (10 seeds), placing #1 on the OGB leaderboard ahead of HyperFusion (0.8475 +/- 0.0003). The core of the method is a rank-averaged ensemble of 12 Random Forest models trained on concatenated molecular fingerprints (FCFP, ECFP, MACCS, atom pairs -- 4,263 dimensions total), blended with deep-ensembled GNN predictions at 12% weight. Two findings drive the result: (1) setting max_features to 0.20 instead of the default sqrt(d) gives a +0.008 AUC gain on this scaffold split, and (2) averaging GNN predictions across 10 seeds before blending with the RF eliminates GNN seed variance entirely, dropping the final standard deviation from 0.0008 to 0.0002. No external data or pre-training is used.

Z
Zacharie Bugaud
· · 1 min read · 7 views

arXiv:2603.20724v1 Announce Type: new Abstract: Multi-RF Fusion achieves a test ROC-AUC of 0.8476 +/- 0.0002 on ogbg-molhiv (10 seeds), placing #1 on the OGB leaderboard ahead of HyperFusion (0.8475 +/- 0.0003). The core of the method is a rank-averaged ensemble of 12 Random Forest models trained on concatenated molecular fingerprints (FCFP, ECFP, MACCS, atom pairs -- 4,263 dimensions total), blended with deep-ensembled GNN predictions at 12% weight. Two findings drive the result: (1) setting max_features to 0.20 instead of the default sqrt(d) gives a +0.008 AUC gain on this scaffold split, and (2) averaging GNN predictions across 10 seeds before blending with the RF eliminates GNN seed variance entirely, dropping the final standard deviation from 0.0008 to 0.0002. No external data or pre-training is used.

Executive Summary

This article presents Multi-RF Fusion, a novel ensemble method that achieves state-of-the-art results in molecular property prediction. By blending a rank-averaged ensemble of 12 Random Forest models with deep-ensembled Graph Neural Network (GNN) predictions, Multi-RF Fusion outperforms existing methods on the OGB leaderboard. The authors identify two key findings that contribute to the success of their approach: optimizing the max_features parameter and averaging GNN predictions across seeds. This work demonstrates the potential of ensemble methods and highlights the importance of hyperparameter tuning and model regularization in improving predictive performance. The authors' use of publicly available data and no external pre-training is also noteworthy, as it increases the method's accessibility and applicability to real-world problems.

Key Points

  • Multi-RF Fusion achieves state-of-the-art results in molecular property prediction
  • The method combines a rank-averaged ensemble of 12 Random Forest models with deep-ensembled GNN predictions
  • Optimizing max_features parameter and averaging GNN predictions across seeds improve predictive performance

Merits

Strength in ensemble methods

The article demonstrates the effectiveness of ensemble methods in improving predictive performance in molecular property prediction, which is a challenging task.

Practical application

The authors' use of publicly available data and no external pre-training makes the method more accessible and applicable to real-world problems.

Methodological contributions

The article presents two key findings that contribute to the success of the method, highlighting the importance of hyperparameter tuning and model regularization.

Demerits

Limited generalizability

The method is evaluated on a specific dataset (ogbg-molhiv), and its performance on other datasets and tasks is not explored.

Overreliance on hyperparameter tuning

The method's performance is highly sensitive to the choice of hyperparameters, which may limit its reproducibility and applicability.

Lack of interpretability

The article does not provide insights into the interpretability of the method, which is essential for understanding the relationships between molecular properties and GNN predictions.

Expert Commentary

The article presents a novel ensemble method for molecular property prediction that outperforms existing methods on the OGB leaderboard. The authors' use of publicly available data and no external pre-training is noteworthy, as it increases the method's accessibility and applicability to real-world problems. However, the method's performance is highly sensitive to the choice of hyperparameters, which may limit its reproducibility and applicability. Additionally, the article does not provide insights into the interpretability of the method, which is essential for understanding the relationships between molecular properties and GNN predictions. Overall, the article makes a significant contribution to the field of machine learning for molecular property prediction and highlights the importance of ensemble methods and hyperparameter tuning.

Recommendations

  • Future research should explore the generalizability of the method to other datasets and tasks.
  • The authors should investigate methods to improve the interpretability of the model, such as feature importance and partial dependence plots.
  • The research community should continue to explore the use of ensemble methods and hyperparameter tuning in machine learning for molecular property prediction.

Sources

Original: arXiv - cs.AI