Skip to main content
Academic

BindCLIP: A Unified Contrastive-Generative Representation Learning Framework for Virtual Screening

arXiv:2602.15236v1 Announce Type: new Abstract: Virtual screening aims to efficiently identify active ligands from massive chemical libraries for a given target pocket. Recent CLIP-style models such as DrugCLIP enable scalable virtual screening by embedding pockets and ligands into a shared space. However, our analyses indicate that such representations can be insensitive to fine-grained binding interactions and may rely on shortcut correlations in training data, limiting their ability to rank ligands by true binding compatibility. To address these issues, we propose BindCLIP, a unified contrastive-generative representation learning framework for virtual screening. BindCLIP jointly trains pocket and ligand encoders using CLIP-style contrastive learning together with a pocket-conditioned diffusion objective for binding pose generation, so that pose-level supervision directly shapes the retrieval embedding space toward interaction-relevant features. To further mitigate shortcut reliance

A
Anjie Qiao, Zhen Wang, Yaliang Li, Jiahua Rao, Yuedong Yang
· · 1 min read · 6 views

arXiv:2602.15236v1 Announce Type: new Abstract: Virtual screening aims to efficiently identify active ligands from massive chemical libraries for a given target pocket. Recent CLIP-style models such as DrugCLIP enable scalable virtual screening by embedding pockets and ligands into a shared space. However, our analyses indicate that such representations can be insensitive to fine-grained binding interactions and may rely on shortcut correlations in training data, limiting their ability to rank ligands by true binding compatibility. To address these issues, we propose BindCLIP, a unified contrastive-generative representation learning framework for virtual screening. BindCLIP jointly trains pocket and ligand encoders using CLIP-style contrastive learning together with a pocket-conditioned diffusion objective for binding pose generation, so that pose-level supervision directly shapes the retrieval embedding space toward interaction-relevant features. To further mitigate shortcut reliance, we introduce hard-negative augmentation and a ligand-ligand anchoring regularizer that prevents representation collapse. Experiments on two public benchmarks demonstrate consistent improvements over strong baselines. BindCLIP achieves substantial gains on challenging out-of-distribution virtual screening and improves ligand-analogue ranking on the FEP+ benchmark. Together, these results indicate that integrating generative, pose-level supervision with contrastive learning yields more interaction-aware embeddings and improves generalization in realistic screening settings, bringing virtual screening closer to real-world applicability.

Executive Summary

The article proposes BindCLIP, a novel unified contrastive-generative representation learning framework for virtual screening. By integrating CLIP-style contrastive learning with a pocket-conditioned diffusion objective for binding pose generation, BindCLIP improves upon existing methods by generating interaction-relevant features. The framework also incorporates hard-negative augmentation and a ligand-ligand anchoring regularizer to mitigate shortcut reliance. Empirical results on two public benchmarks demonstrate consistent improvements over strong baselines, particularly in challenging out-of-distribution virtual screening and ligand-analogue ranking on the FEP+ benchmark. BindCLIP's success underscores the importance of integrating generative and contrastive learning approaches for more accurate and generalizable virtual screening results.

Key Points

  • BindCLIP integrates contrastive and generative learning for improved virtual screening
  • Pose-level supervision shapes the retrieval embedding space toward interaction-relevant features
  • Hard-negative augmentation and ligand-ligand anchoring regularization mitigate shortcut reliance

Merits

Improved Generalizability

BindCLIP's integration of contrastive and generative learning enables it to generalize better to out-of-distribution virtual screening settings, making it a more practical solution for real-world applications.

Enhanced Interaction Awareness

The use of pose-level supervision and pocket-conditioned diffusion objectives enables BindCLIP to generate interaction-relevant features, improving the accuracy of virtual screening results.

Demerits

Computational Complexity

The incorporation of hard-negative augmentation and ligand-ligand anchoring regularization may increase the computational complexity of BindCLIP, making it less feasible for large-scale virtual screening applications.

Data Requirements

The effectiveness of BindCLIP may be limited by the availability and quality of training data, particularly in cases where the data is biased or incomplete.

Expert Commentary

The article presents a novel and promising approach to virtual screening, which has the potential to improve the accuracy and generalizability of existing methods. However, the computational complexity and data requirements of BindCLIP may limit its feasibility for large-scale applications. Nonetheless, the results presented in the article are compelling, and the framework's potential to enable the discovery of new drugs and therapies makes it an exciting area of research. Future work should focus on optimizing the computational complexity of BindCLIP and exploring its applications in real-world drug discovery settings.

Recommendations

  • Future research should focus on optimizing the computational complexity of BindCLIP and exploring its applications in real-world drug discovery settings.
  • Developers and researchers should work together to integrate BindCLIP with existing virtual screening pipelines and to develop new standards for virtual screening methodologies.

Sources