Keywords

1 Introduction

Is there a pre-trained model that explores the chemical space of pockets and ligands while considering their complementarity? Recently, many deep learning methods have been proposed to understand the chemical space of protein pockets or drug molecules (or called ligands) and facilitate drug design in many aspects, e.g., finding hits for a novel target [59], repurposing ancient drugs for new targets [25, 57, 67], and searching for similar pockets and molecules [35, 46]. While these models have shown promising potential in learning separate pocket space or molecular space for specific tasks [17, 21, 31, 47, 71], jointly pre-training pockets and ligands considering their complementarity remains to be explored.

We propose co-supervised pretraining (CoSP) framework for understanding the joint chemical space of pockets and ligands. Taking the ligand as an example, contrastive self-supervised pre-training [17, 49, 56] has yielded significant achievements in recent years. By identifying well-defined positive and negative ligand pairs via contrastive loss, the model can learn the underlying knowledge to facilitate downstream tasks. However, these self-supervised methods only capture data dependencies in the "self" domain while ignoring additional information from other complementary fields, such as bindable pockets. Meanwhile, previous studies [1, 5, 11, 37] have shown that pocket-ligand complementarity play a crucial role in determining molecular properties, since chemically similar ligands tend to bind to similar pockets. Inspired by this, we introduce cross-domain dependencies between pockets and ligands to improve molecular representation learning.

We propose gated geometric massage passing (GGMP) layer to extract expressive bio-representations for 3D pockets and ligands. All bio-objects are treated as 3D graphs [20, 24] in that each node contains invariant chemical features (atomic number, etc.) and equivalent geometric features (position and direction). For each bio-object, we optimize the pairwise energy function [22], which considers both chemical features and geometric features via the gated operation. By minimizing the energy function, we derive the updating rules of position and direction vectors. Finally, we combine these rules with classical message passing, resulting in GGMP.

We introduce ChemInfoNCE loss to reduce the negative sampling bias [9, 39]. When applying contrastive learning, the false negative pairs that are actually positive will lead to performance degradation, called negative sampling bias. Chuang [9] assumes that the label distribution of the classification task is uniform and propose DebiasedInfoNCE to alleviate this problem. Considering the specificity of the molecules and extending the situation to continuous properties prediction (regression task), we introduce chemical similarity-enhanced negative ligand sampling. Interestingly, improving the sampling strategy is equivalent to modifying sample weights; thus, we provide a systematic understanding from the view of loss functions and propose ChemInfoNCE.

We evaluates our model on several downstream tasks, from pocket matching, molecule property prediction to virtual screening. Numerous experiments show that our approach can achieve competitive results on these tasks, suggesting that the pocket-ligand complementarity could improve biorepresentation learning.

2 Related Work

Motivation. Protein and molecule achieve their biological functions by binding to each other [7], thus exploring the protein-ligand complex help to improve the understanding of both proteins, molecules, and their interactions. To improve generalization and reduce complexity, we further consider local patterns about the protein pocket x and the bindable ligand \(\hat{x}\). Taking \((x,\hat{x})\) as the positive pair, while \((x, \hat{x}^-)\) as the negative pair, where \(\hat{x}^-\) cannot bind to x, we aims to pre-train a pocket model \(f: x \mapsto \boldsymbol{h}\) and a ligand model \(\hat{f}: \hat{x} \mapsto \hat{\boldsymbol{h}}\), such that the mutual information between \(\boldsymbol{h}\) and \(\hat{\boldsymbol{h}}\) are maximized.

Table 1. Protein and molecule pre-training methods

Equivalent 3D GNN. Extensive works have shown that 3D structural conformation can improve the quality of bio-representations with the help of equivalent massage passing layer [4, 6, 10, 19, 41, 50]. Inspired by the energy analysis [20, 22], we propose a new gated geometric massage passing (GGMP) layer that consider not only the node position but also its direction, where the latter could indicate the location of pocket cavities and the angle of molecular bonds.

InfoNCE. The original InfoNCE is proposed by [36] to contrast semantically similar (positive) and dissimilar (negative) pairs of data points, such that the representations of similar pairs \((x, \hat{x})\) to be close, and those of dissimilar pairs \((x, \hat{x}^-)\) to be more orthogonal. By default, the negative pairs are uniformly sampled from the data distribution. Therefore, false negative pairs will lead to significant performance drop. To address this issue, DebaisedInfoNCE [9] is proposed, which assumes that the label distribution of the classification task is uniform. Although DebaisedInfoNCE has achieved good results on image classification, it is not suitable for direct transfer to regression tasks, as the uniform distribution assumption is too strict. For bio-objects, we discard the above assumption, extend the situation to continuous attribute prediction, use fingerprint similarity to measure the probability of negative ligands, and propose ChemInfoNCE.

Self Bio Pre-training. Many pre-training methods have been proposed for a single protein or ligand domain, which can be classified as sequence-based, graph-based or structure-based. We summarize the protein pre-training models in Table.1. As for sequential models, CPCPort [33] maximizes the mutual information between predicted residues and context. Profile Prediction [48] suggests predicting MSA profile as a new pre-training task. OntoProtein [70] integrates GO (Gene Ontology) knowledge graphs into protein pre-training. While most of the sequence models rely on the transformer architecture, CARP [66] finds that CNNs can achieve competitive results with much fewer parameters and runtime costs. Recently, GearNet [74] explores the potential of 3D structural pre-training from the perspective of masked prediction and contrastive learning. We also summarize the molecule pre-training models in Table.1. As for sequential models, FragNet [44] combines masked language model and multi-view contrastive learning to maximize the inner mutual information of the same SMILEs and the agreement across augmented SMILEs. Beyond SMILEs, more approaches [17, 29, 40, 49, 56, 71, 73] tend to choose graph representation that can better model structural information. For example, Grover [40] integrates message passing and transformer architectures and pre-trains a super-large GNN using 10 million molecules. MICRO-Graph [71] and MGSSL [73] use motifs for contrastive learning. Considering the domain knowledge, MoCL [49] uses substructure substitution as a new data augmentation operation and predicts pairwise fingerprint similarities. Although these pre-training methods show promising results, they do not consider the 3D molecular conformations. To fill this gap, GraphMVP [31] and 3DInfomax [47] explore to maximize the mutual information between 3D and 2D views of the same molecule and achieve further performance improvements. Besides, GEM [16] proposes a geometry-enhanced graph neural network and pre-trains it via geometric tasks. For the pre-training of individual proteins or molecules, these methods demonstrate promising potential on various downstream tasks but ignore their complementarity.

Cross Bio Pre-training. In parallel with our study, Uni-Mol [76], probably the first pre-trained model that can handle both protein pockets and molecules, released the preprinted version. However, they pre-train the pockets and ligands separately without considering their interactions, whereas our approach differs in pre-training data, pre-training strategy, model structure and downstream tasks.

3 Methodology

3.1 Co-Supervised Pre-training Framework

We propose the co-supervised pretraining (CoSP) framework, as shown in Figure.1, to explore the joint chemical space of protein pockets and ligands, where the methodological innovations include:

  1. 1.

    We propose the gated geometric message passing layer to model 3D pockets and ligands.

  2. 2.

    We establish a co-supervised pre-training framework to learn pocket and ligand representations.

  3. 3.

    We introduce ChemInfoNCE with improved negative sampling guided by chemical similarity.

  4. 4.

    We evaluate the model on pocket matching, molecule property prediction, and virtual-screening tasks.

3.2 Geometric Representation

We introduce the unified data representation and neural network for modeling 3D pockets and ligands. We use structures collected from the BioLip dataset [64] as pretraining data for developing \(\text {CoSP}_{\text {base}}\) model. Further, we use augmente the pretraining data with CrossDock dataset [18], resulting in the \(\text {CoSP}_{\text {large}}\) model. In downstream tasks where ligand conformations are not provided, we generate 3D conformations using MMFF [52] (if successful) or their 2D conformations (if failed).

Fig. 1.
figure 1

Overview of CoSP. We contrast bound pocket-ligand pairs with unbound ones to learn the complementarity-aware chemical embeddings. We extract positive pocket-ligand pairs (e) from the protein-ligand complexes (d), and augment pos/neg relations of complexes via ligand similarity (f). We pretrain the model on BioLip dataset (a), followed by finetuning (b) and evaluation (c) on different tasks.

Pocket and Ligand Graph. We represent bio-object as graph \(\mathcal {G}(X, \mathcal {V}, \mathcal {E})\) , consisting of coordinate matrix \(X \in \mathbb {R}^{n,3}\), node features \(\mathcal {V} \in \mathbb {R}^{n, d_f}\) , and edge features \(\mathcal {E} \in \mathbb {R}^{n, d_e}\), where n, \(d_f\) and \(d_e\) represent the number of node, node features dimension and edge features dimension. For pockets, the graph nodes include amino acids within 10 \(\mathring{A}\) to the ligand, X contrains the position of \(C_{\alpha }\) of residues, on which we construct \(\mathcal {E}\) via k-nn algorithm. For molecules, the graph nodes include all ligand atoms except Hs, X contrains the atom positions, and we use the molecular bonds as \(\mathcal {E}\).

Gated Geometric Massage Passing. From layer t to \(t+1\), we use the gated geometric massage passing (GGMP) layer to update 3D graph representations, i.e., \([\boldsymbol{v}_i^{t+1}, \boldsymbol{x}_{i}^{t+1} , \boldsymbol{n}_{i}^{t+1}] = \textrm{GGMP}(\boldsymbol{v}_i^{t}, \boldsymbol{x}_{i}^{t}, \boldsymbol{n}_{i}^{t})\), where \(\boldsymbol{n}_{i}\) is the direction vector. For molecules, \(\boldsymbol{n}_{i}\) points to the negative neighborhood center of node i; for pockets, \(\boldsymbol{n}_{i}\) indicates the position of protein caves. Given 3D conformations, we minimize the pairwise energy function E:

$$\begin{aligned} E(X, F, \mathcal {E}) = \sum _{(i,j) \in \mathcal {E}} u(\boldsymbol{v}_i, \boldsymbol{v}_j, \boldsymbol{e}_{ij}) g(\langle \boldsymbol{n}_i, \boldsymbol{n}_j \rangle , d_{ij}^2) \end{aligned}$$
(1)

where \(d_{ij}^2 = ||\boldsymbol{x}_i-\boldsymbol{x}_j||^2\), both chemical energy \(u(\cdot )\) and geometric energy \(g(\cdot )\) are considered. By calculating the gradients of \(\boldsymbol{x}_i\) and \(\boldsymbol{n}_i\), we obtain their updating rules:

$$\begin{aligned} \begin{aligned} -\frac{\partial E(X, F, \mathcal {E})}{\partial \boldsymbol{x}_i} = -\sum _{j \in \mathcal {N}_i} 2 u_{ij} \frac{\partial g_{ij} }{\partial d_{ij}^2} (\boldsymbol{x}_i-\boldsymbol{x}_j) \\ \approx \sum _{j \in \mathcal {N}_i} u(\boldsymbol{v}_i, \boldsymbol{v}_j, \boldsymbol{e}_{ij}) \phi _x(d_{ij}^2, \langle \boldsymbol{n}_{i}, \boldsymbol{n}_{j} \rangle ) (\boldsymbol{x}_{i}-\boldsymbol{x}_{j}) \end{aligned} \end{aligned}$$
(2)
$$\begin{aligned} \begin{aligned} -\frac{\partial E(X, F, \mathcal {E})}{\partial \boldsymbol{n}_i} = -\sum _{j \in \mathcal {N}_i} u_{ij} \frac{\partial g_{ij} }{\partial \langle \boldsymbol{n}_i, \boldsymbol{n}_j \rangle } \boldsymbol{n}_j \\ \approx \sum _{j \in \mathcal {N}_i} u(\boldsymbol{v}_i, \boldsymbol{v}_j, \boldsymbol{e}_{ij}) \phi _n(d_{ij}^2, \langle \boldsymbol{n}_{i}, \boldsymbol{n}_{j} \rangle ) \boldsymbol{n}_{j} \end{aligned} \end{aligned}$$
(3)

Note that \(\phi _x\) and \(\phi _n\) are the approximation of \(\frac{\partial g_{ij} }{\partial d_{ij}^2}\) and \(\frac{\partial g_{ij} }{\partial \langle \boldsymbol{n}_i, \boldsymbol{n}_j \rangle }\). Combining graph message passing, we propose the GGMP layer:

$$\begin{aligned}&\boldsymbol{m}_{ij} = \phi _{m}(\boldsymbol{v}_i^t, \boldsymbol{v}_j^t, e_{ij} )\end{aligned}$$
(4)
$$\begin{aligned}&\boldsymbol{g}_{ij} = \phi _{g}(d_{ij}^2, \langle \boldsymbol{n}_{i}^{t}, \boldsymbol{n}_{j}^{t} \rangle )\end{aligned}$$
(5)
$$\begin{aligned}&\boldsymbol{h}_i^{t+1} = \phi _{h}(\boldsymbol{h}_i^t, \sum _{j \in \mathcal {N}_i} \boldsymbol{m}_{ij} \boldsymbol{g}_{ij})\end{aligned}$$
(6)
$$\begin{aligned}&\boldsymbol{x}_{i}^{t+1} = \boldsymbol{x}_{i}^{t} + \lambda \sum _{j\in \mathcal {N}_{i}} u(\boldsymbol{m}_{ij}) \phi _x(\boldsymbol{g}_{ij}) (\boldsymbol{x}_{i}^{t}-\boldsymbol{x}_{j}^{t}) \end{aligned}$$
(7)
$$\begin{aligned}&\boldsymbol{n}_{i}^{t+1} = \boldsymbol{n}_{j}^{t} + \lambda \sum _{j\in \mathcal {N}_{i}} u(\boldsymbol{m}_{ij}) \phi _n(\boldsymbol{g}_{ij}) \boldsymbol{n}_{j}^{t} \end{aligned}$$
(8)

where \(\phi _*\) and u are approximated by neural networks, \(\lambda \) is a hyperparameter, and \(\boldsymbol{n}^0_{i} = -\sum _{j\in \mathcal {N}(i)}{\boldsymbol{x}^0_j}/||\sum _{j\in \mathcal {N}(i)}{\boldsymbol{x}^0_j}||\).

3.3 Contrastive Loss

In contrastive learning, the biased negative sampling impairs model performance by sampling false negative data during training. Previous methods [9, 39] address this problem with the assumption that false-negative samples are uniformly distributed under the classification setting. We propose chemical knowledge-based sampling to better address this issue, where fingerprint similarity is used to measure the probability of negative ligands. Interestingly, the change in sampling distribution is equivalent to the design of a weighted loss, and we provide a comprehensive understanding from the perspective of contrastive loss.

Uni-contrastive Loss. Given the pocket \(x \sim p\), we draw positive ligands \(\hat{x}^+\) from the distribution \(\hat{p}_x^+\) of bindable molecules and negative ligands \(\{\hat{x}_i^-\}_{i=1}^N \) from the distribution \(\hat{q}\) of non-bindable ones. By default, the positive ligands are determined by the pocket-ligand complexes, while negative ones are uniformly sampled from the ligand sets. We use pocket model f and ligand model \(\hat{f}\) to learn the latent representations \(\boldsymbol{h}\), \(\hat{\boldsymbol{h}}^+\) and \(\{ \hat{\boldsymbol{h}}_i^- \}_{i=1}^N\), where the proxy task is to maximize the positive similarity \(s^+(\boldsymbol{h}, \hat{\boldsymbol{h}}^+)\) against the negative similarities \(s_{i}^-(\boldsymbol{h} ,\hat{\boldsymbol{h}}_i^-), i=1,2,\cdots \), resulting in:

$$\begin{aligned} L_{\text {Uni}}&= \mathbb {E}_{x \sim p, \hat{x}^+ \sim \hat{p}_x^+, \atop \{\hat{x}_i^-\}_{i=1}^N \sim \hat{q}} \left[ \log {(1 + \frac{Q}{N} \sum _{i=1}^N \frac{s_{i}^-(\boldsymbol{h} ,\hat{\boldsymbol{h}}_i^-)}{s^+(\boldsymbol{h}, \hat{\boldsymbol{h}}^+)})} \right] \end{aligned}$$
(9)

where Q and N are constants. For each data sample x, the gradients contributed to \(s^+\) and \(s_i^-\) are:

$$\begin{aligned} \frac{\partial {L}}{\partial s^+}&= \frac{1}{1+\sum _{i=1}^N s_i^- / s^+} \sum _{i=1}^N {\frac{\partial s_i^-/s^+}{\partial s^+}}\end{aligned}$$
(10)
$$\begin{aligned} \frac{\partial {L}}{\partial s_i^-}&= \frac{1}{1+\sum _{i=1}^N s_i^- / s^+} \frac{\partial s_i^-/s^+}{\partial s_i^-} \end{aligned}$$
(11)

The \(L_{\text {Uni}}\) provides balanced gradient to positive and negative samples, i.e., \(\frac{\partial {L}}{\partial s^+} = \sum _i {\frac{\partial {L}}{\partial s_i^-}}\). One can verify that InfoNCE is the special case of \(L_{\text {Uni}}\) by setting \(s^+(\boldsymbol{h}, \hat{\boldsymbol{h}}^+) = e^{\gamma \boldsymbol{h}^T \boldsymbol{h}^+}\) and \(s_{i}^-(\boldsymbol{h} ,\hat{\boldsymbol{h}}_i^-) = e^{\gamma \boldsymbol{h}^T \boldsymbol{h}_i^-}\).

DebiasedInfoNCE. Uniformly sampling negative ligands from the data distribution \(\hat{q}\) could mistaken positive samples as negative ones. Denote \(h(\cdot )\) as the labeling function, [9] suggests to draw negative samples from the real negative distribution \(\hat{q}_x^-(\hat{x}^-) = p(\hat{x}^-|h(\hat{x}^-) \ne h(x))\). To handle the \(\{ h(\hat{x}^-) \ne h(x) \}\) event, the joint distribution \(p(\hat{x},c)=p(\hat{x}|c)p(c)\) over data \(\hat{x}\) and label c is considered. Assume the class probability \(p(c)=\tau ^+\) is uniform, and let \(\tau ^-=1-\tau ^+\) be the probability of observing any different class, \(\hat{q}\) could be decomposed as \(\tau ^- \hat{q}_x^-(\hat{x}^-) + \tau ^+ \hat{q}_x^+(\hat{x}^-)\). Therefore, \(\hat{q}_x^- = (\hat{q} - \tau ^+ \hat{q}_x^+)/\tau ^-\), and the DebiasedInfoNCE is:

$$\begin{aligned} L_{\text {Debiased}}&= \mathbb {E}_{x \sim p, \hat{x}^+ \sim \hat{p}_x^+, \atop \{\hat{x}_i^-\}_{i=1}^N \sim \hat{q}_x^-} \left[ \log {(1 + \frac{Q}{N} \sum _{i=1}^N \frac{s_{i}^-(\boldsymbol{h} ,\hat{\boldsymbol{h}}_i^-)}{s^+(\boldsymbol{h}, \hat{\boldsymbol{h}}^+)})} \right] \end{aligned}$$
(12)

where \(s^+(\boldsymbol{h}, \hat{\boldsymbol{h}}^+)=e^{\boldsymbol{h}^T \hat{\boldsymbol{h}}^+}\), \(s_{i}^-(\boldsymbol{h} ,\hat{\boldsymbol{h}}_i^-)=e^{\boldsymbol{h}^T \hat{\boldsymbol{h}}_i^-}\). With mild assumptions, the approximated debaised InfoNCE can be written as:

$$\begin{aligned} \mathbb {E}_{x \sim p, \hat{x}^+ \sim p_x^+, \atop \{\hat{x}_i^-\}_{i=1}^N \sim \hat{q}} \left[ \log { (1+ \frac{Q}{\tau ^-} \sum _{i=1}^{N} (e^{\boldsymbol{h}^T \boldsymbol{h}_i^- - \boldsymbol{h}^T \boldsymbol{h}^+} - \tau ^+) ) }\right] \end{aligned}$$
(13)

ChemInfoNCE. Although DebiasedInfoNCE solves the problem of sampling bias to some extent, it suffers from some shortcomings. Firstly, for classification with discrete labels, the assumption of uniform class probabilities may be too strong, especially for the unbalanced dataset. Secondly, when it comes to regression, molecules have continuous chemical properties and the event \(\{ h(\hat{x}) \ne h(\hat{x}^-) \}\) can not describe the validity of negative data. To address these issues, we introduce a new event \(\{ \text {sim}(\hat{x},\hat{x}^-)<\tau \}\) to measure the validity of negative samples, where \(\text {sim}(\cdot , \cdot )\) is the function of chemical similarity. The underlying assumption is that molecules with lower chemical similarity to the reference ligand are more likely to be negative samples.

$$\begin{aligned} \begin{aligned} q_x^-(\hat{x}^-)&:= q(\hat{x}^-| \text {sim}(x, \hat{x}^-) < \tau ) \\&\propto \max (1-\text {sim}(x,\hat{x}^-)-\tau ,0 ) \cdot p(\hat{x}^-) \end{aligned} \end{aligned}$$
(14)

By denoting \(w_i = \max (1-\text {sim}(x,\hat{x}^-)-\tau ,0 )\), the final ChemInfoNCE can be simplfied as:

$$\begin{aligned} L_{\text {Chem}}&\approx \mathbb {E}_{x \sim p, \hat{x}^+ \sim p_x^+, \atop \{\hat{x}_i^-\}_{i=1}^N \sim \hat{q}} \left[ \log { (1+ \sum _{i=1}^{N} ( \rho _i e^{\boldsymbol{h}^T \hat{\boldsymbol{h}}_i^- - \boldsymbol{h}^T \hat{\boldsymbol{h}}^+} ) ) }\right] \end{aligned}$$
(15)

where \(\rho _i = \frac{w_i}{\sum _{i=1}^N w_i}\).

Table 2. Molecule property prediction. We compare different methods across 9 benchmarks. The best and sub-optimum results are highlighted in bold and underline.

4 Experiments

In this section, we conduct extensive experiments to verify the effectiveness of the proposed method from three perspectives:

  1. 1.

    Ligand: Could the ligand model provide competitive results in predicting molecular properties?

  2. 2.

    Pocket: How does the pre-trained pocket model perform on the pocket matching tasks?

  3. 3.

    Pocket-ligand: Could the joint model find potential binding pocket-ligand pairs, i.e., virtual screening?

4.1 Pre-training Setup

Pre-training Dataset. We adopt BioLip [64] dataset for pre-training \(\text {CoSP}_\text {base}\), where the original BioLip contains 573,225 entries up to 2022.04.01. Compared to PDBBind [54] with 23,496 complexes, BioLip contains more complexes that lack binding affinity, thus could provide a more comprehensive view of binding mode analysis. To focus on the drug-like molecules and their binding pockets, we filtered out other unrelated complexes that contain peptides, DNA, RNA, single ions, etc. In addition, we augment the pretraining data with the CrossDock dataset [18] to develop \(\text {CoSP}_\text {large}\).

Experimental Setting. We pre-train \(\text {CoSP}_\text {base}\) with 6 layer GGMPs via ChemInfoNCE loss, where the hidden feature dimension is 128. We train the model for 50 epochs using Adam optimizer on NVIDIA A100s, where the initial learning rate is 0.01 and the batch size is 100. The chemical ligand similarity is calculated by RDKit [28]. To achieve better performance, \(\text {CoSP}_\text {large}\) extends the 6-layer GNN to 12 layers, with hidden dimensions from 128 to 1024, and uses augmentated dataset (BioLip+CrossDock).

4.2 Downstream Task 1: Molecule Property Prediction

Experimental Setup. Could the model learn expressive features for molecule classification and regression tasks? We evaluate CoSP on 9 benchmarks collected by MoleculeNet [61]. Following previous researches, we use scaffold splitting to generate train/validation/test set with a ratio of 8:1:1. We report AUC-ROC and RMSE metrics for classification and regression tasks, respectively. The mean and standard deviations of results over three random seeds are provided by default. We finetune the model using the similar code of MGSSL [73].

Baselines. We evaluate CoSP against a broad of baselines, including D-MPNN [65], Attentive FP [63], \(\text {N-Gram}_{\text {RF}}\), \(\text {N-Gram}_{\text {XGB}}\) [30], MolCLR [56], PretrainGNN [23], GraphMVP-G, GraphMVP-C [32], 3DInfomax [47], MICRO-graph [71], \(\text {GROVER}_{\text {base}}\), \(\text {GROVER}_{\text {large}}\) [40], GEM [16], and Uni-Mol [76]. Most these baselines are pre-training methods, except for \(\text {N-Gram}_{\text {RF}}\) and \(\text {N-Gram}_{\text {XGB}}\). Some of the methods mentioned in the related works are not included because the experimental setup, e.g., data spliting, may be different.

Results and Analysis. We show results in Table.2. The main observations are: (1) \(\text {CoSP}_\text {large}\) could achieve the best results on 4/9 downstream tasks, and top-3 results on 9/9 downstream tasks. (2) Pre-training techniques help improve the model’s generalization ability, and the model could learn expressive molecular features via co-supervised pre-training. By extending the model size and pre-training data volumn, \(\text {CoSP}_\text {large}\) achieves non-trivial performance gains compared to \(\text {CoSP}_\text {base}\). (3) Through ablation studies, we further identified the superiority of ChemInfoNCE over DebaisedInfoNCE by achieving consistent performance gains on various datasets.

Table 3. Pocket matching results. We compare different methods across 10 benchmarks.

4.3 Downstream Task 2: Pocket Matching

Experimental Setup. Could the pre-trained model identify chemically similar pockets? We explore the discriminative ability of the pocket model with the pocket matching tasks. To comprehensively understand the potential of the proposed method, we evaluated it on 10 benchmarks recently collected in the ProSPECCTs dataset [15]. For each sub-dataset, the positive and negative pairs of pockets are defined differently according to the research objectives. We summarize five research objectives as O1: Whether the model is robust to the pocket definition? O2: Whether the model is robust to the pocket flexibility? O3: Can the model distinguish between pockets with different properties? O4: Whether the model can distinguish dissimilar proteins binding to identical ligands and cofactors? O5: How about the performance on real applications? We report the AUC-ROC scores on all benchmarks.

Baselines. We compare CoSP with both classical and deeplearning baselines. The classical methods can be divided into profile-based, graph-based and grid-based ones. The profile-based methods encode topological, physicochemical and statistical properties in a unified way for comparing various pockets, e.g., SiteAlign [42], RAPMAD [27], TIFP [13], FuzCav [58], PocketMatch [69], SMAP [62], TM-align [72], KRIPO [60] and Grim [13]. The graph-based methods adopt isomorphism detection algorithm to find common motifs, e.g., Cavbase [43], IsoMIF [8], ProBiS [26]. Grid-based methods represent pockets by regularly spaced pharmacophoric grid points, e.g.,VolSite/Shaper [12]. Another tools include SiteEngines [45] and SiteHopper [3]. We also compare with the recent deeplearning model–DeeplyTough [46].

Results and Analysis. We present the pocket matching results in Table. 3, where the pre-trained model achieves competitive results in most cases. Specifically, CoSP is robust to pocket definition (O1) and achieves the highest AUC scores in D1 and D1.2. The robustness also remains when considering conformational variability (O2), where \(\text {CoSP}_\text {large}\) achieves 1.00 AUC score in D2. It should be noted that robustness to homogeneous pockets does not mean that the model has poor discrimination; on the contrary, the model could identify pockets with different physicochemical and shape properties (O3) in D3 and D4. Compared with previous deep learning methods (DeeplyTough), \(\text {CoSP}_\text {large}\) provides better performance in distinguishing different pockets bound to the same ligands and cofactors (Q4), refer to the results of D5, D5.2, D6 and D6.2. Last but not least, \(\text {CoSP}_\text {large}\) showed good potential for practical applications (O5) with 0.90 AUC score. In addition, we found that pocket direction plays a key role in extracting pocket features, which is helpful to indicate the location of the pocket cavity. As shown in Table. 3, the performance of pocket matching will be degraded if the directional feature \(\boldsymbol{n}\) is removed.

Table 4. Virtual screening results on DUD-E.

4.4 Downstream Task 3: Virtual Screening

Experimental Setup. Could the model distinguish molecules most likely to bind to the given pocket? We evaluate CoSP on the DUD-E [34] dataset which consists of 102 targets across different protein families. For each target, DUD-E provides 224 actives (positive examples) and 10,000 decoy ligands (negative examples) in average. The decoys were calculated by choosing them to be physically similar but topologically different from the actives. During finetuning, we use the same data splitting as GraphCNN [51], and report the AUC-ROC and ROC enrichment (RE) scores. Note that \(x\% \text {RE}\) indicates the ratio of the true positive rate (TPR) to the false positive rate (FPR) at \(x\%\) FPR value.

Baselines. We compare \(\text {CoSP}_{\text {large}}\) with AutoDock Vina [53], RF-score [2], NNScore [14], 3DCNN [38], GraphCNN [51], DrugVQA [75], GanDTi [55], and AttentionSiteDTI [68]. AutoDock Vina is an commonly used open-source program for doing molecular docking. RF-score use random forest capture protein-ligand binding effects. Other methods use deeplearning models to learn the protein-ligand binding.

Fig. 2.
figure 2

Two examples of virtual screening. For each pocket, we choose two ligands that are Top 1% active molecules as identified by the model. We use AutoDock Vina to generate molecular binding pose and compute the affinity score.

Results and Analysis. We present results in Table.4, and observe that: (1) Random forest and MLP-based RF-score and NNScore achieve competitive results to Vina, indicating the potential of machine learning in virtual screening. (2) Deeplearning-based Graph CNN, 3DCNN, DrugVQA, GanDTi, and AttentionSiteDTI significantly outperforms both RF-score and NNScore. (3) \(\text {CoSP}_\text {large}\) achieves competitive AUC score and outperforms all baselines in RE scores. The improvement of \(\text {CoSP}_\text {large}\) suggests that the model can effectively learn protein-ligand interactions from the pre-training data. (4) In addition, we select Top 1% ligands identified by the model as actives for the given pocket and use AutoDock Vina to validate the docking results. In Fig. 2, the visual results show that our model can identify high-affinity ligands, which is helpful for drug discovery.

5 Conclusion

This paper proposes a co-supervised pre-training framework to learn the joint pocket and ligand spaces via chemically inspired contrastive loss. The pre-trained model could achieve competitive results on molecule property predictions, pocket matching, and virtual screening. We hope the unified modeling framework could further advance the development of AI-guided drug discovery.