Introduction

Interactions between small molecules and proteins play pivotal roles in various biological processes across organisms. However, majority of these interactions remain understudied due to experimental constraints and human biases, limiting our understanding of the complex mechanisms governing life and hampering efforts of drug discovery.

Metabolite-protein interactions (MPIs) play a crucial role in regulating metabolic pathways, triggering signaling transduction, and maintaining cellular balance. However, MPIs are frequently low-affinity and are difficult to be detected by experiments. A recent study discovered that many overlooked MPIs contribute to the survival and growth of organisms in response to a changing environment1. Additionally, proteome-wide characterization of MPIs provides strong evidence that metabolites serve as not only intermediates in metabolic reactions but also signaling molecules via interactions with proteins that are not enzymes2,3.

In addition to intra-species MPIs, the microbiome co-evolves with the human and plays a role in shaping human phenotypes4. In the human body, microbiota produces an extremely diverse metabolite repertoire that can gain access to and interact with host cells, thus influencing the phenotype of the human host5. For example, butyrates produced by the microbiome binds to diverse human non-enzyme proteins6. The human microbiome is not only associated with a large number of human diseases but also responsible for the efficacy and toxicity of therapeutics (e.g., cancer immunotherapy) that target human host7,8. Thus, elucidating previously unrecognized interspecies MPIs will shed light on the molecular mechanisms underlying microbiome-human interactions9,10 and offer new opportunities on developing novel therapeutics11.

Beyond MPIs, uncovering novel drug-target interactions (DTIs) will facilitate identifying novel therapeutic targets, understanding polypharmacology, and advancing drug repurposing, thereby accelerating drug discovery and development. Unfortunately, small molecule ligands of more than 90% protein families remain unknown12. Many of these understudied proteins could be potential drug targets13. The lack of knowledge of small molecule ligands of understudied proteins hinders the drug development of presently incurable diseases14. On the one hand, many disease-causing genes are functionally and pharmaceutically uncharacterized15. It is a time-consuming and high-risk endeavor to develop assays for compound screening of understudied proteins. On the other hand, a drug acts on a biological system, and often interacts with not only its intended target but also unknown off-target(s) that may lead to unexpected drug adverse reactions or therapeutic effects16. Thus, identifying drug-target interactions including chemicals with novel scaffolds will improve the successful rate of drug discovery for unmet medical needs17.

Given the experimental challenges in elucidating understudied intra- and inter-species molecular interactions, deep learning offers a promising alternative approach owning to its recent phenomenal successes in Natural Language Processing (NLP) and image processing. A number of deep learning models have been developed to predict drug-target interactions18,19,20. Nevertheless, few methods can accurately and reliably predict understudied molecular interactions due to the dearth of labeled data and out-of-distribution (OOD) problems in which small molecules or proteins involved in the interaction are significantly different from those in the annotated databases used as training data. A plethora of transfer learning techniques have been developed to address the molecular OOD problem21. However, they offer little help in bridging remote chemical spaces21.

Semi-supervised learning and meta-learning alone have shown promise in addressing OOD challenges in protein-ligand interaction predictions12,22. Semi-supervised learning uses the labeled data to learn patterns that are generalized to the unlabeled data. This approach has shown potential in exploring new chemical spaces, as demonstrated by Liu et al.22. On the other hand, meta-learning is an approach to learning to learn. It has demonstrated superior generalization power across various applications23,24. Cai et al. has developed an out-of-cluster meta-learning (OOC-ML) method, which has notably enhanced the generalization performance for OOD protein-chemical interactions12. OOC-ML simulates an OOD scenario. It harnesses common patterns extracted from predicting ligands across distinct protein clusters (meta-model) and generalizes this knowledge to a different protein cluster. These two techniques complement each other: semi-supervised learning explores unlabeled OOD data while meta-learning exploits labeled data. To our knowledge, no methods have been developed to combine these approaches to overcome data scarcity and address OOD challenges in predicting molecular interactions. Additionally, state-of-the-art semi-supervised learning method uses a teacher-student model where the teacher model is fixed in each iteration22. It may lead to confirmation bias in pseudo-labeling, another problem that needs to be addressed.

In this paper, we have developed MMAPLE - Meta Model Agnostic Pseudo Label Learning - to address the challenges aforementioned. MMAPLE incorporates the concept of meta-learning and transfer learning into a semi-supervised learning framework. Under a meta-learning framework, the student model in MMAPLE constantly sends feedback to the teacher to reduce confirmation biases. MMAPLE is effective in exploring unlabeled data and addressing the OOD problem. We have demonstrated that MMAPLE significantly improves the accuracy of DTIs, human metabolite-enzyme interactions, and understudied microbiome-human MPI predictions on multiple base models in the OOD setting. Using MMAPLE, we have identified and experimentally validated novel microbiome-human MPIs and proposed their associations with human physiology. Our findings suggest that MMAPLE can be a general framework for investigating understudied biological problems.

Results

Overview of MMAPLE

We evaluate the proposed MMAPLE method on three diverse OOD cases: novel DTIs, hidden human MPIs, and understudied microbiome-human MPIs. The statistics of training/validation and testing data are shown in Fig. 1A. In brief, for OOD DTIs and human MPIs, no chemicals in the testing data have a Tanimoto coefficient larger than 0.5 compared with those in the training/validation set. Details of the distribution of chemical similarities between training/validation and testing data are shown in Fig. 1C. Although 1.7% chemicals in the testing data are similar to those in the training/validation data of microbiome-human MPIs with a Tanimoto coefficient larger than 0.5, the proteins are significantly different based on e-value as shown in Fig. 1C lower-right. Furthermore, there are no labeled microbiome-human MPIs in the training data. Thus, all benchmarks are in a challenging understudied label scarcity and/or OOD scenario.

Fig. 1: Data statistics and framework illustration.
figure 1

A Statistics of training, validation, and testing data used in this study. B Illustration of MMAPLE framework. A deep learning model is trained using both labeled and unlabeled data and iteratively updated using gradients from the trained model as metadata. C Chemical and protein similarity distribution between training/validation and testing datasets. Up-left: Chemical similarity distribution of OOD DTI; up-right: chemical similarity distribution of hidden human MPI; lower left-right: chemical and protein similarity distribution of zero-shot microbiome-human MPI experiment. Chemical similarity is quantified by the Tanimoto coefficient of chemical fingerprints. Protein similarity is measured by the negative logarithm (base 10) of the e-value derived from BLAST61.

The uniqueness of MMAPLE in predicting understudied OOD molecular interactions is threefold, as shown in Fig. 1B. Firstly, MMAPLE iteratively transfers knowledge from observed molecular interactions to the unexplored chemical genomics space of interest, employing a teacher-student approach. Secondly, unlike a conventional teacher-student model, the teacher model receives feedback from the student model to perform a meta-update aligned with meta-learning. Lastly, akin to transfer learning, the training of the student model incorporates a new target domain sampling strategy. Its aim is to guarantee that the unlabeled target domain of interest mirrors the distribution of the labeled source domain as well as to increase sampling efficiency. This alignment facilitates the model in acquiring a more robust and generalizable representation of the data. In contrast, utilizing untargeted random sampling for unlabeled data during training may lead to a markedly divergent data distribution from that of the target domain, owing to the astronomical chemical genomics space. For the three cases in this study, the target domain is different, as defined such that (1) the data distribution is significantly different from the labeled source domain to avoid data leaking, and (2) the data is relevant to the problem of interest. For example, for the DTI prediction, only drug targets are sampled.

We train base binary classification models that use labeled molecular interactions from ChEMBL25. The base models used in this study included four state-of-the-art models for chemical-protein predictions: pre-trained protein language model DISAE26, TransformerCPI18, DeepPurpose19, and BACPI20. Then MMAPLE was applied to these models for exploring unlabeled molecular interaction space. MMAPLE first initializes a teacher model using the labeled data. Then a target domain sampling strategy is applied to select a set of unlabeled data from the large space of understudied OOD DTIs or MPIs. The pre-trained teacher model makes predictions about the selected unlabeled data and assigns labels to them (pseudo labels). Next, a student model is trained using pseudo-labeled data. Different from conventional teacher-student model, the student model is evaluated by labeled data and provides feedback (metadata) to the teacher model during training the student model. Finally, the teacher model is updated based on the performance of the student and generates new pseudo labels. This process repeats multiple times until the training converges. The number of iterations depend on the base model and the problem of interest. The details of MMAPLE are in the Method section.

MMAPLE significantly improves the performance of OOD DTI predictions

We first evaluated the performance of MMAPLE for OOD DTI predictions. We used molecular interactions in ChEMBL25 and HMDB27 as training data, and annotated DTIs from DrugBank28 as testing data. To simulate an OOD scenario, we removed all chemicals that are structurally similar to drugs in the testing data (Tanimoto coefficient >0.5). As shown in Fig. 2, both PR-AUCs and ROC-AUCs of MMAPLE are significantly improved over all base models with p-values less than 0.05 (Supplemental Tables 1, 2). The percentage of improvement on PR-AUC ranges from 13% to 26%. Furthermore, the trained models are less over-fitted than the base models, as supported by the narrow gaps between the training curve of validation data and that of testing data, as shown in Supplemental Fig. 1.

Fig. 2: OOD DTI prediction outcomes when applying MMAPLE to base models.
figure 2

A ROC curves. B PR curves. C UMAP visualization of chemical space. Top to bottom - original fingerprints, baseline embeddings from DISAE, and MMAPLE embeddings. Orange and blue dots for MMAPLE and baseline, respectively.

The superior performance of MMAPLE may be because it can better align the embedding space of OOD samples to that of training data. To test this hypothesis, we investigated if MMAPLE could alleviate the distribution shift between training and testing data. We extracted the embeddings of the training and testing examples before training and acquired by DISAE and MMAPLE, then utilized the Uniform Manifold Approximation and Projection (UMAP) for visual analysis. Figure 2C supports this hypothesis. Before the training, the embeddings of training chemicals are scattered around those of testing chemicals. While DISAE - the best-performed base model - narrows the dispersion, our model achieves tighter overlap between two distributions. Importantly, our model not only draws them closer but also ensures a more uniform distribution within each group, reducing inter-distribution gaps.

Transfer learning in the protein space via protein language modeling can improve the performance of DTI prediction26. As shown in Figs. 2A, B and 3A, B in the next section, DISAE that is based on a pre-trained protein language model outperforms other baselines that do not utilize the language model. However, the improvement from DISAE is not as significant as that by MMAPLE. Additionally, We study if transfer learning in the chemical space could boost the performance of OOD DTI predictions. We apply a chemical pretraining-fine-tuning based on self-supervised Motif Learning Graph Neural Network (MoLGNN)29. In consistent with recent findings21, no improvement was detected, as shown in Supplemental Table 3.

Fig. 3: Hidden Human MPI prediction outcomes when applying MMAPLE to base models.
figure 3

A ROC curves; B PR curves; C UMAP visualization of chemical space. top to bottom - original fingerprints, baseline embeddings from DISAE, and MMAPLE embeddings. Orange and blue dots for MMAPLE and baseline, respectively.

MMAPLE significantly improves the performance of hidden OOD human MPI predictions

We next evaluated the performance of the MMAPLE model in predicting hidden human MPIs. We first trained the model using ChEMBL which primarily includes exogenous small molecule ligands and druggable protein targets. We evaluated the performance of the trained model using the Human metabolite database HMDB27 on human MPIs. The test cases were in the OOD setting, as supported by the chemical similarity distributions (Fig. 1C).

Figure 3 A indicates that MMAPLE significantly outperforms all of state-of-the-art base models on both ROC and PR. The ROC-AUC and PR-AUC increase by 17% to 20% and 17% to 30%, respectively, suggesting that MMAPLE is able to accurately predict hidden human MPIs in an OOD setting.

Again, MMAPLE training brings the embeddings of testing samples closer to those of training data than the baseline, as shown in Fig. 3C. Overall, our results suggest that MMAPLE significantly outperforms the state-of-the-art methods for OOD DTI and hidden MPI predictions.

MMAPLE significantly improves the performance of understudied OOD interspecies MPI predictions and reveals the molecular basis of microbiome-human interactions

Known interspecies microbiome-human MPIs are extremely scarce, only including 17 observed active interactions (See Methods for details). To investigate interspecies interaction, MMAPLE was trained on a combination of three datasets: HMDB, ChEMBL, and NJS1630, while the test set consisted of 17 annotated along with 145 negative microbiome-human MPIs from the literature31,32. As shown in Fig. 1C, no metabolite-protein pairs in the testing set have similar chemicals or proteins to those in the training/validation set. Because no interspecies MPIs exist in the training and validation set, the problem is a zero-shot learning scenario. The previously best performed model DISAE has a poor performance, with a PR-AUC of 0.193. It indicates that transfer learning alone is not sufficient to address the OOD challenge of interspecies MPI predictions. Our results, presented in Fig. 4A, demonstrate that MMAPLE significantly outperforms DISAE in terms of ROC and PR. It achieves a three-fold increase in the PR-AUC for interspecies MPI predictions. These findings indicate that MMAPLE holds promise in deepening our comprehension of interspecies interactions, thus serving as a valuable tool for investigating the impact of the microbiome on human health and disease.

Fig. 4: Results of microbiome metabolite-human protein interaction predictions.
figure 4

A ROC and PR curves of models tested by literature annotated 17 positive and 145 negative microbiome-human MPIs. B Top 5 predicted G-protein coupled receptor (GPCR) genes that interact with TMAO and GPCR functional assay results, with p-value indicating the tail probability from Kernel density estimation; C Predicted 3D binding pose of TMAO on the CXCR4 antagonist conformation; D Interaction patterns between TMAO and CXCR4; E Proposed molecular mechanism of TMAO-human interactions. No assay is available for GNRHR and ADGRA3. Antagonist: a molecule blocks the receptor, preventing activation and the usual cellular response. Agonist: a molecule activates the receptor, triggering a cellular response.

To further validate the performance of MMAPLE, we predicted and experimentally validated the interactions between trimethylamine N-oxide (TMAO) and human G-protein coupled receptors (GPCRs). TMAO is a small molecule generated by gut microbial metabolism. It has been observed that elevated plasma levels of TMAO increase the risk for major adverse cardiovascular events33, activate inflammatory pathways34, and promote foam cell formation35. Additionally, TMAO inhibits insulin signaling36. However, it remains elusive how TMAO modulates these pathological processes at a molecular level. Besides its biological interest, TMAO is one of the most challenging molecules for MMAPLE. Firstly, the current study of microbiome-human interactions mainly focuses on short-chain fatty acids, there are few data for TMAO. Secondly, TMAO is a molecule with different structural characteristics from other chemicals in the training data, The chemical structure of TMAO is significantly different from known metabolites involved in microbiome-human MPIs, as shown in Supplemental Fig. 2. Thus, we choose TMAO to rigorously evaluate MMAPLE in an OOD scenario.

Figure 4B lists the top 7 predicted GPCR genes that interact with TMAO with a p-value less than 5.0e-6 (approximate false discovery rate of 0.05). We performed GPCR functional assays to experimentally test the binding activities of five of them under the concentration of 30 μM of TMAO, which is the same concentration used in the previous study and is based on the physiological concentration of TMAO in the human (1–45 μM)37. The assay for two top-ranked GPCRs GNRHR and ADGRA3 is not available. As shown in Fig. 4B, all five tested GPCRs are antagonists that block the activity of receptors. and CXCR4 demonstrates the strongest activity (activity score >30). Other top-ranked predictions can be found in Supplemental Table 4. We have also performed additional experiments to analyze the predictions from the baseline model (DISAE). MMAPLE significantly outperforms the base model, as shown in Supplemental Figure 3. The full predictive results are in Supplemental Table 5.

Protein-ligand docking by AutoDock Vina38 suggests that TMAO can fit into the antagonist conformation of the CXCR4 structure, as shown in Fig. 4C, D. Among these interacting residues, TRP94, TYR116, and Glu288 also interact with the co-crystallized ligand of encoded protein of CXCR4 (PDB id: 3ODU). TYR116 and GLU288 provide attractive charges to the nitrogen atom on TMAO. ARG188 forms a conventional hydrogen bond with an oxygen atom on TMAO. These strong interactions could keep TMAO in the binding pocket. The CXCR4 antagonism by TMAO establishes a causal linkage for observed microbiome TMAO-human interactions, as illustrated in Fig. 4E. It is known that CXCR4 regulates PI3K and RAF/RAS/MEK pathways39 (KEGG Pathway: https://www.genome.jp/pathway/hsa04062). PI3K pathway regulates bile acid synthesis39. TMAO’s inhibition on bile acid synthesis may be responsible for its promotion effect on atherosclerosis33. The physiological effect of TMAO on obesity and insulin resistance may be via CXCR4-RAF/RAS/MEK axis. It has been observed that the deficiency of CXCR4 and impaired RAF/RAS/MEK signaling results in obesity and insulin resistance40,41,42,43. Thus, microbiome TMAO-human CXCR4 interaction is responsible for the several observed pathological effects of TMAO. However, other TMAO effects such as inflammation cannot be directly explained by the TMAO-CXCR4 interaction. It is possible that other human proteins can interact with TMAO.

While other top-predicted GPCRs (GLP1R, GIPR, CALCRL, C3AR1) show much weaker binding to TMAO compared to CXCR4 at the tested TMAO concentration, the antagonist (inhibition) effect of TMAO can be amplified at higher TMAO concentrations, such as when consuming food, and aligns with experimental evidence. GLP1R activation is known to reduce inflammation, suggesting that TMAO inhibition of GLP1R might increase inflammation44,45. Similarly, studies have shown that gut GIPR is associated with diet-induced inflammation and insulin resistance46. Modulation of CALCRL is associated with insulin resistance47, and its deletion can worsen intestinal inflammation48. C3AR1 plays a protective role against atherosclerosis49, implying that TMAO blocking C3AR1 activity might increase the risk of atherosclerosis in individuals.

Semi-supervised learning, meta-learning, and target domain sampling synergistically contribute to the performance of MMAPLE

In our comprehensive ablation study, we rigorously examined the influence of several key components on our model’s performance: the introduction of target domain sampling and semi-supervised learning with pseudo labels, the choice between utilizing soft pseudo labels or hard labels, and the application of meta-learning.

When excluding meta-learning from our training process, we kept the teacher model static, therefore restricting it to generate constant pseudo labels for the student model to learn from. This leads to performance decline when compared to our full MMAPLE model, as shown in Table 1. The absence of iterative feedback learning addresses the critical role of meta-learning.

Table 1 Ablation study

To investigate the effect of teacher-student training, we trained the model sorely with meta-learning by leveraging the Model-Agnostic-Meta-Learning (MAML) framework50. This approach, while constantly outperforming the baseline, resulted in 124% fall in PR-AUC compared to MMAPLE. This experiment not only demonstrated the intricate dependencies between meta-learning and semi-supervised learning but also underscored the necessity of synergy of these techniques to achieve superior model performance.

Models trained on one-hot (hard) labels are subject to over-fitting since they do not represent soft decision boundaries across concepts. Soft labels, which are probability distributions over the possible classes as opposed to hard labels, are often demonstrated to be more effective due to the ability to provide the model with more information about the uncertainty in the data, as well as the ability against label noise, resulting in more robust predictions51,52. As shown in the Table 1, when soft labels were used, ROC-AUC improved by 25%, and PR-AUC increased by twofold.

The objective of target domain sampling aligns with that of transfer learning. As shown in Table 1, using target domain sampling significantly increased the performance of the model by 11% on ROC-AUC and 115% on PR-AUC, showing the effectiveness of this strategy in improving the performance of MMAPLE.

In summary, integrating meta-learning, target domain sampling, and soft labeling into a teacher-student framework yields superior performance compared to each of these approaches individually, as well as any combination of two of them.

Discussion

In this study, we present MMAPLE, a highly effective deep learning framework, designed to address the challenges of data scarcity and OOD problems encountered when applying machine learning in understudied biological domains when transfer learning is less effective. Through extensive evaluations, we have demonstrated the exceptional capabilities of MMAPLE in exploring the unlabeled data space and facilitating knowledge transfer from one chemical space to another. Using MMAPLE, we successfully predicted and experimentally validated novel interactions between microbiome metabolites and human proteins, thereby shedding light on the intricate interplay between these components. Notably, our framework does not rely on a specific model and can accommodate various deep-learning architectures tailored to specific biological tasks. Thus, MMAPLE serves as a versatile and robust framework for investigating a wide range of understudied biological problems.

MMAPLE shows potential for improvement in several key areas. Firstly, the current implementation of MMAPLE lacks the ability to estimate the uncertainty associated with pseudo labels. By incorporating an accurate uncertainty quantification mechanism, it becomes possible to select high-confidence pseudo labels during training, therefore reducing the impact of noise. Secondly, the process of sampling pseudo labels in a vast and imbalanced chemical-protein interaction space proves time-consuming, particularly when aiming to achieve the desired positive versus negative ratio. The performance of MMAPLE can be further enhanced by employing an unbiased and efficient sampling strategy. For example, sampling based on protein family or chemical similarity clustering. Thirdly, while MMAPLE has thus far been applied exclusively to classification problems, it would be interesting to explore its extension to regression problems. Lastly, the meta-update in the current implementation of MMAPLE uses in-distribution data. Incorporation of OOC meta-learning may further improve the generalization power of MMAPLE in an OOD setting. These would be subject to future study.

Method

Data sets

Experiment 1: DTI prediction

Training/validation data

We used molecular interactions in ChEMBL25 and HMDB27 as training and validation data. It contained 298,736 total pairs with 230,811 unique chemicals and 3084 unique proteins.

OOD testing data

The annotated DTIs from DrugBank28 were used as testing data. To simulate an OOD scenario, we removed all chemicals that are structurally similar to drugs in the testing data (Tanimoto coefficient > 0.5), totaling 21,760 pairs including 8917 unique chemicals and 3266 proteins.

Unlabeled data

To focus on the unexplored domain of interest, a target domain sampling strategy was developed. Specifically, we selected unlabeled pairs of drug targets and drug-like chemicals but excluded already labeled pairs. For each chemical, we sampled six proteins, resulting in 53,502 total unlabeled pairs. The detailed data statistics can be found in Fig. 1.

Experiment 2: Human MPI prediction

Training/validation data

The training data for this experiment was sourced from ChEMBL (version 29)25. It consisted of 334,668 pairs with 252,712 unique compounds and 5204 unique proteins, where each pair represented an activity with a single protein as the target.

OOD testing data

For the testing, we utilized HMDB27, which provided interactions between metabolites and human enzymes. We randomly sampled 10,000 pairs as the testing data covering 8,921 unique compounds and 2611 unique proteins.

Unlabeled data

To create the unlabeled dataset, we considered all the unlabeled metabolite-enzyme pairwise combinations. From the total pairs, we included all unique metabolites and randomly selected two enzymes to associate with each chemical, this resulted in the creation of a sizeable unlabeled dataset, consisting of 44,644 unlabeled samples. The detailed data statistics can be found in Fig. 1.

Experiment 3: Microbiome-human MPI prediction

Training/validation data

For this experiment, the training data consisted of a combination of ChEMBL, HMDB, and NJS1630 datasets. After removing duplicates and unusable data, the dataset contained a total of 1,667,708 samples including 357,213 unique compounds and 168,517 unique proteins.

OOD testing dataset

The testing dataset was manually created based on two published works. The first work31 provided information on interactions between 241 GPCRs and metabolites from simplified human microbiomes (SIHUMIs) consisting of the seven most general bacteria species. The second work32 involved the screening of gut microbiota metabolomes to identify ligands for various GPCRs. Since this study focused on small molecule metabolites, lipids were excluded, resulting in a total of 162 MPIs, including 17 positive activities.

Unlabeled data

For the protein side, we included all GPCRs from UniProt53. Besides, an equal number of proteins were randomly selected from the Pfam dataset. Chemical samples were the 240 unique metabolites from the NJS16 dataset. Overall, the unlabeled data consisted of 73,238 pairs. The detailed data statistics can be found in Fig. 1.

MMAPLE base models

To enable a fair comparison with the baseline models, we currently focus on binary classification problems. Four state-of-the-art base models were employed to evaluate the performance MMAPLE:

  • DISAE26. Distilled Sequence Alignment Embedding (DISAE) is a method developed by us that includes three major modules: protein language model, chemical structure modeling, and the combination of the above two modules. The protein sequence module uses distilled sequence alignment embedding, leveraging a transformer-based architecture trained on nearly half a million protein domain sequences for generating meaningful protein embeddings. This is crucial for predicting protein-ligand interactions in out-of-distribution (OOD) scenarios. The chemical module is a graph isomorphism network (GIN) to obtain chemical features, which are numerical representations of small molecules and capture their chemical properties. Finally, DISAE includes an attentive pooling module that combines the protein and chemical embeddings obtained from the first two modules to produce the final output for predicting DTIs or MPIs as a binary classification task (i.e., active or inactive). The attentive pooling module uses a cross-attention mechanism to weigh the importance of each protein and chemical embedding, allowing it to focus on the most relevant information when making the prediction. Lbase denotes the loss function of the base model, which is a binary cross-entropy loss in this case.

  • TransformerCPI18 Adapted from the transformer architecture, TransformerCPI takes protein sequence as the input to the encoder, and atom sequence as the input to the decoder, and learns the interaction at the last layers. Specifically, the amino acid sequence is embedded with a Word2vec model pre-trained on all human protein sequences in UniProt, and the self-attention layers in the encoder are replaced with a gated convolutional network and output the final presentation of proteins. The atom features of chemicals are learned through graph convolutional network (GCN) by aggregating their neighbor atom features. The interaction features are further obtained by the decoder of transfer, which consists of self-attention layers and feed-forward layers.

  • DeepPurpose19 DeepPurpurpose provides a library for DTI prediction incorporating seven protein encoders and eight compound encoders to learn the protein and compound representations respectively, and eventually feeds the learned embeddings into an MLP decoder to generate predictions. We implemented the best-reported architecture, convolutional neural network (CNN) for both protein and compound feature representation learning, as another base model of MMAPLE.

  • BACPI20 The last base model included in this study is the Bi-directional attention neural network for compound-protein interaction (BACPI). Similarly, it consists of chemical representation learning, protein representation learning, and CPI prediction components to combine them. BACPI employs a graph attention network (GAT) for compounds to learn various information of the molecule graphs. For protein, it introduces a CNN module to take the amino acid sequence as input, to learn the local contextual features of protein by using a content window to split the sequences onto overlapping subsequences of amino acids. Finally, the atom structure graphs and residue sequence features are fed into the bi-directional attention neural network to integrate the representations and capture the important regions of compounds and proteins, and the integrated features are used to predict the CPI.

Semi-supervised meta-learning

We adopted a semi-supervised meta-learning paradigm for our model training. Similar to pseudo labels, there is a pair of teacher model and student model, the teacher model takes unlabeled data as input, and uses the predicted results as pseudo labels for the student model to learn with the combination of labeled and pseudo-labeled data. However, instead of learning from the fixed teacher model, the student constantly sends feedback to the teacher in the format of performance on labeled data, and the teacher keeps updating the pseudo labels on every mini-batch. This strategy could solve the problem of confirmation bias in pseudo-labeling54. The illustration of MMAPLE training is shown in Fig. 5. Let T and S denote the teacher model and the student model, θT and θS denote the corresponding parameters (\({\theta }_{T}^{{\prime} }\) and \({\theta }_{S}^{{\prime} }\) denote the updated parameters). We use \({{\mathcal{L}}}\) to represent the loss function, and T(xuθT) to stand for the teacher predictions on unlabeled data xu, similar notations for S(xuθS) and \(S({x}_{l};{\theta }_{S}^{{\prime} })\). CE denotes the cross-entropy loss.

Fig. 5: Illustration of MMAPLE training schema.
figure 5

A teacher model generates pseudo labels by predicting a batch of unlabeled data. The pseudo label is further passed to a filter to control the balance ratio of the positive and negative samples (as a hyperparameter). A student model generates the predictions from the same unlabeled data as those used in the teacher model and is updated by minimizing loss function \({{\mathcal{L}}}_{u}({\theta }_{T},{\theta }_{S})\) as in equation (3). Then, the updated student model takes a batch of labeled data and generates new predictions that compare with the ground truth labels and minimize loss \({{{\mathcal{L}}}}_{l}({\theta}^{{\prime} }_{S})\) as equation (6).

Model training

MMAPLE does not work alone but is built on top of other models. The training process is repeated until optimization converges. The number of iterations depends on the base model and training data, so it varies accordingly. To ensure a fair comparison with the base models, both MMAPLE and base models were constructed using the same architecture. The detailed training procedure is shown in Algorithm 1.

The update rule of student

On a batch of unlabeled data xu, sample T(xuθT) from the teacher’s prediction, and optimize the student model with the objective

$${\min }_{{\theta }_{S}}{{{\mathcal{L}}}}_{u}({\theta }_{T},{\theta }_{S})$$
(1)

where

$${{{\mathcal{L}}}}_{u}({\theta }_{T},{\theta }_{S}):={{\mathbb{E}}}_{{x}_{u}}[CE(T({x}_{u};{\theta }_{T}),S({x}_{u};{\theta }_{S}))]$$
(2)

The optimization of each mini-batch is performed as

$${\theta }_{S}^{{\prime} }={\theta }_{S}-{\eta }_{S} \nabla {\theta }_{S}{{{\mathcal{L}}}}_{u}({\theta }_{T},{\theta }_{S})$$
(3)

The update rule of teacher

On a batch of labeled data (xlyl), and use the students’ update to optimize the teacher model with the objective

$${\min }_{{\theta }_{T}}{{{\mathcal{L}}}}_{l}({\theta }_{S}-{\eta }_{S}\nabla{\theta }_{S}{{{\mathcal{L}}}}_{u}({\theta }_{T},{\theta }_{S}))$$
(4)

where

$${{{\mathcal{L}}}}_{l}({\theta }_{S}^{{\prime} }):={{\mathbb{E}}}_{{x}_{l},{y}_{l}}[CE({y}_{l},S({x}_{l};{\theta }_{S}^{{\prime} }))]$$
(5)

The optimization of each mini-batch is performed as

$${\theta }_{T}^{{\prime} }={\theta }_{T}-{\eta }_{T}\nabla{\theta }_{T}{{{\mathcal{L}}}}_{l}({\theta }_{S}-{\eta }_{S}\nabla{\theta }_{S}{{{\mathcal{L}}}}_{u}({\theta }_{T},{\theta }_{S}))$$
(6)

We experimented with both hard labels and soft labels. Due to the superior performance of soft labels to hard labels, the final MMAPLE was trained using the soft label. The methods are described as follows:

Using soft labels

Because we always treat θS as fixed parameters when optimizing Equation (6) and ignore its higher-order dependence on θT, the objective is fully differentiable with respect to θT when soft pseudo labels are used, i.e., T(xuθT) is the full distribution predicted by the teacher model. This allows us to perform standard back-propagation to obtain the gradient.

Additionally, we incorporated the temperature scaling to soften the teacher model’s predictions55. T(xuθT) is the teacher’s output distribution computed by applying softmax over the logits \(z\! :\,{\mbox{softmax}}\,(z)=\frac{\exp (z/T)}{\mathop{\sum }_{j = 1}^{n}\exp ({z}_{j}/T)}\), the temperature parameter T is used to control the ”softness” of the output probabilities. In the implementation, the temperature was tuned by hyperparameter searching.

For the quality control of soft labels, we employed a balance sampler to control the ratio between positive and negative hard labels transferred from soft labels. This will provide a mechanism to dynamically adjust the ratio of positive and negative during training. This ratio served as a crucial parameter to govern the training process, enabling us to strike a balance between the two label categories. Through this approach, we aimed to alleviate bias and imbalance in the dataset.

Using hard labels

When using hard pseudo labels, we followed the derivative rule proposed in the reference54, which was a slightly modified version of REINFORCE applied to obtain the approximated gradient of \({{{\mathcal{L}}}}_{l}\) in Equation(6) with respect to θT as follows:

$$h={\eta }_{S}\cdot \left({\left({\nabla}_{{\theta }_{{S}^{{\prime} }}}CE\left.\right({y}_{l},S\left({x}_{l};{\theta }_{S}^{(t+1)}\right)\right)}^{T}\cdot \nabla{\theta }_{{S}^{{\prime} }}CE\left({\hat{y}}_{u},S\left({x}_{u};{\theta }_{S}^{t}\right)\right)\right)$$
(7)

The teacher’s gradient from the student’s feedback:

$${g}_{T}^{t}=h\cdot \nabla{\theta }_{T}CE({\hat{y}}_{u},T({x}_{u};{\theta }_{T})){| }_{{\theta }_{T}}={\theta }_{T}^{(t)}$$
(8)

Algorithm 1

Training procedure

Require: N, the batch size nsup, number of epochs of supervised training nfreeze, number of epochs that teacher model is frozen n, number of training epochs

Input: Xun, Xl

Stage 1:

for epoch = 1 to nsup do

for t = 1 to \(\frac{{N}_{l}}{N}\) do

sample Xl of size N from the labeled data (without rep)

Update θT with Lbase

end for

end for

save the model with early stopping

Stage 2:

Initialize the teacher model with θT

Initialize student model with random parameters θS

for epoch = 1 to nfreeze do

for t = 1 to \(\frac{\min ({N}_{un},{N}_{l})}{N}\) do

sample Xun of size N from unlabeled data (without rep)

update θS with student update rule

end for

end for

for epoch = nfreeze + 1 to n do

sample Xun of size N from unlabeled data (without rep)

update θS with student update rule

update θT with teacher update rule

end for

Model evaluation

The model performance was measured using both Receiver Operating Characteristic (ROC) and Precision-Recall (PR) and their corresponding area under the curve (AUC). While ROC is a commonly used metric, it may give an optimistic impression of the model’s performance, particularly when datasets are imbalanced56. Therefore, PR is a better metric to evaluate the performance of MMAPLE than ROC. A three-fold cross-validation approach was utilized to ensure the robustness of the model’s performance evaluation. Consistency across evaluations was maintained by using the same folds for all base models.

Statistical significance of prediction

In our study, we focused on predicting GPCR genes that interact with TMAO, employing a comprehensive analytical approach to evaluate the statistical significance of each prediction. The prediction scores generated through our model were subjected to Kernel Density Estimation (KDE) from the Python package scipy57. KDE is a non-parametric way to estimate the probability density function of our prediction scores. By applying KDE, we were able to calculate the tail probability for each predicted interaction score, which we interpreted as a p-value. This p-value serves as an indicator of the rarity or significance of the predictions within the overall distribution of scores, providing a statistical basis for identifying the most significant GPCR-TMAO interactions. The detailed results of our predictions can be found in Supplemental Table 4. The results of DISAE predictions can be found in Supplemental Table 5.

Statistics and Reproducibility

Statistical analyses in general were conducted using paired t-tests. We employed three-fold cross-validation to ensure the robustness of our results. For each fold, we applied early stopping and tested the model on the hold-out testing set. The final reported mean performance is the average result from these testing sets.

GPCR functional assay

Trimethylamine N-oxide (TMAO) (purity: 95%, molecular weight: 76.12) was purchased from Sigma-Aldrich (MO, USA).

GPCR functional assay was performed using the PathHunter® β-Arrestin assay by Eurofins (CA, USA). The PathHunter® β-Arrestin assay monitors the activation of a GPCR in a homogenous, non-imaging assay format using a technology called Enzyme Fragment Complementation (EFC) with β-galactosidase (β-Gal) as the functional reporter. The enzyme is split into two inactive complementary portions (EA for Enzyme Acceptor and PK for ProLink) expressed as fusion proteins in the cell. EA is fused to β-Arrestin and PK is fused to the GPCR of interest. When the GPCR is activated and β-Arrestin is recruited to the receptor, ED and EA complementation occurs, restoring β-Gal activity which is measured using chemiluminescent PathHunter® Detection Reagents.

The compound activity was analyzed using the CBIS data analysis suite (ChemInnovation, CA).

For agonist mode assays, percentage activity was calculated using the following formula:

%Activity = 100% x (mean RLU of test sample - mean RLU of vehicle control) / (mean MAX control ligand - mean RLU of vehicle control)

Where RLU is relative luminescence unit of the measurement.

For antagonist mode assays, percentage inhibition was calculated using the following formula:

%Inhibition = 100% x (1 - (mean RLU of test sample - mean RLU of vehicle control) / (mean RLU of EC80 control - mean RLU of vehicle control))

Where EC80 is 80% maximal effective concentration of TMAO.

Protein-ligand docking

AutoDock Vina38 was applied on TMAO to find the best conformation in the CXCR4 chemokine receptor (PDB ID: 3ODU). The center of the co-crystallized ligand (ligand ID: ITD) in 3ODU was used to define the center of the searching space and 12 Angstrom of extra space was added to the edge of ITD to set up the docking space for TMAO. The binding energies between TMAO and 3ODU were attained in terms of Kcal/mol.

Ablation study

All the ablation studies were applied to the experiment of Microbiome-human MPI prediction.

Vanilla TS

Vanilla teacher-student model, where the teacher model is pre-trained, and kept frozen while training the student model, so the student will reply on the pseudo labels to learn, without sending feedback to the teacher model. Hard labels are used for pseudo-labeling.

TS soft

Same as Vanilla TS, except for soft labels are used.

OOC-ML(out-of-cluster meta-learning)

As demonstrated in the published work12, we created five clusters based on the scaffold of the molecules, and we forced the model to see data from different clusters from every meta-update, therefore the model was pushed to generalize on the unseen data.