Keywords

1 Introduction

Most drug molecules usually perform their functions through the interaction with target proteins in human body. The discovery for drug targets has become the significant focus of innovative drugs research [1, 2]. Hence, prediction of drug-target interactions (DTIs) is one of the most important steps in genomic drug discovery pipeline and drug repurposing [3,4,5], the purpose is to discover putative new drugs and new uses of existing drugs. Nevertheless, due to the limitation of throughput and cost, the traditional experimental methods are difficult to be widely applied for DTIs prediction. It is of great significance to develop effective calculation methods to predict the interaction between drugs and targets.

Recently, there are already a variety of calculation methods used to identify molecular associations [6,7,8,9,10,11,12,13,14,15,16], especially the interaction between drugs and targets [17,18,19,20,21]. Luo et al. developed a computational pipeline to detect novel DTIs from a constructed heterogeneous network, which achieves substantial performance improvement over other state-of-the-art methods [22]. Van Laarhoven et al. proposed a simple machine learning method that uses Gaussian interaction profile kernel and regularized least squares for predicting drug-target interactions [23]. Chen et al. proposed a drug-target interaction prediction method by random walk on a large-scale heterogeneous network, which combines drug-drug similarity network, protein-protein similarity network and known drug-target interaction network [24]. Ezzat et al. provide a comprehensive overview and empirical evaluation on the computational DTIs prediction, which helps understanding the advantages and disadvantages of the state-of-the-art methods [25]. Based on these methods, we proposed a multi-molecular network, also called molecular associations network (MAN) [26] to detect the interactions between drug candidate and related target proteins.

2 Materials and Methods

2.1 Datasets Construction

In the multi-molecular network, the high quality data mainly from nine open source database, which obtained nine known relationships (shown in Table 1) and five types of molecules (shown in Table 2). MAN contained topological relationships and distributions among all the molecules in the heterogeneous network. The drug molecular data and target protein sequences can be collected from DrugBank database and STRING database.

Table 1. Nine known relationships in the molecular associations network
Table 2. The number of 5 types of biomolecules from the nine known relationships

2.2 Molecular Association Network

From the collection of nine known relationships between five types of biomolecules annotated in many famous databases which mentioned above, we constructed a multi-molecular network, also called Molecular Associations Network (MAN) by linking arbitrary two association nodes. The complex molecular associations network is shown in Fig. 1. Based on the known associations, some biomolecules are suggested to interact with each other. In tfhe network graph, the heterogeneous nodes correspond to five types of biomolecules (drug, protein, disease, miRNA and lncRNA), and edges correspond to associations among them. The construction of systematic and MAN network provides a new perspective for predicting interactions between drug and target.

Fig. 1.
figure 1

Construction of multi-molecular network

2.3 Network Embedding: Node2Vec

Grover Aditya and Jure Leskovec proposed a graph embedding method, called Node2Vec, an algorithmic framework for learning continuous feature representations for nodes in graphs [36]. Different from the traditional graph embedding model, it can be seen as an extension of DeepWalk [37]. On the basis of DeepWalk, Node2Vec introduces two biased random walk methods: breadth-first search (BFS) [38] and depth-first search (DFS) [39], to characterize the structural equivalence and homophily of the network. Compared with random walk without any guidance, this method achieves the purpose of biased random walk by introducing Return Parameter and In-out Parameter, that is, the whole random walk process moves between BFS and DFS by setting different offsets. Take node V11 as an example, two search strategies as shown in Fig. 2.

Fig. 2.
figure 2

Two search strategies from node V11 (step = 3)

2.3.1 Biased Random Walk

Suppose that a random walk started with node M and end with node N. Here, due to the use of two different search strategies (BFS and DFS), the selection of strategy will directly affect the result of random walk. The unnormalized transition probability algorithm is introduced to solve this problem. The transfer probability between the two nodes can be described as follow:

$$ \pi_{NX} = \alpha_{pq} (M,X) \bullet w_{NX} $$
(1)

where, X represents the next position. w is the weight of the edge of the two nodes, which is based on the scenario. α is search bias.

$$ \alpha_{pq} (M,X) = \left\{ {\begin{array}{*{20}l} {\frac{1}{p}} \hfill & {if\,d_{MX} = 0} \hfill \\ 1 \hfill & {if\,d_{MX} = 1} \hfill \\ {\frac{1}{q}} \hfill & {if\,d_{MX} = 2} \hfill \\ \end{array} } \right. $$
(2)

where, dMX is the shortest distance between M and N; p is return parameter, which controls the probability of returning to the original node; q represents in-out parameter, which controls the relationship between BFS and DFS. The setting of different p and q determine the priority of node sequence. When training the model, to find the best p and q by according to the needs of the scene and grid search.

  1. (1)

    When d = 0, it means to return to node M from N. At this time, the search bias is 1/p, which can be understood as returning to the previous step with a probability of 1/p;

  2. (2)

    When d = 1, X is the direct neighbor of M, which is equivalent to BFS, then the bias is 1;

  3. (3)

    When d = 2, X is the neighbor’s neighbor of M, which is equivalent to DFS, then the bias is 1/q.

2.3.2 Feature Learning

Now, suppose there is a graph G = (V, E), where V is the set of nodes and E is the set of edges. The objective function for maximizing log-property could be described as follows:

$$ \mathop {max}\limits_{f} \sum\limits_{v \in V} {logPr(N_{s} (v)\left| {f(v)} \right.)} $$
(3)

where, function f: V —> Rd to represent the mapping from vertex to feature representation, where d is a pre-set hyper-parameter that represents the dimension of feature representation of each vertex. As a result, f is a matrix whose size is |V| × d. v ∈ V, and Ns(v) ⊂ V represents the neighbor vertex of vertex v under the sampling strategy s.

Assuming that the possibility of observing neighborhood nodes is independent of the feature representation of observing any other neighborhood nodes of a given vertex, so as to decompose this conditional probability, then

$$ Pr(Ns(v)\left| {f(v)} \right.) = \prod\limits_{{n_{i} \in N_{S} (v)}} {Pr} \left( {n_{i} \left| {f(v)} \right.} \right) $$
(4)

Assuming that the influence between two vertices in the feature space is symmetrical, then

$$ Pr\left( {n_{i} \left| {f(v)} \right.} \right) = \frac{{exp\left( {f\left( {n_{i} } \right) \bullet f(v)} \right)}}{{\sum\limits_{x \in V} {exp} (f(x) \bullet f(v))}} $$
(5)

The purpose of these two assumptions is to better handle the optimization problem. Based on the above assumptions, the objective function can be simplified as follow:

$$ max_{f} \sum\limits_{v \in V} {\left[ { - logZ_{v} + \sum\limits_{{v_{i} \in N_{S} (v)}} f \left( {v_{i} } \right) \bullet f(v)} \right]} $$
(6)

For each node,

$$ Z_{v} = \sum\limits_{x \in V} {exp} (f(v) \bullet f(x)) $$
(7)

2.4 Random Forest

Random forest (RF) is an ensemble algorithm that includes a number of decision trees [40]. Focus on the problem of classification, each decision tree is treated as a classifier. Each sample is input into each tree for classification, and the category with the largest number of votes is designated as the final output.

In the process of feature importance assessment using random forest, it depends on the contribution of each feature to each tree in the RF. The contribution usually measured by Gini index or error rate of out-of-bag (OOB) data. Assuming that there is n features f1, f2, f3,…, fn, the Gini variable importance measures (VIM) of each feature fi can be described as follow:

$$ Gini_{n} = \sum\nolimits_{m = 1}^{\left| M \right|} {\sum\nolimits_{{m^{\prime } \ne m}} {p_{nm} p_{{nm^{\prime } }} } } = 1 - \sum\nolimits_{m = 1}^{\left| M \right|} {p_{nm}^{2} } $$
(8)

where, m represents m classes. pnm is the proportion of class k in node n.

2.5 Performance Measurement Tools

In our study, in order to size up the effectiveness and steadiness of our constructed model, we counted the results of five parameters [41,42,43]: Accuracy (Acc), recall (sensitivity, hit rate, or true positive rate (TPR)), specificity(selectivity, or true negative rate (TNR)),precision (positive predictive value (PPV)) and Matthews’s Correlation Coefficient (MCC), respectively. These parameters can be represented as follows:

$$ Acc = \frac{TP + TN}{TP + FP + TN + FN} $$
(9)
$$ TPR = \frac{TP}{TP + FN} $$
(10)
$$ TNR = \frac{TN}{FP + TN} $$
(11)
$$ PPV = \frac{TP}{FP + TP} $$
(12)
$$ MCC = \frac{(TP \times TN) - (FP \times FN)}{{\sqrt {(TP + FN) \times (TN + FP) \times (TP + FP) \times (TN + FN)} }} $$
(13)

3 Results and Discussion

In this paper, a deep learning method was derived from the idea of molecular association network, and proposed for predicting DTIs. Then, the node2vec embedding method was employed to obtain a behavior feature vector of each node from the multi-molecular network, which is constructed by integrating the associations among drug, protein, disease, lncRNA and miRNA. The good performance was obtained for this method based on behavior feature than traditional attribute feature on the collected datasets. Here, random forest classifier model was used to fulfill the experiment. During this experiment, we set the same parameters to compare the performances of the two different features on the model, the results as shown in Table 3 and 4. From the two tables, it is obvious that the accuracy on behavior features is 5% higher than the accuracy on attribute features under five-fold cross validation.

Table 3. Performance evaluation with RF on attribute features
Table 4. Performance evaluation with RF on behavior features

The ROC curve of random forest classifier based on attribute feature and behavior feature with 5-fold cross-validation that is shown in Fig. 3 and Fig. 4, respectively. It is obvious that the average of AUC is 0.8957 by using attribute information, the average of AUC is 0.9396 by using behavior information based on MAN network. So the behavior information of nodes play an important role in the DTIs predictions.

Fig. 3.
figure 3

The ROC curve of random forest on attribute feature

Fig. 4.
figure 4

The ROC curve of random forest on behavior feature

4 Conclusion

In this study, we developed a deep learning method to discover the potential interaction between drugs and target proteins on a large scale by investigating the relationship among five molecules (drug, protein, miRNA, lncRNA and disease). And we construct a novel scheme based on above five molecules and nine relationships between arbitrarily two molecules, which called MAN network. Focus on this network, each node can obtain a feature vector by using node behavior information (the relationship of each node with others could be described by node2vec graph embedding method). The traditional attribute feature vector comes from the drug molecular fin-gerprint and protein sequences on integrated dataset. Finally, a random forest (RF) classifier is performed on these features to predict potential drug-target pairs. Experimental results indicated that the behavior feature could be performed better on random forest classifier. It is also demonstrated that the use of behavior information is very helpful for addressing the problem of drug molecules and target proteins. This work is a new attempt to predict DTIs and would have potential applications for drug discovery and repositioning.