Keywords

1 Introduction

Drug-disease association is almost involved in the entire process of drug repositioning, providing a theoretical basis for the discovery of new drug efficacy. Therefore, it is a prospective task to explore as many new drug-disease associations as possible. In recent years, several computational methods of drug-disease association based on drug target information, drug structure information, disease semantic information and other information sources have been proposed. For example, some methods use disease, drug and drug target to predict drug-disease associations (TL-HGBI). Drug - disease association prediction based on drug target information is a popular method [1]. Drug targets are also considered to be one of the sources of information for predicting drug-disease interactions, but the computational conditions for these methods are that the drug can find the corresponding drug target information. In these methods, a three-layer heterogeneous network is typically constructed using drugs, diseases, and drug targets, and the network is constructed based on the distribution of similarity measures [2]. Combining multiple associated sources of information provides more insight into predictive drug-disease association than using only drug targets as sources of information [3]. Therefore, how to effectively integrate more information sources has attracted wide attention [4].

Inspired by graph representation learning, we re-examine some basic relational prediction problems from the perspective of graphs to find better solutions. Graph is a basic and commonly used data structure. Many scenes in the real world can be abstracted into a graph structure, such as social network, traffic network, etc. [5]. The biomolecule in the cell can also be viewed as a graph structure, with the association of different types of biomolecules forming the edges of the graph and the biomolecules serving as the nodes of the graph [6]. Using graph theory to develop reliable bio-association graph technology to solve bio-association prediction problem will have a subversive impact on current bioinformatics research [7]. There is no doubt that the seamless integration of graph with biomacromolecules will drive the development of the post-genomic era [8].

The prediction of nodes and edges is an important task in network analysis [9]. In the node classification task, the most likely node label in the prediction network is the first task [10]. For example, in the drug-target interaction network, the focus is on predicting the functional labeling of drugs [11]. Similarly, in a molecular association network, we want to predict whether a pair of nodes in the network should have an edge that connects them [12, 13]. Predicting nodes and edges can help us discover new interactions between drugs and diseases [14]. Node2vec is an algorithm framework for learning the continuous feature representation of nodes in a network [15]. It defines a flexible concept of node network domain and designs a biased random walk process to effectively explore different network domains [16].

Computational methods used to find new drugs and disease associations can solve the problem of high cost and low efficiency, so it has important practical significance [17]. Based on the similarity of biomolecular association network and graph structure, this paper proposes a biomolecular network representation learning model to predict drug-disease association [18]. The model is based on the biomolecular network representation method Node2Bio and XGboost classifier [19].

The biomolecular network consists of two parts: nodes (drugs, diseases, proteins, ncRNA (miRNA, lncRNA)) and edges (the relationship of nodes) [20]. Each node can be represented in two ways: attribute information of the node (such as the molecular fingerprint of the drug and the phenotype of the disease) and a vector of relationships with other nodes in the network embedding [21]. Finally, all node features are integrated to form feature descriptors and imported into the XGboost classifier to predict the association of each drug with all diseases [22]. It is worth noting that although the main purpose is to predict drug-disease association, our proposed molecular association network model and iterative update algorithm can be applied to other prediction problems as well [23].

2 Materials and Methods

2.1 Nine Kinds of Molecular Associations

To build a molecular association network, we need to download drugs, diseases, lncRNA, miRNA and protein information from different data sources. Then the feature vectors of drug, disease, lncRNA, miRNA and protein were calculated by different methods. All known interactions are derived from existing databases [24]. Drugs and diseases are downloaded from the CTD database and drugs SMILE is downloaded from DrugBank [25]. Zhang et al. collated 18,416 drug-disease associations from the CTD database and named this data set “SCMFDD-S” [26]. Drug-protein associations were collected from the DrugBank database for a total of 11,107 associations. The Protein-protein association is based on 19,237 associations in the STRING dataset [27]. The Protein-disease association was collected from the DisGeNET [28] database and a total of 25,087 associations were collected. A total of 690 lncRNA-protein associations were collected from the LncRNA2Target [29] database. A total of 1264 lncRNA-disease associations were collected from the LncRNADisease [30] database and the lncRNASNP2 [31] database. 4494 miRNA-protein associations were collected from miRTarBase [32]. The miRNA-disease association was collected from HMDD [33] for a total of 16,427. 8374 miRNA-lncRNA associations were downloaded from lncRNASNP2 [31].

2.2 Disease MeSH Descriptors and Directed Acyclic Graph

In this study, we used the MeSH disease descriptor downloaded from the National Library to calculate the semantic similarity of the disease. This representation is described by a directed acyclic graph (DAG), in which nodes in the DAG represent disease, and the ends of each edge are the parent and child nodes, respectively [34]. If the disease \( p\left( j \right) \) is the parent of the disease \( p\left( i \right) \), the disease \( p\left( i \right) \) can be described as:

$$ DAG_{p\left( i \right)} = \left( {p\left( i \right),N_{p\left( i \right)} ,E_{p\left( i \right)} } \right) $$
(1)

where \( N_{p\left( i \right)} \) represents the set of points for all diseases. \( E_{p\left( i \right)} \) contains all the edges in \( DAG_{p\left( i \right)} \).

In \( DAG_{p\left( i \right)} \) of disease \( s \), the contribution of any ancestral disease \( p\left( i \right) \) to disease \( s \) is as the formula:

$$ \left\{ {\begin{array}{*{20}l} {D_{p\left( i \right)} \left( s \right) = 1} \hfill & {if\;s = p\left( i \right)} \hfill \\ {D_{p\left( i \right)} \left( s \right) = \hbox{max} \left\{ {\beta \cdot D_{p\left( i \right)} \left( \acute{s} \right) |\acute{s} \in children\;of\;s} \right\} } \hfill & {if\;s \ne p\left( i \right)} \hfill \\ \end{array} } \right. $$
(2)

In addition, disease \( p\left( i \right) \) contributes 1 to its own semantic value. Therefore, the semantic value \( DV\left( {p\left( i \right)} \right) \) of the disease \( p\left( i \right) \) is defined as follows:

$$ DV\left( {p\left( i \right)} \right) = \sum\nolimits_{{s \in N_{p\left( i \right)} }} {D_{p\left( i \right)} \left( s \right)} $$
(3)

We hypothesized that the more DAG Shared between diseases, the higher the semantic similarity score. The DAG similarity value \( SV_{1} \left( {p\left( i \right),p\left( j \right)} \right) \) of the disease \( p\left( i \right) \) and disease \( p\left( j \right) \) is calculated as:

$$ SV_{1} \left( {p\left( i \right),p\left( j \right)} \right) = \frac{{\mathop \sum \nolimits_{{s \in N_{p\left( i \right)} {\bigcap }N_{p\left( j \right)} }} \left( {D_{p\left( i \right)} \left( s \right) + D_{p\left( j \right)} \left( s \right)} \right)}}{{DV\left( {p\left( i \right)} \right) + DV\left( {p\left( j \right)} \right)}} $$
(4)

2.3 Stacked Autoencoder

Stacked auto-encoder (SAE) is a multi-layer neural network and is a deep learning model that uses modular units to create deep neural networks [35]. The purpose of Auto-encoder is to make the value of the output as close as possible to the value of the input. Given a drug molecular fingerprint set \( x \), autoencoder input \( x \) through an expression to determine the mapping of hidden:

$$ {\text{Y}} =\upsigma\left( {W_{1} x + b_{1} } \right) $$
(5)

where \( \sigma \) denotes the logistic sigmoid. Y is the result of the hidden representation, and \( x \) is the reconstructed vector after mapping:

$$ \acute{x} = \sigma \left( {W_{2} x + b_{2} } \right) $$
(6)

The stack auto-encoder is a combination of multiple autoencoders. The principle is to use the output of the first layer of the autoencoder as the input of the next layer of the autoencoder, and so on, to obtain the output of the last layer of the auto-encoder. In this paper, a drug fingerprint obtains a descriptor representing a structural feature by a stacked autoencoder.

2.4 NcRNA and Protein Sequence

We chose to encode the sequence using a 64 (4 × 4 × 4) dimensional vector encoding ncRNA and analyzed it with K-mer, where k is taken as 3. The 3-mer mode is a sliding window containing 3 nucleotides to analyze each transcription. In the initial state, the number of occurrences of all patterns is set to 0. If the window matches exactly the string in the transcript, the count is incremented by 1 and the slide continues. Finally, divide the number of occurrences by the length of the sequence to get the normalized frequency.

The article by Shen et al. [36] proposes that protein sequences can be encoded into four classes based on the polar side chains of the amino acids. Each protein sequence is characterized by a 3-mer. The ncRNA uses the same normalized frequency calculation method.

2.5 Node Representation

In the molecular association network, many nodes and edges are involved in the prediction task. We chose node2vec to learn the continuous feature representation of nodes in the network [37]. Suppose just traversed go from edge (t, v) to node v. Assume that the transition probability of the next step edge (v, x) is \( \pi_{vx} \). We set the unnormalized transition probability to \( \pi_{vx} = \alpha_{pq} \left( {t,x} \right) \cdot \omega_{vx} \), where \( d_{tx} \) represents the shortest path distance between nodes t and x:

$$ \alpha_{pq} \left( {t,x} \right) = \left\{ {\begin{array}{*{20}c} {\frac{1}{p}\;\; if\;d_{tx} = 0} \\ {1\;\; if\;d_{tx} = 1} \\ {\frac{1}{q}\;\; if\;d_{tx} = 2} \\ \end{array} } \right. $$
(7)

2.6 XGBoost

XGBoost algorithm has been widely applied in the field of bioinformatics. XGBoost is an integration of several weak classifiers, in this case the CART regression tree model. The objective function of XGBoost is defined as:

$$ Obj = \sum\nolimits_{m = 1}^{n} {l\left( {y_{m} ,\hat{y}_{m} } \right) + \sum\nolimits_{k = 1}^{K} {\Omega \left( {f_{k} } \right)} } $$
(8)
$$ \varOmega \left( f \right) = \varUpsilon T + 0.5\lambda \left\| \omega \right\|^{2} $$
(9)

Here \( l \) is a differentiable convex loss function that measures the difference between the prediction \( \widehat{{y_{m} }} \) and the target \( y_{m} \). The complexity of the \( \Omega \) penalty model. The newly generated tree is to fit the residual error predicted last time. When t trees are generated, the prediction score is:

$$ \hat{y}_{m}^{\left( t \right)} = \hat{y}_{m}^{{\left( {t - 1} \right)}} + f_{t} \left( {x_{m} } \right) $$
(10)

The target function is updated to:

$$ {\mathcal{L}}^{\left( t \right)} = \sum\nolimits_{m = 1}^{n} {l\left( {y_{m} ,\hat{y}_{m}^{t - 1} + f_{t} \left( {x_{m} } \right)} \right) +\Omega \left( {f_{t} } \right)} $$
(11)

In general, a second order approximation can be used to quickly optimize the target. The approximate objective function is:

$$ {\mathcal{L}}^{\left( t \right)} \simeq \sum\nolimits_{m = 1}^{n} {\left[ {l\left( {y_{m} ,\hat{y}^{t - 1} } \right) + g_{m} f_{t} \left( {x_{m} } \right) + \frac{1}{2}h_{m} f_{t}^{2} \left( {x_{m} } \right)} \right] +\Omega \left( {f_{t} } \right)} $$
(12)

where \( g_{m} \) is the first derivative and \( h_{m} \) is the second derivative.

$$ g_{m} = \partial_{{\hat{y}^{{\left( {t - 1} \right)}} }} l\left( {y_{m} ,\hat{y}^{t - 1} } \right) $$
(13)
$$ h_{m} = \partial_{{\hat{y}^{{\left( {t - 1} \right)}} }}^{2} l\left( {y_{m} ,\hat{y}^{t - 1} } \right) $$
(14)

Since the prediction score of the former \( t - 1 \) tree and the residual of \( y \) do not affect the optimization of the objective function, the objective function can be simplified as:

$$ {\tilde{\mathcal{L}}}^{\left( t \right)} = \sum\nolimits_{m = 1}^{n} {\left[ {g_{m} f_{t} \left( {x_{m} } \right) + \frac{1}{2}h_{m} f_{t}^{2} \left( {x_{m} } \right)} \right] +\Omega f\left( t \right)} $$
(15)

3 Results and Discussion

3.1 Evaluation Criteria

In order to verify the predictive power of our model. Five-fold cross-validation was performed to verify. All samples were first randomly divided into nearly the same number of five subsets. Each time four subsets are used as a training set and the remaining subsets are used as test sets, the process is repeated five times so that each subset can be used as a test set. Finally, the average of the five groups was taken as the final result. Several evaluation criteria used in our study to estimate the predictive power of our model, including sensitivity (Sen.), specificity (Spec.), precision (Prec.) accuracy (Acc.) and Matthews correlation coefficient (MCC). The calculation method is as follows:

$$ Sen. = \frac{TP}{TP + FN} $$
(16)
$$ Spec. = \frac{TN}{FP + TN} $$
(17)
$$ Prec. = \frac{TP}{TP + FP} $$
(18)
$$ Acc. = \frac{TP + TN}{TP + TN + FP + FN} $$
(19)
$$ MCC = \frac{TP \times TN - FP \times FN}{{\sqrt {\left( {TP + FP} \right)\left( {TP + FN} \right)\left( {TN + FP} \right)\left( {TN + FN} \right)} }} $$
(20)

For further evaluation, we also compute the receiver operating characteristic (ROC) curve, sum up the ROC curve in a numerical way, and calculate the area under the ROC curve (AUC). We compute the precision-recall (PR) curve and calculate the area under the PR curve (AUPR).

4 Results and Discussion

4.1 Five-Fold Cross-Validation on SCMFDD-S Dataset

We performed five-fold cross-validation on the SCMFDD-S data set to evaluate the performance of Node2Bio in predicting drug-disease association [38]. The process of cross-validation is to divide the data set into five equal parts, select a different set as the test set each time, and the remaining four sets as the training set, and repeat the experiment five times [39]. Node2Bio yielded an average accuracy of 77.42%, sensitivity of 75.25%, specificity of 79.59%, precision of 78.67%, Matthews correlation coefficient of 54.90% and AUC of 85.69% with standard deviations of 0.24%, 1.01%, 0.74%, 0.41%, 0.46% and 0.12% [40]. To evaluate the performance of Node2Bio, we compare it to some related methods of NTSIM-C. The comparison method uses the same data set for five-fold cross-validation. The experimental results represented by AUC are shown in Table 1. The results from experiments demonstrate that the performance of Node2Bio is significantly better than the related methods of NTSIM-C. Unlike the comparison method, Node2Bio combines nine molecular associations and integrates related information from a cellular perspective to achieve significant predictive effects.

Table 1. AUC comparison of Node2Bio-based method with different methods

5 Conclusion

In this study, we proposed a computational method for predicting drug-disease associations using a highly efficient biomolecular network representation model. The proposed method leverages multiple types of relational data that are biologically associated and constructs a heterogeneous network on which a graph embedding technique, node2vec, is applied for feature extraction. Using the embedding feature as inputs, we adopted the XGboost algorithm to do classification for drug-disease association. The experimental results are the proposed method to be effective, robust and superior to existing methodologies. It is anticipated that the model we trained can be applied to predict drug effects on different kinds of diseases on a large scale.