Keywords

1 Introduction

Traditional drug development faces difficulties relating to the expensive, time consuming and high risk of failure. Studies have demonstrated that drug repositioning, which aims to discovery new indications for existing drugs, offers a promising alternative to drug development. Some successful repositioned drugs (e.g. Sildenafil, thalidomide, raloxifene) have historically generated high revenues for their patent holders or companies [1]. Compared to in vivo experimental methods for drug repositioning, in silico approaches are efficient at identifying potential drug-disease association, and thus significantly reduce research costs. Therefore, it is necessary to develop a computational method for identifying drug-disease associations.

To date, much effort has been allocated to developing computational approaches for predicting drug-disease associations. Conventional computational methods mainly depend on two strategies, the network-based method and feature-based method. A key idea behind network-based algorithms is the construction of complex biological networks with large-scale biological data. Wang et al. [2] proposed a drug-disease heterogeneous network model termed Heterogeneous Graph Based Inference (HGBI) and extended the algorithm to a three-layer network (HL_HGBI), adding a new layer of the target information [3]. However, the assumption was that drugs should have diverse indications and diseases should have diverse treatments. Martínez et al. [4] constructed a complex network which included drugs, diseases and proteins. Protein interactions were used as a bridge to perform DrugNet, a general network-based prioritization based on a propagation flow algorithm. Luo et al. [5] exploited known drug-disease associations to devise the drug-drug and disease-disease similarity measures, then building a drug-disease heterogeneous network, on which a bi-random walk algorithm was adopted to predict novel potential associations between drugs and diseases.

Much attention has also been devoted to introducing feature-based methods. Bleakley et al. provided a supervised learning approach [i.e. support vector machine (SVM)] on a bipartite local model (BLM) from chemical and genomic data [6]. Mei et al. [7] proposed BLM-NII, combining BLM with a neighbor-based interaction-profile inferring(NII) procedure. Gottlieb et al. [8] conducted multiple drug-drug and disease-disease similarity measures as classification features, implementing a classification algorithm named PREDICT to infer potential drug indications. Yang et al. [9] calculated relevance scores between drugs and diseases from a drug-target-pathway-gene-disease network and learnt a probabilistic matrix factorization model (PMF) based on known drug-disease associations to classify drug-disease associations. However, most of these approaches rely on the known association information and directly set the weight of unknown disease-drug associations to zero. This is perhaps the major reason that existing methods can’t obtain a satisfactory performance based on sparse known associations validated by biological experiments.

In this article, we propose a graph regularized transductive regression (GRTR) method to deal with the problem of the sparse known associations for drug-disease association prediction. A three-layer heterogeneous network composed of drugs, diseases and targets is constructed from multiple datasets. Then we approximately calculate drug related diseases from local neighborhood information and adjust the weight of links with diseases based on it. Through a transductive regression model with graph regularization, the relevance score for potential drug-disease associations will be iteratively updated and all drugs ranked by their scores and judged whether they are related to a disease. Compared to the previous nine advanced prediction methods, GRTR performs better in terms of AUC and AUPR. Furthermore, the effect of varying weighted parameters and the effect of preliminarily estimating drug-related disease are analyzed. Case studies on the selected drugs and targets further exhibit the predictive ability of drug-disease association.

2 Methods

The overall process of predicting new drug-disease associations by GRTR is displayed in Fig. 1. GRTR first constructs a three-layer heterogeneous network composed of drugs, diseases and targets. Next, local information is obtained based on preliminary estimates for drug-related disease from the distribution of diseases associated with neighbor nodes in the heterogeneous network. Finally, using the heterogeneous network, the known relationships with diseases and preliminary estimation results as inputs, GRTR adopt graph regularization transductive regression to score and rank drug-disease associations iteratively.

Fig. 1.
figure 1

GRTR workflow. Given the inputs of the heterogeneous network matrix S and the matrix of known association \( {\mathbf{y}}_{{\mathbf{L}}} \), we first obtain preliminary estimates for drug related diseases \( {\mathbf{y}}_{{\mathbf{u}}} \) using neighbor distribution information. We then score and rank drug-disease associations iteratively based on graph regularization transductive regression. The top rank drugs for each disease in the predicted association matrix f are treated as the candidate drugs for those diseases for further experimental investigation.

2.1 Heterogeneous Network Construction

The three-layer heterogeneous network consists of three nodes types: drug nodes, disease nodes and target nodes. Suppose that m, n and k are the number of drugs, diseases and targets, respectively. \( {\text{S}}^{11} = \left\{ {{\text{S}}_{i,j}^{11} } \right\}_{i = 1,j = 1}^{m,m} \) is an adjacency matrix of the drugs similarity network, \( {\text{S}}^{22} = \left\{ {{\text{S}}_{i,j}^{22} } \right\}_{i = 1,j = 1}^{k,k} \) is an adjacency matrix of the protein interaction network and \( {\text{S}}^{33} = \left\{ {{\text{S}}_{i,j}^{33} } \right\}_{i = 1,j = 1}^{n,n} \) is an adjacency matrix of the disease similarity network. Drug similarities can be calculated based on their chemical structures. Disease similarities and protein-protein interactions can be obtained from online datasets. We connect the above three subnetworks using experimentally verified drug-disease associations (\( {\text{S}}^{13} = \left\{ {{\text{S}}_{i,j}^{13} } \right\}_{i = 1,j = 1}^{m,n} \)), target-disease associations (\( {\text{S}}^{23} = \left\{ {{\text{S}}_{i,j}^{23} } \right\}_{i = 1,j = 1}^{k,n} \)) and drug-target associations (\( {\text{S}}^{12} = \left\{ {{\text{S}}_{i,j}^{12} } \right\}_{i = 1,j = 1}^{m,k} \)) to form a heterogeneous network. The adjacency matrix of the heterogeneous network can be represented as follows:

$$ S = \left( {\begin{array}{*{20}c} {S^{11} } & {S^{12} } & {S^{13} } \\ {\left( {S^{12} } \right)^{T} } & {S^{22} } & {S^{23} } \\ {\left( {S^{13} } \right)^{T} } & {\left( {S^{23} } \right)^{T} } & {S^{33} } \\ \end{array} } \right) $$

where \( \left( \cdot \right)^{T} \) represents the transpose of a matrix.

2.2 Preliminary Estimation of Drug Related Disease

In our research, the node with no known associations with a disease is unlabeled while other nodes are labeled. Preliminary estimation for the related diseases for an unlabeled drug is a local estimation. According to the assumption that drugs which are ‘close together’ will have associations with the same disease [10], we will consider neighborhood information based on the equal combination of diseases which have association with neighbor nodes in the heterogeneous network. Firstly, the neighbors of a drug \( {\text{i}} \) can be defined by the nearest labeled nodes \( {\text{N}} \) in the heterogeneous network.

$$ N(i) = \left\{ {j\,|\,S_{ij} > \sigma ,1 \le i \le m + n + k,1 \le j \le m + n + k} \right\} $$
(1)

where \( \sigma \) is a threshold and in this paper \( \sigma = 0.5 \). Then we use the mean distribution of the neighbor’s disease to describe the biological network’s local information and obtain preliminary estimations for the related diseases (\( \tilde{y} \)).

$$ \tilde{y}_{i} = \frac{{\sum\limits_{j \in N(i)} {S_{i,j} y_{j} } }}{{\sum\limits_{j \in N(i)} {S_{i,j} } }} $$
(2)

where \( {\text{y}}_{j} \) denotes the known associations between nodes \( {\text{j}} \) and diseases. Here, diseases can be understood as discrete variables. Hence, the variance of a neighbor’s disease distribution (\( \sigma_{{\tilde{y}}}^{2} \)) can be obtained as follows:

$$ \sigma_{{\tilde{y}_{i} }}^{2} = \frac{{\sum\limits_{j \in N(l + i)} {S_{i,j} } (y_{j} - \tilde{y}_{i} )^{2} }}{{\sum\limits_{j \in N(i)} {S_{i,j} } }} $$
(3)

2.3 Graph Regularized Transductive Regression

The main idea of our prediction method is based on transductive regression which is one of the most popular methods for imbalanced (sparse) data analysis, because prediction though transductive regression can lead to good knowledge extraction of the hidden network structure [11]. Wan et al. [12] presented a graph regularization-based transductive regression (Grempt) method using a symmetry meta-path to deal with label prediction on heterogeneous information networks, which have performed satisfactorily. In order to address the limitations of the symmetry meta-path, we revise the objective function’s first term to directly consider different links classes in the heterogeneous network. The revised objective function is defined as follows:

$$ \begin{aligned} J(f) & = \sum\limits_{p,q = 1}^{\left| A \right|} {w_{p,q} \sum\limits_{i = 1}^{{v_{p} }} {\sum\limits_{j = 1}^{{v_{q} }} {S_{i,j}^{p,q} \left\| {\frac{{f_{i}^{p} }}{{\sqrt {D_{ii}^{pq} } }} - \frac{{f_{j}^{q} }}{{\sqrt {D_{jj}^{pq} } }}} \right\|} } }^{2} \\ & \quad { + }\,\alpha_{1} \sum\limits_{i}^{\left| L \right|} {\left\| {f_{i} - y_{i} } \right\|^{2} } \\ & \quad + \,\alpha_{2} \sum\limits_{i}^{\left| U \right|} {\frac{{\left\| {f_{i} - \tilde{y}_{i} } \right\|^{2} }}{{\sigma_{{\tilde{y}_{i} }}^{2} }}} \\ \end{aligned} $$
(4)

where \( \alpha_{1} \) and \( \alpha_{2} \) are two regularization coefficients which balance the different components of the model. \( {\text{A}} = \left\{ {{\text{drug}},\,{\text{disease}},\,{\text{target}}} \right\} \) is the network nodes category, \( {\text{w}}_{p,q} \) is the correlation between categories \( A_{p} \) and \( A_{q} \) \( \left( {{\text{p}}, {\text{q}} \in \left\{ {1,2, \ldots ,\left| A \right|} \right\}} \right) \), \( v_{p} \) is the number of nodes which belong to category \( A_{p} \), \( {\text{S}}_{i,j}^{pq} \) is the relevance between object \( {\text{i}} \in A_{p} \) and object \( {\text{j}} \in A_{q} \) in the network, \( f_{i} \) and \( f_{i}^{p} \) are the prediction results of node \( {\text{i}} \) where p denotes \( {\text{i}} \in A_{p} \), \( {\text{D}}_{ii}^{pq} \) is the sum of the i-th row in \( {\text{S}}^{pq} \), \( {\text{L}} \) is the labeled nodes set which has an association with disease and U is the unlabeled node set. The model consists of 3 functions and each one corresponds to different meaning:

  • The first part of the objective function is the global smoothness item, which formulates that similar nodes are likely to be associated with similar diseases.

  • The second term of the objective function minimizes the difference between the predicted results and the known association.

  • The last term formulates a regularization item to minimize the difference between the predicted results and the preliminary estimation from local characteristics.

The global minimum is calculated by differentiating (4) with respect to \( f_{L}^{p} \) and \( f_{U}^{p} \) respectively, which gives:

$$ \begin{aligned} \frac{\partial J(f)}{{\partial f_{L}^{p} }} & = \sum\limits_{p,q,p \ne q}^{\left| A \right|} {2w_{p,q} \left( {f_{L}^{p} - R_{LL}^{pq} f_{L}^{q} - R_{LU}^{pq} f_{U}^{q} } \right)} \\ & \quad + \,4w_{p,p} \left( {f_{L}^{p} - R_{LL}^{pp} f_{L}^{p} - R_{LU}^{pp} f_{U}^{p} } \right) + 2\alpha_{1} \left( {f_{L}^{p} - y_{L}^{p} } \right) \\ \end{aligned} $$
(5)
$$ \begin{aligned} \frac{\partial J(f)}{{\partial f_{U}^{p} }} & = \sum\limits_{p,q,p \ne q}^{\left| A \right|} {2w_{p,q} \left( {f_{U}^{p} - R_{UL}^{pq} f_{L}^{q} - R_{UU}^{pq} f_{U}^{q} } \right)} \\ & \quad + \,4w_{p,p} \left( {f_{U}^{p} - R_{UL}^{pp} f_{L}^{p} - R_{UU}^{pp} f_{U}^{p} } \right) + 2\frac{{\alpha_{1} }}{{\sigma_{{\tilde{y}^{p} }}^{2} }}\left( {f_{L}^{p} - \tilde{y}^{p} } \right) \\ \end{aligned} $$
(6)

where \( {\text{f}}_{\text{L}}^{\text{p}} \) denotes the prediction result of labeled nodes belonging to \( {\text{A}}_{\text{p}} \) and \( {\text{f}}_{\text{U}}^{\text{p}} \) denotes the prediction result of unlabeled nodes belonging to \( {\text{A}}_{\text{p}} \). \( {\text{R}}^{\text{pq}} = \left( {{\text{D}}^{\text{pq}} } \right)^{{ - \frac{1}{2}}} {\text{S}}^{\text{pq}} \left( {{\text{D}}^{\text{qp}} } \right)^{{ - \frac{1}{2}}} \) is the integration of the whole heterogeneous network, which can be rearranged based on labeled and unlabeled objectives.

$$ R^{pq} = \left( {\begin{array}{*{20}c} {R_{{_{LL} }}^{pq} } & {R_{{_{LU} }}^{pq} } \\ {R_{{_{UL} }}^{pq} } & {R_{{_{UU} }}^{pq} } \\ \end{array} } \right)\, \cdot \,\text{p} ,{q \in }\left\{ {\text{1} ,\text{2} ,\ldots ,\left| A \right|} \right\} $$

Suggested that \( \frac{{\partial {\text{J}}\left( f \right)}}{{\partial f_{L}^{p} }} = 0 \) and \( \frac{{\partial {\text{J}}\left( f \right)}}{{\partial f_{U}^{p} }} = 0 \), the closed-form solution is obtained. However, the iterative solution is sometimes preferable [13]. The detail steps of GRTR to predict potential associations are described in Algorithm 1.

figure a

3 Experiment and Results

3.1 Dataset

Experimentally confirmed drug-disease associations and drug–target associations are both downloaded from the supplementary material of [8]. Gottlieb et al. have collected 1933 known drug-disease associations involving 593 drugs registered in DrugBank [14] and 313 diseases listed in the Online Mendelian Inheritance in Man (OMIM) database [15]. At last, we get 2814 known drug–target associations between 593 drugs and 777 proteins.

The interactions between diseases and proteins are obtained from DisGeNET [16], for a total of 10010 relationships between 3221 proteins and 313 diseases.

The disease–disease similarity network is downloaded from Online Mendelian Inheritance in Man Mining Tool (MimMiner) [17]. According to the MimMiner database, disease–disease similarities have already been normalized to the range [0, 1].

The protein–protein interaction network is built using 37039 binary interactions among 9465 genes in the Human Protein Reference Database (HPRD) [18].

The drug–drug similarities are calculated based on their chemical structures. First, the chemical structures of all drug compounds in the Canonical Simplified Molecular Input Line-Entry System (SMILES) format [19] are downloaded from DrugBank. Then, the Chemical Development Kit [20] is used to calculate a binary fingerprint for each drug. Finally, Tanimoto score [21] of two drugs was calculated based on their fingerprints, which was in the range of [0, 1].

3.2 Parameters Selection and the Effect of Preliminary Estimation for Drug-Related Disease

There are three parameters \( {\text{w}} \), \( \alpha_{1} \) and \( \alpha_{2} \) in our prediction. \( {\text{w}} \) controls the importance of different network. \( \alpha_{1} \) and \( \alpha_{2} \) control the contribution of known labeled objects and preliminary estimation, respectively. We set \( {\text{w}} = 1 \) for easy. To determine the optimal configuration of \( \alpha_{1} \) and \( \alpha_{2} \), we firstly let both increase from 0 to 1 in increments of 0.05 and record the change in AUC. The results can be seen in Fig. 2(a), in which AUC value increases rapidly as both \( \alpha_{1} \) and \( \alpha_{2} \) increase, and then became steadily reaching the maximum AUC value. However, in order to determine the general future trend as \( \alpha_{1} \) and \( \alpha_{2} \) become larger, we also vary them from 1 to 200, as demonstrated in Fig. 2(b). AUC value rapidly decreases in the range \( 0\, \le \,\alpha_{1} \, \le \,10 \) and then remains almost constant in the range \( 10\, < \,\alpha_{1} \, \le \,200 \) which shows the result is not improved for these regions. But, there is an opposite trend for \( \alpha_{2} \), which first rises rapidly in the range of \( 0\, \le \,\alpha_{2} \, \le \,10 \) and after keeping a short constant, it decreases in the range \( 30\, \le \,\alpha_{2} \, \le \,200 \). Finally, we select \( \alpha_{1} \) = 1 and \( \alpha_{2} \) = 20 for getting a better prediction result. Although \( \alpha_{2} \) is much larger than \( \alpha_{1} \), it fits with the reality that preliminary estimation for drug-related disease information is more important than it is for predicting new relations.

Fig. 2.
figure 2

The influence of different \( \alpha_{1} \) and \( \alpha_{2} \) values on AUC. (a): 0–1 in 0.05 increments. (b): 0–200 in 10 increments.

If we don’t use the preliminary estimation for drug-related disease (\( \alpha_{2} \) = 0, \( \alpha_{1} \ne 0) \), the largest AUC is 0.9139. As \( \alpha_{2} \) gets larger, AUC turns to be larger until reaches the best value. To a certain extent, preliminary estimation for drug-related disease is significant, and can improve predictive ability.

3.3 Compared with Existing Methods

Systematic experiments are performed to evaluate the performance of the presented approach with nine other methods: Weighted Nearest Neighbor-Gaussian Interaction Profile(WNN-GIP) [22], Collaborative Matrix Factorization(CMF) [23], Kernelized Bayesian matrix factorization(KBMF) [24], Neighborhood Regularized Logistic Matrix Factorization(NRLMF) [25], a bipartite local model (BLM) [12], BLM with neighbor-based interaction profile inferring (BLM-NII) [7], comprehensive similarity measures and Bi-Random Walk algorithm (MBiRW) [5], standard LapRLS improved by incorporating a new kernel (NetLapRLS) [26]. We use 10-fold validation to compare GRTR performance. The area under the receiver operating characteristic (ROC) curve (AUC) [27] and the area under the precision recall (PR) curve (AUPR) are used to measure the quality of the predicted drugs for diseases. Figure 3 shows the ROC and PR curves of the 10-fold validation experiments. Table 1 gives the AUC and AUPR values. As expected, the GRTR’s AUC value is 0.9668, which outperforms all other competitive methods significantly. GRTR is 2.10% better than the second-best method, NRLMF, which also achieved an impressive result of 0.9465. For AUPR, we observe that the values are lower than those in the original papers. The main reason for this is that the data we used is larger and comparatively sparser. But GRTR also performs well, obtaining the second best in the dataset with the AUPR value of 0.5925. Though GRTR is slightly lower than NRLMF, it is still very competitive among the methods.

Fig. 3.
figure 3

The ROC and PR curves of GRTR and nine existing methods.

Table 1. AUC and AUPR values of GRTR and nine existing method.

3.4 Case Study

Here, the capability of our method in predicting novel drug-disease associations is further examined here. One well-known biological database CTD [30] and some references are used to verify the predicted novel drug-disease associations. For each disease, the candidate drugs are ranked based on the prediction scores and the top-10 predicted drugs as prediction results are collected. For instance, 8 of the top 10 potentially related drugs have been directly shown to be linked with Diabetes Mellitus type II (see Table 2), a endocrine system disease and metabolic disease. Lovastatin (DrugBank: DB00227) is predicted to treat it and has been recorded in CTD. Figure 4 presents lovastatin’s neighbor drugs and the diseases they can treat. Vitamin d-dependent rickets, osteoporosis and hyperlipoproteinemia are metabolic disease. And barakat syndrome is an endocrine system disease. In addition, we also find many associated genes between those that lovastatin can interact with to treat diabetes mellitus and the neighbors can act on to treat corresponding disease, e.g. there are 1307 genes shared with the pravastatin treating hyperlipoproteinemia, 1460 genes shared with the calcitriol treating vitamin d-dependent Rickets and 762 genes shared with the Ergocalciferol treating barakat syndrome, etc.

Table 2. The top 10 predicted results for diabetes mellitus associated drugs.
Fig. 4.
figure 4

Lovastatin (DB00227)’s neighbors and diseases can be treated. The yellow circle is the predicted drug, the red circles are the neighbor drugs of the predicted drug and the green circles are the diseases its neighbor can treated. (Color figure online)

4 Conclusions

Identifying drug-disease associations is helpful in reducing the difficulty of drug development and contributing to improved understanding of the underlying complex relations among drugs, targets and diseases. In this work, we systematically studied the problem of predicting drug-disease associations. Conventional methods for drug-disease association prediction mainly achieved unsatisfactory performance for the sparse known associations. However, the number of drug-disease associations verified by biological experiments is far less than that of the potential drug-disease associations. Therefore, GRTR based on graph regularized transductive regression was developed to predict potential drug-disease associations. At first a three-layer heterogeneous network consisting of drugs, diseases and targets was constructed. Afterwards, preliminary estimation for drug-related diseases was conceived from neighbor information. Ultimately, transductive regression strategy was adopted a to predict drug-disease associations on the heterogeneous network. The superior performance of GRTR was validated by cross validation and the top-ranked predictions. Experiment results indicate that our method can predict better than nine other approaches. Furthermore, case studies on several drugs indicated that potential drug-disease association predicted by GRTR could assist in the biomedical research.

Despite the efficiency of GRTR, there are still some limitations which require further optimization. Firstly, our method involved multiple parameters and the establishment of the optimal parameter values is still a challenging problem. Secondly, more biological information can be used to improve predictions. Finally, although higher reliability has been achieved, the current capability of GRTR remains unsatisfactory and necessitates further improvement.