1 Introduction

Drug-Drug Interactions (DDIs) refer to the situations when the activity of a drug is abolished, diminished or potentiated by a co-administrated drug. Many Adverse Drug Reactions (ADRs) caused by DDIs could be fatal, so that DDIs discovery is a crucial task in medical safety improvement and clinical cost reduction before the drugs are approved or administered.

Many researchers conduct experimental trials, such as metabolic profiles tests, transporter-associated pharmacokinetic interactions, or even animal-based tests. Although these methods support medically and biological verification, they are quite time-consuming and expensive. Moreover, DDIs may be the result of combinations of various processes. Therefore, it is impractical and unrealistic to predict all the potential DDIs by experimental tests. The knowledge of DDIs is still extremely obscure and incomplete.

Nowadays, many digital drug databases with DDIs reports have been constructed to provide online biochemical and pharmacological information. In order to accelerate the discovery procedure and reduce the exploration space, machine learning techniques are introduced into the DDIs discovery task. Most previous work consider DDIs discovery as a binary classification problem, in which drug-drug pairs with DDI labels are positive samples while pairs with non-DDI labels are negative samples.

However, in the practical applications, not all the drug-drug pairs with non-DDI labels can be simply viewed as negative samples, since potential DDIs may be discovered. Therefore, in this paper, we treat DDIs discovery as a positive and unlabeled learning (PU learning) problem. Some researchers have started the research on PU learning problem. Generally, existing mainstream work on PU learning can be divided into three families.

  1. 1.

    Two-phase methods. In general, these methods consist of two major phases: (1) identify the negative samples from the unlabeled set; (2) train classifiers iteratively and choose a classifier with the best performance. These methods include Naive Bayesian (Liu et al, 2013), S-EM (Liu et al, 2002), PEBL (Yu et al, 2002), Roc-SVM (Li and Liu, 2003), PNLH (Fung et al, 2006).

  2. 2.

    Reduction methods. These methods reduce PU learning to a problem of learning with high one-sided noise by treating the unlabeled set as noisy negative examples. Logistic regression is performed after weighting the examples to handle noise rates of greater than 1/2 (Lee and Liu, 2003). Biased SVM (B-SVM) (Liu et al, 2013) directly uses the asymmetric cost formulation of the SVM algorithm.

  3. 3.

    Statistical methods. These methods calculate statistical probabilities to predict the label of the ambiguous samples. Predict probabilities are added as sample weights to identify protein records (Elkan and Noto, 2008). Pulce framework proposed in (Ienco and Pensa, 2016) learns a distance metric by leveraging statistical relationships between attributes.

However, none of these three types of methods is able to achieve satisfactory PU learning performance, because: (1) the two-phase methods iteratively train classifiers after reliable negative set is extracted, which are quite time-consuming; (2) the reduction methods directly learns from noisy samples using parameter-sensitive methods, e.g., weighted logistic regression and biased SVM, which are difficult to achieve optimal performance; (3) the statistical prediction methods lack prior information in most real-world applications, which leads to relatively poor accuracy of statistical prediction.

In this paper, we propose a PU learning framework based on Extreme Learning Machine (ELM), named PU-ELM. PU-ELM first converts the reliable negative set extraction task to a one-class classification problem, and then treats the weighted learning task as a semi-supervised learning problem (PU-ELM-SS) or a weighted learning problem (PU-ELM-EW). Therefore, our PU-ELM framework provides novel abilities of both negative samples extraction and weighted learning.

To the best of our knowledge, this is the first paper solving PU learning problems based on ELM. In summary, a PU learning framework PU-ELM based on ELM is proposed, which consists of reliable negative set extraction and the learning component. To be specific, the major contributions of this paper are as follows.

  1. 1.

    A negative set extraction strategy based on one class ELM (OC-ELM-RN) is proposed to extract reliable negative set with higher-quality;

  2. 2.

    A semi-supervised approach (denoted as PU-ELM-SS) is applied in the learning component, which achieves extremely fast PU learning speed;

  3. 3.

    An entropy based weighted approach (PU-ELM-EW) is proposed, which assigns more effective weights to the ambiguous samples based on entropy, and achieves better weighted learning performance;

  4. 4.

    Extensive experiments on a synthetic dataset with various settings are conducted to verify the performance of PU-ELM compared with state-of-the-art methods.

  5. 5.

    The practical effectiveness of PU-ELM is verified on a real-world dataset extracted from DrugBank.

The remainder of this paper is organized as follows. Section 2 introduces preliminaries related to this paper. The features calculation is presented in Section 3. Our reliable negative set extraction approach is proposed in Section 4.1, and our weighted learning approach is proposed in Sect. 4.2. Experiments results are analyzed in Sect. 5. Section 6 draws the conclusion.

2 Preliminaries

In this section, we first introduce the drug-drug interactions (DDIs) discovery task, and the positive and unlabeled learning (PU learning) problem. Then we give a brief introduction to Extreme Learning Machine (ELM).

2.1 Drug-drug interactions

DrugBankFootnote 1 is a gold standard database of drug information, from which drug-drug pairs and properties of drugs can be collected. Adverse drug reactions information corresponding to a specific drug can be extracted from MetaADEDB (Cheng et al, 2013) and FAERSFootnote 2 maintained by FDA. Each ADR report describes primary suspect drug, secondary suspect drug, concomitant and interacting. Based on these information, the drug-ADR connections can be constructed.

Given the dataset above, the drug-drug interaction discovery task is to predict whether a drug-drug pair without confirmed DDIs will cause potential ADRs if these two drugs are administered together. This task is a great help to pharmacists and researchers for medicine development and administration.

DDIs prediction problem is studied in (Cheng and Zhao, 2014) based on several machine learning methods. A network of DDIs is constructed from DrugBank, MetaADEDB and FAERS. The predict model is trained directly using Naive Bayes, decision tree, k-nearest neighbors, logistic regression and Support Vector Machine. However, this work does not consider the unlabeled samples, and the proposed PU learning approach is intuitive, which leads to lot of room for performance improvement.

Fig. 1
figure 1

Flowchart of PU learning problem


In the PU learning problem, given a dataset \(\mathbb {S}\), the subset \(\mathbb {P} \in \mathbb {S}\) refers to the set containing all the samples with positive labels. The subset of remaining unlabeled samples is denoted as \(\mathbb {U}\), and we have \(\mathbb {P} \bigcup \mathbb {U} = \mathbb {S}\).

We cannot simply treat any sample \(\mathbf {x}_i \in \mathbb {U}\) as either a positive sample or a negative one. Therefore, the first step is to determine a reliable negative set (denoted as \(\mathbb {RN}\)) from \(\mathbb {U}\), in which the samples can be treated as negative. Then the second step is to train a classifier by the positive set \(\mathbb {P}\), reliable negative set \(\mathbb {RN}\) and remaining unlabeled set \(\mathbb {U}{'}\) with ambiguous samples, where \(\mathbb {U}{'}=\mathbb {U}\setminus \mathbb{RN}\). The flowchart of PU learning problem is shown as Fig. 1.

In addition to the three families of PU learning methods mentioned in Section 1, some researchers aim to further improve the learning performance. Xiao et al. assign two similarity weights to the ambiguous samples, and then build an SVM based classifier. Some previous work (Nguyen et al, 2012; Pan et al, 2012; Chang et al, 2016) focus on PU learning problem in time series and data stream, in which data will be expired outside the sliding window. A multi-typed heterogeneous collective classification method named MHCC is proposed in (Li et al, 2014), and extended to PU learning problem to detect fake restaurants reviews. Random walk with restarts and kNN methods are used in (Lan et al, 2016) to extract reliable negative samples. However, the operations of large dimensional matrix in these methods have extremely high computation cost. An ensemble of SVMs with bootstrap resamples is applied in (Claesen et al, 2015) to increase robustness against label noises. However, the iteratively resampling procedure is quite time-consuming.

In the bioinformatics field, some real-world applications with PU learning problems have been studied, such as protein-protein interactions (PPI) detection (Tsai et al, 2008) , genomics sequence classfication (Xiao and Segal, 2008), disease genes identification (Yang et al, 2012) and kinase substrate prediction (Yang et al, 2016) . However, these work mainly focus on bioinformatic feature extraction and representation. The PU learning strategies and methods are relatively naive and intuitive.

2.2 Extreme learning machine

Extreme Learning Machine (ELM) (Huang et al, 2004; Huang et al, 2006) achieves extremely fast learning speed and good generalization performance. ELM has been proved to be a powerful learning component in many fields (Zhao et al, 2011; Zhao et al, 2014; Zhao et al, 2016; Wang et al, 2008; Sun et al, 2011; Sun et al, 2014). Different from traditional back-propagation feed-forward neural networks, ELM randomly generates parameters of the single hidden layer, so that ELM achieves extremely fast learning speed and good generalization performance.

Given N arbitrary samples \(\{{\mathbf{{x}}_i},{{t}_i}\} \in {\mathbf{{R}}^{n \times m}}\), the optimization problem of ELM is described as

$$\begin{aligned}&{\text {minimize:}}\qquad \dfrac{1}{2}||\beta ||^2 + \dfrac{1}{2}\sum \limits _{i=1}^{N}{\mathbf {C}_i||\xi _i||^2} onumber \\&\text {subject to:} \qquad {\text {h}}(\mathbf {x}_i)\beta = t_i^{\top } - \xi _i^{\top }, ~~~ i=1,\ldots ,N \end{aligned}$$
(1)

where \(\mathbf {C}_i\) is the ith penalty coefficient for tradeoff, and \(\xi _i\) is the error vector to the ith corresponding training sample.

ELM is mathematically modeled as

$$\begin{aligned} \sum \limits _{i = 1}^L {{\beta _i}{\text {G}}({{\varvec{\omega }}_i},{b_i},\mathbf{{x}})} = \varvec{\beta }{\text {h}}(\mathbf {x}) \end{aligned}$$
(2)

where L is the number of hidden layer nodes, \({{\varvec{\omega }}_i} = {[{\varvec{\omega }_{i1}},{\varvec{\omega }_{i2}},...,{\varvec{\omega }_{in}}]^\mathrm {T}}\) is the input weight vector from input nodes to the ith hidden node, \(b_i\) is the bias of ith hidden node, \(\varvec{\beta }_i\) is the output weight from the ith hidden node to the output node. \({\text {G}}({{\varvec{\omega }}_i},{b_i},\mathbf{{x}})\) is the activation function to generate mapping neurons, which can be any nonlinear piecewise continuous functions.

The outputs of hidden layer, denoted as matrix \(\mathbf {H}\), can be viewed as the procedure of ELM feature mapping, which is calculated as

$$\begin{aligned} \mathbf {H} = \begin{bmatrix} {\text {h}} ({\mathbf{{x}}_1}) \\ \vdots \\ {\text {h}} ({\mathbf{{x}}_N}) \\ \end{bmatrix} = {\begin{bmatrix} {{\text {G}}({{\varvec{\omega }}_1},{b_1},\mathbf{{x}}_1)}&\cdots&{{\text {G}}({\varvec{\omega }}_L,b_L,\mathbf{{x}}_1)} \\ \vdots&\cdots&\vdots \\ {{\text {G}}({{\varvec{\omega }}_1},{b_1},\mathbf{{x}}_N)}&\cdots&{{\text {G}}({{\varvec{\omega }}_L},{b_L},\mathbf{{x}}_N)} \\ \end{bmatrix}_{N \times L}} \end{aligned}$$
(3)

ELM aims to minimize the norms of both the output weights and the training errors between the output \(\mathbf {O}=\mathbf {H}\varvec{\beta }\) and original class labels \(\mathbf {T}\). In other words, ELM with L hidden layer nodes tries to approximate training samples with zero error, i.e., \(\sum olimits _{i=1}^{L}{||\mathbf {O}_i-\mathbf {T}_i||}=0\). Thus, by replacing \(\mathbf {O}\) with \(\mathbf {T}\), the output weights \(\varvec{\beta }\) can be calculated by feature mapping matrix \(\mathbf {H}\) and training samples labels \(\mathbf {T}\):

$$\begin{aligned} \varvec{\beta } = {{\mathbf {H}}^\dag }{} \mathbf{{T}} \end{aligned}$$
(4)

where \({\mathbf {H}}^\dag\) is the Moore-Penrose Inverse of \(\mathbf {H}\).

According to the ridge regression theory (Hoerl and Kennard, 2000), adding biases to the diagonal elements of a symmetric matrix improves the stability and generalization performance. Thus the output function of ELM with bias \(\dfrac{\mathbf {I}}{C}\) is rewritten as

$$\begin{aligned} {\text {f}}(\mathbf {x})=\varvec{{\text {h}}}(\mathbf {x})\varvec{\beta }=\varvec{{\text {h}}}(\mathbf {x})\mathbf {H}^\mathrm {T} \left( \dfrac{\mathbf {I}}{C}+\mathbf {HH}^\text {T}\right) ^{-1} \mathbf {T} \end{aligned}$$
(5)

or

$$\begin{aligned} {\text {f}}(\mathbf {x})=\varvec{{\text {h}}}(\mathbf {x})\varvec{\beta }=\varvec{{\text {h}}}(\mathbf {x})\left( \dfrac{\mathbf {I}}{C}+\mathbf {H}^\mathrm {T}\mathbf {H}\right) ^{-1}\mathbf {H}^\mathrm {T} \mathbf {T} \end{aligned}$$
(6)

in the case that the number of training samples is much larger than the dimensionality of the feature space.

3 Drug-drug interactions features

Inspired by the similarities in (Cheng and Zhao, 2014), we calculate four improved drug-drug similarities, which are side effect similarity, chemical structural similarity, target protein similarity and therapeutic similarity.

Side effect similarity: each drug is coded by ADE bit vectors in MetaADEDB. Each bit represents one ADE. If an ADE is associated with a given drug in MetaADEDB, the corresponding bit is set to 1; otherwise 0. Then the side effect similarity \({S_{SE}}(d_{i},d_j)\) between drugs \(d_i\) and \(d_j\) is calculated by Jaccard similarity coefficient.

Chemical structural similarity: the chemical structural similarity \({S_{CS}}(d_i,d_j)\) between drugs \(d_i\) and \(d_j\) is calculated via Jaccard similarity coefficient using MACCS keys.

Target protein similarity: each drug is coded using target protein bit vectors. Each bit represents one target protein. If a target protein is associated with a drug in the DTI network, the corresponding bit is set to 1; otherwise 0. Then \({S_{TP}}(d_i,d_j)\) between drug \(d_i\) and \(d_j\) is calculated using Jaccard similarity coefficient using the drug’s target protein bit vectors.

Therapeutic similarity: the kth level drug therapeutic similarity \({S_{TS}}(d_i,d_j)\) is defined via the ATC codes as

$$\begin{aligned} {S_{TS}}(d_i,d_j) = \dfrac{1}{5}\sum \limits _{k=1}^5{\dfrac{{\text {ATC}}_k(d_i)\cap {\text {ATC}}_k(d_j)}{{\text {ATC}}_k(d_i)\cup {\text {ATC}}_k(d_j)}} \end{aligned}$$
(7)

where \({\text {ATC}}_k(d_i)\) represents all ATC codes at the kth level of drug \(d_i\), five levels of ATC codes in total. For a drug with multiple ATC codes, the therapeutic similarity is calculated using the average value.

4 PU learning framework based on ELM

Our proposed framework PU-ELM consists of two major components: (1) the reliable negative set extraction method; (2) the classifier learning method. The first component applies a one-class classification strategy based RN set extraction algorithm OC-ELM-RN. In the second component, we apply two different strategies, namely semi-supervised learning algorithm PU-ELM-SS and entropy based learning algorithm PU-ELM-EW. All the components in our PU-ELM framework are based on ELM theory. The overall framework is shown as Fig. 2.

Fig. 2
figure 2

PU-ELM framework

4.1 Reliable negative set extraction

The quality of reliable negative set directly effects the PU learning performance. Most existing approaches utilizes probability threshold (Liu et al, 2013), frequency threshold (Liu et al, 2002) and prototype vector boundary (Li and Liu, 2003) to extract reliable negative set \(\mathbb {RN}\). However, the lack of a priori information leads to unsatisfactory results.

In order to improve the quality of the reliable negative set, we propose a one-class classification strategy based extraction algorithm named OC-ELM-RN (One-Class ELM for Reliable Negative extraction). In the classification problem with only one target class, \({\text {f}}(\mathbf {X})\) in the original ELM is the least mean square solution for given input \({\text {g}}(\mathbf {X})\) and class label T. Note that T is a constant value representing the target class label. The output weight \(\varvec{\beta }\) becomes a hyper plane approximation mapping \({\text {g}}(\mathbf {X})\) to T (Leng et al, 2015). Then the difference \(|{\text {f}}(\mathbf {X})-T|\) is the distance between the training samples and the ELM hyper plane. Therefore, a distance function \({\text {d}}(\mathbf {x}_i)=|{\text {f}}(\mathbf {x})_i-T|\) is defined to indicate how far the sample \(\mathbf {x}_i\) is from the ELM hyper plane.

We first give the definition of one-class classification problem in the context of this paper.

Definition 1

(One-class classification) Given l labeled data \(\{\mathbf {x}_i,T\}_{i=1}^{l}\) all belonging to class T, u unlabeled data \(\{\mathbf {x}_i\}_{i=1}^{u}\), and a distance threshold \(\xi\), for each unlabeled samples \(\mathbf {x}_i\), if \({\text {d}}(\mathbf {x}_i)<\xi\), \(\mathbf {x}_i\) is accepted by the target class, which means \(\mathbf {x}_i\) belongs to class T; rejected otherwise.

In order to set up a representation of negative samples space, PU-ELM first extracts representative negative samples from unlabeled samples. In other words, a fraction \(\theta\) of the unlabeled samples, which are most deviant from the target class T, will be rejected by the target class. Thus in framework PU-ELM, the one-class \(\theta\)-extraction problem can be described follows.

Definition 2

(One-class \(\theta\)-extraction) Given l labeled data \(\{\mathbf {x}_i,T\}_{i=1}^{l}\) all belonging to class T, u unlabeled data \(\{\mathbf {x}_i\}_{i=1}^{u}\), and a distance threshold \(\xi\), the positive one-class extraction problem is to extract the samples of class T from the unlabeled samples whose distances are smaller than \(\xi\); and the negative one-class extraction problem is to extract negative samples whose distances are larger than or equal to \(\xi\), with a rejection rate \(\theta\).

Based on Definition 2, our proposed OC-ELM-RN is designed to extract reliable negative set \(\mathbb {RN}\). Since the class labels of all the labeled training samples are a constant value T, we define the output vector \(\mathbf {T}\) including l labeled samples as

$$\begin{aligned} \mathbf {T} = [t_1,\ldots ,t_l]^\top = [T,\ldots ,T]^\top _{l} \end{aligned}$$
(8)

According to the optimization problem of ELM, in the one-class classification problem, whether each sample can be accepted by the target class (i.e., the positive class) is determined by the distance from the sample to the target class in the ELM feature space, which indicates the distance to the ELM hyper plane. Given a sample \(\mathbf {x}_i\), the distance function between \(\mathbf {x}_i\) and the target class (i.e., the positive class) is defined as

$$\begin{aligned} {\text {d}}(\mathbf {x}_i)=\, & {} \big | {\text {h}}(\mathbf {x}_i)^{\top }\varvec{\beta }-T\big | onumber \\= \, & {} \bigg | {\text {h}}(\mathbf {x}_i)^{\top }\mathbf {H}^{\top }\left( \dfrac{\mathbf {I}}{C} + \mathbf {HH}^{\top }\right) ^{-1}\mathbf {T} - T\bigg | \end{aligned}$$
(9)

A larger \({\text {d}}(\mathbf {x}_i)\) means the training sample \(\mathbf {x}_i\) is further from the target class T. The only hyper-parameter of one-class \(\theta\)-extraction is the rejection fraction ratio \(\theta\), which determines the ratio of samples rejected by the target class. Then given all the sample distances calculated by Equation 9, the distance threshold \(\xi\) converted from rejection fraction \(\theta\) directly determines whether each sample belongs to the target class T or not. We first sort the samples by their distances from the target class, and then assign the value of threshold \(\xi\) to the distance of the jth least deviant sample, where \(j=\lfloor \theta N\rfloor\).

Then we describe the output function for \(\mathbf {x}_i\) to the target class as

$$\begin{aligned} {\text {f}}(\mathbf {X}) = sign(\xi - {\text {d}}(\mathbf {X})) \end{aligned}$$
(10)
figure a

With the output function of one-class ELM, Algorithm 1 shows the major procedure of OC-ELM-RN. Given l positive training samples and u unlabeled training samples, OC-ELM-RN returns r reliable negative samples \((r<u)\). OC-ELM-RN first initiates an empty set \(\mathbb {RN}\) to maintain reliable negative samples (Line 1). For each \(\mathbf {x}_i\) in the unlabeled set \(\mathbb {U}\) (Lines 2-5), OC-ELM-RN calculates the distance \(d_i\) between \(\mathbf {x}_i\) and the target class (i.e., the positive class in PU learning problem) (Line 3). If the distance \(d_i\) is not less than a threshold \(\xi\) (Line 4), OC-ELM-RN treats \(\mathbf {x}_i\) as a reliable negative sample, and pushes \(\mathbf {x}_i\) into the reliable negative set \(\mathbb {RN}\) (Line 5).

4.2 Classifiers learning

In the second component of PU-ELM, classifiers are trained using the positive set \(\mathbb {P}\), reliable negative set \(\mathbb {RN}\) extracted by OC-ELM-RN, and the remaining unlabeled set \(\mathbb {U}\). We present two learning strategies based on semi-supervised learning and entropy based weighted learning respectively.

4.2.1 The semi-supervised learning approach

Partially labeled-data learning problems comparison. Note that the semi-supervised classification problem is different from the three problems mentioned in Sect. 4.1, including one-class classification (refers to Definition 1), positive one-class extraction and negative one-class extraction (refers to Definition 2). The comparison is presented in Table 1. Although all of them are designed to deal with a training dataset consists of both labeled and unlabeled samples, the semi-supervised classification task is to learn how to classify the unlabeled samples into the multiple positive and multiple negative classes. Therefore, methods based on ELM hyper plane distance are impractical in the semi-supervised learning scenario. In order to achieve extremely fast learning speed, we present a PU learning method PU-ELM-SS based on semi-supervised ELM (Huang et al, 2014).

Table 1 Partially labeled-data learning problems comparison

The learning algorithm. In the classification problem of learning with a few labeled samples and plenty more unlabeled samples, the semi-supervised ELM (SS-ELM) (Huang et al, 2014) maintains the original ELM geometric properties of feature space for the training data in decision space. In the PU learning problem, since the training dataset includes reliable negative samples in set \(\mathbb {RN}\), positive samples in \(\mathbb {P}\) and unlabeled samples in \(\mathbb {U}\), we first design a semi-supervised learning method based on semi-supervised ELM to realize the training phase of the classifier. For convenience, the framework PU-ELM with semi-supervised ELM as classifier is hereafter referred to as PU-ELM-SS.

Given l labeled data \(\{\mathbf {x}_i, \mathbf {t}_i\}_{i=1}^{l}\) and u unlabeled data \(\{\mathbf {x}_i\}_{i=1}^{u}\), the manifold regularization framework assumes that if two samples are close to each other, the difference between the conditional probabilities of these two points should be small (Huang et al, 2014). In order to penalize large variation in the conditional probability based on manifold regularization, we approximate the conditional probability as

$$\begin{aligned} \hat{L}_{m} = \frac{1}{2}\sum _{i,j}{w_{i,j}||\mathbf {t}_i-\mathbf {t}_j||}={\text {Tr}}(\mathbf {T}^{\top }\mathbf {LT}) \end{aligned}$$
(11)

where \(\mathbf {L=D-A}\) is the Laplacian matrix of similarity graph formed by both labeled and unlabeled training samples, \(\mathbf {D}\) is the degree matrix and \(\mathbf {A}\) is the adjacency matrix of the sample graph. \(\mathbf {T}\in \mathbf {R}^{(l+u)\times m}\) is the target labels with its ith row equal to \(\mathbf {t}_i\), and \({\text {Tr}}(\cdot )\) denotes the trace of a matrix.

Thus, the optimization problem with manifold regularization is described as

$$\begin{aligned}&{\text {minimize:}}\qquad \dfrac{1}{2}||\varvec{\beta }||^2 + \dfrac{1}{2}\sum \limits _{i=1}^{N}{\mathbf {C}_i}||\xi _i||^2 + \dfrac{\lambda }{2}{\text {Tr}}(\mathbf {T}^{\top }\mathbf {LT}) onumber \\&\text {subject to:}\qquad {\text {h}}(\mathbf {x}_i)\varvec{\beta } = \mathbf {t}_i^{\top } - \xi _i^{\top }, ~~~ i=1,\ldots ,l onumber \\&\qquad \qquad {\text {f}}(\mathbf {x}_i) = {\text {h}}(\mathbf {x}_i)\varvec{\beta }, ~~~ i=1,\ldots ,l+u \end{aligned}$$
(12)

where \(\mathbf {C}_i\) is the penalty coefficient of the ith training sample, \(\lambda\) is a tradeoff parameter. Therefore, the output weight \(\mathbf {\beta }\) of is calculated as

$$\begin{aligned} \varvec{\beta } = (\mathbf {I}_{l+u} + \mathbf {H}^{\top }\mathbf {CH} + \lambda \mathbf {H}^{\top }\mathbf {LH})^{-1} \mathbf {H}^{\top }\mathbf {C}\mathbf {T} \end{aligned}$$
(13)

when the number of samples N is larger than the number of hidden layer nodes L, or

$$\begin{aligned} \varvec{\beta } = \mathbf {H}^{\top }(\mathbf {I}_{l+u} + \mathbf {CH}\mathbf {H}^{\top } + \lambda \mathbf {LH}\mathbf {H}^{\top })^{-1}\mathbf {C}\mathbf {T} \end{aligned}$$
(14)

when L is larger than N.

The procedure of framework PU-ELM applying SS-ELM is presented as Algorithm 2.

figure b

The graph Laplacian matrix is first constructed using both the labeled and unlabeled training samples (Line 1). Then PU-ELM-SS constructs an ELM neural network by randomly generating the input weights \(\varvec{\omega }\) and biases \(\mathbf {b}\) (Line 2). After the calculation of the ELM feature mapping matrix \(\mathbf {H}\) (Line 3), the output weight \(\varvec{\beta }\) is calculated using Equation 13 in the case that L is larger than or equal to N, or using Equation 14 in the case that L is smaller than N.

4.2.2 The entropy based weighted learning approach

Weighted learning. Intuitively, given a classification problem with biased distribution of samples belonging to different classes, with the advantage in quantity, the majority class tends to push the separating boundary towards the class with minor number of samples to gain better classification (Zong et al, 2013). Since ELM is to minimize both the training errors and the norm of output weight, a weighted ELM proposed in (Zong et al, 2013) tries to assign a higher tradeoff coefficient C to the minority class.

In the weighted ELM, an \(N\times N\) diagonal matrix \(\mathbf {W}\) indicating the weight of training samples is first generated. Each diagonal element \(\mathbf {W}_{ii}\) in the weight matrix is the weight of the ith element \(\mathbf {x}_i\). According to the optimization problem of the original ELM, to minimize the norm of output weight and weighted cumulative error, the optimization problem is described as

$$\begin{aligned}&{\text {minimize:}}\qquad \dfrac{1}{2}||\varvec{\beta }||^2 + \dfrac{1}{2}\sum \limits _{i=1}^{N}{\mathbf {W}_i}{\mathbf {C}_i}||\xi _i||^2 onumber \\&\text {subject to:}\qquad {\text {h}}(\mathbf {x}_i)\varvec{\beta } = \mathbf {t}_i^{\top } - \xi _i^{\top }, ~~~ i=1,\ldots ,N \end{aligned}$$
(15)

Similar to the original ELM, when the size of training dataset N is relatively larger than the number of hidden layer nodes L, the calculation of left pseudo-inverse of an \(L\times L\) matrix is more efficient than the right pseudo-inverse of an \(N\times N\) matrix; in the other case when L is relatively small, right pseudo-inverse is recommended. Therefore, we have the following two solutions to the output weight \(\varvec{\beta }\) in different cases:

$$\begin{aligned} \varvec{\beta } = \mathbf {H}^{\top } \left(\dfrac{\mathbf {I}}{C} + \mathbf {WHH}^{\top }\right)^{-1}\mathbf {WT} \end{aligned}$$
(16)

when N is smaller than L, or

$$\begin{aligned} \varvec{\beta } = \left(\dfrac{\mathbf {I}}{C} + \mathbf {H}^{\top }\mathbf {WH}\right)^{-1}\mathbf {H}^{\top }\mathbf {WT} \end{aligned}$$
(17)

when N is larger than L.

Weights determination. The remaining key issue is how to set the specific weight to each training sample. Different from weighted ELM (Zong et al, 2013), which assign weights of training samples manually, we propose an entropy based weighting scheme. We first run a clustering method (e.g. k-Means) to cluster the training set into two sets. We denote the cluster with more positive samples than negative ones as the positive cluster \(C_P\), and the cluster with more negative samples as the negative cluster \(C_N\). Then, we calculate the weight of each sample following these two principles:

  1. 1.

    the samples of positive set and reliable negative set have larger weights than the unlabeled samples;

  2. 2.

    within the positive set or the reliable negative set, the samples with smaller distance to the cluster centroid have larger weights.

The first principle is intuitive. The positive and reliable negative samples are more convinced to the weighted learning method. Therefore, we set the weights of the samples of both positive set \(\mathbb {P}\) and reliable negative set \(\mathbb {RN}\) to 1. We also calculate the entropy values of the positive cluster \(C_P\) and the negative cluster \(C_N\) respectively.

\(E_{+}\) is the entropy of \(C_P\) calculated as

$$\begin{aligned} E_+ = -p_+\log _2{p_+} - p_u^{+}\log _2{p_u^+} \end{aligned}$$
(18)

where \(p_+=\dfrac{n_+}{n_+ + n_u^+}\) and \(p_u^+=\dfrac{n_u^+}{n_+ + n_u^+}\).

\(E_{-}\) is the entropy of \(C_N\) calculated as

$$\begin{aligned} E_- = -p_-\log _2{p_-} - p_u^{-}\log _2{p_u^-} \end{aligned}$$
(19)

where \(p_-=\dfrac{n_-}{n_- + n_u^-}\) and \(p_u^-=\dfrac{n_u^-}{n_- + n_u^-}\).

As to the second principle, being closer to the cluster centroid means being closer to the centroid of either the positive samples or the reliable negative samples extracted by OC-ELM-RN. In other words, the unlabeled samples far from the centroids of the two clusters are more ambiguous than the rest unlabeled samples. Therefore, taking both these two principles into consideration, the weight of each unlabeled training sample \(\mathbf {x}_i\) is calculated as

$$\begin{aligned} \mathbf {W}_{ii} = {\left\{ \begin{array}{ll} E_+\cdot (1-\dfrac{r_i}{r_+}), &{} \mathbf {x}_i \in C_P\\ E_-\cdot (1-\dfrac{r_i}{r_-}), &{} \mathbf {x}_i \in C_N \end{array}\right. } \end{aligned}$$
(20)

where \(r_i\) is the distance between \(\mathbf {x}_i\) and the cluster centroid, \(r_+\) is the maximum distance (i.e. the radius) of \(C_P\) and \(r_-\) is the maximum distance of \(C_N\).

The proposed algorithm. According to the above theory, we present the procedure of our proposed entropy based weighted learning algorithm PU-ELM-EW as Algorithm 3.

figure c

In PU-ELM-EW, we first initiate several variables to maintain the number of clusters (i.e. k) and the number of each class in each cluster (i.e. \(n_+,n_u^+,n_-,n_u^-\)) (Line 1). Then we invoke the clustering methods (here we apply k-Means as an example) to separate the whole training dataset into k clusters \(C_P\) and \(C_N\) (Line 2). For each sample \(\mathbf {x}_i\) in \(C_P\) (Lines 3-7), we add one to the number of positive samples \(n_+\) (Line 5) or the number of unlabeled samples in \(C_P\) \(n_u^+\) (Line 7), and calculate the Euclidean distance between \(\mathbf {x}_i\) and the centroid of \(C_P\) (Line 8). For each sample \(\mathbf {x}_i\) in \(C_N\) (Lines 9-13), we add one to the number of reliable negative samples \(n_-\) (Line 11) or the number of unlabeled samples in \(C_N\) \(n_u^-\) (Line 13), and calculate the Euclidean distance between \(\mathbf {x}_i\) and the centroid of \(C_N\) (Line 14). With the total numbers of samples in every set, entropies \(E_+\) and \(E_-\) is calculated using Equation 18 and 19 (Line 15). With the distances between samples and clusters, the training weight of each sample is calculated using Equation 20 (Line 17). Then PU-ELM-EW begins the neural network construction. Similar to the original ELM, PU-ELM-EW first generates the input weights \(\varvec{\omega }\) and biases \(\mathbf {b}\) randomly (Line 18). Then the ELM feature mapping matrix \(\varvec{\beta }\) is calculated using Equation 16 (Line 21). If the number of hidden layer neurons L is larger than the number of training samples N, PU-ELM-EW uses Equation 16 to calculate the output weight \(\varvec{\beta }\) (Line 21); otherwise, Equation 17 (Line 23).

5 Experiments

In this section, we first introduce our experiments setup. Then we present the performance evaluation on both synthetic datasets and real-world applications.

5.1 Experiments setup

All the experiments are conducted on a 64-bit Windows PC with Intel Core i5-6500 (3.20GHz) CPU, 8GB RAM memory. Data processing and learning methods are realized using MATLAB R2013b. Our proposed framework PU-ELM is verified on both synthetic and real-world datasets:

  1. 1.

    Experiments with different settings of data and algorithms are evaluated on the synthetic dataset.

  2. 2.

    Real-world application performance are verified on the real-world dataset.

Eight state-of-the-art methods are selected as our rivals in our experiments:

  1. 1.

    PU-SVM-SS: Based on our proposed PU-ELM-SS, we replace one-class ELM with one-class SVM to extract reliable negative set, and semi-supervised ELM with semi-supervised SVM to train classifiers.

  2. 2.

    PU-SVM-Biased: Based on our proposed PU-ELM-EW, we replace one-class ELM with one-class SVM, and weighted ELM with biased SVM (Liu et al, 2013).

  3. 3.

    One-class ELM: A direct implementation of one-class ELM for both reliable negative set extraction and classifiers training. The PU learning problem is completely treated as a one-class classification problem.

  4. 4.

    Spy+SVM: The spy technique of S-EM (Liu et al, 2002) is applied for reliable negative set extraction, combined with SVM to train classifiers iteratively.

  5. 5.

    Roc-SVM (Li and Liu, 2003): The Rocchio technique is applied for reliable negative set extraction, combined with SVM to train classifiers iteratively.

  6. 6.

    NB+SVM (Liu et al, 2013): The Naive Bayesian method is applied for reliable negative set extraction, combined with SVM to train classifiers iteratively.

  7. 7.

    PEBL (Yu et al, 2002): The 1-DNF method is applied to filter out possible positive samples by calculating attributes frequencies of each sample.

  8. 8.

    RESVM (Claesen et al, 2015): Several methods are ensemble to achieve better learning performance.

5.2 Synthetic dataset evaluation

In order to evaluate our algorithms under various settings, in this subsection, we first present our evaluation results on a synthetic dataset.

5.2.1 Data generation and parameters

We generate a binary classification dataset with a positive class and a negative class. Each class has 500 data points (i.e., totally 1000 samples in the dataset). Each sample subjects to multivariate Gauss distribution. The positive samples have a mean value matrix \([-2,-2]\) and a covariance matrix [2, 0; 0, 2]; while the negative samples have a mean value matrix [2, 2] and the same covariance matrix [2, 0; 0, 2]. The distribution of generated data is shown as Fig. 3.

Fig. 3
figure 3

Synthetic dataset distribution

To simulate a PU learning problem, we need to guarantee that the actual class labels of both positive samples and unlabeled samples are maintained, so that the classification performance can be evaluated. Therefore, in each set of our experiments, we adopt the following strategies to construct our synthetic dataset:

  • We keep a positive rate \(\lambda\) of the positive samples. The parameter \(\lambda\) is varied from \(10\%\) to \(100\%\).

  • We treat the rest (\(1-\lambda\)) positive samples as unlabeled samples.

  • We treat all the negative samples as unlabeled samples.

Another parameter which has to be determined before the experiments is the reliable negative set rejection rate \(\theta\). It decides the fraction of rejection of the unlabeled samples whose distance is larger than distance threshold \(\xi\), as explained in Section 4.1. The rate \(\theta\) is determined by the following set of experiments.

5.2.2 Experiments results

Determination of \(\varvec{\theta }\). We first try to determine an appropriate \(\theta\). Since all the unlabeled samples have to be traversed, no matter what the value of \(\theta\) is, \(\theta\) has no direct effect on the training time of PU-ELM. We only evaluate the variation of F-score with an increasing \(\theta\).

Since \(\theta\) is the parameter of our proposed reliable negative extraction method, only four methods using OC-ELM-RN strategy are evaluated in this set of experiments, which are PU-ELM-SS, PU-ELM-EW, PU-SVM-SS, PU-SVM-Biased. The positive rate \(\lambda\) is set to 30% temporarily.

Fig. 4
figure 4

F-score comparison with varied \(\theta\)

The results are shown in Figure 4. We select 15(%) as the value of rejection rate \(\theta\) in the following experiments, due to its highest learning performance. Besides, we find some outcomes which may guide the observation in the following experiments: (1) PU-ELM-EW has the highest performance; (2) the SVM based method combined with our extraction strategy (PU-SVM-Biased) outperforms PU-ELM-SS; (3) the effect of \(\theta\) on ELM based methods is less than that on SVM based methods, which suggests that ELM has better generalization performance than SVM in the context of this dataset and application.

Training time evaluation. We then evaluate training time with varied positive sample rate \(\lambda\). The results are presented in Fig. 5.

Fig. 5
figure 5

Training time comparison with varied \(\lambda\)

The overall training time contains two major parts: (1) reliable negative set extraction time; (2)weighted learning time. From the results we can see that one-class ELM has the highest learning speed. Since one-class ELM traverses the unlabeled samples once without iterations, the training time of one-class ELM decreases linearly along with the increasing \(\lambda\). The second fastest group of methods includes PU-ELM-SS and PU-SVM-Biased, which both have no iterations in neither of the two major phases. The training time of the rest methods drop slightly, because the major time-consuming part is the iterative weighted learning, which is not directly related to the positive rate. The other proposed method PU-ELM-EW achieves relatively faster learning speed compared with state-of-the-art methods, due to the learning speed advantage of ELM. According to the above experimental results, we set \(\lambda\) to \(35\%\) in our following experiments.

F-score evaluation. Then we compare the classification performance between the proposed methods and our rivals. F-score is evaluated with varied positive rate \(\lambda\) in Fig. 6.

Fig. 6
figure 6

F-Score comparison with varied \(\lambda\)

As can be seen, PU-ELM-EW achieves the highest PU learning performance, due to its both one-class strategy based reliable negative set extraction and entropy based weighted learning strategy. PU-SVM-Biased achieves a similar performance only next to PU-ELM-EW. The NB+SVM and PEBL are based on a priori probabilities, thus they achieve rapid performance increment when \(\lambda\) remains relatively small. Between these two probability based methods, NB+SVM has a higher performance than PEBL. The one-class ELM, which is the most intuitive implementation for PU learning problem, achieves the lowest F-score with slightest fluctuation.

5.3 Real-world application evaluation

We also investigate the practical effectiveness in the real-world applications in this subsection.

5.3.1 Dataset

As introduced in Sect. 2, the real-world dataset used in our experiments is extracted from DrugBank, MetaADEDB and FAERS. This dataset contains 721 FDA-approved drugs in total, among which we extract 6946 DDI pairs as positive training samples. Then the same number of non-DDI pairs from these 721 drugs are randomly constructed and employed as unlabeled training samples. Each DDI pair has four similarity based attributes presented in Sect. 3. The reliable negative set rejection rate of this real-world dataset is set to 3.7%.

Since from the medical perspective, the unlabeled DDI pairs are not experimentally validated, the actual unlabed samples, which are predicted positive in our experiments, cannot be considered as misclassified. That is, false positive rate (FPR) and true negative rate (TNR) cannot be calculated. As a result, we only evaluate true positive rate (TPR) and false negative rate (FNR=1-TPR) in this set of experiments. In other words, we focus on how many positive samples are correctly classified by our methods. The predicted positive samples will be bio-experimentally validated and reported in our future work.

5.3.2 Experiments results

In this set of experiments, we verify the training time and true positive rate of PU-ELM. The results are presented in Table 2.

Table 2 Performance comparison on the real-world dataset

From the results we find that one-class ELM has the highest learning speed (which is 10.611 s), because directly implementing the one-class classification strategy needs no iterations of clustering or classifiers training. However, this strategy has the poorest learning performance, which is far from the requirement of practical applications. Our proposed PU-ELM-SS has the second highest learning speed (11.521 s), which is nearly the same as one-class ELM. Besides, PU-ELM-SS has a very promising true positive rate (75.311%) only second to our proposed PU-ELM-EW (81.882%), which has a significant TPR advantage compared with the rest state-of-the-art methods. RESVM also gains a good learning performance due to its ensemble of multiple learning strategy, but it has an order of magnitude longer training time than the rest methods except PEBL. In summary, aiming at practical applications, PU-ELM-SS achieves the highest learning speed and PU-ELM-EW achieves the best PU learning performance.

6 Conclusion

Focusing on the PU learning problem, we propose an ELM based framework PU-ELM. Within the framework, the first component OC-ELM-RN improves the quality of the extracted reliable negative set. As to the learning component, PU-ELM-SS based on semi-supervised learning strategy achieves extremely fast learning speed; PU-ELM-EW based on entropy weighting strategy improves the PU learning performance. Extensive experiments on both synthetic and real-world datasets verified the effectiveness and efficiency of PU-ELM with varied settings, indicating that PU-ELM has a better ability for DDIs discovery applications.