1 Introduction

Data mining is a very important and widely used technology in many diverse domains. Classification is an important data mining technology. In common binary classification tasks, the presence of both positive and negative examples in training data is needed to build an efficient classifier. Nevertheless, the acquisition of class labels is both costly and difficult in some real-world fields. In the worst case, it is impossible to extract class labels. In such cases, the algorithm for classification that only exploits a small portion of positive and large amount of unlabeled examples is needed. Such learning is termed as positive and unlabeled (PU) learning. Figure 1 shows the data distribution for PU learning in two-dimensional space. From the Fig. 1, we observe that each example has two features, [x]1 and [x]2. “Red + ” represents positive examples and “blue circle” represents unlabeled examples, where “blue ⊕” and “blue ⊖ ” describe positive and negative examples in unlabeled ones respectively. The objective of PU learning is to predict the label of these unlabeled examples (blue circle) and any new example point. In this paper, our main work is classification for PU learning.

Fig. 1
figure 1

Two-dimensional data distribution for PU learning

PU learning is a hot topic in the literatures of data mining. For example in practice, users may mark their favorite Web pages (positive examples), but they are usually unwilling to mark boring pages (negative examples). Recently, more and more applications for PU learning appear, such as bioinformatics classification [1], time series classification [2], data stream classification [3, 4], opinion mining [5, 6] and so on.

For PU learning, several approaches have been proposed. A theoretical study of probably approximately correct (PAC) learning from positive and unlabeled examples was first done in [7]. The study concentrated on the computational complexity of learning and showed that function classes learnable under the statistical queries model. Recently, learning from positive examples was also studied theoretically in [8] within a Bayesian framework where the distribution of functions and examples were assumed known. In [9], it was shown that if the sample size was large enough, minimizing the number of unlabeled examples classified as positive while constraining the positive examples to be correctly classified would give a good classifier. With these theories, various approaches have been suggested in the literatures to solve PU learning. The main difference for these methods was how to use unlabeled examples, and there were four types of methods.

The first type is the two-step strategy, which selects possible negative or positive examples from unlabeled examples, and then builds classifiers using positive examples and negative examples. The popular techniques for extracting possible negative or positive examples included spy, Rocchio and 1-DNF [10]. After extracting possible negative or positive examples, standard machine learning methods such as Naïve Bayes and SVM [11] are used to train classifiers. At present, the two-step strategy is a widely used method, such as S-EM [12], ROC-SVM [13], 1-DNF, PNLH [14], LCLC [15], Pulce [16, 17] etc. Pulce took advantage of the intrinsic relationships between attribute values and exceeded the independence assumption made by Naïve Bayes. In fact, Pulce leveraged the statistical properties of the data to learn a distance metric during the classification task.

The second type of method is related to one-class classification, which estimates the distribution of the positive class only from positive examples, such as one-class support vector machine [18, 19]. In fact, it has been proved that these methods are effective only for the case where the number of positive examples is large enough to capture the characteristics of the positive class, and their performance would be rather poor when this number is very small [20].

The third type of method is the probability estimation approach [21, 22]. In [21], it took each unlabeled example as both positive and negative example with weights pre-computed by an additional classifier for protein record identification. Luigi Cerulo et al made use of the approach in [23] to reconstruct gene regulatory networks without negative examples.

The fourth type of methods is a one-step method. It includes BSVM [9], WL [24, 25] and LUHC [26] which is most related with our work. BSVM was built by giving appropriate weights to the positive examples and unlabeled examples which were regarded as negative examples with noise, respectively. Experimental results indicated that the performance was better than most of two-step strategies. WL used Logistic Regression technique after weighting the negative class. LUHL proposed a Laplacian unit-hyperplane classifier which adding a manifold regularizer to make the predicted labels and the initial labels sufficiently close on the labeled points.

For PU learning, we mainly propose a novel classifier based on global and local learning (GLLC). From the global perceptive, our classifier constructs a biased least square support vector machine (BLSSVM) with all positive and unlabeled examples (can be taken as negative examples with noise) rather than only support vectors like BSVM. The obvious advantage of BLSSVM is that a little noise in a large amount of negative examples will have little effect on the building of the final classifier, whereas the same noise in support vectors will produce serious deviation on the construction of the final classifier, such as BSVM. In addition, BLSSVM has a very small fluctuation and excellent performance over a wide ratio of positive examples in unlabeled examples. This is because the number of negative examples is far more than positive examples in unlabeled examples and the distribution of negative class changes little over a wide ratio of positive examples in unlabeled examples. This leads to the final classifier being more stable and accurate whatever the ratio of positive examples in unlabeled examples is. From the local perspective, a smooth regularization term indicates the two data points should belong to the same class if they are close in the intrinsic geometry. This smooth term not only reflects geometrical properties of the training examples but also makes up the insufficient training of the global learning. In summary, combining global learning and local learning will produce a novel classifier which is called GLLC for short. To sum up, GLLC makes best use of the advantages and bypasses the disadvantages. Finally the numerical experiments on both synthetic and several real world datasets are presented to show the effectiveness of GLLC. The preliminary results show that our GLLC is less sensitive to the proportion of labeled examples, and superior to BSVM, LUHC and EM on both the ability of classification and the computational efficiency.

The rest of this paper is organized as follows. Some previous works are introduced in Section 2. The proposed GLLC approach is presented in Section 3. Then in Section 4, experiments on both synthetic and real datasets are reported. Finally, we conclude this paper in Section 5.

2 Related works

Before we go into the detail of related works, let us give the definition of PU learning firstly. Throughout this paper, we suppose that the training dataset is represented as T = {(x i , y i ),i = 1,⋯,p, x j , j = p + 1,⋯,p + u} where p and u are the number of positive and unlabeled examples respectively, \(x_{i} \in \mathbb {R}^{n}\) is a training example input and y i = 1 is the positive class label of x i ,(i = 1,⋯ ,p), the label of x j , j = p + 1,⋯ ,p + u is unknown, but it equals + 1 or − 1. The objective of PU learning is to find a real function f(x) in \(\mathbb {R}^{n}\) such that the output value of y for any x can be predicted by the sign function of f(x) as

$$ sign(f(x)). $$
(1)

2.1 BSVM

One of the most popular methods for PU learning was proposed based on SVM technique in [9] which was called BSVM. BSVM took unlabeled examples as negative examples with noise, namely y i = − 1,(i = p + 1,⋯ ,p + u), and then constructed an SVM classifier by giving a larger penalty parameter to weight the positive examples errors and a smaller penalty parameter to weight the negative examples errors. Then the optimization problem of BSVM can be formulated as

$$\begin{array}{@{}rcl@{}} &&\underset{(w,b,\xi )}{\min} \frac{1}{2}\left\| {\left. w \right\|} \right.^{2}+C_{p} \sum\limits_{i = 1}^{p} {\xi_{i} } +C_{n} \sum\limits_{i=p + 1}^{p+u} {\xi_{i} }\\ && s.t. y_{i} (<w,\varphi (x_{i} )>+b)\ge 1-\xi_{i} i = 1,2,...,p+u\\ && \xi_{i} \ge 0 i = 1,2,...,p+u \end{array} $$
(2)

where ξ=(ξ 1, ξ 2,...,ξ p + u )T is penalizing variable vector. C p and C n represent the penalty parameters of misclassification for positive and unlabeled examples respectively. Similar to classical SVM, function φ(⋅) maps the input space into a higher dimensional space when the dataset is nonlinearly separable. And then searching for a hyperplane in the feature spaces

$$ f(x)=<w,\varphi (x)>+ b = 0 $$
(3)

to predict the class label of any example x with (1).

Although BSVM [9] is a state-of-the-art learning algorithm for PU learning [27], only its support vectors play a role in classification, and performance is not satisfactory when the number of the labeled positive examples is small.

2.2 Least squares SVM

The least squares support vector machine (LSSVM) is a promising classification technique proposed firstly by Suykens et al. [28]. LSSVM seeks an optimal separating hyper-plane \(f(x)=<w,\varphi (x)>+b = 0,w\in \mathbb {R}^{n},b\in \mathbb {R}\)

That maximizes the margin between two classes based on given training examples {x i , y i }, i = 1, ⋯ ,l, where x i is a vector in the input space \(\mathbb {R}^{n}\) and y i denotes the class label taking a value of + 1 or − 1. φ(⋅) is a nonlinear function which is used to map the input space into a higher dimensional space. Therefore, the maximum-margin classifier is obtained by solving

$$ \begin{array}{l} \underset{(w,b,\xi )}{\min} \frac{1}{2}\left\| {\left. w \right\|} \right.^{2}+\frac{C}{2}\sum\limits_{i = 1}^{l} {{\xi_{i}^{2}}} \\ s.t.y_{i} (<w,\varphi (x_{i} )>+b)= 1-\xi_{i} i = 1,2,...,l \\ \end{array} $$
(4)

The geometric interpretation of the above problem (4) with \(x\in \mathbb {R}^{2}\) for linearly separable case is shown in Fig. 2., where minimizing \(\frac {1}{2}\left \| {\left . w \right \|} \right .^{2}\) realizes the maximal margin between the straight lines < w, φ(x) > + b = 1 a n d < w, φ(x) > + b = − 1, minimizing \(\sum \limits _{i = 1}^{l} {{\xi _{i}^{2}}}\) makes the two straight lines proximal to all inputs of positive (“ + ”) and negative (“ ”) examples respectively.

Fig. 2
figure 2

An interpretation of two-dimensional least squares SVM

It’s worth mentioning that the superiority of LSSVM is simpler and faster than SVM. This is because LSSVM needs to solve a quadratic programming with only equality constraints, or equivalently a linear system of equations, but SVM needs to solve a quadratic programming with inequality constraints.

3 Global and local learning

3.1 Global learning

Consider the PU learning described in the beginning of Section 2. To overcome the shortcoming of BSVM, we first construct a biased least squares SVM (BLSSVM) classifier by giving two different penalty parameters to weight the misclassification errors of positive and unlabeled examples respectively, where unlabeled examples can be regarded as negative examples with noise, i.e., y i = − 1,i = p + 1,⋯ ,p + u. Thus, we seek the decision function

$$f(x)=<w,x>+b = 0,w\in \mathbb{R}^{n},b\in \mathbb{R} $$

so that the classification hyper-plane separates both positive examples and unlabeled examples with the maximum margin, resulting in the primal optimization problem

$$ \begin{array}{l} \underset{(w,b,\xi )}{\min} \frac{\text{1}}{2}\left\| {\left. w \right\|} \right.^{2}+\frac{C_{p} }{2}\sum\limits_{i = 1}^{p} {{\xi_{i}^{2}}} +\frac{C_{n} }{2}\sum\limits_{i=p + 1}^{p+u} {{\xi_{i}^{2}}} \\ s.t. y_{i} (<w,x_{i} >+b)= 1-\xi_{i} i = 1,2,...,p+u \\ \end{array} $$
(5)

Both penalizing term and equality constraints reflect that all positive examples approximate to straight line < w, x > + b = 1 and unlabeled examples approximate to < w, x > + b = − 1 respectively. However, the degree of approximation for positive and unlabeled examples is different, which is controlled by the parameters C p and C n . Obviously, C p is larger than C n because we pay more attention to classify positive examples for PU learning. More concisely, Let

$$\begin{array}{@{}rcl@{}} P&=&\left( {{\begin{array}{*{20}c} {x_{1},} \hfill & {x_{2},} \hfill & {\cdots,} \hfill & {x_{p} } \hfill \\ \end{array} }} \right)^{T},\\ U&=&\left( {{\begin{array}{*{20}c} {x_{p + 1},} \hfill & {x_{p + 2},} \hfill & {\cdots,} \hfill & {x_{p+u} } \hfill \\ \end{array} }} \right)^{T}, \\ X&=&\left( {P^{T},U^{T}} \right)^{T}, \\ Y&=&diag(y_{1}, y_{2} ,{\cdots} ,y_{p+u} ), \\ C&=&diag(\underbrace {C_{p} ,{\cdots} ,C_{p} ,}_{p}\underbrace {C_{n} ,\cdots ,C_{n} }_{u}), \\ \xi &=&(\xi_{\text{1}} ,\xi_{\text{2}} ,{\cdots} ,\xi_{p+u} )^{T} \end{array} $$
(6)

Thus, the optimization model (5) can rewrite as matrix format

$$\begin{array}{@{}rcl@{}} &&\underset{w,b,\xi}{\min} \frac{\text{1}}{2}w^{T}w+\frac{1}{2}\xi^{T}C\xi\\ &&s.t. Y\left( {Xw+\mathrm{e}b} \right)+\xi =e \end{array} $$
(7)

where e is an appropriate vector of ones.

Clearly, all training data takes part in classification. We can call the BLSSVM classifier which is obtained by (7) as global learning classifier. Compared with traditional methods, such as BSVM, BLSSVM has some obvious merits. First, BLSSVM can reflect the class labels of all examples more sufficiently and accurately than BSVM. This is because that only support vectors determine the final classifier for BSVM, which can be deemed as local learning. Whereas the whole examples are used to construct the final classifier for BLSSVM, which can be regarded as global learning. Obviously, negative support vectors may contain some positive examples and produce some effects on the construction of BSVM. Contrarily, if all examples take part in the building of classifier, compared with most correct negative examples, those false negative examples are not important for the final classifier construction. Second, BLSSVM is a more stable classifier than BSVM. Namely, the performance of BLSSVM is less sensitive than that of BSVM over a wide ratio of positive examples in unlabeled examples. This is because that the number of negative examples is far more than positive examples in unlabeled examples and the distribution of negative class changes little over a wide ratio of positive examples in unlabeled examples. All in all, Global learning for classification is more accurate and effective for PU learning.

3.2 Local learning

In the semi-supervised learning scenario, a common hypothesis [29,30,31,32] is that two data points are more likely to belong to the same class if they are close in the intrinsic geometry. In the other words, the difference between function f(x i ) and f(x j ) should be smaller if the distance between x i and x j is shorter. In this paper, we take this common hypothesis into PU learning, minimize the following regularization term

$$ \left\| f \right\|_{M}^{2} =\frac{1}{(p+u)^{2}}\sum\limits_{i,j = 1}^{p+u} {w_{ij} (f(x_{i} )} -f(x_{j} ))^{2} $$
(8)

where w i j is similarity weight between x i and x j . It can be calculated by Gaussian functions:

$$ w_{ij} =\left\{ {{\begin{array}{*{20}c} {\exp (-{\left\| {x_{i} -x_{j} } \right\|^{2}} \mathord{\left/ {\vphantom {{\left\| {x_{i} -x_{j} } \right\|^{2}} {2\sigma }}} \right. \kern-\nulldelimiterspace} {2\sigma })} \hfill & {if x_{i}, x_{j} ~are ~neighbor} \hfill \\ 0 \hfill & {otherwise} \hfill \\ \end{array} }} \right. $$
(9)

where σ is a bandwidth hyper-parameter. It controls the decay rate.

Obviously, if the x i is near to x j , implying that they have a high similarity, the difference between f(x i ) and f(x j ) should be severely punished since w i j is large, resulting in that |f(x i ) − f(x j )| is small. Note L = DW, W is the graph weight matrix where (i,j)-th entry is w i j and D is a diagonal degree matrix with \(D_{ii} =\sum \limits _{j = 1}^{l+u} {w_{ij} }\). The regularization item in (8) can be rewritten as matrix format

$$ \left\| f \right\|_{M}^{2} \text{=}f^{T}Lf $$
(10)

where f =(f(x 1),f(x 2),⋯f(x p + u ))T. Then, the final classification is performed by (1).

It is easily observed that, regularization term in (10) is local learning, which just considers the class label of neighborhood data points. Such local learning reflects intrinsic geometric information for classification.

3.3 Global and local learning

Now let us consider PU learning scenario, combining global learning in (7) and local learning in (10), we obtain the optimization problem

$$\begin{array}{@{}rcl@{}} &&\underset{w,b,\xi}{\min} \frac{\lambda }{2}w^{T}w+\frac{1}{2}\xi^{T}C\xi +f^{T}Lf\\ &&s.t. Y\left( {Xw+\mathrm{e}b} \right)+\xi =e \end{array} $$
(11)

Note that λ is positive parameter. Let

$$ f=\left( {f(x_{1} ),f(x_{2} ),{\cdots} f(x_{p+u} )} \right)^{T}=Xw+eb $$
(12)

Then, the above problem (11) leads to our primal problem

$$ \begin{array}{l} \underset{w,b,\xi}{\min} \frac{\lambda }{2}w^{T}w+\frac{1}{2}\xi^{T}C\xi +\frac{1}{2}(Xw+eb)^{T}L(Xw+eb) \\ s.t. Y\left( {Xw+\mathrm{e}b} \right)+\xi =e \\ \end{array} $$
(13)

Clearly, the minimizing problem (13) is a quadratic programming with only equality constraints. Introducing the Lagrange multiplier α = (α 1,⋯,α p + u )T for optimization problem (13), we obtain

$$\begin{array}{@{}rcl@{}} L\left( {w,b,\xi ,\alpha } \right)&=&\frac{\lambda }{2}w^{T}w+\frac{1}{2}\xi^{\mathrm{T}}C\xi\\ &&+\,\frac{1}{2}(Xw+eb)^{T}L(Xw+eb)\\ &&-\,\alpha^{T}Y\left( {Xw+eb} \right)-\alpha^{T}\xi +\alpha^{T}e \end{array} $$
(14)

Based on the Karush-Kuhn-Tucker (KKT) necessary and sufficient optimality conditions, taking the partial derivatives of (14) with respect to w, b, q, α and equating them to zero, we obtain the following optimal conditions

$$\begin{array}{@{}rcl@{}} \frac{\partial L}{\partial w}&=&\lambda w+X^{T}L(Xw+eb)-X^{T}Y\alpha = 0 \end{array} $$
(15)
$$\begin{array}{@{}rcl@{}} \frac{\partial L}{\partial b}&=&\mathrm{e}^{T}L(Xw+eb)-\mathrm{e}^{T}Y\alpha = 0 \end{array} $$
(16)
$$\begin{array}{@{}rcl@{}} \frac{\partial L}{\partial \xi }&=&c\xi -\alpha = 0 \end{array} $$
(17)
$$\begin{array}{@{}rcl@{}} \frac{\partial L}{\partial \alpha }&=&Y\left( {Xw+\mathrm{e}b} \right)+\xi -e = 0 \end{array} $$
(18)

Substituting (17) into (18), we can solve the resulting equations for w, α and b

figure f

where I is an identity matrix. The matrix format is

$$ \left( {{\begin{array}{lll} {\lambda I+X^{T}LX} & {X^{T}Le} & {-X^{T}Y} \\ {\mathrm{e}^{T}LX} & {\mathrm{e}^{T}Le} & {\mathrm{e}^{T}Y} \\ {YX} & {Y\mathrm{e}} & {C^{-1}} \\ \end{array} }} \right)\left( {{\begin{array}{lll} w \\ b \\ \alpha \\ \end{array} }} \right)=\left( {{\begin{array}{lll} 0 \\ 0 \\ e \\ \end{array} }} \right) $$
(22)

It is worth mentioning that the dimension of variables is too high to solve in short time. Therefore, we can solve the value of w from the formulation (19) when the matrix (λ I + X T L X) is invertible. Namely,

$$ w^{\ast }=(\lambda I+X^{T}LX)^{-1}(X^{T}Y\alpha -X^{T}Leb) $$
(23)

Let

$$ A=(\lambda I+X^{T}LX)^{-1} $$
(24)

Substituting w into equations (2019), we can obtain the solution of α ∗ and b from the following linear equations

$$ \left( {{\begin{array}{ll} {\mathrm{e}^{T}LXAX^{T}Y\!-\mathrm{e}^{T}Y} & {\mathrm{e}^{T}Le\,-\,\mathrm{e}^{T}LXAX^{T}Le}\\ {YXAX^{T}Y\,+\,C^{-1}}& {Y\mathrm{e}\,-\,YXAX^{T}Le}\\ \end{array} }} \!\right)\!\!\left( \!{{\begin{array}{ll} \alpha\\ b \\ \end{array} }} \!\right)\,=\,\left( \! {{\begin{array}{ll} 0 \\ e \\ \end{array} }} \!\right) $$
(25)

Once α ∗ and b are found, a new example x can be classified as positive class or negative class by the following formulation

$$\begin{array}{@{}rcl@{}} y&=&sign(f(x))\\ &=&sign(\sum\limits_{j = 1}^{p+u} \alpha_{j} y_{j} <x,x_{j} >+b) \end{array} $$
(26)

Now, the key problem is how to guarantee the invertibility of (λ I + X T L X) and what are the merits of (λ I + X T L X)? The following theorems can answer these questions.

Theorem 1

(λ I + X T L X) is symmetric and positive definite.

Proof

L is a symmetric and semi-positive definite matrix, then the following equations are satisfied

$$ \forall z\in \mathrm{R}^{n}, z^{T}X^{T}LXz=(Xz)^{T}L(Xz)\ge 0 $$
(27)
$$ (X^{T}LX)^{T}=X^{T}L^{T}X=X^{T}LX $$
(28)

Obviously, X T L X is a symmetric and semi-positive definite matrix. Noting that λ > 0, we can derive λ I is symmetric and positive definite. Therefore (λ I + X T L X) is symmetric and positive definite. □

Theorem 2

A = (λ I + X T L X)−1 is also a positive definite matrix and there exists the matrix B, such that

$$ A=BB^{T} $$
(29)

Proof

Since (λ I + X T L X) is a positive definite matrix, its invertible matrix is also a positive definite matrix. It can be decomposed into

$$ A=T{\Lambda}^{\frac{1}{2}}{\Lambda}^{\frac{1}{2}}T^{T}=(T{\Lambda}^{\frac{1}{2}})(T{\Lambda} ^{\frac{1}{2}})^{T}=BB^{T} $$
(30)

where Λ is a diagnose matrix, the diagnosed entry is eigenvalue, T is composed by eigenvectors. □

Through the previous theorems, substituting A into B B T, formula (20) can be rewritten as

$$\begin{array}{@{}rcl@{}} \!\!\left( \! {{\begin{array}{ll} {\mathrm{e}^{T}L(XB)(XB)^{T}Y\,-\,\mathrm{e}^{T}Y} & {\mathrm{e}^{T}Le\,-\,\mathrm{e}^{T}L(XB)(XB)^{T}Le} \\ {Y(XB)(XB)^{T}Y\,+\,C^{-1}} & {Y\mathrm{e}\,-\,Y(XB)(XB)^{T}Le} \\ \end{array} }}\! \right)\!\!\left( \! {{\begin{array}{ll} \alpha \\ b \\ \end{array} }} \!\right)\,=\,\left( \! {{\begin{array}{ll} 0 \\ e \\ \end{array} }} \!\right)\\ \end{array} $$
(31)

It is natural to find that coefficient matrix in (26) which is partitioned into four blocks always contains inner product operation, just like (X B)(X B)T. As we all known, inner product is a kind of kernel or some distance measure, such as linear kernels, RBF kernels, Euclidean distance, Mahalanobis distance and so on. In addition, we find the object of inner product is XB, and it can be regarded as a linear transformation for training data X. In fact, this kind of linear transformation plays an important role in feature selection and dimension decrease. In conclusion, (X B)(X B)T is a novel metric measure after linear mapping of X. These new insights prompt us to extend the above linear separable case to nonlinear separable case. For nonlinearly separable case, we also introduce a nonlinear function Φ(X B) mapping the input feature space to the higher dimensional space. Naturally, inner product Φ(X B)Φ(X B)T is deemed as a kernel function (matrix). Note K = Φ(X B)Φ(X B)T. Therefore, for nonlinear separable case, substituting (X B)(X B)T for K will receive the following linear equations

$$ \left( {{\begin{array}{ll} {\mathrm{e}^{T}LKY-\mathrm{e}^{T}Y} & {\mathrm{e}^{T}Le-\mathrm{e}^{T}LKLe} \\ {YKY+C^{-1}} & {Y\mathrm{e}-YKLe} \\ \end{array} }} \right)\left( {{\begin{array}{ll} \alpha \\ b \\ \end{array} }} \right)=\left( {{\begin{array}{ll} 0 \\ e \\ \end{array} }} \right) $$
(32)

Based on (27), a nonlinear classifier appears and classifies any new example x as positive class or negative class by the following decision function

$$\begin{array}{@{}rcl@{}} y&=&sign(f(x))\\ &=&sign(\sum\limits_{j = 1}^{p+u} \alpha_{j}^{\ast } y_{j} K(Bx,Bx_{j} )+b^{\ast }) \end{array} $$
(33)

Right now, we establish the complete learning framework. Noting that our classifier is constructed based on global and local learning, so it is called GLLC for short. From a global learning perspective, GLLC not only has a strong robustness but also has a very small fluctuation over a wide ratio of positive examples in unlabeled examples. From a local learning perspective, a smooth regularization term will be minimized to make the two neighborhood data points are more likely belong to the same class if they are close in the intrinsic geometry. This smooth term not only reflects geometrical properties of the training examples but also alleviates the insufficient training of the global learning. Third, the time complexity of GLLC is lower than that of BSVM, where GLLC only needs to solve linear equations and BSVM is a quadratic programming. The implementation of the algorithm for GLLC is shown in Table 1.

Table 1 The solving algorithm for GLLC

4 Experiments

To evaluate the performance of GLLC, we perform experiments on three sets of artificial datasets, six UCI benchmark datasets [33] and one USPS handwritten image dataset in this paper. More concretely, we compare GLLC with the following state-of-the-art algorithms:

  1. 1.

    BSVM is a classical support vector machine. It uses positive examples and unlabeled examples which can be negative examples with noise to build an SVM classifier.

  2. 2.

    Pulce takes into account data dependencies and learns a distance model from attribute relationships to train a k-NN-like classifier.

  3. 3.

    LUHC determines a decision unit-hyperplane by solving a quadratic programming problem.

  4. 4.

    NB is another PU learning algorithm, which takes ‘selected completely at random’ assumption and estimates the prior probability of positive, then learn classifier model.

  5. 5.

    S-EM uses spy technique to identify reliable negative examples from the unlabeled examples, and uses the Expectation Maximization (EM) algorithm to build a NB classifier.

All the classifiers are implemented in MATLAB 2014a environment on a PC with Intel Core(TM) core i3 CPU (2.4 GHz) with 2 GB RAMS. In both classification datasets, we use LIBSVM [34] to build a classifier for BSVM, LPU package [35] to implement system for NB and S-EM.

4.1 Performance metric

We use the popular F value on the positive class as the evaluation measure. F value takes into account of both recall and precision

$$ F=\frac{2pr}{p+r} $$
(34)

Where

$$p=\frac{TP}{TP+FP} $$

And

$$r=\frac{TP}{TP+FN}\quad . $$

TP, TN, FP and FN are the number of true positive, true negative, false positive and false negative, respectively. A high precision ensures that the identified positives are predominantly true positives, and a high recall ensures that most of the positives are identified. F value captures the average effect of both precision and recall, and is therefore suitable for our purpose. In addition, the accuracy of classification is also showed in this paper to evaluate the performance of GLLC.

4.2 The selection of the parameters

For our GLLC, hyper-parameters are set by cross validation from some grids. Unfortunately, F value cannot be computed on the validation set during the training process because there is no negative example. An approximate computing method [9] is used to evaluate the performance by

$$ \overset{\wedge}{F} =\frac{{r_{p}^{2}} }{P(f(x)= 1)} $$
(35)

where x is the random variable representing the input vector, P(f(x) = 1) is the probability of an input example x classified as positive example, r p is the recall for positive set P in the validation set. Therefore, parameter selection algorithm for our GLLC is shown in Table 2. Concretely, hyper-parameters are set by five-fold cross validation from some grids introduced in the following.

  • GLLC. RBF kernel parameter σ, C n and λ are chosen from {2−6,2−5,⋯,26} and C p is searched from 2C n ,

  • LUHC RBF kernel parameter σ and λ are also searched from {2−6, 2−5, ⋯, 26} and ν is selected from {0.1, 0.2, ⋯ , 0.9}

  • BSVM. RBF kernel parameter σ, C n are also chosen from the set {2−6, 2−5, ⋯, 26} and C p is searched from 2C n

  • Pulce. Parameter k is varied over the set of values {1, 3, 7}.

The final decision function is obtained by the optimal parameters by selecting the best average F value in (30).

Table 2 Parameter selection algorithm for GLLC

4.3 Experiment results

4.3.1 Artificial datasets

Consider three types of two-dimensional synthetic ‘two half moons’ datasets with a few randomly labeled positive examples. These three types of datasets contain 200, 400 and 600 examples respectively, where red star represents positive examples and black dot represent negative examples. In order to demonstrate the classification performance for PU learning, 10%, 20% and 30% proposition of positive examples are labeled randomly and radius data points are taken as unlabeled examples for these three types of datasets, which are marked blue triangle in Figs. 345678910 and 11. In order to reflect the advantages of the manifold regularization in our GLLC algorithm, we compare the GLLC with LUHC and BSVM only in this experiment. The RBF kernel is used by all these three methods. In Figs. 311a, red stars represent positive unlabeled examples; black solid points represent negative unlabeled examples; All these examples are unlabeled examples. Blue triangles are positive examples. Figures 311b, c and d show the classification results of GLLC, LUHC and BSVM. More specifically, we can draw the following conclusions:

Fig. 3
figure 3

Comparison classification results of our GLLC, LUHC and BSVM on artificial half-moons datasets when the number of labeled examples is 200 and the proportions of positive examples is 10%

Fig. 4
figure 4

Comparison results of our GLLC, LUHC and BSVM on artificial half-moons datasets when the number of labeled examples is 200 and the proportions of positive examples is 20%

Fig. 5
figure 5

Comparison results of our GLLC, LUHC and BSVM on artificial half-moons datasets when the number of labeled examples is 200 and the proportions of positive examples is 30%

Fig. 6
figure 6

Comparison results of our GLLC, LUHC and BSVM on artificial half-moons datasets when the number of labeled examples is 400 and the proportions of positive examples is 10%

Fig. 7
figure 7

Comparison results of our GLLC, LUHC and BSVM on artificial half-moons datasets when the number of labeled examples is 400 and the proportions of positive examples is 20%

Fig. 8
figure 8

Comparison results of our GLLC, LUHC and BSVM on artificial half-moons datasets when the number of labeled examples is 400 and the proportions of positive examples is 30%

Fig. 9
figure 9

Comparison results of our GLLC, LUHC and BSVM on artificial half-moons datasets when the number of labeled examples is 600 and the proportions of positive examples is 10%

Fig. 10
figure 10

Comparison results of our GLLC, LUHC and BSVM on artificial half-moons datasets when the number of labeled examples is 600 and the proportions of positive examples is 20%

Fig. 11
figure 11

Comparison results of our GLLC, LUHC and BSVM on artificial half-moons datasets when the number of labeled examples is 400 and the proportions of positive examples is 30%

No matter how many samples and what ratio of labeled positive samples, our GLLC always identifies all positive examples, i.e., the accuracy and F value of GLLC is 100%. This result verifies effectiveness of our local learning again.

LUHC also has good classification effective, especially when the number of labeled examples is larger However, its classification power isn’t as good as GLLC, such as Fig. 4c.

With the decrease of the labeled positive examples, more and more unlabeled positive examples are classified negative examples inaccurately for BSVM.

All in all, we summarize an important conclusion, that is Laplacian regularization for local learning plays a key role in classification for PU learning.

4.3.2 UCI datasets

To better illustrate the effective of our GLLC in this paper, we also provide the numerical results of GLLC and other classical methods on six UCI datasets. More concretely, our experiments are set up in the following way: firstly, each dataset is divided into two subsets: 50% for training and 50% for testing; then, we randomly select 10%, 20% and 30% of the training set as positive set, and use the remainder as unlabeled data; Finally, we transform them into PU tasks. Each experiment runs 10 times independently, and records the average F value and classification accuracy.

Table 3 shows the details of the UCI datasets. Instance: Total number of examples, Feature: Number of features, Test: Number of test examples, Train (10%): Number of training examples, where 10% positive examples are selected as positive set and the rest as the unlabeled set, Train (20%): Number of training examples, where 20% positive examples are selected as positive set and the rest as the unlabeled set, Train (30%): Number of training examples, where 10% positive examples are selected as positive set and the rest as the unlabeled set.

Table 3 Details of the six UCI datasets

The values in Tables 45 and 6 are the mean classification F value and Tables 78 and 9 show the average accuracy when 10%, 20% and 30% of the training set are selected as the labeled positive examples, respectively. The best F value and accuracy are depicted by bold figures. From these tables, we can also observe the superiority of GLLC. Compared with LUHC, BSVM for linear and Gauss kernel situations, Pulce, NB and S-EM, GLLC has the best performance, and the F value exceeds to 0.81 on all datasets except Hepatitis. The gaps between GLLC and other competitors are more and more apparent when the ratio of the labeled positive is smaller and smaller. These results verify our theory analysis again in Section 2. Namely, GLLC is insensitive to the counts of labeled positive examples. However, why is Hepatitis dataset an exception? Based on the analysis of its features, we find the dependencies among attributes are strong and the dataset is correlated. Pulce just takes into account these neighborhood function attribute relationships to train a knn-like classifier. Nevertheless, SVMs-like classifiers are only focusing on providing maximum margin hyper-plane between two classes without considering dependencies among attributes. That is why classification average F value and accuracy of Pulce are the higher than that of other SVMs classifiers.

Table 4 Average F value of GLLC-lin, GLLC-rbf, LUHC-lin, LUHC-rbf, Pulce, BSVM-lin, BSVM-rbf, NB, and S-EM on UCI datasets at the test set with 10% of labeled positive examples
Table 5 Average F value of GLLC-lin, GLLC-rbf, LUHC-lin, LUHC-rbf, Pulce, BSVM-lin, BSVM-rbf, NB, and S-EM on UCI datasets at the test set with 20% of labeled positive examples
Table 6 Average F value of GLLC-lin, GLLC-rbf, LUHC-lin, LUHC-rbf, Pulce, BSVM-lin, BSVM-rbf, NB, and S-EM on UCI datasets at the test set with 30% of labeled positive examples
Table 7 Average accuracy of GLLC-lin, GLLC-rbf, LUHC-lin, LUHC-rbf, Pulce, BSVM-lin, BSVM-rbf, NB, and S-EM on UCI datasets at the test set with 10% of labeled positive examples
Table 8 Average accuracy of GLLC-lin, GLLC-rbf, LUHC-lin, LUHC-rbf, Pulce, BSVM-lin, BSVM-rbf, NB, and S-EM on UCI datasets at the test set with 20% of labeled positive examples
Table 9 Average accuracy of GLLC-lin, GLLC-rbf, LUHC-lin, LUHC-rbf, Pulce, BSVM-lin, BSVM-rbf, NB, and S-EM on UCI datasets at the test set with 30% of labeled positive examples

In order to contrast more obvious, we also provide the change trend of average F value and average classification accuracy for different ratio of labeled positive examples in Figs. 12 and 13. In Figs. 12 and 13, x-axis represents the percentage of randomly labeled positive points; y-axis is the average classification F value and accuracy. It is evident from experimental results in Figs. 12 and 13 that the performance of GLLC is always excellent and effective on five datasets. F value and accuracy reaches to 0.8–0.95, whereas other competitors are just between 0.5 and 0.7. Corresponding to tables above, Pulce has the best result on Hepatitis dataset. This behavior clearly affects that the dataset are dense and correlated. Pulce exploits the dependencies among attributes. In addition, the performance of our GLLC is more stable than other methods with the change for the percentage of labeled positive examples changes.

Fig. 12
figure 12

Comparison the mean F value of our GLLC-lin, GLLC-rbf, LUHC-lin, LUHC-rbf, Pulce, BSVM-lin, BSVM-rbf, NB and S-EM on six UCI datasets

Fig. 13
figure 13

Comparison the mean accuracy of our GLLC-lin, GLLC-rbf, LUHC-lin, LUHC-rbf, Pulce, BSVM-lin, BSVM-rbf, NB and S-EM on six UCI datasets

4.3.3 Handwritten image

In the following, we conduct the performance evaluation on big data, USPS. The USPS [36] database consists of grayscale handwritten digit images from 0 to 9, as shown in Fig. 14. Each digit contains 1100 images, and the size of each image is 16 *16 pixels with 256 gray levels. Here we select four pair wise digits of varying difficulty for classification.

Fig. 14
figure 14

Representation of Handwritten image on USPS

Figure 15a–d gives the average F value of digit 1 versus 7, 2 versus 5, 0 versus 8 and 4 versus 6 with 5%, 10%, 20%, 30%, 40% and 50% ratio of the training sets are considered as labeled positive examples, respectively. In this subsection, we choose to compare GLLC with the approaches LUHC, pulce and BSVM. We use linear kernel for our GLLC, LUHC and BSVM. From Fig. 15 (a-d), it can be noticed that competitors exhibit an increment trend in performance when the number of labeled positive examples grows from 5% to 50%. Different from these approaches, our GLLC shows excellent results, F value is always near to 100% whatever the ratio of labeled positive examples is. The difference of competitors and GLLC is obvious.

Fig. 15
figure 15

F value of GLLC-lin, LUHC-lin, Pulce, BSVM-lin on handwritten image datasets, where x-axis is the ratio of positive labeled examples

4.4 Experimental analysis

4.4.1 The analysis of parameters

To further analyze our GLLC, in this subsection, we report the influence of our experiment results on the four parameters λ, σ, C p and C n The common approach to select the parameters is to find their optimal values from a range of values. Below, we take six UCI datasets as an example to analyze how the parameters affect the performance of GLLC. Figure 16a–d shows the average F value of GLLC with these four parameters when the ratio of labeled positive examples is 0.3. From Fig. 16a–d, we can observe that F value does change as the parameter changes. This observation means the parameters effect on the classification performance of our GLLC. As for a future comment, if only the parameters reach suitable values, average F value would be higher. In other words, the influence of parameters is obvious and we have to select best values to promise the optimal measure. Besides, Fig. 16c–d shows the change trend of C p and C n , subject to C p > C n .

Fig. 16
figure 16

Influence of parameter λ, σ, C p and C n to F value when the ratio of labeled positive examples is 0.3

4.4.2 The analysis of time complexity

In this subsection, we also analyze the time complexity of GLLC. Take the prediction of handwritten image as an example. Figure 17 lists the mean training central processing unit (CPU) time of GLLC with some other competitors, where the time units are seconds. From the Fig. 17, we can see GLLC always consume the least amount of mean training time in the same hardware environment no matter what ratio of labeled positive examples is. Moreover, with the ratio of the labeled positive examples increasing, the gaps of computational time between GLLC and other competitors are with a growing trend. It is reasonable that GLLC just solve a group of linear equations which will produce little effect on the computational time, whereas LUHC, BSVM have to solve QPP and Pulce computes an amount of distances by considering dependencies of attributions which will have a dramatic effect, especially when the ratio of labeled positive increases. To sum up, compared with LUHC, BSVM and Pulce, our GLLC not only has better classification ability but also spends remarkably less computational time

Fig. 17
figure 17

Training consuming time of GLLC-lin, LUHC-lin, Pulce, BSVM-lin on handwritten image datasets, where x-axis is the ratio of positive labeled examples

5 Conclusions

In this paper we have put forward a novel classifier combined with global and local regularization, named GLLC for short. In theory, we have proved that our GLLC is not only stable and robust to the changes of the counts of labeled positive examples but also has very low computational time. Experiments on artificial datasets, six UCI datasets and handwritten image classification have shown that GLLC is more effective than LUHC, BSVM and other popular methods for PU learning. In brief, GLLC has more strong discriminative power for positive and unlabeled learning.