1 Introduction

In many real-world applications in data mining, information retrieval and pattern recognition, labeled data are usually very insufficient and labeling a huge number of data points needs expensive human labor and takes much time. However, unlabeled data may be abundant and can be easily and cheaply obtained. Thus how to use the labeled and unlabeled data to improve the performance becomes an important problem. This motivates a hot research direction of semi-supervised learning [1].

Most semi-supervised learning algorithms [25] are constructed under some clustering and manifold assumptions [6, 7]. These assumptions are sensible since in many real-world problems the neighboring data points or the data points forming the same structure (manifold) are likely to have the same label. A typical family of algorithms are those developed on data graph [811].

Based on data graph, Zhu et al. [11] proposed an algorithm called Harmonic Energy Minimization (HEM). In HEM, Gaussian fields and harmonic functions are used to propagate the label information to the unlabeled data. The algorithm HEM can be interpreted as a random walks on graph, and can yield output of probability value, i.e., the output can be viewed as the probabilities of the data points belonging to the labeled classes. However, the algorithm clamps the labels for the labeled data, which makes it sensitive to noises in the labeled data. Later, Belkin et al. [8] proposed an algorithm to relax the constraints on the labeled data, which makes it insensitive to noises in the labeled data. However, the interpretation of random walks for it is not clear, and the algorithm may fail to classify the data if the density of the data varies largely across different classes. In addition, the derived matrix may be singular when the constructed graph is not connected, which makes the algorithm unsolvable.

Recently, Zhou et al. [10] proposed an algorithm called Learning with Local and Global Consistency (LLGC). The algorithm uses normalized Laplacian [12] to construct the regularizer, in which some drawbacks have been avoided. This algorithm can also be explained in view of random walks on graph. However, under this interpretation, the output of the algorithm are not probabilities but normalized commute times [13]. Thus, there lacks of a mechanism to calculate the probabilities of the data points belonging to the classes, which may be very useful for further data processing.

In this paper, we propose a general graph based algorithm with normalized weights for semi-supervised learning. In our algorithm, the drawbacks mentioned above are eliminated. We use the Laplacian with normalized weights to construct the regularizer, and a novel class label is introduced into the algorithm to discover novel class. Several theoretical interpretations on graph are given, which make our algorithm sound for semi-supervised learning tasks.

The rest of this paper is organized as follows: we propose the algorithm in Sect. 2. In Sect. 3, we give the theoretical interpretations for the proposed algorithm from three viewpoints on graph, i.e., regularization framework, label propagation and random walks. Some discussions are given in Sect. 4. In Sect. 5, the experimental results on several toy examples and benchmark datasets are reported to demonstrate the effectiveness of our algorithm. Finally, we give the conclusions in Sect. 6.

2 The algorithm

Given a point set \({{\mathcal{X}}} = \{x_1,\ldots,x_l,x_{l+1},\ldots,x_{n}\}\) and a labeled class set \({{\mathcal{C}}}= \{1,\ldots,c\},\) the first l points x i (i ≤  l) are labeled as \(y_i \in{{\mathcal{C}}}\) and the remaining u points x l+1, ..., x l+u are unlabeled. Here nl + u, and usually lu. We introduce an additional label to construct the label set as \({\tilde{\mathcal{C}}}= \{1,\ldots,c,c+1\}.\) The label c + 1 gives the algorithm a mechanism to discover novel class.

The goal of the algorithm is to predict the labels of the unlabeled points using both the labeled data and the unlabeled data.

Let \(F=[F_1^T,\ldots,F_n^T]^T \in {\mathbb{R}}^{n\times (c+1)}\) be the the soft label matrix, where \(F_i\in {\mathbb{R}}^{c+1}(1\leq i \leq n)\) are row vectors and each element in F i belongs to [0,1]. Define the matrix \(Y=[Y_1^T,\ldots,Y_n^T]^T \in {\mathbb{R}}^{n\times (c+1)},\) where \(Y_i\in {\mathbb{R}}^{c+1}(1\leq i \leq n)\) are row vectors. For the labeled data, Y ij  = 1 if x i is labeled as j and Y ij  = 0 otherwise. For the unlabeled data x i , Y ij  = 1 if j = c + 1 and Y ij  = 0 otherwise. Our algorithm is described as follows:

  1. 1.

    Construct the neighborhood weighted graph. Points x i and x j are linked by a weight calculated by

    $$ W_{ij}=e^{ {-\|x_i-x_j\|^2} / {\sigma^2}} $$
    (1)

    if x i is in the k-neighbors of x j or x j is in the k-neighbors of x i , otherwise, W ij = 0. Here \(\left\|\cdot\right\|\) is the 2-norm of vector, i.e., \(\|x\|^2=x^T x.\)

  2. 2.

    Calculate the normalized weights by

    $$ \tilde{W}_{ij} = {W_{ij}} / ({\sqrt {d_i d_j}}) $$
    (2)

    and the normalized weight matrix can be written as \(\tilde{W}= D^{-1/2}WD^{-1/2},\) where D is a diagonal matrix with entries d i  = ∑ j W ij .

  3. 3.

    Calculate \(P = \tilde{D}^{-1}\tilde{W},\) where \(\tilde{D}\) is a diagonal matrix with entries \(\tilde{d}_i=\sum_j{\tilde{W}_{ij}}.\)

  4. 4.

    Calculate the soft label matrix \(F \in {\mathbb{R}}^{n\times (c+1)}\) by

    $$ F = (I - I_\alpha P)^{-1}I_\beta Y $$
    (3)

    where I is an n × n identity matrix, I α is an n × n diagonal matrix with the ith entry being α i , and I β = II α. Here α i (0 ≤ α i  < 1) is a parameter for data x i , which will be discussed later. Then the label of data point x i is assigned as

    $$ y_i=arg max_{j\le c+1}F_{ij} $$
    (4)

If y i  = c + 1, then x i can be seen as a sample coming from a novel class. This mechanism of novel class discovery is useful since the unlabeled data may not belong to all of the labeled classes. On other hand, if the prior knowledge tells us that the number of classes is just c, then if y i  = c + 1, x i can be seen as an outlier, or be assigned as

$$ y_i=arg max_{j\le c}F_{ij} $$
(5)

3 Interpretations on graph for the algorithm

In this section, we give some theoretical interpretations from the viewpoint of graph for the algorithm proposed in Sect. 2. We will show that the algorithm can be derived from a regularization framework, and also can be seen as a label propagation process and a special Markov random walks.

Consider a graph \({{\mathcal{G}}}= ({{\mathcal{V}}},{{\mathcal{E}}})\) with nodes \({{\mathcal{V}}}\) corresponding to the n data points, nodes \({{\mathcal{L}}}=\{1,\ldots,l\}\) corresponding to the labeled points with labels y 1, ..., y l , and nodes \({{\mathcal{U}}}=\{l+1,\ldots, l+u\}\) corresponding to the unlabeled points. The normalized weight \(\tilde{W}\) used below is constructed on the edges of graph.

3.1 Regularization framework

Denote tr(·) as the trace operator, and denote \(\left\|\cdot\right\|_F\) is the Frobenius norm of matrix, i.e., \(\|M\|_F^2=tr(M^T M).\) Consider a regularization framework on graph, the cost function associated with F is defined as

$$ {{\mathcal{J}}}(F)=\sum\limits_{i,j=1}^n{\tilde{W}_{ij}\|F_i-F_j}\|_F^2 +\sum\limits_{i = 1}^n {\mu_i \tilde{d}_{i} \left\|F_i - Y_i \right\|_F^2} $$
(6)

where \(\tilde{W}_{ij}, F_i, Y_i,\) and \(\tilde{d}_{i}\) are defined as the same as those in Sect. 2.

The first term in the cost function is a regularization term, which measures the smoothness of the resulted labels on graph. The second term is a fitting term, which measures the difference between the resulted labels and the initial labels assignment. The trade off between these two competing constraints is controlled by μ i and \(\tilde{d}_{i}.\) Here μ i  > 0 is a regularization parameter for the ith data point x i and \(\tilde{d}_{i} = \sum_j{\tilde{W}_{ij}}\) is the degree of the ith data point x i .

For the purpose of analyzing conveniently, we rewritten (6) in the matrix form as

$$ {{\mathcal{J}}}(F) = tr(F^T\tilde{L}F) + tr((F-Y)^T U\tilde{D}(F-Y)) $$
(7)

where \(\tilde{L}=\tilde{D}- \tilde{W}\) is a Laplacian matrix with normalized weights \(\tilde{W},\) and U is a diagonal matrix with the ith entry being μ i .

The optimal solution for the optimization problem can be easily solved by setting the derivative of \({{\mathcal{J}}}(F)\) to zero, i.e.,

$$ \left.{\frac{{\partial{{\mathcal{J}}}}}{{\partial F}}}\right|_{F = F^* } = 2\tilde{L}F^* + 2U\tilde{D}(F^* - Y) = 0 $$
(8)

Let us introduce a set of variables

$$ \alpha_i = {1}/(1+\mu_i) \quad (i = 1,2,\ldots,n) $$
(9)

note that \(P = \tilde{D}^{-1}\tilde{W},\) then the solution can be derived as

$$ \begin{aligned} F^*&=(L + U\tilde{D})^{ - 1} U\tilde{D}Y\\ &=(I - P + U)^{-1}UY\\ &=(I_\alpha-I_\alpha P + I_\beta)^{ - 1} I_\beta Y\\ &=(I - I_\alpha P)^{ - 1}I_\beta Y\\ \end{aligned} $$
(10)

which is just the classifying function (3) used in the proposed algorithm.

3.2 Label propagation

Let us consider an iterative process for label propagation. In each iteration, the label information of each data point is partly received from its neighbors, and the rest is received from its initial label (see Fig. 1a). The label information of the data at time t + 1 is propagated based on the following equations

$$ F(t + 1) = \hat{P}F(t) + I_\beta Y $$
(11)

where \(\hat{P}= I_\alpha P,\) and I α, I β and P are defined as those used in Sect. 2.

Fig. 1
figure 1

The label propagation and random walks on graph. (a) In each iteration of the label propagation process, the label information of each data is partly received from its neighbors’ labels, and the rest is received from its initial label y i . (b) Each data point x i randomly walks to its neighbors with the probability determined by P. There is a probability β i to return to itself at one walk. The walks will stop when hits one of the data points on graph twice consecutively

We now show that the sequence F(t) will converge to the same solution as in (3). By the iteration (11), we have

$$ F(t)=\hat{P}^{t} F(0) + \sum\limits_{i = 0}^{t - 1} {\hat{P}^i I_\beta Y} $$
(12)

Note that the ∞-norm of matrix \(\hat{P}\) is lower than 1 in the case of 0 ≤ α i  < 1(1 ≤ i ≤ n). According to the matrix property, the spectral radius of \(\hat{P}\) is not greater than the ∞-norm, i.e., \(\rho(\hat{P}) < 1.\) Therefore, \(I-\hat{P}\) is invertible and we have \(\mathop{\lim }\limits_{t \to \infty}{\hat{P}^{t}}=0\) and \(\mathop{\lim}\limits_{t \to \infty }\sum\nolimits_{i = 0}^{t - 1}{\hat{P}^i I_\beta Y}=(I-\hat{P})^{-1}I_\beta Y.\) Hence the iteration process is convergent and converges to

$$ F^* = \mathop{\lim}\limits_{t \to \infty } F(t) = (I-\hat{P})^{-1}I_\beta Y = (I-I_\alpha P)^{-1}I_\beta Y $$
(13)

and does not depend on the initial value F(0).

Therefore, the proposed algorithm in Sect. 2 can be interpreted from an iterative process of label propagation on graph, with the transition matrix being \(\hat{P}.\)

3.3 A special random walks

Imagining a random walks on graph (see Fig. 1b), and the transition probability matrix \(\tilde{P}\) is

$$ \tilde{P}=I_\beta + I_\alpha P $$
(14)

where I α, I β and P are defined as the same in Sect. 2. Note that each row of \(\tilde{P}\) sum to 1, which indicates \(\tilde{P}\) is a stochastic matrix. The stop rule of the special random walks are defined as following:

Stop rule: Each point walks randomly on the graph based on the transition probability matrix \(\tilde{P},\) and stops when it consecutively hits one of the points on the graph twice. It is considered to have hit the starting point once before the walks.

Denote G as

$$ G = I_\beta + \hat{P} I_\beta + \hat{P}^2 I_\beta + \cdots + \hat{P}^n I_\beta + \cdots $$
(15)

Note that the value of \((\hat{P}^k I_\beta)_{ij}\) is the probability of the ith point stopping the walks at the jth point at the kth step, so G ij is the probability of the ith point stopping the walks at the jth point.

G can be written as G = (IIαP)−1Iβ, therefore (3) can be written as

$$ F = GY $$
(16)

From (15) and (16) we see that F ij (j ≤ c) is just the probability of the ith point which stops the random walks at the labeled data point whose label is j, and F ij (jc + 1) is the probability of the ith point which stops the random walks at one of the unlabeled data point.

Therefore, the proposed algorithm in Sect. 2 can also be interpreted as a special random walks on graph, with the transition probability matrix being \(\tilde{P}\) defined in (14) and with the stop condition being twice hitting one of the data point consecutively.

Several properties of the proposed algorithm can be clearly understood from the viewpoint of this random walks. The stop condition of twice hitting one of the data makes the starting data point having the chance to stop the walks at another data point, which means the resulted label of the labeled data can be changed from its initial label.

The method proposed by Zhu et al. [11] can also be interpreted as random walks, but the transition probability matrix and the stop condition are different from ours. In their method, the walks can only stop at the labeled data points, while in our algorithm, the random walks can stop at the unlabeled data points, which makes our algorithm having the mechanism to discover novel class in data.

4 Discussions

It is interesting to note that the label propagation procedures and the random walks defined in Sect. 3 seem to be very similar, these two procedures, however, are essentially different. First, the transition matrices are different (\(\hat{P}\) in the label propagation procedures while \(\tilde{P}\) in the random walks). Second, the transition directions in these two procedures are inverted (see Fig. 1).

The proposed algorithm is an extension to HEM [11]. The introduced parameters α i for each data x i make the algorithm more general. HEM is a special case in this algorithm where α i  = 0 for the labeled data x i and α i  = 1 for the unlabeled data x i . In contrast, we can set the parameters α i with more freedom in the general algorithm. Usually, for the labeled point x i , if we are sure that the initial label is definitely correct, α i can be set to zero, which means the resulted label of x i will be equal to the initial label and remains unchanged, otherwise α i may be set to a positive value such that the resulted label of x i can be changed from the initial label, which is important to detect noises in the labeled data. For the unlabeled point x i , α i can be set to a large value but lower than 1, α i  = 1 means that the resulted label of x i will definitely be 1 to c, and thus lose the capability to discover the novel class. Moreover, α i  = 1 may make the matrix (II α P) singular. Therefore, we constrain α i  < 1 in our algorithm.

The algorithm LLGC [10] is also derived from a regularized framework, but the two terms in the regularized framework are both different from those of us. The output of LLGC are not probability values, while in our algorithm, denote \({\bf 1}_n = [1,\ldots,1]^T \in {\mathbb{R}}^{n \times 1},\) we have

$$ \begin{aligned} \left. \begin{array}{l} P{{\mathbf{1}}}_n={{\mathbf{1}}}_n\\ Y {{\mathbf{1}}}_{c+1}={{\mathbf{1}}} _n\\\end{array} \right\} &\Rightarrow I_\alpha P {{\mathbf{1}}}_n + I_\beta Y {{\mathbf{1}}}_{c+1} = {{\mathbf{1}}}_n\\ &\Rightarrow I_\beta Y {{\mathbf{1}}}_{c+1}=(I - I_\alpha P) {{\mathbf{1}}}_n\\ &\Rightarrow(I - I_\alpha P)^{ - 1} I_\beta Y {{\mathbf{1}}}_{c+1}={{\mathbf{1}}}_n\\ \end{aligned} $$
(17)

which indicates that the output are probability values, and thus might be more convenient for further data processing. It is worth to note that if we remove \(\tilde{d}_{i}\) from the second term of the regularization framework in (6), the results are not probability values anymore, and thus cannot be interpreted by the subsequent label propagation and random walks.

The effect of the normalized weights is illuminated in Fig. 2. Recall that the normalized weight between x i and x j is defined as \(\tilde{W}_{ij}={W_{ij}}/({\sqrt {d_i d_j}}).\)

Fig. 2
figure 2

The relative changes of weights before and after normalization. Thicker line denotes larger weight. These changes clearly indicate that the weights with normalization make the classification between the data with very different density region more easier

The normalization can strengthen the weights in the low density region and weaken the weights in the high density region, which make the overall weights normalized. Therefore, the normalized weights might make the classification more easier in the case that the density of the data varies largely across different classes. It is worth to note that the normalized Laplacian matrix \(I-D^{-1/2}WD^{-1/2}=I-\tilde{W}\) also exploits the effectiveness of the normalized weight \(\tilde{W}.\) However, when the Laplacian matrix \(\tilde{L}\) in (7) is replaced by the normalized Laplacian matrix as in LLGC [10], the results are not probability values anymore as the normalized Laplacian matrix is usually not a Laplacian matrix.

5 Experiments

In this section, we first validate our algorithm with some toy examples, and then evaluate it on several benchmark datasets. Finally, we give an experiment to verify the capability of our algorithm to discover novel class in data.

In our experiments, as there is no prior knowledge to be used, we simply set the regularization parameters α i in (9) for the labeled data x i to the same value α l , and the regularization parameters α i for the unlabeled data x i to the same value α u .

5.1 Toy examples

We give several toy examples to analyze and validate our algorithm. The effect of the normalized weights, the α l , and the α u are discussed in these toy problems.

Figure 3a shows the toy data which consists of two classes with very different density distributions. The results illustrate that by the normalized weights, our method can effectively classify the data in the case that the density of the data varies largely across different classes.

Fig. 3
figure 3

a The partially labeled data. b Classification result without normalized weights. c Classification result with normalized weights. The results demonstrate that by the normalized weights, it can successfully classify the data with very different density between classes

Figure 4a shows the toy data of two rings with 8 labeled points. From the cluster assumption and the manifold assumption, ideal classifier should classify the points on the outside ring as one class and the points on the inside ring as another class. Thus there is one incorrectly labeled point in each class, which can be viewed as noise. The situation described in Fig. 4a may very possibly exist in real world problems since the noise in labeling is easily to occur possibly due to the tiredness or careless of the annotator. Therefore, developing the robust classifier which can automatically detect the noise is of vital importance.

Fig. 4
figure 4

a The partially labeled data with noise. b Classification result with α l  = 0. c Classification result with α l  = 0.99. The results demonstrate that when α l is set to a positive value, our method can effectively detect the noises in labeled data

To some extent, our method can automatically detect the noise in the labeled data if we set α l to a positive value. However, if we set α l to zero, the noise in the labeled data can not be detected, since α l  = 0 means the resulted label will not change from its initial label. Therefore, if the label information for the labeled data x i is not very convincible, we can set the corresponding α i to a larger value. On the contrary, if we can ensure that the label information for the labeled data x i is correct, the corresponding α i could set to zero.

Figure 5a shows the toy data of three rings with only two labeled points. From the cluster and manifold assumptions, ideal classifier should classify the three rings as three classes. However, there are only two classes being labeled, it is desirable to discover the intermediate ring as novel class. The results clearly demonstrate that our algorithm has the capability to discover novel class with α u  < 1.

Fig. 5
figure 5

a The partially labeled data. b Classification result with α u  = 1. c Classification result with α u  = 0.99999; The green, down triangles denote the novel class data discovered by our algorithm. The results demonstrate that when α u is set to a value less than 1, our method can effectively discover novel class in data

5.2 Experiments on benchmark datasets

We evaluate our algorithm with the benchmark datasets provided in [6], and compare with k nearest neighbor classifier (kNN), SVM, and several popular semi-supervised learning algorithms, including Transductive SVM (TSVM) [4], Low Density Separation (LDS) [14], Cluster Kernels (CK) [15], Laplacian Regularized Least Squares (LRLS) [2] and Learning with Local and Global Consistency (LLGC) [10]. We denote our algorithm without normalized weights as GGSSL1 and that with normalized weights as GGSSL2.

The benchmark consists of seven datasets. A brief description for the datasets are summarized in Table 1. The first two datasets were generated from two Gaussians without the manifold structure. For the image datasets of Digit1, USPS and COIL, the manifold assumption is expected to be held. For each dataset, 12 splits are provided. Each split contains 100 labeled data and at least one labeled point for each class, and there is no bias in the labeling process. In these experiments, for the kNN classifier, we use the nearest neighbor classifier (1-NN). For the LLGC and our algorithm, the parameter k in the construction of k-neighborhood graph is simply set to 6, and the parameter σ in (1) is determined by \(\sigma =\sqrt {-{\frac{\bar{d}}{ln(s)}}},\) where \(\bar{d}\) is the average of squared Euclidean distances for all the edged pairs, i.e., \(\bar{d}= {\frac{1}{z}}\sum\nolimits_{ij,W_{ij}\ne 0}{\left\| {x_i - x_j } \right\|} ^2\)(z is the number of all the edged pairs). s is searched from: s ∈ {0.0001*1/k, 0.001*1/k, 0.01*1/k, 0.1*1/k, 1/k }.

Table 1 Description of the benchmark datasets

In our algorithm, the regularization parameter α l is simply set to 0 and α u is simply set to 0.99999.

The results are summarized in Table 2, in which the results of SVM, TSVM, LDS, CK and LRLS have been reported in [6]. The experimental results demonstrate that there is no algorithm uniformly better than the others. Therefore, how to select an algorithm for a specific dataset is an open problem.

Table 2 Average test errors (%) with 100 labeled training data points

The performance of our algorithm for these benchmark datasets is comparable to LLGC method, and behaves better on Digit1 and COIL, which indicates that our algorithm is expected to perform well for the data having manifold structure.

It is worthy noting that the mainly computation time of our algorithm is spent on the first step, i.e., constructing of W, which is a necessary step for graph based method. Equation 3 in our algorithm is actually solved by a large sparse linear system, which has been intensively studied and there exist efficient algorithms whose computational time is nearly linear [16].

5.3 Novel class discovery

We present an experiment to validate the capability of our algorithm to discover novel class in data. Since, the COIL dataset consists of six classes in the benchmark, it is selected in this experiment. We only use the labeled information from the first three classes, and remove the labeled information from the last three classes. Therefore, the last three classes can be seen as a novel class in this setting.

We use the kNN classifier as the baseline, and compare our algorithm with LLGC. The parameters in the algorithms are set as those in previous experiments. Essentially, LLGC has not the mechanism to discover novel class. For each data x i , LLGC outputs three values corresponding the first three classes, and x i is classified to the class whose corresponding value is maximum. Here, we make a slight modification for it. If the maximum value is lower than a threshold t, then x i is classified to the novel class.

We record the test error rate for the data from the first three classes, the test error rate for the data from the last three classes, and the overall test error rate for the all data, respectively.

The results are presented in Table 3. For LLGC, the threshold t is set to the value that the overall test error rate is minimum, and t = 0.00008 in this experiment. It is worth to point out that the way of setting t is favorable to LLGC and practically infeasible since the test error rate is usually unavailable in practice. Our algorithm has the intrinsic mechanism to discover novel class, and the results in this experiment demonstrate that the mechanism is effective in practice.

Table 3 Test errors (%) for the COIL dataset. Data from the first three classes are seen as “data with labeled class”, and data from the last three classes are seen as “data with novel class”

6 Conclusions

In this paper, we propose a general algorithm for semi-supervised learning based on graph. The algorithm is formulated as an optimization problem which can be effectively and efficiently solved. Several drawbacks in traditional graph based method have been eliminated in our algorithm. Moreover, our algorithm has the mechanism to discover novel class in data, which is useful in the practice in data mining, information retrieval, and pattern recognition. We also give three theoretical interpretations for our algorithm. Experimental results on several toy examples and benchmark datasets have demonstrated the effectiveness of our algorithm.