1 Introduction

Mobile phones are more widely used than ever in our daily lives. Pictures are playing an increasingly important role in mobile communications. With a lot of pictures extensively used in mobile phones, it is a meaningful idea to mark new unlabeled pictures with labels. In this way, we can mark a new picture with some predefined labels. For example, we hope to recognize and tag the acquaintances in new pictures based on their faces automatically in our Facebook account. In fact, this involves the semi-supervised image classification technique. In image classification, labeled images are often scarce and difficult to obtain compared with abundance unlabeled images in the real world. Actually the above image classification technique is the label propagation [4, 5] problem, which utilizes both labeled and unlabeled data. Label propagation is a part of the field of semi-supervised learning [1, 2], and can be treated as one of the graph-based semi-supervised learning methods [35].

Many researchers [712] find out that the construction of the graph \(G = (V,E)\) is the key to label propagation. The vertex set \(V\) represents the data points, while the weight \(W\) of the edge set \(E\) represents the similarity among data points. The task of graph construction is to explore the weighted edge connection strategy. After the graph being constructed, we can propagate the label information to all unlabeled samples by the graph-based semi-supervised learning method.

Famous graph study methods include: (1) neighbor based methods, e.g., \(k\) neighbor (KNN) [6], epsilon ball neighbor \(\varepsilon\)(-ll) [7]. (2) Local structure based methods, e.g., locally linear embedding (LLE) [8]. (3) Self-representation based methods, e.g., sparse subspace clustering (SSC) [9], low-rank representation (LRR) [1012], least-squares regression (LSR) [13], smooth representation clustering (SMR) [14].

Assume the dataset is denoted as \(X = [x_{1} ,x_{2} , \ldots ,x_{n} ] \in R^{d \times n}\) the objective function of the self-representation-based graph construction method is

$$X = XZ,\quad {\text{s}} . {\text{t}}.\,Z \ne I,$$
(1)

where \(Z = [z_{1} ,z_{2} , \ldots ,z_{n} ] \in R^{n \times n}\) is a square matrix, and \(I\) is a unit matrix. \(Z_{ij}\) denotes the similarity between \(x_{i}\) and \(x_{j}\). Self-representation-based graph construction assumes that each sample can be represented by a linear combination of all samples, then, the similarity between any two samples can be measured by the linear coefficients. Besides, a symmetry procedure is defined as [712]

$$W = (|Z| + |Z^{T} |)/2.$$

From (1), many researchers consider about various regularization terms of \(Z\). Sparse subspace clustering (SSC) [9] aims to represent a sample by using the least other samples. The mathematic problem of SSC graph is

$$\mathop {\text{argmin}}\limits_{Z} ||Z||_{1} . \quad {\text{s}} . {\text{t}}.\,X = XZ, \quad {\text{diag}}\left( Z \right) = 0,$$
(2)

where \(||Z||_{1}\) denotes the \(l_{1}\) norm of \(Z\) and \(||Z||_{1} = \mathop \sum \nolimits_{i = 1}^{n} \mathop \sum \nolimits_{j}^{n} |Z_{ij} |.\)

Low-rank representation (LRR) [1012] requires the representation matrix has a low rank structure. The mathematical expression of LLR graph is

$$\mathop {\text{argmin}}\limits_{Z} ||Z||_{*} ,\quad {\text{s}} . {\text{t}}.\,X = XZ,$$
(3)

where \(||Z||_{*}\) denotes the nuclear norm of \(Z\) that is, the sum of all eigenvalues of \(Z\).

Least-squares regression (LSR) [13] graph solve the following mathematical problem

$$\mathop {\text{argmin}}\limits_{Z} ||Z||_{\text{F}} . \quad {\text{s}} . {\text{t}}.\,X = XZ,\quad {\text{diag}}\left( Z \right) = 0,$$
(4)

where \(||Z||_{\text{F}}\) denotes the \({\text{Frobenius}}\) norm of \(Z\) and \(||Z||_{\text{F}} = ( \sum \nolimits_{i = 1}^{n} \sum \nolimits_{j = 1}^{n} Z_{ij}^{2} )^{{\frac{1}{2}}} .\)

If the data is noisy, SSC, LRR and LSR graph construction methods usually adopt the following strategy: extend the constraint \(X = XZ\) to \(X = XZ + {\mathcal{S}}\), where \({\mathcal{S}} \in R^{d \times n}\) is the noise matrix. Three regularization terms (\(||{\mathcal{S}}||_{1} , ||{\mathcal{S}}||_{2,1} ,||{\mathcal{S}}||_{\text{F}}\)) are utilized for \({\mathcal{S}}\) in SSC, LRR, LSR, respectively. \(||{\mathcal{S}}||_{2,1}\) is the \(l_{2,1}\) norm of \({\mathcal{S}}\) that is, \(||{\mathcal{S}}||_{2,1} = \mathop \sum \nolimits_{j = 1}^{n} \sqrt {\mathop \sum \nolimits_{i = 1}^{n} {\mathcal{S}}_{ij}^{2} }\). Different regularization terms for \({\mathcal{S}}\) mean different estimations of noise distribution, e.g., \(||{\mathcal{S}}||_{\text{F}}\) assumes the noise distribution which is approximated to Gaussian noise (others can be seen in references [9, 11, 12] in detail). For example, LSR considers the norm \(||{\mathcal{S}}||_{\text{F}}\) and its mathematical expression is

$$\mathop {\text{argmin}}\limits_{\text{Z}} ||X - XZ||_{\text{F}}^{2} + \lambda_{1} ||Z||_{\text{F}}^{2} ,\quad {\text{s}} . {\text{t}} .\,{\text{diag}}\left( Z \right) = 0,$$
(5)

where \(\lambda_{1} > 0\) is the tuned parameter.

Smooth representation clustering (SMR) [14] extends the LSR graph, which requires the smoothness of the representation, i.e., if \(x_{i} \to x_{j}\) then \(z_{i} \to z_{j}\), Its mathematical expression can be described as

$$\mathop {\text{argmin}}\limits_{Z} ||X - XZ||_{\text{F}}^{2} + \lambda_{1} {\text{trace(ZLZ}}^{T} ) ,$$
(6)

where \(L\) the Laplace matrix of the graph \(W\), \(L = D - W\), \(W\) the KNN graph, \(D\) is the diagonal matrix with \(D_{ii} = \sum\nolimits_{j = 1}^{n} {W_{ij} }\). Notice that SMR graph relies on the KNN graph \(W\). Thus, the quality of the SMR graph will decrease if the quality of the KNN graph \(W\) is bad.

The construction of the graph is the key issue in graph-based semi-supervised learning. Thus, the main object is to explore to construct a graph that can reflect the true structure of the data. Many researches have reported that elimination of edges between different categories is the greatest challenge in constructing graphs. An ideal graph should have no connected edges between different categories. Those undesired wrongly connected edges among different categories are often called ‘bridge’ edges. The graph will improve many properties such as sparsity and block property without the bridge edges. Note that SSC graph and LRR graph try to eliminate ‘bridge’ edges by leading the graph to be sparse and low rank, respectively. In this paper, our main goal is to propose a novel method to eliminate ‘bridge’ edges more effectively.

Among all these self-representation-based graph construction methods, we observe that the LSR graph is the most effective method to achieve the self-representation idea because LSR has the weakest regularization on \(Z\), while the main disadvantage is its poor ability to eliminate the bridge edges, as seen in Fig. 6a. Inspired by this observation, we adopt the objective function of LSR to achieve the self-representation idea and combine the attempt to enhance the ability of eliminating ‘bridge’ edges. Based on LSR, we propose the nonnegative least-squares regression (NLSR) by adding nonnegative constraint on \(Z\) and removing the constraint \({\text{diag}}\left( Z \right) = 0\). The nonnegative constraint on \(Z\) can avoid the offset of positive and negative coefficients. And the nonnegative constraint can also improve the sparsity of \(Z\) to some extent. The constraint \({\text{diag}}\left( Z \right) = 0\) s removed to further increase the sparsity of \(Z\) In addition, the NLSR also considers the sparse corrupted noise \(E\) to handle noisy data and improve the robustness of the model. Furthermore, we propose the weighted version of NLSR method to further eliminate ‘bridge’ edges.

The rest of the paper is organized as follows: In Sect. 2, we introduce the proposed nonnegative least-squares regression (NLSR) method. In Sect. 3, we introduce the weighted NLSR method. Experimental results are presented in Sect. 4. Finally, conclusions are drawn in Sect. 5.

2 Nonnegative least-squares regression

Assume that \(X = [x_{1} ,x_{2} , \ldots ,x_{n} ] \in R^{d \times n}\) denotes a dataset with n sample points, and each sample point has d dimensions. Considering the nonnegative constraint and neglecting the diagonal constraint based on the LSR graph, then (5) can be rewritten as

$$\begin{aligned} \arg_{Z} \hbox{min} ||X - XZ||_{\text{F}}^{2} + \lambda_{1} ||Z||_{\text{F}}^{2} , \hfill \\ \quad {\text{s}} . {\text{t}} .\,Z \ge 0, \hfill \\ \end{aligned}$$
(7)

where \(Z = [z_{1} ,z_{2} , \ldots ,z_{n} ] \in R^{n \times n}\) is the representation matrix, and \(\lambda_{1} > 0\) is the tuned parameter.

Assume that the noise estimation is \({\mathcal{S}}\) then \(X = XZ + {\mathcal{S}}\). Thus, the noise regularization term in (7) is actually \(||{\mathcal{S}}||_{\text{F}}\). Moreover, we assume that there is sparse corrupted noise \(E\) n the data, then \(X = XZ + {\mathcal{S}} + \text{ }E\). Nonnegative least-squares regression (NLSR) considers the sparse corrupted noise \(E\) and the normal Gaussian noise \({\mathcal{S}}\), and its mathematical expression is

$$\begin{aligned} \mathop {\text{argmin}}\limits_{Z} |\left| {\mathcal{S}} \right||_{\text{F}}^{2} + \lambda_{1} ||Z||_{\text{F}}^{ 2} + \lambda_{2} ||E||_{1} , \hfill \\ \quad {\text{s}} . {\text{t}} .\,X = XZ + {\mathcal{S}} + E,\quad Z \ge 0, \hfill \\ \end{aligned}$$
(8)

where \(\lambda_{1} ,\lambda_{2} > 0\) are the tuned parameters.

The mathematical expression (8) can be written in further

$$\begin{aligned} \mathop {\text{argmin }}\limits_{\text{Z}} \frac{1}{2}||X - XZ - E||_{\text{F}}^{ 2} + \frac{{\lambda_{1} }}{2}||Z||_{\text{F}}^{ 2} + \lambda_{2} ||E||_{1} . \hfill \\ \quad {\text{s}} . {\text{t}}.\,Z \ge 0. \hfill \\ \end{aligned}$$
(9)

We can solve problem (9) by optimizing variables separately, which means that we can optimize a certain parameter by fixing other parameters. Using Inexact ALM [15] method, we can separate variables of the objective function by an auxiliary variable \(C\), then problem (9) turns to

$$\begin{aligned} \mathop {\text{argmin}}\limits_{Z} \frac{1}{2}||X - XZ - E||_{\text{F}}^{ 2} + \frac{{\lambda_{1} }}{2}||Z||_{\text{F}}^{ 2} + \lambda_{2} ||E||_{1} . \hfill \\ \quad {\text{s}} . {\text{t}} .\,C = Z,\quad C \ge 0, \hfill \\ \end{aligned}$$
(10)

The Lagrange function of problem (10) is

$${\mathcal{L}} = \frac{1}{2}||X - XZ - E||_{\text{F}}^{ 2} + \frac{{\lambda_{1} }}{2}||Z||_{\text{F}}^{ 2} + \lambda_{2} ||E||_{1} + < \varGamma ,C - Z > + \frac{\mu }{2}\left( {C - Z} \right)^{2} ,$$
(11)

where \(\varGamma \in R^{n \times n}\) is the Lagrange multiplier, and \(\mu \ge 0\) is punishment parameter.

Fix other variables and solve \(Z\) by

$$\frac{{\partial {\mathcal{L}}}}{\partial Z} = - X^{T} \left( {X - XZ - E} \right) + \lambda_{1} Z - \varGamma + \mu \left( {Z - C} \right) = 0$$
(12)

then

$$Z = \left( {\frac{{\left( {X^{T} X + \lambda_{1} I} \right)}}{\mu } + I} \right)^{ - 1} \left[ {\frac{{X^{T} X - X^{T} E + \varGamma }}{\mu } + C} \right].$$
(13)

Fix other variables and solve \(C\), and we have

$$\frac{{\partial {\mathcal{L}}}}{\partial C} = \varGamma + \mu \left( {C - Z} \right) = 0, \quad C \ge 0,$$
(14)
$$C = \hbox{max} \left( {0,Z + \frac{\varGamma }{\mu }} \right).$$
(15)

Fix other variables to solve \(E\)

$$E = \arg \min_{E} \lambda_{2} ||E||_{1} + \frac{1}{2}||E - (X - XZ)||_{\text{F}}^{ 2} = \varTheta_{{\lambda_{2} }} \left( {X - XZ} \right),$$
(16)

where \(\varTheta_{\beta } \left( x \right) = {\text{sign}}(x){\text{max(}}\left| x \right| - \beta ,0)\) is the soft-threshold operator [16], and

$${\text{sign}}\left( x \right) = \left\{ {\begin{array}{*{20}c} {1,} \\ {0,} \\ { - 1,} \\ \end{array} \begin{array}{*{20}c} {\quad x > 0} \\ {\quad x = 0} \\ {\quad {\text{other}} .} \\ \end{array} \begin{array}{*{20}c} {} \\ { } \\ \\ \end{array} } \right.$$
(17)

The whole algorithm is described in algorithm 1.

3 Weighted NLSR

Different categories of data should not exist in connected edges in a good graph, and wrongly connected edges are always named as ‘bridge’ edges. In order to further eliminate ‘bridge’ edges in the graph, we propose weighted NLSR by considering a weight multiplier. Weighted NLSR (WNLSR) is the problem

$$\begin{aligned} \arg_{Z} \hbox{min} \frac{1}{2}||X - XZ - E||_{\text{F}}^{ 2} + \frac{{\lambda^{\prime}_{1} }}{2}||K \cdot Z||_{\text{F}}^{ 2} + \lambda ||E||_{1} , \hfill \\ \quad {\text{s}} . {\text{t}}.\,Z \ge 0, \quad ||K||_{\text{F}}^{2} = {\text{constant}}, \quad K_{ii} \ne 0, \hfill \\ \end{aligned}$$
(18)

where \(||K||_{\text{F}}^{ 2} = {\text{constant}} > 0\) is to avoid the trivial solution \(K = 0\). If \(K_{ii} = 0\), \(Z\) will be a unit matrix which is also a trivial solution.

One can define the \(K\) by some other graphs such as \(k\) neighbor (KNN) [6] graph. In fact, the most important issue in graph constructing is eliminating ‘bridge’ edges. However, if the graph contains ‘bridge’ edges, \(K\) will also help \(Z\)to emerge ‘bridge’ edges.

\(K\) can also be regarded as a variable and can be solved automatically. Notice that constraints for \(K\) should be added to avoid the trivial solution \(K = 0\). Consider the instance that the dataset is denoted as \(X = [x_{1} ,x_{2} ,x_{3} ,x_{4} ] \in R^{d \times 4}\). There are two classes and \(k = 2\). \(x_{1} ,x_{2}\) belong to the first class, while \(x_{3} ,x_{4}\) belong to the second class.

The ideal \(Z\) could be

$$Z = \left[ {\begin{array}{*{20}c} 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ \end{array} } \right].$$

Each element denotes an edge between a pair of points. The weight matrix \(K\) is designed to help wipe off wrong connections in \(Z\). Concretely, \(K\) should punish two points, which do not belong to the same class, and the corresponding value in \(Z\) should be 0. Thus, the ideal \(K\) (set \(K_{ii} = 1\) to avoid trivial solution) could be

$$K = \left[ {\begin{array}{*{20}c} 1 & 0 & 1 & 1 \\ 0 & 1 & 1 & 1 \\ 1 & 1 & 1 & 0 \\ 1 & 1 & 0 & 1 \\ \end{array} } \right].$$

Pivotal elements in \(K\) are displayed in green. These green elements can avoid emerging ‘bridge’ edges, which are the wrongly connected edges among different categories. However, the value of \(K\) does not depend on \(X\) but depends on \(Z\). In the process of solving this problem, \(Z\) always has wrong connections. For example, \(Z\) could be

$$Z = \left[ {\begin{array}{*{20}c} 1 & 1 & 0 & {\bf{1}} \\ 1 & 1 & 0 & 0 \\ {\bf{1}} & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ \end{array} } \right].$$

Wrong connections in Z are displayed in purple. When minimizing \(\left| {\left| {K \cdot Z} \right|} \right|_{\text{F}}^{ 2}\), Z will lead K be

$$K = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 & 0 \\ 0 & 1 \\ {\bf{0}} & 1 \\ \end{array} \begin{array}{*{20}c} 1 & {\bf{0}} \\ 1 & 1 \\ 1 & 0 \\ \end{array} } \\ {\begin{array}{*{20}c} 1 & 1 \\ \end{array} \begin{array}{*{20}c} 0 & 1 \\ \end{array} } \\ \end{array} } \right].$$

On the other hand, purple elements in K have no constraints on the wrong connections (displayed in purple) in Z. Wrong connections in Z cannot be eliminated by minimizing \(||K \cdot Z||_{F}^{2}\). Thus, update K automatically cannot help improve the quality of Z.

For this problem, we start from the original problem (9). Formula (9) is a special case of (18) when K = 1n×n, where 1n×n is the n × n matrix in which all elements equal one. In this instance, i.e.,

$$K = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ \end{array} \begin{array}{*{20}c} 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ \end{array} } \\ {\begin{array}{*{20}c} 1 & 1 \\ \end{array} \begin{array}{*{20}c} 1 & 1 \\ \end{array} } \\ \end{array} } \right].$$

Next, we define K by giving the variable p ≥ 0

$$K = \hbox{max} [I,(A \ge p)],$$
(19)

where I is the unit matrix, A is a random matrix containing values drawn from the standard uniform distribution on the open interval (0,1). (A ≥ p) is a binary matrix that (A ≥ p) ij  = 1 if A ij  ≥ p and (A ≥ p) ij  = 0 otherwise. Notice that K ii  = 1 is required to avoid the trivial solution Z = I.

Edge connections among samples can be divided into right connections and wrong connections. Our goal is to construct the ideal K, which only has constraints on the wrong connections, i.e., elements in K equal to 1 when the corresponding connections are wrong connections, while elements in K equal to 0 when the corresponding connections are right connections. If p = 0, then K = 1n×n. Constraints on the right connections may reduce the ability of self-representation, while constraints on the wrong connections are always concealed due to the coefficients of the right connections are always nonzero.

If p becomes a little larger, then a few elements in K will equal zero. The constraints on some right connections are removed, and the self-representation ability can be improved. Though constraints on some wrong connections are also removed, we still have constraints on the other wrong connections, which gain more attention due to constraints on the right connections become less. Thus, a proper p will not only improve the self-representation ability of the model, but also improve the constraint ability on a certain number of wrong connections.

Define K by (19), we can solve problem (18) by optimizing variables separately. Using Inexact ALM [15] method, we can separate variables of the objective function by an auxiliary variable C, then problem (18) turns to

$$\begin{aligned} \mathop {\text{argmin}}\limits_{Z} \frac{1}{2}||X - XZ - E||_{\text{F}}^{ 2} + \frac{{\lambda^{\prime}_{1} }}{2}||K \cdot C||_{\text{F}}^{ 2} + \lambda_{2} ||E_{1} , \hfill \\ \quad {\text{s}} . {\text{t}} .\,C = Z, \quad C \ge 0. \hfill \\ \end{aligned}$$
(20)

The Lagrange function of problem (20) is

$${\mathcal{L}} = \frac{1}{2}\left| {\left| {X - XZ - E} \right|} \right|_{\text{F}}^{ 2} + \frac{{\lambda^{\prime}_{1} }}{2}\left| {\left| {K \cdot C} \right|} \right|_{\text{F}}^{ 2} + \lambda_{2} \left| {\left| E \right|} \right|_{1} + \left\langle {\varGamma ,C - Z} \right\rangle + \frac{\mu }{2}\left( {C - Z} \right)^{2} ,$$
(21)

where Γ ∊ R n×n is the Lagrange multiplier, and μ ≥ 0 is punishment parameter.

Fix other variables and solve Z by

$$\frac{{\partial {\mathcal{L}}}}{\partial Z} = - X^{T} \left( {X - XZ - E} \right) - \varGamma + \mu \left( {Z - C} \right) = 0,$$
(22)

then

$$Z = \left( {\frac{{X^{T} X}}{\mu } + I} \right)^{ - 1} \left[ {\frac{{X^{T} X - X^{T} E + \varGamma }}{\mu } + C} \right].$$
(23)

Fix other variables and solve C, and we have

$$\frac{{\partial {\mathcal{L}}}}{{\partial C_{ij} }} = \lambda^{\prime}_{1} K_{ij}^{2} \cdot C_{ij} + \varGamma_{ij} + \mu \left( {C - Z} \right)_{ij} = 0, C_{ij} \ge 0.$$
(24)

Then

$$C_{ij} = \frac{{Z_{ij} - \frac{{\varGamma_{ij} }}{\mu }}}{{\left( {\frac{{\lambda^{\prime}_{1} }}{\mu }K_{ij}^{2} + 1} \right)}}.$$
(25)

We can update C by

$$C = \left[ {1./\left( {\frac{{\lambda^{\prime}_{1} }}{\mu }K \cdot K + 1} \right)} \right] \cdot \left( {Z - \frac{\varGamma }{\mu }} \right),$$
(26)

where ./ and· are element-wise division and multiplication, respectively.

$$C = \hbox{max} \left( {0,C} \right).$$
(27)

Fix other variables to solve E

$$E = \mathop {\text{argmin}}\limits_{\text{E}} \lambda_{2} \left| {\left| E \right|} \right|_{1} + \frac{1}{2}\left| {\left| {E - \left( {X - XZ} \right)} \right|} \right|_{F}^{2} = \varTheta_{{\lambda_{2} }} \left( {X - XZ} \right),$$
(28)

where Θ β (x) = sign(x)max(|x| − β, 0) is the soft-threshold operator [16], and

$${\text{sign}}\left( x \right) = \left\{ {\begin{array}{*{20}c} {1,} \\ {0,} \\ { - 1,} \\ \end{array} \quad \begin{array}{*{20}c} { x > 0} \\ {x = 0} \\ {{\text{other}} .} \\ \end{array} } \right.$$
(29)

The whole algorithm is described in algorithm 2.

4 Experiments

In this section, the semi-supervised learning task is used to evaluate the performance of the proposed NLSR method. We measure different graph construction algorithms by the classification error of the unlabeled data. Datasets include face dataset and motion dataset (as seen in Figs. 1, 2).

Fig. 1
figure 1

Sample images from YaleB dataset. Each row belongs to the same person

Fig. 2
figure 2

Sample images from Hopkins155 datasets. Different colors indicate different motions

4.1 Datasets

ORL dataset The ORL datasetFootnote 1 consists of 10 different images for each of 40 distinct subjects, which are taken at different times, under different lighting condition, with different facial expression and with/without glasses. Each image is 32 × 32 pixels with 256 gray levels per pixel. The first ten people are selected for constructing the data matrix for experiment.

Yale dataset The Yale datasetFootnote 2 contains 165 grayscale images of 15 individuals. There are 11 images per subject, one per different facial expression or configuration. Each image is also represented by a 1024-dimensional vector in image space.

Extended YaleB dataset [17] This database is challenging due to large noise and corruptions. It contains 2414 frontal face images of 38 subjects. Each subject has 64 face images. We choose the cropped images of first five individuals, and resize them to 32 × 32 pixels. The data are projected into a 10 × 6-dimensional subspace by PCA.

Hopkins155 motion dataset [18] This dataset is a motion segmentation dataset, consisting of 156 video sequences with extracted feature points and their tracks across frames. It contains board sequences, traffic sequences and pedestrian movement sequences. The first 100 two-motion video sequences are selected for constructing the data matrix for experiment. We use PCA to project the data into a 12-dimensional subspace.

4.2 Semi-supervised image classification

To demonstrate how the classification performance can be improved by our method, we compare the proposed algorithm with four algorithms: KNN graph [6], SSC graph [9], LSR graph [13] and SMR graph [14]. The parameters are set according to corresponding references and the best parameters are determined by the finite grid [19]. We do not consider the LRR graph as the compared algorithm because it always performs poorly in the experiments.

The parameter \(\lambda^{\prime}_{ 1}\) in WNLSR is always set as \(\lambda^{\prime}_{ 1} = \lambda_{ 1}\), which is denoted as WNLSR 1. In addition, we can set \(\lambda^{\prime}_{ 1} = \lambda_{ 1} \left( { 1 - p + p/n} \right)\) by considering the number of elements in \(K \cdot Z\), which is denoted as WNLSR 2. Besides, λ 1 is set as 0.01 throughout the whole paper.

We choose the famous local and global consistency (LGC) [5] method as the semi-supervised learning method to compare the performance of different graph construction methods. Assume that dataset X = [x 1,x 2,…,x n ] ∊ R d×n, and the first l samples x 1,x 2,…, x l are labeled. The label L comes from k categories, L = [1,2,…,k]. \(Y = \left[ {Y_{l} Y_{u} } \right]^{T} \in R^{n \times k}\) is the label matrix. If sample x i is labeled by \(j, j \in \left\{ {1,2 \ldots ,k} \right\}\), then Y ij  = 1, otherwise Y ij  = 0. The optimization function of LGC is

$$\mathop {\text{argmin}}\limits_{F} {\text{tr\{ }}F^{T} \tilde{L}_{W} F + \beta {{(F - Y)}^T}(F - Y){\text{\} ,}}$$
(30)

where \(\beta \in [0, + \infty )\) balances the local adaptation term and the overall smooth term of the objective function. Generally, we set \(\beta = 0.99. \quad F = \left[ {F_{l} F_{u} } \right]^{T} \in R^{n \times k}\) is the desired classification function while \(\tilde{L}_{W}\) is the standard Laplacian graph of W and \(\tilde{L}_{W} = D^{{ - \frac{1}{2}}} \left( {D - W} \right)D^{{ - \frac{1}{2}}} .\)

Datasets include three face datasets (ORL, Yale, and YaleB) and 100 small datasets in Hopkins155. The face datasets need to be normalized first (x i  = x i /||x i ||2, \(i = 1,2, \ldots n\)). There are 103 datasets used in the semi-supervised learning experiments in total. For each data set, the evaluations are conducted with different labeled samples \(NL\). For the fixed labeled samples NL, we run the experiments as follows:

  1. 1.

    Construct graphs by different methods.

  2. 2.

    Randomly choose NL points as labeled data from the data set as the collection for experiment.

  3. 3.

    Apply the LGC method to learn the label propagation.

  4. 4.

    Calculate the classification error on all unlabeled data.

  5. 5.

    Repeat the above process for 20 times.

4.3 Experimental results

Experimental results are shown in Tables 1, 2, 3, and 4. From these results, we can observe that: (1) in most cases, NLSR consistently achieves good performance compared with LSR while WNLSR performs good. (2) SSC also performs well on ORL, Yale and extended YaleB face datasets. (3) Though LSR and SMR always have good performance on unsupervised learning experiments [13, 14], they perform poorly on 100 small datasets of Hopkins155. Meanwhile, KNN and NLSR graph perform well on these datasets. The results show that NLSR achieved good performance based on the nonnegative constraint and the design of the error estimation.

Table 1 Average classification error rates of different graph construction methods on ORL dataset
Table 2 Average classification error rates of different graph construction methods on Yale dataset
Table 3 Average classification error rates of different graph construction methods on Extend YaleB dataset
Table 4 Average classification error rates of different graph construction methods on Hopkins155 datasets (100 datasets)

4.4 Algorithm analysis

Parameter selection is important for algorithms. We use the finite grid [19] method to select parameters for NLSR, as seen in Table 5. Notice that parameters λ 1, λ 2 control the nonnegative constraint term and the error design term in NLSR problem [as seen in (8) and (7)]. When λ 2 = +∞, we set E = 0, then formula (8) turns to formula (7), i.e., only the nonnegative constraint is considered.

Table 5 A finite grid of parameter values

Figure 3 shows the influence of parameter λ 2 when λ 1 is fixed. We can find that a suitable λ 2 will help reduce the classification error rate, but the reduction extent is limited. Notice that when λ 2 is small, the performance of the model will be affected. It suggests that the key role in NLSR model is the nonnegative constraint, and it also shows that the error estimation is more difficult.

Fig. 3
figure 3

Analysis of parameter λ 2 in NLSR. Experiments are the classification error rates (ER %) on 1 ORL; 2 Yale and 3 extend YaleB datasets when NL = 5. NLSR2, fixing λ 1 = 0.01 and varying λ 2. NLSR1: \(\lambda_{1} = 0.01, \lambda_{2} = + \infty\)

In our experiments, we simply set p = 0.3 and p = 0.75 for WNLSR. Figure 4 shows the average performance (100 running times) of various p on different datasets when NL = 5. For simplicity, we set λ 1 = 0.01 and \(\lambda_{2} = + \infty\)

Fig. 4
figure 4

Error rates vs p (WNLSR) on different datasets with NL = 5. y-axis denotes the error rate (ER %) of the method, x-axis denotes the variable p. a, b ORL dataset. c, d Yale dataset. e, f Extend YaleB dataset. g, h Hopkins155 datasets

Since the proper p is different in different datasets, it is important to find a way to estimate it. We can estimate p as follows:

  1. 1.

    Give a semi-supervised problem.

  2. 2.

    Randomly select one labeled sample in each class as the testing samples, and regard them as unlabeled samples in graph learning.

  3. 3.

    Give a random matrix and compute the predict error of testing samples with different p.

  4. 4.

    Repeat the whole procedure 100 times for the ORL dataset, the Yale dataset and the YaleB dataset, while compute the average predict error of 100 Hopkins155 datasets.

Figure 5 shows the average predicted error rates (100 running times) of testing samples with various p on different datasets when NL = 5. We can observe that the estimated proper p in Fig. 5 is roughly consistent with the proper p in Fig. 4 on different datasets.

Fig. 5
figure 5

Average predicted error rates of testing samples vs p (WNLSR) on different datasets with NL = 5. y-axis denotes the error rate (ER %) of testing samples, x-axis denotes the variable p. a, b ORL dataset. c, d Yale dataset. e, f Extend YaleB dataset. g, h Hopkins155 datasets

Now we analyze the sparsity of the representation coefficient matrix Z. The sparsity of a matrix Z can be defined as follows:

$${\text{sparsity}}\left( Z \right) = \frac{{\mathop \sum \nolimits_{i = 1}^{n} {\text{sparsity}}\left( {Z_{i} } \right)}}{n} .$$
(31)

The sparsity(Z i ) of the vector Z i can be calculated as [20]

$${\text{sparsity}}\left( {Z_{i} } \right) = \frac{{{{\sqrt n - \mathop \sum \nolimits \left| {Z_{ij} } \right|} \mathord{\left/ {\vphantom {{\sqrt n - \mathop \sum \nolimits \left| {Z_{ij} } \right|} {\sqrt {\mathop \sum \nolimits Z_{ij}^{2} } }}} \right. \kern-0pt} {\sqrt {\mathop \sum \nolimits Z_{ij}^{2} } }}}}{\sqrt n - 1},$$
(32)

where \(Z_{ij}\) is the jth element of Z i .

The sparsest possible vector will have a sparseness of one, whereas a vector with all elements equal should have a sparseness of zero. As seen in Table 6, the representation coefficient matrices Z of KNN and SSC have high sparsity, while the sparsity of representation coefficient matrices of LSR and SMR have low sparsity. NLSR can obtain more sparse representation coefficient matrix than LSR.

Table 6 The mean sparseness of the representation coefficient Z obtained by different methods on different datasets

The sparsity of the representation coefficient matrix Z will directly lead to the sparsity of the corresponding weight matrix W. By observing experimental results and the sparsity of graphs, we can find that the sparsity of the graph has an important influence on semi-supervised learning. In fact, the most important factor of the good performance of NLSR is that the nonnegative constraint can help improve the sparsity of the LSR graph. The sparsity, of course, is not the only factor in determining the classification performance. For example, KNN and SCC graphs have the highest sparsity on the Hopkins155 and the YaleB datasets, but their classification performances are not so good as NLSR. Figure 6 shows graphs obtained by different graph construction methods on the ORL dataset (40 categories). We can find that NSLR is more “clean” than LSR graph, which means that NLSR has less incorrect edges between points than LSR. Notice that SCC graph also has not too many incorrect edges between points. However, the aggregation degree of each category of SCC graph is lower than LSR graph or NLSR graph. Figure 7 shows the difference between NLSR (p = 0) and WNLSR (p = 0.3). We magnify partial area of the graph obtained by WNLSR. We can find out that WNLSR can eliminate more ‘bridge’ edges than NLSR.

Fig. 6
figure 6

Graphs by different graph construction methods on ORL dataset. Values of graph are normalized and they varies from [0,1], the diagonal of the graph is also set to 0 to display

Fig. 7
figure 7

Graphs by WNLSR 1 with different p on the ORL dataset. If p = 0, WNLSR 1 becomes NLSR method. Values of graph are normalized and they varies from [0,1], the diagonal of the graph is also set to 0 to display

In Table 7, we report the average running time of different graph construction methods. The LSR graph and the Knn graph usually have short running time. The SCC graph, the NLSR graph and the WNLSR graph, which always perform well, usually have long running time. The NLSR graph and the WNLSR graph run faster than the SCC graph on ORL, Yale and YaleB datasets, and run slower than the SCC graph on Hopkins155 datasets. The computer configuration is Intel(R) Core(TM) i5-3470 CPU @ 3.20 GHz, 4G of memory, Microsoft Windows7 system, and Matlab 2010b software.

Table 7 The average running time of different graph construction methods

The major computational burden of NLSR and WNLSR (both are iterative algorithms) lie in the computation of the inverse of matrix (formula 13), with a computational complexity of O(n 3).

5 Conclusion and future work

Inspired by the idea of the self-representation of data, we propose a novel graph construction method named as NLSR graph based on the LSR graph. The biggest advantage of LSR is that it can directly adopt the self-representation idea. However, its poor ability to eliminate the wrong edges among samples limits its application. Based on LSR, we emphasize the nonnegative constraint on Z without increasing other regularizations on Z. In this way, NLSR not only maintains the inherent advantage of LSR, but also improves its ability to eliminate the bridge edges. In fact, the nonnegative constraint can avoid the offset of positive and negative coefficients, and improve the sparsity of the constructed graph. Weighted version of NLSR (WNLSR) is also proposed to further eliminate ‘bridge’ edges by constructing the proper weighted multiplier. At last, we redesign the noise estimation by taking both the small Gaussian noise and the sparse corrupted noise into consideration at the same time. Image classification experiments have showed encouraging results of the NLSR algorithm when compared with the state-of-the-art algorithms in semi-supervised learning, especially in improving LSR method.

The construction of the graph is the key issue in graph-based semi-supervised learning. Thus, constructing a good graph is a meaningful task. In constructing a graph, we can focus on the following issue: a pair of samples are close in distance space but belong to different classes. The edges between those sample pairs are usually ‘bridge’ edges. We can study how to pull those samples far away from each other by considering Mahalanobis space [2123]. It is also a meaningful task to develop more efficient algorithms for NLSR and WNLSR in future work.