1 Introduction

Dimensionality reduction has been widely used as a fundamental tool to analyze the high-dimensional data [1, 18, 20]. Linear discriminant analysis (LDA) [24], principal component analysis (PCA) [25] are the most popular dimensionality reduction techniques. Some dimensionality reduction techniques such as LDA and PCA can be understood as matrix factorization by using different objective function criteria. Matrix factorization approximately decomposes a matrix as a product of two or more matrices.Among existing matrix decomposition methods, non-negative matrix factorization (NMF) [12] can be used to obtain new representations of data points with non-negativity constraints. That is, it requires that all elements of the decomposed factor matrices are non-negative. These non-negative constraints lead to parts-based representations of the objects because they only allow additive, not subtractive, combinations of the original data points. NMF is a helpful dimensionality reduction method for face recognition [7], document clustering [29], image processing [10] and computer vision [22].

Generally, clustering can be divided into unsupervised clustering and semi-supervised clustering. In unsupervised clustering, we don’t need to use any label information to cluster data. In semi-supervised clustering, we need labels of some data points to clustering. Semi-supervised clustering methods need some labeled data that can be user specified or randomly selected from the data points. NMF is an unsupervised learning method. NMF does not use any prior knowledge of data to guide the learning process, nevertheless, there is certain amount of prior knowledge in the real world applications. Using prior knowledge to improve the performance of the algorithms has become one of the hot areas of machine learning. Many machine learning researchers have pointed out that when a small amount of labeled data is used in conjunction with unlabeled data, it can produce encouraging improvement in learning performance [3, 6, 8, 30, 35]. However, it is infeasible to label all the data points in the database, because the cost will be highly expensive, whereas obtaining a small amount of labeled data is relatively inexpensive. Under these circumstances, semi-supervised learning algorithms can play a greater performance. NMF has been extended to semi-supervised manner to get better performance [9, 16, 23, 31].

Liu et al. [16] proposed a constrained non-negative matrix factorization (CNMF) approach which took the label information as additional constraints. The main idea of their algorithm is that the data points with the same class label must be strictly mapped to share the same representation in the new parts-based representations space. Thus, the method forces the new parts-based representations to have the consistent label information with the original data. Obviously, this requirement is too strict so that it will weaken the representational ability of the new parts-based representations space for other unlabeled data, because it might assign unlabeled data with totally wrong representations due to its constraints. Wang et al. [23] proposed a penalized matrix factorization (PMF) algorithm, which took the form of pairwise constraints as supervisory information. However, the penalties for violating the must-link constraints are hard to fix. Yang et al. [31] proposed a pairwise constraints guided non-negative matrix factorization (PCNMF), which used the pairwise constraints to guide the clustering process.

Recently, manifold learning method [36, 37] has also been incorporated into NMF. Cai et al. [2] had proposed a graph regularized NMF (GNMF) algorithm which encoded the geometrical information of the data space by constructing a nearest neighbor graph to model the local manifold structure.

In our previous work [9], we have proposed a Semi-supervised non-negative matrix factorization (SEMINMF) with graph Laplacian method which incorporated label information and graph Laplacian into NMF. However we can only set the dimensionality of the factorized matrices to the number of clusters in SEMINMF, which may result in bigger reconstruction error between the original matrix and the factorized matrices. Besides, the label information used in SEMINMF can be regarded as hard constraints, it forces the factorized coefficient matrix to have the consistent label information with the cluster indicator matrix of the labeled points, which also may generate bigger reconstruction error.

In this paper, we propose a novel pairwise constrained non-negative matrix decomposition with graph Laplacian (PCGNMF) method. Unlike SEMINMF, PCGNMF does not directly use the class label information to clustering, but utilizes the pairwise constraints generated among all the labeled data to enhance the learning quality. The label information used in SEMINMF can be regarded as hard constraints, while pairwise constraints used in PCGNMF can be regarded as soft constraints. PCGNMF can set the dimensionality of the factorized matrices freely, but in SEMINMF the dimensionality of the factorized matrices must be the same as the number of clusters. With the pairwise constraints, PCGNMF requires that two data points having the same class label should have very similar representations in the new parts-based representations space as much as possible. On the contrary, the data points having different class labels should have quite dissimilar representations in the new parts-based representations space. We do not directly use the pairwise constraints to create the graph Laplacian matrix, because the number of pairwise constraints is small, which can not characterize the local structure of the data adequately. So we incorporate graph Laplacian into NMF, it requires that the nearby points should share the similar representations as much as possible. In this way, we expect that PCGNMF can obtain a more compact and discriminative representation for the data. To achieve this, we carefully design a new NMF objective function incorporating the pairwise constraints information and graph Laplacian into it. Our experimental evaluations show that the proposed approach achieves the state-of-the-art performance.

2 Related Works

Given a set of data points matrix \(\mathbf {X}=[\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_n]\in \mathbb {R}^{m\times n}\), \(\mathbf {x}_j\), \(j=1,\ldots ,n\), is an m-dimensional non-negative vector, denoting the jth data point. NMF aims to factorize \(\mathbf {X}\) into the product of two non-negative matrices \(\mathbf {U}\) and \(\mathbf {V}\). The product of \(\mathbf {U}\) and \(\mathbf {V}\) is a good approximation to the original matrix, i.e.,

$$\begin{aligned} \mathbf {X}\approx \mathbf {U}\mathbf {V}^{T} \end{aligned}$$
(1)

In order to obtain the two non-negative matrices, we can quantify the quality of the approximation by using a cost function with some distance metric. For example, if the Euclidean distance between two matrices is used, the problem turns to minimize the following objective function.

$$\begin{aligned} J = ||\mathbf {X}-\mathbf {U}\mathbf {V}^{T}||^2 = \sum _{i=1}^m\sum _{j=1}^n({x_{ij}}-\sum _{c=1}^k{u_{ic}}{v_{jc}})^2 \end{aligned}$$
(2)

where \(\Vert .\Vert \) is the matrix Frobenius norm denoting the squared sum of all the entries in the matrix. The sizes of the factorized matrices \(\mathbf {U}\) and \(\mathbf {V}\) are \({m}\times {k}\) and \({n}\times {k}\), respectively. The dimensionality of \(\mathbf {U}\) and \(\mathbf {V}\) is \(k\). Usually, \({k}\) is chosen such that \({k} \ll \min \{{m},{n}\}\). Each column vector \(\mathbf {u}_c\) of matrix \(\mathbf {U}\) can be regarded as a basis of the new representations space [4, 28], while the jth row vector of matrix \(\mathbf {V}\) contains the coefficients of a linear combination of the column vectors of \(\mathbf {U}\), this linear combination is used to approximate the jth column vector \(\mathbf {x}_j\) of matrix \(\mathbf {X}\). NMF can derive the latent characteristic structure space \(\mathbf {U}\) using the matrix factorization in the clustering process [13, 27, 29, 33].

When NMF is used to deal with clustering tasks, the dimensionality \(k\) of the factorized matrices has multiple choices. We can set \(k\) to be the same as or bigger than the number of clusters, even we can set \(k\) to be smaller than the number of clusters. When we set \(k\) to be the same as the number of clusters, each column of decomposed matrix \(\mathbf {U}\) can be regarded as the center of one partition of dataset, each data point can be represented by an additive combination of all column vectors of the decomposed matrix \(\mathbf {U}\). Each entry in the jth row of the factorized matrix \(\mathbf {V}\) is the projection of the jth data point \(\mathbf {x}_j\) of the matrix \(\mathbf {X}\) onto corresponding column vector of matrix \(\mathbf {U}\). Hence, the cluster membership of each data point can be determined by finding the basis (one column of \(\mathbf {U}\)) with which the data point has the largest projection value. More specifically, we examine each row of \(\mathbf {V}\), and assign data point \(\mathbf {x}_j\) to cluster \(c\) if \(c=\arg \max \limits _{c} v_{jc}\) [29]. Certainly, we can also apply K-means to the coefficient matrix for clustering when \(k\) is set to be the same as the number of clusters. But if \(k\) is set to be bigger or smaller than the number of clusters, we can only apply K-means to the coefficient matrix for clustering.

Cai et al. [2] had proposed a graph regularized NMF (GNMF) algorithm which incorporated the graph Laplacian into NMF. The objective function of GNMF is defined as:

$$\begin{aligned} J =||\mathbf {X}-\mathbf {U}\mathbf {V}^T||^2 + \lambda \mathrm {tr}(\mathbf {V}^T\mathbf {L}\mathbf {V}) \end{aligned}$$
(3)

where \(\mathrm {tr}(\cdot )\) is the trace operator, \(\mathbf {L}\) is the graph Laplacian matrix, \(\mathbf {L} = \mathbf {D} - \mathbf {W}\), \(\mathbf {W}\) is the affinity matrix, its entry \(w_{jq}\) denotes the similarity between point \(\mathbf {x}_j\) and \(\mathbf {x}_q\), \(\mathbf {D}\) is a diagonal matrix with its entries defined as \(d_{jj} = \sum _{q=1}^n w_{jq}\). Due to the graph Laplacian matrix, GNMF can effectively utilize the local structure of the data and obtain a compact representation for the data.

Recently, pairwise constraints have been incorporated into NMF. Yang et al. [31] proposed a PCNMF, which utilized the pairwise constraints to improve the performance of NMF. The objective function of PCNMF is defined as:

$$\begin{aligned} J&= ||\mathbf {X}-\mathbf {U}\mathbf {H}^T||^2 + \lambda \mathrm {tr}(\mathbf {H}^T\mathbf {S}\mathbf {H}) \\ A_{ij}&= {\left\{ \begin{array}{ll} \alpha &{} \text {if } \mathbf {x}_i, \mathbf {x}_j(i\ne j) \text { have the same class label}\\ \nonumber -(1 - \alpha ) &{} \text {if } \mathbf {x}_i, \mathbf {x}_j(i\ne j) \text { have different class labels}\\ \nonumber 0 &{} \text {otherwise } \end{array}\right. } \end{aligned}$$
(4)

where \(\mathbf {S} = \mathbf {D} - \mathbf {A}\), \(\mathbf {D}\) is a diagonal matrix with its entries defined as \(d_{ii} = \sum _{j=1}^n A_{ij}\). The objective function of PCNMF looks like the objective function of GNMF, the main difference is that PCNMF only uses the explicit pairwise constraints to construct the graph Laplacian matrix, while GNMF uses all the data to construct a graph to model the local structure.

Liu et al. [16] proposed a CNMF approach, which utilized the label information to enhance the performance of NMF. The objective function of CNMF is defined as:

$$\begin{aligned} J =||\mathbf {X}-\mathbf {U}\mathbf {Z}\mathbf {A}||^2 \end{aligned}$$
(5)

CNMF incorporates the label information by introducing an auxiliary matrix \(\mathbf {A}\). For each data point, if \(\mathbf {x}_i\) and \(\mathbf {x}_j\) have the same class label, they will have the same representation in the new parts-based representations space.

In our previous work [9], we have proposed a SEMINMF method which incorporated label information and graph Laplacian into NMF. The objective function of SEMINMF is defined as:

$$\begin{aligned} J =||\mathbf {X}-\mathbf {U}\mathbf {V}^T||^2 + \alpha \mathrm {tr}(\mathbf {V}^T\mathbf {L}\mathbf {V}) +\beta ||\mathbf {V}-\mathbf {Y}||^2 \end{aligned}$$
(6)

where \(\mathbf {L}\) is the graph Laplacian matrix, \(\mathbf {Y}\) is the cluster indicator matrix of the labeled points. The main drawback of SEMINMF is that we can only set \(k\) to the number of clusters, which may result in bigger reconstruction error between the original matrix and the factorized matrices.

Yang et al. [32] proposed a non-negative spectral clustering with discriminative regularization algorithm, which imposed non-negative constraints to the cluster indicator matrix.

In Sect. 3, we present a novel PCNMF with graph Laplacian method, which incorporates the pairwise constraints generated among the labeled data and graph Laplacian into NMF. The new objective function for NMF is different from these algorithms.

3 NMF with Pairwise Constraints and Graph Laplacian

3.1 The Objective Function

Given a data set consisting of \(n\) data points \(\mathbf {X}=[\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_n]\in \mathbb {R}^{m \times n}\), the label information of the first \(s\) data points \(\mathbf {x}_t (t\le s)\) is available and the rest points \(\mathbf {x}_r (s<r\le n)\) are unlabeled. When we have these labeled points, we can obtain the specific pairwise constraints information among them. Suppose the data set \(\mathbf {X}\) is going to be divided into \(k\) clusters, we randomly select \(f\) labeled points from each cluster, the pairwise constraints can be easily generated among the labeled points. More specifically, if any two labeled points have the same class label, we generate a must-link constraint for them. If any two labeled points share different class labels, a cannot-link constraint is generated for them. The number of all must-link pairwise constraints and cannot-link pairwise constraints is \(k\times \textit{C}_{f}^{2}\) and \(f^{2}\times \textit{C}_{k}^{2}\), respectively. \(\textit{C}_{n}^{m}\) denotes the number of ways to select \(m\) from \(n\) objects.

Then we can construct a must-link pairwise constraint symmetric matrix \(\mathbf {M}=[m_{pj}]\in \mathbb {R}^{n\times n}\) \((p, j=1,2,\ldots ,n)\) and a cannot-link pairwise constraint symmetric matrix \(\mathbf {C}=[c_{pj}]\in \mathbb {R}^{n\times n}\) \((p, j=1,2,\ldots ,n)\) with the first \(s\) labeled data points on the data set as follows:

$$\begin{aligned} m_{pj} = {\left\{ \begin{array}{ll} 1 &{} \quad \text {if } \mathbf {x}_i, \mathbf {x}_j(i\ne j) \text { have the same class label}\\ 0 &{}\quad \text {otherwise } \end{array}\right. }\nonumber \\ c_{pj} = {\left\{ \begin{array}{ll} 1 &{}\quad \text {if } \mathbf {x}_i, \mathbf {x}_j(i\ne j) \text { have different class labels}\\ 0 &{}\quad \text {otherwise } \end{array}\right. } \end{aligned}$$
(7)

With the pairwise constraints, our proposed approach reduces to minimize the following objective function:

$$\begin{aligned} J&= \sum _{i=1}^m\sum _{j=1}^n \left( {x_{ij}}-\sum _{c=1}^k{u_{ic}}{v_{jc}}\right) ^2 +\alpha \sum _{c=1}^k\sum _{q=1}^n\sum _{j=1}^n w_{jq}(v_{jc} - v_{qc})^2 \nonumber \\&+\beta \sum _{j=1}^n \left( \sum _{p: m_{pj}=1}\sum _{c=1}^k\sum _{h=1,h\ne c}^k{v_{jc}}{v_{ph}} +\sum _{p: c_{pj}=1}\sum _{c=1}^k{v_{jc}}{v_{pc}} \right) \end{aligned}$$
(8)
$$\begin{aligned} w_{jq} = {\left\{ \begin{array}{ll} \exp (-\frac{||\mathbf {x}_j-\mathbf {x}_q||^2}{\sigma ^2}) &{} \text {if} \quad \mathbf {x}_j \in N_p(\mathbf {x}_q) \quad \text {or} \quad \mathbf {x}_q \in N_p(\mathbf {x}_j) \quad \text {and} \quad j \ne q \\ 0 &{} \text {otherwise } \end{array}\right. } \end{aligned}$$
(9)

The Eq. (8) can be rewritten in matrix form using an auxiliary matrix \(\mathbf {A}\in \mathbb {R}^{k\times k}\), \(\mathbf {A}\) is defined as:

$$\begin{aligned} \mathbf {A}&= \left( \begin{array}{c@{\quad }c@{\quad }c@{\quad }c} 0 &{} \quad 1 &{} \quad ...&{} \quad 1 \\ 1 &{} \quad 0 &{} \quad ...&{} \quad 1 \\ . &{} \quad . &{} &{} \quad . \\ . &{} \quad . &{} &{} \quad . \\ . &{} \quad . &{} &{} \quad . \\ 1 &{} \quad 1 &{} \quad ...&{} \quad 0 \\ \end{array} \right) \nonumber \\ J&= ||\mathbf {X}-\mathbf {U}\mathbf {V}^T||^2 + \alpha \mathrm {tr}(\mathbf {V}^T\mathbf {L}\mathbf {V}) +\beta [\mathrm {tr}(\mathbf {V}^T\mathbf {M}\mathbf {V}\mathbf {A}) + \mathrm {tr}(\mathbf {V}^T\mathbf {C}\mathbf {V})] \end{aligned}$$
(10)

Although Eq. (10) is compact, it may not easy to understand how it works. Hence, we analyze it with Eq. (8).

In Eq. (9), \(N_p(\mathbf {x}_q)\) denotes the set of the \(p\) nearest neighbors of the data point \(\mathbf {x}_q\). In Eq. (8), \(u_{ic}\ge 0\) and \(v_{jc}\ge 0\), \(i=1,2,\ldots ,m; q, p, j=1,2,\ldots ,n; c=1,2,\ldots ,k\). The first term in Eq. (8) corresponds to the cost function of NMF, it denotes the squared sum of the Euclidean distance between \(\mathbf {X}\) and \(\mathbf {U}\mathbf {V}^{T}\). The second term is graph Laplacian regularization which is used to capture the local structure of the data, it implies that the nearby points should share the similar representations as much as possible. The third term is the cost function for violation of the pairwise constraints. Specifically, the third term includes two components, one is the cost for violation of the must-link constraints, the other is the cost for violation of the cannot-link constraints. We now analyze how the two components of pairwise constraints work when we set \(k\) to be the same as the number of clusters:

  1. 1.

    Suppose point \(\mathbf {x}_j\) belongs to the cth cluster, then it will have the largest projection value \(v_{jc}\) in the jth row of the matrix \(\mathbf {V}\) onto corresponding column vector \(\mathbf {u}_{c}\) of the matrix \(\mathbf {U}\). If point \(\mathbf {x}_p\) has a must-link constraint with point \(\mathbf {x}_j\) \((m_{pj}=1)\), then \(\mathbf {x}_p\) also belongs to the cth cluster. We expect that \(\mathbf {x}_p\) will also have the largest projection value \(v_{pc}\) in the pth row of the matrix \(\mathbf {V}\) onto corresponding column vector \(\mathbf {u}_{c}\) of the matrix \(\mathbf {U}\). In this case, the product of \(v_{jc}\) and \(v_{pc}\) is the biggest than any other product of \(v_{jc}\) and \(v_{ph}\) \((h=1,\ldots ,k;h\ne c)\) in the jth row and the pth row of the matrix \(\mathbf {V}\). Therefore, \(v_{pc}\) should be maximized in the \(p\)-th row of the matrix \(\mathbf {V}\), this is imposed by minimizing \(\sum _{j=1}^n(\sum _{p: m_{pj}=1}\sum _{c=1}^k\sum _{h=1,h\ne c}^k{v_{jc}}{v_{ph}}\)). When \(\sum _{j=1}^n(\sum _{p: m_{pj}=1}\sum _{c=1}^k\sum _{h=1,h\ne c}^k{v_{jc}}{v_{ph}}\)) is minimized, \(v_{ph}(h=1,\ldots ,k;h\ne c)\) will be as smaller as possible, while \(v_{pc}\) will be getting bigger as much as possible. Eventually, the point \(\mathbf {x}_p\) is assigned to the cth cluster as much as possible, as it has the largest projection value \({v_{pc}}\) in the pth row of the matrix \(\mathbf {V}\) onto corresponding column vector \(\mathbf {u}_{c}\) of the matrix \(\mathbf {U}\).

  2. 2.

    When two points \(\mathbf {x}_j\) and \(\mathbf {x}_p\) have a cannot-link constraint \((c_{pj}=1)\), they must be assigned to different clusters. For example, \(\mathbf {x}_j\) is going to be assigned to the cth cluster and it will have the largest projection value \({v_{jc}}\) in the jth row of the matrix \(\mathbf {V}\) onto corresponding column vector \(\mathbf {u}_{c}\) of the matrix \(\mathbf {U}\). Then \(\mathbf {x}_p\) must be assigned to a different cluster, say, the hth \((h\ne c)\) cluster, so it will have the largest projection value \({v_{ph}}\) in the pth row of the matrix \(\mathbf {V}\) onto corresponding column vector \(\mathbf {u}_{h}\) of the matrix \(\mathbf {U}\). That is, we expect that the jth row and the pth row of the matrix \(\mathbf {V}\) are as orthogonal as possible. This can be imposed by minimizing \(\sum _{j=1}^n(\sum _{p: c_{pj}=1}\sum _{c=1}^k{v_{jc}}{v_{pc}})\).

In PCGNMF, we can also set \(k\) to be smaller or bigger than the number of clusters. When \(k\) is different from the number of clusters, how the two components of pairwise constraints work is similar to the above analysis.

If points \(\mathbf {x}_j\) and \(\mathbf {x}_p\) have a must-link constraint \((m_{pj}=1)\), they should be assigned into the same cluster, equivalently, it means that \(\mathbf {x}_j\) and \(\mathbf {x}_p\) will have the very similar representations. In other words, \(v_{jc}\) should be almost the same as \(v_{pc} (c=1,\ldots ,k)\). If \(v_{jc}\) is the largest projection value in the jth row of \(\mathbf {V}\), we expect that \(v_{pc}\) will also be the largest projection value in the pth row of \(\mathbf {V}\) as much as possible. This can be imposed by minimizing \(\sum _{j=1}^n(\sum _{p: m_{pj}=1}\sum _{c=1}^k\sum _{h=1,h\ne c}^k{v_{jc}}{v_{ph}}\)). On the contrary, if \(\mathbf {x}_{j}\) and \(\mathbf {x}_{p}\) have a cannot-link constraint \((c_{pj}=1)\), they should be assigned into different clusters, that is to say, \(\mathbf {x}_j\) and \(\mathbf {x}_p\) will possess quite dissimilar representations. So the jth row and the pth row of the matrix \(\mathbf {V}\) should be as orthogonal as possible. This can be imposed by minimizing \(\sum _{j=1}^n(\sum _{p: c_{pj}=1}\sum _{c=1}^k{v_{jc}}{v_{pc}})\).

The trade-off these terms is governed by the positive parameters \(\alpha , \beta \), which specify the relative importance of the reconstruction error, local geometrical structure and the violation of the pairwise constraints.

3.2 The Algorithm

The objective function \(J \) of PCGNMF in Eq. (10) is not convex in both two matrix variables \(\mathbf {U}\) and \(\mathbf {V}\). Therefore, it is unrealistic to find the global minima of \(J \). In the following, we introduce an iterative updating algorithm which can obtain a local optima for \(J \).

Using the matrix property \(\mathrm {tr}\)(\(\mathbf {A}\mathbf {B}\))=\(\mathrm {tr}(\mathbf {B}\mathbf {A})\) and \(\mathrm {tr}(\mathbf {A})\)=\(\mathrm {tr}(\mathbf {A}^T)\), the objective function \(J \) can be rewritten as following:

$$\begin{aligned} J&= \mathrm {tr}((\mathbf {X}-\mathbf {U}\mathbf {V}^{T})^T(\mathbf {X}-\mathbf {U}\mathbf {V}^{T}))+\alpha \mathrm {tr}(\mathbf {V}^T\mathbf {L}\mathbf {V})\nonumber \\&+\beta [\mathrm {tr}(\mathbf {V}^T\mathbf {M}\mathbf {V}\mathbf {A}) + \mathrm {tr}(\mathbf {V}^T\mathbf {C}\mathbf {V})]\nonumber \\&= \mathrm {tr}(\mathbf {X}^T\mathbf {X}) - 2\mathrm {tr}(\mathbf {X}^T\mathbf {U}\mathbf {V}^T) + \mathrm {tr}(\mathbf {V}\mathbf {U}^T\mathbf {U}\mathbf {V}^T) \nonumber \\&+\, \alpha \mathrm {tr}(\mathbf {V}^T\mathbf {L}\mathbf {V}) + \beta [\mathrm {tr}(\mathbf {V}^T\mathbf {M}\mathbf {V}\mathbf {A}) + \mathrm {tr}(\mathbf {V}^T\mathbf {C}\mathbf {V})] \end{aligned}$$
(11)

Let \(\phi _{ij}\) and \(\varphi _{ij}\) be the Lagrange multiplier for constraint \(u_{ij}\ge 0\) and \(v_{ij}\ge 0\), respectively, and \(\varvec{\Phi }=[\phi _{ij}]\), \(\varvec{\Psi }=[\psi _{ij}]\). The Lagrange function \(\mathcal {L}\) is

$$\begin{aligned} \mathcal {L} = J + \mathrm {tr}(\varvec{\Phi }\mathbf {U}^T) + \mathrm {tr}(\varvec{\Psi }\mathbf {V}^T) \end{aligned}$$
(12)

Let the derivatives of \(\mathcal {L}\) with respect to \(\mathbf {V}\) and \(\mathbf {U}\) vanish, we have:

$$\begin{aligned}&\displaystyle \frac{\partial \mathcal {L}}{\partial {\mathbf {V}}} = -2\mathbf {X}^T\mathbf {U} + 2\mathbf {V}\mathbf {U}^T\mathbf {U} + 2\alpha (\mathbf {DV} - \mathbf {WV}) +\beta (\mathbf {MVA} + \mathbf {CV}) + \varvec{\Psi }=0 \end{aligned}$$
(13)
$$\begin{aligned}&\displaystyle \frac{\partial \mathcal {L}}{\partial {\mathbf {U}}} = -2\mathbf {X}\mathbf {V} + 2\mathbf {U}\mathbf {V}^T\mathbf {V} + \varvec{\Phi }=0 \end{aligned}$$
(14)

Using the KKT conditions \(\psi _{jc}v_{jc} = 0\) and \(\phi _{ic}u_{ic} = 0\), we get the following equations for \(v_{jc}\) and \(u_{ic}\):

$$\begin{aligned}&\displaystyle v_{jc} \longleftarrow v_{jc}\frac{2(\mathbf {X}^T\mathbf {U})_{jc} + 2\alpha (\mathbf {W}\mathbf {V})_{jc} }{2(\mathbf {V}\mathbf {U}^T\mathbf {U})_{jc} + 2\alpha (\mathbf {D}\mathbf {V})_{jc} + \beta (\mathbf {M}\mathbf {V}\mathbf {A}+\mathbf {C}\mathbf {V})_{jc}}\end{aligned}$$
(15)
$$\begin{aligned}&\displaystyle u_{ic} \longleftarrow u_{ic}\frac{(\mathbf {X}\mathbf {V})_{ic}}{(\mathbf {U}\mathbf {V}^T\mathbf {V})_{ic}} \end{aligned}$$
(16)

3.3 Computational Complexity Analysis

The objective function of PCGNMF is minimized by iteratively updating matrices \(\mathbf {U}\) and \(\mathbf {V}\). In this section, we will discuss the extra computational cost of our PCGNMF algorithm.

The big \(O\) analysis is usually used to express the complexity of the algorithm [15]. However, it may be not precise enough to differentiate the complexity of PCGNMF. Thus, we count the arithmetic operations for PCGNMF algorithm [2, 15]. Three arithmetic operations addition, multiplication and division are involved in the updating computation. All these operations are performed on floating-point numbers [15]. Table 1 has described the parameters used in the complexity analysis.

Table 1 Parameters used in complexity analysis

Based on the updating rules, we count the number of operations for each update step in PCGNMF. It is important to note that \(\mathbf {M}\) and \(\mathbf {C}\) are sparse matrices, we use \(m_{n}\) and \(c_{n}\) to denote the number of pairwise must-link constraints and pairwise cannot-link constraints, respectively. Thus, we only need \((m_{n}k+nk^{2})\) flam (a floating point addition and multiplication) to compute \(\mathbf {MVA}\) and \(c_{n}k\) flam to compute \(\mathbf {CV}\). Moreover, \(\mathbf {W}\) is also a sparse matrix, we only need \(npk\) flam to compute \(\mathbf {WV}\) [2]. So PCGNMF needs \((2mnk + m_{n}k + c_{n}k + 5nk + npk + 2mk^2 + 3nk^2 )\) fladd (a floating point addition), \((2mnk + m_{n}k + c_{n}k + npk + 3nk + 2mk^2 + 3nk^2)\) flmlt (a floating point multiplication) and \((mk + nk)\) fldiv (a floating point division) in each iteration. Besides the multiplicative updating, PCGNMF needs \(O(s^{2})\) to construct the constraint matrices \(\mathbf {M}\) and \(\mathbf {C}\), and PCGNMF also needs \(O(n^2m)\) to construct the \(p\)-nearest neighbor graph [2].

Suppose the multiplicative updates stop after \(t\) iterations, the overall computational complexity for PCGNMF will be \(O(tmnk + s^2 + n^2m)\).

4 Experimental Results

In this section, The image clustering tasks are used for the performance evaluations of our proposed PCGNMF algorithm.

4.1 Evaluation Metrics

Two metrics are used to evaluate the clustering performance on each experiment [2, 14, 29]. Experimental result is evaluated by comparing the cluster label of each sample point with its label provided by the dataset. One metric is the accuracy (\({AC}\)), which can be used to measure the percentage of correct labels obtained by the algorithm. Given a dataset including \(n\) images, let \({l}_i\) and \(\gamma _i\) be the cluster label and the label provided by the dataset of the ith sample point, respectively. The \({AC}\) is defined as follows:

$$\begin{aligned} {AC} = \frac{\sum _{i=1}^n\delta (\gamma _i,{map}({l}_i)) }{{n}} \end{aligned}$$
(17)

where \(n\) denotes the total number of images in the dataset. \(\delta ({x},{y})\) is the delta function that equals one if x \(=\) y and equals zero otherwise, and map(\({l}_i\)) is the mapping function that maps each cluster label \({l}_i\) to the equivalent label from the dataset. The best mapping can be found by using the Kuhn-Munkres algorithm [17].

The second metric is the normalized mutual information (\(NMI\)). In clustering problems, mutual information can measure how similar two clusters are. Given two sets of image clusters \({C}\) and \({C}'\), their mutual information metric MI(\({C}\), \({C}'\)) is defined as follows:

$$\begin{aligned} MI (\mathcal {{C}}, {C}') = \sum _{{c}_i\in {C},{c}'_j\in {C}'}{p}({c}_i,{c}'_j)\cdot \mathrm{log}\frac{{p}({c}_i,{c}'_j)}{{p}({c}_i)\cdot {p}({c}'_j)} \end{aligned}$$
(18)

where \({p}({c}_i)\), \({p}({c}'_j)\) denote the probabilities that an image arbitrarily selected from the data set belongs to the clusters \({c}_i\) and \({c}'_j\), respectively, and \({p}({c}_i,{c}'_j)\) denotes the joint probability that this arbitrarily selected image belongs to the cluster \({c}_i\) as well as \({c}'_j\) at the same time. MI(\({C}\),\({C}'\)) takes values between zero and max(\({H}(\)C\()\),H(\({C}'\))), where \({H}({C})\) and H(\({C}'\)) are the entropies of C and \({C}'\), respectively. It reaches the maximum max(\({H}({C}),{H}({C}'\))) when the two sets of image clusters are identical and becomes zero when the two sets are completely independent. One important character of \({M\!I}({C},{C}'\)) is that the value keeps the same for all kinds of permutations [14]. We use the following normalized metric \(N\!M\!I({C},{C}'\)) which takes values between zero and one:

$$\begin{aligned} N\!M\!I(\mathcal {{C} }, \mathcal {{C}'}) = \frac{{M\!I}({C},{C}')}{\mathrm{max}({H}({C}),{H}({C}'))} \end{aligned}$$
(19)

4.2 Performance Evaluations and Comparisons

To evaluate how the clustering performance can be improved by our method, we compare our algorithm with other five state-of-the-art algorithms:

  1. 1.

    NMF based clustering [29].

  2. 2.

    Graph regularized Non-negative Matrix Factorization (GNMF) which utilizes the local structure of the data by the graph Laplacian [2].

  3. 3.

    CNMF which takes label information as additional constraints [14].

  4. 4.

    PCNMF which incorporates the pairwise constraints information of the data into NMF [31].

  5. 5.

    SEMINMF with graph Laplacian method which incorporates label information and graph Laplacian into NMF [9].

We conduct the performance evaluations using four image datasets. The descriptions of the four datasets are summarized in Table 2, each dataset contains a certain number of categories of images. The detailed descriptions for each image dataset will be introduced later. Generally, when we use NMF to deal with clustering tasks, we set \(k\) to the number of clusters [2, 14, 29, 31]. In some cases, if \(k\) is different from the number of clusters, the performances of the algorithms may be even better. In order to demonstrate this difference, we first set \(k\) to the number of clusters, Tables 3 and 4 have shown the performance of each algorithm. Then, we report the best performance and corresponding \(k\) of each algorithm in Tables 5 and 6. On AT&T, Yale, AR and USPS, the number of categories which is used to clustering is 20, 10, 20, 6, respectively. The experiments are carried out as follows:

  1. (1).

    We conduct ten independent experiments on each dataset. In each experiment, we randomly select twenty subjects for clustering on AT&T and AR databases. On Yale database, we randomly select ten subjects for clustering. On USPS, we randomly select six subjects for clustering in each experiment.

  2. (2).

    In our experiments, three images are randomly selected from each cluster with labels on AT&T and Yale datasets. On AR database, we randomly select five images from each category to provide the label information. For USPS dataset, we randomly pick up 10 % images from each cluster as the available label information. For PCGNMF and PCNMF, the pairwise constraints are generated among all the labeled data points on each dataset.

  3. (3).

    In the clustering process, for NMF, GNMF, PCNMF and CNMF, in order to achieve the best performance, fast K-means algorithm [19] is further applied to the new data representation \(\mathbf {V}\) for clustering. For PCGNMF and SEMINMF, we use \(\mathbf {V}\) to determine the cluster label of each data point when \(k\) is set to the number of clusters. That is, we examine each row of \(\mathbf {V}\), and assign data point \(\mathbf {x}_j\) to cluster \(c\) if \(c=\arg \max \limits _{k} v_{jk}\). If \(k\) is set to be smaller or bigger than the number of clusters, we can apply fast K-means algorithm to the new data representation \(\mathbf {V}\) obtained by PCGNMF for clustering.

The above process is repeated ten times, we calculate the average \({AC}\) and \(NMI\) over the ten tests. For each algorithm, in order to achieve its best results, the parameters are appropriately selected. In GNMF, the regularization parameter \(\lambda \) searches the grid \(\{0.01,0.1,1,10,100,500,1000\}\). For PCNMF, \(\lambda \) searches the grid \(\{0.01,0.1,1,10,100\}\), \(\alpha \) searches the grid \(\{0.8,0.85,0.9,0.95,0.99\}\).For SEMINMF, the regularization parameter \(\alpha \) is set by searching the grid \(\{200,260,320,380,440,500,560,620\}\), \(\beta \) searches the grid \(\{6,10,20,30,40,50,60, 70,80,90,100\}\). For FCGNMF, \(\alpha \) searches the grid \(\{0.01,0.1,1,10\}\), \(\beta \) is set by searching the grid \(\{1,10,20,30,60,100\}\), the number of the nearest neighbors \(p\) searches the grid \(\{3,4,5,6,7,8,9,10\}\), in our all experiments, we simply fix \(\alpha \) = 0.1, \(\beta \) = 20, \(p\) = 3. For GNMF and PCGNMF, the parameter \(\sigma ^{2}\) is set to 1 on each database. For SEMINMF, the parameter \(\sigma ^{2}\) is set to 1 on AT&T, Yale and AR, on USPS, \(\sigma ^{2}\) is set to 0.1.

Table 2 Descriptions of the four databases
Table 3 Clustering accuracy comparison on the four databases
Table 4 Clustering normalized mutual information comparison on the four databases
Table 5 The best clustering accuracy and corresponding \(k\) of each algorithm comparison on each database
Table 6 The best clustering normalized mutual information and corresponding \(k\) of each algorithm comparison on each database

4.3 Data Sets

4.3.1 AT&T Dataset

The AT&TFootnote 1 dataset contains 400 images of 40 distinct subjects. Each subject has 10 different images. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement). In all the experiments, the original images are normalized in scale and orientation such that the two eyes are aligned at the same position. Then, the facial areas are cropped into the final images for clustering. The size of each cropped image is 32\(\times \)32 pixels, with 256 gray levels per pixel. Thus, each image can be represented by a 1,024-dimensional vector [14].

4.3.2 Yale Dataset

The Yale FaceFootnote 2 database contains 165 grayscale images in GIF format of 15 individuals. There are 11 images per subject, one per different facial expression or configuration: center-light, w/glasses, happy, left-light, w/no glasses, normal, right-light, sad, sleepy, surprised, and wink. Preprocessing for this dataset has done the same as the AT&T dataset. Each image can also be represented by a 1,024-dimensional vector.

4.3.3 AR Dataset

The AR database consists of over 4,000 frontal images for 126 individuals. We select a subset (with only illumination and expression changes) containing 50 male subjects and 50 female subjects, the total images is 1,399 [26, 34].

4.3.4 USPS Dataset

The USPSFootnote 3 handwritten digit database contains 10 objects. We select a popular subset containing 9298 16\(\times \)16 handwritten digit images in total.

When \(k\) is set to the number of clusters, Tables 3 and 4 show the detailed clustering accuracy, normalized mutual information and standard deviations on the four datasets. On AT&T, we can see that SEMINMF gets the second best performance, PCGNMF achieves 0.2 % improvement in accuracy and 1.7 % improvement in normalized mutual information over SEMINMF on average. On Yale, SEMINMF obtains the second best result for accuracy, CNMF gets the second best performance in normalized mutual information, PCGNMF improves 4 % in accuracy and 6.2 % in normalized mutual information over SEMINMF and CNMF. On AR, SEMINMF is the best algorithm, PCGNMF gets the second best performance. On USPS, we can see that the local structure of the data is particularly important, GNMF even obtains the best result for normalized mutual information with graph Laplacian only. PCGNMF gets the best result for accuracy.

Tables 5 and 6 (the subscripts in the tables denote the dimensionality \(k\) of the factorized matrices.) show the best performance and corresponding \(k\) of each algorithm on all the databases. Note that in SEMINMF, we can only set \(k\) to the number of clusters, the results of SEMINMF are the same as in Tables 3 and 4. In order to compare with others algorithms, we list the results of SEMINMF again. On AT&T, NMF achieves the best performance when the dimensionality \(k\) of the factorized matrices is 12, GNMF and PCGNMF obtain the best performances when \(k\) is the same as the number of clusters, PCNMF and CNMF have slight improvements in performances when \(k\) is 21 and 22, respectively. On Yale, when \(k\) is bigger than the number of clusters, NMF, GNMF, CNMF, and PCGNMF obtain better performances. On AR, PCGNMF gets the best result for normalized mutual information when \(k\) is 23, but when \(k\) is 20 the normalized mutual information of PCGNMF is worse than that of SEMINMF. \(k\) is limited to the number of clusters, which is the main drawback of SEMINMF. On USPS, the performances of GNMF, PCNMF, CNMF and PCGNMF have been improved when \(k\) is bigger than 6, PCGNMF still gets the best performance in accuracy.

Tables 7 and 8 show the reconstruction errors of PCGNMF and SEMINMF on AR and USPS databases. In SEMINMF, we can only set \(k\) to the number of clusters, which may result in bigger reconstruction error between the original matrix and the factorized matrices. Besides, the label information used in SEMINMF can be regarded as hard constraints, SEMINMF forces the factorized coefficient matrix to fit the cluster indicator matrix of the labeled points, which is too strict so that it may also lead to bigger reconstruction error. From Tables 7 and 8, we can see that when \(k\) is the same as the number of clusters, the reconstruction error of PCGNMF is smaller than that of SEMINMF. When \(k\) becomes large, the reconstruction error of PCGNMF becomes smaller, so the product of \(\mathbf {U}\) and \(\mathbf {V}\) will be a better approximation of \(\mathbf {X}\).

Table 7 Reconstruction errors of PCGNMF and SEMINMF on AR database
Table 8 Reconstruction errors of PCGNMF and SEMINMF on USPS database

4.4 Parameters Selection

Our PCGNMF algorithm has three main parameters: the number of labeled points, the regularization parameters \(\alpha \) and \(\beta \). In this section, we illustrate the effect of the parameters on performance.

Figure 1 shows how the performances of semi-supervised algorithms vary with the increase of labeled points. On AT&T, as can be seen, when the number of the labeled points increases, the performances of PCNMF, CNMF, SEMINMF and PCGNMF have been improved significantly. On Yale, when the number of labeled points increases, PCGNMF can make use of the label information to enhance the performance and obtain the best results. The performance of PCNMF is not improved significantly when the number of the labeled points increases. On AR, when the number of labeled points is 7 for each cluster, PCGNMF will be as good as SEMINMF. On USPS, the performance of CNMF is not improved significantly when the number of labeled points increases, the performance of PCNMF even degrades. When the number of labeled points is 120, the normalized mutual information of PCGNMF is competitive with that of GNMF.

Fig. 1
figure 1

When the number of labeled points varies, the performances of the algorithms on each database

To study how the two regularization parameters \(\alpha \) and \(\beta \) affect the image clustering performance, we carry out some experiments on the parameters sensitivity with \(\alpha \) and \(\beta \) varying respectively. Figure 2a shows the performance of PCGNMF varies with \(\alpha \) when \(\beta \) is fixed on AT&T. We can see that when \(\alpha \) varies from 0 to 0.1, the performance of PCGNMF has been improved. When \(\alpha \) varies from 0.1 to 0.2, PCGNMF consistently achieves good and stable performance. When \(\alpha \) is greater than 0.2, the performance obviously declines. Figure 2b shows the performance varies with \(\beta \) when \(\alpha \) is fixed, from which we can see that when \(\beta \) varies from 1 to 15, the performance has been improved significantly, when \(\beta \) is greater than 25 and 20, the accuracy and the normalized mutual information of PCGNMF will drop, respectively. Figure 3a shows the performance of PCGNMF varies with \(\alpha \) when \(\beta \) is fixed on USPS. It can be seen that when \(\alpha \) varies from 0 to 1, the performance has been improved significantly, PCGNMF consistently achieves good and stable performance when \(\alpha \) varies from 1 to 20. When \(\alpha \) is fixed, Fig. 3b shows the performance varies with \(\beta \), we can see that when \(\beta \) varies from 0 to 0.5, the performance has been improved significantly, PCGNMF consistently achieves good and very stable performance when \(\alpha \) varies from 0.5 to 10,000. From Fig. 2 to Fig. 3, we can see that the local structure and the pairwise constraints of the data are all important, with combination of the graph Laplacian and the pairwise constraints, PCGNMF obtains a more compact and discriminative representation for the data and so it can achieve good performance.

Fig. 2
figure 2

The performance of PCGNMF versus a, c \(\alpha \) with \(\beta \) fixed, b, d \(\beta \) with \(\alpha \) fixed on AT&T and Yale, respectively

Fig. 3
figure 3

The performance of PCGNMF versus a, c \(\alpha \) with \(\beta \) fixed, b, d \(\beta \) with \(\alpha \) fixed on AR and USPS, respectively

5 Conclusions

In order to enhance the performance of NMF, label information and pairwise constraints have been incorporated into NMF. However, some previous existing methods can not make full use of the pairwise constraints and label information to improve the performance of NMF. The CNMF proposed by Liu et al. [16] did not consider that data points with different class labels should have dissimilar representations. The PCNMF proposed by Yang et al. [31] did not consider the local structure of the data. Our previous work SEMINMF [9] incorporated label information as hard constraints and graph Laplacian into NMF. However, SEMINMF can only set the dimensionality of the factorized matrices to the number of clusters, which is the main drawback of SEMINMF.

In this paper, the proposed PCGNMF algorithm takes pairwise constraints of the labeled data points and the local structure of the data with graph Laplacian into account. In PCGNMF, we can set the dimensionality of the factorized matrices freely, so the model is more flexible.

Our experimental evaluations for image clustering tasks show that the proposed algorithm is effective and achieves the state-of-the-art performance. Compared with SEMINMF, the reconstruction error of PCGNMF is smaller than that of SEMINMF, which means that the product of the factorized matrices obtained by PCGNMF will be a better approximation of original data matrix.