1 Introduction

Clustering is a fundamentally important task to numerous applications, such as image classification [34], saliency detection [40], image segmentation [44] and motion segmentation [31]. The goal of clustering is to simultaneously segment unlabeled data points into clusters so that the data points in the same clusters are more similar to each other than those in different clusters [38]. Under the Lambertian assumption, the face images of a subject with a fixed pose and varying illumination approximately lie in a linear subspace of dimension 9 [1]. Thus, the face clustering problem can be considered as image clustering problem over a union of subspaces. In the past decades, a large number of clustering methods have emerged, such as k-means [26], spectral clustering [28], support vector clustering [2], maximum margin clustering [37] and multi-view subspace clustering [42, 48, 49].

In the big data era, high-dimensional data is ubiquitous in many real applications such as image processing [18, 20]. However, high-dimensional data not only results in high computational cost of time and memory for related algorithms, but also degrades their performances due to the inevitable noise and insufficient number of training samples [17, 19, 27]. Under the assumption that high-dimensional data almost lie in multiple low-dimensional subspaces, subspace clustering has gained a lot of attention due to its capability and efficiency in data clustering [33]. The key idea in subspace clustering is to construct a weighted affinity graph matrix from the initial database. Thus, constructing an informative graph to capture the essential relationship of data plays a key role for data clustering.

Many research studies on graph learning investigate how to better capture the intrinsic structure information of data [41, 43, 46]. Generally speaking, the affinity graph can be constructed based on pairwise distance (e.g. Euclidean distance) or reconstruction coefficients. Recent methods for learning the affinity graph are based on the self-expressiveness property, which states a sample in a union of subspaces can be expressed as a linear combination of other samples, i.e., X = XZ. Furthermore, a symmetric affinity graph matrix is induced from the new representation Z (i.e. \(G=\frac {1}{2}(|Z|+|Z^{T}|)\)), where Gij = Gji determines samples i and j belong to the same subspace. The main issue of existing graph learning methods is that all features are treated equally in the graph construction even if many features are redundant features or even noises.

Recently, sparse representation and low-rank representation have attracted much attention in data clustering due to their success in adaptively exploit the intrinsic representation structures of data [15, 30]. We try to combine their advantages and address the issue in the existing adaptive graph learning methods. In this paper, we propose a novel face clustering method via learning a sparsity preserving low-rank graph (LSPLRG). Specifically, the first step is to learn a sparse affinity graph by data self-representativeness and 1-norm. Then, a weighted matrix on the data reconstruction errors is imposed to reduce the useless features with large reconstruction errors. Moreover, a distance regularization term is imposed to preserve local structure information of the data, and a constraint on the representation matrix is added in the objective function to alleviate the relevance. These meaningful factors could improve the effectiveness of the proposed model and encourage us to learn a graph to reveal the intrinsic similarity relationships of samples.

The main contributions of this paper are listed as follows:

  1. 1.

    The proposed method learns a initial affinity graph on the sparse coefficients without any a priori graph or similarity matrix. The sparsity could make the obtained graph better capture the intrinsic structure of the data when they are suffered from noise.

  2. 2.

    An adaptive weighted matrix is imposed on the data reconstruction errors to enhance the role of important features, while a constraint on the representation matrix is to reduce the redundant features.

  3. 3.

    The proposed model could exploit the global and local structure of data by LRR and distance regularization term, which ensures to learn a more effective graph for face clustering.

The rest of this paper is organized as follows. Section 2 reviews the related works including low-rank representation and sparse subspace clustering. In Section 3, we introduce the details of our proposed clustering method LSPLRG and its optimizing schemes. Experimental results are presented for illustration in Sections 4 and 5 concludes this paper.

2 Related works

In this section, we briefly review the related work, such as low-rank representation (LRR) and sparse subspace clustering (SSC) before introducing our model. Before reviewing the related work, we define some notations. For a matrix X, xj is its i th column and xij is its (i,j)th entry. The Frobenius norm and nuclear norm are denoted by ∥XF and ∥X (the sum of the singular values of X), respectively. ⊙ denotes the element-wise multiplication. XT is the transpose of X and tr(X) is the trace of X.

2.1 Low-rank representation

Recently, theoretical advances on LRR enable us to explore low-dimensional subspace structures embedded in data. Given a set of data, LRR aims at finding the lowest-rank representation of all data jointly and preserving the membership of samples that belong to the same subspace [23]. Thus, the data usually can be represented by other data that lie in the same subspace when the subspace are independent and the data is noiseless. Generally, the LRR problem can be formulated as follows:

$$ \underset{Z}{\min} \|{Z}\|_{*}\quad \text{ subject to } \quad {X=AZ} $$
(1)

where the columns of A are a set of known bases or dictionary items and Z is called the low-rank representation of the data X. ∥Z is the nuclear norm, which is the convex envelope of the rank function [7].

In the real world applications, observation data often contain noise corruption, and data matrix X itself is used as the dictionary. With the balance parameter λ, a more general model version of (1) can be presented as follows:

$$ \underset{Z}{\min} \|{Z}\|_{*}+\lambda\|{E}\|_{l}\quad \text{s.t.} \quad {X=XZ+E} $$
(2)

Here, there are many strategies to define the error term E. For example, 0-norm characterizes the random corruption, 2,1-norm generally characterizes sample-specific corruption, and F-norm is proposed for the small Gaussian noise. A number of methods have been proposed for solving the above low-rank matrix problems, and the most commonly used methods are Augmented Lagrange Multiplier (ALM) and its variants. The advantage of LRR mainly comes from the low-rank component could reduce the influence of the outliers, which means LRR has the ability to correct the corruptions in data automatically.

2.2 Sparse subspace clustering

In recent years, Sparse Representation (SR) [10] has attracted much attention due to its effectiveness for representing and compressing high-dimensional signals. According to Compressed Sensing (CS) theory [4], the minimum 1-norm solution to an underdetermined system of linear equation is also the sparsest possible solution under general conditions. Inspired by SR, the standard sparse subspace clustering (SSC) algorithm [13] has been proposed to cluster data points that lie in a union of low-dimensional subspace. By exploiting the self-expressiveness property of the data X, the formulation of SSC is

$$ \underset{Z}{\min} \|{Z}\|_{1}+\lambda\|{E}\|_{F}\quad \text{s.t.} \quad {X=XZ+E}, \quad diag(Z)=0 $$
(3)

where Z is the coefficient matrix. Each column of Z is the sparse representation vector corresponding to each data point. E is the representation error and λ is a tradeoff parameter.

The main difference between LRR and SSC is the choice of the regularization term of Z. As can be seen from problem (3), ∥Z1 is used as a convex surrogate of ∥Z0 to promote sparsity of Z in SSC, and ∥Z is used to seek a jointly low-rank representation of all data in LRR. The element Zij in Z reflects the similarity between data pair xi and xj. Hence Z is often used to define the affinity matrix (|Z| + |ZT|)/2 for final segmentation of the data. The clustering results are obtained by applying a spectral clustering algorithm [25], such as normalized cuts (NCuts) [32].

3 Learning a sparsity preserving low-rank graph

In the clustering task, we have a set of unlabeled data \(X=[x_{1},x_{2},...,x_{n}] \in \mathbb {R}^{d\times n}\) and aim to group the data into c clusters. An effective affinity graph which could capture the intrinsic structures of data is essential to obtain better clustering performance. In this section, we present the details of our algorithm including graph initialization and graph learning. Then, an optimization scheme based on iterative updating rules is used to solve the objective function.

3.1 Graph initialization

Recently, sparse coding becomes a widely adopted tool which supposes that any signal can be composed by some basic signals. Different from the other graph learning methods that learning a initial graph by the distances of data points, we attempt to learn a initial graph by the sparse representation of data points. To construct a sparse graph, the objective function is to calculate a representation matrix Z = [z1,z2,...,zn], which is the solution to 1 problem

$$ \underset{Z}{\arg\min} \|X-XZ\|_{F}^{2}+\lambda\|{Z}\|_{1} $$
(4)

where ∥⋅∥F is the Frobenius norm of the matrix and ∥Z1 is the 1 norm of the matrix Z. The coefficient matrix Z can be regarded as an asymmetric graph matrix for a dictionary learning problem in which the dictionary is already given by the data themselves. For each data xi, the vector zi denotes the sparse coding vector required for constructing the sample xi from the set of data points. As can be seen, the optimization of problem (4) seeks a sparse graph matrix with a smaller reconstruction error. The sparsity could make the obtained graph better capture the intrinsic structure of the data points when they are suffered from noises.

3.2 Graph learning

In the task of clustering, the learned similarity graph should match the true affinity between data points, and capture the multi-subspaces structure information. To this end, many researchers proposed to impose the graph regularization term in order to learn an affinity graph with high quality [6, 11, 35], and these methods are proved to be effective under mixed conditions.

However, there is a severe problem that many methods treat the reconstruction errors equally in the linear representation, which is harmful to capture the intrinsic structure of data. A robust graph learning method should assign different weight adaptively on the reconstruction error to reinforce its effect during graph learning. Specifically, larger reconstruction error should be assigned smaller weight and important feature should be assigned larger weight respectively. Thus, a weighted nonnegative low-rank representation framework can be defined as

$$ \begin{array}{@{}rcl@{}} &&\underset{W,Z}{\min} \|{W^{\frac{1}{2}}\odot E}\|_{F}^{2}+ \frac{\lambda_{1}}{2} \|{W}\|_{F}^{2}+\lambda_{2} \|{Z}\|_{*}\\ &&\text{s.t.} \quad {X=XZ+E,Z,W \geqslant 0, W^{T} {\textbf{1}}={\textbf{1}}} \end{array} $$
(5)

where W is the weighted matrix with positive values of all elements, and \(W^{\frac {1}{2}}\) is defined as an element-wise square root of W. The constraint term WT1 (\({\textbf {1}}\in \mathbb {R}^{d\times 1}\) is a vector that all elements are 1) ensures the weight treats all samples equally. The second term in this criterion is a regularization term given by the Frobenius norm, which provides a solution with the majority of elements are not null. Minimizing the optimization sub-problem to variable W, i.e., \(\min \limits _{W\geqslant 0, W^{T} \textbf {1}=\textbf {1}} \|{W^{\frac {1}{2}}\odot E}\|_{F}^{2}+ \frac {\lambda _{1}}{2} \|{W}\|_{F}^{2}\) leads to a sparse weighted matrix [36]. Boundary constraints \(W \geqslant 0, W^{T} \textbf {1}=\textbf {1}\) can avoid trivial solution [47] and \(Z\geqslant 0\) ensures the learned graph is interpretable and its each element reveals the similar degree of the corresponding two samples.

Beyond low rank property, local similarity structure learned from data points is proved to be very helpful for subspace clustering [22]. Intuitively, close (similar) data points should have close (similar) representation coefficients. In order to exploit the local relationship between the data points, a distance regularization term to constrain the affinity matrix Z is imposed in the subspace clustering. Then the model can be written as

$$ \begin{array}{@{}rcl@{}} &&\underset{W,Z}{\min} \|{W^{\frac{1}{2}}\odot E}\|_{F}^{2}+ \frac{\lambda_{1}}{2} \|{W}\|_{F}^{2}+\lambda_{2} \|{Z}\|_{*}+\lambda_{3} Tr(D^{T} Z)\\ &&\text{s.t.} \quad {X=XZ+E,Z,W \geqslant 0, W^{T} {\textbf{1}}={\textbf{1}}} \end{array} $$
(6)

where parameter λ3 is set to control the weight of corresponding regularizing term and Tr(⋅) is the trace operation. By the definition that the i th row and j th column element of matrix D is \(d_{ij}=\|x_{i}-x_{j}\|_{2}^{2}\), then \(Tr(D^{T} Z)={\sum }_{i,j=1}^{n} d_{ij}z_{ij}\). Thus, the third term enforces that the samples with small distance should have similar representations. By imposing the distance regularization term, the local relationship between the points is preserved such that the clustering performance can be improved.

Once obtaining the representation matrix Z, the graph learning methods usually obtain the similarity affinity matrix from Z. In order to alleviate the relevance among the rows of Z, a constraint on Z is added in the objective function. To avoid the sample is selected to represent itself and the trivial solution, we constrain the affinity matrix Z such that the values of its diagonal elements are zero and the sum of its each row is one. Our final graph learning model of LSPLRG can be formulated as follows

$$ \begin{array}{@{}rcl@{}} &&\underset{W,Z}{\min} \|{W^{\frac{1}{2}}\odot E}\|_{F}^{2}+ \frac{\lambda_{1}}{2} \|{W}\|_{F}^{2}+\lambda_{2} \|{Z}\|_{*}+\lambda_{3} Tr(D^{T} Z)+\lambda_{4} Tr(Z^{T} L_{e} Z)\\ &&\text{s.t.} \quad {X=XZ+E,Z,W \geqslant 0, W^{T} {\textbf{1}}={\textbf{1}}, diag(Z)=0, Z{\bar{\textbf{1}}}={\bar{\textbf{1}}}} \end{array} $$
(7)

where \({\bar {\textbf {1}}}\in \mathbb {R}^{n\times 1}\) and \(L_{e}={\bar {\textbf {1}}}{\bar {\textbf {1}}}^{T}-I\) (I is identity matrix). In the optimization, we propose an alternative method to update each variable. Thus, both the matrix factorization and the graph learning are achieved in the learned subspace, and the merit of subspace clustering is inherited.

3.3 Update rules

To find the solutions to (7), we use the alternating direction method of multipliers (ADMM) [3] to obtain the local optimal solution of variables W and Z. We first introduce two auxiliary variables E = XXZ and U = Z to make the optimization problem (7) separable, and problem (7) is rewritten as

$$ \begin{array}{@{}rcl@{}} &&\underset{W,Z}{\min} \|{W^{\frac{1}{2}}\odot E}\|_{F}^{2}+ \frac{\lambda_{1}}{2} \|{W}\|_{F}^{2}+\lambda_{2} \|{U}\|_{*}+\lambda_{3} Tr(D^{T} Z)+\lambda_{4} Tr(Z^{T} L_{e} Z)\\ &&\text{s.t.} \quad {X=XZ+E,Z=U,Z,W \geqslant 0, W^{T} {\textbf{1}}={\textbf{1}}, diag(Z)=0} \end{array} $$
(8)

The corresponding augmented Lagrangian function of (8) is as follows

$$ \begin{array}{@{}rcl@{}} L(Z,W,E,U,Y_{1},Y_{2})&=&\|{W^{\frac{1}{2}}\odot E}\|_{F}^{2}+ \frac{\lambda_{1}}{2} \|{W}\|_{F}^{2}+\lambda_{2} \|{U}\|_{*}\\ &&+\lambda_{3} Tr(D^{T} Z)+\lambda_{4} Tr(Z^{T} L_{e} Z)\\ &&+\frac{\mu}{2}\left( \|{X-XZ-E-\frac{Y_{1}}{\mu}}\|_{F}^{2}+\|{Z-U-\frac{Y_{2}}{\mu}}\|_{F}^{2}\right) \end{array} $$
(9)

where Y1 and Y2 are Lagrangian multipliers, and μ is a scalar parameter. The augmented Lagrangian is separable and can be minimized along one coordinate direction at each iteration, i.e. minimizing the augmented Lagrangian with respect to one variable alternately with others being fixed. We introduce the detailed procedures and the solution of each subproblem in the following

Update W: Fix the other variables and update W by solving the following problem

$$ \underset{W,W^{T} {\textbf{1}}={\textbf{1}}}{\min} \|{W^{\frac{1}{2}}\odot E}\|_{F}^{2}+ \frac{\lambda_{1}}{2} \|{W}\|_{F}^{2} $$
(10)

which can be updated by the element-wise strategy. Obviously, problem (10) is equivalent to the following minimization problem

$$ \underset{W,W^{T} {\textbf{1}}={\textbf{1}}}{\min} \sum\limits_{i=1}^{d}\sum\limits_{j=1}^{n}\left( w_{ij}e_{ij}^{2}+\frac{\lambda_{1}}{2} w_{ij}^{2}\right) $$
(11)

and it is equivalent to the following problem

$$ \underset{W,W^{T} {\textbf{1}}={\textbf{1}}}{\min} \sum\limits_{i=1}^{d}\sum\limits_{j=1}^{n}\left( w_{ij}+\frac {e_{ij}^{2}}{\lambda_{1}}\right)^{2} $$
(12)

Problem (12) is independent for different j, so we can optimize the following problem separately for each column wj of W

$$ \underset{w_{j}, {w_{j}^{T}}{\textbf{1}}={\textbf{1}}}{\min} \sum\limits_{j=1}^{n}\|w_{j}+\frac {1}{\lambda_{1}} h_{j}\|_{2}^{2} $$
(13)

where hj is the j th column of matrix H = EE.

According to [29], problem (13) can be transformed into the following Lagrangian function

$$ L(h_{j},\eta,\beta_{j})=\frac{1}{2}\|w_{j}+\frac {1}{\lambda_{1}} h_{j}\|_{2}^{2}-\eta_{j}\left( {w_{j}^{T}}\textbf{1}-1\right)-{\beta_{j}^{T}} w_{j} $$
(14)

where ηj and βj ≥ 0 are the Lagrangian multipliers.

The optimal solution wj should satisfy that the derivative of (14) with respect to wj is equal to zero

$$ \frac{\partial L}{\partial w_{j}}=w_{j}+\frac{h_{j}}{\lambda_{1}}-\eta_{j}\textbf{1}-\beta_{j}=0 $$
(15)

For the i th element of wj, we have

$$ \begin{array}{@{}rcl@{}} w_{ij}+\frac{h_{ij}}{\lambda_{1}}-\eta_{j}\textbf{1}-\beta_{ij}=0 \end{array} $$
(16)

By combining (16) and the Karush-Kuhn-Tucker(KKT) condition wijβij = 0 [29], we will have

$$ w_{j} = \left( \eta_{j} \textbf{1} - \frac{h_{ij}}{\lambda_{1}}\right)_{+} $$
(17)

where (v)+ = max(0,v), according to (17) and the constraint \({w_{j}^{T}} \textbf {1}=1\), we have

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{i=1}^{d} \left( \eta_{j} - \frac{h_{ij}}{\lambda_{1}}\right)=1\\ \Rightarrow &&\eta_{j} = \frac{1}{d}+\frac{1}{d \lambda_{1}} \sum\limits_{i=1}^{d} h_{ij} \end{array} $$
(18)

Therefore, we can obtain the optimal solution wj according to (17).

Update E: for updating E, we have the following minimization problem

$$ \underset{E}{\min} \|{W^{\frac{1}{2}}\odot E}\|_{F}^{2}+ \frac{\mu}{2}\|{X-XZ-E-\frac{Y_{1}}{\mu}}\|_{F}^{2} $$
(19)

which can be further simplified to

$$ \underset{E}{\min} \|{W^{\frac{1}{2}}\odot E}\|_{F}^{2}+ \frac{\mu}{2}\|E-S\|_{F}^{2} $$
(20)

where \(S={X-XZ-\frac {Y_{1}}{\mu }}\). By spanning the Frobenius norm and removing the irrelevant terms, we have

$$ \begin{array}{@{}rcl@{}} &&\underset{E}{\min} \sum\limits_{i=1}^{d} \sum\limits_{j=1}^{n} \left( w_{ij}e_{ij}^{2}+\frac{\mu}{2}(e_{ij}-s_{ij})^{2}\right)\\ &&\Leftrightarrow \sum\limits_{i=1}^{d} \sum\limits_{j=1}^{n} \underset{e_{ij}}{\min}\left( e_{ij}-\frac{\mu s_{ij}}{\mu+2w_{ij}}\right)^{2} \end{array} $$
(21)

The optimal solution to each element eij of variable E is

$$ e_{ij} = \frac{\mu s_{ij}}{\mu+2w_{ij}} $$
(22)

Update U: for updating U, problem (9) is transformed as follows

$$ \min\limits_{U} \lambda_{2} \|{U}\|_{*}+\frac{\mu}{2}\|{Z-U-\frac{Y_{2}}{\mu}}\|_{F}^{2} $$
(23)

This problem has a closed-form solution by using the singular value thresholding (SVT) operator [21], i.e.

$$ U = {\Theta}_{\frac{\lambda_{2}}{\mu}} \left( Z+\frac{Y_{2}}{\mu}\right)=U \mathcal{S}_{\frac{\lambda_{2}}{\mu}}({\Sigma})V^{T} $$
(24)

where UΣVT is the singular value decomposition of \(\left (Z+\frac {Y_{2}}{\mu }\right )\), and \(\mathcal {S}_{\frac {\lambda _{2}}{\mu }}(\cdot )\) is the soft-thresholding operator [21].

Update Z: when the other variables are fixed, the objective optimization problem (9) with respect to Z is degenerated to the following problem

$$ \begin{array}{@{}rcl@{}} &&\min\limits_{Z} \lambda_{3} Tr(D^{T} Z)+\lambda_{4} Tr(Z^{T} L_{e} Z)\\ &&+\frac{\mu}{2}\left( \|{X-XZ-E-\frac{Y_{1}}{\mu}}\|_{F}^{2}+\|{Z-U-\frac{Y_{2}}{\mu}}\|_{F}^{2}\right)\\ &&\text{s.t.} \quad {Z\geqslant 0, diag(Z)=0, Z{\bar{\textbf{1}}}={\bar{\textbf{1}}}} \end{array} $$
(25)

where \(\bar {1}\) is the column vector with all elements except the i th element are one, and the i th element is zero. We first calculate a latent solution \(\hat {Z}\) by solving the following problem

$$ \begin{array}{@{}rcl@{}} &&\min\limits_{Z} \lambda_{3} Tr(D^{T} Z)+\lambda_{4} Tr(Z^{T} L_{e} Z)\\ &&+\frac{\mu}{2}\left( \|{X-XZ-E-\frac{Y_{1}}{\mu}}\|_{F}^{2}+\|{Z-U-\frac{Y_{2}}{\mu}}\|_{F}^{2}\right) \end{array} $$
(26)

This problem (26) has a closed-form solution as

$$ \hat{Z}=(\lambda_{4} L_{e}+X^{T}X+I)^{-1}\left( X^{T}\left( X-E+\frac{Y_{1}}{\mu}\right)+U-\frac{Y_{2}}{\mu}-\frac{\lambda_{3}}{\mu}D\right) $$
(27)

The optimal solution Z can be calculated by minimizing the following problem

$$ \underset{Z\geqslant 0, diag(Z)=0, Z{\bar{\textbf{1}}}={\bar{\textbf{1}}}}{\min} \|Z-\hat{Z}\|_{F}^{2} $$
(28)

Similar to the optimization strategy of problem (13), we obtain each row of Z by

$$ z_{i} = (\xi_{i} \bar{1}^{T} +\bar{z}_{i})_{+} $$
(29)

where \(\bar {z_{i}}=[\bar {z}_{i1},...,\bar {z}_{ii},...,\bar {z}_{in}]\) is the i th row of \(\hat {Z}\) (obtained by (27)), and the element \(\bar {z}_{ii}\) is set to zero. Similar to problem (13), the Lagrangian multiplier ξ is

$$ \xi_{i} = \frac{1+\bar{z}_{i}{\bar{\textbf{1}}}}{n-1} $$
(30)

From (29), we can obtain the optimal solution zi, which is the row of Z.

figure a

After we optimize variables W,E,U and Z, the ADMM algorithm also needs to update the Lagrange multipliers Y1,Y2 as well as parameter μ for faster convergence. The details of the optimization algorithm are exhibited in Algorithm 1.

Once obtaining the affinity graph Z, Normalized cut (Ncut) spectral clustering is applied on (|Z| + |ZT|)/2 to group data into several groups.

4 Experimental results

In this section, we consider the face clustering problem, where the goal is to group the face images into clusters according to their subjects. The clustering performance is evaluated on four publicly available image data sets, namely ORL,Footnote 1 Extended YaleB,Footnote 2 AR,Footnote 3 and LFW.Footnote 4 The important statistics of these databases are summarized in Table 1. To evaluate the performance of the proposed method, we conduct some experiments for face clustering, and compare the experimental results with several representative methods, including K-means, LRR [23], SSC[13], RLRR [9], BDR [24] and AWNLRR [36]. K-means serves as a baseline and obtains the final clustering results without learning an affinity matrix. The other six methods apply the same Ncut spectral clustering on the affinity matrix to obtain the clustering results. For fair comparison in all experiments, we use the Matlab codes released by the corresponding authors with the default or optimal parameter settings. The parameter values of each method remain the same for all evaluation runs, and these values are set as suggested as in the corresponding literature.

Table 1 Description of databases

4.1 Evaluation metrics

The clustering performance is evaluated by comparing the obtained label of each sample with that provided by the databases. Two metrics are used to quantitatively evaluate the clustering performance [5]. One metric is accuracy (AC) and the other is the normalized mutual information metric (NMI).

Given a data sample xi, let ri and si be the cluster label obtained by applying different algorithms and the label provided by the data set, respectively. The AC measures the percentage of correctly classified data points in the clustering solution compared with the ground truth class labels and is defined by

$$ AC = \frac{{\sum}_{i=1}^{N} \delta(r_{i},map(s_{i}))}{N} $$
(31)

where N is the total number of samples, and δ(x,y) is 1 if x = y and 0 otherwise. map(si) is a mapping function which can map the labels obtained by the clustering methods to the labels provided by the databases. The best mapping is usually fulfilled by the Kuhn-Munkres algorithm [8].

The second metric used in clustering applications is the normalized mutual information (NMI). It aims to measure the similarity of two clusters based on the amount of statistical information shared by random variables. Given two sets of image clusters \(\mathcal {C}=\{c_{1},...,c_{k}\}\) and \(\mathcal {C}^{\prime }=\{c^{\prime }_{1},...,c^{\prime }_{k}\}\), their mutual information metric MI(\(\mathcal {C},\mathcal {C}^{\prime }\)) is defined as:

$$ MI(\mathcal{C},\mathcal{C}^{\prime}) = \underset{c_{i}\in \mathcal{C}, c^{\prime}_{j}\in\mathcal{C}^{\prime}}{\sum} p(c_{i},c^{\prime}_{j})\cdot log\frac{p(c_{i},c^{\prime}_{j})}{p(c_{i})\cdotp(c^{\prime}_{j})} $$
(32)

where \(p(c_{i}),p(c^{\prime }_{j})\) represent the probabilities that an image arbitrarily selected from the data set belongs to the clusters \(\mathcal {C},\mathcal {C}^{\prime }\), respectively, and \(p(c_{i},c^{\prime }_{j})\) represents the joint probability that this arbitrarily selected image belongs to both clusters simultaneously. MI(\(\mathcal {C},\mathcal {C}^{\prime }\)) takes values between zero and max(\(H(\mathcal {C}),H(\mathcal {C}^{\prime })\)), where \(H(\mathcal {C})\) and \(H(\mathcal {C}^{\prime })\) denote the entropy of the clusters \(\mathcal {C}\) and \(\mathcal {C}^{\prime }\), respectively. MI(\(\mathcal {C},\mathcal {C}^{\prime }\)) reaches the maximum when two sets of image clusters are identical, while it equals to zero when the two sets are completely independent. The advantage of MI(\(\mathcal {C},\mathcal {C}^{\prime }\)) is that the value keeps the same for all kinds of permutations. By dividing the mutual information by max(\(H(\mathcal {C}),H(\mathcal {C}^{\prime })\)), NMI is derived as follows:

$$ NMI(\mathcal{C},\mathcal{C}^{\prime}) = \frac{MI(\mathcal{C},\mathcal{C}^{\prime})}{max({H(\mathcal{C}),H(\mathcal{C}^{\prime})})} $$
(33)

Different from AC, NMI is invariant with the permutation of labels. Namely, it does not require the mapping between two clusters in advance.

4.2 Clustering results

Given a collection of face images from multiple subjects, which have various illumination conditions and expressions. For each given clustering number k, the experiments are repeated 20 runs on the randomly chosen clusters and the average clustering performance is recorded as the final result. The accuracy (AC) and normalized mutual information (NMI) are calculated by the predicted and given labels, and the best results for each database are highlighted in bold in the tables.

4.2.1 ORL database

The ORL database has ten different images of each of 40 distinct subjects. For each subject, the images are captured under different facial expressions and light conditions. The face images used in this work are cropped and pre-resized to 32 × 32 pixels for computational efficiency. Images are preprocessed in advance so that faces are located. The samples of ORL database are depicted in Fig. 1. In order to evaluate the algorithm performances over different sample sizes, different cluster numbers are selected in our experiments. We select {10,15,20,25,30,35,40} clusters respectively from ORL database. Tables 2 and 3 show the detailed clustering accuracy and normalized mutual information by seven methods, respectively. Our proposed method outperforms all the other competing methods consistently, in terms of AC, while SSC performs slightly better as the k = {35,40} in terms of NMI. This can be explained by that the sparsity has the same ability to characterize the structure of data as that of low-rank.

Fig. 1
figure 1

Sample face images from the ORL database

Table 2 Clustering Performance on the ORL Database: AC (%)
Table 3 Clustering Performance on the ORL Database: NMI (%)

4.2.2 AR database

The AR face database contains about 4000 face images of 126 subjects. For each subject, there are 26 images are taken in two separate sessions with large variations in terms of facial disguise, illumination and expressions. Some images of the same subject from the AR face database are shown as in Fig. 2. Each data point is normalized to have a unit length. We then construct the data matrix X from subsets which consist of different numbers of subjects k ∈{10,12,14,16,18,20,22,24,26}. The subspace clustering methods can be performed on X and the performances are recorded in the Tables 4 and 5. It can be seen LSPLRG achieves the competitive performance in most cases. One may notice that BDR distinctly outperforms other methods in this database. This is because the relationship between the data is more important in this database, which can be effectively exploited by block diagonal representation. Thus, beyond the sparse vector, low-rank matrix, the block diagonal matrix is another interesting structure of structured sparsity.

Fig. 2
figure 2

Sample images from the AR database

Table 4 Clustering Performance on the AR Database: AC (%)
Table 5 Clustering Performance on the AR Database: NMI (%)

4.2.3 LFW database

The third database is the LFW face database, which consists of more than 13000 face images from 1680 subjects pictured under the unconstrained conditions. In our experiments, we select a subset including 1251 face images of 86 individuals for evaluation, and each subject has only 10-20 images with an imbalanced number of samples. All the face images are cropped and resized to 32 × 32 pixels. Figure 3 shows typical face images from LFW face database. In our experiments, we select {10,20,30,40,50,60,70,80} clusters respectively and the experimental results of different methods on this database are presented in Table 6 and 7. We can see that the best clustering results are still achieved by our LSPLRG, which also verifies the fact that the proposed method has particular potential for face image clustering.

Fig. 3
figure 3

Sample images from the LFW database

Table 6 Clustering Performance on the LFW Database: AC (%)
Table 7 Clustering Performance on the LFW Database: NMI (%)

4.2.4 Extended yale B database

The Extended Yale B face database contains 2414 frontal-face images of 38 subjects captured under different laboratory-controlled illumination conditions. For each subject, there are 59-64 nearly frontal images which are manually aligned and cropped. In our experiment, each image is resized to 32 × 32 pixels, and is vectorized to a 1024 vector as a data point. Figure 4 shows some face images with various lighting condition. Each data point is normalized to have a unit length. We construct the data matrix X from subsets which consist of different numbers of subjects k ∈{10,15,20,25,30,35}. It should be noted that Extended Yale B database is challenging for subspace clustering due to corruptions in the data caused by specular reflections. According to the Tables 8 and 9, we can see that the proposed LSPLRG once again achieves the best results on all cases. The average clustering accuracies obtained by LRR, SSC, RLRR, BDR, AWNLRR, LSPLRG are 76.79%, 73.53%, 61.25%, 46.98%, 89.79%, and 91.76%, respectively. In this database, AWNLRR and LSPLRG significantly outperform the other methods in term of both accuracy and normalized mutual information. This is because there is a graph manifold regularization to preserve the local geometrical structure in these two methods. Moreover, the experimental results also validate that our method has an outstanding capability on overcoming the challenges of illumination variations and corruptions.

Fig. 4
figure 4

Sample images from the Extended Yale B database

Table 8 Clustering Performance on the Extended Yale B Database: AC (%)
Table 9 Clustering Performance on the Extended Yale B Database: NMI (%)

4.3 Visualization of the affinity matrices

To further demonstrate the effectiveness of LSPLRG, we evaluate the affinity matrices obtained by different methods, which could qualitatively reflect the performance of affinity learning. Figure 5 shows the visualization of these affinity matrices.

Fig. 5
figure 5

Affinity matrix comparisons on the ORL database. (a)-(c) are the affinity matrices obtained using SSC, AWNLRR and LSPLRG, respectively

We use 10 classes of samples from ORL face database to visually present the affinity matrices derived by SSC, AWNLRR and LSPLRG. From Fig. 5, we can see that the affinity matrix designed by our method LSPLRG has more clear block-diagonal structure. Specifically, in the case of ORL, due to the existence of noise produced by face expressions and minor face poses, the main concern is the off-diagonal parts. Comparing the affinity matrices of SSC and AWNLRR, there is fewer nonzero entries in off-diagonal blocks obtained by LSPLRG and the entries within the diagonal blocks dominate the matrix in amplitude, which implies that each subject becomes highly compact and the different subjects are better separated. This demonstrates that LSPLRG not only take into account the global and local structures of data, but also learns a sparse affinity matrix. Thus, the affinity matrix learned by our method has more discriminative information and better for face clustering.

4.4 Convergence and parameter discussion

Convergence is the basic requirement for an excellent algorithm. In this subsection, we mainly focus on analyzing the convergence property of the proposed method with ADMM reported in Algorithm 1. In fact, the optimization scheme for problem (7) is equivalent to a two-block optimization problem, which is similar to the classical ADMM [16, 39]. The classical ADMM is intended to solve problems in the form

$$ \underset{z \in {\Omega}_{z}, w\in {\Omega}_{w}}{\min} f(z)+h(w) \quad \text{s.t.} \quad {Rz+Tw=u} $$
(34)

where f and h are convex functions. Ωz and Ωw are the boundary constraints of variables z and w. It is apparent that ADMM for problem (34) can be directly extended to solve the matrix optimization problem as follows

$$ \min\limits_{Z \in {\Omega}_{Z}, W\in {\Omega}_{W}} f(Z)+h(W) \quad \text{s.t.} \quad {RZ+TW=U} $$
(35)

where R,T,U are matrices. The augmented Lagrangian of problem (35), in the method of classical ADMM, is formulated as

$$ L(Z,W,C)=f(Z)+h(W)+\frac{\mu}{2}\|RZ+TW-U\|_{F}^{2}+\langle C,RZ+TW-U\rangle $$
(36)

where C is the Lagrangian multiplier, and μ is a penalty coefficient. ADMM updates two primal variables in an alternating scheme, and iteratively solves problem (36) as follows

$$ \begin{array}{@{}rcl@{}} &&Z^{t+1}=\arg\min\limits_{Z} L_{\mu}(Z,W^{t},C^{t}) \end{array} $$
(37)
$$ \begin{array}{@{}rcl@{}} &&W^{t+1}=\arg\min\limits_{W} L_{\mu}(Z^{t},W,C^{t}) \end{array} $$
(38)
$$ \begin{array}{@{}rcl@{}} &&C^{t+1}=C^{t}+{\mu}(RZ^{t+1}+TW^{t+1}-U) \end{array} $$
(39)

It should be noted that problem (7) is a special case of problem (35), and the proposed optimization algorithm shown in Algorithm 1 has same optimization style of classical ADMM. Therefore, the proposed optimization algorithm shown in Algorithm 1 is equivalent to a two-block ADMM, the global convergence of which is theoretically guaranteed [12, 16]. Figure 6 shows the convergence curves on four different databases, i.e. ORL, AR, LFW and Extended Yale B. It is obvious that the objective value decreases monotonously in each iteration, and finally converges to a local optima. Meanwhile, the optimization algorithm converges fast and within ten iterations, which also proves the fast convergence property of the optimization.

Fig. 6
figure 6

Convergence curves on different databases

There are several regularization parameters affecting the performance of our proposed method LSPLRG. λ1 controls the values of weighted matrix W, which is used to avoid trivial solution. λ2,λ3,λ4 are tunable parameters used to balance the importance of the corresponding terms. We investigate their impacts on the final clustering performance. In our experiments, we just change one parameter while fixing the other ones. Figure 7 shows the clustering accuracies of the proposed method with respect to the four parameters on the ORL database. Figure 7a shows the influence of the parameter λ1 on the clustering accuracy of LSPLRG. Generally, larger λ1 leads to better clustering performance. From Fig. 7b, we can see that the performance is not very well when λ2 is larger than 0.001. The performance degeneration is due to the relatively weak regularization effect. From Fig. 7c, we can see that LSPLRG performs well and stably when λ3 ≤ 0.01. If λ3 is relatively large, the local geometrical structure of data may not be well preserved due to the reconstruction loss. From Fig. 7d, it can be seen that the performance of LSPLRG is stable when λ4 is within the range of {0.0001,0.001,0.01}. Based on these observations, we fix parameter λ1 = 100, λ2 = 0.0001, λ3 = 0.01 and λ4 = 0.001 for all the experiments in this paper.

Fig. 7
figure 7

Clustering accuracy versus different values of parameters

5 Conclusion and future work

In this paper, a novel face clustering method called learning a sparsity preserving low-rank graph (LSPLRG) has been put forward. The proposed method jointly combines the sparse representation (SR) and low-rank representation (LRR) to learn the optimal affinity graph for clustering. In contrast with existing graph based methods, LSPLRG learns a initial affinity graph on the sparse coefficients without any a priori graph or similarity matrix. This is a critical step for reducing the influence of noise and outliers. Meanwhile, LSPLRG introduces an adaptive weighted matrix to constrain the self-representation, and integrates the distance regularization term into the low-rankness property of data representation to exploit the global and local structure of data. To reduce the redundant features, a constraint on the representation matrix to make our model learn a more discriminative graph for face clustering. Extensive experiments on benchmark databases show that the proposed method produce very competitive results for face clustering compared with several state-of-the-art subspace clustering methods.

Similar to most clustering methods, we found the main limitation of our method may be sensitive to some parameter initializations, which remains to our future study. Furthermore, we will try to develop our framework based on more effective features, and we plan to utilize the newly developed techniques including the block-level strategy [14] and the category-agnostic technique [45] to obtain the salient edge features. In order to reduce the redundant features and the influence of noise and outliers, the optimal affinity graph could be built by the salient edge features which integrates the local edge information and global location information.