1 Introduction

Data representation [15, 16, 27] is a fundamental issue in the machine learning, computer vision etc. In many applications, the data are usually high dimension. Traditional methods that perform well in low-dimensional space can become entirely impractical in high-dimensional feature space. Therefore, dimensionality reduction has become increasingly important since it can alleviate the curse of dimensionality, accelerate learning process, and even provide significant insights into the nature of the problem. Generally speaking, dimensionality reduction techniques [1, 18, 19, 25, 26, 29, 33] can be divided into two categories, that is, feature extraction [1, 18, 19, 33] and feature selection [6, 8, 17]. Feature extraction combines all original features to form new representations while feature selection tries to select a subset of most discriminative features. Compared with feature selection which does not change the original representations of data, feature extraction can create new features.

The most popular feature extraction approaches include Principal Component Analysis (PCA) [1, 22], Nonnegative Matrix Factorization (NMF) [7, 18, 19, 23, 31], Singular Value Decomposition (SVD), and Concept Factorization (CF) [33]. Although these methods have different motivations, they can all be interpreted in matrix decomposition, which usually finds two or more lower dimensional matrices to approximate the original one. The factorization leads to a reduced representation of the initial data, and thus belongs to the technologies for dimension reduction.

Unlike PCA [1] and SVD, NMF [18, 19] factorizes the original data matrix as a multiplication of two ones which are constrained by having nonnegative elements. One matrix consists of basis vectors which reveals the latent semantic structure, and the other matrix can be considered as the coefficients where each sample point is a linear combination of the bases. NMF can be recognized as a part-based representation of the data because only additive, not subtractive, combinations are applied. Such a representation encodes the data applying few components, which makes the encoding easy to interpret. Due to the capability of being able to extract the most discriminative features and feasibility in computation, NMF and its extension versions [5, 11, 21] have been widely applied in computer vision, especially for face recognition task.

Data locality has been widespread applied in many machine learning problems such as dimension reduction [16, 28], clustering [10, 20], and classification [9, 30, 32, 36, 37]. NMF yield sparse codings such that each data point is a linear representation of few basis vectors. However, the sparsity achieved by NMF does not always satisfy data locality properties. As suggested by Yu [20], locality must result in sparsity but not necessary vice versa. It has been stated in [36] that applying the locality constraint would means the sparsity for the encoding matrix, since only the basis vectors close to the original input data would be chosen for data representation. In NMF approach, a sample point might be reconstructed by basis vectors, which are far from the sample point and thus result in unsatisfying classification results. The standard NMF does not preserve the locality during its decomposition process, while local line coding(LLC) [20] can preserve such properties.

Given a loss function during optimization, sparsity regularization has been widely investigated. Bradley et al. [3] proposed 1-SVM to perform feature selection adopting the 1-norm which lead to sparse solution. Hoyer [14] extended NMF to sparse constraint explicitly by a 1-norm minimization on the coefficient and basis matrices, which makes us to discover sparse representations better than those given by standard NMF. Cai et al. [4] proposed a unified sparse subspace learning (SSL) approach based on 1-norm regularization. The shortcoming of the 1-norm regularization could not ensure that all the data vectors are sparse in the same features, so it is not feasible to conduct feature selection. To address such an issue, Nie et al. [24] proposed a robust feature selection approach by imposing 2,1-norm on both loss function and regularization term. Gu et al. [12], Hou et al. [13], and Yang et al. [35] used the 2,1-norm in subspace learning, sparse regression, and discriminative feature selection respectively. The 2,1-norm regularization term results in the row sparsity as well as the correlations of all the features.

In this paper we present a novel matrix factorization approach, called Non-negative Matrix Factorization by Joint Locality-constrained and 2,1-norm Regularization(NMF2L), which is designed to include row sparsity and locality constraints at the same time. The contribution of this paper is summarized as follows.

  1. 1.

    By making the basis vectors to be as close to the original input data points as possible, we incorporate local coordinate coding [36] into non-negative matrix factorization objective function. By adding the 2,1-norm regularization, we can achieve a row sparse coefficient matrix.

  2. 2.

    The proposed NMF2L performs feature selection on the coefficient matrix, rather than performs feature selection on the original matrix used in traditional feature selection methods.

  3. 3.

    We provide an efficient and effective multiplicative updating procedure for NMF2L approach, and rigorous convergence analysis of our approach is given.

The rest part of this paper is organized as follows. Firstly, it reviews the NMF and Nonnegative Sparse Coding(NNSC) methods, and then introduces our NMF2L method and the optimization scheme, convergence study is provided. Furthermore, it describes and analyzes the experimental results. Finally, it concludes and discusses future work.

2 Related works

In this section, we review briefly NMF and NNSC.

2.1 NMF

NMF is a decomposition approach of data matrices whose elements are nonnegative. Assume a nonnegative matrix \(\mathbf {X}=[\mathbf {x}_{1},\mathbf {x}_{2},\cdots ,\mathbf {x}_{N}]\in \mathbb {R}_{+}^{M\times N}\). Each column of X is a data point. NMF aims to find two non-negative matrices \(\mathbf {U}=[u_{ik}]\in \mathbb {R}_{+}^{M\times K}\) and \(\mathbf {V}=[v_{jk}]\in \mathbb {R}_{+}^{K\times N}\) which solve the following optimization problem:

$$ \begin{array}{r} \mathcal{J}_{NMF} =\|\mathbf{X}-\mathbf{U}\mathbf{V}\|^{2}_{F}, \\ s.t.\ \ \mathbf{U}\geq 0, \mathbf{V}\geq 0, \end{array} $$
(1)

where ∥⋅∥ F is Frobenius norm. Although the loss function \(\mathcal {J}_{NMF}\) is convex in U only or V only, they are nonconvex in both matrices together. Therefore, it is impractical to use an algorithm to find the global optimum solution of \(\mathcal {J}_{NMF}\). To solve the optimization problem, Lee et al. [18] proposed an iterative updating rule as follows:

$$\begin{array}{@{}rcl@{}} u_{jk}&\leftarrow& u_{jk}\frac{(\mathbf{X}\mathbf{V}^{T})_{jk}}{(\mathbf{U}\mathbf{V}\mathbf{V}^{T})_{jk}}, \\ v_{ki}&\leftarrow& v_{ki}\frac{(\mathbf{U}^{T}\mathbf{X})_{ki}}{(\mathbf{U}^{T}\mathbf{U}\mathbf{V})_{ki}}. \end{array} $$

2.2 Non-negative sparse coding (NNSC)

NMF produces sparse representations, which can encode the data with only a few basis vectors. This property further promotes the interpretability of practical problem. However, the sparseness introduced by nonnegativity may not be enough and is difficult to control. To address the difficulty, the Nonnegative Sparse Coding(NNSC) [14] method decomposes multivariate data into a set of positive sparse components by using 1-norm penalty function to measure the sparseness. Combining reconstruction loss with a sparseness constraint, the following optimization problem is solved:

$$\begin{array}{@{}rcl@{}} \mathcal{J}_{NNSC} =\|\mathbf{X}-\mathbf{U}\mathbf{V}\|^{2}_{F}+\lambda\|\mathbf{V}\|_{1}, \\ s.t.\ \ \mathbf{U}\geq 0, \mathbf{V}\geq 0, \end{array} $$
(2)

where ∥⋅∥1 is the 1-norm. λ is the tradeoff parameter. The loss function can be solved under the following update rules:

$$\begin{array}{@{}rcl@{}} \mathbf{U}\leftarrow \mathbf{U}-\mu(\mathbf{U}\mathbf{V}-\mathbf{X})\mathbf{V}^{T}, \\ v_{ki}\leftarrow v_{ki}\frac{(\mathbf{U}^{T}\mathbf{X})_{ki}}{((\mathbf{U}^{T}\mathbf{U}\mathbf{V})_{ki}+\lambda)}, \end{array} $$

where μ denotes the step-size.

3 Non-negative matrix factorization by joint locality-constrained and 2,1-norm regularization

In this section, we present a novel and effective approach for data representation. To the end, we consider two regularization terms, i.e., preserving the locality and generating row sparsity of coefficient matrix. We would like to introduce them in sequence.

3.1 The objective function

The first regularization term is motivated from the concept of LLC [36]. we give the concept of coordinate coding in the following.

Definition

A coordinate coding is a pair (γ,C),where \(C\subset \mathbb {R}^{d}\) is a set of anchor points, and γ is a map of \(\mathbf {x}\in \mathbb {R}^{d}\) to [γ v (x)] vC R |C|. It induces the following physical approximation of x in \(\mathbb {R}^{d}\): \(\gamma (\mathbf {x})={\sum }_{v\in C}\gamma _{v}(\mathbf {x})v\).

On the basis of this definition, the NMF can be considered as a coordinate coding where the columns of the basis matrix U can be considered as a set of anchor points, and each column of V is the coordinates of each data point in connection with the anchor points. In order to preserve the local structure of the data, only a few anchor points close to the original data would be chosen for data representation. The local coordinate constraint can be formulated as the following problem:

$$\begin{array}{@{}rcl@{}} \mathcal{Q}=\sum\limits_{k=1}^{K}|v_{ki}|\|\mathbf{u}_{k}-\mathbf{x}_{i}\|^{2}. \end{array} $$
(3)

The above constraint leads to a heavy penalty if x i is far away from the basis vector u k while its coordinate v k i with respect to u k is large.

We choose the second regularization term to distinguish the importance of different features. It is desirable to let the significant features be represented by non-zero values and the nonsignificant features by zeros after the iterative update. Since each row of the coefficient matrix V is in correspondence to a feature in the original space. This motivates us to add 2,1-norm regularization on the coefficient matrix V, which impels many rows in V decline to zero. Then we choose the important feature (i.e. the feature with non-zero values) and discard the unimportant features. 2,1-norm, as the second regularization term, is given as follows:

$$\begin{array}{@{}rcl@{}} \|\mathbf{V}\|_{2,1}=\sum\limits_{j=1}^{K}\|\mathbf{V}^{(j)}\|_{2}, \end{array} $$
(4)

where V (j) is the j th row of matrix V, which reveals the important degree of the j th feature to all the data points.

By integrating (3) and (4) into the traditional NMF, the overall loss function of NMF2L is defined as:

$$\begin{array}{@{}rcl@{}} \mathcal{O}&=&\|\mathbf{X}-\mathbf{U}\mathbf{V}\|_{F}^{2} +\mu\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}|v_{ki}|\|\mathbf{u}_{k}-\mathbf{x}_{i}\|^{2}+\lambda\|\mathbf{V}\|_{2,1}, \\ s.t.\ \ \mathbf{U}&=&[\mathbf{u}_{1},\cdots,\mathbf{u}_{K}]\in \mathbb{R}^{M\times K}>0, \\ \mathbf{V}&=&[\mathbf{v}_{1},\cdots,\mathbf{v}_{K}]\in \mathbb{R}^{K\times N}>0, \end{array} $$
(5)

where μ and λ are positive regularization parameters. We call (5) Nonnegative Matrix Factorization by Joint Locality-constrained and 2,1-norm Regularization(NMF2L). Let μ = 0 and λ = 0, (5) degenerates to the original NMF.

3.2 The update rules

The loss function \(\mathcal {O}\) of NMF2L in (5) is nonconvex in both U and V together. Therefore, it is unrealistic to find an algorithm to achieve the global optimal solution. We give an iterative algorithm which can achieve local optimal solution in the following. Following some algebraic steps, the objective function can be rewritten as follows:

$$\begin{array}{@{}rcl@{}} \mathcal{O}=\|\mathbf{X}-\mathbf{U}\mathbf{V}\|_{F}^{2}+\mu\sum\limits_{i=1}^{N}\|(\mathbf{x}_{i}1^{T} -\mathbf{U})\boldsymbol{\Lambda}_{i}^{1/2}\|^{2}+\lambda\|\mathbf{V}\|_{2,1}, \\ s.t.\ \ \mathbf{U}\geq 0,\ \mathbf{V}\geq 0, \end{array} $$
(6)

where \(\boldsymbol {\Lambda }_{i}=diag(|\mathbf {V}_{i}|)\in \mathbb {R}^{K\times K}\). According to the matrix property Tr(A B) = Tr(B A), \(\|\mathbf {A}\|_{F}^{2}=\mathrm {Tr(\mathbf {A}^{T}\mathbf {A})}\) and Tr(A) = Tr(A T), we have

$$\begin{array}{@{}rcl@{}} \mathcal{O}=\text{Tr}\left( \mathbf{X}\mathbf{X}^{T}+\mathbf{U}\mathbf{V}\mathbf{V}^{T}\mathbf{U}^{T}-2\mathbf{X}\mathbf{V}^{T}\mathbf{U}^{T} +\mu\sum\limits_{i=1}^{N}(\mathbf{x}_{i}\mathbf{1}^{T}\boldsymbol{\Lambda}_{i}\mathbf{1}\mathbf{x}_{i}^{T}\right.\\ \left.-2\mathbf{x}_{i}\mathbf{1}^{T}\boldsymbol{\Lambda}_{i}\mathbf{U}^{T}+\mathbf{U}\boldsymbol{\Lambda}_{i}\mathbf{U}^{T})\!\vphantom{\sum\limits_{i=1}^{N}}\right)\! +\lambda\|\mathbf{V}\|_{2,1}. \end{array} $$
(7)

Due to U ≥ 0,V ≥ 0, we introduce the Lagrangian multiplier Ψ = [ψ j k ] and Φ = [ϕ k i ]. Therefore, the objective function could be rewritten as Lagrangian multiplier

$$\begin{array}{@{}rcl@{}} \mathcal{L}=\text{Tr}\left( \mathbf{X}\mathbf{X}^{T}+\mathbf{U}\mathbf{V}\mathbf{V}^{T}\mathbf{U}^{T}-2 \mathbf{X}\mathbf{V}^{T}\mathbf{U}^{T} +\mu\sum\limits_{i=1}^{N}\left( \mathbf{x}_{i}\mathbf{1}^{T} \boldsymbol{\Lambda}_{i}\mathbf{1} \mathbf{x}_{i}^{T}\right.\right. \\ \left.\left.-2\mathbf{x}_{i}\mathbf{1}^{T}\boldsymbol{\Lambda}_{i}\mathbf{U}^{T}+\mathbf{U}\boldsymbol{\Lambda}_{i}\mathbf{U}^{T}\right)\right) +\lambda\|\mathbf{V}\|_{2,1} \\ -\text{Tr}(\boldsymbol{\Psi}\mathbf{U}^{T})-\text{Tr}(\boldsymbol{\Phi}\mathbf{V}^{T}). \end{array} $$
(8)

Setting \(\frac {\partial \mathcal {L}}{\partial \mathbf {U}}=0\) and \(\frac {\partial \mathcal {L}}{\partial \mathbf {V}}=0\), we obtain

$$\begin{array}{@{}rcl@{}} \boldsymbol{\Psi} = 2\mathbf{U}\mathbf{V}\mathbf{V}^{T}-2\mathbf{X}\mathbf{V}^{T}+ \mu\sum\limits_{i=1}^{N}(-2\mathbf{x}_{i}\mathbf{1}^{T}\boldsymbol{\Lambda}_{i}+2\mathbf{U}\boldsymbol{\Lambda}_{i}), \end{array} $$
(9)
$$\begin{array}{@{}rcl@{}} \boldsymbol{\Phi} = 2\mathbf{U}^{T}\mathbf{U}\mathbf{V}-2\mathbf{U}^{T}\mathbf{X}+\mu(\mathbf{C}-2\mathbf{U}^{T}\mathbf{X} +\mathbf{D})+\lambda\mathbf{G}\mathbf{V}, \end{array} $$
(10)

where G is a diagonal matrix with the i-th diagonal element as \(\mathbf {G}_{ii}=\frac {1}{2\|\mathbf {V}^{(i)}\|}\). Define column vector \(\mathbf {c}=diag(\mathbf {X}^{T}\mathbf {X})\in \mathbb {R}^{N}\). Let C = (c,⋯ ,c)T be a K × N matrix whose rows are c T. Define column vector \(\mathbf {d}=diag(\mathbf {U}^{T}\mathbf {U})\in \mathbb {R}^{K}\). Let D = (d,⋯ ,d) be a K × N matrix whose columns are d.

Applying the Karush-Kuhn-Tucker conditions [2] ψ j k u j k = 0 and ϕ k i v k i = 0, we get the following equations:

$$\begin{array}{@{}rcl@{}} (\mathbf{U}\mathbf{V}\mathbf{V}^{T})_{jk}u_{jk}-(\mathbf{X}\mathbf{V}^{T})_{jk}u_{jk} +\mu\left( \sum\limits_{i=1}^{N}\mathbf{U}\boldsymbol{\Lambda}_{i}\right)_{jk}u_{jk} \\ -\mu\left( \sum\limits_{i=1}^{N}(\mathbf{x}_{i}\mathbf{1}^{T}\boldsymbol{\Lambda}_{i})_{jk}\right.u_{jk}=0, \end{array} $$
(11)
$$\begin{array}{@{}rcl@{}} 2(\mathbf{U}^{T}\mathbf{U}\mathbf{V})_{ki}v_{ki}-2(\mathbf{U}^{T}\mathbf{X})_{ki}v_{ki}+\lambda(\mathbf{G}\mathbf{V})_{ki}v_{ki} \\ +\mu(\mathbf{C}-2\mathbf{U}^{T}\mathbf{X}+\mathbf{D})_{ki}v_{ki}=0. \end{array} $$
(12)

We can achieve the following update rules:

$$\begin{array}{@{}rcl@{}} u_{jk} \leftarrow u_{jk}\frac{\left( \mathbf{X}\mathbf{V}^{T}+\mu{\sum}_{i=1}^{N}\left( \mathbf{x}_{i}\mathbf{1}^{T} \boldsymbol{\Lambda}_{i}\right)_{jk}\right.} {\left( \mathbf{U}\mathbf{V}\mathbf{V}^{T}+\mu{\sum}_{i=1}^{N} \mathbf{U}\boldsymbol{\Lambda}_{i}\right)_{jk}}, \end{array} $$
(13)
$$\begin{array}{@{}rcl@{}} v_{ki}\leftarrow v_{ki}\frac{2(\mu+1)(\mathbf{U}^{T}\mathbf{X})_{ki}} {(2\mathbf{U}^{T}\mathbf{U}\mathbf{V}+\mu\mathbf{C}+\mu\mathbf{D}+\lambda\mathbf{G}\mathbf{V})_{ki}}. \end{array} $$
(14)

3.3 Convergence analysis

In this section, we apply the auxiliary function method [19] to prove the convergence. We first introduce the definition of auxiliary function in the following.

Definition 1

Z(h,h ) is an auxiliary function of F(h) if the conditions

$$\begin{array}{@{}rcl@{}} Z(h,h^{\prime})\geq F(h), && Z(h,h)=F(h) \end{array} $$

are satisfied.

Lemma 1

If Z is an auxiliary function for F, then F is non-increasing under the update

$$\begin{array}{@{}rcl@{}} h^{(t+1)} &=& \arg \underset{h}{\min} Z(h,h^{(t)}). \end{array} $$

Proof.

$$\begin{array}{@{}rcl@{}} F(h^{(t+1)})\leq Z(h^{(t+1)},h^{(t)})) \leq Z(h^{(t)},h^{(t)})=F(h^{(t)}). \end{array} $$

The convergence of the algorithm are addressed in the following:

For any element v a b in V, we apply \(F_{v_{ab}}\) to denote the part of \(\mathcal {O}\) which is only related to v a b . It is easy to check that

$$\begin{array}{@{}rcl@{}} F^{\prime}_{v_{ab}}=(2\mathbf{U}^{T}\mathbf{U}\mathbf{V}-2\mathbf{U}^{T}\mathbf{X} +\mu(\mathbf{C}- 2\mathbf{U}^{T}\mathbf{X} +\mathbf{D})+\lambda\mathbf{G}\mathbf{V})_{ab} ,\\ F^{\prime\prime}_{v_{ab}}=2(\mathbf{U}^{T}\mathbf{U})_{aa}+\lambda(\mathbf{G}-\mathbf{G}^{3}(\mathbf{V}\odot \mathbf{V}))_{aa}, \end{array} $$

where ⊙ denotes the element-wise multiplication.

Theorem 1

The function

$$\begin{array}{@{}rcl@{}} Z\left( v,v_{ab}^{(t)}\right) &=&F_{v_{ab}}\left( v_{ab}^{(t)}\right)+F^{\prime}_{v_{ab}}\left( v_{ab}^{(t)}\right)\left( v-v_{ab}^{(t)}\right)\\ && +\frac{(2\mathbf{U}^{T}\mathbf{U}\mathbf{V}+\mu\mathbf{C}+ \mu\mathbf{D}+\lambda\mathbf{G}\mathbf{V})_{ab}}{v_{ab}^{(t)}}\left( v-v_{ab}^{(t)}\right)^{2} \end{array} $$
(15)

is an auxiliary function for \(F_{v_{ab}}\) .

Proof

Since \(Z(v,v)=F_{v_{ab}}(v)\) is obvious, we need only indicate that \(Z(v,v_{ab}^{(t)})\geq F_{v_{ab}}(v)\). To do this, we compare the Taylor series expansion of \(F_{v_{ab}}(v)\)

$$\begin{array}{@{}rcl@{}} F_{v_{ab}}(v)=F_{v_{ab}}\left( v_{ab}^{(t)}\right)+F^{\prime}_{v_{ab}}\left( v_{ab}^{(t)}\right)\left( v-v_{ab}^{(t)}\right) + \left( (\mathbf{U}^{T}\mathbf{U})_{aa}\right. \\ \left.+\lambda\frac{\left( \mathbf{G}-\mathbf{G}^{3}(\mathbf{V}\odot\mathbf{V})\right)_{aa}}{2}\right)\left( v-v_{ab}^{(t)}\right)^{2} \end{array} $$
(16)

with (15) to obtain that \(Z(v,v_{ab}^{(t)})\geq F_{v_{ab}}(v)\) is equal to

$$\begin{array}{@{}rcl@{}} \frac{(2\mathbf{U}^{T}\mathbf{U}\mathbf{V}+\mu\mathbf{C}+ \mu\mathbf{D}+\lambda\mathbf{G}\mathbf{V})_{ab}}{v_{ab}^{(t)}} \geq (\mathbf{U}^{T}\mathbf{U})_{aa}\\ +\frac{\lambda(\mathbf{G}-\mathbf{G}^{3}(\mathbf{V}\odot\mathbf{V}))_{aa}}{2}. \end{array} $$
(17)

We have

$$\begin{array}{@{}rcl@{}} (2\mathbf{U}^{T}\mathbf{U}\mathbf{V})_{ab}=2\sum\limits_{l=1}^{K}(\mathbf{U}^{T}\mathbf{U})_{al}v_{lb}^{(t)} \geq(\mathbf{U}^{T}\mathbf{U})_{aa}v_{ab}^{(t)} \end{array} $$

and

$$\begin{array}{@{}rcl@{}} \lambda(\mathbf{G}\mathbf{V})_{ab} &=&\lambda\sum\limits_{l=1}^{K}\mathbf{G}_{al}v_{lb}^{(t)}\geq \lambda\mathbf{G}_{aa}v_{ab}^{(t)} \\ &\geq& \lambda(\mathbf{G}_{aa}-(\mathbf{G}^{3}(\mathbf{V}\odot\mathbf{V}))_{aa})v_{ab}^{(t)}\\ &=&\lambda(\mathbf{G}-\mathbf{G}^{3}(\mathbf{V}\odot\mathbf{V}))_{aa}v_{ab}^{(t)}\\ &\geq& \frac{\lambda(\mathbf{G}-\mathbf{G}^{3}(\mathbf{V}\odot\mathbf{V}))_{aa}}{2}v_{ab}^{(t)}. \end{array} $$
(18)

Thus, (17) holds and \(Z(v,v_{ab}^{(t)})\geq F_{ab}(v)\). □

Theorem 2

Equation (14) could be obtained by minimizing function \(Z(v,v_{ab}^{(t)})\) , where \(v_{ab}^{(t)}\) is the iterative solution at the t-th step.

Proof

To obtain the minimum, we only need set the derivative \(\frac {\partial Z(v,v_{ab}^{(t)})}{\partial v_{ab}}=0\), and have

$$\begin{array}{@{}rcl@{}} \frac{\partial Z\left( v_{ab},v_{ab}^{(t)}\right)}{\partial v_{ab}}&=&F^{\prime}_{v_{ab}}\left( v_{ab}^{(t)}\right)\\ &&+\frac{2(2\mathbf{U}^{T}\mathbf{U}\mathbf{V}+\mu\mathbf{C}+\mu\mathbf{D}+\lambda\mathbf{G} \mathbf{V})_{ab}}{v_{ab}^{(t)}}\left( v_{ab}-v_{ab}^{(t)}\right)=0. \end{array} $$
(19)

Thus, by simple algebra formulation, we can obtain (14).

From Theorem 1, we know that \(Z(v,v_{ab}^{(t)})\) is an auxiliary function for \(F_{v_{ab}}\). According to Lemma 1 and Theorem 2, updating v a b using (14) will decrease monotonically the objective function in (7), therefore it converge local optimal.

The converge proof that updating u a b using (13) is similar to the above. □

3.4 Connection to gradient descent method

The objective function of NMF2L can be minimized by gradient descent algorithm. Using gradient descent method results in the update rules as follows:

$$\begin{array}{@{}rcl@{}} u_{jk}&\leftarrow& u_{jk}+\eta_{jk}\frac{\partial\mathcal{O}}{\partial v_{jk}}, \end{array} $$
(20)
$$\begin{array}{@{}rcl@{}} v_{ki}&\leftarrow& v_{ki}+\delta_{ki}\frac{\partial\mathcal{O}}{\partial v_{ki}}, \end{array} $$
(21)

where the δ j k and η k i are the parameters of the step size.

It is difficult to set these size parameters, and maintain the non-negativity of u j k and v k i . we set

$$\begin{array}{@{}rcl@{}} \eta_{jk}=-\frac{u_{jk}}{2(\mathbf{U}\mathbf{V}\mathbf{V}^{T}+\mu{\sum}_{i=1}^{N}\mathbf{U} \boldsymbol{\Lambda}_{i})_{jk}}, \end{array} $$
(22)
$$\begin{array}{@{}rcl@{}} \delta_{ki}=-\frac{v_{ki}}{(2\mathbf{U}^{T}\mathbf{U}\mathbf{V}+\mu\mathbf{C}+\mu\mathbf{D}+\lambda\mathbf{G} \mathbf{V})_{ki}}, \end{array} $$
(23)

then we can obtain

$$\begin{array}{@{}rcl@{}} u_{jk}+\eta_{jk}\frac{\partial\mathcal{O}}{\partial u_{jk}}=u_{jk}\frac{\left( \mathbf{X}\mathbf{V}^{T}+ \mu{\sum}_{i=1}^{N}(\mathbf{x}_{i}\mathbf{1}^{T}\boldsymbol{\Lambda}_{i})_{jk}\right.} {\left( \mathbf{U}\mathbf{V}\mathbf{V}^{T}+\mu{\sum}_{i=1}^{N}\mathbf{U}\boldsymbol{\Lambda}_{i}\right)_{jk}}, \end{array} $$
(24)
$$\begin{array}{@{}rcl@{}} v_{ki}+\delta_{ki}\frac{\partial\mathcal{O}}{\partial v_{ki}}=v_{ki}\frac{2(\mu+1)(\mathbf{U}^{T}\mathbf{X})_{ki}} {(2\mathbf{U}^{T}\mathbf{U}\mathbf{V}+\mu\mathbf{C}+\mu\mathbf{D}+\lambda\mathbf{G}\mathbf{V})_{ki}}, \end{array} $$
(25)

which are the update rules in (13) and (14). It is clear that the (13) and (14) are special cases of gradient descent.

4 Experiments

In this section, we systematically investigated the NMF2L for clustering task. Some experiments were performed to indicate the effectiveness of our algorithm.

4.1 Data preparation

Three different publicly available database are widespread adopted as benchmark datasets. These datasets are described as follows:

The ORL face dataset consists of 10 different face images for 40 distinct persons. All the 400 have been captured against a dark homogeneous background with the subjects in an upright, frontal position with tolerance for some side movement. For some persons, the faces were taken at different times, varying the lighting, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses).

The Yale face dataset consists of 165 gray scale face images of 15 persons. All faces show variations in lighting condition (left-light, center-light, right-light), facial expression(normal, happy, sad, sleepy, surprised and wink), and with/without glasses.

The CMU PIE face dataset consists of 68 persons with 41368 face images as a whole. The face images were captured by 13 synchronized cameras and 21 flashes, under varying poses, illuminations and expressions. In our experiments, one near frontal pose (C27) is chosen under different illuminations, lightings and expressions which leaves us about 49 near frontal face images for each individual.

In the experiments, face images were preprocessed so that faces were located. Firstly, original face images were normalized in scale and orientation such that the two eyes were aligned at the same position. Then the facial areas were cropped into the final images for clustering. We resized them to 32 × 32 pixels with 256 gray levels per pixel for computational convenience.

4.2 Experimental design

This section contains the description of evaluation metrics, compared methods, and parameter settings.

4.2.1 Evaluation metrics

In our experiments, we set the number of clusters equal to the number of classes for all algorithms. We evaluated the clustering performance by comparing the cluster results obtained by algorithms with its true classes. The Accuracy (Acc) and the Normalized Mutual Information metric (NMI) were adopted to measure the clustering results [34].

The Accuracy(Acc) is defined as follows:

$$\begin{array}{@{}rcl@{}} \text{Acc}=\frac{{\sum}_{i=1}^{n}\delta(map(r_{i}),l_{i})}{n}, \end{array} $$
(26)

where r i is the cluster label of x i , and l i is the true class label, n denotes the total number of samples, δ(x,y) represents the delta function that equals one if x = y and equals zero otherwise, and m a p(r i ) is the permutation mapping function that maps the obtained label r i to the equivalent label from the data set.

Normalized Mutual Information(NMI) is defined as follows:

$$\begin{array}{@{}rcl@{}} \text{NMI}=\frac{{\sum}_{i=1}^{c}{\sum}_{j=1}^{c}n_{i,j}\log\frac{n_{i,j}}{n_{i}\hat{n}_{j}}} {\sqrt{\left( {\sum}_{i=1}^{c}n_{i}\log\frac{n_{i}}{n}\right)\left( {\sum}_{j=1}^{c}\hat{n}_{j}\log\frac{\hat{n}_{j}}{n}\right)}}, \end{array} $$
(27)

where n i is the number of samples in the i-th cluster \(\mathcal {C}_{i}\) according to clustering results and \(\hat {n}_{j}\) is the number of samples in the j-th ground truth class \(\mathcal {C}^{\prime }_{j}\). n i,j denotes the number of samples that are in the intersection between \(\mathcal {C}_{i}\) and \(\mathcal {C}^{\prime }_{j}\).

4.2.2 Compared methods

We showed the data clustering performance of the NML2L algorithm, and compared the result with the state-of-the-art algorithms using the same dataset. The algorithms that we evaluated are listed below:

  • Traditional k-means clustering algorithm (KM).

  • Principal Component Analysis(PCA) [1].

  • Nonnegative Matrix Factorization (NMF) [18].

  • Nonnegative Matrix Factorization with Sparseness Constraints(NMFSC) [14].

  • Graph regularized Non-negative Matrix Factorization (GNMF) [5].

  • Nonnegative Local Coordinate Factorization (NLCF) [10] for feature extraction.

  • Our proposed Nonnegative Matrix Factorization by Joint Locality-constrained and 2,1-norm Regularization(NMF2L).

4.2.3 Parameter settings

We evaluated the clustering performance on three face datasets. For each dataset, the evaluations were performed under different numbers of clusters. The dimensionality of the new space was set to the number of clusters. For NMF, once the number of clusters is given, there is no parameter selection. The regularization parameters were set by the grid {0.001,0.01,0.1,1,10,100,500,1000}. The neighborhood size in GNMF was tuned by searching the grid {2,3,⋯ ,10}.

For a fixed cluster number k, we randomly chose k classes from the dataset, and used different algorithms to achieve new data representations V. K-means was performed based on the new data representation V. Because of depending on initialization, k-means was repeated 20 times with random initializations and the average result was reported. We compared the obtained clusters with the original image class to compute the Acc and NMI.

4.3 Clustering results

Tables 12 and 3 displayed the values of Acc and NMI when using different the number of clusters based on various algorithms. The last row of table gived the average clustering results over k. The mean values and standard deviations were available.

Table 1 Clustering performance on the ORL dataset
Table 2 Clustering performance on the yale dataset
Table 3 Clustering performance on the PIE dataset

On ORL data set, our NMF2L algorithm achieves performance gain of 9.08% in Acc and 9.96% in NMI over the best performance by the other four algorithms. On Yale data set, our proposed algorithm achieves performance gain of 5.31% in Acc and 6.97% in NMI over the best performance by the other algorithms. On PIE data set, our algorithm outperform significantly other six algorithms in terms of both Acc and NMI.

4.4 Parameter sensitivity

Parameter selection for unsupervised algorirthm is one of challenges to machine learning tasks. In this part, we studied the clustering results of NMF2L in regard to the variations of different parameters’ settings. Our algorithm have two regularization parameters, which are denoted as μ and λ in (5). We ploted the Acc and NMI of K-means with different μ and λ in a searching grid {0.0001,0.01,0.1,1,10,100,1000}. In the experiment, we randomly chose half of samples per class for parameter selection. We averaged the clustering results of 20 times with random initializations. The average result were recorded, and the 3D plots were shown in Figs. 12, and 3. The horizontal axes denote the various value of the parameter μ and λ, while the vertical axis is the evaluation metric. In the 3D plots, the square/circle marker in the direction of X/Y-axis indicated the best μ/λ for varying μ/λ. Beside the intersect point, there is a digit number showing the value of Acc or NMI.

Fig. 1
figure 1

Acc and NMI of Kmeans on ORL data set

Fig. 2
figure 2

Acc and NMI of Kmeans on Yale data set

Fig. 3
figure 3

Acc and NMI of Kmeans on PIE data set

4.5 Convergence analysis

In this subsection, we performed an experiment to validate convergence of our algorithm. Following the above experiments, we randomly chose half of samples per class for convergence analysis. The two parameters α and β were both fixed at 10. We chose NMF and NMF2L for comparison to study the convergence speed. We drew the convergence curves on all the three datasets in Fig. 4. The horizontal axis represents the number of iterations and the vertical axis denotes the value of objective function. We can see in Fig. 4 that the objective function value becomes stable after about 100 iterations on all datasets. The convergence experiment shows the efficiency of our algorithm.

Fig. 4
figure 4

Convergence curve of NMF and NMF2L. a ORL, b Yale, and c PIE

4.6 Overall observations and discussion

In the face recognition experiments above, we consider several groups of experiments based on different databases. From the above experiment results, we can obtain several attractive insights as follows:

  1. 1)

    The performance of K-means are the worst for all algorithms. This shows feature extraction is necessary to enhance the clustering performance. The performance of PCA is inferior to other NMF-based algorithms, which demonstrates the superiority of parts-based representation idea. In most cases, GNMF performs better than NMF, which verifies the major role of the geometric structure in matrix decomposition algorithm. But we can see from Table 2 that it is not right on the Yale dataset. A possible explanation is that GNMF cannot guarantee that nearby points have the same class labels, and therefore, NMF based on graph regularization may even bring negative effective effects.

  2. 2)

    For all datasets, our NMF2L algorithm is superior to all other algorithms. The reason lies in the fact that NMF2L is designed to use the locality constraint to preserve the geometrical structure, and employ the 2,1-norm to generate the row sparsity.

  3. 3)

    We can notice from Figs. 13 that the clustering performance varies different combinations of μ and λ. The impact of different values of the regularization parameters is involved in the trait of the data set. We can see from Fig. 4 that the objective function value rapidly converges.

5 Conclusion

In this correspondence, we have presented a novel matrix factorization algorithm, called Non-negative Matrix Factorization by Joint Locality-constrained and 2,1-norm Regularization(NMF2L) for feature extraction, in which feature selection and non-negative local coordinate factorization problem are simultaneously solved by optimizing a single objective function. The experimental results on three datasets have demonstrated the effectiveness of our approach over other matrix factorization algorithms. Further research on this topic includes: 1) how to apply it to some large-scale and real-life applications; 2) how to extend the current framework for tensor-based nonnegative data decomposition.