Nonnegative matrix factorization by joint locality-constrained and ℓ 2,1-norm regularization

Xing, Ling; Dong, Hao; Jiang, Wei; Tang, Kewei

doi:10.1007/s11042-017-4970-9

Nonnegative matrix factorization by joint locality-constrained and ℓ _2,1-norm regularization

Published: 07 July 2017

Volume 77, pages 3029–3048, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Nonnegative matrix factorization by joint locality-constrained and ℓ _2,1-norm regularization

Download PDF

Ling Xing¹,
Hao Dong²,
Wei Jiang² &
…
Kewei Tang²

549 Accesses
6 Citations
Explore all metrics

Abstract

Nonnegative matrix factorization has been widely applied recently. The nonnegativity constraints result in parts-based, sparse representations which can be more robust than global, non-sparse features. However, existing techniques could not accurately dominate the sparseness. To address this issue, we present a unified criterion, called Nonnegative Matrix Factorization by Joint Locality-constrained and ℓ _2,1-norm Regularization(NMF2L), which is designed to simultaneously perform nonnegative matrix factorization and locality constraint as well as to obtain the row sparsity. We reformulate the nonnegative local coordinate factorization problem and use ℓ _2,1-norm on the coefficient matrix to obtain row sparsity, which results in selecting relevant features. An efficient updating rule is proposed, and its convergence is theoretically guaranteed. Experiments on benchmark face datasets demonstrate the effectiveness of our presented method in comparison to the state-of-the-art methods.

Robust automated graph regularized discriminative non-negative matrix factorization

Article 30 January 2021

Multiple Graph Regularized Non-negative Matrix Factorization Based on L2,1 Norm for Face Recognition

Orthogonal Dual Graph Regularized Nonnegative Matrix Factorization

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Data representation [15, 16, 27] is a fundamental issue in the machine learning, computer vision etc. In many applications, the data are usually high dimension. Traditional methods that perform well in low-dimensional space can become entirely impractical in high-dimensional feature space. Therefore, dimensionality reduction has become increasingly important since it can alleviate the curse of dimensionality, accelerate learning process, and even provide significant insights into the nature of the problem. Generally speaking, dimensionality reduction techniques [1, 18, 19, 25, 26, 29, 33] can be divided into two categories, that is, feature extraction [1, 18, 19, 33] and feature selection [6, 8, 17]. Feature extraction combines all original features to form new representations while feature selection tries to select a subset of most discriminative features. Compared with feature selection which does not change the original representations of data, feature extraction can create new features.

The most popular feature extraction approaches include Principal Component Analysis (PCA) [1, 22], Nonnegative Matrix Factorization (NMF) [7, 18, 19, 23, 31], Singular Value Decomposition (SVD), and Concept Factorization (CF) [33]. Although these methods have different motivations, they can all be interpreted in matrix decomposition, which usually finds two or more lower dimensional matrices to approximate the original one. The factorization leads to a reduced representation of the initial data, and thus belongs to the technologies for dimension reduction.

Unlike PCA [1] and SVD, NMF [18, 19] factorizes the original data matrix as a multiplication of two ones which are constrained by having nonnegative elements. One matrix consists of basis vectors which reveals the latent semantic structure, and the other matrix can be considered as the coefficients where each sample point is a linear combination of the bases. NMF can be recognized as a part-based representation of the data because only additive, not subtractive, combinations are applied. Such a representation encodes the data applying few components, which makes the encoding easy to interpret. Due to the capability of being able to extract the most discriminative features and feasibility in computation, NMF and its extension versions [5, 11, 21] have been widely applied in computer vision, especially for face recognition task.

Data locality has been widespread applied in many machine learning problems such as dimension reduction [16, 28], clustering [10, 20], and classification [9, 30, 32, 36, 37]. NMF yield sparse codings such that each data point is a linear representation of few basis vectors. However, the sparsity achieved by NMF does not always satisfy data locality properties. As suggested by Yu [20], locality must result in sparsity but not necessary vice versa. It has been stated in [36] that applying the locality constraint would means the sparsity for the encoding matrix, since only the basis vectors close to the original input data would be chosen for data representation. In NMF approach, a sample point might be reconstructed by basis vectors, which are far from the sample point and thus result in unsatisfying classification results. The standard NMF does not preserve the locality during its decomposition process, while local line coding(LLC) [20] can preserve such properties.

Given a loss function during optimization, sparsity regularization has been widely investigated. Bradley et al. [3] proposed ℓ ₁-SVM to perform feature selection adopting the ℓ ₁-norm which lead to sparse solution. Hoyer [14] extended NMF to sparse constraint explicitly by a ℓ ₁-norm minimization on the coefficient and basis matrices, which makes us to discover sparse representations better than those given by standard NMF. Cai et al. [4] proposed a unified sparse subspace learning (SSL) approach based on ℓ ₁-norm regularization. The shortcoming of the ℓ ₁-norm regularization could not ensure that all the data vectors are sparse in the same features, so it is not feasible to conduct feature selection. To address such an issue, Nie et al. [24] proposed a robust feature selection approach by imposing ℓ _2,1-norm on both loss function and regularization term. Gu et al. [12], Hou et al. [13], and Yang et al. [35] used the ℓ _2,1-norm in subspace learning, sparse regression, and discriminative feature selection respectively. The ℓ _2,1-norm regularization term results in the row sparsity as well as the correlations of all the features.

In this paper we present a novel matrix factorization approach, called Non-negative Matrix Factorization by Joint Locality-constrained and ℓ _2,1-norm Regularization(NMF2L), which is designed to include row sparsity and locality constraints at the same time. The contribution of this paper is summarized as follows.

1.
By making the basis vectors to be as close to the original input data points as possible, we incorporate local coordinate coding [36] into non-negative matrix factorization objective function. By adding the ℓ _2,1-norm regularization, we can achieve a row sparse coefficient matrix.
2.
The proposed NMF2L performs feature selection on the coefficient matrix, rather than performs feature selection on the original matrix used in traditional feature selection methods.
3.
We provide an efficient and effective multiplicative updating procedure for NMF2L approach, and rigorous convergence analysis of our approach is given.

The rest part of this paper is organized as follows. Firstly, it reviews the NMF and Nonnegative Sparse Coding(NNSC) methods, and then introduces our NMF2L method and the optimization scheme, convergence study is provided. Furthermore, it describes and analyzes the experimental results. Finally, it concludes and discusses future work.

2 Related works

In this section, we review briefly NMF and NNSC.

2.1 NMF

NMF is a decomposition approach of data matrices whose elements are nonnegative. Assume a nonnegative matrix $\mathbf {X}=[\mathbf {x}_{1},\mathbf {x}_{2},\cdots ,\mathbf {x}_{N}]\in \mathbb {R}_{+}^{M\times N}$. Each column of X is a data point. NMF aims to find two non-negative matrices $\mathbf {U}=[u_{ik}]\in \mathbb {R}_{+}^{M\times K}$ and $\mathbf {V}=[v_{jk}]\in \mathbb {R}_{+}^{K\times N}$ which solve the following optimization problem:

$$ \begin{array}{r} \mathcal{J}_{NMF} =\|\mathbf{X}-\mathbf{U}\mathbf{V}\|^{2}_{F}, \\ s.t.\ \ \mathbf{U}\geq 0, \mathbf{V}\geq 0, \end{array} $$

(1)

where ∥⋅∥_F is Frobenius norm. Although the loss function $\mathcal {J}_{NMF}$ is convex in U only or V only, they are nonconvex in both matrices together. Therefore, it is impractical to use an algorithm to find the global optimum solution of $\mathcal {J}_{NMF}$. To solve the optimization problem, Lee et al. [18] proposed an iterative updating rule as follows:

$$\begin{array}{@{}rcl@{}} u_{jk}&\leftarrow& u_{jk}\frac{(\mathbf{X}\mathbf{V}^{T})_{jk}}{(\mathbf{U}\mathbf{V}\mathbf{V}^{T})_{jk}}, \\ v_{ki}&\leftarrow& v_{ki}\frac{(\mathbf{U}^{T}\mathbf{X})_{ki}}{(\mathbf{U}^{T}\mathbf{U}\mathbf{V})_{ki}}. \end{array} $$

2.2 Non-negative sparse coding (NNSC)

NMF produces sparse representations, which can encode the data with only a few basis vectors. This property further promotes the interpretability of practical problem. However, the sparseness introduced by nonnegativity may not be enough and is difficult to control. To address the difficulty, the Nonnegative Sparse Coding(NNSC) [14] method decomposes multivariate data into a set of positive sparse components by using ℓ ₁-norm penalty function to measure the sparseness. Combining reconstruction loss with a sparseness constraint, the following optimization problem is solved:

$$\begin{array}{@{}rcl@{}} \mathcal{J}_{NNSC} =\|\mathbf{X}-\mathbf{U}\mathbf{V}\|^{2}_{F}+\lambda\|\mathbf{V}\|_{1}, \\ s.t.\ \ \mathbf{U}\geq 0, \mathbf{V}\geq 0, \end{array} $$

(2)

where ∥⋅∥₁ is the ℓ ₁-norm. λ is the tradeoff parameter. The loss function can be solved under the following update rules:

$$\begin{array}{@{}rcl@{}} \mathbf{U}\leftarrow \mathbf{U}-\mu(\mathbf{U}\mathbf{V}-\mathbf{X})\mathbf{V}^{T}, \\ v_{ki}\leftarrow v_{ki}\frac{(\mathbf{U}^{T}\mathbf{X})_{ki}}{((\mathbf{U}^{T}\mathbf{U}\mathbf{V})_{ki}+\lambda)}, \end{array} $$

where μ denotes the step-size.

3 Non-negative matrix factorization by joint locality-constrained and ℓ _2,1-norm regularization

In this section, we present a novel and effective approach for data representation. To the end, we consider two regularization terms, i.e., preserving the locality and generating row sparsity of coefficient matrix. We would like to introduce them in sequence.

3.1 The objective function

The first regularization term is motivated from the concept of LLC [36]. we give the concept of coordinate coding in the following.

Definition

A coordinate coding is a pair (γ,C),where $C\subset \mathbb {R}^{d}$ is a set of anchor points, and γ is a map of $\mathbf {x}\in \mathbb {R}^{d}$ to [γ _v(x)]_v∈C ∈ R ^|C|. It induces the following physical approximation of x in $\mathbb {R}^{d}$: $\gamma (\mathbf {x})={\sum }_{v\in C}\gamma _{v}(\mathbf {x})v$.

On the basis of this definition, the NMF can be considered as a coordinate coding where the columns of the basis matrix U can be considered as a set of anchor points, and each column of V is the coordinates of each data point in connection with the anchor points. In order to preserve the local structure of the data, only a few anchor points close to the original data would be chosen for data representation. The local coordinate constraint can be formulated as the following problem:

$$\begin{array}{@{}rcl@{}} \mathcal{Q}=\sum\limits_{k=1}^{K}|v_{ki}|\|\mathbf{u}_{k}-\mathbf{x}_{i}\|^{2}. \end{array} $$

(3)

The above constraint leads to a heavy penalty if x _i is far away from the basis vector u _k while its coordinate v _{k
i} with respect to u _k is large.

We choose the second regularization term to distinguish the importance of different features. It is desirable to let the significant features be represented by non-zero values and the nonsignificant features by zeros after the iterative update. Since each row of the coefficient matrix V is in correspondence to a feature in the original space. This motivates us to add ℓ _2,1-norm regularization on the coefficient matrix V, which impels many rows in V decline to zero. Then we choose the important feature (i.e. the feature with non-zero values) and discard the unimportant features. ℓ _2,1-norm, as the second regularization term, is given as follows:

$$\begin{array}{@{}rcl@{}} \|\mathbf{V}\|_{2,1}=\sum\limits_{j=1}^{K}\|\mathbf{V}^{(j)}\|_{2}, \end{array} $$

(4)

where V ^(j) is the j th row of matrix V, which reveals the important degree of the j th feature to all the data points.

By integrating (3) and (4) into the traditional NMF, the overall loss function of NMF2L is defined as:

$$\begin{array}{@{}rcl@{}} \mathcal{O}&=&\|\mathbf{X}-\mathbf{U}\mathbf{V}\|_{F}^{2} +\mu\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}|v_{ki}|\|\mathbf{u}_{k}-\mathbf{x}_{i}\|^{2}+\lambda\|\mathbf{V}\|_{2,1}, \\ s.t.\ \ \mathbf{U}&=&[\mathbf{u}_{1},\cdots,\mathbf{u}_{K}]\in \mathbb{R}^{M\times K}>0, \\ \mathbf{V}&=&[\mathbf{v}_{1},\cdots,\mathbf{v}_{K}]\in \mathbb{R}^{K\times N}>0, \end{array} $$

(5)

where μ and λ are positive regularization parameters. We call (5) Nonnegative Matrix Factorization by Joint Locality-constrained and ℓ _2,1-norm Regularization(NMF2L). Let μ = 0 and λ = 0, (5) degenerates to the original NMF.

3.2 The update rules

The loss function $\mathcal {O}$ of NMF2L in (5) is nonconvex in both U and V together. Therefore, it is unrealistic to find an algorithm to achieve the global optimal solution. We give an iterative algorithm which can achieve local optimal solution in the following. Following some algebraic steps, the objective function can be rewritten as follows:

$$\begin{array}{@{}rcl@{}} \mathcal{O}=\|\mathbf{X}-\mathbf{U}\mathbf{V}\|_{F}^{2}+\mu\sum\limits_{i=1}^{N}\|(\mathbf{x}_{i}1^{T} -\mathbf{U})\boldsymbol{\Lambda}_{i}^{1/2}\|^{2}+\lambda\|\mathbf{V}\|_{2,1}, \\ s.t.\ \ \mathbf{U}\geq 0,\ \mathbf{V}\geq 0, \end{array} $$

(6)

where $\boldsymbol {\Lambda }_{i}=diag(|\mathbf {V}_{i}|)\in \mathbb {R}^{K\times K}$. According to the matrix property Tr(A B) = Tr(B A), $\|\mathbf {A}\|_{F}^{2}=\mathrm {Tr(\mathbf {A}^{T}\mathbf {A})}$ and Tr(A) = Tr(A ^T), we have

$$\begin{array}{@{}rcl@{}} \mathcal{O}=\text{Tr}\left( \mathbf{X}\mathbf{X}^{T}+\mathbf{U}\mathbf{V}\mathbf{V}^{T}\mathbf{U}^{T}-2\mathbf{X}\mathbf{V}^{T}\mathbf{U}^{T} +\mu\sum\limits_{i=1}^{N}(\mathbf{x}_{i}\mathbf{1}^{T}\boldsymbol{\Lambda}_{i}\mathbf{1}\mathbf{x}_{i}^{T}\right.\\ \left.-2\mathbf{x}_{i}\mathbf{1}^{T}\boldsymbol{\Lambda}_{i}\mathbf{U}^{T}+\mathbf{U}\boldsymbol{\Lambda}_{i}\mathbf{U}^{T})\!\vphantom{\sum\limits_{i=1}^{N}}\right)\! +\lambda\|\mathbf{V}\|_{2,1}. \end{array} $$

(7)

Due to U ≥ 0,V ≥ 0, we introduce the Lagrangian multiplier Ψ = [ψ _{j
k}] and Φ = [ϕ _{k
i}]. Therefore, the objective function could be rewritten as Lagrangian multiplier

$$\begin{array}{@{}rcl@{}} \mathcal{L}=\text{Tr}\left( \mathbf{X}\mathbf{X}^{T}+\mathbf{U}\mathbf{V}\mathbf{V}^{T}\mathbf{U}^{T}-2 \mathbf{X}\mathbf{V}^{T}\mathbf{U}^{T} +\mu\sum\limits_{i=1}^{N}\left( \mathbf{x}_{i}\mathbf{1}^{T} \boldsymbol{\Lambda}_{i}\mathbf{1} \mathbf{x}_{i}^{T}\right.\right. \\ \left.\left.-2\mathbf{x}_{i}\mathbf{1}^{T}\boldsymbol{\Lambda}_{i}\mathbf{U}^{T}+\mathbf{U}\boldsymbol{\Lambda}_{i}\mathbf{U}^{T}\right)\right) +\lambda\|\mathbf{V}\|_{2,1} \\ -\text{Tr}(\boldsymbol{\Psi}\mathbf{U}^{T})-\text{Tr}(\boldsymbol{\Phi}\mathbf{V}^{T}). \end{array} $$

(8)

Setting $\frac {\partial \mathcal {L}}{\partial \mathbf {U}}=0$ and $\frac {\partial \mathcal {L}}{\partial \mathbf {V}}=0$, we obtain

$$\begin{array}{@{}rcl@{}} \boldsymbol{\Psi} = 2\mathbf{U}\mathbf{V}\mathbf{V}^{T}-2\mathbf{X}\mathbf{V}^{T}+ \mu\sum\limits_{i=1}^{N}(-2\mathbf{x}_{i}\mathbf{1}^{T}\boldsymbol{\Lambda}_{i}+2\mathbf{U}\boldsymbol{\Lambda}_{i}), \end{array} $$

(9)

$$\begin{array}{@{}rcl@{}} \boldsymbol{\Phi} = 2\mathbf{U}^{T}\mathbf{U}\mathbf{V}-2\mathbf{U}^{T}\mathbf{X}+\mu(\mathbf{C}-2\mathbf{U}^{T}\mathbf{X} +\mathbf{D})+\lambda\mathbf{G}\mathbf{V}, \end{array} $$

(10)

where G is a diagonal matrix with the i-th diagonal element as $\mathbf {G}_{ii}=\frac {1}{2\|\mathbf {V}^{(i)}\|}$. Define column vector $\mathbf {c}=diag(\mathbf {X}^{T}\mathbf {X})\in \mathbb {R}^{N}$. Let C = (c,⋯ ,c)^T be a K × N matrix whose rows are c ^T. Define column vector $\mathbf {d}=diag(\mathbf {U}^{T}\mathbf {U})\in \mathbb {R}^{K}$. Let D = (d,⋯ ,d) be a K × N matrix whose columns are d.

Applying the Karush-Kuhn-Tucker conditions [2] ψ _{j
k} u _{j
k} = 0 and ϕ _{k
i} v _{k
i} = 0, we get the following equations:

$$\begin{array}{@{}rcl@{}} (\mathbf{U}\mathbf{V}\mathbf{V}^{T})_{jk}u_{jk}-(\mathbf{X}\mathbf{V}^{T})_{jk}u_{jk} +\mu\left( \sum\limits_{i=1}^{N}\mathbf{U}\boldsymbol{\Lambda}_{i}\right)_{jk}u_{jk} \\ -\mu\left( \sum\limits_{i=1}^{N}(\mathbf{x}_{i}\mathbf{1}^{T}\boldsymbol{\Lambda}_{i})_{jk}\right.u_{jk}=0, \end{array} $$

(11)

$$\begin{array}{@{}rcl@{}} 2(\mathbf{U}^{T}\mathbf{U}\mathbf{V})_{ki}v_{ki}-2(\mathbf{U}^{T}\mathbf{X})_{ki}v_{ki}+\lambda(\mathbf{G}\mathbf{V})_{ki}v_{ki} \\ +\mu(\mathbf{C}-2\mathbf{U}^{T}\mathbf{X}+\mathbf{D})_{ki}v_{ki}=0. \end{array} $$

(12)

We can achieve the following update rules:

$$\begin{array}{@{}rcl@{}} u_{jk} \leftarrow u_{jk}\frac{\left( \mathbf{X}\mathbf{V}^{T}+\mu{\sum}_{i=1}^{N}\left( \mathbf{x}_{i}\mathbf{1}^{T} \boldsymbol{\Lambda}_{i}\right)_{jk}\right.} {\left( \mathbf{U}\mathbf{V}\mathbf{V}^{T}+\mu{\sum}_{i=1}^{N} \mathbf{U}\boldsymbol{\Lambda}_{i}\right)_{jk}}, \end{array} $$

(13)

$$\begin{array}{@{}rcl@{}} v_{ki}\leftarrow v_{ki}\frac{2(\mu+1)(\mathbf{U}^{T}\mathbf{X})_{ki}} {(2\mathbf{U}^{T}\mathbf{U}\mathbf{V}+\mu\mathbf{C}+\mu\mathbf{D}+\lambda\mathbf{G}\mathbf{V})_{ki}}. \end{array} $$

(14)

3.3 Convergence analysis

In this section, we apply the auxiliary function method [19] to prove the convergence. We first introduce the definition of auxiliary function in the following.

Definition 1

Z(h,h ^′) is an auxiliary function of F(h) if the conditions

$$\begin{array}{@{}rcl@{}} Z(h,h^{\prime})\geq F(h), && Z(h,h)=F(h) \end{array} $$

are satisfied.

Lemma 1

If Z is an auxiliary function for F, then F is non-increasing under the update

$$\begin{array}{@{}rcl@{}} h^{(t+1)} &=& \arg \underset{h}{\min} Z(h,h^{(t)}). \end{array} $$

Proof.

$$\begin{array}{@{}rcl@{}} F(h^{(t+1)})\leq Z(h^{(t+1)},h^{(t)})) \leq Z(h^{(t)},h^{(t)})=F(h^{(t)}). \end{array} $$

The convergence of the algorithm are addressed in the following:

For any element v _{a
b} in V, we apply $F_{v_{ab}}$ to denote the part of $\mathcal {O}$ which is only related to v _{a
b}. It is easy to check that

$$\begin{array}{@{}rcl@{}} F^{\prime}_{v_{ab}}=(2\mathbf{U}^{T}\mathbf{U}\mathbf{V}-2\mathbf{U}^{T}\mathbf{X} +\mu(\mathbf{C}- 2\mathbf{U}^{T}\mathbf{X} +\mathbf{D})+\lambda\mathbf{G}\mathbf{V})_{ab} ,\\ F^{\prime\prime}_{v_{ab}}=2(\mathbf{U}^{T}\mathbf{U})_{aa}+\lambda(\mathbf{G}-\mathbf{G}^{3}(\mathbf{V}\odot \mathbf{V}))_{aa}, \end{array} $$

where ⊙ denotes the element-wise multiplication.

Theorem 1

The function

$$\begin{array}{@{}rcl@{}} Z\left( v,v_{ab}^{(t)}\right) &=&F_{v_{ab}}\left( v_{ab}^{(t)}\right)+F^{\prime}_{v_{ab}}\left( v_{ab}^{(t)}\right)\left( v-v_{ab}^{(t)}\right)\\ && +\frac{(2\mathbf{U}^{T}\mathbf{U}\mathbf{V}+\mu\mathbf{C}+ \mu\mathbf{D}+\lambda\mathbf{G}\mathbf{V})_{ab}}{v_{ab}^{(t)}}\left( v-v_{ab}^{(t)}\right)^{2} \end{array} $$

(15)

is an auxiliary function for $F_{v_{ab}}$ .

Proof

Since $Z(v,v)=F_{v_{ab}}(v)$ is obvious, we need only indicate that $Z(v,v_{ab}^{(t)})\geq F_{v_{ab}}(v)$. To do this, we compare the Taylor series expansion of $F_{v_{ab}}(v)$

$$\begin{array}{@{}rcl@{}} F_{v_{ab}}(v)=F_{v_{ab}}\left( v_{ab}^{(t)}\right)+F^{\prime}_{v_{ab}}\left( v_{ab}^{(t)}\right)\left( v-v_{ab}^{(t)}\right) + \left( (\mathbf{U}^{T}\mathbf{U})_{aa}\right. \\ \left.+\lambda\frac{\left( \mathbf{G}-\mathbf{G}^{3}(\mathbf{V}\odot\mathbf{V})\right)_{aa}}{2}\right)\left( v-v_{ab}^{(t)}\right)^{2} \end{array} $$

(16)

with (15) to obtain that $Z(v,v_{ab}^{(t)})\geq F_{v_{ab}}(v)$ is equal to

$$\begin{array}{@{}rcl@{}} \frac{(2\mathbf{U}^{T}\mathbf{U}\mathbf{V}+\mu\mathbf{C}+ \mu\mathbf{D}+\lambda\mathbf{G}\mathbf{V})_{ab}}{v_{ab}^{(t)}} \geq (\mathbf{U}^{T}\mathbf{U})_{aa}\\ +\frac{\lambda(\mathbf{G}-\mathbf{G}^{3}(\mathbf{V}\odot\mathbf{V}))_{aa}}{2}. \end{array} $$

(17)

We have

$$\begin{array}{@{}rcl@{}} (2\mathbf{U}^{T}\mathbf{U}\mathbf{V})_{ab}=2\sum\limits_{l=1}^{K}(\mathbf{U}^{T}\mathbf{U})_{al}v_{lb}^{(t)} \geq(\mathbf{U}^{T}\mathbf{U})_{aa}v_{ab}^{(t)} \end{array} $$

and

$$\begin{array}{@{}rcl@{}} \lambda(\mathbf{G}\mathbf{V})_{ab} &=&\lambda\sum\limits_{l=1}^{K}\mathbf{G}_{al}v_{lb}^{(t)}\geq \lambda\mathbf{G}_{aa}v_{ab}^{(t)} \\ &\geq& \lambda(\mathbf{G}_{aa}-(\mathbf{G}^{3}(\mathbf{V}\odot\mathbf{V}))_{aa})v_{ab}^{(t)}\\ &=&\lambda(\mathbf{G}-\mathbf{G}^{3}(\mathbf{V}\odot\mathbf{V}))_{aa}v_{ab}^{(t)}\\ &\geq& \frac{\lambda(\mathbf{G}-\mathbf{G}^{3}(\mathbf{V}\odot\mathbf{V}))_{aa}}{2}v_{ab}^{(t)}. \end{array} $$

(18)

Thus, (17) holds and $Z(v,v_{ab}^{(t)})\geq F_{ab}(v)$. □

Theorem 2

Equation (14) could be obtained by minimizing function $Z(v,v_{ab}^{(t)})$ , where $v_{ab}^{(t)}$ is the iterative solution at the t-th step.

Proof

To obtain the minimum, we only need set the derivative $\frac {\partial Z(v,v_{ab}^{(t)})}{\partial v_{ab}}=0$, and have

$$\begin{array}{@{}rcl@{}} \frac{\partial Z\left( v_{ab},v_{ab}^{(t)}\right)}{\partial v_{ab}}&=&F^{\prime}_{v_{ab}}\left( v_{ab}^{(t)}\right)\\ &&+\frac{2(2\mathbf{U}^{T}\mathbf{U}\mathbf{V}+\mu\mathbf{C}+\mu\mathbf{D}+\lambda\mathbf{G} \mathbf{V})_{ab}}{v_{ab}^{(t)}}\left( v_{ab}-v_{ab}^{(t)}\right)=0. \end{array} $$

(19)

Thus, by simple algebra formulation, we can obtain (14).

From Theorem 1, we know that $Z(v,v_{ab}^{(t)})$ is an auxiliary function for $F_{v_{ab}}$. According to Lemma 1 and Theorem 2, updating v _{a
b} using (14) will decrease monotonically the objective function in (7), therefore it converge local optimal.

The converge proof that updating u _{a
b} using (13) is similar to the above. □

3.4 Connection to gradient descent method

The objective function of NMF2L can be minimized by gradient descent algorithm. Using gradient descent method results in the update rules as follows:

$$\begin{array}{@{}rcl@{}} u_{jk}&\leftarrow& u_{jk}+\eta_{jk}\frac{\partial\mathcal{O}}{\partial v_{jk}}, \end{array} $$

(20)

$$\begin{array}{@{}rcl@{}} v_{ki}&\leftarrow& v_{ki}+\delta_{ki}\frac{\partial\mathcal{O}}{\partial v_{ki}}, \end{array} $$

(21)

where the δ _{j
k} and η _{k
i} are the parameters of the step size.

It is difficult to set these size parameters, and maintain the non-negativity of u _{j
k} and v _{k
i}. we set

$$\begin{array}{@{}rcl@{}} \eta_{jk}=-\frac{u_{jk}}{2(\mathbf{U}\mathbf{V}\mathbf{V}^{T}+\mu{\sum}_{i=1}^{N}\mathbf{U} \boldsymbol{\Lambda}_{i})_{jk}}, \end{array} $$

(22)

$$\begin{array}{@{}rcl@{}} \delta_{ki}=-\frac{v_{ki}}{(2\mathbf{U}^{T}\mathbf{U}\mathbf{V}+\mu\mathbf{C}+\mu\mathbf{D}+\lambda\mathbf{G} \mathbf{V})_{ki}}, \end{array} $$

(23)

then we can obtain

$$\begin{array}{@{}rcl@{}} u_{jk}+\eta_{jk}\frac{\partial\mathcal{O}}{\partial u_{jk}}=u_{jk}\frac{\left( \mathbf{X}\mathbf{V}^{T}+ \mu{\sum}_{i=1}^{N}(\mathbf{x}_{i}\mathbf{1}^{T}\boldsymbol{\Lambda}_{i})_{jk}\right.} {\left( \mathbf{U}\mathbf{V}\mathbf{V}^{T}+\mu{\sum}_{i=1}^{N}\mathbf{U}\boldsymbol{\Lambda}_{i}\right)_{jk}}, \end{array} $$

(24)

$$\begin{array}{@{}rcl@{}} v_{ki}+\delta_{ki}\frac{\partial\mathcal{O}}{\partial v_{ki}}=v_{ki}\frac{2(\mu+1)(\mathbf{U}^{T}\mathbf{X})_{ki}} {(2\mathbf{U}^{T}\mathbf{U}\mathbf{V}+\mu\mathbf{C}+\mu\mathbf{D}+\lambda\mathbf{G}\mathbf{V})_{ki}}, \end{array} $$

(25)

which are the update rules in (13) and (14). It is clear that the (13) and (14) are special cases of gradient descent.

4 Experiments

In this section, we systematically investigated the NMF2L for clustering task. Some experiments were performed to indicate the effectiveness of our algorithm.

4.1 Data preparation

Three different publicly available database are widespread adopted as benchmark datasets. These datasets are described as follows:

The ORL face dataset consists of 10 different face images for 40 distinct persons. All the 400 have been captured against a dark homogeneous background with the subjects in an upright, frontal position with tolerance for some side movement. For some persons, the faces were taken at different times, varying the lighting, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses).

The Yale face dataset consists of 165 gray scale face images of 15 persons. All faces show variations in lighting condition (left-light, center-light, right-light), facial expression(normal, happy, sad, sleepy, surprised and wink), and with/without glasses.

The CMU PIE face dataset consists of 68 persons with 41368 face images as a whole. The face images were captured by 13 synchronized cameras and 21 flashes, under varying poses, illuminations and expressions. In our experiments, one near frontal pose (C27) is chosen under different illuminations, lightings and expressions which leaves us about 49 near frontal face images for each individual.

In the experiments, face images were preprocessed so that faces were located. Firstly, original face images were normalized in scale and orientation such that the two eyes were aligned at the same position. Then the facial areas were cropped into the final images for clustering. We resized them to 32 × 32 pixels with 256 gray levels per pixel for computational convenience.

4.2 Experimental design

This section contains the description of evaluation metrics, compared methods, and parameter settings.

4.2.1 Evaluation metrics

In our experiments, we set the number of clusters equal to the number of classes for all algorithms. We evaluated the clustering performance by comparing the cluster results obtained by algorithms with its true classes. The Accuracy (Acc) and the Normalized Mutual Information metric (NMI) were adopted to measure the clustering results [34].

The Accuracy(Acc) is defined as follows:

$$\begin{array}{@{}rcl@{}} \text{Acc}=\frac{{\sum}_{i=1}^{n}\delta(map(r_{i}),l_{i})}{n}, \end{array} $$

(26)

where r _i is the cluster label of x _i, and l _i is the true class label, n denotes the total number of samples, δ(x,y) represents the delta function that equals one if x = y and equals zero otherwise, and m a p(r _i) is the permutation mapping function that maps the obtained label r _i to the equivalent label from the data set.

Normalized Mutual Information(NMI) is defined as follows:

$$\begin{array}{@{}rcl@{}} \text{NMI}=\frac{{\sum}_{i=1}^{c}{\sum}_{j=1}^{c}n_{i,j}\log\frac{n_{i,j}}{n_{i}\hat{n}_{j}}} {\sqrt{\left( {\sum}_{i=1}^{c}n_{i}\log\frac{n_{i}}{n}\right)\left( {\sum}_{j=1}^{c}\hat{n}_{j}\log\frac{\hat{n}_{j}}{n}\right)}}, \end{array} $$

(27)

where n _i is the number of samples in the i-th cluster $\mathcal {C}_{i}$ according to clustering results and $\hat {n}_{j}$ is the number of samples in the j-th ground truth class $\mathcal {C}^{\prime }_{j}$. n _i,j denotes the number of samples that are in the intersection between $\mathcal {C}_{i}$ and $\mathcal {C}^{\prime }_{j}$.

4.2.2 Compared methods

We showed the data clustering performance of the NML2L algorithm, and compared the result with the state-of-the-art algorithms using the same dataset. The algorithms that we evaluated are listed below:

Traditional k-means clustering algorithm (KM).
Principal Component Analysis(PCA) [1].
Nonnegative Matrix Factorization (NMF) [18].
Nonnegative Matrix Factorization with Sparseness Constraints(NMFSC) [14].
Graph regularized Non-negative Matrix Factorization (GNMF) [5].
Nonnegative Local Coordinate Factorization (NLCF) [10] for feature extraction.
Our proposed Nonnegative Matrix Factorization by Joint Locality-constrained and ℓ _2,1-norm Regularization(NMF2L).

4.2.3 Parameter settings

We evaluated the clustering performance on three face datasets. For each dataset, the evaluations were performed under different numbers of clusters. The dimensionality of the new space was set to the number of clusters. For NMF, once the number of clusters is given, there is no parameter selection. The regularization parameters were set by the grid {0.001,0.01,0.1,1,10,100,500,1000}. The neighborhood size in GNMF was tuned by searching the grid {2,3,⋯ ,10}.

For a fixed cluster number k, we randomly chose k classes from the dataset, and used different algorithms to achieve new data representations V. K-means was performed based on the new data representation V. Because of depending on initialization, k-means was repeated 20 times with random initializations and the average result was reported. We compared the obtained clusters with the original image class to compute the Acc and NMI.

4.3 Clustering results

Tables 1, 2 and 3 displayed the values of Acc and NMI when using different the number of clusters based on various algorithms. The last row of table gived the average clustering results over k. The mean values and standard deviations were available.

Table 1 Clustering performance on the ORL dataset

Full size table

Table 2 Clustering performance on the yale dataset

Full size table

Table 3 Clustering performance on the PIE dataset

Full size table

On ORL data set, our NMF2L algorithm achieves performance gain of 9.08% in Acc and 9.96% in NMI over the best performance by the other four algorithms. On Yale data set, our proposed algorithm achieves performance gain of 5.31% in Acc and 6.97% in NMI over the best performance by the other algorithms. On PIE data set, our algorithm outperform significantly other six algorithms in terms of both Acc and NMI.

4.4 Parameter sensitivity

Parameter selection for unsupervised algorirthm is one of challenges to machine learning tasks. In this part, we studied the clustering results of NMF2L in regard to the variations of different parameters’ settings. Our algorithm have two regularization parameters, which are denoted as μ and λ in (5). We ploted the Acc and NMI of K-means with different μ and λ in a searching grid {0.0001,0.01,0.1,1,10,100,1000}. In the experiment, we randomly chose half of samples per class for parameter selection. We averaged the clustering results of 20 times with random initializations. The average result were recorded, and the 3D plots were shown in Figs. 1, 2, and 3. The horizontal axes denote the various value of the parameter μ and λ, while the vertical axis is the evaluation metric. In the 3D plots, the square/circle marker in the direction of X/Y-axis indicated the best μ/λ for varying μ/λ. Beside the intersect point, there is a digit number showing the value of Acc or NMI.

4.5 Convergence analysis

In this subsection, we performed an experiment to validate convergence of our algorithm. Following the above experiments, we randomly chose half of samples per class for convergence analysis. The two parameters α and β were both fixed at 10. We chose NMF and NMF2L for comparison to study the convergence speed. We drew the convergence curves on all the three datasets in Fig. 4. The horizontal axis represents the number of iterations and the vertical axis denotes the value of objective function. We can see in Fig. 4 that the objective function value becomes stable after about 100 iterations on all datasets. The convergence experiment shows the efficiency of our algorithm.

4.6 Overall observations and discussion

In the face recognition experiments above, we consider several groups of experiments based on different databases. From the above experiment results, we can obtain several attractive insights as follows:

1)
The performance of K-means are the worst for all algorithms. This shows feature extraction is necessary to enhance the clustering performance. The performance of PCA is inferior to other NMF-based algorithms, which demonstrates the superiority of parts-based representation idea. In most cases, GNMF performs better than NMF, which verifies the major role of the geometric structure in matrix decomposition algorithm. But we can see from Table 2 that it is not right on the Yale dataset. A possible explanation is that GNMF cannot guarantee that nearby points have the same class labels, and therefore, NMF based on graph regularization may even bring negative effective effects.
2)
For all datasets, our NMF2L algorithm is superior to all other algorithms. The reason lies in the fact that NMF2L is designed to use the locality constraint to preserve the geometrical structure, and employ the ℓ _2,1-norm to generate the row sparsity.
3)
We can notice from Figs. 1–3 that the clustering performance varies different combinations of μ and λ. The impact of different values of the regularization parameters is involved in the trait of the data set. We can see from Fig. 4 that the objective function value rapidly converges.

5 Conclusion

In this correspondence, we have presented a novel matrix factorization algorithm, called Non-negative Matrix Factorization by Joint Locality-constrained and ℓ _2,1-norm Regularization(NMF2L) for feature extraction, in which feature selection and non-negative local coordinate factorization problem are simultaneously solved by optimizing a single objective function. The experimental results on three datasets have demonstrated the effectiveness of our approach over other matrix factorization algorithms. Further research on this topic includes: 1) how to apply it to some large-scale and real-life applications; 2) how to extend the current framework for tensor-based nonnegative data decomposition.

References

Belhumeur PN, Hespanha JP, Kriegman D et al (1997) Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Trans Pattern Anal Mach Intell 19(7):711–720
Article Google Scholar
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press
Bradley PS, Mangasarian OL (1998) Feature selection via concave minimization and support vector machines. In: International conference on machine learning, pp 82–90
Cai D, He X, Han J et al (2007) Spectral regression: a unified approach for sparse subspace learning. In: International conference on data mining, pp 73–82
Cai D, He X, Han J et al (2011) Graph regularized nonnegative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell 33 (8):1548–1560
Article Google Scholar
Chang X, Nie F, Yang Y et al (2014) A convex formulation for semi-supervised multi-label feature selection. In: Twenty-eighth AAAI conference on artificial intelligence. AAAI Press, pp 1171–1177
Chang X, Nie F, Yang Y et al (2016) Convex sparse PCA for unsupervised feature learning. ACM Trans Knowl Discov Data (TKDD) 11(1):3:1–3:16
Google Scholar
Chang X, Yang Y (2016) Semisupervised feature analysis by mining correlations among multiple tasks. IEEE Transactions on Neural Networks and Learning Systems. doi:10.1109/TNNLS.2016.2582746
Article MathSciNet Google Scholar
Chao Y, Yeh Y, Chen Y et al (2011) Locality-constrained group sparse representation for robust face recognition. In: International conference on image processing, pp 761–764
Chen Y, Zhang J, Cai D et al (2013) Nonnegative local coordinate factorization for image representation. IEEE Trans Image Process 22(3):969–979
Article MathSciNet Google Scholar
Geng B, Tao D, Xu C et al (2012) Ensemble manifold regularization. IEEE Trans Pattern Anal Mach Intell 34(6):1227–1233
Article Google Scholar
Gu Q, Li Z, Han J et al (2011) Joint feature selection and subspace learning. In: International joint conference on artificial intelligence, pp 1294–1299
Hou C, Nie F, Yi D et al (2011) Feature selection via joint embedding learning and sparse regression. In: International joint conference on artificial intelligence, pp 1324–1329
Hoyer PO (2002) Non-negative sparse coding. In: Proceedings of IEEE workshop on neural networks for signal processing, pp 557–565
Jiang W, Li M, Zhang Y (2014) Neighborhood preserving convex nonnegative matrix factorization. Math Probl Eng 2014(2):1–8
MathSciNet MATH Google Scholar
Jiang W, Liu J, Qi H et al (2016) Robust subspace segmentation via nonconvex low rank representation. Inf Sci 340:144–158
Article MathSciNet Google Scholar
Kotsiantis SB (2014) RETRACTED ARTICLE: feature selection for machine learning classification problems: a recent overview[J]. Artif Intell Rev 42(1):157–157
Article Google Scholar
Lee D, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791
Article Google Scholar
Liu H, Wu Z, Li X et al (2012) Constrained nonnegative matrix factorization for image representation. IEEE Trans Pattern Anal Mach Intell 34(7):1299–1311
Article Google Scholar
Liu H, Yang Z, Wu Z et al (2011) Locality-constrained concept factorization. In: International joint conference on artificial intelligence, pp 1378–1383
Liu H, Yang Z, Wu Z et al (2012) A-optimal non-negative projection for image representation. Comput Vision Pattern Recogn, 1592–1599
Luo M, Nie F, Chang X et al (2016) Avoiding optimal mean robust PCA/2DPCA with non-greedy l1-norm maximization. In: International joint conference on artificial intelligence
Luo M, Nie F, Chang X et al (2017) Probabilistic non-negative matrix factorization and its robust extensions for topic modeling. In: Thirty-first AAAI conference on artificial intelligence
Nie F, Huang H, Cai X et al (2010) Efficient and robust feature selection via joint ℓ _2,1-norms minimization. Neural Inform Process Syst, 1813–1821
Nie L, Song X, Chua TS (2016) Learning from multiple social networks. Synthesis Lect Inform Concepts Retriev Serv 8(2):118–129
Google Scholar
Nie L, Zhang L, Wang M et al (2017) Learning user attributes via mobile social multimedia analytics. ACM Trans Intell Syst Technol (TIST) 8(3):36–47
Google Scholar
Qi H, Li K, Shen Y et al (2012) Object-based image retrieval with kernel on adjacency matrix and local combined features. ACM Trans Multimed Comput Commun Appl 8(4):1–18
Article Google Scholar
Roweis S, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326
Article Google Scholar
Song X, Nie L, Zhang L et al (2015) Interest inference via structure-constrained multi-source multi-task learning. In: International conference on artificial intelligence. AAAI Press, pp 2371–2377
Wang J, Yang J, Yu K et al (2010) Locality-constrained linear coding for image classification. Comput Vision Pattern Recogn, 3360–3367
Wang R, Nie F, Yang X et al (2015) Robust 2DPCA with non-greedy, ℓ ₁-norm maximization for image analysis[J]. IEEE Trans Cybern 45(5):1108–1112
Article Google Scholar
Wei C, Chao Y, Yeh Y et al (2013) Locality-sensitive dictionary learning for sparse representation based classification. Pattern Recogn 46(5):1277–1287
Article Google Scholar
Xu W, Gong Y (2004) Document clustering by concept factorization. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 202–209
Xu W, Liu X, Gong Y et al (2003) Document clustering based on non-negative matrix factorization. In: International ACM SIGIR conference on research and development in information retrieval, pp 267–273
Yang Y, Shen HT, Ma Z et al (2011) ℓ _2,1-norm regularized discriminative feature selection for unsupervised learning[c]. In: International joint conference on artificial intelligence, pp 1589–1594
Yu K, Zhang T, Gong Y et al (2009) Nonlinear learning using local coordinate coding. Neural information processing systems, 2223–2231
Zheng M, Bu J, Chen C et al (2011) Graph regularized sparse coding for image representation. IEEE Trans Image Process 20(5):1327–1336
Article MathSciNet Google Scholar

Download references

Acknowledgments

We would like to thank all anonymous reviewers for their helpful comments. This work is supported by the Natural Science Foundation of Liaoning No. 2015020070, and the Natural Science Foundation of China No.61171109 and 61175048.

Author information

Authors and Affiliations

School of Information Engineering, Henan University of Science and Technology, Luoyang, 471023, China
Ling Xing
School of Mathematics, Liaoning Normal University, Dalian, 116029, China
Hao Dong, Wei Jiang & Kewei Tang

Authors

Ling Xing
View author publications
You can also search for this author in PubMed Google Scholar
Hao Dong
View author publications
You can also search for this author in PubMed Google Scholar
Wei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Kewei Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Jiang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xing, L., Dong, H., Jiang, W. et al. Nonnegative matrix factorization by joint locality-constrained and ℓ _2,1-norm regularization. Multimed Tools Appl 77, 3029–3048 (2018). https://doi.org/10.1007/s11042-017-4970-9

Download citation

Received: 21 April 2017
Revised: 15 June 2017
Accepted: 20 June 2017
Published: 07 July 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s11042-017-4970-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Nonnegative matrix factorization by joint locality-constrained and ℓ 2,1-norm regularization

Abstract

Similar content being viewed by others

Robust automated graph regularized discriminative non-negative matrix factorization

Multiple Graph Regularized Non-negative Matrix Factorization Based on L2,1 Norm for Face Recognition

Orthogonal Dual Graph Regularized Nonnegative Matrix Factorization

Explore related subjects

1 Introduction

2 Related works

2.1 NMF

2.2 Non-negative sparse coding (NNSC)

3 Non-negative matrix factorization by joint locality-constrained and ℓ 2,1-norm regularization

3.1 The objective function

Definition

3.2 The update rules

3.3 Convergence analysis

Definition 1

Lemma 1

Theorem 1

Proof

Theorem 2

Proof

3.4 Connection to gradient descent method

4 Experiments

4.1 Data preparation

4.2 Experimental design

4.2.1 Evaluation metrics

4.2.2 Compared methods

4.2.3 Parameter settings

4.3 Clustering results

4.4 Parameter sensitivity

4.5 Convergence analysis

4.6 Overall observations and discussion

5 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Nonnegative matrix factorization by joint locality-constrained and ℓ _2,1-norm regularization

3 Non-negative matrix factorization by joint locality-constrained and ℓ _2,1-norm regularization