Keywords

1 Introduction

Representation learning has attracted an increasing attention in the past decade, because of its impressive performance in practice [5, 28, 37, 52, 58]. Sparse representation and low-rank representation are two typical representation learning techniques. Given a set of basis vectors (i.e., an over-complete dictionary) and a test sample, sparse representation aims to find the sparest representation of the test sample among all the linear combinations of dictionary atoms. Therefore, it could reveal the relationships of high-dimensional samples, which would greatly benefit the processing of massive high-dimensional multimedia data, such as images, videos, and web data. In recent years, lots of sparse representation methods has been developed and applied to various real-world applications, including image classification [57], image super-resolution [56], video anomaly detection [40], and visual tracking [2]. Researchers also found that the sparsity can be supported by the theory in human visual system, i.e., the nerve cells in the connecting pathway only react to a certain amount of stimuli [44]. Sparse representation based classifier (SRC) is a classical method in this category, which achieves promising results in face recognition [52]. The key idea of sparse representation is to recover a signal from a small number of linear measurements. Given an over-complete dictionary \(D\) and a test sample \(y\), the objective function of SRC is:

$$\begin{aligned} \min _{\alpha } \Vert \alpha \Vert _1 \quad \mathrm {s.t.} \quad y = D\alpha , \end{aligned}$$

where \(\alpha \) is a coding coefficient vector whose non-zero elements are those corresponding to the category \(y\) belongs to.

Another type of representation learning, low-rank representation, has shown excellent performance for learning features from noisy observations. Existing low-rank methods have been applied to background modeling [8], shadow removal [8], subspace clustering [34], image processing [62] and multimedia analysis [63]. Low-rank representation (LRR) is able to discover the underlying structures in noisy data. The objective function of LRR is:

$$\begin{aligned} \min _{Z,E} \Vert Z\Vert _* + \lambda \Vert E\Vert _{2,1},\quad \mathrm {s.t.} \quad X = {\textit{DZ}} + E, \nonumber \end{aligned}$$

where \(Z\) is the coefficient matrix, \(E\) is the error matrix, and \(D\) is the dictionary.

From the objective functions of sparse representation and low-rank representation, we can observe that the dictionary \(D\) plays an important role. Traditionally, \(D\) can either be pre-specified or learned from the training sample set. In SRC [52] and LRR [34], \(D\) is pre-specified as the original training set \(X\). However, this simple strategy has two major drawbacks. First, the original sample set may not faithfully represent the test samples, due to the noise and uncertainty in practice. Second, the discriminative information contained in the sample set is ignored, which is critical for classification tasks. Therefore, it’s necessary to adaptively learn a dictionary from the sample set.

Dictionary learning is indeed an active research topic in signal processing and computer vision, and significant research progress has been made in recent years. The K-SVD algorithm is presented to learn an over-complete dictionary by updating dictionary atoms and sparse representations iteratively [1], which generalizes the conventional K-means clustering process. Based on K-SVD, a discriminant version is also designed [60], which makes use of the classification feedback in the training phase. To enhance the discriminability of dictionary, other strategies include associating the label information with each dictionary basis [23], and introducing the Fisher criterion to the dictionary learning process [58]. A category-specific dictionary learning method is proposed for fine-grained image categorization [19], which incorporates incoherence constraints among the different dictionaries in the objective of feature coding. Dictionary learning is usually time-consuming. To address this problem, several methods emphasize some specific discriminative criteria to reduce the computational complexity [28, 50]. When the data set contains clean samples, the methods above could achieve promising performance. However, when the training samples are noisy or heavily corrupted that is very common in practice, these methods would fail. Specifically, if the samples are corrupted with large noise, then in order for representing the training samples, the dictionary atoms will also get corrupted. In this chapter, we focus on learning a discriminative dictionary that is also robust to noise, and applying it to image classification.

1.1 Motivation and Contributions

To deal with the noisy observations in practice, current low-rank learning methods eliminate noise through a low-rank constraint [8]. In our case of dictionary learning for image classification, the training samples from the same class are linearly correlated and lie in a low dimensional manifold. Hence, a sub-dictionary for representing samples from one class should own the property of low-rankness.

Inspired by the previous work, we aim at learning a discriminative dictionary for image classification that can handle training samples corrupted with large noise. We propose a discriminative dictionary learning with low-rank regularization (\(D^2L^2R^2\)) approach and illustrate it in Fig. 1. In the figure, the training sample set can be approximately recovered by the multiplication of the dictionary and the coding coefficient matrix. Each sub-dictionary is of low rank (can be seen as multiplication of two matrices of smaller size) to reduce the negative effect of noise contained in training samples. The coding coefficients conform to Fisher discrimination criterion. Benefiting from the above design, our approach has the following advantages. First, the Fisher discriminant function can help us achieve a small ratio of the within-class scatter to between-class scatter on the coefficients, making the dictionary learned has strong discerning power. Second, low-rank regularization will output a compact and pure dictionary that can reconstruct the denoised images even when the training samples are contaminated.

Fig. 1
figure 1

Illustration of our \(D^2L^2R^2\) model. Each sub-dictionary learned is of low-rankness, which can reduce the negative effect of noise contained in training samples. The coding coefficients conform to Fisher discrimination criterion, making the dictionary discriminative for the training samples

The most relevant methods to ours include Fisher discrimination dictionary learning (FDDL) [58] and dictionary learning for sparse representation (DLRD_SR) [37]. Different from FDDL, our \(D^2L^2R^2\) approach can well cope with training samples with large noise and can still achieve impressive performance due to the low-rank regularization on the sub-dictionaries. Our approach also differs from the recently proposed DLRD_SR method, which integrates rank minimization into sparse representation and achieves impressive face recognition results especially when corruption existed. Though DLRD_SR was claimed to be able to handle noisy samples as well, it may suffer from certain information loss because of the low-rank regularization, our \(D^2L^2R^2\) approach compensates this by enforcing the Fisher criterion on the coding coefficients of the training sets.

In all, our contributions include:

  • We propose a novel low-rank dictionary learning method, which improves the discriminability of dictionary and is suitable to deal with noisy observations.

  • We formulate our model as an \(l_1\) regularized rank-minimization problem. The iterative projection method (IPM) and inexact augmented Lagrange multiplier (ALM) algorithms are employed to solve our objective function.

  • We evaluate the performance of our method and related methods on four image databases, including face databases and digit database. Extensive results demonstrate the effectiveness of our method.

  • This chapter is an extension of our previous papers [31, 32]. We provide more details of our method in this chapter, and add some discussions of related works that are published most recently. More discussions in methodology and experimental parts are also provided.

1.2 Organization and Notations

The rest of this chapter is organized as follows. Section 2 introduces some related works. Section 3 introduces the formulations of our Discriminative Dictionary Learning with Low-Rank Regularization (\(D^2L^2R^2\)) approach, and the optimization algorithms are described in Sect. 4. Section 5 describes the classification scheme. Section 6 shows experimental results on four image datasets. Finally, we draw conclusions in Sect. 7. In addition, Table 1 summarizes the notations used in this chapter.

Table 1 Notations

2 Related Works

In this section, we briefly review some related works, including sparse representation, dictionary learning and low-rank learning.

2.1 Sparse Representation

We introduce the most popular sparse representation method, SRC [52]. Let \(Y\) denote the entire training sample set that consists of \(n\) training samples from all \(c\) different classes, that is, \(Y = [Y_1, Y_2,\ldots ,Y_c]\). \(Y_i\in R^{d\times n_i}\) contains all the training samples from \(i\)th class, \(d\) is the dimension of samples, and \(n_i\) is the number of samples from \(i\)th class. To classify a test sample \(y\), we need to go through two phases: coding and classification.

(a) Coding phase: we obtain the coding coefficient of \(y\) by solving the following minimization problem:

$$\begin{aligned} \alpha = \arg \min \Vert \alpha \Vert _0 \quad \text{ subject } \text{ to } \quad \Vert y-Y\alpha \Vert _2 \le \varepsilon . \end{aligned}$$
(1)

The model seeks the sparsest representation for \(y\) among all the possible linear combinations of the basis in the dictionary. For general over-complete dictionary, the determination for this sparsest representation is shown to an NP-hard problem [12]. Instead, approximation methods are proposed to tackle the problem. Among them, the simplest are matching pursuit [39] and orthogonal matching pursuit [11]. The most widely used approach is to replace the \(l_0\) norm with its convex envelope \(l_1\) norm, and then the minimization problem (1) becomes:

$$\begin{aligned} \alpha = \arg \min \Vert \alpha \Vert _1 \quad \text{ subject } \text{ to } \quad \Vert y-Y\alpha \Vert _2 \le \varepsilon , \end{aligned}$$
(2)

which can be reformulated using Lagrange multiplier as following:

$$\begin{aligned} \alpha = \arg \min \Vert y-Y\alpha \Vert _2 + \lambda \Vert \alpha \Vert _1. \end{aligned}$$
(3)

As reviewed in [55], there are three widely used approaches to the above \(l_1\) minimization problem: the interior-point method [18, 24], the Homotopy method [14, 16, 38, 43] and first-order methods [3, 13, 17, 42]. Primal-dual interior-point method converts problem with inequality constraints to the one with equality constraints in an iterative fashion and then apply Newton’s barrier method. However, the interior-point method cannot scale well to large-scale real-world problem. Homotopy method uses the fact that as the balance parameter \(\lambda \) decreases, problem (3) is a homotopy from \(l_2\) to \(l_1\). However, homotopy method is also computational expensive. First-order methods look into the \(l_1\) norm’s structure and significantly reduce the computational cost of each iteration.

(b) Classification phase: \(y\) is classified as the category with the smallest residual:

$$\begin{aligned} \min _i r_i(y) = \Vert y-Y\delta _i(\alpha )\Vert _2, \end{aligned}$$
(4)

where \(\delta _i(\alpha )\) is a function that picks the coefficients corresponding to \(i\)th class.

Besides SRC, many other sparse representation methods have applied to many domains, e.g., blind image deblurring [59], face alignment [49], human action recognition [54], image classification [57], image super resolution [56], video anomaly detection [40], etc. Those sparse representation methods above could deal with noisy observations to some extent. However, there are two shortcomings. First, the dictionary is usually pre-specified as the training set, which limits the learning performance. Second, the structure information among training sample set is usually ignored, which is critical for classification tasks. Dictionary learning could tackle the first problem, and the second one can be eliminated by low-rank learning.

2.2 Dictionary Learning

The goal of dictionary learning in sparse representation or low-rank modeling is to learn a dictionary which can yield sparse/low-rank representation for the training samples. Probabilistic approach to dictionary learning learns a dictionary either by maximum likelihood  [30] or by maximum a-posterior [26]. Given the training samples \(Y=\{y_i\}_{i=1}^N\), the maximum likehood learns a dictionary that maximizes the likehood function \(p(Y|D)=\prod _{i=1}^Np(y_i|D)\). Each \(p(y_i|D)\) is calculated by integrating out its sparse coefficient which assumes to be a hidden variable. Maximum a-posterior finds the dictionary that instead maximizes the posterior \(p(D|Y)\propto p(Y|D)p(D)\). Different choices of the prior \(p(D)\) will lead to different formulations of the dictionary.

Many methods generalizes the K-Means clustering algorithm to learn the dictionary in sparse representation because sparse representation actually generalizes K-Means by selecting more than one clusters and the coefficients can take arbitrary values when minimizing the mean square error (MSE). The methods all first learn the sparse coding and then update the dictionary alternatively. K-SVD [1] applies SVD decomposition after computing the overall representation error matrix to update each atom of the dictionary. K-SVD is shown to converge by reducing MSE monotonically, but a global minimum is not guaranteed. Besides, K-SVD only considers good representation but is not optimal for classification. The classification error is integrated with the reconstruction error in the objective to learn a dictionary that is suitable for sparse representation but also has the disriminative power [60]. Discriminative K-SVD [60] can get around the issues of getting stuck at local minima and slow convergence. Label consistent K-SVD [23] further adds a discriminative sparse-code error term, i.e. label consistency of the sparse codes so that the training samples from the same class have similar sparse coefficients. In this way, it can explore the underlying structure of the training set, and generate disriminative sparse representation optimal for classification. Pairwise sparse codes’ similarity/dissimilarity constraints is considered in [20] and combines with the classification error to achieve a discriminative dictionary. The idea is that signals from the same class should share similar sparse codes while those from different classes have dissimilar ones. Most recently, dictionary learning has also been extended to other scenarios, such as multi-modal visual dictionary learning [21] and online dictionary learning [51].

2.3 Low-Rank Learning

Theoretical advances on low-rank matrix recovery and completion enable us to correctly recover underlying low-rank structure in data [8, 9], and low-rank matrix recovery has been applied in many areas. Given a matrix \(M\) of low rank, matrix completion aims at recovering it from noisy observations of a random small portion of its elements. It has been proved that under certain assumptions, the problem can be exactly solved and several methods have been proposed [7, 25].

Robust PCA [8] can be used to recover corrupted data in a single subspace by solving a matrix rank minimization problem. This can be regarded as an extension of sparse representation from vector to matrix. Low-rank representation (LRR) [34] recovers data from multiple subspaces, and it shows very impressive results on subspace segmentation. Latent LRR (LatLRR) [35] is an extension of LRR that can recover the effects of unobserved hidden data. LatLRR can also extract salient features from corrupted images for use in classification tasks. Based on LRR, a fixed-rank representation (FRR) method [36] is proposed for unsupervised learning on image features. In [64], a non-negative low-rank and sparse graph is constructed for semi-supervised learning. Also, low-rank constraints have been introduced to the visual domain adaption in [22] and transfer subspace learning in [47]. In [10], a low-rank approximation method with structural incoherence is proposed for face recognition.

Given an observed and usually corrupted sample set \(X_O\), low-rank learning methods solve the nuclear norm regularized optimization problem, which can be generally formulated as:

$$\begin{aligned} \text {min} \Vert X\Vert _* + \lambda \Vert E\Vert _l,\quad \text {s.t.} \quad X_{O} = A(X) + E. \end{aligned}$$
(5)

where \(\Vert \cdot \Vert _*\) is the nuclear norm (trace norm), \(X\) and \(E\) are unknown matrices to learn, \(A\) is a linear operator, \(\Vert \cdot \Vert _1\) is used to measure the noise, and \(\lambda > 0\) is a balance parameter. In Robust PCA, \(A\) is an identity matrix and \(\Vert \cdot \Vert _l\) is expressed as \(\Vert E\Vert _1\). In LRR, \(A(X) = {\textit{AX}}\) where \(A\) is a given dictionary, and \(\Vert \cdot \Vert _{2,1}\) is chosen for \(E\).

Several optimization algorithms have been proposed to solve the above problem, such as semi-definite programming (SDP) and accelerated proximal gradient (APG). However, these algorithms suffer large computational burden. Recently, Lin et al. proposed an augmented Lagrange multipliers (ALM), which solves the nuclear norm optimization problem efficiently. By introducing the singular value thresholding (SVT) operator, ALM has a computational complexity of \(\mathrm {O}(n^3)\). In this chapter, we adopt the ALM algorithm to solve the rank minimization problem.

Most recently, some researchers have tried to incorporate low-rank constraints into dictionary learning models. A dictionary learning for sparse representation (DLRD_SR) method is proposed in [37], which seeks low-rank dictionary for each class. It achieves excellent performance for face recognition with noise. A structured low-rank representation method is presented in [61], which emphasizes the structure information into dictionary learning. In addition, a low-rank dictionary learning approach is also proposed for object co-detection [53]. However, in these methods, the discriminative information contained in training samples haven’t been extensively exploited. In our method, we explicitly utilize the class label information through the Fisher criterion.

3 Problem Formulation

We aim to learn a discriminative dictionary even when large noise exists in the training samples. In this section, we introduce our motivation, and describe the detailed formulations of our method.

3.1 Motivation

Let \(Y\) denote the training sample set which consists of \(n\) training samples from all \(c\) different classes: \(Y = [Y_1, Y_2,\ldots ,Y_c]\), and \(Y_i\in \mathrm {R}^{d\times n_i}\) denote a subset that contains all the training samples from the \(i\)th class.

We aim to learn a discriminative dictionary from \(Y\) for future image classification task. Rather than learning the dictionary as a whole from all the training samples, we separately learn a sub-dictionary \(D_i\) for the \(i\)th class. With all the sub-dictionaries learned, we will get the whole dictionary as \(D=[D_1,D_2,\ldots ,D_c]\), where \(c\) is the number of classes, \(D_i\) is the sub-dictionary for the \(i\)th class, each \(D_i\) is of size \(d\times m_i\), \(d\) is the dimension of each dictionary atom which is the same with the feature dimension of each training sample, and \(m_i\) is the number of atoms in the \(i\)th sub-dictionary.

We represent the entire training set \(Y\) using the whole dictionary \(D\), and denote by \(X\) the obtained sparse coefficient matrix. We should have the reconstruction formulation: \(Y \approx DX\), and \(X\) could be written as \(X=[X_1,X_2,\ldots ,X_c]\), where \(X_i\) is the sub-matrix that is the coefficients for representing \(Y_i\) using \(D\). We propose the following \(D^2L^2R^2\) model:

$$\begin{aligned} J_{D,X} = \mathop {\arg \min }\limits _{D,X} \{R(D,X)+\lambda _1 \Vert X\Vert _1 + \lambda _2 F(X) + \alpha \sum _i\Vert D_i\Vert _*\}, \nonumber \end{aligned}$$

where \(R(D,X)\) is the reconstruction function for expressing the discrimination power of \(D\), \(\Vert X\Vert _1\) is the \(l_1\) regularization on coding coefficient matrix, \(F(X)\) is the Fisher discriminant function of the coefficients \(X\), and \(\Vert D_i\Vert _*\) is the nuclear norm of each sub-dictionary \(D_i\), which is the convex envelope of its matrix rank. We will break down the model in the following subsections.

3.2 Discriminative Reconstruction Function

Our assumption is that the sub-dictionary \(D_i\) should have a good capability to represent the samples from \(i\)th class, even the samples are unseen during the training stage. To illustrate this idea mathematically, we rewrite \(X_i\), the coding coefficient matrix of \(Y_i\) over \(D\), as \(X_i = [X_i^1;X_i^2;\ldots ;X_i^c]\), where \(X_i^j\in R^{m_j\times n_i}\) is the coding coefficient matrix of \(Y_i\) over \(D_j\). Therefore, we will have to minimize \(\Vert Y_i-D_i X_i^i\Vert _\mathrm {F}\), which denotes the reconstruction of \(i\)th class over the \(i\)th sub-dictionary.

On the other hand, for the \(i\)th sub-dictionary \(D_i\), as we want to emphasize its representation ability for the \(i\)th class, it should not be able to represent samples from other classes. Formally, the term \(\sum _{j=1,j\ne i}^c\Vert D_i X_j^i\Vert _\mathrm {F}^2\) should be as small as possible, where each \(X_j^i\) has nearly zero elements.

Moreover, as a bases matrix in the high-dimensional space, the whole dictionary \(D\) should well represent samples from any class \(Y_i\). Thus, we require the minimization of \(\Vert Y_i-DX_i\Vert _\mathrm {F}^2\). In all, for the sub-dictionary \(D_i\), we can write the discriminative reconstruction function as \(R(D_i,X_i)=\Vert Y_i-D_i X_i^i\Vert _\mathrm {F}^2 + \sum _{j=1,j\ne i}^c\Vert D_i X_j^i\Vert _\mathrm {F}^2 + \Vert Y_i-DX_i\Vert _\mathrm {F}^2\), and we need to minimize the value of \(R(D_i,X_i)\).

3.3 Fisher Discriminant Function

In addition to the discriminative reconstruction function, we want to make the coding coefficient matrix \(X\) discriminative as well. In this way, \(D\) will have discriminative power for training samples \(Y\).

The sparse coding coefficients in \(X\) could be considered new representations for training sample set \(Y\). To explicitly incorporate the discriminative information, we apply Fisher discrimination criterion [15] on the coefficient matrix \(X\), so that the ratio of within-class scatter to between-class scatter will be minimized and samples from different classes can be well separated.

Let \(S_W(X)\) and \(S_B(X)\) denote the within-class scatter matrix and the between-class scatter matrix of \(X\), respectively. We have

$$\begin{aligned} S_W(X)&= \sum _{i=1}^c\sum _{x_k\in X_i}(x_k-\bar{x}_i)(x_k-\bar{x}_i)^T, \nonumber \\ S_B(X)&= \sum _{i=1}^c n_i(\bar{x}_i-\bar{x})(\bar{x}_i-\bar{x})^T, \nonumber \end{aligned}$$

where \(\bar{x}_i\) is the mean sample of \(X_i\), \(\bar{x}\) is the mean sample of \(X\), and \(n_i\) is the number of samples in \(i\)th class.

Based on the scatter matrices \(S_W(X)\) and \(S_B(X)\), we can define a discriminative function \(F(X)\) as

$$\begin{aligned} \hat{F}(X)=tr(S_W(X))-tr(S_B(X)). \nonumber \end{aligned}$$

Note here that minimizing \(tr(S_W(X))-tr(S_B(X))\) is usually equivalent to minimizing the ratio of within-class scatter to between-class scatter. However, this function is not convex, which makes it difficult to solve our objective function. Fortunately, [58] suggests that, by adding a regularization term \(\eta \Vert X\Vert _\mathrm {F}^2\) with a proper value for \(\eta \), the function would be strictly convex to \(X\). Finally, our discriminative function is

$$\begin{aligned} F(X)=tr(S_W(X))-tr(S_B(X))+\eta \Vert X\Vert _\mathrm {F}^2, \end{aligned}$$
(6)

where the parameter \(\eta \) in the last term is set as \(\eta =1\).

3.4 Low-Rank Regularization

In classification tasks, we usually assume that the training samples in the same class are linearly correlated and reside in a low dimensional subspace, whereas the samples from different classes should reside in different subspaces. Therefore, a sub-dictionary for representing samples from one class should be reasonably of low-rankness. The low-rank constraint is able to make each sub-dictionary more compact.

Besides, putting low-rank constraints on the sub-dictionaries would also eliminate the negative influence of noise, which makes the dictionary more pure and robust. Of all the possible sub-dictionary \(D_i\) that can represent samples from \(i\)th class, we want to find the one with the most compact bases, that is to minimize \(\Vert D_i\Vert _*\).

3.5 The \(D^2L^2R^2\) Model

Considering the discriminative reconstruction function, Fisher discriminant function, and the low-rank regularization together, we can build the following \(D^2L^2R^2\) model:

$$\begin{aligned} J_{(D,X)} =\mathop {\arg \min }\limits _{D,X} \left( \begin{array}{l} \sum _{i=1}^c \Vert Y_i-D_i X_i^i\Vert _\mathrm {F}^2+\Vert Y_i-DX_i\Vert _\mathrm {F}^2\\ + \sum _{j=1,j\ne i}^c\Vert D_jX_i^j\Vert _\mathrm {F}^2 + \lambda _1 \Vert X\Vert _1\\ +\,\lambda _2 F(X)+\alpha \sum _{i=1}^c\Vert D_i\Vert _* \end{array}\right) . \end{aligned}$$
(7)

4 Optimization

In this section, we introduce how to solve our model Eq. (7). Basically, it’s an \(l_1\) regularized rank-minimization problem, and we could extend the existing optimization tools to solve this problem by iteratively updating each variable. We will update \(X_i\) and \(D_i\) alternatively.

In particular, our optimization contains two major steps. First, we optimize \(X_i(i=1,2,\ldots ,c)\) when the dictionary \(D\) and all \(X_j(j\ne i)\) are fixed. Then, we can get coding coefficient matrix \(X\) by putting all the \(X_i (i=1,2,\ldots ,c)\) together. Second, we optimize \(D_i\) when \(X\) and \(D_j (j\ne i)\) are fixed. We describe the detailed implementations of solving these two sub-problems in this section, and also provide the complete algorithms.

4.1 Update Coding Coefficients \(X_i\)

Assumed that \(D\) is fixed, the original objective function Eq. (7) is then reduced to a sparse coding problem. We update each \(X_i\) one by one and make all \(X_j(j\ne i)\) fixed. This can be done by solving the following problem:

$$\begin{aligned} J_{(X_i)} =\mathop {\arg \min }\limits _{X_i} \left( \begin{array}{l} \Vert Y_i-D_i X_i^i\Vert _\mathrm {F}^2+\Vert Y_i-DX_i\Vert _\mathrm {F}^2\\ + \sum _{j=1,j\ne i}^c\Vert D_jX_i^j\Vert _\mathrm {F}^2\\ +\, \lambda _1\Vert X_i\Vert _1 + \lambda _2F_i(X_i) \end{array}\right) , \end{aligned}$$
(8)

where \(F_i(X_i)=\Vert X_i-\bar{X}_i\Vert _\mathrm {F}^2-\sum _{k=1}^c\Vert \bar{X}_k-\bar{X}\Vert ^2+\eta \Vert X_i\Vert _\mathrm {F}^2\) and \(\bar{X}_k\) and \(\bar{X}\) are matrices composed of the mean of vectors of \(k\)th class and all classes.

This reduced objective function can be solved by using Iterative Projection Method (IPM) in [45] by rewriting it as

$$\begin{aligned} J_{(X_i)}=\mathop {\arg \min }\limits _{X_i}\{Q(X_i)+2\tau \Vert X_i\Vert _1\}. \end{aligned}$$

The detailed implementations of IPM algorithm can be referred to [45, 58].

4.2 Updating Sub-dictionaries \(D_i\)

In the second step, we need to optimize the sub-dictionaries \(D_i\). When \(X\) is fixed, we can update \(D_i\) by fixing all the other \(D_j(j\ne i)\). Please note that the coding coefficient of \(Y_i\) over \(D_i\) should be updated at the same time. In addition, when \(D_i\) is updated, the coding coefficient of \(Y_i\) over \(D_i\) should also be updated to reflect this change, that is, \(X_i^i\) is also updated.

By ignoring those irrelevant terms w.r.t \(D_i\), the objective function Eq. (7) can be reduced to

$$\begin{aligned} J_{(D_i)} =\mathop {\arg \min }\limits _{D_i,X_i^i} \left( \begin{array}{l} \Vert Y_i-D_iX_i^i-\sum _{j=1,j\ne i}^c D_jX_i^j\Vert _\mathrm {F}^2\\ + \sum _{j=1,j\ne i}^c \Vert D_j X_i^j\Vert _\mathrm {F}^2 + \Vert Y_i-D_iX_i^i\Vert _\mathrm {F}^2\\ +\,\alpha \Vert D_i\Vert _* \end{array}\right) . \nonumber \end{aligned}$$

Denote \(r(D_i)=\Vert Y_i-D_iX_i^i-\sum _{j=1,j\ne i}^c D_jX_i^j\Vert _F^2+\sum _{j=1,j\ne i}^c \Vert D_j X_i^j\Vert _F^2\) and put the third term \(\Vert Y_i-D_iX_i^i\Vert _\mathrm {F}^2\) into the constraint, the above objective function can be converted to the following:

$$\begin{aligned} \begin{array}{rl} \min \limits _{D_i,E_i,X_i^i} &{} \Vert X_i^i\Vert _1+\alpha \Vert D_i\Vert _* + \beta \Vert E_i\Vert _{2,1}+\lambda r(D_i), \\ \mathrm {s.t.} &{} Y_i=D_i X_i^i+E_i , \end{array} \end{aligned}$$
(9)

where \(\left\| E_i\right\| _{2,1}=\sum \nolimits _{q=1}^n{\sqrt{\sum \nolimits _{p=1}^n{([E_i]_{pq})^2}}}\) is the \(l_{2,1}\)-norm that is usually adopted to model the sample-specific corruptions or noise.

To facilitate the optimization, we introduce two relaxation variables \(J\) and \(Z\), and then Eq. (9) can be rewritten as:

$$\begin{aligned} \begin{array}{rl} \min \limits _{D_i,E_i,X_i^i} &{} \Vert Z\Vert _1+\alpha \Vert J\Vert _* + \beta \Vert E_i\Vert _{2,1} + \lambda r(D_i),\\ \mathrm {s.t.}&{} Y_i = D_i X_i^i+E_i, D_i=J, X_i^i=Z. \end{array} \end{aligned}$$
(10)

The above problem can then be solved by the well established inexact Augmented Lagrange Multiplier (ALM) [6] method. The augmented Lagrangian function of Eq. (10) is

$$\begin{aligned} \min \limits _{D_i,E_i,X_i^i}&\Vert Z\Vert _1 +\, \alpha \Vert J\Vert _* + \beta \Vert E_i\Vert _{2,1} + \lambda r(D_i) \nonumber \\&+\, tr[T_1^t(Y_i - D_iX_i^i -E_i)] \nonumber \\&+\, tr[T_2^t(D_i-J)]+tr[T_3^t(X_i^i-Z)] \nonumber \\&+\, \frac{\mu }{2}(\Vert Y_i - D_i X_i^i-E_i\Vert _\mathrm {F}^2 \nonumber \\&+\,\Vert D_i-J\Vert _F^2+\Vert X_i^i-Z\Vert _\mathrm {F}^2), \end{aligned}$$
(11)

where \(T_1\), \(T_2\) and \(T_3\) are Lagrange multipliers and \(\mu \)(\(\mu >0\)) is a balance parameter.

4.3 Algorithms and Analysis

The details of solving the problem (11) are summarized in Algorithm 1. Each atom of the dictionary is normalized to a unit vector. A similar proof to demonstrate the convergence property of Algorithm 1 can be found in Lin et. al’s work [33].

Computational cost is usually a major concern of algorithms in practice, so we analyze the time complexity of Algorithm 1. The most time-consuming steps in Algorithm 1 are step 1 and step 3, and each of them costs \(\mathrm {O}(n^3)\), where \(n\) is the number of training samples. Therefore, the total time complexity of Algorithm 1 is \(\mathrm {O}(tn^3)\), where \(t\) is the number of iterations.

figure a

Once the dictionary \(D\) is initialized, we can proceed by iteratively repeating the above process until a stopping criterion is reached. We summarize the complete procedures of our \(D^2L^2R^2\) approach in Algorithm 2.

5 Classification Scheme

In this section, we design a classification scheme based on the learned dictionary. Given a query sample \(y\), we can calculate the its coding over the dictionary \(D\) by solving the following problem:

figure b
$$\begin{aligned} x = \mathop {\arg \min }\limits _{x} \{\Vert y - D x\Vert _2^2 + \gamma \Vert x\Vert _1\}. \end{aligned}$$
(12)

Denote by \(x=[x_1;x_2;\ldots ;x_c]\), where \(x_i\) is the coefficient vector over sub-dictionary \(D_i\). We can calculate the residual associated with \(i\)th class as

$$\begin{aligned} e_i = \Vert y - D_i x_i\Vert _2^2 + w\Vert x-\bar{x}_i\Vert _2^2, \end{aligned}$$
(13)

where \(\bar{x}_i\) is the learned mean coefficient of class \(i\), and \(w\) is a preset weight parameter.

The identity of testing sample \(y\) is determined according to

$$\begin{aligned} \text{ identity }(y) = \arg \min _i \{e_i\}. \end{aligned}$$
(14)

6 Experiments

We evaluate the performance of our approach and related methods on the ORL [46], Extend Yale B [29], CMU PIE [48] face datasets and MNIST digit dataset [27]. To testify the robustness to noise of each compared method, we simulate different types of noise in the experiments, including illumination changes, pixel corruptions, and block corruptions.

6.1 Datasets and Settings

The ORL dataset contains 400 images in total, 10 different images for each of 40 different subjects. The background of the images is uniform and dark while the subjects are in frontal, upright posture. The images were shot under different lighting condition and with various facial expression and details [46]. For each class, we randomly select half of the images randomly as training samples and rest as test samples. This random selection process was repeated five times in our experiments. Each image is normalized to the size of \(32\times 32\). To add noise, those images are manually corrupted by an unrelated block image at a random location. In our experiments, the percentage of corrupted area is increased from 10 to 50 %. Figure 2 shows an example of images with 20 % block corruptions.

Fig. 2
figure 2

Example of ORL images with 20 % block corruptions. Top original images. Bottom corresponding corrupted images with 20 % block corruptions

The CMU PIE dataset consists of 41,368 images of 68 subjects, each person under 13 different poses, 43 different illumination conditions, and with 4 different expressions. We use the first 15 subjects and select those images with frontal position and various expression and illumination. Each subject has 50 images, and we randomly select 10 images as training and the rest as testing and repeat our experiment on five random splits. The images are normalized to the size of \(32\times 32\). We replace a certain percentage of randomly selected pixels of each image with pixel value 255. This percentage is increased from 10 to 40 %. Figure 3 exemplifies random pixel corruption of face images.

Fig. 3
figure 3

Example of original images (1st row) and noisy images with 10 % random pixel corruptions (2nd row) from PIE dataset

The Extended Yale B dataset contains 2,414 frontal-face images of 38 subjects captured under various laboratory-controlled lighting conditions. We choose the first 15 subjects and each subject has around 60 images. We randomly take half as training samples, and the rest as testing samples and repeat the experiment five times. The images are normalized to size \(32 \times 32\). We also replace a certain percentage of randomly selected pixels from the images with the pixel value 255. Figure 4 shows original images and noisy images with 30 % pixel corruptions.

Fig. 4
figure 4

Example of original images (1st row) and images with 30 % random pixel corruptions (2nd row) from Extended YaleB dataset

The MNIST handwritten digit dataset used in this experiment is a popular subset that contains 60,000 training samples and 10,000 test samples. In our experiments, we randomly select 30 samples of each digit to construct the training set, and select 2,000 test samples in total. The size of each digit image is \(28\times 28\). As before, we replace a certain percentage (from 10 to 40 %) of randomly selected pixels from the images with pixel value 255. Figure 5 shows original and noisy images with 20 % pixel corrputions.

Fig. 5
figure 5

Example of original images (1st row) and images with 20 % random pixel corruptions (2nd row) from MNIST digit dataset

Parameter Settings. The number of dictionary columns of each class is set as the training sample size. There are 5 parameters in our approach: \(\lambda _1\) and \(\lambda _2\) in Eq. (8) and \(\alpha ,\beta \) and \(\lambda \) in Eq. (9). In our experiments, we found that changing \(\alpha \) and \(\lambda \) wouldn’t affect the result that much, and we set them both as 1. Parameters of the comparison algorithms are chosen by cross validation in 5-fold fashion. For ORL, \(\lambda _1=0.005,\lambda _2=0.05,\beta =0.1\); for Extended Yale B, \(\lambda _1=0.005,\lambda _2=0.005,\beta =0.01\); for PIE, \(\lambda _1=0.025,\lambda _2=0.025,\beta =0.1\); for MNIST, \(\lambda _1=0.005,\lambda _2=0.005,\beta =0.01\). The influence of \(\lambda _1\) and \(\lambda _2\) will be discussed in the next subsections.

Compared Methods. We compare the proposed \(D^2L^2R^2\) method with the following methods:

  • DLRD_SR [37]: It is the most relevant work to our method, which applies low-rank regularization on the dictionary. However, it doesn’t explore the discriminative information in the coefficients.

  • FDDL [58]: It introduces Fisher criterion to the dictionary learning process, but it’s not suitable to handle the noisy dataset.

  • LRC [41]: LRC is a recently proposed classifier which falls in the category of nearest subspace classification.

  • SRC [52]: SRC is the most popular sparse representation based classification method.

  • LDA [4]: LDA is a classical supervised subspace learning method, which has been widely applied to face recognition and image classification.

6.2 Results and Analysis

ORL Face Dataset. In Table 2, we list the recognition accuracies (with standard derivations) of all compared methods under different levels of occlusions. From the table, we can observe that our approach constantly performs the best under different levels of corruptions (\(>\)0 %). FDDL achieves the best result when there is no corruption, however, when the percentage of occlusions increases, performance of FDDL along with that of LRC, SRC and LDA drops rapidly, but \(D^2L^2R^2\) and DLRD_SR can still obtain much better recognition rates. This demonstrates the effectiveness of low-rank regularization when noise exists. Comparing \(D^2L^2R^2\) with DLRD_SR, \(D^2L^2R^2\) performs better due to the Fisher criterion on the coefficients.

Table 2 Average recognition rate (%) of different algorithms on ORL dataset with various corruption percentage (%) (averaging over five random splits)
Fig. 6
figure 6

Average recognition accuracy on PIE dataset with different percentage of pixel corruptions (averaging over five random splits)

PIE Face Dataset. Figure 6 shows the recognition accuracy under various noise percentage on the PIE face dataset. Again, \(D^2L^2R^2\) performs the best most of the time. However, when there is no corruption or the percentage of corruption is very small, \(D^2L^2R^2\) cannot beat FDDL. We see that the low-rank regularization doesn’t help much in this case, but can on the hand degrade the performance. However, compared with DLRD_SR, \(D^2L^2R^2\) still obtains better accuracy due to the benefit from Fisher criterion. This can also be validated from LDA’s good performance.

Extended YaleB Face Dataset. Table 3 lists the recognition accuracies of all compared methods on the Extended YaleB face dataset. Both \(D^2L^2R^2\) and DLRD_SR perform well when noise exists whereas recognition rates of LRC, SRC and LDA decrease fast with increasing noise, which demonstrates the superiority of low-rank regularization in terms of handling noise.

Table 3 Average recognition rate (%) of different algorithms on Extended YaleB dataset with various corruption percentage (%) (averaging over five random splits)

MNIST Digit Dataset. Table 4 shows the average recognition rates (with standard deviations) of all compared methods on the MNIST digit dataset. We can observe that our approach constantly outperforms all the other competing methods under different levels of corruptions. The results demonstrate the low-rank regularization’s capability of handling noise and the further discrimination Fisher criterion can introduce.

Table 4 Average recognition rate (%) of different algorithms on MNIST dataset with various corruption percentage (%) (averaging over five random splits)

6.3 Discussions

Both \(D^2L^2R^2\) and DLRD_SR perform better under noise condition from the results of the four datasets (both face and digit datasets), which no doubt demonstrates low-rank regularization’s advantage in dealing with noise. Comparing \(D^2L^2R^2\) with DLRD_SR, the former one can achieve better results almost all the time. This is due to the Fisher discriminant function on the coefficient matrix, which can make the dictionary learned more discerning. However, this function is defined on the coefficients of training set, so when the training set is not sufficient compared to the testing set, \(D^2L^2R^2\) and FDDL might not perform that well.

To investigate how sensitive the parameters are, we experiment on PIE dataset without occlusion to see how different values of parameters \(\lambda _1\) and \(\lambda _2\) affect the recognition accuracy. Figure 7 shows the recognition accuracy using different values of parameter \(\lambda _1\). Other parameters are fixed. The accuracy reaches a plateau as \(\lambda _1\) grows from 0.025, which indicates our method is insensitive to the choice of \(\lambda _1\). Similarly, Fig. 8 shows the recognition accuracy using different values of parameter \(\lambda _2\). Again, the accuracy reaches a plateau after \(\lambda _2\) grows to 0.025, thus our method is also insensitive to \(\lambda _2\). Also notice that when \(\lambda _1=0\), the accuracy drops relatively 21 %, which shows the importance of the sparsity of the coefficients.

Fig. 7
figure 7

Recognition accuracy versus parameter \(\lambda _1\) on PIE dataset

Fig. 8
figure 8

Recognition accuracy versus parameter \(\lambda _2\) on PIE dataset

We also evaluate the computational cost of \(D^2L^2R^2\) and other compared methods on the PIE dataset. Table 5 shows the running timeFootnote 1 of different methods. We can observe that sparse representation based methods usually consumes more time than linear methods like LDA and LRC. As expected, our method’s running time is between that of FDDL and DLRD_SR.

In addition, we only adopt images under controlled environment in the experiments, since low-rank constraint is somewhat sensitive to images that are not well aligned. Designing models to learn noise-free and discriminative dictionary from images taken in uncontrolled environment could be our future work.

Table 5 Running time (seconds) of different algorithms on PIE dataset

7 Summary

In this chapter, we present a novel low-rank dictionary learning method named discriminative dictionary with low-rank regularization (\(D^2L^2R^2\)) for face and digit image classification. Our method adopts a class-specific dictionary learning strategy, and imposes a low-rank constraint for each sub-dictionary. The major contributions of our method include: (1) The supervised information is explicitly incorporated in our method through the Fisher discriminant function, which makes the dictionary more discriminative. In addition, the correlation between each sub-dictionary and samples from other classes is minimized. (2) The low-rank constraints are imposed on the sub-dictionaries, which make those sub-dictionaries more compact and robust to noise. (3) Based on IPM and inexact ALM algorithms, an optimization algorithm is designed to solve the proposed \(l_1\) regularized rank-minimization problem, and the classification scheme is also discussed. (4) We evaluate the classification performance of our approach and several compared methods on four face and digit image datasets. The experimental results show that our approach is superior to many other state-of-the-art dictionary learning and classification methods, especially when the images are heavily corrupted.