Keywords

1 Introduction

Image classification has long been a core task in the field of computer vision, with the aim of distinguishing different classes of images based on their features. In recent years, image classification algorithms have received a lot of attention because of their usefulness in many application areas (e.g., face recognition [21], handwritten digit recognition [15], hyperspectral remote sensing imagery classification [10]). Presently, a large number of image classification algorithms have been proposed. The mainstream classification algorithms can be broadly classified into two categories: non-parametric and parametric methods. The former is mainly representation-based, say the nearest neighbor classifier, and the latter consists of a lot of methods, including support vector machine (SVM) and neural network (NN). This study focuses on representation-based methods for their mathematical interpretability.

As the earliest representation-based method, the nearest neighbor classifier considers the simply representation for a given point using its nearest neighbors and ignores the degree of representation. Therefore, researchers have developed more representation-based methods, such as sparse representation-based classifier (SRC) [8] and collaborative representation-based classifier (CRC) [19]. Generally, representation-based methods linearly represent a test sample using the given training samples and classify it by measuring the residuals between it and its class representations. During the representation process, SRC uses each class of training samples to linearly represent test samples, whereas CRC adopts all training samples to represent test samples. When the number of training samples in each class is small, the SRC representation can lead to a degradation of classification performance. However, CRC can avoid this situation well, which makes researchers prefer CRC. Therefore, some CRC extensions have been proposed, such as collaborative-competitive representation-based classifier (CCRC) [18], double competitive constraints-based collaborative representation for classification (DCCRC) [6], weighted discriminative collaborative competitive representation (WDCCR) [5], and Probabilistic CRC (ProCRC) [4]. Although the classification performance of CRC-based methods seems to be good, their computational time increases as the amount of data increases. To maintain the classification performance and reduce the computational effort, it is considered to learn the information features of samples through dictionaries and apply them in the classification method.

Back in 2006, Aharon et al. [1] proposed a dictionary learning method, the K-SVD algorithm, which is a generalization of K-means. Although K-SVD trains a dictionary with superior performance and performs well in image recovery and image compression, it is not suitable for image classification tasks because it does not utilize the label information of data. In view of the shortcoming, Zhang et al. [20] proposed a discriminative K-SVD (D-KSVD) algorithm based on K-SVD. D-KSVD adds a new label term to the original objective function by introducing label information, which makes it possible to maintain the performance of the dictionary while making it applicable to image classification tasks. Jiang et al. [9] proposed label consistent KSVD (LC-KSVD) by adding a new label consistency constraint (discriminating sparse coding errors) to the objective function of D-KSVD, associating each dictionary atom in the dictionary with its corresponding specific label and forcing samples in the same class to have similar sparse representations.

The dictionaries learned by above methods are all shared ones, which can adequately capture the main features needed for facial images when the intra-class variation of facial images is small. However, the representation capability of the shared dictionary would be decreased when images in the same class have a great intra-class distance owing to various factors during the shooting process. To remedy it, class-specific dictionary learning algorithms have been proposed, including Fisher discrimination-based dictionary learning (FDDL) [17], discriminative dictionary learning via Fisher discrimination K-SVD [22], and probabilistic collaborative dictionary learning [12]. Class-specific dictionary learning can learn dictionaries with better representation performance for sufficient data with large intra-class distances; thus, the uncertainty of dictionary atoms will increase if a few training samples are used to learn the complete information of each class, which eventually leads to the degradation of classification performance.

In this paper, we propose a new weak correlation-based discriminative dictionary learning (WCDDL) method. WCDDL first learns a structured dictionary by all training samples to ensure that the structured dictionary has a good reconstruction performance for test samples and then effectively associates each sub-dictionary with its corresponding class using the properties of class-specific structured dictionaries. In this way, each class-specific sub-dictionary has a good reconstruction ability for the training samples of that class. To address the issue of weak dictionary discriminative ability brought by sparse training samples in each class, WCDDL incorporates a term, called the weak correlation term, which is used to weaken the correlation between sub-dictionaries. Different from the methods in [11, 21] increasing the number of training samples to preserve the classification performance, WCDDL increases the inter-class distance by applying the weak correlation term so that a small number of samples can distinguish classes.

2 Proposed Method

This section presents WCDDL. Before explaining our proposed algorithm, we describe its learning framework. The goal of WCDDL is to learn a structured dictionary based on a given data set and apply the well-learned dictionary to classify unseen data points.

Let the given training sample set be \(H=\{(\textbf{y}_1,\ell _1),\ldots ,(\textbf{y}_n,\ell _n)\}\), where \(\textbf{y}_i\in \mathcal {R}^m\) is the ith training sample, \(\ell _i\in \{1,\ldots ,C\}\) is the label of \(\textbf{y}_i\), and n and C are the numbers of samples and classes. For the cth class, we denote its sample matrix as \(\textbf{Y}_c=[\textbf{y}_{c_1},\ldots ,\textbf{y}_{c_{n_c}}] \in \mathcal {R}^{m \times {n_c}}\), where \(c_i\in \{1,\ldots ,n\}\), and \(n_c\) is the number of training samples in the cth class. Note that \(n=\sum _{c=1}^{C}n_c\). Then, the entire training sample matrix is \(\textbf{Y} = [\textbf{Y}_1,\ldots ,\textbf{Y}_C] \in \mathcal {R}^{m\times n}\). Let \(\textbf{D}=[\textbf{D}_1,\ldots ,\textbf{D}_C]\in \mathcal {R}^{m\times (r\times C)}\) be the structured training dictionary, where \(\textbf{D}_c\in \mathcal {R}^{m\times r}\) is the sub-dictionary corresponding to the cth class, and r is the number of atoms in the sub-dictionaries.

During the classification procedure, a test sample \(\textbf{y} \in \mathcal {R}^{m}\) can be represented by a linear combination of atoms of the structured dictionary \(\textbf{D}\). That is \(\textbf{y} \approx \textbf{Dx} = \sum _{c=1}^C \textbf{D}_c \textbf{x}^c\), where \(\textbf{x} = [\textbf{x}^1,\ldots ,\textbf{x}^C]^T \in \mathcal {R}^{(r \times C) \times 1}\) is the coefficient vector of the dictionary \(\textbf{D}\) for sample \(\textbf{y}\), and \(\textbf{x}^c=[{x}^c_1,\ldots ,{x}^c_r]\in \mathcal {R}^{r}\) is the coefficients vector of the cth sub-dictionary for the sample \(\textbf{y}\).

2.1 Dictionary Learning Algorithm

In our WCDDL, the structured dictionary can be obtained by solving the following optimization problem:

$$\begin{aligned} f{\mathbf {(D,X)}} = \min _{\mathbf {(D,X)}} r(\textbf{Y},\textbf{D},\textbf{X}) +\lambda g(\textbf{D},\textbf{X}) \end{aligned}$$
(1)

where \( \textbf{X} = [\textbf{x}_1,\ldots ,\textbf{x}_n] \in \mathcal {R}^{(r\times C) \times n}\) is the coefficient matrix with respect to the training sample matrix \(\textbf{Y}\), and \(\textbf{x}_i\) is the coefficient vector of sample \(\textbf{y}_i\), \(r(\textbf{Y},\textbf{D},\textbf{X})\) is the reconstruction term, \(g(\textbf{D},\textbf{X})\) is the weak correlation term, and \(\lambda >0\) is the regularization term.

The purpose of reconstruction term \(r(\textbf{Y,D,X})\) is to learn a structured dictionary that extracts feature information from samples. In [17], \(r(\textbf{Y,D,X})\) is defined as

$$\begin{aligned} \begin{aligned} r(\textbf{Y,D,X})&= \sum _{c = 1}^{C}r(\textbf{Y}_c,\textbf{D},\textbf{X}_c) \end{aligned} \end{aligned}$$
(2)

where \(r(\textbf{Y}_c,\textbf{D},\textbf{X}_c)\) is used to learn the sub-dictionary of the cth class, and \(\textbf{X}_c\) is the coefficient matrix with respect to \(\textbf{Y}_c\). Further, this reconstruction term can be decomposed as

$$\begin{aligned} \begin{aligned} r(\textbf{Y,D,X})&= \sum _{c = 1}^{C}\left\{ \Vert \textbf{Y}_c - \textbf{DX}_c\Vert _2^2+\Vert \textbf{Y}_c - \textbf{D}_c \textbf{X}_c^c\Vert _2^2 +\sum _{j=1,j\ne c}^{C}\Vert \textbf{D}_j \textbf{X}_c^j\Vert _2^2\right\} \end{aligned} \end{aligned}$$
(3)

where \(\textbf{X}_c^c\) is the coefficient matrix with respect to \(\textbf{Y}_c\) for the cth class. In (3), the first term \(\Vert \textbf{Y}_c-\textbf{DX}_c\Vert _2^2\) allows us to learn a dictionary that represents each sample well approximately; the second term \(\Vert \textbf{Y}_c-\textbf{D}_c\textbf{X}_c^c\Vert _2^2\) uses the sub-dictionary \(\textbf{D}_c\) to represent samples in the cth class as much as possible provided that the trained dictionary can represent the test samples approximately; the third term \(\sum _{j=1,j\ne c}^{C}\Vert \textbf{D}_j\textbf{X}_c^j\Vert _2^2\) improves the discriminative property of the second term by reducing the representation ability of classes except the cth class.

To compensate for the decrease of dictionary representation ability induced by small training samples, we introduce the weak correlation term [5], which has the following form:

$$\begin{aligned} \begin{aligned} g(\textbf{D},\textbf{X})&= \sum _{c=1}^{C}g(\textbf{D}_c,\textbf{X}_c)\\&= \sum _{c=1}^{C}\sum _{j=1,j\ne c}^{C}\Vert \textbf{D}_c\textbf{X}_c^c+\textbf{D}_j\textbf{X}_c^j\Vert _2^2\\&=\sum _{c=1}^{C}\sum _{j=1,j\ne c}^{C}\Vert \textbf{D}_c\textbf{X}_c^c\Vert _2^2 +\Vert \textbf{D}_j\textbf{X}_c^j\Vert _2^2 +2(\textbf{D}_c\textbf{X}_c^c)^T(\textbf{D}_j \textbf{X}_c^j) \end{aligned} \end{aligned}$$
(4)

By (4), we can see that minimizing \((\textbf{D}_c\textbf{X}_c^c)^T(\textbf{D}_j\textbf{X}_c^c)\) is equivalent to minimizing \(g(\textbf{D,X})\), which can be regarded as the correlation between dictionary representations of class c and class j. In other words, the minimization of \(g(\textbf{D},\textbf{X})\) is to weaken the correlation between dictionary representations. When this term keeps shrinking, the correlation between dictionary representations of class c and class j would also keep decreasing. In this way, we can make a distinction between classes only by extracting a small amount of category information when the training samples are sparse.

2.2 Solution

Now, we consider the solution to (1), using an alternating iterative method that is to alternatively update the structured dictionary \(\textbf{D}\) and the sparse coefficient matrix \(\textbf{X}\).

First, we fix the structured dictionary \(\textbf{D}\) and update the coefficient matrix \(\textbf{X}_c\) class by class. The specific objective function used to update \(\textbf{X}_c\) is as follows:

$$\begin{aligned} \begin{aligned} f({\textbf{X}_c})=\min _{\textbf{X}_c}~~&\Vert \textbf{Y}_c-\textbf{DX}_c\Vert _2^2+\Vert \textbf{Y}_c-\textbf{D}_c\textbf{X}_c^c\Vert _2^2\\ {}&+\sum _{j=1,j\ne c}^{C}\Vert \textbf{D}_j\textbf{X}_c^j\Vert _2^2 + \lambda \sum _{j=1,j\ne c}^{C}\Vert \textbf{D}_c\textbf{X}_c^c+\textbf{D}_j \textbf{X}_c^j\Vert _2^2 \end{aligned} \end{aligned}$$
(5)

In order to solve (5), we use a soft threshold function [14] for updating \(\textbf{X}_c\), which is widely used in sparse signal reconstruction tasks. The soft threshold function has the form as follows:

$$\begin{aligned} S(\mathbf {\beta })= sign(\mathbf {\beta })\odot (|\mathbf {\beta }|-\lambda _1w)_+ \end{aligned}$$
(6)

where the variable \(\beta \) could be a scalar, vector, or matrix, \(\odot \) denotes the element multiplication of two vectors or matrices, \(\lambda _1>0\) is the weight factor, \(w\in \mathcal {R}\) is the threshold parameter that controls the magnitude of each change in \(\mathbf {\beta }\), \(sign(\cdot )\) is the sign function, and \((\cdot )_+=\max (\cdot ,0)\). In addition, we need to find the partial derivative of \(f({\textbf{X}_c})\) with respect to \(\textbf{X}_c\). That is

$$\begin{aligned} \begin{aligned} \frac{\partial f({\textbf{X}_c})}{\partial \textbf{X}_c} =&-2\textbf{D}^T(\textbf{Y}_c-\textbf{DX}_c)-2\textbf{D}_c^T(\textbf{Y}_c-\textbf{D}_c\textbf{X}_c^c) + 2\sum _{j=1,j\ne c}^{C}\textbf{D}_j^T\textbf{D}_j\textbf{X}_c^j\\&+ 2\sum _{j=1,j\ne c}^{C}({\textbf{D}_c^c}^T + {\textbf{D}_c^j}^T)({\textbf{D}_c\textbf{X}_c^c}+{\textbf{D}_j\textbf{X}_c^j}) \end{aligned} \end{aligned}$$
(7)

Let \(\beta =\textbf{X}_c-\frac{\lambda }{2}\frac{\partial {f({\textbf{X}_c})}}{\partial {\textbf{X}_c}}\). We use the soft threshold function (6) to iteratively update the coefficient matrix for each class with the following equation:

$$\begin{aligned} \textbf{X}_c^t=S(\textbf{X}_c^{t-1}-\frac{\lambda }{2}\frac{\partial {f({\textbf{X}_c^{t-1}})}}{{\partial {\textbf{X}_c^{t-1}}}}),~~ c=1,\ldots ,C \end{aligned}$$
(8)

where t is the current iteration number, and \(\textbf{X}_c^{t-1}\) is the coefficient matrix of the cth class generated in the \((t-1)\)th iteration. We keep iterations until \(\textbf{X}_c\) converges or t reaches the predetermined maximum number of iterations.

After updating the coefficient matrix \(\textbf{X}\), we fix it unchanged and update the structured dictionary \(\textbf{D}\), for which we also use a class-by-class update scheme. The optimization problem with respect to only \(\textbf{D}\) is

$$\begin{aligned} \begin{aligned} f({\textbf{D}_c})=\min _{\textbf{D}_c}~~&\Vert \textbf{Y}_c-\textbf{DX}_c\Vert _2^2+\Vert \textbf{Y}_c-\textbf{D}_c\textbf{X}_c^c\Vert _2^2+\sum _{j=1,j\ne c}^{C}\Vert \textbf{D}_j\textbf{X}_c^j\Vert _2^2\\&+\lambda \sum _{j=1,j\ne c}^{C}\Vert \textbf{D}_c\textbf{X}_c^c+\textbf{D}_j\textbf{X}_c^j\Vert _2^2 \end{aligned} \end{aligned}$$
(9)

Let \(\mathbf {Y'} = \textbf{Y}_c-\sum _{j=1,j\ne c}^{C}\textbf{D}_j\textbf{X}_c^j\), so (9) can be rewritten as

$$\begin{aligned} \begin{aligned} f({\textbf{D}_c})= \min _{\textbf{D}_c}~~&\Vert \mathbf {Y'}-\textbf{D}_c\textbf{X}_c^c\Vert _2^2+\Vert \textbf{Y}_c-\textbf{D}_c\textbf{X}_c^c\Vert _2^2+\sum _{j=1,j\ne c}^{C}\Vert \textbf{D}_j\textbf{X}_c^j\Vert _2^2\\&+\lambda \sum _{j=1,j\ne c}^{C}\Vert \textbf{D}_c\textbf{X}_c^c+\textbf{D}_j\textbf{X}_c^j\Vert _2^2 \end{aligned} \end{aligned}$$
(10)

Similarly, the partial derivative of \(f_{\textbf{D}_c}\) with respect to \(\textbf{D}_c\) must be calculated for optimizing (10). We make the partial derivative of \(f_{\textbf{D}_c}\) with respect to \(\textbf{D}_c\) equal to zero and then obtain

$$\begin{aligned} \begin{aligned} \frac{\partial f({\textbf{D}_c})}{\partial \textbf{D}_c} =&-2{\textbf{X}_c^c}^T(\mathbf {Y'}-\textbf{D}_c\textbf{X}_c^c) - 2{\textbf{X}_c^c}^T(\textbf{Y}_c-\textbf{D}_c\textbf{X}_c^c)\\&+2\lambda \sum _{j=1,j\ne c}^{C}{\textbf{X}_c^c}^T(\textbf{D}_c\textbf{X}_c^c+\textbf{D}_j\textbf{X}_c^j)=0 \end{aligned} \end{aligned}$$
(11)

Since each column in the coefficient matrix \({\textbf{X}_c^c}^T\) is linearly independent, it is a column-full rank matrix. By arranging (11), we have

$$\begin{aligned} \mathbf {Y'}-\textbf{D}_c\textbf{X}_c^c+\textbf{Y}_c-\textbf{D}_c\textbf{X}_c^c-\lambda \sum _{j=1,j\ne c}^{C}(\textbf{D}_c\textbf{X}_c^c+\textbf{D}_j\textbf{X}_c^j)=0 \end{aligned}$$
(12)

Rearranging (12), we can further obtain

$$\begin{aligned} \frac{2\textbf{Y}_c-(\lambda +1)\sum _{j=1,j\ne c}^{C}\textbf{D}_j\textbf{X}_c^j}{2+\lambda C}= \textbf{D}_c\textbf{X}_c^c \end{aligned}$$
(13)

Let \(\mathbf {Y''} = \frac{2\textbf{Y}_c-(\lambda +1)\sum _{j=1,j\ne c}^{C}\textbf{D}_j\textbf{X}_c^j}{2+\lambda C}\). Then (13) can be rewritten in the following form:

$$\begin{aligned} \mathbf {Y''} = \textbf{D}_c\textbf{X}_c^c \end{aligned}$$
(14)

We use the K-SVD method to get \(\textbf{D}_c\) in (14) and then obtain the structured dictionary \(\textbf{D}\).

We keep repeating iterations to calculate the above formula until the coefficient matrix and dictionary converge or satisfy our maximum number of iterations.

2.3 Classification Algorithm

After we obtain the well-trained structured dictionary \(\textbf{D}\), we consider the label prediction for a given test sample. Let \(\textbf{y}_{test}\in \mathcal {R}^{m}\) be an arbitrary test sample. To estimate its class label, we first compute the coefficient vector \(\textbf{x}_{test}\) according to the structured dictionary \(\textbf{D}\). The objective function for solving the coefficient vector is as follows:

$$\begin{aligned} \textbf{x}_{test} = \arg \min _{\textbf{x}} \left\{ \Vert \textbf{y}_{test}-\textbf{Dx}\Vert _2^2+\lambda _1\Vert \textbf{x}\Vert _1\right\} \end{aligned}$$
(15)

To solve (15), we use the orthogonal matching tracking algorithm (OMP) [13] to find the optimal solution vector \(\textbf{x}_{test}\).

To predict the label of \(\textbf{y}_{test}\), we make the dictionary representation \(\textbf{D}_c\textbf{x}_{test}^c\) of \(\textbf{y}_{test}\) as the residual operation to obtain the representation residual \(e_c\) of the cth class. The residual of the cth class is calculated as follows:

$$\begin{aligned} e_c=\Vert \textbf{y}_{test}-\textbf{D}_c\textbf{x}_{test}^c\Vert _2^2+\lambda _2\Vert \textbf{x}_{test}-\textbf{m}_c\Vert _2^2 \end{aligned}$$
(16)

where \(0<\lambda _2<1\) is the weight factor, \(\textbf{m}_c\) is the mean vector of coefficient vectors of the cth class. In (16), the first term \(\Vert \textbf{y}_{test}-\textbf{D}_c\textbf{x}_{test}^c\Vert _2^2\) is mainly used to calculate the residuals, and the second term \(\Vert \textbf{x}_{test}-\textbf{m}_c\Vert _2^2\) is used to estimate how similar the coefficient vector is to the representation coefficient vector of cth class. If \(\Vert \textbf{x}_{test}-\textbf{m}_c\Vert _2^2\) is small, then the representation coefficient vector of \(\textbf{y}_{test}\) is close to the representation coefficient vector of the cth class sample. Finally, the label of the test sample \(\textbf{y}_{test}\) is predicted by

$$\begin{aligned} \ell _{test}=\arg \min _{c=1,\ldots ,C} e_c \end{aligned}$$
(17)

3 Experiments

The goal of this section is to validate the feasibility and efficiency of the proposed algorithm. First, we briefly introduce the facial datasets used for experiments and then, compare the classification accuracy of different representation-based models on different datasets, including K-SVD [1], D-KSVD [20], FDDL [17], LC-KSVD [9], and Label embedded dictionary learning (LEDL) [16].

3.1 Datasets

In our experiments, two public facial datasets are introduced here, Yale [3, 7] and ORL [2], which are widely used to validate representation-based algorithms. The description of these datasets is given as follows:

  • Yale: This database is a facial dataset that contains 165 face images from 15 people, each with 11 face images. These human images have different facial expressions: center light, with glasses, happy, left light, without glasses, normal, right light, sad, sleepy, surprised, and blinking. Fig 1(a) shows some specific examples of the Yale dataset.

  • ORL: The ORL database is also a facial dataset that contains 400 face images from 40 people, each with 10 face images. These images were acquired at different times of day, under different lighting, with different facial expressions (eyes open/closed, smiling/not smiling) and facial details (with/without glasses). Figure 1(b) provides some specific examples from the ORL dataset.

Fig. 1.
figure 1

Some image samples from (a) Yale and (b) ORL databases.

3.2 Performance Comparison

For comparison, we randomly divide both Yale and ORL datasets into a training set and a test set. Compared models are first trained on the training set and then tested on the test set to provide the classification performance. Each dataset is divided in the following way: p images from each class are taken as the training samples and the remaining images in this class are regarded as the test samples, where p takes values in the set \(\{2,3,4,5,6,7,8\}\) and the divided dataset is called pTrain. The randomness of division may have a certain influence on experimental results. To eliminate the randomness, we perform 50 random division for each pTrain and report the average results over 50 trails.

Classification Performance. Table 1 presents experimental results on Yale, where the highest values among compared methods are highlighted in bold. First, We can clearly observe that the performance of all compared algorithms gradually decreases when the number of training samples decreases, which is inevitable. Second, we can see that WCDDL has a good performance on seven divided datasets compared to other algorithms. When the number of training samples is small, WCDDL performs much better than other methods. For example, the proposed WCDDL has a classification accuracy of 72.6% on 2Train in which each class consists of only two samples for training. In this case, our method improves the accuracy by 10.4% compared to the second-best algorithm FDDL. For the case of 8Train, WCDDL is 6.9% higher in accuracy than the second-best LEDL. Relatively speaking, the improvement of WCDDL on smaller datasets is more obvious. Note that WCDDL is designed on the basis of FDDL. In detail, our method replaces the discriminative coefficient term in FDDL with the weak correlation term. Findings indicate that WCDDL is much better than FDDL, which means that the weak correlation term works well.

Table 2 shows experimental results on ORL, where the highest values among the six methods are in bold. Similar conclusions can be obtained by experimental results in Table 2. More training samples induce better classification performance, and the classification accuracy of WCDDL is also higher than other representation-based algorithms under different training set divisions. On the ORL dataset, WCDDL ranks first in seven divisions, followed by LEDL. With the increase of training samples, the gap between WCDDL and LEDL in accuracy is closing all the time. In other words, the advantage of WCDDL is that it deals with sparse training samples better.

Table 1. Mean accuracy obtained by compared methods on Yale with various divided datasets.
Table 2. Mean accuracy obtained by compared methods on ORL with various divided datasets.

Representation Visualization. Owing to the relationship between WCDDL and FDDL, we further compare them by observing their structured dictionary and reconstructed images. After training a model, we can obtain the structured dictionary \(\textbf{D}\) and coefficient matrix \(\textbf{X}\) and then reconstruct unseen images with \(\textbf{D}\) and X.

Figures 2(a) and 2(b) are the dictionary and reconstructed samples generated by FDDL, respectively. By observing Fig. 2, we can see that the structured dictionary trained by FDDL extracts the overall facial features; thus, so FDDL achieves good visualization results in the subsequent reconstruction of images.

Figure 3(a) shows the first two classes of atoms of the structured dictionary obtained by WCDDL, and Fig. 3(b) plots the first two classes of samples reconstructed by WCDDL. By comparing the dictionary and reconstructed images obtained by FDDL and WCDDL, we find that the visualization result of WCDDL is not good as that of FDDL. The proposed WCDDL aims to learn the distinguished features of different classes and reduce the correlation between dictionary representations of different classes. Thus, the samples we reconstruct are very close in the same class and differ more between classes.

Fig. 2.
figure 2

Visualization of FDDL on Yale, (a) structured dictionary \(\textbf{D}\), and (b) reconstructed images.

Fig. 3.
figure 3

Visualization of WCDDL on Yale, (a) structured dictionary \(\textbf{D}\), and (b) reconstructed images.

4 Conclusion

In this study, image classification is implemented by the proposed WCDDL. To solve the issue of poor dictionary classification capability induced by sparse training samples, we introduce a weak correlation term to reduce the relation between any two classes. To evaluate WCDDL, we conduct extensive experiments using two common facial datasets. Experimental results show the classification advantage of our approach in particular for sparse training samples. On the Yale dataset, WCDDL is higher than the second-best method by 10.4% when there are only two training samples for each class. At the same time, we also validate the efficiency of the weak correlation term by comparing FDDL and WCDDL. Our work can be not only applicable to face classification, but also be extended to other classification tasks, such as scene classification and handwriting recognition.

Of course, there still have some issues with our current approach. For example, the dictionary update in WCDDL takes a two-stage approach, where the iterative update for the coefficient matrix is time-consuming. Thus, we may take a simpler and more efficient update approach to improve the overall performance of the model in the future. Therefore, how to further improve the classification performance and reduce the computational time when the training sample set is sparse will be our future research goal.