Keywords

1 Introduction

Automatic face recognition with occlusion has been a hot topic in the area of computer vision and pattern recognition due to the increasing need for real-world applications. Various approaches have been proposed, including subspace mapping algorithm [1,2,3], feature extraction [4,5,6] and kernel models [7,8,9,10]. However, all these method used the reconstructed images for classification, considering the fact that the reconstructed images might remove some useful information and introduce some redundant information, therefore, whether the reconstructed images are suitable for occluded face recognition needs study.

To avoid over-fitting, a regularization term is generally imposed upon the LR model. There are two widely-used constraints: the L2-normand the L1-norm. While L1-norm regularizer is a traditional model for sparse representation. Recently proposed sparse representation classification (SRC) algorithms have obtained promising performance on image classification and image super-resolution tasks, etc. [11,12,13,14,15,16,17]. Since sparse representation classification approaches obtained competitive performance in face recognition area [18], it has attracted researchers’ attention in image classification. Sparse representation classification model has shown the robust ability to deal with sparse random pixel corruption and block occlusion. Nevertheless, a discriminative dictionary for both sparse data representation and classification is still a difficult learning problem.

Some recent work, on the other hand, began to investigate the role of sparsity in face recognition [10, 19,20,21]. Liu et al. [21] introduce the dual form of dictionary learning and provided some theoretical proof. They argued that it is L1 constraint together with L2 that makes SRC more effective. To overcome high residual error and instability, Zeng et al. [20] analyzed the main principle of SRC and believed that the collaborative representation strategy can enhance the interpretability. They presented a collaborative representation classifier (CRC) based on Ridge regression. CRC can be thus considered as a special case of the SRC algorithm, however, does not provide a mechanism for noise removal, so it is not robust to detect occluded face.

Face recognition with occlusion algorithm need to be robust against arbitrary occlusions. Despite the emergence of a large number of face detection algorithms, most of the existing algorithm are focused on partial occlusion. In the early years, Wen et al. proposed a face occlusion detection method using Gabor filter function [22]. In order to use the temporal feature, some algorithm based on spatial-temporal information were used to find frontal faces [23,24,25]. Some skin color-based detection approaches are used to find faces [26, 27]. Interestingly, other researchers focus on the popular “recognition by parts” scheme, whose main aim is to predict the head position by determining the appropriate human body model of other parts, such as a probabilistic weighted retrieval [28], locally salient ICA information [29], a new learning algorithm PRSOM (Probabilistic Self-Organizing Model) [30], discriminative and robust subspace model [31], dynamic similarity function [32], local non-negative matrix factorization [33], holistic PCA model [34], SVM model [35], Markov Random Field method [36], optimal feature selection model [37], confidence weighting model [38], embedded hidden Markov method [39]. These approaches can cope with the partial occlusion cases through extracting other features of the non-occluded parts. However, for the severe occluded cases, the performance of these algorithm will be reduced. Other head detection approaches are also a hot research area in surveillance application, such as color model-based approaches, contour-based approaches and matching-based approaches, and these can also be regarded as a different application of face detection. Color model-based approaches [40, 41] determine face regions by extracting hair and face color information. The computational complexity of these algorithm is very low, but when the region of head is severely covered, these methods will not work. Matching-based approaches [42, 43] detect head by comparing the similarity of training template and the current area. Contour-based approaches [44, 45] adequately use complex geometric curves to depict face contour feature. This kind of algorithm can deal with severe occlusion problem, but the computational cost is high. Also, it is hard to work in a low-resolution image. In this paper, we try to detect head regions by a novel and robust algorithm.

It is worth mentioning that the convolutional neural networks methods, such as DeepID [46] and WebFace [47], are proved to handle face recognition with various variations. However, they exploit lots of information with very complex image variations to assist the training process, so the main drawbacks contribute to the high computational complexity and complex parameter tuning. Thus, they are not suitable to treat with undersampled face recognition, especially for face occlusion problem.

Note that the residual image (a difference between the raw and reconstructed image) contains most of the occluded information as shown in Fig. 1. It is obviously that the occluded region in the residual image is very intuitive. In this paper, we propose a discriminative sparse coding model to deal with recognition task. With the same setting as [20, 21], we consider the scenario that only one non-occluded training sample is available for each subject of interest, which is close to many real application scenarios such as security, video surveillance et al. Compared with some related methods, the advantages of our proposed model are highlighted as follows:

Fig. 1.
figure 1

Examples of raw images, reconstructed images and residual images. (a) The raw images. (b) The reconstructed images. (c) The residual images.

  • An occlusion variation dictionary is learned for representing the possible occlusion variations between the training and testing samples. Different from SRC, our proposed model extracts the features from the covariance of occlusion variations based on deep networks to construct the occlusion variation dictionary. Experimental results show that the learned dictionary can efficiently represent the possible occlusion variations.

  • Proposing novel measurements strategy to improve sparsity, robustness and discriminative ability. Different from traditional sparse representation which task is to minimize the reconstruction error only, in this proposed model, two terms, the similarity constrain term and the coefficient incoherence term are introduced to ensure that the learned dictionary has the powerful discriminative ability.

The reminder of the paper is organized as follows: Sect. 2 presents some related works. Section 3 describes our proposed model. Section 4 shows experiment results and Sect. 5 draws conclusions.

2 Related Work

In SRC [15], Wright et al. proposed a general classification scheme in which the training samples of all classes were taken as the dictionary to represent the query face image, and classified it by evaluating which class leads to the minimal reconstruction error of it. Since SRC scheme has shown impressive performance in FR, how to design a framework and algorithm to learn a discriminative dictionary for both sparse data representation and classification are attracting a great deal of attention.

Wright et al. [15] proposed the sparse representation based classification (SRC) scheme for robust face recognition (FR). Given K classes of subjects, and let \( D = \left[ {A_{1} ,A_{2} , \cdots ,A_{K} } \right] \) be the dictionary formed by the set of training samples, where \( A_{i} \) is the subset of training samples from class i. Let y be a test sample. The algorithm of SRC is summarized as follows.

  1. (a)

    Normalize each training sample in \( A_{i} \), \( i = 1,2, \cdots ,K \).

  2. (b)

    Solve l1-minimization problem: \( \hat{x} = {\text{argmin}}_{x} \left\{ {\left\| {y - Dx} \right\|_{2}^{2} + \gamma \left\| x \right\|_{1} } \right\} \), where \( \gamma \) is scalar constant.

  3. (c)

    Label a test sample y via: \( {\text{Label}}\left( y \right) = {\text{argmin}}_{i} \left\{ {e_{i} } \right\} \), where \( e_{i} = \left\| {y - A_{i} \hat{\alpha }^{i} } \right\|_{2}^{2} \), \( \hat{x} = \left[ {\hat{\alpha }^{1} ,\hat{\alpha }^{2} , \cdots ,\hat{\alpha }^{K} } \right]^{T} \) and \( \hat{\alpha }^{i} \) is the coefficient vector associated with class i.

Obviously, the underlying assumption of this scheme is that a test sample can be represented by a weighted linear combination of just those training samples belonging to the same class. Its impressive performance reported in [15] showed that sparse representation is naturally discriminative.

According to predefined relationship between dictionary atoms and class labels, we can divide current supervised dictionary learning into three categories: shared dictionary learning, class-specific dictionary learning and hybrid dictionary learning. In shared dictionary learning, a dictionary shared by all classes is learned, meanwhile the discriminative power of the representation coefficients is also mined [48, 49]. In generally, in this scheme, a shared dictionary and a classifier over the representation coefficients are together learned. However, there is no relationship between the dictionary atoms and the class labels, and thus no class-specific representation residuals are introduced to perform classification task.

In the class-specific dictionary learning, a dictionary whose atoms are predefined to correspond to subject class labels is learned and thus the class-specific reconstruction error could be used to perform classification [50, 51].

The hybrid dictionary models which combines shared dictionary atoms and class-specific dictionary atoms have been proposed [52, 53]. However, the shared dictionary atoms could encourage learned hybrid dictionary compact to some extent, how to balance the shared part and class-specific part in the hybrid dictionary is not a trivial task.

3 Proposed Model

Machine learning algorithm are often used in computer vision due to their ability to leverage large amounts of training data to improve performance. For face recognition task, the deeply learned features are required to be generalized enough for identifying new unseen classes without label prediction. In order to enhance the discriminative power of the deeply learned features, wen etc. propose new supervision signal, called center loss. Specifically, the center loss simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers [54]. It is encouraging to see that their CNNs achieve the state-of-the-art accuracy. Therefore, in this paper, we adapt this deep network model to extract occluded face feature.

Recently proposed sparse representation classification (SRC) algorithms have obtained promising performance on image classification and image super-resolution tasks, etc.. For a detailed introduction to sparse representation, which can be found in [15,16,17]. Since sparse representation classification approaches obtained competitive performance in face recognition area [18], it has attracted researchers’ attention in image classification. Nevertheless, a discriminative dictionary for both sparse data representation and classification is still a pending problem.

To address these difficulties, we propose a modified sparse model for this purpose. Different from traditional sparse representation which task is to minimize the reconstruction error only, in this proposed model, two terms, the representation-constrained term and the coefficient incoherence term are introduced to ensure that the learned dictionary has the powerful discriminative ability.

3.1 Proposed Sparse Classification Model

The representation-constrained term is used to project each descriptor into its local coordinate system which captures the correlations between similar descriptors by sharing dictionary. On the other hand, the coefficients incoherence term ensures that samples from different classes can be built by independent dictionary.

In the class-specific dictionary learning, each dictionary atom \( {\text{D}} = [D_{1} ,D_{2} , \mathtt{L} ,D_{K} ] \) indicate class label, where \( D_{i} \) is the sub-dictionary of class i. In our experimental settings, corresponding training deep feature samples set \( \left\{ {a_{ij} \left| {i = 1,2, \mathtt{L} ,k;j = 1,2, \mathtt{L} ,N} \right.} \right\}, \) where \( a_{ij} \) indicates the j-th sample of class i, K is the number of classes, and N denotes the number of training samples in each class. Let \( {\text{A}} = [A_{1} ,A_{2} , \mathtt{L} ,A_{i} ] \in R^{n \times N} \) , where \( A_{i} = [a_{i1} ,a_{i2} , \mathtt{L} ,a_{iN} ], \) n is the deep feature dimension. Our purpose is to contain the classification error as a term in the objective function for dictionary learning for purpose of making the lexicon be optimal for classification. Sparse code Z can be directly utilized as a characteristic for classification. Let \( Z = [Z_{1} ,Z_{2} , \mathtt{L} ,Z_{i} ], \) denote the learned dictionary by \( {\text{D}} = [d_{1} ,d_{2} , \mathtt{L} ,d_{k} ] \in R^{n \times k} \) (k > n and k < N). We propose the following novel sparse model:

$$ \begin{aligned} \left\langle {D,W,Z} \right\rangle & = \arg \hbox{min} \{ \left\| {A - DZ} \right\|_{F}^{2} + \lambda_{1} \left\| Z \right\|_{1} + \lambda {}_{2}\left\| {Z - m} \right\|_{F}^{2} + \gamma_{1} \left\| {WZ - B} \right\|_{F}^{2} + \gamma_{2} \left\| W \right\|_{F}^{2} \} \\ & \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad s.t.\left\| {d_{c} } \right\|_{2} \le 1,\forall c \in \left\{ {1,2, \cdots ,k} \right\} \\ \end{aligned} $$
(1)

Where \( m = [m_{1} ,m_{2} , \mathtt{L} ,m_{i} ] \in R^{k \times N} \), \( m_{i} \) denotes mean vector \( Z_{i} \) of in class i, \( \left\| {WZ - B} \right\|_{F}^{2} \) denotes the classification error, \( B = [0,0, \mathtt{L} ,b_{N} ] \in R^{m \times N} \) are the class labels of input feature. \( b_{i} = [0,0, \mathtt{L} 1 \mathtt{L} ,0]^{T} \in R^{m} \) is a label vector. \( W \in R^{m \times k} \) denotes the matrix of classifier parameters, and \( \lambda_{1} \), \( \lambda_{2} \), \( \gamma_{1} \) and \( \gamma_{2} \) are the scale adjustment parameters.

In our proposed model, the representation-constrained term \( \left\| {WZ - B} \right\|_{F}^{2} \) and coefficients incoherence term \( \left\| W \right\|_{F}^{2} \) are introduced in Eq. 1.

3.2 Optimization Process

Obviously, the function of Eq. 1 is not co-convex to (D; W; Z), when the other two variables are fixed, it is convex for each of D, W, and Z. So, we can optimize D, W and Z respectively to make the Eq. 1 be three sub-problems, i.e., update the Z when D and W are fixed, update D when W and Z are fixed, and update W when D and Z are fixed. Let us explain the details.

Updating Z: When D and W are constant value, it can be regarded as sparse coding problem for solving Z. When \( Z_{i} \) is updated, all \( Z_{j} (j \ne i) \) are also fixed. Thus, for each \( Z_{i} \), the objective function in Eq. 2 can be replaced by:

$$ \left\langle Z \right\rangle = \arg \hbox{min} \{ \left\| {A - DZ} \right\|_{F}^{2} + \lambda_{1} \left\| Z \right\|_{1} + \lambda {}_{2}\left\| {Z - m} \right\|_{F}^{2} + \gamma_{1} \left\| {WZ - B} \right\|_{F}^{2} $$
(2)

By solving Eq. 2, we have:

$$ Z_{i} = \left\{ {D^{T} D + (\lambda_{1} + \lambda_{2} )I + \gamma_{1} W^{T} W} \right\}^{ - 1} (D^{T} A_{i} + \lambda_{2} m_{i} + \gamma_{1} W^{T} b_{i} ) $$
(3)

Updating D: When Z and W are constant value, Eq. 1 can be regarded as solving \( {\text{D}} = [D_{1} ,D_{2} , \mathtt{L} ,D_{K} ] \) sparse coding problem. When \( D_{i} \) is updated, all \( D_{j} (j \ne i) \) are also fixed. Thus, Eq. 1 can be replaced by:

$$ \left\langle D \right\rangle = \arg \hbox{min} \left\{ {\left\| {A - DZ} \right\|_{F}^{2} } \right\},s.t.\left\| {d_{c} } \right\|_{2} = 1,\forall c \in \left\{ {1,2, \cdots ,k} \right\}. $$
(4)

The above problem in Eq. 4 can be solved effectively by the Lagrange dual method.

Updating W: When D and Z are fixed, Eq. 1 can be replaced by:

$$ \left\langle W \right\rangle = \arg \hbox{min} \left\{ {\gamma_{1} \left\| {WZ - B} \right\|_{F}^{2} + \gamma_{2} \left\| W \right\|_{F}^{2} } \right\} $$
(5)

Obviously, Eq. 5 can be solved using the least square method. Thus we can get the following solution:

$$ W_{i} = b_{i} Z_{i}^{T} (Z_{i} Z_{i}^{T} + \frac{{\gamma_{2} }}{{\gamma_{1} }}I)^{ - 1} $$
(6)

Therefore, according to the above equations, the optimized values of all parameters in Eq. 1 can be got.

4 Experimental Results

To evaluate the proposed model, we compare it with the state-of-the-art methods for face recognition with occlusion. Sparse representation-based approaches: sparse representation based face classification (SRC) [15], robust sparse coding (RSC) [16], correntropy-based sparse representation (CESR) [17] and extended sparse representation-based classification (ESRC) [19].

4.1 Results on the AR Database with Real-World Occlusion

We evaluate the performance of our proposed model in dealing with real occlusion using the AR face database [18], which consists of 4000 frontal-face images from 126 subjects (70 men and 56 women). Each subject has two separate sessions and 13 images for each session. These images are taken under different variations, including various facial expressions, illumination variations and occlusions (such as sunglasses and scarf).

In the first group, all the rest samples with occlusion from the 80 subjects in session 1 and session 2 are used as testing set, which is divided into three subsets in session 1 or session 2, respectively. (sunglasses subset and scarf subset). The final results with the existing methods are shown in Table 1. Based on the results, we can draw the following conclusions:

Table 1. Recognition rates for different methods on the AR database.
  • In the same session (session 1 or session 2), the face recognition performances of these existing methods are much better on sunglasses subset than on scarf subset. That is because the sunglasses occlude roughly 20% of the image, while the scarf occlude roughly 40% of the image.

  • The performances of some sparse representation-based face recognition approaches such as SRC, RSC and CESR, are bad to address occluded problem for face recognition. For example, the recognition rates of SRC, RSC and CESR are only 13.33%, 35.83% and 10.00% on scarf subset from session 2, respectively. Lack of sufficient training samples to represent the test sample is the main reason.

  • Our proposed algorithm obtains significantly higher recognition rates than most of these compared methods, the recognition results is 89.68%, 86.48%, 70.16% and 64.29%, respectively. It indicates that our proposed model is more robust to occlusion variation than these existing methods.

4.2 Results on the CAS-PEAL Database with Real Occlusion

The CAS-PEAL [18] face database consists of 9594 images of 1040 subjects (595 males and 445 females), which are obtained in different variations, including pose, expression, accessory, lighting, time and distance. Each subject is captured under at least two kinds of these variations. Thus, here we take a subset with normal and different accessory variations, which contains 3038 images of 434 subjects and 7 images for each subject. So each subject has 1 neutral image, 3 images with glasses/sunglasses, and 3 images with hats. Finally, all the images are cropped to 120 × 100 size.

In each recognition process, we select 350 subjects of interest for training and testing, and the remaining 84 subjects are considered as external data for learning the occlusion variation dictionary. In training process, we choose only the neutral image of each of the 350 subjects. While in testing process, we consider three separate test subsets of the 350 subjects. The first test subset constitute with 3 images of each subject wearing glasses/ sunglasses (glass subset). The second test subset constitute with 3 images of each subject wearing hats (hat subset). The third test subset constitute with 6 images of the subject from glass subset and hat subset.

The final recognition rates for all the methods on CAS-PEAL database are given in Table 2. Based on the results, we can get the following conclusions:

Table 2. Recognition rates for different methods On the CAS-PEAL databas.
  • The hats occlusion subset is more challenging than the glasses/sunglasses occlusion subset on the database.

  • SRC-based face recognition approaches, such as ESRC, RSC and CESR improve the face recognition performance of ordinary SRC for the glasses/sunglasses occlusion, but their performance is not good to the hats occlusion. For example, the recognition rate of CESR can reach up to 89.33% for the glasses/sunglasses occlusion, but it degrades seriously for the hats occlusion, only 29.43%.

  • Our proposed algorithm achieves the best results for all subsets. That is because the representation-constrained and the representation coefficients are more discriminative, and the corresponding classification method is effective to reveal such information. It indicates that proposed model can well learn the occlusion variation, and is also effective to detect occlusion cases.

5 Conclusion

In this paper, we present a novel sparse representation-based classification model and apply the alternating direction method of multipliers to solve it. Different from traditional sparse representation which task is to minimize the reconstruction error only, in this proposed model, two terms, the representation-constrained term and the coefficient incoherence term are introduced to ensure that the learned dictionary has the powerful discriminative ability. Proposed model takes advantage of the structural characteristics of noise and provides a unified framework for integrating error detection and error support into one sparse model. Extensive experiments demonstrate that the proposed model is robust to occlusions.