1 Introduction

Within the last decades, the sparse representation-based classification (SRC) framework has been widely used in face and facial expression recognition because of its robustness to interferences such as corruptions and occlusions. Wright et al. [1] firstly present the SRC and applied it to face recognition. Their work not only shows the superiority of SRC, but also annouces that feature extraction methods are not necessary to SRC framework when the training face images are sufficient. Their claim is supported by the experimental results that SRC with random projection-based features can outperform some traditional face recognition schemes. However, according to the comparison concerning the experimental results of published works, the feature extraction method impacts the performance of the SRC framework, especially in the FER [2], because effective facial expression feature can stand out specific expression characters better than the raw face images, for instance, Eigenfaces [2, 3], Gabor [2], LBP [4], HOG [5], Deep learning feature [6]. The experimental results also show that the SRC framework performs better than some traditional classification framework, such as Support Vector Machine (SVM) and NearestNeighbour (NN) when noises and blocks pollute facial expression images. In the work of FER based on SRC mentioned above, all the researchers take the overall facial features as the main basis for distinguishing expressions, while ignore the unique features of each type of expressions.

Therefore, with the development of FER research [7, 8], expression feature extraction methods gradually begin to focus on the method of reducing intra-class variation. Before the FER researchers realize the importance of intra-class variation, it already exists in many areas of object detection [9] and interprets as illumination, viewpoint, scale, occlusion, shading, clutter, blur, motion, and imaging noise, etc. In FER work, inra-class variation mainly refers to identity and illumination. In the existing applications of the SRC framework, the main two ways of reducing the impact of intra-class variation are based on dictionary construction and decomposition-reconstruction accuracy. In detail, one way is trying to implement different feature extraction or dictionary optimization method, increasing the difference between sample categories in the dictionary, such as PCA-based dictionary building rely on different images [10], intra-class variation reduction features (IVRF) [7, 8] or KSVD (K-means Singular Value Decomposition) [11]. The other way is trying to optimize the solution process of sparse codings, such as Collaborate Representation Classification (CRC) [12] or extended sparse representation-based classification method [13, 14]. Comparing those two ways, in our opinion, the first way mainly focuses on dealing with the facial expression characteristics. It is the key to improving FER's performance for SRC, while the second way is more universal, which emphasizes proposing a more common and precise recognition framework based on SRC.

In this paper, we conducted the study following the first way, aiming at an exclusive recognition method based on the SRC framework suitable for the FER task and robust to intra-class variation. In practice, reducing intra-class variation can improve the sparsity of the sparse weighted matrix corresponding to the dictionary set up by training samples. Analysis of the existing works indicates two ways to decline the impact of intra-class variation. One way is to extract difference information between different types of facial expression images [7, 8]. While, the other way is to extract difference information between natural face images and specific facial expression images [10, 15, 16]. In our opinion, the requirement of comprehensive training dictionary is much higher in a first way. Because those researchers [7, 8] use the difference information between query images and the IVRF (intra-class variation reduced features) of all seven facial expressions. And this way will lead to the loss of some common features of AUs shared by different kinds of facial expressions. For instance, in the definition of FACS [17] (facial action coding system), the fear expression is composed by AU1 + AU2 + AU4 + AU5 + AU7 + AU20 + AU25 while AU1 + AU2 + AU5 + AU25 + AU26 composes the surprise expression. Those two expressions share four AUs (AU1, AU2, AU5 and AU25), and many characters may be lost when directly subtracted from each other. In other words, the second way may lead to less information loss than the first way. Recently, some researchers have already realized that fact, for example, Du et al. [15] propose a facial expression method based on the difference expression images between the neutral face and specific expressions, but they did not extract the auxiliary features of neutral face. Lee et al. [18] directly use the difference vector between ICV face image and peak expression image in videos for FER. However, no matter which way we choose, the risk of character information loss can not be removed. Zhe et al. [19] have realized this problem, but they only treat the differential information as an auxiliary information to the full face information.

As a whole, it is noticeable that the overall facial features and differential facial features are both critical to FER. However, they are not treated as equal in many existing works. Therefore, this paper proposes a novel FER method based on compound-variational dictionaries to fuse those two kinds of features equally. Figure 1 indicates the process of the proposed method.

Fig. 1
figure 1

Flowchart of the proposed method based on decision-fusion and compound-variational dictionaries

The main contribution of this paper can be summarized as follows:

  1. 1.

    The compound-variational dictionaries are proposed to fuse mixed and difference information while dealing with FER problems. As shown in Fig. 1, the compound dictionary built by facial expression features at the apex is used to reserve some characteristics that may be lost during the difference operation between neutral face and specific expression face. On the other hand, the variational dictionary is used to remove the intra-class variation, such as illumination and identity.

  2. 2.

    A novel FER process based on compound-variational dictionaries and SRC framework is proposed. To achieve better performance, we design a two-stage SRC procedure. Firstly, we reconstruct the mixed information of specific facial expression based on the sparse coefficient of the compound dictionary, and make a preliminary judge for any query image sequences. Secondly, we use the difference information between the neutral face images and face image with specific facial expressions at apex level in one face image sequence to judge different facial expressions with variational dictionaries. Finally, a decision fusion strategy relies on the reconstruction error is implemented to given the final judgment.

The rest of the paper is organized as follows: Sect. 2 introduces the proposed method for solving the FER task and gives detailed algorithms. Section 3 shows the experimental results and compares the proposed method with existing state-of-the-art methods in FER problems. The conclusion is given in Sect. 4.

2 Methods

2.1 Dictionary Design

In practice, the dictionary is essential for the SRC framework when carrying out the FER task. In early work, many researchers believe that only one sufficient dictionary is enough for the SRC framework to finish the work. However, with further research, many studies found that two or more dictionaries can help the SRC framework to improve its performance under some given situations, such as completing the under-sampled FR challenge [13], simplifying the process of sparse coding [12], and rating the exaggeration of expression [20].

In this paper, inspired by the above work, we suggest using dual dictionaries structure called compound-variational dictionaries to reduce the intra-class variation and eliminate the impact of information loss. The establishing process of dictionaries is shown in Fig. 2.

Fig. 2
figure 2

Establishing process of compound-variational dictionaries

Generally, it is challenging to align all the essential facial units when dealing with the whole face. So, when design the variational dictionary, we determine some typical Regions of Interest (ROIs) that characterizing the facial action units in face image sequences. As shown in Fig. 3, a representative set of ROIs are selected due to the Action-Unit definition in the FACS system [17] and pre-processed facial feature points. In our study, we extract nine ROIs from each face image.

Fig. 3
figure 3

Illustration of ROIs selection

In practical, firstly, we selected an expressionless face image and defined as \(d_{j}\) in one facial expression sequence. Secondly, we select some facial expression images at the apex level and defined as \(\left\{ {b_{j}^{n} } \right\}\), where n denotes the number of selected frames. The selected ROIs of face image with specific expressions are defined as \(\left\{ {\theta^{i} \left( \cdot \right)} \right\}_{K}\) where {i = 1, 2, …, 9} denotes the order of ROIs and {K = ksad, ksurprise, kangry, khappy, kfear, khate} denotes the number of training sequences with specific expressions. The compound dictionary D, is formed by merging \(\left\{ {\theta^{i} \left( {\left\{ {b_{j}^{n} } \right\}} \right)} \right\}\) for each person and is arranged into the matrix as below:

$$D \equiv \left[ {H\left( {\left\{ {\theta^{i} \left( {\left\{ {b_{1}^{n} } \right\}} \right)} \right\}_{{k_{{{\text{sad}}}} }} } \right), \ldots ,H\left( {\left\{ {\theta^{i} \left( {\left\{ {b_{3}^{n} } \right\}} \right)} \right\}_{{k_{{{\text{angry}}}} }} } \right), \ldots ,H\left( {\left\{ {\theta^{i} \left( {\left\{ {b_{6}^{n} } \right\}} \right)} \right\}_{{k_{{{\text{hate}}}} }} } \right)} \right] \in R^{N \times K}$$
(1)

where \({\text{H}}\left( \cdot \right)\) denotes extracting HOG features and N denotes the feature dimension. The variational dictionary E is formed based on image differencing between \(d_{j}\) and \(\left\{ {b_{j}^{n} } \right\}\), and arranged into the matrix as below:

$$E \equiv \left[ {L\left( {\left\{ {\theta^{i} \left( {\left\{ {b_{1}^{n} } \right\} - d_{1} } \right)} \right\}_{{k_{{{\text{sad}}}} }} } \right), \ldots ,L\left( {\left\{ {\theta^{i} \left( {\left\{ {b_{3}^{n} } \right\} - d_{3} } \right)} \right\}_{{k_{{{\text{angry}}}} }} } \right), \ldots ,L\left( {\left\{ {\theta^{i} \left( {\left\{ {b_{6}^{n} } \right\} - d_{6} } \right)} \right\}_{{k_{{{\text{hate}}}} }} } \right)} \right] \in R^{M \times K}$$
(2)

where \(L\left( \cdot \right)\) denotes extracting LBP features from the difference image between \(\left\{ {b_{j}^{n} } \right\}\) and \(d_{j}\), M denotes the feature dimension.

2.2 Facial Features Extraction

In many previous work [5, 10, 15], we found that HOG and LBP features have better performance than other features such as Gabor, Haar, Eigenface, Fisherface and Laplacian face, when combined with SRC. So, we use the histogram of oriented gradient (HOG) and to extract the joint information and local binary patterns (LBP) to extract differential information between neutral face image and images with apex facial expressions of one individual. Furthermore, the fineness of feature extraction are also very important [5, 8, 18]. In our previous work [5], we set up some strategies for improving fineness shown as below. When extracting HOG features, the strategies of spatial cell segmentations including: (1) face images are divided by a sliding window(4 × 4 pixels) with the interval of 2 pixels; (2) face images are divided by a sliding window (8 × 8 pixels) with the interval of 4 pixels; (3) face images are divided by a sliding window(16 × 16 pixels) with the interval of 8 pixels. When extracting LBP features, the strategies including: (1) face images are divided by a sliding window(4 × 4 pixels) with the interval of 1 pixel; (2) face images are divided by a sliding window(4 × 4 pixels) with the interval of 2 pixels; (3) face images are divided by a sliding window(8 × 8 pixels) with the interval of 2 pixels; (4) face images are divided by a sliding window(8 × 8 pixels) with the interval of 4 pixels. In this issue, we choose all the stragies instead of finding a better one. Because the higher the fineness of features are, the better the recognition accracy will be achieved.

All the extracted HOG features by those three segmentation strategies are union to one compound dictionary. The process is described in Fig. 4. While all the extracted LBP features are union to one variational dictionary.

Fig. 4
figure 4

The building process of compound dictionary

2.3 Two-Stage SRC (TSSRC) Framework

In this paper, we design a two-stage SRC framework called TSSRC. In the TSSRC framework, each input face image sequence combines two different sub-signals in the compound and variational dictionaries. For a test image sequence y, we first select pneutral and qpeak face images, comprehensive characteristics \(y_{1} = H\left( {\theta^{i} \left( {q^{{{\text{peak}}}} } \right)} \right)\) and differential characteristics \(y_{2} = L\left( {\theta^{i} \left( {q^{{{\text{peak}}}} - p^{{{\text{neutral}}}} } \right)} \right)\) are extracted. Finally, the optimal sparse coefficient vector \(\hat{l}\) and \(\hat{m}\) of two-stage SRC are obtained employing the following l1-norm minimization problem as below:

$$\hat{l} = \arg \min \left\| l \right\|_{1} \,s.t.\left\| {y_{1} - Dl} \right\|_{2} \le \varepsilon$$
(3)
$$\hat{m} = \arg \min \left\| m \right\|_{1} \quad s.t.\left\| {y_{2} - Em} \right\|_{2} \le \varepsilon$$
(4)

In physical conception, Eq. (3) means that the input face image is represented by the features of the whole face, Eq. (4) means that the feautres of sepcific facial expressions represent the input face image.

In particular, the preliminary expression class label \(T_{{j_{1} }}^{{{\text{pre}}}}\) and \(Y_{{j_{2} }}^{{{\text{pre}}}}\) are determined by finding the expression class with the maximum of sparse coefficient:

$$T_{{j_{1} }}^{{{\text{pre}}}} = \arg \max \sum\limits_{{j_{1} = 1}}^{6} {\delta_{{j_{1} }} \left( {\hat{l}} \right)}$$
(5)
$$Y_{{j_{1} }}^{{{\text{pre}}}} = \arg \max \sum\limits_{{j_{2} = 1}}^{6} {\delta_{{j_{2} }} } \left( {\hat{m}} \right)$$
(6)

where \(\delta_{{j_{1} }} \left( {\hat{l}} \right) = \left[ {0, \ldots ,l_{1}^{{j_{1} }} ,l_{2}^{{j_{1} }} , \ldots } \right]\) and \(\delta_{{j_{1} }} \left( {\hat{m}} \right) = \left[ {0, \ldots ,m_{1}^{{j_{1} }} ,m_{2}^{{j_{1} }} , \ldots } \right]\) which indicate the sparse coefficient, sub-vectors corresponding to the expression class in \(\hat{l}\) or \(\hat{m}\).

If \(T_{{j_{1} }}^{{{\text{pre}}}}\) and \(Y_{{j_{2} }}^{{{\text{pre}}}}\) indicate the same expression class, then we can directly output the final classification result. Otherwise, we utilize joint judgment based on the sparse coefficient \(\hat{l}\) and \(\hat{m}\).

In practical work, we first train two Auxiliary decision dictionaries for D and E by coding each sample with the rest of the samples. Table 1 summarizes the process.

Table 1 Training process of auxiliary decision dictionaries

Finally, we use S and Z to integrate the two-stage classification results, as shown in Table 2.

Table 2 Reconstruction error based joint judgement

2.4 Facial Expression Recognition Based on TSSRC

In this part, a particular implementation is considered to assess the performance of using TSSRC for FER in facial image sequences as shown in Fig. 5. The main steps of the proposed FER in face image sequences with TSSRC are summarized as follows.

Fig. 5
figure 5

Block diagram of the proposed approach

  • Step 1 Probe expressionless sample y1 with neutral face and sample \(y_{2}^{w}\) (w = 3) at apex expression in the probe image sequence are selected. Training samples are used to build compound dictionary D and variational dictionary E.

  • Step 2 ROI regions of y1 and \({\text{y}}_{2}^{w}\) are found and defined as \(\theta^{i} \left( {y_{1} } \right)\) and \(\theta^{i} \left( {y_{2}^{w} } \right)\).

  • Step 3 HOG descriptor is applied to sample y2 to generate feature space \(H\left( {y_{2}^{w} } \right)\).

  • Step 4 LBP descriptor is applied to the difference information between \(\theta^{i} \left( {y_{1} } \right)\) and \(\theta^{i} \left( {y_{2}^{w} } \right)\) to generate feature space \(L\left( {\theta^{i} \left( {y_{2}^{w} } \right) - \theta^{i} \left( {y_{1} } \right)} \right)\).

  • Step 5 SRC is executed for \(H\left( {y_{2}^{w} } \right)\) and \(L\left( {\theta^{i} \left( {y_{2}^{w} } \right) - \theta^{i} \left( {y_{1} } \right)} \right)\) based on D and E to generate two sparse coefficients vector.

  • Step 6 If the judgment of two-stage SRC indicates different class labels, we join those sparse coefficients vector based on auxiliary decision dictionaries S and Z to get the final judgment.

2.5 Complexity Analysis

In our method, the main three time-consuming processes are sparse codes by three dictionaries. So, the total computational complexity of our method is \(O\left( {m_{1} k_{1} z_{1} + m_{2} k_{2} z_{2} + m_{3} k_{3} z_{3} + m_{4} k_{4} z_{4} } \right)\), where z1,z2,z3,z4 are the number of non-zero entries in the sparse coding results of compound dictionary, variational dictionary and auxiliary decision dictionaries. m1k1, m2k2, m3k3 and m4k4 refer to the number of elements in those dictioanries.

3 Experiment Results

To evaluate the performance of our approach, we carry out experiments on the CK + databases [21] and select 310 labelled expression image sequences from 110 subjects. In the CK + database mainly consisting of Western faces, each face image in the image sequences has 68 landmarks, as shown in Fig. 6. In practice, our methods need 7 landmarks to help select ROIs.

Fig. 6
figure 6

a Image sequence of sad expression in CK + database with 7 landmarks; b Image sequence in CK + database with 68 landmarks

We selected the first one and the last three frames from each sequence to evaluate our methods in the experiment. Moreover, the performance evaluation of the proposed approach is based on a ten-fold cross-validation and LOSO method. In this study, two eye locations were manually determined, and all cropped face images were rescaled to the size of 64 × 64 pixels. Figure 7 shows the example face images.

Fig. 7
figure 7

Cropped example face images

At the beginning, we select one-fold as a testing sample and the rest nine folds as training samples to evaluate the performance of our method, and try to use the experiment result to reveal the relationship between two-stage SRC as shown in Tables 3 and 4.

Table 3 The confusion result (%) based on one fold in the first stage SRC
Table 4 The confusion result (%) based on one fold in the second stage SRC

The result of the first stage SRC shown in Table 3 does better for sad, surprise, happy, and hate expression recognition. While the results of the second stage SRC shown in Table 4 do better for angry and fear expression recognition. Therefore, the two-stage SRC are complementary. Figures 8, 9, 10, 11, 12, 13, 14, 15 and 16 present confusion results of the other nine folds, and the complementarity is not an exception.

Fig. 8
figure 8

The confusion result based on the first fold

Fig. 9
figure 9

The confusion result based on the second fold

Fig. 10
figure 10

The confusion result based on the third fold

Fig. 11
figure 11

The confusion result based on the fourth fold

Fig. 12
figure 12

The confusion result based on the fifth fold

Fig. 13
figure 13

The confusion result based on the sixth fold

Fig. 14
figure 14

The confusion result based on the seventh fold

Fig. 15
figure 15

The confusion result based on the eighth fold

Fig. 16
figure 16

The confusion result based on the ninth fold

The results show that the first stage sparse representation can achieve good performance in most cases. It means that the performance of the methods based on the overall facial features is better than the performance of the methods based on variational facial expression features in most instances. However, in some cases, the recognition accuracy of the first stage sparse representation may appear a sharp decline, for example, the results in Figs. 9 and 11. The decline shows that the recognition results will be affected by the identity. Moreover, that problem cannot be easily overcome when using a one-stage sparse representation method based on overall face features. Thus, the role of the second stage sparse representation is mainly improving the robustness of the first stage sparse representation to some unexpected cases. However, the improvement of robustness will lead to more time-consuming. In the following, we compare the time-consuming of some one-stage sparse representation methods. The time-consuming per image of our method is about one minute and twenty-five seconds. The experiments are conducted on a notebook with Intel(R) Core(TM) i7-4860HQ and 32 GB RAM. The program is written in MATLAB (Table 5).

Table 5 The time-consuming of different one-stage sparse representation method

For the purpose of fusing the results of two-stage SRC, we propose a novel method based on the analysis of the sparse coefficient vectors. As shown in Figs. 17 and 18, we display two sparse coefficient vectors of different stage SRC when classifying the facial expression of fear in one-fold.

Fig. 17
figure 17

The sparse coefficient vectors of the first stage SRC

Fig. 18
figure 18

The sparse coefficient vectors of the second stage SRC

The fourth, fifth, sixth, thirteenth and fifteenth samples may be misclassified in the second stage SRC while they’re correctly classifying in the first stage SRC. In this situation, directly using sparsity weights [18] to fuse those two results will face a severe problem, which is the weights are hard to choose by experience. We use the reconstruction error instead of sparsity weights to joint judge those two results based on sparse coefficient vectors.

In the following, we compare the final validation result after reconstruction error-based fusion with several classical methods on the CK + database.

Table 6 shows that the proposed TSSRC model achieves the best performance, which indicates that our method has made full use of various facial expression information than the others based on the SRC framework.

Table 6 FER rates (%) on CK + database

We also conduct our method on the JAFFE database, which is mainly consists of Asian faces. The facial expression images are shown in Fig. 19. The results in Table 7 show that the universality of our method is very good.

Fig. 19
figure 19

Image examples in JAFFE database

Table 7 FER rates (%) on JAFFE database

4 Conclusions

In this letter, we propose a two-stage SRC to utilize compound and variational expression information fully. Compared with the classical expression recognition methods, our model can obtain better discriminant power. Extensive experiments on the famous datasets (CK +) show the effectiveness of our approach. We also conduct the experiment on the JAFFE database. The accuracy of our method achieves nearly 100%. Comparing those two experimental results based on a different database, we find that our approach can treat the problem of identity very well.