1 Introduction

Facial expression recognition, which aims to predict six basic facial expressions including disgust, angry, fear, happy, sadness and surprise, is a classic problem in the field of computer vision. In the past few years, expression recognition has drawn considerable attention [1, 4, 6, 13], as it can facilitate many other face-related tasks such as face recognition [17] and alignment [34]. Among various methods, deep neural networks, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have demonstrated outstanding performance in expression recognition [10, 14, 15, 36].

However, most existing methods focus on recognizing the strong expressions of clearly separable but ignore the weak expressions that are ubiquitous in the normal communication of daily life. Due to the lack of salient features, weak expression is more difficult to recognize compared to strong expression (see Fig. 1). There has been sparse research in the direction of recognizing weak expression, e.g., [11, 21, 35, 36]. In particular, the so-called Peak-piloted Deep Network (PPDN) proposed in [36] is specially designed for coping with weak expressions. The key idea of PPDN is about a peak-piloted feature transformation, which utilizes the intermediate-layer feature maps of peak expressions to supervise those of the corresponding non-peak expressions. This method improves the capability to capture critical and subtle details of weak expressions, and thereby it outperforms the state-of-the-art methods in facial expression recognition.

Despite these significant progress, weak expression recognition is still a very challenging problem due to two main difficulty. First, as a special task of expression recognition, weak expression recognition suffers from the common difficulty that different subjects may exhibit the same expression with diverse visual appearances and facial intensities. Second, very often the weak expressions may not bring about dramatic changes in visual appearance, there is a great similarity between different weak expressions. For instance, as shown in Fig. 1, the weak expressions of fear and sadness are quite similar to each other [36].

Fig. 1
figure 1

Examples of strong and weak expression

In this work, we propose a novel method termed Deeper Cascaded Peak-piloted Network (DCPN) for weak expression recognition. The same as PPDN, our DCPN uses the intrinsic correlations between weak and strong expressions to magnify the critical and subtle details of weak expressions. In order to capture the critical and subtle details of weak expressions more precisely, the proposed DCPN uses a deeper, larger network architecture compared to the network used in PPDN. Furthermore, to prevent the enlarged network architecture from overfitting, we propose a new integration training method called cascaded fine-tuning.

Fig. 2
figure 2

Illustration of three training stages of DCPN. In the first stage, the pre-trained basic network is fine-tuned with data augmentation to get a better initialization. In the second stage, we use the basic network to choose a peak expression and non-peak expressions from each sequence. In the last stage, the resulting network is fine-tuned with peak-piloted feature transformation. During the back-propagation process, the stochastic gradient decent is used in the first stage, and peak gradient suppression is used in the last stage

The training process of DCPN contains three main stages as demonstrated in Fig. 2. In the first stage, the basic network of DCPN is firstly pre-trained on the ImageNet dataset and then fine-tuned for facial expression recognition. In the second stage, for every frame in each sequence, the fine-tuned network is used to generate the prediction score of the corresponding expression label. The frame with the highest score is taken as the peak expression image (e.g. the strong expression), while the others are considered as non-peak expression images. The peak expression image is often the most easily recognizable expression in each sequence, and it tends to be the last frame in the sequence which begins with a neural emotion and ends with a peak of the emotion. In the last stage, an image pair, consisting of a peak and a non-peak expression of the same type and subject, serves as an input to the network. The image pair passes through several intermediate layers to generate feature maps for each expression image. The L2-norm of the difference between the feature maps of the image pair is then minimized. This network utilizes a different back-propagation algorithm named peak gradient suppression (PGS) [36], which encourages the feature maps of a non-peak expression toward those of the corresponding peak expression. Stochastic gradient descent (SGD) [5] drives the feature maps of the image pair to be close to each other, and PGS drives the non-peak expression images toward the corresponding peak expression image.

Overall, this work is to establish a refined version of PPDN [36] in the purpose of improving the ability to recognize weak expressions so as to fundamentally improving the accuracy of facial expression recognition. Our main contributions are summarized as follows:

  • Compared with the network adopted by PPDN, our DCPN uses a deeper and larger network, which can capture the subtle details of expressions more precisely and thus shows better performance in weak expression recognition.

  • To prevent the enlarged network architecture from overfitting, we propose a new integration training method called cascaded fine-tuning. The experiments demonstrated on several popular facial expression recognition databases show that our method distinctly outperforms PPDN and can achieve state-of-the-art performance.

2 Related work

Deep learning algorithms have shown excellent performance of facial expression recognition in latest significant conferences [10, 14, 15, 36] and competitions [2, 7, 8, 30]. These methods can be divided into two categories: sequence-based and still image approach. In the first category, Liu et al. [18] propose a method called 3D CNN-DAP, which first applies a 3D convolutional network (C3D) to facial expression recognition. Jung et al. [15] propose a method called DTAGN, which integrates with a C3D and a fully connected DNN. Jaiswal et al. [14] is the first to use the CNN in combination with Bi-directional Long Short-Term Memory (BLSTM) for facial expression recognition, which outperforms the winner of the FERA 2015 challenge [30]. Fan et al. [8] propose a novel hybrid network combining RNN and C3D which wins the EmotiW2016 [7] facial expression recognition competition. In the second category, Yu et al. [32] utilize an ensemble of multiple CNNs. Bargal et al. [2] propose a hybrid network containing VGG16 [25],a modified VGG (13 layers) and Residual Network [12]. Yao et al. [31] propose a new deeper and wider network structure than inception [27]. Zhao et al. [36] propose a novel peak-piloted feature transformation, which can be utilized on all layers of the network to help to recognize the weak expression.

From above methods, using a deeper and wider network structure (more powerful capability of extracting features) and methods (e.g. multiple networks integration [8],fine-tune [10],joint fine-tune [15], etc.) of preventing the enlarged network from overfitting have become the mainstream of facial expression recognition. Sequence-based methods utilize both appearances and dynamic motions to exploit the correlations between different facial expression intensities in each sequence from the same subject, and still image methods are more generic, recognizing facial expressions in both sequences and still images.

In contrast to sequenced-based and still image methods, PPDN takes a sequence of images as a input in the training phase to take dynamics into account, and it takes one testing image as the input in the process of testing. It combines the advantages of sequence-based and still image methods. In this paper, the proposed DCPN designs a new deeper and wider network structure on the basis of PPDN and a more powerful integration training method to prevent the enlarged network from overfitting, and it has a more powerful capability to recognize weak expressions.

3 Deeper cascaded peak-piloted network

In this section, we will introduce the DCPN framework, which improves the capability to recognize the weak expression on the basis of PPDN.

3.1 Overall framework

The overall pipeline of the proposed DCPN is shown in Fig. 2. During the training process, DCPN takes an image sequence as a input. This image sequence passes through the network, which is firstly pre-trained on the ImageNet dataset and then fine-tuned for facial expression recognition. After the network, it is divided into two parts: a peak expression and non-peak expressions. This two parts are used as the new input to fine-tune the network again. This integration training method of cascaded fine-tune integrates two identical networks which shares the same parameters, and then fine-tunes the parameters of the network two times.

3.2 Basic network architecture

To further improve the performance of weak expression recognition, we design a new network inception-w as the basic network architecture on the basis of PPDN (see Fig. 3). Compared with GoogLeNet [27], which is used as the basic network architecture in PPDN, inception-w factorizes convolutions and aggressive dimension reductions to reduce much computational cost, then computational and memory savings can be used to increase both the width and depth of the network to improve the ability to capture critical and subtle details to recognize the weak expression. In contrast to Inception-v4 [26], some inception structures in Inception-w are removed due to the limited facial expression training databases.

Inception-w utilizes three different inception structures: Inception-A, Inception-B and Inception-C, which are firstly proposed in Inception-v3 [28] and deeper and wider than the traditional inception structure in GoogLeNet. And Inception-w also uses two reduction modules: Reduction-A and Reduction-B, which are firstly proposed in Inception-v4. Furthermore, Inception-w substitutes the fully convolutional (FC) layers for the fully connected layers to reduce the number of the parameters. In total, Inception-w implements three different inception structures and two reduction modules after five convolutional layers and two max pooling layers. We denote the five convolutional layers and two max pooling layers as CONVs, and the structure of the CONVs is the same as that in Inception-v3. After that, the first FC layer generates the intermediate features with 2048 dimensions, and the second FC layer generates the logit values of label predictions for six basic expressions. The auxiliary classifier is on the top of the last \(17\times 17\) layer as is shown in the dashed box of Fig. 3. The utility of auxiliary classifier is introduced to improve the convergence of the deep network. The different layers of the DCPN architecture produce feature maps as can be seen in Fig. 4

Fig. 3
figure 3

Basic network architecture. The output size of each module is the input size of the next one, and the network is 42 layers deep, but the computation cost is only about 2.5 higher than that of GoogLeNet

Fig. 4
figure 4

Example visualizations of the different layers

3.3 Three-stage cascaded framework

The three-stage cascaded framework can be introduced explicitly in the following:

Stage 1 Facial expression databases, e.g. CK\(+\) [22] and Oulu-CASIA [29], provide only thousands of images. However, a typical deep network has many parameters, and this will make a deep network prone to overfitting. To overcome this problem, various data augmentation techniques are required. However, some traditional data augmentations, such as rotation, translation and random clipping, may bring noise to the facial expression databases. In this stage, each image passes three different linear transformations before being sent to the deep network. Those transformations are random horizontal flip and random changes of the brightness and saturation. Most of the models pre-trained using the ImageNet dataset outperform the model without any pre-trained due to a good initialization provided by the pre-trained models; therefore, the basic network is pre-trained using the ImageNet dataset. To combat the vanishing gradient problem in very deep network and improve the convergence during the training, the loss function is defined as the summation of an auxiliary classifier loss and a cross-entropy loss:

$$\begin{aligned} \begin{aligned} J&=\frac{1}{N}\left( J_{1}+J_{2}+\lambda \sum _{i=1}^{N}\Vert W\Vert ^{2}\right) \\&=\frac{1}{N}\sum _{i=1}^{N}L(y_{i},f(x_{i};W))+\frac{1}{N}\mu \sum _{i=1}^{N}L(y_{i},f_\mathrm{aux}(x_{i};W))\\&\quad +\lambda \Vert W\Vert ^{2}, \end{aligned} \end{aligned}$$
(1)

which is an extended version of that in PPDN [36]. Here, \(J_{1}\) and \(J_{2}\) indicate the auxiliary classifier loss and the cross-entropy loss, respectively. \(x_{i}\) is a activation feature of a batch with N training images, \(y_{i}\in \{0,1,2,3,4,5\}\) is a vector storing the ground truth labels. \(f(x_{i};W)\) and \(f_\mathrm{aux}(x_{i};W)\) are defined as the logit values of the deep network and the auxiliary classifier. W are parameters of the deep network. L is cross-entropy loss function between the logit values of expression labels and the corresponding ground truth labels. The final regularization term is used to penalize the complexity of network parameters W. A stochastic gradient decent method is utilized for fine-tuning the network.

Stage 2 The peak frame is not known a priori in the real-world videos or some facial expression recognition databases. The image sequence from the same subject is taken as the input of the deep network, which is fine-tuned in stage 1. Then the frame with the highest prediction score in each sequence is treated as a peak expression image, while the others are treated as the non-peak expression images. This training stage is more applicable to videos where the information of the peak expression is not available.

Stage 3 During this stage, the fine-tuned deep network takes an image pair with a peak and a non-peak expression of the same type and subject as an input. This image pair passes through the intermediate layers of the deep network to generate feature maps for each expression image. The L2-norm of the difference between the feature maps of non-peak and peak expression images is then minimized, to embed the evolution from non-peak to peak expressions into the DCPN framework. To supervise the feature maps of the non-peak expression image with those of the peak expression image, the deep network is learned by a loss function that contains the L2-norm of the difference between the feature maps of peak and non-peak expressions. We do not need the auxiliary classifier to improve the convergence of the deep network in this stage because the fine-tuned deep network has already roughly converged. Following PPDN [36], we fine-tune the deep network for the second time with a loss function defined as follows:

$$\begin{aligned} \begin{aligned}&J'=\frac{1}{N}\left( J'_{3}+J'_{1}+J'_{2}+\lambda \sum _{i=1}^{N}\Vert W\Vert ^{2}\right) \\&\quad =\frac{1}{N}\sum _{i=1}^{N}L\left( y_{i}^{p},f\left( x_{i}^{p};W\right) \right) +\frac{1}{N}\sum _{i=1}^{N}L\left( y_{i}^{n},f\left( x_{i}^{n};W\right) \right) \\&\quad \quad +\frac{1}{N}\sum _{i=1}^{N}\sum _{j\in \Omega }\Vert f_{j} \left( x_{i}^{p};W\right) -f_{j}\left( x_{i}^{n};W\right) \Vert ^{2}+\lambda \Vert W\Vert ^{2}, \end{aligned} \end{aligned}$$
(2)

where \(J'_{3}\), \(J'_{1}\) and \(J'_{2}\) indicate the L2-norm of the difference between the feature maps of each expression image pair and two cross-entropy losses for recognition, respectively. \(\Omega \) is set of layers that exploit the peak-piloted transformation, and \(f_{j}\) is feature map in the j-th layer. \(x_{i}^{n}\) denotes a face with non-peak expression and \(x_{i}^{p}\) denotes a face with the corresponding peak expression. To drive the intermediate-layer feature maps of non-peak expressions toward those of the corresponding peak expression, we adopt instead a special-purpose back-propagation algorithm which is based on peak gradient suppression (PGS) [36]:

$$\begin{aligned} \begin{aligned} W^{+}&=W-\frac{\gamma }{N}\frac{\partial J'_{3}}{\partial f_{j} (W;x_{i}^{n})}\times \frac{\partial f_{j}(W;x_{i}^{n})}{\partial W}\\&\quad -\frac{1}{N}\gamma \nabla _{W}(J'_{1})-\frac{1}{N}\gamma \nabla _{W}(J'_{2})-2\gamma W, \end{aligned} \end{aligned}$$
(3)

where \(\gamma \) is learning rate. The difference between SGD and PGS is that the gradients due to the feature responses of the peak expression image \(-\frac{\gamma }{N}\frac{\partial J'_{3}}{\partial f_{j}(W;x_{i}^{p})}\times \frac{\partial f_{j}(W;x_{i}^{p})}{\partial W}\) are suppressed in 3. In this way, PGS supervises the feature maps of non-peak expressions toward those of peak expressions instead of making the peak and non-peak expression images get close to each other. The process of optimizing the loss function 2 in stage 3 by PGS can be seen as peak-piloted feature transformation, which is first proposed in PPDN.

3.4 Cascaded fine-tune method

Due to the large number of model parameters and small training set, the deep network is prone to overfitting. To alleviate this problem, we propose a novel integration training method called cascaded fine-tune. First, we fine-tune pre-trained deep network on the ImageNet dataset with data augmentation, this provides a good initialization of the deep network. To speed up the convergence of the network, we add an auxiliary classifier loss to our loss function. Then, we fine-tune the resulting networks by adding peak-piloted feature supervision on various layers, and this drives the feature maps of non-peak expressions toward those of peak expressions to improve the performance of weak expression recognition.

4 Experiments

Although DCPN shows great performance in weak expression recognition, we still conduct extensive experiments on two popular facial expression databases instead of micro-expression databases due to the fact that we need the strong expression to supervise the weak expressions during the training. Weak expressions must be hard samples for traditional facial expression recognition, so the improvement of the ability to recognize the weak expression can obviously improve the performance of facial expression recognition.

4.1 Data pre-processing

Most of the papers on the subject use face crop techniques to throw off the useless information for high accuracy. We utilize Multi-task Cascaded Convolutional Network (MTCNN) [33], which achieves superior accuracy over the state-of-the-art techniques for face detection and alignment, to crop faces from each dataset. In each sequence, the position of the face regions on each image cropped by MTCNN must be slightly different from each other. Face regions consist of four facial landmarks, including the coordinates of the top left corner and the lower right corner. To reduce the noise caused by resizing non-aligned face regions and the effects of scale variability caused by minimizing the L2-norm of the difference between the peak expression and non-peak expression, we choose the minimum coordinates of the top left corner and the maximum coordinates of the lower right corner to produce new face regions in each sequence to ensure that the facial frames from the same subject will be aligned with each other. Then we crop the central region of the face image with an area containing 87.5\(\%\) of the resulting face region, and resize it to a size of \(299\times 299\) by bilinear interpolation algorithm. As can be seen from Fig. 5, some details that have nothing to do with facial expression recognition have been dropped after the data pre-processing, such as freckles, mustaches and breakouts.

Fig. 5
figure 5

Illustration of a standard pre-processing results in CK\(+\), which involves face detection, central cropping and bilinear interpolation algorithm

Fig. 6
figure 6

Illustration of a standard pre-processing results in oulu-CASIA, which involves face detection, zero padding and bilinear interpolation algorithm

As can be seen from Fig. 6, the process of data pre-processing in oulu-CASIA is a little different from that in CK\(+\). Due to the fact that oulu-CASIA is more challenging than CK\(+\), so we need to minimize the face regions as much as possible. We crop faces utilizing the coordinates of two eyes provided by MTCNN rather than using those of face regions directly, then determine the final rectangular face by keeping the distance between two eyes equal. Finally, we turn rectangle into square through zero padding and resize it to a size of \(299\times 299\).

4.2 Description of the databases

Facial expression recognition databases usually provide video sequence. We conduct all experiments on two popular databases, \(\text {CK}+\) and Oulu-CASIA database. \(\text {CK}+\) is a representative database for facial expression recognition. It contains six basic facial expressions and one non basic expression (contempt). It is composed of 593 sequences from 123 subjects, of which only 309 are annotated with six basic expression labels and 18 are annotated with one non basic expression label. There are 118 subjects which are divided into ten groups. Nine subsets are used for training, and the remaining subset is used for testing. In this database, each sequence starts with a neural expression and ends with a peak expression. Oulu-CASIA contains 480 image sequences of six basic facial expressions under normal illumination conditions. There are 80 subjects, and 10-fold cross-validation is performed in the same way as in the case of CK\(+\). Similar to the CK\(+\) database, the facial expression evolves from a neural to a peak expression.

4.3 Experimental setting

DCPN uses Inception-w as the basic network architecture. The pre-processed face regions are resized to \(299\times 299\).

In the first stage, the convolutional layer weights are initialized with those of the pre-trained model on the ImageNet dataset. We first fine-tune only the last two FC layers by setting the learning rate as 0.01 for 20,000 iterations, and then fine-tune all the layers by setting the learning rate as 0.0001 for 10,000 iterations. We set \(\mu =0.4\) in Eq.1. All models are trained using a weight decay \(\lambda \) of 0.00004 and a batch size of 32 images or image pairs. The stochastic gradient descent with a momentum of 0.9 is also used for training the network.

In the last stage, the peak-piloted feature transformation is only employed in the last two FC layers which shows the better performance than used on whole or partial layers of the network [36]. The main reason is that the peak-piloted feature transformation is more useful for supervising the highly semantic features extracted by the deep network than fine-grained ones extracted by the shallow network. The final network is fine-tuned by setting the learning rate as 0.000001 for 20,000 iterations. Following the standard setting of [23], we use 10-fold subject-independent cross-validation for evaluation in all experiments.

4.4 Evaluation on facial expression recognition

As is shown in Table 1, to evaluate the effects of various aspects of our approach and compare the performance of our approach to that of other existing approaches fairly, we divide the databases under the standard setting and conduct two sets of experiments. One for evaluating the performance of recognizing weak expressions, peak expressions and combined expressions of our method, the other for comparing the performance of our method with other existing methods on the same database.

Table 1 Standard partition of the dataset

Table 1 shows the results of data segmentation under the standard setting [36], where Weak is the number of weak expressions consisting of the 7th to 9th frames in each sequence, Strong is the number of strong expressions consisting of the last one to three frames in each sequence, Peak is the number of peak expressions consisting of the frames with the highest prediction score in each sequence.

Table 2 Average accuracy on CK\(+\) database
Table 3 Average accuracy on Oulu-CASIA database

The main advantage of DCPN is its improved ability to capture the critical and subtle details, and it can obviously improve the performance of recognizing weak expressions. To test this, we evaluate on three different test sets, including “Peak”, “Weak” and “combined”. The average accuracy of 10-fold cross-validation is shown in Tables 2 and 3. “Inception-w” shows the average accuracy in the first stage of DCPN. It is obviously that our approach results in the first stage outperforms “PPDN”, and the most substantial improvements are obtained on the test set of the weak expression, 92.48 and 72.22\(\%\) of DCPN vs 88.13 and 69.75\(\%\) of “Inception-w” on CK\(+\) and Oulu-CASIA, respectively. This is evidence in support of the great performance of recognizing weak expressions of DCPN. And the improved performance of the weak expression recognition also facilitates the ability to facial expression recognition, DCPN outperforms “PPDN” on the combined sets, where both peak and non-peak expressions are evaluated.

Table 4 Performance comparisons of still image methods on CK\(+\) database

Table 4 compares the DCPN to still image-based approaches on CK\(+\), under the standard setting which uses the strong expression (e.g. the last one to three frames) for training and testing.

Table 5 Performance comparisons of sequence-based methods on CK\(+\) database
Table 6 Performance comparisons of sequence-based methods on Oulu-CASIA database

Tables 5 and 6 compare DCPN to sequence-based approaches on CK\(+\) and Oulu-CASIA. Unlike the still-based approaches, sequence-based approaches use the image sequences for training and testing. So given an image sequence in the test phases, we use DCPN to choose the peak expression and then test the average accuracy of the peak expression. DCPN achieves better performance of facial expression recognition than other state-of-the-art methods. On the CK\(+\) database, it has gains of 2.2\(\%\) and 0.3\(\%\) over “DTAGN(Joint)”[15] and “PPDN”[36]. On the Oulu-CASIA database it achieves 86.23\(\%\) vs. the 81.46\(\%\) of “DTAGN(Joint)”[15] and the 84.58\(\%\) of “PPDN”[36].

5 Conclusions

In this paper, we propose a novel Deeper Cascaded Peak-piloted Network for weak expression recognition. We design a deeper network Inception-w and utilize the peak-piloted feature transformation to improve the performance of the weak expression recognition and then we also present a integration training method called cascaded fine-tune to prevent the deep network from overfitting. The proposed DCPN shows its improved ability to recognize the weak expression and achieve the state-of-the-art performance on two popular facial expression recognition databases.