1 Introduction

Facial expression is a non-verbal way to share and portray a person’s feeling in our daily life. Generally, facial expression can be categorized into two classes: macro-expression and micro-expression. Macro-expression (also known as normal expression) that lasts for 3/4 to 2 seconds can be easily recognized with naked eye for humans. On the contrary, micro-expression is much shorter (1/25 to 1/3 seconds; the precise length definition varies [22, 40]) and more imperceptible. Another key characteristics of micro-expression is that it is spontaneous and uncontrollable. So even if someone tries to hide his/her true emotions and pretend another macro-expression, the micro-expression will reveal the true emotion [28]. Therefore, compared with macro-expression, micro-expression is usually regarded as a vital and accurate factor to detect human inner emotions. The research of micro-expression recognition has attracted lots of attention and gets a wide broad of applications in the field of public security [6], judicial criminal investigation [8] and so on. It has shown great power on preventing some potential danger threating to social security or warning earlier if different kinds of emergence happen. It can also benefit judging whether someone is telling the truth.

Despite the importance of recognizing micro-expression, it is extremely difficult for both machines and human beings due to its uncontrollability, short duration and small range of activity [7]. And due to the spontaneity of micro-expressions and the excitability of specific environments, the datasets of micro-expressions are very limited, which limits the design of efficient recognition algorithms to certain extent. For example, although deep learning-based algorithm has demonstrated great success on various computer vision field such as image interpolation [1], image/video enhancement [18, 19], video compression [20], and the similar topic of macro-expression recognition, its great power has not yet been fully exploited in the field of micro-expression recognition. A key reason for applying deep-learning-based method is the lack of large-scale datasets that enable efficient feature extraction for micro-expression.

Recently, visual attention mechanism has been proposed and successfully applied in structural prediction tasks such as visual captioning [39] and quality assessment [21]. It is based on the reasonable assumption that human vision tends to focus on selective parts rather than the whole visual scenes. By incorporating visual attentions, the deep models can learn richer and more efficient features for visual tasks. Therefore, visual attention can be considered as a feature extraction mechanism that combines contextual fixations.

Inspired by the visual attention mechanisms [11, 43] and the widely used convolutional neural networks (CNN), in this paper, we take full advantages of visual attention and design an attention-based CNN network for accurate micro-expression recognition called MERTA. In particular, three different types of attention have been used: 1) General attention embeds the static information of facial landmark. The facial expression is generally closely related to layout of landmark areas (the happiness will inevitably lead to the raise of the mouth corner). 2) Motive attention embeds the dynamic information. As micro-expression is characterized by tiny facial movement, it is beneficial to emphasize on the motive areas on the face. 3) Channel attention can be viewed as the process of selecting semantic attributes on the demand of the facial expression since a channel-wise feature is essentially a detector response map of the corresponding filter. For example, we want to predict disgust, our attention-wise attention will assign more weights on the channel-wise feature maps generated by filters according to the semantics like frown. We use VGGNet-16 [33] as the backbone to extract spatial features from original images and its optical flow and optical strain images. The features after the second fully-connected layers are then concatenated and fed into one layer of Long Short-Term Memory (LSTM) [9] and two fully-connected layers to predict the micro-expression. We evaluate the effectiveness of the proposed model on the well-known CASME II dataset. The proposed algorithm can significantly surpass the baseline model without attention by 4%. It also outperform state-of-the-art micro-expression algorithms.

The rest of the paper is organized as follows: Section 2 presents some related works. In Section 3, we introduce the proposed method with the details of the proposed attention mechanisms. In Section 4, we demonstrate the experimental results and ablation analysis. And Section 3.2 gives the conclusions.

2 Related work

2.1 Micro-expressions recognition

In recent years, micro-expression recognition has gradually become more popular and made remarkable progress. Pfisher et al. [28] proposed to use a temporal interpolation model to accurately recognize micro-expression. Xu et al. [38] used a facial dynamics map to identify and recognize micro-expression. Wang et al. [35] reduced the redundancy of local binary pattern from three orthogonal planes (LBP-TOP) by proposing the local binary patterns with six intersection points (LBP-SIX). In [25], the micro-expression is recognized using adaptive magnification of discriminative facial motion. The algorithm of Patel et al. [26] is the first one to explore the possibility of deep learning methods for micro-expression recognition tasks, they use transfer learning method of facial expression to select features and an evolutionary algorithm is extended to search for a set of optimal depth features. Borza et al. [2] used image difference to analyze motion changes in appointed frame and two classifiers are used to determine whether micro-expressions occur at appointed frame t. Li et al. [16] combined deep multiple task learning with normalized histogram of directional optical flow characteristics to detect micro-expressions. It divides the facial into regions of interest (ROIS) and integrate powerful optical flow methods with HOOF features to evaluate the direction of facial muscle movement. Kim et al. [13] proposed a method of recognizing micro-expressions by learning temporal and spatial feature representations with expression state constraints. Khor et al. [12] proposed an enriched long-term cyclic convolution network (ELRCN), which first encodes each micro-expression frame into an eigenvector through a CNN module. Then it predicts micro-expressions by transferring the eigenvector to a long-term and short-term memory (LSTM) module. Peng et al. [27] consider the subsize of micro-expressions in 2018 Micro-Expression Grand Challenge (MEGC). So they chose the routine of transfer learning methods to recognize micro-expressions using convolutional neutral networks. Mayya et al. [23] interpolated video sequences using time interpolation method (TIM). Then, deep convolution neural net (DCNN) was used to extract facial features in the general graphics processing unit (GPGPU) system supporting CUDA. In order to develop reliable deep neural networks, extensive training sets of labeled image samples are needed. However, due to the short face appearance and duration of depression, micro-expression recognition is still a challenging task.

2.2 Micro-expression dataset

Generally, it is difficult for ordinary people to identify micro-expressions, which have short duration, small range of change, few motion area and complex condition of psychology. At present, though many researches have been carried out on micro-expression recognition work, the datasets involved are very few. Some mainstream data sets are: 1) Spontaneous micro-expression corpus (SMIC) [15] and SMIC II proposed by Li et al. from University of Oulu in Finland. It is the first spontaneous micro-expression dataset built up in 2012 with 164 samples of 16 subjects. The micro-expressions are categorized into positive, negative, and surprise. 2) The USF-HD dataset [32] proposed by Shreve et al. from University of South Florida. It contains both micro-expression and macro-expression samples but the samples are generated by imitation rather than induction. 3) Chinese Academy of Sciences Micro-Expression (CASME) dataset [34], CASME II [41] and CAS(ME)2 [30]. CASME contains 195 sequences with marked start frame, apex frame and end frame of micro-expression. And CASME II contains 247 spontaneous samples of 50 men and 50 women. Other micro-expression datasets include Polikovsky dataset [29] from University of Tsukuba in Japan, York DDT (New York Polygraph Testing) database [36], and Spontaneous Actions and Micro-Movements (SAMM) dataset [4] from University of Manchester. In summary, existing micro-expression datasets are limited in both sample numbers and expressions. It is mainly due to the strict environmental requirements to record micro-expression and difficulties in accurate labeling of micro-expressions. Therefore, a large-scale micro-expression dataset similar to ImageNet is almost impossible. This lays an inevitable obstacle for deep learning algorithms that are highly data-driven.

2.3 Attention mechanism

Attention mechanism [17, 42] was originally developed on the basis of human visual characteristics. Human visual attention mechanism is a unique brain signal processing mechanism: vision obtains the target area that needs to be focused on by scanning global images quickly, which we called the focus of attention normally. Then vision will invest more attention resources in this area to get more information and ignore other area with less key points. Although it has been researched decays ago, it becomes a hot topic in computer vision field recently due to its success on image/video caption and visual question answering. In [24], Mnih et al. introduced an attention mechanism to Recurrent Neural Network (RNN) for image classification. Xu et al. [37] proposed the first visual attention model for image caption. Yang et al. [43] refined the spatial attention with a stacked attention model. Semantic attention relies on semantic concepts to select effective features. In [44], the filters of convolutional layer are considered as semantic detectors. In [10], Hu et al. proposed a Squeeze-and-Excitation block (channel-wise attention) to adaptively recalibrate channel-wise feature responses. In [45], Zhang et al. proposed a context encoding module to leverage global scene context information. In SCA-CNN [3], the spatial attention and semantic attention are jointly applied for image caption.

3 The proposed algorithm

In this section, we describe the proposed micro-expression algorithm MERTA. The overall network structure is shown in Fig. 1. It is composed of three subnets of VGGNet-16 with attentions, whose output are concatenated and fed into a single layer of LSTM. We first give the descriptions on the backbone structure and then present the details on three attention mechanisms individually.

Fig. 1
figure 1

Network structure of the proposed MERTA. Given a sequence of micro-expression, the input contains three parts: original frames, optical flows, and optical strains. Each part goes through a subnet of VGGNet-16. Three different attention mechanisms including general attention, motive attention, and channel attention are involved. The outputs of three subnets are then consecutively concatenated to feed into one single layer of LSTM, whose output passes two fully connected layers to predict the micro-expression categories (in this case, the micro-expression is disgust)

3.1 Backbone network

The backbone of the proposed algorithm is similar to [5] containing two parts: CNN part extracts the spatial features from each frame of the sequence and the recurrent LSTM is used to extract temporal information from consecutive spatial features. Therefore, the combination of two parts can efficiently exploit the spatio-temporal information of input sequence. Inspired by [12], we adopt enhanced inputs in our framework. In particular, the optical flow and optical strain are introduced and fed into two variants of the CNN network to extract richer hierarchical features. Optical flow captures the first-order motive information while optical strain captures higher-order derivatives, which represents the deformation incurred during non-rigid motion.

Suppose {I(x, y, t)} is a sequence of frames, where x, y are the 2-D spatial coordinate and t represents the frame number. As a well-known motion estimation technique that is based on the brightness conservation principle, optical flow is typically defined as:

$$ \begin{array}{@{}rcl@{}} \frac{\partial I}{\partial x}\cdot f_{x} + \frac{\partial I}{\partial y}\cdot f_{y} + I_{t} = 0, \end{array} $$
(1)

where It represents the temporal gradient and \(\boldsymbol {f} = [f_{x}=\frac {\partial x}{\partial t},f_{y}=\frac {\partial y}{\partial t}]\) is the optical flow. Its magnitude is denoted as f = |f|. In this work, we adopt the algorithm in [31] where the optical flow is estimated using L1 norm and a regularization term. As shown in Fig. 1, the opitcal flow captures the movement of eyebrow (i.e., frown) which is closely related to the expression of disgust. According to [32], the optical strain can be calculated directly from optical flow as:

$$ \begin{array}{@{}rcl@{}} s = \frac{1}{2}[\nabla \boldsymbol{f}+(\nabla \boldsymbol{f})^{T}], \end{array} $$
(2)

which can be expended as:

$$ \begin{array}{@{}rcl@{}} \boldsymbol s = \left[\begin{array}{llll} \frac{\partial f_{x}}{\partial x} & \frac{1}{2}\frac{\partial f_{x}}{\partial y}+\frac{1}{2}\frac{\partial f_{y}}{\partial x} \\ \frac{1}{2}\frac{\partial f_{y}}{\partial x}+\frac{1}{2}\frac{\partial f_{x}}{\partial y} & \frac{\partial f_{y}}{\partial y} \end{array}\right]. \end{array} $$
(3)

Then, the magnitude of optical strain s can be computed as the L2 norm of s, i.e.,

$$ \begin{array}{@{}rcl@{}} s = \sqrt{\left( \frac{\partial f_{x}}{\partial x}\right)^{2}+\frac{1}{2}\left( \frac{\partial f_{y}}{\partial x}+\frac{\partial f_{x}}{\partial y}\right)^{2}+\left( \frac{\partial f_{y}}{\partial y}\right)^{2}}. \end{array} $$
(4)

As shown in Fig. 1, the optical strain captures the boundary of moving regions, highlighting the diverse deformation incurred during non-rigid facial muscle movement.

Given the original frames \(\{I(x,y,t)\in \mathbb {R}^{3}\}\), the optical flow \(\{f_{x}(x,y,t)\in \mathbb {R}^{2}\}\), \(\{f_{y}(x,y,t)\in \mathbb {R}^{2}\}\), \(\{f(x,y,t)\in \mathbb {R}^{2}\}\) and the optical strain \(\{s(x,y,t)\in \mathbb {R}^{2}\}\), we leverage three separate classical network VGGNet-16 [33] as the backbone to fully enjoy the benefit of deep CNNs where the optical strain are first converted into a 3-channel color map by replicating it three times on channel dimension. Three types of attention mechanisms have been introduced to VGGNet-16 for further emphasized discriminative features. By extracting features from individual inputs, the separate subnets can disentangle the facial, motive, and deformable features, easing micro-expression recognition. Since high-level features generally contain semantic information, the feature maps from the second fully connected layers of three subnets are fused with concatenation and then passed to the subsequent recurrent LSTM. We follow the framework of [12] to use one single LSTM unit. But the proposed LSTM contains 256 hidden states which is less than that of [12] for compact feature representation and less memory cost. As shown in Fig. 1, a 128-d fully connected layer and a 5-d fully connected layer are appended on top of LSTM to predict the micro-expression.

Limited by the scale of micro-expression dataset, training the proposed framework end-to-end would be extremely difficult. The proposed framework is trained in two stages. In the first stage, we train three subnets of VGGNet-16 individually using micro-expression samples with labels. In order to accelerate training process and get efficient facial features with relative small-scale training samples, VGGNet-16 is initialized with the parameters of VGGFace, which is trained on the large-scale face data set Labeled Faces in the Wild (LFW). In the second stage, we fix the parameters of VGGNet-16 and train the rest layers including a LSTM module and two fully connected layers. In both of the training process, the model is optimized with a cross-entropy loss:

$$ \begin{array}{@{}rcl@{}} L = -\sum\limits_{k} p_{k} \log q_{k}, \end{array} $$
(5)

where k is the index for micro-expression class, pk is one-hot vector of the ground truth micro-expression class, the qk is the output of the softmax layer that represents the probability of different micro-expression class.

3.2 Attention mechanism

Although the three-subnet backbone has been designed to extract both static features (from original frames subnet) and dynamic features (from optical flow subnet and optical strain subnet), the features are extracted in an in discriminative way. For example, the non-facial regions are regarded equally to the facial regions. Even though this naive feature extraction method has obtained great success in image classification, face recognition or macro-expression recognition, it is far from enough for the challenging problem of micro-expression recognition due to its subtlety and shortness. As mentioned above, inspired by the smart brain signal processing mechanism, attention mechanism is shown to an effective way to emphasize on discriminative information. Different from [5] that indifferently extract facial features, the proposed algorithm introduces three types of attention mechanisms: general attention highlights the landmark areas which are fully of expression muscles; motive attention highlights the motion area where the expression appears; channel attention highlights the expression-related semantic features. The proposed model with three attention incorporated is shown in Fig. 1.

3.2.1 General attention

General attention describes the fact that all expressions including micro-expressions are more easy to be identified by movement of landmarks. For example, if someone is smiling, the most obvious symbol may be the rise of the corner of the mouth despite it takes about 42 muscles to smile. Therefore, facial landmarks are the most discriminative areas that need to concentrate. In this paper, we use dlib C++ library [14] to detect 68 landmarks of faces, \(\{\boldsymbol l^{k}=[{l_{x}^{k}},{l_{y}^{k}}], k=1,2,\cdots ,68\}\). As the detected landmarks are separate pixels, we further filter the landmark mask M, i.e.,

$$ \begin{array}{@{}rcl@{}} A_{g} = M \ast G, \end{array} $$
(6)

where the pixels of M are all zeros expect at {lk}, G is the 25 × 25 Gaussian kernel. Then, Ag represents the general attention. The process is shown in Fig. 2. The landmarks of each frame are marked by green diamond with red numbers. It is observed that the blurry landmark highlights the critical facial region such as eyes and mouth. Emphasizing on these critical regions will facilitate micro-expression recognition.

Fig. 2
figure 2

General attention. We detect the landmark of each frame (marked by green diamond with red numbers) and apply Gaussian smooth process to get general attention areas

3.2.2 Motive attention

Motive attention captures the critical motive information. Since the micro-expression occurs in a very short time, it occupies only a few frames even with high-speed cameras. Therefore, trying to identify a micro-expression from few apex frames with clear spatial signal would be very difficult. Therefore, we turn to identify the motive characteristics of micro-expression. In this work, we refer to the magnitude of optical flow and optical strain for motive clues. The 2-D masks for motive attention are defined as:

$$ \begin{array}{@{}rcl@{}} A_{m} = \frac{1}{2}(f+s). \end{array} $$
(7)

3.2.3 Channel attention

General attention and motive attention assign weights to features from the perspective of spatial dimension, which relieves the problem of distraction caused by less relevant facial regions. In fact, the same distraction problem appears on the channel dimension. As above mentioned, each feature map can be regarded as semantic responses to difference filters. Understanding and utilizing semantic information is very important for micro-expression recognition. For VGG-Face network pre-trained for face recognition, the feature maps encode rich information on appearance characteristics. Different appearance characteristics may have different levels of importance. For instance, the size of nostril will be relevant to anger than whether he/she has a hook nose. Therefore, in addition to spatial attention, we also include semantic attention in the proposed work, which is denoted as channel attention.

Given the contextual facial features extracted from conv5_3 layer of VGG-Face, our goal is to apply a set of scaling factors to automatically and selectively highlight the expression-dependent feature maps. The channel attention is shown in Fig. 3. Suppose the feature maps are represented as Φ = [ϕ1, ϕ2,⋯ , ϕ512], where \(\phi _{c}\in \mathbb {R}^{W\times H}\) is the c-th slice of the feature maps Φ, 512 is the total number of channels. We first use average pooling layer to get a channel feature vector v:

Fig. 3
figure 3

Channel attention is a SE net structure to redistribute weights of different channels. First we use a pooling layer to reduce the dimension of features and put two fully-connection layers to get weights between different channels and multiple input features to achieve the redistribution operation

$$ \begin{array}{@{}rcl@{}} \boldsymbol v = [v_{1}, v_{2}, \cdots, v_{512}], \boldsymbol v\in\mathbb{R}^{512}, \end{array} $$
(8)

where the average value vc is used to represent the c-th channel features. Then two fully connected layers are exploited to learn the aggregate feature of each channel:

$$ \begin{array}{@{}rcl@{}} \boldsymbol u = \boldsymbol W_{2} \ast N(\boldsymbol W_{1} \ast \boldsymbol v + \boldsymbol b_{1}) + \boldsymbol b_{2}, \end{array} $$
(9)

where W1, W2 are the convolution filters and b1, b2 are bias parameters. N(⋅) denotes the non-linear activation function. Note that two fully connected layers form a bottleneck structure to model the correlation between channels and output the same number of weights as the input features. We first reduce the feature dimension to 1/4 of the input and then ascend back to the original dimension through a Fully Connected layer. The advantage of this method over using a Fully Connected layer directly is that it has more nonlinearity, which can better fit the complex correlation between channels and greatly reduces the amount of parameters and computation.

Then, the normalized weight vector for channel attention mechanism is then defined as:

$$ \begin{array}{@{}rcl@{}} A_{c} = \frac{1}{1+\exp(-\boldsymbol u)}, \end{array} $$
(10)

which is a sigmoid function applied to u. To apply the normalized weights to each channel of the input feature maps, we replicate the weight vector to the same dimension of input feature maps (i.e., 14 × 14 × 512) and then perform pixel-wise multiplication.

3.2.4 Fusion of attention mechanisms

Given three attention mechanisms, their fusion procedure will obvious have great impact on the efficiency of them. Both the general attention and motive attention belong to spatial attention mechanism which guides the model to emphasize on certain spatial location of the input feature maps. General attention focuses on the landmarks while motive attention focuses on the motion area. For channel attention, it helps to emphasize on certain semantic information as each feature map can be regarded as semantic responses to different filters. Therefore, we first combine two spatial attentions (i.e., general attention and motive attention) to remove less relevant features and then apply the channel attention to emphasize on more discriminative semantic features. As shown in Fig. 4, the proposed three attention mechanisms are incorporated to the high-level feature maps from conv5_3 layer of VGGNet-16 with the dimension of 14 × 14 × 512. General attention and motive attention are both with the same resolution as original frames, so they are first combined by pixel-wise summation. To match the dimension of feature maps from conv5_3 layer, the combined attention map is downsampled to 1/16 by bilinear interpolation and then replicated to the channel depth of 512. They we carry out pixel-wise multiplication between the resized attention map and feature maps. After applying spatial weights, the feature maps further go through the channel attention module to re-weight different channels of feature maps.

Fig. 4
figure 4

The combination of three attention mechanisms. Two spatial attentions are first combined and integrated with visual features, which goes through channel attention mechanism afterwards

4 Experimental results

4.1 Experimental setting

The proposed algorithm are evaluated on the most commonly used CASME II dataset [41], which contains 255 spontaneously micro-expression video sequences recorded at high temporal resolution (200fps). The samples were divided into seven micro-expression categories, including happiness, depression, disgust, fear, sadness, surprise and others. The labels for each micro-expression was set based on not only the action unit (AU), but also the videos used to trigger emotions and responses of participants. Because the additional information may conflict in some occasions, Facial Micro-expression Challenge 1 proposed a new target class based on Facial Action Coding System (FACS), where samples are classified into seven new categories. The sample numbers of different categories vary from 1 to 99. For fair comparison, VI and VII categories are ignored. The evaluations in our experiments are conducted with leave-one-subject-out cross validation, i.e., the test subject was excluded from the training set. The recognition accuracy is then calculated by averaging 26 times evaluation (26 subjects in CASME II).

Adam optimizer is used to train our model. The initial learning rate is 10− 5 and decay rate is 10− 6. We train 15 epochs to finetune three subnets of VGGNet-16 and training 20 epochs to get the final overall model. Analogy to most existing ME algorithms, the ME sequences are interpolated to fixed number of frames. In this work, Temporal Interpolation Model (TIM) [15] is used to generate 10 frames for all ME samples. As optical flow and optical strain calculate the motion between adjacent frames, there are only 9 inputs for these two subnets, and we fed only 9 ME frames into the first subnet for consistency. The ternary-attention-based visual features from 9 sets of inputs are then sequentially fed into the LSTM to get the recognition results.

4.2 Ablation analysis

To validate the efficiency of the proposed attention mechanisms, we carried out extensive ablation analysis. We compared the proposed algorithm with four different variants: Baseline model denotes the backbone network without attention mechanisms; Baseline-MA involves the motive attention; Baseline-MA-GA involved both motive attention and general attention; Baseline-MA-CA involves motive attention and channel attention; Baseline-MA-GA-CA is the proposed network with all three attention mechanisms. The accuracy results of different algorithms are shown in Table 1. It is obvious that the accuracy increases with the involvement of more attention mechanisms, validating the efficiency of each attentive component of the proposed algorithm.

Table 1 Accuracy of the proposed algorithm and its variances

4.3 Comparison with state-of-the-art algorithms

We also compare the proposed algorithm with state-of-the-art micro-expression recognition algorithms, including: the benchmark LBP-TOP algorithm [28], the Facial Dynamics Map (FDM) algorithm [38], the modified LBP with Six Intersection Points (LBP-SIP) algorithm [35], the Adaptive Magnification of Discriminative Facial Motion (Adaptive MM + LBP-TOP) [25] and ELRCN-TE [12]. The accuracy of different algorithms are shown in Table 2. It is shown that the proposed algorithm consistently outperforms existing micro-expression algorithms with large gaps.

Table 2 Accuracy of the proposed and state-of-the-art algorithms

5 Conclusion

In this paper, we propose a novel micro-expression recognition algorithm with ternary attentions. The backbone model contains three subnet of VGGNet-16 to extract features from the original frames, the optical flow, and the optical strain, respectively. These features are then concatenated to go through one layer of LSTM for spatio-temporal features, which are used for classification with two fully connected layers. To facilitate more efficient feature extraction, we introduce three different kinds of attention mechanisms: the general attention emphasizes on the more relevant facial regions of landmarks; the motive attention guide the model to focus the facial areas with large motion; and the channel attention put more weights on the semantic features that related to micro-expressions. Experimental results validate the efficiency of each attention mechanism and the proposed model outperforms state-of-the-art algorithms with large gaps.