Keywords

1 Introduction

Facial Micro-Expression (ME) discloses true mental state unconsciously while someone is trying to obscure them advertently in a high-stake situation. This transient ME lasts for less than 1/5 s [1] and diminishes under cover of ordinary acted facial expressions. It has a very low intensity [2] due to the tiny movements of facial muscles. It is challenging to create a high-stake situation in a controlled environment and generate ME voluntarily against its spontaneous nature. These reasons impede the real-time [3] implementation of ME recognition (MER) from publicly available ME datasets, while it gives an important cue for lie or deceitful behavior detection. Though there are only five spontaneous publicly available ME datasets [3], all of them have smaller size of samples. Facial Action Coding System (FACS) coding [4] has also been used in most of the ME databases to identify the action units (AU) that are linked to specific facial muscle movements within facial components. AUs that are pertinent to emotional state help to reduce the subjective biases in recognition of MEs. Despite all of these challenges, ME recognition has gained momentum in the computer vision community in the last few years [5] due to some inescapable practical implications such as business deal negotiation, psychoanalysis, forensic investigation, and homeland securities.

Many handcrafted works for feature descriptions were devised based on appearance-based feature learning for ME recognition. For static and dynamic texture-based ME recognition, LBP and many of its variants have been proposed in [6,7,8,9,10,11,12]. Though they are very familiar, they are not innate for AU/motion-based recognition due to LBP features. Hence the geometry or motion-based descriptors [13,14,15,16,17] were introduced to capture the deformation in it with the facial landmarks or optical flow features. These techniques are susceptible to head-pose variations and leaned to face registration. Also, gradient-based feature descriptors [18, 19] were proposed to mitigate lighting variations but still suffer from head-pose variations.

Automatic feature learning methods ignite independent learning from inputs effectively. For that purpose, many deep learning-based models have gain popularity in recent years for ME recognition. The first possible use of the convolutional neural network (CNN) deep model [20] for ME recognition was less prone to better accuracy due to over-fitting. Takalkar and Xu [21] proposed a CNN-based model with data augmentation to combat the over-fitting problem but ended up with minor improvement due to data imbalance and subjective bias in annotations. In [22], temporal interpolation was used with DCNN followed by a support vector machine (SVM) classifier for ME recognition to combat the short duration. Peng. et al. [23] proposed a dual temporal scale convolutional neural network (DTSCNN) for Spatio-temporal feature extractions from optical flow inputs of a composite dataset from two ME datasets. It achieved better performance in comparison to some hand-design methods. In [24], CNN accompanied long-short-term-memory (LSTM) was proposed. It considered the class discrimination, expression states, and persistence of states along with temporal change. It achieved better results but was not strong enough to confront the imbalance sample problem. In TLCNN [25], pre-trained CNN was used to model spatial features from a single frame then fed them to LSTM. It used the combined video samples from three ME datasets. A pre-trained deep network-based method on apex frame was done in [26]. Two-stream optical flow-dependent high level features extractions and classification network was introduced in [27]. In [28], dual-stream shallow CNN was proposed to combat with over-fitting and saliency map. Deep recurrent-CNN (R-CNN) based-model trained from scratch for spatiotemporal feature learning were presented in [29]. In [30] shallow R-CNN model was designed and experimented on composite dataset samples. VGGNets and LSTM discriminative attention model were proposed in [31]. Deep learning-based approaches and models have shown a strong and reliable representation of discriminative features from ME.

Numerous researches have been carried out based on handcrafted feature engineering, automatic spatial features extractions, and temporal correlation among spatial features to classify the ME from SMIC [32], CASME [33], CASMEII [34], CAS(ME)2 [35], and SAMM [36] datasets. There is some resemblance between macro-expressions (MACE) and MEs in terms of some facial dynamics. In addition, MACE datasets (e.g., CK+ [37], MUGFE [38], and OULU-CASIA [39]) have a large number of samples, which facilitate taking advantage of transfer learning in recognition of ME from limited samples using deep learning models. Automatic ME recognition from frame sequences is quite challenging due to the persistence in a small number of frames. Moreover, ME videos contain some/many redundant and neural frames that increase the computational cost as well as the influence of irrelevant feature extractions. Despite this, video-based end-to-end deep ME recognition is more resilient as it models the spatio-temporal features against the illumination, head-pose, and subtle motion variations. Though composite ME samples from multiple ME datasets make the solution even harder due to inconsistency among them, it helps to model ME to a greater extent from a diverse and large group of samples possessing different intrinsic factors that might be a set/subset in the wild samples. That paves the way to realistic categorization of ME from more spontaneous and natural ME samples.

We propose a method to extract spatio-temporal features from composite samples of five spontaneous ME datasets [32,33,34,35,36] through the 3D CNN in the residual network as backbone and LSTM auto-encoder for de-noising and fine-tuning the high-level feature maps comprising the temporal deformation of spatial features and followed by a native structural regularizer accompanied with a soft-max classifier. Two cross-validation strategies, Leave-One-Subject-Out (LOSO) and Stratified 5-Fold cross-validations (CV) have been designed and tested on the combined samples to estimate standard accuracy, unweighted average recall (UAR), and un-weighted average F1-Score (UF1). Two-stage transfer learning has been used based on three benchmark MACE datasets[37,38,39]. Our model shows superior recognition accuracy on three metrics that surpass the state-of-the-art methods.

The main contributions to this work are summarized below:

  • We propose an automatic ME recognition method based on the 3DCNN-18 residual network as a backbone for spatio-temporal feature extractions followed by a conv-LSTM auto-encoder with the size of 2-1-2 for fine-tuning the temporal co-relation among spatial features and de-noising them. Finally, a structural regularizer is appended for aggressive summarization to feed the final vector to a soft-max classifier.

  • Macro-to-micro transfer learning has been revitalized to deal with over-fitting due to the lower number of ME video samples.

  • Construction of a composite ME video dataset with three class samples (e.g., negative, positive, and surprise) from five publicly available ME datasets.

  • Validation with two CV techniques, LOSO and Stratified 5-fold, to measure effectiveness and verify the generalization of the model using cross samples and cross subjects. That shows the very high effectiveness of the proposed method that outweighs the state-of-the-art MER approaches.

The rest of this paper is organized with methodology in Sect. 2, experiments in Sect. 3, and results and discussion in Sect. 4, and conclusions in Sect. 5.

2 Methodology

This section systematically presents our proposed method. The approach encompasses the pipeline, model architecture, evaluation metrics, and loss function, and transfer learning mechanism from macro to micro expressions.

2.1 Process Pipeline

General pipelines show the flow of input video processing steps, 3DCNN-18 residual [40] backbone for spatio-temporal feature extractions, 2D Conv-LSTM auto-encoder inspired by [41], and a structural regularizer followed by a soft-max classifier. It covers the input to emotional class label predictions from a sequence of spontaneous ME frames, which is depicted in Fig. 1.

Faces in ME video frames are detected and aligned with a deep-based reliable face alignment network [42] along with a real-time and robust single-shot-scale invariant face detector (S3FD) [43] to correct head-pose. It reduces the negative influence on ME recognition. The face has been cropped based on the detected bounding box to eliminate the irrelevant area from images. The facial images are then normalized to 64 × 64, which makes training faster for video sequences by reducing the computational cost. Then it is augmented using vertical flipping and 8° counter-clockwise rotation to triple the original samples. It helps to reduce over-fitting to some extent. Furthermore, the frame sequences have been padded with last frame which is the onset frame in our case for all ME datasets except SMIC, as the apex frame is not annotated in it. We restrict the ME video length to 12 frames for batching to reduce the computational cost and unnecessary redundancy. As the datasets samples are highly imbalanced, we have calculated the class weights for sampling in weighted random sampler for batching to ensure the representative samples for each class for recognition. After all of these preprocessing, augmentation and batching steps, the frame sequences have been fed into the model depicted in Fig. 2 for saptio-temporal feature extractions and classification. The model is described in the following section.

2.2 Model Architecture

It encompasses a residual 18 3DCNN backbone network, a 2-1-2 LSTM auto-encoder with 2D convolution, a 3DCNN layer followed by a structural regularizer, and a soft-max classifier. Here 3DCNN is for convolutions in spatial and temporal dimensions of ME frame sequence concurrently. It is implemented in a residual framework with 18 layers to capture the dynamic changes in spatial ME features from faces. The elicited spatio-temporal features with the dimension (channel × length × height × width) have been transferred as input to the convolutional LSTM two-layer auto-encoder for compact representation of features by reducing noises and fine-tuning the subtle spatial changes. It unrolls both LSTM encoder cells for the length dimension. We have considered the vanilla LSTM cell with 2D convolutions, which are formulized in the following set of equations-

$${i}_{t}=\sigma \left({W}_{xi}\,\,conv\,\,{x}_{t}+ {W}_{hi}\,\,conv\,\,{h}_{t-1}+ {W}_{ci}\,\,*\,\,{c}_{t-1}+ {b}_{i}\right)$$
(1)
$${f}_{t}=\sigma \left({W}_{xf}\,\,conv\,\,{x}_{t}+ {W}_{hf}\,\,conv\,\,{h}_{t-1}+ {W}_{cf}\,\,*\,\,{c}_{t-1}+ {b}_{f}\right)$$
(2)
$${c}_{t}={f}_{t}\,\,*\,\,{C}_{t-1}+{i}_{t}\,\,*\,\,tanh\left({W}_{xc}\,\,conv\,\,{x}_{t}+ {W}_{hc}\,\,conv\,\,{h}_{t-1} + {b}_{c}\right)$$
(3)
$${o}_{t}=\sigma \left({W}_{xo}\,\,conv\,\,{x}_{t}+ {W}_{ho}\,\,conv\,\,{h}_{t-1}+ {W}_{co}\,\,*\,\,{c}_{t}+ {b}_{o}\right)$$
(4)
$${h}_{t}={o}_{t}*\mathrm{tanh}\left({c}_{t}\right)$$
(5)

In Eqs. (1), (2), (3), (4), and (5), it is the input, ft is the output from forget gate, Ct current state, ot output, and Ht is the hidden state at current time step t. Again, conv represents the convolution operation and * for Hadamard product. LSTM cell is iteratively unrolled for the activated feature map sequence obtained from the 3DCNN backbone. Conv-LSTM encoder mapped the input as a compact representation to an encoded vector. Here, it is the hidden vector h2 from the 2nd LSTM encoder cell for the last featured frame in the high-level features frame sequence. Conv-LSTM decoder with two LSTM cells has been used to reconstruct the fine-grained, smooth spatio-temporal transient evolution. Here 3 × 3 convolutional filter is used for both the decoder and encoder parts. Regenerated ME features stacked on h3 through length iterations from the decoder part are fed into a 3DCNN layer with kernel 1 × 3 × 3. It converts the feature maps from regressive to actual class form to categorize ME expression sequence into three labels negative, positive, and surprise. Then a native structural regularizer with adaptive average pooling is used for aggressive summarization to generate a vector of three predicted values. It reduces the trainable parameters to alleviate the over-fitting. A finishing soft-max classifier is used to ensure the predicted of three values is 1.

Fig. 1.
figure 1

ME recognition pipeline comprising preprocessing of input video to class label prediction.

Fig. 2.
figure 2

Model architecture composed of 3D ResNet18 and Conv-LSTM auto-encoder appended with structural regularizer.

2.3 Evaluation Metrics and Loss Functions

To evaluate the model on composite ME samples, three metrics have been used to observe the influence of data organization, preprocessing, augmentation, and frame rate setup, as well as to measure the effectiveness of the proposed model. Here standard accuracy, balanced accuracy, i.e., UAR and UF1 macro metrics, have been represented in Eqs. (6), (7), and (8).

$$Standard{\text{ }}Accuracy\, = \,\frac{{TP + TN}}{{TP + FP + TN + FN}}$$
(6)
$$Un{\text{-}}weighted{\text{ }}Average{\text{ }}Recall\, = \,\frac{{\sum per{\text{ }}class{\text{ }}accuracy}}{C}$$
(7)
$$Un{\text{-}}weighted{\text{ }}Average{\text{ }}F\it{1}{\text{ }}Score\, = \,\frac{{2TP}}{{2TP + FP + FN}}$$
(8)

In Eqs. (6), (7), and (8), TP, TN, FP, FN, and C represent True Positive, True Negative, False Positive, False Negative, and the number of classes, respectively.

As the target ME datasets are highly imbalanced, we have considered these metrics to reduce the bias of the model to higher sample classes on accuracy. Model sensitivity is measured with UAR by averaging the recall on each predicted class of a ME video sequence, and UF1 is calculated at the macro level from precision and recall for each predicted sample. Also, the standard accuracy has been estimated on the total number of correctly classified samples out of total predicted samples.

The model has been trained based on cross-entropy loss function on predicted output. Therefore, it is suitable for multiclass label classification and for highly imbalanced datasets such as publicly available ME datasets. The analytical form is given in Eq. (9).

$$\begin{aligned} loss & \left( {x,class} \right) = - log\left( {\frac{{exp\left( {x\left[ {class} \right]} \right)}}{{\sum _j exp\left( {x\left[ j \right]} \right)}}} \right) = - x\left[ {class} \right] + \log \left( { \sum \limits_j exp\left( {x\left[ j \right]} \right)} \right) \\ {\text{For}} & {\text{ class weights-}} \\ & loss\left( {x,class} \right) = weight\left[ {class} \right]\left( { - x\left[ {class} \right] + \log \left( { \sum \limits_j exp\left( {x\left[ j \right]} \right)} \right)} \right) \\ \end{aligned}$$
(9)

In Eq. (9), x is the normalized output from the soft-max layer, and the class is [0, 1, 2] for three categories negative, positive, and surprise, respectively. As the model has been trained on a batch of ME frame sequence, the loss has been calculated based on the average of loss of each predicted sample in that batch.

2.4 Macro to Micro Knowledge Transfer

The model has been trained on three benchmarks MACE datasets, CK+ [37], MUGFE [38], and OULU-CASIA [39]. In our model, the backbone network is pre-trained on kinetics-400 [44], and the conv-LSTM auto-encoder with 4 LSTM cells and the 3DCNN layer are initialized randomly. The proposed model has about 37.6 million trainable parameters. On the other hand, publicly accessible ME datasets have a lower number of video samples. So, the training of the model on target ME samples in the first place must be a cause of over-fitting. Considering these facts, we have done the pre-training on MACE video samples in negative, positive, and surprise classes for similar knowledge of ME. Then the model is further trained on ME composite samples in the same categories for target knowledge MER. Datasets organization and experimental details have been discussed in Sect. 3.

3 Experiments

This section describes the organization of data samples into three discrete categories negative, positive, and surprise for both MACE and ME. It also represents the data set summaries, model implementation, and design and implementation of cross-validations LOSO and Stratified-5Fold.

3.1 Macro and Micro Datasets Organization

Three benchmark MACE datasets have been used for the transformation of knowledge about ordinary prototypical facial expressions. The widely used extended Cohn-Kanade (CK+) [37] is one of them. The other two MUGFE [38] and OULU-CASIA [39] are also very relevant datasets for MACE video samples. We have considered frame sequences for the expressions of anger, disgust, sadness, happiness, and surprise in all three datasets. In CK+, the respective MACE videos have been reduced to the range of 6 to 50 frames by manually eliminating some early neutral frames but preserving the last one as a peak in each sequence. For MUG facial expression dataset, we have kept the video length is less than or equal to 62 frames by discarding the neutral and less expressive frames. Videos of OULU-CASIA have been restricted to 20 frames in the same way. Only the samples of the strong visible lighting part of this dataset are taken for our transfer learning. All the samples from these datasets are categorized into three classes negative (anger, disgust, and sadness), positive (happiness), and surprise (surprise). So, it results in a composite MACE three class dataset for pre-training. Samples from these representative datasets are summarized in Table 1.

We have considered five spontaneous ME datasets mentioned in Table 2 for ME recognition in negative, positive, and surprise categories. A composite dataset comprising of ME video samples from those ME datasets has been constructed. Here the negative one has been formed with anger, disgust, and sadness, where happiness and surprise for positive and surprise respectively. The number of samples in each class is tabulated in Table 1. To the best of our knowledge, this is the first attempt to recognize ME on five ME datasets. To reduce the subjective bias, all ME samples except SMIC enlisted in Table 1 have been reclassified based on AUs [45].

Table 1. Macro expressions and micro expressions video samples summary from three MACE datasets and five publicly available spontaneous ME datasets

As ME videos have many redundant frames, which have a detrimental effect on recognition accuracy, we have eliminated many irrelevant frames by keeping only the frames in between onset and apex frames, including these two. Thus, the ME video sequences have been reduced to the intervals 6 to 13, 5 to 12, 7 to11, and 11 to 12 for CASME, CASMEII, CAS(ME)2, and SAMM, respectively. The last frame in each ME sample is the frame with the highest peak. As SMIC is not annotated with apex frames, so we have kept the original video length. Only high-speed (HS) and visual (VIS) samples have been considered for SMIC. But this composite dataset has introduced a domain shifting challenge due to subject diversities and different disparate attributes among the ME datasets. But a larger number of samples facilitates to alleviate over-fitting and model generalization, which has been supported by the results in Sect. 4. Preprocessing and augmentation except smoothing are like those that have been discussed in Sect. 2.1. In the case of the MACE dataset, image smoothing has also been used for augmentation.

3.2 Experimental Settings

The proposed model has been trained on a composite macro dataset summarized in Table 1. The sequence length of each video is made equal to 12 frames by selecting the frames at equal interval and is padded with the last frame. Then the padded sequences have been used for pre-training. The hyper-parameters such as batch_size and initial learning rate are set to 64 and 0.0001, respectively. Learning rate is scheduled with patience 5 for equal validation loss in five consecutive epochs. An early stopping regularizing is used based on degraded or unchanged validation loss for 10 consecutive epochs. Learnable parameters have been optimized using Adam optimizer. The dimension of a batch of videos is 64 × 3 × 12 × 64 × 64. For training and testing on ME composite target samples, we have considered the initial learning rate as 0.001, and the early stopping epoch is 8. For both cases, a weighted random sampler is configured and used to combat the bias towards the larger sample classes. But all other settings remain the same. The model is trained and tested on a platform with Windows 10 Pro, Intel Core I5 7400 CPU 3GHz, 8 GB DDR4 RAM, and NVIDIA GeForce RTX 2070 with 8 GB memory. The widely used Pythonic deep learning framework PyTorch has been used to accomplish the experiment.

3.3 Model Evaluations

The proposed model has been aligned with similar domain knowledge through the hold-out evaluations (80%:20%) on MACE composite samples. Here the model is trained and validated against three classes. The resulting confusion matrix is given in Fig. 3. Then the pre-trained model has been evaluated on our composite ME dataset in Table 1.

For this purpose, two validation strategies have been carefully considered and tuned for evaluating the model under three classes negative, positive, and surprise. One is LOSO CV, and the other one is Stratified-5 fold. Many challenges have been induced in the composite ME dataset due to the domain transition, especially the subjective diversities and the larger variations in temporal and spatial- frequencies along imbalance problem are notable. These cause a bias towards a specific group of ME video samples. LOSO and Stratified-5 fold are there to reduce biases for such data distributions. In LOSO, 87 subjects have been used to validate the model, where each subject is tested in each split during one epoch. On the other hand, each fold-out of 5 equal folds has been used as test samples in each split during one epoch. The metrics such as standard accuracy, UAR, and UF1 have been estimated from these validations. Figure 4 demonstrates the confusion metrics of best accuracies for both the tests. Also, the evolution of model convergence has been recorded in Fig. 5 and 6.

Fig. 3.
figure 3

Confusion matrix from pre-training on the composite MACE dataset

Table 2. Disparate spatio-temporal properties of five Publicly available spontaneous ME datasets. Here, HS-High Speed, VIS-Visual Camera, NIR- Nearly Infrared.
Fig. 4.
figure 4

Confusion matrix for LOSO CV and Stratified 5-folds on composite ME dataset

Fig. 5.
figure 5

LOSO CV standard accuracy, UAR, and UF1 scores

Fig. 6.
figure 6

Stratified 5-fold CV standard accuracy, UAR, and UF1 scores

4 Results and Discussion

We have constructed and used a composite ME dataset from the five spontaneous ME datasets mentioned in Table 3 for LOSO and Stratified 5-fold CV to categorize ME video sequence in three classes negative, positive, and surprise. Our proposed method and evaluations on this ME combined dataset show superiority to the state-of-the-art methods. It achieves significantly higher effectiveness standard accuracy, UAR, and UF1 metrics for both CVs, which is clear from the results in Table 3 and Fig. 4, 5, and 6. Table 3 has recorded the results for similar tasks base-on recent methods. CapsuleNet based on apex frame [46] has achieved about 65% UAR, and UF1 which is lower by 32% and 31%, respectively, compared to our proposed method. In RCN-A [30], the score has been improved up to 71% and 74% for both metrics. However, it is also far behind our estimated results. In macro assisted network MicroNet [47], there is a reasonable gain for both the metrics. It is ended up with scores less than 86% and 87%, respectively. Our proposed method has demonstrated a remarkable gain in terms of three metrics standard accuracy, UAR, and UF1. The estimated score is about 97% for all three metrics in the stratified CV. In the case of LOSO CV, our approach shows consistent scores of 97%, 97%, and 96%, respectively. We have covered both LOSO and stratified CV, but other approaches have done evaluations on the first one.

To lessen the over-fitting due to small size ME datasets, our composite ME dataset with larger video samples put a positive impact on all three accuracies by confronting the domain shifting challenge. Transfer learning on the integrated MACE dataset is another contributing factor to it. Subjective bias reduction with the objective categorization of ME samples, length of frame sequence reduction to 12 frames with apex frame as the last one, augmentation, and small spatial size of each frame have influenced the model prediction approvingly. With these facts, model design with spatio-temporal backbone in 3D residual framework and de-noising the five-dimensional activated feature maps with two encoders convolutional LTM layers and two accompanied LSTM decoder layers. This auto-encoder has facilitated further abstraction of relevant spatial changes in the ME frame sequence. Final structural regularization has subsidized the parameters.

The collaborative positive effects of our method on accuracies have also demonstrated the model generalization in-terms of a myriad of attributes in ME datasets, which has been captured by normalized confusion matrices and shown in Fig. 4. Some irregular weight updates have caused the minor drifting before the epoch 20 and epoch 42 in Fig. 5 and 6 respectively, which is due to the random initialization of parameters of auto-encoder and the last 3DCNN layer. After that points model is started to converge for all three metrics in both CV.

Table 3. LOSO CV ME recognition results based on contemporary methods for negative, positive and surprise classes in different compositions of ME datasets.

5 Conclusions

Our work has proposed a method and model for ME recognition on composite ME video sequences from spontaneous datasets SMIC, CASME, CASMEII, CAS (ME)2, and SAMM. To the best of our knowledge, it is the first attempt for ME recognition on five ME datasets. The proposed model combines the synergies of 3DResNet 18 and conv-LSTM auto-encoder. Two exhaustive cross-validations LOSO and Stratified 5-fold, have been experimented on the composite dataset. Our method shows remarkable accuracies compared to state-of-the-art methods. From our model evaluation, it is evident that the model is generalized, highly effective across domain shifting among ME datasets. In the future, this will help to classify the ME frame sequence from diverse ME samples in the wild.