1 Introduction

Facial expression, an indispensable component of human emotion expression systems, is usually regarded as a non-verbal language reflecting the state of human emotions. In [37], A. Mehrabian’s research has showed that facial expression contains the most emotional information, and 55% of what the speaker wants to say comes from facial expression. As FER has made considerable progress in recent years, it provides a wide range of applications in scientific fields, such as human robot interaction (HRI) [8], safe driving [25], medical diagnosis [24], and so on. Although FER has been studied for decades and notable advances have been obtained in both software and hardware systems [18], the recognition of facial expression with high accuracy remains to be realized due to the impact of inter-personal variations, facial occlusion, and changes in facial pose. As illustrated in Fig. 1, psychologists Ekman and Friesen [10] have originally proposed that human beings have six prototypical emotions, namely anger (AN), disgust (DI), fear (FE), happiness (HA), sadness (SA), and surprise (SU), each of which reflects a unique psychological activity with a particular expression.

Fig. 1
figure 1

Example of six prototypical expressions from the CK+ database

Feature extraction obtains high-level semantic expression by calculating the texture, shape, spatial structure and other information of the original image, which plays an important role in FER [3]. According to the type of input data, there are two mainstream approaches in the research of FER currently: image-based and video-based approaches. The image-based approaches analyze and extract expression features through peak frames, while video-based approaches detect the temporal and motion information from image sequences with facial expressions. With regard to image-based approaches, they are normally divided into two categories: geometric feature-based and appearance feature-based approaches. For the former, the locations of many facial key points are extracted and subsequently combined into a feature vector to represent facial geometric information (e.g., angle, distance, and position) [14]. Besides, the appearance feature-based approaches model the appearance variations by applying descriptors on holistic or local regions to convolute and extract features [45].

Many previous studies are based on either a still frame or an image sequence, nevertheless, little work has been done to combine these two methods together. Actually, features extracted by these two methods are complementary. The peak frame of facial expression has strong discrimination while the temporal information is indispensable in special video classification tasks. Since a single feature is not comprehensive and rich enough to capture all dominant global information, it is necessary to fuse multiple complementary features to design a robust feature descriptor [29, 54]. In recent years, there is a trend to apply neural network model [42] in FER, which yields state-of-the-art results [32]. However, many follow-up problems come into being, such as the low availability of big data, poor generalization ability of models, excessive consumption of processing time and memory, and so on.

This paper presents an effective system using both static and dynamic features to enhance the recognition accuracy of FER in video. We utilize two kinds of static information when facial expression occurs: Gabor magnitude pictures with multiple scales and multiple directions are used to achieve extraction of texture features and 49 key points of facial expression are applied for geometric features. In addressing dynamic features, we improve the LBP-TOP method proposed by Zhao et al. [58], which demonstrated the contribution of XY plane is less than that of XT and YT planes in FER. To enhance the contribution rate of XY plane, we substitute the LBP histogram extracted from the image sequence in the XY plane for that from the peak frame. Compared with the original LBP-TOP method, the improved LBP-TOP method has computational simplicity and higher effectiveness for characterizing spatiotemporal texture features. The improved LBP-TOP is represented as I-LBP-TOP in this paper. Subsequently, the SVM in the shogun toolbox is used to obtain corresponding results of three classifiers [60], and MV strategy can determine the final classification result.

The main contributions of this paper are as follows:

  • Firstly, we put forward an improved LBP-TOP descriptor (I-LBP-TOP), which remedy the shortcomings of original LBP-TOP descriptor in FER.

  • Secondly, a new geometric feature (GF) is proposed and extended to time dimension (ST-GF).

  • Thirdly, a framework that integrates the spatiotemporal motion features (I-LBP-TOP and ST-GF) with static features (GMF) is proposed, which takes into account geometry-appearance and dynamic-still information simultaneously.

  • Finally, the experimental results prove that the recognition performance of the system is greatly improved due to the introduction of multi-feature fusion method at the decision-level.

The remainder of this paper is arranged as follows. Previous related work is briefly reviewed in Section 2. In Section 3, we introduce the main contribution of this paper. Section 4 discusses and analyzes the experimental results and Section 5 concludes the paper.

2 Related work

2.1 Static image based approaches

Over the past few decades, most of researches have been devoted to expression analysis based on still frames. For example, a general approach is Local Binary Pattern (LBP), which was first introduced by Ojala et al. [43] and implemented by Shan et al. [47] in the field of FER. As a simple descriptor with rotation and illumination invariance, LBP is widely used in expression recognition. This descriptor has many variants, including Completed LBP (CLBP) [16], support LBP (sLBP) [40] and scale selective LBP (SSLBP) [15]. Gabor wavelets have also been proven to be a powerful tool with an optimal localization in both the spatial as well as the frequency domain [26]. Gabor magnitude features are commonly used for modeling face changes, while there are also several Gabor phase based approaches like HGPP [55] and LGXP [51], which show competitive performance for facial feature extraction. What is more, the Histogram of Orientated Gradients (HOG) features are constructed by calculating and counting intensity gradient distribution of the local image region to represent shape and appearance information of facial image [9]. For geometric features, Liliana et al. utilized landmarks in a facial component to analyze facial expression [30]. Furthermore, in [6] and [27], the displacement and the coordinates difference of facial landmarks between a neutral face and an emotional face were calculated to characterize facial rigid changes, respectively. However, there are not enough neutral expressions in some databases or in the real system, which makes it impossible to calculate geometric features by far.

2.2 Dynamic image sequence based approaches

As a matter of fact, natural facial expression activity is a dynamic process, and its changing process can be disassembled into three stages: the onset, the apex and the offset. Therefore, video-based approaches in FER have become an active topic in recent studies. Both volume local binary pattern (VLBP) and LBP-TOP are extensions of LBP descriptor in time dimension, combining motion and appearance textures [57]. LBP-TOP, as a simplified descriptor of VLBP, has shown its promising performance for FER system [23]. However, the histograms extracted from XY plane are not as significant as those from XT and YT plane. Authors of [2] have proposed the LGBP-TOP descriptor, in which LBP-TOP was used on each Gabor magnitude sequence to further enhance the feature extraction. One drawback of this method is that the computational cost can be very high when a facial expression sequence is represented as 40 Gabor magnitude sequences. By using the optical flow approach, Guojiang et al. analyzed the dynamic information of facial expressions and extracted the characteristic flow which could reflect the facial expression changes effectively [17].

2.3 Multi-feature fusion based approaches

Latest studies suggest that the fusion of multi-feature can yield better results than only performing a single feature in emotion recognition system. The method reported in [1], proved that simple combinations of both static and dynamic approaches can break through their respective limitations. Moreover, Zhao et al. proposed a novel framework for facial expression analysis concatenating dynamic and static information in video sequences [59]. However, it is difficult to generalize universal features across different persons only from the extracted spatiotemporal texture information. Fan et al. combined PHOG-TOP and dense optical flow according to the weighting strategy to extract both the spatial and dynamic motion information of facial expressions [11]. In [12], Feng et al. focused on two-stream-CNN with LBP-TOP to capture spatial and temporal streams.

In addition, multiple features can be fused at feature-level or decision-level. For instance, Hu et al. [22] employed Center-Symmetric Local Signal Magnitude Pattern (CS-LSMP) descriptor on multiple features for obtaining fused features. Rathee et al. fused Gabor, HOG and DWT features, using Multiview Distance Metric Learning (MDML), which employed complementary features of images to extract details while eliminating redundant information [44]. In [7], HOG-TOP, acoustic and geometric features were combined, and multiple kernel SVM was used for classification at the feature-level. On the contrary, in [46], both audio and visual information were fused at the decision-level with a decision rule to identify emotions. And in [13], towards the extracted SIFT and LBP descriptors, Gao et al. used the improved DS-evidence theory for decision-level information fusion to improve the robustness of face recognition in complex conditions. In general, system performs excellently when combining multiple complementary features.

3 Proposed model

This section presents the detailed methodology of our proposed framework, including three types of feature descriptors (I-LBP-TOP, GMF, ST-GF) and a classification method for multi-feature fusion at the decision-level. As illustrated in Fig. 2, the process of the proposed FER system includes following steps: (1) for the preprocessed image sequence, extracting LBP histograms on XT and YT planes and combining with LBP histograms on XY plane of peak image to generate I-LBP-TOP descriptor; (2) employing Gabor operator to extract GMFs from preprocessed peak image; (3) utilizing facial key points for the sampled image sequence to obtain ST-GF; (4) for the above three features, three SVM base classifiers are trained, and the final classification result of an unknown sample is determined by MV strategy of the base classifiers. The detailed algorithms of the framework are in the following sections.

Fig. 2
figure 2

Overview of our proposed framework for FER

3.1 Texture and Spatial-Temporal Motion LBP Descriptor

The LBP is a descriptor for extracting local texture features, which calculates the pixels in an image successively. For each pixel in the image to be processed, the neighboring pixels are thresholded to generate a binary code that is usually converted to a decimal number to represent the LBP value of the central pixel. It reflects the texture information of local neighborhood (see Fig. 3 for an illustration). In this way, each pixel in the picture is redistributed according to the values of its neighboring pixels to obtain the LBP feature.

Fig. 3
figure 3

The Original LBP descriptor

Denoting by gp the pth neighboring pixel of the central pixel gc, by P the total number of all involved neighbors and by R the radius of the neighborhood, the calculation method of the original LBP descriptor is given in (1).

$$ LBP_{P,R} = \sum\limits_{p = 0}^{P - 1}{s_{1}({g_{p}},{g_{c}}){2^{p}}}, $$
(1)

where the function s1 can be formulated as follows:

$$ s_{1}(x,y)=\begin{cases} 1 & \text{if~} x-y\geq 0 \\ 0 & \text{otherwise} \end{cases}. $$
(2)

The LBP maps involve the local information, while their statistical histograms are utilized as feature vectors to take global information into consideration. For an image of size M × N, the histogram after LBP encoding can be defined as:

$$ H(i)=\sum\limits_{m=1}^{M}\sum\limits_{n=1}^{N}s_{2}(LBP_{P,R}(m,n),i),~\forall i\in [1,I], $$
(3)

where the function s2 is defined as:

$$ s_{2}(x,y)=\begin{cases} 1 & \text{if} x-y=0 \\ 0 & \text{otherwise} \end{cases}, $$
(4)

where i is the number of patterns after LBP mapping and I is the maximal LBP pattern value. Note taht 59-dimensional histogram of uniform pattern is adopted in this paper.

The co-occurrences on XT and YT planes of LBP-TOP are applied concerning both the spatial and temporal domain information. It is known that a single frame is a two-dimensional plane and a video sequence is a three-dimensional volume. Accordingly, a video sequence can be considered as a stack of XY planes in the temporal dimension T, and similarly for XT and YT planes but in the axis Y and X, respectively. Figure 4 shows a 33-frame images sequence and its corresponding example images from three planes. Figure 4a shows the frontal face image of frame 8 in the XY plane, while Fig. 4b and c describe the top and side views of the volume, both of which give the visual impression of motion changing in temporal space. The LBP histograms obtained from image sequence on XY, XT and YT planes are represented by \({H_{XY_{seq}}}\), \({H_{XT_{seq}}}\) and \({H_{YT_{seq}}}\), respectively. Then a single histogram is derived by the concatenation of these statistical histograms corresponding to separate planes. Figure 5 demonstrates the procedure of the LBP-TOP feature extraction for a block. The final histogram can be formulated as follows:

$$ \begin{array}{@{}rcl@{}} &&H=(H^{1},H^{2},H^{3}), \\ &&H^{j}(i)=\sum\limits_{x,y,t}s_{2}(LBP_{P,R}(x,y,t),i),~\forall i\in [1,I], \end{array} $$
(5)

where LBPP,R(x,y,t) represents the LBP code of the central pixel in the jth plane, where j = 1,2,3 denotes the XY, XT, YT planes respectively. In the block-based approach, the histogram of each block volume needs to be cascaded to obtain the final feature vector.

Fig. 4
figure 4

The example images in XY, XT and YT planes. (a) Image in the XY plane (128×256) in t = 8. (b) Image in the XT plane (128×33) in y = 128. (c) Image in the YT plane (33×256) in x = 64

Fig. 5
figure 5

LBP-TOP histogram of a block volume. (a) Block volume. (b) LBP histogram from three planes

However, in FER, these three planes of LBP-TOP contribute differently for feature expression, and not all components are of equal importance. Compared with XT and YT planes, the features extracted from the XY plane contribute less. The \({H_{XY_{seq}}}\) is achieved from neutral face to emotional face of dynamic sequence, while the neutral face does not contain the corresponding expression information, which may weaken the feature expression ability of \({H_{XY_{seq}}}\). In contrast, \({H_{XT_{seq}}}\) and \({H_{YT_{seq}}}\) explain more about the movement of facial muscles.

Similar to [59], \({H_{XY_{seq}}}\) is abandoned, and the joint \({H_{XT_{seq}}}\) and \({H_{YT_{seq}}}\) construct spatial-temporal motion LBP features. However, unlike in [59], we utilize the LBP histogram of the peak frame instead of Gabor multiorientation fusion histogram to enhance spatial texture information. On the one hand, this is because compared with Gabor multiorientation fusion histogram, the LBP histogram has lower calculation expense and can make up for the deficiency of the original LBP-TOP descriptor in XY plane. On the other hand, with the uniform pattern coding, the dimension of LBP histogram is the same as that of the LBP-TOP histogram in single plane, both of which are 59 dimensions. In this paper, we combine the LBP texture feature of the peak frame (HXY) with the spatiotemporal motion feature of image sequence, which is called I-LBP-TOP descriptor. The detail of I-LBP-TOP algorithm is presented in Algorithm 1.

figure a

3.2 Gabor magnitude descriptor

Gabor transform can analyze the gray changes of images in multi-resolution and multi-orientation effectively and thus is capable of solving the problem of different expressions with different scales. Meanwhile, it has good properties for information extraction in local spatial and frequency domain. Based on the above advantages, Gabor feature has been successfully applied in the field of FER. For point z = (x,y), the 2-D Gabor filter commonly used is defined as follows:

$$ G_{({\text{u,v}})}(z) = \frac{{||k_{u,v}|{|^{2}}}}{{{\sigma^{2}}}}\exp \left( { - \frac{{||k_{u,v}|{|^{2}}||z|{|^{2}}}}{{2{\sigma^{2}}}}} \right) \times \left[ {\exp (ik_{u,v}z) - \exp (- \frac{{{\sigma^{2}}}}{2})} \right] $$
(6)

where \(k_{u,v} = k_{v}{e^{i\phi _{u}}}\), \(k_{v} = k_{\max \limits } /{\lambda ^{v}}\), and ϕu = πu/8, \(k_{\max \limits }\) is the maximum frequency, λ is the spacing factor between filters in the frequency domain, u, v corresponds to the direction and scale of Gabor filter, respectively, and ||⋅|| is the norm descriptor.

The Gabor representation of an image is the convolution of image I(z) with Gabor filters, where five scales and eight orientations are used:

$$ F_{u,v}(z) = I * G_{\textit{u,v}}(z), $$
(7)

where u ∈{0,1,⋯ ,7}, v ∈{0,1,⋯ ,4} and ∗ stands for the convolution operator. It should be noted that the coefficients of the Gabor wavelet are complex, hence the response Fu,v of a Gabor filter is complex. Unlike the Gabor phase information of the transform which is time-varying, the magnitude of Gabor response is relatively smooth and stable. As a consequence, we exploit the magnitude of the response Fu,v to yield the Gabor feature. In this paper, we generate 40 Gabor magnitude maps for each individual face image by Gabor wavelet transformation. An example of the Gabor wavelet transformation with five scales and eight orientations is given in Fig. 6.

Fig. 6
figure 6

The example of Gabor wavelet transformation. (a) Real part of Gabor kernel. (b) 40 Gabor magnitude pictures

3.3 Geometric descriptor

Facial expression not only encompasses the objective appearance and shape information, but also involves the unexpected identity characteristics of different subjects. The advantage of facial key points features is that they are not influenced by person identity such as face shape, gender, age, race and illuminance in the input video. Accordingly, we employ the facial key points as a geometric feature in this paper. The major task is to locate the key points from the face, including the corners of the eyebrows, eyes, mouth and nose in an image. It is encouraging that there have been many mature algorithms for facial key point detection [19] and face alignment [41], which are widely used in expression recognition, face tracking and face recognition. We utilize Supervised Descent Method (SDM) algorithm to locate the 49 key points of a face [52]. The coordinates of the facial image with emotion are shown in Fig. 7. As depicted in Fig. 7, no matter the same person, different women or men, the coordinates of the same expression are very similar according to columns. Furthermore, the coordinates of different expressions do have a great difference according to rows, especially the mouth, eyes and eyebrows, which proves that the key points of the face can remove the common underlying structure for the face images and extract the shape attributes of expressions effectively. Therefore, we have adequate grounds to take the coordinates of these 49 facial key points as geometric feature. Supposing that (xi,yi) represents the coordinates of the i-th facial landmark, a set of facial landmarks can be expressed as (8):

$$ V_{e} = [x_{1}, y_{1}, x_{2}, y_{2}, \cdots, x_{n}, y_{n}],\quad n = 49, $$
(8)

where Ve is the geometric vector of emotion e. Then, the coordinates of X and Y axis are standardized with the mean value of 0 and the variance of 1 respectively which transforms them to dimensionless pure values. As a consequence, a total of 98-length coordinates are constructed to represent the geometric feature. The extracted 98-length vector is learned through the multi-kernel SVM and achieves high recognition accuracy.

Fig. 7
figure 7

The coordinates of facial expression images

As we all know, facial expression can be considered as a dynamic process, in which the same facial key point will produce relative displacement between frames. Therefore, the change of neutral face to expressive face in time dimension can be described by the trajectory of facial key points. To obtain the trajectory data and relative position information of the facial key points simultaneously, we consider extending the above GF descriptor to time dimension. Specifically, the original image sequence is normalized to a fixed length: L. Then the GFs of the L frames are concatenated into a one-dimensional vector, which is represented by V. Consequently, the ST-GF descriptor can be calculated as follows:

$$ V = \left[V_{\text{e}}^{1},{V_{e}^{2}}, {\cdots} ,{V_{e}^{L}}\right] $$
(9)

3.4 Classification

Classification is the process of training features to get the optimal mapping model between features and tags, and realizing the correct prediction of unknown samples. SVM is considered as one of the most effective and robust classifier for FER due to its following properties: (1) a good balance between model complexity and generalization error and (2) a capability to deal with high dimensional data. We denote by \(\left \{ {(x_{i},y_{i}),i = 1,...,L} \right \}\), xiRn, and yi ∈{− 1,1} a set of training data with labels. A new test data x is classified by

$$ f(x) = {\text{sign}}\left( {\sum\limits_{i = 1}^{L} {{\alpha_{i}}{y_{i}}K(x,{x_{i}}) + b} } \right), $$
(10)

where αi are Lagrange multipliers of a dual optimization problem, describing the separated hyper-plane, b is a bias, and K(⋅,⋅) is a kernel function. For linear separable data, SVM finds a hyperplane to maximizes the margin with respect to support vectors. For nonlinear data, the processing method of SVM is to map the data into a higher dimensional space by selecting appropriate kernel function. Among various kernel functions, the most frequently used are polynomial and radial basis function (RBF) kernels. However, different feature vectors have different dimensions and are of different importance for recognition. It is difficult to ensure that the parameters and kernel functions are suitable for all feature vectors when performing SVM. On this condition, multiple kernel learning (MKL) is also a good choice, which employs a convex combination of multiple kernels to substitute for the single kernel:

$$ \begin{array}{@{}rcl@{}} &&K(x,x^{\prime}) = \sum\limits_{m = 1}^{M} {{d_{m}}{K_{m}}(x,x^{\prime})}, \\ &&s.t.\quad {d_{m}} \ge 0,\quad\sum\limits_{m = 1}^{M} {{d_{m}}} = 1, \end{array} $$
(11)

where M is the total number of kernel functions and Km represents basic kernel function.

Since SVM is a typical binary classifier, for multi-class problem, the one-vs-one and one-vs-rest approaches are simple but effective technique. In our study, we adopt one-vs-one method to deal with the six-class problem. In this strategy, SVM is trained between any two types of samples and 15 binary SVM classifiers need to be designed.

3.5 Multi-feature fusion at decision-level

According to the different stages of feature fusion, it is mainly divided into feature-level fusion and decision-level fusion. Feature-level fusion is performed before classification, which concatenates multiple features directly or according to weight ratio to form high-dimensional feature vectors. On the contrary, decision-level fusion is carried out after classification, and the final category of sample is determined by MV strategy of ensemble learning. We utilize feature-level fusion method to deal with features with small differences, such as I-LBP-TOP descriptor, while for multiple types of features, decision-level fusion method is adopted.

As proved by Bonab et al. [4], the MV strategy of ensemble learning combines multiple classifiers to obtain at least equal to the average performance of all individual components. Based on the above theory, We train three SVM base classifiers for three different types of features (I-LBP-TOP, GMF, ST-GF), and the predicted results are combined through MV. In the testing phase, for each sample x, the method of calculating the MV is as follows:

$$ H(x) = \mathfrak c_{\kappa},\quad \kappa=\underset{j}{\arg\max}\left\{\sum\limits_{t = 1}^{T}{h_{t}}^{j}(x)\right\}, $$
(12)

where T is the number of base classifiers, \({h_{t}}^{j}(x) \in \{ 0, 1\}\) denotes the class tag, if ht predicts sample x as class \(\mathfrak c\), the value is 1, otherwise it is 0. When an unknown sample is classified, the category with the largest number of votes is the final classification result.

4 Experimental results

To evaluate the performance of our proposed model, we perform experiments on the CK+ and Oulu-CASIA facial expression databases. The details of the experiments and results are shown below. They are based on Windows OS with CPU Intel Core (TM) i5-1035G1 and 16GB of RAM. The feature extraction phase is based on the MATLAB platform of version R2019a. In addition, the design of classifier uses Shogun toolbox, which is based on Python platform of version 3.6. In order to test the predictive performance of our model, we take the leave-one-out cross validation. In this method, all the expression of each face can be used as a test set, which is very commonly used in expression recognition. Besides, the 10-fold cross validation is also used on CK+ database to compare with the existing methods. As shown in Fig. 8, our proposed FER system comprises the following stages: pre-processing, feature extraction, classification and feature fusion.

Fig. 8
figure 8

Flow chart of our proposed FER system

4.1 Facial expression databases and preprocessing

The extended Cohn-Kanade (CK+) database is an effective and general database to verify expression recognition system [35]. The expression sequence of this database starts from neutral expression and gradually changes to the peak of expression. It contains 593 video sequences from 123 individuals between the age of 18 and 30, while only 327 facial expression sequences are labeled with seven universal emotions (anger, contempt, disgust, fear, happiness, sadness, and surprise). Most of the papers have abandoned the contempt expression because of its small amount of data (only 18 expression sequences). In order to facilitate the comparative experiments, we select 309 facial expression sequences with six basic expressions, excluding contempt.

The Oulu-CASIA database is another widely used video based database, which contains six basic expressions from 80 subjects of 23 to 58 years old (a mix of male/female and glasses/without glasses) [56]. Facial expressions are captured by a VIS camera under three different light conditions: normal, weak and dark. Similar to the CK+ database, all expression sequences are also changing from neutral to peak of emotion. We evaluate our model with 480 sequences (80 subjects by six expressions) in the normal illumination condition. The face sequence samples of CK+ and Oulu-CASIA databases are illustrated in Fig. 9.

Fig. 9
figure 9

Samples of expression sequence

It should be noted that in the process of collecting expression data, the acquired facial image can be tilted inevitably due to the change of facial muscles or head deflection. Hence, it is necessary to align any in-plane rotation so that the eyes are on the same horizontal line to optimize the recognition performance. After getting enough aligned samples, we note that even if the images come from benchmark facial expression databases, there is too much background information independent of expression. Accordingly, face detection and normalization are applied to remove the irrelevant information and improve the quality of subsequent feature extraction and recognition for FER [20, 21]. Support that the distance between the pupils of two eyes is d, the height and width of the cropped image are 2.25 × d and 1.2 × d based on the middle position of the pupils of two eyes, as shown in Fig. 10. These factor values are determined empirically [34]. Finally, in CK+ database, for the LBP feature, the cropped face is normalized to 256 × 128, while for the Gabor feature, it is normalized to 112 × 96 to reduce dimensions. Since the original image pixels of the Oulu-CASIA database are low, only 320 × 240, the cropped face is normalized to 112 × 64 and 64 × 48 respectively for the above two features.

Fig. 10
figure 10

The original image and the image normalized to 256 × 128 after cropping on CK+ database

4.2 The effect of block numbers and overlapping ratio on I-LBP-TOP descriptor

First of all, we perform experiments on CK+ database to determine the appropriate number of blocks and whether to use overlapping blocks for I-LBP-TOP descriptor. We can learn from the previous researches that too few blocks could make the extracted features insufficient to get terrible accuracy, while a large number of blocks may also lead to the problem of too high feature dimensions, increasing time complexity and decreasing accuracy. What is more, the overlapping ratio of 80% of original image has been shown to obtain the best result. Figure 11 shows how the accuracy of expression recognition varies with the number of blocks. Assume that the overlapping ratio of 80% of original image is represented by ‘1’, and non-overlapping block is represented by ‘0’. Then ‘00’ indicates that both the HXY and \({H_{(XT+YT)_{seq}}}\) features are not overlapped; ‘01’ means: the HXY feature performs without overlapping, while \({H_{(XT+YT)_{seq}}}\) feature performs 80% overlap; ‘11’ represents both the HXY and \({H_{(XT+YT)_{seq}}}\) features with an overlapping ratio of 80%; and ‘10’ indicates that the blocking method with an overlapping ratio of 80% is adopted by HXY, whereas the \({H_{(XT+YT)_{seq}}}\) feature does not.

Fig. 11
figure 11

Recognition rate of I-LBP-TOP descriptor according to the block size and whether performing overlapping ratio of 80% on CK+ database, where the black circle means the best result

It can be found that the best result is obtained with 8 × 5 blocks, and a small or a large value of blocks will degrade the recognition performance. Consequently, 8 × 5 blocks are selected for all facial images. Besides, for spatial-temporal LBP histogram, we adopt the overlapping ratio of 80% of original block, whereas for LBP histogram of the peak frame, the non-overlapping blocks are used to obtain more position information.

4.3 Comparison of I-LBP-TOP with original LBP-TOP and other components

Secondly, we separate the LBP-TOP descriptor in three planes and investigate the performance of LBP histogram of peak frame (i.e., HXY), LBP histogram of three individual planes (i.e., \({H_{XY_{seq}}}\), \({H_{XT_{seq}}}\), \({H_{YT_{seq}}}\)), pairwise combination of different plane components (i.e., \({H_{XY}} + {H_{XT_{seq}}}\), \({H_{XY}} + {H_{YT_{seq}}}\), \({H_{(XY+XT)_{seq}}}\), \({H_{(XY+YT)_{seq}}}\), \({H_{(XT+YT)_{seq}}}\)), I-LBP-TOP histogram and original LBP-TOP histogram on CK+ database. All the LBP descriptors are coded with uniform patterns. Based on the 8 × 5 blocks division method, the obtained feature vector is 59 × 8 × 5 = 2360 dimensions, which is reduced to 308 dimensions by using principal component analysis (PCA) [61].

We utilize ablation study to analyze the contribution of different components, as shown in Table 1. The highest performance is obtained by combining the LBP histogram from peak frame with the spatiotemporal LBP histogram, which is 3.56% higher than original LBP-TOP descriptor. Compared with the LBP histogram of XY plane from image sequence whose recognition rate is only 84.14%, the accuracy of LBP histogram from peak image can reach 92.88%. We can conclude that the features extracted from XY plane of image sequence are affected by a series of weak expressions (i.e., neutral expressions and changing expressions) and cannot extract representative texture features. In addition, we can figure it out that the LBP descriptor in XT and YT planes can effectively extract the spatiotemporal information. The effect of YT plane is better than the other two planes, indicating that the shape information in the vertical direction plays more important role than that in the horizontal direction. The experimental results validate that our proposed I-LBP-TOP descriptor has the ability to extract more effective spatiotemporal texture features.

Table 1 Ablation study of LBP histogram from either XY, XT, YT plane or their combination on CK+ database

In order to test the adaptability of LBP histogram components in different classifiers, experiments are performed on K-Nearest Neighbor (KNN), Random Forest (RF), Artificial Neural Networks (ANN) and SVM classifiers, as shown in Fig. 12. It can be seen from the above figure that the change of recognition rate of each LBP histogram component has basically the same tendency regardless of the classifiers involved. For instance, the HXY performs much better than \({H_{XY_{seq}}}\) and the recognition rate of I-LBP-TOP descriptor is always higher than that of original LBP-TOP descriptor in all kinds of classifiers. Furthermore, we also find that SVM is more suitable for our LBP histogram features compared with other classifiers.

Fig. 12
figure 12

The recognition rate of LBP histogram from either XY, XT, YT plane or their combination on KNN, RF, ANN and SVM classifiers on CK+ database

In addition, we compare the computational speed of original LBP-TOP and I-LBP-TOP descriptors on CK+ database. The computation time varies with the length of expression sequence and the number of blocks. Under the same condition of expression sequence (19) and block size (8 × 5), the computation time of original LBP-TOP is 3.96s, while that of I-LBP-TOP is 3.27s. Moreover, when the length of expression sequence is set to 40, the computation time of original LBP-TOP and I-LBP-TOP is 8.38s and 7.21s, respectively. With the increase of sequence length and block number, the time superiority of I-LBP-TOP descriptor is more obvious.

4.4 Evaluation of the proposed geometric descriptors

In this subsection, we analyze the validation and superiority of our proposed geometric descriptors. As mentioned in Subsection 3.3, the dimensions of GF and ST-GF descriptors are 98 and 98×L, respectively. In the experiment, L is set to 6 in CK + database and 9 in Oulu-CASIA database, which is the minimum length of the image sequence. Table 2 shows the comparison results of GF and ST-GF descriptors on two databases (CK + and Oulu-CASIA). We can see that compared with the GF, the overall recognition rate of ST-GF is increased, especially for angry and sadness. ST-GF can enhance GF and better capture the subtle changes of those two expressions.

Table 2 Comparison results of GF and ST-GF descriptors

Furthermore, our proposed ST-GF is compared with other geometric feature extraction algorithms on CK+ database, as illustrated in Table 3. The method [36] extracted geometric features according to facial key regions. For eyebrows and lips, the coordinate differences of key points among frames are calculated as displacement information, while for the eye regions, the projection ratio of horizontal distance to vertical distance is utilized. In [33], Euclidean distances of facial landmarks are put into graph-based network to obtain DAUGN-G expression recognition model. And [50] employed Riemannian sparse coding and dictionary learning to code shape trajectories of 2D facial landmarks. From the comparison results, it is witnessed that our proposed ST-GF achieves a superior performance, which outperforms the geometric features in recent years.

Table 3 Comparison results of different geometric features on CK+ database

4.5 Feature visualization

To further demonstrate the separability of the proposed features, we utilize T-Distributed Stochastic Neighbor Embedding (t-SNE) [28] algorithm to visualize the feature vector extracted by I-LBP-TOP, Gabor and spatiotemporal geometric descriptor, respectively. The t-SNE algorithm is a useful visualization technique to convert high-dimensional data into two space. Figure 13 shows the visualization results on CK+ database.

Fig. 13
figure 13

The visualization results on CK+ database

Figure 13(a) shows the random distribution of original input data, with various types of samples mixed together. A total of 6 clusters is having, representing each facial expression. It may be observed from Fig. 13(b)to 13(d) that our feature extraction method can effectively distinguish six kinds of expressions to a certain extent. Particularly, the expressions of surprise, happiness and disgust can be well clustered together in the same category.

4.6 The influence of feature fusion on expression recognition rate

Subsequently, we explore the effectiveness of combining I-LBP-TOP feature, GMF and ST-GF at the decision-level. The above three features utilize SVM classifiers with their appropriate kernel functions and save the prediction results separately. For each single sample, the three classifiers adopt the principle that a few obeys the majority to determine the final category. If the prediction result of every classifier is different, a classification is randomly selected as the final result. Tables 456 and 7 shows the classification accuracy obtained by applying I-LBP-TOP feature, GMF, ST-GF and hybrid feature on CK+ database, respectively. The horizontal axis represents a predicted class among six emotions, and the vertical axis represents the target class which is the correct label.

Table 4 Confusion matrix of I-LBP-TOP feature on CK+ database
Table 5 Confusion matrix of GMF on CK+ database
Table 6 Confusion matrix of ST-GF on CK+ database
Table 7 Confusion matrix of hybrid feature at decision-level on CK+ database

From the results in the table, we observe that fear and sadness are the most confused expressions. On the one hand, because of the relatively small sample numbers of these two expressions, it is difficult to extract a group of features that reveal the internal rules. On the other hand, the slight dynamic variations of facial critical areas for both fear and sadness create more difficulties to distinguish them clearly. Moreover, anger and sadness tend to be identified with each other incorrectly, which may be due to their similar mouth motion. As shown in Fig. 14, it is difficult to accurately distinguish some expressions even for human. However, happiness and surprise can be easily recognized with an accuracy of nearly 100%, which is attributed to their relatively large muscle deformations and drastic changes in appearance.

Fig. 14
figure 14

Failed cases in CK+ database. (a) Expressions are mislabeled as Disgust from Anger. (b) Expressions are mislabeled as Sad from Anger. (c) Expressions are mislabeled as Anger from Sad

According to the results illustrated in Fig. 15, we note that there is always a descriptor that performs better than the others when identifying a certain type of expression. For instance, the I-LBP-TOP descriptor is better at identifying anger; GMFs have greater advantages in recognizing disgust and sadness; and ST-GFs have the best recognition rate for fear. When merging at decision-level, the recognition rate of each expression can be improved owing to the different dominant expressions of these three descriptors.

Fig. 15
figure 15

Comparison of recognition rate between single feature and hybrid feature

Table 8 presents the comparison results of different methods on CK+ database. Fan et al. [11], Zhao et al. [59], Chen et al. [7], Bougourzi et al. [5], and Shanthi et al. [48] all used multiple hand-crafted feature fusion methods to evaluate the performance of model. Even compared with the recent convolutional neural network framework proposed by Yang et al. [53] and Kim et al. [27], our proposed traditional hand-crafted features perform better with the same 10-fold cross validation. Experimental results illustrate that fused features at decision-level are further enhanced based on the individual feature, achieving a final recognition rate of 99.35% on CK+ database. Our whole approach outperforms other well-known algorithms, which reveals its effectiveness and advancement in processing dynamic expression sequences.

Table 8 Comparison results of our proposed method and other seven methods on CK+ database

In addition, to further demonstrate the reasonability of our proposed method, the Oulu-CASIA database is also used to provide quantitative comparisons with several methods in other papers. The comparison results are shown in Table 9. The performance of our proposed method on Oulu-CASIA database is inferior to CK+ database. In particular, the Gabor feature of peak frame performs poorly. The main reason for the poor accuracy is that the expression changes of some peak frames are relatively small and the resolution of original image is low. However, the proposed method still outperforms the other four methods based on hand-crafted features. According to Tables 8 and 9, although the best result is the machine learning based(ML-based) method, it can not be ignored that it also at the expense of a large amount of calculation and high hardware cost.

Table 9 Comparison results of our proposed method and other four methods on Oulu-CASIA database

4.7 Evaluation of computational time

In the experiment, we examine the computational cost of our proposed method and LBP-TOP on CK+ and Oulu-CASIA databases. Table 10 shows the average computational time of feature extraction (Matlab platform) and classification (Python platform). Although the feature extraction time on CK+ database is longer than that of Oulu-CASIA database because of its high resolution, the classification time is faster. The feature extraction time of I-LBP-TOP, LBP-TOP and ST-GF comes from each image sequence, so it is more time-consuming than single frame. Compared to LBP-TOP, our proposed I-LBP-TOP descriptor shorten the runtime effectively. Moreover, the proposed geometric descriptor takes the minimum amount time in both feature extraction and classification, showing the potential for real-time implementation. Significantly, GMF is calculated by using construct_Gabor_filters_PhD.m and filter_image_with_Gabor_bank_PhD.m function in PhD toolbox. Even if per image is amplified to 40 Gabor amplitude images, the calculation speed is still very fast in feature extraction stage. Based on the experimental results, although computational cost of fused method is close to the sum of multiple single features, its remarkable performance for FER cannot be ignored.

Table 10 Computational cost of our proposed method and LBP-TOP on CK+ and Oulu-CASIA databases

4.8 A selection of SVM kernel function

Ultimately, we evaluate the appropriate SVM kernel function for each individual descriptor, as shown in Table 11. Since our data is linearly inseparable, we only verify the effects of the polynomial kernel function, RBF kernel function, and MKL on classification accuracy, where MKL stands for multiple kernel learning and is a linear combination of polynomial and RBF kernel function.

Table 11 Comparing the influence of SVM Kernel function on individual descriptor

Obviously, the choice of SVM kernel function plays a critical role in the performance of classification. We find that the use of a RBF kernel function is better than a polynomial kernel function in the case of fewer feature dimensions. On the contrary, the polynomial kernel function is more effective when the feature dimensions are large. RBF kernel function works better for ST-GF whose dimensions (from 98× 6 reduced to 282) are less than sample numbers (309). However, when the original dimensions of I-LBP-TOP and Gabor features are much larger than 309, the polynomial kernel function performs better. In addition, the representation ability of MKL is not always optimal. If the performance of single kernel function is not poor, the effect of their combination may be enhanced. For instance, the polynomial kernel function and RBF kernel function have the accuracy of 95.47% and 94.17% on ST-GF, respectively, accordingly the MKL obtains the best result with the accuracy of 95.79%.

5 Conclusion

In this paper, we present an efficient facial expression recognition framework combing I-LBP-TOP, GMF and ST-GF at the decision-level for more accurate and competitive classification. The I-LBP-TOP descriptor can extract not only dynamic texture features, but also static texture features to characterize facial appearance changes. The adopted GMF can obtain the orientation and scale information effectively. In addition, the proposed method that directly utilizes the facial key points as geometric feature achieves simple calculation and high recognition accuracy. In the fusion strategy at decision-level, each descriptor has the same weight and the advantages are fully exerted to boost the performance of recognition.

Experiments performed on the CK+ and Oulu-CASIA facial expression databases confirm that the superiority of the proposed approach over other existing methods. Our proposed hybrid feature has reached the improved performance on 6 basic expressions with an average recognition rate of 99.35% on CK+ database and 80.63% on Oulu-CASIA database. Nonetheless, there is still room for further improvement in the accuracy of our algorithm in identifying fear and sadness. In the future, we will consider using data augmentation methods to increase the sample size of fear and sadness to solve the problem of sample imbalance, and develop more powerful structures to simulate the movement of facial critical areas from videos. Further, robust features need to be designed to resist head pose variation, occlusion, illumination effect in real-time environment.