A facial expression recognition method based on ensemble of 3D convolutional neural networks

Sun, Wenyun; Zhao, Haitao; Jin, Zhong

doi:10.1007/s00521-017-3230-2

A facial expression recognition method based on ensemble of 3D convolutional neural networks

Original Article
Published: 20 October 2017

Volume 31, pages 2795–2812, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

A facial expression recognition method based on ensemble of 3D convolutional neural networks

Download PDF

Wenyun Sun¹,
Haitao Zhao² &
Zhong Jin¹

899 Accesses
16 Citations
Explore all metrics

Abstract

In this paper, a general framework for 3D convolutional neural networks is proposed. In this framework, five kinds of layers including convolutional layer, max-pooling layer, dropout layer, Gabor layer and optical flow layer are defined. General rules of designing 3D convolutional neural networks are discussed. Four specific networks are designed for facial expression recognition. Decisions of the four networks are fused together. The single networks and the ensemble network are evaluated on the Extended Cohn–Kanade dataset and achieve accuracies of 92.31 and 96.15%. The ensemble network obtains an accuracy of 61.11% on the FEEDTUM dataset. A reusable open-source project called 4DCNN is released. Based on this project, implementing 3D convolutional neural networks for specific tasks will be convenient.

3D Convolutional Neural Networks for Facial Expression Classification

Multi-region Ensemble Convolutional Neural Network for Facial Expression Recognition

A Novel Convolutional Neural Network for Facial Expression Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Research on facial expression was started by psychologists. Facial Action Coding System (FACS) [7] and Emotional Facial Action Coding System (EMFACS) [8] were proposed by Ekman and Friensen. In their studies, facial expressions are defined as several action units (AUs) associated with six basic emotions. In the community of computer vision, quite a few quantitative studies have been devoted to analyzing expressions in images or videos [45]. One of the most important methods is the Active Appearance Models (AAMs) [4] in which a statistical model was defined using 68 facial landmarks. Facial landmarks are easy to be understood and manipulated. Action units and emotional labels can be inferred from these facial landmarks by rules [23]. Besides the facial landmark based methods, using the global or local appearance of the face is another way to recognize the expressions [14, 20, 21, 30, 40, 50].

In the research of facial expression recognition, performance can be improved by making the most use of the structures in 3D space. Recently, 3D face tracker is used for pose independent and texture independent facial expression recognition [6]. 3D Gabor filters are used to extract features from 3D scanning data, keeping invariant to head pose, clutter and lighting condition [42]. Local Binary Pattern histograms from Three Orthogonal Planes (LBP-TOP) are used to extract spatial–temporal features for recognizing facial expressions in movie clips [5]. The extra dimension plays an important role in these studies.

This paper is extended from our previous work [32]. The main contributions of this paper include:

Convolutional layer, max-pooling layer, dropout layer, Gabor layer and optical flow layer are defined for 3D data. The general rules of designing 3D convolutional neural networks are discussed.
Four networks are proposed for facial expression recognition. After they have been trained separately, decisions of the four networks are fused together. Experiment result shows that these networks work well. The single networks and the ensemble network are evaluated on the Extended Cohn–Kanade dataset, achieve accuracies of 92.31 and 96.15%. The performance outperforms the state-of-the-art [14, 20, 21, 40, 50]. The ensemble network obtains accuracies of 61.11% on the FEEDTUM dataset.

The remainder of this paper is organized as follows. In Sect. 2, related work on deep learning and expression recognition is surveyed. Section 3 gives the framework of 3D CNNs. In Sect. 4, a new initialization method for neural networks is proposed. In Sect. 5, four networks are proposed to solve the facial expression recognition problem. Experiments are carried out to evaluate the proposed method on posed and spontaneous facial expression datasets. Results are analyzed and compared with the previous work. Finally, conclusions are presented in Sect. 6.

2 Related work

2.1 Convolutional neural networks

Deep learning-based methods have been applied to many video analysis tasks including human genders recognition [46], English-to-Chinese translation [47], moving object recognition [11], static hand gesture recognition [24], etc. In particular, many studies employed convolutional neural networks (CNNs) to achieve superior performances. In a standard CNN [19], the element at position (x, y) in the c-th feature map of the l-th convolutional layer, denoted as $v_{l,c}^{x,y}$, is given by

$$\begin{aligned} v_{l,c}^{x,y} = \tan h\left( \sum \limits _{C=1}^{C_{l-1}} \sum \limits _{p=1}^{P_l} \sum \limits _{q=1}^{Q_l} W_{l,c,C}^{p,q} v_{(l-1),C}^{(x+p-1),(y+p-1)} + b_{l,c} \right) , \end{aligned}$$

(1)

where $\tan h$ is the hyperbolic tangent activation function, $b_{l,c}$ is the bias for the c-th feature map. $C_{l-1}$ is the number of feature maps in the $(l-1)$-th layer. C indexes over the set of feature maps in the $(l-1)$-th layer connected to the current feature map. $W_{l,c,C}^{p,q}$ is the value at the position (p, q) of the kernel connected to the C-th feature map. $P_l$ and $Q_l$ are the height and width of the kernel, respectively.

2.2 3D convolutional neural networks

3D convolutional neural networks were proposed by Ji et al. [15] for solving the human action recognition problem. It is a general method for processing spatial–temporal data (e.g. moving object recognition [11]). Considering the 3rd dimension as the temporal dimension of a video sequence, we can capture the motion information by 3D convolutions. Formally, the element at position (x, y, z) in the c-th feature map of the l-th convolutional layer, denoted as $v_{l,c}^{x,y,z}$, is formulated by

$$\begin{aligned} v_{l,c}^{x,y,z} = \tan h\left( \sum \limits _{C=1}^{C_{l-1}} \sum \limits _{p=1}^{P_l} \sum \limits _{q=1}^{Q_l} \sum \limits _{r=1}^{R_l} W_{l,c,C}^{p,q,r} v_{(l-1),C}^{(x+p-1),(y+q-1),(z+r-1)} + b_{l,c}\right) , \end{aligned}$$

(2)

where $W_{l,c,C}^{p,q,r}$ is the value at the position (p, q, r) of the kernel connected to the C-th feature map in the previous layer. $P_l$, $Q_l$ and $R_l$ are the height, width and depth of the kernel respectively.

In this paper, fully connected layer and 1D/2D/3D convolutional layers are reformulated in a general form. The close relationship among them is discovered.

2.3 CNNs for facial expression recognition

Facial expression is an interesting research subject in the community of computer vision. There are related tasks like classification, detection, manipulation and transfer. In this paper, our attention is focused on solving the problem of classifying faces into six basic emotional categories using 3D convolutional neural networks.

As we know, CNNs were proposed in the 1990s. A piece of interesting work about CNN-based expression recognition had already been done in the early days by Matsugu et al. [23]. After Krizhevsky et al. [17] had won the Imagenet Large-Scale Visual Recognition Challenge 2012 (ILSVRC-2012) [27] using a novel network called Alex-Net, deep CNNs attracted more attentions than before. Sun et al. [30] designed an expression classifier based on the modern CNN proposed by Krizhevsky et al. Byeon et al. [3] designed an expression classifier based on 3D CNN proposed by Ji et al. [15].

Both the work of Byeon et al. and this work use 3D CNNs to solve the facial expression recognition task. The differences between them should be clarified:

This paper aims at proposing general definitions (see Sect. 3.1) and designing rules (see Sect. 3.2) for 3D CNNs rather than solving a specific task.
In this work, an ensemble of four networks are proposed for facial expression recognition (see Sect. 5.1). The preprocessing steps, layer configurations and loss functions of the proposed networks are different from those of Byeon et al. Our networks achieve superior performances (see Sects. 5.3.4, 5.4.4).
We use Gabor layer and optical flow layer in our 3D CNNs to extract texture feature and motion feature in spatial–temporal space. The performance can be extremely improved by fusing decisions supported by both texture feature and motion feature (see Sects. 5.3.4, 5.4.4)

3 Unified framework for 3D CNNs

In this section, a unified framework for 3D CNNs is proposed. In this framework, definitions of convolutional layers, max-pooling layers and dropout layers are extended for 3D data. Low-level image feature extractors such as Gabor filters and optical flow calculators are defined as CNN layers. Based on these definitions, general rules of designing 3D convolutional neural networks are discussed.

3.1 Layers for 3D data

3.1.1 Convolutional layers for 3D data

Let us start from the fully connected layer. Denote the l-th fully connected layer’s activation as ${\mathbf{v }}_l \in {\mathbb {R}}^{n}$ and the previous layer’s activation as ${\mathbf{v }}_{l-1} \in {\mathbb {R}}^{m}$. A fully connected layer can be defined as follows:

$$\begin{aligned} {\mathbf{v }}_l = \tan h( g({\mathbf{v }}_{l-1}) ), \end{aligned}$$

(3)

where

$$\begin{aligned} g({\mathbf{x }}) = {\mathbf{W }} {\mathbf{x }}+{\mathbf{b }}. \end{aligned}$$

(4)

The function g plays the role of fully connected operator. It converts a vector of m dimension to a vector of n dimension by linear transformation ${\mathbf{W }} \in {\mathbb {R}}^{n \times m}$. Then, the bias ${\mathbf{b }} \in {\mathbb {R}}^{n}$ is added to the converted vector. After that, the vector is activated by an element-wise nonlinear activation function $\tan h$ (or sigmoid, $\max (\cdot ,0)$, etc.).

The definition of the fully connected layer can be generalized to convolutional layer. Firstly, denote the elements of $g({\mathbf{x }})$, ${\mathbf{W }}$, ${\mathbf{x }}$ and ${\mathbf{b }}$ as $g_i$, $W_{i,j}$, $x_j$ and $b_i$ , $(i \in \{1, 2, \ldots , n\}, j \in \{1, 2, \ldots , m\})$ respectively. The vectorized fully connected operator g can be written as

$$\begin{aligned} g_i = \left( \sum \limits _{j=1}^{m} W_{i,j} \cdot x_j \right) + b_i, \quad i \in \{1, 2, \ldots , n\}. \end{aligned}$$

(5)

Secondly, Eq. (5) is modified by replacing the scalar multiplication operator “$\cdot$” by vector/matrix/tensor convolution operator “$\bullet$”. The scalar elements $g_i$, $W_{i,j}$, $x_j$ and $b_i$ are also replaced by vector/matrix/tensor. Then we get

$$\begin{aligned} {\mathbf{g }}_i = \left( \sum \limits _{j=1}^{m} {\mathbf{W} }_{i,j} \bullet {\mathbf{x }}_j\right) + {\mathbf{b }}_i, \quad i \in \{1, 2, \ldots , n\}, \end{aligned}$$

(6)

where ${\mathbf{g }}_i \in {\mathbb {R}}^{{\text {dim}}({\mathbf{g }}_i)_1 \times {\text {dim}}({\mathbf{g }}_i)_2 \times \cdots \times {\text {dim}}({\mathbf{g }}_i)_D}$, ${\mathbf{W }}_{i,j} \in {\mathbb {R}}^{{\text {dim}}({\mathbf{W }}_{i,j})_1 \times {\text {dim}}({\mathbf{W }}_{i,j})_2 \times \cdots \times {\text {dim}}({\mathbf{W }}_{i,j})_D}$, ${\mathbf{x }}_j \in {\mathbb {R}}^{{\text {dim}}({\mathbf{x }}_j)_1 \times {\text {dim}}({\mathbf{x }}_j)_2 \times \cdots \times {\text {dim}}({\mathbf{x }}_j)_D}$, ${\mathbf{b }}_i \in {\mathbb {R}}^{{\text {dim}}({\mathbf{b }}_i)_1 \times {\text {dim}}({\mathbf{b }}_i)_2 \times \cdots \times {\text {dim}}({\mathbf{b }}_i)_D}$, $i \in \{1, 2, \ldots , n\}$, $j \in \{1, 2, \ldots , m\}$. ${\mathbf{b }}_i$ is a constrained vector/matrix/tensor in which all the elements are equal.

Fully connected layer and 1D/2D/3D convolutional layers can be defined by this general definition in the following specific cases:

When $D=1$ and ${\text {dim}}({\mathbf{g }}_i)_1={\text {dim}}({\mathbf{x }}_j)_1={\text {dim}}({\mathbf{W }}_{i,j})_1=1$, Eq. (6) is specialized to Eqs. (4)/(5). It describes the fully connected layer (see Fig. 1a).
When $D=1$ and ${\text {dim}}({\mathbf{g }}_i)_1, {\text {dim}}({\mathbf{x }}_j)_1, {\text {dim}}({\mathbf{W }}_{i,j})_1 \geqslant 2$, it describes the 1D convolutional layer in Time Delay Neural Networks (TDNNs) [35] (see Fig. 1b).
When $D=2$, it describes the commonly used 2D convolutional layer (see Fig. 1c). ${\mathbf{x }}$ contains m channels of $[{\mathbf{x }}_1, {\mathbf{x }}_2, \ldots , {\mathbf{x }}_m]^{\mathrm {T}}$. Each channel is a spatial image of ${\text {dim}}({\mathbf{x }}_j)_1 \times {\text {dim}}({\mathbf{x }}_j)_2$. ${\mathbf{g }}$ contains n channels of $[{\mathbf{g }}_1, {\mathbf{g }}_2, \ldots , {\mathbf{g }}_n]^{\mathrm {T}}$. Each channel is a spatial image of ${\text {dim}}({\mathbf{g }}_i)_1 \times {\text {dim}}({\mathbf{g }}_i)_2$. ${\mathbf{W }}$ contains $n \times m$ convolutional kernels, whose sizes are ${\text {dim}}({\mathbf{W }}_{i,j})_1 \times {\text {dim}}({\mathbf{W }}_{i,j})_2$. From this point of view, the convolutional layer can be considered as a fully connected layer with convolutional connections.
When $D \geqslant 3$, it describes the D-dimensional convolutional layer which processes D-th order tensors (see Fig. 1d).

3.1.2 Max-pooling layers for 3D data

Spatial pooling is an important procedure in CNNs. The 2D max-pooling layer partitions a spatial image into a set of small non-overlapping $2 \times 2$ regions. For each region, the maximum is output. Denote the c-th channel of the l-th layer as ${\mathbf{v }}_{l,c} \in {\mathbb {R}}^{{\text {dim}}({\mathbf{v }}_{l,c})_1 \times {\text {dim}}({\mathbf{v }}_{l,c})_2}$ and the c-th channel of the previous layer as ${\mathbf{v }}_{l-1,c} \in {\mathbb {R}}^{2 {\text {dim}}({\mathbf{v }}_{l,c})_1 \times 2 {\text {dim}}({\mathbf{v }}_{l,c})_2}$. The element at position $(x_1,x_2)$ of the l-th 2D max-pooling layer, denoted as $v_{l,c}^{x_1,x_2}$, is given by

$$\begin{aligned} v_{l,c}^{x_1,x_2} = \max \left( R_{l,c}^{x_1,x_2} \right) , \end{aligned}$$

(7)

where

$$\begin{aligned} R_{l,c}^{x_1,x_2} = \left\{ v_{l-1,c}^{x'_1,x'_2} | x'_1 \in \{2x_1-1, 2x_1\}, x'_2 \in \{2x_2-1, 2x_2\} \right\} . \end{aligned}$$

(8)

The set $R_{l,c}^{x_1,x_2}$ contains all elements in a $2 \times 2$ sub-matrix. The max/avg function outputs the maximum/mean of this set. The 2D max-pooling layer converts ${\mathbf{v }}_{l-1}$ to ${\mathbf{v }}_{l}$ channel by channel.

Similarly, the max-pooling layer can be extended for 3D data. Denote the c-th channel of the l-th layer as ${\mathbf{v }}_{l,c} \in {\mathbb {R}}^{{\text {dim}}({\mathbf{v }}_{l,c})_1 \times {\text {dim}}({\mathbf{v }}_{l,c})_2 \times \cdots \times {\text {dim}}({\mathbf{v }}_{l,c})_D}$ and the c-th channel of the previous layer as ${\mathbf{v }}_{l-1,c} \in {\mathbb {R}}^{2 {\text {dim}}({\mathbf{v }}_{l,c})_1 \times 2 {\text {dim}}({\mathbf{v }}_{l,c})_2 \times \cdots \times 2 {\text {dim}}({\mathbf{v }}_{l,c})_D}$. The element at position $(x_1,x_2,\ldots ,x_D)$ of the l-th D-dimensional max-pooling layer, denoted as $v_{l,c}^{x_1,x_2,\ldots ,x_D}$, is given by

$$\begin{aligned} v_{l,c}^{x_1,x_2,\ldots ,x_D} = \max \left( R_{l,c}^{x_1,x_2,\ldots ,x_D}\right) , \end{aligned}$$

(9)

where

$$\begin{aligned} R_{l,c}^{x_1,x_2,\ldots ,x_D} = \left\{ v_{l-1,c}^{x'_1,x'_2,\ldots ,x'_D} | x'_i \in \{2x_i-1, 2x_i\} , i \in {1,2,\ldots ,D} \right\} . \end{aligned}$$

(10)

The set $R_{l,c}^{x_1,x_2,\ldots ,x_D}$ contains all elements in a $2 \times 2 \times \cdots \times 2$ sub-tensor. The max/avg function outputs the maximum/mean of this set. When $D=1$/$D=2$/$D=3$, Eqs. (9), (10) describe 1D/2D/3D max-pooling layer in TDNNs/2D CNNs/3D CNNs. Examples are illustrated in Fig. 2.

3.1.3 Dropout layers for 3D data

Dropout is a popular method to prevent over-fitting [12]. The dropout layer is usually applied after fully connected layers (see Fig. 3a). Denote the c-th activation of the l-th dropout layer as $v_{l,c} \in {\mathbb {R}}$ and the c-th activation of the previous layer as $v_{l-1,c} \in {\mathbb {R}}$. The element-wise dropout layer can be defined as

$$\begin{aligned} v_{l,c} = a_{l,c} v_{l-1,c}, \end{aligned}$$

(11)

where

$$\begin{aligned} a_{l,c} \sim B(1, 0.5). \end{aligned}$$

(12)

The $a_{l,c}$ is the random gate coefficient of the c-th activation. When $a_{l,c}=1$, the c-th activation of the l-th layer keeps the same as that of the previous layer. Otherwise, the c-th activation of the l-th layer is set to zero, and the c-th activation of the previous layer is suppressed. After applying element-wise dropout, the number of activations in the l-th layer keeps the same as that of the previous layer.

The dropout layer can also be used for processing vectors/matrices/tensors. Denote the c-th channel of the l-th dropout layer as ${\mathbf{v }}_{l,c} \in {\mathbb {R}}^{{\text {dim}}({\mathbf{v }}_{l,c})_1 \times {\text {dim}}({\mathbf{v }}_{l,c})_2 \times \cdots \times {\text {dim}}({\mathbf{v }}_{l,c})_D}$ and the c-th channel of the previous layer as ${\mathbf{v }}_{l-1,c} \in {\mathbb {R}}^{{\text {dim}}({\mathbf{v }}_{l,c})_1 \times {\text {dim}}({\mathbf{v }}_{l,c})_2 \times \cdots \times {\text {dim}}({\mathbf{v }}_{l,c})_D}$. The D-dimensional dropout layer can be defined as

$$\begin{aligned} {\mathbf{v }}_{l,c} = a_{l,c} {\mathbf{v }}_{l-1,c}, \end{aligned}$$

(13)

where

$$\begin{aligned} a_{l,c} \sim B(1, 0.5). \end{aligned}$$

(14)

The $a_{l,c}$ is the random gate coefficient of the c-th channel. When $a_{l,c}=1$, the c-th channel of the l-th layer keeps the same as that of the previous layer. Otherwise, the c-th channel of the l-th layer is set to zero tensor, and the c-th channel of the previous layer is suppressed. When $D=1$/$D=2$/$D=3$, the dropout layer defined by Eqs. (13), (14) can be applied after 1D/2D/3D convolutional layers or 1D/2D/3D max-pooling layers. Examples are illustrated in Fig. 3b–d.

3.1.4 Gabor layers for 3D data

2D Gabor filters are widely used in image processing applications. They are biologically explainable for the evidence that simple cells in the primate visual cortex behave similarly to Gabor functions [18]. Some research findings show that there are many Gabor-like kernels in the first layer of a CNN which is well trained on large-scale natural image datasets [43, 44]. But little research has been devoted to Gabor filters for tensors except [25, 38, 42].

A Gabor filter is defined by a sinusoidal wave multiplied by a Gaussian function. In 3D space, Gabor filters is defined as

$$\begin{aligned} g_{\sigma ,\nu ,\theta ,\phi }(x,y,z) = n_{\sigma }(x,y,z) w_{\nu ,\theta ,\phi }(x,y,z), \end{aligned}$$

(15)

where

$$\begin{aligned} n_{\sigma }(x,y,z)= & {} \frac{1}{(2\pi )^{3/2}\sigma ^3} {\text {exp}}\left( -\frac{x^2+y^2+z^2}{2\sigma ^2}\right) , \end{aligned}$$

(16)

$$\begin{aligned} w_{\nu ,\theta ,\phi }(x,y,z)= {\text {exp}}\left[ i 2\pi \left( u_0 x + v_0 y + w_0 z\right) \right] , \end{aligned}$$

(17)

where

$$\begin{aligned} u_0 = \nu \sin \theta \cos \phi , v_0 = \nu \sin \theta \sin \phi , w_0 = \nu \cos \theta . \end{aligned}$$

(18)

In Eq. (16), $\sigma$ determines the scale of the Gaussian function $n$. In Eqs. (17), (18), the $\theta$ and the $\phi$ are the yaw and pitch angles which determine the orientation of the wave function $w$. $\nu$ determines the frequency of the wave function $w$.

A 3D Gabor filter bank is a set of filters created by varying $\sigma$, $\nu$, $\theta$ and $\phi$. It can be formally defined as

$$\begin{aligned} G= \left\{ g_{\sigma ,\nu ,\theta ,\phi } | \sigma \in \varSigma , \nu \in N, \theta \in \varTheta , \phi \in \varPhi \right\} , \end{aligned}$$

(19)

where $\varSigma$, N, $\varTheta$ and $\varPhi$ limit $\sigma$, $\nu$, $\theta$ and $\phi$ in reasonable ranges.

To simplify the implementation, we consider the discrete 3D Gabor filter as a special kind of 3D convolutional layer. This trick can be practically played when some requirements are satisfied:

The layer accepts single channel as its input and produces multiple channels as its output.
Each convolutional kernel ${\mathbf{W }}_{i,j}$ is initialized to the i-th bank element $G_i \in G$. The element at position (p, q, r) of the kernel, denoted as ${\mathbf{W }}_{i,j}^{p,q,r}$, is given by
$$\begin{aligned} {\mathbf{W }}_{i,j}^{p,q,r} = G_i\left( p-R_x-1, q-R_y-1, r-R_z-1\right) , \end{aligned}$$
(20)
where i indexes over the set of the bank set. j is fixed at 1. $R_x = ({\text {dim}}(W_{i,j})_1-1)/2$, $R_y = ({\text {dim}}(W_{i,j})_2-1)/2$, $R_z = ({\text {dim}}(W_{i,j})_3-1)/2$ are the radiuses of the Gabor kernel along x-axis, y-axis and z-axis.
Convolutional kernels are fixed during training.
The bias vector is initialized to zeros and fixed during training.
The identity activation function $f(x)=x$ is used.

Gabor filters for higher-order tensor (> 3D) can be defined in the same manner with more coordinates $(x,y,z,\ldots )$ and more orientation angles $(\theta ,\phi ,\ldots )$.

3.1.5 Optical flow layers for 3D data

The Horn–Schunck method [13] can be used to calculate dense optical flow fields for gray-scale videos. The field contains 2D motion vectors in each pixel. These vectors are calculated from a current gray-scale frame and its previous frame.

In 3D CNNs, an optical flow layer accepts a fixed-length gray-scale video clip ${\mathbf{v }}_{l-1} \in {\mathbb {R}}^{w \times h \times T \times 1}$ as its input and produces an output ${\mathbf{v }}_l \in {\mathbb {R}}^{w \times h \times (T-1) \times 2}$. Denote the c-th channel of the input at time t as ${\mathbf{v }}_{l-1,c}^t \in {\mathbb {R}}^{w \times h}$ and the c-th channel of the output at time t as ${\mathbf{v }}_{l,c}^t \in {\mathbb {R}}^{w \times h}$. The optical flow layer can be defined as follows:

$$\begin{aligned} {\mathbf{v }}_{l,1}^t= & {}\, o_x\left( {\mathbf{v }}_{l-1,1}^t, {\mathbf{v }}_{l-1,1}^{t+1}\right) \nonumber \\ {\mathbf{v }}_{l,2}^t= & {}\, o_y\left( {\mathbf{v }}_{l-1,1}^t, {\mathbf{v }}_{l-1,1}^{t+1}\right) \end{aligned}$$

(21)

where $o_x$ and $o_y$ are functions calculating the magnitudes of the optical flow field along x-axis and y-axis.

3.2 General 3D CNNs

A 3D CNN consists of extended layers defined in Sect. 3.1 (convolutional, max-pooling, fully connected, dropout, Gabor and optical flow) and standard layers (flatten, softmax norm, cross-entropy loss, mean squared error loss, etc.) [17, 19]. As an example illustrated in Fig. 4, a flatten layer is placed in the middle of the network. It reshapes and concatenates multiple tensors into a single vector. The flatten layer divides the whole network into two parts: the convolutional part on the left and the fully connected part on the right.

The input and output of the layers on the left are sets of 3rd-order tensors, which are usually called feature maps or channels. An optical flow layer or a Gabor layer is often placed at the beginning of the network. They calculate low-level dense image features. The cropping layer randomly crops a sub-tensor with specified size for data augmentation. Stacked pairs of convolutional layer and max-pooling layer learn mid-level and high-level features from labeled data.

Layers on the right have the same functions as the ones in standard fully connected neural networks. Fully connected layers are good at logical reasoning based on the deep features learned by the convolutional part. According to Eq. (6), the fully connected layers can be considered as a specific case of the general convolutional layers, in which spatial size of the activation is $1 \times 1 \times \cdots \times 1$ and spatial size of the convolution kernel is $1 \times 1 \times \cdots \times 1$. For better understanding, the fully connected layer illustrated in Fig. 4 follows the regular form in [19]. It is intrinsically equivalent to the $1 \times 1 \times \cdots \times 1$ convolution layer. A softmax normalization layer or a cross-entropy loss layers is often attached to the end of 3D CNNs for solving classification problem. Sometimes, a mean squared error layer is used for solving regression problems. Dropout layers can be inserted after fully connected layers, convolutional layers and max-pooling layers for improving the quality of the middle representations.

The objective of the training algorithm is to minimize the loss with respect to the parameters. By making sure all these layers are derivable, all gradients of the loss with respect to the parameters can be calculated using the chain rule. Then the gradient descent method can be applied to solve this optimization problem. The Stochastic Gradient Descent (SGD) with mini-batches method is good at making full use of the parallel computing ability of the modern computers. It is widely used for training neural networks on large-scale datasets.

The depth of the network is always the central topic in deep learning. Encouraged by the recent advances in deep learning [28, 33], we have investigated some very deep networks. Unlike the natural image tasks, the facial expression recognition task has very limited sizes of the datasets, counts of the categories and variations of the inputs. It is not to say the expression recognition problem is easy to solve. But at least such deep networks containing 16–19 layers with millions of trainable parameters are unnecessary. Theoretically, as the network goes deeper, more abstract features can be learned. But a very deep network could cause problems like over-fitting, degradation of training accuracy, larger meta-parameter set, higher computation cost, vanishing/exploding gradient, etc. These problems will make the network difficult to train. For the expression recognition problem, we suggest using network containing 2–4 convolutional layers and 1–2 fully connected layers.

4 Initialization method for neural networks

Some research findings [9, 10] show that good initialization methods may be helpful in keeping activations and gradients stable, thus making training faster. In these studies, it assumes that the data obey the standard normal distribution. The activations’ distribution can be estimated by the inputs’ distribution. The gradients’ distribution can be estimated by the distribution of the gradients in the next layer. By adjusting the standard deviation of the random initial values in each layer, the activations and gradients can be controlled in a reasonable range (see Fig. 5).

But in practice, it is found that the assumption is not always satisfied. The dataset often does not obey the standard normal distribution. The difference between the actual distribution and the estimated distribution is increased as the layer goes deeper. So the initialization methods based on the standard normal distribution assumption become very sensitive.

Furthermore, the method of Glorot et al. [9] is based on the assumption that the activation function is $\tan h$ or sigmoid. This assumption is invalid when Rectified Linear Unit (ReLU) or Parametric Rectified Linear Unit (PReLU) is used. The method is improved to meet the requirement of the network by He et al. [10]. But, the improved method is still unsuitable for networks containing various kinds of non-standard layers (e.g. networks described in Sect. 5.1). To this end, a new initialization method for neural networks is proposed.

Inspiring by the method of Glorot et al. [9] and He et al. [10], we use several trials to determine the standard deviation of the initial parameters to keep the activations and gradients stable in each layer. The key of initialization is to determine a standard deviation to keep the activations stable. Consider two extreme examples: A convolutional kernel sampled from $N(0, 1000^2)$ may lead the activations to be saturated. But a convolutional kernel sampled from a small standard deviation (e.g. $N(0, 0.001^2)$) may lead the signal to fall into the linear region of the activation function and make the training slow. The optimal standard deviation may be found between these two examples. We search the optimal standard deviation by these steps:

Initialize each element of the convolutional kernels of the first layer to a random value sampled from the normal distribution with zero means and a predefined large standard deviation.
Initialize the bias vector of the first layer to zeros.
Perform the feed-forward pass using a large mini-batch.
Calculate the histogram of the activations in the first layer.
If too many activations in the first layer are saturated (e.g. $>0.95$ or $<-\,0.95$ when using $\tan h$ activation function), regenerate the kernels with a smaller standard deviation and try again.
Otherwise, this standard deviation is chosen for the first layer.
A greedy layer-wise scheme is used to initialize the whole network. The rest layers can be initialized in the same manner with the standard deviations of the previous layers fixed.

As concluded by He et al. [10], if an initialization method scales the forward signal properly, it also has the same effect on the backward signal, or vice versa. The pseudo-code of the proposed method is summarized in Algorithm 1.

Although the new method is simple and just looks like an engineering trick, it considers the real distribution of data in specific domains. This property is not ensured by the existing methods [9, 10]. Besides, the proposed method can be easily extended to new networks built from a wide variety of novel layers without caring about the details.

5 Experiments

5.1 Network designments

As listed in Tables 1, 2, 3 and 4, four different networks are proposed to solve the expression classification problem, namely 3DCNN-A, 3DCNN-B, 3DCNN-C and 3DCNN-D. They are wished to make complementary predictions. The major difference between the four networks is their low-level feature extractors. 3D Gabor layers are used in 3DCNN-A and 3DCNN-B. Optical flow layers are used in 3DCNN-C and 3DCNN-D. It means that their predictions are made according to different views of the data, namely the texture and the motion. The decisions of 3DCNN-A and 3DCNN-B are based on facial appearances, and the decisions of 3DCNN-C and 3DCNN-D are based on facial motions. Moreover, there are several minor differences including the depths of convolutional/fully connected layers, the receptive field sizes of the convolutional layers, the numbers of output channels of the convolutional layers, and the activation functions.

Although four networks can work well independently, the performance can be further improved by decision-level fusion. After all the networks have been trained on different views of the data separately, the decisions supported by each network (activations of the normalized softmax layer, probabilities of six categories) are fused together to make the final prediction.

The symmetry of the face should also be considered. Decisions according to the frontal view and the mirrored view are also fused together. The pipeline is outlined in Fig. 6.

Table 1 3DCNN-A

A facial expression recognition method based on ensemble of 3D convolutional neural networks

Abstract

Similar content being viewed by others

3D Convolutional Neural Networks for Facial Expression Classification

Multi-region Ensemble Convolutional Neural Network for Facial Expression Recognition

A Novel Convolutional Neural Network for Facial Expression Recognition

Explore related subjects

1 Introduction

2 Related work

2.1 Convolutional neural networks

2.2 3D convolutional neural networks

2.3 CNNs for facial expression recognition

3 Unified framework for 3D CNNs

3.1 Layers for 3D data

3.1.1 Convolutional layers for 3D data

3.1.2 Max-pooling layers for 3D data

3.1.3 Dropout layers for 3D data

3.1.4 Gabor layers for 3D data

3.1.5 Optical flow layers for 3D data

3.2 General 3D CNNs

4 Initialization method for neural networks

5 Experiments

5.1 Network designments

5.2 Low-level feature extractor designments

5.3 Experiment on posed dataset

5.3.1 Extended Cohn–Kanade dataset

5.3.2 Preprocesses

5.3.3 Training and test

5.3.4 Results and analysis

5.3.5 Comparing with previous work

5.4 Experiment on spontaneous dataset

5.4.1 FEEDTUM dataset

5.4.2 Preprocesses

5.4.3 Training and test

5.4.4 Results and analysis

5.4.5 Comparing with previous work

6 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation