Keywords

1 Introduction

Human emotion recognition has been an active research area for the past few years, due to the increasing demand for applications in perceptual and cognitive sciences and affective computing. It has become an essential component for fields such as computer animations, sociable robots, and neuromarketing. Human emotions can be recognized by using facial expressions and vocal tones. According to Kaulard et al. [1], nonverbal components convey two-thirds of human communication, while verbal components convey only one-third. Various kinds of data including physiological signals, such as electromyograph (EMG), electrocardiogram (ECG), and electroencephalograph (EEG), can also be considered as input for the emotion recognition process. Among these, the facial image is the promising input type as it is noninvasive and provides an ample amount of information for expression recognition. Emotions can be categorized into three types: basic emotions (BEs), compound emotions (CEs), and micro-expressions (MEs). Basic emotions cover neutral, anger, disgust, fear, surprise, sadness, and happiness.

Two categories of approaches for facial expression recognition (FER) are in use: conventional approaches and deep learning-based approaches. When compared to deep learning-based techniques, conventional techniques are advantageous as they require less computational power. Hence, no additional infrastructure is needed. Input images having illumination changes, occlusion, and deflection of the head may influence the face detection task performance and reduce the accuracy of FER. Conventional techniques are not suitable for noisy input data. Deep learning-based techniques address these issues. Of late, convolutional neural networks (CNNs) were proven effective for face detection [2]. As CNNs contain deep layers and use elaborate designs, they can ably handle noisy data automatically [3]. CNNs proved to exhibit better performance than conventional methods for the FER task [4, 5]. The performance of CNN highly depends on the choice of its hyperparameters. It is possible to enhance the CNN’s performance by optimizing hyperparameters such as the number of hidden layers, units in each layer, filters, size of the filter, batch size, and learning rate. The present work considers the optimization of hyperparameters that describe the CNN structure. Grid search and random search techniques are commonly used for this purpose [6]. Each of these techniques has its limitations, and both need more time and domain expertise for identifying ideal hyperparameter values. Metaheuristic-based approaches can address these shortcomings as they are stochastic approximation methods. The present work employed the differential evolution (DE) algorithm for tuning the selected hyperparameters.

2 Related Work

Kim et al. [7] proposed to train multiple CNNs. They have shown an improvement in training by changing the network topology and random weight initialization. An interesting method for selecting the CNN structure was presented by Gao et al. [8]. They proposed gradient priority particle swarm optimization (GPSO) with gradient penalties for tuning CNN architecture. Experimental results have shown that the proposed method has gained competitive prediction performance for the emotion recognition task. Bergstra and Bengio [6] proposed to employ a grid or random search for tuning hyperparameters. Since the number of hyperparameters is large, testing is computationally expensive. Snoek et al. [9] have addressed the limitations of trial and error-based techniques for hyperparameter optimization. They have proposed a Bayesian optimization framework. Bochinski et al. [10] have shown that evolutionary algorithms can outperform the existing hyperparameter optimization methods.

3 Methodology

Benchmark dataset for facial expression recognition is split into training set (TS) and testing set (TE). For ensuring that samples of all classes get selected, stratified sampling without replacement is used. Selected samples from TS generate tuning set (TUS). Tuning set is further divided into TUS1 and TUS2. TUS1 is used for hyperparameter optimization. TUS2 is used for validating the outcome of optimization. Differential evolution is performed until the termination condition is met. CNN is trained using the outcome of DE on TS. The holdout method is used for assessing the performance of the trained model. After training, the model’s performance is assessed by using TE. Table 1 specifies the architecture of the convolutional neural network used in the present work (Fig. 1).

Table 1 Configuration of convolutional neural network
Fig. 1
figure 1

Architecture of the proposed model

Hyperparameter Tuning—Metaheuristic optimization techniques proved to yield better results when the search space is large and complex [11]. Since the number of hyperparameters is large in CNN, tuning them is computationally expensive. Hence, the proposed model determines the optimal network topology by using the differential evolution (DE) algorithm. A simple, yet powerful, population-based stochastic search technique, differential evolution (DE) [12], has gained much attention and a wide range of successful applications [13, 14], due to its simplicity, ease in the implementation, and quick convergence.

The hyperparameters considered for tuning using DE include number of convolutional layers, filter size, stride, dropout rate, and batch size. A vector comprising the above-mentioned parameters is used as a chromosome for the DE algorithm. Precision and recall values for each of the six basic emotions are calculated by using the confusion matrix. Fitness function is defined as F = AvgPrec + AvgRec, where AvgPrec is the average of precision values computed for each basic emotion. Likewise, AvgRec is the average of recall values. DE aims to improve the existing solution using the techniques of mutation, recombination, and selection. The general paradigm of differential evolution is shown in Fig. 2.

Fig. 2
figure 2

Differential evolution algorithm scheme

Initialization—Creation of a population of individuals. The ith individual vector (chromosome) of the population at current generation t with d dimensions is as follows

$$Z_{i} \left( t \right) = \left[ {Z_{i,1} \left( t \right),Z_{i,2} \left( t \right), \ldots ,Z_{i,d} \left( t \right)} \right]$$
(1)

Mutation—A random change of the vector Zi components. For each individual vector Zk(t) that belongs to the current population, a new individual, called the mutant individual, U is derived through the combination of randomly selected and pre-specified individuals.

$$U_{k,n} \left( {t + 1} \right) = Z_{m,n} \left( t \right) + F*\left( {Z_{i,n} \left( t \right) - Z_{j,n} \left( t \right)} \right)$$
(2)

where the indices m, n, i, j are uniformly random integers mutually different and distinct from the current index ‘k’ and F is a real positive parameter, called mutation factor or scaling factor (usually ϵ [0, 1]).

Recombination (Crossover)—Merging the genetic information of two or more parent individuals for producing one or more descendants. Binomial crossover is used in the present work. The binomial or uniform crossover is performed on each component n (n = 1, 2, …, d) of the mutant individual Uk,n(t + 1). For each component, a random number ‘r’ in the interval [0, 1] is drawn and compared with the crossover rate (CR) or recombination factor (another DE control parameter), CR € [0, 1]. If r < CR, then the nth component of the mutant individual Uk,n(t) will be selected; otherwise, the nth component of the target vector Zk,n(t) becomes the nth component.

$$U_{k,n} \left( {t + 1} \right) = \left\{ {\begin{array}{*{20}l} {U_{k,n} \left( {t + 1} \right),} \hfill & {{\text{if}}\;{\text{rand}}_{n} \left( {0,1} \right) < {\text{CR}}} \hfill \\ {Z_{k,n} \left( t \right),} \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right.$$
(3)

SelectionChoice of the best individuals for the next cycle. If the new offspring yields a better value of the objective function, it replaces its parent in the next generation; otherwise, the parent is retained in the population, i.e.,

$$Z_{k} \left( {t + 1} \right) = \left\{ {\begin{array}{*{20}l} {U_{k} \left( {t + 1} \right),} \hfill & {{\text{if}}\;\;f\left( {U_{k} \left( {t + 1} \right)} \right) > f\left( {Z_{k} \left( t \right)} \right)} \hfill \\ {Z_{k} \left( t \right),} \hfill & {{\text{if}}\;\;f\left( {U_{k} \left( {t + 1} \right)} \right) < f\left( {Z_{k} \left( t \right)} \right)} \hfill \\ \end{array} } \right.$$
(4)

where f is the objective function to be minimized. It can be inferred that DE is a powerful population-based heuristic search technique that has empirically proven to be very robust for global optimization over continuous spaces. As the number of control parameters in DE is very few compared to other algorithms, DE is effective and efficient and thus can be treated as a widely applicable approach for solving real-world problems [13, 14].

4 Experimentation

For experimentation, two benchmark datasets, CK+ and Japanese Female Facial Expressions (JAFFE), are used.

CK+ Dataset: This dataset has 593 image sequences representing seven basic expressions (happiness, sadness, surprise, disgust, fear, anger, and neutral) of 123 models. Since the work is focused on recognition of six basic expressions, neutral expression images were ignored. Out of 593, 309 sequences have validated emotion labels that belong to one of the six previously mentioned emotions. They were selected by excluding other sequences. From each image sequence, last two frames were selected making a dataset of 618 images.

Japanese Female Facial Expressions (JAFFE): The JAFFE dataset has 213 images of ten female Japanese models. Each image represents one of the seven basic emotions (including neutral emotion). Images pertaining to neutral expression are not used.

The proposed model is implemented using Keras with a TensorFlow back end in Python 3.6. Experiments are conducted on the selected datasets. Seventy percentage of the samples are used for training, and remaining 30% is used for testing. For validating hyperparameter tuning, 20% of the samples from training dataset are used. The samples are selected by using stratified sampling. Tables 2, 3, 4, and 5 show the confusion matrices of the two datasets used with and without hyperparameter tuning. Prediction accuracies for CK+ dataset and JAFFE dataset are depicted in Fig. 3. For both the datasets, optimization of hyperparameters has improved the accuracy of all basic emotions except fear. Fear has least impact of optimization. For JAFFE dataset, accuracy is decreased by 1%. Proposed model has improved the overall classification accuracy by 4.32% for CK+ dataset and 3.78% for JAFFE dataset.

Table 2 Confusion matrix for CK+ dataset without hyperparameter tuning
Table 3 Confusion matrix for CK+ dataset with hyperparameter tuning
Table 4 Confusion matrix for JAFFE dataset without hyperparameter tuning
Table 5 Confusion matrix for JAFFE dataset with hyperparameter tuning
Fig. 3
figure 3

Classification accuracies for CK+ and JAFFE datasets

5 Conclusion

The present study proposes to optimize the convolutional neural network hyperparameters for improving the human emotion recognition rate from facial expressions. Conventional techniques fail to offer good classification accuracy for noisy input data. As CNNs contain deep layers, they can handle noisy data and are proven suitable for facial expression recognition. However, CNNs demand high computation power making their applicability limited. The performance of CNN highly depends on the choice of its hyperparameters. To enhance the CNN performance for facial expression recognition, its hyperparameters are optimized using the DE algorithm. CK+ and JAFFE datasets are used for assessing the tuned model’s performance. The results obtained have shown that hyperparameter tuning has improved the overall accuracy by 4%.