1 Introduction

Emotion is a pivotal factor in human life as it affects the working ability, mental state, and judgment of human beings. Numerous experts worked on this topic in different disciplines like psychology, cognitive science, neuroscience, computer technology, brain-computer interfacing (BCI) [38], and others. The electroencephalogram (EEG) based emotion recognition has created a lot of scope in these disciplines using the actual affective state [20]. BCI has numerous applications of EEG-based emotion recognition including humanoid robots. Most of the humanoid robots lack in terms of emotion and this field is not much explored. In effective BCI, emotion recognition is a major parameter and it becomes complex due to its fuzzy property. Human emotion correlates with context, language, time, space, culture, and other components. Therefore, the absolute true labels are also not possible with different emotions with EEG recording, which creates the issue [3].

Many authors proposed facial expression [51], gesture [9, 30], posture, speech [27], and other physical signal-based emotion recognition methods. These types of data are easy to record but can be easily controlled and falsify the true emotion [15]. The controlling or mimicking of nervous system related signals is very tough as they are involuntarily activated [39] and only subject experts can control these signals. Therefore, true emotion signature can be observed in nervous system related recordings. Several physiological recordings such as EEG, electrocardiogram (ECG) [34], temperature, electromyogram (EMG) [5], respiratory, galvanic skin response (GSR) can be used to study the human emotion [39]. The minute investigation of brain activity with various emotions can assist the accurate and computation efficient emotion recognition models. Recent research in dry electrode implementation [11, 13] and wearable devices promote the EEG recording based emotion identification in real-time accomplishment for mental state monitoring [23, 35]. EEG based emotion recognition is one of the key feature required in human-machine interaction (HMI) and a humanoid robot. This study is focused on EEG based human emotion analysis in which the electrical activity of the brain is investigated during different emotions (neutral, positive, and negative emotions).

Many studies are performed for EEG based human emotion diagnosis and tried to form a definitive relationship of EEG signals with different emotions [28, 33]. EEG signal analysis is a very challenging task as it is non-stationary [36]. In a real-time scenario, other signals are added into EEG recording and signal to noise ratio (SNR) becomes low. Matrix decomposition-based EEG signal analysis methods are proposed but due to high complexity, real-time implementation is tough [7, 37, 41]. The emotion related stable patterns of EEG recordings are observed in [55] which uses the DEAP dataset. The critical frequencies of emotion and significant channel selection in EEG recordings are micro-observed in [54]. This channel selection is good to find the position of the electrode for emotion analysis. The time-frequency based and various non-linear features are studied for EEG based emotion recognition and achieve 59.06% accuracy (ACC) with DEAP data [20]. It is suggested that the gamma-band in EEG recordings is more correlated with the emotion function [18].

Many machine learning-based architectures are proposed for EEG based emotion examination. The bi-hemisphere based neural network is designed for EEG emotion detection, and the experiment is performed on SEED dataset with 63.50% ACC [21]. The graph neural network-based emotion recognition is performed on the same dataset with 89.23% ACC using gamma band, and 94.24% ACC with all bands [56]. A regional asymmetric convolution neural network (CNN) based study is carried on DEAP data and acquire 95% ACC for arousal and valence emotion detection [6]. In most of the methods, existing models are improved to achieve a good classification of human emotion. The proposed approach employs multiple models and develops a hybrid approach to attain better ACC than the developed methods. The two models, based on CNN and long short term memory (LSTM), are hybridized to improve the final prediction using the ensemble model.

Rest of the article is organized in the following manner. Section 2, presents the dataset. Proposed approach with features is explained in Section 3. The proposed hybrid model along with the CNN and LSTM based models are presented in Section 4. This also includes the implementation of ensemble learning. Results are explained in Section 5. Finally, the article is concluded in Section 6.

2 Dataset

We have used two datasets for EEG based emotion recognition. The detailed explanation of both the datasets is given next to this section.

2.1 SEED data

The database employed in the proposed approach has been obtained from the brain like computing and machine learning (BCMI) methods. We employed the SJTU emotion EEG dataset (SEED) [8, 54]. The dataset contains EEG data of 15 subjects (7 males and 8 females) recorded in three separate sessions, each session having 15 trials. In each trial, the EEG signal is recorded when the subject is watching Chinese film clips with three types of emotions, namely positive, neutral, and negative. The duration of each film clip is about 4 minutes and two film clips targeting the same emotion are not shown consecutively. The participants reported their emotional reactions to each film clip by completing the questionnaire immediately after watching each film clip. The EEG signals are recorded using a 62- channel electrode cap according to the international 10-20 system. The data is then down-sampled to 200Hz to make system faster and a band-pass frequency filter from 0-75Hz is applied which contains all the EEG rhythm information.

2.2 DEAP data

The DEAP data has been recorded for the analysis of human emotion using EEG signals. It is recorded for 32 healthy participants aged between 19 years to 37 years and out of 32 participants 16 were female. Each participant has been exposed to 40 music videos each of which has a duration of 1-min with the same emotion throughout the video length. The data comprises 40 channels out of which 32 EEG channels have been investigated in this paper. The data is recorded with Biosem ActiveTwo devices at a sampling rate of 512 Hz. It is further downsampled to 128 Hz to reduce the system complexity. The DEAP data provides 32 files, where each file contains the 40-channel EEG recording of 40 videos of one minute duration each.

3 Proposed approach

Block diagram of the proposed approach is shown in Fig. 1. All the subjects were made to sit on a chair in the resting state and are asked to watch the videos portraying different emotions. Simultaneously, EEG signals are recorded and pre-processing is done. The differential entropy (DE) based features are computed in five EEG rhythms. The DE based features are explained in the next sub-section. Further, CNN and LSTM models are employed and combined to obtain the hybrid model. Thereafter, the ensemble model is proposed based on these models.

Fig. 1
figure 1

Block diagram of the proposed system for EEG-based emotion recognition

3.1 Features extraction

We have employed DE as a feature in the proposed approach. DE extends the idea of Shannon entropy and is used to measure the complexity of a continuous random variable. DE as a feature was first introduced to EEG-based emotion recognition by Duan et al. [8]. It has been found to be more suited for emotion recognition than the traditional feature. DE has the balanced ability to discriminate EEG patterns between low and high-frequency energy. DE feature extracted from EEG data provides stable and accurate information for emotion classification [53]. The differential entropy feature is as defined below:

$$ \begin{array}{llll} h(Y) & = -{\int}_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(y-\mu)^{2}}{2\sigma^{2}}}\log(\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(y-\mu)^{2}}{2\sigma^{2}}})dy \\ & = \frac{1}{2} \log(2\pi e\sigma^{2}) \end{array} $$
(1)

where the time series Y obeys the Gaussian distribution N(μ, σ2). stop DE was employed to construct features in the five frequency bands: delta (1- 3Hz), theta (4-7Hz), alpha (8-13Hz), beta (14-30Hz), and gamma (31- 50Hz). For the SEED dataset, the extracted DE feature for a sample EEG signal has 310 dimensions as there are 62 channels for each frequency band [54]. Similarly, the 32 channels are considered for DEAP dataset in five EEG sub-bands which leads to total 160 DE features.

Various models along with the hybrid model and ensemble model are explained in next section.

4 Model employed for emotion recognition in the proposed system

In the proposed work, initially, the CNN and LSTM based models are employed for emotion recognition. Thereafter, a hybrid model is proposed which is a combination of CNN and LSTM. Finally, an ensemble model of these three proposed models is taken into consideration. All these models are explained in this section.

4.1 CNN-based model

The idea behind CNNs bears a resemblance to traditional artificial neural networks (ANNs), consisting of neurons that self-optimize through learning. CNN’s are powerful performers on large sequential data represented by matrices such as images broken down to their pixel values [45]. A smaller n × n kernel slides over the entire feature matrix performing convolutions over the superposed space [12]. The feature map size can be kept consistent across multiple convolutions using padding of 0s. However, functions like Max Pooling are employed to reduce the amount of computational data and still retain the important information [26]. As the feature maps pass through the different convolutional layers, the filters learn to detect patterns and more abstract features.

EEG based emotion classification using the CNN method was also explored in the approaches of [46]. Cascade and parallel convolutional recurrent neural networks have been used for EEG human-intended movement classification tasks [52]. Additionally, before applying the CNN, EEG data could be converted to image representation after feature extraction [42]. However, the accuracy of emotion recognition by using only CNN is not high.

The details of the CNN architecture employed in the proposed approach are shown in Fig. 2: The CNN model consists of four convolutional (conv) blocks with 64, 128, 256, 512 filters, respectively. The kernel size of conv filters is 5 × 5 and 3 × 3. All the layers use padding, followed by maximum sub-sampling layers, which operate over 2 × 2 sub-windows at each conv layer, known as the Max Pooling layers. The network ends with three fully connected dense layers fed to the c-way softmax [24] classification layer. Relu activation is employed due to its unity gradient, where the maximum amount of error is passed during back-propagation [1]. Dropout regularization is used after every layer which improves the performance of the model via a modest regularization effect [29]. Thereafter, the predictions of the CNN model are fed to the proposed ensemble model for emotion recognition.

Fig. 2
figure 2

CNN-based model architecture employed in the proposed approach for EEG-based emotion recognition

4.2 LSTM-based model

The LSTM networks are modified recurrent neural networks (RNN), capable of learning long-term dependencies. LSTM network is parametrized by weight matrices from the input and the previous state for each of the gates, in addition to the memory cell, which overcomes the issue of vanishing/exploding gradient [10].

We use the standard formulation of LSTMs with the logistic function (σ) [4] on the gates and the hyperbolic tangent [2] on the activations. The input is of the shape 1325 × 62. The model has 4 LSTM layers with dropouts in between, and then the output is passed to the fully connected network. SoftMax activation function [24] is used to predict the final output. The block diagram of the LSTM architecture is shown in Fig. 3.

Fig. 3
figure 3

LSTM-based architecture employed in the proposed approach for EEG-based emotion recognition

4.3 Hybrid model

The hybrid model combines more than one base model in series. Figure 4 shows the structure of the hybrid model employed in the proposed approach. The hybrid model improves the performance by capturing more information that is left undetected previously.

Fig. 4
figure 4

Hybrid model employed in the proposed approach for EEG-based emotion recognition

The first three blocks of the hybrid model consist of convolutional (conv) blocks. The conv block consists of max pool layers and the Dropout regularisation to avoid overfitting [29]. The output shape of the third conv block is 15 × 66 × 512. On the other hand, the input shape to the LSTM block is 66 × 7680. The reshape layer is employed between the conv and LSTM block to facilitate this dimensional mismatch. In general, 2D conv block work on inputs which are \(\mathbb {R}^{3}\), while LSTM inputs are in \(\mathbb {R}^{2}\). The LSTM network uses the Tanh activation function [2] and batch norm regularization [47]. The output of the LSTM block is passed to a fully connected network that uses softMax [24] to calculate the probabilities of the output.

4.4 Ensemble learning-based model

Ensemble learning is mainly of two types, namely, homogeneous and heterogeneous. It combines the prediction from multiple models and integrates the individual strengths of the base models. This results in the robustness and the improved performance of the overall approach [50]. Ensemble learning is homogeneous when the base models are of the same type. In the proposed approach, ensemble learning is heterogeneous as the base models are different.

Once these models are trained, a statistical method is used to combine the predictions of the different models. The statistical method involves the methods of bagging, boosting, and stacking. We have employed stacking as it is suitable for heterogeneous ensemble models [43]. Stacking is the process in which separate models learn parallely on the dataset and a small meta model, usually a feed-forward neural network (FNN) is used to combine individual predictions and come up with the final outputs. Stacking introduces a meta-model that receives the different predictions of the base models as its input. The meta-model [48] learns to maximize the output prediction, and this becomes our final output. In addition to stacking, we have also investigated the max function as a statistical method to combine the predictions. Figure 5 shows the block diagram of ensemble model. The meta model used in the stacking method consists of 4 fully connected (FC) layers followed by a softmax classifier [24].

Fig. 5
figure 5

Ensemble model employed in the proposed approach for EEG-based emotion recognition

5 Results & discussion

The proposed approach has been evaluated on two datasets namely SEED and DEAP. In the proposed approach, the performance of various models has been investigated using the k-fold cross-validation test [32] with k = 10. The individual performances of the CNN, LSTM and the hybrid model have been obtained. Further, we have also obtained the performance of the ensemble model. Each model is trained for 60 epochs with a batch size of 64. The learning rate (LR) has not been fixed due to saturation in loss which results in no further improvement in the performance of the model. To overcome this limitation, we have employed LR annealer which makes the learning rate a variable parameter. It should be noted that we have used same feature and experimental setup for both the datasets.

5.1 Experimental results for SEED data

The performance of individual models are measured by evaluating certain parameters such as weighted average precision (WAP), weighted average sensitivity (WAS), and weighted average F1 score (WAF1). F1 score is a good metric to check stability of the model. Table 1 tabulates the performance parameters of individual models, hybrid model and ensemble model for EEG emotion recognition.

Table 1 Classification performance of individual and ensemble model for EEG-based emotion recognition on SEED data

The experimental results suggest that the CNN and LSTM model individually achieves the classification accuracy (ACC) of 89.53% and 89.99%, respectively. The hybrid model achieves an ACC of 93.46%. On the other hand, the ensemble model achieves the ACC of 97.16% for the stack-based ensemble learning. The results for SEED data are tabulated in Table 1. From Table 1, it can be noticed that the ensemble-based method provides improved performance over other models. We believe that this is because the base models are not weak and provide good accuracy by themselves (Fig. 6).

Fig. 6
figure 6

The box and Whisker plots for ACC achieved by proposed (top-left) CNN-based model, (top-right) LSTM-based model, (bottom-left) Hybrid-model, (bottom-right) Ensemble stack model for EEG-based emotion recognition

Figure 7 shows the plot of loss function and LR with respect to epoch. When LR saturates after some epochs, there is no significant decreases in loss which results in poor model performance. On the other hand, as we decrease the RL when loss saturates, the loss tends to settle more quickly and improves system performance.

Fig. 7
figure 7

(a) Loss (b) Learning rate of the proposed CNN based model, RNN based model, and Hybrid model on SEED data

We have also shown Box and Whisker plots of ACC to shed some more light on the results in Fig. 6. The inter-quartile range (IQR) is indicated with the box and the orange line showing where the median lies. This includes all the results from 25 percentile to 75 percentile. The minimum and maximum values are marked by the solid black line at the top and bottom of the box and whisker plot. The outliers, marked by circles, are results that did not fall in the whisker range, which contains results in the range of 1.5× IQR.

We further compare our experimental results of the proposed approach with some of the past benchmark methodologies on emotion recognition on the SEED dataset. Table 2 tabulates the comparison of the proposed approach with other past benchmark results. It can be observed that the proposed approach outperforms the previous methodologies. It can also be noticed that the standard deviation (STD) of the proposed approach is very less as compared with other approaches tabulated in Table 2. This also reflects the repeatability and reproducibility of the proposed approach.

Table 2 Comparison of previous benchmark methodologies for EEG-based emotion recognition on SEED data

5.2 Experimental results for DEAP data

We have also employed DEAP dataset for evaluating the performance of proposed approach for EEG-based emotion analyis with same feature and experimental setup. The performance of the proposed approach on DEAP dataset has been tabulated in Table 3. It can be observed from Table 3 that the ensemble obtains maximum performance as compared to other individual models. The CNN-based, LSTM-based and hybrid models achieve classification performances of 63.50%, 63.89%, 64.02%, respectively. Tables demonstrate that the ensemble model achieves better performance than individual models. The performance of other existing works on DEAP data with same DE feature has been compared in Table 4. It can be observed from Table 4 that the proposed system attains better performance than the existing methods for EEG-based emotion recognition.

Table 3 Classification performance of individual and ensemble model for EEG-based emotion recognition on DEAP data
Table 4 Comparison of previous benchmark methodologies for EEG-based emotion recognition on DEAP data

For the future work, we planned to extent our work to propose new feature for the effective emotion recognition from EEG signal. Also, the SEED and DEAP datasets will be evaluated with new features to further improve the existing performances. We also intend to test the proposed model for other EEG-based neuronal system development.

6 Conclusion

This paper proposes the ensemble learning-based EEG emotion recognition system. Firstly, the differential entropy was extracted from different frequency bands of EEG signals. Thereafter, these features are fed to CNN and LSTM based models. The hybrid model is developed by combining the sub-blocks of CNN and LSTM models. The ensemble model is proposed based on the CNN, LSTM, and hybrid model. The experimental results suggest that the ensemble model achieves better classification performance than the other models employed in the proposed approach. The proposed ensemble model outperforms the compared methodologies with 97.16% ACC for EEG-based emotion recognition on SEED dataset. The proposed method is also evaluated on DEAP dataset and obtains 65% ACC using same features and model parameters. All the models provided impressive accuracy individually and showed a much lower standard deviation.

BCI is an upcoming field that is highly reliant on the accurate, repeatable, and efficient classification of our brain waves frequently recorded by EEG methods. The experimental results suggest that the proposed approach is suitable for this purpose and paves the way for upcoming research fields of such as humanoid robots, sophisticated prosthetics, and AI-assisted healthcare and recovery. In future, a hardware implementation can be done for the proposed model.