1 Introduction

Emotions are considered to have a strong impact on how people interpret information, make decisions, and shape their actions when interacting with others. The study of psychological and behavioral interaction is a relatively recent area of study. Emotion analysis is an essential task in everyday life, particularly in the domain of human-computer interface [1, 2]. Emotion analysis will aid in improving the caliber of communication between computer’s intelligence and human brain. Emotion analysis is also used in health care to understand patients’ neurocognitive functioning [3], and physiological signals like galvanic skin response, heart rate, electromyography and electroencephalography (EEG) are commonly used to assess emotional state [4]. The relationship between various emotional states and EEG signals has been extensively analyzed in the last decade [5]. EEG is an exclusive strategy for measuring brain responses to emotional stimuli because it is non-invasive, fast, and inexpensive. EEG signals are commonly applied for emotion recognition since they can be used to investigate different aspects of emotions based on frequency band, electrode location, and temporal details [6]. In recent studies, EEG signals have been shown to acquire good classification score for emotion classification, and can successfully describe how physiologically cognitive and emotional acts are interrelated. Based on these peculiarities, EEG signals are considered as the primary cradle of knowledge for emotion detection in HCI systems. It is a promising area of study that has gathered a lot of attention from different disciplines, ranging from neuroscience to computer science. While previous investigation has looked into different methodologies for emotion recognition, due to the complex pattern, emotion recognition techniques are still in demand by many innovative applications. Emotion recognition is often essential in medical care in order to understand a patient’s social and mental state. Virtual world based application [7], driving assistance [8], gaming [9], mental health-care [10], and social security [11] are some of the other applications area of emotion recognition that have been created.

EEG can acquire brain activity signals and explore them using machine learning algorithms to predict emotion retreated behaviors and translate them into instructions. Typically, either a classification or a regression model is being used for this task. The kind of machine learning algorithms are defined by the results and sensory input displayed throughout the experiment. Machine learning algorithms provides the ability to learn from problem-specific training data in order to modernize the process of establishing analytical models and solving associated tasks. Deep learning is an artificial neural network-based machine learning premise. Deep learning models supersede shallow machine learning algorithm and conventional data analysis perspectives in many implementations. The main distinction between deep learning and traditional machine learning is its performance as data volume increases. Processing features takes time and requires specialized knowledge. The accuracy of the extracted features determines the performance of most ML algorithms. A significant difference between DL and traditional machine-learning algorithms is the attempt to obtain high-level features directly from data. As a result, DL reduces the effort required to design a feature extractor for each problem. Therefore, In this research work both Machine learning and Deep learning based algorithms has been implemented for emotion recognition and the performance accuracy has been evaluated and compared. This paper aims to investigate and analyze the different classifiers and their performance accuracy on EEG signals for emotion detection.

Fig. 1
figure 1

Workflow of Proposed Machine and Deep Learning Based Model for Emotion Classification

The overall work-flow of the designed architecture is given in Fig. 1. As a very initial stage, EEG signals are acquired while a subject is wearing a Muse EEG headset and is shown an emotional video on the screen. The second stage includes preprocessing of the acquired EEG signals using basic filtering process in order to remove artifacts, followed by feature extraction step to extract some statistical, temporal, frequential and time-frequency based features from pre-processed EEG signals. Finally, these features are divided into train and test samples and then train samples are used as input to the proposed Machine and Deep learning based models for a multi-classification task representing the three emotional states i.e. positive, negative and neutral. The paper also presents emotion classification accuracy results of the implemented Machine and Deep Learning based various network designed to classify stimuli evoked EEG recordings. The result shows that the deep learning based models provide better accuracy compared to some of the Machine learning based classifiers. The following sections of the paper present an extensive literature review in Sect. 2, followed by the methodology in Sect. 3 including dataset description, the proposed architecture and the implementation of various model. At last, Sect. 4 documents the result and discussion followed by the conclusion in Sect. 5.

2 Related Work

An extensive literature survey has been done in three aspects: (i) the previous work related to various extracted features and feature extraction techniques carried out for EEG signals. (ii) Machine learning based EEG evoked emotion recognition related work. (iii) Deep learning based emotion recognition related tasks.

2.1 Related Work on Feature Extraction Techniques on EEG Signals

Several innovative and BCI inspired studies focused on feature extraction and machine learning techniques have been published. Various feature extraction techniques have been used successfully in EEG analysis to retrieve complex and non-linear pattern which can help for classification tasks [12, 13]. Existing complex pattern discovery strategies rely on hand-crafted procedures for extracting prominent features from EEG signals, which must then be able to classify EEG signals using a variety of classifiers. Several studies have concentrated on the feature extraction step, with the goal of identifying the most important features for EEG evoked emotions. The EEG-evoked emotion classification procedure can be separated into two primary stages. The main stage is feature extraction from EEG signals which represents the prominent emotional state. In EEG based processes, signal processing techniques such as Fourier transforms, wavelet transforms and chaos theory are widely used in features extraction. However, since these signals are dynamic, time-variant and non-linear in nature, the study of EEG signal is intricate with frequency and time domain techniques [14]. The measures of complexity and chaos within the time domain-based mechanism, have demonstrated to be the most discriminative in EEG classification [15]. The extracted EEG features can be separated in domain of frequency or temporal band. The temporal features primarily perceives the time related data of EEG signals, for example, fractal dimension [16], Hjorth features [17], and higher order crossing features [18]. The features extracted from frequency domain aims to perceive frequency based EEG data, for example, differential entropy (DE) [19], the rational asymmetry (RASM), power spectral density (PSD), [20] and so on. PSD features using the Fast Fourier Transform and Short-Time Fourier Transform, as described in [21] are considered the most basic and widely used. Other studies looked into the characteristics of decomposition techniques including the Intrinsic Mode Functions (IMF) and Discrete Wavelet Transform (DWT) [22]. The multiband feature matrix (MFM) in [23] ensures that the location of the sensors on the scalp is taken into account during the feature extraction process. Other reported EEG features from different works are Shannon entropy, Sample Entropy, Log energy entropy, Differential Entropy, Wavelet Entropy, Approximate Entropy, Common Spatial Patterns (CSP) and Asymmetry Index (AI) on the 5 Frequency bands of EEG signal i.e. delta, theta, alpha, beta, and gamma bands [24]. Due to the complicated nature of the brain signals, recent publications consider an increasing number of non-linear features such as Higher Order Crossings or Fractal Dimensions. Nonetheless, simple features such as band powers have almost become obsolete omnipotent despite the fact that they are premised on various underlying algorithms that are often referred to for solely for comparison purposes [25].

2.2 Machine Learning Based EEG Evoked Emotion Classification Related Work

Multiple machine learning (ML) based methods, such as linear discriminant analysis (LDA) [26], Naive Bayes (NB), support vector machine (SVM) [27], random forest(RF), k-nearest neighbors (k-NN) [28], and others, were utilized with reasonable accuracy as EEG-evoked classifiers for emotional classification tasks. The Multiple Layer Perception (MLP) was used to detect the emotions from normalized features vectors of EEG signals and reported 69.69% accuracy. MLP and SVM classifier are again implemented in [20] to classify the emotion states from EEG recording taken while listening to music, using PSD and DASM features which improved the classification precision to 82.29%. An average accuracy of 87.53% has been achieved by SVM from the linear dynamic features of EEG signals recorded while watching videos [29]. Some non linear EEG features are also extracted and K nearest neighbor shall be used to classify extracted characteristics into emotional state, in order to demonstrate the superiority of a non-linear method of extraction over earlier frequency extraction techniques [30]. Another work, that extracted different nonlinear features from empirical decomposition (EMD) of EEG dataset (SEED) and the attributes that have been evaluated are input into a random forest (RF) algorithm to obtain classification accuracy of 93.87, 91.45 and 89.59% for negative, neutral and positive emotions respectively [31]. In [32], the ICA technique applied to extract features from SEED dataset and these feature are classified by ANN. A tuned Q wavelet transform (TQWT) algorithm as a feature extractor is employed, and then a rotation forest ensemble (RFE) classifier is fused with several classifiers such as k-NN, SVM, DT, ANN and RF algorithms and attains over 93% classification precision with RFE + SVM [33]. Recently, Laura et. al. [34] used Naïve Bayes, SVM with various kernel, Random Forest and an Artificial Neural Network classifier for emotion classification on AMIGOS dataset.

In the current work, some of the conventional ML based algorithms has been implemented on EEG features for comparative analysis and measures the classification precision, which gave further inspiration to develop models to refine the results further.

2.3 Deep Learning Based EEG Evoked Emotion Classification Related Work

Deep learning methods, as opposed to traditional machine learning approaches, extract deeply layered features automatically from large datasets and are better suited to displaying the most prominent features of data. Deep learning methods have been successful in recognizing emotions in numerous studies. Deep learning (DL) methodology has achieved best success in the fields of computer vision [35, 36], natural language handling [37], speech recognition [38], and visual stimuli based EEG classification [39] as well as EEG-based emotion recognition [40] in the last few years, according to some researches. Some DL based approaches such as Deep Belief Networks (DBNs) [41], Recurrent Neural Network (RNN), Convolutional Neural Networks (CNNs) [40], Gated Recurrent Unit (GRU), Long Short Term Memory (LSTM) [42] and others have been adapted for EEG dependent emotion classification. The LSTM technique is a enhanced form of RNN (Recurrent Neural Network) that is particularly good at processing time series data like EEG signals. The LSTM network is ideal for EEG-related tasks because of its ability to remember past information. It also prevents the vanishing gradient problems that a permanent network has, and it has been used in a wide range of applications. CNN is the most accurate and optimistic deep neural network model for classification tasks among the various deep learning models. Different CNNs, as well as some multimodal frameworks with CNN, have been commonly used to obtain results from object, text, recorded video, visual and speech classification tasks [43, 44]. Recently the CNN along with its variants has been implemented for emotion recognition task [45].

3 Methodology

3.1 Dataset Acquisition

In this experiment, a publicly accessible Kaggle emotion dataset is utilized for emotion classification task [46]. Data was collected using the Muse Headband sensor from a male and female subject for three minutes per emotion state i.e. positive, negative, neutral. The Muse is a commercial EEG sensing system with five dry-sensors: one for a reference point (NZ) and four electrode channels (AF7, AF8, TP9, TP10) for recording brain wave activity. Also, there are six minutes of neutral data collected while subject is in rest, six (negative and positive) stimuli was used to evoke the emotions. Furthermore, during any events, nobody was supposed to close their eyes. The subjects in the relaxed task were advised to ease their muscles and relax while listening to moderate music and sound effects intended to help in meditation. Similarly, another test was performed to record a neutral emotion but with no stimuli at all. This test was performed before the others to avoid the long-term consequences of a relaxed state. The Muse Headset’s EEG samples was automatically registered for sixty seconds. The data was found to be streaming at a variable frequency between 150 and 270 Hz. The recorded EEG data was sampled before extracting the features.

3.2 Feature Extraction

Since EEG data is non-linear and non-stationary in nature, single values cannot be used to determine class. The classification of EEG signals mostly depends on the temporal existence of the signals, instead of the values specifically. The identification of the nature that govern various frequency bands of EEG signal often necessitates temporal analysis. Temporal statistical extraction is carried out for these purposes. A time window slider of size 1 s with a 0.5 s overlap are used for temporal statistical extraction. In the current research paper, some previously experimented successful statistical attributes of EEG signals [47] are used to extract features. This section explains why statistical extraction is necessary, as well as how to go about implementing it. The following sections explain the various statistical feature variants that are extracted from the initial dataset:

  1. 1.

    Mean: A collection of signal values is interpreted within a series of time-frame \(x_{1}, x_{2}, x_{3},....., x_{n},\) and mean values are calculated as in Eq. 1:

    $$\begin{aligned} \hbox {mean}(\mu )= \frac{1}{n}\sum _{i=1}^{n}{x_{i}} \end{aligned}$$
    (1)
  2. 2.

    Standard Deviation: The standard deviation for all signals are calculated as:

    $$\begin{aligned} \hbox {SD}(\sigma )= \sqrt{\frac{\sum _{i=1}^{n}({x_{i}-\mu })^2}{n}} \end{aligned}$$
    (2)
  3. 3.

    The skewness and kurtosis indicating asymmetry and peak of waves i.e. statistical instances of the third and fourth order of mean is defined in Eq. 3 and 4.

    $$\begin{aligned} \hbox {Skewness}(s)= \frac{\mu ^{3}}{\sigma ^{3}} \end{aligned}$$
    (3)
    $$\begin{aligned} \hbox {kurtosis}: \sqrt{\frac{\sum _{i=1}^{n}({x_{i}-\mu })^k}{n}} \end{aligned}$$
    (4)

    where k is the third and fourth order of mean.

  4. 4.

    Maximum value inside every specific time interval \({max_{1},max_{2},..., max_{n}}\).

  5. 5.

    Minimum value inside every specific time interval \({min_{1},min_{2},..., min_{n}}\).

  6. 6.

    Min-Max Derivatives: By splitting the temporal frame in half and calculating the each half’s values of each time-frame, min-max derivatives values can be calculated as in Eq. 5.

    $$\begin{aligned} D=\frac{\mu ^p-\mu ^{p/2}}{2} \end{aligned}$$
    (5)

    where p= 1 sec and p/2= 0.5 sec i.e., in a one-second time frame, the second half of the data sequence is shown. In order to get the derivative given the max and min features in sub time windows (t), the same strategy is used in (Eq. 6 and 7):

    $$\begin{aligned} max(t)=\frac{max^p-max^{p/2}}{2} \end{aligned}$$
    (6)
    $$\begin{aligned} min(t)=\frac{min^p-min^{p/2}}{2} \end{aligned}$$
    (7)
  7. 7.

    After slicing the original one-second time window into four batches of 0.25 s each, the next temporal features are extracted. The mean, maximum, and minimum values of each batch were then calculated:{\(\mu _1, \mu _2, \mu _3, \mu _4\)}, {\(max_1, max_2, max_3, max_4\)} and {\(min_1, min_2, min_3, min_4\)}.

  8. 8.

    Then the 1D Euclidean distance between all mean values, using the formulas \(\delta _{\mu 12}= |\mu _1- \mu _2|\), \(\delta _{\mu 13} = |\mu _1- \mu _3|\), \(\delta _{\mu 14} = |\mu _1- \mu _4|\), \(\delta _{\mu 23} = |\mu _2- \mu _3|\), \(\delta _{\mu 24} = |\mu _2- \mu _4|\), \(\delta _{\mu 34} = |\mu _3- \mu _4|\) is calculated, and the same for the minimum and maximum values, yielding 18 features dependent on distances are extracted. Total 30 features in the short period of time window for each signal by using the four mean values, four max and four min values, and integrating the existing 18 features, total of 150 temporal features per second considering the five signals are extracted.

  9. 9.

    Based on the preceding 150 feature vectors, last 6 features is discarded to achieve 144 features, which allows to construct a \(12 \times 12\) square matrix to evaluate the log-covariance as follows in Eq. 8:

    $$\begin{aligned} lcM=U(logm(cov(M))) \end{aligned}$$
    (8)

    where the upper triangular products are returned by U(), the matrix multiplication function is logm(.), and the covariance matrix is given by (9)

    $$\begin{aligned} cov(M)=cov_{ij}=\frac{1}{n}\sum _{k}^{n}(x_{ik}-\mu _{i})(x_{kj}-\mu _{j}) \end{aligned}$$
    (9)
  10. 10.

    Entropy is an instability measure that is used in brain-machine interface applications to calculate the amount of chaos in the system since it is a non-linear measure that quantifies the level of relevance of the data. Shannon entropy is proven efficient for non-linear time series data and calculated as in Eq. 10:

    $$\begin{aligned} Shannon Entropy (SE) = -\sum _{i} S_j \times log(S_j) \end{aligned}$$
    (10)

    where h is a function determined in each 1 sec time window and \({S_{j}}\) indicates each (normalized signal) feature of this periodic window.

  11. 11.

    Then, by splitting the same time period in half to compute the log-energy entropy as in (11):

    $$\begin{aligned} loge=\sum _{i} log (S_{i}^{2})+ \sum _{j} log (S_{j}^{2}) \end{aligned}$$
    (11)

    where i and j is an iterator for the values from the first sub window (0–0.5 sec.) and the second sub window (0.5–1 sec) respectively.

  12. 12.

    Fast Fourier Transform (FFT) is a useful tool for analyzing the spectrum of a time series computed as follows at each time window (12):

    $$\begin{aligned} FFT=\sum _{n=0}^{N-1} S_{n}^{t} e^{-i2\pi k \frac{n}{N}} k=0..... N-1 \end{aligned}$$
    (12)

The EEG signals are represented using these statistical features mentioned above considered for each electrode channel and time window, this produces a total of \(2147 \times 2548\) features where 2548 is the number of rows in the feature set. These features are then used as input to the various model for classification of emotion states.

3.3 EEG Evoked Emotion Classification Using Machine and Deep Learning Algorithm

The current work proposes an EEG emotion classification model and compares the various machine learning and deep learning based models to classify EEG related signal. Various ML and DL based architecture has been implemented and evaluated in this paper, which has been already implemented for emotion recognition tasks on different emotion datasets and achieved comparable performance and classification accuracy percentage on described dataset in section above.

3.4 Implemented Machine Learning Classification Algorithms for EEG Evoked Emotion Classification

In this paper, machine learning based methodology namely Support Vector Machine (SVM), Decision Tree (DT), Gaussian Naive Bayes (GNB), K-Nearest Neighbor (k-NN) and Random Forest (RF), Multiple Layer Perception (MLP) and Artificial Neural network (ANN) has been implemented, trained and tested on emotion dataset described in Sect. 3.1.

The short description with the parameter settings of these techniques are listed as follows:

Support Vector Machine (SVM) is exceptionally favored as it produces huge precision with less calculation. SVM can be applied to solve regression as well as classification problems. Although in many cases, it is generally utilized in classification tasks. In this emotion based classification task, SVM has been implemented on the datasets described in Sect. 3.1 and the extracted EEG temporal feature sets were fed to the SVM model as input for classifying each emotion state into negative, neutral, positive. The Radial Basis Function (RBF) kernel has been applied for this purpose as it is widely used for many classification problem and it has localized and finite response along the entire x-axis. Generally It allows for binary classification and the separation of data points into two classes. The same method is used for multi-class problem after decomposing the multi classification problem into numerous binary classification problems. The goal is to map an emotion-based feature set to a high-dimensional space in order to achieve mutual linear separation between each pair classes. To classify the features of the two classes, SVM generate various possible hyperplanes that could be optimized by finding the maximum margin between the features of each pair of the classes for multiple emotion classification.

Random Forest is the most popular machine learning algorithm because of its simplicity and diversity. It is based on building a “forest" using multiple decision trees and combines them together using bagging algorithm that can uplift the overall classification accuracy. Here, the estimator parameter i.e. the number of trees in the forest is 50.

K-Nearest Neighbor (k-NN) is a supervised machine learning algorithm which can be utilized for both regression and classification problems. In k-NN, calculation similarity between various feature set is used to predict the estimations of a new data point which further implies that the optimal spot for the new testing tuple will be determined by how precisely it resembles with the neighbor’s test tuple of the training set. In the current work the default number of neighbors used is 5.

Gaussian Naive Bayes is a basic yet unexpectedly strong predictive method. NB classifiers are an assortment of classification algorithms dependent on Bayes’ theorem with the "native" assumption that each pair of emotion based features has the class variable value of conditional independence. It is a group of algorithms where every one of them share a typical standard, for example, each pair of features being classified are independent of one another. Gaussian Naive Bayes is a variation of Naive Bayes that follows Gaussian normal distribution and works finely with continuous data.

Decision Tree is a binary as well as multi-class systematic classification technique. It questions the dataset i.e feature sets of the EEG evoked emotions and split each set into nodes to get the final emotional states. A binary tree can display the decision tree algorithm. A query is asked about the root and each of the internal nodes, and the data on this node is further divided into multiple records with different features. The tree leaves are the classes in which the dataset is divided. High dimensional information can be handled with remarkable precision using decision trees. The task for measuring the quality of the split is "GINI" for the Gini insufficiency and "Entropy," which is employed in this work as well.

Artificial Neural Networks (ANN) also known as Neural Networks (NNs), are computer structures that are loosely modeled after the biological neural networks concept that make up human brains. ANN is a supervised learning technique which acquires the information from network in the form of connected network units. It is impossible for a person to extract this information. This consideration prompted the extraction of a classification rule in data mining. The classification process begins with the dataset. There are two components to the data set: a training and a test sample. The training sample is implemented to learn the network, while test sample is applied to assess the classifier’s accuracy. Various methods, such as the hold-out process, cross validation, and random sampling, can be used to divide a data set. Generally, learning steps of a neural network is as follows:

  1. 1.

    The input, output, and hidden layers each have a fixed number of nodes.

  2. 2.

    For the learning method, an algorithm is used.

A neural network’s most vital capability is to change the structure of the network and learn by changing the weight makes it suitable in the artificial intelligence field. In this work, three dense layer has been taken for building up the ANN architecture for emotion recognition. The first dense layer with 256 neurons takes the input emotion feature vectors described in Sect. 3.2 and provides the resultant transformed feature map for second dense layer having 128 neurons. Finally, the output dense layer with 3 neurons representing the 3 emotion states takes the output from previous dense layer and provides the emotion classification accuracy with other evaluation parameters using ANN model on emotion dataset.

Multiple Layer Perception (MLP) stands for multi-layer perceptron and is a form of ANN. An input layer with at least single hidden layer and an output layer are the three layers of nodes in the simplest MLP. For training, MLP employs a supervised learning mechanism known as backpropagation. Backpropagation is used to conduct learning over a set amount of time calculated in epochs. Backpropagation is an example of automatic differentiation in computer science which compares the classification error. Then the outputs of the network are sent backwards to ground truths to extract a gradient from the final layer, which determines the weights of neurons of a network, commanding their enactment referred to as a gradient descent optimization. The weights of neurons are calculated using an algorithm by calculating the error rate of a network. Following the learning process, an exquisite neural network is developed as a function to intrigue into the output class. A simple MLP model has been implemented in this paper with two dense layer with 100 and 3 neurons for three emotional states classification.

3.5 Implemented Deep Learning Classification Algorithms for EEG Evoked Emotion Classification

In this work, three deep learning based strategies has been implemented for designing of emotion recognition model to improve the classification accuracy and its performance is compared with above described ML methods. The DL based models such as Gated Recurrent Unit (GRU) and Long short-term memory (LSTM) and Convolutional Neural Network (CNN) are trained and evaluated for detection of three emotion states (Fig. 2).

Convolutional Neural Network (CNN) is already established as a popular and applicable deep learning approach. The ability to extract feature maps through the convoutional layers and select the best feature maps by the means of maxpooling layer helps to remove the feature extraction and selection stage from conventional classification tasks.

Fig. 2
figure 2

Architecture of Proposed CNN Model for Emotion Classification

In this paper, the CNN architecture is composed of 2 Conv1D layers, a Maxpooling 1D layer, a Flatten layer and 2 fully connected (FC) layers. The first Conv1D layer receives the feature vector (2548,1) representing a single dimension input feature row and outputs a feature map by applying 16 kernels with kernel size 10 of stride 1 and no padding. The second Conv1D layer comprises of the 16 kernels with 3 kernel size with 1 stride rate and no padding which yields a resultant feature map followed by a 1D Maxpooling layer with parameter size 8 which suppresses the output and reduce the dimension of output vector. Finally, the resultant feature map are flattened to get a 1D vector and 2 fully connected layer with 100 and 3 outputs neurons and derive the probabilities for 3 classes i.e. negative, positive and neutral emotion states. Both Conv1D layers use non-linear ReLu (Rectified Linear Units) activation function to convert the feature map between 0.01 to 1. The first dense layer likewise employs the ReLu activation function, while another dense layer employs the softmax activation function, which always returns a value between 0 and 1.

Fig. 3
figure 3

a Proposed LSTM model; b Proposed GRU model

Long Short-Term Memory (LSTM): Human Emotions, change over time and this inconsistency is expressed in EEG signal temporal interrelationships. The classification strategy of Long Short-Term Memory networks (LSTM) is used to investigate these associations. LSTM, an enhanced design of Recurrent Neural Network (RNN). RNN stands for Recurrent Neural Network in which knowledge is transferred from the current loop to the following loop in a network of loops. This looping nature of RNN architecture make it useful for time series data. Standard RNN, on the other hand, has a problem with long-term dependencies. As the distance between loops widens, RNN can lose its ability to link information. Because of the explicit nature of its recurring module, short and long-term dependencies can be learned through LSTM. Since LSTM is good at learning short term as well as long-term dependencies of time series, it was used in this study to look at EEG signal temporal correlations. In this work a LSTM architecture is proposed which composed 2 LSTM layer, 2 dropout layer and a dense layer. The extracted feature vector is fed to the first LSTM layer which consists of 64 consecutive LSTM blocks with ReLU activation function to learn long term dependencies for input feature vector. These learned output features are then fed to a dropout layer which randomly drops 20% of neurons to prevent overfitting problem and improves the model’s generalization. The second LSTM layer is then applied with 32 LSTM blocks with Sigmoid function to get the more generalized feature vectors out of it followed by a dropout layer. Finally, a dense layer of 3 neurons with sigmoid function indicating the 3 emotion states are applied to get the classification accuracy. A brief proposed architecture is shown in Fig. 3a and an architecture of a single LSTM cell is shown in Fig. 4.

Fig. 4
figure 4

Architecture of a LSTM cell

Fig. 5
figure 5

Architecture of a GRU cell

Table 1 Summary of implemented DL based model with layers, parameters and activation function

Gated Recurrent Unit (GRU) is a newer form of recurrent neural network that is very similar to LSTM. The GRUs replaced the cell state and forget state of LSTM with the reset gate and update gate to control the flow of information. In this experiment, GRU model has also been implemented to build an emotion classification model. The extracted emotion feature vector are fed to the GRU layer with 256 GRU neurons connected with each other to recognize the previous feature details as well as to learn current feature map to classify the emotions. These learned features are then converted to 1D vector using flatten layer. Finally the fully connected layer with 3 layer representing 3 emotion states provides the classification accuracy. A brief proposed architecture is shown in Fig. 3b and an architecture of a single GRU cell is shown in Fig. 5.

Fig. 6
figure 6

ANN model’s a Accuracy graph; b Loss graph

Fig. 7
figure 7

CNN model’s a Accuracy graph; b Loss graph

Fig. 8
figure 8

LSTM model’s a Accuracy graph; b Loss graph

Fig. 9
figure 9

GRU model’s a Accuracy graph; b Loss graph

Fig. 10
figure 10

Comparison graph of all implemented ML and DL based model

Fig. 11
figure 11

Comparison graph of all implemented ML based model

Fig. 12
figure 12

Comparison graph of all implemented DL based model

A brief explanation with layers name, parameters size and activation function used in MLP, ANN, CNN, LSTM and GRU model is shown in Table 1. These architectures has been trained on 80% of dataset then 20% data is utilized for testing the overall model performance. Additionally, all implemented models were trained with 16 batch size, ADAM optimizer and categorical cross-entropy loss function for 100 epochs. For the Adaptive Moment Estimation (ADAM) optimizer, a learning rate of 0.0001 and decay of \(1e^{-6}\) was used. This architecture has been tested using the above described dataset and the acquired results and further analysis is discussed in the Result and Discussion section.

4 Result and Discussion

In this current work, various learning algorithm is implemented for emotion detection to acquire better classification accuracy and a comparative analysis of these methods with respect to their performance and accuracy is also provided. As discussed in the previous section, ten different models i.e. machine learning based models namely SVM, RF, k-NN, GNB, DT, MLP and ANN, additionally Deep learning based models i.e. CNN, LSTM, GRU is implemented to classify the EEG based emotion evoked signals. As described in dataset description section, Kaggle emotion dataset was employed to train and test the proposed models. The nature of EEG signals is uncertain and frail signals which hides many important details and clues which can contribute in Classification tasks. Various statistical feature extraction techniques such as mean, standard deviation, skewness, kurtosis, maximum, minimum and its derivatives along with Euclidean distance between mean, max and min values were also extracted. Furthermore, Log-covariance matrix and some non-linear features such as Shannon and log energy entropy and a linear feature ie FFT (fast Fourier feature) is extracted. Therefore, to achieve better accuracy out of EEG signals, these extracted features then served as an input data for the proposed ML and DL based classification techniques. Additionally, this method was processed with categorical cross-entropy and Adaptive Moment Estimation (Adam) optimizer has been used for all the above described model for 100 epochs. The performance of the classification is further analyzed by using the average train and test Accuracy (Acc.), Recall (R), Precision (P) and F1 score values (in %) and Loss and Accuracy graph for the proposed DL based model. The number of accurate predictions divided by the total number of observations in the dataset yields the accuracy (Acc.) as described in equation 13. The comparison analysis of proposed work performance with other existing work is provided in table 3.

$$\begin{aligned} Accuracy (Acc.)= \frac{TP+TN}{TP+FP+FN+TN} \end{aligned}$$
(13)

The highest level of accuracy is 1.0, while the lowest level is 0.0.

Recall calculates the proportion of correctly classified emotion states and total number of entries in the dataset (Eq. 14), while Precision is the proportion of correctly classified emotion states and the predicted emotion states (Eq. 15).

$$\begin{aligned} Recall (R)= \frac{TP}{TP+FN} \end{aligned}$$
(14)
$$\begin{aligned} Precision (P)= \frac{TP}{TP+FP} \end{aligned}$$
(15)

Accuracy provides overall percentage of true positive values has been successfully classified but the F1 score provide the harmonic mean value of true positive and true negative values classified described in Eq. 16.

$$\begin{aligned} F1 Score (F1)= \frac{2\cdot {P}\cdot {R}}{P+R} \end{aligned}$$
(16)

The overall train and test accuracy along with precision, recall and F1 score for all models on the dataset is provided in Table 2.

Table 2 Classification Train and Test accuracy with Precision, Recall and F1 score(%) of machine and deep learning models on emotion dataset
Table 3 Comparison analysis of proposed work performance with other existing work

It can be observed from this table that most of the implemented ML and DL based methodologies have achieved better accuracy. Among all ML based algorithms implemented for emotion recognition, DT, RF and SVM achieved highest classification accuracy with 98.12, 96.42 and 96.25%. The other DL based models such as LSTM, GRU and CNN outperformed with 97.42, 97.19 and 98.13% accuracy on test dataset respectively and 100% classification accuracy on train dataset. CNN and RF model achieved highest performance but LSTM, GRU, DT and SVM has achieved comparable accuracy and performed well. Figures 678 and 9 shows the accuracy and loss graph of training and testing of ANN, CNN, LSTM and GRU respectively. The comparison graph of all implemented methods is shown in Fig. 10. All Machine learning and Deep learning model; accuracy, F1 score, precision and recall is compared in Figs. 11 and 12 separately.

5 Conclusion

This study suggested some ML and DL based algorithms for classification of EEG evoked multiple emotion states ie: Neutral, Positive and Negative. To develop a emotion classification model, various features were first extracted from raw EEG data and then these feature sets were then supplied to various ML and DL based models to acquire the overall model’s performance accuracy and loss. These implemented methodology were then also compared and evaluated using the train and test dataset and average accuracy on train and test dataset, F1 score and model’s precision, recall were evaluated for each implemented model with their accuracy and loss graph also described. This paper experimented on all previously implemented ML based methods such as SVM, KNN, DT, RF and GNB for comparative analysis with proposed Deep learning based models. The proposed CNN model was able to classify the feature sets as inputs extracted from the EEG dataset to identify 3 different emotion states: positive, negative and neutral from the captured EEG signals of a subject while viewing emotional video clips on the screen. The other algorithms suc h as LSTM and GRU model efficiency is evident from its accuracy in comparison to CNN model. A future extension to the work may include testing the model on other datasets acquired through different visual stimulus. Different optimization techniques can be applied to achieve optimal solution and better performance.