1 Introduction

Intelligent human–computer systems are highly demanded today in order to increase the ability to interact with people and to understand human communication accurately. Emotional information processing is of great importance in human–computer interaction and emotion detection [1]. Recognizing people's emotions and gaining emotional intelligence is of great importance for the emotional and correct interaction (human–computer interaction, HCI) between human and computer [2]. Today, with the emergence of the concept of Metaverse, studies in this field have increased. Studies have shown that biomedical signals are affected by emotional changes. The aim of the researchers here is to create emotion detection systems with balanced performance in terms of both accuracy and response speed. Therefore, the recent challenge in this area is to consider fewer physiological signals [3,4,5]. Emotion detection methods are designed by analyzing biomedical signals produced by the autonomic nervous system (such as electrocardiography and galvanic skin response). In addition, physiological signals are directly affected by emotional changes. According to the results of many studies in the literature, one of the most accurate ways to detect emotion is the use of biomedical signals [6]. At this point, since the biomedical signals are nonlinear and nonstationary, appropriate features must be selected to increase the accuracy of the system.

Among these physiological signals, emotion detection based on the analysis of short-term ECG (electrocardiography) signal, computer games, and multimedia biofeedback systems will be useful in applications such as detecting emotional responses to changing stimuli in psychiatric studies [7]. ECG signals are signals that allow monitoring the electrical activity of the heart.

Studies have shown that biomedical signal characteristics affect classification performance. In addition, the long computation time of the emotion detection algorithm to emotion recognition/classification poses a challenge for real-time emotion recognition applications. At this point, feature extraction and feature selection are important steps because ideal feature extraction will enhance the classification and the improvement of the evaluated indicators [6]. In these studies, classification with as few features as possible will be beneficial in terms of transaction costs.

Regarding emotional states, Plutchik proposed eight different emotions as fear, anger, sadness, disgust, curiosity, surprise, pleasure, and joy. The rest of the emotions are a mix of these emotions (for instance, a mix of sadness and curiosity) [8]. From another point of view, emotions are divided into four regions, in the arousal and valence planes. Here, on the valence axis, it ranges from very positive emotions to negative emotions, or in other words, the left side expresses negative emotions and the right side expresses positive emotions. In the axis of arousal; it ranges from the least activating to the activating emotions from the lower side of the axis upwards. For example, it ranges from sleepiness to excitement. Here, emotions are classified as positive or negative on the valence axis, and high and low on the arousal axis [9].

There are many studies in the literature related to sentiment analysis, especially those using physiological signals, which have been one of the areas that researchers have shown great interest in. In these studies, the physiological signal used, the database, the extracted features, and the classification methods vary. Below are summaries of some recent studies in the literature:

Ferdinando et al. applied standard empirical mode decomposition and bivariate empirical mode decomposition to ECG signals recorded in the MAHNOB-HCI database, and obtained features based on the statistical distribution of the instantaneous frequency calculated using the Hilbert transform of the intrinsic mode function. They used SVM and KNN for classification and achieved the highest performance metrics with KNN, obtaining accuracies of 59.7 and 55.8% for arousal and valence, respectively [7].

Hsu et al. extracted features in the time and frequency domains from ECG signals recorded in the MAHNOB-HCI database and used an SFFS-KBCS-based feature selection algorithm. They also reduced the dimensionality of the features using GDA. They used LS-SVM for classification and obtained results of 49.2 and 44.1% for arousal and valence, respectively [10].

Ben and Lachiri used ECG, respiration volume, skin temperature, and galvanic skin response signals recorded in the MAHNOB-HCI database to extract 169 features. They used SVM for classification and obtained accuracies of 64.23 and 68.75% for arousal and valence, respectively [11].

Siddharth et al. conducted a classification experiment using LSTM with physiological signals from four different datasets, including DEAP, MAHNOB-HCI, AMIGOS, and DREAMER. They also evaluated the performance of the algorithm extensively by using transfer learning to show that their proposed method overcomes inconsistencies between datasets. Using the ECG signal from the MAHNOB-HCI dataset with their proposed method, they achieved accuracies of 79% for both arousal and valence [12].

Baghizadeh et al. limited the physiological signals recorded in the MAHNOB-HCI database and obtained various features in the time, frequency, and time–frequency domains by applying the Poincaré map to R-R, QT, and ST intervals of the ECG signals. They used three classifiers, KNN, SVM, and MLP, for classification and achieved the best average accuracies of 82.17 ± 4.73 and 78.07 ± 3.59 for arousal and valence categories, respectively, with KNN [13].

In the proposed model, using the ECG signals from the MAHNOB-HCI database, preprocessing and feature extraction are performed, and new feature generation is performed with automated feature engineering. Using three different learning methods: support vector machines (SVM), feedforward neural network (FNN), and bidirectional long short-term memory (BiLSTM); classification was performed for both valence and arousal levels. The relevant sections are given in detail in the Materials and Methods section.

2 Material and method

2.1 MAHNOB-HCI database

MAHNOB database; it is a database designed to elicit emotional responses to content, such as amusement or disgust, in order to learn about the natural behavior of healthy adults when interacting with a computer during multimedia viewing. Within the scope of this experiment, camera, microphone, and eye tracker were used (color and monochromatic at six different angles) to see the reactions of the participants, and different physiological signals were recorded within the scope of the experiment. These recorded biomedical signals are EEG (electroencephalography), ECG, GSR (galvanic skin response), body temperature, and respiration (RESP) signal. Signals were recorded at 1024 Hz and down-sampled to 256 Hz [14].

In this study, since the classification process was based on arousal and valence, two-class emotion categorization was performed as shown in Table 1 according to the value of r.

Table 1 Emotion categorization according to the arousal–valence model [6]

Here, the r value is the evaluation score that the subjects entered from the keyboard, and these values can be found for each record in the database.

Two classes were converted into rating points according to the r scale and labeled. As shown in Fig. 1, the data show a balanced distribution when divided into two classes in the valence and arousal categories.

Fig. 1
figure 1

Data distribution pie chart for valence and arousal category

Within the scope of the study, it is aimed to perform the classification process over the ECG signal.

2.2 Methods

In the study, firstly, the ECG records obtained from the database were segmented into 15-s recordings due to their different lengths. Then, the preprocessing step was carried out to remove the noise on the raw ECG signals. After detecting the R peaks with the Pan–Tompkins algorithm; P, Q, R, S, and T waves were detected. On these detected waves, first the PQRST fragment with a length of 281 samples, then the maximum–minimum values of the P, Q, R, S, and T waves with a length of 10 samples are obtained. Thus, morphological features with a total length of 291 samples were extracted. In addition, HRV (heart rate variability) features of five samples were obtained and added, and a feature vector with a total length of 297 samples was created. Finally, the normalization process was applied on the obtained feature vector. Afterwards, feature generation was carried out with feature engineering in order to increase both the number of obtained features and increase the accuracy of the classification algorithm.

Then with SVM, FNN, and BiLSTM; two separate classification studies, namely valence and arousal, were carried out. The block diagram of the method used is shown in Fig. 2.

Fig. 2
figure 2

The methodology used in the study for emotion classification

2.2.1 ECG signal preprocessing

There are basically three types of noise in the ECG signal: (1) baseline drift, (2) power line interference, and (3) high-frequency noise.

Causes of baseline shift are breathing, movement of the patient, inability to place the electrodes completely. It is seen as the signal moving on a DC voltage. The frequency range of this noise is below 0.5 Hz [15]. For this reason, a 0.5-Hz cutoff frequency Butterworth filter was used for this noise.

The frequency range of power line interference usually varies around 50 Hz [15]. For power line noise, an IIR (infinite impulse response) notch filter with a cutoff frequency of 50 Hz is used.

ECG signals are a low-frequency biomedical signal seen between 0.5 and 100 Hz, but the weighted information content is considered below 40 Hz. High-frequency content outside this frequency may be electrical noise from the heart muscle or nervous activities from different parts of the body, or the noise of other high-frequency devices in the environment [16, 17]. For this reason, a Butterworth filter with a passband corner frequency of 40 Hz and a stopband corner frequency of 60 Hz is used for high-frequency noise.

2.2.2 Detection of ECG signal P, Q, R, S, and T waves

The ECG signal is a biomedical signal consisting of P, Q, R, S, and T waves. Each wave results from one step of the electrical activity of the heart. These waves are used in clinical diagnosis and signal processing studies. In order to detect the other wave peaks of the ECG signal and to extract the morphological features, first of all, the QRS complex that occurs with the contraction of the ventricles during the heartbeat must be detected. In this study, the Pan–Tompkins algorithm, which has been effective for years due to its high accuracy, has been used for the detection of QRS complexes [18].

There are certain periods between P, Q, R, S, and T waves in the ECG signal. Between P and Q, it can take 120–200 ms at most, and between S and T widths between 80 and 120 ms [19]. The QRS complexes detected in the proposed study were divided into two separate vectors as right and left vectors. First, the P wave is detected by finding a 250-ms window maximum point to the left of the QRS complex. On the right side of the QRS complex, the maximum point is sought with a window of 150 ms, and the T wave is detected. From here on, the minimum point between the onset of QRS and the P wave is determined as the Q wave, and the minimum point between the onset of QRS and the T wave is determined as the S wave. Finally, the maximum point between the Q wave and the S wave was determined as the R peak, as shown in Fig. 3.

Fig. 3
figure 3

Detected P, Q, R, S, and T waves

2.2.3 ECG signal morphological features

For an ECG signal, the PQRST fragment refers to the electrical activity in one beat of the heart. The morphological features here basically correspond to the positions, durations, amplitudes and shapes of certain waves, or deviations in the signal [20]. In this study, to detect the changes in the amplitudes and durations of P, Q, R, S, and T waves in recordings containing different emotional states, it is intended to extract a PQRST fragment representing each ECG recording.

Therefore, a total of 281 samples long PQRST fragment with 110 samples from the left of the R peak and 170 samples from the right would be sufficient [17]. This is approximately 1 s for recordings at a sampling frequency of 256 Hz. A normal person's heart rate will be between 60 and 100 bpm (beats per minute). The extracted PQRST fragments for each recording were averaged, and this average was differentiated to increase the salience of the change points on the signal. In this way, a feature vector representing each record was created. In addition, the maximum and minimum amplitude range values for each of the P, Q, R, S, and T waves were added to the feature vector.

2.2.4 ECG signal heart rate variability features

Heart rate variability (HRV) is a method used in the clinic, as well as a measurement of the activity of the central autonomic nervous system (sympathetic and parasympathetic impulses) toward the myocardium. These parameters are average heart rate, standard deviation of all R-R intervals (SDNN), mean squared of consecutive differences (RMSSD), and number of intervals of R-R intervals greater than 50 ms (NN50), pNN50 as the ratio of NN50 number to all R-R intervals can be counted as parameters [21]. It is expected that emotion change and arousal level will affect HRV parameters with the effect of the autonomic nervous system, therefore, the HRV features mentioned in the proposed study were also used.

2.2.5 Automated feature engineering

Feature engineering is an important part of machine learning and classification studies, and its purpose is to obtain a new enriched representation of the data with the additional variables produced. The aim is to predict more accurate classification models with the new predicted variables. It is also done by transforming the original (existing) feature space to create a new feature space by learning a predictive relationship between associated features [22]. These new features can be differences, ratios, or other transformations of existing features with mathematical operators [23].

In this study, gencfeatures, an automated feature engineering algorithm in MATLAB environment, was used. The purpose of this algorithm; it is to automatically generate new features that give the best results by using mathematical operators to make the feature vector entered as input suitable for linear classification. Mathematical operators used are the difference, proportioning, z-score (standardization), logarithm, square root, exponentiation, and trigonometric functions. This function automatically creates a new feature vector in the direction of increasing the weight of the features that have a high effect on the classification among the input features according to the classification type determined by the related mathematical operators [24, 25].

Since a two-class classification study will be performed as an estimator here, a linear estimator has been applied. With this algorithm, output attributes of 450 samples are automatically generated from 297 sample-long input features. The sample count 450 has been empirically chosen, based on the experience with the data, and the number of samples of the output features can be adjusted as desired. Details can be studied in the literature [25]. The output attributes here are completely different from the input, and operations with mathematical operators used to automatically obtain new attributes can be seen.

2.2.6 Support vector machines

SVM is a classification algorithm designed for binary classification, but can also be used in multi-class classification studies. SVM is a supervised classification method [26] that divides d-dimensional data into two classes by separating them with a hyperplane [27,28,29,30].

2.2.7 Feedforward artificial neural networks

FNNs are one of the most widely used types of neural networks. In FNN, neurons are arranged in layers and are fully interconnected. Basically an FNN: It consists of an input, a series of hidden and output layers. Connections between neurons are shown as weights. Neurons in the network consist of an addition function and an activation function (such as Sigmoid, Tangents Hyperbolics, and Softmax) [31,32,33].

In this study the sigmoid function is used as the activation function that follows all layers except the last layer. The last fully connected layer performs classification according to the output of the network.

2.2.8 Bidirectional long short-term memory

In traditional recurrent neural network (RNN) and long short-term memory (LSTM) network models, it propagates forward only, so the information at time t depends only on the information before time t. For this reason, the BiLSTM model is obtained by combining two independent RNNs and replacing the hidden layers with LSTM cells.

Unlike LSTM, BiLSTM (bidirectional LSTM) has two hidden layers (forward and reverse) connected to the outputs. This structure works by considering past and future situations to improve accuracy [34, 35].

The first layer of the BiLSTM architecture model created within the scope of this study is the feature input. Since there are 450 samples of features, the multi-channel feature input layer consisting of 450 channels is used. The next BiLSTM layer has 100 hidden LSTM cells. In addition, the softplus layer is used because it increases the stability during training.

The next layer, the batch normalization layer, allows the layers in the network to wait for the previous layers to learn, that is, simultaneous learning. It allows the network to be used with a high learning speed and also makes the network more stable and organized. Finally, there is the classification layer with the softmax function together with the fully connected layer.

In the proposed model, hyper parameters; it was determined and trained as 0.001 learning rate for 100 epochs by using 100 batch size with Adam Optimizer [36].

3 Experimental results

Accuracy, sensitivity, and specificity values were calculated to evaluate the performance of the models used in the proposed study. Here, the accuracy is calculated as shown in Eq. 1 and shows the ratio of the signals detected by the algorithm correctly. Sensitivity is calculated as shown in Eq. 2 and gives the correct rate of detecting high and positive emotion ECG signals for this study. Specificity, on the other hand, is calculated as in Eq. 3, showing the rate of accurately detecting low and negative emotional ECG signals. In addition, 10-fold cross validation was used when separating the training and test data.

$$\mathrm{Accuracy} \, \left(\%\right)= \frac{{\text{TP}}+{\text{TN}}}{{\text{TP}}+{\text{TN}}+{\text{FP}}+{\text{FN}}}\times 100$$
(1)
$$\mathrm{Sensitivity} \, \left(\%\right)= \frac{\text{TP}}{{\text{TP}}+{\text{FN}}}\times 100$$
(2)
$$\mathrm{Specificity} \, \left(\%\right)= \frac{\text{TN}}{{\text{TN}}+{\text{FP}}}\times 100$$
(3)

where TP (true positive); the number of correctly classified, high or positive emotional ECG signals. TN (true negative); the number of correctly classified, low or negative emotional ECG signals. FP (false positive); the number of ECG signals classified as high or positive despite having a low or negative emotion. FN (false negative); the number of ECG signals that are classified as low or negative despite having a high or positive emotion.

In order to understand the effectiveness of automated feature engineering while performing the classification study: First of all, the features were classified by SVM, FNN, and BiLSTM without applying feature engineering right after the normalization process. Then, after feature generation with feature engineering, they were classified with SVM, FNN, and BiLSTM, which have the same parameters. The results given in Table 2 show the effectiveness of automated feature engineering by significantly increasing the classification performance criteria.

Table 2 Experimental results of emotion classification

When the table is examined, accuracy metrics of all classification algorithms increased for both arousal and valence. In addition, it can be seen that the BiLSTM algorithm working with feature engineering gets the best results. While the most unsuccessful among the other methods before feature engineering was applied, the highest accuracy was obtained after feature engineering.

4 Discussion

In studies in the literature, for emotion detection; it is seen that different databases, features, and classification algorithms are used. Table 3 shows the comparison of the best results obtained within the scope of the study and the results in the literature.

Table 3 The proposed method and its comparison with current studies in the literature

If Table 3 is examined, it is seen that the proposed method gives very good results with a small number of features and feature engineering obtained from the time domain. These features were produced by using automated feature engineering with a length of 297 samples and a feature length of 450 samples. Since the best results were obtained with BiLSTM within the scope of this study, only BiLSTM is included in the table. When examined the results of the study of Baghizadeh et al. [13], it is seen that the proposed method for classification is more successful. Considering that only the ECG signal is used, the success metrics of the proposed method show that it can be considered successful when compared with the studies in the literature.

5 Conclusion

In this study, a machine learning-based method has been proposed to detect emotion using ECG signals. In the study, signal recordings obtained from the MAHNOB-HCI database and obtained by watching movie segments containing various emotional states were used. Butterworth filter is used to remove high-frequency noise and baseline drift of ECG signals. In addition, preprocessing was completed using an IIR notch filter to remove 50-Hz power line interference. Then, morphological and HRV features were obtained from the preprocessed ECG signals. These features have lower computational cost and complexity compared to feature extraction techniques such as wavelet transform and Fourier transform. In the next step, features are produced with automated feature engineering. One of the main purposes of the study is to develop a method with high accuracy rates by obtaining a small number of features and low processing load with automated feature engineering. Automated feature engineering increases the classification accuracy of input features by using mathematical operators to weigh heavily impacted features. Table 2 shows that automated feature engineering improves the accuracy of all classification models. Considering that only ECG signals are used, the accuracy rates of 83.61% for arousal with BiLSTM and 78.28% for valence, compared to the literature, were quite good.

Emotion detection studies are increasing day by day, when the concept of metaverse emerged and started to become widespread. In addition, these biofeedback systems are tried to be integrated into many multimedia tools such as computer games. With the development of technology, biological signal tracking systems are being integrated into smart devices. This will also make real-time emotion detection possible.