Emotion Recognition from Multimodal Data: a machine learning approach combining classical and hybrid deep architectures

de Santana, Maíra Araújo; Fonseca, Flávio Secco; Torcate, Arianne Sarmento; dos Santos, Wellington Pinheiro

doi:10.1007/s42600-023-00293-9

Emotion Recognition from Multimodal Data: a machine learning approach combining classical and hybrid deep architectures

Original Article
Published: 02 August 2023

Volume 39, pages 613–638, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Research on Biomedical Engineering Aims and scope Submit manuscript

Emotion Recognition from Multimodal Data: a machine learning approach combining classical and hybrid deep architectures

Download PDF

Maíra Araújo de Santana¹,
Flávio Secco Fonseca¹,
Arianne Sarmento Torcate¹ &
…
Wellington Pinheiro dos Santos ORCID: orcid.org/0000-0003-2558-6602^1,2

284 Accesses
2 Citations
Explore all metrics

Abstract

Purpose

The expression of emotions is essential in human relationships. However, the aging process associated with some pathologies such as Alzheimer’s Disease and other dementias can affect our ability to express emotions.

Methods

In this context, we propose a method for automatic recognition of emotions from multimodal data. We based this approach on Artificial Intelligence algorithms, as part of the development of a human–machine interface to support the personalization of therapy for elderly people with dementia. From this tool, emotional feedback can modulate the therapy. By doing this we hope to improve the therapeutic results. In this work, the performance of the proposed architectures was evaluated regarding to their ability to recognize emotions in physiological and speech signals and in images of facial expressions.

Results

In the context of physiological and speech signals, we achieved promising results with the use of Random Forest. We found an accuracy of up to 99% in classifying emotions from physiological signals and almost 80% with speech signals. In the images assessment, we found more than 82% of accuracy when adopting a hybrid architecture.

Conclusion

The good results in the test stage are encouraging and point to the possibility of adopting the method in the analysis of emotions in multimodal data. These findings are even more interesting due to the large amount and variety of emotions.

Towards Machine Learning-Based Emotion Recognition from Multimodal Data

Multimodal Emotion Recognition Using Deep Networks

Emotion recognition to support personalized therapy in the elderly: an exploratory study based on CNNs

Article 01 July 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Motivation and problem characterization

Emotions are present on many situations in our daily lives. They shape our choices, desires, tastes, memories, and other human aspects. Throughout human history, the range of emotions that can be felt and expressed has always been a topic that has attracted the attention of behavioral scientists. Even in the 1900s, some studies were conducted in order to find patterns to map the different human emotions (Russell 1980; Izard 1977, 1991).

Nowadays we know that emotions are distinguishable from each other and are built from the subjective experiences of each individual. Furthermore, these feelings can be interpreted as involuntary physiological responses. However, emotions are not isolated and easily identified variables, since they are manifested from combined elements such as sensations, changes in voice and facial expressions (Oliveira and Jaques 2013). The use of Artificial Intelligence (AI) techniques has contributed to this field of study (Santana et al. 2021; Saxena et al. 2020). AI algorithms are already successfully applied in the analysis of many complex and non-linear problems of everyday life (Gupta et al. 2018; Andrade et al. 2020; Oliveira et al. 2020; Santana et al. 2020a, 2020b, 2018; Cruz et al. 2018; Barbosa et al. 2020; Silva and Santana 2020; Gomes et al. 2020; Freitas Barbosa et al. 2021). Furthermore, this tool is commonly successful in analyzing large volumes of data (Deshpande and Kumar 2018).

Facial expression analysis currently is the most common way to perform automatic emotion recognition (Santana et al. 2021). Despite being a well-explored field of study, there are still several gaps associated with this task. The development of solutions in this context requires a large amount of data, due to the huge human variability, especially regarding demographic aspects (Lawrence et al. 2015; Reyes et al. 2018). This factor directly interferes with the facial expressions and, therefore, needs to be considered. The need for many data leads to an increased computational costs associated with facial expression data analysis. In addition, pathologies, injuries and human aging itself commonly affect the face and the individual’s ability to express emotions (Lawrence et al. 2015; Harms et al. 2010; Kohler et al. 2003). These challenges can be overcome by associating facial expression data with data from other sources (Silva et al. 1997; Abdullah et al. 2021). Therefore, the face is not the only source of information for decision making regarding the classification of emotions.

From a neurological point of view, human emotions activate a series of affectivecognitive brain structures. We can assess the neuronal activity generated by emotions from an electroencephalogram (EEG) (Izard 1977, 1991). EEG is one of the main techniques for acquiring human neurophysiological activity. Mainly due to its reliability, effectiveness, simplicity, portability and accessibility (Gupta et al. 2018; Andrade et al. 2020; Oliveira et al. 2020; Santana et al. 2020a; Alarcao and Fonseca 2017). In addition to neurophysiological activations, emotions have an effect on peripheral physiological signals. We noticed common changes through galvanic skin response (GSR), heart rate, temperature, and respiratory rate. The association of these peripheral and central physiological signals favors the recognition of emotions (Santana et al. 2020a; Doma and Pirouz 2020; Vijayakumar et al. 2020; Khalili and Moradi 2009; Shu and Wang 2017).

Emotions can also be perceived and differentiated from patterns of human voice recordings. Changes in the time and frequency domains of these signals often appear during the expression of different emotions. Several studies have been dedicated to the recognition of emotions in speech, especially with the aim of incorporating this analysis into human–computer interfaces (Santana et al. 2021; Schuller et al. 2003; Livingstone and Russo 2018; Issa et al. 2020). However, developing models that understand the nuances in natural language and speech is still a complex task. Therefore, there is a tendency to combine this analysis with other types of data related to the manifestation of emotions (Santana et al. 2021).

One of the main factors that make it difficult to recognize emotions is the existence of some pathology. Neurodegenerative pathologies such as Alzheimer’s disease and other dementias commonly lead to neurological impairments that affect both the identification of emotions and their expression (García-Casal et al. 2017; Behere et al. 2011; McIntosh et al. 2015). In addition, with the current and growing process of population aging around the world, we are also experiencing an increase in cases of this type of pathologies (Mundial and da Saúde 2018; Saúde 2021). According to Ferreira and Torro-Alves (2016), emotions are fundamental in the regulation of social interactions, as they guide our preferences, motivations and decision making (Ferreira and Torro-Alves 2016). They are also indispensable to provide good verbal and non-verbal communication (Chaturvedi et al. 2021; Dorneles et al. 2020). Thus, it is essential to develop tools that help in the identification of emotions for a dignified and pleasant quality of life.

In the therapeutic context, automatic emotion recognition tools are important to improve interventions in the most diverse audiences. Some studies demonstrate that emotional response can be used to improve patient engagement in the therapy process (Marinoiu et al. 2018; Schipor et al. 2011; Sourina et al. 2012; Delmastro et al. 2018; Aranha et al. 2017; Arroyo-Palacios and Slater 2016). It is important to highlight that greater engagement tends to increase the effectiveness of these therapeutic interventions (Lenze et al. 2011).

Therefore, this study proposes a method for recognizing emotions from multimodal data. This method will be incorporated as the core of a human–machine interface to support the therapy of elderly people with dementia. The aim is to contribute to the personalization and consequent optimization of the therapeutic process. We base the proposed method on artificial intelligence algorithms to deal with data from physiological parameters, facial expressions and speech signals.

We organize the article as follows. In the next section we present some recent and relevant studies of emotion recoginition from these type of data. After this section, we detail our approach in the Materials and Methods topic, followed by the results and discussion sections. Finally, we draw some conclusions, highlighting the main findings and future possibilities.

Related Works

Nowadays, automatic recognition of emotions has strong relevance in the therapeutic scenario. The emotional response of patients has already been used to shape the therapeutic experience, so that interventions become more appropriate to achieve the particular goals of each individual. Different therapy modalities can benefit from emotion recognition tools. Some studies are already carried out in the context of physical therapy for motor rehabilitation (Aranha et al. 2017), in speech therapy (Schipor et al. 2011), in music therapy (Sourina et al. 2012), in addition to cognitive behavioral therapies (Marinoiu et al. 2018; Arroyo-Palacios and Slater 2016).

Aranha et al. (2017) proposed a serious game adaptation approach for motor rehabilitation. From the implemented framework, physical therapists can use the affective response of patients to adapt the commands of a game. The recognition of emotions was performed from the analysis of the user’s facial expression. With this, it was possible to achieve the goals of rehabilitation more effectively. An increase in the effectiveness of the therapeutic process was also identified by (Schipor et al. 2011), but now in the context of speech therapy. Since speech quality is also influenced by the individual’s emotional condition, the authors implemented an emotion recognition module in a Computer Based Speech Therapy System (CBST) to assess the quality of word pronunciation in the context of speech therapy in children. The results were promising and point to the close relationship between the human voice and emotional states.

Music therapy sessions can also be favored by assessing the affective state of the patient. (Sourina et al. 2012) built a tool to identify the emotional state of the user in real time and use it to adjust the songs used during music therapy. This approach classifies emotions into fear, frustrated, sad, happy, pleasant, and satisfied from EEG signals. Thus, using perceived emotion, the system automatically selects the most appropriate music to meet the patient’s needs.

Marinoiu et al. 2018) investigated the expression of emotions in the context of robot-assisted therapy of children with Autism Spectrum Disorder (ASD). The authors performed emotion recognition in 3D videos collected with a Kinect system. After analyzing the data, they realized that emotional state identification has great potential to improve human–machine interaction and, consequently, improve therapeutic intervention in these individuals.

In order to modulate the cognitive-behavioral state of the participants, (Arroyo-Palacios and Slater 2016) proposed a virtual reality scenario to identify and modulate the affective state of the user. In the proposed interface, participants were represented by virtual dancers and had to control the rhythm of the dance by modulating their own mood. Thus, people who were agitated should make the avatar move more calmly. In the opposite way, people who were more relaxed should make the character dance more frantically. Participants’ mood were identified from the physiological signals of skin conductance, heart rate and respiratory rate. Only by modulating these parameters it was possible to control the avatar’s activity. The authors concluded that by using this game, participants were able to emotionally awaken in when in the activation condition and relax in the relaxation condition.

In order to contribute to the development of studies related to emotional response, (Soleymani et al. 2011) gave rise to the MAHNOB-HCI database. The base was built to acquire information about different manifestations of affective responses to audiovisual stimuli. This acquisition was performed from multimodal data that, among other information, includes records of physiological signals from the central nervous system (EEG) and peripheral nervous system (electrocardiogram (ECG), GSR, respiratory amplitude and skin temperature). Given the vast amount of information, this database has been used in several studies for emotion recognition. The authors of the database themselves conducted a promising preliminary study of automatic recognition of emotions from this data. For this analysis, the authors extracted spectral and statistical features from the physiological signals. In total, 318 attributes were extracted, 20 from the GSR signal, 64 from the ECG, 14 from the breathing pattern, 4 from the skin temperature and 216 from the EEG signal. After feature extraction, the authors evaluated the performance of a Support Vector Machine (SVM) model with a radial basis function (RBF) kernel to classify data in terms of valence and arousal. The proposed model was able to classify peripheral physiological signals with an accuracy of 46.2% for the arousal class and 45% for valence. In the classification of EEG signals, the method obtained slightly better accuracies, 52.4% for arousal and 57% for valence.

A few years after the development of this database, (Wiem and Lachiri 2017) used peripheral physiological signals from the MAHNOB-HCI database to propose a method for classifying emotions. Initially, the authors removed artifacts and noise using Butterworth filters. Then, 169 statistical attributes were extracted from each signal. Finally, the authors evaluated the performance of 4 SVM configurations for signal classification. SVM algorithms with linear, polynomial, sigmoid and gaussian kernels were evaluated. These different configurations showed similar results to each other. The use of ECG signals resulted in the best classification performances, with accuracies around 65% for arousal and 60% for valence using SVM with linear kernel. This same SVM configuration reached accuracies between 53 and 63% in the classification of affective states using the other peripheral physiological signals. However, when the physiological signals were combined, SVM algorithm with polynomial kernel showed a better performance in classifying arousal levels, with an accuracy of 64.23%. In the case of valence, the SVM with Gaussian kernel presented the best performance, with an accuracy of 68.75%.

Another relevant work that uses the physiological signals of the MAHNOB-HCI base is that of (Wei et al. 2018). In this study, the authors sought to perform emotion recognition from EEG, ECG, respiration amplitude, and GSR signals. For the feature extraction, the authors used a combination of attributes from the time and frequency domains. After extracting attributes, the authors submitted the data for classification with an SVM algorithm with RBF kernel. The hyperparameters of this algorithm (C and γ) were optimized by the grid search method. The authors evaluated the algorithm’s performance in rating 5 emotions: Sadness, Happiness, Disgust, Neutral, and Fear. The classification was made separately for each of the physiological signals. Thus, an accuracy of 74.52% was obtained using EEG signals, 68.75% with ECG signals, 54.33% using respiration amplitude, and 57.69% with GSR signals. Subsequently, the authors also evaluated the performance from the combination of the four physiological data, reaching an accuracy of up to 84.62%.

Still in the efforts to find of strategies to perform automatic emotion recognition, Livingstone and Russo (2018) proposed the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) (Livingstone and Russo 2018). The database has voice and video recordings of professional actors expressing 8 emotions: calm, happy, sad, angry, fearful, surprise, disgust, and neutral. In this study, the authors also validated the database based on the analysis of 72 human evaluators. This evaluation showed that emotions are better identified using audio associated with video or simply video than using audio alone. Overall, the authors reported accuracies between 58 and 67% in classifying emotions through speech. These results are associated with Kappa between 0.41 and 0.52. The “neutral” and “angry” states were the most easily identified. The highest amount of misclassification was associated with “sad” emotion.

In 2020, (Issa et al. 2020) proposed a method for classifying emotions from voice signals. Part of the method evaluation was done using the RAVDESS database. The proposed architecture consists of extracting honeyfrequency cepstral coefficients, chroma-gram, honey-scale spectrogram, Tonnetz representation, and spectral contrast features from sound files. After the feature extraction, the data were classified by a convolutional neural network (CNN) with a rectified linear activation function (ReLU). This model correctly classified 71.61% of the data from the RAVDESS database. Better rating performances were obtained for stronger emotions like “angry”. There was greater confusion in the classification of emotions closer to each other such as “calm” and “sad” or “happy” and “surprised”.

The following year, (Luna-Jiménez et al. 2021) proposed the use of a CNN architecture for emotion recognition with the RAVDESS database. Pre-trained CNN architectures with AlexNet were used for feature extraction. The authors obtained better results with RBF kernel SVM as the classifier. The best evaluated model resulted in an accuracy of 76.58% in the identification of the 8 emotions in the database. Emotions “angry” and “disgust” were ranked higher. Higher error rates were associated with the “sad” class, commonly confused with “calm” and “fearful”.

The FER facial expressions database was developed by (Goodfellow et al. 2013). This database has 35,887 images, all resized to 48 × 48 pixels and converted to grayscale, covering 7 types of emotions: Anger (4.593), Disgust (547), Fear (5.121), Happy (8.989), Neutral (6.198), Sad (6.077) and Surprise (4.002). The authors used CNN and SVM to extract attributes and classify images of facial expressions. The method proposed by them reached 71.2% accuracy in the test set.

The FER database was also used by (Ng et al. 2015). In their approach, the database was used to train a CNN model. Initially, the images were cropped and adjusted for better visualization of the facial expression. Then, they were used to refine the training of a CNN, given the size and diversity of the dataset. Finally, the trained architecture was used to classify the images from the EmotiW database. The accuracies found by them were median, assuming values between 42 and 56%. It is important to mention that the authors did not report the execution of class balancing steps. Since the base is originally unbalanced, the lack of balance may have negatively affected the results. This balance makes CNN learn the patterns of some classes better than others, skewing the result. Five years later, (Kusuma et al. 2020) conducted an emotion recognition study using a pre-trained VGG-16 model. ImageNet image dataset was used to train the model. Then, the authors used the model to classify the images from the FER-2013 database. Finally, their method was able to differentiate 7 distinct emotions with an accuracy of 69.40%.

In Table 1 we present the main information from these related studies, such as their main goal, the computation techniques used and their main findings. At the last line of this table is our method, proving that our proposal is well contextualized in the state-of-the-art.

Table 1 Summary of related works

Full size table

Material and Methods

Theoretical background

Affective Computing, a burgeoning field at the intersection of computer science and psychology, is concerned with the development of computational systems endowed with the ability to perceive, interpret, and respond to human emotions (Bota et al. 2019; Cambria et al. 2017). It encompasses a diverse array of methodologies, including signal processing, machine learning, and human–computer interaction, which collectively enable machines to discern and appropriately react to human affective states (Picard 2000; Hasnul et al. 2021; Calvo and D’Mello 2010; Kołakowska et al. 2020). Of particular significance is the domain of healthcare, where emotion recognition assumes a crucial role. The recognition and comprehension of emotions in healthcare settings have profound implications for patient care, facilitating the early detection of mental health disorders and fostering tailored interventions (Picard 2000; Kołakowska et al. 2020; Sinha 2021). For instance, in the realm of mental health, emotion recognition systems can be employed to monitor the emotional well-being of individuals with conditions such as depression or anxiety, enabling timely intervention and personalized treatment strategies (Picard 2000; Kołakowska et al. 2020; Sinha 2021). Furthermore, in domains like telemedicine and remote patient monitoring, emotion recognition technologies hold the potential to assess patients’ emotional states during virtual consultations, thus providing a comprehensive understanding of their needs and experiences (Picard 2000; Hasnul et al. 2021; Kołakowska et al. 2020; Sinha 2021).

The paramount importance of emotion recognition in healthcare extends beyond patient care to the well-being of healthcare practitioners (Pujol et al. 2019). Professionals in highstress occupations, including doctors, nurses, and caregivers, confront significant emotional challenges that can profoundly impact their mental health and job performance (Pujol et al. 2019). Here, emotion recognition systems offer an avenue for assessing the emotional states of healthcare workers, enabling timely intervention and support. For example, wearable devices equipped with sensors can continuously monitor physiological signals, such as heart rate variability and skin conductance, thereby providing valuable insights into emotional states such as stress or burnout (Hasnul et al. 2021; Kołakowska et al. 2020; Saganowski et al. 2020; Marcos et al. 2021; Ayata et al. 2020; Dhuheir et al. 2021). Such information can be utilized to deliver real-time feedback and interventions to healthcare professionals, thereby ensuring their well-being and overall job satisfaction (Hasnul et al. 2021; Kołakowska et al. 2020; Saganowski et al. 2020; Marcos et al. 2021; Ayata et al. 2020; Dhuheir et al. 2021). Moreover, the integration of emotion recognition technologies has the potential to foster the development of intelligent systems that respond empathetically to the emotional needs of healthcare providers, thereby cultivating a supportive work environment and augmenting their overall work experience (Hasnul et al. 2021; Kołakowska et al. 2020; Saganowski et al. 2020; Marcos et al. 2021; Ayata et al. 2020; Dhuheir et al. 2021).

Alzheimer’s disease and other forms of dementia impose a significant burden on individuals worldwide (Cobos and Rodríguez, M.d.M.M. 2012; Olanrewaju et al. 2015; Castro et al. 2021; Livingston et al. 2017; Shafqat 2008). The prevalence of these conditions has reached alarming levels, with an estimated 50 million people currently affected globally (Zhang et al. 2021; Barnes and Yaffe 2011). Alzheimer’s disease, in particular, is the most common form of dementia, characterized by progressive cognitive decline and memory loss (Cobos and Rodríguez 2012; Olanrewaju et al. 2015; Castro et al. 2021; Livingston et al. 2017; Shafqat 2008). Patients with Alzheimer’s experience a range of symptoms, including impaired thinking, disorientation, language difficulties, and changes in behavior and mood. These debilitating effects severely impact the quality of life of patients and their families, leading to a decline in functional abilities, loss of independence, and diminished social engagement (Cobos and Rodríguez 2012; Olanrewaju et al. 2015; Castro et al. 2021; Livingston et al. 2017; Shafqat 2008). Moreover, the global impact of Alzheimer’s and dementia extends beyond the individual level, placing an immense strain on healthcare systems, economies, and society as a whole (Cobos and Rodríguez 2012; Olanrewaju et al. 2015; Castro et al. 2021; Livingston et al. 2017; Shafqat 2008). Several works have faced the problem of providing in vivo diagnosis for Alzheimer’s disease and other dementias by using image diagnosis optimized by machine learning and evolutionary computing (Souza et al. 2021; Santos et al. 2008a, 2009a, 2008a, 2009b, 2008b; Silva et al. 2019). Given the profound consequences of these diseases, there is an urgent need for interventions that can alleviate symptoms and enhance the well-being of patients (Sörensen et al. 2006; Haan and Wallace 2004).

Music therapy and other art-based interventions have shown promising potential in assisting individuals with Alzheimer’s disease, particularly in the early stages of the illness. Music, with its ability to evoke emotional responses and retrieve memories, holds a unique position in therapeutic approaches for dementia patients. Engaging in music therapy sessions can stimulate various cognitive functions, including memory recall and emotional processing, leading to the emergence of positive affective memories (Matziorinis and Koelsch 2022; Brotons and Marti 2003; Guess 2017; Steen et al. 2018; Leggieri et al. 2019). The power of positive affective memories evoked through music therapy is significant, as it can enhance the overall quality of life for Alzheimer’s and dementia patients. These memories may evoke emotions, trigger reminiscence, and foster connections with personal experiences, promoting a sense of identity and emotional well-being (Steen et al. 2018; Blackburn and Bradshaw 2014; Guetin et al. 2013). By leveraging the emotional and memory-related benefits of music therapy, individuals with Alzheimer’s can experience improved mood, reduced agitation, enhanced communication, and increased social interaction. Importantly, these effects can extend beyond the duration of therapy sessions, creating a positive impact on daily life and social interactions for patients and their caregivers (Steen et al. 2018; Blackburn and Bradshaw 2014; Guetin et al. 2013).

Furthermore, the integration of emotion recognition tools in music therapy holds great potential for improving its efficacy (Kim and André 2008). With a multimodal approach, several different datasets, provided by different information, can be used to improve emotion recognition by machine learning. It is possible to combine video, audio, face, and physiological signals to improve emotion recognition accuracy, generating a precise way to estimate emotions in a biofeedback-based approach (Santana et al. 2021; Muyuan et al. 2004). By providing biofeedback to therapists, these tools can assist in assessing the emotional responses and levels of engagement of patients during music therapy sessions. This feedback enables therapists to fine-tune their interventions, tailoring the therapeutic approach to optimize the emergence of positive affective memories. With real-time information on the effectiveness of the therapeutic techniques employed, therapists can adjust the selection of music, tempo, or delivery method to maximize the desired emotional responses in patients (Sourina et al. 2012; Kim and André 2008; Muyuan et al. 2004; Lin et al. 2009; Yang et al. 2009). The incorporation of emotion recognition tools in music therapy enhances its precision and efficiency, ultimately leading to more targeted and personalized interventions, and potentially slowing the progression of Alzheimer’s disease while improving the well-being and quality of life of patients (Sourina et al. 2012; Kim and André 2008; Muyuan et al. 2004; Lin et al. 2009; Yang et al. 2009).

Proposal

Considering the importance of emotion recognition in the therapeutic context (Marinoiu et al. 2018; Schipor et al. 2011; Sourina et al. 2012; Aranha et al. 2017; Arroyo-Palacios and Slater 2016), we propose an emotion recognition approach from multimodal data. It will be part of a human–machine interface (HMI) to support therapy of elderly people with dementia. Overall, the interface works as a biofeedback of emotions. This way, a therapist may change the intervention based on the patient’s emotional response. In this context, we conducted some experiments to find the classification model that will integrate the system. We used public available databases of EEG and peripheral physiological signals, speech signals, and images of facial expressions.

Figure 1 illustrates the operation of the HMI. Data from 3 different sources (EEG, speech, and facial expression) are acquired from the patient. This data is then processed in the pre-processing and segmentation steps. These steps are followed by extracting features from the data. Then, we submit the feature vector to the classification step. The goal in classification is to identify the emotion felt by the patient. Finally, the therapist may assess this emotional response and use it to adapt the intervention applied to the patient.

Considering the EEG and audio signals, the pre-processing consists of applying a notch filter, to minimize the influence of the electrical network frequency (60 Hz or 50 Hz, depending on the country) and a bandpass filter, to keep the signals within the expected range. Then, the signals were segmented into windows, for later attribute extraction. As we use public databases, these two signal filtering steps had already been performed. Considering the images of facial expressions, the only preprocessing used was face segmentation, using the Haar-Cascade classifier. However, since public image databases were used, this step had already been performed, so we used the complete image of the face as available.

Therefore, this system seeks to improve the therapy of elderly people affected by dementia. It will provide emotional feedback for customized therapeutic interventions. The next sections present the tools and methods adopted to achieve this goal. The experiments and the findings shown here consist on a proof of concept for the development of the proposed HMI.

Datasets

During this study we used 3 different databases, all seeking to relate human data to their respective emotions. The first one has peripheral and central physiological signals, which are associated with 6 different classes of emotions. In the second database, speech signals from people expressing 8 different emotions were acquired. Finally, the third database consists of images of 7 emotions expressed in faces.

Further information regarding each databases are at the following topics. All three databases were initially submitted to a stage of splitting the data into trainingvalidation and test subsets. At this point, the original amount of data was randomly divided into 70% for the set used for training and validating the models and 25% for the test set. In the validation stage, to find the best ranking configuration, we use tenfold cross-validation. After defining the best classifier architecture, the entire database from the validation stage was used in the test stage as a training set to train the chosen model and test it in the previously separated test set. Figure 2 illustrates this sets’ preparation. We left 5% of the datasets out to ensure there is no intersection between training/validation and test sets. Furthermore, it is important to emphasize that the test set did not participate in the training of any model, being used to assess the performance of the best model found in the training and validation stage. It is also worth mentioning the importance of designing these training-validation and test sets. This step ensures that they present the same statistical behavior even though being made of different instances.

Physiological data

The database used for emotion assessment from physiological signals was the Multimodal Database for Affect Recognition and Implicit Tagging (MAHNOB-HCI), developed by (Soleymani et al. 2011). The database has central and peripheral physiological data collected using a multimodal approach. Among the collected data there are Electroencephalogram (EEG), Electrocardiogram (ECG), Galvanic Skin Response (GSR), respiration amplitude, and skin temperature. All these data were used in our approach.

The instrumental arrangement for data collection was as follows. For GSR data, the authors used two electrodes in the distal phalanges of the middle and index fingers (Soleymani et al. 2011). To acquire ECG information they placed 3 electrodes at the upper right and left corners of the chest and abdomen. EEG was recorded from 32 active silver chloride electrodes, including 2 references, positioned according to the international 10–20 system. Finally, they measured skin temperature from a skin sensor. The authors started acquiring data from 30 volunteers with different cultural backgrounds and of both male and female genders. However, 6 subjects did not participate in all acquisition steps.

Emotional response to visual and auditory stimuli in MAHNOB-HCI was carried out in two stages. At the first one they played 20 short videos to evoke emotions while recording the physiological response of each participant. At the end of each video a neutral clip was played to minimize the emotional bias activated by the previous video and ease participants self-assessed after watching the videos. The emotional evaluation was performed using a discrete scale with values between 1 to 9 (where 1 is the most pleasant emotion and 9 is the most unpleasant). This assessment was based on five different questions: i) What emotion was presented?; ii) What level of pleasure?; iii) What level of activation?; iv) What level of dominance? and v) What is the level of predictability?. To classify the videos, the 3D model of inferences of affective states PAD (Pleasure-Activation-Dominance) was used and to answer which emotion each video is supposed to evoke. Then, the participants had to rate the stimuli in a scale from 1 to 9, corresponding to the following emotional states: neutral, anxiety, fun, sadness, happiness, disgust, anger, surprise, and fear. The data employed in our methodology was obtained during the acquisition phase. Although the authors have meticulously devised the MAHNOB-HCI database to encompass data pertaining to nine distinct classes of emotions, it is worth noting that specific data for joy, fun, and fear classes was not made available. Consequently, we leveraged the dataset comprising the remaining six emotions to fulfill our research objectives.

Additionally, at the second acquisition stage participants were asked to perform a digital content labeling tasks based on their emotional responses. During this second stage the authors acquired the reactions expressed on the face from video cameras. Data from this step were not used in our current study.

Therefore, the database we use here has 285 signals, being 55 related to Hapiness emotion, 14 to Sadness, 84 to Neutral, 40 to Disgust, 65 to Amusement, and 27 to Anger. Each signal originally has 47 channels, however, 9 were empty and 38 had some information. Among these channels, 1 to 32 had EEG signals. On channels 33, 34 and 35 were the ECG. Channel 41 was dedicated to GSR, channel 45 to respiration rate, and channel 46 to skin temperature.

Speech data

To perform the recognition of emotions through voice we used the Ryerson AudioVisual Database of Emotional Speech and Song (RAVDESS) (Livingstone and Russo 2018). This is a Canadian base composed of the voices of 24 professional actors, in English with an American accent. This database has 7356 audio and video files, totaling 25 GB of data. The amount of participants is equally divided in 12 men and 12 women.

During the recordings, each individual introduced himself speaking the following expressions in English: “Kids are talking by the door” and “Dogs are sitting by the door”. They spoke these phrases in order to represent 8 emotions: neutral, calm, joy, sadness, anger, fear, surprise and disgust. Likewise, they recorded the 2 phrases in singing tones to represent 6 emotions: neutral, calm, joy, sadness, anger and fear. The idea behind this method was to get as close as possible to the desired expression, in an induced way. For this, each actor used different techniques in order to achieve the final goal. They recorded each sentence at 2 different levels of intensity: normal and strong, plus neutral expressions. In the end, each contracted actor recorded, on average, for 4 h, with a microphone placed 20 cm in front of him.

As we would not use video signals and the focus of the work was not the singing, we selected the portion of files containing only spoken excerpts. This reduced our base to a set of 1440 exclusively audio files in.WAV format, with 48 kHz and 16bit.

The files are originally named according to Table 2, following the order of identification: modality, channel, emotion, intensity, declaration, repetition and actor. Thus, each label is composed of 7 numbers of 2 digits each (eg 02–01-06–01-02–01-12.wav).

Table 2 Description of RAVDESS filenames. The 7 identifier and respective codes

Full size table

Originally the database is divided by the actor/actor. However, as our objective is to classify emotions, we used this labeling of the files to reorganize the base according to emotion classes. Thus, we created 8 folders containing each emotion class. In this way, we can work better with the files in their characteristic groups, unlike the previous way, separated by actors.

As for the distribution of instances among the 8 classes of emotions, the base is unbalanced. Each emotion has a total of 192 files. However, Neutral emotion has only 96 audio files (Livingstone and Russo 2018).

Facial expressions

For the emotion recognition experiment in facial expressions, we used the Facial Expression Recognition 2013 (FER-2013) database, introduced in the ICML 2013.

Challenges in Representation Learning (Goodfellow et al. 2013). The FER-2013 database consists of 35,887 images, all resized to 48 × 48 pixels and converted to shades of gray, covering 7 types of emotions, namely: Anger (4,593), Disgust (547), Fear (5,121), Happy (8,989), Neutral (6,198), Sad (6,077) and Surprise (4,002). This database is currently considered the largest publicly available facial expression database for researchers who want to train machine learning models, mainly Deep Neural Networks (DNNs). Figure 3 shows examples of images from this database.

Processing and Classification

For data processing we used an approach to deal with physiological and voice signals and another approach to images of facial expressions. For the signals, features regarding their temporal, frequency and statistical distributions were extracted. As for the images, we propose an architecture for feature extraction and classification based on deep networks and the Random Forest algorithm. The procedures adopted for each database are detailed below.

Signal data

The recognition of emotions through physiological and speech signals was performed following the steps shown in the diagram in Fig. 4.

Initially, we submit the signals to a feature extraction step. In this step, we used the GNU/Octave mathematical computing software, version 4.0.3 (Eaton et al. 2015), to extract the 34 features mathematically described in the Table 3. In this way, each instance of the signal is represented by some of its statistical characteristics and in the time and frequency domains. Such attributes proved to be relevant and effective in the representation of EEG and peripheral physiological signals in previous studies with physiological and voice signals (Santana et al. 2020a; Espinola et al. 2021a, 2021b).

Table 3 List of the 34 features with their mathematical representations

Full size table

Statistical information was extracted by the attributes in the left column of the Table 3: mean, variance, standard deviation, root mean square, average amplitude changes, difference absolute deviation, integrated absolute value, logarithm detector, simple square integral, mean absolute value, mean logarithm kernel, skewness, kurtosis, maximum amplitude, and 3rd, 4th, and 5th moments.

The attributes related to the time–frequency domain of the signals are the right column of the Table 3: waveform length, zero crossing, slope sign changes, Hjorth parameters (activity, mobility, and complexity), mean frequency, median frequency, mean power, peak frequency, power spectrum ratio, total power, variance of central frequency, Shannon’s entropy, and 1st, 2nd, and 3rd spectral moments.

During the process of extracting attributes from the peripheral and central physiological signals, we perform the windowing of these signals. We used a 5 s window with 1 s overlap between windows. This procedure aims to increase the spectral characteristics of the sample. From this windowing, we generate an unbalanced dataset with 8.097 instances. The Happiness class now has 1.704 instances, 1.114 in the Neutral class, 500 in Sadness, 1.222 in Disgust, 2.650 in Fun and 907 in Anger. Finally, each of the 38 channels of these instances was subjected to the extraction of the 34 attributes in the Fig. 3.

Since voice signals are shorter than physiological ones and it is important to analyze them in all their context, we did not perform windowing on these signals. However, some of the audios were mono (1 channel) and others stereo (2 channels), so, to avoid any kind of incompatibility, we duplicated the mono signals to the equivalent stereo. Finally, we extract the features in Fig. 3 from each channel of the signals.

After extracting attributes, we designed the training/validation and test sets as illustrated in Fig. 2. This step was performed for both the physiological and speech signal databases. The test sets (25%) were set aside to be used after finding the most suitable model for the classification of each type of signal. Therefore, the steps described below were performed only with the training/validation set, which corresponds to 70% of the instances of each database.

As mentioned before, for both physiological and speech signals, there is an unbalanced distribution of instances in the respective classes. If not adjusted, this class imbalance can generate biased learning, favoring classes with more representatives (instances). To avoid this unfair learning, we performed a class balancing step. Here we balanced classes by adding synthetic instances to minority classes using the Synthetic Minority Over-sampling Technique (SMOTE) (Blagus and Lusa 2013; Chawla et al. 2002). This algorithm creates synthetic instances based on the real instances of a given class. Minority classes are balanced by taking each instance and adding synthetic samples along the line segments joining their k nearest neighbors. In our approach, we configure SMOTE with k = 3 neighbors. For this step, we used the Waikato Environment for Knowledge Analysis (Weka) software, version 3.8 (Witten and Frank 2005).

In the physiological signals database, the balancing of the classes resulted in 2.649 instances in the Sadness class, 2.658 in the Happiness class, 2.651 in the Disgust and Neutral classes, 2.650 in the Fun class and the Anger class with 2.658 instances. Therefore, this set started to have a more balanced distribution of the instances in the classes.

In the context of speech signals, it was still necessary to apply SMOTE to expand the total number of instances of each class by 50%. We performed this procedure after balancing the classes in order to improve the distribution of the number of instances along the set, since there is a large number of classes in this problem (8). This expansion was an adjustment in the dimensionality of the set, that is, better balancing the relationship between the number of instances, attributes and classes. Thus, each emotion now has 288 instances.

Finally, the balanced sets of both databases were submitted to the training/validation stage. In this step, we evaluated the performance of 3 classic classifier algorithms: Random Forest, Support Vector Machine (SVM) and Extreme Learning Machine. Random Forest is an algorithm structured by a committee of decision trees. Individually, these trees act as experts in identifying the patterns associated with the problem (Breiman 2001; Jackins et al. 2021; Pal 2005). SVM is a method that stands out for its good generalization performance in classification problems (Platt 1998; Cortes and Vapnik 1995; Zeng et al. 2021). It is based on the modeling of hyperplanes that serve as decision boundaries for problem solving. ELM stands out for its great generalization power and its reduced training time due to the random initialization of the input layer weights and the analytical calculation of the subsequent weights (Santana et al. 2018; Huang et al. 2004; Silva and Krohling 2016).

We investigated SVM because it is a well-established classification architecture in the context of biomedical signal and image classification. Additionally, ELM was used because it can also be considered as a fast algorithm for two-layer multilayer perceptron networks, another classic approach to biomedical problems. We also investigated Random Forest for its potential good performance in problems where generalization is difficult. Due to its ensemble behavior, Random Forest is robust to class imbalance and can deal well with problems that are expressed through multiple rules, unlike SVM and ELM which, in turn, look for general rules in the form of combinations of polynomials and other mathematical functions.

We tested different models for each of these methods, varying their main hyperparameters in the configurations presented in Table 4. It is worth mentioning that the k-fold cross-validation method with k = 10 was used during the experiments to avoid overfitting (Jung and Hu 2015). In this method, the dataset was randomly divided into k subsets, with k − 1 used for training and the remaining subset used for validation. In this way, successive training steps are performed until the performance is validated for all k sets. Furthermore, in order to obtain statistical information regarding the performance of the algorithms, each configuration was evaluated for 30 repetitions. These experiments were also conducted in Weka, version 3.8.

Table 4 Experimental settings for the classifiers applied to signal data

Full size table

The Weka environment was chosen because it is easy to prototype, separating the choice of the machine learning model from the final prototyping in the application, thus allowing users to build complex machine learning models for different applications. Weka also allows the chosen models to be saved in a file, for later application in the final emotion recognition solution.

Image data

For the recognition of emotions through images of facial expressions we propose a new architecture based on deep network with a Random Forest classifier in the output. As illustrated in the Fig. 5, in this architecture we use a transfer learning approach to extract attributes from images from the FER2013 database. For this, we applied a pre-trained LeNet network with the MNIST dataset, composed of a training set of 60.000 images of handwritten digits (Deng 2012). LeNet was one of the first Convolutional Neural Networks. It was proposed by (LeCun et al. 1998) and has 7 layers, being 3 convolutional layers, 2 downsampling layers, and 2 fully connected layers. The convolution filters are used to extract spatial features from the images. Therefore, this network extracted 500 features from each image from the database.

After extracting attributes, we designed the training/validation and test sets (Fig. 2). The test set was only used after finding the most suitable model for image classification. Therefore, the steps described below were performed only with the training/validation set, which corresponds to 70% of the instances.

Since the database we used has a notable imbalance between classes, we also submit the feature vectors of this set to the SMOTE (Chawla et al. 2002) method for balancing. As for the data from the other databases, we configured SMOTE with k = 3 close neighbors. Balancing increased the pool to 62269 instances. These instances became better distributed among the classes happy (8.989), fear (8.961), neutral (8.987), disgust (8.861), sad (8.872), anger (8.915), and surprise (8.684).

Finally, the balanced set was submitted to classification with Random Forest algorithms. This algorithm was chosen to compose this architecture because it is versatile, fast-executing and deals well with large datasets. Methods based on Random Forest have also been successful in sets with missing data, with poor balance and with little variability (Andrade et al. 2020; Oliveira et al. 2020; Gomes et al. 2020; Freitas Barbosa et al. 2021).

We conducted experiments with different models of Random Forest, varying the number of trees between 10, 20, 50, 100, 150, 200, 250, 300, 350, 400 and 500. The Table 5 details the settings. In these experiments we also used the k-fold cross-validation method with k = 10 to avoid overfitting (Jung and Hu 2015). In addition, each configuration was evaluated for 30 repetitions to verify the statistical behavior of the models. Both the class balancing and classification steps were carried out by Weka, version 3.8.

Table 5 Experimental settings for the classifiers applied do image data

Full size table

Test stage

In the test stage, we used the subsets with 25% of each dataset (physiological signals, speech signals and images of facial expressions) that did not participate in the evaluation stages of the classification models. This step is important to verify the generalization capacity of the evaluated models. A good generalization is desirable, as it implies a good performance of the model to classify new data. "New data" is data that did not take part in the training and is therefore unknown to the algorithm.

From the training and validation stages, we assessed the performance of the tested models. This analysis allowed us to identify the most suitable models to deal with each of the three data sources. After identifying these models, we trained each one with the entire training/validation set. Finally, we use the trained models to estimate the classes of the data in the test sets.

Metrics

To evaluate the performance of the models in both the training/validation and the test stages, we used five metrics: accuracy, kappa index, sensitivity, specificity and area of the ROC curve. It is important to note that for the analysis of the models we take into account this set of performance metrics and not just one of them. The Table 6 presents the mathematical description of these metrics.

Table 6 Mathematical expressions for the metrics used to evaluate the classification performance. TP, TN, FP and FN are the quantity of True Positives, True Negatives, False Positives and False Negatives, respectively. TPR and FPR are the True Positive Rate and False Positive Rate, respectively

Full size table

Accuracy is a metric that indicates how efficient the classifier is at correctly predicting the class of each instance. It is an index directly proportional to the true positives (TP) and true negatives (TN) rates. The kappa statistic is a metric similar to accuracy. However, kappa takes into account the random hit chance (Artstein and Poesio 2008). When predictions are purely random, kappa index assumes 0 (zero) or negative values. Sensitivity is the metric used to assess the classifier’s performance in identifying the true positives. Sensitivity is commonly called the true positive rate (TPR), but it is also known as recall. Specificity, as opposed to sensitivity, it is used to assess performance in identifying the true negatives. Thus, it is known as the true negative rate (TNR). The area under the ROC (Receiver Operating Characteristics) curve, widely known as Area Under Curve (AUC), is also a metric used to assess how well the model performs in the prediction. ROC curve is a probabilistic curve and the area under it represents the chance the model has to correctly predicts the data (Hanley and McNeil 1982). The curve is built from the false positive rate (FPR) on the x-axis, and the sensitivity (TPR) on the y-axis. In the case of multiclass problems, such as in this work, AUC can be evaluated in two ways: one vs one; one vs all. In the first one, all the curves of the combination of all classes with each other are plotted in pairs. In the second, the curves of the combination of one class versus all others are plotted.

Results

In this section, we show the results of the training-validation and test stages for the three databases. Initially, we present the results for the physiological signals base, followed by the voice signals database. Finally, there are the results for the identification of emotions in facial expressions.