Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Emotions have main role and are affected in developing any type of social setting and humans are social and live socially and most of the actions are emotional. Human emotional states (expression and recognition) have been the focal point of attention in several areas of neuroscience and psychology to cognitive and computer science. For the acceptance of robots by humans the application of emotions for Human Robot Interaction (HRI) purpose are very significant. A robot that is able to realize and express emotions can pass on in a lifelike way. The observation of different modalities, such as facial expression, gesture, and speech, improves the emotional state recognition. Moreover, recognizing the emotion is a complicated process and there are some researches which looking for recognizing real emotion. In our previous work, we deployed group theory concept of recognizing real emotion by detecting symmetry patterns in face [15].

GU and et al. [2] analyzed and explored the importance and the use of the information in each trait which are efficient in human emotional states. They found out when we would wish to recognize emotional states, non-verbal communication, facial expressions and body posture/motion complement each other. Adolphs [3] showed how the human brain correlates past experiences, motion information in the visual stimuli, and face expressions. The brain is able to integrate this multimodal information and generate a theatrical performance of the visual stimuli based on all of them in concert. The pretense of this operation in computer systems can be achieved by neural models, with a specific social system that has different type of feature representations such as Convolutional Neural Networks (CNN).

CNN were introduced formally by Lecun, et al. [4]. They are prompted by the hierarchical process of simple and complex cells in the human learning ability to extract and learn different information from visual stimuli. Each layer of a CNN has the capability to react to different information, and when stacked together the layers can create a complex representation of the optical input.

In our recent works we presented a multimodal architecture for emotion in robot and we broke down what it has in mind for a robot to have emotion and distinguishing emotional state for communication from an emotional state as a mechanism for the formation of its behavior with humans and robots by (CNN) [5, 6]. In this clause, we plan to implement the given model and compare the results with our previous works.

This paper is coordinated as follows: The next section explains the related works. Section 3 describes human and robot emotion. The relation between deep learning and emotion is given in Sect. 4. In Sect. 5 we demonstrate the integrated model. In Sect. 6 we present experimental results and stopping points and future works are shown in the final part.

2 Related Works

The research study by Mehrabian [14] has indicated that 7 % of the communication data is transferred by linguistic language, 38 % by paralanguage, and 55 % by facial expressions in human face-to-face communication. Some models of multimodal databases can be found in [79] and most studies, have looked at the integration of facial looks and speech information and there have been a few efforts to fuse data from body movement and motions in a multimodal framework. Sun et al. [25] designed hidden identity features with deep convolutional networks to realize approximately 1000 false identities on LFW database and achieved 97.45 % verification accuracy with only weakly aligned faces. El Kaliouby and Robinson [11] offered a model to make head movements and facial expressions state information. Susskind et al. [23] took advantage of learning deep belief nets to classify facial action units in realistic face images. Krizhevsky et al. [24] used the deep convolutional neural network to classify the 1.2 million images in the ImageNet LSVRC-2010 contest in 1000 different categories and achieved the inconceivably higher accuracy than the temporal state-of-the-art. Gunes and Piccardi [10] fused facial expressions and body gestures information for bimodal emotion recognition. For identification purposes, almost, all types of machine learning techniques have been used in emotion recognition approaches [12, 13]. For many reasons and mainly for our final goal of creating an emotion in robots as much as similar to human emotion, we are looking for learning method which can satisfy these parameters. Lately, CNN showed up good results in biometrics, particularly in facial expression and speech recognition. We decided to use it do some preprocessing of data before feeding to the algorithm, such as LSH to prune the database data space [22, 26]. We present a multimodal CNN-based model for automatic emotion recognition and expression. Our model deploys the CNN method, and uses it for multimodal emotional state recognition using facial expression, gesture and speech recognition. This information indicates that, the facial expressions give a great amount of data in human communication. Deploying different modalities and multimodal systems, such as body position, gestures and speech, improved the determination of the emotional state.

3 Human and Robot Emotion

The Human emotional state causes the focus of attention in several areas from biology, neuroscience and psychology to cognitive and computer science due to its importance in human communication, interaction and social dealings. Here, we explain a little about Neuromodulation and Cognitive parameters and their relation with emotion.

3.1 Neuromodulation

Neuromodulation refers to the action on nerve cells of endogenous substances called neuromodulators. Three main neuromodulator systems involved in emotion are:

  • Dopamine based communication and motor activation,

  • Serotonin based regulation of conduct,

  • Opioid based regulation and relaxation [16, 17].

Emotion can be regarded as continuous patterns of neuromodulation of certain lots of brain structure. All EE and ER functionalities are related to the special activities in the brain, for example for facial expression, the smiles are initiated in the motor cortex and routed via the pyramidal motor system. If we would like to simulate the EE in the robots, knowing about these parameters in details and their weights on the emotional state types for simulating the human emotion in robots can assist us. In the following study, in the future, we plan to utilize these parameters and their weights for making the model more flexible.

3.2 Cognition

Robot learning process steps (here, EE and ER) should be very similar to human and it needs to include cognition. In that respect are several integrated cognitive architectures trying to develop all aspects of conduct as a single system while remaining constant across different domains [18]. More or less of these cognitive architectures are biologically inspired, while some others are inspired by psychological theories, in which some of them also contain the concept of effect in their intent. There are the interplay of affect (value), motivation (action tendencies), cognition (meaning), and behavior at three levels of information processing:

  • Reactive: a hard-wired release of fixed action patterns and an interrupt generator.

  • Routine: the locus of unconscious well-learned automatized activity and primitive and unconscious emotions.

  • Reflective: the home of higher-order cognitive functions.

Based on the traditional approaches, cognition emphasizes on information processing which normally has excluded emotion. On the other hand, new growth of cognitive neuroscience as an inspiration for understanding human cognition has highlighted its interaction with emotion. Probes into the neural systems underlying human behavior demonstrate that the mechanisms of emotion and knowledge are intertwined from early perception to abstract thought. These findings suggest that the classic division between the subject of emotion and knowledge may be unrealistic and that an apprehension of human cognition involves the consideration of emotion. Emotions influence fundamental processes mediating high level cognition such as:

  • Attention speed, duration and capacity,

  • Working memory speed and capacity,

  • Long term memory recall and encoding.

It is also apparent that cognition divided functions into different domains, such as memory, attention, and reasoning. The concept of emotion causes a structural architecture that may be similarly diverse and complex.

4 CNN and Emotion

Deep learning can be employed in robots and build the robot emotions more realistic and HRI & RRI better. Deploying different modalities and multimodal systems, such as facial expression, gestures and speech, improved the determination of the emotional state.

4.1 Facial Expression Recognition

Studies on facial expression recognition have been lasting for three decades since 1970s. Paul Ekman et al. [1] postulated six cross-cultural, basic emotions (anger, disgust, fear, happiness, sadness, and surprise) from a psychological view, and developed Facial Action Coding System (FACS) to describe facial micro-expression [19]. Our work also selects the six basic emotions and neutral emotion as our measure of facial expression classification. In general, for facial expression recognition system, there are three basic parts:

  • Face detection: Most of face detection methods can detect only frontal and near-frontal views of the fount. Viola and Jones [20, 21] utilized a lot of rectangular features to find facial expressions in real time.

  • Facial feature extraction: Sorts of features (geometric features, show features and hybrid features of geometric and appearance features) are drawn out for recognizing facial expression.

  • Facial expression recognition: In facial expression recognition, there are dissimilar methods. Due to lack robust features, most of facial expression recognition models work poorly in the complex environment [22].

In recent years, deep learning arouses academia and industrial attentions due to its magic in computer vision. Our work is taking advantage of deep models to extract robust facial features and translate them to recognize facial emotions. FACS system analysis [26] has been employed to derive the features-details that are important during the formulation of a specific facial expression. There are 13 moving-points (11 active points and 2 passive points) and 6 non-moving reference points. The FAUs have been rendered to the corresponding feature-level movements as given in Table 1. We denote vertical-up motion by ↑, vertical-down motion by ↓, horizontally stretched outwards by ‘⟷’, horizontally compressed inwards by ‘↢’, oblique-stretched downwards by ‘↘’, oblique-stretched upwards by ‘↗’. If the emotion is symmetric, then the superscripts L (left) and R (right) have been excluded. If the move is optional or shows a higher intensity increase and so it has been ranked inside the square brackets ‘[’… ‘]’. Junction is shown using concatenation. Disjunction is shown using vertical bar ‘|’. Essential feature-point are within parenthesis ‘(’… ‘)’ separated by ‘, ’. The details presented in our previous research [27].

Table 1. Feature Point Displacements (FDP)

4.2 Speech Recognition

Human language encodes emotional information in two different ways:

  • What is said? And

  • How it is said?

And then a spoken message can be split down into two sections:

  • A semantic and

  • A paralinguistic one.

Several approaches to recognize emotions from speech have been reported [2830]. Voice communication systems should be able to treat the non-linguistic information such as emotions, along with the message. For instance, words associated with happiness are characterized by longer utterance duration, shorter inter-word silence, and higher pitch and energy values with more extensive scopes. In sad sentences, the vitality and the pitch are usually held at the same point. Thus, these emotions are hard to be separated. We possess three important speech characteristics to model emotional speech:

  • The standard deviations and ranges;

  • Maximum, minimum and median values of the pitch; and

  • Energy.

The deep neural network trained itself and resolves the complex problems based on the knowledge available. Resolutions of the individual groups, as considerably as a combined set, have led to the following assumptions: among acoustic features duration and energy appear to be most relevant, while voice quality showed less impact. However, no single group outperformed the pool of all acoustic features. In our experiments we restricted the set of features to those that can be extracted in real time and in a fully automatic mode.

4.3 Gesture Recognition

Gestures are expressive and meaningful questions, involving hands, face, head, shoulders, and/or the complete human body. Gesture recognition has a wide scope of applications, such as sign language for communication among the disabled, lie detection, monitoring emotional states or stress levels of studies, and navigating and/or manipulating in virtual environments. Recognition of emotion from gestures is challenging as there is no generic notion to represent a subject’s emotional state by his or her gestures. Further, the gestural pattern has a wider variation depending on the subject’s geographical origin, acculturation, and the power and intensity of his or her looks. Motions can be static, seeing a single pose or dynamic with a pre-stroke, stroke, and post-stroke phases [31]. Automatic identification of continuous gestures requires temporal segmentation. The most common gestural pattern, frequently used in emotion identification, is the hand movements. Glowinski et al. [32] proposed an interesting technique for hand (and head) gesture analysis for emotion recognition. Camurri et al. [33] classified expressive gestures from the human full body movement during the carrying into action of the subject in a dance. They identified motion cues and measured overall duration, contraction index, quantity of motion, and motion smoothness. On the base of these motion cues, they designed an automated classifier to classify four emotions (anger, fear, sadness, and happiness). Castellano et al. [34] employed hand gestures for emotion recognition.

5 Integrated Model

Figure 1 shows the integrated model which has both EE and ER for emotion in robots and creating better HRI and RRI [27]. For emotion recognition part, the data will come to CNN and the fusion will be answered based on their weight on human robot interaction and then we can count on the accuracy. For instance, if it receives 75 % from speech and 95 % by facial expression and 80 % of gesture, then grounded on their weights (for example here Mehrabian: 7 % by linguistic language, 38 % by paralanguage, and 55 % by facial expressions) they should multiply by these values and the average is the least accurate. On the other hand, for emotional reflection, established on the emotion recognition and cognitive appraisal, the scheme will force away the data from databases for words, gesture and facial expression which are more linked to the emotion recognition state that is recognized in the old state.

Fig. 1.
figure 1

Integrated model

For ER and EE parts, we used the Decision Level Fusion of data in the ER part and Diffusion in the EE part. In decision level fusion each modality is first pre-classified independently, i.e., each biometric trait is captured, and features are then drawn out from that captured trait, based on that extracted features. The final classification is established on a merger of the yields of different modes. This is the highest stage of fusion with respect to human interface. In other words, the decision from each biometric system is concluded to construct the final determination [35].

6 Experimental Results

Table 2 indicates the confusion matrix of the emotion recognition system based on facial expressions. The overall functioning of this classifier was 80.4 %. Table 3 shows the performance of the emotion recognition system with respect to gesture analysis. The overall execution here is 86 %. Table 4 displays the confusion matrix of the emotion recognition system based on language. The overall execution of this classifier is 83 %. Table 5 shows the performance of the system with decision level integration using the best probability approach and 98.8 is overall accuracy.

Table 2. Confusion matrix for facial expressions
Table 3. Confusion matrix for gesture
Table 4. Confusion matrix for speech
Table 5. Decision level fusion

7 Conclusion

We implemented the model for six basic emotion states and there are some other states of emotion, such as mix emotions, which are really laborious to be picked out by robots. We implemented our multi-modal system for automatic emotional state recognition. The proposed model achieves a more respectable performance when multimodal information is applied, in this case composed of facial expression, speech and gesture. The suggested model is able to learn from three different data streams: speech, facial expression and gesture. It deploys the CNN for better scholarship and identification. The results show more honest performance by comparing with old method. Our experiments show that a significant improvement of identification accuracy is accomplished when we use convolutional Neural Network (CNN) and multimodal information system, from 91 % reported in the previous research [27] to 98.8 %. For future study, we plan to run along a mix emotion and test it on it and then enforce the model in a real-world scenario with a Telepresence Robot. We plan to move and test it on, double [36].