1 Introduction

Emotion plays an important role in daily life due to its effect on people’s behaviour. Several studies have investigated the physiological changes related to emotion to find a correspondence between emotional states and biometrics. To do so, automatic systems are designed for extracting individual physiological measures (e.g. electrodermal activity, heart rate, brain’s electrical activity) obtained in response to outer stimuli [1, 7, 11, 12, 41, 44, 46, 55, 66]. These studies have shown that the collected measures are similar in people experiencing the same emotion, thereby allowing the inference of emotional states based on the interpretation of physiological reactions. For example, the polygraph is a tool widely used nowadays to identify physiological reactions to emotion and, as a result, it works as an efficient lie detector, although its accuracy is penalised by the cost factor [38].

Aiming of finding a cheaper method and use it in natural human interactions, facial-based systems have been developed [8, 9, 18, 32, 39, 42, 57, 60, 68, 72, 74, 81]. A general system consists of face detection, extraction of facial features, expression recognition, and emotion inference [59]. Developing such automated facial expression analysis system is a complex task because there are many factors which affect the processing of facial information. Face shape, facial hair, eyeglasses, age, gender, occlusion, differences in expressiveness, and degree of facial plasticity are examples of common constraints which must be considered by robust algorithms to obtain consistent results. To do so, the automated system requires sufficient information to handle this wide spectrum of ethnic and physiological factors. In addition, the system has to operate at real time, whereas facial responses might be fast and sudden. There are many methods to perform each step of the facial expressions analysis depending on constraints in face acquisition (e.g. image resolution, illumination, and occlusion) and the desired efficiency in the recognition task (e.g. real-time responses and error rate).

The interest of studying facial expressions in order to recognise emotion is not new, and there is a wide spectrum of applications for this specific aim, such as prediction of soft-biometrics or health-care, human machine interaction, diagnoses, lie detection, and so on [15]. The knowledge of the individual’s emotional state provides information which could be extremely valuable in interpreting particular scenarios or evaluating human activities. For obtaining such knowledge, it is important to understand why and how we react to emotions through facial expressions.

Darwin was the first who claimed that emotions and their expressions were biologically innate and evolutionarily adaptive. Ekman and Friesen conducted a study to investigate the universality of emotion facial expressions, whose findings led to the conclusion that people have the innate ability to perform and interpret six facial expressions (happiness, anger, disgust, fear, surprise, and sadness), which are called universal, although their intensity and initiation depend on cultural factors [23]. A recent study has suggested four basic emotions instead of six [37]. This conclusion was reached by the analysis of each facial muscle activated in signalling emotions which showed clearly differences between facial expressions of happiness and sadness, while similarities between other emotions were found (fear and surprise; anger and disgust). Despite the different perspectives on the universality of some emotion facial expressions, there is a general representation of them, which is called prototypic. Most studies encompassing facial expression analysis take prototypic expressions as basis for training, causing a bias that does not present emotion facial expression analysis in real-world scenarios. For this reason, contemporaneous studies have been considering spontaneous facial expressions [2, 10, 15, 31, 47, 73, 82, 83], but there is still unresolved issues in this field.

Although the prototypic expressions of basic emotions are universal [22], which makes them easy to be identified, natural expressions (expressions elicited spontaneously in daily life) differ mainly in expressiveness, which is not universal [22] and therefore they can lead us to deduce incorrect emotional states.

Some studies have shown the impact of natural and prototypical expressions on the accuracy of classifiers [3, 72, 77]. The results showed a higher accuracy rate when using a database with prototypic facial expressions than using spontaneous ones. For example, Valstar and Pantic [72] obtained a recognition rate of 72% using spontaneous facial expressions.

In this paper we investigate the use of spontaneous facial expressions in emotion analysis to explore the possibility of developing a system able to infer emotional states, and to determine the extent to which this type of personal characteristic inference is likely to be possible in practice. We have investigated spontaneous facial expressions over prototypic ones so that we could find a system close to human interaction which included: sonorous and visual stimuli, facial expression analysis, and emotion inference based on newly collected data.

There are many reasons which encourage the development of an emotion analysis method considering spontaneous facial expressions:

  • Emotion analysis can be applied in different contexts providing essential information which could be helpful for areas such as medicine [13, 16, 58], psychology [29], forensics [20], human-machine interface [80], music [50], graphic animations [67], and so on.

  • As previously mentioned, most studies explore prototypic facial expressions over spontaneous ones and therefore the resultant methods are not suitable for real-world scenarios. Thus, the analysis of spontaneous facial expressions could lead to the development of more robust approaches in facial expression analysis.

  • The use of facial features in emotion analysis is a subject addressed by theorists who discuss the universality of emotions. The existence of similar facial reactions to emotions across different cultures is widely explored in the literature in attempting to prove or disprove this hypothesis, which once disproved could lead to different perspectives on emotion inference from face.

This paper is organised as follows: Sect. 2 describes the emotion models from the literature which were considered during the elaboration of the FAMOS; Sect. 3 presents a semi-automatic facial expression analyser employed to identify facial expressions; the stimuli for eliciting spontaneous facial expressions and the database acquisition process are presented in Sects. 4 and 5 discusses the obtained results in each experimental scenario; and finally, Sect. 6 shows our conclusions and perspectives for future work.

2 Emotion model

Emotion models are intended for recognising emotions considering their features. A well-defined emotion model is essential to cover the main parameters related to the affective experience. Among different emotion models, two-dimensional’s, where emotions are arranged as a valence-arousal vector, are the most common [11, 45, 64, 71]. Despite this, there is an increasingly growth of interest in exploring new dimensions, since two dimensions are not always enough to describe affective experiences because they are not capable of representing the rich semantic space of emotion [26, 70]. Besides, there are studies which have pointed out that the arousal dimension should not be considered as an atomic measure, whereas it is composed of two sub-dimensions, arousal–calmness and tension–relaxation, related to opposite causes [63]. This perspective has encouraged the use of multidimensional models with more than the two traditional dimensions in recent studies [49, 62, 78].

In attempting to support the multidimensional approach, a study demonstrated that emotion models should consider a set of six emotion components, namely appraisal of events, psychophysiological changes, motor expressions, action tendencies, subjective experiences, and emotion regulation, due to their high correlation with emotion experiences related to 24 prototypical emotion terms [26]. The four dimensions highlighted in this study were those which presented the greatest variance in the analysed sample: evaluation-pleasantness, potency-control, activation-arousal, and unpredictability. The dimensions were showed to be significantly correlated to the six emotion components, which in turn have been shown to describe the emotional experience robustly [25, 61]. These results also imply that simple two-dimensional models, which are common in the literature, miss major sources of variation in the emotion domain.

Taking these results into account, these six emotion components were considered as a reference for the elaboration of a new multidimensional model called FAMOS (Facial expressions, Appraisal, MOod, and Subjective experiences). The emotion words explored in this model were those related to three basic emotions: happiness, sadness and fear. They were selected due to the possibility of provoking such emotions through aesthetic art forms (music and images) strongly enough to produce visible physical changes and without manipulating people [79]. For this reason, anger/disgust were not explored in this study.

Figure 1 illustrates each of these dimensions showing how they were obtained. The input consists of both subjective and physiological factors (facial expressions). Each subjective dimension is described as a numerical attribute scaled from 1 to 5, obtained from questionnaires, while the facial expressions are described by action units (AUs), obtained by a facial expression analyser (FEA). As a result, it is possible to describe the emotion experienced in terms of these dimensions grouping the output data into basic emotions.

Fig. 1
figure 1

FAMOS emotion model design

The FAMOS consists of the following dimensions:

  • Mood (action tendencies component) It represents the individual’s emotional state before their exposure to the eliciting stimulus. Subjects describe their initial emotional state in a 1–5 Likert scale, being 1 related to low valence emotions (sadness), 3 related to distress feelings (fear), and 5 related to high valence emotions (happiness). It is also possible to choose a middle ground between either sadness and fear or fear and happiness, depending on the subject’s perspective on their mood. The mood can influence the arousing emotion process [40] and therefore it was considered in the emotional experience analysis in order to find its correlation with the reported appraisal.

  • Appraisal (appraisal of events component) It represents the appraisal associated with the eliciting stimulus. After being exposed to such stimulus, the individual is asked to answer a second questionnaire, where they inform their appraisal on a 1–5 Likert scale, being 1 related to low valence emotions (sadness), 3 related to distress feelings (fear), and 5 related to high valence emotions (happiness). It is also possible to choose a middle ground between either sadness and fear or fear and happiness, depending on the subject’s perspective on their emotional state. These ratings are used to aid the stimulus efficiency assessment by checking whether the appraisal is consistent with the expected emotion. Furthermore, the correlation between subjective reports and facial responses is evaluated in attempting to find empirical evidence about the agreement between them.

  • Subjective experiences (subjective experiences component) Episodic memory can produce several different types of emotion in a short period of time [40], which makes the knowledge of personal information (e.g. previous contact with the eliciting stimulus) from an individual highly desirable [79]. Concerning the auditory stimuli, individuals who have musical experience, for example, could perceive hidden details that can help the cognitive process of understanding musical cues, which is necessary to arouse emotional experiences through music. We also considered people who easily get emotional, once the emotional trigger time is short, and the eclectic ones, who are more receptive to musical cues. Taking these facts into account, this personal information was collected through the application of questionnaires after and before the exposure to the eliciting stimuli.

  • Facial expressions (motor component) In response to an emotion eliciting situation, a person can demonstrate an emotion in many ways and one of the most significant is through facial expressions. Mehrabian [48] indicated that the verbal stimuli (i.e., spoken words) of a message contributes only with \(7\%\) of the effect of the message as a whole, the vocal stimuli (e.g., voice intonation) contributes with \(38\%\), while facial expression of the speaker contributes with \(55\%\) of the spoken message effect. Thus, it is possible to infer the emotion experienced by someone by means of the analysis of their facial expressions. In this work, the facial expressions were obtained through facial pictures taken from filming people during their exposure to an eliciting stimulus. Then, a FEA was employed to extract AUs from each picture, as will be presented in Sect. 3, which subsequently allowed to check the existence of AUs patterns associated with the basic emotions.

We have adopted Likert scales for representing Mood and Appraisal since they are more intuitive and easier to rate by individuals than usual 2-D valence-arousal vectors. In addition, the arousal component is already tied to the emotion terms (e.g. Happiness: high valence, average arousal [56]).

The subjective dimensions explored in this work were associated with the processing of musical cues in attempting to check links between musical load and emotional experiences. The ratings from individuals were compared to their facial reactions, which were extracted by a FEA. The results obtained by the FEA were compared to human assessment as well, in order to evaluate the accuracy and precision of the method.

The emotion components, emotion regulation and psychophysiological changes were not explored in this study since the emotions were assumed to be spontaneous (without attenuation, amplification, concealment, and substitution), and only motor expressions were considered in attempting to obtain a low-cost method (the physiological data acquisition is quite expensive).

Few studies have considered contextual information to improve the facial expression analysis task [83]; therefore, we have analysed the effect of subjective information on the emotional experience for this particular goal.

3 Facial expression analysis

Facial expressions can be defined as changes in the face provoked by internal emotional states in response to external stimuli or to communicate the individual’s intentions [69]. Processing information from face has helped humans to survive since the time when there was no language. For instance, perceiving someone’s intentions was fundamental to escape from predators or to recognise potential enemies and reproductive partners [14]. This ability is more prominent in women, who are most able to process facial information according to studies in this field [17, 34], which suggest that the reason for these results is the women’s need to look after of infants by decoding and detecting distress on their faces or in order to protect them against threatening signals from other individuals [51].

In order to describe facial expressions, Ekman and Friesen [24] developed a facial code called facial action coding system (FACS). The system describes the physical expression of emotions through AUs, which are the smallest visually discernible facial movements related to each facial muscle or group of facial muscles. Table 1 shows the upper and lower face AUs used in our study.

Table 1 Upper and lower face AUs [24]

In order to find a unified set of AUs to describe spontaneous emotions, we have employed a method based on a model-driven technique, which depends on prior information about the faces (a neutral face) and landmark detection. The presented method infers the AUs based on the differences obtained between neutral and expressive faces. We have chosen a local model-based method since it has the advantage of being suitable to both single images and image sequences, and it is not affected by age wrinkles since a template image is taken into account during the inference of facial expressions. Also, it does not require extensive prior knowledge about the object of interest, which is required by image-based approaches. The proposed FEA is illustrated in Fig. 2.

Fig. 2
figure 2

Facial expression analyser

Each step of the facial feature extraction can be described in details as follows:

  1. 1.

    Face detection It is responsible for extracting the face from the image. For this purpose, we have used the Viola–Jones method [76], which can be described by three key concepts: integral image generation, selection of Haar-like features by Adaboost algorithm and generation of the cascade classifiers. The first step of the implemented algorithm is the contrast adjustment by using Histogram Equalisation [30]. Then, the classifier is loaded through a file containing a decision tree trained by using several positive and negative images of face. The OpenCV platform [5] offers these files freely; therefore, it was not necessary to train the classifier. Once the face is detected, the image is cropped removing unessential information. Figure 3 shows examples of the detection performed by this algorithm. Even for images with poor lighting, the face was detected.

  2. 2.

    FCP marking Facial points are marked manually based on the geometric model proposed by Kobayashi and Hara [42] for extracting facial features. The proposed geometrical model describes the face through 30 points called facial characteristic points (FCP). The points were chosen based on the key locations of each facial deformation which constitutes expressive facial expressions. As a result, it is possible to identify expressive changes in face by analysing the FCP obtained from each facial image. The FCP were manually marked to ensure the required accuracy in identifying subtle facial expressions, since such accuracy could be harmed by using automatic extraction methods. Finally, the facial points are normalised by applying three transformations to the coordinates of the FCP: translation, rotation and scaling.

  3. 3.

    Calculation of feature values Using the normalised FCP (obtained in Step 2), the feature values were calculated by identifying geometric features on the face [39], as shown in Table 2. For example, ieb_height describes the inner eyebrow height, which is a feature value which considers the normalised FCP 19, 20, 21, and 22. In the original method, only the one-side upper facial points were considered. Moreover, the calculation of geometric features was not efficient in specific situations, for example, in case of changes in eyebrows related to AUs 1 and 2, the feature value \(eb\_height\) contained both outer and inner brow height and therefore these AUs were always found together. Another problem was found in the feature m_mos which was not enough to describe some lower face AUs (e.g. AU15). In order to handle these problems, three new features have been proposed to describe these changes accurately: ieb_height, oeb_height and lc_height. The reformulated model is shown in Table 2.

Table 2 Feature values adapted from Jongh [39]
  1. 4.

    Action units inference After the calculation of feature values from neutral and expressive pictures, the obtained difference was matched to AUs by using a rule-based system. The rules were created by evaluating changes in mouth, eyes, and eyebrows presented in expressive expressions. Table 3 shows the changes in a neutral face and the corresponding AUs. Thresholds (values expressed in pixels) were obtained by training and used to consider changes in facial expressions which are not significant enough to generate a new expression. For instance, AU 1 is found if there is an increase in the inner eyebrow height, and the FCPs 19 and 20 compared to the values obtained from the neutral face.

Table 3 AUs inference—criteria

The classifier was trained using 40 positive samples containing spontaneous facial expressions (obtained during an experiment, as will be presented in Sect. 4) and 40 negative samples containing neutral expressions. The changes in feature values found due to FCP normalisation were measured in order to find thresholds for each equation presented in Table 3. Afterwards, new images were submitted to the classifier and the results were compared to human assessments.

Fig. 3
figure 3

Source: http://biometrics.idealtest.org/. Access date: 04 April 2015

Object detection using Viola-Jones method. The larger square represents the face classifier and the smaller ones represent the facial features classifiers. a Glasses and poor lighting, b eyes closed and poor lighting, c smile and poor lighting, d upward inclination, e downward inclination and f smile.

4 New emotion database acquisition: technical specifications

4.1 Subjects

Since it is rare to find a freely available database with non-prototypic facial expressions, with certain exceptions [19, 52, 53], as part of our contribution, a new database with facial expressions of elicited emotions was built. The users who took part in the data collection were students from Universidade Federal do Rio Grande do Norte (UFRN) and employees from SIG Software company, considering only individuals without facial deformities, 52 females and 49 males aged 18–60. The average time spent per person during the experiment was 7 min. Each participant was exposed to only one emotional stimulus since after the procedure the majority of people became aware of the nature of the research.

4.2 Apparatus and stimulus materials

The images were collected using a webcam (\(1280 \times 720\)). Instead of analysing videos, since we have proposed a frame-based method for extracting facial features, we have captured frames from the captured data, 5 frames per second. The seven more expressive frames from each subject were used in the data analysis stage in attempting to avoid attenuation of the results provoked by excessive neutral frames.

In music, several studies have explored the effects of musical parameters on the individual’s mood [35, 43, 44, 46, 65]. A survey of Music Information Retrieval (MIR) is presented by Hu (2010) who states the existence of mood effect in music [6] and shows the main musical parameters with some correspondence with listeners’ judgements and divergences in the subjective factor. Taking these facts into account, we have proposed sonorous stimuli for eliciting emotional states. The musical tracks were selected based on previous experiments from the literature [21, 27, 44] that showed related high emotion levels by measuring physiological changes (e.g. heart rate, blood pressure, skin conductance level, finger temperature and respiration measures) in individuals who were listening to these tracks while they were being monitored. The selected tracks are illustrated in Table 4.

Table 4 Tracks used for eliciting emotions

At first, the experiment was made only using the auditory stimuli presented in Table 4, but the results showed that these stimuli were not sufficient to provoke visible changes in facial expressions and the absence of a fixed spot to focus the individual’s vision generated distraction which harmed the emotion eliciting process. Taking these preliminary results into account, it was decided to employ visual stimuli as well. Verbal messages were not used, thus the eliciting stimulus could be presented to anyone without relying on linguistic factors.

Facial pictures can elicit emotion based on intrinsic human features: mimicry, empathy and cognitive load [33]. On account of this, we have selected people’s images in particular emotional scenarios, depending on the emotion expected to be provoked. The pictures were chosen considering cultural categories which are related to the induction of particular emotions. For example, pictures of people smiling were chosen for eliciting happiness, pictures of people crying, alone, ill, victims of violence, and living in poverty were chosen for eliciting sadness, and pictures taken from horror movies were chosen for eliciting fear. Figure 4 shows samples of the used images labelled by emotion.

Fig. 4
figure 4

Visuals intended for eliciting emotions. a Happiness Source: http://www.logueria.com/blog/91-como-ter-clientes-mais-felizes. Access date: 09 March 2014. b Sadness Source: http://www.pictures88.com/comments/sad/. Access date: 09 March 2014. c Fear Source: http://www.dirtyandthirty.com/power/manifestation-monday-fear-what-are-you-scared-of-let-it-go/. Access date: 20 December 2014

The images selected were submitted to a 30-people jury who were asked to report which emotion was transmitted by the images presented in an online questionnaire, one image per page. The jury had free time to judge them and move to the next image. The musical stimuli used in this study were not evaluated, since they had been evaluated in previous studies. The available alternatives for the 41 images (11 for happiness, 14 for sadness and 16 for fear) were happiness, sadness, fear or none. The results were evaluated to find consistency of judgements about the emotions provoked by the images, thereby supporting the selected visuals used for eliciting emotions. Figure 5 shows the ratings given by the jury.

Fig. 5
figure 5

Reports for visual stimuli regarding the emotions elicited by them. a Happiness images, b sadness images and c fear images

For the purpose of describing the agreement among the user’s reports, Fleiss’ kappa (\(\kappa \)) values were obtained [75]. For happiness it was obtained \(\kappa _h = 0.68\), which indicates a substantial agreement, sadness and fear had a moderate agreement (\(\kappa _s = 0.56\) and \(\kappa _f = 0.52\)) and none had a slight agreement (\(\kappa _n = 0.03\)), according to the interpretation of \(\kappa \) by Viera et al. [75]. The general \(\kappa \) was 0.45, which is considered moderate and therefore it implies in moderate agreement about the emotions elicited by the selected visual stimuli.

4.3 Procedure

The procedure can be described as follows:

  1. 1.

    The participant is explained the protocol in each step of the experiment. The nature of the research is not revealed, so the results are not harmed by unconscious bias behaviour.

  2. 2.

    The participant is warned about the filming procedure, and in order to enable data acquisition, a term sheet is given to formalise the cession of the obtained images for research purposes.

  3. 3.

    The participant is asked to fill a questionnaire about their musical experience, current feelings, if he/she is musically eclectic and often gets emotional listening to music.

  4. 4.

    A picture is taken from the individual in order to have a reference neutral facial expression. For those using glasses, it was asked to remove them during the experiment.

  5. 5.

    At the same time, the participant hears a track and sees pictures displayed and changed every 5 s, both representing a specific emotion, while their face is being filmed. This process takes about 1 min. At the end, an image is displayed warning the participant about the end of the experiment and then, the individual is notified about a second questionnaire.

  6. 6.

    The participant is asked to fill the second questionnaire about their new emotional state and a previous contact with the heard song. The latter question was necessary because, in some cases, it is easier to show emotional signals in the first contact with a song, since emotions, elicited by music, related to violation of expectancy might include anxiety/fear [40].

Facing the eliciting stimulus, participants presented facial expressions and the most perceptive ones were obtained from the happiness stimulus, as presented in other previous studies [4, 36]. Figure 6 shows samples collected during the experiment of still neutral faces—(a), (b) and (c)—and samples from the same users, but after they have been exposed to stimuli of sadness, happiness or fear—(d), (e) and (f)—respectively.

Fig. 6
figure 6

Samples of expressive and neutral faces obtained during experiment. a Neutral face, b neutral face, c neutral face, d sadness, e happiness and f fear

5 Results and discussion

5.1 Face action units recognition

In this study, images were obtained from 101 subjects who participated in the experiment presented in Sect. 4, considering only individuals without facial deformities, 52 females and 49 males, aged 18–60, approximately 33 per emotion. Approximately 270 frames (\(640 \times 480\) pixels) were obtained from each participant, plus an additional frame representing a neutral expression of each individual. Among the collected frames, the seven more expressive ones were selected for facial expression analysis. Forty images were used during training step in attempting to obtain thresholds for each rule of the system (see Table 3). After the training step, the 61 remaining images were submitted to the analyser. Table 5 contains the results from a set of about 700 images (training images + new images). Neutral expression occurrences were not considered in the analysis, unless when they caused false positives or false negatives.

Table 5 AU recognition results on 101 subjects

The first column enumerates the AUs explored in this work according to the categorisation proposed by Ekman et al. [24]. The second column shows the occurrences of each AU, where only one occurrence was considered per subject (it varied between 0 and 1) without considering the frequency of AUs presented in frames of the same individual. The third column contains the AUs correctly classified by the analyser, taking as ground truth the human assessment. Columns I, D and S represent insertions, deletions, and substitutions, respectively. Insertions represent false positives which occur when an AU is found, but the ground truth states that the expression was neutral. For instance, AU 4 was found in three expressions considered neutral; therefore, 4 insertions were obtained. On the other hand, deletions represent false negatives which occur when an AU is not found, but according to the ground truth the expression was observed in the image. For instance, an occurrence of AU 7 was not found by the analyser which inferred a neutral expression instead. Substitutions represent the occurrence of an AU X which was misclassified as an AU Y. For instance, AU 12 and AU 15 were misclassified as AU 20; hence, there are 2 substitutions for them and 2 insertions for the latter. Finally, the columns accuracy and precision represent efficiency metrics grouped by AU. Both metrics were obtained by means of the following equations:

$$\begin{aligned} \begin{aligned} {\mathrm{accuracy}}&= \frac{\mathrm{correct}}{\mathrm{occurrences}}, \\ {\mathrm{precision}}&= \frac{\mathrm{correct}-{\mathrm{incorrect}}}{\mathrm{occurrences}}. \end{aligned} \end{aligned}$$
(1)

The accuracy describes the overall success rate, where the value for the system was 96%. On the other hand, the precision is a reliability metric which eliminates the correct results obtained due to chance. To do so, the number of correctly classified instances is subtracted from the misclassified ones (insertions \(+\) deletions \(+\) substitutions). The overall precision for the analyser was 92%.

5.2 Data analysis

We have analysed images of 101 subjects submitted to stimuli intended for inducing emotional states and, consequently, unconscious facial responses, which are the focus of this study. Figure 7 shows some of the subjects who participated in the study.

Fig. 7
figure 7

Fifteen of the 101 participants in the experiment

Assuming the selected stimuli was able to evoke the expected emotions, it was possible to make a description of each emotion in terms of facial reactions using AUs. Figure 8 shows the histogram containing the presented AUs in each emotional scenario. The occurrences of each AU were obtained without considering multiple occurrences for the same individual. For example, if an individual presented the AU 12 several times, only one occurrence of this AU would be considered for the analysis. The extraction of AUs was performed by the developed facial expression analyser jointly with human assessment in order to ensure the accuracy of the results.

Fig. 8
figure 8

AUs grouped by emotion. a Histogram of AUs, b average representation of happiness, c average representation of sadness and d average representation of fear

In Happiness, it was possible to find a pattern in facial reactions presented by the individuals. The majority of them presented the AUs 6, 7, 12, and, in some situations, AU 25. The pattern found is consistent with the prototypic representation of this emotion described in the literature, differing only regarding the occurrence of AU 7, which is associated with lowering the eyelids during spontaneous smiles, but it is rarely presented in artificial smiles. In Sadness, the predominant AUs were 7, 15 and 4, where AU 15 had an extremely smooth representation, in contrast to the prototypic representation for this emotion. AUs 4 and 7 were presented signalling discomfort caused by the eliciting stimulus. Apathy was also observed during the exposure to the sadness stimulus. The most significant difference was found in Fear. The predominant AUs were 7, 12, 4, 23 e 25 (either single or combined displays), where AU 7 was presented by individuals closing their eyes during image exposure, and AU 12 was presented in smooth smiles. Surprisingly, the smiles presented during Fear have scientific explanation. Allan and Barbara Pease [54] presented 2 types of smiles, being one of them known as a fear face, performed by primates. This fear smile communicates submission or anxiety facing a fear scenario, which is why some people smile thinking about a frightening situation; it is an involuntary way of protecting themselves by showing submission to the threat. AU 23 is an indicative of anxiety as well, once this action is associated with the dryness of mouth caused by situations of anxiety related to fear, for example. AU 25 was observed with some frequency in some individuals in a smooth representation, different from the prototypic exhibition of the emotion fear. The most expressive expressions were found in Happiness, other scenarios presented low intensity expressions, or even neutral ones.

The data obtained through questionnaires and the employed emotion stimuli were compared to the appraisal reported by the individuals aiming to identify whether particular parameters affect emotional experiences.

We accounted for the occurrences of each given question and compared them to the reported appraisal in order to obtain the correlation measure between them. In addition, we compared the appraisal to the emotional stimulus to check its validity. The purpose of this study is not to provide a formula for emotion inference, rather it is to highlight the impact of the proposed dimensions on the final emotional state, statistically, which implies that they should not be neglected in emotion assessment scenarios.

The evaluated parameters were the emotion stimulus, mood, musical experience, subject’s preferences, emotionality, previous contact with the musical stimulus, gender, and age. Figure 9 (parameters at the same scale) and Fig. 10 (parameters at different scales) show the scatter plots containing the data distribution according to its occurrence in the analysed sample.

Fig. 9
figure 9

Scatter plot from emotional parameters: mood, music experience, preferences, and emotionality. a Appraisal \(\times \) mood, b appraisal \(\times \) music experience, c appraisal \(\times \) preferences and d appraisal \(\times \) emotionality

Fig. 10
figure 10

Scatter plot from emotional parameters: stimulus, known song, gender, and age. a Appraisal \(\times \) stimulus, b appraisal \(\times \) known song, c appraisal \(\times \) gender and d appraisal x age

The points represent the combination of appraisal and explored measure, which are shown in a grey scale where the darkest points correspond to highest occurrences in the sample, which happens when several subjects describe the same values for the emotional parameters, and the lighter ones correspond to lowest occurrences. The red line represents a linear trend lineFootnote 1 which shows how data are expected to be scattered in the sample considering its distribution. The emotional parameters were described by numeric values, where the stimulus is represented by 1 as Happiness, 2 as Fear, and 3 as Sadness (see Fig. 10a), Appraisal and Mood are represented in a 1–5 scale, 5 being the highest value (see Fig. 9a), Known song is represented in a 0–1 scale, 0 being the absence of previous contact with the song played during the experiment and 1 being the existence of this previous contact (see Fig. 10b), Gender is represented by 0 as Male and 1 as Female (see Fig. 10c) and Age is represented by 0 as \(\le \,25\), 1 as 25–48 and 2 as 48–60 (see Fig. 10d).

Pearson’s correlation coefficient (\(\rho \)) [28] was obtained to quantify the correlation between the variables presented in Fig. 9, where it was found moderate correlation between the emotion stimulus and appraisal (\(\rho _{s,a} \approx -\,0.62\)) and weak correlation between other variables.Footnote 2 These results demonstrate that the lower the stimulus (in terms of the numeric labels for each emotion, as previously explained) the greater the reported appraisal, which suggests that the emotion stimuli were able to elicit emotional experiences since, in most cases, the reported appraisal was consistent with the expected emotion in each emotional scenario.

The analysis of emotional parameters also considered each emotion individually. The obtained results indicated moderate correlation between mood and appraisal for Happiness and Fear, where the correlation coefficient obtained was \(\rho \approx 0.58\) for happiness stimulus (see Fig. 11a) and \(\rho \approx 0.6\) for fear stimulus (see Fig. 11b), which means that the higher the mood, the higher the reported appraisal in these emotional scenarios. For sadness stimulus the correlation was weak (see Fig. 11c), which demonstrates that for this emotion the mood did not affect the emotional experience (in terms of reported appraisal). Furthermore, it was found a moderate correlation between appraisal and musical preferences for Fear (\(\rho \approx 0.42\)) indicating that in this emotional scenario the subjects who reported higher appraisal were the more musically eclectic ones (see Fig. 11d).

Fig. 11
figure 11

Scatter plot from emotional parameters grouped by emotion. a Happiness: appraisal \(\times \) mood, b fear: appraisal \(\times \) mood, c sadness: appraisal \(\times \) mood and d fear: appraisal \(\times \) preferences

Considering a gender distinction in the sample, it was found a moderate correlation (\(\rho \approx 0.54\)) between Mood and Appraisal for males (see Fig. 12a), while for females, the mood did not affect their emotional experience in most individuals (see Fig. 12b). These results suggest that the higher the mood, the higher the appraisal reported by males, thus supporting the superior performance of females over males in emotion analysis since the mood did not affect females’ performance in this task. Concerning the correlation between Appraisal and Stimulus, both males and females presented moderate values (\(\rho \approx -\,0.62\)), which means that the eliciting stimulus provoked the same emotional states regardless subject’s gender (see Fig. 12c, d).

Fig. 12
figure 12

Scatter plot from emotional parameters grouped by gender. a Male: appraisal \(\times \) mood, b female: appraisal \(\times \) mood, c male: appraisal \(\times \) stimulus and d female: appraisal \(\times \) stimulus

In summary, the overall results suggested that the designed stimuli were able to elicit emotions since more than 65% of the users reported changes in mood (Fig. 9a). In addition, the results showed the effect of mood on emotional experiences of happiness and fear and the influence of musical preferences on the appraisal reported by individuals who had contact with the fear stimulus. The analysis of emotional parameters suggests that the four dimensions were valid to describe the emotional experience, although some subjective factors could have been disregarded (e.g. music experience, emotionality, and known song). Finally, 79% of the users reported an appraisal consistent with the used stimulus (Fig. 10a).

We understand these are initial results; nevertheless, we believe they point to very significant differences concerning emotion analysis from facial expressions, mainly due to the use of spontaneous facial expressions instead of prototypic ones. In the proposed experimental study, we have presented results for only three basic emotions, but we believe that this method can be used to other emotions.

6 Final considerations

The presented framework brings a new perspective on facial expression analysis considering spontaneous emotional experiences in order to find a system which is most suitable for human interaction studies. We have demonstrated that there is a link between subjective information and experienced emotions, which encourages the use of such information in emotion analysis studies. We also hope to encourage the use of more realistic data in the area of facial expression analysis by providing a method to elicit spontaneous emotional experiences. For future work, we will cover all four basic emotions by creating a visual-sonorous stimulus for inducing natural facial expressions for each emotion, thereby improving the collected set of images. Additionally, the age range will be expanded to contain younger and older individuals, which will allow a more robust data analysis taking into account distinct age groups. Finally, the database will be publicly released to support the development and assessment of further facial expression analysers.