Keywords

1 Introduction

Human computer applications have found their way into almost every aspect of human live. Especially, in recent years, there has been a tremendous boost in options for tracking human behavior by means of computer systems and algorithms in principally any situation, be it in private or public life. This has motivated scientists, companies, stakeholders and societies to replace or augment human-human interaction by agent-based communication systems, such as, for example, expressive personal assistants, embodied conversational agents (ECAs) or virtual Chatbots to name but a few systems. The underlying computational models of these systems have become quite intelligent in an attempt to recognize the human user’s feelings, sentiments, intentions, motivations and emotions, and respond to it [1]. Many agent systems integrate artificial intelligence (AI) and machine learning to allow automated interaction with human users in real time.

Fig. 1.
figure 1

Agent interaction with a human user. Upper panel: agent dealing with congruent information conveyed by the user; lower panel: agent dealing with incongruent information conveyed by the user.

Because human interaction is multimodal and heavily depending on social perception and emotions, many agent-based systems aim to decode, detect and recognize the user’s emotions, his or her current mood state from his/her overt behavior including text messages, gesture, the voice or the expressive bodily or facial behavior of the user [2,3,4]. Despite vast progress in the field of decoding the user’s emotions from different modalities (voice, text, etc. …), human-centered agent-based systems also need to account for the interactions between the different modalities (text, voice, face). Consider a virtual agent-based system (e.g., Chabot) for recognizing the user’s emotions from his/her smartphone text messages. The Chatbot’s task is to try to “understand” the user’s emotions and reply emotionally and empathetically back to him/her (see Fig. 1). Maybe the system can also monitor the user’s heart rate, voice or facial expressions to make emotion detection and emotion recognition more versatile. As illustrated in Fig. 1, the system will have a high accuracy in decoding the “correct or true” emotion, if information from all input channels or modalities (e.g., text, voice, face, heart rate) is congruent (Fig. 1, upper panel). However, accuracy will most likely drop, if information from the multimodal input channels is incongruent (Fig. 1, lower panel). In the latter case, reduction to only one modality or input channel (e.g., text or expressive behavior only) could even lead to false decoding results.

Thus, from a psychological point of view, truly interactive and human-centered computer systems should be emotionally intelligent and smart enough to reason and understand how information from different modalities (e.g., text, voice, face, …) interacts and contributes to the user’s self-report to accurately sense the user’s experience. Understanding modality interactions is imperative for developing truly human-aware agent-systems that decode, simulate and imitate humans based on data produced by real humans. Importantly, as illustrated in the examples in Fig. 1, these systems need to learn to deal with inconsistencies or incongruence in the data sent by the user and interpret it correctly. Knowing how to computationally model congruent and incongruent information is especially relevant for certain agent-based applications in the field of Psychology, such as the development of negative behavior intervention, mental health prevention and in research on inter-individual differences. Regarding certain mental disorders (e.g., depression, anxiety) or developmental disorders (e.g., autism), these are characterized by discrepancies in information processing between modalities. In other words, these patients show and maybe also experience incongruence between the information perceived and how it is affectively appraised, subjectively experienced and expressed e.g., verbally or physically. Moreover, these patients often have difficulties in accurately describing, identifying or expressing their own feelings and accurately responding to the emotions of other people. For example, they are showing a preference to neutralize or suppress their emotion or tend to a negativity bias in the perception and affective evaluation of the Self and of other people often in combination with blunted or exaggerated facial expression to these [5, 6].

Therefore, knowing the basic principles of multi-modal emotion processing and of social perception and implementing these principles into agent-based systems will constitute key challenges for an agent technology that aims to best serve its human users. In this paper evidence from psychology and results from an experimental study on social perception and the role of bodily emotional feedback in emotion perception will be presented and discussed to demonstrate how human appraisal and affective evaluation of the very same input - here words describing the user’s own or other people’s feelings - can change depending on whether facial expressions are congruent or incongruent with the verbal input during affective evaluation. Furthermore, inter-individual differences in emotion processing as assessed by standardized psychological questionnaires assessing depressive symptoms as well as trait and state anxiety as well as their possible influence on emotion processing are explored. Finally, the implication of the results of this experimentally controlled study will be discussed as a potential and challenge for human-centered approaches in agent-based communication and possible recommendations will be given how information about multimodal incongruence effects can be integrated in user-centered approaches using psychology-driven character computing as an example [7,8,9].

2 Methods

2.1 Participants, Study Design and Methods

The study was conducted at Ulm University, Germany. The study and data collection were hosted by the Department of Applied Emotion and Motivation Psychology. N = 45 volunteers (all university students) (n = 36 women, n = 9 men) with a mean age of 18 – 32 years took part in the study. All participants were fully debriefed about the purpose of the study and gave written informed consent prior to participation. After arrival at the psychological laboratory, the volunteers were randomly assigned to three experimental conditions/groups. Depending on the experimental groups, the facial emotion expression of the participant was either inhibited (Group A: Emotion suppression, n = 14; n = 11 women) or facilitated (Group B: Emotion facilitation; n = 17; n = 14 women) by asking participants to hold a wooden stick either with their lips (Group A) or between the front teethes (Group B), see Fig. 2 for an overview. As shown in Fig. 2, holding the stick with the lips inhibits a smile and consequently any positive emotion expression by the lower part of the face including mouth, lips, cheeks controlled by the face muscles including the m. zygomaticus for smiling. In contrast, holding the stick between teeth elicits positive emotion expression by stimulating the facial muscles responsible for smiling. In Group C (control group, n = 13; n = 11 women), facial expressions were not inhibited nor facilitated. Instead, participants received a gummy drop, so called “Myobands” (ProLog GmbH; [10]), which they had to keep in their mouth. The drops avoid under- or over-expressive facial expressions but still allow facial mimicry, i.e., spontaneous emotion expression and facial imitations of the emotions perceived. Experimental manipulations of this kind have been used in previous studies as an implicit experimental procedure to inhibit positive facial expression (smile) or induce positive emotion expression in a controlled laboratory environment. The manipulation is aimed at testing effects of facial inhibition and of congruent vs. incongruent facial feedback on social and emotion perception [11,12,13].

Fig. 2.
figure 2

Left: Group A (Emotion suppression, i.e., “inhibition of a smile”; right: Group B (Emotion facilitation, i.e., “smile”).

The participants of the three groups received detailed instructions of how to keep the wooden stick or the drop (Group C) with their lips, mouth or tongue. Next, all participants were asked to perform the same experimental task, the so called His-Mine-Paradigm developed by the author in [14] and existing in different experimental variants [e.g., 15, 16, 17]. In the HisMine Paradigm, participants are presented emotionally positive and negative nouns or neutral nouns. The nouns are matched in terms of linguistic features (e.g., word length, word frequency) and in terms of emotional features including the big two emotional dimensions of valence and arousal [18]. The nouns are presented together with possessive pronouns of the first person (self) or third person (other person) or an article (control condition, no person reference). All stimuli were presented in German language to native speakers of German. The participants were instructed that the words presented on the screen describe their own emotions, feelings, attitudes or neutral objects, or the emotions, feelings or objects of a third person (e.g., possible interaction partner) or do make no specific person reference to a particular person including oneself. The pronoun-noun or article-noun pairs were presented randomly on the computer screen and therefore could not be anticipated by the participants, see Fig. 3 for an illustration of the paradigm.

Fig. 3.
figure 3

The HisMine Paradigm [14] for measuring the user’s affective judgments of self- and other-related emotional and neutral concepts in an emotional evaluation task with self- and other-related words and words without person reference (controls).

The participant’s task was to read each word pair attentively and evaluate the feelings the word pair elicits. They were instructed to not think too much about the meaning of the word and respond spontaneously by following their gut feelings. They were told to decide as spontaneously as possible. The stimuli were presented in trials and trials were separated by an intertrial interval (ITI) in which a fixation cross was shown (see Fig. 3 for an illustration). The words (including ITI) were presented for 4000 ms on the computer screen. The participants were asked to indicate their valence judgments - positive/pleasant, negative/unpleasant or neutral - by pressing a key on the keyboard and indicate their preferred evaluation (positive/pleasant: key N; negative/unpleasant: key V; neutral: key B). Reaction time and accuracy (number of valence congruent key presses) were recorded and statistically analyzed. The experiment was programmed with Presentation® software (Neurobehavioral Systems, Inc.). Statistics were performed with Statistica (TIBCO® Data Science). Before and after the experiment (duration: 20 min), participants were asked to fill in standardized psychological questionnaires asking for mood (positive and negative affect, PANAS [19], state and trait anxiety (STAI, [20]), and current depressive symptoms (last two weeks, BDI-2, [21]).

3 Results

Result patterns (accuracy and reaction times as dependent variables) were analyzed for each experimental group separately by within-subject repeated measures designs (ANOVA) with “emotion” (negative/unpleasant, positive/pleasant or neutral) and “reference” (self, other, no person reference) as within-subject factors. P-values are reported adjusted (Greenhouse Geisser corrections) in case sphericity assumptions were not met (F-statistic and degrees of freedom are reported uncorrected). No between-group factor was included in the ANOVA designs to avoid biases due to unequal and small sample sizes (n = 17 vs. n = 14 vs. n = 13 per group). In summary, the following hypotheses were tested. First, the hypothesis (H1a) was tested that participants respond differently to emotional and neutral words giving emotional content priority in processing over neutral content, as predicted by emotion theories and converging previous empirical evidence [18]. Therefore, affective evaluations of positive and negative words were expected to differ from affective evaluations of neutral words (significant effect of the factor “emotion”). Second, because humans are social perceivers, it was expected that this emotion effect is modulated by the self- vs. other-reference of the words (H1b: interaction effect of the factors “emotion” x “reference”). The interaction of “emotion” x “reference” was expected for accuracy and reaction time measures and to be most pronounced in group C, i.e., the control group who performed the task without facial expression manipulation. Previous findings suggest that healthy participants differ in their affective evaluation of self and other, showing faster evaluation of content related to their own (positive) emotions compared to content describing other people’s emotions [22], an effect known as self-positivity bias in social perception [23]. Third, the influence of congruency vs. incongruency and inhibition of information between modalities was tested (H2). If feedback from the body (here: facial emotion expression) influences affective judgments and in particular the process of relating one’s affective judgments about self and others to one’s gut feelings, it was expected that experimental manipulation of facial expressions will impact emotion processing and particularly its modulation by self- other-reference. Therefore, it was expected that Group A and Group B will show different response patterns than Group C, although performing the same affective judgment task as group C. To determine the individual effects of the three experimental conditions/groups, the results of the ANOVA designs (accuracy and reaction times as dependent variables) of each of the three experimental groups were further analyzed by means of planned comparisons including parametric (t-tests) where appropriate. In total, from all possible comparisons across stimulus categories, only a few planned comparisons were made to test H1a, H1b and H2. The planned comparison tests compared per dependent variable, affective judgments of emotional and neutral words within the reference category of “self” or “other” or “article/no person reference”. In addition, the comparisons included comparisons within the same emotion category (positive/pleasant or negative/unpleasant or neutral) across the reference categories (self vs. other vs. no person reference). The advantage of the strategy of planned comparisons is that multiple testing can be avoided and only comparisons that can be meaningfully interpreted with respect to the hypotheses are selected and tested. Finally, exploratory analyses were performed to investigate (H3) how inter-individual differences in affect, self-reported depressive symptoms, trait or state anxiety impact affective judgments when facial feedback is blocked (Group A), facilitated (Group B) or not manipulated (Group C). H3 was explored by correlation analysis (Pearson) between the self-report data and reaction time and accuracy measures. Correlations were performed separately for each experimental group. Results were considered significant for comparisons of p < .05 (alpha = 95%).

H1a and H1b:

As predicted, the control group (Group C) showed a significant “emotion” effect and a significant “emotion” x “reference” effect in accuracy (“emotion”: F(2,24) = 23.14, p < .002; “reference”: F(2,24) = 1.05, p = .36; “emotion x reference”: F(4,48) = 3.2, p < .02) and in reaction times (“emotion”: F(2,24) = 4.71, p < .02; “reference”: F(2,24) = 8.26, p < .02; “emotion” x “reference”: F(4,48) = 3.19, p < .05). As illustrated in Fig. 4, participants of Group C had a significantly higher accuracy for negative as compared to neutral words for all words irrespective of whether these described negative emotions of the own person or a third person or no particular person (p < .05). For positive emotions, accuracy was significantly higher for judgments of own emotions, particularly of positive ones (e.g., “my happiness”) compared to emotion of others (e.g., “his happiness”), p < .05. This supports previous findings that healthy subjects find it more difficult to evaluate a positive word as positive when it describes the emotion of another person compared to when it refers to one’s own emotion or when the word describes the same emotion without a person reference (e.g., “the happiness”). Moreover, participants of Group C were least accurate for self-related neutral words, i.e., on average only half of the words were judged as neutral when they were self-related, p < .05, and accuracy dropped below 50%. This difference was also reflected in differences in reaction times. Participants were fasted in accessing their feelings in their judgment for positive words that were referenced to themselves (p < .05), and took significantly longer in their decision for neutral words, specifically when neutral words were related to another person or had no person reference (e.g., “his shoes”, “the shoes”), all p < .05. Again, this supports previous findings of social perception, namely that healthy subjects show positive self-evaluations [22, 23].

Fig. 4.
figure 4

Results. Accuracy (left) and reaction times (right). The results are described in detail in the text.

H2:

As predicted and as illustrated in Fig. 4, facial emotion suppression and facial emotion facilitation, here the inhibition of a smile (Group A) or its induction (Group B) had a significant impact on the participants’ response patterns. Group B, who held the stick between their teeth showed a significant “emotion” effect and a significant “emotion” x “reference” interaction in the accuracy data (“emotion”: F(2,32) = 15.2, p < .001; “reference”: F(2,32) = 1.68, p = .2; “emotion” x “reference”: F(4,64) = 4.24, p < .02). However, producing a smile reduced the speed of the judgments for self- and other-reference: the interaction effect between “emotion” x “reference” was not significant in Group B (“emotion”: F(2,32) = 11.44, p < .001; “reference”: F(2,32) = 2.16, p = .13; “emotion” x “reference”: F(4,64) = 2.04, p = .13). As shown in Fig. 4, Group A, in whom positive facial emotion expression was blocked during affective judgments, showed no significant “emotion” effect nor a significant interacting of “emotion” x “reference” in reaction times indicating that blocking facial feedback reduces the influence of emotion and of social perception on affective judgments by decreasing the speed of discriminating between emotional and neutral content and between self vs. other (“emotion”: F(2,26) = 2.87, p = .1; “reference”: F(2,26) = 2.05, p = .14; “emotion” x “reference”: F(4,52) = 1.71, p = .18). Interestingly, blocking one’s smile seemed to improve accuracy for self-related neutral content (accuracy increased above 50%), as was indicated by significant interaction of “emotion” x “reference”, F(4,52) = 3.57, p = .03, in this group, see Fig. 4, accuracy, left column.

H3:

Correlations between self-report measures, accuracy and reaction time measures revealed no significant effects in Group C. However, in Group B, trait as well as state anxiety were significantly positively correlated with the speed of affective judgments for self- and other-related negative words and negatively correlated with accuracy for positive self- and other-related words (Pearson correlations; all p < .05, two-tailed). Similarly, depressive symptoms were negatively correlated with accuracy for other-referenced positive words and this was also observed when smiling was inhibited in Group A (all p < .05). In addition, in Group A, who inhibited a smile, positive affect was negatively correlated with accuracy for negative self-related words and positively correlated with reaction times for positive self-related words (all p < .05). The results suggest that in healthy subjects, inter-individual differences in emotion perception including positive affect and negative emotion experience in terms of anxiety proneness (state and trait) and depressive symptoms can significantly interact with affective judgments, specifically in conditions in which incongruence between affective and bodily signals occurs.

4 Discussion

Humans send text messages, share private experiences in internet forums, and like or dislike others for their comments, thereby quite intuitively transferring their thoughts, emotions and feelings via the internet from a sender to a receiver, be it human or a digital agent. Regardless of the sensory modality (e.g., visual, auditory) in which the information is presented and communicated (e.g., text, audio, video, ...), in humans it is rapidly decoded, the content is implicitly appraised for the relevance for the perceiver and affectively evaluated by the perceiver on the basis of his/her gut feelings and this often without direct cognitive control or reflection. Spontaneous human information processing is a dynamic and highly complex social and emotional process. There is a cascade of processes triggered and information or signals from different modalities are being integrated while humans are perceiving information and evaluating it for their social and emotional relevance. Notably, the perception of self and of other people including affective evaluation of own emotions and feelings play a critical role in human interactions. Psychological research has shown that reading single words carrying positive or negative emotional connotations are rapidly and preferentially processed, eliciting changes in brain networks responsible for emotion perception and preparation of action (fight-flight), for an overview see [24]. Moreover, it has been shown that the content of the same words re-enacts bodily emotional and motivational systems (e.g., facial muscle expression, startle reflex) differently depending on whether their content is related to the user’s own feelings or describing other people’s feelings, for an overview see [24]. There seems no doubt that humans can emotionally feel what they read and intuitively and affectively differentiate between self and others by embodying or bodily suppressing their feelings. Especially, there is ample evidence that with regards to bodily signals, specifically facial expression of emotions significantly shape the perception of self and of other people by influencing how people affectively evaluate information about self and others [27, 28]. For instance, inhibiting facial expression leads to a significant decrease in understanding the emotions of others - be it during reading, during watching videos or pictures of other people or while interacting in real time with others.

This feedback from facial expressions could be so strong that it might significantly impact the user’s decisions, for instance whether he or she likes or dislikes other users in an internet forum for their comments. Not being aware of the fact that social processing and emotion processing are multimodal processes in which information from one modality interacts with the information conveyed by other modalities can lead to wrong predictions in automated emotion recognition tools if these focus on only one modality or do not have theoretical or empirical knowledge of how to weigh and combine cross-modal congruent and incongruent information. For that reason, user-centered agent-based systems need to be able to deal with incongruence a) between information processing modalities or b) in information between sender and receiver. An example of an incongruence in a) is, that e.g., what is said by the user is incongruent with how it is expressed by the user. For example, the user sends a positive text message but displays a negative facial expression. An example for incongruence in b) is that the message of a sender is incongruent with the emotion/mood of the receiver or with the context in which the perceiver or receiver decodes the message. This sender-receiver incongruence also holds true for messages sent by the virtual agent itself [25]. For instance, if a virtual agent is sending incongruous emotional information to the human receiver of the message (visual: smile, auditory: sad voice). Incongruence effects on the agent’s side can have a number of negative side effects on the user’s behavior and feelings. It can lead to significantly fewer trust of the user in the agent, a decrease in the user’s self-disclosure motivation and in turn produce inconsistent emotional expression on the user’s side, thereby significantly impairing human-agent interaction qualitatively and quantitatively. Whether a) or b), emotion psychology as well as concomitant research including the results of the current study impressively demonstrate that information from that information from different modalities as well as incongruence of information between the modalities matter and should not be ignored in human-computer interaction. Therefore, focusing on only one decoding signal of the user might lead to inconsistent results because incongruent effects become unnoticed if not tracked across modalities. As a consequence, the user’s true or correct emotional experience cannot be accurately detected. One can think of many instances in virtual agent-based human communication in which detecting an accurate reaction from the user does not reflect the true emotional state of the user. This is the case, when a person wishes to disguise his/her emotion or hide feelings by faking a smile or by mimicking a poker face while posting his/her text message. In other instances, the user may implicitly suppress his/her emotion without even being explicitly aware of it when posing a different emotion inconsistent with the message sent. Similarly, when in a depressed or negative mood or when suffering from depression or anxiety, emotion processing often becomes difficult including describing, identifying and expressing emotions. Computationally modeling such discrepancies in perception, feeling and action in the user becomes most relevant for agent-systems that aim at supporting e-mental health interventions. Regardless which use-case one might imagine, the importance of multimodal processing and how to integrate information that signals incongruence between modalities as meaningful information into computational models of user-centered agent systems is of central relevance, especially in light of novel developments such as multimodal sentiment analysis or “the emotive Internet”.

Nevertheless, so far, incongruence effects still remain a challenge and an issue of investigation in human-computer interaction [4, 25]. As pointed out in [25], “much work remains to be done before sophisticated multimodal interaction becomes a common place, indispensable part of computing”, cited from [25]. The present study was aimed to experimentally demonstrate the impact and power of incongruence in emotion processing in the laboratory by experimentally manipulation the participants’ facial expressions by holding a wooden pen with lips or teeth or letting participants freely express themselves. Furthermore, despite small sample sizes, the present study also showed that inter-individual differences in emotion processing can and should not ignored as they mutually interact with the strength of such incongruence effects.

5 Conclusion and Future Outlook

Much effort and scientific work has been invested in the field of multimodal human computer interaction and human-robotics to provide computational solutions for data fusion, data interfacing, data classification or multimodal sensor engineering to allow for multimodal tracking of the user’s behavior from principal any device and make virtual agents (including robotic behavior) as communicative interaction and communication partners more humanoid and thus “real” for the human user. While there is agreement that enriching agent-based human interaction by multiple modalities can increase the accuracy of the computational system in its attempt of detection, prediction and simulation of human behavior, multimodality is still a complicated tasks when it comes to dealing with ambiguity or incongruence in the data while mapping for instance the user’s emotional states. Here, the challenge is not to treat these inconsistencies as error or noise in data preprocessing but as meaningful information. Up to know, the problem of incongruence is receiving more and more attention by researchers in the field. First solutions for theory-driven and data-driven hybrid approaches have also been already proposed [26]. Character Computing [7,8,9] constitutes one of these ambitious approaches that takes the human user in the center of its investigation and that attempts to go beyond simple emotion tracking. The novelty of Character Computing is its joint approach from Computer Scientists and Psychologists. Crucially, its framework as well as its underlying computational architecture and models are psychologically-driven and based on psychological reasoning including both a theory-driven and a data-driven approach. As outlined in [7,8,9], Character Computing tries closing the gap between the knowledge gained from psychological research and its application by computer scientists. It is a multimethod, multimodal approach that based on well-controlled psychological experiments already has proven successful in combining multimodal input with automated signal decoding routines. For instance to give one example, in [9], a study is discussed in which participants watched simple character trait and character state words to elicit positive or negative attitudes - amongst others about one’s body appearance (in terms of size and shape) - while participants’ responses were tracked by sensors recording the participants’ eye blink data and heart rate to measure their implicit bodily arousal and approach or withdrawal behavior towards these words and thereby detect potential risk for body dissatisfaction and eating disorder. Next, the special prediction tested in the psychological studies as the one in [9] as well as the one presented in the present study can be directly integrated into an information ontology in order to represent the knowledge gained from the interactions between modalities.

To conclude, this paper asked how to deal with incongruence in agent-based human interaction and demonstrated its relevance by investigating how processes related to social perception and bodily facial feedback influence the user’s affective judgments of emotional and neutral concepts. The research question, the experimental design as well as the results of this study provide an important theoretical framework and empirical proof for user-centered research aimed at solving incongruence detection in uni- and multimodal agent-based systems for the sake of modeling the “true” emotion of its users and for improving the user’s trust in keeping up communication with virtual agents as empathetic and equal conversation partners.