Human perception of a conversational virtual human: an empirical study on the effect of emotion and culture

Qu, Chao; Brinkman, Willem-Paul; Ling, Yun; Wiggers, Pascal; Heynderickx, Ingrid

doi:10.1007/s10055-013-0231-z

Human perception of a conversational virtual human: an empirical study on the effect of emotion and culture

Original Article
Published: 30 August 2013

Volume 17, pages 307–321, (2013)
Cite this article

Download PDF

Access provided by CONRICYT – Journals CONACYT

Virtual Reality Aims and scope Submit manuscript

Human perception of a conversational virtual human: an empirical study on the effect of emotion and culture

Download PDF

Chao Qu¹,
Willem-Paul Brinkman¹,
Yun Ling¹,
Pascal Wiggers¹ &
…
Ingrid Heynderickx^1,2

1268 Accesses
12 Citations
Explore all metrics

Abstract

Virtual reality applications with virtual humans, such as virtual reality exposure therapy, health coaches and negotiation simulators, are developed for different contexts and usually for users from different countries. The emphasis on a virtual human’s emotional expression depends on the application; some virtual reality applications need an emotional expression of the virtual human during the speaking phase, some during the listening phase and some during both speaking and listening phases. Although studies have investigated how humans perceive a virtual human’s emotion during each phase separately, few studies carried out a parallel comparison between the two phases. This study aims to fill this gap, and on top of that, includes an investigation of the cultural interpretation of the virtual human’s emotion, especially with respect to the emotion’s valence. The experiment was conducted with both Chinese and non-Chinese participants. These participants were asked to rate the valence of seven different emotional expressions (ranging from negative to neutral to positive during speaking and listening) of a Chinese virtual lady. The results showed that there was a high correlation in valence rating between both groups of participants, which indicated that the valence of the emotional expressions was as easily recognized by people from a different cultural background as the virtual human. In addition, participants tended to perceive the virtual human’s expressed valence as more intense in the speaking phase than in the listening phase. The additional vocal emotional expression in the speaking phase is put forward as a likely cause for this phenomenon.

Lost in Interpretation? The Role of Culture on Rating the Emotional Nonverbal Behaviors of a Virtual Agent

Measuring users’ emotional responses in multisensory virtual reality: a systematic literature review

Article Open access 13 October 2023

How Deep Is a Virtual Reality Experience? Virtual Environments, Emotions and Physiological Measures

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

To create a feeling of being “present” in virtual reality is essential to the success of many virtual reality applications such as training (Broekens et al. 2011), coaching (Rizzo et al. 2011), therapy (Brinkman et al. 2012) and games (Isbister 2006). A feeling of being “present” in virtual reality may be achieved by making the virtual reality environment as natural as possible. Human–computer interaction, including human–virtual human interaction, is inherently natural and social (Reeves and Nass 1996), and so, is an essential component in the realism of the virtual environment. Without proper behavior of the virtual human, users may not be able to “suspend disbelief” and the effectiveness of the virtual reality application will decrease.

Considering the importance of emotion in human–human communication, emotion may also help people to establish a better relationship with virtual human (Reeves and Nass 1996). As Picard et al. (2001) argues, without some emotional skills, machines will not appear intelligent when interacting with people. Therefore, multiple technologies to give virtual human the abilities of generating human acceptable expressions have been developed in recent decades (Ersotelos and Dong 2008).

Different applications require different levels of emphasis on how the virtual human express their emotions. Even when implemented in only part of the application, emotional expressions can be effective. In a health coach application, for example, the virtual human might mainly need to speak to motivate the user, and emotional expressions during speaking are most important. In a virtual reality exposure therapy for fear of public speaking, the virtual human only needs to listen, and so, emotional expressions while listening are most important. In some applications, such as for a role playing game or a negotiation simulator, the full range of speaking and listening is used and might benefit from emotional expressions. Studies on generating and evaluating the emotional agents normally only focus on either listening (Slater et al. 1999; Wong and McGee 2012) or speaking (MacDorman et al. 2010; Qiu and Benbasat 2005). Studies that do include both speaking and listening,(e.g., Core et al. (2006), Broekens et al. (2012a), Link et al. (2006)) focus mainly on the conversation and communication as a whole and do not separately investigate the speaking and listening phases of the whole conversation. To our knowledge, no study has directly compared the impact of emotional expressions during speaking and listening in virtual reality. In the current study, the virtual human’s valence state was manipulated, while she was speaking and listening from negative to positive, and the impact on the participants’ perception was examined.

Besides the difference between listening and speaking, culture might also be an important factor for a designer to consider as many applications are used all across the world nowadays. Especially for some virtual reality applications, such as virtual reality exposure therapy for patients with social phobia (Brinkman et al. 2012), it is crucial to understand how people with a different cultural background perceive the affective behavior of virtual humans. Several studies have already focused on the effect of cultural differences on evaluating virtual human’s emotions. For example, Jack et al. (2012) showed that facial expressions of emotion are culture specific. However, Yun et al. (2009) found that cultural background has little effect on emotion perception. Kleinsmith et al. (2006) evaluated cultural impact on perception of emotion and found that emotions are both universal- and cultural-specific. Therefore, similar to the studies for perceiving emotional expressions of real humans, the universality of emotion perception of virtual human seems also still inconclusive. In addition, most research is only limited to the investigation of head-only virtual human with facial expressions and far less research is devoted to emotional expressions from a 3D virtual human which expresses its emotional state also via gaze, head movement or voice intonation.

In summary, this study involves two research questions: (1) whether emotional expressions of virtual human are perceived differently depending on the cultural background of the perceiver, and (2) whether a person is more perceptive to emotional expressions in one of the two phases (the speaking phase and listening phase) or whether a person treats these two phases as equally important when rating the virtual human’s emotions? To answer these research questions, we designed a virtual human representing a Chinese lady at an age around 25. She had the ability to show multiple emotional states in multiple nonverbal and verbal ways: i.e., through facial expression, head movement, gaze and voice intonation. During the listening phase, the virtual human’s emotional behavior was expressed by nonverbal communication only, while during the speaking phase, the emotional behavior was expressed by both verbal (i.e., intonation) and nonverbal communications. To avoid a possible emotional bias from the content of the conversation, a relatively neutral topic, i.e., conference attendance, was selected in this experiment. Petrushin (1999) pointed out that humans are not perfect in decoding manifest emotions such as anger and happiness in voice intonation only. Therefore, as a first step, only three basic emotional valence states (positive, neutral and negative) were used in this study. In order to test the effect of cultural influence on the perception of the emotions expressed by the virtual human, two groups of participants were recruited: from the same culture as the virtual human and from other cultures. We chose to compare Chinese versus non-Chinese participants, because it is known from cultural models (Hofstede 2001) that the difference in cultural values is significant between these two groups, and since two of the authors experienced these differences while living in Europe. Moreover, the background of these authors facilitated the recruitment of Chinese participants.

Based on knowledge available in the literature, we envision the following two hypotheses related to our research questions.

Hypothesis 1

Individuals with the same cultural background as the virtual human perceive the valence state of the virtual human differently from individuals with a different cultural background.

Especially, as the virtual human was speaking Chinese, participants with a different cultural background could not understand what the virtual human said during verbal communication. Hence, participants with a different cultural background from the virtual human are expected to perceive her emotion differently from participants with the same cultural background.

Hypothesis 2

The virtual human’s expressed valence is perceived as more intense in the speaking phase than in the listening phase.

Since the speaking phase also allows including verbal expression of the emotion, it seems likely that compared to the listening phase, the emotion expressed in the speaking phase is perceived as more intense.

The rest of the paper is structured as follows. Section 2 provides theoretical background on how a virtual human can express emotion through facial expression, gaze, head movements and voice intonation. In addition, it discusses cultural differences in emotion recognition and various emotional models, needed to understand the rest of the paper. Section 3 provides a description of the apparatus, validation of the stimuli material and the procedure of the experiment, and its results are presented in Sect. 4. Finally, in Sect. 5, the findings of the study are discussed and conclusions are drawn.

2 Theoretical background

No matter what roles virtual humans play in a virtual world, they need to elicit an anthropomorphic interaction with their human users. This requires vast knowledge of various human aspects including facial expression, gaze, head movement, voice expression and their cultural difference in order to make the virtual human believable, responsive and interpretable.

2.1 Facial expression of a virtual human

Facial expression is one of the options to express human emotion and as such plays a substantial role in depicting human characters. Started in the early 70s and 80s (Parke 1972; Platt and Badler 1981), face modeling and animation have been a continuous research topic for many years. From early 2000s, more flexible emotion representations were created with MPEG-4-based facial animation (Tsapatsoulis et al. 2002). Recent advances in facial animation that allow to produce a rich set of effects on synthetic humans already had their impact on the industry (Ersotelos and Dong 2008).

Multiple approaches have been proposed to create naturally looking facial expressions; they can be categorized as follows: (1) simulation or physically based models, which try to model the anatomical structure of the face as well as the underlying dynamics (Kahler et al. 2001; Lee et al. 1995; Waters 1987), (2) performance-driven models, which reassemble frames from video footage or motion capture data of a real person to yield the desired facial expression (Brand 1999; Bregler 1997; Chuang and Bregler 2002; Ezzat et al. 2004; Litwinowicz and Williams 1994) and (3) parameterized-based models, which assign weights to the vertices of meshes representing the face, such that during animation the vertices are moved according to the weights (Cohen and Massaro 1993; Parke 1974; Zhang et al. 2006). Considering the high computational load required for the simulation or physically based models and the high costs for the motion capture equipment needed for performance-driven models, we decided to choose an easily repeated facial expression animation based on a parameterized model for this study.

2.2 Head movement and gaze of a virtual human

Besides facial expressions, also head and eye movements were implemented in the virtual human used in our experiment. Head movements and eye gaze are two important sources of emotional feedback in interaction (Cassell and Thorisson 1999; Lee and Marsella 2012; Ruttkay and Pelachaud 2005). They are essential to embody interactive conversational systems (Cassell et al. 1994), and it is relatively simple to create primarily nods and glances toward or away from the user. Still the correct timing is essential (Cassell and Thorisson 1999). Research of Lance et al. (2008) and Lee et al. (2009) shows how head movements and gaze can be embedded into a virtual character.

2.3 Voice expression of a virtual human

Along with the nonverbal emotional expressions, emotion can also be expressed by voice intonation when the virtual human is talking. Speech was once considered as the main channel to carry most, or even all, the necessary information in a conversation (Ochsman and Chapanis 1974). This idea has been countered by a growing body of research on believable, lifelike embodied conversational agents (Bates 1994). Still, the importance of the voice in emotion expression cannot be denied (Scherer 1995). Many studies have investigated emotional effects in voice and speech (Bailenson et al. 2006; Petrushin 1999; Scherer 2003), and emotion expressed in the voice of virtual humans (Cerezo and Baldassarri 2008; Moridis and Economides 2012). The intonation of the voice was therefore also considered as an important aspect of the virtual human’s emotional expression in this study.

2.4 Cultural difference

Culture, such as age, gender, posture and context, is one of the many factors affecting emotion expression (Picard 1998). A long-time question in the study of human emotion is the extent to which emotional expressions are universal or culturally determined (Elfenbein et al. 2007). Cultural background may influence the rate of emotion recognition (Matsumoto 2002). When an expresser of an emotion and the perceiver of the emotion have the same cultural background, the perceiver’s recognition rate is found to be higher than when the expresser and perceiver have a different cultural background (Elfenbein 2003; Elfenbein and Ambady 2002; Elfenbein et al. 2007). However, Darwin (1872) and Tomkins (1962, 1963) argue that universal emotions do exist, studies also show universality in the facial expression of emotion and its perception, and attribute only little effect of cultural background on emotion perception from facial expressions (Ekman 1994; Ekman and Friesen 1971; Ekman et al. 1987; Matsumoto 2002, 2007).

The question of impact of cultural background can be extended to human–virtual human interaction. Although various studies show that people can correctly identify emotions expressed by embodied agents in general (Bartneck 2001; Schiano et al. 2000), how good this performance is retained in different cultures needs to be considered. Clear indications support the statement that culture can shape the expression and interpretation of emotions (Keltner and Ekman 2000). Culture as a factor has also been studied in the interaction with computers. For example, Dotsch and Wigboldus (2008) and Brinkman et al. (2011) have found a difference in emotional reaction to a virtual human with ethnic appearance that match or did not match the person’s ethnicity. Endrass et al. (2011) show that in German and Japanese cultures, the user’s perception of an agent conversation can be enhanced by a culturally prototypical performance of gestures and body postures. Kleinsmith et al. (2006) worked on the cross-cultural difference of recognizing affect from virtual human’s body posture and suggest to consider culture as one specific factor for the implementation of agents. Meanwhile, Jan et al. (2007) mention that in Arabian and US American cultures, gaze, proximity and turn-taking behavior are all culture related. These results reveal that participants perceive behavior that is in line with their own cultural background differently from behavior that is typical for a different cultural background. In the work presented in this paper, cultural background is considered as a variable which is expected to influence how people perceive the emotional expression of the virtual human.

2.5 Dimensional emotion model

For facial expressions, six universal basic emotions exist (Ekman et al. 1992). However, for language, people’s categorization of verbal labels to describe their everyday life emotions vary between languages and cultures (Russell 1991). Instead of placing these expressed emotions in categories, i.e., a discrete emotional approach, others suggest placing them in a multi-dimensional space, i.e., a dimensional approach (Fox 2008). Three broad dimensions have often been proposed to describe affect (Mehrabian and Russell 1974): i.e., valence, arousal and dominance. Valence is variously referred to as positive and negative affect or as pleasant and unpleasant feelings. The arousal dimension ranges emotions from deep sleep to frenetic excitement. Dominance focuses on the expression of social control and aggression and varies between submissive and dominant (Schroder 2004). Compared to the discrete emotional approach, the dimensional approach often uses subjective reports of feelings as its main dependent variable. As such, it has a strong empirical base. Support for the existence of these dimensions has come from research into subjective reports, physiological responses, neural circuits and cognitive appraisal (Barrett 2006; Fox 2008). Furthermore, Wierzbicka (1995) and Church and Katigbak (1998) also investigated the cross-cultural universality of the emotional dimensions. Their results showed the universality of the valence and arousal dimensions. The study presented in this paper focuses on the valence dimension only. Although participants were asked to rate the virtual human’s emotion on all the three dimensions, only the valence dimension was used for data analysis.

3 Experiment

3.1 Participants

Twelve Chinese (7 females and 5 males) and twelve non-Chinese (5 females and 7 males) students from the Delft University of Technology participated in the experiment. Their age ranged from 24 to 38 years with a mean of 27.8 (SD = 3.4) years. All participants were naive with respect to the hypotheses. Written informed consent forms were obtained from all the participants. The experiment was approved by the university ethic committee.

3.2 Creating the virtual human

Although Cowell and Stanney (2003) found that users generally prefer to interact with a youthful character matching their ethnicity, they found no significant preference for character gender. Furthermore, Kulms et al. (2011) showed that actual behavior and its evaluation are more important for the evaluation than gender stereotypes. Therefore, a Chinese virtual lady aged around 25 years was specially created for this study.

The model of the virtual human was created by FaceGen and 3Ds MAX. All main factors which were considered to contribute to emotion expression were combined; the virtual human’s facial expression, her head and eye movements and her voice intonation were manipulated to express emotion during the conversation. To create facial expressions, an easily repeated facial expression animation method was used. This method rigged the face mesh into 22 action units with 18 features (Gratch et al. 2002), where each feature was an anchor point attached to a set of vertices of the face. A model for the face dynamics that was able to control the intensity of the expression, its onset, peak and decay was defined. As such, the virtual human had the ability to show any intensity and any combination of the six basic Ekman facial expressions (Ekman et al. 2002). The validation of this approach was shown by Broekens et al. (2012b). By setting the values for the three emotional dimensions (i.e., the valence, arousal and dominance) and for the expression duration, any emotion could be expressed by the virtual human. The facial expressions from neutral to negative or from neutral to positive, used by the virtual human in our experiment, are shown in Fig. 1.

The participants were asked to judge the emotional state of the virtual human, and so, there was no interaction between the participant and the virtual human. The participant was told that the scene contained a virtual lady talking with a human, but that the human voice was removed. Therefore, problems related to timing (i.e., whether the virtual human should or should not show an expression at a certain point of time) were avoided, and the participant could focus on the emotional behavior of the virtual human herself.

Seven conditions were included in the experiment, all varying in the emotional states of the virtual human. Since the scenario was conversation based, two continuously alternating phases could be identified, i.e., one in which the virtual human was speaking and one in which she was listening. These phases allowed the virtual human to express her emotion differently in the two phases. In the speaking phase, the virtual human used voice and nonverbal communications to express her emotions, while in the listening phase, the virtual human only used nonverbal communication to express her emotions. Three emotional states were created for both phases (i.e., positive, neutral and negative states), and they formed the basis for the seven different conditions, shown in Fig. 2. As the combination of positive (negative) listening and negative (positive) speaking included contradictory emotional information of the virtual human in the speaking and listening phases, these combinations were considered unnatural and so were excluded from the experiment. Taking the neutral attitude in both the speaking and listening phases as the baseline, it was expected that participants would give a higher valence score when the virtual human responded positively either in the listening or speaking phase. Assuming that there would be no interaction between the speaking and listening phase and that both phases would have a similar impact on the expressed valence intensity, the seven conditions could be ordered into five groups: highly negative (S − L−), lowly negative (S − L0, S0L−), neutral (S0L0), lowly positive (S + L0, S0L+) and highly positive (S + L+). If the intensity of the expressions with a negative or positive valence would be equal, these five groups could be projected on a single valence scale as is done in Fig. 2 (shown as the predicted valence value axis). Comparing the actual valence values obtained in the experiment to the predicted valence values would make it possible to study hypothesis 2 about the experience of the valence intensity in the two phases of the conversation.

The participants were asked to sit in front of the virtual human (displayed only above her chest on a computer screen), right at the place where the virtual human’s “conversational partner” would sit. With this setup, the participants could well perceive the virtual human’s emotional state, expressed by her vocal expression, facial expression and eyes and head movements. When expressing a positive emotional state, the virtual human would show a happy facial expression and once in a while would nod her head to agree with her conversation partner. Her eyes would mainly look at her conversational partner, only occasionally look away (Fig. 3c). When expressing a negative emotional state, the virtual human would have an angry facial expression and would continuously look away showing limited interest in her conversation partner (Fig. 3a). The intensity of both the positive (happy) and negative (angry) emotional expressions was evaluated in a previous study (Broekens et al. 2012b) to ensure that they both could be identified by individuals. The neutral expression was the default facial expression of FaceGen, with the six Ekman basic emotion (Ekman et al. 1992) parameters set to zero and with all other morph modifiers removed when generating the face model.

In the speaking phase, the virtual human would look directly at her conversation partner. In the negative speaking condition, she would have a negative facial expression (Fig. 4a), while in the positive speaking condition, she would have a positive facial expression (Fig. 4c). In addition, speech with either a negative or positive intonation was added to the virtual human.

3.3 Emotion validation

As mentioned already above, for the speaking phase, verbal communication was added to the virtual human. The voice of the virtual human was recorded in Chinese by a Chinese linguistics student. Her voice was recorded 3 times, each time expressing a different emotional state: positive, neutral and negative. A small separate study, in which 6 Chinese participants, 3 males and 3 females with an average age of 27 (SD = 0.5) years, were asked to rate the valence of the recorded voice on a scale from 1 (negative) to 9 (positive), showed that the emotion in the recorded voice was indeed perceived as intended, F(2,10) = 25.29, p < .001. The negative voice was significantly lower than the neutral voice, t(5) = 3.87, p = .012, and the positive voice, t(5) = 6.52, p < .001. Further, the positive voice was significantly higher than the neutral voice, t(5) = 3.61, p = .015. The means and standard deviations of the scores on the positive, the neutral and the negative voice were M = 7.8, SD = 1.9; M = 5.7, SD = 2.0; M = 1.7; SD = .8, respectively.

Making a fair comparison between the listening and speaking phases requires that the intensity of the nonverbal communication is similar in both phases. For example, the virtual human’s facial and body expression in the lowly negative speaking phase and lowly negatively listening phase (see Fig. 2) should have a similar impact on the valence intensity. To test this, an additional small study was conducted. In this study, twelve participants, 5 males and 7 females with an average age of 27 years (SD = 1.8), were presented simultaneously with two video clips of the virtual human including both the listening and speaking phases. Half of the participants were Chinese. The participants were asked to rate how easily they could see the difference between the two videos on a scale from very easy (0) to very difficult (100). The participants were explicitly asked not to rate the valence, but only the easiness with which differences were perceived, representing the intensity of the emotion. The videos were presented without sound. The participants were asked to rate 12 pairs in total (S − L0/S0L0, S0L−/S0L0, S + L0/S0L0, S0L+/S0L0, S − L−/S + L+, S − L−/S0L0, S + L+/S0L0, S0L0/S0L0, S + L0/S + L0, S − L0/S − L0, S0L+/S0L+, S0L−/S0L−), presented to each participant in a different random order. Before they rated the pairs, the participants were shown all the possible behaviors of the virtual human so that they could establish an overall frame of reference.

The first step of the analysis was to see whether the more intense stimuli were easier to distinguish from the neutral reference video (S0L0) and whether the positive and negative videos were equally distinctive. Therefore, a MANOVA with repeated measures was conducted with the intensity of the video stimuli (high versus low intensity) and the valence direction (positive versus negative) as independent variables. The analysis was conducted on the rating for highly positive (S + L+/S0L0) and negative (S − L−/S0L0) videos, and the mean rating for lowly positive (S + L0/S0L0 and S0L+/S0L0) and negative (S − L0/S0L0 and S0L−/S0L0) videos across the speaking and listening phases. The analysis found a significant main effect (F(1, 11) = 21.91, p = 0.001) for intensity, in that the highly positive or negative videos (M = 32, SD = 17) were rated as easier to be distinguished than the lowly positive or negative videos (M = 44, SD = 15). Also a significant (F(1, 11) = 15.63, p = 0.002) main effect was found for direction. The positive videos (M = 25, SD = 15) were rated as more easily to be distinguished from the neutral video than the negative videos (M = 50, SD = 23). The analysis found no significant (F(1, 11) = 1.60, p = 0.23) two-way interaction effect, which suggests that the two main effects were constant across the conditions.

The next analysis focused on the question whether, compared to the neutral reference video, the positive or negative differences in the listing or speaking phase were equally distinguishable, and whether this was the same for the positive and negative videos. Therefore, a second MANOVA with repeated measures was conducted with the valence direction and the phase (speaking versus listening) as independent variables. The analysis used the rating for lowly positive speaking (S + L0/S0L0) and lowly positive listening (S0L+/S0L0) phases and the rating for the lowly negative (S − L0/S0L0) speaking and lowly listening (S0L−/S0L0) phases. The analysis again revealed that the positive videos (M = 28, SD = 16) were significantly (F(1, 11) = 16.91, p = 0.002) rated as more easily to be distinguished than the negative videos (M = 59, SD = 24) from the neutral reference video. No significant difference was found between the listening and speaking phases (F(1, 11) = 0.14, p = 0.71), and also, no significant two-way interaction effect was found (F(1, 11) = 0.44, p = 0.52). Figure 5 shows the videos with their predicted valence and the estimated valence. The latter is the z score of the rating for the video subtracted from the rating of the neutral reference video (S0L0/S0L0), whereby the rating of negative videos was multiplied by −1. Both the two lowly negative and the two lowly positive videos are positioned closely together. In other words, the intensity of the nonverbal communication seems similar in the listening and speaking phases. Furthermore, because of the significant difference in rating between negative and positive videos, the neutral reference video seems to be positioned closer to the negative videos than to the positive videos. As illustrated in Fig. 5, the predicted and estimated valence values for the videos do not follow a linear function, but rather a cubic function. By using a fitted inverted cubic function, the intensity weighted predicted valence values for the videos were calculated from the estimated valence values, thereby creating values of intended valence intensity to be compared with the perceived valence rating of videos later in the paper.

3.4 Measurements

There are various ways to quantitatively measure the three emotional dimensions (i.e., valence, arousal and dominance). To ensure the reliability of the emotion measurement, two subjective self-reporting instruments were included in this study: the Self-Assessment Manikin Questionnaire (SAM) (Lang 1995) and the AffectButton (AFB) (Broekens and Brinkman 2009).

The SAM questionnaire consists of a series of manikin figures to judge the affective quality (Fig. 6). As a nonverbal rating system, the SAM questionnaire represents the intensity value of the three dimensions of emotion: valence, arousal and dominance (Lang 1995). The first row of SAM manikin figures ranges from unhappy to happy on the valence dimension. The second row represents the arousal dimension, ranging from relaxed to excited. The last row ranges from dominated to controlling, representing the dominance dimension. When instructed on how to use the SAM questionnaire according to the detailed explanation, provided in the instruction manual of Lang et al. (2008), participants can select one of the nine figures on each row to express their feelings about the emotional stimulus. The manikin figures were taken from the PXLab (Irtel 2007). Various studies show that the SAM questionnaire accurately measures emotional reactions to imagery (Lang et al. 1999; Morris 1995), sounds (Bradley and Lang 2007), robot gesture expression (Haring et al. 2011), etc.

The AffectButton (AFB) offers a flexible and dynamic way to collect users’ explicit affective feedback (Broekens and Brinkman 2009, 2013). The AFB is a button like input interface (Fig. 7). In essence, the AFB can be regarded as a navigation tool through a large set of facial expressions. The user can freely move the cursor over the face to change its affective state. Similar to the SAM questionnaire, the AFB returns feedback on the valence, arousal and dominance dimensions. Designed with the intention to be a quick and user-friendly explicit emotion measurement instrument, the reliability and validation of the AFB have been studied on measuring emotional reactions to words, feelings and music (Broekens and Brinkman 2009; Broekens et al. 2010).

3.5 Procedure

Prior to the experiment, participants were provided with an information sheet, and the procedure was explained to them. They were then asked to sign an informed consent form. The experiment was setup as a within-subject design, comprising seven conditions with different emotional expressions both in the listening and speaking phases. In each condition, the participants were asked to watch a short clip (around 1 min) of a conversation about going to conferences between a Chinese virtual lady and a person. In each clip, the virtual human spoke 10 sentences in total and was silent in between each sentence, listening to her conversational partner talking. The total length of the virtual human’s speaking phase was around 15 s, and the rest of the 45 s was counted as the virtual human’s listening phase. The conversation was in Chinese, and the participants could hear what the virtual human said during the speaking phase; during the listening phase, there was no sound of the virtual human’s conversational partner. The participants were asked to rate the virtual human’s emotional state using both SAM and AFB when they finished watching a clip. The order in which the video clips were shown was randomized across the participants.

4 Results

The experiment had seven conditions (Fig. 2), with two different measurements and two groups of participants (Chinese and non-Chinese). The data recorded by the SAM questionnaire were integers ranging from 0 to 8, while the data recorded by the AFB were floating-point numbers ranging from −1 to 1. To compare these two measurements, the data were first normalized into z scores per measurement for each participant across the seven conditions.

The means for the SAM questionnaire and AFB on the valence emotional dimension are shown in Fig. 8. A repeated measures MANOVA was conducted to test the difference between SAM and AFB scores, thereby using condition, type of measurement and cultural background as three independent variables, and the z scores on valence as dependent variable. The analysis also included all two-way and three-way interactions. The results showed no significant difference between SAM and AFB measurement, F(1,22) = 1.30, p = .26, and also no significant interaction effect.

To test the relationship between these two measurements, a correlation analysis between SAM and AFB scores on the valence dimension was performed. The average scores across all participants for the seven conditions were used. The results showed that SAM and AFB were highly correlated with each other on the valence dimension (r(7) = 0.995, p < .001). The valence scores collected by these two measurements could therefore be regarded as consistent. This made it possible to only focus on the average of the SAM and AFB z scores in the remaining analyses.

4.1 Chinese versus non-Chinese

To test the effect of cultural background on the valence rating of the emotional expressions, a mixed MANOVA was conducted using condition as a within-subjects independent variable, cultural background as a between-subjects independent variable and averaged valence score of both measurements as a dependent variable. The results showed no significant main effect for the cultural background on valence score, F(1,22) = 1.23, p = .64, and no significant interaction between cultural background and condition, F(6,17) = 0.72, p = .28.

Instead of looking for a difference between participants from different cultural backgrounds, the next step of the analysis focused on similarity in the ratings between these two groups. To examine the relationship between the ratings of Chinese and non-Chinese participants, we performed a correlation analysis based on the means for the seven conditions. The results showed that the scores on valence of the Chinese participants are significantly correlated with those of the non-Chinese participants r(7) = .98, p < .001. Although a difference in cultural background was expected, the result showed a high consistency in the evaluation of the emotional state between participants from different cultures. Hence, the results of the two groups of participants were grouped in the rest of the data analyses.

4.2 Positive versus neutral versus negative emotional state

Participants were asked to rate seven conditions (i.e., different combinations of a positive, negative and neutral emotional state during the virtual human’s speaking and listening phases). A repeated measures MANOVA was conducted to study the effect of these conditions on averaged valence score of the SAM and AFB z scores. The results showed a significant effect of condition on the valence rating, F(6,18) = 59.50, p < .001. Next, to run a priori comparisons, paired-sample t tests were performed using the averaged valence scores of the SAM and AFB z scores in all the conditions as paired variables. The results are shown in Table 1.

Table 1 Mean, SD and mean difference of the valence rating of the different conditions. H₀: μ₁ = μ₂, * p < 0.05

Full size table

To test whether the subjective valence score was correlated with the intensity weighted predicted valence values (see chapter 3.3 and Fig. 5 and hereafter abbreviated as weighted valence values) for each condition, we calculated the Pearson correlation coefficient between the weighted valence values and the subjective scores averaged over the participants across the seven experimental conditions. This correlation was relatively high, r(7) = .93, p = .002. The following step in the analysis was to determine the deviation between the subjective valence scores and their corresponding expected valence value per experimental condition. To do so, we fitted a line through the three data points: S + L+, S0L0 and S − L− using least-squares regression. Figure 9 shows this line, including the mean subjective valence scores of the remaining four conditions. Deviations of perceived valence from this line (for the lowly negative and positive videos) show to what extent the perceived valence is different from what is expected in case of an equal intensity in valence between the speaking and listening phases (noted as expected valence value in Fig. 9).

One-sample t tests revealed that when the virtual human showed neutral listening, both a positive (S + L0) and negative (S − L0) emotional expressions during speaking had a more extreme valence than expected, i.e., the subjective score was more positive than the expected valence value in case of the positive emotional expression (t(23) = 2.69, p = .013) and more negative than the expected valence value in case of a negative emotional expression during speaking (t(23) = −6.14, p < .001). The opposite was seen for the impact of the listening phase. Considering the speaking phase with a neutral emotional expression, the subjective valence score for listening with a positive emotional expression (i.e., the S0L+ condition) was significantly less positive than expected (t(23) = − 3.08, p = .005). Similarly, the subjective valence score for listening with a negative emotional expression (i.e., the S0L− condition) was significantly less negative than the expected valence value (t(23) = 3.38, p = .003).

Moreover, the subjective valence score for the S0L− condition was almost equal (t(23) = 0.059, p = .95) to the subjective valence score for the S0L0 condition (i.e., speaking with a neutral emotional expression and listening with a neutral emotional expression). Still, the subjective valence score of the S0L+ condition (i.e., speaking with a neutral emotional expression and listening with a positive emotional expression) was significantly more positive than that for the S0L0 condition ((t(23) = 2.92, p = .008). A direct comparison of the lowly positive or negative conditions provided a similar pattern. The subjective valence value for the S − L0 condition (i.e., speaking with a negative emotional expression and listening with a neutral emotional expression) was significantly more negative than the subjective valence value for the S0L− condition (i.e., speaking with a neutral emotional expression and listening with a negative emotional expression), t(23) = 4.97, p < .001. Similarly, the subjective valance value for the S + L0 condition (i.e., speaking with a positive emotional expression and listening with a neutral emotional expression) was significantly more positive than the subjective valence value for the S0L+ condition (i.e., speaking with a neutral emotional expression and listening with a positive emotional expression), t(23) = 4.01, p = .001.

Together these observations imply that people do not perceive much difference between the virtual human showing neutral or negative listening behavior, but they do perceive a difference with the virtual human showing positive listening behavior. In conclusion, all these results support hypothesis 2, stating that the valence of the emotional expression during the listening phase of a conversation is perceived as less impactful compared to the emotional expression during the speaking phase.

Finally, we also compared the more extreme emotional conditions with the S0L0 condition. The S + L+ condition (t(23) = −9.00, p < .001) or S − L− condition (t(23) = 5.16, p < .001) with positive or negative emotional expressions in both the listening and speaking phases, respectively, strongly impact the perceived valence in the expected way.

5 Discussion and conclusion

The experiment described in this paper is a human perception study on positive and negative emotions of a virtual human and how cultural background might affect the perception of these emotions. In a sense, this study can be seen as a re-confirmation in virtual reality of what is known about human–human interaction in the actual world. Still, this is an important validation step as conversations with virtual humans are increasingly used as part of gaming (e.g., Hudlicka and Delft (2009), training (e.g., Broekens et al. (2012) or psychotherapy (e.g., Opris et al. (2012).

The study found that both Chinese and non-Chinese participants could perceive the valence of the virtual human’s emotional states, and no significant difference between these two groups was found. Instead, the ratings of these two groups were highly correlated. The results show that the valence of the emotional states of the virtual human can be easily recognized by all participants independent of their cultural backgrounds. Hypothesis 1 is therefore not confirmed. On the contrary, our results support the idea of universality of the facial expression of emotion (Ekman 1994; Matsumoto 2007), and question the need for tailored made virtual reality applications which target different cultural groups or have multi-cultural users. Still, the results of this study may not be generally applicable to all cultures, since we here only evaluated possible differences in emotion perception between Chinese and non-Chinese people. Further studies are needed to extend our conclusion of universality of emotion perception of virtual human to people with other cultural backgrounds.

In addition, comparing the difference between conditions, it seems that the participants’ perception of the valence was more influenced by the emotion of the virtual human while speaking than while listening; and so, this supports Hypothesis 2. Comparing the subjectively perceived valence scores with the expected valence values (Fig. 9), the valence perceived by the participants in the conditions where the listening was neutral, the speaking performed with a positive or negative emotional expression was significantly more extreme than what was predicted from equal intensity between speaking and listening. Similarly, the perceived valence was less extreme than the weighted valence value when the speaking was neutral, but the listening performed with a positive or negative emotional expression. This shows the additional influence of verbal communication on valence recognition during a human–virtual human conversation. These findings seem to be in contrast to reports of Melo, Carnevale, and Gratch (2011), who claim that there is no difference in emotion perception between verbal and nonverbal communications. Their study however used text typing as verbal communication means between human and virtual human, which might explain the different finding. It seems not surprising that the combination of both verbal and nonverbal communications transfers more emotional information than the nonverbal communication only. Furthermore, the influence of the voice can be regarded as content independent because of the high consistency found between the Chinese and non-Chinese participants in this experiment. In other words, the results suggest that affective aspects can be conveyed in the speech even if the language is not understood.

The finding that the perceived valence of the emotion of the virtual human is more intense in the speaking phase than in the listening phase of a conversation may be extended with new research on how to control the level of emotion during these separate phases. Applications such as virtual reality exposure therapy for patients suffering from social phobia may be designed in a way to manipulate the potential phobic stressor using the virtual human’s emotional behavior. Further studies may exploit the difference in valence perception between the speaking and listening phases and explore how to further optimize the persuasive power during these two phases, which may be beneficial for the design of many virtual applications involving human–virtual human conversation. Besides, this study only focuses on how individuals perceive the performance of a virtual human. It is also interesting to test the emotional influence on a human during a human–virtual human conversation. Whether the virtual human’s emotion could lead or alter the content of the conversation could be an appealing topic in the persuasive computing area.

Two main conclusions may be drawn from the experiment, but there are also still a number of limitations. First, the virtual human only showed her upper body, and no gestures were used to express emotion. However, in recent decades, more insights have become available on body expression (Gross et al. 2010; Kleinsmith and Bianchi-Berthouze 2013). It would therefore be interesting to examine how our findings would be affected when the virtual human used its full body to express emotions. Second, the position of the virtual human was fixed in the current study. It would be interesting to test the emotional impact of manipulating the virtual human’s position, for example, far away versus nearby (Broekens et al. 2012b). Third, the face model of the virtual human we used in this study was generated by FaceGen with the ethnicity parameter set at Southeast Asia. However, no empirical validation was done to confirm the ethnic appearance of the virtual human. Fourth, the study described in this paper only focused on the valence dimension of the emotion, neglecting so far the other two dimensions of emotion, namely arousal and dominance. Including the additional two dimensions would allow to study more complex emotions, for example, fear, surprise, etc. Despite of the limitations, the results of this paper suggest a superior impact on perceiving the virtual human’s emotional state during its speaking phase, and a potential independence of the perceived valence of the virtual human’s emotion with cultural background. These findings could help designers to focus their attention upon creating and evaluating virtual human with appropriate emotional expressions, which may help to improve the overall experience of virtual environments.

References

Bailenson JN, Yee N, Merget D, Schroeder R (2006) The effect of behavioral realism and form realism of real-time avatar faces on verbal disclosure, nonverbal disclosure, emotion recognition, and copresence in dyadic interaction. Presence Teleoper Virtual Environ 15(4):359–372
Google Scholar
Barrett LF (2006) Solving the emotion paradox: categorization and the experience of emotion. Pers Soc Psychol Rev Off J Soc Pers Soc Psychol 10:20–46. doi:10.1207/s15327957pspr1001_2
Article Google Scholar
Bartneck C (2001) Affective expressions of machines. CHI ‘01 extended abstracts on human factors in computing systems, pp 189–190
Bates J (1994) The role of emotion in believable agents. Commun ACM 37(7):122–125
Article Google Scholar
Bradley MM, Lang PJ (2007) The international affective digitized sounds (2nd Edition; IADS-2): affective ratings of sounds and instruction manual. Technical report B-3. University of Florida, Gainesville, Fl
Brand M (1999) Voice puppetry. In: Proceedings of the 26th annual conference on computer graphics and interactive techniques SIGGRAPH 99, pp 21–28
Bregler C (1997) Video rewrite: driving visual speech with audio. In: Proceedings of SIGGRAPH’97, pp 1–8
Brinkman W-P, Veling W, Dorrestijn E, Sandino G, Vakili V, van der Gaag M (2011) Virtual reality to study responses to social environmental stressors in individuals with and without psychosis. Stud Health Technol Inf 167:86–91
Google Scholar
Brinkman W-P, Hartanto D, Kang N, de Vliegher D, Kampmann IL, Morina N et al (2012) A virtual reality dialogue system for the treatment of social phobia. In: Paper presented at the CHI’12 extended abstracts on human factors in computing systems
Broekens J, Brinkman W-P (2009) Affectbutton: towards a standard for dynamic affective user feedback. Affect Comput Intel Interact Workshops 2009:1–8
Google Scholar
Broekens J, Brinkman W-P (2013) AffectButton: a method for reliable and valid affective self-report. Int J Hum Comput Stud 71(6):641–667
Article Google Scholar
Broekens J, Pronker A, Neuteboom M (2010) Real time labeling of affect in music using the affectbutton. In: Proceedings of the 3rd international workshop on affective interaction in natural environments, pp 21–26
Broekens J, Harbers M, Brinkman W-P, Jonker C, Van den Bosch K, Meyer JJ (2011) Validity of a virtual negotiation training. In: IVA’11 Proceedings of the 11th international conference on intelligent virtual agents, pp 435–436
Broekens J, Harbers M, Brinkman W-P, Jonker C, Van den Bosch K, Meyer J-J (2012) Virtual reality negotiation training increases negotiation knowledge and skill. In: IVA’12 Proceedings of the 12th international conference on intelligent virtual agents, pp 218–230
Broekens J, Qu C, Brinkman W-P (2012) Dynamic facial expression of emotion made easy. Technical report. Interactive Intelligence, Delft University of Technology, pp 1–30
Cassell J, Thorisson KR (1999) The power of a nod and a glance: envelope vs. emotional feedback in animated conversational agents. Appl Artif Intell 13:519–538
Article Google Scholar
Cassell J, Pelachaud C, Badler N, Steedman M, Achorn B, Becket T et al (1994) Animated conversation: Rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents. In: Proceedings of ACM SIGGRAPH, pp 413–420
Cerezo E, Baldassarri S (2008) Affective embodied conversational agents for natural interaction. In: Or J (ed) Affective computing: emotion modelling, synthesis and recognition, pp 329–354
Chuang E, Bregler C (2002) Performance driven facial animation using blendshape interpolation. Computer Science Technical Report, Stanford University
Church T, Katigbak M (1998) Language and organisation of Filipino emotion concepts: comparing emotion concepts and dimensions across cultures. Cogn Emot 12(1):63–92
Article Google Scholar
Cohen MM, Massaro DW (1993) Modeling coarticulation in synthetic visual speech. In: Thalman NM, Thalman D (eds) Models and Techniques in Computer Animation. Springer, Verlag, pp 139–156
Core M, Traum D, Lane HC, Swartout W, Marsella S, Gratch J, Van Lent M (2006) Teaching negotiation skills through practice and reflection with virtual humans. Simulation 82:685–701
Google Scholar
Cowell AJ, Stanney KM (2003) Embodiment and interaction guidelines for designing credible, trustworthy embodied conversational agents. In: 4th international workshop on intelligent virtual agents IVA 2003, 2792, pp 301–309
Darwin C (1872) The expression of emotion in man and animals. Philosophical Library, New York
Book Google Scholar
Dotsch R, Wigboldus DHJ (2008) Virtual prejudice. J Exp Soc Psychol 44(4):1194–1198
Article Google Scholar
Ekman P (1994) Strong evidence for universals in facial expressions: a reply to Russell’s mistaken critique. Psychological Bull 115(2):268–287
Article Google Scholar
Ekman P, Friesen WV (1971) Constants across cultures in the face and emotion. J Pers Soc 17(2):124–129
Article Google Scholar
Ekman P, Friesen WV, O’Sullivan M, Chan A, Diacoyanni-Tarlatzis I, Heider K, Ricci-Bitti P (1987) Universals and cultural differences in the judgments of facial expressions of emotion. J Pers Soc Psychol 53(4):712–717
Article Google Scholar
Ekman P, Rolls ET, Perrett DI, Ellis HD (1992) Facial expressions of emotion: an old controversy and new findings. Philos Trans Biol Sci 335:63–69
Article Google Scholar
Ekman P, Friesen WV, Hager JC (2002) Facial action coding system. Human Face 97:4–5
Google Scholar
Elfenbein H (2003) Universals and cultural differences in recognizing emotions. Curr Dir Psychol Sci 12(5):159–164
Google Scholar
Elfenbein HA, Ambady N (2002) Is there an in-group advantage in emotion recognition? Psychol Bull 128(2):243–249
Article Google Scholar
Elfenbein HA, Beaupre M, Levesque M, Hess U (2007) Toward a dialect theory: cultural differences in the expression and recognition of posed facial expressions. Emotion (Washington, DC) 7(1):131–146
Endrass B, Rehm M, Lipi A (2011) Culture-related differences in aspects of behavior for virtual characters across Germany and Japan. In: Proceedings of AAMAS’11, 2, pp 441–448
Ersotelos N, Dong F (2008) Building highly realistic facial modeling and animation: a survey. Visual Comput 24(1):13–30
Article Google Scholar
Ezzat T, Geiger G, Poggio T (2004) Trainable videorealistic speech animation. In: Sixth IEEE international conference on automatic face and gesture recognition 2004 proceedings, pp 57–64
Fox E (2008) Emotion science cognitive and neuroscientific approaches to understanding human emotions. Palgrave Macmillan, UK
Google Scholar
Gratch J, Rickel J, Andre E, Cassell J, Petajan E, Badler NI, Jeff R (2002) Creating interactive virtual humans: some assembly required. IEEE Intell Syst 17(4):54–63
Article Google Scholar
Gross MM, Crane EA, Fredrickson BL (2010) Methodology for assessing bodily expression of emotion. J Nonverbal Behav 34:223–248. doi:10.1007/s10919-010-0094-x
Article Google Scholar
Haring M, Bee N, Andre E (2011) Creation and evaluation of emotion expression with body movement, sound and eye color for humanoid robots. In: RO-MAN, 2011 IEEE, pp 204–209
Hofstede G (2001) Culture’s consequences: comparing values, behaviors, institutions and organisations across nations. Sage Publications, Thousand Oaks
Google Scholar
Hudlicka E, Delft TU (2009) Foundations for modelling emotions in game characters: modelling emotion effects on cognition. In: Affective computing and intelligent interaction and workshops, ACII 2009
Irtel H (2007) PXLab: the psychological experiments laboratory [online]. University of Mannheim, Mannheim
Google Scholar
Isbister K (2006) Better game characters by design: a psychological approach. Education, CRC Press
Jack RE, Garrod OGB, Yu H, Caldara R, Schyns PG (2012) Facial expressions of emotion are not culturally universal. Proc Natl Acad Sci 109(19):7241–7244
Article Google Scholar
Jan D, Herrera D, Martinovski B (2007) A computational model of culture-specific conversational behavior. In: IVA ‘07 Proceedings of the 7th international conference on intelligent virtual agents, pp 45–56
Kahler K, Haber J, Seidel H-P (2001) Geometry-based muscle modeling for facial animation. In: Proceedings of graphics interface, pp 37–46
Keltner D, Ekman P (2000) Facial expression of emotion. Handbook of emotions, 2nd edn. pp 236–249
Kleinsmith A, Bianchi-Berthouze N (2013) Affective body expression perception and recognition: a survey. IEEE Trans Affect Comput 4:15–33. doi:10.1109/T-AFFC.2012.16
Article Google Scholar
Kleinsmith A, De Silva PR, Bianchi-Berthouze N (2006) Cross-cultural differences in recognizing affect from body posture. Interact Comput 18(6):1371–1389
Article Google Scholar
Kulms P, Kramer NC, Gratch J, Kang S-H (2011) It’s in their eyes: a study on female and male virtual humans' gaze. In: IVA’11 Proceedings of the 11th international conference on intelligent virtual agents, pp 80–92
Lance BJ, Rey MD, Marsella SC (2008) A model of gaze for the purpose of emotional expression in virtual embodied agents. In: AAMAS ‘08 proceedings of the 7th international joint conference on autonomous agents and multiagent systems, 1, pp 12–16
Lang PJ (1995) The emotion probe. Studies of motivation and attention. Am Psychol 50(5):372–385
Article Google Scholar
Lang PJ, Bradley MM, Cuthbert BN (1999) International affective picture system (IAPS): technical manual and affective ratings. Psychology. The Center for Research in Psychophysiology, University of Florida, Gainesville, FL
Lang PJ, Bradley MM, Cuthbert BN (2008) International affective picture system (IAPS): affective ratings of pictures and instruction manual. Technical Report A-8
Lee J, Marsella SC (2012) Modeling speaker behavior: a comparison of two approaches. In: IVA’12 Proceedings of the 12th international conference on intelligent virtual agents, pp 161–174
Lee Y, Terzopoulos D, Walters K (1995) Realistic modeling for facial animation. In: Proceedings of the 22nd annual conference on computer graphics and interactive techniques SIGGRAPH 95, pp 55–62
Lee J, Prendinger H, Neviarouskaya A, Marsella S (2009) Learning models of speaker head nods with affective information. In: 2009 3rd international conference on affective computing and intelligent interaction and workshops, pp 1–6
Link M, Armsby P, Hubal RC, Guinn CI (2006) Accessibility and acceptance of responsive virtual human technology as a survey interviewer training tool. Comput Hum Behav 22:412–426. doi:10.1016/j.chb.2004.09.008
Article Google Scholar
Litwinowicz P, Williams L (1994) Animating images with drawings. In: Proceedings of the 21st annual conference on computer graphics and interactive techniques SIGGRAPH 94, pp 409–412
MacDorman KFK, Coram JJA, Ho C-CC, Patel H (2010) Gender differences in the impact of presentational factors in human character animation on decisions in ethical dilemmas. Presence Teleoper Virtual Environ 19(3):213–229
Google Scholar
Matsumoto D (2002) Methodological requirements to test a possible in-group advantage in judging emotions across cultures: comment on Elfenbein and Ambady (2002) and evidence. Psychol Bull 128(2):236–242
Article Google Scholar
Matsumoto D (2007) Emotion judgments do not differ as a function of perceived nationality. Int J Psychol 42(3):207–214
Article Google Scholar
Mehrabian A, Russell JA (1974) An approach to environmental psychology. MIT Press, Cambridge, MA
Google Scholar
Melo Cd, Carnevale P, Gratch J (2011) The effect of expression of anger and happiness in computer agents on negotiations with humans. In: The tenth international conference on autonomous agents and multiagent systems, pp 2–6
Moridis CN, Economides AA (2012) Affective learning: empathetic agents with emotional facial and tone of voice expressions. IEEE Trans Affect Comput 3(3):260–272
Article Google Scholar
Morris JD (1995) Observations: SAM the self-assessment manikin an efficient cross-cultural measurement of emotional response. J Advert Res 35(6):63–68
Google Scholar
Ochsman RB, Chapanis A (1974) The effects of 10 communication modes on the behavior of teams during co-operative problem-solving. Int J ManMachine Stud 6(5):579–619
Article Google Scholar
Opris D, Pintea S, Garcia-Palacios A, Botella CM, Szamoskozi S, David D (2012) Virtual reality exposure therapy in anxiety disorders: a quantitative meta-analysis. Depression Anxiety 29:85–93. doi:10.1002/da.20910
Article Google Scholar
Parke FI (1972) Computer generated animation of faces. Proc ACM Annu Conf 1:451–457
Google Scholar
Parke FI (1974) A parametric model for human faces. The University of Utah, Doctoral Dissertation
Petrushin V (1999) Emotion in speech: recognition and application to call centers. In: Artificial neural network in engineering (ANNIE’99), pp 7–10
Picard RW (1998) Toward agents that recognize emotion. In: Actes proceedings IMAGINA, pp 153–165
Picard RW, Vyzas E, Healey J (2001) Toward machine emotional intelligence: analysis of affective physiological state. IEEE Trans Pattern Anal Mach Intell 23(10):1175–1191
Article Google Scholar
Platt SM, Badler NI (1981) Animating facial expressions. ACM SIGGRAPH Comput Graph 15(3):245–252
Article Google Scholar
Qiu L, Benbasat I (2005) Online consumer trust and live help interfaces: the effects of text-to-speech voice and three-dimensional avatars. Int J Human-Comput Interact 19:37–41
MATH Google Scholar
Reeves B, Nass C (1996) The media equation. Cambridge University Press, Cambridge
Google Scholar
Rizzo A, Lange B, Buckwalter J, Forbell E, Kim J, Sagae K, Kenny P (2011) An intelligent virtual human system for providing healthcare information and support. Stud Health Technol Inf 163:503–509
Google Scholar
Russell JA (1991) Culture and the categorization of emotions. Psychol Bull 110:426–450
Article Google Scholar
Ruttkay Z, Pelachaud C (2005) From brows to trust: evaluating embodied conversational agents. Springer, Berlin
Scherer KR (1995) Expression of emotion in voice and music. J Voice Off J Voice Found 9(3):235–248
Article Google Scholar
Scherer KR (2003) Vocal communication of emotion: a review of research paradigms. Speech Commun 40:227–256
Article MATH Google Scholar
Schiano DJ, Ehrlich SM, Rahardja K, Sheridan K (2000) Face to interface: facial affect in (hu)man and machine. Proc ACM CHI 2000:193–200
Google Scholar
Schroder M (2004) Speech and emotion research: an overview of research frameworks and a dimensional approach to emotional speech synthesis. Research Report of the Institute of Phonetics
Slater M, Pertaub D-P, Steed A (1999) Public speaking in virtual reality: facing an audience of avatars. IEEE Comput Graphics Appl 19(2):6–9
Article Google Scholar
Tomkins SS (1962) Affect, imagery, consciousness: vol 1. The positive affects. Springer, New York
Google Scholar
Tomkins SS (1963) Affect, imagery, consciousness: vol 2. The negative affects. Springer, New York
Google Scholar
Tsapatsoulis N, Raouzaiou A, Kollias S, Cowie R, Douglas-Cowie E (2002) Emotion recognition and synthesis based on MPEG-4 FAPs. In; MPEG-4 facial animation the standard implementations applications
Waters K (1987) A muscle model for animating three-dimensional facial expression. Comput Graph SIGGRAPH Proc 21(4):17–24
Article MathSciNet Google Scholar
Wierzbicka A (1995) Emotions across languages and cultures: diversity and universals. Cambridge University Press, Cambridge
Google Scholar
Wong JW-E, McGee K (2012) Frown more, talk more: effects of facial expressions in establishing conversational rapport with virtual agents. In: IVA’12 Proceedings of the 12th international conference on intelligent virtual agents, pp 419–425
Yun C, Deng Z, Hiscock M (2009) Can local avatars satisfy a global audience? A case study of high-fidelity 3D facial avatar animation in subject identification and emotion perception by US and international groups. Comput Entertain 7(2):1–25
Article Google Scholar
Zhang QZQ, Liu Z, Quo GQG, Terzopoulos D, Shum H-YSH-Y (2006) Geometry-driven photorealistic facial expression synthesis. IEEE Trans Visual Comput Graphics 12(1):48–60
Article Google Scholar

Download references

Acknowledgments

This study is supported in part by the Chinese Scholarship Council (No. 2008609199) and the COMMIT project—Interaction for Universal Access.

Author information

Authors and Affiliations

Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands
Chao Qu, Willem-Paul Brinkman, Yun Ling, Pascal Wiggers & Ingrid Heynderickx
Philips Research Laboratories, High Tech Campus 34, 5656 AE, Eindhoven, The Netherlands
Ingrid Heynderickx

Authors

Chao Qu
View author publications
You can also search for this author in PubMed Google Scholar
Willem-Paul Brinkman
View author publications
You can also search for this author in PubMed Google Scholar
Yun Ling
View author publications
You can also search for this author in PubMed Google Scholar
Pascal Wiggers
View author publications
You can also search for this author in PubMed Google Scholar
Ingrid Heynderickx
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chao Qu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qu, C., Brinkman, WP., Ling, Y. et al. Human perception of a conversational virtual human: an empirical study on the effect of emotion and culture. Virtual Reality 17, 307–321 (2013). https://doi.org/10.1007/s10055-013-0231-z

Download citation

Received: 17 January 2013
Accepted: 20 August 2013
Published: 30 August 2013
Issue Date: November 2013
DOI: https://doi.org/10.1007/s10055-013-0231-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Human perception of a conversational virtual human: an empirical study on the effect of emotion and culture

Abstract

Similar content being viewed by others

Lost in Interpretation? The Role of Culture on Rating the Emotional Nonverbal Behaviors of a Virtual Agent

Measuring users’ emotional responses in multisensory virtual reality: a systematic literature review

How Deep Is a Virtual Reality Experience? Virtual Environments, Emotions and Physiological Measures