1 Introduction

To create a feeling of being “present” in virtual reality is essential to the success of many virtual reality applications such as training (Broekens et al. 2011), coaching (Rizzo et al. 2011), therapy (Brinkman et al. 2012) and games (Isbister 2006). A feeling of being “present” in virtual reality may be achieved by making the virtual reality environment as natural as possible. Human–computer interaction, including human–virtual human interaction, is inherently natural and social (Reeves and Nass 1996), and so, is an essential component in the realism of the virtual environment. Without proper behavior of the virtual human, users may not be able to “suspend disbelief” and the effectiveness of the virtual reality application will decrease.

Considering the importance of emotion in human–human communication, emotion may also help people to establish a better relationship with virtual human (Reeves and Nass 1996). As Picard et al. (2001) argues, without some emotional skills, machines will not appear intelligent when interacting with people. Therefore, multiple technologies to give virtual human the abilities of generating human acceptable expressions have been developed in recent decades (Ersotelos and Dong 2008).

Different applications require different levels of emphasis on how the virtual human express their emotions. Even when implemented in only part of the application, emotional expressions can be effective. In a health coach application, for example, the virtual human might mainly need to speak to motivate the user, and emotional expressions during speaking are most important. In a virtual reality exposure therapy for fear of public speaking, the virtual human only needs to listen, and so, emotional expressions while listening are most important. In some applications, such as for a role playing game or a negotiation simulator, the full range of speaking and listening is used and might benefit from emotional expressions. Studies on generating and evaluating the emotional agents normally only focus on either listening (Slater et al. 1999; Wong and McGee 2012) or speaking (MacDorman et al. 2010; Qiu and Benbasat 2005). Studies that do include both speaking and listening,(e.g., Core et al. (2006), Broekens et al. (2012a), Link et al. (2006)) focus mainly on the conversation and communication as a whole and do not separately investigate the speaking and listening phases of the whole conversation. To our knowledge, no study has directly compared the impact of emotional expressions during speaking and listening in virtual reality. In the current study, the virtual human’s valence state was manipulated, while she was speaking and listening from negative to positive, and the impact on the participants’ perception was examined.

Besides the difference between listening and speaking, culture might also be an important factor for a designer to consider as many applications are used all across the world nowadays. Especially for some virtual reality applications, such as virtual reality exposure therapy for patients with social phobia (Brinkman et al. 2012), it is crucial to understand how people with a different cultural background perceive the affective behavior of virtual humans. Several studies have already focused on the effect of cultural differences on evaluating virtual human’s emotions. For example, Jack et al. (2012) showed that facial expressions of emotion are culture specific. However, Yun et al. (2009) found that cultural background has little effect on emotion perception. Kleinsmith et al. (2006) evaluated cultural impact on perception of emotion and found that emotions are both universal- and cultural-specific. Therefore, similar to the studies for perceiving emotional expressions of real humans, the universality of emotion perception of virtual human seems also still inconclusive. In addition, most research is only limited to the investigation of head-only virtual human with facial expressions and far less research is devoted to emotional expressions from a 3D virtual human which expresses its emotional state also via gaze, head movement or voice intonation.

In summary, this study involves two research questions: (1) whether emotional expressions of virtual human are perceived differently depending on the cultural background of the perceiver, and (2) whether a person is more perceptive to emotional expressions in one of the two phases (the speaking phase and listening phase) or whether a person treats these two phases as equally important when rating the virtual human’s emotions? To answer these research questions, we designed a virtual human representing a Chinese lady at an age around 25. She had the ability to show multiple emotional states in multiple nonverbal and verbal ways: i.e., through facial expression, head movement, gaze and voice intonation. During the listening phase, the virtual human’s emotional behavior was expressed by nonverbal communication only, while during the speaking phase, the emotional behavior was expressed by both verbal (i.e., intonation) and nonverbal communications. To avoid a possible emotional bias from the content of the conversation, a relatively neutral topic, i.e., conference attendance, was selected in this experiment. Petrushin (1999) pointed out that humans are not perfect in decoding manifest emotions such as anger and happiness in voice intonation only. Therefore, as a first step, only three basic emotional valence states (positive, neutral and negative) were used in this study. In order to test the effect of cultural influence on the perception of the emotions expressed by the virtual human, two groups of participants were recruited: from the same culture as the virtual human and from other cultures. We chose to compare Chinese versus non-Chinese participants, because it is known from cultural models (Hofstede 2001) that the difference in cultural values is significant between these two groups, and since two of the authors experienced these differences while living in Europe. Moreover, the background of these authors facilitated the recruitment of Chinese participants.

Based on knowledge available in the literature, we envision the following two hypotheses related to our research questions.

Hypothesis 1

Individuals with the same cultural background as the virtual human perceive the valence state of the virtual human differently from individuals with a different cultural background.

Especially, as the virtual human was speaking Chinese, participants with a different cultural background could not understand what the virtual human said during verbal communication. Hence, participants with a different cultural background from the virtual human are expected to perceive her emotion differently from participants with the same cultural background.

Hypothesis 2

The virtual human’s expressed valence is perceived as more intense in the speaking phase than in the listening phase.

Since the speaking phase also allows including verbal expression of the emotion, it seems likely that compared to the listening phase, the emotion expressed in the speaking phase is perceived as more intense.

The rest of the paper is structured as follows. Section 2 provides theoretical background on how a virtual human can express emotion through facial expression, gaze, head movements and voice intonation. In addition, it discusses cultural differences in emotion recognition and various emotional models, needed to understand the rest of the paper. Section 3 provides a description of the apparatus, validation of the stimuli material and the procedure of the experiment, and its results are presented in Sect. 4. Finally, in Sect. 5, the findings of the study are discussed and conclusions are drawn.

2 Theoretical background

No matter what roles virtual humans play in a virtual world, they need to elicit an anthropomorphic interaction with their human users. This requires vast knowledge of various human aspects including facial expression, gaze, head movement, voice expression and their cultural difference in order to make the virtual human believable, responsive and interpretable.

2.1 Facial expression of a virtual human

Facial expression is one of the options to express human emotion and as such plays a substantial role in depicting human characters. Started in the early 70s and 80s (Parke 1972; Platt and Badler 1981), face modeling and animation have been a continuous research topic for many years. From early 2000s, more flexible emotion representations were created with MPEG-4-based facial animation (Tsapatsoulis et al. 2002). Recent advances in facial animation that allow to produce a rich set of effects on synthetic humans already had their impact on the industry (Ersotelos and Dong 2008).

Multiple approaches have been proposed to create naturally looking facial expressions; they can be categorized as follows: (1) simulation or physically based models, which try to model the anatomical structure of the face as well as the underlying dynamics (Kahler et al. 2001; Lee et al. 1995; Waters 1987), (2) performance-driven models, which reassemble frames from video footage or motion capture data of a real person to yield the desired facial expression (Brand 1999; Bregler 1997; Chuang and Bregler 2002; Ezzat et al. 2004; Litwinowicz and Williams 1994) and (3) parameterized-based models, which assign weights to the vertices of meshes representing the face, such that during animation the vertices are moved according to the weights (Cohen and Massaro 1993; Parke 1974; Zhang et al. 2006). Considering the high computational load required for the simulation or physically based models and the high costs for the motion capture equipment needed for performance-driven models, we decided to choose an easily repeated facial expression animation based on a parameterized model for this study.

2.2 Head movement and gaze of a virtual human

Besides facial expressions, also head and eye movements were implemented in the virtual human used in our experiment. Head movements and eye gaze are two important sources of emotional feedback in interaction (Cassell and Thorisson 1999; Lee and Marsella 2012; Ruttkay and Pelachaud 2005). They are essential to embody interactive conversational systems (Cassell et al. 1994), and it is relatively simple to create primarily nods and glances toward or away from the user. Still the correct timing is essential (Cassell and Thorisson 1999). Research of Lance et al. (2008) and Lee et al. (2009) shows how head movements and gaze can be embedded into a virtual character.

2.3 Voice expression of a virtual human

Along with the nonverbal emotional expressions, emotion can also be expressed by voice intonation when the virtual human is talking. Speech was once considered as the main channel to carry most, or even all, the necessary information in a conversation (Ochsman and Chapanis 1974). This idea has been countered by a growing body of research on believable, lifelike embodied conversational agents (Bates 1994). Still, the importance of the voice in emotion expression cannot be denied (Scherer 1995). Many studies have investigated emotional effects in voice and speech (Bailenson et al. 2006; Petrushin 1999; Scherer 2003), and emotion expressed in the voice of virtual humans (Cerezo and Baldassarri 2008; Moridis and Economides 2012). The intonation of the voice was therefore also considered as an important aspect of the virtual human’s emotional expression in this study.

2.4 Cultural difference

Culture, such as age, gender, posture and context, is one of the many factors affecting emotion expression (Picard 1998). A long-time question in the study of human emotion is the extent to which emotional expressions are universal or culturally determined (Elfenbein et al. 2007). Cultural background may influence the rate of emotion recognition (Matsumoto 2002). When an expresser of an emotion and the perceiver of the emotion have the same cultural background, the perceiver’s recognition rate is found to be higher than when the expresser and perceiver have a different cultural background (Elfenbein 2003; Elfenbein and Ambady 2002; Elfenbein et al. 2007). However, Darwin (1872) and Tomkins (1962, 1963) argue that universal emotions do exist, studies also show universality in the facial expression of emotion and its perception, and attribute only little effect of cultural background on emotion perception from facial expressions (Ekman 1994; Ekman and Friesen 1971; Ekman et al. 1987; Matsumoto 2002, 2007).

The question of impact of cultural background can be extended to human–virtual human interaction. Although various studies show that people can correctly identify emotions expressed by embodied agents in general (Bartneck 2001; Schiano et al. 2000), how good this performance is retained in different cultures needs to be considered. Clear indications support the statement that culture can shape the expression and interpretation of emotions (Keltner and Ekman 2000). Culture as a factor has also been studied in the interaction with computers. For example, Dotsch and Wigboldus (2008) and Brinkman et al. (2011) have found a difference in emotional reaction to a virtual human with ethnic appearance that match or did not match the person’s ethnicity. Endrass et al. (2011) show that in German and Japanese cultures, the user’s perception of an agent conversation can be enhanced by a culturally prototypical performance of gestures and body postures. Kleinsmith et al. (2006) worked on the cross-cultural difference of recognizing affect from virtual human’s body posture and suggest to consider culture as one specific factor for the implementation of agents. Meanwhile, Jan et al. (2007) mention that in Arabian and US American cultures, gaze, proximity and turn-taking behavior are all culture related. These results reveal that participants perceive behavior that is in line with their own cultural background differently from behavior that is typical for a different cultural background. In the work presented in this paper, cultural background is considered as a variable which is expected to influence how people perceive the emotional expression of the virtual human.

2.5 Dimensional emotion model

For facial expressions, six universal basic emotions exist (Ekman et al. 1992). However, for language, people’s categorization of verbal labels to describe their everyday life emotions vary between languages and cultures (Russell 1991). Instead of placing these expressed emotions in categories, i.e., a discrete emotional approach, others suggest placing them in a multi-dimensional space, i.e., a dimensional approach (Fox 2008). Three broad dimensions have often been proposed to describe affect (Mehrabian and Russell 1974): i.e., valence, arousal and dominance. Valence is variously referred to as positive and negative affect or as pleasant and unpleasant feelings. The arousal dimension ranges emotions from deep sleep to frenetic excitement. Dominance focuses on the expression of social control and aggression and varies between submissive and dominant (Schroder 2004). Compared to the discrete emotional approach, the dimensional approach often uses subjective reports of feelings as its main dependent variable. As such, it has a strong empirical base. Support for the existence of these dimensions has come from research into subjective reports, physiological responses, neural circuits and cognitive appraisal (Barrett 2006; Fox 2008). Furthermore, Wierzbicka (1995) and Church and Katigbak (1998) also investigated the cross-cultural universality of the emotional dimensions. Their results showed the universality of the valence and arousal dimensions. The study presented in this paper focuses on the valence dimension only. Although participants were asked to rate the virtual human’s emotion on all the three dimensions, only the valence dimension was used for data analysis.

3 Experiment

3.1 Participants

Twelve Chinese (7 females and 5 males) and twelve non-Chinese (5 females and 7 males) students from the Delft University of Technology participated in the experiment. Their age ranged from 24 to 38 years with a mean of 27.8 (SD = 3.4) years. All participants were naive with respect to the hypotheses. Written informed consent forms were obtained from all the participants. The experiment was approved by the university ethic committee.

3.2 Creating the virtual human

Although Cowell and Stanney (2003) found that users generally prefer to interact with a youthful character matching their ethnicity, they found no significant preference for character gender. Furthermore, Kulms et al. (2011) showed that actual behavior and its evaluation are more important for the evaluation than gender stereotypes. Therefore, a Chinese virtual lady aged around 25 years was specially created for this study.

The model of the virtual human was created by FaceGen and 3Ds MAX. All main factors which were considered to contribute to emotion expression were combined; the virtual human’s facial expression, her head and eye movements and her voice intonation were manipulated to express emotion during the conversation. To create facial expressions, an easily repeated facial expression animation method was used. This method rigged the face mesh into 22 action units with 18 features (Gratch et al. 2002), where each feature was an anchor point attached to a set of vertices of the face. A model for the face dynamics that was able to control the intensity of the expression, its onset, peak and decay was defined. As such, the virtual human had the ability to show any intensity and any combination of the six basic Ekman facial expressions (Ekman et al. 2002). The validation of this approach was shown by Broekens et al. (2012b). By setting the values for the three emotional dimensions (i.e., the valence, arousal and dominance) and for the expression duration, any emotion could be expressed by the virtual human. The facial expressions from neutral to negative or from neutral to positive, used by the virtual human in our experiment, are shown in Fig. 1.

Fig. 1
figure 1

Emotions expressed by moving some action units (i.e., the small squares in the figure) of the face mesh. Left column emotions changing from neutral to negative; right column emotions changing from neutral to positive

The participants were asked to judge the emotional state of the virtual human, and so, there was no interaction between the participant and the virtual human. The participant was told that the scene contained a virtual lady talking with a human, but that the human voice was removed. Therefore, problems related to timing (i.e., whether the virtual human should or should not show an expression at a certain point of time) were avoided, and the participant could focus on the emotional behavior of the virtual human herself.

Seven conditions were included in the experiment, all varying in the emotional states of the virtual human. Since the scenario was conversation based, two continuously alternating phases could be identified, i.e., one in which the virtual human was speaking and one in which she was listening. These phases allowed the virtual human to express her emotion differently in the two phases. In the speaking phase, the virtual human used voice and nonverbal communications to express her emotions, while in the listening phase, the virtual human only used nonverbal communication to express her emotions. Three emotional states were created for both phases (i.e., positive, neutral and negative states), and they formed the basis for the seven different conditions, shown in Fig. 2. As the combination of positive (negative) listening and negative (positive) speaking included contradictory emotional information of the virtual human in the speaking and listening phases, these combinations were considered unnatural and so were excluded from the experiment. Taking the neutral attitude in both the speaking and listening phases as the baseline, it was expected that participants would give a higher valence score when the virtual human responded positively either in the listening or speaking phase. Assuming that there would be no interaction between the speaking and listening phase and that both phases would have a similar impact on the expressed valence intensity, the seven conditions could be ordered into five groups: highly negative (S − L−), lowly negative (S − L0, S0L−), neutral (S0L0), lowly positive (S + L0, S0L+) and highly positive (S + L+). If the intensity of the expressions with a negative or positive valence would be equal, these five groups could be projected on a single valence scale as is done in Fig. 2 (shown as the predicted valence value axis). Comparing the actual valence values obtained in the experiment to the predicted valence values would make it possible to study hypothesis 2 about the experience of the valence intensity in the two phases of the conversation.

Fig. 2
figure 2

Seven conditions, existing of combinations of an emotional state in the speaking and listening phase of a conversation, as used in the experiment and their corresponding predicted valence intensity

The participants were asked to sit in front of the virtual human (displayed only above her chest on a computer screen), right at the place where the virtual human’s “conversational partner” would sit. With this setup, the participants could well perceive the virtual human’s emotional state, expressed by her vocal expression, facial expression and eyes and head movements. When expressing a positive emotional state, the virtual human would show a happy facial expression and once in a while would nod her head to agree with her conversation partner. Her eyes would mainly look at her conversational partner, only occasionally look away (Fig. 3c). When expressing a negative emotional state, the virtual human would have an angry facial expression and would continuously look away showing limited interest in her conversation partner (Fig. 3a). The intensity of both the positive (happy) and negative (angry) emotional expressions was evaluated in a previous study (Broekens et al. 2012b) to ensure that they both could be identified by individuals. The neutral expression was the default facial expression of FaceGen, with the six Ekman basic emotion (Ekman et al. 1992) parameters set to zero and with all other morph modifiers removed when generating the face model.

Fig. 3
figure 3

Different emotional states of the virtual human in her listening phase. a Negative: angry facial expression, only looking at her conversation partner at the beginning, gradually losing interest and starting to look around. b Neutral: neutral facial expression while constantly looking at her conversation partner with slight eye movements. c Positive: happy facial expression while constantly looking at her conversation partner, showing some slight eye movements and occasionally nodding her head

In the speaking phase, the virtual human would look directly at her conversation partner. In the negative speaking condition, she would have a negative facial expression (Fig. 4a), while in the positive speaking condition, she would have a positive facial expression (Fig. 4c). In addition, speech with either a negative or positive intonation was added to the virtual human.

Fig. 4
figure 4

Different emotional states of the virtual human in her speaking phase. a Negative: angry facial expression while looking at her conversation partner, and speaking with a negative voice intonation. b Neutral: neutral facial expression while constantly looking at her conversation partner, and speaking with a neutral voice intonation. c Positive: happy facial expression while constantly looking at her conversation partner, and speaking with a positive voice intonation

3.3 Emotion validation

As mentioned already above, for the speaking phase, verbal communication was added to the virtual human. The voice of the virtual human was recorded in Chinese by a Chinese linguistics student. Her voice was recorded 3 times, each time expressing a different emotional state: positive, neutral and negative. A small separate study, in which 6 Chinese participants, 3 males and 3 females with an average age of 27 (SD = 0.5) years, were asked to rate the valence of the recorded voice on a scale from 1 (negative) to 9 (positive), showed that the emotion in the recorded voice was indeed perceived as intended, F(2,10) = 25.29, p < .001. The negative voice was significantly lower than the neutral voice, t(5) = 3.87, p = .012, and the positive voice, t(5) = 6.52, p < .001. Further, the positive voice was significantly higher than the neutral voice, t(5) = 3.61, p = .015. The means and standard deviations of the scores on the positive, the neutral and the negative voice were M = 7.8, SD = 1.9; M = 5.7, SD = 2.0; M = 1.7; SD = .8, respectively.

Making a fair comparison between the listening and speaking phases requires that the intensity of the nonverbal communication is similar in both phases. For example, the virtual human’s facial and body expression in the lowly negative speaking phase and lowly negatively listening phase (see Fig. 2) should have a similar impact on the valence intensity. To test this, an additional small study was conducted. In this study, twelve participants, 5 males and 7 females with an average age of 27 years (SD = 1.8), were presented simultaneously with two video clips of the virtual human including both the listening and speaking phases. Half of the participants were Chinese. The participants were asked to rate how easily they could see the difference between the two videos on a scale from very easy (0) to very difficult (100). The participants were explicitly asked not to rate the valence, but only the easiness with which differences were perceived, representing the intensity of the emotion. The videos were presented without sound. The participants were asked to rate 12 pairs in total (S − L0/S0L0, S0L−/S0L0, S + L0/S0L0, S0L+/S0L0, S − L−/S + L+, S − L−/S0L0, S + L+/S0L0, S0L0/S0L0, S + L0/S + L0, S − L0/S − L0, S0L+/S0L+, S0L−/S0L−), presented to each participant in a different random order. Before they rated the pairs, the participants were shown all the possible behaviors of the virtual human so that they could establish an overall frame of reference.

The first step of the analysis was to see whether the more intense stimuli were easier to distinguish from the neutral reference video (S0L0) and whether the positive and negative videos were equally distinctive. Therefore, a MANOVA with repeated measures was conducted with the intensity of the video stimuli (high versus low intensity) and the valence direction (positive versus negative) as independent variables. The analysis was conducted on the rating for highly positive (S + L+/S0L0) and negative (S − L−/S0L0) videos, and the mean rating for lowly positive (S + L0/S0L0 and S0L+/S0L0) and negative (S − L0/S0L0 and S0L−/S0L0) videos across the speaking and listening phases. The analysis found a significant main effect (F(1, 11) = 21.91, p = 0.001) for intensity, in that the highly positive or negative videos (M = 32, SD = 17) were rated as easier to be distinguished than the lowly positive or negative videos (M = 44, SD = 15). Also a significant (F(1, 11) = 15.63, p = 0.002) main effect was found for direction. The positive videos (M = 25, SD = 15) were rated as more easily to be distinguished from the neutral video than the negative videos (M = 50, SD = 23). The analysis found no significant (F(1, 11) = 1.60, p = 0.23) two-way interaction effect, which suggests that the two main effects were constant across the conditions.

The next analysis focused on the question whether, compared to the neutral reference video, the positive or negative differences in the listing or speaking phase were equally distinguishable, and whether this was the same for the positive and negative videos. Therefore, a second MANOVA with repeated measures was conducted with the valence direction and the phase (speaking versus listening) as independent variables. The analysis used the rating for lowly positive speaking (S + L0/S0L0) and lowly positive listening (S0L+/S0L0) phases and the rating for the lowly negative (S − L0/S0L0) speaking and lowly listening (S0L−/S0L0) phases. The analysis again revealed that the positive videos (M = 28, SD = 16) were significantly (F(1, 11) = 16.91, p = 0.002) rated as more easily to be distinguished than the negative videos (M = 59, SD = 24) from the neutral reference video. No significant difference was found between the listening and speaking phases (F(1, 11) = 0.14, p = 0.71), and also, no significant two-way interaction effect was found (F(1, 11) = 0.44, p = 0.52). Figure 5 shows the videos with their predicted valence and the estimated valence. The latter is the z score of the rating for the video subtracted from the rating of the neutral reference video (S0L0/S0L0), whereby the rating of negative videos was multiplied by −1. Both the two lowly negative and the two lowly positive videos are positioned closely together. In other words, the intensity of the nonverbal communication seems similar in the listening and speaking phases. Furthermore, because of the significant difference in rating between negative and positive videos, the neutral reference video seems to be positioned closer to the negative videos than to the positive videos. As illustrated in Fig. 5, the predicted and estimated valence values for the videos do not follow a linear function, but rather a cubic function. By using a fitted inverted cubic function, the intensity weighted predicted valence values for the videos were calculated from the estimated valence values, thereby creating values of intended valence intensity to be compared with the perceived valence rating of videos later in the paper.

Fig. 5
figure 5

Predicted valence plotted against the estimated valence fitted with a cubic function

3.4 Measurements

There are various ways to quantitatively measure the three emotional dimensions (i.e., valence, arousal and dominance). To ensure the reliability of the emotion measurement, two subjective self-reporting instruments were included in this study: the Self-Assessment Manikin Questionnaire (SAM) (Lang 1995) and the AffectButton (AFB) (Broekens and Brinkman 2009).

The SAM questionnaire consists of a series of manikin figures to judge the affective quality (Fig. 6). As a nonverbal rating system, the SAM questionnaire represents the intensity value of the three dimensions of emotion: valence, arousal and dominance (Lang 1995). The first row of SAM manikin figures ranges from unhappy to happy on the valence dimension. The second row represents the arousal dimension, ranging from relaxed to excited. The last row ranges from dominated to controlling, representing the dominance dimension. When instructed on how to use the SAM questionnaire according to the detailed explanation, provided in the instruction manual of Lang et al. (2008), participants can select one of the nine figures on each row to express their feelings about the emotional stimulus. The manikin figures were taken from the PXLab (Irtel 2007). Various studies show that the SAM questionnaire accurately measures emotional reactions to imagery (Lang et al. 1999; Morris 1995), sounds (Bradley and Lang 2007), robot gesture expression (Haring et al. 2011), etc.

Fig. 6
figure 6

Self-Assessment Manikin Questionnaire, three rows representing the valence, arousal and dominance dimension, respectively. Copyright © 2001–2006, Hans Irtel. Distributed under the MIT License as certified by the Open Source Initiative

The AffectButton (AFB) offers a flexible and dynamic way to collect users’ explicit affective feedback (Broekens and Brinkman 2009, 2013). The AFB is a button like input interface (Fig. 7). In essence, the AFB can be regarded as a navigation tool through a large set of facial expressions. The user can freely move the cursor over the face to change its affective state. Similar to the SAM questionnaire, the AFB returns feedback on the valence, arousal and dominance dimensions. Designed with the intention to be a quick and user-friendly explicit emotion measurement instrument, the reliability and validation of the AFB have been studied on measuring emotional reactions to words, feelings and music (Broekens and Brinkman 2009; Broekens et al. 2010).

Fig. 7
figure 7

AffectButton (left) and its corresponding appearance samples (right) while moving the cursor (the cross)

3.5 Procedure

Prior to the experiment, participants were provided with an information sheet, and the procedure was explained to them. They were then asked to sign an informed consent form. The experiment was setup as a within-subject design, comprising seven conditions with different emotional expressions both in the listening and speaking phases. In each condition, the participants were asked to watch a short clip (around 1 min) of a conversation about going to conferences between a Chinese virtual lady and a person. In each clip, the virtual human spoke 10 sentences in total and was silent in between each sentence, listening to her conversational partner talking. The total length of the virtual human’s speaking phase was around 15 s, and the rest of the 45 s was counted as the virtual human’s listening phase. The conversation was in Chinese, and the participants could hear what the virtual human said during the speaking phase; during the listening phase, there was no sound of the virtual human’s conversational partner. The participants were asked to rate the virtual human’s emotional state using both SAM and AFB when they finished watching a clip. The order in which the video clips were shown was randomized across the participants.

4 Results

The experiment had seven conditions (Fig. 2), with two different measurements and two groups of participants (Chinese and non-Chinese). The data recorded by the SAM questionnaire were integers ranging from 0 to 8, while the data recorded by the AFB were floating-point numbers ranging from −1 to 1. To compare these two measurements, the data were first normalized into z scores per measurement for each participant across the seven conditions.

The means for the SAM questionnaire and AFB on the valence emotional dimension are shown in Fig. 8. A repeated measures MANOVA was conducted to test the difference between SAM and AFB scores, thereby using condition, type of measurement and cultural background as three independent variables, and the z scores on valence as dependent variable. The analysis also included all two-way and three-way interactions. The results showed no significant difference between SAM and AFB measurement, F(1,22) = 1.30, p = .26, and also no significant interaction effect.

Fig. 8
figure 8

Means and standard deviations of SAM and AFB z scores for the valence dimension for each of the seven experimental conditions

To test the relationship between these two measurements, a correlation analysis between SAM and AFB scores on the valence dimension was performed. The average scores across all participants for the seven conditions were used. The results showed that SAM and AFB were highly correlated with each other on the valence dimension (r(7) = 0.995, p < .001). The valence scores collected by these two measurements could therefore be regarded as consistent. This made it possible to only focus on the average of the SAM and AFB z scores in the remaining analyses.

4.1 Chinese versus non-Chinese

To test the effect of cultural background on the valence rating of the emotional expressions, a mixed MANOVA was conducted using condition as a within-subjects independent variable, cultural background as a between-subjects independent variable and averaged valence score of both measurements as a dependent variable. The results showed no significant main effect for the cultural background on valence score, F(1,22) = 1.23, p = .64, and no significant interaction between cultural background and condition, F(6,17) = 0.72, p = .28.

Instead of looking for a difference between participants from different cultural backgrounds, the next step of the analysis focused on similarity in the ratings between these two groups. To examine the relationship between the ratings of Chinese and non-Chinese participants, we performed a correlation analysis based on the means for the seven conditions. The results showed that the scores on valence of the Chinese participants are significantly correlated with those of the non-Chinese participants r(7) = .98, p < .001. Although a difference in cultural background was expected, the result showed a high consistency in the evaluation of the emotional state between participants from different cultures. Hence, the results of the two groups of participants were grouped in the rest of the data analyses.

4.2 Positive versus neutral versus negative emotional state

Participants were asked to rate seven conditions (i.e., different combinations of a positive, negative and neutral emotional state during the virtual human’s speaking and listening phases). A repeated measures MANOVA was conducted to study the effect of these conditions on averaged valence score of the SAM and AFB z scores. The results showed a significant effect of condition on the valence rating, F(6,18) = 59.50, p < .001. Next, to run a priori comparisons, paired-sample t tests were performed using the averaged valence scores of the SAM and AFB z scores in all the conditions as paired variables. The results are shown in Table 1.

Table 1 Mean, SD and mean difference of the valence rating of the different conditions. H0: μ1 = μ2, * p < 0.05

To test whether the subjective valence score was correlated with the intensity weighted predicted valence values (see chapter 3.3 and Fig. 5 and hereafter abbreviated as weighted valence values) for each condition, we calculated the Pearson correlation coefficient between the weighted valence values and the subjective scores averaged over the participants across the seven experimental conditions. This correlation was relatively high, r(7) = .93, p = .002. The following step in the analysis was to determine the deviation between the subjective valence scores and their corresponding expected valence value per experimental condition. To do so, we fitted a line through the three data points: S + L+, S0L0 and S − L− using least-squares regression. Figure 9 shows this line, including the mean subjective valence scores of the remaining four conditions. Deviations of perceived valence from this line (for the lowly negative and positive videos) show to what extent the perceived valence is different from what is expected in case of an equal intensity in valence between the speaking and listening phases (noted as expected valence value in Fig. 9).

Fig. 9
figure 9

The relationship between intensity weighted predicted valence and the averaged subjective valence

One-sample t tests revealed that when the virtual human showed neutral listening, both a positive (S + L0) and negative (S − L0) emotional expressions during speaking had a more extreme valence than expected, i.e., the subjective score was more positive than the expected valence value in case of the positive emotional expression (t(23) = 2.69, p = .013) and more negative than the expected valence value in case of a negative emotional expression during speaking (t(23) = −6.14, p < .001). The opposite was seen for the impact of the listening phase. Considering the speaking phase with a neutral emotional expression, the subjective valence score for listening with a positive emotional expression (i.e., the S0L+ condition) was significantly less positive than expected (t(23) = − 3.08, p = .005). Similarly, the subjective valence score for listening with a negative emotional expression (i.e., the S0L− condition) was significantly less negative than the expected valence value (t(23) = 3.38, p = .003).

Moreover, the subjective valence score for the S0L− condition was almost equal (t(23) = 0.059, p = .95) to the subjective valence score for the S0L0 condition (i.e., speaking with a neutral emotional expression and listening with a neutral emotional expression). Still, the subjective valence score of the S0L+ condition (i.e., speaking with a neutral emotional expression and listening with a positive emotional expression) was significantly more positive than that for the S0L0 condition ((t(23) = 2.92, p = .008). A direct comparison of the lowly positive or negative conditions provided a similar pattern. The subjective valence value for the S − L0 condition (i.e., speaking with a negative emotional expression and listening with a neutral emotional expression) was significantly more negative than the subjective valence value for the S0L− condition (i.e., speaking with a neutral emotional expression and listening with a negative emotional expression), t(23) = 4.97, p < .001. Similarly, the subjective valance value for the S + L0 condition (i.e., speaking with a positive emotional expression and listening with a neutral emotional expression) was significantly more positive than the subjective valence value for the S0L+ condition (i.e., speaking with a neutral emotional expression and listening with a positive emotional expression), t(23) = 4.01, p = .001.

Together these observations imply that people do not perceive much difference between the virtual human showing neutral or negative listening behavior, but they do perceive a difference with the virtual human showing positive listening behavior. In conclusion, all these results support hypothesis 2, stating that the valence of the emotional expression during the listening phase of a conversation is perceived as less impactful compared to the emotional expression during the speaking phase.

Finally, we also compared the more extreme emotional conditions with the S0L0 condition. The S + L+ condition (t(23) = −9.00, p < .001) or S − L− condition (t(23) = 5.16, p < .001) with positive or negative emotional expressions in both the listening and speaking phases, respectively, strongly impact the perceived valence in the expected way.

5 Discussion and conclusion

The experiment described in this paper is a human perception study on positive and negative emotions of a virtual human and how cultural background might affect the perception of these emotions. In a sense, this study can be seen as a re-confirmation in virtual reality of what is known about human–human interaction in the actual world. Still, this is an important validation step as conversations with virtual humans are increasingly used as part of gaming (e.g., Hudlicka and Delft (2009), training (e.g., Broekens et al. (2012) or psychotherapy (e.g., Opris et al. (2012).

The study found that both Chinese and non-Chinese participants could perceive the valence of the virtual human’s emotional states, and no significant difference between these two groups was found. Instead, the ratings of these two groups were highly correlated. The results show that the valence of the emotional states of the virtual human can be easily recognized by all participants independent of their cultural backgrounds. Hypothesis 1 is therefore not confirmed. On the contrary, our results support the idea of universality of the facial expression of emotion (Ekman 1994; Matsumoto 2007), and question the need for tailored made virtual reality applications which target different cultural groups or have multi-cultural users. Still, the results of this study may not be generally applicable to all cultures, since we here only evaluated possible differences in emotion perception between Chinese and non-Chinese people. Further studies are needed to extend our conclusion of universality of emotion perception of virtual human to people with other cultural backgrounds.

In addition, comparing the difference between conditions, it seems that the participants’ perception of the valence was more influenced by the emotion of the virtual human while speaking than while listening; and so, this supports Hypothesis 2. Comparing the subjectively perceived valence scores with the expected valence values (Fig. 9), the valence perceived by the participants in the conditions where the listening was neutral, the speaking performed with a positive or negative emotional expression was significantly more extreme than what was predicted from equal intensity between speaking and listening. Similarly, the perceived valence was less extreme than the weighted valence value when the speaking was neutral, but the listening performed with a positive or negative emotional expression. This shows the additional influence of verbal communication on valence recognition during a human–virtual human conversation. These findings seem to be in contrast to reports of Melo, Carnevale, and Gratch (2011), who claim that there is no difference in emotion perception between verbal and nonverbal communications. Their study however used text typing as verbal communication means between human and virtual human, which might explain the different finding. It seems not surprising that the combination of both verbal and nonverbal communications transfers more emotional information than the nonverbal communication only. Furthermore, the influence of the voice can be regarded as content independent because of the high consistency found between the Chinese and non-Chinese participants in this experiment. In other words, the results suggest that affective aspects can be conveyed in the speech even if the language is not understood.

The finding that the perceived valence of the emotion of the virtual human is more intense in the speaking phase than in the listening phase of a conversation may be extended with new research on how to control the level of emotion during these separate phases. Applications such as virtual reality exposure therapy for patients suffering from social phobia may be designed in a way to manipulate the potential phobic stressor using the virtual human’s emotional behavior. Further studies may exploit the difference in valence perception between the speaking and listening phases and explore how to further optimize the persuasive power during these two phases, which may be beneficial for the design of many virtual applications involving human–virtual human conversation. Besides, this study only focuses on how individuals perceive the performance of a virtual human. It is also interesting to test the emotional influence on a human during a human–virtual human conversation. Whether the virtual human’s emotion could lead or alter the content of the conversation could be an appealing topic in the persuasive computing area.

Two main conclusions may be drawn from the experiment, but there are also still a number of limitations. First, the virtual human only showed her upper body, and no gestures were used to express emotion. However, in recent decades, more insights have become available on body expression (Gross et al. 2010; Kleinsmith and Bianchi-Berthouze 2013). It would therefore be interesting to examine how our findings would be affected when the virtual human used its full body to express emotions. Second, the position of the virtual human was fixed in the current study. It would be interesting to test the emotional impact of manipulating the virtual human’s position, for example, far away versus nearby (Broekens et al. 2012b). Third, the face model of the virtual human we used in this study was generated by FaceGen with the ethnicity parameter set at Southeast Asia. However, no empirical validation was done to confirm the ethnic appearance of the virtual human. Fourth, the study described in this paper only focused on the valence dimension of the emotion, neglecting so far the other two dimensions of emotion, namely arousal and dominance. Including the additional two dimensions would allow to study more complex emotions, for example, fear, surprise, etc. Despite of the limitations, the results of this paper suggest a superior impact on perceiving the virtual human’s emotional state during its speaking phase, and a potential independence of the perceived valence of the virtual human’s emotion with cultural background. These findings could help designers to focus their attention upon creating and evaluating virtual human with appropriate emotional expressions, which may help to improve the overall experience of virtual environments.