Abstract
In human-robot interaction, natural and intuitive communication between robot and human is one of the most important research topics. Emotion plays a crucial role to make natural and social interactions. Research has focused more on robots’ appearances and facial emotional expressions, but little research has investigated robots’ voices and their mixed effects with robot types and different emotions for users to perceive robots’ emotional states. In this study, anthropomorphic and zoomorphic robots, four different voice types, and seven different emotional voices were used as mixed factors to discuss how these influence users’ perception on robots’ emotional expression and other characteristics. Sixteen participants were asked to read fairy tales to robots and determine robots’ emotional states when the robots verbally responded. Overall, the anthropomorphic robot (Nao) was preferred over the zoomorphic robot (Pleo), but this appearance did not influence emotion recognition accuracy or other robot characteristics. Participants showed lower accuracy in recognizing negative emotions with high arousal: anger, fear, and disgust. TTS was rated lower than other human voices in all robot characteristics, such as warmth, honesty, trustworthiness, and naturalness. Implications and design directions are discussed with the results.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Research on social robots has sharply increased. To design social robots, it is important to consider key variables to influence robots’ sociability, trust, and acceptance. Previous studies in human-robot social interaction discovered that users expect natural and intuitive communication with robots [1,2,3]. Emotion is a critical component to enable this natural and intuitive communication. To express emotions effectively and accurately, robots can utilize many different sensory cues. However, research has focused more on facial expressions [4] and a comprehensive study on the correlation between different factors and emotion perception is still rudimentary. In this paper, an exploratory experiment was conducted to study users’ perception on robots’ emotional states with mixed factors, including robot appearances, voice types, and emotions. Participants were asked to read fairy tales to robots and determine robots’ emotional states when the robots made a comment.
1.1 Related Work
Companion Robots and Natural Communication
The prospect of introducing companion robots into an individual’s daily life has received a significant focus from those working and conducting research in the field of Human-Robot Interaction. The development and integration of effective robot companions hold a tremendous degree of promise, given the potential for robots to assist people or even perform tasks that exceed their capabilities.
Previous studies have indicated broad support for the concept of companions, with participants viewing a robot’s roles as an assistant, machine, or servant to conform to their expectations of the robot’s function [1]. A much greater support and confidence was expressed for robots to be charged with performing household tasks as opposed to tasks dealing with children or animals [1]. A robot companion’s ability to communicate is as significant as establishing the social context under which such a companion should operate. Humanlike communication is a desired trait [1]. Studies have similarly shown that the natural language interface of a robot receives more attention in comparison to its functionality, suggesting that the communicative behavior may be a more critical component of the system [3]. Proposed criteria by which to evaluate communicative behavior includes the ability of a robot to detect communication partners and pay attention to them, as well as its comprehension of speech, gestures, and its surrounding environment so as to understand an assigned task [3]. Such criteria revolve around maximizing the social aptness of robot companions so that they may interact and carry out tasks in a natural way.
Expressing one’s own emotions and reading others’ emotions is also critical for facilitating this natural interaction. To express emotions effectively and accurately, a number of verbal (e.g., voice style, accent, gender, and affective prosody) and nonverbal (e.g., appearance, facial expression, gesture, and movement) cues can be used.
In the current study, we explored the scenario where our participants served as a storyteller and our robots were emotionally empathized with them and responded to the story. We specifically considered the robots’ ability to convey emotions, which is a critical part of human-like communication.
Form Factor of Robots
In the design of robots, there are two typical forms of design; one is anthropo-morphic, and the other is zoomorphic [5]. Each one has its unique characteristics and deals with different tasks. In studies on robots’ form factor (or appearance) and users’ perception, anthropomorphic and zoomorphic robots were preferred over machine-like ones or imaginary creatures [7,8,9]. Anthropomorphic and zoomorphic robots may have different working scenarios. The more a robot’s appearance is human-like shaped, the more intelligent people think it is [10]. Also, during the interaction with humans, anthropomorphic robots may be more able to convey emotional expressions more effectively because their appearance is similar to humans [5]. However, the influence of facial expression of anthropomorphic robots on users’ perceptions are sometimes controversial, maybe due to the Uncanny Valley [6]. In other words, a robot’s suitability to being like a human or an animal is highly dependent on what kind of task it has and what intelligence level it wants users to perceive. Preference to an anthropomorphic or zoomorphic robot is influenced by many complex factors. In our work, we investigated the effects of voice types on each robot, by applying both qualitative and quantitative measures, examining user perception from broader perspectives.
Robots’ Emotion Expression
In terms of emotional expression, emotional vocal expressions can effectively influence the behavior of perceivers [11]. Research explains a robot’s emotion expression process in relation to communication theory: 1) a robot’s internal state drives expressions, 2) specific robot behaviors are related to specific user reactions, and 3) the situation is an important driver of emotion expressions [12]. Emotion perception is an important source of information about the theory of mind and emotions can be perceived from facial expressions, voices, and whole-body movements [13]. As mentioned, emotion expression and emotion perception play a critical role in human-robot interaction and are widely studied in a range of disciplines. However, previous studies have been dominated by robots’ facial emotions and other modalities such as vocal and tactile processes have been less frequently considered [14, 15]. The present study focused more on auditory stimuli by including various emotive voices, representing seven different emotions and investigated the differences in users’ emotion perception.
1.2 Research Questions
From this background, we tried to attain a deeper understanding of the effects of robots’ appearances, voices, and emotion types on users’ perception about robots and their emotions. More specifically, we were interested in the following research questions:
-
How can robot appearances, voices, emotion types, and their interactions influence people’s perception of robots’ emotional states?
-
How can robot appearances and voices, and their interactions influence people’s perception of robots’ characteristics?
-
How can robot appearances and voices, and their interactions influence people’s preference on robots?
To answer these research questions, we conducted a preliminary empirical experiment in which young adults (college students) interacted with two robots (human-like and animal-like) using four different voices (regular human, characterized human-like, characterized animal-like, text-to-speech) and seven emotions (six basic emotions + anticipation). We collected our participants’ emotion recognition accuracy and other subjective perception on robots.
2 Method
2.1 Participants
Sixteen university students participated in the study (Age: M = 23.5, SD = 3.97). Six participants identified themselves as male and the other ten participants identified as female. Participants were ethnically diverse (3 Asians, 2 Hispanic, 9 Caucasians, 1 Middle easterners, and 1 Africans). Participants participated in the experiment for at most 2 h and participants were compensated with $20 ($10 per hour). All participants agreed to participate after reviewing the consent form approved by the VT IRB.
2.2 Robotic Systems and Stimuli
Two robots, NAO and Pleo, having different appearances and features were employed in the experiment (Fig. 1). We used these two robots, which represent an anthropomorphic robot and zoomorphic robot each, to contrast the effects that robotic appearance has on people’s emotion perception. NAO is a small-size anthropomorphic robot (Height: 22.6 inch, Length: 10.8 inch, Width 12.2 inch) having similarity to human and Pleo is a zoomorphic robot (Height: 8 inch Length: 15 inch, Width 4 inch) which looks like a little dinosaur. Both robots played recorded auditory feedback, which were emotive utterances, to participants following the storylines. Two different stories (“The three little pigs” and “The boy who cried wolf”) were used in this experiment.
Four voice types were created for seven emotional expressions. We first categorized different voice types as a synthesized voice (text-to-speech or TTS voice) and a recorded human voice. The human voices were provided by two female native speakers in our research group and all the voices were speaking American English with American accents. Next, the recorded human voice was subdivided into three categories that included a regular voice and a characterized voice for each robot (i.e., characterized NAO voice and characterized Pleo voice). The characterized voices for NAO and Pleo were designed to exaggerate emotional expressions with the robots’ characters while the female speakers envisioned the characteristics of robots from their appearances.
The TTS voices were generated using text-to-speech [16] engines. Microsoft’s female voice and the iOS female voice were used, which were provided by default with the respective operating systems. These TTS voices included no emotional information beyond the words themselves.
Seven different emotions were presented throughout each story including Ekman’s six basic emotions. The six basic emotions (anger, disgust, fear, happiness, sadness, and surprise) were chosen for their prevalence in psychology. In addition to them, the seventh emotion, anticipation, was chosen for its similarity to fear and surprise [17]. Its inclusion allows us the opportunity to see if participants can discern an emotion that is not traditionally regarded as a basic emotion and to gauge confusion between emotions with subtle differences. The seven emotions were fit into both stories (“The three little pigs” and “The boy who cried wolf”) as depicted in Table 1.
2.3 Design and Procedure
A 2 (robots) × 4 (voice types) × 7 (emotions) within-subject design was applied. Therefore, 8 different combinations of robots and voice types were provided to each participant with all 7 emotions. The presented order and the number of each combination were counterbalanced such that 1) each combination was almost equally presented about 20 times across participants in total and 2) levels of each treatment were presented at least once to each participant. Therefore, each participant interacted with all 8 conditions of robots and voice types and all 7 presented emotions. The 8 conditions were separated into two sessions to help participants recall and compare four different conditions. In each condition, the participant was instructed to read the script in front of a robot and listen to the emotional comment from the robot at various points in the story. The whole procedure including each step and the experiment environment are depicted in Figs. 2, 3 and 4 below.
The participants were asked to fill out several questionnaires after listening to each comment generated from the robot, after finishing reading each full story, and after experiencing four conditions. Specifically, after each response to seven emotions, each condition, and each session, the surveys were conducted for measuring the accuracy of emotion recognition and robot characteristics (Warmth, Honesty, Trustworthiness), naturalness (Natural, Human-like, Robot-like) and preferences (Likability, Attractiveness) of presented emotions. The questionnaire consisted of open questions, seven-point Likert scales (1: Lowest, 7: Highest), and single-choice questions. (Table 2).
Presented orders of emotions in the two stories were different, but the order in each story was fixed to maintain the storylines. To generalize the results, we employed two different stories having the same 7 emotions presented and two different voice groups having the same characteristics but recorded by different female speakers and two different female text-to-speech (TTS) engines. The examples of the presented order are depicted in Table 3. To validate the equivalence in accuracy, clarity, suitability, and preference of the two stories and two voice groups, the results were analyzed as below (Table 4), showing similar results in all categories.
3 Results
3.1 Data Collection
The answers to open questions regarding emotions were interpreted by two examiners. Each examiner categorized all the answers into seven pre-defined emotions or marked as ‘indistinguishable’ if the answers do not fall into any categories. Two examiners worked independently, and the inter-rater reliability test showed that 87.8% (787/896) of the results were consistent with the high coefficient value of Cronbach Alpha using variance (=0.96). If interpretations from examiners were different, a third examiner reviewed the answers and decided which emotion the answer fell into.
3.2 Emotion Perception: Accuracy, Clarity, Suitability, and Features
First, the accuracy of emotion perception, defined as the proportion of correct emotion answers, was analyzed. Figure 5 and Table 5 show the inferential statistics of accuracy across presented emotions, voice types, and robots. Regarding presented emotions, anger, disgust, and fear showed significantly lower accuracies (below chance level) than other emotions. Therefore, we removed these three emotions from further accuracy analyses. Results were analyzed with the aligned rank transform (ART) [43] for nonparametric factorial analyses since there are 3 factors (Robots, Voice Types, and Emotions) and dependent variable (1: correct, 0: wrong) is not normally distributed. The ART allowed analyzing the aligned-ranked data with a 2 (Robots) × 4 (Voice Types) × 4 (Emotions) repeated measures analysis of variance (ANOVA) and testing all main effects and interaction effects.
For accuracy, there was no significant difference between Nao (M = 57.1%, SD correct = 0.5%) and Pleo (M = 53.6%, SD correct = 0.5%). The result revealed a statistically significant difference across voice types. However, there was significant interaction effect between emotions and voice types. For the multiple comparisons among voice types, paired-samples t-tests were conducted. All pairwise comparisons in this comparison applied a Bonferroni adjustment to control for Type-I error, which meant that we used more conservative alpha levels (critical alpha level = .0083 (0.05/6)). Participants showed significantly lower accuracy in a TTS voice than all other three voice types.
Table 6 shows how participants misclassified emotions.
Second, clarity and suitability of perceived emotions over robots, voice types, and presented emotions were analyzed as shown in Fig. 6. Clarity and suitability were rated using a 1 to 7 Likert-scale (1: Lowest, 7: Highest). Again, only answers that correctly recognized emotions were considered. Overall, there were differences found in clarity over emotions and voice types. For robots, there were no significant differences found in both clarity and suitability categories. Results were analyzed with a 2 (Robot) × 4 (Voice Type) × 7 (Emotions) repeated measures analysis of variance (ANOVA). The result revealed a statistically significant difference in clarity ratings over robots, voice types, and presented emotions. Nao showed significantly higher clarity rating than Pleo. For the multiple comparisons among voice types, paired-samples t-tests were conducted and the result is shown in Table 5. TTS showed significantly lower clarity rating than the other three voice types. For the multiple comparisons among seven emotions, paired-samples t-tests were conducted. All pairwise comparisons in this item applied a Bonferroni adjustment to control for Type-I error, with an alpha levels = .0023 (0.05/21). Surprise showed significantly higher clarity rating than disgust and happiness. For suitability ratings, the result revealed that TTS showed significantly lower score than the other three voice types. No other differences were found.
Finally, the features by which to perceive emotions were analyzed with the results as shown in Table 7. The answers were collected from an open question (“What characteristics of the voice brought to mind that emotion?”) and the number of occurrences of words was counted. Each participant was allowed to provide multiple answers for each comment. Most of the emotions were perceived from tone by 40.9%, pitch by 15.6%, and context by 12.4%.
3.3 Characteristics: Warmth, Honesty, and Trustworthiness
Figure 7 and Table 8 show the rating scores in trustworthiness over voice types and robots. Results were analyzed with a 2 (Robot) × 4 (Voice Type) repeated measures analysis of variance (ANOVA). For robots, there were no significant differences found in three categories. The result revealed a statistically significant difference in trustworthiness among voice types. There was no interaction effect between robots and voice types. For the multiple comparisons among voice types, paired-samples t-tests were conducted. Warmth and Honesty showed the exactly same pattern as trustworthiness (i.e., no other differences except for voice types with TTS being significantly lower).
3.4 Naturalness: Natural, Human-like, and Robot-like
Figure 8 and Table 9 show the rating scores in “robot-like” over voice types. Results were analyzed with a 2 (Robot) × 4 (Voice Type) repeated measures analysis of variance (ANOVA). For robots, there were no significant differences found in all three categories. The result revealed a statistically significant difference in the rating scores in “robot-like” among voice types. There was no interaction effect between robots and voice types. For the multiple comparisons among voice types, paired-samples t-tests were conducted. Participants showed significantly higher rating scores for TTS than all other three voice types. In addition, characterized Pleo showed significantly higher robot-likeness rating than regular voice. Natural and Human-like showed the exactly opposite pattern as Robot-like (i.e., TTS was significantly lower than others).
3.5 Preferences: Likability and Attractiveness
Figures 9, 10 and Table 10 showed the rating scores in “likability” and “attractiveness” over robots and voice types. Results were analyzed with a 2 (Robot) × 4 (Voice Type) repeated measures analysis of variance (ANOVA). For “likability”, participants showed significantly higher rating scores for characterized Nao than characterized Pleo. The result also revealed a statistically significant difference in the rating scores over voice types. There was no interaction effect between robots and voice types. For the multiple comparisons among voice types, paired-samples t-tests were conducted. Participants showed significantly lower rating scores for TTS than all other three voice types. For “attractiveness”, the result revealed a statistically significant difference in the rating scores over voice types. For the multiple comparisons among voice types, paired-samples t-tests were conducted. Participants showed significantly lower rating scores for TTS than all other three voice types. Also, participants showed significantly lower rating scores for characterized Pleo than either characterized NAO or regular voice.
4 Discussion
To get a holistic picture of the effects of robot appearances, voices, and emotions types on users’ perception on robots’ emotions and characteristics, we conducted a preliminary study. Overall, results showed that the effects of voice types (human vs. TTS) seem to be larger than those of robot appearances on multiple dependent variables.
For emotion recognition accuracy, robot appearances did not show a significant difference between anthropomorphic (Nao) and zoomorphic (Pleo) robots. As expected, TTS showed significantly lower emotion recognition accuracy than other three human voice types. However, there were no differences in accuracy among the three voice types (characterized Nao, characterized Pleo, and regular). Also, there were no differences among the three human voice types for clarity and suitability. Taken together, this might imply the potential for using characterized voice for different purposes where appropriate (e.g., for children) without degrading emotion recognition accuracy, as long as it is a human voice. However, the result shows that the emotion recognition accuracy significantly varies depending on the expressed emotions. As shown in Fig. 5, happiness, sadness and surprise showed relatively higher accuracy than anger, disgust, and fear. Anticipation was placed in between. This might happen because happiness, sadness, and surprise are more common emotional states the participants can expect from the fairy tales. Anger, disgust, and fear are all negative-high arousal emotions. The participants might not expect these types of high strength, negative emotions from the fairy tales. However, the relationship between accuracy and each emotion shown in the present study is not in line with the results of the previous study [e.g., 18]. The difference might stem from different experimental settings (e.g., emotional words, prosody, context given by fairy tales, etc.). Thus, more iterative research is required to unpack the underlying mechanisms. For the misclassified emotions, valence showed a big impact. Based on the confusion matrix, anger (negative) was mostly misclassified as sadness (negative) (39.8%) and anticipation (positive) was mostly misclassified as happiness (positive) (35.9%). Based on the participants’ self-report, most of the emotions were perceived from tone by 40.9%, pitch by 15.6%, and context by 12.4%, which shows that affective prosody is more critical than the content itself.
For robot characteristics, there was no statistically significant difference between the two robots, but there were differences between all human voices and TTS. We can cautiously infer that people did not perceive any differences among the regular, characterized Nao and characterized Pleo in terms of warmth, honesty, and trustworthiness.
Similarly, for naturalness, there was no statistically significant difference between the two robots even though participants consistently showed a tendency to perceive higher natural (M = 4.39, SD = 2.18 vs. M = 3.88, SD = 1.93), higher human-like (M = 4.73, SD = 2.07 vs. M = 4.2 SD = 1.9), and lower robot-like (M = 3.42, SD = 2.22 vs. M = 3.87, SD = 2.16) from Nao, compared to Pleo. As expected, there were significant differences in these ratings between all human voices and TTS.
Finally, participants liked Nao significantly more than Pleo from the two robot types. They gave the highest rating to characterized Nao voice, followed by Regular, characterized Pleo, and TTS, even though only TTS was significantly different from other voice types. Participants also gave higher attractiveness rating to Nao (M = 4.25, SD = 1.96) than Pleo (M = 3.83, SD = 1.82), which did not reach the statistical significance level due to large variance. All three human voices were significantly more attractive than TTS. Also, both characterized Nao and regular voice were significantly more attractive than TTS. Again, this shows the potential for use of the characterized voice, at least, for anthropomorphic robots.
This exploratory study can provide practical guidelines for the voice design of various robots and further research studies. People seemed to generally perceive higher preference for an anthropomorphic robot compared a zoomorphic robot, which is in line with literature [8]. However, using either characterized or regular human voice did influence neither people’s emotion recognition nor their perception about robot characteristics, such as warmth, honesty, and trustworthiness, as well as naturalness. Therefore, this study supports using human voice as a medium to express robots’ emotions with a different voice design choice, depending on users, goal, and context.
5 Limitations and Future Work
The results of this experiment have been limited by several factors. First, only female voice was used in this study. Depending on the gender of the voice, the results might be different. Second, the sample size was small and not sufficient to draw a firm conclusion. Due to the COVID-19, the experiment was not run as much as planned. In future work, more participants with diversity should be recruited to generalize the results. Another limitation includes that the questionnaire for emotion recognition was an open-ended, which caused considerable confusion and lower accuracy rate. In future work, a questionnaire with more specific emotion options can be provided with additional open-ended input. Finally, the different speaker systems of different robots might also have influenced on the result (e.g., clarity) and should be addressed in the next study.
References
Dautenhahn, K., Woods, S., Kaouri, C., Walters, M.L., Koay, K.L., Werry, I.: What is a robot companion-friend, assistant or butler? In: Proceedings of the International Conference on Intelligent Robots and Systems 2005, IEEE/RSJ, pp. 1192–1197 (2005)
Vu, C., Cross, M., Bickmore, T., Gruber, A., Campbell, L.: U.S. Patent No. 8,935,006. In: U.S. Patent and Trademark Office, Washington, DC (2015)
Wrede, B., et al.: Research issues for designing robot companions: BIRON as a case study (2004)
Schirmer, A., Adolphs, R.: Emotion perception from face, voice, and touch: comparisons and convergence. Trends Cogn. Sci. 21(3), 216–228 (2017)
Lohse, M., Hegel, F., Swadzba, A., Rohlfing, K., Wachsmuth, S., Wrede, B.: What can I do for you? Appearance and application of robots. Proc. AISB 7, 121–126 (2007)
Seyama, I., Nagayama, S.: The uncanny valley: effect of realism on the impression of artificial human faces. Teleoperators Virtual Environ. 16(4), 337–351 (2007)
Li, D., Rau, P., Li, Y.: A cross-cultural study: effect of robot appearance and task. Int. J. Soc. Robot. 2(2), 175–186 (2010)
Hosseini, F., Hilliger, S., Barnes, J., Jeon, M., Park, H., Howard, M.: Love at first sight: mere exposure to robot appearance leaves impressions similar to interactions with physical robots. In: Proceedings of the 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 615–620. IEEE (2017)
Barnes, J., FakhrHosseini, M., Jeon, M., Park, H., Howard, A.: The influence of robot design on acceptance of social robots. In: Proceedings of the 14th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), pp. 51–55. IEEE (2017)
Hegel, F., Krach, S., Kircher, T., Wrede, B., Sagerer, G.: Understanding social robots: a user study on anthropomorphism. In: Proceedings of the 17th IEEE International Symposium on Robot and Human Interactive Communication, Roman, pp. 574–579. IEEE (2008)
Bachorowski, J., Owren, M.: Sounds of emotion: production and perception of affect-related vocal acoustics. Ann. New York Acad. Sci. 1000(1), 244–265 (2003)
Fischer, K., Jung, M., Jensen, L.: Emotion expression in HRI – when and why. In: Proceedings of the 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (2019)
Frith, D., Frith, U.: The neural basis of mentalizing. Neuron 50(4), 531–534 (2006)
Calvo, A., D’Mello, S.: Affect detection: an interdisciplinary review of models, methods, and their applications. IEEE Trans. Affect. Comput. 1(1), 18–37 (2010)
Schirmer, A., Adolphs, R.: Emotion perception from face, voice, and touch: Comparisons and convergence. Trends Cogn. Sci 21(3), 216–228 (2017)
Williams, G., Watts, N., MacLeod, C., Mathews, A.: Cognitive Psychology and Emotional Disorders. John Wiley & Sons, Oxford (1988)
Barnes, J., Richie, E., Lin, Q., Jeon, M., Park, H.: Emotive voice acceptance in human-robot interaction. In: Proceedings of the 24th International Conference on Auditory Display (2018)
Jeon, Myounghoon, Rayan, Infantdani A.: The effect of physical embodiment of an animal robot on affective prosody recognition. In: Jacko, Julie A. (ed.) HCI 2011. LNCS, vol. 6762, pp. 523–532. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21605-3_57
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Ko, S. et al. (2020). The Effects of Robot Appearances, Voice Types, and Emotions on Emotion Perception Accuracy and Subjective Perception on Robots. In: Stephanidis, C., Kurosu, M., Degen, H., Reinerman-Jones, L. (eds) HCI International 2020 - Late Breaking Papers: Multimodality and Intelligence. HCII 2020. Lecture Notes in Computer Science(), vol 12424. Springer, Cham. https://doi.org/10.1007/978-3-030-60117-1_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-60117-1_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60116-4
Online ISBN: 978-3-030-60117-1
eBook Packages: Computer ScienceComputer Science (R0)