Keywords

1 Introduction

Research on social robots has sharply increased. To design social robots, it is important to consider key variables to influence robots’ sociability, trust, and acceptance. Previous studies in human-robot social interaction discovered that users expect natural and intuitive communication with robots [1,2,3]. Emotion is a critical component to enable this natural and intuitive communication. To express emotions effectively and accurately, robots can utilize many different sensory cues. However, research has focused more on facial expressions [4] and a comprehensive study on the correlation between different factors and emotion perception is still rudimentary. In this paper, an exploratory experiment was conducted to study users’ perception on robots’ emotional states with mixed factors, including robot appearances, voice types, and emotions. Participants were asked to read fairy tales to robots and determine robots’ emotional states when the robots made a comment.

1.1 Related Work

Companion Robots and Natural Communication

The prospect of introducing companion robots into an individual’s daily life has received a significant focus from those working and conducting research in the field of Human-Robot Interaction. The development and integration of effective robot companions hold a tremendous degree of promise, given the potential for robots to assist people or even perform tasks that exceed their capabilities.

Previous studies have indicated broad support for the concept of companions, with participants viewing a robot’s roles as an assistant, machine, or servant to conform to their expectations of the robot’s function [1]. A much greater support and confidence was expressed for robots to be charged with performing household tasks as opposed to tasks dealing with children or animals [1]. A robot companion’s ability to communicate is as significant as establishing the social context under which such a companion should operate. Humanlike communication is a desired trait [1]. Studies have similarly shown that the natural language interface of a robot receives more attention in comparison to its functionality, suggesting that the communicative behavior may be a more critical component of the system [3]. Proposed criteria by which to evaluate communicative behavior includes the ability of a robot to detect communication partners and pay attention to them, as well as its comprehension of speech, gestures, and its surrounding environment so as to understand an assigned task [3]. Such criteria revolve around maximizing the social aptness of robot companions so that they may interact and carry out tasks in a natural way.

Expressing one’s own emotions and reading others’ emotions is also critical for facilitating this natural interaction. To express emotions effectively and accurately, a number of verbal (e.g., voice style, accent, gender, and affective prosody) and nonverbal (e.g., appearance, facial expression, gesture, and movement) cues can be used.

In the current study, we explored the scenario where our participants served as a storyteller and our robots were emotionally empathized with them and responded to the story. We specifically considered the robots’ ability to convey emotions, which is a critical part of human-like communication.

Form Factor of Robots

In the design of robots, there are two typical forms of design; one is anthropo-morphic, and the other is zoomorphic [5]. Each one has its unique characteristics and deals with different tasks. In studies on robots’ form factor (or appearance) and users’ perception, anthropomorphic and zoomorphic robots were preferred over machine-like ones or imaginary creatures [7,8,9]. Anthropomorphic and zoomorphic robots may have different working scenarios. The more a robot’s appearance is human-like shaped, the more intelligent people think it is [10]. Also, during the interaction with humans, anthropomorphic robots may be more able to convey emotional expressions more effectively because their appearance is similar to humans [5]. However, the influence of facial expression of anthropomorphic robots on users’ perceptions are sometimes controversial, maybe due to the Uncanny Valley [6]. In other words, a robot’s suitability to being like a human or an animal is highly dependent on what kind of task it has and what intelligence level it wants users to perceive. Preference to an anthropomorphic or zoomorphic robot is influenced by many complex factors. In our work, we investigated the effects of voice types on each robot, by applying both qualitative and quantitative measures, examining user perception from broader perspectives.

Robots’ Emotion Expression

In terms of emotional expression, emotional vocal expressions can effectively influence the behavior of perceivers [11]. Research explains a robot’s emotion expression process in relation to communication theory: 1) a robot’s internal state drives expressions, 2) specific robot behaviors are related to specific user reactions, and 3) the situation is an important driver of emotion expressions [12]. Emotion perception is an important source of information about the theory of mind and emotions can be perceived from facial expressions, voices, and whole-body movements [13]. As mentioned, emotion expression and emotion perception play a critical role in human-robot interaction and are widely studied in a range of disciplines. However, previous studies have been dominated by robots’ facial emotions and other modalities such as vocal and tactile processes have been less frequently considered [14, 15]. The present study focused more on auditory stimuli by including various emotive voices, representing seven different emotions and investigated the differences in users’ emotion perception.

1.2 Research Questions

From this background, we tried to attain a deeper understanding of the effects of robots’ appearances, voices, and emotion types on users’ perception about robots and their emotions. More specifically, we were interested in the following research questions:

  • How can robot appearances, voices, emotion types, and their interactions influence people’s perception of robots’ emotional states?

  • How can robot appearances and voices, and their interactions influence people’s perception of robots’ characteristics?

  • How can robot appearances and voices, and their interactions influence people’s preference on robots?

To answer these research questions, we conducted a preliminary empirical experiment in which young adults (college students) interacted with two robots (human-like and animal-like) using four different voices (regular human, characterized human-like, characterized animal-like, text-to-speech) and seven emotions (six basic emotions + anticipation). We collected our participants’ emotion recognition accuracy and other subjective perception on robots.

2 Method

2.1 Participants

Sixteen university students participated in the study (Age: M = 23.5, SD = 3.97). Six participants identified themselves as male and the other ten participants identified as female. Participants were ethnically diverse (3 Asians, 2 Hispanic, 9 Caucasians, 1 Middle easterners, and 1 Africans). Participants participated in the experiment for at most 2 h and participants were compensated with $20 ($10 per hour). All participants agreed to participate after reviewing the consent form approved by the VT IRB.

2.2 Robotic Systems and Stimuli

Two robots, NAO and Pleo, having different appearances and features were employed in the experiment (Fig. 1). We used these two robots, which represent an anthropomorphic robot and zoomorphic robot each, to contrast the effects that robotic appearance has on people’s emotion perception. NAO is a small-size anthropomorphic robot (Height: 22.6 inch, Length: 10.8 inch, Width 12.2 inch) having similarity to human and Pleo is a zoomorphic robot (Height: 8 inch Length: 15 inch, Width 4 inch) which looks like a little dinosaur. Both robots played recorded auditory feedback, which were emotive utterances, to participants following the storylines. Two different stories (“The three little pigs” and “The boy who cried wolf”) were used in this experiment.

Fig. 1.
figure 1

Pictures of robots (NAO, Pleo)

Four voice types were created for seven emotional expressions. We first categorized different voice types as a synthesized voice (text-to-speech or TTS voice) and a recorded human voice. The human voices were provided by two female native speakers in our research group and all the voices were speaking American English with American accents. Next, the recorded human voice was subdivided into three categories that included a regular voice and a characterized voice for each robot (i.e., characterized NAO voice and characterized Pleo voice). The characterized voices for NAO and Pleo were designed to exaggerate emotional expressions with the robots’ characters while the female speakers envisioned the characteristics of robots from their appearances.

The TTS voices were generated using text-to-speech [16] engines. Microsoft’s female voice and the iOS female voice were used, which were provided by default with the respective operating systems. These TTS voices included no emotional information beyond the words themselves.

Seven different emotions were presented throughout each story including Ekman’s six basic emotions. The six basic emotions (anger, disgust, fear, happiness, sadness, and surprise) were chosen for their prevalence in psychology. In addition to them, the seventh emotion, anticipation, was chosen for its similarity to fear and surprise [17]. Its inclusion allows us the opportunity to see if participants can discern an emotion that is not traditionally regarded as a basic emotion and to gauge confusion between emotions with subtle differences. The seven emotions were fit into both stories (“The three little pigs” and “The boy who cried wolf”) as depicted in Table 1.

Table 1. Dialogues in stories for presenting different emotions

2.3 Design and Procedure

A 2 (robots) × 4 (voice types) × 7 (emotions) within-subject design was applied. Therefore, 8 different combinations of robots and voice types were provided to each participant with all 7 emotions. The presented order and the number of each combination were counterbalanced such that 1) each combination was almost equally presented about 20 times across participants in total and 2) levels of each treatment were presented at least once to each participant. Therefore, each participant interacted with all 8 conditions of robots and voice types and all 7 presented emotions. The 8 conditions were separated into two sessions to help participants recall and compare four different conditions. In each condition, the participant was instructed to read the script in front of a robot and listen to the emotional comment from the robot at various points in the story. The whole procedure including each step and the experiment environment are depicted in Figs. 2, 3 and 4 below.

Fig. 2.
figure 2

The flow diagram of the procedure

Fig. 3.
figure 3

An example of part story the participant read (The Boy Who Cried Wolf)

Fig. 4.
figure 4

Experimental setting

The participants were asked to fill out several questionnaires after listening to each comment generated from the robot, after finishing reading each full story, and after experiencing four conditions. Specifically, after each response to seven emotions, each condition, and each session, the surveys were conducted for measuring the accuracy of emotion recognition and robot characteristics (Warmth, Honesty, Trustworthiness), naturalness (Natural, Human-like, Robot-like) and preferences (Likability, Attractiveness) of presented emotions. The questionnaire consisted of open questions, seven-point Likert scales (1: Lowest, 7: Highest), and single-choice questions. (Table 2).

Table 2. The list of questions and types in questionnaires

Presented orders of emotions in the two stories were different, but the order in each story was fixed to maintain the storylines. To generalize the results, we employed two different stories having the same 7 emotions presented and two different voice groups having the same characteristics but recorded by different female speakers and two different female text-to-speech (TTS) engines. The examples of the presented order are depicted in Table 3. To validate the equivalence in accuracy, clarity, suitability, and preference of the two stories and two voice groups, the results were analyzed as below (Table 4), showing similar results in all categories.

Table 3. Examples of the presented order
Table 4. Accuracy, clarity, suitability, and preference over stories and voice groups

3 Results

3.1 Data Collection

The answers to open questions regarding emotions were interpreted by two examiners. Each examiner categorized all the answers into seven pre-defined emotions or marked as ‘indistinguishable’ if the answers do not fall into any categories. Two examiners worked independently, and the inter-rater reliability test showed that 87.8% (787/896) of the results were consistent with the high coefficient value of Cronbach Alpha using variance (=0.96). If interpretations from examiners were different, a third examiner reviewed the answers and decided which emotion the answer fell into.

3.2 Emotion Perception: Accuracy, Clarity, Suitability, and Features

First, the accuracy of emotion perception, defined as the proportion of correct emotion answers, was analyzed. Figure 5 and Table 5 show the inferential statistics of accuracy across presented emotions, voice types, and robots. Regarding presented emotions, anger, disgust, and fear showed significantly lower accuracies (below chance level) than other emotions. Therefore, we removed these three emotions from further accuracy analyses. Results were analyzed with the aligned rank transform (ART) [43] for nonparametric factorial analyses since there are 3 factors (Robots, Voice Types, and Emotions) and dependent variable (1: correct, 0: wrong) is not normally distributed. The ART allowed analyzing the aligned-ranked data with a 2 (Robots) × 4 (Voice Types) × 4 (Emotions) repeated measures analysis of variance (ANOVA) and testing all main effects and interaction effects.

Fig. 5.
figure 5

Accuracy of perceiving emotions over emotions and voice types

Table 5. Statistics for emotion perception (accuracy, clarity, suitability)

For accuracy, there was no significant difference between Nao (M = 57.1%, SD correct = 0.5%) and Pleo (M = 53.6%, SD correct = 0.5%). The result revealed a statistically significant difference across voice types. However, there was significant interaction effect between emotions and voice types. For the multiple comparisons among voice types, paired-samples t-tests were conducted. All pairwise comparisons in this comparison applied a Bonferroni adjustment to control for Type-I error, which meant that we used more conservative alpha levels (critical alpha level = .0083 (0.05/6)). Participants showed significantly lower accuracy in a TTS voice than all other three voice types.

Table 6 shows how participants misclassified emotions.

Table 6. The confusion matrix between presented and perceived emotions

Second, clarity and suitability of perceived emotions over robots, voice types, and presented emotions were analyzed as shown in Fig. 6. Clarity and suitability were rated using a 1 to 7 Likert-scale (1: Lowest, 7: Highest). Again, only answers that correctly recognized emotions were considered. Overall, there were differences found in clarity over emotions and voice types. For robots, there were no significant differences found in both clarity and suitability categories. Results were analyzed with a 2 (Robot) × 4 (Voice Type) × 7 (Emotions) repeated measures analysis of variance (ANOVA). The result revealed a statistically significant difference in clarity ratings over robots, voice types, and presented emotions. Nao showed significantly higher clarity rating than Pleo. For the multiple comparisons among voice types, paired-samples t-tests were conducted and the result is shown in Table 5. TTS showed significantly lower clarity rating than the other three voice types. For the multiple comparisons among seven emotions, paired-samples t-tests were conducted. All pairwise comparisons in this item applied a Bonferroni adjustment to control for Type-I error, with an alpha levels = .0023 (0.05/21). Surprise showed significantly higher clarity rating than disgust and happiness. For suitability ratings, the result revealed that TTS showed significantly lower score than the other three voice types. No other differences were found.

Fig. 6.
figure 6

The rating scores of clarity over robots, voice types, and emotions

Finally, the features by which to perceive emotions were analyzed with the results as shown in Table 7. The answers were collected from an open question (“What characteristics of the voice brought to mind that emotion?”) and the number of occurrences of words was counted. Each participant was allowed to provide multiple answers for each comment. Most of the emotions were perceived from tone by 40.9%, pitch by 15.6%, and context by 12.4%.

Table 7. The result of surveys on features that used to perceive emotions

3.3 Characteristics: Warmth, Honesty, and Trustworthiness

Figure 7 and Table 8 show the rating scores in trustworthiness over voice types and robots. Results were analyzed with a 2 (Robot) × 4 (Voice Type) repeated measures analysis of variance (ANOVA). For robots, there were no significant differences found in three categories. The result revealed a statistically significant difference in trustworthiness among voice types. There was no interaction effect between robots and voice types. For the multiple comparisons among voice types, paired-samples t-tests were conducted. Warmth and Honesty showed the exactly same pattern as trustworthiness (i.e., no other differences except for voice types with TTS being significantly lower).

Fig. 7.
figure 7

The rating scores of trustworthiness over voice types

Table 8. Statistics for characteristics (trustworthiness)

3.4 Naturalness: Natural, Human-like, and Robot-like

Figure 8 and Table 9 show the rating scores in “robot-like” over voice types. Results were analyzed with a 2 (Robot) × 4 (Voice Type) repeated measures analysis of variance (ANOVA). For robots, there were no significant differences found in all three categories. The result revealed a statistically significant difference in the rating scores in “robot-like” among voice types. There was no interaction effect between robots and voice types. For the multiple comparisons among voice types, paired-samples t-tests were conducted. Participants showed significantly higher rating scores for TTS than all other three voice types. In addition, characterized Pleo showed significantly higher robot-likeness rating than regular voice. Natural and Human-like showed the exactly opposite pattern as Robot-like (i.e., TTS was significantly lower than others).

Fig. 8.
figure 8

The rating scores of robot-like over voice types

Table 9. Statistics for naturalness (natural, human-like, robot-like)

3.5 Preferences: Likability and Attractiveness

Figures 9, 10 and Table 10 showed the rating scores in “likability” and “attractiveness” over robots and voice types. Results were analyzed with a 2 (Robot) × 4 (Voice Type) repeated measures analysis of variance (ANOVA). For “likability”, participants showed significantly higher rating scores for characterized Nao than characterized Pleo. The result also revealed a statistically significant difference in the rating scores over voice types. There was no interaction effect between robots and voice types. For the multiple comparisons among voice types, paired-samples t-tests were conducted. Participants showed significantly lower rating scores for TTS than all other three voice types. For “attractiveness”, the result revealed a statistically significant difference in the rating scores over voice types. For the multiple comparisons among voice types, paired-samples t-tests were conducted. Participants showed significantly lower rating scores for TTS than all other three voice types. Also, participants showed significantly lower rating scores for characterized Pleo than either characterized NAO or regular voice.

Fig. 9.
figure 9

The rating scores of likability over robots and voice types

Fig. 10.
figure 10

The rating scores of attractiveness over voice types

Table 10. Statistics for naturalness (likability, attractiveness)

4 Discussion

To get a holistic picture of the effects of robot appearances, voices, and emotions types on users’ perception on robots’ emotions and characteristics, we conducted a preliminary study. Overall, results showed that the effects of voice types (human vs. TTS) seem to be larger than those of robot appearances on multiple dependent variables.

For emotion recognition accuracy, robot appearances did not show a significant difference between anthropomorphic (Nao) and zoomorphic (Pleo) robots. As expected, TTS showed significantly lower emotion recognition accuracy than other three human voice types. However, there were no differences in accuracy among the three voice types (characterized Nao, characterized Pleo, and regular). Also, there were no differences among the three human voice types for clarity and suitability. Taken together, this might imply the potential for using characterized voice for different purposes where appropriate (e.g., for children) without degrading emotion recognition accuracy, as long as it is a human voice. However, the result shows that the emotion recognition accuracy significantly varies depending on the expressed emotions. As shown in Fig. 5, happiness, sadness and surprise showed relatively higher accuracy than anger, disgust, and fear. Anticipation was placed in between. This might happen because happiness, sadness, and surprise are more common emotional states the participants can expect from the fairy tales. Anger, disgust, and fear are all negative-high arousal emotions. The participants might not expect these types of high strength, negative emotions from the fairy tales. However, the relationship between accuracy and each emotion shown in the present study is not in line with the results of the previous study [e.g., 18]. The difference might stem from different experimental settings (e.g., emotional words, prosody, context given by fairy tales, etc.). Thus, more iterative research is required to unpack the underlying mechanisms. For the misclassified emotions, valence showed a big impact. Based on the confusion matrix, anger (negative) was mostly misclassified as sadness (negative) (39.8%) and anticipation (positive) was mostly misclassified as happiness (positive) (35.9%). Based on the participants’ self-report, most of the emotions were perceived from tone by 40.9%, pitch by 15.6%, and context by 12.4%, which shows that affective prosody is more critical than the content itself.

For robot characteristics, there was no statistically significant difference between the two robots, but there were differences between all human voices and TTS. We can cautiously infer that people did not perceive any differences among the regular, characterized Nao and characterized Pleo in terms of warmth, honesty, and trustworthiness.

Similarly, for naturalness, there was no statistically significant difference between the two robots even though participants consistently showed a tendency to perceive higher natural (M = 4.39, SD = 2.18 vs. M = 3.88, SD = 1.93), higher human-like (M = 4.73, SD = 2.07 vs. M = 4.2 SD = 1.9), and lower robot-like (M = 3.42, SD = 2.22 vs. M = 3.87, SD = 2.16) from Nao, compared to Pleo. As expected, there were significant differences in these ratings between all human voices and TTS.

Finally, participants liked Nao significantly more than Pleo from the two robot types. They gave the highest rating to characterized Nao voice, followed by Regular, characterized Pleo, and TTS, even though only TTS was significantly different from other voice types. Participants also gave higher attractiveness rating to Nao (M = 4.25, SD = 1.96) than Pleo (M = 3.83, SD = 1.82), which did not reach the statistical significance level due to large variance. All three human voices were significantly more attractive than TTS. Also, both characterized Nao and regular voice were significantly more attractive than TTS. Again, this shows the potential for use of the characterized voice, at least, for anthropomorphic robots.

This exploratory study can provide practical guidelines for the voice design of various robots and further research studies. People seemed to generally perceive higher preference for an anthropomorphic robot compared a zoomorphic robot, which is in line with literature [8]. However, using either characterized or regular human voice did influence neither people’s emotion recognition nor their perception about robot characteristics, such as warmth, honesty, and trustworthiness, as well as naturalness. Therefore, this study supports using human voice as a medium to express robots’ emotions with a different voice design choice, depending on users, goal, and context.

5 Limitations and Future Work

The results of this experiment have been limited by several factors. First, only female voice was used in this study. Depending on the gender of the voice, the results might be different. Second, the sample size was small and not sufficient to draw a firm conclusion. Due to the COVID-19, the experiment was not run as much as planned. In future work, more participants with diversity should be recruited to generalize the results. Another limitation includes that the questionnaire for emotion recognition was an open-ended, which caused considerable confusion and lower accuracy rate. In future work, a questionnaire with more specific emotion options can be provided with additional open-ended input. Finally, the different speaker systems of different robots might also have influenced on the result (e.g., clarity) and should be addressed in the next study.