Keywords

1 Introduction

Designers regularly use various robot characteristics, such as appearance, speech, and behavior, to inspire people to engage with robots as social actors. Robot appearance, including the use of familiar human- and animal-like forms, can visually represent a robot’s social capabilities. Similarly, the content of verbal utterances (e.g. saying “Hello”) can explicitly invite people to engage with the robot socially. More subtle cues to a robot’s social abilities can be provided through the nonverbal characteristics of the robot’s voice – implicit age, gender, emotional expression, markers of cultural origin, and individual vocal quirks can evoke a certain type of character and unique social presence for a robot.

Fig. 1.
figure 1

The social robot Haru.

Studies in social robotics and human-robot interaction have explored how various aspects of a robot’s voice can affect people’s perceptions of the robot’s capabilities, personality, and appropriateness for different tasks. People deem certain voices as more or less appropriate for specific robots [15]. Robots with higher levels of vocal expressiveness are perceived as having more social presence [12]. The perceived age of a robot’s voice can affect people’s perceptions of its credibility and social presence [8]. Robots with child-like voices can be seen as more extroverted and relaxed [7]. Users also respond to the perceived gender of human-like robot voices, reporting more positive attitudes about robots with which they share a gender [9]. Additionally, the delivery of verbal content as manipulated through voice pitch, empathy, and humor, has been found to not only affect people’s perceptions of and attitudes towards the robot, but also their enjoyment of the human-robot interaction task [16].

While prior research has identified particular aspects of robot voice that affect users’ interaction experiences and robot evaluations in short term scenarios, long term interaction will also require users to see the robot as a believable, relatable, and cohesive social agent. Based on the above-mentioned prior research, a combination of various vocal aspects (e.g. age, gender, vocal style/genre) will need to be employed to create a convincing social presence for robots. With this idea of long-term, companionable social interaction with robots in mind, this paper describes the process of designing an expressive voice for the Haru robot through the selection of diverse voice talent. It then explores how users evaluate different text-to-speech voices, meant to be used with the Haru social robot, in relation to their appropriateness for the robot, as well as their convincingness, expressiveness, and cohesiveness. The broader aim of this work is to inform the design of engaging and persuasive social characters for companion social robots.

As our target robot, we selected the social robot Haru [10, 11], shown in Fig. 1. Haru is an experimental tabletop robot for multimodal communication that uses verbal and non-verbal channels for interactions. Haru’s design is centered on its potential to communicate emotions through richness in expressivity [10]. Haru has five motion degrees of freedom (namely base rotation, neck leaning, eye stroke, eye rotation and eyes tilt) that allow it to perform expressive motions. Furthermore, each eye includes a 3-inch TFT screen display in which the robot eyes are displayed. Inside the body there is an addressable LED matrix (the mouth). Haru can communicate via text-to-speech (TTS)—albeit currently with off-the-shelf voices—through animated routines, projected screen, etc. Haru’s range of communicative strategies positions the robot as a potent embodied communication agent that can support long-term interaction with people.

This paper is organized as follows: in Sect. 2, we describe our iterative design process of refining our social robot’s personality description while auditioning voice talents; in Sect. 3, we present an elicitation survey that evaluates a select number of voice talent finalists; in Sect. 4, we discuss the findings of our survey; in Sect. 5, we outline relevant work in social robots; and, finally, in Sect. 6, we recap our findings and discuss future work.

Fig. 2.
figure 2

Iterative refinement design process for Haru’s personality and voice concept.

2 Design Process

One of our core research topics is the development of a long-term robotic companion, which can lead to the forging of a bond between a human and a social robot similar to the bond shared by other social creatures [11]. But this goal requires a persuasive character with rich expressivity that is beyond the capability of conventional TTS systems. We identify three characteristics that an ideal TTS voice for a social robot should have:

  • Convincingness. The voice should fit the robot’s character, physical appearance, and application scenarios.

  • Emotiveness. The voice should be capable of conveying a wide range of emotions and vocal delivery styles.

  • Consistency. Throughout its application, the voice should sound like it seamlessly belongs to a single entity.

Fig. 3.
figure 3

Haru’s personality bible.

We adopt a holistic approach to designing Haru’s personality and voice concept based on a process of iterative refinement, where updates to the personality definition feed into recruitment and evaluation of voice talent, and their evaluation informs refinement of the personality definitions. To aid us in organizing Haru’s personality definition, we borrow a practice from screenwriting and keep a personality bible [13, 14] for Haru, recording important personality traits, reference characters, and other relevant background information. The personality bible is a reference document for writers and engineers to check in order to keep Haru’s personality consistent. It outlines information about Haru’s background (his fear of social isolation and magnets, for example). It also outlines how Haru speaks (enthusiastically and informally). An excerpt is shown in Fig. 3. The iterative refinement process is shown in Fig. 2, and we describe it below.

Fig. 4.
figure 4

Haru’s self-introduction script.

  • Identify Reference Characters. Based on our existing vision of Haru’s personality, we brainstorm reference characters that effectively convey some aspect of Haru’s personality. Some examples include Prince Ali from Aladdin, Dory from Finding Dory, and Finn the Human from Adventure Time.

  • Extract personality traits. We summarize the relevant personality traits of the reference characters from Step 1 into keywords. We consider behavioral traits as well as vocal characteristics and speaking mannerisms. For example, Haru has the enthusiasm and empathy of Dory, the trusting and reassuring cadence of Prince Ali, and the energy and childish sense of wonder of Finn the Human.

  • Update Personality Bible. We update Haru’s personality bible with reference characters from Step 1. and relevant personality trait keywords from Step 2.

  • Recruit and Audition voice talents. We recruit voice talents online and audition them through interactive table reads using scenario scripts showcasing Haru’s personality. Voice talents are shown Haru’s personality bible and coached to convey our vision of Haru while encouraging creativity in their portrayal.

  • Select voice talents for Finals. Finally, we analyze the results of the audition in Step 4., considering both the quality of their portrayal and how Haru’s personality bible could be refined. If the voice talent is deemed satisfactory, we select them for the final evaluation. We then go back to Step 1. and repeat the process until we have gathered enough voice talents for the final evaluation.

Table 1. Voice talent search finalists selected for the elicitation study.

2.1 Voice TalentSearch

We searched for playful, energetic, curious voices from adult males, adult females and young children. The search took place over several months. Roughly 30 voices were researched closely and 10 voice talents were recorded, of which three children and two women were actively considered. While a mastery of English was required, the talent was sourced from all over the world. Given the diversity of Haru’s intended audience, we wanted voice talent from diverse backgrounds to counteract any regional idiosyncrasies.

The desirability of the voice talent was measured across several criteria. The first was a youthful quality. This quality is impossible to achieve by simply raising the pitch of an adult voice. The effect of having smaller vocal chords produces a slightly raspy, at times even nasal quality that is completely unique to young children. The second criteria was emotiveness. The voice talent had to be capable of conveying a slightly exaggerated degree of emotion. This exaggeration is important because much of the nuance of a performance is ‘lost in translation’ in the voice capture process. The third criteria is technical ability. The voice talent (VT) needs to be of a certain technical reading level to get through the material required of the voice capture process. Given that the VT may be a child, however, a certain amount of stumbling and coaching is expected. The fourth criteria is coachability. The VT needs to be able to take direction well in order to calibrate a performance correctly.

2.2 Audition Process

Each voice talent set aside an hour to go through specially designed audition scripts (see Fig. 4 for an example) to test the talent’s range of emotion, technical reading ability, endurance, and ability to take direction. The auditions were interactive with the writer playing Randy, the human, and the voice talent playing Haru. The audio of the dialogues was recorded for later comparison. Through the video call, the writer was able to give directions to the talent. For example: smile during an upbeat performance or slightly grit your teeth to convey seriousness. It was often required for the writer to give a line reading for the talent to imitate. This also had the effect of keeping the performances relatively consistent across all the different auditions. Finalists are summarized in Table 1.

Table 2. Target characteristics in the voice talent elicitation study.

3 Elicitation Study

To evaluate the convincingness of our voice talent finalists, we conducted an online survey where participants evaluated them over a variety of characteristics.

3.1 Online Survey

To familiarize themselves with Haru’s appearance and personality, survey participants first watched a video of Haru non-verbally interacting with an off-screen human. Non-verbal interaction was selected to avoid preconceptions about Haru’s voice. Next, for each voice talent, participants listened to a short clip of them from a short Haru self-introduction script designed to be representative of Haru’s personality and desired emotive range (see Fig. 4). Then, they were asked to rate each voice for a series of target characteristics on a scale of 1–5 (see Table 2). Participants also selected all emotions conveyed from a list 8 of emotions from Plutchik’s circumplex model [17]. Finally, they rated the overall suitability of the voice for Haru, and were asked for their free-form opinions on appropriateness of the voice and about Haru. The survey was conducted over Google Forms, and a sample form with synthetic responses can be seen at this link.

Table 3. Average scores for each VT by characteristic. The highest score for each characteristic is shown in bold. Standard deviations are given in (parentheses).

3.2 Protocol

We posted an advertisement and recruited participants on the US-based crowdsourcing platform Upwork over three days in the month of July 2021. The platform allows interested workers to send a ‘proposal’ to the client who has posted the job. We recruited participants from those who submitted the proposal on a-first-come-first-served basis, with a preference given to users with higher ratings and consideration for geographical diversity. We initially collected survey responses from 61 participants, of which 57 were analyzed after filtering the data using a comprehension-check question requiring the completion of a brief video to answer correctly. The responses of any participant who answered incorrectly ( n = 4 ) were discarded. The survey was expected to take approximately 30 min to complete, and those who completed the task were offered a fixed amount of $20 for their participation.

The demographics of the participants was as follows. We had more female participants (n = 34) than male participants (n = 23) and the most common age group was 18–30 (n = 37), followed by 30–40 (n = 16); 40–50 (n = 2); above 50 (n = 2). In terms of geographic location, 57 participants participated in the study from 27 different countries, with the largest number of participants from Asia (n = 26), followed by Europe (n = 12); Africa (n = 12); North America (n = 4); and South America (n = 3).

We used Upwork to recruit participants for the following reasons: 1) access to a diverse population of participants [2, 4, 6], and 2) an expected level of response quality, based on the platform’s profile-oriented nature that reveals names and location of the workers, and the mutual-rating system between clients and workers.)

Table 4. VT preference orderings, where \(\gg \): \(p < 0.05\), and >: \(0.05 \le p \le 0.45\), and \(\approx \): \(p > 0.45\), as measured via a single-tail t-test. Masculinity and femininity are derived from the gender scores.
Table 5. Select comments on voice talent suitability from study participants.

4 Discussion

  • Results. Average scores for characteristics are summarized in Table 3. Voice talent preferences for each characteristic are given as orderings annotated with statistical significance in Table 4. Select comments on voice talent suitability are given in Table 5. Finally, recognized emotions are summarized in Table 6.

  • Characteristics. We find that with the exception of overall suitability and demographic-related characteristics (youthfulness, gender), all were ranked positively, confirming their importance.

  • Demographics. Survey participants ranked voices by youthfulness in the same order as the voice talent’s ages. Gender exhibited a similar trend: the single female voice talentwas ranked as most feminine, followed by the child voice talent. This is understandable, as children have higher-pitched voices than adults, and are often perceived as more feminine. Likewise, the voices of the men over 40 were ranked as most masculine.

  • Emotions. Voices with higher acceptability tend to have more emotions detected, supporting our theory that expressive voices are preferred for social robots. We also note that overall positive emotions (e.g. surprise, joy, anticipation, trust) are recognized more than negative ones (e.g. anger, fear, sadness, disgust), although this may be due to the contents of Haru’s self-introduction script.

  • Overall. Survey participants overwhelmingly preferred Voice A to all others across all characteristics with statistical significanceFootnote 1. This supports our intuition that young, energetic voices are ideal and provides confirmation of our design direction for Haru’s voice.

Table 6. Emotions recognized by study participants.

5 Related Research

Much research on expressive TTS has focused on evaluating emotion conveyance with robots or virtual avatars. Breazeal evaluated the expression of emotion in TTS for anthropomorphic robots with an analysis of vocal affect [5]. Tang et al. evaluated emotive TTS with a 3-D virtual avatar [19]. Roehling and authors present a summary of research on vocal correlates in expressive speech and examine available TTS to use in their robotic project by applying the findings. Authors suggest that factors including pitch, duration, loudness, spectral energy structure, and voice quality are crucial for an expressive speech synthesis [18] Barnes et al. [3] compared the effectiveness of human voices and synthesized voices for use with humanoid and dinosaur robots and found that monotone synthesized voices were unsuitable for emotion-rich interactions. Through an online evaluation of TTS for three social robots, Alonso-Martin and authors suggest the correlation between intelligibility and expressiveness of TTS systems, as well as an inverse correlation between these two factors and artificiality [1].

6 Conclusion

In this paper, we described an iterative refinement process for developing a social robot’s personality while auditioning voice talents. Through this process, we selected five finalist voice talents and evaluated them through an online elicitation study. The preferences exhibited by participants toward young, energetic, and expressive voices provided supporting evidence for our design direction.

In future work, we plan to continue to refine our definition of Haru’s personality and to conduct a more detailed survey that includes evaluation of voice appropriateness given the context of specific applications for Haru. We will also limit the survey to the three regions where Haru is being considered for deployment: the US & Canada, Europe, and Japan in order to account for differences in cultural perception. Finally, we plan to finalize our voice talent selection and carry out the development and evaluation of an expressive TTS for Haru.