1 Introduction

In Human-Robot Interaction (HRI), the focus is given to make robots learn to react to users socially and engagingly [1]. Such social robots are used for various applications such as education (e.g. [2]), passenger guidance (e.g. [3]) and healthcare (e.g. [1, 4, 5]). Healthcare robotics is the focus of this research study. The healthcare robot (Healthbots project [6]) [7, 8] is an application of human-robot interaction under development at the Centre for Automation and Robotic Engineering Science, the University of Auckland, New Zealand. This project aims to develop social robots that provide support and care to people living in nursing homes. The role of these Healthbots will be to assist the medical staff in aged-care facilities by being a companion to the aged people [9]. Currently, the technology is undergoing additional field trials in realistic environments and commercialisation [6]. This paper describes the journey towards developing an empathetic voice for Healthbots. The next two sections explains the motivation to develop an empathetic voice (Sect. 2) and details about empathy in social robot applications (Sect. 3). This is followed by Sect. 4 describing a pilot study conducted to understand if people prefer empathetic voice in Healthcare robots. Section 5. Further, emotional speech synthesis (Sect. 6) and another experiment (Sect. 7) to evaluate the acceptance of synthesised empathetic voice are also described in detail. Section 8 concludes the paper.

2 Motivation: Acceptance of Social Robots

In this section, the motivation for building empathetically speaking social robots is discussed. How humans interact with robot’s in social situations and the impact that the robot’s voice has on their acceptance are discussed in detail. Evidence from past research is used to emphasise the importance of the robot’s synthesised voice on acceptance of robots by humans. Robots that interact in social situations as companions are novel to people who use them due to the very few preconceptions about their attributes and behaviour that people have. People rationalise this novelty by projecting the familiar human-like characteristics, emotions and behaviour onto them [10]. This behaviour is called AntropomorphismFootnote 1. First, the general factors that improve social robots’ acceptance are discussed, leading to the impact of the robot’s voice on acceptance.

2.1 Acceptance of Social Robots

When robots serve as companions in social situations, their acceptance among users is a primary design consideration. Some factors that enhance robots’ acceptance are their appearance, humanness, personality, expressiveness and adaptability [7]. Many studies have looked into such factors that need to be taken care of, in order to improve the acceptance of social robots, with prime focus given to the elderlyFootnote 2 [7, 11, 12]. Previous studies have observed that people anthropomorphise robots [10, 13]. A study by Heerink et al. defines the social abilities that humans expect from robots [14]. The results of the study, which identifies the factors that encourage older adults to accept robots, are summarised as the Almere Model [11]. Only the results leading to the development of Almere model are discussed here, as the Almere model was primarily developed based on studies on elderly users of healthcare robots, which is the application considered in this study. Also, the results of the model adequately lead to the relevance of the voice type on the user’s acceptance of social robots. The results of the study looking into the factors that encourage older adults to accept robots, provide a clear indication that humans tend to anthropomorphise robots. Humans anthropomorphise robots by expecting social abilities from them. According to the study, the social abilities that humans expect from social robots are that they should: cooperate, express empathy, show assertivity, exhibit self-control, show responsibility, competence and gain trust.

Now, the big question is - how can these social abilities be embedded in social robots? The aforementioned social abilities can be expressed during the scenarios in which a human and robot interact through some means of communication. This communication occurs in multiple modalities, principally the auditory and visual mediums [13]. Even for the Healthbots considered in this study, the communication is visually through the information displayed on a screen and verbally using spoken dialogues of the robot. As the verbal communication is also used by the robot to interact with humans, the synthetic voice of the robot plays a significant role in determining how people anthropomorphise robot, which in-turn impacts the robot’s acceptance.

2.2 Impact of Robotic Speech on Anthropomorphism

Speech is a primary mode of communication between robots and humans. People’s anthropomorphism of robots is impacted by the type of synthetic voice used by robots to converse, and this also affects the robots’ acceptance. Adding literature evidence to this statement regarding the relation between synthesised speech and acceptance of the robot, a summary of various studies about this concept is presented here. Research on robot voices is based on various studies on the impact of speech from artificial agents on anthropomorphism and their acceptance [15]. Examples of social robots that use synthetic speech to interact with people are Kismet [16], a storyteller robot [17] and reception robots [18].

Experiments reported in [19] indicate that people make judgements of robots’ personalities based on their voice. A study reported in 2003, [13] discusses and reviews other studies that show the impact of a robot’s voice/speech on people’s judgements of the robot’s perceived intelligence. This study has attributed the perception of intelligence of robots by humans to be similar to the judgements humans make about other humans. Hence this can be considered as a direct impact of anthropomorphism of the social robot impacted by speech. During the same time period, Goetz et al. [20] experimented on how people’s cooperation with a robot varied depending on the speaking style of the robot (synthesised speech) when it was instructing a team to complete a task. One team performed the task instructed by a playfully speaking robot. A different team performed the same task instructed by a robot with a neutral voice. Here, the performance of the team that did the task under the playful robotic voice was better than the other team. This would mean that humans get motivated by a robotic voice, even though the voice is synthesised. Another experiment conducted in 2006 investigated the difference when affectFootnote 3 was added to the robot’s synthesised speech [21]. Here, a robot guided people to complete a task. In one case, the robot expressed urgency to motivate people to complete the given task. In the other case, the robot spoke with a robotic voice, without motivating the people. The study arrived at the observation that the team that did the task under the expressive robotic voice performed better than the team under the neutral voiceFootnote 4. In another experiment conducted in 2012, users listened to a human-like voice type and a robot-like voice type. The two voice types were spoken by robot Flobi using its synthesised voice. The decision of the vocal cues to produce the robot-like and the human-like voice was based on pre-tests regarding human-likeness vs robot-likeness. The acceptance of the robot with the human-like voice was better than the robot-like voice [22], although specifics of the vocal cues are not reported in the paper.

Based on the studies discussed in the above paragraph, it can be seen that people anthropomorphise robotic speech, i.e., associate human attributes to the robot based on the voice, which is synthesised. This is evident in the way the robot’s acceptance improved with a change in voice and how people performed better when the robot spoke expressively. Upon understanding that the synthesised voice of the social robot is a key factor in its acceptance, it is then necessary to decide what type of voice is suitable for social robots. In the next section, the type of expressive voice needed for healthcare robots is identified based on past studies and perception experiments.

3 Empathy and Emotions Needed in Healthcare Robots

Recently, it was observed that roboticists build robots in the anthropomorphic form to improve their acceptance through embodied cognition, but users are disappointed by the lack of reciprocal empathyFootnote 5 from these robots [23]. Due to the lack of definition of empathy for human-robot interactions, the tendency for humans to anthropomorphise robots is used as the key to deriving a definition in [24]. Empathy in human-human interaction is the behaviour that enables one human to experience what another human feels and respond to it. It is an emotional response that is automatically evoked by one’s understanding of the other human [25]. When the companion is a robot, empathy in human-robot interaction can be defined as the programmed affective reaction of the robot to the behaviour of the human that it can sense according to the technology embedded in it. It is also called Artificial empathy in human-robot interaction studies [24, 26].

3.1 Prosody Component for Empathy Portrayal

The empathy portrayal by humans involves various communication modalities, such as facial, vocal (non-verbal and verbal) [23]. For robots, these communication modalities exist. The focus of this research study is to use speech to express empathy. Speech has two components [27] pertinent to empathy:

  1. 1.

    The verbal component, which focuses on the words alone.

  2. 2.

    The prosody component, which can be thought of as the melody and rhythm of speech. Emotions are expressed by variations in prosody component (like varying intonation, speech rate, stress) [27, 28]. This prosody component refers to the affective prosody.

Empathetic behaviour via speech can be depicted by a proper choice of words, which is the verbal component, and the emotions portrayed by the speaker, which is the prosody component. The choice of words determines the lexical features, which contributes to the verbal component. The emotions govern the acoustic features [29] contributing to the prosody component. Often empathy is incorporated into synthesised speech by the inclusion of words that convey an affective response (called dialogue modelling). As stated in [30], a robot nurse assistant should be able to greet people, sound happy when informing patients of good results and express sympathy or encouragement when the test results are not satisfactory. So, a combination of speech and visual channels of the robot can be used to impart empathy. This research focuses on speech alone. Empathy is communicated more via the non-linguistic channel, as stated by [31]. The same study also cites research indicating that a speaker’s emotional state can be expressed without the use of words and be understood just by listening to the speaker’s voice. Hence, along with words that convey empathy, the emotions that are used in saying those words play an inevitable role in making the listener understand the speaker’s empathy towards them. In this study, the aim is to express empathy in synthesised speech for healthcare robot by:

  1. 1.

    Using the speech alone as the medium for communication between the social robot and the human user.

  2. 2.

    Modelling the prosody component of speech to express empathy.

3.2 Empathy and Emotional Expression in Robotic Speech

Empathy in social robots is a relatively new research area. To date, there are only a few published research studies, and the major findings are discussed here. One study (in 2005) in this area of empathetic social robots [19] has shown that robots with empathy received positive ratings in the areas of likeability and trustworthiness. They were also perceived as supportive. Further, a study in 2013 shows that robots with empathy have reduced frustration and stress among users, as well as improved the users’ comfort, satisfaction, and performance doing a set task [32]. Finally, in 2018, James et al. [24] reports that the positive effects of empathy are produced in users only when the robot’s expressions are in congruence with the users’ affective state.

Motivated by these research findings, good modelling of empathy and expression of emotion is required while building the robots. This modelling will avoid a potential mismatch between the users’ expectation of the robot’s emotion (based on the application) and the actual emotions expressed by the robot, which can otherwise lead to a negative effect. The studies on empathy in healthcare robots are also limited. The study reported in [24] has explored empathetic healthcare robots and people’s preference to them. Also, [32] addresses the benefits of an empathetic voice (among other voice attributes like pitch and humour) improving users’ ease of interaction with the robot, while stating direct advantages for healthcare robot applications.

Currently, in the Healthbots used in this research, a New Zealand English voice is incorporated, and pilot studies were conducted with regards to the naturalness of the voice [8, 33]. The voice that has no emotional expression can be called a “neutral” voice. It was noted repeatedly that familiarity with the voice and closeness to human-like speech improves the positive attitude towards robots. This positive attitude, in turn, improves the acceptance of the robot. Also, the acceptance level can increase after meeting the robot assistant [8], and if the robot speaks with a local accent [34]. However, the age and native language of the user can impact on the perceived intelligibility of the robot voice. Watson et al. [33] found that the non-native listeners performed significantly worse than the native listeners in a synthetic speech condition. The authors report that the in-depth language model that the native speakers have, helped them parse the synthetic speech better than the non-native speakers.

From the studies discussed here, a robot that converses in a familiar language and factors such as the speaking style (including local accent), emotional expression and empathy are critical factors in improving their acceptance. As humans anthropomorphise robots, an empathetically interacting robot is expected to increase the level of acceptance of social robots based on the evidence presented in Sects. 2 and 3. To test that these findings apply to the healthcare robots that are used in this study, a large-scale perception experiment was conducted.

4 Study 1 (Pilot)

Footnote 6 This study (discussed in detail in [24]) involves a perception experiment to evaluate whether human subjects perceive empathy in robot speech. For this experiment, empathy is expressed through speech, with prosody being varied with the relevant melody and rhythm; i.e., by adding appropriate emotions to the words in the speech of the robot. A perception test was conducted to address the following research questions:

  1. 1.

    Research question I: Can people perceive empathetic behaviour from a robot when only the emotions in its speech are used to express empathy?

  2. 2.

    Research question II: Do people prefer empathetic voice from robots or a non-empathetic robotic voice?

  3. 3.

    Research question III: What factors of speech can be related to an empathetic voice?

The robot used for the study is the Healthbot. There are three different situations in which the robot speaks to the patient - (1) greeting the user, (2) providing medicine reminders, (3) guiding the user to use the touch interface.

Dialogues were framed for each of these situations, and included dialogues already used by the Healthbots (more details about dialogues are in Sect. 5). Each situation had 20-25 dialogues. A professional voice artist produced the dialogues in two variations.

  1. 1.

    One variation used a monotone voice with no variation in prosody features like intonation and intensity. This voice will be referred as robotic voice here.

  2. 2.

    The second variation was spoken like a nurse speaking empathetically to a patient, with changes in emotions. This voice will be referred as empathetic voice here.

A professional voice artist was used instead of synthesised voice as the current synthesised voices used in the Healthbot lack naturalness and quality as they are still under development. Also, robotic voices that were empathetic were not able to be created at the time of the study. Indeed, this was one of the points of the study - to ascertain what types of emotions were required for an empathetic voice. This is a pilot study to understand what type of voices are preferred by participants. If the empathetic voice is natural-sounding and the robotic voice is synthesised, it may cause the participants to be biased towards the more natural-sounding voice. This bias needed to be avoided, and hence, acted out voices were used for both the cases.

Fig. 1
figure 1

The participant taking part in the study watching the Healthbot talking

An illustration of a participant taking the test watching the healthbot talking is shown in Fig. 1. A link to the online survey is provided hereFootnote 7 where the video of the robot with the different voices can be seen. 120 participants, aged 16–65 (age distribution shown in Fig. 2), completed study 1. Majority of the participants were from the age group 26–35. Based on their self-reporting, all participants had above average hearing ability, with 50 participants being first language New Zealand English speakers (L1)Footnote 8 and 70 were bilingual speakers (L2). All participants completed the test. The participants could choose to use headphones or loudspeakers according to their convenience. In total, 20% of the participants used loudspeakers, and the remaining 80% used headphones. Each participant took approximately 15 minutes for the test. An online survey platform QualtricsFootnote 9 was used. No restriction was put on recruiting participants for the test other than a minimum age of 16. Such a generalised participation was selected as the Healthbots will be used in applications where the users may not have any knowledge about robotics. The participants went through three parts of the survey to address each of the research questions.

Fig. 2
figure 2

The distribution of age groups in pilot study and study 2

4.1 Addressing Research Question I - Pilot

4.1.1 Design

For addressing Research question I, both the voice variations spoken by the actor used the same words with variation in only the prosody component. The robot had a neutral facial expression. The patient’s dialogue was spoken by a speaker in the same manner regardless of the variation in the robot’s voice. An example of a Healthbot dialogue was “It seems like you are taking a long time to take your medicine.”, for which the patient responds “I am lately very slow in all the tasks I do!”. The participants could see and hear the Healthbot speaking (as shown in Fig. 1), but the patient speaking to the robot was not shown to them. The patient was not shown to enable the participants to feel as if the robot was speaking to them, and hence, they could rate the robot’s interaction with them. Each participant was given one scenario (greetings, reminders or instructions) with the two variations (robotic voice and empathetic voice). Both voice variations were shown to each participant one after the other, with the empathetic voice first, followed by the robotic voice. After seeing each scenario, the participants had to rate the voice based on an empathy scale.

The questions asked to the participants for Research question I were based on the empathy measuring scale from the Motivational Interviewing Treatment Integrity (MITI) module [35] used for human-human interaction, which was extended to human-robot interaction in [24]. MITI module defines five scales to rate a clinician’s empathy. The 5 point scale used in MITI and the experiment are shown in Table 5 in the Appendix and Table 1 of [24]. A score of 1 represents the least empathy according to the MITI scale. The dialogues were not randomised as they were framed as a conversation between the robot and the patient. First, the participants saw and heard the robot speaking the empathetic voice. They were then asked to rate the voice based on the scale. They then listened to the robotic voice and rated it based on the same scale. A within-participants design is used here as the difference between the two voices types may not be captured if the same person does not hear both the voice types. This may cause the robotic voice to also be perceived as empathetic (although lower levels) if heard separately.

Fig. 3
figure 3

Participants’ rating of the two voice types. The blue bar (bottom bar of every pair) represents the empathetic voice, and the yellow bar (top bar of every pair) represents the robotic voice. (Color figure online)

4.1.2 Results

Figure 3 shows the empathetic behaviour rating given by the participants to the two voices based on the empathy rating scale. The bar chart shows the percentage of participants who have chosen a particular empathetic behaviour scale (1 to 5) for the robotic voice and empathetic voice.

Almost 85% (sum of scores for 4 and 5 scales − 42.7% + 42.7% - showing high levels of empathy perception) of the participants felt that the robot with the empathetic voice showed great interest in the patient and tried to engage with them. Half of this group felt that the robot responded well, while the other half felt that it could do better. The authors believe the reason people felt that the robot could do better may be related to people’s inhibitions that a robot cannot feel the patient’s situation; instead, it is just programmed to respond accordingly. Conversely, 75% of the participants (sum of scores for 1 and 2 scales showing low levels of empathy perception − 38.7%+36.3%) felt that the robot with the robotic voice had little interest in the patient (given a rating of 1 or 2). Curiously, two participants (1.6%) felt that the robotic voice still showed a high level of empathy. As the empathy rating scale decreases from 3 to 1, it can be seen that less than 15% of the participants have given a lower rating for the empathetic voice. At the same time, for the robotic voice, most of the participants have given a rating of 1 or 2 on the scale. This suggests that robotic speech with appropriate words alone is not sufficient for people to perceive an empathetic behaviour from the robot.

Table 1 Statistical analysis results of robotic voice and empathetic voice in study 1 (Pilot) and study 2

4.1.3 Statistical Analysis

Because the data is skewed for both the voice types, a Wilco-xon signed-rank test was conducted to assess the difference between the robotic voice and the empathetic voice. The empathy scale ratings 1 to 5 given by the participants was used for the analysis. The results (shown in Table 1 Row 3) indicate that the empathetic voice ratings (Median = 4) are significantly higher than the robotic voice ratings (Median = 2), \(p < 0.001\), \(r = 0.7\). An effect size \(r = 0.7\) indicates that the effect is large according the Cohen’s benchmark for effect sizes [36]. Hence, it can be summarised that the empathetic voice received higher ratings than the robotic voice, and the result is statistically significant.

4.2 Addressing Research Question II - Pilot

4.2.1 Design

To evaluate Research question II of the experiment, both voice variations were shown to each participant. Then they were asked to judge which voice they preferred. The dialogues lasted for almost 1–2 min for each of the variations.

4.2.2 Results

In total, 113 of the 120 participants (about 95%) preferred the empathetic voice over the robotic voice, which is a robust result.

4.3 Addressing Research Question III - Pilot

4.3.1 Design

For evaluating Research question III, participants were asked reasons (free-response and forced-response) for choosing their preferred voice from the robotic voice and the empathetic voice. The reasons given for the participants to choose from are listed on the left end of Fig. 4. Each dialogue spoken by the robot was listed, and the emotion/feeling/tone that a patient might expect from a nurse speaking was associated as a label with each dialogue. For example, a dialogue, “It seems like you are taking a long time to take your medicine”, would be expected to be said with concern and empathy. The first author gave similar labels to each dialogue. The options given to the participants were based on these labels. Also, they were asked which emotions they could feel when listening to each of the voices from the options angry, happy, sad, excited, concerned, anxious, encouraging, assertive, apologetic and other. The first four emotions are primary emotionsFootnote 10 (excluding neutral) and the rest are words indicating secondary emotions.

Fig. 4
figure 4

Participants’ reasons (forced-responses) for choosing the empathetic voice (in blue) and not preferring the robotic voice (yellow)

4.3.2 Results

The reasons for choosing the empathetic voice (blue colour) and not preferring the robotic voice (yellow colour) from the forced-responses are given in Fig. 4, along with the percentage of the participants who selected these reasons. (There were only a few free responses and no common theme evolved from them). The most influencing factors for preferring the empathetic voice was the tone and emotions in the voice, closely followed by friendliness in the voice. People could also perceive empathy, concern and encouragement in the voice, which also contributed to their choice. Looking at the reasons for not choosing the robotic voice, the lack of emotions and monotony in the voice are the most influencing factors (as the number of participants who chose these reasons is higher than the rest), followed by lack of encouragement and concern in the voice. It is important to restate here that both the voices had the same verbal content. This content contained words that portrayed encouragement or concern (For example - “Oh dear! Exercising regularly is very crucial for you” expresses concern in the words and was used for both the empathetic voice and the robotic voice). It was when these words showing active engagement were spoken expressively with the appropriate emotions that participants could perceive the empathetic behaviour of the robot. This suggests that for developing empathetic artificial agents, the interaction via speech plays a role in influencing people’s perception of robots. Even though other modalities like facial expressions are under research, speech synthesis needs to be developed to express more human-like empathy. Communication via speech comprises of dialogue modelling along with the synthesis of the required emotions. From this test, it is also evident that proper dialogue modelling alone is not enough. Participants perceived higher empathetic behaviour only from the voice where the emotions matched the dialogues spoken by the robot.

Responses to the emotions that participants could perceive from the empathetic voice are consolidated in Fig. 5. The responses that came under other were warm, friendly and engaging. Only the participants who preferred the empathetic voice were required to provide this response, and each participant could provide multiple responses. Here, it can be seen that the emotions perceived by the participants in the empathetic voice are secondary emotions. This indicates that the synthetic voice spoken by the social robot should also be modelled to speak with a selection of secondary emotions.

The conclusions from the pilot study were:

Fig. 5
figure 5

Participants’ responses for the emotions they could perceive from the empathetic voice (pilot study)

Participants can perceive empathy from robots when empathy is portrayed by speech using variations in the prosody component. When the prosody variation is absent in speech (i.e. only the words in the sentences expressed empathy), participants perceived lower levels of empathy.

Participants prefer an empathetic voice from a robotic companion compared to a robotic voice (non-emotional) in a healthcare application.

The main factors that contributed to people’s reason to prefer the empathetic voice are the emotions in the voice and the variations in prosody.

The prosody component needs to be in alignment with the verbal component so that people can perceive empathy from a social robot. In order to correctly model the prosody component, the emotions needed for an empathetic voice and acoustic features to model an empathetic voice needs to be identified. The next section explains this in detail.

5 Emotion Analysis of Social Robot

From the pilot study conducted, it was found that the addition of emotions to the verbal component is essential for people’s perception of empathy in robotic speech. To synthesise empathetic speech, proper modelling of emotions is essential to enhance the verbal component. To identify the emotions associated with an empathetic voice, an emotion analysis of the Healthbot was done as described in [24]. Defining an emotional range that can be called as “empathetic” was the focus of the study, and also a pre-requisite for synthesising empathetic voice. Each dialogue spoken by the robot in the empathetic voice was perceptually analysed and marked on the valence-arousal plane to identify the emotional range (details are provided in [24]) This analysis was independent of the responses provided by the participants in the pilot study. Based on the analysis, the emotions needed for an empathetic healthcare robot were identified as: anxious, apologetic, confident, enthusiastic, and worried.

These dialogues that are designed for the Healthbot require emotions that do not fall under the primary emotion categories (marked as green “+” in Fig. 6) but are rather variants of them which are the secondary emotions (marked as blue “*”). Many studies, including [30, 41,42,43,44,45] have focused on social robots which are capable of synthesising speech with the primary emotions. As important, these primary emotions are in real life; this study of the dialogues suggests that synthesising these nuanced secondary emotions are for an empathetic robot voice.

Fig. 6
figure 6

Valence-arousal plane of emotions showing primary emotions defined by Ekman [37] (marked in green +) and the emotions identified for the healthcare robot based on [24] (marked in blue *). Adapted from [46]

5.1 Discussion Based on the Pilot Study and Emotions for Empathetic Robot

We believe that to improve HRI interactions, synthetic speech needs to have the capacity to emulate secondary emotions, in addition to primary emotion. This requires us to have knowledge of the acoustic features of these emotions. However, in contrast to the primary emotions, there are very few studies on the acoustics of these emotions. Further, with regards to the specific set of secondary emotions required for our Healthbots, there were no resources to get the acoustic features. To that end, an emotional corpus which includes these secondary emotions needs to be developed.

It is not possible to use existing speech corpora of the primary emotions to determine the acoustic features of the secondary emotions. The primary emotions are well apart in the valence-arousal plane. An inspection of the position of the secondary emotions in the valence-arousal plane (Fig. 6) shows that they are not well separated on the valence-arousal plane. This will be a significant challenge when trying to model and synthesise these secondary emotions; hence there is a need for a purpose-built speech corpora.

6 Emotional Speech Synthesis

6.1 Corpus Development

The emotions needed for the social robot are identified as - anxious, apologetic, confident, enthusiastic and worried. To study the secondary emotions, a New Zealand English speech corpus with strictly-guided simulated emotions was developed [47]. This corpus, called JLcorpusFootnote 11 contains five primary (angry, excited, happy, neutral, sad) and five secondary emotions (anxious, confident, worried, apologetic, enthusiastic). The JLcorpus has an equal number of four English long vowels-/a:/, /o:/, /i:/ and /u:/, to facilitate emotion-related formant and glottal source features comparison across vowel types. The corpus contains 2400 sentences spoken by two male and two female professional actors. The semantic context of all the sentences in the corpus was kept the same for all primary emotions, while the secondary emotions have 13 emotion-incongruent sentences and two emotion-congruent sentences. The inclusion of emotion-congruent sentences allows analysis of the effect of semantic influence on emotion portrayal and acoustic features, as seen in [48]. The emotion quality of the JLcorpus was evaluated by a large scale perception test of 120 participants, where the participants evaluated the emotions portrayed in the corpus. The corpus was labelled at the word and phonetic levels by webMAUS [49] with hand correction for wrongly marked boundaries.

Fig. 7
figure 7

Fujisaki parameters for ‘Sound the horn if you need more’ (SAMPA phonetic symbols). \(T_0\), \(T_1\), \(T_2\) marked for first phrase and accent commands

6.2 Features Modelled

Modelling and synthesising the secondary emotions was done using three prosody features - fundamental frequency (\(f_0\)), speech rate, and mean intensity. A preliminary analysis of the emotions in the JLCorpus is reported in [47]. Detailed analysis of the \(f_0\) contour based on the Fujisaki model (a method to parameterise the \(f_0\) contour, more details to follow) is reported in [50]. The decision to model \(f_0\) contour and speech rate is based on these analyses.

The Fujisaki model [51] parameterises the \(f_0\) contour superimposing (all parameters marked for a sentence in Fig. 7: (1) the base frequency \(F_b\) (indicated by the horizontal line at the floor of the \(f_0\) pattern), (2) the phrase component - declining phrasal contours accompanying each prosodic phrase, and (3) the accent component - reflecting fast \(f_0\) movements on accented syllables and boundary tones. These commands are specified by the following parameters:

  1. 1.

    Phrase command onset time (\(T_0\)): Onset time of the phrasal contour, typically before the segmental onset of the phrase of the ensuing prosodic phrase. (Phrase command duration Dur_phr = End of phrase time\(T_0\))

  2. 2.

    Phrase command amplitude (\(A_p\)): Magnitude of the phrase command that precedes each new prosodic phrase, quantifying the reset of the declining phrase component.

  3. 3.

    Accent command Amplitude (\(A_a\)): Amplitude of accent command associated with every pitch accent.

  4. 4.

    Accent command onset time (\(T_1\)) and offset time (\(T_2\)): The timing of the accent command that can be related to the timing of the underlying segments. (Accent command duration Dur_acc = \(T_2 - T_1\))

\(A_a\), \(A_p\), \(T_0\), \(T_1\), \(T_2\), \(F_b\) are referred to as the Fujisaki parameters. Dur_phr and Dur_acc are derived parameters from the Fujisaki parameters. The Fujisaki parameters for each utterance was extracted using AutoFuji extractor [52]. Checking was done so that potential errors in \(f_0\) tracking did not affect the parameters. Analysis of the effect of emotions on the Fujisaki parameters [50] showed that they were affected by the emotions, with accent command parameters (smaller units - \(A_a\) and Accent command duration \(T_2\)\(T_1\)) and \(F_b\) having the most significant effect.

Mean values were obtained for the speech rate (in syllables/s) [47] and intensity (in dB) of the sentences for rule-based modelling of these prosody features for each emotion.

Fig. 8
figure 8

Emotion-based \(f_0\) contour transformation implementation in emotional text-to-speech synthesis system

6.3 Emotional Text-to-Speech Synthesis System

The inputs to an emotional text-to-speech synthesis system are the text to be converted to speech and the emotion to be produced. To facilitate real-time implementation, all the features used for prosody modelling here are based on these two inputs only. Fig. 8 shows the proposed system for emotional text-to-speech synthesis system. The input text is analysed linguistically to extract context features. A text-to-speech synthesis system for New Zealand English based on MaryTTS [53] has been built [54, 55] . This New Zealand English text-to-speech synthesis system produces speech without any emotion and will be referred as non-emotional speech here. The input text is passed through the text-to-speech synthesis system to obtain non-emotional speech. The pitch is extracted from the non-emotional speech (by Praat Auto Correlation Function [56]) and label files are obtained from input text and non-emotional speech using the New Zealand English option of the Munich Automatic Web Segmentation System [49]. The pitch and label files are passed on to AutoFuji extractor to obtain the 5 derived Fujisaki parameters of non-emotional speech (\(Ap_N\), \(Aa_N\), \(Dur_phr_N\), \(Dur_acc_N\), \(Fb_N\) - subscript “N” added to denote “Non-emotional”). The parameters are then time-aligned to the input text at the phonetic level. The decision of speech rate and intensity are made based on the emotion tag. With the context features, non-emotional speech Fujisaki parameters, emotion and speech rate as features, a transformation is applied on each of the non-emotional speech Fujisaki parameters to obtain the emotional speech parameters (the feature list given in Table 2). The context features and the non-emotional speech Fujisaki parameters are extracted by automatic algorithms, while the emotion, speaker are tags assigned depending on the emotion and speaker to which the conversion needs to be done. Feature extraction is done at the phonetic level as the transformation is phone-based.

Table 2 Features used for \(f_0\) contour transformation

Context features, speaker and emotion tag are categorical, while the non-emotional speech features and speech rate are continuous. Hand-corrected Fujisaki parameters obtained from the natural emotional speech is the target value to be transformed. The transformation is applied to the Fujisaki parameters of non-emotional speech to convert them to Fujisaki parameters of emotional speech. Ensemble learning using two regressors - Random Forests [57] and Adaboost [58] is employed. The average of the predicted values from these regressors would be the final transformed value. The hyperparameters of these regressors are tuned via grid search cross-validation. An emotion-dependent model is built for each of the Fujisaki parameters. All the emotions-dependent models are combined to form the \(f_0\) contour transformation model. The database for modelling contains 7413 phones with their corresponding Fujisaki parameters. 80% of the database is used for training and rest for testing using random selection.

Once the non-emotional Fujisaki parameters are transformed to that of emotional speech, the resynthesis is done (last block in Fig. 8). Predicted Fujisaki parameters are time-aligned to the phones in the sentence. Fujisaki parameters are then used to reconstruct the \(f_0\) contour by superimposing the \(F_b\), accent commands and phrase commands based on the Fujisaki model. Once the \(f_0\) contour is reconstructed, the speech is re-synthesised by pitch-synchronous overlap and add using Praat. The re-synthesised speech has \(f_0\) contour obtained from the transformation model developed. Intensity and speech rate rules are assigned by emotion-based mean values.

Table 3 Confusion matrix showing hit rates from pair-wise subjective test

6.4 Performance Analysis of Synthesised Speech

Performance analysis of the synthesised speech produced using the method described above was conducted using a perception test. In this perception test, sentences from the JLCorpus like “Tom beats that farmer” are used, and not the robot dialogues as the Pilot study. Hence, this test is independent of the healthcare robot application, and tests only the quality of the emotions in the synthesised voice and thereby the machine learning approach. The synthesised emotional speech was evaluated by a perception test with 29 participants, where the participants evaluated emotions in the synthesised emotional speech. The participants had almost 50% distribution of first and second-language speakers of English (all variants of English were included). The majority of the participants were from the age group 16–35 (82%) and the remaining distributed over 36–65. All the participants had an average, above average or excellent (self-reported) hearing ability. In a forced response emotion classification task, the participants had to choose which emotion they perceived and group the sentences into the two emotion pairs provided. In total, 100 sentences were evaluated by 29 participants, giving 2900 evaluations.

Table 3 shows the confusion matrix obtained from the perception test for each emotion pair. The most confused emotion pairs were enthusiastic vs anxious, confident vs enthusiastic and worried vs apologetic (expected due to their closeness in the valence arousal levels). The confusion between anxious vs enthusiastic is the only problematic pair that can be an inappropriate reaction from the robot to the human user. On average, the perception accuracy was 87% to differentiate between the emotion pairs. The results obtained here are comparable to other emotional synthesis studies that used different techniques to model \(f_0\) contour. For instance [59] reported 50% perception accuracy for expressions of good, bad news, question, [60] reported 75% perception accuracy for happy, angry, neutral and [61] reported 65% perception accuracy for joy, sadness, anger, fear. However, no past studies did contour modelling on the secondary emotions we studied; hence direct comparison will not be possible. It is of note that the accuracy rate for the secondary emotions in the perception test in [47] for the JLCorpus was 40%, which is considerably lower than the 87% obtained here. However, this was quite a different test where participants had five emotions to select from, rather than two. The secondary emotions are not as well separated on the valence-arousal place, and giving participants a choice of five emotions, will lead to confusions between emotions close to each other on the valence-arousal plane, as can be seen in Table 4 (Comparison between apologetic and anxious vs confident and enthusiastic).

7 Study 2 - Perceived Empathy from Synthesised Emotional Voice

Footnote 12 Based on the results obtained from the pilot study (Sect. 4), the emotions required for an empathetic voice were identified (Sect. 5), then modelled and finally synthesised (Sect. 6. The next step is to find out if humans can actually perceive empathy from the developed voice. A second perception test with the synthesised voice being spoken by a robot was conducted. This perception test is a replica of the test conducted in Sect. 4, except that the voices used here are synthesised. The testing setup provided to the participants were also the same as the pilot test, as shown in Fig. 1. A link to the survey is provided hereFootnote 13 where the robot speaking to the patient can be seen. A total of 51 participants aged 16–55 (age distribution is shown in Fig. 2) with average or above-average hearing ability took part in this experiment. The aim is to address the three research questions listed in Sect. 4 with the synthesised voice.

7.1 Addressing Research Question I

7.1.1 Design

The participants saw a video and listened to two sets of dialogues between a Healthbot and a patient. The text associated with both the dialogue sets were the same. As before, there are two voices:

  1. 1.

    Synthesised robotic voice - This is the robotic voice synthesised without any emotions. This voice is the output from the New Zealand English text-to-speech synthesis system. This voice is rendered in a neutral tone without any emotions associated with it.

  2. 2.

    Synthesised empathetic voice - This is the emotional voice that is produced by the emotion transformation model. All the dialogs spoken by this voice contain one of the five secondary emotions - anxious, apologetic, confident, enthusiastic, worried.

The same voice talent was used to create both the synthesised voices. The participants listened to each of these voices separately (the empathetic voice first, followed by the robotic voice) and rated the voices on a five-point empathy scale, which was also used in the pilot experiment.

7.1.2 Aim

From the pilot study, the participants could perceive empathy from the voice of the robot when the empathy was expressed using changes in prosody only. However, this was done using natural speech. The synthesised voice will not be as perfect or human-like as natural voice. Hence, this test is essential to understand if the expression of empathy in the synthesised voices is being perceived by participants.

Fig. 9
figure 9

Participants’ rating of the two voice types

7.1.3 Results

Figure 9 summarises the rating given by the participants for the two voice types in %. It can be seen that 81.5% of the participants rated 4 or 5 for the empathetic voice. This means that the participants could perceive higher levels of empathy from the synthesised emotional voice. For the robotic voice, 44.5% of participants rated it on scale 1 or 2. The participants perceived only lower levels of empathy from the synthetic robotic voice. Both the voices spoke the same text. The only difference was in the prosody modelling in synthesised empathetic voice to produce emotional speech. This suggests that the addition of prosody modelling contributed to the perception of an empathetic voice from the robot.

7.1.4 Statistical Analysis

Because the data is skewed for both the voice types, a Wilcox-on signed-rank test was conducted to assess the difference between the ratings received for the synthesised robotic voice and the synthesised empathetic voice. The empathy scale ratings 1 to 5 given by the participants was used for the analysis. Some interesting findings were obtained from the statistical analysis, and they are:

(a) Difference in empathy ratings for study 1 and 2: The results (shown in Table 1 Row 4) indicates that the empathetic voice ranks (Median = 4) are higher than the robotic voice ranks (Median = 3.5), \(p = 0.003\), \(r = 0.3\)). Effect size \(r = 0.3\) indicates that the effect is small to medium according to the Cohen’s benchmark for effect sizes.

Fig. 10
figure 10

Boxplot showing participants’ rating of the two voice type for study 1 and study 2

Table 4 Comparison between robotic voice and empathetic voice in study 1 (Pilot) and study 2

The effect size is lower for the synthesised speech (\(r = 0.3\)) compared to the acted-out speech (\(r = 0.7\)). To ascertain what is happening, consider the boxplot in Fig. 10 with the participants’ ratings of the robotic voice and the empathetic voice from both study 1 and 2. From Fig. 10, it can be seen that, for the pilot study there is no overlap between the two voice type ratings (which also reflects in the effect size \(r = 0.7\) from Table 1 ). However, in the second study using the synthesised voice, it can be seen that there is some overlap between the ratings for robotic voice and empathetic voice.

(b) Empathetic voice rating from both studies: The empathetic voice has higher empathy rating compared to the robotic voice for both study 1 and 2 (see boxplots in Fig. 10). The ratings received for robotic voice and empathetic voice in both study 1 and 2 are significantly different (From Table 1 Columns 6 and 7)) from each other. Table 4 shows pair-wise Wilcoxon signed-rank test results for both voice types in study 1 and 2. This comparison helps to understand if peoples’ responses are statistically different for the two studies. It can be seen that the difference in ratings received for the empathetic voice for study 1 and 2 is not statistically significant (\(p = 0.176, r = 0.1\)Footnote 14). This suggests that participants felt that the robot was both interested in what the patient was saying, and was trying to engage with the patient, regardless of whether the empathetic voice was acted or synthesised. Hence, the aim of this study - which is to develop a synthesised empathetic voice is successful.

(c) Difference in robotic voice rating from study 1 and 2:For the robotic voice in the two studies, there is a statistically significant difference (\( p < 0.001, r = 0.4\)14 - From Table 4). This difference can be visually observed from the box plots in Fig. 10. Here, the synthesised robotic voice has higher ratings compared to the robotic voice produced by an actor. This could be because people give more allowances to the synthesised voice as it is not human. So, they tune their mind to actively listen to the voice and the words (which portrayed empathy), knowing very well that it is synthesised. However, when the participants know that the voice is produced by a human, then people probably expect more empathy in the voice. This could be a reason why the acted-out robotic speech was poorly rated on the empathy scale.

7.2 Addressing Research Question II

7.2.1 Design

In this part of the test, the participants were asked which of the two voices they preferred if they were the patient, and they were talking to a healthcare robot. In this stage, they could see the video of the two voices any number of times to make their decision.

7.2.2 Aim

This test is to understand which voice the participants prefer in the actual application of the robot.

7.2.3 Results

83% of the participants preferred the synthesised empathetic voice and 17% preferred synthesised robotic voice. In the pilot experiment with natural speech (described in Sect. 4.2.2) similar findings were observed with the majority of the participants (95%) preferring the empathetic voice over the robotic voice. It is clear that in both studies, the empathetic voice was preferred over the robotic one. However, the participants’ reasons for making a choice may not necessarily be the same. A robotic voice spoken by a human could be perceived as creepy, wheres a synthetic voice with modelled empathy might be considered acceptable for the task. However, without further study, we can only speculate the reason. We also need to consider the impact of participant numbers. The pilot experiment was done by 120 participants, while 51 participants did this second experiment. The larger number of participants in the initial test may also be a reason for the stronger results. Additionally, some participants found empathy in the synthetic voice to be “not real”, which made them choose robotic voice instead.

7.3 Addressing Research Question III

7.3.1 Design

In this part of the experiment, the participants were asked the reason for preferring of the two voice types. The participants were provided with a series of options to justify their choice of the voice they preferred (same options as the pilot study - Forced-response). Also, there was a free-response section, where the participants could write what they wanted. The participants were also asked which all emotions they could perceive from the voice. They could see the video of the two voices any number of times to make their decision.

7.3.2 Aim

This was designed to understand why people chose either of the two voices.

7.3.3 Results

The forced-response reasons for choosing synthesised empathetic voice and for not choosing synthesised robotic voice are given in Fig. 11. The tone of the voice, the emotions, the empathy, the friendliness and encouragement in the voice are the most frequent reasons for participants choice of the synthesised empathetic voice. The lack of emotions, the tone being not appropriate and the lack of friendliness in the voice are the key reasons for not preferring synthesised robotic voice.

Fig. 11
figure 11

Participants’ reasons for choosing the synthesised empathetic voice (in blue) and not preferring the synthesised robotic voice (yellow)

The participants also had a free-response section where they could make comments on their choices, other than the ones already listed in Fig. 11. Figure 13 presents a mind map of the reasons the participants provided for preferring the synthesised empathetic voice. The key ideas that came up from the thematic analysisFootnote 15 are the preference due to suitability for the application (Social robots, specifically healthcare robots), the influence of the changes introduced to the affective prosody in the voice and the properties of the voice. Reasons that were most quoted were the naturalness perceived in the empathetic voice and the emotions in the voice. Participants also commented that the voice is suitable and appropriate for a healthcare application, and they felt that the robot was engaging and interested in the patient. Another important reason the participants mentioned was the tone of the voice, which is a direct reflection of the \(f_0\) contour modelling done to synthesise the emotions in the empathetic speech. Figure 14 illustrates a mind map summarising participants’ responses for not preferring the synthesised robotic voice. All the responses were related to the major themes - voice suitability for a social robot application and the property of the voice. The most quoted reasons for not preferring the robotic voice were that the voice sounded unnatural and the lack of emotions. Also, many participants found the lack of engagement from the robot speaking with the robotic voice as a reason to not prefer it.

Fig. 12
figure 12

Participants’ responses for the emotions they could perceive from the empathetic voice (study 2)

Fig. 13
figure 13

Mind map showing free responses from participants for preferring the Synthesised empathetic voice

Fig. 14
figure 14

Mind map showing free responses from participants for not choosing the Synthesised robotic voice

The participants were also asked to choose the emotions they could perceive from the two voice types from a set angry, happy, sad, excited, concerned, anxious, encouraging, assertive, apologetic, other. The results are summarised in Fig. 12. Similar to the results in the pilot study (Fig. 5, most of the emotions that the participants could perceive were words indicating secondary emotions. Empathy and confidence were the responses that came under the other option.

The major takeaways from study 2 are:

The reasons for choosing the synthesised empathetic voice are that the participants preferred to have a healthbot which portrays emotions while expressing empathy. Even though the text content in the dialogues was the same, higher empathy was perceived only when the acoustics of emotions were added to the voice in congruence with the text. This strongly suggests that the prosody modelling produced by the speech synthesis system must be in alignment with the linguistic content of the sentence, for empathy to be correctly perceived by the users.

The synthesised robotic voice was perceived as being uninterested or rude, which made participants not like the voice in the healthcare scenario. Without proper prosody modelling, there is the possibility that robotic dialogues may sound like the robot is uninterested in the patient. This could reduce the acceptance of social robots.

The participants could perceive many secondary emotions from synthesised empathetic voice - like apologetic, concerned (worried), encouraging (enthusiastic), assertive (confident). These are the emotions that were modelled by the emotion transformation, and the perception test has shown that the participants can perceive the same emotions in the synthesised voice.

8 Discussions and Conclusion

This paper presents two studies done in a symmetric fashion to develop acceptable synthetic voices for healthcare robots. Study 1 starts with trying to identify what type of voice is acceptable for healthcare robots. This is done by conducting a perception test using voices spoken by a professional voice artist. Once the type of voice needed was identified to be empathetic, the emotions needed for an empathetic voice were then found out based on the application - healthcare robots. The emotions needed for the social robot was found to be secondary emotions- anxious, apologetic, concerned, enthusiastic, worried. A corpus containing these emotions were then developed, and model to synthesise these emotions was formed using ensemble regressors. The emotional speech model was perceptually evaluated. Further, to complete the process, a second study was conducted using the synthesised voice as the voice of the healthcare robot. This study was a replica of the pilot study conducted initially, the only difference being the use of synthesised voice as the voice of the robot. The major contributions of this paper are: (a) the development of an emotional speech model for the secondary emotions that were identified to be needed for a healthcare robot (based on the pilot study 1), (b) synthesising the emotional speech based on the model, and (c) conducting a similar study to that done in the pilot study, but using synthesised speech for the healthcare robot. This study tested the acceptability of a healthcare robot speaking empathetically using synthesised speech with five secondary emotions.

Major findings of the pilot study are that the emotions needed for healthcare robots are not only the well-researched primary emotions, but also nuanced secondary emotions. There is a lot of resource development and research needed to understand the acoustics of these secondary emotions. These secondary emotions were synthesised via ensemble regression modelling applied to the output of a New Zealand English text-to-speech synthesis system. Participants in study 2 found the emotional speech containing the five secondary emotions to be more empathetic than a robotic voice saying the same textual information. This result further strengthens the motivation to study nuanced emotions and include these emotions in the voice of social robots, along with the primary emotions that are already well-researched.

The participants preferred the empathetic voice over the robotic voice for a healthcare application. Hence the second important finding is that people can perceive empathy from the healthcare robot’s voice when empathy was expressed only by the prosody component of speech. The text of the dialogues spoken by the synthesised robotic voice and the synthesised empathetic voice are both the same. But only when the emotions (prosody component) matched the text did people perceive empathy in the voice. And this could be perceived even when the voice was synthesised. This result emphasises the importance of the emotions in the speech being congruent with the textual content of what is being said. This congruence is essential for participants to perceive empathy from the robots, and this perception of empathy also improves the acceptance of social robots.

This study was based on the Healthbots application and using the dialogue set of the Healthbots for all the analysis. More nuanced emotions may be identified to be needed when the application is different. Such an application-oriented analysis should be done to identify the emotions that are needed. The ensemble regression-based model can be extended for more emotions.

For the experiments reported in this paper, the participants were shown videos of the Healthbot talking with different voice types. But we cannot extend these results to scenarios when people directly interact with a physical robot. The reaction of people when directly interacting with robots can be affected by the presence of the robot near them [64], perceived age and gender [65], engagement techniques (like gaze, nodding) [66, 67], accent of the robot’s voice [68] and other factors. Hence, direct interaction of people with a robot will have to be studied with the same voice types (as used in this study) to evaluate the empathy perceived from the robot. The authors will be conducting such a study (similar to [68]) in the near future.

This study focused on the prosody component of speech. The verbal component also impacts empathy portrayal by speech. Hence dialogue modelling also needs to be conducted to develop empathetic voices. Along with speech, the other communication channels like facial and para-linguistic channels also contribute to the perception of empathy. These are areas where more research needs to be done to develop social robots that express empathy.