Introduction

Robots are moving into human social spaces and collaborating in different tasks. An intelligent social robot is required to adapt the affective content of its generated behavior to the context of interaction and to the profile of the user to increase the credibility and appropriateness of its interactive intents. Speech, facial expressions, and gestures can express synchronized affective information that can enhance behavior expressivity [18]. Gestures and facial expressions play an important role in explaining speech particularly in case of any speech signal deterioration [28].

Different studies in the literature of Human–Robot Interaction (HRI) and Human–Computer Interaction (HCI) discussed synthesizing affective speech [40, 58] and facial expressions [13, 65] in addition to gesture generation [20, 61]. Besides, other studies investigated the effect of multimodal information of speech and facial expressions on emotion recognition (compared to unimodal information) [17]. However, to our knowledge, these studies, among others, have not proposed a general framework to bridge between affective speechFootnote 1 (Sect. “Affective Speech Synthesis”) on one side and both adaptive gestures [1, 3,4,5] and facial expressions (Sect. “Facial Expressivity”) on the other side, as illustrated in our current study. The proposed framework allows for an explicit control on prosody parameters so as to better express emotion. In addition, it considers the relationship between emotion and gestures, which allows for adapting the generated robot gestural behavior to the characteristics of the synthesized affective speechFootnote 2 according to the proposed context of interaction in this study. The illustrated system architecture in Sect. “System Architecture” guarantees a direct human–robot interaction contextFootnote 3, which allows for generating and evaluating affective speech, adaptive gestures, and facial expressions so as to address the effect of the robot behavioral multimodality on interaction with a wide scope (Sect. “Effect of the Robot Behavioral Multimodality on Interaction”). Additionally, we discuss another evaluation for the generated affective behavior of the robot based on the behavioral determinant factors of the participants: personality extraversion [33] and gender.

The important role that affective speech, gestures, and facial expressions could play in enhancing the robot behavior expressivity during social interaction is investigated through three experimental hypotheses of interaction between the participants and ALICE robot, where the robot behaviors with combined—at least two modalities of—speech, gestures, and/or facial expressions are compared to those with less affective cuesFootnote 4 (Sect. “Hypotheses”). During the experiments, each participant watches a set of videos that aims at eliciting specific target emotions upon which interactive discussions with the robot start, where the participant evaluates the characteristics of the generated robot behavior (Sect. “Effect of the Robot Behavioral Multimodality on Interaction”). Moreover, we report personality and gender-based evaluations for the robot behavior to find out any differences in the way it was perceived by the introverted–extroverted and male–female participants within a human–robot interaction context so as to bridge between affective perception of the robot behavior and human profile (Sects. “Human Personality-Based Evaluation of the Affective Robot Behavior” and “Gender-Based Evaluation of the Affective Robot Behavior”). Last but not least, we discuss the findings of this study and propose research directions for future work (Sect. “Discussion”).

Related Work

The correlation between emotion and speech has been  extensively investigated in the related literature [26]. Speech prosody can reflect human emotion through variations in the basic features, like pitch, volume, and intensity [59]. The variations in the characteristics of voice prosody that can influence the conveyed affective meaning of speech in case of different emotions, such as anger, disgust, fear, pleasure and sadness, were studied in Sauter et al. [72]. Emotion perception and the needed time for emotion recognition using prosodic features were discussed in Pell and Kotz [62]. The evolutionary nature of emotion was investigated in Aly and Tapus [2, 8] through a perceptual model, where mixtures of basic emotions could compose complex emotions (e.g., fear + sadness = desperation).

The literature reveals different approaches towards synthesizing speech so as to improve both Human–Robot Interaction (HRI) and Human–Computer Interaction (HCI). Murray and Arnott [58] discussed a primary initiative to synthesize affective speech using a rule-based formant synthesis technique but the quality was low. Edgington [27] presented a concatenation-based technique that attained a little success in emotion expression. This last approach was further developed so that it employed the unit selection technique that avoids interference with the recorded voice to obtain a better quality of speech, and it reported some success in expressing anger, happiness, and sadness [40]. Similarly, deep learning approaches for speech synthesis have attracted attention over the last decade [31, 66, 92]; however, these approaches focused mainly on neutral speech synthesis. Moreover, end-to-end models (e.g., Tacotron model [90]) have been recently used in affective speech synthesis [51, 86]. However, these systems imitate a generic style of speaking in a few predefined emotions with a limited ability to control the affective expressivity of speech, which deprives them of flexibility and ease of use in our study considering the required large amount of data for training them. Generally, the previously discussed techniques, among others, do not have explicit control on the parameters of speech prosody to better express emotion. Therefore, in this work, we use the well-known pre-trained text-to-speech engine, Mary-TTS [75], to generate affective robot behavior expressed through speech (beside other modalities of communication, such as facial expressions and/or head–arm gestures) during interaction.

The basic definition of gesture was given by Kendon [45] and McNeill [56]. They defined a gesture as a synchronized body movement with speech, which is related in a parallel or complementary way to the meaning of an utterance. Ekman and Friesen [29] proposed a primary categorization of gestures: (1) affect displays (e.g., facial expressions), (2) adaptors (e.g., scratching), (3) regulators (e.g., using arm–hand movements to control turn-taking within a conversation), (4) illustrators (e.g., pointing), and (5) emblems (e.g., waving). This categorization was further adapted by Kendon [46]—due to neglecting language while it is a fundamental interactive phenomenon—who proposed a new gesture categorization: (1) signs (i.e., sign language), (2) pantomime (i.e., sequence of gestures with a narrative structure), (3) emblems, and (4) gesticulation. McNeill [56] named the continuum of Kendon’s gesture categorization as ‘Kendon’s Continuum’ in his honor, and proposed another widely cited gesture typology of four categories, which could be considered as gesticulations (according to Kendon’s classification): (1) metaphorics (i.e., gestures referring to abstract ideas), (2) beats (e.g., rhythmic finger movements), (3) iconics (i.e., gestures with a close semantic correlation with speech that refer to images of specific entities), and (4) deictics (e.g., pointing). These categories represent the evolution of the described images and ideas in a speaker’s mind.

The related literature in Human–Computer Interaction (HCI) and Human–Robot Interaction (HRI) shows active research towards generating iconic and metaphoric gestures that constitute a major part of the human nonverbal behavior during interaction [56]. Pelachaud [61] introduced the rule-based 3D agent GRETA that can generate a multimodal synchronized behavior using an input text. It can generate gestures of different categories regardless of the context and domain of interaction, contrarily to other 3D conversational agents (e.g., MAX agent) [48]. Cassell et al. [20] introduced a rule-based gesture generator; BEAT toolkit that can produce an animation script for both virtual agents (e.g., the agent REA) [19] and robots [9] from an input text. This toolkit can synthesize gestures of different categories such as iconic gestures, except for metaphoric gestures. Le et al. [50] proposed a rule-based framework for generating synchronized multimodal behaviors using the agent GRETA and robots. Generally, the majority of the rule-based gesture generation approaches do not consider the effect of emotion on body language, which could introduce a difficulty when adapting the generated robot behavior to human emotion detected through speech prosody [57] and gesture characteristics. Similarly, several deep learning approaches focused, increasingly, on gesture synthesis over the last years. Chiu et al. [22] proposed a data-driven framework for predicting gestures from speech; however, the model uses only predefined categories of annotated gesture data, which limits the shape of the produced gestures to those used in training with their language dependencies. Moreover, the model outputs gesture category labels rather than motion curves; therefore, it can not be used directly with 3D agents and robots. Hasegawa et al. [34] discussed a data-driven model for metaphoric gesture motion synthesis for a stick figure based on a speech input in Japanese; however, the generated gestures were rated relatively lower than the original gestures in semantic consistency. This model was further improved through motion representation learning to ameliorate gesture motion synthesis [49] but using the same language. Yoon et al. [91] introduced a data-driven end-to-end robot model for generating different categories of gestures (including iconic and metaphoric gestures) based on an input text and not a direct speech, which is similar to the rule-based gesture generators explained earlier. Besides, this model requires a very large amount of data for training. Therefore, in this paper, we present a complementary human–robot interaction study to our work [5] that discussed a framework for generating arm and head gestures adapted to speech prosody that correlates with emotion. These gestures are modeled on the robot in parallel with affective speech and/or facial expressions to examine the effect of the robot behavioral multimodality on interaction with human users.

The correlation between speech and facial expressions has been extensively investigated in the literature. Kalra et al. [41] showed that speech prosody and the movement of face muscles can change in a synchronous manner to express different emotions. The unimodal perception of human emotion through audio or visual information was discussed in Silva et al. [79]. Additionally, Busso et al. [17] discussed the complementarity and combination of both modalities that can increase the perception of human emotion. Karras et al. [43] presented a Convolutional Neural Network (CNN) model that can synthesize 3D facial animation from speech—in different languages—expressing emotion. Other deep learning approaches have been discussed in Taylor et al. [83] and Vougioukas et al. [88] for facial animation synthesis from speech. These approaches, among others, are mostly limited to animating face models without focusing on generating facial expressions in different affective states.

In robotics and computer-based applications, modeling and synthesis of facial expressions have attracted much attention over the last decades. Platt and Badler [65] discussed a 3D face model that controls the responsible muscular actions for facial expressions following the Facial Action Coding System (FACS). Spencer-Smith et al. [80] presented a realistic 3D face model that can create different stimuli with 16 FACS units. Modeling credible facial expressions on robots was a rich topic of research in the last years due to their mechanical constraints compared to virtual agents that have a higher flexibility in creating facial expressions. Breazeal [15] presented the robot-head Kismet that employs eyes, mouth, and ears to model different emotions expressing sadness, surprise, happiness, disgust, and anger. Breemen et al. [16] introduced the robot iCat that can express fear, anger, sadness, and happiness. Beira et al. [13] developed the iCub robot that can model different emotions using gestures and facial expressions, such as happiness, anger, surprise, and sadness. Lutkebohle et al. [52] presented the robot-head Flobi that can express different emotions, such as fear, anger, surprise, sadness, and happiness. Hoffman et al. [37] developed the conversation companion Kip1, which can reflect emotion using a few degrees of freedom, like expressing fear through a shivering motion. Similarly, designing facial expressions on android robots has been a subject of extensive research to investigate the way to create convincing facial expressions considering the rules of human emotion expression [64]. Vlachos and Schärfe [87] investigated designing facial expressions on an android robot, where the findings showed the incapability of the robot to reproduce the ‘fear’ and ‘disgust’ emotions due to mechanical limitations in the face. These previous approaches for modeling facial expressions on 3D agents and robots, among others, show serious efforts towards creating expressive facial behaviors with specific emotions, and they report in the same time some limitations when modeling emotions with a wide scope. This indicates the importance of the robot behavioral multimodality, where each behavior modality enhances the other modalities so as to improve the clarity of the robot behavior during interaction.

The robot behavioral multimodality refers to coordinating and combining different modalities of communication in the robot (agent) behavior, which has been a challenging research topic over the last years [38, 84]. In facial expressions and gestures coordination, among others, Clavel et al. [23] discussed the positive effect of facial and bodily expressions on the affective expressivity of a virtual character (and consequently emotion recognition), and Costa et al. [25] proved that gestures can effectively help in recognizing the facial expressions of a robot. In speech and gestures coordination, among others, Salem et al. [71] discussed the positive effect of gestures and speech multimodality on the evaluation of the robot behavior. In speech, gestures, and facial expressions coordination, among others, Castellano et al. [21] and Schirmer and Adolphs [73] reported the positive effect of multimodal information on emotion recognition compared to less-modal information. The related literature on the affective expressivity of the robot behavior has largely focused on unimodal (and bimodal) behaviors [38] considering the difficulty to generate a synchronized multimodal behavior, compared to virtual agents, with reasonably expressive speech, facial expressions, and gestures. This is due to the limited facial expressivity of robots that restricts generating a wide range of credible facial expressions, mechanical limitations that restrict generating gestures smoothly, and inability to synthesize affective speech for a wide range of emotions. In this work, we try to take a step forward towards creating a multimodal framework for generating affective robot behavior with more than two combined modalities of communication. Besides, we propose designs for modeling affective speech and facial expressions, in addition to gesturesFootnote 5, which can inspire other researchers in social robotics with solutions when examining hard-to-model emotions. Furthermore, we discuss the participants’ evaluations of the generated robot behavior considering their gender and personality, which is useful for future studies in human–robot interaction.

Fig. 1
figure 1

Overview of the system architecture

In this paper, we use the expressive ALICE robot for the purpose of modeling and evaluating a multimodal robot behavior expressed through combined, at least two modalities of, speech, facial expressions, and/or head–arm gestures compared to the robot behaviors with less combined affective cues. The paper is organized as follows: Sect. “System Architecture” discusses the system architecture, Sect. “Experimental Setup” illustrates the experimental hypotheses, design, and scenario of interaction, Sects. “Experimental Results” and “Discussion” provide a description of the experimental results and a discussion of the outcome of the study, and finally, Sect. “Conclusion” concludes the paper.

System Architecture

This study presents a series of interaction experiments between humans and a robot, where the generated gestures and facial expressions of the robot depend on the synthesized affective speech, as indicated in Fig. 1, so as to create a multimodal affective robot behavior. The proposed framework is coordinated through the following subsystems:

  1. 1.

    Speech Recognition, which is the HTML5 multilingual Google API.

  2. 2.

    Emotion Detection, where predefined emotion-referring keywords are detected in the recognized speech of the participant, which correspond to his/her opinion about the projected video during each interaction experiment so as to label the emotional content of each videoFootnote 6,Footnote 7.

  3. 3.

    Mary-TTS Engine, which converts the story textsFootnote 8 with the detected emotion labels of the employed videos to affective speech (Sect. “Affective Speech Synthesis”).

  4. 4.

    Body Gesture Generator, which uses the generated speech by Mary-TTS engine to generate synchronized head–arm gesturesFootnote 9 [5].

  5. 5.

    Facial Expressions Modeling, where facial expressions are modeled on the robot face in synchrony with the synthesized speech (Sect. “Facial Expressivity”).

  6. 6.

    ALICE Robot, which is the test bed platform in the conducted experiments with the participants (Sect. “Experimental Setup”).

In the following sections of the paper, we illustrate the subsystems of the proposed framework and describe the experimental setup in detail.

Affective Speech Synthesis

The text-to-speech Mary-TTS engine is used for adding prosody and accent cues to a predefined text, which summarizes the storyline of a video under discussion [75]. This engine could help in making the robot able to engage in conversation with each participant using adaptive affective speech to the displayed story in the video. Mary-TTS engine uses a high-level markup language (SSML: Speech Synthesis Markup Language) to define the vocal pattern of the synthesized speech [82] as it provides different efficient features such as adding periods of silence between words in addition to providing an easy control on speech characteristics (i.e., pitch contour and baseline, and speech rate) (Fig. 2). This could make it a helpful tool for the vocal design of the target emotions described in this study. It should be recalled that Mary-TTS engine is not yet prepared for synthesizing emotional speech in English in a human-like manner (same as other TTS engines); however, to our knowledge, Mary-TTS engine provides better vocal design capabilities and a higher flexibility than the other available engines. This makes the proposed vocal design in this work as an approximate step towards communicating the meaning of each expressed emotion during interaction. Thus, the robot behavioral multimodality is important for emphasizing the meaning of the expressed behavior, where each modality enhances the expressiveness of the other modalities.

Fig. 2
figure 2

SSML specification of the ‘sadness’ emotion

Table 1 illustrates the proposed vocal patterns of the target emotions in which pitch contours are characterized by sets of parameters inside parenthesesFootnote 10. Speech rates of the target emotions vary between the rates of the ‘sadness’ emotion (lowest rate) and the ‘anger’ emotion (highest rate). The inter and intra-sentence break times were imposed experimentally on the proposed vocal design to enhance the affective expressivity of speech. The indicated inter-sentence break time with each emotion represents the silence periods that separate sentences at which both the lips and jaw of the robot make particular expressions to clarify the expressed emotion (Sect. “Facial Expressivity”). Besides, the intra-sentence break time indicates the silence periods of short duration within a sentence, which are necessary to clarify the expressivity of the ‘sadness’ and ‘fear’ emotions. The experimental parameters shown in Table 1 are an example of the prosody patterns of parts of the texts converted to speech for each emotion. The vocal patterns of the remaining parts of the texts differ slightly with respect to the indicated parameters in Table 1 so as to further clarify tonal variation over the text. Some emotions required using interjections (with tonal stress) to enhance their expressivity, like ‘Ugh’ and ‘Yuck’ for the ‘disgust’ emotion, and ‘Oh my God’ for the ‘fear’ emotion.

Table 1 The design of the vocal pattern and contour behavior of each target emotion

Facial Expressivity

The proposed design of facial expressions for the target emotions is grounded on the well-known coding system of facial actions (FACS) [30]. This design is clearly explained in Table 2, which shows the corresponding joints to each emotion in the face of the robot and the designed gestures to clarify the meaning of facial expressions. The corresponding FACS units to emotions, in bold font, represent the most observed prototypical units between subjects [76], whereas the other units are observed at lower percentages. The underlined action units are the units with corresponding relative joints in the face of the robot.

Table 2 The design of facial expressions, modeled on the robot, for each target emotion

The complexity behind modeling emotion on the face of the robot lies in the absence of equivalent joints to specific FACS descriptors (e.g., cheek raiser and nose wrinkler). Therefore, and inspired by the experimental designs of McColl and Nejat [55] and Wallbott [89]Footnote 11, we imposed some additional body gestures experimentally in order to reduce the negative effect of the absent joints on affective expressivity. These additional gestures do not include neither head gestures nor arm–hand gestures, which are generated by the gesture generator [5] (except for the italic-font gestures indicated in Table 2, which are required to enhance the affective expressivity of the robot)Footnote 12. For example, the combination of the additional gestures neck rotation and raising front-bent arms is helpful for better expressing the ‘disgust’ emotion (Fig. 3), which can give the participant the feeling that the robot does not like the interaction context. In a similar way, the emotions of ‘sadness’, ‘fear’, and ‘anger’ are assigned the gestures of bowing head and covering-eyes with hand, mouth-guard with hand, and down head-shaking, respectively, to emphasize their affective expressivity (Fig. 3). The main role of the additional right smile and left smile face joints of the ‘fear’ emotion is to depress the corners of the open mouth so as to enhance its affective expressivity; however, both joints do not have equivalent FACS descriptors (Table 2). Generally, modeling persuasive facial expressions on a robot is not a trivial task because of the mechanical limitations of its joints (unlike the case with 3D agents). Therefore, the robot behavioral multimodality can play an important role in enhancing its affective expressivity during interaction, where each behavior modality can clarify the other modalities.

Fig. 3
figure 3

Synthesized facial expressions by ALICE robot

Figure 4 demonstrates the eyelids animation script where three points of the motion path are described through position and time. In order to achieve a temporal alignment between eyelids animation and speech, if the synthesized speech duration is longer or shorter than the eyelids animation duration, the model determines the corresponding new time instants to animation points based on speech duration, animation duration, and the previous time instants of animation points. The segmentation of human speech is achieved through an embedded voice activity detection algorithm in the speech recognition system, which can efficiently label speech and silence segments. In case the silence period represents an inter-sentence break time that was discussed in Sect. “Affective Speech Synthesis”, both of the robot jaw and lips perform specific animations (e.g., pulling the corners of the lips to express happiness) which could enhance the meaning of the expressed emotion (Fig. 3). This is due to the robot mechanical constraints that prevent the synchronization between lips motion and speech while performing an animation with both the jaw and lips at the same time. Meanwhile, if the silence period corresponds to an intra-sentence break time, the jaw of the robot opens to express fear and closes to express sadness during the silence period (Sect. “Affective Speech Synthesis”).

Fig. 4
figure 4

Eyelids animation script

Experimental Setup

In this section, we discuss the employed database for emotion induction in the participants. In addition, we present the experimental hypotheses, design, and scenario of interaction between the participant and ALICE robot developed by RoboKindFootnote 13.

Database

The employed database contains 20 silent videos excerpted from feature films (with duration varying from 29 to 236 s) for inducing 6 target emotions in the participants: neutral, disgust, anger, happiness, fear, and sadnessFootnote 14. Hewig et al. [36] discussed and validated the efficiency of the database in eliciting emotions in humans. Consequently, in this paper, we will not focus on measuring the level of emotion induction in the participantsFootnote 15. During the experiments, we used 12 expressive videos from the database to elicit the target emotions. This means that six main videos were used during the experiments, and six standby videos (i.e., one standby video per emotion) were used automatically in case any of the main videos failed to elicit the corresponding target emotion (Table 3).

Table 3 The target emotions and their corresponding feature films. The main videos were extracted from the bold-font films. Meanwhile, the other films represent the standby videos

Hypotheses

Human emotion experience is generally characterized by different cognitive constructs, such as (1) emotion clarity (i.e., the clear and definite representation of emotion) [24], (2) emotion differentiation, which is the ability to accurately identify and represent emotion into discrete categories (e.g., sadness, disgust, and happiness). This is conceptually correlating with emotion clarity, where each construct could enhance the other one [14], (3) emotional complexity (i.e., the broad range of emotion experiences associated with a tendency to accurately differentiate between emotion categories) [42], and (4) emotional awareness (i.e., the knowledge complexity of emotion, which represents the ability to be aware of emotion) [54]. Each of these constructs is measured through calculated indices from subjects’ self-reports [44].

In this research study, the main objective is to generate a well-perceived multimodal robot behavior so as to enhance the interaction with a human user. Consequently, the clarity and differentiation constructs of emotion would be directly addressed through investigating the ability of the participants to recognize the affective content of the generated robot behaviorFootnote 16. Besides, the participants would evaluate the effect of the robot behavioral multimodality on interaction. The subjective evaluation of the generated multimodal robot behavior investigates basically the clarityFootnote 17/expressivityFootnote 18, and the recognizability (i.e., emotion differentiation) of the affective robot behavior in addition to the synchronization between the behavior modalities, etc. The examined hypotheses in this study are:

  • H1: The combination of facial expressions, speech, and arm and head gestures will increase the clarity of the affective content of the robot behavior to the participant compared to the experimental conditions with less combined affective cues (i.e., less combined modalities of communication).

  • H2: Facial expressions will enhance the recognizability, and expressivity, of the robot emotion by the participant compared to the experimental conditions without facial expressions.

  • H3: The characteristics of the arm and head gestures of the robot (e.g., acceleration) will enhance the expressivity of the robot behavior so as to help the participant in recognizing and distinguishing between emotions compared to the experimental conditions without arm and head gestures.

The effect of emotional speech on interaction was not examined through an independent hypothesis because this requires whether:

  • Comparing the robot behavior that employs affective speech to the robot behavior that does not employ affective speech (i.e., using neutral or monotone speech). However, the proposed system in this study uses the synthesized speech as a basis for generating synchronized gestures with facial expressions (Fig. 1). Therefore, synthesizing monotone speech will lead to associated facial expressions and gestures with different characteristics than those of the facial expressions and gestures generated using affective speech. Consequently, it is not possible to compare between the robot behaviors in similar experimental conditions (e.g., the robot behavior expressed through speech and gestures in the case of affective speech and the same behavior in the case of monotone speech as gestures in both cases will be different).

  • Comparing the robot behavior that employs affective speech to the robot behavior that does not employ speech at all. This condition does not match the context of the non-mute human–robot interactionFootnote 19.

Consequently, these two cases are excluded from our experimental design. Instead, the important role of speech in enhancing the affective content of interaction would be measured directly through analyzing the post-experiment questionnaires.

Experimental Design

The experimental design is based on the between-subjects designFootnote 20 through a human–robot interaction context in which the synthesized speech by Mary-TTS (text-to-speech) engine is used as an input to the gesture generator [5] so as to synthesize adapted gestures to the synthesized affective speechFootnote 21. This constitutes an implicit validation for the expressivity of the synthesized speech using Mary-TTS engine in which the more natural (i.e., human-like) the synthesized speech is, the more natural will be the corresponding generated gestures (to be evaluated by the participants). Besides, generating adaptive gestures based on speech characteristics is concordant with the cognitive co-production process of synchronized speech and gestures that humans undergo [56]. The synthesized speech and gestures (in addition to facial expressions) are modeled on the robot and evaluated by the participants at the end of each conducted experiment. The proposed design includes the following robot behavior conditions:

  • The robot produces a multimodal affective behavior expressed through facial expressions, speech, and arm and head gestures (i.e., condition C1-SFG).

  • The robot produces a multimodal affective behavior expressed through facial expressions and speech (i.e., condition C2-SF).

  • The robot produces a multimodal affective behavior expressed through arm and head gestures, and speech (i.e., condition C3-SG).

  • The robot produces a unimodal affective behavior expressed through speech (i.e., condition C4-S).

To validate the first hypothesis, the experimental conditions C1-SFG, C2-SF, C3-SG, and C4-S were examined. While for the second hypothesis, the conditions C2-SF and C4-S were examined, and for the third hypothesis, the conditions C3-SG and C4-S were examined. We excluded the condition of the robot producing a unimodal behavior expressed through facial expressions or arm and head gestures without using speech, and the condition of the robot producing arm and head gestures combined with facial expressions without using speech (Sect. “Hypotheses”). The condition C3-SG was excluded from validating the second hypothesis and the condition C2-SF was excluded from validating the third hypothesis because the facial expressions of the robot are associated with the additional body gestures detailed in Table 2. Consequently, separating between the conditions of facial expressions and gestures (i.e., conditions C2-SF and C3-SG) could guarantee differentiating between the accompanying gestures to the robot facial expressions and the basic head–arm gestures synthesized by the generator. This could lead to better evaluating the effect of facial expressions and gestures on interaction.

The literature reveals serious efforts to elicit emotion in humans under laboratory conditions. These emotion induction methods include: dyadic interaction tasks [70], affective imagery [47], music [69], and pictures and film clips [85]. In this study, the robot and the participant, in each condition, follow an expressive stimulus set of short videos through six experiments that mean to elicit six different target emotions (Fig. 5) after a short preparation phaseFootnote 22,Footnote 23. The scenario of interaction is described as follows:

  • The robot invites the participant to watch some videos and discuss their storylines.

  • The robot asks the participant to express his/her opinion about the content of the projected video. Afterwards, it detects and segments predefined emotion-referring keyword(s) from the recognized comment of the participant, such as “This is disgusting!”, “This video is expressing sadness!”, etc. This helps in detecting the video’s emotional content (from the participant’s point of view) to trigger a corresponding adaptive robot behavior.

  • After listening to the participant’s comment on the video, the robot makes a comment accompanied by speech, facial expressions, and/or head–arm gestures on the content of the video.

  • If the displayed video induces, in the participant, another emotion than the concerned target emotion so that the system detects keyword(s) that belong mainly to another category of emotion-referring keywords, the robot comments through a neutral behavior. Thereupon, the robot asks the participant to watch a different video so as to retry to induce the emotion that was failed to be elicited using the first video (Table 4).

  • The experiment terminates for the examined target emotion. Thereupon, the participant evaluates the generated behavior of the robot through a 7-point Likert scale questionnaire. This evaluation focuses on the relevance of the robot behavior to the context of interaction in terms of its emotional content and expressivity, synchronization between the robot behavior modalities (i.e., speech, facial expressions, and/or gestures according to the examined experimental condition), etc.Footnote 24 Afterwards, a new experiment of interaction starts for examining a different, randomly selected, target emotion.

  • After all the experiments terminate, the experimenter and the robot express gratitude to the participant for his/her time and cooperation.

Fig. 5
figure 5

Interaction experiments between the robot and two different participants

Table 4 shows that the majority of the target emotions were correctly recognized by the participants after watching the first videos in the four experimental conditions, while the second videos were slightly required. This shows that the chosen videos from the employed silent video databaseFootnote 25 had convincing emotional contents [36]. Afterwards, the participants were first asked through each post-experiment questionnaire to evaluate the characteristics of the generated robot behavior in terms of each modality of communication (i.e., speech, gestures, and facial expressions) independently, then they were asked to evaluate and recognize the affective content of the generated combined behavior. We argue that this supports separating between the emotional contents of the videos and the robot behaviors during evaluation—supported by the findings of [35]Footnote 26—up to the level that allows for investigating the experimental conditions successfullyFootnote 27.

Table 4 The recognition scores of the videos’ affective contents in the different experimental conditions

Experimental Results

A total of 60 participants were recruited to validate the different examined hypotheses in this study. The participants have been equally distributed over the experimental conditions (i.e., 6 females and 9 males for every condition). The participants were undergraduate and postgraduate students and employees at ENSTA-ParisTech (with ages varying from 20 to 57 years old, \(M=29.6\) and \(SD=9.4\)). The participants had a technical background with an average of \(66.7\%\), and a non-technical background with an average of \(33.3\%\). Moreover, only \(40\%\) of the participants had previous interaction experience with robots, while \(60\%\) of them did not interact with robots beforehand. The effect of synthesizing adaptive robot behavior on interaction with the participants in addition to personality and gender-based evaluations of the emotional expressivity of the generated behavior are illustrated in the following points:

Effect of the Robot Behavioral Multimodality on Interaction

For the first hypothesis, a significant difference was found by ANOVA analysis in the clarity of the affective robot behavior expressed through a combination of speech, facial expressions, and head–arm gestures with respect to the robot behaviors, with less affective cues, expressed through speech, speech and facial expressions, and speech and head–arm gestures (\(F[3,356]=21.15\), \(p<0.001\)) (Fig. 6). Tukey’s HSD comparisons indicated a significant difference in clarity between the robot behavior expressed through combined speech, facial expressions, and head–arm gestures (i.e., condition C1-SFG) on one side, and the robot behaviors expressed through speech (i.e., condition C4-S) (\(p<0.001\)) (the lowest among the four conditions), speech and facial expressions (i.e., condition C2-SF) (\(p<0.001\)), and speech and head–arm gestures (i.e., condition C3-SG) (\(p<0.001\)) on the other side. Moreover, no significant difference was observed between the conditions C2-SF and C3-SG in the clarity of the robot behavior.

Fig. 6
figure 6

Human perception of the emotional expressivity of the robot behavior in the four experimental conditions, where the clarity of behavior refers to the maximum level of expressivity it can show

For the second hypothesis, the robot behavior expressed though facial expressions and speech was found by the participants to be more expressive and adapted to the context of interaction than the behavior expressed through speech (\(F[1,178]=18.63\), \(p<0.001\)). Moreover, the participants considered that speech and facial expressions were synchronized with an average score of \(M=5.9\), \(SD=0.9\). Furthermore, they did not find any significant inconsistency in affective content between speech and facial expressions with an average score of \(M=1.8\), \(SD=1.2\). Over and above, they agreed that speech was less expressive than facial expressions with an average score of \(M=4.4\), \(SD=1.5\). Table 5 shows that facial expressions improved only the score of recognizing the emotion of ‘anger’ in the experimental condition C2-SF with reference to the condition C4-S, which is related to the limitations of Mary-TTS engine in designing a highly expressive vocal pattern for this particular emotion (Sect. “Affective Speech Synthesis”), so that facial expressions enhanced the affective content of speech giving the participants the feeling that the robot was expressing the ‘anger’ emotion persuasively. To the contrary, the facial expressions of the robot had a negative effect on the score of recognizing the emotion of ‘disgust’ in the experimental condition C2-SF with reference to the condition C4-S, which is related to the limited affective facial expressivity for this particular emotion (Sect. “Facial Expressivity”).

Table 5 The scores of recognizing the target emotions, modeled on the robot, in different conditions

For the third hypothesis, the affective content of the robot behavior expressed through both arm and head gestures and speech was considered to be more expressive and observable by the participants than that of the behavior expressed through speech (\(F[1,178]=17.16\), \(p<0.001\)). Furthermore, the participants found that speech and gestures were synchronized with an average score of \(M=6.1\), \(SD=0.7\), and they agreed that the execution of gestures was fluid with an average score of \(M=5.35\), \(SD=1.03\). Over and above, the participants found that gestures were more expressive than speech with an average score of \(M=4.25\), \(SD=1.43\). The affective content of the arm and head gestures of the robot behavior was reasonably recognized by the participants (Table 5). The generated gestures ameliorated only the score of recognizing the emotion of ‘anger’ in the experimental condition C3-SG with reference to the condition C4-S, which is related to gesture characteristics such as velocity and acceleration, that enhanced the robot expressivity for this emotion.

Figure 6 illustrates the variation in the affective expressivity of the robot behavior in the experimental conditions C1-SFG, C2-SF, C3-SG, and C4-S. The robot behavioral expressivity in each condition was investigated through a different group of 15 participants. The combination of different affective cues (i.e., speech, facial expressions, and head–arm gestures in the condition C1-SFG) provided clarity to the robot behavior with respect to the other conditions that employ less affective cues as argued in the first hypothesisFootnote 28. Meanwhile, no significant difference was observed in the robot behavioral expressivity between the conditions C2-SF and C3-SG.

A significant result was found by two-way ANOVA analysis in the perception of the affective robot behavior with clarity–expressivity of facial expressions (i.e., condition C2-SF) and emotion as independent variables (\(F[2,168]=4.47\), \(p=0.0359\)). However, no significant result was found with clarity–expressivity of gestures (i.e., condition C3-SG) and emotion as independent variables. After running one-way ANOVA analysis on each emotion individually, the results showed that both the ‘happiness’ and ‘disgust’ emotions were found significantly more clear when being expressed through combined speech, facial expressions, and head–arm gestures (i.e., condition C1-SFG) (\(F[1,28]=3.36\), \(p=0.077\)) than when being expressed though speech and facial expressions (i.e., condition C2-SF) (\(F[1,28]=6.133\), \(p=0.0196\)). Meanwhile, no significant differences were found for the ‘neutral’, ‘sadness’, ‘fear’, and ‘anger’ emotions. Over and above, a statistically significant main effect was observed for the experimental conditions (\(F[3,335]=12.738\), \(p<0.001\)) and for the target emotions (\(F[5,335]=5.527\), \(p<0.001\)).

Human Personality-Based Evaluation of the Affective Robot Behavior

Personality is a determinant factor in human social interaction, which has a long-term consistent effect on the generated multimodal human behavior. Reisenzein and Weber [67] defined personality as the coherent and collective pattern of emotion, cognition, behavior, and goals over time and space. Moreover, Revelle and Scherer [68] discussed the strong relationship between personality and emotion. Several research studies in neuroscience discussed the correlation between the neurobiological structure of personality extraversion and the activation in different brain regions involved in emotional responding (which implies perceiving the affective content of interaction) [39]. This potential correlation between personality extraversion and emotion perception would be investigated within a human–robot interaction context so as to study the effect of human personality on perceiving the emotional expressivity of the robot behavior.

Personality Extraversion-Based Evaluation of the Affective Robot Behavior

Table 6 indicates the numbers of the introverts and extraverts in each experimental condition, where the calculation of personality scores was based on the online Big5 personality model questionnaire [32]Footnote 29 that each participant filled in at the beginning of the experiments. Figure 7 illustrates the effect of the human extraversion personality trait—in terms of the introversion and extraversion of personality—on the perception of the affective expressivity of the robot behavior. In the four experimental conditions, both the introverts and extraverts showed a similar tendency in evaluating the emotional expressivity of the robot behavior, where the perception of the extraverted participants for the robot behavior was, in general, higher than that of the introverted participants. The variance in evaluating the expressivity of the robot behavior by the introverted and extraverted participants was found statistically significant (through T-Test) in the different conditions: C1-SFG (\(p<0.02\)), C2-SF (\(p<0.03\)), C3-SG (\(p<0.03\)), and C4-S (\(p<0.02\)).

Table 6 The numbers of the introverted and extraverted participants in the four experimental conditions

This evaluation difference between the introverted and extraverted participants is concordant with the findings of Shulman and Hemenover [77], Petrides et al. [63], and Atta et al. [12], who argued that emotional intelligenceFootnote 30 is positively correlating with personality extraversion. Consequently, the extraverted participants are expected to have a relatively higher emotional intelligence than that of the introverted participants so that they gave higher ratings for the robot behavior in the four experimental conditions. The previous evaluation of the affective expressivity of the robot behavior matches the illustrated findings in Fig. 6, where the evaluation of the robot behavior in the condition C1-SFG was higher than that in the other conditions.

Gender-Based Evaluation of the Affective Robot Behavior

Both of the female and male participants have positively perceived the affective expressivity of the generated robot behavior in the four experimental conditions (Fig. 8). The indicated ratings in the figure show that the perception of the male participants for the affective robot behavior in the four conditions was generally higher than the perception of the female participants. This relatively higher preference of the male participants over the female participants for the emotional expressivity of the female ALICE robot matches the findings of Siegel et al. [78] and Park et al. [60], where they found that the participants considered the opposite-sex robots to be more attractive and convincing during interaction.

The variance between the ratings of the male and female participants for the emotional expressivity of the robot behavior indicated in Fig. 8 was found statistically significant (through T-Test) in the different conditions: C1-SFG (\(p<0.02\)), C2-SF (\(p<0.03\)), C3-SG (\(p<0.02\)), and C4-S (\(p<0.001\)). Furthermore, the male participants considered the generated multimodal robot behavior more adapted to the emotional content of the videos, and consequently the context of interaction, than the female participants (\(p<0.01\)), which supports the hypothesis of the opposite-sex attraction of human users to robots.

The observable difference between the ratings of the male and female participants in the condition C4-S compared to those in the conditions C1-SFG, C2-SF, and C3-SG (Fig. 8) could be related to the low affective expressivity of the robot behavior employing speech only in interaction with respect to those that employ speech combined with facial expressions and/or gestures (Fig. 6). We argue that facial expressions and gestures enhanced the affective content of the robot behavior, which slightly improved the perception of the female participants to the generated behavior in the conditions C1-SFG, C2-SF, and C3-SG while keeping the opposite-sex attraction hypothesis of human users to robots valid. These findings need; however, a larger number of male and female participants to have a clearer visualization for their perceptual differences of the robot behavior.

Discussion

We propose an integrated system for generating affective robot behavior expressed through speech, gestures, and facial expressions within a human–robot interaction context. We investigate the multimodality of the generated robot behavior and its positive effect on interaction with the participants through three experimental hypotheses that compare between the robot behavior with combined, at least two modalities of, speech, gestures, and/or facial expressions and those with less affective cues. Moreover, we investigate any potential effect of human personality and gender on the way the robot behavior was perceived during interaction.

The proposed framework (Sect. “System Architecture”) integrates different subsystems for affective speech synthesis, gesture generation based on speech prosody, and an expressive robot with highly credible facial expressions, which allows for studying the effect of the robot behavioral multimodality on interaction with a wide scope. The obtained results demonstrate the positive role that affective cues could play in enhancing the expressivity of the robot behavior so as to help the participants in perceiving its emotional content appropriately. These findings are clearly illustrated in Fig. 6, where the robot behavior that combines speech, facial expressions, and gestures attained a higher level of expressivity (i.e., clarity level) than the other robot behaviors with less affective cues.

When searching in the related studies in the literature for concordant results with our findings on affect recognition using multimodal information, we found that the majority of them were unimodal (and bimodal)—based approaches employing, among others, gestures and facial expressions, speech and gestures, and speech and physiological signals [38, 93]. Meanwhile, there are a few studies that discussed emotion recognition with more that two modalities of information. Castellano et al. [21] used speech, gestures, and facial expressions to recognize emotions, and reported that using multimodal data for affect recognition highly increased the scores with respect to the cases that use less modalities of data [73]. Generally, our proposed system shares the same concept of the positive effect of multimodality on emotion perception and recognition. However, it is designed to generate and embody a multimodal behavior—expressed through speech, gestures, and facial expressions—on ALICE robot so as to be positively perceived by the participants, which makes it a different contribution than any other approach in the related literature.

Over and above, the results report some differences in the perception of the introverted–extroverted and male–female participants for the affective robot behavior, where the perception of the extraverted and male participants for the robot behavior was generally higher than that of the introverted and female participants in the different conditions of behavior (Figs. 7 and 8). While we tried to explain these findings in light of other similar findings in the related literature (Sects. “Personality Extraversion-Based Evaluation of the Affective Robot Behavior” and “Gender-Based Evaluation of the Affective Robot Behavior”) so as to support our results, we believe that a larger number of introverted–extroverted and male–female participants is required in order to figure out their perceptual differences of the robot behavior more precisely. However, we argue that the current results could give useful insights into human perception of the affective robot behavior to the other interested researchers in the field of human–robot interaction.

Fig. 7
figure 7

Human personality-based evaluation (in terms of introversion and extraversion of personality) of the affective expressivity of the robot behavior

Fig. 8
figure 8

Gender-based evaluation of the affective expressivity of the robot behavior

Conclusion

This paper introduces a framework for generating an adapted multimodal robot behavior, expressed through speech, gestures, and/or facial expressions, to the context of interaction with human users. A set of videos that mean to induce target emotions in the participants is employed during the experiments upon which interactive discussions start with the robot around their affective contents. Each participant is only exposed to one of the four experimental conditions of multimodal–unimodal robot behaviors during the experiments. The system uses Mary-TTS engine to generate emotional speech; however, the proposed vocal design requires using interjections and inter/intra-sentence break times in order to enhance the affective content of the synthesized speech. Besides, the gesture generator synthesizes adaptive head–arm gestures to the generated speech. The proposed design of facial expressions requires using additional body gestures in order to increase their credibility and expressivity to the participants.

This paper validates the important role of the robot behavioral multimodality in enhancing the clarity of interaction compared to interaction conditions with less affective cues. Moreover, it discusses the positive effect of the designed facial expressions and gestures in enhancing the emotional expressivity and recognizability of the robot behavior. Over and above, it demonstrates the perceptual differences between the introverted–extroverted and male–female participants for the generated affective robot behavior. For the future work, we are considering to improve the gestural expressivity of the system through additional gesture generators. Moreover, we are considering to ameliorate the affective expressivity of speech and facial expressions to make the generated multimodal robot behavior more persuasive and natural. Besides, we are considering to integrate language models that can help the robot to understand human language with a wider scope instead of parsing keywords as with the employed system in the paper [10, 11].