Abstract
To interact and cooperate with humans in their daily-life activities, robots should exhibit human-like “intelligence”. This skill will substantially emerge from the interconnection of all the algorithms used to ensure cognitive and interaction capabilities. While new robotics technologies allow us to extend such abilities, their evaluation for social interaction is still challenging. The quality of a human–robot interaction can not be reduced to the evaluation of the employed algorithms: we should integrate the engagement information that naturally arises during interaction in response to the robot’s behaviors. In this paper we want to show a practical approach to evaluate the engagement aroused during interactions between humans and social robots. We will introduce a set of metrics useful in direct, face to face scenarios, based on the behaviors analysis of the human partners. We will show how such metrics are useful to assess how the robot is perceived by humans and how this perception changes according to the behaviors shown by the social robot. We discuss experimental results obtained in two human-interaction studies, with the robots Nao and iCub respectively.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The development of social robots focuses on the design of living machines that humans should perceive as realistic, effective partners, able to communicate and cooperate with them as naturally as possible [13]. To this purpose, robots should be able to express, through their shapes and their behaviors, a certain degree of “intelligence” [31]. This skill entails the whole set of social and cognitive abilities of the robot, which makes the interaction “possible” in a human-like manner, through exchange of verbal and nonverbal communication, learning how to predict and adapt to the partner’s response, ensuring engagement during interactions, and so on. The development of such abilities, for a robotics researcher, translates into the implementation of several complex algorithms to endow the robot with different cognitive and social capabilities: multimodal people tracking [5], face recognition [79], gesture recognition [17], speech recognition [18, 21], object learning [40], motor skills learning [7], action synchronization [2, 54], just to name a few. Each algorithm or module is evaluated in the metric space of its specific problem. If we limit ourselves at evaluating their performance or their coordination, we make the mistake of evaluating their efficiency as algorithms [59] rather than their capability to obtain a desired effect once they are used in a human–robot interaction context. If all those modules worked perfectly, would the robot be perceived as intelligent? The answer is not granted, for example recent studies showed that humans prefer to interact with a “non-perfect” robot that makes mistakes and exhibit some uncertainty or delays [1, 63].
Evaluating the quality of the experiences with a social robot is critical if we want robots to interact socially with humans to provide assistance and enter their private and personal dimension [67]. But how can we evaluate whether a robot is capable of engaging a human in social tasks? Do we have metrics to determine whether different robot behaviors can improve the quality of such human–robot interaction? Most importantly, can we find good metrics that can be retrieved by cheap sensors (e.g., a Kinect) and in “natural” interaction scenarios, without recurring to the use of invasive measuring devices (e.g., eye-trackers or motion capture systems)?
Measuring the quality of the experiences [57] involving social robots can be a quite challenging task, involving the assessment of several aspects of the interaction, such as the user expectations, his feelings, his perceptions and his satisfaction [23, 46]. A characterizing feature of the user experience is given by the ability of robot to engage users in the social task. As stated by [55]: “Engagement is a category of user experience characterized by attributes of challenge, positive affect, endurability, aesthetic and sensory appeal, attention, feedback, variety/novelty, interactivity, and perceived user control”. The paper will focus on engagement as characterizing feature of the quality of the experiences with social robots, defining it as “the process by which individuals involved in an interaction start, maintain and end their perceived connection to one another” [65]. In direct, face-to-face scenarios, measurable changes of the human partners behaviors will reflect this engagement through both physiological variations [51], as heart rate [77] or skin conductivity changes [45], and movements [60], as synchronous and asynchronous motions such as nodding [47], gazing [58, 66] or mimicry [11]. Such movements correspond to the non-verbal body language that humans use to communicate each other [14, 31]. In the light of this observation, it seems possible to infer the engagement of a person involved in a social interaction with the robot through an analysis of his non-verbal cues. With social interactions we do not refer exclusively to cooperative scenarios, in which, for instance, nodding or joint attention can be seen as feedback given to the partners. To some extent, the same holds to competitive and deceptive interactions [72], where the dynamics of non-verbal behaviors are still used as feedback to humans, for instance, to communicate boring, misunderstandings, rejections or surprise. In any case, variations of non-verbal cues between study groups can inform about the engagement of the partners involved in the social task. This assumption is then very general and able to include a large variety of social interactions and it becomes a powerful instrument to evaluate and, in some cases, to manipulate the synergy between the peers.
Several social signals have been proposed in literature to study the engagement. Hall et al. [34] manipulated nonverbal gestures, like nodding, blinking and gaze aversion, to study the perceived engagement of the human participants, retrieved by a post-experimental questionnaire. Significant works focusing on engagement during verbal interactions were proposed by Rich and Sidner. In particular, in [58] manually annotated engagement was analyzed through mutual and directed gaze, and correlated with spoken utterances. In [66] and [65], via manual labeling, gaze signals have been used by the authors distinguishing between head nods and quick looks; in [38], where the authors combined different gaze behaviors, captured using eye tracker, for conversational agents. Ivaldi et al. [39] also used post- experimental questionnaires to evaluate engagement, but obtained indirect measurements of engagement through the rhythm of interaction, the directional gaze and the timing of the response to robot’s stimuli, captured by the use of RGB-D data. In [60], engagement is automatically evaluated from videos of interactions with robots, using visual features related to the body posture, precisely to the inclination of the trunk/back. Similar measures have been used to evaluate behaviors in medical contexts [12] using audio features and video analysis [10, 61, 62].
In this paper we propose a methodology to evaluate the engagement aroused during interactions between social robots and human partners, based on metrics that can be easily retrieved from off-the-shelves sensors. Such metrics are principally extracted by static and dynamic behavioral analysis of posture and gaze, and have been supporting our research activities in human–robot interaction.
We remark that our study is focused on methodologies that do not require intrusive devices that could make the human–robot interaction unnatural, such as eye-trackers or wearable sensors. We choose to work with cheap sensors like Kinects and microphones that can be easily placed in indoor environments and are easy to accept for ordinary people. These features are important since we target real applications with users that are not familiar with robotics. Users’ perception and need is an element that must be taken into account by the experimental and robotics setting [71].
2 Material and Methods
To evaluate the engagement, here we address direct, face-to-face, interaction scenarios, where a robot is used to elicit behaviors on humans. This is the case, for example, of a robot playing interactive games with a human partner, or the case of a human tutoring the robot in a cooperative or learning task. The choice of such kind of scenarios, as in Fig. 1, does not represent a huge limitation on the validity and on the use of the proposed methodology: many social interactions between humans usually occur in similar conditions.
In this scenario, we assume the human standing in front of the robot. An RGB-D sensor is placed close to the robot, in the most convenient position to capture (as much as possible) the environment and the interacting partner. The information about the robot status and behavior is continuously logged; at the same time, the RGB-D sensor captures, synchronizes and stores the data perceived about the environment and the human [3]. The human posture and his head movements are tracked, according to the processing pipeline shown in Fig. 2. Such information is then statically and dynamically analyzed to retrieve: body posture variation, head movements, synchronous events, imitation cues and joint attention (Table 1).
2.1 People Tracking
The presence, position and activity of people interacting with the robot is processed by a posture tracking system based on the data captured by the RGB-D sensor. The humans location and their body parts are tracked in the visible range of the camera to extract gestures, body movements and posture.
Precisely, the depth data perceived by the sensor is processed through a background subtraction algorithm [64]; from each point of the foreground, local features are calculated and classified to assess to which body part (among 31 possible patches) they belong. Finally, through a randomized decision forest, a dense probabilistic skeleton is obtained. Joints of the body are calculated according to the position and the density of each patch.
The tracking algorithm provides a complete map of the body, characterized by the position and the orientation of 15 anatomical parts, including the two arms, the two legs, the torso and the head. Concerning the latter, the algorithm is not able to retrieve an accurate orientation of the head: to accomplish this task, we need a different, dedicated approach that we describe in the following.
2.2 Head Movements
Once the presence of the human partner is found and his body is tracked, the information about the head pose can be extracted.
From the 3D information about the body of the person interacting with the robot, the estimated position of the head is back projected over the RGB image captured by the sensor, to obtain the coordinates in the image space in which the face should appear. A rectangular area of the camera image, centred on such coordinates, is then cropped to retrieve the face, as shown in Fig. 3. A face tracking algorithm is then applied to retrieve the head pose: our face tracking implementation is based on a Constrained Local Model approach [24]. This class of trackers is based on a statistical model of the faces based on a set of constrained landmarks, such as the face shape, its texture, its appearance. Such landmarks are used to approximate the contour of the face, the lips, the eyes, the nose. The algorithm tries to adapt iteratively the shape defined by the landmarks to find the best fit. The result of the algorithm is the best fitting set of facial landmarks approximating the actual face. From the facial landmarks, the orientation of the whole head is calculated and integrated in the full body model.
The head pose provides only an approximation of the gazing direction, since it cannot capture the eyes movements or their direction. However, it continues to be an informative estimator in the case in which potential targets of the interactions are displaced respect to the person’s field of view. In such scenarios, the objects are located in a way that they are not visible unless the participants turn their heads toward them, then they are forced to turn the head toward the targets to gaze them. In absence of high resolution cameras that could provide more accurate images of the eyes, the head orientation provides a fair estimate of the human gaze direction. Most importantly, it does not require invasive devices as wearable eye-trackers.
2.3 Static Analysis
We hereby extract and analyze the information related to the posture and gaze of the human interacting with the robot. Histograms are used to study the distribution of the measured data.
Figure 4 shows the 2D histogram of the position of each joint of a person while performing the “Jumping jack” exercise. The time distribution of the joints/bodies positions is conveniently represented as a heat map overlapped with a snapshot of the person performing such movement. In this exercise, the person jumps from a standing position with the feet together and arms at the sides of the body, to a position with legs spread wide and hands over the head. The heat map shows hot spots over the positions in which the body joints spend more time. In particular, red spots depict the start/stop position of each joint during the jumping jack movement. The heat map allows to capture with a simple visualization the posture information in time, such as the movement of the trunk and its stability. Also, it is able to show the variability of the trajectories of arms and legs.
A similar analysis can be done for the gaze. Figure 5 shows the 2D histogram of a person gazing to a tutor, standing in front of him, and over objects on his two sides. The resulting heat map is again a very convenient visualization tool: it shows the focus of attention of the person, highlighting the correspondent hot spots of the head gazing towards the tutor and towards the two objects. It must be noted that the gaze direction is projected on the pitch-yaw plane of the head, since the gaze is approximated by the head orientation as described in Sect. 2.2.
One possible way to study the head movements is by applying data mining algorithms. In the bottom-left corner of Fig. 5, we can see the three areas found applying a clustering algorithm—precisely k-means (\(\mathrm{k}=3\), as the 3 hot spots). The information related to the clusters, such as their average, barycenter, density and variance, can be used to extract useful descriptive features of the gaze behaviors of the humans interacting with the robot. The analysis of such signals can also provide information about the head stabilization during fixation. As we will show in the next Section, the clusters information can be used for example to compare the outcome of different experimental conditions.
2.4 Dynamic Analysis
The histogram analysis of the head movements and of the body posture gives only a partial description of the human behaviors, because it does not capture the movement dynamic.
The time analysis can be useful to individuate synchronous events and phenomena occurring when the interacting agents synchronize their actions. In particular it can capture causality, synchronous and delayed events [27]. Figure 6 reports the elicitation of a gaze in response to a robot attention cue: precisely, the robot points its arm twice towards some objects. The figure plots in blue the yaw movement of the human head, and in red the movement of the shoulder of a robot: we can see that the head movement is induced by the robot goal-directed action. Figure 7 shows in blue the behavior of the shoulder of a person, and in red the same data from the robot. The plot highlights how the robot fails the first elicitation, while the human imitates it in the second elicitation.
The time between the beginning of the robot’s arm movement and the beginning of the human gaze can be interpreted as a measure of the effectiveness of the nonverbal communication abilities of the robot [9]. Humans could respond as fast as if they were communicating with another human, if the robot was readable and “interactable” [39]. If the robot lacks in communication abilities, humans could struggle on understanding the communication cue, thus responding slower than in the ideal case; this delay, if not controlled, can make the response non-contingent. Lastly, humans may not respond at all, because of a complete inability of the robot to communicate with nonverbal-behaviors or gestures.
The synchrony of human–robot movements and the contingent response to reciprocal cues are critical features for evaluating the quality of imitation and joint attention [30, 48]. Figure 8 highlights the joint attention elicited by the robot towards a human: in blue the yaw head movement of the human, in red the robot’s. Here, the robot is always able to elicit joint attention, as there is a contingent response to each attention cue—a gaze towards an object on the left or right.
Among the temporal features of an interaction, we can account the rhythm of interaction [39] or the pace of interaction [58]: this measure relates to the time between consecutive interactions. The more the pace of human–robot interaction tends to the one of human–human interaction, the more this interaction is perceived as natural [35].
An important matter is the automatic discovery of events, such as beginning and end of interactions. This can be relatively easy from the robot’s point of view, since its actions are typically determined by a state machine or some parametrized policy: it is trivial to get the time of the perception events triggering a behavior. On the contrary, it becomes trickier to retrieve the events describing human behaviors from the flow of RGB-D data. One possible way to discriminate easily between activity and inactivity period is to analyze the time spectrum of the joint trajectories, and threshold the energy of such signals computed across a sliding window.
3 Case Studies
The presented methods have been successfully employed in two different human–robot interaction experiments. Both experiments focused on a triadic interaction where the robot tried to engage the human partner during a specific task. The first case study is a social teaching experiment, where a human teaches the color of some objects to the iCub humanoid robot [39]. In the second case study, the Nao robot tries to elicit behaviors on children affected by autism spectrum disorder and on children in typical development. This section presents the two studies and report on the results that were obtained applying our evaluation methods to discriminate between behaviors from different conditions in the task and different groups.
3.1 Interactive Behaviors Assessment
In this scenario, the robot interacts with a human partner to improve its knowledge about the environment. The two peers stand in front of each other, as shown in Fig. 9. The robot can interrogate the human about the objects on the table, to discover their color properties. A simple speech recognition system, based on a fixed dictionary [59], is used to retrieve the verbal information from the human. The match between color information and object is possible thanks to the shared attention system: the robot is capable to select, among the different objects, the one currently observed by the human. The ability of the robot to retrieve the focus of attention of the human is based on the estimation of the partner’s head orientation provided by the head tracking system. Remarkably, the head orientation is not only used for the post-experiment evaluation of the interaction, but it is used in runtime to provide to the robot control system the information about the gaze of the human partner. This way, the robot can gaze at the same direction.
In this tutoring scenario, described in detail in [39], the authors investigated whether the initiative of the robot could produce an effect in the engagement of the human partner. The experiments consisted in a teaching phase, where the robot had to learn the colors of all the objects, and a verification phase where it had to tell to the human the colors of all the objects. The authors manipulated the robot initiative in the teaching phase, as shown in Fig. 10.
Two conditions were tested. In a first condition (RI) the robot initiates the interaction by selecting an object, gazing at it, and interrogating the human about its properties. In the second condition (HI) the human decides which object to teach, by gazing at it once the robot is ready. The experiments were performed by 13 participants without previous interactions with the robot: 7 people (\(26 \pm 3\) years old) in the RI case, 6 people (\(22 \pm 1\) years old) in the HI case.
Head movements have been analyzed with the methods discussed in the previous section. Figure 11 shows samples of the estimated gaze of some participants. Both static and dynamic features related to the validation stage have been retrieved. The static analysis of the gaze shows four hot spots in both conditions. These hot spots correspond to the head gazing over the robot and over the three objects placed on the table. The differences between the two conditions are highlighted by the clustering of the data using the k-means algorithm, as shown in Fig. 12.
The dynamic analysis of the head movements show statistical relevant differences between the two groups: the reaction time in response to the robot attention stimuli over an object in the verification stage is faster if the robot initiates the interaction \((\hbox {p}<0.005)\), rather than if the human initiates. This result is confirmed \((\hbox {p}<1.5\hbox {e}{-}5)\) by the analysis of the pace of the interaction, the time interval between consecutive robots attention stimuli during the verification stage. The pace is faster if robot manifests proactive behaviors, initiating the interaction.
3.2 Autism Assessment
The proposed evaluation methodology has been used in an interactive scenario to match differences between children affected by autism spectrum disorder (ASD) and children in typical development (TD). In this assessment scenario, described in detail in [6], a robot is placed in front of the child and used as an instrument to elicit joint attention. As shown in Fig. 13, two images of a cat and of a dog conveniently placed on the environment are used as targets for the attention for the two peers. The RGB-D sensor provides to the robot the capability to look at the child and, at the same time; it stores all the information related to the behavior of the children, paired and synchronized with the movements of the robot.
The experiment is composed by three stages in which the robot tries to induce joint attention increasing the informative content it provides to the human. In the first stage the robot gazes over the two focuses of attention; then it gazes and points over them; finally it gazes, points and vocalizes “look at the cat”, “look at the dog”, as shown in Fig. 14.
Thirty-two children have been chosen for the experiments:
-
Group ASD: 16 people (13 males, 5 females), \(9.25 \pm 1.87\) years old.
-
Group TD: 16 people (9 males, 6 females), \(8.06 \pm 2.49\) years old.
In this case, head movements and posture have been analyzed and compared between the two groups. Using generalized linear mixed models, we found a significant higher variance of the yaw movements in TD children rather than in the children with ASD (\(\hbox {b} = 1.66, \hbox {p} = 0.002\)). The analysis showed also a significant effect on the yaw movements in accord to the induction modalities used to stimulate Joint Attention: higher variance has been found in vocalizing + pointing compared to pointing (\(\hbox {b} = 1.52, \hbox {p} < 0.001\)) and compared to gazing only (\(\hbox {b} = 1.55, \hbox {p} < 0.001\)). At the same time, pitch movements analysis revealed a lower variance in TD children (\(\hbox {b} = -0.84, \hbox {p} = 0.019\)) rather than children affected by ASD.
As highlighted in Fig. 15, both the heat maps of the head pitch and yaw movements show a central hot spot: this area represents the gaze of the child towards the robot. The two lobes corresponding to the two focuses of attention on the sides of the room are less highlighted in ASD children rather than in TD children. An analysis of the clusters obtained using k-means on the TD children data shows that both left and right directions gathered 30.2 % of all the occurrences. Applying the same k-means model to ASD children data shows that left and right represented just 8.72 % of all the occurrences (Fisher’s exact test, p \(=\) 2.2\(\times 10^{-16}\)): during the Joint Attention task, TD children gazed over the focus of attention placed on the room 4.6 times more frequently during than the children with ASD (95 % Confidence interval 4.4–4.6). Those results highlight an higher response to the robot’s elicitation by children in typical development, while less stability on the gazing is found in ASD children.
A similar analysis has been performed using the body pose, Fig. 16, and body posture data, Fig. 17. In particular, the displacements of each child from the zero position shown a higher stability in TD children: using multivariate regression, the pose variance was significantly lower than in ASD children, within all axes (x, estimates \(=\) 28.1, p \(=\) 0.001; y, estimates = 7, p \(=\) 0.006; z, estimates \(=\) 12, p \(=\) 0.009). A similar behavior has been found from the analysis of body posture data, considering the pitch and the yaw of the trunk. Also, in this case, ASD children data results less stable than TD children data: posture variance was significantly lower in the TD children than in ASD children, within all axes (x, estimates \(=\) 13.9, p \(=\) 0.0016; y, estimates \(=\) 9.2, p \(=\) 0.016; z, estimates \(=\) 1.6, p \(=\) 0.003). Such results highlight lower stability of the body posture in ASD children rather than in TD children.
4 Discussion
We proposed in this paper a methodology for analyzing the engagement between humans and social robots in direct, face-to-face, interactive scenarios. The methodology is based on an evaluation of the engagement aroused by a robot during the interaction, focusing on the nonverbal behaviors expressed by the human partner. Both static and dynamic interaction cues have been considered, as they can be used to extract different meaningful measures. The ones described in Sect. 2 were able to characterize different aspects of the social interaction between humans and robots in two different use cases.
In both scenarios, the human and the robot established a mutual communication. In such contexts, a correct comprehension and proper use of nonverbal behaviors are essential tools to achieve an “optimal” interaction: to provide readable behaviors, and to arouse on human partners the illusion of a social intelligence. The importance of nonverbal behaviors has been highlighted by developmental sciences [50]. Toddlers learn about the world in a social way. They develop communication skills through nonverbal cues, and such skills gradually evolve together with verbal language [70]. Imitation, joint-attention, gesticulation, synchrony are all learned in the very first stages of childhood development, and seem to be pivotal traits of the developmental process [43, 69]. In adulthood, those become semi-automatic, almost involuntary behaviors, influenced by the culture, used in daily communications, eventually in combination with spoken language to reinforce it or to completely alter its meaning.
The measurement of nonverbal signals during interactions with robots can provide information about the engagement between pairs [27]. The static analysis of the movements of the body articulations can reveal how humans respond to the robot stimuli, if they respond as engaged partners or not. The analysis of the gaze behavior can be used to model the attention system of the human partner and improve joint attention. The dynamic analysis can be used to study the motor resonance, synchrony of movements, and can improve imitations and gestures. A robot capable to capture the attention of the human partner should leverage all those nonverbal cues to increase the engagement.
4.1 A Practical Set of Measures
Similar measures can be retrieved using motion capture systems. However, usually such systems use marker based technologies: they require passive or active beacons that should be worn by the user. This do not only increase the complexity of the system, but they critically reduces the naturalness of the interaction. The proposed system, instead, is based on a simple RGB-D camera, a marker-less technology that can be still used to track human movement [4, 33]. Despite its lower resolution, this system allows researchers to explore the engagement in very natural scenarios, without the restrictions and the complexity imposed by wearable devices and marker holders.
While such measures have been developed to enable studies in naturalistic settings, those can be aggregated with the features obtained from physiological responses of the participants in specially designed experiments during which participants would forget the existence of worn sensors, to establishing natural interactions as much as possible. In such case, it would be possible to capture a larger dynamic of possible interactions and, at the same time, to study the neurophysiological bases of the engagement [28, 76].
Several researches in social robotics make use of post-experiment questionnaires to gather information about the engagement after the experiments [41, 42]. Unfortunately, while quick and easy to analyze, questionnaires can be strongly affected by several kind of biases [22]. Without being exhaustive, it is possible to find at least three important sources of errors in questionnaires: their design, the experimental subjects, and the experimenter. The design of the questionnaire can introduce artifacts due to complexity, ambiguity or specificity of the questions, or due to the number, too few or too many, of the answers’ options (question wording [56]). The subjects can also introduce errors, because of their unconscious will to be positive experimental subjects and to provide socially desirable responses (response bias [32]). Lastly, the researchers can also be a source of error with their tendency to interpret the answers as a confirmation of their hypothesis (confirmation bias [53]). The measures presented in this paper can be used as a practical and objective tool to explore the interaction with robots; they can also serve as a complement to verify and eventually to reinforce the results obtained by questionnaires and surveys.
4.2 Readability of Robot’s Nonverbal Cues
As social animals, humans are extraordinarily able to infer information about their partners, and to build models of the other and of their society. Nonverbal behaviors play a central role for making this inference [29].
In the first scenario, the presented metrics have been used to show that humans react faster to the attention cues of a proactive robot. It is possible to speculate about the manipulation of the proactive behavior of the robot to strengthen the engagement, regulate the rhythm of interaction, and arouse in people the perception of social intelligence. The engagement, here, comes essentially from the readability provided by the nonverbal cues. This result is confirmed also in the experiment with children affected by ASD and TD children: a significant difference in the two groups has been found according to the amount of information expressed by the nonverbal language of the robot. The more modalities the robot uses to communicate its focus of attention (from gazing, to gazing and pointing, to vocalizing), the more its behavior becomes readable by the children.
The results obtained in the two case studies confirm that the proposed measures are effective to study the engagement. These metrics can be used by the robot as a continuous, online feedback signal to evaluate (and eventually manipulate) the engagement with the human partner [5].
Future studies, however, will focus on the use of the presented metrics in long term scenarios, in which the novelty effect of the robot became less relevant with time. In such settings people will interact day-by-day with the robot becoming accustomed to its behaviors; at the same time, human subjects could adapt their own behaviors to the robot.
4.3 The “bias” of the Anthropomorphic Design
People have the natural tendency of projecting human-like features on animals and inanimate objects. This is the so called “anthropomorphism” [49]: as “social machines”, we seek in the unknown the same intelligence patterns we are used to recognize in our peers, projecting our social intelligence. The robot is not perceived as a machine: people have frequently the illusion that the robot understands them, needs their help and wants to communicate. During an interaction, the human can naturally develop a feeling a “partnership” with the robot [15]. The anthropomorphic design of the robots can help the readability of their behaviors, facilitating the interaction with human partners [16].
The robots used in our experiments, iCub and Nao, have a baby-like, humanoid shape, which makes them particularly suited for interaction but also introduces an anthropomorphism bias in their human partners. These robots communicate implicitly, just with their design, very human-like traits such as a personality, emotions and intention, and arouse a sense of co-presence [78]. The presented metrics can be used to study the perceived engagement with other types of robots. They should be as well able to highlight differences due to different types of robot, even if it is difficult to make predictions about the human reaction to non-humanoid robots or “headless” robots. It would be very interesting to see if our results with humanoids hold in the case of androids and very human-like robots. A relational analysis with respect to the uncanny valley could not be so quite obvious [36, 37]. Our intuition is that the presented metrics should be able to highlight the different reactions and behaviors of the human partners in all cases, but it is difficult to imagine how much the results will diverge in similar experiments involving androids (maybe revealing aversion effects).
We plan future experiments where we will use the proposed metrics to assess the engagement between humans and different types of robots. Since the robot design practically impacts the span of their behaviors, we will carefully ponder such a study, considering the limits and capabilities of each robot and evaluating their “social intelligence” on comparable tasks and desired behaviors.
4.4 Are We Measuring Social Intelligence?
Explaining the concept of “intelligence” is a non-trivial problem [44]. Intelligence could be intuitively associated to the ability of humans to understand the world. However, this definition still lacks of generality, due to the observation of certain kinds of intelligence in the living. The idea of intelligence in humans is context-dependent. The psychometric approach to human intelligence provides a definition according to three points of view [68]: the abstract intelligence, as the ability of understanding and managing ideas and symbols; the mechanical intelligence, as the capability of working with concrete objects; the social intelligence [20], as “ability to get along with others” [73].
These definitions can be also employed in robotics, with an interesting parallelism [25, 26]. The abstract intelligence can be identified with the capability of the robots of learning, reasoning using symbols, exploring the knowledge and deducing new facts. This roughly corresponds to the area of “Artificial Intelligence”. We can refer mechanical intelligence to the perceptuo-motor intelligence or body intelligence, the ability to interact with the physical world, to perceive it and to coordinate proper actions on it. This kind of intelligence comes from the robot embodiment. The robot should be able to capture the relevant information from the world and to link them to high level, abstract, symbols. Reasoning on these symbols should take into account the physical possibilities and capabilities provided by the robot’s embodiment in its physical world.
Finally, social intelligence would refer to the ability of robots to act socially with humans, to communicate and interact with them in a human-like way, following social behaviors and rules attached to their role, to learn and adapt their own behaviors throughout their lifetime, incorporating shared experiences with other individuals into their understanding of self, of others, and of the relationships they share [16]. The domain of “Social Signal Processing” [74] aims to provide computers and robots with social intelligence, addressing to a correct perception, accurate interpretation and appropriate display of social signals [75].
Expressing social intelligence is a key feature to achieve an optimal interaction with humans. The perception of “social intelligence” can be aroused if the robot is capable to exhibit social cues: accomplishing coherent behaviors, following social rules and communicating with the humans in natural way.
The proposed methodology focuses on the analysis of non-verbal human’s behaviors during interaction with robots. We can speculate about the use of the presented metrics as a feedback of the social intelligence perceived by the humans. From this point of view, responsive behaviors produced by a robot will induce in the human partners a perception of intelligence that can be quantitatively captured by the proposed measures by observing changes to the human’s reactions to the robot social cues. This interpretation comes out from the experiments we discussed.
In the first experiment, it is possible to speculate about the social intelligence expressed by the robot and perceived by the human partner, according to slight changes on the plot of the interaction. Questionnaires given to the participants of the experiments, reported as more intelligent the robot that initiates the interaction. In our view this can be attributed to the increased readability of the proactive case, which makes the human “aware” of the robot status and creates the illusion of a greater intelligence than in the other case. This illusion could be one of the reasons for the human interact notably faster.
The second experiment is remarkable since Autism is characterized by a lack of social intelligence [8, 19]. Here, the behaviors shown by the robot and, at the same time, the plot of the interaction do not vary between the two conditions, so the differences are due to the different ability of the children to recognize social cues. The proposed metrics to evaluate the engagement highlight the lacking of social intelligence in the ASD children, showing behavioral differences between the two groups.
4.5 Conclusions
In this paper a set of metrics has been proposed to evaluate the engagement between humans and robots in direct, face to face scenarios. Those metrics have been applied to study the interaction in two different use cases, characterized by natural settings and different objectives, and to assess effectively different human responses to robot behaviors. In both the scenarios, the metrics confirmed the importance of the study of non-verbal cue to improve the interactions between humans and robots. Nevertheless, thanks to their easiness of use in real world scenarios, due to employment of non-intrusive sensors, such metrics present a strong potential for scalability and a further generalization to different applications and contexts. Limitations of the metrics would be studied in future works, in particular in long-term scenarios, in which human subjects will be accustomed to the behaviors of the robot, and according to the use of different robotic designs, anthropomorphic and not.
References
Admoni H, Dragan A, Srinivasa SS, Scassellati B (2014) Deliberate delays during robot-to-human handovers improve compliance with gaze communication. In: Proceedings of the 2014 ACM/IEEE international conference on human–robot interaction, HRI ’14, pp 49–56
Andry P, Blanchard A, Gaussier P (2011) Using the rhythm of nonverbal human-robot interaction as a signal for learning. IEEE Trans Auton Ment Dev 3(1):30–42
Anzalone SM, Chetouani M (2013) Tracking posture and head movements of impaired people during interactions with robots. In: New trends in image analysis and processing-ICIAP 2013. Springer, Berlin, pp 41–49
Anzalone SM, Ghidoni S, Menegatti E, Pagello E (2013) A multimodal distributed intelligent environment for a safer home. In: Intelligent autonomous systems 12. Springer, Berlin, pp 775–785
Anzalone SM, Ivaldi S, Sigaud O, Chetouani M (2013) Multimodal people engagement with icub. In: Biologically inspired cognitive architectures 2012. Springer, Berlin, pp 59–64
Anzalone SM, Tilmont E, Boucenna S, Xavier J, Jouen AL, Bodeau N, Maharatna K, Chetouani M, Cohen D (2014) How children with autism spectrum disorder behave and explore the 4-dimensional (spatial 3d+ time) environment during a joint attention induction task with a robot. Res Autism Spectr Disord 8(7):814–826
Argall BD, Browning B, Veloso M (2011) Teacher feedback to scaffold and refine demonstrated motion primitives on a mobile robot. Robot Auton Syst 59(3–4):243–255
Baron-Cohen S (1997) Mindblindness: an essay on autism and theory of mind. MIT press, Cambridge
Bertenthal BI, Boyer TW, Han JM (2012) Social attention is not restricted to the eyes: pointing also automatically orients direction of attention. The Annual Meeting of the Psychonomic Society, Minneapolis, MN
Boucenna S, Anzalone S, Tilmont E, Cohen D, Chetouani M (2014) Learning of social signatures through imitation game between a robot and a human partner. Auton Ment Dev IEEE Trans 6(3):213–225
Boucenna S, Gaussier P, Andry P, Hafemeister L (2014) A robot learns the facial expressions recognition and face/non-face discrimination through an imitation game. Int J Soc Robot 6(4):633–652
Boucenna S, Narzisi A, Tilmont E, Muratori F, Pioggia G, Cohen D, Chetouani M (2014) Interactive technologies for autistic children: a review. Cogn Comput 6(4):1–19
Breazeal C (2003) Toward social robots. Robot Auton Syst 42:167–175
Breazeal C, Kidd CD, Thomaz AL, Hoffman G, Berlin M (2005) Effects of nonverbal communication on efficiency and robustness in human–robot teamwork. In: IEEE/RSJ international conference on intelligent robots and systems, pp 383–388
Breazeal CL (2000) Sociable machines: expressive social exchange between humans and robots. Ph.D. thesis, Massachusetts Institute of Technology
Breazeal CL (2004) Designing sociable robots. MIT press, Cambridge
Brethes L, Menezes P, Lerasle F, Hayet J (2004) Face tracking and hand gesture recognition for human–robot interaction. In: IEEE international conference on robotics and automation, vol 2. IEEE, pp 1901–1906
Brick T, Scheutz M (2007) Incremental natural language processing for hri. In: ACM/IEEE international conference on human–robot interaction, HRI ’07. ACM, New York, pp 263–270
Bruner J, Feldman C (1993) Theories of mind and the problems of autism. In: Baron-Cohen SE, Tager-Flusberg HE, Cohen DJ (eds) Understanding other minds: perspectives from autism. Oxford University Press
Cantor N, Kihlstrom JF (1987) Personality and social intelligence. Prentice-Hall, Englewood Cliffs
Cantrell R, Scheutz M, Schermerhorn P, Wu X (2010) Robust spoken instruction understanding for hri. In: 5th ACM/IEEE international conference on human–robot interaction, pp 275–282
Choi BC, Pak AW (2005) A catalog of biases in questionnaires. Prev Chronic Dis 2(1):A13
Crespi N, Molina B, Palau C et al (2011) Qoe aware service delivery in distributed environment. In: Advanced information networking and applications (WAINA), 2011 IEEE Workshops of International Conference on, pp 837–842. IEEE
Cristinacce D, Cootes T (2006) Feature detection and tracking with constrained local models. In: Proceedings of British machine vision conference, vol 3. pp 929–938
Dautenhahn K (1995) Getting to know each otherartificial social intelligence for autonomous robots. Robot Auton Syst 16(2):333–356
Dautenhahn K (2007) Socially intelligent robots: dimensions of human–robot interaction. Philos Trans R Soc B 362(1480):679–704
Delaherche E, Chetouani M, Mahdhaoui A, Saint-Georges C, Viaux S, Cohen D (2012) Interpersonal synchrony: a survey of evaluation methods across disciplines. IEEE Trans Affect Comput 3(3):349–365
Delaherche E, Dumas G, Nadel J, Chetouani M (2014) Automatic measure of imitation during social interaction: a behavioral and hyperscanning-eeg benchmark. Pattern Recognit Lett. doi:10.1016/j.patrec.2014.09.002
Ekman P, Friesen WV (1981) The repertoire of nonverbal behavior: categories, origins, usage, and coding. In: Kendon A, Sebeok TA, Umiker-Sebeok J (eds) Nonverbal communication, interaction, and gesture: selections from Semiotica. Walter de Gruyter, pp 57–106
Fischer K, Lohan K, Saunders J, Nehaniv C, Wrede B, Rohlfing K (2013) The impact of the contingency of robot feedback on hri. In: Collaboration Technologies and Systems (CTS), 2013 international conference on. IEEE, pp 210–217
Fong T, Nourbakhsh I, Dautenhahn K (2003) A survey of socially interactive robots. Robot Auton Syst 42:143–166
Furnham A (1986) Response bias, social desirability and dissimulation. Personality Individ Differ 7(3):385–400
Ghidoni S, Anzalone SM, Munaro M, Michieletto S, Menegatti E (2014) A distributed perception infrastructure for robot assisted living. Robot Auton Syst 62(9):1316–1328
Hall J, Tritton T, Rowe A, Pipe A, Melhuish C, Leonards U (2014) Perception of own and robot engagement in human–robot interactions and their dependence on robotics knowledge. Robot Auton Syst 62(3):392–399
Harris TK, Banerjee S, Rudnicky AI (2005) Heterogeneous multi-robot dialogues for search tasks. In: Proceedings of the AAAI spring symposium intelligence, Citeseer
Ishiguro H (2006) Interactive humanoids and androids as ideal interfaces for humans. In: Proceedings of the 11th international conference on intelligent user interfaces. ACM, New York, pp. 2–9
Ishiguro H (2007) Android science. In: Robotics research. Springer, Berlin, pp 118–127
Ishii R, Shinohara Y, Nakano T, Nishida T (2011) Combining multiple types of eye-gaze information to predict users conversational engagement. 2nd workshop on eye gaze on intelligent human machine interaction
Ivaldi S, Anzalone SM, Rousseau W, Sigaud O, Chetouani M (2014) Robot initiative in a team learning task increases the rhythm of interaction but not the perceived engagement. Front Neurorobotics 8(5):1–23
Ivaldi S, Nguyen S, Lyubova N, Droniou A, Padois V, Filliat D, Oudeyer PY, Sigaud O (2014) Object learning through active exploration. IEEE Trans Auton Ment Dev 6(1):56–72
Kamide H, Mae Y, Kawabe K, Shigemi S, Hirose M, Arai T (2012) New measurement of psychological safety for humanoid. In: Proceedings of the seventh annual ACM/IEEE international conference on human–robot interaction. ACM, New York, pp. 49–56
Kamide H, Mae Y, Takubo T, Ohara K, Arai T (2010) Development of a scale of perception to humanoid robots: Pernod. In: Intelligent robots and systems (IROS), 2010 IEEE/RSJ International Conference on. IEEE, pp 5830–5835
Kaplan F, Hafner V (2004) The challenges of joint attention. Lund University Cognitive Studies, Lund
Kihlstrom JF, Cantor N (2000) Social intelligence. Handb Intell 2:359–379
Kulic D, Croft EA (2007) Affective state estimation for human–robot interaction. Robot IEEE Trans 23(5):991–1000
Laghari KUR, Connelly K (2012) Toward total quality of experience: a qoe model in a communication ecosystem. Commun Mag IEEE 50(4):58–65
Lee C, Lesh N, Sidner CL, Morency LP, Kapoor A, Darrell T (2004) Nodding in conversations with a robot. In: CHI’04 extended abstracts on human factors in computing systems. ACM, New York, pp 785–786
Lee J, Chao C, Bobick AF, Thomaz AL (2012) Multi-cue contingency detection. Int J Soc Robot 4(2):147–161
Lemaignan S, Fink J, Dillenbourg P (2014) The dynamics of anthropomorphism in robotics. In: Proceedings of the 2014 ACM/IEEE international conference on human–robot interaction. ACM, New york, pp 226–227
Miller PH (2010) Theories of developmental psychology. Macmillan, London
Mower E, Feil-Seifer DJ, Mataric MJ, Narayanan S (2007) Investigating implicit cues for user state estimation in human–robot interaction using physiological measurements. In: The 16th IEEE international symposium on robot and human interactive communication, 2007 (RO-MAN 2007). IEEE, pp 1125–1130
Natale L, Nori F, Metta G, Fumagalli M, Ivaldi S, Pattacini U, Randazzo M, Schmitz A, Sandini G (2012) Intrinsically motivated learning in natural and artificial systems, chap. The iCub platform: a tool for studying intrinsically motivated learning. Springer, Berlin
Nickerson RS (1998) Confirmation bias: a ubiquitous phenomenon in many guises. Rev Gen Psychol 2(2):175
Obhi SS, Sebanz N (2011) Moving together: toward understanding the mechanisms of joint action. Exp Brain Res 211(3):329–336
O’Brien HL, Toms EG (2008) What is user engagement? A conceptual framework for defining user engagement with technology. J Am Soc Inf Sci Technol 59(6):938–955
Payne SL (1951) The art of asking questions. Princeton University Press, Princeton
Raake A, Egger S (2014) Quality and quality of experience. In: Quality of experience. Springer, Berlin, pp 11–33
Rich C, Ponsler B, Holroyd A, Sidner CL (2010) Recognizing engagement in human–robot interaction. In: Proceedings of ACM/IEEE international conference on human–robot interaction (HRI). ACM Press, New York, pp 375–382
Rousseau W, Anzalone SM, Chetouani M, Sigaud O, Ivaldi S (2013) Learning object names through shared attention. In: IROS-Int. workshop on developmental social robotics. pp 1–6
Sanghvi J, Castellano G, Leite I, Pereira A, McOwan PW, Paiva A (2011) Automatic analysis of affective postures and body motion to detect engagement with a game companion. In: 6th ACM/IEEE international conference on human–robot interaction. ACM, New York, pp 305–311
Scassellati B (2005) Quantitative metrics of social response for autism diagnosis. In: IEEE international workshop on robot and human interactive communication, 2005 (ROMAN 2005). IEEE, pp 585–590
Scassellati B (2007) How social robots will help us to diagnose, treat, and understand autism. In: Robotics research. Springer, Berlin, pp 552–563
Short E, Hart J, Vu M, Scassellati B (2010) No fair!! an interaction with a cheating robot. In: 5th ACM/IEEE international conference on human–robot interaction. ACM, New York, pp 219–226
Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M, Blake A, Cook M, Moore R (2013) Real-time human pose recognition in parts from single depth images. Commun ACM 56(1):116–124
Sidner C, Lee C, Kidds C, Lesh N, Rich C (2005) Explorations in engagement for humans and robots. Artif Intell 166(1):140–164
Sidner CL, Kidd CD, Lee C, Lesh N (2004) Where to look: a study of human–robot engagement. In: Proceedings of the 9th international conference on intelligent user interfaces. ACM, New York, pp 78–84
Tapus A, Mataric M, Scasselati B (2007) Socially assistive robotics [grand challenges of robotics]. IEEE Robot Autom Mag 14(1):35–42
Thorndike EL (1920) Intelligence and its uses. Harper’s magazine, New York
Tomasello M (1995) Joint attention as social cognition. In: Moore C, Dunham PJ (eds) Joint attention: its origins and role in development. Lawrence Erlbaum Associates, Inc. pp 103–130
Tomasello M, Farrar MJ (1986) Joint attention and early language. Child Dev 57:1454–1463
Vaussard F, Fink J, Bauwens V, Retornaz P, Hamel D, Dillenbourg P, Mondada F (2014) Lessons learned from robotic vacuum cleaners entering the home ecosystem. Robot Auton Syst 62(3):376–391
Vázquez M, May A, Steinfeld A, Chen WH (2011) A deceptive robot referee in a multiplayer gaming environment. In: International conference on Collaboration Technologies and Systems (CTS), 2011. IEEE, pp 204–211
Vernon PE (1933) Some characteristics of the good judge of personality. J Soc Psychol 4(1):42–57
Vinciarelli A, Pantic M, Bourlard H (2009) Social signal processing: survey of an emerging domain. Image Vis Comput 27(12):1743–1759
Vinciarelli A, Pantic M, Heylen D, Pelachaud C, Poggi I, D’Errico F, Schröder M (2012) Bridging the gap between social animal and unsocial machine: a survey of social signal processing. IEEE Trans Affect Comput 3(1):69–87
Weisman O, Delaherche E, Rondeau M, Chetouani M, Cohen D, Feldman R (2013) Oxytocin shapes parental motion during father-infant interaction. Biol Lett. doi:10.1098/rsbl.2013.0828
Yannakakis GN, Hallam J, Lund HH (2008) Entertainment capture through heart rate activity in physical interactive playgrounds. User Model User-Adapt Inter 18(1–2):207–243
Zhao S (2003) Toward a taxonomy of copresence. Presence 12(5):445–455
Zhao W, Chellappa R, Phillips PJ, Rosenfeld A (2003) Face recognition: a literature survey. ACM Comput Surv 35(4):399–458
Acknowledgments
This work was supported by the Investissiments d’Avenir program (SMART ANR-11-IDEX-0004-02) through Project EDHHI/SMART, the ANR Project Pramad, and by the European Commission, within the projects CoDyCo (FP7-ICT-2011-9, No. 600716) and and Michelangelo Project (FP7-ICT No.288241).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Anzalone, S.M., Boucenna, S., Ivaldi, S. et al. Evaluating the Engagement with Social Robots. Int J of Soc Robotics 7, 465–478 (2015). https://doi.org/10.1007/s12369-015-0298-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12369-015-0298-7