1 Introduction

The development of social robots focuses on the design of living machines that humans should perceive as realistic, effective partners, able to communicate and cooperate with them as naturally as possible [13]. To this purpose, robots should be able to express, through their shapes and their behaviors, a certain degree of “intelligence” [31]. This skill entails the whole set of social and cognitive abilities of the robot, which makes the interaction “possible” in a human-like manner, through exchange of verbal and nonverbal communication, learning how to predict and adapt to the partner’s response, ensuring engagement during interactions, and so on. The development of such abilities, for a robotics researcher, translates into the implementation of several complex algorithms to endow the robot with different cognitive and social capabilities: multimodal people tracking [5], face recognition [79], gesture recognition [17], speech recognition [18, 21], object learning [40], motor skills learning [7], action synchronization [2, 54], just to name a few. Each algorithm or module is evaluated in the metric space of its specific problem. If we limit ourselves at evaluating their performance or their coordination, we make the mistake of evaluating their efficiency as algorithms [59] rather than their capability to obtain a desired effect once they are used in a human–robot interaction context. If all those modules worked perfectly, would the robot be perceived as intelligent? The answer is not granted, for example recent studies showed that humans prefer to interact with a “non-perfect” robot that makes mistakes and exhibit some uncertainty or delays [1, 63].

Evaluating the quality of the experiences with a social robot is critical if we want robots to interact socially with humans to provide assistance and enter their private and personal dimension [67]. But how can we evaluate whether a robot is capable of engaging a human in social tasks? Do we have metrics to determine whether different robot behaviors can improve the quality of such human–robot interaction? Most importantly, can we find good metrics that can be retrieved by cheap sensors (e.g., a Kinect) and in “natural” interaction scenarios, without recurring to the use of invasive measuring devices (e.g., eye-trackers or motion capture systems)?

Measuring the quality of the experiences [57] involving social robots can be a quite challenging task, involving the assessment of several aspects of the interaction, such as the user expectations, his feelings, his perceptions and his satisfaction [23, 46]. A characterizing feature of the user experience is given by the ability of robot to engage users in the social task. As stated by [55]: “Engagement is a category of user experience characterized by attributes of challenge, positive affect, endurability, aesthetic and sensory appeal, attention, feedback, variety/novelty, interactivity, and perceived user control”. The paper will focus on engagement as characterizing feature of the quality of the experiences with social robots, defining it as “the process by which individuals involved in an interaction start, maintain and end their perceived connection to one another” [65]. In direct, face-to-face scenarios, measurable changes of the human partners behaviors will reflect this engagement through both physiological variations [51], as heart rate [77] or skin conductivity changes [45], and movements [60], as synchronous and asynchronous motions such as nodding [47], gazing [58, 66] or mimicry [11]. Such movements correspond to the non-verbal body language that humans use to communicate each other [14, 31]. In the light of this observation, it seems possible to infer the engagement of a person involved in a social interaction with the robot through an analysis of his non-verbal cues. With social interactions we do not refer exclusively to cooperative scenarios, in which, for instance, nodding or joint attention can be seen as feedback given to the partners. To some extent, the same holds to competitive and deceptive interactions  [72], where the dynamics of non-verbal behaviors are still used as feedback to humans, for instance, to communicate boring, misunderstandings, rejections or surprise. In any case, variations of non-verbal cues between study groups can inform about the engagement of the partners involved in the social task. This assumption is then very general and able to include a large variety of social interactions and it becomes a powerful instrument to evaluate and, in some cases, to manipulate the synergy between the peers.

Several social signals have been proposed in literature to study the engagement. Hall et al. [34] manipulated nonverbal gestures, like nodding, blinking and gaze aversion, to study the perceived engagement of the human participants, retrieved by a post-experimental questionnaire. Significant works focusing on engagement during verbal interactions were proposed by Rich and Sidner. In particular, in [58] manually annotated engagement was analyzed through mutual and directed gaze, and correlated with spoken utterances. In [66] and [65], via manual labeling, gaze signals have been used by the authors distinguishing between head nods and quick looks; in [38], where the authors combined different gaze behaviors, captured using eye tracker, for conversational agents. Ivaldi et al. [39] also used post- experimental questionnaires to evaluate engagement, but obtained indirect measurements of engagement through the rhythm of interaction, the directional gaze and the timing of the response to robot’s stimuli, captured by the use of RGB-D data. In  [60], engagement is automatically evaluated from videos of interactions with robots, using visual features related to the body posture, precisely to the inclination of the trunk/back. Similar measures have been used to evaluate behaviors in medical contexts  [12] using audio features and video analysis [10, 61, 62].

In this paper we propose a methodology to evaluate the engagement aroused during interactions between social robots and human partners, based on metrics that can be easily retrieved from off-the-shelves sensors. Such metrics are principally extracted by static and dynamic behavioral analysis of posture and gaze, and have been supporting our research activities in human–robot interaction.

We remark that our study is focused on methodologies that do not require intrusive devices that could make the human–robot interaction unnatural, such as eye-trackers or wearable sensors. We choose to work with cheap sensors like Kinects and microphones that can be easily placed in indoor environments and are easy to accept for ordinary people. These features are important since we target real applications with users that are not familiar with robotics. Users’ perception and need is an element that must be taken into account by the experimental and robotics setting [71].

2 Material and Methods

To evaluate the engagement, here we address direct, face-to-face, interaction scenarios, where a robot is used to elicit behaviors on humans. This is the case, for example, of a robot playing interactive games with a human partner, or the case of a human tutoring the robot in a cooperative or learning task. The choice of such kind of scenarios, as in Fig. 1, does not represent a huge limitation on the validity and on the use of the proposed methodology: many social interactions between humans usually occur in similar conditions.

Fig. 1
figure 1

Typical human–robot interactions scenarios with the iCub robot [52]

In this scenario, we assume the human standing in front of the robot. An RGB-D sensor is placed close to the robot, in the most convenient position to capture (as much as possible) the environment and the interacting partner. The information about the robot status and behavior is continuously logged; at the same time, the RGB-D sensor captures, synchronizes and stores the data perceived about the environment and the human [3]. The human posture and his head movements are tracked, according to the processing pipeline shown in Fig. 2. Such information is then statically and dynamically analyzed to retrieve: body posture variation, head movements, synchronous events, imitation cues and joint attention (Table 1).

Fig. 2
figure 2

The algorithm’s pipeline employed for the extraction of social measures

Table 1 Static and dynamic metrics for evaluating the quality of dyadic and triadic face-to-face interactions

2.1 People Tracking

The presence, position and activity of people interacting with the robot is processed by a posture tracking system based on the data captured by the RGB-D sensor. The humans location and their body parts are tracked in the visible range of the camera to extract gestures, body movements and posture.

Precisely, the depth data perceived by the sensor is processed through a background subtraction algorithm [64]; from each point of the foreground, local features are calculated and classified to assess to which body part (among 31 possible patches) they belong. Finally, through a randomized decision forest, a dense probabilistic skeleton is obtained. Joints of the body are calculated according to the position and the density of each patch.

The tracking algorithm provides a complete map of the body, characterized by the position and the orientation of 15 anatomical parts, including the two arms, the two legs, the torso and the head. Concerning the latter, the algorithm is not able to retrieve an accurate orientation of the head: to accomplish this task, we need a different, dedicated approach that we describe in the following.

2.2 Head Movements

Once the presence of the human partner is found and his body is tracked, the information about the head pose can be extracted.

From the 3D information about the body of the person interacting with the robot, the estimated position of the head is back projected over the RGB image captured by the sensor, to obtain the coordinates in the image space in which the face should appear. A rectangular area of the camera image, centred on such coordinates, is then cropped to retrieve the face, as shown in Fig. 3. A face tracking algorithm is then applied to retrieve the head pose: our face tracking implementation is based on a Constrained Local Model approach [24]. This class of trackers is based on a statistical model of the faces based on a set of constrained landmarks, such as the face shape, its texture, its appearance. Such landmarks are used to approximate the contour of the face, the lips, the eyes, the nose. The algorithm tries to adapt iteratively the shape defined by the landmarks to find the best fit. The result of the algorithm is the best fitting set of facial landmarks approximating the actual face. From the facial landmarks, the orientation of the whole head is calculated and integrated in the full body model.

Fig. 3
figure 3

The face detection and the head pose extraction

The head pose provides only an approximation of the gazing direction, since it cannot capture the eyes movements or their direction. However, it continues to be an informative estimator in the case in which potential targets of the interactions are displaced respect to the person’s field of view. In such scenarios, the objects are located in a way that they are not visible unless the participants turn their heads toward them, then they are forced to turn the head toward the targets to gaze them. In absence of high resolution cameras that could provide more accurate images of the eyes, the head orientation provides a fair estimate of the human gaze direction. Most importantly, it does not require invasive devices as wearable eye-trackers.

2.3 Static Analysis

We hereby extract and analyze the information related to the posture and gaze of the human interacting with the robot. Histograms are used to study the distribution of the measured data.

Figure 4 shows the 2D histogram of the position of each joint of a person while performing the “Jumping jack” exercise. The time distribution of the joints/bodies positions is conveniently represented as a heat map overlapped with a snapshot of the person performing such movement. In this exercise, the person jumps from a standing position with the feet together and arms at the sides of the body, to a position with legs spread wide and hands over the head. The heat map shows hot spots over the positions in which the body joints spend more time. In particular, red spots depict the start/stop position of each joint during the jumping jack movement. The heat map allows to capture with a simple visualization the posture information in time, such as the movement of the trunk and its stability. Also, it is able to show the variability of the trajectories of arms and legs.

Fig. 4
figure 4

The histogram heat map during the jumping jack exercise. Red spots highlight the start/stop position of each body articulation. (Color figure online)

A similar analysis can be done for the gaze. Figure 5 shows the 2D histogram of a person gazing to a tutor, standing in front of him, and over objects on his two sides. The resulting heat map is again a very convenient visualization tool: it shows the focus of attention of the person, highlighting the correspondent hot spots of the head gazing towards the tutor and towards the two objects. It must be noted that the gaze direction is projected on the pitch-yaw plane of the head, since the gaze is approximated by the head orientation as described in Sect. 2.2.

Fig. 5
figure 5

The histogram heat map of a person’s head movements. Peaks correspond to three different focuses of attention, on the center, on the left and on the right. Overlapped, the k-means classification of such data

One possible way to study the head movements is by applying data mining algorithms. In the bottom-left corner of Fig. 5, we can see the three areas found applying a clustering algorithm—precisely k-means (\(\mathrm{k}=3\), as the 3 hot spots). The information related to the clusters, such as their average, barycenter, density and variance, can be used to extract useful descriptive features of the gaze behaviors of the humans interacting with the robot. The analysis of such signals can also provide information about the head stabilization during fixation. As we will show in the next Section, the clusters information can be used for example to compare the outcome of different experimental conditions.

2.4 Dynamic Analysis

The histogram analysis of the head movements and of the body posture gives only a partial description of the human behaviors, because it does not capture the movement dynamic.

The time analysis can be useful to individuate synchronous events and phenomena occurring when the interacting agents synchronize their actions. In particular it can capture causality, synchronous and delayed events [27]. Figure 6 reports the elicitation of a gaze in response to a robot attention cue: precisely, the robot points its arm twice towards some objects. The figure plots in blue the yaw movement of the human head, and in red the movement of the shoulder of a robot: we can see that the head movement is induced by the robot goal-directed action. Figure 7 shows in blue the behavior of the shoulder of a person, and in red the same data from the robot. The plot highlights how the robot fails the first elicitation, while the human imitates it in the second elicitation.

Fig. 6
figure 6

The evolution over time of the robot’s arm joint, in red, overlapped to the head yaw movements of the human, in blue. This highlights two synchrony events between the pointing gesture of the robot and the gazing behavior of the human. (Color figure online)

Fig. 7
figure 7

The evolution over time of the robot’s arm joint, in red, overlapped to the arm movements of the human, in blue. This highlights a synchrony event in terms of imitation, between the pointing gesture of the robot and the pointing behavior of the human. (Color figure online)

The time between the beginning of the robot’s arm movement and the beginning of the human gaze can be interpreted as a measure of the effectiveness of the nonverbal communication abilities of the robot [9]. Humans could respond as fast as if they were communicating with another human, if the robot was readable and “interactable” [39]. If the robot lacks in communication abilities, humans could struggle on understanding the communication cue, thus responding slower than in the ideal case; this delay, if not controlled, can make the response non-contingent. Lastly, humans may not respond at all, because of a complete inability of the robot to communicate with nonverbal-behaviors or gestures.

The synchrony of human–robot movements and the contingent response to reciprocal cues are critical features for evaluating the quality of imitation and joint attention [30, 48]. Figure 8 highlights the joint attention elicited by the robot towards a human: in blue the yaw head movement of the human, in red the robot’s. Here, the robot is always able to elicit joint attention, as there is a contingent response to each attention cue—a gaze towards an object on the left or right.

Fig. 8
figure 8

The evolution over time of the robot’s head yaw, in red, overlapped to the head yaw movements of the human, in blue. This highlights several synchrony events in terms of joint attention, between the head movement of the robot and the gazing behavior of the human. (Color figure online)

Among the temporal features of an interaction, we can account the rhythm of interaction [39] or the pace of interaction [58]: this measure relates to the time between consecutive interactions. The more the pace of human–robot interaction tends to the one of human–human interaction, the more this interaction is perceived as natural [35].

An important matter is the automatic discovery of events, such as beginning and end of interactions. This can be relatively easy from the robot’s point of view, since its actions are typically determined by a state machine or some parametrized policy: it is trivial to get the time of the perception events triggering a behavior. On the contrary, it becomes trickier to retrieve the events describing human behaviors from the flow of RGB-D data. One possible way to discriminate easily between activity and inactivity period is to analyze the time spectrum of the joint trajectories, and threshold the energy of such signals computed across a sliding window.

3 Case Studies

The presented methods have been successfully employed in two different human–robot interaction experiments. Both experiments focused on a triadic interaction where the robot tried to engage the human partner during a specific task. The first case study is a social teaching experiment, where a human teaches the color of some objects to the iCub humanoid robot [39]. In the second case study, the Nao robot tries to elicit behaviors on children affected by autism spectrum disorder and on children in typical development. This section presents the two studies and report on the results that were obtained applying our evaluation methods to discriminate between behaviors from different conditions in the task and different groups.

3.1 Interactive Behaviors Assessment

In this scenario, the robot interacts with a human partner to improve its knowledge about the environment. The two peers stand in front of each other, as shown in Fig. 9. The robot can interrogate the human about the objects on the table, to discover their color properties. A simple speech recognition system, based on a fixed dictionary [59], is used to retrieve the verbal information from the human. The match between color information and object is possible thanks to the shared attention system: the robot is capable to select, among the different objects, the one currently observed by the human. The ability of the robot to retrieve the focus of attention of the human is based on the estimation of the partner’s head orientation provided by the head tracking system. Remarkably, the head orientation is not only used for the post-experiment evaluation of the interaction, but it is used in runtime to provide to the robot control system the information about the gaze of the human partner. This way, the robot can gaze at the same direction.

Fig. 9
figure 9

The iCub robot learns about objects colors from a human partner in a tutoring scenario. (Color figure online)

In this tutoring scenario, described in detail in [39], the authors investigated whether the initiative of the robot could produce an effect in the engagement of the human partner. The experiments consisted in a teaching phase, where the robot had to learn the colors of all the objects, and a verification phase where it had to tell to the human the colors of all the objects. The authors manipulated the robot initiative in the teaching phase, as shown in Fig. 10.

Fig. 10
figure 10

A schematic representation of the experimental protocol to study the effect of the robot initiative in a tutoring scenario. The teaching phase changes according to the partner that begins the interaction: robot initiative (RI) or human initiative (HI). In the verification phase the robot always asks the human to chose an object

Two conditions were tested. In a first condition (RI) the robot initiates the interaction by selecting an object, gazing at it, and interrogating the human about its properties. In the second condition (HI) the human decides which object to teach, by gazing at it once the robot is ready. The experiments were performed by 13 participants without previous interactions with the robot: 7 people (\(26 \pm 3\) years old) in the RI case, 6 people (\(22 \pm 1\) years old) in the HI case.

Head movements have been analyzed with the methods discussed in the previous section. Figure 11 shows samples of the estimated gaze of some participants. Both static and dynamic features related to the validation stage have been retrieved. The static analysis of the gaze shows four hot spots in both conditions. These hot spots correspond to the head gazing over the robot and over the three objects placed on the table. The differences between the two conditions are highlighted by the clustering of the data using the k-means algorithm, as shown in Fig. 12.

Fig. 11
figure 11

Examples of gaze behaviors during the experiments. The superposition of human and robot gaze is used to study the reaction time to the robot’s attention stimuli. Each vertical bar marks the beginning of a new interaction

Fig. 12
figure 12

The heat maps of the human gaze (head yaw on X-axis, pitch on Y-axis) in the two conditions (HI and RI) highlights differences in the human gazing behavior. We can observe four different areas of focus of attention: the robot (in front of the human) and the three objects. Their location was chosen to conveniently highlight the three areas of gaze. a Human leader condition. b Robot leader condition

The dynamic analysis of the head movements show statistical relevant differences between the two groups: the reaction time in response to the robot attention stimuli over an object in the verification stage is faster if the robot initiates the interaction \((\hbox {p}<0.005)\), rather than if the human initiates. This result is confirmed \((\hbox {p}<1.5\hbox {e}{-}5)\) by the analysis of the pace of the interaction, the time interval between consecutive robots attention stimuli during the verification stage. The pace is faster if robot manifests proactive behaviors, initiating the interaction.

3.2 Autism Assessment

The proposed evaluation methodology has been used in an interactive scenario to match differences between children affected by autism spectrum disorder (ASD) and children in typical development (TD). In this assessment scenario, described in detail in [6], a robot is placed in front of the child and used as an instrument to elicit joint attention. As shown in Fig. 13, two images of a cat and of a dog conveniently placed on the environment are used as targets for the attention for the two peers. The RGB-D sensor provides to the robot the capability to look at the child and, at the same time; it stores all the information related to the behavior of the children, paired and synchronized with the movements of the robot.

Fig. 13
figure 13

A Nao robot tries to elicit joint attention over two focus of attention in an interactive scenario

The experiment is composed by three stages in which the robot tries to induce joint attention increasing the informative content it provides to the human. In the first stage the robot gazes over the two focuses of attention; then it gazes and points over them; finally it gazes, points and vocalizes “look at the cat”, “look at the dog”, as shown in Fig. 14.

Fig. 14
figure 14

In the experimental protocol, the robot tries to elicit joint attention in children in different conditions that mix multimodal social cues: gazing, pointing and vocalization

Thirty-two children have been chosen for the experiments:

  • Group ASD: 16 people (13 males, 5 females), \(9.25 \pm 1.87\) years old.

  • Group TD: 16 people (9 males, 6 females), \(8.06 \pm 2.49\) years old.

In this case, head movements and posture have been analyzed and compared between the two groups. Using generalized linear mixed models, we found a significant higher variance of the yaw movements in TD children rather than in the children with ASD (\(\hbox {b} = 1.66, \hbox {p} = 0.002\)). The analysis showed also a significant effect on the yaw movements in accord to the induction modalities used to stimulate Joint Attention: higher variance has been found in vocalizing + pointing compared to pointing (\(\hbox {b} = 1.52, \hbox {p} < 0.001\)) and compared to gazing only (\(\hbox {b} = 1.55, \hbox {p} < 0.001\)). At the same time, pitch movements analysis revealed a lower variance in TD children (\(\hbox {b} = -0.84, \hbox {p} = 0.019\)) rather than children affected by ASD.

As highlighted in Fig. 15, both the heat maps of the head pitch and yaw movements show a central hot spot: this area represents the gaze of the child towards the robot. The two lobes corresponding to the two focuses of attention on the sides of the room are less highlighted in ASD children rather than in TD children. An analysis of the clusters obtained using k-means on the TD children data shows that both left and right directions gathered 30.2 % of all the occurrences. Applying the same k-means model to ASD children data shows that left and right represented just 8.72 % of all the occurrences (Fisher’s exact test, p \(=\) 2.2\(\times 10^{-16}\)): during the Joint Attention task, TD children gazed over the focus of attention placed on the room 4.6 times more frequently during than the children with ASD (95 % Confidence interval 4.4–4.6). Those results highlight an higher response to the robot’s elicitation by children in typical development, while less stability on the gazing is found in ASD children.

Fig. 15
figure 15

The heat maps of the children head yaw (on X-axis) and pitch (on Y-axis) in the two conditions highlights differences on their behavior: ASD children present a lower response to the elicitation and less stability of their gazing towards the robot. a Head movements in TD condition. b Head movements in ASD condition

A similar analysis has been performed using the body pose, Fig. 16, and body posture data, Fig. 17. In particular, the displacements of each child from the zero position shown a higher stability in TD children: using multivariate regression, the pose variance was significantly lower than in ASD children, within all axes (x, estimates \(=\) 28.1, p \(=\) 0.001; y, estimates = 7, p \(=\) 0.006; z, estimates \(=\) 12, p \(=\) 0.009). A similar behavior has been found from the analysis of body posture data, considering the pitch and the yaw of the trunk. Also, in this case, ASD children data results less stable than TD children data: posture variance was significantly lower in the TD children than in ASD children, within all axes (x, estimates \(=\) 13.9, p \(=\) 0.0016; y, estimates \(=\) 9.2, p \(=\) 0.016; z, estimates \(=\) 1.6, p \(=\) 0.003). Such results highlight lower stability of the body posture in ASD children rather than in TD children.

Fig. 16
figure 16

Heat maps of the trunk displacement from the zero position of the children in the two conditions highlight differences among the two groups: ASD children’s position in the space is less stable than in TD children. a Displacement in TD condition (front). b Displacement in ASD condition (front). c Displacement in TD condition (top). d Displacement in ASD condition (top)

Fig. 17
figure 17

Heat maps of the body pose of the children in the two conditions highlight differences among the two groups: ASD children posture is less stable than in TD children. a Trunk pose in TD condition. b Trunk pose in ASD condition

4 Discussion

We proposed in this paper a methodology for analyzing the engagement between humans and social robots in direct, face-to-face, interactive scenarios. The methodology is based on an evaluation of the engagement aroused by a robot during the interaction, focusing on the nonverbal behaviors expressed by the human partner. Both static and dynamic interaction cues have been considered, as they can be used to extract different meaningful measures. The ones described in Sect. 2 were able to characterize different aspects of the social interaction between humans and robots in two different use cases.

In both scenarios, the human and the robot established a mutual communication. In such contexts, a correct comprehension and proper use of nonverbal behaviors are essential tools to achieve an “optimal” interaction: to provide readable behaviors, and to arouse on human partners the illusion of a social intelligence. The importance of nonverbal behaviors has been highlighted by developmental sciences [50]. Toddlers learn about the world in a social way. They develop communication skills through nonverbal cues, and such skills gradually evolve together with verbal language [70]. Imitation, joint-attention, gesticulation, synchrony are all learned in the very first stages of childhood development, and seem to be pivotal traits of the developmental process [43, 69]. In adulthood, those become semi-automatic, almost involuntary behaviors, influenced by the culture, used in daily communications, eventually in combination with spoken language to reinforce it or to completely alter its meaning.

The measurement of nonverbal signals during interactions with robots can provide information about the engagement between pairs [27]. The static analysis of the movements of the body articulations can reveal how humans respond to the robot stimuli, if they respond as engaged partners or not. The analysis of the gaze behavior can be used to model the attention system of the human partner and improve joint attention. The dynamic analysis can be used to study the motor resonance, synchrony of movements, and can improve imitations and gestures. A robot capable to capture the attention of the human partner should leverage all those nonverbal cues to increase the engagement.

4.1 A Practical Set of Measures

Similar measures can be retrieved using motion capture systems. However, usually such systems use marker based technologies: they require passive or active beacons that should be worn by the user. This do not only increase the complexity of the system, but they critically reduces the naturalness of the interaction. The proposed system, instead, is based on a simple RGB-D camera, a marker-less technology that can be still used to track human movement [4, 33]. Despite its lower resolution, this system allows researchers to explore the engagement in very natural scenarios, without the restrictions and the complexity imposed by wearable devices and marker holders.

While such measures have been developed to enable studies in naturalistic settings, those can be aggregated with the features obtained from physiological responses of the participants in specially designed experiments during which participants would forget the existence of worn sensors, to establishing natural interactions as much as possible. In such case, it would be possible to capture a larger dynamic of possible interactions and, at the same time, to study the neurophysiological bases of the engagement [28, 76].

Several researches in social robotics make use of post-experiment questionnaires to gather information about the engagement after the experiments [41, 42]. Unfortunately, while quick and easy to analyze, questionnaires can be strongly affected by several kind of biases [22]. Without being exhaustive, it is possible to find at least three important sources of errors in questionnaires: their design, the experimental subjects, and the experimenter. The design of the questionnaire can introduce artifacts due to complexity, ambiguity or specificity of the questions, or due to the number, too few or too many, of the answers’ options (question wording [56]). The subjects can also introduce errors, because of their unconscious will to be positive experimental subjects and to provide socially desirable responses (response bias [32]). Lastly, the researchers can also be a source of error with their tendency to interpret the answers as a confirmation of their hypothesis (confirmation bias [53]). The measures presented in this paper can be used as a practical and objective tool to explore the interaction with robots; they can also serve as a complement to verify and eventually to reinforce the results obtained by questionnaires and surveys.

4.2 Readability of Robot’s Nonverbal Cues

As social animals, humans are extraordinarily able to infer information about their partners, and to build models of the other and of their society. Nonverbal behaviors play a central role for making this inference [29].

In the first scenario, the presented metrics have been used to show that humans react faster to the attention cues of a proactive robot. It is possible to speculate about the manipulation of the proactive behavior of the robot to strengthen the engagement, regulate the rhythm of interaction, and arouse in people the perception of social intelligence. The engagement, here, comes essentially from the readability provided by the nonverbal cues. This result is confirmed also in the experiment with children affected by ASD and TD children: a significant difference in the two groups has been found according to the amount of information expressed by the nonverbal language of the robot. The more modalities the robot uses to communicate its focus of attention (from gazing, to gazing and pointing, to vocalizing), the more its behavior becomes readable by the children.

The results obtained in the two case studies confirm that the proposed measures are effective to study the engagement. These metrics can be used by the robot as a continuous, online feedback signal to evaluate (and eventually manipulate) the engagement with the human partner [5].

Future studies, however, will focus on the use of the presented metrics in long term scenarios, in which the novelty effect of the robot became less relevant with time. In such settings people will interact day-by-day with the robot becoming accustomed to its behaviors; at the same time, human subjects could adapt their own behaviors to the robot.

4.3 The “bias” of the Anthropomorphic Design

People have the natural tendency of projecting human-like features on animals and inanimate objects. This is the so called “anthropomorphism” [49]: as “social machines”, we seek in the unknown the same intelligence patterns we are used to recognize in our peers, projecting our social intelligence. The robot is not perceived as a machine: people have frequently the illusion that the robot understands them, needs their help and wants to communicate. During an interaction, the human can naturally develop a feeling a “partnership” with the robot [15]. The anthropomorphic design of the robots can help the readability of their behaviors, facilitating the interaction with human partners [16].

The robots used in our experiments, iCub and Nao, have a baby-like, humanoid shape, which makes them particularly suited for interaction but also introduces an anthropomorphism bias in their human partners. These robots communicate implicitly, just with their design, very human-like traits such as a personality, emotions and intention, and arouse a sense of co-presence [78]. The presented metrics can be used to study the perceived engagement with other types of robots. They should be as well able to highlight differences due to different types of robot, even if it is difficult to make predictions about the human reaction to non-humanoid robots or “headless” robots. It would be very interesting to see if our results with humanoids hold in the case of androids and very human-like robots. A relational analysis with respect to the uncanny valley could not be so quite obvious [36, 37]. Our intuition is that the presented metrics should be able to highlight the different reactions and behaviors of the human partners in all cases, but it is difficult to imagine how much the results will diverge in similar experiments involving androids (maybe revealing aversion effects).

We plan future experiments where we will use the proposed metrics to assess the engagement between humans and different types of robots. Since the robot design practically impacts the span of their behaviors, we will carefully ponder such a study, considering the limits and capabilities of each robot and evaluating their “social intelligence” on comparable tasks and desired behaviors.

4.4 Are We Measuring Social Intelligence?

Explaining the concept of “intelligence” is a non-trivial problem [44]. Intelligence could be intuitively associated to the ability of humans to understand the world. However, this definition still lacks of generality, due to the observation of certain kinds of intelligence in the living. The idea of intelligence in humans is context-dependent. The psychometric approach to human intelligence provides a definition according to three points of view [68]: the abstract intelligence, as the ability of understanding and managing ideas and symbols; the mechanical intelligence, as the capability of working with concrete objects; the social intelligence [20], as “ability to get along with others” [73].

These definitions can be also employed in robotics, with an interesting parallelism [25, 26]. The abstract intelligence can be identified with the capability of the robots of learning, reasoning using symbols, exploring the knowledge and deducing new facts. This roughly corresponds to the area of “Artificial Intelligence”. We can refer mechanical intelligence to the perceptuo-motor intelligence or body intelligence, the ability to interact with the physical world, to perceive it and to coordinate proper actions on it. This kind of intelligence comes from the robot embodiment. The robot should be able to capture the relevant information from the world and to link them to high level, abstract, symbols. Reasoning on these symbols should take into account the physical possibilities and capabilities provided by the robot’s embodiment in its physical world.

Finally, social intelligence would refer to the ability of robots to act socially with humans, to communicate and interact with them in a human-like way, following social behaviors and rules attached to their role, to learn and adapt their own behaviors throughout their lifetime, incorporating shared experiences with other individuals into their understanding of self, of others, and of the relationships they share [16]. The domain of “Social Signal Processing” [74] aims to provide computers and robots with social intelligence, addressing to a correct perception, accurate interpretation and appropriate display of social signals [75].

Expressing social intelligence is a key feature to achieve an optimal interaction with humans. The perception of “social intelligence” can be aroused if the robot is capable to exhibit social cues: accomplishing coherent behaviors, following social rules and communicating with the humans in natural way.

The proposed methodology focuses on the analysis of non-verbal human’s behaviors during interaction with robots. We can speculate about the use of the presented metrics as a feedback of the social intelligence perceived by the humans. From this point of view, responsive behaviors produced by a robot will induce in the human partners a perception of intelligence that can be quantitatively captured by the proposed measures by observing changes to the human’s reactions to the robot social cues. This interpretation comes out from the experiments we discussed.

In the first experiment, it is possible to speculate about the social intelligence expressed by the robot and perceived by the human partner, according to slight changes on the plot of the interaction. Questionnaires given to the participants of the experiments, reported as more intelligent the robot that initiates the interaction. In our view this can be attributed to the increased readability of the proactive case, which makes the human “aware” of the robot status and creates the illusion of a greater intelligence than in the other case. This illusion could be one of the reasons for the human interact notably faster.

The second experiment is remarkable since Autism is characterized by a lack of social intelligence [8, 19]. Here, the behaviors shown by the robot and, at the same time, the plot of the interaction do not vary between the two conditions, so the differences are due to the different ability of the children to recognize social cues. The proposed metrics to evaluate the engagement highlight the lacking of social intelligence in the ASD children, showing behavioral differences between the two groups.

4.5 Conclusions

In this paper a set of metrics has been proposed to evaluate the engagement between humans and robots in direct, face to face scenarios. Those metrics have been applied to study the interaction in two different use cases, characterized by natural settings and different objectives, and to assess effectively different human responses to robot behaviors. In both the scenarios, the metrics confirmed the importance of the study of non-verbal cue to improve the interactions between humans and robots. Nevertheless, thanks to their easiness of use in real world scenarios, due to employment of non-intrusive sensors, such metrics present a strong potential for scalability and a further generalization to different applications and contexts. Limitations of the metrics would be studied in future works, in particular in long-term scenarios, in which human subjects will be accustomed to the behaviors of the robot, and according to the use of different robotic designs, anthropomorphic and not.