Keywords

1 Introduction

Social robotics [4] aims to provide robots with artificial social intelligence to improve human-machine interaction and to introduce them in complex human contexts. The demand for sophisticated robot behaviors requires to model and implement human-like capabilities to sense, to process, and to act/interact naturally by taking into account emotions, intentions, motivations, and other related cognitive functions.

Talking involves spontaneous gesticulation; postures and movements are relevant for social interactions even if they are subjective and culture dependent. Aiming at building trust and making people feel confident when interacting with them, socially acting humanoid robots should show human-like talking gesticulation. Therefore, they need a mechanism that generates movements that resembles humans’ in terms of naturalness. A previous work [24] made use of gestures selected from a set of movements previously compiled. Those gestures were then randomly concatenated and reproduced according to the duration of the speech. That approach was prone to produce repetitive movements and resulted in unnatural jerky expression.

The goal of the system presented in this paper is to develop a natural talking gesture generation behavior for a humanoid robot. At this step we aim to give a step forward by training a Generative Adversarial Network (GAN) gesture generation system with movements captured directly from humans. A Kinect sensor is used to track the skeleton of the talking person and a GAN is trained to generate a richer and more natural talking gesticulation.

Gestures (head, hands and arms movements) are used both to reinforce the meaning of the words and to express feelings through non-verbal signs. Among the different types of conversational movement of arms and hands synchronised with the flow of the speech, beats are those not associated with particular meaning [18]. References to talking gestures of the present work will be limited to beats.

The robotic platform employed in the performed experiments is a Softbank Robotics Pepper robotFootnote 1. Currently, our robot is controlled using the naoqi_driverFootnote 2 package that wraps the needed parts of NAOqiFootnote 3 API and makes them available in ROSFootnote 4.

2 Related Work

According to Beck et al. [3], there are three main robot motion generation approaches: manually creating motion, motion capturing, and motion planning; for manual creation, it is required to set each joint position of the humanoid robot for each key frame (time step); the motion capture-based approach tries to mimic human gestures, recording human movements and mapping these data to a humanoid robot [22]; and motion planning approach relies on kinematics and/or dynamics equations to solve a geometric task. They found that the motion capturing approach produces the most realistic results, because the robot reproduces previously captured human movements.

Motion capturing and imitation is a challenge because humans and robots have different kinematic and dynamic structures. Motion capture (MoCap) is the process of recording motion data through any type of sensor. Applications of MoCap systems range from animation, bio-mechanics, medicine to sports science, entertainment, robotics [21, 31] or even study of animal behavior [27]. MoCap systems rely on optical technologies, and can be marker-based (e.g. ViconFootnote 5) or markerless like RGB-D cameras. While the former ones provide more accurate results, the latter ones are less prone to produce gaps (missing values) that need to be estimated [19, 29]. Many approaches make use of the Kinect sensors due to its availability [1, 9, 20].

No matter the motion capture system being used, there is a need to transfer human motion to the robot joints. This can be done by direct kinematics, adapting captured human joint angles to the robot. Or alternatively, inverse kinematics calculates the necessary joint positions given a desired end effector’s pose.

On the other hand, generative models are probabilistic models capable of generating all the values for a phenomenon. Unlike discriminative models, they are able to generate not only the target variables but also the observable ones [28]. They are used in machine learning to (implicitly or explicitly) acquire the distribution of the data for generating new samples. There are many types of generative models. For instance, Bayesian Networks (BNs) [7], Gaussian Mixture Models (GMMs) [8] and Hidden Markov Models (HMMs) [23] are well known probability density estimators.

Focusing on generative models used for motion generation, in [14] the authors propose the combination of Principal Component Analysis (PCA) [30] and HMMs for encoding different movement primitives to generate humanoid motion. Tanwani [28] uses HSMM (Hidden Semi-Markov Models) for learning robot manipulation skills from humans. Regarding on social robotics, some generative approaches are being applied with different objectives. In [17] Manfrè et al. use HMMs for dance creation and in a later work they try variational auto-encoders again for the same purpose [2].

Deep learning techniques have also been applied to generative models, giving rise to deep generative models. A taxonomy of such models can be found in [10]. In particular Generative Adversarial Networks (GANs) [11] are semi-supervised emerging models that basically learn how to generate synthetic data from the given training data. GANs are deep generative models capable to implicitly acquire the probability density function in the training data, being able to automatically discover the internal structure of datasets by learning multiple levels of abstraction [15]. Gupta et al. [12] extend the use of GANs to generate socially acceptable motion trajectories in crowded scenes in the scope of self-driving cars. In [26] GANs showed to overcome other generative approaches such as HMM and GMM when confronted to the task of motion generation. In that work, movements produced synthetically (using choregraph) were used to train the different generative approaches. Instead, in this paper we feed the GAN with movements obtained by observing and capturing human talking gestures.

3 Developed Approach to Enhance Robot Spontaneity

The GAN used in the current approach takes as input only proprioceptive joint position information. In order to feed the GAN with natural motion data, a motion capturing approach is employed. Thus, these two aspects are exhaustively described here on.

3.1 Human Motion Capture and Imitation

In [25], direct kinematics was used to teleoperate a NAO robot. Human skeleton obtained with a Kinect was tracked and arm movements were replicated by the robot, while walking motions were commanded by using different spatial key movements. As the goal was to teleoperate the robot, there was no need for subtle and continuous motion since the arms only required to reach single poses when demanded by the operator. On the contrary, gesticulation requires continuous arm motion and involves also hands, head and fingers. Although the present work makes use of a similar motion capture and mapping system, the presented system has been enriched to cover all the aspects involved.

The Kinect uses structured light (depth map) and machine learning to infer body position [16]. The OpenNI based skeleton_markersFootnote 6 ROS package is able to extract in real time the 15 joints associated to the human skeleton.

Mapping Human’s Arms into Robot’s Space 

Human arms have 7 degrees of freedom (DoF): a spherical joint at the shoulder, a revolute joint at the elbow and a spherical joint at the wrist. On the contrary, our humanoid’s arms have 5 DoF: two at the shoulder (pitch and yaw) and elbow (yaw and roll), and one at the wrist (yaw) (Fig. 1). Thereby, the movement configurations of human and robot arms differ.

To transform the Cartesian coordinates obtained from the Kinect into Pepper’s coordinate space a joint control approach was employed. Note that the transformations are performed to the reference system of each individual joint, not to a robot’s global reference frame. On the following explanation we will focus on the left arm. The analysis of the right arm is similar and it will be omitted here.

Pepper’s left arm has five jointsFootnote 7 (see Fig. 1): shoulder roll (\(LS_\alpha \)) and pitch (\(LS_\beta \)), elbow roll (\(LE_\alpha \)) and yaw (\(LE_\gamma \)) and wrist yaw (\(LW_\gamma \)). The skeleton_markers package can not detect the operator’s hands’ yaw motion and thus, \(LW_\gamma \) joints cannot be reproduced using the skeleton information. We chose another approach for \(LW_\gamma \), that will be explained later on.

Fig. 1.
figure 1

Pepper’s left arm joints and actuators (from Softbanks official Pepper’s user guide (see Footnote 7)).

In order to calculate the shoulder’s roll angle (\(LS_\alpha \)) we use the dot product of the distance vector between both shoulders (\(\overline{LRS}\)) and the length vector between the shoulder and the elbow (\(\overline{LSE}\)). Note that, before computing that product, \(\overline{LRS}\) and \(\overline{LSE}\) vectors must be normalized. The \(LS_\alpha \) angle is calculated in the Kinect’s coordinate space, therefore, it must be transformed into the robot’s coordinate space by rotating it \(\frac{-\pi }{2}\) radians (Eq. 1).

$$\begin{aligned} \begin{aligned} LS_\alpha = \arccos {(\overline{LRS} \cdot \overline{LSE})}\\ LS^{robot}_\alpha = LS_\alpha - \frac{\pi }{2} \end{aligned} \end{aligned}$$
(1)

Elbow’s roll (\(LE_\alpha \)) angle is calculated in the same way as shoulder’s roll angle (\(LS_\alpha \)) but the length vector between the shoulder and the elbow (\(\overline{LSE}\)) and the length vector between the elbow and the hand (\(\overline{LEH}\)) are used instead. Again, those vectors need to be normalized and transformed to the robot’s space, in this case by rotating it \(-\pi \) radians.

With respect to elbow’s yaw angle (\(LE_\gamma \)) calculation we use only the y and z components of the \(\overline{LEH}\) vector. After normalizing \(\overline{LEH}\), Eq. 2 is applied to obtain the \(LE_\gamma \). Lastly, a range conversion is needed to get \(LE^{robot}_\gamma \) (from [\(\frac{\pi }{2}\), \(\pi \)] to [\(-\frac{\pi }{2}\), 0] and from [\(-\pi \), \(-\frac{\pi }{2}\)] to [0, \(\pi \)]).

$$\begin{aligned} \begin{aligned} LE_\gamma = \arctan {\frac{\overline{LEH}_z}{\overline{LEH}_y}}\\ LE^{robot}_\gamma = rangeConv(LE_\gamma ) \end{aligned} \end{aligned}$$
(2)

To conclude with the joints, the shoulder pitch angle (\(LS_\beta \)) is calculated by measuring the angle between the shoulder to elbow vector and the z axis. \(z=0\) occurs with the arm extended at \(90^\circ \) with respect to the torso. Thus, lowering the arm produces negative pitch angle while raising it above the shoulder produces positive angular values.

The \(LS_\beta \) can be defined as:

$$\begin{aligned} \begin{aligned} \Vert A \Vert = LSE_z \quad (\text{ by }~\text{ definition })\\ \sin {(LS_\beta )}=\frac{\Vert A \Vert }{\Vert \overline{LSE} \Vert }=\frac{\Vert A \Vert }{1} \\ LS^{robot}_\beta = \arcsin {(LSE_z)} \end{aligned} \end{aligned}$$
(3)

where \(LSE_z\) is the Z coordinate of the shoulder to elbow vector.

Mapping Human’s Hands into Robot’s Space  

Hands movements are common in humans while talking. We do rotate wrist and open and close hands, move fingers constantly. Unfortunately, the skeleton capturing system we are using does not allow to detect such movements. It is possible though to capture the state of the hands using a different approach.

The developed solution forces the user to wear coloured gloves, green in the palm of the hand and red in the back (Fig. 2). While the human talks, hands coordinates are tracked and those positions are mapped into the image space and a subimage is obtained for each hand. Angular information is afterwards calculated by measuring the number of pixels (max) of the outstanding color in a subimage. Equation 4 shows the procedure for the left hand. N is a normalizing constant and \(maxW_\gamma \) stands for the maximum wrist yaw angle of the robot.

$$\begin{aligned} \left\{ \begin{array}{ll} LW^{robot}_\gamma = max/N \times maxW_\gamma &{} \text{ if } max\text { is palm}\\ LW^{robot}_\gamma = \frac{max-N}{N} \times maxW_\gamma &{} \text{ otherwise } \end{array}\right. \end{aligned}$$
(4)
Fig. 2.
figure 2

Snapshot of a data capture session.

In addition \(LE_\gamma \) is modified when humans palms are up (subimage has only green pixels) to easy the movement of the robot.

Regarding the fingers, as they cannot be tracked, their position is randomly set at each skeleton frame.

Mapping Human’s Head into Robot’s Space  

Humans move the head while talking and thus, head motion should also be captured and mapped. The robot’s head has 2 DoFs that allow the head to move left to right (yaw) and up and down (pitch) as shown in Fig. 3.

Fig. 3.
figure 3

Pepper’s head joints and actuators (from Softbanks official Pepper’s user guide (see Footnote 7)).

The Kinect skeleton tracking program gives us the (neck and) head 3D positions. The approach taken for mapping the yaw angle to the robot’s head consists of applying a gain \(K_1\) to the human’s yaw value, once transformed into the robot space by a \(-\frac{\pi }{2}\) rotation (Eq. 5).

$$\begin{aligned} H^{robot}_\gamma = K_1 \times H_\beta \end{aligned}$$
(5)

In order to approximate head’s pitch angle, the head to neck vector (\(\overline{HN}\)) is calculated and rotated \(-\frac{\pi }{2}\) and then, its angle is obtained (Eq. 6). Note that robot’s head is an ellipsoid instead of an sphere. To avoid unwanted head movements a lineal gain is applied to the final value.

$$\begin{aligned} H^{robot}_\beta = \arctan {(rotate(\overline{HN}, -\frac{\pi }{2}))} + |K_2*H_\gamma | \end{aligned}$$
(6)

3.2 Generative Model

GAN networks are composed by two different interconnected networks. The Generator (G) network generates possible candidates so that they are as similar as possible to the training set. The second network, known as Discriminator (D), judges the output of the first network to discriminate whether its input data are “real”, namely equal to the input data set, or if they are “fake”, that is, generated to trick with false data.

The training dataset given to the D network contained 2018 unit of movements (UM), being each UM is a sequence of 4 consecutive poses, and each pose 14 float numbers corresponding to joint values of head, arms, wrists and hands (finger opening). These samples were recorded by registering 5 different persons talking, about 9 min overall.

The D network is thus trained using that data to learn its distribution space and its input dimension is 56. On the other hand, the G network is seeded through a random input with a uniform distribution in the range [\(-1, 1\)] and with a dimension of 100. The G intends to produce as output gestures that belong to the real data distribution and that the D network would not be able to correctly pick out as generated. Figure 4 depicts the architecture the generator and discriminator networks.

GAN has been trained for 2000 epochs and its hyper-parameter have been tuned experimentally; we set up a batch size of 16, a learning rate of 0.0002, Adam [13] as optimization method, and \(\beta _1\) = 0.5 and \(\beta _2\) = 0.999 as its parameters.

Fig. 4.
figure 4

GAN setup for talking gesture generation.

4 Results

The obtained robot performance can be appreciated in the following videos:

  1. 1.

    The first videoFootnote 8 shows some instants recorded during the process of generating the database of movements captured through the motion capturing and imitation mechanisms. On the left side the participant talking and gesticulating is displayed, while the simulated robot mimicking the movements in real time (without GAN) is shown in the right side.

  2. 2.

    A second videoFootnote 9 shows the evolution of the robot behavior during different steps of the training process. The final number of epochs was empirically set to 2000 for the model that has been integrated into the gesture generation system.

Notice that the temporal length of the audio intended to be pronounced by the robot determines the number of units of movement required to the generative model. Thus, the execution of those units of movements, one after the other, defines the whole movement displayed by the robot.

5 Conclusions and Further Work

In this work a talking gesture generation system has been developed using a GAN feeded with natural motion data obtained through a motion capturing and imitation system. The suitability of the approach is demonstrated with a real robot. Results show that the obtained robot behavior is appropriate, and thanks to the movement variability the robot expresses itself with naturalness.

As further work, we intend to improve the skeleton capture process by using more robust systems, such as OpenPose [5, 6] or wrnchAIFootnote 10 that allow to capture more detailed movements. In this way, the speakers would not need to wear the gloves, that somehow are conditioning them. Moreover, speakers tend to behave in an constricted way when recorded. A more powerful skeleton tracker system would allow to use recorded videos from real talks and to build a more objective database. With respect to the mapping process, in [20] direct kinematics is compared with two inverse kinematics approaches and the neuro fuzzy approach seems to improve the direct one. The use of a more effective method to translate human poses to robot poses could also produce better movements.

The work presented here pretends to be the starting point to acquire a richer gesture set, such as emotion-based gestures or context related gestures. Moreover, a generator conditioned on the sentence/word itself would correspond to how humans use their gestures to emphasize their communication.