Keywords

1 Introduction

1.1 Motivation

The goal of Human Robot Interaction (HRI) research is to increase the performance of human robot interaction in order to make it similar to human-human interaction, allowing robots to assist people in natural human environments. As for communication between humans, gestural communication is also widely used in human robot interaction. Several approaches have been developed over the last few years. Some approaches are based on data markers or gloves and use mechanical or optical sensors attached to these devices that transform reflexion of the members into electrical signals to determine the posture. These methods are based on various information such as the angles and the joints of the hand which contain data position and orientation. However, these approaches require that the user wear a glove or a boring device with a load of cables connected to the computer, which slows the natural human robot interaction. In the other side, computer vision is a non intrusive technology which allows gesture recognition, without any interference between the human and the robot. The vision-based sensors include 2D and 3D sensors. However, gesture recognition based on 2D images had some limitations. Firstly, the images can not be in a consistent level lighting. Second, the background elements can make the recognition task more difficult. With the emergence of Kinect [1], depth capturing in real time becomes very easy and allows us to obtain not only the location information, but also the orientation one. In this paper we aim to use only the depth information to build a 3D gesture recognition system for human robot interaction.

1.2 Related Work

A gesture recognition system includes several steps: detection of one or more members of the human body, tracking, gesture extraction and finally classification. Hand tracking can be done based on skin color. This can be accomplished by using color classification into a color space. In [2], skin color is used to extract the hand and then track the center of the corresponding region. The extracted surface into each chrominance space has an elliptical shape. Thus, taking into account this fact, the authors proposed a skin color model called elliptical contour. This work was extended in [3] to detect and localize the head and hands. In addition, the segmentation process is also an important step in tracking. It consists of removing non-relevant objects leaving behind only the regions of interest. Segmentation methods based on clustering are widely used in hand detection and especially K-means and expectation maximization. In [4] the authors combine the advantages of both approaches and propose a new robust technique named KEM (K-means Expectation Maximization). Other detection methods based on 2D/3D template matching were also developed [57]. However, skin color based approaches are greatly affected by illumination changes and background scene complexity. Therefore, recent studies tend to integrate new information such as depth. Indeed, depth information given by depth sensors can improve the performance of gesture recognition systems. There are several studies that combine color and depth information, either in tracking or segmentation [811]. Other works combine depth information, color and speech [12]. In [10], the authors use a silhouette shape based technique to segment the human body, then they combine 3D coordinates and motion to track the human in the scene. Filtering approaches are also used in tracking such as the Unscented Kalman Filter [13], the Extented Kalman Filter [14] and the Particle Filter [15]. Other methods are based on points of interest which have more constraints on the intensity function and are more reliable than the contour based approaches [16]. They are robust to occlusions present in a large majority of images.

The most challenging problem in dynamic gesture recognition is the spatial-temporal variability, when the same gesture could be different in velocity, shape and duration. These characteristics make recognition of dynamic hand gestures very difficult compared to static gestures [17]. As in speech, hand writing and character recognition [18, 19], HMM were successfully used in gesture recognition [2022]. Actually, HMM can model spatial-temporal time series and preserve the spatial-temporal identity of gesture. The authors in [23] developed a dynamic gesture recognition system based on the roll, yaw and pitch orientations of the left arm joints. Other mathematical models such as Input-Output Hidden Markov Model (IOHMM) [24], Hidden Conditional Random Fields (HCRF) [25] and Dynamic Time Warping [26] are also used to model and recognize sequences of gestures.

Fig. 1
figure 1

Flowchart of the proposed 3D dynamic gesture recognition technique

In this paper, we propose a robust classification realized by combining HMM and DTW methods for 3D dynamic gesture recognition. The basic framework of the technique is shown in Fig. 1. The Skeleton algorithm given by the Kinect SDK is used for body tracking. Only depth information is recorded. The 3D joints information are extracted and used to calculate new and more relevant features which are the angles between joints. Discrete HMM with Left-Right Banded topology are used to model and classify gestures. Finally, the output of HMM is given as input for DTW algorithm in order to measure the distance between the gesture sequence and a reference sequence. The final decision is given by comparing the distance calculated by DTW to a fixed threshold. The evaluation experiments show the effectiveness of the proposed technique. The performance of our technique is further demonstrated with the validation step which yielded good recognition even without training phase.

The rest of the paper is organized as follows: Sect. 2 describes our 3D dynamic gesture approach and the features we used. Section 3 gives some experimental results. Finally, Sect. 4 ends the paper with a conclusion and future work.

2 Proposed Approach

In the context of human robot interaction, the aim of our work is to recognize five 3D dynamic gestures based on depth information. We are interested in deictic gestures. The five gestures we want to recognize are: {come, recede, stop, pointing to the right and pointing to the left}. Figure 2 shows the execution of each gesture to be recognized. Our gesture recognition approach consists of two main parts: (1) Human tracking and data extraction, and (2) gesture classification.

Fig. 2
figure 2

Five distinct gesture kind

2.1 Human Tracking and Data Extraction

In order to proceed to the gesture recognition, we need first to achieve a robust tracking for Human body and arms. Most recent tracking methods use color information. However, color is not a stable cue, and is generally influenced by several factors such as brightness changing and occlusions. Hence, color-based tracking approaches fail often and don’t success to provide 3D human postures at several times. In our work we choose to use a depth sensor (Kinect) in order to extract 3d reliable data. Figure 3 shows the reference coordinate frames associated to the acquisition system.

Fig. 3
figure 3

Kinect system coordinate

The coordinates x, y and z denote, respectively, the x and y positions and the depth value. Human tracking is performed using the Skeletal Tracking method given by the kinect SDK.Footnote 1 This method projects a skeleton on the human body image so each joint of the body is related to a joint of the projected skeleton. In this manner, it creates a collection of 20 joints to each detected person. Figure 4 shows the information used in our approach: depth image (b) and skeleton tracking (c).

Fig. 4
figure 4

a RGB image, b depth image, c skeleton tracking

The idea is to estimate in real time the variations of the active angles while executing the gestures. The considered angles are: \(\alpha \) elbow, \(\beta \) shoulder and \(\gamma \) armpit angle, as shown in Fig. 5. Each angle is then computed from the 3D coordinates of the three joints that are commonly accounted to it:

  • \(\alpha \) elbow angle is computed from the 3D coordinates of elbow, wrist and shoulder joints.

  • \(\beta \) shoulder angle is computed from the 3D coordinates of shoulder, elbow and shoulder center joints.

  • \(\gamma \) armpit angle is computed from the 3D coordinates of shoulder, elbow and hip joints.

Fig. 5
figure 5

\(\alpha \), \(\beta \) and \(\gamma \) angles

When performing a gesture we record the values given by each of these three angles and we store the results in vectors as follow:

$$\begin{aligned} V_\alpha =[\alpha _1, \alpha _2, \ldots , \alpha _T] \end{aligned}$$
(1)
$$\begin{aligned} V_\beta =[\beta _1, \beta _2, \ldots , \beta _T] \end{aligned}$$
(2)
$$\begin{aligned} V_\gamma =[\gamma _1, \gamma _2, \ldots , \gamma _T] \end{aligned}$$
(3)

where T is the length of the gesture sequence, it is variable from a gesture to another and from a person to another. The input vector of our 3D dynamic gesture recognition system will be then written as:

$$\begin{aligned} V_\alpha =[\alpha _1, \alpha _2, \ldots , \alpha _T, \beta _1, \beta _2, \ldots , \beta _T, \gamma _1, \gamma _2, \ldots , \gamma _T] \end{aligned}$$
(4)

The gesture description based on angles variation allows distinguishing between different human gestures. Thus, for every canonical gesture, there is one main angle which changes throughout the gesture and the remaining two angles vary slightly. We consider the five gestures defined previously. The angle which is varing for come and recede is the angle \(\alpha \). Likewise, the angle \(\gamma \) for stop gesture, and angle \(\beta \) for both pointing gestures. The main angle’s variations in each gesture are showing in the Table 1.

Table 1 The main angle’s variations in each gesture

In this work, we propose to use the sequences of angles variations as an input of our gesture recognition system as explained in the next section.

2.2 Gesture Classification Method

Our recognition method is based on a combination of Hidden Markov Models (HMM) and Dynamic Time Warping (DTW) method. HMM are widely used in temporal pattern, speech, and handwriting recognition, they generally yield good results. The problem in the dynamic gestures is their spatial and temporal variability which make their recognition very difficult, compared to the static gestures. In fact, the same gesture can vary in speed, shape, length. However, HMM have the ability to maintain the identity of spatio-temporal gesture even if its speed and/or duration change. Since we work with time series data, we use Dynamic Time Warping algorithm to measure similarity between two sequences that may vary in time and speed. DTW warps the sequences and gives a distance like quantity between them.

In the first stage, we classify the gesture using HMM [27]. Based on the best probability of belonging to one of the five classes, the gesture kind is recognized. In the second stage, we measure the similarity between the variations of the main angle sequence that characterizes the gesture class which HMM gave as output and another variations sequence of the same angle taken as a reference using DTW. Next, the distance is compared to a precalculated threshold. If the distance is less than the threshold we keep the result provided by HMM method else the gesture will be considered as an unknown gesture then rejected. Therefore a bad performed gesture will be rejected instead of being misclassified. Figure 6 shows the steps of our recognition system. First, HMM method will classify a given gesture (Gtest) into one of the five classes. Then, HMM method will give the result which is the type of gesture (for example: Come). As mentioned before, the angle which characterizes the gesture Come is elbow angle designed by \(\alpha \). Thus, we take the first part of the gesture sequence (Gtest) which corresponds to \(\alpha \) angle variations, and we take a reference sequence of \(\alpha \) angle variations in Come gesture from the database. Next, we calculate the distance between these two sequences using DTW method. The resulting distance is compared to a threshold that was fixed for the gesture Come.

Fig. 6
figure 6

Our recognition system combining HMM and DTW

Hidden Markov Models. An HMM can be expressed as \(\lambda =(A, B, \pi )\) and described by:

  1. (a)

    A set of N states \(S=\{s_1,s_2, \ldots ,s_n\}\).

  2. (b)

    An initial probability distribution for each state \(\varPi \) = \(\{\pi _j\}\), \(j = \{1, 2, \ldots , N\}\), with \(\pi _j\) = Prob(\(S_j\) at \(t=1\)).

  3. (c)

    A N-by-N transition matrix \(A=\{a_{ij}\}\), where \(a_{ij}\) is the transition probability of \(s_i\) to \(s_j\); \(1 \le i\), \(j\le N\) and the sum of the entries in each row of the matrix A must be equal to 1 because it corresponds to the sum of the probabilities of making a transition from a given state to each of the other states.

  4. (d)

    A set of observations \(O=\{o_1,o_2, \ldots , o_t\}\), \(t=\{1, 2, \ldots ,T\}\) where T is the length of the longest gesture path.

  5. (e)

    A set of k discrete symbols \( V=\{v_1, v_2, \ldots , v_k\}\).

  6. (f)

    The N-by-M observation matrix \(B=\{b_{im}\}\), where \(b_{im}\) is the probability of generating the symbol \(v_k\) from state \(s_j\) and the sum of the entries in each row of the matrix B must be 1 for the same previous reason.

Fig. 7
figure 7

HMM topologies. a Left-right banded topology, b left-right topology, c ergodic topology

There are three main problems for HMM: evaluation, decoding, and training, which are solved by using Forward algorithm, Viterbi algorithm, and Baum-Welch algorithm, respectively [28]. Also, HMM has three topologies: Fully Connected (Ergodic model) where each state can be reached from any other state, Left-Right (LR) model where each state can go back to itself or to the following states and Left-Right Banded (LRB) model in which each state can go back to itself or the following state only (Fig. 7). We choose left-right banded model Fig. 7a as the HMM topology, because the left-right banded model is good for modeling-order-constrained time-series whose properties sequentially change over time. We realized five HMM, one HMM for each gesture type.

Initializing Parameters for LRB Model. We created five HMM, one for each gesture. First of all, every parameter of each HMM should be initialized. We start with the number of states. In our case this number is not the same for all the five HMM, it depends on the complexity and duration of the gesture. We use 12 states as maximum number and 8 as minimum one in which the HMM initial vector parameters \(\varPi \) will be designed by;

$$\begin{aligned} \varPi =(1\,0\,0\,0\,0\,0\,0\,0) \end{aligned}$$
(5)

To ensure that the HMM begins from the first state, the first element of the vector must be 1. The second parameter to be defined is the Matrix A which can be written as:

$$\begin{aligned} A= {\begin{matrix} a_{ii} &{} 1-a_{ii} &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} a_{ii} &{} 1-a_{ii} &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} a_{ii} &{} 1-a_{ii} &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} a_{ii} &{} 1-a_{ii} &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 &{} a_{ii} &{} 1-a_{ii} &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} a_{ii} &{} 1-a_{ii} &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} a_{ii} &{} 1-a_{ii} \\ 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} a_{ii} \end{matrix}} \end{aligned}$$
(6)

where \(a_{ii}\) is initialized by a random value. The Matrix B is determined by:

$$\begin{aligned} B=\{b_{im}\} \end{aligned}$$
(7)

where \(b_{im}\) is initialized by a random value.

Training and Evaluation. Our database is composed of 100 videos for each kind gesture (50 for training and 50 for testing). In the training phase the Baum-Welch algorithm [28] is used to do a full training for the initialized HMM parameters \(\lambda =(\varPi , A, B)\). Our system is trained on 50 sequences of discrete vector for each kind of gesture by using LRB topology with the number of states ranging from 3 to 12. After the training process, we obtain new HMM parameters \((\varPi ', A', B')\) for each type of gesture. According to the forward algorithm with Viterbi path, the other 50 video sequences for each type of gesture are tested using the new parameters. The forward algorithm computes the probability of the discrete vector sequences for all the five HMM models with different states. Thereby, the gesture path is recognized corresponding to the maximal likelihood of 5 gesture HMM models over the best path that is determined by Viterbi algorithm. The following steps demonstrate how the Viterbi algorithm works on LRB topology [29]:

  • Initialization:

    for \(1 \le i \le N\),

    \(\delta _1(i)=\varPi _i \cdot b_i(o_1)\)

    \(\phi _1(i)=0\)

  • Recursion:

    for \(2 \le t \le T, 1 \le j \le N\),

    \(\delta _t(i)= max[\delta _{t-1}(i)\cdot a_{ij}]\cdot b_j(o_t)\)

    \(\phi _t(i)=arg max[\delta _{t-1}(i)\cdot a_{ij}]\)

  • Termination:

    \(p^*=max[\delta _T(i)]\)

    \(q_T^*=arg max[\delta _T(i)]\)

  • Reconstruction:

    for \(T-1<=t<=1\)

    \(q_t^*=\phi _{t+1}(q_{t+1}^*)\)

The resulting trajectory (optimal states sequence) is \(q_1^*, q_2^*, \ldots , q_T^*\) where \(a_{ij}\) is the transition probability from state \(s_i\) to state \(s_j, b_j(o_t)\) is the probability of emitting o at time t in state \(s_j, \delta _t(j)\) represents the maximum value of \(s_j\) at time \(t, \phi _t(j)\) is the index of \(s_j\) at time t and \(p^*\) is the state optimized likelihood function.

Calculating the Threshold for DTW Distance. We have calculated empirically five threshold values one for each class of gesture. First we consider for each gesture its own reference sequence. For Come class, the reference sequence contains the variations of \(\alpha \) angle throughout a Come gesture. For Recede class, the reference sequence contains the variations of \(\alpha \) angle throughout a Recede gesture. For Pointing to the right class, the reference sequence contains the variations of \(\beta \) angle throughout a Pointing to the right gesture. For Pointing to the left class, the reference sequence contains the variations of \(\beta \) angle throughout a Pointing to the left gesture. And for Stop class, the reference sequence contains the variations of \(\gamma \) angle throughout a Stop gesture. The threshold of a gesture class corresponds to the maximum distance between its appropriate reference sequence and 50 sequences of test. The distance is given by DTW algorithm and the sequences of test are extracted from the training database.

Fig. 8
figure 8

The cases of detection failure by the Kinect. The first image: the distance is greater than 3 m. The second and third images: the person is not in front of the Kinect

3 Experimental Results

3.1 Experimental Protocol

Before the experiment, the experimental protocol was given to the subjects which describes the beginning and the end of the five gestures. The gesture duration is not fixed. The person can do a gesture whether slowly or speedy. We used the Kinect sensor that must remain stable. The distance between the kinect and the person should be between 80 cm and 3 m in order to detect the person properly. Figure 8 shows some cases when the Kinect can not totally detect the body. The environment is more or less crowded with no obstacles between the subject and the Kinect. While performing a gesture, the person should be standing and remains in front of the kinect.

Fig. 9
figure 9

Angles variations for come gesture

Fig. 10
figure 10

Angles variations for recede gesture

Fig. 11
figure 11

Angles variations for pointing to the right gesture

Fig. 12
figure 12

Angles variations for pointing to the left gesture

Fig. 13
figure 13

Angles variations for stop gesture

Fig. 14
figure 14

Recognition accuracy when changing the number of state of HMM from 3 to 14 states

3.2 Recognition Results

Angles variations are plotted in Figs. 9, 10, 11, 12 and 13. As it is shown, each gesture is characterized by the most changing angle comparing to the two others. We choose the state number of HMM for each gesture according to the experiment results and find that the recognition rate is maximum when the state number is 11 states for the gestures come, recede and pointing to the right, 12 for the gesture pointing to left, and 8 for the last gesture stop as shown in Fig. 14. Therefore, we use this setting in the following experiments. A given gesture sequence is recognized in 0.1508 s. The recognition results are listed in Table 2. We can see that the proposed method can greatly improve the recognition process, especially for opposed gestures like come and recede, pointing to the right and pointing to left. We can also see that there is no confusing between some gestures such as come and recede. In this case, it is due to the fact that the angle \(\alpha \) changes during these two gestures decreases in come and increases in recede.

Table 2 Confusing matrix and recognition accuracy

The same reasoning can be given in the case of the tow opposed gestures, pointing to the right and pointing to left. As a matter of fact, even if the same angle varies in two different gestures, our method can distinguish them.

Table 3 Comparison between the performance of our approach and Ye and Ha [30]’s approach

Table 3 presents a comparison of our approach with that of the authors in [30]. They use raw, roll and pitch orientations of elbow and shoulder joints of the left arm. Their database contains five gestures trained by one person and tested by two. The gesture duration is fixed beforehand. In offline mode, the accuracy of recognizing gestures executed by persons who did training was found to be 85 % with their method and 97.2 % with our method. And without training, the recognition accuracy attained 73 % with their method and 82 % with our method. The gestures we have defined for the human robot interaction are natural. They are almost the same that we use daily and between people. Whereas, most methods in the state of the art are based on constrained gestures that use signs which are not natural. The proposed gesture recognition approach is based only on depth information that is what makes it very robust against the environment complexity and illumination variation.

4 Conclusion and Future Work

In this paper we have presented an efficient method for 3D natural and dynamic gesture recognition intended for human robot interaction. The proposed gesture recognition system is able to recognize five deictic gestures described by depth information only. The upper body part is tracked using Kinect camera and angles are computing from the 3d coordinates of five different joints. Our five gesture are represented by a sequence which combines the variations of three angles. This sequence is the input of our classification system that combines HMM and DTW method. First, HMM affects the given gesture to one of five classes corresponding to the maximum probability. Based on this result, the DTW measures the similarity between the variations sequence of the main angle that characterizes the gesture class which HMM method gave as output and its reference sequence. The output distance is compared to the threshold corresponding to the same class; if the distance is less than the threshold then we keep the HMM result else we reject the gesture. Experimental results presented in this paper, confirm the effectiveness and the efficiency of the proposed approach. In one hand, the recognition rate can reach up to 100 % for some kind of gestures. The combination of HMM and DTW avoid misclassification and reject bad performed gestures. In the second hand, the system can recognize gestures even if the distance or the location of people change knowing that some conditions should be respected as given in the experimental protocol. Finally, The environment and the brightness do not affect the data collection and analyzing because we rely on depth only.

Nevertheless, and despite the vast amount of relevant research efforts, the problem of efficient and robust vision based recognition of natural gestures in unprepared environments still remains open and challenging, and is expected to remain of central importance in human-robot interaction in the forthcoming years. In this context we intend to continue our research efforts towards enhancing the current system. At first, the training and test data-sets will be expanded to include richer gesture types in order to recognize different gestures in the same sequence. Then, we intend to introduce other information such as speech to improve the proposed recognition system by detecting the beginning and the end of the gesture.