1 Introduction

Human Computer Interaction (HCI) technologies and algorithms are becoming more important in the last years, a time in which users ask for new ways of communication with computers and of interaction with virtual environments. The user experience of high technological services is not always optimal and HCI might help bringing these services to the mass market. As mentioned in [21], in the last years 3D user interfaces (3D UI) are becoming more important in the console gaming scenario.Footnote 1 , Footnote 2 , Footnote 3 Besides, in desktop computers interfaces, the usage of the hand as input device provides natural human-computer interaction [24]. Usability constitutes a main issue in the development of HCI systems and some of the aspects are pointed out in [11]; in [3] we find a study devoted to improve user experience.

The ultimate goal of this work is to provide the user with a natural interaction and a good experience when interacting with a computer in contexts of application such as the interaction with maps,Footnote 4 allowing intuitive movements of the earth surface. Other contexts of application of this approach can be the control of multimedia menus [31] or the point of view on a virtual environment. Other motion based gestures recognition could allow the interpretation of sign languages [9, 13].

The paper is structured as follows: In Section 2 the State Of Art is exposed and the innovations of our system are pointed out before giving an overview of it in Section 3. In Section 4 the proposed dictionary of gestures and the compilation of users executions is described. In Section 5 the approach followed for gestures detection is explained for later, in Section 6, presenting the significant user-independent evaluation figures and enumerating the achieved conclusions in Section 7.

2 Related Work

There are several works focused on hand gesture recognition based on range data, as the use of depth information has been recurrent in the last years. Some examples of the use of depth information can be found in [7, 28] where stereo-vision systems applied to gesture recognition are presented. In [18] they estimate the 3D trajectory of hand by using markers. Another approach consists in the adjustment of 3D models to 2D images [1, 32]. A recent research line is the use of Time-of-Flight (TOF) range cameras that supply real-time depth information per pixel [31] at low cost. An example of the use of this technology can be found in [8] where it is used to improve people tracking in a smart room. TOF technology can also present some problems, such as optical noise existence, unmatched boundaries or temporal inconsistency [16]. The use of depth information results in an enrichment of the communication between user and machine by means of gestural interfaces. In [22] some advantages are remarked: robustness to illumination changes and easy segmentation even when there is camera motion. In [2] a 3D hand model is adjusted to the cloud of points obtained from the captured depth image. In [17, 2527, 29, 31] experiments, using depth sensors, are performed over static hand gestures collections, pointing out the advantages of using depth information. Another technology for obtaining range data is the one proposed in [23] where the scene is illuminated with a colored pattern, captured by a common RGB camera and later processed to infer depth information.

More concretely, there are several works which focus on the detection of motion pattern based gestures. In [36] a system for the detection of shape and motion based gestures is presented, using 2D images as input. It is evaluated for four different gestures, but only two different trajectories. Yoon et al. [37] recognizes 26 alphabetical gestures on the basis of features of location, angle and velocity. In [5], based on 3D motion captures obtained with an accelerometer, digits 0 to 9 drawn to the air are recognized. Kim et al. [15] presents a solution based on neural networks fed with spatiotemporal information. In [25] two simple motion patterns are taken into account (i.e. MenuOpen and MenuClose) which correspond to two of the gestures introduced in Section 4 (i.e. N and S). In Section 5 of [27] a whole motion-based gestures dictionary is proposed, it is the one used in this paper. In [19, 35] authours perform experiments using the MSRGesture3D dataset,Footnote 5 which includes 12 dynamic American Sign Language gestures. Among these gestures, following the taxonomy proposed in [27], we can find pose-based, pose-motion based and compound gestures, while the approach proposed in this paper is focused in motion-based ones. There are other datasets, such as MSRC-12 Kinect gesture data set [6], which includes a collection of gestures based on human body parts movements, something out of the scope of this work.

In this paper we present a novel non intrusive (i.e. there is no need of gloves or markers like in [13, 14, 34] or accelerometers like in [5]) real-time approach to the detection of intuitive motion based gestures usable in different application contexts. The learning phase of our approach does not need the capture of ground-truth real data, since the patterns are defined synthetically by using a human arm model (see Section 5.1) making it is user independent (differently to [5, 15, 36, 37]). During evaluation, performed with the collaboration of several users, the system worked properly, as the results presented in this paper confirm (see Section 6). Thanks to the proposed normalization (see Section 5.4) and the representativity of the chosen arm model (see Section 5.1) the system is robust to variations in the distance to the camera, in the height of the user and in the size of arm and hand. The use of TOF technology, apart from providing an accurate segmentation robust to low illumination conditions (not as in color camera based systems [4, 28, 32, 33, 38]), offers a representative point of the hand motion, the closest one to the camera, with no need of application of traditional segmentation techniques.

3 System Overview

In Fig. 1, an overview of the system is presented. First of all, the depth data range is limited to a maximum distance of 3 m, as explained in Section 4. The Point Of Interest (POI) to be tracked is computed, storing its coordinates from frame to frame (i.e. each p i represents the 3 coordinates of the POI at frame i) which are an estimation of the hand trajectory. More concretely, the proposed POI is the point detected closest to the camera. An alternative POI is also proposed for evaluating purposes, this is the geodesic center of the segmented hand mask (see Section 5.3). Five samples trajectory segments (i.e. four translation segments) are compared with synthetically generated motion patterns (i.e. each ξ a i represents the coordinates of pattern associated to gesture a at sample i) using the Dynamic Time Warping (DTW) distance as explained in Section 5.4. So, each translation segment will be locally labeled with the closest synthetic pattern. This results, along a gesture execution, in a collection of assigned labels to several translation segments. The final label of the gesture will be the most common assigned label.

Figure 1
figure 1

Overview of the system.

4 Data Set

It is very important to have a representative data collection in order to obtain significant evaluation results. For this we use one of the dictionaries described in the dataset proposed in [27]. This is compound of nine gestures (see Fig. 2): slaps in 8 directions (named as the cardinal directions: N, NE, E, SE, S, SW, W and NW) and one slap getting closer and further to the camera (named IO, Inwards-Outwards). For compiling this collection 11 users were asked to execute 5 repetitions of each of the 9 gestures, what makes a total of 495 videos.Footnote 6 This collection is entirely used for evaluation purposes, since the knowledge used by the detection system is expressed by the motion patterns defined via the arm model described in Section 5.1. For recording the videos a TOF camera (SR4000 developed by Mesa Imaging)Footnote 7 was placed 1.5 m above the floor, with an horizontal orientation orthogonal to the user. This camera captures depth images with QCIF resolution (176×144 pixels) and a depth precision of ±1cm. It was configured to capture 30fps and to operate in a 3 m depth range (0.3–3.3m) in order to remove background objects. The recorded users were not asked to keep a certain distance to the camera neither to perform the gestures with any speed restriction. As well, the users had different heights, what makes the collection certainly representative of the potential users of the system. Some captures of this data set can be found in Fig. 3.

Figure 2
figure 2

Gestures observed from user’s point of view.

Figure 3
figure 3

Depth captures of the proposed gestures for user 1. Notice that the temporal coordinate of the captures evolves from left to right.

This dictionary of gestures was proposed following usability criteria, slaps executed in different directions are an intuitive way of interacting with a virtual environment. Two usability objectives [11] were taken into account in the gestures selection process: learnability and minimization of support requirements. In terms of learnability, it can be said that none of the users showed difficulties in learning the dictionary and that they only required of a brief introduction: they were asked to perform the indicated gestures as if they were interacting with a menu environment. In terms of minimization of support requirements, it can be said that no user presented doubts about how to execute the gestures.

5 Methodology

Our approach consists of the definition of synthetic motion patterns which will be compared with the hand motion estimations computed from the real data set videos.

5.1 Motion Pattern Modelling

An arm model, responding to human anatomy, has been proposed for the definition of the considered motion patterns. We consider two arm segments (see Fig. 4): the upper arm represented by the vector \(\overrightarrow {r_{U}}\) which goes from the shoulder to the elbow and the lower arm reprensented by \(\overrightarrow {r_{L}}\), from the elbow and to the wrist. The hand is not considered explicitly in this model, since the variation that could introduce is non significant in comparison with the ones shown by the arm movements. The lengths for these upper and lower segments were defined with fixed length: \(\left |\overrightarrow {r_{U}}\right |=\left |\overrightarrow {r_{L}}\right |=1\). Finally, the vector that describes the trajectory of the wrist to be analized is \(\overrightarrow {r}=\overrightarrow {r_{U}}+\overrightarrow {r_{L}}\) . In Fig. 4 some set-ups of the arm model are shown. Notice that for a variation of △𝜃 in angles 𝜃 x and 𝜃 y for the upper segment, the lower segment presents a variation of 2△𝜃, acumulating this way the variation of the upper segment. The expression of the vectors \(\overrightarrow {r_{U}}\) and \(\overrightarrow {r_{L}}\) are the following:

  • For gestures N and S (see Fig. 4a):

    $$\overrightarrow{r_{U}}=\left[0,-sin(\theta^{x}),cos(\theta^{x})\right] $$
    $$\overrightarrow{r_{L}}=\left[0,-sin(2\theta^{x}),cos(2\theta^{x})\right] $$

    where 𝜃 x∈[0,π/2]. For gesture N 𝜃 x goes from π/2 to 0, while for gesture S from 0 to π/2. Notice that these two motion patterns are contained in plane yz.

  • For gestures E and W (see Fig. 4b):

    $$\overrightarrow{r_{U}}\,=\,-sin(\psi_{0})\!\!\left[cos(\theta^{y}),\frac{cos(\psi_{0})}{sin(\psi_{0})},-sin\left.(\theta^{y})\right)\right] $$
    $$\overrightarrow{r_{L}}=\left[-cos(2\theta^{y}-{\pi}/{2}),0,sin(2\theta^{y}-\pi/2)\right] $$

    where 𝜃 y∈[π/4,3π/4] and \(\psi _{0}=\frac {25^{\circ }\times \pi rad}{180^{\circ }}\). ψ 0 is the angle formed by the upper segment of the arm and \(-\hat {y}\). For gesture E 𝜃 y goes from 3π/4 to π/4, while for gesture W from π/4 to 3π/4. Notice that these two motion patterns are contained in plane xz.

  • For NE, SE, SW and NW : a rotation about the z axis is performed over the gestures N and S (see Fig. 4c). This rotation matrix, R, is:

    $$R=\left[\begin{array}{cccc} sin\left( {\theta_{0}^{z}}\right) & cos\left( {\theta_{0}^{z}}\right) & 0 & 0\\ -cos\left( {\theta_{0}^{z}}\right) & sin\left( {\theta_{0}^{z}}\right) & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1 \end{array}\right] $$

    and so, the homogenous coordinates for vectors \(\overrightarrow {r_{U}}\) and \(\overrightarrow {r_{L}}\) are:

    $$\overrightarrow{r_{U}^{hom}}=R\times\left[0,-sin(\theta^{x}),cos(\theta^{x}),0\right]^{\prime} $$
    $$\overrightarrow{r_{L}^{hom}}=R\times\left[0,-sin(2\theta^{x}),cos(2\theta^{x}),0\right]^{\prime} $$

    where 𝜃 x∈[0,π/2] , as for gestures N and S, \({\theta _{0}^{z}}={\pi }/{4}\) for gestures NW and SE and \({\theta _{0}^{z}}={3\pi }/{4}\) for gestures NE and SW. The application of these rotation matrixes implies that the modelled patterns are contained in the plane xz rotated about the z axis.

Figure 4
figure 4

Model set-ups of the arm model. \({\protect \overrightarrow {{r_{U}}}}\) is a vector that goes from the shoulder to the elbow and \({\protect \overrightarrow {{r_{L}}}}\) from the elbow to the wrist. The angles 𝜃 x and 𝜃 y are variables which define the trajectory of the arm in Fig. 4a, b, while ψ 0 and \({\theta _{0}^{z}}\) are fixed angles that define the position of the elbow at the beggining of the execution of the movement in Fig. 4b and c respectively. ψ 0 is the angle formed by \({\protect \overrightarrow {{r_{U}}}}\) and \(-\hat {y}\) (see Fig. 4b). \({\theta _{0}^{z}}\) indicates the rotation angle applied to N and S gestures, which results in the set-up shown in Fig. 4c.

5.2 Motion Pattern Definition

The direction in which the defined intervals are covered depends on the direction of execution of the specific gesture, for example, in the case of gesture N 𝜃 x for \(\overrightarrow {r_{U}}\) begins in π/2 and ends in 0 while for gesture S is the other way around. In order to consider different speeds in the execution of the gestures 6 different patterns per gesture are presented: 1 for the whole arc , 1 for each half and 1 for each third. This makes 6 synthetic patterns per gesture. The selected length for these patterns was 5 samples (i.e. 4 translation segments) what defines the temporal window used for the comparison of synthetic and real patterns (see Fig. 1).

For the definition of the IO synthetic pattern no angles or arm model were considered, just a simpler approach was followed: the pattern was defined as a sequence of movements in the z axis. Three kinds of translations segments (i.e., an homogeneous motion interval) were considered: I, translation getting closer to the camera; O, moving away from the camera; S, staticity between two frames (applying the normalization described in Section 5.4 spurious translations are considered as staticity). Following the line of considering different execution speeds, various motion patterns (composed by 4 translation segments) were defined: IIII, IIIS, IISS, SSOO, SOOO, OOOO, IIIO, IIOO and IOOO. For example, if the execution of the gesture is very fast and only 5 samples are captured during it, the expected segments pattern would be IISS or SSOO. While, if the execution is slower sequences such as IIII or OOOO could be detected.

5.3 Motion Pattern Capturing

In order to capture a representative trajectory of the hand motion it is important to choose an easily traceable point. An unstable point would present noisy translations that could produce wrong estimations of the hand motion. The use of range information provides us with a robust to illumination and easy to detect POI, the closest to the camera. For the detection of this point it is not even necessary to previously segment the image.

With the intention of showing the advantages of using depth information, we also present an approach that makes no use of depth information (except for the depth range limitation): it extracts the tracking point considering the segmentation mask image resulting from the depth range limitation as binary (considering foreground all the pixels of the depth image with value over zero). In this case, the chosen tracking POI is the geodesic center of the binary mask, which is estimated by performing the ultimate erosion [20] up to a point.

5.4 Patterns Comparison

The comparison between two patterns is performed, not over the absolute coordinates of the trajectory, but over the translation of the POI between two frames. For calculating the distance between two patterns a previous normalization is performed, consisting of setting to one the length of each displacement between two sucesives samples frames of the POI. This solution has been used in problems such as hand writing recognition [10] or motion hand based gestures detection, like in [36] where the length of the translations is not used as a feature, something equivalent to fixing their length. In order to filter spurious errors in the detection of the tracked point when it is static (for gesture IO), this normalization is only applied when the magnitude of the translation of the POI between consecutive frames is over the third of the maximum one within the gesture execution. This defines an enough wide range of speeds for the proposed gestures which are intuitively executed in an homogenous way. The presented normalization makes the system independent to variations in the distance to the camera, in the angle of view, in the heigth of the user and in the size of the arm.

Once the synthetic (see Section 4) and captured motion patterns (see Section 5.3) are normalized, they are compared. The Dynamic Time Warping (DTW) distance has shown good performance when comparing temporal patterns executed at different speeds, concretely it has been widely applied to speech recognition problem [30]. An example of its application to hand gesture recognition can be found in [36]. Notice that each new captured motion pattern has four translation vectors, which describe the hand trajectory for five frames. It is then compared with each of the synthetic motion patterns present in the collection described in Section 5.1. This way we obtain a histogram of incidence of the closest synthetic patterns to this new captured motion pattern. The most common one gives us the label to assign to the gesture capture. If there is a tie between labels, the label ’Unknow’ is the one assigned.

6 Experiments

6.1 Experimental Set-up

This section presents two different evaluation scenarios, both of them user independent since the learning process is performed using synthetic data and the evaluation is done with 11 different users (see Section 4):

  1. 1.

    2.5D scenario: the tracked POI is the closest point to the camera and its depth coordinate (apart from x and y coordinates) is used for modelling the trajectory.

  2. 2.

    2D information scenario: this second scenario was set-up considering the input images as binary masks as explained in Section 5.3. The depth information is implicitally used in the set-up of the camera (see Section 4), resulting in a segmentation mask, but this info is not used in the estimation of the hand trajectory. In this case, the tracked POI is the geodesic center of the binary mask, obtained with an iterative algorithm process [25]. Although the depth information is used for the calculation of this mask the z coordinate is not used in the comparison of the patterns.

The comparison of the results obtained for these two set-ups will permit to obtain conclusions about the utility of using depth information in hand gesture recognition.

6.2 Results

This section compiles the results obtained for the two evaluation scenarios introduced in Section 6.1:

  1. 1.

    2.5D scenario: the resulting confussion matrix can be found in Table 1. The obtained accuracy rate is 0.951.

  2. 2.

    2D information scenario: The obtained accuracy rate is 0.780 (see Table 2).

Table 1 Confusion matrix for the 2.5D scenario.
Table 2 Confusion matrix for the 2D scenario.

From the results compiled in Table 1 there are several aspects to point out:

  • The label IO is the one assigned more times erroneously. It introduces 10 false negatives for executions of other gestures. This is due to the fact that the users tend to introduce the hand in the interaction area (and move it away) with upward and downward trajectories. These patterns are present in the definition of other gestures, apart from IO, producing misclassifications.

  • When the assigned labels within an execution results on the same score for 2 or more gestures the assigned label is Unknown (U). This situation produces 7 misclassifications.

  • Without taking into account the missclassifications produced by the inclusion of the IO gesture (i.e. the only one which translation is fundamentally takes place in the depth coordinate), the obtained accuracy rates are, 0.873 for the 2D scenario and 0.977 for the 2.5D one. So, the use of depth information improves the results even when the gestures are apparently detectable using only 2D information.

Table 2 presents not such good results, mainly due to the instability of the geodesic center. Since no depth information is considered, the representative point to be tracked needs to be estimated on the basis of a segmentation which is noisy due to variation in its shape and size. So, noisy translations are added to the real translations of the hand.

As far as we know, no user-indepent evaluations have been performed for motion based gestures detection, consequently we enumerate the evaluation figures of some works in which the absence of overlap between train and evaluation corpora is not ensured. In [36] a 0.97 accuracy rate is obtained in separating only two motion patterns. [5] presents results for an intrusive approach based on the use of an accelerometer: obtaining 0.93 for 5-fold cross validation and 0.98 for 10-fold cross validation, in the detection of 0 to 9 digits. Kim et al. [15] separates 6 gestures on the basis of the posture and motion of the hand, obtaining an accuracy of 0.975 for the best setup. In [37], the highest accuracy rate in the detection of 26 gestures drawn to the air is 0.932. In [25], two of the considered gestures were N and S, obtaining a mean recall of 0.938 in their detection. So we can say that our approach achieves results comparable to the ones of the State Of Art, even when they do not present user-independent evaluations.

6.3 Computational Cost

We can express the computational cost as a function depending on the number of translation segments for each motion pattern, N , and the number of synthetical patterns, N S y n P a t , contained in the collection described in Section 5.1. We have consider, as significant, the periods necessary for performing a sum, T S , a product, T P , and a square root T s q r t . The different stages considered on this work will present the following computational times per frame:

  1. 1.

    POI sampling: In the case of the 2.5D scenario, this is the time needed to compute the position of the closest pixel, for what is necessary to perform w i d t h×h e i g h t−1 comparisons, so T A−3D =(w i d t h×h e i g h t−1)×(N+1)×T S . In the 2D scenario we have to take into account the time for extracting the geodesic center of the binary mask as described in [25], T A−2D =4.311 m s e c.

  2. 2.

    Trajectory computation: This is the time needed for calculating the trajectory vector on the basis of the point coordinates, T B =3×N×T S .

  3. 3.

    Trajectory Normalization: as described in Section 5.4, T C = N×(5×T S +6×T P + T s q r t ).

  4. 4.

    DTW computation: T D = N 2×N S y n P a t ×(5×T S +3×T P + T s q r t ).

Current Float Point Units offer a solution for the computation of arithmetic operations with dedicated hardware, achieving computational times in the same order of magnitude for sum, product and squared root. On the basis of Pentium speed testsFootnote 8 we can establish the following relation between T S , T P and T s q r t , defining T 0 as the reference computational time: T S T P = T 0 and T s q r t =2×T 0. Doing so, and on the basis of the presented expressions, we obtain a total computational time of T = T A + T B + T C + T D = T A + T 0×N×(16+10×N×N S y n P a t ). With N=4 and N S y n P a t =54 we obtain T = T A +8704×T 0. A CPU performance test was run on an Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93Ghz with 2.98GB RAM, as in [25], being the obtained T 0 below 1n s e c. So T 3D = T A−3D +8704×T 0=135419×T 0 (T 3D <0.136 m s e c s) and T 2D = T A−2D +8704×T 0 (T 2D <4.321 m s e c s).

As shown in Table 3, the described approaches require much less than 1/25s e c per frame, enabling real-time HCI.

Table 3 Computational Costs per frame and Accuracy for the two considered scenarios.

7 Conclusions

In this paper a non intrusive motion-based hand gesture detection system using range data is presented. It is able to work in real-time allowing the interaction between a user and a virtual environment or computer menu. It is robust to the relative camera position and to the speed of execution of the gestures. It is, as well, user-independent, being able to work with a collection of gestures executed by users of different heights and arm’s sizes. A novel definition of the motion patterns, based on human anatomy, is presented: the obtained results bear witness to its remarkable representation capacity. A significant data set of depth videos has been compiled and made available for researching purposes (see Section 4).

From the results we confirm that the use of depth information for the hand trajectory estimation implies a significant increase in gesture detection accuracy rate. Our approach (2.5D scenario) works without the need of applying any segmentation algorithm (apart from limitating the depth range of the capture) or calculating the geodesic center of the hand mask, as in the 2D scenario, which means a lower computation time (see Table 3). The achieved accuracy rate for the proposed dictionary, performing a user-independent evaluation , is 0.951, a very promising value, as already mentioned, comparable to the results of the State Of Art. The experiments performed in this work also show that the 2.5D approach performs better that the 2D, even without considering the only gesture with a clear translation just in the depth coordinate, the IO gesture.

In the light of the results described in Section 6 we consider two main future work lines:

  • The use of a Hidden Markov Model in order to manage the temporal sequence of detected labels. This could solve some misclasification situations in which the order of the detections is relevant.

  • The use of color-depth registration approaches [12] could improve the quality of the hand motion estimation, and make feasible the detection of more complex gestures.