Robust gesture recognition using feature pre-processing and weighted dynamic time warping

Arici, Tarik; Celebi, Sait; Aydin, Ali S.; Temiz, Talha T.

doi:10.1007/s11042-013-1591-9

Robust gesture recognition using feature pre-processing and weighted dynamic time warping

Published: 17 July 2013

Volume 72, pages 3045–3062, (2014)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Robust gesture recognition using feature pre-processing and weighted dynamic time warping

Download PDF

Tarik Arici¹,
Sait Celebi¹,
Ali S. Aydin¹ &
…
Talha T. Temiz¹

1264 Accesses
47 Citations
Explore all metrics

Abstract

Gesture recognition is a technology often used in human-computer interaction applications. Dynamic time warping (DTW) is one of the techniques used in gesture recognition to find an optimal alignment between two sequences. Oftentimes a pre-processing of sequences is required to remove variations due to different camera or body orientations or due to different skeleton sizes between the reference gesture sequences and the test gesture sequences. We discuss a set of pre-processing methods to make the gesture recognition mechanism robust to these variations. DTW computes a dissimilarity measure by time-warping the sequences on a per sample basis by using the distance between the current reference and test sequences. However, all body joints involved in a gesture are not equally important in computing the distance between two sequence samples. We propose a weighted DTW method that weights joints by optimizing a discriminant ratio. Finally, we demonstrate the performance of our pre-processing and the weighted DTW method and compare our results with the conventional DTW and state-of-the-art.

A Non-temporal Approach for Gesture Recognition Using Microsoft Kinect

A One-Shot DTW-Based Method for Early Gesture Recognition

Feature design scheme for Kinect-based DTW human gesture recognition

Article 11 July 2015

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Interacting with computers using human motion is commonly employed in human-computer interaction (HCI) applications. One way to incorporate human motion into HCI applications is to use a predefined set of human joint motions i.e., gestures. Gesture recognition has been an active research area [12, 19, 26, 39], and involves state-of-the-art machine learning techniques in order to work reliably in different environments. A variety of methods have been proposed for gesture recognition including Dynamic Time Warping [26], Hidden Markov Models [12], Finite State Machines [13], hidden Conditional Random Fields (CRFs) [35] and orientation histograms [11]. In addition to these, there are methods employed in gesture recognition that are not view-based. Examples of these are the use of Wii controller (Wiimote) [29] and DataGlove [23].

DTW measures similarity between two time sequences which might be obtained by sampling a source with varying sampling rates or by recording the same phenomenon occurring with varying speeds [37]. The conventional DTW algorithm is basically a dynamic programming algorithm, which uses an iterative update of DTW cost by adding the distance between mapped elements of the two sequences at each iteration step. The distance between two elements is oftentimes the Euclidean distance, which gives equal weights to all dimensions of a sequence sample. However, depending on the problem a weighted distance might perform better in assessing the similarity between a test sequence and a reference sequence. For example in a typical gesture recognition problem, body joints used in a gesture can vary from gesture class to gesture class. Hence, not all joints are equally important in recognizing a gesture.

We propose a weighted DTW algorithm that uses a weighted distance in the cost computation. The weights are chosen so as to maximize a discriminant ratio based on DTW costs. The weights are obtained from a parametric model which depends on how active a joint is in a gesture class. The model parameter is optimized by maximizing the discriminant ratio. By doing so, some joints will be weighted up and some joints will be weighted down to maximize between-class variance and minimize within-class variance. As a result, irrelevant joints of a gesture class (i.e., parts that are not involved in a gesture class) will contribute to the DTW cost to a lesser extent, while keeping the between-class variances large.

Our system first extracts body-joint features from a set of skeleton data that consists of six joint positions, which are left and right hands, wrists and elbows. We have observed that the gestures in our training set, which have quite different motion patterns, require the use of all or a subset of these six joints only. These obtained skeleton features are used to recognize gestures by matching them with pre-stored reference sequences. Pre-processing is needed to suppress the noise due to different body and camera orientations, and different body sizes. After pre-processing is done, the matching is performed by assigning a test sequence to a reference sequence with the minimum DTW cost. By removing the variations in the data, the DTW cost becomes more reliable in classification as demonstrated by the increase in the discriminant ratio values.

2 Related work

One commonly used technique for gesture recognition is using HMMs for modeling gesture sequences. HMMs are especially known for their application to speech recognition, gesture recognition, etc. HMMs are statistical models for sequential data [3, 4], and therefore can be used in gesture recognition [12, 18, 32]. The states of an HMM are hidden and state transition probabilities are to be learned from the training data. However, defining states for gestures is not an easy task since gestures can be formed by a complex interaction of different joints. Also, learning the model parameters i.e., transition probabilities, requires large training sets, which may not always be available. On the other hand, DTW does not require training but needs good reference sequences to align with.

After DTW was introduced in 1960s [5], it has been used in solving different problems such as speech recognition to warp speech in time to be able to cope with different speaking speeds [2, 22, 28], data mining and information retrieval to deal with time-dependent data [1, 24], curve matching [10], online handwriting recognition [34], hand shape classification [17]. In gesture recognition, DTW time-warps an observed motion sequence of body joints to pre-stored gesture sequences [9, 17, 25, 36]. Although we present the theory of the general DTW and its implementation issues, in this paper we focus more on its application to gesture recognition. Comprehensive surveys about the general DTW algorithm can be found in [21, 30]. This work is the extended version of our work in [7].

Using a weighting scheme in DTW cost computation has been proposed for gesture recognition [26]. The method proposed in [26] uses DTW costs to compute between-class and within-class variations to find a weight for each body joint. These weights are global weights in the sense that there is only one weight computed for a body joint. However, our proposed method computes a weight for each body joint and for each gesture class. This boosts the discriminative power of DTW costs since a joint that is active in one gesture class may not be active in another gesture class. Hence weights has to be adjusted accordingly. This helps especially dealing with within-class variation. To avoid reducing the between-class variance, we compute weights by optimizing a discriminant ratio using a parametric model that depends on body joint activity. Another type of weighting in DTW for aligning time series is proposed in [15]. Their goal is to modify DTW so that the similarity between two 1D time series is robust to outliers. An outlier at a particular time instant can create a large error, which dominates distances in other time instants. To avoid this, a robust distance function instead of the L ₁ norm (i.e., absolute distance), or the L ₂ norm (i.e., Euclidean distance) is used. Hence, the weighting is for different distance values between the two samples of 1D time series. However, we propose to weight each dimension of a multi-dimensional signal, where each dimension is a joint position. This work is complimentary to our work in the sense that a robust distance function that penalizes the outliers to a lesser degree, can also be used in our method.

The goal of dynamic time warping is the alignment of two time sequences via a dynamic cost minimization. The final DTW cost, which is a dissimilarity measure between two time sequences, is also used for classification. Oftentimes, a test sequence is aligned to a set of templates via DTW and the test sequence is matched to the minimum cost template. A novel approach is presented in [20], to separate the alignment and classification tasks. First, alignment is performed using DTW. The aligned sequences are classified by using feature generation methods from Machine Learning theory. Our proposed method can be used in the alignment phase of the technique proposed in [20].

With Microsoft’s launch of Kinect in 2010, and release of Kinect SDK in 2011, numerous applications and research projects exploring new ways in human-computer interaction have been enabled. Some examples are gesture recognition [26], touch detection using depth data [38], human pose estimation [14], implementation of real-time virtual fixtures [27], real-time robotics control applications [33] and the physical rehabilitation of young adults with motor disabilities [8]. In the next section we discuss data acquisition and feature pre-processing.

3 Data acquisition and feature pre-processing

We use Microsoft Kinect sensor [31] to obtain joint positions. Kinect SDK tracks 3D coordinates of 20 body joints given in Fig. 1 in real time (30 frames per second). The Kinect algorithm uses depth images to predict joint positions and the predicted joint positions are quite robust to color, texture, and background.

In our experiments we have focused on hand-arm gestures. Six out of the 20 joints available in Kinect’s skeleton model are informative in recognizing a hand-arm gesture, which are left hand, right hand, left wrist, right wrist, left elbow and right elbow joints. However, there is no limitation on the number of joints used in our proposed method. For hand-arm gestures the relevant body joints are obvious, but for more complex gestures the most informative joints for the recognition task can be selected by using feature selection techniques from the classification literature in machine learning [6]. For example Sequential Backward Elimination (SBE) technique, which starts with all the 20 joints, and eliminates joints one by one based on the discriminant ratio change can be utilized.

In our method, a feature vector consists of 3D coordinates of these six joints and is of dimension of 18 as given below

$$ \mathbf{f}_n=[X_1, Y_1, Z_1, X_2, Y_2, Z_2, \ldots X_6, Y_6, Z_6], $$

(1)

where n is the index of the skeleton frame at time t _n. A gesture sequence is the concatenation of N such feature vectors.

After N feature vectors are concatenated to create the gesture sequence, they are pre-processed before the DTW cost computation. The pre-processing consists of three stages. First stage is the normalization stage which translates all skeletons to the center of the field of view. This could be done by subtracting the hip center joint position from the other joint positions. Note that the reference frames are already recorded at the center of the field of view. The second pre-processing stage removes the rotational distortion caused by different orientations of human bodies. Contrary to the reference gestures, where trained performers are used, it is highly possible to have different orientations or positionings of users with respect to camera in real-life cases. Such occasions are problematic for gesture recognition since they will result in rotationally distorted skeleton frames (See Fig. 2). To cope with these occasions, our pre-processing system rotates the skeleton frames if necessary, such that the skeleton frames will be orthogonal to the principal axis of the camera. To this end, we define two vectors by using spatial coordinates of the right shoulder, left shoulder and hip center which are obtained from Kinect sensor. One of the vectors is defined from the midpoint of right and left shoulder to hip center, while the other vector is defined from the same midpoint to the right shoulder. Using these two vectors, we calculate the three angles, α, β, θ, of the skeleton with respect to the camera’s coordinate system, and compute the rotation matrices $\mathbf{R^{\alpha}_{x}}, \mathbf{R^{\beta}_{y}}, \mathbf{R^{\theta}_{z}}$, respectively. The rotation is then applied using these angles with the appropriate order. See an example rotation in Y axis with $\mathbf{R^{\beta}_{y}}$ in Fig. 3. The third and the last pre-processing stage is the elimination of variations in the feature vectors due to different skeleton ratios (broad-shouldered, narrow-shouldered). All feature vectors are normalized with the distance between the left and the right shoulders to account for the variations due to a person’s size. Note that the reference sequences are recorded with people who has average skeleton ratios. Next, we present a more detailed discussion on DTW.

4 Dynamic time warping for gesture recognition

DTW is a template matching algorithm to find the best match for a test pattern out of the reference patterns, where the patterns are represented as a time sequence of features. In Fig. 4 we show an example matching of two sequences.

Let R = {r ₁, r ₂, ..., r _N }, N ∈ ℕ and T = {t ₁, t ₂, ..., t _M }, M ∈ ℕ be reference and test sequences (sequence of set of joint positions in our case), respectively. The objective is to align the two sequences in time via a nonlinear mapping. Such a warping path can be illustrated as an ordered set of points as given below

$$ p = (p_1,p_2,\ldots,p_L),\ p_l = (n_l, m_l), $$

where p _l = (n _l, m _l), denotes mapping of $r_{n_l}$ to $t_{m_l}$. p _l ∈ [1:N] ×[1:M] for l ∈ [1:L], where L is the number of mappings. The total cost D of a warping path p between R and T with respect to a distance function d(r _i,t _j), i ∈ [1:N] and j ∈ [1:M], is defined as the sum of all distances between the mapped sequence elements

$$ \mathrm{D_p} =\sum\limits_{l=1}^{L} d(r_{n_l}, t_{m_l}), \label{eq:dtw_tc} $$

(2)

where D _p is the total cost of the path p and d(r _i,t _j) measures the distance between elements r _i and t _j. For gesture recognition, distance can be chosen as the distance between the corresponding joint positions (3D points) of the reference gesture, R, and the test gesture T.

A mapping can also be viewed as a path on a two-dimensional (2D) grid, also known as the cost matrix, which is of size N ×M (see Fig. 5), where grid node (r _i,t _j) denotes the distance between r _i and t _j. The node (r ₁, t ₁) which starts the alignment by matching the first sequence elements is conventionally placed on the left-bottom corner of the grid. Each path p on the 2D grid (i.e., the cost matrix) is associated with a total cost D given in (2). Note that among all possible paths, we are mostly interested in the path which makes the total accumulated cost minimum while satisfying the desired constraints. Hence, optimal path denoted by p ^* is the path with the minimum total cost. The DTW distance between two sequences is defined by the distance associated with a total cost D given in (2) using the optimal path, i.e.:

$$ \mathrm{DTW(\mathbf{R},\mathbf{T})} = \mathrm{D_{p^*}}(\mathbf{R},\mathbf{T}). $$

(3)

Some well-known restrictions on the warping path have been proposed to eliminate unrealistic correspondences between the sequences [21, 28]. The most fundamental constraints which are applied in various topics as well as gesture recognition, are the following:

(i)
Boundary conditions: p ₁ = (1,1), p _L = (N,M).
(ii)
Step size condition: p _l + 1 − p _l ∈ { (0,1), (1,0), (1,1) } for l ∈ [1:L − 1].

The boundary conditions require the whole reference sequence to be mapped to the whole test sequence, and can be modified if this is not strictly desired. The step size condition requires that only one element of both sequences can be skipped at each cost computation step of Bellman’s principle. Hence, optimal path can progress from a restricted set of predecessor nodes as shown in Fig. 6. Since all the elements are ordered in time, the set of predecessor nodes are to the left and bottom of a current node.

First, let’s define C(n _l,m _l) as below

$$ \mathrm{C(n_l,m_l)} = \mathrm{DTW}(\mathbf{R}(1:n_l), \mathbf{T}(1:m_l)). $$

(4)

Note that C(N,M) is equal to DTW(R,T). Let’s further assume that the total costs of the optimal paths to three predecessor nodes denoted by (n _l − 1, m _l), (n _l, m _l − 1), and (n _l − 1, m _l − 1) have been computed. Since the (l − 1)th position of the path (i.e., (n _l − 1,m _l − 1)) is restricted to be one of these three nodes on the 2D grid, Bellman’s principle leads to

$$ \begin{array}{rll} \mathrm{C(n_l,m_l)} = \mathrm{min} \{ && \mathrm{C(n_l,m_l-1)}, \\ && \mathrm{C(n_l-1,m_l)}, \\ && \mathrm{C(n_l-1,m_l-1)} \} + d(r_{n_l},t_{m_l}). \label{eq:dynamicprog} \end{array} $$

(5)

Finally, the minimum cost path aligning two sequences has cost C(N,M) = DTW(R, T), and the test sequence is matched to the reference sequence that has the minimum cost among all reference sequences.

Although (5) outputs the minimum cost between two sequences, it does not output the optimal path. To find the optimal path, which can be used to map test sequence elements to reference sequence elements, one needs to backtrack the optimal path starting with the final node. Note that if the boundary condition is satisfied, i.e., the whole test sequence is mapped to the whole reference sequence, than (n _L,m _L) = (N,M) and (n ₁,m ₁) = (1,1).

4.1 Boosting the reliability of DTW

Global constraints define a set of nodes on the 2D grid to be searched for finding the optimal path. Imposing global constraints not only reduces the DTW computational complexity, but also increases the reliability of DTW’s dissimilarity measure by omitting unrealistic paths. We used a well-known global constraint region, Sakoe–Chiba band [28] given in Fig. 5. The Sakoe–Chiba band effectively limits the warping amount, i.e., slowing down or speeding up of a sequence in time. For example a gesture can be performed with different speeds in time depending on the performer but it is logical to expect that there is a limit to how slow or how fast a gesture is performed.

Another problem that degrades DTW’s reliability in gesture recognition is due to unknown beginning and ending times of gesture samples. A gesture in a test sequence can often begin later or end sooner than the gesture in the reference sequence stored for that gesture class. Boundary conditions assume that all gestures start at the beginning of the sequence and finish at the ending of the sequence. Hence, imposing boundary conditions in such cases decreases the reliability of DTW costs. To boost the reliability, we relaxed the boundary conditions by changing the total cost given in (2) as below

$$ \mathrm{D_p} =\sum\limits_{l=1}^{L} \alpha_l d(r_{n_l}, t_{m_l}), $$

(6)

where α _l is a weight that is equal to 1 everywhere except the regions close to the starting node (i.e., left-bottom node denoted by (r ₁,t ₁)) and the ending node (i.e., right-top node denoted by (r _N,t _M)). To infer the proximity of the current node to starting and ending nodes the length of the path, $||p_l||=\sqrt{n_l^2 + m_l^2}$, is utilized. The distance terms coming from the beginning and ending of the sequence is weighted down by computing α _l from the below formula

$$ \alpha_l = \left\{ \begin{array}{lll} \dfrac{||p_l||}{\tau} & \mbox{if } ||p_l|| &< \tau \\ \dfrac{L- || p_l||}{\tau} & \mbox{if } L - ||p_l|| &< \tau \\ 1 & \mbox{otherwise,} \end{array} \right. \label{eq:alpha_l} $$

(7)

where L is the length of the longest path and τ is a threshold value.

4.2 Weighted DTW

The conventional DTW computes the dissimilarity between two time sequences by aligning the two sequences based on a sample based distance as in (5). If the sequence samples are multi-dimensional (18 dimensional for the gesture recognition problem), using an Euclidean distance gives equal importance to all dimensions. We propose to use a weighted distance in the cost computation based on how relevant a body joint is to a specific gesture class. The relevancy is defined as the contribution of a joint to the motion pattern of that gesture class. To infer a joint’s contribution to a gesture class we compute its total displacement (i.e., contribution) during the performance of that gesture by a trained user by

$$ C_j^g= \sum\limits_{n=2}^{N} Dist^j (\mathbf{f}_{n-1}^{\,g}, \mathbf{f}_n^{\,g}), $$

(8)

where g is the gesture index, j is the joint index and n is the skeleton frame number. Dist ^j() computes the displacement of jth joint’s two consecutive coordinates in feature vectors $\mathbf{f}_{n-1}^{\,g}$, and $\mathbf{f}_n^{\,g}$. By summing up these consecutive displacements one can find the total displacement of a joint in a selected reference gesture.

After the total displacements are calculated, we filter out the noise (e.g, shaking, trembling) and threshold them from the bottom and the top. This prevents our parametric weight model to output too high or low weights as given below

$$ C_{j}^{g} = \left\{ \begin{array}{ll} C_a & \mbox{if $0\leq C_{j}^{g}<T_1$} \\ \dfrac{C_{j}^{g}-T_1}{T_2-T_1}(C_b-C_a)+C_a & \mbox{if $T_1\leq C_{j}^{g} <T_2$} \\ C_b & \mbox{otherwise,} \end{array} \right. $$

(9)

where C _a and C _b are threshold values, and T ₁ and T ₂ are experimentally determined boundary values for threshold assignment.

Using the total displacement to asses the contribution of a joint in performing a gesture, the weights of gesture class g are calculated via

$$ w_{j}^{g} = \frac{ 1 - e^{ - \beta C_{j}^{g} } } {\sum\limits_{k} \big( 1 - e^{ - \beta C_{k}^{g}}\big)}, $$

(10)

where $w_{j}^{g}$ is joint j’s weight value for gesture class g. Note that in this formulation a joint’s weight value can change depending on the gesture class. For example, for the right-hand-push-up gesture, one would expect the right hand, right elbow and right wrist joints to have large weights, but to have smaller weights for the left-hand-push-up gesture.

To incorporate these weights into the cost, the distance function d(r _n,t _m) becomes a weighted average of joints distances between two consecutive frames and is defined to be

$$ d(r_n,t_m)=\sum\limits_{j} Dist^{j}(r_n,t_m)w^g_{j}, $$

(11)

which gives the distance between nth skeleton frame of reference gesture R and mth skeleton frame of test gesture T, where R is a sequence known to be in gesture class g and T is an unknown test sequence.

The weights are obtained from the model given in (10), which has a single parameter β. Our objective is to choose a β value that minimizes the within-class variation while between-class variation is maximized. Between-class variation maximization and within-class variation minimization can be achieved by making irrelevant joints contribute less to the cost (e.g., reducing the weights of right hand in left-hand-push-up gesture) and not reducing (or possibly increasing) the weights of joints that can help to discriminate different gestures. We try to achieve this goal by maximizing a discriminant ratio similar to Fisher’s Discriminant Ratio [16]. To this end, we define D _g,h(β), as the average weighted DTW cost between all samples of gesture class g and gesture class h using weights calculated with β. Then between-class dissimilarity is the average of all D _g,h(β)’s (h ≠ g) as the following:

$$ D_{B}(\beta)= \sum\limits_{g}\sum\limits_{\substack{h \\ h \neq g}} D_{g,h}(\beta), $$

(12)

which measures the sum of average distances between gesture classes. This helps us infer the average distance between a gesture and the rest of the gestures for a given β.

Within-class dissimilarity is the sum of within-class distance D _g,g(β) for all gesture classes,

$$ D_{W}(\beta)= \sum\limits_{g} D_{g,g}(\beta), $$

(13)

which sums the average distance D _g,g(β) between the samples of gesture classes for all g.

The discriminant ratio of a given β, R(β), is then obtained by

$$ R(\beta) = \frac{D_{B}(\beta)}{D_{W}(\beta)}. $$

(14)

The optimum β, β ^*, is chosen as the one that maximizes R:

$$ \beta^* = \arg \max_{\beta} R(\beta). $$

(15)

5 Results

Our experiments were performed on our gesture database which was recorded with 38 participants. It took approximately one week to finish all the recordings. All participants performed 12 different gestures with six samples per gesture class. Bad records due to a bad gesture performance (e.g., incomplete gesture) or Kinect’s human-pose recognition failure, correspond to approximately 30 % percentage of all recorded gestures. They were manually deleted by using an OpenGL based gesture visualizer. The physical factors (e.g., distance from the Kinect sensor to the user, illumination in the room) are kept constant during the recording of all records. Each gesture sample includes 20 joint position data per frame in addition to time stamps of each skeleton frame. The gesture databases used in the experiments, source code for visualization of gestures, source code used to produce the results in this paper and more results are publicly available.^{Footnote 1} We are hoping that the databases can be used in testing other gesture recognition algorithms as well.

We tested the performance of our feature pre-processing technique and proposed weighting method on our three discrete gesture databases to show the improvements separately: (i) Rotationally distorted gesture database: In this database we recorded a set of noisy gestures in terms of the rotational orientation of the body with respect to the Kinect sensor in X, Y and Z axes (See Fig. 3). The gestures are performed by trained users. This database is designed in order to see the effect of pre-processing on the recognition performance. It has 12 different gesture classes and 21 gesture samples per gesture class. (ii) Relaxed gesture database: In this database there is no intentionally generated rotational distortion, instead, these gesture samples are performed more relaxed in terms of the movement of body parts other than the active joints involved in gesture performance. For example in one sample of this database, performer scratches his head with his left hand while he performs the right-hand-push-up gesture. This database has 8 gesture classes and 1116 gesture samples in total. (iii) Rotationally distorted and relaxed gesture database: In this database performers recorded gestures relaxed in terms of both rotation and body movement. This database has 12 gesture classes and 198 gesture samples in total. We use this database to show the overall performance of the system. All the three databases are created using Microsoft Kinect Sensor.

In addition to these databases, there is a set of reference samples per gesture class, performed properly by trained users without any rotational distortion and without any undesired movements. These reference samples are used in learning the total distance measures of each joint in each class, which is required by our weight model in (10). Two sample reference gestures are shown in Fig. 7.

In the first experiment, we test our pre-processing method using the rotationally distorted gesture database. We first calculated the discriminant ratios (See (14)) of 21 samples for each 12 gesture class without using any of the pre-processing methods. Then, we used the same gesture samples to calculate the discriminant ratios again, but this time using our proposed pre-processing methods. Note that uniform weights were used in order to see the performance of the pre-processing method alone. The improved achieved by pre-processing can be seen in Fig. 8.

In the second experiment we compared our weighted DTW algorithm against the conventional DTW method and a weighted DTW method proposed by [26] using the relaxed gesture database. The confusion matrices for the three algorithms for six chosen gesture classes are given in Tables 1, 2, and 3. Note that the recognition rates for these classes are consistent with recognition rates of other classes(i.e. classes that are not presented in the confusion matrix), but only these classes are shown for the sake of brevity.

Table 1 Confusion matrix for the conventional DTW

Full size table

Table 2 Confusion matrix for the weighted DTW in [26]

Full size table

Table 3 Confusion matrix for our proposed weighted DTW

Full size table

After creating the confusion matrices, we computed the overall recognition accuracies according to the following formula:

$$ A=100\cdot\frac{\mathrm{Trace}(C)}{\sum\limits^{m}_{i=1} \sum\limits^{n}_{j=1} C(i,j)}, $$

(16)

where A denotes the accuracy, and C denotes the confusion matrix.

Our proposed method outperforms the weighted DTW method in [26] by a large margin as given in Table 4. The reason is that their weights are global weights, i.e., a joint’s weight is independent of the gesture class. However, in our proposed method a joint can have a different weight depending on the gesture class we are trying to align with. This degree of freedom in computing the associated DTW cost increases the reliability of DTW cost significantly.

Table 4 Accuracies of the three methods

Full size table

In the third and the last stage, we tested the overall performance of our system using the rotationally distorted and relaxed gesture database. The purpose of this operation is to determine the overall improvement of the pre-processing and the weighting on the recognition performance using a larger database. These experiments clearly demonstrate the performance boost provided by our proposed techniques. The results are given in Table 5.

Table 5 Overall performance comparison using the rotationally distorted and relaxed gesture database

Full size table

6 Conclusion

We have developed a weighted DTW method to boost the discrimination capability of DTW’s cost, and shown that the performance increases significantly. The weights are based on a parametric model that depends on the level of a joint’s contribution to a gesture class. The model parameter is optimized by maximizing a discriminant ratio, which helps to minimize within-class variation and maximize between-class variation. We have also developed a pre-processing method to cope with real life situations, where different body shapes and user orientations with respect to the depth sensor may occur. Our weighted DTW, enables noise in skeleton joints as long as they do not make a gesture of one class similar to a gesture of another class. This is because weights are selected by maximizing between-class variation. We hope that the proposed method will enable more natural remote control of different devices using pre-defined commands for a given context/situation. As long as the noise in a joint does not overlap with another gesture class, the user is free to naturally use his/her other joints.

Notes

http://mll.sehir.edu.tr/mtap2013

References

Adams NH, Bartsch MA, Shifrin J, Wakefield GH (2004) Time series alignment for music information retrieval. In: ISMIR
Amin TB, Mahmood I (2008) Speech recognition using dynamic time warping. In: International conference on advances in space technologies. doi:10.1109/ICAST.2008.4747690
Baum L (1972) An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3:1–8
Google Scholar
Baum LE, Petrie T, Soules G, Weiss N (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Stat 41:164–171. doi:10.1214/aoms/1177697196
Article MATH MathSciNet Google Scholar
Bellman R, Kalaba R (1959) On adaptive control processes. IRE Trans Autom Control 4(2):1–9
Article Google Scholar
Brodley CE, Utgoff PE (1995) Multivariate decision trees. Mach Learn 19(1):45–77
MATH Google Scholar
Celebi S, Aydin AS, Temiz TT, Arici T (2013) Gesture recognition using skeleton data with weighted dynamic time warping. In: Computer vision theory and applications, Visapp
Chang YJ, Chen SF, Huang JD (2011) A kinect-based system for physical rehabilitation: a pilot study for young adults with motor disabilities. Res Dev Disabil 32(6):2566–2570. doi:10.1016/j.ridd.2011.07.002. http://www.sciencedirect.com/science/article/pii/S0891422211002587
Article Google Scholar
Corradini A (2001) Dynamic time warping for off-line recognition of a small gesture vocabulary. In: Proceedings IEEE ICCV workshop on recognition, analysis, and tracking of faces and gestures in real-time systems, 2001. IEEE, pp 82–89
Efrat A, Fan Q (2007) Venkatasubramanian S Curve matching, time warping, and light fields: new algorithms for computing similarity between curves. J Math Imaging Vis 27(3):203–216
Article MathSciNet Google Scholar
Freeman WT, Roth M (1994) Orientation histograms for hand gesture recognition. In: International workshop on automatic face and gesture recognition, pp 296–301
Gehrig D, Kuehne H, Woerner A, Schultz T (2009) Hmm-based human motion recognition with optical flow data. In: IEEE international conference on humanoid robots (Humanoids 2009), Paris, France
Hong P, Huang TS, Turk M (2000) Gesture modeling and recognition using finite state machines. In: Proceedings of the fourth IEEE international conference on automatic face and gesture recognition 2000, FG ’00. IEEE Computer Society, Washington, DC, USA, p 410. http://dl.acm.org/citation.cfm?id=795661.796191
Chapter Google Scholar
Jain HP, Subramanian A, Das S, Mittal A (2011) Real-time upper-body human pose estimation using a depth camera. In: Proceedings of the 5th international conference on computer vision/computer graphics collaboration techniques, MIRAGE’11. Springer, Berlin, Heidelberg, pp 227–238. http://dl.acm.org/citation.cfm?id=2050320.2050340
Chapter Google Scholar
Jeong YS, Jeong MK, Omitaomu OA (2011) Weighted dynamic time warping for time series classification. Pattern Recog 44(9):2231–2240
Article Google Scholar
Kim SJ, Magnani A, Boyd SP (2005) Robust fisher discriminant analysis. In: Neural information processing systems
Kuzmanic A, Zanchi V (2007) Hand shape classification using dtw and lcss as similarity measures for vision-based gesture recognition system. In: EUROCON, 2007. The international conference on computer as a tool. IEEE, pp 264–269
Lee HK, Kim J (1999) An hmm-based threshold model approach for gesture recognition. IEEE Trans Pattern Anal Mach Intell 21(10):961–973. doi:10.1109/34.799904
Article Google Scholar
Liang R, Ouhyoung M (1998) A real-time continuous gesture recognition system for sign language. In: Proceedings third IEEE international conference on automatic face and gesture recognition, 1998. IEEE, pp 558–567
Lichtenauer JF, Hendriks EA, Reinders M (2008) Sign language recognition by combining statistical dtw and independent classification. IEEE Trans Pattern Anal Mach Intell 30(11):2040–2046
Article Google Scholar
Müller M (2007) Information retrieval for music and motion, vol 6. Springer, Berlin
Book Google Scholar
Myers CS, Habiner LF (1981) A comparative study of several dynamic time-warping algorithms for connected-word. Bell Syst Tech J
Quam D (1990) Gesture recognition with a dataglove. In: Proceedings of the IEEE 1990 national aerospace and electronics conference 1990, NAECON 1990, vol 2, pp 755–760. doi:10.1109/NAECON.1990.112862
Rath T, Manmatha R (2003) Word image matching using dynamic time warping. In: Proceedings IEEE computer society conference on computer vision and pattern recognition 2003, vol 2, pp. II–521–II–527. doi:10.1109/CVPR.2003.1211511
Rekha J, Bhattacharya J, Majumder S (2011) Shape, texture and local movement hand gesture features for indian sign language recognition. In: 3rd international conference on trendz in information sciences and computing, (TISC) 2011, pp 30–35. doi:10.1109/TISC.2011.6169079
Reyes M, Dominguez G, Escalera S (2011) Feature weighting in dynamic time warping for gesture recognition in depth data. In: IEEE international conference on computer vision workshops (ICCV Workshops) 2011, pp 1182–1188. doi:10.1109/ICCVW.2011.6130384
Ryden F, Chizeck HJ, Kosari SN, King H, Hannaford B (2011) Using kinect and a haptic interface for implementation of real-time virtual fixtures. In: Robotics sciences and systems, workshop on RGB-D: advanced reasoning with depth cameras, Los Angeles
Sakoe H, Chiba S (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans Acoust Speech Signal Process 26(1):43–49. doi:10.1109/TASSP.1978.1163055
Article MATH Google Scholar
Schlömer T, Poppinga B, Henze N, Boll S (2008) Gesture recognition with a wii controller. In: Proceedings of the 2nd international conference on tangible and embedded interaction, TEI ’08. ACM, New York, NY, USA, pp 11–14. doi:10.1145/1347390.1347395
Chapter Google Scholar
Senin P (2008) Dynamic time warping algorithm review. Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA (2008)
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: CVPR, vol 2, p 7
Starner T, Pentland A (1996) Real-time american sign language recognition from video using hidden Markov models. In: International symposium on computer vision
Stowers J, Hayes M, Bainbridge-Smith A (2011) Altitude control of a quadrotor helicopter using depth map from microsoft kinect sensor. In: IEEE international conference on mechatronics, (ICM) 2011, pp 358–362. doi:10.1109/ICMECH.2011.5971311
Tappert C, Suen C, Wakahara T (1990) The state of the art in online handwriting recognition. IEEE Trans Pattern Anal Mach Intell 12(8):787–808
Article Google Scholar
Wang SB, Quattoni A, Morency LP, Demirdjian D, Darrell T (2006) Hidden conditional random fields for gesture recognition. In: IEEE computer society conference on computer vision and pattern recognition 2006, vol 2, pp 1521–1527. doi:10.1109/CVPR.2006.132
Wenjun T, Chengdong W, Shuying Z, Li J (2010) Dynamic hand gesture recognition using motion trajectories and key frames. In: 2nd international conference on advanced computer control, (ICACC) 2010, vol 3, pp 163–167. doi:10.1109/ICACC.2010.5486760
Wikipedia (2012) Dynamic time warping. http://en.wikipedia.org/wiki/Dynamic_time_warping. (online) Accessed 1 Aug 2008
Wilson AD (2010) Using a depth camera as a touch sensor. In: ACM international conference on interactive tabletops and surfaces, ITS ’10. ACM, New York, NY, USA, pp 69–72. doi:10.1145/1936652.1936665.
Chapter Google Scholar
Wilson AD, Bobick AF (1999) Parametric hidden Markov models for gesture recognition. IEEE Trans Pattern Anal Mach Intell 21:884–900. doi:10.1109/34.790429
Article Google Scholar

Download references

Acknowledgements

We would like to thank to all Sehir University student who participated in our gesture database recordings and patiently performed all gestures to help our experiments.

Author information

Authors and Affiliations

Department of Electrical Engineering, Istanbul Sehir University, Kusbakisi Caddesi No: 27 34662, Uskudar, Istanbul, Turkey
Tarik Arici, Sait Celebi, Ali S. Aydin & Talha T. Temiz

Authors

Tarik Arici
View author publications
You can also search for this author in PubMed Google Scholar
Sait Celebi
View author publications
You can also search for this author in PubMed Google Scholar
Ali S. Aydin
View author publications
You can also search for this author in PubMed Google Scholar
Talha T. Temiz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tarik Arici.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Arici, T., Celebi, S., Aydin, A.S. et al. Robust gesture recognition using feature pre-processing and weighted dynamic time warping. Multimed Tools Appl 72, 3045–3062 (2014). https://doi.org/10.1007/s11042-013-1591-9

Download citation

Published: 17 July 2013
Issue Date: October 2014
DOI: https://doi.org/10.1007/s11042-013-1591-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robust gesture recognition using feature pre-processing and weighted dynamic time warping

Abstract

Similar content being viewed by others

A Non-temporal Approach for Gesture Recognition Using Microsoft Kinect

A One-Shot DTW-Based Method for Early Gesture Recognition

Feature design scheme for Kinect-based DTW human gesture recognition

1 Introduction

2 Related work

3 Data acquisition and feature pre-processing