1 Introduction

Interacting with computers using human motion is commonly employed in human-computer interaction (HCI) applications. One way to incorporate human motion into HCI applications is to use a predefined set of human joint motions i.e., gestures. Gesture recognition has been an active research area [12, 19, 26, 39], and involves state-of-the-art machine learning techniques in order to work reliably in different environments. A variety of methods have been proposed for gesture recognition including Dynamic Time Warping [26], Hidden Markov Models [12], Finite State Machines [13], hidden Conditional Random Fields (CRFs) [35] and orientation histograms [11]. In addition to these, there are methods employed in gesture recognition that are not view-based. Examples of these are the use of Wii controller (Wiimote) [29] and DataGlove [23].

DTW measures similarity between two time sequences which might be obtained by sampling a source with varying sampling rates or by recording the same phenomenon occurring with varying speeds [37]. The conventional DTW algorithm is basically a dynamic programming algorithm, which uses an iterative update of DTW cost by adding the distance between mapped elements of the two sequences at each iteration step. The distance between two elements is oftentimes the Euclidean distance, which gives equal weights to all dimensions of a sequence sample. However, depending on the problem a weighted distance might perform better in assessing the similarity between a test sequence and a reference sequence. For example in a typical gesture recognition problem, body joints used in a gesture can vary from gesture class to gesture class. Hence, not all joints are equally important in recognizing a gesture.

We propose a weighted DTW algorithm that uses a weighted distance in the cost computation. The weights are chosen so as to maximize a discriminant ratio based on DTW costs. The weights are obtained from a parametric model which depends on how active a joint is in a gesture class. The model parameter is optimized by maximizing the discriminant ratio. By doing so, some joints will be weighted up and some joints will be weighted down to maximize between-class variance and minimize within-class variance. As a result, irrelevant joints of a gesture class (i.e., parts that are not involved in a gesture class) will contribute to the DTW cost to a lesser extent, while keeping the between-class variances large.

Our system first extracts body-joint features from a set of skeleton data that consists of six joint positions, which are left and right hands, wrists and elbows. We have observed that the gestures in our training set, which have quite different motion patterns, require the use of all or a subset of these six joints only. These obtained skeleton features are used to recognize gestures by matching them with pre-stored reference sequences. Pre-processing is needed to suppress the noise due to different body and camera orientations, and different body sizes. After pre-processing is done, the matching is performed by assigning a test sequence to a reference sequence with the minimum DTW cost. By removing the variations in the data, the DTW cost becomes more reliable in classification as demonstrated by the increase in the discriminant ratio values.

2 Related work

One commonly used technique for gesture recognition is using HMMs for modeling gesture sequences. HMMs are especially known for their application to speech recognition, gesture recognition, etc. HMMs are statistical models for sequential data [3, 4], and therefore can be used in gesture recognition [12, 18, 32]. The states of an HMM are hidden and state transition probabilities are to be learned from the training data. However, defining states for gestures is not an easy task since gestures can be formed by a complex interaction of different joints. Also, learning the model parameters i.e., transition probabilities, requires large training sets, which may not always be available. On the other hand, DTW does not require training but needs good reference sequences to align with.

After DTW was introduced in 1960s [5], it has been used in solving different problems such as speech recognition to warp speech in time to be able to cope with different speaking speeds [2, 22, 28], data mining and information retrieval to deal with time-dependent data [1, 24], curve matching [10], online handwriting recognition [34], hand shape classification [17]. In gesture recognition, DTW time-warps an observed motion sequence of body joints to pre-stored gesture sequences [9, 17, 25, 36]. Although we present the theory of the general DTW and its implementation issues, in this paper we focus more on its application to gesture recognition. Comprehensive surveys about the general DTW algorithm can be found in [21, 30]. This work is the extended version of our work in [7].

Using a weighting scheme in DTW cost computation has been proposed for gesture recognition [26]. The method proposed in [26] uses DTW costs to compute between-class and within-class variations to find a weight for each body joint. These weights are global weights in the sense that there is only one weight computed for a body joint. However, our proposed method computes a weight for each body joint and for each gesture class. This boosts the discriminative power of DTW costs since a joint that is active in one gesture class may not be active in another gesture class. Hence weights has to be adjusted accordingly. This helps especially dealing with within-class variation. To avoid reducing the between-class variance, we compute weights by optimizing a discriminant ratio using a parametric model that depends on body joint activity. Another type of weighting in DTW for aligning time series is proposed in [15]. Their goal is to modify DTW so that the similarity between two 1D time series is robust to outliers. An outlier at a particular time instant can create a large error, which dominates distances in other time instants. To avoid this, a robust distance function instead of the L 1 norm (i.e., absolute distance), or the L 2 norm (i.e., Euclidean distance) is used. Hence, the weighting is for different distance values between the two samples of 1D time series. However, we propose to weight each dimension of a multi-dimensional signal, where each dimension is a joint position. This work is complimentary to our work in the sense that a robust distance function that penalizes the outliers to a lesser degree, can also be used in our method.

The goal of dynamic time warping is the alignment of two time sequences via a dynamic cost minimization. The final DTW cost, which is a dissimilarity measure between two time sequences, is also used for classification. Oftentimes, a test sequence is aligned to a set of templates via DTW and the test sequence is matched to the minimum cost template. A novel approach is presented in [20], to separate the alignment and classification tasks. First, alignment is performed using DTW. The aligned sequences are classified by using feature generation methods from Machine Learning theory. Our proposed method can be used in the alignment phase of the technique proposed in [20].

With Microsoft’s launch of Kinect in 2010, and release of Kinect SDK in 2011, numerous applications and research projects exploring new ways in human-computer interaction have been enabled. Some examples are gesture recognition [26], touch detection using depth data [38], human pose estimation [14], implementation of real-time virtual fixtures [27], real-time robotics control applications [33] and the physical rehabilitation of young adults with motor disabilities [8]. In the next section we discuss data acquisition and feature pre-processing.

3 Data acquisition and feature pre-processing

We use Microsoft Kinect sensor [31] to obtain joint positions. Kinect SDK tracks 3D coordinates of 20 body joints given in Fig. 1 in real time (30 frames per second). The Kinect algorithm uses depth images to predict joint positions and the predicted joint positions are quite robust to color, texture, and background.

Fig. 1
figure 1

Kinect joints

In our experiments we have focused on hand-arm gestures. Six out of the 20 joints available in Kinect’s skeleton model are informative in recognizing a hand-arm gesture, which are left hand, right hand, left wrist, right wrist, left elbow and right elbow joints. However, there is no limitation on the number of joints used in our proposed method. For hand-arm gestures the relevant body joints are obvious, but for more complex gestures the most informative joints for the recognition task can be selected by using feature selection techniques from the classification literature in machine learning [6]. For example Sequential Backward Elimination (SBE) technique, which starts with all the 20 joints, and eliminates joints one by one based on the discriminant ratio change can be utilized.

In our method, a feature vector consists of 3D coordinates of these six joints and is of dimension of 18 as given below

$$ \mathbf{f}_n=[X_1, Y_1, Z_1, X_2, Y_2, Z_2, \ldots X_6, Y_6, Z_6], $$
(1)

where n is the index of the skeleton frame at time t n . A gesture sequence is the concatenation of N such feature vectors.

After N feature vectors are concatenated to create the gesture sequence, they are pre-processed before the DTW cost computation. The pre-processing consists of three stages. First stage is the normalization stage which translates all skeletons to the center of the field of view. This could be done by subtracting the hip center joint position from the other joint positions. Note that the reference frames are already recorded at the center of the field of view. The second pre-processing stage removes the rotational distortion caused by different orientations of human bodies. Contrary to the reference gestures, where trained performers are used, it is highly possible to have different orientations or positionings of users with respect to camera in real-life cases. Such occasions are problematic for gesture recognition since they will result in rotationally distorted skeleton frames (See Fig. 2). To cope with these occasions, our pre-processing system rotates the skeleton frames if necessary, such that the skeleton frames will be orthogonal to the principal axis of the camera. To this end, we define two vectors by using spatial coordinates of the right shoulder, left shoulder and hip center which are obtained from Kinect sensor. One of the vectors is defined from the midpoint of right and left shoulder to hip center, while the other vector is defined from the same midpoint to the right shoulder. Using these two vectors, we calculate the three angles, α, β, θ, of the skeleton with respect to the camera’s coordinate system, and compute the rotation matrices \(\mathbf{R^{\alpha}_{x}}, \mathbf{R^{\beta}_{y}}, \mathbf{R^{\theta}_{z}}\), respectively. The rotation is then applied using these angles with the appropriate order. See an example rotation in Y axis with \(\mathbf{R^{\beta}_{y}}\) in Fig. 3. The third and the last pre-processing stage is the elimination of variations in the feature vectors due to different skeleton ratios (broad-shouldered, narrow-shouldered). All feature vectors are normalized with the distance between the left and the right shoulders to account for the variations due to a person’s size. Note that the reference sequences are recorded with people who has average skeleton ratios. Next, we present a more detailed discussion on DTW.

Fig. 2
figure 2

Two skeletons with different orientations (left: ground-truth reference frame, right: rotationally distorted test frame due to improper body orientation)

Fig. 3
figure 3

Camera A is used to record the ground-truth reference gestures with perpendicular angles, Camera B is used to record a rotationally distorted test sequence. β is the desired angle to rotate the skeleton in Y axis. After this rotation, the skeleton will be rotated in other axes if needed until it will be perpendicular to all axes

4 Dynamic time warping for gesture recognition

DTW is a template matching algorithm to find the best match for a test pattern out of the reference patterns, where the patterns are represented as a time sequence of features. In Fig. 4 we show an example matching of two sequences.

Fig. 4
figure 4

DTW used to match two sequences, reference sequence and test sequence

Let R = {r 1, r 2, ..., r N }, N ∈ ℕ and T = {t 1, t 2, ..., t M }, M ∈ ℕ be reference and test sequences (sequence of set of joint positions in our case), respectively. The objective is to align the two sequences in time via a nonlinear mapping. Such a warping path can be illustrated as an ordered set of points as given below

$$ p = (p_1,p_2,\ldots,p_L),\ p_l = (n_l, m_l), $$

where p l  =  (n l , m l ), denotes mapping of \(r_{n_l}\) to \(t_{m_l}\). p l  ∈ [1:N] ×[1:M] for l ∈ [1:L], where L is the number of mappings. The total cost D of a warping path p between R and T with respect to a distance function d(r i ,t j ), i ∈ [1:N] and j ∈ [1:M], is defined as the sum of all distances between the mapped sequence elements

$$ \mathrm{D_p} =\sum\limits_{l=1}^{L} d(r_{n_l}, t_{m_l}), \label{eq:dtw_tc} $$
(2)

where D p is the total cost of the path p and d(r i ,t j ) measures the distance between elements r i and t j . For gesture recognition, distance can be chosen as the distance between the corresponding joint positions (3D points) of the reference gesture, R, and the test gesture T.

A mapping can also be viewed as a path on a two-dimensional (2D) grid, also known as the cost matrix, which is of size N ×M (see Fig. 5), where grid node (r i ,t j ) denotes the distance between r i and t j . The node (r 1, t 1) which starts the alignment by matching the first sequence elements is conventionally placed on the left-bottom corner of the grid. Each path p on the 2D grid (i.e., the cost matrix) is associated with a total cost D given in (2). Note that among all possible paths, we are mostly interested in the path which makes the total accumulated cost minimum while satisfying the desired constraints. Hence, optimal path denoted by p * is the path with the minimum total cost. The DTW distance between two sequences is defined by the distance associated with a total cost D given in (2) using the optimal path, i.e.:

$$ \mathrm{DTW(\mathbf{R},\mathbf{T})} = \mathrm{D_{p^*}}(\mathbf{R},\mathbf{T}). $$
(3)
Fig. 5
figure 5

Accumulated cost matrix of two sequences R and T with sizes N and M, respectively. Global constraint region, R, Sakoe–Chiba band [28], is shown with gray color

Some well-known restrictions on the warping path have been proposed to eliminate unrealistic correspondences between the sequences [21, 28]. The most fundamental constraints which are applied in various topics as well as gesture recognition, are the following:

  1. (i)

    Boundary conditions: p 1 = (1,1), p L  = (N,M).

  2. (ii)

    Step size condition: p l + 1 − p l  ∈ { (0,1), (1,0), (1,1) } for l ∈ [1:L − 1].

The boundary conditions require the whole reference sequence to be mapped to the whole test sequence, and can be modified if this is not strictly desired. The step size condition requires that only one element of both sequences can be skipped at each cost computation step of Bellman’s principle. Hence, optimal path can progress from a restricted set of predecessor nodes as shown in Fig. 6. Since all the elements are ordered in time, the set of predecessor nodes are to the left and bottom of a current node.

Fig. 6
figure 6

Predecessor nodes used in Bellman’s principle where n l  ∈ [1:N], m l  ∈ [1:M] and l ∈ [2:L]. Note that (n l − 1,m l − 1) ∈ {(n l  − 1, m l ), (n l , m l  − 1), (n l  − 1, m l  − 1)}

First, let’s define C(n l ,m l ) as below

$$ \mathrm{C(n_l,m_l)} = \mathrm{DTW}(\mathbf{R}(1:n_l), \mathbf{T}(1:m_l)). $$
(4)

Note that C(N,M) is equal to DTW(R,T). Let’s further assume that the total costs of the optimal paths to three predecessor nodes denoted by (n l  − 1, m l ), (n l , m l  − 1), and (n l  − 1, m l  − 1) have been computed. Since the (l − 1)th position of the path (i.e., (n l − 1,m l − 1)) is restricted to be one of these three nodes on the 2D grid, Bellman’s principle leads to

$$ \begin{array}{rll} \mathrm{C(n_l,m_l)} = \mathrm{min} \{ && \mathrm{C(n_l,m_l-1)}, \\ && \mathrm{C(n_l-1,m_l)}, \\ && \mathrm{C(n_l-1,m_l-1)} \} + d(r_{n_l},t_{m_l}). \label{eq:dynamicprog} \end{array} $$
(5)

Finally, the minimum cost path aligning two sequences has cost C(N,M) = DTW(R, T), and the test sequence is matched to the reference sequence that has the minimum cost among all reference sequences.

Although (5) outputs the minimum cost between two sequences, it does not output the optimal path. To find the optimal path, which can be used to map test sequence elements to reference sequence elements, one needs to backtrack the optimal path starting with the final node. Note that if the boundary condition is satisfied, i.e., the whole test sequence is mapped to the whole reference sequence, than (n L ,m L ) = (N,M) and (n 1,m 1) = (1,1).

4.1 Boosting the reliability of DTW

Global constraints define a set of nodes on the 2D grid to be searched for finding the optimal path. Imposing global constraints not only reduces the DTW computational complexity, but also increases the reliability of DTW’s dissimilarity measure by omitting unrealistic paths. We used a well-known global constraint region, Sakoe–Chiba band [28] given in Fig. 5. The Sakoe–Chiba band effectively limits the warping amount, i.e., slowing down or speeding up of a sequence in time. For example a gesture can be performed with different speeds in time depending on the performer but it is logical to expect that there is a limit to how slow or how fast a gesture is performed.

Another problem that degrades DTW’s reliability in gesture recognition is due to unknown beginning and ending times of gesture samples. A gesture in a test sequence can often begin later or end sooner than the gesture in the reference sequence stored for that gesture class. Boundary conditions assume that all gestures start at the beginning of the sequence and finish at the ending of the sequence. Hence, imposing boundary conditions in such cases decreases the reliability of DTW costs. To boost the reliability, we relaxed the boundary conditions by changing the total cost given in (2) as below

$$ \mathrm{D_p} =\sum\limits_{l=1}^{L} \alpha_l d(r_{n_l}, t_{m_l}), $$
(6)

where α l is a weight that is equal to 1 everywhere except the regions close to the starting node (i.e., left-bottom node denoted by (r 1,t 1)) and the ending node (i.e., right-top node denoted by (r N ,t M )). To infer the proximity of the current node to starting and ending nodes the length of the path, \(||p_l||=\sqrt{n_l^2 + m_l^2}\), is utilized. The distance terms coming from the beginning and ending of the sequence is weighted down by computing α l from the below formula

$$ \alpha_l = \left\{ \begin{array}{lll} \dfrac{||p_l||}{\tau} & \mbox{if } ||p_l|| &< \tau \\ \dfrac{L- || p_l||}{\tau} & \mbox{if } L - ||p_l|| &< \tau \\ 1 & \mbox{otherwise,} \end{array} \right. \label{eq:alpha_l} $$
(7)

where L is the length of the longest path and τ is a threshold value.

4.2 Weighted DTW

The conventional DTW computes the dissimilarity between two time sequences by aligning the two sequences based on a sample based distance as in (5). If the sequence samples are multi-dimensional (18 dimensional for the gesture recognition problem), using an Euclidean distance gives equal importance to all dimensions. We propose to use a weighted distance in the cost computation based on how relevant a body joint is to a specific gesture class. The relevancy is defined as the contribution of a joint to the motion pattern of that gesture class. To infer a joint’s contribution to a gesture class we compute its total displacement (i.e., contribution) during the performance of that gesture by a trained user by

$$ C_j^g= \sum\limits_{n=2}^{N} Dist^j (\mathbf{f}_{n-1}^{\,g}, \mathbf{f}_n^{\,g}), $$
(8)

where g is the gesture index, j is the joint index and n is the skeleton frame number. Dist j() computes the displacement of jth joint’s two consecutive coordinates in feature vectors \(\mathbf{f}_{n-1}^{\,g}\), and \(\mathbf{f}_n^{\,g}\). By summing up these consecutive displacements one can find the total displacement of a joint in a selected reference gesture.

After the total displacements are calculated, we filter out the noise (e.g, shaking, trembling) and threshold them from the bottom and the top. This prevents our parametric weight model to output too high or low weights as given below

$$ C_{j}^{g} = \left\{ \begin{array}{ll} C_a & \mbox{if $0\leq C_{j}^{g}<T_1$} \\ \dfrac{C_{j}^{g}-T_1}{T_2-T_1}(C_b-C_a)+C_a & \mbox{if $T_1\leq C_{j}^{g} <T_2$} \\ C_b & \mbox{otherwise,} \end{array} \right. $$
(9)

where C a and C b are threshold values, and T 1 and T 2 are experimentally determined boundary values for threshold assignment.

Using the total displacement to asses the contribution of a joint in performing a gesture, the weights of gesture class g are calculated via

$$ w_{j}^{g} = \frac{ 1 - e^{ - \beta C_{j}^{g} } } {\sum\limits_{k} \big( 1 - e^{ - \beta C_{k}^{g}}\big)}, $$
(10)

where \(w_{j}^{g}\) is joint j’s weight value for gesture class g. Note that in this formulation a joint’s weight value can change depending on the gesture class. For example, for the right-hand-push-up gesture, one would expect the right hand, right elbow and right wrist joints to have large weights, but to have smaller weights for the left-hand-push-up gesture.

To incorporate these weights into the cost, the distance function d(r n ,t m ) becomes a weighted average of joints distances between two consecutive frames and is defined to be

$$ d(r_n,t_m)=\sum\limits_{j} Dist^{j}(r_n,t_m)w^g_{j}, $$
(11)

which gives the distance between nth skeleton frame of reference gesture R and mth skeleton frame of test gesture T, where R is a sequence known to be in gesture class g and T is an unknown test sequence.

The weights are obtained from the model given in (10), which has a single parameter β. Our objective is to choose a β value that minimizes the within-class variation while between-class variation is maximized. Between-class variation maximization and within-class variation minimization can be achieved by making irrelevant joints contribute less to the cost (e.g., reducing the weights of right hand in left-hand-push-up gesture) and not reducing (or possibly increasing) the weights of joints that can help to discriminate different gestures. We try to achieve this goal by maximizing a discriminant ratio similar to Fisher’s Discriminant Ratio [16]. To this end, we define D g,h (β), as the average weighted DTW cost between all samples of gesture class g and gesture class h using weights calculated with β. Then between-class dissimilarity is the average of all D g,h (β)’s (h ≠ g) as the following:

$$ D_{B}(\beta)= \sum\limits_{g}\sum\limits_{\substack{h \\ h \neq g}} D_{g,h}(\beta), $$
(12)

which measures the sum of average distances between gesture classes. This helps us infer the average distance between a gesture and the rest of the gestures for a given β.

Within-class dissimilarity is the sum of within-class distance D g,g (β) for all gesture classes,

$$ D_{W}(\beta)= \sum\limits_{g} D_{g,g}(\beta), $$
(13)

which sums the average distance D g,g (β) between the samples of gesture classes for all g.

The discriminant ratio of a given β, R(β), is then obtained by

$$ R(\beta) = \frac{D_{B}(\beta)}{D_{W}(\beta)}. $$
(14)

The optimum β, β *, is chosen as the one that maximizes R:

$$ \beta^* = \arg \max_{\beta} R(\beta). $$
(15)

5 Results

Our experiments were performed on our gesture database which was recorded with 38 participants. It took approximately one week to finish all the recordings. All participants performed 12 different gestures with six samples per gesture class. Bad records due to a bad gesture performance (e.g., incomplete gesture) or Kinect’s human-pose recognition failure, correspond to approximately 30 % percentage of all recorded gestures. They were manually deleted by using an OpenGL based gesture visualizer. The physical factors (e.g., distance from the Kinect sensor to the user, illumination in the room) are kept constant during the recording of all records. Each gesture sample includes 20 joint position data per frame in addition to time stamps of each skeleton frame. The gesture databases used in the experiments, source code for visualization of gestures, source code used to produce the results in this paper and more results are publicly available.Footnote 1 We are hoping that the databases can be used in testing other gesture recognition algorithms as well.

We tested the performance of our feature pre-processing technique and proposed weighting method on our three discrete gesture databases to show the improvements separately: (i) Rotationally distorted gesture database: In this database we recorded a set of noisy gestures in terms of the rotational orientation of the body with respect to the Kinect sensor in X, Y and Z axes (See Fig. 3). The gestures are performed by trained users. This database is designed in order to see the effect of pre-processing on the recognition performance. It has 12 different gesture classes and 21 gesture samples per gesture class. (ii) Relaxed gesture database: In this database there is no intentionally generated rotational distortion, instead, these gesture samples are performed more relaxed in terms of the movement of body parts other than the active joints involved in gesture performance. For example in one sample of this database, performer scratches his head with his left hand while he performs the right-hand-push-up gesture. This database has 8 gesture classes and 1116 gesture samples in total. (iii) Rotationally distorted and relaxed gesture database: In this database performers recorded gestures relaxed in terms of both rotation and body movement. This database has 12 gesture classes and 198 gesture samples in total. We use this database to show the overall performance of the system. All the three databases are created using Microsoft Kinect Sensor.

In addition to these databases, there is a set of reference samples per gesture class, performed properly by trained users without any rotational distortion and without any undesired movements. These reference samples are used in learning the total distance measures of each joint in each class, which is required by our weight model in (10). Two sample reference gestures are shown in Fig. 7.

Fig. 7
figure 7

Two sample reference gestures in the gesture database: Right Hand Push Up and Left Hand Wave

In the first experiment, we test our pre-processing method using the rotationally distorted gesture database. We first calculated the discriminant ratios (See (14)) of 21 samples for each 12 gesture class without using any of the pre-processing methods. Then, we used the same gesture samples to calculate the discriminant ratios again, but this time using our proposed pre-processing methods. Note that uniform weights were used in order to see the performance of the pre-processing method alone. The improved achieved by pre-processing can be seen in Fig. 8.

Fig. 8
figure 8

Discriminant ratios for with and without pre-processed gesture samples using the rotationally distorted gesture database. Note that the discriminant ratios are increased, on average, 42 % with the proposed pre-processing method. There are 21 gesture samples in each gesture class. The gesture classes are, namely, Both Hands Pull Down, Both Hands Push Up, Left Hand Pull Down, Left Hand Push Up, Left Hand Swipe Left, Left Hand Swipe Right, Left Hand Wave, Right Hand Pull Down, Right Hand Push Up, Right Hand Swipe Left, Right Hand Swipe Right, Right Hand Wave, respectively

In the second experiment we compared our weighted DTW algorithm against the conventional DTW method and a weighted DTW method proposed by [26] using the relaxed gesture database. The confusion matrices for the three algorithms for six chosen gesture classes are given in Tables 1, 2, and 3. Note that the recognition rates for these classes are consistent with recognition rates of other classes(i.e. classes that are not presented in the confusion matrix), but only these classes are shown for the sake of brevity.

Table 1 Confusion matrix for the conventional DTW
Table 2 Confusion matrix for the weighted DTW in [26]
Table 3 Confusion matrix for our proposed weighted DTW

After creating the confusion matrices, we computed the overall recognition accuracies according to the following formula:

$$ A=100\cdot\frac{\mathrm{Trace}(C)}{\sum\limits^{m}_{i=1} \sum\limits^{n}_{j=1} C(i,j)}, $$
(16)

where A denotes the accuracy, and C denotes the confusion matrix.

Our proposed method outperforms the weighted DTW method in [26] by a large margin as given in Table 4. The reason is that their weights are global weights, i.e., a joint’s weight is independent of the gesture class. However, in our proposed method a joint can have a different weight depending on the gesture class we are trying to align with. This degree of freedom in computing the associated DTW cost increases the reliability of DTW cost significantly.

Table 4 Accuracies of the three methods

In the third and the last stage, we tested the overall performance of our system using the rotationally distorted and relaxed gesture database. The purpose of this operation is to determine the overall improvement of the pre-processing and the weighting on the recognition performance using a larger database. These experiments clearly demonstrate the performance boost provided by our proposed techniques. The results are given in Table 5.

Table 5 Overall performance comparison using the rotationally distorted and relaxed gesture database

6 Conclusion

We have developed a weighted DTW method to boost the discrimination capability of DTW’s cost, and shown that the performance increases significantly. The weights are based on a parametric model that depends on the level of a joint’s contribution to a gesture class. The model parameter is optimized by maximizing a discriminant ratio, which helps to minimize within-class variation and maximize between-class variation. We have also developed a pre-processing method to cope with real life situations, where different body shapes and user orientations with respect to the depth sensor may occur. Our weighted DTW, enables noise in skeleton joints as long as they do not make a gesture of one class similar to a gesture of another class. This is because weights are selected by maximizing between-class variation. We hope that the proposed method will enable more natural remote control of different devices using pre-defined commands for a given context/situation. As long as the noise in a joint does not overlap with another gesture class, the user is free to naturally use his/her other joints.