1 Introduction

As computers and mobile devices are becoming increasingly important in our daily life, many researches are focused on human machine interaction. Traditional methods, such as mouse, keyboard or button based remote control, bore the youngsters and, more seriously, trouble the blind and the old, since they can not see the buttons on the keyboard or remoter controller clearly, nor can they track the trail of the mouse. Therefore it is necessary to propose other human machine interaction methods. Hand gesture, which is a natural way of communication between individuals, becomes one of the most suitable ways and attracts a large number of research groups [15, 24, 35, 37, 41].

With the rapid development of sensors, 3D accelerometers are becoming much cheaper and more widely used in mobile devices. Accelerometer based gesture recognition has been highlighted by a number of research groups. For example, Elisabetta Farella, et al., detected the movement of the human body by analyzing the accelerometer network [9]. Wolfgang Hürst et al., chose the accelerometer, together with the compass, to fulfill gesture-based interaction for mobile augmented reality [14]. Andrew Wilson et al., implemented a gesture based interaction device called Xwand [38]. Cho et al., proposed Magic Wand to control home electronic facilities [7]. Chen et al., adopted the gesture-aware device as a presentation tool [6]. There has been numerous researches on the applications of the acceleration sensor in mobile phone platform [3, 5, 8, 1921, 36]. Also, accelerometer based gesture researches have shown great potential for applications in e-Learning/e-Teaching [1113].

In previous work, the gesture recognition algorithms can be mainly divided into two categories: template based methods and model based methods, as shown in Table 1. Among the template based methods, which set training examples as gesture templates, Dynamic Time Warping (DTW) is perhaps the most frequently used algorithm. It is easy to implement and requires only one training sample to initiate. DTW based researches have shown satisfied performance in a number of user-dependent systems [2, 17, 40]. However, since people may have various personalities when performing the same gesture, setting gesture samples from one person as templates always results in poor performance when used by others. Thus DTW fails in the user-independent applications [39]. In contrast, model based methods learn the parameters which describe each gesture from a large quantity of training samples. Thus, HMM, which is the most widely used model based method, can perform well in many user-independent applications [16, 23, 27, 30]. Also, Hidden Conditional Random Fields (HCRF) [33], Dynamic Bayesian Networks (DBN) [34], and Self Organizing Networks (SON) [10] have been implemented in some previous gesture recognition systems. However, the determination of the parameters in model based method must require a large number of training samples from different users, which leads to a time-consuming and laborious sample collection process. But with a limited number of training samples, there will be a sharp decline in system performance.

Table 1 Traditional gesture recognition methods

Whether template based or model based methods, the templates or parameters of the gestures are always learnt from training samples first and then certain kind of matching is conducted to output the classification result. For these training-required methods, a small number of training samples always result in poor performance and a large quantity of training samples result in time-consuming and laborious sample collection process. Thus, training-free recognition is becoming a hot research topic in many fields recently [26, 31]. Without useful information of the gestures, training-free gesture recognition seems a task impossible. However, standard gesture trails, the crucial prior information ignored in previous researches, are frequently provided in the instructions for user-independent gesture recognition applications. With this prior knowledge, we can do much more.

In this paper, by exploiting gesture trail information, high-performance training-free hand gesture recognition with accelerometer is fulfilled. First, we determine the underlining space for gesture generation with the physical meaning of acceleration direction. Then, the template of each gesture in underlining space can be generated from gesture trails directly, which are frequently provided in the instructions of gesture recognition devices. Thus, during the gesture template generation process, fulfills the training-free gesture recognition without requiring training samples. After that, a feature extraction method, which transforms the original acceleration sequence into a sequence of more user-invariant features in underlining space, and a more robust template matching method, which is based on dynamic programming, are presented to finish the gesture recognition process and enhance the system performance. To test the algorithm’s performance, 2,280 gesture samples are collected from 28 persons for 8 gestures used in DVD control [18]. An accuracy of 93.1 %, which is even better than standard training-required methods of DTW(an accuracy of 88.7 % for the best template, and an accuracy of 72.9 % on average) and HMM(an accuracy of 91.1 % when using 90 % percentage samples as training ones), is achieved by our training-free method of MMDTW in the user-independent experiment.

The remainder of the paper is organized as follows. First, classical algorithms of DTW and HMM are discussed in Section 2. In Section 3, key points in designing high-performance training-free algorithm are addressed. Then Section 4 proposes the algorithm for training-free accelerometer based gesture recognition. Section 5 tests the performance of the algorithm and some conclusions are drawn in Section 6.

2 Related work

This section discusses the classical algorithms for gesture recognition: DTW and HMM (Fig. 1).

Fig. 1
figure 1

The time alignment of two sequences varying in length

2.1 DTW

DTW is a well-known algorithm for measuring the similarity between two sequences which may vary in length [22] (Fig. 1). For two sequences of X: = (x 1,x 2,...,x N ) of length N ∈ ℕ and Y: = (y 1,y 2,...,y M ) of length M ∈ ℕ, we define a local cost measure c(x n ,y m ) ≥ 0,Footnote 1 which depicts the difference between x n and y m , we can obtain the cost matrix C ∈ ℝN ×M defined by C(n,m): = c(x n ,y m ). Then (N,M)-warping path is a sequence p = (p 1,...,p L ) with p l  = (n l ,m l ) ∈ [1:N]×[1:M] for l ∈ [1:L] satisfying the following three conditions.

  1. 1.

    Boundary condition: p 1 = (1,1) and p L  = (N,M);

  2. 2.

    Monotonicity condition: n 1 ≤ n 2 ≤ ... ≤ n L and m 1 ≤ m 2 ≤ ... ≤ m L ;

  3. 3.

    Step size condition:p l + 1 − p l  ∈ {(1,0),(0,1),(1,1)} for l ∈ [1:L − 1].

The total cost C p (X,Y) of a warping path p between X and Y with respect to the local cost measure c is defined as.

$$ c_p(X,Y):= \sum\limits_{l=1}^L c(x_{n_l},y_{m_l}). $$
(1)

An optimal warping path is a warping path p  ∗  having minimal total cost among all possible warping paths. The DTW distance DTW(X,Y) between X and Y is then defined as the total cost of p  ∗ ,

$$ DTW(X,Y) := c_{p^*} (X,Y) = \min \{c_p(X,Y)| p \mbox{ is an }(N,M)\mbox{-warping path} \}. $$
(2)

With prefix sequences X(1:n): = (x 1,...,x n ) for n ∈ [1:N] and Y(1:m), we can define:

$$ D(n,m):= DTW(X(1:n),Y(1:m)). $$
(3)

DTW finds an alignment between X and Y with minimal overall cost DTW(X,Y) = D(N,M) by dynamic programming with the following recursive process.

  1. 1.

    Initialization:

    \(D(n,1)=\sum_{k=1}^n c(x_k,y_1)\), where n ≤ N;

    \(D(1,m)=\sum_{k=1}^m c(x_1,y_k)\), where m ≤ M.

  2. 2.

    Iteration:

    $$ D(n,m) = min \left\{ \begin{array}{lll} D(n-1,m) & + & C(n,m)\\ D(n-1,m-1) & + & C(n,m)\\ D(n,m-1) & + & C(n,m)\\ \end{array} \right.; $$

    where 1 < n ≤ N,1 < m ≤ M.

In summary, the key steps of DTW are

  1. 1.

    determine the local cost measure function c(x n ,y m ) ≥ 0 and obtain the count cost matrix C ∈ ℝN ×M;

  2. 2.

    output overall cost D(N,M) via dynamic programming.

In practical, the above process is performed between the sequence for classification (X of length N) and the delegates of each class (Y i of length M i ). Then the classification result is given as the best matched class C = arg min i D i (N,M i ).

2.2 HMM

2.2.1 Introduction of HMM

HMM is a famous and widely applied model [28] dealing with the sequential data efficiently. In the model, we assume that it is the latent variables x i which form a Markov chain that give rise to the observations z i . As a generative model shown in Fig. 2, HMM is depicted by the following three sets of parameters: (1) Initial probabilities π k  ≡ p(z 1k  = 1); (2) Transition probabilities A jk  ≡ p(z nk  = 1| z n − 1,j  = 1), determining the relationship between adjacent latent states under Markov assumption, where z nk indicates the kth states of the latent variable state in time n; (3) Emission probabilities p(x n |z n ,φ), determining the relationship between the latent states and the observations in time n, where φ indicates the parameters of the emission distribution.

Fig. 2
figure 2

Graphical structure of hidden Markov model

According to the relationship between variables, we can get the joint distribution of all variables in the model(both latent variable and observations) as follows:

$$ p(\mathbf{X},\mathbf{Z}|\theta) =p(\mathbf{z}_1|\mathbf{\pi}) \left[\prod\limits_{n=2}^N p(\mathbf{z}_n|\mathbf{z}_{n-1},\mathbf{A})\right]\prod\limits_{m=1}^N p(\mathbf{x}_m|\mathbf{z}_m,\phi); $$
(4)

where X = {x 1,...,x N }, Z = {z 1,...,z N }, θ = {π,A,φ} represent all parameters in the model.

2.2.2 Training process of HMM

When using HMM for classification, we should first determine the model parameters for each class i : θ i  = {π i , A i , φ i } with the training samples of class i. The (local optimal) maximum likelihood solution can be determined by Expectation Maximum (EM) Algorithm:Footnote 2

  1. 1.

    E-step: θ old = θ new, obtain q(Z) = p(Z|X,θ old);

  2. 2.

    M-step: Maximize \(Q(\theta,\theta^{old})=\sum_\mathbf{Z} p(\mathbf{Z}|\mathbf{X},\theta^{old}) ln p(\mathbf{X},\mathbf{Z}|\theta)\) with respect to model parameters θ, obtain θ new.

When adopting EM algorithm in HMM, by defining

$$ \gamma(\boldsymbol z_n)=p\left(\boldsymbol z_n| \boldsymbol X, \boldsymbol \theta^{old}\right), $$
(5)
$$ \xi\left(\boldsymbol z_{n-1},\boldsymbol z_n\right)=p\left(\boldsymbol z_{n-1},\boldsymbol z_n|\boldsymbol X,\boldsymbol \theta^{old}\right), $$
(6)
$$ \gamma(z_{nk})= \mathbb{E}[z_{nk}]=\sum\limits_{\boldsymbol z} \gamma(\boldsymbol z) z_{nk}, $$
(7)
$$ \xi\left(z_{n-1,j},z_{nk}\right)= \mathbb{E}[z_{n-1,j}z_{nk}]=\sum\limits_{\boldsymbol z} \gamma(\boldsymbol z) z_{n-1,j}z_{nk}. $$
(8)

the optimization objective function in M-step can be reformulated as,

$$ \begin{array}{rll} Q\left(\theta,\theta^{old}\right)= \sum\limits_{k=1}^K \gamma(z_{1k}) &\ln& \pi_k +\sum\limits_{n=2}^N \sum\limits_{j=1}^K \sum\limits_{k=1}^K \xi(z_{n-1,j},z_{nk})\\ &\ln& \boldsymbol A_{jk} + \sum\limits_{n=1}^N \sum\limits_{k=1}^K \gamma(z_{nk}) \ln p(\boldsymbol x_n|\boldsymbol \phi_k). \end{array} $$
(9)

The updating function of parameters can be obtained by using Lagrange multipliers,

$$ \pi_k = \frac{\gamma(z_{1k})}{\sum_{j=1}^K \gamma(z_{1j})}; $$
(10)
$$ A_{jk}=\frac{\sum_{n=2}^N \xi(z_{n-1,j}, z_{nk})}{\sum_{l=1}^K \sum_{n=2}^N \xi(z_{n-1,j}, z_{nl})}. $$
(11)

The updating function of ϕ is related to the formulation emission probabilities p(x n |z n ,ϕ). The quantities \(\gamma(\boldsymbol z_n)\) and \(\xi(\boldsymbol z_{n-1},\boldsymbol z_n)\) in each iteration can be efficiently computed via Sum-Product Algorithm.Footnote 3

2.2.3 Classification process of HMM

Then the likelihood for a sequence belonging to each class P(X|C i ) = P(X|θ i ) can be efficiently determined. According to Bayesian rule, the posterior probabilities P(C i |X) can be obtained and the final recognition result is achieved:

$$ P(C_i|\mathbf{X}) = \frac{P(\mathbf{X}|\theta_i)P(C_i)}{\sum\limits_{i} P(\mathbf{X}|\theta_i)P(C_i)}. $$
(12)

3 Well-performed training-free algorithm design

To recognize a gesture, it’s significantly important to analyze the generation process of the acceleration data and then find out the reason why the traditional methods perform well or not. Based on the analysis results, we can design well-performed and training-free algorithms.

3.1 Gesture generation process

When we are conducting a gesture, the brain analyzes the trail of the gesture first and divides it into a series of hand operations in turn. Then, individual discrepancy is also addressed due to habitual hand motion. Eventually, after the accelerometer senses the data and outputs it, the acceleration sequences are observable. In the above process, as illustrated in Fig. 3, the acceleration sequences are observations obtained from the accelerometer. There are underlining factors determining these observations, namely motion process in our case. For the same gestures performed by different people, though the observations vary greatly, the underlining factors are roughly the same.

Fig. 3
figure 3

The generating process of the gesture data

3.2 Analysis of traditional algorithms

DTW chooses one or more acceleration samples as templates for each gesture. Then the matching is performed in the observable space of acceleration sequences directly. Since Dynamic Programming based matching can capture long term dependencies of sequences [29], DTW performs well in user dependent applications [17]. However, in the case of user-independent, it is nearly impossible for DTW to get rid of the personal features and thus DTW fails to show satisfied performance.

HMM, in a different way, depicts the underlining factors for gesture generation via latent variables. Consequently, it has displayed a rather desirable performance in user-independent applications. However, HMM depicts the relationship between latent variables via 1-order transition matrix. Thus it can not discriminate gestures with similar transition matrix and fails to model the long term dependencies (high order relations) between underlining factors [25, 32].

3.3 Well-performed training-free algorithm design

3.3.1 Key points for well performance

Inspired by the analysis in the previous Sections of 3.1 and 3.2, to obtain well-performed gesture recognition algorithms, it is important for us to consider the following aspects during the design of the algorithm:

  • Underlining Space: the underlining space should be able to depict the user-invariant underlining factors of gestures.

  • Matching Algorithm: the matching should be able to capture long term dependencies.

3.3.2 Key points for training-free

Assume gesture trails and observations of acceleration sequences are given in the system, to fulfill training-free gesture recognition, the following aspects should be considered during the design of the algorithm:

  • Generation of gesture templates in underlining space: atomic gestures should be defined so that gesture trails can be made full use of and generate the gesture templates in underlining space can be generated without training samples.

  • Observation representation in underlining space: an unsupervised feature extraction should be designed to map the original acceleration sequences into the underlining space.

  • Training-free matching algorithm: the designed matching algorithm should have no parameters to learn.

3.3.3 Framework of our well-performed training-free algorithm

It’s convenient and suitable for us to determine the underlining space with the physical meaning of acceleration direction for the following reasons:

  • Provided the gesture trails, their corresponding templates of acceleration direction sequences can be analyzed conveniently;

  • The observations of acceleration sequences can be mapped into acceleration direction sequences conveniently;

  • Acceleration direction sequences enjoy user-invariant property.

Thus in the underlining space, we define the atomic gestures as the acceleration action in certain directions. Then we can obtain the template of each gesture and the mapped sequences of accelerometer data. After that, the modified DTW matching is conducted to obtain the recognition result since this matching strategy of dynamic programming captures long term dependencies better than 1-order transition matrix in HMM [29].

  • The Underlining Space (Physical Meaning of Acceleration Direction)

    • Gesture template generation (no training samples required):

      Gesture trails → Templates of gestures (atomic gesture sequences);

    • Observation Feature Extraction (unsupervised):

      Acceleration sequences → Mapped sequences (probabilistic atomic gesture sequences);

  • Matching via Modified DTW (no parameters to train).

For the process of our algorithm is mainly comprised of template generation, mapping, matching and classification, our algorithm is called Mapping based Modified DTW (MMDTW), which is shown can see seen in Fig. 4.

Fig. 4
figure 4

Flowchart of the proposed mapping based modified DTW

4 Process of our algorithm

In this section, the process of our algorithm will be discussed in details.

4.1 Gesture template generation

In this gesture recognition process, we make the following definitions: X indicates the right-left direction (left positive), Y indicates the vertical direction (upward positive), and Z indicate the backward-forward direction (forward positive). As shown in Fig. 5, the Nokia gestures defined in [18] will be discussed as an example. Since the gestures are all defined in XY space, the atomic gestures are represented by normalized acceleration vectors with Z dimension always remaining 0: z 1 = (1,0,0), z 2 = ( − 1,0,0), z 3 = (0,1,0), z 4 = (0, − 1,0), \(\mathbf{z}_5=(\sqrt{2}/2,\sqrt{2}/2,0)\), \(\mathbf{z}_6=(\sqrt{2}/2,-\sqrt{2}/2,0)\), \(\mathbf{z}_7=(-\sqrt{2}/2,\sqrt{2}/2,0)\), \(\mathbf{z}_8=(-\sqrt{2}/2,-\sqrt{2}/2,0)\).Footnote 4

Fig. 5
figure 5

Template sequences of gestures

Now under the following assumptions, the gesture templates can be generated according to straight forward physical analysis:

  1. 1.

    For basic gestures moving uni-directionally, there exists both an acceleration process and a deceleration process. For example, the left moving gesture in Fig. 5 can be regarded as a left acceleration (corresponding to the underlining state z 1) followed by left deceleration (correspond to the underlining state z 2). Thus the template for this gesture is written as [1 2].Footnote 5

  2. 2.

    For complex gestures originated from basic gestures, assumption 1 follows during each basic gesture process.

    For example, gesture square in Fig. 5 is comprised of up [3 4], right [2 1], down [4 3], left [1 2]. Since each basic gesture follows assumption 1, we can give out the total template for square as [3 4 2 1 4 3 1 2].

  3. 3.

    For circle gestures, they contain the following sub-motions,

    1. (1)

      a uniform motion during the whole gesture process,

    2. (2)

      an acceleration process when the gesture starts,

    3. (3)

      an deceleration process at the end of the gesture.

    For example, gesture clock-wise circle in Fig. 5 contains (a) Uniform Circular Motion: approximated by underlining states as [3 7 2 8 4 6 1 5 3]; (b) Beginning Acceleration: [1]; (c) Ending Acceleration: [2]. When we adding submotion b to the first state of submotion a, together with submotion c added to the final state of submotion a, we get the template for clock-wise circle: [5 7 2 8 4 6 1 5 7].

Under above assumptions, we can get the template sequences of gestures T = [t 1,...,t m ,...,t M ] as shown in Fig. 5. And the fact the mth element of sequence T, t m , numbered d indicates the underlining state at this time to be z d .

4.2 Feature extraction

This subsection will describe how to map the acceleration sequence of length N into the underlining space and how to present the mapped sequence L = (l 1,...,l i ,...,l N ). In most accelerometers, the raw acceleration sequence contains the influence of gravity. Here we minus g in the vertical direction and get sequence A = (a 1,...,a i ,...,a N ).

In order to minimize the loss in the mapping process, soft mapping is adopted in our algorithm. With the distance between a i and the underlining state z d set, the distance is mapped into probabilities with Softmax function:

$$ p(\mathbf{z}_{d}|\mathbf{a}_i)=\frac{exp(a_{id})}{\sum\limits_d exp(a_{id}))}; $$
(13)

where a id is determined by

$$ a_{id}=\frac{1}{\epsilon+dis(\mathbf{a}_i,\mathbf{z}_d)}; $$
(14)

where a small positive number ϵ is added to avoid a id being infinite.

By assigning l id  = p(z d |a i ), we can obtain a probabilistic underlining sequence L, where l id indicates the probability of a i to be underlining state z d .

4.3 Modified DTW based matching

This subsection will describe the matching method between the mapping result L and the template sequence of a gesture T in the underlining space.

4.3.1 Cost matrix

The cost matrix C(i,m) indicates the distances between the ith element in the mapped sequence L: l i and the mth element in the template sequence T: t m decided by.

$$ C(i,m)=w_i \cdot d_{im} $$
(15)

where w i is the weight of l i and d im is the distance between l i and t m .

Since points with larger amplitudes are more important for recognition, w i is given in proportion to the amplitude of the point as

$$ w_i=|\mathbf{a}_i|_2. $$
(16)

In particular, if t m  = d, then the distance between t m and l i can be measured in proportion to l id  = p(z d |a i ). Thus we define the following piecewise function

$$ d_{im} = \left\{ \begin{array}{ll} 1-l_{id} & \textrm{if } l_{id}>\theta\\ k(1-l_{id}) & \textrm{if } l_{id}<\theta \end{array} \right. $$
(17)

where 0 < t < 1. The threshold θ indicates that when l id is very small, it’s nearly impossible for l i to be t m and the distance between l i and t m should be very large. In this case, we multiply the distance by k > 1.

So that C(i,m) is given by the following function

$$ C(i,m) = \left\{ \begin{array}{ll} |\mathbf{a}_i|_2\cdot(1-l_{id}) & \textrm{if }l_{id}>t\\ k\cdot |\mathbf{a}_i|_2 \cdot(1-l_{id}) & \textrm{if }l_{id}<t \end{array} \right. \label{Eq:cost_matrix} $$
(18)

where d = t m .

4.3.2 Modified DTW

For each element of a template sequence t m should last a period of time, when given the optimal alignment of DTW matching, there should be at least K adjacent elements of mapped sequence M to be assigned to one element t m in the template. So DTW is modified in the following way. First, each underlining state in template is repeated K times, for example, when K equals 2, the original sequence of [1 2] extends to [1 1 2 2]; then the symmetric traditional DTW is modified into an unsymmetric one as follows

  • Initialization

    \(D(i,1)=\sum_{k=1}^n C(x_k,y_1)\), where i ≤ N;

    \(D(1,m)=\sum_{k=1}^m C(x_1,y_k)\), where m ≤ KM

  • Modified DTW

    $$ D(i,m) = min \left\{ \begin{array}{lll} D(i-1,m-1) & + & C(i,m)\\ D(i-1,m) & + & C(i,m) \end{array} \right. \label{Eq:MMDTW} $$
    (19)

    where 1 < i ≤ N,1 < m ≤ KM;

    N indicates the length of the mapped sequence;

    M indicates the length of the original template,

    thus KM indicates the length of extended template

4.3.3 Classification

After matching, we get the tamplate of each gesture C: MDTW(L,T c ) = D(N,K M c ) for templates of each gesture c. Now the sequence should be recognized as the most similar gesture

$$ C= arg \min_c MDTW(T_c,L) $$
(20)

where C is the recognition result.

5 Experiment

In this section, the environment and results of the experiment will be discussed in details.

5.1 Experiment description

5.1.1 Hardware and software environment

We built a real time recognition system to test the performance of our algorithm. As shown in Fig. 6, the system is mainly comprised of Wii remote, Bluetooth Module and Laptop.

  1. 1.

    Wii remote: collect the acceleration data.

    Wii remote [1] has a built-in three-axis accelerometer: ADXL330 from Analog Devices. When operating at 100 HZ, the accelerometer can sense acceleration between − 3 g to 3 g with noise below 3.5 mg. The acceleration data and button action information can be sent from Wii Remote to Laptop via Bluetooth module. When collecting samples, the process of each gesture begins when the participant presses ’A’ button on Wii remote, and it finishes when the ‘A’ button is released.

  2. 2.

    Bluetooth module: communicate between the wii remote and the laptop.

    IVT Bluesoleil is utilized to connect the laptop and Wii and the sample rate is 10 Hz.

  3. 3.

    Laptop: recognize the gesture in real time.

    The computer is a Thinkpad Laptop with Intel T5870 processer and 2 G memory. Our real time gesture recognition is implemented on the platform of Matlab. The Operation System is Windows XP and the Matlab edition is R2010a. To read the date from the Wii remote by bluetooth communication, we use the toolbox of WiiLab [4].

Fig. 6
figure 6

Real time gesture recognition system

5.1.2 Samples description

The algorithm is tested on eight gestures arising from research undertaken by Nokia[18], shown in Fig. 5. We collect samples from 28 persons. 15 of them are male while others are female. All of them are graduate students and they have never used gesture recognition devices before. Each of the 8 gestures is repeated 10 times. Consequently we get the database of 28 person ×8 gestures ×10 times, which aggregates totally 2,240 samples.

5.1.3 Parameters setting and experiment process

In our algorithm shown in Algorithm 1, we do not need any training samples. We use cosine metric to measure the distance between the acceleration vectors. The parameters are set as follows, ϵ = 0.001,k = 100,\(N=\lfloor \frac{1}{2}\frac{l}{l_{M_{i}}} \rfloor\). We get our recognition result although the real time gesture recognition system mentioned above.

figure g

We also implement HMM and DTW to conduct a comparison. When testing DTW, we take the first sample of each gesture collected by a person as the template. To see the influence of personality, we adopt the templates by different persons in turn, and classify all the other samples according to their similarity with the templates. In the experiment of HMM, since discrete HMM and continual HMM get similar performance [18], we utilize discrete HMM. We employ leave one out strategy and use all other persons’ samples besides the testing person’s as training samples. We use different percentage of training samples to see the recognition performance. The training samples of HMM are randomly selected from the data set and the parameters of HMM are determined according to [18] which sets 8 cluster classes and 5 latent states, In addition we use HMM toolbox by MurphyFootnote 6 to realize our left-to-right HMM test system.

5.2 Result

As displayed in Fig. 7, the comparison between DTW and MMDTW shows the result that the choice of template is closely related to the recognition result of DTW. With the best templates, DTW achieves an accuracy of 88.7 %; while the least accuracy of DTW sharply drops to 26.5 %. The average accuracy for DTW is 72.9 % and 9 out of 28 templates get an accuracy less than 70.0 %, which is intolerable in practical applications. In comparison, the accuracy of MMDTW is 93.1 % and is better than DTW whichever the template is. This improvement just stems from the no-inclination templates and the matching in underlining space.

Fig. 7
figure 7

Comparison of MMDTW and DTW. The accuracy of DTW changes when choosing different users’ first sample of each gesture as template and all other users’ samples as testing ones. MMDTW needs no sample from users to start and generates template of each gesture itself

Then in Fig. 8, we can see the comparison of HMM and MMDTW. The result shows that MMDTW displays a much better performance than HMM when the latter uses relative small percentage of training samples. With increasing of training samples, HMM performs better. However, the accuracy of HMM can not exceed that of MMDTW even when 90 % samples are used for training. Compared with HMM, the improvement of MMDTW may come from the Modified DTW matching, which compares the differences between the template and the mapped sequence more efficiently.

Fig. 8
figure 8

Comparison of HMM and MMDTW. The accuracy of HMM changes when choosing different training samples percentage. MMDTW needs no training samples

Just we can conclude in Fig. 9, the best accuracy of HMM is 91.1 % and the average accuracy of DTW is 72.9 %. Thus, we can figure out that our algorithm can not only meet the training-free requirement, but also show a better accuracy than the training-required algorithms of DTW and HMM.

Fig. 9
figure 9

Comparison of MMDTW, HMM and DTW

6 Conclusion

In this paper, in order to fulfill the training-free gesture recognition with accelerometer for user-independent applications, we make use of the crucial prior knowledge of gesture trail and design the algorithm of MMDTW. The algorithm defines an underlining space with the physical meaning of acceleration direction. After that, template for each gesture is defined in this space according to the gesture trail. Then acceleration sequence, the output of the accelerometer, is mapped into the underlining space to conduct a modified DTW matching with templates. Finally the best matched gesture turns out to be the recognition result. When tested in an user-independent experiment with 2,240 samples, the training-free algorithm even shows better performance than the traditional training-required algorithms of DTW and HMM, which verifies the effectiveness of MMDTW.

We believe our algorithm is the first step toward the training-free gesture recognition. In future work, by exploiting the gesture trail information, more complicated physical analysis can be adopted to generate the templates of general gestures with more general sensors, such as video camera. Also, without the time-consuming and laborious sample collection process, the well-performed training-free gesture recognition makes it much more convenient for users to extend the original gesture vocabulary themselves. Thus, this kind of training-free algorithms enjoys great potential for a large number of applications, such as self-defined vocabulary based gesture recognition, gestures devices to support E-teaching, and so on.