1 Introduction

Motion trajectory analysis is one of fundamental tasks in many types of research area, such as activity recognition [19, 21, 33], robots action [15], anomaly detection [22], video surveillance [20], etc.

The studies in motion trajectory classification can be divided into three main categories with respect to the classification scheme: model based, feature based and trajectory based. Model based [11] classification aims to build a parameterized model for a collection of motion trajectories, and classifying a new motion trajectory by determining the model that best fits to it. The statistical models such as Gaussian, Markov and Hidden Markov Models [14], Random Forest [18], Neural network are frequently used in this category. In the feature based studies [28], the high dimensionality of motion trajectories are transformed into a set of features with the help of suitable feature extraction methods. For example, Lin and Hsieh [16] develop a Kernel-based representation for motion trajectory retrieval and classification. Bashir, et al. [1]propose a Principle Component Analysis-based approach for trajectory indexing and retrieval. Although such methods state above can optimally give a set of features, an extensive survey claims that they become unfeasible for out-of-core processing which is usually the case for large datasets[6].

Our motion trajectory classification framework falls into the trajectory-based category. In this category, two of the fundamental problems are trajectories representation and the sequences alignment. An effective representation can yield significantly better performance than the raw trajectory data. Moreover, alignment methods can also be used to improve the performance of a classifier by integrating the trajectories representation into the alignment.

A basic way of comparing two trajectories is use the original data which rely on the absolute positions of motion, yet ineffective in computation and invariant to motion trajectory translation and scale [8]. Hence, the most challenge in motion trajectory classification is the specific trajectory adjustments either in alignment or representation steps. To maintain the scale and translation invariant during alignment, the raw trajectories data are often converted in the form of geometric invariant signatures such as curvature, torsion, and their derivatives, etc. To develop an effective system for real-world applications, many researchers have devoted themselves to find a compact yet discriminative representation for trajectories. In the existing work, Yang et al. [31] represented trajectories in segment level rather than point level, and index the trajectories by four segment sequences classes to recognize them. However, this methods can only represent simple shapes and are inefficient in complex and long term trajectories. In Wu et al’ s [26] paper, three geometric invariant signature descriptions for motion characterization are developed. Such flexible descriptions give the signature high functional adaptability to meet various application requirements in trajectory representation, perception and recognition, however, computing three differential features and two global features for each discrete point is computationally expensive. Also, reliable finite difference estimation of higher order derivatives is difficult due to their high sensitivity to noise. Yang et al. [29, 32] also present a mixed invariant signature descriptor in the basis of differential feature with global invariants for motion perception and recognition, however, it has the same problems with Wu’s method.

Dynamic time warping [4, 13, 30] is the most representative non-linear mapping method which can handle the local time shifting by finding an optimal path with minimal sum of distance from a pairwise distance matrix. In the traditional perspective, the pointwise based alignment usually resorts to use the local information which is adjacent to each point [12]. Currently, Z, Zhang [34] have recently suggested using shape context [2] as a rich local shape descriptor to replace the raw observed position value in conventional DTW in 2D space. To keep the locality of a shape context, they set the outer radius of shape context to one tenth of the length of the trajectories. As an alternative, this approach generates a more feature-to-feature alignment between motion trajectories and thus serves as a robust similarity measure. However, this matching mechanism cannot handle a situation that a trajectory partially similar with its sub-trajectory. In addition, from a shape point of view, ignoring the relationship of each point to its global sequence context, are largely unstable and incline to result in pathological alignment [25]. In our point of view, to build effective trajectory recognition, it will be helpful to refer to the concept of global shape descriptor on each trajectory point, and it will outperform the local descriptor. For this purpose, we propose an adaptive outer radius of 3D shape context mechanism. During trajectory feature representation, the global trajectories information are extracted with the largest point-wise distance in each trajectory equals to the outer most circle radius of shape context descriptor. In this way, the global information of each trajectory point can be included and extracted as pointwise feature for trajectories alignment.

Another challenge of motion trajectory classification is that the appearance variations among the identical meaning trajectories may cause false discrimination. In the most motion trajectory applications, such trajectories perturbations are emerged because these trajectories are drawn repeatedly by different users. As far as we study, most of researchers are not conscious of this problem. In our previous work [17], the 2D shape order context descriptor combined with DTW was proposed. It is greatly insensitive to trajectories perturbations and highly invariant to trajectories scale and translation. However, this previous work did not refer to the 3D trajectories.

As we state above, motion trajectory classification approaches in 2D space are wildly studied and already achieves better performance. It should be noted that emphasize on using projected 2D trajectory in some kind of viewpoint, may lost the authentic meaning in the 3D space. Compared with projected 2D trajectories, more rich information can be drawn from 3D trajectories in spatiotemporal domain and therefore better performance can be achieved in trajectory-based schemes. Also, with more and more complicate as behaviors and activities are performed, 3D trajectory analysis should be further considered. Based on 2D shape context descriptor, the extension of 3D shape context descriptor has been proposed in [9]. It relies on a specific subdivision of the spherical volume around the trajectories point that needs to be described. Matthias et al. [10] propose to use of 3D shape context to recognize the spatial and temporal details inherent in human actions. However, such 3D shape context descriptors is not restrictively in 3D space, but in 2D space plus time domain. S.H Zhao et al. [35] generalized strategy for dynamic 3D depth data matching and apply this strategy in action retrieval task. In Zhao’s paper, they use nine planes to segment the ball-like descriptor into 48 homogeneous regions, resulting in the equal probability of the bins for capturing the contour points in each static depth frame. And then employ dynamic time warping (DTW) to measure the temporal similarity between two 3D dynamic depth sequences. Most previous works on 3D shape context are mainly used in the application of static object such as pose estimation[23], 3D object recognition [3] and registration [27], etc, however, rarely apply in motion trajectory classification.

In this paper, we present a novel feature extraction method for motion trajectories. The proposed approach is very straightforward and simplicity for implementation. It takes a motion trajectory of (x, y, z) coordinates, as the input data. Then we perform a two-stage of normalization to yield a compact trajectories. The first stage implicitly normalize the raw trajectories to the same points length. The second stage, called trajectory span distance normalization, aims to normalize points coordinates by dividing the largest distance which is calculated from any of pointwise in each trajectories. After preprocessing, we extend the traditional shape context to spatiotemporal to capture the global representation for each 3D trajectory point. A series of the resulting representation is guaranteed to be most efficient feature for trajectories alignment by using dynamic time warping. Finally, the trajectories are classified by one nearest neighbor (1NN) classifier. The proposed motion trajectory classification method not only remarkably reduces the miss matching problem, thus improves the classification accuracy, but also effectively avoids the trajectories appearance perturbation and can be greatly invariant to motion trajectories scale and translation.

The rest of our paper is organized as follow. Section 2 presents a framework of our proposed method. In Section 3, we first normalize the trajectories to the same length, then briefly review the shape context descriptor and explicitly expatiated the proposed 3D shape context. In Section 4, the trajectory alignment method based on DTW is introduced and the combination of 3D-SC and DTW for trajectory recognition is elaborated. This paper is concluded in Section 6.

2 Overview of the framework

The main idea behind our motion trajectory classification approach is to obtain a series of global 3D trajectory shape descriptors in spatiotemporal domain, to replace the raw observed values in finding the alignment between two motion trajectories. Figure 1 shows the block diagram of the proposed method. Our representation method for 3D motion trajectory is composed of three functional units, namely pre-processing, 3D shape context representing and dynamic time warping. In the flow diagram, during the pre-processing step, trajectories are firstly normalized to the same length and then each of the trajectory points coordinate is divided by the maximum pointwise Euclidean distance. The pre-processing results are then represented by 3D shape context descriptor in spatiotemporal domain. In the resulting feature space, we integrate the 3D SC descriptor into a dynamic time warping framework for feature-wise aligning and finally use the standard nearest-neighbor algorithm for trajectory classification.

Fig. 1
figure 1

Block diagram of the 3D shape context based gesture trajectory classification algorithm

3 Trajectory Pre-processing and representation

3.1 Pre-processing

Motion trajectory length normalization is important because of the fact that the majority of trajectories recognition algorithms are need to work with the same length of trajectories points. Since the trajectories are discrete sequences, the number of sampling points may differ from one trajectory to another. Also, the performance of 3D shape context feature will be affected by the distance inequality of two adjacent trajectory points. To avoid directly calculating the 3D-SC feature on a row data, we need to normalize the length of trajectories ahead. This trajectory length normalization based on sampling several neighboring points will reduce the sensitivity of the 3D-SC feature to the trajectory points’ intervals. Also, using a fixed amount of points intervals can guarantee that all the identical trajectories in different scales can be represented by the unified histograms.

Given L sampling points along a 3D trajectory, we can represent the set of points as

$$ {\left\{{x}_k,{y}_k,{z}_k\right\}}_{k=1}^L $$
(1)

where {x k , y k , z k } denote the {x, y, z} coordinates of the k − th point on the trajectory respectively. Some examples of 3D trajectories in the compact Australian Sign Language dataset are shown in Fig. 2.

Fig. 2
figure 2

3D gesture trajectories from ASL dataset

The starting and ending points of gesture trajectory are manually drawn and the stationary points caused by the signer’s holding behavior, are eliminated by considering the relationship between adjacent points' positions.In our experiments, these stationary points can be detected by examining the condition of

$$ \left|{x}_k-{x}_{k-\varDelta k}\right|+\left|{y}_k-{y}_{k-\varDelta k}\right|+\left|{z}_k-{z}_{k-\varDelta k}\right|\le \varepsilon $$
(2)

where Δk = 2 and ε = 0.01.

A parametric spline approximation [5] is implemented to smooth the trajectory curve. Then, we perform resampling so that the resampled trajectories are of equal length as shown in Fig. 3. In order to finding the optimized trajectory length for trajectory representation, in the experiment section, we employ many different normalized trajectories length for final classification. As a result, the optimization trajectory length can be approximately set at 70 sample points after experiment evaluation.

Fig. 3
figure 3

Example of gesture trajectory normalization; the upper line shows the original trajectory with different points number; the bottom line shows the normalized trajectory with the same points number

For the second step of pre-processing, we proposed a trajectory span distance normalization method, which transform each 3D trajectory to a common domain.

Given a set of resample points \( {\left\{{x}_k,{y}_k,{z}_k\right\}}_{k=1}^{L^{\prime }} \) from the previous step, normalized coordinates are given by

$$ {\left\{{x}_k^{\prime },{y}_k^{\prime },{z}_k^{\prime}\right\}}_{k=1}^{L^{\prime }}={\left\{\frac{x_k}{d_{\max }},\frac{y_k}{d_{\max }},\frac{z_k}{d_{\max }}\right\}}_{k=1}^{L^{\prime }} $$
(3)

where \( {d}_{\max }=\underset{\begin{array}{l}i,j\in {L}^{\prime}\\ {}i\ne j\end{array}}{ \arg \kern0.3em \max}\left\{\sqrt{{\left({x}_i-{x}_j\right)}^2+{\left({y}_i-{y}_j\right)}^2+{\left({z}_i-{z}_j\right)}^2}\right\} \)

In such a case, the distance of any two points is involved in the range between 0 and 1. As a result, the scale variations contained in the raw trajectory data can be effectively removed.

3.2 The review of shape context

In this section, we first review the shape context descriptor [2] which can be utilized in representing a object point by measuring the distribution of relative positions of neighboring points. Obviously, the full set of vectors used as shape descriptors contains global details since it configures the entire shape relative to the reference points. This set of vectors is identified as a highly discriminative descriptor which can represent the shape distribution over relative positions. The shape context of p i is defined as a coarse histogram h i of the relative coordinates of the remaining n − 1 points:

$$ {h}_i(k)=\#\left\{q\ne {p}_i:\left(q-{p}_i\right)\in bin(k)\right\} $$
(4)

The bins are uniform in log-polar space, making the descriptor more sensitive to the positions of nearby sample points than to those of points farther away. The histogram similarity of pair wise points (p i , q j ) between two trajectories can be denoted as follow:

$$ {C}_{ij}=\frac{1}{2}{\displaystyle \sum_{k=1}^K\frac{{\left[{h}_i(k)-{h}_j(k)\right]}^2}{h_i(k)+{h}_j(k)}} $$
(5)

where h i (k) and h j (k) denote the K-bin normalized histogram at p i and q i , respectively.

3.3 The adaptive 3D shape context descriptor

Based on shape context, the 3D shape context descriptor is very straightforward. In 2D shape context, a point histogram is built based on 2D log-polar coordinate system as shown in Fig. 4 (left).In 3D shape context, the pervious descriptor is extended to 3D space by building a point histogram based on 3D spherical coordinate system as shown in Fig. 4 (right). Denoting an origin on one of a trajectory point, the 3D shape context captures the 3D spatial and 1D temporal distribution of all other trajectory points around it. Along radial direction, bins are arranged uniformly in log-polar space which makes it more sensitive to positions of nearby points than to those of remaining points farther away. If there are i bins for the radius,j bins for azimuth and k bins for elevation, the 3D shape context is partitioned to i × j × k bins in total.

Fig. 4
figure 4

The illustration of log-polar coordinates in 2D (left) and 3D (right) space

The traditional coordinate representation denotes each position of trajectory points as x, y, z in Cartesian coordinate. It can be represented in the spherical coordinate as follow:

$$ \left\{\begin{array}{l}x=r \sin \theta \cos \theta \\ {}y=r \sin \theta \sin \varphi \\ {}z=r \cos \theta \end{array}\right. $$
(6)

Generally, a 3D shape context descriptor can be described by five parameters: the number of θ bins along the azimuth dimension, the number of log(r) bins along the radial dimensions, the number of φ bins along the elevation dimension, outer radius, and inner radius. Outer radius is the radius of the outer most circle in 3D SC ball, and the inner radius is the radius of the inner most circle in 3DSC ball. The (x, y, z) position of each point in the Cartesian coordinate can be converted to (r, θ, φ) in the spherical coordinate as follow:

$$ \left\{\begin{array}{l}r=\sqrt{x^2+{y}^2{z}^2}\hfill \\ {}\theta = arccos\left(\frac{z}{r}\right)\hfill \\ {}\varphi = actan\left(\frac{y}{x}\right)\hfill \end{array}\right. $$
(7)

Obviously, the outer radius controls the 3DSC ball volume. It restricts the number of trajectory points that involve in each 3D shape context. It also indicates the width of the time window from the perspective of time series. In previous works, there are two main kinds of strategies to determine the outer radius in 2D shape context: one is to set the outer radius to one tenth of the motion trajectories length, which means that at most 10 % of the points will be covered by each shape context. The other is compute the average distance of any two points in the motion trajectories and take it as the maximum size of the outer radius. In our work, instead of using 3DSC to extract the local information from each trajectory, the global information of the whole motion trajectory is concerned, as shown in Fig. 5.

Fig. 5
figure 5

Local 3D shape context (left) vs. Global 3D shape context (right)on gesture trajectory representation

In Eq. (7), we treat a origin in the spherical coordinate the same as a origin in the Cartesian coordinate. However, for the 3DSC feature extraction, each trajectory point should be treated as a reference point with origin translate from (0,0,0) to the current point position (x n , y n , z n ). Hence, for 3DSC descriptor representation, the parameter of each reference point can be expressed as:

$$ \left\{\begin{array}{l}{r}_n=\sqrt{{\left({x}_r-{x}_n\right)}^2+{\left({y}_r-{y}_n\right)}^2+{\left({z}_r-{z}_n\right)}^2}\\ {}\theta = \arccos \left(\frac{z_r-{z}_n}{r_n}\right)\\ {}\varphi = \arctan \left(\frac{y_r-{y}_n}{x_r-{x}_n}\right)\end{array}\right. $$
(8)

where (x r , y r , z r ) denote the all other trajectory points aside from the current point position (x n , y n , z n ). Consequently, for the n-th trajectory point (x n , y n , z n ) as the 3DSC descriptor’s origin, the corresponding outer radius can be expressed as follow:

$$ {r}_n^{\max }=\underset{\begin{array}{l}\left({x}_r,{y}_r,{z}_r\right)\in {L}^{\prime}\cup \\ {}\left({x}_r,{y}_r,{z}_r\right)\ne \left({x}_n,{y}_n,{z}_n\right)\end{array}}{argmax}\left(\sqrt{{\left({x}_r-{x}_n\right)}^2+{\left({y}_r-{y}_n\right)}^2+{\left({z}_r-{z}_n\right)}^2}\right) $$
(9)

As seen from above equation, the adaptive outer radius sets \( {\left\{{r}_n^{\max}\right\}}_{n=1}^{L^{\prime }} \) in a trajectory can be established by assembling each outer radius which generate from each trajectory point. We would like to point out that although different length of outer radius \( {\left\{{r}_n^{\max}\right\}}_{n=1}^{L^{\prime }} \) are yielded by setting different 3DSC ball origin, all the trajectory points can also be involved in each 3D shape context compactly.

In our trajectory representation strategy, each outer radius equals to the local maximum trajectory span distance which calculated according to the current 3DSC origin. In addition, a spherical grid is defined by means of subdivisions along the azimuth, elevation and radial dimensions. To account for generality, the number of subdivisions can be different along each dimension. In our experiments, the azimuth and elevation dimensions are equally divided into 12 and 8 spaces respectively. Typically, the outer radius of \( {\left\{{r}_n^{\max}\right\}}_{n=1}^{L^{\prime }} \) in each trajectory is 2k times larger than the inner radius. Hence, the radial dimension is logarithmically divided into 5 spaces, which means k = 5. For the 3DSC histogram computation, each bin accumulates a weighted sum of the trajectory points number falling thereby.

The benefit of using adaptive outer radius mechanism is that it makes it possible for generating the global information which can increase the diversity of pairwise points during distance calculation. Furthermore, the global 3D shape context can give a better discrimination for matching a motion trajectory with its sub trajectory. In this case, the pairwise points with global information give relatively higher matching score for the same meaning trajectories, and reversely give relatively lower matching score for different meaning trajectories but which partially has the same shape appearance from one to another.

From a shape point of view, trajectories of the same class could be seen as similar shapes but with small non-rigid shape deformations, as shown in Fig. 6. 3D Shape contexts are extremely rich descriptors in that they can give appropriate tolerances for these trivial deformations, meanwhile, are only sensitive to those discriminative deformations. In contrast, trajectories of different classes always contain deformations large enough to be grasped by 3D shape contexts.

Fig. 6
figure 6

Gesture digital “6” under shape deformations perform by different signers

4 Gesture trajectory alignment

4.1 Dynamic time warping

In time series analysis, dynamic time warping (DTW) is a well-established algorithm for comparing temporal sequences which may vary in time or speed. DTW addresses the main problem of aligning two sequences in order to get the most suitable distance measure of their overall difference. Compared with Euclidean distance, DTW can overcome the time distortion problem by finding a time-flexible alignment between two given time series, where the total cumulative distance is minimized. Each point of the time series is aligned to at least one point of another time series.

More specifically, suppose X = {x 1, x 2, ⋯ x m } ∈ R m and Y = {y 1, y 2, ⋯ y n } ∈ R n denote two time series with length m and n, respectively. To align two sequences using DTW, an m by n matrix is construct. The value of the (i th, j th) cell of the matrix is the base distance between the two feature vectors x i and y i , namely δ(x i , y i ).

A warping path W defines an alignment between X and Y can be formally written as W = w 1, ⋯, w T , where max(m, n) ≤ T ≤ m + n − 1. Each w t  = (i, j) specifies that feature vector x i of the X sequence is matched with Y feature vector y i . The warping path is typically subject to several constraints:

Boundary conditions: w 1 = (1, 1) and w T  = (m, n);

Temporal continuity: Given w t  = (a, b), and w t − 1 = (a′, b′), then a − a′ ≤ 1 and b − b′ ≤ 1;

Temporal monotonicity: Given w t  = (a, b) and w t − 1 = (a′, b′), where a − a′ ≥ 0 and b − b′ ≥ 0.

From the point of view of above restrictions, an exponential number of warping paths can be found; however, DTW computes the optimal path that will minimize the following warping cost:

$$ DTW\left(\mathbf{X},\mathbf{Y}\right)= \min \left\{{\displaystyle \sum_{t=1}^T\delta \left({w}_k\right)}\right\} $$
(10)

To find the optimized path, DTW can be recursively calculated using dynamic programming which computes the cumulative distance DTW(i, j) with the distance δ(x i , y i ) found in the current cell and the minimum of the cumulative distances of the adjacent elements as the follow:

$$ DTW\left(i,j\right)=\delta \left({x}_i,{y}_i\right)+ \min \left\{DTW\left(i-1,j-1\right),DTW\left(i,j-1\right),DTW\right(i-1,j\Big\} $$
(11)

In this way, we can find the best warping path W and the global matching score D by back tracing the cumulative distance matrix.

4.2 Using 3D shape context in DTW

The standard dynamic time warping typically using successive sequence locations as the trajectory feature for the cost matrix computation. In our proposed method, each element of the cost matrix is acquired by computing the histogram similarity between two 3D shape context. The new base distance between each pair of points can be defined as:

$$ {\delta}_{3D-SC}\left(p,q\right)\equiv {C}_{pq} $$
(12)

where C pq is defined in Eq. (5); The δ 3D − SC can be substitute δ(⋅) for computing DTW.

One thing must be clarified is that different from SC-DTW [34] which uses shape context to generate the alignment, our 3DSC-DTW consider to use the matrix cost of global 3D shape context feature rather than original Euclidean distance as the cumulative value. The merit is that it is greatly invariant to the trajectory translation and scale.

One of the significant reason in combination of the global 3DSC feature and DTW is that it can deal with the sub-trajectories problems. Without lose of generality, we take the digital gesture in the 2D space as an example. As we can see in Fig. 7, the digital gesture “2” can be treated as the sub-gesture of the digital gesture “3”. By using the local 3DSC feature, the alignment of digital“2” and the partial of digital“3”,as shown in Fig. 7a, coincide with the alignments of digital “2”. Only the end point of digital “2” is left to match the rest points of digital “3”. Toward this end, the final decision score may be relatively lower and can readily cause the miss classification. On the contrary, the alignments of digital “2”and“3” in Fig 7b, which make use of the global 3DSC feature, can achieve relatively higher matching score, hence has strong discrimination.

Fig. 7
figure 7

Alignments of two digital gestures with global (left) and local (right) 3D shape context descriptor. a Alignments with the local 3D-SC descriptor b Alignments with the global 3D-SC descriptor

Another reason for embedding the 3DSC descriptor into DTW is that it can greatly resist to trajectory appearance perturbation. Unlike pose models, trajectory data encompasses a notion of time flow. Even trajectories have the same appearances, they may represent different meanings due to the different directions of time flow. Typically, all the gesture trajectories no matter for training or testing should be captured under a fixed canonical coordinate frame. However, the rotation of trajectory should be considered in two situations:(1) all the gestures are captured under the fixed canonical coordinate frame. In this case, as we mentioned before, even the gesture trajectories may have slightly appearance difference or axis inclination, the 3D shape context descriptor is insensitive to such deformations, and can greatly eliminate the presence of noise. (2) Gestures are captured under different coordinate frame. In this case, we should transform the trajectory into a canonical coordinate frame according to the translation and scale parameters. Otherwise, it is hard to determine whether two gestures have the same meanings or not even they have similar shape.

4.3 Time complexity

Suppose P denotes the number of bins in a 3D shape context histogram. The time complexity of computing a gesture point histogram is P = r ∗ a ∗ b, where r is the number of radial bins, a is the number of θ angular bins, b is the number of φ angular bins. Recall that m and n represent the lengths of two time series after gesture normalization. DTW has to consider all cells in the warping matrix; thus, it has a time complexity of O(m ⋅ n). In 3DSC-DTW, the base distance is the difference between two histograms instead of two real numbers. Hence, SC-DTW has a time complexity of O(P ⋅ m ⋅ n).

5 Experiments

In this section, we conduct a series of experiments to evaluate the proposed method. Experiments are conducted on three types of datasets: two types of Australian Sign Language dataset(compact and large) from UCI KDD archive [24] and the 3D hand digital dataset [7]. The compact ASL trajectory dataset consists of 95 sign classes (words), and 27 samples were captured for each sign. The large ASL dataset also contain 95 signs examples. Each sign has 70 examples and with 6650 sign samples in total.

For ASL datasets(compact and large) evaluation, we first utilize the compact ASL dataset to investigate the optimization of trajectory normalization length. Secondly, the benefits of using adaptive outer radius and scale invariance in trajectory classification are implemented, and then we compare our results to the state-of-the-art methods. The trajectory recognition performance under varying training size is also tested based on the large ASL database. Finally, we made a evaluation on the 3D hand digital dataset to test the discrimination capacity between sub-trajectory and full-trajectory as well as the robustness of proposed method.

5.1 The benefit of using trajectory length normalization

The propose of this experiment is to evaluate the impact of various of trajectories normalization in the performance of 3DSC based trajectory classification technique. As shown in Table 1, the classification accuracy with normalized trajectories length are overall higher than original trajectories length. Thus, the effectiveness of using normalized trajectory can be verified. The best results were obtained when the normalization length approximately equals to 70 sample points. This result may suggests the best normalization length should be fixed neither too larger nor too small. Consequently, we choose each trajectories equals to 70 sample points as the optimal normalization length.

Table 1 Classification accuracy with varying the sample length of motion trajectories

5.2 The benefit of using adaptive outer radius

In this section, the compact ASL dataset are utilized and a 9-fold-cross validation was conducted for trajectory classification by varying the scale of adaptive outer radius. One of ninth trajectories from each category serve as testing samples and the others serve as training samples. We repeated this test 7 times. After 7 round evaluations, the average classification rate is computed for the final comparison.

The propose of this experiment is to demonstrate the advantage of using adaptive outer radius. Figure 4 shows the classification accuracy under varying the scale of adaptive outer radius. The scaled adaptive outer radius is defined as follow:

$$ {\left\{{r_n}^{\prime}\right\}}_{n=1}^{L^{\prime }}=\kappa \cdot {\left\{{r}_n^{\max}\right\}}_{n=1}^{L^{\prime }} $$
(13)

where k is the scale factor which control the size of the 3D shape context;

The x axis in the Fig 8 represents the scale factor from 0.1 to 1, with step 0.1. For example, 0.1 means we set outer radius of each 3D shape context equals to 10 % of the corresponding local maximum trajectories span distance. We consider that the scale factor from 0.1 to 0.9 generate the local feature, otherwise it generate the global feature. With the increasing of the outer radius, the classification accuracy of 4, 8, and 10 classes are gradually getting higher. All these three types of classes reach the maximum classification rate 95.68 %, 91.43 %, 90.14 % respectively under the scale factor equals to 1, which means all the gesture points are involved in each 3D shape context ball volume and the global 3DSC descriptor are generated for histogram distribution computing.

Fig. 8
figure 8

Classification rate under varying adaptive outer radius

5.3 Scale invariance performance

In this part, we evaluate the performance of classification accuracy by changing trajectory scale and translation in 2, 4, 8, and 10 classes respectively. As shown in Figs. 9 and 10, a considerable improvement of our proposed method on scale invariance can be obviously seen. To apply a certain amount of scaling to the input gestures, we multiply the x and y coordinates of each trajectory point by a set of small increments ([1.1, 1.3, 1.5, 1.7, 1.9]). To apply a certain amount of translation to the input trajectories, we add a set of small increments ([0.01, 0.03, 0.05, 0.07, 0.09]) in meters to the x and y coordinates of the position of each gesture point. With gradually increasing scale and translation factors, the classification rate of Euclidean Distance based DTW method [8] rapidly dropped, however, the proposed method and Mix signature method [32] still remain a stable accuracy. Moreover, the performances of our method outperform Mix signature method. This advantage owes to the 3D shape context descriptor with adaptive outer radius can automatically represent the global information for each pairwise points as a robust similar measure.

Fig. 9
figure 9

Classification accuracy vs. gesture scale: a Two classes, b Four classes, c Eight classes, d Ten classes

Fig. 10
figure 10

Classification accuracy vs. gesture translation: a Two classes, b Four classes, c Eight classes, d Ten classes

5.4 Comparison with other methods

In the third experiment, the compact ASL dataset is used to evaluate the trajectory recognition performance. Since this database and the chosen classes were used in [26, 31, 32] for experiments, we implement experiment on the equivalent situation for comparison. Consider that trajectory recognition also relies on efficient recognition engine, we test the performance of the adaptive 3DSC descriptor by utilizing another two recognition engines as well, which is support vector machine and Lock-step measure. Lock-step measure means a one-to-one correspondence matching between time series as they compare i-th point of one time series to i-th point of another time series. According to the experimental results represented in Table 2, we can observe that: 1) For all of the approaches, the proposed method achieves the highest recognition rate in matching within 2, 4, 8 and 10 classes respectively. 2)The performance of alignment based method DTW [8, 26, 32], Euclidean distance(Solution 1) outperform the discriminate methods SVM(Solution 2). 3)As the number of classes increase, the recognition ratio of all method gradually dropped, however, the decline rate from class 10 to class 2 of our proposed method is 8.48 %, less than that of all other methods. That is to say, our method is more flexible for multiclass recognition.

Table 2 Comparison with the state-of-the-art methods on compact ASL dataset

5.5 Recognition with varying training size

This section utilized the large ASL dataset to evaluate trajectory recognition performance under varying training size. In this experiment, for fair comparison, we follow to extract 2,4,8,10 classes in [26, 31, 32] to evaluate the classification accuracy. Since the dataset has 70 examples for each sign, we decide to take leave one out cross validation strategy. During evaluation, we successively select each gesture example from each category as testing data, and treated the remaining examples of each category as training data. For the propose of examining the relationship between training size and recognition rate. We randomly extracted a certain number of examples from the remain training data. As shown in Fig 11, with gradually increasing the training size, the classification accuracies of 2, 4, 8, 10 classes improved significantly and achieve maximum with using 69 training samples. Also, we can observe that the proposed method outperforms all other state-of-the-art methods, which indicates that our proposed method is also suitable for large datasets. Moreover, we also test other amounts of classes, due to the space limitation we did not show here. Nevertheless, the tendency of classification accuracies, in general, remain the same.

Fig. 11
figure 11

The classification accuracy versus training size on Large ASL dataset

5.6 Results on 3D HSD dataset

In this section, the experiments are conducted on the 3D Hand-Signed Digit which can be visualized as shown in Fig. 12. This hand gesture datasets is a commonly used benchmark for gesture recognition with 10 categories performed by 12 different people. In training examples, 300 digit exemplars with 30 per class were stored in the database. In test examples, 440 digit exemplars with 44 per class were captured.

Fig. 12
figure 12

The visualization of the 3D Hand-Signed Digit

Table 2 illustrates.

Table 3 illustrates the performance comparison of different algorithms by using 300 training data and 440 testing data. As expected, our proposed method(Solution 3) yield a higher recognition rate than other methods. It is worth noting that our proposed adaptive 3DSC descriptor achieve a relatively higher performance when it combine with Lock-step measure (Solution 2).

Table 3 Comparison of Recognition Rate by other state-of-the-art methods

In Table 4, we manually define the full meaning gestures and the corresponding sub-gesture for the hand signed digits recognition. From Table 4, we can see that gesture “1”,“2”,“5”,“7” can be defined as the sub-gesture of {“7”,“9”},{“3”},{“8”},{“2”,“3”} respectively.

Table 4 Sub-gesture table for Hand Signed digit from “0” to “9”

In the following tables, we examine the miss matching numbers between the super-gestures and the corresponding sub-gestures. For simplicity, we choose outer radius scale factor equals to 0.5 to represent the local 3DSC descriptor. As we can see from the following Tables 5, 6, 7 and 8, the total misclassification numbers that caused by sub-gestures are larger than other gestures. That explains why the proposed global 3DSC representation can be effective on decreasing misclassification and restraining the ambiguity among partial similar gestures.

Table 5 The missclassification numbers between subgesture "1" and the corresponding supergestures
Table 6 The missclassification numbers between subgesture "2" and the corresponding super-gesture
Table 7 The missclassification numbers between subgesture "5" and the corresponding super-gesture
Table 8 The missclassification numbers between subgesture "7" and the corresponding super-gestures

6 Conclusions and future work

In this paper, we present a novel motion trajectory classification method in the spatiotemporal domain. An invariant descriptor - 3D shape context with adaptive outer radius is presented. This descriptor able to flexible extract rich global shape context information for motion trajectory representation. An effective alignment algorithm based on Dynamic Time Warping which replaces the raw distance feature by 3D shape context descriptor is proposed for calculating the matching similarity. We compare the classifying performance with our proposed descriptor to the previous descriptors in the three benchmark datasets. The experiments results show that the proposed method achieves the state-of-the-art performance in both accuracy and efficiency for motion trajectory classification in the spatiotemporal domain.

There are still several future tasks to improve our current work. It is in urgent need to establish a real-time motion trajectory recognition or classification system to automatically segment the motion trajectory by detecting the start and end frame. In addition, how to recognize the motion trajectory from different viewpoint and how to apply the proposed 3D motion trajectory strategy to various application, such as activity recognition, anomaly detection, video surveillance is still worth study.