1 Introduction

Monitoring and recognition of human motion based on wearable inertial sensors is an increasingly important research field in pattern recognition and machine learning [6, 9, 31]. With the aid of wearable inertial devices like accelerometers and gyroscopes, users can obtain more information to assist themselves. Compared to video sensor-based human motion recognition, wearable inertial sensors have a number of advantages in data acquisition [13], such as easy carrying, low cost, unrestricted collection range and strong privacy preservation. Accordingly, this technique can be applied to many fields such as health care and assisted living [35, 38], prevention of diseases [32], entertainment, and promoting exercise [18, 19].

At present, the study on human motion recognition using wearable inertial sensors is mainly focused on isolated motion recognition. Although a variety of human motions have been researched [22, 26, 41], these motions have been achieved by manual segmentation which wastes a lot of time and energy. Even more important is that these motions are isolated and their correlations with each other are removed. Although the motion data and the recognition results are performed well, it is hard to apply to real life in this way because of the manual need for segmentation and the artificial nature of the data collected. Real motion data is presented in the form of a fluid sequence in practical application and human motions have randomness and disorder, which can even make manual motion identification a difficult task. Particularly, motion data is usually mixed with some useless data, so it is important to retrieve only the motion data we need from time-series data [11]. For the reasons above, a set of effective segmentation and recognition strategies for human motion sequence needs to be constructed.

Currently, research on human motion mainly uses video sensors, and the motion sequences can be divided frame-by-frame, then we may study the image associated with each frame using a common recognition algorithm [8, 10, 25, 28, 34]. In addition, some more prominent methods are proposed using video sensors. Wang et al. [39] introduced a method of motion sequence recognition based on characteristic descriptors. The authors made dynamic silhouette sequences translate into multivariate time series in the low dimension space by using tensor subspace analysis (TSA), then a support vector machine (SVM) classifier was used to classify the human motions after extracting motion descriptors in multivariate time series. Zhao et al. [43] studied gesture recognition from motion data streams based on a dynamic matching approach. The authors mainly solved two problems in this paper: one was how to monitor motions from a continuous data stream, and the other was how to recognize a gesture that had different styles. Ofli et al. [29] proposed a new definition of human motions which was described as sequences of the most informative joints (SMIJ), they just used the temporal ordering of joints to analyze motion sequences and also constructed some features to develop their method. Li et al. [24] proposed a new framework to predict long-term complex activities by mining human activity sequence patterns, the relationships between activities were established using a probabilistic suffix tree (PST), and sequential pattern mining (SPM) was used to structure the interactive information model.

Compared to recognition of human motion sequence using video sensors, only using inertial sensors to study human motion [17]. Studies based on video sensors may easily extract related motion features (such as joint angle, distance, direction and human silhouette), but it is difficult for a study using inertial sensors. Additionally, only some statistical features of the signals can be extracted, such as mean, variance, or median. Accurate trajectories may be obtained using video sensors, which can help us build the motion model, but it is too difficult if only using inertial sensors to generate the motion trajectory. At present, there has less research on human motion sequences using inertial sensors. Much of the current work [4, 7, 12, 16] was focused on how to improve the performance of classification algorithms. Although they also studied motion sequences, and these papers did not consider whether there was invalid data in the motion sequences. In [2, 17], the authors only studied single-motion sequences without considering multiple-motions sequences, and the sophisticated sensors used in their work are difficult to apply in real life. The literatures on activity spotting [1, 30] systematically studied human motion sequences by using wearable inertial sensors; the first step was to extract the useful motion data in data stream, then the extracted data was recognized by using traditional classification algorithms. The main inadequacies of the papers were that the accuracy rates at the segmentation stages were relatively low.

In this paper a novel monitoring framework of human motion using wearable inertial sensors is proposed. The monitoring framework may be used to handle the human motion sequences automatically, and retrieve motion data to be recognized from motion sequences. The monitoring framework does not require manual processing, and it may save human power and material resources and time. The main contributions of our work are as follows:

  1. (1)

    We propose a systematic and automatic monitoring framework based on wearable inertial sensors. The framework mainly includes three stages: data acquisition, segmentation and recognition. Segmentation is mainly composed of pre-segmentation and fine segmentation. A hidden Markov model (HMM) is used to recognize the segmented human motion data at the recognition stage.

  2. (2)

    During pre-segmentation, singular value decomposition (SVD) is used to remove useless data as much as possible for reducing the time for the whole segmentation process.

  3. (3)

    A novel similarity measure function called multi-sensors similarity measure function (MSHsim) is proposed to achieve accurate segmentation during fine segmentation.

  4. (4)

    The monitoring framework can recognize multiple human motions, and because of the use of the sliding window at the segmentation stage, online recognition can be achieved using the proposed monitoring framework.

The rest of the paper is organized as follows: in Section 2, the monitoring framework is given and a detailed introduction is given of the segmentation and recognition stages. In Section 3 the experiment and evaluation are presented; data acquisition is introduced in detail, and performance measurement and evaluation of the proposed method are discussed. This is all followed by the conclusion in the last section.

2 Methodology

The main components of the monitoring framework of human motion are shown in Fig. 1. It includes three components: data acquisition, segmentation and recognition. Data acquisition will be introduced in Section 3.1. Segmentation consists of pre-segmentation and fine-segmentation. At the recognition stage an HMM is used. In addition, for a motion sequence, the motion data to be recognized is described as labeled data, and the rest is called junk data in this paper.

Fig. 1
figure 1

Detailed structure of the human activity sequence recognition framework

2.1 Segmentation stage

2.1.1 Pre-segmentation

For segmentation of the human motion sequence, the first step is pre-segmentation as shown in Fig. 1, and its purpose is to remove the junk data. For this purpose, we use SVD combined with a sliding window. Because of this sliding window, our proposed framework has the ability of on-line processing as well.

For a real matrix ARm×n, here mn, there must exist two orthogonal matrices U = [u1, u2, ⋯ , um] ∈ Rm × m and V = [v1, v2, ⋯ , vn] ∈ Rn × n, and there have,

$$ A=UDV^{T}, $$
(1)

here, D = diag(λ1, λ2, ⋯ , λn), and λ1λ2 ≥ ⋯ ≥ λn. λi (i ∈ {1, 2, ⋯ , n}) represents the eigenvalues of matrix A. For (1), V is described as the right eigenvectors of real matrix A. Since V belongs to the kernel of real matrix A, and for two similar data matrices, the difference in the change of sensitivity can be obtained through calculation of the inner product of the right eigenvalues of two similar data matrices [23, 37]. Suppose v1 and v2 are the right eigenvectors of two similar data matrices, and α is the intersection angle of them, then we have |v1v2| = |v1||v2||cos(α)|, because v1 and v2 are orthogonal, so |v1| = |v2| = 1, then |v1v2| = |cos(α)|. When α is close to 0, |v1v2| is close to 1. Particularly, when α is equal to 0 (it also means that the two data matrices are same), |v1v2| = 1. Most importantly, this similarity is mainly measured by the first few larger eigenvalues of the data matrices [23].

Suppose there are z sensor nodes, and the motion data from the k th sensor node is defined as matrix Xk, k ∈ (1,2, ⋯ , z). \(V_{k}=({v_{1}^{k}},\,\,,{v_{2}^{k}},\cdots ,\,\,{v_{n}^{k}})\) is the right eigenvector set of matrix Xk after SVD. In addition, the candidate matrix is Y, suppose eigenvalues of matrix Y is \((\widehat {\lambda }_{1},\widehat {\lambda }_{2},\cdots ,\widehat {\lambda }_{n})\), and \( \widehat {\lambda }_{1}\geq \widehat {\lambda }_{2}\geq {\cdots } \geq \widehat {\lambda }_{n}\). \(\widehat {V}=[\widehat {v}_{1},\widehat {v}_{2},\cdots ,\widehat {v}_{n}]\) is the right eigenvector set of matrix Y after SVD. Then the segmentation function δk for the k th sensor node may be defined by the following equation,

$$ \delta_{k}=\frac{1}{\omega}\sum\limits_{i = 1}^{\omega}|{v_{i}^{k}}*\widehat{v}_{i}|, $$
(2)

and ω may be obtained by the following equation,

$$ J(\omega)=\mathop{min}\limits_{\omega} \,\,\left\{\left( \sum\limits_{i = 1}^{\omega}\widehat{\lambda}_{i}\right)/\left( \sum\limits_{i = 1}^{n}\widehat{\lambda}_{i}\right)\geq \sigma\right\}. $$
(3)

Here, σ is a threshold parameter. Suppose that the obtained data matrix in the k th sensor node is \(\widetilde {X}_{k}\) after using (2), and its serial number during the sampling time is defined as \(L_{pre}^{k}(\widetilde {X}_{k})\), then the final result Lpre for all sensor nodes can be defined as follows,

$$ \begin{array}{ll} L_{pre}&=L_{pre}^{1}(\widetilde{X}_{1})\cap L_{pre}^{2}(\widetilde{X}_{2}) \cap{\cdots} \cap L_{pre}^{z}(\widetilde{X}_{z})\\ &=\bigcap\limits_{i = 1}^{z}L_{pre}^{i}(\widetilde{X}_{i}). \end{array} $$
(4)
figure c

The detailed procedure for pre-segmentation is in Algorithm 1, T1 is empirical parameter and it is equal to 0.85 in this paper. For a data matrix ARm×n, (mn), the time complexity of SVD of matrix A is O(mn2), and SVD for matrix ATA can take O(n3). The right eigenvectors of A are the eigenvectors of ATA on the basis of matrix theory. So in this paper we use matrix ATA instead of matrix A to make SVD in order to save computation cost. In general, the sensitivity of this proposed method based on SVD is high according to matrix theory, and the corresponding segmentation function value (2) may be higher only when the two motions are very similar. In practice, everyone has his or her own style to perform a motion, which makes it impossible to have two fully consistent motions for different people. Even for the same person, it is impossible for the person to perform two perfectly consistent motions due to the influence of environmental factors. In addition, if the accuracy of the inertial sensors used in the paper is not very high, it is more difficult to find a correspondence between two very similar trajectories. Two examples are given in Fig. 2. The first example includes Fig. 2a and b, and it represents that the same person performing the same motion two times. Figure 2a shows the sensor signals including acceleration signal and angular velocity signal. From Fig. 2a it can be seen that there is a difference between the two sensor signals, although the subject and the motion are the same. Figure 2b shows the corresponding eigenvalues, and the deviation of the second eigenvalues is larger, one is 313.37, the other is 580.49. For the first example, the segmentation function value is only 0.94 according to (2). The second example, including Fig. 2c and d, is obtained by different two subjects performing the same motion. From Fig. 2c it can be seen that the two curves differ. Figure 2d shows the corresponding eigenvalues, the eigenvalues of the first subject are 1367.1, 646.15, 290.95, 12.19, 6.42 and 2.81, and the eigenvalues of the second subject are 1134.5, 746.20, 215.49, 8.69, 4.62 and 1.86. For the second example, the segmentation function value is just 0.89, which has a gap compared with the result of 0.995 in [23]. Although it is difficult to accomplish the fine segmentation of the human motion sequence by this method, because of the small computational complexity mentioned above the method is chosen to complete the pre-segmentation. The objective is to reduce the time required for the whole segmentation process as much as possible.

Fig. 2
figure 2

Two examples for pre-segmentation. a The sensor signals of the activity running for the same subject. b The corresponding eigenvalues of the six axis sensor signals in (a). c The sensor signals for different two subjects. d The corresponding eigenvalues of the six axis sensor signals in (c)

2.1.2 Fine segmentation

After the pre-segmentation step, the new motion data can be obtained according to the (2) and (3). During fine segmentation, our purpose is to get the required motion data from the new data after pre-segmentation. We mainly use a feature similarity search between sequence data in a specific window and candidate data, and their similarity is compared using feature vectors. It is important to note that a novel similarity measure function, MSHsim, is proposed to do this work. This similarity measure function is established mainly based on two aspects including the similarity measure function (Hsim) proposed in [42] and the characteristics of sensor data. Hsim is defined as the following equation,

$$ \text{H}_{sim}=\frac{1}{n}\left( \sum\limits_{i = 1}^{n}\frac{1}{(a_{i}-b_{i})^{p}}\right)^{\frac{1}{p}}, $$
(5)

here A = [a1, a2, ⋯ , an] ∈ Rn, B = [b1, b2, ⋯ , bn] ∈ Rn are two vectors. The function in [42] is not suitable for the present research, because it does not consider the correlation between two vectors, and this problem makes only certain features play a major role, but these features may be not very important.

figure d

Suppose there are z inertial sensor nodes, and for the inertial sensors, they can measure an acceleration signal and angular velocity signal. For candidate data, the corresponding acceleration feature set and angular velocity feature set are Yacc = [y1acc, y2acc, ⋯ , ypacc] and \(Y_{av}=[y_{1}^{av},y_{2}^{av},\cdots , y_{p}^{av} ]\). Acceleration feature vector is \(y_{i}^{acc}=[y_{i1}^{acc},y_{i2}^{acc},\cdots ,y_{is}^{acc}]^{T}\in R^{s}\) and angular velocity feature vector is \(y_{i}^{av}=[y_{i1}^{av},y_{i2}^{av},\cdots ,y_{iv}^{av}]^{T}\in R^{v}\), (i ∈ (1, 2, ⋯ , p)). s and v are the dimensions of acceleration features and angular velocity features, and each feature vector is composed of some basic features such as mean, variance, energy, and these features are obtained based on z sensor data fusion. For the motion sequence data obtained by pre-segmentation, suppose for one specific window whose size is greater than all candidate data length, acceleration feature vector is \(x^{acc}=[x_{1}^{acc},x_{2}^{acc}\), \(\cdots ,x_{s}^{acc}]^{T}\in R^{s}\) and angular velocity feature vector is \(x^{av}=[x_{1}^{av},x_{2}^{av},\cdots \), \(x_{v}^{av}]^{T}\in R^{v}\). Then the measure function Facc for acceleration feature vector xacc and measure function Fav for angular velocity feature vector xav are as follows,

$$ \begin{array}{ll} F_{acc}&=\frac{1}{ps}\sum\limits_{i = 1}^{p}\left( \sum\limits_{j = 1}^{s}\frac{1}{1+\mathrm{(C_{i}^{acc})}^{-1}|x_{j}^{acc}-y_{ij}^{acc}|^{2}}\right)^{\frac{1}{2}}\\ F_{av}&=\frac{1}{pv}\sum\limits_{i = 1}^{p}\left( \sum\limits_{j = 1}^{v}\frac{1}{1+\mathrm{(C_{i}^{av})}^{-1}|x_{j}^{av}-y_{ij}^{av}|^{2}}\right)^{\frac{1}{2}}. \end{array} $$
(6)

Here, \(\mathrm {C_{i}^{acc}}=cov(x^{acc},y_{i}^{acc})\), \(\mathrm {C_{i}^{acc}}=cov(x^{av},y_{i}^{av}),\,\,i\in (1,2,\cdots ,p)\), and cov(⋅) represents the covariance function. So the multi-sensor similarity measure function MSHsim can be defined as follows,

$$ \text{MSH}_{sim}=\frac{1}{2}(F_{acc}+F_{av}). $$
(7)

According to (6), it involves the multiplication of an n × n dimensional feature matrix for z sensors, so the time complexity of MSHsim is O(z3n3). In addition, the above similarity measure function MSHsim has the following characteristics, (1) MSHsim may reflect the level of similarity between two data, and it can be obtained from the comparison of each dimension of the feature vectors; (2) The value range of MSHsim is [0, 1]. If its value is higher, it shows that the two data are more similar. When MSHsim is equal to 1, it indicates that these two data are equivalent. When the value of MSHsim is close to 0, this indicates that the similarity of the two data is low; (3) The measure function MSHsim considers the correlation between two feature vectors by using the covariance function; (4) The measure function MSHsim also considers the characteristics of multi-sensors data; the acceleration data and angular velocity data are calculated separately, as they are separate scales. The detailed procedure of fine segmentation is presented in Algorithm 2, and in this paper T2 is chosen in the interval [0.96, 0.98].

2.2 Recognition stage

After segmenting the motion sequences, the motion data with labels needs to be recognized. In this paper, the classification algorithm we chose is an HMM. HMMs have been used in the field of human motion recognition based on wearable inertial sensors for a long time [40]. In this paper, we need to establish the HMMs for different human activities and the classification result is achieved though finding the activity sequence matching the maximum posterior probability. In addition, the features used as observations in the HMM including mean, variance, kurtosis, correlation coefficients, energy and entropy [20, 21]; the resulting number of features was 144(6 features × 4 nodes × 6 axes). The feature selection method we used is called robust linear discriminant analysis (RLDA) [15]. RLDA is a modified form of linear discriminant analysis (LDA), and it can solve the problem that the error exponentially and distort discriminant analysis become larger by reestimating several smaller eigenvalues of within-class scatter matrix in LDA.

3 Experiment and evaluation

3.1 Experimental platform and data acquisition

In this paper, data acquisition is based on a multi-sensor experimental platform, which mainly includes the following components. (1) Four inertial sensor nodes, each consists of a triaxial accelerometer (ADXL325) and a triaxial gyroscope (LPR550AL), the size and the shape of the sensor node are shown in Fig. 3b. (2) One wireless receiving node, whose shape is shown in Fig. 3c. (3) A personal computer. Four sensor nodes are used to collect the motion data, and data may be uploaded to personal computer by the wireless receiving node. The data processing software is in the personal computer collecting that data.

Fig. 3
figure 3

The equipment in our experiment. a One subject with inertial sensor nodes, the positions are as follows: 1 right wrist, 2 left arm, 3 right ankle, 4 left thigh. b The size and shape of one inertial sensor node. c The shape of the receiving node

Motion data was collected from nine subjects from our laboratory, training data was collected from five of them. The remaining four subjects were held out as a test data set. Four inertial sensor nodes were placed on the subjects, and the locations are right waist, left arm, right ankle and left thigh as shown in Fig. 3a. The sampling frequency is 50Hz. These subjects all perform the ten same basic sports motions. These ten basic motions are walking (WA), running (RUN), turning-left waist (TLW), turning-right waist (TRW), pressing-left leg (PLL), pressing-right leg (PRL), kicking-left leg (KLL), kicking-right leg (KRL), climbing stairs (CS) and going downstairs (GD), respectively. Training data subjects are asked to do one motion at a time, and the execution time for each motion is about 120s. Test data subjects (they are marked as Subject 1#, Subject 2#, Subject 3#, Subject 4#) are asked to carry out the corresponding motion sequences, and the execution order is shown in Fig. 4. In Fig. 4a it shows the actual execution frame of Subject 1#,and the frame is obtained by the video, in Fig. 4b it shows the resultant acceleration curve from the right waist sensor node. Testing time is about 660s for each test data subject.

Fig. 4
figure 4

The sequence of activities. a The actual execution frame of Subject 1#. b The resulting acceleration curve from right waist sensor node

3.2 Performance measurement

In order to evaluate the performance of the proposed method, the following metrics are used [5]: Precision, Recall, Accuracy, and F-score. Suppose that TP (true positive) represents the number of positive sampled points that are classified as positive, FP (false positive) represents the number of negative sampled points that are classified as positive. FN (false negative) represents the number of positive sampled points that are classified as negative, TN (true negative) represents the number of negative sampled points that are classified as negative. The above-mentioned metrics are defined as follows,

$$ \textit{Precision}=\frac{TP}{TP+FP}, $$
(8)
$$ \textit{Recall}=\frac{TP}{TP+FN}, $$
(9)
$$ \textit{Accuracy}=\frac{TP+TN}{TP+FP+FN+TN}, $$
(10)
$$ \textit{F-score}= 2*\frac{Precision*Recall}{Precision+Recall}. $$
(11)

This problem is a multi-class problem, so the averages of recall and precision are used to calculate the F-score.

3.3 Performance evaluation for segmentation

In this section, we evaluate the running time for segmentation for multi-sensor data. It shows the comparison of running time between MSHsim and SVD + MSHsim for four subjects in Table 1. From the table it can be seen that, running time is almost more than double if we only use the fine segmentation method MSHsim without SVD. This demonstrates the effectiveness of the per-segmentation method based on SVD proposed in this paper.

Table 1 The comparison of running time (s) between MSHsim and SVD+MSHsim for four subjects

In order to evaluate the segmentation performance, four subjects are required to perform the motion sequences mentioned in section III-A, and the actual results are shown in Fig. 5. Figure 5a shows the segmentation result of Subject 1#, the retrieved motions are represented as labeled data, and the rest are defined as junk data. The blue curve in the first box of Fig. 5a represents the real situation, and it is obtained by using the video, the red curve of Fig. 5a is the result of data segmentation by using the proposed method. Similarly, the results for the remaining three subjects can be obtained. From the figure, it can be seen that the result of Subject 1# is consistent with expectations. But for Subject 2#, this result is not good compared with the actual condition. Overall, this figure shows that most of the labeled data may be retrieved by using the proposed method, although there may be some errors for the four subjects.

Fig. 5
figure 5

The results of data segmentation for four subjects by using the proposed segmentation method. The blue curves represent the real situations, and the red curves are the segmentation results

In order to show the segmentation effects more clearly, the confusion matrices combining with precision and recall of the four subjects are given in Table 2. For Subject 1#, the precision and recall of labeled data are 95.60% and 97.88% respectively, and both of them are high. The precision and recall of junk data are also high at 95.73% and 91.15% respectively, which shows that the proposed segmentation method is effective. For Subject 2#, the precision and recall of labeled data are 88.35% and 96.38% , and the precision and recall of junk data are 91.21% and 74.73%. The recall of junk data is not high, it shows the junk data in excess of 20% is wrongly assigned as labeled data. The same situation also appears for Subject 3#, the precision and recall of labeled data are 90.86% and 96.90% respectively, and the precision and recall of junk data are 91.96% and 77.66%. For Subject 4#, the results are very good like the Subject 1#. In summary, from the Table 2 it can be seen that the precisions and recalls for all of the subjects are good, again, this shows that the segmentation method proposed in the paper is effective. Table 3 gives the detailed segmentation results of four metrics by using SVD+MSHsim.

Table 2 The confusion matrices of four subjects during segmentation. LD represents labeled data, JD represents junk data
Table 3 The detailed results (%) of Precision, Recall, Accuracy and F-score by using SVD+MSHsim at segmentation stage, \(\overline {P}\) represents the average of precision values and \(\overline {R}\) represents the average of recall values

To express the superiority of the proposed segmentation method based on SVD and MSHsim in the paper, we make the comparisons between our method and other methods including Hsim and Euclidean distance [1, 3]. Euclidean distance is a very useful similarity measuring technique, and it can be defined as follows,

$$Euclidean=((x-y)(x-y)^{T})^{\frac{1}{2}}, $$

where xRn and yRn are two feature vectors. Figure 6 gives the accuracy rate of segmentation between different segmentation methods including Hsim, Euclidean distance and our method. From Fig. 6 it can be seen that MSHsim outperforms the other two algorithms for four subjects. For Subject 1#, the accuracy rate of MSHsim is 95.58%, the other two methods are 92.38% and 79.50%. For Subject 2#, the result of our method is just 89.13%, which is not still exceeded by other methods, and their results are 86.74% and 77.42%. For Subject 3#, the results are 91.05%, 88.93% and 78.79% corresponding to MSHsim, Hsim and Euclidean distance. The accuracy rates are 94.89%, 84.32% and 75.98% for Subject 4#. As you can see from the results, both MSHsim and Hsim can obtain better performance than that of Euclidean distance. That is because the first two methods calculate the level of similarity from each dimension of the feature vectors, but the similarity measure by using Euclidean distance is calculated holistically. This will lead to greater error, especially for the acceleration data and angular velocity data that have different orders of magnitude. In addition, our method MSHsim is slightly better than Hsim, because we fully consider the characteristics of the sensor data, and use the covariance function to solve the correlation among axes. Because the acceleration data and angular velocity data have different dimensions, they are processed separately.

Fig. 6
figure 6

The accuracy rate of segmentation between different segmentation methods including Hsim, Euclidean distance and MSHsim

3.4 Performance evaluation for recognition

After segmentation retrieved data is classified into the ten kinds of human motions introduced in section III-A. The recognition algorithm we chose is HMM because of its ability to process sequences. Figure 7 gives the detailed classification results of four subjects. The red curves represent the types of motion for recognition, and the blue curves represent the actual acquisition data. From the four pictures in Fig. 7, the two curves in each picture are basically the same. In detail, the red curve and blue curve in Fig. 7a are closer than the other three pictures except TLW and TRW, which suggests that the classification result of Subject 1# is the best among these four subjects. The overlap of two curves in Fig. 7d is not well especially for KLL, and this shows that the classification effect of Subject 4# is poor. Next the concrete results are given in order to express the classification effects better.

Fig. 7
figure 7

The classification results of four subjects compared with the actual situations, where the red curve of each picture represents the classification result, and the blue curve represents the corresponding actual situation. JD represents junk data

Table 4 gives the corresponding confusion matrices for four subjects. For Subject 1# (Table 4(a)), it can be seen that the recalls and the precisions of TLW and TRW are relatively low compared with other motions, and they are not more than 90%. The reason is that these two motions are easy to confuse, which is consistent with what is shown in Fig. 7a. The classification results of other motions in Table 4(a) are relatively better. For Subject 2# (Table 4(b)), the precisions of four motions including PRL, KLL, KRL and GD are much worse than other motions, and their precisions are 79.61%, 79.08% , 60.26% and 81.18%, respectively. The recalls of TRW and KRL are just 79.01% and 76.15%, respectively. The reason is principally because the error is larger in the segmentation stage for Subject 2#. The precision of GD is very poor for Subject 3#, because some data in CS and some junk data are erroneously classified as GD, but for other motions their results are good. For Subject 4#, most of the data in PLL are erroneously classified as PRL. As a result, the recall of PLL is very low at only 8.45%, and the precision of PRL is only 54.43%, which is consistent with that shown in Fig. 7d. In short, we can achieve good results for motion sequence classification using an HMM. Table 5 gives the detailed recognition results of four metrics by using an HMM.

Table 4 The confusion matrices of four subjects in recognition stage
Table 5 The detailed results (%) of Precision, Recall, Accuracy, and F-score by using an HMM for the recognition stage

Figure 8 shows the comparison of recognition results between the HMM and other four classification algorithms including K-nearest neighbor (KNN) [36], naive Bayes (NB) [33], softmax regression (SR) [14], and linear discriminant analysis classifier (LDAC) [27]. From Fig. 8 it can be seen that the HMM classification algorithm can achieve the best accuracy rates for four subjects, 92.75%, 87.75%, 89.42% and 90.21%, respectively. The results of both NB and LDAC are relatively stable, and their accuracy rates are more than 80% for four subjects but all are lower than that by using the HMM. The KNN result is relatively low for each subject, and the effect of SR is also not satisfying from Fig. 8. This demonstrates that the classification algorithm chosen in this paper is effective.

Fig. 8
figure 8

The comparison of recognition accuracy rates between the HMM and four other classification algorithms including KNN, NB, SR and LDAC

4 Conclusion

In order to realize the segmentation and recognition of human motion sequences using wearable inertial sensors, a novel recognition framework is proposed; this is mainly composed of a data acquisition stage, segmentation stage, and recognition stage. Four inertial sensors are used to collect human motion data. The segmentation stage is divided into two steps including pre-segmentation and fine segmentation. At the pre-segmentation step SVD is used to delete the junk data in the human motion sequences. We also proposed a novel similarity measure, MSHsim, to realize the accurate segmentation of motion sequence during fine segmentation. An HMM was used during the recognition stage. Motions sequences from four test data subjects are used to validate the proposed methods. The experimental results demonstrate the effectiveness of the proposed framework.

In our future work, there are still some problems that need further research. First, the on-line recognition should be realized if we want to use this method in real life, and the performance of inertial sensor will be improved to achieve long-term monitoring of human motions, and we also consider to establish the monitoring system based on cloud computing in future work. Second, time consumption is still great because of multi-sensors fusion, it needs to be reduced by means of the most simplified algorithm. Third, in future work some of the more complex motion sequences need to be studied.