1 Introduction

Monitoring of the activities of alone old people is one of the important issues in modern electronic healthcare [1]. Recognition and distinguishing the above activities may be utilized for several applications including assistive living, rehabilitation, and surveillance. Although video-based monitoring is the simplest way for human activity recognition but this technique was less appropriately addressed, because of its privacy-invasive nature. In parallel with recent advances in Micro Electro Mechanical Systems (MEMS), low-cost small size sensors have emerged. Such sensors are widely used in smartphones, smartwatches, and healthcare devices. Some of these sensors (for example accelerometer and gyroscope) enables smartphones and smartwatch to be utilized as equipment for monitoring human activities [2]. The similarity of the signals which are captured from different activities of a person (for example upstairs and downstairs or eating activities) caused the discrimination of several human activities remains a challenging classification problem. In addition, a high amount of recorded data increases the computational cost of recognition algorithms.

In several studies, various linear and nonlinear classification schemes have been proposed to address the above limitations. The main objective of these schemes is to obtain acceptable accuracy, especially when the similarity of the activities increases. Some primary approaches try to distinguish activities based on their simple features such as mean [3,4,5] or variance [6].Although using such methods shows acceptable results for several simple activities (e.g. standing, sitting, and running) but their outcomes for some complicated activities (e.g. stretching and riding elevator) are not satisfactory.

Some researches make use of Fourier transform domain for extracting suitable features that may distinguish between similar human activities. These features may reflect the frequency-based properties of several activities, therefore they may differentiate between activities based on their various frequency contents. Unfortunately, this family of methods has no sufficient accuracy in recognizing activities in parallel with their high computational cost.

In some studies [7], the support vector machine (SVM) is used as a classifier that makes use of Gaussian kernels. Using this type of kernels results in more flexibility in decision boundaries, therefore ultimately increase the accuracy of the recognition of several activities. However, in the above method, the strategy of selecting SVM parameters is very challenging, because the resultant accuracy is highly dependent on these parameters.

Some sophisticated methods try to classify different activities by constructing the Hidden Markov Model (HMM) [8]. Although this technique has shown better results than many of its older alternatives, however, its performance is highly dependent on the quantity and quality of the extracted features from the recorded signals.

In parallel with recent advances in manufacturing processors with high computational power, deep neural networks have attracted a lot of attention as an effective paradigm to overcome challenges of human activity recognition. In this method, a deep neural network extracts non-handcrafted features from its raw input data [9]. Furthermore, deep neural networks are based on the learning of multiple levels of representations of the data. Such a multi-level representation scheme in parallel with their deep architecture (several processing layers) enables them to get more accurate results [9]. Deep convolutional neural networks (CNNs) are initiated from deep learning theory which is based on large scale data and different types of layers. A portion of this structure is responsible for extracting discriminative features of input data, while others are responsible for the classification of the data based on extracted features. Based on the above abilities, deep convolutional networks have been widely used for separating human activities in recent years containing several standard, partial, or full weight sharing versions (e.g. partial weight sharing in the first convolutional layer and full weight sharing in the second convolutional layer) [10]. Unfortunately, temporal-dependency which is the main characteristic of human activity signals has not been addressed in classic CNN which hampers its performance in human activity recognition. Therefore in complementary researches, the temporal dependency of data has been incorporated in the solution. In some researches, recurrent neural networks (RNNs) were used to consider the time dependence of human activities in constructing deep neural networks [11]. In more complicated solutions several combinations of CNN and long-short term memory (LSTM) were introduced in order to extract temporal and local features simultaneously [12]. Although these methods enabled researchers to improve the results of human activity recognition systems, however the accuracy saturation still remains as an important limiting factor in this application.

In this paper, a new method is introduced to improve distinguishing different human activities. The proposed algorithm is based on minimizing the accuracy saturation phenomenon along with improving the optimization ability of LSTM-CNN. In our proposed algorithm the temporal deep learning scheme is modified by using the concept of residual network. The resultant architecture utilizes shortcuts to jump over some layers thanks to existing residual networks in its body. Therefore the problem of vanishing gradients is addressed by reusing activations from a previous layer until the layer next to the current one has learned its weights. Consequently, the problem of accuracy saturation is greatly reduced. The paper is organized as follows. In Sect. 2, the proposed approach is demonstrated including dataset and pre-processing, learning deep neural network and the classification of activities. In Sect. 3, the performance of the proposed method is evaluated by comparing its results with the results of deep learning methods. In Sect. 4, the obtained results from the proposed scheme are compared with the results of non-deep techniques. Finally, conclusion is presented in the last section of the paper.

2 Methods

In this section, the details of the proposed method are described. Firstly the LSTM-CNN deep structure is introduced and then the residual network is applied to improve the performance of deep structure against vanishing gradients problem and increases the optimization performance of the network by identity mapping. Finally, the classification is performed in order to distinguish six human activities.

2.1 Convolutional neural network

Convolutional neural network (CNN) is a kind of deep neural network which has high potential in extracting high-level features. The feature extraction is perfumed in so-called convolutional layers of CNN [12] thanks to its linear and nonlinear kernels and regardless of the feature positions which makes them scale-invariant. Suppose input activity signal as:

$$x_{i}^{0} = \left[ {ax_{1} , \ldots , ax_{N} } \right]$$
(1)

In which \(x_{i}^{0}\) may be represented by a matrix of size \(3 \times N\) which N refers to the number of incorporated accelerometers. In the same manner, the input of k-th convolutional layer includes \(Z_{i,j}\) feature map as demonstrated in Fig. 1. Therefore the component for (i, j) location in k-th layer and l-th feature map may be computed as:

Fig. 1
figure 1

Illustration of convolution operation with X and Y kernel size

$$z_{i,j}^{l,k} = \sigma \left( {\mathop \sum \limits_{{k^{\prime } = 1}}^{{k^{\prime } }} \mathop \sum \limits_{x = 1}^{X} \mathop \sum \limits_{y = 1}^{Y} w_{{x,y,k^{\prime } }}^{{l - 1,k^{\prime } }} z_{i,j + y - 1}^{{l - 1,k^{\prime } }} + b^{l - 1,k} } \right)$$
(2)

In which σ and \(k^{\prime }\) demonstrate activation function and number of feature maps, respectively in \((l - 1){\text{th}}\) layer with kernel size of X and Y. Furthermore, w represents weight matrix and b shows the bias.

Finally, all feature maps are transferred into distinguished classes by using a fully connected layer. For this goal, a dense layer is used with some nodes which are equal to the number of activity classes using bellow so-called softmax function.

$${\text{Softmax}}\left( {Z_{i,j} } \right) = \frac{{e^{{Z_{i,j} }} }}{{\mathop \sum \nolimits_{n = 1}^{N} e^{{Z_{n} }} }} ,\quad n = 1, \ldots ,N$$
(3)

2.2 Long short-term memory

Feedforward networks consider all inputs and outputs as independent elements which are not a valid assumption for time sequence phenomena such as human activity signals. To overcome this limitation, recurrent neural networks (RNNs) are used which have a great potential to model the temporal dependencies thanks to their recurrent unit which serves as a memory. The main challenges in classic RNN are vanishing gradient [13] and the limited number of memory [14] which hamper its long term temporal dependency. This weakness may seriously hamper the effectiveness of RNNs in modelling of long time series of human activity signals.

Long short-term memory (LSTM) is a type of recurrent networks to solve the vanishing gradient problem mentioned above [15]. In this network, the memory cell has been used to save the information instead of the recurrent unit. Memory cells are constructed and updated by using three main gates including write (i.e. controlling input information), read (i.e. controlling output information), and reset (i.e. forgetting useless information) [15] as demonstrated in Fig. 2.

Fig. 2
figure 2

LSTM cell including write, read and forget gate to save information

The functionality of the LSTM which was shown in Fig. 2 may be described in details by Eqs. (48).

$${\text{i}}_{{\text{t}}} = \upsigma_{\text{i}} \left( {{\text{w}}_{{{\text{zi}}}} {\text{z}}_{{\text{t}}} + {\text{w}}_{{{\text{hi}}}} {\text{h}}_{{{\text{t}} - 1}} + {\text{w}}_{{{\text{ci}}}} {\text{c}}_{{{\text{t}} - 1}} + {\text{b}}_{{\text{i}}} } \right)$$
(4)
$${\text{f}}_{{\text{t}}} = \upsigma _{{\text{f}}} \left( {{\text{w}}_{{{\text{zf}}}} {\text{z}}_{{\text{t}}} + {\text{w}}_{{{\text{hf}}}} {\text{h}}_{{{\text{t}} - 1}} + {\text{w}}_{{{\text{cf}}}} {\text{c}}_{{{\text{t}} - 1}} + {\text{b}}_{{\text{f}}} } \right)$$
(5)
$${\text{c}}_{{\text{t}}} = {\text{f}}_{{\text{t}}} {\text{c}}_{{{\text{t}} - 1}} + {\text{i}}_{{\text{t}}} \upsigma _{{\text{c}}} \left( {{\text{w}}_{{{\text{zc}}}} {\text{z}}_{{\text{t}}} + {\text{w}}_{{{\text{hc}}}} {\text{h}}_{{{\text{t}} - 1}} + {\text{b}}_{{\text{c}}} } \right)$$
(6)
$${\text{o}}_{{\text{t}}} = \upsigma _{o} \left( {{\text{w}}_{{{\text{zo}}}} {\text{z}}_{{\text{t}}} + {\text{w}}_{{{\text{ho}}}} {\text{h}}_{{{\text{t}} - 1}} + {\text{w}}_{{{\text{co}}}} {\text{c}}_{{\text{t}}} + {\text{b}}_{o} } \right)$$
(7)
$${\text{h}}_{\text{t}} = {\text{o}}_{\text{t}}\upsigma_{\text{h}} \left( {{\text{c}}_{\text{t}} } \right)$$
(8)

In the above equations i, f, o, and c represents input gate, forget gate, output gate, and cell activation functions respectively. The combination of CNN and LSTM may be used to model local and temporal dependencies for long time series signals [12].

2.3 Batch normalization

Generally, the change of input distribution may cause several problems in the learning process of deep neural networks [16]. On the other hand, the presence of any amount of variance in the input of each layer shows itself in the form of more intense changes in the next layer, due to nonlinear and deep structure of CNN [17]. In order to reduce the impact of this unfavourable phenomenon, the normalization is widely used between successive layers [9, 18, 19]. In this research Batch Normalization (BN) function is applied to reduce the internal covariance shift between layers in parallel with increasing learning speed [20] as illustrated in Eq. (9) and Fig. 3.

Fig. 3
figure 3

Locating BN in CNN structure. Before each CNN layer, a BN layer was placed to reduce internal covariance shift

$$BN\left( z \right) = \gamma \frac{{z - E\left\{ z \right\}}}{{\sqrt {\left( {Var\left\{ z \right\} + \varepsilon } \right)} }} + \beta$$
(9)

In above equation \(\gamma\) and \(\beta\) are learning parameters.

2.4 Residual network

Although depth increment leads to construct more fitted model between input and output in CNN [20], but such a deep structure faces some limitations in its training procedure. One of the important problems is the vanishing/exploding gradient [19, 21] which occurs with the stacking of more layers [22]. Degradation with the network depth increasing, cause the accuracy gets saturates and then reduces fast. This problem points out that maybe the network has problems with approximating identity mapping due to stacking nonlinear layers [22].

In this research, residual network [22] is incorporated in the combination of CNN and LSTM structures (i.e. ConvLSTM) to overcome the above-mentioned problem. This solution utilizes parameter-free connections (identity shortcuts) for connecting the input of layer to output as shown in Fig. 4.

Fig. 4
figure 4

Residual network which including parameter free connections (identity shortcuts) to connect the input of layer to output

This shortcut connections help the function to play its role more optimal. Therefore, such direct input–output connections enable the deep network to overcome accuracy saturation and overfitting problems by skipping some layers. As shown in Fig. 4, the shortcut connections can help the solver function to map the identity function easier.

The final structure of our proposed network has been shown in Fig. 5 in which the convolutional layers extract local features from a 3-axis Mobile sensor-based accelerometer as feature maps. LSTM layers used to model temporal dependency existing in the feature maps. As shown in this figure in parallel with feature extraction, BNs are applied between layers to reduce the variance of each layer. Finally, the fully connected layer maps the result of ConvLSTM into six classes of activities as shown in the right end of Fig. 5.

Fig. 5
figure 5

The final structure of the proposed method: three CNN structures which each include two convolution layers and two BN layers. Residual shortcuts connect the first CNN output to the output of the last BN layer in each CNN section. Two LSTM layers were implanted to model temporal dependencies. Finally, the fully connected layer with softmax function, map the features into desired activity classes

The complete procedure of the proposed method may be observed in pseudocode of Fig. 6.

Fig. 6
figure 6

Description of the pseudocode of the proposed method

3 Results

The proposed method was applied on two datasets from wireless sensor data mining (WISDM) lab [23, WISDM2] which includes over one million and 15 million raw time series data respectively. These signals have been captured from the smartphone’s 3-axis accelerometer of 36 volunteers for first the dataset and data from the accelerometer sensor from smartphone and watch as 51 subjects performed 18 activities for the second one. The smartphone was fixed in the right pocket of each volunteer’s pants to achieve the maximum robustness in recorded signals.

The first dataset which we used included six different activities consist of walking, jogging, upstairs, downstairs, standing, and sitting which has been captured over 10 min per each. From the second dataset, we choose eight activities from the smartwatch’s accelerometer consist of walking, jogging, stairs, standing, sitting, eating soup, eating sandwich, and eating chips. For the first dataset, the volumes of data belonging to several activities are not equal because some users did not perform some of the activities due to their physical restrictions. Furthermore, some activities (i.e. sitting and standing) were limited to only a few minutes because it has been expected that the data would remain almost constant over time. More information about this dataset may be found in Table 1.

Table 1 The details of two WISDM datasets

In Figs. 7 and 8 for instance, some sample signals from both datasets have been shown. These examples belong to upstairs and downstairs activities for the first dataset and two different eating activities for the second dataset respectively. These figures clearly show that there is no significant difference between two recorded signals belonging to downstairs and upstairs activities for the first dataset and eating sandwich and eating chips for the second one. Therefore it illustrates why recognizing these activities may be considered as a challenge in the domain of human activity recognition.

Fig. 7
figure 7

Two recorded signals belong to a downstairs and b upstairs activities

Fig. 8
figure 8

Two recorded signals belong to a eating sandwich and b eating chips activities

However for the Eating Soup activity, the signal is more distinguishable from the other two eating activities as showed in Fig. 9.

Fig. 9
figure 9

Recorded signal belongs to eating soup

The proposed method was implemented on the Tensorflow framework, using the tensor processing unit (TPU) hardware developed by Google collaboratory. Furthermore, three deep learning-based alternative methods were also implemented to compare with the proposed scheme including: (a) basic CNN algorithm, (b) combination of CNN and LSTM which is called ConvLSTM for brevity in the rest of the article, and (c) ConvLSTM which has been modified by residual network (ResNet) concept which is called ConvLSTM + ResNet for brevity in the rest of the article. The accuracy of each method was obtained on the same data set and reported to evaluate the performance of several examined algorithms.

The first step of human activity recognition is data segmentation, and the most traditional approach is to use a sliding window. In this paper, the window size of 90 with a 50% overlap was used.

In this contribution, three different structures were proposed which for each of them the best configuration (number of layers and hyperparameters) were determined and showed in Table 2. The weights initialized randomly for each training procedure using stochastic gradient decent (SGD) optimizer [24] with the momentum of 0.9 and the initial learning rate of 0.01 and decay rate of 50% per 10 epochs. Table 2 shows the main parameters of examined structures.

Table 2 Description of main parameters of proposed method and its deep based alternatives

Firstly, the performance of CNN was evaluated in distinguishing several activities of the two mentioned datasets. As demonstrated in Tables 3 and 4, this approach has obtained utterly acceptable results on those on-challenging activities in which their recorded signals were not so similar to each other.

Table 3 Results for the first dataset of WISDM classification by using CNN
Table 4 Results for the second dataset of WISDM classification by using CNN

However, the performance of this network was dropped when it was applied to recognize challenging activities (i.e. downstairs and upstairs for the first dataset and eating sandwich and eating chips for the second dataset which caused similar signals). These decrements occurred in such way that for the first dataset the accuracies were obtained 85.53 and 86.99% for upstairs and downstairs respectively and for the second dataset, the accuracies were 58.58 and 50.22 for eating sandwich and eating chips respectively.

Tables 5 and 6 show the results obtained from applying the ConvLSTM scheme. These results demonstrate although the modification of CNN by long short-term memory may marginally improve the accuracies for the mentioned similar activities. For the first dataset, 4.03 and 2.02% improvement was obtained for upstairs and downstairs and in a similar manner for the second dataset, the improvements were 2.36 and 8.06% for eating sandwich and chips respectively). Note that these improvements were not enough to make the results acceptable for these challenging activities. Furthermore, Tables 3 and 4 show that the accuracies of the other activities have had no meaningful difference from the results which had been obtained from basic CNN (e.g. these differences were about 1%).

Table 5 Results for the first dataset of WISDM classification by using ConvLSTM
Table 6 Results for the second dataset of WISDM classification by using ConvLSTM

Finally, Tables 7 and 8 demonstrate that the proposed method has significantly increased the obtained accuracies for two challenging activities compared to CNN. The obtained results showed that the residual network concept may improve the recognition accuracies against the basic CNN network for the first dataset by extents of 9.16 and 6.96% for upstairs and downstairs activities respectively. These improvements were obtained as 5.13 and 4.94 for the same activities compared to the ConvLSTM scheme. However, these results illustrated a tiny accuracy decrement in walking compared to basic CNN. For the second dataset, accuracies which were obtained for eating sandwich and chips were improved about 7.41 and 9.71% compared to the CNN and 5.05 and 1.65% compared to the ConvLSTM.

Table 7 Results for the first dataset of WISDM classification by using ConvLSTM + ResNet
Table 8 Results for the second dataset of WISDM classification by using ConvLSTM + ResNet

4 Discussion

In the previous section, the superiority of the proposed algorithm against CNN based schemes has been investigated. The common aspect of all those algorithms was that all of them belong to deep neural networks family, hence all of them extract features from raw data by using their convolutional layers. In this section, the performance of the proposed algorithm is compared with the feature-based classifiers as an alternative family for deep methods. To perform such comparison, five feature-based activity recognition methods were applied on the first dataset of WISDM. The alternative algorithms include (a) a combination of hand-crafted features and Random Forest classifier which is called for brevity as Basic features + RF in this article [23, 25], (b) principal component analysis (PCA) based on empirical cumulative distribution function which is called for brevity as PCA + ECDF in this article [26, 27], (c) logistic regression [23, 28], (d) a decision tree algorithm used for classification which is called J48 algorithm in the article [23, 28] and finally (e) multilayer perceptron [23]. The classification accuracy was calculated for the proposed method and the all above alternatives to compare their effectiveness. Table 9 shows the obtained accuracies for all examined methods. This table describes that for upstairs activity the recognition accuracy of the proposed algorithm was 33.23%, 35.43%, 67.17%, 35.14%, and 28.02% better than multilayer perceptron, J48, logistic regression, PCA + ECDF, and basic features + RF methods, respectively. Also, this table shows that for downstairs activity, the recognition accuracy of the proposed algorithm was 49.65%, 38.47%, 81.69%, 54.35%, and 44.13% better than the above alternatives.

Table 9 Comparison of Results first dataset of WISDM classification by using proposed method and its feature based alternatives

For the other three non-challenging activities (e.g. sitting, standing and jogging) although the proposed algorithm recognized activities overall better than its alternatives, but for most of the test items the performances of examined methods were in acceptable range in accordance with those which were described in deep learning methods (see Tables 3, 5, 7). However, it is important to note that even for these non-challenging activities the superiorities of the proposed scheme against alternative methods have reached up to 16.03% (e.g. proposed method vs Basic features + RF in sitting activity). Finally, in case of walking activity, the results of the proposed scheme were 13.89%, 3.87%, 7.55%, and 5.77% higher than those which had been obtained by basic features + RF, logistic regression, J48, and multilayer perceptron methods. However, for this activity, the result of the proposed method has been lower than PCA + ECDF by extents of 1.09%. The above results confirmed the results which had been obtained by using deep learning-based methods which showed that the proposed algorithm caused a great accuracy improvement in recognizing challenging activities (i.e. downstairs and upstairs). On the other hand, for other non-challenging activities although the proposed method showed a slight accuracy increase or decrement compared to existing methods, but the recognition accuracies belonging to this method still remained within the acceptable range.

5 Conclusion

In recent years, deep neural networks have been widely utilized for human activity detection which the most famous among them is convolutional neural network (CNN). Despite the considerable potential of CNNs in recognizing human activities, unfortunately, such networks face with accuracy saturation phenomenon which hampers their performance in real-world applications. In this paper, a new structure was introduced to address this problem based on a combination of long short-term memory (LSTM) and residual network structures. The performance of the proposed structure was evaluated on two real data sets contained the recorded signals belonged to six human activities including walking, jogging, upstairs, downstairs, sitting, and standing for the first dataset. The second dataset contained walking, jogging, stairs, sitting, standing, eating soup, eating sandwich, and eating chips. Two different scenarios were adopted to compare the performance of the proposed method with two main categories of existing techniques. In the first scenario, the performance of the proposed method was compared with those of its own family, all based on deep learning. The obtained results showed that for the first dataset, the proposed scheme distinguished both of downstairs and upstairs (as the most challenging activities) almost 5% better than its closest deep based alternative. For the second data set improvements were 5 and 1.65% for those results which had been obtained for eating sandwich and eating chips respectively. On the other hand, the performances of the proposed method and its deep based alternatives had no meaningful difference among four other (i.e. non-challenging) activities for both datasets. The second scenario was dedicated to comparing the performance of the proposed structure and non-deep techniques. The results obtained in this scenario also indicated the superiority of the proposed method against five well known non-deep techniques in recognition of challenging activities for the first dataset. The obtained results showed that the proposed scheme distinguished downstairs and upstairs (as the most challenging activities) almost 38 and 28% better than its closest feature-based alternative. Similar to the previous scenario, the performance of the proposed method and its alternatives had no meaningful difference and both are in acceptable range when they were examined on the other four non-challenging activities. Based on the above analyses it may be concluded that the proposed structure has considerable potential to be used as a low-cost and non-invasive diagnostic modality to implement as mobile software.