1 Introduction

Representation of human activity can be categories through body motion and gesture [3], and determine the predict states of action. Time series data created by different sensors of wearable gadgets are capable to give data about the movement as well as posture [2]. Majorly frequency-based features and Statistical features are used to identify the activities from time-series data. Statistical features offer less computational time and complexity as compare to the frequency-based feature [6]. Data received from the accelerometer is in the form of tri-axis (x, y, z) and effective for recognition of activity. The accelerometer can be easily integrated with wearable devices and mobile devices. Tri-axis data of accelerometer is capable to capture the human action in the time domain. Healthcare is one of the major sectors, utilizing sensor data. Many of the healthcare based applications are using the wearable sensors data for prescribing the health recommendation [11]. The present generation of handy mobile devices like a fitness band, smartphone, smartwatches includes a verity of sensors like accelerometer, gyroscope, magnetometer, etc. and capable of analyzing human activity and behavior analysis. The time series is segmented using a sliding window of a fixed length and split each time series into equal segments. The task of action/ activity recognition involves the identification of activity from time-series data, which acts for a certain duration by a human. In early days’ handcraft features and machine learning approaches were used for recognition of activity but deep learning-based approaches are in the great demand among the researchers. Deep learning-based approaches automatically learn the required feature representation directly from the data. Other than high accuracy and decent generalization, deep learning models trained in an end-to-end fashion [15].

Major challenges to identify activity from sensor data is to differentiate the similar activities like walking and jogging, up-stairs and down-stairs. The traditional approach extracted the features and fed into the classifier but accuracy compromise if these features missed the representation. This is a tedious and laborious task that employs a practical method that is not guaranteed to be optimal thus deep learning based approach for tri-axis data of the accelerometer is talk over in this paper. Using deep learning approach featuring engineering process can be avoided. Model itself correlate the features and combine them. Three different deep learning algorithms, namely CNN, LSTM and ConvLSTM are discussed in this paper, which combine orientation invariant and consecutive point trajectory information as additional input along with the tri-axis data of accelerometer. Accuracy of the algorithms radically improve with these additional features. CNN is very powerful without memory which is required while processing the sequential data like time series so LSTM is better option in this case where it keep past input [33].

The rest of this paper is organized as follows: Sect. 2 present the motivation behind, in Sect. 3, we briefly review the literature based on time series data. Section 4, describe accelerometer data. Section 5, describe the proposed model. Experiment and result analysis discuss in Sect. 6 and finally, we conclude in Sect. 7.

2 Motivation

Activity recognition using sensor data is a high demanding research area because of its wide utility. We all are aware of the case of Asha Sahani in Mumbai whose Skelton was found in her flat. Many of such areas like healthcare system, monitoring the old people, monitoring the daily activities, sensor based mechanism is very useful however identification of activity would be a greater challenge for the automated system. An efficient mechanism with accurate result is the need of an hour.

3 Previous work

The task of activity recognition involves the identification of activity from time-series data that act for a certain duration. Deep learning approaches attract the researchers and many deep learning architectures have been proposed to exploit the time series data. Lee et al. [14] proposed a one-dimensional convolutional neural network in which they calculate the magnitude of the tri-axis data of accelerometer captured by their research team. They recorded three activities walking, running, and still. Shoaib et al. [25] proposed a fourth-dimension i.e. magnitude of tri-axis (x, y, z) data. They proved this four-dimension data as an input and used various classifiers to classify the data. Robert et al. [29] proposed a Multi-layer perceptron solution implemented on a mobile phone. They extracted five features from the data and input to a multilayer perceptron. Cesar et al. [28] proposed a hierarchical neural classifier for activity classification. They extracted the time-domain feature from the tri-axis data. Transferring knowledge among the models having different probability distributions was proposed by Abdullah et al. [9]. Random forest, decision tree, and transfer boost algorithms were used for the evaluation. Alvina et al. [1] built an application to trace the physical activates of the user. They collect dataset by their own and implemented the classifier in mobile applications. A convolution neural network-based approach was proposed by Charissa et al. [22]. They experimented on various combinations of feature map and the number of convolution layers. Jinyong [18] proposed a transfer learning model based on the convolution network. He trained the model on the WISDM dataset and utilize this learning to classify the activities of the UCI HAR dataset. Wenchao et al. [8] proposed an idea of transform to time series data into an activity image of time–frequency-spectral. They apply a convolution neural network to learn features and classify the activities. Wanmin et al. [32] combine accelerometer and gyroscope data to extract features and classify activities. Charissa [23] used the accelerometer and gyroscope tri-axial sensor data to perform 6-axes, 1D convolution for constructing a convolution neural network. A hybrid feature selection method was proposed by Wang et al. [31]. This method combined the traditional feature selection method filter and wrapper. The experimental results showed balances between recognition efficiency and accuracy. Yuwen et al. [4] proposed LSTM based feature extraction from the tri-axis data of accelerometer.

4 Accelerometer data

For this research work, we are using the accelerometer dataset used in [13] and the WISDM dataset [12]. In both the dataset, data was collected through an accelerometer. The accelerometer is used to measure the acceleration of an object. It is measured in meter per second square and sampling frequency in Hz. Typical accelerometers are made up of two or three axis-vector components. Most smartphones typically make use of three-axis models. These devices are very sensitive intended to measure even tiny variations in acceleration. The accelerometer in the wearable device delivers the XYZ coordinate values, which is then used to identify the situation and the acceleration of the device. The XYZ coordinate indicates the direction and position of the device at which acceleration occurred (Fig. 1).

5 Proposed methodology

Three different convolutions deep learning network model are proposed. These models use five dimensions, x, y, z, orientation invariant feature and consecutive point trajectory information as input. The result shows all three models work well as compare to the state of art methods. Data received from the accelerometer is delivered to the XYZ coordinate values. Tri-axis data is preprocessed and extract the additional external feature vector. Detail discussion about feature extraction is given below is Sect. 5.1.

5.1 Feature extraction

The Accelerometer data provides the XYZ coordinate values, which are used to measure the position and the acceleration of the device [5]. We add two more dimensions in this three-dimension data i.e. orientation invariant feature of the sensor and the consecutive point trajectory information between the two sensor positions. Orientation invariant feature minimizes the effect of change in orientation [25]. The motion between two consecutive points trajectory can be identify through displacement vector (Fig. 2), it maps the translation between two consecutive movements and boosts the learning process.

The orientation invariant feature can be achieved through the magnitude of the vector. The magnitude of a vector (||\(v\)||) is the length of the line segment that defines it. Magnitude can be obtained by:

$$||v||=\sqrt{{\mathrm{x}}^{2}+{y}^{2}+{z}^{2}}$$
(1)

consecutive point trajectory information can be determined by the displacement vector. The displacement vector describes the motion in the space. To establish a coordinate system and a convention for the axes, the coordinates x, y, and z to locate a particle at point P (x, y, z) in three dimensions. If the particle is moving, the variables x, y, and z are functions of time (t), therefore.

$$x=x(t), \; y=y(t), \; z=z(t)$$

The position vector from the origin to the point B is \(\overrightarrow {v\left( t \right)}\). In the unit vector notation \(\overrightarrow {v\left( t \right)}\) will be-

$$\overrightarrow{v\left(t\right)}=x\left(t\right)\widetilde{i}+y\left(t\right)\widehat{j}+z\left(t\right)$$
(2)

With definition of the position of a particle in three-dimensional space, we the three-dimensional displacement can formulate. Figure shows a particle at time t1 located at position A with position vector \(\overrightarrow{v}\) 1 → (t1). At a later time t2, the particle is located at B with position vector \(\overrightarrow{v}\) 2 → (t2).

The displacement vector \(\Delta v\) is found by

$$\overrightarrow{\Delta v}=\overrightarrow{v}\left({t}_{1}\right)-\overrightarrow{v}\left({t}_{2}\right)$$
(3)

The magnitude of the displacement can be calculated as

$$||\Delta v||=\sqrt{({\mathrm{x}-\text{x}(\mathrm{t}))}^{2}+({\text{y}-\mathrm{y}(\text{t}))}^{2}+({\mathrm{z}-\text{z}(\mathrm{t}))}^{2}}$$
(4)

So now the input data is of five-dimension x, y, z, magnitude, and the magnitude of the displacement vector between two consecutive sensor values (Figs. 3 and 4).

Window size (frame size) plays an important role sometimes not choosing the correct window size may give a penalty and reduce the accuracy of the data. Dataset [13] is captured with the frequency rate 10 Hz and WISDM dataset captured with a frequency rate of 20 Hz, therefore we choose window size of 4 s during the experiment and same for the WISDM dataset [12] as well, also we choose overlapping of one fourth the size of the window.

5.2 CNN model

In this Convolution deep learning model, the data received from the accelerometer are first divided into time-series segments which are the same size as the window frame. The size of the window frame we selected was 4 s. A hop size with 25% overlapping was considered for the experiment. Initially, the tri-axis data of accelerometer is preprocessed and finds the value of magnitude ||\(v\)|| and the Consecutive point trajectory information ||\(\Delta v\)||. Now the whole five-dimensional data provide as an input to the convolution neural network. A batch of segments, each segment sized 1 × 20, was stored in a 5D tensor. These segments are divided into 80% training data and 20% testing data. With these training data, the deep model is trained with a learning rate of 0.001. The trained model was tested with the 20% test data in each epoch. A checkpoint was created in which the model was saved if the performance improved with each epoch in the validation loss. The CNN model contains a total of five convolutional layers with dropout layers for regularization with a probability constant of 0.2 and a dense layer.

Each convolution layer of model X is trained in a similar way and the individual unit is shown as Fig. 5. In the lth layer’s training there is an input x[l−1] with channel cl-1, i.e. cl-1 feature maps in 2-D arrays of \({x}_{1}^{l-1}\) ………….. \({x}_{cl-1}^{l-1}\). The representation of mth feature map is [10]

$$Z_m^{[l]} = \sum\limits_{i = 1}^{{c_{l - 1}}} {X{{_i^{l - 1}}}} * w_{im}^{[l]} + b_m^i$$
(5)

where m = 1, 2,….cl, \(w_{im}^{[l]}\) denote the mth kernel, \(b_m^i\) represent the bias of lth layer and * represent 2-D convolution operation.

Fig. 1
figure 1

Tri axis of accelerometer [5]

Fig. 2
figure 2

Displacement vector

Fig. 3
figure 3

Signal representation of dataset [13] with addition features magnitude and displacement vector

Fig. 4
figure 4

Signal representation of WISDOM dataset with addition features magnitude and displacement vector

Fig. 5
figure 5

Architecture CNN model

Fig. 6
figure 6

Architecture LSTM model

Among the nonlinear activation function ReLU activation function is used in the model to transform output data [17] and softmax for the classification of the activity class. At the end of the training, we had the latest optimized model that shows high performance in the dataset, which is then used for testing against the testing data.

5.3 LSTM model

The initial phase of data preprocessing is same as we did in the CNN model. A batch of segments, each segment was stored in a 5D tensor. These segments are divided into 80% training data and 20% testing data. With these training data, the deep model is trained with a learning rate of 0.001. The trained model was tested with the 20% test data in each epoch.

The LSTM memory is also known as “gated” cell, where the word gate means the ability to make the decision of avoiding or maintaining the memory information. Important features from the input are preserved by an LSTM model over a long period of time. An LSTM has three of these gates, to protect and control the cell state [7].

A deep network was built by stacking multiple layers of LSTM memory units with 2 fully connected layers (Fig. 6). Sparse categorical crossentropy is applied to measure the loss. The major formulation used in LSTM obtained from [7] are mention in Eq. 6.

$$f_t=\sigma\left(w_f*\left[h_{t-1,}X_t\right]+b_f\right)$$
$${I}_{t}=\sigma ({w}_{I}*\left[{h}_{t-1,}{X}_{t}\right]+{b}_{I})$$
$${C}_{t}=tanh({W}_{C}*\left[{h}_{t-1,}{X}_{t}\right]+{b}_{C})$$
$${C}_{t}={f}_{t} *{C}_{t-1}+{I}_{t} * {C}_{t}$$
$${O}_{t}=\sigma ({w}_{O}*\left[{h}_{t-1,}{X}_{t}\right]+{b}_{O})$$
$${h}_{t}={O}_{t} * \text{tanh}({C}_{t})$$

where \({f}_{t}, {I}_{t}, {O}_{t}\) are forget, new cell and output gates. Xt is input and ht is hidden unit.

5.4 ConvLSTM model

The tri-axis data of accelerometer is preprocessed and finds the value of magnitude ||\(v\)|| and the magnitude of displacement vector ||\(\Delta v\)||. Now the whole five-dimensional data provide as an input to the convolution neural network. A batch of segments, each segment was stored in a 5D tensor. ConvLSTM model was first introduced by Satya et al. [26]. The above picture shown in Fig. 7 describes how a general ConvLSTM model work. The CovnLSTM has nice properties for activity recognition as convolution operator may handle local spatial features the LSTM part manage temporal correlation in the sensor data. As discuss in the Sect. 5.3, An LSTM has three of these gates, to protect and control the cell state. The model consists three convolution layer and the output of third convolution layer flatten and input to the LSTM (Fig. 7). Timedistribution function is used to manage the dimensions of the input layer.

Fig. 7
figure 7

Architecture ConvLSTM model

Fig. 8
figure 8

Activity Dataset [13] Activity type and location information

Fig. 9
figure 9

WISDOM dataset Activity type

5.5 Dataset

We test our model in two datasets. The first dataset is used in [13] and the author makes it publically available (Fig. 8). The dataset was collected from two volunteers who performed activities using a smartwatch attached to the wrist of their dominant hand for four weeks. This data was captured with a sampling rate of 10 Hz. The dataset contains activities majorly at three locations office, kitchen, and outdoor. It consists of 11 different activates like office work, reading, writing, cooking, walking, running, etc. Another dataset used for the experiment is the WISDOM [12] dataset, which is publically available and can be download from http://www.cis.fordham.edu/wisdm/. WISDM dataset contains the tri-axis accelerometer data along with the information of time and user performed (Fig. 9). The dataset contains a total of 1,098,207 examples with six classes named walking, jogging, upstairs, downstairs, sitting, and standing. WISDM dataset is highly imbalance therefore during the preprocessing we balanced the dataset. WISDM dataset was captured with a sample rate of 20 Hz.

6 Experiment and result analysis

This section discussed the result of the proposed model. We use two datasets to compared with the-state-of-art methods for human activity recognition: activity dataset [13] and WISDM dataset [12]. We also compare our method with state of art methods [4, 9, 12, 13, 28, 29]. The proposed model is trained and tested on these two datasets for activity recognition. Adding features orientation invariant and consecutive point trajectory information as additional input along with the tri-axis data of accelerometer provide strong support to the proposed approach compare to the other methods.

Tables 1 and 2 shown the efficiency of proposed models on activity dataset. Proposed model 5D-CNN achieve 93.04% accuracy on activity dataset, LSTM model achieve 94.63% accuracy, LSTM without information of the location achieve 96.38% and 5D-ConvLSTM model with location achieve.

Table 1 Activity recognition accuracy over the dataset-1
Table 2 Result of activity dataset

98.04% validation accuracy. Figures 101112 and 13 show the accuracy graph and confusion matrix of the CNN, LSTM, ConvLSTM and ConvLSTM with location model respectively.

Fig. 10
figure 10

Accuracy chart and Confusion matrix of 5D- CNN Model for Activity Dataset

Fig. 11
figure 11

Accuracy chart and Confusion matrix of LSTM Model for Activity Dataset

Fig. 12
figure 12

Accuracy chart and Confusion matrix of ConvLSTM Model for Activity Dataset

Fig. 13
figure 13

Accuracy chart and Confusion matrix of ConvLSTM Model for Activity Dataset with location input

All four models are tests with different hyperparameters like number of epochs, number of hidden layers, learning rate, loss function, optimizer, activation function etc. Although we keep learning rate and loss function function fixed for all models. Sparse categorical crossentropy loss function is used for for all models and keep learning rate 0.001.Various possible combinations of epoch, hidden layers and optimizer analyze during the experiment.

Table 3 represents the comparison of proposed models with the approach used in [13]. we find that accuracy is improved using our proposed model and believed that it can be further improved by using more fusion strategy. Table 3 shows that CNN model is ~ 3% better as compared to [13] and LSTM is ~ 4% better whereas ConvLSTM model batter ~ 6% from the state of art method. We also analyze ConvLSTM model with adding additional input location and found that proposed ConvLSTM model is ~ 3% batter with location input.

Table 3 Comparison of proposed approach with the state of art method
Table 4 Activity recognition accuracy over the WISDM dataset

By analyzing the confusion matrices shown in Figs. 101112 and 13, a remarkable performance can be observed in recognition of every activity. Confusion matrices of ConvLSTM model show that walking and writing achieved 100% accuracy whereas while adding the location as input than eating, walking and taking transport achieve 100% accuracy.

Various parameters are tuned for achieving the optimum result. Five convolution layer are used in the CNN model with a learning rate of 0.001 and using Adam optimizer. Whereas the various possible combination of the layers, epoch and optimizer experimented and optimum result received with 40 epochs,128 layers in LSTM and Adam optimizer, whereas 3 convolution layer and 128 LSTM layer are applied for the ConvLSTM model. The Similar combination uses while adding location input.

Tables 4 and 5 shown the evaluation of proposed approach using CNN, LSTM and ConvLSTM model on the WISDM dataset. CNN model achieves 96.52%, LSTM model achieve 97.00% accuracy and.

Table 5 Result on WISDM Dataset

ConvLSTM model achieved 98.41% accuracy on the WISDM dataset. Figures 14, 15 and 16 show the accuracy graph of all three models for training and testing data under various epochs as well as confusion matrix.

Fig. 14
figure 14

Accuracy chart and Confusion matrix of 5D- CNN Model for WISDOM Dataset

Fig. 15
figure 15

Accuracy chart and Confusion matrix of LSTM Model for WISDOM dataset

Fig. 16
figure 16

Accuracy chart and Confusion matrix of ConvLSTM Model

By analyzing the confusion matrices shown in Figs. 14, 15 and 16, a remarkable performance can be observed in the recognition of every activity. Confusion matrices show that standing is classified with 100% accuracy by all three models, all three models discriminate walking and running with grate accuracy. Models are a bit confused with upstairs and downstairs. Table 5 shows individual activity recognition accuracy achieved by model-1 and model-2.

Table 6 compare the result of proposed models with the state of art methods. we find that accuracy is improved using our proposed approach. CNN model is ~ 2% better as compared to [30] and ~ 4% from [24] and LSTM model is ~ 3% better then [30] and ~ 4% then [24] whereas ConvLSTM model achieved highest accuracy among the all three proposed models and ~ 4% better as compared to [30].

Table 6 Comparison of proposed approach with the state of art method used WISDM dataset

Table 7 represent two types of errors: a false positive or a false negative. A false positive indicates the false alarm or a type I error and misclassification represent as false negative a type II error. Various performance metrics can be calculated through a confusion matrix.

Table 7 Structure of confusion matrix
Table 8 Precision, Recall and F Score and Matthews Correlation Coefficient – Activity Dataset
Table 9 Precision, Recall and F Score and Matthews Correlation Coefficient– WISDOM Dataset

The metric commonly used are accuracy, precision recall and F-score [21]. Precision indicate the proportion of positive identification was actually correct whereas the recall is the ratio of correctly detected positive instances to the total number of positive instances. In proposed model we can think of this as representing the correct classification of the activities (Tables 8 and 9).

The F- Score is used to measure the test accuracy and balance the use of precision and recall and can be calculate via harmonic mean of precision and recall. Mathematical formulation of precision, recall and F- score and MCC are represented by Eqs. 4,5,6 and 7 respectively [16, 27].

$$\mathrm{Precision}=\frac{\text{TP}}{(\mathrm{TP}+\text{FP})}$$
(6)
$$\mathrm{Recall}=\frac{\text{TP}}{(\mathrm{TP}+\text{FN})}$$
(7)
$$\mathrm{F}-\text{Score}=\frac{2\mathrm{ X Precision X Recall}}{(\text{Precision}+\mathrm{Recall})}\times 100$$
(8)
$$MCC=\frac{TP \; X \; TN-FP \; X \; FN}{\sqrt{[TP+FP][TP+FN][TN+FP][TN+FN]}}$$
(9)

where MCC—> Matthews correlation coefficient

TP—> True positive instances

FP- > False Positive instances

TN- > True Negative Instances

7 Conclusion

This article presented the recognition of human activity from accelerometer data. In this work, two additional features, orientation invariant and consecutive point trajectory information are added along with the tri-axis data. Three deep learning architectures CNN, LSTM and ConvLSTM were used for experiment and tested on two different datasets one used in [13] and another one is publically available dataset i.e. WISDM dataset and achieved remarkable results. Also, compared the proposed models with the result of the state of art methods [4, 12, 19, 20, 24, 30] tested on mentioned datasets and found that the proposed models perform well compared to other methods. Results specified that the proposed network can distinguish similar actions with different velocity. In future some more external features can be combined along with tri-axis data and test more complex activity.