1 Introduction

Human activity recognition (HAR) is a rapidly expanding area of research in the field of pervasive computing. It refers to the process of automatically recognizing human activities and has gained importance lately because its health-based applications have turned out to be multifarious involving ambient assisted living, personal fitness monitoring, child and elderly support, chronic health care, rehabilitation and fall detection. Many approaches have been proposed in the literature for human activity recognition. These approaches can be broadly grouped into three categories: vision-based approaches, environmental sensor-based approaches and wearable sensor-based approaches.

Computer vision-based activity recognition [11, 21, 35, 46] involves recognition of activities from videos captured by cameras under well-controlled laboratory settings. However, these methods fail to produce reliable results under home settings due to issues [1, 36] like clutter, occlusion, variable illumination, shadow etc. In addition, these methods require cameras to be fixed at predetermined locations that limit the coverage area owing to the requirement of deploying a large number of cameras. The privacy issue is another major drawback that prohibits the deployment of vision based techniques for action recognition.

Environment sensor-based approaches are used in recognizing human activities based on the interaction between the subjects and environmental sensors like RFID tags [39], infrared-based motion sensors [26] etc. This approach is used in the recognition of daily activities like washing, eating, sleeping, lying, sitting etc. However, the main drawback of this scheme is its limitation of being confined to indoor scenarios [30].

Unlike vision and environment based sensors, wearable sensor-based techniques provide robust action recognition in both indoor and outdoor environments. These sensors can be easily worn by individuals on various parts of the body like arm, waist, legs etc., due to their flexibility and miniature size. In addition, individuals can also carry more than one sensor device at a time [14]. All these benefits have led to tremendous research in the field of action recognition using wearable sensors. Among these, the smartphone based action recognition schemes are more trending since they render powerful context-based information and also have wireless communication capabilities as they possess various built-in sensors including accelerometer, gyroscope, magnetometer, orientation sensor, barometer, proximity sensors, camera, blue-tooth and wireless fidelity (wi-fi) modules.

Recently, extensive research is being carried out on high-dimensional sparse signals. A signal is said to be sparse if it can be represented as a linear combination of relatively few base elements in a basis or an overcomplete dictionary [7]. In fact, most of the real-world signals have sparsity property embedded in them. A detailed comparative study of various sparse representation based algorithms and its applications was presented in [50]. In this, sparse representation was categorized into 5 groups based on the norms used for optimization. The 5 categories included minimization using l0− norm, lp− norm, l1− norm, l2, 1− norm and l2− norm. In l0− norm minimization, the optimization problem was framed such that the sparsity or the number of coefficients in the sparse vector is minimum. However, this problem is a non-deterministic NP-hard problem and hence it is difficult to produce an approximate solution. This problem was overcome by the implementation of l1− norm minimization as it had an analytical solution. l1− norm refers to the sum of absolute values of all the coefficients in the sparse coefficient vector. For lp− norm minimization, the value of p was varied from 0 to 1. In particular, the values investigated were p = 0.1, 1/2, 1/3 and 0.9. The l2− norm is also called the Euclidean norm, which is calculated as the root of the squared sum of all the elements in the sparse coefficient vector. The lp− norm was found to be non-convex and non-smooth. The l1− norm was identified to be convex, non-smooth and globally nondifferentiable. Whereas, the l2− norm was found to be convex, smooth and globally differentiable. However, the l2− norm minimization was found to be “limitedly-sparse” and not strictly sparse. This showed that l2− norm had the property of discriminability. Although the primary aim of exploring sparsity nature of a signal is for the purpose of compression and reconstruction, its discriminative capability has been analysed and popularly used in many machine learning applications that include but not limited to image fusion [29], object tracking [42], face recognition [10], human activity recognition [20] and human emotion recognition [51]. Inspired by their credibility to be used in miscellaneous domains, in this paper, we have exploited the sparsity nature of human activity based inertial signals acquired from wearable sensors and proposed a sparse representation based action recognition framework. This paper explores a novel methodology based on sparse theory to incorporate and fuse inertial data from sensors like accelerometer, gyroscope, magnetometer and orientation sensor to improve the accuracy and reliability of smartphone based action recognition schemes.

This remainder of this paper is organized as follows. Section 2 surveys the state-of-the-art works related to human action recognition. Section 3 depicts the detailed description of the proposed framework covering data acquisition, feature extraction and proposed classification. Section 4 presents the quantitative evaluation of the proposed system. The conclusion of the paper is finally provided in Section 5.

2 Related works

With recent advances in pervasive computing, the sensor network technology potentially facilitates various fields ranging from healthcare, ambient assisted living, security and surveillance. Particularly, researchers have developed various action recognition systems using wearable sensors.

Bao et al. [6] proposed an action recognition framework that utilized five biaxial accelerometers. Features like mean, correlation, energy and frequency domain entropy were extracted from the acquired accelerometer data. Several classifiers were evaluated and it was found that decision tree classifiers produce the best performance. The authors of [18] projected an activity recognition scheme using a tri-axial accelerometer embedded in smartphones. Using the concept of principal component analysis, the fundamental period of the accelerometer data was extracted using the phase trajectory matrix. The classification was done using k-nearest neighbor (k-NN) and neural networks. Accelerometer based activity recognition system using an ensemble of classifiers was implemented in [9]. This approach combined three classifiers namely, the J48 decision tree, Multi-Layer Perceptrons and Logistic Regression techniques using combination rule based on average of probabilities. Liu et al. [24] presented an automated action recognition system using temporal patterns extracted from different actions. Joint pattern feature space was constructed using the extracted patterns and was used for classification. Varkey et al. [37] presented a window based algorithm for recognizing fly activities and fine movements within each activity. Features like mean, standard deviation, peak-to-peak, root mean square, maximum and correlation between axes were extracted from accelerometer and gyroscope data. This system utilized a supervised learning approach based on support vector machine (SVM) for classification. Zhang and Sawchuk [49] projected a new framework for activity recognition based on sparse theory and compressive sensing using wearable sensors. Further, it was shown that feature extraction based on random projections achieved best recognition performance. This system achieved a maximum accuracy of about 96.1% considering nine types of activities including forward walking, left side walking, right side walking, upstairs, downstairs, jumping, running, standing and sitting. Fuentes et al. [13] introduced an online motion recognition system using accelerometers embedded in smartphones. The raw accelerometer data was converted to statistical features which were classified using an SVM classifier. This system produced an overall accuracy of about 93% for recognizing activities like stopping, walking, standing-up and sitting-down. In [47], Yin et al. proposed a high-performance training-free approach using accelerometer sensor for recognizing hand gestures. A robust template matching technique based on dynamic programming was presented and used in the gesture recognition process.

The authors of [45] proposed a distributed recognition scheme for classifying human actions using wearable sensors. In this scheme, each action class was modeled using a subspace in the mixture subspace model. Using the training data, the sparsest linear representation of the test data was computed. It was demonstrated that the test class corresponds to the one that produces dominant coefficients. It was also shown that the proposed system had good sensor energy saving capability. A novel scheme for sensor selection based on ontology for wearable action recognition was presented in [38]. The wearable sensors comprised of magnetic and inertial measurement units. For any given nonrecoverable sensor, a new technique for automatic selection of its suitable replacement was also proposed. This framework was based on a set of heuristic rules which were used to find the sensors for replacement. Finally, the appropriate sensor is selected based on iteratively posed queries. The positioning of sensors at the appropriate location is important for accurate action recognition. Hence, an investigation of wearable sensor placement on various parts of the body was conducted in [5]. In addition, most discriminating time-frequency features for effectively classifying different activities was also analyzed in this paper. It was observed that for high-level activities knee, ear and arm were the optimal body locations for sensor placement. Similarly, for transitional activities chest and knee locations were selected.

A number of score level fusion schemes have been proposed in the literature for solving classification problem. These schemes attain high accuracy since they allow multiple scores to be integrated in an efficient manner. A weighted fusion scheme for combining various matching scores for face and palmprint recognition was proposed in [44]. This system utilized two different biometric traits. The proposed score fusion scheme comprised of four steps. In the first two steps, the matching score between the training and testing samples of the two traits were calculated. In the third step, a cross-matching score between the training sample of first trait and the testing sample of the second trait was computed. In the final step, the three matching scores were normalized and combined using weighted coefficients. An adaptive weighted fusion scheme for classifying images was proposed in [43]. In this paper, the optimal weights for fusing various features were determined automatically without any manual settings. Initially, features were extracted from different classes of images. Distance-based scores were then computed between the test sample and all the training samples. These distances are then normalized and sorted. Finally, the scores were fused using weighted fusion technique in which the weights were adaptively chosen based on the confidence of the scores.

The extensive prevalence of smartphones in today’s world has massively accelerated tremendous research in the field of action recognition using smartphones. These smartphones are embedded with various sensors out of which accelerometer, gyroscope, magnetometer and orientation sensors offer rich and valuable location and movement based information, using which the actions performed by individuals can be easily analyzed and discerned. Further, the use of smartphones in action recognition creates low intrusiveness as they substitute additional sensing components for acquiring the sensor data. Owing to these benefits, smart phone based action recognition systems have paved path for a host of innovative applications including but not limited to fall detection [12], health monitoring [27], classification of construction worker activities [2], personal authentication [19], elderly safety [3], physical activity recognition [25], sporting activities classification [28], context-aware recommendations [31] and emotion recognition [52].

The motivation behind investigating action recognition using built-in sensors of smart phones is its diverse applications in various fields mentioned above. To further improve the accuracy of recognition using such built-in sensors we have proposed a novel action recognition scheme in this paper. There are many studies in the literature that have explored the fusion of various sensors like accelerometer, gyroscope and magnetometer.

Wang et al. explored the use of smartphone embedded inertial sensors in action recognition. They showed that accelerometer and gyroscope data when fused produces better recognition performance compared to that of using a single sensor data. In addition, a novel feature selection approach was also proposed for dimensionality reduction and to simultaneously increase the recognition rate [40]. Huynh et al. [17] proposed a threshold based fall detection algorithm using a wireless wearable sensor system comprising of a tri-axial accelerometer and tri-axial gyroscope. It was shown that the addition of gyroscope sensor information increased the overall sensitivity of the system since it provides information related to angular velocity changes. Lee and Mase presented an activity and location recognition system consisting of a bi-axial accelerometer, gyroscope and digital compass. This system could determine the location of the auser, detect transitions and classify activities like sitting, walking and standing [23]. Yun et al. gave the notion of a foot motion filtering algorithm for estimating foot kinematics during normal walking [48]. This system was built using input from sensors which include tri-axial accelerometer, a tri-axial angular rate sensor and a tri-axial magnetometer. The proposed algorithm recognized foot kinematics parameters like foot orientation, acceleration, position, velocity and gait phase. In addition, an adaptive-gain complimentary filter was used for accurate estimation of foot orientation. Ronao and Cho [33] proposed a human activity recognition framework using deep convolutional neural networks (convnets). This system used data from accelerometer and gyroscope sensors that are embedded in smartphones. It was shown that convnets can automatically and adaptively extract robust and relevant features from the sensor data. Altun et al. [4] presented a comparative study of various classification techniques used for classifying human activities using wearable inertial and magnetic sensors. Five sensor units were worn around the chest, the legs and the arms. Feature extraction was performed using principal component analysis. It was inferred that Bayesian decision making classifier performed better compared to other classifiers like decision tree (DT), k-NN, least-squares method, SVM and artificial neural networks. Gravina et al. [15] gave a systematic and comprehensive review of different levels of multi-sensor data fusion in body sensor networks. The performance of three levels of fusion namely data-level, feature-level and decision-level fusion were analyzed in detail in this paper.

The main challenge in using different inertial sensors for action recognition lies with the formulation of a suitable fusion rule, that best incorporates the information from all sensors that ultimately aid in improving the classification accuracy. Thus, the main goal of this paper is to present a novel sparse representation based action recognition scheme that fuses data from various built-in sensors of smart phones to classify human activities and to compare its performance with various standard machine learning classification algorithms.

3 Methodology

The complete overview of the action recognition scheme using smartphone is shown in Fig. 1. It comprises of two stages, namely, the training and testing stage. The data from the four inertial sensors of mobile phone namely, accelerometer, gyroscope, magnetometer and orientation sensor were acquired. From the acquired data, two types of features from the time and frequency domain were extracted. The above two steps are performed in both the training and testing stage. Two types of dictionaries i.e., concatenated and class-specific dictionaries were generated from the extracted features during the training stage. During testing, the features from the test data along with the two dictionaries generated during the training stage are used in classifying the test activity using the proposed classification algorithm.

Fig. 1
figure 1

Overview of action recognition scheme using smartphone

3.1 Data acquisition

The smartphone used in data acquisition was Moto M equipped with an Android operating system (version 6.0.1). This device has a wide range of sensors including triaxial accelerometer, triaxial gyroscope and magnetometer. In addition to these sensors, this device also provides orientation based measurements. The accelerometer sensor returns the acceleration force applied to the device by the user and is measured in meters per square second (m/s2). The effect of gravity is included in this acceleration signal. The gyroscope measures angular velocity in radians per second (rad/s), which is the rate of rotation of the device around each axis. The magnetometer measures the magnetic field along three perpendicular axes in microtesla (μT). The orientation sensor gives three rotation angles of the device namely the roll, pitch and the azimuth with respect to each axis. The calculation of these rotation angles utilizes the accelerometer, magnetometer and gyroscope sensors. Thus, all the four types of sensor data having three components each were acquired.

Data was sampled at a rate of 64 Hz. This sampling rate was chosen since it is sufficiently higher than the minimum sampling rate of 20 Hz necessary for human action recognition. Fifteen healthy subjects (7 males and 8 females) with age 29 ± 4.5 years, height 5.5 ± 0.57 ft and weight 64 ± 5 Kg (mean ± standard deviation) were involved in data collection. These subjects were asked to wear a belt type mobile pouch such that the mobile phone was at the right waist as shown in Fig. 2.

Fig. 2
figure 2

Data acquisition system

Each participant was asked to perform a total of 8 different kinds of daily activities including sit, stand, lie-down, walk, jog, jump, upstairs and downstairs. Each of these activities was performed for a duration of about 30 s, three times each. Thus, the total time for recording all the activities using all the subjects were about 3 h. Each of the acquired data was segmented prior to feature extraction using a non-overlapping window with a window length of 2 s.

3.2 Feature extraction

The extracted features comprised of time-domain features and frequency-domain features. Time-domain features have been popularly used in generating distinct and proficient features from sensor data, that aid in successful representation of human activities. In our work, 9 popularly used time-domain features were employed. Statistical features like mean, standard deviation, first quartile, second quartile, third quartile and the pairwise correlation between the three axes were used [32]. In addition, root mean square (RMS), interquartile range (IQR) and zero crossing rate (ZCR) were also included to boost the classification performance [49]. First quartile, second quartile and third quartile refers to 25th percentile, median and 75th percentile respectively. Every feature was computed for each component of the sensor data. Thus, the total number of time domain features extracted from a 2-s segmented window was 27.

Frequency domain features were extracted by performing a fast Fourier transform (FFT) on each window. FFT aids in analyzing a time domain signal in frequency-domain, by taking advantage of the fact that any continuous time signal can be decomposed to a sum of weighted sinusoidal functions. A total of eight different features were extracted namely, dominant frequency [49], spectral energy, entropy [6] and magnitude of the first five components of the FFT spectrum [32]. Dominant frequency is determined to be the frequency that has the highest peak in the spectrum. Each of these features was derived for each component of the sensor data, and thus the total number of frequency domain features extracted from a 2-s window summed to 24.

3.3 Proposed sparse representation based classification scheme

Sparse representation based classification has been used in various vision [16] and wearable sensor [41, 49] based action recognition schemes. Three common techniques used are based on using shared, class-specific and concatenated dictionary [16]. In our work, we have combined the concept of class-specific and concatenated dictionary based classification and proposed a novel sensor fusion based classification scheme that achieves best action recognition performance. Here, classification is initially performed based on majority voting scheme.

Sparse representation is a technique in which the linear combination of a few atoms from an over-complete dictionary can be used to represent a signal. In this way, the sparse representation can be used to represent a signal in its compact form. Let us consider f ∈ Rn × 1 to be the input signal vector and ϕ ∈ Rn × m to be the over-complete dictionary such that n < m. Then, according to sparse representation theory, we can represent the input signal as, f = ϕα, where α ∈ Rm × 1 is the sparse coefficient vector.

In this work, we have collected data from four different smartphone sensors namely accelerometer, gyroscope, magnetometer and orientation sensor. Let C denote the total number of activity classes to be classified. Here, the total number of activity classes to be classified is C = 8. In the proposed scheme, features were extracted from the data of all the four sensors belonging to each activity class. Let i represent class label and j represent sensor label where i ∈ 1, 2, ..., C and j ∈ 1, 2, 3, 4. The features extracted from the ith class of jth sensor is used to form a feature matrix φij ∈ Rn × m, where n represents the number of features extracted from a single sensor data and let m represents the number of segmented time-frames from a single class. These features matrices were used to create two types of dictionaries namely class-specific and concatenated dictionaries. Let \( {\varphi}_i^{cl}\in {R}^{4n\times m} \) represent the class-specific dictionary of class i formed using \( {\varphi}_i^{cl}={\left[{\varphi}_{i1}\;|\;{\varphi}_{i2}\;|\;{\varphi}_{i3}|{\varphi}_{i4}\right]}^T \) i.e., every class-specific dictionary comprises features from all the four sensors of class i. Let \( {\varphi}_j^{co}\in {R}^{n\times 8m} \) represent concatenated dictionary of sensor j formed using \( {\varphi}_j^{co}=\left[{\varphi}_{1j}\;|\;{\varphi}_{2j}\;|\dots |{\varphi}_{ij}\right] \) i.e., each concatenated dictionary includes features from all the classes of a particular sensor.

The description of the proposed algorithm is as follows. Features are extracted from all the four sensors. Features from each sensor are initially deployed separately in the estimation of activity using sparse representation based framework using a concatenated dictionary \( {\varphi}_j^{co} \), where j ∈ 1, 2, 3, 4. In addition two initializations are made. An Activity label l is initialized with a zero value. Also, an Activity flag Af ∈ R1 × C is initialized with all zeros. During classification, a sensor-specific test feature vector \( {f}_j^t \), j ∈ 1, 2, 3, 4 is constructed comprising of features of a particular sensor obtained from test data. For every sensor, the sparse coefficient vector α is obtained using the orthogonal matching pursuit algorithm (OMP) [8] by utilizing the concatenated dictionary \( {\varphi}_j^{co} \) and sensor-specific test feature vector \( {f}_j^t \) of the corresponding sensor. The obtained sparse coefficient vector is split in accordance with the number of action classes being considered. In our work, it is split into 8 divisions since we have considered 8 action classes and the l1− score \( {s}_i^j \) of each action i belonging to sensor j is calculated as the l1− norm value of each sub-vector. The test action class is identified based on the class that gets the maximum score. The output obtained during the classification is used to increment the Activity flag at its corresponding location. For example, if the output is 2, then the second location of the Activity flag gets incremented from its initial 0 value to 1. The same process is repeated for all four sensors. If anyone of the 8 locations of Activity flag has a value of 3 or 4, it clearly indicates that majority of the sensors have produced the same action class as output. In this case, the Activity label l is updated with the value corresponding to the location of Activity flag that contains the highest value. For instance, if the Activity flag gets updated as Af = [0 0 0 0 0 3 0 1] after all the four iterations, it denotes that, the 6th class has got a majority vote of 3 and 8th class has got a vote of 1. Hence, it means that three out of four sensors have indicated 6th class as the output. So, in this case, the Activity label l is updated with 6. However, if none of the action classes gets a vote of 3 or 4 it shows that there exist dissimilar outputs from different sensors. It this case the Activity label l does not get updated and remains with a zero value.

When the Activity label l value is still at zero, next level classification is performed. In this case, classification is done using both the minimum reconstruction error obtained using a class-specific dictionary \( {\varphi}_i^{cl} \) and also using l1− score obtained from the concatenated dictionary \( {\varphi}_j^{co} \). Now, a combined test feature vector \( {f}_c^t\in {R}^{4n\times 1} \) is constructed using features from all four sensors of the test data. Using the combined test feature vector \( {f}_c^t \) and class-specific dictionaries \( {\varphi}_i^{cl} \), sparse coefficient vector is obtained for every class using OMP. For every class, the reconstruction error ri is calculated using the computed sparse coefficient vector and its corresponding class-specific dictionary. The l1− scores \( {s}_i^j \) obtained using l1− norm of each sensor j are used to form l1− score vector sj. The l1− score vector with highly varying values indicates highly varying correlation with different classes, that increases its distinguishing capability of identifying the test activity class from other classes. While the ones with similar values indicates the least capability of distinguishing the actual test class. Hence, in order to allow the features of the sensor with a more distinguishing capability to contribute more to the decision of action recognition, a weighted score fusion scheme has been used. In this scheme, the fusion weight for adaptively fusing the scores of every sensor j are computed as the standard deviation of the l1− score vector σj. Thus, the scores of different sensors are integrated by virtue of different weights based on its standard deviation. Then, the values of l1−score \( {s}_i^j \) and reconstruction error ri are normalized in the range 0 to 1, where the least value is mapped to 0 and the highest value to 1. Now, the Activity metric for each class ami is formulated such that it should have the maximum l1− score and minimum reconstruction error. In this way, it can be computed as \( a{m}_i={s}_i^1{\sigma}^1+{s}_i^2{\sigma}^2+{s}_i^3{\sigma}^3+{s}_i^4{\sigma}^4-{r}_i \). This can also be framed as, \( a{m}_i=\sum \limits_{j=1}^4{s}_i^j{\sigma}^j-{r}_i \). Finally, the activity class index i is estimated as the class that produces maximum value of activity metric ami. This index is identified to be the Activity label l..

figure cfigure c

In the above algorithm, the value of error bound ε was empirically set to 0.01.

4 Performance evaluation

The classification results for the C class classification problem was organized in the form of a confusion matrix MC × C, wherein, each element Mij indicates the count of observation which class i are classified as class j. From the confusion matrix, true positives tp, true negatives tn, false positives fp and false negatives fn of the system are identified. These measurements are utilized in calculating standard performance metrics like recall, precision, specificity, F-score and accuracy of the system [22].

Recall (λ): The fraction of correctly estimated positive cases to the total number of positive cases defines the recall of a classifier.

$$ \lambda =\frac{t_p}{t_p+{f}_n} $$
(1)

Precision (ρ): The fraction of correctly estimated positive cases to the total number of cases estimated as positive indicates the precision of a classifier.

$$ \rho =\frac{t_p}{t_p+{f}_p} $$
(2)

Specificity (δ): The fraction of correctly estimated negative cases to the total number of negative cases constitutes the specificity of a classifier.

$$ \delta =\frac{t_n}{t_n+{f}_p} $$
(3)

F-score (μ): The union of precision and recall into a single metric by means of their harmonic mean indicates the F-score of a classifier.

$$ \mu =2\times \frac{\rho \times \lambda }{\rho +\lambda } $$
(4)

Accuracy (α): The total number of cases that were correctly estimated among all the cases is the accuracy of a classifier.

$$ \alpha =\frac{t_n+{t}_p}{t_n+{t}_p+{f}_n+{f}_p} $$
(5)

To show the importance of using data from four different sensors in improving the recognition rate, we have compared the classification accuracy of classifying activities using sensors individually and with various combinations using feature level fusion [15]. In feature level fusion, the features extracted from data of different sensors are appended to form a new feature vector which is used in classification. These comparisons are shown in Table 1 in terms of overall accuracy. From Table 1, it is evident that usage of data from multiple sensors helps in achieving maximum recognition rate. Among the three standard classifiers, we find that SVM produces better results. Now, considering SVM we can observe that the use of a single sensor namely accelerometer along produces an overall accuracy of about 76%. By using one more sensor, that is along with gyroscope sensor the accuracy raises to about 86.2%. Further, by using magnetometer along with accelerometer and gyroscope the accuracy is further increased to about 90.9%. Finally, by using the features from all four sensors namely, accelerometer (a), gyro sensor (g), magnetometer (m) and orientation sensor (o) we find that a much greater value of about 94.8% accuracy is obtained. These results clearly demonstrate that an increase in the number of sensors helps in contributing to the performance of the system. This is mainly due to the complementary information provided by different sensors.

Table 1 Comparison of classification accuracy using different sensors and their combinations

To exhibit the superiority of the proposed sparse representation based classification framework quantitatively, this framework was compared with various standard classifiers. In particular, DT, k-NN and SVM were used for comparison [2]. For evaluation, the leave-one-subject-out cross-validation technique was used [32]. In this technique accelerometer data from one subject was used for testing, and the remaining data was utilized for training. This procedure was repeated until all the subjects were used for testing at least once. The overall performance is the average of all the repetitions.

The confusion matrices obtained for classifying activities using data from all four sensors using standard classifiers and the proposed sparse representation based classifier as described in section 3.3 are shown in Tables 2, 3, 4, and 5.

Table 2 Confusion matrix for classification using DT
Table 3 Confusion matrix for classification using k-NN
Table 4 Confusion matrix for classification using SVM
Table 5 Confusion matrix for classification using the proposed system

From these matrices, we infer that the performance of the proposed classification scheme is higher than that of standard classifiers. The values of recall and precision for all the activities are greater than 90%, unlike standard classifiers. While classifying activities that have a similar style like upstairs and downstairs, the performance of the standard classifiers is very low. However, the proposed system achieves an average recall of 92.89% for these two activities which are 24.97%, 4.29%, 2.81% greater than that achieved using DT, k-NN and SVM respectively. Similarly, the proposed system achieves an average precision of 93.57% for these two activities which is 32.40%, 7.68%, 4.68% greater than that achieved using DT, k-NN and SVM respectively. The overall accuracy obtained using the proposed scheme is 97.13%, which is 15.89%, 4.28 and 2.33% greater than the accuracy achieved using DT, k-NN and SVM respectively.

To further validate the performance of the proposed system, its performance is compared with standard classifiers and also with the state-of-the-art sparse representation based algorithms in terms of all the performance measures like average values of recall, precision, specificity, F-score and overall accuracy averaged over all the eight activities. These values are presented in Table 6.

Table 6 Comparison of standard and state-of-the-art classifiers with the proposed system

The proposed system proves its splendid performance, by producing superior values greater than 97%, for all the performance measures. As seen from the Table 6, among the three standard classifiers, SVM produces better results. However, the proposed system renders 2.33%, 2.31%, 0.33 and 2.32% higher recognition values than that of SVM in terms of recall, precision, specificity and F-score respectively. In [49], the features extracted from all the sensors are combined using feature level fusion to form a single feature vector. Then, reproduction error is computed for each class and the action class that produces minimum residue is classified as the output action label. Nonetheless, in our proposed system classification is done not only based on residue but also based on l1− score. Hence, we can clearly observe from Table 6 that the proposed system produces better results compared to the algorithm presented in [49]. To further investigate the performance of our system, the classification was also done using concatenated dictionary formed using features of all classes based on maximum l1− norm values of the sparse coefficient vector as in [34]. Again, it was observed that the proposed system produced higher values of recall, precision, specificity and F-score of about 1.35%, 1.36%, 0.19 and 1.35% greater compared to the l1− norm based algorithm used in [34] respectively.

The bar graphs in Figs. 3 and 4, sketches the variation in the performance of various classifiers in terms of Specificity and F-score respectively. Comparing the three standard classifiers namely, DT, k-NN and SVM, the superiority of SVM is noticeable. However, the proposed system outperforms SVM, in terms of both specificity and F-score value for almost all the actions. In addition, from the graphs we also perceive that the proposed system achieves better recognition rate compared to the state-of-the-art classifiers for most of the actions.

Fig. 3
figure 3

Graphical comparison of standard and state-of-the-art classifiers and proposed system in terms of Specificity

Fig. 4
figure 4

Graphical comparison of standard and state-of-the-art classifiers and proposed system in terms of F-score

Hence, it is obvious that the proposed method outperforms all other methods. This is because unlike other methods, in the proposed algorithm classification is done in two levels. First, classification is done using a majority vote criterion based on the l1− scores obtained from different sensors. Second, in case if none of the action gets a majority score, classification is done based on weighted fusion scheme in which, the l1− score are weighted using its standard deviation. Furthermore, the final classification criterion is designed such that the weighted l1− norm scores are maximized while simultaneously minimizing the reconstruction error. These aspects of the proposed algorithm aids in producing better classification results.

5 Conclusions

In this paper, we presented a novel framework for accurately determining human activities using data from in-built smartphone sensors. The features extracted from the data acquired by these sensors were utilized in action recognition using a novel sparse representation based algorithm. This algorithm was used in fusing data from various sensors in a superlative manner and achieved highest action recognition rate. Also, the proposed system was developed using data from mobile devices that makes it ease for practical implementation, as it does not require any additional equipments for data collection. This further helps in facile real-time implementation of the proposed system. Furthermore, the performance of the proposed scheme was quantitatively analyzed using various performance metrics like recall, precision, specificity, F-score and accuracy. It was shown that the proposed system showed exceptional performance compared to standard classifiers and also state-of-the-art sparse representation based algorithms in terms of all the performance metrics being considered.