1 Introduction

Recently, artificial intelligence (AI) has been widely used in several applications (e.g., cancer recognition [1, 2], burnout analysis [3], exam correction [4, 5], diseases diagnosis [6,7,8], sign language interpretation [9], natural language processing [10], and pattern recognition [11]). AI accelerated development has paved the way for human activity recognition (HAR). It is concerned with using sensor data to recognize a specific action (or movement) of a person. It has become one of the broadest research topics because of the sensors and accelerometers availability, less power consumption, and low cost. It has been widely employed in smart home [12], medical care [13, 14], image analysis [15], video surveillance [16], military defense [17], sleep state detection [18], and behavior monitoring [19, 20]. In HAR, movements are indoors and outdoors-performed activities (e.g., talking, walking, running, sitting, and standing). Additionally, they can be more focused activities such as the activities achieved in a kitchen or on a factory floor [21]. In short, the fundamental task of HAR is to choose a suitable sensor and use it to observe and capture the activities of the user [22] as shown in Fig. 1. Recently, HAR can be classified into the sensor- and visual-based recognition [23, 24]. HAR sensor-based data has become a research focusing-field because of the wide usage of wearable and portable sensors in daily life. The HAR sensors mainly include geomagnetic [25], acceleration [26, 27], and gyroscope [28].

Fig. 1
figure 1

The process of human activity recognition (HAR)

Historically, it required custom hardware and was costly to gather and store data from sensors for activity recognition. Nowadays, smartphones, smartwatches, and other personal tracking devices, utilized for health and fitness monitoring, are inexpensive and omnipresent. As a result, sensor data collected from these devices are more common and inexpensive to collect, so, the activity recognition problem become a wide study field. Smartwatches have been beneficial in a broad range of healthcare applications, especially, the ones that are concentrating on health and fitness monitoring [29, 30]. Compared to other smart devices, smartwatches are truly wearable without interrupting the daily lives of the user [31]. The growth of smartwatches in the healthcare field has facilitated people to monitor their fitness and health [32]. Unfortunately, wearable sensors-gathered data are time-series data that are complex, noisy, and imbalanced [33]. Hence, HAR is a complex procedure, which contains the following steps: pre-process and segment the time series data, extract the data features, and then classify them by utilizing a classification algorithm.

Manual features extraction for HAR, based on classical machine learning (ML) algorithms, is required [34]. Dimensionality reduction and feature extraction methods are required for ML algorithms to achieve better performance. These methods aspire to find the most informational and compacted set of features by generating new ones from the existing features. They represent the most crucial part of classification because the performance is decreased significantly if the features are not suitable.

Creating classification models that can classify the less common activities is a significant challenge. Classification models designed to classify imbalanced data are biased to learn about the more frequently occurring classes. This type of bias happens as the models learn better from classes containing more data. Different methods were proposed to deal with the class imbalance problem and these methods can be split into two main approaches: data-level and classifier-level methods [35]. Traditional ML methods that have been used to perform the HAR task include Naive Bayes and support vector machines (SVM) [36]. Recently, the evolution of deep learning has resulted in being utilized widely in HAR [37]. It learns and extracts features automatically without the complex steps of manual feature extraction, hence, the workload of feature engineering is significantly decreased [38, 39]. Deep Neural Networks such as Recurrent Neural Networks and Convolutional Neural Networks (CNN) have gained significant performance across different applications and outperformed many traditional methods. Lately, Long Short-Term Memory (LSTM) and CNN provide state-of-the-art results on HAR tasks with no or little feature engineering [40].

1.1 Research gap

In the HAR research field, a high-quality benchmark dataset for HAR methods is missing. Most of the publically available datasets suffer from limited or imbalanced data problems [33]. Most of the observed activities are simplistic and do not cover the entirety of human actions. Integrating deep architectures for solving HAR using context information can be difficult [41]. Although these methods deliver state-of-the-art performance on benchmark datasets, they are still overconfident in their predictions.

1.2 Research objectives

The major objective of the current study is to suggest an approach that is used for Human Activity Recognition (HAR). The objective is to suggest an analysis of machine and deep learning algorithms to deal with HAR. Additionally, extracting the most relevant features from raw data and reducing their dimensionality in an efficient manner are two challenging tasks.

1.3 Research contributions

The contributions of the current study can be summarized as follows:

  • Performing human activity recognition tasks using a detailed comparative analysis of a variety of machine and deep learning algorithms to determine the optimal modal.

  • Analyzing balancing and sampling techniques to deal with imbalanced data, hence, determining the best approaches.

  • Reporting state-of-the-art performance metrics and comparing them with different related studies and approaches.

1.4 Paper organization

The rest of the current study is organized as follows: Sect. 2 reviews and summarizes the related literature. In Sect. 3, the background is discussed. It represents imbalanced data and oversampling techniques, features engineering and dimensionality reduction, Topological Data Analysis (TDA), feature scaling, and classification and optimization. In Sect. 4, a discussion about the methodology, datasets acquisition, data pre-processing phase, features engineering and dimensionality reduction techniques, ML classification and optimization phase, and DL classification phase. Section 5 presents the details and discussions of the experiments and results. Section 6 presents the study limitations and finally, Sect. 7 conclude the paper and present the future work.

2 Literature review

Classical ML algorithms demand extensive domain expertise and feature engineering to transform raw sensor data into features, from which a classifier identifies activities (e.g., SVM [42] and random forest [43]). In shi et al. [44], an algorithm based on the standard deviation trend analysis was utilized to recognize transition activity. For basic activity, SVM was mainly used for recognition. For transition activity, the standard deviation value of data was analyzed to evaluate the trend of the flow of the overall data to recognize the activity. The achieved accuracy by their proposed model was over 80% on real data.

In Garcia et. al. [45], their placement-, orientation-, and subject-independent HAR dataset was introduced. An SVM algorithm was presented to perform the experiments on the dataset and accuracy of 74.39% was obtained. Their proposed model was able to tighten the gap between the real-life application and a model. Ahmed et al. [46] have proposed a hybrid method that contains a filter and a wrapper for the feature selection process. The process employed a sequential floating forward search to extract the features that would be fed to a multi-class SVM. Their model was validated on a public benchmark dataset [47] and an average accuracy of 98.13% was delivered. Their proposed system provided acceptable activity recognition and operated efficiently with limited hardware resources.

Deep learning algorithms such as recurrent neural networks [48] and convolutional neural networks [49] conduct automatic feature extraction and classification. They have delivered promising results in different sensor-based HAR scenarios [50]. Barut et al. [51] used a single wearable sensor to create a new dataset and utilized a multi-task LSTM model for intensity evaluation and activity recognition to deliver better outcomes. Accuracy of 97.76% and F1-score of 83.43% were obtained. Wang and Liu [52] suggested a Hierarchical-LSTM based on the LSTM for human activity recognition. Three public UCI datasets were used to train and evaluate their model and accuracy of 99.15% was achieved.

Furthermore, convolutional neural networks are used for HAR tasks for temporal features extraction and to produce significant performance advancement [53,54,55]. Zhang et al. [34] utilized the encoder and decoder operations of the U-Net architecture in creating their proposed HAR framework. Rather than sliding window labeling, dense labeling was used to provide a single label per sample in the time series data. Additionally, to enhance the performance of the dense prediction outcome, a post-correction algorithm was utilized. Four datasets, including the WISDM dataset [56], UCI HAPT dataset (HAPT) [47], UCI OPPORTUNITY Gesture dataset (OPP Gesture) [57], and the dataset of self-collected Sanitation was used to conduct experiments. For the OPP Gesture dataset, an accuracy of 94.78% was obtained by their U-Net_PC model. Teng et al. [58] used the local loss function to achieve the layer-wise training of a convolution neural network for HAR. Their method was evaluated on five public datasets, namely UCI HAR dataset [42], Opportunity dataset [59], UniMib-SHAR dataset [60], PAMAP2 dataset [61], and WISDM dataset [36]. The reported accuracy and F1-score were 98.82% and 98.81%, respectively.

To extract powerful features from raw sensor-based data automatically and effectively, Ronao and Cho [62] proposed a model formed of alternating convolution and pooling layers. To predict human activities, the extracted features from the previous layers were passed to the fully connected and SoftMax layers. The dataset has been collected from 30 volunteer subjects and the dataset proposed in [63] was used to train and evaluate their model, reaching an overall performance of 94.79% on the dataset with raw sensor data, and 95.75% with further information of temporal fast Fourier transform of the HAR dataset. Bianchi et al. [64] suggested a CNN model formed of four convolution layers and only one fully connected layer for HAR and performed well on their small training set. Their system was designed to recognize nine different activities with an accuracy of 97%.

A different design paradigm that was very prevalent among the community of activity recognition was to create hybrid models [50, 65]. In Ordonez and Roggen [66], a DL architecture using a combination of convolutional and recurrent neural networks was proposed to conduct HAR from wearable sensors. Two public datasets, OPPORTUNITY [59] and Skoda [67] were used to evaluate their proposed approach. For the Skoda dataset, an F1-score of 95.8% was obtained. In Xia et al. [68], two LSTM layers and cascading convolutional layers were employed to extract features from time-series data. To maintain the classification performance while reducing the model parameters, a global average pooling layer was utilized rather than the fully connected layer. Three public datasets, UCI [47], WISDM [56], and OPPORTUNITY [57, 59] were used to evaluate the model performance. The overall accuracy of the model for the UCI-HAR dataset was 95.78%, for the WISDM dataset was 95.85%, and for the OPPORTUNITY dataset was 92.63%.

Ignatov et al. [69] used CNN and statistical features to extract features from sensor data. The assembled features vector was handed to the next layers to identify the activities. The proposed approach was evaluated on two commonly used datasets (i.e., WISDM [56] and UCI [47]). The reported accuracy and F1-score were 97.63% and 97.62%, respectively. The results indicated that their presented model delivered state-of-the-art performance while demanding no manual feature engineering and low computational cost. In Xu et al. [55], the inception module of GoogLeNet architecture was explored to extract spatial features from sensor data.

Furthermore, temporal features obtained using the recurrent neural network and spatial features were combined. Three benchmark datasets were used to conduct experiments, OPPORTUNITY dataset [59], PAMAP2 dataset [61], and Smartphone database [70]. For the OPPORTUNITY dataset, an F-measure of 94.6% was obtained. Khan and Ahmad [71] proposed a multi-head attention-based model for HAR. Their framework included three lightweight convolutional heads, each created using one-dimensional CNN to extract features from input sensor data. Their model was induced with attention to strengthening the representation ability of CNN. Two publicly available datasets: WISDM [56] and UCI HAR [47] were used to conduct ablation experiments and studies and evaluate the proposed model. The achieved F1-score was 97.20% for the WISDM dataset.

When dealing with HAR data, a problem of imbalanced data may exist. To tackle the imbalance problem, the most intuitive path is to re-sample the class with the largest number of samples as done in Alani et al. [33]. Experiments were done using an extensive sensor-based multi-modal dataset developed from the Sensor Platform for Healthcare in a Residential Environment [72]. The results showed that when using the SMOTE oversampling technique to correct the class imbalance, CNN-LSTM achieved the highest classification accuracy of 93.67% followed by CNN of 93.55%, and LSTM of 92.98%. Grzeszick et al. [73] used two augmentation techniques, Gaussian noises perturbation, and interpolation to solve the class imbalance problem. In their study, a CNN was utilized on the multiple inertial measurement units sequential data. A dataset introduced in [74] was used to evaluate the proposed model and classification accuracy of 73.9% ± 4.6% was obtained.

2.1 Related studies summarization

Table 1 presents a comparison between the related studies and the current study. The related studies are ordered from the oldest to the latest.

Table 1 Comparison between the related studies and the current study

3 Background

3.1 Imbalanced data and oversampling techniques

The imbalance problem appears when one of the target classes has a small instances number compared to other classes. Normally, a typical classifier neglects to detect a minority class due to the little number of class samples. Newly, there is a significant interest to solve the class imbalance issue. It considers as a challenging issue that requires more attention to be resolved by several researchers [75, 76]. Using resampling techniques to make the dataset balanced is considered one of the common procedures. Resampling methods can be utilized either by oversampling or undersampling the dataset [77]. Undersampling can be described as the process of decreasing the amount of majority target samples (i.e., instances) [78] such as tomeks’ links [79] and cluster centroids [80]. Oversampling can be achieved by increasing the amount of minority class samples by repeating some instances or producing new instances [81].

In the current study, only the oversampling approach is used. They are summarized as follows:

  • Synthetic Minority Oversampling Technique (SMOTE): In a classic oversampling technique, the number of data is increased, no further variation or information is given to the ML model. Chawla et al. presented SMOTE [82] that operates differently. It creates synthetic data to overcome the overfitting problem posed by random oversampling. It utilizes the K-nearest neighbor algorithm. Initially, it starts by choosing data randomly from the minority class. Then, the K-nearest neighbors are set from that data. Synthetic data would be constructed among the selected K-nearest neighbor and the random data.

  • Synthetic Minority Over-sampling Technique for Nominal (SMOTEN) is a development of SMOTE which is used for nominal features proposed by Chawla et al. [82]. In it, the nearest neighbor is calculated using a modified version of the Value Difference Metric [83, 84].

  • Borderline-SMOTE is a variation of the SMOTE. Unlike the SMOTE, it generates synthetic data only along the decision boundary between the two classes [85].

  • Adaptive Synthetic (ADASYN) is a generalized form of the SMOTE algorithm. Similar to SMOTE, it aims to oversample the minority class by generating synthetic instances for it. When compared to Borderline SMOTE, it takes a different approach. As Borderline SMOTE aims to synthesize the data around the decision boundary of the data, while ADASYN creates synthetic data considering the data density [86].

  • K-means SMOTE is an effective and simple oversampling method that is used to solve class-imbalanced data based on SMOTE and K-means clustering oversampling. It aims to aid classification by developing minority class instances in crucial and safe areas of the input space. This method averts the noise generation and effectively overcomes the imbalances within and between classes [87].

  • Borderline SMOTE SVM (SVM SMOTE) is another variation of the Borderline SMOTE [88]. The primary difference between this technique and the other SMOTE ones, it incorporates the SVM algorithm to identify the misclassification instead of using K-nearest neighbors. In the SVM SMOTE, the support vectors are used to approximate the borderline area after training SVMs on the original training set.

3.2 Features engineering and dimensionality reduction techniques

Feature engineering pipeline is the preprocessing step that extracts features from raw data and transforms them into formats that can be ingested by the ML algorithms [4, 89]. It helps the ML algorithms to determine patterns in data that boost their performance. Feature engineering is an important task to develop predictive solutions [90] but it is a challenging and the least well-studied topic in ML and data-mining [91]. Feature engineering is a manual problem-specific process performed by ML and domain experts [92]. In ML, feature engineering consists of four main steps: feature creation, feature transformation, feature extraction, and feature selection [93, 94].

  • Feature Creation: This step includes specifying the variables that can be useful in the predictive model. It is a subjective process that requires human creativity and intervention. Features are incorporated by multiplication, addition, and subtraction to construct new derived features with higher predictive power [95]. Usually, ML experts combine features in a trial-and-error manner until the generated features fulfill the expectations [90, 96]. Using automated feature generation methods takes a long time to produce an outcome and is computationally expensive [97, 98].

  • Feature Transformation: Feature transformation usually implies simpler modifications over the features [99]. Transformation involves manipulating the predictor variables to improve model performance. It is used to ensure that the variables are on the same scale, guarantee the flexibility of the model in which a variety of data can be ingested, make the model easier to understand, avoid computational errors, and improve accuracy. Some of the standard transformations are binning, rounding, scaling, exponential transformations, logarithmic transformations, and power functions [93].

  • Feature Extraction: These techniques aim to find a smaller set of new variables in which each is a combination of the input variables, including the same information as them. Feature extraction is used to develop new variables by extracting them from raw data. It aims to reduce the data volume into a more suitable set for modeling [100]. These methods include text analytics, cluster analysis, principal components analysis, and edge detection algorithms.

  • Feature Selection: When applied to the original dataset, only the most relevant variables are kept. Feature selection algorithms are used to analyze, judge, and rank a subset of features from the pool of available features. Selection is employed to determine which features are irrelevant and redundant to be removed, and which are the most useful features for the model and should be prioritized [101]. These methods can be divided into four high-level categories: filter (i.e., ANOVA, Pearson correlation, variance thresholding), wrapper (i.e., forward, backward, and stepwise selection), embedded (i.e., Lasso, Ridge, Decision Tree), and (4) hybrid methods [102, 103].

3.2.1 Features extraction

In the current work, only feature extraction techniques are applied to perform dimensionality reduction. These techniques are operated to reduce model complexity, overfitting, generalization error, and increase the computation efficiency of the model [104]. They are:

  • Principal Component Analysis (PCA) is a method of acquiring the important features from a large features-set available in a dataset. It finds the direction of maximum variance in high-dimensional data and views the data onto a further subspace with dimensions equal to or fewer than the original ones [105].

  • Linear Discriminant Analysis (LDA) is a supervised learning feature extraction technique that targets to decrease the spreading inside the class itself and increase the distance between the classes mean [106, 107].

  • Independent Component Analysis (ICA) is a linear method that takes independent components mixture as input data and targets to identify each of them correctly [108].

  • Random Projection (RP) is a technique utilized to perform feature reduction for points set that lie in the Euclidean space [109].

  • Truncated Singular Value Decomposition (T-SVD) is a matrix factorization technique, similar to PCA, used to reduce the dimensionality of the data. Opposite to PCA, the data is not centered before calculating the singular value decomposition. This means it can be used efficiently with sparse matrices [110]. The data matrix is factorized by t-SVD where the number of columns is equal to the truncation.

3.2.2 Topological data analysis (TDA)

Topology [111] is the study of shapes and their properties. It deals with properties of the shapes (e.g., the number of components and loops in shapes). Topological Data Analysis (TDA) is an approach of datasets analysis using topological techniques [112]. It exploits the topological and geometrical properties of data such as shape and connectivity. TDA is inspired by the notion that geometry and topology deliver a robust approach to infer strong qualitative information about the data structure [113]. Datasets that are incomplete, high-dimensional, and noisy are challenging to extract information from. TDA provides a general framework to analyze such sets and provide robustness to noise and dimensionality reduction [113]. Additionally, TDA inherits functoriality (i.e., a functor is a mapping between categories), a fundamental concept of modern mathematics which allows it to adapt to new mathematical tools [114].

Persistent homology is a central tool of TDA to construct multi-scale invariants of data and represents them with barcodes or persistence diagrams [115, 116]. It considers data as a point cloud and tries to find the holes in point clouds using discretization and triangulation of the initial data space with simplicial complexes. TDA offers: (1) A compressed mathematical representation of a dataset: The single data point detail up to the global structure of a dataset can be studied without bearing a cognitive overload, (2) Missing data and noise resistance: TDA maintains important features of the data, (3) Invariance. The size, orientation, or skew of data does not change the data as only connectedness matters, (4) Exploration tool of data: Fetch answers to questions that have not been asked yet, and (5) Data and manifolds shape study tool: TDA inherits functoriality and has a robust theoretical foundation.

TDA creates the persistence diagram which is a 2D plot that indicates the birth and death of n-dimensional holes in the induced topological spaces. TDA provides the Mapper [117]. It is considered as a combination of clustering, dimensionality reduction, and graph networks techniques utilized to get a higher-level understanding of the data structure. It is used to: (1) Visualize the shape of data through a particular lens, (2) Detect interesting topological structures (i.e., clusters) that cannot be found by the traditional methods, and (3) Select the best features that discriminate data and for model interpretability.

In biological research fields, several publications have successfully used TDA. These include Type-2 diabetes (T2D) subgrouping using clinical parameters [118], modeling RNA hairpin folding [117], and gene expression patterns-based breast cancer classification [119].

3.3 Feature scaling techniques

Feature scaling, also named data normalization, is an approach employed to normalize the range of data features or independent variables [120]. In this work, five of the most commonly used feature scaling techniques are utilized [121].

  • Absolute Maximum (Max-Abs) Scaling is used to compute the absolute maximum value of the feature in the dataset and then all the values in the column are divided by that value [122]. The output values will range between 1 and -1.

  • Minimum Maximum (Min-Max) Scaling is achieved by subtracting the minimum value with all the values in the dataset and then dividing the output value by the dataset range (i.e., maximum value - minimum value) [123].

  • Normalization: The maximum value is used to perform normalization [124]. In the previous cases, the range of the data is changed while in normalization the distribution shape of the data is changed.

  • Standardization (Z-Score Normalization): Z-score is calculated for each data point and replaces the data value with the calculated one [124]. As a result, all features are centered around the mean value with a standard deviation of 1.

  • Robust Scaling is not prone to outliers. In this method, the median value is subtracted from all the data points and then the output values are divided by the Inter Quartile Range (IQR) value [125]. The IQR is the distance between the 25th and the 75th percentile points. Hence, the median value is centered at zero.

3.4 Classification, optimization, and performance evaluation

Machine learning (ML) algorithms are proved to be useful in a broad variety of applications (e.g., email filtering, computer vision, speech recognition, and medicine) in which it is infeasible to design conventional algorithms to accomplish the required tasks [126]. ML techniques are gaining huge attention in data mining, leveraging it to recognize historic trends to deliver future models [127].

Deep learning (DL) is a subset of ML where artificial neural networks learn from large amounts of data [128, 129]. DL algorithms are used to solve complex problems where datasets are diverse, unstructured, and inter-connected. The more DL algorithms learn, the better they perform [130]. Several types of DL algorithms have existed. These include convolutional neural networks (CNNs) and recurrent neural networks (RNNs). A CNN is an algorithm designed for object detection and image processing. The convolution is a unique filtering process performed through an image to assess every element within it [131, 132]. The RNN has built-in feedback loops that allow the algorithms to remember past data points. They can use this memory to inform their understanding of current events.

3.4.1 Machine learning classifiers

Simple classifiers include Naive Bayes, Decision Tree, Logistic Regression, and K-Nearest Neighbor [133]. Then, there are ensemble classifiers which refer to algorithms that merge the predictions from two or more models. Their popularity is because of their ease in implementation and success on an expansive domain of predictive modeling problems [134]. In the current study, seven types are used where six of them are ensemble classifiers, they are:

  • Light Gradient Boosting Machine (LGBM) Classifier is a distributed gradient boosting framework primarily created by Microsoft to be employed in ML [135, 136]. It is established over decision tree [137] algorithms and used for classification, ranking, and further tasks. The development of LightGBM concentration is on scalability and performance [138].

  • XGBoost (XGB) Classifier is a distributed scalable gradient-boosted decision tree algorithm [139]. It is the leading ML algorithm for ranking, classification, and regression problems. It delivers parallel tree boosting, sparse optimization, multiple loss functions, regularization, bagging, and early stopping.

  • Adaptive Boosting (AdaBoost) Classifier is a meta-estimator that starts with fitting a classifier on a set of data and then additional copies of the classifier are fitted on the same set [140].

  • Histogram-based Gradient Boosting (HGB) Classifier is a gradient boosting method that executes this technique and customizes the training algorithm around input variables under this transform. It has native support for the missing values. The main drawback of gradient boosting is that training the model is time-consuming which appears when using the model on datasets with tens of thousands of examples.

  • Random Forest (RF) Classifier is a method that merges a large number of independent trees trained by equally and randomly distributed subsets of the data [141]. It is one of the most utilized algorithms because of its simplicity and diversity (i.e., can be used for regression, classification, and other tasks that function by building a group of decision trees at the time of training) [142, 143].

  • The Decision Tree (DT) Classifier is a supervised non-parametric learning algorithm used for classification and regression [144]. It is a tree-structured classifier, in which the features of a dataset are represented by internal nodes, the decision rules are represented by branches (i.e., decision nodes), and an outcome is represented by a leaf node. Hence, there are two types of nodes including decision and leaf nodes [145].

  • The Extra Trees (ETs) Classifier is an ensemble learning method in which the results of numerous de-correlated decision trees gathered in a forest are aggregated to output the result of the classification [146].

3.4.2 Deep learning classifiers

In DL, algorithms use the input distribution to extract features and useful data patterns during the training process. Deep learning models include several algorithms. In the current study, three types of algorithms are used. They are 1D Convolutional Neural Network (1D-CNN) [147, 148], Gated Recurrent Unit (GRU) [149], and Bi-directional Long Short-Term Memory Network (BiLSTM) [48]. They are discussed as follows:

  • 1D Convolutional Neural Network (1D-CNN) Classifier is a modified version of CNN that has been recently developed. In 1D-CNN, the computational complexity is quite lower than the CNN [148]. Hence, 1D-CNNs are suitable for real-time and low-cost applications.

  • Gated Recurrent Unit (GRU) Classifier is a gating mechanism in RNNs. It is similar to an LSTM with a forget gate [150] but with fewer parameters, as the output gate does not exist [151]. GRUs have been delivered better performance on less frequent and smaller datasets [152].

  • Bi-Directional Long Short Term Memory Network (BiLSTM) Classifier is a sequence processing model that includes two LSTMs: the first processes the input in a forward direction, and the other in a backward direction [153]. It increases the available amount of information to the network effectively, enhancing the context available to the algorithm.

3.4.3 K-fold cross-validation

Cross-validation [154] is a resampling technique employed to assess ML models on a limited dataset. Cross-validation is used to detect overfitting (i.e., failing to generalize a model) [155]. This approach has only one parameter called “K” where the input data is split into K-fold (i.e., subsets of data).

3.4.4 Grid search hyperparameter optimization

Hyperparameter optimization (i.e., tuning) is the process of finding the most suitable values of the hyperparameters and it is one of the most important parts of ML [156, 157]. A model with poor performance and wrong results may result from a wrong choice of the values of the hyperparameters [158]. Hyperparameters are the parameters of a model whose values are specified before training and their values influence the model behavior. For example, the number of trees in a random forest is a hyperparameter as its value is set before training. As the optimal values of the hyperparameters are unknown, hyperparameters optimization algorithms are required. The current study utilizes the grid search, also referred to as full factorial design [159]. For the model with several hyperparameters, the best combination of hyperparameters values needed to be found by searching in a multi-dimensional space.

4 Methodology

The suggested methodology is presented and summarized graphically in Fig. 2 and discussed in detail in the next subsections.

Fig. 2
figure 2

Graphical presentation of the suggested methodology

4.1 Data acquisition phase

The datasets used in the current study are retrieved from different public sources. They are:

  • WIrelesss Sensor Data Mining (WISDM) v1.1: It consists of 6 columns and 1,098,208 rows (i.e., records). The columns are “user”, “activity”, “timestamp”, “x-acceleration”, “y-acceleration”, and “z-acceleration”. There are 6 activities (i.e., “Walking,” “Jogging,” “Sitting,” “Standing,” “Upstairs,” and “Downstairs”). The data were sampled with a sampling rate of 20Hz (i.e., one sample per 50ms). The “user” field is from 1 to 36. The “x-acceleration,” “y-acceleration,” and “z-acceleration” fields are from -20 to 20 and are measured by the Android phone’s accelerometer [56]. It can be retrieved from https://www.cis.fordham.edu/wisdm/dataset.php.

  • Human Activity Recognition Using Smartphones Data Set v1.0 (UCI-HAR): It is applied on 30 volunteers in the age ranged from 19 to 48 years using “Samsung Galaxy S II” Android device. There are 6 activities (i.e., “WALKING,” “WALKING_UPSTAIRS,” “WALKING_DOWNSTAIRS,” “SITTING,” “STANDING,” and “LAYING”). The dataset is partitioned into two subsets where 70% for training (i.e., “train” folder) and 30% for testing (i.e., “test” folder). The number of training data is 7,352 while the number of testing data is 2,947. The features are 561 and their names are defined in “features.txt” [70]. It can be retrieved from https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones. The training and testing subsets are merged in the current study and partitioned in a later process.

It is worth noting that the datasets combine static and dynamic activities. Static activities include sitting, standing, and lying while dynamic activities include walking, walking downstairs, and walking upstairs. Table 2 presents a summary of the datasets used in our study.

Table 2 Summary of the datasets used in the current study

4.2 Data pre-processing phase

The datasets are pre-processed before being used in the later phases. Data balancing and sampling are applied.

4.2.1 Data balancing

The datasets are not balanced and this can lead to overfitting or misclassification issues [160]. The current study utilized the techniques discussed in Sect. 3.1 to determine the most suitable technique to use in the experiments. The techniques are executed for 10 runs on the same dataset to determine the average time between them. Table 3 shows if the technique crashes or not, the average time and if the technique produces balanced data. They are applied to the “WISDM” dataset as it contains a large volume of records to check if the technique is scalable or not. If the technique crashes then it is not a scalable technique.

Table 3 Comparison between the oversampling techniques on the “WISDM” dataset

Table 3 shows that the SMOTE method outperforms other methods as it does not crash, produces balanced datasets, and consumes less time. Hence, synthetic oversampling is applied in the current study on the used datasets using the SMOTE technique [82]. Figure 3 shows the distribution of each category of the “WISDM” dataset before and after SMOTE oversampling. It reached 2,546,394 records after the SMOTE balancing process. Figure 4 shows the distribution of each category of the “Human Activity Recognition Using Smartphones Data Set v1.0 (UCI-HAR)” dataset before and after SMOTE oversampling. It reached 11,664 records after the SMOTE balancing process. In both figures, the x-axis is the categories and the y-axis is the count of each category.

Fig. 3
figure 3

The distribution of each category of the “WISDM” dataset before (left plot) and after (right plot) SMOTE oversampling

Fig. 4
figure 4

The distribution of each category of the “Human Activity Recognition Using Smartphones Data Set v1.0 (UCI-HAR)” dataset before (left plot) and after (right plot) SMOTE oversampling

4.2.2 Data sampling

The datasets are time-series data and instead of using row-by-row in the classification phase, a sampling mechanism is applied. The sampling is performed vertically which means that the records are stacked row-by-row after the sampling process. It is applied using different configurations as shown in Table 4. The second column is the sampling size, the third column is the step size, the fourth column is the overlapping percentage, the fifth column is the number of records of the imbalanced dataset, the sixth column is the number of records of the balanced dataset, and the seventh column is the output shape of the balanced dataset after sampling.

Table 4 The data sampling criteria

4.3 Features engineering and dimensionality reduction techniques

4.3.1 Features extraction using TDA

The TDA is used to extract the features from the datasets in nine flavors. They are: (1) Persistence entropy in two views: normalized and non-normalized, (2) Number of Points, (3) Bottleneck, (4) Wasserstein where the p in \(L^p\) is set to 2.0, (5) Betti where the p in \(L^p\) is set to 2.0 and number of bins is set to 100, (6) Landspace where the p in \(L^p\) is set to 2.0, the number of bins is set to 100, and the number of to consider in the persistence landscape is set to three values: 1, 2, and 3, (7) Persistence Image where the p in \(L^p\) is set to 2.0, sigma value is set to 0.1, and the number of bins is set to 100, (8) Heat where the p in \(L^p\) is set to 2.0, sigma value is set to 0.1, and the number of bins is set to 100, and (9) Silhouette where the p in \(L^p\) is set to 2.0, the number of bins is set to 100, and the power is set to two values: 1.0 and 2.0.

The pipeline of extracting the features consists of three stages: (1) Cubical Persistence resulted from the filtered cubical complexes, (2) Bottleneck scaler as it makes the lifetime of the most persistent point across all diagrams and homology dimensions equal to two, and (3) a filter to discard the points that have a lifetime less than or equal to a cutoff value (set to 0.01 in the current study). From that, the number of extracted features per row is 26.

4.3.2 Dimensionality reduction

The current study utilized five features reduction techniques. They are PCA, LDA, ICA, RP, and T-SVD, as discussed in Sect. 3.2. The features are reduced to 3 for the “WISDM” dataset. The reason behind this is that the original dataset consists of 3 columns. Also, the features are reduced to 100 for the “UCI-HAR” dataset.

4.4 ML classification and optimization phase

The current study uses different classification algorithms to achieve the state-of-the-art (SOTA) performance metrics. They are the (1) LGBM, (2) XGB, (3) AdaBoost, (4) HGB, (5) ETs, (6) DT, and (7) RF classifiers.

4.4.1 Hyperparameters optimization using grid search (GS)

There are different hyperparameters for each used machine algorithm and hence the GS optimization approach is implemented to find the best combinations for each classifier. Table 5 summarizes the different used hyperparameters of each classifier to select from. The used machine learning algorithms and their hyperparameters:

  • LGBM Classifier: The boosting type is set to Gradient Boosting Decision Tree (GBDT). The max depth is the maximum tree depth for the base learners and is set to “None” which means that there are no limitations. The learning rate is the boosting learning rate. The number of estimators is the number of boosted trees to fit and is set to 300.

  • XGB Classifier: The boosting type is set to GBDT. The number of estimators is set to 300. The max depth is set to “None”.

  • AdaBoost Classifier: The number of estimators is set to 300.

  • DT Classifier: The combinations are applied between criteria to measure the quality of a split and the splitting mechanism. The max depth is set to “None.”

  • ETs Classifier: The combinations are applied on the criterion. The max depth is set to “None.” The number of estimators is set to 300.

  • RF Classifier: The combinations are applied on the criterion and class weight. The number of estimators is set to 300. The max depth is set to “None.”

  • HGB Classifier: The max iteration is set to 100 and the learning rate is set to 0.1.

Table 5 The different used hyperparameters of the classifiers

4.4.2 Features scaling

Standardization, normalization, min-max scaling, max-absolute scaling, and robust scaling are used with the grid search to find the best scaler technique. Table 6 summarizes the used feature scaling equations.

Table 6 Summary of the feature scaling equations

where \({X_{\text {scaled}}}\) is the scaled output vector while \(X_{\text {input}}\) is the input vector, IQR is the interquartile range, \(\mu\) is the mean value, and \(\sigma\) is the standard deviation value.

4.4.3 Performance improvement

The train-to-test splitting and K-fold cross-validation are used to improve the estimated performance of the classifiers. The train-to-test splitting partitions the datasets into train and test subsets. The current study uses 85% to 15% for the train and test subsets, respectively, after shuffling them. The current study uses fivefold (i.e., \(K = 5\)).

4.5 DL classification phase

The current study utilized four DL approaches as discussed in Sect. 3.4.2. The first is the GRU that consists of nine layers: (1) input layer, (2) three cascaded GRU layers with 32 kernels, 20% dropout, and 20% recurrent dropout, (3) dense layer with 64 units and LeakyReLU activation function, (4) dropout layer with a dropout ratio of 50%, (5) another dense layer with 32 units and LeakyReLU activation function, (6) another dropout layer with a dropout ratio of 50%, (7) output layer with a SoftMax activation function.

The second is the 1D-CNN that consists of 14 layers: (1) input layer, (2) four cascaded 1D convolutional layers with 32, 64, 128, and 256 filters (i.e., kernels), respectively, kernel size of 3, same padding, and LeakyReLU activation function, (3) 1D max-pooling layer with pooling size of 3, a stride of 2, and same padding, (4) dropout layer with a dropout ratio of 50%, (5) dense layer with 256 units and LeakyReLU activation function, (6) dropout layer with a dropout ratio of 50%, (7) flatten layer, (8) another dense layer with 512 units and LeakyReLU activation function, (9) another dropout layer with a dropout ratio of 50%, (10) a third dense layer with 1024 units and LeakyReLU activation function, and (11) output layer with a SoftMax activation function.

The third is the BiLSTM that consists of 3 layers: (1) input layer, (2) bi-directional LSTM with 4 units and LeakyReLU activation function, and (3) output layer with a SoftMax activation function. All of the DL models used the Adam parameters optimizer, Categorical Crossentropy loss function, and 64 epochs. The models’ architecture came out after a set of try-and-error trials as there are no specific rules to define the architectures because of the dataset dependency.

4.6 Performance evaluation

The confusion matrix is reported with its TP (i.e., True Positive), TN (i.e., True Negative), FP (i.e., False Positive), FN (i.e., False Negative) values. Eqs.1, 2, 3, and 4 show how to calculate the values of TP, FP, FN, and TN, respectively, for multi-class problems.

$$\begin{aligned} \text {{TP}}_i= & {} C_{(i, i)} \end{aligned}$$
(1)
$$\begin{aligned} \text {{FP}}_i= & {} \sum _{l=1}^{n}{C_{(l, i)}} - \text {{TP}}_i \end{aligned}$$
(2)
$$\begin{aligned} \text {{FN}}_i= & {} \sum _{l=1}^{n}{C_{(i, l)}} - \text {{TP}}_i \end{aligned}$$
(3)
$$\begin{aligned} \text {{TN}}_i= & {} \sum _{l=1}^{n}{\sum _{k=1}^{n}{C_{(l, k)}}} - \text {{TP}}_i - \text {{FP}}_i - \text {{FN}}_i \end{aligned}$$
(4)

where C is the confusion matrix, n is the number of classes, and i is the class number.

Different performance metrics are calculated from them. They are (1) Accuracy, (2) Balanced Accuracy, (3) Precision (i.e., PPV), (4) Recall (i.e., Sensitivity, Hit Rate, and TPR), (5) Specificity (i.e., TNR), (6) F1-score (i.e., Dice coef. and Overlap Index), (7) IoU (i.e., Jaccard Index), (8) NPV (Negative Predictive Value), and (10) ROC (i.e., Receiver Operating Characteristic). The corresponding equations for them are presented in Table 7.

Table 7 Summary of the performance evaluation metrics

To engage all of the calculated metrics together, the weighted sum metric is calculated as shown in Eq. 5.

$$\begin{aligned} \text {WSM}=\, & {} \frac{1}{9} \times (\text {Accuracy} + \text {Balanced Accuracy} + \text {Precision} \nonumber \\{} & {} \quad + \text {Specificity} + \text {Recall} + \text {F1} + \text {IoU} + \text {ROC} + \text {NPV}) \end{aligned}$$
(5)

4.6.1 Multi-class averaging

During the process of evaluating the performance of multi-class dataset implementations, it is preferred to use an averaging method. It is used to average the scores to acquire a single number describing the overall performance instead of having multiple scores per class. This includes micro-, macro-, and macro-weighted averaging methods. They are discussed as follows:

  • Micro Averaging. For each class, the metric will be computed independently and the average will be taken hence all classes are treated equally. In the Micro-average method, the true positives, false positives, and false negatives are summed individually for different sets and apply them to get the statistics. For a balanced dataset, micro averaging is preferred when an understandable metric for overall performance regardless of the class is required. As the more the number of samples, the more impact the corresponding class has on the final score, thus favoring majority classes.

  • Macro Averaging is straightforward as it is aggregating the contributions of all classes to calculate the average metric. Hence, the statistics of the smaller classes are reflected. It is appropriate when the performance of all classes is important equally.

  • Macro-Weighted Averaging is calculated by weighting the score of each class label by the number of true instances. It is applied in the situation of an imbalanced dataset when assigning greater contributions to the majority classes. It is worth mentioning that using this type of averaging with balanced data will yield the same result as macro-averaging.

In the current study, since the datasets used in the evaluating phase are balanced, the micro-averaging method is utilized.

5 Experiments and discussion

Figure 5 summarizes the flow of the numerical data and the corresponding experiments categories numbers. This will facilitate for the reader to trace the experiments categories.

Fig. 5
figure 5

The flow of the numerical data and the corresponding experiments categories numbers

5.1 Experiments configurations, constrains, and assumptions

Table 8 summarizes the experiments configurations.

Table 8 The experiments configurations

In the current study, the constraints and assumptions applied during the sampling and feature reduction processes are: (1) the target is to choose a number of features equal to or less than the initial number of entry features for each dataset, (2) constructing features that aware of the time-series data, and (3) the time-complexity is taken into account. From that, the current study fixed the number of the output features to be 3 for the WISDM and 100 for the UCI-HAR datasets. In the reported results tables, the “None” in the “Max Depth” column means that there is no limitation in the max tree depth, the “None” in the “Class weight” column means that there are no weights defined to the classes.

5.2 First category experiments

The current Section presents the experiments applied on both datasets after the dimensionality reduction step as presented in Fig. 5. The “WISDM” dataset is sampled and five features reduction techniques are applied to reduce the number of features to 3 as described earlier. Table 9 shows the best combinations and Table 10 shows the corresponding performance metrics applied on the reduced “WISDM” data when applying 50% overlapping \((i.e., 50,926 \times 3)\) using the different mentioned classifiers. It shows the best of each classifier after the grid searching process.

Table 9 The best combinations on the reduced “WISDM” dataset \((50,926 \times 3)\) using each classifier
Table 10 The corresponding performance metrics using each classifier applied on the reduced “WISDM” dataset \((50,926 \times 3)\)

Table 10 reports that the best-reported performance metrics. It shows that the best-reported accuracy, balanced accuracy, precision, specificity, recall, F1-score, IoU, ROC, and NPV are 93.92%, 89.06%, 81.76%, 96.35%, 81.76%, 81.76%, 69.15%, 89.36%, and 96.35%, respectively, by the XGB classifier and the RP feature reduction technique. In terms of the elapsed time, the best-reported classifier is DT with ICA as a feature reduction technique with 20,186 seconds. Table 11 and Fig. 6 summarize the WSM metrics using the “WISDM” dataset when applying 50% overlapping. It shows that the highest WSM value is 86.61% that is produced by the RFC classifier with the RP feature reduction method.

Table 11 Tabular summarization of the WSM metrics using the “WISDM” dataset \((50,926 \times 3)\)
Fig. 6
figure 6

Graphical summarization of the WSM metrics using the “WISDM” dataset \((50,926 \times 3)\)

Table 12 shows the best combinations and Table 13 shows the corresponding performance metrics applied on the reduced “WISDM” data when applying 0% overlapping \((i.e., 25,463 \times 3)\) using the different mentioned classifiers. It shows the best of each classifier after the grid searching process.

Table 12 The best combinations applied on the reduced “WISDM” dataset \((25,463 \times 3)\) using each classifier
Table 13 The corresponding performance metrics using each classifier applied on the reduced “WISDM” dataset \((25,463 \times 3)\)

Table 13 reports that the best reported performance metrics. It shows that the best reported accuracy, balanced accuracy, precision, specificity, recall, F1–score, IoU, ROC, and NPV are 93.53%, 88.39%, 80.65%, 96.13%, 80.65%, 80.65%, 67.57%, 88.73%, and 96.13%, respectively, by the HGB classifier and T–SVD feature reduction technique. In terms of the elapsed time, the best reported classifier is DT and ICA as feature reduction technique with 9.4 seconds. Table 14 and Fig. 7 summarize the WSM metrics using the “WISDM” dataset when applying 0% overlapping. It shows that the highest WSM value is 85.83% that is produced by the LGBM classifier with the T–SVD feature reduction technique.

Table 14 Tabular summarization of the WSM metrics using the “WISDM” dataset \((25,463 \times 3)\)
Fig. 7
figure 7

Graphical summarization of the WSM metrics using the “WISDM” dataset \((25,463 \times 3)\)

The “UCI-HAR” dataset is sampled and five features reduction techniques are applied to reduce the number of features to 100 as described earlier. Table 15 shows the best combinations and Table 16 shows the corresponding performance metrics applied on the reduced “UCI-HAR” data when applying 50% overlapping \((i.e., 232 \times 100)\) using the different mentioned classifiers. It shows the best of each classifier after the grid searching process.

Table 15 The best combinations on the reduced “UCI-HAR” dataset \((232 \times 100)\) using each classifier
Table 16 The corresponding performance metrics using each classifier applied on the reduced “UCI-HAR” dataset \((232 \times 100)\)

Table 16 reports that the best reported performance metrics. It shows that the best reported accuracy, balanced accuracy, precision, specificity, recall, F1-score, IoU, ROC, and NPV are 99.43%, 98.97%, 98.28%, 99.66%, 98.28%, 98.28%, 96.61%, 98.97%, and 99.66%, respectively, by HGB classifier with the PCA feature reduction technique. In terms of the elapsed time, the best reported classifier is DT with LDA as feature reduction technique with 0.6 seconds. Table 17 and Fig. 8 summarize the WSM metrics using the “UCI-HAR” dataset when applying 50% overlapping. It shows that the highest WSM value is 98.68% that is produced by the HGB with the PCA feature reduction technique.

Table 17 Tabular summarization of the WSM metrics using the “UCI-HAR” dataset \((232 \times 100)\)
Fig. 8
figure 8

Graphical summarization of the WSM metrics using the “UCI-HAR” dataset \((232 \times 100)\)

Table 18 shows the best combinations and Table 19 shows the corresponding performance metrics applied on the reduced “UCI-HAR” data when applying 0% overlapping \((i.e., 116 \times 100)\) using the different mentioned classifiers. It shows the best of each classifier after the grid searching process.

Table 18 The best combinations applied on the reduced “UCI-HAR” dataset \((116 \times 100)\) using each classifier
Table 19 The corresponding performance metrics using each classifier applied on the reduced “UCI-HAR” dataset \((116 \times 100)\)

Table 19 reports that the best reported performance metrics. It shows that the best reported accuracy, balanced accuracy, precision, specificity, recall, F1-score, IoU, ROC, and NPV are 100% by the LGBM classifier with the T-SVD feature reduction technique. In terms of the elapsed time, the best reported classifier is DT with LDA as feature reduction technique with 0.5 seconds. Table 20 and Fig. 9 summarize the WSM metrics using the “UCI-HAR” dataset when applying 0% overlapping. It shows that the highest WSM value is 100% that is produced by the LGBM classifier with the T-SVD feature reduction technique.

Table 20 Tabular summarization of the WSM metrics using the “UCI-HAR” dataset \((116 \times 100)\)
Fig. 9
figure 9

Graphical summarization of the WSM metrics using the “UCI-HAR” dataset \((116 \times 100)\)

5.2.1 First category experiments remarks

Does the overlapping appliance during the dataset sampling process affect the performance? According to Tables 11 and 14, applying 50% overlapping increases the best-reported WSM by 0.78%; however, the increase is not considerable. Does the overlapping appliance during the dataset sampling process affect the performance? According to Tables 17 and 20, applying 0% overlapping increases the best-reported WSM by 1.32%. According to Table 20, is it reasonable to obtain a WSM value of 100%? The answer can be “YES” as this happens because of the high model complexity and the number of records is relatively low (i.e., 116).

5.3 Second category experiments

The current Section presents the experiments applied on both datasets after the TDA feature extraction step as presented in Fig. 5. Table 21 shows the best combinations and Table 22 shows the corresponding performance metrics applied on the four sampled datasets after TDA using the different mentioned classifiers. It shows the best of each classifier after the grid searching process.

Table 21 The best combinations applied on the four sampled datasets after TDA
Table 22 The corresponding performance metrics using each classifier applied on the four sampled datasets after TDA

Table 22 reports that the best reported performance metrics. It shows that for the UCI-HAR dataset with 0% overlap, the best reported accuracy, balanced accuracy, precision, specificity, recall, F1-score, IoU, ROC, and NPV are 95.69%, 92.24%, 87.07%, 97.41%, 87.07%, 87.07%, 77.10%, 92.39%, and 97.41%, respectively, by the LGBM classifier. In terms of the elapsed time, the best reported classifier is DT with 0.8 seconds. For the UCI-HAR dataset with 50% overlap, the best reported accuracy, balanced accuracy, precision, specificity, recall, F1-score, IoU, ROC, and NPV are 96.70%, 94.05%, 90.09%, 98.02%, 90.09%, 90.09%, 81.96%, 94.14%, and 98.02%, respectively, by the LGBM classifier. In terms of the elapsed time, the best reported classifier is DT with 0.9 seconds. For the WISDM dataset with 0% overlap, the best reported accuracy, balanced accuracy, precision, specificity, recall, F1-score, IoU, ROC, and NPV are 95.07%, 91.12%, 85.20%, 97.04%, 85.20%, 85.20%, 74.22%, 91.31%, and 97.04%, respectively, by the LGBM classifier. In terms of the elapsed time, the best reported classifier is DT with 54.1 seconds. For the WISDM dataset with 50% overlap, the best reported accuracy, balanced accuracy, precision, specificity, recall, F1-score, IoU, ROC, and NPV are 95.34%, 91.61%, 86.01%, 97.20%, 86.01%, 86.01%, 75.45%, 91.78%, and 97.20%, respectively, by the LGBM classifier. In terms of the elapsed time, the best reported classifier is DT with 114.8 seconds. Table 23 and Fig. 10 summarize the WSM metrics. It shows that the highest WSM values are 90.38%, 92.57%, 89.05%, and 89.62% produced by the UCI-HAR + 0% Overlap, UCI-HAR + 50% Overlap, WISDM + 0% Overlap, and WISDM + 50% Overlap, respectively, by the LGBM classifiers.

Table 23 Tabular summarization of the WSM metrics using TDA and the four sampled datasets
Fig. 10
figure 10

Graphical summarization of the WSM metrics using TDA and the four sampled datasets

5.3.1 Second category experiments remarks

Why was not the TDA feature extraction used over the flatten features instead of sampled ones directly? The TDA feature extraction technique accepts data with any dimension (i.e., n-dimensional data where \(n \ge 1\)) as an input. Hence, performing two processes is not preferred to optimize the time. On the contrary, the traditional feature reduction techniques require 2-dimensional data as an input and hence the flatten step is crucial. Does the overlapping appliance during the dataset sampling process affect the performance? According to Table 23, applying 50% overlapping with UCI-HAR and WISDM datasets increases the best-reported WSM by 2.19% and 0.57%, respectively.

5.4 Third category experiments

The current Section presents the experiments applied on both datasets after the sampling step as presented in Fig. 5. The deep learning classifiers are used in this category. Table 24 shows the reported performance metrics for the four generated datasets using the 1D-CNN model. It shows that the highest accuracy, balanced accuracy, precision, specificity, recall, F1-score, IoU, ROC, NPV, and WSM values for the (1) “UCI-HAR + 50% Overlap” are 100% that is reported by a batch size of 16, (2) “UCI-HAR + 0% Overlap” are 100% that is reported by batch sizes of 64 and 4, (3) “WISDM + 50% Overlap” are 99.90%, 99.81%, 99.70%, 99.94%, 99.68%, 99.69%, 99.38%, 99.81%, 99.94%, and 99.76% that is reported by a batch size of 512, and (4) “WISDM + 0% Overlap” are 99.71%, 99.47%, 99.17%, 99.83%, 99.10%, 99.13%, 98.28%, 99.47%, 99.82%, and 99.33% that is reported by a batch size of 256.

Table 24 The performance metrics using the 1D-CNN model applied on the four generated datasets

Table 25 shows the reported performance metrics for the four generated datasets using the GRU model. It shows that the highest accuracy, balanced accuracy, precision, specificity, recall, F1-score, IoU, ROC, NPV, and WSM values for the (1) “UCI-HAR + 50% Overlap” are 99.86%, 99.74%, 99.57%, 99.91%, 99.57%, 99.57%, 99.14%, 99.74%, 99.91%, and 99.67% that is reported by a batch size of 32, (2) “UCI-HAR + 0% Overlap” are 100%, 100%, 100%, 100%, 100%, 100%, 100%, 100%, 100%, and 100% that is reported by a batch size of 16, (3) “WISDM + 50% Overlap” are 98.83%, 97.75%, 96.81%, 99.37%, 96.13%, 96.47%, 93.18%, 97.76%, 99.23%, and 97.28% that is reported by a batch size of 256, and (4) “WISDM + 0% Overlap” are 98.11%, 96.34%, 94.88%, 98.99%, 93.69%, 94.28%, 89.18%, 96.38%, 98.74% and 95.62% that is reported by a batch size of 256.

Table 25 The performance metrics using the GRU model applied on the four generated datasets

Table 26 shows the reported performance metrics for the four generated datasets using the BiLSTM model. It shows that the highest accuracy, balanced accuracy, precision, specificity, recall, F1-score, IoU, ROC, NPV, and WSM value for the (1) “UCI-HAR + 50% Overlap” are 99.71%, 99.48%, 99.14%, 99.83%, 99.14%, 99.14%, 98.29%, 99.48%, 99.83%, and 99.34% that is reported by a batch size of 8, (2) “UCI-HAR + 0% Overlap” are 99.71%, 99.48%, 99.14%, 99.83%, 99.14%, 99.14%, 98.29%, 99.48%, 99.83%, and 99.34% that is reported by a batch size of 8, (3) “WISDM + 50% Overlap” are 94.98%, 89.59%, 87.53%, 97.68%, 81.50%, 84.41%, 73.02%, 89.95%, 96.35%, and 88.33% that is reported by a batch size of 256, and (4) “WISDM + 0% Overlap” are 94.78%, 88.98%, 87.36%, 97.68%, 80.29%, 83.68%, 71.94%, 89.41%, 96.12%, and 87.80% that is reported by a batch size of 256.

Table 26 The performance metrics using the BiLSTM model applied on the four generated datasets

Figure 11 shows a graphical comparison of the best reported results between the three approaches (i.e., 1D-CNN, GRU, and BiLSTM). It shows that the 1D-CNN approach outperforms the other two approaches. Also, the overlapping mechanism shows better results compared with the non-overlapping approach.

Fig. 11
figure 11

Graphical comparison of the best reported results between the three approaches (i.e., 1D-CNN, GRU, and BiLSTM)

5.4.1 Third category experiments remarks

Does the overlapping appliance during the dataset sampling process affect the performance? According to Tables 24, 25, and 26, applying 50% overlapping with WISDM dataset increases the best-reported WSM values by (1) 0.43% for the 1D-CNN model, (2) 1.66% for the GRU model, and (3) 0.53% for the BiLSTM model. However, it does not increase the best-reported WSM values for the UCI-HAR dataset.

5.5 Overall remarks

From the experiments performed in the current study and partitioned into three categories, the best approach is 1D-CNN concerning the “WISDM” and “UCI-HAR” datasets, respectively. Additionally, applying 50% overlapping with “UCI-HAR” and “WISDM” datasets increases the best-reported metrics. For the “WISDM” dataset, concerning the first category of experiments, the accuracy, balanced accuracy, precision, specificity, recall, F1-score, IoU, ROC, and NPV had been increased by 0.39%, 0.67%, 1.11%, 0.22%, 1.11%, 1.11%, 1.58%, 0.63%, and 0.22%, respectively. Concerning the second category of experiments, the accuracy, balanced accuracy, precision, specificity, recall, F1-score, IoU, ROC, and NPV had increased by 0.27%, 0.49%, 0.81%, 0.16%, 0.81%, 0.81%, 1.23%, 0.47%, and 0.16%, respectively. Concerning the third category of experiments, the accuracy, balanced accuracy, precision, specificity, recall, F1-score, IoU, ROC, and NPV had increased by 0.19%, 0.34%, 0.53%, 0.11%, 0.58%, 0.56%, 0.1%, 0.34%, 0.12%, and 0.43%, respectively. For the “UCI-HAR” dataset, concerning the second category of experiments, the accuracy, balanced accuracy, precision, specificity, recall, F1-score, IoU, ROC, and NPV had increased by 1.01%, 1.81%, 3.02%, 0.61%, 3.02%, 3.02%, 4.86%, 1.75%, and 0.61%, respectively. However, for the first and third categories of experiments, the reported metrics had not been affected when applying the overlapping.

Why were the original datasets not used directly? The original datasets, specially the WISDM dataset, are large and time-demanding (i.e., the training process is time-consuming) and suffer from imbalanced data issue. Additionally, the time-series behaviour is not employed. Why were the resulting datasets from the balancing phase not used directly? As with the original datasets, the balancing-resulting datasets are large and time-demanding. Moreover, time-series behaviour data will not be employed. Is the TDA better than the traditional ML features reduction techniques? From Tables 11, 14, 17, 20, and 23, TDA feature extraction is better than traditional feature reduction techniques. Concerning the WSM value, this conclusion was reached because of the following: (1) for the WISDM dataset with 50% overlapping, TDA has outperformed the best reported traditional technique by 3.77% and (2) for the WISDM dataset with 0% overlapping, TDA has outperformed the best reported traditional technique by 6.74%.

According to Tables 10, 13, 16, 19, and 22, why did the precision, recall, and F1-score have the same values and also for the specificity and NPV? As mentioned before, the micro-average is used as an averaging method. In this method, if there is a false positive, there will always also be a false negative and vice versa, as always one class is predicted. Hence, increasing only FP or FN but not both is not possible (i.e., the resulting FP and FN have the same values). According to the equations mentioned in Table 7, precision, recall, and F1-score will always have the same values. The same for specificity and NPV values. Is the DL approach better than the traditional ML approach with the traditional ML features reduction techniques? From Tables 11, 14, 17, 20, 24, 25, and 26, DL approach achieved better WSM values than the traditional ML approach with the traditional ML features reduction techniques. Is the DL approach better than the traditional ML approach with the TDA features? From Tables 23, 24, 25, and 26, DL approach achieved better WSM values than the traditional ML approach with the TDA features. It is worth mentioning that when the results were conducted from the three categories of experiments, the DL algorithms used with both datasets after the sampling step reported the best results so far.

5.6 Related studies comparison

The performance of the algorithms from the most recent studies and those proposed was compared. In both datasets, the suggested approach performed well in terms of classification. Table 27 shows a comparison between the results of the related studies and the presented approach concerning the two utilized datasets. In the WISDM dataset, the best-reported accuracy and F1-score of the proposed algorithm were 1.08% and 0.88% better than those of Teng’s layer-wise CNN approach. Concerning the UCI-HAR dataset, the accuracy score of the proposed approach was 3.02% and 0.85% better than those of Teng’s layer-wise CNN and Wang’s hierarchical deep LSTM network, respectively.

Table 27 Comparison between the results of the current study and the related studies utilizing the same dataset(s)

6 Limitations

Although the current study presented the potential of using machine and deep learning models to perform the Human Activity Recognition (HAR) task, some limitations are presented. The main limitation is the instantaneity, the high-dimensional features results are consuming a considerable amount of time in the classifier training stage. Additionally, only features extraction using TDA and 5 dimensionality reduction techniques were used. Also, GS is not utilized with the suggested deep learning models. To overcome the imbalanced data problem, only the oversampling techniques were used.

7 Conclusions

Recently, HAR has earned a lot of interest and emerged as a promising approach. It has a wide range of possible applications (e.g., intelligent assistance for people suffering from cognitive disorders and elderly people). In this research, a comprehensive analysis for recognizing activities of humans with the help of traditional feature reduction techniques, feature extraction, ML, and DL algorithms. The sensor-based data retrieved from two public datasets (i.e., WISDM and UCI-HAR) are used to train and evaluate different machine and deep learning models to recognize several human activities. Nine different oversampling techniques were utilized to deal with the problem of imbalanced data. Additionally, a sampling mechanism with two overlapping percentages (i.e., 50% and 0%) is applied to each balanced dataset to use the advantage of time-series data. For feature extraction and dimensionality reduction, five traditional techniques were applied (i.e., PCA, LDA, ICA, RP, and T-SVD). Additionally, feature extraction using Topological Data Analysis (TDA) was used. Seven types of machine learning algorithms are used where six of them are ensemble classifiers (i.e., LGBM, XGB, AdaBoost, HGB, RF, ETs, and DT). For the DL experiments, three types of algorithms are used (i.e., 1D-CNN, GRU, and BiLSTM). Three categories of experiments were created. The first category is constructed using traditional feature reduction techniques and ML algorithms, while the second category is conducted using TDA feature extraction and ML algorithms. For the first two categories of experiments, grid search is used to perform the hyperparameter optimization process. For the third category experiments, automatic feature extraction is performed using the three chosen DL algorithms. For the first category experiments, the best-reported scores concerning the WISDM dataset are accuracy, F1-score, recall, and precision of 93.92%, 81.76%, 81.76%, and 81.76%, respectively, achieved by XGB classifier with RP as feature reduction technique. When concerning the UCI-HAR dataset, the best-reported scores are accuracy, F1-score, recall, and precision of 100% achieved by LGBM classifier with T-SVD as feature reduction technique. For the second category experiments, the best-reported scores concerning the WISDM dataset are accuracy, F1-score, recall, and precision of 95.34%, 86.01%, 86.01%, and 86.01%, respectively, achieved by LGBM classifier. When concerning the UCI-HAR dataset, the best-reported scores are accuracy, F1-score, recall, and precision of 96.70%, 90.09%, 90.09%, and 90.09%, respectively, achieved by LGBM. For the third category, the best-reported scores concerning the WISDM dataset are accuracy, F1-score, recall, and precision of 99.90%, 99.69%, 99.68%, and 99.70%, respectively, achieved by 1D-CNN classifier. When concerning the UCI-HAR dataset, the best-reported scores are accuracy, F1-score, recall, and precision of 100% achieved by 1D-CNN classifier. The utilized data was gathered through accelerometers that were worn on various body parts of people. The gathered information is a time series that depicts the acceleration along all three dimensions. There are two aspects to the data (i.e., time steps and the values of the acceleration in three axes). In the current study, the used time series data had a strong time locality that can be recovered by convolutions, hence 1D-CNN performed the best. This is understandable given that a 1D convolution on a time series roughly computes its moving average (using terms from digital signal processing) and applies a filter to the time series, giving some hints about the trend of the data. The reported results were then compared with 6 of prior related works utilizing the same dataset(s). This showed that the current study had outperformed all the mentioned prior works.

7.1 Future work

In future studies, a meta-heuristic optimizer, e.g., Aquila Optimizer and Sparrow Search Algorithm, can be used to optimize the hyperparameters of the deep learning models. The experiments can be applied with different datasets with more features such as the heart rate. The undersampling techniques can be tested and compared with the oversampling ones on the datasets. To enhance used datasets, additional augmentation methods, such as deep learning-based generative models (e.g., conditional generative adversarial networks and variationally auto-encoders), will be incorporated.