Introduction

One of the main reasons human activity recognition [1] is essential due to its effectiveness in safety and security implementations. These technologies can be used to monitor potential safety hazards in the workplace, such as workers operating machinery without proper safety precautions. It can also be used in public spaces to detect suspicious behaviour and potential securing threats. Additionally, human activity recognition can be used in healthcare to monitor patients with physical or cognitive impairments, allowing for early intervention if necessary. These technologies can also be used in sports and fitness to track and analyze movement [2] patterns, allowing for improved training and injury prevention. These technologies can also be used for surveillance and security [3]. Human activity recognition research aims to develop algorithms [4] and systems that can automatically recognize, categorize and understand human behaviour and actions through various datasets. Biometric identification of human activity recognition [5] can be made using physiological or behavioural traits such as walking, jogging, jumping, kicking, sit-down, jumping, and so on [6, 7]. With a growing population, many researchers are provided various automated human activity recognition systems to recognize human activity with various conditions, such as health care monitoring systems, intelligent home monitoring and many more Internet of Things-based monitoring systems [8, 9] which are containing sensors and actuators, etc. the various system for HAR (human activity recognition) are categorized into two parts sensors based and vision based, as depicted in Fig. 1.

Fig. 1
figure 1

Human activity recognition approaches

In computer vision, routine human actions are captured by cameras and video-based systems, with automated recognition based on image sequencing. Due to advancements in microelectronics and computer systems, there has been a significant development in low-power, high-capacity, inexpensive sensors and wired and wireless communication networks [10, 11]. Although the video-based method frequently yields good results indoors, it can achieve a different precision outside or in realistic conditions [12, 13]. Wearable sensors can control physiological characteristics, making them easier to measure. Wearable sensors such as environmental or video sensors are attached [14] to the monitored object and are not dependent on external infrastructure.

Difficulties in Human Activity Recognition

Based on some researcher view, several issues were faced, such as extracting features from the recorded databases, constructing models, and identifying and classifying distinct activities. This research aims to identify the human activities in BML (Bio Motion Lab) video database using various human characteristics. Various sensors [15] are utilized in HAR. HAR is an active and well-liked research area to obtain raw data. These sensors are significant in this domain, and picking the best sensors in the right way might be challenging after reading about several sensing technologies and examining thirty five publications in the last decade. The three main categories of sensors are studied wearable sensors devices such as accelerometers, gyroscopes, magnetometer, GPS, and video sensors device such as cameras fixed in one place to detect actions, and detecting user interaction with the environment using environmental sensors and radio-based sensors have sensed the radio signals like Bluetooth and Wi-Fi [16, 17], etc. and one more infrared sensor, also known as an infrared camera. At the time of preparing the dataset of different volunteers in ideal and unstructured surface using video and sensor-based approaches, facing the various challenges and difficulties are observed, as depicted in Fig. 2 in which it has seven general difficulties and challenges such as dissimilarity of age, similar posture of object, composite human activities etc.

Fig. 2
figure 2

Challenges in video-based and sensor-based HAR

The authors employed a distinct method, such as segmentation, extraction of features, and visualization, to accomplish this activity recognition [18] and evaluate algorithms for recognizing human activity in terms of the challenges and complexity of activity recognition using sensors. The difficulty of the actions can vary and depends on various factors, such as the number of states of activities and kinds of activities, the sensors used, and the protocols used for data gathering. Locomotor activities are categorized into three categories: stationary activity, kinetic activity, and activity involving postural changes, as shown in Fig. 3.

Fig. 3
figure 3

Categorization of human activity, including the sub-category

Especially in comparison to kinetic activities (running, walking, etc.), static activities (sleeping, sitting, etc.) and activities involving postural changes are easier to identify. However, due to significant feature space overlap, separating from broadly similar postures, such as standing and sitting, presents significant challenges.

Moreover, dynamic activities such as walking upstairs and downstairs with high feature space similarity are also challenging to distinguish due to comparable movement patterns. Most of the time, completed actions do not coincide with each other during the whole activity period, making the recognition even more difficult. For instance, while sitting and standing are closely connected (therefore, challenging to distinguish), walking is considerably different from both, which means easily separable. Figure 3 represents the three main types of human activity [19] with each sub-category.

The authors [20] proposed a method with deep recurrent neural networks, which were suggesting with maximum throughput from crude accelerometer data, which conformed faster recognition times than another simple technique. Reference [21] suggested combining the Random Forest classifiers with a post-processing method called the Mode of approach for categorizing locomotion and transportation activities.

Objective of the Research

The primary goal of this present work is to offer a method for Human Activity Recognition (HAR). The goal is to offer a deep learning algorithm analysis to deal with HAR. Moreover, extract the most important features out of the Bio-Motion Lab’s raw MoVi dataset.

Organization of the Paper

The remainder of this article is structured as follows: The related work is reviewed in the next section. The proposed method and calculation of measuring parameters are covered in “Proposed Method”. Result evaluation and discussion with experimental setup and data preparation are included in “Result evaluation and Discussion”. “Conclusion and Future Work” addresses the conclusion and future work of the research article.

Related Work

Many deep learning-based techniques have been presented for human activity identification in the last decay. The authors [22] have presented deep learning techniques and performed examinations with published three datasets; after concluding the result, the GRU-based approach was performed with the best result for human activity classification. Authors [16] have proposed Hidden Markov Model (HMM) to identify human activity. HMM is one of the more powerful statistics techniques. The author used the available dataset to find the performance parameters such as accuracy, precision etc.

The authors [23] have provided and demonstrated the 3DCNN + LSTM framework to recognize different human activities, evaluate experimental results with available public datasets and compare the results with existing results.

The authors [24] have suggested a novel method for early finding health-related issues based on human activity or motion recognition patterns. The initial goal is activity detection using motion patterns and deep learning methods. The suggested technique's architecture is made using the pre-processing, engineering layer, and classification layers. The classification of activity uses the CNN layer technique. A publicly accessible dataset called opportunity was used to train and test the proposed approach and achieved an improved higher accuracy rate of 88.57%. Applications of this novel method include smart homes, intelligent monitoring devices, virtual healthcare, and health advisors.

The authors [25] have investigated whether the HAR problem can be solved using machine learning and deep learning technique, including conventional dimensionality reduction and TDA feature extraction methods. The experiments are carried out with the help of WISDM and UCI-HAR datasets. Various data balancing approaches are used to remove the issue of the imbalanced dataset. In addition to topological data Analysis (TDA), some conventional dimensionality reduction approaches are used. HAR is carried out by seven machine learning (ML) algorithms. Deep learning techniques are also used, including 1DCNN, BiLSTM, and GRU. Three experiments are used: DL, ML with TDA, and ML with standard characteristics. The best reported accuracy and WSM scores for the first category experiments for the WISDM dataset are 99.10% and 86.61%, respectively. The best-reported accuracy and WSM scores for the UCI-HAR dataset were 100% and 100%, respectively. The best-reported results for the WISDM dataset for the second category experiments are accuracy and WSM, which are 95.34% and 89.62%, respectively. The best-reported scores for the UCI-HAR dataset are accuracy and WSM, which are 96.70% and 92.57%, respectively. The accuracy and WSM scores for the third category experiments with the WISDM dataset are best reported at 99.90% and 99.76%, respectively. The best-reported scores for the UCI-HAR dataset were 100% accuracy and 100% WSM, respectively. At the end of the conclusive outcomes, the proposed method is examined with existing research using the same datasets.

The authors [26] have experimented with classifying and predicting human activities using a supervised (XG Boost) machine learning technique. It is the goal of this research paper. The report shows a precision rate of 97% and a recall rate of 97% during the categorization phase, and classification accuracy is 97%. The new one should sprint and produce extremely accurate results compared to previous models.

The authors [27] have provided an overview of human activity recognition while contrasting and analyzing current studies and measures. The methods have recognized and classified abnormal activity using CNN, and LSTM approaches. A brand-new hybrid deep learning structure that combined CNN and LSTM was suggested to combine the extracted feature. First, CNN pre-processed the video and retrieved the visual elements. Then learned, the temporal features of visual elements via LSTM, and add attention mechanism was added to help choose the most crucial elements. An experiment is used on the standard dataset UMN to assess the model’s capacity to detect the abnormality.

The authors [7] have provided an overview of extracting and predicting human body motions, frequently occurring indoors, using any embedded hardware device like a camera or sensor device, known as “human activity recognition.” Before the advent of smartphones and other personal wearable devices with accelerometer-based sensors that monitor our movements, data collection from sensors was quite expensive. People are very interested in the classification method known as HAR because it allows us to identify different movements of the human body, such as sitting, running, jumping, and jogging, by using wearable sensors like an accelerometer and gyroscope and requesting techniques like convolution neural networks and deep learning techniques. This review examines various human actions, substances, and techniques to identify human activity and body posture.

The authors [28] have proposed modern technology can detect, recognize, and monitor human actions thanks to the widespread usage of HAR and sensor-based data. The status of the human activity recognition publication needed to be revised even though numerous studies and reviews on human activity recognition have already been printed. As a result, this review intends to shed light on the state of the HAR literature as of publications made after 2018. To emphasize application domains, data origins, methodologies, and available investigation issues in human activity recognition, the 95 articles assessed for this study were divided into different categories. Daily living activities have received most of the attention in the literature, followed by user activities centred on particular and group-based activities. Yet, more research must be done on real-time tasks, including surveillance, healthcare, and suspicious activity. Previous studies have extensively used data from mobile sensors and closed-circuit television (CCTV) videos. The most popular methods in the literature examined that are being used for HAR are CNN, LSTM deep learning, and a support vector machine learning technique.

A comparison of the existing research and the current investigation is shown in Table 1.

Table 1 Comparison of the existing study with the proposed investigation

Proposed Method

Deep learning (DL) is a sub-part of machine learning (ML) that involves artificial neural networks (ANN) to learn from data. Neural networks are made from various layers. The input layer receives the data, and the output layer creates the prediction or classification. Among these, one or more hidden layers can transform the inputs. In this paper, two hybrid deep learning approaches are proposed to classify various human activities, defined one by one, and to finalize which is best among them based on its evaluated results.

ConvNet, or convolutional neural network (CNN), is a class of deep neural networks most commonly applied to analyzing visual imagery datasets. ConvNet designs consist of five elements: input phase, convolutional phase, pooling phase, fully connected phase, and output phase. A basic rule for ConvNet architectures is to apply convolutional layers to the input in succession, periodically down-sample the spatial dimensions, and then use Pooling Layers to reduce the numeral of feature mappings [34]. FCL (fully connected layers) layers in a neural network are those where each activation unit on one layer is coupled to every input on the layer above it.

The proposed method modifies the existing ConvNet model, adding extra layers after the existing ConvNet architecture layers. These layers are not part of ConvNet architecture, shown in Fig. 4.

Fig. 4
figure 4

Proposed architecture of ConvNet model

The long short-term memory (LSTM) network is often used for modelling long-term dependencies. ConvLSTM2D [35] is an artificial neural network component that gives the time series character to the convolutional layers, resulting in a model that captures both extended- and short-term dependencies. In short, ConvLSTM2D is the prominent feature of our design, such as All of the inputs X_1, X_iput, cell outputs C_out-1, C_out_t, hidden states Hidden_1, Hidden_t, and gates I_gate, Forget, OP_gate of the ConvLSTM. It is a distinctive characteristic of our architecture (rows and columns). We can visualize the inputs and states as vectors standing on a spatial grid to understand them better. By using the inputs and previous states of its local neighbours, the ConvLSTM predicts the future state of each cell in the grid. Using a convolution operator in the state-to-state and input-to-state transitions makes this simple. In the below, where ‘∗’ stands for the convolution operator and ‘◦’ (Hadamard product), the main equations of ConvLSTM are displayed in Eqs. (15) [36].

$${\text{I\_gate}} = \sigma \left( {W_{{{\text{xi}}}} * X_{{{\text{iput}}}} + W_{{{\text{hi}}}} *{\text{ Hidden}}_{t} - 1 + W_{{{\text{ci}}}} \circ C_{{{\text{out}}}} - 1 + b_{i} } \right)$$
(1)
$${\text{Forget }} = \sigma \left( {W_{{{\text{xi}}}} f * X_{{{\text{iput}}}} + W_{hi} *{\text{ Hidden\_t}} - 1 \; + \;W_{{{\text{ci}}}} \circ C_{{{\text{out}}}} - 1 + b_{f} } \right)$$
(2)
$${\text{C\_out }} = {\text{Forget}} \circ {\text{C\_out\_t}} - 1 + {\text{I\_gate}} \circ \tanh \left( {W_{{{\text{xc}}}} * X_{t} + W_{{{\text{hc}}}} *{\text{ Hidden}}_{t} - 1 + b_{c} } \right)$$
(3)
$$\sigma \left( {W_{{{\text{xo}}}} * X_{t} + W_{{{\text{ho}}}} *{\text{ Hidden\_t}} - 1 + W_{{{\text{co}}}} \circ {\text{C\_out\_t}} + b_{o} } \right)$$
(4)
$${\text{Hidden\_t }} = {\text{OP\_gate}} \circ \tanh \left\{ {{\text{C\_out\_t}}} \right\}$$
(5)

A ConvLSTM [37, 38] with a larger transitional kernel must capture faster motions if we consider the states as the concealed representations of object tracking. One with a smaller kernel, in contrast, may record slower motions. On a Moving BML dataset, we compared our ConvLSTM network to the ConvNet network to better understand the behaviour of our model. Different layer levels and kernel values are used to run our model.

ConvLSTM2D architecture [39] combines the gating of the LSTM layer with 2D convolutions layers architecture. ConvLSTM layers do a similar task to LSTM [31, 40, 41], but instead of matrix multiplication, it does convolution operations and retains the input dimensions [35].

The input of Keras ConvLSTM layer is a 5_Dimensional with structure (samples, time, channels, rows, cols) if it is first channels, (samples, time, rows, cols, channels) if it is last channels. The ConvLSTM deep learning approach involves using the channel's last ConvLSTM layer with the “data_format” parameter set to “channels_last.”

The output of ConvLSTM layer If return_sequence = True, then it is a 5-D tensor with structure (samples, time, filters, rows, cols). If return_sequence = False, it is a 4D tensor in TensorFlow with structure (samples, filters, rows, cols). This ConvLSTM deep learning approach involves using return_sequence = True in the ConvLSTM layer implementation. The model's input is of the shape (None, 20, 128, 128, 3) with 20 frames extracted from each video with each frame of the size 128 × 128. Each frame is of the RGB format having three layers for each layer. It is depicted in Fig. 5 and the data sapling criteria are represented in Table 2.

Fig. 5
figure 5

Proposed architecture of ConvLSTM2D

Table 2 Sampling criteria of data

Algorithm

figure a

These deep learning architectures and concepts are relevant to tracking and recognizing human activities. Using these architectures and techniques makes it possible to achieve high accuracy in recognizing and classifying complex human activities from raw sensor data. Finally, ConvLSTM2D handles the BML dataset better than ConvNet, delivering more significant results.

Calculation of Measuring Parameters

Equations 69 describe how the evaluation of results depends on different performance measures. True Positive: the deep learning model accurately [42, 43] identified the accurate activity label for the test sample. True Negative: the model successfully excludes the test sample from a specific label. False Positive: an inaccurate label from the actual sample is used to predict the test sample. False Negative: an inaccurate match between a predicted sample and its original label [44].

$$\mathrm{Accuracy}= \frac{{\mathrm{True}}_{\mathrm{Positive}}+{\mathrm{True}}_{\mathrm{Negative}}}{{\mathrm{True}}_{\mathrm{Positive}}+{\mathrm{True}}_{\mathrm{Negative}}+ {\mathrm{False}}_{\mathrm{Positive}}+{\mathrm{False}}_{\mathrm{Negative}}}$$
(6)
$$\mathrm{Measurement}\_\mathrm{Precision}= \frac{{\mathrm{True}}_{\mathrm{Positive}}}{{\mathrm{True}}_{\mathrm{Positive}}+ {\mathrm{False}}_{\mathrm{Positive}}}$$
(7)
$$\mathrm{Measurement}\_\mathrm{Recall} = \frac{{\mathrm{True}}_{\mathrm{Positives}}}{{\mathrm{True}}_{\mathrm{Positive}}+ {\mathrm{False}}_{\mathrm{Negative}}}$$
(8)
$$\mathrm{Measurement}\_\mathrm{F}1\_\mathrm{Score}=2*\frac{\mathrm{Measurment}\_\mathrm{Precision} *\mathrm{Measurment}\_\mathrm{Recall} }{\mathrm{Measurment}\_\mathrm{Precision} +\mathrm{Measurment}\_\mathrm{Recall}}$$
(9)

In convolutional neural networks (CNNs), the cross-entropy loss (logistic loss) is a regularly used loss function to determine the gap between the expected probability distribution and the original probability distribution of the output. The goal of training the CNN is to minimize the cross-entropy loss. It means that the aim is to reduce the model loss. On the other hand smaller the loss betters the model. Moreover, a perfect model has zero cross-entropy loss. In the case of the ConvLSTM2D technique, a variant of the LSTM (long short-term memory) model that includes convolutional layers, the cross-entropy loss can be used to train the network for classification tasks such as images or video classification. Equations 10 and 11 describe how the evaluation of the results of model loss.

$$\text{Measurement of loss}= - \sum \limits_{k=1}^{n}{\mathrm{X}}_{\mathrm{k}}*\mathrm{log}(softmax\_Prob.\_k)$$
(10)
$$\text{Measurement of loss}= -[{\mathrm{X}}_{\mathrm{k}}*\mathrm{log}\left(softmx\_Prob.\_k\right)+{\mathrm{X}}_{1-\mathrm{k}}*\mathrm{log}\left(softmax\_Prob.\_1-k\right)]$$
(11)

where n is number of classes, Xk is a class, softmax_Probk is a softmax probability of Kth class, X1−k is a previous class, softmax_Prob_1−k is the probability of (1−k)th class.

Result Evaluation and Discussion

This section explains the database utilized the results of the experiments, and a comparison with other HAR systems.

Experiment Setup

The experimental setup involves using Google Colaboratory, which uses python version 3.7.15 to run the python scripts. TensorFlow 2.9.2 and Keras 2.9.0 are the primary libraries for building the deep learning neural network model and also used GPU support in google colab to increase the training and testing procedure speed. Google colab provides Intel® Xenon® CPU @2.00 GHz with a total RAM of 12 GB. Google colab uses Linux based operating system to execute its processes. Tesla T4 GPU supported by colab provides CUDA Version 11.2 and 15 GB of graphic memory, which reduces the time complexity for training and testing our deep learning model.

Dataset and Result Analysis

The dataset used in this experiment is the BML MoVi dataset produced by BIO MOTION LAB. It contains synchronized poses, body meshes and video recordings. This dataset has videos for 44 different activities captured from four distinct points of view for 60 women and 30 men candidates performing day-to-day actions and sporting acts like kicking, sitting down, walking, etc. which is shown in Table 3. Twenty activities out of 44 were selected, for each activity has more than 50 videos available for better training of the model [2] and the Table 4 represents the whole detail of MoVi dataset. Made stick-figure video recordings of each participant performing each activity using the available sensor movement data and then fed it into our deep-learning model for activity classification. After finishing the pre-processing, the human activity dataset is partitioned into two phases, of which 80% is used for model training, and 20% is used for model testing. To evaluate the general measures such as accuracy, precision, recall, and F1_score of the suggested approach by combing different optimizers such as Adam, Adamax, RMSprop, and Nadam, shown in Figs. 6, 7, and Table 1, which clearly shows the incremented accuracy of ConvLSTM2D with RMSprop, which equals 97% in the period of training and 82% in the phase of testing, which is greater than the proposed ConvNet approach with various optimizers.

Table 3 Description of forty four activities including video of each activity
Table 4 The databases that were used in this paper
Fig. 6
figure 6

Histogram of optimizer’s accuracy in the form of training and testing

Fig. 7
figure 7

Percentage of accuracy (training and testing) of ConvLSTM2D Model using RMSprop optimizer

Using ConvNet, we have achieved maximum accuracy of 84%, but in ConvLSTM2D, we have achieved an average of 97% accuracy with the RMSprop optimizer.

In implementing the ConvLSTM2D model, we have to use various optimizers to achieve better results with various measurements such as accuracy, precision, recall, and F1_Score. After performing the analysis process, it is represented in Figs. 8 and 9.

Fig. 8
figure 8

Accuracy and loss represent in the form of a line graph of each optimizer

Fig. 9
figure 9

Line and histogram plot for training accuracy and loss, training max accuracy and min loss, Validation accuracy and loss, and Validation max accuracy and min loss

Figure 10a represents the confusion matrix for all 20 activities in the ConvLSTM2D model (training), and Fig. 10b represents the confusion matrix for all 20 activities in the ConvLSTM2D model (testing).

Fig. 10
figure 10

a Training data (confusion matrix). b Testing data (confusion matrix)

The result is concluded by using a video dataset of each activity. The achieved model accuracy with various optimizers such as Adam, Adamax, RMSprop, and Nadam are displayed in Table 5, including precision, recall, and F1_score, as shown in Table 6.

Table 5 Comparative outcome for the proposed model using different optimizers
Table 6 Comparative outcomes of measuring parameters for the proposed model

Conclusion and Future Work

The article shows deep learning techniques for tracking and recognizing human activities. The extraction and recognition of features from 20 human activities are investigated, and proposed methods are tested on the MoVi dataset, which contains 60 female and 30 male volunteers. Participants were randomly chosen to create the dataset to test the strategy’s effectiveness. The proposed technique can deal with different terrain, inclination and viewpoint challenges. The final accuracy result is achieved at 97% from the ConvLSTM2D network, and the measuring parameters (Precision, Recall, and F1_score) still need to be increased. The proposed techniques solved the problem of interclass variability and below given points clearly explain the novel contribution of the research work.

  1. 1.

    It is necessary to reduce the computational cost of models regarding memory, CPU, sensors, and battery utilization.

  2. 2.

    The proposed approach has achieved the best recognition accuracy, precision, and resource utilization results.

  3. 3.

    A classifier-based approach can also be used to accurately identify the most similar activity, such as standing and sitting or walking, walking upstairs, and walking downstairs. Most previous investigations have needed help to identify similar activities.

  4. 4.

    Investigation can be performed on integrating sensory, video, and similar activity data with a shorter execution time than the existing method.