Keywords

1 Introduction

Human motion analysis is a topic that receives much attention in robotics and medicine. Research on ambulatory activities is being conducted in rehabilitation science to improve the quality of life and context awareness in designing human-machine interfaces. For example, in [1], an intelligent system for elderly and disabled people is proposed where the user can communicate with a robot via gesture recognition and recognition of everyday activities. These technologies help monitor the health status of patients and older people. In [2], a multi-sensor system is proposed to allow continuous rehabilitation monitoring. Diagnosing diseases such as multiple sclerosis, Parkinson’s disease, and stroke [3] has been performed using human gait analysis.

Moreover, human gait has been utilized to develop indoor pedestrian navigation systems that can lead users to a specific area or track their daily activity level [4]. Multimodal systems have been designed for gait analysis for biometric applications [5]. Upper and lower limb motion analyses are also helpful for the development of prosthetic limbs for amputees [6].

Recognizing lower limb movements is essential for the daily care of the elderly, the weak, and the disabled. It is widely accepted that approaches for identifying lower limb movement can be divided into three types [7]: computer vision-based, ambient device-based, and wearable sensor-based. Computer vision-based types can monitor activities by analyzing video footage captured by cameras with multiple viewpoints placed at the desired location [8]. The implementation of computer vision-based technology is restricted by the space required to install the sensors [9]. The ambient device-based type provides for installing ambient sensors to measure the frequency of vibrations caused by regular activities for motion detection [10].

Nevertheless, activity monitoring can be severely affected by various environmental conditions. Aside from that, privacy concerns may arise with this approach [11]. The wearable sensor-based type uses multiple compact, wireless, and low-cost wearable sensor devices to record lower limb activity information [12]. The wearable sensor is suitable for outdoor use and compatible with the physical environment, and is primarily used for lower limb motion detection [13].

This work was motivated by the desire to develop and propose a method for recognizing lower limb movement that is highly accurate and capable of extracting useful information from inertial signals. To assess the multi-dimensional information included within the inertial signal, the multi-resolution convolutional neural network (M-CNN) was introduced to pull out high-level features and efficiently identify lower limb movements. The proposed model’s performance in recognition is assessed with the help of training and testing data taken from a reference dataset known as HARTH, which is open to the public. Finally, the evaluated metrics are compared with three basic deep learning (DL) models.

The following structure can be seen throughout the remainder of this article’s content: Sect. 2 presents recent related work on DL approaches for lower limb movement. Section 3 describes in detail the multi-resolution CNN model utilized in this study. Section 4 demonstrates our experimental results using a publicly available benchmark dataset. This section also contrasts the outcomes of the proposed model with those of the fundamental DL models. Section 5 concludes this work and identifies areas for potential future research.

2 Related Works

2.1 Types of Sensor Modalities

Even though many HAR techniques can be generalized to all sensor modalities, most are specialized and have a limited scope. Modalities can be divided into three categories: body-worn sensors, ambient sensors, and object sensors.

One of the most common HAR modalities is the use of body-worn sensors. Examples of body-worn sensors include gyroscopes, magnetometers, and accelerometers. These devices can collect information about human activity by analyzing angular velocity and acceleration variations. Several studies on DL for lower limb movements have used body-worn sensors; nevertheless, most studies have concentrated on the data gathered from accelerometers. Gyroscopes and magnetometers are commonly used in conjunction with accelerometers to detect lower limb movements [14]. Ambient sensors are often embedded in a user’s smart environment and consist of sound sensors, pressure sensors, temperature sensors, and radar. They are commonly used in data collection to study people’s interactions and environment. The movement of objects can be measured with various object sensors, while ambient sensors can detect changes in the surrounding environment. Several research papers have investigated ambient sensors for HAR in ADL and hand movements [15]. Some experiments have used accelerometers or sensors in combination with ambient sensors to optimize the HAR accuracy. This shows that adopting hybrid sensors that collect different data sets from other sources can considerably boost research in HAR and encourage applications such as commercial smart home systems [16].

2.2 Deep Learning Approaches

The challenges associated with feature extraction in conventional machine learning (ML) can potentially be solved by DL [17]. Figure 1 demonstrates how DL can improve HAR performance using different network configurations. The features are extracted, and the models are trained simultaneously in DL. The network can learn the features automatically instead of manually hand-crafted as in conventional ML approaches.

Fig. 1.
figure 1

DL-based-HAR pipeline.

3 The Sensor-Based HAR Framework

The sensor-based HAR framework consists of four main processes: (1) data acquisition, (2) data pre-processing, (3) data generation, and (4) training models and classification, as shown in Fig. 2.

Fig. 2.
figure 2

The framework for HAR developed using sensors was used in this work.

3.1 HARTH Dataset

The human Activity Recognition Trondheim dataset, also known as HARTH, is available as a public dataset [18]. Twenty-two participants were recorded for 90 to 120 min during their regular working hours using two triaxial accelerometers attached to the lower back and thighs and a camera attached to the chest. Experts annotated the data independently using the camera’s video signal. They labeled twelve activities. For the HARTH dataset, two triaxial Activity AX3 accelerometers [19] were used to collect data. The AX3 is a compact sensor that weighs only 11 g. Configurable parameters include sampling rate (between 12.5 and 3,200 Hz), measurement range (±2/4/8/16 g), and resolution (which can be up to 13 bits).

A total of twelve different types of physical activities were recorded for the dataset throughout two sessions. In the first session, 15 participants (six women) were asked to perform their daily activities as usual for 1.5 to 2 h while being recorded. They were asked to complete each activity: sitting, standing, lying, walking, and running (including jogging) for a minimum of two to three minutes. For this time, the two sensors collected acceleration data at a sampling rate of 100 Hz (later reduced to 50 Hz) and a measurement range of ±8 g. At the start of the recordings, each participant conducted three heel drops (i.e., dropped their heels firmly on the ground), which later assisted in synchronizing the acceleration and video signals. The duration of the first recording session was approximately 1,804 min (≈30 h). The average recording time was around 120 ± 21.6 min. After the recording was completed, videos were down-sampled to 640 × 360 pixels at a frame rate of 25 frames per second and annotated frame by frame. In addition to the five activities presented, participants performed other activities, which we labeled as follows: climbing Stairs (up), climbing Stairs (down), shuffling (standing with leg movement), cycling (standing), cycling (sitting), transportation (sitting) (e.g., in a car), and transportation (standing) (e.g., in a bus). This resulted in a total of twelve different designations.

3.2 Data Pre-processing

Raw sensor data were processed in the data preprocessing as follows: Removal of noise and normalization of the data. In this work, an average smoothing filter was applied to gyroscope and accelerometer sensors in all three dimensions to remove noise from the signals. Then, the sensor data is normalized, which helps to solve the model learning problem by bringing all data values into a similar range. As a result, the gradient descents can converge faster. Next, the normalized data were segmented using a sliding window with a fixed width of two seconds and a percentage overlap of 50%.

3.3 The Proposed Multi-resolution CNN Model

The multi-resolution technology CNN stands for a convolutional neural network with advanced features. It consists of filters with different kernel sizes, and these filters must be used in each layer to extract relevant information from the convolutional layers successfully. Nafea et al. [16] demonstrated encouraging HAR results with multi-resolution modules based on the inception modules provided by Szegedy et al. [20]. This inspired us to investigate them in more detail. Multiple kernel sizes are used, and the results of these kernel sizes are combined, as opposed to the standard CNN practice of using only a single kernel size in a single layer. The result is that a single layer is used to extract features from various scales. Figure 3 shows the proposed multi-resolution CNN.

3.4 Performance Measurement Criteria

Four standard evaluation metrics, e.g., accuracy, recall, and F1-score, are calculated using 5-fold cross-validation to evaluate the effectiveness of the suggested DL model. The mathematical formulas for the four metrics are given below:

$$ Accuracy = \frac{TP + TN}{{TP + TN + FP + FN}} $$
(1)
$$ Precision = \frac{TP}{{TP + FP}} $$
(2)
$$ Recall = \frac{TP}{{TP + FN}} $$
(3)
$$ F1 - score = 2 \times \frac{{Precision \times {\text{Re}} call}}{{Precision + {\text{Re}} call}} $$
(4)

These four metrics were used to quantify the effectiveness of HAR. The recognition was a true positive (TP) for the class under consideration and a true negative for all other classes (TN). Misclassified sensor data may result in a false positive (FP) recognition for the class under consideration. Sensor data that should belong to another class may be misclassified, resulting in a false negative (FP) recognition of that class.

4 Experiments and Results

We have described the experimental setup and provided the experimental results to evaluate three basic DL models (CNN, LSTM, and CNN-LSTM), including the proposed multi-resolution CNN.

4.1 Experiments

All experiments were conducted on the Google Colab Pro with a Tesla V100. NumPy (NumPy 1.18.5) was used to work with matrices, Pandas (Pandas 1.0.5) was used to work with CSV files, and Scikit-Learn was used to evenly divide examples by class for the training, testing, and validation datasets. The Python programming (Python 3.6.9) and other libraries (Keras 2.3.1 and TensorFlow 2.2.0) were used to perform the experiments.

4.2 Experimental Results

The performance of DL models for recognizing data from wearable sensors is shown in Table 1. According to the experimental results, the proposed MR-CNN model had the highest performance, measured by an F1-score of 94.76%.

Table 1. Performance metrics of DL models using sensor.

We considered classification results obtained from the MR-CNN as shown in Table 2. Regarding the activities of sitting in the HARTH dataset, the MR-CNN model achieved an F1-score of 1.00, as these activities do not involve movement. In contrast, F1-score values greater than 0.95 identify walking and running activities in the dataset.

Fig. 3.
figure 3

A detailed description of the proposed multi-resolution CNN.

Table 2. Performance metrics of DL models using sensor data of lower limb movement.

5 Conclusions

This research proposed a new architecture using multiple convolutional layers with different kernel dimensions to achieve feature recognition with different resolutions. The proposed multi-resolution convolutional neural network (MR-CNN) model outperformed previous work in a public HARTH dataset that does not contain hand-crafted features. A comparison of the confusion matrices shows that the MR-CNN model achieved the highest performance of 94.76% in activity differentiation.

In our future work, we intend to use various types of DL networks, including ResNeXt, InceptionTime, Temporal Transformer, etc., in heterogeneous human activity recognition. Moreover, data augmentation is an exciting technique for model improvement in imbalanced datasets. This technique can be used for this problem.