1 Introduction

In recent years, wearable robots have a promising future and can physically assist humans in locomotion [1]. In particular, LLEs are increasingly used and noticed by the public. Although wearable technology is promising, its control system needs to be further developed. Humans and robots need to work together to perform repetitive activities in applications, such as robot-assisted rehabilitation, leg exoskeletons, and others [2, 3]. The human–robot system may pose a serious problem. In fact, even an occasional wrong action while wearing a LLEs can cause irreversible damage to the human body [4]. Therefore, automatic recognition of the current state of human movement is a prerequisite for LLEs.

However, most current commercial exoskeletons, such as the Össur Power Knee prosthesis, ReWalk, and Indego exoskeletons [5, 6], generally communicate the intention to move by pressing the control button or performing abnormal body movements. This technique is not real-time capable and usually carries a risk of physical harm. By overcoming the interruptions from people and the environment, the metabolic cost of a person can be decreased by employing an effective control approach [7]. A complete intelligent intention detection system needs to be applied to LLEs to achieve safe performance.

Based on the data obtained from the wearer's movements, intention recognition can predict upcoming movements. Several works have completed intention recognition with sensor fusion to improve performance [8, 9]. This method relies on manual extraction, especially expert knowledge, to extract useful features.

DL can automatically extract features and is suitable for intent detection. In many researches, cameras are used to collect environmental data and apply a set of image classification algorithms [4, 10]. While this method can provide the desired classification, it wastes computational resources and requires expensive hardware and a long time to train the model. However, time series data can respond quickly to classification problems and save memory. Therefore, in this study, time series data are adopted to predict human movement intentions, and the DL method is used to improve the prediction accuracy.

The contributions of this work include:

  1. (1)

    Processing unbalanced label data.

  2. (2)

    Applying the ResNet model to time series data sets.

  3. (3)

    Comparing the performance of different models and identifying the best performance for classifying human motion intentions.

The rest of the paper is organized as follows: Section 2 describes the related work. The proposed methods and experiments are described in Sects. 3 and 4, respectively. The conclusion of this paper is presented in Sect. 5.

2 Related Work

Most of the research on human intention recognition involved using collected images for classification prediction and human motion signals for time series classification prediction, fusing these data to achieve high accuracy, and converting images into time series or translating the latter into the former.

Laschowski et al. [10] also collected the “ExoNet” image dataset, which contains real indoor and outdoor walking environments. They then trained and tested more than ten state-of-the-art deep CNN based on the dataset. Zhang et al. [24] develop an end-to-end unsupervised cross-subject adaptation with time series datasets. Based on the MCD [25], the feature generator aligns the features of the source and target domains to fool the domain classifier until it is unable to detect which domain the features originated from. To stabilize the point cloud of the environment, a depth camera and an IMU are used together. The original 3D point cloud was reduced to 2D and classified by a neural network [26]. Hur et al. [27] used a novel encoding technique to convert an inertial sensor signal into an image with minimal distortion and a CNN model for image-based intention classification.

A time series dataset (human motion signal) was used to study intention recognition of LLE. A variety of DL network models were used for training and testing. In general, LSTM or CNN-LSTM models are applied for prediction. Table 1 shows that LSTM is widely used for human motion signals, while ResNet classifies image data. We found that fewer studies have used ResNet in the time series datasets of LLE. Therefore, the following experiment was proposed to apply ResNet to the intention recognition of LLE, and compare it with CNN and CNN-LSTM.

Table 1 Previous recognition systems used algorithms for classification

3 Proposed Methods

In this paper, we analyze several common network structures, including CNN, CNN-LSTM, ResNet, and the extension of ResNet-Att. Recently, CNN-LSTM seems to be the most widely used for time series processing [17, 28]. However, ResNet could also be applied to time series data. We would choose the highest prediction accuracy by comparing several algorithms.

3.1 Overall Framework

Our proposed framework is illustrated in Fig 1. We take the processed time series data as input. CNN, CNN-LSTM, ResNet, and ResNet-Att were selected as model objects for comparison. ResNet, although not commonly used for time series data, has surprisingly achieved excellent results. Recently, the attention mechanism has gained great popularity in DL. It refers to the human thinking mode that is able to scan data and focus on the desired part in the target field. However, we have found that the superposition of multilayer networks can lead to overfitting, and the attention mechanism is not suitable for all network models.

Fig. 1
figure 1

Experimental network framework structure

3.2 Data Processing

In this paper, the public datasets in [4] are processed. ZHONG B et al. organized seven healthy subjects and one trans-tibial amputee to participate in this study. Different locations for wearable cameras were chosen to collect images of the subject in the environment. IMU signals were also collected from sensors attached to the lower limbs. The lower limb device was attached to the shin area of the subjects. For the amputee subject, the device was attached to the top of the pants around the prosthetic socket of a passive lower limb prosthesis. A time series dataset containing accelerometer and gyroscope sensor readings and timestamps was used to predict human locomotion intentions.

In this dataset, we obtained four categories including flat ground, grass with special terrain, up and down stairs. These scenes are representative and have wide applicability, which can be used not only in LLE rehabilitation scenarios but also in assistive situations. In Table 2, we have labeled the terrains and described the distribution of the data labels.

Table 2 Labels of terrains and data points

Table 2 shows that the distribution of labels is unbalanced. S/WOLG labels have a high proportion, which could lead to dilution of some features and affect the experimental effect. To ensure that each label datapoint contains the same number (100,000), we performed a balanced resampling of the accelerometer and gyroscope signal data. The window size and the sliding window were set to 60 and 4, respectively. Thus, a single prediction is made for all four points.

From Fig. 2, we can see that the accelerometer and the gyroscope data of S/WOLG (Label 0) are relatively stable, and the data of US (Label 1) is the most volatile, which is also related to the increase in the motion amplitude when walking up the stairs. DS (Label 2) is much more stable than the US. Compared to S/WOLG, WOG (Label 3) produces noisy data when the road surface becomes uneven and unstable.

Fig. 2
figure 2

The raw data related to detailed presentations

3.3 Model Proposed

  1. (1)

    CNN-Based

The CNN consists of several different layers. We designed a structure that was suitable for this experiment. Three 1D convolutional layers were added. Given the input signal \(x\left(n\right)\), the output \(y\left(n\right)\) can be obtained by convolving the signal \(x\left(n\right)\) with the convolution kernel \(\omega (n)\) of size \(l\) [29].

$$y\left(n\right)=x\left(n\right)\times \omega (n)=\sum_{m=0}^{l-1}x(m)\cdot \omega (n-m)$$
(1)

To reduce overfitting, we added the dropout layer. The batch-normal layer could facilitate the stabilization of the network during training. The framework of the convolution layer is shown in Table 3.

Table 3 Layer parameters of the CNN architecture
  1. (2)

    CNN-LSTM-Based

The CNN-LSTM is composed of CNN and LSTM. The gate structure of LSTM [30] mainly consists of a forgetting gate, an input gate, and an output gate in Fig. 3. The forgetting gate simulates the action of the human brain and represents the discarded information.

Fig. 3
figure 3

LSTM cell structure

Using the input parameters, each layer is calculated by the following functions:

$${i}_{t}=\sigma ({\omega }_{ii}{x}_{t}+{b}_{ii}+{\omega }_{hi}{h}_{t-1}+{b}_{hi})$$
(2)
$${f}_{t}=\sigma ({\omega }_{if}{x}_{t}+{b}_{if}+{\omega }_{hf}{h}_{t-1}+{b}_{hf})$$
(3)
$${\widetilde{c}}_{t}=tanh({\omega }_{i\widetilde{c}}{x}_{t}+{b}_{i\widetilde{c}}+{\omega }_{h\widetilde{c}}{h}_{t-1}+{b}_{h\widetilde{c}})$$
(4)
$${o}_{t}=\sigma ({\omega }_{io}{x}_{t}+{b}_{io}+{\omega }_{ho}{h}_{t-1}+{b}_{ho})$$
(5)
$$ c_{t} = f_{t} \odot c_{t - 1} + i_{t} \odot g_{t} $$
(6)
$$ h_{t} = o_{t} \odot \tanh \,\left( {c_{t} } \right) $$
(7)

CNN-LSTM takes advantage of CNN to extract spatial features, and LSTM promotes the extraction of input information [31]. The network structure we designed is shown in Table 4.

Table 4 Layer parameters of the LSTM-CNN structure
  1. (3)

    ResNet-Based

The ResNet [32] is widely used for feature extraction. With the continuous depth of CNN, this phenomenon may cause the convergence of the network to be degraded, the accuracy to deteriorate, and overfitting to occur. To solve this problem, ResNet has been proposed. In this work, we selected ResNet-50 to realize the recognition of human locomotion intention.

Attention mechanisms [33] have been used by many researchers to improve the performance of the network. Therefore, in this paper, we also propose the possibility that attention mechanisms can improve the model accuracy. However, we also found that the additional attention mechanisms would cause an overfitting phenomenon when the accuracy of the original model reached a high level, which is not applicable to this experiment. The corresponding evidence is provided in the following part.

The proposed framework is described in Fig. 4.

Fig. 4
figure 4

The basic structure of Resnet50 and attention mechanisms have been added. The green module is the bottleneck of the ResNet, and each layer is constructed by a number of blocks. The dotted box is where we would insert the attention mechanism

The Channel Attention Module (CAM) [34] was added to ResNet, its description can be found in Fig. 5, which is composed of ResNet-Att. Channel attention refers to a mechanism that allows a network to weight feature mappings according to context in order to achieve better performance [35, 36]. We restrict ourselves to channel attention in this work since its execution typically requires less computation.

Fig. 5
figure 5

The structure of the CAM. The input features are subjected to maximum and average pooling, respectively. Then, the results are used as input to the shared MLP layer. The processed results are added and activated with sigmoid. The weight of each CAM is obtained, which would be multiplied by the features

As shown in Fig. 5, given an input F ∈ RC×H×W, MC and MS are obtained through MaxPool, AvgPool, and then through Shared MLP, which are calculated by the following functions:

$${M}_{c}(F)=BN(MLP(MaxPool(F)))=BN({\omega }_{1}({\omega }_{0}MaxPool(F)+{b}_{0})+{b}_{1}))$$
(8)
$${M}_{s}(F)=BN(MLP(AvgPool(F)))=BN({\omega }_{1}({\omega }_{0}AvgPool(F)+{b}_{0})+{b}_{1}))$$
(9)
$$M(F)=\sigma ({M}_{c}(F)+{M}_{s}(F))$$
(10)

where \({\omega }_{0}\in {R}^{\frac{C}{r}\times C}, {b}_{0}\in {R}^\frac{C}{r}, {\omega }_{1}\in {R}^{C\times \frac{C}{r}}, {b}_{1}\in {R}^{C},\) r is the reduction ratio, BN is defined as a batch normalization operation, and \(\upsigma \) is a sigmoid function.

The input F is the feature extracted by ResNet, CAM directly weighted F. The overall process can be concluded:

$$ F{\prime} = M_{c} \left( F \right) \otimes F $$
(11)
$$ F^{^{\prime\prime}} = M_{s} \left( F \right) \otimes F{\prime} $$
(12)

where \(\otimes\) represents element-wise multiplication., \({F}^{{\prime}\mathrm{^{\prime}}}\) is the final refined output.

4 Experiments

4.1 Experimental Setting

In contrast to Zhong et al., we did not use images for dataset [4], but decided to use the IMU signal and the time stamp. The train-test split is 8:2. Data from healthy individuals were used for the training set, while the test set included trans-tibial amputees. The performance of CNN, CNN-LSTM, ResNet, and ResNet-Att was compared.

The network was trained with an Adam optimizer. The epoch and the batch sizes were 20 and 64, respectively, and the dropout rate was set to 0.1. The network was implemented using PyTorch and tested on a computer with an AMD Ryzen 7 5800H with Radeon Graphics, a 16 GB memory chip, and a graphics card (NVIDIA GeForce RTX3070).

4.2 Results on Datasets

A comparative experiment was performed with CNN, CNN-LSTM, ResNet, and ResNet-Att. The accuracy of the confusion matrix is shown in Fig. 6a–d, which is high in Fig. 6a–c. ResNet outperformed the competition in the four algorithms, achieving 99%, 99%, 99%, and 98%, respectively. WOG can be easily identified as S/WOLG, which was selected because some research [37] pointed out that the special terrain has not been studied yet. We believe that it is necessary for LLE to detect in special terrain, which is of great help for auxiliary. The performance of recognition in special terrain can help LLE to adapt to different walking speeds. Accuracy, precision, recall, F1 score and loss were chosen as evaluation criteria.

Fig. 6
figure 6

The confusion matrix test accuracy of the CNN (a), LSTM-CNN (b), ResNet (c), and ResNet-Att (d)

As shown in Fig. 6d, the ResNet-Att performed worse, which made the network and the optimization process more complex. The ResNet performed well, and the additional attention mechanism could lead to overfitting.

From Fig. 7, it can be seen that the loss curve of ResNet decreases compared to the other models and has the best performance. The loss is almost 0. However, the loss of ResNet-Att is around 1.4, so it is not suitable to add an attention mechanism to this experiment. ResNet has achieved a higher classification in the intention recognition of LLE. Therefore, after the experimental demonstration, we would use ResNet network for the subsequent deployment on the lower computer.

Fig. 7
figure 7

The test loss of CNN, CNN-LSTM, ResNet, and ResNet-Att

As shown in Table 5, precision is the positive category that takes into account the prediction of all samples. Recall is the correctly predicted positive categories with all actual positive samples and measures the number of actual positive cases that can be recalled. The F1 score considers both precision and recall. It is high only when both precision and recall rates are in high proportion.

Table 5 Classification report of models

According to the above evaluation criteria, ResNet performs outstandingly in this dataset, which confirms the future applications of intention recognition in LLE. Compared to most current studies using CNN and CNN-LSTM networks, ResNet also has a great advantage in recognition.

However, future studies need to make more improvements. We should not use only the kinematic data, which may have limitations in the complex real-world conditions. Research is needed in the area of multi-sensor data fusion, where data from the vision system can be used to complement automatic motion pattern control decisions based on mechanical, inertial, and/or neuromuscular sensors. This is because the single environmental feature does not clearly express the user's motion intentions in real life. Fusion of camera data with kinematic data could improve performance. Although data fusion is currently insufficient, we will investigate this aspect further in future.

5 Discussion

In this work, an offline dataset was used to train and evaluate the framework. The offline dataset was collected from seven healthy subjects and one trans-tibial amputee. The training and testing procedure is described as follows: (1) Dividing the offline dataset into a training, and a test dataset; (2) Balancing the training dataset of data points with different labels to obtain the same number of 100,000 each; (3) Setting the window size (60) and sliding window (4). The data is normalized at the interval of (0, 0.5); (4) Training the locomotion prediction network with the training dataset. The input of the network is the human motion signal, and the output of the prediction network is the locomotion category; (5) Perform dropout sampling to obtain predictions from the trained terrain prediction network for the test dataset; (6) Evaluate the trained framework with the test datasets.

We performed a balanced resampling of the data to avoid the large impact on the accuracy after the softmax layer caused by the label imbalance problem, and carried out a series of normalizations to map the data to the same range.

As shown in Fig. 6, the highest classification accuracy for the classification algorithm presented in this paper can reach almost 99% in the experiments, which is about 3% higher than the accuracy of CNN classifier. Compared with CNN, ResNet added a shortcut connection, namely the residual unit, which makes the network not too deep. However, ResNet-Att made the networks more complex, which reduced the accuracy by approximately 30% compared to ResNet. CNN-LSTM was often used to classify a time series dataset into different motion patterns, and performed 1% lower than ResNet. ResNet appears to have great potential.

According to Table 6, when compared to [38, 41], CNN performs similarly to our experimental part in recognizing human motion intention, and the present CNN-LSTM fusion algorithm in the state of detection, which has a good performance. Its performance is indeed significantly improved compared to CNN. However, the performance is still inferior to ResNet. Therefore, our proposed application of ResNet for human activity recognition is successful.

Table 6 Compared with previous research that used different neural networks for human locomotion classification

We believe that the application of ResNet in the field of intention recognition is feasible. Compared to CNN, ResNet is an improved algorithm with better performance than CNN. However, most current studies focus on the improved performance of CNN-LSTM, which undoubtedly complicates the network. This is an undesirable measure. Considering that the real-time performance of intention recognition is significant, the complexity of the model will greatly reduce the real-time performance of the recognition system. Therefore, we proposed that ResNet is more suitable for intention recognition than CNN and CNN-LSTM, and our results also show that ResNet will have better performance. At the same time, we also assumed that the attention mechanism would improve the performance. Contrary to our expectation, the attention module by ResNet would reduce the accuracy and make the results overfitting. In future experiments, we will reconsider and further discuss the addition of attention mechanisms. In this work, we conclude that ResNet has the best performance in the time series of intention recognition. We will classify ResNet as the algorithm of our intention recognition system in an actual prototype experiment.

In addition, the four common types of locomotion intentions were accurately estimated in this paper. Most of the estimation errors in this paper were less than 4%. Data fusion could enable improvement of the system. In [4], the lower limb camera is combined with an on-glasses camera, which can facilitate the prediction of distant terrain. The result showed that the accuracy was significantly improved. However, since the camera is not self-contained in the wearable robots, adding an on-glasses camera may increase the number of frames to be processed, resulting in lower system efficiency. Therefore, in [44], two cameras were used to capture the images simultaneously. The feature vectors of the images from both cameras were concatenated by feature-level fusion, but this is not the best solution. The two cameras can be activated asynchronously to dynamically combine the advantages of both cameras for different scenarios. However, the camera can cause privacy issues, so most current research does not provide an effective processing method in data fusion. Some studies simply spliced datasets to achieve low-level data fusion, while the mainstream is feature-level fusion, which may cause relatively large feature loss in real processing, so decision-level fusion is increasingly perceived by the public compared to the previous two situations. In future, we will conduct a series of comparative experiments on these three fusion methods to verify the appropriate method and improve the performance of the algorithm. We believe that data fusion will have a positive impact on recognition accuracy. We will continue to explore the limitations of this aspect.

6 Conclusion

A time series-based locomotion recognition was developed for LLE. In this study, we used the dataset of seven healthy subjects and one trans-tibial amputee. Four locomotion modes, including S/WOLG, US, DS, and WOG, were analyzed during the experiment. To facilitate comparative experiments, four models were proposed in this study and an attention mechanism was added. We conclude that ResNet has great potential for processing time series datasets. The promising results are expected to significantly improve the decision making in locomotion recognition of LLE. The high classification accuracy in this work provides a good theoretical illustration for the intention recognition of LLE. In the subsequent experiments of the lower computer, we will also use ResNet for experimental demonstration. Although it has not yet been verified on an actual prototype, we will use it in follow-up experiments to prove its performance in further studies.

We are concerned that in the field of intention recognition, the realistic environment of the exoskeleton is complex, and it is not enough to process only homogeneous data. The reality is composed of multi-source, heterogeneous data [45]. Therefore, in future, we will develop a series of multi-source information acquisition devices in the design laboratory, which are not limited to kinematic data. In addition, in-depth research on multi-source heterogeneous fusion [46, 47] methods and algorithms will be conducted, which is also lacking in many current studies.