1 Introduction

With science and technological advancement, people are enjoying an ever more convenient life. The smart wearable device can identify human actions through embedded sensors, especially, in current smart home systems, where they have become an integral part of human-device interaction, so their demand is much larger. Smart wearable devices can be connected to the smart home system through the Internet and be controlled through human voice, application, and gestures (Katariya et al. 2012). Although smart wearable devices have a wide range of applications, their input mode is limited by hardware and software configuration, affecting the user experience. The gesture is an important way a human interacts with the world, and smart wearable devices can intelligently control home appliances by sensing and tracking human behavior (Ban and Lee 2019). Deep learning is the deep rule and expression of initial information analysis. One-dimensional (1D) and two-dimensional (2D) data of the initial information will be analyzed throughout the process, which is one of the main learning skills of the machine. Identification and interpretation of human upper limb activities are the basis for human behavior tracking. The development of machine learning can be divided into two stages chronologically: shallow learning and deep learning (Jankovi et al. 2018).

Shin and Cha (2018) have proposed a human behavior recognition method based on sensor measurement and artificial intelligence (AI), and have developed a multi-mode sensor composed of acceleration, gyroscope, and height sensing. Human behavior is identified by the real-time combination of sensors and deep learning. Human activity data are collected and human behaviors are classified. The LSTM network can learn human posture and activities from the collected data, and the experiment proves the practicability and effectiveness of the proposed system (Shin and Cha 2018). Zhang et al. (2020) have used a convolutional neural network (CNN) to analyze and identify the human behavior data collected through sensors and retrieved the hidden state data of human behavior as the discriminant feature of motion. The experimental results show that the deep learning algorithm has good performance in human behavior classification (Zhang et al. 2020). Yun and Woo (2019) have proposed to combine the quantitative analysis technology, deep learning, and machine learning to detect human thermal infrared signals, and then the infrared data of 30 subjects are simulated and tested. The results show that machine learning can perform better in real-time detection, while the deep learning model can only achieve 90% detection accuracy (Yun and Woo 2019).

Here, the CNN and recursive neural network (RNN) are combined to study the application of human motion recognition based on deep learning and smart wearable devices in sports. First, it is necessary to define the process of human motion recognition, including data acquisition, data pre-processing, data model establishment (extraction with convolution characteristics), and classification resolution. The innovation can be summed up as follows: (1) the three convolution feature extraction methods are compared with each other to facilitate the future experimental selection. (2) Feature extraction based on 1D + RNN has less data loss rate and higher accuracy. (3) The motion behaviors are classified through the two kinds of data collection sensors in smart wearable devices.

2 Methodology

2.1 Human motion behavior recognition process

The CNN is applied in many fields, such as image classification and speech recognition. The data characteristics of the original sensor are automatically extracted through the convolution layer, pooling layer, and fully connected layer (Wang et al. 2020). RNN is used for dynamic sequence transmission. Here, these two networks are combined to study the application of human motion recognition based on deep learning and smart wearable devices in sports. Shallow learning includes machine learning algorithms, such as support vector machines (SVM), random forests (RF), conditional RFs, and neural networks. Shallow learning algorithms usually contain one or two nonlinear transformations, which perform well on simple and limited problems. Shallow learning lacks representation and modeling capabilities, so it cannot be used in complex applications, such as speech processing and visual images. Deep learning can transform data from original data space to new feature space layer by layer, so it can automatically learn hierarchical features of data. These functions can complete tasks, such as classification, regression, and functional visualization (Zeng et al. 2019). Compared with shallow learning, deep learning can learn more abstract concepts and adapt to more complex functions. Smart wearable devices are compact and convenient and can be applied in various experiments. The data are collected by smart wearable devices, and the data are pre-processed(Xu and Yan 2020). The model is constructed through CNN. Finally, the model is trained and tested, and the output is obtained (Fig. 1).

Fig. 1
figure 1

Experimental process of behavior recognition

2.2 Overall framework for data acquisition

The acceleration and gyroscope sensor motion data acquisition system can be treated as two interlaced parts, upper and lower parts(Zhao et al. 2020).

The upper part can be divided into two functional modules, the motion data display module, and the motion data acquisition module. The data display module can record the acceleration and gyroscope information in each direction for a smart device wearer. The individual's motion at each time node is labeled when a motion is completed. Meanwhile, the data of the acceleration transducer and gyroscope transducer in x, y, and z directions are recorded. The motion data acquisition module is the core of the overall system design (Shen et al. 2020). Every time the wearer moves a muscle, the start button should be pressed, and the data-recording will be initiated. Concurrently, the timing will be activated, and once the time runs out, the system stops recording. Then, the type of motion should be chosen through the system. Hence, the front-end module acts as a user interface and is responsible for user interaction.

The lower part includes the data acquisition module, collects the background data, and is invisible to users. The module works continuously during the entire human motion and transmits the collected data to the data display module (Liu et al. 2019).

2.3 Data collection flow

The deep CNN is utilized to obtain a more accurate human motion classification model, which collects multiple initial data through sensors. Relevant databases are established by researchers around the world (Dubey and Jain 2020). For example, human action database is established by some domestic researchers based on acceleration sensors, and a database is established by the wireless sensor data mining laboratory of Fordham University in the United States.

A data collection process is designed as in Fig. 2. First, some volunteers’ information is collected. Then the location of the smart wearable device is confirmed (Hsu et al. 2019). After the devices are put on, the type of motion behavior will be determined, and data collection commences. The smart wearable devices with the required sensor type are selected for data collection. After the data collection, the data are tested. If the data are complete, the process ends; if not, the above process is repeated.

Fig. 2
figure 2

Sensor data collection process

2.4 Data pre-processing

Data pre-processing can be classified as data denoising, data segmentation, and data labeling.

Data denoising is needed since the initial data from the sensor are inevitably mixed with external noises, which makes the collected data noisy. Noise harms the feature extraction and decreases the accuracy of data analysis (Gao et al. 2020). First, data pre-processing can filter the initial information to remove the noise. Generally, noise can be filtered through two kinds of filters: analog filters and digital filters. The digital filter analyzes and corrects the initial data through mathematical methods to remove noise. It does not require hardware equipment and has high stability and reliability (Gao et al. 2019), while it can greatly reduce noise. Median filtering is most commonly used in digital filtering, and it is also a signal processing technology. The internal sorting algorithm can process the initial information. In a series of numbers or images, the sampled sequences can be arranged in size and value, and the median in the left and right domain values can represent the value of this point, which makes the adopted values more accurate and can eliminate the redundant isolated noise points (Huang and Zhang 2020). Here, the median filtering method can filter the original data of the sensor to remove the noise.

If the data cannot be denoised and classified directly, the data should be pre-processed through the segmentation method. For example, if data are collected for a lengthy and continuous-time, these data should be pre-processed through the segmentation method.

Here, in the proposed human motion recognition system for smart wearable devices, data samples should be labeled data denoising and segmentation. The One-Hot encoding method, also known as the one-digit valid code, processes every state-indicating bit in the conditional code register, ensuring that every state corresponds to an independent bit in the register and only one bit in the register is valid at a given time (Hsieh et al. 2020). If a data sample with m possible values is encoded independently, the data sample will become a binary feature that can only activate one m mutually exclusive at a time, so that the data will become sparse from close. One-Hot coding can complete the difficult feature information and it can also expand the features.

2.5 Convolution feature extraction

First, the one-dimensional (1D) convolution feature extraction method is introduced. Acceleration sensors and gyroscope sensors are two kinds of 1D dynamic series sensors, and the 1D convolution method can extract features in 1D space. A convolution kernel can be used as a filter and eliminate abnormal values and complex data (Termritthikun et al. 2019). The convolution kernel can also be used as a feature detector and obtain the best results in the segments of each specific dynamic series. 1D convolution method is applied on each channel of the sensor in the data processing process.

Second, two-dimensional (2D) convolution feature extraction is introduced. The concept of CNN has been proposed in the early 1980s (Duan et al. 2020; Nguyen and Nguyen 2020). The input information determines the way of feature extraction in convolution operations. 2D images in daily life are extracted through the 2D convolution method. The LeNet model is a classical CNN model that can identify images well, so it is widely used. The following is a model with two convolutional neural layers (Fig. 3):

Fig. 3
figure 3

Deep CNN model

The model displayed consists of five parts: the input layer, convolutional layer, pooling layer, fully connected layer, and classified output layer from left to right. Two vital layers are the convolutional layer and the pooling layer (Kim et al. 2020). The convolution calculation process is as follows:

$$ X_{i}^{l,j} = f\left( {b_{j} + \sum\limits_{a = 1}^{m} {W_{a}^{j} X_{i + a - 1}^{i = 1,j} } } \right) $$
(1)

In the equation, l in \({X}_{i}^{l,j}\) represents the lth convolutional layer, i is the value of the convolution output matrix, j corresponds to the number of the output matrix that is expressed as 0–N from left to right, and N represents the number of the convolution output matrix. f represents a nonlinear function, namely, the Sigmoid function (Ascioglu and Senol 2020).

The data are refined through the pooling layer. The input of the pooling layer is obtained from the convolutional layer, and its output becomes the input of the next convolutional layer. The average pooling algorithm can reduce the dimension, and the calculation process is shown in Eq. (2).

$$ X_{i}^{l,j} = \frac{1}{N}*\left( {\sum\limits_{i = 1,j = 1}^{n} {X_{i,j} } } \right) $$
(2)

In Eq. (2), \({X}_{i}^{l,j}\) represents the local output from the pooling layer, which is the mean of the small convolution matrix of the previous layer. After several convolution and pooling operations, a fully connected layer is added. Finally, the probability of each behavior category is output through Softmax classification (Lv et al. 2021; Shi et al. 2020).

Third, 1D CNN + LSTM is proposed. In addition to 1D CNN and 2D CNN, the deep convolutional LSTM (DeepConvLSTM) model of 1D CNN + LSTM can also extract features. The LSTM can extract the features of sensor input data, and the recursive layer can accumulate the extracted features in time sequence through a feature map. The model is illustrated in Fig. 4.

Fig. 4
figure 4

Deep ConvLSTM model structural diagram

The figure shows that the input information is a dynamic series, and these data are collected by the input layers through an iterating sliding method from top to bottom. The input layer is composed of several channels and is extracted from the sensors. The number of channels is denoted as D, and all channel samples are denoted by \({S}^{l}\). Different convolutional layers have different lengths of \({S}^{l}\). In the convolution operation, only the part where the input completely overlaps with the convolution kernel is calculated (Wasimuddin et al. 2021). Thus, the length of the convolution feature map is calculated as follows:

$$ S^{(I + 1)} = S^{I} - P^{I} + 1 $$
(3)

In the equation, \({P}^{l}\) represents the length of the convolution kernel in layer l. For each convolutional layer, the length of the convolution kernel is the same, defined as \({P}^{l}\)=5. The value of l is the integer between 2 and 5.

In Fig. 4, the leftmost is the input data. First, input data are extracted through four convolutional layers, and then the extracted data are calculated through two nonlinear activation functions in hidden layers. Finally, the behavior classification results are output through Softmax. The data size of the Layer is D × \({S}^{l}\). D represents the number of channels of sensor data, and \({S}^{l}\) represents the length of the feature map of layer l. Layers 2–5 are convolutional, \({K}^{l}\) represents the convolution kernel of layer l, and \({F}^{l}\) stands for the number of feature graphs of layer l. Here, \({a}_{t,j}^{l}\) represents the excitation of the ith feature graph of layer l. Layer 6 and Layer 7 are two hidden layers, where \({a}_{t,j}^{l}\) represents the excitation of the ith element in the lth layer at time t, and the dynamic series is along the Y-axis.

2.6 Introduction of experimental data set and experimental environment

The experiment is set on HUAWEI WATCH 1. Every testee wears a smart bracelet, and their motion data are collected through the linear acceleration sensors, gyroscopes, and geomagnetic sensors in the smart bracelets for experimental usage. After the behaviors of the testees are tracked and recorded through the smart wearable devices, these behaviors are classified into different motion behaviors (Xu et al. 2020). The Sheng Jingwan Road is chosen as the experimental site, and the network model is trained with the data of Sheng Jingwan Road in advance through five testees to obtain the experimental training set. Finally, the algorithm is trained and tested according to the obtained test data. The experimental hardware environment and software environment are shown in Table 1, and the hyperparameters of the neural network are demonstrated in Table 2.

Table 1 Experimental environment settings
Table 2 Neural network hyperparameter setting

3 Results and discussion

3.1 Analysis on feature extraction of human motion recognition

Based on the theory in the second part, first, the original data are collected and pre-processed, and then the current state of motion behavior is identified through convolution feature extraction. The loss value and accuracy of convolution feature extraction are directly related to human motion behavior. The less the loss value is, the higher the accuracy is, and the higher the accuracy of the recognition of the motion state is. Consequently, several common behaviors in sports, such as walking, running, and jumping, can be more accurately identified and can provide a better application value. Here, the results of feature extraction are compared through different algorithms, and the optimal feature extraction method is selected for human motion behavior recognition.

For each model, the damage rate and accuracy can be compared, and the final loss value and accuracy can be compared. Hence, the human motion recognition effects of three different algorithms are obtained, including the 1D CNN, 2D CNN, and 1D CNN + LTSM algorithms, as illustrated in Figs. 5 and 6.

  1. 1.

    Loss value and accuracy curve of each model during training

Fig. 5
figure 5

Curve of damage rate value in different feature extraction algorithms

Fig. 6
figure 6

Curve of accurate value for each feature extraction algorithm

The human motion behavior recognition of 1D CNN, 2D CNN, and 1D CNN + LTSM are compared. Figure 5 is the loss value trend of each feature extraction method. Figure 6 is the accuracy curve.

Figure 5 shows that the loss function is inversely proportional to the number of iterations under different feature extraction methods. Among them, the damage rate of the 1D CNN algorithm is very high, and the damage rate of the 2D CNN algorithm is smaller than that of the 1D CNN algorithm. The best result is obtained through the 1D CNN + LSTM algorithm, and the loss value decreases significantly with the increase of the iteration number.

Figure 6 shows that the 2D CNN algorithm is better than the 1D CNN algorithm. The accuracy of the 1D CNN + LSTM algorithm increases the fastest at the beginning of repeated feedback. With the growth of iteration number, the increase of accuracy begins to slow down, and the maximum accuracy is reached at 0.96. Obviously, the convergence speed of the 1D CNN algorithm and 2D CNN algorithm is relatively slow.

  1. 2.

    Final damage rate and the accurate value of each method

Table 3 shows the final damage rate and the accurate value of 1D CNN, 2D CNN, and 1D CNN + LSTM algorithms.

Table 3 Average damage rate and recognition accuracy of feature extraction algorithms on tested sets

Table 3 shows that the average loss value of the 1D CNN feature extraction algorithm is the largest, while the average recognition accuracy is the lowest. The average loss value of the 2D CNN feature extraction algorithm is smaller than that of the 1D CNN algorithm, and the average recognition accuracy is the highest. Since there are continuous dynamic series data received and extracted by the sensors from the x, y, and z directions, the current state is determined through this moment and the previous state. Hence, the 1D CNN + LSTM feature extraction algorithm performs best with an average loss value of 0.346, and the average recognition accuracy gets almost as high as 90%.

Different algorithms are included to analyze human motion recognition through deep learning based on related research. First, different convolution feature extraction algorithms are introduced. The classification model of human motion behaviors is constructed based on the research of human motion behavior recognition. There are the 1D CNN feature extraction algorithm, 2D CNN feature extraction algorithm, and the 1D CNN + LSTM feature extraction algorithm. Finally, the results are compared and analyzed. The results show that the 1D CNN + LSTM algorithm can better complete information processing and extraction for the action information of wearable device sensors. The final experimental results are obtained through observation of the change curve of damage rate and accurate value with the increase of the number of iteration. Apparently, the 1D CNN + LSTM extraction algorithm should be chosen in smart devices to process relevant data when identifying sports behavior.

3.2 Experimental analysis of human motion recognition based on deep learning and smart wearable devices in sports

  1. (1)

    The selection of sensors in smart wearable devices. To design the overall framework for sensor acquisition and to complete the data acquisition process, smart wearable devices have been equipped with motion sensors, such as acceleration, gyroscope sensors, and environmental sensors, such as gravity and distance sensors. In this experiment, the 1D CNN + LSTM algorithm is selected to extract data, and acceleration and gyroscope sensors are selected.

  2. (2)

    Selection of sports types and wearing position of smart devices.

    There are 28 sports events in the Summer Olympic Games, including track and field, gymnastics, badminton, and swimming. Much smaller and simpler venues and equipment are required in many sub-items, such as race walking, sprinting, long-distance running in track and field. Race walking and sprinting are chosen to be compared with walking. Usually, smart wearable devices are very easy to carry. The most vulnerable parts of humans are the knee joint, ankle joint, and elbow joint, indicating that muscles in these parts are very active and the data collected here are more useful. Therefore, smart wearable devices are placed in these parts of the human body.

  3. (3)

    Data collection and analysis of sports events.

    Fifty healthy youngsters in their twenties, 25 males and 25 females volunteered for this data collection. Their body height is between155 and 185 cm, and their body weight is between 45 and 75 kg. Experimental process: each volunteer completes three actions, including walking, race walking, and sprinting, one by one within a given time (1 min). The school playground is chosen as the data collection site. Data are collected from three directions, x, y, and z. The data acquisition diagram is illustrated in Figs. 7 and 8.

    Fig. 7
    figure 7

    Three-axis data of gyroscope sensor, including walking, race walking, and sprinting

    Fig. 8
    figure 8

    Three-axis data of acceleration sensor, including walking, race walking, and sprinting

(1)

Figures 7 and 8 show that the candidate’s movements and accelerations have obvious changes in walking, race walking, and sprinting, and their boundaries are very obvious. The candidate is walking during the first half of the trip, race walking in the middle, and sprinting in the last part.

Smart wearable devices are used here for data collection in sports events. The 1D CNN + LSTM extraction algorithm is chosen as the optimal data processing algorithm, and the model is established based on the second part analysis and the experimental comparison with other algorithms. The process of motion behavior recognition includes information collection, input, data pre-processing, model establishment, and motion behavior classification. Besides, smart wearable devices can collect both single-behavior activity and multi-behavior activities. Consequently, human motion behaviors are identified and classified through the established smart wearable smart device model.

3.3 Performance evaluation of algorithms

The performance of the 1D CNN algorithm, 2D CNN algorithm, and 1D CNN + LSTM algorithm are evaluated, respectively, through comparison with the LSTM network algorithm and CNN algorithms which include the BPNN algorithm and RNN algorithm. Their performances are shown in Fig. 9 five sets of initial data.

Fig. 9
figure 9

Performance comparison results

Figure 9 illustrates that different neural network models yield different prediction accuracy with the same initial data. The overall prediction accuracy of the 1D CNN algorithm, 2D CNN algorithm, and 1D CNN + LSTM algorithm are much higher than other algorithms. Meanwhile, the prediction results of 1D CNN + LSTM show high stability, so the model parameters should be optimized in the subsequent research to improve the prediction accuracy.

4 Discussion

The development of the Internet and the information age has witnessed the impact of deep learning on online and offline trades and professions in recent years. Nowadays, deep learning has been widely used in the research of smart wearable devices on human motion behavior recognition. CNN, as a deep learning algorithm, can be combined with smart wearable devices and redefine and refresh traditional sports events. For example, the proposed action recognition technology can evaluate the standardization of athletes' movements. Besides, human motion recognition based on smart wearable devices has many advantages, such as low energy consumption, low equipment cost, high security, easy replacement, and high comfort. It can also be applied to daily life, adding more fun to sports. Based on previous studies, several human motion behavior recognition algorithms are analyzed and compared. The results are compared for the 1D CNN, 2D CNN, and 1D CNN + LSTM algorithms on the data collected through the smart wearable devices, indicating that these algorithms are only partially effective and accurate. Consequently, the optimal algorithm, the 1D CNN + LSTM algorithm is selected for human motion behavior recognition.

5 Conclusion

A CNN feature extraction algorithm based on 1D and 2D features is introduced to study human motion recognition based on deep learning and smart wearable devices. Then, the 1D CNN + LSTM algorithm is proposed through the combination of the CNN feature extraction algorithm and the LSTM networks. The experimental results show that the proposed 1D CNN + LSTM algorithm is more efficient and accurate than the traditional method. Therefore, it can be better applied in future sports events. Although a solution is provided for human motion recognition based on deep learning, due to the complexity and external interference in real life, more and deeper research is needed for an extended application. In future research, more multi-angle sensors will be required for data collection with higher measurement accuracy. Meanwhile, it is hoped that various sports events besides track and field can be identified through higher measurement accuracy, providing a new direction for the follow-up research.