1 Introduction

Today, the rapid socio-economic development and urbanization process is speeding up the pace of life, with ever more stressed-out men and women from work, school, and society. Under such situations, fitness and exercise are of great significance to improve people's physical and mental conditions to better cope with pressure. Fitness activities encompass a diversified list of sports during which an effective and scientific training strategy is crucial, since improper training may lead to workout injuries, such as sprained ankles and muscle pulls and strain [1, 2]. The recent decades have witnessed the emergence and development of the new Artificial Intelligence (AI) technology, Deep Learning (DL), and many studies have been conducted on its application to fitness motion detection.

Meanwhile, the Chinese government has proposed to strengthen national fitness through new technological means, including IoT (Internet of Things), big data, Remote Monitoring (RMON), and remote service, as well as AI technology that has now been deeply integrated into sports, for example, various intelligent HardWare (HW) and SoftWare (SW) have been designed for sports training. The integration of AI technology in sports and fitness training has promoted sports intellectualization [3]. Under professional guidance and AI-aided training, the fitness training process becomes more scientific. AI equipment can be used to process and visually demonstrate real-time fitness images through advanced Computer Technology (CT), thus allowing people to fully grasp their fitness and physical status [4, 5]. DL algorithm is an AI technology and is widely used in computer vision (such as face recognition), Natural Language Processing (NLP), Data Mining (DM), and Machine Translation (MT). Image Processing (IP), as a subset of computer vision, is the means of translation between the human visual system and digital imaging devices to get an enhanced image or to extract some useful information from it [6, 7]. Therefore, the application of DL to fitness motion image detection and recognition is of great significance.

To sum up, an IoT-based intelligent fitness system can help people comprehensively master their physical status to train themselves more scientifically, and the application of AI technology to fitness motion image detection and recognition has practical significance. Innovatively, the DL algorithm is introduced to detect and recognize real-time fitness images and build an intelligent fitness detection system to provide real-time training standards and avoid workout injuries in fitness activities. Then, the system performance is verified through a simulation experiment. The results provide references for the digital and intelligent development of the sports field.

The specific research framework is summarized below. The first section, the introduction, mainly explains the current development status of fitness, research background, and innovation points, highlighting the significance of this study; the second section, related works, analyzes the application status of AI technology in fitness, summarizes the advantages and disadvantages of related research, highlighting the focus of this research; the third section analyzes the needs and objectives of intelligent fitness, and implemented a real-time intelligent fitness IP model based on DL; the fourth section is the results and discussion, which discusses the results of the simulation experiment, highlighting the advantages of this research; the fifth section, the conclusion, summarizes the results of this study and analyzes the shortcomings and prospects.

2 Related works

2.1 The application situation of DL in real-time IP

With the advancement of CT, various algorithms are introduced. In particular, DL has been favored by most researchers in IP applications due to its strong prediction performance. Chen et al. (2018) proposed a multi-scale robust image (semantic) segmentation algorithm based on the Atrous Spatial Pyramid Pool (ASPP) and improved Deep Convolutional Neural Network (DCNN). ASPP could resample give feature layer at multiple rates before convolution to capture the objects and useful image context at multiple scales and improve the positioning performance through qualitative and quantitative methods [8]. Nasir et al. (2019) proposed a distributed Dynamic Power Distribution (DPD) scheme based on Reinforcement Learning (RL) and allocated the approximate optimal power through delayed Channel State Information (CSI) of the agent. The proposed scheme was particularly suitable for inaccurate system models with non-ignorable CSI delay [9]. To ensure network security, Sultana et al. (2019) explored the application of DL in Software-Defined Network (SDN)-based Network Intrusion Detection System (NIDS) and introduced NIDS model development tools in SDN environment [10]. Wang et al. (2020) put forward a new lightweight Automatic Modulation Classification (AMC) method via DL by introducing a scale factor to each Convolutional Neural Network (CNN) neuron. The scale factor sparsity was enhanced through compression sensing. The simulation results showed that the proposed light AMC method could effectively reduce the model size and accelerate the calculation, with a slight performance loss [11]. He et al. (2021) implemented a novel deep end-to-end Neural Network (NN) model based on a modified Recursive Neural Network (RNN) for sports posture image recognition. The recognition accuracy of motion posture was improved by about 3% compared with the existing NN model [12].

2.2 The application of DL in sports

The recent years mark the wide applications of AI technologies, as well as the popularization of smartphones and smart wearable devices, allowing people to record their psychophysiological data, such as heart rate, and their geographical locations in real-time. In particular, DL has become the focus of research in sports applications. Cust et al. (2019) reviewed the application of DL in sports motion recognition and found that DL could be used to extract features from target motions and implement more objective automatic detection models for sport-specific movements [13]. Ba (2020) proposed a medical rehabilitation DL system for sports injury based on Magnetic Resonance Imaging (MRI) analysis. Through human motion analysis, the proposed system could improve the cerebral cortex analysis and judgment ability to coordinate human motion, thereby effectively preventing sports injuries [1]. Khan et al. (2020) put forward an automatic human action recognition method based on Deep Neural Network (DNN) and multi-view features and selected the best features through relative entropy, mutual information, and strong correlation. Consequently, the recognition accuracy of the proposed model was significantly better than that of other existing methods [14]. To enhance the efficiency of motion behavior monitoring, Hu et al. (2021) implemented a motion monitoring model based on feature similarity and an optimization-driven DL framework for image enhancement. The model performance had shown some practical values through experimental verification [15].

In summary, the AI technologies, such as DL, are widely applied in image identification and analysis, but very few studies have been conducted on the application of AI technologies in the real-time IP of fitness motions. Thus, the DL algorithm is innovatively introduced to detect fitness motions, which is of great significance to formulate real-time training standards and prevent workout injuries.

3 Real-time IP of intelligent fitness system based on DL

3.1 Real-time IP demand and functional analysis of fitness motion

Since the 21st century, with people’s enriching material wellbeing, prolonging off-hours, and increasing awareness for health, various fitness activities and clubs have emerged, developed, and expanded. To provide high-quality services and build unique brands, fitness club operators have to constantly update their sports facilities, thus increasing the operating and management costs. One expedient measure is to sell long-term membership cards, such as seasonal cards, annual cards, and yearly cards, which, however, heavily burdens the fitness members financially, thus excluding some potential fitness groups. Meanwhile, within the club, members might suffer workout injuries due to unscientific training standards [16, 17]. Aiming at these problems, many club owners purchase AI-based fitness equipment with an intelligent fitness motion detection and recognition system to guide their members to train more scientifically, which has shown great practical values and proven to be effective. Thus, intelligent fitness motion detection and recognition have great significance.

Specifically, this paper aims to design real-time data services that can monitor the physiological indicators and the equipment parameters and provide real-time training standards for fitness members, as well as a universal data upload interface for sports equipment manufacturers, and standard protocols for service platform DM system [18]. The functional and non-functional objectives of the intelligent fitness system are shown in Fig. 1.

Fig. 1
figure 1

The functional objectives and non-functional objectives of the intelligent fitness system

The intelligent fitness system realizes data acquisition, real-time sharing, image display, and user management functions through a three-tier network: the fitness terminal (or data transmission unit) → central machine → remote server system. The intelligent fitness system also considers non-functional objectives, including system performance, reliability, stability, adaptability, and security. That is, the performance of each functional module should be stable and complete, and the underlying sensing system should fully perceive and report fitness data in real-time, while data security must be safeguarded. The central machine collects, integrates, and uploads data efficiently; the user interface is friendly and beautiful, and it responds to the data request of the underlying sensing system and the agreed instructions of the remote server in real-time.

3.2 Design of intelligent fitness monitoring system

In the intelligent fitness monitoring system, the perception layer can collect and summarize fitness information, comprehensively perceive fitness equipment, fitness people, and fitness process through IoT, identify dangerous fitness motions, and help train fitness users through the DL algorithm. Afterward, the perception layer reports sports and fitness information. The perception layer is divided into multiple independent data collection centers according to specific applications and geographic locations, and each collection center contains the central computer, Wireless Sensor Network (WSN), fitness terminal, and transmission unit, as shown in Fig. 2.

Fig. 2
figure 2

The organizational structure of the system perception layer

Further, collected fitness image sequences are processed, and their features are extracted using the CNN algorithm. Common IP algorithms apart from CNN also include RNN and LSTM. CNN is a feedforward NN with many layers, such as convolution layer, full connection layer, and pooling layer [19, 20]. Figure 3 illustrates the application of CNN to real-time IP and detection.

Fig. 3
figure 3

Flowchart of real-time IP and detection through CNN

In Fig. 3, the nonlinear transformation layer can enhance the decision function nonlinearity and improve the network generalization. Nonlinear transformation functions used for DL include the ReLU function, Sigmoid function, and TanH function [21]:

$${\text{Sigmoid}}(l) = \frac{1}{{1 + e^{ - l} }}$$
(1)
$${\text{Tan}} H(l) = \frac{1}{{1 + e^{ - 2l} - 1}}$$
(2)
$${\text{Re}} {\text{LU}}(l) = \left\{ {\begin{array}{*{20}c} {0,} \\ {1,} \\ \end{array} } \right.\begin{array}{*{20}c} {l < 0} \\ {l \ge 0} \\ \end{array}$$
(3)

Usually, CNN performs convolution operations on multiple dimensions. The convolution operation on a Two-Dimension (2D) input matrix I with a 2D kernel K reads:

$$\begin{array}{*{20}c} {} \\ {S(i,j) = (I \cdot K)(i,j) = \sum\limits_{m} {\sum\limits_{n} {I\left( {m,n} \right)K(i - m,j - n)} } } \\ \end{array}$$
(4)

In Eq. (4), (i, j) refers to the dimension of the matrix, and (m, n) denotes the order of the matrix. Convolutions can be exchanged and can be equivalently expressed as Eq. (5).

$$S(i,j) = (I \cdot K)(i,j) = \sum\limits_{m} {\sum\limits_{n} {I(i - m,j - n)K\left( {m,n} \right)} }$$
(5)

The convolution operation is exchangeable: the convolution kernel is flipped relative to the input, and then the index of input increases, while the index of the kernel decreases. Then, convolution exchangeability is achieved through kernel flip. Although exchangeability is useful for verification issues, it is not an important property in the application of NN. By contrast, many NN libraries contain a related function, called the Cross-Correlation (CC) function [22], which is almost the same as convolution operation but cannot flip the kernel, as expressed in Eq. (6).

$$S(i,j) = (I \cdot K)(i,j) = \sum\limits_{m} {\sum\limits_{n} {I(i + m,j + n)K\left( {m,n} \right)} }$$
(6)

The CNN is used to classify fitness image pixels, and the image is reconstructed with collected pixels and enlarged through an up-sampling operation to its original size for output. Thus, each pixel in the output image is predicted through the calculation of the maximum pixel value at the position in all the obtained images.

3.3 Implementation of intelligent fitness real-time IP system based on DL

The fitness videos contain detailed and critical fitness motion data, from which the suspicious workout injury information and motion features can be extracted and used for fitness motion diagnosis. Yet, the calculation task is too heavy, which reduces diagnosis efficiency and generates redundancies at the same time, especially, under single-frame-based image processing. Therefore, only the keyframes containing suspicious injury information are picked out from the fitness videos, and a corresponding keyframe extraction function is designed to improve the diagnosis efficiency and reduce the calculation redundancy; the keyframes are stored in real-time and used for the fitness motion feature extraction and injury diagnosis. Specifically, based on the real-time keyframes of fitness videos, the CNN-based DL algorithm is used for feature extraction and prediction; then, the CNN is trained with both standard and non-standard fitness motion images; afterward, the trained CNN model has loaded and used for fitness motion prediction based on extracted and segmented keyframes, thus realizing the real-time IP of the fitness motion images. Significantly, the flowchart of intelligent fitness real-time IP based on DL is shown in Fig. 4, and the main steps include image sequence loading, image keyframe extraction, attention mechanism, image feature analysis, and image understanding. Each step is linked closely, and the former determines the latter.

Fig. 4
figure 4

Flowchart of intelligent fitness real-time IP system based on DL

The proposed DL-based intelligent fitness real-time IP system monitors fitness motions and diagnoses possible injuries through data acquisition from the front-end video surveillance. Based on the wireless network, video and thermal sensor nodes on fitness equipment are used for real-time image data acquisition, processing, and diagnosis, thereby providing visualized management and intelligent decision-making for real-time fitness motion IP and diagnosis.

First, real-time image acquisition and preprocessing: collected fitness images often contain noises and interferences from illumination, temperature, and equipment, which degrades image quality and handicaps motion detection, classification, and tracking. Thus, acquired keyframe denoising is a critical step for fitness motion detection. For a more user-friendly system environment, an improved AMF (Adaptive Median Filter) algorithm is proposed, and the algorithm flow reads: the non-noise signal points in the small window are detected, and they are filtered according to the detection results; if the window contains non-noise signal points, these pixels are regarded as alternative signals, and otherwise, the filtering window is increased; consequently, the image details are preserved as much as possible under a minimized filtering window; finally, the alternative signal pixels from the previous step are judged again and output without any change, while the filter median of the noise pixel will be output.

Subsequently, the Recurrent Attention CNN (RA-CNN) algorithm in CNN is optimized to extract features and classify real-time fitness images. The RA-CNN model does not train or test the model through the detailed annotation information but recursively learns to discriminate salient regions and region-based feature representation in a mutually reinforcing manner and encode the complete input image into multi-scale fine-grained local regions [23]. This paper further improves the accuracy of the RA-CNN model by network recursion. After the real-time keyframes are denoised, the deep hybrid attention network is used as the first network for image feature extraction. Then, the clipping and amplification module is added after the last residual structure unit of the first network. The clipping and amplification module can clip the original image according to the regions with high spatial response features in the last convolution layer of the first network (the force applying method at different joints) and amplify the clipped image. Afterward, the clipped and enlarged image is sent to the second deep hybrid convolution network to further extract more refined features. Finally, the extracted features from the two-tier networks are used for classification.

Overall, the network architecture of the proposed DL-based intelligent fitness real-time IP system is also a specific application of the attention mechanism. The key area positioning equals the weight distribution of the original fitness image. The rectangular region is located according to the spatial response of the convolution feature of the first network, the weight in and outside the rectangular region is set to 1 and 0, respectively. Thus, the clipping operation is to apply weight distribution to the original image.

3.4 Simulation analysis of intelligent fitness real-time IP system

Further, the Matlab model is constructed for simulation analysis and verification of the proposed DL-based intelligent fitness real-time IP system. The standard public data set PAMAP2 is selected for simulation experiment [24], which is a data set on body movement proposed by the University of California, Irvine, containing 12 sports items: walking, running, cycling, rope skipping, and daily activities, etc. Acceleration, angular velocity, and magnetic field direction data are recorded by the hand-held, chest-mounted, or foot-mounted Inertial Measurement Unit (IMU) for over 10 h, with a sampling interval of 0.01 s (or sampling frequency 100 Hz). Totally, 5,000 pieces of data are selected for each fitness motion, respectively, and each image is segmented with a ratio of 4:1 for training and testing, respectively. Firstly, the images with noise densities of 10%, 30%, 50%, 70%, and 90% are selected to verify the model Noise Reduction (NR) effect. Secondly, the NR performance of the proposed model is comparatively analyzed with that of the Standard Median Filter (SMF) algorithm [25] and the Ranked-order Based Adaptive Median Filter (RAMF) algorithm [26]. Thirdly, the proposed model is trained by training sample images under different model parameters, and the model accuracy is verified under the test set through the comparison with other literature algorithms, including RA-CNN [27], AlexNet [28], LSTM [29], CNN [30], and RNN [31]. Table 1 displays the simulation environment.

Table 1 Table of modeling tools for simulation experiment

Peak Signal-to-Noise Ratio (PSNR) and Mean Absolute Error (MAE) can quantitatively and objectively measure the performance of the intelligent fitness monitoring system, as expressed in Eqs. (7) and (8), respectively.

$${\text{PSNR}} = 10\log_{10} \frac{{M \times N \times 255^{2} }}{{\sum\nolimits_{i = 1}^{M} {\sum\nolimits_{J = 1}^{N} {\left[ {Z\left( {i,j} \right) - F(i,j)} \right]^{2} } } }}$$
(7)
$${\text{MAE}} = \frac{{\sum\nolimits_{i = 1}^{M} {\sum\nolimits_{J = 1}^{N} {\left| {Z\left( {i,j} \right) - F(i,j)} \right|} } }}{M \times N}$$
(8)

\(F(i,j)\) refers to the gradation at the coordinate \((i,j)\) of the noise image, Z denotes the filtered output image, and \(M \times N\) stands for image height × image width.

Next, hyperparameters are set to analyze the detection accuracy of the CNN-based DL algorithm: the epoch is 60, and the simulation time is 2,000 s. The learning rate adopts the strategy of equal proportional reduction, which is set to 0.01 initially so that the network learns at a faster speed and then is reduced by 10 times when the loss function stops converging. The CNN is further trained until the learning rate is reduced to 0.0001. The batch size is set as 128. The prediction results are evaluated through accuracy, precision, recall, and F-score, and their expressions read:

$${\text{Acc}} = \frac{{\sum\nolimits_{i = 1}^{l} {\frac{{{\text{TP}}_{i} + {\text{TN}}_{i} }}{{{\text{TP}}_{i} + {\text{FP}}_{i} + {\text{TN}}_{i} + {\text{FN}}_{i} }}} }}{l}$$
(9)
$${\text{Precision}} = \frac{{\sum\nolimits_{i = 1}^{l} {\frac{{{\text{TP}}_{i} }}{{{\text{TP}}_{i} + {\text{FP}}_{i} }}} }}{l}$$
(10)
$${\text{Recall}} = \frac{{\sum\nolimits_{i = 1}^{l} {\frac{{{\text{TP}}_{i} }}{{{\text{TP}}_{i} + {\text{FN}}_{i} }}} }}{l}$$
(11)

In Eqs. (9)–(11), TP denotes the number of positive samples predicted to be positive, FP represents the number of negative samples predicted to be positive, and FN is the number of positive samples predicted to be negative. TN stands for the number of negative samples predicted to be negative. Accuracy (ACC) can measure the overall classification accuracy, namely, the predicted correct sample rate. Recall (Rec) can measure the coverage of positive samples, namely, the proportion of correctly classified positive samples in all the positive samples. Precision (Pre) represents the ratio of samples classified as positive samples to actual positive samples, and F-measure, the weighted harmonic mean of precision and recall, is used to measure Pre.

4 Results and discussion

4.1 Image NR Effect of different algorithms

This section evaluates the proposed filtering algorithm through comparative analysis with SMF and RAMF algorithms based on the PSNR and the Mean Absolute Difference (MAD), as shown in Figs. 5 and 6.

Fig. 5
figure 5

The PSNR of each algorithm under different noise densities

Fig. 6
figure 6

The MAE(%) of each algorithm under different noise densities

In Figs. 5 and 6, when the noise densities are 10%, 30% and 50%, 70%, and 90%, respectively, the proposed filtering algorithm has the best performance in terms of PSNR and MAE, thus proving that the proposed filtering algorithm can filter noise and protect the image details better than other algorithms. Meanwhile, when the noise density exceeds 50%, the PSNR of both SMF and RAMF algorithms drops dramatically. However, the PSNR of the proposed filtering algorithm for seriously noisy images (with noise density up to 90%) remains above 20 dB. Thus, the proposed filtering algorithm shows good robustness, can better filter out the image impulse noise while preserving the details of the nose, eyes, and hair of the fitness people as much as possible.

4.2 Real-time IP performance analysis of different algorithms

Further, the proposed algorithm is evaluated through comparative analysis with RA-CNN, AlexNet, LSTM, CNN, and RNN from accuracy, precision, recall, and F1 score, as shown in Figs. 7, 8, 9, 10.

Fig. 7
figure 7

Accuracy curves of different algorithms

Fig. 8
figure 8

The precision of different algorithms

Fig. 9
figure 9

Recall of different algorithms

Fig. 10
figure 10

F1 scores of different algorithms

In Figs. 7, 8, 9 and 10, the proposed algorithm is compared with other DL algorithms from accuracy, precision, recall, and F1 scores, respectively. Apparently, the proposed algorithm outperforms other DL algorithms (such as RA-CNN, AlexNet, LSTM, CNN, and RNN) by over 2.24% with a detection accuracy of 97.80%. Meanwhile, the precision, recall, and F1 score of the proposed algorithm are the highest, and the F1 score is not between precision and recall but might be smaller than both of them. Therefore, compared with other DL algorithms, the proposed DL-based intelligent fitness monitoring system has higher detection accuracy and better safety performance.

Figure 11 illustrates that the transmission delay is positively correlated with the number of real-time image collections, while the proposed algorithm shows the least significant increase: only less than 1 s within 750 real-time images. Further, the detection performance for the real-time keyframe of the proposed model is comprehensively analyzed. The results show that the proposed real-time IP algorithm can accurately detect moving targets and calculate the information of moving regions, which meets the design requirements. Meanwhile, the proposed algorithm can adapt to background variation and accurately detects the dynamic fitness motions under the absence of mark point and tight trousers against a complex background (Fig. 12). Therefore, the proposed algorithm has high real-time performance, thus providing a solid foundation for the follow-up operations, such as human body modeling and limb posture analysis, and can monitor fitness motions in real-time and give early warnings under laboratory conditions.

Fig. 11
figure 11

Image transmission delay of different algorithms

Fig. 12
figure 12

The keyframe detection result of real-time fitness image

5 Conclusion

In the context of national fitness development and sports informatization, the objective is to provide real-time training standards and solve workout injuries in fitness activities. An intelligent real-time fitness IP system is constructed based on the improved DL algorithm and keyframe extraction from real-time image sequences. The simulation experiment indicates that the system can collect and process the fitness image in real-time, shows excellent NR performance for seriously noisy images, and has strong robustness. The collected real-time fitness keyframes can adapt to background changes and accurately detect limb movement, which provides an experimental reference for real-time monitoring and intelligent development of fitness activities. Still, there are some limitations. For example, the proposed real-time fitness motion detection system can only track single-target limb motion, so multi-target limb motion detection will be further explored in the follow-up study. Besides, in future research, three-dimensional and more diversified data from the induction coil, surveillance video, and broadcasting will be integrated into the prediction model. The proposed system only realizes some simple functions on the remote server of the system perception layer, such as authentication, data uploading, and data query interface, which will be further refined in the coming-up research.