Fast and robust video-based exercise classification via body pose tracking and scalable multivariate time series classifiers

Singh, Ashish; Bevilacqua, Antonio; Nguyen, Thach Le; Hu, Feiyan; McGuinness, Kevin; O’Reilly, Martin; Whelan, Darragh; Caulfield, Brian; Ifrim, Georgiana

doi:10.1007/s10618-022-00895-4

Fast and robust video-based exercise classification via body pose tracking and scalable multivariate time series classifiers

Published: 21 December 2022

Volume 37, pages 873–912, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Fast and robust video-based exercise classification via body pose tracking and scalable multivariate time series classifiers

Download PDF

Ashish Singh ORCID: orcid.org/0000-0002-4743-4530¹,
Antonio Bevilacqua¹,
Thach Le Nguyen¹,
Feiyan Hu²,
Kevin McGuinness²,
Martin O’Reilly³,
Darragh Whelan³,
Brian Caulfield⁴ &
…
Georgiana Ifrim¹

689 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

Recent technological advancements have spurred the usage of machine learning based applications in sports science and healthcare. Using wearable sensors and video cameras to analyze and improve the performance of athletes, has become widely popular. Physiotherapists, sports coaches and athletes actively look to incorporate the latest technologies in order to further improve performance and avoid injuries. While wearable sensors are very popular, their use is hindered by constraints on battery power and sensor calibration, especially for use cases which require multiple sensors to be placed on the body. Hence, there is renewed interest in video-based data capture and analysis for sports science. In this paper, we present the application of classifying strength and conditioning exercises using video. We focus on the popular Military Press exercise, where the execution is captured with a video-camera using a mobile device, such as a mobile phone, and the goal is to classify the execution into different types. Since video recordings need a lot of storage and computation, this use case requires data reduction, while preserving the classification accuracy and enabling fast prediction. To this end, we propose an approach named BodyMTS to turn video into time series by employing body pose tracking, followed by training and prediction using multivariate time series classifiers. We analyze the accuracy and robustness of BodyMTS and show that it is robust to different types of noise caused by either video quality or pose estimation factors. We compare BodyMTS to state-of-the-art deep learning methods which classify human activity directly from videos and show that BodyMTS achieves similar accuracy, but with reduced running time and model engineering effort. Finally, we discuss some of the practical aspects of employing BodyMTS in this application in terms of accuracy and robustness under reduced data quality and size. We show that BodyMTS achieves an average accuracy of 87%, which is significantly higher than the accuracy of human domain experts.

An Examination of Wearable Sensors and Video Data Capture for Human Exercise Classification

Value evaluation of human motion simulation based on speech recognition control

Article 04 January 2022

Innovative use of optical sensors for real-time imaging and diagnosis using enhanced verge denoising and LSTM for sports medicine

Article 18 February 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Recent years have seen a tremendous growth of the use of machine learning for sports science and healthcare applications. This is mainly due to the increased usage of wearable sensors and video-based tracking devices (Ahmadi et al. 2014; O’Reilly et al. 2015, 2017, 2018; Fawaz et al. 2019; Kwon et al. 2020; Choutas et al. 2018) to capture data that is utilized for rehabilitation or to assess the performance of athletes (Richter et al. 2021).

Human exercise performance classification is a sub-field of human activity recognition (HAR) where the goal is to classify the execution of an exercise into predetermined classes. Most research in this field has focused on utilizing inertial sensors for data capture (Ahmadi et al. 2014; O’Reilly et al. 2015, 2017, 2018; Fawaz et al. 2019), which commonly involves extracting domain-specific or predefined statistical features from sensor data and applying supervised machine learning methods. However, using sensors to collect human activity data has some notable limitations: sensor-based data collection is error-prone and time-consuming as sensors require careful positioning on the body, as well as calibration for the specific task (Whelan et al. 2016; Kwon et al. 2020).

This work focuses on classifying physical exercise execution by using video data capture. Video data helps to alleviate some of the above problems, as videos can be easily captured through available smartphones and data capture does not require multiple specialized sensor devices to be worn on the body, thus avoiding issues such as discomfort and impending the ease of movement (Kwon et al. 2020). In this paper, we work with video recordings of participants executing the Military Press (MP) exercise. MP is an important exercise in strength and conditioning, injury risk screening, and rehabilitation (Whelan et al. 2016). The main objective is to classify exercise performance in terms of differentiating between correct and different aberrant executions of the exercise. Incorrect execution may lead to muscoskeletal injuries and impede performance (Baechle and Earle 2008), therefore, automated and accurate feedback on execution is important to avoid injuries and maximize the performance of the user. While this is an important exercise, it is also a difficult one to classify, with human inter-rater agreement at about 60% (Whelan et al. 2019).

Our previous work (Singh et al. 2020) proposed an approach for interpretable classification of Military Press exercises using videos as time series. We showed that a body pose estimation method, OpenPose (Cao et al. 2019), combined with multivariate time series classifiers (MTSC) can be used to accurately classify and interpret correct and incorrect executions. We henceforth name this approach BodyMTS (for Body tracking Multivariate Time Series). Figure 1 shows the overall flow of BodyMTS: (1) pose estimation identifies and tracks multiple body parts over the video frames, (2) the (X, Y) location coordinates of body parts for each frame are extracted resulting in multivariate time series, (3) a multivariate time series classifier is trained to classify the execution of the exercise into pre-defined classes.

In this paper, we extend our prior work with an extensive analysis of the robustness of BodyMTS to different sources of data noise, as well as a side-by-side comparison with state-of-the-art deep learning methods for human activity classification directly from videos. Our hypothesis is that body pose estimation provides a strong prior for the classifier which now focuses on the pose information important for the task, not on other details in the video, e.g., background. This is contrasted to direct end-to-end video classification with deep learning, where noisy data may affect the model robustness and accuracy, and generalisation beyond benchmarks is known to be a challenge^{Footnote 1} (Azulay and Weiss 2019). In our experiments, we show that deep learning models require pre-training on large amounts of data, with a gap of 60% in accuracy between training from scratch and pre-trained models.

Although BodyMTS in its current form is a proof-of-concept, we demonstrate its applicability and feasibility by considering the key factors that may influence performance, such as the impact of realistic noise types on the classifier accuracy and running time, as well as the computational resources and storage space used by the data and models. We focus our attention on noise coming from changes in video quality, pose estimation quality, or time series data pre-processing.

While research on assessing the performance of athletes using sensors has been successfully deployed,^{Footnote 2} there are currently not many approaches to classify the execution of strength and conditioning exercises using videos. In our search, we have identified software such as Kinovea (Adnan et al. 2018; Puig-Diví et al. 2019) and DartFish (Fathallah Elalem 2016; Faro and Rui 2016), which seem to work through manual analysis at a very low frame rate. Despite providing a vast number of features, these systems are not equipped with automatic classification of physical exercises (Adnan et al. 2018; Puig-Diví et al. 2019).

Existing research on human activity recognition from videos is based on applying complex deep learning architectures (Ji et al. 2010; Simonyan and Zisserman 2014; Tran et al. 2015; Feichtenhofer et al. 2019). Despite the competitive performance on benchmarks, this is achieved at the cost of heavy computation resources such as several hours of training and testing on high-end GPU hardware. Besides the need for high-end hardware, this also has a negative environmental effect. Furthermore, these models are trained and tested on datasets such as UCF-101 (Soomro et al. 2012) and Kinetics-400 (Kay et al. 2017), which contain long duration videos and a wide range of activities. For instance, in Kinetics-400 the average duration of a clip is 10 s and the number of samples is around 300k. In our setting, a single clip is of 3 s duration on average and the differences between the classes are subtle, making the classification task more challenging, e.g., cycling versus walking in contrast to executing the MP with/without an arch in the back. Our dataset is also small (a few thousand samples for training and validation) when compared to these large benchmarks. We have found no prior work that uses videos for strength and conditioning exercise classification and works with this type of smaller data scale and fine-grained classification.

Our main contributions in this paper can be summarized as follows:

We present and extensively evaluate BodyMTS, an end-to-end video-as-timeseries human exercise performance classification method. We study the impact of improvements in body pose estimation methods (e.g., OpenPose, (Cao et al. 2019)) and recent multivariate time series classifiers (e.g., ROCKET Dempster et al. (2020) and MiniROCKET (Dempster et al. 2021)) on the overall classification accuracy. We show improvements in accuracy with an average of 87% classification accuracy for the Military Press exercise.
We analyze the robustness of BodyMTS against different types of realistic noise and measure the impact on the classifier performance. We consider three common sources of noise in our application setting: video capture quality, pose estimation quality and time series pre-processing steps.
We conduct an extensive empirical study comparing BodyMTS to state-of-the-art deep learning approaches for human activity recognition directly from videos. We compare all methods in terms of accuracy, training/testing time and computation resources. We show that BodyMTS is robust to lower quality data captured at prediction time and has fast training and prediction.
To support our paper, all of our code, data and detailed results are available at: https://github.com/mlgig/BodyMTS_2021.git.

The paper is organized as follows. In Sect. 2 we discuss the application and technical requirements of BodyMTS. In Sect. 3, we give an overview of the related literature on human activity recognition, human pose estimation, strength and conditioning exercises and multivariate time series classification. Section 4 describes the data collection process and the Military Press dataset. Section 5, presents our methodology for classifying MP exercises from videos and Sect. 6 describes the main data mining challenges. In Sect. 7, we analyze the robustness of BodyMTS against different sources of noise and compare its performance with state-of-the-art deep learning methods. In Sect. 8, we describe the lessons learned from this study, as well as limitations and future work. In Sect. 9 we summarise our recommendations for practitioners working on similar tasks and we conclude in Sect. 10.

2 Application requirements

In this section, we discuss the required BodyMTS features and the corresponding application and technical requirements. We note that BodyMTS is currently a proof-of-concept and the actual deployment scenario and requirements may change depending upon the business case and the end-user requirements.

The aim of BodyMTS is to provide a scalable system that can accurately measure and evaluate end-user performance of strength and conditioning (S &C) exercises, with a view to provide feedback in near real-time. This, in turn can guide physiotherapists, trainers, and elite and recreational athletes to perform exercises correctly and therefore minimise injury risk and enhance performance. We devised the following list of application requirements based on previous research in which we consulted with end users, clinicians, and strength and conditioning experts on the design, implementation and evaluation of interactive feedback systems for exercise (Brennan et al. 2020; Argent et al. 2019, 2018; Giggins and Caulfield 2015; O’Reilly et al. 2017):

Be able to accurately monitor the body parts movement, accounting for the critical body segments involved in the exercise in question.
Detect when deviations from normal movement profile have occurred, and which kind of deviation has occurred in each case.
Provide clear and simple feedback to the end user, in near real-time.
Simple data capture based on ubiquitous sensor technology (e.g, single phone).
Coverage of wide range of S &C or rehabilitation exercises.

Table 1 summarizes the application features and the corresponding application and technical requirements for such a system.

Table 1 Application features, application requirements and associated technical requirements

Fast and robust video-based exercise classification via body pose tracking and scalable multivariate time series classifiers

Abstract

Similar content being viewed by others

An Examination of Wearable Sensors and Video Data Capture for Human Exercise Classification

Value evaluation of human motion simulation based on speech recognition control

Innovative use of optical sensors for real-time imaging and diagnosis using enhanced verge denoising and LSTM for sports medicine

Explore related subjects

1 Introduction

2 Application requirements

3 Related work

3.1 Strength and conditioning exercise classification

3.2 Human activity recognition

3.3 Human pose estimation

3.4 Multivariate time series classification

4 Data collection

5 Methods

5.1 Methodology

6 Data mining challenges and solutions in the context of BodyMTS

7 Experiments

7.1 BodyMTS versus direct human action recognition classifiers

7.1.1 Deep model architectures

7.1.2 Results of BodyMTS versus deep learning models

7.1.3 Impact of segmentation

7.1.4 Impact of video quality noise

7.2 Robustness analysis: impact of noise on BodyMTS

7.2.1 Data-capture noise

7.2.2 OpenPose parameters

7.2.3 Results for the impact of video quality on BodyMTS

7.2.4 Results for impact of noise due to OpenPose parameters

7.2.5 Training on good quality videos and testing on poor quality videos

7.2.6 Discussion on impact of video quality and OpenPose parameters on BodyMTS

7.3 Robustness analysis: time series classification in BodyMTS

8 Lessons learned and limitations of the BodyMTS approach

9 Recommendations for practitioners

10 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Time series data pre-processing and classification

1.2 Quantifying video quality noise using video quality metrics

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation