Development of a Human Daily Action Recognition System for Smart-Building Applications

Nguyen, Ha Xuan; Hoang, Dong Nhu; Bui, Hoang Viet; Dang, Tuan Minh

doi:10.1007/978-981-99-4725-6_45

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 752))

Included in the following conference series:

The International Conference on Intelligent Systems & Networks

445 Accesses
1 Citations

Abstract

In this work, a daily action recognition system for smart-building applications is developed. The system consists of a processing pipeline to perform tasks of human detection, pose estimation, and action class classification. The Yolov7-Pose was used for the human detection and pose estimation task, while a trained model based on the CRT-GC method was used for the action classification. The prediction of the start-to-finish duration of an action in a sequence video is performed via the sliding window method. For the trained model and the evaluation, a self-generated dataset of six classes of daily actions with challenging conditions was created. The evaluation results show that the Yolov7-Pose outperforms others in terms of accuracy, robustness, and computational efficiency. The pose estimation reaches an AP₅₀ of 89.1%, and the action recognition has an mAP50 of 85.6%, in which the highest accuracy reaches 95.7%. The total computing time for the overall processing pipeline is 14ms. The obtained results indicate that there is a high potential for practical applications.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Continuous Human Action Recognition in Ambient Assisted Living Scenarios

An Extensive Survey on Machine Learning-Enabled Automated Human Action Recognition Models

RegFrame: fast recognition of simple human actions on a stand-alone mobile device

Article 15 February 2017

Keywords

1 Introduction

With the rapid development of image processing techniques, computing hardware, and cameras, there have been many investigations into human daily action recognition [1,2,3,4]. The recognition of these activities allows for many practical applications in smart buildings, including the security surveillance, abnormal activity of elderly people or children, and safety monitoring [3]. There have been numerous datasets and methods for action recognition [3]. The datasets can be based on the 2D or 3D images, interest point, trajectory, depth, pose, motion, and shape. Also, the processing techniques can vary from traditional machine learning techniques, for example, graph-based, to deep learning techniques, such as convolutional neural networks [3].

The majority of published datasets for the pose dataset are three-dimensional keypoints captured from RGB-D cameras or combined position sensor systems [5]. The use of 3D or stereo cameras for action recognition is, in fact, inconvenient since complex infrastructure is required. Also, the field of view of 3D cameras is very limited, which reduces the potential applications of the system. Thus, in practical applications, 2D cameras are often used. As a result, datasets and processing algorithms for this type of camera are needed. The existing public datasets still lack challenging conditions as they appear in real-life applications, which should be further considered. Typical processing steps of action recognition are human detection, human pose estimation, and action classification. This sequence of consecutive processing steps can be computationally inefficient. This issue should also be improved.

In this work, a human daily action recognition system for smart-building applications is developed. Unlike traditional methods, the Yolov7-Pose [6] is used to detect humans while also estimating their corresponding pose. For the action classification, a model based on the CTR-GC method [7] is trained. The training and testing datasets are collected from NTU [5] and our self-built testbed. This dataset consists of many videos of six classes of actions, including standing, sitting, walking, standing up, sitting down, and falling. The overall processing pipeline is developed. The prediction of the start-to-end duration of an action in the video sequence is performed using the concept of the sliding window [8]. Accuracy, robustness against challenging conditions, and computational efficiency are all evaluated. The contribution of this work is thus a new system that allows for efficiently and accurately recognizing human actions in challenging conditions.

2 System Description and Method

The overall processing pipeline for the daily action recognition system is shown in Fig. 1, which consists of a sequency of processing steps. First, image frames captured from the camera system are pushed into the human detection, tracking, and pose estimation modules. A pretrained model of Yolov7-Pose [6], a branch of Yolov7 official implement, was used for the detection and pose estimation tasks, and the so-called SORT [9] method was used for the tracking task. After this processing step, each detected and tracked person in the video frames is assigned a tracking ID, which corresponds to a sequence of poses. As depicted in Fig. 1. There are N video frames and M persons that are detected and tracked. The sequence number of each ID is continuously pushed into the buffering system according to the rule First-In-First-Out (FIFO). Second, the data sampling module acquires data from the buffering system and pushes it to the action recognition module. For each ID, a given number of consecutive series of poses are sampled using the so-called “sliding window method” [8]. The use of a sliding window allows us to find the start-to-finish duration of an action since, in fact, an action can appear at any time in the sequence of videos. Without the sliding window, the consecutive series of poses of each action can be incorrectly acquired or overlapped with other actions, which reduces the accuracy of the action prediction module. The action recognition module employs the Channel-wise Topology Refinement Graph Convolution (CTR-GC) method [7]. This model outputs the prediction of each action in a set of six interesting actions with corresponding confidence. In this work, six classes of daily activities, including standing up, sitting down, falling, standing, sitting, and walking, are considered.

Details of the sliding window method are illustrated in Fig. 2. In the illustration, a sliding window has three sliders. These sliders have different lengths (number of poses), which were empirically chosen. The window is moved to a new location with a stride. The sliders “2” of the sliding window “2” contain start-to-finish poses of the appearing action very well, as can be seen. In contrast, all sliders in the sliding window “1” cannot hold the entire duration of the displayed action. In the sliding window “2”, the slider “1” is too short while the slider “3” is too long, which may overlap with other actions.

The choice of the lengths of each slider as well as the stride is very important to correctly capture the start-to-finish duration of an action. Depending on the type of action, the duration lengths can be different, which leads to the fact that the lengths of the sliders and stride must be empirically chosen. Thus, an empirical model is proposed. Based on the collected dataset of the six classes of actions, histograms of duration for all classes of actions are plotted. From these histograms, the lengths of the three sliders are empirically found to be 25, 40, and 55 for each. The stride length is 15 and the buffering length is 55.

All three sliders of each sliding window are pushed to the trained CTR-GC model for action recognition. The prediction model is implemented with a batch size of three to improve the computational efficiency. After the prediction step, an action class with the highest mean confidence of three predictions is voted to be recognized.

The CTR-GC [7] backbone was trained for the action model using our collected daset. Unlike many published works, where 3D pose is used, in our work, only 2D pose is used. The use of 2D pose is so meaningful that, in most practical applications, 2D cameras are often used. The use of 2D cameras can expand the field of view and does not require complex processing algorithms like 3D cameras. Table 1 shows statistics from our self-generated dataset. The dataset was created using videos from NTU [5] as well as our own self-created videos on two different testbeds. The videos are recorded with full-HD resolution of 1920x1080 and 15FPS. In our testbeds, each testbed has four cameras, which are arranged at different viewpoints and heights. Each testbed is setup in a different location to add variety to the dataset. The number of videos for each class of action is listed in Table 1. Depending on the type of action, the video takes 2–5 s to complete.

Table 1. Statistics of pose dataset for fine-tuning the CTR-GC model.

Full size table

The generated videos are then fed into the Yolov7-Pose model, which detects and estimates the pose. It should be noted that the pose is 2D and has a skeleton of 17 keypoints that are fitted to the COCO dataset format [10]. After this step, a pose dataset is obtained and further used for the training process of the CTR-GC model. This model is trained from scratch, where the dataset obtained from NTU and testbed1 is used for the training and that of testbed2 is used for the testing. The trained model is customized so that the input must align with the MS-COCO dataset format and the output has six classes of actions.

The overall processing pipeline is deployed on a workstation computer having configuration of 16GB RAM, CPU Intel(R) Core (TM) i7–10700 CPU @ 2.90GHz, and a computation accelerated card GPU NVIDIA RTX 3060 12GB RAM. The used models are converted to suitable format to be able be computationally accelerated on Nvida’s hardware via the TensorRT framework. The processing pipeline is coded by C++ and also the OpenCV library is used.

3 Results and Discussions

The use of Yolov7-Pose has the advantage of detecting people and estimating their corresponding poses in a single inference. This will reduce the computational time significantly. In traditional methods, two separate steps, including human detection and pose estimation, must be performed. An evaluation of the traditional methods was also performed, where the Yolov5s [11] was used for the human detection and the HRNet-w48 [12] was used for the pose estimation. The two approaches are evaluated on the MS-COCO dataset, which has 5000 images of different types of objects. The evaluation results are shown in Table 2. The object detection model is evaluated based on the benchmark average precision (AP) with a threshold of intersection over union (IoU) ranging from 0.5 to 0.95. Similarly, in the pose estimation, the benchmark object keypoint similarity (OKS) with a threshold is used.

The results show that, for the pose estimation task, with a threshold of 0.5, the discrepancy between the two approaches is very small (89.2 vs. 90.4). In contrast, at thresholds from 0.5 to 0.95, the mAP of HRNet-w48 is significantly higher. This is due to the fact that the image input size of the two approaches differs. With the HRNetw48, the image size is only 384x288 while that of the Yolov7-Pose is 640x640. This leads to the error in the Yolo7-Pose becoming higher. However, for the object detection task, the Yolov7-Pose outperforms the Yolov5s.

It is noticed that the results in Table 2 are obtained based on the evaluation of the MS-COCO dataset, in which objects and their poses are quite easy to detect and estimate. In practical applications, there are several challenging conditions that must be overcome. Thus, the two approaches must be evaluated based on challenging data to verify their robustness and accuracy. In our self-generated dataset from testbeds 1 and 2, challenging conditions are added. As illustrated in Fig. 3, images with very poor lighting conditions (infra-red mode) of humans with partial or occluded appearances are included. Furthermore, scenarios about different shooting angles, including from far to close, diagonal angles, hard-to-see, and the variety of ways to generate actions, are considered. It is demonstrably demonstrated that using Yolov7-Pose with the challenging dataset is far more robust than HRNet-w48. As can be seen in Fig. 4, the HRNet-w48 fails to estimate the pose, whereas the Yolov7-Pose performs much better. One key advantage of the Yolov7-Pose is that it is very computationally efficient and requires only 10 ms for estimating all poses in the image, whereas HRNet-w48 needs 20 ms for a single pose. If the image contains several poses, the computational time will be longer. Compared to other works [13,14,15], the use of Yolovv7-Pose has the best performance in terms of accuracy and computational efficiency, especially for challenging conditions like occluded and partial bodies.

Table 2. Comparison of average precision of the Yolov7-Pose and Yolov5s with HRNetw48.

Full size table

Table 3 shows the evaluation results for the action recognition model CTR-GC. The model is evaluated based on the cross-view method. That means all videos of testbed 1 are used for the training process, and all videos of testbed 2 are used for the testing process. The percentage of true positives divided by the total number of tests is used to calculate accuracy. It can be seen that the standing action has the highest accuracy of 95.7%, which is followed by the falling action with an accuracy of 89.5%. Other actions have lower accuracy. This can be explained by the fact that standing is a very visible action. As a result, it is easier to recognize. For the falling action, our dataset contains more data about this action, including the volume as well as the variety, than other classes. Thus, the accuracy of this action is higher and more robust. The reason for this unbalanced data in this class is that we aim to detect the falling action of elderly people in smart-building applications. The average accuracy of all classes is 85.6%. The computation time of the model is only 4ms. With these results, our system can be used for practical applications.

Table 3. Accuracy of the action recognition model CTR-GC

Full size table

4 Conclusions and Outlook

In this work, a daily action recognition system for smart-building applications has been successfully developed. The obtained results have high potential for practical applications. The use of Yolov7-Pose has advantages in terms of accuracy, robustness, and computational efficiency. The use of HRNet-w48 has less robustness and requires much more computational effort. Thus, Yolov7-Pose should be the best choice. The datasets that have been made public still lack challenging conditions as well as variety properties. Thus, they are hard to use for practical applications. A self-generating dataset containing challenging conditions and real-life scenarios should therefore be created. The sliding window parameter has a significant impact on accuracy. Depending on the interested action classes, these parameters must be optimized in an empirical way.

In the future, a thorough evaluation of the pose estimation model on challenging datasets and optimization of the sliding window are intended to be investigated.

References

Özyer, T., Ak, D.S., Alhajj, R.: Human action recognition approaches with video datasets—a survey. Knowl. Based Syst. 222, 106995 (2021)
Article Google Scholar
Le, V.-T., Tran-Trung, K., Hoang, V.T.: A comprehensive review of recent deep learning techniques for human activity recognition. Comput. Intell. Neurosci. 2022, 1–17 (2022). https://doi.org/10.1155/2022/8323962
Article Google Scholar
Pareek, P., Thakkar, A.: A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artif. Intell. Rev. 54(3), 2259–2322 (2021)
Article Google Scholar
Ren, B., Liu, M., Ding, R., Liu, H.: A survey on 3d skeleton-based action recognition using learning method. arXiv preprint arXiv:2002.05907 (2020)
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019 (2016)
Google Scholar
Maji, D., Nagori, S., Mathew, M., Poddar, D.: YOLO-pose: enhancing YOLO for multi person pose estimation using object keypoint similarity Loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2637–2646 (2002)
Google Scholar
Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., Hu, W.: Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13359–13368 (2021)
Google Scholar
Ma, C., Li, W., Cao, J., Du, J., Li, Q., Gravina, R.: Adaptive sliding window-based activity recognition for assisted livings. Inf. Fusion 53, 55–65 (2020)
Article Google Scholar
Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B.: Simple online and realtime tracking. In: 2016 IEEE international conference on image processing (ICIP), pp. 3464–3468 (2016)
Google Scholar
Lin, T.-Y., et al.: Microsoft coco: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Yolov5s. https://github.com/ultralytics/yolov5
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Google Scholar
Huang, J., Zhu, Z., Huang, G., Du, D.: AID: pushing the performance boundary of human pose estimation with information dropping augmentation. arXiv preprint arXiv:2008.07139 (2020)
Cai, Y., et al.: Learning delicate local representations for multi-person pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020. LNCS, vol. 12348, pp. 455–472. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_27
Chapter Google Scholar
Artacho, B., Savakis, A.: Omnipose: a multi-scale framework for multi-person pose estimation. arXiv preprint arXiv:2103.10180 (2021)

Download references

Acknowledgement

This research is funded by the CMC Applied Technology Institute, CMC Corporation, Hanoi, Vietnam.

Author information

Authors and Affiliations

Hanoi University of Science and Technology, No. 1 Dai Co Viet, Hanoi, Vietnam
Ha Xuan Nguyen
CMC Applied Technology Institute, CMC Corporation, 11 Duy Tan, Hanoi, Vietnam
Ha Xuan Nguyen, Dong Nhu Hoang, Hoang Viet Bui & Tuan Minh Dang
CMC University, CMC Corporation, CMC Corporation, 11 Duy Tan, Hanoi, Vietnam
Tuan Minh Dang
Posts and Telecommunication Institute of Technology, Ha Dong, Hanoi, Vietnam
Tuan Minh Dang

Authors

Ha Xuan Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Dong Nhu Hoang
View author publications
You can also search for this author in PubMed Google Scholar
Hoang Viet Bui
View author publications
You can also search for this author in PubMed Google Scholar
Tuan Minh Dang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ha Xuan Nguyen .

Editor information

Editors and Affiliations

A1 Building, Hanoi University of Industry, Bac Tu Liem, Hanoi, Vietnam
Thi Dieu Linh Nguyen
School of Engineering and Technology, Universidad Internacional De La Rioja, Logroño (La Rioja), Spain
Elena Verdú
Swinburne University of Technology, Hanoi, Vietnam
Anh Ngoc Le
Warsaw University of Technology, Warsaw, Poland
Maria Ganzha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, H.X., Hoang, D.N., Bui, H.V., Dang, T.M. (2023). Development of a Human Daily Action Recognition System for Smart-Building Applications. In: Nguyen, T.D.L., Verdú, E., Le, A.N., Ganzha, M. (eds) Intelligent Systems and Networks. ICISN 2023. Lecture Notes in Networks and Systems, vol 752. Springer, Singapore. https://doi.org/10.1007/978-981-99-4725-6_45

Download citation

DOI: https://doi.org/10.1007/978-981-99-4725-6_45
Published: 20 August 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4724-9
Online ISBN: 978-981-99-4725-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Development of a Human Daily Action Recognition System for Smart-Building Applications

Abstract

Similar content being viewed by others

Continuous Human Action Recognition in Ambient Assisted Living Scenarios

An Extensive Survey on Machine Learning-Enabled Automated Human Action Recognition Models

RegFrame: fast recognition of simple human actions on a stand-alone mobile device

Keywords

1 Introduction

2 System Description and Method

3 Results and Discussions

4 Conclusions and Outlook

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Development of a Human Daily Action Recognition System for Smart-Building Applications

Abstract

Similar content being viewed by others

Continuous Human Action Recognition in Ambient Assisted Living Scenarios

An Extensive Survey on Machine Learning-Enabled Automated Human Action Recognition Models

RegFrame: fast recognition of simple human actions on a stand-alone mobile device

Keywords

1 Introduction

2 System Description and Method

3 Results and Discussions

4 Conclusions and Outlook

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation