Keywords

1 Introduction

With the rapid development of image processing techniques, computing hardware, and cameras, there have been many investigations into human daily action recognition [1,2,3,4]. The recognition of these activities allows for many practical applications in smart buildings, including the security surveillance, abnormal activity of elderly people or children, and safety monitoring [3]. There have been numerous datasets and methods for action recognition [3]. The datasets can be based on the 2D or 3D images, interest point, trajectory, depth, pose, motion, and shape. Also, the processing techniques can vary from traditional machine learning techniques, for example, graph-based, to deep learning techniques, such as convolutional neural networks [3].

The majority of published datasets for the pose dataset are three-dimensional keypoints captured from RGB-D cameras or combined position sensor systems [5]. The use of 3D or stereo cameras for action recognition is, in fact, inconvenient since complex infrastructure is required. Also, the field of view of 3D cameras is very limited, which reduces the potential applications of the system. Thus, in practical applications, 2D cameras are often used. As a result, datasets and processing algorithms for this type of camera are needed. The existing public datasets still lack challenging conditions as they appear in real-life applications, which should be further considered. Typical processing steps of action recognition are human detection, human pose estimation, and action classification. This sequence of consecutive processing steps can be computationally inefficient. This issue should also be improved.

In this work, a human daily action recognition system for smart-building applications is developed. Unlike traditional methods, the Yolov7-Pose [6] is used to detect humans while also estimating their corresponding pose. For the action classification, a model based on the CTR-GC method [7] is trained. The training and testing datasets are collected from NTU [5] and our self-built testbed. This dataset consists of many videos of six classes of actions, including standing, sitting, walking, standing up, sitting down, and falling. The overall processing pipeline is developed. The prediction of the start-to-end duration of an action in the video sequence is performed using the concept of the sliding window [8]. Accuracy, robustness against challenging conditions, and computational efficiency are all evaluated. The contribution of this work is thus a new system that allows for efficiently and accurately recognizing human actions in challenging conditions.

2 System Description and Method

The overall processing pipeline for the daily action recognition system is shown in Fig. 1, which consists of a sequency of processing steps. First, image frames captured from the camera system are pushed into the human detection, tracking, and pose estimation modules. A pretrained model of Yolov7-Pose [6], a branch of Yolov7 official implement, was used for the detection and pose estimation tasks, and the so-called SORT [9] method was used for the tracking task. After this processing step, each detected and tracked person in the video frames is assigned a tracking ID, which corresponds to a sequence of poses. As depicted in Fig. 1. There are N video frames and M persons that are detected and tracked. The sequence number of each ID is continuously pushed into the buffering system according to the rule First-In-First-Out (FIFO). Second, the data sampling module acquires data from the buffering system and pushes it to the action recognition module. For each ID, a given number of consecutive series of poses are sampled using the so-called “sliding window method” [8]. The use of a sliding window allows us to find the start-to-finish duration of an action since, in fact, an action can appear at any time in the sequence of videos. Without the sliding window, the consecutive series of poses of each action can be incorrectly acquired or overlapped with other actions, which reduces the accuracy of the action prediction module. The action recognition module employs the Channel-wise Topology Refinement Graph Convolution (CTR-GC) method [7]. This model outputs the prediction of each action in a set of six interesting actions with corresponding confidence. In this work, six classes of daily activities, including standing up, sitting down, falling, standing, sitting, and walking, are considered.

Fig. 1.
figure 1

Processing pipeline of the whole activity recognition system.

Details of the sliding window method are illustrated in Fig. 2. In the illustration, a sliding window has three sliders. These sliders have different lengths (number of poses), which were empirically chosen. The window is moved to a new location with a stride. The sliders “2” of the sliding window “2” contain start-to-finish poses of the appearing action very well, as can be seen. In contrast, all sliders in the sliding window “1” cannot hold the entire duration of the displayed action. In the sliding window “2”, the slider “1” is too short while the slider “3” is too long, which may overlap with other actions.

The choice of the lengths of each slider as well as the stride is very important to correctly capture the start-to-finish duration of an action. Depending on the type of action, the duration lengths can be different, which leads to the fact that the lengths of the sliders and stride must be empirically chosen. Thus, an empirical model is proposed. Based on the collected dataset of the six classes of actions, histograms of duration for all classes of actions are plotted. From these histograms, the lengths of the three sliders are empirically found to be 25, 40, and 55 for each. The stride length is 15 and the buffering length is 55.

All three sliders of each sliding window are pushed to the trained CTR-GC model for action recognition. The prediction model is implemented with a batch size of three to improve the computational efficiency. After the prediction step, an action class with the highest mean confidence of three predictions is voted to be recognized.

Fig. 2.
figure 2

Illustration of the sliding window for detecting start-to-finish duration of an action [8].

The CTR-GC [7] backbone was trained for the action model using our collected daset. Unlike many published works, where 3D pose is used, in our work, only 2D pose is used. The use of 2D pose is so meaningful that, in most practical applications, 2D cameras are often used. The use of 2D cameras can expand the field of view and does not require complex processing algorithms like 3D cameras. Table 1 shows statistics from our self-generated dataset. The dataset was created using videos from NTU [5] as well as our own self-created videos on two different testbeds. The videos are recorded with full-HD resolution of 1920x1080 and 15FPS. In our testbeds, each testbed has four cameras, which are arranged at different viewpoints and heights. Each testbed is setup in a different location to add variety to the dataset. The number of videos for each class of action is listed in Table 1. Depending on the type of action, the video takes 2–5 s to complete.

Table 1. Statistics of pose dataset for fine-tuning the CTR-GC model.

The generated videos are then fed into the Yolov7-Pose model, which detects and estimates the pose. It should be noted that the pose is 2D and has a skeleton of 17 keypoints that are fitted to the COCO dataset format [10]. After this step, a pose dataset is obtained and further used for the training process of the CTR-GC model. This model is trained from scratch, where the dataset obtained from NTU and testbed1 is used for the training and that of testbed2 is used for the testing. The trained model is customized so that the input must align with the MS-COCO dataset format and the output has six classes of actions.

The overall processing pipeline is deployed on a workstation computer having configuration of 16GB RAM, CPU Intel(R) Core (TM) i7–10700 CPU @ 2.90GHz, and a computation accelerated card GPU NVIDIA RTX 3060 12GB RAM. The used models are converted to suitable format to be able be computationally accelerated on Nvida’s hardware via the TensorRT framework. The processing pipeline is coded by C++ and also the OpenCV library is used.

3 Results and Discussions

The use of Yolov7-Pose has the advantage of detecting people and estimating their corresponding poses in a single inference. This will reduce the computational time significantly. In traditional methods, two separate steps, including human detection and pose estimation, must be performed. An evaluation of the traditional methods was also performed, where the Yolov5s [11] was used for the human detection and the HRNet-w48 [12] was used for the pose estimation. The two approaches are evaluated on the MS-COCO dataset, which has 5000 images of different types of objects. The evaluation results are shown in Table 2. The object detection model is evaluated based on the benchmark average precision (AP) with a threshold of intersection over union (IoU) ranging from 0.5 to 0.95. Similarly, in the pose estimation, the benchmark object keypoint similarity (OKS) with a threshold is used.

The results show that, for the pose estimation task, with a threshold of 0.5, the discrepancy between the two approaches is very small (89.2 vs. 90.4). In contrast, at thresholds from 0.5 to 0.95, the mAP of HRNet-w48 is significantly higher. This is due to the fact that the image input size of the two approaches differs. With the HRNetw48, the image size is only 384x288 while that of the Yolov7-Pose is 640x640. This leads to the error in the Yolo7-Pose becoming higher. However, for the object detection task, the Yolov7-Pose outperforms the Yolov5s.

It is noticed that the results in Table 2 are obtained based on the evaluation of the MS-COCO dataset, in which objects and their poses are quite easy to detect and estimate. In practical applications, there are several challenging conditions that must be overcome. Thus, the two approaches must be evaluated based on challenging data to verify their robustness and accuracy. In our self-generated dataset from testbeds 1 and 2, challenging conditions are added. As illustrated in Fig. 3, images with very poor lighting conditions (infra-red mode) of humans with partial or occluded appearances are included. Furthermore, scenarios about different shooting angles, including from far to close, diagonal angles, hard-to-see, and the variety of ways to generate actions, are considered. It is demonstrably demonstrated that using Yolov7-Pose with the challenging dataset is far more robust than HRNet-w48. As can be seen in Fig. 4, the HRNet-w48 fails to estimate the pose, whereas the Yolov7-Pose performs much better. One key advantage of the Yolov7-Pose is that it is very computationally efficient and requires only 10 ms for estimating all poses in the image, whereas HRNet-w48 needs 20 ms for a single pose. If the image contains several poses, the computational time will be longer. Compared to other works [13,14,15], the use of Yolovv7-Pose has the best performance in terms of accuracy and computational efficiency, especially for challenging conditions like occluded and partial bodies.

Table 2. Comparison of average precision of the Yolov7-Pose and Yolov5s with HRNetw48.
Fig. 3.
figure 3

Illustration of the challenging of the self-generated dataset.

Fig. 4.
figure 4

Performance comparison between Yolov5s-HRNet-w48 and Yolov7-Pose.

Table 3 shows the evaluation results for the action recognition model CTR-GC. The model is evaluated based on the cross-view method. That means all videos of testbed 1 are used for the training process, and all videos of testbed 2 are used for the testing process. The percentage of true positives divided by the total number of tests is used to calculate accuracy. It can be seen that the standing action has the highest accuracy of 95.7%, which is followed by the falling action with an accuracy of 89.5%. Other actions have lower accuracy. This can be explained by the fact that standing is a very visible action. As a result, it is easier to recognize. For the falling action, our dataset contains more data about this action, including the volume as well as the variety, than other classes. Thus, the accuracy of this action is higher and more robust. The reason for this unbalanced data in this class is that we aim to detect the falling action of elderly people in smart-building applications. The average accuracy of all classes is 85.6%. The computation time of the model is only 4ms. With these results, our system can be used for practical applications.

Table 3. Accuracy of the action recognition model CTR-GC

4 Conclusions and Outlook

In this work, a daily action recognition system for smart-building applications has been successfully developed. The obtained results have high potential for practical applications. The use of Yolov7-Pose has advantages in terms of accuracy, robustness, and computational efficiency. The use of HRNet-w48 has less robustness and requires much more computational effort. Thus, Yolov7-Pose should be the best choice. The datasets that have been made public still lack challenging conditions as well as variety properties. Thus, they are hard to use for practical applications. A self-generating dataset containing challenging conditions and real-life scenarios should therefore be created. The sliding window parameter has a significant impact on accuracy. Depending on the interested action classes, these parameters must be optimized in an empirical way.

In the future, a thorough evaluation of the pose estimation model on challenging datasets and optimization of the sliding window are intended to be investigated.