Keywords

1 Introduction

Recognition of human activity is still developing in computer vision, a field with many applications such as video surveillance, human computer interface and automated driving. In previous studies, the bag-of-words approach or preset motion attributes were commonly used in human activity recognition [10, 11, 24, 36]. Recent deep learning-based representation methods such as 3D convolutional neural networks(CNN) [8], two-stream CNN [27], and multi-stream CNN [31] have shown promising results for the human activity recognition problem. However, recognizing human activity accurately remains a challenging task, compared to other aspects of computer vision and machine learning. The use of RGB information imposes limitations on extensibility and versatility because it is often influenced by recording conditions, such as illumination, size, resolution, and occlusion.

With the advent of depth sensors such as Microsoft Kinect, Asus Xtion, and Intel RealSense, instead of using RGB camera, action recognition using 3D skeleton sequences has attracted substantial research attention, and many advanced approaches have been proposed [4, 9, 18, 20, 32, 35]. Human actions can be represented by a combination of movements of skeletal joints in 3D space. In addition, there has also been major advances in skeleton-based human activity recognition researches [2, 3, 25, 29, 34]. They models what happens between two or more people based on their joint information. Although the human skeleton can provide sophisticated information about human behavior, most depth sensors are currently limited to indoor applications with close distance; these conditions are necessary to estimate articulated poses accurately. However, Human activity recognition using articulated poses outdoors could have many more practical applications. Therefore, we address such settings: namely, activity recognition problems where articulated poses are estimated from RGB videos.

In recent studies, deep learning-based approaches have achieved excellent results in estimating the human body joints from RGB videos through pose evaluation [6, 7, 26]. It has become possible to extract accurate multiple human poses with joint information from RGB video in real time. Because pose estimation and action recognition are closely related problems, some studies simultaneously address these two tasks. A multi-task deep learning approach performed joint 2D and 3D pose estimation from still images and human action recognition from video in a single framework [19]. An AND-OR graph-based action recognition approach utilizes hierarchical part composition analysis [33]. Even though the end-to-end approach has advantages for optimization of the task, it has limited extensibility to videos in varying real-world environments. Furthermore, an approach to research involving interactions, rather than single human actions, methodologically distinct; another problem is that requires the large amount of training data.

In this paper, we propose a novel framework for human activity recognition from RGB video based on spatio-temporal weight of active joints. The proposed framework extracts individual human body joints using publicly available pose estimation method, and recognizes human interaction based on joint motion, local path image, and full-body images with spatio-temporal weight of active region. Therefore, the proposed framework selectively focuses on the informative joints in each frame in an unconditioned RGB video. Figure 1 shows that the interaction regions differ in human activity. In the case of a handshake, hand interaction occurs, but a punch can be understood as head and hand interaction, and a hug as interaction between torso and hand.

Fig. 1.
figure 1

An example of human body joints with spatio-temporal active region analysis: stretched right hand is interacting three different body part of other person in each activity.

We presents our contributions as follows: first, the proposed framework is based on the RGB video, so it has the benefit that activity recognition can be performed using in the wild without constraints. Second, the spatio-temporal weight of the active region is given to activity relevant motion or poses, that makes the model can focus on important cue of human activity. Third, the experimental result shows the effectiveness of proposed method for the human behavior understanding. This framework allows us to develop a highly extensible application. Furthermore, by not performing separate learning for estimation, detection, and tracking tasks, the proposed framework can be extended to varying datasets in an unconditioned environment.

2 Proposed Method

2.1 Preprocessing

In the most recent studies, video representation through a CNN-based approach has shown good results. We first normalize the RGB pixel data and extract feature vector from images to process input images through CNN. We perform human object detection using Faster-RCNN [22] with the Inception-resnet-v2 network [30]. The detection result provides (xy) coordinates with height and width. We also perform joint estimation using Part Affinity Fields (PAF) [7] on the same images.

Fig. 2.
figure 2

An illustration of joint indexes from estimated human pose to corresponding five body parts (torso, left hand, right hand, left leg, and right leg).

The composition of the estimated joints using PAF is shown in Fig. 2. The PAF provides 18 joints for each human object. In addition, the average of joints 8 coordinate and joint 11 coordinate is designated as point 18 for utilization of the torso information; this is referred to as the hip. For each human subject, we denote each joint as \(j_{i} = \{j_{0}, ... , j_{18}\}\). The pose estimation in an RGB frame often causes a missing joint. Thus, if the previous n frames have failed to estimate a joint, the value in current frame is used for interpolation and restoration. We use the bounding box to filter out bad results using constraints. First, both the head and torso of each object must be included in the bounding box. If a failure occurs in estimating the head (index 0), the average coordinate of \(j_{14}, ... , j_{17}\) is used as the head position. In this way, noisy objects and poorly estimated joints for interaction can be removed.

In order to consider the local image associated with the body parts in active region, we extract the \((n {\times } n)\)-size image feature from each joint location of index 0 (head position) and 3 (right elbow), 4 (right hand), 9 (right knee), 10 (right foot), 6 (left elbow), 7 (left hand), 12 (left knee), and 13 (left foot). The last fully connected layer of the Inception-resnet-v2 network is used to extract its feature vector, \(\mathbf{{pf}}_j^t\). The input image patches are extracted where the \(([x-n/2:x+n/2], [y-n/2:y+n/2])\), center is in position (xy).

2.2 Body Joint Exploitation

We extract four type of joint-based body part features to express the behavior of an individual human. At each time step, for each subject, the 2D coordinates of the 19 body joints are obtained. To consider the characteristics of different behaviors, the motion features were extracted according to the status of the joints. First, we create five body parts using joints from 0 to 18 defined in each frame: right arm \(p_1\) (2, 3, 4), left arm \(p_2\) (5, 6, 7), right leg \(p_3\) (8, 9, 10), left leg \(p_4\) (11, 12, 13), and torso \(p_5\)(0, 1, 18). Each number denotes the joint index, and their 2D coordinates are defined as \(j_{i^n,(x,y)}\). We then calculate the spatio-temporal weight of five body parts that are created by combining joints and extract the motion feature by using each body part. After defining the five body parts, we calculate the inner angle of each part, \(\theta _{in}\) as follows:

$$\begin{aligned} \begin{aligned} {\mathbf {a}} = (j_{i^1,x} - j_{i^2,x}, j_{i^1,y} - j_{i^2,y}), \\ {\mathbf {b}} = (j_{i^1,x} - j_{i^3,x}, j_{i^1,y} - j_{i^3,y}), \\ \theta =arccos\left( \frac{\mathbf {a}\cdot \mathbf {b}}{|\mathbf {a}||\mathbf {b}|}\right) \\ \end{aligned} \end{aligned}$$
(1)

We also calculate the angle between each part using (1). The outer angle \(\theta _{out}\) denotes the value for connected part, which is calculated using following joint indexes as input in each frame: (1, 2, 3), (1, 5, 6), (1, 8, 9) and (1, 11, 12). The inner angle represents the relative position of the joint inside the body part, and the outer angle represents the shape of the body part. \(\theta _{in}\) and \(\theta _{out}\) can express scale-invariant human posture information for the five body parts. This is also an important cue to express the movement of the body parts by changing the position of each joint. For all points in the body parts, the average value of the difference between each previous point and each current point of the sequence, normalized by n length, is used to calculate the motion velocity, \(v_p^t\) and acceleration \(\hat{v}_j^t\).

2.3 Full-Body Image Representation

We also conduct full-body image-based activity descriptor to capture overall appearance change. The method used here exploits the SCM descriptor used for human interaction recognition [15, 16]. Extracting a feature vector from a full-body image has proved useful. Since joint estimation from RGB images includes a failure case, a full-body image can compensate for the missing parts.

From the bounding box of the human object region, we extract weights from the last fully connected layer of the inception-resnet-v2 network. Then we generate a sub-volume for each object \(\mathbf {f}_{oi}^{t} = [p, \delta {x}, \delta {y}]\), where p denotes the average of feature vectors in a sub-volume. A series of frame-level image feature vectors of object oi at time t for l consecutive frames, are averaged into a single feature vector. Then, K-means clustering is performed on the training set to generate codewords \(\{w_k\}_{k=1}^{K}\), where k denotes the number of clusters. Each of sub-volume feature \(\mathbf {f}_{oi}^{t}\) is assigned to the corresponding cluster \(w_k\) following the BoW paradigm. The index of the corresponding cluster \(k_{oi}^t\) is codeword index, which is also the index of the row and column of the descriptor. Here, we should note that, we use the \(oj_I\) coordinates from joint estimation to obtain more precise information.

A descriptor using sub-volume features is constructed from each sub-volume of an object \(v_{oi}^{t}= (\mathbf {f}, x, y, k)\). We measure the Euclidean distance between sub-volumes oi and oj. The overall spatial distance between sub-volume oi and the other oj in segment t for \(\#pairs\), where \(oj\ne oi\), is aggregated as follows:

$$\begin{aligned} r^t = \frac{1}{2}\sum _{oi}\sum _{oj\ne oi}dist_{oi,oj}^t. \end{aligned}$$
(2)

The participation ratio of the pair in the segment t is represented using distance difference between sub-volume oi and oj to the global motion activation. The feature scoring function based on sub-volume clustering is calculated as follows:

$$\begin{aligned} f_p = log\left( \frac{||w_{oi}^t -\mathbf {f}_{oi}^t||+||w_{oj}^t-\mathbf {f}_{oj}^t||}{2} + \psi \right) . \end{aligned}$$
(3)

After computing all required values between all sub-volumes, we finally construct the SCM descriptor, as follows:

$$\begin{aligned} M^t(k_{oi}^t,k_{oj}^t) =\frac{1}{N}\sum _{oi,oi\ne oj}\sum _{1:t} \frac{s_{oi}^t}{\epsilon ^t}\frac{r^t}{dist_{oi,oj}^t}f_p(\mathbf {f}_{oi}^t, \mathbf {f}_{oj}^t), \end{aligned}$$
(4)

where N is the normalization term. The value between oi, oj is assigned to the SCM descriptor using the corresponding cluster index, \(k_{oi}^t\) and \(k_{oj}^t\), of each sub-volume. Each of the descriptors is generated for every non-overlapped time step. Therefore, the descriptor is constructed in a cumulative way.

2.4 Spatio-temporal Weight for Classification

In this section, we present the joint based spatio-temporal weight of active region. The basic idea of spatio-temporal weight of active region is the assumption that, when human interaction occurs, the body parts that constitute each action will be of different importance. Spatio-temporal weights of each body part of the person who leads the action and other person have different depends on interactive motions. For example, when person 1 punches person 2, person 1 reaches out to person 2’s head and person 2 would be pushed back without motion towards person 1. If person 1 performs a push action, person 2’s response will look similar to a punch, but person 1 will reach out to person 2’s torso, and two hands will reach out. We try to capture these subtle differences between similar activities, and reflect the difference in the weights. The weight of body parts between persons is calculated as follows:

$$\begin{aligned} \begin{aligned} A_{p,t}=S\times {\frac{\sum _{p}^{5}|d_{p,t}-d_{p,t-1}|}{|d_{p,t}-d_{p,t-1}|}} \end{aligned} \end{aligned}$$
(5)

where d denotes the relative distance between each pair of body parts among the interacting persons. The calculated part weight, \(A_{p,t}\) is multiplied by the velocity \(wv_p^t = A_{p,t}\times {v_p^t}\) and acceleration \(w\hat{v}_j^t = A_{p,t}\times {\hat{v}_j^t}\) to determine the weight. The motion feature \(m_{p}^{t}\) is created by concatenating \(\theta _{in}\), \(\theta _{out}\), a weighted \(wv_p^t\), and \(w\hat{v}_j^t\). The weight is also multiplied by the image patch feature vector from each joint. Since an interacting body part with a high weight plays an important role in the activity, this also gives a high weight to the joint-based image feature extracted from the position of the body part as \(\mathbf{{wf}} = \mathbf{{pf}} \otimes A_p\).

Fig. 3.
figure 3

Illustration of the overall framework combining spatio-temporal weight, joint feature and image feature. The first joint estimation from the video denotes human body joint extraction from RGB input.

The overall framework is illustrated in Fig. 3. From a given video, estimated joints are processed through three different streams: joint patch feature extraction, body part motion features with spatio-temporal weight extraction, and full-body image feature extraction. At each step, The generated motion features \(m_{p}^{t}\) and joint-based weighted image patch features \(\mathbf{{wf}}_{p}^{t}\), and SCM descriptor after multi layer perception are concatenated and used as LSTM inputs. The final vector is used as input to LSTM Together, and the activity classification task is the output of the LSTM after processing t segments.

3 Experiment

In this section, we validate the effectiveness of the proposed method on the BIT-Interaction dataset [11] and UT-Interaction dataset [23], which are common and widely used in human interaction recognition research. The performance of the proposed method is shown by comparing the performance with that of the competing methods. In this experiment, the joint estimation was done using PAF [7]. To extract joint patch features and full-body image features, we use the weight of the Inception-resnet-v2 network [30], implemented in Tensorflow [1].

Fig. 4.
figure 4

Sample frames of the BIT-Interaction dataset (a-b), UT-Interaction dataset Set #1 (c) and Set #2 (d).

The BIT-Interaction dataset used in the experimental evaluation consists of eight classes of human interactions: bow, boxing, handshake, high-five, hug, kick, pat, and push. Each class contains 50 clips. The videos were captured in a very realistic environment, including partial occlusion, movement, complex background, variying sizes, view point changes, and lighting changes. The sample images of this dataset is shown in Fig. 4(1)–4(b). Both images have the environmental difficulties that are occlusion and complex background. For this dataset, we used a training set with from 1 to 34 index for each class (a total of 272 clips) and the remaining from 35 to 50 index as the test set (128 clips) following official standard in the literature [5, 10, 13].

Table 1. Comparison of the recognition results on the BIT-Interaction dataset

The experimental results for quantitative comparison on the BIT-Interaction dataset, compared with the competing methods, are shown in Table 1. The table lists the average classification accuracy for eight classes. The proposed method achieved better overall performance over than all the other comparison methods, with 92.67% recognition accuracy for human interaction activity recognition. This result is better than the competing methods. In addition, it shows better performance than SCM-based technique [15], that only considers full-body images. This means that it is better to use the joint-based high-level motion information than to utilize the low-level image features alone.

The UT-Interaction dataset used in the experimental evaluation consist of six classes of human interactions: push, kick, hug, point, punch, and handshake. Each class contains 10 clips for each set. The dataset is composed of two sets of video which were captured in different environments; set #1 and set #2. The set #1 videos were captured in a parking lot background. However, the backgrounds in set #2 of the UT-Interaction dataset consisted of grass and jittering twigs, which could be noise to local patches. We performed leave-one-out cross validation for the performance in the Table 2 and Table 3 as done in previous studies [5, 10, 16, 21, 24, 28].

Table 2. Comparison of the recognition results on the UT-Interaction dataset (set #1).
Table 3. Comparison of the recognition results on the UT-Interaction dataset (set #2).

Table 2 compares the classification accuracy measured on the UT-Interaction #1. In set #1, the proposed method achieved 91.70 % recognition accuracy. The performance of MMAPM was very high and the proposed method has the second highest performance. On the other hand, our method achieved the highest performance in set #2 as shown in Table 3. This is because set #2 has a noisier background than the set #1, so the proposed method of using human structural characteristics through joint estimation works better than competing methods based on image features only. In real-world scenarios, considering the complexities of environmental change, the proposed method is highly effective.

4 Conclusion and Future Work

Despite numerous studies, it is still challenging and difficult to recognize the complex activity of people in video. However, the complex activity of two or more people interacting with each other requires a higher level of scene understanding than robust image representation. In this study, we showed that robust activity recognition results can be obtained by acquiring joint information of the human that is estimated from RGB videos are informative to understand the human activity. The spatio-temporal weight to actively interacting body parts improve the recognition accuracy than RGB-only methods. This indicates that the relationship between objects plays a key role in complex activity recognition. In addition, the proposed method has high practicality, in the sense that it can overcome the limitations of existing sensors that uses depth information to exploit the skeleton information and increase the possibility of using a common RGB camera. In future research, we intend to expand this work to show robust performance even in interactions involving more people or non-human objects.