Keywords

1 Introduction

Automatically recognizing human actions is receiving increasing attention due to its wide range of applications such as video retrieval, human-computer interaction and activity monitoring. A large number of methods [1, 2] for humane action recognition have been proposed, ranging from trajectory-based methods [3] and local descriptor-based methods [4] to attribute-based method [5, 6].

However, most of these previous approaches for human action recognition are constrained to well-controlled environments and fail to achieve desired results in complex scenes. Human action recognition in complex scenes is an extremely difficult task, due to several challenges, like background clutter, camera motion, occlusions and illumination variations. To address these challenges, several methods, such as tree-based template matching [7], tensor canonical correlation [8] and prototype based action matching [9] are proposed. Most of these methods are complex, time consuming and preprocessing requirement, such as segmentation, tree data structure building, target tracking or background subtraction. Other methods [1012] for human action recognition in complex scenes apply spatio-temporal interest point detectors and local descriptors to characterize and encode the action video, which demand less or no preprocessing. Thus, this kind of methods achieve promising recognition accuracy. However, interest points are usually false detection in uncontrolled environments and Fig. 58.1 shows the result of interest point detection in complex scenes. We can see that the interest points outside the green rectangle are invalid owing to the actor inside the green rectangle.

Fig. 58.1
figure 1

Detecting interest points in complex scenes

In this paper, we propose a novel method to classify human actions in complex scenes. As is well-known, interest points inside or around the actor are beneficial to classification. Therefore, we utilize salient regions to select the interest points. Concretely, we reserve the interest points with high salient values, while we discard the interest points with low salient values. After selecting interest points, we apply CLC coding strategy [13] to consider the spatial and temporal relationship among interest points. Finally, we train the classification model using SVM.

The rest of this paper is organized as follows. Section 58.2 introduces the proposed method in detail. Section 58.3 demonstrates that our experimental results are more accurate than the state-of-the-art methods on UCF sports dataset and YouTube dataset. Finally, in section “Conclusion” we conclude this paper.

2 Approach

2.1 Method Overview

The proposed method consists of four stages: (a) detecting spatio-temporal interest points for each action video; (b) detecting salient regions for each frame of an action video; (c) selecting significative spatio-temporal interest points according salient regions; (d) generating feature histograms and training classifier. The flowchart of the proposed method is shown in Fig. 58.2 and the corresponding detailed description will be presented in the following sections.

Fig. 58.2
figure 2

Flowchart of the proposed method. (a) Original action video; (b) detecting spatio-temporal interest points; (c) detecting salient regions; (d) selecting significative spatio-temporal interest points; (e) generating feature histograms

2.2 Detection of Spatio-Temporal Interest Points

To detect interest points, we first use Harris3D corner detector [14] for each action video as shown in Fig. 58.2b. For each interest point, we characterize the local appearance using histogram-of-gradients (HOG) and histogram-of-optical-flow (HOF) [15]. As a result, we obtain a set of interest points for an action video, \( v={\left\{\left({\mathbf{x}}_i,{\mathbf{s}}_i\right)\right\}}_{i=1,\dots, N} \), where N is the number of interest points for the action video v, x i indicates the feature vector of the i-th interest point and s i is the location of the i-th interest point. Here s i  = (x, y, t) where x, y and t are horizontal, vertical, and temporal coordinates respectively.

2.3 Salient Region Detection

We detect the salient regions for each frame by using region contrast (RC) [16]. First, we segment the frame into regions using a segmentation method based on graph cut [17]. Then, the color histogram is builded for each region. For each region a k , we calculate the salient value by comparing its color contract with all other regions:

$$ S\left({a}_k\right)={\displaystyle \sum_{a_k\ne {a}_i}}exp\left({D}_s\left({a}_k,{a}_i\right)/{\sigma}^2\right)w\left({a}_i\right){D}_r\left({a}_k,{a}_i\right) $$
(58.1)

where w(a i ) is the weight of region a i , D r represent the color distance between the two regions, D s is the spatial distance between two regions, and σ controls the strength of spatial weight. From the above equation, we can see that all the pixels in one region share the same salient value. The number of pixels in a i indicates the weight w i . The spatial distance is defined as the Euclidean distance between the centroids of regions. The color distance between two regions a 1 and a 2 is defined as:

$$ {D}_r\left({a}_1,{a}_2\right)={\varSigma}_{i=1}^{n_1}{\varSigma}_{j=1}^{n_2}f\left({c}_{1,i}\right)f\left({c}_{2,j}\right)D\left({c}_{1,i},{c}_{2,j}\right) $$
(58.2)

where f(c k, i ) is the probability of the i-th color c k, i among all n k colors in the k-th region a k , k = 1, 2. The result of salient region detection is shown in Fig. 58.2c.

2.4 Selective Interest Points

Since human actions are usually recorded in complex scenes, for example cluttered background, illumination variations, camera motion and occluded bodies, there are a lot of noise interest points. The location of these noise interest points are usually in the background, and therefore they are injurious to classification. To address this problem, we apply the salient regions to suppress these noise interest points. For each interest point (x, s), we compute the maximum salient value around its space location:

$$ {S}_m={\displaystyle \underset{\left(x,y\right)\in {R}_l}{ \max }}S\left(\left(x,y\right)\right) $$
(58.3)

where R l is the local region around the interest point (see the red rectangle in Fig. 58.3b) and S((x, y) is the salient value at (x, y). If S m  > T s , we reserve this interest point due to its location in salient region; otherwise we discard this interest point. Here T s is the salient threshold value. The result of selective interest points is illustrated in Fig. 58.2d.

Fig. 58.3
figure 3

Selective interest points

2.5 Feature Coding

After selecting the interest points, we cluster all the interest points from all the action videos by using k-means clustering algorithm and generate the dictionary. Then, we utilize CLC [13] strategy to code the features, which not only considers the spatio-temporal relationships among interest points, but also alleviates the quantization error by using linear coding. Afterwards, each action video is represented by a histogram (see Fig. 58.2e). Finally, we use these feature histograms to train a multi-class SVM.

3 Experimental Results

To evaluate our proposed method for human action recognition, we conduct a series of experiments on two publicly available human action datasets: UCF Sports dataset [18] and YouTube action dataset [19]. These two datasets are challenging because these action videos are recorded in realistic scenes and suffer from cluttered background, illumination variations, camera motion and so on. The codebook is constructed by k-means algorithm and the number of codebook is empirically set to 4,000 [20]. The salient threshold value T s is set to 180 and R l is set to 15.

We present the results of selective interest points in Fig. 58.4. The first row is the original detected interest points, the second row is the salient regions, and the last row is the results of suppressing the noise interest points. From Fig. 58.3, we can see that most of interest points on background are discarded and the interest points on human are reserved which is beneficial to the subsequent classification. Next, we will objectively evaluate our proposed method on UCF Sports dataset and YouTube action dataset.

Fig. 58.4
figure 4

Performance of the selective interest points in complex scenes. The action videos (a, b) are from UCF Sports dataset and the action videos (c, d) are from YouTube action dataset

The UCF sports dataset [18] contains ten different types of sports action: swinging (on the pommel horse and on the floor), diving, kicking (a ball), weight-lifting, horse-riding, running, skateboarding, swinging (at the high bar), golf swinging and walking. The dataset consists of 150 real videos with a wide range of viewpoints and scene backgrounds. In order to increase the amount of training samples, we extend the dataset by adding a horizontally flipped version of each video sequence to the dataset as suggested in [20]. Table 58.1 compares our method with the other excellent methods, in which we can see that our method achieves the highest recognition accuracy of 88 %. Figure 58.5 shows the confusion table of recognition results on UCF sports dataset. From this figure, “horse-riding” are prone to be misclassified into “running” due to their similar appearance.

Table 58.1 Recognition results of different methods on the UCF Sports dataset
Fig. 58.5
figure 5

Confusion table of our method on UCF Sports dataset

The YouTube dataset [19] is a collection of 1,168 complex and challenging YouTube videos of 11 human actions categories: basketball shooting, volleyball spiking, trampoline jumping, soccer juggling, horseback riding, cycling, diving, swinging, golf swinging, tennis swinging and walking (with a dog). The dataset has the following properties: a mix of steady cameras and shaky cameras, cluttered background, low resolution, and variation in object scale, viewpoint and illumination. Our method achieves 88.65 % recognition accuracy on this dataset and Table 58.2 compares our result with the other state-of-the-art methods.

Table 58.2 Recognition results of different methods on the YouTube dataset

Conclusion

In this paper, a novel method has been proposed to classify human actions in complex scenes. We propose to select interest points by using salient regions. The selected interest points are beneficial to sequential classification because they are inside or around the actors. The proposed method has been validated on two challenging datasets, and the experimental results clearly demonstrate the superiority of our method over previous methods in human action recognition.