Exploring hybrid spatio-temporal convolutional networks for human action recognition

Wang, Hao; Yang, Yanhua; Yang, Erkun; Deng, Cheng

doi:10.1007/s11042-017-4514-3

Exploring hybrid spatio-temporal convolutional networks for human action recognition

Published: 08 March 2017

Volume 76, pages 15065–15081, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Exploring hybrid spatio-temporal convolutional networks for human action recognition

Download PDF

Hao Wang¹,
Yanhua Yang¹,
Erkun Yang¹ &
…
Cheng Deng^1,2

910 Accesses
18 Citations
Explore all metrics

Abstract

Convolutional neural networks have achieved great success in many computer vision tasks. However, it is still challenging for action recognition in videos due to the intrinsically complicated space-time correlation and computational difficult of videos. Existing methods usually neglect the fusion of long term spatio-temporal information. In this paper, we propose a novel hybrid spatio-temporal convolutional network for action recognition. Specifically, we integrate three different type of streams into the network: (1) the image stream utilizes still images to learn the appearance information; (2) the optical stream captures the motion information from optical flow frames; (3) the dynamic image stream explores the appearance information and motion information simultaneously from generated dynamic images. Finally, a weighted fusion strategy at the softmax layer is utilized to make the class decision. With the help of these three streams, we can take full advantage of the spatio-temporal information of the videos. Extensive experiments on two popular human action recognition datasets demonstrate the superiority of our proposed method when compared with several state-of-the-art approaches.

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks

Article 01 July 2021

Spatiotemporal feature enhancement network for action recognition

Article 15 December 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the tremendous growth of video capturing devices and storage space, video data have increased explosively. Meantime, human action recognition in videos has attracted much attention in the computer vision community due to the wide applications in video surveillance, multimedia analysis, human-computer interaction and healthcare. Although many efforts have been devoted into this domain, it is still challenging for action recognition because of the following two main reasons: (1) the low quality of video such as low resolution, camera motion and cluttered background; (2) the large intra-class variances due to the different motion speeds, intensity of illumination, and viewpoints. The crucial step for dealing with these obstacles is to design robust feature extraction method. As far as we known, there are mainly two categories of video feature representation in action recognition, hand-crafted features and deep-learned ones.

In the first category, the researchers use the hand-crafted local features including Cuboids [6], Space Time Interest Points [17], improved Dense Trajectories [31] and so on. The extraction of these features often consists of two steps: key points detection and feature extraction. However, the dimension of these local features is relative high and it thus increases the computational complexity. Besides, these hand-crafted local features may lack discriminative capacity for action recognition and are not optimal for visual representation.

In the second category, the researchers develop various deep neural network architectures to extract features, which have achieved great success in visual recognition recently. Based on the type of network architectures, we can further divide these works into three types. The first type of architecture uses 2D convolutional neural networks [2, 8, 13, 25, 32]. These architectures can utilize the power of pre-trained model on image recognition, but lose the capability for capturing spatio-temporal information simultaneously. The second type of architecture uses 3D convolutional neural networks [4, 11, 29]. They extend the 2D convolutional filters to 3D and apply them into the action tubes to capture spatio-temporal information simultaneously. Although this architecture exactly suits the video data structure for modeling spatio-temporal information, the initialization of parameters in network can not utilize the existing models which are pre-trained on large scale labelled image datasets. The third type of architecture is a hybrid [7, 38] of convolutional neural networks and recurrent neural networks since the recurrent neural networks can model the temporal information better. But the procedure for joint training is complex and the optimal solution is hard to be obtained.

In this paper, we proposed a novel hybrid spatio-temporal convolutional network by combining dynamic image stream with spatial and temporal streams to take full advantage of spatio-temporal information. In order to understand the information these three streams contain, we give an example of RGB image, optical flow image and dynamic image in Fig. 1. Our work is mainly inspired by Two-Stream networks [25], although the combination of spatial stream and temporal stream can fuse the appearance and motion information to obtain better performance, we find that it still has limit in temporal modeling because the spatial stream only trained on single still frames and the temporal stream which used optical flow loses much appearance details. To better incorporate the appearance information and the motion information, we introduce a novel dynamic image stream into the whole architecture. By using a ranking machine [2] to encode temporal evolution of frames in video, dynamic image can preserve details of objects as well as the motion information in a relative long time period simultaneously.

This paper has three main contributions:

(1)
An improved dynamic image network is proposed and evaluated to show that dynamic images can capture spatio-temporal information simultaneously.
(2)
We propose a novel hybrid spatio-temporal convolutional network by combining dynamic image stream with spatial stream and temporal stream to explore spatio-temporal information.
(3)
Our approach obtains state-of-the-art performance on HMDB51 dataset (70.4 %) and comparable performance on UCF101 dataset (94.1 %).

The rest of this paper is organized as follows. In Section 2, some related networks are introduced and discussed. Section 3 provides the proposed approach, including network architectures, training and testing details. The experimental results are presented and discussed in Section 4. We draw the conclusions in Section 5 finally.

2 Related work

Researchers have devoted much efforts to design discriminative feature representations and effective classifiers in action recognition for decades. Many local image features have been generalized to videos such as 3D SIFT [21], extended SURF [37], 3D HOG [14]. These local features extracted around the detected interest points represent the 3D volumes. Recently the improved Dense Trajectories [31] has shown to be successful on a number of challenging datasets. Specifically, the information is encoded by HOG, HOF and MBH along with the trajectory to represent the action in video. However, these local features are designed specifically and hard to be generalized to other scenarios. Besides, they also lack enough high-level semantic information. To deal with these issues, Actons [48] and Action Banks [20] was proposed. In order to utilize the relations between action categories, Yang et al. [45] proposed a novel method based on multi-task learning framework with super-category. Alfaro et al. [1] proposed a novel scheme to quantify relative intra and inter-class similarities among local temporal patterns. For the action recognition on RGBD datasets, several methods have been proposed such as multi-modal multi-part learning framework [22], discriminative multi-instance multi-task learning framework (MIMTL) [43], bilinear heterogeneous information machine [15], and latent max-margin multi-task learning framework [44]. There are also some interesting works related to surveillance system, such as video structural description (VSD) [39–42] which represents and organizes the content in videos, and correspondence structure learning [24] for person re-identification and so on. These hand-crafted methods are highly depended on expert knowledge to design and can not be trained in an efficient end-to-end way.

The great success of deep convolutional neural networks has attracted researchers to utilize deep features for action recognition. Ji et al. [11] extended 2D convolutional filters to 3D to apply them to the video tubes directly. The C3D model [29] used 3D convolutions and 3D pooling to explore spatio-temporal information simultaneously. These 3D models can not utilize the existing models which are pre-trained on large scale labelled image datasets. Some works use recurrent neural network to model the temporal evolution such as LRCN [7] and hybrid of CNN and LSTM framework [38]. But the complex joint training procedure and much parameters make it hard to obtain optimal soluiton. When the attention mechanism is introduced in action recognition, a simple soft attention mechanism [23], Video LSTM model [18] and two stream hierarchical attention model [36] are proposed. But these methods need specifically designed regularizer to guide the attention mechanism. In order to overcome the problem of limited temporal modeling, Feichtenhofer et al. [8] proposed several spatiotemporal fusion methods of video snippets. There are also excellent works [4, 47] for accelerating the speed of action recognition while preserving acceptable performance. Among these deep-learned methods, the most representative work is Two-Stream ConvNets [25] which uses two individually trained and complementary streams, i.e., spatial stream and temporal stream. This method firstly obtained comparable performance with hand-crafted features and we design our model based on it in this paper.

Recently, researchers have devoted much efforts to the temporal modeling due to the temporal evolution information is more discriminative for action recognition. The Temporal Segment Networks [34] segments the video into several clips and does sparse sampling in each clip to model the long term temporal information. Their experimental results show that the model can focus on useful information on the whole video. Bilen et al. [2] proposed a novel compact representation of videos: dynamic images. The dynamic images are generated by encoding the order of each frame in the video to capture the dynamics evolution.

Among these approaches, the Temporal Segment Networks [34] and Dynamic Image Networks [2] are most close to us. They both focus on modeling a long term temporal evolution and explore the appearance information and motion information to improve the performance. However, Temporal Segment Networks just divides the video into several clips and it is still hard to capture dynamic information between RGB frames in spatial stream. While dynamic images naturally capture a relative long term dynamics due to that it is generated by encoding a length of L (e.g. L=20) consecutive frames and meanwhile it conveys complementary information both with still images and optical flow images. In this paper, we adopt dynamic image stream as the third stream and combine it with original two streams to take full advantage of spatio-temporal information.

3 Proposed approach

In this section, we describe the proposed hybrid spatio-temporal convolutional network for action recognition in details. Firstly, the overall frameworks is presented in Section 3.1. Then we describe the network architectures, training details in Section 3.2. Finally, the testing details is introduced in Section 3.3.

3.1 Hybrid spatio-temporal convolutional neural network

In this section, we propose a novel hybrid spatio-temporal convolutional network for action recognition and the framework is illustrated in Fig. 2. This framework contains three individual and complementary streams: spatial stream, motion stream and dynamic image stream. In the spatial stream, the RGB still images which contain the appearance information are processed. Similarly, the optical flow images which contain motion information are processed in the temporal stream. In the dynamic image stream, the dynamic images which take the correlation of space-time are processed. In the following we give a detailed description of each stream and summarize the advantages of our proposed scheme.

Spatial stream

In this stream, the frames of the whole video have same label regardless that they are different from each other. We input the still images and obtain class scores at the softmax layer of this stream. As we can see, still RGB images contain static appearance information such as color, texture, particular scenes and objects. These information are strongly associated with the performed action. For example, the bow always exists in the archery action and horse always exists in the horse riding action. However, the limits of using still RGB images in action recognition are obvious. The cluttered background of video would decrease the performance and different action may have similar patterns in RGB images. For example, smiling and laughing, as well as walking and jogging. Only using static appearance information would result in confusion of which action is exactly performed.

Temporal stream

This stream is intended to model the motion evolution of action. Here we use stacked optical flow fields to represent a motion pattern during a period of time and use them as the inputs of the stream. In this stream, various stacked optical flow in a video have same class label and they are calculated to obtain class scores at the softmax layer. Here we give a detailed description of optical flow. Assuming the intensity of light is basically consistent in corresponding region, optical flow is calculated via the relative movement between two consecutive frames. As an example in Fig. 3, we use d _t(x,y) to denote the displacement vector at the point (x,y) in the tth frame, which reflects the movement from current point to corresponding point in the following (t+1)th frame. The d _t(x,y) is composed of the horizontal vector $d_{t}^{h}(x,y)$ and vertical vector $d_{t}^{v}(x,y)$. Based on the consistency between two consecutive frames, we obtain the equation as bellow:

$$ I(x,y,t) = I\left( x+d_{t}^{h},y+d_{t}^{v},t+1\right), $$

(1)

where I(x,y,t) represents the image function(i.e. gray value) of pixel at the location of (x,y) at time t. The linearized version of the equation by using the first-order Taylor approximation is illustrated as

$$\begin{array}{@{}rcl@{}} I(x,y,t) &\approx& I(x,y,t+1) + \nabla I(x,y,t+1)^{T} \mathbf{d}_{t}(x,y) \\ 0 &=& \underbrace{I(x,y,t+1) - I(x,y,t)}_{I_{t}(x,y,t+1)} + \nabla I(x,y,t+1)^{T} \mathbf{d}_{t}(x,y). \end{array} $$

(2)

Then we obtain the optical flow constraint (OFC) equation as bellow:

$$ OFC(d_{t}^{h}, d_{t}^{v}): \quad 0 = I_{t} + I_{x} {d_{t}^{h}} + I_{y} {d_{t}^{v}}, $$

(3)

where the partial derivatives of image function (i.e. gray value) are denoted as I _t, I _x, and I _y. Finally, we use this constraint and various methods to obtain the optical flow vector ${d_{t}^{h}}$ and ${d_{t}^{h}}$. The methods and equations are complex so we do not illustrate more details in this paper due to we only use optical flow as one of the feature representations.

The experimental results show that the optical flow information is more discriminative than still RGB appearance information. However, it is ambiguous because of a single optical flow characterizes accurate motion information such as the moving violently block of the current frame. Besides, it also can be affected by subtle motion of camera.

Dynamic image stream

This stream is an important component of our proposed work. We utilize the appearance and long term dynamics which are encoded in dynamic images to model the correlation of space and time. In this stream, the dynamic images are treated as RGB images for training and testing and calculated to obtain class scores. We give an example in Fig. 4 and clearly observe that the background is removed and the motion pattern is shown in dynamic image.

In this section, we firstly give the formulation used to generate dynamic images. Then we present the derivation of approximate rank pooling method due to its good balance between efficiency and accuracy. Finally, we describe the pipeline of the generation. It should be noticed that the process of generation is basically following the original work.

The core idea of dynamic image is encoding the order of frames into the video representations. The objective function for obtaining a optimal dynamic image d is presented bellow:

$$ E(d) = \frac{\lambda}{2} {\Vert d \Vert}^{2} + \frac{2}{T(T-1)} \times \sum\limits_{q>t} \max \{ 0, 1-S(q|d) + S(t|d)\}, $$

(4)

where the first term is usual quadratic regularizer and the second term is penalty for incorrectly ranking pairs. In here,

$$ q > t \Rightarrow S(q|d) > S(t|d), $$

(5)

and

$$ S(i|d) = \langle d, V_{i} \rangle, \quad i=q,t , $$

(6)

V _i is the representation at time i which includes the information happened in past, and q,t are time steps. Then we give the derivation of approximate rank pooling is as bellow:

$$\begin{array}{@{}rcl@{}} \nabla E(\mathbf{0}) &\propto& \sum\limits_{q>t} \nabla \max \{ 0, 1-S(q|d)+S(t|d)\}|_{d=\mathbf{0}} \\ &\propto& \sum\limits_{q>t} \nabla \max \{ 0, 1- \langle d, V_{q} \rangle + \langle d, V_{t} \rangle\}|_{d=\mathbf{0}} \\ &=& \sum\limits_{q>t} \nabla \langle d, V_{t} - V_{q} \rangle = \sum\limits_{q>t} V_{t} - V_{q}. \end{array} $$

(7)

Then (7) can be formulated into (8),

$$ d^{*} \propto \sum\limits_{q>t} V_{q} -V_{t} = \sum\limits_{q>t} \left[ \frac{1}{q} \sum\limits_{i=1}^{q} \phi_{i} - \frac{1}{t}\sum\limits_{j=1}^{t} \phi_{j} \right] = \sum\limits_{t=1}^{T} \alpha_{t} \phi_{t}, $$

(8)

where ϕ is the feature vector extracted from one single frame and V is the time varying mean representations of frames. The final coefficient α _t is presented as bellow:

$$ \alpha_{t} = 2(T-t+1) - (T+1)(H_{T} - H_{t-1}) , $$

(9)

where $H_{t} = {\sum }_{i=1}^{t} 1/i$ and we set H ₀=0.

For the generation process, we firstly choose T consecutive frames and perform a non-linear transformation (e.g. square root operation) to each frame. Then we use the coefficients calculated above to generate the initial dynamic images. Finally a minmax normalization for each color channel is performed and final dynamic image is merged from them.

The combination of spatial stream, temporal stream and dynamic image stream can model the whole action better by capturing appearance information and motion information. The dynamic image stream we added play an important role as richer feature representation which modeling appearance and long term motion information simultaneously. It can alleviate the problems caused by spatial stream (e.g. the spatial stream is only trained on single still image) and temporal stream (e.g. the temporal stream only capture relative short term dynamics).

3.2 Network training

In this paper, we adopt the Inception Network [28] with Batch Normalization [10] as building block. The Inception Unit has three convolution subunits and one pooling subunit meanwhile the size of 5 × 5 filter is replaced with two 3 × 3 size filters. In here adding the Batch Normalization unit could accelerate the convergence speed. To further improve the capability of modeling temporal information, we adopt the temporal segment skills [34] to divide the whole input video into several clips and do spare sampling in each clip at training stage. The fusion of these three streams are performed at the softmax layer. In here we treat the generated dynamic images as RGB images for training and testing. To better improve the capability of generalization, the dropout layer is added and dropout ratio used is set to 0.8 for spatial stream network, 0.7 for temporal stream network and 0.8 for dynamic image stream network.

3.3 Network testing

The network inputs have three types: RGB images x _a which contain appearance information, stack of optical flow fields x _m which contain motion information, dynamic images x _d which contain appearance and motion information both. These three inputs go through the convolutional neural network to obtain the class scores of each input. For each training example x={x _a,x _m,x _d} with the label k∈{1,2,...,K}, we compute the class probability $p(k|x) = {exp(z_{k})} / {{\sum }_{i=1}^{K} exp(z_{i})}$. Here z _i are unnormalized log probabilities. Then based on these three class scores, a weighted fusion is performed

$$ p(k|x) = w_{a} p(k|x_{a}) + w_{m} p(k|x_{m}) + w_{d} p(k|x_{d}) $$

(10)

to obtain the final class scores. The effect of fusing three stream is not like in the original two stream. Because the fusion of three stream can be seemed as three combination: combination of spatial stream and dynamic image stream, combination of temporal stream and dynamic image stream and combination of pure spatial stream and pure temporal stream. The fusion is illustrated in Fig. 5. In our view this fusion includes three weighted combination of two stream and the experimental results show the superiority of our approach.

4 Experiments

In this section, we firstly introduce the existing two large and popular action recognition datasets: UCF101 [27] dataset and HMDB51 [16] dataset in Section 4.1. The training and testing details for each stream is presented in Section 4.2. The effect of using deeper network is evaluated in Section 4.3. Then the combination of dynamic images with single RGB and stacked optical flow is evaluated in Section 4.4. The results show that the dynamic images not only capture the appearance information but also capture the motion information. In Section 4.5 we compare our proposed approach to state-of-the-arts to show the superiority of our approach and we also give some examples to show why our approach can improve the performance.

4.1 Datasets and evaluation protocol

In order to verify the effectiveness of our proposed method, extensive experiments are performed in the following challenging datasets. Both of them are widely used in action recognition. We follow the standard setup and report average accuracy over splits.

UCF101

The dataset annotated 13320 videos into 101 action categories which can be divided into sports video, human motion, human-object interaction and so on. The videos are collected from YouTube and in each category the videos are further divided into 25 groups. In each group the background or viewpoint is similar.

HMDB51

The dataset includes 6766 videos which are divided into 51 action categories. The videos are collected from movies, videos on Google, YouTube and so on. Due to different sources the background is cluttered and viewpoint is variable.

We follow the evaluation protocol provided by the organizers of datasets: each of them has three splits for training and testing, and the final accuracy is averaged across the splits. The details of datasets are presented in Table 1 and the mean accuracy precision is defined as bellow:

$$ P = \sum\limits_{i=1,2,...,C} {P_{i}} / { C} $$

(11)

Where P _i is the accuracy of each category and C is the number of total categories.

Table 1 Details of datasets

Full size table

4.2 Experiments settings

Training stage

In spatial stream, the sizes of frames extracted from the video is various from video to video. So we follow the operation from the original two stream work and make the smallest side of frames equal 256. Then a 224 × 224 region is randomly cropped. Also we adopt the randomly horizontal flipping and scale jittering. At training stage, the learning rate starts from 0.001 which is reduced by a factor of 10 every 1000 iterations and stops at 2500 iterations. We use a pre-trained Inception network model on ImageNet [5] to initialize the parameters of spatial network. The batch size is 32 and split segment is 3.

In temporal stream, we use a stack of 10 consecutive optical flow fields as input. The optical flow we use in this paper is TVL1 [46] optical flow, which could be computed in OpenCV with GPU. The optical flow fields are linearly re-scaled to a [0,255] range and stored as JPEG form to avoid to save them as float type, which would significantly save the space of storage. The learning rate starts from 0.005 which is reduced by a factor of 10 at 12k, 18k iterations and stops at 20k iterations. The batch size is 30 and we use a modified pre-trained Inception network model on ImageNet to initialize the parameters of temporal network. Specifically, the channels of first layer in pre-trained model is averaged and then copied N times. Here N denotes the number of channel in the first layer for temporal network.

In dynamic image stream, we generate the dynamic images by following the approximate rank pooling operations from the original dynamic image networks. The window size used for extracting dynamic images is 20 for UCF101 dataset and 10 for HMDB51 dataset. The stride is 1 for both datasets. The setting of window size and stride are adopted from the Discriminative Hierarchical Dynamic Image Networks [9]. For UCF101 dataset, the learning rate starts from 0.001 which is reduced by a factor of 10 every 3000 iterations and stops at 6500 iterations. For HMDB51 dataset we use the same learning rate, iterations and batch size as spatial stream.

Testing stage

Notice that we treat dynamic images as RGB images for training and testing. At testing stage we extract 25 frames from one video with equal temporal interval and average the scores of all 25 samples to compute the final class scores in spatial stream and dynamic image stream. As for temporal stream, we also extract 25 samples but the difference is that each sample consists of 10 consecutive fields (x, y directions of 5 consecutive optical flow).

4.3 Exploration of deeper network for dynamic images

In the original Dynamic Image Networks [2], the researchers used CaffeNet [12] which only has five convolutional layers and three fully connected layers and did not use any data augmentation skills, so the performance is relative low. In Discriminative Hierarchical Dynamic Image Networks [9], the researchers used VGG16 [26] network to extract frame features and then performed hierarchical rank pooling to obtain higher order dynamic images, however, the process of extracting dynamic images becomes much complex due to the hierarchical operation. We use the Inception with Batch Normalization as building block due to its good balance between accuracy and efficiency. We observe that the result for dataset UCF101 increases from 72.2 % to 83.3 % and for dataset HMDB51 increases from 40.9 % to 53.6 %. Compared with vanilla rank pooling and hierarchical rank pooling , the experimental results show that using deeper network with more data augmentation skills could improve the performance significantly and the results are illustrated in Table 2. Table 3 shows the most significant difference between original dynamic image network and our improved dynamic image network.

Table 2 Evaluation of using deeper network for dynamic images

Full size table

Table 3 Comparison between original dynamic image network and our improved version

Full size table

It should be noticed that the results are compared without combination of hand-crafted features.

4.4 Effectiveness analysis

In this section, we combine the dynamic image stream with appearance stream and temporal stream respectively and the results are illustrated in Table 4. When we remove each of the three streams, the performance decrease is presented in Table 5. The results show that the motion information is crucial for action recognition, and dynamic image stream we proposed is effective to improve the performance by capturing both appearance information and motion information.

Table 4 Evaluation of combination between three streams

Full size table

Table 5 Performance decrease of removing each of three streams

Full size table

When we combine the spatial stream and dynamic image stream as two stream, we observe that the result increases both on UCF101 dataset (over spatial stream 3.0 % and over dynamic image stream 4.4 %) and on HMDB51 dataset (over spatial stream 5.9 % and over dynamic image stream 5.3 %). This proves that the dynamic images can capture motion information to improve the performance on spatial stream. When we combine the temporal stream with dynamic image stream, we observe the similar improvement on UCF101 dataset (2.7 % over temporal stream and 8.9 % over dynamic image stream) and on HMDB51 dataset (4.4 % over temporal stream and 12.9 % over the dynamic image stream). This shows that the dynamic image is complementary with optical flow. Notice that we use the weight 1 for spatial stream and 1 for dynamic image stream on UCF101 dataset when we combine these two streams. Similarly, we set the weight of spatial stream as 1 and set the weight of dynamic image stream as 1.2 on HMDB51 dataset. As for the combination of temporal stream and dynamic image stream, we use the equal weight on HMDB51 dataset and 1:0.7 on UCF101 dataset. The weight is 1:1.5 when we combine the spatial stream and temporal stream for both datasets. From the division of the weight and the improvement on each stream, we observe that dynamic image stream performs better on HMDB51 due to that the background of HMDB51 is more cluttered than UCF101.

4.5 Comparison with the state-of-the-arts

As we observed in Section 4.4, the dynamic images can not only model the motion information but also model the appearance information. And in original two stream, the spatial stream only trained on single frame and the length L of stacked optical flow is relative small (e.g. 10). These two shortcomings would degrade the performance for action recognition. So we combine the dynamic image with the original two stream to propose a novel hybrid spatio-temporal convolutional networks. The results presented in Table 6 shows that our approach outperforms the state-of-the-art method by 1.0 % on the HMDB51 dataset. The weight of spatial stream, temporal stream and dynamic image stream is set to 0.8, 1.0, and 0.1 on UCF101 dataset. While the weight of spatial stream, temporal stream and dynamic image stream is set to 1.1, 2.0, and 0.7 on HMDB51 dataset. In here, we compare our approach with hand-crafted features based methods and deep features based methods both. Specifically, we choose improved Dense Trajectories (iDT) [31], MoFAP [33] method in hand-crafted features and Two-Stream ConvNets [25], Convolutional Fusion of Two Stream [8], Temporal Segment Networks (TSN) [34], trajectory-pooled deep convolutional descriptors (TDD) [32], long term convolution networks (LTC) [30], transformation CNN [35] and key volume mining framework (KVMF) [49] in deep-learned features.

Table 6 Comparison with the state-of-the-arts

Full size table

Besides recognition accuracies, we want to attain further insight about why our approach can improve performance. From Fig. 6, we can observe that on spatial stream, the appearance information of smoking is similar to the appearance information of laughing. So the weight of appearance stream is relative high and it may result in this failure. However, the dynamic image stream can strength the motion information on spatial stream, so it is classified correctly.

5 Conclusion

In this paper, we firstly explore the deeper network for dynamic images and reveal that the dynamic images can not only capture appearance information but also motion information. Based on this, we proposed a novel hybrid spatio-temporal convolutional network by combining dynamic image stream with original two stream to explore spatio-temporal information. The fusion of three stream shows superiority compared with several state-of-the-art methods.

References

Alfaro A, Mery D, Soto A (2016) Action recognition in video using sparse coding and relative features Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2688–2697
Google Scholar
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 3034–3042
Google Scholar
Cai Z, Wang L M, Peng X, Qiao Y (2014) Multi-view super vector for action recognition Conference on computer vision and pattern recognition (CVPR), 2014, I.E. IEEE, pp 596–603
Google Scholar
Diba A, Pazandeh A, Gool LV (2016) Efficient two-stream motion and appearance 3d CNNs for video classfication. arXiv:1608.08851
Deng J, Dong W, Socher R, Li L J, Li K, Li F-F (2009) Imagenet: a large-scale hierarchical image database Conference on computer vision and pattern recognition (CVPR), 2009, I.E. IEEE, pp 248–255
Google Scholar
Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features VS-PETS 2005
Google Scholar
Donahue J, Hendricks L A, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrel T (2015) Long-term recurrent convolutional networks for visual recognition and description Conference on computer vision and pattern recognition (CVPR), 2015, I.E. IEEE, pp 2625–2634
Google Scholar
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream networks fusion for video action recognition. arXiv:1604.06573
Fernando B, Anderson P, Hutter M, Gounld S (2016) Discriminative hierarchical rank pooling for activity recognition Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 1924–1932
Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift International conference on machine learning (ICML), 2015, pp 448–456
Google Scholar
Ji SW, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell (PAMI) 35(1):221–231
Article Google Scholar
Jia Y Q, Evan S, Jeff D, Sergey K, Jonathan L, Ross G, Sergio G, Trevor D (2014) Caffe: convolutional architecture for fast feature embedding. arXiv:1408.5093
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li Fei-Fei (2014) Large-scale video classification with convolutional neural networks Conference on computer vision and pattern recognition (CVPR), 2014, I.E. IEEE, pp 1725–1732
Google Scholar
Klaser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients 2008-19th British machine vision conference (BMVC), British machine vision association
Google Scholar
Kong Y, Fu Y (2015) Bilinear heterogeneous information machine for rgbd action recognition Conference on computer vision and pattern recognition (CVPR), 2015, I.E. IEEE, pp 1054–1062
Google Scholar
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition International conference on computer vision (ICCV), 2011, I.E. IEEE, pp 2556–2563
Google Scholar
Laptev I (2005) On space-time interest points. Int J Comput Vis (IJCV) 64 (2–3):107–123
Article Google Scholar
Li Z Y, Gavves E, Jain M, Snoek CGM (2016) VideoLSTM convolves, attends and flows for action recognition. arXiv:1607.01794
Peng X, Wang L M, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. arXiv:14054506
Sadanand S, Corso J J (2012) Action bank: a high-level representation of activity in video Conference on computer vision and pattern recognition (CVPR), 2012, I.E. IEEE, pp 1234–1341
Google Scholar
Scovanner P, Ali S, Mubarak Shah (2007) A 3-dimensional SIFT descriptor and its application to action recognition ACM international conference on multimedia (ACM MM), pp 357–360
Google Scholar
Shahroudy A, Ng T T, Yang Q, Wang G (2016) Multimodal multipart learning for action recognition in depth videos. IEEE Trans Pattern Anal Mach Intell (PAMI) 10:2123–2129
Article Google Scholar
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119
Shen Y, Lin W Y, Yan J C, Xu M L, Wu J X, Wang J D (2015) Person re-identification with correspondence structure learning International conference on computer vision (ICCV), 2015, I.E. IEEE, pp 3200–3208
Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos Annual conference on neural information processing systems (NIPS), pp 568–576
Google Scholar
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition International conference on learning representations (ICLR), pp 1–14
Google Scholar
Soomro K, Zamir A R, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Szegedy C, Vanhoucke V, Ioffe S, Shlens J (2016) Rethinking the inception architecture for computer vision Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2818–2826
Google Scholar
Tran D, Bourdev L, Fergus R, Torresani L, Manohar Paluri (2015) Learning spatiotemporal features with 3d convolutional networks International conference on computer vision (ICCV), 2015, I.E. IEEE, pp 4489–4497
Google Scholar
Varol G, Laptev I, Schmid C (2016) Long-term temporal convolutions for action recognition. arXiv:1604.04994
Wang H, Schmid C (2013) Action recognition with improved trajectories International conference on computer vision (ICCV), 2013, I.E. IEEE, pp 3551–3558
Google Scholar
Wang L M, Qiao Y, XO T (2015) Action recognition with trajectory-pooled deep-convolutional descriptors Conference on computer vision and pattern recognition (CVPR), 2015, I.E. IEEE, pp 4305–4314
Google Scholar
Wang L M, Qiao Y, Tang X O (2016) MoFAP: a multi-level representation for action recognition. Int J Comput Vis (IJCV) 119(3):254–271
Article MathSciNet Google Scholar
Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang XO, Gool LV (2016) Temproal segment networks: towards good practices for deep action recognition. arXiv:1608.00859
Wang X L, Farhadi A, Gupta A (2016) Action ∼ transformation Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2658–2667
Google Scholar
Wang Y L, Wang S H, Tang J L, O’Hare N, Chang Y, Li BX (2016) Hierarchical attention network for action recognition in videos. arXiv:1607.0641
Willems G, Tuytelaars T, Gool L V (2008) An efficient dense and scale-invariant spatio-temporal interest point detector Proceedings of the european conference on computer vision (ECCV), pp 650–663
Google Scholar
Wu Z X, Wang X, Jiang Y G, Ye H, Xue X Y (2015) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification ACM international conference on multimedia (ACM MM), pp 461–470
Google Scholar
Xu Z, Hu C P, Mei L (2016) Video structured description technology based intelligence analysis of surveillance videos for public security applications. Multimedia Tools and Applications (MTAP) 75 (19):12155–12172
Article Google Scholar
Xu Z, Liu Y H, Mei L, Hu C P, Chen L (2015) Semantic based representing and organizing surveillance big data using video structural description technology. J Syst Softw 102:217–225
Article Google Scholar
Xu Z, Mei L, Hu C P, Liu Y H (2016) The big data analytics and applications of the surveillance system using video structured description technology. Clust Comput 19(3):1283–1292
Article Google Scholar
Xu Z, Mei L, Liu Y H, Hu C P, Chen L (2016) Semantic enhanced cloud environment for surveillance data management using video structural description. Computing 98(1–2):35–54
Article MathSciNet MATH Google Scholar
Yang YH, Deng C, Gao SQ, Liu W, Tao DP, Gao XB (2016) Discriminative multi-instance multi-task learning for 3d action recognition. IEEE Trans Multimedia (TMM). doi:10.1109/TMM.2016.2626959
Yang Y H, Deng C, Tao D P, Zhang S T, Liu W, Gao X B (2016) Latent max-margin multitask learning with skelets for 3d action recognition. IEEE Transactions on Cybernetics (TCYB) 99:1–10
Google Scholar
Yang Y H, Liu R S, Deng C, Gao X B (2016) Multi-task human action recognition via exploring super-category. Signal Process (SP) 124:36–44
Article Google Scholar
Zach C, Pock T, Bischof H (2007) A duality based approach for realtime TV-l1 optical flow 29th DAGM symposium on pattern recognition, pp 214–223
Google Scholar
Zhang B W, Wang L M, Wang Z, Qiao Y, Wang H L (2016) Real-time action recognition with enhanced motion vector CNNs Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2718–2726
Google Scholar
Zhu J, Wang B Y, Yang X K, Zhang W J, Tu Z W (2013) Action recognition with actons International conference oncomputer vision (ICCV), 2013, I.E. IEEE, pp 3559–3566
Google Scholar
Zhu W J, Hu J, Sun G, Cao X D, Qiao Y (2016) A key volume mining deep framework for action recognition Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 1991–1999
Google Scholar

Download references

Acknowledgements

The authors would like to thank the Editor-in-Chief, the handling associate editor and all anonymous reviewers for their considerations and suggestions. This work was supported by the National Natural Science Foundation of China (61572388).

Author information

Authors and Affiliations

Department of Electronic and Engineering, Xidian University, Xi’an, 710071, China
Hao Wang, Yanhua Yang, Erkun Yang & Cheng Deng
The State Key Laboratory of Integrated Services Networks (ISN), Xidian University, Xi’an, 710071, China
Cheng Deng

Authors

Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yanhua Yang
View author publications
You can also search for this author in PubMed Google Scholar
Erkun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Deng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheng Deng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Yang, Y., Yang, E. et al. Exploring hybrid spatio-temporal convolutional networks for human action recognition. Multimed Tools Appl 76, 15065–15081 (2017). https://doi.org/10.1007/s11042-017-4514-3

Download citation

Received: 27 November 2016
Revised: 31 January 2017
Accepted: 14 February 2017
Published: 08 March 2017
Issue Date: July 2017
DOI: https://doi.org/10.1007/s11042-017-4514-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Exploring hybrid spatio-temporal convolutional networks for human action recognition

Abstract

Similar content being viewed by others

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks

Spatiotemporal feature enhancement network for action recognition

1 Introduction

2 Related work

3 Proposed approach