Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors

Lu, Xiusheng; Yao, Hongxun; Zhao, Sicheng; Sun, Xiaoshuai; Zhang, Shengping

doi:10.1007/s11042-017-5251-3

Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors

Published: 05 October 2017

Volume 78, pages 507–523, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors

Download PDF

Xiusheng Lu¹,
Hongxun Yao¹,
Sicheng Zhao¹,
Xiaoshuai Sun¹ &
…
Shengping Zhang¹

626 Accesses
26 Citations
Explore all metrics

Abstract

Hand-crafted and learning-based features are two main types of video representations in the field of video understanding. How to integrate their merits to design good descriptors has been the research hotspot recently. Motivated by TDD (Wang et al. 2015), we combine trajectory pooling method and 3D ConvNets (Tran et al. 2015) and put forward a novel multi-scale trajectory-pooled 3D convolutional descriptor (MTC3D) for action recognition in this paper. Specifically, we calculate multi-scale dense trajectories from the input video and perform trajectory pooling on feature maps of 3D CNN. The proposed descriptor has two advantages: 3D CNN has the ability to extract high-level semantic information from videos and multi-scale trajectory pooling method utilizes the temporal information of videos subtly. The experiments on the datasets of HMDB51 and UCF101 demonstrate that the proposed descriptor achieves state-of-the-art results.

Trajectory-Pooled 3D Convolutional Descriptors for Action Recognition

Second-order motion descriptors for efficient action recognition

Article 28 October 2020

Encoding Multi-resolution Two-Stream CNNs for Action Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the explosive growth in the amount of videos on the Internet, action recognition [1, 15, 27] has attracted increasing attention in recent years, which has potential applications in different fields such as abnormal event detection [3], human-computer interaction [28], video retrieval [33, 45] and robot perception [6]. Numerous researchers dedicate to this area to deal with the challenges like occlusions, low resolution, background clutter and camera motions.

Feature extraction is the fundamental and critical step in the framework of image and video analysis [21, 23, 24, 26, 32, 42, 46, 47]. Video representations are usually motivated by image features. Compared to images, videos have additional temporal information. How to apply the motion information contained in the time domain of videos is the core issue when designing/learning video representations. There are two main types of features for video description: hand-crafted descriptors and learning-based descriptors, which are presented comprehensively in Section 2.

Hand-crafted and learning-based descriptors have their own advantages which are complementary to each other: the design of hand-crafted descriptors reflects the researcher’s observation of visual data and they are easy to explain, while learning-based descriptors usually have higher discriminative capacity and are hard to interpret. How to combine the benefits of these two kinds of features to design good descriptors has been an active research area. On one hand, the experience in designing hand-crafted features can be utilized to guide the devise of deep neural networks. For instance, 3D ConvNets [16, 38] can be considered borrowing the idea from HOG3D [18] or 3D SIFT [30]. One the other hand, some techniques of hand-crafted descriptors are used to post-process deep descriptors. In [41], trajectory-pooled deep-convolutional descriptor (TDD) united dense trajectories of iDT descriptors with two-stream ConvNets [32] and achieved good performance. Motivated by TDD, in this paper we focus on integrating trajectory pooling method with C3D descriptors and present a novel multi-scale trajectory-pooled 3D convolutional descriptor (MTC3D) for action recognition, as shown in Fig. 1. Specifically, multi-scale dense trajectories and C3D features of c o n v4b and c o n v5b layers are first computed from the input videos. Then we conduct max pooling on c o n v4b and c o n v5b feature maps of C3D to shrink their temporal dimensions to one. In this way, a 16-frame-long trajectory is mapped to 16 points on one corresponding pooled feature map. After two types of normalization techniques, we perform trajectory pooling to the normalized feature maps and obtain the proposed MTC3D descriptors.

The proposed descriptors have two merits: 3D ConvNets has the ability to extract discriminative and shift-invariant features from videos, while the trajectory pooling method captures the temporal information of videos contained in the multi-scale trajectories of objects. We should note that our MTC3D is different from the process of TDD. In TDD, the feature maps of two-stream ConvNets have the same number of frames to the input videos and trajectory pooling method is conducted on different frames of feature maps, whereas we pool the points on the feature maps whose temporal dimensions are one in proposed MTC3D. After MTC3Ds are gained, we employ Fisher vector to encode them and feed the encoding results into a linear SVM classifier. We evaluate the performance of MTC3D on two challenging action datasets: HMDB51 [20] and UCF101 [34]. The proposed MTC3D alone achieves 56.0% and 86.6% on HMDB51 and UCF101. MTC3D outperforms C3D (one net) by 4.3% on UCF101 with the same pre-trained model.When combined with iDT, MTC3D obtains accuracies of 65.0% and 90.4% on HMDB51 and UCF101.

One preliminary version on trajectory-pooled 3D convolutional descriptor (TC3D) was first introduced in our previous work [25]. In this paper we make the following three improvements: (1) we add Section 2 to review hand-crafted descriptors and learning-based descriptors detailedly; (2) we integrate multi-scale motion information into TC3D and put forward MTC3D; (3) we conduct more comparative experiments and carry out error analysis of the results and discuss the advantages and disadvantages of MTC3D.

The remainder of this paper is organized as follows: In Section 2, We give an introduction to hand-crafted descriptors and learning-based descriptors. In Section 3, the proposed multi-scale trajectory-pooled 3D convolutional descriptor is introduced in detail. We report the experimental results on HMDB51 and UCF101 datasets in Section 4. Finally, the whole paper is concluded in Section 5.

2 Related works

There are two main types of features for video description: hand-crafted descriptors and learning-based descriptors, as presented below.

Hand-crafted descriptors

Hand-crafted descriptors contain Histograms of Oriented Gradients (HOG) [4], Histograms of Optical Flows (HOF) [21], Motion Boundary Histograms (MBH) [5], HOG3D [18], 3D SIFT [30], and so on. The process of extracting hand-crafted descriptors mainly can be divided into two steps: the first step is to detect interest points using some interest-point detector [13] or in a dense way, and then the local information (e.g., pixel value, optical flow) or its gradient in the neighborhood of detected points is aggregated to construct a histogram. HOG [4], HOF [21] and MBH [5] describe the information of image gradients, optical flow and motion boundaries (i.e., gradients of optical flow) respectively. HOG3D [18] and 3DSIFT [30] imitate the process of HOG [4] and SIFT [24] and compute histograms of 3D spatio-temporal gradients. The most successful hand-crafted descriptor for action recognition so far is Improved Dense Trajectories (iDT) [39]. In essence, it extracts special spatio-temporal interest areas using dense trajectories where HOG, HOF, and MBH descriptors are calculated. iDT outperforms other hand-crafted descriptors in almost all the public action datasets. After getting hand-crafted descriptors, feature encoding methods such as bag-of-words (BoW) model [10], sparse coding [9, 48] or Fisher Vector [29] are applied to learn higher-level features and enhance the recognition results. These hand-crafted descriptors contain researchers’ observation and experience and thus achieve great successes in this area. However, designing a good hand-crafted feature is difficult and time-consuming, and rely on expert knowledge. Moreover, hand-crafted descriptors are usually only applicable to certain applications and do not generalize well.

Learning-based descriptors

A transformation from raw input to the representation is learned by machine learning methods for learning-based descriptors. At first, some shallow learning techniques were used in this domain. In [22], a stacked convolutional Independent Subspace Analysis network was proposed to learn invariant spatio-temporal features from videos. With the developments of deep learning, convolutional neural networks (CNN) [19], which is inspired by the behavior of the animal visual cortex, has been proved to be an effective feature learning technique in action recognition [17, 32]. CNN can be used to process videos in an end-to-end way or provide deep features as the input of feature encoding approaches and classifiers. Some deep descriptors (e.g., Deep ConvNets [17]) learned high-level features from raw videos directly by 2D convolutions. Convolutional 3D descriptors (C3D) [38] employed 3D convolution and 3D pooling operations to model the temporal information of the videos better and give superior results. Two-stream ConvNets [32] used two networks to handle the spatial and temporal information separately and the inputs of its spatial and temporal stream ConvNets are RGB frames and optical flow fields. Driven by the success on speech translation [12] and machine translation [36], Long Short-Term Memory (LSTM) [14] that is a special type of recurrent neural networks has been applied to model video sequences recently. In [8], Donahue et al. utilized LSTM to learn long-term dependencies in videos and developed Long-term Recurrent Convolutional Networks for three vision tasks (i.e., activity recognition, image description, and video description). Ng et al. [44] compared different convolutional temporal feature pooling architectures and LSTM to explore a better way of feature aggregation in time domain. In [35], the LSTM Encoder-Decoder framework was used to learn video representations in an unsupervised way. Sharma et al. [31] merged a soft attention based model into multi-layered LSTM, which learned to focus on the spatial areas in each frame that were relevant for the recognition task. These deep descriptors work well due to the high discriminative capacity and good generalization ability of deep neural networks. But in a sense, deep learning techniques are a black box and the features they learn are not easy to interpret.

3 Multi-scale trajectory-pooled 3D convolutional descriptors

In this section, we elaborate a new multi-scale trajectory-pooled 3D convolutional descriptor for video representation, as shown in Fig. 1. We first explain the extraction process of dense trajectories and 3D convolutional feature maps from the raw videos. Then, the feature map normalization and trajectory pooling steps are described in detail. We finally introduce the multi-scale strategy we use.

3.1 Dense trajectories

We adopt improved trajectories [39], which is originally used to compute iDT descriptor, to extract dense trajectories due to its good performance. Improved trajectories is a modified version of dense trajectories [40]. In dense trajectories, feature points are first sampled on a grid spaced by 5 pixels. Then each point is tracked by median filtering in a dense optical flow field w = (u _t,v _t).

$$\begin{array}{@{}rcl@{}} P_{t+1}=(x_{t+1},y_{t+1})=(x_{t},y_{t})+(M*w)|_{(\overline{x}_{t},\overline{y}_{t})} \end{array} $$

(1)

where P _t = (x _t,y _t) represents the feature point at frame t, M is the kernel for median filtering, and $(\overline {x}_{t},\overline {y}_{t})$ is the rounded position of (x _t,y _t). After the dense optical flow field is calculated, points of adjacent frames are linked to get the trajectories. To avoid drifting problem, the length of a trajectory is limited to 15 frames in [39]. Static trajectories and trajectories with sudden large displacements are also removed to make the obtained dense trajectories more robust.

Compared with dense trajectories, camera motion is considered in improved trajectories to enhance the performance. Camera motion is calculated based on the assumption that two adjacent frames are associated by a homography [37]. To compute the homography matrix, two complementary methods (i.e., SURF descriptor [2] and dense optical flow) are combined to find the matches between two frames at first. Afterward, the RANSAC approach [11] is applied to estimate the homography. Eventually, camera motion is removed to get a better optical flow that is more focused on foreground moving objects. In this way, the trajectories generated by background camera motion are suppressed to get small displacements and then removed by a thresholding method. In the proposed descriptor, the length of a trajectory is set at 16 frames to match the temporal length of the input clips of C3D. Given a video V, we get dense trajectories

$$\begin{array}{@{}rcl@{}} T(V)=\{T_{1},T_{2},\cdots,T_{K}\} \end{array} $$

(2)

where T _k represents the k ^th trajectory of the video V:

$$\begin{array}{@{}rcl@{}} T_{k}=\left\{\left( {h^{k}_{1}},{w^{k}_{1}},{d^{k}_{1}}\right),\left( {h^{k}_{2}},{w^{k}_{2}},{d^{k}_{2}}\right),\cdots,\left( {h^{k}_{P}},{w^{k}_{P}},{d^{k}_{P}}\right)\right\} \end{array} $$

(3)

where $({h^{k}_{p}},{w^{k}_{p}},{d^{k}_{p}})$ denotes the p ^th point in trajectory T _k and P represents the length of a trajectory.

3.2 Convolutional feature maps

We employ 3D ConvNets [16, 38] to learning features from videos in MTC3D. 3D convolution and 3D pooling operations are adopted in 3D ConvNets. 3D convolution is the natural extension of 2D convolution. Both 3D convolution and 2D convolution can have multi-dimensional inputs, and the differences exist in the outputs. The outputs of 2D convolution are two-dimensional feature maps, whether its output has two or more dimensions, as shown in Fig. 2a and b. In contrast, the output volumes of 3D convolution can have multiple dimensions, as illustrated in Fig. 2c. In other words, 3D convolution conserves the temporal information of the input videos. Hence, we can utilize multiple 3D convolutional layers to handle the spatial and temporal information of the inputs in a hierarchical way simultaneously.

The architecture of C3D is illustrated in Tables 1 and 2. 3D convolution and pooling kernels with a size of S × S × T are used, where S and T represent the spatial and temporal size of the kernels. C3D net has 8 convolution layers, which have 3 × 3 × 3 convolutional filters, with stride 1 × 1 × 1. The kernel size of p o o l1 layer is 2 × 2 × 1, with stride 2 × 2 × 1. The other 4 max-pooling layers have 2 × 2 × 2 pooling kernels, with stride 2 × 2 × 2. In our experiments C3D net is used to cope with videos as convolutional feature extractors, not in an end-to-end way. Specifically, we compute feature maps of c o n v4b and c o n v5b layers from the input videos and the full-connected layers are abandoned.

Table 1 The convolutional layers of the C3D Architecture

Full size table

Table 2 The pooling layers of the C3D Architecture

Full size table

We denote the size of the inputs or feature maps by H × W × D × C, where H and W are the height and width in spatial dimension, D is the depth in temporal dimension, and C is the number of channels. Then the size of the input clips of C3D net is 112 × 112 × 16 × 3. The c o n v4b and c o n v5b feature maps has a size of 14 × 14 × 4 × 512 and 7 × 7 × 2 × 512 respectively. Whereafter, we conduct a max-pooling operation to reduce the temporal sizes of c o n v4b and c o n v5b feature maps to one. Finally given a clip V, the representation $F_{v} \in \mathbb {R}^{{H} \times {W} \times {C}}$ are gained, where H and W are 7 or 14 and C is 512.

3.3 Feature map normalization and trajectory pooling

Given the representation F _v, two types of normalization approaches (not shown in Fig. 1) are adopted as in TDD. The first one is spatiotemporal normalization. The result of a convolutional layer for each channel can be viewed as a spatiotemporal block. Spatiotemporal normalization is conducted by dividing the feature map values by the maximum value of the spatiotemporal block for each channel.

$$\begin{array}{@{}rcl@{}} \widetilde{F}_{st}(h,w,c)=F(h,w,c)/max_{h,w}F(h,w,c) \end{array} $$

(4)

The second normalization method is channel normalization, and the feature map values are divided by the maximum value in the same spatio-temporal position across different channels.

$$\begin{array}{@{}rcl@{}} \widetilde{F}_{ch}(h,w,c)=F(h,w,c)/max_{c}F(h,w,c) \end{array} $$

(5)

After normalization, the values of the points on feature maps are aligned into a same interval. In experiments, these two normalization approaches are used separately and their results $\widetilde {F}_{st}(h,w,c)$ and $\widetilde {F}_{ch}(h,w,c)$ are fused to further enhance the performance.

In C3D net, spatial and temporal padding are implemented on the convolutional layers to make its inputs and outputs have the same size. And the effect of the padding is that it create the mappings between the points in videos and those on feature maps. For example, the point with coordinate (h,w,d) in clip V corresponds to that with coordinate (r × h,r × w) on the obtained representation F _v, where r is the spatial map size ratio calculated in advance, as shown in Tables 1 and 2. In this way, the points on the trajectories are mapped to those on current representations directly when conducting trajectory pooling.

Given a normalized feature map $\widetilde {F}$ and a trajectory T _k, trajectory pooling is carried out as follows:

$$\begin{array}{@{}rcl@{}} D(T_{k}, \widetilde{F})=\max\limits_{p}{\widetilde{F}\left( \overline{\left( r \times {h^{k}_{p}}\right)},\overline{\left( r \times {w^{k}_{p}}\right)},c\right)} \end{array} $$

(6)

where r is the spatial map size ratio, $\left (r \times {h^{k}_{p}}, r \times {w^{k}_{p}}\right )$ is mapped from the corresponding p ^th point $\left ({h^{k}_{p}}, {w^{k}_{p}}, {d^{k}_{p}}\right )$ of original video in trajectory T _k, $\overline {(\cdot )}$ is the rounding operation. $D(T_{k}, \widetilde {F}) \in \mathbb {R}^{{C} \times {K}}$ is the designed trajectory-pooled 3D convolutional descriptor (TC3D), where C is the number of channels and K is the number of trajectories.

3.4 Multi-scale extension

Above we introduce the process of extracting TC3D on single scale. Following the idea of iDT, we compute the trajectories for multiple scales and put forward the multi-scale extension of TC3D, that is, the proposed multi-scale trajectory-pooled 3D convolutional descriptor (MTC3D). Specifically, We first densely sample the feature points for 8 spatial scale by a factor of $1/\sqrt 2$. Then the feature points are tracked in each scale over 16 frames. Hence given a video V, we acquire multi-scale dense trajectories

$$\begin{array}{@{}rcl@{}} \widehat{T}(V)=\{T_{1},T_{2},\cdots,T_{K_{1}},\cdots,T_{1},T_{2},\cdots,T_{K_{M}}\} \end{array} $$

(7)

where $\{T_{1},T_{2},\cdots ,T_{K_{m}}\}$ represents the computed trajectories in m ^th scale, and K _m is the number of trajectories. Then the proposed multi-scale trajectory-pooled 3D convolutional descriptor is $\widehat {D}(T_{k}, \widetilde {F}) \in \mathbb {R}^{{C} \times {\widehat {K}}}$, where $\widehat {K}={\sum }_{m=1}^{M}{K_{m}}$.

4 Experiments

In this section, we test the proposed TC3D and MTC3D on two public datasets: HMDB51 [20] and UCF101 [34]. We first introduce the datasets and the implementation details. Afterward, the exploration experiments and the comparisons to other methods are presented in turn. Finally, we conduct the error analysis and the discussions on the merits and demerits of the proposed descriptors.

4.1 Datasets

Two challenging action datasets are employed in our experiments: HMDB51 and UCF101, as shown in Fig. 3. The HMDB51 dataset has 6766 video clips taken from movies, YouTube, Google videos, etc. This dataset contains five types of actions from general facial actions to body movements for human interaction and covers 51 action categories.We employ three training/testing splits and report average accuracy over three splits as in [20] . The UCF101 dataset involves realistic videos collected from YouTube. It includes 13320 video sequences and has 101 action classes. Each class has 25 groups and the sequences in the same group may share some common characteristics (e.g., similar background). We use the three training/testing splits as in [34] and also report average accuracy.

4.2 Implementation details

In the experiments, feature maps of c o n v4b and c o n v5b layers of C3D net are extracted, whose sizes are 14 × 14 × 4 × 512 and 7 × 7 × 2 × 512. When computing features maps from the HMDB51 and UCF101 datasets, we employ a C3D model that is pre-trained on Sports-1M and released by Tran et al. in [38]. After max pooling operation in temporal dimension, a representation whose size is 7 × 7 × 512 or 14 × 14 × 512 is acquired. We conduct spatiotemporal and channel normalization to the representation, and the two normalized representations and the original representation are utilized to compute different MTC3Ds, which will be fused to boost the experimental results. The 16-frame-long dense trajectories from videos are extracted because the input of C3D net is 16-frame-long clip. Then we obtain MTC3D with a size of 512 × K by trajectory pooling, where K is the number of trajectories in this video. Next, PCA is used to reduce MTC3Ds to 128 dimensions to cut down the time and space overhead. After we get MTC3Ds from the videos, Fisher vector [29] is applied to encode them. We first build a dictionary of visual words by GMM with G(G = 256) mixtures. Then we assign MTC3D to their nearest visual words and gain a vector with 2 × 128 × 256 dimensions. At last, linear SVM is employed as the classifier.

4.3 Exploration experiments

In this section, TC3D is used to explore the impact of different settings in steps of the proposed pipeline, due to its lower time and space costs compared to MTC3D. We first evaluate the performance of sum pooling and max pooling methods in trajectory pooling step on three splits of HMDB51. TC3D with c o n v5 features and spatiotemporal normalization are used in the experiments and the results are summarized in Table 3. The average accuracy of max pooling is 0.4 higher than sum pooling. Therefore max trajectory pooling is chose in the proposed descriptor.

Table 3 The performance of different trajectory pooling methods (with c o n v5 features and spatiotemporal normalization) on three splits of HMDB51 dataset

Full size table

We employ TC3D with c o n v5 features and spatiotemporal normalization and investigate the impact of different PCA dimensions in Fig. 4a. Dimension 128 gets the best performance among them. Thus, TC3Ds and MTC3Ds are reduced to 128 dimensions and then fed into Fisher vector in the whole experiments. In Fig. 4b, we use TC3D with c o n v5 features and describe the average accuracy of different normalization methods. S t_N o r m and C h a_N o r m represents spatiotemporal normalization and channel normalization respectively. N o_N o r m stands for the original representation without normalization. Combination of them is 3.1% better than N o_N o r m, which demonstrates the effects of the normalization methods.

Table 4 reports the recognition results of TC3D with different convolutional layers. We see that the combination of c o n v4 and c o n v5 improves the average accuracy, which indicates that TC3Ds with different layers are complementary to each other. Table 5 illustrates the average accuracy of TC3D and MTC3D. MTC3D computes multi-scale dense trajectories and captures richer motion information and outperforms TC3D on these two datasets.

Table 4 The average accuracy of TC3D with different convolutional layers on HMDB51 and UCF101

Full size table

Table 5 The average accuracy of TC3D and MTC3D on HMDB51 and UCF101

Full size table

4.4 Comparison to the state of the art

We compare the proposed TC3D and MTC3D with other algorithms and summarize the action recognition accuracy in Table 6. The upper part shows the recognition methods whose inputs are only RGB videos. The lower part presents other algorithms that take both RGB frames and precomputed optical flow fields as inputs. We can observe that TC3D and MTC3D combined with Fisher vector and linear SVM perform much better than HOG descriptor and other RGB videos based deep neural networks, containing Deep networks [17], Spatial stream network [32], LRCN [8] and LSTM composite model [35]. TC3D and MTC3D also outperform C3D [38] and conv4 and conv5 spatial layers of TDD [41]. MTC3D and C3D (1 net) use the same pre-trained model in the whole experiments and MTC3D performs 4.3% better than C3D (1 net) on UCF101. The results indicate that trajectory pooling method captures the inherent nature of temporal dimension and promotes the recognition accuracy. When united with iDT descriptors, MTC3D performs better than other deep learning methods whose inputs are RGB frames and optical flow fields and achieves state-of-the-art results. The confusion matrices of the recognition results using TC3D on Split1 of HMDB51 and UCF101 are illustrated to give an intuitive view in Fig. 5.

Table 6 Action recognition results on HMDB51 and UCF101

Full size table

4.5 Error analysis

Some misclassified samples of HMDB51 dataset are displayed in Fig. 6. We illustrate four main reasons about the misclassifications. The first one is that a video may contain multiple actions, which is an inherent problem for action recognition task. Two example videos are shown on the top left and in the first one a girl is brushing her hair and laughing at the same time. It is hard to classify the video correctly, even to human labelers. The second reason is camera motion as shown on the top right. Camera motions (e.g., pan, tilt and zoom) produce the background motion and also interfere the foreground motion, which degrades the recognition results. Shot changes can also fall into this category roughly, which sometimes result in unpredictable classification outputs.

Motion similarity is the third reason as shown on the bottom left of Fig. 6. Some actions share similar body motions. For example, both swinging baseball and throwing may contain the motions of holding the object over the head and throwing it out. These motions generate similar motion-based features, which makes the classification extremely difficult. The last one is appearance similarity shown on the bottom right. Two actions can have the same scene, background or objects, which leads to similar appearance-based features. For instance, both dribbling and shooting ball occur at the basketball court and relate to the basketball and the basketball stand.

Camera motion increases the intra-class variation, while motion and appearance similarity reduces inter-class distance. And the first reason that multiple actions exist in one video has both of these two roles. The four reasons mentioned above bring great difficulties and challenges for action recognition task. From the misclassified examples, we can see that the mistakes are reasonable and the proposed descriptor indeed ”understands” the video samples. These recognition errors also indicate potential directions on how to further improve the discriminative ability and the robustness of descriptors next.

4.6 Discusion

The proposed MTC3D extracts discriminative deep features from the inputs and meanwhile captures the temporal information of videos by the trajectory pooling method. Furthermore, compared to TDD [41], there is no need to train temporal network for optical flow frames when extracting MTC3D.

However, MTC3D performs worse than some recent works, such as TSN [43] and TLE [7]. A primary reason is that new techniques are used in these works. For example, the main idea of TSN is that the input video is divided into several segments which are processed by different spatial and temporal stream ConvNets and the class scores of these segments are fused to obtain a video-level prediction. TLE follows the idea of TSN and adds a temporal encoding layer besides. These techniques (i.e., segmenting videos and adding feature encoding layer in the network) can also be incorporated into MTC3D to further improve the recognition accuracy. In this paper, we focus on integrating trajectory pooling method with C3D descriptors and thus do not utilize the above techniques. Another reason is that we make use of the C3D model pre-trained on Sports-1M directly owing to the limits of our computing power and storage capacity. For example, the spatial size of c o n v4b and c o n v5b feature maps in C3D net is only 14 × 14 and 7 × 7, which affects the performance of MTC3D. Thus adopting new techniques in our pipeline and training new 3D ConvNets will be our future work.

5 Conclusion and future work

In this paper, we combine C3D with dense trajectories and present a new multi-scale trajectory-pooled 3D convolutional descriptor for action recognition. We take advantage of both 3D ConvNets that extracts high-level features from videos and trajectory pooling strategy that utilizes important motion information. Experiments validate the superior performances of the proposed descriptor on two challenging datasets. Based on the discussion section above, we will imitate C3D net to design our own 3D ConvNets that is more suitable for trajectory pooling method and add the feature encoding layer in the network in future.

References

Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv (CSUR) 43(3):16
Article Google Scholar
Bay H, Tuytelaars T, Van Gool L (2006) Surf: speeded up robust features. In: Computer vision–ECCV 2006, pp 404–417
Boiman O, Irani M (2007) Detecting irregularities in images and in video. Int J Comput Vis 74(1):17–31
Article Google Scholar
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Computer society conference on computer vision and pattern recognition, 2005. CVPR 2005, vol 1. IEEE, pp 886–893
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Computer vision–ECCV 2006, pp 428–441
Demiris Y, Khadhouri B (2006) Hierarchical attentive multiple models for execution and recognition of actions. Robot Autonom Syst 54(5):361–369
Article Google Scholar
Diba A, Sharma V, Van Gool L (2016) Deep temporal linear encoding networks. arXiv:1611.06678
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Fanello SR, Gori I, Metta G, Odone F (2013) Keep it simple and sparse: real-time action recognition. J Mach Learn Res 14(1):2617–2640
Google Scholar
Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, vol 2. IEEE, pp 524–531
Fischler MA, Bolles RC (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM 24(6):381–395
Article MathSciNet Google Scholar
Graves A, Jaitly N (2014) Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1764–1772
Harris C, Stephens M (1988) A combined corner and edge detector. In: Alvey vision conference, vol 15, no 50. Manchester, pp 5210–5244
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Jhuang H, Serre T, Wolf L, Poggio T (2007) A biologically inspired system for action recognition. In: IEEE 11th international conference on computer vision, 2007. ICCV 2007. IEEE, pp 1–8
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008-19th British machine vision conference. British Machine Vision Association, pp 275–1
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: 2011 IEEE international conference on computer vision (ICCV). IEEE, pp 2556–2563
Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8
Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: 2011 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 3361–3368
Liu AA, Su YT, Nie WZ, Kankanhalli M (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114
Article Google Scholar
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article MathSciNet Google Scholar
Lu X, Yao H, Sun X, Zhang S, Zhang Y (2017) Trajectory-pooled 3d convolutional descriptors for action recognition. In: Pacific rim conference on multimedia
Nie W, Liu A, Li W, Su Y (2016) Cross-view action recognition by cross-domain learning. Image Vis Comput 55:109–118
Article Google Scholar
Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990
Article Google Scholar
Rautaray SS, Agrawal A (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 43(1):1–54
Article Google Scholar
Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput vis 105(3):222–245
Article MathSciNet MATH Google Scholar
Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th international conference on multimedia. ACM, pp 357–360
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Snoek CG, Worring M (2008) Concept-based video retrieval. Found Trends Inf Retriev 2(4):215–322
Article Google Scholar
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning, pp 843–852
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
Szeliski R (2006) Image alignment and stitching: a tutorial. Founda Trends Comput Graph Vis 2(1):1–104
Article MathSciNet MATH Google Scholar
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 3169–3176
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314
Wang F, Qi S, Gao G, Zhao S, Wang X (2016) Logo information recognition in large-scale social media data. Multimed Syst 22(1):63–73
Article Google Scholar
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision. pp 20–36
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702
Zhao S, Chen L, Yao H, Zhang Y, Sun X (2015) Strategy for dynamic 3d depth data matching towards robust action retrieval. Neurocomputing 151:533–543
Article Google Scholar
Zhao S, Yao H, Gao Y, Ji R, Xie W, Jiang X, Chua TS (2016) Predicting personalized emotion perceptions of social images. In: Proceedings of the 2016 ACM on multimedia conference. ACM, pp 1385–1394
Zhao S, Yao H, Gao Y, Ji R, Ding G (2017) Continuous probability distribution prediction of image emotions via multitask shared sparse regression. IEEE Trans Multimed 19(3):632–645
Article Google Scholar
Zhu Y, Zhao X, Fu Y, Liu Y (2011) Sparse coding on local spatial-temporal volumes for human action recognition. Comput Vis–ACCV 2010:660–671
Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 61472103).

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Xiusheng Lu, Hongxun Yao, Sicheng Zhao, Xiaoshuai Sun & Shengping Zhang

Authors

Xiusheng Lu
View author publications
You can also search for this author in PubMed Google Scholar
Hongxun Yao
View author publications
You can also search for this author in PubMed Google Scholar
Sicheng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoshuai Sun
View author publications
You can also search for this author in PubMed Google Scholar
Shengping Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xiusheng Lu or Hongxun Yao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, X., Yao, H., Zhao, S. et al. Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors. Multimed Tools Appl 78, 507–523 (2019). https://doi.org/10.1007/s11042-017-5251-3

Download citation

Received: 01 June 2017
Revised: 07 September 2017
Accepted: 20 September 2017
Published: 05 October 2017
Issue Date: January 2019
DOI: https://doi.org/10.1007/s11042-017-5251-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors

Abstract

Similar content being viewed by others

Trajectory-Pooled 3D Convolutional Descriptors for Action Recognition

Second-order motion descriptors for efficient action recognition

Encoding Multi-resolution Two-Stream CNNs for Action Recognition

1 Introduction