1 Introduction

With the explosive growth in the amount of videos on the Internet, action recognition [1, 15, 27] has attracted increasing attention in recent years, which has potential applications in different fields such as abnormal event detection [3], human-computer interaction [28], video retrieval [33, 45] and robot perception [6]. Numerous researchers dedicate to this area to deal with the challenges like occlusions, low resolution, background clutter and camera motions.

Feature extraction is the fundamental and critical step in the framework of image and video analysis [21, 23, 24, 26, 32, 42, 46, 47]. Video representations are usually motivated by image features. Compared to images, videos have additional temporal information. How to apply the motion information contained in the time domain of videos is the core issue when designing/learning video representations. There are two main types of features for video description: hand-crafted descriptors and learning-based descriptors, which are presented comprehensively in Section 2.

Hand-crafted and learning-based descriptors have their own advantages which are complementary to each other: the design of hand-crafted descriptors reflects the researcher’s observation of visual data and they are easy to explain, while learning-based descriptors usually have higher discriminative capacity and are hard to interpret. How to combine the benefits of these two kinds of features to design good descriptors has been an active research area. On one hand, the experience in designing hand-crafted features can be utilized to guide the devise of deep neural networks. For instance, 3D ConvNets [16, 38] can be considered borrowing the idea from HOG3D [18] or 3D SIFT [30]. One the other hand, some techniques of hand-crafted descriptors are used to post-process deep descriptors. In [41], trajectory-pooled deep-convolutional descriptor (TDD) united dense trajectories of iDT descriptors with two-stream ConvNets [32] and achieved good performance. Motivated by TDD, in this paper we focus on integrating trajectory pooling method with C3D descriptors and present a novel multi-scale trajectory-pooled 3D convolutional descriptor (MTC3D) for action recognition, as shown in Fig. 1. Specifically, multi-scale dense trajectories and C3D features of c o n v4b and c o n v5b layers are first computed from the input videos. Then we conduct max pooling on c o n v4b and c o n v5b feature maps of C3D to shrink their temporal dimensions to one. In this way, a 16-frame-long trajectory is mapped to 16 points on one corresponding pooled feature map. After two types of normalization techniques, we perform trajectory pooling to the normalized feature maps and obtain the proposed MTC3D descriptors.

Fig. 1
figure 1

The process of extracting MTC3D. The MTC3D framework contains three steps: extracting feature maps of C3D and dense trajectories from raw videos and conducting trajectory-constrained pooling method on the extracted feature maps. Finally, MTC3Ds that has C × K dimensions are obtained, where C is the number of channels of feature maps and K is the number of trajectories of the input video

The proposed descriptors have two merits: 3D ConvNets has the ability to extract discriminative and shift-invariant features from videos, while the trajectory pooling method captures the temporal information of videos contained in the multi-scale trajectories of objects. We should note that our MTC3D is different from the process of TDD. In TDD, the feature maps of two-stream ConvNets have the same number of frames to the input videos and trajectory pooling method is conducted on different frames of feature maps, whereas we pool the points on the feature maps whose temporal dimensions are one in proposed MTC3D. After MTC3Ds are gained, we employ Fisher vector to encode them and feed the encoding results into a linear SVM classifier. We evaluate the performance of MTC3D on two challenging action datasets: HMDB51 [20] and UCF101 [34]. The proposed MTC3D alone achieves 56.0% and 86.6% on HMDB51 and UCF101. MTC3D outperforms C3D (one net) by 4.3% on UCF101 with the same pre-trained model.When combined with iDT, MTC3D obtains accuracies of 65.0% and 90.4% on HMDB51 and UCF101.

One preliminary version on trajectory-pooled 3D convolutional descriptor (TC3D) was first introduced in our previous work [25]. In this paper we make the following three improvements: (1) we add Section 2 to review hand-crafted descriptors and learning-based descriptors detailedly; (2) we integrate multi-scale motion information into TC3D and put forward MTC3D; (3) we conduct more comparative experiments and carry out error analysis of the results and discuss the advantages and disadvantages of MTC3D.

The remainder of this paper is organized as follows: In Section 2, We give an introduction to hand-crafted descriptors and learning-based descriptors. In Section 3, the proposed multi-scale trajectory-pooled 3D convolutional descriptor is introduced in detail. We report the experimental results on HMDB51 and UCF101 datasets in Section 4. Finally, the whole paper is concluded in Section 5.

2 Related works

There are two main types of features for video description: hand-crafted descriptors and learning-based descriptors, as presented below.

Hand-crafted descriptors

Hand-crafted descriptors contain Histograms of Oriented Gradients (HOG) [4], Histograms of Optical Flows (HOF) [21], Motion Boundary Histograms (MBH) [5], HOG3D [18], 3D SIFT [30], and so on. The process of extracting hand-crafted descriptors mainly can be divided into two steps: the first step is to detect interest points using some interest-point detector [13] or in a dense way, and then the local information (e.g., pixel value, optical flow) or its gradient in the neighborhood of detected points is aggregated to construct a histogram. HOG [4], HOF [21] and MBH [5] describe the information of image gradients, optical flow and motion boundaries (i.e., gradients of optical flow) respectively. HOG3D [18] and 3DSIFT [30] imitate the process of HOG [4] and SIFT [24] and compute histograms of 3D spatio-temporal gradients. The most successful hand-crafted descriptor for action recognition so far is Improved Dense Trajectories (iDT) [39]. In essence, it extracts special spatio-temporal interest areas using dense trajectories where HOG, HOF, and MBH descriptors are calculated. iDT outperforms other hand-crafted descriptors in almost all the public action datasets. After getting hand-crafted descriptors, feature encoding methods such as bag-of-words (BoW) model [10], sparse coding [9, 48] or Fisher Vector [29] are applied to learn higher-level features and enhance the recognition results. These hand-crafted descriptors contain researchers’ observation and experience and thus achieve great successes in this area. However, designing a good hand-crafted feature is difficult and time-consuming, and rely on expert knowledge. Moreover, hand-crafted descriptors are usually only applicable to certain applications and do not generalize well.

Learning-based descriptors

A transformation from raw input to the representation is learned by machine learning methods for learning-based descriptors. At first, some shallow learning techniques were used in this domain. In [22], a stacked convolutional Independent Subspace Analysis network was proposed to learn invariant spatio-temporal features from videos. With the developments of deep learning, convolutional neural networks (CNN) [19], which is inspired by the behavior of the animal visual cortex, has been proved to be an effective feature learning technique in action recognition [17, 32]. CNN can be used to process videos in an end-to-end way or provide deep features as the input of feature encoding approaches and classifiers. Some deep descriptors (e.g., Deep ConvNets [17]) learned high-level features from raw videos directly by 2D convolutions. Convolutional 3D descriptors (C3D) [38] employed 3D convolution and 3D pooling operations to model the temporal information of the videos better and give superior results. Two-stream ConvNets [32] used two networks to handle the spatial and temporal information separately and the inputs of its spatial and temporal stream ConvNets are RGB frames and optical flow fields. Driven by the success on speech translation [12] and machine translation [36], Long Short-Term Memory (LSTM) [14] that is a special type of recurrent neural networks has been applied to model video sequences recently. In [8], Donahue et al. utilized LSTM to learn long-term dependencies in videos and developed Long-term Recurrent Convolutional Networks for three vision tasks (i.e., activity recognition, image description, and video description). Ng et al. [44] compared different convolutional temporal feature pooling architectures and LSTM to explore a better way of feature aggregation in time domain. In [35], the LSTM Encoder-Decoder framework was used to learn video representations in an unsupervised way. Sharma et al. [31] merged a soft attention based model into multi-layered LSTM, which learned to focus on the spatial areas in each frame that were relevant for the recognition task. These deep descriptors work well due to the high discriminative capacity and good generalization ability of deep neural networks. But in a sense, deep learning techniques are a black box and the features they learn are not easy to interpret.

3 Multi-scale trajectory-pooled 3D convolutional descriptors

In this section, we elaborate a new multi-scale trajectory-pooled 3D convolutional descriptor for video representation, as shown in Fig. 1. We first explain the extraction process of dense trajectories and 3D convolutional feature maps from the raw videos. Then, the feature map normalization and trajectory pooling steps are described in detail. We finally introduce the multi-scale strategy we use.

3.1 Dense trajectories

We adopt improved trajectories [39], which is originally used to compute iDT descriptor, to extract dense trajectories due to its good performance. Improved trajectories is a modified version of dense trajectories [40]. In dense trajectories, feature points are first sampled on a grid spaced by 5 pixels. Then each point is tracked by median filtering in a dense optical flow field w = (u t ,v t ).

$$\begin{array}{@{}rcl@{}} P_{t+1}=(x_{t+1},y_{t+1})=(x_{t},y_{t})+(M*w)|_{(\overline{x}_{t},\overline{y}_{t})} \end{array} $$
(1)

where P t = (x t ,y t ) represents the feature point at frame t, M is the kernel for median filtering, and \((\overline {x}_{t},\overline {y}_{t})\) is the rounded position of (x t ,y t ). After the dense optical flow field is calculated, points of adjacent frames are linked to get the trajectories. To avoid drifting problem, the length of a trajectory is limited to 15 frames in [39]. Static trajectories and trajectories with sudden large displacements are also removed to make the obtained dense trajectories more robust.

Compared with dense trajectories, camera motion is considered in improved trajectories to enhance the performance. Camera motion is calculated based on the assumption that two adjacent frames are associated by a homography [37]. To compute the homography matrix, two complementary methods (i.e., SURF descriptor [2] and dense optical flow) are combined to find the matches between two frames at first. Afterward, the RANSAC approach [11] is applied to estimate the homography. Eventually, camera motion is removed to get a better optical flow that is more focused on foreground moving objects. In this way, the trajectories generated by background camera motion are suppressed to get small displacements and then removed by a thresholding method. In the proposed descriptor, the length of a trajectory is set at 16 frames to match the temporal length of the input clips of C3D. Given a video V, we get dense trajectories

$$\begin{array}{@{}rcl@{}} T(V)=\{T_{1},T_{2},\cdots,T_{K}\} \end{array} $$
(2)

where T k represents the k th trajectory of the video V:

$$\begin{array}{@{}rcl@{}} T_{k}=\left\{\left( {h^{k}_{1}},{w^{k}_{1}},{d^{k}_{1}}\right),\left( {h^{k}_{2}},{w^{k}_{2}},{d^{k}_{2}}\right),\cdots,\left( {h^{k}_{P}},{w^{k}_{P}},{d^{k}_{P}}\right)\right\} \end{array} $$
(3)

where \(({h^{k}_{p}},{w^{k}_{p}},{d^{k}_{p}})\) denotes the p th point in trajectory T k and P represents the length of a trajectory.

3.2 Convolutional feature maps

We employ 3D ConvNets [16, 38] to learning features from videos in MTC3D. 3D convolution and 3D pooling operations are adopted in 3D ConvNets. 3D convolution is the natural extension of 2D convolution. Both 3D convolution and 2D convolution can have multi-dimensional inputs, and the differences exist in the outputs. The outputs of 2D convolution are two-dimensional feature maps, whether its output has two or more dimensions, as shown in Fig. 2a and b. In contrast, the output volumes of 3D convolution can have multiple dimensions, as illustrated in Fig. 2c. In other words, 3D convolution conserves the temporal information of the input videos. Hence, we can utilize multiple 3D convolutional layers to handle the spatial and temporal information of the inputs in a hierarchical way simultaneously.

Fig. 2
figure 2

2D and 3D convolution. a 2D convolution on two-dimensional input. b 2D convolution on multidimensional input. c 3D convolution on multidimensional input. The outputs of 2D convolution are always two-dimensional feature maps, while 3D convolution has multidimensional outputs

The architecture of C3D is illustrated in Tables 1 and 2. 3D convolution and pooling kernels with a size of S × S × T are used, where S and T represent the spatial and temporal size of the kernels. C3D net has 8 convolution layers, which have 3 × 3 × 3 convolutional filters, with stride 1 × 1 × 1. The kernel size of p o o l1 layer is 2 × 2 × 1, with stride 2 × 2 × 1. The other 4 max-pooling layers have 2 × 2 × 2 pooling kernels, with stride 2 × 2 × 2. In our experiments C3D net is used to cope with videos as convolutional feature extractors, not in an end-to-end way. Specifically, we compute feature maps of c o n v4b and c o n v5b layers from the input videos and the full-connected layers are abandoned.

Table 1 The convolutional layers of the C3D Architecture
Table 2 The pooling layers of the C3D Architecture

We denote the size of the inputs or feature maps by H × W × D × C, where H and W are the height and width in spatial dimension, D is the depth in temporal dimension, and C is the number of channels. Then the size of the input clips of C3D net is 112 × 112 × 16 × 3. The c o n v4b and c o n v5b feature maps has a size of 14 × 14 × 4 × 512 and 7 × 7 × 2 × 512 respectively. Whereafter, we conduct a max-pooling operation to reduce the temporal sizes of c o n v4b and c o n v5b feature maps to one. Finally given a clip V, the representation \(F_{v} \in \mathbb {R}^{{H} \times {W} \times {C}}\) are gained, where H and W are 7 or 14 and C is 512.

3.3 Feature map normalization and trajectory pooling

Given the representation F v , two types of normalization approaches (not shown in Fig. 1) are adopted as in TDD. The first one is spatiotemporal normalization. The result of a convolutional layer for each channel can be viewed as a spatiotemporal block. Spatiotemporal normalization is conducted by dividing the feature map values by the maximum value of the spatiotemporal block for each channel.

$$\begin{array}{@{}rcl@{}} \widetilde{F}_{st}(h,w,c)=F(h,w,c)/max_{h,w}F(h,w,c) \end{array} $$
(4)

The second normalization method is channel normalization, and the feature map values are divided by the maximum value in the same spatio-temporal position across different channels.

$$\begin{array}{@{}rcl@{}} \widetilde{F}_{ch}(h,w,c)=F(h,w,c)/max_{c}F(h,w,c) \end{array} $$
(5)

After normalization, the values of the points on feature maps are aligned into a same interval. In experiments, these two normalization approaches are used separately and their results \(\widetilde {F}_{st}(h,w,c)\) and \(\widetilde {F}_{ch}(h,w,c)\) are fused to further enhance the performance.

In C3D net, spatial and temporal padding are implemented on the convolutional layers to make its inputs and outputs have the same size. And the effect of the padding is that it create the mappings between the points in videos and those on feature maps. For example, the point with coordinate (h,w,d) in clip V corresponds to that with coordinate (r × h,r × w) on the obtained representation F v , where r is the spatial map size ratio calculated in advance, as shown in Tables 1 and 2. In this way, the points on the trajectories are mapped to those on current representations directly when conducting trajectory pooling.

Given a normalized feature map \(\widetilde {F}\) and a trajectory T k , trajectory pooling is carried out as follows:

$$\begin{array}{@{}rcl@{}} D(T_{k}, \widetilde{F})=\max\limits_{p}{\widetilde{F}\left( \overline{\left( r \times {h^{k}_{p}}\right)},\overline{\left( r \times {w^{k}_{p}}\right)},c\right)} \end{array} $$
(6)

where r is the spatial map size ratio, \(\left (r \times {h^{k}_{p}}, r \times {w^{k}_{p}}\right )\) is mapped from the corresponding p th point \(\left ({h^{k}_{p}}, {w^{k}_{p}}, {d^{k}_{p}}\right )\) of original video in trajectory T k , \(\overline {(\cdot )}\) is the rounding operation. \(D(T_{k}, \widetilde {F}) \in \mathbb {R}^{{C} \times {K}}\) is the designed trajectory-pooled 3D convolutional descriptor (TC3D), where C is the number of channels and K is the number of trajectories.

3.4 Multi-scale extension

Above we introduce the process of extracting TC3D on single scale. Following the idea of iDT, we compute the trajectories for multiple scales and put forward the multi-scale extension of TC3D, that is, the proposed multi-scale trajectory-pooled 3D convolutional descriptor (MTC3D). Specifically, We first densely sample the feature points for 8 spatial scale by a factor of \(1/\sqrt 2\). Then the feature points are tracked in each scale over 16 frames. Hence given a video V, we acquire multi-scale dense trajectories

$$\begin{array}{@{}rcl@{}} \widehat{T}(V)=\{T_{1},T_{2},\cdots,T_{K_{1}},\cdots,T_{1},T_{2},\cdots,T_{K_{M}}\} \end{array} $$
(7)

where \(\{T_{1},T_{2},\cdots ,T_{K_{m}}\}\) represents the computed trajectories in m th scale, and K m is the number of trajectories. Then the proposed multi-scale trajectory-pooled 3D convolutional descriptor is \(\widehat {D}(T_{k}, \widetilde {F}) \in \mathbb {R}^{{C} \times {\widehat {K}}}\), where \(\widehat {K}={\sum }_{m=1}^{M}{K_{m}}\).

4 Experiments

In this section, we test the proposed TC3D and MTC3D on two public datasets: HMDB51 [20] and UCF101 [34]. We first introduce the datasets and the implementation details. Afterward, the exploration experiments and the comparisons to other methods are presented in turn. Finally, we conduct the error analysis and the discussions on the merits and demerits of the proposed descriptors.

4.1 Datasets

Two challenging action datasets are employed in our experiments: HMDB51 and UCF101, as shown in Fig. 3. The HMDB51 dataset has 6766 video clips taken from movies, YouTube, Google videos, etc. This dataset contains five types of actions from general facial actions to body movements for human interaction and covers 51 action categories.We employ three training/testing splits and report average accuracy over three splits as in [20] . The UCF101 dataset involves realistic videos collected from YouTube. It includes 13320 video sequences and has 101 action classes. Each class has 25 groups and the sequences in the same group may share some common characteristics (e.g., similar background). We use the three training/testing splits as in [34] and also report average accuracy.

Fig. 3
figure 3

Sample frames from HMDB51 (first row) and UCF101 datasets (second row)

4.2 Implementation details

In the experiments, feature maps of c o n v4b and c o n v5b layers of C3D net are extracted, whose sizes are 14 × 14 × 4 × 512 and 7 × 7 × 2 × 512. When computing features maps from the HMDB51 and UCF101 datasets, we employ a C3D model that is pre-trained on Sports-1M and released by Tran et al. in [38]. After max pooling operation in temporal dimension, a representation whose size is 7 × 7 × 512 or 14 × 14 × 512 is acquired. We conduct spatiotemporal and channel normalization to the representation, and the two normalized representations and the original representation are utilized to compute different MTC3Ds, which will be fused to boost the experimental results. The 16-frame-long dense trajectories from videos are extracted because the input of C3D net is 16-frame-long clip. Then we obtain MTC3D with a size of 512 × K by trajectory pooling, where K is the number of trajectories in this video. Next, PCA is used to reduce MTC3Ds to 128 dimensions to cut down the time and space overhead. After we get MTC3Ds from the videos, Fisher vector [29] is applied to encode them. We first build a dictionary of visual words by GMM with G(G = 256) mixtures. Then we assign MTC3D to their nearest visual words and gain a vector with 2 × 128 × 256 dimensions. At last, linear SVM is employed as the classifier.

4.3 Exploration experiments

In this section, TC3D is used to explore the impact of different settings in steps of the proposed pipeline, due to its lower time and space costs compared to MTC3D. We first evaluate the performance of sum pooling and max pooling methods in trajectory pooling step on three splits of HMDB51. TC3D with c o n v5 features and spatiotemporal normalization are used in the experiments and the results are summarized in Table 3. The average accuracy of max pooling is 0.4 higher than sum pooling. Therefore max trajectory pooling is chose in the proposed descriptor.

Table 3 The performance of different trajectory pooling methods (with c o n v5 features and spatiotemporal normalization) on three splits of HMDB51 dataset

We employ TC3D with c o n v5 features and spatiotemporal normalization and investigate the impact of different PCA dimensions in Fig. 4a. Dimension 128 gets the best performance among them. Thus, TC3Ds and MTC3Ds are reduced to 128 dimensions and then fed into Fisher vector in the whole experiments. In Fig. 4b, we use TC3D with c o n v5 features and describe the average accuracy of different normalization methods. S t_N o r m and C h a_N o r m represents spatiotemporal normalization and channel normalization respectively. N o_N o r m stands for the original representation without normalization. Combination of them is 3.1% better than N o_N o r m, which demonstrates the effects of the normalization methods.

Fig. 4
figure 4

The recognition results of different PCA dimensions and normalization methods with c o n v5 features on HMDB51. Dimension 128 and the fusion of normalization methods obtain the best results respectively

Table 4 reports the recognition results of TC3D with different convolutional layers. We see that the combination of c o n v4 and c o n v5 improves the average accuracy, which indicates that TC3Ds with different layers are complementary to each other. Table 5 illustrates the average accuracy of TC3D and MTC3D. MTC3D computes multi-scale dense trajectories and captures richer motion information and outperforms TC3D on these two datasets.

Table 4 The average accuracy of TC3D with different convolutional layers on HMDB51 and UCF101
Table 5 The average accuracy of TC3D and MTC3D on HMDB51 and UCF101

4.4 Comparison to the state of the art

We compare the proposed TC3D and MTC3D with other algorithms and summarize the action recognition accuracy in Table 6. The upper part shows the recognition methods whose inputs are only RGB videos. The lower part presents other algorithms that take both RGB frames and precomputed optical flow fields as inputs. We can observe that TC3D and MTC3D combined with Fisher vector and linear SVM perform much better than HOG descriptor and other RGB videos based deep neural networks, containing Deep networks [17], Spatial stream network [32], LRCN [8] and LSTM composite model [35]. TC3D and MTC3D also outperform C3D [38] and conv4 and conv5 spatial layers of TDD [41]. MTC3D and C3D (1 net) use the same pre-trained model in the whole experiments and MTC3D performs 4.3% better than C3D (1 net) on UCF101. The results indicate that trajectory pooling method captures the inherent nature of temporal dimension and promotes the recognition accuracy. When united with iDT descriptors, MTC3D performs better than other deep learning methods whose inputs are RGB frames and optical flow fields and achieves state-of-the-art results. The confusion matrices of the recognition results using TC3D on Split1 of HMDB51 and UCF101 are illustrated to give an intuitive view in Fig. 5.

Table 6 Action recognition results on HMDB51 and UCF101
Fig. 5
figure 5

The confusion matrices of the recognition results of TC3D on Split1 of HMDB51 and UCF101 datasets

4.5 Error analysis

Some misclassified samples of HMDB51 dataset are displayed in Fig. 6. We illustrate four main reasons about the misclassifications. The first one is that a video may contain multiple actions, which is an inherent problem for action recognition task. Two example videos are shown on the top left and in the first one a girl is brushing her hair and laughing at the same time. It is hard to classify the video correctly, even to human labelers. The second reason is camera motion as shown on the top right. Camera motions (e.g., pan, tilt and zoom) produce the background motion and also interfere the foreground motion, which degrades the recognition results. Shot changes can also fall into this category roughly, which sometimes result in unpredictable classification outputs.

Fig. 6
figure 6

Misclassified samples of HMDB51 dataset. The first line under the examples is the true labels, and the second line represents the predicted labels. There are four reasons that cause the misclassifications: multiple actions (top left), camera motion (top right), motion similarity (bottom left), and appearance similarity (bottom right)

Motion similarity is the third reason as shown on the bottom left of Fig. 6. Some actions share similar body motions. For example, both swinging baseball and throwing may contain the motions of holding the object over the head and throwing it out. These motions generate similar motion-based features, which makes the classification extremely difficult. The last one is appearance similarity shown on the bottom right. Two actions can have the same scene, background or objects, which leads to similar appearance-based features. For instance, both dribbling and shooting ball occur at the basketball court and relate to the basketball and the basketball stand.

Camera motion increases the intra-class variation, while motion and appearance similarity reduces inter-class distance. And the first reason that multiple actions exist in one video has both of these two roles. The four reasons mentioned above bring great difficulties and challenges for action recognition task. From the misclassified examples, we can see that the mistakes are reasonable and the proposed descriptor indeed ”understands” the video samples. These recognition errors also indicate potential directions on how to further improve the discriminative ability and the robustness of descriptors next.

4.6 Discusion

The proposed MTC3D extracts discriminative deep features from the inputs and meanwhile captures the temporal information of videos by the trajectory pooling method. Furthermore, compared to TDD [41], there is no need to train temporal network for optical flow frames when extracting MTC3D.

However, MTC3D performs worse than some recent works, such as TSN [43] and TLE [7]. A primary reason is that new techniques are used in these works. For example, the main idea of TSN is that the input video is divided into several segments which are processed by different spatial and temporal stream ConvNets and the class scores of these segments are fused to obtain a video-level prediction. TLE follows the idea of TSN and adds a temporal encoding layer besides. These techniques (i.e., segmenting videos and adding feature encoding layer in the network) can also be incorporated into MTC3D to further improve the recognition accuracy. In this paper, we focus on integrating trajectory pooling method with C3D descriptors and thus do not utilize the above techniques. Another reason is that we make use of the C3D model pre-trained on Sports-1M directly owing to the limits of our computing power and storage capacity. For example, the spatial size of c o n v4b and c o n v5b feature maps in C3D net is only 14 × 14 and 7 × 7, which affects the performance of MTC3D. Thus adopting new techniques in our pipeline and training new 3D ConvNets will be our future work.

5 Conclusion and future work

In this paper, we combine C3D with dense trajectories and present a new multi-scale trajectory-pooled 3D convolutional descriptor for action recognition. We take advantage of both 3D ConvNets that extracts high-level features from videos and trajectory pooling strategy that utilizes important motion information. Experiments validate the superior performances of the proposed descriptor on two challenging datasets. Based on the discussion section above, we will imitate C3D net to design our own 3D ConvNets that is more suitable for trajectory pooling method and add the feature encoding layer in the network in future.