DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition

Tong, Ming; Li, Mingyang; Bai, He; Ma, Lei; Zhao, Mengao

doi:10.1007/s00521-019-04030-1

DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition

Original Article
Published: 08 February 2019

Volume 32, pages 5285–5302, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition

Download PDF

Ming Tong¹,
Mingyang Li¹,
He Bai¹,
Lei Ma¹ &
…
Mengao Zhao¹

454 Accesses
7 Citations
Explore all metrics

Abstract

In order to improve action recognition accuracy, the discriminative kinematic descriptor and deep attention-pooled descriptor are proposed. Firstly, the optical flow field is transformed into a set of kinematic fields with more discriminativeness. Subsequently, two kinematic features are constructed, which more accurately depict the dynamic characteristics of action subject from the multi-order divergence and curl fields. Secondly, by introducing both of the tight-loose constraint and anti-confusion constraint, a discriminative fusion method is proposed, which guarantees better within-class compactness and between-class separability, meanwhile reduces the confusion caused by outliers. Furthermore, a discriminative kinematic descriptor is constructed. Thirdly, a prediction-attentional pooling method is proposed, which accurately focuses its attention on the discriminative local regions. On this basis, a deep attention-pooled descriptor (DKD–DAD) is constructed. Finally, a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor is presented, which comprehensively obtains the discriminative dynamic and static information in a video. Consequently, accuracies are improved. Experiments on two challenging datasets verify the effectiveness of our methods.

Action recognition with spatio-temporal augmented descriptor and fusion method

Article 29 July 2016

Action Recognition with a Multi-View Temporal Attention Network

Article 09 November 2021

A survey on aggregating methods for action recognition with dense trajectories

Article 15 March 2015

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Human action recognition in videos possesses a crucial academic value and an extensive market application prospect, which make it quickly become a focus and difficulty in computer vision and artificial intelligence. Consequently, it attracts great interests of researchers and institutions. However, action recognition is still a challenging problem while concentrating on some real-world data obtained from web videos [1, 2], movies [3], etc. Therefore, extracting effective features is undoubtedly very significant for action recognition.

1.1 Background and motivation

Extracting dynamic features is one of the important research directions for action recognition. Early works, including spatiotemporal interest point (STIP) [4], cuboids [5] and so on, usually adopt interest point detectors to capture pixels with salient change of intensity or gradients in a spatiotemporal video volume, then describe these interest points or small regions using statistics acquired from neighboring pixels, so as to obtain the motion information of action subject. Subsequently, some methods [6, 7] extend 2D image features to 3D features in videos to acquire spatiotemporal features for action recognition. In addition, quite a few research results show that the motion information of trajectories can obtain impressive performance, such as dense trajectories [8, 9] obtained by tracking densely sampled points using optical flow fields. In fact, the above methods based on interest points have been turned out to be successful in the field of action recognition. However, they are highly dependent on localized statistics within a small spatiotemporal neighborhood [5, 9], and cannot describe the global characteristics of motion as a whole. Moreover, there are also some scholars [10, 11] who have built deep networks, such as two-stream convolutional networks [10] and 3D convolutional networks [11], so as to acquire spatiotemporal features for action recognition. However, these networks are not only difficult to train, but also unable to achieve the performance equivalent to hand-crafted features.

As an important tool for describing dynamic properties of videos, optical flow has been widely applied in the field of action recognition. However, many of the features based on local optical flow may simply summarize the flow according to histograms of its orientations [12, 13], thus arguably ignore other potentially discriminative properties. Actually, the optical flow may be regarded as a flow field, so some of its characteristics can be extracted using the fluid dynamic theory. By exploring the dynamic characteristics of optical flow field, optical flow can be described in a richer way to obtain the physical characteristics of flow pattern. However, they are less involved in existing features for action recognition.

With the increasing number of action classes, adopting motion features alone is not discriminative enough for dependable action recognition. In fact, the appearance information of action scene and discriminative object in a video also plays a quite significant role. Recently, due to the favorable learning and abstract abilities of deep learning, it has occupied an absolute dominant position in image processing field and has been widely used in various application fields [14,15,16]. For this reason, some scholars, by constructing a deep network, have attempted to extract the important static features from images for action recognition. Wang et al. [17] firstly captured the spatial relationship and the high-order correlations between parts. Then, they constructed a hierarchical spatial sum-product network (HS-SPN) to extract static deep features for action recognition. Kwak et al. [18] introduced the triplet-based rank constraints into a deep convolutional network, so as to automatically capture the pose embedding information from still image for action recognition. Subsequently, Qi et al. [19], by defining a joint loss function, integrated the pose hints into the convolutional neural networks (CNN) framework. Thus, the static deep features containing pose information are obtained for action recognition. However, these methods directly input the whole image into deep network for feature extraction without focusing on the discriminative object in background.

In order to overcome the above problem, some scholars attempted to extract features from the discriminative regions of video frames, so as to improve recognition performance. Peng et al. [20] firstly divided the whole human body region [21] into multiple regions. Then, a deep CNN network is used to extract features from individual discriminative regions for action recognition. Ni et al. [22] proposed a network composed by two connected deep convolutional neural networks (DCNNs). The first DCNN adopts video frames as inputs and creates response maps indicating locations for body parts. Then, these maps are fed into the second DCNN for learning discriminative and semantic-aligned action representations of each body part for action recognition. However, the above methods usually need to construct additional networks to obtain the discriminative regions, which is generally difficult for training networks and results in higher computational consumption. Moreover, these methods assume that the discriminative information always exists in the regions around human body, and therefore often focus on human or its parts. In fact, some actions may be easier to be distinguished using the appearance information of action scenes such as the ocean wave in “surfing” action; while others might need to pay close attention to the discriminative object that interacts with the human, such as the bicycle in “bike riding” action.

1.2 Overview of DKD–DAD

Motivated by the above methods, a framework with discriminative kinematic descriptor and deep attention-pooled descriptor (DKD–DAD) for action recognition is proposed, as shown in Fig. 1. Firstly, the optical flow field is transformed into a set of kinematic fields with more discriminativeness to construct two kinematic features; subsequently, a discriminative fusion method is proposed, by which a discriminative kinematic descriptor is obtained to depict the dynamic characteristics of action subject. Secondly, a prediction-attentional pooling method is proposed to automatically acquire the discriminative local regions in a video frame. Furthermore, a deep attention-pooled descriptor is presented to capture the discriminative static information in action scene. Finally, a DKD–DAD framework is constructed, which combines discriminative kinematic descriptor and deep attention-pooled descriptor together for action recognition. Consequently, accuracy is improved.

1.3 Working flow of DKD–DAD

In this section, the whole working flow of DKD–DAD is illustrated in Fig. 2, which includes two branches. Specifically, given an input action video, the left branch aims to obtain the proposed discriminative kinematic descriptor, which depicts the dynamic information of action video; the right branch intends to acquire the proposed deep attention-pooled descriptor, which describes the static information.

The left algorithm branch includes the following steps: Firstly, the optical flow field is extracted. Then, the divergence and curl fields are calculated, respectively. Subsequently, the first-order divergence and curl fields are acquired. Similarly, the second-order divergence and curl fields are obtained. Furthermore, the multi-order divergence and curl fields are, respectively, jointly encoded to acquire the multi-order divergence and curl features. The obtained features are discriminatively fused, consequently the proposed discriminative kinematic descriptor is obtained. Regarding the right branch, it shows the main procedures as below. Firstly, one frame is randomly picked as the key-frame, which is inputted into the Inception-v3 [23] network to acquire the static deep features. Then, the deep local and global features are, respectively, extracted from the Inception-v3. Subsequently, the extracted features are inputted into the multiple channel attentions to obtain the local and global attentional heatmaps, which are, respectively, fused by taking the predictions of deep network as weight coefficients. Furthermore, the fused attentional heatmaps are used to conduct weighted pooling on deep local and global features, respectively. Consequently the proposed deep attention-pooled descriptor is acquired. Finally, the discriminative kinematic descriptor and deep attention-pooled descriptor are concatenated as the feature representation of action video, then inputted into the support vector machine (SVM) classifier, and thus the action label is obtained.

In summary, the major contributions of this paper are as follows: (1) Two kinematic features of multi-order divergence and multi-order curl are constructed, which more accurately depict the dynamic characteristics of action subject. (2) A novel fusion method is proposed, which ensures the discriminativeness of two kinematic features. Furthermore, a discriminative kinematic descriptor is constructed. (3) A prediction-attentional pooling method is presented. Consequently, a deep attention-pooled descriptor is constructed. (4) A DKD–DAD framework for action recognition is proposed, which finally improves recognition accuracy. Experimental results demonstrate that, the proposed methods can provide promising performance compared to several state-of-the-art methods on two challenging datasets.

The rest of this paper is structured as follows. Section 2 constructs two kinematic features, and meanwhile presents the discriminative fusion method as well as discriminative kinematic descriptor. Section 3 proposes the prediction-attentional pooling method and deep attention-pooled descriptor. The related experiments and analysis of the proposed methods are shown in Sect. 4, followed by the conclusions with future work in Sect. 5.

2 Discriminative kinematic descriptor

Human action videos contain rich motion information that can characterize the intrinsic patterns of different actions. Most of the recent works usually used optical flow field to describe motion information. However, optical flow only records displacement vectors of pixels between two successive frames. While by calculating the kinematics of optical flow field, the physical properties of flow pattern can be captured, which describe motion in a richer way, and contain more details of motion as well as precise variations, such as local expansion, local spin, velocity, acceleration, etc. Therefore, researching field is equivalent to exploring motion itself. In order to better obtain the dynamic characteristics of action subject, this section firstly transforms the optical flow field into a set of kinematic fields with more discriminativeness, and then constructs two kinematic features. Finally, the discriminative fusion method is presented to obtain the proposed discriminative kinematic descriptor, as shown in Fig. 3.

2.1 Construction of kinematic features

In this section, two kinematic features, namely multi-order divergence feature and multi-order curl feature, are constructed. In order to do this, the optical flow is computed firstly [24]. Specifically, given a video, any a point $q$ at time $t$ is denoted as $q_{t}$, then the optical flow vector of $q_{t}$ is denoted as $w\left( {q_{t} } \right) = \left( {u\left( {q_{t} } \right),v\left( {q_{t} } \right)} \right)$, where $u\left( {q_{t} } \right)$ and $v\left( {q_{t} } \right)$ are the horizontal and vertical components of $w\left( {q_{t} } \right)$, respectively. Subsequently, by calculating optical flow vector from adjacent frames at every pixel position, the optical flow field is acquired. In the following, the construction process of kinematic features is detailedly given.

1.
Extraction of divergence and curl

The divergence and curl are both the local first-order differential scalar quantities of optical flow field, which describe the physical pattern of flow, represent different characteristics of optical flow field, respectively, and meanwhile can well depict the different characteristics of motion in videos from distinct perspectives. In this paper, the divergence and curl of $q_{t}$ are computed as follows:

$$div\left( {q_{t} } \right) = \frac{{\partial u\left( {q_{t} } \right)}}{\partial x} + \frac{{\partial v\left( {q_{t} } \right)}}{\partial y}$$

(1)

$$curl\left( {q_{t} } \right) = \frac{{\partial v\left( {q_{t} } \right)}}{\partial x} - \frac{{\partial u\left( {q_{t} } \right)}}{\partial y}$$

(2)

By, respectively, calculating $div\left( \cdot \right)$ and $curl\left( \cdot \right)$ for each point in a video frame, the divergence field $Field^{div}$ and curl field $Field^{curl}$ corresponding to the frame are obtained. The physical meaning of $Field^{div}$ derives from the fact that it acquires the amount of local expansion occurring in optical flow field, and can depict the regions of local expansion caused by action subject. The physical significance of $Field^{curl}$ is that it can delineate the local spin around the axis that is perpendicular to the plane of optical flow field, and can describe the dynamic characteristics resulting from human body motion in optical flow field.

2.
Acquisition of the first-order derivatives of divergence and curl

In fact, it is not sufficient to describe the optical flow field generated by action subject only using divergence and curl. Therefore, the first-order derivatives of divergence and curl are, respectively, calculated to capture the precise variations of local expansion and local spin caused by the motion of action subject. Given a spatiotemporal point $q_{t}$, the first-order derivatives of divergence and curl for it along $x$, $y$ and $t$ directions are, respectively, computed by following Eqs. (3) and (4).

$$\left( {div_{x} \left( {q_{t} } \right),div_{y} \left( {q_{t} } \right),div_{t} \left( {q_{t} } \right)} \right)^{{T}} = \nabla \cdot div\left( {q_{t} } \right)$$

(3)

$$\left( {curl_{x} \left( {q_{t} } \right),curl_{y} \left( {q_{t} } \right),curl_{t} \left( {q_{t} } \right)} \right)^{{T}} = \nabla \cdot curl\left( {q_{t} } \right)$$

(4)

where $\nabla = \left( {{\partial \mathord{\left/ {\vphantom {\partial {\partial x}}} \right. \kern-0pt} {\partial x}},{\partial \mathord{\left/ {\vphantom {\partial {\partial y}}} \right. \kern-0pt} {\partial y}},{\partial \mathord{\left/ {\vphantom {\partial {\partial t}}} \right. \kern-0pt} {\partial t}}} \right)^{{T}}$ represents the gradient operator.

Till then, by, respectively, calculating $div_{x} \left( \cdot \right)$, $div_{y} \left( \cdot \right)$, $curl_{x} \left( \cdot \right)$ and $curl_{y} \left( \cdot \right)$ for each point $q_{t}$ in a video frame, a set of first-order spatial kinematic fields is obtained, including the first-order spatial divergence fields $Field^{{div_{x} }}$ and $Field^{{div_{y} }}$ as well as the first-order spatial curl fields $Field^{{curl_{x} }}$ and $Field^{{curl_{y} }}$, which acquire the relative motion between pixels along $x$ and $y$ directions, and meanwhile remove the camera motion. Similarly, by calculating $div_{t} \left( \cdot \right)$ and $curl_{t} \left( \cdot \right)$ for each point $q_{t}$, a set of first-order temporal kinematic fields is obtained as well, including the first-order temporal divergence field $Field^{{div_{t} }}$ and the first-order temporal curl field $Field^{{curl_{t} }}$, which obtain the velocities of divergence and curl, and meanwhile directly remove the slowly changing background in a video through the subtraction of two consecutive kinematic fields.

3.
Acquisition of the second-order derivatives of divergence and curl

In order to more detailedly describe the kinematic characteristics of optical flow field, the second-order derivatives of divergence and curl for $q_{t}$ along $x$, $y$ and $t$ directions are further computed, respectively, as shown in Eqs. (5) and (6).

$$\left( {div_{xx} \left( {q_{t} } \right),div_{yy} \left( {q_{t} } \right),div_{tt} \left( {q_{t} } \right)} \right)^{T} = \nabla \odot \left( {div_{x} \left( {q_{t} } \right),div_{y} \left( {q_{t} } \right),div_{t} \left( {q_{t} } \right)} \right)^{T}$$

(5)

$$\left( {curl_{xx} \left( {q_{t} } \right),curl_{yy} \left( {q_{t} } \right),curl_{tt} \left( {q_{t} } \right)} \right)^{T} = \nabla \odot \left( {curl_{x} \left( {q_{t} } \right),curl_{y} \left( {q_{t} } \right),curl_{t} \left( {q_{t} } \right)} \right)^{T}$$

(6)

where $\odot$ denotes the element-wise multiplication.

In the above formulas, the second-order derivatives $div_{xx} \left( {q_{t} } \right)$, $div_{yy} \left( {q_{t} } \right)$, $curl_{xx} \left( {q_{t} } \right)$ and $curl_{yy} \left( {q_{t} } \right)$ can, respectively, describe the change rates of the first-order derivatives of divergence and curl along $x$ and $y$ directions. And they construct second-order spatial kinematic fields, including the second-order spatial divergence fields $Field^{{div_{xx} }}$ and $Field^{{div_{yy} }}$, as well as the second-order spatial curl fields $Field^{{curl_{xx} }}$ and $Field^{{curl_{yy} }}$, which depict the more detailed motion information. Whereas the second-order derivatives $div_{tt} \left( {q_{t} } \right)$ and $curl_{tt} \left( {q_{t} } \right)$, respectively, correspond to the change rates of the first-order derivatives of divergence and curl along $t$ direction. Thus, the second-order temporal kinematic fields, including the second-order temporal divergence field $Field^{{div_{tt} }}$ and curl field $Field^{{curl_{tt} }}$, acquire the accelerations of divergence and curl.

4.
Joint coding

A set of kinematic fields obtained above usually possesses high dimensions and strong correlation, which results in great challenges for the subsequent joint feature coding. Therefore, above kinematic features are firstly reduced in dimension, respectively, then they are jointly encoded to obtain the proposed multi-order divergence feature and multi-order curl feature. The specific process is as follow. Here, the divergence fields are taken as examples.

(a)
Feature dimension reduction. For the $j {\text{-th}}$ frame in the $i {\text{-th}}$ video, its fields $Field_{i,j}^{div}$, $Field_{i,j}^{{div_{x} }}$, $Field_{i,j}^{{div_{y} }}$, $Field_{i,j}^{{div_{t} }}$, $Field_{i,j}^{{div_{xx} }}$, $Field_{i,j}^{{div_{yy} }}$ and $Field_{i,j}^{{div_{tt} }}$ are reduced in dimension, respectively, by two-dimension principle component analysis (2DPCA) [25], so as to obtain the corresponding low-dimensional representation $FIELD_{i,j}^{div} = \left[ {\hat{F}ield_{i,j}^{div} ,\hat{F}ield_{i,j}^{{div_{x} }} ,\hat{F}ield_{i,j}^{{div_{y} }} ,\hat{F}ield_{i,j}^{{div_{t} }} ,\hat{F}ield_{i,j}^{{div_{xx} }} ,\hat{F}ield_{i,j}^{{div_{yy} }} ,\hat{F}ield_{i,j}^{{div_{tt} }} } \right] \in R^{d}$, where $i = 1,2, \ldots ,G$, $G$ represents video number; $j = 1,2, \ldots ,Q_{i}$ and $Q_{i}$ denotes the frame number in the $i{\text{-th}}$ video; $d$ is the dimension of $FIELD_{i,j}^{div}$.
(b)
Feature coding. Fisher vector [26] is used to jointly code for above low-dimensional representation. A Gaussian mixture model (GMM) of $K$ components is utilized to create the Fisher vectors. Then, L2 normalization is applied to the Fisher vectors to obtain the multi-order divergence feature set $Ms^{div} { = }\left\{ {Ms_{1}^{div} ,Ms_{2}^{div} , \ldots ,Ms_{G}^{div} } \right\} \in R^{G \times O}$ for all videos, where $O = 2dK$. By the same way, the multi-order curl feature set $Ms^{curl} { = }\left\{ {Ms_{1}^{curl} ,Ms_{2}^{curl} , \ldots ,Ms_{G}^{curl} } \right\} \in R^{G \times O}$ for all videos is also obtained.

2.2 Construction of discriminative kinematic descriptor

The proposed $Ms^{div}$ and $Ms^{curl}$, respectively, depict the dynamic characteristics of action subject from multiple levels and different perspectives, between which there exists a certain complementarity information. Therefore, fusing them will necessarily acquire a more complete feature representation to delineate action subject in complex environment more precisely. This section aims to propose a discriminative neural network fusion method to achieve the fusion of $Ms^{div}$ and $Ms^{curl}$. Consequently, the proposed discriminative kinematic descriptor is obtained, as shown in Fig. 3. The specific process is presented as follow.

1.
Introduction of a single tight-loose constraint term

Given a feature set $\varvec{Z} = f\left( {Ms^{div} ,Ms^{curl} ;{\varvec{\Theta}}} \right)$, where $f\left( \cdot \right)$ denotes the feature projection function, and ${\varvec{\Theta}} = \left\{ {\Theta_{1} ,\Theta_{2} , \ldots ,\Theta_{\omega } } \right\}$ represents model parameter set, $\omega$ is the number of parameters. A single tight-loose constraint $\varvec{TL}_{Z}$ is firstly introduced, as indicated in Eq. (7).

$$\varvec{TL}_{Z} = \frac{1}{G}\sum\limits_{\xi = 1}^{C} {\sum\limits_{\eta = 1}^{{C^{\xi } }} {\sum\limits_{\gamma = 1,\gamma \ne \xi }^{C} {\frac{{\left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\xi } } \right\|}}{{\left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\gamma } } \right\|}}} } }$$

(7)

where $C$ is the number of action classes; $C^{\xi }$ represents the feature number of the $\xi {\text{-th}}$ class; $\varvec{Z}_{\eta }^{\xi } \in R^{1 \times 2O}$ denotes the $\eta {\text{-th}}$ feature of the $\xi {\text{-th}}$ class in $\varvec{Z}$; $\bar{\varvec{Z}}^{\gamma } \in R^{1 \times 2O}$ and $\bar{\varvec{Z}}^{\xi } \in R^{1 \times 2O}$, respectively, represent the feature centers of the $\gamma {\text{-th}}$ class and the $\xi {\text{-th}}$ class, namely mean values of features.

2.
Introduction of an anti-confusion constraint term

It is known that there are usually a large number of outliers in feature space. The distances of these outliers to the feature centers of their own classes are usually larger than the distances to the feature centers of other classes, which seriously affects the discriminativeness of features. In order to reduce the confusion caused by outliers, an anti-confusion constraint $\varvec{AC}_{Z}$ is introduced as a penalty term to measure the degree of confusion between different classes of features, as shown in Eq. (8).

$$\varvec{AC}_{Z} = \frac{1}{G}\sum\limits_{\xi = 1}^{C} {\sum\limits_{\eta = 1}^{{C^{\xi } }} {\sum\limits_{\gamma = 1,\gamma \ne \xi }^{C} {relu\left( {\left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\xi } } \right\| - \left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\gamma } } \right\|} \right)} } }$$

(8)

where $relu\left( \cdot \right)$ represents the rectified linear unit (ReLU) [27].

3.
Construction of the objective function for the proposed fusion method

By introducing both of the constraint terms $\varvec{TL}_{Z}$ and $\varvec{AC}_{Z}$ into the cross-entropy loss function, the objective function of the proposed fusion method is obtained, as shown in Eq. (9).

$$\it \begin{aligned} \hbox{min} \, J & = - \sum\limits_{g = 1}^{G} {\sum\limits_{c = 1}^{C} {y^{\prime}_{g} \ log \left( {p\left( {y_{g} = \left. c \right|\varvec{Z}_{g} } \right)} \right)} } + \varvec{TL}_{Z} + \varvec{AC}_{Z} \\ & = - \sum\limits_{g = 1}^{G} {\sum\limits_{c = 1}^{C} {y^{\prime}_{g} \ log \left( {p\left( {y_{g} = \left. c \right|\varvec{Z}_{g} } \right)} \right)} } \\ & \quad + \,\frac{1}{G}\sum\limits_{\xi = 1}^{C} {\sum\limits_{\eta = 1}^{{C^{\xi } }} {\sum\limits_{\gamma = 1,\gamma \ne \xi }^{C} {\frac{{\left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\xi } } \right\|}}{{\left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\gamma } } \right\|}}} } } \\ & \quad + \,\frac{1}{G}\sum\limits_{\xi = 1}^{C} {\sum\limits_{\eta = 1}^{{C^{\xi } }} {\sum\limits_{\gamma = 1,\gamma \ne \xi }^{C} {relu\left( {\left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\xi } } \right\| - \left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\gamma } } \right\|} \right)} } } \\ \end{aligned}$$

(9)

where $y_{g}$ and $y^{\prime}_{g}$ are, respectively, the predicted label and true label of the $g{\text{-th}}$ sample; $\varvec{Z}_{g}$ represents the $g{\text{-th}}$ feature in $\varvec{Z}$.

It can be seen from Eq. (9) that, during the optimization solution process, the proposed $\varvec{TL}_{Z}$, by calculating the relative distances between each feature and its feature center as well as the feature centers of other classes, respectively, makes each feature point be closer to its own feature center, and meanwhile be farther from the feature centers of other classes. That is, the within-class compactness is enhanced, and the between-class separability is increased simultaneously. Consequently, the discriminativeness of features is improved. Further, it can be seen that the proposed $\varvec{AC}_{Z}$, by gathering the statistics for the sum of error distances in feature space, reduces the between-class confusion caused by outliers.

4.
Acquisition of the proposed discriminative kinematic descriptor

Here, a three-layer neural network called the discriminative fusion network is constructed to finally achieve the fusion of $Ms^{div}$ and $Ms^{curl}$. This network takes the training samples from $Ms^{div}$ and $Ms^{curl}$ as inputs, and Eq. (9) is used as the objective function for optimization. By minimizing Eq. (9) using the stochastic gradient descent (SGD) algorithm, the optimal model parameter set ${\varvec{\Theta}}^{*}$ is acquired. Consequently, the discriminative fusion of $Ms^{div}$ and $Ms^{curl}$ is achieved. That is, the proposed discriminative kinematic descriptor $\varvec{F}_{kinematic} = f\left( {Ms^{div} ,Ms^{curl} ;{\varvec{\Theta}}^{*} } \right)$ is obtained.

Overall, the multi-order divergence feature $Ms^{div}$ and multi-order curl feature $Ms^{curl}$ are firstly obtained from a set of kinematic fields, which possess better discriminativity, and meanwhile remove the camera motion and slowly changing background. Then, in order to acquire a more complete feature representation, the discriminative fusion method is proposed to achieve the fusion of $Ms^{div}$ and $Ms^{curl}$. Consequently, the discriminative kinematic descriptor is obtained, which possesses better within-class compactness and between-class separability, and meanwhile it is robust to outliers. Moreover, the additional detection of interest points is not needed in this paper, thus the computational consumption is significantly reduced, and the negative effects caused by inaccurate interest point detection on action recognition are effectively avoided. All of these are very useful for action recognition.

3 Deep attention-pooled descriptor

When performing action recognition, both dynamic information and static information are very significant clues. In fact, when recognizing the action classes that are closely related to specific objects or action scenes, static features play a crucial role. This section aims to obtain the discriminative static information in background for action recognition. For this purpose, a deep attention-pooled descriptor is constructed.

Firstly, the architecture of Inception-v3 network is briefly introduced. Then, the prediction-attentional pooling method is proposed. Subsequently, it is applied to both lower layer and higher layer of Inception-v3 for acquiring the proposed deep local attentional feature and deep global attentional feature. Finally, by concatenating the two attentional features, the proposed deep attention-pooled descriptor is constructed, as shown in Fig. 4.

3.1 Introduction of architecture of Inception-v3

Inception-v3 deep neural network was developed by Google, which is a 42 layer deep convolutional neural network with 130 layers, and consists of multiple Inception modules. There exist 4 convoluting modules in each Inception module, and the receptive fields of convoluting modules for each Inception module are allowed to freely select from $5 \times 5$, $3 \times 3$ and $1 \times 1$, which can synthesize the different scale information. Compared with Inception-v2 network [28], Inception-v3 adopts a combination of $1 \times n$ and $n \times 1$ convolutional kernel sizes instead of the original $n \times n$ size, which significantly reduces parameter number. In addition, Inception-v3 adopts the global average pooling (GAP), rather than the traditional fully connected layer, to obtain the feature vector at the end of network.

3.2 Extraction of deep local and global features

In fact, a deep network can learn different features at each layer of layer hierarchy. To be specific, the activations in lower layers possess smaller receptive fields, meanwhile, they are much more sensitive to edge-like patterns and corners; while activations in higher layers possess larger receptive fields, which can learn the more global and high-level feature representation and obtain more complex invariances. However, Inception-v3 only adopts the top layer of network, which is not enough for describing the fine-grained detail.

In order to obtain a more complete feature representation, the local feature and global feature are, respectively, extracted from the lower layer and higher layer of Inception-v3, which lay the foundation for further obtaining the proposed deep local attentional feature and deep global attentional feature. Specifically, (1) the Mixed_5c layer with size $35 \times 35 \times 288$ of Inception-v3 is selected and served as the deep local feature $\varvec{X}^{L}$, where $35 \times 35$ denotes the number of regions in a video frame and 288 is the dimension of feature vector for each region. The reason for selecting the $35 \times 35$ region is that, the classical hand-crafted feature usually adopts an $8 \times 8$ region for local feature extraction, and when it is mapped to Inception-v3, the most similar window scale $35 \times 35$ is obtained. (2) the Mixed_7a layer with size $8 \times 8 \times 1280$ of Inception-v3 is selected and taken as the deep global feature $\varvec{X}^{H}$, where $8 \times 8$ denotes the number of regions in a video frame and $1280$ is the dimension of feature vector for each region. The reason for choosing Mixed_7a instead of the last Mixed_7c is that, the feature maps of Mixed_7c have very large receptive fields, which means that each pixel point in feature maps corresponds to all regions of input image [29], thus different locations cannot be assigned different weights. That is to say, it is impossible to highlight the discriminative regions, which is disadvantageous for further acquiring the proposed deep global attentional feature.

3.3 Proposed prediction-attentional pooling method

It is well known that the current deep network usually adopts GAP, rather than fully connected layer, to compress the feature map at end of network for obtaining global features. However, GAP considers all the regions inside feature maps equally important, which may reduce the discriminativeness of features [30]. Therefore, some methods [31, 32] use the attention mechanism to highlight the discriminative regions. Furthermore, other ones extend the single-channel attention mechanism to multiple channels for enhancing the discriminativeness of features. Yan et al. [33] proposed a multi-branch attention networks, which obtains the attention maps from scene-level context and region-level context perspectives, respectively. Then, the two context branches are further integrated to acquire the final attentional regions. Girdhar et al. [29] utilized the low-rank second-order pooling to obtain multiple attention maps from bottom-up and top-down perspectives, respectively. Then, these attention maps are combined to acquire the final attentional regions. However, the above methods adopted simple ways to fuse different attention maps without highlighting the more discriminative attention maps, which makes the final acquired attention regions not accurate enough. Depending on the problem to solve, a novel prediction-attentional pooling method is proposed, which aims to more accurately focus on the significant discriminative regions, and meanwhile suppress irrelevant background interference. Details are as follows.

Given an extracted deep feature map $\varvec{X}{ = }\left[ {\varvec{X}_{1}^{T} , \ldots ,\varvec{X}_{i}^{T} , \ldots ,\varvec{X}_{N}^{T} } \right]^{T} \in R^{N \times D}$, and $\varvec{X}_{i} \in R^{1 \times D}$ maps to distinct overlapping regions in input space, where $N$ denotes the number of regions in a video frame and $D$ is the dimension of feature vector for each region. Thereupon, the proposed prediction-attentional pooling method is briefly summarized as following. Firstly, the attentions with $C$ channels are constructed, where the number of channels equals to the number of classes, and a single weight is learned for each channel, aiming to pay attention to distinct aspects of deep feature. Then, the attentional heatmap for $\varvec{X}$ in each channel is, respectively, calculated to obtain the attentional heatmap set $\left\{ {\varvec{M}^{1} , \ldots ,\varvec{M}^{c} , \ldots ,\varvec{M}^{C} } \right\}$, where $\varvec{M}^{c} \in R^{1 \times N}$ is the $c{\text{-th}}$ attentional heatmap. Secondly, the predictions of deep network are taken as weights to fuse $\left\{ {\varvec{M}^{1} , \ldots ,\varvec{M}^{c} , \ldots ,\varvec{M}^{C} } \right\}$, so as to obtain the weighted fusion attentional heatmap $\varvec{M}^{fuse} \in R^{1 \times N}$. Thirdly, $\varvec{M}^{fuse}$ is utilized as the weight of $\varvec{X}$ to further enhance the effect of important local regions. Consequently, the more accurate and discriminative deep feature is obtained. The specific calculation process is shown below.

1.
Acquisition of attentional heatmaps for $C$ channels. Firstly, in order to obtain $\left\{ {\varvec{M}^{1} , \ldots ,\varvec{M}^{c} , \ldots ,\varvec{M}^{C} } \right\}$, a convolutional kernel $\varvec{a}^{c} \in R^{1 \times D}$ is applied on each channel aiming to acquire attentional heatmaps from different perspectives. Specifically, a softmax function for generating the attention distribution on the regions of the image is adopted for each channel to, respectively, obtain $M_{i}^{c}$, as shown in Eq. (10).
$$M_{i}^{c} = \frac{{\ exp \left( {\varvec{a}^{c} \varvec{X}_{i}^{T} } \right)}}{{\sum\nolimits_{j = 1}^{N} {\ exp \left( {\varvec{a}^{c} \varvec{X}_{j}^{T} } \right)} }}$$
(10)
where $M_{i}^{c}$ represents the $i{\text{-th}}$ element in $\varvec{M}^{c}$, namely the attentional weight of the $i{\text{-th}}$ vector $\varvec{X}_{i}$ in the $c{\text{-th}}$ channel. The larger the $M_{i}^{c}$ is, the higher the importance degree of $\varvec{X}_{i}$ in the $c{\text{-th}}$ channel is. Equation (10) is adopted for each channel, then the attentional heatmap set $\left\{ {\varvec{M}^{1} , \ldots ,\varvec{M}^{c} , \ldots ,\varvec{M}^{C} } \right\}$ is obtained.
2.
Acquisition of the weighted fusion attentional heatmap $\varvec{M}^{fuse}$ for $C$ channels. The motivation is that different actions activate different attentional heatmap sets. In fact, different channels in attentional heatmaps capture different regions related to action subject, discriminative objects and background. In certain circumstance, some channels are more important than the others. Therefore, the higher weights should be assigned to these discriminative channels that play more significant roles in action recognition.

For the sake of highlighting the contributions of discriminative channels related to $\varvec{X}$, the prior probability of $\varvec{X}$ belonging to the $c{\text{-th}}$ class is adopted as the weight of $M_{i}^{c}$ to conduct weighted fusion on $\left\{ {\varvec{M}^{1} , \ldots ,\varvec{M}^{c} , \ldots ,\varvec{M}^{C} } \right\}$, so as to obtain the weighted fusion attentional heatmap $\varvec{M}^{fuse}$. For the $i{\text{-th}}$ element $M_{i}^{fuse}$ in $\varvec{M}^{fuse}$, the calculation is shown in Eq. (11).

$$M_{i}^{fuse} = \sum\limits_{c = 1}^{C} { \, p\left( {y = \left. c \right|\varvec{X}} \right)M_{i}^{c} }$$

(11)

where $y$ is class label; $p\left( {y = \left. c \right|\varvec{X}} \right)$ represents the prior probability of $\varvec{X}$ belonging to the $c{\text{-th}}$ class. As can be seen from Eq. (11), the larger the $p\left( {y = \left. c \right|\varvec{X}} \right)$ is, the larger the weight of $M_{i}^{c}$ is, then the larger the contribution of $M_{i}^{c}$ to $M_{i}^{fuse}$ is, that is to say, Eq. (11) assigns larger weights to the more discriminative channels.

Furthermore, as for the calculation of $p\left( {y = \left. c \right|\varvec{X}} \right)$, according to the structure of deep network, it is known that in the process of forward propagation, the feature map of each layer is obtained from the former layer feature map through basic matrix operations. That is, the conditional probability $p\left( {\varvec{b}\left| \varvec{X} \right.} \right) = 1$ holds, in which $\varvec{b}$ is the bottleneck vector of deep network. Thereby, the following derivation holds:

$$\begin{aligned} p\left( {y = c\left| \varvec{X} \right.} \right) & =\,{{p\left( {y = c,\varvec{X}} \right)} \mathord{\left/ {\vphantom {{p\left( {y = c,\varvec{X}} \right)} {p\left( \varvec{X} \right)}}} \right. \kern-0pt} {p\left( \varvec{X} \right)}} \\ & = \,{{p\left( {y = c,\varvec{b},\varvec{X}} \right)} \mathord{\left/ {\vphantom {{p\left( {y = c,\varvec{b},\varvec{X}} \right)} {p\left( {\varvec{b},\varvec{X}} \right)}}} \right. \kern-0pt} {p\left( {\varvec{b},\varvec{X}} \right)}} \\ & = \, \, p\left( {y = c\left| {\varvec{b},\varvec{X}} \right.} \right) \\ & { = }\, \, p\left( {y = c\left| \varvec{b} \right.} \right) \\ \end{aligned}$$

(12)

where $p\left( {y = \left. c \right|\varvec{b}} \right)$ represents the probability of $\varvec{b}$ belonging to the $c{\text{-th}}$ class, namely the prediction of deep network.

It can be seen from Eq. (12) that the probability of $\varvec{X}$ belonging to the $c{\text{-th}}$ class is equal to the prediction of deep network, where the prediction can be obtained by fine-tuning the network on video dataset. Thus, Eq. (11) is transformed as follow:

$$M_{i}^{fuse} = \sum\limits_{c = 1}^{C} { \, p\left( {y = \left. c \right|\varvec{b}} \right)M_{i}^{c} }$$

(13)

Till then, the weighted fusion attentional heatmap $\varvec{M}^{fuse}$ is acquired.

3.
Acquisition of more accurate and discriminative deep feature $\varvec{Atte}$. $\varvec{M}^{fuse}$ is used to conduct weighted pooling on $\varvec{X}$ for obtaining the attentional feature $\varvec{Atte}$, as shown in Eq. (14).

$$\varvec{Atte} = \varvec{M}^{fuse} \varvec{X} = \sum\limits_{i = 1}^{N} {\sum\limits_{c = 1}^{C} { \, p\left( {y = \left. c \right|\varvec{b}} \right)M_{i}^{c} \varvec{X}_{i} } }$$

(14)

In order to obtain $\varvec{Atte}$ automatically, the SGD algorithm is utilized to minimize the objective function of network, as shown in Eq. (15).

$$\it \begin{aligned} \hbox{min} \, J & = - \sum\limits_{g = 1}^{G} {\sum\limits_{c = 1}^{C} {y^{\prime}_{g} \ log \left( {p\left( {y_{g} = c\left| {\varvec{Atte}_{g} } \right.} \right)} \right)} } + \zeta_{1} \sum\limits_{i = 1}^{N} {\left( {M_{i}^{fuse} } \right)^{2} } { + }\zeta_{2} \sum\limits_{c = 1}^{C} {\left\| {\varvec{a}^{c} } \right\|_{2} } \\ \end{aligned}$$

(15)

where $\varvec{Atte}_{g}$ is the deep attentional feature of the $g{\text{-th}}$ sample; $p\left( {y_{g} = \left. c \right|\varvec{Atte}_{g} } \right)$ denotes the possibility of the $g{\text{-th}}$ sample belonging to the $c{\text{-th}}$ class; $\zeta_{1}$ and $\zeta_{2}$, respectively, denote the attentional regularization coefficient and weight decay coefficient; $\left\| {\, \cdot \,} \right\|_{2}$ is $l_{2}{\text{-norm}}$.

Conclusively, the proposed prediction-attentional pooling method adopts predictions as weights to conduct weighted fusion on the attentional heatmaps of multiple channels, so as to obtain the weighted fusion attentional heatmap $\varvec{M}^{fuse}$, which highlights the contributions of the discriminative channels and meanwhile suppresses irrelevant background interference. Furthermore, $\varvec{M}^{fuse}$ is utilized as the weight for deep feature map $\varvec{X}$ to enhance the effect of important local regions that are significant for action recognition. Consequently, the more accurate and discriminative deep feature $\varvec{Atte}$ is obtained.

3.4 Construction of deep attention-pooled descriptor

In this section, the proposed deep local attentional feature $\varvec{Atte}^{L}$ and deep global attentional feature $\varvec{Atte}^{H}$ are firstly obtained. Then, the proposed deep attention-pooled descriptor is constructed.

Specifically, the proposed prediction-attentional pooling method is applied to both deep local feature $\varvec{X}^{L}$ and deep global feature $\varvec{X}^{H}$, thus $\varvec{Atte}^{L}$ and $\varvec{Atte}^{H}$ are obtained. It is obviously that, $\varvec{Atte}^{L}$ mainly focuses on detail information like texture and edge orientation, while $\varvec{Atte}^{H}$ usually contains global body information and possesses a whole abstract description for action. Therefore, in order to comprehensively depict the discriminative information of action scene, $\varvec{Atte}^{L}$ and $\varvec{Atte}^{H}$ are further concatenated to finally construct the proposed deep attention-pooled descriptor $\varvec{F}_{{attention - pooled}} = \left[ {\varvec{Atte}^{L} ,\varvec{Atte}^{H} } \right]$.

In summary, by combining $\varvec{Atte}^{L}$ and $\varvec{Atte}^{H}$, the proposed deep attention-pooled descriptor can more comprehensively and accurately depict the static visual appearance information of action scene and discriminative object in a video, which improves the discriminativeness of features, and is very useful for action recognition.

4 Experiments and analysis

In this section, the comparisons and analysis on experimental results of the proposed methods for action recognition are reported on two challenging video datasets, namely UCF101 and HMDB51. The illustrations of their representative frames are provided in Fig. 5.

4.1 Datasets and experimental settings

4.1.1 Introduction of datasets

UCF101 [1] dataset includes 13,320 videos with 101 action classes. This dataset gives the largest diversity in terms of actions and large variations in camera motion, viewpoint, object appearance and pose, illumination conditions, cluttered background, object scale, etc. Videos of each action class are divided into 25 groups, where videos from the same group may share similar background and viewpoint. The standard protocol of three train-test splits [34] is used in our experiments, and average accuracy is adopted as the eventual performance measure.

HMDB51 [2] dataset is collected from various sources and represents a fine multifariousness of light conditions, surroundings and situations in which action happens. The camera motion consists of traveling shots, camera shaking, zooming, etc. In total, the dataset contains 6766 video clips divided into 51 action classes, each including at least 101 video clips. The original evaluation scheme of three train-test splits [2] is adopted in our experiments. Each split includes 30 videos for testing and 70 videos for training in each class. The average result over three splits is utilized to evaluate the final performance.

4.1.2 Experimental settings

(1) Parameter setting for the proposed discriminative kinematic descriptor. The number of Gaussians $K$ in the Fisher vector is set to 128. (2) Parameter settings for Inception-v3. Inception-v3 is utilized in this paper, and is trained on the ILSVRC2012 dataset [35]. TensorFlow open source software library [36] provided by Google is utilized to build the CNN framework, and the parameters of Inception-v3 are fine-tuned on UCF101 and HMDB51 datasets using 4 NVIDIA Titan X GPUs. The SGD algorithm is adopted for training the network. The batch size is set as 50; the momentum is set to 0.9; the learning rate is set as 0.0001; the weight decay is set as 0.0005, and the dropout ratio is selected as 0.9. (3) Parameter settings for the proposed deep attention-pooled descriptor. Similarly, SGD is adopted to train the proposed deep network. Specifically, the batch size is set as 200 and learning rate is set to 0.001. (4) Classifier settings. For the proposed discriminative kinematic descriptor and DKD–DAD framework, the linear SVM is used as a classifier. As for the proposed deep attention-pooled descriptor, the output of softmax layer in network is directly used for action recognition.

4.2 Experiment on parameter selection

In this section, UCF101 dataset is taken as an example, and the regularization coefficient $\zeta_{1}$ and weight decay $\zeta_{2}$ are optimized to show their significance for action recognition by using the proposed deep attention-pooled descriptor. Specifically, $\zeta_{1}$ and $\zeta_{2}$ are, respectively, set by searching the grids $\left\{ {0.0005,\;0.005,\;0.05,\;0.5} \right\}$ and $\left\{ {0.00005,\;0.0005,\;0.005,\;0.05} \right\}$. The following Fig. 6 shows the recognition result under different parameters.

Figure 6 shows that the best performance is achieved when $\zeta_{1} = 0.05$ and $\zeta_{2} = 0.0005$. Ulteriorly, it can be observed that, when $\zeta_{1}$ and $\zeta_{2}$ become larger or smaller, the accuracy begins to decline. Thereby, the trade-off on $\zeta_{1}$ and $\zeta_{2}$ is very necessary. In fact, when $\zeta_{1}$ and $\zeta_{2}$ become larger or smaller, the deep attention-pooled descriptor cannot more accurately highlight the discriminative regions of key-frame, then the discriminability of static features is weakened, thereby the recognition performance becomes worse. Consequently, $\zeta_{1}$ and $\zeta_{2}$ are, respectively, set to $0.05$ and $0.0005$ on UCF101 dataset in subsequent experiments. Furthermore, similar results are demonstrated on HMDB51 dataset.

4.3 Action recognition with kinematic features

In this section, the proposed kinematic features are applied for action recognition to verify their effectiveness. Tables 1 and 2, respectively, show the recognition results of the proposed kinematic features, namely multi-order divergence feature and multi-order curl feature, as well as contrastive methods on UCF101 and HMDB51.

Table 1 Recognition result of the proposed kinematic features and contrastive methods on UCF101 dataset

Full size table

Table 2 Recognition result of the proposed kinematic features and contrastive methods on HMDB51 dataset

Full size table

As can be seen from Tables 1 and 2, the proposed kinematic features achieve better accuracies than all contrastive methods. The reasons lie in that: both of the features, by transforming optical flow field into a set of kinematic fields with more discriminativeness, acquire the different dynamic characteristics of optical flow field from multiple levels and various perspectives. In fact, they capture the spatiotemporal characteristics of action subject, thus they can more accurately depict the detailed motion information of subject, and meanwhile remove the camera motion and slowly changing background. Consequently, the accuracies are improved.

4.4 Action recognition with discriminative kinematic descriptor

This section aims to demonstrate the effectiveness of the proposed discriminative kinematic descriptor, namely validity of the proposed discriminative fusion method. Taking UCF101 and HMDB51 datasets as examples, Tables 3 and 4, respectively, show the recognition results of the proposed discriminative kinematic descriptor obtained by the proposed discriminative fusion method, as well as the results of the concatenation and linear weighted fusion for the proposed multi-order divergence feature and multi-order curl feature. Meanwhile, the results of contrastive methods are also given. In addition, the weight coefficients of linear weighted fusion are obtained by use of the grid search algorithm.

Table 3 Recognition result of the proposed discriminative kinematic descriptor and contrastive methods on UCF101 dataset

Full size table

Table 4 Recognition result of the proposed discriminative kinematic descriptor and contrastive methods on HMDB51 dataset

Full size table

It can be observed from Tables 3 and 4 that: (1) The proposed discriminative kinematic descriptor outperforms all contrastive methods. (2) The result of concatenating the multi-order divergence feature and multi-order curl feature is superior to the results of using any of them alone, which indicates that there is really some complementary information between them. (3) The performance of linear weighted fusion is better than that of concatenation, which indicates that the proposed multi-order divergence feature and multi-order curl feature have different contribution degrees to action recognition. Based on the above observations, the best performance of the proposed discriminative kinematic descriptor is owing to the following contributions. The proposed discriminative fusion method, by introducing the tight-loose constraint term, reduces the within-class variations while also increasing the between-class differences. That is, the proposed discriminative kinematic descriptor possesses better within-class compactness and between-class separability simultaneously. In addition, by further introducing the anti-confusion constraint term, the confusion caused by outliers is reduced, which enhances the discriminativeness and robustness of the proposed kinematic descriptor. Consequently, the performance is improved effectively.

4.5 Action recognition with prediction-attentional pooling method

This section aims to verify the effectiveness of the proposed prediction-attentional pooling method. Taking the split 1 of UCF101 dataset for example, Fig. 7 demonstrates the recognition accuracies with applying the proposed pooling method, GAP, max pooling (MAX) and classical attention pooling method on the extracted deep local feature $\varvec{X}^{L}$ and deep global feature $\varvec{X}^{H}$.

It can be seen from Fig. 7 that: (1) Regardless of the deep local or global feature, the classical attention pooling method achieves better accuracies than GAP and MAX. The reason lies in that, the introduction of the attention mechanism highlights the contribution of discriminative local regions. (2) The proposed pooling method outperforms classical attention pooling method. The reason lies in that, the proposed pooling method adopts the predictions of network output as weights to weighted fuse the attentions of different channels, which further highlights the contributions of discriminative channels. Consequently, the discriminative regions are accurately obtained, and the accuracy is effectively improved.

4.6 Action recognition with deep attention-pooled descriptor

In order to verify the effectiveness of the proposed deep attention-pooled descriptor, Tables 5 and 6, respectively, show the recognition results of this descriptor and contrastive methods on UCF101 and HMDB51 datasets.

Table 5 Recognition result of the proposed deep attention-pooled descriptor and contrastive methods on UCF101 dataset

Full size table

Table 6 Recognition result of the proposed deep attention-pooled descriptor and contrastive methods on HMDB51 dataset

Full size table

As is shown in Tables 5 and 6, the proposed deep attention-pooled descriptor performs better than all contrastive methods. The reason lies in that, the proposed descriptor, by combining the proposed deep local attentional feature and global attentional feature, further accurately depicts the static visual appearance information of action scene and discriminative object in a video, and enhances the discriminativeness of static deep features. Consequently, accuracies are improved effectively.

4.7 Heatmap visualization of prediction-attentional pooling method

To intuitively demonstrate the validity of the proposed prediction-attentional pooling method, Fig. 8 illustrates the visualization examples of heatmaps obtained by the proposed pooling method. For comparative analysis, the visualization examples of heatmaps obtained by classical attention pooling method are given simultaneously.

It can be seen from Fig. 8a that for “biking” action, compared with the visualization result of classical attention pooling method, the deep global attentional heatmap obtained by the proposed pooling method can highlight the discriminative region (bicycle) and meanwhile suppress other irrelevant regions. Furthermore, by superimposing the deep local attentional heatmap on deep global attentional heatmap, the more discriminative local regions (bicycle wheels) are ulteriorly highlighted. Similarly, for “riding horse” in Fig. 8b, “swinging on the pommel horse” in Fig. 8c and “playing violin” in Fig. 8d, the proposed pooling method can also more accurately focus on the discriminative objects “horse,” “pommel horse” and “violin” in video frames. For “table tennis shot” in Fig. 8e and “surfing” in Fig. 8f, the same conclusions can be obtained.

4.8 Action recognition with DKD–DAD framework

This section aims to demonstrate the effectiveness of the proposed DKD–DAD framework. Tables 7 and 8, respectively, show the recognition results of the DKD–DAD and contrastive methods on UCF101 and HMDB51 datasets.

Table 7 Recognition result of the proposed DKD–DAD framework and contrastive methods on UCF101 dataset

Full size table

Table 8 Recognition result of the proposed DKD–DAD framework and contrastive methods on HMDB51 dataset

Full size table

From the above experimental results, it can be seen that the proposed DKD–DAD achieves better accuracy than all contrastive methods. Through analysis, this is due to the following contributions. DKD–DAD combines the discriminative kinematic descriptor and deep attention-pooled descriptor together for action recognition, which shares the benefits of both hand-crafted feature and deep feature, and thus comprehensively acquires important dynamic characteristics and discriminative static information in a video. Consequently, accuracies are effectively improved.

4.9 Experiments on running time

Running time plays a significant role in performance assessment, thereby the time consumption of the proposed methods are simply presented. The UCF101 dataset is taken as an example. (1) For a video containing 55 frames, it approximately takes 51.50 s to extract the discriminative kinematic descriptor. Since the proposed kinematic descriptor does not require interest point detection and trajectory tracking, the time consumption is chiefly on calculating optical flow. (2) As for the proposed deep attention-pooled descriptor, it takes about 46.14 ms for each frame. (3) For the proposed DKD–DAD framework, the overall processing time of a 55-frame video is about 53.00 s. These experiments are run on a workstation with a 2.60 GHz CPU.

5 Conclusions

The following conclusions are drawn from this paper. Firstly, by transforming the optical flow field into a set of kinematic fields with more discriminativeness, the dynamic characteristics hidden within the optical flow field are captured. Subsequently, two kinematic features are constructed, which more accurately depict the dynamic characteristics of action subject from the multi-order divergence and curl fields, meanwhile remove the camera motion and slowly changing background. Secondly, a discriminative fusion method is proposed. By introducing a single tight-loose constraint, the better within-class compactness and between-class separability are guaranteed. At the same time, the introduction of the other anti-confusion constraint reduces the confusion caused by outliers. On this basis, the discriminative kinematic descriptor is constructed, which possesses better discriminativeness and robustness. Thirdly, a prediction-attentional pooling method is proposed, which adopts the predictions of deep network as weights to weighted fuse different channel information of attentions, and thus highlights the contributions of discriminative channels. Consequently, its attention is more accurately focused on discriminative regions while suppressing irrelevant background interference. Furthermore, the deep attention-pooled descriptor is constructed, which obtains the significant static visual appearance information of action scene and discriminative object in a video. Finally, a DKD–DAD framework is constructed by combining the proposed discriminative kinematic descriptor and deep attention-pooled descriptor, which comprehensively obtains the dynamic characteristics and static information, and further improves the accuracies of action recognition. The proposed methods are extensively evaluated on two challenging datasets of UCF101 and HMDB51, where the superior performance is achieved in comparison with a number of state-of-the-art methods. The future work will focus on researching and designing deeper network as well as more effective pooling method, so as to handle complex video concepts.

References

Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of IEEE international conference on computer vision (ICCV), pp 2556–2563
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 1–8
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Article Google Scholar
Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Proceedings of the 14th international conference on computer communications and networks (ICCCN), pp 65–72
Yuan C, Li X, Hu W, Ling H, Maybank SJ (2014) Modeling geometric-temporal context with directional pyramid co-occurrence for action recognition. IEEE Trans Image Process 23(2):658–672
Article MathSciNet Google Scholar
Zhang J, Shum HP, Han J, Shao L (2018) Action recognition from arbitrary views using transferable dictionary learning. IEEE Trans Image Process 27(10):4709–4723
Article MathSciNet Google Scholar
Wang H, Oneata D, Verbeek J, Schmid C (2016) A robust and efficient video representation for action recognition. Int J Comput Vis 119(3):219–238
Article MathSciNet Google Scholar
Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Article MathSciNet Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems (NIPS), pp 568–576
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of IEEE international conference on computer vision (ICCV), pp 4489–4497
Yi Y, Lin M (2016) Human action recognition with graph-based multiple-instance learning. Pattern Recognit 53:148–162
Article Google Scholar
Singh S, Arora C, Jawahar CV (2017) Trajectory aligned features for first person action recognition. Pattern Recognit 62:45–55
Article Google Scholar
Zhang H, Sun Y, Liu L, Wang X, Li L, Liu W (2018) ClothingOut: a category-supervised GAN model for clothing segmentation and retrieval. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3691-y
Article Google Scholar
Ji Y, Zhang H, Wu QMJ (2018) Saliency detection via conditional adversarial image-to-image network. Neurocomputing 316:357–368
Article Google Scholar
Zhang H, Ji Y, Huang W, Liu L (2018) Sitcom-star-based clothing retrieval for video advertising: a deep learning framework. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3579-x
Article Google Scholar
Wang J, Wang G (2018) Hierarchical spatial sum-product networks for action recognition in still images. IEEE Trans Circuits Syst Video Technol 28(1):90–100
Article Google Scholar
Kwak S, Cho M, Laptev I (2016) Thin-slicing for pose: Learning to understand pose without explicit pose estimation. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 4938–4947
Qi T, Xu Y, Quan Y, Ling L (2017) Image-based action recognition using hint-enhanced deep neural networks. Neurocomputing 267:475–488
Article Google Scholar
Peng X, Schmid C (2016) Multi-region two-stream R-CNN for action detection. In: Proceedings of European conference on computer vision (ECCV), pp 744–759
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems (NIPS), pp 91–99
Ni B, Li T, Yang X (2018) Learning semantic-aligned action representation. IEEE Trans Neural Netw Learn Syst 29(8):3715–3725
Article Google Scholar
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 2818–2826
Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Proceedings of the Scandinavian conference on image analysis (SCIA), pp 363–370
Yang J, Zhang D, Frangi AF, Yang J (2004) Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Trans Pattern Anal Mach Int 26(1):131–137
Article Google Scholar
Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105(3):222–245
Article MathSciNet Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems (NIPS), pp 1097–1105
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167
Girdhar R, Ramanan D (2017) Attentional pooling for action recognition. In: Advances in neural information processing systems (NIPS), pp 34–45
Zhou Q, Fan H, Su H, Yang H, Zheng S, Ling H (2018) Weighted bilinear coding over salient body parts for person re-identification. arXiv:1803.08580
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119
Zhuang B, Liu L, Shen C, Reid I (2017) Towards context–aware interaction recognition for visual relationship detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 589–598
Yan S, Smith JS, Lu W, Zhang B (2017) Multi-branch attention networks for action recognition in still images. IEEE Trans Cognit Develop Syst. https://doi.org/10.1109/TCDS.2017.2783944
Article Google Scholar
Jiang YG, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2013) THUMOS challenge: action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14. Accessed 30 June 2018
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: large-scale machine learning on heterogeneous systems. http://tensorflow.org/. Accessed 30 June 2018. Software available from tensorflow.org
Miao J, Xu X, Qiu S, Qinf C, Tao D (2015) Temporal variance analysis for action recognition. IEEE Trans Image Process 24(12):5904–5915
Article MathSciNet Google Scholar
Shi F, Laganière R, Petriu E (2016) Local part model for action recognition. Image Vis Comput 46(11):18–28
Article Google Scholar
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Article Google Scholar
Nguyen TV, Mirza B (2017) Dual-layer kernel extreme learning machine for action recognition. Neurocomputing 260:123–130
Article Google Scholar
Kobayashi T (2017) Flip-invariant motion representation. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 5628–5637
Jain M, Jegou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 2555–2562
Yu M, Liu L, Shao L (2016) Structure-preserving binary representations for RGB-D action recognition. IEEE Trans Pattern Anal Mach Int 38(8):1651–1664
Article Google Scholar
Caetano C, dos Santos JA, Schwartz WR (2016) Optical flow co-occurrence matrices: a novel spatiotemporal feature descriptor. In: Proceedings of international conference pattern recognition (ICPR), pp 1947–1952
Xu Z, Hu R, Chen J, Chen C, Chen H, Li H, Sun Q (2017) Action recognition by saliency-based dense sampling. Neurocomputing 236:82–92
Article Google Scholar
Miao J, Xu X, Mathew R, Huang H (2015) Residue boundary histograms for action recognition in the compressed domain. In: Proceedings of IEEE international conference on image processing (ICIP), pp 2825–2829
Kihl O, Picard D, Gosselin PH (2015) A unified framework for local visual descriptors evaluation. Pattern Recognit 48(4):1174–1184
Article Google Scholar
Feichtenhofer C, Pinz A, Wildes RP (2015) Dynamically encoded actions based on spacetime saliency. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 2755–2764
Fernando B, Anderson P, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 1924–1932
Wang L, Qiao Y, Tang X (2016) MoFAP: a multi-level representation for action recognition. Int J Comput Vis 119(3):254–271
Article MathSciNet Google Scholar
Tu NA, Huynh-The T, Khan KU, Lee YK (2018) ML-HDP: a hierarchical bayesian nonparametric model for recognizing human actions in video. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2018.2816960
Article Google Scholar
Zheng Y, Yao H, Sun X, Zhao S, Porikli F (2018) Distinctive action sketch for human action recognition. Signal Process 144:323–332
Article Google Scholar
Jiang YG, Dai Q, Liu W, Xue X, Ngo CH (2015) Human action recognition in unconstrained videos by explicit motion modeling. IEEE Trans Image Process 24(11):3781–3795
Article MathSciNet Google Scholar
Bilinski PT, Bremond F (2015) Video covariance matrix logarithm for human action recognition in videos. In: Proceedings of the 24th international conference on artificial intelligence (IJCAI), pp 2140–2147
Shao L, Liu L, Yu M (2016) Kernelized multiview projection for robust action recognition. Int J Comput Vis 118(2):115–129
Article MathSciNet Google Scholar
Yang Y, Liu R, Deng C, Gao X (2016) Multi-task human action recognition via exploring super-category. Signal Process 124:36–44
Article Google Scholar
Yao T, Wang Z, Xie Z, Gao J, Feng DD (2017) Learning universal multiview dictionary for human action recognition. Pattern Recognit 64:236–244
Article Google Scholar
Zhu Y, Newsam S (2016) Depth2action: exploring embedded depth for large-scale action recognition. In: Proceedings of European conference on computer vision (ECCV), pp 668–684
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 3034–3042
Lan Z, Yu SI, Yao D, Lin M, Raj B, Hauptmann A (2016) The best of both worlds: Combining data-independent and data-driven approaches for action recognition. In: Proceedings of IEEE international conference on computer vision and pattern recognition workshops (CVPR Workshops), pp 123–132
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems (NIPS), pp 3468–3476
Wang X, Farhadi A, Gupta A (2016) Actions ~ transformations. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 2658–2667
Wang P, Cao Y, Shen C, Liu L, Shen HT (2017) Temporal pyramid pooling-based convolutional neural network for action recognition. IEEE Trans Circuits Syst Video Technol 27(12):2613–2622
Article Google Scholar
Kar A, Rai N, Sikka K, Sharma G (2017) Adascan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 3376–3385
Ye Y, Tian Y (2016) Embedding sequential information into spatiotemporal features for action recognition. In: Proceedings of the IEEE international conference on computer vision workshops (ICCV Workshops), pp 37–45
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector CNNs. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 2718–2726
Ma S, Bargal SA, Zhang J, Sigal L, Sclaroff S (2017) Do less and achieve more: training CNNs for action recognition utilizing action images from the web. Pattern Recognit 68:334–345
Article Google Scholar
Yang H, Yuan C, Xing J, Hu W (2017) SCNN: Sequential convolutional neural network for human action recognition in videos. In: Proceedings of the IEEE international conference on image processing (ICIP), pp 355–359
Cherian A, Fernando B, Harandi M, Gould S (2017) Generalized rank pooling for activity recognition. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 1581–1590
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2018) Real-time action recognition with deeply transferred motion vector CNNs. IEEE Trans Image Process 27(5):2326–2339
Article MathSciNet Google Scholar
Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Int 40(6):1510–1517
Article Google Scholar

Download references

Acknowledgement

This work was supported partially by Shaanxi Province key project of Research and Development Plan research Project S2018-YF-ZDGY-0187 and International Cooperation Project of Shaanxi Province research project S2018-YF-GHMS-0061.

Author information

Authors and Affiliations

School of Electronic Engineering, Xidian University, Xi’an, 710071, China
Ming Tong, Mingyang Li, He Bai, Lei Ma & Mengao Zhao

Authors

Ming Tong
View author publications
You can also search for this author in PubMed Google Scholar
Mingyang Li
View author publications
You can also search for this author in PubMed Google Scholar
He Bai
View author publications
You can also search for this author in PubMed Google Scholar
Lei Ma
View author publications
You can also search for this author in PubMed Google Scholar
Mengao Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming Tong.

Ethics declarations

Conflict of interest

All the authors of the manuscript declared that there are no potential conflicts of interest.

Human and animal rights

All the authors of the manuscript declared that there is no research involving human participants and/or animal.

Informed consent

All the authors of the manuscript declared that there is no material that required informed consent.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tong, M., Li, M., Bai, H. et al. DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition. Neural Comput & Applic 32, 5285–5302 (2020). https://doi.org/10.1007/s00521-019-04030-1

Download citation

Received: 08 October 2018
Accepted: 11 January 2019
Published: 08 February 2019
Issue Date: May 2020
DOI: https://doi.org/10.1007/s00521-019-04030-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition

Abstract

Similar content being viewed by others

Action recognition with spatio-temporal augmented descriptor and fusion method

Action Recognition with a Multi-View Temporal Attention Network

A survey on aggregating methods for action recognition with dense trajectories

Explore related subjects

1 Introduction

1.1 Background and motivation

1.2 Overview of DKD–DAD

1.3 Working flow of DKD–DAD

2 Discriminative kinematic descriptor

2.1 Construction of kinematic features

2.2 Construction of discriminative kinematic descriptor

3 Deep attention-pooled descriptor

3.1 Introduction of architecture of Inception-v3

3.2 Extraction of deep local and global features

3.3 Proposed prediction-attentional pooling method

3.4 Construction of deep attention-pooled descriptor

4 Experiments and analysis

4.1 Datasets and experimental settings

4.1.1 Introduction of datasets

4.1.2 Experimental settings

4.2 Experiment on parameter selection

4.3 Action recognition with kinematic features

4.4 Action recognition with discriminative kinematic descriptor

4.5 Action recognition with prediction-attentional pooling method

4.6 Action recognition with deep attention-pooled descriptor

4.7 Heatmap visualization of prediction-attentional pooling method

4.8 Action recognition with DKD–DAD framework

4.9 Experiments on running time

5 Conclusions

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation