1 Introduction

Human action recognition in videos possesses a crucial academic value and an extensive market application prospect, which make it quickly become a focus and difficulty in computer vision and artificial intelligence. Consequently, it attracts great interests of researchers and institutions. However, action recognition is still a challenging problem while concentrating on some real-world data obtained from web videos [1, 2], movies [3], etc. Therefore, extracting effective features is undoubtedly very significant for action recognition.

1.1 Background and motivation

Extracting dynamic features is one of the important research directions for action recognition. Early works, including spatiotemporal interest point (STIP) [4], cuboids [5] and so on, usually adopt interest point detectors to capture pixels with salient change of intensity or gradients in a spatiotemporal video volume, then describe these interest points or small regions using statistics acquired from neighboring pixels, so as to obtain the motion information of action subject. Subsequently, some methods [6, 7] extend 2D image features to 3D features in videos to acquire spatiotemporal features for action recognition. In addition, quite a few research results show that the motion information of trajectories can obtain impressive performance, such as dense trajectories [8, 9] obtained by tracking densely sampled points using optical flow fields. In fact, the above methods based on interest points have been turned out to be successful in the field of action recognition. However, they are highly dependent on localized statistics within a small spatiotemporal neighborhood [5, 9], and cannot describe the global characteristics of motion as a whole. Moreover, there are also some scholars [10, 11] who have built deep networks, such as two-stream convolutional networks [10] and 3D convolutional networks [11], so as to acquire spatiotemporal features for action recognition. However, these networks are not only difficult to train, but also unable to achieve the performance equivalent to hand-crafted features.

As an important tool for describing dynamic properties of videos, optical flow has been widely applied in the field of action recognition. However, many of the features based on local optical flow may simply summarize the flow according to histograms of its orientations [12, 13], thus arguably ignore other potentially discriminative properties. Actually, the optical flow may be regarded as a flow field, so some of its characteristics can be extracted using the fluid dynamic theory. By exploring the dynamic characteristics of optical flow field, optical flow can be described in a richer way to obtain the physical characteristics of flow pattern. However, they are less involved in existing features for action recognition.

With the increasing number of action classes, adopting motion features alone is not discriminative enough for dependable action recognition. In fact, the appearance information of action scene and discriminative object in a video also plays a quite significant role. Recently, due to the favorable learning and abstract abilities of deep learning, it has occupied an absolute dominant position in image processing field and has been widely used in various application fields [14,15,16]. For this reason, some scholars, by constructing a deep network, have attempted to extract the important static features from images for action recognition. Wang et al. [17] firstly captured the spatial relationship and the high-order correlations between parts. Then, they constructed a hierarchical spatial sum-product network (HS-SPN) to extract static deep features for action recognition. Kwak et al. [18] introduced the triplet-based rank constraints into a deep convolutional network, so as to automatically capture the pose embedding information from still image for action recognition. Subsequently, Qi et al. [19], by defining a joint loss function, integrated the pose hints into the convolutional neural networks (CNN) framework. Thus, the static deep features containing pose information are obtained for action recognition. However, these methods directly input the whole image into deep network for feature extraction without focusing on the discriminative object in background.

In order to overcome the above problem, some scholars attempted to extract features from the discriminative regions of video frames, so as to improve recognition performance. Peng et al. [20] firstly divided the whole human body region [21] into multiple regions. Then, a deep CNN network is used to extract features from individual discriminative regions for action recognition. Ni et al. [22] proposed a network composed by two connected deep convolutional neural networks (DCNNs). The first DCNN adopts video frames as inputs and creates response maps indicating locations for body parts. Then, these maps are fed into the second DCNN for learning discriminative and semantic-aligned action representations of each body part for action recognition. However, the above methods usually need to construct additional networks to obtain the discriminative regions, which is generally difficult for training networks and results in higher computational consumption. Moreover, these methods assume that the discriminative information always exists in the regions around human body, and therefore often focus on human or its parts. In fact, some actions may be easier to be distinguished using the appearance information of action scenes such as the ocean wave in “surfing” action; while others might need to pay close attention to the discriminative object that interacts with the human, such as the bicycle in “bike riding” action.

1.2 Overview of DKD–DAD

Motivated by the above methods, a framework with discriminative kinematic descriptor and deep attention-pooled descriptor (DKD–DAD) for action recognition is proposed, as shown in Fig. 1. Firstly, the optical flow field is transformed into a set of kinematic fields with more discriminativeness to construct two kinematic features; subsequently, a discriminative fusion method is proposed, by which a discriminative kinematic descriptor is obtained to depict the dynamic characteristics of action subject. Secondly, a prediction-attentional pooling method is proposed to automatically acquire the discriminative local regions in a video frame. Furthermore, a deep attention-pooled descriptor is presented to capture the discriminative static information in action scene. Finally, a DKD–DAD framework is constructed, which combines discriminative kinematic descriptor and deep attention-pooled descriptor together for action recognition. Consequently, accuracy is improved.

Fig. 1
figure 1

Overview of the proposed DKD–DAD framework for action recognition

1.3 Working flow of DKD–DAD

In this section, the whole working flow of DKD–DAD is illustrated in Fig. 2, which includes two branches. Specifically, given an input action video, the left branch aims to obtain the proposed discriminative kinematic descriptor, which depicts the dynamic information of action video; the right branch intends to acquire the proposed deep attention-pooled descriptor, which describes the static information.

Fig. 2
figure 2

Working flow of the proposed DKD–DAD framework for action recognition

The left algorithm branch includes the following steps: Firstly, the optical flow field is extracted. Then, the divergence and curl fields are calculated, respectively. Subsequently, the first-order divergence and curl fields are acquired. Similarly, the second-order divergence and curl fields are obtained. Furthermore, the multi-order divergence and curl fields are, respectively, jointly encoded to acquire the multi-order divergence and curl features. The obtained features are discriminatively fused, consequently the proposed discriminative kinematic descriptor is obtained. Regarding the right branch, it shows the main procedures as below. Firstly, one frame is randomly picked as the key-frame, which is inputted into the Inception-v3 [23] network to acquire the static deep features. Then, the deep local and global features are, respectively, extracted from the Inception-v3. Subsequently, the extracted features are inputted into the multiple channel attentions to obtain the local and global attentional heatmaps, which are, respectively, fused by taking the predictions of deep network as weight coefficients. Furthermore, the fused attentional heatmaps are used to conduct weighted pooling on deep local and global features, respectively. Consequently the proposed deep attention-pooled descriptor is acquired. Finally, the discriminative kinematic descriptor and deep attention-pooled descriptor are concatenated as the feature representation of action video, then inputted into the support vector machine (SVM) classifier, and thus the action label is obtained.

In summary, the major contributions of this paper are as follows: (1) Two kinematic features of multi-order divergence and multi-order curl are constructed, which more accurately depict the dynamic characteristics of action subject. (2) A novel fusion method is proposed, which ensures the discriminativeness of two kinematic features. Furthermore, a discriminative kinematic descriptor is constructed. (3) A prediction-attentional pooling method is presented. Consequently, a deep attention-pooled descriptor is constructed. (4) A DKD–DAD framework for action recognition is proposed, which finally improves recognition accuracy. Experimental results demonstrate that, the proposed methods can provide promising performance compared to several state-of-the-art methods on two challenging datasets.

The rest of this paper is structured as follows. Section 2 constructs two kinematic features, and meanwhile presents the discriminative fusion method as well as discriminative kinematic descriptor. Section 3 proposes the prediction-attentional pooling method and deep attention-pooled descriptor. The related experiments and analysis of the proposed methods are shown in Sect. 4, followed by the conclusions with future work in Sect. 5.

2 Discriminative kinematic descriptor

Human action videos contain rich motion information that can characterize the intrinsic patterns of different actions. Most of the recent works usually used optical flow field to describe motion information. However, optical flow only records displacement vectors of pixels between two successive frames. While by calculating the kinematics of optical flow field, the physical properties of flow pattern can be captured, which describe motion in a richer way, and contain more details of motion as well as precise variations, such as local expansion, local spin, velocity, acceleration, etc. Therefore, researching field is equivalent to exploring motion itself. In order to better obtain the dynamic characteristics of action subject, this section firstly transforms the optical flow field into a set of kinematic fields with more discriminativeness, and then constructs two kinematic features. Finally, the discriminative fusion method is presented to obtain the proposed discriminative kinematic descriptor, as shown in Fig. 3.

Fig. 3
figure 3

Schematic of the proposed discriminative kinematic descriptor

2.1 Construction of kinematic features

In this section, two kinematic features, namely multi-order divergence feature and multi-order curl feature, are constructed. In order to do this, the optical flow is computed firstly [24]. Specifically, given a video, any a point \(q\) at time \(t\) is denoted as \(q_{t}\), then the optical flow vector of \(q_{t}\) is denoted as \(w\left( {q_{t} } \right) = \left( {u\left( {q_{t} } \right),v\left( {q_{t} } \right)} \right)\), where \(u\left( {q_{t} } \right)\) and \(v\left( {q_{t} } \right)\) are the horizontal and vertical components of \(w\left( {q_{t} } \right)\), respectively. Subsequently, by calculating optical flow vector from adjacent frames at every pixel position, the optical flow field is acquired. In the following, the construction process of kinematic features is detailedly given.

  1. 1.

    Extraction of divergence and curl

The divergence and curl are both the local first-order differential scalar quantities of optical flow field, which describe the physical pattern of flow, represent different characteristics of optical flow field, respectively, and meanwhile can well depict the different characteristics of motion in videos from distinct perspectives. In this paper, the divergence and curl of \(q_{t}\) are computed as follows:

$$div\left( {q_{t} } \right) = \frac{{\partial u\left( {q_{t} } \right)}}{\partial x} + \frac{{\partial v\left( {q_{t} } \right)}}{\partial y}$$
(1)
$$curl\left( {q_{t} } \right) = \frac{{\partial v\left( {q_{t} } \right)}}{\partial x} - \frac{{\partial u\left( {q_{t} } \right)}}{\partial y}$$
(2)

By, respectively, calculating \(div\left( \cdot \right)\) and \(curl\left( \cdot \right)\) for each point in a video frame, the divergence field \(Field^{div}\) and curl field \(Field^{curl}\) corresponding to the frame are obtained. The physical meaning of \(Field^{div}\) derives from the fact that it acquires the amount of local expansion occurring in optical flow field, and can depict the regions of local expansion caused by action subject. The physical significance of \(Field^{curl}\) is that it can delineate the local spin around the axis that is perpendicular to the plane of optical flow field, and can describe the dynamic characteristics resulting from human body motion in optical flow field.

  1. 2.

    Acquisition of the first-order derivatives of divergence and curl

In fact, it is not sufficient to describe the optical flow field generated by action subject only using divergence and curl. Therefore, the first-order derivatives of divergence and curl are, respectively, calculated to capture the precise variations of local expansion and local spin caused by the motion of action subject. Given a spatiotemporal point \(q_{t}\), the first-order derivatives of divergence and curl for it along \(x\), \(y\) and \(t\) directions are, respectively, computed by following Eqs. (3) and (4).

$$\left( {div_{x} \left( {q_{t} } \right),div_{y} \left( {q_{t} } \right),div_{t} \left( {q_{t} } \right)} \right)^{{T}} = \nabla \cdot div\left( {q_{t} } \right)$$
(3)
$$\left( {curl_{x} \left( {q_{t} } \right),curl_{y} \left( {q_{t} } \right),curl_{t} \left( {q_{t} } \right)} \right)^{{T}} = \nabla \cdot curl\left( {q_{t} } \right)$$
(4)

where \(\nabla = \left( {{\partial \mathord{\left/ {\vphantom {\partial {\partial x}}} \right. \kern-0pt} {\partial x}},{\partial \mathord{\left/ {\vphantom {\partial {\partial y}}} \right. \kern-0pt} {\partial y}},{\partial \mathord{\left/ {\vphantom {\partial {\partial t}}} \right. \kern-0pt} {\partial t}}} \right)^{{T}}\) represents the gradient operator.

Till then, by, respectively, calculating \(div_{x} \left( \cdot \right)\), \(div_{y} \left( \cdot \right)\), \(curl_{x} \left( \cdot \right)\) and \(curl_{y} \left( \cdot \right)\) for each point \(q_{t}\) in a video frame, a set of first-order spatial kinematic fields is obtained, including the first-order spatial divergence fields \(Field^{{div_{x} }}\) and \(Field^{{div_{y} }}\) as well as the first-order spatial curl fields \(Field^{{curl_{x} }}\) and \(Field^{{curl_{y} }}\), which acquire the relative motion between pixels along \(x\) and \(y\) directions, and meanwhile remove the camera motion. Similarly, by calculating \(div_{t} \left( \cdot \right)\) and \(curl_{t} \left( \cdot \right)\) for each point \(q_{t}\), a set of first-order temporal kinematic fields is obtained as well, including the first-order temporal divergence field \(Field^{{div_{t} }}\) and the first-order temporal curl field \(Field^{{curl_{t} }}\), which obtain the velocities of divergence and curl, and meanwhile directly remove the slowly changing background in a video through the subtraction of two consecutive kinematic fields.

  1. 3.

    Acquisition of the second-order derivatives of divergence and curl

In order to more detailedly describe the kinematic characteristics of optical flow field, the second-order derivatives of divergence and curl for \(q_{t}\) along \(x\), \(y\) and \(t\) directions are further computed, respectively, as shown in Eqs. (5) and (6).

$$\left( {div_{xx} \left( {q_{t} } \right),div_{yy} \left( {q_{t} } \right),div_{tt} \left( {q_{t} } \right)} \right)^{T} = \nabla \odot \left( {div_{x} \left( {q_{t} } \right),div_{y} \left( {q_{t} } \right),div_{t} \left( {q_{t} } \right)} \right)^{T}$$
(5)
$$\left( {curl_{xx} \left( {q_{t} } \right),curl_{yy} \left( {q_{t} } \right),curl_{tt} \left( {q_{t} } \right)} \right)^{T} = \nabla \odot \left( {curl_{x} \left( {q_{t} } \right),curl_{y} \left( {q_{t} } \right),curl_{t} \left( {q_{t} } \right)} \right)^{T}$$
(6)

where \(\odot\) denotes the element-wise multiplication.

In the above formulas, the second-order derivatives \(div_{xx} \left( {q_{t} } \right)\), \(div_{yy} \left( {q_{t} } \right)\), \(curl_{xx} \left( {q_{t} } \right)\) and \(curl_{yy} \left( {q_{t} } \right)\) can, respectively, describe the change rates of the first-order derivatives of divergence and curl along \(x\) and \(y\) directions. And they construct second-order spatial kinematic fields, including the second-order spatial divergence fields \(Field^{{div_{xx} }}\) and \(Field^{{div_{yy} }}\), as well as the second-order spatial curl fields \(Field^{{curl_{xx} }}\) and \(Field^{{curl_{yy} }}\), which depict the more detailed motion information. Whereas the second-order derivatives \(div_{tt} \left( {q_{t} } \right)\) and \(curl_{tt} \left( {q_{t} } \right)\), respectively, correspond to the change rates of the first-order derivatives of divergence and curl along \(t\) direction. Thus, the second-order temporal kinematic fields, including the second-order temporal divergence field \(Field^{{div_{tt} }}\) and curl field \(Field^{{curl_{tt} }}\), acquire the accelerations of divergence and curl.

  1. 4.

    Joint coding

A set of kinematic fields obtained above usually possesses high dimensions and strong correlation, which results in great challenges for the subsequent joint feature coding. Therefore, above kinematic features are firstly reduced in dimension, respectively, then they are jointly encoded to obtain the proposed multi-order divergence feature and multi-order curl feature. The specific process is as follow. Here, the divergence fields are taken as examples.

  1. (a)

    Feature dimension reduction. For the \(j {\text{-th}}\) frame in the \(i {\text{-th}}\) video, its fields \(Field_{i,j}^{div}\), \(Field_{i,j}^{{div_{x} }}\), \(Field_{i,j}^{{div_{y} }}\), \(Field_{i,j}^{{div_{t} }}\), \(Field_{i,j}^{{div_{xx} }}\), \(Field_{i,j}^{{div_{yy} }}\) and \(Field_{i,j}^{{div_{tt} }}\) are reduced in dimension, respectively, by two-dimension principle component analysis (2DPCA) [25], so as to obtain the corresponding low-dimensional representation \(FIELD_{i,j}^{div} = \left[ {\hat{F}ield_{i,j}^{div} ,\hat{F}ield_{i,j}^{{div_{x} }} ,\hat{F}ield_{i,j}^{{div_{y} }} ,\hat{F}ield_{i,j}^{{div_{t} }} ,\hat{F}ield_{i,j}^{{div_{xx} }} ,\hat{F}ield_{i,j}^{{div_{yy} }} ,\hat{F}ield_{i,j}^{{div_{tt} }} } \right] \in R^{d}\), where \(i = 1,2, \ldots ,G\), \(G\) represents video number; \(j = 1,2, \ldots ,Q_{i}\) and \(Q_{i}\) denotes the frame number in the \(i{\text{-th}}\) video; \(d\) is the dimension of \(FIELD_{i,j}^{div}\).

  2. (b)

    Feature coding. Fisher vector [26] is used to jointly code for above low-dimensional representation. A Gaussian mixture model (GMM) of \(K\) components is utilized to create the Fisher vectors. Then, L2 normalization is applied to the Fisher vectors to obtain the multi-order divergence feature set \(Ms^{div} { = }\left\{ {Ms_{1}^{div} ,Ms_{2}^{div} , \ldots ,Ms_{G}^{div} } \right\} \in R^{G \times O}\) for all videos, where \(O = 2dK\). By the same way, the multi-order curl feature set \(Ms^{curl} { = }\left\{ {Ms_{1}^{curl} ,Ms_{2}^{curl} , \ldots ,Ms_{G}^{curl} } \right\} \in R^{G \times O}\) for all videos is also obtained.

2.2 Construction of discriminative kinematic descriptor

The proposed \(Ms^{div}\) and \(Ms^{curl}\), respectively, depict the dynamic characteristics of action subject from multiple levels and different perspectives, between which there exists a certain complementarity information. Therefore, fusing them will necessarily acquire a more complete feature representation to delineate action subject in complex environment more precisely. This section aims to propose a discriminative neural network fusion method to achieve the fusion of \(Ms^{div}\) and \(Ms^{curl}\). Consequently, the proposed discriminative kinematic descriptor is obtained, as shown in Fig. 3. The specific process is presented as follow.

  1. 1.

    Introduction of a single tight-loose constraint term

Given a feature set \(\varvec{Z} = f\left( {Ms^{div} ,Ms^{curl} ;{\varvec{\Theta}}} \right)\), where \(f\left( \cdot \right)\) denotes the feature projection function, and \({\varvec{\Theta}} = \left\{ {\Theta_{1} ,\Theta_{2} , \ldots ,\Theta_{\omega } } \right\}\) represents model parameter set, \(\omega\) is the number of parameters. A single tight-loose constraint \(\varvec{TL}_{Z}\) is firstly introduced, as indicated in Eq. (7).

$$\varvec{TL}_{Z} = \frac{1}{G}\sum\limits_{\xi = 1}^{C} {\sum\limits_{\eta = 1}^{{C^{\xi } }} {\sum\limits_{\gamma = 1,\gamma \ne \xi }^{C} {\frac{{\left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\xi } } \right\|}}{{\left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\gamma } } \right\|}}} } }$$
(7)

where \(C\) is the number of action classes; \(C^{\xi }\) represents the feature number of the \(\xi {\text{-th}}\) class; \(\varvec{Z}_{\eta }^{\xi } \in R^{1 \times 2O}\) denotes the \(\eta {\text{-th}}\) feature of the \(\xi {\text{-th}}\) class in \(\varvec{Z}\); \(\bar{\varvec{Z}}^{\gamma } \in R^{1 \times 2O}\) and \(\bar{\varvec{Z}}^{\xi } \in R^{1 \times 2O}\), respectively, represent the feature centers of the \(\gamma {\text{-th}}\) class and the \(\xi {\text{-th}}\) class, namely mean values of features.

  1. 2.

    Introduction of an anti-confusion constraint term

It is known that there are usually a large number of outliers in feature space. The distances of these outliers to the feature centers of their own classes are usually larger than the distances to the feature centers of other classes, which seriously affects the discriminativeness of features. In order to reduce the confusion caused by outliers, an anti-confusion constraint \(\varvec{AC}_{Z}\) is introduced as a penalty term to measure the degree of confusion between different classes of features, as shown in Eq. (8).

$$\varvec{AC}_{Z} = \frac{1}{G}\sum\limits_{\xi = 1}^{C} {\sum\limits_{\eta = 1}^{{C^{\xi } }} {\sum\limits_{\gamma = 1,\gamma \ne \xi }^{C} {relu\left( {\left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\xi } } \right\| - \left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\gamma } } \right\|} \right)} } }$$
(8)

where \(relu\left( \cdot \right)\) represents the rectified linear unit (ReLU) [27].

  1. 3.

    Construction of the objective function for the proposed fusion method

By introducing both of the constraint terms \(\varvec{TL}_{Z}\) and \(\varvec{AC}_{Z}\) into the cross-entropy loss function, the objective function of the proposed fusion method is obtained, as shown in Eq. (9).

$$\it \begin{aligned} \hbox{min} \, J & = - \sum\limits_{g = 1}^{G} {\sum\limits_{c = 1}^{C} {y^{\prime}_{g} \ log \left( {p\left( {y_{g} = \left. c \right|\varvec{Z}_{g} } \right)} \right)} } + \varvec{TL}_{Z} + \varvec{AC}_{Z} \\ & = - \sum\limits_{g = 1}^{G} {\sum\limits_{c = 1}^{C} {y^{\prime}_{g} \ log \left( {p\left( {y_{g} = \left. c \right|\varvec{Z}_{g} } \right)} \right)} } \\ & \quad + \,\frac{1}{G}\sum\limits_{\xi = 1}^{C} {\sum\limits_{\eta = 1}^{{C^{\xi } }} {\sum\limits_{\gamma = 1,\gamma \ne \xi }^{C} {\frac{{\left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\xi } } \right\|}}{{\left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\gamma } } \right\|}}} } } \\ & \quad + \,\frac{1}{G}\sum\limits_{\xi = 1}^{C} {\sum\limits_{\eta = 1}^{{C^{\xi } }} {\sum\limits_{\gamma = 1,\gamma \ne \xi }^{C} {relu\left( {\left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\xi } } \right\| - \left\| {\varvec{Z}_{\eta }^{\xi } - \bar{\varvec{Z}}^{\gamma } } \right\|} \right)} } } \\ \end{aligned}$$
(9)

where \(y_{g}\) and \(y^{\prime}_{g}\) are, respectively, the predicted label and true label of the \(g{\text{-th}}\) sample; \(\varvec{Z}_{g}\) represents the \(g{\text{-th}}\) feature in \(\varvec{Z}\).

It can be seen from Eq. (9) that, during the optimization solution process, the proposed \(\varvec{TL}_{Z}\), by calculating the relative distances between each feature and its feature center as well as the feature centers of other classes, respectively, makes each feature point be closer to its own feature center, and meanwhile be farther from the feature centers of other classes. That is, the within-class compactness is enhanced, and the between-class separability is increased simultaneously. Consequently, the discriminativeness of features is improved. Further, it can be seen that the proposed \(\varvec{AC}_{Z}\), by gathering the statistics for the sum of error distances in feature space, reduces the between-class confusion caused by outliers.

  1. 4.

    Acquisition of the proposed discriminative kinematic descriptor

Here, a three-layer neural network called the discriminative fusion network is constructed to finally achieve the fusion of \(Ms^{div}\) and \(Ms^{curl}\). This network takes the training samples from \(Ms^{div}\) and \(Ms^{curl}\) as inputs, and Eq. (9) is used as the objective function for optimization. By minimizing Eq. (9) using the stochastic gradient descent (SGD) algorithm, the optimal model parameter set \({\varvec{\Theta}}^{*}\) is acquired. Consequently, the discriminative fusion of \(Ms^{div}\) and \(Ms^{curl}\) is achieved. That is, the proposed discriminative kinematic descriptor \(\varvec{F}_{kinematic} = f\left( {Ms^{div} ,Ms^{curl} ;{\varvec{\Theta}}^{*} } \right)\) is obtained.

Overall, the multi-order divergence feature \(Ms^{div}\) and multi-order curl feature \(Ms^{curl}\) are firstly obtained from a set of kinematic fields, which possess better discriminativity, and meanwhile remove the camera motion and slowly changing background. Then, in order to acquire a more complete feature representation, the discriminative fusion method is proposed to achieve the fusion of \(Ms^{div}\) and \(Ms^{curl}\). Consequently, the discriminative kinematic descriptor is obtained, which possesses better within-class compactness and between-class separability, and meanwhile it is robust to outliers. Moreover, the additional detection of interest points is not needed in this paper, thus the computational consumption is significantly reduced, and the negative effects caused by inaccurate interest point detection on action recognition are effectively avoided. All of these are very useful for action recognition.

3 Deep attention-pooled descriptor

When performing action recognition, both dynamic information and static information are very significant clues. In fact, when recognizing the action classes that are closely related to specific objects or action scenes, static features play a crucial role. This section aims to obtain the discriminative static information in background for action recognition. For this purpose, a deep attention-pooled descriptor is constructed.

Firstly, the architecture of Inception-v3 network is briefly introduced. Then, the prediction-attentional pooling method is proposed. Subsequently, it is applied to both lower layer and higher layer of Inception-v3 for acquiring the proposed deep local attentional feature and deep global attentional feature. Finally, by concatenating the two attentional features, the proposed deep attention-pooled descriptor is constructed, as shown in Fig. 4.

Fig. 4
figure 4

Schematic of the proposed deep attention-pooled descriptor

3.1 Introduction of architecture of Inception-v3

Inception-v3 deep neural network was developed by Google, which is a 42 layer deep convolutional neural network with 130 layers, and consists of multiple Inception modules. There exist 4 convoluting modules in each Inception module, and the receptive fields of convoluting modules for each Inception module are allowed to freely select from \(5 \times 5\), \(3 \times 3\) and \(1 \times 1\), which can synthesize the different scale information. Compared with Inception-v2 network [28], Inception-v3 adopts a combination of \(1 \times n\) and \(n \times 1\) convolutional kernel sizes instead of the original \(n \times n\) size, which significantly reduces parameter number. In addition, Inception-v3 adopts the global average pooling (GAP), rather than the traditional fully connected layer, to obtain the feature vector at the end of network.

3.2 Extraction of deep local and global features

In fact, a deep network can learn different features at each layer of layer hierarchy. To be specific, the activations in lower layers possess smaller receptive fields, meanwhile, they are much more sensitive to edge-like patterns and corners; while activations in higher layers possess larger receptive fields, which can learn the more global and high-level feature representation and obtain more complex invariances. However, Inception-v3 only adopts the top layer of network, which is not enough for describing the fine-grained detail.

In order to obtain a more complete feature representation, the local feature and global feature are, respectively, extracted from the lower layer and higher layer of Inception-v3, which lay the foundation for further obtaining the proposed deep local attentional feature and deep global attentional feature. Specifically, (1) the Mixed_5c layer with size \(35 \times 35 \times 288\) of Inception-v3 is selected and served as the deep local feature \(\varvec{X}^{L}\), where \(35 \times 35\) denotes the number of regions in a video frame and 288 is the dimension of feature vector for each region. The reason for selecting the \(35 \times 35\) region is that, the classical hand-crafted feature usually adopts an \(8 \times 8\) region for local feature extraction, and when it is mapped to Inception-v3, the most similar window scale \(35 \times 35\) is obtained. (2) the Mixed_7a layer with size \(8 \times 8 \times 1280\) of Inception-v3 is selected and taken as the deep global feature \(\varvec{X}^{H}\), where \(8 \times 8\) denotes the number of regions in a video frame and \(1280\) is the dimension of feature vector for each region. The reason for choosing Mixed_7a instead of the last Mixed_7c is that, the feature maps of Mixed_7c have very large receptive fields, which means that each pixel point in feature maps corresponds to all regions of input image [29], thus different locations cannot be assigned different weights. That is to say, it is impossible to highlight the discriminative regions, which is disadvantageous for further acquiring the proposed deep global attentional feature.

3.3 Proposed prediction-attentional pooling method

It is well known that the current deep network usually adopts GAP, rather than fully connected layer, to compress the feature map at end of network for obtaining global features. However, GAP considers all the regions inside feature maps equally important, which may reduce the discriminativeness of features [30]. Therefore, some methods [31, 32] use the attention mechanism to highlight the discriminative regions. Furthermore, other ones extend the single-channel attention mechanism to multiple channels for enhancing the discriminativeness of features. Yan et al. [33] proposed a multi-branch attention networks, which obtains the attention maps from scene-level context and region-level context perspectives, respectively. Then, the two context branches are further integrated to acquire the final attentional regions. Girdhar et al. [29] utilized the low-rank second-order pooling to obtain multiple attention maps from bottom-up and top-down perspectives, respectively. Then, these attention maps are combined to acquire the final attentional regions. However, the above methods adopted simple ways to fuse different attention maps without highlighting the more discriminative attention maps, which makes the final acquired attention regions not accurate enough. Depending on the problem to solve, a novel prediction-attentional pooling method is proposed, which aims to more accurately focus on the significant discriminative regions, and meanwhile suppress irrelevant background interference. Details are as follows.

Given an extracted deep feature map \(\varvec{X}{ = }\left[ {\varvec{X}_{1}^{T} , \ldots ,\varvec{X}_{i}^{T} , \ldots ,\varvec{X}_{N}^{T} } \right]^{T} \in R^{N \times D}\), and \(\varvec{X}_{i} \in R^{1 \times D}\) maps to distinct overlapping regions in input space, where \(N\) denotes the number of regions in a video frame and \(D\) is the dimension of feature vector for each region. Thereupon, the proposed prediction-attentional pooling method is briefly summarized as following. Firstly, the attentions with \(C\) channels are constructed, where the number of channels equals to the number of classes, and a single weight is learned for each channel, aiming to pay attention to distinct aspects of deep feature. Then, the attentional heatmap for \(\varvec{X}\) in each channel is, respectively, calculated to obtain the attentional heatmap set \(\left\{ {\varvec{M}^{1} , \ldots ,\varvec{M}^{c} , \ldots ,\varvec{M}^{C} } \right\}\), where \(\varvec{M}^{c} \in R^{1 \times N}\) is the \(c{\text{-th}}\) attentional heatmap. Secondly, the predictions of deep network are taken as weights to fuse \(\left\{ {\varvec{M}^{1} , \ldots ,\varvec{M}^{c} , \ldots ,\varvec{M}^{C} } \right\}\), so as to obtain the weighted fusion attentional heatmap \(\varvec{M}^{fuse} \in R^{1 \times N}\). Thirdly, \(\varvec{M}^{fuse}\) is utilized as the weight of \(\varvec{X}\) to further enhance the effect of important local regions. Consequently, the more accurate and discriminative deep feature is obtained. The specific calculation process is shown below.

  1. 1.

    Acquisition of attentional heatmaps for \(C\) channels. Firstly, in order to obtain \(\left\{ {\varvec{M}^{1} , \ldots ,\varvec{M}^{c} , \ldots ,\varvec{M}^{C} } \right\}\), a convolutional kernel \(\varvec{a}^{c} \in R^{1 \times D}\) is applied on each channel aiming to acquire attentional heatmaps from different perspectives. Specifically, a softmax function for generating the attention distribution on the regions of the image is adopted for each channel to, respectively, obtain \(M_{i}^{c}\), as shown in Eq. (10).

    $$M_{i}^{c} = \frac{{\ exp \left( {\varvec{a}^{c} \varvec{X}_{i}^{T} } \right)}}{{\sum\nolimits_{j = 1}^{N} {\ exp \left( {\varvec{a}^{c} \varvec{X}_{j}^{T} } \right)} }}$$
    (10)

    where \(M_{i}^{c}\) represents the \(i{\text{-th}}\) element in \(\varvec{M}^{c}\), namely the attentional weight of the \(i{\text{-th}}\) vector \(\varvec{X}_{i}\) in the \(c{\text{-th}}\) channel. The larger the \(M_{i}^{c}\) is, the higher the importance degree of \(\varvec{X}_{i}\) in the \(c{\text{-th}}\) channel is. Equation (10) is adopted for each channel, then the attentional heatmap set \(\left\{ {\varvec{M}^{1} , \ldots ,\varvec{M}^{c} , \ldots ,\varvec{M}^{C} } \right\}\) is obtained.

  2. 2.

    Acquisition of the weighted fusion attentional heatmap \(\varvec{M}^{fuse}\) for \(C\) channels. The motivation is that different actions activate different attentional heatmap sets. In fact, different channels in attentional heatmaps capture different regions related to action subject, discriminative objects and background. In certain circumstance, some channels are more important than the others. Therefore, the higher weights should be assigned to these discriminative channels that play more significant roles in action recognition.

For the sake of highlighting the contributions of discriminative channels related to \(\varvec{X}\), the prior probability of \(\varvec{X}\) belonging to the \(c{\text{-th}}\) class is adopted as the weight of \(M_{i}^{c}\) to conduct weighted fusion on \(\left\{ {\varvec{M}^{1} , \ldots ,\varvec{M}^{c} , \ldots ,\varvec{M}^{C} } \right\}\), so as to obtain the weighted fusion attentional heatmap \(\varvec{M}^{fuse}\). For the \(i{\text{-th}}\) element \(M_{i}^{fuse}\) in \(\varvec{M}^{fuse}\), the calculation is shown in Eq. (11).

$$M_{i}^{fuse} = \sum\limits_{c = 1}^{C} { \, p\left( {y = \left. c \right|\varvec{X}} \right)M_{i}^{c} }$$
(11)

where \(y\) is class label; \(p\left( {y = \left. c \right|\varvec{X}} \right)\) represents the prior probability of \(\varvec{X}\) belonging to the \(c{\text{-th}}\) class. As can be seen from Eq. (11), the larger the \(p\left( {y = \left. c \right|\varvec{X}} \right)\) is, the larger the weight of \(M_{i}^{c}\) is, then the larger the contribution of \(M_{i}^{c}\) to \(M_{i}^{fuse}\) is, that is to say, Eq. (11) assigns larger weights to the more discriminative channels.

Furthermore, as for the calculation of \(p\left( {y = \left. c \right|\varvec{X}} \right)\), according to the structure of deep network, it is known that in the process of forward propagation, the feature map of each layer is obtained from the former layer feature map through basic matrix operations. That is, the conditional probability \(p\left( {\varvec{b}\left| \varvec{X} \right.} \right) = 1\) holds, in which \(\varvec{b}\) is the bottleneck vector of deep network. Thereby, the following derivation holds:

$$\begin{aligned} p\left( {y = c\left| \varvec{X} \right.} \right) & =\,{{p\left( {y = c,\varvec{X}} \right)} \mathord{\left/ {\vphantom {{p\left( {y = c,\varvec{X}} \right)} {p\left( \varvec{X} \right)}}} \right. \kern-0pt} {p\left( \varvec{X} \right)}} \\ & = \,{{p\left( {y = c,\varvec{b},\varvec{X}} \right)} \mathord{\left/ {\vphantom {{p\left( {y = c,\varvec{b},\varvec{X}} \right)} {p\left( {\varvec{b},\varvec{X}} \right)}}} \right. \kern-0pt} {p\left( {\varvec{b},\varvec{X}} \right)}} \\ & = \, \, p\left( {y = c\left| {\varvec{b},\varvec{X}} \right.} \right) \\ & { = }\, \, p\left( {y = c\left| \varvec{b} \right.} \right) \\ \end{aligned}$$
(12)

where \(p\left( {y = \left. c \right|\varvec{b}} \right)\) represents the probability of \(\varvec{b}\) belonging to the \(c{\text{-th}}\) class, namely the prediction of deep network.

It can be seen from Eq. (12) that the probability of \(\varvec{X}\) belonging to the \(c{\text{-th}}\) class is equal to the prediction of deep network, where the prediction can be obtained by fine-tuning the network on video dataset. Thus, Eq. (11) is transformed as follow:

$$M_{i}^{fuse} = \sum\limits_{c = 1}^{C} { \, p\left( {y = \left. c \right|\varvec{b}} \right)M_{i}^{c} }$$
(13)

Till then, the weighted fusion attentional heatmap \(\varvec{M}^{fuse}\) is acquired.

  1. 3.

    Acquisition of more accurate and discriminative deep feature \(\varvec{Atte}\). \(\varvec{M}^{fuse}\) is used to conduct weighted pooling on \(\varvec{X}\) for obtaining the attentional feature \(\varvec{Atte}\), as shown in Eq. (14).

$$\varvec{Atte} = \varvec{M}^{fuse} \varvec{X} = \sum\limits_{i = 1}^{N} {\sum\limits_{c = 1}^{C} { \, p\left( {y = \left. c \right|\varvec{b}} \right)M_{i}^{c} \varvec{X}_{i} } }$$
(14)

In order to obtain \(\varvec{Atte}\) automatically, the SGD algorithm is utilized to minimize the objective function of network, as shown in Eq. (15).

$$\it \begin{aligned} \hbox{min} \, J & = - \sum\limits_{g = 1}^{G} {\sum\limits_{c = 1}^{C} {y^{\prime}_{g} \ log \left( {p\left( {y_{g} = c\left| {\varvec{Atte}_{g} } \right.} \right)} \right)} } + \zeta_{1} \sum\limits_{i = 1}^{N} {\left( {M_{i}^{fuse} } \right)^{2} } { + }\zeta_{2} \sum\limits_{c = 1}^{C} {\left\| {\varvec{a}^{c} } \right\|_{2} } \\ \end{aligned}$$
(15)

where \(\varvec{Atte}_{g}\) is the deep attentional feature of the \(g{\text{-th}}\) sample; \(p\left( {y_{g} = \left. c \right|\varvec{Atte}_{g} } \right)\) denotes the possibility of the \(g{\text{-th}}\) sample belonging to the \(c{\text{-th}}\) class; \(\zeta_{1}\) and \(\zeta_{2}\), respectively, denote the attentional regularization coefficient and weight decay coefficient; \(\left\| {\, \cdot \,} \right\|_{2}\) is \(l_{2}{\text{-norm}}\).

Conclusively, the proposed prediction-attentional pooling method adopts predictions as weights to conduct weighted fusion on the attentional heatmaps of multiple channels, so as to obtain the weighted fusion attentional heatmap \(\varvec{M}^{fuse}\), which highlights the contributions of the discriminative channels and meanwhile suppresses irrelevant background interference. Furthermore, \(\varvec{M}^{fuse}\) is utilized as the weight for deep feature map \(\varvec{X}\) to enhance the effect of important local regions that are significant for action recognition. Consequently, the more accurate and discriminative deep feature \(\varvec{Atte}\) is obtained.

3.4 Construction of deep attention-pooled descriptor

In this section, the proposed deep local attentional feature \(\varvec{Atte}^{L}\) and deep global attentional feature \(\varvec{Atte}^{H}\) are firstly obtained. Then, the proposed deep attention-pooled descriptor is constructed.

Specifically, the proposed prediction-attentional pooling method is applied to both deep local feature \(\varvec{X}^{L}\) and deep global feature \(\varvec{X}^{H}\), thus \(\varvec{Atte}^{L}\) and \(\varvec{Atte}^{H}\) are obtained. It is obviously that, \(\varvec{Atte}^{L}\) mainly focuses on detail information like texture and edge orientation, while \(\varvec{Atte}^{H}\) usually contains global body information and possesses a whole abstract description for action. Therefore, in order to comprehensively depict the discriminative information of action scene, \(\varvec{Atte}^{L}\) and \(\varvec{Atte}^{H}\) are further concatenated to finally construct the proposed deep attention-pooled descriptor \(\varvec{F}_{{attention - pooled}} = \left[ {\varvec{Atte}^{L} ,\varvec{Atte}^{H} } \right]\).

In summary, by combining \(\varvec{Atte}^{L}\) and \(\varvec{Atte}^{H}\), the proposed deep attention-pooled descriptor can more comprehensively and accurately depict the static visual appearance information of action scene and discriminative object in a video, which improves the discriminativeness of features, and is very useful for action recognition.

4 Experiments and analysis

In this section, the comparisons and analysis on experimental results of the proposed methods for action recognition are reported on two challenging video datasets, namely UCF101 and HMDB51. The illustrations of their representative frames are provided in Fig. 5.

Fig. 5
figure 5

Representative frames from videos in UCF101 and HMDB51 datasets

4.1 Datasets and experimental settings

4.1.1 Introduction of datasets

UCF101 [1] dataset includes 13,320 videos with 101 action classes. This dataset gives the largest diversity in terms of actions and large variations in camera motion, viewpoint, object appearance and pose, illumination conditions, cluttered background, object scale, etc. Videos of each action class are divided into 25 groups, where videos from the same group may share similar background and viewpoint. The standard protocol of three train-test splits [34] is used in our experiments, and average accuracy is adopted as the eventual performance measure.

HMDB51 [2] dataset is collected from various sources and represents a fine multifariousness of light conditions, surroundings and situations in which action happens. The camera motion consists of traveling shots, camera shaking, zooming, etc. In total, the dataset contains 6766 video clips divided into 51 action classes, each including at least 101 video clips. The original evaluation scheme of three train-test splits [2] is adopted in our experiments. Each split includes 30 videos for testing and 70 videos for training in each class. The average result over three splits is utilized to evaluate the final performance.

4.1.2 Experimental settings

(1) Parameter setting for the proposed discriminative kinematic descriptor. The number of Gaussians \(K\) in the Fisher vector is set to 128. (2) Parameter settings for Inception-v3. Inception-v3 is utilized in this paper, and is trained on the ILSVRC2012 dataset [35]. TensorFlow open source software library [36] provided by Google is utilized to build the CNN framework, and the parameters of Inception-v3 are fine-tuned on UCF101 and HMDB51 datasets using 4 NVIDIA Titan X GPUs. The SGD algorithm is adopted for training the network. The batch size is set as 50; the momentum is set to 0.9; the learning rate is set as 0.0001; the weight decay is set as 0.0005, and the dropout ratio is selected as 0.9. (3) Parameter settings for the proposed deep attention-pooled descriptor. Similarly, SGD is adopted to train the proposed deep network. Specifically, the batch size is set as 200 and learning rate is set to 0.001. (4) Classifier settings. For the proposed discriminative kinematic descriptor and DKD–DAD framework, the linear SVM is used as a classifier. As for the proposed deep attention-pooled descriptor, the output of softmax layer in network is directly used for action recognition.

4.2 Experiment on parameter selection

In this section, UCF101 dataset is taken as an example, and the regularization coefficient \(\zeta_{1}\) and weight decay \(\zeta_{2}\) are optimized to show their significance for action recognition by using the proposed deep attention-pooled descriptor. Specifically, \(\zeta_{1}\) and \(\zeta_{2}\) are, respectively, set by searching the grids \(\left\{ {0.0005,\;0.005,\;0.05,\;0.5} \right\}\) and \(\left\{ {0.00005,\;0.0005,\;0.005,\;0.05} \right\}\). The following Fig. 6 shows the recognition result under different parameters.

Fig. 6
figure 6

Accuracy of the proposed deep attention-pooled descriptor versus regularization coefficient and weight decay parameters on UCF101 dataset

Figure 6 shows that the best performance is achieved when \(\zeta_{1} = 0.05\) and \(\zeta_{2} = 0.0005\). Ulteriorly, it can be observed that, when \(\zeta_{1}\) and \(\zeta_{2}\) become larger or smaller, the accuracy begins to decline. Thereby, the trade-off on \(\zeta_{1}\) and \(\zeta_{2}\) is very necessary. In fact, when \(\zeta_{1}\) and \(\zeta_{2}\) become larger or smaller, the deep attention-pooled descriptor cannot more accurately highlight the discriminative regions of key-frame, then the discriminability of static features is weakened, thereby the recognition performance becomes worse. Consequently, \(\zeta_{1}\) and \(\zeta_{2}\) are, respectively, set to \(0.05\) and \(0.0005\) on UCF101 dataset in subsequent experiments. Furthermore, similar results are demonstrated on HMDB51 dataset.

4.3 Action recognition with kinematic features

In this section, the proposed kinematic features are applied for action recognition to verify their effectiveness. Tables 1 and 2, respectively, show the recognition results of the proposed kinematic features, namely multi-order divergence feature and multi-order curl feature, as well as contrastive methods on UCF101 and HMDB51.

Table 1 Recognition result of the proposed kinematic features and contrastive methods on UCF101 dataset
Table 2 Recognition result of the proposed kinematic features and contrastive methods on HMDB51 dataset

As can be seen from Tables 1 and 2, the proposed kinematic features achieve better accuracies than all contrastive methods. The reasons lie in that: both of the features, by transforming optical flow field into a set of kinematic fields with more discriminativeness, acquire the different dynamic characteristics of optical flow field from multiple levels and various perspectives. In fact, they capture the spatiotemporal characteristics of action subject, thus they can more accurately depict the detailed motion information of subject, and meanwhile remove the camera motion and slowly changing background. Consequently, the accuracies are improved.

4.4 Action recognition with discriminative kinematic descriptor

This section aims to demonstrate the effectiveness of the proposed discriminative kinematic descriptor, namely validity of the proposed discriminative fusion method. Taking UCF101 and HMDB51 datasets as examples, Tables 3 and 4, respectively, show the recognition results of the proposed discriminative kinematic descriptor obtained by the proposed discriminative fusion method, as well as the results of the concatenation and linear weighted fusion for the proposed multi-order divergence feature and multi-order curl feature. Meanwhile, the results of contrastive methods are also given. In addition, the weight coefficients of linear weighted fusion are obtained by use of the grid search algorithm.

Table 3 Recognition result of the proposed discriminative kinematic descriptor and contrastive methods on UCF101 dataset
Table 4 Recognition result of the proposed discriminative kinematic descriptor and contrastive methods on HMDB51 dataset

It can be observed from Tables 3 and 4 that: (1) The proposed discriminative kinematic descriptor outperforms all contrastive methods. (2) The result of concatenating the multi-order divergence feature and multi-order curl feature is superior to the results of using any of them alone, which indicates that there is really some complementary information between them. (3) The performance of linear weighted fusion is better than that of concatenation, which indicates that the proposed multi-order divergence feature and multi-order curl feature have different contribution degrees to action recognition. Based on the above observations, the best performance of the proposed discriminative kinematic descriptor is owing to the following contributions. The proposed discriminative fusion method, by introducing the tight-loose constraint term, reduces the within-class variations while also increasing the between-class differences. That is, the proposed discriminative kinematic descriptor possesses better within-class compactness and between-class separability simultaneously. In addition, by further introducing the anti-confusion constraint term, the confusion caused by outliers is reduced, which enhances the discriminativeness and robustness of the proposed kinematic descriptor. Consequently, the performance is improved effectively.

4.5 Action recognition with prediction-attentional pooling method

This section aims to verify the effectiveness of the proposed prediction-attentional pooling method. Taking the split 1 of UCF101 dataset for example, Fig. 7 demonstrates the recognition accuracies with applying the proposed pooling method, GAP, max pooling (MAX) and classical attention pooling method on the extracted deep local feature \(\varvec{X}^{L}\) and deep global feature \(\varvec{X}^{H}\).

Fig. 7
figure 7

Recognition accuracy of different pooling methods on UCF101 dataset (split 1)

It can be seen from Fig. 7 that: (1) Regardless of the deep local or global feature, the classical attention pooling method achieves better accuracies than GAP and MAX. The reason lies in that, the introduction of the attention mechanism highlights the contribution of discriminative local regions. (2) The proposed pooling method outperforms classical attention pooling method. The reason lies in that, the proposed pooling method adopts the predictions of network output as weights to weighted fuse the attentions of different channels, which further highlights the contributions of discriminative channels. Consequently, the discriminative regions are accurately obtained, and the accuracy is effectively improved.

4.6 Action recognition with deep attention-pooled descriptor

In order to verify the effectiveness of the proposed deep attention-pooled descriptor, Tables 5 and 6, respectively, show the recognition results of this descriptor and contrastive methods on UCF101 and HMDB51 datasets.

Table 5 Recognition result of the proposed deep attention-pooled descriptor and contrastive methods on UCF101 dataset
Table 6 Recognition result of the proposed deep attention-pooled descriptor and contrastive methods on HMDB51 dataset

As is shown in Tables 5 and 6, the proposed deep attention-pooled descriptor performs better than all contrastive methods. The reason lies in that, the proposed descriptor, by combining the proposed deep local attentional feature and global attentional feature, further accurately depicts the static visual appearance information of action scene and discriminative object in a video, and enhances the discriminativeness of static deep features. Consequently, accuracies are improved effectively.

4.7 Heatmap visualization of prediction-attentional pooling method

To intuitively demonstrate the validity of the proposed prediction-attentional pooling method, Fig. 8 illustrates the visualization examples of heatmaps obtained by the proposed pooling method. For comparative analysis, the visualization examples of heatmaps obtained by classical attention pooling method are given simultaneously.

Fig. 8
figure 8

Heatmap visualization of the proposed prediction-attentional pooling method and classical attention pooling method on UCF101 dataset. Row 1: original video frames; row 2: heatmaps obtained by classical attention pooling method; row 3: deep global attentional heatmaps obtained by the proposed pooling method; row 4: heatmaps obtained by superimposing the deep local attentional heatmaps on deep global attentional heatmaps

It can be seen from Fig. 8a that for “biking” action, compared with the visualization result of classical attention pooling method, the deep global attentional heatmap obtained by the proposed pooling method can highlight the discriminative region (bicycle) and meanwhile suppress other irrelevant regions. Furthermore, by superimposing the deep local attentional heatmap on deep global attentional heatmap, the more discriminative local regions (bicycle wheels) are ulteriorly highlighted. Similarly, for “riding horse” in Fig. 8b, “swinging on the pommel horse” in Fig. 8c and “playing violin” in Fig. 8d, the proposed pooling method can also more accurately focus on the discriminative objects “horse,” “pommel horse” and “violin” in video frames. For “table tennis shot” in Fig. 8e and “surfing” in Fig. 8f, the same conclusions can be obtained.

4.8 Action recognition with DKD–DAD framework

This section aims to demonstrate the effectiveness of the proposed DKD–DAD framework. Tables 7 and 8, respectively, show the recognition results of the DKD–DAD and contrastive methods on UCF101 and HMDB51 datasets.

Table 7 Recognition result of the proposed DKD–DAD framework and contrastive methods on UCF101 dataset
Table 8 Recognition result of the proposed DKD–DAD framework and contrastive methods on HMDB51 dataset

From the above experimental results, it can be seen that the proposed DKD–DAD achieves better accuracy than all contrastive methods. Through analysis, this is due to the following contributions. DKD–DAD combines the discriminative kinematic descriptor and deep attention-pooled descriptor together for action recognition, which shares the benefits of both hand-crafted feature and deep feature, and thus comprehensively acquires important dynamic characteristics and discriminative static information in a video. Consequently, accuracies are effectively improved.

4.9 Experiments on running time

Running time plays a significant role in performance assessment, thereby the time consumption of the proposed methods are simply presented. The UCF101 dataset is taken as an example. (1) For a video containing 55 frames, it approximately takes 51.50 s to extract the discriminative kinematic descriptor. Since the proposed kinematic descriptor does not require interest point detection and trajectory tracking, the time consumption is chiefly on calculating optical flow. (2) As for the proposed deep attention-pooled descriptor, it takes about 46.14 ms for each frame. (3) For the proposed DKD–DAD framework, the overall processing time of a 55-frame video is about 53.00 s. These experiments are run on a workstation with a 2.60 GHz CPU.

5 Conclusions

The following conclusions are drawn from this paper. Firstly, by transforming the optical flow field into a set of kinematic fields with more discriminativeness, the dynamic characteristics hidden within the optical flow field are captured. Subsequently, two kinematic features are constructed, which more accurately depict the dynamic characteristics of action subject from the multi-order divergence and curl fields, meanwhile remove the camera motion and slowly changing background. Secondly, a discriminative fusion method is proposed. By introducing a single tight-loose constraint, the better within-class compactness and between-class separability are guaranteed. At the same time, the introduction of the other anti-confusion constraint reduces the confusion caused by outliers. On this basis, the discriminative kinematic descriptor is constructed, which possesses better discriminativeness and robustness. Thirdly, a prediction-attentional pooling method is proposed, which adopts the predictions of deep network as weights to weighted fuse different channel information of attentions, and thus highlights the contributions of discriminative channels. Consequently, its attention is more accurately focused on discriminative regions while suppressing irrelevant background interference. Furthermore, the deep attention-pooled descriptor is constructed, which obtains the significant static visual appearance information of action scene and discriminative object in a video. Finally, a DKD–DAD framework is constructed by combining the proposed discriminative kinematic descriptor and deep attention-pooled descriptor, which comprehensively obtains the dynamic characteristics and static information, and further improves the accuracies of action recognition. The proposed methods are extensively evaluated on two challenging datasets of UCF101 and HMDB51, where the superior performance is achieved in comparison with a number of state-of-the-art methods. The future work will focus on researching and designing deeper network as well as more effective pooling method, so as to handle complex video concepts.