1 Introduction

The videos captured from realistic scenes usually contain various types of human actions, which range from simple gestures such as running and clapping to complex activities such as ice dancing and basketball. Recognizing what actions are performed in videos is an important task in computer vision, which can be applied to intelligent video surveillance, patient monitoring, human-machine interaction, etc. Although extensive researches have been devoted, it is still a challenge to extract discriminative features from realistic videos for action recognition.

The dense tracking technique, which extracts trajectory features by tracking densely sampled points across frames [44], has shown to be successful in action recognition. However, the traditional dense tracking does not discriminate between action-relevant regions (i.e., the regions corresponding to the body parts or objects that participate in the action) and background, and may produce dense features in both areas, especially in the presence of camera motion. In contrast, when human recognizes actions in a video, he instinctively focuses his eyes on action-relevant regions and only gives a glance at surroundings occasionally. Thus, the features most useful for action recognition are those from action-relevant regions. The background features may encode some contextual information for action recognition, but they are generally far less informative than those in action-relevant regions. Incorporating them into action recognition may degrade recognition performance.

An intuitive improved mechanism is to perform dense tracking only in action-relevant regions, where the central issue is how to differentiate these regions from background. Since the action-relevant regions are usually more salient than background, many methods [33, 39, 40, 53] rely on saliency detection technique to gain high saliency for action-relevant regions so that they can be highlighted and discriminated. However, action-relevant regions cannot be always equivalent to salient regions. For example, some large action-relevant regions (e.g., the big torso of runner) may be not so salient for the uniform motion in them, and some background regions may also obtain high saliency for the irrelevant object motion (e.g., leaf swinging) in them. In these cases, a common saliency detection method may lead to wrong action-relevant region detection. Thus, for recovering action-relevant regions, a saliency detection method should be designed to exploit the difference between these regions and background more precisely. Partly inspired by the multi-stage idea used in many saliency detection methods [8, 14, 21, 38], we deduce that, if some definite background regions with very low motion saliency are segmented at first, the action-relevant regions will be easier to discriminate based on the fact that they tend to be distinct from the definite background regions more clearly than other regions. In Fig. 1, we try reconstructing some video frames using the linear combination of definite background regions in terms of motion information and find that the reconstruction errors of action-relevant regions are much larger than that of other regions. Based on this observation, we propose a three-stage saliency detection method to detect action-relevant regions for action recognition. Considering saliency does not vary considerably in a small spatiotemporal area, we temporally split a video into sub-videos, and detect the saliency of each sub-video on patch level.

Fig. 1
figure 1

Reconstruction error produced by reconstructing the video frames using the linear combination of definite background regions in terms of motion information

The framework is shown in Fig. 2. In a realistic video, background motion caused by camera movement is usually uniform and lies in a low-rank subspace, while action motion is often irregular and sparse. In the first stage, we employ low-rank matrix recovery (LRMR) [48] to decompose the overall motion of a sub-video into a low-rank part and a sparse part and use the latter to compute initial saliency. Thus, some definite background regions will gain very low saliency and can be segmented at first. Among the rest (candidate foreground) regions, the action-relevant ones tend to be distinct from the definite background regions more clearly than the background ones. Thus, in the second stage, we learn sparse representation (SR) for candidate foreground regions using the dictionary formed by the definite background regions based on motion and appearance information, and use reconstruction error to compute refined saliency, so that the action-relevant regions have more chance to be highlighted. Considering saliency should be spatially coherent and some incorrect saliency values can be remedied by their neighboring saliency values, in the third stage, we spatially update the saliency based on the motion and appearance similarity to increase coherence. Comparing the saliency with a given threshold, a binary saliency map is created to indicate action-relevant regions. Finally, we fuse binary saliency maps into dense tracking to extract action-relevant trajectory features for action recognition.

Fig. 2
figure 2

Framework of the three-stage action-relevant region detection method

The contributions of the proposed method are three-fold: 1) We propose to segment some definite background regions at first based on the initial saliency obtained from the sparse part of the motion produced by LRMR algorithm. 2) By obtaining the sparse representation for the candidate foreground regions based on the dictionary formed by the definite background regions and computing the refined saliency using reconstruction error, the action-relevant regions are more likely to be highlighted. 3) Saliency is updated in spatial domain to increase coherence, so that some incorrect saliency values can be rectified using their neighboring saliency values.

It is worth noting that, although the action-relevant region detection method is proposed to improve the dense tracking technique for action recognition, it is essentially a separate technique and can also be applied to many other existing action recognition frameworks such as the ones based on dense sampling [42], context modeling [5] and deep learning [4] to detect the actors and related objects or emphasize the action-relevant features.

The rest of this paper is organized as follows. Section 2 gives a review on the related works. Section 3 details the three-stage saliency detection method for recovering action-relevant region . Section 4 introduces saliency-based dense tracking for action recognition. Section 5 describes the experimental settings, datasets and results. Section 6 draws the conclusion.

2 Related works

2.1 Action recognition

In the past decades, extensive researches have been devoted to human action recognition and its related tasks, which involve recognizing actions in videos [7, 13, 26, 29, 33, 41, 44], recognizing activities from sensor data [18,19,20, 22], recognizing and detecting human actions in untrimmed videos [9, 51], etc. We just focus on the works that are most related to the proposed method.

In early researches, interest point tracking plays an important role in action recognition. To obtain long-term information for action recognition, Matikainen et al. [26] and Messing et al. [27] track a set of interest points in a video sequence through KLT tracker [23] to extract trajectories. Sun et al. [36] extract trajectories by tracking SIFT salient points over consecutive frames via pairwise SIFT matching. Sun et al. [37] and Bregonzio et al. [1] combine KLT tracker and SIFT matching to increase trajectory duration and density. However, recent researches show that it is more successful to extract features with a finer granularity by dense tracking. Wang et al. [44] extract dense trajectories by tracking densely sampled points across video frames via dense optical flow to describe actions and significantly improve the recognition performance.

Due to the harmful impact of background features, the improved techniques of dense tracking draw much attention in action recognition. To emphasize action-relevant features, some research works propose to improve dense tracking through saliency detection. Wang et al. [40] recover action-relevant regions in a video by detecting saliency through LRMR and perform dense tracking in these regions to extract action-relevant trajectories. Vig et al. [39] employ saliency-mapping algorithms to find informative regions and highlight the descriptors in these regions. Yi et al. [53] compute saliency maps from color, space and optical flow to obtain the trajectories containing only foreground motion. Because background trajectory features are mostly induced by camera motion, some research works aim to eliminate them by removing camera motion from videos. Wang et al. [41] prune background trajectories by estimating homography using RANSAC to cancel out camera motion from optical flow. Jain et al. [10] estimate dominant camera motion with a 2D affine motion model and separate it from optical flow. Wu et al. [49] use low-rank optimization to decompose dense trajectories into camera-induced component and object-induced component and use the latter to describe actions.

2.2 Saliency dection

Our action-relevant region detection method is partly inspired by some related works on image or video saliency detection, which adopt a somewhat multi-stage idea. Gao et al. [8] propose to use a first-pass RPCA to identify the likely regions of foreground, and then employ a second-pass block-sparse RPCA with fine-tuned parameter λ to accomplish a finer saliency detection. Tong et al. [38] first compute saliency using a bottom-up method to build background and foreground dictionaries, and then obtain saliency map using the LLC algorithm [43]. Li et al. [14] first extract boundary segment features to form background templates, then reconstruct the image by dense and sparse appearance models based on the templates and use reconstruction error to measure saliency.

Simlar to our method, there are also some research works [35, 52] employing the combination of LRMR and SR for saliency detection. Yan et al. [52] first transform image patches into feature space via SR using a learned dictionary, and then compute the saliency from the sparse part produced by performing LRMR over the transformed features. A similar pipeline is adopted in [35] for video saliency detection. The feature matrix of video cuboids is first obtained based on group SR and then decomposed into a low-rank part and a sparse part; the sparse part is used to compute saliency.

Although we also use a multi-stage idea and a combination of LRMR and SR, the pipeline of our method is totally different from above saliency detection methods. We first compute the initial saliency based on the sparse part of motion produced by LRMR to decide candidate foreground and definite background, then calculate the refined saliency by learning SR to reconstruct the candidate foreground regions based on the dictionary formed by the definite background, and finally improve the saliency by an updating operation. This pipeline is specially designed to compute saliency by exploiting the difference between action-relevant regions and background, which helps to obtain high saliency contrast and thus is better at action-relevant region detection.

3 Action-relevant region detection

Considering that action-relevant regions do not vary considerably in short time, we split the temporal duration of a video to obtain a number of short sub-videos and compute a saliency map for each sub-video to indicate the action-relevant regions of all frames in it. Besides, since saliency also does not vary significantly in a small spatial region, we spatially divide each sub-video into a number of small patches and detect the action-relevant regions in it on patch level.

Assuming a video with T frames is denoted by V = [I1, I2,⋯ , IT], where It is the frame at time t, by splitting the temporal duration evenly, we can divide the video into K non-overlapped sub-videos, i.e., V = [V1, V2,⋯, VK], where \({{\mathbf {V}}_{k}} = \left [ {{{\mathbf {I}}_{(k - 1)w + 1}},{{\mathbf {I}}_{(k - 1)w + 2}}, {\cdots } ,{{\mathbf {I}}_{kw}}} \right ]\), and w is the temporal length of the sub-video. Then, each sub-video is further spatially divided into MN non-overlapped patches with equal spatial size s × s. The divided Vk can be represented by a block matrix

$$ {{\mathbf{V}}_{k}} = \left[ {\begin{array}{*{20}{c}} {{{\mathbf{P}}_{1}}}&{{{\mathbf{P}}_{2}}}& {\cdots} &{{{\mathbf{P}}_{N}}} \\ {{{\mathbf{P}}_{N + 1}}}&{{{\mathbf{P}}_{N + 2}}}& {\cdots} &{{{\mathbf{P}}_{2N}}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {{{\mathbf{P}}_{\left( {M - 1} \right)N + 1}}}&{{{\mathbf{P}}_{\left( {M - 1} \right)N + 2}}}& {\cdots} &{{{\mathbf{P}}_{MN}}} \end{array}} \right] $$
(1)

where Pn is the n th patch and its size is s × s × w. In the following, we use Vk as an example to introduce how to recover the action-relevant regions in each sub-video through three-stage saliency detection.

3.1 Initial saliency detection

In this stage, we compute the initial saliency to differentiate definite background regions from candidate foreground, which can be accomplished based on the motion difference between action-relevant regions and background. If a video is captured with a moving camera, the background motion is spatially uniform and lies in a low-rank subspace, while the action motion is spatially irregular and can be seen as sparse error. For this reason, we use LRMR to separate the overall motion of a sub-video into a low-rank part and a sparse error part, and use the latter to measure the initial saliency, which is then used to classify the patches in the sub-video into a candidate foreground set and a definite background set.

Before utilizing LRMR to detect the saliency of sub-video Vk, we arrange the motion in it on patch level to form a motion matrix. To this end, we first obtain the motion vector of each patch by vectorizing the optical flow of all points in the patch in the order of spatiotemporal position. The motion vector of Pn is formed by \({\mathbf {x}}^{n} = [\begin {array}{*{20}{c}}{{u_{1}^{n}}}&{{v_{1}^{n}}}& {\cdots } &{{u_{i}^{n}}}&{{v_{i}^{n}}}& {\cdots } &{u_{s \times s \times w}^{n}}&{v_{s \times s \times w}^{n}} \end {array}]\), where \({u_{i}^{n}}\) and \({v_{i}^{n}}\) are respectively the horizontal and vertical components of the optical flow at the i th point in Pn. When the motion vectors of all patches in Vk are obtained, we stack them column-wise in the order of spatial position to obtain a motion matrix \({\mathbf {X}} = \left [ {\begin {array}{*{20}{c}}{{{\mathbf {x}}^{1}}}&{{{\mathbf {x}}^{2}}}& {\cdots } &{{{\mathbf {x}}^{MN}}} \end {array}} \right ]\), where X is a 2D matrix and its size is (2s2w) × (MN).

The overall motion of sub-video Vk is encoded in X. We decompose X into a low-rank matrix B and a sparse error matrix F by solving the following relaxed LRMR problem [48],

$$ \arg \min \limits_\mathbf{B,F} {\| \mathbf{B}\|_{*}} + \lambda \| \mathbf{F} \|_{1} \quad \textrm{s.t.}\quad \mathbf{X = B + F} $$
(2)

where ∥. is the nuclear norm of a matrix, λ trades off rank versus sparsity. According to [16, 48, 49], the setting of λ is highly dependent on the dimension of the matrix to be decomposed. We empirically set \(\lambda = 1.1 \slash {\sqrt {max(2{s^{2}}w,MN)}}\) based on the recommended settings in these works. The optimization problem in (2) is solved with the inexact Augmented Lagrange Multiplier (IALM) algorithm [16].

After decomposition, the uniform background motion in sub-video Vk is mostly deposited in the low-rank matrix B, while the irregular action motion is mostly conveyed in the sparse error matrix F. Each column in F encodes the action motion of a patch in the sub-video. We compute the \(l_{\infty }\) norms for all columns in F and obtain a vector \({{\hat {\mathbf {s}}}_{k}} = \left [ {\begin {array}{*{20}{c}} {{{\hat s}_{1}}}&{{{\hat s}_{2}}}& {\cdots } &{{{\hat s}_{MN}}} \end {array}} \right ]\), where \({\hat s_{n}}\) is used to measure the initial saliency of patch Pn in sub-video Vk.

In general, the patches in action-relevant regions contain irregular action motion and produce high saliency values in \({\hat {\mathbf {s}}}_{k}\), while the patches in background contain uniform background motion and produce small values in \({\hat {\mathbf {s}}}_{k}\). However, in some cases, some action motion in large action-relevant regions is also uniform and deposited in matrix B, and some background motion may contain subtle spatial changes and be included in matrix F. Thus, the initial saliency cannot always correctly distinguish all action-relevant regions from background. We just compare it with a relatively small threshold Ts to classify all patches into a candidate foreground set Sf and a definite background set Sb. More specifically,

$$ {\begin{array}{*{20}{c}} {S_{f}} = \left\{ {{{\mathbf{P}}_{i}}\left| {{\hat s_{i}}} \right. \geqslant {T_{s}},{\text{ }}i = 1,2, {\cdots} ,MN} \right\}\\ {S_{b}} = \left\{ {{{\mathbf{P}}_{i}}\left| {{\hat s_{i}}} \right. < {T_{s}},{\text{ }}i = 1,2, {\cdots} ,MN} \right\} \end{array}} $$
(3)

The threshold Ts is small enough in (3) so that nearly all possible foreground patches are included in Sf and the patches retained in Sb nearly all come from the true background. If the saliency of all patches in the sub-video is smaller than Ts, it means that the sub-video contains no action-relevant regions, and the following steps are not processed.

3.2 Saliency refinement

As a small threshold is used in (3), some background patches with subtle motion changes are retained in the candidate foreground set Sf. In this stage, to recover the true action-relevant patches from Sf, we compute refined saliency for each patch in Sf so that the patches in action-relevant regions can be distinguished more clearly from those in background.

Generally, even though the motion in some large action-relevant regions is spatially uniform, it still tends to be distinct from the motion in definite background. On the other hand, even though the motion in some background regions contains subtle changes, it is still apt to resemble the motion in definite background. Thus, we learn a sparse representation for the motion vector of each patch in the candidate foreground set Sf based on the dictionary constructed by the motion vectors of all patches in the definite background set Sb, and use the reconstruction error to measure patch saliency. With this technique, the action-relevant patches in Sf are difficult to reconstruct and thus obtain high saliency, while the background patches in Sf are easy to reconstruct and thus obtain low saliency. As such, the saliency contrast between action-relevant regions and background is sharpened.

In some cases, the motion of some action-relevant patches happens to resemble that of background patches, but its appearance (e.g., color) may be distinct from that of background ones. To enable action-relevant patches to have more chances to gain big reconstruction error, we combine appearance with motion to refine the saliency of each patch in Sf. To this end, we form a color matrix for the centered frame of the sub-video on patch level. When sub-video Vk is spatially divided into patches, the centered frame \({\mathbf {I}}_{k}^{c}\) in it is also divided as

$$ {{\mathbf{I}}_{k}^{c}} = \left[ {\begin{array}{*{20}{c}} {{{\mathbf{p}}_{1}}}&{{{\mathbf{p}}_{2}}}& {\cdots} &{{{\mathbf{p}}_{N}}} \\ {{{\mathbf{p}}_{N + 1}}}&{{{\mathbf{p}}_{N + 2}}}& {\cdots} &{{{\mathbf{p}}_{2N}}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {{{\mathbf{p}}_{\left( {M - 1} \right)N + 1}}}&{{{\mathbf{p}}_{\left( {M - 1} \right)N + 2}}}& {\cdots} &{{{\mathbf{p}}_{MN}}} \end{array}} \right] $$
(4)

where pn is the n th patch and its size is s × s. We obtain the color vector of each patch by vectorizing the color values of all pixels in it. The color vector of pn is formed by \({{\mathbf {a}}^{n}} = {\left [ {\begin {array}{*{20}{c}} {{L_{1}^{n}}}&{{a_{1}^{n}}}&{{b_{1}^{n}}}& \cdots &{{L_{i}^{n}}}&{{a_{i}^{n}}}&{{b_{i}^{n}}}& {\cdots } &{L_{s \times s}^{n}}&{a_{s \times s}^{n}}&{b_{s \times s}^{n}} \end {array}} \right ]^{T}}\), where \({{L_{i}^{n}}}\), \({{a_{i}^{n}}}\) and \({{b_{i}^{n}}}\) are the color values of the i th pixel in pn in Lab space. By stacking the color vectors of all patches in the order of spatial position, we obtain the color matrix of \({\mathbf {I}}_{k}^{c}\), i.e., \({\mathbf {A}} = \left [ {\begin {array}{*{20}{c}}{{{\mathbf {a}}^{1}}}&{{{\mathbf {a}}^{2}}}& {\cdots } &{{{\mathbf {a}}^{MN}}} \end {array}} \right ]\), where A is a 2D matrix and its size is (3s2) × (MN).

The motion and appearance information of Vk can be combined by concatenating the motion and color matrices into a feature matrix G

$$ {{\mathbf{G}}} = \left[ {\begin{array}{*{20}{c}} {\mathbf{X^{\prime}}} \\ {\mathbf{A^{\prime}}} \end{array}} \right] $$
(5)

where \(\mathbf {X^{\prime }}\) and \(\mathbf {A^{\prime }}\) are respectively formed by the normalized columns in X and A. Each column in G is the feature vector of a patch in Vk, whose length is (2s2w + 3s2). The feature vectors of all patches in Sf are stacked column by column in given order to form the candidate foreground feature matrix \({{\mathbf {X}}_{f}} = \left [ {\begin {array}{*{20}{c}}{{\mathbf {x}}_{1}^{f}}&{{\mathbf {x}}_{2}^{f}}& {\cdots } &{{\mathbf {x}}_{R}^{f}} \end {array}} \right ]\), where \({\mathbf {x}}_{r}^{f}\) is the feature vector of the r th patch in Sf, and R is the number of patches in Sf. Similarly, the feature vectors of all patches in Sb form the definite background feature matrix \({{\mathbf {X}}_{b}} = \left [ {\begin {array}{*{20}{c}}{{\mathbf {x}}_{1}^{b}}&{{\mathbf {x}}_{2}^{b}}& {\cdots } &{{\mathbf {x}}_{S}^{b}} \end {array}} \right ]\), where \({\mathbf {x}}_{s}^{b}\) is the feature vector of the s th patch in Sb, and S is the number of patches in Sb.

Then, the refined saliency for each patch in Sf can be computed by learning a sparse representation for the corresponding vector in Xf using Xb as dictionary. Take the r th patch in Sf for example, the sparse representation for \({\mathbf {x}}_{r}^{f}\) can be obtained by solving the following optimization problem

$$ \underset{{{\mathbf{\alpha }}_{r}}}{\min } \left\| {{\mathbf{x}}_{r}^{f} - {{\mathbf{X}}_{b}}{{\mathbf{\alpha }}_{r}}} \right\|_{2}^{2} + \lambda {\left\| {{{\mathbf{\alpha }}_{r}}} \right\|_{1}} $$
(6)

where λ is a regularization parameter that trades off reconstruction error versus sparsity, and αr is the reconstruction coefficient vector.

Considering background patches in Sf are more correlated with the patches that are spatially close to them in definite background set Sb, we assign a weight to each atom in Xb to further enable those patches to be reconstructed easily. Specifically, the weight for the i th atom \({\mathbf {x}}_{i}^{b}\) is computed by

$$ {w_{i}} = \exp \left( {\frac{{dist\left( {{c_{r}},{c_{i}}} \right)}}{\sigma_{w} }} \right) $$
(7)

where cr and ci are the centers of the patches corresponding to the reconstructed \({\mathbf {x}}_{r}^{f}\) and atom \({\mathbf {x}}_{i}^{b}\) respectively; \({dist\left ({{c_{r}},{c_{i}}} \right )}\) is the normalized Euclidean distance between them; σw is a parameter adjusting the decay speed of weight, which is set to 0.05. The weights of all atoms form a weight vector \({{\mathbf {w}}_{r}} = \left [ {\begin {array}{*{20}{c}}{{w_{1}}}&{{w_{2}}}& {\cdots } &{{w_{S}}} \end {array}} \right ]\). Introducing wr in (6), we obtain the weighted sparse representation optimization problem [3]

$$ \underset{{{\mathbf{\alpha }}_{r}}}{\min } \left\| {{\mathbf{x}}_{r}^{f} - {{\mathbf{X}}_{b}}{{\mathbf{\alpha }}_{r}}} \right\|_{2}^{2} + \lambda {\left\| {diag({{\mathbf{w}}_{r}}){{\mathbf{\alpha }}_{r}}} \right\|_{1}} $$
(8)

We solve (8) using the optimization toolbox of SPArse Modeling Software (SPAMS) [24], and empirically set λ = 0.15. When the sparse representation is obtained, we compute the reconstruction error to measure the saliency of the r th patch in Sf

$$ {s_{r}} = {\left\| {{\mathbf{x}}_{r}^{f} - {{\mathbf{X}}_{b}}{{\mathbf{\alpha }}_{r}}} \right\|_{2}} $$
(9)

Since (7) produces smaller weight for the shorter distance, the background patches in Sb that are spatially close to the reconstructed patch are emphasized in (8), which helps to obtain lower reconstruction error for the background patches in Sf.

Similarly, we calculate the saliency values of all patches in Sf. Combining them with those of the patches in Sb, which are obtained with LRMR in the first stage, we construct a patch-level saliency matrix Sk according to the patch position in the divided sub-video Vk

$$ {{\mathbf{S}}_{k}} = \left[ {\begin{array}{*{20}{c}} {{s_{1}}}&{{s_{2}}}& {\cdots} &{{s_{N}}} \\ {{s_{N + 1}}}&{{s_{N + 2}}}& {\cdots} &{{s_{2N}}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {{s_{\left( {M - 1} \right)N + 1}}}&{{s_{\left( {M - 1} \right)N + 2}}}& {\cdots} &{{s_{MN}}} \end{array}} \right] $$
(10)

where the element sn is the saliency of patch Pn in the divided Vk. If all the elements in Sk are smaller than a small threshold, no patch in Vk contains action motion, and Sk is set to zero. Otherwise, the values in it are rescaled to [0,255].

3.3 Saliency updating

Through saliency refinement, the action-relevant patches have more chances to gain high saliency contrast against background, but there are still some outliers that cannot be correctly highlighted. In this stage, we further improve the saliency detection results through a spatial updating operation.

When the saliency is computed patch by patch in the first two stages, the spatial coherence of saliency between neighboring patches is ignored. In fact, the saliency values of the adjacent patches located in the same body part or background should be spatially coherent. If the spatial saliency coherence is enhanced, some wrongly suppressed action-relevant patches may be highlighted due to the high saliency of their adjacent action-relevant patches, and on the other hand, some wrongly highlighted background patches may be suppressed due to the low saliency of their adjacent background patches. Based on this observation, we update the saliency value of each patch by adding a weighted sum of the saliency values of its spatially adjacent patches. The weight is the probability that two patches belong to the same body part or background, which is measured by the motion and appearance similarity between the two patches in the centered frame of the sub-video.

To update the saliency value of Pn, we first compute the motion and appearance similarity between pn and each 8-neighborhood patch pi in the centered frame of Vk

$$ {d_{n,i}} = \exp \left( { - \frac{{dist\left( {{H_{n}},{H_{i}}} \right)}}{\sigma_{d} }} \right) $$
(11)

where Hn and Hi are the feature vectors formed by concatenating the HOG and HOF features of patch pn and pi respectively, \({dist\left ({{H_{n}},{H_{i}}} \right )}\) is the Euclidean distance between Hn and Hi, and σd is a parameter adjusting the decay speed of similarity, which is empirically set to 0.5. Based on the patch similarity, the saliency value of Pn is updated by

$$ {s^{\prime}_{n}} = {s_{n}} + \frac{{\sum\limits_{{{\mathbf{p}}_{i}} \in N\left( {{{\mathbf{p}}_{n}}} \right)} {{d_{n,i}}*{s_{i}}} }}{{\sum\limits_{{{\mathbf{p}}_{i}} \in N\left( {{{\mathbf{p}}_{n}}} \right)} {{d_{n,i}}} }} $$
(12)

where \({N\left ({{{\mathbf {p}}_{n}}} \right )}\) is the neighborhood of pn in the centered frame. When all the saliency values in Sk are updated, we resize it to the original spatial size of the video through nearest-neighbor interpolation to generate a pixel-level saliency matrix \({\tilde {\mathbf {S}}_{k}}\). By thresholding each element in \({\tilde {\mathbf {S}}_{k}}\) with a given threshold \(T^{\prime }_{s}\), we obtain the binary saliency map Mk for sub-video Vk

$$ {{\mathbf{M}}_{k}}\left( {x,y} \right) = \left\{ {\begin{array}{*{20}{l}} 1&{{\tilde{\mathbf{S}}_{k}}\left( {x,y} \right) \geqslant {T^{\prime}_{s}}} \\ 0&{otherwise} \end{array}} \right. $$
(13)

where \({\tilde {\mathbf {S}}_{k}}\left ({x,y} \right )\) is the saliency value of the pixel (x, y) in any frame of the sub-video and \({\mathbf {M}_{k}}\left ({x,y} \right )\) is the action-relevant region indicator of (x, y). If \({\mathbf {M}_{k}}\left ({x,y} \right )=1\), the pixel belongs to action-relevant regions, otherwise it belongs to background.

4 Saliency-based dense tracking and action recognition

To evaluate recognition performance, we fuse binary saliency maps into dense tracking technique [44] to extract trajectories in action-relevant regions. Feature points are densely sampled on a grid spaced by 5 pixels and tracked from one frame to the next using filtered dense optical flow. When the points are tracked in the next frame, we see them as candidate succeeding trajectory points and employ the binary saliency map to determine whether they are in action-relevant regions or not. Only those located in action-relevant regions are regarded as valid trajectory points. To ensure dense coverage of trajectories, if no tracked point is found in a 5 × 5 neighborhood in action-relevant regions, a new point is sampled and added to the tracking process. After the tracking process is finished, all points on the same track form a trajectory. We remove static trajectories and those with sudden large displacements, which are mostly irrelevant to the interested action.

Four types of trajectory descriptors (Shape, HOG, HOF and MBH) are computed for all trajectories. To reduce the influence of camera motion, we compute the motion descriptors (Shape, HOF and MBH) from the rectified optical flow, in which camera motion is removed through the technique used in the improved dense trajectory method [41]. Each type of descriptors is encoded independently using Fisher vector [31, 32] with the same settings as [41] to obtain the final video-level representation. We use the SVM classifier and multiple kernel learning to predict action class, where four linear kernels are used, each corresponding to one type of video-level representation.

5 Experiments

5.1 Experimental settings

We carry out experiments on four benchmark datasets to evaluate the proposed method, namely, YouTube, Hollywood2, HMDB51 and UCF101. When detecting action-relevant regions, we resize video frames with a factor of 1/4 for all datasets to reduce computational costs. The sub-video length is set to w = 5 frames and the spatial size of patches is set to s = 4 pixels, which are shown to be optimal by the experiments in [40]. We set the initial saliency threshold Ts = 10, the final saliency threshold \(T^{\prime }_{s}=100\), which are proved to be most appropriate by the experiments.

5.2 Action datesets

Hollywood2

dataset [25] is composed of the videos extracted from 69 movies, which fall into 12 actions classes: answering phone, driving car, eating, fighting person, getting out of car, hugging person, hand shaking, kissing, running, sitting down, sitting up and standing up. The action videos in it are split into two sets in experiments, i.e., 823 training samples and 884 test samples.

YouTube

dataset [17] consists of 1168 videos, which are divided into 11 action categories: Basketball shooting, biking, diving, golf swinging, horseback riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volleyball spiking and walking (with a dog). The videos in each category are grouped into 25 groups. The LOO (Leave-One-Out) scheme on group basis is employed in experiments, using one group of each category for testing and the others for training in each run. Recognition performance is measured by averaging the recognition accuracy over 25 runs.

HMDB51

dataset [12] contains 6,766 video clips extracted from commercial movies as well as YouTube. In our experiments the original videos are used, which are divided into 51 action categories, each including a minimum of 101 clips. We use the original experimental setup [12], where three train-test splits are used. For each split, 70 videos are used for training and 30 videos for testing in each class. Recognition performance is measured by the average recognition accuracy over three splits.

UCF101

dataset [34] consists of 13,320 video clips collected from YouTube, which are divided into 101 categories. The videos in each category are grouped into 25 groups. In our experiments, the protocol with three train-test splits [34] is followed and recognition performance is reported as the average accuracy over these splits.

5.3 Saliency and action-relevant region detection results

5.3.1 Saliency detection results of different stages

We visualize the saliency detection results of different stages for some action videos from four datasets in Fig. 3. The videos are all contaminated with various types of camera motion, such as shaking, travelling and zooming. As indicated in Fig. 3b, the initial saliency obtained in the first stage can not highlight all the action-relevant regions. The candidate foreground set and the definite background set are respectively displayed in white and black in Fig. 3c. Obviously, the former set contains not only nearly all the action-relevant patches, but also some background patches, and the patches in the latter set nearly all come from background. As indicated in Fig. 3d, after the saliency is refined in the second stage, the saliency contrast is greatly sharpened and the action-relevant regions are significantly highlighted. As shown in in Fig. 3e, after spatially updated in the third stage, the saliency in action-relevant regions is even more prominent and continuous, which well distinguishes the action-relevant regions from background.

Fig. 3
figure 3

Saliency detection results of different stages: a sample frame, b initial saliency of the first stage, c candidate foreground set (white) and definite background set (black), d refined saliency of the second stage, and e updated saliency of the third stage

5.3.2 Final saliency visualization and comparison

We compare the final saliency detection results with that of two recent video saliency detection methods [21, 45], which are not specially designed for recovering action-relevant regions and have shown excellent performance in common video saliency detection. To this end, we randomly select some videos from four action datasets, which are shown in Fig. 4a. All of them contain various degrees of background motion except the last two. The saliency is computed using the three methods under their default parameter settings and the binary saliency maps are all obtained with threshold \(T^{\prime }_{s}=100\). From Fig. 4b and c, we can see, whether for the videos with dynamic or static background, the final saliency obtained by our method well highlights action-relevant regions, and the binary saliency maps offer good coverage over the action-relevant regions in most cases. In contrast, as shown in Fig. 4d and f, the saliency contrast obtained by [45] and [21] is not as sharp as ours, and in the binary saliency maps yielded by [45] and [21] shown in Fig. 4e and g, background regions are frequently wrongly segmented as action-relevant regions, and conversely action-relevant regions are also often mistaken for background. Obviously, our method is more suitable for action-relevant region detection than common video saliency detection methods.

Fig. 4
figure 4

Comparison of the saliency detection results of different methods: a sample frame, b saliency of our method, c binary saliency map of our method, d saliency of [45], e binary saliency map of [45], f saliency of [21], and g binary saliency map of [21]

5.3.3 Evaluation of appearance information in saliency refinement

In the second stage of saliency detection, we introduce static appearance information (color) and combine it with motion information to refine saliency, so that the patches in action-relevant regions have more chances to obtain high saliency. To evaluate the influence of appearance information on saliency detection, we try removing it from feature matrix G and using only motion information for saliency refinement. The saliency detection results on some videos are visualized and compared with those obtained under the default settings in Fig. 5. As we can observe, when both appearance and motion information is employed, higher saliency is gained in the inner parts of action-relevant regions and the binary saliency maps cover the action-relevant regions more completely, which demonstrates that incorporating appearance information is beneficial to saliency refinement.

Fig. 5
figure 5

Comparison of the saliency detection results with or without static appearance information: a sample frame, b saliency of default settings, c binary saliency map of default settings, d saliency without appearance information, and e binary saliency map without appearance information

However, As shown in the first two rows in Fig. 5, there are still some regions in large action-relevant areas cannot be recovered by incorporating appearance information. One possible reason is that these regions are wrongly classified into definite background in the first stage due to very uniform motion in them, and thus cannot be highlighted in the second stage. The other possible reason is that these regions happen to be similar to background in both appearance and motion. Both cases cannot be handled by the proposed method.

5.3.4 Evaluation of computational complexity

In this section, the computational complexity for each stage of the saliency detection method is evaluated on the 150 videos of BaseballPitch class in UCF101 dataset, whose average size is 320 × 240 × 100. The experiment is conducted in MATLAB on a personal computer with 4.2 GHz CPU/16G RAM. and the time consumption per frame for each stage is shown in Table 1. Obviously, the first stage is most time consuming. Although it has been shown in [16] that the IALM can be empirically more than five times faster than the other algorithms, unfortunately, the computational costs are still non-negligible. In contrast, the second stage is much more computationally efficient, because the dictionary for SR is directly formed by stacking the feature vectors of the patches in the definite background set, which bypasses the time-consuming dictionary learning. The updating operation in the third stage just involves a weighted summation of the saliency values in 8-neighborhood, and can also be implemented very efficiently. Overall, the total processing time for action-relevant region detection is 1122.11ms per frame, thus it still cannot work in a real-time manner on a typical personal computer.

Table 1 Time consumption per frame for different stages of saliency detection

5.4 Results of saliency-based dense tracking

The trajectories extracted by saliency-based dense tracking for the sample videos from four datasets are visualized in Fig. 6c. As can be seen, by fusing the binary saliency maps into dense tracking, the trajectories in action-relevant regions are well extracted, which are dense and continuous, and offer a good coverage. At the same time, the trajectories in background are well suppressed. For comparison, we also extract the dense trajectories using traditional dense tracking. As visualized in Fig. 6d, besides those in action-relevant regions, there are also a large number of trajectories produced in background.

Fig. 6
figure 6

Trajectory visualization and comparison: a sample frame, b binary saliency map, c trajectories produced by saliency-based dense tracking, and d trajectories produced by traditional dense tracking

5.5 Action recognition performance

5.5.1 Overall recognition performance

The default overall recognition results obtained based on the trajectories extracted from the action-relevant regions detected by the proposed method are shown in the last row of Table 2 indicated by ‘Saliency-DT’, which are 64.8% on Hollywood2, 91.4% on YouTube, 58.9% on HMDB51 and 86.1% on UCF101. To give a quantitative evaluation of each stage of saliency detection, we also compute the recognition results based on the action-relevant regions obtained from the initial saliency produced by the first stage and the refined saliency produced by the second stage, which are respectively indicated by ‘Initial-saliency-DT’ and ‘Refined-saliency-DT’ in Table 2. Comparing the results in three rows, we observe that the performance of refined saliency is always superior to that of initial saliency, and non-negligible improvement is also witnessed when the saliency is spatially updated.

Table 2 Recognition performance of different stages of saliency detection

The default recognition performance is computed based on the trajectories in detected action-relevant regions with the motion descriptors obtained from rectified optical flow. To validate the superiority of the proposed method, we also compute the results of the counterpart with motion descriptors in original optical flow field, and the results of dense trajectories extracted by conventional dense tracking, which are respectively indicated with ‘Saliency-DT-original’ and ‘DT’ in Table 3. By comparison, when original optical flow is used in motion descriptors, the trajectories in detected action-relevant regions also show obvious superiority over the dense trajectories. To be specific, the performance improvements are respectively 2.2% on Hollywood2, 2.9% on YouTube, 3.7% on HMDB51 and 3.4% on UCF101. Comparing ‘Saliency-DT-original’ with ‘Saliency-DT’, we find the recognition performance is further improved when rectified optical flow is used in motion descriptors.

Table 3 Comparison of recognition performance of dense trajectories (DT), trajectories in action-relevant regions with motion descriptors in original optical flow field (Saliency-DT-original) and default settings (Saliency-DT)

5.5.2 Influence of saliency thresholds on recognition performance

Two saliency thresholds are involved in the action-relevant region detection method, i.e., the initial saliency threshold Ts and the final saliency threshold \(T^{\prime }_{s}\). We evaluate the impact of these two thresholds on recognition performance on Hollywood2 and YouTube datasets. When evaluating one threshold, we compute the recognition results by varying it with a given step and fixing the other to the default. As shown in Fig. 7a, the recognition performance fluctuates occasionally with Ts, and reaches the highest at Ts = 12 on Hollywood2 and Ts = 10 on YouTube. As shown in Fig. 7b, the recognition performance improves with \(T^{\prime }_{s}\) at the beginning but deteriorates when \(T^{\prime }_{s}\) exceeds a value on both datasets. The highest performance is reached at \(T^{\prime }_{s}=100\) on Hollywood2 and \(T^{\prime }_{s}=110\) on YouTube.

Fig. 7
figure 7

Influence of saliency thresholds on recognition performance: a initial saliency threshold, b final saliency threshold

From Fig. 7a, we observe that the initial saliency threshold Ts should be small, but too small value is also not preferable. This is because a reasonable Ts, one on hand, should be small enough to guarantee all true action-relevant patches are enclosed in candidate foreground set, but on other hand, should not be too small to ensure the definite background set contains enough patches to form a dictionary. From Fig. 7b, it can also be seen that the final saliency threshold \(T^{\prime }_{s}\) should be set appropriately, because too big \(T^{\prime }_{s}\) tends to cause action-relevant regions to be wrongly classified as background ones, and conversely too small \(T^{\prime }_{s}\) will increases the risk of wrongly classifying background regions as action-relevant ones.

5.5.3 Comparison with the state of the art

The best recognition results of our method are compared with those of the state-of-the-art trajectory-based methods in Table 4. On Hollywood2, although [11] yields the best result using trajectory cluster as global point to alleviate camera motion, our method outperforms all other methods, among which, [44] is the traditional dense tracking method, and [6, 10, 39,40,41] improve the dense tracking by suppressing background trajectory features using saliency detection or other techniques. On YouTube, our method performs the best among all the methods compared, which include the dense tracking method [44], its improved versions [6, 30, 40, 47, 53], and the methods employing improved VLAD [50] or Multi-view SC [54] to encode dense trajectory features. On HMDB51, our method performs competitively with the ordered trajectories [28], and is also superior to all other trajectory-based methods. On UCF101, our method gains the highest performance among all methods except [11].

Table 4 Comparison with the state-of-the-art trajectory-based methods

6 Conclusion

To improve the conventional dense tracking techniques that do not discriminate between action-relevant regions and background when extracting features for action recognition, this paper proposes a three-stage saliency detection technique to recover action-relevant regions so that only action-related features are extracted to describe actions. Initial saliency is computed in the first stage based on LRMR to determine candidate foreground and definite background. Saliency is refined using SR in the second stage so that the true action-relevant regions are highlighted with sharper contrast against background. Saliency is spatially updated in the third stage to rectify some wrong saliency values through the enhancement of spatial saliency coherence. Then, binary saliency map is created to indicate action-relevant regions and incorporated into dense tracking to extract action-relevant trajectory features to describe actions. Experimental results show that the proposed method can detect action-relevant regions more accurately than common video saliency detection methods, and the recognition performance is superior to that of conventional dense tracking and competitive with that of existing improved versions.