1 Introduction

Human action recognition has become an important research area in computer vision since it concerns several key applications like video indexing, video surveillance or human–computer interaction. The typical setup for this task involves the extraction of highly discriminant features localized in both space and time. A wide variety of such descriptors have been introduced recently [23, 32, 38, 39], and have become essential tools of the action classification framework. These descriptors are then aggregated into a single vector using the extension to video of the well-known “Bag of Words” image signature approaches [35]. To further improve the results, most action classification systems use the combination of several complementary descriptors.

This paper is a revised and extended version of earlier work presented in [19]. The proposed descriptor is localized spatially and temporally in a space–time tube, to capture characteristic atoms of motion. The Series of Polynomial Approximation of Flow (SoPAF) space–time motion descriptor is based on polynomial decomposition of the optical flow [19].

We propose to extend this descriptor by coding the vector field with the half-wave rectification proposed by Efros et al. [9]. Moreover, we study two different functional bases (polynomial and sine) for modeling the temporal evolution of spatial polynomial coefficients.

The paper is organized as follows. In Sect. 2 we present the most popular space–time feature descriptors in the literature. Then, in Sect. 3 we present the SoPAF descriptor and our extension. Finally, in Sect. 4 we carry out experiments on two well-known action classification datasets.

2 Related work

The recognition of human action and activity is an important area in several fields such as computer vision, machine learning and signal processing. A popular way of comparing videos is to extract a set of descriptors from video, to find a transformation that maps the set of descriptors into a single vector, and then to measure the similarity between the obtained vectors.

We first present several works related to descriptors extraction and then the most popular signature approaches.

2.1 Video descriptors

In the early work on action recognition, silhouette-based descriptors, also called motion appearance models, were used. These descriptors are computed from the evolution of a silhouette obtained by background subtraction methods or by taking the difference of frames (DOF). From a sequence of binary images, Bobick and Davis  [8] propose descriptors called Motion Energy Image (MEI) representative of the energy of movement and Motion History Image (MHI) providing information about the chronology of motion. These two descriptors are modeled by seven Hu moments. Kellokumpu et al. [18] use histograms of Local Binary Patterns (LBP) to model the MHI and MEI images. In [17], they propose an extension of the LBP directly applied on the image pixels with successful results. Wang and Suter [41] use two other descriptors, namely the Average Motion Energy (AME) and the Mean Motion Shape (MMS). The AME is a descriptor close to the MHI representing the average image of silhouettes. The MMS is defined from boundary points of the silhouette in complex coordinates with the origin placed at the centroid of the 2D shape. As time is an important information in video, Gorelick et al. study the silhouettes as space–time volumes [4, 12]. Space–time volumes are modeled with Poisson equations. From these, they extract seven spatio-temporal characteristic components.

The main drawback of all these methods is the computation of silhouettes. Indeed, this computation is not very robust, making these methods only relevant in controlled environments such as the Weizmann dataset [4] or the KTH dataset [32]. Moreover, they tend to fail on more realistic data-sets such as UCF11 [24] or Hollywood2 [23].

Assuming that action recognition is closely linked to the notion of movement, many authors have proposed descriptors based on the modeling of optical flow. The optical flow encodes the displacement of pixels from two consecutive frames. The result can be represented by vector fields with two components \(\mathcal {U}\) and \(\mathcal {V}\). Here, \(\mathcal {U}\) denotes the horizontal component of motion and \(\mathcal {V}\) the vertical component. Early works with respect to this approach were proposed by Polana and Nelson [30]. The vector field is first decomposed according to a spatial grid. Then, in each cell of the grid, the magnitude of motion is accumulated. This method can only process periodic actions such as running or walking.

Efros et al. [9] propose a descriptor computed on a figure-centric spatio-temporal volume for each person in a video. The vector field representing the motion between two consecutive frames of the volume is computed with the Lucas and Kanade optical flow algorithm [26]. The two components \(\mathcal {U}\) and \(\mathcal {V}\) of the vector field are decomposed with a half-wave rectification technique. The resulting four components are blurred using a Gaussian filter and normalized. They are directly used as a descriptor. The obtained descriptors are compared using the normalized correlation measure. This descriptor is used and/or extended by several authors in [7, 10].

Tran et al. [36] have proposed the motion context descriptor. It is also a figure-centric descriptor based on the silhouette extraction. They use the vector field and the binary silhouette as three components. The components of the field are blurred with a median filter. Then, the three components are subdivided with a grid of \(2 \times 2\) cells. Each cell is decomposed into 18 radial bins, each covering 20\(^\circ \). Inside the radial bins, the sum of each component is computed. This provides, for each component, 4 histograms composed with 18 bins. The concatenation of these histograms provides a 216-dimensional vector which is the movement pattern of a given field. From this pattern, the Motion Context is created. It is composed of the 216-dimensional vector of the current frame plus the first 10 vectors of the PCA models of the \(5\) previous frames, the first 50 vectors of the PCA models of 5 current frames and finally the first 10 vectors PCA models of 5 next frames.

Ali and Shah first compute many kinematic features on the field, and then compute kinematic modes with a spatio-temporal principal component analysis to create a figure-centric descriptor [1].

Figure-centric descriptors are dependent on the person detector associated with them. Moreover, they do not take into account the context in the video that can add relevant information to action recognition. Consequently, these methods tend to fail on more realistic data-sets such as UCF11 [24] or Hollywood2 [23] datasets.

An other approach is proposed in [33, 34] that allows to compute the similarity between motions of video segments without computing motion fields. This method do not have to use video background subtraction. However, this method requires a set of training video centered on the action to recognize.

Finally, the descriptors that have emerged in recent years are the extension to video of still image descriptors [38]. The most commonly used are SIFT [25], SURF [3] and histogram of oriented gradient (HOG) [6]. SIFT and SURF are both interest points detector and local image descriptor. In this paper, we only consider the descriptors. SIFT and HOG descriptors rely on a histogram of orientation of gradient. Locally, the orientation of the gradient is quantized in \(o\) orientations (typically 8). For a given spatial window, a HOG (or a SIFT) descriptor is computed by decomposing the window with a grid of \(N \times N\) cells. Each cell contains the histogram of orientations of the gradient. The descriptor is obtained by the concatenation of the \(N \times N\) histograms. HOF is the same as HOG but is applied to optical flow instead of gradient. The MBH models the spatial derivatives of each component of the optical flow vector field with a HOG.

Recently, Wang et al. [38] propose to model these usual descriptors along dense trajectories. The time evolution of trajectories, HOG, HOF and MBH, is modeled using a space–time grid following pixel trajectories. The use of dense trajectories for descriptor extraction increases the performances of popular descriptors (HOG, HOF and MBH).

2.2 Signatures

Once a set of descriptors is obtained from the video, a popular way of comparing images (or videos) is to map the set of descriptors into a single vector and then to measure the similarity between the obtained vectors (for example in [31, 38, 39]). The most common method for such embeddings is inspired by the text retrieval community and is called the “Bag of Words” (BoW) approach [35]. It consists in computing a dictionary of descriptor prototypes (usually by clustering a large number of descriptors) and then computing the histogram of occurrences of these prototypes (called “Visual Words”) within the set.

In still images classification, these approaches have been formalized in [40] by a decomposition of the mapping into two steps. The first step, namely the “coding step”, consists in mapping each descriptor into a codeword using the aforementioned dictionary. The second step is to aggregate the codewords into a single vector and is called the “pooling step”. Structural constraints such as sparsity [42] or locality [40] can be added to the coding process to ensure that most of the information is retained during the pooling step. Common pooling processes include averaging the codewords or retaining the entry-wise maximum among the codewords (max pooling). Extensions of the BoW model have been recently proposed to include more precise statistical information. In [2], the authors propose to model the distribution of distances of descriptors to the clusters centers. In the coding/pooling framework, each descriptor is coded by 1 in the bin corresponding to its distance to the cluster’s center to which it belongs, and 0 otherwise. The pooling is simply the averaging over all codewords.

In [15], the authors proposed a coding process where the deviation between the mean of the descriptors of the set and the center of the cluster to which they belong to is computed. The whole mapping process can be seen as the deviation between a universal model (i.e. the dictionary) and a local realization (i.e. the set of descriptors). Using this model deviation approach, higher-order statistics have been proposed, like “super-vectors” in [43, “Fisher Vectors” in [16] or “VLAT” in [28, 29]. Fisher vectors are known to achieve state of the art performances in image classification challenges [5].

To compare the performances of descriptors, in this paper, we consider a compressed version of VLAT which is known to achieve near state of the art performances in still image classification with very large sets of descriptors [27]. In our case, the dense sampling both in spatial and temporal directions leads to highly populated sets, which is consistent with the statistics computed in VLAT signatures. Given a clustering of the descriptors space with \(C\) clusters computed on some training set, the first- and second-order moments \(\mu _c\) and \(\tau _c\) are computed for each cluster \(c\):

$$\begin{aligned} \mu _c&= \frac{1}{|c|} \sum _{i} \sum _{r} \mathbf {\nu }_{rci}\end{aligned}$$
(1)
$$\begin{aligned} \tau _c&= \frac{1}{|c|} \sum _{i} \sum _{r} (\mathbf {\nu }_{rci}-\mu _c)(\mathbf {\nu }_{rci}-\mu _c)^\mathrm{T} \end{aligned}$$
(2)

with \(\vert c \vert \) being the number of descriptors \(\mathbf {\nu }_{rci}\) of video \(i\) in cluster \(c\), for all videos in the training set. The eigen decomposition of the covariance matrix \(\tau _c\) for each cluster \(c\) is then performed:

$$\begin{aligned} \tau _c = \mathbf {V}_c \mathbf {D}_c \mathbf {V}_c^\top \end{aligned}$$
(3)

Using this decomposition, descriptors are projected on the subspace generated by the eigenvectors \(V_c\).

The compressed VLAT signature \(\tau _{i,c}\) of video \(i\) is computed for each cluster \(c\) with the following equation:

$$\begin{aligned} \tau _{i,c} = \sum _r (\mathbf {V}_c (\mathbf {\nu }_{rci} - \mu _c))(\mathbf {V}_c (\mathbf {\nu }_{rci} - \mu _c))^\top - \mathbf {D}_c \end{aligned}$$
(4)

\(\tau _{i,c}\) are then flattened into vectors \(\mathbf {v}_{i,c}\). The complete VLAT signature \(\mathbf {x}_i\) of video \(i\) is obtained by concatenation of \(\mathbf {v}_{i,c}\) for all clusters \(c\):

$$\begin{aligned} \mathbf {v}_i = (v_{i,1} \dots v_{i,C}) \end{aligned}$$
(5)

It is advisable to perform a normalization step for best performance.

$$\begin{aligned} \forall j,\,\, \mathbf {v}_i'[j]&= \mathrm{sign}(\mathbf {v}_i[j]) |\mathbf {v}_i[j]|^\alpha , \end{aligned}$$
(6)
$$\begin{aligned} \mathbf {x}_i&= \frac{\mathbf {v}_i'}{\Vert \mathbf {v}_i' \Vert } \end{aligned}$$
(7)

With \(\alpha =0.5\) typically. The size of the compacted VLAT signature depends on the number \(d_c\) of eigenvectors retained in each cluster, and is equal to \(\sum _c \frac{d_c(d_c+1)}{2}\) (thanks to the matrices \(\tau _{i,c}\) being symmetric, only half of the coefficients are kept).

3 Series of polynomial approximation of flow (SoPAF)

We propose to extend the SoPAF descriptor [19]. SoPAF descriptor models the vector field of motion between two frames using projection on an orthogonal basis of polynomials. This polynomial model is used in [21] to recognize movements in a video. The modeling is applied to the entire field and each frame is processed separately. In another context, this polynomial model is locally used to detect singularities such as vortex or saddle point in fluid motion [20]. Since motion can successfully be modeled by polynomials, we propose to use such models on a local neighborhood to obtain densely extracted local motion descriptors. We use two successive polynomial models. At first, the spatial vector field is modeled with two-dimensional polynomial basis. Then, time evolution of spatial coefficients is modeled with a one-dimensional basis. We propose to extend the descriptor using the half-wave rectification technique proposed by Efros et al. [9]. Moreover, we propose to evaluate sine function basis in addition to of polynomial functions.

3.1 Spatial modeling using a polynomial basis

Let us consider the descriptor \(\mathbf {M}(i,j,t)\) located in frame at coordinates \((i,j)\) and in video stream at time \(t\). Descriptors are computed using space and time neighborhood around location \((i,j,t)\), denoted as window \(W(i,j,t)\). An example of \(W(i,j,t)\) is shown in Fig. 1a. We propose to model the vector field of motion inside the window \(W(i,j,t)\) by a finite expansion of orthogonal polynomials. Let us define the family of polynomial functions with two real variables as follows:

$$\begin{aligned} P_{K,L}(x_1,x_2) = \sum _{k=0}^{K} \sum _{l=0}^{L} \ a_{k,l} \ x_1^k \ x_2^l \end{aligned}$$
(8)

where \({k \in \{0..K\}, l \in \{0..L\}}\), \(K \in \mathbb {N}^+\) and \(L \in \mathbb {N}^+\) are respectively the maximum degree of the variables \((x_1,x_2),\) and \(\{a_{k,l}\}\) are the polynomial coefficients. The global degree of the polynomial is \(D=K+L\).

Fig. 1
figure 1

Localisation in space and space–time domains. a Localisation in space domain, b localization example in space–time domain with \(\tau =3\)

Let \( \mathcal {B}= \{P_{k,l}\}_{k\in \{0..K\},l\in \{0..L\}}\) be an orthogonal basis of polynomials. A basis of degree \(D\) is composed by \(n\) polynomials with \(n = (D+1)(D+2)/2\) as follows:

$$\begin{aligned} \mathcal {B} =&\{P_{0,0},P_{0,1},\cdots ,P_{0,L},P_{1,0},\cdots \nonumber \\&\cdots , P_{1,L-1},\cdots ,P_{K-1,0},P_{K-1,1},P_{K,0}\} \end{aligned}$$
(9)

We can create an orthogonal basis using the following three-term recurrence:

$$\begin{aligned} \left\{ \begin{array}{lll} P_{-1,l}(\mathbf {x}) = 0 \\ P_{k,-1}(\mathbf {x}) = 0 \\ P_{0,0}(\mathbf {x}) = 1 \\ P_{k+1,l}(\mathbf {x}) = (x_1 - \lambda _{k+1,l}) P_{k,l}(\mathbf {x}) -\mu _{k+1,1}P_{k-1,l}(\mathbf {x}) \\ P_{k,l+1}(\mathbf {x}) = (x_2 - \lambda _{k,l+1}) P_{k,l}(\mathbf {x}) -\mu _{k,l+1}P_{k,l-1}(\mathbf {x}) \\ \end{array} \right. \end{aligned}$$
(10)

where \(\mathbf {x}=(x_1,x_2)\) and the coefficients \(\lambda _{k,l}\) and \(\mu _{k,l}\) are given by

$$\begin{aligned}&\lambda _{k+1,l}= \frac{\langle x_1 P_ {k,l}(\mathbf {x}) \mid P_{k,l}(\mathbf {x}) \rangle }{\Vert P_ {k,l}(\mathbf {x}) \Vert ^2} \quad \lambda _{k,l+1}= \frac{\langle x_2 P_ {k,l}(\mathbf {x}) \mid P_{k,l}(\mathbf {x}) \rangle }{\Vert P_ {k,l}(\mathbf {x}) \Vert ^2}\nonumber \\&\mu _{k+1,l}= \frac{\langle P_ {k,l}(\mathbf {x}) \mid P_{k,l}(\mathbf {x}) \rangle }{\Vert P_ {k-1,l}(\mathbf {x}) \Vert ^2} \quad \mu _{k,l+1}= \frac{\langle P_ {k,l}(\mathbf {x}) \mid P_{k,l}(\mathbf {x}) \rangle }{\Vert P_ {k,l-1}(\mathbf {x}) \Vert ^2} \end{aligned}$$
(11)

and \(\langle \cdot \mid \cdot \rangle \) is the usual inner product for polynomial functions:

$$\begin{aligned} \langle P_1 \mid P_2\rangle = \int \!\!\!\!\!\int _{\Omega } P_1(\mathbf {x})P_2(\mathbf {x})w(\mathbf {x}) \mathrm{d}\mathbf {x} \end{aligned}$$
(12)

with \(w\) the weighting function that determines the polynomial family and \(\Omega \) the spatial domain covered by the window \(W(i,j,t)\). We use Legendre polynomials (\(w(\mathbf {x})=1, \forall \mathbf {x}\)).

Using this basis, the approximation of the horizontal motion component \(\mathcal {U}\) is:

$$\begin{aligned} \tilde{\mathcal {U}} = \sum _{k = 0}^{D} \sum _{l = 0}^{D-k} \tilde{u}_{k,l} \frac{ P_{k,l}(\mathbf {x}) }{ \Vert P_{k,l}(\mathbf {x}) \Vert } \end{aligned}$$
(13)

The polynomial coefficients \(\tilde{u}_{k,l}\) are given by the projection of component \(\mathcal {U}\) onto normalized \(\mathcal {B}\) elements:

$$\begin{aligned} \tilde{u}_{k,l} = \frac{ \langle \mathcal {U} \mid P_{k,l}(\mathbf {x}) \rangle }{ \Vert P_{k,l}(\mathbf {x}) \Vert } \end{aligned}$$
(14)

Similarly, vertical motion polynomial coefficients \(\tilde{v}_{k,l}\) are given by computing the projection of vertical component \(\mathcal {V}\) onto \(\mathcal {B}\) elements. Using the polynomial basis \(\mathcal {B}\) of degree \(D\), the vector field associated to window \(W(i,j,t)\) is modeled by \((D+1) \times (D+2)\) coefficients.

3.2 Time modeling using a polynomial basis

Since an action is performed along more than two frames, we propose to model motion information in longer space–time volumes.

Let us consider the descriptor located in frame at coordinates \((i,j)\) and in video stream at time \(t_0\).

We consider the same spatial domain as previously defined (see Fig. 1a). Moreover, we now consider the space–time tube defined by all the window \(W(i, j, t_0)\) to \(W(i, j, t_0+\tau )\), with \(\tau \) being the length of our descriptors temporal domain (see Fig. 1b). For each frame at time \(t\) between \(t_0\) and \(t_0 + \tau \), we propose to model the vector field of motion inside the windows \(W(i,j,t)\) of the tube by the coefficients \(\tilde{u}_{k,l}\) and \(\tilde{v}_{k,l}\), as defined in the previous section.

Then all coefficients \(\tilde{u}_{k,l}(i,j,t)\) (respectively \(\tilde{v}_{k,l}(i,j,t)\)) for \(t=t_0\) to \(t=t_0 + \tau \) are grouped in a vector defined as

$$\begin{aligned} \mathbf {u}_{k,l}(i,j,t_0) = [ \tilde{u}_{k,l}(i,j,t_0), \dots , \tilde{u}_{k,l}(i,j,t_0+\tau ) ] \end{aligned}$$
(15)

We model the time evolution of the coefficients \(\tilde{u}_{k,l}(i,j,t)\) (resp. \(\tilde{v}_{k,l}(i,j,t)\)) by projecting \(\mathbf {u}_{k,l}(i,j,t_0)\) (resp. \(\mathbf {v}_{k,l}\)) onto a one-dimensional orthogonal function basis. In [19], we use Legendre polynomial basis of degree \(d\) defined by

$$\begin{aligned} \left\{ \begin{aligned} P_{-1}(t)&= 0 \\\ P_0(t)&= 1 \\ T_n(t)&= (t- \langle t P_{n-1}(t) | P_{n-1}(t) \rangle )P_{n-1}(t)-P_{n-2}(t) \\ P_n(t)&= \frac{T_n(t)}{|T_n|} \end{aligned} \right. \end{aligned}$$
(16)

In this work, we also use sine basis for time evolution modeling. Using such basis (polynomial or sine) with degree \(d\), the approximation of \(\mathbf {u}_{k,l}(i,j,t)\) is:

$$\begin{aligned} \tilde{\mathbf {u}}_{k,l}(i,j,t) = \sum _{n = 0}^{d} \tilde{u}_{k,l,n}(i,j,t) \frac{ P_{n}(t) }{ \Vert P_{n}(t) \Vert } \end{aligned}$$
(17)

The model has \(d+1\) coefficients \({\tilde{\mathbf {u}}}_{\,k,l}(i,j,t)\) given by

$$\begin{aligned} \tilde{u}_{k,l,n}(i,j,t) = \frac{ \langle \mathbf {u}_{k,l}(i,j,t) \mid P_{n}(t) \rangle }{ \Vert P_{n}(t) \Vert } \end{aligned}$$
(18)

The time evolution of a given coefficient \(\tilde{u}_{k,l}(i,j)\) (respectively \(\tilde{v}_{k,l}(i,j)\)) is given by the vector \(\mathbf {m}_{l,k}(i,j,t_0)\) (respectively \(\mathbf {n}_{l,k}(i,j,t_0)\)) as defined in Eq. (19)

$$\begin{aligned}&\mathbf {m}_{l,k}(i,j,t_0) = [\tilde{u}_{k,l,0}(i,j,t_0),\nonumber \\&\quad \times \tilde{u}_{k,l,1}(i,j,t_0), \dots ,\tilde{u}_{k,l,d}(i,j,t_0) ] \end{aligned}$$
(19)

The feature descriptor \(\mathbf {\nu }(i,j,t_0)\) for the whole space–time volume beginning at time \(t_0\) and centered at position (i, j) is given by

$$\begin{aligned}&\mathbf {\nu }(i,j,t_0)\! = [ \mathbf {m}_{0,0}, \mathbf {m}_{0,1},\cdots , \mathbf {m}_{0,L},\mathbf {m}_{1,0},\cdots ,\mathbf {m}_{1,L-1},\cdots \nonumber \\&\quad \cdots , \mathbf {m}_{K-1,0},\mathbf {m}_{K-1,1},\mathbf {m}_{K,0}, \mathbf {n}_{0,0},\mathbf {n}_{0,1},\cdots \nonumber \\&\quad \cdots ,\mathbf {n}_{0,L}, \mathbf {n}_{1,0},\cdots ,\mathbf {n}_{1,L-1},\cdots \nonumber \\&\quad \mathbf {n}_{K-1,0},\mathbf {n}_{K-1,1},\mathbf {n}_{K,0}] \end{aligned}$$
(20)

Here, \(\mathbf {m}_{k,l}(i,j,t_0)\) and \(\mathbf {n}_{k,l}(i,j,t_0)\) are written as \(\mathbf {m}_{k,l}\) and \(\mathbf {n}_{k,l}\) for clarity reasons. The size of the descriptor \(\mathbf {\nu }_(i,j,t_0)\) is \((D+1) \times (D+2) \times d\).

We name series of polynomial approximation of flow the descriptor as it is defined in [19]. If sine basis is used to model the motion vector field evolution, we name the descriptor SoPAF + Sine. Note, for the spatial modeling of the vector field, only polynomial basis is used.

3.3 Series of local polynomial approximation of rectified flow

We propose an extension of the SoPAF descriptor using the half-wave rectification coding proposed by Efros et al. in [9] and used in several works. The half-wave rectification coding produces a four-dimensional code from the horizontal component \(\mathcal {U}\) and the horizontal component \(\mathcal {V}\) of the vector field. The code is defined as:

$$\begin{aligned} \mathcal {U}^+(\mathbf x ) = \left\{ \begin{array}{l} \mathcal {U}(\mathbf x ) \quad \text { if }\,\, \mathcal {U}(\mathbf x ) > 0 \\ 0 \quad \text { else } \end{array} \right. \end{aligned}$$
(21)
$$\begin{aligned} \mathcal {U}^-(\mathbf x ) = \left\{ \begin{array}{l} \mathcal {U}(\mathbf x ) \quad \text { if } \,\, \mathcal {U}(\mathbf x ) < 0 \\ 0 \quad \text { else } \end{array} \right. \end{aligned}$$
(22)
$$\begin{aligned} \mathcal {V}^+(\mathbf x ) = \left\{ \begin{array}{l} \mathcal {V}(\mathbf x ) \quad \text { if }\,\, \mathcal {V}(\mathbf x ) > 0 \\ 0 \quad \text { else } \end{array} \right. \end{aligned}$$
(23)
$$\begin{aligned} \mathcal {V}^-(\mathbf x ) = \left\{ \begin{array}{l} \mathcal {V}(\mathbf x ) \quad \text { if }\,\, \mathcal {V}(\mathbf x ) < 0 \\ 0 \quad \text { else } \end{array} \right. \end{aligned}$$
(24)

This coding is applied to motion vector field before the modeling steps of SoPAF descriptor. This preprocessing doubles the dimensions of the obtained descriptor. We show in Fig. 2 an example of half-wave rectification coding. In case we use the half-wave rectification coding step, we name the descriptor Series of Polynomial Approximation of Rectified Flow (SoPARF). If sine basis is used to model the motion vector field evolution, we name the descriptor SoPARF + Sine.

Fig. 2
figure 2

Half-wave rectification. a Video at time t, b video at time \(t + 1\), c \(\mathcal {U}\) component, d \( \mathcal {V}\) component, e \(\mathcal {U}^{+}\) component, f \(\mathcal {U}^{-}\) component, g \(\mathcal {V}^{+}\) component and h \(\mathcal {V}^{-} \)component

3.4 Trajectories

As proposed in [38], we use trajectories to follow the spatial position of the window along time axis.

In our case the window \(W(i_1,j_1,t_0+1)\) at time \(t_0+1\) is selected as the best matching block with respect to the window \(W(i_0,j_0,t_0)\) from time \(t_0\). This matching is performed using a three-step search block matching method from [22]. The temporal evolution of spatial coefficients is thus modeled on tubes instead of volumes.

4 Experiments

We carry out experiments on two well-known human action recognition datasets. The first one is the UCF11 dataset [24], and the second one is the Hollywood2 Human Actions dataset [23].

In this section, we first introduce the two datasets. Second, we evaluate parameters of our descriptor on the UCF11 dataset. Third, we compare our descriptor to the literature results on UCF11 and Hollywood2 datasets. For the parameterization, we use the best results obtained on UCF11 evaluation.

We use a Horn and Schunk optical flow algorithm [13] for motion extraction with 25 iterations and the regularization parameter \(\lambda \) is set to 0.1. We extract the motion fields at 5 scales for UCF11 and 7 for Hollywood2, the scale factor is set to 0.8.

For experiments, we use VLAT indexing method to obtain signatures from descriptors. We train a linear SVM for classification.

4.1 Datasets

4.1.1 UCF11 dataset

The UCF11  [24] dataset is an action recognition data set with 11 action categories, consisting of realistic videos taken from youtube (Fig. 3). The data set is very challenging due to large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background and illumination conditions. The videos are grouped into 25 groups, where each group consists of more than 4 action clips. The video clips in the same group may share some common features, such as the same person, similar background or similar viewpoint. The experimental setup is a leave-one-group-out cross-validation.

Fig. 3
figure 3

Example of videos from UCF11

4.1.2 Hollywood dataset

The Hollywood2 [23] dataset consists of a collection of video clips and extracts from 69 films in 12 classes of human actions (Fig. 4). It accounts for approximately 20 h of video and contains about 150 video samples per actions. It contains a variety of spatial scales, zoom camera, deleted scenes and compression artifact which allows a more realistic assessment of human actions classification methods. We use the official train and test splits for the evaluation.

Fig. 4
figure 4

Example of videos from Hollywood2 dataset

4.2 Evaluation of our descriptor

In this section, we evaluate our descriptor. The spatial size of space–time volumes is set to \(32 \times 32\) pixels and the length is set to 15. These parameters are defined according to results of the evaluation of parameters of HOG, HOF and MBH in [38]. The spatial step for dense extraction is set to 10 pixels and the time step is set to five frames. In Fig. 5, we show the results of our evaluation. In Fig. 5a, we show the results for the SoPAF with spatial degree varied from 2 to 4, and time degree varied from 0 to 5. The best results are obtained for spatial degree 4 and time degree 1. In Fig. 5b, we show that the results for SoPAF + Sine with spatial degree varied from 2 to 4, and time degree varied from 0 to 5. The best results is obtained for spatial degree 3 and time degree 1. In Fig. 5c, we show that the results for the SoPARF with spatial degree varied from 2 to 4, and time degree varied from 0 to 5. The best results is obtained for spatial degree 4 and time degree 2. This result is clearly better than results of SoPAF and SoPAF + Sine. In Fig. 5d, we show that the results for the SoPARF + Sine with spatial degree varied from 2 to 4, and time degree varied from 0 to 5. The best results is obtained for spatial degree 4 and time degree 1. This result is slightly lower than SoPARF but clearly better than those of SoPAF and SoPAF + Sine. We compare now our descriptors with HOF, since it models the same information as ours. To compare our descriptor with HOF, we evaluate HOF for space grid from \(2\times 2\) to \(4\times 4\) cells and time grid from 1 to 4 cells. We show the results of this evaluation in Fig. 6. Note we obtain at best \(80.4~\%\), which is better than Wang et al. in [38], albeit with a different configuration of the HOF descriptor. Our best setup is obtained for a grid of \(3\times 3\) cells and a time grid of two cells. With SoPARF and SoPARF + Sine descriptors, we obtain significantly better results (\(83.9\) and \(83.6\,\%\) respectively) than HOF (\(80.4\,\%\)).

Fig. 5
figure 5

Evaluation of space and time degree for our descriptor on UCF11 dataset; the horizontal axis represents the degree of the temporal functions basis and the vertical axis represents the average accuracy. a Series of polynomial approximation of flow, b series of polynomial approximation of flow with sine functions approximation along time axis (SoPAF \(+\) Sine), c series of polynomial approximation of rectified flow and d series of polynomial approximation of rectified flow with sine functions approximation along time axis (SoPARF \(+\) Sine)

4.3 Comparison of descriptors computational time

We compare the computation of our four best setups to the computation time of the best HOF descriptor in the previous evaluation. The computation of descriptors is performed with an Intel(R) Xeon(R) E5-2620 0 \(@\) 2.00 GHz processor. We compute all the descriptors of the video called “\(\mathrm{v}\_\mathrm{biking}\_01\_01\)” of the UCF11 dataset.

To be fair in comparison, we use the same space–time dimensions of descriptors. We use a dimension of \(30 \times 30\) pixels spatially and 14 pixels temporally. The spatial step between descriptors is set to 10 and the temporal step is set to 5.

The results of computational time by frame (in seconds) are reported in Table 1. As one can see, descriptors that do not use rectification coding are comparable to HOF in computational time. When we use the rectification coding, the computational time clearly increases. However, the best descriptors in our evaluation is the SoPARF and its computational time is still acceptable for real datasets.

4.4 Experimental results

In this section we compare our descriptors to the literature on the two datasets. For each dataset, we show the results with our SoPARF and SoPARF + Sine descriptors alone and with a HOG and MBH descriptors combination. Let us note that our approach uses linear classifiers, and thus leads to better efficiency both for training classifiers and classifying video shots, on the contrary to methods in [38] and [11].

On Table 2, we show the results obtained on UCF11 dataset and compare them to recent results from the literature. We obtain good results only using the proposed SoPARF or SoPARF + Sine descriptors. The SoPARF improves the results of Wang et al., HOF descriptor by \(11\,\%\) and our implementation of HOF by \(3\,\%\). The SoPARF provides the same results as the MBH of Wang et al. and improve by \(4\,\%\) the SoPAF. When using SoPARF, HOG and MBH combination or SoPARF + Sine, HOG and MBH combination we obtain \(86.0\,\%\) of average accuracy, which is above the state of the art performances while using a linear classifier and combining less descriptors.

Fig. 6
figure 6

Evaluation of HOF descriptor on UCF11 dataset; the horizontal axis represent the number of cells along the time axis and the vertical axis represent the average accuracy

Table 1 Computational time of the best descriptors from the evaluation presented in Sect. 4.2
Table 2 Classification average accuracy on the UCF11 dataset

On Table 3, we show the results obtained on Hollywood2 dataset. With our SoPARF descriptor, we obtain better results than the related HOG, HOF and MBH descriptors of [38] and than our implementation of HOG, HOF and MBH descriptors. Especially, we improve by \(4\,\%\) the HOF of Wang et al. and by \(6\,\%\) our implementation of HOF. The SoPARF improves the SoPAF by \(3\,\%\), although this comes at the price of slightly increasing the computational time and dimension of the resulting descriptor. When combining SoPARF with HOG and HOF, we obtain a mAP of \(58.6\,\%\) with linear classifier, slightly better than the results obtain by combining four descriptors in [38].

Table 3 Mean average precision on the Hollywood2 dataset

5 Conclusion

In this paper, we introduced a novel family of local motion descriptors using polynomial approximations of the optical flow and time evolution modeling.

For a given spatial window, after projecting the components of the optical flow on an orthogonal bivariate polynomial basis, we model the temporal evolution of spatial coefficients with one-dimensional polynomial basis. To model homogenous motion patterns, our space–time volumes follow trajectories of associated image patches. The use of the half-wave rectification coding improves the results of SoPAF descriptor. Moreover, we show the the possibility of using other basis for modeling the time evolution of spatial coefficients.

We carry out experiments on the well-known UCF11 and Hollywood2 datasets, using recent signatures method from image classification techniques. We obtain improved results over popular descriptors such as HOG, HOF and MBH which highlight the soundness of the approach.

Further improvement would be used this framework to model gradient field of images or optical flow as in HOG and MBH and extending the coding step with other approaches.