Keywords

1 Introduction

Human action recognition (HAR) is a computer vision task that seeks to monitor, understand, and characterize humans in videos [7]. This task has a wide pool of applications that include automatic surveillance, video indexing and retrieval, and virtual reality [5]. The conventional pipeline for action recognition can be divided into three stages: (i) feature extraction from raw videos, (ii) data representation, (iii) and classification into predefined categories [20].

In regard to feature extraction, the literature exhibits two trends, hand-crafted and Convolutional Neural Networks (CNNs) features. Both intend to describe the local space, codify the motion information, and then combine these sources for allowing the proper transcription of human activity [15]. To the date, Two-stream CNNs is the most effective framework for action recognition, employing two deep networks and fusion techniques to take advantage of both appearance and motion clues [13]. However, CNN-based methods hamper in-depth action analysis and understanding as there is not visual interpretability [22]. Moreover, deep learning requires large amounts of training data, which in many applications is not available [1]. On the contrary, The most popular hand-crafted feature estimation technique is known as Improved Dense Trajectories (iDT) [20]. The method, describes the local space of trajectories generated by tracking a dense grid of points. Employing descriptors such as Histograms of Oriented Gradients (HOG) for codifying appearance through color gradients, Histograms of Optical Flow (HOF) for describing movement, and Motion Boundary Histograms (MBH) for codifying changes in motion [3].

For data representation, authors proposed feature encoding and relevance analysis for highlighting salient patterns and enabling codification of visual information [12]. Super-vector based methods such as Fisher Vector (FV) and Vector of Locally Aggregated Descriptors (VLAD) are presented as the most well-known approaches for feature encoding in action recognition tasks [19]. On the other hand, non-linear relevance analysis using kernel methods have shown promising results in recent research [7]. Nevertheless, their kernel evaluation requires computing and storing large distance matrices, while also tuning parameters which increases computational complexity [7]. Lastly, it is convention to employ Support Vector Machines (SVM) for classification [17].

Both FV and VLAD methods are supported by the Gaussian Mixture Model (GMM) to generate a codebook of visual words [21]. These methods quantify the similarity between a video sample and previously computed codebook for encoding visual information through calculating Gaussian responsibilities [6]. However, GMMs trained by optimization-based methods, e.g. Expectation Maximization (EM), require extensive cross-validation for selecting the number of visual words in the codebook [8]. Moreover, the initialization required by these training methods makes models fall into local minima [2]. Therefore, using conventional GMM implies large number of operations and memory requirements, which increases the computational burden of conventional recognition systems.

In this paper, we introduce a novel data encoding framework using Bayesian inference and Dirichlet processes to support video-based HAR. Our approach is fully automatic, allowing every parameter in the model to be updated hierarchically through the Markov Chain Monte Carlo (MCMC) algorithm Gibbs sampling. Specifically, our approach includes a Infinite Gaussian Mixture model (IGMM) for revealing a set of discriminant visual words, trained through a MCMC-based optimization that evades local minima. In Fact, the infinite limit on the number components avoids estimating this parameter through extensive cross-validation. Attained results on both UCF50 and HMDB51 databases demonstrate that our proposal obtained promising recognition performance and computational savings, favoring HAR tasks.

The rest of the paper is organized as follows: Sect. 2 presents the main theoretical background. Section 3 describes the experimental setup. Section 4 introduces results and discussions. Finally, Sect. 5 presents conclusions and future work.

2 Infinite Gaussian Model for Fisher Vector Encoding

Let \(\{{\varvec{Z}}_n \!\!\in \!\!\mathbb {R}^{T_n \times D}, y_n \!\!\in \!\!\mathbb {N}\}_{n=1}^{N}\) be an input-output pair set holding N human action videos. Each sample \({\varvec{Z}}_n\), is represented by \(T_n\) observations. The local space of every observation is characterized by a D-dimensional descriptor, as in [20]. The output label \(y_n\) denotes the specific human action of video n. From \({\varvec{Z}}\!\!\in \!\!\mathbb {R}^{T\times D}\), where \(T\,\!\!=\!\!\,\sum _{n=1}^{N}T_n\), we aim to train a generative model using IGMM. The procedure is as follows [4]:

The likelihood from observation \({\varvec{z}}_t \!\!\in \!\!\mathbb {R}^D\) to a GMM with \(k_{\text {rep}}\) components is:

$$\begin{aligned} p({\varvec{z}}_t|\{{\varvec{\mu }}_{j},{\varvec{S}}_{j},\pi _{j}\}_{j=1}^{k}) = \sum _{j=1}^{k_{\text {rep}}} \pi _j \mathcal {N}({\varvec{\mu }}_j,{\varvec{S}}_{j}^{-1}) \end{aligned}$$
(1)

where \({\varvec{\mu }}_j\!\!\in \!\!\mathbb {R}^D\) are mean vectors, \({\varvec{S}}_{j}\!\!\in \!\!\mathbb {R}^{D\times D}\) are precision matrices, and \(\pi _j\) are the mixing proportions. Variable \(k_{\text {rep}}\) denotes the number of Gaussian components that have associated data, named represented classes [14].

2.1 Component Parameters

The component means \({\varvec{\mu }}_j\) and precisions \({\varvec{S}}_j\) are given by Gaussian and Wishart priors, respectively:

$$\begin{aligned} p({\varvec{\mu }}_j|{\varvec{\lambda }},{\varvec{R}})\sim \mathcal {N}({\varvec{\lambda }},{\varvec{R}}^{-1}) \qquad p({\varvec{S}}_j|\beta ,{\varvec{W}})\sim \mathcal {W}(\beta ,{\varvec{W}}^{-1}) \end{aligned}$$
(2)

where \({\varvec{\lambda }} \!\!\in \!\!\mathbb {R}^{D}\) is a mean vector, \({\varvec{R}}\!\!\in \!\!\mathbb {R}^{D\times D}\) and \({\varvec{W}} \!\!\in \!\!\mathbb {R}^{D\times D}\) are precision matrices, and \(\beta \) is the degrees of freedom. These hyper-parameters are common to all components. The conditional posterior on \({\varvec{\mu }}_j\) is obtained by conjugating its prior:

$$\begin{aligned} p({\varvec{\mu }}_j|{\varvec{\lambda }}, {\varvec{R}},&\{{\varvec{z}}_t:c_{t,j}\,\!\!=\!\!\,1\}, {\varvec{S}}_j)\propto \prod _{t:c_{\,t,j}= 1} p({\varvec{z}}_t|{\varvec{\mu }}_j,{\varvec{S}}_j)\times p({\varvec{\mu }}_j|{\varvec{\lambda }},{\varvec{R}}) \nonumber \\&\sim \mathcal {N}\left( (T_j\,\overline{{\varvec{z}}}_j\,{\varvec{S}}_j + {\varvec{\lambda }}{\varvec{R}})(T_j\,{\varvec{S}}_j+{\varvec{R}})^{-1},(T_j\,{\varvec{S}}_j + {\varvec{R}})^{-1}\right) \end{aligned}$$
(3)

\({\varvec{c}}_t \!\!\in \!\!\mathbb {R}^{k}\) is a latent variable, with notation 1 of k, where k include both represented and unrepresented classes. Unrepresented classes are virtually infinite [18]. \(T_j\) is the number of observations belonging to class j. likewise, \(\overline{{\varvec{z}}}_j\) is the average vector of these observations.

$$\begin{aligned} \overline{{\varvec{z}}}_j = \frac{1}{T_j} \sum _{t:c_{\,t,j}=1} {\varvec{z}}_t, \quad T_j = \sum _{t=1}^{T} c_{\,t,j} \end{aligned}$$
(4)

The conditional posterior on \({\varvec{S}}_j\) is obtained by conjugating its prior:

$$\begin{aligned} p(&{\varvec{S}}_j|\beta ,{\varvec{W}},\{{\varvec{z}}_t:c_{t,j}\,\!\!=\!\!\,1\}, {\varvec{\mu }}_j)\propto \prod _{t:c_{\,t,j}= 1} p({\varvec{z}}_t|{\varvec{\mu }}_j,{\varvec{S}}_j)\times p({\varvec{S}}_j|\beta ,{\varvec{W}}) \nonumber \\&\sim \mathcal {W}(\beta +T_j,[\frac{1}{\beta + T_j}(\beta \,{\varvec{W}}+\sum _{t:c_{\,t,j}=1} ({\varvec{z}}_t-{\varvec{\mu }}_j)^{\top }({\varvec{z}}_t-{\varvec{\mu }}_j))]^{-1}) \end{aligned}$$
(5)

2.2 Hyper-parameters

For hyper-parameters \({\varvec{\lambda }},{\varvec{R}}\), and \({\varvec{W}}\) the priors are defined as follows:

$$\begin{aligned} p({\varvec{\lambda }})\sim \mathcal {N}({\varvec{\mu }}_{Z},\mathbf{cov}_{Z}) \quad p({\varvec{R}})\sim \mathcal {W}(1,\mathbf{cov}_{Z}^{-1}) \quad p({\varvec{W}})\sim \mathcal {W}(1,\mathbf{cov}_{Z}) \end{aligned}$$
(6)

variables \({\varvec{\mu }}_{Z} \!\!\in \!\!\mathbb {R}^{D}\) and \(\mathbf{cov}_{Z} \!\!\in \!\!\mathbb {R}^{D\times D}\), are respectively the mean and covariance of \({\varvec{Z}}\). Following the procedure exposed in Sect. 2.1, the posterior distributions on hyper-parameters are obtained straight forward using the mean and precision priors, Eq. 2, as likelihoods in each case:

$$\begin{aligned}&p({\varvec{\lambda }}|\{{\varvec{\mu }}_j\}_{j=1}^{k_{\text {rep}}},{\varvec{R}})\propto \prod _{j=1}^{k_{\text {rep}}} p({\varvec{\mu }}_j|{\varvec{\lambda }},{\varvec{R}}) \times p({\varvec{\lambda }})\nonumber \\&\sim \mathcal {N}\left( ({\varvec{\mu }}_{Z}{} \mathbf{cov}_{Z}^{-1}+{\varvec{R}}\sum _{j=1}^{k_{\text {rep}}}{\varvec{\mu }}_j)(\mathbf{cov}_{Z}^{-1}+k_{\text {rep}}{\varvec{R}})^{-1},(\mathbf{cov}_{Z}^{-1}+k_{\text {rep}}{\varvec{R}})^{-1}\right) \end{aligned}$$
(7)
$$\begin{aligned} p({\varvec{R}}|&\{{\varvec{\mu }}_j\}_{j=1}^{k_{\text {rep}}},{\varvec{\lambda }})\propto \prod _{j=1}^{k_{\text {rep}}} p({\varvec{\mu }}_j|{\varvec{\lambda }},{\varvec{R}})\times p({\varvec{R}}) \nonumber \\&\sim \mathcal {W}\left( k_{\text {rep}}+1,\left[ \frac{\mathbf{cov}_{Z}+\sum _{j=1}^{k_{\text {rep}}}({\varvec{\mu }}_j-{\varvec{\lambda }})^{\top }({\varvec{\mu }}_j-{\varvec{\lambda }})}{k_{\text {rep}}+1}\right] ^{-1}\right) \end{aligned}$$
(8)
$$\begin{aligned} p({\varvec{W}}|\{{\varvec{S}}_j\}_{j=1}^{k_{\text {rep}}},\beta )&\propto \prod _{j=1}^{k_{\text {rep}}} p({\varvec{S}}_j|\beta ,{\varvec{W}})\times p({\varvec{W}}) \nonumber \\&\sim \mathcal {W}\left( k_{\text {rep}} \beta +1,\left[ \frac{\mathbf{cov}_{Z}^{-1}+\sum _{j=1}^{k_{\text {rep}}}{\varvec{S}}_j}{k_{\text {rep}} \beta +1}\right] ^{-1}\right) \end{aligned}$$
(9)

Parameter \(\beta \) remains scalar after conjugacy. According to Rasmussen [16], it has gamma prior of the form:

$$\begin{aligned} g&= \beta - D + 1 \end{aligned}$$
(10)
$$\begin{aligned} p(g^{-1}) \sim \mathcal {G}(1,\frac{1}{D}) \quad&\rightarrow \quad p(g) \propto g^{-\frac{3}{2}} \exp \{-\frac{D}{2\,g}\} \end{aligned}$$
(11)

For this parameter the posterior distribution takes the following form:

$$\begin{aligned}&p(g|\{{\varvec{S}}_j\}_{j=1}^{k_{\text {rep}}},{\varvec{W}}) \propto \prod _{j=1}^{k_{\text {rep}}} p({\varvec{S}}_j|\beta ,{\varvec{W}}) \times p(g)\nonumber \\&\propto (\frac{\beta }{2})^{\frac{k_{\text {rep}}\,\beta \,D}{2}}\,g^{-\frac{3}{2}}\,\Gamma _{D}(\frac{\beta }{2})^{-k_{\text {rep}}}\,\exp \{-\frac{D}{2\,g}\}\,\prod _{j=1}^{k_{\text {rep}}} |{\varvec{W}}\,{\varvec{S}}_j|^{\frac{\beta }{2}}\,\exp \{-\frac{1}{2}\,\beta \,\mathbf{tr}({\varvec{W}}{\varvec{S}}_j)\} \end{aligned}$$
(12)

The later density is not standard form. However, \(p(\log (g)|\{{\varvec{S}}_j\}_{j=1}^{k_{\text {rep}}},{\varvec{W}})\) is log-concave, so we may generate independent samples using the Adaptive Rejection Sampling technique (ARS), and transform these samples to get values of \(\beta \).

2.3 Mixing Proportions and Latent Variables

In this section k is not limited to represented classes. For the mixing proportions \(\pi _j\), the prior is a symmetric Dirichlet distribution with concentration \(\alpha /k\).

$$\begin{aligned} p(\{\pi _j\}_{j=1}^{k}|\alpha ) \sim \text {Dir}(\{\alpha /k\}_{j=1}^{k}) = \frac{\Gamma (\alpha )}{\Gamma (\alpha /k)^k} \prod _{j=1}^{k} \pi _{j}^{\alpha /k-1}, \end{aligned}$$
(13)

where \(\Gamma (\cdot )\) is the gamma function. Likewise, the joint distribution for the latent variable \({\varvec{c}}_t\) has the following form:

$$\begin{aligned} p(\{c_{\,t,j}\}_{j=1}^{k}|\{\pi \}_{j=1}^{k}) = \prod _{j=1}^{k} \pi _{j}^{c_{\,t,j}}, \quad \{\forall t: \prod _{j=1}^{k} \pi _{j}^{T_j}\}, \end{aligned}$$
(14)

Using the Dirichlet integral type I, the prior is directly written in terms of the latent variable:

$$\begin{aligned} p(\{c_{j}\}_{j=1}^{k}|\alpha ) =&\int p(\{c_j\}_{j=1}^{k}|\{\pi _j\}_{j=1}^k)\;p(\{\pi _j\}_{j=1}^{k})d\pi _1 \cdots d\pi _k \nonumber \\ =&\frac{\Gamma (\alpha )}{\Gamma (\alpha /k)^k} \prod _{j=1}^{k} \frac{\Gamma (T_j + \alpha /k)}{\Gamma (\alpha /k)}. \end{aligned}$$
(15)

For estimating variable \({\varvec{c}}_t\), it is required the prior for a single indicator given all others. This is obtained from Eq. 15, keeping all but a single indicator fixed:

$$\begin{aligned} p(c_{t,j}\,\!\!=\!\!\,1|{\varvec{c}}_{-t},\alpha ) = \frac{T_{-t,j} + \alpha /k}{T-1+\alpha }. \end{aligned}$$
(16)

where the subscript \(-t\) indicates all the indexes except t and \(T_{-t,j}\) is the number of observations, excluding \({\varvec{z}}_t\), that are associated with component j.

Lastly, an inverse Gamma prior is chosen for parameter \(\alpha \):

$$\begin{aligned} p(\alpha ^{-1}) \sim \mathcal {G}(1,1) \rightarrow p(\alpha )\propto \alpha ^{-3/2} \exp \{-1/2\alpha \}. \end{aligned}$$
(17)

The likelihood for \(\alpha \) is derived from Eq. 15, its posterior distribution takes the following form:

$$\begin{aligned} p(\alpha |\{T_j\}_{j=1}^k,k)&= p(\{T_j\}_{j=1}^{k}|\alpha )\times p(\alpha ) \nonumber \\&\propto \frac{\alpha ^{k-3/2}\exp \{{-1/2\alpha }\}\Gamma (\alpha )}{\Gamma (T+\alpha )} \end{aligned}$$
(18)

Sampling from the later density requires employing ARS. In the limit where \(k\rightarrow \infty \), the conditional prior for \({\varvec{c}}_{\,t}\), Eq. 16, becomes:

$$\begin{aligned} \text {Components where}\; T_{-t,j} > 0: \quad p(c_{\,t,j} = 1|{\varvec{c}}_{-i},\alpha ) \qquad&= \quad \frac{T_{-i,j}}{T-1+\alpha }, \nonumber \\ \text {else:} \qquad p({\varvec{c}}_{\,t}\ne {\varvec{c}}_{\,t'},\,\{\forall t \ne t'\}|{\varvec{c}}_{-t},\alpha )\quad&= \quad \frac{\alpha }{T-1+\alpha }. \end{aligned}$$
(19)

The posterior is obtained by multiplying the complete likelihood, Eq. 1, and the latent variables prior, Eq. 19:

$$\begin{aligned} \text {Components where}\; T_{-t,j}&> 0: \quad p(c_{\,t,j} = 1|{\varvec{c}}_{-i},{\varvec{\mu }}_j,{\varvec{S}}_j,\alpha ) \nonumber \\&\propto \frac{T_{-i,j}}{T-1+\alpha }\,|{\varvec{S}}_j|^{\frac{1}{2}}\,\exp \{-\frac{1}{2} ({\varvec{z}}_t - {\varvec{\mu }}_j)\,{\varvec{S}}\,({\varvec{z}}_t - {\varvec{\mu }}_j)^{\top }\}, \end{aligned}$$
(20)
$$\begin{aligned} \text {else:} \quad p({\varvec{c}}_{\,t}\ne {\varvec{c}}_{\,t'},&\{\forall t \ne t'\}|{\varvec{c}}_{-t},{\varvec{\lambda }},{\varvec{R}},\beta ,{\varvec{W}},\alpha ) \nonumber \\&\propto \frac{\alpha }{T-1+\alpha }\int p({\varvec{z}}_t|{\varvec{\mu }}_j,{\varvec{S}}_j)\,p({\varvec{\mu }}_j,{\varvec{S}}_j|{\varvec{\lambda }},{\varvec{R}},\beta ,{\varvec{W}})\; d{\varvec{\mu }}_j\,d{\varvec{S}}_j. \end{aligned}$$
(21)

The likelihood for components with observations other than \({\varvec{z}}_t\) is Gaussian with parameters \({\varvec{\mu }}_j\) and \({\varvec{S}}_j\). On the other hand, for unrepresented classes the likelihood parameters are obtained by sampling from the components priors, as the marginalization of existing parameters is not analytically tractable [16]. When an unrepresented class is chosen, a new class is introduced to the model. Likewise, when a class becomes empty, the class is removed from the model.

3 Experimental Setup

Database. To test our Infinite Gaussian Fisher Vector encoding approach (IGFV), we employ both the UCF50 [17] and HMDB51 [11] databases. The UCF50 database contains realistic videos taken from Youtube, with substantial variation in-camera motion, object appearance, and illumination changes. For concrete testing, we use \(N\,\!\!=\!\!\,5967\) videos concerning 46 human action categories. Following the standard procedure, we perform a leave-one-group-out cross-validation scheme and report the average accuracy over 25 predefined groups [17]. On the other hand, the HMDB51 database is collected from a variety of sources. For the sake of simplicity, we use \(N\,\!\!=\!\!\,6510\) video sequences concerning 51 action categories. Following the proposed protocol, we perform 3-fold cross-validation and report the average accuracy over three predefined train-test splits [11].

Settings. For each video sample, we employ the hand-crafted Improved Dense Trajectory feature estimation technique (iDT), with the code provided by the authors in [20]. Using the default settings, we extract the following trajectory aligned descriptors: Histogram of Oriented Gradients (HOG), Histogram of Optical Flow (HOF), and Motion Boundary Histogram (MBHx, and MBHy). All descriptors are extracted along all valid trajectories and the resulting dimensionality D is 96 for HOG, MBHx, and MBHy, and 108 for HOF.

In practice, using the standard Wishart distribution for sampling model precisions \({\varvec{S}}_j\), \({\varvec{R}}\), and \({\varvec{W}}\) may generate matrices that are not symmetric positive semidefinite (SPD). To avoid this inconvenient, we employ the Frobenius norm positive approximation from [10], that is: For an arbitrary matrix \({\varvec{A}}\!\!\in \!\!\mathbb {R}^{N\,\times \, N}\), its nearest SPD Frobenius approximation is set to be \(\widehat{{\varvec{A}}}_F\,\!\!=\!\!\,({\varvec{B}}+{\varvec{H}})/2\), where \({\varvec{H}}\) is the symmetric polar factor of \({\varvec{B}}\,\!\!=\!\!\,({\varvec{A}}+{\varvec{A}}^{\top })/2\).

We use the ARS algorithm for sampling scalar parameters \(\beta \) and \(\alpha \). In brief, the algorithm employs piecewise exponential functions for approximating any univariate log-concave density h(x) through an envelope (upper hull) and squeezing function (lower hull). Both touch the density function at m sampled points, known as abscissae (\(x_1,\ldots ,x_m\)). Conventionally, the starting point \(x_1\) is chosen such that \(h'(x_1)>0\), and the final point \(x_m\) is chosen such that \(h'(x_m)<0\), where \(h'(x)\,\!\!=\!\!\,h(x)/dx\). Even though the method is adaptive and approximated curves will converge to the density function. An erroneous initialization generates ill-posed samples that hamper the proper operation of the algorithm. We solve this issue through a trial-and-error iterative solution. Finally, the samples are obtained as stated in [9].

Training. Initially, we randomly select a subsample of 5000 trajectories per category from the training set. Then, using PCA we select the most relevant attributes until 90% of input variability is preserved. Later, we employ the spatio-temporal pyramid technique for distributing the training partition into cells. For each spatio-temporal cell, we estimate an IGMM codebook using the procedure exposed in Sect. 2. The model starts with a single component, then 1000 iterations of Gibbs sampling are performed for updating all parameters and hyper-parameters iteratively from their posterior distribution, with 800 “burn in iterations”. From the remaining 200 repetitions, we use Bayesian Information Criterion (BIC) for choosing the best available mixture model. Afterward, the conventional FV encoding technique is employed for representing locally described samples as super-vectors [7]. In short, the method quantifies the similarity between a video sample and trained IGMM model, named codebook. To the resulting super-vector, we apply a Power Normalization (PN) followed by the L2-Normalization. The above procedure is performed per descriptor. Ultimately, all four IGFV representations are concatenated together.

For the classification step, we use a one-vs-all Linear SVM with regularization parameter equal to 100. Figure 1 summarizes the IGFV training pipeline. It is worth noting that the feature extraction was performed in C++ and the remaining experiments in MATLAB.

4 Results and Discussions

ARS Correction. Figure 2 shows the proposed correction for ARS initialization, when sampling parameter \(\alpha \). In Fig. 2a, we see that abscissae \(x_1\) and \(x_3\) are chosen according to the conventional criteria (i.e., \(h'(x_1)>0\) and \(h'(x_3)<0\)). Though \(x_3\,\!\!=\!\!\,2.38\) satisfy the restriction, \(m_3\,\!\!=\!\!\,-0.002\). This value creates wide upper hull that may generate ill-posed samples. In this case, \(\widehat{\alpha }\,\!\!=\!\!\,3019\) when the abscissae range (most probable values) is around [1.5, 2.4]. To solve this problem, we iteratively increase \(x_3\) by 50% until the upper hull is relatively constrained. Figure 2b is obtained through this procedure. Here, \(x_3\,\!\!=\!\!\,3.6\), \(m_3\,\!\!=\!\!\,-2.18\), and the sampled parameter is \(\widehat{\alpha }\,\!\!=\!\!\,2.05\).

Fig. 1.
figure 1

Sketch of the proposed IGFV data encoding technique.

Fig. 2.
figure 2

Adaptive Rejection Sampling with initialization correction. concave log-density, lower hull, and upper hull. Circular points indicate the initial three abscissae (\(x_1,x_2,x_3\)).

Confusion Matrices. Figure 3 shows the obtained confusion matrices using linear SVM for both employed databases. The proposal achieves \(88.5\pm 4.07\)% and \(57.6\pm 1.93\) of mean accuracy, within the cross-validation scheme for each database. From a visual inspection on Fig. 3a, the system demonstrates the ability to discriminate among human actions, with slight errors in a few categories. On the other hand, the visual inspection on Fig. 3b shows how difficult it is for the system to classify actions from the HMDB51 dataset. When reviewed in detail, the provided bounding boxes from some videos do not correspond, partially or entirely, to the reported activity. Thus, the bad performance of the system may be explained by this issue, considering the importance of an effective human detection within iDT feature estimation.

Comparison with the State of the Art. In turn, Table 1 presents a comparative study among similar feature encoding approaches for human action recognition. In this study, we are analyzing properties and comparing characteristics among encoding methodologies. Thus, for feature extraction and classification we standard methods for the sake of comparison. Employed benchmarks have in common the following considerations: (i) are tested in both UCF50 and HMDB51 databases, (ii) employ the iDT feature estimation technique, (iii) perform classification through linear SVM. In particular, ST-VLAD [5] and SFV-STP [20], follow all requirements. However, they require extensive cross-validation for estimating the number of components k in their codebook. Moreover, ST-VLAD also needs searching the number of Spatio-temporal groups, which for both SFV-STP [20] and our IGFV are fixed to 8 cells, one division for each spatial and temporal axis.

Our main contribution is the automation of a methodology that conventionally requires extensive cross-validation. Thus, the slight drop in accuracy from our method, when compared to benchmarks, is compensated with computational savings, because it both reduces number of operations and memory requirements. In particular, conventional GMM codebooks perform maximum likelihood estimation, through EM, of model parameters. Such optimization, performs a large number of iterations (in our case 500), and its convergence depends on initialization. Furthermore, this method needs fixing the number of components k before optimizing all other parameters (means, precision, and priors). Authors in [20], proposed searching k in a set comprising 10 different values and then selecting the best value according to classification performance. This approach requires cross-validating k in an operation that increases the number of iterations by 10 times the number of folds in the cross-validation. In terms of memory requirements, it requires storing all parameters 10 times, until the best k is chosen.

The drop in accuracy from our method could be attributed to the precisions sampled by the IGMM. In our case, IGMM samples complete precision matrices that are considering the correlation between attributes. For us, this is a disadvantage as the IGMM codebook suffers from low resolution, i.e., complete Gaussians can explain massive data clusters. Meanwhile, benchmarks approaches constrain covariances to diagonal or spherical matrices. Thus, the estimated number of components from IGMM is in the order of tens, whereas the exhaustive search from benchmarks converges to hundreds of components. This is an interesting result that demonstrates the quality of IGMM estimated codebooks, as fewer components allows the codification of discriminant visual information.

Fig. 3.
figure 3

Confusion matrices from Human Action Recognition in UCF50 and HMDB51 databases.

Table 1. Comparison with similar approaches on UCF50 and HMDB51 datasets.

5 Conclusions

We introduced a novel Infinite Gaussian Fisher Vector feature encoding framework to support video-based Human Action Recognition (IGFV). Our approach is fully automatic, allowing every parameter in the model to be updated hierarchically through the MCMC algorithm Gibbs sampling. The IGFV encoding allows revealing a set of discriminant local spatio-temporal features for enabling the precise codification of visual information, with competitive recognition results and computational savings. In particular, the infinite limit on the number of Gaussian components evades estimating this parameter through extensive cross-validation, which drastically reduces the number of operations and memory requirements for performing HAR. Attained results on both UCF50 and HMDB51 database showed that our proposal correctly classified 88.5% and 57.6% of human actions under the specific cross-validation of each dataset. Our IGFV obtained promising results that are comparable with state-of-art encoding approaches. Furthermore, it outperforms those approaches considering the trade-off between accuracy and computational complexity, as our proposal reduces both number of operations and memory requirements.

As future work, authors will evaluate alternatives for enhancing the resolution from IGMM codebooks, such as placing diagonal or spherical constrains to the sampled precision. We are convinced that all the already mentioned benefits from Bayesian inference and Dirichlet processes, combined with an enhanced model resolution, will yield better performance in human action recognition tasks.