Keywords

1 Introduction

In recent years, a tremendous amount of human action video recordings has been made available. As a consequence, human action recognition has become a very popular task in computer vision with a wide range of applications such as visual surveillance systems, human-robot interaction, video retrieval, and sports video analysis [21]. The recognition of human actions in videos is a challenging task due to anthropometric differences (size, gender, shape) among subjects and variations in the way, spread, and speed of the action.

Hidden conditional random fields (HCRF) [13] are a generalization of conditional random fields (CRF) [9] and appear to be a very promising approach in many application domains, due to their ability of relaxing strong independence assumptions and exploiting spatio-temporal variations, via a graphical structure. A hybrid model that consists of a combination of generative and discriminative models to improve the performance of the classical models has been proposed in the literature [1]. Motivated by this approach, Soullard et al. [17] introduced an HMM-based weighting in the conditional probability of the HCRF, which constrains the discriminative learning, yielding improved accuracy. On the other hand, Zhang et al. [25] used HMMs to make hidden variables “observable” to HCRF so the objective function can be convex. Multi-modal action recognition (i.e., combination of audio and visual information) using HCRFs has also been given great focus [16, 19, 22].

However, previous works define the number of hidden variables in a intuitive manner or with exhaustive search, which is a computationally expensive and time-consuming task. Bousmalis et al. [2] introduced infinite hidden conditional random fields (iHCRF), a nonparametric model that estimates the number of hidden variables to recognize human behaviors. The model assumes that the HCRF potentials are sampled directly from a set of hierarchical Dirichlet processes and its hyper-parameters are learned using the sampling that removes hidden variables that are not presented in the samples. Moreover, Bousmalis et al. [3] proposed an extension of the previous model, called variational HCRF, which is a generalized framework for infinite HCRF modeling and variational inference in order to converge faster and reduce the computational cost.

Recently, deep leaning methods have shown outstanding results [5]. Although important progress has been made in the fields such as object detection and image classification, the understanding of human actions is still a difficult task due to pose variability, short duration of actions, and ambiguity in human annotations. Sigurdsson et al. [15] tried to answer a set of fundamental questions regarding the problem of action recognition to address the aforementioned limitations. Convolutional neural networks (CNNs or ConvNets) [7] are the most widely used approach to simultaneously learn spatio-temporal dynamics of human actions. A novel architecture that uses spatio-temporal fusion by combining the ConvNet towers was introduced by Feichtenhofer et al. [8]. Finally, an on-line frame-pooling method, which extracts only the most important frames that best describe human actions, was proposed by Wang et al. [23].

In this work, a robust incremental hidden conditional random field (RI-HCRF) is proposed, which addresses two major issues in standard HCRFs. At first, the proposed model incrementally estimates the optimal number of hidden variables using a Viterbi-based procedure. Additionally, it uses a mixture of Student’s t-distributions as prior to the parameters of the model that leads to a model robust to outliers.

2 Model Formulation

The proposed RI-HCRF model is defined by a linear chained structured undirected graph \(\mathcal {G} = (\mathcal {V}, \mathcal {E})\). Each variable, observed (video frame) and unobserved (hidden variable), is a node \(\mathcal {V}\) in the graphical model \(\mathcal {G}\) and any dependencies between them are presented by an edge \(\mathcal {E}\). We consider a dataset \(\mathcal {D}=\{\mathbf {x}^{k},y^{k}\}_{k=1}^{N}\) of N labeled observations, where an observation \(\mathbf {x}^{k}=\{\mathbf {x}_1, \mathsf {x}_2, \ldots , \mathbf {x}_T\}\) corresponds to the \(k^{\text {t}h}\) video sequence that consists of T frames, and \(y^{k}\) is the \(k^{\text {t}h}\) class label defined in a finite label set \(\mathcal {Y}\). Each observation \(\mathbf {x}_{i}^{k}\) can be represented by a feature vector \(\phi (\mathbf {x}_i^{k})\in \mathbb {R}^d\), which is a collection of several features extracted from the \(i^{\text {t}h}\) frame in the video sequence.

The goal of the proposed model is, for a given observation \(\mathbf {x}\) and the model’s parameter vector \(\varvec{\theta }\), to find the most probable label y by maximizing the conditional probability \(P(y|\mathbf {x};\varvec{\theta })\). The RI-HCRF model is a member of the exponential family and the conditional probability of the class label given an observation sequence is defined as:

$$\begin{aligned} P(y|\mathbf {x};\varvec{\theta }) = \displaystyle \sum _{h} P(y,\mathbf {h}|\mathbf {x};\varvec{\theta }) = \frac{\displaystyle \sum _{\mathbf {h}}\exp {\varPsi (y,\mathbf {h},\mathbf {x};\varvec{\theta })}}{\displaystyle \sum \limits _{{\begin{array}{c} {y',\mathbf {h}} \end{array}}} \exp {\varPsi (y',\mathbf {h},\mathbf {x};\varvec{\theta })}}\, , \end{aligned}$$
(1)

where \(\mathbf {h}\) is a set of hidden variables \(\mathbf {h}^{k}=\{h_1, h_2, \ldots , h_T\}\), with \(\mathbf {h}^{k}\in \mathcal {H}\), and \(\varPsi (\cdot )\) is the potential function that specifies dependencies between different nodes in the model given by:

(2)

The functions \(\varvec{\theta }_{1} \cdot \psi _{1}(\cdot )\) and \(\varvec{\theta }_{2} \cdot \psi _{2}(\cdot )\) are the node potentials \(N_{p}\), which model the relationship between the hidden variable \(h_j\) and the feature vector \(x_j\), and the relationship between the class label y and the hidden variable \(h_j\), respectively, and \(\varvec{\theta }_{3} \cdot \psi _{3}(\cdot )\) is the edge potential \(E_{p}\), which models the relationship between the class label y and the hidden variables \(h_i\) and \(h_j\).

3 Estimation of the Number of Hidden States

To estimate the optimal number of hidden variables, we propose an iterative method, which seeks for a sequence of hidden variables, where each variable participates in a maximum number of optimal paths. The variables with low participation in the optimal paths are rejected.

For a given label \(y=\alpha \), the node potentials for all observations and all possible hidden variables can be represented in a matrix form:

$$\begin{aligned} \displaystyle N_{p}=[n_{p_{ij}}]_{S \times T}= \begin{bmatrix} \theta _{1_{11}}\cdot x_1+\theta _{2_1\alpha }&\dots&\theta _{1_{1T}}\cdot x_T+\theta _{2_1\alpha } \\ \theta _{1_{21}}\cdot x_1+\theta _{2_2a}&\dots&\theta _{1_{2T}}\cdot x_T+\theta _{2_2\alpha } \\ \vdots&\ddots&\vdots \\ \theta _{1_{S1}}\cdot x_1+\theta _{2_S\alpha }&\dots&\theta _{1_{ST}}\cdot x_T+\theta _{2_T\alpha } \end{bmatrix} \, , \end{aligned}$$
(3)

where S is the number of hidden variables and T is the number of frames in the video sequence. The edge potentials, for a given label \(y=\alpha \), express the compatibility between a pair of hidden variables and they can be represented by the following square matrix:

$$\begin{aligned} \displaystyle E_{p}=[e_{p_{ij}}]_{S \times S}= \begin{bmatrix} \theta _{3_{11}}&\theta _{3_{12}}&\dots&\theta _{3_{1S}} \\ \theta _{3_{21}}&\theta _{3_{22}}&\dots&\theta _{3_{2S}} \\ \vdots&\vdots&\ddots&\vdots \\ \theta _{3_{S1}}&\theta _{3_{S2}}&\dots&\theta _{3_{SS}} \end{bmatrix} . \end{aligned}$$
(4)

All values of the node potentials \(N_{p}\) matrix are transformed into the range [0, 1] using min-max normalization, so that all input variables equally contribute in the model and the parameters of the node potential \(N_{p}\) are not scaled with respect to the units of the inputs. As a result, we end up with smaller standard deviations, which can suppress the effect of outliers. Also, we construct a stochastic matrix based on the edge potential \(E_{p}\), with each row summing to one.

To determine the number of hidden variables, we employ multiple HMMs. Specifically, an HMM is defined by the set of hidden variables \(\mathbf {h}\) and a set of parameters \(\varvec{\varLambda }=\{\varvec{\pi },\mathbf {A},\mathbf {B}\}\), where \(\varvec{\pi }\) is a vector that collects the prior probabilities of \(h_i\) being the first hidden variable of a state sequence, \(\mathbf {A}\) is a matrix that collects the transition probabilities of moving from one hidden variable to another, and \(\mathbf {B}\) is the matrix that collects the emission probabilities, which characterize the likelihood of a certain observation \(\mathbf {x}\), if the model is in hidden variable \(h_i\).

Let us consider that the normalized node potentials are the entries of the emission probability matrix, the edge potentials are the entries of the transition probability matrix, and there is a vector of initial probabilities \(\varvec{\pi }\), where each hidden variable is equally probable to be first in the sequence \(\pi _{1\times S}=\frac{1}{S}\cdot \mathbbm {1}_{S}\).

Given the above definitions, for a given label \(y=\alpha \), we can determine an HMM and the optimal hidden variable sequence using the Viterbi algorithm to estimate the probability of the most probable path ending \(\delta _t(i)\) and keep track of the best path ending  \(\phi _t(i)\) in the \(i^\text {th}\) hidden state at time t:

$$\begin{aligned} \displaystyle \delta _t(i)=\max _{h_1,\ldots ,h_{T-1}}{P(h_1,\ldots ,h_{T-1},h_T=i,x_1,\ldots ,x_T|\pi ,E_{p},N_{p})}\, ,\end{aligned}$$
(5)
$$\begin{aligned} \displaystyle \phi _t(i)=\underset{h_1,\ldots ,h_{T-1}}{\mathrm {argmax}}{P(h_1,\ldots ,h_{T-1},h_T=i,x_1,\ldots ,x_T|\pi ,E_{p},N_{p})}\, . \end{aligned}$$
(6)
Fig. 1.
figure 1

Illustration of the iterative and incremental method for the estimation of the optimal number of the hidden variables. The grey nodes are the observed features (\(x_{i}\)), and the unknown labels (y). The white nodes are the hidden variables (h). At each iteration, an HCRF is built with an initial number of hidden states and then the Viterbi algorithm is used to estimate the optimal paths in the HMM setting. Then, the frequency of each hidden variable is computed and if the termination criterion is satisfied the number of hidden states is decided by majority voting, otherwise the number of hidden states is increased by one and the process is repeated.

Figure 1 depicts the flow of the proposed method for the estimation of the optimal number of hidden variables. The proposed method learns the optimal number of hidden variables following an incremental learning approach. It starts by setting an initial number \(S=|\mathcal {H}|\ge 1\) for the hidden variables and the maximum number of iterations. In each iteration, all optimal paths, for every video sequence and for every label, are estimated using the Viterbi algorithm and the frequency of appearance of each hidden state in all paths is calculated. The termination criterion is reached when the frequency of each hidden variable is lower than a predefined threshold \(\tau \). If this criterion is not satisfied, the number of hidden variables is increased by one and the process is repeated. If the termination criterion is satisfied, we move to the next iteration and a voting for the most probable number of hidden variables in the current iteration is performed. Finally, when the maximum number of iterations is reached, the optimal number of hidden variables is the one with the majority of votes.

4 A Student’s T-Mixture Prior on the Model Parameters

Let us assume that the parameters \(\varvec{\theta }\) of the proposed RI-HCRF follow a mixture model with three Student’s-t components. Taking into consideration that the parameter vector \(\varvec{\theta }\) describes three different relationships among observations, hidden variables and labels, we expect that each component corresponds to one of these relationships. To this end, the proposed method relies on partitioning the parameter vector using a Student’s t-mixture model to identify, preserve, and enhance the different characteristic of each partition and improve the classification model. The use of Student’s t-mixture model is justified by the fact that it has a heavier tails pdf and provides smaller weights to the observations that lie in the tail area. Thus, it provides robustness to outliers and less extreme estimates of the posterior probabilities of the mixture model [11]. Additionally, each Student’s t-component originates from a wider class of an elliptically symmetric distribution with an additional robustness tuning parameter that corresponds to the degrees of freedom \(\nu \).

However, by making the above assumption the problem of setting the mixture weights arises. To estimate the best weights for the mixture model, one might perform an exhaustive search to check multiple combinations of probable values of mixture weights. To avoid the prohibitive quadratic computational cost of this approach, we dynamically estimate the best fitted mixture model for parameters \(\varvec{\theta }\) at the end of each iteration of the training process.

Let each parameter \(\varvec{\theta }\) follows a univariate t-distribution with mean \(\mu \), variance \(\sigma ^2\), and \(\nu \in [0,\infty )\) degrees of freedom then, given the weight u that follows a Gamma distribution (\(\varGamma \)) parameterized by \(\nu \), the parameter \(\varvec{\theta }\) has the univariate normal with mean \(\mu \) and variance \(\sigma ^2/u\), with u being a random variable distributed as \(u \sim \varGamma (\nu /2, \nu /2)\).

By integrating out the weights from the joint density leads to the density function of the marginal distribution:

$$\begin{aligned} p(\varvec{\theta }; \nu ,\mu ,\lambda ) =\frac{\varGamma (\frac{\nu +1}{2})}{\varGamma (\frac{\nu }{2})}\Big (\frac{\lambda }{\pi \nu }\Big )^{\frac{1}{2}}\Big ( 1+ \frac{\lambda (\theta -\mu )^2}{\nu }\Big )^{-\frac{\nu +1}{2}} \, , \end{aligned}$$
(7)

where the inverse scaling parameter \(\lambda \) (similar to precision) is the reciprocal of variance (\(\lambda =(\sigma ^2)^{-1}\)). Also, it can be shown that for \(\nu \rightarrow \infty \) the Student’s t-distribution tends to be a Gaussian distribution. Moreover, for \(\nu >1\), \(\mu \) is the mean of \(\varvec{\theta }\) and for \(\nu >2\), the variance of \(\varvec{\theta }\) is \(\nu (\lambda (\nu -2))^{-1}\). The t-distribution is used to estimate probabilities based on incomplete data or small samples. A K-component mixture of t-distributions is given by:

$$\begin{aligned} \phi (\varvec{\theta }, \varOmega )=\sum _{i=1}^K\pi _ip\varvec{(}\theta ; \nu _i,\mu _i,\lambda _i)\, , \end{aligned}$$
(8)

where \(\varvec{\theta }\) denotes the observed data vector, \(\varOmega =\{\varOmega _i\}_{i=1}^K\) is the mixture parameter set with \(\varOmega _i=\{\pi _i,\nu _i,\mu _i,\lambda _i\}\), and \(\pi _i\) are the \(i^{th}\) mixing proportions that satisfy the following constraints: \(\sum _{i=1}^{K} \pi _{i}=1\) and \(0\le \pi _{i}\le 1\). The best fitted mixture with Student’s t-components can be obtained by maximizing the likelihood function using the EM algorithm [11].

During training a maximum likelihood approach is followed to estimate the parameters \(\varvec{\theta }\) of the model by maximizing the following loss function:

$$\begin{aligned} \mathcal {L}(\varvec{\theta })= \displaystyle \sum _{i=1}^N\log P(y_i|x_i; \varvec{\theta }) + \log \Big (\displaystyle \sum _{k=1}^K\pi _kp(\varvec{\theta }; \nu _k,\mu _k,\lambda _k)\Big ) \, , \end{aligned}$$
(9)

where the first term is the conditional log-likelihood of the input data and the second term represents the best fitted Student’st-mixture model on parameter vector \(\varvec{\theta }\), obtained by the EM algorithm. The optimal weights \(\varvec{\theta }^\star \) are learned by maximizing the objective function. The optimization of Eq. (9) is performed using the limited-memory BFGS (LBFGS) method [4], since the value and the derivative of the objective function may be calculated. Then, the corresponding label is estimated by maximizing the posterior probability:

$$\begin{aligned} y^\star = \underset{y\in \mathcal {Y}}{\mathrm {argmax}}{P(y|\mathbf {x};\varvec{\theta }^\star )}\, . \end{aligned}$$
(10)

5 Experimental Results

To demonstrate the ability of the proposed method to recognize human actions, we compared it with several state-of-the-art methods in three publicly available benchmark datasets. First, we used the Parliament dataset [20], which consists of 228 video sequences of political speeches categorized in three behavioral classes: friendly, aggressive, and neutral. The TV human interaction (TVHI) dataset [12], is a collection of 300 video sequences depicting four kinds of interactions such as handshakes, high fives, hugs, and kisses. Finally, the SBU Kinect Interaction (SBU) dataset [24] is used, which contains approximately 300 video sequences of two-person interactions captured by a Microsoft Kinect sensor. Each video is labeled with one of the following actions: approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands.

For all datasets, we used spatio-temporal interest points (STIP) [10] as our basic representation. Also, for the Parliament and TVHI datasets, we extracted the mel-frequency cepstral coefficients (MFCC) [14] features along with their first and second order derivatives, while for the TPI dataset, we used the provided by the dataset poses. Additional to the hand-crafted features, we used CNNs both for end-to-end classification and for feature extraction by employing the pre-trained model of Tran et al. [18], which is a 3D ConvNet. Finally, we assessed the performance of the RI-HCRF by comparing with the following baseline methods: SVM [6], CRF [9], HCRF [13], and CNN (end-to-end) [18].

The threshold for the automatic learning of the optimal number of hidden variables was set to take values from a discrete set \(\tau \in \{0.001,0.005,0.01,0.02,0.05\}\) and the maximum number of iterations were set to 30. The number of components for the mixture of Student’s t-distribution was set to \(K=3\). The model parameters were randomly initialized and the experiments were repeated five times, while 5-fold cross validation was used to split into training and test sets. To examine the performance of the RI-HCRF model against the standard HCRF, we varied the number of hidden variables from 3 to 18.

Table 1. Comparison of the classification accuracies (\(\%\)) for the Parliament [20], TVHI [12], and SBU [24] datasets. The results were averaged for all different configurations (mean ± standard deviation).

The average recognition accuracies and the corresponding standard deviations for all datasets, for both hand-crafted and CNN features, are presented in Table 1. It can be observed that the proposed RI-HCRF method outperforms all the state-of-the-art methods for all datasets. It is worth noting that RI-HCRF show very small deviation compared with the rest of the methods, which indicates that the use of Student’s t-distribution provides robustness and reduces the miss classification errors. It is also interesting to observe that for the Parliament and TVHI datasets, the absolute improvement of RI-HCRF over the CNN model is very high (\(11\%\) and \(33\%\), respectively). This improvement can be explained by the fact that the CNN model uses a linear classifier in the softmax layer, while RI-HCRF is more suitable to encode sequential data by modeling dependencies between consecutive frames in a more principled way.

Fig. 2.
figure 2

Confusion matrices for the classification results using CNN features for the (a) Parliament [20], (b) TVHI [12], and (c) SBU [24] datasets.

Since the best results, for the RI-HCRF method, were achieved when CNN features were employed, the corresponding confusion matrices are depicted in Fig. 2. It can be observed that the misclassification errors are quite small for the three datasets indicating that the proposed method can efficiently recognize human actions with high accuracy and small intra-class classification errors.

Fig. 3.
figure 3

Estimation of the optimal number of hidden variables by the proposed RI-HCRF algorithm using the CNN features for the (a) Parliament [20], (b) TVHI [12], and (c) SBU [24] datasets. The number of hidden states that do not appear in the horizontal axis received zero votes.

Fig. 4.
figure 4

Classification accuracies with respect to the number of hidden states using exhaustive search for the (a) Parliament [20], (b) TVHI [12], and (c) SBU [24] datasets. The results from exhaustive search are in-line with the optimal number of hidden states as predicted by RI-HCRF in Fig. 3.

Figure 3 depicts the results for the prediction of the optimal number of hidden states for all three datasets, when CNN features are used for the classification. For the Parliament dataset, number 4 is the most suitable candidate, for TVHI number 7 seems to be the most probable case, and for the SBU dataset, number 9 turns out to be the candidate with the most votes estimated by the RI-HCRF model. The estimated number of hidden states obtained from the proposed model is in fully agreement with the results from the exhaustive search of the number of hidden states (Fig. 4). Also, we may observe that the number of candidates is much lower compared to the exhaustive search.

6 Conclusion

In this paper, a video-based action recognition method using a graphical representation of human actions is introduced. The proposed approach is an extension of standard HRCFs, which can automatically infer the number of hidden states from the input data. The proposed model is a mixture of three Student’s t-components coupled to the RI-HCRF model as a prior to the parameters providing robustness to outliers. An extented experimental evaluation demonstrated that the proposed method achieved promising results, while reduced the search space for the estimation of the number of hidden states.