Keywords

1 Introduction

Human action recognition has become an important research area in the computer vision field due to its wide range of applications, including automatic video analysis, video indexing and retrieval, video surveillance, and virtual reality [5]. As a result of the increasing amount of video data available both on internet repositories and personal collections, there is a strong demand for understanding the content of complex real-world data. However, different challenges arise for action recognition in realistic video data [13]. First, there is large intra-class variation caused by factors such as the style and duration of the performed action, scale changes, dynamic viewpoint, and sudden motion. Second, background clutter, occlusions, and low-quality video data are known to affect robust recognition as well. Finally, for large-scale datasets, the data processing represents a crucial computational challenge to be addressed [3].

The most popular framework for action recognition is the Bag of visual Words (BOW) with its variations [11, 12]. The BOW pipeline contains three main stages: feature estimation, feature encoding, and classification. Besides, there are several pre-processing and post-processing stages, such as relevance analysis and feature embedding to enhance data decorrelation, separability and interpretability [2]. Furthermore, different normalization techniques have been introduced for improving the performance of the recognition system. For the feature estimation step, the recent success of local space-time features like Dense Trajectories (DT) and Improved Dense Trajectories (iDT) has lead researchers to use them on a variety of datasets, obtaining excellent recognition performance [12, 13]. Regarding the feature encoding step, super-vector based encoding methods such as Fisher Vector (FV) and Vector of Locally Aggregated Descriptors (VLAD) are presented as the state-of-the-art approaches for feature encoding in action recognition tasks [5, 11]. Lastly, the classification stage has usually been performed by Support Vector Machines (SVM) in most recognition frameworks [8, 10].

The feature encoding method that provides the final video representation is crucial for the performance of an action recognition system, as it influences directly the classifier ability to predict the class labels. However, video representations generated by methods such as FV or VLAD are known to provide high dimensional encoding vectors which increases the computational requirements in the classification stage [5, 13]. On the other hand, the high dimensionality of the input data could affect the classifier accuracy adversely, by using redundant information and even noise, which do not enhance data separability. Therefore, the Dimensionality Reduction (DR), which consists of feature selection and feature embedding methods, is imperative to lighten the burden associated with the encoding stage, eliminate redundant information, and project samples into new spaces to increase separability [1]. Conventional methods, such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) have been proposed to decorrelate the individual features of descriptors and reduce their length, in a pre-processing stage to the encoding [6]. Nevertheless, these methods are specially designated to work with real-valued vectors coming from flat Euclidean spaces. Thus, in modern computer vision due to real-world data and models, there has been growing interest to go beyond the extensively studied Euclidean spaces and analyse more realistic non-linear scenarios for better representation of the data [7].

In this work, we introduce a new human action recognition system using kernel relevance analysis. The system, based on a non-linear representation of the super-vector obtained by the FV encoding technique, seeks to reduce the input space dimensionality, as well as, enhance separability and interpretability of video data. Specifically, our approach includes a centered kernel alignment (CKA) technique to recognize relevant descriptors related to action recognition. Hence, we match trajectory-aligned descriptors with the output labels (action categories) through non-linear representations [2]. Also, the CKA-algorithm allows to compute a linear projection matrix, where the columns quantify the required number of dimensions to preserve the 90% of the input data variability. Therefore, by projecting the video samples into the CKA generated space, the class separability is preserved, and the number of dimensions is reduced. Attained results on the UCF50 database demonstrate that our proposal favors the interpretability of the commonly employed descriptors in action recognition, and presents a system able to obtain competitive recognition accuracy using a drastically reduced input space dimensionality to the classification stage.

The rest of the paper is organized as follows: Sect. 2 presents the main theoretical background, Sect. 3 describes the experimental set-up, Sect. 4 introduces the results and discussions. Finally, Sect. 5 shows the conclusions.

2 Kernel-Based Descriptor Relevance Analysis and Feature Embedding

Let be an input-output pair set holding N video samples, each of them represented by T trajectories generated while tracking a dense grid of pixels, whose local space is characterized by a descriptor of dimensionality D, as presented in [13]. Here, the samples are related to a set of human action videos meanwhile the descriptor in turn is one of the following trajectory-aligned measure: trajectory positions (Trajectory), Histogram of Oriented Gradients (HOG), Histogram of Optical Flow (HOF), Motion Boundary Histograms (MBHx and MBHy), yielding a total of \(F=5\) descriptors. Likewise, the output label \(y_n\) denotes the human action being performed in the corresponding video representation. From \({\varvec{Z}}_n\), we aim to encode T described trajectories concerning a Gaussian Mixture Model (GMM), trained to be a generative model of the descriptor in turn. Therefore, the Fisher Vector (FV) feature encoding technique is employed, as follows [9]:

Let \({\varvec{Z}}_n\) be a matrix holding T described trajectories , and \(\upsilon ^{\lambda }\) be a GMM with parameters , which are respectively the mixture weight, mean vector, and diagonal covariance matrix of K Gaussians. We assume that \({\varvec{z}}_t\) is generated independently by \(\upsilon ^{\lambda }\). Therefore, the gradient of the log-likelihood describes the contribution of the parameters to the generation process:

$$\begin{aligned} {\varvec{x}}^{\lambda }_n = \frac{1}{T} \sum _{t=1}^{T}\nabla _{\lambda } \log \upsilon _{\lambda }({\varvec{z}}_t) \end{aligned}$$
(1)

where \(\nabla _{\lambda }\) is the gradient operator w.r.t \(\lambda \). Mathematical derivations lead \({\varvec{x}}^{\mu ,i}_n\) and \({\varvec{x}}^{\sigma ,i}_n\) to be the D-dimensional gradient vectors w.r.t the mean and standard deviation of the Gaussian i, that is:

$$\begin{aligned} {\varvec{x}}^{\mu ,i}_n&= \frac{1}{T\sqrt{w_i}} \sum _{t=1}^{T} {\varvec{\gamma }}_t(i) \left( \frac{{\varvec{z}}_t-{\varvec{\mu }}_i}{\sigma _i}\right) ,\end{aligned}$$
(2)
$$\begin{aligned} {\varvec{x}}^{\sigma ,i}_n&= \frac{1}{T\sqrt{2 w_i}} \sum _{t=1}^{T} {\varvec{\gamma }}_t(i) \left[ \frac{({\varvec{z}}_t-{\varvec{\mu }}_i)^2}{\sigma _i^2}-1\right] \end{aligned}$$
(3)

where \({\varvec{\gamma }}_t(i)\) is the soft assignment of trajectory \({\varvec{z}}_t\) to the Gaussian i, that is:

$$\begin{aligned} {\varvec{\gamma }}_t(i) = \frac{w_i \upsilon _i({\varvec{z}}_t)}{\sum _{j=1}^{K}w_j \upsilon _j({\varvec{z}}_t)} \end{aligned}$$
(4)

The final gradient vector \({\varvec{x}}_{n}^{\lambda }\) is a concatenation of the \({\varvec{x}}^{\mu ,i}_n\) and \({\varvec{x}}^{\sigma ,i}_n\) vectors for \(i=1,\ldots ,K\) and is 2KD-dimensional.

Assuming that the same procedure is performed for each descriptor, the concatenation of the resulting vectors generates the set . Afterwards, a Centered Kernel Alignment (CKA) approach is performed to compute a linear projection matrix, and to determine the relevance weight from each trajectory-aligned descriptor individual feature, as follows [2]:

Let , where , be a positive definite kernel function, which reflects an implicit mapping , associating an element with the element , that belongs to the Reproducing Kernel Hilbert Space (RKHS), \(\mathcal {H}_X\). In particular, the Gaussian kernel is preferred since it seeks an RKHS with universal approximation capability, as follows [4, 14]:

(5)

where is a distance function in the input space, and is the kernel bandwidth that rules the observation window within the assessed similarity metric. Likewise, for the output labels space , we also set a positive definite kernel . In this case, the pairwise similarity distance between samples is defined as , being \(\delta (\cdot )\) the Dirac delta function. Each of the above defined kernels reflects a different notion of similarity and represents the elements of the matrices respectively. In turn, to evaluate how well the kernel matrix \(\mathbf{K }_{X}\) matches the target \(\mathbf{K }_{L}\), we use the statistical alignment between those two kernel matrices as [2]:

$$\begin{aligned} \hat{\rho }(\mathbf{K }_{X},\mathbf{K }_{L})= \frac{\langle \bar{\mathbf{K }}_{X},\bar{\mathbf{K }}_{L}\rangle _{\text {F}}}{\sqrt{\langle \bar{\mathbf{K }}_{X}\bar{\mathbf{K }}_{X}\rangle _{\text {F}}\langle \bar{\mathbf{K }}_{L}\bar{\mathbf{K }}_{L}\rangle _{\text {F}}}}, \end{aligned}$$
(6)

where the notation \(\bar{\mathbf{K }}\) stands for the centered kernel matrix calculated as \(\bar{\mathbf{K }} = \tilde{{\varvec{I}}}\mathbf{K }\tilde{{\varvec{I}}}\), being \(\tilde{{\varvec{I}}}= {\varvec{I}}-{\varvec{1}}^{\top }{\varvec{1}}/N\) the empirical centering matrix, is the identity matrix, and is the ones vector. The notation \(\langle \cdot ,\cdot \rangle _{\text {F}}\) represents the matrix-based Frobenius norm. Hence, Eq. (6) is a data driven estimator that allows to quantify the similarity between the input feature space and the output label space [2]. In particular, for the Gaussian kernel \(\kappa _X\), the Mahalanobis distance is selected to perform the pairwise comparison between samples:

$$\begin{aligned} \upsilon _{{\varvec{A}}}^2({\varvec{x}}_n,{\varvec{x}}_{n'}) = ({\varvec{x}}_n-{\varvec{x}}_{n'}) {\varvec{A}}{\varvec{A}}^{\top }({\varvec{x}}_n-{\varvec{x}}_{n'})^{\top }, \; n,n'\in \{1,2,\dots ,N\}, \end{aligned}$$
(7)

where the matrix holds the linear projection in the form , with , being P the required number of dimensions to preserve the 90% of the input data variability, and \({\varvec{A}}{\varvec{A}}^{\top }\) the corresponding inverse covariance matrix in Eq. (7), assuming \(P\le S\). Therefore, intending to compute the projection matrix \({\varvec{A}}\), the formulation of a CKA-based optimizing function can be integrated into the following kernel-based learning algorithm:

$$\begin{aligned} \hat{{\varvec{A}}} = \text {arg} \max _{{\varvec{A}}} \log \left( \hat{\rho }(\mathbf{K }_{X}({\varvec{A}};\sigma ), \mathbf{K }_{L}) \right) \!, \end{aligned}$$
(8)

where the logarithm function is employed for mathematical convenience. The optimization problem from Eq. (8) is solved using a recursive solution based on the well-known gradient descent approach. After the estimation of the projection matrix \(\hat{{\varvec{A}}}\), we assess the relevance of the S input features. To this end, the most contributing features are assumed to have the higher values of similarity relationship with the provided output labels. Specifically, the CKA-based relevance analysis calculates the relevance vector index , holding elements that allows to measure the contribution from each of the s-th input features in building the projection matrix \(\hat{{\varvec{A}}}\). Hence, to calculate those elements, a stochastic measure of variability is utilized as follows: ; where , and .

3 Experimental Set-Up

Database. To test our video-based human action recognition using kernel relevance analysis (HARK), we employ the UCF50 database [10]. This database contains realistic videos taken from Youtube, with large variations in camera motion, object appearance and pose, illumination conditions, scale, etc. For concrete testing, we use \(N=5967\) videos concerning the 46 human action categories in which the human bounding box file was available [13]. The video frames size is 320 \(\times \) 240 pixels, and the length varies from around 70–200 frames. The dataset is divided into 25 predefined groups. Following the standard procedure, we perform a leave-one-group-out cross-validation scheme and report the average classification accuracy overall 25 folds.

HARK Training. Initially, for each video sample in the dataset we employ the Improved Dense Trajectory feature estimation technique (iDT), with the code provided by the authors in [13], keeping the default parameter settings to extract \(F=5\) different descriptors: Trajectory (xy normalized positions along 15 frames), HOG, HOF, MBHx, MBHy. The iDT technique is an improved version of the previously realized Dense Trajectory technique from the same author, which removes the trajectories generated by the camera motion and the inconsistent matches due to humans. Thus, the human detection is a challenging requirement in this technique, as people in action datasets appear in many different poses, and could only be partially visible due to occlusion or by being partially out-of-scene. These five descriptors are extracted along all valid trajectories and the resulting dimensionality \(D_f\) is 30 for the trajectory, 96 for HOG, MBHx and MBHy, and 108 for HOF.

We then randomly select a subsample of trajectories from the training set to estimate a GMM codebook with \(K = 256\) Gaussians, and the FV encoding is performed as explained in Sect. 2. Afterwards, we apply to the resulting vector a Power Normalization (PN) followed by the L2-Normalization (\(||\text {sign}(x)|x|^{\alpha }||\), where \(0\le a\le 1\) is the normalization parameter). The above procedure is performed per descriptor, fixing \(\alpha =0.1\). Next, all five normalized FV representations are concatenated together, yielding \(S = 218112\) encoding dimension. The linear projection matrix and the relevance vector index are computed as explained in section Sect. 2; where , is the average required number of dimensions, through 25 leave-one-out iterations, to preserve the \(90\%\) of the input data variability.

For the classification step, we use a one-vs-all Linear SVM with regularization parameter equal to 1 and a Gaussian kernel SVM, varying the kernel bandwidth between the range \([0.1\sigma _o, \sigma _o],\) being the median of the input space Euclidean distances; and searching the regularization parameter within the set \(\{0.1,1,100,500,1000\}\), by nested cross-validation with the same leave-one-group-out scheme. Figure 1 summarizes the HARK training pipeline. It is worth noting that all experiments were performed using the Matlab software on a Debian server with 230 GB of RAM and 40 cores. The FV code is part of the open-source library VLFeat, the implementation is publicly availableFootnote 1. On the other hand, the CKA code was developed by Alvarez-Meza et al. in [2] and is also publicly availableFootnote 2.

Fig. 1.
figure 1

Sketch of the proposed HARK-based action recognition system.

4 Results and Discussions

Figure 2, shows a visual example of feature estimation and encoding using trajectory-aligned descriptors and BOW. From the color points, where different colors represent the assignment of a given trajectory to one of the prototype vectors generated by the k-means algorithm, we can appreciate the hard assignment of trajectory descriptors in the BOW encoding. Also, different sizes of the points represent the scale in which the trajectory is generated. In contrast, this paper uses the soft assignment of the GMM-based FV encoding, which is not as straightforward to express in a figure. It is worth noting that due to the human segmentation performed before the trajectory-based feature estimation, the encoding points are mainly grouped in the player whereabouts, which constrains the zone of interest to only characterize the player information. This strategy helps to reduce the uncertainty from the video representation, as the influence of the background is decreased.

Fig. 2.
figure 2

Feature estimation and encoding using trajectory-aligned descriptors and BOW.

Figure 3(a) shows the normalized relevance value of the provided Trajectory, HOG, HOF, MBHx, and MBHy descriptors, this figure is generated by averaging the components of which corresponds to each descriptor. Therefore, the mean and standard deviation is presented to represent the descriptor relevance vector. As seen, the HOG descriptor exhibit the highest relevance value regarding our HARK criteria, this descriptor quantify the local appearance and shape within the trajectory-aligned space window through the distribution of intensity gradients. Notably, all the others descriptors mainly quantifies the human local motion (Trajectory normalized positions, HOF, MBHx, MBHy), are very close regarding their relevance value. Hence, the trajectory-aligned descriptors match similarly the human actions labels concerning the CKA-based analysis presented in Sect. 2, as they are all local measures of appearance, shape, and motion equally important to support action recognition. Remarkable, the relevance value in Fig. 3(a) mainly depends upon the discrimination capability of the Gaussian kernel in Eq. 5, and the local measure being performed by the descriptor. Now, as seen in Fig. 3(b), the CKA embedding in its first two projections provides an insight into the data overlapping. The studied classes overlapping (human actions) can be attributed to similar intra-class variations in several categories, as videos with realistic scenarios have inherent attributes such as background clutter, scale changes, dynamic viewpoint and sudden motion, that may be affecting adversely the class separability.

Furthermore, as it is evidenced by the confusion matrix of the test set in Fig. 3(c), an RBF SVM over the CKA feature embedding can obtain \(90.97 \pm 2.64\)% of accuracy in classifying human actions on the employed dataset. From this matrix, the classes 22 and 23 are generating classification problems because the human movements performed in both are similar, these classes correspond to Nunchucks and Pizza tossing respectively. As expected, the RBF SVM can achieve more reliable recognition than a Linear SVM, as the data problem in Fig. 3(b) is non-linear, see the results presented for this paper in Table 1. Notable, our approach requires only 104.8 dimensions on average through 25 leave-one-out iterations to classify 46 actions of the UCF50 dataset, with competitive accuracy, which is very useful, because more elaborated classifiers (once discarded due to the data dimension) can be employed to increase the recognition rate further.

Fig. 3.
figure 3

Human action recognition on the UCF50 database. (a) Feature relevance values. (b) 2D input data projection from 46 action categories using CKA.(c) Confusion matrix for the test set under a nested leave-one-group-out validation scheme using an RBF SVM classifier.

In turn, Table 1 presents a comparative study of the results achieved by our HARK and other similar approaches from the state-of-the-art for human action recognition on the UCF50 database. To build this comparative analysis, approaches with similar experimental set-up are employed. Specifically, those approaches using iDT representation and similar descriptors. Primarily, the compared results exhibit a trade-off between data dimension and accuracy, more elaborate procedures such as the one presented in [5], uses Time Convolutional Networks (TCN) and Spatial Convolutional Networks (SCN) descriptors along with iDT descriptors, and Spatio-temporal VLAD (ST-VLAD) encoding to enhance the class separability. Thus, the mentioned approach obtain very high mean accuracy 97.7%. However, the data dimensionality is considerably high, which limits the usage of many classifiers. On the other hand, the approach presented in [13], enhances the spatial resolution of the iDT descriptors by using a strategy called spatiotemporal pyramids (STP) along with Spatial Fisher Vector encoding (SFV). Obtained results regarding the accuracy of [13] are comparable to ours. Nonetheless, the data dimension is drastically higher.

Table 1. Comparison with similar approaches in the state-of-the-art on the UCF50 dataset.

5 Conclusions

In this paper, we introduced a video-based human action recognition system using kernel relevance analysis (HARK). Our approach highlights the primary descriptors to predict the output labels of human action videos using trajectory representation. Therefore, HARK quantifies the relevance of \(F=5\) trajectory-aligned descriptors towards a CKA-based algorithm, that matches the input space with the output labels, to enhance the descriptor interpretability, as it allows to determine the importance of local measures (appearance, shape, and motion) to support action recognition. Also, the CKA-algorithm allows to compute a linear projection matrix, through a non-linear representation, where the columns quantify the required number of dimensions to preserve the 90% of the input data variability. Hence, by projecting the video samples into the generated CKA space, the class separability is preserved, and the number of dimensions is reduced. Attained results on the UCF50 database show that our proposal correctly classified the 90.97% of human actions samples using an average input data dimension of 104.8 in the classification stage, through 25 folds under a leave-one-group-out cross-validation scheme. In particular, according to the performed relevance analysis, the most relevant descriptor is the HOG which quantifies the local appearance and shape through the distribution of intensity gradients. Remarkable, HARK outperforms state-of-art results concerning the trade-off between the accuracy achieved and the required data dimension (Table 1). As future work, authors plan to employ other descriptors such as the deep features presented in [5]. Also, a HARK improvement based on the enhancement of spatial and temporal resolution, as the one presented in [13], could be an exciting research line.