Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Research on human action recognition can be divided into two areas: (i) single-action recognition, and (ii) multi-action recognition. In most computer vision literature, action recognition approaches have concentrated on single actions, where each video to be classified contains only one action. However, when observing realistic human behaviour in natural settings, the fundamental problem is segmenting and recognising actions from a multi-action sequence [1]. It is challenging due to the high variability of appearances, shapes, possible occlusions, large variability in the temporal scale and periodicity of human actions, the complexity of articulated motion, the exponential nature of all possible movement combinations, as well as the prevalence of irrelevant background [2, 3].

Hoai et al. [2] address joint segmentation and classification by classifying temporal regions using a multi-class Support Vector Machine (SVM) and performing segmentation using dynamic programming. A similar approach is presented in [4], where the temporal relationship between actions is considered. Borzeshi et al. [5] proposed the use of hidden Markov models (HMMs) with irregular observations (termed HMM-MIO) to perform multi-action recognition. More recently, Carvajal et al. [6] proposed to model each action using a Gaussian mixture model (GMM) approach where classification decisions are made using overlapping temporal windows. A drawback of [2, 4, 5] is that they have a large number of parameters to optimise. Furthermore, [5] requires an extra stage to reduce dimensionality due to use of very high dimensional feature vectors, while [2, 4] require fully labelled annotations for training. One downside of [6] is that for each action and scenario, a model with a large number of Gaussians is required, making the system computationally expensive.

Typically, the aforementioned approaches used for the multi-action recognition task can be classified as either generative or discriminative models. The approaches presented in [5, 6] are generative models, while those presented in [2, 4] are discriminative models. Generative and discriminative models have complementary strengths. Generative models can easily deal with variable length sequences and missing data, while also being easier to design and implement [7, 8]. In contrast, discriminative models often achieve superior classification and generalisation performance [7, 8]. An ideal recognition system would hence combine these two separate but complementary approaches.

The Fisher vector (FV) approach [7, 9, 10] allows for the fusion of both generative and discriminative models. In contrast to the popular Bag of Words (BoW) approach [11] which describes images by histograms of visual words, the FV approach describes images by deviations from a probabilistic visual vocabulary model. The resulting vectors can then be used by an SVM for final classification. Recently, FV has been successfully applied to the single-action recognition problem [12, 13].

A reliable low-level feature descriptor is a crucial stage for the success of an action recognition system. One popular descriptor for action recognition is Spatio-Temporal Interest Points (STIPs) [14]. However, STIP based descriptors have several drawbacks [15, 16]: they are computationally expensive, unstable, imprecise and can result in unnecessarily sparse detections. See Fig. 1 for a demonstration of STIP based detection. Other feature extraction techniques used for action recognition include gradients [6] and optical flow [16, 17]. Each pixel in the gradient image helps extract relevant information, eg. edges (see Fig. 1). Since the task of action recognition is based on a sequence of frames, optical flow provides an efficient way of capturing the local dynamics [16].

Fig. 1.
figure 1

Top row: feature extraction based on Spatio-Temporal Interest Points (STIPs) is often unstable, imprecise and overly sparse. Bottom row: interest pixels (marked in red) obtained using magnitude of gradient. (Color figure online)

Contributions. In this paper we propose a novel hierarchical system to perform multi-action segmentation and recognition. A given video is processed as a sequence of overlapping temporal windows. Each frame in a temporal window is represented through selective low-level spatio-temporal features, based on a combination of gradients with optical flow. Interesting features from each temporal window are then pooled and processed to form a Fisher vector. Instead of directly classifying each Fisher vector, a multi-class SVM is employed to generate a vector of probabilities, with one probability per class. The final classification decision (action label) for each frame is then obtained by integrating the class probabilities at the frame level, which exploits the overlapping of the temporal windows. The proposed system hence combines the benefits of generative and discriminative models.

To the best of our knowledge, the combination of probabilistic integration with Fisher vectors is novel for the multi-action segmentation and recognition problem. In contrast to [2, 4, 5], the proposed system requires fewer parameters to be optimised. We also avoid the need for a custom dynamic programming definition as in [2, 4]. Lastly, unlike the GMM approach proposed in [6] the proposed method requires only one GMM for all actions, making it considerably more efficient.

2 Proposed Method

The proposed system has a hierarchical nature, stemming from progressive reduction and transformation of information, starting at the pixel level. The system is comprised of four main components:

  1. 1.

    Division of a given video into overlapping multi-frame temporal windows, followed by extracting interesting low-level spatio-temporal features from each frame in each window.

  2. 2.

    Pooling of the interesting features from each temporal window to generate a sequence of Fisher vectors.

  3. 3.

    Conversion of each Fisher vector into a vector of class probabilities with the aid of a multi-class SVM.

  4. 4.

    Integration of the class probabilities at the frame level, leading to the final classification decision (action label) for each frame.

Each of the components is explained in more detail in the following subsections.

2.1 Overlapping and Selective Feature Extraction

A video \({\mathcal {V}} = \left( {\varvec{I}}_t \right) _{t=1}^T\) is an ordered set of T frames. We divide \({\mathcal {V}}\) into a set of overlapping temporal windows \(\left( \mathcal {W}_s \right) _{s=1}^{S}\), with each window having a length of L frames. To achieve overlapping, the start of each window is one frame after the start of the preceding window. Each temporal window is hence defined as a set of frame identifiers: \(\mathcal {W}_s = \left( t_{\mathrm {start}},~ \ldots ,~ t_{\mathrm {start}-1+L}\right) \).

Each frame \(\varvec{I}_t \in \mathbb {R}^ {r\times c}\) can be represented by a set of feature vectors \(F_t = \{\varvec{f}_p\}_{p=1}^{N_t}\) (with \(N_t < r \cdot c\)) corresponding to interesting pixels. Following [6], we first extract the following \(D=14\) dimensional feature vector for each pixel in a given frame t:

$$\begin{aligned} \varvec{f} = \left[ \; x, \; y, \; \varvec{g}, \; \varvec{o} \; \right] \end{aligned}$$
(1)

where x and y are the pixel coordinates, while \(\varvec{g}\) and \(\varvec{o}\) are defined as:

$$\begin{aligned} \varvec{g} = & {} \left[ \; |J_x|, \; |J_y|, \; |J_{yy}|, \; |J_{xx}|, \; \sqrt{J_x^2 + J_y^2}, \; \text {atan} \frac{|J_y|}{|J_x|} \; \right] \end{aligned}$$
(2)
$$\begin{aligned} \varvec{o} = & {} \left[ \; u, \;\; v, \;\; \frac{\partial {u}}{\partial {t}}, \;\; \frac{\partial {v}}{\partial {t}}, \;\; \left( \frac{\partial u}{\partial x} + \frac{\partial v}{\partial y} \right) , \;\; \left( \frac{\partial v}{\partial x} - \frac{\partial u}{\partial y} \right) \; \right] \end{aligned}$$
(3)

The first four gradient-based features in Eq. (2) represent the first and second order intensity gradients at pixel location (xy). The last two gradient features represent gradient magnitude and gradient orientation. The optical flow based features in Eq. (3) represent in order: the horizontal and vertical components of the flow vector, the first order derivatives with respect to time, the divergence and vorticity of optical flow [17].

Typically only a subset of the pixels in a frame correspond to the object of interest. As such, we are only interested in pixels with a gradient magnitude greater than a threshold \(\tau \) [18]. We discard feature vectors from locations with a small magnitude. In other words, only feature vectors corresponding to interesting pixels are kept. This typically results in a variable number of feature vectors per frame. See the bottom part in Fig. 1 for an example of the retained pixels.

2.2 Representing Windows as Fisher Vectors

Given a set of feature vectors, the Fisher Vector approach encodes the deviations from a probabilistic visual dictionary, which is typically a diagonal GMM. The parameters of a GMM with K components can be expressed as \(\{w_k,\varvec{\mu }_k,\varvec{\sigma }_k\}_{k=1}^{K}\), where, \(w_k\) is the weight, \(\varvec{\mu }_k\) is the mean vector, and \(\varvec{\sigma }_k\) is the diagonal covariance matrix for the k-th Gaussian. The parameters are learned via the Expectation Maximisation algorithm [19] on training data.

For each temporal window \(\mathcal {W}_s\), the feature vectors are pooled into set X containing \(N {~=~} \sum _{t \in \mathcal {W}_s} N_t\) vectors. The deviations from the GMM are then accumulated via [10]:

$$\begin{aligned} \mathcal {G}_{\varvec{\mu }_{k}}^{X}= & {} \frac{1}{N\sqrt{w_k}} \sum \nolimits _{n=1}^{N} \gamma _n(k)\left( \frac{\varvec{f}_n - \varvec{\mu }_k}{\varvec{\sigma }_k} \right) \end{aligned}$$
(4)
$$\begin{aligned} \mathcal {G}_{\varvec{\sigma }_{k}}^{X}= & {} \frac{1}{N\sqrt{2w_k}} \sum \nolimits _{n=1}^{N} \gamma _n(k)\left[ \frac{\left( \varvec{f}_n - \varvec{\mu _k}\right) ^2}{\varvec{\sigma }_k^2} -1 \right] \end{aligned}$$
(5)

where vector division indicates element-wise division and \(\gamma _n(k)\) is the posterior probability of \(\varvec{f}_n\) for the k-th component: \(\gamma _n(k) = {w_k\mathcal {N}(\varvec{f}_n|\varvec{\mu }_k, \varvec{\sigma }_k)} / { \left( \sum \nolimits _{i=1}^{K}w_i\mathcal {N}(\varvec{f}_n|\varvec{\mu }_i, \varvec{\sigma }_i) \right) }\).

The Fisher vector for window \(\mathcal {W}_s\) is represented as the concatenation of \(\mathcal {G}_{\varvec{\mu }_{k}}^{X}\) and \(\mathcal {G}_{\varvec{\sigma }_{k}}^{X}\) (for k = \(1, \ldots , K\)) into vector \(\varvec{\varPhi }_s\). As \(\mathcal {G}_{\varvec{\mu }_{k}}^{X}\) and \(\mathcal {G}_{\varvec{\sigma }_{k}}^{X}\) are D-dimensional, \(\varvec{\varPhi }_s\) has the dimensionality of 2DK. Note that we have omitted the deviations for the weights as they add little information [10].

2.3 Generation of Probability Vectors

For each Fisher vector we generate a vector of probabilities, with one probability per action class. First, a multi-class SVM [20] is used to predict class labels, outputting a set of raw scores. The scores are then transformed into a probability distribution over classes by applying Platt scaling [21]. The final probability vector derived from Fisher vector \(\varvec{\varPhi }_s\) is expressed as:

$$\begin{aligned} \varvec{q}_s = \left[ P(l=1|\varvec{\varPhi }_s), ~ \cdots , ~ P(l=A | \varvec{\varPhi }_s) \right] \end{aligned}$$
(6)

where l indicates an action class label and A is the number of action classes. The parameters for the multi-class SVM are learned using Fisher vectors obtained from pre-segmented actions in training data.

2.4 Integrating Probability Vectors to Label Frames

As the temporal windows are overlapping, each frame is present in several temporal windows. We exploit the overlapping to integrate the class probabilities at the frame level. The total contribution of the probability vectors to each frame t is calculated by:

$$\begin{aligned} \varvec{Q}_t = \sum \nolimits _{s=1}^{S} \mathbf {1}_{\mathcal {W}_s} (t) \cdot \varvec{q}_s \end{aligned}$$
(7)

where \(\mathbf {1}_{\mathcal {W}_s} (t)\) is an indicator function, resulting in 1 if \(t \in \mathcal {W}_s\), and 0 otherwise. The estimated action label for frame t is then calculated as:

$$\begin{aligned} \widehat{l}_t = \underset{l=1,\ldots ,A}{\mathrm {arg~max}} ~ \varvec{Q}_t^{[l]} \end{aligned}$$
(8)

where \(\varvec{Q}_t^{[l]}\) indicates the l-th element of \(\varvec{Q}_t\).

3 Experiments

We evaluated our proposed method for joint action segmentation and recognition on two datasets: (i) a stitched version of the KTH dataset [22], and (ii) the challenging Carnegie Mellon University Multi-Modal Activity Dataset (CMU-MMAC) [23]. The results are reported in terms of frame-level accuracy as the ratio between the number of matched frames over the total number of frames.

The s-KTH (stitched KTH) dataset is obtained by simply concatenating existing single-action instances into sequences [5]. The KTH dataset contains 25 subjects performing 6 types of human actions and 4 scenarios. The scenarios vary from indoor, outdoor, scale variations, and different clothes. Each original video of the KTH dataset [22] contains an individual performing the same action. This action is performed four times and each subdivision or action-instance (in terms of start-frame and end-frame) is provided as part of the dataset. This dataset contains 2391 action-instances, with a length between 1 and 14 s [24]. The image size is 160\(\times \)120 pixels, and temporal resolution is 25 frames per second.

The action-instances (each video contains four instances of the action) were picked randomly, alternating between the two groups of {boxing, hand-waving, hand-clapping} and {walking, jogging, running} to accentuate action boundaries. See Fig. 2 for an example. The dataset was divided into two sets as in [5, 6]: one for training and one for testing. In total, 64 and 36 multi-action videos were used for training and testing, respectively. We used 3-fold cross-validation.

Fig. 2.
figure 2

Example of a multi-action sequence in the stitched version of the KTH dataset (s-KTH): boxing, jogging, hand clapping, running, hand waving and walking.

Fig. 3.
figure 3

Example of a challenging multi-action sequence in the CMU-MMAC kitchen dataset: crack, read, stir, and switch-on.

The CMU-MMAC dataset is considerably more challenging as it contains realistic multi-action videos [23]. A kitchen was built to record subjects preparing and cooking food according to five recipes. This dataset has occlusions, a cluttered background, and many distractors such as objects being deliberately moved. For our experiments we have used the same subset as per [5], which contains 12 subjects making brownies. The subjects were asked to make brownies in a natural way (no instructions were given). Each subject making the brownie is partially seen, as shown in Fig. 3.

The videos have a high resolution and are longer than in s-KTH. The image size is 1024\(\times \)768 pixels, and temporal resolution is 30 frames per second. The average duration of a video is approximately 15,000 frames and the average length of an action instance is approximately 230 frames (7.7 s), with a minimum length of 3 frames (0.1 s) and a maximum length of 3,269 frames (108 s) [5]. The dataset was annotated using 14 labels, including the actions close, crack, open, pour, put, read, spray, stir, switch-on, take, twist-off, twist-on, walk, and the remaining actions (eg. frames in between two distinct actions) were grouped under the label none [25]. We used 12-fold cross-validation, using one subject for testing on a rotating basis.

All videos were converted into gray-scale. Additionally, the videos from the CMU-MMAC dataset were re-scaled to 128\(\times \)96 to reduce computational requirements. Based on [6] and preliminary experiments on both datasets, we used \(\tau = 40\), where \(\tau \) is the threshold used for selection of interesting low-level feature vectors (Sect. 2.1). Although the interesting feature vectors are calculated in all frames, we only use the feature vectors extracted from every second frame in order to speed up processing.

Parameters for the visual vocabulary (GMM) were learned using a large set of descriptors obtained from training videos. Specifically, we randomly sampled 100,000 feature vectors for each action and then pooled all the resultant feature vectors from all actions for training. Experiments were performed with three separate GMMs with varying number of components: \(K=\{64,128,256\}\). We have not evaluated larger values of K due to increased computational complexity and hence the exorbitant amount of time required to process the large CMU-MMAC dataset.

To learn the parameters of the multi-class SVM, we used video segments containing single actions. For s-KTH this process is straightforward as the videos have been previously segmented. The CMU-MMAC dataset contains continuous multi-actions. For this reason, to train our system we obtain one Fisher vector per action in each video, using the low-level feature vectors belonging to that specific action.

3.1 Effect of Window Length and Dictionary Size

We have evaluated the performance of two variants of the proposed system: (1) probabilistic integration with Fisher vectors (PI-FV), and (2) probabilistic integration with BoW histograms (PI-BoW), where the Fisher vector representation is replaced with BoW representation. We start our experiments by studying the influence of the segment length L, expressed in terms of seconds. The results are reported in Figs. 4 and 5, in terms of average accuracy over the folds.

Using the PI-FV variant (Fig. 4), we found that using \(L=1s\) and \(K=256\) leads to the best performance on the s-KTH dataset. For the CMU-MMAC dataset, the best performance is obtained with \(L=2.5\,s\) and \(K=64\). Note that using larger values of K (128 and 256) leads to worse performance. We attribute this to the large variability of appearance in the dataset, where the training data may not be a good representative of test data. As such, using a large value of K may lead to overfitting to the training data.

The optimal segment length for each dataset is different. We attribute this to the s-KTH dataset containing short videos whose duration is between 1s and 7s, while CMU-MMAC has a large range of action durations between 0.1s and 108s. While the optimal values of L and K differ across the datasets, the results also show that relatively good overall performance across both datasets can be obtained with \(L=1s\) and \(K=64\).

The results for the PI-BoW variant are shown in Fig. 5. The best performance for the PI-BoW variant on the s-KTH dataset is obtained using \(L=1s\) and \(K=256\), while on the CMU-MMAC dataset it is obtained with \(L=2.5s\) and \(K=256\). These are the same values of L and K as for the PI-FV variant. However, the performance of the PI-BoW variant is consistently worse than the PI-FV variant on both datasets. This can be attributed to the better representation power of FV. Note that the visual dictionary size K for BoW is usually higher in order to achieve performance similar to FV. However, due to the large size of the CMU-MMAC dataset, and for direct comparison purposes, we have used the same range of K values throughout the experiments.

Fig. 4.
figure 4

Performance of the proposed PI-FV approach for varying the segment length on the s-KTH and CMU-MMAC datasets, in terms of average frame-level accuracy over the folds.

Fig. 5.
figure 5

As per Fig. 4, but showing the performance of the PI-BoW variant (where the Fisher vector representation is replaced with BoW representation).

3.2 Confusion Matrices

Figures 6 and 7 show the confusion matrices for the PI-FV and PI-BoW variants on the CMU-MMAC dataset. The confusion matrices show that in 50 % of the cases (actions), the PI-BoW variant is unable to recognise the correct action. Furthermore, the PI-FV variant on average obtains better action segmentation than PI-BoW.

For five actions (crack, open, read, spray, twist-on), PI-FV has accuracies of \(0.5\,\%\) or lower. Action crack implies crack and pour eggs into a bowl, but it’s annotated only as crack, leading to confusion between crack and pour. We suspect that actions read and spray are poorly modelled due to lack of training data; they are performed by a reduced number of subjects. Action twist-on is confused with twist-off which are essentially the same action.

Fig. 6.
figure 6

Confusion matrix for the PI-FV variant on the CMU-MMAC dataset.

Fig. 7.
figure 7

As per Fig. 6, but using the PI-BoW variant.

3.3 Comparison with GMM and HMM-MIO

We have compared the performance of the PI-FV and PI-BoW variants against the HMM-MIO [5] and stochastic modelling [6] approaches previously used for multi-action recognition. The comparative results are shown in Table 1.

The proposed PI-FV method obtains the highest accuracy of 85.0 % and 40.9 % for the s-KTH and CMU-MMAC datasets, respectively. In addition to higher accuracy, the proposed method has other advantages over previous techniques. There is just one global GMM (representing the visual vocabulary). This is in contrast to [6] which uses one GMM (with a large number of components) for each action, leading to high computational complexity. The HMM-MIO method in [5] requires the search for many optimal parameters, whereas the proposed method has just two parameters (L and K).

Table 1. Comparison of the proposed methods (PI-FV and PI-BoW) against several recent approaches on the stitched version of the KTH dataset (s-KTH) and the challenging CMU-MMAC dataset.

3.4 Wall-Clock Time

Lastly, we provide an analysis of the computational cost (in terms of wall-clock time) of our system and the stochastic modelling approach. The wall-clock time is measured under optimal configuration for each system, using a Linux machine with an Intel Core processor running at 2.83 GHz.

On the s-KTH dataset, the stochastic modelling system takes on average 228.4 min to segment and recognise a multi-action video. In comparison, the proposed system takes 5.6 min, which is approximately 40 times faster.

4 Conclusions and Future Work

In this paper we have proposed a hierarchical approach to multi-action recognition that performs joint segmentation and classification in videos. Videos are processed through overlapping temporal windows. Each frame in a temporal window is represented using selective low-level spatio-temporal features which efficiently capture relevant local dynamics and do not suffer from the instability and imprecision exhibited by STIP descriptors [14]. Features from each window are represented as a Fisher vector, which captures the first and second order statistics. Rather than directly classifying each Fisher vector, it is converted into a vector of class probabilities. The final classification decision for each frame (action label) is then obtained by integrating the class probabilities at the frame level, which exploits the overlapping of the temporal windows.

The proposed approach has a lower number of free parameters than previous methods which use dynamic programming or HMMs [5]. It is also considerably less computationally demanding compared to modelling each action directly with a GMM [6].

Experiments were done on two datasets: s-KTH (a stitched version of the KTH dataset to simulate multi-actions), and the more challenging CMU-MMAC dataset (containing realistic multi-action videos of food preparation). On s-KTH, the proposed approach achieves an accuracy of 85.0 %, considerably outperforming two recent approaches based on GMMs and HMMs which obtained 78.3 % and 71.2 %, respectively. On CMU-MMAC, the proposed approach achieves an accuracy of 40.9 %, outperforming the GMM and HMM approaches which obtained 33.7 % and 38.4 %, respectively. Furthermore, the proposed system is on average 40 times faster than the GMM based approach.

Possible future areas of exploration include the use of the fast Fisher vector variant proposed in [26], where for each sample the deviations for only one Gaussian are calculated. This can deliver a large speed up in computation, at the cost of a small drop in accuracy [27].