1 Introduction

Recognition of single-person-oriented human actions is one of the central functions of modern computer systems which uses a camera tool for understanding humans with many applications such as surveillance, human–computer interaction (HCI) and motion retrieval.

1.1 Motivation

Over the last two decades, the majority of approaches (e.g., learning-based approaches including deep learning methods [1,2,3,4,5], instance matching-based methods [6,7,8] and sparse representation-based approaches [9, 10]) focus on the classification of a query video after collecting a large number of (or at best a full/completed set of) labeled training samples. In other words, the underlying assumption of these methods is that a sufficient number of training samples must be available per class, which makes performances of these methods deteriorate when only a few training samples are available. But unfortunately, in some intelligent systems, the users often do not have sufficient training samples for action modeling. For instance, in vision-based surveillance applications such as safety protection and terrorism/crime deterrence, abnormal actions/activities are often defined as those rarely occurred in specific monitored sites, where the users cannot collect sufficient training samples for designing detectors [11]. To address this problem, some researchers attempted to collect extensive training samples by virtue of web data, e.g., [7, 12]. This is, however, expensive and time-consuming to collect such volume of data in practical usages.

Another group of studies have taken a different way to perform action recognition only with a limited number of training samples. In particular, Seo and Milanfar [13] proposed a method of using a single example of an action as a query to find similar matches through measuring the likeness of a voxel to its surroundings, which is based on the computation of novel space-time descriptors from the query video; Rodriguez et al. [14] proposed a method based on a maximum average correlation height (MACH) filter which is capable of capturing intra-class variability by synthesizing a single-action MACH filter for a given action class; Neverova et al. [15] presented a training strategy to overcome the training problem when the number of labeled samples is not at web-scale like static image datasets by exploiting careful initialization of individual modalities and gradual fusion of modalities from the strongest to weakest cross-modality structure. These approaches mostly enhanced action recognition by improving the phase of classifiers’ training, and thus their performances are still below those using a larger number of samples.

On the other hand, since various variations are often included in the training samples for many other applications, methods that apply structural analysis for the original data before processing have been popular, as seen from [16,17,18]. For example, Ahmadi et al. [19] proposed to recover accurate surgical workflow by averaging signals recorded in different operations of the same type taking advantages of an enhanced version of the dynamic time warp algorithm; Boudaoud et al. [20] presented a specific statistical tools for shape dispersion analysis based on a mean shape curve which is learned according to the degree of specific polynomial time functions; Morlini and Zani [21] proposed a new method to estimate the structural mean of a sample of curves by modifying the classical DTW, which has been demonstrated the priority on air pollutant data analysis; Xie et al. [22] introduced a method for clustering and averaging the tracks of people obtained in a multi-camera network using DTW and random sampling for optimizing the work cycles. In these works, the method of structural mean/average learning has been proven to be a promising strategy for enhancing model training/learning when handling training samples with varying amplitudes and phases/timings.

In the area of action recognition, Cherla et al. [18] have also proposed a fast and view-invariant average-template action model called “action basis” by the use of eigen-analysis from training sequences of different people, where the model shows great potentials to deal with action recognition with fewer training samples but it uses empirical eigenvalues to construct the average template that requires further quantitative investigation and experimental validation. Additionally, the action basis is only appropriate for unimodal classes where the samples are expected to gather around their class center. However, in complex action recognition tasks, unimodality is a very strong assumption that is not valid. Indeed, even the simplest action (e.g., walk) is rather different when performed by different persons, various views, scales, etc.

Fig. 1
figure 1

Given a training set containing a limited number of samples, rather than directly using them for action modeling and recognition, we propose to learn structural average sequences from each sample pairs in every class and then form a new set of training samples by taking together the original samples and the average sequences for action modeling and recognition

In this paper, in line with the methods of structural mean/average analysis, we focus on the further extension and validation of the average templates for action recognition when only a few training samples are available. Noticeably, different from [18] using PCA to generate the average template, the method proposed in this paper uses structural average curves analysis (SACA) to generate average templates by taking into account the variations of timing and amplitude between sample sequences per action class. Our method is complementary to those methods focusing on action recognition using limited samples, e.g., [13,14,15, 18], and also could be potentially integrated with some of them for further improving their recognition performances.

1.2 Overview and contribution

As illustrated in Fig. 1, rather than directly using original training samples for action modeling and recognition, we propose to learn structural average samples by using SACA from these original samples, and then gather the resulting average samples with the original ones to form a new training set. Afterward, based on the new set, statistical distribution of human actions can be extracted using, e.g., bag-of-words (BoW). A query action can be finally recognized with conventional classification strategies such as ANN, SVM and k-NNC. The main contributions of this paper to the field are:

  • SACA has been successfully applied to speech recognition. Here, we introduce SACA to the problem of action recognition. To the best of our knowledge, this is the first work that uses SACA to analyze human motions.

  • Instead of using the original training samples for action modeling directly, we propose the average samples extracted by SACA together with the original ones to model human actions which takes into account the variations of timing and amplitude between video sequences in one action class.

  • The proposed method of action modeling is successfully extended and validated on benchmarking datasets by comparing with the baselines relying on the original samples. In addition, it could potentially be integrated with the existing approaches for further improving their recognition performances.

The remainder of this paper is organized as follows. Section 2 details the SACA-based approach for the recognition of human actions. Experimental results are presented in Sect. 3, followed by discussions. Section 4 concludes this paper.

2 Methodology

2.1 Frame feature extraction

As the first step of video analysis, for a given query video F to be recognized which contains n frames, we first extract features in each frame and concatenate the resulting features to be a time-sequential set of features that can represent the video as, \(F=\{f_i\}, i\in \{1,2,\ldots ,n\}\) where \(f_i\) corresponds to the features at ith frame. Here, it is worth mentioning that feature extraction plays an important role in video description and thus takes a direct influence in next action recognition. While, further discussion on this procedure is beyond the scope of the paper because our focus here is to design an enhanced recognition framework using limited action samples. In other words, our expected framework does not rely on specific action features but would be workable for other features as long as they can describe the video effectively and informatively.

2.2 Structural average curves analysis for action modeling

2.2.1 Problem formulation

Let \(\{F_i^c: i=1,2,\ldots ,N\}\) be the collected training samples for action class c, where i in \(F_i^c\) indicates the index of ith action video and N is the number of video samples for this class. Suppose each observed frame \(f_i(j)\) in a video \(F_i\) (i.e., \(f_i(j)\subset F_i, 1\le j \le n_i\) where \(n_i\) is the number of frames in \(F_i\)) fit the following model as,

$$\begin{aligned} f_i(j)=\mathcal {G}(t_{i,j})+\varepsilon _{i,j}, \quad j=1, 2, \ldots , n_i, \end{aligned}$$
(1)

where \(\mathcal {G}\) is a smoothing function, \(t_{i,j}\in [0, 1]\) is the timings with any closed interval for the ith video sequence and \(\{\varepsilon _{i,j}\}\) are the independent and identical distributed (I.I.D.) errors with zero mean, i.e., \(\textsf {E}[\varepsilon _{i,j}]=0\).

The problem of learning averaging sequences is equivalent to estimating the smoothing function \(\mathcal {G}\). When all video samples in the class have the same number of frames, i.e., \(\forall i, n_i=n\), the expectation of \(f_i(j)\) can be given by

$$\begin{aligned} \textsf {E}[f_i(j)]=\mathcal {G}(t_{i,j})+\textsf {E}[\varepsilon _{i,j}]=\mathcal {G}(t_{i,j}). \end{aligned}$$
(2)

Assuming ergodicity of i for all samples, i.e., \(i\in [1,\ldots ,m]\) in each frame, we can estimate each element g(j) in \(\mathcal {G}\) approximately by the law of large numbers as a sample mean as

$$\begin{aligned} g(j)\simeq \overline{f_i}(j)=\frac{1}{m}\sum _{i=1}^m f_i(j). \end{aligned}$$
(3)

This approach, however, does not take into account for timing variations but only for amplitude variations. In real-life scenarios, action videos are often observed with a greatly different number of video frames because of different performing paces/intensities between individuals or sometimes even in the same individual. In fact, the timing variation is more common in automatic speech recognition where the processed speech sequences are often varying in time or speed [23, 24]. To address this issue, an intuitive and natural alternative is to find the best match between every video sample \(f_i\) and an average sequence candidate \(\mathcal {G}=\{g(j'):j'=1,2,\ldots ,m\}\) by alignments \(\mathcal {W}\) with respect to minimizing a cost function using an accumulated error, as

$$\begin{aligned} \inf _{\mathcal {W}}\sum _{i=1}^{n_i}\sum _{(j,j')\in \mathcal {W}}||f_i(j)-g(j')||, \end{aligned}$$
(4)

where \(||\cdot ||\) is a distance metric. Thanks to the dynamic programming, we can obtain \(\mathcal {W}=\{(j,j')\}\) as a warping path connecting (1, 1) and \((n_i,m)\). Now, the problem addressed in this paper is how to learn structural average sequences from \(\mathcal {W}=\{(j,j')\}\). The following section gives the procedures.

2.2.2 Averaging sequences

Sequential optimize Eq. (4) for each average sequence candidate is extremely time-consuming or even impossible. Therefore, some studies (see [19,20,21,22] for example) solve this problem on the basis of a structural averaging analysis. Motivated by these works, given two arbitrary action video samples \(F=\{f(1), f(2), \ldots , f(n)\}\) and \(F'=\{f'(1), f'(2), \ldots , f'(n')\}\), we learn the structural average sequences as follows:

  • Step 1: Compute the distances for all frame pairs between F and \(F'\) (i.e., \(\{(f(i),f'(j)): i=1,2,\ldots ,n; j=1,2,\ldots ,n'\}\)) to form a two-dimensional square lattice, and then take the optimal warping path \(\mathcal {W}=\{w(k)\rightarrow (i(k),j(k)): k=1,2,\ldots ,K; i(1)=j(1)=1; i(K)=n, j(K)=n'\}\) from the resulting square lattice using dynamic programming with respect to the cost function in Eq. (4);

  • Step 2: The length K of obtained warping path \(\mathcal {W}\) contains a different number of timings (or, in other words, sampling rates). We then normalize \(\mathcal {W}\) to be a common timing \(\overline{K}\) by the interpolation and averaging operations where the common timing is produced by averaging the timings of F and \(F'\), i.e., \(\overline{K}=(n+n')/2\);

  • Step 3: The normalized warping path \(\mathcal {U}=\{u(k): k=1,2,\ldots ,\overline{K}\}\) indicates the best matching pairs between the two video sequences (as shown in Fig. 2).Footnote 1 We finally construct the average sequence \(\mathcal {F}\) as

    $$\begin{aligned} \begin{aligned}&\mathcal {F}=\{f^c(k): k=1,2,\ldots ,\overline{K}\},\\&f^c(k)=(f(\mathcal {U}^-(k)+f'(\mathcal {U}^-(k)))/2, \end{aligned} \end{aligned}$$
    (5)

    where \(\mathcal {U}^{-}\) is an inverse of \(\mathcal {U}\) since it is strictly increasing in temporal extent.

Fig. 2
figure 2

Learning average sequences: a taking the optimal warping path from a two-dimensional square lattice resulted by computing the distances for all frame pairs between two compared video sequences; b normalizing the original warping path to be a common timing by the interpolation and averaging operations

2.3 Practical issues

In the Step 1, the distances for all frame pairs between two compared video sequences F and \(F'\) (i.e., \(\{(f(i),f'(j)): i=1,2,\ldots ,n; j=1,2,\ldots ,n'\}\)) have to be computed to synchronize these two sequences. Here, it is worth mentioning that for human actions studied in this paper, human actions are often or almost always represented by multiple features from different measurements, and furthermore each feature may provide different weights/cues for action discrimination. For this reason, the classical distance metric, typically the Euclidean distance, would be not suitable for coping with such multi-dimensional sequences. To address this problem, we employ the following procedures to compute the distance between each frame pair in implementation:

  • Normalize each dimension of F and \(F'\) separately to a zero mean and unit variance and smooth each dimension with a Gaussian filter;

  • Compute the distance matrix \(\mathcal {D}\) by:

    $$\begin{aligned} \mathcal {D}(i,j)=\sum _{h=1}^{H}|f(i,h)-f'(j,h)| \end{aligned}$$
    (6)

    where \(f(i,h), f'(j,k)\) are the hth features, respectively, in f(i) and \(f'(j)\);

  • Use \(\mathcal {D}\) to find the optimal warping path with the Viterbi algorithm.

2.4 Action modeling and recognition

Assuming that we have learnt average sequences from every pair of action samples for each action class by the above-described procedures, we now have a set of average sequences \(\{\mathcal {F}^c_i: i=1,2,\ldots ,N^c\}\) for each class c, and apparently the number of this set is \(N^c=C_N^2=N(N-1)/2\). By collecting the two sets of average sequences and the original action samples together (as shown in Fig. 3), we can obtain a new set, i.e., \(\mathcal {S}^c=\{F^c_i\}\cup \{\mathcal {F}^c_i\}\) for performing action modeling. We then use the bag-of-words (BoW) model to represent each sample in the new set \(\mathcal {S}^c\) for modeling the action of class c as follows:

  • The codebook (i.e., vocabulary of words) is first constructed by clustering \(\{\mathcal {S}^c: c=1,2,\ldots ,C\}\) (C is the number of total classes) using k-means algorithm where codewords are defined by the centers of resulted clusters;

  • Each frame in the video sample is assigned as one codeword by minimizing the Euclidean distance over all codewords in the codebook;

  • Last, each video sample is described as a histogram of assigned codewords. The effect of codebook size K on action recognition was investigated in experiments (see Fig. 4).

Fig. 3
figure 3

Given an original set of action samples \(\{F^c\}\) for class c, we learn average sequences \(\{\mathcal {F}^c\}\) from every sample pair in the set. Then, action modeling is then performed with a newly formed set of \(\mathcal {S}^c\) by taking together \(\{F^c\}\) and \(\{\mathcal {F}^c\}\)

Let us assume that we have a set of histograms of codewords with action labels \(c\in \{1,2,\ldots C\}\). For a newly arrived query action video \(F^*\) also represented by a histogram of codewords learnt already, we can classify it for example using k-nearest neighbors classifier (k-NNC) or a support vector machine (SVM).

3 Experimental validation

3.1 Dataset

Since our focus is on enhancing action recognition with less amount of action samples, we chose four small-scaled benchmarking datasets for our evaluation as follows:

The Weizmann Dataset consists of 90 video sequences including 10 categories of human action: bend, jack, jump, pjump, run, side, skip, walk, wave1 and wave2, performed by each of nine subjects.

The UT-Tower DatasetFootnote 2 consists of 108 video sequences from 9 types of actions: pointing, standing, digging, walking, carrying, running, wave1, wave2 and jumping. Each action is performed 12 times by 6 individuals.

The UC-3D Motion DatabaseFootnote 3 consists of 11 different activities including 6 interactive actions and 5 single actions. In this paper, we mainly focus on individual actions, so we chose the 5 single actions in our investigation: bend, jumping, running, walking and sitting/standing cycle, performed 15 times by 5 individuals.

The UTD Multimodal Human Action Dataset (UTD MHAD) was released very recently [25]. In this dataset, each action is performed by 8 subjects. We tested 15 actions, i.e., swipe left, swipe right, wave, clap, throw, arm cross, basketball shoot, draw X, draw circle (clockwise), draw circle (counter clockwise), draw triangle, bowling, boxing, baseball swing, and tennis swing, in our investigation.

3.2 Experimental implementation

As stated previously, frame feature extraction is the first step for video analysis. In the experiments, we used local temporal self-similarities (LTSS) extracted from difference images for frame representation [26] due to its relative simpleness in implementation and its no-requirement of bounding-box annotation and subjection detection. We used the same parameter setting as described in [26] which brings the total number of features up to 240 in each frame. Here, one notes that, in the Weizmann dataset, the two actions of wave1 and wave2 have very similar flow and they are easily confused to each other by the flowed-based approaches, we thus in experiments only tested the action of wave1 as made in [26].

For all datasets, in the following experiments, we tested the codebook size K from 50 to 150 with a step of 5. We tested two widely used classification methods of k-NNC and SVM, for performing action recognition. They were operated, respectively, as follows:

k-NNC: we compared \(F^*\) with k nearest action samples in \(\mathcal {S}^c\) for each action class c, i.e., \(\{F^c_1,F^c_2,\ldots ,F^c_k\}\subset \mathcal {S}^c\), by a distance metric dist, typically the Euclidean distance. Then the most similar class was chosen as

$$\begin{aligned} F^*\rightarrow \arg \min _c \sum _{i=1}^k dist (F^*,F^c_i). \end{aligned}$$
(7)

SVM: we trained SVM with RBF kernel in a one-against-all framework to handle multi-class classification. LIBSVM library was used in MATLAB for implementing the SVM-based action classification.

We also compared these classification methods with the recently proposed deep learning (DL)-based method. We implemented the DL method based on convolutional neural network (CNN). More specifically, the Deep Learn ToolboxFootnote 4 was used in MATLAB for accomplishing this task where we trained a 6c-2s-12c-2s CNN to deal with multi-class classification. In this method, we feed the extracted LTSS features directly to the DL classification.

Additionally, for all the above classification methods, we compared the recognition performances of using original training samples with those obtained by the proposed scheme to investigate the effectiveness and priority. The leave-one-person-out cross-validation was used for classification evaluation.

3.3 Results and analysis

Figure 4 shows the recognition rates for tested values of codebook size K by using k-NNC or SVM classification. It can be seen that, for all datasets, the recognition performance has been improved significantly for almost all tested values of K with our proposed method than those obtained by original training samples. More specifically, in Fig. 5, we summarized the average recognition rates by using k-NNC and SVM classification as well as the recognition rate by DL. It can be seen that, for all datasets, the recognition rates by each classification method with using the extended samples are higher than those using the original prototypical samples.

Fig. 4
figure 4

Recognition rates in four datasets. a Weizamann dataset. b UT-Tower dataset. c UC-3D Motion dataset. d UTD MHAD dataset

Fig. 5
figure 5

Average recognition rates by using k-NNC and SVM classification and recognition rates by DL method

Table 1 Summary of experimental results

More details are provided in Table 1 where we can find that, in Weizmann dataset, our method achieved a recognition rate of 98.77% (SVM, \(K=145\)), while 93.83% was obtained by using the original samples (5-NNC, \(K=75\)). UT-Tower dataset was 75% by our method (3-NNC, \(K=100\) and SVM, \(K=135\)) and 70.37% (SVM, \(K=140\)) with the original samples. In the UC-3D Motion dataset, it was 93.33% (SVM, \(K=150\)) by our method, while it was 81.33% (SVM, \(K=70\)) with the compared method. Last, in UTD MHAD dataset, our method achieved 91.67% (SVM, \(K=120\)), while it was 84.17% (1-NNC, \(K=115\)) with using the original samples. Here, one interesting observation is, for each testing dataset, the recognition rate of DL is lower than those by using k-NNC and SVM classification. It is not surprising because the performance of DL classification relies heavily on the number of training samples, while, the extended number of samples by our method is still somewhat limited on the testing datasets. In addition, there are some parameters that can significantly affect the performance of DL, for example, as reported in [27], the recognition rate on the Weizamann dataset can achieve 96.67% by using 3D CNN, that is higher than 88.89% reported in our experiment. In this regard, it is believed that the performance of DL method by integrating our proposed scheme would be further improved through optimizing appropriate settings. Further discussion is, however, beyond the scope of this paper as our focus in this paper is on the extension of training samples.

In the method, we propose to use the extended training set derived from SACA, instead of original training set, for action modeling and recognition. An example of two randomly selected actions in UT-Tower dataset is shown in Fig. 6. Intuitively, we can see that the samples are more dense within each action class and meanwhile these two actions have a more distinguishable classification boundary in the extended training set, compared with those in the original training set. These would be the main reasons why we can achieve better recognition performances with the proposed method.

In addition, we derived structural average sequences by learning from each pair of video samples in every action class, which is conducted on the basis of features obtained previously. And we only chose two conventional classification methods and one deep learning method for comparison. In fact, since our proposed method is to extend the training samples prior to action modeling and recognition, other feature extraction methods and classification methods can also be integrated with our method to further improve their recognition performances, especially in the cases where there are a limited number of samples.

Fig. 6
figure 6

PCA-2D of two randomly selected actions in UT-Tower dataset

4 Conclusion

In this paper, we have proposed a new scheme for modeling human actions by virtue of SACA when only a limited number of training samples are available. Rather than directly using the original training set for action modeling, we derived structural average sequences by learning from each pair of video samples in every action class and then combined them with original video samples to generate a new training set. Extensive experiments and methodological analysis on the new training set were provided to demonstrate the advantages of the proposed method. In addition, the proposed method can potentially be integrated with other approaches to further improve their recognition performances.