Keywords

1 Introduction

The category of human activities includes gestures (elementary human body movements executed for a short time), actions (composed of multiple temporarily organized gestures performed by a single person, such as walking, running, bending or waving), interactions (human-human or human-object activities such as carrying/abandoning/stealing an object) and group activities (i.e. activities performed by groups consisting of multiple objects) [1]. If silhouettes are used for action classification then the human movement can be represented as a continuous pose change and silhouettes extracted from consecutive video frames can be applied to obtain action descriptors used in traditional classification approaches [2]. The approach presented in this paper addresses the problem of recognising an action of a single person and uses information contained in a sequence of binary silhouettes. Each silhouette is firstly processed individually using particular shape description algorithm. Shape representations are compared within a sequence to obtain sequence representation. Then sequence representations are processed and final representations are obtained. These are further subjected to the process of classification based on the template matching approach.

The variety of surveillance system applications affects the diversity of activity recognition approaches. In [1] methods for activity recognition were classified into hierarchical (statistical, syntactic and description-based) and non-hierarchical (spatio-temporal and sequential). Hierarchical methodologies enable the recognition of complicated and complex human activities, including interactions and group activities. In turn, non-hierarchical solutions aim to recognize short, primitive actions and repetitive activities—the recognition process is based on the analysis of unknown sequences using an algorithm that matches data to the predefined activity classes. One of the common spatio-temporal technique using accumulated silhouettes was proposed in [3] and is based on motion energy image (MEI, shows where the movement occurs) and motion history image (MHI, shows how the object is moving). MEI and MHI are used to create a static vector-image (temporal template), scale and shift invariant descriptor is used for representation and Mahalanobis distance is used for matching. Another space-time approach was introduced in [4]. It utilizes Poisson equation for feature extraction and human actions are represented as three-dimensional shapes (silhouettes accumulated in the space-time volume). In [5] action sequence is represented by a History Trace Template composed of the set of Trace Transforms extracted for each silhouette. In [6] an action is represented by a set of SAX (Symbolic Aggregate approXimation) vectors which are based on one-dimensional representations of each silhouette. Some other action recognition approaches utilizes only characteristic frames—action is recognized based on selected key poses, e.g. [7,8,9].

The rest of the paper is organized as follows: Sect. 2 describes the developed approach, Sect. 3 presents algorithms used for shape representation, Sect. 4 presents experimental results on action classification, and Sect. 5 concludes the paper.

2 Proposed Approach

The developed approach is composed of several steps, starting from representing shape information of each silhouette in the dataset and ending on preparing final representation of each sequence, which is then used for action classification. In our approach it is assumed that one video sequence represents one action and it corresponds to a set of binary images—one image contains one silhouette and can be understood as a foreground mask.

In step 1 each silhouette is represented by one shape descriptor using information about its contour or region. In many cases, one shape description algorithm can be employed to obtain representations of different size, e.g. by calculating various order of moments for Zernike Moments or taking various subparts of Fourier Transform-based descriptor. Thank to this we have an opportunity to investigate many shape representations and to select the smallest one which simultaneously carries the most information. If needed, the resultant shape representation is transformed into a vector.

Step 2, for a single sequence, includes calculation of dissimilarities between first frame and the rest of frames using Euclidean distance. The resulting vector containing distance values (normalized to interval [0, 1]) is a one-dimensional descriptor of a sequence (a distance vector). The number of its elements equals the number of silhouettes in the input sequence. This vector can be plotted and analysed visually in terms of similarities between actions and their characteristics.

Step 3 aims to convert distance vectors into the form and size that enables the calculation of similarity between them. Therefore, distance vectors were treated as signals and it turned out that the best way to transform such a signal was to use the magnitude of the fast Fourier Transform and a periodogram. Periodogram is a spectral density estimation of a signal and it can determine hidden periodicities in data [10]. Moreover, the periodogram helped to equalize the size of final representations.

Step 4 includes the process of action classification based on sequence descriptors using template matching approach and correlation coefficient. Here template matching is understood as a process that compares each test object with all templates and indicates the most similar one, which corresponds to the probable class of a particular test object. This is a traditional classification solution when only one template set is used. However, some initial tests showed that final results depend on templates. Therefore, we have decided to perform the experiment several times using k-fold cross-validation technique [11] and different set of templates in each iteration. The final recognition effectiveness is then the average of all iterations. For instance, in the first iteration objects with numbers from 1 to k are used as templates and objects with numbers from \(k+1\) to n as test objects, then in the second iteration objects with numbers from \(k+1\) to \(2*k\) are used as templates and remaining objects are used for testing, and so on. Then the results can be interpreted and analysed in three different ways (considering only correct classifications—‘true positive’):

  1. 1.

    Recognition effectiveness for each shape descriptor, averaged for all classes and all iterations,

  2. 2.

    Recognition effectiveness for each iteration, each shape descriptor and averaged for all classes,

  3. 3.

    Classification accuracy for each class, each shape descriptor and averaged for all iterations (or for one selected iteration only).

3 Selected Shape Descriptors

3.1 Zernike Moments

The Zernike Moments are orthogonal moments which can be derived using Zernike orthogonal polynomials and the following formula [12]:

$$\begin{aligned} V_{nm} \left( x,y\right) =V_{nm}\left( r\cos \theta ,\sin \theta \right) =R_{nm}(r)\exp \left( jm\theta \right) , \end{aligned}$$
(1)

where \(R_{nm}(r)\) is the orthogonal radial polynomial [12]:

$$\begin{aligned} R_{nm}(r)=\sum \limits _{s=0}^{(n-|m|)/2)} \left( -1\right) ^{s} \frac{\left( n-s \right) !}{s!\times \left( \frac{n-2s+|m|}{2}\right) ! \left( \frac{n-2s-|m|}{2} \right) !} \, r^{n-2s} , \end{aligned}$$
(2)

where \(n=0,1,2,\ldots \); \(0 \le |m| \le n\); \(n-|m|\) is even.

The Zernike polynomials are a complete set of functions orthogonal over the unit disk \(x^{2}+y^{2}<1\). The Zernike Moments are rotation invariant and resistant to noise and minor variations in shape. The Zernike Moments of order n and repetition m of a region shape f(xy) are derived using the following formula [12]:

$$\begin{aligned} Z_{nm}= \frac{n+1}{\pi } \sum \limits _{r} \sum \limits _{\theta } f(r\cos \theta ,r\sin \theta ) \cdot R_{nm}(r) \cdot \exp (jm\theta ), \; r \le 1. \end{aligned}$$
(3)

3.2 Moment Invariants

Moment Invariants can be applied for grayscale images or objects (both region and contour) and are described below based on [13,14,15]. Firstly, general geometrical moments are calculated using a following formula (discrete version):

$$\begin{aligned} m_{pq} = \sum \limits _{x}\sum \limits _{y} x^p y^q f(x,y). \end{aligned}$$
(4)

The value of function f(xy) equals 1 if a pixel belongs to the object (silhouette) and 0 for background. Then, to make representation invariant to translation, the centroid is calculated:

$$\begin{aligned} x_c=\frac{m_{10}}{m_{00}}, \,\,\,\, y_c=\frac{m_{01}}{m_{00}}. \end{aligned}$$
(5)

In the next step, Central Moments are calculated using the centroid:

$$\begin{aligned} \mu _{pq} = \sum \limits _x \sum \limits _y (x-x_c)^p (y-y_c)^q f(x,y). \end{aligned}$$
(6)

Then, the invariance to scaling is obtained by central normalised moments:

$$\begin{aligned} \eta _{pq}=\frac{\mu _{pq}}{\mu _{00}^{\frac{p+q+2}{2}}}. \end{aligned}$$
(7)

Ultimately, Moment Invariants are derived (usually seven first values are used in pattern recognition applications):

(8)

3.3 Contour Sequence Moments

Another moment shape descriptor is Contour Sequence Moments (based on shape contour only). The method is described below based on [16]. In the first step, the contour is represented as ordered sequence z(i) which elements are the Euclidean distances from the centroid to particular N points of the shape contour. The one-dimensional normalised contour sequence moments are calculated using following formulas:

$$\begin{aligned} m_r=\frac{1}{N} \sum \limits _{i=1}^{N} [z(i)]^r, \end{aligned}$$
(9)
$$\begin{aligned} \mu _r=\frac{1}{N} \sum \limits _{i=1}^{N} [z(i)-m_1]^r. \end{aligned}$$
(10)

The r-th normalised contour sequence moment and normalised central sequence moment are:

$$\begin{aligned} \bar{m_r}=\frac{m_r}{(\mu _2)^{r/2}}, \,\,\,\, \bar{\mu _r}=\frac{\mu _r}{(\mu _2)^{r/2}}. \end{aligned}$$
(11)

The final shape description consists four values:

$$\begin{aligned} F_1=\frac{(\mu _2)^{1/2}}{m_1}, \,\,\,\, F_2=\frac{\mu _3}{(\mu _2)^{3/2}}, \,\,\,\, F_3=\frac{\mu _4}{(\mu _2)^2}, \,\,\,\, F_4=\bar{\mu _5}. \end{aligned}$$
(12)

4 Experimental Conditions and Results

4.1 Data and Conditions

Several experiments have been carried out in order to verify the effectiveness and accuracy of the proposed approach using moment shape descriptors. Each experiment consisted of four steps described in Sect. 2, except that for each experiment different shape description algorithm was used: Moment Invariants, Contour Sequence Moments and Zernike Moments (orders from 1st to 15th). The experiments were performed using a part of the Weizmann dataset [17]. The original Weizmann dataset contains 90 low-resolution (180 \(\times \) 144, 50 fps) video sequences of 9 actors performing 10 actions. The corresponding binary masks extracted using background subtraction are available and were used as input data. We have selected five types of actions for the experiments: run, walk, bend, jump and one-hand wave (see Fig. 1 for exemplary frames and Fig. 2 for exemplary silhouettes). Therefore, our experimental database consisted of 45 silhouette sequences of 9 actors performing 5 actions. The number of frames (silhouettes) in a sequence varied from 28 to 125. During the experiment each subgroup of 5 action sequences was iteratively used as a template set. Experimental results are presented and described in the next subsection.

Fig. 1.
figure 1

Exemplary video frames from the Weizmann dataset [17]—images in rows correspond to (from the top) bending, jumping, running, walking and waving actions respectively.

Fig. 2.
figure 2

Exemplary silhouettes extracted from video sequences presented in Fig. 1—silhouettes in rows correspond to (from the top) bending, jumping, running, walking and waving actions (images come from the Weizmann dataset [17]).

4.2 Results

In this subsection some initial experimental results are provided. The goal of the first experiment was to identify the best approach for further work. Therefore, the results correspond to the average recognition effectiveness values for each shape descriptor, all classes and all iterations, and are as follows:

  • 38.05% for Moment Invariants;

  • 40.55% for Contour Sequence Moments;

  • 32.5%, 37.22%, 29.44%, 33.61%, 30.55%, 34.16%, 32.22%, 40.55%, 43.05%, 45.83%, 47.5%, 46.66%, 50.27%, 47.77% and 48.88%, for Zernike Moments of orders from 1st to 15th respectively;

For Zernike Moments the use of 13th order gives the best results and only this order for feature representation will be further analysed. Table 1 contains the results of the second experiment—percentage recognition effectiveness values obtained in each iteration, for each shape descriptor and averaged for all classes. It can be seen that the classification accuracy values vary between iterations and that the best result is obtained in iteration no. 7. This can be interpreted in such a way that templates used in this iteration are represented by the most distinctive features enabling proper class indication.

Table 1. Recognition effectiveness for each iteration, each shape descriptor and averaged for all classes

In Table 2 the results of the third experiment are provided—the averaged classification values for all iterations. It can be clearly seen that ‘bend’ action is the most recognizable one, while the ‘jump’ action is the least distinctive. Based on Table 2, Zernike Moments can be selected as the best shape descriptor, except for the classification to ‘wave’ class which was more effective while Contour Sequence Moments descriptor was used. The same dependencies can be indicated for Table 3 where the classification accuracy values for iteration no. 7 are depicted.

Table 2. Classification accuracy averaged for all iterations
Table 3. Classification accuracy for iteration no. 7

There is another interesting element of the approach—normalized distance vectors can be plotted and compared visually. Figure 3 contains five plots of distance vectors corresponding to five silhouette sequences used in Fig. 2 to illustrate exemplary frames. The differences between plots are clearly visible which relates to variability of silhouettes within a sequence and reveals periodicities in actions. Low peaks correspond to silhouettes that are most similar (due to the use of Euclidean distance) to the first silhouette in a sequence. Some distinctive features of the plots could be further employed to improve classification results. For instance, the faster the action, the more densely arranged low or high peaks are obtained.

Fig. 3.
figure 3

Exemplary plots of distance vectors for five actions performed by the same actor: bending, jumping, running, walking and waving respectively. The exemplary distance vectors were obtained using Zernike Moments of order 13th.

5 Summary and Conclusions

In the paper, an approach for action recognition based on silhouette sequences has been presented. It uses various shape description algorithms to represent silhouettes and Euclidean distance to estimate dissimilarity between first and the rest of frames. Normalized distance vectors are further processed using fast Fourier transform and periodogram in order to obtain final sequence representations. These representations are compared using template matching approach and correlation coefficient. The best experimental results in terms of classification accuracy were obtained using Zernike Moments of order 13th.

Generally, the initial results are promising, although the proposed approach requires improvements and should be examined using more data. To make the approach more effective, future works include experimental verification of other shape representation algorithms and matching measures. Moreover, the problems which cause lower classification accuracy should be identified and solved—the use of additional processing step, another shape feature or different classification process should be investigated. Also, not only shape descriptors may be used, but other features, such as centroid of an object which may help to distinguish e.g. jumping and running actions. Any modifications should be verified for their influence on final effectiveness of the approach and time consumption.