Moment Shape Descriptors Applied for Action Recognition in Video Sequences

Gościewska, Katarzyna; Frejlichowski, Dariusz

doi:10.1007/978-3-319-54430-4_19

Katarzyna Gościewska¹⁷ &
Dariusz Frejlichowski¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10192))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

2197 Accesses
3 Citations

Abstract

Algorithms for recognition of human activities have found application in many computer vision systems, for example in visual content analysis approaches and in video surveillance systems, where they can be employed for the recognition of single gestures, simple actions, interactions and even behaviour. In this paper an approach for human action recognition based on shape analysis is presented. Set of binary silhouettes extracted from video sequences representing a person performing an action are used as input data. The developed approach is composed of several algorithms including those for shape representation and matching. It can deal with sequences of different number of frames and none of them has to be removed. The paper provides some initial experimental results on classification using proposed approach and moment shape description algorithms, namely the Zernike Moments, Moment Invariants and Contour Sequence Moments.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Integration of moment invariants and uniform local binary patterns for human activity recognition in video sequences

Article 16 November 2015

A Combination of Moment Descriptors, Fourier Transform and Matching Measures for Action Recognition Based on Shape

Action Recognition Using Silhouette Sequences and Shape Descriptors

Keywords

1 Introduction

The category of human activities includes gestures (elementary human body movements executed for a short time), actions (composed of multiple temporarily organized gestures performed by a single person, such as walking, running, bending or waving), interactions (human-human or human-object activities such as carrying/abandoning/stealing an object) and group activities (i.e. activities performed by groups consisting of multiple objects) [1]. If silhouettes are used for action classification then the human movement can be represented as a continuous pose change and silhouettes extracted from consecutive video frames can be applied to obtain action descriptors used in traditional classification approaches [2]. The approach presented in this paper addresses the problem of recognising an action of a single person and uses information contained in a sequence of binary silhouettes. Each silhouette is firstly processed individually using particular shape description algorithm. Shape representations are compared within a sequence to obtain sequence representation. Then sequence representations are processed and final representations are obtained. These are further subjected to the process of classification based on the template matching approach.

The variety of surveillance system applications affects the diversity of activity recognition approaches. In [1] methods for activity recognition were classified into hierarchical (statistical, syntactic and description-based) and non-hierarchical (spatio-temporal and sequential). Hierarchical methodologies enable the recognition of complicated and complex human activities, including interactions and group activities. In turn, non-hierarchical solutions aim to recognize short, primitive actions and repetitive activities—the recognition process is based on the analysis of unknown sequences using an algorithm that matches data to the predefined activity classes. One of the common spatio-temporal technique using accumulated silhouettes was proposed in [3] and is based on motion energy image (MEI, shows where the movement occurs) and motion history image (MHI, shows how the object is moving). MEI and MHI are used to create a static vector-image (temporal template), scale and shift invariant descriptor is used for representation and Mahalanobis distance is used for matching. Another space-time approach was introduced in [4]. It utilizes Poisson equation for feature extraction and human actions are represented as three-dimensional shapes (silhouettes accumulated in the space-time volume). In [5] action sequence is represented by a History Trace Template composed of the set of Trace Transforms extracted for each silhouette. In [6] an action is represented by a set of SAX (Symbolic Aggregate approXimation) vectors which are based on one-dimensional representations of each silhouette. Some other action recognition approaches utilizes only characteristic frames—action is recognized based on selected key poses, e.g. [7,8,9].

The rest of the paper is organized as follows: Sect. 2 describes the developed approach, Sect. 3 presents algorithms used for shape representation, Sect. 4 presents experimental results on action classification, and Sect. 5 concludes the paper.

2 Proposed Approach

The developed approach is composed of several steps, starting from representing shape information of each silhouette in the dataset and ending on preparing final representation of each sequence, which is then used for action classification. In our approach it is assumed that one video sequence represents one action and it corresponds to a set of binary images—one image contains one silhouette and can be understood as a foreground mask.

In step 1 each silhouette is represented by one shape descriptor using information about its contour or region. In many cases, one shape description algorithm can be employed to obtain representations of different size, e.g. by calculating various order of moments for Zernike Moments or taking various subparts of Fourier Transform-based descriptor. Thank to this we have an opportunity to investigate many shape representations and to select the smallest one which simultaneously carries the most information. If needed, the resultant shape representation is transformed into a vector.

Step 2, for a single sequence, includes calculation of dissimilarities between first frame and the rest of frames using Euclidean distance. The resulting vector containing distance values (normalized to interval [0, 1]) is a one-dimensional descriptor of a sequence (a distance vector). The number of its elements equals the number of silhouettes in the input sequence. This vector can be plotted and analysed visually in terms of similarities between actions and their characteristics.

Step 3 aims to convert distance vectors into the form and size that enables the calculation of similarity between them. Therefore, distance vectors were treated as signals and it turned out that the best way to transform such a signal was to use the magnitude of the fast Fourier Transform and a periodogram. Periodogram is a spectral density estimation of a signal and it can determine hidden periodicities in data [10]. Moreover, the periodogram helped to equalize the size of final representations.

Step 4 includes the process of action classification based on sequence descriptors using template matching approach and correlation coefficient. Here template matching is understood as a process that compares each test object with all templates and indicates the most similar one, which corresponds to the probable class of a particular test object. This is a traditional classification solution when only one template set is used. However, some initial tests showed that final results depend on templates. Therefore, we have decided to perform the experiment several times using k-fold cross-validation technique [11] and different set of templates in each iteration. The final recognition effectiveness is then the average of all iterations. For instance, in the first iteration objects with numbers from 1 to k are used as templates and objects with numbers from $k+1$ to n as test objects, then in the second iteration objects with numbers from $k+1$ to $2*k$ are used as templates and remaining objects are used for testing, and so on. Then the results can be interpreted and analysed in three different ways (considering only correct classifications—‘true positive’):

1.
Recognition effectiveness for each shape descriptor, averaged for all classes and all iterations,
2.
Recognition effectiveness for each iteration, each shape descriptor and averaged for all classes,
3.
Classification accuracy for each class, each shape descriptor and averaged for all iterations (or for one selected iteration only).

3 Selected Shape Descriptors

3.1 Zernike Moments

The Zernike Moments are orthogonal moments which can be derived using Zernike orthogonal polynomials and the following formula [12]:

$$\begin{aligned} V_{nm} \left( x,y\right) =V_{nm}\left( r\cos \theta ,\sin \theta \right) =R_{nm}(r)\exp \left( jm\theta \right) , \end{aligned}$$

(1)

where $R_{nm}(r)$ is the orthogonal radial polynomial [12]:

$$\begin{aligned} R_{nm}(r)=\sum \limits _{s=0}^{(n-|m|)/2)} \left( -1\right) ^{s} \frac{\left( n-s \right) !}{s!\times \left( \frac{n-2s+|m|}{2}\right) ! \left( \frac{n-2s-|m|}{2} \right) !} \, r^{n-2s} , \end{aligned}$$

(2)

where $n=0,1,2,\ldots $; $0 \le |m| \le n$; $n-|m|$ is even.

The Zernike polynomials are a complete set of functions orthogonal over the unit disk $x^{2}+y^{2}<1$. The Zernike Moments are rotation invariant and resistant to noise and minor variations in shape. The Zernike Moments of order n and repetition m of a region shape f(x, y) are derived using the following formula [12]:

$$\begin{aligned} Z_{nm}= \frac{n+1}{\pi } \sum \limits _{r} \sum \limits _{\theta } f(r\cos \theta ,r\sin \theta ) \cdot R_{nm}(r) \cdot \exp (jm\theta ), \; r \le 1. \end{aligned}$$

(3)

3.2 Moment Invariants

Moment Invariants can be applied for grayscale images or objects (both region and contour) and are described below based on [13,14,15]. Firstly, general geometrical moments are calculated using a following formula (discrete version):

$$\begin{aligned} m_{pq} = \sum \limits _{x}\sum \limits _{y} x^p y^q f(x,y). \end{aligned}$$

(4)

The value of function f(x, y) equals 1 if a pixel belongs to the object (silhouette) and 0 for background. Then, to make representation invariant to translation, the centroid is calculated:

$$\begin{aligned} x_c=\frac{m_{10}}{m_{00}}, \,\,\,\, y_c=\frac{m_{01}}{m_{00}}. \end{aligned}$$

(5)

In the next step, Central Moments are calculated using the centroid:

$$\begin{aligned} \mu _{pq} = \sum \limits _x \sum \limits _y (x-x_c)^p (y-y_c)^q f(x,y). \end{aligned}$$

(6)

Then, the invariance to scaling is obtained by central normalised moments:

$$\begin{aligned} \eta _{pq}=\frac{\mu _{pq}}{\mu _{00}^{\frac{p+q+2}{2}}}. \end{aligned}$$

(7)

Ultimately, Moment Invariants are derived (usually seven first values are used in pattern recognition applications):

(8)

3.3 Contour Sequence Moments

Another moment shape descriptor is Contour Sequence Moments (based on shape contour only). The method is described below based on [16]. In the first step, the contour is represented as ordered sequence z(i) which elements are the Euclidean distances from the centroid to particular N points of the shape contour. The one-dimensional normalised contour sequence moments are calculated using following formulas:

$$\begin{aligned} m_r=\frac{1}{N} \sum \limits _{i=1}^{N} [z(i)]^r, \end{aligned}$$

(9)

$$\begin{aligned} \mu _r=\frac{1}{N} \sum \limits _{i=1}^{N} [z(i)-m_1]^r. \end{aligned}$$

(10)

The r-th normalised contour sequence moment and normalised central sequence moment are:

$$\begin{aligned} \bar{m_r}=\frac{m_r}{(\mu _2)^{r/2}}, \,\,\,\, \bar{\mu _r}=\frac{\mu _r}{(\mu _2)^{r/2}}. \end{aligned}$$

(11)

The final shape description consists four values:

$$\begin{aligned} F_1=\frac{(\mu _2)^{1/2}}{m_1}, \,\,\,\, F_2=\frac{\mu _3}{(\mu _2)^{3/2}}, \,\,\,\, F_3=\frac{\mu _4}{(\mu _2)^2}, \,\,\,\, F_4=\bar{\mu _5}. \end{aligned}$$

(12)

4 Experimental Conditions and Results

4.1 Data and Conditions

Several experiments have been carried out in order to verify the effectiveness and accuracy of the proposed approach using moment shape descriptors. Each experiment consisted of four steps described in Sect. 2, except that for each experiment different shape description algorithm was used: Moment Invariants, Contour Sequence Moments and Zernike Moments (orders from 1st to 15th). The experiments were performed using a part of the Weizmann dataset [17]. The original Weizmann dataset contains 90 low-resolution (180 $\times $ 144, 50 fps) video sequences of 9 actors performing 10 actions. The corresponding binary masks extracted using background subtraction are available and were used as input data. We have selected five types of actions for the experiments: run, walk, bend, jump and one-hand wave (see Fig. 1 for exemplary frames and Fig. 2 for exemplary silhouettes). Therefore, our experimental database consisted of 45 silhouette sequences of 9 actors performing 5 actions. The number of frames (silhouettes) in a sequence varied from 28 to 125. During the experiment each subgroup of 5 action sequences was iteratively used as a template set. Experimental results are presented and described in the next subsection.

4.2 Results

In this subsection some initial experimental results are provided. The goal of the first experiment was to identify the best approach for further work. Therefore, the results correspond to the average recognition effectiveness values for each shape descriptor, all classes and all iterations, and are as follows:

38.05% for Moment Invariants;
40.55% for Contour Sequence Moments;
32.5%, 37.22%, 29.44%, 33.61%, 30.55%, 34.16%, 32.22%, 40.55%, 43.05%, 45.83%, 47.5%, 46.66%, 50.27%, 47.77% and 48.88%, for Zernike Moments of orders from 1st to 15th respectively;

For Zernike Moments the use of 13th order gives the best results and only this order for feature representation will be further analysed. Table 1 contains the results of the second experiment—percentage recognition effectiveness values obtained in each iteration, for each shape descriptor and averaged for all classes. It can be seen that the classification accuracy values vary between iterations and that the best result is obtained in iteration no. 7. This can be interpreted in such a way that templates used in this iteration are represented by the most distinctive features enabling proper class indication.

Table 1. Recognition effectiveness for each iteration, each shape descriptor and averaged for all classes

Full size table

In Table 2 the results of the third experiment are provided—the averaged classification values for all iterations. It can be clearly seen that ‘bend’ action is the most recognizable one, while the ‘jump’ action is the least distinctive. Based on Table 2, Zernike Moments can be selected as the best shape descriptor, except for the classification to ‘wave’ class which was more effective while Contour Sequence Moments descriptor was used. The same dependencies can be indicated for Table 3 where the classification accuracy values for iteration no. 7 are depicted.

Table 2. Classification accuracy averaged for all iterations

Full size table

Table 3. Classification accuracy for iteration no. 7

Full size table

There is another interesting element of the approach—normalized distance vectors can be plotted and compared visually. Figure 3 contains five plots of distance vectors corresponding to five silhouette sequences used in Fig. 2 to illustrate exemplary frames. The differences between plots are clearly visible which relates to variability of silhouettes within a sequence and reveals periodicities in actions. Low peaks correspond to silhouettes that are most similar (due to the use of Euclidean distance) to the first silhouette in a sequence. Some distinctive features of the plots could be further employed to improve classification results. For instance, the faster the action, the more densely arranged low or high peaks are obtained.

5 Summary and Conclusions

In the paper, an approach for action recognition based on silhouette sequences has been presented. It uses various shape description algorithms to represent silhouettes and Euclidean distance to estimate dissimilarity between first and the rest of frames. Normalized distance vectors are further processed using fast Fourier transform and periodogram in order to obtain final sequence representations. These representations are compared using template matching approach and correlation coefficient. The best experimental results in terms of classification accuracy were obtained using Zernike Moments of order 13th.

Generally, the initial results are promising, although the proposed approach requires improvements and should be examined using more data. To make the approach more effective, future works include experimental verification of other shape representation algorithms and matching measures. Moreover, the problems which cause lower classification accuracy should be identified and solved—the use of additional processing step, another shape feature or different classification process should be investigated. Also, not only shape descriptors may be used, but other features, such as centroid of an object which may help to distinguish e.g. jumping and running actions. Any modifications should be verified for their influence on final effectiveness of the approach and time consumption.

References

Vishwakarma, S., Agrawal, A.: A survey on activity recognition and behavior understanding in video surveillance. Vis. Comput. 29, 983–1009 (2012)
Article Google Scholar
Borges, P.V.K., Conci, N., Cavallaro, A.: Video-based human behavior understanding: a survey. IEEE Trans. Circuits Syst. Video Technol. 23, 1993–2008 (2013)
Article Google Scholar
Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23, 257–267 (2001)
Article Google Scholar
Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29, 2247–2253 (2007)
Article Google Scholar
Goudelis, G., Karpouzis, K., Kollias, S.: Exploring trace transform for robust human action recognition. Pattern Recogn. 46, 3238–3248 (2013)
Article Google Scholar
Junejo, I.N., Junejo, K.N., Aghbari, Z.A.: Silhouette-based human action recognition using SAX-shapes. Vis. Comput. 30, 259–269 (2014)
Article Google Scholar
Baysal, S., Kurt, M.C., Duygulu, P.: Recognizing human actions using key poses. In: 20th International Conference on Pattern Recognition, pp. 1727–1730 (2010)
Google Scholar
Liu, L., Shao, L., Zhen, X., Li, X.: Learning discriminative key poses for action recognition. IEEE Trans. Cybern. 43, 1860–1870 (2013)
Article Google Scholar
Chaaraoui, A.A., Climent-Pérez, P., Flórez-Revuelta, F.: Silhouette-based human action recognition using sequences of key poses. Pattern Recogn. Lett. 34, 1799–1807 (2013)
Article Google Scholar
Chitode, J.: Digital Signal Processing. Technical Publications, Berlin (2009)
Google Scholar
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of 14th International Joint Conference on Artificial Intelligence, vol. 2, pp. 1137–1143 (1995)
Google Scholar
Zhang, D., Lu, G.: Shape-based image retrieval using generic Fourier descriptor. Sig. Proc.: Image Commun. 17, 825–848 (2002)
Google Scholar
Rothe, I., Susse, H., Voss, K.: The method of normalization to determine invariants. IEEE T. Pattern Anal. 18, 366–376 (1996)
Article Google Scholar
Hupkens, T.M., de Clippeleir, J.: Noise and intensity invariant moments. Pattern Recogn. Lett. 16, 371–376 (1995)
Article Google Scholar
Liu, C.B., Ahuja, N.: Vision based fire detection. In: Proceedings of 17th International Conference on Pattern Recognition, vol. 4, pp. 134–137 (2004)
Google Scholar
Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis, and Machine Vision. PWS—an Imprint of Brooks and Cole Publishing, Belmont (1998)
Google Scholar
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: 10th IEEE International Conference on Computer Vision, pp. 1395–1402 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Szczecin, Żołnierska 52, 71-210, Szczecin, Poland
Katarzyna Gościewska & Dariusz Frejlichowski

Authors

Katarzyna Gościewska
View author publications
You can also search for this author in PubMed Google Scholar
Dariusz Frejlichowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dariusz Frejlichowski .

Editor information

Editors and Affiliations

Wrocław University of Science and Technology, Wroclaw, Poland
Ngoc Thanh Nguyen
Japan Advanced Institute of Science and Technology, Nomi, Japan
Satoshi Tojo
Japan Advanced Institute of Science and Technology, Nomi, Japan
Le Minh Nguyen
Wrocław University of Science and Technology, Wroclaw, Poland
Bogdan Trawiński

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gościewska, K., Frejlichowski, D. (2017). Moment Shape Descriptors Applied for Action Recognition in Video Sequences. In: Nguyen, N., Tojo, S., Nguyen, L., Trawiński, B. (eds) Intelligent Information and Database Systems. ACIIDS 2017. Lecture Notes in Computer Science(), vol 10192. Springer, Cham. https://doi.org/10.1007/978-3-319-54430-4_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-54430-4_19
Published: 26 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54429-8
Online ISBN: 978-3-319-54430-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Moment Shape Descriptors Applied for Action Recognition in Video Sequences

Abstract

Similar content being viewed by others

Integration of moment invariants and uniform local binary patterns for human activity recognition in video sequences

A Combination of Moment Descriptors, Fourier Transform and Matching Measures for Action Recognition Based on Shape

Action Recognition Using Silhouette Sequences and Shape Descriptors

Keywords

1 Introduction

2 Proposed Approach