Keywords

1 Introduction

Due to the popularization of surveillance cameras and personal video devices, video-based human motion analysis and recognition have become a highly active area in computer vision. Human action recognition is difficult for many reasons, such as high dimension of video data, intra-class variability caused by scale, viewpoint and illumination changes, low resolution, and video quality. Human action recognition is the process of identifying human actions that occur in the video sequences. Its application is involved in surveillance footage, user interfaces, robotics, automatic video organization, patient monitoring systems, athletic performance analysis, etc. Even though this topic has been studied for several years, it has some problem in classifying the actions performed by human accurately.

2 Review of Related Work

Various surveys have been done in this area on recognition methodologies and approach-based and general overview-based taxonomies [1, 2]. Tracking motion of human and action recognition from the sequence of images is analyzed [3]. The extension of [3] in the development of motion analysis has been surveyed by Wang et al. [4]. This paper is organized in hierarchical manner which describes issues caused during detection, tracking, and behavior understanding within human motion analysis system. Kim et al. [5] proposed the scheme over an ordinal measure of accumulated motion in a single query action video without considering whole silhouette of human body as in the earlier [6].

A set of activities is described using motion template system in [7]. Various methods which employ MHI have been proposed in [8] and provide variants and its applications of this method. The combined prompt of MHI and MEI is proposed in [9] for representation of motion based on optical flow-based directional history and energy image templates. To extract the judicial information from given video sequences, Shao et al. [10] proposed a system using wavelet transform-based descriptors for representation and recognition of human action. A multiscale representation has been proposed in [11] by comparing the efficiency with Gabor filter and discrete wavelet transform (DWT) which applied prior to dimensionality reduction. It results Gabor has the highest efficiency in recognition.

The local binary pattern (LBP) has been used from past 10 years, and it is widely used for texture classification proposed in [1214].

3 Proposed Work

The proposed method consists of the following stages: accumulated motion image (AMI), motion history image (MHI), DWT, LBP, feature extraction, training, and classification. The overall procedure is shown in the flow diagram Fig. 1.

Fig. 1
figure 1

Flow diagram

3.1 Accumulated Motion Image (AMI)

In the proposed system to represent the spatiotemporal features of human actions, we define AMI and it is computed by using frame differences. AMI is computed using frame differences as in Eq. 1

$$ {\text{AMI}} = \frac{1}{T}\sum\limits_{t = 1}^{T} {\left| {{D}(x,y,t)} \right|} $$
(1)

where D(x, y, t) = I(x, y, t) − I(x, y, t − 1) and T denotes the total number of frames present in a single action video.

Figure 2 shows the AMI computed for a jacking video and one-hand wave action.

Fig. 2
figure 2

AMI for jack and one-hand wave action

3.2 Motion History Image (MHI)

MHI is extensively used in action recognition research areas. It provides motion shape information of a video to recognize actions.

To represent how the motion occurs in a video, form a motion history image (MHI) is computed by using a simple replacement and decay operator as in Eq. 2

$$ H_{\tau } \left( {x,y,t} \right) = \left\{ {\begin{array}{*{20}l}\uptau \hfill & {{\text{if}}\;D(x,y,t) = 1} \hfill \\ {\hbox{max} (0,H_{\tau } (x,y,t-1) - 1)} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. $$
(2)

where τ is the current time stamp and D is the absolute value of silhouette difference between frames t and t − 1. The result will be the scalar-valued image where brighter pixel shows the most recently occurred action.

Figure 3 shows the MHI computed for a jacking action and one-hand wave action video

Fig. 3
figure 3

MHI for jacking and one-hand wave action

3.3 Discrete Wavelet Transform (DWT)

DWT is used to extract the spatial features and for dimensionality reduction. In the proposed system, we use Haar wavelet filter for decomposition of the MHI image.

Figure 4 shows the output of the DWT.

Fig. 4
figure 4

DWT of MHI in different directions

3.4 Local Binary Pattern (LBP)

LBP is applied over AMI to extract the edge information. In texture analysis, the sampling is done on equally spaced sampling points P on a circular neighborhood with radius R. The value of the LBP code of the pixel is given in P-bit binary number as in Eq. 3

$$ {\text{LBP}}_{P,R} { = }\sum\limits_{P = 0}^{P} {S(g_{p} - g_{c} )2^{p} \quad s(x) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {{\text{if}}\;x \ge 0} \hfill \\ {0,} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right.{\kern 1pt} } $$
(3)

where

g c :

gray value of center pixel (x c , y c , t c )

g p :

gray value at p plane sampling points

Figure 5 shows the output of the LBP.

Fig. 5
figure 5

LBP of AMI

3.5 Feature Extraction

In the present work, the moment invariants are evaluated using central moments of the image function I(x, y) up to third order. Regular moments are defined as in Eq. 4

$$ M_{pq} = \sum\limits_{x} {\sum\limits_{y} {x^{p} y^{q} } } $$
(4)

where for p, q = 0, 1, 2, … and is M pq the order moment (p + q)th of the continuous image function I(x, y). The central moments of order up to 3 are calculated using Eq. 4 which is invariant under translation.

3.6 Training and Classification

The system has to be trained with some classification algorithm which helps and finds the exact match of test actions. In the proposed system, K-nearest neighbor (KNN) classification method is applied. KNN algorithm is a part of supervised learning that has been used in many applications in the field of data mining, statistical pattern recognition, and many others.

KNN is a method for classifying objects based on closest training examples in the feature vector. An object is classified by a majority vote of its neighbors. K is always a positive integer. If K = 1, it is then called as nearest neighbor (NN). The neighbors are taken from a set of objects for which the correct classification is known.

3.7 Weizmann Dataset

The Weizmann dataset has been used in the proposed system which consists of large number of subjects and actions. It includes 81 low-resolution videos from nine different people, each performing 10 natural actions. A sample is shown in Fig. 6.

Fig. 6
figure 6

Some sample of 10 action classes in Weizmann dataset

4 Results and Discussion

The proposed system is worked out with Weizmann dataset consists of 8 actions of different persons. The actions including bend, jumping jack, jumping, walk, run, skip, gallop side, and wave.

Number of actions taken for testing = N

The total classification rate of the proposed system is calculated as follows

$$ C \, = \frac{{({\kern 1pt} N - {\text{Actions}}{\kern 1pt} {\kern 1pt} {\text{wrongly}}{\kern 1pt} {\kern 1pt} {\text{recognized}})}}{N} $$

The percentage of video giving correct output is 90 %, and the percentage of video giving wrong output is 10 %. The accuracy given above is obtained by using both accumulated motion image and MHI with LBP and DWT, respectively. It gives better result when both the techniques are concatenated rather than applied separately. This proposed method is able to recognize 9 out of 10 actions.

5 Conclusions

AMI evaluation helps us in handling the variations of appearance and clearly shows the moving pixels. MHI is defined, and its determination gives direction of motion. LBP gives shape and texture information of AMI and also helped to improve recognition accuracy. DWT aids in extracting spatial features of MHI. Then, the feature vectors of LBP and DWT obtained by Hu seven moments and distance metric calculated using KNN classifier and the actions are labeled accordingly. The proposed system gives the high performance when compared with previous proposals which uses KNN. The computation is simpler and less time consuming.