Keywords

1 Introduction

Human activity Recognition from video has been an active area of research for more than a decade. It is a challenging problem to detect humans in video streams due to variations in pose, appearance, clothing, background clutter and illumination. Camera movement or background clutter makes it even more difficult. Potential applications include surveillance, assisted care for the elderly, monitoring of children in daycare, crowd monitoring, sports training, detection of abnormal activities and content based video retrieval. Although image based features have made considerable advances in recent years [1,2,3,4], they are not yet mature enough for many practical applications. On the other hand, most movements are characteristics of human actions, so classification accuracy can potentially be improved by paying more attention to motion information. Many researchers in this area have assumed that the camera and the background scene are essentially static. This greatly simplifies the problem because the mere presence of motion information can help us identify the action class. Towards this end, Colque et al. [13] proposed a model for capturing anomalies in human activities using orientation, velocity, and entropy. Authors calculated histogram as a feature based on optical flow. Viola et al. [5] point out that including motion features markedly increases the overall performance of their system.

Lot of work has been done for activity detection from RGB videos based on pose estimation and motion components. Ni et al. [6] analyzed human action through discovering the most discriminative dense trajectories group. Vrigkas et al. [7] clustered the motion trajectories and the clustered motion trajectories are used to represent human action.

Ma et al. [8] extracted video segments for partial or complete human motion. They constructed a tree based vocabulary of similar actions. Fernando et al. [9] exploited temporal ordering in videos to enumerate human actions in chronological order. They used ranking learning framework for summarization of relevant information. Zhang et al. [10] used Gaussian mixture model for modeling human action and a transfer ranking approach was used for recognizing unseen classes.

In this paper, we propose a set of features based on local estimation of significant motion in RGB videos. Many researchers use grid splitting of video frames for histogram calculation to generate feature descriptors [12, 14,15,16], To come up with features which are less sensitive to relative positioning of camera and the human in the scene, we propose to divide the motion matrix into independent horizontal and vertical strips and use the histograms of each strip as part of the feature descriptor. We show that this helps to better discern between various human actions. We use random forest classification technique as the machine learning tool, and present results on a publicly available dataset to illustrate the superiority of our method. The rest of the paper is organized as follows: Sect. 2 summarizes some related work in the area. In Sect. 3 we indicate the specific descriptors which we propose to extract from RGB video. Section 4 describes the experimental results and comparison with other state of art methods and finally Sect. 5 concludes the work.

2 Related Work

Chun and Lee [14] estimate motion flow using dense optical flow, then divide the estimations into grid cells and calculate histograms for each cell in the grid as feature descriptors. Zhang and Parker [15] proposed CoDe4D features using multi-channel orientation histogram for RGB-D data. Luo et al. [16] proposed to model the motion dynamics with robust linear dynamical systems and histograms of oriented gradients (HOG). Cheng et al. [17] proposed a framework for activity awareness using surface electromyography and accelerometer (ACC) signals. They used histogram of negative entropy to detect the starting and end point of the activity. Mukherjee et al. [18] proposed a graph theoretic technique for recognizing human actions. They used histogram of oriented optical flow and a bag-of-word approach to calculate the descriptor. Zhou and Zhang [19] proposed to encode the movements of local parts of human action. To discover elementary actions with stable states, the authors used multiple-instance formulation. Dogan et al. [20] proposed 3D volume motion templates (VMTs). To make the method view independent, the authors make a rotation with respect to a canonical orientation. Colque et al. [13] proposed Histograms of Optical Flow Orientation and Magnitude (HOFM) to detect anomalous events in videos. Tripathi et al. [12] used histogram of gradient (HOG) for detecting abnormal activity in ATM cabins.

In this paper we divide the video volume into row volumes and column volumes separately and calculate corresponding intensity histograms to reduce the sensitivity of feature vector toward the relative position of camera and object. This is described in the next section.

3 Proposed Method

RGB color frames are extracted from the action video clip, and converted to gray scale as depicted in Fig. 1a, b. The frames are bundled into small sized groups, such that each bundle contains B frames. In order to capture significant motion information, difference of consecutive frames is computed at each pixel. For each bundle, the magnitude of maximum difference at each pixel is stored in a matrix P, hereafter named as Motion projection matrix. Appropriate scaling of gray values at each pixel is carried out to depict the range of motion for the specific bundle. Regions having no motion will appear completely black, and areas with significant motion will appear bright gray. For a M × N frame, the difference matrix is computed using pair of consecutive frames as shown below:

Fig. 1.
figure 1

(a) RGB video frames for pour activity, (b) gray scaled video frame with clubbing and(c) motion projection matrix for one club of frames

Fig. 2.
figure 2

Splitting of P for feature creation and matching (a) grid splitting and matching of features, (b) column splitting and matching of features

$$ d_{i} \left( {m,n} \right) = \left| {f_{i + 1} \left( {m,n} \right) - f_{i} \left( {m,n} \right)} \right| $$
(1)

where m = 1, 2, …;  = 1, 2, …

and i takes on values 1, 2, ….., B − 1

Next, we consider the differences at each pixel across the bundle and select the maximum d i to create the Motion projection matrix P:

$$ P\left( {m,n} \right) = { \hbox{max} }\left( {d_{1} \left( {m,n} \right), d_{2} \left( {m,n} \right), \ldots \ldots ., d_{B - 1} \left( {m,n} \right)} \right) $$
(2)

Figure 1c depicts a typical Motion Projection Matrix.

To capture region wise movements of the video bundle, the motion projection matrix P is independently examined along both horizontal and vertical directions. Firstly P is segmented into R rows \( r_{1} , r_{2} , \ldots ,r_{R} \) where the height of each row is chosen as 5 or 10 pixels. Histogram of each horizontal strip (row) is computed and stored into 15 bins. The histogram is divided into 15 bins. Histogram of ith horizontal strip is denoted by \( H_{{r_{i} }} \) which is a vector of size 15. For the R rows, we get 15 * R feature descriptors.

In a similar manner, the Motion Projection Matrix P is now segmented into C columns \( c_{1} , c_{2} , \ldots ,c_{C} \), each of width 5 or 10. Histogram of each vertical strip is computed and binned into 15 groups. \( H_{{c_{i} }} \) represents histogram of ith vertical strip as depicted in Fig. 3. Each histogram \( H_{{r_{i} }} \) or \( H_{{c_{i} }} \) is in form of a vector of size 15. Correspondingly, we obtain our next set of feature descriptor with 15 * C elements.

Fig. 3.
figure 3

Histogram calculation for horizontal and vertical regions in motion projection matrix

We could have divided P into R × C grid and computed histogram of each cell. However, it turns out that cell to cell feature matching in grid splitting is highly sensitive to relative position of camera and object (i.e. if an object is performing same activity ‘near to’ or ‘away from’ camera then grid based feature matching could not perform well because the movements reside in a particular set of grids for ‘near to camera’ case and other set of grids for ‘away from camera’ case. The same phenomenon will happen in case of left/right, up/down and various other compositions of relative positions.).

While row by row feature matching is less sensitive to the horizontal relative positions, while being more sensitive to vertical relative movements. Column by column feature matching is less sensitive to the vertical relative positions while more sensitive to horizontal relative movements of camera and object. The Proposed approach will also take care of ‘near to’ and ‘away from’ cases as shown in Fig. 2a and b.

The proposed feature vector H is formed by concatenating horizontal and vertical histogram bins as shown below

$$ H = [H_{{r_{1} }} , H_{{r_{2} }} , \ldots .., H_{{r_{R} }} , H_{{c_{1} }} , H_{{c_{2} }} , \ldots ., H_{{c_{C} }} ] $$
(3)
figure a

The size of proposed feature vector H is \( 15*\left( {C + R} \right) \).

The output of Algorithm 1 is a set of feature vectors associated with corresponding activity labels. The outputs for various bundles are column wise concatenated to make a training dataset of features.

While the size of the various video clips in the dataset might differ for various activities, and the bundle size chosen arbitrarily, the proposed method ensures that number of feature descriptors remain fixed at 15 * (C + R) for each bundle, as it essentially extracts the histogram information.

4 Classification

Support Vector machine could be used for classification purposes. However, for large number of classes, the use of one-against-all technique creates an unbalanced dataset – usually the Positive class has very small share (5%–6%) while the negative class has the lion share (94%–95%). This may result in underperformance of the SVM algorithm because it would try to minimize the overall error. To circumvent this issue, we propose to use “random forest” for activity classification.

The method creates number of classification trees by selecting random feature vectors. The feature set is chosen randomly for training each tree in the Random Forest. Bagging is used to decrease correlation between randomly chosen trees. This makes it more immune to noise. For testing the query dataset, it is run on all the trees of the forest and the final classification is established through voting on the outcomes. We have chosen a publicly available dataset named JHMDB [11] for evaluating our proposed method.

This dataset is a joint-annotated human motion database consisting of 21 activities: (a) brush hair, (b) catch, (c) clap, (d) climb stairs, (e) golf, (f) jump, (g) kick ball, (h) pick, (i) pour, (j) pull-up, (k) push, (l) run, (m) shoot ball, (n) shoot bow, (o) shoot gun, (p) sit, (q) stand, (r) swing baseball, (s) throw, (t) walk, and (u) wave. The dataset consists of 36–55 clips per action class with each clip containing 15–40 frames. There are 31,838 annotated frames in total. Figure 4 illustrate some of the activities of JHMDB dataset, columns of the figure representing catch, jump, wave and push activity respectively. While first row of the figure shows the 10th frame, the second row shows the 20th frame of the corresponding activities.

Fig. 4.
figure 4

Frames of catch, jump, wave and push (column wise) activity of JHMDB dataset. First row is 10th frames and second row is 20th frames of the corresponding activities.

We have performed experiments on the JHMDB dataset for different values of B (number of frames in a bundle), R (number of horizontal strips in matrix P), C (number of vertical strips in matrix P) and number of trees in random forest. Typically R = C has been chosen in our experiments. Overall Classification accuracy of various experiments are shown in Table 1.

Table 1. Classification accuracy on various parameters of the experiments for JHMDB dataset

In the rest of the paper our discussion is based on the values of the parameters chosen as in experiment 1 of Table 1.

JHMDB dataset is a challenging dataset, since scenes are taken from movies, youtube channel etc. without imposing any constrains on light illumination, camera movements, object orientation and relative position of object with camera.

The proposed method performed well for many activities while some activities are classified with low accuracies. Pull-up and shoot bow activities are classified with accuracy of 89% and 83%. There are four more activities which have been classified with accuracy more than 70%. Activities of sit, run, walk and kick ball are classified with lower accuracies. The reason for low accuracies can be attributed to high similarity amongst some of the activities resulting in greater number of misclassifications. The proposed method’s overall classification accuracy comes out to 51.75%.

Activity wise classifier accuracy is shown in Fig. 5 and the confusion matrix for activity recognition on JHMDB dataset is shown in Fig. 6.

Fig. 5.
figure 5

Activity wise classifier accuracy for JHMBD dataset

Fig. 6.
figure 6

Confusion matrix for JHMDB dataset

Results of proposed method are compared with other state of art techniques in Table 2. It can be seen that our approach is performing better than all the histogram and trajectory based approaches of Jhuang et al. [11].

Table 2. Comparison with other methods.

5 Conclusion

In this paper we have proposed a method for human action recognition based on local estimation of motion in RGB videos. Background subtraction method is used on pair of consecutive frames to determine local motion, and for a small bundle of frames, the maximum magnitude of motion at a pixel is saved to create a Projected Motion Matrix. The matrix is segmented into horizontal and vertical strips and binned histograms of each strip serve as feature descriptors. We have used these descriptors in a random forest based classification scheme and evaluated the performance on a publicly available human action RGB dataset.