1 Introduction

Human action recognition is one of the most active research topics in computer vision and pattern recognition. Behavioral biometric [16, 17, 37, 40], interactive user interface [33], automatic visual surveillance [10, 13, 44, 45], video summarization [11, 29] and content-based video analysis and retrieval [3, 7, 11, 12, 29] are some applications of action recognition in real life.

In recent years, researchers have used variety of terms and definitions for the problem of action recognition. Some writers have used different terms such as behavior [15], activity [14], action [46, 54], event [6, 21] and gesture [4] in the same meaning. In this work, we use the term action and define it as a sequence of movements performed by a human agent during a job. For example running is an action consisting sequence of hand, legs and body movement in a specific order. The action label is an infinitive that describes the action in the best way, so that normal humans can understand. Action recognition is the process of determining a label for an action based on the information extracted from the input video sequences even when the action performed by different person in different style and speed.

Existence of variety of gestures, movements and styles turns human action recognition to be one of the most challenging problem in computer vision. The variety of gesture and movements in a simple action like walking is so high that it could be used as a biometric identifier to identify individual people in [40]. Environmental variation such as lighting condition, dynamic background and occlusion are some of the other challenges in human action recognition problem. In most previous approaches for action recognition, the problem of variation in viewpoints have been ignored. These approaches assume that the actor is performing the action in a pre-specified angle from the camera. For example in [4, 8, 58] the actors have to face a camera or have to be parallel to a view plane during performing the action. These approaches are not practical in most real life applications such as automatic surveillances in which the actor performs the action in arbitrary angle and position. Variations in viewpoints have attracted more attention in recent years and some approaches [5, 19, 26, 27, 32, 38, 39, 42, 50, 51, 55] have been proposed for view independent action recognition. Some recent works [43, 53, 56, 60] have been using the 3D information such as depth map to overcome problems such as lighting, occlusion and view change. In depth map, value of each pixel is proportional to the distance from viewpoint. Depth map could be captured by somatosensory equipment such as Microsoft Kinect or could be generated from multiple images captured simultaneously by multiple calibrated cameras, using stereo matching algorithms [25, 59]. Since depth map is not available in most real life applications such as automatic visual surveillance we focus our work on traditional 2D approaches which use single camera for action recognition.

Weinland [52] has classified recent view-independent action recognition approaches into three main strategies: view normalization, exhaustive search and view invariance. View normalization approaches map video frames from different views into a common canonical coordinate frame. Matching is then performed under this canonical setting. Exhaustive search approaches take all possible view transformations into account and search for the optimal match within them. View invariance approaches use features that do not depend on view transformation, therefor the result of matching is the same for any viewpoint. The proposed approach in this paper stands in view invariance strategies by using a combination of a view-dependent representation and a view-independent representation. Our view-independent representation is based on the view invariance temporal self-similarity matrix introduced in [19]. For view-dependent representation we used the entropy of silhouette’s distance transformation. Experimental results show that the proposed method outperforms recent action recognition approaches.

The rest of this paper is organized as follows: First we represent an overview of recent works in action recognition. Next we describe the proposed approach for view-independent action recognition. In section 4 we demonstrate the efficiency and practicality of the proposed method and compare it with recent action recognition methods by testing on two public dataset. Section 5 concludes the paper.

2 Related works

Previous works on recognizing human action from video sequences are numerous. In this paper we focus on action recognition methods that are invariance to viewpoint variations. We use Weinland’s taxonomy in [52] to classify view-independent action recognition approaches in three strategies: view normalization, exhaustive search, and view invariance. In the following part we introduce these strategies and discuss view-independent action recognition approaches based on these three strategies.

In view-normalization approaches, each frame of the video is transformed to a common canonical coordinate frame and the action recognition is applied on the transformed video. To perform the normalization the transformation matrix from the canonical frame to the current frame have to be estimated. This transformation must remove global scale, translation variation and camera rotations. To remove global scale and translation most recent works extract a ROI around the actor and translate it to a unit frame. To estimate the camera rotation the 3D orientation of the actor must be estimated. Walking direction or the direction of face could be used to estimate the body direction. Cuzzolin [9] estimated the motion direction of actor by interpolating the sequences of center of mass along time using a spline. Zhao [61] inferred the 3D orientation of human by assuming that he is facing in the direction of motion. Rogez et al. [38] assumed that most of the times people walk on a planer ground with a vertical posture. Based on this assumption they used the direction of motion to define the canonical views. In [5] non-orthogonal views taken from several cameras used to build a new virtual view orthogonal to the motion direction and the view-dependent action recognition method in [28] used to recognize action from the orthogonal virtual view. In [39] Volume Motion Template (VMT) was introduced as a virtual 3D template based on disparity maps of stereo input sequences, and the projection of the VMT at optimal virtual view used for gesture recognition. The optimal virtual view obtained by rotating the VMT around the Y-axis to match the motion direction.

Exhaustive search approaches extract features from several samples of action in different viewpoints to learn the action. For recognizing the action, extracted features from the target video sequence are matched against features of each training sample and the label of the best match chosen as the action label.

In exhaustive search approaches Bobick et al. [4] combined motion-energy image (MEI) and motion history image (MHI) and built a vector-valued image where each pixel represents the motion at that pixel location. The estimated templates matched against the stored templates of known actions in different views to find the best match. Ogale et al. [32] defined action as a short sequence of key poses and represented each pose by a collection of silhouettes extracted from different views. They built a single HMM containing pose and views using training set and used it to find the best sequence of poses for single camera video in the test set. Ahmad [2] used human silhouettes and Cartesian components of optical flow velocity features to build a set of HMM for each action in different views. Experimental results show that his proposed method was robust to variation in views.

Lv et al. [26] modeled action with a chain of 3D key pose extracted from a small set of motion capture sequences. They used 90 cameras around the human model in POSER [34] software to render each key pose from different viewpoints. For recognition, shape context of human silhouette extracted from each frame matched against shape context of rendered key pose. Similar to [26], Natarajan [30] rendered poses from Mocap data [1] of various actions in multiple viewpoints using POSER software and represented them in a conditional random field (CRF). They computed observation probability based on shape similarity and the transition probability based on flow similarity. They used body poses in all frames of action templates instead of the key poses because using key poses with large difference would make the flow matching inaccurate. Weinland [50] modeled action as a sequence of 3D exemplars. Exemplars are represented in 3D as a virtual hull that has been computed using 5 calibrated cameras. To match observation and exemplars the visual hulls are projected into 2D and the match between the resulting silhouettes is computed.

View invariance approaches use features and matching functions that are independent to view transformation. In these approaches view-dependent features are removed during the feature extraction stage and for each action a set of view-independent features are estimated. View invariance approaches do not neither need to estimate actor orientation like view normalization nor several viewpoints for training like exhaustive search approaches. But removing view dependent features results in loss of discriminative information for action classification.

Yilmaz et al. [57] proposed to model an action based on shape and motion of actors. They used spatiotemporal volume (STV) of actor as a 3D object in (x, y, t), and generated the action sketch by analyzing the differential geometric surface properties of STV, such as peaks, pits and valleys. The calculated action sketch had been used as a view-independent feature for action recognition. Weinland introduced motion history volume (MHV) in [51] as a free-viewpoint representation for human actions by extending 2D motion history image (MHI) templates to 3D. In [51] he transformed MHVs in to cylindrical coordinates around the vertical axes and extracted view independent features using Fourier transform. Shen et al. [42] introduced fundamental ratios as a view independent feature for action recognition. They decomposed human gestures into set of point triplets. Then they calculated the similarity between actions by fundamental ratios of associated point triplets during the motion.

Wang et al. [48] introduced a new representation based on dense trajectories and motion boundaries. For trajectory based representation they used point coordinates, histogram of oriented gradient and histogram of optical flow to embed shape, appearance and motion information of each cell of a spatio-temporal grid. For motion boundaries based representation, they used motion boundary histogram (MBH) [47] which relies on differential optical flow. Wang et al. [49] improved MBH by removing trajectories generated by camera motions and only keeping trajectories related to human action. To remove camera trajectories, They matched feature points between frames using SRIF descriptor and dense optical flow and removed matches from human region using state of the art human detectors [35].

Shuiwang et al. [18] claimed that traditional action recognition approaches which consist of two separate stage, including feature extraction and classification, are not practical in all applications. Because finding appropriate features is difficult and highly application dependent. To solve this problem, they suggest using deep learning models to learn high-level features based on low-level features. They expand the convolutional neural network (CNN) [22] which is a 2D deep learning model into 3D, by using 3D convolution in the convolution stages of CNNs, to compute features from both spatial and temporal dimensions. The results of Shuiwang’s experiments show superior performance in comparison to some baseline methods.

Junejo et al. [19] introduced temporal self-similarities matrix (SSM) for view independent action recognition. Each element of SSM represents the difference between low level features of frames corresponding to row, and column number of the element. Junejo claimed that self-similarity matrix is stable under view changes of an action. He used optical flow and histogram of oriented gradient as low level features to build self-similarity matrices. He then used a log polar descriptor to extract the structure of SSM and classified this descriptor using recent bag of words method. Our view-independent representation is based on self-similarity matrix introduced in [19] but instead of optical flow and histogram of oriented gradient we use the trajectory of feature points as a low level feature to build the temporal self-similarity matrix.

3 The proposed action recognition method

Employing view-invariance features for view-independent representation of actions, has grown in recent years. There are two main problems with view-invariance features; first, view-invariance features are few in numbers and may not be enough for classification algorithms. Second, high level information of actions is lost due to usage of view independent features. Thus, the features remained may not have adequate capability to distinguish different activities.

To solve these problems, we propose a hybrid method consisting of a view-dependent representation called alpha representation and a view-independent representation called beta representation. In this method, the view-dependent representation is used to reduce the number of possible categories for each video sequences. The view-dependent representation places similar activities in same clusters by clustering training samples in training phase. In test phase, the clusters of target samples are first determined and then the action labels are predicted only by using the actions within that cluster. The experiments show that the proposed method has a reasonable accuracy for view-independent action recognition and its results are comparable with those of recently proposed [19, 26] action recognition algorithms.

Figure 1 shows the diagram of the proposed method. In this paper, we use the self-similarity matrix introduced in [19] as the view-independent representation and the entropy of silhouette’s distance transform as the view-dependent representation. It should be noted that the proposed framework is independent of both aforementioned methods. Hence, a combination of other view-independent representation methods can be used. The components of the proposed framework are described in the following.

Fig. 1
figure 1

Diagram of proposed framework for action recognition

3.1 Preprocessing

The sequence of video images is the result of mapping from a three-dimensional space to two-dimensional plane images. The two-dimensional images contain large amounts of information, but many of them have little value for our goal. So the first step in preprocessing is to remove useless information and prepare video sequences to extract useful features. In the preprocessing stage, the actor image is isolated from the background image using the background subtraction techniques. We used the mean filter in RGB and HSV color models for background subtraction. First, the mean filter used in RGB color model to isolate the foreground image from the background image. As shown in Fig. 2b, the resulting image is very noisy. In addition, the actor’s shadow on the ground is intended as a foreground object. To solve this problem, the image noise is reduced using morphology operations and finding the biggest blob (Fig. 2c, d). Then, the actor’s shadow on the ground is removed by using the mean filter in HSV color model within the bounding box around the actor.

Fig. 2
figure 2

Stages of background subtraction (a) Original frame (b) RGB subtraction (c) Morphological operation (d) Finding biggest blob (e) HSV subtraction

3.2 Alpha representation

For view-dependent representation we used the entropy of silhouettes distance transformation. Silhouette is one of the most common view-dependent representations for human body. The advantage of the silhouette is its ability to display the external structure of the body with a reasonable accuracy and low computational cost and its disadvantage is noises and uncertainty in boundaries caused by background subtraction. To reduce the effect of noise in boundaries, we used the distance transform. In distance transform, the value of each pixel is equal to its minimum distance from the boundaries. In addition to reduce the noise on boundaries, distance transform represent the silhouette’s structure. Figure 3 shows silhouette and distance transform of three different scenes from IXMAS dataset.

Fig. 3
figure 3

Silhouette and distance transform of three different scene in alpha representation

Recall that the parts of body which have been involved in the action cause more variations in the pixels of that region. Thus, we used entropy to calculate the contributions of pixels in the video sequence of each action. Entropy is a measure of the uncertainty in a random variable. The concept of entropy was introduced by Shannon in 1984 in a paper entitled “A Mathematical Theory of Communication” [41]. Shannon defined the entropy H of a discrete random variable X with probability mass function P(X) as:

$$ H(x)=E\left[- In\left(P(x)\right)\right] $$
(1)

Shannon’s entropy represents the information content of a random variable. Random variables with more uncertainty have more information and have higher entropy than random variables with lower uncertainty. We calculate entropy for illumination of pixels during video sequence of each sample. If the pixel does not change or just shows a slight change in illumination during the sequence of frames, the entropy will be close to zero. Obviously, such pixel has little information. If the pixel shows a greater change in illumination with a uniform distribution, the information content of pixel will increase, hence the pixel entropy will increase. Figure 4 shows the entropy values for four different actions of IXMAS dataset. In these pictures brighter pixels has higher entropy.

Fig. 4
figure 4

Sample entropy of silhouettes distance transform in alpha representation for four different actions

As illustrated in Fig. 4 pixels of body organs which have more movement in the actions are brighter than other parts. For example in wave and punch actions, pixel of hands are brighter and have more information than other pixels.

3.3 Clustering

The goal of clustering phase is to reduce the search space for the classification phase by reducing the number of possible labels for each action. To reduce the number of possible labels for each action, we partitioned action’s instances and place instances with similar alpha representations in the same clusters. In this work, we used k-means for clustering actions based on their alpha representation. Following the clustering step, the labels of samples in each cluster is chosen as possible labels for the set of actions that belongs to that cluster. If samples of one action have placed in more than one cluster, the label of that action would be chosen for all of its clusters. In classification phase, the label of the test sample would be selected from the label of actions, which belong to the nearest cluster to the sample.

3.4 Beta representation

In view independent representation, image features which have less dependency to the viewpoint changes are used. In this work, we used the temporal self-similarity matrix proposed by Junejo [19] as the view-independent representation. For a sequence of images = {I 1, I 2, …, I T }, SSM of I is a square symmetric matrix of size T × T where its (i, j) th element represented by d ij is the distance between low level features of frame i and j. We use the Euclidean distance of feature vectors as a measure of frames differences.

$$ {\left[{d}_{ij}\right]}_{i,j=1,2,\dots, T}=\left[\begin{array}{lllll}0\hfill & {d}_{1,2}\hfill & {d}_{1,3}\hfill & \cdots \hfill & {d}_{1,T}\hfill \\ {}{d}_{2,1}\hfill & 0\hfill & {d}_{2,3}\hfill & \cdots \hfill & {d}_{2,T}\hfill \\ {}\vdots \hfill & \vdots \hfill & \vdots \hfill & \ddots \hfill & \vdots \hfill \\ {}{d}_{T,1}\hfill & {d}_{T,2}\hfill & {d}_{T,3}\hfill & \cdots \hfill & 0\hfill \end{array}\right] $$
(2)

Junejo et al. [19] claimed that the self-similarity matrix has similar structure for various instance of each action and self-similarity matrix of distinct actions has different structures. They used HOG and optical flow as base features and demonstrated that the structure of self-similarity matrices obtained from these two features for a same action will be identical. Nonetheless, the optical flow and HOG calculations have high computational costs. To resolve the problem, in our work the trajectory of points with almost uniform distribution on the body surface are used. In [19] for calculating the trajectory-based SSM a set of M points {p 1, p 2, …, p M} distributed over human body is tracked and mean Euclidean distance between each of the k pairs of corresponding points at any two instants i and j of the sequence computed as:

$$ {d}_{ij}=\frac{1}{M}{\displaystyle \sum_{m=1}^M{\left\Vert {p}_i^m-{p}_j^m\right\Vert}_2} $$
(3)

The main problem with trajectory based SSM is that they have non-uniform distribution of points over the human body. When feature points are selected for tracking, the selected points may have a greater density on some part of the body. As a result, corresponding part of body may affect the shape of self-similarity matrix. To resolve this problem, the feature points can be uniformly distributed on the body surface or on a contour. However, if one part of body was not visible in a single frame, no point will be placed on that part even if that part has an important role in the action, thus its effect would not appear in the self-similarity matrix. The aforementioned problems are resolved by employing the image tabulation. For this purpose, the internal surface of the bounding box around the actor is divided into fixed-size cells as shown in Fig. 5. Then, the minimum number of feature points for each cell is determined with respect to the number of boundary pixels of the silhouette in that cell. The tabulation is performed for all frames in the video image sequence. In the case where the number of feature points in a cell is less than the minimum required points, the uniform distribution of points in the cells is ensured through finding new feature points. This method ensures a uniform distribution of points on the body surface during the performance of the action.

Fig. 5
figure 5

Uniform distribution of feature points on human body is guaranteed using image tabulation

Figure 6 shows the self-similarity matrices calculated for the “Picking” and “Butterfly” actions for two different actors. In this figure, blue color indicates values close to zero while red color indicates the maximum values. Accordingly, the self-similarity matrix of an action has a fixed structure for different viewpoints and different actors.

Fig. 6
figure 6

Self-similarity matrices of point trajectories for “Picking” and “Butterfly” action in WVU dataset

Another issue that should be considered in temporal self-similarity is the difference between motions along the horizontal and vertical axes. In most action recognition scenarios, the rotation of camera around the roll axis is negligible. In this case, the upward and downward directions for the actor and the camera are the same. However, since the actor can turn around, the left and right directions for the actor will not necessarily match the left and right directions on the image. Thus, the motion along the vertical axis is a view independent feature while the motion along the horizontal axis is a view dependent feature. The view-dependent features must be removed in alpha representation. Therefore, we divided the motion direction of feature points into two orthogonal vectors of x and y along the coordinate axes. Then, we half-wave rectified the motion along the vertical axis into two non-negative channels y + and y so that y = y + − y . Finally we calculated the self-similarity matrices for feature vectors y +, y and x. Figure 7 shows the self-similarity matrices of two channels, y + and y along the original self-similarity matrices.

Fig. 7
figure 7

Results of half wave rectifying on temporal self-similarity matrices for two different actions

To extract the structure of self-similarity matrix we used the local log-polar descriptor proposed in [19]. As shown in Fig. 8, the log-polar descriptor is semicircular and divided into 11 blocks. The center of descriptor is placed on the main diagonal entries. For the each block i of a descriptor, the normalized histogram of gradients direction of pixels within that block is calculated as h a i  = [h a i,b ] b = 1 : 8

Fig. 8
figure 8

Extracting Local descriptor from Self-similarity matrix

The histogram vectors of all 11 blocks are concatenated to each other and a local descriptor vector h i  = [h a i ] a = 1 : 11 is made. For the blocks that are outside the self-similarity matrix, the vector, h a i considers to be equal to the zero vector. In the case where more than one self-similarity matrices are extracted, the local descriptor h f i is extracted for each feature. Then, the descriptors of various matrices are concatenated to create a single descriptor h i  = [h f i ] f = 1 : F . Finally, a sequence of video frames will be represented by a set of local descriptors H(I) = (h 1, h 2, …, h T ) calculated on the main diagonal entries of self-similarity matrix.

3.5 Classification

Following the extraction of self-similarity matrix, we constructed the view-independent representation as a set of local descriptors H(I) = (h 1, h 2, …, h T ). In the classification phase we used the recently bag of feature (BOF) approach [20, 31]. In BOF, first k cluster centers are selected as keywords by performing a k-means clustering on local descriptor extracted from training set. Then, all local descriptors of training and test samples are assigned to the nearest keyword (Cluster center). In the next stage, the histogram of keywords is calculated and each video sequence is described by a normalized histogram of keywords H(I). These histograms are used as inputs for nearest neighbor classification algorithm. The nearest neighbor classifier assigns to each sample of H(I) the action label of training sample I* which has the minimum distance between H(I) and H(I*) over all training samples. It should be noted that in the nearest neighbor classification (NCC), only training samples belonging to the cluster of alpha representation of the target video are involved.

4 Experiments

In this section we present some experimental results on our view-independent action recognition approach and compare it with other action recognition approaches. In our implementation we used a background model based on the mean and variance of pixel values for background subtraction. The noises caused by change of lighting and acquisition processes were removed by using the morphology operators. To calculate the entropy, the distance transform of the silhouette image was first extracted. Then, the entropy of each pixel through time was calculated and the entropies were clustered using the k-means algorithm. We used the KLT method to extract the point trajectories and guaranteed the uniform distribution of points on the body by tabulating the silhouette image. The self-similarity matrices were constructed separately for horizontal and vertical directions of trajectories. Finally, we used the nearest neighbor classification to classifying and recognizing actions. We test our implementation on two datasets: INRIA XMAS [51] and WVU [36] (Fig. 9).

Fig. 9
figure 9

a Sample frames from WVU dataset (b) Position and orientations of the cameras in WVU dataset (c) Sample frames from INRIA XMAS Dataset

The WVU dataset provided by the University of West Virginia includes 12 actions performed by 48 actors. In this dataset, the position and orientation of all actors on the scene is the same. Eight cameras were used for capturing the dataset. Captured videos have a resolution of 640 × 480 pixels with a rate of 20 frames per second. Figure 9b shows the arrangement of cameras on the scene. In WVU dataset, the start and end time for each action is precisely specified and the actions are performed by actors in best manner with the least possible changes.

The INRIA XMAS dataset [51] includes 13 daily actions each performed 3 times by 11 actors. The actors select their position and orientation freely. The action images are simultaneously recorded by five cameras. Dataset is recorded in 23 fps and a resolution of 640 × 480 pixels. Similar to [19] we performed our experiments on 9 actor and 10 actions.

In our implementation without any optimization, the process of feature extraction on INRIA XMAS dataset [51] runs at a speed of 20 frames per second(fps) on an Intel core i7 computer which is faster than [19] which runs at 10 fps and is close to real time speed at 30 fps. We used K-Fold Cross Validation to evaluate our experiments and make sure that actions of the same person do not appear in the training and in the test sets simultaneously. We perform each experiment for 10 times and the mean results are reported. Since our proposed method does not use the information of camera calibration and it does not use the combination of multiple cameras, the results were only compared with similar action recognition approached in which only single camera used for recognizing the action.

4.1 Action recognition from same-view

In this experiment, single camera was used for training and testing. The test was performed on all cameras of WVU and IXMAS datasets except for the top camera in IXMAS. Tables 1 and 2 illustrates the results for same-view action recognition on WVU and IXMAS datasets. When a single camera is used for training and testing, the results of the proposed method show a significant improvement compared to preceding [19] approach, because we employed the view-dependent representation which provides more discriminative information in same-view action recognition.

Table 1 Comparison of action recognition accuracy in single view scenario on WVU dataset by Junejo’s method
Table 2 Comparison of action recognition results in single view scenario on IXMAS dataset by alternative methods

The improvement in accuracy is less clear in the IXMAS dataset, because the actors selected their orientation freely. Table 2 shows the higher accuracy of the proposed method than other methods except for [26] which is predictable because [26] is a view dependent approach and should perform better in same-view situations. The proposed method overcome [19] in three cameras and in average accuracy while [19] had better accuracy on the second camera. Furthermore, the results of proposed method have less variation over different views compared to all methods except for [26]. It should be noted that the result of [19] in Table 2 is the one that claimed in [19]. The results of our implementation of [19] are lower than the claimed results.

4.2 Cross-view action recognition

In cross-view action recognition, separate cameras were used for training and testing. Figure 10 shows the confusion matrix of cross-view action recognition on WVU dataset. In this matrix each element shows the action recognition accuracy when camera associated with rows number used for training and camera associated with column number used for testing. According to this table, about the main diagonal of the matrix where the angle between training and testing cameras is low, the accuracy of the algorithm increases. Furthermore, the accuracy has increased in experiments where training and testing cameras where in opposite direction. For example, high accuracy was obtained for pair cameras 1 and 5 and pair cameras 3 and 7. In the case where the training and testing cameras are orthogonal, like cameras 1 and 3 or camera 1 and 7, the accuracy is low.

Fig. 10
figure 10

Cross-view action recognition accuracy of proposed method on WVU dataset

The results are boosted when we increased the number of cameras used for training. Table 3 shows the results when 1 through 7 cameras used for training and the remaining cameras used for testing. In this experiment, the training cameras were selected randomly and for each scenario, experiments were performed 20 times and the mean accuracy is presented in the Table 3. According to the results, a suitable accuracy is obtained using more than 4 cameras for the training.

Table 3 Cross-view action recognition accuracy of proposed method on WVU dataset when more than one camera used for training the algorithm

4.3 The impact of training cameras distribution on the accuracy of action recognition

To study the effects of camera distribution around the actor we used three cameras for training and five cameras for testing. There were 56 possible combinations for choosing test cameras. Experiment have been repeated 10 times for each 56 combinations and the mean accuracy has been recorded for each of them. For grouping similar combinations based on the distribution of train cameras, we used the following equation and calculated φ as a criterion for distribution of cameras (θ i is the angle between train cameras).

$$ \varphi =1-\frac{1}{360}\left(\left|120-{\theta}_1\right|+\left|120-{\theta}_2\right|+\left|120-{\theta}_3\right|\right) $$
(4)

In this equation φ will increase when train cameras are distributed more uniformly around the actor. In Table 4 we divided the 56 possible combinations in to 5 groups, based on the value of φ and we calculated mean accuracy for each group separately. As illustrated in Table 4 the accuracy of action recognition increases when the distribution of training cameras around the actor is more uniformly, and using three uniform distributed cameras around the actor for training the algorithm may provide accuracy close to situations where 6 or 7 cameras had been used for training in the previous experiments.

Table 4 Action recognition results for different situations when 3 cameras are used for training and 5 cameras are used for testing on WVU dataset

When the cameras used for training are more uniformly distributed, their captured frames will have less overlap with each other. Thus, there will be less duplication of information. Furthermore a uniform distribution of cameras will decrease the maximum possible angular distance between training and testing cameras and according to the second experiment when angular distance between training and testing cameras decrease the action recognition accuracy will increase. Thus, when training cameras are distributed uniformly around the actor the action recognition accuracy will increase.

4.4 The effect of alpha representation on accuracy of action recognition

In this experiment, we try to study the effect of view-dependent representation on accuracy of the proposed method by changing the number of clusters in alpha representation. When we use one cluster, all action samples would be lied in the same cluster, hence the effect of view-dependent representation would be disappeared. With increasing the number of clusters, action samples would be partitioned in more clusters, hence there would be fewer samples in each cluster and the effect of view-dependent representation would increase. Figure 11 plots the accuracy of same-view and cross-view action recognition for different number of clusters on WVU dataset.

Fig. 11
figure 11

action recognition accuracy on the WVU dataset for same-view and cross-view situation for various number of clusters

As illustrated in Fig. 11 increasing the role of view-dependent representation would improve the accuracy on both same-view and cross-view action recognition. As we expect the effect of view-dependent representation on same-view action recognition is further than cross-view action recognition. And by increasing the number of clusters the accuracy of same-view action recognition increased more than cross-view action recognition.

4.5 Comparison with other action recognition methods

In this section, the proposed method was tested on IXMAS dataset and compared with other action recognition methods. In this experiment, for every pair of cameras, a camera was used for the training algorithm and another camera was used for testing. The average accuracy of the algorithms in all experiments, the average accuracy of algorithms in the case of using a same camera for training and testing and the average accuracy of the algorithm in case of using two different cameras for training and testing were calculated. Table 5 shows the results of various action recognition methods.

Table 5 Comparison of action recognition accuracy on IXMAS dataset by alternative methods

According to the results, the proposed algorithm shows a higher accuracy than Junejo’s et al. [19] in all three scenarios. Despite the slightly lower accuracy of the proposed algorithm than Lv’s et al. [26] in the same-view experiments, the proposed algorithm outperforms it in cross-view experiments.

As we mentioned in first experiment, the higher accuracy of [26] in same-view action recognition is due to its usage of view-dependent representation. This experiment shows that while the proposed method has appropriate accuracy in both same-view and cross-view situations, the method in [26] has a very poor accuracy in cross-view situations.

Figure 12 shows the confusion matrix for the WVU datasets. As shown, the proposed method has high accuracy for all actions. The lowest action recognition accuracy was observed for very similar actions such as “Waving 1 hand” and “Waving 2 hand”. In 17 % of experiments, the proposed algorithm confuses these two actions. The highest accuracy levels of 94.84 and 92.95 were respectively observed for “butterfly” and “running” actions.

Fig. 12
figure 12

Confusion matrix of proposed method on WVU dataset

Figure 13 shows the confusion matrix of proposed method on IXMAS dataset. These results indicate that the proposed method is able to recognize “cross arms”, “siting down”, “getting up”, “turning”, “walking” and “kicking” with a very high accuracy. The proposed method is more accurate than [19] in these actions. The cause of this difference is due to motion direction used for the construction of self-similarity matrix as well as the use of alpha representation to reduce the possible action labels. Furthermore, these results show that the proposed algorithm is not able to accurately recognize similar actions such as “scratching head” and “waving” and in many cases these two actions are misunderstood with each other. However, this error rate is acceptable given the high similarity of these two actions.

Fig. 13
figure 13

Confusion matrix of proposed method on IXMAS dataset

5 Conclusion

In this paper, we proposed a new framework for view independent action recognition in which combinations of view-dependent and view-independent representations have been used. We used entropy of silhouette’s distance transformation as view-dependent representation and the self-similarity matrix obtained from tracking of uniformly distributed feature points on human body as view-independent representation. We can summarize the main contributions of the proposed method as follows: Introducing a new framework for combining view-dependent and view-independent representation in action recognition. Calculating the entropy of silhouette’s distance transforms over time as view-dependent representation. Ensuring uniform distribution of feature points on the actor’s body by tabulating body silhouette in each frame and using feature points trajectories as the low-level feature for building SSMs instead of optical flow and HOG.

In all experiments the proposed approach demonstrate superiority or equal accuracy with other recent action recognition methods. Only in the case where same camera is used for training and testing, the proposed method shows lower accuracy than [26] which is a view-dependent approach. The proposed method always shows higher accuracy than Junejo [19] in which the combination of self-similarity matrices of the optical flow and histogram of oriented gradients is used as a view-independent representation.

View-independent action recognition algorithms often have lower efficiency in recognizing action from same-view than view-dependent algorithms. However, because of using the view-dependent representation the proposed method shows a significantly increased accuracy compared to other view-independent action recognition methods when same camera used for training and testing. The most important benefits of the proposed methods are its independence to viewpoint and good performance on same-view situations compare to the other view-independent approaches. Furthermore, as we mentioned earlier, the proposed method is two times faster than preceding approach [19], and its speed is close to real time 30 fps videos. However, it has some limitations in real-world applications. The proposed method has been only tested in fully-controlled environments such as the studio of INRIA with static background. In real-world applications, the dynamic variations of background would reduce the quality of trajectories and generated temporal self-similarity matrices, which would reduce the overall accuracy of action recognition. For these applications, we need accurate background subtraction algorithms to remove the effect of dynamic variations in the background. Furthermore, the accuracy of the proposed method is dependent on the quality of the point tracking algorithm. We believe using better point tracking algorithms would increase the accuracy of proposed method. We did not test this method on group actions which more than one person had participated in it. Using this method for group actions needs more discussion such as if we should build temporal SSMs for each person separately, or we need to build one SSM for the entire scene.