1 Introduction

Action recognition has been a widely researched topic in computer vision for over a couple of decades. Its applications in real-time surveillance and security make it more challenging and interesting. Various approaches have been taken to solve the problem of action recognition [20]; however, the majority of the current approaches fail to address the issue of a large number of action categories and highly unconstrained videos taken from web.

Most state-of-the-art methods developed for action recognition are tested on datasets like KTH, IXMAS, and Hollywood (HOHA), which are largely limited to a few action categories and typically taken in constrained settings. The KTH and IXMAS datasets are unrealistic; they are staged, have minor camera motion, and are limited to less than 13 actions which are very distinct. The Hollywood dataset [9], which is taken from movies, addresses the issue of unconstrained videos to some extent, but involves actors, contains some camera motion and clutter, and is shot by a professional camera crew under good lighting conditions. The UCF YouTube Action (UCF11) dataset [10] consists of unconstrained videos taken from the web and is a very challenging dataset, but it has only 11 action categories, all of which are very distinct actions. The UCF50 dataset, which is an extension of the UCF11 dataset, also contains videos downloaded from YouTube and has 50 action categories. The recently released HMDB51 dataset [8] has 51 action categories, but after excluding facial actions like smile, laugh, chew, and talk, which are not articulated actions, it has 47 categories compared to 50 categories in UCF50. Most of the current methods would fail to detect an action/activity in datasets like UCF50 and HMDB51 where the videos are taken from web. These videos contain random camera motion, poor lighting conditions, clutter, as well as changes in scale, appearance, and viewpoints, and occasionally no focus on the action of interest. Table 1 shows the list of action datasets.

Table 1 Action datasets

In this paper, we study the effect of large datasets on performance, and propose a framework that can address issues with real-life action recognition datasets (UCF50). The main contributions of this paper are as follows:

  1. 1.

    We provide an insight into the challenges of large and complex datasets like UCF50.

  2. 2.

    We propose the use of moving and stationary pixel information obtained from optical flow to obtain our scene context descriptor.

  3. 3.

    We show that as the number of actions to be categorized increases, the scene context plays a more important role in action classification.

  4. 4.

    We propose the idea of early fusion schema for descriptors obtained from moving and stationary pixels to understand the scene context, and finally perform a probabilistic fusion of scene context descriptor and motion descriptor.

To the best of our knowledge, no one has attempted action/activity recognition on such a large-scale dataset (50 action categories) consisting of videos taken from the web (unconstrained videos) using only visual information.

The rest of the paper is organized as follows. Section 2 deals with the related work. Section 3 gives an insight into working with large datasets. In Sects. 4 and 5, we introduce our proposed scene context descriptor and the fusion approach. In Sect. 6, we present the proposed approach, followed by the experiments and results with discussions in Sect. 7. Finally, we conclude our work in Sect. 8.

2 Related work

Over the past two decades, a wide variety of approaches has been tried to solve the problem of action recognition. Template-based methods [1], modeling the dynamics of human motion using finite state models [6] or hidden Markov models [21], and Bag of Features models [4, 10, 11, 22] (BOF) are a few well-known approaches taken to solve action recognition. Most of the recent work has been focused on BOF in one form or another. However, most of this work is limited to small and constrained datasets.

Categorizing large numbers of classes has always been a bottleneck for many approaches in image classification/action recognition. Deng et al. [3] demonstrated the challenges of doing image classification on 10,000 categories. Recently, Song et al. [17] and Zhao et al. [19] attempted to categorize large numbers of video categories by using text, speech, and static and motion features. Song et al. [17] used visual features like color histogram, edge features, face features, SIFT, and motion features and showed that text and audio features outperform visual features by a significant margin.

With the increase in number of action categories, motion features alone are not discriminative enough for reliable action recognition. Marszalek et al. [13] introduced the concept of context in action recognition by modeling the scenes. 2D-Harris detector is used to detect salient regions from which SIFT descriptor is extracted and bag-of-features framework is used to obtain the static appearance descriptor. Han et al. [5] detects person, body parts, and the objects involved in an action and used the knowledge of their spatial location to design contextual scene descriptor. Recently, Choi et al. [2] introduced the concept of “Crowd Context” to classify activities involving interaction between multiple people. In all the proposed methods [2, 5, 13], the performance depends on the detectors used.

Extracting reliable features from unconstrained web videos has been a challenge. In recent years, action recognition in realistic videos was addressed by Laptev et al. [9] and Liu et al. [10, 11]. Liu et al. [10] proposed pruning of the static features using PageRank and motion features using motion statistics. Fusion of these pruned features showed a significant increase in the performance on the UCF11 dataset. Ikizler et al. [7] used multiple features from the scene, object, and person, and combined them using a Multiple MIL (multiple instance learning) approach. Fusion of multiple features extracted from the same video has gained significant interest in recent years. Work by Snoek et al. [14] compares early and late fusion of descriptors.

There has been no action recognition work done on very large datasets, using only visual features. In this paper, we propose a method which can handle these challenges.

3 Analysis on large-scale dataset

UCF50 is the largest action recognition dataset publicly available, after excluding the non-articulated actions from the HMDB51 dataset. UCF50 has 50 action categories with a total of 6676 videos, and with a minimum of 100 videos for each action class. Samples of video screenshots from UCF50 are shown in Fig. 1. This dataset is an extension of UCF11. In this section, we perform a baseline experiment on UCF50 by extracting the motion descriptor and using the bag of video words approach. We use two classification approaches:

  1. 1.

    BoVW-SVM: support vector machines (SVM) to do classification.

  2. 2.

    BoVW-NN: nearest neighbor approach using SR-Tree to do classification.

Which motion descriptor do we use?

Fig. 1
figure 1

Screenshots from videos in the UCF50 dataset showing the diverse action categories

Due to the large scale of the dataset, we prefer a motion descriptor which is faster to compute and reasonably accurate. To decide on the motion descriptor, we performed experiments on a smaller dataset KTH with different motion descriptors, which were extracted from the interest points detected using Dollar’s detector [4]. At every interest point location \((x,y,t)\) , we extract the following motion descriptors:

  • Gradient: At any given interest point location in a video \((x,y,t)\), a 3D cuboid is extracted. The brightness gradient is computed in this 3D cuboid, which gives rise to three channels \((G_{x},G_{y},G_{t})\) that are flattened into a vector, and later PCA is applied to reduce the dimension.

  • Optical flow: Similarly, Lucas–Kanade optical flow [12] is computed between consecutive frames in the 3D cuboid at \((x,y,t)\) location to obtain two channels \((V_{x},V_{y})\). The two channels are flattened and PCA is utilized to reduce the dimension.

  • 3D-SIFT: Three-dimensional SIFT proposed by Scovanner et al. [15] is an extension of SIFT descriptor to spatio-temporal data. We extract 3D-SIFT around the spatio-temporal region of a given interest point \((x,y,t)\).

All of the above descriptors are extracted from the same location of the video and the experimental setup is identical. We use BOF paradigm and SVM to evaluate the performance of each descriptor. From Table 2, one can notice that 3D-SIFT outperforms the other two descriptors for codebook of size 500, whereas gradient and optical flow descriptors perform the same. Computationally, the gradient descriptor is the fastest and 3D-SIFT is the slowest. Due to the time factor, we will use gradient descriptor as our motion descriptor for all further experiments.

Table 2 Performance of different motion descriptors on the KTH dataset

We also tested our framework on the recently proposed motion descriptor MBH by Wang et al. [18]. The MBH descriptor encodes the motion boundaries along the trajectories obtained by tracking densely sampled points using optical flow fields. Using the code provided by the authors [18], MBH descriptors are extracted for UCF11 and UCF50 datasets and used in place of the above-mentioned motion descriptor for comparison of results with [18].

3.1 Effect of increasing the action classes

In this experiment, we show that increasing the number of action classes affects the recognition accuracy of a particular action class. Since the UCF11 dataset is a subset of UCF50, we first start with the 11 actions from the UCF11 dataset and randomly add new actions from the remaining 39 different actions from the UCF50 dataset. Each time a new action is added, a complete leave-one-out cross validation is performed using bag of video words approach on motion descriptor and SVM for classification on the incremented dataset using a 500-dimension codebook. Performance using BoVW-SVM on the initial 11 actions is 55.46 % and BoVW-NN is 37.09 %. Even with the increase in the number of actions in the dataset, SVM performs significantly better than the nearest neighbor approach.

Figure 2 shows the change in performance by using BoVW-SVM on the initial 11 actions as we add the 39 new actions, one at a time. Increasing the number of actions in the dataset has affected some actions more than others. Actions like “soccer juggling” and “trampoline jumping” were most affected; they have a standard deviation of \(\sim \)7.08 and \(\sim \)5.84 %, respectively. Some actions like “golf swing” and “basketball” were least affected with a very small standard deviation of \(\sim \)1.35 and \(\sim \)2.03 %, respectively. Overall, the performance on 11 actions from UCF11 dropped by \(\sim \)13.18 %, i.e., from 55.45 to 42.27 %, by adding 39 new actions from UCF50. From Fig. 2, one can also notice that 8 of 11 actions have standard deviation of more than \(\sim \)4.10 %. Analysis of the confusion table shows a significant confusion of these initial 11 actions with newly added actions. This shows that the motion feature alone is not discriminative enough to handle more action categories.

Fig. 2
figure 2

The effect of increasing the number of actions on the UCF YouTube Action dataset’s 11 actions by adding new actions from UCF50 using only the motion descriptor. Standard deviation (SD) and mean are also shown next to the action name. The performance on the initial 11 actions decreases as new actions are added

To address the above concerns, we propose a new scene context descriptor which is more discriminative and performs well in very large action datasets with a high number of action categories. From the experiments on UCF50, we show that the confusion between actions is drastically reduced and the performance of the individual categories increased by fusing the proposed scene context descriptor.

4 Scene context descriptor

In order to overcome the challenges of unconstrained web videos and handle a large dataset with lots of confusing actions, we propose using the scene context information in which the action is happening. For example, skiing and skateboarding, horse riding and biking, and indoor rock climbing and rope climbing have similar motion patterns with high confusion, but these actions take place in different scenes and contexts. Skiing happens on snow, which is very different from where skateboarding is done. Similarly, horse riding and biking happen in very different locations. Furthermore, scene context also plays an important role in increasing the performance on individual actions. Actions are generally associated with places, e.g., diving and breast stroke occur in water, and golf and javelin throw are outdoor sports. In order to increase the classification rate of a single action, or to reduce the confusion between similar actions, the scene information is crucial, along with the motion information. We refer to these places or locations as scene context in our paper.

As the number of categories increases, the scene context becomes important, as it helps reduce the confusion with other actions having similar kinds of motion. In our work, we define scene context as the place where a particular motion happens (stationary pixels), and also include the object that creates this motion (moving pixels).

Humans have an extraordinary ability to perform object detection, tracking and recognition. We assume that humans tend to focus on objects that are salient or the things that move in their field of view. We try to mimic this by coming up with groups of moving pixels which can be roughly assumed as salient regions and groups of stationary pixels as an approximation of non-salient regions in a given video.

Moving and stationary pixels: Optical flow gives a rough estimate of velocity at each pixel given two consecutive frames. We use optical flow \((u,v)\) at each pixel obtained using Lucas–Kanade method [12] and apply a threshold on the magnitude of the optical flow to decide if the pixel is moving or stationary. Figure 3 shows the moving and stationary pixels in several sample key frames. We extract dense CSIFT [14] at pixels from both groups and use BOF paradigm to get a histogram descriptor for both groups separately. We performed experiments using CSIFT descriptor, extracted on a dense sampling of moving pixels \(\mathrm{MP}_{v}\) and stationary pixels \(\mathrm{SP}_{v}\). For a 200-dimension codebook, the moving pixels CSIFT histogram alone resulted in a 56.63 % performance, while the stationary pixels CSIFT histogram achieved 56.47 % performance on the UCF11. If we ignore the moving and stationary pixels and consider the whole image as one, we obtain a performance of 55.06 %. Our experiments show that concatenation of histogram descriptors of moving and stationary pixels using CSIFT gives the best performance of 60.06 %. From our results, we conclude that concatenation of \(\mathrm{MP}_{v} \) and \(\mathrm{SP}_{v}\) into one descriptor \(\mathrm{SC}_{v}\) is a very unique way to encode the scene context information. For example, in a diving video, the moving pixels are mostly from the person diving, and the stationary pixels are mostly from the water (pool), which implies that diving will occur only in water and that this unique scene context will help detect the action diving.

Fig. 3
figure 3

Moving and stationary pixels obtained using optical flow

Why CSIFT? Liu et al. [10] show that using SIFT on the UCF11 dataset gave them 58.1 % performance. Our experiments on the same dataset using GIST gave us a very low performance of 43.89 %. Our approach of scene context descriptor using CSIFT gave us a performance of 63.75, \(\sim \)2.5 % better than motion feature and \(\sim \)5.6 % better than SIFT. It is evident that color information is very important for capturing the scene context information.

Key frames: Instead of computing the moving and stationary pixels and their corresponding descriptor on all the frames in the video, we perform a uniform sampling of k frames from a given video, as shown in Fig. 4. This reduces the time taken to compute the descriptors, as the majority of the frames in the video are redundant. We did not implement any kind of key frame detection, which can be done by computing the color histogram of frames in the video and considering a certain level of change in color histogram as a key frame. We tested on the UCF11 dataset by taking different numbers of key frames sampled evenly along the video. Figure 5 shows that the performance on the dataset is almost stable after three key frames. In our final experiments on the datasets, we consider three key frames equally sampled along the video to speed up the experiments. In this experiment, a codebook of dimension 500 is used.

Fig. 4
figure 4

Key frame selection from a given video

Fig. 5
figure 5

Performance of scene context descriptor on different number of key frames

4.1 How discriminative is the scene context descriptor?

In this experiment, the proposed scene context descriptors are extracted and a bag of video word paradigm followed by SVM classification is employed to study the proposed descriptor. Similar to the experiment in Sect. 3.1, one new action is added to UCF11 incrementally from UCF50, at each increment leave-one-out cross-validation is performed. The average performance on the initial 11 actions of UCF11 is 60.09 %; after adding 39 new actions from UCF50 the performance on the 11 actions dropped to 52.36 %, i.e., a \(\sim \)7.72 % decrease in performance, compared to \(\sim \)13.18 % decrease for motion descriptor. The average standard deviation of the performance of the initial 11 actions over the entire experimental setup is \(\sim \)2.25 % compared to \(\sim \)4.18 % for motion descriptor. Figure 6 clearly shows that the scene context descriptor is more stable and discriminative than the motion descriptor with the increase in the number of action categories.

Fig. 6
figure 6

Effect of increasing the number of actions on the UCF YouTube Action dataset’s 11 actions by adding new actions from UCF50, using only the scene context descriptor. Standard deviation (SD) and mean are shown next to the action name.The performance on the initial 11 actions decreases as new actions are added, but with significantly less standard deviation compared to using motion descriptor as shown in Fig. 2

5 Fusion of descriptors

A wide variety of visual features can be extracted from a single video, such as motion features (e.g., 3DSIFT, spatio-temporal features), scene features (e.g., GIST), or color features (e.g., color histogram). To do the classification using all these different features, the information has to be fused eventually. According to Snoek et al. [16], fusion schemes can be classified into early fusion and late fusion based on when the information is combined.

Early fusion: In this scheme, the information is combined before training a classifier. This can be done by concatenating the different types of descriptors and then training a classifier. Late fusion: In this scheme, classifiers are trained for each type of descriptor, and then the classification results are fused. Classifiers, such as SVM, can provide a probability estimate for all the classes rather than a hard classification decision. The concept of fusing this probability estimate is called probabilistic fusion [23]. For probabilistic fusion, the different descriptors are considered to be conditionally independent. This is a fair assumption for the visual features that we use in this paper, i.e., motion descriptor using gradients and color SIFT. In probabilistic fusion, the individual probabilities are multiplied and normalized. For d sets of descriptors \(\left\{ X_{j}\right\} ^{d}_{1}\) extracted from a video, the probability of the video being classified as action \(a\), i.e., \(p( a \ |\{X_{j}\}^{d}_{1})\), using probabilistic fusion is:

$$\begin{aligned} p{\left( a \ |\{X_{j}\}^{d}_{1}\right)} = \frac{1}{N} \prod ^{d}_{j=1} p( a \ |X_{j}), \end{aligned}$$
(1)

where \(N\) is a normalizing factor which we consider to be 1. In late fusion, the individual strengths of the descriptors are retained.

5.1 Probabilistic fusion of motion and scene context descriptor

Probabilistic fusion: Late fusion using probabilistic fusion requires combining the probability estimates of both the descriptors from their separately trained SVMs, i.e.,

$$\begin{aligned} max\left(P_\mathrm{SC}(i) \ P_\mathrm{M}(i)\right),\ \mathrm{where}\ i=1\ to\ a, \end{aligned}$$

where \(a\) is the number of actions to classify, and \(P_\mathrm{SC}(i)\) and \(P_\mathrm{M}(i)\) are the probability estimates of action \(i\), obtained using SVMs trained on scene context descriptors and motion descriptors separately. We also tested early fusion of both motion and scene context features, i.e., \(\left[ M_{v}\ \mathrm{SC}_{v} \right]\), and trained an SVM, which gave \(\sim \)5 % performance better than individual descriptors on UCF50, which was expected. However, performing an early fusion after normalization, i.e., \(\left[ M_{v}/\mathrm{max}\left(M_{v}\right),\mathrm{SC}_{v}/\mathrm{max}\left(\mathrm{SC}_{v}\right) \right] \), gave a remarkable increase in the performance, \(\sim \)14 %. It is evident from Fig. 7 that on average across all the codebooks, late fusion (probabilistic fusion) is the best. Therefore, in all of our experiments on KTH, HMDB51, UCF YouTube (UCF11), and UCF50 datasets, we do probabilistic fusion of both scene context and motion descriptors.

Fig. 7
figure 7

Performance of different methods to fuse scene context and motion descriptors on the UCF50 dataset

6 System overview

To perform action recognition, we extract the following information from the video: (1) scene context information in key frames and (2) motion features in the entire video, as shown in Fig. 8. The individual SVMs probability estimates are fused to get the final classification.

Fig. 8
figure 8

Proposed approach

In the training phase, from each training videos, we extract spatio-temporal features \(\left\{ m_{1},m_{2},\ldots ,m_{x}\right\} \), from \(x\) interest points detected using the interest point detector proposed by Dollar et al. [4]. We also extract CSIFT features on moving pixels \(\left\{ mp_{1},mp_{2},\ldots ,mp_{y}\right\} \) and stationary pixels \(\left\{ sp_{1},sp_{2},\ldots ,sp_{z}\right\} \) from \(k\) frames uniformly sampled in the video, where \(y\) and \(z\) are the number of CSIFT features extracted from moving and stationary regions, respectively. A codebook of size \(p\) is generated of the spatio-temporal features from all the training videos. Similarly, a codebook of size \(q\) is generated of CSIFT features from moving pixels and stationary pixels combined. For a given video \(v\), we compute the histogram descriptors \(M_{v}\), \(\mathrm{MP}_{v}\), and \(\mathrm{SP}_{v}\) using their respective codebooks for the \(x\) spatio-temporal features from the entire video, \(y \) CSIFT features from the moving pixels, and \(z \) CSIFT features from the stationary pixels from key frames. We do an early fusion of \(\mathrm{MP}_{v}\) and \(\mathrm{SP}_{v}\) before training a classifier using support vector machine (SVM), i.e., \(\mathrm{SC}_{v} = \left[\mathrm{MP}_{v}\ \mathrm{SP}_{v}\right]\), which we call the scene context descriptor. We train SVM classifier \(\mathrm{SVM}_{M} \) for all the motion descriptors \(M_{v} \) and separate SVM classifier \(\mathrm{SVM}_{C} \) for all scene context descriptors \(\mathrm{SC}_{v} \), where \(v = \left[1,2,\ldots ,tr\right] \) and \(tr\) is the number of training videos. Since all the descriptors \(M_{v}\), \(\mathrm{MP}_{v}\), and \(\mathrm{SP}_{v}\) are histograms, we use histogram intersection kernel in the SVM classifier.

Given a query video \(q\), we extract the motion descriptor \(M_{q}\) and the scene context descriptor \(\mathrm{SC}_{q}\), as described in the training phase. We perform a probabilistic fusion of the probability estimates of the motion descriptor \(\left[P_\mathrm{M}(1),P_\mathrm{M}(2),\ldots ,P_\mathrm{M}(a)\right]\), and scene context descriptor \(\left[P_\mathrm{SC}(1),P_\mathrm{SC}(2), \ldots , P_\mathrm{SC}(a)\right]\) obtained from \(\mathrm{SVM}_{M}\) and \(\mathrm{SVM}_{C}\) trained on motion and scene context descriptors, respectively, for action classes, i.e.,

$$\begin{aligned}&\left[P(1),P(2),\ldots ,P(a)\right] \\&\qquad =\left[P_\mathrm{M}(1)P_{C}(1),P_\mathrm{M}(2)P_{C}(2),\ldots ,P_{C}(a)P_\mathrm{M}(a)\right]. \end{aligned}$$

We use the fused probabilities as confidence to do the action classification.

7 Experiments and results

Experiments were performed on the following datasets: KTH, UCF11, UCF50, and HMDB51. The KTH dataset consists of six actions performed by 25 actors in a constrained environment, with a total of 598 videos. The HMDB51 dataset has 51 action categories, with a total of 6,849 clips. This dataset is further grouped into five types. In this dataset, general facial action type is not considered as articulated motion, which leaves the dataset with 47 action categories. The UCF11 dataset includes 1,100 videos and has 11 actions collected from YouTube with challenging conditions, such as low quality, poor illumination conditions, camera motions, etc. The UCF50 dataset has 50 actions with a minimum of 100 videos for each category, also taken from YouTube. This dataset has a wide variety of actions taken from different contexts and includes the same challenges as the UCF YouTube Action dataset.

In all of our experiments, we used three key frames from a single video to extract scene context features as explained before; however, we use all the frames in the video to get motion features without any pruning. We do not consider the audio, text, etc. contained in the video file to compute any of our features. Our method uses only the visual features. All the experiments have been performed under leave-one-out cross validation unless specified.

7.1 UCF11 dataset

UCF11 is a very challenging dataset. We extract 400 cuboids of size \(11\times 11\times 17\) for the motion descriptor and a scene context descriptor from three key frames. We evaluate using leave-one-out cross validation. Our approach gives a performance of 73.20 % (Fig. 9), with a codebook of size 1,000. Motion descriptor alone gives a performance of 59.89 % (Fig. 10), and the scene context descriptor alone gives a performance of 60.06 % (Fig. 11). The idea of scene context plays a very important role in performing our approach. For example, the performance of motion descriptor for biking action is 49 %, and it has 21 % confusion with horse riding. After fusion with the scene context descriptor, which has 12 % confusion with horse riding, the performance increased to 67 % and the confusion with horse riding reduced to 10 %. The confusion decreased by 11 % and the performance increased by 18 %. This happens due to the complementary nature of probabilistic fusion where the individual strengths of the descriptors is preserved. This is also observed in “basketball” and “tennis swing” as shown in Fig. 9.

Fig. 9
figure 9

Confusion table for UCF11 dataset using the proposed framework i.e., probabilistic fusion of motion descriptor (dollar-gradient) and scene context descriptor. Average performance 73.20%

Fig. 10
figure 10

Confusion table for UCF11 dataset using motion descriptor (dollar-gradient). Average performance 59.89 %

Fig. 11
figure 11

Confusion table for UCF11 dataset using scene context descriptor. Average performance 60.06 %

The performance reported by Liu et al. [11] using hybrid features obtained by pruning the motion and static features is 71.2 %. We performed \(\sim \)2 % better than Liu et al. [11]. Recently, Ikizler-Cinbis et al. [7] showed that their approach had 75.21 % performance, which was \(\sim \)2.1 % better than our approach. However, they performed computationally intense steps like video stabilization, person detector, and tracking, which were not done in our approach. By replacing the motion feature with MBH (4096-dimention codebook) [18] and following the exactly same experimental setup (SVM with a \(x^{2}\) kernel) [18], the motion (MBH) and scene context descriptors gave us 83.13 and 46.57 %, respectively. When combined in multi-channel approach [18] gives 85.34 %, which is \(\sim \)1 % better than the best-known results on UCF11 as reported by Wang et al. [18].

7.2 UCF50 dataset

This is a very large and challenging dataset with 50 action categories. In this experiment, 1,000-dimension codebooks are used for both the motion and scene context descriptor. The individual performance of motion descriptor is 53.06 %; using our new scene context descriptor the performance is 47.56 %. After the fusion of both the descriptors, we have a performance of 68.20 %, which is a \(\sim \)15 % increase (Fig. 12).

Fig. 12
figure 12

Confusion table for UCF50 using our approach. Average performance 68.20 %

The performance on rock climbing indoor using motion descriptors is 28; 11 % of the time it gets confused with rope climbing, and 10 % of the time rope climbing gets confused with rock climbing indoor. This is understandable because of the similar motion pattern in these actions. The performance of scene context descriptor for indoor rock climbing is 71 % with a confusion of 1 % with rope climbing, and the performance of rope climbing is 10 % with a confusion of 4 % with indoor rock climbing. Low confusion occurred because both the actions happened in two very different locations. Using our approach, we get 80 % performance on indoor rock climbing and 42 % performance on rope climbing. The complete confusion table is shown in Fig. 12. In some cases, the scene context descriptor performs badly compared to motion descriptor; for example, in bench press the performance using scene context is 54 % with 15 % confusion with pizza tossing. The reason for this is that both the actions are performed indoor in most cases. However, they have no confusion in motion descriptor. This increases the final performance of bench press to 71 %.

Figure 13 shows the performance by incrementally adding one action at a time from UCF50 to UCF11. The overall performance for the initial 11 actions using our approach is 70.56 %, and on all the 50 actions it is 66.74 %, a drop of 3.8 % in the overall performance in spite of adding 39 new actions. The fusion of both the descriptors consistently added 15.5 % to the motion descriptor with a variance of 1 and 17.3 % to the scene context descriptor with a variance of 9.3 % (Table 3).

Table 3 Performance comparison on KTH dataset
Fig. 13
figure 13

Performance as new actions are added to UCF YouTube (UCF11) dataset from the UCF50 dataset

It is interesting to note that substituting MBH (2048-dimension codebook) as the motion descriptor in the above experimental setup gave us the best performance of 76.90 %, where MBH and scene context descriptors gave 71.86 and 47.28 %, respectively.

7.3 HMDB51 dataset

The proposed approach has been tested on all the 51 categories in the HMDB51 dataset on original videos, and the experimental setup was kept similar to [8] for comparison. We used the HOG/HOF features provided by the authors [8], which gave us 19.96 % for a codebook of size 2,000. The scene context descriptor is computed by extracting dense CSIFT on three key frames and quantizing using a codebook of size 2,000, which gave us 17.91 %. The proposed probabilistic fusion has 27.02 %, which is \(\sim \)3.84 % higher than the best results reported by Kuehne et al. [8].

7.4 KTH dataset

We applied our proposed method on the KTH dataset. Although the idea of scene context is not useful in this dataset, experiments have been conducted simply to compare the performance of our method with other state-of-the-art results on the KTH dataset. The experimental setup is leave-one-out cross validation and a 1,000-dimension codebook is used. We got a performance of 89.79 % using our approach, whereas scene context feature performance alone was 64.20 % and motion feature performance alone was 91.30 %. We had a 1.51 % drop in the performance due to the scene context features, in spite of the 25.95 % difference between scene context and motion features. This shows the robustness in performing the probabilistic fusion of both scene context and motion descriptors.

8 Conclusion

In this paper, we proposed an approach to perform recognition in large datasets like UCF50 and HMDB51. The proposed approach has the best performance on datasets like UCF11 (87.19 %), UCF50 (76.90 %), and HMDB51 (27.02 %). We showed that, as the number of categories increase, the motion descriptors become less discriminative. We also showed that the proposed scene context descriptor is more discriminative, and when properly fused with motion descriptors gives \(\sim \)15 and \(\sim \)4 % improvement on UCF50. Our approach does not require pruning of motion or static features, stabilization of videos, or detection and tracking of persons. The proposed method has the ability to do action recognition on highly unconstrained videos and also on large datasets.