Keywords

1 Introduction

In the present era, human activity recognition [1,2,3,4,5], in videos has become a prominent area of research in the field of computer vision. It has many daily living applications such as patient monitoring, object tracking, threat detection, and security and surveillance [6,7,8,9]. The motivation to work in this field is to recognize human gestures, actions and interactions in videos. The recognition of human activities in video involves various steps such as preprocessing, segmentation, feature extraction, dimension reduction and classification. We can save time if we have accurate knowledge of the publically available datasets [10, 11], so that there is no need to generate new dataset and a researcher’s work will be easier to identify the datasets and a key focus will be on developing the new algorithm rather than gathering the information about datasets. With the advancement of labelling algorithm, it becomes an opportunity to label the dense dataset videos for activity recognition, object tracking, and scene reconstruction [12,13,14]. This work covers gesture recognition, daily living actions or activity, sports actions, human–human interactions and human–object interaction datasets. This paper consists of both RGB and RGB-D publically available datasets. This work provides datasets specifications such as year of publication, frame rates, spatial resolution, the total number of action and number of actors (subjects) performing in videos and state-of-the-art solutions on existing benchmarks. Tables 1 and 2 provides the details of RGB and RGB-D datasets, respectively. Before 2010, a large number of RGB video dataset was available to this community [15,16,17]. After the advancement of low-cost depth sensor, e.g. Microsoft Kinect, there has been a drastic increase in 3D, and multi-modal videos datasets. Due to low cost and lightweight sensors datasets are recorded with multiple modalities such as depth frames, accelerometer, IR sensors frames, acoustical data, and skeleton data information. The RGB-D datasets having multiple modalities reduce the chance of loss of information in videos as compared to traditional RGB datasets at the cost of increased complexities [18, 19].

Table 1 RGB (2D) video dataset
Table 2 RGB-D (3D) video dataset

2 Related Work

Chaquet et al. [20], focused on 28 publically available RGB datasets of human action and activity. The dataset characteristics are discussed such as ground truth, numbers of action/actors, views and area of applications. Their work does not cover RGB-depth dataset available at that time. Edwards et al. [3], focused on pose-based methods and presented a novel high-level activity dataset. Their work gives no information about state-of-the-art accuracies on existing dataset. Wang et al. [21], discussed specific novel techniques on RGB-D-based motion recognition. T. Hassner [22], focused on action recognition and accuracy of most of the RGB datasets. The very limitation of this work is the action in depth datasets and area of applications. M. Firman [23], analysed the depth dataset such as semantics, identification, face/pose recognition and object tracking. Borges et al. [24], discussed advantages and shortcomings of various methods for human action understanding. Zhang et al. [25], engrossed in action RGB-D benchmarks and lack of considered pose, human interaction activities. Besides, they intended to cover state-of-the-art accuracy and classification techniques on specific benchmarks. Compared with the existing surveys, the primary aim of this work will provide an accessible platform to the readers.

3 Challenges in HAR Dataset

In this section, we discuss challenges involved in RGB and RGB-D dataset. It can be noticed that dataset videos are facing limitations in at least one of aspects such as similarity of actions, cluttered background, viewpoints variations, illuminations variations and occlusions.

3.1 Background and Environmental Conditions

The background in videos may be different types such as slow/high dynamic, static, occluded, airy, rainy and dense populated. It can be observed that KTH dataset is more challenging due to changing the background as compared to Weizmann dataset. The UT-Interaction, BEHAVE, BIT Interactions datasets recorded in the larger outdoor area and changing natural background conditions. The various datasets such as UCF sports activity, UIUC, Olympic sports, hollywood1, HMDB51, THUMOS, ActivityNet and YouTube 8 M recorded from online sources YouTube, Google, and various movies, are challenging due to having both dynamic objects and backgrounds conditions.

3.2 Similarity and Dissimilarity of Actions

The similarity between the actions classes in the datasets provides a fundamental challenge to the researcher. There are many actions which seem to be similar in videos such as jogging, running, walking, etc. The accuracy of classification is affected by the same type of actions. The same actions performed by different actors increase the complexity of the dataset such as YouTube Sports 1 M dataset having thousands of videos of same action class.

3.3 Occlusion

Occlusion is a thing where another object hides the object of interest. For the human action and activity recognition, occlusion can be categorized as self-occlusion and occlusion of another object/partial occlusion. The depth sensor is severely affected by internal noise data and self-occlusion by performing users such as in CAD-60, 50 salad, Berkeley MHAD, UWA3D activity, LIRIS, MSR Action pair, UTD-MHAD, M2I, SYSU-3D HOI, NTU RGB + D and PKU-MMD datasets.

3.4 View Variations

The viewpoint of any activity recorded inside the video dataset is a key attribute in the human activity recognition system. The multiple views have more robust information than single view and independent of captured view angle inside the dataset. However, multiple views increase the complexity such as more training as well as test data is required for classification analysis. Here, KTH, Weizmann, Hollywood, UCF Sports, MSR Action 3D, and Hollywood 3D, are single view datasets. The multi-view datasets are CAD-60, CAD-120, UWA3D, Northwestern-UCLA, LIRIS, UTD- MHAD, NTU RGB-D, IXMAS, CASIA Action, UT-Interaction, BEHAVE, BIT-Interaction, Breakfast Action.

4 Approaches for Human Action Recognitions

Based on the methodologies used in recent years to recognize human action and activities we can categorize the existing solutions to two major categories such as handcrafted features descriptor and deep learning approaches.

4.1 Local and Global Approaches

The initial work of human action recognition is limited to pose somewhat or gesture recognition. The first step to recognize the human action in videos was introduced by Bobik and Davis [26]. They simplified human action using Motion History Images (MHI) and Motion Energy Images (MEI). The global MHI template is given by

$$ \left( {x, y, t} \right) = \mathop \sum \limits_{\tau \, = 0}^{i - 1} B\left( {x,y,t - i} \right), $$
(1)

where \( E_{\tau } \) is obtained MEI at particular time instant τ, while \( B\left( {x,y,t - i} \right) \) is binary image sequences represents detected objects pixels.

The local representation STIPs for action recognition introduced by Laptev et al. [27]. A local 3D Harris operator [23] show a good performance to recognized 3D data objects with less number of interest points and widely used in computer vision applications. It is based on local autocorrelation function and defined as

$$ e\left( {x,y} \right) = \mathop \sum \limits_{{x_{i} y_{i} }} W\left( {x_{i, } y_{i} } \right)\left[ { I\left( {x_{i} +\Delta x + y_{i} +\Delta y} \right) - I\left( {x_{i} ,y_{i} } \right)} \right]^{2} , $$
(2)

where, I (·,·) is defined as the image function and \( x_{i} ,y_{i} \) are the points in the Gaussian function W centred on \( \left( {x, y} \right) \), which defines the neighborhood area in analysis.

4.2 Deep Learning Approaches

After 2012, these architecture received initial successes with supervised approaches which overcome vanishing gradient problem by using ReLU, GPUs (reduced time complexities). Deep learning technique is data driven it lacks when training samples are less, so in the case of small activity dataset local and global feature extractors are good and efficient for classification purpose.

Li et al. [28] showed that 3D convolutional networks outperform the 2D frame based counterparts with a noticeable margin. The 3D convolution value at position \( \left( {x, y,z} \right) \) on the \( j^{\rm th} \) feature map in the \( i^{\rm th} \) layer is defined as,

$$ v_{ij}^{xyz} = { \tanh }\left( {b_{ij} + \mathop \sum \limits_{m} \mathop \sum \limits_{P = 0}^{{P_{i - 1} }} \mathop \sum \limits_{Q = 0}^{{Q_{i - 1} }} \mathop \sum \limits_{R = 0}^{{R_{i - 1} }} w_{ijm}^{pqr} v_{{\left( {i - 1} \right)m}}^{{\left( {x + p} \right)\left( {y + q} \right)\left( {z + r} \right)}} } \right), $$
(3)

where, \( R_{i} \) is the size of the 3D kernel along the temporal dimension while \( w_{ijm}^{pqr} \) is the \( \left( {p,q,r} \right)^{\rm th} \) value of the kernel connected to the \( m^{\rm th} \) feature map in the previous layer. Karpathy et al. [29] proposed the concept of slow fusion to increase the temporal awareness of a convolutional network. Donahue et al. [30] addressed the problem of action recognition through the cascaded CNN and a class of recurrent neural network (RCNN) which is also known as Long Short Term Memory (LSTM) networks is given as

$$ h^{t} = \sigma \left( {w_{x} x^{t} + w_{h} h^{{\left( {t - 1} \right)}} } \right) $$
(4)
$$ z^{t} = \sigma \left( { w_{z} h^{\left( t \right)} } \right) $$
(5)
$$ w_{x} { \in {\mathbb{R}}}^{r \times d} ,w_{h} { \in {\mathbb{R}}}^{r \times r} ,w_{z} { \in {\mathbb{R}}}^{m \times r} $$
(6)

Here, \( x^{\left( t \right)} { \in {\mathbb{R}}}^{d} \) (external signal), \( z^{\left( t \right)} { \in {\mathbb{R}}}^{m} \) (output signal), and \( h^{\left( t \right)} { \in {\mathbb{R}}}^{r} \) (hidden state). The recurrent neural network is found to be best model for video activity analysis.

5 Discussion

In this section, we briefly discuss the advantages and disadvantages of both types of 2D and 3D datasets.

5.1 Advantages of RGB and RGB-D Dataset

It can observe that from Tables 3 traditional human activity datasets are recorded with a small number of actions recognition from segmented videos under somewhat controlled conditions. Some benchmarks downloaded from online media such as YouTube, movies and social videos sharing sites represent a realistic action scene which is more practical for real-life applications. UCF 101 dataset is the largest dataset in the context of some classes, video clips than UCF 11, UCF 50, Olympic sports and HMDB51 datasets. ActivityNet is large-scale RGB video dataset captured with complete annotated labels and bounding box. The 3D datasets have advantages over visual 2D dataset as they are less sensitive to illuminations because they are captured with multiple sensors system such as visual, acoustical, and inertial sensors systems. It can be observed that from Table 4, that the fusion of information using different sensors increases the recognition accuracy on depth dataset at the cost of increased complexities. The 3D Online RGB-D action dataset was recorded in a living room environment used for cross-action environment and real online action recognition. The NTU RGB + D dataset is having a large number of actions/actors among existing datasets and was captured with multiple modalities and different camera views. PKU-MMD is large scale benchmark focused on continuous multi-modalities 3D complex human activities with complete annotation information, and it is suitable for deep learning methods.

Table 3 Technical specification RGB and RGB-D dataset
Table 4 RGB and RGB-D dataset with state-of-the-art accuracy and techniques

5.2 Disadvantages of RGB and RGB-D Dataset

Currently, there are many video datasets, despite this, there are limitations in automatically recognize and classify the human activities. The main reasons of such limitations in at least one of the form are the number of samples for each action, the length of clips, capturing environmental conditions, background clutter and viewpoints changes and some activities. The 2D datasets were recorded with a small number of actions to complex actions with a broad range of applications. The 2D datasets are faced more challenges like view variations, intra-class variations, cluttered background, partial occlusions, and camera movements than depth datasets. The RGB-D dataset is facing limitations of low resolutions, less training samples, the number of camera view, different actions, various subjects and less precision. Initial RGB-D datasets captured single actions videos frames under controlled indoor or lab environments. MSR Action 3D is restricted to gaming actions depth frames only. Northwestern-UCLA dataset was recorded with more than one Kinect sensors at the same time to collect multi-view representations. It becomes a challenge to handle and synchronize all sensors data information simultaneously.

6 Conclusion

A review of the various state-of-the-art datasets on human action has been presented. Human action datasets have been categorized into two major categories: RGB and RGB-D datasets. The challenges involved and specifications of these datasets have been discussed. The conventional RGB dataset faces the problems of a cluttered background, illumination variations, camera motion, viewpoints change and occlusions. It is a challenge for feature descriptors in activity recognitions datasets that meets the changing real-world environments. It is required robust evaluation techniques for cross-dataset validation, which will be useful for realistic scenarios applications.