Abstract
Vision-based Human activity recognition is becoming a trendy area of research due to its broad application such as security and surveillance, human–computer interactions, patients monitoring system, and robotics. For the recognition of human activity various approaches have been developed and to test the performance on these video datasets. Hence, the objective of this survey paper is to outline the different video datasets and highlights their merits and demerits under practical considerations. We have categorized these datasets into two part. The first part consists two-dimensional (2D-RGB) datasets and the second part has three-dimensional (3D-RGB) datasets. The most prominent challenges involved in these datasets are occlusions, illumination variation, view variation, annotation, and fusion of modalities. The key specification of these datasets are resolutions, frame rate, actions/actors, background, and application domain. All specifications, challenges involved, and the comparison made in tabular form. We have also presented the state-of-the-art algorithms that give the highest accuracy on these datasets.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In the present era, human activity recognition [1,2,3,4,5], in videos has become a prominent area of research in the field of computer vision. It has many daily living applications such as patient monitoring, object tracking, threat detection, and security and surveillance [6,7,8,9]. The motivation to work in this field is to recognize human gestures, actions and interactions in videos. The recognition of human activities in video involves various steps such as preprocessing, segmentation, feature extraction, dimension reduction and classification. We can save time if we have accurate knowledge of the publically available datasets [10, 11], so that there is no need to generate new dataset and a researcher’s work will be easier to identify the datasets and a key focus will be on developing the new algorithm rather than gathering the information about datasets. With the advancement of labelling algorithm, it becomes an opportunity to label the dense dataset videos for activity recognition, object tracking, and scene reconstruction [12,13,14]. This work covers gesture recognition, daily living actions or activity, sports actions, human–human interactions and human–object interaction datasets. This paper consists of both RGB and RGB-D publically available datasets. This work provides datasets specifications such as year of publication, frame rates, spatial resolution, the total number of action and number of actors (subjects) performing in videos and state-of-the-art solutions on existing benchmarks. Tables 1 and 2 provides the details of RGB and RGB-D datasets, respectively. Before 2010, a large number of RGB video dataset was available to this community [15,16,17]. After the advancement of low-cost depth sensor, e.g. Microsoft Kinect, there has been a drastic increase in 3D, and multi-modal videos datasets. Due to low cost and lightweight sensors datasets are recorded with multiple modalities such as depth frames, accelerometer, IR sensors frames, acoustical data, and skeleton data information. The RGB-D datasets having multiple modalities reduce the chance of loss of information in videos as compared to traditional RGB datasets at the cost of increased complexities [18, 19].
2 Related Work
Chaquet et al. [20], focused on 28 publically available RGB datasets of human action and activity. The dataset characteristics are discussed such as ground truth, numbers of action/actors, views and area of applications. Their work does not cover RGB-depth dataset available at that time. Edwards et al. [3], focused on pose-based methods and presented a novel high-level activity dataset. Their work gives no information about state-of-the-art accuracies on existing dataset. Wang et al. [21], discussed specific novel techniques on RGB-D-based motion recognition. T. Hassner [22], focused on action recognition and accuracy of most of the RGB datasets. The very limitation of this work is the action in depth datasets and area of applications. M. Firman [23], analysed the depth dataset such as semantics, identification, face/pose recognition and object tracking. Borges et al. [24], discussed advantages and shortcomings of various methods for human action understanding. Zhang et al. [25], engrossed in action RGB-D benchmarks and lack of considered pose, human interaction activities. Besides, they intended to cover state-of-the-art accuracy and classification techniques on specific benchmarks. Compared with the existing surveys, the primary aim of this work will provide an accessible platform to the readers.
3 Challenges in HAR Dataset
In this section, we discuss challenges involved in RGB and RGB-D dataset. It can be noticed that dataset videos are facing limitations in at least one of aspects such as similarity of actions, cluttered background, viewpoints variations, illuminations variations and occlusions.
3.1 Background and Environmental Conditions
The background in videos may be different types such as slow/high dynamic, static, occluded, airy, rainy and dense populated. It can be observed that KTH dataset is more challenging due to changing the background as compared to Weizmann dataset. The UT-Interaction, BEHAVE, BIT Interactions datasets recorded in the larger outdoor area and changing natural background conditions. The various datasets such as UCF sports activity, UIUC, Olympic sports, hollywood1, HMDB51, THUMOS, ActivityNet and YouTube 8 M recorded from online sources YouTube, Google, and various movies, are challenging due to having both dynamic objects and backgrounds conditions.
3.2 Similarity and Dissimilarity of Actions
The similarity between the actions classes in the datasets provides a fundamental challenge to the researcher. There are many actions which seem to be similar in videos such as jogging, running, walking, etc. The accuracy of classification is affected by the same type of actions. The same actions performed by different actors increase the complexity of the dataset such as YouTube Sports 1 M dataset having thousands of videos of same action class.
3.3 Occlusion
Occlusion is a thing where another object hides the object of interest. For the human action and activity recognition, occlusion can be categorized as self-occlusion and occlusion of another object/partial occlusion. The depth sensor is severely affected by internal noise data and self-occlusion by performing users such as in CAD-60, 50 salad, Berkeley MHAD, UWA3D activity, LIRIS, MSR Action pair, UTD-MHAD, M2I, SYSU-3D HOI, NTU RGB + D and PKU-MMD datasets.
3.4 View Variations
The viewpoint of any activity recorded inside the video dataset is a key attribute in the human activity recognition system. The multiple views have more robust information than single view and independent of captured view angle inside the dataset. However, multiple views increase the complexity such as more training as well as test data is required for classification analysis. Here, KTH, Weizmann, Hollywood, UCF Sports, MSR Action 3D, and Hollywood 3D, are single view datasets. The multi-view datasets are CAD-60, CAD-120, UWA3D, Northwestern-UCLA, LIRIS, UTD- MHAD, NTU RGB-D, IXMAS, CASIA Action, UT-Interaction, BEHAVE, BIT-Interaction, Breakfast Action.
4 Approaches for Human Action Recognitions
Based on the methodologies used in recent years to recognize human action and activities we can categorize the existing solutions to two major categories such as handcrafted features descriptor and deep learning approaches.
4.1 Local and Global Approaches
The initial work of human action recognition is limited to pose somewhat or gesture recognition. The first step to recognize the human action in videos was introduced by Bobik and Davis [26]. They simplified human action using Motion History Images (MHI) and Motion Energy Images (MEI). The global MHI template is given by
where \( E_{\tau } \) is obtained MEI at particular time instant τ, while \( B\left( {x,y,t - i} \right) \) is binary image sequences represents detected objects pixels.
The local representation STIPs for action recognition introduced by Laptev et al. [27]. A local 3D Harris operator [23] show a good performance to recognized 3D data objects with less number of interest points and widely used in computer vision applications. It is based on local autocorrelation function and defined as
where, I (·,·) is defined as the image function and \( x_{i} ,y_{i} \) are the points in the Gaussian function W centred on \( \left( {x, y} \right) \), which defines the neighborhood area in analysis.
4.2 Deep Learning Approaches
After 2012, these architecture received initial successes with supervised approaches which overcome vanishing gradient problem by using ReLU, GPUs (reduced time complexities). Deep learning technique is data driven it lacks when training samples are less, so in the case of small activity dataset local and global feature extractors are good and efficient for classification purpose.
Li et al. [28] showed that 3D convolutional networks outperform the 2D frame based counterparts with a noticeable margin. The 3D convolution value at position \( \left( {x, y,z} \right) \) on the \( j^{\rm th} \) feature map in the \( i^{\rm th} \) layer is defined as,
where, \( R_{i} \) is the size of the 3D kernel along the temporal dimension while \( w_{ijm}^{pqr} \) is the \( \left( {p,q,r} \right)^{\rm th} \) value of the kernel connected to the \( m^{\rm th} \) feature map in the previous layer. Karpathy et al. [29] proposed the concept of slow fusion to increase the temporal awareness of a convolutional network. Donahue et al. [30] addressed the problem of action recognition through the cascaded CNN and a class of recurrent neural network (RCNN) which is also known as Long Short Term Memory (LSTM) networks is given as
Here, \( x^{\left( t \right)} { \in {\mathbb{R}}}^{d} \) (external signal), \( z^{\left( t \right)} { \in {\mathbb{R}}}^{m} \) (output signal), and \( h^{\left( t \right)} { \in {\mathbb{R}}}^{r} \) (hidden state). The recurrent neural network is found to be best model for video activity analysis.
5 Discussion
In this section, we briefly discuss the advantages and disadvantages of both types of 2D and 3D datasets.
5.1 Advantages of RGB and RGB-D Dataset
It can observe that from Tables 3 traditional human activity datasets are recorded with a small number of actions recognition from segmented videos under somewhat controlled conditions. Some benchmarks downloaded from online media such as YouTube, movies and social videos sharing sites represent a realistic action scene which is more practical for real-life applications. UCF 101 dataset is the largest dataset in the context of some classes, video clips than UCF 11, UCF 50, Olympic sports and HMDB51 datasets. ActivityNet is large-scale RGB video dataset captured with complete annotated labels and bounding box. The 3D datasets have advantages over visual 2D dataset as they are less sensitive to illuminations because they are captured with multiple sensors system such as visual, acoustical, and inertial sensors systems. It can be observed that from Table 4, that the fusion of information using different sensors increases the recognition accuracy on depth dataset at the cost of increased complexities. The 3D Online RGB-D action dataset was recorded in a living room environment used for cross-action environment and real online action recognition. The NTU RGB + D dataset is having a large number of actions/actors among existing datasets and was captured with multiple modalities and different camera views. PKU-MMD is large scale benchmark focused on continuous multi-modalities 3D complex human activities with complete annotation information, and it is suitable for deep learning methods.
5.2 Disadvantages of RGB and RGB-D Dataset
Currently, there are many video datasets, despite this, there are limitations in automatically recognize and classify the human activities. The main reasons of such limitations in at least one of the form are the number of samples for each action, the length of clips, capturing environmental conditions, background clutter and viewpoints changes and some activities. The 2D datasets were recorded with a small number of actions to complex actions with a broad range of applications. The 2D datasets are faced more challenges like view variations, intra-class variations, cluttered background, partial occlusions, and camera movements than depth datasets. The RGB-D dataset is facing limitations of low resolutions, less training samples, the number of camera view, different actions, various subjects and less precision. Initial RGB-D datasets captured single actions videos frames under controlled indoor or lab environments. MSR Action 3D is restricted to gaming actions depth frames only. Northwestern-UCLA dataset was recorded with more than one Kinect sensors at the same time to collect multi-view representations. It becomes a challenge to handle and synchronize all sensors data information simultaneously.
6 Conclusion
A review of the various state-of-the-art datasets on human action has been presented. Human action datasets have been categorized into two major categories: RGB and RGB-D datasets. The challenges involved and specifications of these datasets have been discussed. The conventional RGB dataset faces the problems of a cluttered background, illumination variations, camera motion, viewpoints change and occlusions. It is a challenge for feature descriptors in activity recognitions datasets that meets the changing real-world environments. It is required robust evaluation techniques for cross-dataset validation, which will be useful for realistic scenarios applications.
References
Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. ACM Comput. Surv. 43, 1–43 (2011)
Vishwakarma, S., Agrawal, A.: A survey on activity recognition and behavior understanding in video surveillance. Vis. Comput. 29, 983–1009 (2013)
Edwards, M., Deng, J., Xie, X.: From pose to activity: surveying datasets and introducing CONVERSE. Comput. Vis. Image Underst. 144, 73–105 (2016)
Dawn, D.D., Shaikh, S.H.: A comprehensive survey of human action recognition with spatiotemporal interest point (STIP) detector. Vis. Comput. 32, 289–306 (2016)
Bux, A., Angelov, P., Habib, Z.: Vision-based human activity recognition: a review. Adv. Comput. Intell. Syst. 513, 341–371 (2016)
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: Tenth IEEE International Conference on Computer Vision. Beijing (2005)
Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Proceedings of the European Conference on Computer Vision (2006)
Xu, W., Miao, Z., Zhang, X.P., Tian, Y.: A hierarchical spatio-temporal model for human activity recognition. IEEE Trans. Multimedia 99, 1 (2017)
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition. Boston (2015)
Ryoo, M.S., Chen, C.C., Aggarwal, J., Chowdhury, A.R.: An overview of contest on semantic description of human activities. In: Recognizing Patterns in Signals, Speech, Images and Videos. vol. 6388 (2010)
Vishwakarma, D. K., Singh, K.: Human activity recognition based on spatial distribution of gradients at sub-levels of average energy silhouette images. In: IEEE Transactions on Cognitive and Development Systems, vol. 9, no. 4, pp. 316–327. (2017)
Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco (2010)
Kong, Y., Liang, W., Dong, Z., Jia, Y.: Recognizing human interaction from videos by a discriminative model. IET Comput. Vis. 8, 277–286 (2014)
Ni, B., Moulin, P., Yang, X., Yan, S.: Motion part regularization: Improving action recognition via trajectory group selection. In: IEEE Conference on Computer Vision and Pattern Recognition. Boston (2015)
Aggarwal, J., Xia, L.: Human activity recognition from 3D data- a review. In: Pattern Recognition Letters. vol. 48 (2013)
Lun, R., Zhao, W.: A survey of applications and human motion recognition with Microsoft Kinect. In: International Journal of Pattern Recognition and Artificial Intelligence, vol. 29 (2015)
Presti, L.L., Cascia, M.L.: 3D skeleton-based human action classification: a survey. Pattern Recogn. 53, 130–147 (2016)
Zhang, J., Li, W., Ogunbona, P.O., Wang, P., Tang, C.: RGB-D based action recognition datasets: a survey. Pattern Recogn. 60, 86–105 (2016)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the Advances in Neural Information Processing Systems. (2014)
Chaquet, J.M., Carmona, E.J., Caballero, A.F.: A survey of video datasets for human action and activity recognition. Comput. Vis. Image Underst. 117, 633–659 (2013)
Wang, P., Li, W., Ogunbona P.O., Escalera, S.: RGB-D-based motion recognition with deep learning: a survey. Int. J. Comput. Vis. (2017)
Hassner, T.: A critical review of action recognition benchmarks. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops. Portland (2013)
Firman, M.: RGBD datasets: past, present and future. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2016)
Borges, P.-V.K., Conci, N., Cavallaro, A.: Video-based human behavior understanding: a survey. IEEE Trans. Circuits Syst. Video Technol. 23, 1993–2008 (2013)
Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23, 257–267 (2001)
Laptev, I.: On space-time interest points. Int. J. Comput. Vision 64, 107–123 (2005)
Li, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 221–231 (2013)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition. Columbus (2014)
Donahue, J., Hendricks, L., Guadarrama, S., Rohrbach, M.V., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Sipiran, I., Bustos, B.: Harris 3D: a robust extension of the Harris operator for interest point detection on 3D meshes. In: The Visual Computer, vol. 27 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Singh, T., Vishwakarma, D.K. (2019). Human Activity Recognition in Video Benchmarks: A Survey. In: Rawat, B., Trivedi, A., Manhas, S., Karwal, V. (eds) Advances in Signal Processing and Communication . Lecture Notes in Electrical Engineering, vol 526. Springer, Singapore. https://doi.org/10.1007/978-981-13-2553-3_24
Download citation
DOI: https://doi.org/10.1007/978-981-13-2553-3_24
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2552-6
Online ISBN: 978-981-13-2553-3
eBook Packages: EngineeringEngineering (R0)