Human Activity Recognition in Video Benchmarks: A Survey

Singh, Tej; Vishwakarma, Dinesh Kumar

doi:10.1007/978-981-13-2553-3_24

Tej Singh³⁶ &
Dinesh Kumar Vishwakarma³⁶

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 526))

1851 Accesses
28 Citations

Abstract

Vision-based Human activity recognition is becoming a trendy area of research due to its broad application such as security and surveillance, human–computer interactions, patients monitoring system, and robotics. For the recognition of human activity various approaches have been developed and to test the performance on these video datasets. Hence, the objective of this survey paper is to outline the different video datasets and highlights their merits and demerits under practical considerations. We have categorized these datasets into two part. The first part consists two-dimensional (2D-RGB) datasets and the second part has three-dimensional (3D-RGB) datasets. The most prominent challenges involved in these datasets are occlusions, illumination variation, view variation, annotation, and fusion of modalities. The key specification of these datasets are resolutions, frame rate, actions/actors, background, and application domain. All specifications, challenges involved, and the comparison made in tabular form. We have also presented the state-of-the-art algorithms that give the highest accuracy on these datasets.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Video benchmarks of human action datasets: a review

Article 17 August 2018

A Review of State of Art Techniques for 3D Human Activity Recognition System

A Systematic Analysis of the Human Activity Recognition Systems for Video Surveillance

Keywords

1 Introduction

In the present era, human activity recognition [1,2,3,4,5], in videos has become a prominent area of research in the field of computer vision. It has many daily living applications such as patient monitoring, object tracking, threat detection, and security and surveillance [6,7,8,9]. The motivation to work in this field is to recognize human gestures, actions and interactions in videos. The recognition of human activities in video involves various steps such as preprocessing, segmentation, feature extraction, dimension reduction and classification. We can save time if we have accurate knowledge of the publically available datasets [10, 11], so that there is no need to generate new dataset and a researcher’s work will be easier to identify the datasets and a key focus will be on developing the new algorithm rather than gathering the information about datasets. With the advancement of labelling algorithm, it becomes an opportunity to label the dense dataset videos for activity recognition, object tracking, and scene reconstruction [12,13,14]. This work covers gesture recognition, daily living actions or activity, sports actions, human–human interactions and human–object interaction datasets. This paper consists of both RGB and RGB-D publically available datasets. This work provides datasets specifications such as year of publication, frame rates, spatial resolution, the total number of action and number of actors (subjects) performing in videos and state-of-the-art solutions on existing benchmarks. Tables 1 and 2 provides the details of RGB and RGB-D datasets, respectively. Before 2010, a large number of RGB video dataset was available to this community [15,16,17]. After the advancement of low-cost depth sensor, e.g. Microsoft Kinect, there has been a drastic increase in 3D, and multi-modal videos datasets. Due to low cost and lightweight sensors datasets are recorded with multiple modalities such as depth frames, accelerometer, IR sensors frames, acoustical data, and skeleton data information. The RGB-D datasets having multiple modalities reduce the chance of loss of information in videos as compared to traditional RGB datasets at the cost of increased complexities [18, 19].

Table 1 RGB (2D) video dataset

Full size table

Table 2 RGB-D (3D) video dataset

Full size table

2 Related Work

Chaquet et al. [20], focused on 28 publically available RGB datasets of human action and activity. The dataset characteristics are discussed such as ground truth, numbers of action/actors, views and area of applications. Their work does not cover RGB-depth dataset available at that time. Edwards et al. [3], focused on pose-based methods and presented a novel high-level activity dataset. Their work gives no information about state-of-the-art accuracies on existing dataset. Wang et al. [21], discussed specific novel techniques on RGB-D-based motion recognition. T. Hassner [22], focused on action recognition and accuracy of most of the RGB datasets. The very limitation of this work is the action in depth datasets and area of applications. M. Firman [23], analysed the depth dataset such as semantics, identification, face/pose recognition and object tracking. Borges et al. [24], discussed advantages and shortcomings of various methods for human action understanding. Zhang et al. [25], engrossed in action RGB-D benchmarks and lack of considered pose, human interaction activities. Besides, they intended to cover state-of-the-art accuracy and classification techniques on specific benchmarks. Compared with the existing surveys, the primary aim of this work will provide an accessible platform to the readers.

3 Challenges in HAR Dataset

In this section, we discuss challenges involved in RGB and RGB-D dataset. It can be noticed that dataset videos are facing limitations in at least one of aspects such as similarity of actions, cluttered background, viewpoints variations, illuminations variations and occlusions.

3.1 Background and Environmental Conditions

The background in videos may be different types such as slow/high dynamic, static, occluded, airy, rainy and dense populated. It can be observed that KTH dataset is more challenging due to changing the background as compared to Weizmann dataset. The UT-Interaction, BEHAVE, BIT Interactions datasets recorded in the larger outdoor area and changing natural background conditions. The various datasets such as UCF sports activity, UIUC, Olympic sports, hollywood1, HMDB51, THUMOS, ActivityNet and YouTube 8 M recorded from online sources YouTube, Google, and various movies, are challenging due to having both dynamic objects and backgrounds conditions.

3.2 Similarity and Dissimilarity of Actions

The similarity between the actions classes in the datasets provides a fundamental challenge to the researcher. There are many actions which seem to be similar in videos such as jogging, running, walking, etc. The accuracy of classification is affected by the same type of actions. The same actions performed by different actors increase the complexity of the dataset such as YouTube Sports 1 M dataset having thousands of videos of same action class.

3.3 Occlusion

Occlusion is a thing where another object hides the object of interest. For the human action and activity recognition, occlusion can be categorized as self-occlusion and occlusion of another object/partial occlusion. The depth sensor is severely affected by internal noise data and self-occlusion by performing users such as in CAD-60, 50 salad, Berkeley MHAD, UWA3D activity, LIRIS, MSR Action pair, UTD-MHAD, M2I, SYSU-3D HOI, NTU RGB + D and PKU-MMD datasets.

3.4 View Variations

The viewpoint of any activity recorded inside the video dataset is a key attribute in the human activity recognition system. The multiple views have more robust information than single view and independent of captured view angle inside the dataset. However, multiple views increase the complexity such as more training as well as test data is required for classification analysis. Here, KTH, Weizmann, Hollywood, UCF Sports, MSR Action 3D, and Hollywood 3D, are single view datasets. The multi-view datasets are CAD-60, CAD-120, UWA3D, Northwestern-UCLA, LIRIS, UTD- MHAD, NTU RGB-D, IXMAS, CASIA Action, UT-Interaction, BEHAVE, BIT-Interaction, Breakfast Action.

4 Approaches for Human Action Recognitions

Based on the methodologies used in recent years to recognize human action and activities we can categorize the existing solutions to two major categories such as handcrafted features descriptor and deep learning approaches.

4.1 Local and Global Approaches

The initial work of human action recognition is limited to pose somewhat or gesture recognition. The first step to recognize the human action in videos was introduced by Bobik and Davis [26]. They simplified human action using Motion History Images (MHI) and Motion Energy Images (MEI). The global MHI template is given by

$$ \left( {x, y, t} \right) = \mathop \sum \limits_{\tau \, = 0}^{i - 1} B\left( {x,y,t - i} \right), $$

(1)

where $ E_{\tau } $ is obtained MEI at particular time instant τ, while $ B\left( {x,y,t - i} \right) $ is binary image sequences represents detected objects pixels.

The local representation STIPs for action recognition introduced by Laptev et al. [27]. A local 3D Harris operator [23] show a good performance to recognized 3D data objects with less number of interest points and widely used in computer vision applications. It is based on local autocorrelation function and defined as

$$ e\left( {x,y} \right) = \mathop \sum \limits_{{x_{i} y_{i} }} W\left( {x_{i, } y_{i} } \right)\left[ { I\left( {x_{i} +\Delta x + y_{i} +\Delta y} \right) - I\left( {x_{i} ,y_{i} } \right)} \right]^{2} , $$

(2)

where, I (·,·) is defined as the image function and $ x_{i} ,y_{i} $ are the points in the Gaussian function W centred on $ \left( {x, y} \right) $, which defines the neighborhood area in analysis.

4.2 Deep Learning Approaches

After 2012, these architecture received initial successes with supervised approaches which overcome vanishing gradient problem by using ReLU, GPUs (reduced time complexities). Deep learning technique is data driven it lacks when training samples are less, so in the case of small activity dataset local and global feature extractors are good and efficient for classification purpose.

Li et al. [28] showed that 3D convolutional networks outperform the 2D frame based counterparts with a noticeable margin. The 3D convolution value at position $ \left( {x, y,z} \right) $ on the $ j^{\rm th} $ feature map in the $ i^{\rm th} $ layer is defined as,

$$ v_{ij}^{xyz} = { \tanh }\left( {b_{ij} + \mathop \sum \limits_{m} \mathop \sum \limits_{P = 0}^{{P_{i - 1} }} \mathop \sum \limits_{Q = 0}^{{Q_{i - 1} }} \mathop \sum \limits_{R = 0}^{{R_{i - 1} }} w_{ijm}^{pqr} v_{{\left( {i - 1} \right)m}}^{{\left( {x + p} \right)\left( {y + q} \right)\left( {z + r} \right)}} } \right), $$

(3)

where, $ R_{i} $ is the size of the 3D kernel along the temporal dimension while $ w_{ijm}^{pqr} $ is the $ \left( {p,q,r} \right)^{\rm th} $ value of the kernel connected to the $ m^{\rm th} $ feature map in the previous layer. Karpathy et al. [29] proposed the concept of slow fusion to increase the temporal awareness of a convolutional network. Donahue et al. [30] addressed the problem of action recognition through the cascaded CNN and a class of recurrent neural network (RCNN) which is also known as Long Short Term Memory (LSTM) networks is given as

$$ h^{t} = \sigma \left( {w_{x} x^{t} + w_{h} h^{{\left( {t - 1} \right)}} } \right) $$

(4)

$$ z^{t} = \sigma \left( { w_{z} h^{\left( t \right)} } \right) $$

(5)

$$ w_{x} { \in {\mathbb{R}}}^{r \times d} ,w_{h} { \in {\mathbb{R}}}^{r \times r} ,w_{z} { \in {\mathbb{R}}}^{m \times r} $$

(6)

Here, $ x^{\left( t \right)} { \in {\mathbb{R}}}^{d} $ (external signal), $ z^{\left( t \right)} { \in {\mathbb{R}}}^{m} $ (output signal), and $ h^{\left( t \right)} { \in {\mathbb{R}}}^{r} $ (hidden state). The recurrent neural network is found to be best model for video activity analysis.

5 Discussion

In this section, we briefly discuss the advantages and disadvantages of both types of 2D and 3D datasets.

5.1 Advantages of RGB and RGB-D Dataset

It can observe that from Tables 3 traditional human activity datasets are recorded with a small number of actions recognition from segmented videos under somewhat controlled conditions. Some benchmarks downloaded from online media such as YouTube, movies and social videos sharing sites represent a realistic action scene which is more practical for real-life applications. UCF 101 dataset is the largest dataset in the context of some classes, video clips than UCF 11, UCF 50, Olympic sports and HMDB51 datasets. ActivityNet is large-scale RGB video dataset captured with complete annotated labels and bounding box. The 3D datasets have advantages over visual 2D dataset as they are less sensitive to illuminations because they are captured with multiple sensors system such as visual, acoustical, and inertial sensors systems. It can be observed that from Table 4, that the fusion of information using different sensors increases the recognition accuracy on depth dataset at the cost of increased complexities. The 3D Online RGB-D action dataset was recorded in a living room environment used for cross-action environment and real online action recognition. The NTU RGB + D dataset is having a large number of actions/actors among existing datasets and was captured with multiple modalities and different camera views. PKU-MMD is large scale benchmark focused on continuous multi-modalities 3D complex human activities with complete annotation information, and it is suitable for deep learning methods.

Table 3 Technical specification RGB and RGB-D dataset

Full size table

Table 4 RGB and RGB-D dataset with state-of-the-art accuracy and techniques

Full size table

5.2 Disadvantages of RGB and RGB-D Dataset

Currently, there are many video datasets, despite this, there are limitations in automatically recognize and classify the human activities. The main reasons of such limitations in at least one of the form are the number of samples for each action, the length of clips, capturing environmental conditions, background clutter and viewpoints changes and some activities. The 2D datasets were recorded with a small number of actions to complex actions with a broad range of applications. The 2D datasets are faced more challenges like view variations, intra-class variations, cluttered background, partial occlusions, and camera movements than depth datasets. The RGB-D dataset is facing limitations of low resolutions, less training samples, the number of camera view, different actions, various subjects and less precision. Initial RGB-D datasets captured single actions videos frames under controlled indoor or lab environments. MSR Action 3D is restricted to gaming actions depth frames only. Northwestern-UCLA dataset was recorded with more than one Kinect sensors at the same time to collect multi-view representations. It becomes a challenge to handle and synchronize all sensors data information simultaneously.

6 Conclusion

A review of the various state-of-the-art datasets on human action has been presented. Human action datasets have been categorized into two major categories: RGB and RGB-D datasets. The challenges involved and specifications of these datasets have been discussed. The conventional RGB dataset faces the problems of a cluttered background, illumination variations, camera motion, viewpoints change and occlusions. It is a challenge for feature descriptors in activity recognitions datasets that meets the changing real-world environments. It is required robust evaluation techniques for cross-dataset validation, which will be useful for realistic scenarios applications.

References

Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. ACM Comput. Surv. 43, 1–43 (2011)
Article Google Scholar
Vishwakarma, S., Agrawal, A.: A survey on activity recognition and behavior understanding in video surveillance. Vis. Comput. 29, 983–1009 (2013)
Article Google Scholar
Edwards, M., Deng, J., Xie, X.: From pose to activity: surveying datasets and introducing CONVERSE. Comput. Vis. Image Underst. 144, 73–105 (2016)
Article Google Scholar
Dawn, D.D., Shaikh, S.H.: A comprehensive survey of human action recognition with spatiotemporal interest point (STIP) detector. Vis. Comput. 32, 289–306 (2016)
Article Google Scholar
Bux, A., Angelov, P., Habib, Z.: Vision-based human activity recognition: a review. Adv. Comput. Intell. Syst. 513, 341–371 (2016)
Article Google Scholar
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: Tenth IEEE International Conference on Computer Vision. Beijing (2005)
Google Scholar
Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Proceedings of the European Conference on Computer Vision (2006)
Google Scholar
Xu, W., Miao, Z., Zhang, X.P., Tian, Y.: A hierarchical spatio-temporal model for human activity recognition. IEEE Trans. Multimedia 99, 1 (2017)
Google Scholar
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition. Boston (2015)
Google Scholar
Ryoo, M.S., Chen, C.C., Aggarwal, J., Chowdhury, A.R.: An overview of contest on semantic description of human activities. In: Recognizing Patterns in Signals, Speech, Images and Videos. vol. 6388 (2010)
Chapter Google Scholar
Vishwakarma, D. K., Singh, K.: Human activity recognition based on spatial distribution of gradients at sub-levels of average energy silhouette images. In: IEEE Transactions on Cognitive and Development Systems, vol. 9, no. 4, pp. 316–327. (2017)
Article Google Scholar
Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco (2010)
Google Scholar
Kong, Y., Liang, W., Dong, Z., Jia, Y.: Recognizing human interaction from videos by a discriminative model. IET Comput. Vis. 8, 277–286 (2014)
Article Google Scholar
Ni, B., Moulin, P., Yang, X., Yan, S.: Motion part regularization: Improving action recognition via trajectory group selection. In: IEEE Conference on Computer Vision and Pattern Recognition. Boston (2015)
Google Scholar
Aggarwal, J., Xia, L.: Human activity recognition from 3D data- a review. In: Pattern Recognition Letters. vol. 48 (2013)
Google Scholar
Lun, R., Zhao, W.: A survey of applications and human motion recognition with Microsoft Kinect. In: International Journal of Pattern Recognition and Artificial Intelligence, vol. 29 (2015)
Article Google Scholar
Presti, L.L., Cascia, M.L.: 3D skeleton-based human action classification: a survey. Pattern Recogn. 53, 130–147 (2016)
Article Google Scholar
Zhang, J., Li, W., Ogunbona, P.O., Wang, P., Tang, C.: RGB-D based action recognition datasets: a survey. Pattern Recogn. 60, 86–105 (2016)
Article Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the Advances in Neural Information Processing Systems. (2014)
Google Scholar
Chaquet, J.M., Carmona, E.J., Caballero, A.F.: A survey of video datasets for human action and activity recognition. Comput. Vis. Image Underst. 117, 633–659 (2013)
Article Google Scholar
Wang, P., Li, W., Ogunbona P.O., Escalera, S.: RGB-D-based motion recognition with deep learning: a survey. Int. J. Comput. Vis. (2017)
Google Scholar
Hassner, T.: A critical review of action recognition benchmarks. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops. Portland (2013)
Google Scholar
Firman, M.: RGBD datasets: past, present and future. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2016)
Google Scholar
Borges, P.-V.K., Conci, N., Cavallaro, A.: Video-based human behavior understanding: a survey. IEEE Trans. Circuits Syst. Video Technol. 23, 1993–2008 (2013)
Article Google Scholar
Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23, 257–267 (2001)
Article Google Scholar
Laptev, I.: On space-time interest points. Int. J. Comput. Vision 64, 107–123 (2005)
Article Google Scholar
Li, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 221–231 (2013)
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition. Columbus (2014)
Google Scholar
Donahue, J., Hendricks, L., Guadarrama, S., Rohrbach, M.V., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
Sipiran, I., Bustos, B.: Harris 3D: a robust extension of the Harris operator for interest point detection on 3D meshes. In: The Visual Computer, vol. 27 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Technology, Delhi Technological University, New Delhi, Delhi, India
Tej Singh & Dinesh Kumar Vishwakarma

Authors

Tej Singh
View author publications
You can also search for this author in PubMed Google Scholar
Dinesh Kumar Vishwakarma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tej Singh .

Editor information

Editors and Affiliations

University of Nevada, Reno, Reno, NV, USA
Banmali S. Rawat
Atal Bihari Vajpayee Indian Institute of Information Technology and Management, Gwalior, Gwalior, Madhya Pradesh, India
Aditya Trivedi
Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India
Sanjeev Manhas
Jaypee Institute of Information Technology, Noida, Uttar Pradesh, India
Vikram Karwal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Singh, T., Vishwakarma, D.K. (2019). Human Activity Recognition in Video Benchmarks: A Survey. In: Rawat, B., Trivedi, A., Manhas, S., Karwal, V. (eds) Advances in Signal Processing and Communication . Lecture Notes in Electrical Engineering, vol 526. Springer, Singapore. https://doi.org/10.1007/978-981-13-2553-3_24

Download citation

DOI: https://doi.org/10.1007/978-981-13-2553-3_24
Published: 20 November 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2552-6
Online ISBN: 978-981-13-2553-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Human Activity Recognition in Video Benchmarks: A Survey

Abstract

Similar content being viewed by others

Video benchmarks of human action datasets: a review

A Review of State of Art Techniques for 3D Human Activity Recognition System

A Systematic Analysis of the Human Activity Recognition Systems for Video Surveillance

Keywords

1 Introduction

2 Related Work

3 Challenges in HAR Dataset

3.1 Background and Environmental Conditions

3.2 Similarity and Dissimilarity of Actions

3.3 Occlusion

3.4 View Variations

4 Approaches for Human Action Recognitions

4.1 Local and Global Approaches

4.2 Deep Learning Approaches

5 Discussion

5.1 Advantages of RGB and RGB-D Dataset

5.2 Disadvantages of RGB and RGB-D Dataset

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Human Activity Recognition in Video Benchmarks: A Survey

Abstract

Similar content being viewed by others

Video benchmarks of human action datasets: a review

A Review of State of Art Techniques for 3D Human Activity Recognition System

A Systematic Analysis of the Human Activity Recognition Systems for Video Surveillance

Keywords

1 Introduction

2 Related Work

3 Challenges in HAR Dataset

3.1 Background and Environmental Conditions

3.2 Similarity and Dissimilarity of Actions

3.3 Occlusion

3.4 View Variations

4 Approaches for Human Action Recognitions

4.1 Local and Global Approaches

4.2 Deep Learning Approaches

5 Discussion

5.1 Advantages of RGB and RGB-D Dataset

5.2 Disadvantages of RGB and RGB-D Dataset

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation