Abstract
Human Action Recognition (HAR) involves human activity monitoring task in different areas of medical, education, entertainment, visual surveillance, video retrieval, as well as abnormal activity identification, to name a few. Due to an increase in the usage of cameras, automated systems are in demand for the classification of such activities using computationally intelligent techniques such as Machine Learning (ML) and Deep Learning (DL). In this survey, we have discussed various ML and DL techniques for HAR for the years 2011–2019. The paper discusses the characteristics of public datasets used for HAR. It also presents a survey of various action recognition techniques along with the HAR applications namely, content-based video summarization, human–computer interaction, education, healthcare, video surveillance, abnormal activity detection, sports, and entertainment. The advantages and disadvantages of action representation, dimensionality reduction, and action analysis methods are also provided. The paper discusses challenges and future directions for HAR.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Action in Human Action Recognition (HAR) consists of an entity that can be observed using either the human eye or some sensing technology. For example, an action such as walking requires a person in the field of view to be continuously observed. Depending on the engaged body parts for action, human activities can be grouped into four categories (Aggarwal and Ryoo 2011).
-
Gesture: It is based on hand, face or other parts movement, wherein verbal communication is not needed.
-
Action: It consists of movements conducted by a person such as walking or running.
-
Interaction: It involves actions to be executed by two actors. It may consist of interaction with the object or interaction with a single person.
-
Group activity: It can be a mixture of gestures, actions, or interactions. The number of performers can be at least two or more with the interactive objects.
HAR is considered to be an active research area due to applications such as content-based video analysis and retrieval, visual surveillance, Human–Computer Interface (HCI), education, medical, as well as abnormal activity recognition, etc. Further discussions on these applications are provided in Sect. 4. In HAR, the action recognition task can be shown by action representation and action analysis. These actions are acquired using different types of sensors such as RGB, range, radar, or wearable sensors. Manual HAR task, for instance, identifying abnormal activity from the video recording, requires a substantial amount of time. Such tasks are expensive and difficult as the human operations are necessary throughout multi-camera views (Singh et al. 2010). Moreover, it is tedious to perform round the clock monitoring of an area of interest and it may introduce human errors. To address these issues, automated modeling of human actions can be used.
Automated modeling of action(s) involves the process of mapping a particular action to a label that describes an instance of that action. Such actions may be performed by different agents (i.e., humans) under varying speed, lighting conditions, and diverse viewpoints. On the other hand, the fully-automated HAR systems have several challenges such as clutter in the background, occlusion, variation in viewpoint, scale, appearance, as well as external conditions for video recording (Thi et al. 2010). For instance, the task of person localization i.e., determining the location and size of a person, would be difficult in the dynamic recording condition (Thi et al. 2010). A considerable amount of research work has been carried out for HAR. To study the developments and recent updates, we conduct this survey focusing on video-based HAR; we provide a complete process of action representation, dimensionality reduction, and action analysis techniques; we also discuss the datasets and remarkable applications of HAR. The primary motivation of this comprehensive survey is to analyze different aspects of HAR along with the Machine Learning (ML) and Deep Learning (DL) techniques, the significance of the datasets and their potential applications. We discuss the challenges associated with HAR and provide potential future directions.
1.1 Prior survey
HAR has been a research interest for various groups over the past years. For the action recognition task, action representation techniques include feature extraction methods and feature descriptors; action analysis may be carried out using traditional ML and/or DL techniques. While we conduct a survey on the existing approaches of HAR based on different applications, we compare our survey with the existing surveys based on categories such as feature, dimensionality reduction, and action classification as shown in Table 1. This section groups prior surveys and discusses the applied field of HAR.
1.1.1 Still image-based action recognition
The main focus of still image-based HAR is on identifying the action of a person from a single image without considering temporal information for action characterization.
One of the surveys on still image-based action recognition is presented in Guo and Lai (2014). Here, different methods such as ML and DL are discussed for low-level feature extraction and high-level representation for actions; also, various datasets along with their characteristics are presented. On the other hand, Vrigkas et al. (2015) presented a survey on HAR using still-image representation wherein, HAR techniques are divided in two categories namely, unimodal and multimodal activity recognition depending on the type of modality used by data.
1.1.2 Action representation and analysis-based HAR
For HAR, a step-by-step strategy can be used which involves feature representation using feature extraction techniques and action classification techniques. In Poppe (2010), HAR is discussed by considering actions involving full-body movement whereas excluding environment and interactions with other humans. Also, action representation and classification tasks are presented.
In Turaga et al. (2008), action classification task is discussed by considering representation and recognition of the actions or activities. Different mechanisms to learn the actions from the video are presented; the study has separately defined terms “action” and “activity” and presented an overview of classification techniques. Authors have also discussed approaches for modeling atomic action classes. Moreover, methods to model actions with more complex dynamics are discussed.
The study of handcrafted and learning-based representations is presented in Zhu et al. (2016a), as well as various advances in handcrafted representation techniques are discussed. The paper discusses different features including Spatio-Temporal Volume-based approaches, depth image-based, and trajectory-based methods. The architecture of 3D Convolution Neural Network (CNN, also known as ConvNet) is also presented in the paper.
In study Herath et al. (2017), a survey on various action recognition methods based on handcrafted and DL techniques are reviewed. In the literature, handcrafted and DL methods along with their architecture are discussed. Subsequently, in Aggarwal and Ryoo (2011), feature extraction methods from the input video are presented; multi-person action recognition is reviewed using hierarchical recognition methods including statistical (state-based models), syntactic (grammar-based methods), and description approaches (describing activities and sub-activities). On the other hand, in Presti and La Cascia (2016), work on 3D skeleton-based approaches is highlighted along with their challenges. Preprocessing methods, descriptors for skeleton-based data, datasets, and validation methods with performance evaluation techniques are also discussed.
1.1.3 Abnormal activity detection
Video surveillance can be used by organizations to manage gatherings, to prevent crime or for inspecting crime scene. Visual surveillance system depends on anomalous event detection. One of the early work Popoola and Wang (2012) discussed the abnormal activity detection for crowd monitoring. This survey also discusses action recognition and event detection.
Survey on abnormal activity detection is presented in Mabrouk and Zagrouba (2018). The survey is divided into two parts including behavior representation (features and descriptors) and behavior modeling (training and learning methods). Datasets and performance measures for abnormal activity detection based methods are also presented in Mabrouk and Zagrouba (2018).
1.1.4 Sensor-based activity recognition
Sensor-based HAR focuses on data received from an accelerometer, gyroscope, and bluetooth. Action classification can be treated as a pattern recognition problem (Wang et al. 2018). Modality of the data is characterized by several different modes of activity or occurrence.
In Wang et al. (2018), a survey of sensor-based activity recognition methods using DL techniques is presented. High-level features are automatically learned using DL techniques for sensor data. In Nweke et al. (2019), fusion of data from mobile and wearable devices with multiple classifiers is discussed. DL techniques for HAR are discussed with applications and open research issues.
1.2 Motivation
With a motivation to discuss state-of-the-art HAR techniques, we have considered classification approaches, their advantages, challenges, datasets, and applications of HAR. This paper discusses ML and DL techniques for HAR and gives a brief description of various features. We have also included potential future work based on HAR in terms of ML, DL, and hybrid techniques. The significance of this survey can be obtained from Table 1; to cover different aspects of HAR, we discuss the holistic approach of action recognition that includes action representation and action analysis for various modalities such as RGB, depth, as well as skeleton. In this paper, we also discuss recent datasets that depict the daily living actions. To the best of our knowledge, recent datasets for HAR has been explored to a limited extent in the field of action recognition. A summary of the existing surveys based on their highlights and important inferences gained from each of the surveys is provided in Table 2; we also mention the expected highlights and inferences of our survey that can be helpful to the reader. It must be noted that to provide a focused review on HAR, we have restricted our survey to the trimmed video sequence.
The major contributions of our paper are as follows.
-
The paper discusses various feature extraction and encoding techniques for HAR including shape, texture, trajectory, depth, as well as others.
-
Various dimensionality reduction methods for the extracted features are described. ML and DL techniques for action analysis are also presented.
-
The considerable advantages and disadvantages of different methods for action representation, dimensionality reduction, and action analysis for HAR are provided.
-
The paper summarizes the recent advances in HAR along with various applications, challenges, and future directions.
The roadmap of the paper is organized as follows. Section 2 presents HAR as a complete process including action representation, dimensionality reduction, and action analysis; datasets used in action classification; their properties along with a discussion of the recent datasets are discussed in Sect. 3; Sect. 4 covers applications of HAR; challenges and future directions are discussed in Sects. 5 and 6, respectively; concluding remarks are given in Sect. 7.
2 HAR: a complete process
HAR is used for analyzing activities from the video. Once video data is captured, data is processed to meet the requirements of the underlying application. A generic system for HAR is graphically represented in Fig. 1; it provides an overview of the general steps including data collection, preprocessing, feature extraction and/or encoding, potential dimensionality reduction, followed by dataset preparations for the training and testing; also, such data samples can be provided to one or more approaches of ML or DL for the action classification and the predicted class labels can be analyzed and evaluated for the test samples. Action representation and dimensionality reduction techniques are useful for ML-based techniques. Wherein for DL-based techniques, these steps may be skipped. The existing approaches considered for action representation, dimensionality reduction, and action analysis for HAR are discussed in Sects. 2.1 to 2.3, respectively.
2.1 Action representation
Action representation provides the low-level processing of human action. It can consist of two steps namely, interest detection and description of the interest boundary. The important features can be extracted and encoding can be carried out using different techniques.
2.1.1 Feature extraction and encoding
The overall procedure of action representation involves the extraction of a set of features from local as well as global features. The goal of the action representation task is to find features that are robust to occlusion, background variation, and viewpoint change (Zhu et al. 2016a). An overview of various features-based action representation is shown in Fig. 2; properties of these features are further described and existing work is reviewed in the following sections.
Space–time interest point-based techniques To represent an image using local features, Space–Time Interest Point (STIP) can be used. STIP features encode image by adding an extra dimension known as temporal information. Temporal domain information is added to spatial ones; the encoded image can provide additional information about contents and structure in the action scene. STIP can be converted to saliency regions by using clustering algorithms. These features are translation and scale-invariant however, they are not rotation-invariant (Laptev 2005).
For the recognition of human actions, these positions are considered to be the most informative ones (Laptev 2005). Extension of salient point detector based on Spatio-temporal feature is proposed in Oikonomopoulos et al. (2005) where image sequence is represented in terms of salient points in space and time. The relationship between different features is established by calculating Chamfer distance (Oikonomopoulos et al. 2005). Scale and translation invariant representation of the extracted features is represented by iterative space–time-warping technique and features are converted to zero mean. The proposed model is evaluated using a sequence of images that perform aerobic exercise.
In Chakraborty et al. (2011), STIPs in multi-view images are detected in a selective manner by surround suppression and imposing local spatio-temporal constraints. The intensity-based STIP is robust to shadows and highlights due to disturbing photometric phenomenon. Color STIP performs better than intensity-based STIP. Therefore, in Everts et al. (2014), color STIP is proposed by reformulation in multiple channels of detectors. For datasets such as UCF sports (CRCV 2010), UCF11 (Liu et al. 2009), and UCF50 (CRCV 2012).
On the other hand in Zhu et al. (2014), feature extraction is performed on depth maps using STIP features and Histogram of Visual Word (HoVW) is created using the quantization of extracted local features. Subsequently in Nazir et al. (2018), feature representation is performed by combining STIP and Scale-Invariant Feature Transform (SIFT), and HoVW-based technique is used for action representation.
STIP methods do not require preprocessing such as background segmentation or human detection. The features are robust to scale, rotation, and occlusion, however, they are not viewpoint-invariant (Laptev 2005). For frames of actions such as boxing, hand-clapping, hand waving, and jogging for KTH dataset (NADA 2004), such features can be efficiently localized in terms of both, space and time as each video is represented using a set of spatial and temporal interest points (Nazir et al. 2018). It is also observed that STIP features are adapted to changes in the illumination and scale, however, they may not be able to distinguish between event and noise in some scenarios (Nazir et al. 2018).
Trajectory-based techniques Trajectories for actions are computed by tracking joints wherein interest points alongside the input videos using optical flow fields. Densely sampled points are tracked to obtain trajectories using optical flow field (Wang et al. 2011). The trajectories are useful in the scenarios wherein long-duration information is captured (Wang et al. 2011).
In Wang and Schmid (2013), dense trajectories are extracted by tracking and sampling dense points at multiple scales in each frame. Here, feature representation is introduced using Histogram of Oriented Gradient (HOG), Histogram of Optical Flow (HOF), and Histogram of Motion Boundary (HoMB), that in turn, captures shape, appearance, and motion information along the trajectory, respectively, and multiple descriptors are used for the same. HoMB gives an improved result compared to SIFT and HOG due to robustness to the camera motion (Dalal et al. 2006); it is based on derivatives of optical flow and used to remove camera motion.
The handcrafted features have less discriminative power for HAR while for efficient extraction of features, usage of DL-based methods requires a large amount of data for training. Therefore, in Wang et al. (2015), an advantage of handcrafted and DL-based features are combined using improved trajectory. Two-stream ConvNets is also known as Trajectory-pooled Deep-Convolutional Descriptor (TpDD). To construct effective descriptor, deep architecture learns multi-scale convolutional feature maps. As explained in Wang et al. (2015), for multi-scale TpDD extension, optical flow is initially computed and single scale tracking is performed, followed by multi-scale pyramid representations of video frames and optical flow construction. For constructing convolutional feature maps having multiple scales, the pyramid representation acts as input to the ConvNets. Subsequently, to enhance the power of dense trajectory for characterizing long-term motion, three-stream networks are used in Shi et al. (2017). Here, dense trajectories are extracted from multiple consecutive frames, resulting in trajectory texture images. The extracted descriptor is known as sequential Deep Trajectory Descriptor (sDTD) that characterizes motion. Three-stream framework namely, spatial, temporal, and sDTD, learn the spatial and temporal domains with CNN-Recurrent Neural Network (RNN) network.
Depth-based techniques A depth image is captured by computing distance between the image plane and object in the scene. With the use of a low-cost depth sensor, for example, Kinect, 3D depth images are invariant to lighting conditions (Taha et al. 2014). The advantages of the depth-based sensor over RGB cameras can be given as calibrated scale estimation, color, and texture-invariant, and a simple background subtraction task (Shotton et al. 2011).
An action can be recognized using Depth Motion Map (DMM) feature as it provides shape and structure information in 3D from depth maps. These maps are projected on three orthogonal planes namely, front, side, and top. To identify motion regions, a map of motion energy is calculated for each projected map. For each projection view, DMM is formed by stacking motion energy for the entire video (Yang et al. 2012).
In Yang et al. (2012), depth maps are projected on three orthogonal planes. Here, for feature representation, HOG is determined after DMM computation to construct compact and discriminative features. In Chen et al. (2015b), DMM-based gestures are used for extracting motion information and feature encoding is performed using Local Binary Pattern (LBP) that performs better compared to DMM-HOG based feature extraction. LBP enhances performance by applying it to the overlapped blocks in DMMs, which increases the discriminative power for action recognition. DMM captured on the entire depth sequence cannot capture detailed motion, however, with the new action occurring at the same time, old motion history may get overwritten.
To increase the accuracy of HAR, data are generated simultaneously from the depth and inertial sensors in Chen et al. (2015a). Here, fused features are formed by directly concatenating features from the depth and inertial sensors. On the other hand, in Chen et al. (2016), a depth sequence is divided into overlapping segments and multiple sets of DMMs are generated. To decrease the intra-class variability due to action speed variations, different temporal length of the depth segments are considered. For intra-class action classification, DMMs are not robust to action speed variations. Therefore, improvement in DMM is proposed in Chen et al. (2017) by accumulating regions involving motion for three planes namely, front, top, and side. Afterwards, patch-based LBP is used to extend feature representation from the pixel level to texture level representation (Chen et al. 2016).
Pose-based techniques RGB-D sensors such as Kinect provide the measurement of skeleton joints. However, these sensors have drawbacks with respect to pose estimation. The Kinect sensor operates with limited distance and contains a limited field of view. Moreover, it cannot work in sunlight (Zhang 2012). The 2D, as well as 3D pose estimation aspects, have been explored and it is as follows.
2D Pose-based Techniques
In 2D pose estimation, deformable part models can be used wherein the collection of templates are matched to recognize the object. However, the deformable part models have limited expressiveness and do not take global context into account (Yang and Ramanan 2012). Pose-based estimation can be efficiently reshaped by CNN. It can be performed by two means namely, detection-based and regression-based methods. The detection-based methods can use powerful CNN-based part detectors which can be further combined using a graphical model (Chen and Yuille 2014). In the detection problem, pose estimation can be performed as a heat map wherein each pixel represents the detection score of joint (Bulat and Tzimiropoulos 2016). Nevertheless, joint coordinates are not directly provided by detection approaches. Poses are recovered in (x, y) coordinates by applying max function as a post-processing step; here, the regression-based methods use a nonlinear function that maps the joint locations directly to the desired output that can be joint coordinates (Bulat and Tzimiropoulos 2016). In Toshev and Szegedy (2014), poses are estimated using CNN-based regression towards body joints. Cascade of such regressors is used to refine the pose estimates. Iterative Error Feedback (IEF) is proposed in Carreira et al. (2016), wherein the iterative prediction of the current estimates is performed and they are corrected iteratively. Instead of predicting outputs in one shot, the self-correcting model is predicted which changes an initial solution by feeding back error predictions which is known as IEF. The function map used in regression is sub-optimal, hence, it results in lower performance as compared to detection-based techniques.
3D Pose-based Techniques
On the other hand, given an image of a person, 3D pose estimation is the task of producing a 3D pose that matches the spatial position of the depicted person. For an accurate reconstruction of 3D poses from real images, the indoor and outdoor scenarios provide enormous applications in entertainment and HCI. Early approaches required feature engineering-based techniques, whereas the current state-of-the-art methods are based on deep neural networks (Zhou et al. 2018a). 3D pose estimation is considered to be more complex than 2D as it handles larger 3D pose space and more ambiguities. In Nunes et al. (2017), skeleton extraction is performed using depth images wherein frame-by-frame skeleton joints are inferred. An APJ3D representation is constituted by a manually selected 15 skeleton joints (Gan and Chen 2013) from relative positions and local spherical angles. These 15 informative joints are manually selected to build a compact representation of human posture. Spatial features are encoded based on joint-joint distances, joint-joint orientations, joint-joint vectors, joint-line distances, and line-line angles to provide rich texture features (Chen 2015) and the network is trained on CNN to identify corresponding actions. On the other hand, Kinect sensor is used in Xu et al. (2016) to obtain human body images. Body part-based skeletal representation is constructed for action recognition, wherein the relative geometry between various body parts is identified. Body rotations and translations in 3D space are members of the Lie group. In Liu et al. (2017b), skeleton input is represented using several color images; here, in the color image generation process, the emphasis is given to motion in skeleton joints to improve the discriminative power of color images. The multi-stream convolutional network involves ten AlexNet and the generated color image is input to each CNN. Due to the discriminative power of multi-stream convolution networks, combining handcrafted information consisting of skeleton joints with multi-stream convolution network increases the recognition performance of HAR. Subsequently, one of the recent advances given in Huynh-The et al. (2019) maps 3D skeleton data to chromatic RGB values. This technique is termed as a Pose Feature to Image (PoF2I) encoding technique. This encoding technique can efficiently deal with varying-length action appearance. Also, a deep learning framework for HAR is presented in Pham et al. (2020) where in the feature representation task, the skeleton is extracted using RGB video sequences. Thereafter, these poses are converted to image-based representation and fed to deep CNN.
Motion-based techniques The motion information of the moving target can be captured using an intelligent system for classifying objects efficiently. Motion tracking can be performed for high-level analysis of classified objects (Paul et al. 2013). The detection process consists of object detection and classification. For object detection, background subtraction, optical flow, and Spatio-temporal filtering can be used. Moving objects using background subtraction are detected by differentiating the current frame and a background frame in a pixel-by-pixel or block-by-block fashion; here, motion is characterized by 3D Spatio-temporal data volume. This method has low-computational complexity but is susceptible to noise (Paul et al. 2013). Subsequently, to detect moving regions in images, optical flow technique computes flow vectors of moving objects, however, these methods have large computational requirements. To recognize humans, the periodic property of the images can be used in motion-based approaches (Paul et al. 2013).
In Cutler and Davis (2000), a view-based approach is applied for the recognition of human movements by using a vector image template such as Motion Energy Image (MEI) and Motion History Image (MHI). MEI feature is a binary template that highlights image regions where motion is present. The shape of the region can be used to suggest both, the action occurring and the viewing angle in the scene. MHI is used to show how motion in the image is moving. MEI and MHI are prone to errors of background subtraction. Replacement and decay operators are used to represent MHI (Bobick and Davis 2001). Space–time silhouettes shapes contain spatial information about the poses of humans such as location, the orientation of actions, to name a few. They also include the aspect ratio of the different body parts at any point in time.
Shape-based techniques The shape-based feature provides human body structure and its dynamics for HAR, whereas, texture-based features characterize motion information in videos by using templates for HAR. For silhouette extraction, background subtraction method may not suitable. Therefore, in Vishwakarma and Kapoor (2015), for the silhouette extraction texture based segmentation method is used. A silhouette representation is used to obtain Region of Interest (ROI) of a person in shape-based action representation technique (Vishwakarma and Kapoor 2015). Human silhouettes can be obtained by RGB video frames or depth videos. Silhouette features are sensitive to occlusion and different viewpoints (Vishwakarma and Kapoor 2015). Silhouette features are extracted from the videos in Khan and Sohn (2011) to identify abnormal activities in elderly people. Thereafter, R-transform is applied to features, for obtaining features that are robust to scale and translation. In Chaaraoui and Flórez-Revuelta (2014b), after processing background, binary segmentation is applied in order to extract the contour points of the human silhouette that represents the summary of feature extraction from single-view by aligning silhouettes independent from shape and contour length using a radial scheme. For each radial bin, further dimensionality reduction can be obtained by representing the summary value for each radial bin (Chaaraoui and Flórez-Revuelta 2014a).
To suppress noise in shape variations of silhouettes, key poses of silhouettes are divided into the cells (Vishwakarma and Kapoor 2015). For extraction of silhouette, texture-based segmentation method is used. To eliminate the frames not containing any information, key-frame extraction method is used which is based on the energy of the frame. Due to the similar postures and motions in some human activities, it is not sufficient to use only skeleton joint features to discriminate human activities. To overcome this limitation, Depth Differential Silhouettes (DDS) is used, and it is represented by the HOG format of DDS projections onto three orthogonal Cartesian planes.
Other action representation techniques Apart from the action representation methods, other action representation techniques include radar-based, gait-based, Electroencephalography (EEG)-based. The goal of radar-based HAR is to recognize human motion automatically using spectrograms (Craley et al. 2017). Modulation of radar echoes with each activity produces unique micro-Doppler signatures. In Craley et al. (2017), a normalized spectrogram-based method is used and outperforms skeleton data produced by Kinect sensor. Another feature representation for HAR is gait-based action representation.
An HAR framework is proposed in Boulgouris and Chi (2007) that is based on the radon transform of binary silhouettes which are used for computation of a template. 3D gait analysis can be performed using depth images and human body joint (tracking) from the available gait sequences (Boulgouris and Chi 2007). An advantage of gait is that it requires no real contact, like automatic face recognition, and that it is less likely to be obscured than other biometrics (Boulgouris et al. 2005). Also, gait-based recognition is performed in an environment where the background is as uniform as possible. Moreover, recognition algorithms based on gait are not view-invariant (Boulgouris et al. 2005).
Activity recognition using EEG involves electrophysiological monitoring to analyze brain states by capturing the voltage fluctuations of ionic current within the neurons of brains (Zhang et al. 2019a). Usage of EEG signals for activity recognition is often termed as cognitive activity recognition system which bridges the gap between the cognitive world and physical world (Zhang et al. 2019b). EEG signals have an excellent temporal resolution that means events occurring at a small fraction of time (millisecond) can also be captured. However, the disadvantage of EEG is that it suffers from the low spatial resolution that means EEG signals are highly correlated spatially (Roy et al. 2019).
Hybrid action representation techniques Performance of HAR can be improved by using hybrid action representation techniques. Hybrid action representation may exist in scenarios, for example, human activities containing similar postures and motions. In such cases, the skeleton joint feature is not enough to discriminate between different activities. An activity recognition system using silhouette and skeleton-based features is presented in Jalal et al. (2017), where, multi-fused features such as skeleton joints and body shape-based features such as HOG and DDS are extracted from the input videos. The shape of full-body is represented by DDS and action classification is performed by the Hidden Markov Model (HMM). However, these features are incorporated for simple actions.
Shape and motion information is combined in Vishwakarma et al. (2016) to handle occlusion, where binary silhouette extraction is performed using Spatial Edge Distribution of Gradients (SDEG) and extraction of temporal information is performed by R-transform. R-transform produces features robust to scaling and translation, but are not robust to the rotational changes. In Shao et al. (2012), shape and motion information is combined and action recognition is performed using temporal segmentation. Motion History Image (MHI) is used for describing shape and Pyramid Correlogram of Oriented Gradients (PCOG) is used as a feature descriptor.
For abnormal activity detection, texture, shape, and motion feature fusion is performed in Miao and Song (2014). Here, Grey Level Co-occurrence Matrix (GLCM), Hu-invariant moments, and HOG are used for texture, shape, and motion feature fusion, respectively. To obtain better performance, data normalization and parameter optimization are performed using an Adaptive Simulated Annealing Genetic Algorithm (ASAGA). Appearance and motion features are combined in Amraee et al. (2018) using HOG-LBP, and HOF, respectively. The extraction of accurate silhouettes is difficult in case of camera movements and complex scenes. Hence, human appearance cannot be identified using silhouettes in the presence of occlusion in the human body.
In Patel et al. (2018), for performing HAR, various features are fused. Authors have fused features to improve the performance of the network. Features such as the average of HOG feature over 10 overlapping frames, Discrete Wavelet Transform, displacement of object centroid, the velocity of an object, and LBP are fused. On the other hand, a feature fusion scheme with a combination of classical descriptors and 3D convolutional networks was proposed in Qin et al. (2017). Descriptors such as HOG to provide good invariance on geometric and optical deformation, HOF to provide invariance to scale changes, and SIFT to provide invariance to the viewpoint are used. These features are fused with the learned features from 3D CNN into a special fusion feature which is then fed to the classification task.
2.1.2 Discussion
Handcrafted representation impact learning-based representations. Traditional ML techniques depend on handcrafted feature representation. These features are local features and follow the densely sample strategy; they also have high computational complexity for both training and testing.
On the other hand, STIP features are suitable for simple actions and gestures. Such features may not perform well for multiple persons in the scene. It can be observed that the trajectory-based approaches can analyze movements robust to view-invariant manner, however, these techniques are not efficient for localizing joint positions. Depth sensors, for example, Kinect, provide an additional capability of human location and skeleton extraction, action detection based on them is simpler and effective than that using RGB data. Moreover, sensors like Kinect and advanced human pose estimation algorithms can make it easier to gain accurate 3D skeleton data.
Skeleton data can also capture spatial information and strong correlations exist between the joints node and their adjacent nodes therefore, structural information related to body can be found in skeleton data. With an inter-frame manner, there may exist strong temporal correlation. Skeleton data is popular representation among the researchers. In Table 3, we provide an overview of the advantages and disadvantage of various feature extraction methods. We also provide a detailed summary of action representation techniques in Table 4.
2.2 Dimensionality reduction techniques
In an action recognition framework, a large number of features are collected to capture all the possible events. This leads to presence of redundant data that complicates the process of learning. Hence, to enhance learning task of the classification model, uncorrelated data should be considered. In dimensionality reduction, original features are transformed by removing the redundant information. Such techniques can be categorized into unsupervised and supervised (Saini and Sharma 2018) while the unsupervised dimensionality reduction techniques include Principal Component Analysis (PCA) (Chen et al. 2015b), autoencoder (Ullah et al. 2019), and Reduced Basis Decomposition (RBD) (Arunraj et al. 2018), the supervised dimensionality reduction techniques include Linear Discriminant Analysis (LDA) (Khan and Sohn 2011) and Kernel Discriminant Analysis (KDA) (Khan and Sohn 2011).
PCA is an unsupervised dimensionality reduction method that provides the representation of input data in terms of the eigen vector. Feature dimension is reduced by retaining features having maximum variance. In Chen et al. (2015b), PCA is applied after extracting DMM-LBP features to map the data to a lower-dimensional space. PCA can be used for providing the discriminative capability to local features (Thi et al. 2010). These features are given as input to the classifier. On the other hand, part-based feature representation is used for HAR in Xu et al. (2016), where the linear combination of these features is presented using PCA. Linear PCA will not always be able to detect all structure in the dataset. More information from the data can be extracted by the use of suitable nonlinear features. Kernel PCA is suited to extract nonlinear structures in the data (Mika et al. 1999). Kernel-based PCA (KPCA) is used in Hassan et al. (2018) wherein PCA along with the kernel function and dot product between two vectors in the original space is computed to identify nonlinear structures in the data.
RBD is a linear dimensionality reduction technique based on the reduced basis method. For reducing the higher dimension, reduced basis method depends on the truth approximations at a sampled set of optimal parameter values (Chen 2015). In Arunraj et al. (2018), the RBD method is used to reduce the dimensionality of the input features, where error determining norm for RBD is implemented with different norms such as identity norm (I), all one norm (J), Symmetric Positive Definite (SPD), and diagonal norm (D). The selection of efficient error estimation norm is dependent on the subjects or application under consideration. Accuracy of RBD method will be lower than that of PCA but it is faster as compared to PCA (Arunraj et al. 2018).
Deep autoencoder is an unsupervised dimensionality reduction method in which learning is based on data (Baldi 2012). In Ullah et al. (2019), dimensionality reduction is performed using four layers of the stacked autoencoder. Updation in the raw input data is captured by the initial layers of the autoencoder. However, the patterns using second-order features are learned using intermediate layers. In Boulgouris and Chi (2007), the gait sequence consists of templates constructed from randon transform. For several cycles, multiple templates of gait can be constructed for an action such as walking, and LDA is used to reduce the feature dimension. Dimensionality reduction task in LDA is performed by lower dimension subspace having relevant gait recognition information.
Another method for dimensionality reduction is Kernel Discriminant Analysis (KDA), which is the non-linear extension of LDA. In KDA Khan and Sohn (2011), a mapping of input data is performed in high dimension feature space using Radial Basis Network (RBF) and for abnormal activity detection, KDA is effective compared to LDA (Khan and Sohn 2011). Using non-linear mapping, R-transformed data is mapped by KDA to the feature space (Khan and Sohn 2011). In Table 5, we present the advantages and disadvantages of these methods and summarize the dimensionality reduction techniques in Table 6.
2.3 Action analysis-based HAR
Action analysis task is performed on the top of the action representation-based method(s). The low-level steps of action recognition may allow to identify the object movement in the scene, however, these descriptors do not provide an understanding of the action label. To label action sequences, the action classification techniques are used which mainly include traditional ML as well as DL techniques. To provide a taxonomy of action classification methods based on ML and DL techniques reviewed in this section for HAR, we provide the graphical representation as shown in Fig. 3.
2.3.1 Traditional machine learning-based methods
Various techniques using ML are proposed for HAR (Kim et al. 2016; Gan and Chen 2013; Nunes et al. 2017; Singh and Mohan 2017). We discuss ML-based action analysis methods such as graph-based methods, SVM, nearest neighbor, HMM, ELM, and hybrid methods.
Graph-based methods Graph-based methods are used to classify input features of HAR that include Random Forest (RF), Geodesic Distance Isograph (GDI), to name a few. To increase the robustness of the action recognition system, a graph of a local action is used, where STIP features are vertices of the graphs and an edge represents a possible interaction (Singh and Mohan 2017).
RF classifier is a tree-based ML technique that leverages the power of multiple decision trees for making decisions. For HAR, RF classifier can efficiently handle thousands of inputs (Ar and Akgul 2013). There is a high interest in ensemble learning algorithms due to their higher accuracy and they are robust to noise compared to single classifier. In RF classifier, each classifier’s contribution to assignment of the most frequent class for input vector x as given by Eq. 1 (Rodriguez-Galiano et al. 2012).
where, \(\hat{C}_b(x)\) is the prediction of class by bth random forest tree. RF has properties such as low error rates, always convergence, i.e., no over-fitting, faster training because of working on the subset of features and hence, better performance, robustness to noise, as well as simplicity (Nunes et al. 2017). In Gan and Chen (2013), RF classifier and the randomized DT is used for training depth-based feature, namely, APJ3D feature is used. Joint feature APJ3D includes the position and angle of joints. In Xu et al. (2017), RF classifier is used to classify activities from the dataset collected from the accelerometer sensors.
Support vector machine The concept of Support Vector Machine (SVM) is to separate the data points using a hyperplane (Cortes and Vapnik 1995). For the classification task, SVM separates points in the high dimensional space through mapping; such mapping of points provides a linear decision surface for input data which helps in classification task (Cortes and Vapnik 1995). As shown in Eq. 2 (Cortes and Vapnik 1995), the weight vector, l and bias term, p are used to define the position of separating hyperplane in SVM.
SVM uses kernel trick to work with high dimension data and to reduce the computational burden. For HAR, SVM can be used when the number of samples is small (Qian et al. 2010). In Chakraborty et al. (2011), BoVW model of local N-jet descriptors and vocabulary building is performed by merging spatial pyramid and vocabulary compression where \(\chi ^2\) kernel-based SVM classifier performs human action classification. In Everts et al. (2014), \(\chi ^2\) kernel SVM is used for training codebooks of sequence containing quantized HOG3D descriptors. For evaluation of the learned classifiers, leave-n-out cross-validation is used. In Zhu et al. (2014), quantization of 3D data is performed, which is given as input to the \(\chi ^2\) kernel-based SVM. GDI graph is used in Kim et al. (2016) for optimizing and localizing human body parts for a given ROI. Instead of classifying each pixel from the input, on the GDI graph, random generation of feature points is performed that gives a summation of cost of the edges connecting the shortest path between the two points. Thereafter, the graph-cut algorithm is applied along with SVM classifier which removes falsely labeled feature points with the use of the previously generated GDI graph (Kim et al. 2016).
For smooth functions, RBF kernel is preferable, whereas, for discrete feature handling, for example, needed by Bag of Visual Words (BoVW), \(\chi ^2\) kernel is used due to the overlap feature modeling capability (Cortes and Vapnik 1995). In Foggia et al. (2013), SVM classifier is used to classify codebook constructed using HoVW. SVM with N one-against-rest technique learns discriminant words for a particular event and ignores others and N such separate classifiers are constructed. For classification of k-th class against the rest, training is performed using k-th classifier on the training data set. In Shao et al. (2012), PCOG feature is given as input to SVM, wherein PCOG feature is shape descriptor calculated from MHI and MEI. For training offline, multi-class SVM with RBF kernel is used. Moreover, for improving the training procedure, input training sequences are divided into cycles for the duration of each movement. Non-linear data is classified by performing multi-class learning using a one-versus-one SVM classifier having a polynomial kernel (Nazir et al. 2018).
Nearest neighbor Non-parametric classifiers provide a classification decision based on data without performing training task. The commonly used non-parametric classifier is Nearest Neighbor (NN) estimation (Boiman et al. 2008). A variant of NN namely, NN-image is used for image classification by comparing the image to the nearest class image. The classification result of the NN-image classifier is inferior to the learning-based classifiers such as SVM, and DT (Boiman et al. 2008). In Oikonomopoulos et al. (2005), the k-Nearest Neighbor (kNN) classifier is used with a Relevance Vector Machine (RVM) method, which is a kernel-based sparse model having similar functionality as SVM. In RVM, learning is performed using the Bayesian approach and Gaussian prior is used for the model weights since overfitting can exist due to maximum-likelihood estimation of the weights. Positive values of these weights correspond to the relevance vector in the class showing the representation of human action.
Hidden Markov model HMM provides movement from one state to another for the given probabilities of transitions. Different hidden states are specified in the training stage of HMM; for the given problem, probabilities corresponding to the transition in state and outputs are optimized in the training stage (Gavrila 1999). Output symbols are produced by optimization based on HMM matching image features for motion class (Gavrila 1999). Motivation to integrate HMM for HAR is that it allows to model temporal evolution of features extracted from the videos easily (Vezzani et al. 2010). However, in HMM, selection of parameters such as the number of states and the number of symbol parameters require trial and error method (Yamato et al. 1992). HMM is used to classify the silhouette feature obtained by applying R transform on input(s) (Jalal et al. 2012). Here, a depth-based silhouette is used as input features.
To obtain robustness to initial conditions, an improved version of HMM called Coupled HMM (CHMM) is used in Brand et al. (1997). The current state in CHMM is determined by the state of the chain and neighboring states at the previous timestamp. Actions having coordinated movements such as moving both hands are effectively classified using the CHMM (Brand et al. 1997). Another variant of HMM, called Layered HMM (LHMM) (Oliver et al. 2002), is used to enhance the robustness of the system. LHMM segments into different layers with different temporal information.
Extreme learning machine ELM was originally developed as a feed-forward network for a single hidden layer and is computationally faster for optimizing input parameters than gradient-based learning methods (Huang et al. 2004). In ELM weights of the input layer and biases in the hidden layer(s) are randomly chosen (Iosifidis et al. 2014). The principle of ELM is based on the learning provided by the network without tuning iteratively hidden neurons in SLFN. In ELM hidden node parameters are generated randomly without iteratively tuning (Huang et al. 2004). Kernel-based ELM (KELM) is the variant of the ELM method that depends only on the input data (Chen et al. 2015b). KELM can be used in situations wherein the number of features will be larger than the number of samples.
RBF kernel-based ELM method is applied in Chen et al. (2015b) using a single hidden layer. DMM-LBP features are input to the ELM network by fusing the projections from top, side, and front or decision level fusion using logarithmic opinion pool on the score of classifiers on different projection. In Chen et al. (2017), multi-temporal based DMM and patch-based LBP features are classified using KELM. Extracted DMM and LBP features are selected using the LDA method. Parameters of KELM are chosen using a 5-fold cross-validation technique to validate the performance of the network.
Zero-Shot Learning In computer vision research, supervised classification techniques are popular. There is an increase in the popularity of these techniques with the introduction of deep networks. Supervised techniques require an abundant amount of labeled data for training. Learned classifier can deal with the instances belonging to the trained data. However, these classifiers do not have ability to deal with unseen classes. In the concept of zero-shot learning, there exist some labeled training instances and the classes in these instances called as seen classes, wherein unseen testing instances which belong to unseen classes. Zero-shot learning is widely used in problems related to videos. In Zero-Shot Action Recognition (ZSAR), it can be used to recognize videos related to unseen actions. For action recognition tasks, popular datasets are UCF101 and HMDB51. Zero-shot learning is able to demonstrate promising results. Newly observed activity types can be detected by ZSAR by using semantic similarity between the activity and other embedded words in the semantic space (Al Machot et al. 2020). Large-scale ZSAR can be modeled by using the visual and linguistic attributes of action verbs (Zellers and Choi 2017).
To narrow down the gap of the knowledge between existing methods and humans, an end-to-end ZSAR framework is proposed in Gao et al. (2019) based on a structured knowledge graph. To design the graph, Two-Stream Graph Convolutional Network (TS-GCN) can be used consisting of a classifier branch and an instance branch. Specifically, the classifier branch takes the semantic-embedding vectors of all the concepts as input, then generates the classifiers for action categories (Gao et al. 2019). A ZSAR framework is developed with knowledge graphs to generate classifiers for new categories (Gao et al. 2019). By designing a two-stream GCN model with a classifier branch and an instance branch, this approach is able to effectively model action-attribute, attribute-attribute, and action-action. In addition, a self-attention mechanism is adopted to model the temporal information across video segments. Zero-shot learning can also be applied to settings wherein classes in the training and test instances are disjoint or another way can be the generalized Zero-Shot Learning (GZSL) where overlap between the number of classes between training and test set may occur (Norouzi et al. 2013). The GZSL is considered much harder than standard one as models learned can be inclined towards seen classes at the time of training. The generative framework for zero-shot action recognition is proposed in Mishra et al. (2018), can be applied to both the generalized as well as the standard case. Each action class is modeled using a probability distribution whose parameters are functions of the attribute vector representing action class.
Hybrid methods To enhance the performance of HAR, hybrid methods can be used. In Vishwakarma and Kapoor (2015), key poses are extracted as silhouettes that are classified using hybrid SVM and kNN algorithm called SVM–NN. As shown in Fig. 4 (Vishwakarma and Kapoor 2015), the procedure of SVM–NN method is depicted where feature extraction is performed using silhouette extraction and PCA is used for dimensionality reduction. Misclassified samples using SVM are further fed into kNN classifier. In Xu et al. (2016), for behavior recognition, skeleton features are mapped to Li group and after performing preprocessing using PCA, SVM is used to classify PCA optimized features. Optimization of error value and radius value in SVM provides better classification accuracy.
The Naïve Bayes classifier is based on the Bayes’ theorem. The conditional probability in the Bayes’ theorem states that an event belonging to a class can be calculated from the conditional probabilities of the particular events in each class, \(x \in X\), and C classes, where X denotes a random variable. The conditional probability that x belongs to k is given by Eq. 3.
It can be seen that Eq. 3 is a pattern classification problem (Jalal and Kim 2014). It finds the probability of the given data belonging to a class. Optimum class is selected by using the class with the highest probability among all the possible classes C, which can minimize the classification error. In NB method, it is assumed that input features are statically independent. The principle behind NB lies in the use of Bayes’ theorem. It is a classification technique based on an assumption of independence among predictors. The hybrid NBNN is used for video classification (Yang and Tian 2014), wherein direct Image-to-Class distance is computed. To compute the separation of the image to another image, the kernel matrix by SVM is used. Classification of an image based on NBNN is applied to NBNN-based video classification in Yang and Tian (2014) where eigen joints are used as frame descriptors without quantization and Video-to-Class distance is used for frame descriptors. In the literature, results are shown by experimentation that Image-to-class distance tends to provide better generalization ability than Image-to-Image distance when applied to the kernel matrix of SVM.
Pose information of the human body provides important ideas about actions (Liu et al. 2013). In the paper Liu et al. (2013), pose-based HAR is performed using the Weighted Local NBNN (WLNBNN) method, which is an improved version of NBNN. Weights are assigned as query descriptor and Euclidean distance is calculated using the nearest exemplar search. Input poses are transformed into pyramidal features using Gabor filter, Gaussian pyramid, and wavelet transform inspired by multiresolution analysis in image processing (Liu et al. 2013).
Discussion Discriminative classifiers learn a direct mapping that links inputs to their correspondence class labels. Due to their high performance and simplicity, supervised techniques such as SVM and the NN can be frequently used for action classification. However, while dealing with high volume datasets, traditional ML techniques may not achieve an efficient performance. The advantages and disadvantages of using ML techniques for action recognition are presented in Table 7.
It can be noticed that real-life scenario-based actions are likely to be more complicated as compared to the actions in the datasets. Besides that new samples may not contain labels, which would make the supervised methods inappropriate. Therefore, ZSAR could emerge as an attempt to overcome these limitations. We also present a summary of the reviewed traditional ML-based classification techniques in Table 8.
2.3.2 Deep learning-based methods
DL is a technique that instructs computers to perform the task similar to that of the naturally conducted tasks by a human brain. In this survey, we have reviewed CNN, RNN, Long Short-Term Memory (LSTM), Deep Belief Network (DBN), as well as Generative Adversarial Network (GAN) that are widely used networks for the action recognition task.
In CNN, maps are created using local neighborhood information. CNN architecture contains three steps for feature extraction: convolution, activation, and pooling (avg., min., or max.). To capture spatial and temporal features for video analysis, a 3D convolution is proposed in Ji et al. (2013). A 3D convolution method can be performed by convolving 3D kernel in stacked multiple frames. The 3D convolution method has a high cost. The training time will be higher in the absence of supported hardware such as GPU (Ji et al. 2013). The architecture of CNN is composed of multiple feature extraction steps. Each step consists of three basic operations: convolution, non-linear neuron activation, and feature pooling. Basic architecture of a deep CNN is depicted in Fig. 5 (Weimer et al. 2016).
A CNN is denoted as deep when multiple layers of feature extraction stages are connected together (Weimer et al. 2016). In Baccouche et al. (2011), the convolution task in CNN is performed in both the space and time domain. For this purpose, 3D CNN takes input as space–time volumes. Thereafter, training in LSTM is performed with the extracted features from 3D CNN. Spatio-temporal information can be extracted from 3D CNN from the input video. Due to layer-by-layer stacking of 3D CNNs, 3D CNN models have higher training complexity as well as higher memory requirements (Zhou et al. 2018b). Mixed Convolutional Tube (MiCT) network is proposed in Zhou et al. (2018b), wherein, feature maps of a 3D input are coupled serially with the block of 2D convolution block. Connections containing cross-domain residual methods are added to the temporal dimension to reduce computation complexity for the model. The advantage of the residual connection is that in correspondence with the 2D convolution, these connections extract static 2D features, whereas 3D convolution only needs to learn residual information.
In Huang et al. (2018), pose-based features are extracted from the 3D CNN network, wherein 3D pose, 2D appearance, and motion stream fusion is performed. For the 3D CNN, extraction of color joint features will result in high complexity, therefore, a heatmap of 15-channel is constructed and convolution is performed in each map. In skeleton-based HAR, the pairwise distance between skeleton features is computed in Li et al. (2017a). CNN inputs to four networks are given as Joint Distance Maps (JDM) afterward ConvNets training and late fusion is performed. On the other hand, skeleton-based input is classified by multi-stream CNN in Liu et al. (2017b) which involves modified AlexNet (Krizhevsky et al. 2012) and color input data is given to each CNN. Fusion of probabilities is generated from each CNN for final class score calculation. The study shows the robustness of multi-stream CNN against changes in the view, noisy input skeletons, and similarity in skeleton input in different classes. The study also presents the superiority of the proposed network to LSTM-based methods.
Deep CNN namely, ConvNets is used to perform efficient HAR with accelerometer and gyroscope using smartphone (Ronao and Cho 2016), in which a local dependency of time-series 1D signals is exploited, wherein features are automatically extracted using CNN without the need for advanced pre-processing techniques as the handcrafted features cannot be transferred to activities of a similar pattern. To convert output of CNN into probability distribution, the fully connected layer is combined with softmax. For incorporating both, spatial and temporal streams, two-stream convolution network is proposed in Feichtenhofer et al. (2016), wherein RGB information (spatial), and optical flow (motion) is modeled independently and predictions are averaged in last layers. This network is not able to capture long-term motion due to optical flow; another drawback of the spatial CNN stream is that the performance is based on a randomly selected single image from the input video. Therefore, complications are present due to background clutter and viewpoint variation (Feichtenhofer et al. 2016).
In Wang et al. (2016), Temporal Segment Network (TSN) is proposed, where, high redundancy is present in the consecutive frames, therefore, dense temporal sampling is unnecessary as it contains highly similar frames after sampling. TSN exploits sparse sampling from long input videos. In Wang et al. (2016), Inception and Batch Normalization (BN-inception) network architecture is used. In addition to the RGB and optical flow images similar to two-stream networks, this approach employs RGB difference between two frames (to model variation in appearance) and optical flow fields (to suppress background motion).
To enhance the performance of skeleton joints based HAR, another two-stream network is proposed in Shi et al. (2019). Two-stream corresponding to joint information and bone information is passed through the Adaptive Graph Convolutional Network (AGCN) network. The network contains the stack of these basic blocks. The final output is passed through the softmax layer. In Li et al. (2019), an actional-graph based CNN structure is proposed, which stacks multiple convolutions from action graph as well as temporal convolutions. The graph structure is learned from data in order to capture dependencies occurring among joints. In Ullah et al. (2019), HAR is performed on the system with real-time video captured from a non-stationary camera. DL technique, CNN is used to extract frame-level features automatically. In Fig. 6 (Ullah et al. 2019), data containing video stream is given as an input to the pre-trained model. In low dimension, temporal changes in human actions are learned by connecting CNN with the deep autoencoder. Human actions are classified using SVM (quadratic) classifier. In Huynh-The et al. (2019), encoding scheme Pose Feature to Image (PoF2I) is shown using distance and orientation to represent skeleton data as an image. These images are fine-tuned on inception-v3 deep ConvNet, which reduces overfitting.
An approach to extract ROI using a Fully Convolutional Network (FCN) is presented in Jian et al. (2019). CNN is used to identify the pose probability of each frame. Using neighboring probability difference of frames, key-frame extraction is performed. The variation-aware key-frame extraction method considers the frame with the maximum probability of key pose calculated by CNN. If different frames result in the same value of key pose probability then the center frame is selected. LSTM contains memory cell which is tuned by gates such as input, output, and forget gates. The gates perform the task of determining the information flow entering or quitting the memory cell. Information is stored in the internal states of the memory cell. LSTM provides an automatic understanding of actions in videos. On the other hand, attention graph-based CNN is proposed in Si et al. (2019) to focus on the joint position in the skeleton, which helps to enhance key node features. Attention Enhanced Graph-based Convolution Neural Network with LSTM (AGCN-LSTM) network is able to capture discriminative features.
In general, most of the LSTM and RNN based methods consider skeleton sequences as low-level features and use the raw skeleton coordinate as their inputs. Hence, these networks cannot extract effective high-level features (Si et al. 2019). Whereas CNN based methods are efficient for image-based recognition tasks (Akilan et al. 2017). They can efficiently preserve spatio-temporal information and can directly convert raw skeleton data to images (Kim and Reiter 2017). However, due to variations in viewpoint and different appearance, the performance of such networks may not be accurate. To incorporate both the spatial and temporal behaviors, CNN can be combined with LSTM. For 3D dataset, LSTM and CNN combination is better than LSTM and LSTM combination (Li et al. 2017b). Figure 7 (Li et al. 2017b) depicts feature extraction, network training, and score fusion for an action recognition task. Skeleton-based features of the spatial and temporal domain are input to the network with CNN and LSTM, respectively. Spatial features correspond to relative position and distance between joints and temporal features corresponds to JDM and trajectory. Scores of these features are fused together by late fusion.
A model named Differential RNN (DRNN) is proposed in Veeriah et al. (2015), wherein actions are represented using spatio-temporal representation and the network is learned using Back-Propagation-Through-Time (BPTT) algorithm. Cross-Validation accuracy in Veeriah et al. (2015) is reported by training with random 16 subjects and the rest for testing. Deep LSTM network can provide end-to-end action recognition where feature co-occurrences are learned from the skeleton joints (Zhu et al. 2016b).
DBN is a DL-based network that uses Restricted Boltzmann Machine (RBM) for training. In Hassan et al. (2018), DBN is used for smartphone-based HAR. Training in DBN is divided into two phases termed as pre-training and fine-tuning. To improve the performance of HAR, RBM with 2 hidden layers is used for network initialization. To obtain rotation, translation, and scale-invariant features Motion MHI, Average Depth Image (ADI), Depth Differential Image (DDI), Hu invariant moments, and R transform methods are used in Foggia et al. (2014). DBN is used to generate robust representation of the samples as well as to build hierarchical features from low-level features (Foggia et al. 2014).
In Gowda (2017), a hybrid model of DL techniques is used for extracting features and identifying interest features. Wherein, DBN is used for extracting motion and static image feature extraction. The output of the DBN is input to KPCA which is further given to CNN to classify one of the actions. Another approach based on combining spatial and temporal information was proposed in Qi et al. (2019). The semantic graph is constructed from each video frame input and node-RNN and edge-RNN are used to train the model. Labeling of the whole scene or individual action or interaction involving different persons can be performed using a constructed model. Subsequently, in Ahsan et al. (2018), GAN is proposed for discriminator network training and the resultant discriminator after learning provides initialized weights. The unsupervised pre-training provides an advantage of automated feature engineering and sampling of frames (Lee et al. 2017). A typical network of GAN consists of a generator and discriminator. The objective of the generator is to create similar data corresponding to training data and the discriminator module goal is used to maximize the probability of correct labels by the generative model and training example samples.
In Yang et al. (2019), openGAN is used for recognizing actions based on open-set. Open-set problem is based on constructing dataset having different categories in training and testing set. Components of the openGAN consist of feature extraction and feature combination using dense blocks, wherein these blocks are connected using layers. Dense blocks are constructed using sub-blocks combining two convolutional layers with concatenation layer. Dense blocks are connected using a stack of layers. As shown in Fig. 8 (Yang et al. 2019), convolutional and a de-convolutional layer are used in the generator with a dense block. In Fig. 9 (Yang et al. 2019), the convolutional and pooling layers are used in the DenseNet for the discriminator network in GAN. Feature maps are projected to \(n+1\) dimensional vector for n classes. In the last layer, softmax classifier is used, and Mean Squared Error (MSE) loss function is used.
In the action recognition field, deep networks are also dominant but shallow methods such as ML-based methods can also be considered before blindly applying deep networks. Shallow techniques have characteristics of efficient performance on small datasets compared to deep networks. In some cases, transfer learning can be applied when features are general in both base and target datasets. It would also be possible to fine-tune the DL models, which can improve the performance of DL models. In Das et al. (2018), spatial layout and temporal encoding are modeled for daily activity recognition. Skeleton data is used to capture long-term dependencies using 3-layer stacked LSTM. Pose-base static features are extracted using CNN. From each frame, body region features are represented by the left hand, right hand, upper body, and full body. Pre-trained Resnet-152 is used for deep feature extraction. These extracted features are learned by feeding into SVM, which further provide classification score on cross-validation set.
For an action recognition problem, it is shown in Rensink (2000) that humans at once cannot focus their attention on an entire scene. In spite of that, relevant information can be extracted by carrying out focus sequentially onto different parts of the scene. When performing a particular task, the focus of the particular model can be identified using attention models, which add a dimension of Interpretability (Sharma et al. 2015). Training of the input videos is performed using GoogleNet and features are extracted from the last convolutional layer. Three-layer LSTM is used for predicting class labels. Cross entropy loss function with the attention regularization is used and the model is forced to look at each region of the frame. The attention mechanism can be used in HAR to focus on the particular body part. In Das et al. (2019a), end-to-end action recognition method is proposed using a 3D skeleton and spatial attention from I3D pre-trained model using 3D CNN. Wherein temporal features are extracted using three layers stacked LSTM. An attention-based mechanism is introduced in action recognition that focuses on the parts of the action.
Discussion Many opportunities are open in HAR with the DL models due to available computing facilities for example GPU. HAR with DL-based methods focus on motion feature learning and utilize them to classify actions.
The CNN-based network seems to provide good results and identify spatial relationships from the RGB data. However, to exploit temporal dependencies from the input videos, LSTM is promising networks. Due to the complementary property of these networks, the performance of the model can be greatly improved by applying later score fusion of CNN and LSTM. Moreover, the requirement of CNN is that it requires a large amount of data for training otherwise overfitting may occur. However, dropout or data augmentation techniques can be applied to overcome the overfitting problem. We briefly discuss the advantages and disadvantages of action classification using DL-based techniques in Table 9. We also provide a summary of DL-based techniques for HAR. Action recognition frameworks, datasets used, and their corresponding results are summarized in Table 10. Also, a summary of action analysis techniques based on traditional ML and DL is discussed in Table 11.
3 Datasets
Datasets play a key role in comparing different algorithms applied for a particular objective. Task-specific algorithm evaluation depends on parameters depending on each dataset. It is computationally economical to capture in real-time two-dimensional (2D) color image sequences. However, with an introduction to the inexpensive 3D sensors, such as Kinect, depth-based processing has been a subject of interest to the researchers (Li et al. 2010).
In this survey, we have discussed different types of RGB and RGBD datasets in detail. Another way of recording can be performed with non-visual sensors which use wearable on-body sensors, for example, accelerometers and gyroscopes as well as radar. Sensor-based datasets have been reviewed in De-La-Hoz-Franco et al. (2018). Categorization of datasets is shown in Fig. 10. In this paper, we review camera-based datasets based on RGB, depth and skeleton modality.
In KTH dataset (NADA 2004), human actions are performed several times with different situations. This dataset has fewer action classes with a resolution of \(160 \times 120\) pixels and does not provide background models. Ballet dataset (Wang and Mori 2009) contains eight actions from ballet DVD wherein each action is performed by three subjects, dataset contain variation due to speed, scale (spatial and temporal), and clothing.
The I3DPost dataset (Gkalelis et al. 2009) contains eight actions including two-person interaction. All the cameras are set up to provide a 360-degree view of the captured scene. Unusual Crowd Activity dataset (University of Minnesota 2010) contains normal and abnormal crowd videos. The dataset comprises 11 scenarios of the escape scene in videos having indoor and outdoor scenes.
The NATOPS dataset (Song et al. 2011) contains 24 aircraft handling signals of routine practice on the deck environment. Signals were repeated by twenty subjects 20 times. \(320 \times 240\) resolution pixel images are present. CAVIAR dataset (Fisher 2012) includes 9 actions. The data is captured at the INRIA Labs and shopping centre in Lisbon. The resolution of the image is \(384 \times 288\) pixels.
Hollywood2 dataset (Laptev 2012) contains human actions with 12 classes and scenes containing 3669 video clips with 10 classes. Video samples are generated from movies. In Florence 3D action dataset (MICC 2012) videos are captured using a DHA dataset (M. C. Laboratory 2012), contains 23 actions performed by 21 actors. Three different scenes based actions are classified. In the depth data, background information is removed. MHAD dataset (Berkeley 2014) contains a set of activities that have dynamic body movements. Some activities have dynamics in both upper and lower extremities. In this dataset, image resolution is \(640 \times 480\).
HMDB51 dataset (Jhuang 2013) contains 51 action categories. The dataset contains videos from movies, YouTube, and videos in Google. In addition to action labels, meta-label is also provided for the description of the input video. UCF Sports dataset (CRCV 2010) contains sports actions featured on channels, for example, BBC and ESPN. The dataset contains 150 sequences having \(720 \times 480\) resolution. The dataset is challenging in terms of a wide variety of scenes and viewpoints thus increasing research in the field of unconstrained environment.
UCF50 (CRCV 2012) dataset contains 50 action categories from YouTube. The goal of the UCF101 dataset (CRCV 2013) is to perform template matching in the temporal domain. UCF YouTube action dataset (CRCV 2013) was created for recognizing actions from the videos. Videos in this dataset can be usual upload by any amateur user recording using the hand-held camera.
MuHAVi dataset (YACVID 2014) contains videos observed at some angle as well as the distance from the subject. In MuHAVi dataset actions are filmed using eight surveillance cameras. The cameras are not calibrated before capturing the videos. Sports-1M dataset (Karpathy 2014) is composed of 1,133,158 video URLs from YouTube videos. These URL’s are annotated automatically with 487 Sports labels.
While dividing the dataset into training and testing sets, in some cases, it is possible that similar video can occur in training as well as testing sets (Karpathy 2014). UCSD Anomaly Detection dataset (Statistical Visual Computing Lab 2014) was acquired with a stationary camera overlooking pedestrian walkways. Peds1 contains clips of videos of a group of people walking towards and away from camera and Peds2 contains scenes containing pedestrian movement.
The major challenges in the dataset are encountered due to the similarity in some of the actions. Northwestern-UCLA Multiview Action3D dataset (Wanqing Li 2014) contains 10 actions with the RGB, depth, and skeleton joint information. Weizmann dataset (Blank et al. 2005) comprises of 10 action categories. In this dataset, all sequences of actions are from the static camera and input frames are having a plain background with image resolution \(180 \times 144\). The Johns Hopkins University multimodal action (JHUMMA) dataset (Murray et al. 2015) contains ten actors to perform actions, wherein to record actions, three ultrasound sensors, and an RGB-D sensor were used. The dataset was captured indoor inside the auditorium having curtains.
IXMAS dataset (INRIA 2016) models human action by incorporating viewpoint invariant data and different body sizes. Five cameras are used for action recognition tasks to view. 13 daily-life activities were performed and variation in different activities is due to varying clothing styles, body size, and execution rate.
MSR action dataset (Liu 2016) contains 16 video sequences that include three types of actions. These sequences are captured with some clutter (Chaquet et al. 2013). The MSR action3D dataset (Li 2017b) contains twenty actions performed by ten subjects and image resolution is \(320 \times 240\). The depth maps were captured using a depth camera. This dataset provides color, depth, and skeleton information for each action. In the given dataset ten actions are missing due to erroneous information. MSRDailyActivity3D dataset (Li 2017a) contains 16 activities. Usually, subjects perform actions in two poses “sitting on sofa” and “standing”.
Kinetics dataset (Kay et al. 2017) is a large-scale dataset that containing 300,000 videos clips in 400 classes. The video clips are sourced from YouTube videos. In Yan et al. (2018), locations of 18 joints are estimated on every frame of the clips. It includes nine activities of 10 subjects perform actions 2 to 3 times. SBU Kinect Interaction Dataset (Computer-Vision-Lab 2012) consists of 21 pairs of two-person interactions of eight types each having two sets. The videos in the dataset were captured using the Kinect sensor. Frame in the dataset contains color and depth feature. UTKinect-Action3D dataset (Xia 2016) contains human actions for the setting comprising indoor recording (Xia et al. 2012). This dataset returns depth information, color information, and skeleton information. In the dataset, RGB images have the resolution \(480 \times 640\) and depth images have \(320 \times 240\). The dataset also contains frames containing occlusion.
NTU-RGBD action recognition dataset (Rapid-Rich-Object-Search Lab 2016) contains 56,880 action samples of 60 classes and 40 subjects with 80 views having modalities RGB videos, skeleton, and depth data. Two protocols are popularly used for evaluation, namely, CS and CV. In evaluation based on CS, forty subjects are split into training and testing groups. The samples from subject IDs 1, 2, 4, 5, 8, 9, 13, 14, 15, 16, 17, 18, 19, 25, 27, 28, 31, 34, 35, 38 are used for training where the remaining subjects are reserved for testing (Rapid-Rich-Object-Search Lab 2016). In CV-based evaluation, all the samples of camera 1 are used for testing, and for training, samples captured from cameras 2 and 3 are used. In other words, the training set consists of front and two side views of the actions, while testing set includes left and right 45 degree views of the action performances. For this evaluation, the training and testing sets have 37, 920 and 18, 960 samples, respectively.
MIVIA Action dataset (MIVIA-Lab 2017) is composed of seven types of actions. All the subjects performed each action twice and the duration of the action is variable depending on the nature of the action. Kinect sensor is used to acquire the depth images and background. CAD-60 dataset (Robot-Learning-Lab 2017) contains human activity videos with RGB, depth information, and tracked skeleton sequences of 60 videos. These activities are captured using the Microsoft Kinect sensor. In this dataset, RGBD data has resolution \(240 \times 320\).
Dongguk Activities and Actions database (CGCV-Laboratory 2017) is produced for indoor surveillance environment. The database consists of three scenarios named as straight-line movement, corner movement, and standing still for 20 people. For improving the performance of the action recognition task, a better understanding of input data with its characteristics is required. Toyota Smarthome dataset (Das et al. 2019b) is the dataset for capturing daily activities which incorporate the challenges in the action recognition tasks, such as a high intra-class imbalance in class, composite activities containing sub-activities and activities having variable duration and similar motion. This dataset was captured with elderly people. No script was given to subjects for performing actions for the entire day. Unlike other datasets, this dataset comprises of actions having the variable distance between camera and subject. This dataset provides three modalities namely, RGB, depth, and skeleton.
A brief understanding of the advantages and disadvantages of such datasets is provided in Table 12. We also summarize various dataset attributes such as background, number of participants, number of cameras, movement of the camera, number of male and female participants, number of actions, modality, type of view, occlusion, and whether an action is scripted or not in Table 13.
3.1 Discussion
One of the important aspects to map to the real-world complexity is that the datasets should contain occlusion and intra-inter class variations. In this survey, we have discussed majority of the datasets that provide actions based on daily activities or some of them do not have any focus. Other datasets which we have discussed comes under gaming category (for example MSRAction3D dataset). Moreover, CAVIAR dataset contains actions related to surveillance application.
In RGB-based HAR techniques, two popular datasets, namely, KTH and Weizmann are primarily used. With majority of the techniques, these datasets achieve 100% accuracy; these datasets contain intraclass variations, however, they provide a good evaluation criterion for the methods. Moreover, KTH dataset contains a limited number of activities and a similar background. To meet the real-world challenges and to scale up the complexity of the data, datasets containing videos downloaded from the Internet are also considered. For example, datasets sports-1M and HMDB are provided with background clutter and scale to increase the complexity.
In datasets such as Hollywood2 dataset, the limited number of labeled videos are present. In the case of 3D action analysis, there is a lack of large-sized datasets. Therefore, NTU-RGBD dataset with 56, 880 RGB+D video samples, having 40 different human subjects was captured using Microsoft Kinect v2.
To the best of our knowledge, there are no sources of public 3D videos for the unconstrained environment. Recording of NTU-RGBD was also performed in a restricted environment such as laboratory, where the activities were performed under strict guidance. Therefore, Activities of Daily Living (ADL) datasets have the partial capability to challenge real-world scenarios.
4 Applications
HAR can be used in a variety of applications such as content-based video analysis and retrieval, visual surveillance, HCI, education, medical, as well as abnormal activity recognition; this section discusses the significance of HAR in respective applications.
4.1 Content-based video summarization
In the current era, rapid growth in video content is due to the immense use of multimedia devices. Retrieval of this information manually could be a toilsome and time-intensive task. The main goal of the content retrieval task is to provide the user with the content of their interest. The concept is known as Content-Based Video Retrieval (CBVR). In Kim and Park (2002), key-frames of the video are compared with the target videos but the computational cost of the key-frame method is too high.
On the other hand, color and texture features are used for video summarization in Shereena and David (2014). Authors have also demonstrated the advantage of combining color and texture features. Real-time video summarization is demonstrated by the study (Bhaumik et al. 2015), wherein threshold based on probability distribution used for generating video summary. Duplicate features are removed using redundancy elimination.
4.2 Human–computer interaction
The HCI-based system aims to bring HCI to a level such as the task of human–computer interaction should be as normal as daily human interaction. A gesture recognition system was proposed in Sharma and Verma (2015) to recognize hand gesture images. Recognized images are static and having a simple background. To detect fingers from the hand, recognition of gesture is performed by counting white colored objects and skin segmentation is performed, whereas, to increase image quality, morphological filters are used. In the gesture recognition system pose classification is performed in Czuszynski et al. (2017), whereas, gesture information is stored in the timestamp sequence. Data was represented in three types of forms such as raw data, detailed description of data frames using features, and high-level feature representation depicting hand pose. Two-layer ANN is used to recognize the extracted features. It provides output in the form of a number, which depicts the type of hand pose. Also, a cost-effective gesture recognition system based on the data captured from a laptop is proposed in Haria et al. (2017) wherein Haar cascade classifier is used to classify gestures containing palm and fist.
4.3 Education
Recognizing human actions from the videos has a crucial role in education and learning. Analyzing human actions from the video in educational institutes may provide behavior recognition and automatic monitoring of attendance during class. The manual procedure for taking attendance of students may be time-consuming and during this process, the instructor may not be able to observe students.
Nowadays, due to technological advancements, the automated real-time attendance monitoring system can be deployed in the classroom. In Chintalapati and Raghunadh (2013), an automated attendance monitoring system is proposed using the Viola-Jones algorithm. Comparative analysis of feature extraction algorithms using PCA, LDA, and LBP Histogram (LBPH) is performed among which the LBPH method performs better. To capture videos, the camera is placed at the classroom entrance, and students are registered while entering into the classroom.
In Lim et al. (2017), students and their activities such as leaving and entering the classroom are identified. The system performs action recognition and identification by performing face recognition and motion analysis. Haar cascade classifier is used for detecting faces and a combination of eigenfaces and fisherfaces algorithms are used for training. For motion analysis, three sub-modules namely, body detection, tracking, and motion recognition are used. To perform attendance monitoring, assumptions are made for the brightness and size of the classroom.
4.4 Healthcare
Healthcare of elderly people has been a major concern as elderly people are prone to disease. Continuous monitoring using automatic surveillance systems is required to identify fall detection or abnormal behavior detection for elderly patients. An approach for representing the behavior of dementia (Alzheimer and Parkinson’s disease) patients is mentioned in Arifoglu and Bouchachia (2017). Abnormal activity in elderly patients with dementia is detected using RNN variants-Vanilla RNNs, LSTM, and Gated Recurrent Unit (GRU).
Real-time monitoring of abnormal patient behavior can be performed using smart-phone-based sensors. Smart-phone-based Wireless Body Sensor Network is used in which physiological data is collected using body sensors in the smart shirt. Continuous monitoring of temperature, ECG, BP, BG, and \({\hbox {SpO}}_2\) are performed and an alert message is issued in real-time in case of the abnormal sign (You et al. 2018). Subsequently, the position and velocity of the person are extracted using Kinect sensor in Nizam et al. (2017) for fall detection. In the sensor range, the velocity of the body is identified for detecting abnormal activities continuously. To confirm the detection of fall from abnormal activity, the subject’s position is identified from the next frames. To compute velocity, skeleton joints are used from Kinect sensor and accuracy, sensitivity, and specificity are calculated for fall and non-fall based actions. For the fall detection, depth maps can be used (Panahi and Ghods 2018). Feature extraction is done on the ellipse feature fitting method in which for pose identification, the orientation of the ellipse is calculated (Yu et al. 2013). Another feature used is the distance from the ellipse center to the floor in 3D space (represented by plane). To classify pose-based features, SVM is applied.
4.5 Video surveillance
A video surveillance system offers visual surveillance while the observer is not directly on the recording site. Surveillance task may be performed either in real-time by analyzing video or video may be stored and evaluated subsequently as and when required. Video surveillance can also be used to identify abnormal activity detection and to analyze player behavior in gaming videos (Wang et al. 2015).
4.6 Abnormal activity recognition
Abnormal behavior recognition can be used to ensure security in places such as railway stations, airports, and outdoor places. Recognizing such events are challenging due to a large number of surveillance cameras.
Abnormal behavior for three categories such as a person, group, and vehicle are identified using a single Dynamic Oriented Graph (Varadarajan and Odobez 2009). Even in the case of objects following the same paths, abnormal behavior can be identified. For example, crossing a railway line by a person is considered unusual, whereas train crossing through the railway line is considered usual activity. The anomaly event detection task is divided into the global and the local anomaly (Miao and Song 2014). Wherein, global anomaly tasks performs emergency clustering and individual behavior performance is computed under local anomaly task. For global anomaly detection, UMN dataset (CRCV 2020) is used in which people suddenly going out of the scene is considered as the global anomaly and for the local anomaly, UCSD dataset (Statistical_Visual_Computing_Lab 2014) is used in which samples containing people walking are included in the training and the abnormal behavior includes cycling and skating of a single person.
Graph-based method for abnormal activity detection is presented in Duque et al. (2007), wherein nodes of the graph are depicted by STIP and connection among different nodes is given by fuzzy membership function. The anomaly detection task is divided into two different subtasks local and global. Local and global abnormal activity is classified using SVM. An intelligent system for the crowded scene is presented (Feng et al. 2017) using deep Gaussian Mixture Model. Multi-layer nonlinear input transformation is performed adaptively for feature extraction from sensors. This transformation improves the performance of the network with a few parameters.
4.7 Sports
Motion in sports video is difficult to analyze by trainers wherein, observing long matches continuously can be difficult for the audience to follow (Thomas et al. 2017). Recent research includes analysis of player movement individually and in the group for their training as well as for finding key-stages in the game. In YACVID (2014), sports video highlight classification is performed using a DNN. The study YACVID (2014) has used players’ actions to acquire higher-level representation using two-stream CNN combining skeleton joint-based and RGB-based. To model temporal dependencies in the video, LSTM is used.
In Ullah et al. (2019) pre-trained deep CNN model VGG-16 is used to extract frame-level features for identifying player actions. To learn temporal changes, deep autoencoder is used and human actions are classified using the SVM approach. Graph-based models for recognizing group activities are popularly used. In Qi et al. (2019), Sports videos are classified based on scene content using a semantic graph. Structural RNN is used to extend the semantic graph model to the temporal dimension.
4.8 Entertainment
HAR field has been widely used to identify actions in the movies or identifying dance movements related activities. In Laptev et al. (2008), action retrieval task is presented using text-based classifier (regularized perceptron). Action classification from movie script is shown using space–time features and non-linear SVMs. In Wang et al. (2017), movie actions are classified using 3D CNN. To minimize loss of information while learning, the study has introduced two modules, namely, encoding and a temporal pyramid pooling layer. To combine motion and appearance information the study has incorporated feature concatenation layer. Two movie datasets, namely, the HMDB51 (Jhuang 2013) and the Hollywood2 (Laptev 2012) are used for experimentation. Another application of HAR is to identify dance movements from videos. In Kumar et al. (2018), authors have proposed a multi-class AdaBoost classifier with fused features. The dataset based on Indian classical dance consist of online and offline videos of different dance forms.
In video classification motion information between different frames plays a crucial role in the performance of the action classification task. In Castro et al. (2018), authors have identified that for motion-intensive videos, visual information is not sufficient for classifying actions efficiently. The analysis of the action recognition task is performed using video, optical flow, and multi-person pose data.
5 Challenges
Despite the progress made in the field of HAR, state-of-the-art algorithms still misclassify actions due to several major challenges pertaining to HAR. In this section, we have discussed challenges of HAR. For an action recognition task, there can be differences in the actions performed by the same subjects even different class actions may appear to be similar, for example, jogging can be considered as running in fast speed. HAR models should be able to handle variations within one class with other classes. In Lu et al. (2018), sports action classification is performed, where dataset trained on one sport does not provide good results when tested on another sport.
For handcrafted representation, high dimension of training dataset may incur a lot of computation. For reducing dimensions, various dimensionality reduction techniques are used. At different intervals of time, an action may be performed with varying speed by the same subject or different subjects. Variation in the action speed is taken into consideration by HAR system.
In Chen et al. (2016), action speed variation is handled by providing multi-temporal representation of the DMM feature is used with three levels. Action recognition tasks heavily depend on the background clutter wherein unwanted background motion may create ambiguities in the action recognition task. Such problems can be handled by applying some background subtraction techniques before action recognition task (Kalaivani and Vimala 2015). Depth-based techniques are steady with respect to the environment changes and background (Jalal et al. 2012).
5.1 ML-based HAR
Conventional ML for action classification may be bounded with large-scale actions performed in challenging environments (for example, transformations applied on single actor actions, actions based on interaction and actions involving various subjects). Machine learning based classifiers cannot handle large amount of data.
Challenges using traditional ML-based methods can be handling of imbalance data. Moreover, training using ML techniques can suffer from slow learning rate, which gets even worse for large scale training data, and low recognition rate. In ML-based HAR techniques, majority of the work is conducted in supervised learning. Although this provided promising solutions but a problem with this approach is that labeling all the activities, as it requires a large effort for the test data.
5.2 DL-based HAR
DNNs are said to perform better in case of a large amount of training data (Sze et al. 2017). To learn hierarchical features from input videos, approaches based on RNN and LSTM have been used that improved performance of the action recognition task which involves actions having temporal dependencies. However, these models increased network complexity.
CNN-based networks are also popular DNN for HAR but these networks also come across certain challenges such as occlusion and variation in viewpoint. Due to CNN, it is also difficult to understand the meaning of deep features extracted by CNN. Deep CNNs are generally as a black box and thus, may lack in interpretation and explanation. Therefore, sometimes it is difficult to verify them. In addition, CNN-based methods rely on a large amount of data; yet, many realistic scenarios lack sufficient data for training, even though some large-scale datasets have been developed to make fine-tuning of the CNN architecture possible.
5.3 Hybrid HAR
Hybrid approaches can combine features and preprocessing steps, however, the computational complexity of the target system is high which may impact real-time video processing as well as lengthy video processing. These limitations can cause difficulty for lengthy videos and real-time applications with continuous video streaming. Challenge of hybrid HAR is the computational cost of training the model.
6 Future directions
Although ongoing HAR approaches have made incredible progress up to now, applying current HAR approaches in certifiable frameworks or applications is still nontrivial. In this section future directions for traditional ML-based, DL-based, and hybrid HAR is discussed.
6.1 Traditional ML-based HAR
HAR task can be extended for identifying actions with emotions such as happy-sitting, angry-running etc. Another future work can be to design models specific to applications. Moreover, ML algorithms can be able to operate on massive data. Methods in ML should be provided for trimmed action sequences.
6.2 DL-based HAR
To improve the performance of CNN, 3D CNN may be applied as 3D CNN has the capability to exploit spatiotemporal features. Another prospective area of improving performance is ensemble learning. Model performance can be improved by combining multiple architectures. Similarly, concepts such as batch normalization, dropout, and new activation functions are also worth mentioning. Also, to derive generalization, reinforcement learning or active learning technique can be used. In future, gait parameters can be calculated for walk detection for assessing fall risk and also for disease monitoring. Multi-person recognition can be performed in the future. In the future, the methodology should be provided for classifying videos containing overlapped actions. Daily activity-based HAR applications require actions to be continuously identified (untrimmed videos) recognition from continuous video streams is known as online action recognition system. Therefore, the future direction in this field is to apply methods of action recognition for an online case.
6.3 Hybrid HAR
Future direction can be considered as a multimodal perception for action recognition, as in the current trend in HAR field, RGB-D based methods (such as skeleton and depth sensor) are popularly applied. Kinect-based sensor is a low-cost sensor for capturing depth data. This sensor usually does not work properly in sunlight (Pagliari and Pinto 2015), which may hinder the performance of HAR system. For this purpose, multimodal fusion of RGB, skeleton, and depth data can be used to improve the performance of the system.
7 Concluding remarks
Automated HAR is considered as a domain for understanding human behavior. The review provides a survey of existing techniques used for HAR for trimmed videos. We have discussed the general framework of an action recognition task comprising of feature extraction, feature encoding, dimensionality reduction, and action classification. Feature extraction methods are categorized based on STIP, shape, texture, and trajectory. Due to the large size of the extracted features, dimensionality reduction techniques are used that can be divided into two types supervised and unsupervised. We have also discussed action classification methods involving ML and DL methods. We have also discussed the advantages and disadvantages of action representation methods, dimensionality reduction, and action classification methods. The dataset used by all the approaches consists of segmented videos with a known set of action labels. We have also discussed different datasets used for HAR. Application areas such as content-based video retrieval, video surveillance, HCI, education, medical, and abnormal activity detection are also discussed in the paper.
Abbreviations
- ABC:
-
Artificial Bee Colony
- ADI:
-
Average Depth Image
- ADL:
-
Activities of Daily Living
- AGC:
-
Adaptive Graph Convolution
- AGCN:
-
Adaptive Graph Convolutional Network
- ANN:
-
Artificial Neural Network
- ARA:
-
Average Recognition Accuracy
- ASAGA:
-
Adaptive Simulated Annealing Genetic Algorithm
- BN:
-
Batch Normalization
- BoVW:
-
Bag of Visual Words
- BPTT:
-
Back-Propagation-Through-Time
- CAE:
-
Convolution Autoencoder
- CHMM:
-
Coupled Hidden Markove Model
- CNN:
-
Convolution Neural Network
- CS:
-
Cross-Subject
- CV:
-
Cross-View
- DBN:
-
Deep Belief Network
- DDI:
-
Depth Difference Image
- DDS:
-
Depth Differential Silhouettes
- DE:
-
Differential Evolution
- DL:
-
Deep Learning
- DMM:
-
Depth Motion Map
- DNN:
-
Deep Neural Network
- DRNN:
-
Differential Recurrent Neural Network
- DT:
-
Decision Tree
- DTW:
-
Dynamic Time Warping
- ELM:
-
Extreme Learning Machine
- FCN:
-
Fully Convolutional Network
- FTP:
-
Fourier Temporal Pyramid
- GA:
-
Genetic Algorithm
- GAN:
-
Generative Adversarial Network
- GDI:
-
Geodesic Distance Iso
- GLCM:
-
Grey Level Co-occurrence Matrix
- GRU:
-
Gated Recurrent Unit
- HAR:
-
Human Action Recognition
- HCI:
-
Human–Computer Interface
- HMM:
-
Hidden Markov Model
- HOF:
-
Histogram of Optical Flow
- HOG:
-
Histogram of Oriented Gradient
- HoMB:
-
Histogram of Motion Boundary
- HoVW:
-
Histogram of Visual Word
- IEF:
-
Iterative Error Feedback
- JDM:
-
Joint Distance Map
- KDA:
-
Kernel Discriminant Analysis
- KELM:
-
Kernel Extreme Learning Machine
- kNN:
-
k-Nearest Neighbor
- KPCA:
-
Kernel PCA
- LBP:
-
Local Binary Pattern
- LBPH:
-
LBP Histogram
- LDA:
-
Linear Discriminant Analysis
- LHMM:
-
Layered Hidden Markove Model
- LOAO:
-
Leave One Actor Out
- LOSO:
-
Leave One Sequence Out
- LSTM:
-
Long Short-Term Memory
- MAP:
-
Mean Average Precision
- MEI:
-
Motion Energy Image
- MHI:
-
Motion History Image
- MiCT:
-
Mixed Convolution Neural Network
- ML:
-
Machine Learning
- MSE:
-
Mean Squared Error
- NBNN:
-
Naïve Bayes Nearest Neighbor
- PCA:
-
Principal Component Analysis
- PCOG:
-
Pyramid Correlogram of Oriented Gradients
- PoF2I:
-
Pose Feature to Image
- PSO:
-
Particle Swarm Optimization
- PSO-WC:
-
PSO-Weight Class
- PSO-WV:
-
PSO-Weight Views
- RBD:
-
Reduced Basis Decomposition
- RBF:
-
Radial Basis Function
- RBM:
-
Restricted Boltzman Machine
- RF:
-
Random Forest
- RNN:
-
Recurrent Neural Network
- ROI:
-
Region of Interest
- RVM:
-
Relevance Vector Machine
- RVM:
-
Relevance Vector Machine
- SDEG:
-
Spatial Edge Distribution of Gradients
- SDK:
-
Software Development Kit
- sDTD:
-
sequential Deep Trajectory Descriptor
- SIFT:
-
Scale Invariant Feature Transform
- SPD:
-
Symmetric Positive Definite
- SSM:
-
Self-Similarity Matrix
- STIP:
-
Space–Time Interest Point
- STM:
-
Spatio-Temporal Matrix
- SVM:
-
Support Vector Machine
- TDD:
-
Two-stream Deep Convolution Descriptor
- TpDD:
-
Trajectory-pooled Deep-Convolutional Descriptor
- TS-GCN:
-
Two-Stream Graph Convolutional Network
- TSN:
-
Temporal Segment Network
- WLNBNN:
-
Weighted Local NBNN
- ZSAR:
-
Zero-Shot Action Recognition
References
Abdul-Azim HA, Hemayed EE (2015) Human action recognition using trajectory-based representation. Egypt Inform J 16(2):187–198
Aggarwal JK, Ryoo MS (2011) Human activity analysis: a survey. ACM Comput Surv (CSUR) 43(3):16
Ahsan U, Sun C, Essa I (2018) Discrimnet: Semi-supervised action recognition from videos using generative adversarial networks. ArXiv preprint arXiv:1801.07230
Akilan T, Wu QJ, Safaei A, Jiang W (2017) A late fusion approach for harnessing multi-CNN model high-level features. In: 2017 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, pp 566–571
Al Machot F, Elkobaisi MR, Kyamakya K (2020) Zero-shot human activity recognition using non-visual sensors. Sensors 20(3):825
Amraee S, Vafaei A, Jamshidi K, Adibi P (2018) Abnormal event detection in crowded scenes using one-class SVM. Signal Image Video Process 12:1115–1123
Angelini F, Fu Z, Long Y, Shao L, Naqvi SM (2019) 2D pose-based real-time human action recognition with occlusion-handling. IEEE Trans Multimedia 22(6):1433–1446
Ar I, Akgul YS (2013) Action recognition using random forest prediction with combined pose-based and motion-based features. In: 2013 8th international conference on electrical and electronics engineering (ELECO). IEEE, pp 315–319
Arifoglu D, Bouchachia A (2017) Activity recognition and abnormal behaviour detection with recurrent neural networks. Procedia Comput Sci 110:86–93
Arunraj M, Srinivasan A, Juliet AV (2018) Online action recognition from RGB-D cameras based on reduced basis decomposition. J Real-Time Image Process 17:341–356
Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding. Springer, pp 29–39
Baldi P (2012) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, pp 37–49
Berkeley (2014) Multimodal human action dataset. Last Accessed 11 Dec 2019
Bhaumik H, Bhattacharyya S, Nath MD, Chakraborty S (2015) Real-time storyboard generation in videos using a probability distribution based threshold. In: 2015 fifth international conference on communication systems and network technologies (CSNT). IEEE, pp 425–431
Bhoomika Rathod SB, Pandya D, Patel R (2017) A survey on human activity analysis techniques. Int J Future Revolut Comput Sci Commun Eng 3:462–471
Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space–time shapes. In: Tenth IEEE international conference on computer vision (ICCV’05) Volume 1, vol 2. IEEE, pp 1395–1402
Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267
Boiman O, Shechtman E, Irani M (2008) In defense of nearest-neighbor based image classification. In: 2008 IEEE conference on computer vision and pattern recognition. IEEE, pp 1–8
Boulgouris NV, Chi ZX (2007) Gait recognition using radon transform and linear discriminant analysis. IEEE Trans Image Process 16(3):731–740
Boulgouris NV, Hatzinakos D, Plataniotis KN (2005) Gait recognition: a challenging signal processing technology for biometric identification. IEEE Signal Process Mag 22(6):78–90
Brand M, Oliver N, Pentland A (1997) Coupled hidden Markov models for complex action recognition. In: Proceedings of the computer vision and pattern recognition, 1997. IEEE, pp 994–999
Bulat A, Tzimiropoulos G (2016) Human pose estimation via convolutional part heatmap regression. In: European conference on computer vision. Springer, pp 717–732
Cao J, Lin Z, Huang G-B (2012) Self-adaptive evolutionary extreme learning machine. Neural Process Lett 36(3):285–305
Carreira J, Agrawal P, Fragkiadaki K, Malik J (2016) Human pose estimation with iterative error feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4733–4742
Castro D, Hickson S, Sangkloy P, Mittal B, Dai S, Hays J, Essa I (2018) Let’s dance: learning from online dance videos. ArXiv preprint arXiv:1801.07388
CGCV-Laboratory (2017) Dongguk activities and actions database. Last Accessed 11 Dec 2019
Chaaraoui AA, Flórez-Revuelta F (2014a) A low-dimensional radial silhouette-based feature for fast human action recognition fusing multiple views. International scholarly research notices, vol 2014
Chaaraoui AA, Flórez-Revuelta F (2014b) Optimizing human action recognition based on a cooperative coevolutionary algorithm. Eng Appl Artif Intell 31:116–125
Chakraborty B, Holte MB, Moeslund TB, Gonzalez J, Roca FX (2011) A selective spatio-temporal interest point detector for human action recognition in complex scenes. In: 2011 IEEE international conference on computer vision (ICCV). IEEE, pp 1776–1783
Chaquet JM, Carmona EJ, Fernández-Caballero A (2013) A survey of video datasets for human action and activity recognition. Comput Vis Image Underst 117(6):633–659
Chen Y (2015) Reduced basis decomposition: a certified and fast lossy data compression algorithm. Comput Math Appl 70(10):2566–2574
Chen X, Yuille AL (2014) Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Advances in neural information processing systems, pp 1736–1744
Chen C, Jafari R, Kehtarnavaz N (2015a) Improving human action recognition using fusion of depth camera and inertial sensors. IEEE Trans Hum Mach Syst 45(1):51–61
Chen C, Jafari R, Kehtarnavaz N (2015b) Action recognition from depth sequences using depth motion maps-based local binary patterns. In: 2015 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1092–1099
Chen C, Liu M, Zhang B, Han J, Jiang J, Liu H (2016) 3D action recognition using multi-temporal depth motion maps and fisher vector. In: IJCAI, pp 3331–3337
Chen C, Liu M, Liu H, Zhang B, Han J, Kehtarnavaz N (2017) Multi-temporal depth motion maps-based local binary patterns for 3-D human action recognition. IEEE Access 5:22590–22604
Chintalapati S, Raghunadh M (2013) Automated attendance management system based on face recognition algorithms. In: 2013 IEEE international conference on computational intelligence and computing research (ICCIC). IEEE, pp 1–5
Computer-Vision-Lab (2012) SBU Kinect interaction dataset. Last Accessed 11 Dec 2019
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Craley J, Murray TS, Mendat DR, Andreou AG (2017) Action recognition using micro-Doppler signatures and a recurrent neural network. In: 2017 51st annual conference on information sciences and systems (CISS). IEEE, pp 1–5
CRCV (2010) UCF Sports Action dataset. Last Accessed 11 Dec 2019
CRCV (2012) UCF50 dataset. Last Accessed 11 Dec 2019
CRCV (2013) UCF101 dataset. Last Accessed 1 Feb 2020
CRCV (2020) UMN video dataset. Last Accessed 1 Feb 2020
Cutler R, Davis LS (2000) Robust real-time periodic motion detection, analysis, and applications. IEEE Trans Pattern Anal Mach Intell 22(8):781–796
Czuszynski K, Ruminski J, Wtorek J (2017) Pose classification in the gesture recognition using the linear optical sensor. In: 2017 10th international conference on human system interactions (HSI). IEEE, pp 18–24
Dai C, Liu X, Lai J, Li P, Chao H-C (2019) Human behavior deep recognition architecture for smart city applications in the 5G environment. IEEE Netw 33(5):206–211
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European conference on computer vision. Springer, pp 428–441
Das S, Koperski M, Bremond F, Francesca G (2018) Deep-temporal lstm for daily living action recognition. In: 2018 15th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 1–6
Das S, Chaudhary A, Bremond F, Thonnat M (2019a) Where to focus on for human action recognition? In: 2019 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 71–80
Das S, Dai R, Koperski M, Minciullo L, Garattoni L, Bremond F, Francesca G (2019b) Toyota smarthome: real-world activities of daily living. In: Proceedings of the IEEE international conference on computer vision, pp 833–842
De-La-Hoz-Franco E, Ariza-Colpas P, Quero JM, Espinilla M (2018) Sensor-based datasets for human activity recognition: a systematic review of literature. IEEE Access 6:59192–59210
D’Orazio T, Marani R, Renó V, Cicirelli G (2016) Recent trends in gesture recognition: how depth data has improved classical approaches. Image Vis Comput 52:56–72
Duque D, Santos H, Cortez P (2007) Prediction of abnormal behaviors for intelligent video surveillance systems. In: IEEE symposium on computational intelligence and data mining, 2007. CIDM 2007. IEEE, pp 362–367
Everts I, Van Gemert JC, Gevers T (2014) Evaluation of color spatio-temporal interest points for human action recognition. IEEE Trans Image Process 23(4):1569–1580
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941
Feng Y, Yuan Y, Lu X (2017) Learning deep event models for crowd anomaly detection. Neurocomputing 219:548–556
Fisher PR (2012) CAVIAR dataset. Last Accessed 1 Feb 2020
Foggia P, Percannella G, Saggese A, Vento M (2013) Recognizing human actions by a bag of visual words. In: 2013 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, pp 2910–2915
Foggia P, Saggese A, Strisciuglio N, Vento M (2014) Exploiting the deep learning paradigm for recognizing human actions. In: 2014 11th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 93–98
Gan L, Chen F (2013) Human action recognition using APJ3D and random forests. JSW 8(9):2238–2245
Gao J, Zhang T, Xu C (2019) I know the relationships: zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8303–8311
Gavrila DM (1999) The visual analysis of human movement: a survey. Comput Vis Image Underst 73(1):82–98
Gkalelis N, Kim H, Hilton A, Nikolaidis N, Pitas I (2009) The i3DPost multi-view and 3D human action/interaction database. In: 2009 conference for visual media production. IEEE, pp 159–168
Gowda SN (2017) Human activity recognition using combinatorial deep belief networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1–6
Guo G, Lai A (2014) A survey on still image based human action recognition. Pattern Recogn 47(10):3343–3361
Gupta JP, Singh N, Dixit P, Semwal VB, Dubey SR (2013) Human activity recognition using gait pattern. Int J Comput Vis Image Process (IJCVIP) 3(3):31–53
Haria A, Subramanian A, Asokkumar N, Poddar S, Nayak JS (2017) Hand gesture recognition for human computer interaction. Procedia Comput Sci 115:367–374
Hassan MM, Uddin MZ, Mohamed A, Almogren A (2018) A robust human activity recognition system using smartphone sensors and deep learning. Future Gener Comput Syst 81:307–313
Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21
Huang G-B, Zhu Q-Y, Siew C-K (2004) Extreme learning machine: a new learning scheme of feedforward neural networks. In: Proceedings of the 2004 IEEE international joint conference on neural networks, 2004, vol 2. IEEE, pp 985–990
Huang Z, Wan C, Probst T, Van Gool L (2017) Deep learning on lie groups for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6099–6108
Huang Y, Lai S-H, Tai S-H (2018) Human action recognition based on temporal pose CNN and multi-dimensional fusion. In: Proceedings of the European conference on computer vision (ECCV)
Huynh-The T, Hua-Cam H, Kim D-S (2019) Encoding pose features to images with data augmentation for 3D action recognition. IEEE Trans Industr Inform 16:3100–3111
Ijjina EP, Chalavadi KM (2016) Human action recoxgnition using genetic algorithms and convolutional neural networks. Pattern Recogn 59:199–212
INRIA (2016) IXMAS dataset. Last Accessed 1 Feb 2020
Iosifidis A, Tefas A, Pitas I (2014) Regularized extreme learning machine for multi-view semi-supervised action recognition. Neurocomputing 145:250–262
Jalal A (2017) IM-daily depth activity dataset. Last Accessed 1 Feb 2020
Jalal A, Kim Y (2014) Dense depth maps-based human pose tracking and recognition in dynamic scenes using ridge data. In: 2014 11th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 119–124
Jalal A, Uddin MZ, Kim T-S (2012) Depth video-based human activity recognition system using translation and scaling invariant features for life logging at smart home. IEEE Trans Consum Electron 58:3
Jalal A, Kim Y-H, Kim Y-J, Kamal S, Kim D (2017) Robust human activity recognition from depth video using spatiotemporal multi-fused features. Pattern Recogn 61:295–308
Jhuang H (2013) HMDB dataset. Last Accesed 11 Dec 2019
Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Jian M, Zhang S, Wu L, Zhang S, Wang X, He Y (2019) Deep key frame extraction for sport training. Neurocomputing 328:147–156
Jiang Z, Lin Z, Davis L (2012) Recognizing human actions by learning and matching shape-motion prototype trees. IEEE Trans Pattern Anal Mach Intell 34(3):533–547
Kalaivani P, Vimala D (2015) Human action recognition using background subtraction method. Int Res J Eng Technol (IRJET) 2(3):1032–1035
Kang SB, Szeliski R (2004) Extracting view-dependent depth maps from a collection of images. Int J Comput Vis 58(2):139–163
Karpathy A (2014) Sports-1M dataset. Last Accessed 11 Dec 2019
Kastaniotis D, Theodorakopoulos I, Theoharatos C, Economou G, Fotopoulos S (2015) A framework for gait-based recognition using Kinect. Pattern Recogn Lett 68:327–335
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, et al (2017) The kinetics human action video dataset. ArXiv preprint arXiv:1705.06950
Ke Y, Sukthankar R, Hebert M (2007) Event detection in crowded videos. In: 2007 IEEE 11th international conference on computer vision. IEEE, pp 1–8
Khan ZA, Sohn W (2011) Abnormal human activity recognition system based on R-transform and kernel discriminant technique for elderly home care. IEEE Trans Consum Electron 57:4
Kim SH, Park R-H (2002) An efficient algorithm for video sequence matching using the modified hausdorff distance and the directed divergence. IEEE Trans Circuits Syst Video Technol 12(7):592–596
Kim TS, Reiter A (2017) Interpretable 3D human action analysis with temporal convolutional networks. In: 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 1623–1631
Kim H, Lee S, Kim Y, Lee S, Lee D, Ju J, Myung H (2016) Weighted joint-based human behavior recognition algorithm using only depth information for low-cost intelligent video-surveillance system. Expert Syst Appl 45:131–141
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kumar K, Kishore P, Kumar DA, Kumar EK (2018) Indian classical dance action identification using adaboost multiclass classifier on multifeature fusion. In: 2018 conference on signal processing and communication engineering systems (SPACES). IEEE, pp 167–170
Laptev I (2005) On space–time interest points. Int J Comput Vis 64(2–3):107–123
Laptev I (2012) Hollywood2 dataset. Last Accessed 11 Dec 2019
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8
Lee LH, Wan CH, Yong TF, Kok HM (2010) A review of nearest neighbor-support vector machines hybrid classification models. J Appl Sci 10:1841–1858
Lee H-Y, Huang J-B, Singh M, Yang M-H (2017) Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE international conference on computer vision, pp 667–676
Li W (2017a) MSR daily activity 3D dataset. Last Accessed 11 Dec 2019
Li W (2017b) MSR-action3D dataset. Last Accessed 1 Feb 2020
Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: 2010 IEEE computer society conference on computer vision and pattern recognition-workshops. IEEE, pp 9–14
Li C, Hou Y, Wang P, Li W (2017a) Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Process Lett 24(5):624–628
Li C, Wang P, Wang S, Hou Y, Li W (2017b) Skeleton-based action recognition using LSTM and CNN. In: 2017 IEEE international conference on multimedia and expo workshops (ICMEW). IEEE, pp 585–590
Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3595–3603
Lim JH, Teh EY, Geh MH, Lim CH (2017) Automated classroom monitoring with connected visioning system. In: Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), 2017. IEEE, pp 386–393
Liu DZ (2016) MSR action dataset. Last Accessed 1 Feb 2020
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos in the wild. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 1996–2003
Liu L, Shao L, Zhen X, Li X (2013) Learning discriminative key poses for action recognition. IEEE Trans Cybern 43(6):1860–1870
Liu L, Shao L, Li X, Lu K (2016) Learning spatio-temporal representations for action recognition: a genetic programming approach. IEEE Trans Cybern 46(1):158–170
Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE (2017a) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26
Liu M, Liu H, Chen C (2017b) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn 68:346–362
Lu K, Chen J, Little JJ, He H (2018) Lightweight convolutional neural networks for player detection and classification. Comput Vis Image Underst 172:77–87
Mabrouk AB, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems: a review. Expert Syst Appl 91:480–491
M. C. Laboratory (2012) DHA video dataset. Last Accessed 1 Feb 2020
Miao Y, Song J (2014) Abnormal event detection based on SVM in video surveillance. In: 2014 IEEE workshop on advanced research and technology in industry applications (WARTIA). IEEE, pp 1379–1383
MICC (2012) Florence 3D actions dataset. Last Accessed 11 Dec 2019
Mika S, Schölkopf B, Smola AJ, Müller K-R, Scholz M, Rätsch G (1999) Kernel PCA and de-noising in feature spaces. In: Advances in neural information processing systems, pp 536–542
Mishra A, Verma VK, Reddy MSK, Arulkumar S, Rai P, Mittal A (2018) A generative approach to zero-shot and few-shot action recognition. In: 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 372–380
MIVIA-Lab (2017) MIVIA Dataset. Last Accessed 11 Dec 2019
Moya Rueda F, Grzeszick R, Fink G, Feldhorst S, ten Hompel M (2018) Convolutional neural networks for human activity recognition using body-worn sensors. In: Informatics, vol 5. Multidisciplinary Digital Publishing Institute, p 26
Murray TS, Mendat DR, Pouliquen PO, Andreou AG (2015) The Johns Hopkins University multimodal dataset for human action recognition. In: Radar sensor technology XIX; and active and passive signatures VI, vol 9461. International Society for Optics and Photonics, p 94611U
NADA (2004) KTH dataset. Last Accessed 1 Feb 2020
Nazir S, Yousaf MH, Velastin SA (2018) Evaluating a bag-of-visual features approach using spatio-temporal features for action recognition. Comput Electr Eng 72:660–669
Neha TK (2020) A review on PSO-SVM based performance measurement on different datasets. Int J Res Appl Sci Eng Technol 8:444–448
Nizam Y, Mohd MNH, Jamil MMA (2017) Human fall detection from depth images using position and velocity of subject. Procedia Comput Sci 105:131–137
Norouzi M, Mikolov T, Bengio S, Singer Y, Shlens J, Frome A, Corrado GS, Dean J (2013) Zero-shot learning by convex combination of semantic embeddings. ArXiv preprint arXiv:1312.5650
Nunes UM, Faria DR, Peixoto P (2017) A human activity recognition framework using max-min features and key poses with differential evolution random forests classifier. Pattern Recogn Lett 99:21–31
Nweke HF, Teh YW, Mujtaba G, Al-Garadi MA (2019) Data fusion and multiple classifier systems for human activity detection and health monitoring: review and open research directions. Inf Fusion 46:147–170
Ohlberger M, Rave S (2015) Reduced basis methods: success, limitations and future challenges. ArXiv preprint arXiv:1511.02021
Oikonomopoulos A, Patras I, Pantic M (2005) Spatiotemporal salient points for visual recognition of human actions. IEEE Trans Syst Man Cybern Part B Cybern 36(3):710–719
Oliver N, Horvitz E, Garg A (2002) Layered representations for human activity recognition. In: Proceedings of the 4th IEEE international conference on multimodal interfaces. IEEE Computer Society, p 3
Oreifej O, Liu Z (2013) HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 716–723
Pagliari D, Pinto L (2015) Calibration of Kinect for xbox one and comparison between the two generations of microsoft sensors. Sensors 15(11):27569–27589
Panahi L, Ghods V (2018) Human fall detection using machine vision techniques on RGB-D images. Biomed Signal Process Control 44:146–153
Patel CI, Garg S, Zaveri T, Banerjee A, Patel R (2018) Human action recognition using fusion of features for unconstrained video sequences. Comput Electr Eng 70:284–301
Paul M, Haque SM, Chakraborty S (2013) Human detection in surveillance videos and its applications: a review. EURASIP J Adv Signal Process 2013(1):176
Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In: European conference on computer vision. Springer, pp 581–595
Pham HH, Salmane H, Khoudour L, Crouzil A, Velastin SA, Zegers P (2020) A unified deep framework for joint 3D pose estimation and action recognition from a single RGB camera. Sensors 20(7):1825
Popoola OP, Wang K (2012) Video-based abnormal human behavior recognition: a review. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):865–878
Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990
Prasnthi Mandha SVR, Lavanya Devi G (2017) A random forest based classification model for human activity recognition. Int J Adv Sci Technol Eng Manag Sci 3:294–300
Presti LL, La Cascia M (2016) 3D skeleton-based human action classification: a survey. Pattern Recogn 53:130–147
Qi M, Wang Y, Qin J, Li A, Luo J, Van Gool L (2019) stagNet: an attentive semantic RNN for group activity and individual action recognition. IEEE Trans Circuits Syst Video Technol 30:549–565
Qian H, Mao Y, Xiang W, Wang Z (2010) Recognition of human activities using svm multi-class classifier. Pattern Recogn Lett 31(2):100–111
Qin Y, Mo L, Xie B (2017) Feature fusion for human action recognition based on classical descriptors and 3D convolutional networks. In: 2017 eleventh international conference on sensing technology (ICST). IEEE, pp 1–5
Rapid-Rich-Object-Search Lab (2016) NTU RGB+D action recognition dataset. Last Accessed 11 Dec 2019
Razzak MI, Naz S, Zaib A (2018) Deep learning for medical image processing: overview, challenges and the future. In: Classification in BioApps. Springer, pp 323–350
Rensink RA (2000) The dynamic representation of scenes. Vis Cognit 7(1–3):17–42
Robot-Learning-Lab (2017) Cornell activity dataset (CAD-60). Last Accessed 11 Dec 2019
Rodriguez-Galiano VF, Ghimire B, Rogan J, Chica-Olmo M, Rigol-Sanchez JP (2012) An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS J Photogramm Remote Sens 67:93–104
Ronao CA, Cho S-B (2016) Human activity recognition with smartphone sensors using deep learning neural networks. Expert Syst Appl 59:235–244
Roy Y, Banville H, Albuquerque I, Gramfort A, Falk TH, Faubert J (2019) Deep learning-based electroencephalography analysis: a systematic review. J Neural Eng 16(5):051001
Saini O, Sharma S (2018) A review on dimension reduction techniques in data mining. Comput Eng Intell Syst 9:7–14
Shao L, Ji L, Liu Y, Zhang J (2012) Human action segmentation and recognition via motion and shape analysis. Pattern Recogn Lett 33(4):438–445
Sharma RP, Verma GK (2015) Human computer interaction using hand gesture. Procedia Comput Sci 54:721–727
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. ArXiv preprint arXiv:1511.04119
Shereena V, David JM (2014) Content based image retrieval: classification using neural networks. Int J Multimedia Appl 6(5):31
Shi Y, Tian Y, Wang Y, Huang T (2017) Sequential deep trajectory descriptor for action recognition with three-stream cnn. IEEE Trans Multimedia 19(7):1510–1520
Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12026–12035
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: CVPR 2011. IEEE, pp 1297–1304
Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1227–1236
Singh D, Mohan CK (2017) Graph formulation of video activities for abnormal activity recognition. Pattern Recogn 65:265–272
Singh S, Velastin SA, Ragheb H (2010) Muhavi: a multicamera human action video dataset for the evaluation of action recognition methods. In: Seventh IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 48–55
Song Y, Demirdjian D, Davis R (2011) NATOPS aircraft handling signals database. Last Accessed 11 Dec 2019
Statistical Visual Computing Lab (2014) UCSD anomaly detection dataset. Last Accessed 11 Dec 2019
Sze V, Chen Y-H, Yang T-J, Emer JS (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295–2329
Taha A, Zayed HH, Khalifa M, El-Horbaty E-S (2014) Human action recognition based on msvm and depth images. Int J Comput Sci Issues (IJCSI) 11(4):42
Thakkar A, Lohiya R (2020) Attack classification using feature selection techniques: a comparative study. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-020-02167-9
Thi TH, Zhang J, Cheng L, Wang L, Satoh S (2010) Human action recognition and localization in video using structured learning of local space–time features. In: 2010 seventh IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 204–211
Thomas G, Gade R, Moeslund TB, Carr P, Hilton A (2017) Computer vision for sports: current applications and research topics. Comput Vis Image Underst 159:3–18
Toshev A, Szegedy C (2014) Deeppose: human pose estimation via deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1653–1660
Turaga P, Chellappa R, Subrahmanian VS, Udrea O (2008) Machine recognition of human activities: a survey. IEEE Trans Circuits Syst Video Technol 18(11):1473–1488
Ullah A, Muhammad K, Haq IU, Baik SW (2019) Action recognition using optimized deep autoencoder and CNN for surveillance data streams of non-stationary environments. Future Gener Comput Syst 96:386–397
University of Minnesota (2010) Unusual crowd activity dataset. Last Accessed 11 Dec 2019
Varadarajan J, Odobez J-M (2009) Topic models for scene analysis and abnormality detection. In: 2009 IEEE 12th international conference on computer vision workshops (ICCV workshops). IEEE, pp 1338–1345
Veeriah V, Zhuang N, Qi G-J (2015) Differential recurrent neural networks for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 4041–4049
Vezzani R, Baltieri D, Cucchiara R (2010) Hmm based action recognition with projection histogram features. In: International conference on pattern recognition. Springer, pp 286–293
Vishwakarma DK, Kapoor R (2015) Hybrid classifier based human activity recognition using the silhouette and cells. Expert Syst Appl 42(20):6957–6965
Vishwakarma DK, Kapoor R, Dhiman A (2016) A proposed unified framework for the recognition of human activity by exploiting the characteristics of action dynamics. Robot Auton Syst 77:25–38
Vrigkas M, Nikou C, Kakadiaris IA (2015) A review of human activity recognition methods. Front Robot AI 2:28
Wang Y, Mori G (2009) Human action recognition by semilatent topic models. IEEE Trans Pattern Anal Mach Intell 31(10):1762–1774
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Wang H, Kläser A, Schmid C, Liu C-L (2011) Action recognition by dense trajectories. In: CVPR 2011. IEEE, pp 3169–3176
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision. Springer, pp 20–36
Wang P, Cao Y, Shen C, Liu L, Shen HT (2017) Temporal pyramid pooling-based convolutional neural network for action recognition. IEEE Trans Circuits Syst Video Technol 27(12):2613–2622
Wang J, Chen Y, Hao S, Peng X, Hu L (2018) Deep learning for sensor-based activity recognition: a survey. Pattern Recogn Lett 119:3–11
Wang W, Zheng VW, Yu H, Miao C (2019) A survey of zero-shot learning: settings, methods, and applications. ACM Trans Intell Syst Technol (TIST) 10(2):1–37
Wanqing Li XN (2014) Northwestern-UCLA multiview action 3D dataset. Last Accessed 11 Dec 2019
Weimer D, Scholz-Reiter B, Shpitalni M (2016) Design of deep convolutional neural network architectures for automated feature extraction in industrial inspection. CIRP Ann Manuf Technol 65(1):417–420
Xia L (2016) UT Kinect-action 3D dataset. Last Accessed 11 Dec 2019
Xia L, Chen C-C, Aggarwal J (2012) View invariant human action recognition using histograms of 3D joints. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 20–27
Xu D, Xiao X, Wang X, Wang J (2016) Human action recognition based on Kinect and PSO-SVM by representing 3D skeletons as points in lie group. In: 2016 international conference on audio, language and image processing (ICALIP). IEEE, pp 568–573
Xu L, Yang W, Cao Y, Li Q (2017) Human activity recognition based on random forests. In: 2017 13th international conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD). IEEE, pp 548–553
YACVID (2014) MuHAVi dataset. Last Accessed 11 Dec 2019
Yamato J, Ohya J, Ishii K (1992) Recognizing human action in time-sequential images using hidden Markov model. In: Proceedings CVPR’92 of the 1992 IEEE computer society conference on computer vision and pattern recognition, 1992. IEEE, pp 379–385
Yang Y, Ramanan D (2012) Articulated human detection with flexible mixtures of parts. IEEE Trans Pattern Anal Mach Intell 35(12):2878–2890
Yang X, Tian Y (2014) Effective 3D action recognition using EigenJoints. J Vis Commun Image Represent 25(1):2–11
Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: Proceedings of the 20th ACM international conference on Multimedia. ACM, pp 1057–1060
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI conference on artificial intelligence
Yang Y, Hou C, Lang Y, Guan D, Huang D, Xu J (2019) Open-set human activity recognition based on micro-Doppler signatures. Pattern Recogn 85:60–69
Yao A, Gall J, Fanelli G, Van Gool L (2011) Does human action recognition benefit from pose estimation? In: BMVC 2011-proceedings of the British machine vision conference 2011
You D, Hamsici OC, Martinez AM (2010) Kernel optimization in discriminant analysis. IEEE Trans Pattern Anal Mach Intell 33(3):631–638
You I, Choo K-KR, Ho C-L et al (2018) A smartphone-based wearable sensors for monitoring real-time physiological data. Comput Electr Eng 65:376–392
Yu M, Yu Y, Rhuma A, Naqvi SM, Wang L, Chambers JA et al (2013) An online one class support vector machine-based person-specific fall detection system for monitoring an elderly individual in a room environment. IEEE J Biomed Health Inform 17(6):1002–1014
Zellers R, Choi Y (2017) Zero-shot activity recognition with verb attribute induction. ArXiv preprint arXiv:1707.09468
Zhang Z (2012) Microsoft Kinect sensor and its effect. IEEE Multimedia 19(2):4–10
Zhang X, Yao L, Wang X, Monaghan J, Mcalpine D, Zhang Y (2019a) A survey on deep learning based brain computer interface: recent advances and new frontiers. ArXiv preprint arXiv:1905.04149
Zhang X, Yao L, Wang X, Zhang W, Zhang S, Liu Y (2019b) Know your mind: adaptive cognitive activity recognition with reinforced CNN. In: 2019 IEEE international conference on data mining (ICDM). IEEE, pp 896–905
Zhou X, Zhu M, Pavlakos G, Leonardos S, Derpanis KG, Daniilidis K (2018a) Monocap: monocular human motion capture using a CNN coupled with a geometric prior. IEEE Trans Pattern Anal Mach Intell 41(4):901–914
Zhou Y, Sun X, Zha Z-J, Zeng W (2018b) Mict: Mixed 3D/2D convolutional tube for human action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 449–458
Zhu Y, Chen W, Guo G (2014) Evaluating spatiotemporal interest point features for depth-based action recognition. Image Vis Comput 32(8):453–464
Zhu F, Shao L, Xie J, Fang Y (2016a) From handcrafted to learned representations for human action recognition: a survey. Image Vis Comput 55:42–52
Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X et al (2016b) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI, vol 2, p 8
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Pareek, P., Thakkar, A. A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications. Artif Intell Rev 54, 2259–2322 (2021). https://doi.org/10.1007/s10462-020-09904-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-020-09904-8