1 Introduction

Action in Human Action Recognition (HAR) consists of an entity that can be observed using either the human eye or some sensing technology. For example, an action such as walking requires a person in the field of view to be continuously observed. Depending on the engaged body parts for action, human activities can be grouped into four categories (Aggarwal and Ryoo 2011).

  • Gesture: It is based on hand, face or other parts movement, wherein verbal communication is not needed.

  • Action: It consists of movements conducted by a person such as walking or running.

  • Interaction: It involves actions to be executed by two actors. It may consist of interaction with the object or interaction with a single person.

  • Group activity: It can be a mixture of gestures, actions, or interactions. The number of performers can be at least two or more with the interactive objects.

HAR is considered to be an active research area due to applications such as content-based video analysis and retrieval, visual surveillance, Human–Computer Interface (HCI), education, medical, as well as abnormal activity recognition, etc. Further discussions on these applications are provided in Sect. 4. In HAR, the action recognition task can be shown by action representation and action analysis. These actions are acquired using different types of sensors such as RGB, range, radar, or wearable sensors. Manual HAR task, for instance, identifying abnormal activity from the video recording, requires a substantial amount of time. Such tasks are expensive and difficult as the human operations are necessary throughout multi-camera views (Singh et al. 2010). Moreover, it is tedious to perform round the clock monitoring of an area of interest and it may introduce human errors. To address these issues, automated modeling of human actions can be used.

Automated modeling of action(s) involves the process of mapping a particular action to a label that describes an instance of that action. Such actions may be performed by different agents (i.e., humans) under varying speed, lighting conditions, and diverse viewpoints. On the other hand, the fully-automated HAR systems have several challenges such as clutter in the background, occlusion, variation in viewpoint, scale, appearance, as well as external conditions for video recording (Thi et al. 2010). For instance, the task of person localization i.e., determining the location and size of a person, would be difficult in the dynamic recording condition (Thi et al. 2010). A considerable amount of research work has been carried out for HAR. To study the developments and recent updates, we conduct this survey focusing on video-based HAR; we provide a complete process of action representation, dimensionality reduction, and action analysis techniques; we also discuss the datasets and remarkable applications of HAR. The primary motivation of this comprehensive survey is to analyze different aspects of HAR along with the Machine Learning (ML) and Deep Learning (DL) techniques, the significance of the datasets and their potential applications. We discuss the challenges associated with HAR and provide potential future directions.

Table 1 Comparison between the existing surveys and our survey for HAR

1.1 Prior survey

HAR has been a research interest for various groups over the past years. For the action recognition task, action representation techniques include feature extraction methods and feature descriptors; action analysis may be carried out using traditional ML and/or DL techniques. While we conduct a survey on the existing approaches of HAR based on different applications, we compare our survey with the existing surveys based on categories such as feature, dimensionality reduction, and action classification as shown in Table 1. This section groups prior surveys and discusses the applied field of HAR.

Table 2 Summary of survey studied

1.1.1 Still image-based action recognition

The main focus of still image-based HAR is on identifying the action of a person from a single image without considering temporal information for action characterization.

One of the surveys on still image-based action recognition is presented in Guo and Lai (2014). Here, different methods such as ML and DL are discussed for low-level feature extraction and high-level representation for actions; also, various datasets along with their characteristics are presented. On the other hand, Vrigkas et al. (2015) presented a survey on HAR using still-image representation wherein, HAR techniques are divided in two categories namely, unimodal and multimodal activity recognition depending on the type of modality used by data.

1.1.2 Action representation and analysis-based HAR

For HAR, a step-by-step strategy can be used which involves feature representation using feature extraction techniques and action classification techniques. In Poppe (2010), HAR is discussed by considering actions involving full-body movement whereas excluding environment and interactions with other humans. Also, action representation and classification tasks are presented.

In Turaga et al. (2008), action classification task is discussed by considering representation and recognition of the actions or activities. Different mechanisms to learn the actions from the video are presented; the study has separately defined terms “action” and “activity” and presented an overview of classification techniques. Authors have also discussed approaches for modeling atomic action classes. Moreover, methods to model actions with more complex dynamics are discussed.

The study of handcrafted and learning-based representations is presented in Zhu et al. (2016a), as well as various advances in handcrafted representation techniques are discussed. The paper discusses different features including Spatio-Temporal Volume-based approaches, depth image-based, and trajectory-based methods. The architecture of 3D Convolution Neural Network (CNN, also known as ConvNet) is also presented in the paper.

In study Herath et al. (2017), a survey on various action recognition methods based on handcrafted and DL techniques are reviewed. In the literature, handcrafted and DL methods along with their architecture are discussed. Subsequently, in Aggarwal and Ryoo (2011), feature extraction methods from the input video are presented; multi-person action recognition is reviewed using hierarchical recognition methods including statistical (state-based models), syntactic (grammar-based methods), and description approaches (describing activities and sub-activities). On the other hand, in Presti and La Cascia (2016), work on 3D skeleton-based approaches is highlighted along with their challenges. Preprocessing methods, descriptors for skeleton-based data, datasets, and validation methods with performance evaluation techniques are also discussed.

1.1.3 Abnormal activity detection

Video surveillance can be used by organizations to manage gatherings, to prevent crime or for inspecting crime scene. Visual surveillance system depends on anomalous event detection. One of the early work Popoola and Wang (2012) discussed the abnormal activity detection for crowd monitoring. This survey also discusses action recognition and event detection.

Survey on abnormal activity detection is presented in Mabrouk and Zagrouba (2018). The survey is divided into two parts including behavior representation (features and descriptors) and behavior modeling (training and learning methods). Datasets and performance measures for abnormal activity detection based methods are also presented in Mabrouk and Zagrouba (2018).

1.1.4 Sensor-based activity recognition

Sensor-based HAR focuses on data received from an accelerometer, gyroscope, and bluetooth. Action classification can be treated as a pattern recognition problem (Wang et al. 2018). Modality of the data is characterized by several different modes of activity or occurrence.

In Wang et al. (2018), a survey of sensor-based activity recognition methods using DL techniques is presented. High-level features are automatically learned using DL techniques for sensor data. In Nweke et al. (2019), fusion of data from mobile and wearable devices with multiple classifiers is discussed. DL techniques for HAR are discussed with applications and open research issues.

1.2 Motivation

With a motivation to discuss state-of-the-art HAR techniques, we have considered classification approaches, their advantages, challenges, datasets, and applications of HAR. This paper discusses ML and DL techniques for HAR and gives a brief description of various features. We have also included potential future work based on HAR in terms of ML, DL, and hybrid techniques. The significance of this survey can be obtained from Table 1; to cover different aspects of HAR, we discuss the holistic approach of action recognition that includes action representation and action analysis for various modalities such as RGB, depth, as well as skeleton. In this paper, we also discuss recent datasets that depict the daily living actions. To the best of our knowledge, recent datasets for HAR has been explored to a limited extent in the field of action recognition. A summary of the existing surveys based on their highlights and important inferences gained from each of the surveys is provided in Table 2; we also mention the expected highlights and inferences of our survey that can be helpful to the reader. It must be noted that to provide a focused review on HAR, we have restricted our survey to the trimmed video sequence.

The major contributions of our paper are as follows.

  • The paper discusses various feature extraction and encoding techniques for HAR including shape, texture, trajectory, depth, as well as others.

  • Various dimensionality reduction methods for the extracted features are described. ML and DL techniques for action analysis are also presented.

  • The considerable advantages and disadvantages of different methods for action representation, dimensionality reduction, and action analysis for HAR are provided.

  • The paper summarizes the recent advances in HAR along with various applications, challenges, and future directions.

The roadmap of the paper is organized as follows. Section 2 presents HAR as a complete process including action representation, dimensionality reduction, and action analysis; datasets used in action classification; their properties along with a discussion of the recent datasets are discussed in Sect. 3; Sect. 4 covers applications of HAR; challenges and future directions are discussed in Sects. 5 and 6, respectively; concluding remarks are given in Sect. 7.

2 HAR: a complete process

Fig. 1
figure 1

A general overview of Human Action Recognition system

HAR is used for analyzing activities from the video. Once video data is captured, data is processed to meet the requirements of the underlying application. A generic system for HAR is graphically represented in Fig. 1; it provides an overview of the general steps including data collection, preprocessing, feature extraction and/or encoding, potential dimensionality reduction, followed by dataset preparations for the training and testing; also, such data samples can be provided to one or more approaches of ML or DL for the action classification and the predicted class labels can be analyzed and evaluated for the test samples. Action representation and dimensionality reduction techniques are useful for ML-based techniques. Wherein for DL-based techniques, these steps may be skipped. The existing approaches considered for action representation, dimensionality reduction, and action analysis for HAR are discussed in Sects. 2.1 to 2.3, respectively.

2.1 Action representation

Action representation provides the low-level processing of human action. It can consist of two steps namely, interest detection and description of the interest boundary. The important features can be extracted and encoding can be carried out using different techniques.

2.1.1 Feature extraction and encoding

The overall procedure of action representation involves the extraction of a set of features from local as well as global features. The goal of the action representation task is to find features that are robust to occlusion, background variation, and viewpoint change (Zhu et al. 2016a). An overview of various features-based action representation is shown in Fig. 2; properties of these features are further described and existing work is reviewed in the following sections.

Fig. 2
figure 2

Various features for HAR

Space–time interest point-based techniques To represent an image using local features, Space–Time Interest Point (STIP) can be used. STIP features encode image by adding an extra dimension known as temporal information. Temporal domain information is added to spatial ones; the encoded image can provide additional information about contents and structure in the action scene. STIP can be converted to saliency regions by using clustering algorithms. These features are translation and scale-invariant however, they are not rotation-invariant (Laptev 2005).

For the recognition of human actions, these positions are considered to be the most informative ones (Laptev 2005). Extension of salient point detector based on Spatio-temporal feature is proposed in Oikonomopoulos et al. (2005) where image sequence is represented in terms of salient points in space and time. The relationship between different features is established by calculating Chamfer distance (Oikonomopoulos et al. 2005). Scale and translation invariant representation of the extracted features is represented by iterative space–time-warping technique and features are converted to zero mean. The proposed model is evaluated using a sequence of images that perform aerobic exercise.

In Chakraborty et al. (2011), STIPs in multi-view images are detected in a selective manner by surround suppression and imposing local spatio-temporal constraints. The intensity-based STIP is robust to shadows and highlights due to disturbing photometric phenomenon. Color STIP performs better than intensity-based STIP. Therefore, in Everts et al. (2014), color STIP is proposed by reformulation in multiple channels of detectors. For datasets such as UCF sports (CRCV 2010), UCF11 (Liu et al. 2009), and UCF50 (CRCV 2012).

On the other hand in Zhu et al. (2014), feature extraction is performed on depth maps using STIP features and Histogram of Visual Word (HoVW) is created using the quantization of extracted local features. Subsequently in Nazir et al. (2018), feature representation is performed by combining STIP and Scale-Invariant Feature Transform (SIFT), and HoVW-based technique is used for action representation.

STIP methods do not require preprocessing such as background segmentation or human detection. The features are robust to scale, rotation, and occlusion, however, they are not viewpoint-invariant (Laptev 2005). For frames of actions such as boxing, hand-clapping, hand waving, and jogging for KTH dataset (NADA 2004), such features can be efficiently localized in terms of both, space and time as each video is represented using a set of spatial and temporal interest points (Nazir et al. 2018). It is also observed that STIP features are adapted to changes in the illumination and scale, however, they may not be able to distinguish between event and noise in some scenarios (Nazir et al. 2018).

Trajectory-based techniques Trajectories for actions are computed by tracking joints wherein interest points alongside the input videos using optical flow fields. Densely sampled points are tracked to obtain trajectories using optical flow field (Wang et al. 2011). The trajectories are useful in the scenarios wherein long-duration information is captured (Wang et al. 2011).

In Wang and Schmid (2013), dense trajectories are extracted by tracking and sampling dense points at multiple scales in each frame. Here, feature representation is introduced using Histogram of Oriented Gradient (HOG), Histogram of Optical Flow (HOF), and Histogram of Motion Boundary (HoMB), that in turn, captures shape, appearance, and motion information along the trajectory, respectively, and multiple descriptors are used for the same. HoMB gives an improved result compared to SIFT and HOG due to robustness to the camera motion (Dalal et al. 2006); it is based on derivatives of optical flow and used to remove camera motion.

The handcrafted features have less discriminative power for HAR while for efficient extraction of features, usage of DL-based methods requires a large amount of data for training. Therefore, in Wang et al. (2015), an advantage of handcrafted and DL-based features are combined using improved trajectory. Two-stream ConvNets is also known as Trajectory-pooled Deep-Convolutional Descriptor (TpDD). To construct effective descriptor, deep architecture learns multi-scale convolutional feature maps. As explained in Wang et al. (2015), for multi-scale TpDD extension, optical flow is initially computed and single scale tracking is performed, followed by multi-scale pyramid representations of video frames and optical flow construction. For constructing convolutional feature maps having multiple scales, the pyramid representation acts as input to the ConvNets. Subsequently, to enhance the power of dense trajectory for characterizing long-term motion, three-stream networks are used in Shi et al. (2017). Here, dense trajectories are extracted from multiple consecutive frames, resulting in trajectory texture images. The extracted descriptor is known as sequential Deep Trajectory Descriptor (sDTD) that characterizes motion. Three-stream framework namely, spatial, temporal, and sDTD, learn the spatial and temporal domains with CNN-Recurrent Neural Network (RNN) network.

Depth-based techniques A depth image is captured by computing distance between the image plane and object in the scene. With the use of a low-cost depth sensor, for example, Kinect, 3D depth images are invariant to lighting conditions (Taha et al. 2014). The advantages of the depth-based sensor over RGB cameras can be given as calibrated scale estimation, color, and texture-invariant, and a simple background subtraction task (Shotton et al. 2011).

An action can be recognized using Depth Motion Map (DMM) feature as it provides shape and structure information in 3D from depth maps. These maps are projected on three orthogonal planes namely, front, side, and top. To identify motion regions, a map of motion energy is calculated for each projected map. For each projection view, DMM is formed by stacking motion energy for the entire video (Yang et al. 2012).

In Yang et al. (2012), depth maps are projected on three orthogonal planes. Here, for feature representation, HOG is determined after DMM computation to construct compact and discriminative features. In Chen et al. (2015b), DMM-based gestures are used for extracting motion information and feature encoding is performed using Local Binary Pattern (LBP) that performs better compared to DMM-HOG based feature extraction. LBP enhances performance by applying it to the overlapped blocks in DMMs, which increases the discriminative power for action recognition. DMM captured on the entire depth sequence cannot capture detailed motion, however, with the new action occurring at the same time, old motion history may get overwritten.

To increase the accuracy of HAR, data are generated simultaneously from the depth and inertial sensors in Chen et al. (2015a). Here, fused features are formed by directly concatenating features from the depth and inertial sensors. On the other hand, in Chen et al. (2016), a depth sequence is divided into overlapping segments and multiple sets of DMMs are generated. To decrease the intra-class variability due to action speed variations, different temporal length of the depth segments are considered. For intra-class action classification, DMMs are not robust to action speed variations. Therefore, improvement in DMM is proposed in Chen et al. (2017) by accumulating regions involving motion for three planes namely, front, top, and side. Afterwards, patch-based LBP is used to extend feature representation from the pixel level to texture level representation (Chen et al. 2016).

Pose-based techniques RGB-D sensors such as Kinect provide the measurement of skeleton joints. However, these sensors have drawbacks with respect to pose estimation. The Kinect sensor operates with limited distance and contains a limited field of view. Moreover, it cannot work in sunlight (Zhang 2012). The 2D, as well as 3D pose estimation aspects, have been explored and it is as follows.

2D Pose-based Techniques

In 2D pose estimation, deformable part models can be used wherein the collection of templates are matched to recognize the object. However, the deformable part models have limited expressiveness and do not take global context into account (Yang and Ramanan 2012). Pose-based estimation can be efficiently reshaped by CNN. It can be performed by two means namely, detection-based and regression-based methods. The detection-based methods can use powerful CNN-based part detectors which can be further combined using a graphical model (Chen and Yuille 2014). In the detection problem, pose estimation can be performed as a heat map wherein each pixel represents the detection score of joint (Bulat and Tzimiropoulos 2016). Nevertheless, joint coordinates are not directly provided by detection approaches. Poses are recovered in (xy) coordinates by applying max function as a post-processing step; here, the regression-based methods use a nonlinear function that maps the joint locations directly to the desired output that can be joint coordinates (Bulat and Tzimiropoulos 2016). In Toshev and Szegedy (2014), poses are estimated using CNN-based regression towards body joints. Cascade of such regressors is used to refine the pose estimates. Iterative Error Feedback (IEF) is proposed in Carreira et al. (2016), wherein the iterative prediction of the current estimates is performed and they are corrected iteratively. Instead of predicting outputs in one shot, the self-correcting model is predicted which changes an initial solution by feeding back error predictions which is known as IEF. The function map used in regression is sub-optimal, hence, it results in lower performance as compared to detection-based techniques.

3D Pose-based Techniques

On the other hand, given an image of a person, 3D pose estimation is the task of producing a 3D pose that matches the spatial position of the depicted person. For an accurate reconstruction of 3D poses from real images, the indoor and outdoor scenarios provide enormous applications in entertainment and HCI. Early approaches required feature engineering-based techniques, whereas the current state-of-the-art methods are based on deep neural networks (Zhou et al. 2018a). 3D pose estimation is considered to be more complex than 2D as it handles larger 3D pose space and more ambiguities. In Nunes et al. (2017), skeleton extraction is performed using depth images wherein frame-by-frame skeleton joints are inferred. An APJ3D representation is constituted by a manually selected 15 skeleton joints (Gan and Chen 2013) from relative positions and local spherical angles. These 15 informative joints are manually selected to build a compact representation of human posture. Spatial features are encoded based on joint-joint distances, joint-joint orientations, joint-joint vectors, joint-line distances, and line-line angles to provide rich texture features (Chen 2015) and the network is trained on CNN to identify corresponding actions. On the other hand, Kinect sensor is used in Xu et al. (2016) to obtain human body images. Body part-based skeletal representation is constructed for action recognition, wherein the relative geometry between various body parts is identified. Body rotations and translations in 3D space are members of the Lie group. In Liu et al. (2017b), skeleton input is represented using several color images; here, in the color image generation process, the emphasis is given to motion in skeleton joints to improve the discriminative power of color images. The multi-stream convolutional network involves ten AlexNet and the generated color image is input to each CNN. Due to the discriminative power of multi-stream convolution networks, combining handcrafted information consisting of skeleton joints with multi-stream convolution network increases the recognition performance of HAR. Subsequently, one of the recent advances given in Huynh-The et al. (2019) maps 3D skeleton data to chromatic RGB values. This technique is termed as a Pose Feature to Image (PoF2I) encoding technique. This encoding technique can efficiently deal with varying-length action appearance. Also, a deep learning framework for HAR is presented in Pham et al. (2020) where in the feature representation task, the skeleton is extracted using RGB video sequences. Thereafter, these poses are converted to image-based representation and fed to deep CNN.

Motion-based techniques The motion information of the moving target can be captured using an intelligent system for classifying objects efficiently. Motion tracking can be performed for high-level analysis of classified objects (Paul et al. 2013). The detection process consists of object detection and classification. For object detection, background subtraction, optical flow, and Spatio-temporal filtering can be used. Moving objects using background subtraction are detected by differentiating the current frame and a background frame in a pixel-by-pixel or block-by-block fashion; here, motion is characterized by 3D Spatio-temporal data volume. This method has low-computational complexity but is susceptible to noise (Paul et al. 2013). Subsequently, to detect moving regions in images, optical flow technique computes flow vectors of moving objects, however, these methods have large computational requirements. To recognize humans, the periodic property of the images can be used in motion-based approaches (Paul et al. 2013).

In Cutler and Davis (2000), a view-based approach is applied for the recognition of human movements by using a vector image template such as Motion Energy Image (MEI) and Motion History Image (MHI). MEI feature is a binary template that highlights image regions where motion is present. The shape of the region can be used to suggest both, the action occurring and the viewing angle in the scene. MHI is used to show how motion in the image is moving. MEI and MHI are prone to errors of background subtraction. Replacement and decay operators are used to represent MHI (Bobick and Davis 2001). Space–time silhouettes shapes contain spatial information about the poses of humans such as location, the orientation of actions, to name a few. They also include the aspect ratio of the different body parts at any point in time.

Shape-based techniques The shape-based feature provides human body structure and its dynamics for HAR, whereas, texture-based features characterize motion information in videos by using templates for HAR. For silhouette extraction, background subtraction method may not suitable. Therefore, in Vishwakarma and Kapoor (2015), for the silhouette extraction texture based segmentation method is used. A silhouette representation is used to obtain Region of Interest (ROI) of a person in shape-based action representation technique (Vishwakarma and Kapoor 2015). Human silhouettes can be obtained by RGB video frames or depth videos. Silhouette features are sensitive to occlusion and different viewpoints (Vishwakarma and Kapoor 2015). Silhouette features are extracted from the videos in Khan and Sohn (2011) to identify abnormal activities in elderly people. Thereafter, R-transform is applied to features, for obtaining features that are robust to scale and translation. In Chaaraoui and Flórez-Revuelta (2014b), after processing background, binary segmentation is applied in order to extract the contour points of the human silhouette that represents the summary of feature extraction from single-view by aligning silhouettes independent from shape and contour length using a radial scheme. For each radial bin, further dimensionality reduction can be obtained by representing the summary value for each radial bin (Chaaraoui and Flórez-Revuelta 2014a).

To suppress noise in shape variations of silhouettes, key poses of silhouettes are divided into the cells (Vishwakarma and Kapoor 2015). For extraction of silhouette, texture-based segmentation method is used. To eliminate the frames not containing any information, key-frame extraction method is used which is based on the energy of the frame. Due to the similar postures and motions in some human activities, it is not sufficient to use only skeleton joint features to discriminate human activities. To overcome this limitation, Depth Differential Silhouettes (DDS) is used, and it is represented by the HOG format of DDS projections onto three orthogonal Cartesian planes.

Other action representation techniques Apart from the action representation methods, other action representation techniques include radar-based, gait-based, Electroencephalography (EEG)-based. The goal of radar-based HAR is to recognize human motion automatically using spectrograms (Craley et al. 2017). Modulation of radar echoes with each activity produces unique micro-Doppler signatures. In Craley et al. (2017), a normalized spectrogram-based method is used and outperforms skeleton data produced by Kinect sensor. Another feature representation for HAR is gait-based action representation.

An HAR framework is proposed in Boulgouris and Chi (2007) that is based on the radon transform of binary silhouettes which are used for computation of a template. 3D gait analysis can be performed using depth images and human body joint (tracking) from the available gait sequences (Boulgouris and Chi 2007). An advantage of gait is that it requires no real contact, like automatic face recognition, and that it is less likely to be obscured than other biometrics (Boulgouris et al. 2005). Also, gait-based recognition is performed in an environment where the background is as uniform as possible. Moreover, recognition algorithms based on gait are not view-invariant (Boulgouris et al. 2005).

Activity recognition using EEG involves electrophysiological monitoring to analyze brain states by capturing the voltage fluctuations of ionic current within the neurons of brains (Zhang et al. 2019a). Usage of EEG signals for activity recognition is often termed as cognitive activity recognition system which bridges the gap between the cognitive world and physical world (Zhang et al. 2019b). EEG signals have an excellent temporal resolution that means events occurring at a small fraction of time (millisecond) can also be captured. However, the disadvantage of EEG is that it suffers from the low spatial resolution that means EEG signals are highly correlated spatially (Roy et al. 2019).

Hybrid action representation techniques Performance of HAR can be improved by using hybrid action representation techniques. Hybrid action representation may exist in scenarios, for example, human activities containing similar postures and motions. In such cases, the skeleton joint feature is not enough to discriminate between different activities. An activity recognition system using silhouette and skeleton-based features is presented in Jalal et al. (2017), where, multi-fused features such as skeleton joints and body shape-based features such as HOG and DDS are extracted from the input videos. The shape of full-body is represented by DDS and action classification is performed by the Hidden Markov Model (HMM). However, these features are incorporated for simple actions.

Shape and motion information is combined in Vishwakarma et al. (2016) to handle occlusion, where binary silhouette extraction is performed using Spatial Edge Distribution of Gradients (SDEG) and extraction of temporal information is performed by R-transform. R-transform produces features robust to scaling and translation, but are not robust to the rotational changes. In Shao et al. (2012), shape and motion information is combined and action recognition is performed using temporal segmentation. Motion History Image (MHI) is used for describing shape and Pyramid Correlogram of Oriented Gradients (PCOG) is used as a feature descriptor.

For abnormal activity detection, texture, shape, and motion feature fusion is performed in Miao and Song (2014). Here, Grey Level Co-occurrence Matrix (GLCM), Hu-invariant moments, and HOG are used for texture, shape, and motion feature fusion, respectively. To obtain better performance, data normalization and parameter optimization are performed using an Adaptive Simulated Annealing Genetic Algorithm (ASAGA). Appearance and motion features are combined in Amraee et al. (2018) using HOG-LBP, and HOF, respectively. The extraction of accurate silhouettes is difficult in case of camera movements and complex scenes. Hence, human appearance cannot be identified using silhouettes in the presence of occlusion in the human body.

In Patel et al. (2018), for performing HAR, various features are fused. Authors have fused features to improve the performance of the network. Features such as the average of HOG feature over 10 overlapping frames, Discrete Wavelet Transform, displacement of object centroid, the velocity of an object, and LBP are fused. On the other hand, a feature fusion scheme with a combination of classical descriptors and 3D convolutional networks was proposed in Qin et al. (2017). Descriptors such as HOG to provide good invariance on geometric and optical deformation, HOF to provide invariance to scale changes, and SIFT to provide invariance to the viewpoint are used. These features are fused with the learned features from 3D CNN into a special fusion feature which is then fed to the classification task.

2.1.2 Discussion

Handcrafted representation impact learning-based representations. Traditional ML techniques depend on handcrafted feature representation. These features are local features and follow the densely sample strategy; they also have high computational complexity for both training and testing.

On the other hand, STIP features are suitable for simple actions and gestures. Such features may not perform well for multiple persons in the scene. It can be observed that the trajectory-based approaches can analyze movements robust to view-invariant manner, however, these techniques are not efficient for localizing joint positions. Depth sensors, for example, Kinect, provide an additional capability of human location and skeleton extraction, action detection based on them is simpler and effective than that using RGB data. Moreover, sensors like Kinect and advanced human pose estimation algorithms can make it easier to gain accurate 3D skeleton data.

Skeleton data can also capture spatial information and strong correlations exist between the joints node and their adjacent nodes therefore, structural information related to body can be found in skeleton data. With an inter-frame manner, there may exist strong temporal correlation. Skeleton data is popular representation among the researchers. In Table 3, we provide an overview of the advantages and disadvantage of various feature extraction methods. We also provide a detailed summary of action representation techniques in Table 4.

Table 3 Advantages and disadvantages of different feature extraction methods

2.2 Dimensionality reduction techniques

In an action recognition framework, a large number of features are collected to capture all the possible events. This leads to presence of redundant data that complicates the process of learning. Hence, to enhance learning task of the classification model, uncorrelated data should be considered. In dimensionality reduction, original features are transformed by removing the redundant information. Such techniques can be categorized into unsupervised and supervised (Saini and Sharma 2018) while the unsupervised dimensionality reduction techniques include Principal Component Analysis (PCA) (Chen et al. 2015b), autoencoder (Ullah et al. 2019), and Reduced Basis Decomposition (RBD) (Arunraj et al. 2018), the supervised dimensionality reduction techniques include Linear Discriminant Analysis (LDA) (Khan and Sohn 2011) and Kernel Discriminant Analysis (KDA) (Khan and Sohn 2011).

PCA is an unsupervised dimensionality reduction method that provides the representation of input data in terms of the eigen vector. Feature dimension is reduced by retaining features having maximum variance. In Chen et al. (2015b), PCA is applied after extracting DMM-LBP features to map the data to a lower-dimensional space. PCA can be used for providing the discriminative capability to local features (Thi et al. 2010). These features are given as input to the classifier. On the other hand, part-based feature representation is used for HAR in Xu et al. (2016), where the linear combination of these features is presented using PCA. Linear PCA will not always be able to detect all structure in the dataset. More information from the data can be extracted by the use of suitable nonlinear features. Kernel PCA is suited to extract nonlinear structures in the data (Mika et al. 1999). Kernel-based PCA (KPCA) is used in Hassan et al. (2018) wherein PCA along with the kernel function and dot product between two vectors in the original space is computed to identify nonlinear structures in the data.

RBD is a linear dimensionality reduction technique based on the reduced basis method. For reducing the higher dimension, reduced basis method depends on the truth approximations at a sampled set of optimal parameter values (Chen 2015). In Arunraj et al. (2018), the RBD method is used to reduce the dimensionality of the input features, where error determining norm for RBD is implemented with different norms such as identity norm (I), all one norm (J), Symmetric Positive Definite (SPD), and diagonal norm (D). The selection of efficient error estimation norm is dependent on the subjects or application under consideration. Accuracy of RBD method will be lower than that of PCA but it is faster as compared to PCA (Arunraj et al. 2018).

Table 4 Summary of action representation techniques
Table 5 Advantages and disadvantages of different dimensionality reduction techniques
Table 6 Summary of dimensionality reduction techniques

Deep autoencoder is an unsupervised dimensionality reduction method in which learning is based on data (Baldi 2012). In Ullah et al. (2019), dimensionality reduction is performed using four layers of the stacked autoencoder. Updation in the raw input data is captured by the initial layers of the autoencoder. However, the patterns using second-order features are learned using intermediate layers. In Boulgouris and Chi (2007), the gait sequence consists of templates constructed from randon transform. For several cycles, multiple templates of gait can be constructed for an action such as walking, and LDA is used to reduce the feature dimension. Dimensionality reduction task in LDA is performed by lower dimension subspace having relevant gait recognition information.

Another method for dimensionality reduction is Kernel Discriminant Analysis (KDA), which is the non-linear extension of LDA. In KDA Khan and Sohn (2011), a mapping of input data is performed in high dimension feature space using Radial Basis Network (RBF) and for abnormal activity detection, KDA is effective compared to LDA (Khan and Sohn 2011). Using non-linear mapping, R-transformed data is mapped by KDA to the feature space (Khan and Sohn 2011). In Table 5, we present the advantages and disadvantages of these methods and summarize the dimensionality reduction techniques in Table 6.

2.3 Action analysis-based HAR

Action analysis task is performed on the top of the action representation-based method(s). The low-level steps of action recognition may allow to identify the object movement in the scene, however, these descriptors do not provide an understanding of the action label. To label action sequences, the action classification techniques are used which mainly include traditional ML as well as DL techniques. To provide a taxonomy of action classification methods based on ML and DL techniques reviewed in this section for HAR, we provide the graphical representation as shown in Fig. 3.

Fig. 3
figure 3

A taxonomy of action classification methods

2.3.1 Traditional machine learning-based methods

Various techniques using ML are proposed for HAR (Kim et al. 2016; Gan and Chen 2013; Nunes et al. 2017; Singh and Mohan 2017). We discuss ML-based action analysis methods such as graph-based methods, SVM, nearest neighbor, HMM, ELM, and hybrid methods.

Graph-based methods Graph-based methods are used to classify input features of HAR that include Random Forest (RF), Geodesic Distance Isograph (GDI), to name a few. To increase the robustness of the action recognition system, a graph of a local action is used, where STIP features are vertices of the graphs and an edge represents a possible interaction (Singh and Mohan 2017).

RF classifier is a tree-based ML technique that leverages the power of multiple decision trees for making decisions. For HAR, RF classifier can efficiently handle thousands of inputs (Ar and Akgul 2013). There is a high interest in ensemble learning algorithms due to their higher accuracy and they are robust to noise compared to single classifier. In RF classifier, each classifier’s contribution to assignment of the most frequent class for input vector x as given by Eq. 1 (Rodriguez-Galiano et al. 2012).

$$\begin{aligned} \hat{C}^B_{rf}=\text {majority vote} {\Bigg \{\hat{C}_b(x)}\Bigg \}_{1}^B \end{aligned}$$
(1)

where, \(\hat{C}_b(x)\) is the prediction of class by bth random forest tree. RF has properties such as low error rates, always convergence, i.e., no over-fitting, faster training because of working on the subset of features and hence, better performance, robustness to noise, as well as simplicity (Nunes et al. 2017). In Gan and Chen (2013), RF classifier and the randomized DT is used for training depth-based feature, namely, APJ3D feature is used. Joint feature APJ3D includes the position and angle of joints. In Xu et al. (2017), RF classifier is used to classify activities from the dataset collected from the accelerometer sensors.

Support vector machine The concept of Support Vector Machine (SVM) is to separate the data points using a hyperplane (Cortes and Vapnik 1995). For the classification task, SVM separates points in the high dimensional space through mapping; such mapping of points provides a linear decision surface for input data which helps in classification task (Cortes and Vapnik 1995). As shown in Eq. 2 (Cortes and Vapnik 1995), the weight vector, l and bias term, p are used to define the position of separating hyperplane in SVM.

$$\begin{aligned} f(x) = l \cdot x + p = 0 \end{aligned}$$
(2)

SVM uses kernel trick to work with high dimension data and to reduce the computational burden. For HAR, SVM can be used when the number of samples is small (Qian et al. 2010). In Chakraborty et al. (2011), BoVW model of local N-jet descriptors and vocabulary building is performed by merging spatial pyramid and vocabulary compression where \(\chi ^2\) kernel-based SVM classifier performs human action classification. In Everts et al. (2014), \(\chi ^2\) kernel SVM is used for training codebooks of sequence containing quantized HOG3D descriptors. For evaluation of the learned classifiers, leave-n-out cross-validation is used. In Zhu et al. (2014), quantization of 3D data is performed, which is given as input to the \(\chi ^2\) kernel-based SVM. GDI graph is used in Kim et al. (2016) for optimizing and localizing human body parts for a given ROI. Instead of classifying each pixel from the input, on the GDI graph, random generation of feature points is performed that gives a summation of cost of the edges connecting the shortest path between the two points. Thereafter, the graph-cut algorithm is applied along with SVM classifier which removes falsely labeled feature points with the use of the previously generated GDI graph (Kim et al. 2016).

For smooth functions, RBF kernel is preferable, whereas, for discrete feature handling, for example, needed by Bag of Visual Words (BoVW), \(\chi ^2\) kernel is used due to the overlap feature modeling capability (Cortes and Vapnik 1995). In Foggia et al. (2013), SVM classifier is used to classify codebook constructed using HoVW. SVM with N one-against-rest technique learns discriminant words for a particular event and ignores others and N such separate classifiers are constructed. For classification of k-th class against the rest, training is performed using k-th classifier on the training data set. In Shao et al. (2012), PCOG feature is given as input to SVM, wherein PCOG feature is shape descriptor calculated from MHI and MEI. For training offline, multi-class SVM with RBF kernel is used. Moreover, for improving the training procedure, input training sequences are divided into cycles for the duration of each movement. Non-linear data is classified by performing multi-class learning using a one-versus-one SVM classifier having a polynomial kernel (Nazir et al. 2018).

Nearest neighbor Non-parametric classifiers provide a classification decision based on data without performing training task. The commonly used non-parametric classifier is Nearest Neighbor (NN) estimation (Boiman et al. 2008). A variant of NN namely, NN-image is used for image classification by comparing the image to the nearest class image. The classification result of the NN-image classifier is inferior to the learning-based classifiers such as SVM, and DT (Boiman et al. 2008). In Oikonomopoulos et al. (2005), the k-Nearest Neighbor (kNN) classifier is used with a Relevance Vector Machine (RVM) method, which is a kernel-based sparse model having similar functionality as SVM. In RVM, learning is performed using the Bayesian approach and Gaussian prior is used for the model weights since overfitting can exist due to maximum-likelihood estimation of the weights. Positive values of these weights correspond to the relevance vector in the class showing the representation of human action.

Hidden Markov model HMM provides movement from one state to another for the given probabilities of transitions. Different hidden states are specified in the training stage of HMM; for the given problem, probabilities corresponding to the transition in state and outputs are optimized in the training stage (Gavrila 1999). Output symbols are produced by optimization based on HMM matching image features for motion class (Gavrila 1999). Motivation to integrate HMM for HAR is that it allows to model temporal evolution of features extracted from the videos easily (Vezzani et al. 2010). However, in HMM, selection of parameters such as the number of states and the number of symbol parameters require trial and error method (Yamato et al. 1992). HMM is used to classify the silhouette feature obtained by applying R transform on input(s) (Jalal et al. 2012). Here, a depth-based silhouette is used as input features.

To obtain robustness to initial conditions, an improved version of HMM called Coupled HMM (CHMM) is used in Brand et al. (1997). The current state in CHMM is determined by the state of the chain and neighboring states at the previous timestamp. Actions having coordinated movements such as moving both hands are effectively classified using the CHMM (Brand et al. 1997). Another variant of HMM, called Layered HMM (LHMM) (Oliver et al. 2002), is used to enhance the robustness of the system. LHMM segments into different layers with different temporal information.

Extreme learning machine ELM was originally developed as a feed-forward network for a single hidden layer and is computationally faster for optimizing input parameters than gradient-based learning methods (Huang et al. 2004). In ELM weights of the input layer and biases in the hidden layer(s) are randomly chosen (Iosifidis et al. 2014). The principle of ELM is based on the learning provided by the network without tuning iteratively hidden neurons in SLFN. In ELM hidden node parameters are generated randomly without iteratively tuning (Huang et al. 2004). Kernel-based ELM (KELM) is the variant of the ELM method that depends only on the input data (Chen et al. 2015b). KELM can be used in situations wherein the number of features will be larger than the number of samples.

RBF kernel-based ELM method is applied in Chen et al. (2015b) using a single hidden layer. DMM-LBP features are input to the ELM network by fusing the projections from top, side, and front or decision level fusion using logarithmic opinion pool on the score of classifiers on different projection. In Chen et al. (2017), multi-temporal based DMM and patch-based LBP features are classified using KELM. Extracted DMM and LBP features are selected using the LDA method. Parameters of KELM are chosen using a 5-fold cross-validation technique to validate the performance of the network.

Zero-Shot Learning In computer vision research, supervised classification techniques are popular. There is an increase in the popularity of these techniques with the introduction of deep networks. Supervised techniques require an abundant amount of labeled data for training. Learned classifier can deal with the instances belonging to the trained data. However, these classifiers do not have ability to deal with unseen classes. In the concept of zero-shot learning, there exist some labeled training instances and the classes in these instances called as seen classes, wherein unseen testing instances which belong to unseen classes. Zero-shot learning is widely used in problems related to videos. In Zero-Shot Action Recognition (ZSAR), it can be used to recognize videos related to unseen actions. For action recognition tasks, popular datasets are UCF101 and HMDB51. Zero-shot learning is able to demonstrate promising results. Newly observed activity types can be detected by ZSAR by using semantic similarity between the activity and other embedded words in the semantic space (Al Machot et al. 2020). Large-scale ZSAR can be modeled by using the visual and linguistic attributes of action verbs (Zellers and Choi 2017).

To narrow down the gap of the knowledge between existing methods and humans, an end-to-end ZSAR framework is proposed in Gao et al. (2019) based on a structured knowledge graph. To design the graph, Two-Stream Graph Convolutional Network (TS-GCN) can be used consisting of a classifier branch and an instance branch. Specifically, the classifier branch takes the semantic-embedding vectors of all the concepts as input, then generates the classifiers for action categories (Gao et al. 2019). A ZSAR framework is developed with knowledge graphs to generate classifiers for new categories (Gao et al. 2019). By designing a two-stream GCN model with a classifier branch and an instance branch, this approach is able to effectively model action-attribute, attribute-attribute, and action-action. In addition, a self-attention mechanism is adopted to model the temporal information across video segments. Zero-shot learning can also be applied to settings wherein classes in the training and test instances are disjoint or another way can be the generalized Zero-Shot Learning (GZSL) where overlap between the number of classes between training and test set may occur (Norouzi et al. 2013). The GZSL is considered much harder than standard one as models learned can be inclined towards seen classes at the time of training. The generative framework for zero-shot action recognition is proposed in Mishra et al. (2018), can be applied to both the generalized as well as the standard case. Each action class is modeled using a probability distribution whose parameters are functions of the attribute vector representing action class.

Fig. 4
figure 4

A hybrid classification model for HAR using SVM–NN (Vishwakarma and Kapoor 2015)

Hybrid methods To enhance the performance of HAR, hybrid methods can be used. In Vishwakarma and Kapoor (2015), key poses are extracted as silhouettes that are classified using hybrid SVM and kNN algorithm called SVM–NN. As shown in Fig. 4 (Vishwakarma and Kapoor 2015), the procedure of SVM–NN method is depicted where feature extraction is performed using silhouette extraction and PCA is used for dimensionality reduction. Misclassified samples using SVM are further fed into kNN classifier. In Xu et al. (2016), for behavior recognition, skeleton features are mapped to Li group and after performing preprocessing using PCA, SVM is used to classify PCA optimized features. Optimization of error value and radius value in SVM provides better classification accuracy.

The Naïve Bayes classifier is based on the Bayes’ theorem. The conditional probability in the Bayes’ theorem states that an event belonging to a class can be calculated from the conditional probabilities of the particular events in each class, \(x \in X\), and C classes, where X denotes a random variable. The conditional probability that x belongs to k is given by Eq. 3.

$$\begin{aligned} P\left( \frac{c_k}{x}\right) =\frac{P(c_k)P(x/c_k)}{p(x)} \end{aligned}$$
(3)

It can be seen that Eq. 3 is a pattern classification problem (Jalal and Kim 2014). It finds the probability of the given data belonging to a class. Optimum class is selected by using the class with the highest probability among all the possible classes C, which can minimize the classification error. In NB method, it is assumed that input features are statically independent. The principle behind NB lies in the use of Bayes’ theorem. It is a classification technique based on an assumption of independence among predictors. The hybrid NBNN is used for video classification (Yang and Tian 2014), wherein direct Image-to-Class distance is computed. To compute the separation of the image to another image, the kernel matrix by SVM is used. Classification of an image based on NBNN is applied to NBNN-based video classification in Yang and Tian (2014) where eigen joints are used as frame descriptors without quantization and Video-to-Class distance is used for frame descriptors. In the literature, results are shown by experimentation that Image-to-class distance tends to provide better generalization ability than Image-to-Image distance when applied to the kernel matrix of SVM.

Pose information of the human body provides important ideas about actions (Liu et al. 2013). In the paper Liu et al. (2013), pose-based HAR is performed using the Weighted Local NBNN (WLNBNN) method, which is an improved version of NBNN. Weights are assigned as query descriptor and Euclidean distance is calculated using the nearest exemplar search. Input poses are transformed into pyramidal features using Gabor filter, Gaussian pyramid, and wavelet transform inspired by multiresolution analysis in image processing (Liu et al. 2013).

Discussion Discriminative classifiers learn a direct mapping that links inputs to their correspondence class labels. Due to their high performance and simplicity, supervised techniques such as SVM and the NN can be frequently used for action classification. However, while dealing with high volume datasets, traditional ML techniques may not achieve an efficient performance. The advantages and disadvantages of using ML techniques for action recognition are presented in Table 7.

It can be noticed that real-life scenario-based actions are likely to be more complicated as compared to the actions in the datasets. Besides that new samples may not contain labels, which would make the supervised methods inappropriate. Therefore, ZSAR could emerge as an attempt to overcome these limitations. We also present a summary of the reviewed traditional ML-based classification techniques in Table 8.

Table 7 Advantages and disadvantages of traditional ML-based techniques for action classification

2.3.2 Deep learning-based methods

DL is a technique that instructs computers to perform the task similar to that of the naturally conducted tasks by a human brain. In this survey, we have reviewed CNN, RNN, Long Short-Term Memory (LSTM), Deep Belief Network (DBN), as well as Generative Adversarial Network (GAN) that are widely used networks for the action recognition task.

In CNN, maps are created using local neighborhood information. CNN architecture contains three steps for feature extraction: convolution, activation, and pooling (avg., min., or max.). To capture spatial and temporal features for video analysis, a 3D convolution is proposed in Ji et al. (2013). A 3D convolution method can be performed by convolving 3D kernel in stacked multiple frames. The 3D convolution method has a high cost. The training time will be higher in the absence of supported hardware such as GPU (Ji et al. 2013). The architecture of CNN is composed of multiple feature extraction steps. Each step consists of three basic operations: convolution, non-linear neuron activation, and feature pooling. Basic architecture of a deep CNN is depicted in Fig. 5 (Weimer et al. 2016).

Table 8 Summary of traditional ML-based techniques for HAR

A CNN is denoted as deep when multiple layers of feature extraction stages are connected together (Weimer et al. 2016). In Baccouche et al. (2011), the convolution task in CNN is performed in both the space and time domain. For this purpose, 3D CNN takes input as space–time volumes. Thereafter, training in LSTM is performed with the extracted features from 3D CNN. Spatio-temporal information can be extracted from 3D CNN from the input video. Due to layer-by-layer stacking of 3D CNNs, 3D CNN models have higher training complexity as well as higher memory requirements (Zhou et al. 2018b). Mixed Convolutional Tube (MiCT) network is proposed in Zhou et al. (2018b), wherein, feature maps of a 3D input are coupled serially with the block of 2D convolution block. Connections containing cross-domain residual methods are added to the temporal dimension to reduce computation complexity for the model. The advantage of the residual connection is that in correspondence with the 2D convolution, these connections extract static 2D features, whereas 3D convolution only needs to learn residual information.

In Huang et al. (2018), pose-based features are extracted from the 3D CNN network, wherein 3D pose, 2D appearance, and motion stream fusion is performed. For the 3D CNN, extraction of color joint features will result in high complexity, therefore, a heatmap of 15-channel is constructed and convolution is performed in each map. In skeleton-based HAR, the pairwise distance between skeleton features is computed in Li et al. (2017a). CNN inputs to four networks are given as Joint Distance Maps (JDM) afterward ConvNets training and late fusion is performed. On the other hand, skeleton-based input is classified by multi-stream CNN in Liu et al. (2017b) which involves modified AlexNet (Krizhevsky et al. 2012) and color input data is given to each CNN. Fusion of probabilities is generated from each CNN for final class score calculation. The study shows the robustness of multi-stream CNN against changes in the view, noisy input skeletons, and similarity in skeleton input in different classes. The study also presents the superiority of the proposed network to LSTM-based methods.

Fig. 5
figure 5

Architecture of Deep Convolutional Neural Network (Weimer et al. 2016)

Fig. 6
figure 6

Human Action Recognition System using pre-trained CNN model (Ullah et al. 2019)

Fig. 7
figure 7

A HAR system depicting blending of LSTM and CNN (Li et al. 2017b)

Deep CNN namely, ConvNets is used to perform efficient HAR with accelerometer and gyroscope using smartphone (Ronao and Cho 2016), in which a local dependency of time-series 1D signals is exploited, wherein features are automatically extracted using CNN without the need for advanced pre-processing techniques as the handcrafted features cannot be transferred to activities of a similar pattern. To convert output of CNN into probability distribution, the fully connected layer is combined with softmax. For incorporating both, spatial and temporal streams, two-stream convolution network is proposed in Feichtenhofer et al. (2016), wherein RGB information (spatial), and optical flow (motion) is modeled independently and predictions are averaged in last layers. This network is not able to capture long-term motion due to optical flow; another drawback of the spatial CNN stream is that the performance is based on a randomly selected single image from the input video. Therefore, complications are present due to background clutter and viewpoint variation (Feichtenhofer et al. 2016).

In Wang et al. (2016), Temporal Segment Network (TSN) is proposed, where, high redundancy is present in the consecutive frames, therefore, dense temporal sampling is unnecessary as it contains highly similar frames after sampling. TSN exploits sparse sampling from long input videos. In Wang et al. (2016), Inception and Batch Normalization (BN-inception) network architecture is used. In addition to the RGB and optical flow images similar to two-stream networks, this approach employs RGB difference between two frames (to model variation in appearance) and optical flow fields (to suppress background motion).

To enhance the performance of skeleton joints based HAR, another two-stream network is proposed in Shi et al. (2019). Two-stream corresponding to joint information and bone information is passed through the Adaptive Graph Convolutional Network (AGCN) network. The network contains the stack of these basic blocks. The final output is passed through the softmax layer. In Li et al. (2019), an actional-graph based CNN structure is proposed, which stacks multiple convolutions from action graph as well as temporal convolutions. The graph structure is learned from data in order to capture dependencies occurring among joints. In Ullah et al. (2019), HAR is performed on the system with real-time video captured from a non-stationary camera. DL technique, CNN is used to extract frame-level features automatically. In Fig. 6 (Ullah et al. 2019), data containing video stream is given as an input to the pre-trained model. In low dimension, temporal changes in human actions are learned by connecting CNN with the deep autoencoder. Human actions are classified using SVM (quadratic) classifier. In Huynh-The et al. (2019), encoding scheme Pose Feature to Image (PoF2I) is shown using distance and orientation to represent skeleton data as an image. These images are fine-tuned on inception-v3 deep ConvNet, which reduces overfitting.

An approach to extract ROI using a Fully Convolutional Network (FCN) is presented in Jian et al. (2019). CNN is used to identify the pose probability of each frame. Using neighboring probability difference of frames, key-frame extraction is performed. The variation-aware key-frame extraction method considers the frame with the maximum probability of key pose calculated by CNN. If different frames result in the same value of key pose probability then the center frame is selected. LSTM contains memory cell which is tuned by gates such as input, output, and forget gates. The gates perform the task of determining the information flow entering or quitting the memory cell. Information is stored in the internal states of the memory cell. LSTM provides an automatic understanding of actions in videos. On the other hand, attention graph-based CNN is proposed in Si et al. (2019) to focus on the joint position in the skeleton, which helps to enhance key node features. Attention Enhanced Graph-based Convolution Neural Network with LSTM (AGCN-LSTM) network is able to capture discriminative features.

In general, most of the LSTM and RNN based methods consider skeleton sequences as low-level features and use the raw skeleton coordinate as their inputs. Hence, these networks cannot extract effective high-level features (Si et al. 2019). Whereas CNN based methods are efficient for image-based recognition tasks (Akilan et al. 2017). They can efficiently preserve spatio-temporal information and can directly convert raw skeleton data to images (Kim and Reiter 2017). However, due to variations in viewpoint and different appearance, the performance of such networks may not be accurate. To incorporate both the spatial and temporal behaviors, CNN can be combined with LSTM. For 3D dataset, LSTM and CNN combination is better than LSTM and LSTM combination (Li et al. 2017b). Figure 7 (Li et al. 2017b) depicts feature extraction, network training, and score fusion for an action recognition task. Skeleton-based features of the spatial and temporal domain are input to the network with CNN and LSTM, respectively. Spatial features correspond to relative position and distance between joints and temporal features corresponds to JDM and trajectory. Scores of these features are fused together by late fusion.

A model named Differential RNN (DRNN) is proposed in Veeriah et al. (2015), wherein actions are represented using spatio-temporal representation and the network is learned using Back-Propagation-Through-Time (BPTT) algorithm. Cross-Validation accuracy in Veeriah et al. (2015) is reported by training with random 16 subjects and the rest for testing. Deep LSTM network can provide end-to-end action recognition where feature co-occurrences are learned from the skeleton joints (Zhu et al. 2016b).

DBN is a DL-based network that uses Restricted Boltzmann Machine (RBM) for training. In Hassan et al. (2018), DBN is used for smartphone-based HAR. Training in DBN is divided into two phases termed as pre-training and fine-tuning. To improve the performance of HAR, RBM with 2 hidden layers is used for network initialization. To obtain rotation, translation, and scale-invariant features Motion MHI, Average Depth Image (ADI), Depth Differential Image (DDI), Hu invariant moments, and R transform methods are used in Foggia et al. (2014). DBN is used to generate robust representation of the samples as well as to build hierarchical features from low-level features (Foggia et al. 2014).

Fig. 8
figure 8

Generator in openGAN (Yang et al. 2019)

Fig. 9
figure 9

Discriminator in openGAN (Yang et al. 2019)

In Gowda (2017), a hybrid model of DL techniques is used for extracting features and identifying interest features. Wherein, DBN is used for extracting motion and static image feature extraction. The output of the DBN is input to KPCA which is further given to CNN to classify one of the actions. Another approach based on combining spatial and temporal information was proposed in Qi et al. (2019). The semantic graph is constructed from each video frame input and node-RNN and edge-RNN are used to train the model. Labeling of the whole scene or individual action or interaction involving different persons can be performed using a constructed model. Subsequently, in Ahsan et al. (2018), GAN is proposed for discriminator network training and the resultant discriminator after learning provides initialized weights. The unsupervised pre-training provides an advantage of automated feature engineering and sampling of frames (Lee et al. 2017). A typical network of GAN consists of a generator and discriminator. The objective of the generator is to create similar data corresponding to training data and the discriminator module goal is used to maximize the probability of correct labels by the generative model and training example samples.

In Yang et al. (2019), openGAN is used for recognizing actions based on open-set. Open-set problem is based on constructing dataset having different categories in training and testing set. Components of the openGAN consist of feature extraction and feature combination using dense blocks, wherein these blocks are connected using layers. Dense blocks are constructed using sub-blocks combining two convolutional layers with concatenation layer. Dense blocks are connected using a stack of layers. As shown in Fig. 8 (Yang et al. 2019), convolutional and a de-convolutional layer are used in the generator with a dense block. In Fig. 9 (Yang et al. 2019), the convolutional and pooling layers are used in the DenseNet for the discriminator network in GAN. Feature maps are projected to \(n+1\) dimensional vector for n classes. In the last layer, softmax classifier is used, and Mean Squared Error (MSE) loss function is used.

In the action recognition field, deep networks are also dominant but shallow methods such as ML-based methods can also be considered before blindly applying deep networks. Shallow techniques have characteristics of efficient performance on small datasets compared to deep networks. In some cases, transfer learning can be applied when features are general in both base and target datasets. It would also be possible to fine-tune the DL models, which can improve the performance of DL models. In Das et al. (2018), spatial layout and temporal encoding are modeled for daily activity recognition. Skeleton data is used to capture long-term dependencies using 3-layer stacked LSTM. Pose-base static features are extracted using CNN. From each frame, body region features are represented by the left hand, right hand, upper body, and full body. Pre-trained Resnet-152 is used for deep feature extraction. These extracted features are learned by feeding into SVM, which further provide classification score on cross-validation set.

For an action recognition problem, it is shown in Rensink (2000) that humans at once cannot focus their attention on an entire scene. In spite of that, relevant information can be extracted by carrying out focus sequentially onto different parts of the scene. When performing a particular task, the focus of the particular model can be identified using attention models, which add a dimension of Interpretability (Sharma et al. 2015). Training of the input videos is performed using GoogleNet and features are extracted from the last convolutional layer. Three-layer LSTM is used for predicting class labels. Cross entropy loss function with the attention regularization is used and the model is forced to look at each region of the frame. The attention mechanism can be used in HAR to focus on the particular body part. In Das et al. (2019a), end-to-end action recognition method is proposed using a 3D skeleton and spatial attention from I3D pre-trained model using 3D CNN. Wherein temporal features are extracted using three layers stacked LSTM. An attention-based mechanism is introduced in action recognition that focuses on the parts of the action.

Discussion Many opportunities are open in HAR with the DL models due to available computing facilities for example GPU. HAR with DL-based methods focus on motion feature learning and utilize them to classify actions.

The CNN-based network seems to provide good results and identify spatial relationships from the RGB data. However, to exploit temporal dependencies from the input videos, LSTM is promising networks. Due to the complementary property of these networks, the performance of the model can be greatly improved by applying later score fusion of CNN and LSTM. Moreover, the requirement of CNN is that it requires a large amount of data for training otherwise overfitting may occur. However, dropout or data augmentation techniques can be applied to overcome the overfitting problem. We briefly discuss the advantages and disadvantages of action classification using DL-based techniques in Table 9. We also provide a summary of DL-based techniques for HAR. Action recognition frameworks, datasets used, and their corresponding results are summarized in Table 10. Also, a summary of action analysis techniques based on traditional ML and DL is discussed in Table 11.

Table 9 Advantages and disadvantages of DL-based techniques for action classification

3 Datasets

Datasets play a key role in comparing different algorithms applied for a particular objective. Task-specific algorithm evaluation depends on parameters depending on each dataset. It is computationally economical to capture in real-time two-dimensional (2D) color image sequences. However, with an introduction to the inexpensive 3D sensors, such as Kinect, depth-based processing has been a subject of interest to the researchers (Li et al. 2010).

In this survey, we have discussed different types of RGB and RGBD datasets in detail. Another way of recording can be performed with non-visual sensors which use wearable on-body sensors, for example, accelerometers and gyroscopes as well as radar. Sensor-based datasets have been reviewed in De-La-Hoz-Franco et al. (2018). Categorization of datasets is shown in Fig. 10. In this paper, we review camera-based datasets based on RGB, depth and skeleton modality.

In KTH dataset (NADA 2004), human actions are performed several times with different situations. This dataset has fewer action classes with a resolution of \(160 \times 120\) pixels and does not provide background models. Ballet dataset (Wang and Mori 2009) contains eight actions from ballet DVD wherein each action is performed by three subjects, dataset contain variation due to speed, scale (spatial and temporal), and clothing.

The I3DPost dataset (Gkalelis et al. 2009) contains eight actions including two-person interaction. All the cameras are set up to provide a 360-degree view of the captured scene. Unusual Crowd Activity dataset (University of Minnesota 2010) contains normal and abnormal crowd videos. The dataset comprises 11 scenarios of the escape scene in videos having indoor and outdoor scenes.

The NATOPS dataset (Song et al. 2011) contains 24 aircraft handling signals of routine practice on the deck environment. Signals were repeated by twenty subjects 20 times. \(320 \times 240\) resolution pixel images are present. CAVIAR dataset (Fisher 2012) includes 9 actions. The data is captured at the INRIA Labs and shopping centre in Lisbon. The resolution of the image is \(384 \times 288\) pixels.

Fig. 10
figure 10

Dataset categorization

Hollywood2 dataset (Laptev 2012) contains human actions with 12 classes and scenes containing 3669 video clips with 10 classes. Video samples are generated from movies. In Florence 3D action dataset (MICC 2012) videos are captured using a DHA dataset (M. C. Laboratory 2012), contains 23 actions performed by 21 actors. Three different scenes based actions are classified. In the depth data, background information is removed. MHAD dataset (Berkeley 2014) contains a set of activities that have dynamic body movements. Some activities have dynamics in both upper and lower extremities. In this dataset, image resolution is \(640 \times 480\).

HMDB51 dataset (Jhuang 2013) contains 51 action categories. The dataset contains videos from movies, YouTube, and videos in Google. In addition to action labels, meta-label is also provided for the description of the input video. UCF Sports dataset (CRCV 2010) contains sports actions featured on channels, for example, BBC and ESPN. The dataset contains 150 sequences having \(720 \times 480\) resolution. The dataset is challenging in terms of a wide variety of scenes and viewpoints thus increasing research in the field of unconstrained environment.

UCF50 (CRCV 2012) dataset contains 50 action categories from YouTube. The goal of the UCF101 dataset (CRCV 2013) is to perform template matching in the temporal domain. UCF YouTube action dataset (CRCV 2013) was created for recognizing actions from the videos. Videos in this dataset can be usual upload by any amateur user recording using the hand-held camera.

Table 10 Summary of DL-based techniques for HAR
Table 11 Summary of action analysis techniques based on ML and DL

MuHAVi dataset (YACVID 2014) contains videos observed at some angle as well as the distance from the subject. In MuHAVi dataset actions are filmed using eight surveillance cameras. The cameras are not calibrated before capturing the videos. Sports-1M dataset (Karpathy 2014) is composed of 1,133,158 video URLs from YouTube videos. These URL’s are annotated automatically with 487 Sports labels.

While dividing the dataset into training and testing sets, in some cases, it is possible that similar video can occur in training as well as testing sets (Karpathy 2014). UCSD Anomaly Detection dataset (Statistical Visual Computing Lab 2014) was acquired with a stationary camera overlooking pedestrian walkways. Peds1 contains clips of videos of a group of people walking towards and away from camera and Peds2 contains scenes containing pedestrian movement.

The major challenges in the dataset are encountered due to the similarity in some of the actions. Northwestern-UCLA Multiview Action3D dataset (Wanqing Li 2014) contains 10 actions with the RGB, depth, and skeleton joint information. Weizmann dataset (Blank et al. 2005) comprises of 10 action categories. In this dataset, all sequences of actions are from the static camera and input frames are having a plain background with image resolution \(180 \times 144\). The Johns Hopkins University multimodal action (JHUMMA) dataset (Murray et al. 2015) contains ten actors to perform actions, wherein to record actions, three ultrasound sensors, and an RGB-D sensor were used. The dataset was captured indoor inside the auditorium having curtains.

IXMAS dataset (INRIA 2016) models human action by incorporating viewpoint invariant data and different body sizes. Five cameras are used for action recognition tasks to view. 13 daily-life activities were performed and variation in different activities is due to varying clothing styles, body size, and execution rate.

MSR action dataset (Liu 2016) contains 16 video sequences that include three types of actions. These sequences are captured with some clutter (Chaquet et al. 2013). The MSR action3D dataset (Li 2017b) contains twenty actions performed by ten subjects and image resolution is \(320 \times 240\). The depth maps were captured using a depth camera. This dataset provides color, depth, and skeleton information for each action. In the given dataset ten actions are missing due to erroneous information. MSRDailyActivity3D dataset (Li 2017a) contains 16 activities. Usually, subjects perform actions in two poses “sitting on sofa” and “standing”.

Kinetics dataset (Kay et al. 2017) is a large-scale dataset that containing 300,000 videos clips in 400 classes. The video clips are sourced from YouTube videos. In Yan et al. (2018), locations of 18 joints are estimated on every frame of the clips. It includes nine activities of 10 subjects perform actions 2 to 3 times. SBU Kinect Interaction Dataset (Computer-Vision-Lab 2012) consists of 21 pairs of two-person interactions of eight types each having two sets. The videos in the dataset were captured using the Kinect sensor. Frame in the dataset contains color and depth feature. UTKinect-Action3D dataset (Xia 2016) contains human actions for the setting comprising indoor recording (Xia et al. 2012). This dataset returns depth information, color information, and skeleton information. In the dataset, RGB images have the resolution \(480 \times 640\) and depth images have \(320 \times 240\). The dataset also contains frames containing occlusion.

NTU-RGBD action recognition dataset (Rapid-Rich-Object-Search Lab 2016) contains 56,880 action samples of 60 classes and 40 subjects with 80 views having modalities RGB videos, skeleton, and depth data. Two protocols are popularly used for evaluation, namely, CS and CV. In evaluation based on CS, forty subjects are split into training and testing groups. The samples from subject IDs 1, 2, 4, 5, 8, 9, 13, 14, 15, 16, 17, 18, 19, 25, 27, 28, 31, 34, 35, 38 are used for training where the remaining subjects are reserved for testing (Rapid-Rich-Object-Search Lab 2016). In CV-based evaluation, all the samples of camera 1 are used for testing, and for training, samples captured from cameras 2 and 3 are used. In other words, the training set consists of front and two side views of the actions, while testing set includes left and right 45 degree views of the action performances. For this evaluation, the training and testing sets have 37, 920 and 18, 960 samples, respectively.

MIVIA Action dataset (MIVIA-Lab 2017) is composed of seven types of actions. All the subjects performed each action twice and the duration of the action is variable depending on the nature of the action. Kinect sensor is used to acquire the depth images and background. CAD-60 dataset (Robot-Learning-Lab 2017) contains human activity videos with RGB, depth information, and tracked skeleton sequences of 60 videos. These activities are captured using the Microsoft Kinect sensor. In this dataset, RGBD data has resolution \(240 \times 320\).

Dongguk Activities and Actions database (CGCV-Laboratory 2017) is produced for indoor surveillance environment. The database consists of three scenarios named as straight-line movement, corner movement, and standing still for 20 people. For improving the performance of the action recognition task, a better understanding of input data with its characteristics is required. Toyota Smarthome dataset (Das et al. 2019b) is the dataset for capturing daily activities which incorporate the challenges in the action recognition tasks, such as a high intra-class imbalance in class, composite activities containing sub-activities and activities having variable duration and similar motion. This dataset was captured with elderly people. No script was given to subjects for performing actions for the entire day. Unlike other datasets, this dataset comprises of actions having the variable distance between camera and subject. This dataset provides three modalities namely, RGB, depth, and skeleton.

A brief understanding of the advantages and disadvantages of such datasets is provided in Table 12. We also summarize various dataset attributes such as background, number of participants, number of cameras, movement of the camera, number of male and female participants, number of actions, modality, type of view, occlusion, and whether an action is scripted or not in Table 13.

Table 12 Advantages and disadvantages of various datasets for action recognition

3.1 Discussion

One of the important aspects to map to the real-world complexity is that the datasets should contain occlusion and intra-inter class variations. In this survey, we have discussed majority of the datasets that provide actions based on daily activities or some of them do not have any focus. Other datasets which we have discussed comes under gaming category (for example MSRAction3D dataset). Moreover, CAVIAR dataset contains actions related to surveillance application.

In RGB-based HAR techniques, two popular datasets, namely, KTH and Weizmann are primarily used. With majority of the techniques, these datasets achieve 100% accuracy; these datasets contain intraclass variations, however, they provide a good evaluation criterion for the methods. Moreover, KTH dataset contains a limited number of activities and a similar background. To meet the real-world challenges and to scale up the complexity of the data, datasets containing videos downloaded from the Internet are also considered. For example, datasets sports-1M and HMDB are provided with background clutter and scale to increase the complexity.

In datasets such as Hollywood2 dataset, the limited number of labeled videos are present. In the case of 3D action analysis, there is a lack of large-sized datasets. Therefore, NTU-RGBD dataset with 56, 880 RGB+D video samples, having 40 different human subjects was captured using Microsoft Kinect v2.

To the best of our knowledge, there are no sources of public 3D videos for the unconstrained environment. Recording of NTU-RGBD was also performed in a restricted environment such as laboratory, where the activities were performed under strict guidance. Therefore, Activities of Daily Living (ADL) datasets have the partial capability to challenge real-world scenarios.

4 Applications

HAR can be used in a variety of applications such as content-based video analysis and retrieval, visual surveillance, HCI, education, medical, as well as abnormal activity recognition; this section discusses the significance of HAR in respective applications.

4.1 Content-based video summarization

In the current era, rapid growth in video content is due to the immense use of multimedia devices. Retrieval of this information manually could be a toilsome and time-intensive task. The main goal of the content retrieval task is to provide the user with the content of their interest. The concept is known as Content-Based Video Retrieval (CBVR). In Kim and Park (2002), key-frames of the video are compared with the target videos but the computational cost of the key-frame method is too high.

On the other hand, color and texture features are used for video summarization in Shereena and David (2014). Authors have also demonstrated the advantage of combining color and texture features. Real-time video summarization is demonstrated by the study (Bhaumik et al. 2015), wherein threshold based on probability distribution used for generating video summary. Duplicate features are removed using redundancy elimination.

4.2 Human–computer interaction

The HCI-based system aims to bring HCI to a level such as the task of human–computer interaction should be as normal as daily human interaction. A gesture recognition system was proposed in Sharma and Verma (2015) to recognize hand gesture images. Recognized images are static and having a simple background. To detect fingers from the hand, recognition of gesture is performed by counting white colored objects and skin segmentation is performed, whereas, to increase image quality, morphological filters are used. In the gesture recognition system pose classification is performed in Czuszynski et al. (2017), whereas, gesture information is stored in the timestamp sequence. Data was represented in three types of forms such as raw data, detailed description of data frames using features, and high-level feature representation depicting hand pose. Two-layer ANN is used to recognize the extracted features. It provides output in the form of a number, which depicts the type of hand pose. Also, a cost-effective gesture recognition system based on the data captured from a laptop is proposed in Haria et al. (2017) wherein Haar cascade classifier is used to classify gestures containing palm and fist.

Table 13 Summary of various datasets and their properties (Note: S: Skeleton, D: Depth, U: Unspecified, F: Female, M: Male, Occ: Occlusion, Act: Acted, Y: Yes, N: No)

4.3 Education

Recognizing human actions from the videos has a crucial role in education and learning. Analyzing human actions from the video in educational institutes may provide behavior recognition and automatic monitoring of attendance during class. The manual procedure for taking attendance of students may be time-consuming and during this process, the instructor may not be able to observe students.

Nowadays, due to technological advancements, the automated real-time attendance monitoring system can be deployed in the classroom. In Chintalapati and Raghunadh (2013), an automated attendance monitoring system is proposed using the Viola-Jones algorithm. Comparative analysis of feature extraction algorithms using PCA, LDA, and LBP Histogram (LBPH) is performed among which the LBPH method performs better. To capture videos, the camera is placed at the classroom entrance, and students are registered while entering into the classroom.

In Lim et al. (2017), students and their activities such as leaving and entering the classroom are identified. The system performs action recognition and identification by performing face recognition and motion analysis. Haar cascade classifier is used for detecting faces and a combination of eigenfaces and fisherfaces algorithms are used for training. For motion analysis, three sub-modules namely, body detection, tracking, and motion recognition are used. To perform attendance monitoring, assumptions are made for the brightness and size of the classroom.

4.4 Healthcare

Healthcare of elderly people has been a major concern as elderly people are prone to disease. Continuous monitoring using automatic surveillance systems is required to identify fall detection or abnormal behavior detection for elderly patients. An approach for representing the behavior of dementia (Alzheimer and Parkinson’s disease) patients is mentioned in Arifoglu and Bouchachia (2017). Abnormal activity in elderly patients with dementia is detected using RNN variants-Vanilla RNNs, LSTM, and Gated Recurrent Unit (GRU).

Real-time monitoring of abnormal patient behavior can be performed using smart-phone-based sensors. Smart-phone-based Wireless Body Sensor Network is used in which physiological data is collected using body sensors in the smart shirt. Continuous monitoring of temperature, ECG, BP, BG, and \({\hbox {SpO}}_2\) are performed and an alert message is issued in real-time in case of the abnormal sign (You et al. 2018). Subsequently, the position and velocity of the person are extracted using Kinect sensor in Nizam et al. (2017) for fall detection. In the sensor range, the velocity of the body is identified for detecting abnormal activities continuously. To confirm the detection of fall from abnormal activity, the subject’s position is identified from the next frames. To compute velocity, skeleton joints are used from Kinect sensor and accuracy, sensitivity, and specificity are calculated for fall and non-fall based actions. For the fall detection, depth maps can be used (Panahi and Ghods 2018). Feature extraction is done on the ellipse feature fitting method in which for pose identification, the orientation of the ellipse is calculated (Yu et al. 2013). Another feature used is the distance from the ellipse center to the floor in 3D space (represented by plane). To classify pose-based features, SVM is applied.

4.5 Video surveillance

A video surveillance system offers visual surveillance while the observer is not directly on the recording site. Surveillance task may be performed either in real-time by analyzing video or video may be stored and evaluated subsequently as and when required. Video surveillance can also be used to identify abnormal activity detection and to analyze player behavior in gaming videos (Wang et al. 2015).

4.6 Abnormal activity recognition

Abnormal behavior recognition can be used to ensure security in places such as railway stations, airports, and outdoor places. Recognizing such events are challenging due to a large number of surveillance cameras.

Abnormal behavior for three categories such as a person, group, and vehicle are identified using a single Dynamic Oriented Graph (Varadarajan and Odobez 2009). Even in the case of objects following the same paths, abnormal behavior can be identified. For example, crossing a railway line by a person is considered unusual, whereas train crossing through the railway line is considered usual activity. The anomaly event detection task is divided into the global and the local anomaly (Miao and Song 2014). Wherein, global anomaly tasks performs emergency clustering and individual behavior performance is computed under local anomaly task. For global anomaly detection, UMN dataset (CRCV 2020) is used in which people suddenly going out of the scene is considered as the global anomaly and for the local anomaly, UCSD dataset (Statistical_Visual_Computing_Lab 2014) is used in which samples containing people walking are included in the training and the abnormal behavior includes cycling and skating of a single person.

Graph-based method for abnormal activity detection is presented in Duque et al. (2007), wherein nodes of the graph are depicted by STIP and connection among different nodes is given by fuzzy membership function. The anomaly detection task is divided into two different subtasks local and global. Local and global abnormal activity is classified using SVM. An intelligent system for the crowded scene is presented (Feng et al. 2017) using deep Gaussian Mixture Model. Multi-layer nonlinear input transformation is performed adaptively for feature extraction from sensors. This transformation improves the performance of the network with a few parameters.

4.7 Sports

Motion in sports video is difficult to analyze by trainers wherein, observing long matches continuously can be difficult for the audience to follow (Thomas et al. 2017). Recent research includes analysis of player movement individually and in the group for their training as well as for finding key-stages in the game. In YACVID (2014), sports video highlight classification is performed using a DNN. The study YACVID (2014) has used players’ actions to acquire higher-level representation using two-stream CNN combining skeleton joint-based and RGB-based. To model temporal dependencies in the video, LSTM is used.

In Ullah et al. (2019) pre-trained deep CNN model VGG-16 is used to extract frame-level features for identifying player actions. To learn temporal changes, deep autoencoder is used and human actions are classified using the SVM approach. Graph-based models for recognizing group activities are popularly used. In Qi et al. (2019), Sports videos are classified based on scene content using a semantic graph. Structural RNN is used to extend the semantic graph model to the temporal dimension.

4.8 Entertainment

HAR field has been widely used to identify actions in the movies or identifying dance movements related activities. In Laptev et al. (2008), action retrieval task is presented using text-based classifier (regularized perceptron). Action classification from movie script is shown using space–time features and non-linear SVMs. In Wang et al. (2017), movie actions are classified using 3D CNN. To minimize loss of information while learning, the study has introduced two modules, namely, encoding and a temporal pyramid pooling layer. To combine motion and appearance information the study has incorporated feature concatenation layer. Two movie datasets, namely, the HMDB51 (Jhuang 2013) and the Hollywood2 (Laptev 2012) are used for experimentation. Another application of HAR is to identify dance movements from videos. In Kumar et al. (2018), authors have proposed a multi-class AdaBoost classifier with fused features. The dataset based on Indian classical dance consist of online and offline videos of different dance forms.

In video classification motion information between different frames plays a crucial role in the performance of the action classification task. In Castro et al. (2018), authors have identified that for motion-intensive videos, visual information is not sufficient for classifying actions efficiently. The analysis of the action recognition task is performed using video, optical flow, and multi-person pose data.

5 Challenges

Despite the progress made in the field of HAR, state-of-the-art algorithms still misclassify actions due to several major challenges pertaining to HAR. In this section, we have discussed challenges of HAR. For an action recognition task, there can be differences in the actions performed by the same subjects even different class actions may appear to be similar, for example, jogging can be considered as running in fast speed. HAR models should be able to handle variations within one class with other classes. In Lu et al. (2018), sports action classification is performed, where dataset trained on one sport does not provide good results when tested on another sport.

For handcrafted representation, high dimension of training dataset may incur a lot of computation. For reducing dimensions, various dimensionality reduction techniques are used. At different intervals of time, an action may be performed with varying speed by the same subject or different subjects. Variation in the action speed is taken into consideration by HAR system.

In Chen et al. (2016), action speed variation is handled by providing multi-temporal representation of the DMM feature is used with three levels. Action recognition tasks heavily depend on the background clutter wherein unwanted background motion may create ambiguities in the action recognition task. Such problems can be handled by applying some background subtraction techniques before action recognition task (Kalaivani and Vimala 2015). Depth-based techniques are steady with respect to the environment changes and background (Jalal et al. 2012).

5.1 ML-based HAR

Conventional ML for action classification may be bounded with large-scale actions performed in challenging environments (for example, transformations applied on single actor actions, actions based on interaction and actions involving various subjects). Machine learning based classifiers cannot handle large amount of data.

Challenges using traditional ML-based methods can be handling of imbalance data. Moreover, training using ML techniques can suffer from slow learning rate, which gets even worse for large scale training data, and low recognition rate. In ML-based HAR techniques, majority of the work is conducted in supervised learning. Although this provided promising solutions but a problem with this approach is that labeling all the activities, as it requires a large effort for the test data.

5.2 DL-based HAR

DNNs are said to perform better in case of a large amount of training data (Sze et al. 2017). To learn hierarchical features from input videos, approaches based on RNN and LSTM have been used that improved performance of the action recognition task which involves actions having temporal dependencies. However, these models increased network complexity.

CNN-based networks are also popular DNN for HAR but these networks also come across certain challenges such as occlusion and variation in viewpoint. Due to CNN, it is also difficult to understand the meaning of deep features extracted by CNN. Deep CNNs are generally as a black box and thus, may lack in interpretation and explanation. Therefore, sometimes it is difficult to verify them. In addition, CNN-based methods rely on a large amount of data; yet, many realistic scenarios lack sufficient data for training, even though some large-scale datasets have been developed to make fine-tuning of the CNN architecture possible.

5.3 Hybrid HAR

Hybrid approaches can combine features and preprocessing steps, however, the computational complexity of the target system is high which may impact real-time video processing as well as lengthy video processing. These limitations can cause difficulty for lengthy videos and real-time applications with continuous video streaming. Challenge of hybrid HAR is the computational cost of training the model.

6 Future directions

Although ongoing HAR approaches have made incredible progress up to now, applying current HAR approaches in certifiable frameworks or applications is still nontrivial. In this section future directions for traditional ML-based, DL-based, and hybrid HAR is discussed.

6.1 Traditional ML-based HAR

HAR task can be extended for identifying actions with emotions such as happy-sitting, angry-running etc. Another future work can be to design models specific to applications. Moreover, ML algorithms can be able to operate on massive data. Methods in ML should be provided for trimmed action sequences.

6.2 DL-based HAR

To improve the performance of CNN, 3D CNN may be applied as 3D CNN has the capability to exploit spatiotemporal features. Another prospective area of improving performance is ensemble learning. Model performance can be improved by combining multiple architectures. Similarly, concepts such as batch normalization, dropout, and new activation functions are also worth mentioning. Also, to derive generalization, reinforcement learning or active learning technique can be used. In future, gait parameters can be calculated for walk detection for assessing fall risk and also for disease monitoring. Multi-person recognition can be performed in the future. In the future, the methodology should be provided for classifying videos containing overlapped actions. Daily activity-based HAR applications require actions to be continuously identified (untrimmed videos) recognition from continuous video streams is known as online action recognition system. Therefore, the future direction in this field is to apply methods of action recognition for an online case.

6.3 Hybrid HAR

Future direction can be considered as a multimodal perception for action recognition, as in the current trend in HAR field, RGB-D based methods (such as skeleton and depth sensor) are popularly applied. Kinect-based sensor is a low-cost sensor for capturing depth data. This sensor usually does not work properly in sunlight (Pagliari and Pinto 2015), which may hinder the performance of HAR system. For this purpose, multimodal fusion of RGB, skeleton, and depth data can be used to improve the performance of the system.

7 Concluding remarks

Automated HAR is considered as a domain for understanding human behavior. The review provides a survey of existing techniques used for HAR for trimmed videos. We have discussed the general framework of an action recognition task comprising of feature extraction, feature encoding, dimensionality reduction, and action classification. Feature extraction methods are categorized based on STIP, shape, texture, and trajectory. Due to the large size of the extracted features, dimensionality reduction techniques are used that can be divided into two types supervised and unsupervised. We have also discussed action classification methods involving ML and DL methods. We have also discussed the advantages and disadvantages of action representation methods, dimensionality reduction, and action classification methods. The dataset used by all the approaches consists of segmented videos with a known set of action labels. We have also discussed different datasets used for HAR. Application areas such as content-based video retrieval, video surveillance, HCI, education, medical, and abnormal activity detection are also discussed in the paper.