Keywords

1 Introduction

Computer vision research is currently focused on human activity recognition. Action recognition has already been attempted using video sequences recorded by cameras. Recognizing human behaviors frequently uses spatiotemporal characteristics, e.g. [1]. Real-time depth data gathering is now possible because of advancements in image technology. Depth maps can also deliver 3D details for distinguish behavior that is challenging to characterize utilizing standard images and are less susceptible to changes with lighting conditions than traditional photos. Figure 1 shows two illustrations of the actions of golf swing and a kick forward, each with nine depth maps. Numerous studies on human action detection utilizing depth images have been conducted but since introduction inexpensive depth sensors, in particular, ASUS Xtion with Microsoft Kinect, e.g., [2]. Observed in, additional information is provided to complete action recognition by the 3D multiple objects of skeleton of a person that are calculated using depth photographs.

Fig. 1
Two illustrations labeled A and B represent nine consecutive human figures doing swinging and kicking motions. The actions begin on the left and end on the right.

Actions of a golf swing and a kick forward are examples of depth map sequences [3]

There are two key queries regarding various classification techniques which action? (specifically, the issue with recognizing) and “Where in the video?” (specifically, the localization issue). The kinetic states of such a person must be known when trying to recognize human activity because then the computer can do so effectively.

Examining actions from still photos or video clips is the aim of human activity recognition. This fact serves as the driving force behind human activity identification systems’ quest to accurately classify data input into the relevant activity category. Different types of human behavior are six categories, depending in terms of complexity: gestures, atomic acts, human-to-human or object group actions, interactions, behaviors, and things. According to their complexity, human activities are divided up in Fig. 2.

Fig. 2
A circular chart labeled human activities at the center has six branch activities. The labeled components are gestures, atomic actions, human-to-object and human-to-human interactions, group actions, behaviors, and events.

Decomposition of human activities

The remainder of the essay the following structure Sect. 2 gives some historical context. The specifics of the depth motion maps’ characteristics are explained in Sect. 3. Human activity categorization would be described in Sect. 4. Unimodal and multimodal approaches are covered in Sects. 5 and 6. Dataset collected is considered in Sect. 7. Section 8 also includes final remarks.

2 Background

For the purpose of identifying video clips of people acting recorded by conventional space-time-based RGB cameras, techniques such spatiotemporal space-time volumes characteristics, as well as trajectory, are extensively used. In [4], to recognize human action, spatiotemporal interest points and an SVM classifier were combined. Cuboids’ descriptors accustomed to express actions. Activities in a series of videos were identified using SIFT-feature trajectories that were described in an order of three degrees of abstraction. In order to accomplish action categorization, several characteristics of local motion were assembled as spatiotemporal features from a bag (BoF) [3]. As motion templates to characterize the spatial and temporal properties of human motions in movies, motion energy images (MEIs) or motion-history images (MHIs) were only launched in [5]. When computing dense motion flow using MHI is occurred then, a hierarchical extension was provided with correct accuracy. The sensitivity for recognition to variations in illumination is a significant drawback of adopting either depending on hue or intensity approaches, restricting the robustness in recognition. Research with action recognition dependent on depth data has expanded with the introduction of RGBD sensors. Skeletal joint locations are retrieved from depth pictures for skeleton-based techniques. A customized spherical coordinate system and histograms of 3D joint positions (HOJ3D) were used to create a view-invariant posture representation. With the use of LDA, reprojected HOJ3D were used and grouped around K-situation visual words. A continuous hidden Markov model was used to simulate the sequential evolutions of such visual words. Based on Eigen joints, a Naive Bayes Nearest-Neighbor (NBNN) classification was used to identify human behavior. (i.e., variations in joint position) integrating data on offset, motion, and still posture. Due to some errors in skeletal estimate, many skeleton-based techniques have limits. Additionally, many programmers do not always have access to the skeleton information.

To discriminate between various actions, several techniques require spatiotemporal data extraction information based on complete [6] collection of a depth map's point’s series. The use of an action graph in a group was 3D points which was also used to describe body positions and describe the dynamics of actions. The 3D points’ sample technique, however, produced a lot of data, necessitating a time-consuming training phase. To efficiently describe the body shape as well as movement information for distinguishing actions, an extent motions’ histogram with a map on directional gradients (HOG) has been used. A weighted sampling strategy was used to extract random occupancy frequency (ROP) features from depth pictures. The characteristics were demonstrated to be robust to occlusion by using a sparse coding strategy to effectively encode random occupancy sequence features during action recognition. In order to preserve spatial and geographic context statement while managing intra-class conflict variability, 4D advanced patterns were being used as features. Then, for action recognition, a straightforward technical design here on cosine distance was applied. A hybrid system for action recognition method incorporating depth and the skeleton data was employed. Local occupancy patterns and 3D joint position were employed as features then, to characterize each action and account for intra-class variances; another action let accuracy of the model was learned.

3 Depth Motion Maps as Features

The 3D structure but also shape information can be recorded using a depth map. Alemayoh et al. [7] suggested to characterize the motion of an action by imposing depth pictures across three Cartesian orthogonal planes. Because it is computationally straightforward, the same strategy is used throughout the work while the method for getting DMMs is changed. In more detail, any 3D depths are frame also used like create three map v 2D mapped projections that represent the top, side, or front perspectives

$${\text{Where}}\,\,v = \left\{ {f,s,t} \right\}$$
(1)

To illustrate (x, y, z) with in a frame depth z, the number of pixels in three projected maps is denoted by the value of depth in such an orthogonal coordinate system, z, x, and y, respectively.

Separated from, the actual distinction between these two separate maps before thresholding is used in this calculation to determine the motion energy for each projected map. The depth gesture map DMMv is created in-depth video series N frame's worth by stacking all motion energies throughout the full sequence as follows:

$${\text{DMM}}_{v} = \sum\limits_{i = a}^{b} {\left| {{\text{map}}_{v}^{i} - {\text{map}}_{v}^{i - 1} } \right|}$$
(2)

where i shows the frame index.

4 Human Activity Categorization

Over the past two decades, the categorization of human activities has remained a difficult job in computer vision. There is a lot of potential in this field based on earlier studies on describing human behavior. According to the type of sensor data they use, we first divide the acknowledgement of human action techniques into the two broad categories: (i) unimodal and (ii) multimodal identification system approaches. According on how they represent human activities, every one of those is two types, then further broken into smaller divisions. As a result, we suggest alternative classification of human activities in hierarchy techniques, as shown in Figs. 3 and 4.

Fig. 3
A tree diagram represents the human activity recognition methods. The two branches are labeled unimodal and multimodal, with 4 and 3 sub-divisions, respectively.

Proposed hierarchical categorization of human activity recognition methods

Fig. 4
Ten photographs of human actions in sports activities. The activities are labeled diving, golf swing, kicking, lifting, riding horse, running, skateboarding, swing-bench, swing-side, and walking.

Representative frames of the main human action classes for various datasets [8]

5 Unimodal-Based Methods

Utilizing data from a single modality, single-modal identification of human action algorithms cites examples of human activity. The majority of current methods classifies the underlying activities’ label using various classification models and show human activity as either a series of images elements collected from still images or video. For identifying human activities based upon motion features, unimodal techniques are appropriate. On the other hand, it can be difficult to identify the underlying class just from motion. How to maintain is the biggest challenge that the continuity of motion throughout duration of an action takes place uniformly or unevenly throughout a video sequence. Some approaches employ brief motion velocities; others track the optically flow features to employ the whole length on motion curves.

The four basic categories we use to categorize unimodal methods are (i) space-time, (ii) stochastic, (iii) rule-based, but also (iv) methods based on shapes. Depending on just the sort of representation each approach employs, every one of those sub-categories describes particular characteristics of the strategies for recognizing human activities (Fig. 5).

Fig. 5
2 illustrations. a. A schematic of a triangular diagram supported by three columns. Three components are labeled class label, hidden parts, and image. b. A tree diagram with hierarchical sequences. The latent variables, super observations, and recursion are labeled at the right.

Representative stochastic approaches for action recognition [9]

6 Multimodal-Based Methods

Multimodal activity recognition techniques have received a lot of interest lately. Variety components that offer in addition to helpful information can serve as a definition an event. A number of multimodal strategies are found upon in this situation, and feature fusion could be expressed by either initial fusion or lateness fusion. Directly combining characteristics into a greater attribute vector but therefore the simplest method is to learn the underlying action benefit from numerous features. Although the resultant feature vector has a significantly bigger dimension, this feature fusion strategy may improve recognition performance.

A temporal relationship between the underlying activity and the various modalities is crucial in understanding the data since multimodal cues are typically connected in time. In that situation, audiovisual analysis serves a variety of purposes beyond just synchronizing audio and video, however, for monitoring identification of activities. Three groups of multimodal techniques are distinguished: (i) effective techniques, (ii) behavioral techniques, and (iii) social networking-based techniques. Multimodal approaches define atomic interactions or activities that may be related to the effective states of either a communicator's counterpart and depend on feelings and/or physical movements (Fig. 6).

Fig. 6
A flowchart of four multimodal emotion recognitions. The four labeled processes are pre-processing, segmentation, feature extraction, and continuous prediction. Each has three or four components linked to the main nodes.

Flow chart of multimodal emotion recognition [9]

7 Performance of Collected Dataset

Dataset is satisfied with high class variability (intra-class) and high class similarity. The following values are shown in Table 1.

Table 1 Table captions should be placed above the tables

In both Tables 1 and 2, we calculated the precision and recall value of tested data where some data [10] on precision, rappel and accuracy with latest relevant data. We also categorized the age scale between 1 and 10, and last range was 40–50 for monitoring the activity of human. Some results are better in age from 25 to 40, i.e., middle age. We use dataset in further study if we consider any image [11,12,13,14] pattern [15,16,17,18,19].

Table 2 Scaling with tested data and random data

8 Conclusion

Real-time-based model can be predicted with human activity recognition, so in this paper, we conducted a thorough analysis of contemporary techniques for identifying human activity and developed a hierarchical taxonomy of grouping these techniques. According to channel of origin, many of these methods are used to identify activities of humans, and we surveyed many methodologies and divided them into two major categories (unimodal and multimodal). The motion properties of an action sequence were captured through using depth motion maps created from three projection perspectives. In future work, motion monitoring, image classification, and video classification may be useful for exascale computing with fast computing technique.