1 Introduction

Human action recognition is a widespread study area [1]; it can be implemented in many applications in e-health [2], video surveillance systems [3], and strategic war situations [4], etc. Amid significant work initiatives over the past few decades, recognition of action remains a daunting issue because of human movements' dynamic embodied nature.

It is usually seen that the skeletal joints are useful for understanding actions [5]. Newly implemented depth sensors combined with the real-time skeleton estimation algorithm to render the process of action detection far more comfortable. Skeletal joints' location in the frame and their mobility in changing frames can be used to perform an action recognition task. Such new developments have contributed to a revived attention in recognition of human skeletal action.

Many existing approaches for interpreting skeletal action recognize human skeletal joints or joints' usage to identify specifics of activity [9,10,11,12,13,14,15,16,17,18,19,20,21,22]. These methods do not adapt exceptionally well to identify skeletal joint action because they cannot distinguish complex details across various joints of the body that can reflect the dynamic spatial variations in human behavior.

Two-dimensional skeleton body joints are used as input of the target method; we accomplished an action recognition task using the least amount of data. Initially, the super-joint is observed in the body, which is an essential factor in detecting behavior. The usage of super-joint functions is determined by using standard normal, slope, and space parameters features. In particular, standard normal and slope determine the unite, while parameters space features obtained combined details using joint. All these considerations help us to formulate an action descriptor, known as SNSP, which comprehends Super-joint, standard Normal, Slope & Parameters space features. Our action recognition system collects all complicated spatial details acquired from all skeleton joints points. Anticipated SNSP is based on three commonly established and freely accessible action databases, i.e., UTD Multimodal Human Action Dataset [6], KARD- Kinect Activity Recognition Dataset [7], and SBU Kinect Interaction Dataset [8]. Classification is performed by using the K-nearest neighbor classifier [9]. SNSP outperforms the aforementioned tactics as it collects detailed and necessary spatial information between all skeleton joints; in this research paper, our contributions are as follows:

  • Capture all the necessary spatial details of the joints by using 2D joints.

  • Super-joint is developed to track the behavior of certain joints.

  • Combined and unite spatial information is acquired using SNSP.

The presented article is split up into segments: Sect. 2 is related work, sketches preceding human action recognition approaches. Section 3 outlines our proposed SNSP; Sect. 4 explains experimental analysis, and Sect. 5 addresses evaluation in the projected descriptor action. Results & discussion are clarified in Sect. 6, while our proposed study is eventually concluded in Sect. 7.

2 Related Work

Several valuable surveys compare the skeletal joints designed aimed at human action recognition; many researchers perform recognitions of action using different methods. This section discusses some of the state-of-the-art techniques based on the skeletal joint points [5, 9,10,11,12,13,14,15,16,17,18,19,20,21,22], and other relative approaches [23,24,25,26,27,28,29,30,31].

Skeletal joints are used frequently in the literature to perform action recognition tasks. Several of the traditional HAR methods at the skeleton joint are: Graph-based approach for 3D human skeletal action recognition [5], recognition of actions using 3-D Posture Data and Hidden Markov Models is presented in [9], Skeleton-based human action recognition with global context-aware attention LSTM networks [10]. Capturing the micro-temporal relations between the joints using CNN and then exploiting their macro-temporal connections by computing the Fourier temporal pyramid [11]. A human activity recognition system using skeleton data from RGB-D sensors is proposed in [12]. The enhanced trajectory is proposed using codebook and local bags of words [13], Co-occurrence features obtained using a deep LSTM network [14], and an end-to-end Spatial–temporal attention model is presented in [15]. Global context-aware attention using LSTM [16] and pose conditioned spatiotemporal attention is explained in research [17]. In [18], the author performed action recognition using skeleton net and deep learning [18]. Obtaining temporal information from frames using deep CNN and multi-task learning networks [19]. Action recognition is performed by using a hand gesture recognition system [20], CNN grounded joints distance maps [21], and joint trajectory maps [22].

Action recognition is performed by using a two-fold transformation model [1], hexagonal volume local binary pattern [2], and deep learning and wearable sensors [23]. S2DDI model developed using sequences of depth and CNN collecting motion from regional to fine-grained levels [24]. Activity is recognized by using the image relating history technique suggested in [25]. Spatial rank pooling is developed to gather heatmap development as an image of body form [26]. In particular, spatial–temporal activations created by a stack of pose estimation layers are reprojected in space and time using a stack of three-dimensional convolutions learned directly from the data [27]. Collaborating body part contrast mining for human interaction recognition [28]. Each depth frame is projected onto three Cartesian orthogonal planes in a vague boundary sequence using depth maps. The absolute value of the difference between two consecutive projected maps is accumulated [29]. Human action recognition is performed through silhouette and silhouette skeletons [30], and geometrical patterns and optical flow in [31]. Authors, Yu et al. performed image recognition by using: Click prediction for web image reranking using multimodal sparse coding [32], Learning to rank using user clicks and visual features [33], and Hierarchical deep click feature prediction [34]. Recently developed methods for performing human action recognition are hierarchical learning approach [35], uncovering deep learning approach [36], positioning sensor and deep learning technique [37], a robust framework for abnormal act detection [38], Kinect sensor-based interaction monitoring system using the BLSTM neural network [39], skeletal data-based system [40], compressive sensing-based recognition of human upper limb motions with Kinect skeletal data [41], a unified deep framework for joint 3D [42], global co-occurrence feature learning and active coordinate system [43], multi-stream Spatio-temporal fusion network [44], and image representation of pose-transition feature for 3D skeleton-based action recognition [45].

3 Proposed Action Descriptor

The goal is to use the least amount of data to conduct human behavior identification, so two-dimensional skeletal joints are used. Our feature vector is constructed in such a way that joints that are showing identical movement characteristics are collected together, allowing a more discriminating video representation for action recognition. This is accomplished using the information provided by the human pose. The aim is to link skeleton joints together and determine a descriptor for SNSP operation. Following these factors, we proposed an action descriptor:

3.1 Super-Joint

A point from which the weight of a body or system may be considered to act is known as the center of gravity [46]. Whereas skeleton body joints action recognition, that unique human organ is the neck. If any action sequences have to be identified, it is imperative to observe the neck joint because the neck is a frame of reference for other joints. Its normal range of motion is 40 towards 80 degrees [47], it’s observed that the body organ neck is the least changing organ. This unique and vital joint is called a super-joint, while other joints change their super-joint position. It is observed that all others joints are doing action against the super-joint, the same as the sun and planets movement. The super-joint is notated by \(j_{n} \left( {x_{n} , y_{n} } \right)\), and it is shown as a red joint in Fig. 1. Features explained in Sects. 3.2, 3.3, and 3.4 are all based on utilizing super-joint.

Fig. 1
figure 1

Super-joint detected on the human body

3.2 Standard Normal Feature

The standard normal distribution is a wave form of a mean of zero and a standard deviation of one. The standard normal distribution is positioned at nil (0), and the grade of assumed dimension deviates from the mean is provided by the standard deviation, as shown in Eq. 1.

$$ P\left( x \right) = \frac{1}{{\sqrt {2\pi } }}e^{{{\raise0.7ex\hbox{${ - x^{2} }$} \!\mathord{\left/ {\vphantom {{ - x^{2} } 2}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$2$}}}} $$
(1)

We have altered the standard normal distribution formula to permit it to join in features among two joints. This transformed comparation is pulling out coordinate’s differences amongst super and any joint. Let us suppose that we have any particular \(ith\) joint \(j_{i} \left( {x_{i} , y_{i} } \right)\), if we calculate standard normal features sandwiched between \(j_{n}\) and \(j_{i}\). Following Eq. is describing an association amid \(j_{n}\) and \(j_{i} ,\) characterized by \(P\left( {x_{n} ,x_{i} , y_{n} ,y_{n} } \right)\).

$$ P\left( {x_{n} ,x_{i} , y_{n} ,y_{n} } \right) = \frac{1}{{\sqrt {2\pi } }}e^{{{\raise0.7ex\hbox{${ - \left( {(x_{i} - x_{n} )^{2} + (y_{i} - y_{n} )^{2} } \right)}$} \!\mathord{\left/ {\vphantom {{ - \left( {(x_{i} - x_{n} )^{2} + (y_{i} - y_{n} )^{2} } \right)} 2}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$2$}}}} $$
(2)

3.3 Slope Feature

Slope or cartesian coordinates angle (\(\theta_{s}\)) is calculated between the super-joint \(j_{n} \left( {x_{n} , y_{n} } \right)\) and any particular \(ith\) joint \(j_{i} \left( {x_{i} , y_{i} } \right)\), it validates which angles are connected to two joints. The value of measuring the slope of the slope is that it is used separately as a function and allows us to measure specific features. Equation 3 is demonstrating the slope between \(j_{n}\) and \(j_{i}\).

$$ \theta_{s} = {\text{ tan}}^{ - 1} \left( {\frac{{\left( {y_{i} - y_{n} } \right)}}{{\left( {x_{i} - x_{n} } \right)}}} \right) $$
(3)

3.4 Parameter Space Features

We are utilizing parameter space features to calculate individual components of \(j_{n}\) and \(j_{i}\). The slope angle is used to help us incorporate mutual information at the body joints. Alteration of joint \(j_{n} \left( {x_{n} , y_{n} } \right) \) and \( j_{i} \left( {x_{i} , y_{i} } \right)\) into parameter space is performed using Eqs. 4 and 5. The \(\rho_{n}\) and \(\rho_{i}\) is obtained by using joints \(j_{n}\) and \(j_{i}\). Figure 2a represents the transformation of the (x, y) plane and Fig. 2b (\(\rho , \theta\)) plane. Our feature vector is also incorporating the obtained spatial statistics.

$$ \rho_{n} = x_{n} \cos \theta_{s} { } + y_{n} \sin \theta_{s} { } $$
(4)
$$ \rho_{i} = x_{i} \cos \theta_{s} { } + y_{i} \sin \theta_{s} { } $$
(5)
Fig. 2
figure 2

Body joints illustration on the parameter plane

3.5 Action Descriptor: SNSP

The relationship of individual body joint and super-joint is drawn and called a Correlation of skeletal joints and super-joint. Proposed correlation is very important in collection action details, which are represented in Fig. 3. Twenty joints dataset representations are: eight (8), five (5), and six (6) correlations are shown in Fig. 3a–c. The number of correlations relies upon the number, the number of joints i.e., 20, 15, and 30.

Fig. 3
figure 3

Correlation of skeletal joints and super-joint

Proposed SNSP is utilizing minimal joints information to perform a human action recognition task, we have used 2D instead of using 3D skeletal joints points. At each correlation, we have calculated standard normal, slope, and parameter space features. In particular, standard normal and slope gained the unite, while parameters space features acquired combined details using joints. By linking all calculated features, we developed an action descriptor, termed SNSP. Comprehensive spatial action information is gained from 2D- skeletal joints, the action recognition algorithm is represented in Table number one (1).

SNSP algorithm will be used to calculate respective features at each frame. Coupling frame by frame features will form a feature matrix. Classification of function matrixes is achieved using the K-nearest neighbor classifier [37]. Figure 4 is representing the system diagram of our proposed SNSP descriptor.

Fig. 4
figure 4

SNSP system diagram

4 Experimental Analysis

This segment labels the used system requirements, and information of human action recognition databases. Details of joints of three databases in this unit.

4.1 System Stipulations

We used a computing computer to facilitate numerical processing, exposed in subsequent:

4.2 Dataset Joint's Orientations

UTD Multimodal Human Action Dataset comprises twenty joints to represent the human body [6]. Figure 5 is demonstrating what is the order of twenty joints, are their name according to their representation.

Fig. 5
figure 5

UTD Multimodal Human Action Dataset skeletal joints

The KARD- Kinect Activity Recognition Dataset consists of 15 joints representing the human body [7]. Figure 6 describes 15 variations of integrated databases for the joints.

Fig. 6
figure 6

KARD- Kinect Activity Recognition Dataset skeletal joints

SBU Kinect Interaction Dataset contains thirty (30) skeletal joints per frame [8]. A person's joints are represented in Fig. 7a; fifteen joints are used to describe person a. Fifteen joints of person b are presented in Fig. 7b. SBU Kinect Interaction skeletal joints can be seen in Fig. 7.

Fig. 7
figure 7

SBU Kinect Interaction Dataset skeletal joints

5 Evaluation

Our proposed SNSP is compared interim of accuracy (%) by state-of-the-art approaches implemented by using: UTD Multimodal Human Action Dataset, KARD- Kinect Activity Recognition Dataset SBU Kinect Interaction Dataset.

Figure 8 demonstrated the assessment between our proposed system and the current eleven solutions to the UTD Multimodal Human Action Dataset precision index. Rendering to the certain results, authors [6] gained seventy-nine percentage as least correctness, while authors [27] attained all-out of 90% accurateness amongst nominated methods. Our presented SNSP accomplished 91.8% accuracy.

Fig. 8
figure 8

SNSP evaluation by state-of-the-art approaches arranged UTD Multimodal Human Action Dataset

Assessment of our proposed SNSP method with the existing methodologies to assess KARD- Kinect Activity Recognition Dataset accuracy as explained in Fig. 9. According to the given values, the author in [11] obtained a maximum of 96.3 & 97.41%, whereas scholar [7] attained 84.8 & 84.5% as minutest accurateness among whole related methods. Conversely, anticipated SNSP conquered 97.8% intended for KARD- Kinect Activity Recognition Dataset, which significantly amended thru maximum previous efficiency.

Fig. 9
figure 9

SNSP evaluation by state-of-the-art approaches arranged KARD- Kinect Activity Recognition Dataset

The comparison of our proposed method by the current 16 methods in contradiction of the SBU Kinect Interaction Dataset accurateness was shown in Fig. 10. Giving to specific standards, the Author in [8] achieved bottom-line efficiency, while our previous proposed approach, Huynh et al. [45], gained an extreme of 99.3% precision amongst particular approaches. Anticipated SNSP conquered 99.9% for SBU Kinect Interaction Dataset, which was enhanced by earlier supreme accurateness.

Fig. 10
figure 10

SNSP evaluation by state-of-the-art approaches arranged SBU Kinect Interaction Dataset

6 Results and Discussion

The preceding segment represents that SNSP outclasses state-of-the-art tactics of three datasets. Each dataset from Figs. 8, 9, and 10 have been clarified inside class results throughout here. We have categorized using the K-nearest neighbor classifier's corresponding classification help to produce confusion matrices for each dataset. The confusion matrix was provided with ten severity levels. The lowest intensity indicates the least likelihood of error and the highest intensity indicates the greatest probability of error inside the class.

UTD Multimodal Human Action Dataset consisted of twenty-seven sequences of actions [6]. Individual performance between classes, as confusion matrix was seen in Fig. 11. The suggested action descriptor of acts produced outstanding success within the class. Minimum version, according to the record, 76.6% within the category "Tennis right-hand forehand swing" while 97.9% can be seen in the "Right hand pick up and throw" class. Use SNSP, and the average interclass score is 91.8 percent.

Fig. 11
figure 11

SNSP results on UTD Multimodal Human Action Dataset

Figure 12 displays the uncertainty matrix for KARD- Kinect Activity Recognition Dataset, which includes 18 separate tasks. All maximum amounts of correct decisions can be seen in classes "Take Umbrella" and "Stand up," but then bottom-line results can be seen by 91.1% in activity "Two hand wave, "using K-nearest neighbor classifier.

Fig. 12
figure 12

SNSP results on KARD- Kinect Activity Recognition Dataset

At SBU Kinect Interaction Dataset, SNSP reached 99.9 percent; it includes eight different classes. Figure 13 illustrates the confusion matrix employing the SNSP descriptor. 99.8% results can be seen in "Punching, Pushing, and Shaking Hands" classes. SNSP successfully classified all other classes. SNSP algorithm is shown in Table 1, whereas computational time analysis is represented in Table 2.

Fig. 13
figure 13

SNSP results on SBU Kinect Interaction Dataset

Table 1 Proposed action descriptor, SNSP algorithm
Table 2 Computational time analysis

7 Conclusion

In research work, we have projected an innovative skeleton-based approach to 2D human skeleton action recognition. At initial, we developed a novel SNSP descriptor with skeleton joints. Features are extracted using super-joints, standard normal, space, and parameter space. Our results of the experiments have shown that the proposed SNSP descriptor has superior performance compared to several state-of-the-art systems on three different action datasets i.e., UTD Multimodal Human Action Dataset, KARD- Kinect Activity Recognition Dataset, and SBU Kinect Interaction Dataset, besides gained an accurateness of 99.2%, 98.3%, and 99.6% respectively.