1 Introduction

Human action recognition has turn out to be a prominent & diligent research area in computer vision and image processing, which includes classification and recognition of normal & abnormal human activities of daily routine. It belongs to the automated recognition of human activity (normal & abnormal) in various application areas via analyzing the sequence of observations. Nowadays, crowded places with normal and abnormal activities are familiar due to population increase that turn toward suspicious activities. So, HAR has become an essential part in the automatic interpretation of human environment interaction in various online-offline applications such as auto-driving, intelligent surveillance [1,2,3,4,5], smart-gadgets analysis [6], object detection & tracking [7], video retrieval [8], and assisted daily living. Other HAR applications firmly coupled with the daily activities such as motion analysis [9,10,11,12], pose motion analysis [13, 14], health monitoring [15], classification or detection of actions or motions [16], and understanding human action behavior [17]. By recognizing and analyzing the human actions from the videos, one can clearly distinguish between normal and abnormal behaviors that can make significant improvements in public safety. Withal, HAR remains an ambitious challenge due to its clutter backgrounds, slight interclass segregation, and wide intra-class deviation. The main thing to recognize high accuracy & efficiency is to conquer both static appearances within each frame of the videos as well as temporal relationships throughout the multiple frames generated via videos. Some applications such as monitoring suspicious detection and early reporting for fall detection are also considered in human activity recognition. However, various techniques are there for the representation of human action based on motion, such as RGB-based videos [18,19,20], RGBD-based videos [21,22,23,24], and skeleton-based videos [25,26,27,28,29,30]. Skeleton-based action recognition has become more prominent in recent times as it offers a focused and concise approach. The representation of human skeletons in videos typically involves a series of joint coordinates, which can be obtained through pose estimators or action prediction methods. By focusing solely on the action poses and disregarding contextual factors like background variations and lighting changes, skeleton sequences provide a compact and robust way to capture action information. On comparing these techniques, all the skeleton-based methods represent: (i) human motion via 3D coordinates positions for key-body points and (ii) are more robust to problems like variations of background clutter, observation viewpoints, illumination or intensity conditions, and so on. These advantages of skeleton motion sequences motivate researchers to develop new techniques for exploring informative features for human action recognition. These methods are gaining utmost importance in HAR since skeletons represent a compact sequence of data forms that depict dynamic motion within human body movements [31]. In respect of annotation, i.e., capturing the effective motion action representation from several unlabeled skeleton samples, manual annotation founds to be very expensive and challenging, nonetheless an unexplored area. These days, sensors are used for collecting data for their low-cost and high mobility. Some approaches are used for tracking and calculating the skeleton keyjoints with feature invariant to human key-body points, observation point, camera viewpoint, and so on.

Some approaches are used for tracking and calculating the skeleton keyjoints with feature invariant to human key-body points, observation point, camera viewpoint, and so on. The skeletal features within the human body are responsible for recognizing all the normal & abnormal activities. Besides this, they have also been used for evaluating (some activities such as falling, discriminating between jogging and running) the variation in the keyjoint coordinates between the center mass of the body, acceleration motion, velocity motion, for movement: angles between the keyjoint points within the skeleton. Some new methods like ST-LSTM and ST-GCN (spatiotemporal graph convolution network) are practices to extract these features also. The movement of body parts and the execution of various actions are made possible by the human skeletal system. When it comes to data modality, the use of skeleton-based information aligns with the structure of the human anatomy, which enhances the interpretability of ConvLSTM learning. This modality specifically focuses on 3D coordinates of keyjoints in the human body. By analyzing these skeleton sequences, the model is capable of recognizing and understanding human movements. Another advantage of the skeleton modality is its emphasis on privacy, as it is considered to be more privacy-friendly compared to other modalities.

In this work, the prime objective is to efficiently combine the important cues in CNN (convolutional neural networks), and LSTM using spatiotemporal data with skeleton-based recognition approaches call up as ST-LSTM. Here, a set of extracted skeleton features in conjunction with skeletal keyjoint is fed as input to the model. The skeletal tracking algorithms were used for detecting the keyjoints followed by the feature extraction that has been done through RGB frame data (extracted from videos) for improvising the efficiency of model. Some standard features, like angle between the keyjoint coordinates, velocity motion, acceleration motion, and human body position of the center of mass, for movement: angles between the joint key points, have been extracted from keyjoint coordinates of the human skeleton.

Once the feature extraction is done along with preprocessing thereupon, the preprocessed data are feed to the model, consisting of 17 extracted features among 25 skeleton coordinates. The overall pipeline of proposed ConvST-LSTM-Net model is illustrated in Fig. 1. The model exploits a spatiotemporal network consisting of CNNs, ST-LSTMs & fully connected dense layers. The model first detects the skeleton keyjoints of the persons using the skeleton-based recognition method. These keyjoints are fed to the CNN layers, followed by ST-LSTMs for the extraction of spatial–temporal features. Then, output from a hidden layer of ST-LSTMs is passed via FC dense layer (fully connected) for classification.

Fig. 1
figure 1

Process informatics pipeline of ConvST-LSTM-Net. At first, the video frames are passed through the skeleton-based recognition feature method to extract the skeleton keyjoint coordinates. Then, the obtained keyjoint coordinates are fed to a modified ST-LSTM cell followed by ST-LSTM layers to evaluate the spatiotemporal feature. Further, the outputs are passed to FC dense layers. Ultimately, SoftMax shows the framewise prediction scores of human action behaviors

The key contributions of the research work can be summarized as follows:

  1. 1.

    A spatiotemporal ConvST-LSTM-Net model has been proposed that utilizes human body keyjoint coordinates from skeletal data obtained from RGB videos. The keyjoints are fed as an input to CNN layers for extracting the spatial–temporal features followed by ST-LSTM and output is passed to the time-distributed FC dense layer.

  2. 2.

    Motivated by the advances in CNN, ConvLSTM, and ST-LSTM, we have seamlessly combined the ideas of these models and integrated them to propose a new paradigm for skeleton-based action recognition termed as ConvST-LSTM-Net. The model brings the attention toward improving the efficacy by focusing only on informative keyjoint coordinates.

  3. 3.

    Among 25 keyjoints, a set of 17 extracted skeleton features along with 21 skeleton keyjoint coordinates are fed to the model as not all the skeleton keyjoints are informative in nature for recognizing the action classes.

  4. 4.

    The proposed ConvST-LSTM-Net model shows better performance in comparison to the existing models by using different modalities over various benchmarks, viz. NTU RGB + D dataset [32], UT-Kinect dataset [33], UP-Fall Detection [34], UCF101 [35], and HDMB51 dataset [36].

The rest of this paper is systematized as follows: Sect. 2 describes an overview of relevant studies available in literature. Section 3 briefly introduces the key terms and techniques along with the proposed methodology. Section 4 shows the experimental results and analysis. At last, the Sect. 5 discusses about the conclusion and future perspective.

2 Related work

This section briefly discusses the work related to skeleton-based RNN, LSTM & ST-LSTM for human action recognition.

2.1 Skeleton-based action recognition

Earlier, conventional skeleton-based recognition in the HRA modal aimed at the handcrafted features [37, 38], a well-known method for describing and classifying image features. However, it fails to classify the adequate semantic-labeling information of the human body. Deep learning approaches such as convolutional neural networks (CNNs) [39,40,41,42,43], recurrent neural networks (RNNs) [44,45,46,47,48,49], and graph convolutional networks (GCNs) [50] have achieved the exotic performance for learning more informative features about skeleton sequence, which helps in HAR learning. Many works have been introduced to achieve high performance of the skeleton sequence model. Li et al. [41] introduced a framework on convolutional co-occurrence feature learning that gradually works on hierarchical methods to aggregate contextual information on various levels. Vemulapalli et al. [37] designed a rolling map based on relative 3D rotations among different human body parts. Liu et al. [44] elongate RNN-based technique into the spatial–temporal model for revisiting the result based on the action-related performance of human action. Zhu et al. [43] introduced a cuboid-CNN in skeleton actions, ultimately concluding a human’s normal keyjoint movements. Zhang et al. [51] implemented a view-adaptive modal for auto-regulate angle viewpoints during any motion action & obtaining different viewpoint observations of human actions. However, for skeleton sequences, these models fail to extract the temporal–spatial correlation configurations & even fail to explore the graphical aspects of human body structure. Due to the popularity of graph-based techniques, Yan et al. [52] introduced an approach based on GCN for the skeleton-based HAR, then introduced the ST-GCN method for featuring the spatial & temporal dynamics configuration of keyjoint skeletons of humans synchronously. Song et al. [53] worked on solving the occlusion issues and implemented multi-stream GCN for extracting qualified features for activated skeleton keyjoints in human action. Furthermore, they proposed a non-local technique [54] by using 2-stream GCN approach: 2 s-AGCN for improving recognition accuracy. Also, Shi et al. [55] worked on GCN fusion feature and proposed the multi-stream architecture at the output layer. Cheng et al. [56] worked on shift operation based on graph and used the point-wise convolutions connected layer for lowering its computational complexity. Ye et al. [57] introduced novel work on DGCN (dynamic graph convolutional network), an approach used for skeleton-based action recognition under 2-stream-AGCN, which features global dependency via achieving preeminent accuracy. Zhang et al. [58] also work on GCN in the spatial attentive–temporal dilated network for feature extraction in skeleton frame sequences using distant spatial attention weights and temporal scales. In 2-Stream network, Shi et al. [59] confiscated on bones, i.e., bone stream and joint stream, but entirely independent of each other. Furthermore, directed graph neural networks (DGNNs) [60], graph edge convolutional neural networks (GECNNs) [61] were introduced, which depict the relation among joints-bones in terms of action, but they fail to represent the various methods to combine features in the motion action transmission field.

2.2 Skeleton-based action recognition using LSTM’s and RNN’s approaches

Currently, the deep learning area mainly focuses on recurrent neural network (RNN)-based techniques, since it manifests its growth in skeleton-based action detection. In ConvST-LSTM network, the basic principle of this proposed model is familiarized from the ST-LSTM approach, i.e., a sequential fusion of CNN followed by the spatiotemporal method with LSTM is also known for the extension version of RNNs. In this subsection, a brief survey is provided on RNN approaches and LSTM approaches since they are the basic building blocks of the proposed methodology. Veeriah et al. [62] worked on LSTM and introduced a differential gating method to affirm the rate of information change. Du et al. [63] work on the HRNN network (hierarchical recurrent neural network) approach for depicting skeleton structure of human body along with its temporal dynamics coordinates for keyjoints in 2D. In LSTM network, Zhu et al. [64] implemented a mixed regularization technique for normalization toward learning the co-occurrence of skeletal joint features. Meanwhile, they introduced a network for trained termed as ‘in-depth-dropout mechanism.’ Shahroudy et al. [32] worked in LSTM to learn long-term contextual representations of different body parts individually termed as part-aware LSTM model. Liu et al. [44, 65] intended a framework network based on 2D spatiotemporal LSTM for both temporal and spatial domains to explore the hidden input layer’s information-related context in the human body. For 3D coordinates of skeletal joints, they proposed a ‘trust-gate-mechanism’ [44] to trade on imprecise 3D-coordinates inputted via depth sensors devices. Nowadays, skeleton-based action RNN and LSTM approaches also adapt toward action forecasting and detection [66, 67].

3 ConvST-LSTM-Net: the proposed methodology

This section briefly canvasses crucial terms and approaches used in proposed ConvST-LSTM model that is divided into three models, namely CNN, ST-LSTM, and ConvST-LSTM-Net. The proposed methodology has been executed with the review of CNN along with the construction of skeleton body with coordinates and ST-LSTM, respectively.

Initially for dataset preparation, the human body frames from raw RGB videos are used to train the network and meanwhile track the 3D skeleton jointkey coordinates. For preprocessing, the 3D joint normalization method is applied that is helpful in making a bounding box above the tracked human body. Some features have been extracted for determining different activities, features such as velocity motion (ν), acceleration motion (ɑ), weight (w), depth (), height (H), angle (θ) within the consecutive skeleton joints, etc. After training, the feature extraction is done for input human activity behavior. Following subsections elucidates the whole network architecture of the ConvST-LSTM-Net model.

3.1 Keypoint detection and preprocessing

In preprocessing technique for RGB videos, the frames are inputted into the ST-LSTM network to evaluate the keyjoint locations of human body frames from skeleton coordinates. Figure 2 represents the 25-skeleton keyjoint coordinates, which have been tracked at each joint. Only 17 skeletal keyjoints are covered (since they are the informative skeletal keyjoints to specify the normal and abnormal human activities) and these are the right knee, right hip, left knee, left hip, left foot, left ankle, right foot, right ankle, head, spine mid, left wrist, spine base, right shoulder, left shoulder, right wrist, right elbow, and left elbow. Each frame tracked the human skeleton comprising X, Y, Z coordinates of human body keyjoints. After getting the 3D skeletal coordinates, normalization technique has been applied on 3D keyjoints to generate bounding boxes over the tracked human skeleton, which may vary as per the person’s movement in the video.

Fig. 2
figure 2

The 25-skeleton keyjoints for the human body track detection and preprocessing

Afterward, a feedforward network has been used based on a multi-CNN layer followed by ST-LSTM that takes input in the form of keyjoints coordinates from video frames using skeleton-based recognition. It learns the affiliation among the body parts of individuals within the frames. Table 1 presents the details of tracked skeletal keyjoints, a set of derived features, and action class. In this work, for normalizing the convergence of loss function, minimum–maximum normalization technique (i.e., min–max norm) has been used. Here, X indicates the training dataset, then normalization can be achieved as:

Table 1 Detail of tracked skeletal keyjoint coordinates, derived features, and action class
$${{\varvec{X}}}_{\mathbf{n}\mathbf{o}\mathbf{r}\mathbf{a}\mathbf{m}\mathbf{l}\mathbf{i}\mathbf{z}\mathbf{e}}= \frac{{\varvec{X}}-{{\varvec{X}}}_{\mathbf{m}\mathbf{i}\mathbf{n}}}{{{\varvec{X}}}_{\mathbf{m}\mathbf{a}\mathbf{x}}- {{\varvec{X}}}_{\mathbf{m}\mathbf{i}\mathbf{n}}}$$
(1)

3.2 Construction & evaluation of feature vector: geometric & kinematic features

The skeleton keyjoint coordinates are used for constructing and calculating features vectors. The keyjoints coordinates of the human body that are tracked for different activities of humans are actually decided by using feature vectors. For particular activities, different features are utilized. These features and their evaluation are as follows:

\(\mathrm{\angle }{\varvec{\uptheta}}\) (Angle between keyjoints of skeleton coordinates):

Among 25 keyjoints coordinates, we consider those coordinates which are connected via straight line and then a skeleton structure of tracked human body is drawn using these coordinates as shown in Fig. 3. Accordingly, only 10 keyjoint comes out to be the most relevant ones, viz. left shoulder, spine mid, right shoulder, spine base, left knee, right knee, left hip, right hip, left ankle, and right ankle are used for calculating the value of angle θ. It is the illustration for evaluating the keyjoint angles between the left shoulder, neck, and mid hip. If A, B, C are considered as the distance between the coordinate, then values are formulated as \(A\) 1 \(= x\) 1 \(- y\) 1, \(B\) 1 \(= x\) 2 \(- y\) 2, and \(C\) 1 \(= x\) 3 \(- y\) 3, then \(\theta \) can be evaluated as follow:

$${\varvec{\theta}}= \frac{{\varvec{A}}{\varvec{B}}{\varvec{C}}}{{\varvec{A}}{\varvec{B}}*{\varvec{B}}{\varvec{C}}}$$
(2)

where \(ABC=(A1*A2+B1*B2+C1*C2)\)

Fig. 3
figure 3

Illustration for calculation of angle from the left side of the skeleton between left shoulder, neck, and mid-hip

$${\varvec{A}}{\varvec{B}}=\sqrt{{{\varvec{A}}}_{1}^{2}+{{\varvec{B}}}_{1}^{2}+{{\varvec{C}}}_{1}^{2} }\ \mathrm{ and }\ {\varvec{B}}{\varvec{C}}=\sqrt{{{\varvec{A}}}_{2}^{2}+{{\varvec{B}}}_{2}^{2}+{{\varvec{C}}}_{2}^{2}}$$
(3)

Velocity motion estimation (\({\varvec{\nu}})\):

Velocity motion is calculated by taking the distance of positions of humans at time frames t and t + 1 in x, y, z-dimension. So, velocity of the tracked person is given by:

$${\varvec{\nu}}=\frac{\mathop{d}}{{\varvec{t}}}$$
(4)

where \(\mathcal{d}\) indicates distance of the tracked person between frames and t indicates the frame time.

acceleration motion estimation (\(\alpha )\):

The rate of changes of velocity between consecutive frames in x, y, z-directions at time frame t. It is given by:

$$ \alpha = \frac{v}{t} $$
(5)

where α is the acceleration of motion of the person. \(\nu \) indicates the velocity motion of tracked person between the frames. t indicates the frame time.

Head-floor distance (\({\text{h}}_{{\varvec{d}}} )\):

It measures the distance between the head keyjoint coordinate & the floor where tracked person’s location found.

Head-depth distance():

The distance measured from first camera view to the adjacent object is termed as depth. So, head-depth is calculated via the head keyjoint’s coordinates in the z-dimension of a tracked person within the frame.

Width (w):

Width is defined as the difference between maximum of right-left keyjoint \(({{{\varvec{R}}}_{\mathbf{j}}}_{\mathbf{M}\mathbf{a}\mathbf{x}}-{{{\varvec{L}}}_{\mathbf{j}}}_{\mathbf{M}\mathbf{a}\mathbf{x}})\) coordinates of tracked person. The extreme left keyjoint width is estimated using a left elbow, left hip, left knee, left shoulder, left ankle, left foot, and head keyjoint values. In the same way, the right extreme keyjoints can be calculated by using all the right-side keyjoint coordinates. It is calculated as follows:

$${\varvec{W}}=\left|{{{\varvec{R}}}_{\mathbf{j} }}_{\mathbf{M}\mathbf{a}\mathbf{x}} - {{{\varvec{L}}}_{\mathbf{j} }}_{\mathbf{M}\mathbf{a}\mathbf{x}}\right|$$
(6)

where \({R}_{j}\) indicates the right keyjoint coordinates, \({L}_{j}\) indicates the left keyjoint coordinates.

Height (H):

Height is the measure between utmost bottom keyjoints and utmost top keyjoints of body coordinates. In extreme bottom, it includes keyjoint coordinates like left ankle, right knee, left knee, right ankle, right foot, left ankle, left foot, and right ankle and in utmost top, it includes keyjoint coordinates like head, right ankle, right elbow, left ankle, left elbow, right knee, and left knee keyjoints coordinates.

$${\varvec{H}}=\left|{\mathbf{T}}_{{\varvec{j}} } - {\mathbf{B}}_{{\varvec{j}}}\right|$$
(7)

where \({T}_{j}\) indicates the top keyjoint coordinates, \({B}_{j}\) indicates the bottom keyjoint coordinates.

3.3 ConvST-LSTM: the proposed model

In this section, the final preprocessed 3D keyjoint coordinates are inputted into the proposed deep learning network. We have used the sequential fusion of CNNs, Conv-LSTM & ST-LSTM to propose the ConvST-LSTM network.

3.3.1 Convolutional neural network architecture

Initially, the human action recognition has been executed by applying CNNs approach [68]. Consider, \({X}_{t}^{0}=\left[{X}_{1},{X}_{2},{X}_{3},\dots \dots \dots \dots ,{X}_{n}\right]\) as the input vector, where n indicates the input samples and output of convolutional layers can be defined as follows:

$${{\varvec{C}}}_{{\varvec{i}}}^{{\varvec{l}},{\varvec{j}}}={\varvec{\sigma}}\left({{\varvec{B}}}_{\mathbf{j}}+\sum_{{\varvec{m}}=1}^{{\varvec{M}}}{{\varvec{W}}}_{{\varvec{m}}}^{{\varvec{j}}}*{{\varvec{X}}}_{{\varvec{i}}+{\varvec{m}}-1}^{0,{\varvec{j}}}\right)$$
(8)

here l corresponds to an index of convolutional layer; σ depicts the nonlinear sigmoid-activation function; whereas B represents the bias vector corresponds to jth feature-map; filter size of CNN is indicated by M; indicates the weight metrics for the jth feature map is indicated by \({W}_{m}^{j}\); mth is the filter index.

The input frames in the proposed model consist of three input channels, namely sequences, keyjoint, and coordinates, which resemble to the x, y, and z directions, respectively. Each input frame has a resolution of 125 × 25 × 3 pixels and contains information about the movement sequences, keyjoint positions, and spatial coordinates. In convolution layer, 6 filters are passed together with configured size of kernel, padding, and SoftMax functions in the hidden layer in order to avoid the vanishing gradient problem. Max-pooling is used as a pooling operation to estimate the maximum value for feature map, and diminish the processing time by reducing the dimensionality of the frame. Then, output from the hidden layer has passed to FC dense layers. Finally, SoftMax function shows the prediction score of the action classes.

3.3.2 Spatiotemporal LSTM

Before moving to ST-LSTM, let’s recap LSTM [69], which consists of 3 memory cells (gates) and escape the vanishing gradient issue. These are: (a) forget cell: a binary gate that decides how much information to pass through. (b) Input cell: decides whether the current information can be stored in the unit cell and (c) Output cell: contains sigmoid-activation gate, which decides which information to show as output. Lastly, the tanh layer is used to pass the cell state and further multiply it with the final output obtained from the output cell.

The equations which define the activity of each cell can be formulated as follows:

$${{\varvec{i}}}_{{\varvec{t}}}= {\varvec{\sigma}} ( {{\varvec{W}}}_{{{\varvec{X}}}_{{\varvec{i}}}}{{\varvec{X}}}_{{\varvec{t}}}+{{\varvec{W}}}_{{{\varvec{H}}}_{{\varvec{i}}}} {{\varvec{H}}}_{{\varvec{t}}-1}+{{\varvec{W}}}_{{{\varvec{C}}}_{{\varvec{i}}}}{{\varvec{C}}}_{{\varvec{t}}-1}+{{\varvec{B}}}_{{\varvec{i}}})$$
(9)
$${{\varvec{f}}}_{{\varvec{t}}}= {\varvec{\sigma}} ( {{\varvec{W}}}_{{{\varvec{X}}}_{{\varvec{f}}}}{{\varvec{X}}}_{{\varvec{t}}}+{{\varvec{W}}}_{{\mathbf{H}}_{{\varvec{f}}}}{{\varvec{H}}}_{{\varvec{t}}-1}+{{\varvec{W}}}_{{\mathbf{C}}_{{\varvec{f}}}}{\mathbf{C}}_{{\varvec{t}}-1}+{{\varvec{B}}}_{{\varvec{f}}})$$
(10)
$$ {\varvec{o}}_{{\varvec{t}}} = { }{\varvec{\sigma}}{ }\left( {{ }{\varvec{W}}_{{{\varvec{X}}_{{\mathbf{o}}} }} {\varvec{X}}_{{\varvec{t}}} + {\varvec{W}}_{{{\varvec{H}}_{{\varvec{o}}} }} {\varvec{H}}_{{{\varvec{t}} - 1}} + {\varvec{W}}_{{{\mathbf{C}}_{{\varvec{f}}} }} {\mathbf{C}}_{{\varvec{t}}} + {\varvec{B}}_{{\varvec{o}}} } \right) $$
(11)
$$ {\varvec{C}}_{{\varvec{t}}} = {\varvec{f}}_{{\varvec{t}}} {\mathbf{C}}_{{{\varvec{t}} - 1}} + {\varvec{i}}_{{\varvec{t}}} { }{\mathbf{tanh}}\left( {{\varvec{W}}_{{{\varvec{X}}_{{\mathbf{c}}} }} {\varvec{X}}_{{\varvec{t}}} + {\varvec{W}}_{{{\mathbf{H}}_{{\mathbf{c}}} }} {\varvec{H}}_{{{\varvec{t}} - 1}} + {\varvec{B}}_{{\varvec{c}}} } \right) $$
(12)
$${{\varvec{H}}}_{{\varvec{t}}}={{\varvec{o}}}_{{\varvec{t}}} \mathbf{t}\mathbf{a}\mathbf{n}\mathrm{ h }({{\varvec{C}}}_{{\varvec{t}}})$$
(13)

here, Wi, Wf, Wo indicates weight matrices of forget (f), input (i) and output (o) gates, respectively; Xt ∈ input fed to LSTM cells unit at t time; σ depicts the sigmoid-activation function, whereas tanh depicts the hyperbolic-tangent function (both nonlinear functions); Ct indicates memory cell state within the LSTM. Bi, Bf, Bo, and Bc indicates the bias vectors on forget, input & output gates, and memory cell c, respectively. Internal frame input keyjoint coordinates of each cell in the ST-LSTM model are represented in Fig. 4. The skeletal keyjoints are arranged in spatial direction and input as a chain whereas the corresponding keyjoints are inputted over various frames for temporal direction sequentially. Especially, each ST-LSTM cell is feed for a new input (\({x}_{j,t})\), where \(x\) new input feed for 3D position of body keyjoint j in frame time t, the hidden layer (\({h}_{j,t-1}\)) of the same keyjoint j and the hidden layer (\({h}_{j-1,t}\)) for the previous keyjoint j-1 in same frame t, here j indicates the indices of keyjoint, i.e., j ∈ {1..j..J} and t indicates the indices of frames, i.e., t ∈ {1,..t..,T},. An ST-LSTM unit cell consists of an input cell (\({i}_{j,t}\)), 2-forget cells correspond to the sources of context information, i.e., temporal dimension (\({f}_{j,t}^{T}\)) & spatial domain (\({f}_{j,t}^{S}\)), in conjunction with an output gate (\({o}_{j,t}\)).

Fig. 4
figure 4

Illustration of ST-LSTM cell [44]. For spatial domain, the skeletal keyjoints in each frame are aligned & feed sequentially. For temporal domain, the skeletal keyjoints are feed sequentially over the frames

The equations for ST-LSTM are formulated as introduced in [44]:

$$ \left( {\begin{array}{*{20}c} {i_{j,t} } \\ {f_{j,t}^{(S)} } \\ {f_{j,t}^{(T)} } \\ {o_{j,t} } \\ {u_{j,t} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} \sigma \\ \sigma \\ \sigma \\ {\tanh } \\ \end{array} } \right)\left( {W\left( {\begin{array}{*{20}c} {x_{j,t} } \\ {h_{j - 1,t} } \\ {h_{j,t - 1} } \\ \end{array} } \right)} \right) $$
(14)
$${{\varvec{C}}}_{{\varvec{j}},{\varvec{t}}}={{\varvec{i}}}_{{\varvec{j}},{\varvec{t}}}\odot {{\varvec{u}}}_{{\varvec{j}},{\varvec{t}}}+{{\varvec{f}}}_{{\varvec{j}},{\varvec{t}}}^{({\varvec{S}})} \odot {{\varvec{c}}}_{{\varvec{j}}-1,{\varvec{t}}}+{{\varvec{f}}}_{{\varvec{j}},{\varvec{t}}}^{({\varvec{T}})} \odot {{\varvec{c}}}_{{\varvec{j}},{\varvec{t}}-1}$$
(15)
$${{\varvec{h}}}_{{\varvec{j}},{\varvec{t}}}={{\varvec{o}}}_{{\varvec{j}},{\varvec{t}}}\odot \mathbf{t}\mathbf{a}\mathbf{n}\mathbf{h}\left({{\varvec{c}}}_{{\varvec{j}},{\varvec{t}}}\right)$$
(16)

where \({c }_{j,t}\) indicates the cell state; \({h }_{j,t}\) indicates the hidden input layer in ST-LSTM unit at the spatiotemporal steps for keyjoint j and frame time t; the modulated input frame is indicated by \({u }_{j,t}\); ⊙ represents framewise product for each unit and W indicates an affine transformation within the weight.

3.3.3 ConvST-LSTM-Net architecture

Several works [53, 54] have demonstrated that each action sequence has a subset of informative keyjoints. In contrast, some keyjoints may be irrelevant in order to recognize the action classes with proper information. Therefore, for obtaining high accuracy in human action recognition the informative skeletal keyjoints have been identified while focusing on their features vector. At the same time in order to recognize human behavior, we must preferentially concentrate on the informative keyjoints (coordinates for feed), ignoring the features of the irrelevant keyjoints.

This model has been executed by taking a sequential fusion of CNN, ST-LSTM (combination of LSTM and spatiotemporal based recognition), and FC layers. Here, CNNs are pre-owned for feature extraction, ST-LSTMs are used in sequence prediction for spatiotemporal feature extraction, and the features dense layers are used for mapping. For classification, the outputs from CNN’s hidden layer are fed to the ST-LSTM layers and then GAP (Global Pooling Layer) is used to flatten the data followed by FC layers within the model.

The transformation equations for ConvST-LSTM-Net can be given as:

$$ {\mathbf{\mathcal{F}}}_{{{\varvec{j}},{\varvec{t}}}}^{{\left( {\varvec{T}} \right)}} = { }{\varvec{\sigma}}{ }\left( {{ }{\varvec{W}}_{{{\varvec{X}}_{{\mathbf{\mathcal{F}}}} }} {\varvec{X}}_{{{\varvec{j}},{\varvec{t}}}} + {\varvec{W}}_{{{\varvec{H}}_{{\mathbf{\mathcal{F}}}} }} { }{\varvec{H}}_{{{\varvec{j}},{\varvec{t}} - 1}} + {\varvec{B}}_{{\mathbf{\mathcal{F}}}} } \right) $$
(17)
$${\mathcal{F}}_{{\varvec{j}},{\varvec{t}}}^{({\varvec{S}})}={\varvec{\sigma}}\left({{\varvec{W}}}_{{{\varvec{X}}}_{\mathcal{F}}}{{\varvec{X}}}_{{\varvec{j}},{\varvec{t}}}+{{\varvec{W}}}_{{{\varvec{H}}}_{\mathcal{F}}}{{\varvec{H}}}_{{\varvec{j}},{\varvec{t}}-1}+{{\varvec{B}}}_{\mathcal{F}}\right)$$
(18)
$$ {{\tilde{\text{I}}}}_{{{\varvec{j}},{\varvec{t}}}} = { }{\varvec{\sigma}}{ }\left( {{ }{\varvec{W}}_{{{\varvec{X}}_{{{\tilde{\text{I}}}}} }} {\varvec{Xj}},_{{\varvec{t}}} + {\varvec{W}}_{{{\mathbf{H}}_{{{\tilde{\text{I}}}}} }} {\varvec{H}}_{{{\varvec{j}},{\varvec{t}} - 1}} + {\varvec{B}}_{{{\tilde{\text{I}}}}} } \right) $$
(19)
$$ \ \tilde{\text{O}}_{{{\varvec{j}},{\varvec{t}}}} = { }{\varvec{\sigma}}{ }\left( {{ }{\varvec{W}}_{{{\varvec{X}}_{\ O } }} {\varvec{X}}_{{\varvec{t}}} + {\varvec{W}}_{{{\varvec{H}}_{\ O } }} {\varvec{H}}_{{{\varvec{t}} - 1}} + {\varvec{B}}_{{\varvec{o}}} } \right) $$
(20)
$${{\varvec{C}}}_{{\varvec{j}},{\varvec{t}}}={{\varvec{f}}}_{{\varvec{j}},{\varvec{t}}}{{\varvec{C}}}_{{\varvec{t}}-1}+{{\varvec{i}}}_{{\varvec{j}},{\varvec{t}}}\mathbf{t}\mathbf{a}\mathbf{n}\mathbf{h}({{\varvec{W}}}_{{{\varvec{X}}}_{{\varvec{c}}}}{{\varvec{X}}}_{{\varvec{t}}}+{{\varvec{W}}}_{{{\varvec{H}}}_{{\varvec{c}}}}{{\varvec{H}}}_{{\varvec{j}},{\varvec{t}}-1}+{{\varvec{B}}}_{{\varvec{c}}}$$
(21)
$${{\varvec{u}}}_{{\varvec{j}},{\varvec{t}}}=\mathbf{t}\mathbf{a}\mathbf{n}\mathrm{h}({{\varvec{W}}}_{{{\varvec{X}}}_{{\varvec{u}}}}*{{\varvec{X}}}_{{\varvec{t}}}+{{\varvec{W}}}_{{{\varvec{H}}}_{{\varvec{u}}}}{{\varvec{H}}}_{{\varvec{t}}-1}+{{\varvec{B}}}_{{\varvec{u}}}$$
(22)
$${{\varvec{H}}}_{{\varvec{t}}}={{\varvec{o}}}_{{\varvec{t}}} \odot \mathbf{t}\mathbf{a}\mathbf{n}\mathrm{h }({{\varvec{C}}}_{{\varvec{t}}})$$
(23)

where Xj,t, Cj,t, Hj,t, Fj,t, Ij,t indicates inputs states, cells states, hidden states, forget cells, input cells for keyjoint j in frame time t; \({u}_{t}\) input modulation gates and \({\tilde{O} }_{t}\) is the output cells; \({C}_{t}\) is the memory cell used for aggregating the states information controlled by the cells. Figure 5 depicts about the ST-LSTM layer for each unit cell.

Fig. 5
figure 5

The ConvST-LSTM network for ST-LSTM layer in each unit cell

3.3.3.1 Model training

ConvST-LSTM-Net was trained on the frame samples obtained from the videos, keypoint recognition, followed by the fusion of spatiotemporal model consisting of ConvLSTMs. The model was trained on 150 epochs on a machine having AMDRyzen7 5800H processor, 8 GB RAM, and Graphics: NVIDIA GeForce RTX 3050 M, GPU, having a learning rate of 0.001 after repeated hyperparameter tuning.

For setup, Keras API version 2.3 of Python along with TensorFlow version 2.3.0 has been used in the backend to build the spatiotemporal model. To increase the code’s reusability and readability, some helper functions are initially defined from the python libraries. Along with an optimum value has been set for the user-defined hyperparameters like size, no. of layers, iteration, epochs, no. of batch sizes, and learning rate. The training sample data with various batch size is feed to the model and get trained over 150 epochs. In first time-distributed CNN layer, we use 32 filters with kernel size 3 and its output is then regularized to attain faster convergence. Then, max-pooling is added to reduce computational costs. Dropout layer benefited to avoid the overfitting where 50% of weights are dropped randomly. For next time-distributed CNN layers, different size of filters is practices after performing feature extraction, we apply an additional dropout layer with a rate of 40%. At step 3, we use GAP layer through which the output of CNN layer is flattened to 1*56 dimension.

Further, ST-LSTM is used to handle the sequential action data of the tracked person’s keyjoints coordinates. The ST-LSTM layer’s output is passed to the time-distributed FC dense layer. At last, SoftMax layer gives the framewise probabilities for each action classes. The architecture of the proposed ConvST-LSTM model is illustrated in Fig. 6. Further, Adam optimizer is help in optimizing the cost function and uses gradient clipping within the code. The hyperparameters such as checkpoint-path, saver-function, epochs, iterations, filter size, kernel size, and test & train data have been set for training purpose. Moreover, the proposed model utilized the stopping criteria with the value of 50. That means the training will be terminated only if there is no improvement in the monitor performance measure for 50 epochs or iterations in a row. This helps to prevent the entire model fall in to local optima. It has been found that the performance is excessively upgraded by using a sequential way.

Fig. 6
figure 6

Block architectural diagram of ConvST-LSTM-Net model for human action recognition. Starting from left side, the input frames clipped from videos; time-distributed convolutional layers including max-pooling, GAP, ST-LSTMs, FC dense layer followed by SoftMax function layer that results as a prediction of action

4 Experimental results and analysis

This section discusses about the implementation trait of the proposed model on various benchmarks with its training hyperparameters. We has evaluate the performance of ConvST-LSTM-Net model on three publicly available benchmarks, i.e., the NTU RGB + D 60 dataset [32], UT-Kinect dataset [33], UP-Fall-Detection Dataset [34], UCF101 [35], and HDMB51 dataset [36] for skeleton-based data.

4.1 Experiments on NTU RGB + D 60 dataset

The NTU RGB + D60 [32] is publicly available dataset used for human action recognition consisting of total 56,880 samples having 60 activity classes collected over 40 subjects in it. In this dataset, activities are classified into three categories having 40 daily living activities (drinking, standing, reading, happing, etc.), 9 medical conditions-related activities (sneezing, staggering, falling, vomiting, etc.), and 11 common activities (punching, kicking, hugging, etc.) based on multimodal information of the daily action characterization, along with 3D skeletal keyjoint, RGB-videos, masked-depth maps, full-depth maps, and infrared sequences data. The annotations provide the 3D location in x, y, z-dimension of each keyjoint in the camera coordinate system. It has total 25 key points per subject and each clip has 2 subjects. The evaluation has done on two protocols: Cross-Subject (CS) and Cross-View (CV).

For performing experiments, we choose 5 action classes (Stand, Sit, Run, Walk, Fall) contains 150 clips in each class. The two benchmarks for evaluation are set as: (1) Cross-subject (CS) contains 400 clips from 5 subjects, used for training; and the 100 clips for validation. (2) Cross-view (CV) contains 450 from 5 subjects used for training and 150 clips for validation. The proposed ConvST-LSTM-Net model surpasses the ConvLSTM network in [68] by 4.3% with the CS evaluation protocol and 3.1% with the CV evaluation protocol. This demonstrates that spatiotemporal skeleton-based recognition approaches in LSTM networks bring significant improvement. The comparative analysis for the results of the proposed ConvST-LSTM-Net model with state-of-the-art approaches is enumerated in Table 2.

Table 2 Experimental Results on NTU RGB + D 60 Dataset for skeletal sequence data

The trade-off curves for training accuracy & loss and validation accuracy & loss on the benchmark of NTU RGB + D 60 dataset for its two-evaluation protocol, i.e., CS and CV is illustrated in Fig. 7. Training and validation accuracy increases with time as depicted in Fig. 7a, c and finally, the growth rate reaches a steady-state value. Figure 7b and d depict the loss curve, illustrating the gradual decrease of validation loss with increasing epochs. To evaluate the model’s performance, the weights are saved from the epochs that achieve the highest validation accuracy. The loss curve is shown in Fig. 7b and d, which demonstrates how the validation loss gradually decreases by increasing epoch. For testing, the weights from the epochs with the maximum validation accuracy are saved.

Fig. 7
figure 7

Trade-off curves for model’s Training and Validation Accuracy versus Training and Validation Loss on the NTU RGB + D 60 benchmark dataset

4.2 Experiments on the UT-Kinect dataset

The UT-Kinect dataset [34] is publicly available and was taken through a single stationary Kinect comprised of total 10 subjects that took total 10 action types (walking, stand up, pick up, carry, sit down, throw, push, pull, wave hands, clap hands). Each subject performs each action twice. 3-channels were captured for (i) RGB, (ii) depth, and (iii) skeleton keyjoint locations. We have only recorded the frames when the skeleton of human body was tracked. To assess the proposed method on this dataset, the standard leave-one-out-cross-validation protocol has been followed. Table 3 provides the comparative result of the proposed ConvST-LSTM-Net model with state-of-the-art approaches. The trade-off curves for the model accuracy and loss on the UT-Kinect dataset has been illustrated in Fig. 8. It is observed from these curves that the proposed methodology offers exceptional accuracy during training and moderate accuracy in the validation process. For training process, the model causes low and for validation process it causes moderate loss.

Table 3 Experimental Results on UT-Kinect
Fig. 8
figure 8

Trade-off curves for Model Training and Validation Accuracy & Model Training and Validation Loss on the UT-Kinect dataset

4.3 Experiments on the UP-fall detection dataset

The UP-fall detection [34] is the large-scale multimodal dataset collected by using vision-wearable, and ambient sensors. It includes Activity for Daily Livings (ADLs-850 GB), collected by 17 healthy persons including 9 male, 8 females individuals. It has total 11 actions, i.e., 6 basic actions for daily living: walk, sit, stand, picking-up an item, laying, jump and 5 fall-actions: fall-forward via knees, fall-forward via hands, fall-sitting in an empty chair, fall backward and fall-sideward). Two cameras were set up to capture the subject’s front views as well as its side views. A total of 589,418 sample image frames are there taken from both cameras. Total size of this vision dataset was 277 GB. For performing experiments, we choose 5 action classes (i.e., Stand, Sit, Run, Walk, Fall) contains 1000 clips in each class in which 800 clips used for training, and the 200 clips for validation. Table 4 gives the comparative results of the proposed ConvST-LSTM-Net with various state-of-the-art methods.

Table 4 Experimental Results on UP-Fall Detection Dataset

Figure 9 illustrates the trade-off curves for (a) accuracy of training and validation vs. (b) loss of training and validation. It is observed from these curves that the proposed methodology offers exceptional accuracy during training and moderate accuracy in the validation process. For training process, the model causes low and for validation process it causes moderate loss.

Fig. 9
figure 9

Trade-off curves for Model Training and Validation Accuracy versus Model Training and Validation Loss on UP-Fall Detection dataset

4.4 Experiments on the UCF101 dataset

The UCF101 [35] is a popular action recognition dataset that contains 13,320 video clips from 101 action categories. The action videos are clustered in 25 groups, where each group contains 4–7 videos of an action. The action categories can be classified into five distinct types, i.e., (a) Human-Object Interaction (b) Body-Motion (c) Human–Human Interaction (d) Playing Musical Instruments (e) Sports. For performing experiments, we choose 5 action classes from body motion categories contains total 17 body motion clips. Table 5 gives the comparative results of the proposed ConvST-LSTM-Net with various state-of-the-art methods.

Table 5 Experimental Results of ConvST-LSTM on the UCF101 Dataset

The trade-off curves for the model accuracy and loss on the UCF101 dataset has been illustrated in Fig. 10, i.e., (a) accuracy of training and validation versus (b) loss of training and validation. For training process, the model causes low and for validation process it causes moderate loss.

Fig. 10
figure 10

Trade-off curves for Model Training and Validation Accuracy versus Model Training and Validation Loss on UCF101 Dataset

4.5 Experiments on the HMDB51 dataset

The HMDB51 [36] dataset is a commonly used benchmark dataset for action recognition in videos, which consists of video clips from various sources like movies, YouTube. From 2 GB, total 7000 clips distributed in 51 action classes. The actions categories can be divided into five types: (a) General facial actions. (b) Facial actions with object manipulation. (c) General body movements. (d) Body movements with object interaction. (e) Body movements for human interaction. The video clips have varying durations and resolutions. For performing experiments, we select the general body movements action classes in which 5 action clips are taken (i.e., Stand up, Sit down, Run, Walk, Fall). Each action classes contains minimum of 101 clips. Among them 80% are used of training and 20% are used for validation. Table 6 gives the comparative results of the proposed ConvST-LSTM-Net with various methods.

Table 6 Experimental Results on the HMDB51 Dataset

Figure 11 illustrates the trade-off curves for (a) accuracy of training and validation versus (b) loss of training and validation on HDMB51 dataset. Form this, it is observed that the proposed methodology achieves outstanding accuracy during the training process and moderate accuracy during the validation process. The model exhibits low loss during training and moderate loss during validation.

Fig. 11
figure 11

Trade-off curves for Model Training and Validation Accuracy versus Model Training and Validation Loss on HDMB51 Dataset

4.6 Multimodal analysis over standard performance measures

This section discusses the results analysis gained on the proposed ConvST-LSTM-Net. The performance of the model has been measured on different performance metrics, i.e., Precision, Recall, F1-score, and Accuracy. Figure 12 displays the accuracy, precision, recall, and F1-score on various benchmarks. The accuracies and losses are plotted for 150 epochs. The proposed ConvST-LSTM-Net results in a better accuracy. The effectiveness of the proposed model is verified on various benchmarks, i.e., NTU RGB + D 60, UT-Kinect, UP-Fall Detection, UCF101, and HDMB51 datasets, where the model outperforms state-of-the-art methods. Figure 13 illustrates the human action recognition results obtained in different benchmarks datasets with framing the bounding box over the tracked human. We observed that the performance of the model is sufficiently high.

Fig. 12
figure 12

Comparative Stats of Standard Performance Measure over different datasets

Fig. 13
figure 13

Illustration of the Human Action Recognition on various benchmarks. Starting from left–right a NTU RGB + D 60 Dataset: Sitting, Standing b UT-Kinect Dataset: Standing, Walking c UP-Fall Detection Dataset: Fall d UCF101 Dataset: Walking, Running e HMDB51 Dataset: Running

5 Conclusions and future prospective

Human action recognition has gained a large prominence in today’s era, but few limitations are there in their application areas despite having networks that could achieve good results. In this paper, we improved the internal cell structure of the ST-LSTM unit and successfully proposed a ConvST-LSTM having high accuracy & reliability. The model is based on a spatiotemporal LSTM module, uses video frames and skeleton-based features and has the robust capability for selecting the informative keyjoints in each frame while ignoring the irrelevant keyjoints of the skeleton sequence. The model is independent of the camera orientation, clothing, background noise, etc., which can effectively recognize suspicious actions related to human activity. Finally, the experimental results show better performance and achieve good accuracy for skeleton-based anomaly activity recognition. However, the consequences of growing population and rise in ever-challenging activities fosters the need to introduce a more promising predictive methodology for recognizing human behavior that proffers a practical alternative solution for the security and protection of people from daily risks in life. With the future perspective, we can use a graph oriented spatiotemporal base data to represent humans and objects. Moreover, GCN can also be used for the classification and detection of unsuspicious activity of human behaviors.