1 Introduction

As a large research hotspot in the field of computer vision, action recognition has important research significance and wide application prospects in many fields, such as intelligent monitoring [1, 2], human–computer interaction [3, 4], virtual reality [5]. The method based on traditional handcraft features [6, 7] is hard to deal with human action recognition in a complex scene. With the great success of deep learning in image classification [8], the application of deep learning in human action recognition has gradually become a development trend, but there are still some difficulties and challenges. There are two kinds of approaches to solving the problem of action recognition based on deep learning: 1) video recognition method for extracting and classifying spatial-temporal features; 2) pose estimation method for extracting skeleton information for retraining. Since neural networks can learn features from data, and this form of learning mode is consistent with the process of human awareness of the world, semantic features learned from neural networks can also be used for action recognition.

Among the various deep learning models, the most common is the convolutional neural networks (CNN) [8] used for image recognition tasks. Recurrent neural networks (RNN) [9] have been widely used in natural language processing (NLP) [10, 11] due to its superiority in time series modeling. Wang et al. proposed a relatively simple method [12], by coding the joint trajectory (distance) and its dynamic information into a texture pattern [13], called joint trajectory maps (JTM). Donahue et al. put forward a network [14] which combines CNN with LSTM, in which the pre-processed depth image data is first sent to the originally designed CNN to get the spatial features, and then the optical flow information in the video data is sent to the LSTM to get the temporal features. Finally, the spatial-order feature and temporal-order feature are fused and the mapping category of Softmax is adopted. Based on the research of human body 3D skeleton motion representation [15], more and more attention has been paid to it. Shao et al. proposed a hierarchical model [16] of body part motion recognition, which decomposes a skeleton into multiple moving rigid bodies according to the motion characteristics of the human body, the rotation speed invariant descriptor RRV (Rotation and Relative Velocity) is proposed to represent the rotation and velocity invariant of each rigid body in the skeleton, and the motion representation is obtained.

In the case of skeleton-based action recognition, the graph convolutional networks (GCN) [17,18,19,20] approach were proposed and gained attention due to its high-performance achievement. GCN is often used for learning tasks, such as graph classification [21,22,23], graph regression [24], and node classification [25,26,27]. According to the topological structure of the human body itself, we can construct a graph structure data with a human shape, and treat it as a graph classification task, then we can use graph convolution to process it. Yan et al. proposed the spatial-temporal graph convolutional network (ST-GCN) [28] that is the first time to apply the convolution to 3D human action recognition. The model not only constructs the spatial graph but also constructs the temporal graph in the time dimension according to the human topological structure, then the spatial-temporal features of human body movements are extracted by graph convolution. To reduce the redundant frames in skeleton video frames and extract the most discriminating video frames, the deep progressive reinforcement learning graph convolutional network (DPRL-GCN) model proposed by Tang et al. [29]. Then the keyframe is sent to GCN for recognition and classification. To excavate the node association in the local block of human body structure, Kalpit et al. proposed a part-based graph convolutional network (PB-GCN) [30] to divide the human body skeleton graph into several different subgraphs according to certain principles, and then perform graph convolution operation on each subgraph separately. Finally, the results are fused by a fusion function. To make full use of the potential connection of skeleton node graph topology. Ye et al. designed a joints relation inference network (JRIN) [31] to automatically explore the relationship between skeleton nodes and nodes, the relationship matrix is applied to the adjacency matrix of the original skeleton data to supplement the potential relationship between nodes of the original skeleton topology to better understand the human action. Subsequently, many GCN methods representing a more appropriate spatial graph have been proposed and the performance have been improved dramatically.

The contributions of this paper are summarized as follows:

  • We propose a temporal graph convolutional module (FTGCN) which can focus more temporal information and properly balance them for each action.

  • To better integrate channel, spatial, and temporal information, we propose a unified attention model of the channel, spatial, and temporal (CSTA).

  • Compared with 17 methods on NTU-RGB+D and 8 methods on Kinetics-Skeleton, our method achieves the best performance.

2 Related work

2.1 Skeleton-based action recognition

There are 2 ideas in the field for skeleton-based human action recognition. The early handcraft-based ideas and the current standard deep-learning-based methods. The accuracy of handcraft-based ideas is unacceptable and therefore the deep learning methodology has become the thought methodology during this field for its smart strength and superior performance. There are basically 3 types of network RNNs, CNNs, and GCNs in the recognition of human actions based on deep learning. (1) The RNN-based idea [32] symbolizes the joint coordinates of the skeleton sequence as a vector sequence and feed into the networks. (2) The CNN-based idea [33] converts the skeleton sequence into a corresponding 2D pseudo-image is input into the network, which is similar to the method of image classification. (3) The GCN-based idea takes the joint points of the human body as vertices and the natural connections of the human body as edges to represent the skeleton sequence as a graph. At the same time, it takes into account time information and is widely used due to its superior performance. The first idea to use GCN in this area was ST-GCN [28], which made great progress at the time. Spatial connects the joint points according to the human body structure to form a spatial graph, and connect the same joints in adjacent frames in time, spatially, and temporally the connections form a spatial-temporal graph and send it to the network. However, the spatial graph of ST-GCN [28] is a fixed graph, as it will not demonstrate the relation between the two hands while clapping hands. Therefore, a two-stream adaptive graph convolutional network (2s-AGCN) [34] is proposed.

2.2 Graph convolutional networks

The essential purpose of GCN is to extract the spatial features of the topological graph. There are two main types of graph convolutional neural networks. One type is based on the spatial domain or vertex domain, the other is based on the frequency domain or spectral domain [35]. The method based on spatial convolution directly defines the convolution operation on the connection relationship of each node, which is more similar to the convolution in the traditional convolutional neural network. Different from the spatial perspective method, the spectral perspective method uses the eigenvalues and eigenvectors of the Laplacian matrix of the graph to study the properties of the graph.

3 Approach

In this section, we will introduce the basic background knowledge of this work. Then describe our proposed focus on the temporal graph convolutional network (FTGCN) module and a unified attention model of the channel, spatial and temporal (CSTA) in detail.

3.1 Graph convolution

We use G = (V,E) to describe the human structure graph in which V is the number of human joints and E is the number of edges connected by these joints, that is, the number of human bones. The adjacent matrix of the human structure graph is defined as \(A\{\in \}\left \{0,1 \right \}^{v{\times }v}\), where Ai,j = 0 if the ith and the jth joints are unconnected and 1 otherwise. Let DRv×v be the diagonal degree matrix, where \(D_{i,i}= {\sum }_{j }^{}A_{i,j}\) is the degree matrix of vertices, and the elements on the diagonal are degrees of each vertex in turn. We use x represent the multidimensional coordinates of joints in human body structure. The adjacency matrix A can be used to aggregate the information of the adjacent nodes. The graph convolution operation of each layer can be expressed as follows:

$$ x^{(l+1)}=\sigma \left( \hat{A}x^{l}W^{l}\right) $$
(1)

where \(\hat {A}=D^{-\frac {1}{2}}(A + I_{v})D^{-\frac {1}{2}}\) is the matrix after the adjacency matrix A plus the self-connection matrix Iv and then normalized. xl is the output tensor of layer l, and x0 = x. Wl is the weight matrix that changes with training. σ(⋅) is used to increase the nonlinearity of the neural network and called the activation function. Following the ST-GCN [28], we will implement a three-partition strategy on the human skeleton graph, that is, the neighbor set is divided into three subsets. The first subset is the root node itself. The second subset is the adjacent node closer to the center of gravity of the human skeleton than the root node. The third subset is adjacent nodes that are farther from the center of gravity of the human skeleton than the root node. In this way, A is accordingly classified to be root node set Aroot, centripetal group Acentripetal, and centrifugal group Acentrifugal, which similar to the movement of body parts. Then there is \({\sum }_{k=1}^{3}A_{k}=A\) where \(k = \left \{root, centripetal, centrifugal \right \}\)

ST-GCN [28] is composed of 10 basic blocks. In order to alternately extract spatial and temporal information, each basic block is composed of a spatial GCN and a temporal GCN. In the spatial layer, the convolution operation on the human skeleton structure graph is:

$$ x_{out}=\sum\limits_{k=1}^{3}\hat{A}_{k}x_{in}W_{k} $$
(2)

Where k = 3 represents the three partitions mentioned above, each partition performs the same convolution operation, \(W_{k}\in R^{c_{in}{\times }c_{out}}\) is a weight matrix that can be changed with training, \(x_{in}\in R^{v{\times }c_{in}}\) is the input feature of the spatial layer, and \(x_{out} \in R^{v{\times }c_{out}}\) is the output feature of the spatial layer. cin represents the channel dimension of the input feature, cout represents the channel dimension of the output feature. \(\hat {A}_{k} = {D_{k}}^{-\frac {1}{2}} (A_{k}+I_{vk}) {D_{k}}^{-\frac {1}{2}} \in R^{v{\times }v}\) is the normalized adjacent matrix of each partition. The adjacency matrix A in ST-GCN [28] only considers the natural connection of the human skeleton graph, but the completion of some actions sometimes requires the interaction of non-adjacent joints. In order to solve this problem, the two hand joints that are not adjacent in the clapping action can be connected, AGCN [34] is proposed, constructing the equation as follows:

$$ x_{out}=\sum\limits_{k=1}^{3}(\hat{A}_{k}+B_{k}+C_{k})x_{in}W_{k} $$
(3)

Where BkRv×v is initially set as a V × V matrix. The parameters in the matrix are constantly changing with training. It serves as a supplement to the adjacency matrix A. Different joints are connected to different actions during training. CkRv×v is a data-dependent V × V matrix, and the parameters in the matrix are normalized to a value of 0-1. If the value is not zero, it means that the two joints are connected to each other, otherwise they are not connected. The larger the value, the stronger the connection strength of the two joints and the higher the correlation with the action.

For the temporal layer, only the same joints of the front and rear frames are connected, so common convolution operations are performed in the time dimension, specifically a Kt × 1 convolution kernel, where Kt is the convolution kernel size in the temporal dimension, set to 9.

3.2 The architecture of the networks

In the early days, the input of this task was only joint information, but the human skeleton graph contains joints and bones at the same time. This makes the two-stream network structure suitable for joints and bones as input. The completion of the action is accompanied by the movement of joints and bones, and the two kinds of information are used as input. Improved recognition performance. In our work, like 2s-AGCN [34]. Joint information is first-order information, bone information is second-order information, and two kinds of information are used as inputs to promote each other to improve recognition accuracy. The two-stream architecture is shown in Fig. 1, they are trained independently. Given a sample, we first calculate the data of bones based on the data of joints. Then, the joint data and bone data are fed into the joint stream and bone stream, respectively. Finally, the softmax scores of the two streams are added to obtain the fused score and predict the action label. The details of a basic block (FTC-GCN) are listed in Table 1.

Fig. 1
figure 1

The illustration of two-stream fusion architecture. The top is the bone flow and the bottom is the joint flow. The two-stream scores are fused to get the final prediction result

Table 1 The backbone network of FTC-GCN,which includes ten FTC-GCN blocks

A basic block (FTC-GCN) is shown in Fig. 2. In Fig. 1, the joint flow and bone flow both contain 10 basic blocks (FTC-GCN), and the structure is exactly the same. The output channel from the first layer to the fourth layer is 64. The fifth to seventh layers include 128 output channels. The output channels from the eighth layer to the tenth layer are 256. The strides of the fifth and eighth layers is 2 and the other layers are 1. At last, the FC layer is used to generate the final recognition score.

Fig. 2
figure 2

The illustration of a basic block (FTC-GCN). The FTGCN is used to concentrate more temporal information in the spatial layer, and TCN means 9 × 1 temporal convolution. BN layer and Relu layer are regular operations. CSTA represents the unified attention module. Use residual connections to optimize training and gradient propagation

3.3 The FTC graph convolutional networks

As shown in Fig. 2, one basic block is the series of one spatial GCN (FTGCN), one unified attention module (CSTA), and one temporal GCN (TCN). The FTGCN is used to concentrate more temporal information in the spatial layer. BN layer and Relu layer are regular operations. TCN will perform a Kt × 1 convolution operation along the time dimension on the feature map with dimension C × T × V obtained in FTGCN. C denote the number of channels, T denote the number of keyframes and V denote the number of joints. Use residual connections to optimize training and gradient propagation.

3.4 Focus on temporal graph convolutional module

Many existing GCN models pay attention to the spatial information and neglect the temporal information. To solve the above problems in this work, we propose a focus on temporal graph convolutional module which can focus more temporal information and properly balance them for each action.

In (3), the Ak, Bk, and Ck matrices only focus on the possible connections and connection strengths between joints for a certain action without considering time. To pay more attention to the temporal information, we modify (3) into the following form:

$$ x_{out}=\sum\limits_{k=1}^{3}(\hat{A}_{k}+B_{k}+S_{k}+\lambda T_{k})x_{in}W_{k} $$
(4)

As illustrated in (3), \(x_{in} \in R^{C_{in}\times T\times V}\) denote the input feature, \(x_{out} \in R^{C_{out}\times T\times V}\) denote the output feature. Here Cin and Cout denote the number of channels. V denote the number of joints and T denote the length of the skeleton action sequence. As mentioned earlier, in (3), AkV × V represents the adjacency matrix of the human body structure graph. Before adding to Bk, Ak is normalized to \(\hat {A}_{k} = {D_{k}}^{-\frac {1}{2}} (A_{k}+I_{vk}) {D_{k}}^{-\frac {1}{2}} \in R^{v{\times }v}\), and Bk is initially set to the matrix of V × V based on Ak. The parameters in the Bk are constantly changing with training. It serves as a supplement to the adjacency matrix Ak. Different joints are connected to different actions during training. SkRv×v is a data-dependent V × V matrix, and the parameters in the matrix are normalized to a value of 0-1. If the value is not zero, it means that the two joints are connected

or not. The larger the value, the stronger the connection strength of the two joints and the higher the correlation with the action. TkRv×v is a dynamic graph, as a supplement to temporal information, we use Tk to concentrate as much temporal information as possible. The spatial information at the beginning of the action changes greatly, and as the action progresses, the time information becomes more and more important. According to this characteristic, we use a λ to adjust the importance of the temporal graph for different layers. Convs denote the 1 × 1 convolution operation. Obtain the Sk matrix through the two Convs in Fig. 3. Convt denote the 9 × 1 convolution operation. Get the TkRv×v matrix through two Convt

$$ S_{k}=softmax\left( \left( x_{in}^{T}w_{\theta k}^{T}\right)\left( w_{\phi k}x_{in}\right)\right) $$
(5)
$$ T_{k}=softmax\left( \left( x_{in}^{T}w_{\eta k}^{T}\right)\left( w_{\xi k}x_{in}\right)\right) $$
(6)

where \(w_{\phi k}\in R^{c_{in}\times v}\) and \(w_{\theta k}\in R^{c_{in}\times v}\) correspond to the two Convs in Fig. 3. w𝜃k and wϕk denote the 1 × 1 convolution operation with C convolution

kernels. \(w_{\eta k}\in R^{c_{in}\times v}\) and \(w_{\xi k}\in R^{c_{in}\times v}\) correspond to the two Convs in Fig. 3. wηk and wξk denote the 9 × 1 convolution operation with C convolution kernels.

Fig. 3
figure 3

The illustration of Focus on Temporal Graph Convolutional Networks (FTGCN). Ak is a fixed graph, Bk, Sk and Tk are dynamic graphs. The Convs indicates that the 1 × 1 convolution operation. The Convt denotes the 9 × 1 convolution operation. ⊕ denotes the element-wise addition. ⊗ denotes the matrix multiplication. Use λ to adjust the importance of the temporal graph for different layers. The residual connection is only needed when Cin is not the same as Cout

3.5 Attention module

The channel, spatial, and temporal dimensions often contain redundant information. To better integrate channel, spatial and temporal information, we propose a unified attention model of the channel, spatial and temporal (CSTA).

As shown in Fig. 4. Channel attention focuses on important channels, the area of Adaptive Avgpool is 1 × 1, and the dimension of the feature map after passing the channel attention module remains unchanged. Spatial attention focuses on important spatial information. In spatial attention, the spatial information column is averaged and then the convolution operation is performed, the size of the convolution kernel is set according to the number of joint points in different datasets. After spatial attention, the feature map dimension changes from C × T × V to 1 × 1 × V. Temporal attention can help the model pay different levels of attention for each of the frames. After temporal attention, the feature map dimension changes from 1 × 1 × V to 1 × T × 1. Finally, the dimension becomes C × 1 × 1 through two fully connected layers.

Fig. 4
figure 4

Illustration of the unified attention module. ⊙ denotes the element-wise multiplication. ⊕ denotes the element-wise addition

4 Experiments

We first introduced two large-scale datasets NTU-RGB+D [36] and Kinetics-Skeleton [37] in this field, and then introduced the details of the experiment and some hyperparameters. Then we conducted a large number of experiments on two large-scale datasets and compared 17 methods on NTU-RGB+D [36] and 8 methods on Kinetics-Skeleton [37], and finally we show the results of ablation experiments and tributary results.

4.1 Datasets

NTU-RGBD: NTU RGB+D [36] is a large-scale and multi-modality indoor-captured dataset for skeleton-based action recognition, it contains four modalities of data. Here, we use only the skeleton data, it is formed by three Microsoft Kinect v2 cameras to capture 3D skeleton data at the same time and marked 25 points as joint points. There are 56,880 action clips in 60 classes. The action clips are performed by 40 volunteers whose ages range from 10 to 35. These cameras have the same height but with different horizontal angles. There are two benchmarks for this dataset: (1) Cross Subject (CS), where the subjects are divided into two groups of 20 people each. The training sets included 40,320 samples from 20 subjects and the test sets included 16,560 samples from the remaining 20 subjects. (2) Cross View (CV): Divided by camera angle, the training sets consisted of 37,920 samples captured by camera 2(0) and 3(45), while the test sets consisted of 18,960 samples captured by camera 1(-45).

Kinetics-Skeleton: Kinetics400 [37] is a large-scale dataset that contains about 300,000 video clips in 400 classes from YouTube for human action recognition. The dataset is obtained on Kinetics400 through the OpenPose toolbox, it predicts 18 2D joint nodes and confidence score for each person. The data is divided into training sets and test sets at a ratio of about 12 : 1, with each data cut to 300 frames. We report the accuracy of the top 1 and top 5 against the benchmark.

4.2 Implementation details

Our framework is implemented on PyTorch [38] and the code will be released later (https://github.com/dongle329/FTC-GCN). All experiments use stochastic gradient descent with a Nesterov momentum of 0.9. We use two NVIDIA GeForce 1080Ti GPUs for the model training and the batch size is 16, the weight decay is 0.0001 and the initial learning rate is 0.1.

In our experiment, we trained two flows successively, and each stream occupied two GPUs during training. Finally, we fused the join flow and the bone flow.

For NTU RGB+D [36], the learning rate is divided by 10 at the 30th and 40th epochs. 60 epochs in total. The training time of joint flow on the Cross-Subject benchmark is about 41 hours, the training time of bone flow is about 41 hours. The training time of joint flow on the Cross-View benchmark is about 40 hours, and the training time of bone flow is about 40 hours.

For Kinetics-Skeleton [37], the learning rate is divided by 10 at the 45th and 55th epochs. 65 epochs in total. The training time of joint flow is about 203 hours, and the training time of bone flow is about 202 hours.

4.3 Comparisons to the state-of-the-arts

The results are listed in Tables 2 and 3, respectively. Extensive experiments on two large-scale datasets, compared with 17 methods on NTU-RGB+D [36] and 8 methods on Kinetics-Skeleton [37]. These all show that for skeleton-based human action recognition, our method achieves the best performance, which suggests the superiority of our model.

Table 2 Comparison with the state-of-the-art methods on NTU-RGB + D dataset
Table 3 Comparison with the state-of-the-art methods on Kinetics dataset

4.4 Ablation study

As shown in Table 4, we conducted ablation experiments about FTGCN, λ and CSTA on NTU-RGB+D [36].

Table 4 The importance of FTGCN and unified Attention (CSTA) were evaluated on NTU-RGB + D dataset

The accuracy of AGCN is 93.83%. First, we replace the spatial GCN with our proposed FTGCN without the λ parameter, and the obtained accuracy is 94.34%. The increase in accuracy proves the effectiveness of focusing on time information as much as possible. The accuracy of FTGCN of 94.59% proves the effectiveness of the λ parameter, that is the deeper the layer, the more important the temporal information. Finally, the accuracy obtained by adding the CSTA module(FTC-GCN) is 95.14%, which is increased by 0.55%. Combining the scores of joint flow and bone flow to obtain the final accuracy of our network is 96.50%

4.5 The results of two-stream fusion

In this section, we show the results of two streams fused according to different benchmarks on two datasets NTU-RGB+D [36] and Kinetics [37] .

Table 5 shows the two-stream fusion results for different benchmarks on NTU-RGB+D [36] dataset. For cs [36] benchmark, the accuracy of joint flow is 87.9%, the accuracy of bone flow is 88.2%, and the accuracy after fusion is 90.4%. For the cv [36] benchmark, the accuracy of joint flow is 95.1.9%, the accuracy of bone flow is 95.0%, and the accuracy after fusion is 96.5%.

Table 5 Two-stream fusion results on NTU-RGB + D dataset

Table 6 shows the two-stream fusion results for different benchmarks on Kinetics [37] dataset. For Top-1 [37] benchmark, the accuracy of joint flow is 36.1%, the accuracy of bone flow is 35.6%, and the accuracy after fusion is 37.8%. Based on this, the two-stream fusion can further boost the performance of the proposed method.

Table 6 Two-stream fusion results of Top-1 accuracy on Kinetics dataset

5 Conclusions

We design a novel temporal graph convolutional module(FTGCN) which can focus more temporal information and properly balance them for each action. This approach increases the flexibility and generalization capacity of the model. It is also confirmed that the temporal information of graph is more suitable for the action recognition task than the human-body-based graph. To integrate channel, spatial and temporal information, we propose a unified attention (CSTA) module, which helps the model paying more attention to the important joints, frames and features. In addition, both the FTGCN module and the CSTA module can be easily incorporated into the adaptive graph convolutional networks (AGCN), and significantly improve the performance of AGCN. Due to the contribution of these two modules, our FTC-GCN achieves the best performance compared to the methods listed in the table on the two large-scale action recognition datasets.