Keywords

1 Introduction

In recent years, the task of action recognition has become one of the most attractive topics in the field of artificial intelligence, especially human action recognition (HAR) is widely used in various fields such as human or object interaction, video surveillance systems and healthcare systems [1], providing accurate judgment analysis and understanding of human actions for machinery and equipment in these fields, playing a crucial role in the development and progress of artificial intelligence.

Early on, research on human skeleton action recognition is mainly through deep neural network models to learn the correlation of human actions in time and space. In these models, the performance of human skeleton action recognition based on graph convolutional networks (GCN) [2] is better than that based on recurrent neural networks (RNN) [3] and convolutional neural networks (CNN) [4]. GCN methods can construct a spatiotemporal topology graph of 3D positions of human skeleton joint nodes by regarding human joint nodes as vertices of a graph, treating natural topological connections between adjacent joint nodes as spatial edges of a graph and considering temporal correlation between adjacent frames as temporal edges. Then input the processed human skeleton topology graph sequence into the network for learning to finally achieve action classification. The GCN-based method has been proven to be an effective solution for achieving the task of human action recognition.

To further improve the performance of the model, they [5,6,7,8] focus on introducing adaptive graph residual masks to capture the relationships between different joints, that is, to extract more hidden information from the original human skeleton dataset, such as bone and velocity. In order to enhance the feature representation of every actions, they train this information through multiple network streams and fuse all the trained features together to obtain the score of each action and achieve the task of action classification. However, more information will cause information redundancy and model size doubling sacrificing model storage space and computational efficiency, which is extremely disadvantageous for model promotion in practical applications.

In response, SGN [9] achieves superior performance with a smaller model, however, it also has problems such as insufficient data mining and incomplete semantic utilization. Guided by literature [10, 11], we consider combining channel attention with dilated convolution attention to enhance feature connections between frame dimensions and channels in the model. The main contributions of this paper can be summarized as follows:

  • This paper introduces multiple hidden information of human skeletons after data preprocessing and effectively fuses them in the early stage of the model, enhancing feature representation of each information and obtaining a richer topology graph.

  • In order to make full use of two semantic relationships, we integrate two semantic information into graph convolution modules by adjusting graph convolution layers and channel width effectively, solving defects in spatial-temporal separation processing.

  • In the time module, we design a time multi-scale dilated convolution kernel attention (T-MDKA), to obtain a large receptive field by replacing large kernel convolution with dilated convolution, thereby simulating remote dependencies. In addition, we construct two branch time convolution blocks to more robustly learn the temporal features of actions.

Fig. 1.
figure 1

Skeleton diagrams of 5 frames from three action sequences.

2 Related Works

2.1 Attention Mechanism

The attention mechanism can be seen as simulating the degree of attention that people pay to a certain part when processing information by adjusting the size of the weights. It is currently widely used in various fields. [10,11,12]. SENet [10] proposes a squeeze-and-excitation block to learn global channel information, which enables the model to focus on more useful feature information. VAN [12] improves channel adaptability by using large kernels. To maximize the role of large convolution kernels, MAN [11] adopts the structure of transformer and introduces GSAU to replace MLP structure to obtain multi-scale remote modeling dependencies not only improving model representation ability, but also reducing model parameters and computational complexity.

2.2 Lightweight Models

In the image field [13] and object detection field [14], methods using depth-separable convolution and grouped convolution are proposed respectively to replace traditional convolution greatly reducing model parameters. Zhang et al. [9] based on graph convolutional neural networks introduce high-order semantic information to enhance feature expression ability achieving low parameters while maintaining high recognition accuracy. Cheng et al. [15] construct a lightweight network framework using dynamic displacement graph convolution instead of traditional convolution, In order to further simplify, they [16] use edge \(\textrm{RELU}\) distillation technology, which also improves model recognition performance. In addition, Song et al. [17] embed separable convolution layers into early multi-information fusion module. It makes the model’s parameter size extremely small, making the model more lightweight.

Fig. 2.
figure 2

(a) the overall framework of this paper’s model includes three parts; (b) the pyramid partition attention module applied to TCN1 and TCN2; (c) the dilated convolution attention module proposed in this paper.

3 Method

In this section, we will detail the composition of our proposed MDKA-GCN. Figure 2(a) is our overall model framework.

3.1 Multi-Branch Fusion Module

Earlier research [17] has shown that more skeleton topology graphs play a key role in model performance. In this work, we mine three types of input features from skeleton data: 1) joint stream, 2) velocity stream, 3) bone stream.

Specifically, this paper represents the skeleton sequence as a set of joint sets \({\textbf {S}}_{t,k}=\{x_{t,k}|t=1,2,...,T;k=1,2,...,J \}\), where T represents the total number of time frames and J represents the total number of human joints.

Joint \({\textbf {S}}_{t,k}\in \mathbb {R}^{C \times T\times V}\) is the original 3D coordinate provided by the datasets where channel C is equal to 3. Therefore, through formula \(\textrm{v}_{t,k}=\textrm{x}_{t,k}-\textrm{x}_{t,m}\), we can obtain the relative position of joints where \(\textrm{x}_{t,m}\) represents the position of the human skeleton’s center of gravity. Considering that many subtle actions are concentrated on the hands such as “play Rubik’s cube” in Fig. 1(c), we determine three central joints as the upper and lower spine and palm wrist joints of the action sequence.

Similarly joint velocity can also be easily defined by joint position that is the position change between adjacent frames represented as \(\textrm{v}_{t,k}=\textrm{x}_{t,k}-\textrm{x}_{t-1,k}\).

Like the definition of relative position, bone information can also be defined by the position difference between two adjacent joints on a skeleton represented as \(\textrm{b}_{t,k}=\textrm{x}_{t,k}-\textrm{x}_{t,i}\), where joint \(\textrm{x}_{t,k}\) represents a position away from the human body’s center of gravity and joint \(\textrm{x}_{t,i}\) represents a position close to the human body’s center of gravity adjacent to joint \(\textrm{x}_{t,k}\). Similar to formula of joint velocity, we can easily get bone velocity. Since acceleration information is crucial for capturing some small actions, we can obtains bone acceleration information from bone velocity information as input information. Finally, these input information are encoded through two fully connected layers (\(\textrm{FC}\)),

$$\begin{aligned} \textrm{P}_{t,k} = \sigma (\textrm{FC}(\sigma (\textrm{FC}(\textrm{x}_{t,k})))) \end{aligned}$$
(1)

where \(\textrm{P}_{t,k}\) represents the joint information encoded by the fully connected layer; \(\sigma \) in this paper represents activation function \(\textrm{RELU}\).

In the feature fusion module, we obtain different channel attention weights through two layers of convolution operations, further enhancing feature representation of each information stream. Then fuse three information streams and input them into graph convolution modules,

$$\begin{aligned} \hat{\textrm{P}}_{t,k} = Conv(\sigma (bn(\textrm{p}_{t,k}))) \end{aligned}$$
(2)
$$\begin{aligned} f_\textrm{in} = Cat[\hat{\textrm{P}}_{t,k},\hat{\textrm{V}}_{t,k},\hat{\textrm{B}}_{t,k}] \end{aligned}$$
(3)

where \(\hat{\textrm{V}}_{t,k}\) and \(\hat{\textrm{B}}_{t,k}\) represents respectively the velocity information and bone information encoded by the fully connected layer; bn and Cat represent normalization function and concatenation operation respectively. Conv represents pointwise convolution, which reduces the channel dimension to avoid generating a large number of parameters due to high dimensions after concatenation.

Fig. 3.
figure 3

(a) and (b) are two divergent convolution modules that integrate the Pyramid Split Attention (PSA) module.

3.2 Semantic Information

In this work, we use a one-hot vector \(J'_{k}\) to represent the kth skeleton joint. Similarly use a one-hot vector \(T'_{i}\) to represent the ith frame index. This paper concatenates two semantic information in low dimensions, then convolves them through multi-layer perceptron (\(\textrm{MLP}\)). Finally, input them into each layer graph convolution,

$$\begin{aligned} \textrm{G}_0 = Cat[\textrm{J}'_{k},\textrm{T}'_{i}] \end{aligned}$$
(4)
$$\begin{aligned} \textrm{G}_j = \textrm{MLP}(\textrm{G}_{j-1}) \end{aligned}$$
(5)

where \(\textrm{J}'_{k}\) and \(\textrm{T}'_{i}\) respectively refer to joint type and frame index semantic information, and \(j = 1,2,3,4\). The purpose of \(\textrm{MLP}\) is to increase channel dimensions to match input dimensions of each layer graph convolution.

3.3 Graph Convolution Module

The operation in this paper’s graph convolution module is different from previous work [9], which often extracts spatial features through adjacency matrices composed of natural human joint nodes or deformations thereof. This paper considers the importance of two semantic information, inputs channel graphs fused with semantics into graph convolution modules,

$$\begin{aligned} f_\textrm{out} = \sigma (bn(Conv(f_\textrm{in}\otimes \textrm{G}_j)+Conv(f_\textrm{in}))) \end{aligned}$$
(6)

where \(f_\textrm{in}\) and \(f_\textrm{out}\) are respectively input and output of graph convolution, \(\otimes \) represents matrix multiplication, the size of the convolutional kernel of Conv is \(1\times 1\) with different training weights.

3.4 Time Convolution Module

Considering that time processing method of SGN only uses a fixed convolution operation which is not enough to distinguish some similar actions, such as Fig. 1(a) and Fig. 1(b). Therefore, we propose a time convolution module composed of two branch convolution blocks in Fig. 3 and multi-scale dilated attention in Fig. 2(c).

Branch Convolution Block. In order to obtain receptive fields at different scales while controlling model parameter volume and computational volume. Inspired by [18], We introduce the PSA module and design two types of pyramid split convolution modules in Fig. 2(b) for extracting multi-scale temporal features of action sequences,

$$\begin{aligned} F_i = Conv_i(\sigma (bn(chunk(\textrm{X}_\textrm{in}))) \end{aligned}$$
(7)
$$\begin{aligned} F = F_{0} \oplus F_{1} \end{aligned}$$
(8)

where chunk is the split operator, which divides the input information into two equal parts on the channel dimension. And \(Conv_i\) means the convolution operation with kernel sizes of \(1\times 1\) and \(1\times 3\). \(\oplus \) is the concat operator. Then, the \(\textrm{SEweight}\) module is used to obtain the attention weight from the input feature map with different scales,

$$\begin{aligned} Z_{i} = \textrm{SEweight}(F_{i}) \end{aligned}$$
(9)
$$\begin{aligned} Z = \textrm{Softmax}(Z_{0} \oplus Z_{1}) \end{aligned}$$
(10)
$$\begin{aligned} Out = F \odot Z \end{aligned}$$
(11)

where the Softmax is used to obtain the re-calibrated weight, \(\odot \) represents the dot product operation. Different from TCN1, TCN2 uses grouped convolutions and dilated convolutions with dilation rate 2 with kernel size \(1\times 3\).

Time Multi-scale Dilated Attention. Inspired by some large kernel works [8, 11] without adding too much computational burden obtaining advantages of attention mechanisms in long-term modeling, we propose a time multi-scale dilated kernel attention and replace large kernel convolutions with dilated convolutions focusing on extracting temporal dependencies between action sequences. It is shown in Fig. 2(c), the formula is as follows:

$$\begin{aligned} x_i = Split(x) \end{aligned}$$
(12)
$$\begin{aligned} DKA(x_i) = PWConv(Conv_{DWD}(x_i)) \end{aligned}$$
(13)
$$\begin{aligned} MDKA(x) = Cat(DKA(x_i) \odot x_i) \end{aligned}$$
(14)

where Split is the split operation, \(i= 1, 2, 3, 4\); \(Conv_{DWD}\) is a dilated separable convolution with a kernel size of \(1\times 3\), and the dilation rate can be 2, 3, 4, PWConv represents a normal pointwise convolution.

4 Experiment

4.1 Dataset

NTU-RGB+D 60 Dataset [19]: One of the current mainstream skeleton-based action recognition datasets, containing 56880 skeleton sequences of 60 action categories captured simultaneously from 40 different subjects and 3 Microsoft Kinect V2 depth cameras. Each skeleton sequence contains three-dimensional spatial coordinates of 25 joints. The dataset provides two evaluation benchmarks: Cross-Subject (C-Sub) and Cross-View (C-View). C-Sub is completed by 40 subjects with half of the subjects used for training and the rest for testing. C-View selects samples captured by cameras 2 and 3 for training and the rest for testing.

NTU-RGB+D 120 Dataset [20]: This dataset is an expansion of the NTU RGB+D 60 dataset in terms of action categories and number of actors, containing 114480 action videos participated by 106 actors, with a total of 120 action categories including 82 daily life actions 12 medical conditions and 26 actions under two-person interaction. The dataset has two evaluation benchmarks: Cross-Subject (C-Sub120) and Cross-Setup (C-Set120). C-Sub120 divides this dataset into training set (63026 videos) and validation set(50919 videos) according to different actors in the video. C-Set120 divides the dataset according to the parity of video numbers. 54468 even-numbered videos are used as training sets, and 59477 odd-numbered videos are used as test sets.

4.2 Experimental Details

Similar to [9], the difference is that in order to facilitate the operation of channel graphs composed of two semantic information in the graph convolution module, we adjust the time dimension to 25. This paper sets the number of epochs in the model to 120 sets the batch size for each epoch to 64 sets, the initial learning rate to 0.001 and continues to decrease during iteration. When the number of iterations is 80 and 100 the learning rate drops tenfold. At the same time, this paper also uses Adam to optimize the model where weight decay is 0.0001.

4.3 Ablation Experiment

In this part, we mainly discuss the contributions of different components in this paper’s model, which includes multi-branch fusion module, high-order semantic information time convolution module and necessity of attention module.

Table 1. Comparison of the accuracy of different input branches.

Table 1 experimental results two conclusions are verified: first using information fusion of three input branches has obviously highest recognition accuracy on C-Sub and C-View; second this paper’s information fusion method only increases about 1/7 of parameter volume while model performance has been significantly improved.

Table 2. Verification of the accuracy of two types of semantic information.

Table 2 experimental results show the channel graph composed of two semantic information plays an important role in graph convolution operation. It is worth noting that when there is no channel graph the original graph convolution becomes ordinary pointwise convolution and the model’s ability to aggregate different joint features will weaken resulting in a decline in model performance.

Table 3 verifies the effectiveness of the two branch convolution blocks and multi-scale dilated attention module. From the table it can be seen that the attention module proposed in this paper effectively improves model performance while adding a very small amount of parameters; and our reasonable combination of two branch convolution modules makes the model more balanced in terms of performance and parameter volume.

4.4 Comparison with State-of-the-Art

From Table 4, it can be seen that the best performance of our single-stream network MDKA-GCN (1s) on the two benchmarks is \(91.2\%\) and \(96.2\%\) respectively, while the establishment of multi-stream network enables the model to achieve better performance especially MDKA-GCN (4s) recognition accuracy on the two benchmarks respectively reach \(92.1\%\) and \(96.8\%\), which is better than other SOTA models.

Table 3. Verification of the effectiveness of two divergent convolution modules and attention mechanisms.
Table 4. Comparison of accuracy (%) with some recent SOTA methods.

Due to the randomness in the network during training, such as sample shuffling operations and frame extraction randomness, these random operations can cause incomplete feature learning in model training, resulting in unstable training results. To avoid this randomness and enhance the robustness of the network, we design a multi-stream network structure where each single-stream sub-network structure is completely consistent. We fuse the output results of multiple single-stream sub-networks by adding them together and use them as the final output result of the multi-stream network.

In addition, Table 4 shows that on the C-Sub120 and C-Set120 compared with SGN [9], our single-stream method increases the accuracy by 7.6 percentage points and 6.8 percentage points respectively. In multi-stream networks (2s-AGCN [5], 4s-AGE-Ens [6], SMotif-GCN [7]), our multi-stream method MDKA-GCN(4s), although lower in accuracy on C-Sub120 than 4s-AGE-Ens and SMotif-GCN, reach the highest accuracy on C-Set120 benchmark.

Table 5. Comparisons with SOTA methods

Comparison with Lightweight SOTA. To verify the overall performance of our model as shown in Table 5, we compare with SOTA methods in recent years in terms of accuracy model parameter volume and computational complexity. Compared to lightweight GCN method EfficientB4, our single-stream model has a smaller parameter size and computational complexity while achieving higher accuracy. We compare the model training process of MDKA-GCN and SGN on two datasets in Fig. 4. For fair comparison the hyperparameter settings and data preprocessing methods of the two models are kept consistent. Overall our method’s overall performance has reached an advanced level and is more suitable for resource-limited mobile devices and practical application scenarios compared to most lightweight SOTA methods.

Fig. 4.
figure 4

comparison of the accuracy and convergence of MDKA-GCN and the baseline model SGN.

Fig. 5.
figure 5

Qualitative examples from NTU-RGB+D 60, Six frames are selected from each action.

5 Action Visualization

To more intuitively display the action process, this paper visualizes the skeleton diagrams of actions such as “wear headphones” and “drink water” by observing some similar or difficult-to-distinguish actions in Fig. 5, and selects a few frames from them. These actions are mainly completed by both hands and are extremely similar in spatial configuration and temporal dynamics, requiring long-term observation to distinguish.

6 Conclusion

In this paper, we introduce two semantic information into multiple graph convolution layers, and the model performance is improved while reducing model parameters. Our designed branch convolution block emphasizes significant motion features, the proposed time multi-scale dilated convolution attention module enlarges the receptive field and enriches the representation ability of various temporal features. We conduct extensive experiments on current mainstream action recognition datasets, whose results show that MDKA-GCN is more effective than most mainstream methods in terms of computational cost and performance with broader application prospects in the future.