1 Introduction

Human action recognition has become one of the most important tasks in the computer vision field as it has a wide range of applications in intelligent video surveillance [7, 67], human–machine interaction [32, 36], medical service [53], etc. However, the accuracy of video-based action recognition is limited by the quality of the video and a large amount of video data is not easy to store and transmit. In contrast to video-based human action recognition, skeleton-based human action recognition has attracted great attention due to its robustness against changes in body scales, camera viewpoints, and interference of backgrounds. At the same time, more and more human skeleton data are generated by depth cameras (e.g., Microsoft Kinect) and pose estimation algorithms [5, 64], which provides a lot of available data for deep models. The storage and transmission efficiency of skeleton data is also high. In addition, the CNN-based methods have achieved good performance in processing RGB video in Euclidean space for action recognition [12, 20, 53]. However, unlike RGB video, the skeleton data is a kind of graph structure data, which is in non-Euclidean space. In other words, there is a correlation map between each joint of the human skeleton, but the number of relation joints for each joint is uncertain, and there is no clear order between relation joints. Therefore, how to effectively extract the spatial–temporal information in non-Euclidean space has become a key problem for skeleton-based action recognition, which is the core topic of this work.

Naturally, the skeleton data represents the human action as a sequence of 2D or 3D coordinates of the main body joints, and these joints are connected according to the physical structure of human body to construct a skeleton graph [51]. Therefore, the skeleton data expresses human action by the motion information of the skeleton graph, such as the movement speed, the positional relationship, and the angle of the joints. The traditional deep-model-based methods convert the skeleton data into Euclidean space according to predefined rules and then employ CNN or RNN to learn deep features of the skeleton sequence [4, 25]. Currently, the GCN-based strategy has become the mainstream to handle the problem of skeleton-based human action recognition [10, 21, 41, 46, 62, 69, 71, 73], as it can alternately perform convolutional operation on spatial and temporal edges to jointly learn the spatial–temporal information of the skeleton graph sequence. Compared with CNN and RNN, GCN has the advantage in processing the graph structure data with the result that it can maintain the non-Euclidean characteristics of the skeleton graph. However, as a core problem of GCN, the prior adjacency matrix guides the information aggregation in the non-Euclidean space and limits its perception field, which is difficult to be manually defined. There are many works focused on designing suitable adjacency matrix for skeleton-based action recognition [41, 45, 62], but these incremental modules increase the complexity of the model and are not easy to follow. In this work, we abandon the traditional GCN-based model and exploit an exquisite transformer-based model to capture features from the skeleton sequence, which is fully data-driven and without any complex incremental modules. Similar with GCN, the transformer-based model can also process the graph structure data in non-Euclidean space instead of converting the skeleton data to the Euclidean space in a manually defined way (like CNN or RNN). The transformer achieves remarkable performance in the field of natural language processing [54] and some basic visual tasks [6, 17], but there are few researches that discussed the applicability of the transformer in the field of skeleton-based action recognition. On the basis of keeping the elegant structure of transformer, to make the transformer more suitable for processing the skeleton data of the graph structure, we carefully design a graph-aware transformer (GAT) without increasing any trainable parameters. The GAT takes two important structures of the human body (i.e., the link graph structure and the part graph structure) as prior masks and uses a improved multi-head attention mechanism to extract deep feature of the skeleton sequence.

A recent work 2s-AGCN [45], which utilizes the first-order spatial difference of the skeleton data (i.e., bone vectors) to construct a two-stream GCN, significantly enhances the GCN-based models. However, few GCN-based works can effectively use the first-order temporal difference information (i.e., the velocity vector) and they cannot fully exploit this velocity feature. Worse still, simply adopting the multi-stream strategy to utilize the velocity information can multiply the number of parameters and computation costs of the model. To address this issue, in the GAT, we combine the multi-head attention mechanism with the first-order temporal difference of the skeleton data and propose a velocity-driven correlation, which can make full use of the correlation of joint velocities. Thus, our GAT can learn both position-driven attention maps and velocity-driven attention maps to capture motion features effectively. Notably, unlike the 2s-AGCN algorithm, we do not utilize additional streams but merge the joint positions and the velocity features into one stream. In this way, we can make full use of the velocity vectors without increasing the model size.

In summary, the main contributions of this paper are threefold: (i) A graph-aware transformer (GAT) is carefully designed to extract the spatial–temporal information of the skeleton sequence, which takes two important human body structures as prior masks and uses an improved multi-head attention mechanism to achieve data-driven feature extraction. (ii) The first-order temporal difference of the skeleton sequence is fully utilized by combining the velocity vectors with the guidance of the multi-head attention mechanism, which can effectively learn a velocity-driven attention map to extract motion features. (iii) Extensive experiments on three large-scale datasets demonstrated that our GAT obtains remarkable performance for skeleton-based action recognition and significantly outperforms the GCN baseline.

2 Related works

2.1 Skeleton-based action recognition

Conventional skeleton-based action recognition methods usually employ handcrafted features [1, 55, 56] or utilize RNNs [4, 23, 32, 49, 76], CNNs [25, 28, 29, 35] to learn features of the skeleton sequence. Vemulapalli et al. [55] designed rolling maps to represent the relative 3D rotations between various body parts, which is a key motion feature of skeleton sequence. Liu et al. [32] extended the RNN-based methods to spatial–temporal domains to analyze the action-related information. Zhu et al. [80] proposed a cuboid CNN to fully exploit the local movements of human joints in skeleton actions. These methods cannot effectively extract the spatial–temporal correlation from the joints of skeleton graph and also cannot fully exploit the human body structure. Yan et al. firstly proposed a GCN-based method ST-GCN [62], which significantly boosts the performance of skeleton-based action recognition. Then later, GCN-based methods have become the mainstream. Based on ST-GCN, many variants have been explored [10, 30, 46, 69, 71, 73, 76], which typically introduce some incremental modules, e.g., the attention module [10], the context-aware module [73], and the semantics-guided module [71], to enhance the network capacity. Shi et al. [45] took the first-order spatial difference of joints (i.e., bone vectors) as a second stream and designed a two-stream adaptive GCN. Wen et al. [59] introduced a motif-based graph convolution to encode the hierarchical spatial structure. Zhang et al. [69] explored a spatial attentive and temporal dilated GCN to extract the features of skeleton sequences with different spatial attention weights and temporal scales. Peng et al. [41] turned to neural architecture search (NAS) and proposed the first automatically designed GCN, which can further strengthen the representation ability of the adjacency matrix in GCN. Different from the above works, to completely overcome the limitation of the adjacency matrix, we innovatively propose a transformer-based model, i.e., GAT to replace GCN as the backbone to extract features of the skeleton sequence. GAT not only retains the adjacency matrixes of the skeleton graph as the prior knowledge, but also has a correlation-driven global perception field and a more powerful capability to learn spatial–temporal deep features. Recently, Chiara et al. [42] also proposed to use the transformer to process the skeleton data, but they still employed the GCN layers when extracting the low-level features and they also ignored the velocity features. In contrast, with the help of the graph-aware masks, our GAT model can effectively extract the low-level features and the high-level features with rich velocity information. More importantly, we do not integrate the GCN-based model and any other incremental modules with the transformer, so our GAT model is more concise and easy to implement.

Fig. 1
figure 1

The architecture of the graph-aware self-attention module

2.2 Visual transformer

Transformers, which have been widely used in natural language processing (NLP) tasks, are the models that rely on the multi-head self-attention mechanism to draw global correlations from the input features. Vaswani et al. [54] first proposed transformer based on multi-head attention mechanism for machine translation task. Devlin et al. [16] introduced a new language representation model called BERT (Bidirectional Encoder Representations from Transformers), which pre-trains a transformer on unlabeled text to let the model learn the context of each word. Inspired by the major success of transformer architectures in the field of NLP, recently, using transformer in vision tasks becomes the trend, e.g., object detection [6, 13, 14, 52], image enhancement [8, 63], image segmentation [8, 58], image generation [38], video processing [65, 79], and 3D point cloud processing [75]. For image classification, Dosovitskiy et al. proposed a vision transformer (ViT) [17], which divides an image into \(16 \times 16\) patches and feeds these patches into a standard transformer, obtains remarkable performance. Wu et al. represented images as semantic visual tokens and ran transformer to densely model token relationships [60]. For object detection, Carion et al. [6] combined the transformer framework with the CNN network and proposed a simple and fully end-to-end object detector named DETR. Zhu et al. [13] proposed Deformable DETR, which has become a popular method that significantly improves the detection performance. For video processing, Zhou et al. designed an end-to-end transformer model to encode the video into appropriate representations [79]. Zeng et al. simultaneously fill missing regions in all input video frames by a self-attention module for video inpainting. For skeleton-based action recognition, Chiara et al. [42] proposed to improve GCN by combining spatial and temporal attention modules to explore the spatial–temporal correlation of the skeleton graph sequences. These works demonstrated that transformers have strong visual feature extraction capability and tremendous potential compared with CNNs. However, in the field of skeleton-based action recognition, there are few studies discuss the applicability of the transformer in extracting low-level and high-level skeleton features. Moreover, how to effectively use the transformer to process graph structure data is also a meaningful topic.

3 Background

3.1 Problem formulation

In this paper, we use \({\mathcal {G}}=({\mathcal {V}}, {\mathcal {E}})\) to represent the skeleton graph, where \({\mathcal {V}}\) means the vertexes and \({\mathcal {E}}\) denotes the edges. For the joint graph, \({\mathcal {V}}\) is the set of n joints and \({\mathcal {E}}\) is the set of m bones. For the bone graph, on the contrary, \({\mathcal {V}}\) is the set of m bones, and \({\mathcal {E}}\) is the set of n joints. We consider the adjacency matrix of the skeleton graph as \(A \in \{0,1\}^{n\times n}\), where \(A_{i,j}=1\) if the ith and the jth vertexes are connected, and \(A_{i,j}=0\) otherwise. The initial position feature of the skeleton joints is their 3D (or 2D) coordinates. By taking the first-order spatial difference of the joints, we can get the representation of the bones, which is a sequence of 3D (or 2D) vectors. Let \(X_j \in {\mathbb {R}}^{n\times 3\times T}\) be the 3D joint positions across T frames and \(X_b \in {\mathbb {R}}^{n\times 3\times T}\) be the 3D bone vectors.

3.2 Multi-head self-attention

The self-attention function proposed in [54] can be described as mapping a query and a set of key-value pairs to an output, where the query (Q), key (K), value (V), and output are all feature vectors. The output is computed as a weighted sum of V, where the weight assigned to each V is computed by a correlation function of Q with the corresponding K. In practice, the self-attention function is defined as:

$$\begin{aligned} Attention(Q, K, V) = softmax\left( \frac{QK^{T}}{\sqrt{d_{k}}}\right) V, \end{aligned}$$
(1)

where \(d_{k}\) is the dimension of K. Integrating multiple self-attention head, the multi-head self-attention can be formulated as:

$$\begin{aligned}&Multihead(Q, K, V) \nonumber \\&\quad = Concat(head_{1},\ldots , head_{h})W^{O}, \end{aligned}$$
(2)

where \(head_{i}=Attention(QW_{i}^{Q}, KW_{i}^{K}, VW_{i}^{V})\). The projections are parameter matrices \(W_{i}^{Q} \in {\mathbb {R}}^{d\times d_k}\), \(W_{i}^{K} \in {\mathbb {R}}^{d\times d_k}\), \(W_{i}^{V} \in {\mathbb {R}}^{d\times d_v}\) and \(W^{O} \in {\mathbb {R}}^{hd_v\times d}\). Multi-head self-attention allows the model to jointly attend to information from different representation sub-spaces.

4 Method

4.1 Graph-aware self-attention

The GCN-based models have achieved remarkable performance in skeleton-based action recognition. An important reason is that GCN can fully explore the physical structure of the human body through the prior adjacency matrix. Peng et al. [41] have proved that the first-order adjacency matrix of the human body plays a key role in extracting the low-level features of the skeleton sequence. While to extract high-level features, the higher-order adjacency matrix is more important. Although existing studies have designed a variety of learnable adjacency matrices to improve the performance of GCN [41, 45, 62], higher-order adjacency matrices are more difficult to design and their robustness is poor. Different from GCN, the transformer employs the attention map (see Eq. 1) to aggregate information. This process is completely data-driven and does not require any prior knowledge. Therefore, the transformer is suitable for extracting high-level motion features from the skeleton sequence. To fully exploit the potential ability of the transformer in extracting high-level features, and enable the transformer to have the same ability as GCN to use the prior knowledge of the skeleton graph to extract low-level features, in this work, we propose a graph-aware self-attention module. The structure of the graph-aware self-attention module is shown in Fig. 1. We add a temporal convolutional network (TCN) layer [62] at the end of the module and employ a residual connection [22]. More importantly, based on the multi-head self-attention mechanism, we propose a multi-head graph-aware self-attention, whose detailed structure can be referred to the right part of Fig. 1. We use a variety of prior graph-aware (G-A) masks to constrain the attention map (\(QK^T\)) in this module. Specifically, let \(W^A\) and \(M^i\), respectively, denote the attention map and the graph-aware mask, the graph-aware attention map can be calculated as \(W^A \odot M^i\), where \(\odot \) represents the element-wise multiplication. Next, we will introduce how to construct the G-A masks M.

Fig. 2
figure 2

Illustration of the Link Graph (red) and the Part Graph (yellow)

Fig. 3
figure 3

The architecture of the graph-aware transformer (GAT) model

There are two kinds of natural connecting structures among the joints of the human body. The first is the link graph, which connects all the joints of the human body according to the body’s physical structure (see Figs. 2 and 3 Link Graph). During the movement of the human body, the physical adjacent joints have a significant influence on each other. We use the same representation rule of the adjacency matrix to represent this link-graph mask \(M^L\). That is \(M^{L}_{i,j}=1\) if the ith and the jth vertexes are connected, and 0 otherwise. The second kind of natural connecting structure is the part graph, which connects part of the joints according to the part-based body structure (see Figs. 2 and 3 Part Graph). Different parts of the human body show different motion characteristics during moving. We divide the human body into 5 parts, namely upper left limb, upper right limb, lower left limb, lower right limb, and torso. The joints contained in one part are regarded as a fully connected graph. We use the part-graph mask \(M^P\) to represent this partial connecting structure. By combining the attention head with these two kinds of masks, we get the link-aware attention head and the part-aware attention head, respectively. The link-aware attention head focuses on extracting the motion information based on the physical structure of the human body, while the part-aware attention head is more interested in extracting the motion correlation of human body parts. To avoid losing the global information, we also use the free attention head without any masks. By fusing the above three kinds of attention heads, we construct a graph-aware transformer (GAT), which can make full use of the prior graph structure and effectively extract discriminative motion features of the skeleton graph.

Fig. 4
figure 4

The flowchart of the velocity-driven correlation mechanism

4.2 Graph-aware transformer

The architecture of the deigned graph-aware transformer (GAT) is shown in Fig. 3. Firstly, we use a normalization layer (Norm) and a single-head self-attention layer (SSA) to preprocess the input skeleton data. The dimension of the input skeleton data is \({\mathbb {R}}^{n\times 3\times T}\). After being preprocessed, the output feature dimension changes to \({\mathbb {R}}^{n\times 54\times T}\), which is used as the input of the GAT. The GAT consists of two parts. One part has 6 layers, the other part has 3 layers. Each layer is a multi-head attention module with h attention heads. The multi-head attention operation can be formulated as Eq.  2. In the first part, we use three kinds of different attention heads, namely the link-aware attention head (LG-A), the part-aware attention head (PG-A), and the free attention head (without the graph-aware mask, Free-A). The LG-A is the self-attention head with the link-graph-aware mask. The PG-A is the self-attention head with the part-graph-aware mask. At this stage, the model can extract rich low-level features with the help of the prior knowledge provided by a variety of attention heads. The output feature dimension of the first part is \({\mathbb {R}}^{n\times 192\times T/2}\). In the second part, we only use the free attention head to extract high-level features, which is completely data-driven, and there is no prior graph structure to limit its perception field. The output feature dimension of the second part is \({\mathbb {R}}^{n\times 276\times T/4}\). Finally, the classifier, which consists of a global average pooling layer (Avg-Pooling), a fully connected layer (FC), and a softmax function, is used to classify the human action of the skeleton sequence based on the features extracted by the GAT.

4.3 Velocity-driven correlation

Referring to Eq. 1, in theory, the self-attention mechanism is driven by the correlation between joints. If the product \(QK^T\) of the two joints is larger, the correlation is stronger, and the attention mapping is more obvious. Intuitively, in the process of human motion, joints with similar velocities have a strong correlation. For example, in the running period, there is a significant correlation between the speed of the hands and feet. Another example is clapping hands, where the correlated speed and direction of the two hands are the key features to distinguish this action. Based on the above analysis, combined with the characteristic of the self-attention mechanism, we use the first-order temporal difference (velocity vector) of the joint coordinates to enhance the feature of the joints, which is formulated as:

$$\begin{aligned} \begin{aligned} S_j = X_{j}^{:,:,1:T} - X_{j}^{:,:,0:T-1}, \quad X_j^{'} = Concat(X_j, S_j), \end{aligned} \end{aligned}$$
(3)

where \(S_j\) is the velocity vector of the joints, \(X_j^{'}\) is the enhanced feature. To keep the dimension of \(S_j\) match the dimension of \(X_j\), we fill \(S_j \in {\mathbb {R}}^{n\times 3\times T-1}\) with 0 to make it satisfy \(S_j \in {\mathbb {R}}^{n\times 3\times T}\). Among the GCN-based models, there was no work to use this velocity vector to enhance the feature of the joints. The reason is that GCN cannot fully exploit the correlation between joint velocities, so the performance of the model does not significantly improved after adding this new feature. In contrast, our model is based on the transformer, which can effectively use the correlation between joint velocities and learn velocity-driven attention map to extract motion features. Furthermore, unlike the existing researches that use an additional stream to process the first-order spatial difference, we directly concatenate the velocity vector of the joint with the initial position vector and use a single stream network to process this enhanced feature, so that there is almost no increase in computation cost. The dimension of the enhanced input skeleton data is \({\mathbb {R}}^{n\times 6\times T}\).

To further utilize the correlation between joint velocities, we use the position vector and the velocity vector to calculate two different attention maps respectively (i.e., the position-driven attention map and the velocity-driven attention map as shown in Fig. 4). We take the average of the two attention maps as the final attention map. To sum up, our velocity-driven correlation can be formulated as:

$$\begin{aligned} W^A = \frac{1}{2}(softmax(\frac{Q_X{K_X}^{T}}{\sqrt{d_{k}}}) + softmax(\frac{Q_S{K_S}^{T}}{\sqrt{d_{k}}})). \end{aligned}$$
(4)

The graph-aware velocity-driven self-attention function can be formulated as:

$$\begin{aligned} Attention(X) = (W^A \odot M^i)X^{'}, \end{aligned}$$
(5)

where \(W^A\) and \(M^i\), respectively, denotes the attention map and the graph-aware mask, \(X^{'}\) is the velocity enhanced feature.

Our velocity-driven correlation mechanism can fully exploit the speed relationship of the joints and can obtain the attention map containing rich motion information, which helps the model extract the spatial–temporal motion information of the skeleton sequence effectively.

5 Experiments

5.1 Datasets and implementation details

Three popular skeleton action datasets, i.e., NTU RGB+D 60 (NTU60) [43], NTU RGB+D 120 (NTU120), and Kinetics-Skeleton (KS) [62] are selected for our experiments.

5.1.1 NTU-RGB+D 60

NTU60 [43] is a large-scale dataset with annotated 3D joint coordinates of the human body for the task of human action recognition. NTU-RGB+D contains 56,000 action videos with 60 action classes. These videos are in-door-captured from 40 volunteers in different age groups ranging from 10 to 35. For each action, the videos are obtained by 3 cameras from different viewpoints, and the 3D annotations of human body joints are given in the camera coordinate system. Each action video has no more than 2 subjects and there are 25 key joints for each subject in the skeleton sequences. The NTU-RGB+D dataset includes two settings: (1) Cross-Subject (CS) benchmark, which contains 40,320 videos for training and 16,560 videos for testing. In this setting, the training set comes from one subset of 20 subjects and the remaining 19 subjects are used for evaluation; (2) Cross-View (CV) benchmark, which includes 37,920 videos for training and 18,960 videos for testing. In this setting, the training samples come from the camera viewpoints 2 and 3, while the camera viewpoint 1 is used for evaluation. We follow the conventional settings in [43] and report the top-1 accuracy on both benchmarks.

5.1.2 NTU-RGB+D 120

NTU120 [34] is an extension of NTU60, which adds 57367 new skeleton sequences representing 60 new actions, for a total of 113945 videos referring to 120 classes from 106 subjects under 32 camera setups. It includes two settings: (1) cross-subject (X-Sub) benchmark: the 106 subjects are split into training and testing groups. Each group contains 53 subjects. (2) cross-setup (X-Set) benchmark: the training data comes from samples with even setup IDs, and the testing data comes from samples with odd setup IDs.

Fig. 5
figure 5

Visualization of the velocity-driven attention maps and the position-driven attention maps for different human action classes on the NTU60 dataset. The value of the attention map is the average of all the heads’ attention maps that without the graph-aware mask operations in the second layer of the model. The brighter area indicates that the weight of the attention map is larger there, which means the correlation between two joints is stronger

5.1.3 Kinetics-Skeleton

Kinetics [24] consists of 300,000 videos with 400 action classes. The video clips of Kinetics are abundant and various that sourced from YouTube, but it only provides raw videos without skeleton annotation. Yan et al. [62] used the OpenPose toolbox to estimate the locations of 18 joints on every frame of the videos and released the Kinetics-Skeleton datasets. In Kinetics-Skeleton, all videos are converted to a frame rate of 30fps and are resized to \(340\times 256\) resolution. The OpenPose toolbox generates the 2D coordinates and the confidence score for 18 joints of each human body from the processed videos. For the multi-person clips, two major people are selected by calculating the average joint confidence. Each joint is represented by its 2D coordinate and confidence score, which construct a three-element feature vector. Following the evaluation method of Yan et al. [62], we train the models on the training set and report the top-1 and top-5 accuracies on the testing set.

We implement our model based on the PyTorch deep learning framework [39]. We apply the stochastic gradient descent (SGD) algorithm with Nesterov momentum (0.9) as the optimizer. We use 4 Nvidia GTX 1080Ti GPUs for the model training, and set the batch size to 48. For the NTU60 and NTU120 datasets, the number of training epoch is set as 60 and the learning rate is set to 0.1. The learning rate decay is set as 0.1 at the 30th epoch, 40th epoch and 50th epoch. For the Kinetics-Skeleton dataset, the number of training epoch is set as 70 and the learning rate is set as 0.1. The learning rate decay is set as 0.1 at the 40th epoch, 50th epoch and 60th epoch.

5.2 Ablation study

We present an ablative analysis on the NTU60 CV benchmark to evaluate the effectiveness of the proposed model. We analyze the effect of the transformer-based model, LG-A mask, PG-A mask, and the velocity-driven correlation. ST-GCN [62] is our baseline.

Table 1 Comparison of the top-1 and top-5 accuracy on the NTU60 CV benchmark with different model configurations. V-D means velocity-driven. h is the number of the attention heads

The results in Table 1 show that compared to the GCN-based baseline, our transformer-based model performs better. The free transformer outperforms the baseline by 2.61% (90.91% vs 88.30%), and the velocity-driven free transformer outperforms the baseline by 4.04% (92.49% vs 88.45%). The results demonstrate that the transformer-based model is significantly stronger than the GCN-based model in terms of extracting spatial–temporal motion features of the skeleton sequence. Besides, the transformer-based model can fully exploit the velocity correlation of human body joints. When the velocity-driven correlation is added, the performance of the free transformer improves by 2.3% (92.49% vs 90.91%), but the GCN-based baseline with velocity enhanced feature only improves by 0.15% (88.45% vs 88.30%). Figure 5 shows the velocity-driven attention maps and the position-driven attention maps for different human action classes on the NTU60 dataset. The value of the attention map is the average of all the heads’ attention maps that without the graph-aware mask operations in the second layer of the model. The brighter area indicates that the weight of the attention map is larger there, which means the correlation between the two joints is stronger. We can see that for different actions, the bright area of the velocity-driven attention map is more concentrated than the position-driven attention map. Take the action “clapping” as an example, the activation values of the velocity-driven attention map are concentrated on a few joints of hands with salient motions, which means the velocity-driven attention map can better highlight the significant joints of human action than the position-driven attention map. Therefore, the velocity information can better reflect the motion correlation of the joints during the action procedure. Our GAT, which fuses the velocity-driven correlation mechanism and the position-driven correlation mechanism, can exploit rich motion information from the skeleton sequence.

Extensive experiments are performed to test the impact of the graph-aware masks. Table 1 shows the results of the GAT with LG-A mask, the GAT with PG-A mask, and the GAT with LG-A and PG-A masks. For the GAT with LG-A mask and the GAT with PG-A mask, we make half of their attention heads have masks and the others are free attention heads. For the GAT with LG-A and PG-A masks, we make h/3 of its attention heads have LG-A mask, h/3 of its attention heads have PG-A mask, and the others are free attention heads (see Fig. 3). The total number of the attention heads h is 6. The LG-A mask can bring 1.29% (93.78% vs 92.49%) improvement to the transformer-based model, and the PG-A mask can bring 0.89% (93.38% vs 92.49%) improvement. The combination of the LG-A mask and the PG-A mask can bring 1.43% (93.92% vs 92.49%) enhancement on top-1 accuracy to the transformer-based model. These experimental results show that the prior graph-aware masks are helpful for the transformer to extract motion details from the physical structure and partial structure of the human body. On the other side, when we use the graph-aware attention heads for all the transformer layers without free attention head, the performance of the model will decrease instead (e.g., see the last row of Table 1, the results are, respectively, 0.34% and 0.10% lower than the GAT (\(h=6\), the 7th row of Table 1) on top-1 and top-5 accuracy). This is because high-level skeleton motion features require the model to have a global perception field, but the graph-aware masks limit the perception field of the model. Therefore, the free attention head without any limitation is necessary for the transformer to extract multi-scale and multi-granularity features. Consequently, in our model, we combine the graph-aware attention heads and the free attention head in the first 6 layers and only use the free attention head in the last 3 layers (see Fig. 3). The influence of the number of the attention heads h in each transformer layer is shown in Table 1. Specifically, compared to the GAT with \(h=3\) and \(h=9\), the GAT with \(h=6\) has the best performance.

Table 2 Comparison of the top-1 accuracy with the state of the arts on the NTU60 dataset
Fig. 6
figure 6

The confusion matrix on the NTU60 CV benchmark (left). Comparison of the classification accuracy of 60 action categories on the NTU60 CV benchmark (right). The accuracy of the GAT represents as the red line and the accuracy of the ST-GCN represents as the blue dotted line

Table 3 Comparison of the top-1 accuracy with the state of the arts on the NTU120 dataset
Table 4 Comparison with the state of the arts on the KS dataset

5.3 Comparison with state of the arts

We compare the proposed GAT model with the state-of-the-art skeleton-based action recognition methods on the NTU60, NTU120, and KS datasets. The methods which are selected for comparison include CNN-based methods [2, 26, 28, 29], 3D-CNN-based method [19], RNN-based methods [18, 33], GCN-based methods [11, 30, 44, 45, 62, 73], and transformer-based method [42]. In this experiment, we use a two-stream (joint stream + bone stream) GAT model. The final classification score is the sum of the two-stream scores and the number of parameters is two-stream parameters. The results on the NTU60 dataset are shown in Table 2. Our GAT outperforms the CNN-based methods, RNN-based methods, and GCN-based methods on both the CS and the CV benchmarks, which proved that the transformer has great advantages in dealing with the skeleton data. Our results outperform ST-GCN [62] by 6.9% (95.2% vs 88.3%) on the CV benchmark and 7.5% (89.0% vs 81.5%) on the CS benchmark with less parameters (5.86M vs 6.20M). Figure 6 shows the confusion matrix of our GAT on the NTU60 CV benchmark (left) and the comparison of the classification accuracy with ST-GCN of 60 action categories on the NTU60 CV benchmark (right). The accuracy of the GAT is represented by the red line and the accuracy of the ST-GCN is represented by the blue dotted line. It can be seen that our GAT improves the classification accuracy on all of the action categories, because our velocity-driven correlation can fully exploit the correlation between motion speed of joints, which is helpful for distinguishing the human actions. The results in the red boxes of Fig. 6 also reveal that the main difficulty for skeleton-based action recognition is to distinguish the confusing human actions, such as “writing” and “type on a keyboard,” which needs to be further studied. The results on the NTU-RGB+D 120 dataset are shown in Table 3. Our GAT performs better than the CNN-based methods and RNN-based methods. Compared to the state-of-the-art GCN-based methods Shift-GCN [11], our GAT also obtains competitive results (86.1% vs 86.6% on the X-Set benchmark and 84.0% vs 85.3% on the X-Sub benchmark). Results on the Kinetics-Skeleton dataset are shown in Table 4. Our GAT surpasses the other competitive methods in both top-1 and top-5 accuracy. It demonstrated that our GAT model is more robust to deal with noisy 2D skeleton data in real-world videos. Using only the spatial attention mechanism, the accuracy of our GAT outperforms the S-TR [42] on all the three datasets NTU60, NTU120, and KS.

6 Conclusions

In this work, we proposed a novel graph-aware transformer (GAT), which can fully utilize the velocity correlation of human joints to extract motion features of the skeleton sequence. The link-aware attention and the part-aware attention are the core modules of the GAT, which are designed by fusing the graph-aware masks with the attention map to effectively make use of the prior skeleton graph structures. Extensive experiments are conducted on three large-scale datasets to evaluate the performance of our method. The results verified that the proposed transformer-based model outperforms the GCN-based baseline by a large margin, and the GAT obtains remarkable performance on extracting the spatial–temporal deep features for skeleton-based action recognition. In the future, considering the context information and using the attention mechanism to learn the temporal feature more effectively need to be further investigated.