Keywords

1 Introduction

Given its potential applications in video surveillance [13] and virtual reality [11], human action recognition has garnered significant interest from academia and industry. Compared to the original video data, skeleton data offers several advantages, including mitigating complex background interference and adapting to dynamic changes. Consequently, researchers have developed various skeleton-based action recognition methods. While existing action recognition methods exhibit diversity, there is a consensus that extracting sufficient spatial-temporal information is crucial. Traditional approaches commonly use handcrafted features to model the spatial human joint framework and dynamic information in the temporal dimension. However, these exquisitely designed features are tailored to specific data and applications but are difficult to generalize. Deep learning techniques have rapidly evolved in recent years and are widely used for autonomous feature extraction. Representative networks include convolutional neural networks (CNNs) for processing static images and recurrent neural networks (RNNs) for modeling long-term contextual information in sequential data, such as joint coordinate sequences. In virtue of the peculiarity of the non-Euclidean data format of the natural physical connection of the skeleton structure, Yan et al. [32] pioneered graph-based approaches to model joints and their contacts for skeleton-based action recognition using graph convolutional neural networks (GCNs) and temporal convolution. Since then, GCNs have become the dominant deep neural network architecture for skeleton-based action recognition. Despite their success, GCNs still struggle to establish long-term temporal dependencies and often overlook joint cooperative relationships in motion. For instance, the “clapping” motion heavily relies on the cooperation between the left and right hands, but consciously focusing on the joint-to-joint relationship can lead to computational problems in the model.

In this paper, we proposed a novel framework for skeleton-based action recognition called Multi-Stream Graph Transformer (MS-GTR), as illustrated in Fig. 1. This framework enables effective multi-scale processing of skeleton information and extraction of representative spatio-temporal features. Concretely, we improve the transformer not only to model sequence context dependencies but also to incorporate the graph structure of the skeleton in action recognition. Additionally, we extract diverse information from the joint trajectories to enrich the range of expressions. We divide the data into the main and auxiliary branches to avoid computational complexity. While the main branch is always involved in feature extraction, various extra streams, e.g., self-similarity matrices(SSM) and difference, provide dynamic short-range information to support the main feature extractor. We employ cross-attention for information exchange between the branches to facilitate the efficient integration of motion features across different scales.

As depicted in Fig. 1, the main unit provides only a token representing global information and interacts with the feature conveyed by the auxiliary branch using cross-attention. This token has absorbed the supplemental information, returns to the main unit, and undergoes subsequent operations. We conducted experiments on several human action datasets, including HDM05, NTU RGB+D, and NTU RGB+D 120, and the obtained results validate the value of our approach in improving action recognition performance.

Our main contributions to this work are summarized as follows:

  1. 1.

    A novel graph Transformer architecture is proposed to represent action sequences’ higher-order spatial-temporal features and eliminate the redundant dependencies associated with fixed body connectivity.

  2. 2.

    We propose the multi-stream model called MS-GTR that consists of two distinct branches. The main branch is designed to extract the long-term dynamic features from the joint data directly. The auxiliary branch provides short-term information.

Fig. 1.
figure 1

Illustration of the proposed Multi-Stream Graph Transformer.

2 Related Work

2.1 Vision Transformer

The transformer [29] is a famous attention-based neural network architecture initially proposed for natural language processing. In addition to its success in NLP, the transformer has also proven its excellence for many fundamental computer vision tasks, e.g. classification [2, 8, 35], detection [1, 12], and segmentation [30, 36]. In particular, Zhang et al. [35] introduced the Video Transformer (VidTr) with spatio-temporal separable attention, which outperformed convolutional-based approaches for video classification. Sun et al. [27] built a multi-stream transformer network to model motion at different scales, taking advantage of the transformer’s ability to capture long-range time dependencies. Chen et al. [2] applied a dual-branch vision transformer to complete the task of multi-scale feature extraction and image classification. Moreover, a simple and practical information exchange scheme between branches was proposed based on cross-attention. Inspired by their work, we offer a novel network to provide supplementary information for motion sequences at different scales through cross-attention. EAPT [47] proposes Deformable Attention, which learns offsets for each position in patches to obtain non-fixed attention information that can cover various visual elements.

2.2 Skeleton-Based Action Recognition

Skeleton-based action recognition aims to identify action through the human skeleton sequence. The most significant advantage of the first category of networks is that they take complete account of the long-term contextual associations in action. For example, Du et al. [9] fed the hierarchical structure of the human skeleton into an end-to-end hierarchical RNN, and the parts were reused and spliced together as the number of layers in the network increased. Two-stream temporal convolution networks proposed by Jia et al. [15] fully used inter-frame and intra-frame action characteristics. Xie et al. [18] proposed a temporal-then-spatial recalibration scheme that introduced the attention mechanism to recalibrate the temporal attention of frames and then further process using a convolutional neural network.

The above methods for skeleton-based action recognition primarily focused on capturing temporal features from human skeleton sequences. Still, they struggled to extract spatial characteristics from the topology of the connections between joints. Graph convolutional networks (GCNs) have emerged as a promising solution to this challenge. Yan et al. [32] were the first to apply GCNs to model dynamic skeletons for this task, but the graph topology heavily influenced the expressiveness of the model. Compared to manually setting fixed graph topology, Shi et al. [25] developed an adaptive GCN to learn the graph topology uniformly or individually. Cheng et al. [5] used parameterized topology for channel groups, but their model was bloated. Going a step further, Chen et al. [4] proposed a channel-wise graph convolution that shared a learnable topology as a generic prior for all channels and learned each channel-specific topology in a refinement way, which overcomes the inflexibility of previous methods like 2s-AGCN [25]. The adaptive graph convolutional block used in our proposed model to capture the spatial features is similar to the channel-wise methods. GAT [48] utilizes velocity information in a data-driven manner to learn discriminative spatial-temporal motion features from the sequence of skeleton graphs.

3 Method

3.1 Graph Relative Transformer

Motivation. Expressing higher-order spatial topology, adequately capturing contextual relationships, and effectively modeling spatial-temporal dependencies are essential for the signature representation of human action. However, unbiased modeling of long-term joint relationships can limit reliance on fixed natural connections in the human body, resulting in redundant dependency problems. In other words, due to the model’s excessive focus on the genuine relationships of the body, the potential interactions between joints are easily overlooked. At the same time, the extraction of temporal features is over-reliant on the temporal convolution module. Adopting a fixed convolution kernel for feature extraction cannot adapt to feature changes in different periods, resulting in inadequate local feature extraction. As shown in Fig. 2, we aim to develop a graph topology that goes beyond the natural connectivity of the human body and can represent potential information of human pose. We use this topology to participate in the spatial-temporal feature with the improved Transformer. The goal is to capture long-term dependencies while retaining constraints on the higher-order spatial information of the skeleton in a lightweight manner.

Fig. 2.
figure 2

Illustration of the proposed graph relative transformer.

Implementation of Graph. We attempt to find a reasonable and relatively accessible graph topology to guide us in constructing the spatial information of the skeleton. Our model involves two forms of graph convolution units to gather spatial details in a single frame. In the first approach, we follow the critical design of the spatial graph convolutional neural network proposed by [32]. The difference is that the sampling function is redefined using attention scores instead of inter-joint connections, and the partition strategy is redesigned by manually setting thresholds to extract different scales of joint information. We determine the attention score among vertices as follows:

$$\begin{aligned} a_{ij} = {\textbf {x}}_i{\textbf {w}}_i \cdot ({\textbf {x}}_j{\textbf {w}}_j)^T \end{aligned}$$
(1)

where w is the weight parameter. After normalization we end up with a \(N \times N\) attention scores matrix and partition the vertex neighborhood by thresholds. This partition strategy can filter the vertices that are more related to a certain vertex as:

$$\begin{aligned} N(v_i)=\{v_j|threshold_{low} <a_{ij} \le threshold_{up}\} \end{aligned}$$
(2)

where \(threshold_{up}\) and \(threshold_{low}\) are the upper and lower limits of the thresholds, respectively. However, this approach dramatically increases the computational complexity of the model when calculating the attention scores, resulting in a waste of computational resources, and it does not show the expected effect during the experiment as shown in Sect. 4.3.

Inspired by [4], we aim to capture potential dependencies between joints in a manner that extends beyond the channels of the original spatial coordinates. Therefore, we designed a learnable matrix to represent the degree of association between joints, unlike the attention score matrix, which is parameterized and obtained through model optimization. However, as this solution may lead to the loss of some graph structure features by operating globally, we also add the adjacency matrix as a guide to natural connection to complement it. Another advantage of our approach is that the channels are grouped and globally averaged within the group pooling, which helps simplify the network model. As shown in Fig. 2, relying on the joint association relationships provided by the learnable and adjacency matrices, we apply a graph convolutional operation on the features extracted by the depthwise separable convolution block to obtain an update of the nodal expression features. The obtained feature vectors will be directly involved in extracting the inter-frame dependencies.

Improved Transformer. The standard transformer is the backbone of our model, which we improve to achieve better performance and establish a baseline for subsequent experiments. To incorporate the spatial details of the skeleton, we reconstruct the query vector in the following way:

$$\begin{aligned} {\textbf {q}}^t_i = \sum _{v^t_j \in N(v^t_i)} {a^t_{ij}}\phi _q({\textbf {x}}^t_j) \end{aligned}$$
(3)

where \(a^t_{ij}\) denotes the weight coefficient of vertex i and vertex j at time t in the generalized sense. After fully considering the spatial feature map, we obtain each frame’s corresponding key-value vector pairs by transforming the channel features using Eqs. (4) and (5):

$$\begin{aligned} {\textbf {k}}^t_i = \phi _k({\textbf {x}}^t_i) \end{aligned}$$
(4)
$$\begin{aligned} {\textbf {v}}^t_i = \phi _v({\textbf {x}}^t_i) \end{aligned}$$
(5)

In the temporal dimension, the focus is on modeling the frame-to-frame dependencies to obtain a representative feature map that incorporates the attention mechanism of the transformer:

$$\begin{aligned} \alpha ^{ij}_n = {\textbf {q}}^i_n \cdot {{\textbf {k}}^j_n}^T \end{aligned}$$
(6)

where i, j is the frame number of the action sequence, and n is the index number of the joint node. The joint characteristics of a given frame are updated based on information shared from the same joint in other frames, which establishes a strong long-term dependency. A global information token is introduced to summarize events for the entire time sequence, similar to the approach used in natural language processing. The update equation for a joint node’s features in a given frame is as follows:

$$\begin{aligned} y^i_n = \sum \sigma (\frac{\alpha ^{ij}_n}{\sqrt{d_k}})v^j_n \end{aligned}$$
(7)

where \(\sigma \) is an activation function that normalizes the input. After all the frames have been aggregated, the subsequent feed-forward layer can adjust the dimensions of the output, adding additional capabilities to the model.

3.2 Multi-stream Model Architecture

Based on GTR, we have constructed a robust MS-GTR model for integrating a variety of dynamic information streams toward skeleton-based action recognition. The overall model architecture is shown in Fig. 1. The proposed algorithm operates on an action sequence \(S=\{s^1,\dots , s^t, \dots , s^T\}\), consisting of T frames where each element \(s^t \in \mathbb {R}^{N \times 3}\) represents the 3D coordinates of all available captured joints at a particular frame. We introduce the main and auxiliary branches in the model to capture a broader range of action details.

The main branch is concerned with the long-term dynamic information representation in the joint and bone data. Long-term dynamic information representation refers to a change in motion over long periods, usually in terms of modeling sequence contextual relationships. Specifically, we introduce bone representation as an interpretation of inter-joint connection to obtain directly from the original joint coordinates, which is then fed into the main branch along with the underlying joint features. To calculate bone vectors, which describe the relationship between two joints, we adopt the same approach as in [25]. Given a pair of head joint \(J_{i}=\{x_{i}, y_{i}, z_{i}\}\) and tail joint \(J_{j}=\{x_{j}, y_{j}, z_{j}\}\), we calculate the second-order information as \(B_{i, j} = \{x_{j}-x_{i}, y_{j}-y_{i}, z_{j}-z_{i}\} \).

The auxiliary branch, in contrast, captures short-term features, such as the self-similarity matrix (SSM) and the difference between frames(also known as velocity). Given a set of joint features \(J =\{J_1, J_2, \dots , J_N\}\), we construct the self-similarity matrix \(M_{ssm} \in \mathbb {R}^{N \times N}\) by comparing all the elements in the joint feature set with each other using the calculation formula \(M(i,j) = SSM(J_i, J_j)\). The dot product of elements with each other is the simplest way to calculate the self-similarity matrix. This SSM data is used as the input of the auxiliary branch and as an additional information stream to the main branch regarding joint tightness. Likewise, the motion velocity of joints contains a wealth of action features. The velocity of a particular joint can be calculated as \(\nu ^t = s^{t+1}-s^t\), where \(\nu ^t\) is a vector reflecting the difference between two continuous frames in the original action sequence. This velocity information can be input into an auxiliary branch to support the main component regarding speed characteristics.

To ensure simplicity and fluency, we limit the role of the auxiliary branch to information transfer and rely on the main branch for capturing the most useful action features. To facilitate interaction between the main and auxiliary branches, we employ cross-attention. The remaining streams provide additional supplements to the main branch through the participation of a token related to global information. Specifically, the token of the main branch, \(token_{main}\), is concatenated with the sequence data \(S_{auxiliary} = \{s^1_{aux}, \dots , s^T_{aux}\}\) arising from the auxiliary branch. Subsequently, a self-attention mechanism is implemented on the updated sequence \(S_{fresh} = \{token_{main}, s^1_{aux}, \dots , s^T_{aux}\}\), allowing the \(token_{main}\) also to detect the characteristics of the auxiliary branch.

4 Experiments

4.1 Datasets

HDM05. HDM05 [22] is captured using optical marker-based technology, which helps to reduce noise interference in the motion capture data. It contains trajectories for 31 joints from 130 motion classes performed by five actors. And among these 130 categories, some can be grouped into one category due to the same expression meaning, so we finally get the data of 65 action categories.

NTU RGB+D. The NTU RGB+D [23] involves the capture of motion sequences using three synchronized Microsoft Kinect v2 devices. The dataset contains 56,880 clips from 40 subjects, with each action organized into one of 60 action categories(including 11 multiplayer action categories). The skeleton data includes the 3D coordinates of 25 major joints at each frame. The dataset offers two evaluation criteria for action recognition methods: Cross-View is based on the camera’s viewpoint that captured the action. The training set consists of 37,920 samples captured from a 45-degree view from the left and right, while the test set contains 18,960 samples captured from the front view. Cross-Subject validates the model in terms of different subjects. The experiment had 40 subjects categorized into training and test groups, each containing 20 actors. The training and test sets contain 40,320 and 16,560 samples, respectively.

NTU RGB+D 120.  The NTU RGB+D 120 [40] is a large-scale dataset expended from the NTU RGB+D dataset. In addition to the 60 categories in the previous dataset, this dataset has an additional 60 types (i.e., 120 classes in NTU RGB+D 120). This dataset comprises 114,480 action clips captured from 155 camera views with 106 subjects. The authors of this dataset likewise recommend two benchmarks: Cross-Subject, similar to the previous dataset, are grouped by subjects, with 53 subjects in each group (63,026 samples for training and 50,922 clips for validation). Cross-Set is based on the setup made at the cameras’ height and distance to the subjects to construct the training and testing set. The training set consists of 54,471 samples, while the test set contains 59,477 samples. The dataset includes 56,880 RGB+D video samples from 40 subjects, with each action classified into one of 60 action categories (including 11 multiplayer action categories).

4.2 Implementation Details

The implementation of our model is based on PyTorch and was run on an NVIDIA GeForce RTX 3090 GPU. We use gradient descent to update the model parameters. Specifically, we employed a stochastic gradient descent (SGD) optimizer with a momentum of 0.9 and a weight decay of 0.0004. We set the maximum number of training epochs to 120 and the size of each batch to 64. The initial learning rate is set to 0.01 and decays to 0.1 of the previous at epochs 45, 75, and 90. We employed the cross-entropy function as the loss function and added label smoothing to alleviate over-fitting and improve the model’s generalization ability.

4.3 Ablation Study

We can demonstrate the contributions of the proposed components in GTR to achieve the goal of action recognition through a series of relevant experiments on the Cross-View benchmark of the NTU RGB+D dataset.

Graph Relative Transformer Block. First, we need to confirm whether it is necessary to use graph convolution to construct skeletal spatial guidelines instead of using transformers for the intuitive processing of time series. As previously mentioned, we feed the features extracted by the standard transformer to the classifier for action recognition and compare the result. Additionally, this section will discuss the implications of aggregating and updating graph nodes. As shown in Table 1, the experiment result indicates that combining the initial transformer model with the graph-related blocks significantly improves performance. Therefore, the most effective GTR associated with channels is selected as the foundational component for later model upgrades.

Table 1. Comparisons of the action recognition accuracy on Transformer with and without graph dependencies.
Table 2. Comparison of the effect of the presence or absence of auxiliary branch and different action input modalities on recognition results.

Multi-stream Framework. To improve the representation and generalization ability of the model, we introduce an auxiliary branch to provide different forms of action implications to supplement the main unit. We conducted experiments by blocking the auxiliary streams and comparing their effects with those of the main branch alone. The results in Table 2 show that the fusion of information between branches can provide better semantic support for recognition. We also measured the strength of the different action dynamic expressions introduced to the main branch for recognition. When only difference is available, the model performance is most unsatisfactory. Although the joint flow alone can improve the model to 90.33%, adding the self-similarity matrix to the main branch as an auxiliary information flow improves the recognition accuracy by 0.86%. It indicates that the collaboration of multi-stream information is more beneficial for the final action recognition.

Fig. 3.
figure 3

Confusion matrix of skeleton-based action recognition with MS-GTR building on the cross View validation of NTU RGB+D dataset.

Table 3. Performance comparison on HDM05.
Table 4. Performance comparison on NTU RGB+D dataset.
Table 5. Performance comparison on NTU RGB+D 120 dataset.

4.4 Confusion Matrix Analysis

As shown in Fig. 3, we visualized the confusion matrix of the cross-view benchmark results on the NTU RGB+D dataset to identify the categories that caused substantial interference leading to false recognition. Two situations can cause confusion between classification categories. The first set included categories where the inability to capture the reference led to some inaccuracy in recognition, including “A11:reading”, “A12:writing”, “A29:playing with phone or tablet”, and “A30:type on a keyboard”. These actions all involved manipulating the hands, and the specific tools used varied between the categories. The second set of confusing categories was the inverse order of each other, such as the pair of action sequences “A16:put on a shoe” and “A17:take off a shoe”. We provided an explanation that as the network goes deeper, location information appears to become less significant.

4.5 Comparison to the State of the Art

To visually verify the feasibility and effectiveness of our model on action recognition, we conducted experiments on the HDM05 dataset (Table 3), the NTU RGB+D dataset (Table 4), and the NTU RGB+D 120 dataset (Table 5).

Notably, on the HDM05 dataset, we are currently at the forefront with a result of 99.34%. Whether it is the NTU RGB+D or the extended version, our model always has an advantage in recognition accuracy compared to the recurrent approaches, which indicates that our baseline model extracts superior features when establishing temporal dependencies. However, we still have a long way to go regarding a series of graph convolution variants of the method. Although our model is slightly less effective than 2 s-AGCN, we get a more significant improvement when we introduce the adaptive graph convolutional block, which proves the values of embedding the topology with the Transformer as the baseline model. For STAR [24], which was designed with the same intention as our baseline model, we used a graph structure to compensate for the lack of purely self-attention mechanisms to capture spatial features. Compared to this model, our recognition accuracy improved by 3.25% on Cross View and 0.75% on Cross Subject. In particular, taking KA-AGTN as an example, this model is also positioned as a graph transformer, but its model takes 2 s-AGCN as the baseline model and interpolates the attention layer for enhancing the dependence of local neighboring joints while preserving the spatio-temporal graph convolution layer. Our model starts from the most basic Transformer rather than using the currently available models as a baseline model to improve recognition accuracy, which results in us not fully utilizing the computational resources to represent the capabilities of our model. Therefore, to improve the model’s generalization ability, we can use the existing pre-trained model to participate in the task, which is the direction we can improve in the future.

5 Conclusion

In this work, we propose a novel approach called GTR, which utilizes a transformer to efficiently capture the temporal features of action progression instead of solely relying on graph convolution neural networks. The proposed GTR involves a graph based on the natural connection of body parts in the expression update, which enhances the model expression diversification by motion features of different scales and makes the results more credible. We also introduce motion features with various expression meanings while reducing the complexity of model operations. In contrast to the direct fusion of action information from different scales, MS-GTR involves auxiliary input under the guidance of the main branch without introducing additional calculation costs. Our proposed MS-GTR achieves state-of-the-art performance on datasets captured by motion capture devices with widely varying accuracy and notably achieves leading recognition accuracy on the HDM05 dataset.