Keywords

1 Introduction

As a major research topic of computer vision technology, human action recognition plays an important role in applications such as video surveillance, human-computer interaction and abnormal behavior detection [1, 3, 8, 14, 27]. Skeleton data is a compact and expressive modality that has less data volume compared with RGB or depth modality, and is insensitive to complex backgrounds and dynamic camera perspectives. Therefore, skeleton-based human action recognition technology has received widespread attention [12, 13, 25, 31, 32].

Early deep learning-based action recognition methods manually construct human skeleton coordinates into vector sequences or pseudo-images, and feed them into a recurrent neural network (RNN) or convolutional neural network (CNN) to predict action results [5, 7, 10, 36]. Kim et al. [11] used a one-dimensional residual CNN to identify skeleton sequences based on directly-concatenated joint coordinates. Li et al. [18] constructed an adaptive tree-structured RNN, and Si et al. [28] proposed a novel attention-enhanced graph convolutional LSTM network called AGC-LSTM for human action recognition from skeleton data. However, these methods ignore the important property of human skeleton as a topological structure, and it is difficult to capture the spatio-temporal dependencies between joints.

Graph Convolutional Network (GCN) can efficiently handle non-Euclidean data such as graphs, and it can generalize convolutions from images to graphs of arbitrary size and shape. In recent years, more and more skeleton action recognition models use GCN-based methods to extract spatio-temporal features [4, 7, 16, 22, 25, 26, 34]. Yan et al. [34] manually defined the human body topology, and Shi et al. [25] learned the human body topology dynamically through adaptive graph convolution. They all focus on graph convolution on the global human body topology while ignoring body part information. For many actions, such as clapping and throwing, the motion characteristics of parts are more important. Thakkar et al. [30] is the first to split the human skeleton into different parts for graph convolution. Wang et al. [33] proposed adaptive multi-part graph convolution to learn the spatial correlation between parts based on the self-attention mechanism. However, the topology of the human skeleton has not been fully utilized, and we construct the more refined local topology to extract more detailed features.

In this paper, we will further model the human skeleton topology from the three dimensions of spatial, temporal, and spatio-temporal based on human body parts. We then propose a novel network named Combined Part-wise Topology Graph Convolutional Networks (CPT-GCN), which focuses on exploring fine-grained features and capturing intrinsic spatio-temporal correlations. Specifically, we propose three modules, SPT-GC, TPT-GC and STPT-GC, to perform graph convolution based on locally refined topology. SPT-GC establishes specific global and local topologies in different channels, taking into account both global and local information to capture the spatial connections of joints in more detail. TPT-GC reasonably changes the receptive field of temporal convolution to extract the motion trend and motion details of the whole and part of the action. STPT-GC focuses on extracting the implicit spatio-temporal association information in the skeleton sequence, and establishes the part-enhanced spatio-temporal association topology. Combining the above three modules, our network dynamically aggregates high-dimensional features and achieves excellent performance on large-scale datasets.

Combining these efforts above, our main contributions are summarized as follows:

  • Our proposed SPT-GC refines the spatial topology based on body parts by fusing global and local topology, which extracts more fine-grained spatial features.

  • We propose the spatio-temporal module, including TPT-GC and STPT-GC, which establishes a specific temporal correlation topology and spatio-temporal correlation topology, and effectively extracts the temporal and spatio-temporal correlation of parts and joints.

  • We propose a novel action recognition model CPT-GCN based on skeleton data. It accurately captures the relationship between and within parts, and effectively aggregates the spatial, temporal and spatio-temporal information of skeleton data.

  • We conduct experiments on two widely-used datasets: NTU RGB+D [24] and NTU RGB+D 120 [19], on which our proposed method outperforms state-of-the-art approaches.

2 Related Work

2.1 Skeleton-Based Action Recognition

With the development of deep learning technology, deep learning methods have gradually replaced traditional manual feature methods. The mainstream methods can be divided into three categories according to the network architecture: convolutional neural network (CNN), recurrent neural network (RNN) and graph convolutional network. (GCN).

Fig. 1.
figure 1

The overview of the proposed CPT-GCN model. The entire combined part-wise topology graph convolutional block is represented as \(B_i(C_{in}, C_{out}, S)\). \(C_{in}, C_{out}\) and S denote the number of input channels, the number of output channels and the stride, respectively. There are a total of 10 blocks. GAP represents the global average pooling.

CNN-based method usually converts the skeleton data into a pseudo-image according to the manually designed conversion rules. RNN-based methods usually extract frame-level skeleton features, represent skeleton data as sequential data with predefined traversal rules [4, 18]. However, human skeleton is a natural graphical structure, and GCN has obvious advantages in processing graph-structured data. Yan et al. was the first to use GCN to model human skeleton, proposing Spatio-temporal Graph Convolutional Network (ST-GCN). They build joint connection edges based on the natural connections of the human body, and add temporal associations for the same joints in consecutive frames, constructing a skeletal spatio-temporal graph [34]. Shi et al. proposed an adaptive graph convolution network (AGCN), which uses the self-attention mechanism to change the topology of human skeletons and adaptively learns the connection between the original disconnected skeletons [25, 26]. Liu et al. introduced a multi-scale graph topology to achieve multi-scale joint relationship modeling [21]. Cheng et al. proposed Shift-GCN [7], replacing the traditional convolution operator with the shift convolution operator, using shifted graph convolution. The CTR-GCN proposed by Chen et al. [5] designs channel-wise topology graphs to explore more possibilities for feature learning in different channels.

2.2 Partial Graph Convolution in Skeleton-Based Action Recognition

A complete action can be regarded as composed of different postures of human body parts. For example: in the process of clapping, the clapping of the palm plays a key role in the whole action, while the waving of the arm plays an auxiliary role. Previous studies [7, 21, 25, 26, 34] mostly learn the global features of actions based on the whole skeleton, ignoring the important contribution of local features to actions. Thakkar et al. [30] is the first to split the human skeleton into different parts for graph convolution, which effectively improves performance of recognition. Wang et al. [33] proposed adaptive multi-part graph convolution to learn the spatial correlation between parts based on the self-attention mechanism. Zhu et al. [38] focused on fusing global and local features from a spatial perspective, effectively aggregating multi-level joint features by constructing a topology based on bodyparts.

3 Methods

In this section, we first introduce the construction of skeletal spatio-temporal graph and conventional graph convolution. Then we elaborate the modeling strategies of part-wise spatial topology and spatio-temporal topology respectively. Finally, as shown in Fig. 1, we present the full model structure of the proposed Combined Part-wise Topology Graph Convolutional Networks model named CPT-GCN.

3.1 Preliminaries

Graph Construction. A full action consists of multiple frames containing different samples. We construct spatio-temporal skeleton graphs to describe the structured information between nodes along the spatial and temporal dimensions. The complete spatio-temporal skeleton graph is established based on the natural connections of the human body structure and the connection of consecutive frames, so it contains the connection edges between joints and the connection edges between frames. The graph is defined as \(\mathcal {G}=(\mathcal {X}, \mathcal {V}, \mathcal {E})\). \(\mathcal {X}\) denotes the feature set of vertices, which is represented as a matrix \(X\in R^{C \times V \times T}\), there are V vertices, T frames and C channels. \(\mathcal {V}=\left\{ v_1,v_2,..., v_V\right\} \) denotes the vertex set. \(\mathcal {E}\) is the set of edges, reflecting the connection strength between vertices.

Graph Convolution. After the skeleton spatio-temporal feature map is constructed, we weight and sum the skeleton points in the input feature map with the features of their corresponding neighbor points to obtain the output feature map. The graph convolution implementation of feature maps can be intuitively formulated as:

$$\begin{aligned} f_{out} = \sum \limits _{s}^{S}W_s \cdot f_{in} \cdot A_s \end{aligned}$$
(1)

where \(f_{in}\) and \(f_{out}\) denote the input and output feature maps. S denotes the sampling area of the spatial dimension. \(A_s\) and \(W_s\) denote the adjacency matrix and weight function under the sampling area s.

3.2 Part-Wise Spatial Modeling

Almost any action is composed of sub-actions of different parts, and the difference mainly lies in the correlation between parts and the contribution of parts to the whole action. For example, clapping can be decomposed into the action of two palms and arms, and nodding can be regarded as the action of the head. Thus, optimizing the topology of skeleton based on human body parts can more accurately obtain the dependencies between joints.

Most of the previous studies explored the global features of the skeleton, and learned the spatial relationship of the skeleton through the natural connection of the human body or the attention mechanism [25, 26, 32, 34], which will generate a lot of redundant information, and the spatial topology shared by each channel is also not optimal. Existing part-based models usually aim to extract features from body parts individually or only focus on discovering the importance of different body parts [29, 35]. However, we take full account of inter-part dependencies and intra-part differences, and construct a refined part-wise topology for each channel.

Before performing GCN, body part correlations need to be modeled. Specifically, we divide the human body into 8 parts, which are head, body, two arms, two palms and two legs. The input features \(X \in R^{C \times V \times T}\) is aggregated according to the proposed part division strategy, which is formulated as:

$$\begin{aligned} X^{part}_i = Concat(\left\{ X_j \ | \ j \in L(i)\right\} ) \quad i=1,2,...,P \end{aligned}$$
(2)

where P denotes the number of parts, \(Concat(\cdot )\) denotes the splicing function, L(i) denotes the set of joint numbers corresponding to the i\(_{th}\) part, and \(X_i^{part}\) denotes the feature of the i\(_{th}\) part after aggregation.

Parts Correlation Modeling. In order to obtain the best dependencies between parts, we propose the modeling strategie \(\mathcal {M}(\cdot )\) to model the part dependencies.

Since each joint contributes to the corresponding body part, we perform an average pooling operation on the joints inside the part. In addition, in order to reduce the computation cost, we utilize linear transformations \(\psi (\cdot )\) and \(\varphi (\cdot )\) to reduce the feature dimension before the local topology modeling. \(\mathcal {M}(\cdot )\) needs to calculate the distance of the channel dimension between different parts, and utilizes the nonlinear transformation of the distance to represent the correlation between parts, which is formulated as:

$$\begin{aligned} \begin{aligned} \mathcal {M}(i,j) = \sigma (\psi (AvgPool_{ST}(X^{part}_i)) - \varphi (AvgPool_{ST}(X^{part}_j))) \end{aligned} \end{aligned}$$
(3)

where \(AvgPool_{ST}(\cdot )\) denotes the average pooling in both spatial and temporal dimensions. \(\mathcal {M}(i, j)\) is the modeling strategy, and its value denotes the correlation between parts i and j.

Part-wise Topology Modeling. The part correlation graph obtained by \(\mathcal {M}(\cdot )\) represents the correlation between parts and cannot be directly applied to the human skeleton graph, so it needs to be mapped to joints relation graph through a mapping function. According to the relationship between the various parts obtained, the part correlation features are first connected into a whole vertex matrix, which is formulated as:

$$\begin{aligned} G_{part} = Concat(\left\{ Concat(\left\{ \mathcal {M}(i,j) \ | \ j=1,2,...,P \right\} ) \ | \ i=1,2,...,P \right\} ) \end{aligned}$$
(4)

where \(G_{part}\) denotes the spliced inter-part relationship graph. It expresses different part correlations on each channel. But in fact, the joints within a part do not share weights, so the topology needs to be refined while mapping. We optimize the topology through learnable bias and linear transformation, which is formulated as:

$$\begin{aligned} G_s^{local} = \phi (\mathcal {R}(G_{part}) + B_0) \quad s=1,2,...,S \end{aligned}$$
(5)

where \(\mathcal {R}(\cdot )\) denotes the mapping function, it maps the part association graph to the joint association graph. \(\phi (\cdot )\) denotes the linear transformation function. \(B_0\) denotes the positional bias of the channel and joint, which is a learnable parameter. \(G_s^{local}\) is the feature map based on body parts.

Fig. 2.
figure 2

Model architecture of CPT-GCN block. It consists of three modules: SPT-GC, TPT-GC, and STPT-GC. \(D_1(\cdot )\), \(D_2(\cdot )\) denote parts partition function. \(M_1(\cdot )\), \(M_2(\cdot )\) denote parts correlation modeling function. FC denotes the fully connected layer. BN denotes Batch Normalization. Relu is the activation function.

Spatial Part-wise Topology Graph Convolution(SPT-GC). Local topology captures both part correlations and intra-part differences. On this basis, a global topology is introduced to perform adaptive learning driven by data to capture the global spatial characteristics of actions. Our proposed CPT-GC is more flexible, which combines global and local topology to more accurately obtain the correlation between human skeletons. A gating mechanism \(\alpha \) is introduced in the process of fusing the global graph and the individual refined graph to control the difference in the contribution of required parts and joints in different sampling regions. Finally, the graph convolution can be completed by performing Einstein summation of the part-wise topology and the input features in the spatial dimension.

GCN will dynamically update the global and local topology during the inference process to capture the features of the previously disconnected joints. Therefore, Eq. 1 is modified into the following form:

$$\begin{aligned} f_{out} = \sum \limits _{s}^{S}W_s \cdot f_{in} \cdot (G_s^{global} + \alpha G_s^{local}) \end{aligned}$$
(6)

where \(G^{global}_s\) is the global topology, which is initialized with the natural connection of the human skeleton, and changed by adaptively learning the correlation of actions.

The complete SPT-GC module is shown in Fig. 2 (a). We first divide the bone input feature \(X_{in}\) into parts, and then perform adaptive average pooling on the aggregated features. After that, they are respectively input to two convolutional layers with a convolution kernel of \(1 \times 1\) for dimensionality reduction. After part-wise modeling, the associated topology graph of the part is obtained. Then it needs to be mapped to joint topology and fused with the global topology. In addition, multiple sampling regions S are set to learn semantic information at different levels.

3.3 Part-Wise Spatio-Temporal Modeling

The skeleton feature map composed of human action sequences contains rich spatio-temporal semantics, and there is actually a certain relationship between spatial and temporal information. We propose novel TPT-GC and STPT-GC for extracting temporal and spatio-temporal semantic information of action.

Based on the temporal perspective of the action, a complete action is composed of multiple sub-actions, such as squatting, bouncing, jumping forward, and standing can constitute a complete long jump action. TCN [2] learns the associated information between sub-actions or the trajectory of a complete action by setting convolution kernels of different sizes. But in fact, the sub-actions composed of different actions have different periods. Some actions pay more attention to long-term motion trends, others actions need to rely on short-term motion details to distinguish. Our designed TPT-GC contains different convolutional dilation coefficients, which focus on capturing long-term motion trends and short-term motion details, respectively.

Most of the previous methods extract the features of space and time separately, ignoring the internal relationship of time and space in the action. In fact, if we can extract the correlation between non-corresponding joints between frames, it will surely improve the accuracy of action recognition. Our proposed STPT-GC is used to capture spatio-temporal correlation features, and the effectiveness is verified in ablation experiments, as shown in Table 2.

In addition, the sub-actions that occur in different human body parts are also different. The arms and thighs may dominate the motion trend of this action, or the hands control the motion details of a certain action. It is obvious that adding part information helps to promote the learning of motion paterns. Therefore, we also introduced the concept of parts in the spatio-temporal modeling, and constructed the refined temporal and spatio-temporal topology respectively, achieving the part-enhanced effect.

Temporal Part-wise Topology Graph Convolution(TPT-GC). Inspired by Multi-scale Temporal Convolution [21], we design a part-based temporal modeling module for finer-grained extraction of joint motion trends and motion details. The part division strategy of Eq. 2 is used to aggregate the joint features of body parts. In order to reduce the computational complexity of the model, we utilize the \(\psi (\cdot )\) linear transformation function to reduce the feature dimension. We set two convolution branches with different expansion coefficients in parallel to expand the neighborhood learned by graph convolution and extract semantic information at different levels of actions. The TPT-GC module is shown in Fig. 2 (b), which is formulated as:

$$\begin{aligned} f^{1}_{out}(i) = \sum \limits _{k}^{K}{W_{1} \cdot \psi (f_{in}(i + k))} \quad i=1,2,...,T \end{aligned}$$
(7)
$$\begin{aligned} f^{2}_{out}(i) = \sum \limits _{k}^{K}{W_{2} \cdot \psi (f_{in}(i + 2k))} \quad i=1,2,...,T \end{aligned}$$
(8)

where \(f^{1}_{out}\), \(f^{2}_{out}\) denote the output feature obtained by the two branches. \(W_{1}\), \(W_{2}\) denote the weight corresponding to the convolution. K is the size of the convolution kernel in the time dimension.

Spatio-temporal Part-wise Topology Graph Convolution(STPT-GC). In order to obtain the inherent spatio-temporal correlation information of the action, we designed a novel spatio-temporal modeling module, which is also guided by the part information to establish the more refined spatio-temporal correlation topology. The STPT-GC module is shown in Fig. 2 (c). Specifically, STPT-GC relies on the spatio-temporal correlation topology to obtain spatio-temporal correlation information, and needs to construct a spatial correlation graph and a temporal correlation graph first. It uses the same part division strategy to aggregate joints features of the parts, and then aggregates the temporal and spatial information respectively through the average pooling operation. A linear transformation function is then used to reduce the temporal and spatial feature dimensions. It is formulated as:

$$\begin{aligned} G^{S}_{out} = W_S \cdot \sigma (\phi _1(AvgPool_S(f_{in})))) \end{aligned}$$
(9)
$$\begin{aligned} G^{T}_{out} = W_T \cdot \sigma (\phi _2(AvgPool_T(f_{in}))) \end{aligned}$$
(10)

where \(G^{S}_{out}\) and \(G^{T}_{out}\) denote the spatial and temporal correlation graphs, respectively. \(AvgPool_S(\cdot )\) and \(AvgPool_T(\cdot )\) denote the average pooling operation on the spatial and temporal dimensions, respectively. \(\phi _1(\cdot )\) and \(\phi _2(\cdot )\) denote the linear transformation function. \(\sigma (\cdot )\) denotes the activation function. We add learnable parameters \(W_S\) and \(W_T\) to assist in learning the spatio-temporal features of actions, and then use the Kronecker product to model the spatio-temporal correlation topology. It is formulated as:

$$\begin{aligned} G^{ST}_{out} = \sigma (G^{S}_{out} \times G^{T}_{out}) \end{aligned}$$
(11)

where \(G^{S}_{out}\) and \(G^{T}_{out}\) denote the spatial and temporal correlation graphs, respectively. \(G^{ST}_{out}\) denotes the obtained spatiotemporal correlation topology. Our proposed STPT-GC is parallel to TPT-GC. The output features of TPT-GC and STPT-GC are concatenated after spatio-temporal topological graph convolution. It is formulated as:

$$\begin{aligned} f^{3}_{out} = W \cdot \phi _3(f_{in}) \cdot G^{ST}_{out} \end{aligned}$$
(12)
$$\begin{aligned} f_{out} = Concat(f^{(i)}_{out}) \quad i=1,2,...,N^{branch} \end{aligned}$$
(13)

where \(f^{3}_{out}\) denotes the output feature of the STPT-GC module. \(\phi _3(\cdot )\) denotes the linear transformation function. \(f^{(i)}_{out}\) denotes the output feature of the \(i_{th}\) branch. \(f_{out}\) denotes the output features after \(N^{branch}\) branches are cascaded. It can be understood that the first part of the channel represents the temporal characteristics of the action, and the latter part of the channel represents the spatiotemporal correlation characteristics of the action. The joints of each part can be restored to the original feature dimension through mapping and splicing strategies.

3.4 Model Architecture

We synthesize three modules of SPT-GC, TPT-GC and STPT-GC to construct a powerful graph convolutional network CPT-GCN for skeleton-based action recognition. The overall architecture is shown in Fig. 1 (a), which mainly consists of 10 basic blocks and a classification layer. The output channels of each block in the middle are 64, 64, 64, 64, 128, 128, 128, 256, 256, and 256. The residual network is connected between blocks [9], and finally perform global average pooling and softmax classification to obtain behavior prediction results.

Specifically, each individual block contains a spatial model and a spatio-temporal model, which are responsible for extracting spatial features and spatio-temporal joint features in skeleton information, respectively. As shown in Fig. 1 (b).

Spatial Modeling. The spatial model is mainly composed of SPT-GC modules, and three SPT-GCs are used in parallel to extract semantic information at different levels between parts and joints, as shown in Fig. 2 (a). For a single SPT-GC, first utilizes channel reduction rate r1 to compact representations, uses temporal and intra-part spatial pooling to aggregate features. After that, SPT-GC conducts pair-wise subtraction and activation, then fused with the global map. Finally, the graph convolution is completed to obtain the output feature map, as shown in Eq. 6.

Spatio-temporal Modeling. We demonstrate through ablation experiments that the spatio-temporal model with three branches has better performance. Among them, TPT-GC occupies two branches and STPT-GC occupies one branch, as shown in Fig. 2 (b) and (c).

TPT-GC first uses the channel reduction rate r2 to compress the channel information, and constructs two temporal convolutional layers of different scales to increase the receptive field, which are used to extract the motion trend and motion details of the action respectively.

STPT-GC aggregates temporal and spatial information through average pooling operations, and uses the channel reduction rate of r3 to reduce computational complexity. We use the Kronecker product to model spatio-temporal association topology. Finally, it performs a dot product of the compressed input features with the spatio-temporal correlation topology to complete the graph convolution, which can extract the spatio-temporal correlation information of the action.

4 Experiments

4.1 Datasets

NTU RGB+D. NTU RGB+D (NTU-60) [24] is currently the most widely used large-scale action recognition dataset, containing 60 action categories and 56,000 action clips. The clips were captured by three KinectV2 cameras with different perspectives and performed by 40 volunteers. Each sample contains one action and is guaranteed to have at most 2 subjects. The skeleton information consists of the 3D coordinates of 25 body joints and the corresponding action category labels. NTU-60 recommends two benchmarks [24]: Cross-View Evaluation (X-View) split according to different camera views and Cross-Subject Evaluation (X-Sub) split according to different subjects.

NTU RGB+D 120. NTU RGB+D 120 (NTU-120) [19] extends NTU-60 with a larger scale. It contains 120 action categories and 114,480 action clips. The clips were performed by 106 volunteers in 32 camera setups. NTU-120 also recommends two benchmarks [19]: the first is Cross-Subject Evaluation(X-Sub), which is the same cross-subject evaluation as NTU-60. The other is Cross-Setup Evaluation (X-Set), which splits training and test samples based on the parity of camera setup IDs.

4.2 Training Details

All experiments are conducted on one RTX 3070 TI GPU with the PyTorch deep learning framework. We use the stochastic gradient descent(SGD) with Nesterov momentum(0.9) as the optimizer and the cross-entropy as the loss function. Weight decay is 0.0004. The initial learning rate is set to 0.1 and a warmup strategy [9] is used in the first 5 epochs to make the training procedure more stable. The batch size is 32. The learning rate is divided by 10 at the 35\(_{th}\) epoch and 55\(_{th}\) epoch. The training process is ended at the 70\(_{th}\) epoch.Since the number of frames of the samples is not consistent, we uniformly downsample the frames to 64 frames. In addition, we adopt the data preprocessing strategy of  [21] for the input skeleton features.

4.3 Ablation Studies

In this section, we use the X-Sub benchmark of the NTU-60 to verify the effectiveness of proposed modules in CPT-GCN.

Effectiveness of TPT-GC and STPT-GC. In order to test the performance of the space-time model proposed in Sect. 3.3 and obtain its optimal branch configuration, we conduct experiments on TPT-GC and STPT-GC with different branch numbers. We adopt ST-GCN [34] as the baseline method and replace the temporal module of the baseline model with the proposed spatio-temporal model. The specific ablation experiment configuration and results are shown in Table 1. The experimental results in the table show that the spatio-temporal model with two TPT-GCs and one STPT-GC branch configuration has better performance.

Table 1. Comparison of the validation accuracy of spatio-temporal model with different settings.
Table 2. Comparison of the validation accuracy of CPT-GC with different settings.

Model Configuration Exploration. As mentioned in Sect. 3.4, our proposed CPT-GCN contains three different modules, namely SPT-GC, TPT-GC and STPT-GC. We manually remove or only keep any kind of modules to test the parameter cost and model performance of different configurations of CPT-GCN. Additionally, we adopt ST-GCN [34] as the baseline method, which does not use any of these three modules.

The specific ablation experiment configuration and results are shown in Table 2. The experimental results in the table show that although our proposed SPT-GC module introduces some additional parameters, it can effectively improve the performance of the model. The TPT-GC and STPT-GC modules have a significant effect on improving the performance of the model under the premise that a small number of parameters are required. The combination of the three modules of SPT-GC, TPT-GC and STPT-GC is the optimal configuration of this model. Under this configuration, CPT-GCN bring improvements of +5.2% over the baseline method on the X-Sub benchmark.

Table 3. Recognition accuracy comparison against state-of-the-art methods on the NTU RGB+D dataset.
Table 4. Recognition accuracy comparison against state-of-the-art methods on the NTU RGB+D 120 dataset.

4.4 Comparison with the State-of-the-Art

Most state-of-the-art methods employ a multi-stream fusion framework to enrich semantic information. Our proposed method adopts the same strategy as [5, 7, 26] to generate four data modalities, namely joint, bone, joint motion and bone motion, and fuse the prediction scores of the four modalities.

We compare the final model with state-of-the-art skeleton-based action recognition methods on the NTU-60 and NTU-120 datasets. The results are shown in Tables 3 and 4. These methods for comparison include RNN-based methods [17, 20, 24], CNN-based methods [2, 15, 37] and GCN-based methods [6, 7, 16, 21, 25, 34].

Our model achieves significant improvements of +1.4% and +1.0% over MST-GCN on the X-Sub and X-Set benchmark of NTU-120, respectively. Overall, CPT-GCN achieves better performance than other methods on both datasets, which demonstrates the superiority of our model.

5 Conclusion

In this work, we present a novel combined part-wise topology graph convolutional network (CPT-GCN) for skeleton-based action recognition. SPT-GC accurately learns the joint correlation of actions in a way that combines global topology and local topology. TPT-GC reasonably changes the receptive field of time convolution to extract the motion trend and motion details of the whole and part of the action. STPT-GC focuses on extracting the implicit spatio-temporal association information in the skeleton sequence, and establishes the part-enhanced spatio-temporal association topology. The combination of the three modules shows a powerful correlation modeling capability. We evaluate the proposed model on two large-scale datasets. The experimental results demonstrate that CPT-GCN has stronger performance than other graph convolutions, and the final model has excellent performance and generalization ability.