Keywords

1 Introduction

Human action recognition, which is playing a significant role in many applications such as video surveillance, and man-machine interaction [2], has raised the great attention in recent years.

There are many approaches attempting to analysis that under the dynamic circumstance and complicated background. In lots of cases, background information deserved serious consideration. For example, when a person’s hand moves to his mouth, it’s difficult to distinguish what he’s doing. The question will become easy if there is a cup in the person’s hand, cause of additional information is provided by the background.

However, it would be not effective and even negative that putting the background information together in certain cases. For instance, in figure skating, a person shows a wide range of exaggerated movements for performing. The changing background will disturb the action analysis, therefore, skeleton data without background information is more appropriate in pure action recognition.

Earlier conventional methods [5, 15] treat skeleton data as vector sequences, which could not fully express the interdependency among the joints. Unlike recurrent neural networks (RNN) and convolutional neural networks (CNN), graph convolutional networks (GCN) treats skeleton data as graphs that could fully exploit the relationships between correlated joints.

GCN shows excellent performance in skeleton-based action recognition. However, most previous works [16, 19] pay little attention on feature maps output by the network and there’s room for improvement on datasets with unbalanced categories. Therefore, we proposed a new approach to solve them. We use the focal loss [10] instead of the cross entropy (CE) loss to adapt unbalanced categories. The focal loss can give different weights to different categories according to difficulty of recognition. Above that, we added the center loss [18] working for feature maps to make better distinction and make the network more robust.

In this paper, 1) we modify the loss function from the CE loss to the focal loss to make network more adaptable to the datasets with unbalanced categories. 2) We add the center loss on deep features to make better distinction. 3) On two datasets for skeleton-based action recognition, our methods exceeds the state-of-the-art on both.

2 Related Work

2.1 RGB-D Based Action Recognition

RGB-D based human action recognition has attracted plenty of interest in recent years. Due to RGB-D sensors such as Kinect, RGB data and depth data, which encoding rich 3D structural information, could easy to be obtained. Previous works [8, 17, 20] leads the discovery of the information from visual features and depth features. Instead of considering two modalities as separate channels, SFAM [17] proposed the use of scene flow, which extracted the real 3D motion and also preserved structural information. [20] proposed a binary local representation for video fusion. BHIM [8] represents both two features in the form of matrix that including spatiotemporal structural relationships. Those RGB-D based methods focus on finding an appropriate way to fuse two features.

2.2 Skeleton Based Action Recognition

With the development of deep learning, lots of methods based on conventional networks have been proposed, which learn the features automatically. Some RNN-based methods [11, 15] and CNN-based methods [7, 12] have achieved high performance on action recognition. Unlike the above methods, GCN-based methods [9, 16, 19] treat skeleton data as graph which could exploit the relationships between correlated joints better. ST-GCN [19] is the first to apply GCN on skeleton-based action recognition. 2s-AGCN [16] is an approach to adaptively learn the topology of the graph. AS-GCN [9] made attempts to capture richer dependencies among nodes. Those GCNs automatically learning with information of node location and structure.

3 The Proposed Approach

3.1 Graph Construction

Depending on devices and algorithms, skeleton data are usually represented as the sequence of 2D/3D coordinates. A joint is established connection with others by the graph along both the spatial and temporal dimensions. We construct the graph with joints according to the method in ST-GCN [19]. As shown in the left sketch of Fig. 1, a spatiotemporal graph is composed of the node set N and the edge set E. The node set N contains all the joint coordinates in a sequence. And the edge set E, composed of the spatial edge set \(E_{S}\) and the temporal edge set \(E_{F}\), represents that how the nodes connected with others. For the spatial dimension, nodes connected with others as their natural connections in a frame. For the second subset \(E_{F}\), nodes make connections among frames. The temporal edges connect the same nodes between adjacent frames.

Fig. 1.
figure 1

(a) The spatiotemporal graph used. (b) The spatial configuration partitioning strategy.

3.2 Skeleton Oriented GCN

Deep graph convolutional network could be constructed based on the graph above. ST-GCN [19] consists of the ST-GCN blocks, which contains a spatial graph convolution and a temporal graph convolution.

The spatial graph convolution on node \(v_{i}\) could be formulated as [19]

$$\begin{aligned} f_{out}(v_{i}) = \sum _{v_{j}\in B_{i}} \frac{1}{Z_{ij}} f_{in}(v_{j}) \cdot w(l_{i}(v_{j})) \end{aligned}$$
(1)

where \(f_{out}\) denotes the output feature and \(f_{in}\) denotes the input feature. \(B_{i}\) denotes the set of nodes which connected with node \(v_{i}\). w is the weight function, which is a little different from original convolutional operation, but both provide the weights for input. The difference is that the number of nodes in the neighbor set \(B_{i}\) is unfixed. To solve that, we use the spatial configuration partitioning strategy, proposed in ST-GCN [19]. As shown in the right sketch of Fig. 1, the block cross represents the gravity center of the skeleton. According to the distance to the block cross, the strategy divide the set \(B_{i}\) into three subsets. The normalizing term \(Z_{ij}\) denotes the cardinality of the subset which contains the node \(v_{j}\). In fact, the feature map of the network could be represented as a \(C \times T \times N\) tensor, where C denotes the number of channels and T denotes the length of frame sequences. N denotes the number of nodes in a frame. For the spatial configuration partitioning strategy, the Eq. 1 is transformed into

$$\begin{aligned} f_{out}(v_{i}) = \sum _{j} \varLambda _{j} ^{-\frac{1}{2}} A_{j} \varLambda _{j} ^{-\frac{1}{2}} f_{in} W_{j} \otimes M_{j} \end{aligned}$$
(2)

where \(A_{j}\), a \( N \times N\) tensor, denotes the divided adjacency matrix. Note that \(\sum _{j} A_{j} = A + I\), where A denotes the adjacency matrix, and I is an identity matrix. \(\varLambda _{j}^{ii}=\sum _{k}(A_{j}^{ik}) + \alpha \) is a diagonal matrix designed for normalized. \(\alpha \) is set as 0.001 to avoid \(\varLambda _{j}^{ii}\) being zero. \(W_{j}\) is the weight matrix, representing the w function. \(M_{j}\) is an attention matrix, which denotes the importance of nodes. \(\otimes \) denotes the element-wise product.

In the temporal dimension, we can easily apply graph convolution like traditional convolution. We chose a certain number of frames before or later than the frame to make the number of the neighbors fixed. Therefore, the temporal kernel size could be determined and the convolution operation could be applied in the temporal dimension.

3.3 Loss Function

Focal Loss. The focal loss, an improved version based on the CE loss function, aims to overcome the difficulties due to the imbalance among categories. The formula for calculating the CE loss for binary classification is Eq. 3.

$$\begin{aligned} CE(p, y) = \left\{ \begin{array}{rcl} &{}-log(p) \qquad &{}{if \quad y=1} \\ &{}-log(1-p) \qquad &{}{otherwise,} \\ \end{array} \right. \end{aligned}$$
(3)

and we define \(p_{t}\) as

$$\begin{aligned} p_{t} = \left\{ \begin{array}{rcl} &{}p \qquad &{}{if \quad y=1} \\ &{}1-p \qquad &{}{otherwise,} \\ \end{array} \right. \end{aligned}$$
(4)

and CE(py) can be written as

$$\begin{aligned} CE(p_{t}) = -log(p_{t}) \end{aligned}$$
(5)

Based on the CE loss, [10] proposed the focal loss:

$$\begin{aligned} L_{F}(p_{t}) = -\alpha _{t}(1-p_{t})^{\gamma }log(p_{t}) \end{aligned}$$
(6)

where \(\alpha _{t}\) is a weighting factor to address the imbalance among categories. The factor \((1-p_{t})^\gamma \) could dynamically scale the loss. We set \(\gamma > 0\), and the factor could automatically reduce the weight of easy examples and increase the weight of hard examples. Therefore, we considered that the focal loss is more suitable to the small-scale datasets, and our experiments proved that. Center Loss. For making the deeply learned features more discriminative as shown in Fig. 2 and making network more robust, we add the center loss [18] in our work. The center loss, which could be formulated as Eq. 7, makes features discriminative by minimizing the intra class variance.

$$\begin{aligned} L_{C} = \frac{1}{2} \sum _{i = 1}^{m} \Vert x_{i} - c_{y_{i}}\Vert _{2}^{2} \end{aligned}$$
(7)

Where \(c_{y_{i}}\) denotes the deep features center of the \(y_{i}\)th class, and \(c_{y_{i}}\) is dynamically updated based on mini-batch as the deep features changed. The center loss is proposed for face recognition task, due to separable features are not enough, discriminative features are needed. We considered that it will work for action recognition as well, and we proved that in our experiments.

Fig. 2.
figure 2

The center loss function makes deep features more discriminative.

4 Experiments

In this section, we evaluate the performance of our approach and compare with some state-of-the-art methods on two human action recognition datasets: FSD-10 [14] and RGB-D human video-emotion dataset [13]. We evaluate the performance of approaches by top-1 classification accuracies on the validation set.

We use SGD as optimizer in all models, the batch size is set as 64. The learning rate is set as 0.1 and reduced by 10 in epoch 150 and 225. We use \(L_{S}=L_{F} + \lambda L_{C}\) as loss function in our methods, and use the CE loss for comparison.

4.1 Evaluation on FSD-10

FSD-10. FSD-10 [14], a skating dataset consists of 1484 skating videos covering 10 different actions manually labeled. These video clips are segmented from performance videos of high level figure skaters. Each clip is ranging from 3 s to 30 s, and captured by the camera focusing on the skater. Comparing with other current datasets for action recognition, FSD-10 focuses on the action itself rather background. The information of background even bring negative effect. We divided FSD-10 into a training set (989 videos) and a validation set (425 videos). We train models on the training set and calculate the accuracy on the validation.

Comparisons and Analysis. For proving that the loss function \(L_{S}=L_{F} + \lambda L_{C}\) is more suitable to FSD-10 than the CE loss, we run 2 groups of comparative experiments on FSD-10. The one is based on ST-GCN [19]: we first train ST-GCN with the CE loss, after getting the results, train it again with the loss function \(L_{S}\). The other group is training on DenseNet [4] with the same operation. Besides, we compared the accuracy with the I3D [1], the STM [6] and the KTSN [14]. Table 1 give the result of our experiments. Both on ST-GCN and DenseNet, we see that the loss function \(L_{S}\) give a better performance than the CE loss on FSD-10.

Table 1. The result of our experiments on FSD-10.

4.2 Evaluation on RGB-D Human Video-Emotion Dataset

RGB-D Human Video-Emotion Dataset. RGB-D human video-emotion data-set [13] consists of over 4 thousands RGB video clips and 4 thousands Depth video clips, covering 7 emotion categories. Each clip is around 6 s, containing the whole body of the actor. The background is green, without any information for recognition. The training set has 741 skeleton data, and the validation set has 644. We train models on the training set and calculate the accuracy on the validation.

Comparisons and Analysis. We performed our experiments on the video-emotion dataset for proving that the loss function \(L_{S}\) is suitable to the small-scale datasets for action recognition. Like the comparative experiments on FSD-10, we also run 2 groups of experiments based on ST-GCN [19] and DenseNet [4]. Besides, we compared the accuracy with the MvLE [13] and the MvLLS [3], the methods based on muti-view for recognition on this dataset. Table 2 give the result of methods. We see that the loss function is work on the video-emotion dataset as well, and our methods perform better than the state-of-the-art on this dataset.

Table 2. The result of our experiments on the video-emotion dataset.

5 Conclusion

In this paper, we adapted the center loss and the focal loss to the human action recognition. We use the focal loss aims to overcome the difficulties due to the imbalance among categories. And we consider it’s more suitable to the small-scale datasets with unbalanced categories. We add the center loss to learn more discriminative features and to make better distinction on deep features. We performed our experiments on the FSD-10 [14] and the RGB-D human video-emotion dataset [13], and our methods achieved the state-of-the-art performance.