Keywords

1 Introduction

Multi-person human pose estimation is an important component in many applications, such as video surveillance and sports video analytics. Though great progress has been made in this field [3, 4] thanks to the development of convolutional neural networks (CNNs), human pose estimation remains a challenging problem due to complex poses, diverse appearance, different scales, severe occlusion and crowds. For tracking in videos, the strong camera motions and extreme proximity of people [1] make it even more difficult.

Similar to other computer vision tasks dominated by deep learning, large-scale training data are crucial to exploit the representation power of CNNs for human pose estimation. There exists several extensive datasets such as COCO Dataset [13], MPII Dataset [2], and PoseTrack Dataset [1]. These datasets differ from each other about the distributions of images, poses and annotation standards. To promote the performance of models, many methods [7, 11, 18] choose to utilize multiple datasets for training. Most of them trained the models on COCO dataset first and then fine-tuned them on PoseTrack dataset [18] or MPII dataset [11]. However, it is still unclear what is the best practice to learn a model from multiple datasets for human pose estimation.

In this paper, we treat the task of training on multi-datasets as multi-domain learning [14, 19] and propose a CNN architecture named Multi-Domain Pose Network (MDPN). The network has a common backbone to share the representation from multiple domains and separate prediction heads for dataset-specific pose estimation. During training, we first jointly optimize on all datasets to learn the generic pose embedding. Then each heads are fine-tuned on each domains to further improve the accuracy of localization. We also investigate the prediction strategies for better performance. Evaluated on PoseTrack dataset, our methods with simple network structures significantly improve the learning on multi-dataset over baseline. Moreover, our methods is runner-up of the PoseTrack ECCV 2018 challenge of pose estimation but achieved the best performance without using extra training data other than MPII and COCO datasets.

Fig. 1.
figure 1

Network overview (Left) and multi-domain prediction (Right).

2 Methods

Overall, we adopt the top-down approach [4, 18] to human pose estimation using only single frame information, which employs person detector to detect all the people in the image and then use single person pose estimator (SPPE) to obtain the human poses for all the boxes. For SPPE, we take advantage of multiple dataset information to train a multi-domain network. After that, we use simple matching [18] on adjacent frames to associate individuals into track-lets.

2.1 Multi-Domain Pose Network (MDPN)

For training on multiple datasets, there are three simple solutions:

Mixed. All the datasets are merged into one single dataset. We also merge all the joint sets into one single joint set with a total number of 21 keypoints. During training, gradients are only back-propagated to the annotated joints for each sample. Mixing the datasets can make full use of all the information from all datasets, but different annotation standards for datasets on the same joint may distract the training procedure.

Transfer Learning. As done in [11, 18], we can also train the model on one dataset first to learn the generic representation. Then fine-tuning is performed on the target dataset. COCO dataset [13] is often selected to pre-train the models because its data distribution is good to train a generic pose estimation model. Transfer learning can speed up the learning and often achieves good results. But the learnt embedding is suboptimal for pose estimation and it is easy to lose the knowledge from the first dataset when training on the target dataset for a long time as observed by [21].

Multi-domain Learning. Another approach to train on multi-dataset is to view it as a multi-domain learning task like [14]. It uses a common backbone network to learn a common pose representation and several prediction heads to learn domain-specific pose estimation. Compared with mixed datasets, multi-domain learning addresses the different annotation distribution problem. But to balance between different domains, the learnt prediction heads may not be optimal. Moreover, the predictions of multi-domain network only use one single head, which is a waste of information from other datasets.

According to the analysis above, we propose Multi-Domain Pose Network to solve such problems. We first apply multi-domain learning on all datasets. Then we fine-tune the full model on COCO dataset to optimize the embedding and COCO head. Finally we fix the backbone together with COCO head and fine-tune our network on the combination of MPII and PoseTrack dataset for the remain heads. Figure 1 illustrates the whole network structure. The details of structure will be explained in Sect. 2.2.

For prediction, one simple strategy is to use the predictions from corresponding dataset head. In order to exploit all the information from different datasets, we combine the predictions from different heads to form the final estimation (Fig. 1). There are several ways of combination, which will be discussed and compared in Sect. 3.2. Such methods can also be viewed as a lightweight multi-dataset ensemble implemented by multi-branch predictions like [8, 12].

2.2 Implementations

Model Structure. We use ResNet-152 [10] with three deconvolution layers as backbone [18]. To address vanishing gradients, we add an intermediate prediction after conv3 layers for supervision, and add it back to the second deconvolution layer as skip connection. The size of input image is 384 \(\times \) 288.

Training. The cropping and augmentations are the same as [4]. The Gaussian maps with sigma 9 are used as targets. We use the pre-trained models on ImageNet for ResNet backbone. The base learning rate is 0.001 with batch size 128 and Adam optimizer. For the jointly training stage of MDPN, we use 120 epochs. The learning rate is dropped to 0.0001 at 90 epochs. Then we perform 15 epochs for fine-tuning on COCO dataset. Finally we fine-tune the model on MPII and PoseTrack datasets for 20 epochs (The learning rate is dropped to 0.00001 at 10 epochs). To improve the performance of hard keypoints, we change the L2 loss to Online Hard Keypoints Mining (OHKM) [4] loss with 8 top keypoints at 100 epochs. For other models, we follow the training scheme in [18].

Testing. We follow the common practice in [4] with flipping testing and quarter offsets. We also re-score the box with the production of box score and average keypoint scores [4] after predictions.

Detection. We use four public person detectors trained on COCO dataset [13], including Faster R-CNN [17], Mask R-CNN [9], YOLO [16], and DCN [5]. Then we merge all the boxes with NMS of 0.6 and use them as detection results.

Tracking. We follow the pipelines of flow-based tracking [18] with four modifications. First, we apply OKS-NMS [15] of 0.4 after pose estimation. Second, we use Hungarian matching instead of greedy matching. Third, after tracking we prune short track-lets that contain less than 2 frames to reduce the false positive cases. Finally, we do not employ box propagation because the detector ensemble is strong enough. For multi-frame flow tracking, we use at most 8 frames before.

3 Experiments

3.1 Datasets and Evaluation

We train our models on three datasets: COCO-2017 dataset [13], MPII Dataset [2], and PoseTrack-2018 Dataset [1]. Then we evaluate our methods on the PoseTrack-2018 validation dataset. For multi-person pose estimation, we use mean Average Precision (mAP) metric. For multi-person tracking, we use Multiple Object Tracking Accuracy (MOTA) metric. To compare with state-of-the-art methods, we also evaluate our methods on PoseTrack-2017 validation dataset. For ablation study, we construct a min-val dataset from PoseTrack-2018 validation dataset by uniformly sub-sampling 15 sequences out of 75 sequences.

3.2 Ablation Study

ResNet-50 of input 256 \(\times \) 192 without skip connection is used here for simplicity.

Table 1. Different training (Left) and testing (Right: MDPN-B without fine-tuning) methods on PoseTrack-2018 min-val dataset with ResNet-50.

Testing. We have tried different combination methods on the multi-domain model: (1) COCO branch: Using the COCO branch and interpolating the head positions from other keypoints. (2) PoseTrack branch: Using the PoseTrack branch. (3) COCO + PoseTrack branch: Using the COCO branch with the head position from PoseTrack branch. (4) COCO + MPII branch: Using the COCO branch with the head position from MPII branch. (5) Voting: Averaging the heatmaps from common keypoints from all branches. From the right part of Table 1, the last two methods achieve the best performance. So we will only use these two methods for remain testing and refer them as method A and B.

Training. We compare different training strategies for multi-dataset: (1) MPII: Training on MPII. (2) COCO: Training on COCO. (3) PoseTrack: Training on PoseTrack. (4) COCO\(\rightarrow \) PoseTrack: Training on COCO and fine-tuning on PoseTrack [7, 18]. (5) COCO\(\rightarrow \) PoseTrack + MPII: Training on COCO and fine-tuning on mixed datasets of MPII and PoseTrack [11]. (6) Mixed: Training on mixed dataset. (7) MDPN-B w/o FT: Training with multi-domain learning without fine-tuning and testing with method B. (8) MDPN-B: Training with multi-domain learning with fine-tuning and testing with method B.

Left part of Table 1 shows that among all approaches, the MDPN-B achieves the best performance. And fine-tuning after multi-domain training is important for the final performance (+3.0 mAP). As for the results of single dataset, training on COCO performs the best even without head annotations, while the accuracy of PoseTrack is worst. This is because the images in PoseTrack are obtained from limited videos and contain duplicate information, which leads to a smaller dataset. Another conclusion is that fine-tuning does not always improve the performance on the target dataset due to the knowledge forgetting problems.

Post-processing. Table 2 indicates that all post-processing is necessary for final performance. The OKS-NMS is crucial for mAP because too many false positive part detections may mislead the matching stage of evaluation.

Table 2. Different post-processing methods for pose estimation (Left) and tracking (Right) on PoseTrack-2018 min-val dataset.

3.3 Results on PoseTrack Datasets

We evaluate our methods on PoseTrack 2017 [7, 18, 20] and 2018 dataset [6]. We use AlphaPose [6] model as baseline and apply branch combination, OKS-NMS and re-scoring (AlphaPose++).

Table 3. mAP on PoseTrack 2017 and 2018 datasets. * means with tracking.
Table 4. MOTA on PoseTrack 2017 and 2018 datasets. * means with tracking.
Table 5. Results on PoseTrack ECCV 2018 Challenge without (Top) and with (Bottom) extra training datasets. * means with tracking. Our methods are bold.

Tables 3 and 4 show all the results on validation sets. For 2017 dataset, our methods show comparable performance with state-of-the-art method [18] and outperform other methods. For 2018 dataset, our methods also surpass the baselines with large margin. Meanwhile, testing with COCO-MPII combination is better than that with voting for ResNet-152.

For test set (Table 5), our MDPN methods beats all the other methods trained only on COCO, MPII and PoseTrack by a large margin (7.0 mAP for no-tracking and 3.2 mAP for tracking). The no-tracking version also achieves the second best performance among all methods. For tracking, our method also performs the best among all methods without extra datasets for MOTA by a large margin (3.6 MOTA) and achieves the third best accuracy in all methods.

4 Conclusions

In conclusion, we investigate the strategies for training on multi-dataset and present Multi-Domain Pose Network to improve human pose estimation. It surpasses the baselines and achieves state-of-the-art results on PoseTrack benchmarks. Because of the simplicity, we hope proposed methods can help improve the performance for training on multiple datasets.