Abstract
Multi-person human pose estimation and tracking in the wild is important and challenging. For training a powerful model, large-scale training data are crucial. While there are several datasets for human pose estimation, the best practice for training on multi-dataset has not been investigated. In this paper, we present a simple network called Multi-Domain Pose Network (MDPN) to address this problem. By treating the task as multi-domain learning, our methods can learn a better representation for pose prediction. Together with prediction heads fine-tuning and multi-branch combination, it shows significant improvement over baselines and achieves the best performance on PoseTrack ECCV 2018 Challenge without additional datasets other than MPII and COCO.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Multi-person human pose estimation is an important component in many applications, such as video surveillance and sports video analytics. Though great progress has been made in this field [3, 4] thanks to the development of convolutional neural networks (CNNs), human pose estimation remains a challenging problem due to complex poses, diverse appearance, different scales, severe occlusion and crowds. For tracking in videos, the strong camera motions and extreme proximity of people [1] make it even more difficult.
Similar to other computer vision tasks dominated by deep learning, large-scale training data are crucial to exploit the representation power of CNNs for human pose estimation. There exists several extensive datasets such as COCO Dataset [13], MPII Dataset [2], and PoseTrack Dataset [1]. These datasets differ from each other about the distributions of images, poses and annotation standards. To promote the performance of models, many methods [7, 11, 18] choose to utilize multiple datasets for training. Most of them trained the models on COCO dataset first and then fine-tuned them on PoseTrack dataset [18] or MPII dataset [11]. However, it is still unclear what is the best practice to learn a model from multiple datasets for human pose estimation.
In this paper, we treat the task of training on multi-datasets as multi-domain learning [14, 19] and propose a CNN architecture named Multi-Domain Pose Network (MDPN). The network has a common backbone to share the representation from multiple domains and separate prediction heads for dataset-specific pose estimation. During training, we first jointly optimize on all datasets to learn the generic pose embedding. Then each heads are fine-tuned on each domains to further improve the accuracy of localization. We also investigate the prediction strategies for better performance. Evaluated on PoseTrack dataset, our methods with simple network structures significantly improve the learning on multi-dataset over baseline. Moreover, our methods is runner-up of the PoseTrack ECCV 2018 challenge of pose estimation but achieved the best performance without using extra training data other than MPII and COCO datasets.
2 Methods
Overall, we adopt the top-down approach [4, 18] to human pose estimation using only single frame information, which employs person detector to detect all the people in the image and then use single person pose estimator (SPPE) to obtain the human poses for all the boxes. For SPPE, we take advantage of multiple dataset information to train a multi-domain network. After that, we use simple matching [18] on adjacent frames to associate individuals into track-lets.
2.1 Multi-Domain Pose Network (MDPN)
For training on multiple datasets, there are three simple solutions:
Mixed. All the datasets are merged into one single dataset. We also merge all the joint sets into one single joint set with a total number of 21 keypoints. During training, gradients are only back-propagated to the annotated joints for each sample. Mixing the datasets can make full use of all the information from all datasets, but different annotation standards for datasets on the same joint may distract the training procedure.
Transfer Learning. As done in [11, 18], we can also train the model on one dataset first to learn the generic representation. Then fine-tuning is performed on the target dataset. COCO dataset [13] is often selected to pre-train the models because its data distribution is good to train a generic pose estimation model. Transfer learning can speed up the learning and often achieves good results. But the learnt embedding is suboptimal for pose estimation and it is easy to lose the knowledge from the first dataset when training on the target dataset for a long time as observed by [21].
Multi-domain Learning. Another approach to train on multi-dataset is to view it as a multi-domain learning task like [14]. It uses a common backbone network to learn a common pose representation and several prediction heads to learn domain-specific pose estimation. Compared with mixed datasets, multi-domain learning addresses the different annotation distribution problem. But to balance between different domains, the learnt prediction heads may not be optimal. Moreover, the predictions of multi-domain network only use one single head, which is a waste of information from other datasets.
According to the analysis above, we propose Multi-Domain Pose Network to solve such problems. We first apply multi-domain learning on all datasets. Then we fine-tune the full model on COCO dataset to optimize the embedding and COCO head. Finally we fix the backbone together with COCO head and fine-tune our network on the combination of MPII and PoseTrack dataset for the remain heads. Figure 1 illustrates the whole network structure. The details of structure will be explained in Sect. 2.2.
For prediction, one simple strategy is to use the predictions from corresponding dataset head. In order to exploit all the information from different datasets, we combine the predictions from different heads to form the final estimation (Fig. 1). There are several ways of combination, which will be discussed and compared in Sect. 3.2. Such methods can also be viewed as a lightweight multi-dataset ensemble implemented by multi-branch predictions like [8, 12].
2.2 Implementations
Model Structure. We use ResNet-152 [10] with three deconvolution layers as backbone [18]. To address vanishing gradients, we add an intermediate prediction after conv3 layers for supervision, and add it back to the second deconvolution layer as skip connection. The size of input image is 384 \(\times \) 288.
Training. The cropping and augmentations are the same as [4]. The Gaussian maps with sigma 9 are used as targets. We use the pre-trained models on ImageNet for ResNet backbone. The base learning rate is 0.001 with batch size 128 and Adam optimizer. For the jointly training stage of MDPN, we use 120 epochs. The learning rate is dropped to 0.0001 at 90 epochs. Then we perform 15 epochs for fine-tuning on COCO dataset. Finally we fine-tune the model on MPII and PoseTrack datasets for 20 epochs (The learning rate is dropped to 0.00001 at 10 epochs). To improve the performance of hard keypoints, we change the L2 loss to Online Hard Keypoints Mining (OHKM) [4] loss with 8 top keypoints at 100 epochs. For other models, we follow the training scheme in [18].
Testing. We follow the common practice in [4] with flipping testing and quarter offsets. We also re-score the box with the production of box score and average keypoint scores [4] after predictions.
Detection. We use four public person detectors trained on COCO dataset [13], including Faster R-CNN [17], Mask R-CNN [9], YOLO [16], and DCN [5]. Then we merge all the boxes with NMS of 0.6 and use them as detection results.
Tracking. We follow the pipelines of flow-based tracking [18] with four modifications. First, we apply OKS-NMS [15] of 0.4 after pose estimation. Second, we use Hungarian matching instead of greedy matching. Third, after tracking we prune short track-lets that contain less than 2 frames to reduce the false positive cases. Finally, we do not employ box propagation because the detector ensemble is strong enough. For multi-frame flow tracking, we use at most 8 frames before.
3 Experiments
3.1 Datasets and Evaluation
We train our models on three datasets: COCO-2017 dataset [13], MPII Dataset [2], and PoseTrack-2018 Dataset [1]. Then we evaluate our methods on the PoseTrack-2018 validation dataset. For multi-person pose estimation, we use mean Average Precision (mAP) metric. For multi-person tracking, we use Multiple Object Tracking Accuracy (MOTA) metric. To compare with state-of-the-art methods, we also evaluate our methods on PoseTrack-2017 validation dataset. For ablation study, we construct a min-val dataset from PoseTrack-2018 validation dataset by uniformly sub-sampling 15 sequences out of 75 sequences.
3.2 Ablation Study
ResNet-50 of input 256 \(\times \) 192 without skip connection is used here for simplicity.
Testing. We have tried different combination methods on the multi-domain model: (1) COCO branch: Using the COCO branch and interpolating the head positions from other keypoints. (2) PoseTrack branch: Using the PoseTrack branch. (3) COCO + PoseTrack branch: Using the COCO branch with the head position from PoseTrack branch. (4) COCO + MPII branch: Using the COCO branch with the head position from MPII branch. (5) Voting: Averaging the heatmaps from common keypoints from all branches. From the right part of Table 1, the last two methods achieve the best performance. So we will only use these two methods for remain testing and refer them as method A and B.
Training. We compare different training strategies for multi-dataset: (1) MPII: Training on MPII. (2) COCO: Training on COCO. (3) PoseTrack: Training on PoseTrack. (4) COCO\(\rightarrow \) PoseTrack: Training on COCO and fine-tuning on PoseTrack [7, 18]. (5) COCO\(\rightarrow \) PoseTrack + MPII: Training on COCO and fine-tuning on mixed datasets of MPII and PoseTrack [11]. (6) Mixed: Training on mixed dataset. (7) MDPN-B w/o FT: Training with multi-domain learning without fine-tuning and testing with method B. (8) MDPN-B: Training with multi-domain learning with fine-tuning and testing with method B.
Left part of Table 1 shows that among all approaches, the MDPN-B achieves the best performance. And fine-tuning after multi-domain training is important for the final performance (+3.0 mAP). As for the results of single dataset, training on COCO performs the best even without head annotations, while the accuracy of PoseTrack is worst. This is because the images in PoseTrack are obtained from limited videos and contain duplicate information, which leads to a smaller dataset. Another conclusion is that fine-tuning does not always improve the performance on the target dataset due to the knowledge forgetting problems.
Post-processing. Table 2 indicates that all post-processing is necessary for final performance. The OKS-NMS is crucial for mAP because too many false positive part detections may mislead the matching stage of evaluation.
3.3 Results on PoseTrack Datasets
We evaluate our methods on PoseTrack 2017 [7, 18, 20] and 2018 dataset [6]. We use AlphaPose [6] model as baseline and apply branch combination, OKS-NMS and re-scoring (AlphaPose++).
Tables 3 and 4 show all the results on validation sets. For 2017 dataset, our methods show comparable performance with state-of-the-art method [18] and outperform other methods. For 2018 dataset, our methods also surpass the baselines with large margin. Meanwhile, testing with COCO-MPII combination is better than that with voting for ResNet-152.
For test set (Table 5), our MDPN methods beats all the other methods trained only on COCO, MPII and PoseTrack by a large margin (7.0 mAP for no-tracking and 3.2 mAP for tracking). The no-tracking version also achieves the second best performance among all methods. For tracking, our method also performs the best among all methods without extra datasets for MOTA by a large margin (3.6 MOTA) and achieves the third best accuracy in all methods.
4 Conclusions
In conclusion, we investigate the strategies for training on multi-dataset and present Multi-Domain Pose Network to improve human pose estimation. It surpasses the baselines and achieves state-of-the-art results on PoseTrack benchmarks. Because of the simplicity, we hope proposed methods can help improve the performance for training on multiple datasets.
References
Andriluka, M., et al.: Posetrack: a benchmark for human pose estimation and tracking (2017)
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1302–1310 (2017)
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation (2017)
Dai, J., et al.: Deformable convolutional networks. In: IEEE International Conference on Computer Vision, pp. 764–773 (2017)
Fang, H., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: The IEEE International Conference on Computer Vision (ICCV), vol. 2 (2017)
Girdhar, R., Gkioxari, G., Torresani, L., Paluri, M., Du, T.: Detect-and-track: efficient pose estimation in videos (2018)
Guo, H., Wang, G., Chen, X., Zhang, C., Qiao, F., Yang, H.: Region ensemble network: improving convolutional network for hand pose estimation. In: IEEE International Conference on Image Processing, pp. 4512–4516 (2017)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Jin, S., et al.: Towards multi-person pose tracking: bottom-up and top-down methods (2017)
Li, H., Li, Y., Porikli, F.: Convolutional neural net bagging for online visual tracking. Comput. Vis. Image Understand. 153, 120–129 (2016)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4293–4302 (2016)
Papandreou, G., et al.: Towards accurate multi-person pose estimation in the wild, pp. 3711–3719 (2017)
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2015)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 472–487. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_29
Xiao, T., Li, H., Ouyang, W., Wang, X.: Learning deep feature representations with domain guided dropout for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1249–1258 (2016)
Xiu, Y., Li, J., Wang, H., Fang, Y., Lu, C.: Pose flow: Efficient online pose tracking. arXiv preprint arXiv:1802.00977 (2018)
Zhu, X., Jiang, Y., Luo, Z.: Multi-person pose estimation for posetrack with enhanced part affinity fields. In: ICCV PoseTrack Workshop (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Guo, H., Tang, T., Luo, G., Chen, R., Lu, Y., Wen, L. (2019). Multi-Domain Pose Network for Multi-Person Pose Estimation and Tracking. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11130. Springer, Cham. https://doi.org/10.1007/978-3-030-11012-3_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-11012-3_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11011-6
Online ISBN: 978-3-030-11012-3
eBook Packages: Computer ScienceComputer Science (R0)