Rethinking Keypoint Representations: Modeling Keypoints and Poses as Objects for Multi-person Human Pose Estimation

McNally, William; Vats, Kanav; Wong, Alexander; McPhee, John

doi:10.1007/978-3-031-20068-7_3

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13666))

Included in the following conference series:

European Conference on Computer Vision

2478 Accesses
30 Citations

Abstract

In keypoint estimation tasks such as human pose estimation, heatmap-based regression is the dominant approach despite possessing notable drawbacks: heatmaps intrinsically suffer from quantization error and require excessive computation to generate and post-process. Motivated to find a more efficient solution, we propose to model individual keypoints and sets of spatially related keypoints (i.e., poses) as objects within a dense single-stage anchor-based detection framework. Hence, we call our method KAPAO (pronounced “Ka-Pow”), for Keypoints And Poses As Objects. KAPAO is applied to the problem of single-stage multi-person human pose estimation by simultaneously detecting human pose and keypoint objects and fusing the detections to exploit the strengths of both object representations. In experiments we observe that KAPAO is faster and more accurate than previous methods, which suffer greatly from heatmap post-processing. The accuracy-speed trade-off is especially favourable in the practical setting when not using test-time augmentation. Source code: https://github.com/wmcnally/kapao.

Access provided by Autonomous University of Puebla. Download conference paper PDF

SimplePose V2: Greedy Offset-Guided Keypoint Grouping for Human Pose Estimation

SD-Pose: facilitating space-decoupled human pose estimation via adaptive pose perception guidance

Article 31 May 2024

Keypoint Context Aggregation for Human Pose Estimation

Keywords

1 Introduction

Keypoint estimation is a computer vision task that involves localizing points of interest in images. It has emerged as one of the most highly researched topics in the computer vision literature [1, 9, 11, 15, 18, 19, 37,38,39,40, 47, 49, 55, 60, 63, 67, 70]. The most common method for estimating keypoint locations involves generating target fields, referred to as heatmaps, that center 2D Gaussians on the target keypoint coordinates. Deep convolutional neural networks [26] are then used to regress the target heatmaps on the input images, and keypoint predictions are made via the arguments of the maxima of the predicted heatmaps [57].

While strong empirical results have positioned heatmap regression as the de facto standard method for detecting and localizing keypoints [3, 5,6,7, 12, 24, 35, 37, 43, 54, 57, 66, 68], there are several known drawbacks. First, these methods suffer from quantization error: the precision of a keypoint prediction is inherently limited by the spatial resolution of the output heatmap. Larger heatmaps are therefore advantageous, but require additional upsampling operations and costly processing at higher resolution [3, 7, 12, 35, 37]. Even when large heatmaps are used, special post-processing steps are required to refine keypoint predictions, slowing down inference [6, 7, 35, 43]. Second, when two keypoints of the same type (i.e., class) appear in close proximity to one another, the overlapping heatmap signals may be mistaken for a single keypoint. Indeed, this is a common failure case [5]. For these reasons, researchers have started to investigate alternative, heatmap-free keypoint detection methods [27, 29, 30, 38, 67].

In this paper, we introduce a new heatmap-free keypoint detection method and apply it to single-stage multi-person human pose estimation. Our method builds on recent research showing how keypoints can be modeled as objects within a dense anchor-based detection framework by representing keypoints at the center of small keypoint bounding boxes [38]. In preliminary experimentation with human pose estimation, we found that this keypoint detection approach works well for human keypoints that are characterized by local image features (e.g., the eyes), but the same approach is less effective at detecting human keypoints that require a more global understanding (e.g., the hips). We therefore introduce a new pose object representation to help detect sets of keypoints that are spatially related. Furthermore, we detect keypoint objects and pose objects simultaneously and fuse the results using a simple matching algorithm to exploit the benefits of both object representations. By virtue of detecting pose objects, we unify person detection and keypoint estimation and provide a highly efficient single-stage approach to multi-person human pose estimation.

As a result of not using heatmaps, KAPAO compares favourably against recent single-stage human pose estimation models in terms of accuracy and inference speed, especially when not using test-time augmentation (TTA), which represents how such models are deployed in practice. As shown in Fig. 1, KAPAO achieves an AP of 70.6 on the Microsoft COCO Keypoints validation set without TTA while having an average latency of 54.4 ms (forward pass + post-processing time). Compared to the state-of-the-art single-stage model HigherHRNet + SWAHR [35], KAPAO is 5.1$\times $ faster and 3.3 AP more accurate when not using TTA. Compared to CenterGroup [3], KAPAO is 3.1$\times $ faster and 1.5 AP more accurate. The contributions of this work are summarized as follows:

A new pose object representation is proposed that extends the conventional object representation by including a set of keypoints associated with the object.
A new approach to single-stage human pose estimation is developed by simultaneously detecting keypoint objects and pose objects and fusing the detections. The proposed heatmap-free method is significantly faster and more accurate than state-of-the-art heatmap-based methods when not using TTA.

2 Related Work

Heatmap-free Keypoint Detection. DeepPose [58] regressed keypoint coordinates directly from images using a cascade of deep neural networks that iteratively refined the keypoint predictions. Shortly thereafter, Tompson et al. [57] introduced the notion of keypoint heatmaps, which have since remained prevalent in human pose estimation [5,6,7, 12, 24, 37, 43, 54, 65, 66, 68] and other keypoint detection applications [9, 15, 18, 59, 63]. Remarking the computational inefficiencies associated with generating heatmaps, Li et al. [30] disentangled the horizontal and vertical keypoint coordinates such that each coordinate was represented using a one-hot encoded vector. This saved computation and permitted an expansion of the output resolution, thereby reducing the effects of quantization error and eliminating the need for refinement post-processing. Li et al. [27] introduced the residual log-likelihood (RLE), a novel loss function for direct keypoint regression based on normalizing flows [53]. Direct keypoint regression has also been attempted using Transformers [29].

Outside the realm of human pose estimation, Xu et al. [67] regressed anchor templates of facial keypoints and aggregated them to achieve state-of-the-art accuracy in facial alignment. In sports analytics, McNally et al. [38] encountered the issue of overlapping heatmap signals in the development of an automatic scoring system for darts and therefore opted to model keypoints as objects using small square bounding boxes. This keypoint representation proved to be highly effective and serves as the inspiration for this work.

Single-stage Human Pose Estimation. Single-stage human pose estimation methods predict the poses of every person in an image using a single forward pass [5, 7, 12, 14, 25, 42, 44]. In contrast, two-stage methods [6, 10, 24, 27, 37, 46, 54, 66] first detect the people in an image using an off-the-shelf person detector (e.g., Faster R-CNN [52], YOLOv3 [51], etc.) and then estimate poses for each detection. Single-stage methods are generally less accurate, but usually perform better in crowded scenes [28] and are often preferred because of their simplicity and efficiency, which becomes particularly favourable as the number of people in the image increases. Single-stage approaches vary more in their design compared to two-stage approaches. For instance, they may: (i) detect all the keypoints in an image and perform a bottom-up grouping into human poses [3, 5, 7, 16, 17, 22, 25, 35, 42, 48]; (ii) extend object detectors to unify person detection and keypoint estimation [14, 36, 64, 70]; or (iii) use alternative keypoint/pose representations (e.g., predicting root keypoints and relative displacements [12, 44, 45]). We briefly summarize the most recent state-of-the-art single-stage methods below.

Cheng et al. [7] repurposed HRNet [54] for bottom-up human pose estimation by adding a transpose convolution to double the output heatmap resolution (HigherHRNet) and using associative embeddings [42] for keypoint grouping. They also implemented multi-resolution training to address the scale variation problem. Geng et al. [12] predicted person center heatmaps and 2K offset maps representing offset vectors for the K keypoints of a pose candidate centered on each pixel using an HRNet backbone. They also disentangled the keypoint regression (DEKR) using separate regression heads and adaptive convolutions. Luo et al. [35] used HigherHRNet as a base and proposed scale and weight adaptive heatmap regression (SWAHR), which scaled the ground-truth heatmap Gaussian variances based on the person scale and balanced the foreground/background loss weighting. Their modifications provided significant accuracy improvements over HigherHRNet and comparable performance to many two-stage methods. Again using HigherHRNet as a base, Brasó et al. [3] proposed CenterGroup to match keypoints to person centers using a fully differentiable self-attention module that was trained end-to-end together with the keypoint detector. Notably, all of the aforementioned methods suffer from costly heatmap post-processing and as such, their inference speeds leave much to be desired.

Extending Object Detectors for Human Pose Estimation. There is significant overlap between the tasks of object detection and human pose estimation. For instance, He et al. [14] used the Mask R-CNN instance segmentation model for human pose estimation by predicting keypoints using one-hot masks. Wei et al. [64] proposed Point-Set Anchors, which adapted the RetinaNet [32] object detector using pose anchors instead of bounding box anchors. Zhou et al. [70] modeled objects using heatmap-based center points with CenterNet and represented poses as a 2K-dimensional property of the center point. Mao et al. [36] adapted the FCOS [56] object detector with FCPose using dynamic filters [21]. While these methods based on object detectors provide good efficiency, their accuracies have not competed with state-of-the-art heatmap-based methods. Our work is most similar to Point-Set Anchors [64]; however, our method does not require defining data-dependent pose anchors. Moreover, we simultaneously detect individual keypoints and poses and fuse the detections to improve the accuracy of our final pose predictions.

3 KAPAO: Keypoints and Poses as Objects

KAPAO uses a dense detection network to simultaneously predict a set of keypoint objects $\{\hat{\mathcal {O}}^k\in \hat{\textbf{O}}^k\}$ and a set of pose objects $\{\hat{\mathcal {O}}^p\in \hat{\textbf{O}}^p\}$, collectively $\hat{\textbf{O}} = \hat{\textbf{O}}^k\bigcup \hat{\textbf{O}}^p$. We introduce the concept behind each object type and the relevant notation below. All units are assumed to be in pixels unless stated otherwise.

A keypoint object $\mathcal {O}^k$ is an adaptation of the conventional object representation in which the coordinates of a keypoint are represented at the center $(b_x, b_y)$ of a small bounding box $\textbf{b}$ with equal width $b_w$ and height $b_h$: $\textbf{b} = (b_x, b_y, b_w, b_h)$. The hyperparameter $b_s$ controls the keypoint bounding box size (i.e., $b_s$ = $b_w$ = $b_h$). There are K classes of keypoint objects, one for each type in the dataset [38].

Generally speaking, a pose object $\mathcal {O}^p$ is considered to be an extension of the conventional object representation that additionally includes a set of keypoints associated with the object. While we expect pose objects to be useful in related tasks such as facial and object landmark detection [20, 67], they are applied herein to human pose estimation via detection of human pose objects, comprising a bounding box of class “person,” and a set of keypoints $\textbf{z} = \{(x_k, y_k)\}_{k=1}^K$ that coincide with anatomical landmarks.

Both object representations possess unique advantages. Keypoint objects are specialized for the detection of individual keypoints that are characterized by strong local features. Examples of such keypoints that are common in human pose estimation include the eyes, ears and nose. However, keypoint objects carry no information regarding the concept of a person or pose. If used on their own for multi-person human pose estimation, a bottom-up grouping method would be required to parse the detected keypoints into human poses. In contrast, pose objects are better suited for localizing keypoints with weak local features as they enable the network to learn the spatial relationships within a set of keypoints. Moreover, they can be leveraged for multi-person human pose estimation directly without the need for bottom-up keypoint grouping.

Recognizing that keypoint objects exist in a subspace of a pose objects, the KAPAO network was designed to simultaneously detect both object types with minimal computational overhead using a single shared network head. During inference, the more precise keypoint object detections are fused with the human pose detections using a simple tolerance-based matching algorithm that improves the accuracy of the human pose predictions without sacrificing any significant amount of inference speed. The following sections provide details on the network architecture, the loss function used to train the network, and inference.

3.1 Architectural Details

A diagram of the KAPAO pipeline is provided in Fig. 2. It uses a deep convolutional neural network $\mathcal {N}$ to map an RGB input image $\textbf{I}\in \mathbb {R}^{h\times w\times 3}$ to a set of four output grids $\hat{\textbf{G}} = \{\hat{\mathcal {G}}^s\mid s\in \{8, 16, 32, 64\}\}$ containing the object predictions $\hat{\textbf{O}}$, where $\hat{\mathcal {G}}^s\in \mathbb {R}^{\frac{h}{s}\times \frac{w}{s}\times N_a \times N_o}$:

$$\begin{aligned} \mathcal {N}(\textbf{I}) = \hat{\textbf{G}}. \end{aligned}$$

(1)

$N_a$ is the number of anchor channels and $N_o$ is the number of output channels for each object. $\mathcal {N}$ is a YOLO-style feature extractor that makes extensive use of Cross-Stage-Partial (CSP) bottlenecks [62] within a feature pyramid [31] macroarchitecture. To provide flexibility for different speed requirements, three sizes of KAPAO models were trained (i.e., KAPAO-S/M/L) by scaling the number of layers and channels in $\mathcal {N}$.

Due to the nature of strided convolutions, the features in an output grid cell $\hat{\mathcal {G}}^s_{i,j}$ are conditioned on the image patch $\textbf{I}_p=\textbf{I}_{si:s(i+1), sj:s(j+1)}$. Therefore, if the center of a target object $(b_x, {b_y})$ is situated in $\textbf{I}_p$, the output grid cell $\hat{\mathcal {G}}^s_{i,j}$ is responsible for detecting it. The receptive field of an output grid increases with s, so smaller output grids are better suited for detecting larger objects.

The output grid cells $\hat{\mathcal {G}}^s_{i,j}$ contain $N_a$ anchor channels corresponding to anchor boxes $\textbf{A}^s = \{(A_{w_a}, A_{h_a})\}_{a=1}^{N_a}$. A target object $\mathcal {O}$ is assigned to an anchor channel via tolerance-based matching of the object and anchor box sizes. This provides redundancy such that the grid cells $\hat{\mathcal {G}}^s_{i,j}$ can detect multiple objects and enables specialization for different object sizes and shapes. Additional detection redundancy is provided by also allowing the neighbouring grid cells $\hat{\mathcal {G}}^s_{i\pm 1,j}$ and $\hat{\mathcal {G}}^s_{i,j\pm 1}$ to detect an object in $\textbf{I}_p$ [23, 61].

The $N_o$ output channels of $\hat{\mathcal {G}}^s_{i,j,a}$ contain the properties of a predicted object $\hat{\mathcal {O}}$, including the objectness $\hat{p}_{o}$ (the probability that an object exists), the intermediate bounding boxes $\hat{\textbf{t}}'= (\hat{t}'_x, \hat{t}'_y, \hat{t}'_w, \hat{t}'_h)$, the object class scores $\mathbf {\hat{c}} = (\hat{c}_1, ..., \hat{c}_{K+1})$, and the intermediate keypoints $\hat{\textbf{v}}'= \{(\hat{v}'_{xk}, \hat{v}'_{yk})\}_{k=1}^K$ for the human pose objects. Hence, $N_o = 3K + 6$.

Following [23, 61], an object’s intermediate bounding box $\hat{\textbf{t}}$ is predicted in the grid coordinates and relative to the grid cell origin (i, j) using:

$$\begin{aligned} \hat{t}_x = 2\sigma (\hat{t}'_x) - 0.5 \quad \quad \hat{t}_y = 2\sigma (\hat{t}'_y) - 0.5 \end{aligned}$$

(2)

$$\begin{aligned} \hat{t}_w = \frac{A_w}{s}(2\sigma (\hat{t}'_w))^2 \quad \quad \hat{t}_h = \frac{A_h}{s}(2\sigma (\hat{t}'_h))^2. \end{aligned}$$

(3)

This detection strategy is extended to the keypoints of a pose object. A pose object’s intermediate keypoints $\hat{\textbf{v}}$ are predicted in the grid coordinates and relative to the grid cell origin (i, j) using:

$$\begin{aligned} \hat{v}_{xk} = \frac{A_w}{s}(4\sigma (\hat{v}'_{xk}) - 2) \quad \quad \hat{v}_{yk} = \frac{A_h}{s}(4\sigma (\hat{v}'_{yk}) - 2). \end{aligned}$$

(4)

The sigmoid function $\sigma $ facilitates learning by constraining the ranges of the object properties (e.g., $\hat{v}_{xk}$ and $\hat{v}_{yk}$ are constrained to $\pm 2\frac{A_w}{s}$ and $\pm 2\frac{A_h}{s}$, respectively). To learn $\hat{\textbf{t}}$ and $\hat{\textbf{v}}$, losses are applied in the grid space. Sample targets $\textbf{t}$ and $\textbf{v}$ are shown in Fig. 3.

3.2 Loss Function

A target set of grids $\textbf{G}$ is constructed and a multi-task loss $\mathcal {L}(\hat{\textbf{G}}, \textbf{G})$ is applied to learn the objectness $\hat{p}_o$ ($\mathcal {L}_{obj})$, the intermediate bounding boxes $\hat{{\textbf {t}}}$ ($\mathcal {L}_{box}$), the class scores $\hat{\textbf{c}}$ ($\mathcal {L}_{cls}$), and the intermediate pose object keypoints $\hat{\textbf{v}}$ ($\mathcal {L}_{kps}$). The loss components are computed for a single image as follows:

$$\begin{aligned} \mathcal {L}_{obj} = \sum _s \frac{\omega _s}{n{(G^s})}\sum _{G^s}\textrm{BCE}(\hat{p}_o, p_o\cdot \textrm{IoU}(\hat{\textbf{t}}, \textbf{t})) \end{aligned}$$

(5)

$$\begin{aligned} \mathcal {L}_{box} = \sum _s \frac{1}{n{(\mathcal {O}\in G^s})}\sum _{\mathcal {O}\in G^s}1 - \textrm{IoU}(\hat{\textbf{t}}, \textbf{t}) \end{aligned}$$

(6)

$$\begin{aligned} \mathcal {L}_{cls} = \sum _s \frac{1}{n{(\mathcal {O}\in G^s})}\sum _{\mathcal {O}\in G^s}\textrm{BCE}(\hat{c}, c) \end{aligned}$$

(7)

$$\begin{aligned} \mathcal {L}_{kps} = \sum _s \frac{1}{n{(\mathcal {O}^p\in G^s})}\sum _{\mathcal {O}^p\in G^s} \sum _{k=1}^K \delta (\nu _k > 0)||\hat{\textbf{v}}_k - {\textbf{v}}_k||_2 \end{aligned}$$

(8)

where $\omega _s$ is the grid weighting, $\textrm{BCE}$ is the binary cross-entropy, $\textrm{IoU}$ is the complete intersection over union (CIoU) [69], and $\nu _k$ are the keypoint visibility flags. When $\mathcal {G}^s_{i,j,a}$ represents a target object $\mathcal {O}$, the target objectness $p_o$ = 1 is multiplied by the $\textrm{IoU}$ to promote specialization amongst the anchor channel predictions [50]. When $\mathcal {G}^s_{i,j,a}$ is not a target object, $p_o$ = 0. In practice, the losses are applied over a batch of images using batched grids. The total loss $\mathcal {L}$ is the weighted summation of the loss components scaled by the batch size $N_b$:

$$\begin{aligned} \mathcal {L} = N_b(\lambda _{obj}\mathcal {L}_{obj} + \lambda _{box}\mathcal {L}_{box} + \lambda _{cls}\mathcal {L}_{cls} + \lambda _{kps}\mathcal {L}_{kps}). \end{aligned}$$

(9)

3.3 Inference

The predicted intermediate bounding boxes $\hat{\textbf{t}}$ and keypoints $\hat{\textbf{v}}$ are mapped back to the original image coordinates using the following transformation:

$$\begin{aligned} \hat{\textbf{b}} = s(\hat{\textbf{t}} + [i, j, 0, 0]) \quad \quad \hat{\textbf{z}}_k = s(\hat{\textbf{v}}_k + [i, j]). \end{aligned}$$

(10)

$\hat{\mathcal {G}}^s_{i,j,a}$ represents a positive pose object detection $\hat{\mathcal {O}}^p$ if its confidence $\hat{p}_o\cdot \max (\hat{\textbf{c}})$ is greater than a threshold $\tau _{cp}$ and $\textrm{arg}\,\textrm{max}(\hat{\textbf{c}})=1$. Similarly, $\hat{\mathcal {G}}^s_{i,j,a}$ represents a positive keypoint object detection $\hat{\mathcal {O}}^k$ if $\hat{p}_o\cdot \max (\hat{\textbf{c}}) > \tau _{ck}$ and $\textrm{arg}\,\textrm{max}(\hat{\textbf{c}}) > 1$, where the keypoint object class is $\textrm{arg}\,\textrm{max}(\hat{\textbf{c}}) - 1$. To remove redundant detections and obtain the candidate pose objects $\hat{\textbf{O}}^{p\prime }$ and the candidate keypoint objects $\hat{\textbf{O}}^{k\prime }$, the sets of positive pose object detections $\hat{\textbf{O}}^p$ and positive keypoint object detections $\hat{\textbf{O}}^p$ are filtered using non-maximum suppression (NMS) applied to the pose object bounding boxes with the $\textrm{IoU}$ thresholds $\tau _{bp}$ and $\tau _{bk}$:

$$\begin{aligned} \hat{\textbf{O}}^{p\prime } = \textrm{NMS}(\hat{\textbf{O}}^p, \tau _{bp}) \quad \quad \hat{\textbf{O}}^{k\prime } = \textrm{NMS}(\hat{\textbf{O}}^k, \tau _{bk}). \end{aligned}$$

(11)

It is noted that $\tau _{ck}$ and $\tau _{bk}$ are scalar thresholds used for all keypoint object classes. Finally, the human pose predictions $\hat{\textbf{P}} = \{\hat{\textbf{P}}_i\in \mathbb {R}^{K \times 3}\}$ for $i \in \{1...n(\hat{\textbf{O}}^{p\prime })\}$ are obtained by fusing the candidate keypoint objects with the candidate pose objects using a distance tolerance $\tau _{fd}$. To promote correct matches of keypoint objects to poses, the keypoint objects are only fused to pose objects with confidence $\hat{p}_o\cdot \max (\hat{\textbf{c}}) > \tau _{fc}$:

$$\begin{aligned} \hat{\textbf{P}} = \varphi (\hat{\textbf{O}}^{p\prime }, \hat{\textbf{O}}^{k\prime }, \tau _{fd}, \tau _{fc}). \end{aligned}$$

(12)

The keypoint object fusion function $\varphi $ is defined in Algorithm 1, where the following notation is used to index an object’s properties: $\hat{x} = \hat{\mathcal {O}}_x$ (e.g., a pose object’s keypoints $\hat{\textbf{z}}$ are referenced as $\hat{\mathcal {O}}^{p}_{\textbf{z}}$).

3.4 Limitations

A limitation of KAPAO is that pose objects do not include individual keypoint confidences, so the human pose predictions typically contain a sparse set of keypoint confidences $\hat{\textbf{P}}_i[:,3]$ populated by the fused keypoint objects (see Algorithm 1 for details). If desired, a complete set of keypoint confidences can be induced by only using keypoint objects, which is realized when $\tau _{ck} \rightarrow 0$. Another limitation is that training requires a considerable amount of time and GPU memory due to the large input size used.

4 Experiments

We evaluate KAPAO on two multi-person human pose estimation datasets: COCO Keypoints [33] ($K=17$) and CrowdPose [28] ($K=14$). We report the standard AP/AR detection metrics based on Object Keypoint Similarity [33] and compare against state-of-the-art methods. All hyperparameters are provided in the source code.

4.1 Microsoft COCO Keypoints

Training. KAPAO-S/M/L were all trained for 500 epochs on COCO train2017 using stochastic gradient descent with Nesterov momentum [41], weight decay, and a learning rate decayed over a single cosine cycle [34] with a 3-epoch warm-up period [13]. The input images were resized and padded to $1280\times 1280$, keeping the original aspect ratio. Data augmentation used during training included mosaic [2], HSV color-space perturbations, horizontal flipping, translations, and scaling. Many of the training hyperparameters were inherited from [23, 61], including the anchor boxes $\textbf{A}$ and the loss weights w, $\lambda _{obj}$, $\lambda _{box}$, and $\lambda _{cls}$. Others, including the keypoint bounding box size $b_s$ and the keypoint loss weight $\lambda _{kps}$, were manually tuned using a small grid search. The models were trained on four V100 GPUs with 32 GB memory each using batch sizes of 128, 72, and 48 for KAPAO-S, M, and L, respectively. Validation was performed after every epoch, saving the model weights that provided the highest validation AP.

Testing. The six inference parameters ($\tau _{cp}$, $\tau _{ck}$, $\tau _{bp}$, $\tau _{bk}$, $\tau _{fd}$, and $\tau _{fc}$) were manually tuned on the validation set using a coarse grid search to maximize accuracy. The results were not overly sensitive to the inference parameter values. When using TTA, the input image was scaled by factors of 0.8, 1, and 1.2, and the unscaled image was horizontally flipped. During post-processing, the multi-scale detections were concatenated before running NMS. When not using TTA, rectangular input images were used (i.e., 1280 px on the longest side), which marginally reduced the accuracy but increased the inference speed.

Table 1. Accuracy and speed comparison with state-of-the-art single-stage human pose estimation models on COCO val2017, including the forward pass (FP) and post-processing (PP). Latencies (Lat.) averaged over val2017 using a batch size of 1 on a TITAN Xp GPU.

Full size table

Results. Table 1 compares the accuracy, forward pass (FP) time, and post-processing (PP) time of KAPAO with state-of-the-art single-stage methods HigherHRNet [7], HigherHRNet + SWAHR [35], DEKR [12], and CenterGroup [3] on val2017. Two test settings were considered: (1) without any test-time augmentation (using a single forward pass of the network), and (2) with multi-scale and horizontal flipping test-time augmentation (TTA). It is noted that with the exception of CenterGroup, no inference speeds were reported in the original works. Rather, FLOPs were used as an indirect measure of computational efficiency. FLOPs are not only a poor indication of inference speed [8], but they are also only computed for the forward pass of the network and thus do not provide an indication of the amount of computation required for post-processing.

Due to expensive heatmap refinement, the post-processing times of HigherHRNet, HigherHRNet + SWAHR, and DEKR are at least an order of magnitude greater than KAPAO-L when not using TTA. The post-processing time of KAPAO depends less on the input size so it only increases by approximately 1 ms when using TTA. Conversely, HigherHRNet and HigherHRNet + SWAHR generate and refine large heatmaps during multi-scale testing and therefore require more than two orders of magnitude more post-processing time than KAPAO-L.

CenterGroup requires significantly less post-processing time than HigherHRNet and DEKR because it skips heatmap refinement and directly encodes pose center and keypoint heatmaps as embeddings that are fed to an attention-based grouping module. When not using TTA, CenterGroup-W48 provides an improvement of 2.5 AP over HigherHRNet-W48 and has a better accuracy-speed trade-off. Still, KAPAO-L is 3.1$\times $ faster than CenterGroup-W48 and 1.5 AP more accurate due to its efficient network architecture and near cost-free post-processing. When using TTA, KAPAO-L is 1.7 AP less accurate than CenterGroup-W48, but 4.9$\times $ faster. KAPAO-L also achieves state-of-the-art AR, which is indicative of better detection rates.

We suspect that KAPAO is more accurate without TTA compared to previous methods because it uses larger input images; however, we emphasize that KAPAO consumes larger input sizes while still being faster than previous methods due to its well-designed network architecture and efficient post-processing. For the same reason, TTA (multi-scale testing in particular) doesn’t provide as much of a benefit; input sizes >1280 are less effective due to the dataset images being limited to 640 px.

Table 2. Accuracy comparison with two-stage ($\dagger $) and single-stage methods on COCO test-dev. Best results reported (i.e., including TTA). DEKR results use a model-agnostic rescoring network [12]. Latencies (Lat.) taken from Table 1. *Latencies reported in original papers [4, 36] and measured using an NVIDIA GTX 1080Ti GPU.

Full size table

In Table 2, the accuracy of KAPAO is compared to single-stage and two-stage methods on test-dev. KAPAO-L achieves state-of-the-art AR and falls within 1.7 AP of the best performing single-stage method HigherHRNet-W48 + SWAHR while being 7.4$\times $ faster. Notably, KAPAO-L is more accurate than the early two-stage methods G-RMI [46] and RMPE [10] and popular single-stage methods like OpenPose [4, 5], Associative Embeddings [42], and PersonLab [45]. Compared to other single-stage methods that extend object detectors for human pose estimation (Mask R-CNN [14], CenterNet [70], Point-Set Anchors [64], and FCPose [36]), KAPAO-L is considerably more accurate. Among all the single-stage methods, KAPAO-L achieves state-of-the-art AP at an OKS threshold of 0.50, which is indicative of better detection rates but less precise keypoint localization. This is an area to explore in future work.

4.2 CrowdPose

KAPAO was trained on the trainval split with 12k images and was evaluated on the 8k images in test. The same training and inference settings as on COCO were used except the models were trained for 300 epochs and no validation was performed during training. The final model weights were used for testing. Table 3 compares the accuracy of KAPAO against state-of-the-art methods. It was found that KAPAO excels in the presence of occlusion, achieving competitive results across all metrics compared to previous single-stage methods and state-of-the-art accuracy for AP$^{.50}$. The proficiency of KAPAO in crowded scenes is clear when analyzing AP$^E$, AP$^M$, and AP$^H$: KAPAO-L and DEKR-W48 [12] perform equally on images with easy Crowd Index (less occlusion), but KAPAO-L is 1.1 AP more accurate for both medium and hard Crowd Indices (more occlusion).

Table 3. Comparison with single-stage and two-stage ($\dagger $) methods on CrowdPose test, including TTA. DEKR results use a model-agnostic rescoring network [12]. HigherHRNet + SWAHR [35] not included due to issues reproducing the results reported in the paper using the source code. Latencies (Lat.) taken from Table 1. *Latency reported in original paper [4] and measured using NVIDIA GTX 1080Ti GPU on COCO.

Full size table

4.3 Ablation Studies

The influence of the keypoint bounding box size $b_s$, one of KAPAO’s important hyperparameters, was empirically analyzed. Five KAPAO-S models were trained on COCO train2017 for 50 epochs using normalized keypoint bounding box sizes $b_s/max(w,h)$ $\in \{0.01, 0.025, 0.05, 0.075, 0.1\}$. The validation AP is plotted in Fig. 4 (left). The results are consistent with the prior work of McNally et al. [38]: $b_s/max(w,h)<$ 2.5% destabilizes training leading to poor accuracy, and optimal $b_s/max(w,h)$ is observed around 5% (used for the experiments in previous section). In contrast to McNally et al., the accuracy in this study degrades quickly for $b_s/max(w,h)>$ 5%. It is hypothesized that large $b_s$ in this application interferes with pose object learning.

The accuracy improvements resulting from fusing the keypoint objects with the pose objects are provided in Table 4. Keypoint object fusion adds no less than 1.0 AP and over 3.0 AP in some cases. Moreover, keypoint object fusion is fast; the added post-processing time per image is $\le $ 1.7 ms on COCO and $\le $ 4.5 ms on CrowdPose. Relative to the time required for the forward pass of the network (see Table 1), these are small increases.

Table 4. Accuracy improvement when fusing keypoint object detections with human pose detections. Latencies averaged over each dataset using a batch size of 1 on a TITAN Xp GPU.

Full size table

The fusion of keypoint objects by class is also studied. Figure 4 (right) plots the fusion rates for each keypoint type for KAPAO-S with no TTA on COCO val2017. The fusion rate is equal to the number of fused keypoint objects divided by the number of keypoints of that type in the dataset. Because the number of human pose predictions is generally greater than the actual number of person instances in the dataset, the fusion rate can be greater than 1. As originally hypothesized, keypoints that are characterized by distinct local image features (e.g., the eyes, ears, and nose) have higher fusion rates as they are detected more precisely as keypoint objects than as pose objects. Conversely, keypoints that require a more global understanding (e.g., the hips) are better detected using pose objects, as evidenced by lower fusion rates.

5 Conclusion

This paper presents KAPAO, a heatmap-free keypoint estimation method based on modeling keypoints and poses as objects. KAPAO is effectively applied to the problem of single-stage multi-person human pose estimation by detecting human pose objects. Moreover, fusing jointly detected keypoint objects improves the accuracy of the predicted human poses with minimal computational overhead. When not using test-time augmentation, KAPAO is significantly faster and more accurate than previous single-stage methods, which are impeded greatly by heatmap post-processing and bottom-up keypoint grouping. Moreover, KAPAO performs well in the presence of heavy occlusion as evidenced by competitive results on CrowdPose.

References

Andriluka, M., et al.: Posetrack: A benchmark for human pose estimation and tracking. In: CVPR (2018)
Google Scholar
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)
Brasó, G., Kister, N., Leal-Taixé, L.: The center of attention: Center-keypoint grouping via attention for multi-person pose estimation. In: ICCV (2021)
Google Scholar
Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008 (2018)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR (2017)
Google Scholar
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: CVPR (2018)
Google Scholar
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation. In: CVPR (2020)
Google Scholar
Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., Sun, J.: RepVGG: Making VGG-style convnets great again. In: CVPR (2021)
Google Scholar
Dong, X., Yan, Y., Ouyang, W., Yang, Y.: Style aggregated network for facial landmark detection. In: CVPR (2018)
Google Scholar
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: Regional multi-person pose estimation. In: ICCV (2017)
Google Scholar
Gavrilyuk, K., Sanford, R., Javan, M., Snoek, C.G.: Actor-transformers for group activity recognition. In: CVPR (2020)
Google Scholar
Geng, Z., Sun, K., Xiao, B., Zhang, Z., Wang, J.: Bottom-up human pose estimation via disentangled keypoint regression. In: CVPR (2021)
Google Scholar
Goyal, P., et al.: Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Google Scholar
Huang, W., Ren, P., Wang, J., Qi, Q., Sun, H.: Awr: Adaptive weighting regression for 3d hand pose estimation. In: AAAI (2020)
Google Scholar
Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 34–50. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_3
Chapter Google Scholar
Iqbal, U., Gall, J.: Multi-person pose estimation with local joint-to-person associations. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 627–642. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_44
Chapter Google Scholar
Iqbal, U., Molchanov, P., Breuel, T., Gall, J., Kautz, J.: Hand pose estimation via latent 2.5D heatmap regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 125–143. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_8
Chapter Google Scholar
Jakab, T., Gupta, A., Bilen, H., Vedaldi, A.: Unsupervised learning of object landmarks through conditional image generation. In: NeurIPS (2018)
Google Scholar
Jeon, S., Min, D., Kim, S., Sohn, K.: Joint learning of semantic alignment and object landmark detection. In: ICCV (2019)
Google Scholar
Jia, X., De Brabandere, B., Tuytelaars, T., Gool, L.V.: Dynamic filter networks. In: NeurIPS (2016)
Google Scholar
Jin, S., et al.: Differentiable hierarchical graph grouping for multi-person pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 718–734. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_42
Chapter Google Scholar
Jocher, G., et al.: ultralytics/yolov5: v5.0 (Apr 2021). DOI: https://doi.org/10.5281/zenodo.4679653
Khirodkar, R., Chari, V., Agrawal, A., Tyagi, A.: Multi-hypothesis pose networks: Rethinking top-down pose estimation. In: ICCV (2021)
Google Scholar
Kreiss, S., Bertoni, L., Alahi, A.: Pifpaf: Composite fields for human pose estimation. In: CVPR (2019)
Google Scholar
LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. In: The Handbook of Brain Theory and Neural Networks, vol. 3361(10) (1995)
Google Scholar
Li, J., et al.: Human pose regression with residual log-likelihood estimation. In: ICCV (2021)
Google Scholar
Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C.: Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In: CVPR (2019)
Google Scholar
Li, K., Wang, S., Zhang, X., Xu, Y., Xu, W., Tu, Z.: Pose recognition with cascade transformers. In: CVPR (2021)
Google Scholar
Li, Y., et al.: Is 2d heatmap representation even necessary for human pose estimation? arXiv preprint arXiv:2107.03332 (2021)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: Common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts. In: ICLR (2017)
Google Scholar
Luo, Z., Wang, Z., Huang, Y., Wang, L., Tan, T., Zhou, E.: Rethinking the heatmap regression for bottom-up human pose estimation. In: CVPR (2021)
Google Scholar
Mao, W., Tian, Z., Wang, X., Shen, C.: Fcpose: Fully convolutional multi-person pose estimation with dynamic instance-aware convolutions. In: CVPR (2021)
Google Scholar
McNally, W., Vats, K., Wong, A., McPhee, J.: EvoPose2D: Pushing the boundaries of 2d human pose estimation using accelerated neuroevolution with weight transfer. IEEE Access (2021). https://doi.org/10.1109/ACCESS.2021.3118207
McNally, W., Walters, P., Vats, K., Wong, A., McPhee, J.: DeepDarts: Modeling keypoints as objects for automatic scorekeeping in darts using a single camera. In: CVPRW (2021)
Google Scholar
McNally, W., Wong, A., McPhee, J.: Action recognition using deep convolutional neural networks and compressed spatio-temporal pose encodings. J. Comput. Vis. Imaging Syst. 4(1), 3 (2018)
Google Scholar
McNally, W., Wong, A., McPhee, J.: STAR-Net: Action recognition using spatio-temporal activation reprojection. In: CRV (2019)
Google Scholar
Nesterov, Y.: A method for solving the convex programming problem with convergence rate o(1/k2). Proc. USSR Acad. Sci. 269, 543–547 (1983)
Google Scholar
Newell, A., Huang, Z., Deng, J.: Associative embedding: End-to-end learning for joint detection and grouping. In: NeurIPS (2017)
Google Scholar
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Nie, X., Feng, J., Zhang, J., Yan, S.: Single-stage multi-person pose machines. In: ICCV (2019)
Google Scholar
Papandreou, G., Zhu, T., Chen, L.-C., Gidaris, S., Tompson, J., Murphy, K.: PersonLab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 282–299. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_17
Chapter Google Scholar
Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., Murphy, K.: Towards accurate multi-person pose estimation in the wild. In: CVPR (2017)
Google Scholar
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: CVPR (2019)
Google Scholar
Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., Schiele, B.: DeepCut: Joint subset partition and labeling for multi person pose estimation. In: CVPR (2016)
Google Scholar
Raaj, Y., Idrees, H., Hidalgo, G., Sheikh, Y.: Efficient online multi-person 2d pose tracking with recurrent spatio-temporal affinity fields. In: CVPR (2019)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: CVPR (2016)
Google Scholar
Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks (2015)
Google Scholar
Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: ICML (2015)
Google Scholar
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR (2019)
Google Scholar
Suwajanakorn, S., Snavely, N., Tompson, J., Norouzi, M.: Discovery of latent 3d keypoints via end-to-end geometric reasoning. In: NeurIPS (2018)
Google Scholar
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: Fully convolutional one-stage object detection. In: ICCV (2019)
Google Scholar
Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NeurIPS (2014)
Google Scholar
Toshev, A., Szegedy, C.: DeepPose: Human pose estimation via deep neural networks. In: CVPR (2014)
Google Scholar
Vats, K., McNally, W., Dulhanty, C., Lin, Z.Q., Clausi, D.A., Zelek, J.: PuckNet: Estimating hockey puck location from broadcast video. In: AAAI Workshops (2019)
Google Scholar
Voeikov, R., Falaleev, N., Baikulov, R.: Ttnet: Real-time temporal and spatial video analysis of table tennis. In: CVPRW (2020)
Google Scholar
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Scaled-YOLOv4: Scaling cross stage partial network. arXiv preprint arXiv:2011.08036 (2020)
Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: Cspnet: A new backbone that can enhance learning capability of cnn. In: CVPR (2020)
Google Scholar
Wang, X., Bo, L., Fuxin, L.: Adaptive wing loss for robust face alignment via heatmap regression. In: ICCV (2019)
Google Scholar
Wei, F., Sun, X., Li, H., Wang, J., Lin, S.: Point-set anchors for object detection, instance segmentation and pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 527–544. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_31
Chapter Google Scholar
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: CVPR (2016)
Google Scholar
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 472–487. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_29
Chapter Google Scholar
Xu, Z., Li, B., Yuan, Y., Geng, M.: AnchorFace: An anchor-based facial landmark detector across large poses. In: AAAI (2021)
Google Scholar
Yang, S., Quan, Z., Nie, M., Yang, W.: Transpose: Keypoint localization via transformer. In: ICCV (2021)
Google Scholar
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IoU loss: Faster and better learning for bounding box regression. In: AAAI (2020)
Google Scholar
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)

Download references

Acknowledgements

This work was supported in part by Compute Canada, the Canada Research Chairs Program, the Natural Sciences and Engineering Research Council of Canada, a Microsoft Azure Grant, and an NVIDIA Hardware Grant.

Author information

Authors and Affiliations

Systems Design Engineering, University of Waterloo, Waterloo, Canada
William McNally, Kanav Vats, Alexander Wong & John McPhee
Waterloo Artificial Intelligence Institute, University of Waterloo, Waterloo, Canada
William McNally, Kanav Vats, Alexander Wong & John McPhee

Authors

William McNally
View author publications
You can also search for this author in PubMed Google Scholar
Kanav Vats
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Wong
View author publications
You can also search for this author in PubMed Google Scholar
John McPhee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to William McNally .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

McNally, W., Vats, K., Wong, A., McPhee, J. (2022). Rethinking Keypoint Representations: Modeling Keypoints and Poses as Objects for Multi-person Human Pose Estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13666. Springer, Cham. https://doi.org/10.1007/978-3-031-20068-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-20068-7_3
Published: 11 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20067-0
Online ISBN: 978-3-031-20068-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Rethinking Keypoint Representations: Modeling Keypoints and Poses as Objects for Multi-person Human Pose Estimation

Abstract

Similar content being viewed by others

SimplePose V2: Greedy Offset-Guided Keypoint Grouping for Human Pose Estimation

SD-Pose: facilitating space-decoupled human pose estimation via adaptive pose perception guidance

Keypoint Context Aggregation for Human Pose Estimation

Keywords

1 Introduction

2 Related Work