Keywords

1 Introduction

Keypoint estimation is a computer vision task that involves localizing points of interest in images. It has emerged as one of the most highly researched topics in the computer vision literature [1, 9, 11, 15, 18, 19, 37,38,39,40, 47, 49, 55, 60, 63, 67, 70]. The most common method for estimating keypoint locations involves generating target fields, referred to as heatmaps, that center 2D Gaussians on the target keypoint coordinates. Deep convolutional neural networks [26] are then used to regress the target heatmaps on the input images, and keypoint predictions are made via the arguments of the maxima of the predicted heatmaps [57].

While strong empirical results have positioned heatmap regression as the de facto standard method for detecting and localizing keypoints [3, 5,6,7, 12, 24, 35, 37, 43, 54, 57, 66, 68], there are several known drawbacks. First, these methods suffer from quantization error: the precision of a keypoint prediction is inherently limited by the spatial resolution of the output heatmap. Larger heatmaps are therefore advantageous, but require additional upsampling operations and costly processing at higher resolution [3, 7, 12, 35, 37]. Even when large heatmaps are used, special post-processing steps are required to refine keypoint predictions, slowing down inference [6, 7, 35, 43]. Second, when two keypoints of the same type (i.e., class) appear in close proximity to one another, the overlapping heatmap signals may be mistaken for a single keypoint. Indeed, this is a common failure case [5]. For these reasons, researchers have started to investigate alternative, heatmap-free keypoint detection methods [27, 29, 30, 38, 67].

Fig. 1.
figure 1

Accuracy vs. Inference Speed: KAPAO compared to state-of-the-art single-stage multi-person human pose estimation methods DEKR [12], HigherHRNet [7], HigherHRNet + SWAHR [35], and CenterGroup [3] without test-time augmentation (TTA), i.e., excluding multi-scale testing and horizontal flipping. The raw data are provided in Table 1. The circle size is proportional to the number of model parameters.

In this paper, we introduce a new heatmap-free keypoint detection method and apply it to single-stage multi-person human pose estimation. Our method builds on recent research showing how keypoints can be modeled as objects within a dense anchor-based detection framework by representing keypoints at the center of small keypoint bounding boxes [38]. In preliminary experimentation with human pose estimation, we found that this keypoint detection approach works well for human keypoints that are characterized by local image features (e.g., the eyes), but the same approach is less effective at detecting human keypoints that require a more global understanding (e.g., the hips). We therefore introduce a new pose object representation to help detect sets of keypoints that are spatially related. Furthermore, we detect keypoint objects and pose objects simultaneously and fuse the results using a simple matching algorithm to exploit the benefits of both object representations. By virtue of detecting pose objects, we unify person detection and keypoint estimation and provide a highly efficient single-stage approach to multi-person human pose estimation.

As a result of not using heatmaps, KAPAO compares favourably against recent single-stage human pose estimation models in terms of accuracy and inference speed, especially when not using test-time augmentation (TTA), which represents how such models are deployed in practice. As shown in Fig. 1, KAPAO achieves an AP of 70.6 on the Microsoft COCO Keypoints validation set without TTA while having an average latency of 54.4 ms (forward pass + post-processing time). Compared to the state-of-the-art single-stage model HigherHRNet + SWAHR [35], KAPAO is 5.1\(\times \) faster and 3.3 AP more accurate when not using TTA. Compared to CenterGroup [3], KAPAO is 3.1\(\times \) faster and 1.5 AP more accurate. The contributions of this work are summarized as follows:

  • A new pose object representation is proposed that extends the conventional object representation by including a set of keypoints associated with the object.

  • A new approach to single-stage human pose estimation is developed by simultaneously detecting keypoint objects and pose objects and fusing the detections. The proposed heatmap-free method is significantly faster and more accurate than state-of-the-art heatmap-based methods when not using TTA.

2 Related Work

Heatmap-free Keypoint Detection. DeepPose [58] regressed keypoint coordinates directly from images using a cascade of deep neural networks that iteratively refined the keypoint predictions. Shortly thereafter, Tompson et al. [57] introduced the notion of keypoint heatmaps, which have since remained prevalent in human pose estimation [5,6,7, 12, 24, 37, 43, 54, 65, 66, 68] and other keypoint detection applications [9, 15, 18, 59, 63]. Remarking the computational inefficiencies associated with generating heatmaps, Li et al. [30] disentangled the horizontal and vertical keypoint coordinates such that each coordinate was represented using a one-hot encoded vector. This saved computation and permitted an expansion of the output resolution, thereby reducing the effects of quantization error and eliminating the need for refinement post-processing. Li et al. [27] introduced the residual log-likelihood (RLE), a novel loss function for direct keypoint regression based on normalizing flows [53]. Direct keypoint regression has also been attempted using Transformers [29].

Outside the realm of human pose estimation, Xu et al. [67] regressed anchor templates of facial keypoints and aggregated them to achieve state-of-the-art accuracy in facial alignment. In sports analytics, McNally et al. [38] encountered the issue of overlapping heatmap signals in the development of an automatic scoring system for darts and therefore opted to model keypoints as objects using small square bounding boxes. This keypoint representation proved to be highly effective and serves as the inspiration for this work.

Single-stage Human Pose Estimation. Single-stage human pose estimation methods predict the poses of every person in an image using a single forward pass [5, 7, 12, 14, 25, 42, 44]. In contrast, two-stage methods [6, 10, 24, 27, 37, 46, 54, 66] first detect the people in an image using an off-the-shelf person detector (e.g., Faster R-CNN [52], YOLOv3 [51], etc.) and then estimate poses for each detection. Single-stage methods are generally less accurate, but usually perform better in crowded scenes [28] and are often preferred because of their simplicity and efficiency, which becomes particularly favourable as the number of people in the image increases. Single-stage approaches vary more in their design compared to two-stage approaches. For instance, they may: (i) detect all the keypoints in an image and perform a bottom-up grouping into human poses [3, 5, 7, 16, 17, 22, 25, 35, 42, 48]; (ii) extend object detectors to unify person detection and keypoint estimation [14, 36, 64, 70]; or (iii) use alternative keypoint/pose representations (e.g., predicting root keypoints and relative displacements [12, 44, 45]). We briefly summarize the most recent state-of-the-art single-stage methods below.

Cheng et al. [7] repurposed HRNet [54] for bottom-up human pose estimation by adding a transpose convolution to double the output heatmap resolution (HigherHRNet) and using associative embeddings [42] for keypoint grouping. They also implemented multi-resolution training to address the scale variation problem. Geng et al. [12] predicted person center heatmaps and 2K offset maps representing offset vectors for the K keypoints of a pose candidate centered on each pixel using an HRNet backbone. They also disentangled the keypoint regression (DEKR) using separate regression heads and adaptive convolutions. Luo et al. [35] used HigherHRNet as a base and proposed scale and weight adaptive heatmap regression (SWAHR), which scaled the ground-truth heatmap Gaussian variances based on the person scale and balanced the foreground/background loss weighting. Their modifications provided significant accuracy improvements over HigherHRNet and comparable performance to many two-stage methods. Again using HigherHRNet as a base, Brasó et al. [3] proposed CenterGroup to match keypoints to person centers using a fully differentiable self-attention module that was trained end-to-end together with the keypoint detector. Notably, all of the aforementioned methods suffer from costly heatmap post-processing and as such, their inference speeds leave much to be desired.

Extending Object Detectors for Human Pose Estimation. There is significant overlap between the tasks of object detection and human pose estimation. For instance, He et al. [14] used the Mask R-CNN instance segmentation model for human pose estimation by predicting keypoints using one-hot masks. Wei et al. [64] proposed Point-Set Anchors, which adapted the RetinaNet [32] object detector using pose anchors instead of bounding box anchors. Zhou et al. [70] modeled objects using heatmap-based center points with CenterNet and represented poses as a 2K-dimensional property of the center point. Mao et al. [36] adapted the FCOS [56] object detector with FCPose using dynamic filters [21]. While these methods based on object detectors provide good efficiency, their accuracies have not competed with state-of-the-art heatmap-based methods. Our work is most similar to Point-Set Anchors [64]; however, our method does not require defining data-dependent pose anchors. Moreover, we simultaneously detect individual keypoints and poses and fuse the detections to improve the accuracy of our final pose predictions.

3 KAPAO: Keypoints and Poses as Objects

KAPAO uses a dense detection network to simultaneously predict a set of keypoint objects \(\{\hat{\mathcal {O}}^k\in \hat{\textbf{O}}^k\}\) and a set of pose objects \(\{\hat{\mathcal {O}}^p\in \hat{\textbf{O}}^p\}\), collectively \(\hat{\textbf{O}} = \hat{\textbf{O}}^k\bigcup \hat{\textbf{O}}^p\). We introduce the concept behind each object type and the relevant notation below. All units are assumed to be in pixels unless stated otherwise.

A keypoint object \(\mathcal {O}^k\) is an adaptation of the conventional object representation in which the coordinates of a keypoint are represented at the center \((b_x, b_y)\) of a small bounding box \(\textbf{b}\) with equal width \(b_w\) and height \(b_h\): \(\textbf{b} = (b_x, b_y, b_w, b_h)\). The hyperparameter \(b_s\) controls the keypoint bounding box size (i.e., \(b_s\) = \(b_w\) = \(b_h\)). There are K classes of keypoint objects, one for each type in the dataset [38].

Generally speaking, a pose object \(\mathcal {O}^p\) is considered to be an extension of the conventional object representation that additionally includes a set of keypoints associated with the object. While we expect pose objects to be useful in related tasks such as facial and object landmark detection [20, 67], they are applied herein to human pose estimation via detection of human pose objects, comprising a bounding box of class “person,” and a set of keypoints \(\textbf{z} = \{(x_k, y_k)\}_{k=1}^K\) that coincide with anatomical landmarks.

Both object representations possess unique advantages. Keypoint objects are specialized for the detection of individual keypoints that are characterized by strong local features. Examples of such keypoints that are common in human pose estimation include the eyes, ears and nose. However, keypoint objects carry no information regarding the concept of a person or pose. If used on their own for multi-person human pose estimation, a bottom-up grouping method would be required to parse the detected keypoints into human poses. In contrast, pose objects are better suited for localizing keypoints with weak local features as they enable the network to learn the spatial relationships within a set of keypoints. Moreover, they can be leveraged for multi-person human pose estimation directly without the need for bottom-up keypoint grouping.

Recognizing that keypoint objects exist in a subspace of a pose objects, the KAPAO network was designed to simultaneously detect both object types with minimal computational overhead using a single shared network head. During inference, the more precise keypoint object detections are fused with the human pose detections using a simple tolerance-based matching algorithm that improves the accuracy of the human pose predictions without sacrificing any significant amount of inference speed. The following sections provide details on the network architecture, the loss function used to train the network, and inference.

3.1 Architectural Details

A diagram of the KAPAO pipeline is provided in Fig. 2. It uses a deep convolutional neural network \(\mathcal {N}\) to map an RGB input image \(\textbf{I}\in \mathbb {R}^{h\times w\times 3}\) to a set of four output grids \(\hat{\textbf{G}} = \{\hat{\mathcal {G}}^s\mid s\in \{8, 16, 32, 64\}\}\) containing the object predictions \(\hat{\textbf{O}}\), where \(\hat{\mathcal {G}}^s\in \mathbb {R}^{\frac{h}{s}\times \frac{w}{s}\times N_a \times N_o}\):

$$\begin{aligned} \mathcal {N}(\textbf{I}) = \hat{\textbf{G}}. \end{aligned}$$
(1)

\(N_a\) is the number of anchor channels and \(N_o\) is the number of output channels for each object. \(\mathcal {N}\) is a YOLO-style feature extractor that makes extensive use of Cross-Stage-Partial (CSP) bottlenecks [62] within a feature pyramid [31] macroarchitecture. To provide flexibility for different speed requirements, three sizes of KAPAO models were trained (i.e., KAPAO-S/M/L) by scaling the number of layers and channels in \(\mathcal {N}\).

Fig. 2.
figure 2

KAPAO uses a dense detection network \(\mathcal {N}\) trained using the multi-task loss \(\mathcal {L}\) to map an RGB image \(\textbf{I}\) to a set of output grids \(\hat{\textbf{G}}\) containing the predicted pose objects \(\hat{\textbf{O}}^p\) and keypoint objects \(\hat{\textbf{O}}^k\). Non-maximum suppression (NMS) is used to obtain candidate detections \(\hat{\textbf{O}}^{p\prime }\) and \(\hat{\textbf{O}}^{k\prime }\), which are fused together using a matching algorithm \(\varphi \) to obtain the final human pose predictions \(\hat{\textbf{P}}\). The \(N_a\) and \(N_o\) dimensions in \(\hat{\textbf{G}}\) are not shown for clarity.

Due to the nature of strided convolutions, the features in an output grid cell \(\hat{\mathcal {G}}^s_{i,j}\) are conditioned on the image patch \(\textbf{I}_p=\textbf{I}_{si:s(i+1), sj:s(j+1)}\). Therefore, if the center of a target object \((b_x, {b_y})\) is situated in \(\textbf{I}_p\), the output grid cell \(\hat{\mathcal {G}}^s_{i,j}\) is responsible for detecting it. The receptive field of an output grid increases with s, so smaller output grids are better suited for detecting larger objects.

The output grid cells \(\hat{\mathcal {G}}^s_{i,j}\) contain \(N_a\) anchor channels corresponding to anchor boxes \(\textbf{A}^s = \{(A_{w_a}, A_{h_a})\}_{a=1}^{N_a}\). A target object \(\mathcal {O}\) is assigned to an anchor channel via tolerance-based matching of the object and anchor box sizes. This provides redundancy such that the grid cells \(\hat{\mathcal {G}}^s_{i,j}\) can detect multiple objects and enables specialization for different object sizes and shapes. Additional detection redundancy is provided by also allowing the neighbouring grid cells \(\hat{\mathcal {G}}^s_{i\pm 1,j}\) and \(\hat{\mathcal {G}}^s_{i,j\pm 1}\) to detect an object in \(\textbf{I}_p\) [23, 61].

The \(N_o\) output channels of \(\hat{\mathcal {G}}^s_{i,j,a}\) contain the properties of a predicted object \(\hat{\mathcal {O}}\), including the objectness \(\hat{p}_{o}\) (the probability that an object exists), the intermediate bounding boxes \(\hat{\textbf{t}}'= (\hat{t}'_x, \hat{t}'_y, \hat{t}'_w, \hat{t}'_h)\), the object class scores \(\mathbf {\hat{c}} = (\hat{c}_1, ..., \hat{c}_{K+1})\), and the intermediate keypoints \(\hat{\textbf{v}}'= \{(\hat{v}'_{xk}, \hat{v}'_{yk})\}_{k=1}^K\) for the human pose objects. Hence, \(N_o = 3K + 6\).

Following [23, 61], an object’s intermediate bounding box \(\hat{\textbf{t}}\) is predicted in the grid coordinates and relative to the grid cell origin (ij) using:

$$\begin{aligned} \hat{t}_x = 2\sigma (\hat{t}'_x) - 0.5 \quad \quad \hat{t}_y = 2\sigma (\hat{t}'_y) - 0.5 \end{aligned}$$
(2)
$$\begin{aligned} \hat{t}_w = \frac{A_w}{s}(2\sigma (\hat{t}'_w))^2 \quad \quad \hat{t}_h = \frac{A_h}{s}(2\sigma (\hat{t}'_h))^2. \end{aligned}$$
(3)

This detection strategy is extended to the keypoints of a pose object. A pose object’s intermediate keypoints \(\hat{\textbf{v}}\) are predicted in the grid coordinates and relative to the grid cell origin (ij) using:

$$\begin{aligned} \hat{v}_{xk} = \frac{A_w}{s}(4\sigma (\hat{v}'_{xk}) - 2) \quad \quad \hat{v}_{yk} = \frac{A_h}{s}(4\sigma (\hat{v}'_{yk}) - 2). \end{aligned}$$
(4)

The sigmoid function \(\sigma \) facilitates learning by constraining the ranges of the object properties (e.g., \(\hat{v}_{xk}\) and \(\hat{v}_{yk}\) are constrained to \(\pm 2\frac{A_w}{s}\) and \(\pm 2\frac{A_h}{s}\), respectively). To learn \(\hat{\textbf{t}}\) and \(\hat{\textbf{v}}\), losses are applied in the grid space. Sample targets \(\textbf{t}\) and \(\textbf{v}\) are shown in Fig. 3.

Fig. 3.
figure 3

Sample targets for training, including a human pose object (blue), keypoint object (red), and no object (green). The “?” values are not used in the loss computation. (Color figure online)

3.2 Loss Function

A target set of grids \(\textbf{G}\) is constructed and a multi-task loss \(\mathcal {L}(\hat{\textbf{G}}, \textbf{G})\) is applied to learn the objectness \(\hat{p}_o\) (\(\mathcal {L}_{obj})\), the intermediate bounding boxes \(\hat{{\textbf {t}}}\) (\(\mathcal {L}_{box}\)), the class scores \(\hat{\textbf{c}}\) (\(\mathcal {L}_{cls}\)), and the intermediate pose object keypoints \(\hat{\textbf{v}}\) (\(\mathcal {L}_{kps}\)). The loss components are computed for a single image as follows:

$$\begin{aligned} \mathcal {L}_{obj} = \sum _s \frac{\omega _s}{n{(G^s})}\sum _{G^s}\textrm{BCE}(\hat{p}_o, p_o\cdot \textrm{IoU}(\hat{\textbf{t}}, \textbf{t})) \end{aligned}$$
(5)
$$\begin{aligned} \mathcal {L}_{box} = \sum _s \frac{1}{n{(\mathcal {O}\in G^s})}\sum _{\mathcal {O}\in G^s}1 - \textrm{IoU}(\hat{\textbf{t}}, \textbf{t}) \end{aligned}$$
(6)
$$\begin{aligned} \mathcal {L}_{cls} = \sum _s \frac{1}{n{(\mathcal {O}\in G^s})}\sum _{\mathcal {O}\in G^s}\textrm{BCE}(\hat{c}, c) \end{aligned}$$
(7)
$$\begin{aligned} \mathcal {L}_{kps} = \sum _s \frac{1}{n{(\mathcal {O}^p\in G^s})}\sum _{\mathcal {O}^p\in G^s} \sum _{k=1}^K \delta (\nu _k > 0)||\hat{\textbf{v}}_k - {\textbf{v}}_k||_2 \end{aligned}$$
(8)

where \(\omega _s\) is the grid weighting, \(\textrm{BCE}\) is the binary cross-entropy, \(\textrm{IoU}\) is the complete intersection over union (CIoU) [69], and \(\nu _k\) are the keypoint visibility flags. When \(\mathcal {G}^s_{i,j,a}\) represents a target object \(\mathcal {O}\), the target objectness \(p_o\) = 1 is multiplied by the \(\textrm{IoU}\) to promote specialization amongst the anchor channel predictions [50]. When \(\mathcal {G}^s_{i,j,a}\) is not a target object, \(p_o\) = 0. In practice, the losses are applied over a batch of images using batched grids. The total loss \(\mathcal {L}\) is the weighted summation of the loss components scaled by the batch size \(N_b\):

$$\begin{aligned} \mathcal {L} = N_b(\lambda _{obj}\mathcal {L}_{obj} + \lambda _{box}\mathcal {L}_{box} + \lambda _{cls}\mathcal {L}_{cls} + \lambda _{kps}\mathcal {L}_{kps}). \end{aligned}$$
(9)

3.3 Inference

The predicted intermediate bounding boxes \(\hat{\textbf{t}}\) and keypoints \(\hat{\textbf{v}}\) are mapped back to the original image coordinates using the following transformation:

$$\begin{aligned} \hat{\textbf{b}} = s(\hat{\textbf{t}} + [i, j, 0, 0]) \quad \quad \hat{\textbf{z}}_k = s(\hat{\textbf{v}}_k + [i, j]). \end{aligned}$$
(10)

\(\hat{\mathcal {G}}^s_{i,j,a}\) represents a positive pose object detection \(\hat{\mathcal {O}}^p\) if its confidence \(\hat{p}_o\cdot \max (\hat{\textbf{c}})\) is greater than a threshold \(\tau _{cp}\) and \(\textrm{arg}\,\textrm{max}(\hat{\textbf{c}})=1\). Similarly, \(\hat{\mathcal {G}}^s_{i,j,a}\) represents a positive keypoint object detection \(\hat{\mathcal {O}}^k\) if \(\hat{p}_o\cdot \max (\hat{\textbf{c}}) > \tau _{ck}\) and \(\textrm{arg}\,\textrm{max}(\hat{\textbf{c}}) > 1\), where the keypoint object class is \(\textrm{arg}\,\textrm{max}(\hat{\textbf{c}}) - 1\). To remove redundant detections and obtain the candidate pose objects \(\hat{\textbf{O}}^{p\prime }\) and the candidate keypoint objects \(\hat{\textbf{O}}^{k\prime }\), the sets of positive pose object detections \(\hat{\textbf{O}}^p\) and positive keypoint object detections \(\hat{\textbf{O}}^p\) are filtered using non-maximum suppression (NMS) applied to the pose object bounding boxes with the \(\textrm{IoU}\) thresholds \(\tau _{bp}\) and \(\tau _{bk}\):

$$\begin{aligned} \hat{\textbf{O}}^{p\prime } = \textrm{NMS}(\hat{\textbf{O}}^p, \tau _{bp}) \quad \quad \hat{\textbf{O}}^{k\prime } = \textrm{NMS}(\hat{\textbf{O}}^k, \tau _{bk}). \end{aligned}$$
(11)

It is noted that \(\tau _{ck}\) and \(\tau _{bk}\) are scalar thresholds used for all keypoint object classes. Finally, the human pose predictions \(\hat{\textbf{P}} = \{\hat{\textbf{P}}_i\in \mathbb {R}^{K \times 3}\}\) for \(i \in \{1...n(\hat{\textbf{O}}^{p\prime })\}\) are obtained by fusing the candidate keypoint objects with the candidate pose objects using a distance tolerance \(\tau _{fd}\). To promote correct matches of keypoint objects to poses, the keypoint objects are only fused to pose objects with confidence \(\hat{p}_o\cdot \max (\hat{\textbf{c}}) > \tau _{fc}\):

$$\begin{aligned} \hat{\textbf{P}} = \varphi (\hat{\textbf{O}}^{p\prime }, \hat{\textbf{O}}^{k\prime }, \tau _{fd}, \tau _{fc}). \end{aligned}$$
(12)

The keypoint object fusion function \(\varphi \) is defined in Algorithm 1, where the following notation is used to index an object’s properties: \(\hat{x} = \hat{\mathcal {O}}_x\) (e.g., a pose object’s keypoints \(\hat{\textbf{z}}\) are referenced as \(\hat{\mathcal {O}}^{p}_{\textbf{z}}\)).

figure a

3.4 Limitations

A limitation of KAPAO is that pose objects do not include individual keypoint confidences, so the human pose predictions typically contain a sparse set of keypoint confidences \(\hat{\textbf{P}}_i[:,3]\) populated by the fused keypoint objects (see Algorithm 1 for details). If desired, a complete set of keypoint confidences can be induced by only using keypoint objects, which is realized when \(\tau _{ck} \rightarrow 0\). Another limitation is that training requires a considerable amount of time and GPU memory due to the large input size used.

4 Experiments

We evaluate KAPAO on two multi-person human pose estimation datasets: COCO Keypoints [33] (\(K=17\)) and CrowdPose [28] (\(K=14\)). We report the standard AP/AR detection metrics based on Object Keypoint Similarity [33] and compare against state-of-the-art methods. All hyperparameters are provided in the source code.

4.1 Microsoft COCO Keypoints

Training. KAPAO-S/M/L were all trained for 500 epochs on COCO train2017 using stochastic gradient descent with Nesterov momentum [41], weight decay, and a learning rate decayed over a single cosine cycle [34] with a 3-epoch warm-up period [13]. The input images were resized and padded to \(1280\times 1280\), keeping the original aspect ratio. Data augmentation used during training included mosaic [2], HSV color-space perturbations, horizontal flipping, translations, and scaling. Many of the training hyperparameters were inherited from [23, 61], including the anchor boxes \(\textbf{A}\) and the loss weights w, \(\lambda _{obj}\), \(\lambda _{box}\), and \(\lambda _{cls}\). Others, including the keypoint bounding box size \(b_s\) and the keypoint loss weight \(\lambda _{kps}\), were manually tuned using a small grid search. The models were trained on four V100 GPUs with 32 GB memory each using batch sizes of 128, 72, and 48 for KAPAO-S, M, and L, respectively. Validation was performed after every epoch, saving the model weights that provided the highest validation AP.

Testing. The six inference parameters (\(\tau _{cp}\), \(\tau _{ck}\), \(\tau _{bp}\), \(\tau _{bk}\), \(\tau _{fd}\), and \(\tau _{fc}\)) were manually tuned on the validation set using a coarse grid search to maximize accuracy. The results were not overly sensitive to the inference parameter values. When using TTA, the input image was scaled by factors of 0.8, 1, and 1.2, and the unscaled image was horizontally flipped. During post-processing, the multi-scale detections were concatenated before running NMS. When not using TTA, rectangular input images were used (i.e., 1280 px on the longest side), which marginally reduced the accuracy but increased the inference speed.

Table 1. Accuracy and speed comparison with state-of-the-art single-stage human pose estimation models on COCO val2017, including the forward pass (FP) and post-processing (PP). Latencies (Lat.) averaged over val2017 using a batch size of 1 on a TITAN Xp GPU.

Results. Table 1 compares the accuracy, forward pass (FP) time, and post-processing (PP) time of KAPAO with state-of-the-art single-stage methods HigherHRNet [7], HigherHRNet + SWAHR [35], DEKR [12], and CenterGroup [3] on val2017. Two test settings were considered: (1) without any test-time augmentation (using a single forward pass of the network), and (2) with multi-scale and horizontal flipping test-time augmentation (TTA). It is noted that with the exception of CenterGroup, no inference speeds were reported in the original works. Rather, FLOPs were used as an indirect measure of computational efficiency. FLOPs are not only a poor indication of inference speed [8], but they are also only computed for the forward pass of the network and thus do not provide an indication of the amount of computation required for post-processing.

Due to expensive heatmap refinement, the post-processing times of HigherHRNet, HigherHRNet + SWAHR, and DEKR are at least an order of magnitude greater than KAPAO-L when not using TTA. The post-processing time of KAPAO depends less on the input size so it only increases by approximately 1 ms when using TTA. Conversely, HigherHRNet and HigherHRNet + SWAHR generate and refine large heatmaps during multi-scale testing and therefore require more than two orders of magnitude more post-processing time than KAPAO-L.

CenterGroup requires significantly less post-processing time than HigherHRNet and DEKR because it skips heatmap refinement and directly encodes pose center and keypoint heatmaps as embeddings that are fed to an attention-based grouping module. When not using TTA, CenterGroup-W48 provides an improvement of 2.5 AP over HigherHRNet-W48 and has a better accuracy-speed trade-off. Still, KAPAO-L is 3.1\(\times \) faster than CenterGroup-W48 and 1.5 AP more accurate due to its efficient network architecture and near cost-free post-processing. When using TTA, KAPAO-L is 1.7 AP less accurate than CenterGroup-W48, but 4.9\(\times \) faster. KAPAO-L also achieves state-of-the-art AR, which is indicative of better detection rates.

We suspect that KAPAO is more accurate without TTA compared to previous methods because it uses larger input images; however, we emphasize that KAPAO consumes larger input sizes while still being faster than previous methods due to its well-designed network architecture and efficient post-processing. For the same reason, TTA (multi-scale testing in particular) doesn’t provide as much of a benefit; input sizes >1280 are less effective due to the dataset images being limited to 640 px.

Table 2. Accuracy comparison with two-stage (\(\dagger \)) and single-stage methods on COCO test-dev. Best results reported (i.e., including TTA). DEKR results use a model-agnostic rescoring network [12]. Latencies (Lat.) taken from Table 1. *Latencies reported in original papers [4, 36] and measured using an NVIDIA GTX 1080Ti GPU.

In Table 2, the accuracy of KAPAO is compared to single-stage and two-stage methods on test-dev. KAPAO-L achieves state-of-the-art AR and falls within 1.7 AP of the best performing single-stage method HigherHRNet-W48 + SWAHR while being 7.4\(\times \) faster. Notably, KAPAO-L is more accurate than the early two-stage methods G-RMI [46] and RMPE [10] and popular single-stage methods like OpenPose [4, 5], Associative Embeddings [42], and PersonLab [45]. Compared to other single-stage methods that extend object detectors for human pose estimation (Mask R-CNN [14], CenterNet [70], Point-Set Anchors [64], and FCPose [36]), KAPAO-L is considerably more accurate. Among all the single-stage methods, KAPAO-L achieves state-of-the-art AP at an OKS threshold of 0.50, which is indicative of better detection rates but less precise keypoint localization. This is an area to explore in future work.

4.2 CrowdPose

KAPAO was trained on the trainval split with 12k images and was evaluated on the 8k images in test. The same training and inference settings as on COCO were used except the models were trained for 300 epochs and no validation was performed during training. The final model weights were used for testing. Table 3 compares the accuracy of KAPAO against state-of-the-art methods. It was found that KAPAO excels in the presence of occlusion, achieving competitive results across all metrics compared to previous single-stage methods and state-of-the-art accuracy for AP\(^{.50}\). The proficiency of KAPAO in crowded scenes is clear when analyzing AP\(^E\), AP\(^M\), and AP\(^H\): KAPAO-L and DEKR-W48 [12] perform equally on images with easy Crowd Index (less occlusion), but KAPAO-L is 1.1 AP more accurate for both medium and hard Crowd Indices (more occlusion).

Table 3. Comparison with single-stage and two-stage (\(\dagger \)) methods on CrowdPose test, including TTA. DEKR results use a model-agnostic rescoring network [12]. HigherHRNet + SWAHR [35] not included due to issues reproducing the results reported in the paper using the source code. Latencies (Lat.) taken from Table 1. *Latency reported in original paper [4] and measured using NVIDIA GTX 1080Ti GPU on COCO.
Fig. 4.
figure 4

Left: The influence of keypoint object bounding box size on learning. Each KAPAO-S model was trained for 50 epochs. Right: Keypoint object fusion rates for each keypoint type. Evaluated on COCO val2017 using KAPAO-S without TTA.

4.3 Ablation Studies

The influence of the keypoint bounding box size \(b_s\), one of KAPAO’s important hyperparameters, was empirically analyzed. Five KAPAO-S models were trained on COCO train2017 for 50 epochs using normalized keypoint bounding box sizes \(b_s/max(w,h)\) \(\in \{0.01, 0.025, 0.05, 0.075, 0.1\}\). The validation AP is plotted in Fig. 4 (left). The results are consistent with the prior work of McNally et al. [38]: \(b_s/max(w,h)<\) 2.5% destabilizes training leading to poor accuracy, and optimal \(b_s/max(w,h)\) is observed around 5% (used for the experiments in previous section). In contrast to McNally et al., the accuracy in this study degrades quickly for \(b_s/max(w,h)>\) 5%. It is hypothesized that large \(b_s\) in this application interferes with pose object learning.

The accuracy improvements resulting from fusing the keypoint objects with the pose objects are provided in Table 4. Keypoint object fusion adds no less than 1.0 AP and over 3.0 AP in some cases. Moreover, keypoint object fusion is fast; the added post-processing time per image is \(\le \) 1.7 ms on COCO and \(\le \) 4.5 ms on CrowdPose. Relative to the time required for the forward pass of the network (see Table 1), these are small increases.

Table 4. Accuracy improvement when fusing keypoint object detections with human pose detections. Latencies averaged over each dataset using a batch size of 1 on a TITAN Xp GPU.

The fusion of keypoint objects by class is also studied. Figure 4 (right) plots the fusion rates for each keypoint type for KAPAO-S with no TTA on COCO val2017. The fusion rate is equal to the number of fused keypoint objects divided by the number of keypoints of that type in the dataset. Because the number of human pose predictions is generally greater than the actual number of person instances in the dataset, the fusion rate can be greater than 1. As originally hypothesized, keypoints that are characterized by distinct local image features (e.g., the eyes, ears, and nose) have higher fusion rates as they are detected more precisely as keypoint objects than as pose objects. Conversely, keypoints that require a more global understanding (e.g., the hips) are better detected using pose objects, as evidenced by lower fusion rates.

5 Conclusion

This paper presents KAPAO, a heatmap-free keypoint estimation method based on modeling keypoints and poses as objects. KAPAO is effectively applied to the problem of single-stage multi-person human pose estimation by detecting human pose objects. Moreover, fusing jointly detected keypoint objects improves the accuracy of the predicted human poses with minimal computational overhead. When not using test-time augmentation, KAPAO is significantly faster and more accurate than previous single-stage methods, which are impeded greatly by heatmap post-processing and bottom-up keypoint grouping. Moreover, KAPAO performs well in the presence of heavy occlusion as evidenced by competitive results on CrowdPose.