Keywords

1 Introduction

3D human pose estimation plays a crucial role to unlock widespread applications in human-computer interaction, robotics, surveillance, and virtual reality. Compared with multi-view methods  [19, 41, 43, 52, 61], monocular 3D human pose estimation is more flexible for deployment in outdoor environments. However, given its ill-posed nature, estimating 3D human poses from a single RGB image remains a challenging problem. Thanks to Convolutional Neural Networks (CNNs), many effective approaches are proposed and formulate the problem as joint coordinate regression  [28, 47] or heat maps learning  [57, 65]. Recently, many approaches  [39, 40, 48, 62] have followed a popular paradigm in predicting per voxel likelihood for each human joint and achieved competitive performance.

Fig. 1.
figure 1

Motivation. Most of the previous methods employ a single network architecture to deal with intrinsically heterogeneous human body parts (as shown in (a)). Instead, we are motivated to search for a suitable network architecture for a group of parts and estimate their 3D locations with a part-specific architecture (as shown in (b)).

In most previous approaches shown in Fig. 1(a), CNNs share the same network architecture for predicting all human body parts with different degrees of freedom (DOFs), ranging from parts with higher DOFs like the wrists to parts with lower DOFs like the torso. However, a single network architecture might be sub-optimal to deal with various body parts. Because different parts might have various movement patterns and shapes, estimating their locations might require different network topologies (e.g., different kernel sizes and distinct receptive fields). A recent effort  [54] also demonstrates that it is effective to estimate different body parts by explicitly taking their DOFs into account.

As shown in Fig. 1(b), we approach the problem from a different angle and propose to estimate different body parts with part-specific network architectures. However, looking for optimal architectures for various body parts is an intractable and time-consuming job even for an expert. Therefore, instead of designing them manually, we consult the literature of neural architecture search (NAS)  [4, 14, 17, 23, 31, 49, 56] and propose to search part-specific network architectures for different parts. In fact, the idea of searching network architectures for certain tasks is not new. Specifically, it has been applied in semantic segmentation  [7, 30, 60] and object detection  [8, 13, 42].

However, applying NAS into 3D human pose estimation is non-trivial, because current NAS approaches mainly focus on 2D visual tasks. Different from them, 3D human poses are commonly estimated in a higher-order volumetric space  [11, 40, 48, 52]. It consists of 2D spatial and depth axes and greatly increases the uncertainty during optimization. More importantly, how to use prior information about the human body structure to facilitate the architecture search and achieve a trade-off between accuracy and complexity is another issue.

To deal with these issues, we introduce the fusion cell in the context of NAS to increase the resolution of feature maps and generate desired volumetric heat maps efficiently. The fusion cell has multiple head networks that are various convolutional architectures, consisting of different kernels and operations. To improve the part-awareness of our model, we attempt to generate the volumetric heat map for each part with a specially optimized head network. Considering the symmetry prior of the human body structure, it is inefficient to search a different head network for each part. Our approach classifies all body parts into several groups and assigns each group with a part-specific architecture. In the search stage of our approach, all the architectures, including the fusion cell, are optimized by gradient descent. Then, we stack these optimized computational cells to construct our part-aware 3D pose estimator. In the evaluation stage, our part-aware 3D human pose estimator can select optimized head networks encoded in the fusion cell to estimate different groups of body parts.

Through extensive experiments, we show that our approach can achieve a good trade-off between complexity and performance. With 62% fewer parameters and 24% fewer FLOPs (multiply-adds), our approach outperforms the model using ResNet-50 backbone and achieves 53.6 mm in Mean Per Joint Position Error (MPJPE). By stacking more computational cells, it can further advance the state-of-the-art accuracy on Human3.6M by 2.3 mm with 41% fewer parameters.

Our contributions can be summarized as follows:

  • Our work shows that it might be sub-optimal to estimate 3D poses of all body parts with a single network architecture. To the best of our knowledge, we make the first attempt to search part-specific architectures for different parts.

  • We introduce the fusion cell to generate volumetric heat maps efficiently. In the fusion cell, we classify all body parts into several groups and estimate each group of parts with a distinct head network.

  • Our part-aware 3D pose estimator is both compact and efficient. It achieves state-of-the-art accuracy on both the single-person and multi-person 3D human pose benchmarks using much fewer parameters and FLOPs.

2 Related Work

3D Human pose estimation has been studied widely in the past. In this section, we only focus on previous works that can be relevant to our work.

Estimate 3D poses from 2D Joints: Some approaches divide the task of 3D human pose estimation into first predicting 2D joint locations and then back-projecting them to estimate 3D human poses. The practice of inferring 3D human poses from their 2D projections can be traced back to the classic work  [27]. Given the bone lengths, the problem boils down to a binary decision tree where each branch corresponds to two possible states of a joint concerning its parent. Jiang et al.  [20] generate a set of hypothesis of 3D poses using Taylor’s algorithm  [50] and use them to query a large database of motion capture data to find the nearest neighbor. Similarly, the idea of exploiting nearest neighbor queries has been revisited by  [15]. Chen et al.  [6] also share the idea of using the detected 2D pose to query a large database of exemplary poses. Another common approach  [3, 63] is to learn an over-complete dictionary of basis 3D poses from a large database of motion capture data. Moreno-Noguer et al.  [36] employ the pair-wise distance matrix of 2D joints to learn a distance matrix for 3D joints. Martinez et al.  [32] design a fully-connected network to estimate 3D joint locations relative to the pelvis from 2D poses. Hossain et al.  [16] exploit temporary information to calculate a sequence of 3D poses from a sequence of 2D joint locations. Ci et al.  [10] combine the advantage of graph convolution network and fully-connected network and equip the model with strong generalization power. Cai et al.  [5] introduce a graph-based local-to-global network to recover 3D poses from 2D pose sequences. These methods focus on estimating 3D poses from 2D poses, and we attempt to estimate 3D poses from monocular images.

Estimate 3D poses from Monocular Images: Recently, many approaches have been proposed to estimate 3D poses from monocular images in an end-to-end fashion. Li et al.  [28] and Park et al. [38] exploit the 2D pose information to benefit 3D pose estimation. Rogez et al.  [44] and Varol et al. [53] augment the training data with synthetic images and train CNNs to predict 3D poses from real images. Sun et al. [47] adopt a reparameterized pose representation using bones instead of joints. Pavlakos et al.  [40] extend 2D heat maps to 3D volumetric heat maps and predict per voxel likelihood for each joint. Tome et al.  [51] generalize Convolutional Pose Machine (CPM)  [55] to the task of monocular 3D human pose estimation. Chen et al. [9] propose to decompose the volumetric representation into 2D depth-aware heat maps and joint depth estimation. Zhou et al.  [65] propose a weakly-supervised transfer learning method that uses mixed 2D and 3D labels in a unified deep neural network. By introducing a simple integral operation, Sun et al.  [48] unify heat maps learning and regression learning for pose estimation. Kocabas et al.  [25] propose to train the 3D pose estimator with the multi-view triangulation in a self-supervised manner. Instead of estimating root-relative 3D poses, Moon et al.  [35] propose to estimate 3D poses in the camera coordinate system directly. More recent works  [1, 21, 22, 26, 37] tend to focus on reconstructing fine-grained 3D human shapes. Nevertheless, all works are limited in estimating all body parts with a single head network, and we attempt to search part-specific head networks for different body parts.

3 The Proposed Approach

In the literature of NAS, differential architecture search (DARTS)  [30] is a representative method that can search effective network architectures using fewer computing resources. Therefore, we build our model on DARTS. First, we introduce some basic knowledge about DARTS. Then, we describe our approach to search part-specific head networks for intrinsically heterogeneous body parts.

3.1 Preliminaries: Differential Architecture Search (DARTS)

The framework of DARTS decomposes the searched network architecture into a number of (L) computational cells. There are two types of cells: the normal cell and the reduction cell. Both of them have typical convolution architectures to transform feature maps. Additionally, the reduction cell has another function to downsample the feature map. Each computational cell can be represented as a directed acyclic graph (DAG), consisting of an ordered sequence of N nodes (\(\mathcal {N}=\{x^{(i)}|i=1,...,N\}\)). In the DAG, each node \(x^{(i)}\) (\(i\in \{1,...,N\}\)) is a hidden representation (i.e., feature map), and each edge \(o^{(i,j)}(\cdot )\) denotes the transformation from \(x^{(i)}\) to \(x^{(j)}\) and is associated with an operation (i.e., pooling and convolution). In each cell, there are two input nodes (i.e., \(x^{(1)}\) and \(x^{(2)}\) receive outputs from the previous two cells) and one output node \(x^{(N)}\) (i.e., the concatenation of all intermediate nodes \((x^{(3)}, x^{(4)},...,x^{(N-1)})\)). The output of an intermediate node \(x^{(j)}\) is computed as:

$$\begin{aligned} \begin{aligned}&x^{(j)} = \sum _{i<j}o^{(i,j)}(x^{(i)}) \end{aligned} \end{aligned}$$
(1)

Where the node \(x^{(i)}\) is one predecessor of the node \(x^{(j)}\). There is a pre-defined space of operations denoted by \(\mathcal {O}\), each element of which is a fixed operation (e.g., identity connection, convolution and max pooling). In the search stage, our goal is to automatically select one operation from \(\mathcal {O}\) and assign the operation to \(o^{(i,j)}(\cdot )\) for each pair of nodes.

The core idea of DARTS is to make the search space continuous, and formulate the choice of an operation as a softmax over all possible operations:

$$\begin{aligned} \begin{aligned}&\bar{o}^{(i,j)}(x) = \sum _{o \in \mathcal {O}} \frac{exp(\alpha _{i,j}^{o})}{\sum _{o'\in \mathcal {O}}exp(\alpha _{i,j}^{o'})}o(x) \end{aligned} \end{aligned}$$
(2)

Where \(\alpha _{i,j}^{o}\) denotes the learnable score of the operation \(o(\cdot )\) on the edge from \(x^{(i)}\) to \(x^{(j)}\). \(\alpha _{i,j}\in \mathbb {R}^{|\mathcal {O}|}\) represents the scores of all candidate operations over the edge. The architecture of a cell is denoted as \(\alpha = \{\alpha _{i,j}\}\), consisting of \(\alpha _{i,j}\) for all edges connecting pairs of nodes. Then, DARTS formulates architecture search as finding \(\alpha \) to minimize the loss function on the validation set:

$$\begin{aligned}&\min _{\alpha }~L_{val}(w^{*}(\alpha ), \alpha ) \end{aligned}$$
(3)
$$\begin{aligned}&\mathrm{s.t.}~w^{*}(a) = \mathrm{argmin_{w}}~L_{train}(w, \alpha ) \end{aligned}$$
(4)

Where \(w^{*}(\alpha )\) denotes the network weights associated with the architecture \(\alpha \), which is optimized on the training set. The architecture parameter \(\alpha \) can be optimized via gradient descent by approximating Eq. 3 as:

$$\begin{aligned} \nabla _{\alpha }L_{val}(w^{*}(\alpha ), \alpha ) \approx \nabla _{\alpha }L_{val}(w-\xi \nabla _{w}L_{train}(w,\alpha ),\alpha ) \end{aligned}$$
(5)

Where w denotes the current network weights, \(\nabla _{w}L_{train}(w,\alpha )\) is the a gradient step of w and \(\xi \) is the step’s learning rate. When we finish optimizing \(\alpha \) in the search stage, we assign \(o^{(i,j)}(\cdot )\) with the most likely operation candidate according to \(\alpha ^{(i,j)}\). For each intermediate node in a computational cell, DARTS retains its two strongest predecessors.

3.2 DARTS for Monocular 3D Human Pose Estimation

Since the framework of DARTS is originally designed for image classification, neither the normal cell nor the reduction cell can increase the resolution of feature maps. However, it is a common practice for 3D pose estimators to upsample feature maps from the size of \(8\times 8\) to the size of \(64\times 64\) consecutively and generate volumetric heat maps for all body parts. To this end, as shown in Fig. 2, we propose to introduce another type of cell, namely fusion cell, in the context of DARTS. It can upsample and transform feature maps propagated from previous cells. Just like the reduction cell performs downsampling at input nodes, the fusion cell also upsamples feature maps at input nodes as a preprocessing step. Then, we employ edges between two nodes (i.e., convolution, pooling, etc.) to transform upsampled feature maps and produce volumetric heat maps for all parts at the output node. As shown in Fig. 2, it is interesting to note that the output node is the concatenation of all intermediate nodes and each intermediate node represents volumetric heat maps for a certain group of body parts. Through intermediate nodes in the fusion cell, we automatically divide all body parts into several groups, and the number of groups is equal to the number of intermediate nodes in the fusion cell. As shown in Fig. 2(a), there exist many candidate operations between nodes in the search stage, and we obtain the optimized architecture upon finishing the search process. In the optimized architecture shown in Fig. 2(b), we can observe that each intermediate node has been transformed by a different set of operations. In other words, we learn part-specific architectures in the search stage and employ them to estimate different groups of body parts in the evaluation stage.

Fig. 2.
figure 2

An illustration of the fusion cell. Node 0 is the input node, and Node 1, 2, 3 are intermediate nodes. Node 4 is the output node and concatenates all intermediate nodes. Each edge represents one operation between two nodes. For simplicity, we only draw one input node here instead of two.

We follow a popular baseline  [48] to build our part-aware 3D pose estimator. It predicts per voxel likelihood for each part and uses the soft-argmax operator to extract the 3D coordinate from the volumetric heat map. Instead of using ResNet-50 backbone and deconvolution layers, we search the whole network architecture. In the search stage, we stack the normal cell, the reduction cell, the fusion cell to construct our model with a total of \(N_{c}\) cells. We fix the number of reduction cells and fusion cells to \(N_{r}\) and \(N_{f}\), respectively. Because the fusion cell is designed to generate volumetric heat maps at last, we first interweave \((N_{c}-N_{r}-N_{f})\) normal cells and \(N_{r}\) reduction cells. Following the original DARTS, we organize the position of the reduction cell as:

$$\begin{aligned} P_{r}^{i} = \mathrm{floor}(\frac{N_{c}-N_{f}}{N_{r}+1})\times i+1 \end{aligned}$$
(6)
Fig. 3.
figure 3

An overview of our network architecture. We take the \(256 \times 256\) input image as an example. It consists of ten computation cells: two normal cells, five reduction cells, and three fusion cells. The architecture of all types of cells are optimized in the search stage, and each cell receives inputs from the outputs of the previous two cells.

Where \(i \in \{1,2,...,N_{r}\}\) denotes the \(i^{th}\) reduction cell. \(P_{r}^{i}\) denotes the position of the \(i^{th}\) reduction cell. \(\mathrm{floor}(\cdot )\) represents the function that discards the decimal point of a given number. After arranging normal cells and reduction cells, we append \(N_{f}\) fusion cells behind them. In the search stage, our model has a total of ten cells. We set \(N_r\) and \(N_{f}\) as 5 and 3, respectively. As illustrated in Fig. 3, out of the top seven cells, we interweave two normal cells and five reduction cells. Then, we append three fusion cells consecutively behind them to generate volumetric heat maps for all parts. We employ \(\mathrm{L1}\) loss to supervise estimated 3D poses and update network parameters w on the training set and architectures for all types of cells \(\alpha \) on the validation set alternately.

When we finish the search process, we obtain the optimized normal cell, reduction cell, and fusion cell, as in Fig. 2(b). To evaluate the effectiveness of our searched architectures, we re-train our model constructed with these optimized cells. When our model is built with ten computational cells, the overview of its architecture is the same as what it was in the search stage. As shown in Fig. 3, given an input image, it first goes through a \(3\,\times \,3\) convolution layer and a normal cell to generate the feature map. Then, we append five consecutive reduction cells to downsample the feature map and double its channel with a total stride of \(2^{5}\). After a series of reduction cells, the feature map is \(8\times 8\times 2048\) in size, and we use a normal cell to refine it further. To generate the volumetric heat map, we use the proposed fusion cell to upsample the feature map. Except for the last one, we set the output channel of remaining fusion cells to 256 as a common practice. Three consecutive fusion cells upsample the feature map with a total stride of \(2^{3}\) and generate the volumetric heat map of size \(64 \times 64 \times 64\) for all body parts. For each part, we extract its 3D coordinate from the corresponding volumetric heat map via the differential soft-argmax operation  [48]. As we do in the search stage, we still employ \(\mathrm{L1}\) loss to train our model.

4 Experimental Evaluation

In this section, we present a detailed evaluation of our proposed approach. First, we introduce main benchmarks and present our experimental settings. Then, we conduct rigorous ablation analysis about our approach. Finally, we build our strongest part-aware estimator upon the knowledge obtained in ablation studies and compare it with state-of-the-art performance.

4.1 Main Benchmarks and Evaluation Metrics

Human3.6M Dataset  [18]: It is captured in a calibrated multi-view studio and consists of 3.6 millions of video frames. Eleven subjects are recorded from four camera viewpoints, performing 15 activities. Previous works widely use two evaluation metrics. The first one is mean per joint position error (MPJPE), which first aligns the pelvis joint between estimated and ground-truth 3D poses and computes the average joint error among all human joints. The second metric uses Procrustes Analysis (PA) to align MPJPE further, and it is called PA MPJPE.

MuCo-3DHP and MuPoTS-3D Datasets  [34]: These datasets are designed for multi-person 3D pose estimation. The training set is the MuCo-3DHP dataset, and it is generated by compositing the MPI-INF-3DHP dataset  [33]. MuPoTS-3D dataset acts as the test set and contains 20 in-the-wild scenes. The evaluation metric is the 3D percentage of correct keypoints (3DPCK).

4.2 Experimental Settings and Implementation Details

Human3.6M Dataset: Two evaluation protocols are widely used. Protocol 1 uses six subjects (S1, S5, S6, S7, S8, S9) in training and reports the evaluation result on every \(64^{th}\) frame of Subject 11’s videos using PA MPJPE. Protocol 2 uses six subjects (S1, S5, S6, S7, S8) in training and reports the evaluation result on every \(64^{th}\) frame of two subjects (S9, S11) using MPJPE. In the evaluation stage of our approach, we use additional MPII  [2] 2D pose data during training.

In the search stage, we train the network only with Human3.6M data. We split three subjects (S1, S5, S6) as the training set to update the network parameter w and use two subjects (S7, S8) as the validation set to update the network architecture \(\alpha \). We include following eight operations in the pre-defined space \(\mathcal {O}\): \(3\times 3\) and \(5\times 5\) separable convolutions, \(3\times 3\) and \(5\times 5\) dilated separable convolutions, \(3\times 3\) max pooling, \(3\times 3\) average pooling, identity and zero.

MuCo-3DHP and MuPoTS-3D Datasets: We create 400K composite frames of the MuCo-3DHP dataset, of which half are without appearance augmentation. We use additional COCO  [29] 2D pose data during training.

Table 1. Quantitative evaluation of the number of intermediate nodes within each fusion cell on Human3.6M using Protocol 2. \(N_{i}\) denotes the number of intermediate nodes within each fusion cell. Lower is better, best in bold, second-best underlined.
Fig. 4.
figure 4

Cells found on Human3.6M dataset when we set \(N_{i}\) to 2. Our model uses two intermediate nodes encoded in the fusion cell to estimate different groups of body parts.

Implementation Details: In the search stage, to save GPU memory, we set the size of the input image and the volumetric heat map to \(128\times 128\) and \(32\times 32\times 32\), respectively. The total training epoch is 25, and the parameter w is updated by the Adam optimizer  [24] with a batch size of 40. The initial learning rate is \(1\times 10^{-3}\) and reduced by a factor of 10 at the \(15^{th}\) and the \(20^{th}\) epoch. We start to optimize the network architecture \(\alpha \) at the \(8^{th}\) epoch. Its learning rate and weight decay are \(8\,\times \,10^{-4}\) and \(3\,\times \,10^{-4}\), respectively. The search process lasts two days on a single NVIDIA TITAN RTX GPU. In the evaluation stage, the size of the input image and the volumetric heat map are \(256\times 256\) and \(64\times 64\times 64\), respectively. The total epoch is 20. We train our network with Adam with a batch size of 64. The initial learning rate is \(1\times 10^{-3}\) and reduced by ten at the \(12^{th}\) and the \(16^{th}\) epoch. Training samples are augmented via rotation (\(\pm 30^{\circ }\)), horizontal flip, color jittering, and synthetic occlusion  [46]. The training process takes two days on four NVIDIA P100 GPUs. We run each experiment three times with different random seeds, and the confidence interval is about \(\pm 0.3\) mm.

4.3 Ablation Experiments

The Number of Intermediate Nodes in the Fusion Cell

As we explain in Sect. 3, the number of intermediate nodes in the fusion cell is equal to the number of groups that we divide all body parts into. In this set of experiments, by adjusting the number of intermediate nodes, we are motivated to explore how many groups all body parts are divided into is an optimal choice. In the search stage, we optimize the network architecture where the fusion cell can have \(N_{i}\in \{1,2,3,4\}\) intermediate nodes, and the model has a total of ten computational cells, as in Fig. 3. In Table 1, we can observe that the model with two intermediate nodes outperforms all the others on every action. Compared to dividing all parts into more or fewer groups, it achieves a better trade-off between performance and computational complexity. With only 13.0M parameters and 10.7G FLOPs, it encouragingly reduces MJPJE to 53.6 mm.

Fig. 5.
figure 5

Illustration of the equivalence between shuffling the part order and shuffling the heat map order. The number in the box denotes the part id. There are a total of eighteen parts. As shown in Fig. 4(d), within the last fusion cell, orange boxes indicate parts estimated by Node 0, and pink boxes indicate ones estimated by Node 1. (Color figure online)

To investigate what makes our architecture efficient when \(N_{i}\) is 2, we visualize searched architectures in Fig. 4. As a comparison, when \(N_{i}\) is 1, our model estimates all body parts with a single head network. It is computationally intensive, having 14.7M parameters and 22.9G FLOPs, but its performance is not satisfactory. Towards a better solution shown in Fig. 4(d), we employ two intermediate nodes encoded in the fusion cell to estimate the torso and limbs, respectively. Specifically, Node 0 is transformed from pooling layers and is robust to estimate parts with relatively low DOFs. On the other side, dilated convolutional layers empower Node 1 to capture long-range context information, which is helpful to estimate parts with higher DOFs, such as the wrist and ankle. The normal cell, shown in Fig. 4(a), consists of many dilated convolutional layers, which greatly increase the receptive field of our model, and are critical to performance improvement. As shown in Table 1, if we remove dilated convolution from our search space \(\mathcal {O}\), our searched model has more parameters and FLOPs, and its performance drops from 53.6 mm to 59.9 mm. The reduction cell employs many depth-wise convolution layers to fuse multi-scale features efficiently. Similarly, we validate their importance by removing these operations from \(\mathcal {O}\), and it leads to a 5.1 mm decline in performance.

Table 2. Quantitative evaluation of the shuffled part order on Human3.6M using Protocol 2. We set \(N_{c}\) and \(N_{i}\) to 10 and 2 respectively. We compute part-wise MPJPE to report performance. Bold values indicate parts estimated by Node 0 and italic values denote ones estimated by Node 1.
Table 3. Quantitative evaluation of the importance of the fusion cell on Human3.6M using Protocol 2. BS and WS denote the backbone search and the whole architecture search, respectively. We compute action-wise MPJPE to report the network performance. Lower is better, best in bold, second-best underlined.

The Part-Awareness of Our Model

We begin to validate the part-awareness of our approach from two perspectives. First, to investigate whether searched head networks are part-specific, we intend to shuffle the order of parts when we re-train our model in the evaluation stage. However, it is a little troublesome to do this since we would have to modify the data augmentation policy according to the shuffled order. Alternatively, as shown in Fig. 5, we propose to shuffle the order of heat maps produced in the last fusion cell. The implementation of the shuffle operation is the same as ShuffleNet  [59], which is efficient and GPU-friendly. If our model trained with the shuffled order behaves obviously worse than the original one, we can validate that our optimized head networks are part-aware. We run experiments three times and train our model with different shuffled orders. As shown in Table 2, we observe that all models trained with shuffled orders suffer from a significant drop in performance, more than 3 mm in MPJPE. As we take a closer look, the decline in performance also reflects on every individual part, especially parts with higher DOFs (e.g., ankle, knee), and their estimation accuracy might drop by more than 5 mm. By comparing models trained with shuffled orders, we validate that our approach learns part-specific head networks for specific body parts in the search stage.

Table 4. Quantitative evaluation of the number of cells on Human3.6M using Protocol 2. \(N_{c}\) denotes the number of computational cells. We compute action-wise MPJPE to report the network performance. Lower is better, best in bold, second-best underlined.
Table 5. Comparison with state-of-the-art methods on Human3.6M using Protocol 1. S denotes our small part-aware model with ten cells, and L denotes our large model with twenty cells. Lower is better, best in bold, second-best underlined.

In our model, the fusion cell plays a pivotal role in learning part-specific head networks. To evaluate the importance of the fusion cell, we replace them with deconvolution layers and only search the backbone network. The backbone network only consists of normal cells and reduction cells. For a fair comparison, all constructed networks have two normal cells and five reduction cells, and their only difference is whether they have fusion cells. In Table 3, compared to the backbone search, searching the whole network architecture improves performance by 3.5 mm and reduces 37% parameters and 14% FLOPs. In comparison with the model built on the commonly used ResNet-50 backbone, we advance estimation accuracy by 0.3 mm with 62% fewer parameters and 24% fewer FLOPs. Through our experiments, we show that fusion cells significantly contribute to the compactness and efficiency of our approach and exhibit more competitive performance over models using the ResNet-50 backbone.

The Number of Computational Cells

Instead of stacking only ten computation cells, we attempt to construct a deeper part-aware 3D pose estimator, according to Eq. 6. As shown in Table 4, as we increase the number of computational cells, our model becomes better in performance but has more parameters and FLOPs. When \(N_{c}\) is 20, our model achieves the best performance, 47.3 mm in MPJPE. As we increase \(N_{c}\) from 10 to 20, the gain in network parameters (from 13.0M to 20.4M) and FLOPs (from 10.7G to 14.1G) also leads to an improvement in performance (from 53.6 mm to 47.3 mm). This phenomenon also demonstrates that the network architecture optimized during the search process is computationally efficient.

4.4 Comparison with the State-of-the-Art

To demonstrate the effectiveness and the generalization ability of our approach, we conduct our experiments on both single-person and multi-person 3D pose estimation benchmarks. Previous works have different experimental settings, and we summarize comparison results in Tables 5, 6 and 7, respectively. In Fig. 6, we show qualitative results produced by our model with ten cells. It can generalize well for in-the-wild images, even on challenging poses and crowded scenes.

Table 6. Comparison with state-of-the-art methods on Human3.6M using Protocol 2. S denotes our small part-aware model with ten cells, and L denotes our large model with twenty cells. Lower is better, best in bold, second-best underlined.
Table 7. Comparison with state-of-the-art methods on MuPoTS-3D using all ground truths. S denotes our small part-aware model with ten cells, and L denotes our large model with twenty cells. Higher is better, best in bold, second-best underlined.
Fig. 6.
figure 6

Qualitative results on different datasets. Our small model produces convincing results even on challenging poses and crowded scenes.

Single-Person 3D Human Pose Estimation: We compare our approach on Human3.6M with state-of-the-art methods in Tables 5 and 6. By reducing about 40% parameters, our large part-aware model advances the-state-of-the-art accuracy by 1.3 mm and 2.3 mm in protocol 1 and protocol 2, respectively. If we add supervision on intermediate feature maps, the performance of our small model can be significantly improved, achieving 50.4 mm in Protocol 2. Moreover, our method is also compatible with some efficient learning frameworks  [19, 25, 62].

Multi-person 3D Human Pose Estimation: For multi-person 3D pose estimation, we use RootNet  [35] to estimate absolute depth for the root joint of each person. As shown in Table 7, we compare our model with previous state-of-the-art multi-person pose estimation methods on MuPoTS-3D, and our large part-aware 3D pose estimator achieves more superior performance on every sequence.

5 Conclusion and Future Works

In this work, we propose to estimate 3D poses of different parts with part-specific neural architectures. In the search stage, we optimize the architectures of different types of cells via gradient descent. Then, we interweave optimized computational cells to construct our part-aware 3D pose estimator, which is compact and efficient. Our model advances the state-of-the-art accuracy on both the single-person and multi-person 3D human pose estimation benchmarks. In the future, we attempt to explore other NAS methods to search 3D pose estimators in a larger space, which may open up the possibility for a global optimization.