1 Introduction

Neural scene representations became an emerging trend in computer vision during the last couple of years. They allow to represent scenes through neural networks which operate on 3D space, allowing for tasks like novel view synthesis of static content, generalization over object and scene classes, body, hand and face modelling and relighting and material editing. For a detailed overview please refer to [53, 56]. While most such methods rely on time and memory intensive volumetric rendering, [50] proposes a method of rendering with a single network evaluation per ray. In this paper we propose a single-evaluation rendering formulation based on keypoints in order to represent dynamically deforming objects, like humans and animals. This allows us to render not only novel views of a known shape but also new and unseen poses by adjusting the positions of 3D keypoints. We apply this flexible approach to prevalent tasks in computer vision: the representation and reconstruction of pose for humans and animals. More specifically, we propose a silhouette based 3D keypoint detector that is based on inverse neural rendering. We decide to factor out appearance variation for keypoint detection, in order to ease the bridging of domain gaps and handle animals with few data available. Despite only relying on silhouettes, our proposed approach shows promising results when compared to the state-of-the-art 3D keypoint detector [17]. Furthermore, our approach is capable of zero-shot synthetic to real generalization, see Fig. 1.

Fig. 1.
figure 1

Zero-Shot Synthetic to Real Experiment: Giraffe. Reconstruction of a giraffe in different poses. From left to right: input image, segmentation mask used as input for keypoint estimation, estimated keypoints and shape, rendering from different perspective (cf. supplementary for more real-world examples).

Contributions. Our contribution is a flexible keypoint based neural scene representation and neural rendering framework called Neural Puppeteer (NePu).

We demonstrate that NePu provides valuable gradients as a differential forward map in an inverse rendering approach for 3D keypoint detection. Since we formulate the inverse rendering exclusively on 2D silhouettes, the resulting 3D keypoint detector is inherently robust with respect to transformations or domain shifts. Note that common 3D keypoint estimators require a huge amount of training samples with different texture in order to predict keypoints of the same class of shapes. Another advantage of being independent of texture is that it is easier to generalize from training with synthetic data. This can be particularly useful in cases where it is highly challenging to obtain a sufficient amount of real-world annotations, such as for wild animals (cf. Fig. 1). For animal shapes, we outperform a state-of-the-art 3D multi-view keypoint estimator in terms of Mean Per Joint Position Error (MPJPE) [17].

Unlike common practice, we shift rendering from the 3D to 2D domain, requiring only a single neural network evaluation per ray. In this sense, our approach can be interpreted as a locally conditioned version of Light Field Networks (LFNs) [50]. Our formulation is capable of learning the inter-pose variations of a single instance (constant shape and appearance) under constant lighting conditions, similar to [52]. In contrast, LFNs learn a prior over inter-object variations. We retain the rendering efficiency of LFNs and are capable of rendering color, depth and occupancy simultaneously at \(20\,\text {ms}\) per \(256^2\) image. This is significantly faster than NeRF-like approaches such as [52], which typically achieve less than 1 fps. Due to our fast renderer, fitting a target pose by inverse rendering can be done at \({\sim }1\) fps using 8 cameras. Furthermore, we show that our keypoint-based local conditioning significantly improves the neural rendering of articulated objects, with visibly more details and quantitative improvements in PSNR and MAE over LFNs.

Code and data sets [12] to reproduce the results in the paper are publicly available at https://urs-waldmann.github.io/NePu/. We hope to inspire further work on animal pose and shape reconstruction, where our synthetic dataset can serve as a controlled environment for evaluation purposes and to experiment with novel ideas.

2 Related Work

2.1 3D Keypoint Estimation

Human keypoint estimation is a vast field with many applications [2, 6, 55, 57]. For further reading, we refer the reader to [9, 19, 25, 54]. The current state-of-the-art methods for 3D human keypoint estimation when trained on a single data set, i.e. the famous Human3.6M data set [16], are [14, 17, 45] with an average MPJPE of 18.7 mm, 20.8 mm and 26.9 mm respectively. At the time of writing, there was no code available for [45]. That is why we choose LToHP [17] as a baseline to quantitatively compare our model to. With the huge success of human keypoint estimation, the prediction of 3D keypoints for animals became a sub-branch of its own [3, 13, 20, 21, 34]. We notice that all these 3D frameworks for animals exploit 2D keypoints with standard 3D reconstruction techniques. That is why we also choose LToHP [17] as our baseline for animals, since it uses a learnable approach for the 3D reconstruction part.

Please keep in mind that our pose estimation only relies on multi-view silhouettes, while the above methods require RGB images. Using silhouettes gives more robustness to changes in texture and lighting. The only other work we are aware of that extracts keypoints from silhouettes is [5]. While [5] extracts 2D keypoints from silhouettes for quadrupeds using a template mesh and a stacked hourglass network [35], we are able to predict 3D coordinates for arbitrary shapes.

2.2 Morphable Models

The seminal morphable models for animals and humans are SMAL [65] and SMPL [26] respectively. An extension of SMPL, called SMPL-X [41], includes hands [46] and face [24]. These models have been used to estimate the 3D pose and shape from a single image [4, 47, 62], from multiple unconstrained images in the wild [49, 64] or in an unsupervised manner from a sparse set of landmarks [29]. Because creating these models is tedious, the authors of [63] present an unsupervised disentanglement of pose and 3D mesh shape. Neural Body [43] uses implicit neural representations where different sparse multi-view frames of a video share the same set of latent codes anchored to the vertices of the SMPL model. In this way the SMPL model provides a geometric prior for their model with which they can reconstruct 3D geometry and appearance and synthesize novel views.

While these representations are comparatively easy to handle, the need of a parametric template mesh limits them to a pre-defined class of shapes. In particular, models for quadrupeds can fit a large variety of relatively similar animals like cats and horses, but run into problems if uncommon shapes are present [65]. For example, the fitting of elephants or giraffes pose significant problems due to their additional features (trunk) or different shapes (long neck). Similar problems arise in case of humans and clothing, e.g. a free flowing coat.

On the contrary, [23, 58] exploit implicit representations and reduce manual intervention completely. They disentangled dynamic objects into a signed distance field defined in canonical space and a latent pose code represented by the flow field from a canonical pose to a given shaped pose of the same identity.

2.3 Neural Fields

Neural fields are an emergent research area that has become a popular representation for neural rendering [32, 33, 50, 53, 60], 3D reconstruction and scene representation [31, 38, 51], geometry aware generative modelling [7, 36, 48] and many more. For a detailed overview, we refer the reader to [56].

While these representations often rely on a global latent vector to represent the information of interest [31, 38, 50, 51], the importance of locally conditioning the neural field has been demonstrated in [8, 11, 33, 36, 44]. By relying on local information drawn from geometrically aligned latent vectors, these methods often obtain higher quality reconstructions and generalize better due to their translation equivariance.

Neural fields have an especially strong impact on neural rendering with the formulation of radiance-based integration over rays introduced in NeRF [32]. While NeRF and many follow-ups [33, 60] achieve high-quality renderings of single scenes, [59] match their performance with relying on neural networks, implying that NeRF’s core competence are meaningful gradients for optimization. [52] combines the well known NeRF pipeline [32] with a keypoint based skeleton. This allows them to reconstruct articulate 3D representation of humans by representing the pose through the 3D positions of the keypoints. However, [52] optimizes 3D coordinates in the camera system obtained from fitting the SMPL model [26] while we predict 3D world coordinates from multi-view silhouettes.

In contrast to the volumetric rendering of NeRF typically requiring hundreds of evaluations of the neural network, LFN [50] propose an alternative by instantly predicting a pixels color given the corresponding ray’s origin and direction. This approach results in a much faster rendering compared to NeRF-like approaches. In this work, we embrace the idea of single-evaluation rendering [50], and employ ideas from [11] for an efficient representation and architecture. Our approach of rendering is thus much faster than [52], allowing also for faster solutions to the inverse rendering problem of estimating keypoint locations from images. Since [50] is the only other single-evaluation-rendering method we are aware of, we quantitatively compare our method in Table 1 in terms of color and depth.

3 Neural Puppeteer

We will describe Neural Puppeteer in three parts. First, we discuss the encoding of the pose and latent codes, as can be seen on the upper half of Fig. 2. Second, we describe our keypoint-based neural rendering, as is depicted in the lower half of the figure. Third and last, we discuss how our pipeline can be inverted to perform keypoint estimation by fitting the pose generated from 3D keypoints to input silhouette data.

3.1 3D Keypoint Encoding

Given 3D keypoint coordinates \(\textbf{x} \in \mathbb {R}^{K \times 3}\) of a subject with K keypoints, we aim to learn an encoder network

$$\begin{aligned} \text {enc}: \mathbb {R}^{K \times 3} \rightarrow \mathbb {R}^{d_z},~ \textbf{x} \mapsto \textbf{z} \end{aligned}$$
(1)

that encodes a pose \(\textbf{x}\) as a \(d_z\)-dimensional global representation \(\textbf{z}\), as well as to learn a decoder network

$$\begin{aligned} \text {dec}: \mathbb {R}^{d_z} \rightarrow \mathbb {R}^{K \times 3} \times \mathbb {R}^{K \times d_f},~ \textbf{z} \mapsto (\hat{\textbf{x}}, \textbf{f}) \end{aligned}$$
(2)

that reconstructs the pose \(\hat{\textbf{x}}\) and obtains local features \(\textbf{f}_k \in \mathbb {R}^{d_f}\) for each keypoint. Subsequently, we use the representation \(( \textbf{z},\textbf{f})\) to locally condition our neural rendering on the pose \(\textbf{x}\), as explained in Sect. 3.2. Please note that we do not require a skeleton model, i.e. connectivity between key points, since the pose is only interpreted as a point cloud.

Fig. 2.
figure 2

Neural Puppeteer: Our method takes 3D keypoints (top left) and learns individual latent codes for each keypoint (top right). We project them into arbitrary camera views and perform neural rendering to reconstruct 2D RGB-D images (bottom right). The pipeline can be used to perform pose estimation by inverse rendering, optimizing the 3D keypoint positions by performing gradient descent in the latent code \({\textbf {z}}\in \mathbb {R}^{d_z}\). For rendering, we use only the closest keypoints, illustrated by the yellow circle around the point in question (yellow dot). Connections between keypoints are shown just for visualization and not used by the network. (Color figure online)

We build our encoder upon the vector self-attention

$$\begin{aligned} {\text {VSA}}: \mathbb {R}^{K \times 3} \times \mathbb {R}^{K \times d_f} \rightarrow \mathbb {R}^{K \times d_f},~ (\textbf{x}, \textbf{f}) \mapsto \textbf{f}^{\prime } \end{aligned}$$
(3)

introduced in [61] as a means of a neural geometric point cloud operator. Consequently, our encoder which consists of L layers produces features

$$\begin{aligned} \textbf{f}^{(l+1)} = \text {ET}_l( \text {BN}_l( \textbf{f}^{(l)} + {\text {VSA}}_l(\textbf{x}, \textbf{f}^{(l)}) ) ), l\in \{0,\cdots ,L-1\}, \end{aligned}$$
(4)

where BN denotes a BatchNorm layer [15] and ET denotes element-wise transformations containing a 2-layer MLP, residual connection and another BatchNorm. Initial features \(\textbf{f}^{(0)} \in \mathbb {R}^{K \times d_f}\) are learned as free parameters. The final global representation \(\textbf{z}\) is obtained via dimension-wise global maxpooling,

$$\begin{aligned} \textbf{z} = \underset{k=1, \dots , K}{{\text {max}}}~ \textbf{f}^{(L)}_k. \end{aligned}$$
(5)

We decode the latent vector \(\textbf{z}\) using two 3-layer MLPs. Keypoints are reconstructed using \(\hat{\textbf{x}} = \text {MLP}_{pos}(\textbf{z})\) and additionally features \(\tilde{\textbf{f}} = \text {MLP}_{\text {feats}}(\textbf{z})\) holding information for the subsequent rendering are extracted. Finally, these features are refined to \(\textbf{f}\) using 3 further VSA layers, which completes the decoder \(\text {dec}(\textbf{z}) = (\hat{\textbf{x}}, \textbf{f})\). For more architectural details we refer to the supplementary.

3.2 Keypoint-Based Neural Rendering

Contrary to many recent neural rendering approaches, we do not rely on costly NeRF-style volumetric rendering [32]. Instead we adopt an approach similar to LFN [50], that predicts a pixel color with a single neural network evaluation.

While LFN uses a global latent vector and the ray’s origin and orientation as network inputs, we directly operate in pixel coordinates. Perspective information is incorporated by projecting keypoints \(\textbf{x}\) and corresponding local features \(\textbf{f}\) into pixel coordinates. More specifically, given camera extrinsics \(\textbf{E}\) and intrinsics \(\textbf{K}\) and the resulting projection as \(\pi _{\textbf{E}, \textbf{K}}\) we obtain 2D keypoints

$$\begin{aligned} \textbf{x}_{2D} = \pi _{\textbf{E}, \textbf{K}}(\textbf{x}). \end{aligned}$$
(6)

Additionally, we append the depth values \(\textbf{d}\), such that the keypoint’s positions are \(\textbf{x}_{2D}^* = (\textbf{x}_{2D}, \textbf{d}) \in \mathbb {R}^{K \times 3}\). Using this positional information we further refine the features in pixel coordinates using \(L=3\) layers of VSA. Specifically, we define \(\textbf{f}^{(0)}_{2D} = \textbf{f}\) and

$$\begin{aligned} \textbf{f}_{2D}^{(l+1)} = {\text {VSA}}_l(\textbf{x}_{2D}^*, \textbf{f}_{2D}^{(l)}), l\in \{0,1,2 \}. \end{aligned}$$
(7)

The resulting refined features \(\textbf{f}_{2D}^{(L)}\) and coordinates \(\textbf{x}_{2D}^*\) are the basis for our single-evaluation rendering, as described next.

Given a pixel coordinate \(\textbf{q} \in \mathbb {R}^2\), we follow [11] to interpolate meaningful information from nearby features \(\textbf{f}_{2D}^{(L)}\) and the global representation \(\textbf{z}\). To be more specific, the relevant information is

$$\begin{aligned} \textbf{y} = {\text {VCA}}( (\textbf{q}, 0), \textbf{z}, \textbf{x}_{2D}^*, \textbf{f}_{2D}^{(L)} ) \in \mathbb {R}^{d_f}, \end{aligned}$$
(8)

where VCA denotes the vector cross-attention from [11] and we set the depth for \(\textbf{q}\) to zero.

Finally, the predictions \(\hat{\textbf{c}}\) for color, \(\hat{{\textbf {d}}}\) for depth and \(\hat{{\textbf {o}}}\) for 2D occupancy values are predicted using three feed-forward network heads

$$\begin{aligned} \text {FFN}_{\text {col}}: \mathbb {R}^{d_f} \rightarrow [0, 1]^3,\, \text {FFN}_{\text {dep}}: \mathbb {R}^{d_f} \rightarrow [0, 1],\, \text { and } \text {FFN}_{\text {occ}}: \mathbb {R}^{d_f} \rightarrow [0, 1], \end{aligned}$$
(9)

respectively, using the same architecture as in [44]. For convenience we define the color rendering function

$$\begin{aligned} \mathcal {C}_{\textbf{E}, \textbf{K}}: \mathbb {R}^{K \times 3} \times \mathbb {R}^{K \times d_f} \times \mathbb {R}^{d_z} \rightarrow [0,1]^{H \times W \times 3}, \end{aligned}$$
(10)

which renders an image seen with extrinsics \(\textbf{E}\) and intrinsics \(\textbf{K}\) conditioned on keypoints \(\textbf{x}\), encoded features \(\textbf{f}\) and global representation \(\textbf{z}\), by executing \(\text {FFN}_{\text {col}}\) for all pixels \(\textbf{q}\). For the depth and silhouette modalities we similarly define \(\mathcal {D}_{\textbf{E}, \textbf{K}}\) and \(\mathcal {S}_{\textbf{E}, \textbf{K}}\), respectively. Note, that our silhouettes contain probabilities for a pixel lying on the object.

3.3 Training

We consider a dataset consisting of M poses \(\textbf{x}_m\in \mathbb {R}^{K\times 3}\) captured by C cameras with extrinsics \(\textbf{E}_c\) and intrinsics \(\textbf{K}_c\). For each view c and pose m we have 2D observations

$$\begin{aligned} \textbf{I}_{m,c},\; \textbf{D}_{m,c},\; \textbf{S}_{m,c},\; m\in \{1,\dots ,M\},\; c\in \{1,\dots ,C\}, \end{aligned}$$
(11)

corresponding to color, depth and silhouette, respectively.

All model parameters are trained jointly to minimize the composite loss

$$\begin{aligned} \mathcal {L} = \lambda _{\text {pos}}\mathcal {L}_{\text {pos}} + \lambda _{\text {col}}\mathcal {L}_{\text {col}} + \lambda _{\text {dep}}\mathcal {L}_{\text {dep}} + \lambda _{\text {sil}}\mathcal {L}_{\text {sil}} + \lambda _{\text {reg}}\Vert \textbf{z}\Vert _2, \end{aligned}$$
(12)

where the different positive numbers \(\lambda \) are hyperparameters to balance the influence of the different losses. The keypoint reconstruction loss

$$\begin{aligned} \mathcal {L}_{\text {pos}} = \sum _{m=1}^M \Vert \textbf{x}_{m} - \hat{\textbf{x}}_{m} \Vert _2 \end{aligned}$$
(13)

minimizes the mean Euclidean distance between the ground truth and reconstructed keypoint positions. The color rendering loss

$$\begin{aligned} \mathcal {L}_{\text {col}} = \sum _{m=1}^M\sum _{c=1}^C \Vert \mathcal {C}_{\textbf{E}_{c}, \textbf{K}_{c}}(\textbf{x}_m, \textbf{f}_m, \textbf{z}_m) - \textbf{I}_{m,c}\Vert _2^2 \end{aligned}$$
(14)

is the squared pixel-wise difference over all color channels, the depth loss \(\mathcal {L}_{\text {dep}}\) is given by a structurally identical formula. Finally, the silhouette loss

$$\begin{aligned} \mathcal {L}_{\text {sil}} = \sum _{m=1}^M\sum _{c=1}^C \text {BCE}\left( \mathcal {S}_{\textbf{E}_{c}, \textbf{K}_{c}}(\textbf{x}_m, \textbf{f}_m, \textbf{z}_m), \textbf{S}_{m,c}\right) \end{aligned}$$
(15)

measures the binary cross entropy \(\text {BCE}(\hat{o}, o) = -[o\cdot \text {log}(\hat{o}) + (1-o)\cdot \text {log}(1-\hat{o})]\) over all pixels. Hence the silhouette renderer is trained to classify pixels into inside and outside points, similar to [31].

3.4 Pose Reconstruction and Tracking

While the proposed model learns a prior over poses along with their appearance and geometry and thus can be used to render from 3D keypoints, we can also infer 3D keypoints by solving an inverse problem, using NePu as a differentiable forward map from keypoints to images. We are especially interested in silhouette-based inverse rendering, in order to obtain robustness against transformations that leave silhouettes unchanged.

Given observed silhouettes \(\textbf{S}_c\), extrinsics \(\textbf{E}_c\) and intrinsics \(\textbf{K}_c\) for cameras \(c \in \{1, \dots , C\}\), we optimize for

$$\begin{aligned} \hat{\textbf{z}} = \underset{\textbf{z}}{{\text {argmin}}}~ \sum _{c=1, \dots , C}\text {BCE}\left( \mathcal {S}_{\textbf{E}_{c}, \textbf{K}_{c}}(\text {dec}(\textbf{z}), \textbf{z}), \textbf{S}_{c}\right) + \lambda \Vert \textbf{z} \Vert _2. \end{aligned}$$
(16)

Following [41], we use the PyTorch [39] implementation of the limited-memory BFGS (L-BFGS) [37] to solve the optimization problem. Once \(\hat{\textbf{z}}\) has been obtained, we recover the corresponding pose as \(\hat{\textbf{x}} = \text {MLP}_{\text {pos}}(\hat{\textbf{z}})\).

Since the above mentioned optimization problem does not assume any prior knowledge about the pose, the choice of an initial value is critical. For initial values too far from the ground truth pose, we observe an unstable optimization behavior that frequently diverges or gets stuck in local minima. To overcome this obstacle, we run the optimization with I different starting conditions \(\textbf{z}^{(1)}, \dots , \textbf{z}^{(I)}\), which we obtain by clustering the 2-dimensional t-SNE [28] embedding of the global latent vectors over the training set using affinity propagation [10]. We obtain different solutions \(\hat{\textbf{z}}_i\) from minimization of Eq. (16), and as the final optimum choose the one with best IoU to the target silhouettes,

$$\begin{aligned} \hat{\textbf{z}} = \underset{\hat{\textbf{z}}_i}{{\text {argmax}}}~ \sum _{c=1, \dots , C}\text {IoU}\left( \mathcal {S}_{\textbf{E}_{c}, \textbf{K}_{c}}(\text {dec}(\hat{\textbf{z}}_i), \hat{\textbf{z}}_i), \textbf{S}_{c}\right) . \end{aligned}$$
(17)

While such a procedure carries a significant overhead, the workload is amortized in a tracking scenario. When provided with a sequence of silhouettes \(\textbf{S}_{c,1}, \dots , \textbf{S}_{c, T}, c \in \{1, \dots , C\}\) of T frames, we use the above method to determine \(\hat{\textbf{z}}_1\). For subsequent frames we initialize \(\hat{\textbf{z}}_{t+1} = \hat{\textbf{z}}_{t}\) and fine-tune by minimizing Eq. (16) using a few steps of gradient descent. Our unoptimized implementation of the tracking approach runs at roughly 1 s per frame using 8 cameras.

Table 1. Quantitative results of the rendering on the test sets. Comparison of the color PSNR [db] between the reconstructed RGB images and mean absolut error (MAE) [mm] for the reconstructed depth by LFN* and NePu. The local conditioning significantly improves reconstruction accuracy. See text for a discussion of the results.

4 Experiments

In this section, we evaluate different aspects of our method. To evaluate pose estimation on previously unseen data, we compare our method to the state-of-the-art multi-view keypoint detector [17], which we train for each individual subjects using the same dataset as for NePu.

The fundamental claims of our methodology are shown by rendering novel views and poses, quantitatively evaluating color and depth estimates and comparing against a version of our framework that utilizes the LFN [50] formulation for neural rendering. For visual evidence that our method produces temporally consistent views and additional qualitative results we refer to videos and experiments in our supplemental material. In addition we compare NePu to AniNeRF [42] on Human3.6M [16].

Fig. 3.
figure 3

Novel view synthesis for novel poses. Our method is capable of generating realistic renderings of the captured subject. In this figure, we show novel poses from new perspectives. See text for a discussion of the results.

4.1 Datasets

Our method can be trained for any kind of shape data that can be described by keypoints or skeletons. Connectivity between the keypoints neither needs to be known, nor do we make use of it in any form. We perform a thorough evaluation of our method on multiple datasets, for two types of shapes where a keypoint description often plays a major role: humans and animals.

For the human data we use the SMPL-X Blender add-on [41]. Here, we evaluate both on individual poses and captured motion sequences. The poses were obtained through the AGORA dataset [40], the animations through the AMASS dataset, and originally captured by ACCAD and BMLmovi [1, 22, 30]. We describe the human pose by 33 keypoints, see additional material for details.

For the training of animal data we use hand-crafted animated 3D models from different sources. To capture a variety of different animal shapes, we render datasets with a cow, a giraffe, and a pigeon. In particular the giraffe is a challenging animal for template mesh based methods, as the neck is much longer compared to most other quadrupeds. For each animal we created animations which include idle movement (e.g. looking around), walking, trotting, running, cleaning, and eating and render between 910 and 1900 timesteps. We use between 19 and 26 keypoints to describe the poses.

The keypoints for all shapes were created by tracking individual vertices. As keypoints in the interior of the shape are typically preferred, we average the position of two vertices on opposing sides of the body part. All datasets were rendered using Blender (www.blender.org) and include multi-view data of 24 cameras placed on three rings at different heights around the object. For each view and time step we generate ground truth RGB-D data as well as silhouettes of the main object. The camera parameters, and 3D and 2D keypoints for each view and timestep are also included.

4.2 Implementation Details

We implement all our models in PyTorch [39] and train using the AdamW optimizer [27] with a weight decay of 0.005 and a batch size of 64 for a total of 2000 epochs. We set the initial learning rate to \(5e^{-4}\), which is decayed with a factor of 0.2 every 500 epochs. We weight the training loss in Eq. (12) with \(\lambda _{pos} = 2\), \(\lambda _{col}=\lambda _{dep}=1\), \(\lambda _{sil}=3\) and \(\lambda _{reg}= 1/16\). Other hyperparameters and architectural details are presented in the supplementary.

Instead of rendering complete images during training to compute \(\mathcal {L}_{col}\), \(\mathcal {L}_{dep}\) and \(\mathcal {L}_{sil}\), we only render a randomly sampled subset of pixels. Both for color and depth we sample uniformly from all pixels in the ground truth mask. Hence the color and depth rendering is unconstrained outside of the silhouette. To compute \(\mathcal {L}_{sil}\) we sample areas near the boundary of the silhouette more thoroughly, similar to [8].

Fig. 4.
figure 4

3D point cloud reconstruction. The 2D depth estimations from our method can be used to perform 3D reconstruction of the captured shape. The raw 3D point clouds generated by projecting the estimates from all cameras into the common reference frame already yield a good representation. The outliers originate from depth estimates at occlusion boundaries as the \(L_2\) loss encourages smoothed depth maps and could be easily removed by filtering the point cloud.

4.3 Baselines

We use different baselines: LToHP [17] for our pose estimation and tracking approach and [50] for our keypoint-based locally conditioned rendering results. In addition we compare NePu to AniNeRF [42] on Human3.6M [16].

LToHP. LToHP [17] presents two solutions for multi-view 3D human pose estimation; an algebraic and volumetric one. For details see our supplementary and [17]. We use their implementation from [18] with their provided configuration file for hyperparameters. In order to obtain the quantitative results in Table 2, we individually fine-tune [17] on every animal, using the same data that we trained NePu on. In animal pose estimation it is common practice to fine-tune a network that was pretrained, as in state of the art animal pose estimators like DeepLabCut [34], due to a lack of animal pose data. For all models, including our models, we select the epoch with minimum validation error for test.

LFNs. For the comparison to [50], we integrate their rendering formulation in our framework, resulting in two differences. First, their rendering operates in Plücker coordinates instead of pixel values. Second, and more importantly, we do not use local features \(\textbf{f}\) for conditioning in this baseline, but global conditioning via concatenation using \(\textbf{z}\). In the following, we denote this model by \(\text {LFN}^*\).

AniNeRF. We trained NePu using the same training regime as AniNeRF [42]: We only trained on a single subject, using every 5th of the first 1300 frames of sequence “Posing” of subject “S9” for training and every 5th of the following 665 frames for testing. Like AniNeRF, we also only used cameras 0–2 for training and camera 3 for testing.

Table 2. Quantitative results for 3D keypoint estimation on the test sets. Comparison to LToHP [17]. Values are given in MPJPE [mm] and its median of all samples [mm]. Delta shows difference to [17]. alg., vol. and \(\text {vol.}^\text {gt}\) indicate that [17] is trained with its algebraic model, volumetric model with root joint from alg. results and from ground truth respectively.
Fig. 5.
figure 5

Results for pose estimation. We show projected keypoints of the ground truth, NePu and LToHP [17] in the first, second and third row, respectively, for a human, a giraffe, a pigeon and a cow. Color coded dots indicate the keypoint location and its 3D error to the ground truth. Red dots indicate a high, green dots a low 3D error. The connections between keypoints are just for visualization and not used by the networks. (Color figure online)

4.4 3D Keypoint and Pose Estimation

Quantitative results for the 3D keypoint estimation are shown in Table 2. We report the MPJPE in mm and its median [mm] over all test samples. For LToHP [17] we evaluate both the algebraic and volumetric model and report the better result. In addition we report the average of MPJPE and median over all objects. We achieve a better average MPJPE and median (32 mm and 17 mm respectively) over all objects than LToHP (88 mm and 73 mm respectively). Note, however, that [17] achieves better results for humans only. We hypothesize two reasons for that. First, the human-specific pre-training of LToHP transfers well to the human data we evaluate on. Secondly, the extremities of the human body (especially arms and hands) vanish more often in silhouettes than for the animals we worked with. Example qualitative results can be found in Fig. 5.

4.5 Keypoint-Based Neural Rendering

Quantitative results for the keypoint-based neural rendering part are shown in Table 1. For color comparison we report the PSNR [dB] over all test samples, while for depth comparison we report the MAE [mm]. We achieve a better average PSNR and MAE (23.98 dB and 20.3 mm respectively) over all objects than LFN* (18.16 dB and 55.1 mm respectively).

Qualitative results for color rendering and depth reconstruction can be found in Fig. 3 and Sec. 3.2 of our supplementary respectively. Comparing our method to the implementation without local conditioning shows the importance of the local conditioning, the renderings are much more detailed, with more precise boundaries. Figure 4 shows projections of multiple depth maps from different viewing directions as 3D point clouds. The individual views align up nicely, yielding good input for further processing and analysis. The results of the novel view and pose comparison with AniNeRF [42] on Human3.6M [16] are shown in Fig. 6. Even though our rendering formulation is fundamentally different and much less developed, our results look promising, but cannot meet the quality of AniNeRF. While AniNeRF leverages the SMPL parameters to restrict the problem to computing blend weight fields, our method has to solve a more complex problem. In principle our rendering could also be formulated in canonical space and leverage SMPL to model deformations. In addition, part segmentation maps, as well as, the relation of view direction and body orientation could further help to reduce ambiguities in the 2D rendering.

Compared to A-NeRF [52] and AniNeRF [42] the fundamental differences in the rendering pipeline result in a significant speed increase. Both render at 0.25–1 fps and 1 fps at \(512^2 px\), respectively, while we render at 50 fps at \(256^2 px\). In contrast to both methods we do not make use of optimization techniques that constrain the rendering to a bounding box of the 3D keypoints. Employing this would result in a total speed increase of 50–200\(\times \) at \(512^2 px\).

Fig. 6.
figure 6

Novel view and pose comparison with AniNeRF. Novel view (a–c) and novel view + novel pose (d–f) rendering on Human3.6M dataset.

5 Limitations and Future Work

In the future we plan to extend our model to account for instance specific shape and color variations, by incorporating the respective information in additional latent spaces. Moreover, our 3D keypoint encoder architecture (cf. Sect. 3.1) can for example be further improved for humans in a similar fashion to [52]. Such an approach would intrinsically be rotation equivariant and better capture the piece-wise rigidity of skeletons. Finally, while 2D rendering is much faster, it also means that we do not have guaranteed consistency between the generated views of the scene, in the sense that they are not necessarily exact renderings of the same 3D shape. This is an inherent limitation and cannot easily be circumvented, but our quantitative results indicate that it does not seem to impact quality of the rendering by much.

6 Conclusions

In this paper we present a neural rendering framework called Neural Puppeteer that projects keypoints and features into a 2D view for local conditioning. We demonstrate that NePu can detect 3D keypoints with an inverse rendering approach that takes only 2D silhouettes as input. In contrast to common 3D keypoint estimators, this is by design robust with respect to change in texture, lighting or domain shifts (e.g. synthetic vs. real-world data), provided that silhouettes can be detected. Due to our single-evaluation neural rendering, inverse rendering for downstream tasks becomes feasible. For animal shapes, we outperform a state-of-the-art 3D multi-view keypoint estimator in terms of MPJPE [17], despite only relying on silhouettes.

In addition, we render color, depth and occupancy simultaneously at \(20\,\text {ms}\) per \(256^2\) image, significantly faster than NeRF-like approaches, which typically achieve less than 1 fps. The proposed keypoint-based local conditioning significantly improves neural rendering of articulated objects quantitatively and qualitatively, compared to a globally conditioned baseline.