Keywords

1 Introduction

Reconstruction and animation of clothed human avatars is a rising topic in computer vision research. It is of particular interest for various applications in AR/VR and the future metaverse. Various sensors can be used to create clothed human avatars, ranging from 4D scanners over depth sensors to simple RGB cameras. Among these data sources, RGB videos are by far the most accessible and user-friendly choice. However, they also provide the least supervision, making this setup the most challenging for the reconstruction and animation of clothed humans.

Fig. 1.
figure 1

Detailed Geometry and Generalization to Extreme Poses. Given sparse multi-view videos with SMPL fittings and foreground masks, our approach synthesizes animatable clothed avatars with realistic pose-dependent geometry and appearance. While existing works, e.g. Neural Body [56] and Ani-NeRF [54], struggle with generalizing to unseen poses, our approach enables avatars that can be animated in extreme out-of-distribution poses.

Traditional works in clothed human modeling use explicit mesh [1, 2, 6, 7, 17, 18, 29, 33, 52, 63, 68, 78, 83] or truncated signed distance fields (TSDFs) of fixed grid resolution [34, 35, 66, 76, 81] to represent the geometry of humans. Textures are often represented by vertex colors or UV-maps. With the recent success of neural implicit representations, significant progress has been made towards modeling articulated clothed humans. PIFu [60] and PIFuHD [61] are among the first works that propose to model clothed humans as continuous neural implicit functions. ARCH [24] extends this idea and develops animatable clothed human avatars from monocular images. However, this line of works does not handle dynamic pose-dependent cloth deformations. Further, they require ground-truth geometry for training. Such ground-truth data is expensive to acquire, limiting the generalization of these methods.

Another line of works removes the need for ground-truth geometry by utilizing differentiable neural rendering. These methods aim to reconstruct humans from a sparse set of multi-view videos with only image supervision. Many of them use NeRF [46] as the underlying representation and achieve impressive visual fidelity on novel view synthesis tasks. However, there are two fundamental drawbacks of these existing approaches: (1) the NeRF-based representation lacks proper geometric regularization, leading to inaccurate geometry. This is particularly detrimental in a sparse multi-view setup and often results in artifacts in the form of erroneous color blobs under novel views or poses. (2) Existing approaches condition their NeRF networks [56] or canonicalization networks [54] on inputs in observation space. Thus, they cannot generalize to unseen out-of-distribution poses.

In this work, we address these two major drawbacks of existing approaches. (1) We improve geometry by building an articulated signed-distance-field (SDF) representation for clothed human bodies to better capture the geometry of clothed humans and improve the rendering quality. (2) In order to render the SDF, we develop an efficient joint root-finding algorithm for the conversion from observation space to canonical space. Specifically, we represent clothed human avatars as a combination of a forward linear blend skinning (LBS) network, an implicit SDF network, and a color network, all defined in canonical space and do not condition on inputs in observation space. Given these networks and camera rays in observation space, we apply our novel joint root-finding algorithm that can efficiently find the iso-surface points in observation space and their correspondences in canonical space. This enables us to perform efficient sampling on camera rays around the iso-surface. All network modules can be trained with a photometric loss in image space and regularization losses in canonical space.

We validate our approach on the ZJU-MoCap [56] and the H36M [25] dataset. Our approach generalizes well to unseen poses, enabling robust animation of clothed avatars even under out-of-distribution poses where existing works fail, as shown in Fig. 1. We achieve significant improvements over state-of-the-arts for novel pose synthesis and geometry reconstruction, while also outperforming state-of-the-arts in the novel view synthesis task on training poses. Code and data are available at https://neuralbodies.github.io/arah/.

2 Related Works

Clothed Human Modeling with Explicit Representations: Many explicit mesh-based approaches represent cloth deformations as deformation layers [1, 2, 6,7,8] added to minimally clothed parametric human body models [5, 20, 27, 37, 50, 53, 75]. Such approaches enjoy compatibility with parametric human body models but have difficulties in modeling large garment deformations. Other mesh-based approaches model garments as separate meshes [17, 18, 29, 33, 52, 63, 68, 78, 83] in order to represent more detailed and physically plausible cloth deformations. However, such methods often require accurate 3D-surface registration, synthetic 3D data or dense multi-view images for training and the garment meshes need to be pre-defined for each cloth type. More recently, point-cloud-based explicit methods [38, 39, 82] also showed promising results in modeling clothed humans. However, they still require explicit 3D or depth supervision for training, while our goal is to train using sparse multi-view RGB supervision alone.

Clothed Humans as Implicit Functions: Neural implicit functions [12, 41, 42, 51, 57] have been used to model clothed humans from various sensor inputs including monocular images [21, 22, 24, 31, 59,60,61, 65, 73, 86], multi-view videos [28, 36, 48, 54, 56, 74], sparse point clouds [6, 13, 15, 70, 71, 87], or 3D meshes [10, 11, 14, 44, 45, 62, 67]. Among the image-based methods, [4, 22, 24] obtain animatable reconstructions of clothed humans from a single image. However, they do not model pose-dependent cloth deformations and require ground-truth geometry for training. [28] learns generalizable NeRF models for human performance capture and only requires multi-view images as supervision. But it needs images as inputs for synthesizing novel poses. [36, 48, 54, 56, 74] take multi-view videos as inputs and do not need ground-truth geometry during training. These methods generate personalized per-subject avatars and only need 2D supervision. Our approach follows this line of work and also learns a personalized avatar for each subject.

Neural Rendering of Animatable Clothed Humans: Differentiable neural rendering has been extended to model animatable human bodies by a number of recent works [48, 54, 56, 58, 65, 74]. Neural Body [56] proposes to diffuse latent per-vertex codes associated with SMPL meshes in observation space and condition NeRF [46] on such latent codes. However, the conditional inputs of Neural Body are in the observation space. Therefore, it does not generalize well to out-of-distribution poses. Several recent works [48, 54, 65] propose to model the radiance field in canonical space and use a pre-defined or learned backward mapping to map query points from observation space to this canonical space. A-NeRF [65] uses a deterministic backward mapping defined by piecewise rigid bone transformations. This mapping is very coarse and the model has to use a complicated bone-relative embedding to compensate for that. Ani-NeRF [54] trains a backward LBS network that does not generalize well to out-of-distribution poses, even when fine-tuned with a cycle consistency loss for its backward LBS network for each test pose. Further, all aforementioned methods utilize a volumetric radiance representation and hence suffer from noisy geometry [49, 69, 79, 80]. In contrast to these works, we improve geometry by combining an implicit surface representation with volume rendering and improve pose generalization via iterative root-finding. H-NeRF [74] achieves large improvements in geometric reconstruction by co-training SDF and NeRF networks. However, code and models of H-NeRF are not publicly available. Furthermore, H-NeRF’s canonicalization process relies on imGHUM [3] to predict an accurate signed distance in observation space. Therefore, imGHUM needs to be trained on a large corpus of posed human scans and it is unclear whether the learned signed distance fields generalize to out-of-distribution poses beyond the training set. In contrast, our approach does not need to be trained on any posed scans and it can generalize to extreme out-of-distribution poses.

Concurrent Works: Several concurrent works extend NeRF-based articulated models to improve novel view synthesis, geometry reconstruction, or animation quality [9, 23, 26, 30, 43, 55, 64, 72, 77, 85].  [85] proposes to jointly learn forward blending weights, a canonical occupancy network, and a canonical color network using differentiable surface rendering for head-avatars. In contrast to human heads, human bodies show much more articulation. Abrupt changes in depth also occur more frequently when rendering human bodies, which is difficult to capture with surface rendering [69]. Furthermore, [85] uses the secant method to find surface points. For each secant step, this needs to solve a root-finding problem from scratch. Instead, we use volume rendering of SDFs and formulate the surface-finding task of articulated SDFs as a joint root-finding problem that only needs to be solved once per ray. We remark that [26] proposes to formulate surface-finding and correspondence search as a joint root-finding problem to tackle geometry reconstruction from photometric and mask losses. However, they use pre-defined skinning fields and surface rendering. They also require estimated normals from PIFuHD [61] while our approach achieves detailed geometry reconstructions without such supervision.

Fig. 2.
figure 2

Overview of Our Pipeline. (a) Given a ray \(( \textbf{c}, \textbf{v} )\) with camera center \(\textbf{c}\) and ray direction \(\textbf{v}\) in observation space, we jointly search for its intersection with the SDF iso-surface and the correspondence of the intersection point via a novel joint root-finding algorithm (Sect. 3.3). We then sample near/far surface points \(\{ \bar{\textbf{x}} \}\). (b) The sampled points are mapped into canonical space as \(\{ \hat{\textbf{x}} \}\) via root-finding. (c) In canonical space, we run an SDF-based volume rendering with canonicalized points \(\{ \hat{\textbf{x}} \}\), local body poses and shape \((\theta , \beta )\), an SDF network feature \(\textbf{z}\), surface normals \(\textbf{n}\), and a per-frame latent code \(\mathcal {Z}\) to predict the corresponding pixel value of the input ray (Sect. 3.4). (d) All network modules, including the forward LBS network \(LBS_{\sigma _{\omega }}\), the canonical SDF network \(f_{\sigma _f}\), and the canonical color network \(f_{\sigma _c}\), are trained end-to-end with a photometric loss in image space and regularization losses in canonical space (Sect. 3.5).

3 Method

Our pipeline is illustrated in Fig. 2. Our model consists of a forward linear blend skinning (LBS) network (Sect. 3.1), a canonical SDF network, and a canonical color network (Sect. 3.2). When rendering a specific pixel of the image in observation space, we first find the intersection of the corresponding camera ray and the observation-space SDF iso-surface. Since we model a canonical SDF and a forward LBS, we propose a novel joint root-finding algorithm that can simultaneously search for the ray-surface intersection and the canonical correspondence of the intersection point (Sect. 3.3). Such a formulation does not condition the networks on observations in observation space. Consequently, it can generalize to unseen poses. Once the ray-surface intersection is found, we sample near/far surface points on the camera ray and find their canonical correspondences via forward LBS root-finding. The canonicalized points are used for volume rendering to compose the final RGB value at the pixel (Sect. 3.4). The predicted pixel color is then compared to the observation using a photometric loss (Sect. 3.5). The model is trained end-to-end using the photometric loss and regularization losses. The learned networks represent a personalized animatable avatar that can robustly synthesize new geometries and appearances under novel poses (Sect. 4.1).

3.1 Neural Linear Blend Skinning

Traditional parametric human body models [5, 20, 37, 50, 53, 75] often use linear blend skinning (LBS) to deform a template model according to rigid bone transformations and skinning weights. We follow the notations of [71] to describe LBS. Given a set of N points in canonical space, \(\hat{\textbf{X}} = \{ \hat{\textbf{x}}^{(i)} \}_{i=1}^{N}\), LBS takes a set of rigid bone transformations \(\{ \textbf{B}_b \}_{b=1}^{24}\) as inputs, each \(\textbf{B}_b\) being a \(4 \times 4\) rotation-translation matrix. We use 23 local transformations and one global transformation with an underlying SMPL [37] model. For a 3D point \(\hat{\textbf{x}}^{(i)} \in \hat{\textbf{X}}\)Footnote 1, a skinning weight vector is defined as \(\textbf{w}^{(i)} \in [ 0, 1 ]^{24}, \text {s.t.} \sum _{b=1}^{24} \textbf{w}_b^{(i)} = 1\). This vector indicates the affinity of the point \(\hat{\textbf{x}}^{(i)}\) to each of the bone transformations \(\{ \textbf{B}_b \}_{b=1}^{24}\). Following recent works [11, 45, 62, 71], we use a neural network \(f_{\sigma _{\omega }} (\cdot ): \mathbb {R}^3 \mapsto [ 0, 1 ]^{24}\) with parameters \(\sigma _{\omega }\) to predict the skinning weights of any point in space. The set of transformed points \(\bar{\textbf{X}} = \{ \bar{\textbf{x}}^{(i)} \}_{i=1}^{N}\) is related to \(\hat{\textbf{X}}\) via:

$$\begin{aligned}&\bar{\textbf{x}}^{(i)} = LBS_{\sigma _{\omega }}\left( \hat{\textbf{x}}^{(i)}, \{ \textbf{B}_b \} \right) , \quad \forall i = 1, \ldots , N \nonumber \\ \Longleftrightarrow&\bar{\textbf{x}}^{(i)} = \left( \sum _{b=1}^{24} f_{\sigma _{\omega }} (\hat{\textbf{x}}^{(i)})_{b} \textbf{B}_b\right) \hat{\textbf{x}}^{(i)}, \quad \forall i = 1, \ldots , N \end{aligned}$$
(1)

where Eq. (1) is referred to as the forward LBS function. The process of applying Eq. (1) to all points in \(\hat{\textbf{X}}\) is often referred to as forward skinning. For brevity, for the remainder of the paper, we drop \(\{ \textbf{B}_b \}\) from the LBS function and write \(LBS_{\sigma _{\omega }} (\hat{\textbf{x}}^{(i)}, \{ \textbf{B}_b \})\) as \(LBS_{\sigma _{\omega }} (\hat{\textbf{x}}^{(i)})\).

3.2 Canonical SDF and Color Networks

We model an articulated human as a neural SDF \(f_{\sigma _f} (\hat{\textbf{x}}, \mathbf {\theta }, \mathbf {\beta }, \mathcal {Z})\) with parameters \(\sigma _f\) in canonical space, where \(\hat{\textbf{x}}\) denotes the canonical query point, \(\mathbf {\theta }\) and \(\mathbf {\beta }\) denote local poses and body shape of the human which capture pose-dependent cloth deformations, and \(\mathcal {Z}\) denotes a per-frame optimizable latent code which compensates for time-dependent dynamic cloth deformations. For brevity, we write this neural SDF as \(f_{\sigma _f} (\hat{\textbf{x}})\) in the remainder of the paper.

Similar to the canonical SDF network, we define a canonical color network with parameters \(\sigma _c\) as \(f_{\sigma _c} (\hat{\textbf{x}}, \textbf{n}, \textbf{v}, \textbf{z}, \mathcal {Z}): \mathbb {R}^{9+|\textbf{z}|+|\mathcal {Z}|} \mapsto \mathbb {R}^3\). Here, \(\textbf{n}\) denotes a normal vector in the observation space. \(\textbf{n}\) is computed by transforming the canonical normal vectors using the rotational part of forward transformations \(\sum _{b=1}^{24} f_{\sigma _{\omega }} (\hat{\textbf{x}}^{(i)})_{b} \textbf{B}_b\) (Eq. (1)). \(\textbf{v}\) denotes viewing direction. Similar to [69, 79, 80], \(\textbf{z}\) denotes an SDF feature which is extracted from the output of the second-last layer of the neural SDF. \(\mathcal {Z}\) denotes a per-frame latent code which is shared with the SDF network. It compensates for time-dependent dynamic lighting effects. The outputs of \(f_{\sigma _c}\) are RGB color values in the range [0, 1].

3.3 Joint Root-Finding

While surface rendering [47, 80] could be used to learn the network parameters introduced in Sects. 3.1 and 3.2, it cannot handle abrupt changes in depth, as demonstrated in [69]. We also observe severe geometric artifacts when applying surface rendering to our setup, we refer readers to the Supp. Mat. for such an ablation. On the other hand, volume rendering can better handle abrupt depth changes in articulated human rendering. However, volume rendering requires multi-step dense sampling on camera rays [69, 79], which, when combined naively with the iterative root-finding algorithm [11], requires significantly more memory and becomes prohibitively slow to train and test. We thus employ a hybrid method similar to [49]. We first search the ray-surface intersection and then sample near/far surface points on the ray. In practice, we initialize our SDF network with [71]. Thus, we fix the sampling depth interval around the surface to \([-5\,\text {cm}, +5\,\text {cm}]\).

A naive way of finding the ray-surface intersection is to use sphere tracing [19] and map each point to canonical space via root-finding [11]. In this case, we need to solve the costly root-finding problem during each step of the sphere tracing. This becomes prohibitively expensive when the number of rays is large. Thus, we propose an alternative solution. We leverage the skinning weights of the nearest neighbor on the registered SMPL mesh to the query point \(\bar{\textbf{x}}\) and use the inverse of the linearly combined forward bone transforms to map \(\bar{\textbf{x}}\) to its rough canonical correspondence. Combining this approximate backward mapping with sphere tracing, we obtain rough estimations of intersection points. Then, starting from these rough estimations, we apply a novel joint root-finding algorithm to search the precise intersection points and their correspondences in canonical space. In practice, we found that using a single initialization for our joint root-finding works well already. Adding more initializations incurs drastic memory and runtime overhead while not achieving any noticeable improvements. We hypothesize that this is due to the fact that our initialization is obtained using inverse transformations with SMPL skinning weights rather than rigid bone transformations (as was done in [11]).

Formally, we define a camera ray as \(\textbf{r} = (\textbf{c}, \textbf{v})\) where \(\textbf{c}\) is the camera center and \(\textbf{v}\) is a unit vector that defines the direction of this camera ray. Any point on the camera ray can be expressed as \(\textbf{c} + \textbf{v} \cdot d\) with \(d >= 0\). The joint root-finding aims to find canonical point \(\hat{\textbf{x}}\) and depth d on the ray in observation space, such that:

$$\begin{aligned} f_{\sigma _f} (\hat{\textbf{x}})&= 0 \nonumber \\ LBS_{\sigma _{\omega }}(\hat{\textbf{x}}) - (\textbf{c} + \textbf{v} \cdot d)&= \textbf{0} \end{aligned}$$
(2)

in which \(\textbf{c}, \textbf{v}\) are constants per ray. Denoting the joint vector-valued function as \(g_{\sigma _f, \sigma _{\omega }} (\hat{\textbf{x}}, d)\) and the joint root-finding problem as:

$$\begin{aligned} g_{\sigma _f, \sigma _{\omega }} (\hat{\textbf{x}}, d) = \begin{bmatrix} f_{\sigma _f} (\hat{\textbf{x}}) \\ LBS_{\sigma _{\omega }}(\hat{\textbf{x}}) - (\textbf{c} + \textbf{v} \cdot d) \end{bmatrix} = \textbf{0} \end{aligned}$$
(3)

we can then solve it via Newton’s method

$$\begin{aligned} \begin{bmatrix} \hat{\textbf{x}}_{k+1} \\ d_{k+1} \end{bmatrix} = \begin{bmatrix} \hat{\textbf{x}}_{k} \\ d_{k} \end{bmatrix} - \textbf{J}^{-1}_{k} \cdot g_{\sigma _f, \sigma _{\omega }} (\hat{\textbf{x}}_{k}, d_{k}) \end{aligned}$$
(4)

where:

$$\begin{aligned} \textbf{J}_{k} = \begin{bmatrix} \frac{\partial f_{\sigma _f}}{\partial \hat{\textbf{x}}} (\hat{\textbf{x}}_{k}) &{} 0 \\[6pt] \frac{\partial LBS_{\sigma _{\omega }}}{\partial \hat{\textbf{x}}} (\hat{\textbf{x}}_{k}) &{} -\textbf{v} \end{bmatrix} \end{aligned}$$
(5)

Following [11], we use Broyden’s method to avoid computing \(\textbf{J}_{k}\) at each iteration.

Amortized Complexity: Given the number of sphere-tracing steps as N and the number of root-finding steps as M, the amortized complexity for joint root-finding is O(M) while naive alternation between sphere-tracing and root-finding is O(MN). In practice, this results in about \(5 \times \) speed up of joint root-finding compared to the naive alternation between sphere-tracing and root-finding. We also note that from a theoretical perspective, our proposed joint root-finding converges quadratically while the secant-method-based root-finding in the concurrent work [85] converges only superlinearly.

We describe how to compute implicit gradients wrt. the canonical SDF and the forward LBS in the Supp. Mat. In the main paper, we use volume rendering which does not need to compute implicit gradients wrt. the canonical SDF.

3.4 Differentiable Volume Rendering

We employ a recently proposed SDF-based volume rendering formulation [79]. Specifically, we convert SDF values into density values \(\sigma \) using the scaled CDF of the Laplace distribution with the negated SDF values as input

$$\begin{aligned} \sigma (\hat{\textbf{x}}) = \frac{1}{b} \left( \frac{1}{2} + \frac{1}{2} \text {sign}(-f_{\sigma _f} (\hat{\textbf{x}})\big ) \big (1 - \exp (-\frac{|-f_{\sigma _f} (\hat{\textbf{x}})|}{b})\right) \end{aligned}$$
(6)

where b is a learnable parameter. Given the surface point found via solving Eq. (3), we sample 16 points around the surface points and another 16 points between the near scene bound and the surface point, and map them to canonical space along with the surface point. For rays that do not intersect with any surface, we uniformly sample 64 points for volume rendering. With N sampled points on a ray \(\textbf{r} = (\textbf{c}, \textbf{v})\), we use standard volume rendering [46] to render the pixel color

$$\begin{aligned} \hat{C}(\textbf{r})&= \sum _{i=1}^{N} T^{(i)} \left( 1 - \exp (-\sigma (\hat{\textbf{x}}^{(i)})\delta ^{(i)})\right) f_{c_{\sigma }} (\hat{\textbf{x}}^{(i)}, \textbf{n}^{(i)}, \textbf{v}, \textbf{z}, \mathcal {Z}) \end{aligned}$$
(7)
$$\begin{aligned} T^{(i)}&= \exp \left( -\sum _{j < i} \sigma (\hat{\textbf{x}}^{(j)}) \delta ^{(j)} \right) \end{aligned}$$
(8)

where \(\delta ^{(i)} = |d^{(i+1)} - d^{(i)}|\).

3.5 Loss Function

Our loss consists of a photometric loss in observation space and multiple regularizers in canonical space

$$\begin{aligned} \mathcal {L} = \lambda _{C} \cdot \mathcal {L}_{C} + \lambda _{E} \cdot \mathcal {L}_{E} + \lambda _{O} \cdot \mathcal {L}_{O} + \lambda _{I} \cdot \mathcal {L}_{I} + \lambda _{S} \cdot \mathcal {L}_{S} \end{aligned}$$
(9)

\(\mathcal {L}_{C}\) is the L1 loss for color predictions. \(\mathcal {L}_{E}\) is the Eikonal regularization [16]. \(\mathcal {L}_{O}\) is an off-surface point loss, encouraging points far away from the SMPL mesh to have positive SDF values. Similarly, \(\mathcal {L}_{I}\) regularizes points inside the canonical SMPL mesh to have negative SDF values. \(\mathcal {L}_{S}\) encourages the forward LBS network to predict similar skinning weights to the canonical SMPL mesh. Different from [26, 74, 80], we do not use an explicit silhouette loss. Instead, we utilize foreground masks and set all background pixel values to zero. In practice, this encourages the SDF network to predict positive SDF values for points on rays that do not intersect with foreground masks. For detailed definitions of loss terms and model architectures, please refer to the Supp. Mat.

4 Experiments

We validate the generalization ability and reconstruction quality of our proposed method against several recent baselines [54, 56, 65]. As was done in [56], we consider a setup with 4 cameras positioned equally spaced around the human subject. For an ablation study on different design choices of our model, including ray sampling strategy, LBS networks, and number of initializations for root-finding, we refer readers to the Supp. Mat.

Datasets: We use the ZJU-MoCap [56] dataset as our primary testbed because its setup includes 23 cameras which allows us to extract pseudo-ground-truth geometry to evaluate our model. More specifically, the dataset consists of 9 sequences captured with 23 calibrated cameras. We use the training/testing splits from Neural Body [56] for both the cameras and the poses. As one of our goals is learn to detailed geometry, we collect pseudo-ground-truth geometry for the training poses. We use all 23 cameras and apply NeuS with a background NeRF model [69], a state-of-the-art method for multi-view reconstruction. Note that we refrain from using the masks provided by Neural Body [56] as these masks are noisy and insufficient for accurate static scene reconstruction. We observe that geometry reconstruction with NeuS [69] fails when subjects wear black clothes or the environmental light is not bright enough. Therefore, we manually exclude bad reconstructions and discard sequences with less than 3 valid reconstructions. For completeness, we also tested our approach on the H36M dataset [25] and report a quantitative comparison to [48, 54] in the Supp. Mat.

Baselines: We compare against three major baselines: Neural Body [56](NB), Ani-NeRF [54](AniN), and A-NeRF [65](AN). Neural Body diffuses per-SMPL-vertex latent codes into observation space as additional conditioning for NeRF models to achieve state-of-the-art novel view synthesis results on training poses. Ani-NeRF learns a canonical NeRF model and a backward LBS network which predicts residuals to the deterministic SMPL-based backward LBS. Consequently, the LBS network needs to be re-trained for each test sequence. A-NeRF employs a deterministic backward mapping with bone-relative embeddings for query points and only uses keypoints and joint rotations instead of surface models (i.e. SMPL surface). For the detailed setups of these baselines, please refer to the Supp. Mat.

Benchmark Tasks: We benchmark our approach on three tasks: generalization to unseen poses, geometry reconstruction, and novel-view synthesis. To analyze generalization ability, we evaluate the trained models on unseen testing poses. Due to the stochastic nature of cloth deformations, we quantify performance via perceptual similarity to the ground-truth images with the LPIPS [84] metric. We report PSNR and SSIM in the Supp. Mat. We also encourage readers to check out qualitative comparison videos at https://neuralbodies.github.io/arah/.

For geometry reconstruction, we evaluate our method and baselines on the training poses. We report point-based L2 Chamfer distance (CD) and normal consistency (NC) wrt. the pseudo-ground-truth geometry. During the evaluation, we only keep the largest connected component of the reconstructed meshes. Note that is in favor of the baselines as they are more prone to producing floating blob artifacts. We also remove any ground-truth or predicted mesh points that are below an estimated ground plane to exclude outliers from the ground plane from the evaluation. For completeness, we also evaluate novel-view synthesis with PSNR, SSIM, and LPIPS using the poses from the training split.

Fig. 3.
figure 3

Generalization to Unseen Poses on the testing poses of ZJU-MoCap. A-NeRF struggles with unseen poses due to the limited training poses and the lack of a SMPL surface prior. Ani-NeRF produces noisy images as it uses an inaccurate backward mapping function. Neural Body loses details, e.g. wrinkles, because its conditional NeRF is learned in observation space. Our approach generalizes well to unseen poses and can model fine details like wrinkles.

Fig. 4.
figure 4

Geometry Reconstruction. Our approach reconstructs more fine-grained geometry than the baselines while preserving high-frequency details such as wrinkles. Note that we remove an estimated ground plane from all meshes.

Table 1. Generalization to Unseen Poses. We report LPIPS [84] on synthesized images under unseen poses from the testset of the ZJU-MoCap dataset [56] (i.e. all views except 0, 6, 12, and 18). Our approach consistently outperforms the baselines by a large margin. We report PSNR and SSIM in the Supp. Mat.
Table 2. Geometry Reconstruction. We report L2 Chamfer Distance (CD) and Normal Consistency (NC) on the training poses of the ZJU-MoCap dataset [56]. Note that AniN and AN occasionally produce large background blobs that are connected to the body resulting in large deviations from the ground truth.
Table 3. Novel View Synthesis. We report PSNR, SSIM, and LPIPS [84] for novel views of training poses of the ZJU-MoCap dataset [56]. Due to better geometry, our approach produces more consistent rendering results across novel views than the baselines. We include qualitative comparisons in the Supp. Mat. Note that we crop slightly larger bounding boxes than Neural Body [56] to better capture loose clothes, e.g. sequence 387 and 390. Therefore, the reported numbers vary slightly from their evaluation.

4.1 Generalization to Unseen Poses

We first analyze the generalization ability of our approach in comparison to the baselines. Given a trained model and a pose from the test set, we render images of the human subject in the given pose. We show qualitative results in Fig. 3 and quantitative results in Table 1. We significantly outperform the baselines both qualitatively and quantitatively. The training poses of the ZJU-MoCap dataset are extremely limited, usually comprising just 60–300 frames of repetitive motion. This limited training data results in severe overfitting for the baselines. In contrast, our method generalizes well to unseen poses, even when training data is limited.

We additionally animate our models trained on the ZJU-MoCap dataset using extreme out-of-distribution poses from the AMASS [40] and AIST++ [32] datasets. As shown in Fig. 5, even under extreme pose variation our approach produces plausible geometry and rendering results while all baselines show severe artifacts. We attribute the large improvement on unseen poses to our root-finding-based backward skinning, as the learned forward skinning weights are constants per subject, while root-finding is a deterministic optimization process that does not rely on learned neural networks that condition on inputs from the observation space. More comparisons can be found in the Supp. Mat.

4.2 Geometry Reconstruction on Training Poses

Next, we analyze the geometry reconstructed with our approach against reconstructions from the baselines. We compare to the pseudo-ground-truth obtained from NeuS [69]. We show qualitative results in Fig. 4 and quantitative results in Table 2. Our approach consistently outperforms existing NeRF-based human models on geometry reconstruction. As evidenced in Fig. 4, the geometry obtained with our approach is much cleaner compared to NeRF-based baselines, while preserving high-frequency details such as wrinkles.

4.3 Novel View Synthesis on Training Poses

Lastly, we analyze our approach for novel view synthesis on training poses. Table 3 provides a quantitative comparison to the baselines. While not the main focus of this work, our approach also outperforms existing methods on novel view synthesis. This suggests that more faithful modeling of geometry is also beneficial for the visual fidelity of novel views. Particularly when few training views are available, NeRF-based methods produce blob/cloud artifacts. By removing such artifacts, our approach achieves high image fidelity and better consistency across novel views. Due to space limitations, we include further qualitative results on novel view synthesis in the Supp. Mat.

Fig. 5.
figure 5

Qualitative Results on Out-of-distribution Poses from the AMASS [40] and AIST++ [32] datasets. From top to bottom row: Neural Body, Ani-NeRF, our rendering, and our geometry. Note that Ani-NeRF requires re-training their backward LBS network on novel pose sequence. We did not show A-NeRF results as it already produces severe overfitting effects on ZJU-MoCap test poses. For more qualitative comparisons, please refer to the Supp. Mat.

5 Conclusion

We propose a new approach to create animatable avatars from sparse multi-view videos. We largely improve geometry reconstruction over existing approaches by modeling the geometry as articulated SDFs. Further, our novel joint root-finding algorithm enables generalization to extreme out-of-distribution poses. We discuss limitations of our approach in the Supp. Mat.