1 Introduction

The problem of digital reconstruction, modeling and photo-realistic synthesis of humans from a video sequence such that it can be rendered with any pose from any viewpoint is important, which enables various applications ranging from character animation for games and movies to immersive experience for virtual conferencing. This problem is extremely challenging due to the complicated joint space of human geometry, appearance, and dynamic motion given only RGB videos as observation, especially for monocular videos where multi-view concurrency is unavailable.

Fig. 1.
figure 1

Left: given a monocular video sequence of human performance with initial posed 3D human with off-the-shelf tools, our method jointly reconstructs a mesh-guided neural radiance field (NeRF) and refined per-frame human mesh. Right: our trained mesh-guided NeRF is rigged with 3D mesh model and enables novel pose and view synthesis.

Because of the difficulty in jointly modeling shape, pose and appearance of 3D humans from monocular videos, many previous approaches focus on solving part of the problem only, such as skeleton-based human pose estimation [4, 10] or parametric 3D model [3, 18] based human shape reconstruction [14, 32]. These methods exploit sophisticated pose and shape priors and are thus able to partially counteract the geometry ambiguity; however, due to the lack of appearance information, the obtained results might not perfectly align with the input observations in certain frames. Extracted texture based on the estimated surface is usually blurry and cannot be used for photo-realistic synthesis (Fig. 1).

Recently proposed volumetric neural rendering methods, i.e. NeRF and its variants [2, 19, 28], have shown great advances in high-quality free-view synthesis for static objects. NeRF models static objects by an implicit radiance field function with multi-layer perceptron (MLP) networks. Inspired by NeRF, recent works [17, 21, 24, 25] attempt to model 3D humans by conditioning the radiance field on 3D poses/parametric meshes. While promising human reconstruction and view synthesis results have been achieved, these methods only focus on the modeling of conditional radiance field itself and require accurate 3D poses or meshes as a prior. This assumption is often too strong to be fulfilled in practical capture setups, especially with monocular video only.

To this end, we propose a novel paradigm of modeling an animatable 3D human representation from a monocular video sequence of a single person. Our goal is to build a reconstruction pipeline with few non-trivial requirements such as accurate 3D human poses and/or geometry. To achieve this goal, we propose to jointly optimize per-frame human mesh reconstruction and a dynamic neural radiance field (NeRF) which is conditional on mesh information. Given a monocular video sequence as input observations, the optimization process is driven by the re-rendering error on the neural rendering output corresponding to both the NeRF and human poses, which are updated via back-propagation. To constrain the optimization space of human mesh, we exploit the widely-used parametric human body model [3], and initialize the optimization with poses provided by monocular pose estimation solutions [14, 32] as a starting point. Our joint optimization strategy connects the (previously mangled) 3D geometry estimation problem and NeRF-based appearance optimization problem, and eliminates the requirement of accurate 3D geometry information as a priori, making the modeling pipeline more applicable under monocular video scenarios.

A key property of a good neural representation of humans is that it should have good generalization under unseen human poses after training on limited observations. This is a non-trivial task as previous NeRF-based works for human modeling [24, 25] suffer from degraded quality more or less when generalized to unseen human poses. Our observation is that the key for better pose generalization lies in the embedding method of input for querying NeRF. Intrinsically, the dynamic NeRF-based representation of humans can be regarded as a static NeRF under rest pose equipped with 3D volume deformation that is conditioned on the mesh deformation from rest pose to any arbitrary target pose. Thus, a good embedding for querying a dynamic NeRF input under arbitrary poses should “reverse” the pose deformation in an injective way to find the correct point at the static NeRF. As the “correct” deformation mapping is only available on the surface mesh, the reverse deformation at any off-surface region in the space should be constrained with additional priors. Otherwise, the deformation mapping will be distorted and collapsed, thus failing to generalize to unseen poses.

Fig. 2.
figure 2

Our pipeline. A 3D mesh is generated from SMPL model with target pose \(\theta \), followed with a mesh-guided NeRF which takes query embedding of 3D points and renders image via volume rendering. The query embedding encodes both surface deformation constraint with information of nearest mesh vertices under rest pose, as well as distance-preserve prior with distance to mesh vertices in a local region under target pose. During training the pose is initialized with off-the-shelf tools and are jointly refined with mesh-guided NeRF.

Based on this observation, we propose a new embedding method for querying mesh-guided dynamic NeRF by encoding the input position with its relationship to local nearby surface regions. Specifically, given a query point and a human mesh corresponding to a target pose, we project the query point onto the mesh and find a set of nearest neighbor mesh vertices locally; we then construct the input embedding with distances to these vertices in the target space as well as the normalized position of these vertices in the canonical space with rest human pose, eliminating pose deformation and view transformation.

Out proposed embedding method is able to guide the volume deformation at off-surface points with nearby surface deformation (as we give the inverse-transformed nearest neighbor vertices on mesh). It has two key properties that are essential for improving generalization. First, the embedding is locally based on a nearby small connected region on the guided mesh. The local priors are crucial because they prevent the network from inadvertently relating the output to irrelevant articulated parts, which is known to hurt model generalization to poses unseen during training [21, 34]. Second, since we give the distances to all nearest neighboring vertices in the target space, the embedding will encourage a locally distance-preserve prior to restrain the deformation from collapse.

Our method requires only the monocular video of a single person with a fixed camera, which does not rely on dedicated capture devices and/or accurate human pose information. Extensive experimental results demonstrate the superiority of our model on a variety of data that exhibit various human shapes and poses. To summarize, our contributions are as follows:

  • We propose a novel paradigm for building a neural human representation that can be rendered in unseen poses and views with monocular video inputs.

  • We propose a novel input embedding representation for querying mesh-guided NeRF which improves the generalization ability on novel poses.

  • We develop a pipeline for joint optimization of 3D human meshes and mesh-guided dynamic NeRF supervised by the reconstruction loss only.

2 Related Works

Human Reconstruction. The problem of digital reconstruction of humans is a long-standing problem in computer vision and computer graphics. Traditional methods usually achieve high quality with complicated capture setups such as multi-view capture studio [8, 35, 36] or RGB-D camera arrays [30, 33]. To reduce capture efforts, recent methods leverage deep neural networks to directly reconstruct 3d humans from even single images [7, 14, 20, 27]. These methods often estimate output coefficients of parametric models of 3D human shape and poses [18]. The parametric model of 3D humans is often constructed from a large database of scanned shapes of different humans in a variety of poses and the rigged with a pre-defined skeleton to animate the human mesh.

Neural 3D Representations. Recently, neural representation of 3D scenes has attracted considerable attention in the literature [2, 5, 6, 19, 22, 28]. These methods exploit a neural network (usually multi-layer perceptrons) to represent implicit fields such as signed distance functions for surface or volumetric radiance fields, thus inherently encoding 3D information in a view-consistent manner. Among those neural representations, NeRF [19] (and its variants) has surpassed previous state-of-the-art methods on novel view synthesis tasks for static objects. Some works also extended NeRF to handle general space-time dynamic scenes [23, 26, 31]. Our method targets extending NeRF to model dynamic representation of 3D humans with the help of parametric 3D body mesh models.

Rigging NeRF. A prevalent approach for representing dynamic humans with NeRF is to rig NeRF with articulated models. Common articulation choices are 3D pose skeletons [21, 29] and parametric 3D mesh models [9, 17, 24, 25]. Our method utilizes a parametric 3D mesh model [3] for articulation. While we are similar to previous and concurrent works [17, 21, 24, 29] by sharing the same goal of modeling dynamic human body with articulated NeRF representation, our method differs them in two aspects. First, we attempts to simplify the input to monocular video input as opposed to multi-view video inputs [17, 24] and relax the dependence on accurate 3D geometry input [21] a priori. Second, we propose a new embedding method for querying articulated dynamic NeRF with locality and distance-preserving constraints. Noguchi et al. [21] proposed to learn a most relevant articulated part for any given query point. The concurrent work of Su et al. [29] propose a similar framework with joint-optimization of NeRF and human pose, using the skeleton as the human shape representation and directly relates the input query to all articulated skeleton joints. We focus on improving the generalization ability for NeRF-based animatible 3D human reconstruction with novel embedding designs. Our method preserves locality via nearest-neighbor projection, and encourages locality distance-preserving to avoid collapse of deformation in the whole volume.

3 Method

Given a monocular video sequence \(\{\textbf{I}_i\}_{i=1}^K\) as input, we aim to construct a neural human representation that encodes both appearance and geometry knowledge and can be rendered under an arbitrary pose \(\theta \). In particular, we model our representation with a neural radiance field (NeRF). Our NeRF is dynamically controlled by an underlying parametric mesh model (Sect. 3.1). Given an observation-space pose, the mesh surface is deformed from its rest pose correspondingly. We design a novel query embedding (Sect. 3.2) for the input which encodes both information of surface deformation and addition constraints. Based on the proposed mesh-guided NeRF, we propose an analysis-by-synthesis method to jointly estimate pre-frame 3D mesh from the input video and train NeRF (Sect. 3.3), using off-the-shelf tools for mesh initialization.

3.1 Mesh-Guided NeRF

In NeRF, the rendered color \(\bar{\textbf{C}}(u,v)\) at image pixel (uv) is generated by querying and blending the radiance along the corresponding camera ray according to the volume density value:

$$\begin{aligned} {\bar{\textbf{C}}}(u,v) = \sum _{i=1}^N{T_i(1-\exp (-\sigma _i\delta _i))\textbf{c}_i}, \end{aligned}$$
(1)

where

$$\begin{aligned} T_i = \exp (-\sum _{j=1}^{i-1}{(-\sigma _j\delta _j}))), \end{aligned}$$
(2)

and

$$\begin{aligned} (\textbf{c}_i, \sigma _i) = F(\textbf{x}_i). \end{aligned}$$
(3)

\(\textbf{c}_i \in \mathcal {R}^3\) and \(\sigma _i\) are the color and volume density of the i-th sampled point \(\textbf{x}_i\) along the ray direction. \(F(\textbf{x})\) is usually parameterized with an MLP network.

We extend NeRF to handle the dynamic, articulated human body with a mesh-based parametric 3D model SMPL [18]. An SMPL model \(S(\theta , \beta )\) takes a human 3D pose \(\theta \) of skeleton joint rotations as well as a low-dimensional feature vector of human shape as input and outputs a 3D mesh. As we mainly focus on synthesizing humans under different poses, we omit the shape \(\beta \) afterwards.

Formally, given a pose input \(\theta \), the radiance color \(\textbf{c}(\textbf{x})\) and volume density of our mesh-guided NeRF at point x is computed as follows:

$$\begin{aligned} (\textbf{c}(\textbf{x}), \sigma (\textbf{x})) = F_{\Phi }(q(\textbf{x}, S(\theta ))), \end{aligned}$$
(4)

where the query embedding q is the most important part as it directly relates the output of NeRF with the underlay deformable mesh, as we will discuss next.

3.2 Query Embedding for NeRF

The input of NeRF for querying radiance value at point \(\textbf{x}\) is given by its 3D location (xyz) and 2D viewing direction \(\theta , \phi \) in the world space. A natural extension of input querying for the dynamic scene is to define a deformation field that transforms observation-space points to rest space. Directly estimating a general deformation field together with the NeRF, as in [23, 26, 31], is highly ill-posed and prone to local minima. Inspired by [17, 24], we leverage the deformable SMPL model as the human prior to guide our transformation for input queries. The underlay SMPL model defines reasonable deformation fields on its surface; however, a radiance field from NeRF is defined on full 3D volume, and we still need to determine the deformation on unconstrained off-surface points. Naively projecting off-surface points to its nearest vertex point on the mesh is not optimal because the off-surface deformation will be collapsed, as illustrated in Fig. 3.

Fig. 3.
figure 3

An illustration of our distance-preserved query embedding. (a) Naively embedding the query with nearest neighbor vertex on mesh (the red line), leads to indistinguishable embedding of different surface deformation patterns. (b) With additional geometric K-NN distance information (purple lines), different deformation patterns are clearly separated.

We address this issue from another perspective: instead of inputting an inverse-transformed point with an explicitly defined deformation field for querying NeRF, we construct a query embedding of the input point which encodes two types of information: (1) information that guides how the deformation field should roughly be (denoted as Deformation Guidance), and (2) priors that prevent the deformation field from collapsed local minima (denoted as Deformation Priors). The NeRF then implicitly learns a radiance field based on the input embedding. Figure 2 illustrates our design of query embedding.

Deformation Guidance. Our deformation guidance is based on the underlay SMPL model. For the SMPL model, the transformation relationship between a canonical-space surface point \(\textbf{v}\) and its observation space counterpart \(\textbf{v}'\) is given by the linear blend skinning (LBS) algorithm [15]:

$$\begin{aligned} \textbf{v}' = (\sum _{k=1}^{K}w(\textbf{v})_kG_k)\textbf{v}, \end{aligned}$$
(5)

where K is the number of human parts, \(G_k \in SE(3)\) is the transformation matrix of the k-th part on the human skeleton, and \(w(\textbf{v})\) is the blend weight.

Intuitively, the guidance information from the SMPL model should neither be too global such that the network inadvertently relates the output to irrelevant articulated parts [21, 34], nor collapse to a single nearest neighboring point as the deformation field will remain unconstrained (Fig. 3). To this end, we build the deformation guidance part of the input query with the nearest projected vertex on the mesh as well as the k-nearest adjacent vertices of the projected vertex in rest space via inverse LBS as:

$$\begin{aligned} q_{g}(\textbf{x}) = (\textbf{x}_{dir}, \textbf{v}_0,\textbf{v}_1,...,\textbf{v}_k), \end{aligned}$$
(6)

where \(\textbf{v}_k = (\sum _{l=1}^{L}w(\textbf{v}_k)_lG_l)^{-1}\textbf{v}'_k\) and \(\textbf{v}'_k\) is the k-th nearest neighboring mesh point in the observation space. Note that, we additionally give the relative direction from query point \(\textbf{x}\) to its projected point \(\textbf{v}_0\):

$$\begin{aligned} \textbf{x}_{dir} = \mathcal {R}((\sum _{k=1}^{K}w(\textbf{v})_kG_k)^{-1})\frac{\textbf{v}_0 - \textbf{x}}{\Vert \textbf{v}_0 - \textbf{x}\Vert _2}. \end{aligned}$$
(7)

Here \(\mathcal {R}\) denotes the rotational part of the transformation matrix.

Deformation Priors. Our deformation guidance embedding \(q_g\) itself is based on mesh surface only and insufficient to ensure a well-defined deformation field in the whole volume. We therefore provide an additional part to the query embedding by equipping the input query with the Euclidean distances to its nearest points in the observation space:

$$\begin{aligned} q_p(\textbf{x}) = (d_0,d_1,...,d_k), \end{aligned}$$
(8)

where \(d_k = \Vert \textbf{v}'_{k} - \textbf{x}\Vert _2\). Using the distance in the observation space is important as such information preserves the local difference under different poses and leads to a distance-preserved deformation field.

Appearance Latent Code. To better capture the geometry and appearance detail which cannot be captured by surface mesh deformation, we additionally provide a learnable latent code \(l_k\) defined on each mesh vertex:

$$\begin{aligned} q_a(\textbf{x}) = (\textbf{l}_0,\textbf{l}_1,...,\textbf{l}_k). \end{aligned}$$
(9)

The complete query embedding for the NeRF input is generated by feeding the concatenation vectors into a tiny 3-layer MLP network \(\psi \):

$$\begin{aligned} q(\textbf{x}) = \psi (\mathbf {\gamma }(q_g(\textbf{x})),\mathbf {\gamma }(q_p(\textbf{x})), q_a(\textbf{x})), \end{aligned}$$
(10)

where \(\gamma \) denotes positional encoding as used in the original NeRF [19].

3.3 Joint Mesh Estimation and NeRF Training

Training the mesh-guided NeRF from monocular video input requires paired data of input frames \(\{\textbf{I}_i\}\) and human mesh \(\{\textbf{M}_i\}\). State-of-the-art monocular video based human mesh reconstruction methods such as [13, 14] produce plausible results for human mesh estimation; however, they are still not accurate enough for training our NeRF as non-aligned mesh part to the image will give incorrect guidance and make the NeRF over-fitting to misaligned training poses. Hence we opt to use the plausible mesh estimates provided by prior solutions as initialization, and jointly fine-tune the mesh with NeRF training. Practically, we choose to optimize the pose parameter \(\theta ^i\) for each training frame instead of per-vertex mesh offset, as it gives us enough capability to refine mesh-image misalignment without too much flexibility that overfits to local minima.

Training Objective. Our training is guided by the reconstruction error between the mesh-guided NeRF and the ground-truth frames over the whole video sequence as well as a regularization term penalizing too large deviation from the initial pose estimation \(\theta _0\):

$$\begin{aligned} L = \sum _{i}\sum _{u,v}{ L _i(u,v)} + \lambda _p\sum _{i}\Vert \theta ^i - \theta _0^i\Vert _2^2, \end{aligned}$$
(11)

and

$$\begin{aligned} L _i(u,v) = \Vert \bar{\textbf{C}}(u,v) - \textbf{I}_i(u,v)\Vert _2^2, \end{aligned}$$
(12)

where \(\textbf{I}_i(u,v)\) is the ground truth pixel value at (uv) from the i-th frame. \(\bar{\textbf{C}}(u,v)\) is computed using Eq. 1, Eq. 4 and the proposed query embedding (Eq. 10).

4 Experiments

4.1 Experimental Setup

Datasets. We conduct experiments on different datasets as follows:

  • People-Snapshot [1]: This dataset contains 24 subjects with monocular videos performing turning around. Among them, we choose female-1-casual, female-3-casual, male-1-sport and male-9-plaza for training. We remove the background of the video frames with ground truth silhouettes provided and resize the video to half-size (1080p \(\rightarrow \) 540p). An initial mesh is provided in the data.

  • DoubleFusion [33]: This dataset contains only a sequence of one man, where the actor performs more complex actions while turning around. Thus, we consider it not suitable for a quantitative benchmark and only use it to show qualitative comparisons on novel pose synthesis. The initial mesh is provided in the dataset using additional depth information.

  • ZJU-MoCap [25]: This dataset contains multi-view video sequences of 9 objects with 21 cameras. We choose a single view (subject 313 and subject 386 from camera 7) for training.

  • Human3.6M [11]: This dataset consists of a large number of 3D human poses and corresponding multi-view video sequences. We follow the same protocol as [29], extracting every 64th frame of the videos. We train the model on the subject 9 and subject 11. For each video, we select camera 2 as the input view and employ SPIN [14] to estimate the initial mesh from video frames.

Network Structure. The network \(\psi \) in the query embedding module is implemented with a 3-layer MLP with 128 channels. The NeRF network \(\phi \) is composed of an 8-layer MLP with 256 channels. In the position embedding module, we implement the tiny 3-layer MLP \(\psi \) with 128 channels, and the NeRF module \(\phi \) for rendering is composed of 8-layer MLP with 256 channels. We apply a positional encoding of 10 frequencies to query embedding features except latent codes.

Training Details. We utilize Adam optimizer [12] with learning rate of 1e−4 for optimizing NeRF and latent code. The learning rate of body poses is set to 5e−4 and \(\lambda _p\) is set to 2.0. For volumetric rendering we employ the coarse-to-fine ray sampling strategy of [19]. We also constrain the sampled rays to be more focused on the human part in the image by sampling rays within the 1.2\(\times \) padding bounding box of 2D keypoints with 70% probability, and randomly sampled in the whole image with 30% probability. Each sampled ray is discreted within \([z_{near} - 0.04,z_{far}+0.04]\), where \(z_{near}\) and \(z_{far}\) denote the nearest and farthest ray-point intersection with body mesh, respectively. Our model is trained with a single Nvidia Tesla V100 32 GB GPU, and the training approximately takes 60 h to converge. For datasets without background mask available, we either apply an off-the-shelf matting algorithm [16] or jointly model the background during training. Please check the supplemental materials for details.

Evaluation Metrics. Peak-Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) are used to evaluate image quality.

Table 1. Ablation studies on (a) type of direction, (b) type of distances and (c) type of neighborhood selection for embedding construction.

4.2 Ablation Studies

To validate the influence of our proposed query embedding, we conduct the ablation study on the People-Snapshot dataset and report quantitative results on both training and test (unseen) poses, from the following aspects:

Neighborhood Range: As we have discussed in Sect. 3.2, the deformation guidance from the SMPL model should be neither too global nor too local. We verified this by conducting training with different ranges of mesh neighborhood. The results are shown in Table 1c. Either increasing range (2-hop neighbors) or only nearest neighbor projected point leads to degraded performance, both for training and novel poses. We also test a variant of our method by sampling K-NN point based on Euclidean distance (spatial K-NN) instead of geodesic distance. The results are also degraded as it fails to aware human part connectivity (i.e. two adjacent points in Euclidean space might belong to distinct human parts).

Distance Prior: We validate the importance of distance information in Table 1a. We remove the distance feature in the w/o distance model, and substitute rest-pose distance for observation-pose distance in the canonical distance model. Obviously, without distance information, the results are significantly degraded and the difference between training and novel poses is increased.

Relative Direction: The impact of relative direction embedding is demonstrated in Table 1b, where w/o direction denotes embedding without direction, and w/o inverse denotes embedding direction in observation space. It is worth noting that the w/o inverse greatly reduces the generalization on novel poses.

Pose Refinement: Our joint pose refinement with NeRF training is crucial when the initial mesh is not accurate enough. To validate this, we conduct experiments on both Human3.6M and People-Snapshot dataset. The People-Snapshot dataset has provided an initial mesh that is rather reasonable; yet, we still observe minor artifacts without pose refinement and our joint training further improves the result, both quantitatively (Table 2) and qualitatively (Fig 4 and Fig. 5).

Fig. 4.
figure 4

Qualitative comparison between original and optimized mesh. The final result corrects the initial human mesh, e.g., the alignment error on the arms.

Table 2. The effect of using joint pose refinement.
Table 3. Quantitative comparison with AniNeRF and A-NeRF.

4.3 Comparsions

As there is very few (formally peer-reviewed and published) NeRF-based work that shares the same succinct monocular inputs with mesh-based geometry proxy as ours, we compare with the following methods:

AniNeRF (ICCV 2021). AniNeRF [24] is NeRF-based method for dynamic human modeling. AniNeRF also uses mesh as geometry guidance but requires more strict input requirements of multi-view video input. It produces high quality results with typically 3 to 4 synchronized views. For a fair comparison, we follow the same single view setting and training data to re-train AniNeRF, and report the comparison results in Fig. 7. We emphasize that this experiment setup with monocular input is not for producing best-quality results, but to demonstrate the challenge of monocular video scenario as well as the benefit of our proposed method. Compared with AniNeRF, our method generates complete skin and cloth, whereas AniNeRF is unable to model the whole body with limited view. The quantitative result reported in Table 3 also shows our method outperforms AniNeRF under the same settings. We also refer to the supplemental material for a comparsion to NeuralBody [25], the precursor method of AniNeRF.

A-NeRF (NeurIPS 2021). A-NeRF [29] is a recent work for modeling 3D human with NeRF using monocular video input. A-NeRF exploits joint optimization of NeRF with human skeletons. An apple-to-apple comparsion with A-NeRF is hard as it differs from our method in many implementation aspects which affects the result quality, e.g., from the underly parametric body representation (skeleton-based v.s. mesh-based) to the backbone capacities. Nevertheless, our result on the Human3.6M dataset is quantitatively comparable with A-NeRF (Table 3).

Non-NeRF methods. Regarding non-NeRF methods, we also compare our method with a SMPL-model based method, VideoAvatar [1]. The qualitative results are shown in Fig. 6. Given the same monocular video as input, the NeRF-based method generates results with more natural and realistic color effects.

Fig. 5.
figure 5

The effect of pose refinement on People-Snapshot dataset. Jointly refinement contributes to clearer geometry and eliminates outliers. The improvement brought by refinement is enlarged in red.

Fig. 6.
figure 6

A qualitative comparison with mesh-based method VideoAvatar.

Fig. 7.
figure 7

Qualitative comparison of Ani-NeRF [24] and ours under novel view. Both methods are trained with single view sequence.

4.4 Applications

Novel Pose Synthesis. Our trained representation enables character animation from novel unseen poses. We evaluate our generalization ability by comparing testing data and our rendering driven by the same set of unseen poses on the People-Snapshot and DoubleFusion dataset. The qualitative result is depicted in Fig. 8a. Our model successfully disentangles background and foreground pixels and veritably reconstructs the human body in the Doublefusion dataset (First row). As for side and back view, our model still generates images of high quality as shown in the Human3.6M dataset (Second row). We also provide quantitative results in Fig. 8b on the People-Snapshot and Human3.6m datasets.

Fig. 8.
figure 8

Qualitative and quantitative results of novel pose synthesis on multiple datasets. (a) Top row: novel pose rendering (left) and ground truth (right) on DoubleFusion. Bottom row: rendering (odd column) and ground truth (even column) on Human3.6M. (b) Quantitative results of novel pose synthesis on Human3.6M and People-Snapshot dataset.

Fig. 9.
figure 9

Human animation driven by Doublefusion poses. The synthetic human is trained on the People-Snapshot dataset.

Pose Retargeting. The generalization ability of our model is further evaluated by pose retargeting experiments. The results are shown in Fig. 9, where the driven poses derive from the Doublefusion dataset and training body comes from the People-snapshot dataset. We observe that our model generates realistic human bodys with various poses, which demonstrates the generalization of the proposed methods. We refer to the supplemental material for more novel pose synthesis results, including animation videos.

5 Conclusion

We presented a new method for building animatable neural 3D human representations from only monocular video inputs. Our representation is based on dynamic Neural Radiance Field guided by parametric 3D human meshes. We designed a novel input query embedding of the mesh-guided NeRF. We train the representation by we first initialize per-frame 3D meshes using off-the-shelf tools and then joint optimizing the 3D mesh and dynamic NeRF. The learned neural representation can generalize well to unseen views and poses.

Limitations. Our method is not without limitations. The input embedding of our querying is related to a local region on the mesh surface with a restricted reception field; thus the joint optimization might fail if the initial pose has deviated too much from the ground truth. Due to resolution constraint and the expressiveness of the mesh model we used, our method is still straggling at recovering high-resolution details such as human faces.

Future Work. For future works, we plan to explore different kinds of deformation priors and their effects on rigging dynamic NeRF, improving our performance with sharp details, and extending to general, non-articulated dynamic objects.