Geometry-Guided Progressive NeRF for Generalizable and Efficient Neural Human Rendering

Chen, Mingfei; Zhang, Jianfeng; Xu, Xiangyu; Liu, Lijuan; Cai, Yujun; Feng, Jiashi; Yan, Shuicheng

doi:10.1007/978-3-031-20050-2_14

Mingfei Chen^12,13,
Jianfeng Zhang¹⁴,
Xiangyu Xu¹²,
Lijuan Liu¹²,
Yujun Cai¹²,
Jiashi Feng¹² &
…
Shuicheng Yan¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13683))

Included in the following conference series:

European Conference on Computer Vision

2521 Accesses
13 Citations

Abstract

In this work we develop a generalizable and efficient Neural Radiance Field (NeRF) pipeline for high-fidelity free-viewpoint human body synthesis under settings with sparse camera views. Though existing NeRF-based methods can synthesize rather realistic details for human body, they tend to produce poor results when the input has self-occlusion, especially for unseen humans under sparse views. Moreover, these methods often require a large number of sampling points for rendering, which leads to low efficiency and limits their real-world applicability. To address these challenges, we propose a Geometry-guided Progressive NeRF (GP-NeRF). In particular, to better tackle self-occlusion, we devise a geometry-guided multi-view feature integration approach that utilizes the estimated geometry prior to integrate the incomplete information from input views and construct a complete geometry volume for the target human body. Meanwhile, for achieving higher rendering efficiency, we introduce a progressive rendering pipeline through geometry guidance, which leverages the geometric feature volume and the predicted density values to progressively reduce the number of sampling points and speed up the rendering process. Experiments on the ZJU-MoCap and THUman datasets show that our method outperforms the state-of-the-arts significantly across multiple generalization settings, while the time cost is reduced $>70\%$ via applying our efficient progressive rendering pipeline.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Geometry-guided generalizable NeRF for human rendering

Article 08 February 2024

High-Fidelity Dynamic Human Synthesis via UV-Guided NeRF with Sparse Views

Geometry-Aware Single-Image Full-Body Human Relighting

1 Introduction

High-fidelity free-viewpoint synthesis of human body is important for many applications such as virtual reality, telepresence and games. Some recent works [12, 22, 25, 36] deploy a Neural Radiance Fields (NeRF) [18] pipeline, which achieved fairly realistic synthesis of human body. However, these works usually require dense-view capturing of human body, and have to train a separate model for each person to render new views. The limited generalization ability as well as demand for cost computation severely hinder their application in the real-world scenarios.

In this work, we aim at boosting high-fidelity free-viewpoint human body synthesis with a generalizable and efficient NeRF framework based on only single-frame images from sparse camera views. To pursue such a high-standard framework, there are mainly two challenges that need to be tackled. First, the human body is highly non-rigid and commonly has self-occlusions over body parts, which may lead to ambiguous results with only sparse-view captures. This ambiguity could drastically degrade the rendering quality without proper regularizations, which cannot be easily solved by simply sampling features from multi-view images as in [26, 33, 35]. This problem would become worse when using one model to synthesize unseen scenarios without specific per-scene training. Second, the high computation and memory cost of NeRF-based methods severely hinder human synthesis with accurate details in high-resolution. For example, when rendering one $512\times 512$ image, existing methods need to process millions of sampling points through the neural network, even if using the bound of the geometry prior to remove empty regions.

To address these challenges, we propose a geometry-guided progressive NeRF, called GP-NeRF, for generalizable and efficient free-view human synthesis. More specifically, to regularize the learned 3D human representation, we propose a geometry-guided multi-view feature integration approach to more effectively exploit the information in the sparse input views. For the geometry prior, we adopt a coarse 3D body model, i.e., SMPL [17], which serves as a base estimate of our algorithm. We attach multi-view image features to the base geometry model using an adaptive multi-view aggregation layer. Then we can obtain an enhanced geometry volume by refining the base model with the attached image features, which substantially reduces the ambiguities in learning a high-fidelity 3D neural field. It is worth noting that our multi-view enhanced geometry prior differs significantly from related methods that also utilize human body priors [10, 14, 23, 24]. NB [24] learns a per-scene geometry embedding, which is hard to generalize to unseen human bodies; NHP [10] relies on temporal information to complement the base geometry model, which is less effective for regions occluded throughout the input video. In contrast, our approach can adaptively combine the geometry prior and multi-view features to enhance the 3D estimation, and thus can better handle the self-occlusion problem and acquire lifted generalization capacity even without using videos (see Fig. 1 (a)). By integrating the multi-view information and form a complete geometry volume adapting to the target human body, we can also compensate some limitations of the geometry prior (e.g., inaccurate body shape or lacks cloth information as in [14, 23]), and support our following efficiency progressive pipeline.

Furthermore, to tackle the high computation and memory cost, we introduce a geometry-guided progressive rendering pipeline. As shown in Fig. 1 (b), different from previous methods [10, 24], our pipeline decouples the density and color prediction process, leveraging the geometry volume as well as the predicted density values to reduce the number of sampling points for rendering progressively. By simply deploying our progressive rendering pipeline with the same data and model parameters, we can remove 76.4% points for density prediction (with Density MLP in Fig. 1 (b)) and 94% points for color prediction (with Appearance MLP in Fig. 1 (b)), reducing the total forwarding time of this part for all points by 85%. Later experiments verify that our progressive pipeline causes no performance decline while requiring shorter training time, which is credited to focusing on the density and appearance learning separately.

Our main contributions are in three folds:

We propose a novel geometry-guided progressive NeRF (GP-NeRF) for generalizable and efficient human body rendering, which reduces the computational cost of rendering significantly and also gains higher generalization capacity simply based on the single-frame sparse views.
We propose an effective geometry-guided multi-view feature integration approach, where we let each view compensate the low-quality occluded information for other views with the guidance of the geometry prior.
Our GP-NeRF has achieved state-of-the-art performance on the ZJU-MoCap dataset, taking only 175 ms on RTX 3090 and reducing time for rendering per image by over $70\%$, which well verifies effectiveness and efficiency of our framework.

2 Related Work

Human Performance Capture. Previous works [2, 5, 8, 20] apply traditional modeling and rendering pipelines for novel view synthesis of human performance, relying on either dense camera setup [4, 8] or depth sensors [2, 5, 31] to ensure photo-realistic reconstruction. Follow-up improvements are made by introducing neural networks to the rendering pipeline to alleviate geometric artifacts. To enable human performance capture in a sparse multi-view setup, template-based methods [1, 3, 6, 30] adopt pre-scanned human models to track human motion. However, these approaches require per-scene optimization and the pre-scanned human models are hard to collect in practice, which hinders them from real-world applications. Instead of performing per-scene optimization, recent methods [19, 27, 28, 37] adopt neural networks to learn human priors from ground-truth 3D data, and hence can reconstruct detailed 3D human geometry and texture from a single image. However, due to the limited diversity of training data, it is difficult for them to generate photo-realistic view synthesis or generalize to human poses and appearances that are very different from the training ones. And some other methods [11, 38] sample points from the generated 3D feature space, and then decide the human body opacity for later rendering, but they might generate results that violate the normal human body structures without involving human body geometry constraints.

Neural 3D Representations. Recently, researchers propose implicit function-based approaches [13, 15, 21, 29] to learn a fully-connected network to translate a 3D positional feature into local feature representation. A very recent work NeRF [18] achieves high fidelity novel view synthesis by learning implicit fields of color and density along with a volume rendering technique. Later, several works extend NeRF to dynamic scenes modeling [12, 22, 25, 36] by optimization NeRF and dynamic deformation fields jointly. Despite impressive performance, it is an extremely under-constrained problem to learn both NeRF and dynamic deformation fields together. NB [24] combines NeRF with a parametric human body model SMPL [17] to regularize the training process. It requires a lengthy optimization for each scene and hardly generalizes to unseen scenarios. To avoid such expensive per-scene optimization, Generalizable NeRFs [10, 26, 33, 35] condition the network on the pixel-aligned image features. However, directly extending such methods to complex and dynamic 3D human modeling is highly non-trivial due to self-occlusion, especially when modeling unseen humans under sparse views. Besides, these approaches suffer low efficiency since they need to process a large number of sampling points for volumetric rendering, harming their real-world applicability. Different from existing methods, we carefully design a multi-view information aggregation approach and a progressive rendering technique to improve model robustness and generalization to unseen scenarios under sparse views and also speed up the rendering.

3 Methodology

Given a set of M sparse source views {$\textbf{I}_m|m = 1, 2, ..., M$} of an arbitrary human model, which are captured by M pre-calibrated cameras respectively, we aim to synthesize the novel view $\textbf{I}_t$ of the human model from an arbitrary target camera.

To this end, we propose a geometry-guided progressive NeRF (GP-NeRF) framework for efficient and generalizable free-view human synthesis under very sparse views (e.g., $M = 3$). Figure 2 illustrates the overview of our framework. Firstly, a CNN backbone is used to extract image features $\textbf{F}_m$ for each of the views $\textbf{I}_m$. Then our GP-NeRF framework integrates these multi-view features to synthesize the novel-view image through three modules progressively, leveraging the geometry prior from SMPL [17] as guidance. The three modules are 1) geometry-guided multi-view feature integration (GMI) module (Sect. 3.1); 2) density network (Sect. 3.2); and 3) appearance network (Sect. 3.3). Details of the whole progressive human rendering pipeline are elaborated in Sect. 3.4, and the training method is described in Sect. 3.5.

3.1 Geometry-Guided Multi-view Integration

The geometry-guided multi-view feature integration module, shown in Fig. 2 (a), enhances the coarse geometry prior with multi-view image features by adaptively aggregating these features via a geometry-guided attention module. Then it constructs a complete geometry feature volume that adapts to the target human body.

Firstly, we use the SMPL model [17] as the geometry prior, and get the pixel-aligned image features for each of the 6890 SMPL vertices ${v}_l$ from each source image $\textbf{I}_m$. Specifically, we multiply the coordinate of ${v}_l$ with each source camera pose $[\textbf{R}_m|\textbf{t}_m]$ to transform the original ${v}_l$ to ${v}_{lm}$ into the source camera coordinate system, and then utilize the intrinsic matrix $\textbf{K}_m$ to obtain the projected coordinate $\pi ({v}_{lm})$ in the corresponding image plane. We denote the pixel-aligned features from the image features $\textbf{F}_m$ that corresponds to the pixel location of $\pi ({v}_{lm})$ as $\textbf{F}_m(\pi ({v}_{lm}))$. We use bilinear interpolation to obtain the corresponding features if the projected location is fractional.

After obtaining $\textbf{F}_m(\pi ({v}_{lm}))$ from M source views, we integrate them to represent the geometry information at vertex ${v}_l$ through a geometry-guided attention module. Concretely, we learn an embedding $\textbf{Q}_l$ for each ${v}_l$, and then take $\textbf{Q}_l$ as a query embedding to calculate the correspondence score $s_{lm}$ with each $\textbf{F}_m(\pi ({v}_{lm}))$ respectively:

$$\begin{aligned} \begin{aligned} s_{lm}^v= \frac{(\boldsymbol{W}_1 \textbf{Q}_l+b_1)(\boldsymbol{W}_{2m} \textbf{F}_{lm}^v+b_{2m})^{\top }}{\sqrt{d}}, \\ \end{aligned} \end{aligned}$$

(1)

where we denote $\textbf{F}_m(\pi ({v}_{lm}))$ as $\textbf{F}_{lm}^v$ for simplicity. d is the channel dimension of $\textbf{F}_{lm}^v$. $\boldsymbol{W}$ represents linear projection layers. After that, we weighted sum the M pixel-aligned feature embeddings $\textbf{F}_m(\pi ({v}_{lm}))$ based on the scores $s_{lm}$ to obtain the aggregated geometry related feature $\textbf{F}_{l}^v$ for vertex ${v}_l$:

$$\begin{aligned} \begin{aligned} \textbf{F}_{l}^v= \sum _{m=1}^{M} s_{lm}^v \textbf{F}_{lm}^v. \end{aligned} \end{aligned}$$

(2)

Considering the 6, 890 SMPL vertices with their corresponding features are not dense enough to represent the whole human body volume, we further learn to extend and fill the holes of the sparse geometry feature volume $\textbf{F}^v = \{\textbf{F}_{l}^v, l=1, 2, ..., 6890\}$ through the SparseConvNet [7] and thus obtain a denser geometry feature volume, denoted as $\tilde{\textbf{F}}^v$. In our method, we take the geometry volume $\tilde{\textbf{F}}^v$ as a more reliable basis to indicate occupancy of the human body in the whole space volume. More advanced than the coarse model SMPL, $\tilde{\textbf{F}}^v$ leverages the multi-view image-conditioned features to enhance the coarse geometry prior, which adapts to the shape of the target human body. $\tilde{\textbf{F}}^v$ only preserves the effective volume regions with body contents, including clothes regions. Because the SparseConvNet can gain experience from training to extend the features towards the regions with contents, based on the image-conditioned features with some instructive context information at each feature point. Besides, the geometry volume will also benefit our progressive rendering pipeline, which will be detailed in Sect. 3.4.

3.2 Density Network

The density network predicts the opacity of each sampling point $\textbf{p}_i$, which is highly related to the geometry of human body, like postures and shapes. Through the geometry-guided multi-view integration module in Sect. 3.1, we can construct a geometry feature volume $\tilde{\textbf{F}}^v$ which can provide sufficient reliable geometry information of the target human body. As shown in Fig. 2 (b), for each sampling point $\textbf{p}_i$, we obtain its corresponding geometry related feature $\tilde{\textbf{F}}^v_i$ from $\tilde{\textbf{F}}^v$ based on its coordinate. Though the feature volume can provide the geometry information of human body, such geometry-related features are coarse and may lose some fine image-conditioned features that benefit the high-fidelity rendering. To compensate the information loss, we combine these two kinds of features at each sampling point to predict its density value more accurately. Therefore, we concatenate $\tilde{\textbf{F}}^v_i$ with the mean ($\boldsymbol{\mu }$) and variance ($\textbf{v}$) feature embedding of its corresponding pixel-aligned image features $\{\textbf{F}_m(\pi ({v}_{lm})), m=1, 2, ..., M\}$ that contain more detailed information, and process the concatenated feature through a density MLP to predict the density value at this point.

3.3 Appearance Network

The appearance network aims to predict the RGB color value for each sampling point $\textbf{p}_i$. Since the RGB value is more related to the appearance details of human body, we utilize the image-conditioned features as the input to the appearance network for more detailed information. As shown in Fig. 2 (c), we first aggregate the pixel-aligned image features from input views for each color sampling point $\textbf{p}^c_i$. Specifically, similar to obtaining the pixel-aligned image features for each SMPL vertex, we project the coordinate of $\textbf{p}_i$ to the image plane of each source view, and obtain the pixel-aligned feature embedding, denoted as $\textbf{F}_m(\pi (\textbf{p}_{i}))$. We then concatenate $\textbf{F}_m(\pi (\textbf{p}_{i}))$ from M source views with their mean ($\boldsymbol{\mu }$) and variance ($\textbf{v}$) feature embeddings together. Afterwards, based on the concatenated feature embeddings, an appearance MLP is deployed to predict the RGB value $\hat{\textbf{c}}_i = (\hat{r}_i, \hat{g}_i, \hat{b}_i)$ for the corresponding point $\textbf{p}_i$.

3.4 Geometry-Guided Progressive Rendering

We render the human body in the target view through the volumetric rendering following previous NeRF-based methods [10, 18, 24]. Instead of sampling many redundant points for rendering, we introduce an efficient geometry-guided progressive rendering pipeline for the inference process. Our pipeline leverages the geometry volume in Sect. 3.1 as well as the predicted density values in Sect. 3.2 to reduce the number of points progressively.

Specifically, we first preserve the sampling points that occupy the geometry volume $\tilde{\textbf{F}}^v$ as valid density sampling points $\textbf{p}_i^d$. Compared to the smallest pillar that contains the human body that is used by previous methods [10, 24], the geometry volume is closer to the human body shape and contains much fewer redundant void sampling points. Then we predict the density values for $\textbf{p}_i^d$ through the density network, and the sampling points that have positive density values form a valid density volume. As shown in Fig. 2, the valid density volume is very close to the 3D mesh of the target human body and we further remove many empty regions compared to the geometry volume. We take the sampling points in the valid density volume as the new valid sampling points $\textbf{p}_i^c$, and further predict their color values through the appearance network in Sect. 3.3.

We conduct volume rendering based on the density and color predictions to synthesize the target view $\textbf{I}_t$. Traditional volume rendering methods often march rays $\textbf{r}$ from the target camera to the pixels of the target view image, and then sample N points on each $\textbf{r}$. Denoting the distance of two adjacent sampling points on $\textbf{r}$ as $\delta $, we can formulate the color rendering process for each $\textbf{r}$ as:

$$\begin{aligned} \begin{aligned}&\hat{C}(\textbf{r})=\sum _{i=1}^{N} T_{i}\left( 1-\exp \left( -\sigma _{i} \delta _{i}\right) \right) \hat{\textbf{c}}_{i}, \\&\text {where}~T_{i}=\exp \left( -\sum _{j=1}^{i-1} \sigma _{j} \delta _{j}\right) . \end{aligned} \end{aligned}$$

(3)

For our progressive rendering pipeline, we use projection to bind the sampling points to $\textbf{r}$. Concretely, we project the points within the geometry volume to the target view, take the nearest four pixels of the projected points as valid pixels to march a ray, and then uniformly sample N points between its near and far bounds as [10, 24]. We only process the sampling points within the valid volume regions and then conduct volume rendering based on the rays $\textbf{r}$.

Experiments in Sect. 4.4 verify that our geometry-guided progressive rendering pipeline reduces the memory and time consumption during rendering significantly, and our performance can be even lifted by removing noisy unnecessary sampling points.

3.5 Training

During training, we do not deploy the progressive rendering pipeline in Sect. 3.4, because it is useful only when our density network is reliable. Instead, we march rays from the target camera to pixels randomly sampled on the image while ensuring no fewer than half of the pixels are on the human body. We uniformly sample points on the rays to predict the corresponding density and color values. By performing the volume rendering in Eq. (3), we obtain the predicted color $\hat{C}(\textbf{r})$ for each $\textbf{r}$. To supervise the network, we calculate the Mean Square Error loss between $\hat{C}(\textbf{r})$ and the corresponding ground truth ${C}(\textbf{r})$ color value as our training loss $\mathcal {L}_{rgb}$.

4 Experiments

We study four questions in experiments. 1) Is GP-NeRF able to improve the fitting and generalization performance of human synthesis on the seen and unseen scenarios (Sect. 4.3)? 2) Is GP-NeRF effective at reducing the time and memory cost for rendering (Sect. 4.4)? 3) How does each individual design choice affect model performance (Sect. 4.5) 4) Can GP-NeRF provide promising results, both for human rendering and 3D reconstruction (Sect. 4.6)? We describe the datasets and evaluation metrics in Sect. 4.1, and our default implementation setting in Sect. 4.2.

Table 1. Synthesis performance comparison. Our proposed method outperforms existing methods on all the settings.

Full size table

4.1 Datasets and Metrics

We train and evaluate our method on the ZJU-MoCap dataset [24] and THUman 1.0 dataset [37]. ZJU-MoCap contains 10 sequences with 21 synchronized cameras. We split the 10 sequences into a training set with 7 sequences and a test set with the remaining 3 sequences, following [10] for a fair comparison. THUman contains 202 human body 3D scans. $80\%$ of the scans are taken as the training set, and the remaining are the test set. We render images for each scan from 24 virtual cameras, which are uniformly set on the horizontal plane.

To evaluate the rendering performance, we choose two metrics: peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) following [18, 24]. For the 3D reconstruction, we only provide the qualitative results since the corresponding ground truth is not available.

4.2 Implementation Details

In our implementation, we perform training and inference with an image size of $512 \times 512$ under $M=3$ camera views, where the horizontal angle interval is around $120^\circ $ (Uniform). We utilize a U-Net like architecture [33] as our backbone to extract the image features $\textbf{F}$ in Sect. 3 with a dimension of 32. We sample $N=64$ points uniformly between the near and far bound on each ray. For training, we utilize the Adam optimizer [9], and the learning rate decays exponentially from $1e-4$ for 180k steps. We use one RTX 3090 GPU with a batch size of 1 for both training and inference.

4.3 Synthesis Performance Analysis

In Table 1, we compare our human rendering results to previous state-of-the-art methods. To evaluate the capacity of fitting and generalization on different levels, we train our framework on the first 300 frames of 7 training video sequences of ZJU-MoCap (ZJU-7), and test on 1) the training frames, 2) unseen frames of ZJU-7, and 3) test frames from the 3 test sequences (ZJU-3), respectively. The results in Table 1 verify our advanced generalization capacity on the unseen scenarios. We also achieve competitive fitting performance on the training frames, even comparable to the per-scene optimization methods [24, 32, 34].

Notably, our method outperforms the state-of-the-art NHP [10] which utilizes the geometry prior with features of multi-view videos. Specifically, for the unseen poses and the unseen bodies, we outperform NHP by 0.98 and 1.21 dB on PSNR, and also by $0.5\%$ and $1.5\%$ on SSIM respectively, using only single-frame input. We also conduct generalization experiments across two datasets with different domains. We train our model on 7 random human bodies from the THUman dataset (THUman-7) and all 202 human bodies (THUman-all) separately, and test the synthesis performance on the test frames of ZJU-3. From Table 1, we observe our method outperforms NHP by a large margin under cross-dataset evaluation setup, i.e., around 7.7 dB and 13.6% improvements on PSNR and SSIM respectively. All these results demonstrate the effectiveness of our geometry-guided multi-view information integration approach.

4.4 Efficiency Analysis

In Table 2, we analyze the efficiency improvements^{Footnote 1} gained from our progressive pipeline on the first 300 frames of the 315 (Taichi) sequence in ZJU-MoCap dataset.

Table 2. Computation and memory cost comparison. GP-NeRF$^{\dag }$ has the same structure as our GP-NeRF but adopts vanilla rendering technique. $\times N$ indicates the sampling points are split into N chunks to be processed. #$\textbf{r}$ means the number of sampling rays; #$\textbf{p}^d$ and #$\textbf{p}^c$ mean sampling points through the density network and appearance network, respectively. T$^d$-total indicates the total time cost from backbone output to the density volume, including T$^d$-MLP which means the forwarding time of the density MLP. T$^c$-total means the time from density volume to the color prediction, and T$^c$-MLP is the time for the appearance MLP.

Full size table

Considering the limited GPU memory, our final GP-NeRF can process all the sampling points in one run, but GP-NeRF$^{\dag }$ and NB [24] requires at least twice. As shown in the upper panel of Table 2, compared to NB, NHR and NHP which also use the SMPL bounds to remove redundant marched rays, our GP-NeRF can further remove $38.1\%$ rays and $76.4\%$ #$\textbf{p}^d$ by referring to the constructed geometry volume, and remove $94.0\%$ #$\textbf{p}^c$ based on the valid density volume. Comparing to NB, NHR and NHP, our GP-NeRF $2\times $ costs $60\%-79\%$ less time with lower memory. For fair comparison to GP-NeRF$^{\dag }$ $2\times $, we also test the speed on GP-NeRF for 2 chunks, and our progressive pipeline still reduces the time cost by $57\%$ and the memory cost by $52.4\%$, which verifies the significant efficiency improvement from the proposed rendering pipeline.

In the bottom panel of Table 2, we compare the time cost of each component in GP-NeRF to GP-NeRF$^{\dag }$ without progressive points reduction. The results show that we can reduce over $74\%$ and $63\%$ time cost for density MLP forwarding and the total density related time T$^d$-total respectively, by simply using our progressive rendering pipeline on the same network structures. Our pipeline can also reduce over $92\%$ time cost for the appearance MLP forwarding. Moreover, our progressive pipeline improves the efficiency significantly while even improving the PSNR metric by $0.4\%$, as it can ignore some noisy sampling points during rendering that might degrade the performance.

Table 3. Ablations: feature integration. G, Q, P are different approaches to obtain input features for the shared density and appearance network. G: geometry feature volume; Q: integrate multi-view information at each geometry vertex; P: pixel-aligned image features.

Full size table

Table 4. Ablations: progressive structure. G, Q, P have the same meanings as Table 3. Disentangle indicates whether the density and appearance networks are in a progressive pipeline. Steps mean the number of training steps. The columns of Density and Appearance demonstrate components of the input features.

Full size table

4.5 Ablation Studies

We conduct ablation studies under the uniform camera setting in Sect. 4.2 to verify effectiveness of our main designed components on generalization capacity. We train our model on 7 training sequences of the ZJU-MoCap dataset for 35k steps and validate it on remaining 3 sequences.

Feature Integration. In Table 3, we explore the effectiveness of the proposed geometry-guided feature integration mechanism on the baseline GP-NeRF, i.e., GP-NeRF without adopting progressive rendering pipeline. As shown in Table 3, adaptively aggregating multi-view image features with the guidance of the geometry prior to construct the geometry feature volume (QG) achieves better performance (i.e., 0.21 dB and $0.5\%$ improvements on PSNR and SSIM respectively) than baseline that simply uses the mean of multi-view image features (G), as the proposed geometry-guided attention module helps focus more on the views corresponding to the geometry prior. We also observe baseline using only pixel-aligned image features (P) gains 2.41 dB PSNR and $3\%$ SSIM over baseline using only geometry feature (G), as it captures more detailed appearance features from images for high-fidelity rendering. Moreover, by combining the geometry feature and its corresponding detailed image features (QG+P), we can improve upon P by 0.6 dB PSNR and $0.9\%$ SSIM respectively. This indicates that both the geometry and the pixel-aligned image features can compensate each other for better generalization performance on unseen scenarios.

Progressive Structure. Our progressive rendering pipeline in Sect. 3.4 requires a progressive structure of the density and appearance network. Based on the same experimental settings, we further decouple the density and appearance networks to form a progressive pipeline as in Fig. 2 and evaluate the performance. As shown in Table 4, the progressive structure does not harm the performance and even reaches relatively high performance faster. This is because it allows these two networks to lean their different focus, thus improving the performance more quickly during training. For the density network, involving more detailed image features P can enhance the relatively coarse geometry feature QG, and bring around $0.5\%$ improvements on SSIM. The results also show that the geometry feature QG is much more impactful on the geometry-related density prediction than on the appearance-related color value prediction.

4.6 Visualization

We visualize our human rendering results under three uniform camera views in different experimental settings (Fig. 3). As Fig. 3 (a), (b) and (c) show, compared with other approaches, our method achieves better quality on unseen poses or bodies by synthesizing more high-fidelity details like the clothes wrinkles and reconstructing the body shape more accurately. From Fig. 3 (d), we demonstrate some rendering results on the unseen bodies of the THUman dataset after training on it. Our method generalizes well on the same THUman dataset and can synthesize accurate details.

In Fig. 4, we visualize the density volume from the density MLP in Sect. 3.2 as the mesh results of our 3D reconstruction. Different from previous methods that densely sample points within bounds of the geometry prior to determine the inside points through the density network for mesh construction, our progressive pipeline directly determines the sampling points from the geometry volume in Sect. 3.1, which contains much fewer redundant points and thus is more efficient for 3D reconstruction. Then we construct the mesh based on the points with higher density values. As Fig. 4 (b) shows, on the unseen human bodies, previous image based 3D construction method like PIFuHD [28] can not generalize well. Besides their lower efficiency on making predictions for a lot of redundant sampling points, they are more likely to predict body parts that do not conform to a normal human body structure, because they can not integrate and adapt the given geometry information as well as we do. As shown in Fig. 4, by integrating multi-view information to form a complete geometry volume adapting to the target human body, our method can generally reconstruct very close human body shape and even clothes details like folds on even unseen human bodies (Fig. 4 (b)).

5 Conclusion

We propose a geometry-guided progressive NeRF model for generalizable and efficient free-viewpoint human rendering under sparse camera settings. Using our geometry-guided multi-view feature aggregation approach, the geometry prior can be effectively enhanced with the integrated multi-view information and form a complete geometry volume adapting to the target human body. The geometry feature volume combined with the detailed image-conditioned features can benefit the generalization performance on unseen scenarios. We also introduce a progressive rendering pipeline for higher efficiency, which reduces over $70\%$ rendering time cost without performance degradation. Experimental results on two datasets verify our model can outperform previous methods significantly on generalization capacity and efficiency.

Notes

1.
We count averaged per-sample inference time in milliseconds. For all methods, the time is counted on NVIDIA GeForce RTX 3090 and CPU Intel i7-11700 @ 2.50 GHz, PyTorch 1.8, CUDA 11.4.

References

Carranza, J., Theobalt, C., Magnor, M.A., Seidel, H.P.: Free-viewpoint video of human actors. ACM Trans. Graph. 22(3), 569–577 (2003)
Google Scholar
Collet, A., et al.: High-quality streamable free-viewpoint video. ACM Trans. Graph. 34(4), 1–13 (2015)
Article Google Scholar
De Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multi-view video. In: ACM Trans Graphics (2008)
Google Scholar
Debevec, P., Hawkins, T., Tchou, C., Duiker, H.P., Sarokin, W., Sagar, M.: Acquiring the reflectance field of a human face. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (2000)
Google Scholar
Dou, M., et al.: Fusion4d: real-time performance capture of challenging scenes. ACM Trans. Graph. 35(4), 1–13 (2016)
Article Google Scholar
Gall, J., Stoll, C., De Aguiar, E., Theobalt, C., Rosenhahn, B., Seidel, H.P.: Motion capture using joint skeleton tracking and surface estimation. In: CVPR (2009)
Google Scholar
Graham, B., Engelcke, M., Van Der Maaten, L.: 3d semantic segmentation with submanifold sparse convolutional networks. In: CVPR (2018)
Google Scholar
Guo, K., et al.: The relightables: volumetric performance capture of humans with realistic relighting. ACM Trans. Graph. 38(6), 1–19 (2019)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICCV (2015)
Google Scholar
Kwon, Y., Kim, D., Ceylan, D., Fuchs, H.: Neural human performer: learning generalizable radiance fields for human performance rendering. In: NeurIPS (2021)
Google Scholar
Li, R., Xiu, Y., Saito, S., Huang, Z., Olszewski, K., Li, H.: Monocular real-time volumetric performance capture. In: ECCV (2020)
Google Scholar
Li, T., et al.: Neural 3d video synthesis. arXiv (2021)
Google Scholar
Liu, L., Gu, J., Lin, K.Z., Chua, T.S., Theobalt, C.: Neural sparse voxel fields. arXiv (2020)
Google Scholar
Liu, L., Habermann, M., Rudnev, V., Sarkar, K., Gu, J., Theobalt, C.: Neural actor: neural free-view synthesis of human actors with pose control. ACM Trans. Graph. 40(6), 1–16 (2021)
Google Scholar
Liu, S., Zhang, Y., Peng, S., Shi, B., Pollefeys, M., Cui, Z.: Dist: rendering deep implicit signed distance function with differentiable sphere tracing. In: CVPR (2020)
Google Scholar
Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.: Neural volumes: Learning dynamic renderable volumes from images. In: ACM Transactions on Graphics (2019)
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 1–16 (2015)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
Google Scholar
Natsume, R., et al.: SiCloPe: silhouette-based clothed people. In: CVPR (2019)
Google Scholar
Newcombe, R.A., Fox, D., Seitz, S.M.: Dynamicfusion: reconstruction and tracking of non-rigid scenes in real-time. In: CVPR (2015)
Google Scholar
Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In: CVPR (2020)
Google Scholar
Park, K., et al.: Deformable neural radiance fields. arXiv (2020)
Google Scholar
Peng, S., et al.: Animatable neural implicit surfaces for creating avatars from videos (2022)
Google Scholar
Peng, S., et al.: Neural body: implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: CVPR (2021)
Google Scholar
Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-nerf: neural radiance fields for dynamic scenes. In: CVPR (2021)
Google Scholar
Raj, A., et al.: Pva: pixel-aligned volumetric avatars. arXiv (2021)
Google Scholar
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu: pixel-aligned implicit function for high-resolution clothed human digitization. In: ICCV (2019)
Google Scholar
Saito, S., Simon, T., Saragih, J., Joo, H.: Pifuhd: multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In: CVPR (2020)
Google Scholar
Sitzmann, V., Zollhöfer, M., Wetzstein, G.: Scene representation networks: Continuous 3d-structure-aware neural scene representations. arXiv (2019)
Google Scholar
Stoll, C., Gall, J., De Aguiar, E., Thrun, S., Theobalt, C.: Video-based reconstruction of animatable human characters. ACM Trans. Graph. 29(6), 1–10. (2010)
Google Scholar
Su, Z., Xu, L., Zheng, Z., Yu, T., Liu, Y., Fang, L.: Robustfusion: human volumetric capture with data-driven visual cues using a RGBD camera. In: ECCV (2020)
Google Scholar
Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph. 38(4), 1–12 (2019)
Article Google Scholar
Wang, Q., et al.: IBRNet: learning multi-view image-based rendering. In: CVPR (2021)
Google Scholar
Wu, M., Wang, Y., Hu, Q., Yu, J.: Multi-view neural human rendering. In: CVPR (2020)
Google Scholar
Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: neural radiance fields from one or few images. arXiv (2020)
Google Scholar
Yuan, W., Lv, Z., Schmidt, T., Lovegrove, S.: Star: self-supervised tracking and reconstruction of rigid objects in motion with neural rendering. In: CVPR (2021)
Google Scholar
Zheng, Z., Yu, T., Wei, Y., Dai, Q., Liu, Y.: Deephuman: 3d human reconstruction from a single image. In: ICCV (2019)
Google Scholar
Zins, P., Xu, Y., Boyer, E., Wuhrer, S., Tung, T.: Data-driven 3d reconstruction of dressed humans from sparse views. In: 2021 International Conference on 3D Vision (3DV), pp. 494–504 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Sea AI Lab, Shanghai, China
Mingfei Chen, Xiangyu Xu, Lijuan Liu, Yujun Cai, Jiashi Feng & Shuicheng Yan
University of Washington, Seattle, USA
Mingfei Chen
National University of Singapore, Singapore, Singapore
Jianfeng Zhang

Authors

Mingfei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Lijuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yujun Cai
View author publications
You can also search for this author in PubMed Google Scholar
Jiashi Feng
View author publications
You can also search for this author in PubMed Google Scholar
Shuicheng Yan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiangyu Xu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 364 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, M. et al. (2022). Geometry-Guided Progressive NeRF for Generalizable and Efficient Neural Human Rendering. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13683. Springer, Cham. https://doi.org/10.1007/978-3-031-20050-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-20050-2_14
Published: 28 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20049-6
Online ISBN: 978-3-031-20050-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Geometry-Guided Progressive NeRF for Generalizable and Efficient Neural Human Rendering

Abstract

Similar content being viewed by others

Geometry-guided generalizable NeRF for human rendering

High-Fidelity Dynamic Human Synthesis via UV-Guided NeRF with Sparse Views

Geometry-Aware Single-Image Full-Body Human Relighting

1 Introduction

2 Related Work

3 Methodology