Geometry-Aware Single-Image Full-Body Human Relighting

Ji, Chaonan; Yu, Tao; Guo, Kaiwen; Liu, Jingxin; Liu, Yebin

doi:10.1007/978-3-031-19787-1_22

Chaonan Ji¹²,
Tao Yu¹²,
Kaiwen Guo¹³,
Jingxin Liu¹⁴ &
…
Yebin Liu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13676))

Included in the following conference series:

European Conference on Computer Vision

2530 Accesses
8 Citations

Abstract

Single-image human relighting aims to relight a target human under new lighting conditions by decomposing the input image into albedo, shape and lighting. Although plausible relighting results can be achieved, previous methods suffer from both the entanglement between albedo and lighting and the lack of hard shadows, which significantly decrease the realism. To tackle these two problems, we propose a geometry-aware single-image human relighting framework that leverages single-image geometry reconstruction for joint deployment of traditional graphics rendering and neural rendering techniques. For the de-lighting, we explore the shortcomings of UNet architecture and propose a modified HRNet, achieving better disentanglement between albedo and lighting. For the relighting, we introduce a ray tracing-based per-pixel lighting representation that explicitly models high-frequency shadows and propose a learning-based shading refinement module to restore realistic shadows (including hard cast shadows) from the ray-traced shading maps. Our framework is able to generate photo-realistic high-frequency shadows such as cast shadows under challenging lighting conditions. Extensive experiments demonstrate that our proposed method outperforms previous methods on both synthetic and real images.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Relighting4D: Neural Relightable Human from Videos

Geometry-Guided Progressive NeRF for Generalizable and Efficient Neural Human Rendering

Geometry-guided generalizable NeRF for human rendering

Article 08 February 2024

Keywords

1 Introduction

Human relighting aims to relight a target human with correct shadow effects under a desired illumination. Relighting can realize seamless replacement of the background while ensuring the light consistency of the foreground and the background and has a huge application prospect in film-making, video chat and Virtual Reality. Single-image human relighting for a general human person is convenient and promising for amateur photographers but also more challenging because the difficulty of decoupling lighting, geometry, and surface material from a single image.

Most existing single-image human relighting methods focus on portrait relighting [5, 16, 18, 22, 32, 33, 37, 38, 40, 43, 44, 46, 47, 50,51,52, 58, 59] and only a few works [11, 25, 28] focus on single-image full-body relighting. Since the style, texture and material of human clothes are widely varied and the geometry and poses of clothed humans are usually complex, decoupling the albedo and lighting is highly ill-posed. Moreover, mutual occlusion of limbs will produce self-shadows, which are not only difficult to remove for de-lighting tasks, but also challenging to generate under the target lighting conditions for relighting tasks. Previous full-body relighting methods [11, 25, 28] have attempted to address these issues, but the results still have the following drawbacks: (i) inability to disambiguate albedo and lighting (ii) inability to model high-frequency shadows due to reliance on Spherical harmonics representation of lighting.

We propose a novel geometry-aware single-image human relighting framework that leverages SOTA single-image geometry reconstruction [19, 42] for fully leveraging the traditional graphics rendering and neural rendering techniques simultaneously. First of all, accurate intrinsic decomposition for albedo estimation (de-lighting) is the cornerstone of producing high-quality relighting results. Previous methods [11, 25, 28] have used the UNet to infer albedo and suffer from severe entanglement between lighting and albedo. We discover that the skip connections in UNet is the culprit and found that HRNet [49] performs much better for de-lighting. In addition we further improve the de-lighting performance by: i) removing the aggressive down-sampling operations in the early stage, ii) eliminating skip connections and iii) fusing multi-scale features while maintaining high-resolution representations in the HRNet, and finally achieve better disentanglement between lighting and albedo on both synthetic and real images.

More importantly, even with high-quality albedo, we still need photo-realistic shadows to produce realistic relit results. Limited by the expressive ability of the spherical harmonics (SH) lighting model and the lack of 3D geometry information, previous methods [11, 25, 28] can only produce low-frequency shadows. Low-resolution environment maps [50, 52] and pre-filtered environment maps [40] have also been explored and struggle to generate hard cast shadows. With the development of single-image 3D human reconstruction technology, obtaining high-quality mannequins from a single image is possible, which can provide more 3D prior information for human relighting. Hence, we propose a geometry-aware relighting method, which consists of a ray tracing-based per-pixel lighting representation that explicitly models high-frequency shadows and a learning-based shading refinement module to restore realistic shadows. The key idea of our method is that with the estimated 3D human model, we can render photo-realistic shading maps that encode full-band lighting information under target lighting conditions using ray tracing. This lighting representation is able to preserve high-frequency shadows and significantly improve the quality of the relit results which remains difficult for learning-based methods. However, the ray-traced shadows still suffer from artifacts due to the errors in geometry estimation, which prevents direct composition of the relit results. Thus, we further propose a learning-based refinement module that utilizes the ray-traced shading maps and the inferred ambient occlusion map as input to restore photo-realistic high-frequency shadows with rich local details and clear shadow boundaries. The proposed framework dramatically improves relighting performance and produces a photo-realistic relit image with high-frequency self-shadowing effects under the target lighting conditions.

To conclude, our main contributions are the following:

A novel geometry-aware single-image human relighting framework that combines single-image 3D human reconstruction, ray tracing and neural rendering technologies.
We demonstrate the UNet with skip connections is not suitable for the delighting task and propose a modified HRNet for achieving better disentanglement between lighting and albedo on both synthetic images and real photographs.
We propose a ray tracing-based per-pixel lighting representation and a learning-based shading refinement module that utilizes the inferred ambient occlusion map as auxiliary input to restore photo-realistic shadows while preserving rich local details.

2 Related Work

2.1 Person-Specific Human Relighting

Turner Whitted [55] first used recursive ray tracing to simulate global illumination and render realistic images under target lighting conditions. However, the forward rendering technique requires a high-precision 3d model and corresponding PBR textures of the target object, which are unavailable or costly for real-world objects. Debevec et al. [13] proposed collecting OLAT (one-look-at-a-time) images using LightStage to synthesize specific person’s face under novel illuminations with no 3D model required. Subsequent works improved LightStage to capture higher-quality images and a larger range of human bodies [8, 12, 14, 20, 21, 53, 54, 57]. Guo et al. [20] attempted to combine image-based rendering technique with geometry and material esrimated by LightStage and achieved unprecedented quality and photorealism for free viewpoint videos. However, collecting OLAT data, gradient data and object geometry requires professional equipment and it is a time-consuming and complicated process for person-specific relighting. Li et al. [31] attempted to relight target human from multi-view video recorded under unknown illumination conditions and Imber et al. [23] extended this approach to scene relighting by introducing intrinsic textures. These two methods can produce high-quality relit results but multiview video must be collected for every target human or scene.

2.2 General Human Relighting

Deep learning makes general human relighting possible. Most existing general human relighting methods [5, 16, 18, 22, 32, 33, 35, 37, 38, 40, 43, 44, 46, 47, 50,51,52, 58, 59] focus on human portrait relighting. [40, 50, 57] achieved amazing single-image portrait relighting utilizing a dataset synthesized from OLAT images. For human full-body relighting, Meka et al. [36] combined traditional geometry pipelines with neural rendering to generate relit results using gradient images and estimated human geometry. Kanamori et al. [25] proposed an occlusion-aware single-image full-body human relighting method to infer albedo, geometry and illumination leveraging spherical harmonics (SH) lighting model. On the basis of [25], Lagunas et al. [28] added specular reflectance and light-dependent residual terms to explicitly handle highlights and Tajima et al. [11] used a residual network to restore neglected light effects. However, due to the limitation of the expressive ability of the SH lighting model, [11, 25, 28] could only produce low-frequency shadows and their method may fail under harsh illuminations.

2.3 Inverse Rendering

Ramachandran et al. [41] proposed the shape-from-shading method to estimate shape from shading given an input image with known lighting conditions. Subsequent works [10, 34, 39] assume a simple light source model such as directional and point light sources. Based on the Retinex theory [29], intrinsic images [4, 6, 7, 15, 27, 30, 45, 56] aims to decompose an input image into reflectance and shading. Recent single-image human relighting studies [11, 25, 28] have drawn on the idea of intrinsic images to decompose an input image into albedo, shape and illumination using UNet. We have further improved the intrinsic images decomposition results in the de-lighting stage utilizing the modified HRNet [49] and achieved better disentanglement between albedo and lighting.

3 Overview

Following current work [52], our framework consists of two stages: de-lighting and relighting. Figure 2 shows the architecture of the whole framework. For the de-lighting stage, we first use two geometry networks (Geometry Module) to infer the per-pixel normal map $\widehat{N}$ and ambient occlusion map $\widehat{AO}$ separately. Similar to [40], the inferred normal map and the input image are concatenated as the input of Albedo Module to infer albedo $\widehat{A}$. For the relighting stage, we first render the 3D human models estimated by [42] and [19] (3D Recon Module) using ray tracing to obtain the coarse full-body shading map $S_{coarse}^{body}$ and coarse face shading map $S_{coarse}^{face}$. Then the Refine Module takes these coarse shading maps and the inferred ambient occlusion map as input to produce the refined full-body shading map $\widehat{S_{fine}^{body}}$ and refined face shading map $\widehat{S_{fine}^{face}}$, which are composited together to obtain the final shading map $\widehat{S}$. Details regarding the specific architecture and implementations of the networks are provided in the supplementary materials. Finally, the dot product of the inferred albedo and the final shading map is obtained to produce the relit image $I_{relit}$ using the following formula:

$$\begin{aligned} I_{relit}=\widehat{A} \odot \widehat{S} \end{aligned}$$

4 De-lighting

One of the most important factor for achieving better de-lighting results is the network architecture. Although the UNet architecture has been widely used in previous de-lighting networks [25, 28, 40, 43, 52], we find that it usually produces heavy lighting and albedo entangled results, possibly because of the fact that the lighting and albedo features remain coupled in the UNet architecture.

To guarantee that the lighting and albedo can be disentangled in the network architecture level, we modified the HRNet [49] architecture and achieved better delighting performance than UNet as shown in Fig. 3. Note that the vanilla HRNet can only produce oversmoothed albedo map due to the aggressive downsampling operations in the beginning stage. Even with skip connections between the input downsampling features and output upsampling features, the results of vanilla HRNet remain poor and even make unremoved shadows appear in the inferred albedo map, especially in outdoor scenarios. We assume that skip connections have a negative impact on the final de-lighting results because high-resolution features from the input still contain environmental lighting information. Based on this assumption, we modify the vanilla HRNet by removing the downsampling operations directly (which also avoid skip connections) to fuse multiscale features while maintaining high-resolution representations. Moreover, we modify HRNet to output lighting prediction at the layer with the lowest resolution of the final stage. This strategy guarantees a better decomposition between lighting and albedo at both the architecture and feature representation level. Details regarding the specific architecture and implementations are provided in the supplementary materials.

To estimate the albedo of an input human image, we employ losses as follows:

$$\begin{aligned} \begin{aligned} L_{DL}&= \lambda _{input}\left\| \widehat{A}-A \right\| _{1} +\lambda _{vgg} Vgg(\widehat{A},A) \\ {}&+\lambda _{light}\left\| w\odot log(1+\widehat{I_l})-log(1+I_l) \right\| _{2}^2 \end{aligned} \end{aligned}$$

where $\widehat{A}$ is the inferred albedo and A is the ground truth albedo. Similar to portrait relighting [50], we use the weighted log-$L_2$ loss. $\widehat{I_l}$ and $I_l$ are latitude-longitude representation of environmental illumination and are used to describe the estimated lighting and ground truth lighting, respectively. The ground truth lighting map is downsampled to $16\times 32\times 3$ using Gaussian pyramid. w is the solid angle of each “pixel”. The $L_{DL}$ is the total loss for the de-lighting stage, and $\lambda _{input}$, $\lambda _{vgg}$ and $\lambda _{light}$ are the weight factors. Empirically, we find $\lambda _{input}=500$, $\lambda _{vgg}=100$, $\lambda _{light}=0.025$ achieves the best performance.

5 Relighting

To render photorealistic shadows, we divide the relighting stage into two substages: ray tracing and shading refinement. We first estimate 3D models of the target human because the classical ray tracing algorithm requires a 3D model of target object.

For geometry estimation, given a single human image, we use PIFuHD [42] to estimate a complete 3D human model. Then we crop the human face from the input image and align it with the dense 3D face model following [19] by regressing the 3DMM parameters. Both the full-body 3D model and face 3D model have no texture because we only need to render shading maps of the target human.

5.1 Ray Tracing

For photorealistic rendering, we use the Cycles rendering engine in Blender [1] and a Principled BSDF shader. We set the camera mode to orthographic to ensure that the rendered shading map is pixel-aligned with the input image. Due to the limitations of the PIFuHD [42] network’s generation capacity and memory space, the surface of the estimated full-body 3D model is not smooth enough, which may produce checkered artifacts after raytracing. The schematic diagram of the artifacts is provided in the supplementary materials. To solve the problem, we adopt Laplacian smoothing [48] for the full-body 3D model and set smoothing steps to 10 with cotangent weight. Then the smoothed full-body model is sent to the renderer to render the full-body shading map under the target lighting environment using ray tracing. In addition, the generated 3D face model was directly sent to the renderer without smoothing for ray tracing.

In a word, we render a coarse full-body shading map $S_{coarse}^{body}$ and coarse face shading map $S_{coarse}^{face}$ under the target lighting environment using ray tracing. The rendered coarse shading maps are pixel-aligned with the input image.

5.2 Shading Refinement

After ray tracing under the target lighting environment, we obtained roughly correct shadows (especially for self-occlusions and hard shadows such as cast shadows) of the target human. However, the full-body 3D model reconstructed from a single image is not completely accurate and thus the ray-traced shading maps contain unnatural shadows and obvious geometry errors. To enhance the realism of the ray-traced full-body shading map, we introduce two refinement networks to compensate for shadow-rendering errors and restore a high-quality shading map. The first refinement network is designed to paint in the ray-traced full-body shading map and improve the overall quality of the shading details. The second refinement network is designed to refine facial shading details to ensure that we can fully leverage the geometry priors of human faces. Figure 4 shows the entire shading refinement process.

Full-Body Shading Refinement. The full-body refinement network takes the coarse full-body shading $S_{coarse}^{body}$ and the inferred ambient occlusion map $\widehat{AO}$ as input and outputs the refined full-body shading residual. We choose an ambient occlusion map instead of a normal map for the following three reasons. First, the ambient occlusion map can supplement part of the self-shadows lost due to geometry prediction errors. Second, compared with inferring a normal map, inferring an ambient occlusion map is more robust under various lighting environments even with extreme lighting distributions. Third, existing 3D human reconstruction methods such as [24, 42] are highly dependent on the surface normal to predict 3D geometry surface details, which means that the normal prediction errors are consistent with the geometry errors and $S_{coarse}^{body}$ cannot obtain extra correct geometry information from the normal map to compensate for existing shading errors. Figure 4 shows the refined results with the normal and ambient occlusion maps as auxiliary inputs, respectively.

The architecture of full-body refinement network is similar to MIMO-UNet [9], and we deepen the network to 4 downsampling operations. The coarse full-body shading map $S_{coarse}^{body}$ and the inferred ambient occlusion map $\widehat{AO}$ are concatenated together as the original scale input and the downsampled $\widehat{AO}$ is used as multiscale input. We use multiscale output for supervision. We add adversarial loss in the final highest-resolution output layer to help the network generate plausible shading effects and use PatchGAN as the discriminator. The training loss consists of content loss, fft loss and PatchGAN loss and is defined as follows:

$$\begin{aligned} \begin{aligned} L_{fb}&= \lambda _{content}\sum _{k=1}^{K}\left\| \widehat{S}_{fine}^{body^k}-S^k \right\| _{1}+\left\| P(\widehat{S}_{fine}^{body})-P(S) \right\| _{2}^2 \\ {}&+\lambda _{fft}\sum _{k=1}^{K}\left\| F(\widehat{S}_{fine}^{body^k})-F(S^k) \right\| _{1} \end{aligned} \end{aligned}$$

where K is the number of levels and $K=5$; $\widehat{S}_{fine}^{body^k}$ is the $k_{th}$ level output and $S^k$ is the $k_{th}$ downsampled ground truth shading map. P is the PatchGAN discriminator, and F() denotes the fast Fourier transform (FFT). $\lambda _{content}$ and $\lambda _{fft}$ are weight factors and we empirically set $\lambda _{content}=10$ and $\lambda _{fft}$=0.001.

Face Shading Refinement. Although the above-mentioned refinement network produces realistic shading with rich details, the human eye is very sensitive to the details of the face and is able to distinguish small geometry and shadow errors. Therefore, we continue to refine the face region on the basis of the refined full-body shading map $\widehat{S_{fine}^{body}}$. The face refinement network takes $S_{crop}^{face}$ cropped from $\widehat{S_{fine}^{body}}$ and $S_{coarse}^{face}$ as the input and outputs the refined face shading residual. The training loss can be expressed as follows:

$$\begin{aligned} L_{ff}=\lambda _{face}\left\| \widehat{S_{fine}^{face}}-S^{face} \right\| _{1}+\left\| L(\widehat{S_{fine}^{face}})-L(S^{face}) \right\| _{2}^2 \end{aligned}$$

where L is the LSGAN discriminator, $\lambda _{face}$ is the weight factor and is set to be $\lambda _{face}=5$. $S^{face}$ is the ground truth face shading map. We use UNet architecture as the backbone of face refinement network and details regarding the architecture are provided in the supplementary materials.

6 Implementation Details

We carefully select 811 scanned 3D human figures with good lighting conditions from Twindom [2], of which 700 figures are used for training and 111 figures for testing. The use of the dataset has been officially approved by Twindom. We collect 480 panoramic lighting environments sourced from www.HDRIHaven.com [3] and rotate them every 36$^\circ $C to generate a total of 4800 HDR environment maps. We allocate 4600 HDR environment maps for training and 200 HDR environment maps for testing. To balance the amount of lighting in indoor and outdoor scenes, we add an extra 150 indoor HDR environment maps from the Laval Indoor Dataset [17] to the testing dataset. None of the test lighting conditions appear in the training dataset. Details regarding the specific data rendering, training and testing are provided in the supplementary materials.

7 Experiments

In this section, we first compare our method with previous state-of-the-art methods quantitatively and qualitatively to show that our method performs better on de-lighting task and produces more photorealistic relit results under challenging lighting conditions. Then we evaluate the key contributions of our proposed method and prove the effectiveness of the entire framework and each module. For quantitative comparisons, we adopt the metrics MSE, PSNR and SSIM to compare the inferred albedo, relit shading and relit images with the corresponding ground truth images in our testing dataset. For qualitative comparisons, we show some intrinsic images decomposition results and relit results under target lighting conditions using both real images and synthetic images.

7.1 Comparisons with SOTA Methods

We compare our method with the state-of-the-art single-image human relighting methods RH [25], SFHR [28] and RHW [11]. All of them are based on intrinsic images decomposition and require only a single human image and target lighting for relighting as in our approach. Details regarding the specific comparison settings are provided in the supplementary materials.

Table 1. Quantitative comparisons of our single-image human relighting framework against prior works. Shading indicates the estimated shading map under target lighting condition. The value of MSE is scaled by multiplying $10^{3}$.

Full size table

Comparison on Synthetic Data. We first perform quantitative and qualitative comparisons on the testing dataset where we have ground truth images as a comparison. Table 1 shows quantitative comparison of intrinsic images decomposition performance and single-image human relighting quality. Our method outperforms the competitive methods on every metric for both tasks. To limit the comparison to decomposition and relighting quality only, all metrics are computed only on the foreground region for all methods.

Qualitative comparisons for relighting are shown in Fig. 5. To improve the visual effects, we use MODNet [26] to infer alpha channel of the input image and change the background of the relit images to the corresponding part of the environment lighting map. Benefiting from the combination of classical forward rendering and deep learning, our method can produce photorealistic shadows under arbitrary lighting conditions. As a comparison, RH [25], SFHR [28] and RHW [11] can only produce low-frequency shadows. Moreover, they also fail to relight images under outdoor lighting conditions and produce overly bright or dark relit results. As shown in Fig. 6, our method can produce plausible hard cast shadows and achieve better disentanglement between albedo and shading, which remain challenging for previous methods. Since all methods rely on albedo estimation at first for relighting, we also compare the de-lighting performance. As shown in Fig. 7, compared with RH [25], SFHR [28] and RHW [11], our method achieves better disentanglement between lighting and albedo and is able to remove large areas of shadows (even including self-shadows from concave regions on the real-world 3D shape of the human body, e.g., armpits, crotch, neck under the chin or folds on the clothing.

Comparison on Real-World Images. Although our method is trained on synthetic datasets, it is generalizable to real data. The second column of Fig. 6 and Fig. 7 show shading estimation and de-lighting results on real images, respectively, and the second column of Fig. 5 shows the relit results of images photographed in the real world under arbitrary and complex illumination conditions. RH [25], SFHR [28] and RHW [11] still suffer from the entanglement of lighting and albedo and struggle to produce high-frequency shadows. By comparison, our method performs better at removing detailed shadows from the original images and generating photorealistic shadows on the relit images.

Table 2. Quantitative results for ablation study of the full-body shading refinement.

Full size table

7.2 Ablation Study

To demonstrate the effectiveness of our delighting network, full-body refinement network and face refinement network, we conduct comprehensive ablation studies both quantitatively and qualitatively.

First, we compare our delighting network with vanilla UNet, vanilla HRNet and vanilla HRNet with skip connections. Table 3 shows the quantitative results of different networks and Fig. 3 presents the de-lighting results on a real image. The vanilla HRNet is HRNet-W32 [49] has two transposed convolution layers with stride 2 to ensure that the output size and input size are the same. Based on the vanilla HRNet, the vanilla HRNet with skip connections further adds skip connections between the downsampled features of the first stage and the transposed convolution features of the output. “HRNet(w extra params)” means vanilla HRNet with skip connections, extra stages, modules and blocks. “Ours(5 stages)” indicates that the architecture of network is similar to that of our de-lighting network but only contains 5 stages. The proposed network “Ours” outperforms other networks on all metrics and achieves better disentanglement between lighting and albedo. By contrast, the vanilla HRNet fails to produce high-resolution results and the removal of partial self-shadows such as those caused by clothing folds remains difficult, even with skip connections and extra parameters. Compared with UNet, our method greatly improves PSNR (increasing by 3) and MSE (dropping to half of the UNet’s).

Table 3. Quantitative results for ablation study of the de-lighting network. Vanilla Unet comes from RH [25] and is trained on our dataset.

Full size table

Second, we verify the effectiveness of the ambient occlusion map used by the full-body refinement network. “Refinement Net(w/o ambient)” means that the refinement network only takes coarse full-body shading as input and removes SCM and FAM modules of MIMO-UNet [9]. “Refinement Net(w normal)” means that the refinement network takes the normal map as the auxiliary input rather than the ambient occlusion map. Table 2 shows quantitative results and Fig. 4 shows qualitative results. Without an ambient occlusion map, the refinement work cannot fill in missing geometry details and shadows. Moreover, when the input image is under extreme lighting conditions, the inference of normal map may fail around the boundary of the shadows. By contrast, the inferred ambient occlusion map is unaffected by shadows and able to recover more geometry and occlusion details, thus restoring better shading maps.

Finally, we evaluate the face refinement network. To highlight the role of this module, we present the cropped face shading for comparison. Figure 8 shows the qualitative results for face refinement. The face without refinement contains unnatural facial details such as a twisted nose and asymmetric eyes. By contrast, thanks to the geometry priors provided by 3DMM templates, the refined face owns clearer and more natural facial features. Compared with DSPR [59] and SMFR [22], our method can generate plausible hard cast shadows, especially around the nose and neck, whereas SMFR [22] may produces patchy shadows and DSPR [59] may produces overexposed results.

8 Discussion

Conclusion. We propose a geometry-aware single-image human relighting framework that leverages 3D geometry prior information to produce higher-quality relit results. Our framework contains two stages: de-lighting and relighting. For the de-lighting stage, we use a modified HRNet as the de-lighting network and achieve better disentanglement between lighting and albedo. For the relighting stage, we use ray tracing to render the shading map of the target human, and further refine it utilizing learning-based refinement networks. The extensive results demonstrate that our framework can produce photorealistic high-frequency shadows with clear boundaries under challenging lighting conditions and outperforms the existing SOTA method on both synthetic images and real images.

Limitations. Due to the limitation of the dataset, we adopt the assumption of Lambertian materials for the clothed humans, which fails to produce specular reflectance in the relit results. For the same reason, our delighting network struggles to remove the highlights on the face. Moreover, inaccurate single-image geometry reconstruction may generate unnatural refined shading results.

References

https://www.blender.org/
https://web.twindom.com/
https://polyhaven.com/hdris
Barron, J.T., Malik, J.: Color constancy, intrinsic images, and shape estimation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 57–70. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_5
Chapter Google Scholar
Barron, J.T., Malik, J.: Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell. 37(8), 1670–1687 (2014)
Article Google Scholar
Baslamisli, A.S., Le, H.A., Gevers, T.: CNN based learning using reflection and retinex models for intrinsic image decomposition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6674–6683 (2018)
Google Scholar
Bonneel, N., Kovacs, B., Paris, S., Bala, K.: Intrinsic decompositions for image editing. In: Computer Graphics Forum, vol. 36, pp. 593–609. Wiley Online Library (2017)
Google Scholar
Chabert, C.F., et al.: Relighting human locomotion with flowed reflectance fields. In: ACM SIGGRAPH 2006 Sketches, p. 76 (2006)
Google Scholar
Cho, S.J., Ji, S.W., Hong, J.P., Jung, S.W., Ko, S.J.: Rethinking coarse-to-fine approach in single image deblurring. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4641–4650 (2021)
Google Scholar
Christou, C.G., Koenderink, J.J.: Light source dependence in shape from shading. Vision. Res. 37(11), 1441–1449 (1997)
Article Google Scholar
Tajima, D., Kanamori, Y., Endo, Y.: Relighting humans in the wild: monocular full-body human relighting with domain adaptation. Comput. Graph. Forum 40(7), 205–216 (2021)
Google Scholar
Debevec, P.: The light stages and their applications to photoreal digital actors. SIGGRAPH Asia 2(4), 1–6 (2012)
Google Scholar
Debevec, P., Hawkins, T., Tchou, C., Duiker, H.P., Sarokin, W., Sagar, M.: Acquiring the reflectance field of a human face. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 145–156 (2000)
Google Scholar
Debevec, P., Wenger, A., Tchou, C., Gardner, A., Waese, J., Hawkins, T.: A lighting reproduction approach to live-action compositing. ACM Trans. Graphics (TOG) 21(3), 547–556 (2002)
Article Google Scholar
Ding, S., Sheng, B., Hou, X., Xie, Z., Ma, L.: Intrinsic image decomposition using multi-scale measurements and sparsity. In: Computer Graphics Forum, vol. 36, pp. 251–261. Wiley Online Library (2017)
Google Scholar
Egger, B., et al.: Occlusion-aware 3d morphable models and an illumination prior for face image analysis. Int. J. Comput. Vision 126(12), 1269–1287 (2018)
Article Google Scholar
Gardner, M.A., et al.: Learning to predict indoor illumination from a single image. ACM Trans. Graph. (TOG) 36(6), 1–14 (2017)
Article Google Scholar
Genova, K., Cole, F., Maschinot, A., Sarna, A., Vlasic, D., Freeman, W.T.: Unsupervised training for 3D morphable model regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8377–8386 (2018)
Google Scholar
Guo, J., Zhu, X., Yang, Y., Yang, F., Lei, Z., Li, S.Z.: Towards fast, accurate and stable 3D dense face alignment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 152–168. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_10
Chapter Google Scholar
Guo, K., et al.: The relightables: volumetric performance capture of humans with realistic relighting. ACM Trans. Graph. (TOG) 38(6), 1–19 (2019)
Google Scholar
Hawkins, T., Cohen, J., Debevec, P.: A photometric approach to digitizing cultural artifacts. In: Proceedings of the 2001 Conference on Virtual Reality, Archeology, and Cultural Heritage, pp. 333–342 (2001)
Google Scholar
Hou, A., Zhang, Z., Sarkis, M., Bi, N., Tong, Y., Liu, X.: Towards high fidelity face relighting with realistic shadows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14719–14728 (2021)
Google Scholar
Imber, J., Guillemaut, J.-Y., Hilton, A.: Intrinsic textures for relightable free-viewpoint video. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 392–407. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_26
Chapter Google Scholar
Jafarian, Y., Park, H.S.: Learning high fidelity depths of dressed humans by watching social media dance videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12753–12762 (2021)
Google Scholar
Kanamori, Y., Endo, Y.: Relighting humans: occlusion-aware inverse rendering for full-body human images. ACM Trans. Graph. (TOG) 37(6), 1–11 (2018)
Article Google Scholar
Ke, Z., et al.: Is a green screen really necessary for real-time portrait matting? (2020)
Google Scholar
Laffont, P.Y., Bazin, J.C.: Intrinsic decomposition of image sequences from local temporal variations. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 433–441 (2015)
Google Scholar
Lagunas, M., et al.: Single-image full-body human relighting. arXiv preprint arXiv:2107.07259 (2021)
Land, E.H., McCann, J.J.: Lightness and retinex theory. Josa 61(1), 1–11 (1971)
Google Scholar
Li, C., Zhou, K., Wu, H.T., Lin, S.: Physically-based simulation of cosmetics via intrinsic image decomposition with facial priors. IEEE Trans. Pattern Anal. Mach. Intell. 41(6), 1455–1469 (2018)
Article Google Scholar
Li, G., et al.: Capturing relightable human performances under general uncontrolled illumination. In: Comput. Graph. Forum, vol. 32, pp. 275–284. Wiley Online Library (2013)
Google Scholar
Li, Y., Liu, M.-Y., Li, X., Yang, M.-H., Kautz, J.: A closed-form solution to photorealistic image stylization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 468–483. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_28
Chapter Google Scholar
Lin, J., Yuan, Y., Shao, T., Zhou, K.: Towards high-fidelity 3D face reconstruction from in-the-wild images using graph convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5891–5900 (2020)
Google Scholar
Lopez-Moreno, J., Hadap, S., Reinhard, E., Gutierrez, D.: Light source detection in photographs. In: CEIG, pp. 161–167 (2009)
Google Scholar
Meka, A., et al.: Deep reflectance fields: high-quality facial reflectance field inference from color gradient illumination. ACM Trans. Graph. (TOG) 38(4), 1–12 (2019)
Article Google Scholar
Meka, A., et al.: Deep relightable textures: volumetric performance capture with neural rendering. ACM Trans. Graph. (TOG) 39(6), 1–21 (2020)
Article Google Scholar
Nagano, K., et al.: Deep face normalization. ACM Trans. Graph. (TOG) 38(6), 1–16 (2019)
Article Google Scholar
Nestmeyer, T., Lalonde, J.F., Matthews, I., Lehrmann, A.: Learning physics-guided face relighting under directional light. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5124–5133 (2020)
Google Scholar
Okatani, T., Deguchi, K.: Shape reconstruction from an endoscope image by shape from shading technique for a point light source at the projection center. Comput. Vis. Image Underst. 66(2), 119–131 (1997)
Article Google Scholar
Pandey, R., et al.: Total relighting: learning to relight portraits for background replacement. ACM Trans. Graph. (TOG) 40(4), 1–21 (2021)
Article Google Scholar
Ramachandran, V.S.: Perception of shape from shading. Nature 331(6152), 163–166 (1988)
Article Google Scholar
Saito, S., Simon, T., Saragih, J., Joo, H.: PIFuHD: multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 84–93 (2020)
Google Scholar
Sengupta, S., Kanazawa, A., Castillo, C.D., Jacobs, D.W.: SfSNet: learning shape, reflectance and illuminance of faces ‘in the wild’. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6296–6305 (2018)
Google Scholar
Shahlaei, D., Blanz, V.: Realistic inverse lighting from a single 2D image of a face, taken under unknown and complex lighting. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1, pp. 1–8. IEEE (2015)
Google Scholar
Sheng, B., Li, P., Jin, Y., Tan, P., Lee, T.Y.: Intrinsic image decomposition with step and drift shading separation. IEEE Trans. Visual Comput. Graphics 26(2), 1332–1346 (2018)
Article Google Scholar
Shu, Z., Hadap, S., Shechtman, E., Sunkavalli, K., Paris, S., Samaras, D.: Portrait lighting transfer using a mass transport approach. ACM Trans. Graph. (TOG) 36(4), 1 (2017)
Article Google Scholar
Shu, Z., Yumer, E., Hadap, S., Sunkavalli, K., Shechtman, E., Samaras, D.: Neural face editing with intrinsic image disentangling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5541–5550 (2017)
Google Scholar
Sorkine, O.: Laplacian mesh processing. In: Eurographics (State of the Art Reports), pp. 53–70. Citeseer (2005)
Google Scholar
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Google Scholar
Sun, T., et al.: Single image portrait relighting. ACM Trans. Graph. 38(4), 1–79 (2019)
Article Google Scholar
Tewari, A., et al.: MoFA: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1274–1283 (2017)
Google Scholar
Wang, Z., Yu, X., Lu, M., Wang, Q., Qian, C., Xu, F.: Single image portrait relighting via explicit multiple reflectance channel modeling. ACM Trans. Graph. (TOG) 39(6), 1–13 (2020)
Google Scholar
Wenger, A., Gardner, A., Tchou, C., Unger, J., Hawkins, T., Debevec, P.: Performance relighting and reflectance transformation with time-multiplexed illumination. ACM Trans. Graph. (TOG) 24(3), 756–764 (2005)
Article Google Scholar
Weyrich, T., et al.: Analysis of human faces using a measurement-based skin reflectance model. ACM Trans. Graph. (ToG) 25(3), 1013–1024 (2006)
Article Google Scholar
Whitted, T.: An improved illumination model for shaded display. In: Proceedings of the 6th annual conference on Computer graphics and interactive techniques, p. 14 (1979)
Google Scholar
Ye, G., Garces, E., Liu, Y., Dai, Q., Gutierrez, D.: Intrinsic video and applications. ACM Trans. Graph. (ToG) 33(4), 1–11 (2014)
Article Google Scholar
Zhang, L., Zhang, Q., Wu, M., Yu, J., Xu, L.: Neural video portrait relighting in real-time via consistency modeling. arXiv preprint arXiv:2104.00484 (2021)
Zhang, X., et al.: Portrait shadow manipulation. ACM Trans. Graph. (TOG) 39(4), 1–78 (2020)
Article Google Scholar
Zhou, H., Hadap, S., Sunkavalli, K., Jacobs, D.W.: Deep single-image portrait relighting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7194–7202 (2019)
Google Scholar

Download references

Acknowledgement

This paper is supported by National Key R &D Program of China (2021ZD0113501) and the NSFC project No.62125107, No.62171255 and No.61827805.

Author information

Authors and Affiliations

Department of Automation, Tsinghua University, Beijing, China
Chaonan Ji, Tao Yu & Yebin Liu
Meta Reality Labs, Redmond, USA
Kaiwen Guo
Guangdong OPPO Mobile Telecommunications Corporation Limited, Dongguan, China
Jingxin Liu

Authors

Chaonan Ji
View author publications
You can also search for this author in PubMed Google Scholar
Tao Yu
View author publications
You can also search for this author in PubMed Google Scholar
Kaiwen Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jingxin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yebin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yebin Liu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4431 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ji, C., Yu, T., Guo, K., Liu, J., Liu, Y. (2022). Geometry-Aware Single-Image Full-Body Human Relighting. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13676. Springer, Cham. https://doi.org/10.1007/978-3-031-19787-1_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-19787-1_22
Published: 21 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19786-4
Online ISBN: 978-3-031-19787-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Geometry-Aware Single-Image Full-Body Human Relighting

Abstract

Similar content being viewed by others

Relighting4D: Neural Relightable Human from Videos

Geometry-Guided Progressive NeRF for Generalizable and Efficient Neural Human Rendering

Geometry-guided generalizable NeRF for human rendering

Keywords

1 Introduction