Keywords

1 Introduction

Hair is a critical component of human subjects. Rendering virtual 3D hair models into realistic images has been long studied in computer graphics, due to the extremely complicated geometry and material of human hair. Traditional graphical rendering pipelines try to simulate every aspect of natural hair appearance, including surface shading, light scattering, semi-transparent occlusions, and soft shadowing. This is usually achieved by leveraging physics-based shading models of hair fibers, global illumination rendering algorithms, and artistically designed material parameters. Given the extreme complexity of the geometry and associated lighting effects, such a direct approximation of physical hair appearance requires a highly detailed 3D model, carefully tuned material parameters, and a huge amount of rendering computation. However, for interactive application scenarios that require efficient feedback and user-friendly interactions, such as games and photo editing softwares, it is often too expensive and unaffordable.

With the recent advances in generative adversarial networks, it becomes natural to formulate hair rendering as a special case of the conditional image generation problem, with the hair structure controlled by the 3D model, while realistic appearance synthesized by neural networks. In the context of image-to-image translation, one of the major challenges is how to bridge both the source and target domains for proper translation. Most existing hair generation methods fall into the supervised category, which demands enough training image pairs to provide direct supervision. For example, sketch-based hair generation methods [28, 34, 49] construct training pairs by synthesizing user sketches from real images. While several such methods are introduced, rendering 3D hair models with the help of neural networks do not receive similar treatment. The existing work on this topic [66] requires real and fake domains considerably overlap, such that the common structure is present in both domains. This is achieved at the cost of a complicated strand-level high-quality model, which allows for extracting edge and orientation maps that serve as the common representations of hair structures between real photos and fake models. However, preparing such a high-quality hair model is itself expensive and non-trivial even for a professional artist, which significantly restricts the application scope of this method.

In this paper, we propose a generic neural-network-based hair rendering pipeline that provides efficient and realistic rendering of a generic low-quality 3D hair model borrowing the material features extracted from an arbitrary reference hair image. Instead of using a complicated strand-level model to match real-world hairs like [66], we allow users to use any type of hair model requiring only the isotropic structure of hair strands be properly represented. Particularly, we adopt sparse polygon strip meshes which are much more widely used in interactive applications [65]. Given the dramatic difference between such a coarse geometry and real hair, we are not able to design common structure representations at the model level. Therefore, supervised image translation methods will be infeasible due to the lack of paired data.

To bridge the domains of real hair images and low-quality virtual hair models in an unsupervised manner, we propose to construct a shared latent space between both real and fake domains, which encodes a common structural representation from distinct inputs of both domains and renders the realistic hair image from this latent space with the appearance conditioned by an extra reference. This is achieved by 1) different domain structure encodings used as the network inputs, to pre-disentangle geometric structure and chromatic appearance for both real hair images and 3D models; 2) a UNIT [39]-like architecture adopted to enable common latent space by partially sharing encoder weights between two auto-encoder branches that are trained with in-domain supervision; 3) a structure discriminator introduced to further match the distribution of the encoded structure features; 4) supervised reconstruction enforced on both branches to guarantee all necessary structure information is kept in the shared feature space. In addition, to enable temporally-smooth animation rendering, we introduce a simple yet effective temporal condition method with single image training data only, utilizing the exact hair model motion fields. We demonstrate the effectiveness of the pipeline and each key component by extensively testing on a large amount of diverse human portraits and various hair models. We also compare our method with general unsupervised image translation methods, and show that due to the limited sampling ability on the synthetic hair domain, all existing methods fail to produce convincing results.

2 Related Work

Image-to-image translation aims at converting images from one domain to another while keeping the structure of the source image unchanged. The literature contains various methods performing this task in different settings. Paired image-to-image translation methods [27, 64] operate when pairs of images in both domains are available. For example, semantic labels to scenes [8, 48, 64], edges to objects [54], and image super-resolution [29, 33]. However, paired data are not always available in many tasks. Unsupervised image-to-image translation tackles a setting in which paired data is not available, while sampling from two domains is possible [12, 26, 39, 40, 55, 58, 73]. Clearly, unpaired image-to-image translation is an ill-posed problem for there are numerous ways an image can be transformed to a different domain. Hence, recently proposed methods introduce constraints to limit the number of possible transformations. Some studies enforce certain domain properties [1, 55], while other concurrent works apply cycle-consistency to transform images between different domains [31, 69, 73]. Our work differs from existing studies that we focus on a specific challenging problem, which is the realistic hair generation, where we want to translate manually designed hair models from the domain of rendered images to the domain of real hair. For the purpose of controllable hair generation, we leverage rendered hair structure and arbitrary hair appearance to synthesize diverse realistic hairstyles. The further difference in our work compared to the image-to-image translation papers is unbalanced data. The domain of images containing real hair is far more diverse than that of rendered hair, making it even more challenging for classical image-to-image translation works to address the problem.

Neural style transfer is related to image-to-image translation in a way that image style is changed while content is maintained [9, 16, 20, 25, 36,37,38, 63]. Style in this case is represented by unique style of an artist [16, 63] or is copied from an example image provided by the user. Our work follows the research idea from example-guide style transfer that hairstyle is obtained from reference real image. However, instead of changing the style of a whole image, our aim is to keep the appearance of the human face and background unchanged, while having full control over the hair region. Therefore, instead of following exiting works that inject style features into image generation networks directly [25, 48], we propose a new architecture that combines only hair appearance features and latent features that encodes image content and adapted hair structure for image generation. This way we can achieve the goal that only the style of the hair region is manipulated according to the provided exemplar image.

Domain Adaptation addresses the domain-shift problem that widely exists between the source and target domains [53]. Various feature-based methods have been proposed to tackle the problem [13, 17, 18, 32, 62]. Recent works on adversarial learning for the embedded feature alignment between source and target domains achieve better results than previous studies [14, 15, 22, 41, 60, 61]. Efforts using domain adaptation for both classification and pixel-level prediction tasks have gained significantly progress [1, 10, 60]. In this work, we follow the challenging setting of unsupervised domain adaptation that there is no corresponding annotation between source and target domains. We aim at learning an embedding space that only contains hair structure information for both rendered and real domains. Considering the domain gap, instead of using original images as input, we use rendered and real structure map as inputs to the encoders, which contain both domain-specific layers and shared layers, to obtain latent features. The adaptation is achieved by adversarial training and image reconstruction.

Hair Modeling, Rendering, and Generation share a similar goal with our paper, which is synthesizing photo-realistic hair images. With 3D hair models manually created [65, 70], captured [19, 23, 42, 47, 71], or reconstructed from images [3,4,5,6, 24, 72], traditional graphical hair rendering methods focus on improving rendering quality and performance by either more accurately modeling the special hair material and lighting behaviours [11, 43, 44, 68], or approximating certain aspects of rendering pipeline to reduce the computation complexity [45, 50, 52, 67, 74]. However, the extremely huge computation cost for realistic hair rendering usually prohibits them to be directly applied in real-time applications. Utilizing the latest advances in GANs, recent works [28, 34, 46, 49, 59] achieved impressive progress on conditioned hair image generation as supervised image-to-image translation. A GAN-based hair rendering method [66] proposes to perform conditioned 3D hair rendering by starting with a common structure representation and progressively enforce various conditions. However, it requires the hair model to be able to generate consistent representation (strand orientation map) with real images, which is challenging for low-quality mesh-based models, and cannot achieve temporally smooth results.

Fig. 1.
figure 1

The overall pipeline of our neural hair rendering framework. We use two branches to encode hair structure features, one for the real domain and the other for the fake domain. A domain discriminator is applied to the outputs from both encoders, to achieve domain invariant features. We also use two decoders to reconstruct images for two domains. The decoder in the real domain is different from the one in the fake domain, for it is conditioned on a reference image. Additionally, to generate consistent videos, we apply a temporal condition on the real branch. During inference, we use the encoder in the fake branch to get hair structure features from a 3D hair model and use the generator in the real branch to synthesized an appearance conditioned image.

3 Approach

Let \(\textit{\textbf{h}}\) be the target 3D hair model, with camera parameters \(\textit{\textbf{c}}\) and hair material parameters \(\textit{\textbf{m}}\), we formulate the traditional graphic rendering pipeline as \(\text {R}_{t}(\textit{\textbf{h}}, \textit{\textbf{m}}, \textit{\textbf{c}})\). Likewise, our neural network-based rendering pipeline is defined as \(\text {R}_n(\textit{\textbf{h}}, \textit{\textbf{r}}, \textit{\textbf{c}})\), with a low-quality hair model \(\textit{\textbf{h}}\) and material features extracted from an arbitrary reference hair image \(\textit{\textbf{r}}\).

3.1 Overview of Network Architecture

The overall system pipeline is shown in Fig. 1, which consists of two parallel branches for both domains of real photo (i.e., real) and synthetic renderings (i.e., fake), respectively.

On the encoding side, the structure adaptation subnetwork, which includes a real encoder \(E_r\) and a fake encoder \(E_f\), achieves cross-domain structure embedding \(\textit{\textbf{e}}\). Similar to UNIT [39], we share the weights of the last few ResNet layers in \(E_r\) and \(E_f\) to extract consistent structural representation from two domains. In addition, a structure discriminator \(D_s\) is introduced to match the high-level feature distributions between two domains to enforce the shared latent space further to be domain invariant.

On the decoding side, the appearance rendering subnetwork, consisting of \(G_r\) and \(G_f\) for the real and fake domain respectively, is attached after the shared latent space \(\textit{\textbf{e}}\) to reconstruct the images in the corresponding domain. Each decoder owns its exclusive domain discriminator \(D_r\) and \(D_f\) to ensure the reconstruction matches the domain distribution, besides the reconstruction losses. The hair appearance is conditioned in an asymmetric way that \(G_r\) accepts the extra condition of material features extracted from a reference image \(\textit{\textbf{r}}\) by using material encoder \(E_m\), while the unconditional decoder \(G_f\) is asked to memorize the appearance, which is made on purpose for training data generation (Sect. 4.1).

At the training stage, all these networks are jointly trained using two sets of image pairs \((\textit{\textbf{s}}, \textit{\textbf{x}})\) for both real and fake domains, where \(\textit{\textbf{s}}\) represents a domain-specific structure representation of the corresponding hair image \(\textit{\textbf{x}}\) in this domain. Both real and fake branches try to reconstruct the image \(G(E(\textit{\textbf{x}}))\) from its paired structure image \(\textit{\textbf{s}}\) independently through their own encoder-decoder networks, while the shared structural features are enforced to match each other consistently by the structure discriminator \(D_s\). We set the appearance reference \(\textit{\textbf{r}} = \textit{\textbf{x}}\) in the real branch to fully reconstruct \(\textit{\textbf{x}}\) in a paired manner.

At the inference stage, only the fake branch encoder \(E_f\) and the real branch decoder \(G_r\) are activated. \(G_r\) generates the final realistic rendering using structural features encoded by \(E_f\) on the hair model. The final rendering equation \(\text {R}_n\) can be formulated as:

$$\begin{aligned} \text {R}_n(\textit{\textbf{h}},\textit{\textbf{r}},\textit{\textbf{c}})=G_r(E_f(\text {S}_f(\textit{\textbf{h}},\textit{\textbf{c}})),E_m(\textit{\textbf{r}})), \end{aligned}$$
(1)

where the function \(\text {S}_f(\textit{\textbf{h}},\textit{\textbf{c}})\) renders the structure encoded image \(\textit{\textbf{s}}_f\) of the model \(\textit{\textbf{h}}\) in camera setting \(\textit{\textbf{c}}\).

3.2 Structure Adaptation

The goal of the structure adaptation subnetwork, formed by the encoding parts of both branches, is to encode cross-domain structural features to support final rendering. Since the inputs to both encoders are manually disentangled structure representation (Sect. 4.1), the encoded features \(E(\textit{\textbf{s}})\) only contain structural information of the target hair. Moreover, as the appearance information is either conditioned by extra decoder input in a way that non-spatial-varying structural information is leaked (the real branch) or simple enough to be memorized by the decoder (the fake branch) (Sect. 3.3), the encoded features should also include all the structural information necessary to reconstruct \(\textit{\textbf{x}}\).

\(E_r\) and \(E_f\) share a similar network structure: five downsampling convolution layers followed by six ResBlks. The last two ResBlks are weight-sharing to enforce the shared latent space. \(D_s\) follows PatchGAN [27] to distinguish between the latent feature maps from both domains:

$$\begin{aligned} \mathcal {L}_{D_s}=\mathbb {E}_{\textit{\textbf{s}}_r}[\log (D_s(E_r(\textit{\textbf{s}}_r)))]+\mathbb {E}_{\textit{\textbf{s}}_f}[\log (1-D_s(E_f(\textit{\textbf{s}}_f)))]. \end{aligned}$$
(2)

3.3 Appearance Rendering

The hair appearance rendering subnetwork decodes the shared cross-domain hair features into the real domain images. The decoders \(G_r\) and \(G_f\) have different network structures and do not share weights since the neural hair rendering is a unidirectional translation that aims to map the rendered 3D model in the fake domain to real images in the real domain. Therefore, \(G_f\) is required to make sure the latent features \(\textit{\textbf{e}}\) encode all necessary information from the input 3D model, instead of learning to render various appearance. On the other hand, \(G_r\) is designed in a way to accept arbitrary inputs for realistic image generation.

Specifically, the unconditional decoder \(G_f\) starts with two ResBlks, and then five consecutive upsampling transposed convolutional layers followed by one final convolutional layer. \(G_r\) adopts a similar structure as \(G_f\), with each transposed convolutional layer replaced with a SPADE [48] ResBlk to use appearance feature maps \(\textit{\textbf{a}}_{\textit{\textbf{r}},\textit{\textbf{s}}_r}\) at different scales to condition the generation. Assuming the binary hair mask of the reference and the target images are \(\textit{\textbf{m}}_{\textit{\textbf{r}}}\) and \(\textit{\textbf{m}}_{\textit{\textbf{s}}}\), the appearance encoder \(E_m\) extracts the appearance feature vector on \(\textit{\textbf{r}}\times \textit{\textbf{m}}_{\textit{\textbf{r}}}\) with five downsampling convolutional layers and an average pooling. This feature vector \(E_m(\textit{\textbf{r}})\) is then used to construct the feature map \(\textit{\textbf{a}}_{\textit{\textbf{r}},\textit{\textbf{s}}_r}\) by duplicating it spatially in the target hair mask \(\textit{\textbf{m}}_{\textit{\textbf{s}}}\) as follows:

$$\begin{aligned} \textit{\textbf{a}}_{\textit{\textbf{r}},\textit{\textbf{s}}_r}(p)={\left\{ \begin{array}{ll} E_m(\textit{\textbf{r}}), &{} \hbox {if }\textit{\textbf{m}}_{\textit{\textbf{s}}_r}(p)=1, \\ 0, &{} \hbox {if }\textit{\textbf{m}}_{\textit{\textbf{s}}_r}(p)=0. \end{array}\right. } \end{aligned}$$
(3)

To make sure the reconstructed real image \(G_r(E_r(\textit{\textbf{s}}_r),\textit{\textbf{a}}_{\textit{\textbf{r}},\textit{\textbf{s}}_r})\) and the reconstructed fake image \(G_f(E_f(\textit{\textbf{s}}_f))\) belong to their respective distributions, we apply two domain specific discriminator \(D_r\) and \(D_f\) for the real and fake domain respectively. The adversarial losses write as:

$$\begin{aligned} \mathcal {L}_{D_r}=\mathbb {E}_{\textit{\textbf{x}}_r}[\log (D_r(\textit{\textbf{x}}_r))]+\mathbb {E}_{\textit{\textbf{s}}_r,\textit{\textbf{r}}}[\log (1-D_r(G_r(E_r(\textit{\textbf{s}}_r),\textit{\textbf{a}}_{\textit{\textbf{r}},\textit{\textbf{s}}_r})))], \end{aligned}$$
(4)
$$\begin{aligned} \mathcal {L}_{D_f}=\mathbb {E}_{\textit{\textbf{x}}_f}[\log (D_f(\textit{\textbf{x}}_f))]+\mathbb {E}_{\textit{\textbf{s}}_f}[\log (1-D_f(G_f(E_f(\textit{\textbf{s}}_f))))]. \end{aligned}$$
(5)

We also adopt perceptual losses to measure high-level feature distance utilizing the paired data:

(6)

where \(\varvec{\varPsi }_l(\textit{\textbf{i}})\) computes the activation feature map of input image \(\textit{\textbf{i}}\) at the lth selected layer of VGG-19 [56] pre-trained on ImageNet [51].

Finally, we have the overall training objective as:

$$\begin{aligned} \min _{E,G}\max _{D}\ (\lambda _s\mathcal {L}_{D_s}+\lambda _g(\mathcal {L}_{D_r}+\mathcal {L}_{D_f})+\lambda _p\mathcal {L}_p). \end{aligned}$$
(7)

3.4 Temporal Conditioning

The aforementioned rendering network is able to generate plausible single-frame results. However, despite the hair structure is controlled by smoothly-varying inputs of \(\textit{\textbf{s}}_f\) with the appearance conditioned by a fixed feature map \(\textit{\textbf{a}}_{\textit{\textbf{r}},\textit{\textbf{s}}_r}\), the spatially-varying appearance details are still generated in a somewhat arbitrary manner which tends to flicker in time (Fig. 5). Fortunately, with the availability of the 3D model, we can calculate the exact hair motion flow \(\textit{\textbf{w}}^t\) for each pair of frames \(t-1\) and t, which can be used to warp image \(\textit{\textbf{i}}\) from \(t-1\) to t as \(\text {W}(\textit{\textbf{i}},\textit{\textbf{w}}^t)\). We utilize this dense correspondences to enforce temporal smoothness.

Let \(\textit{\textbf{I}}=\{\textit{\textbf{i}}^0,\textit{\textbf{i}}^1,\ldots ,\textit{\textbf{i}}^T\}\) be the generated result sequence, we achieve this temporal conditioning by simply using the warped result of the previous frame \(\text {W}(\textit{\textbf{i}}^{t-1},\textit{\textbf{w}}^t)\) as an additional condition, stacked with the appearance feature map \(\textit{\textbf{a}}_{\textit{\textbf{r}},\textit{\textbf{s}}_r}\), to the real branch decoder \(G_r\) when generating the current frame \(\textit{\textbf{i}}^t\).

We achieve temporally consistent by changing the real branch decoder only with temporally finetuning. During temporal training, we fix all other networks and use the same objective as Eq. 7, but randomly (\(50\%\) of chance) concatenate \(\textit{\textbf{x}}_r\) into the condition inputs to the SPADE ResBlks of \(G_r^t\). The generation pipeline of the real branch now becomes \(G_r^t(E_r(\textit{\textbf{s}}_r),\textit{\textbf{a}}_{\textit{\textbf{r}},\textit{\textbf{s}}_r},\textit{\textbf{x}}_r)\), so that the network learns to preserve the consistency if the previous frame is inputted as the temporal condition, or generate randomly from scratch if the condition is zero.

Finally, we have the rendering equation for sequential generation:

$$\begin{aligned} \begin{aligned} \textit{\textbf{i}}^t=\text {R}_n(\textit{\textbf{h}},\textit{\textbf{r}},\textit{\textbf{c}}^t)&={\left\{ \begin{array}{ll} G_r(E_f(\textit{\textbf{s}}_f^t),\textit{\textbf{a}}_{\textit{\textbf{r}},\textit{\textbf{s}}_f^t}),&{}\hbox {if }t=0,\\ G_r^t(E_f(\textit{\textbf{s}}_f^t),\textit{\textbf{a}}_{\textit{\textbf{r}},\textit{\textbf{s}}_f^t},\text {W}(\textit{\textbf{i}}^{t-1},\textit{\textbf{w}}^t)).&{}\hbox {if }t>0, \end{array}\right. }\\ \textit{\textbf{s}}_f^t&=\text {S}_f(\textit{\textbf{h}},\textit{\textbf{c}}^t). \end{aligned} \end{aligned}$$
(8)

4 Experiments

Fig. 2.
figure 2

Training data preparation. For the fake domain (left), we use hair model and input image to generate fake rendering and model structure map. For the real domain (b), we generate image structure map for each image.

4.1 Data Preparation

To train the proposed framework, we generate a dataset that includes image pairs \((\textit{\textbf{s}}, \textit{\textbf{x}})\) for both real and fake domains. In each domain, \(\textit{\textbf{s}}\rightarrow \textit{\textbf{x}}\) indicates the mapping from structure to image, where \(\textit{\textbf{s}}\) encodes only the structure information, and \(\textit{\textbf{x}}\) is the corresponded image that conforms to the structure condition.

Fig. 3.
figure 3

Results for the hair models used in this study (2 rows per model). We visualize examples where the input and the reference image are the same (left), and the input and the reference are different images (right). In the former case the method copies appearance from another image.

Real Domain. We adopt the widely used FFHQ [30] portrait dataset to generate the training pairs for the real branch, given it contains diverse hairstyles on shapes and appearances. To prepare real data pairs, we use original portrait photos from FFHQ as \(\textit{\textbf{x}}_r\), and generate \(\textit{\textbf{s}}_r\) to encode only structure information from hair. However, obtaining \(\textit{\textbf{s}}_r\) is a non-trivial process since hair image also contains material information, besides structural knowledge. To fully disentangle structure and material, and construct a universal structural representation \(\textit{\textbf{s}}\) of all real hair, we apply a dense pixel-level orientation map in the hair region, which is formulated as \(\textit{\textbf{s}}_r = \text {S}_r(\textit{\textbf{x}}_r)\), calculated with oriented filter kernels [47]. Thus, we can obtain \(\textit{\textbf{s}}_r\) that only consists of local hair strand flow structures. Example generated pairs are presented in Fig. 2b.

For the purpose of training and validation, we randomly select 65, 000 images from FFHQ as training, and use the remaining 5, 000 images for testing. For each image \(\textit{\textbf{x}}_r\), we perform hair segmentation using off-the-shelf model [4], and calculate \(\textit{\textbf{s}}_r\) for the hair region.

Fake Domain. There are multiple ways to model and render virtual hair models. From coarse to fine, typical virtual hair models range from a single rigid shape, coarse polygon strips representing detached hair wisps, to a large number of thin hair fibers that mimic real-world hair behaviors. Due to various granularity of the geometry, the structural representation is hardly shared with each other or real hair images. In our experiments, all the hair models we used are polygon strips based considering this type of hair model is widely adopted in real-time scenarios for it is efficient to render and flexible to be animated. To generate \(\textit{\textbf{s}}_f\) for a given hair model \(\textit{\textbf{h}}\) and specified camera parameters \(\textit{\textbf{c}}\), we use smoothly varying color gradient as texture to render \(\textit{\textbf{h}}\) into a color image that embeds the structure information of the hair geometry, such that \(\textit{\textbf{s}}_f = \text {S}_f(\textit{\textbf{h}}, \textit{\textbf{c}})\). As for \(\textit{\textbf{x}}_f\), we use traditional graphic rendering pipeline to render \(\textit{\textbf{h}}\) with a uniform appearance color and simple diffuse shading, so that the final synthetic renderings have a consistent appearance that can be easily disentangled without any extra condition, and keep all necessary structural information to verify the effectiveness of the encoding step. Example pairs are shown in Fig. 2a.

For the 3D hair used for fake data pairs, we create five models (leftmost column in Fig. 2). The first four models are used for training, and the last one is used to evaluate the generalization capability of the network, for the network has never seen it. All these models consist of 10 to 50 polygon strips, which is sparse enough for real-time applications. We use the same training set from the real domain to form training pairs. Each image is overlaid by one of the four 3D hair models according to the head position and pose. Then the image with the fake hair model is used to generate \(\textit{\textbf{x}}_f\) through rendering the hair model with simple diffuse shading, and \(\textit{\textbf{s}}_f\) by exporting color textures that encode surface tangent of the mesh. We strictly use the same shading parameters, including lighting and color, to enforce a uniform appearance of hair that can be easily disentangled by the networks.

4.2 Implementation Details

We apply a two-stage learning strategy. During the first stage, all networks are trained jointly following Eq. 7 for the single-image renderer \(\text {R}_n\). After that, we temporally fine-tune the decoder \(G_r\) of the real branch, to achieve temporally-smooth renderer \(\text {R}_n^t\), by introducing the additional temporal condition as detailed in Sect. 3.4. To make the networks of both stages consistent, we keep the same condition input dimensions, including appearance and temporal, but set the temporal condition to zero during the first stage. During the second stage, we set it to zero with \(50\%\) of chance. The network architecture discussed in Sect. 3 is implemented using PyTorch. We adopt Adam solver with a learning rate set to 0.0001 for the first stage, and 0.00001 for the fine-tuning stage. The training resolution of all images is \(512\times 512\), with the mini-batch size set to 4. For the loss functions, weights \(\lambda _p\), \(\lambda _s\), and \(\lambda _g\) are set to 10, 1, and 1, respectively. All experiments are conducted on a workstation with 4 Nvidia Tesla P100 GPUs. During test time, rendering a single frame takes less than 1 second, with structure encoding less than 200 ms and final generation less than 400 ms.

4.3 Qualitative Results

We present visual hair rendering results from two settings in Fig. 3. The left three columns in Fig. 3 show that the reference image \(\textit{\textbf{r}}\) is the same as \(\textit{\textbf{x}}_r\). By applying a hair model, we can modify human hair shape but keep the original hair appearance and orientation. The right four columns show that the reference image is different from \(\textit{\textbf{x}}_r\), therefore, both structure and appearance of hair from \(\textit{\textbf{x}}_r\) can be changed at the same time to render the hair with a new style. We also demonstrate our video results in Fig. 5 (please click the image to watch video results online), where we adopt 3D face tracking [2] to guide the rigid position of the hair model, and physics-based hair simulation method [7] to generate secondary hair motion. These flexible applications demonstrate that our method can be easily applied to modify hair and generate novel high-quality hair images.

4.4 Comparison Results

To the best of our knowledge, there is no previous work that tackles the problem of neural hair rendering; thus, a direct comparison is not feasible. However, in light of our methods aim to bridge two different domains without ground-truth image pairs, which is related to unsupervised image translation, we compare our network with state-of-the-art unpaired image translation studies. It is important to stress that although our hair rendering translation falls into the range of image translation problems, there exist fundamental differences compared to the generic unpaired image translation formulations for the following two reasons.

First and foremost, compared with translation between two domains, such as painting styles, or seasons/times of the day, which have roughly the same amount of images for two domains and enough representative training images can be sampled to provide nearly-uniform domain coverage, our real/fake domains have dramatically different sizes–it is easy to collect a huge amount of real human portrait photos with diverse hairstyles to form the real domain. Unfortunately, for the fake domain, it is impossible to reach the same variety since it would require manually designing every possible hair shape and appearance to describe the distribution of the whole domain of rendered fake hair. Therefore, we focus on a realistic assumption that only a limited set of such models are available for training and testing, such that we use four 3D models for training and one for testing, which is far from being able to produce variety in the fake domain.

Second, as a deterministic process, hair rendering should be conditioned strictly on both geometric shape and chromatic appearance, which can be hardly achieved with unconditioned image translation frameworks.

With those differences bearing in mind, we show the comparison between our method and three unpaired image translation studies, including CycleGAN [73], DRIT [35], and UNIT [39]. For the training of these methods, we use the same sets of images, \(\textit{\textbf{x}}_r\) and \(\textit{\textbf{x}}_f\), for both real and fake domains, and the default hyperparameters reported by the original papers. Additionally, we compare with the images generated by the traditional graphic rendering pipeline. We denote the method as Graphic Renderer. Finally, we report two ablation studies to evaluate the soundness of the network and the importance of each step: 1) we first remove the structural discriminator (termed as w/o SD); 2) we then additionally remove the shared latent space (termed as w/o SL and SD).

Table 1. Quantitative comparison results. We compare our method against commonly adopted image-to-image translation frameworks, reporting Fréchet Inception Distance (FID, lower the better), Intersection over Union (IoU, higher the better) and pixel accuracy (Accuracy, higher the better). Additionally we report ablation studies by first removing the structural discriminator (w/o SD) followed by removing both the structural discriminator and the shared latent space (w/o SL and SD).

Quantitative Comparison. For quantitative evaluation, we adopt FID (Fréchet Inception Distance) [21] to measure the distribution distance between two domains. Moreover, inspired by the evaluation protocol from existing work [8, 64], we apply a pre-trained hair segmentation model [57] on the generated images to get the hair mask, and compare it with the ground truth. Intuitively, the segmentation model should predict the hair mask that similar to the ground-truth for the realistic synthesized images. To measure the segmentation accuracy, we use both Intersection-over-Union (IoU) and pixel accuracy (Accuracy).

The quantitative results are reported in Table 1. Our method significantly outperforms the state-of-the-art unpaired image translation works and graphic rendering approach by a large margin for all three evaluation metrics. The low FID score proves our method can generate high-fidelity hair images that contain similar hair appearance distribution as images from the real domain. The high IoU and Accuracy demonstrate the ability of the network to minimize the structure gap between real and fake domains so that the synthesized images can follow the manually designed structure. Furthermore, the ablation analysis in Table 1 shows both shared encoder layers and the structural discriminator are essential parts of the network, for the shared encoder layers help the network to find a common latent space that embeds hair structural knowledge, while the structural discriminator forces the hair structure features to be domain invariant.

Fig. 4.
figure 4

Visual comparisons. We show selected visual comparisons against commonly adopted image-to-image translation methods as well as visualize ablation results. Our method synthesizes more realistic hair images compared to other approaches.

Qualitative Comparison. The qualitative comparison of different methods is shown in Fig. 4. It can be easily seen that our generated images have much higher quality than the synthesized images created by other state-of-the-art unpaired image translation methods, for they have clearer hair mask, follow hair appearance from reference images, maintain the structure from hair models, and look like natural hair. Compared with the ablation methods (Fig. 4c and d), our full method (Fig. 4b) can follow the appearance from reference images (Fig. 4a) by generating hair with similar orientation.

We also show the importance of temporal conditioning (Sect. 3.4) in Fig. 5. The temporal conditioning helps us generate consistent and smooth video results, for hair appearance and orientation are similar between continuous frames. Without temporal conditioning, the hair texture could be different between frames, as indicated by blue and green boxes, which may result in flickering for the synthesized video. Please refer to the supplementary video for more examples.

Fig. 5.
figure 5

Video results and comparisons. Top row: the first image is the appearance reference image and others are continuous input frames; middle row: generated hair images with temporal conditioning; bottom row: generated hair images without temporal conditioning. We show two zoom-in hair regions for each result. By applying temporal conditioning, our model synthesizes hair images with consistent appearance, while not using temporal conditioning leads to hair appearance flickering as indicated by blue and green boxes. Click the image to play the video results and comparisons. (Color figure online)

5 Conclusions

We propose a neural-based rendering pipeline for general virtual 3D hair models. The key idea of our method is that instead of enforcing model-level representation consistency to enable supervised paired training, we relax the strict requirements on the model and adopt an unsupervised image translation framework. To bridge the gap between real and fake domains, we construct a shared latent space to encode a common structure feature space for both domains, even if their inputs are dramatically different. In this way, we can encode a virtual hair model into such a structure feature, and switch it into the real generator to produce realistic rendering. The conditional real generator not only allows flexible appearance conditioning but can also be used to introduce temporal conditioning to generate smooth sequential results.

Our method has several limitations. First, the current method does not change the input. A smaller fake hair won’t be able to fully occlude the original one in the input image. It is possible to do face inpainting to remove the excessive hair regions to fix this issue. Second, when the lighting/material of the appearance reference is dramatically different from the input, the result may look unnatural. Better reference selection would help to make the results better. Third, the current method simply blends the generated hair onto the input, which causes blending artifacts in some results especially when the background is complicated. A simple solution is to train a supervised boundary refinement network to achieve better blending quality.