Keywords

1 Introduction

Deep image relighting has multiple applications both in research and in practice, and is recently witnessing increased interest. A single-image relighting method would allow aesthetic enhancement applications, such as photo montage of images taken under different illuminations, and illumination retouching without human expert work. Very importantly, in computer vision research image relighting can be leveraged for data augmentation, enabling the trained methods to be robust to changes in light source position or color temperature. It could also serve for domain adaptation, by normalizing input images to a unique set of illumination settings that the down-stream computer vision method was trained on. The relighting task contains multiple sub-tasks, namely, illumination estimation and manipulation, shadow removal or practically inpainting for hardly lit areas, and geometric understanding for shadow recasting. The combination of these tasks makes relighting very challenging.

Recently, datasets limited to interior scenes [33], underexposed images enhanced by professionals [48], and rendered images with randomized light directions [54] have been proposed, but none serve the benchmarking needs for image relighting, namely, having all \(M\times N\) combinations of M scenes and N illumination settings. Further datasets are used in the literature on style transfer or intrinsic image decomposition. For instance, IIW [6] and SAW [27] contain human-labeled reflectance and shading annotations, and BigTime [29] contains time-lapse data of scenes illuminated under varying light conditions. Multiple methods are recently being developed for relighting [12, 34, 42], and the prior literature on intrinsic images, which disentangle surface reflectance from lighting, is rich [5, 6, 18, 39, 44, 51], notably for applications such as relighting [7] and normalization [32].

The aim of this challenge, and of the novel dataset Virtual Image Dataset for Illumination Transfer (VIDIT), is to gauge the current state-of-the-art for image relighting. The virtual dataset provides a well-controlled setup to provide full-reference evaluation, which is ideal for benchmarking purposes, and is an important step towards real-image relighting. Such virtual datasets have proven useful in multiple applications to augment even the training datasets containing real images, for instance the vKitti data [9]. There could be differences relative to real images such as the distribution of textures that can vary from man-made to natural scenes [8, 45], the specifics of the capturing device like chromatic aberrations [15, 31, 58], or the presence of multiple light sources. VIDIT itself is described in the following section. The goal of the challenge is thus to provide a benchmark on this dataset for future research on image relighting.

This challenge is one of the AIM 2020 associated challenges on: scene relighting and illumination estimation [17], image extreme inpainting [36], learned image signal processing pipeline [24], rendering realistic bokeh [25], real image super-resolution [50], efficient super-resolution [56], video temporal super-resolution [41] and video extreme super-resolution [19].

2 Scene Relighting and Illumination Estimation Challenge

2.1 Dataset

The challenge, whose 3 tracks are described in the following section, is based on a novel dataset: VIDIT [16]. VIDIT contains 300 virtual scenes used for training, where every scene is captured 40 times in total: from 8 equally-spaced azimuthal angles, each lit with 5 different illuminants. Every image is \(1024\times 1024\), but the images are downsampled by a factor of 2, with bicubic interpolation over \(4\times 4\) windows, to ease computations for track 3. The dataset is publicly available (https://github.com/majedelhelou/VIDIT).

2.2 Tracks and Competition

Track 1: One-to-one Relighting

Description: the relighting task is pre-determined and fixed for all validation and test samples. In other words, the objective is to manipulate an input image from one pre-defined set of illumination settings (namely, North, 6500K) to another pre-defined set (East, 4500K). The images are in \(1024\times 1024\) resolution, both input and output, and nothing other than the input image is provided.

Evaluation Protocol: We evaluate the results using the PSNR and SSIM [49] metrics, and the self-reported run-times and implementation details are also provided. For the final ranking, we define a Mean Perceptual Score (MPS) as the average of the normalized SSIM and LPIPS [57] scores, themselves averaged across the entire test set of each submission

$$\begin{aligned} 0.5\cdot (S + (1-L)), \end{aligned}$$
(1)

where S is the SSIM score, and L is the LPIPS score. We note that normalizing S and \((1-L)\), by dividing them respectively by their maximum values across all the track’s submissions, before averaging the two does not affect the final ranking. We thus do not do this normalization, which also makes it simpler for external comparisons.

Track 2: Illumination Settings Estimation

Description: the goal of this track is to estimate, from a single input image, the illumination settings that were used in rendering it. Given the input image, the output should estimate the color temperature of the illuminant as well as the orientation, i.e. the position of the light source. The input images are also \(1024\times 1024\) and no other input is given than the 2D image.

Evaluation Protocol: The evaluation of track 2 is based on the accuracy of predictions following this formula for the loss

$$\begin{aligned} \sqrt{ \sum _{i=0}^{N-1} \left( \frac{|\hat{\phi _i}-\phi _i|mod180}{180} \right) ^2 + (\hat{T_i}-T_i)^2 } \end{aligned}$$
(2)

where \(\hat{\phi _i}\) is the predicted angle (0–360) for test sample i and \(\phi _i\) is the ground-truth value for that sample. \(\hat{T_i}\) is the temperature prediction for test sample i and \(T_i\) is the ground-truth value for that sample. \(T_i\) takes values equal to [0, 0.25, 0.5, 0.75, 1], which correspond to the color temperature values [2500K, 3500K, 4500K, 5500K, 6500K].

Track 3: Any-to-any Relighting

Description: this track is a generalization of the first track. The objective is to relight an input image (both color temperature and light source position manipulation) from any arbitrary illumination settings to any arbitrary illumination settings. The latter settings are dictated by a second input guide image, as in style transfer applications. The participants were allowed to make use of their solutions to the first two tracks to develop a solution for this track. The images are in \(512\times 512\) resolution to ease computations, as this track is very challenging.

Evaluation Protocol: We carry out a similar evaluation as for track 1. As the inputs are pairs of possible test images, they cover a larger span of candidate options. For that reason, we double the number of data samples in the validation and test sets for this track.

Challenge Phases for all Tracks. (1) Development: registered teams were given access to the training input and target data, as well as the input validation set data. An online validation server with a leader board provided automated feedback for the submitted image results on the validation set, which was made up of 45 images for tracks 1 and 2, and 90 image pairs for track 3; (2) Testing: registered teams were given access to the input test sets, which are of the same size as the validation ones, and could submit their test results to a private test server. For a submission to be accepted, open-source code and a fact sheet detailing the implemented method needed to be submitted along with the test results. Test results were kept hidden from participating teams, to avoid any chances of test over-fitting, and were only revealed at the end of the challenge.

Table 1. AIM 2020 Image Relighting Challenge Track 1 (One-to-one relighting) results. The MPS, used to determine the final ranking, is computed following Eq. (1). \(^*\)CET_CVLab and CET_SP are merged into one, due to large similarity between the proposed solutions. We also note that normalizing SSIM and (1-LPIPS) scores by the maximum in the track, for computing the MPS, does not affect the ranking.

3 Challenge Results

The results of all three tracks are collected in Tables 12, and 3, respectively. The top solutions are described in the following sections, and the remainder is in the supplementary material.

Visual results of some top submissions along with input and ground-truth images for track 1 are shown in Fig. 1. We notice that most of the outputs generate the relit image with the correct color temperature, however, the shadows are harder to estimate. For instance, lyl and YorkU suffer from shadow removal. Both CET_SP and CET_CVLab tend to remove the unnecessary shadows, although not perfectly, which underlines the difficulty of the shadow-relighting sub-task. We show visual results of some submissions to track 3 in Fig. 2. Among the top 3 submissions, only NPU-CVPG is able to successfully relight the bottom-right part and produce the closest color temperature to the ground-truth.

Table 2. AIM 2020 Image Relighting Challenge Track 2 (Illumination settings estimation) results. The loss is computed based on the angle and color temperature predictions, following Eq. (2), and is used to determine the final ranking.
Table 3. AIM 2020 Image Relighting Challenge Track 3 (Any-to-any relighting) results. The MPS, used to determine the final ranking, is computed following Eq. (1). We also note that normalizing SSIM and (1-LPIPS) scores by the maximum in the track, for computing the MPS, does not affect the ranking.
Fig. 1.
figure 1

Sample visual results from top submissions in track 1, with MPS scores. We observe that relighting previous shadows is the most difficult sub-task.

Fig. 2.
figure 2

Sample visual results from top submissions in track 3, with MPS scores.

4 Track 1 Methods

Fig. 3.
figure 3

Architecture of the Wavelet Decomposed RelightNet (WDRN).

4.1 CET_CVLab: Wavelet Decomposed RelightNet (WDRN)

The architecture of the proposed Wavelet Decomposed RelightNet (WDRN) [37] is shown in Fig. 3. The network structure used is similar to that of an encoder-decoder U-Net. The downsampling operation used in the contraction path is a discrete wavelet transform (DWT) based decomposition instead of a downsampling convolution or pooling. Similarly, in the expansion path, the inverse discrete wavelet transform (IDWT) is used instead of an upsampling convolution. In the wavelet based decomposition, the information from all channels is combined in the downsampling process such that there is minimal information loss when compared to that of a convolutional subsampling. For the given task, it can be deduced that the network must learn to re-calibrate the illumination gradient within the image. To this end, the network should be able to establish the relation between distant pixels. The proposed WDRN can achieve a high receptive field and hence establish this relation with the multi-scale wavelet decomposition. Also, this methodology is computationally efficient and is inspired by the multi-level wavelet-CNN (MWCNN) proposed by Liu et al. [30]. The training loss used in this work is a weighted sum of the SSIM loss, MAE loss and a gray loss (the gray loss term is used in the CET_SP submission, and omitted in that of CET_CVLab). Gray loss is the \(\ell 1\) distance between the grayscale version of the restored image and that of the ground-truth image.

Fig. 4.
figure 4

Architecture diagram of the Coarse-to-Fine Relighting Net (CFRN).

4.2 lyl: Coarse-to-Fine Relighting Net (CFRN)

The proposed Coarse-to-Fine Relighting Net (CFRN) is illustrated in Fig. 4. The solution consists of two networks: (1) progressive coarse network and (2) a network merging the output of the coarse network, with channel attention, to correct the input in each level. Such a progressive process helps to achieve the principle for image relighting: high-level information is a good guide to obtain a better relit image. In the proposed method, there are three indispensable parts; (1) tying the loss at each level (2) using the FineNet structure and (3) providing a lower-level extracted feature input to ensure the availability of low-level information. To make full use of the training data, the team augments data in three ways; (1) scaling: randomly downscaling between [0.5,1.0], (2) rotation: randomly rotating the image by 90, 180, and 270 degrees, and (3) flipping: randomly flipping images horizontally or vertically with equal probability.

Fig. 5.
figure 5

Overview of YorkU team’s NRUNet framework.

4.3 YorkU: Norm-Relighting-U-Net (NRUNet)

The method adopts a U-Net architecture [38] as the main backbone of the proposed framework. The solution consists of two networks: (1) the normalization network, which is responsible for producing uniformly-lit white-balanced images, and (2) the relighting network, which performs the one-to-one image relighting. An instance normalization [46] is applied after each stage in the encoder of the normalization network, while batch normalization is used for the encoder of the relighting network. The relighting network is fed the input image and the latent representations of the uniformly-lit image produced by the normalization network. The team uses the white-balance augmenter in [2] to augment the training data. To produce the ground-truth of the normalization network, the team uses the training data provided for tracks 2 and 3, which include a set of images taken from each scene under different lighting directions. The team exploits their solution for the illumination settings estimation task (see Sect. 5.2) to predict the target scene settings for the one-to-one mapping. Hence, the team increases the number of training images by including the training images provided for tracks 2 and 3. The team pre-trains the normalization network then fixes its weights and the entire framework is jointly trained. The training uses the Adam optimizer [26] with \(\ell 1\) loss. At inference, the team processes a resized version of the input image, then a guided up-sampling [10] is applied to obtain the full-resolution image. The team ensembles the final results by utilizing their one-to-any framework (more details on the one-to-any framework in Sect. 6.2). To relight the image using the one-to-any framework, the team randomly selects six images with the predicted illumination settings of the current track to use them as targets. This procedure generates six relit images that are used along with the result image produced by the one-to-one framework to generate the final result. Figure 5-(a) shows an overview of the proposed one-to-one mapping framework. The source code for the three tracks is available at https://github.com/mahmoudnafifi/image_relighting.

Fig. 6.
figure 6

Diagram illustration of the DRNIR network architecture.

4.4 IPCV_IITM: Deep Residual Network for Image Relighting (DRNIR)

Figure 6 shows the structure of the proposed residual network with skip connections, based on the hourglass network [59]. The network has an encoder-decoder structure with skip connections [23]. Residual blocks are used in the skip connections, and Batch-Norm and ReLU non-linearity in each of the blocks. The encoder features are concatenated with the decoder features of same level. The network takes the input image and directly produces the target image. The team converts the input RGB images to LAB for better processing. To reduce the memory consumption without harming the performance, the team uses a pixel-shuffle block [40] to downsample the image. The network is first trained using the \(\ell 1\) loss, then fine-tuned with the MSE loss. Note that experiments with adversarial loss did not lead to stable training. The learning rate of the Adam optimizer is 0.0001 with a decay cycle of 200 epochs, and a \(512\times 512\) patch size for training. Data augmentation is used to make the network more robust.

4.5 Other Submitted Solutions

The DeepRelight team addresses the one-to-one relighting task by recovering the structure information of the scene, target illumination information, and renders the output with a GAN strategy [47]. Another solution makes use of two pairs of encoder-decoder networks, such that the encoding and decoding are illumination specific, and the learning is also supervised with discriminators. Transforming an image becomes equivalent to encoding it with the first encoder and decoding it with the second. Hertz tackle the problem using a multi-scale hierarchical network, the image is encoded at multiple resolutions and feature information is transferred from lower to higher levels to obtain the final transformation. Lastly, Image Lab [35] build on the multilevel hyper vision net [14], adding convolution block attention [52] in their skip connections. Further details of each of these submitted solutions can be found in the supplementary material.

5 Track 2 Methods

5.1 AiRiA_CG: Dual Path Ensemble Network (DPENet)

The proposed DPENet has two sub-networks, one for angle prediction and one for temperature classification [13]. The full DPENet is shown in Fig. 7. ResNeXt-101_32\(\times 4\)d [53] is adopted for the angle prediction sub-network. The temperature classification sub-network is based on ResNet-50 [20]. The two sub-networks are pre-trained on ImageNet [11]. The solution adopts random flipping and random rotation for data augmentation.

Fig. 7.
figure 7

The structure of Dual Path Ensemble Network (DPENet).

5.2 YorkU: Illuminant-ResNet (I-ResNet)

The team treats the task as two independent classification tasks; (1) illuminant temperature classification and (2) illuminant angle classification. The team adopts the ResNet-18 model [20] trained on ImageNet [11]. The last fully-connected layer is replaced with a new layer with n neurons, where n is the number of output classes for each task. The Adam optimizer [26] is used with cross entropy loss. For angle classification, the team applies the white-balance augmenter proposed in [2] to augment the training data. For temperature classification, the team follows previous work [1, 3, 4] that uses image histogram features instead of the 2D input image. Specifically, the team feeds the network with 2D RGB-uv projected histogram features [1, 3], instead of the original training images. This histogram-based training, rather than image-based, improves the model’s generalization. Figure 8 shows an overview of the team’s solution, including the white-balance augmentation process.

Fig. 8.
figure 8

Overview of the YorkU solution, with the white-balance augmentation [2].

5.3 Image Lab: Virtual Image Illumination Estimation (LightNet)

As shown in Fig. 9, the team adopts a Densenet [22] architecture for the task. The team trains ten different pre-trained networks and also creates a custom network with selective blocks [28]. From these networks, the Densnet121 network achieves the best performance. DenseNet121 consists of fifty-eight dense blocks, followed by three transition blocks and three fully-connected layers. The global average pooling and fully connected layers are removed from the pre-trained network, and replaced with a new global average pooling and fully connected layers with a degree and temperature output layer. From the training dataset, the team creates a random splitting, with 67% of samples taken for training and the rest for validation. The training images are normalized to [0,1]. The Adam optimizer with a learning rate decaying from 0.001 to 0.00001 over 500 epochs is used for training the model with the categorical loss. Attention layers [52] were tested in the development phase but did not yield any improvement.

Fig. 9.
figure 9

Overview of the LightNet model’s architecture.

5.4 Other Submitted Solution

The debut_kele team proposes to use a single EfficientNet [43] backbone, pre-trained on ImageNet. Further details of this submitted solution can be found in the supplementary material.

6 Track 3 Methods

6.1 NPU-CVPG: Self-Attention AutoEncoder (SA-AE)

As shown in Fig. 10, the team presents the novel Self-Attention AutoEncoder (SA-AE) [21] model for generating a relit image from a source image to match the illumination settings of a guide image. In order to reduce the learning difficulty, the team adopts an implicit scene representation [59] learned by the encoder to render the relit images using the decoder. Based on the learned scene representation, an illumination estimation network is designed as a classier to predict the illumination settings of the guide image. A lighting-to-feature network is also designed to recover the corresponding implicit scene representation from the illumination settings, similar to the inverse of the illumination estimation process. In addition, a self-attention [55] mechanism is introduced in the decoder to focus on the rendering of the regions requiring relighting in the source images.

Fig. 10.
figure 10

Overview of the proposed SA-AE network.

6.2 YorkU: Norm-Relighting-U-Net (NRUNet)

As for the one-to-one mapping proposed (Sect. 4.3), the U-Net architecture [38] is used as the main backbone of the any-to-any relighting framework, and two networks are used for normalization and relighting, as shown in Fig. 5-(b). The relighting network is fed the input image, the latent representation of the guide image and the uniformly lit image produced by the normalization network. The team uses the white-balance augmentation [2] on the training data for the normalization network. The team trains two frameworks; one framework on \(256\!\times \!256\) random patches and one on \(256\!\times \!256\) resized images. The final result is generated by taking the mean of the two relit images and applying a guided up-sampling [10].

6.3 IPCV_IITM: Deep Residual Network for Image Relighting (DRNIR)

Figure 11 shows the structure of the proposed residual network with skip connections, based on the hourglass network [59]. The network has an encoder-decoder structure similar to [23]. The team also uses residual blocks in the skip connections. The encoder features are concatenated with the decoder features of the same level. Along with the input image, the network is given a guide image that is used in two places. First, both the input and the guide image are concatenated. Second, the team adds a separate loss to match the illumination properties between the guide image and the predicted image. A separate network predicts the illumination settings of an image, and is trained with the provided ground-truth labels. The team passes both the guide image and the predicted image through the network and minimizes the distance between intermediate feature representations. The feature representation of the guide image is further concatenated with the encoder output and fed to the decoder. The team converts the input RGB images to LAB for better processing. To reduce memory consumption, pixel-shuffle blocks [40] are used as in track 1.

Fig. 11.
figure 11

Network architecture of the DRNIR method.

6.4 lyl: Coarse-to-Fine Relighting Net (CFRN)

The proposed Coarse-to-Fine Relighting Net (CFRN) is shown in Fig. 4, as in track 1. Training is divided in two stages: incomplete training and full training. During an incomplete training, the fine network is trained with a batch size of 16 for 200 epochs. The Adam optimizer (\({\beta _{1}=0.9}\),\({\beta _{2}=0.999}\)) is used to minimize the \(\ell 1\) loss between the generated relit images and the ground-truth. The learning rate is initialized to \({10^{4}}\) and kept unchanged. After the incomplete training with the fine network, the whole CFRN is fully trained. In each full training batch, the team randomly samples 64 patches for 20k epochs.

6.5 Other Submitted Solution

The AiRiA_CG team proposes a creative solution consisting of a dual encoder and single decoder [13]. The input image is encoded, and so is the target image. However, the encoder of the target image is mirrored to match the decoder of the input image latent representation, and the feature layers of the former are thus transferred, layer by layer, to the decoder of the latter. This allows the illumination information to be transferred from the guide image to the input image during the decoding process. Further details of this submitted solution can be found in the supplementary material.