Keywords

1 Introduction

When taking photographs through transparent material such as glass or windows, undesired reflections often ruin the images. To obtain clear images, users may make dark situation or change the camera position but it is not effective for removing reflections because of the limitation on space. The reflection does not only degrade the image quality but also affects the results of applications such as segmentation. Thus, removing reflections from an image is an important task in computer vision. The example of single-image reflection removal task is shown in Fig. 1.

Separating background layer and reflection layer is an ill-posed problem because the photographing situation is not fixed. The thickness of the glass, the number of the glass, the transparent rate and the reflection rate could be change and we cannot model them in an appropriate manner. To work on this ill-posed problem, many previous methods use multiple input images [1, 10, 24, 25] or a video [29]. Inputting multiple images makes the ill-posed problem easier to solve but in actual cases, it is difficult to prepare multiple images so the study of single-image reflection removal is still important. Recently, some methods were proposed to remove reflection without using multiple images [2, 11]. In particular, Convolutional Neural Network (CNN) based methods [3, 5, 7] showed good results in the past few years. Generative Adversarial Networks (GAN) have produced outstanding results in computer vision tasks such as inpainting [32] and super-resolution [15]. Reflection removal is not an exception and GAN-based methods [27, 30, 33] left good results.

In this paper, we propose a novel single-image reflection removal method based on generative adversarial networks as shown in Fig. 2. To preserve texture information effectively while separating background and reflection layer, four kinds of losses are adopted to train the network. The training loss is composed of pixel loss, feature loss, adversarial loss and gradient constraint loss. We propose a novel loss called gradient constraint loss, which keeps the correlation between background and reflection layer low. Since the background layer and reflection layer have no relevance, it is important not to share the information in these two layers. In addition, feature loss is applied to both background layer and reflection layer so it is possible to separate the image into two layers while retaining the image features. When training the networks, we used several reflection models and applied many conditions to the synthetic reflection image. It leads our network to remove real world reflection which has perplexing conditions.

Fig. 1.
figure 1

A visualization of single-image reflection removal. (a) is a synthetic input image which includes reflections. (b) is a generated background image of our method and (c) is a reflection layer.

The contributions of our paper are summarized below:

  • We propose a new Gradient Constrained Network (GCNet) for single-image reflection removal. When training our network, we use four types of losses including pixel loss, feature loss, adversarial loss and gradient constraint loss.

  • Since the gradient constraint keeps the correlation between background and reflection layer low, the output background layer preserves texture information well and the visual quality is high.

  • We applied many kinds of terms to the training dataset, which enable our trained network to remove reflections in many challenging real conditions.

2 Related Work

Since reflection separation is an ill-posed problem, many methods use multiple images [1, 9, 10, 20, 21, 24, 25] or video [29] as an input. Multiple images make the ill-posed problem easier to solve but it is difficult to obtain and additional operation will be required. Thus, single-image reflection removal methods are mainly considered in this paper.

2.1 Optimization Based Methods

Several methods use optimization to suppress the reflection in a single image. To solve an optimization problem, additional prior such as gradient sparsity [2, 31] or gaussian mixture models [22] is needed. These methods can suppress reflections effectively when the input image follow the assumption but when the assumption cannot be applied, the result will be catastrophic.

2.2 Deep Learning Based Methods

The first method which use deep convolutional neural networks for reflection removal was proposed by Fan et al. in [7]. Two networks are cascaded and they first predict edges of background layer by using the first CNN. The predicted edge is used for the guide when reconstructing background layer by using the second CNN. Since they only use pixel-wise loss function in the training process, the semantic structure is not considered. In particular, Generative Adversarial Networks (GAN)-based methods have produced outstanding results in reflection removing task, likewise in other computer vision tasks such as inpainting [32] and super-resolution [15]. Zhang et al. proposed PL Net [33] which is trained by loss function composed of feature loss, adversarial loss, and exclusion loss. The network architecture and loss function is tuned to focus on both low-level and high-level image information. The network is weak in processing overexposed images because their training method does not cope with those problem. Yang et al. [30] proposed a network which predicts the background layer and reflection layer alternately. They use \(L_2\) loss and adversarial loss to train the networks. Wei et al. proposed ERR Net [27] which can be trained by misaligned data. The image features which are obtained by pre-trained VGG19 network [23] are used as input data and they are also used in calculating feature loss. Since these two methods do not focus on the correlation between background and reflection layer, they sometimes generate an image with unnatural color tone.

3 Supporting Methods

3.1 Synthetic Reflection Image

In this paper, we denote I as an image with reflections. The background layer is denoted as B and the reflection layer is denoted as R. In this case, I can be modeled as a linear combination of B and R as below:

$$\begin{aligned} I = B + R. \end{aligned}$$
(1)

In previous work, several reflection models are used to make R. In our training method, we use three of them. When we denote \(R_o\) as an original reflection image, first reflection model can be expressed as:

$$\begin{aligned} R_1 = \alpha K *R_o, \end{aligned}$$
(2)

where \(R_1\) is a synthetic reflection layer, K is a Gaussian kernel and \(\alpha \) is a reflection rate. Second model can be expressed as:

$$\begin{aligned} R_2 = \beta K *H *R_o, \end{aligned}$$
(3)

where \(R_2\) is a synthetic reflection layer, H is a random kernel with two pulses and \(\beta \) is a reflection rate. Applying H represents the ghost effect which is caused by the thickness of the glass [5]. Third model can be expressed as:

$$\begin{aligned} R_3 = K *R_o - \gamma , \end{aligned}$$
(4)

where \(R_3\) is a synthetic reflection layer and \(\gamma \) is an amount of shift. In our method, \(\gamma \) is computed as the same way which is described in [7]. Restoring B from I is the final goal of single-image reflection removal methods but it is a difficult problem because solving B from I is an ill-posed problem.

Fig. 2.
figure 2

The overview of our method.

3.2 Generative Adversarial Networks

Generative adversarial networks (GAN) [8] is a learning method which maps noise to an image. Generator G is trained to create a real-like image and discriminator D is trained to judge whether the discriminator input is real or not. When training GAN, min-maximizing process between generator and discriminator is applied and it can be expressed as:

$$\begin{aligned} \min _{G}\max _{D}V(G,D) = \mathbb {E}_x[logD(x)] + \mathbb {E}_z[log(1-D(G(z)))], \end{aligned}$$
(5)

where x is an image and z is a noise variable. When GAN is applied to image restoration tasks, z should be deteriorated image. Since GAN is good at solving inverse problem, it has shown remarkable results in image processing such as inpainting [32], colorization [18], denoising [4] and super-resolution [15].

4 Proposed Method

Our method removes reflection from a single image with a trained-based algorithm using GAN. We represent that our GAN based method with gradient constraint can remove reflection effectively. The overview of our method is shown in Fig. 2.

Fig. 3.
figure 3

The architecture of our proposed generator. It is based on UNet++ \(\text {L}^4\) [34].

4.1 Network Model

We illustrate the proposed generator architecture in Fig. 3. The network structure of our proposed method is based on UNet++ \(\text {L}^4\) [34]. It is a combination of convolutional layer, batch normalization layer [13], leaky ReLU layer [28], max pooling layer and bilinear interpolation layer. Since we adapt deep supervision structure [34], there are four outputs. We use \(\hat{B}\) as the main output and the other outputs are used for computing pixel loss. The filter size of the convolutional layers are set to \(3\times 3\). The number of the channels in the convolutional layers in \(C^{x,y}\) are set to \(2^{x+5}\).

Our discriminator is composed of the enumeration of convolutional layer, batch normalization layer and leaky ReLU layer. The stride of convolutional layers is set to 2 in every two convolutional layers. Since the final output size of our discriminator is \(16\times 16\), \(L_2\) difference is applied to compute the adversarial loss.

4.2 Loss Functions for Generator

In our method, we applied four kinds of losses to separate background and reflection layer effectively. Let G, D, F be generator, discriminator, and feature extractor, respectively. Generated background image \(\hat{B}_i\) can be obtained by inputting image \(I_i\) into generator G. In our method, we do not estimate reflection layer directly so the reflection layer \(\hat{R_i}\) is estimated by subtracting generated background image from the input image. Thus, the estimation of \(\hat{B}_i\) and \(\hat{R_i}\) can be expressed as:

$$\begin{aligned} \hat{B_i}&= G(I_i;\theta _G) \nonumber \\ \hat{R_i}&= I_i - \hat{B}_i \end{aligned}$$
(6)

where \(\theta _G\) is the set of weights of Generator G. The main purpose in the training process is to minimize the loss \(\mathcal {L}_G(\theta _G)\). Our loss \(\mathcal {L}_G(\theta _G)\) is a combination of four kinds of losses and can be defined as:

$$\begin{aligned} \mathcal {L}_G (\theta _G) = \mu _1 \mathcal {L}_\text {MSE} + \mu _2 \mathcal {L}_\text {feat} + \mu _3 \mathcal {L}_\text {adv} + \mu _4 \mathcal {L}_\text {GC}. \end{aligned}$$
(7)

\(\mathcal {L}_\text {MSE}\) is a pixel loss which computes the \(L_2\) difference and \(\mathcal {L}_\text {feat}\) is a feature loss which is applied in feature domain. \(\mathcal {L}_\text {adv}\) is an adversarial loss and \(\mathcal {L}_\text {GC}\) is a novel loss which is effective for separating background and reflection layers.

Pixel Loss. Pixel loss is applied to compare the pixel-wise difference between generated image and ground truth image. Since minimizing the mean squared error (MSE) is effective for avoiding vanishing gradient problem in training GAN [17], we use MSE loss function to calculate the pixel loss. Our generator generates four images including one main generated image \(\hat{B}\) and three supporting images \(\hat{B_1}\), \(\hat{B_2}\), and \(\hat{B_3}\). \(\hat{B_1}\), \(\hat{B_2}\), and \(\hat{B_3}\) are used only for calculating the pixel loss and it has a good influence in training process [34]. To emphasize the optimization of main generated image, the four output images are weight-averaged when the pixel loss is computed. The additional information is shown in Fig. 3. From the above, our pixel loss is computed by calculating \(L_2\) difference and it is expressed as:

$$\begin{aligned} \hat{B}_{\text {1}i} = G_1(I_i;\theta _G), \hat{B}_{\text {2}i} = G_2(I_i;\theta _G), ~\hat{B}_{\text {3}i} = G_3(I_i;\theta _G) \nonumber \\ \mathcal {L}_\text {MSE} = \sum _{i}^{N} || \frac{1}{8} (5*\hat{B}_i + \hat{B}_{\text {1}i} + \hat{B}_{\text {2}i} + \hat{B}_{\text {3}i}) - B_i ||_2 \end{aligned}$$
(8)

where \(G_1\), \(G_2\), \(G_3\) are the part of generator G and \(B_i\) is a ground truth background image.

Feature Loss. In the reflection removing task, it is important to preserve the structure of the image. Since the pixel loss cannot optimize the semantic feature of the image, we adopted feature loss in our method. Pretrained VGG-19 network [23] is applied for the feature extracting network and the output from the layer ‘conv5_2’ is used for the computation. We calculate the \(L_1\) difference between the feature vector of generated image and ground truth image. Since background layer and reflection layer have different image structure, the feature loss is applied to both background and reflection layer. Our feature loss \(\mathcal {L}_\text {feat}\) is expressed as:

$$\begin{aligned} \mathcal {L}_\text {feat} = \sum _{i}^{N} (|| F(\hat{B}_i) - F(B_i) ||_1 + || F(\hat{R}_i) - F(R_i) ||_1). \end{aligned}$$
(9)

Adversarial Loss. It is known that simple CNN-based networks with MSE loss tend to generate blurry and unnatural images. It is because the images generated by those methods are the average of the several natural solutions [15]. To avoid this problem, adversarial loss was proposed in [8]. The adversarial loss is applied to encourage generator to generate images which follows natural image distribution. In the reflection removing task, the deterioration of color tone is a common problem but in our method, applying the adversarial loss restrained this problem. The adversarial loss in our method is expressed as:

$$\begin{aligned} \mathcal {L}_\text {adv} = \sum _{i}^{N} ||1 - D(\hat{B_i};\theta _D)||_2. \end{aligned}$$
(10)

Gradient Constraint Loss. The main task in a single-image reflection removal is to separate a single image into two layers including background layer and reflection layer. In most cases, background layer and reflection layer have no correlation so minimizing the correlation between two layers is effective in this task. To minimize the correlation, we applied a novel loss function called gradient constraint loss. It is applied in a gradient domain in order to make the task easier. Our gradient constraint loss is composed of two terms: \(\mathcal {L}_\text {GCM}\) and \(\mathcal {L}_\text {GCS}\). \(\mathcal {L}_\text {GCM}\) is a term to keep the correlation between two layers low and \(\mathcal {L}_\text {GCS}\) works as a constraint of \(\mathcal {L}_\text {GCM}\). However, in the early stage of training, we find that the effect of gradient constraint loss is too strong and the network cannot be trained effectively. Thus, the gradient constraint loss is multiplied by the number of epochs in order to keep the effect of the loss low in the early stage of training. Finally, the gradient loss can be described as:

$$\begin{aligned} \mathcal {L}_\text {GC} = (\text {epoch} - 1) * (\mathcal {L}_\text {GCM} + \mathcal {L}_\text {GCS}). \end{aligned}$$
(11)

Since the edge information of background layer and reflection layer should be independent, \(\mathcal {L}_\text {GCM}\) calculates the element-wise product of these two edge layers. \(\mathcal {L}_\text {GCS}\) is applied for giving a constraint to \(\mathcal {L}_\text {GCM}\) and it helps network to separate layers effectively. \(\mathcal {L}_\text {GCM}\) and \(\mathcal {L}_\text {GCS}\) can be expressed as:

$$\begin{aligned} \hat{B}_{\text {g}i}&= \text {Tanhshrink}(\nabla _x \hat{B_i} + \nabla _y \hat{B_i}) \nonumber \\ \hat{R}_{\text {g}i}&= \text {Tanhshrink}(\nabla _x \hat{R_i} + \nabla _y \hat{R_i}) \nonumber \\&\mathcal {L}_\text {GCM} = \sum _{i}^{N}||\hat{B}_{\text {g}i} \odot \hat{R}_{\text {g}i}||_1 \end{aligned}$$
(12)
$$\begin{aligned}&\mathcal {L}_\text {GCS} = \sum _{i}^{N}|| (\hat{B}_{\text {g}i}+ \hat{R}_{\text {g}i}) - (\nabla _x I_i + \nabla _y I_i)||_1. \end{aligned}$$
(13)

The basic idea of minimizing correlation between two layers are proposed in [33] but in our method, we applied a new active function and added a constraint. The main purpose of our gradient constraint loss is to focus on large edges and separate layers effectively. Since the input and ground truth images are normalized into the range \([-2.5,2.5]\) in order to stabilize the training, the conventional Tanh function is not suitable for our network. In addition, to separate layers by mainly using large edges, we want to reduce the impact of small edge regions. Thus, we applied Tanhshrink function as an activation function. The formula of Tanh and Tanhshrink function can be described as:

$$\begin{aligned} \text {Tanh}(x)&= \frac{\exp (x)-\exp (-x)}{\exp (x)+\exp (-x)} \nonumber \\ \text {Tanhshrink}(x)&= x - \text {Tanh}(x). \end{aligned}$$
(14)

By using Tanhshrink, the robustness against blown out highlights is also obtained. When overexposure is occurred, the structure of the reflection layer will be corrupted with the background layer. In this case, the correlation between background and reflection layer does not become zero. Since Eq. 12 encourage the element-wise product of the gradient layers to be zero, the training will not perform well in this situation. To overcome this problem, Tanhshrink function is effective since it compresses small gradients. Owing to this effect, when Tanhshrink is applied as an activation function, the ground truth of element-wise product layer become close to zero even if overexposure is occurred. This is important when applying Eq. 12 during the training process.

We also apply \(\mathcal {L}_\text {GCS}\) as a constraint of \(\mathcal {L}_\text {GCM}\). Since we use Tanhshrink for the activation function, small gradients are compressed into even smaller values. The training process may be affected by this feature when a large gradient is wrongly divided into two small gradients. This problem often occurs when the global tone of the generated image are changed. Thus, we apply \(\mathcal {L}_\text {GCS}\) in order to help generator to separate images not by deteriorating the color tone but by focusing on the structure of the image (Fig. 4).

The effectiveness of the gradient constraint loss in processing real image is shown in Sect. 5.2 and in Fig. 7.

4.3 Loss Function for Discriminator

Since our method is based on GAN, discriminator has to be trained while generator is trained. The discriminator is trained by minimizing the loss \(\mathcal {L}_D(\theta _D)\) and it is described as below:

$$\begin{aligned} \mathcal {L}_D (\theta _D) = \sum _{i}^{N} (||D(\hat{B_i};\theta _D)||_2 + ||V - D(B_i;\theta _D)||_2), \end{aligned}$$
(15)

where V is a random valued matrix which follows Gaussian distribution with an average of 1.

4.4 Training Dataset

To create the training dataset, we use PASCAL VOC 2012 dataset [6] which includes 17K images. We exclude grayscale and pale colored images since it affects the training. The images are first resized into \(256\times 256\) by using bicubic interpolation. After that, the images are randomly flipped and one image is used for the background layer B and another image is used for the reflection layer R. The background layer image is randomly shifted darker in order to deal with the dark real situations. The color reflection image is randomly converted into grayscale image and blurred with Gaussian filter (\(\sigma \in [0.2,7]\) in the case of grayscale, \(\sigma \in [2,4]\) in the case of RGB) and the tone is modified randomly. The reflection model is selected randomly from Eq. (2)–(4):

$$\begin{aligned} R = {\left\{ \begin{array}{ll} R1 &{} \text {with probability 0.1}\\ R2 &{} \text {with probability 0.1}\\ R3 &{} \text {otherwise} \end{array}\right. }\!\!. \end{aligned}$$
(16)

Finally, the synthetic reflection images I are generated by using Eq. (1) and clipped to the range [0, 1].

Fig. 4.
figure 4

A visualization of the types of synthetic reflection.

4.5 Training

Since to remove gray-scaled reflection is easier than removing color reflection, we first trained our network by using the images only include gray-scaled reflection. After training the network for 50 epochs, we initialize a new network with the trained weights. The new network is trained for 100 epochs by using the dataset in Sect. 4.4. We train our generator by minimizing Eq. (7) where \(\mu _1\), \(\mu _2\), \(\mu _3\) and \(\mu _4\) is set to 2, 1, 0.001 and 0.01, respectively. The implementation of our model is based on PyTorch [19] and Adam solver [14] is used for the optimization. The initial learning rate is set to 0.0002 and the batch size is set to 8. It takes about 40 h to train the network on a single GeForce GTX 1080 Ti.

4.6 Rotate Averaging Process

Since the proposed network is not rotationally invariant, the results will change when the rotated image is processed. In our method, we propose a rotate averaging process, which averages the several output images generated from the rotated input images. Images in \(\text {SIR}^2\) benchmark dataset [26] are used for the evaluation. We prepared four kinds of input images: (a) unprocessed image, (b) 90\(^{\circ }\) rotated image, (c) 180\(^{\circ }\) rotated image and (d) 270\(^{\circ }\) rotated image. The comparison of generated results in PSNR and SSIM is shown in Fig. 5. We can see that when we use all the four images, the recovered image quality is the highest. Thus, in our method, we use four rotated images for the input and average all of the output images to get the final image.

Fig. 5.
figure 5

The comparison of generated results in PSNR and SSIM. (a)–(c) means that we use (a), (b) and (c) as the input image of our network and we average all of the output images to get the final image.

5 Experimental Results

In this section, we compare our Gradient Constrained Network (GCNet) with other notable methods, including CEIL Net [7], BDN [30], PL [33], ERR [27]. Images in \(\text {SIR}^2\) benchmark dataset [26] are used for the objective evaluation. We used PSNR [12] and SSIM [12] to assess the performance. PSNR value provides the numerical differences between two images and SSIM value provides the structural differences between two images. Since reflection removal is an ill-posed problem and the transmittance rate cannot be decided, SSIM value is more important to measure the background image quality. We use real images provided by the authors of [7, 16] for the subjective evaluation. All the comparison methods are implemented by the original authors.

Table 1. Comparison on restoration result in PSNR and SSIM. Images in \(\text {SIR}^2\) benchmark dataset [26] are used for the benchmark.

5.1 Results

Images in \(\text {SIR}^2\) benchmark dataset [26] are used for the benchmark. \(\text {SIR}^2\) includes two types of real reflection images: controlled scenes and wild scenes. Controlled scenes are collected in a controlled environment such as in a laboratory. Postcards and daily solid objects are selected as subjects for photography and the datasets includes 199 and 200 images, respectively. Images in wild scenes dataset are collected in a real world out of a lab. Since wild scenes dataset includes complex reflectance, various distance and different illumination, it is more difficult to remove reflections than the controlled scenes dataset.

Table 1 shows the comparison on restoration results in PSNR and SSIM. From Table 1, we can see that our proposed method achieves much higher SSIM than conventional methods in all datasets. In particular, our method shows good results when the wild scenes are processed. This is because our method uses various synthetic reflection image for the training. In addition, the high SSIM shows that the gradient constraint is effective for separating background layer and reflection layer while preserving the image structure. The subjective evaluation is performed in Fig. 6. We can see that BDN and ERR are not good at removing real reflections. CEIL Net and PL can remove some reflections effectively but the global tone of the images is changed.

Fig. 6.
figure 6

Reflection removal results on real images. Images are from [16, 27], and [26]. Best viewed on screen with zoom.

5.2 The Effectiveness of Our Loss Function and Training Method

When we train our generator, a combination of four kinds of losses is minimized as remarked in Sect. 4.2. In addition, our proposed network is trained in two steps as remarked in Sect. 4.5. To show the effectiveness of our loss function and training method, we trained our network in several situations. As remarked in Sect. 4.5, we train our network by changing the dataset in the first step and the second step. Hence, we trained our network without dividing in two steps and used single dataset in order to validate the effectiveness of our training method. We show this restoration result as “Single training”. We also trained our generator by ablating some loss functions. “No \(\mathcal {L}_\text {GC}\)” indicates that gradient constraint loss is removed from the loss function when the training is performed. “No \(\mathcal {L}_\text {GC}\), \(\mathcal {L}_\text {feat}\)” indicates that gradient constraint loss and feature loss are removed from the loss function. In the other words, loss function is composed of pixel loss and adversarial loss.

The comparison of the restoration result is shown in Table 2 and Fig. 7. We can see that the proposed training method achieves the highest PSNR and SSIM in most situation. By the visual result of “No \(\mathcal {L}_\text {GC}\)” and “No \(\mathcal {L}_\text {GC}\), \(\mathcal {L}_\text {feat}\)”, we can say that the gradient constraint loss is effective to separate the background layer and the reflection layer. The texture of the background layer should not appear in the reflection layer but in Fig. 7g and i, we can recognize that the texture of the face is appeared in the reflection layer. Hence, we can say that minimizing the correlation between the background layer and the reflection layer by using gradient constraint loss is efficacious. By the result of “Single training”, we can say that considering the character of the reflection is meaningful in the stage of training. In other words, when solving challenging problem by using learning-based method, finetuning of the network by considering the behavior of the problem is an effective way to train the network.

Table 2. Comparison on restoration result of our methods. Images in \(\text {SIR}^2\) benchmark dataset [26] and synthetic reflection images generated by using Eq. (4) are used for the benchmark. (a) Our proposed method. (b) Trained without pre-training. (c) Trained without using \(\mathcal {L}_\text {GC}\). (d) Trained without using \(\mathcal {L}_\text {GC}\) and \(\mathcal {L}_\text {feat}\).
Fig. 7.
figure 7

Comparison on reflection removal results. Background layers \(\hat{B}\) and reflection layers \(\hat{R}\) are shown. (a) Input image. (b–c) Our proposed method. (d–e) Trained without pre-training. (f–g) Trained without using \(\mathcal {L}_\text {GC}\). (h–i) Trained without using \(\mathcal {L}_\text {GC}\) and \(\mathcal {L}_\text {feat}\).

6 Conclusion

In this paper, we have proposed a novel Gradient Constrained Network (GCNet) for single-image reflection removal. Four kinds of loss functions are combined to train the network and gradient constraint loss is a new loss function which we have proposed. Since the independence between background layer and reflection layer should be considered, the gradient constraint loss to minimize the correlation between these two layers improves the performance for reflection removal. Owing to the novel loss, new synthetic dataset and training method, our method can remove reflection more clearly than state-of-the-art methods. Both quantitative and qualitative evaluation results show that our proposed network preserves the background textures well and the image structure is not corrupted.