Keywords

1 Background

When photographing near-transparent objects such as glass, reflected light can appear in the image, as shown in Fig. 1. This loss of image information due to reflected light can frequently occur, making SIRR a challenging problem that has attracted considerable attention in the computer vision community. The image containing reflected light consists mathematically of the image \(I\), a linear combination of a transmitted image layer \(T\) and a reflected image layer \(R\), as in Eq. (1).

$$ I = T + R $$
(1)

Hence, reflected light rejection can achieve its goal by estimating the transmitted image layer T. Many researchers have tackled the technical challenge of reflected light rejection. Many solutions have been proposed. However, many currently have limitations in performance, robustness, and versatility.

In the early statistical models of SIRR, removing the reflective layer from a single image was avoided because it was impossible to separate the transmission layer from the reflective layer. Conversely, multiple images have been used to estimate the transmission layer T [1,2,3]. The problem has been solved by adding and formulating constraints to the images. And even when only a single image is used, the transmission layer T has been estimated using the formulated equation [4,5,6].

Fig. 1.
figure 1

Examples of images containing reflected light

However, it is difficult to construct a versatile light-reflectance removal model by simply adding these simple constraint conditions to image processing since various assumed situations are possible. Against this background, research on constructing deep learning models has been active in recent years [7,8,9].

There are two problems with SIRR using deep learning models [10, 11]. One is that extracting background images without reflection is illogical, and the other is that the training data is tiny. For the former, the performance of the model is limited. The latter problem arises because of the difficulty in obtaining paired datasets of images with and without reflections. Because of the rare case of reflected light, rather than a simple dataset such as Image Net or MNIST. Therefore, in SIRR, synthetic images obtained by merging images are often used because it is difficult to get true values.

2 Methods

2.1 Proposed Model

The network model proposed in this study is shown in Fig. 1. Six-layer convolution is performed in the encoder part. The bottleneck part employs Deeplabv3+ [12], followed by six-layer convolution in the decoder part to obtain the estimated transmitted image layer \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{T} \) and the estimated reflected image layer \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{R} \) as outputs.

The bottleneck part, the Deeplabv3+ module, uses MobileNetv2 [13] as the backbone. This is followed by ASPP (Atlas Spatial Pyramid Pooling), which uses Image Pooling with atlas convolution rates of 1, 6, 12, and 18, respectively (Figs. 2 and 3).

Fig. 2.
figure 2

Proposal Network

Fig. 3.
figure 3

DeepLabv3+  module

2.2 Loss Function

As the basis of many neural networks, in image restoration techniques, the loss function is generally optimized for the network using the mean squared error (MSE) between the output and the true value.

$$ \begin{array}{*{20}c} {L_{{MSE}} = \left\| {F\left( I \right) - T} \right\|_{2}^{2} } \\ \end{array} $$
(2)

However, \(F\) is the data after processing. However, models optimized using only \(L_{MSE}\) often fail to retain the high-frequency content. In the case of de-reflection, both the reflective and the transparent layers are natural images with different characteristics. To obtain the best restoration results, the network needs to learn the perceptual properties of the transmission layer. Therefore, we adopt a loss function close to a high-level feature abstraction. The VGG loss is calculated as the difference between the layer representation of the restored transmission and the actual transmission image on the pre-trained 19-layer VGG network proposed by Simonyan and Zisserman [14].

$$ \begin{array}{*{20}c} {L_{VGG} = \frac{1}{W_i H_i }\mathop \sum \limits_{i = 1}^M \parallel \varphi_i \left( T \right) - \varphi_i \left( {F\left( I \right)} \right)\parallel_2^2 } \\ \end{array} $$
(3)

where \(\varphi_i\) is the feature map obtained by the i-th convolutional layer (after activation) in the VGG19 network, M is the number of convolutional layers used, \(W_i\) and \(H_i\) are the dimensions of the i-th feature map.

This study uses the loss function consisting of the sum of these two loss functions. The equation is expressed as follows.

$$ \begin{array}{*{20}c} {L = L_{MSE} + \lambda L_{VGG} } \\ \end{array} $$
(4)

\(\lambda\) is a parameter of \( L_{VGG}\) and is set to 0.1 in this study.

2.3 Dataset Creation

Equation (1) will be used in this study as well. The Eq. (1) multiplied by the transmittance α is given below.

$$ \begin{array}{*{20}c} {I = \alpha T + R} \\ \end{array} $$
(5)

In this study, two \(R\) patterns in Eq. (5) are set and used.

The first equation is,

$$ \begin{array}{*{20}c} {R = \beta G*R^{\prime}} \\ \end{array} $$
(6)

\(R^{\prime}\) is the reflective image layer, \(\beta \) is the reflectance, and \( G\) is the Gaussian kernel.

The second equation is,

$$ \begin{array}{*{20}c} {R = G*R^{\prime} - \gamma } \\ \end{array} $$
(7)

\(R^{\prime}\) is the reflective image layer, \(\gamma\) is a constant, and \(G \) is the Gaussian kernel.

The values of \(\alpha , \beta , \gamma ,\) and \(G\) are varied as there are various patterns of reflected light in the real image. In some cases, gamma correction is used to darken \(R\).

3 Experiments and Results

3.1 Dataset

3890 images from the MIT-67 Dataset [15] and 17000 images from the PASCAL VOC 2012 Dataset [16] were collected to generate images with pseudo-reflected light using Eqs. (5), (6), and (7) as the training dataset. The datasets used for the evaluation were Object, Post, and Wild from the SIR2 Dataset [17] and the Real20 Dataset [18]. These datasets are real images, not pseudo-synthesized images containing reflected light. The number of each type of data used in the evaluation is shown in Table 1.

Table 1. Number of data used in the evaluation.

3.2 Experimental Procedure

In this study, training was carried out using the same learning setup under the same conditions to perform a control experiment. The training was carried out with a batch size of 16, 100 epochs, a learning rate of 0.0003, and Adam as optimizer. PSNR and SSIM were used as evaluation metrics [19,20,21]. The GPU used for training was an NVIDIA GeForce RTX 3090, the CPU was an Intel Core i9, the RAM was 64GB and the OS was Ubuntu 20.04 [22]. The combinations used in the experiments are listed in Table 2.

Table 2. Proposed methods

3.3 Experimental Results

The experimental results are shown in Fig. 4. The evaluation indices are given in Table 3. Note that the values shown in the table are average.

Fig. 4.
figure 4

Comparison of output images

Table 3. Results of experiments

4 Discussion

The results in Fig. 4 and Table 3 are discussed. We also discuss each of the proposed methods listed in Table 2.

First, a comparison is made between the conventional and proposed methods 1. These are controlled experiments because the network models are different, and the reflection formation model is the same. Table 3 shows that the accuracy of the proposed method 1 is superior. The resultant images in Fig. 4 show that the reflections are removed. It can also be seen that the pixel values do not drop much. It is considered that the network model (DeepLabv3+) of the proposed method 1 has a significant influence on the removal of reflected light. It is also considered that the ASPP in the DeepLabv3+ structure plays a role.

Next, we compare the proposed method 1 with the proposed method 2. These are contrasting experiments under the condition that the network models are the same and the reflection formation models are different. From Table 3, it is impossible to say which method is better. Similarly, the resultant images in Fig. 4 show different results depending on the image. The reflection formation model influences these. The reflection formation model affects the results by the similarity with the real image.

The proposed method is superior to the conventional methods [23]. However, even with the proposed method, there were some images where the reflection could not be removed. There are two possible reasons for this. The first is the network model. As can be seen from this study, the results vary greatly depending on the network model. Constructing and improving a network model suitable for reflection removal is necessary. The second is the reflection formation model. As the images created by the reflection formation model are used for training, the real data must approximate it.

Moreover, as can be seen from the results, there are various patterns of reflected light in the real data [24], and images must be created for each. In the future, solving these two causes will lead to the removal of reflected light. Furthermore, in this study, learning was carried out using only synthetic data, but learning using real data is also considered one of the measures.

5 Conclusion

DeepLabv3+ is proposed as a deep learning model for single-image reflection removal in this paper. A reflection formation model is also proposed, and a synthetic image is generated.

Experiments are conducted on four datasets commonly used in the SIRR field to compare the proposed methods. The proposed method shows better results through the experiments than the conventional methods. It is also confirmed that the results are affected by the different reflection formation models used in the synthetic data. Although the proposed method in this paper was trained only on the synthetic data, it gave excellent results on the real data.

Future tasks are to study the construction of a model more suitable for removing reflections, to study a reflection formation model similar to real data, and to study learning with real data. Learning using real data is the most effective method, and transfer learning, and meta-learning can be used for this purpose.