Keywords

1 Introduction

It is often the case that the subject that we are trying to photograph is on the other side of the glass and we end up taking a photograph through a glass, as the glass in between is simply unavoidable or the hassle is not worth the efforts. Photographs, thus, taken contain undesirable reflections and degrade the visibility of the scene by blurring, obstructing or deforming the background scene and may result in failure or degradation of processing and analysing capabilities of computer-vision algorithms, such as object detection, event detection, object recognition, image segmentation, video tracking, etc. The problem of getting reflection-free images taken through glass is of great interest in the image processing and computer vision community and has practical demands.

$$ \boldsymbol{I}=\boldsymbol{R}+\boldsymbol{B} $$
(8.1)
  • where

  • I: n × m × 3 matrix which represents the reflection-contaminated image

  • R: n × m × 3 matrix which represents the reflection layer

  • B: n × m × 3 matrix which represents the background layer

The goal of the work is to approximate the background layer B from the acquired image I. Figure 8.1 illustrates the reflection-contaminated image as well as the ground truth for the background and the reflection layers.

Fig. 8.1
Three sets of images depict the view of a book along its normal view, reflection, and background layer. The book is all about the fundamentals of Robotics mechanisms.

Reflection-contaminated image “I”, background layer “B” and reflection layer “R”

The problem of removing reflection from a single image is ill-posed as, for a given reflection-contaminated image, there could be infinite possible decompositions into the background layer and the reflection layer; the same is illustrated with the help of an example image in Fig. 8.2. Also, lack of sufficient labelled data for training and reflection and background layers containing data from natural scenes adds to the ill-posedness of the problem.

Fig. 8.2
A set of photos of the separation of a contaminated photograph of a man and a building compiled within a single image. Background and reflection images of different versions are compiled together.

Three possible separations of a reflection-contaminated image into the background and the reflection layers

Most of the existing methods to remove reflection use specialized hardware or multiple images to make the problem less ill-posed and produce. Recently, some research works used deep learning methods, which outperform the existing methods, but, still, they use very complex architectures and blur out or degrade the quality of the images and fail in cases when the background and the reflection layers are very similar in terms of brightness and structural appearance.

Our contributions to address the above-mentioned issues are as follows:

  • We have trained a relatively simpler architecture end-to-end neural network to estimate the background layer.

  • We have created a loss function based on SSIM score, which is better suited when comparing the similarities between images.

  • We have created a larger labelled training dataset using data from multiple sources.

2 Literature Survey

The problem of how to remove reflection artefacts from an image has been widely researched in the image processing and computer vision community. Existing work can be classified into two categories based on the number of inputs required to produce a single reflection-free image. The first category includes methods requiring multiple inputs (such as multiple images or the use of specialized hardware to capture the image), and the second category includes methods requiring single image as the input. Single image methods can be further classified based on the approach they use to solve the problem, conventional mathematical approaches or learning-based approaches.

2.1 Multi-image Methods

Multiple related images can be used to make the problem of reflection removal less ill-posed and easier to solve but make the process of capturing images difficult. Guo et al. [1] and Y. Li and M. S. Brown [2] use images taken from slightly different angles or video sequence. Agrawal et al. [3] use image pairs taken with and without firing flash. Schechner et al. [4] use polarizer to obtain multiple polarized images. Kong et al. [5] use image pairs with the subject in and out of focus. These methods produce state-of-the-art results but are highly limited in practicability due to the complex process of capturing the images.

2.2 Single Input Methods

When compared with multi-image approaches, trying to suppress or remove reflection artefacts from a single image is difficult because of the constrained data.

2.3 Traditional Approaches

The following approaches use conventional mathematical approaches to remove the reflection.

Levin et al. [6] proposed an oversimplified approach based on local features of corners and edges considering gradient sparsity prior. Authors proposed a method that decomposes the reflection-contaminated image into two images such that the total number of corners and edges is minimized. However, this method performs poorly as the complexity in the images texture rises. Levin et al. [7] rely on user assistance to simplify the problem. Although this method successfully manages to separate the reflections from a single image to a certain degree, manually marking the image for the presence of reflection is difficult and is only practical for a small number of images. Shih et al. [8] reduce the ill-posedness by the use of ghosting cues and exploit the Gaussian mixture model (GMM) to learn image priors. Ghosting cues are the double reflections shifted by some distance, arising due to light being reflected at both the surfaces of a glass pane. Ghosting cues arise mostly in case of double pane glass or if the glass is quite thick, so this method works only on a small subset of images containing reflection. Wan et al. [9] assume prior that the background layer contains sharp and well-defined edges and the reflection layer is relatively smoother and use this relative difference in the smoothness as a cue to create a depth of field (DoF) confidence map, which then is used to classify edges as part of either the background layer or the reflection layer. This method cannot remove reflection from images with tiny textures or small reflection artefacts.

2.4 Learning-Based Approaches

Recent works have leveraged deep learning capabilities to solve the reflection removal problem.

Fan et al. [10] follow the same prior assumption as [9], i.e. reflection layer is off-focus and blurry. They created a synthetic dataset that mimics the assumed prior and proposed a two-stage cascaded network. The first stage predicts the edges of the background layer, and the predicted edges are used by the second layer to guide the background layer recovery. Wan et al. [11] improved [10] two stages into a single end-to-end concurrent network to predict the edges and separate the layers. Zhang et al. [12] combined three losses (feature loss, adversarial loss and exclusion loss) to train the proposed end-to-end network. The network and the losses are tuned to exploit both low-level and high-level information; still, this method performs poorly on images with high exposure. Recently GAN (Generative Adversarial Networks)-based methods [13, 14] have yielded good results, but still have issues handling images with extreme exposures, and [13] produces fails to produce images with natural colours as the colour tone is altered when the parts of reflection appear in the background. Also, the problem inherent with GANs is their complexity, both in terms of network architecture and the time and parameter tuning required to train the network. Some other related articles [15,16,17,18,19,20], and proposed deep learning-based IoT methods to solve different problems.

3 Proposed Method

3.1 Training Dataset

All the existing learning-based single image reflection removal methods fail to fully take advantage of their respective proposed models due to the lack of labelled training data. Lack of labelled training data is a common problem in computer vision community and though there are some workarounds even they are limited in cases in which they can be applied. The most common workaround is creating a synthetic dataset. The problem with creating a synthetic dataset is that they usually fail to truly mimic the wide range and variety of classes present in natural datasets, which in turn limits the capabilities of the method to deal with naturally occurring images. Another workaround is assuming priors and proposing a method considering the priors. Priors usually restrict the scope of the approach by setting some bounds on the input and thereby making the approach more tailored towards dealing with inputs from the smaller range. Such methods may or may not perform equivalently for inputs outside this range.

We have followed the first approach, i.e. to expand the dataset using synthetic images. We have used data from multiple sources to accomplish this and merged it with images from already available datasets to train a reflection removal model.

PASCAL Visual Object Classes (VOC) dataset [15] is used to create synthetic images with reflection artefacts. To synthesize one image, two images are selected from the dataset and cropped into 256 × 256-sized patches. Then, one patch is selected as the background, and the other one as the reflection. Both the images are merged using the following equation:

$$ \boldsymbol{I}=\boldsymbol{\upalpha} \ast \boldsymbol{B}+\boldsymbol{\beta} \ast \left(\boldsymbol{G}\otimes \textbf{R}\right) $$
(8.2)
  • where

  • I, B and R are n × m × 3 matrices representing the resulting synthetic image, the background patch, and the reflection patch, respectively

  • α: blending weight for the background patch

  • β: blending weight for the reflection patch and β = (1 – α)

  • G: represents the Gaussian blur operation applied on reflection patch

Reflection patch is blurred out using Gaussian blur, and then blending weight α ∈ [0.6, 0.8] is used to combine both the images. The generated dataset contains 50,000 synthetic images. An image triplet (containing the background, the reflection and the final blended result) from the training dataset can be seen in Fig. 8.3.

Fig. 8.3
Three photos depict an aeroplane, a dog and a compiled photo of both with the dog image as background.

Image triplet (B, R, I) from the training dataset

3.2 Model Description (Table 8.1)

Table 8.1 Model architecture

3.3 Loss Function

Loss value is a measure of how off the predictions are from true values. Loss function reflects the performance of the model and provides a quantitative measure of accuracy. The loss function is a key aspect in determining how good a solution the trained model is as the objective function being realized while in training phase is to minimize the loss. Therefore, the loss function must be chosen in a way that minimization of the loss value results in the model predicting value close to the true values, and for this to happen the loss function must be tailored to the problem being solved. As images are at the centre of reflection removal problem, we will first take a metric that can precisely measure the similarities between two images and use it inside the loss function to calculate the loss value.

Structural similarity (SSIM) index provides the quantitative measure of structural similarity between images and is formulated on a similar basis using which the human visual system assess the similarity between two scenes. Our visual system has evolved to extract structural information from the scene; therefore, calculating the structural resemblance between two images can provide a decent approximation of actual similarity between them.

SSIM provides better similarity estimation than other measures for images as every pixel is weighted equally in case of peak signal-to-noise ratio (PSNR) and mean squared error (MSE), irrespective of the fact that any change in its value will be noticeable to the human observer or not. This could lead high variations in MSE and PSNR scores for the image pairs when the contrast or brightness changes in one of the image, even though these modifications don’t have a significant effect on human observer assessing image similarity as can be seen in Fig. 3.2. Therefore, structural similarity index is more likely to find such image pairs more similar, as the structural information in the image pair would resemble closely, as the SSIM index is calculated on various windows of an image (Fig. 8.4).

Fig. 8.4
Three sets of photos of a bicycle race express their real, high resolution or contrast and the blurred images of the same in different views.

Original image, increased contrast, blurred image

SSIM index ranges from 0 to 1, 0 meaning that the images share no structural similarity and 1 meaning perfect structural similarity between images, which is only possible for identical images. Three components, namely, luminance, contrast and structure, are used in the process of calculating SSIM index for two perfectly aligned images of same size x and y.

Luminance comparison l(x, y) is given by:

$$ \boldsymbol{l}\left(\boldsymbol{x},\boldsymbol{y}\right)=\frac{\textbf{2}{\boldsymbol{\mu}}_{\boldsymbol{x}}{\boldsymbol{\mu}}_{\boldsymbol{y}}+{\boldsymbol{C}}_{\textbf{1}}}{{\boldsymbol{\mu}}_{\boldsymbol{x}}^{\textbf{2}}+{\boldsymbol{\mu}}_{\boldsymbol{y}}^{\textbf{2}}+{\boldsymbol{C}}_{\textbf{1}}} $$
(8.3)

Contrast comparison c(x, y) is given by:

$$ \boldsymbol{c}\left(\boldsymbol{x},\boldsymbol{y}\right)=\frac{\textbf{2}{\boldsymbol{\sigma}}_{\boldsymbol{x}}{\boldsymbol{\sigma}}_{\boldsymbol{y}}+{\boldsymbol{C}}_{\textbf{2}}}{{\boldsymbol{\sigma}}_{\boldsymbol{x}}^{\textbf{2}}+{\boldsymbol{\sigma}}_{\boldsymbol{y}}^{\textbf{2}}+{\boldsymbol{C}}_{\textbf{2}}} $$
(8.4)

Structure comparison s(x, y) is given by:

$$ \boldsymbol{s}\left(\boldsymbol{x},\boldsymbol{y}\right)=\frac{{\boldsymbol{\sigma}}_{\boldsymbol{x}\boldsymbol{y}}+{\boldsymbol{C}}_{\textbf{3}}}{{\boldsymbol{\sigma}}_{\boldsymbol{x}}{\boldsymbol{\sigma}}_{\boldsymbol{y}}+{\boldsymbol{C}}_{\textbf{3}}} $$
(8.5)
  • where

  • μx is the average of intensities of x, μy is the average of intensities of y,

  • σx2 is the variance of intensities of x, σy2 is the variance of intensities of y,

  • σxy is the covariance of intensities of x and y, and

  • C1, C2 and C3 are used to avoid instability when denominators are close to zero.

  • C1 = (K1 L)2, C2 = (K2 L)2, C3 = C2/2,

  • K1 << 1 and K2 << 1, and L is the dynamic range.

Using the above-mentioned three components, SSIM index is calculated as follows:

$$ \textbf{SSIM}\left(\boldsymbol{x},\boldsymbol{y}\right)=\boldsymbol{l}\left(\boldsymbol{x},\boldsymbol{y}\right)\bullet \boldsymbol{c}\left(\boldsymbol{x},\boldsymbol{y}\right)\bullet \boldsymbol{s}\left(\boldsymbol{x},\boldsymbol{y}\right) $$
(8.6)

Substituting the values of l(x,y), c(x,y) and s(x,y) in the above equation, we get

$$ \textbf{SSIM}\left(\boldsymbol{x},\boldsymbol{y}\right)=\frac{\left(\textbf{2}{\boldsymbol{\mu}}_{\boldsymbol{x}}{\boldsymbol{\mu}}_{\boldsymbol{y}}+{\boldsymbol{C}}_{\textbf{1}}\right)\left(\textbf{2}{\boldsymbol{\sigma}}_{\boldsymbol{x}\boldsymbol{y}}+{\boldsymbol{C}}_{\textbf{2}}\right)}{\left({\boldsymbol{\mu}}_{\boldsymbol{x}}^{\textbf{2}}+{\boldsymbol{\mu}}_{\boldsymbol{y}}^{\textbf{2}}+{\boldsymbol{C}}_{\textbf{1}}\right)\left({\boldsymbol{\sigma}}_{\boldsymbol{x}}^{\textbf{2}}+{\boldsymbol{\sigma}}_{\boldsymbol{y}}^{\textbf{2}}+{\boldsymbol{C}}_{\textbf{2}}\right)} $$
(8.7)

This can be converted into loss function to calculate the loss between the estimated background layer and the actual background layer as follows:

$$ \textbf{los}{\textbf{s}}_{\textbf{SSIM}}\left({\boldsymbol{y}}_{\boldsymbol{true}},{\boldsymbol{y}}_{\boldsymbol{pred}}\right)=\textbf{1}-\textbf{SSIM}\left({\boldsymbol{y}}_{\boldsymbol{true}},{\boldsymbol{y}}_{\boldsymbol{pred}}\right) $$
(8.8)

4 Experiment and Results

In this section, we present the details of the experiments performed and their evaluation. Detailed discussion on the impact of various parameters of the proposed approach on the overall performance is also included.

4.1 Training Details

Trained the network with the following parameters:

  • Number of epochs: 65

  • Batch size: 32

  • Validation split: 0.2

  • Shuffle: True

  • Optimizer: Adam (α = 0.0001, β1 = 0.9, β2 = 0.999)

  • Loss function: MSE, loss_SSIM

A combination of MSE and our custom loss function based on SSIM index has been used during training. The process of training was carried out in two phases: in the first phase, we have used the entire training dataset with MSE as the loss function and trained the network for 40 epochs, and, in the second phase, the network was trained for a total of 25 epochs on smaller subsets of the training dataset with SSIM-based loss function (Figs. 8.5 and 8.6).

Fig. 8.5
A line graph of loss and epoch shows 2 decreasing curves of training and validation. The training curve falls from the loss value of more than 1500 to 300 with an epoch value of 40.

Training and validation loss vs epoch graph – Phase I

Fig. 8.6
A set of 2 line graphs of loss and epoch. The first graph has a decreasing curve of training. The validation curve falls a certain amount and then stays constant. Second, the test curve is constant,

Training and validation loss vs epoch graph – Phase II

4.2 Experimental Set-Up

Experiments were carried out on the system with the following configurations:

  • CPU: Intel Xeon Silver 4114

  • Memory: 64 GB DDR4

  • GPU: NVIDIA Quadro P5000

  • GPU Memory: 16 GB GDDR5X

  • Storage: 4 TB

  • Operating system: Ubuntu 18.04.4 LTS (Bionic Beaver)

Deep learning libraries used: Keras with TensorFlow backend, TensorFlow 2.1.0, CUDA 10.1, cuDNN 7.6.

Programming language and major libraries used: Python 3.6, NumPy, OpenCV, Matplotlib.

4.3 Performance Evaluation Metrics

For performance evaluation, the most common metrics in comparing two images are PSNR value and SSIM score (refer to Sect. 3.3 for a detailed description of SSIM index). Peak signal-to-noise ratio (PSNR) is the ratio between a signal’s maximum power and the power of corrupting noise that affects the quality of images and videos. Generally, PSNR is conveyed on a logarithmic decibel scale. The formula for PSNR between the original image and the noisy image is given in the following equation:

$$ \textrm{PSNR}=20\ast {\log}_{10}\frac{{\mathit{\max}}_f}{\sqrt{\textrm{MSE}}} $$
(8.9)
  • where

  • 𝑚𝑎𝑥𝑓 – maximum signal value present in the original image

Mean squared error (MSE) is

$$ \textrm{MSE}=\frac{1}{mn}\sum \limits_0^{m-1}\sum \limits_0^{n-1}\parallel \textrm{f}\left(i,j\right)-\textrm{g}\left(i,j\right){\parallel}^2 $$
(8.10)
  • where

  • f – original image in matrix form

  • g – predicted image in matrix form

  • m – number of rows in input images

  • n – number of columns in input images

  • i, j – co-ordinates of a current pixel location in input images

4.4 Testing Dataset

Benchmarking SIR2 dataset [16] with images containing real scenes is used to assess the performance and capabilities of the trained network. SIR2 dataset is released by Rapid-Rich Object Search (ROSE) Lab, NTU, Singapore. It has a large number of diverse images containing a reflection, along with the corresponding ground truth of their reflection and background layers. It contains both indoor (controlled) scenes and outdoor (wild) scenes. Indoor scenes include postcards and solid objects used in day-to-day life, such as fruits, toys, mugs, etc. Outdoor scenes contain real-world entities, such as trees, gardens, cars, buildings, etc. with varying illuminations, scales and distances. SIR2 dataset contains a total of 500 image triplets with 200 triplets each for postcard dataset and solid object dataset and 100 triplets for wild scene dataset (Table 8.2 and 8.3 and Fig. 8.7).

Table 8.2 Comparison of average SSIM scores for proposed and competing methods
Table 8.3 Comparison of average PSNR values for proposed and competing methods
Fig. 8.7
Six images depict the views after the removal of the reflection effect in the paired images of a bridge, A man holding a child on a bicycle, the image of a temple, a Colgate bottle, and a dog.figure 7

Reflection removal results on test images by the proposed model

5 Conclusion and Future Work

In this dissertation, we have studied the single image reflection removal problem and proposed a method to suppress the reflection and recover the background layer. Our approach focuses mainly on using simple network architecture along with a loss function tailored to the demands of the problem. To address the issue of lack of labelled training data, we have created and used synthetic dataset for training our network.

Experimental results validate the efficacy and efficiency of our approach. A similar approach can be used in solving the problems, such as super resolution, where the current approaches use complex network architectures, including autoencoders and generative adversarial networks.

Our method has produced decent results but still fails to outperform the state-of-the-art method. Future works can focus on using the ground truth of the reflection layer in addition to the ground truth of the background layer to further improve the effectiveness of the approach.