Keywords

1 Introduction

Shadow detection and removal is a fundamental and challenging task in computer vision and computer graphics. In an image, a shadow is a direct result of occluding a light source. The accuracy of several computer vision tasks, such as object segmentation [20], object recognition [2], and object tracking [13], can be influenced by the shadow since shadows have similar characteristics as objects, so they can get misclassified as part of an object.

In computer vision, the problem involving shadow detection and removal has received much attention. Early works related to this task [5, 6, 21, 26, 28, 29] used physical models of features like intensity, color, gradient, and texture. However, these hand-crafted feature-based methods suffer in understanding the high-level features and related semantic content. In recent years, deep learning-based approaches for analyzing the mapping relation have made significant progress in this field. Khan et al. [11, 12] used convolutional neural networks (CNN) for shadow detection and Bayesian model for shadow removal. The model of Qu et al. [19] is based on an end-to-end multi-context embedding framework to extract essential characteristics from multiple aspects and accumulate them to determine the shadow matte. Fan et al. [4] employed a deep CNN structure containing an encoder-decoder and a refinement model for extracting features with local detail correction and learning the shadow matte. Bansal et al. [3] developed a deep learning model to extract features and directly detect the shadow mask. Hu et al. [7] presented a direction-aware spatial context (DSC) module, utilized with CNNs, to detect and remove the shadow.

The generative adversarial network (GAN) [1] and its extensions, presented in recent years, are dominant strategies for dealing with diverse image-to-image translation challenges. Conditional GANs (CGANs) [15] are significant GAN extensions that incorporate conditioning information into the generator and the discriminator. Nguyen et al. [17] demonstrated the first method of shadow detection with adversarial learning and constructed a CGAN-based architecture to output a shadow mask that can realistically correspond to the ground-truth mask. A shadow image with an adjustable sensitivity factor is used as the conditioning information to the generator and the discriminator. Wang et al. [27] presented a supervised model based on two Stacked-CGANs to tackle shadow detection and removal problems simultaneously in an end-to-end manner. Nagae et al. [16] developed a model based on the method in [27], with minor changes in the shadow removal CGAN, that estimates the illumination ratio and uses that estimation to produce the output. Although these approaches [7, 27] effectively remove the shadow, they tend to generate artifacts and inconsistent colors in the non-shadow area. Hu et al. [8] presented a Mask-ShadowGAN framework that enforces cycle consistency by the guidance of masks and learns a bidirectional mapping between the shadow and shadow-free domains. Tan et al. [24] developed a target-consistency GAN (TCGAN) for shadow removal that aims to learn a unidirectional mapping to translate shadow images into shadow-free images. These methods [8, 24] remove the shadow by maintaining a non-shadow region with cycle and target consistency but suffer from overexposure problems and random artifacts. Also, they require unpaired shadow and shadow-free datasets with the same statistical distribution for better learning.

In this paper, we propose a novel method based on GANs with cycle constraints, and introduce an adaptive exposure correction module for handling the overexposure problem. Figure 1 shows a shadow removal result of the proposed method compared with Mask-ShadowGAN [8], which suffers from over-exposure, particularly in the shadow area. However, our approach handles that problem and generates a result close to the ground-truth. The key contributions of this work are as follows.

  • We present a framework that removes the shadow using generative adversarial constraints along with cycle consistency and content constraints.

  • We introduce an adaptive exposure correction module for handling the over-exposure problem.

  • We introduce a method for enhancing the quality of benchmark datasets and subsequently improving the shadow removal results.

The rest of the paper is organized as follows. Section 2 describes the proposed framework. Section 3 presents experimental results along with the ablation study, and we conclude the work in Sect. 4.

Fig. 1.
figure 1

Shadow removal results comparing the Mask-ShadowGAN [8] method with the proposed method.

2 Proposed Method

The overall scheme of the proposed method is depicted in Fig. 2. The method is based on CycleGAN [30], in which each adversarial generator learns a mapping to another domain, and the corresponding discriminator guides the learning procedure. Apart from the adversarial and cycle constraints, we also employ content and identity constraints as guidance for better learning. Compared to the baseline Mask-ShadowGAN [8], which required unpaired data with an equal statistical distribution of shadow and shadow-free domains, our method utilizes available shadow, shadow-free, and shadow mask images to learn better mapping for shadow removal.

Fig. 2.
figure 2

Illustration of the architecture of the proposed method.

2.1 Generator and Discriminator Learning

The proposed method learns from both the shadow domain \(\mathbb {D}_x\) and the shadow-free domain \(\mathbb {D}_y\). While learning from domain \(\mathbb {D}_x\), the generator network \(G_f\) takes a real shadow image \(I_s \in \mathbb {D}_x\) as input, and generates a shadow-free image \(\hat{I}_{f*}\). The discriminator network \(D_f\) is used to differentiate whether the produced shadow-free image \(\hat{I}_{f*}\) is a real shadow-free image or not. To achieve the cycle-consistency, another generator \(G_s\) is used to reconstruct the shadow image \(\hat{I}_s\) from the generated shadow-free image \(\hat{I}_{f*}\) using a ground-truth shadow mask \(M_{gt*}\) for the image \(I_s\) as a guide.

In the process of learning from the shadow-free domain \(\mathbb {D}_y\), the generator network \(G_s\) takes a real shadow-free image \(I_f \in \mathbb {D}_y\) as input and a ground-truth shadow mask \(M_{gt}\) for the image \(I_f\) as a guide, and generates a shadow image \(\hat{I}_{s*}\). The discriminator network \(D_s\) determines if the created shadow image \(\hat{I}_{s*}\) is a real shadow image or not. To formulate the cycle-consistency loop, the generator \(G_f\) reconstructs the shadow-free image \(\hat{I}_f\) from the generated shadow image \(\hat{I}_{s*}\).

To summarize, the discriminator network \(D_s\) takes either real sample \(I_s\) or fake sample \(\hat{I}_{s*}\) as input and discriminates whether the input is from \(\mathbb {D}_s\) or not. Similarly, discriminator \(D_f\) takes either real sample \(I_f\) or fake sample \(\hat{I}_{f*}\) as input and discriminates whether the input is from \(\mathbb {D}_f\) or not. We shall discuss the corresponding loss functions in Sect. 2.3.

2.2 Adaptive Exposure Correction Module

Given a shadow image, the generator network \(G_f\) is trained to produce a shadow-free image. But in the absence of any constraints, sometimes the generated shadow-free images are much brighter in the shadow area. To handle this over-exposure problem in the resulting shadow-free images, we propose to use an adaptive exposure correction module that takes the generated shadow-free image \(\hat{I}_{f*}\) and an intermediate shadow-free mask \(\hat{M}_{*}\) as inputs, and produces the final shadow-free result \(\hat{I}_{fc*}\). The shadow mask \(\hat{M}_{*}\) for the input shadow image \(I_s\) is obtained as \(\mathbb {B}(\hat{I}_{f*}-I_s, t)\), where the binarization operation \(\mathbb {B}\) is performed on the difference between \(\hat{I}_{f*}\) and the real input shadow image \(I_s\), and t is a threshold obtained by Otsu’s algorithm [18]. \(\mathbb {B}\) sets the value as zero or one, where zero indicates non-shadow region (difference \(\le t\)) and one indicates the shadow region (difference \(>t\)). In the adaptive exposure correction module, we extract the shadow and non-shadow areas using \(\hat{M}_{*}\) and apply gamma correction (power-law transformations) in the shadow area. First, we transform the extracted shadow area to the HSV color space, then perform gamma correction on the value channel and convert it back to RGB color space. Finally, we combine the gamma-corrected shadow area with the non-shadow area to generate \(\hat{I}_{fc*}\) which is the final shadow-free image with exposure correction. To estimate the gamma value, we calculate the mean difference between the shadow and non-shadow areas and map that to the gamma value range 0 to 2. Ideally, for a non-overexposed image, the gamma value will be 1, and no correction will be done. Then the final shadow mask \(\hat{M}_{c*}\) is obtained as \(\mathbb {B}(\hat{I}_{fc*}-I_s, t)\).

2.3 Objectives and Loss Functions

Adversarial Losses: The primary principle behind adversarial learning is that the discriminator will differentiate between real and generated results for both domains, encouraging the corresponding generator to deliver a better output concerning image qualities. The shadow-free adversarial loss and the shadow adversarial loss are given as:

$$\begin{aligned} \mathcal {L}_{gan\text{- }sf(G)}=MSE(P,D_{f} (\hat{I}_{f*})), \mathcal {L}_{gan\text{- }s(G)}=MSE(P,D_{s}(\hat{I}_{s*})) \end{aligned}$$
(1)
$$\begin{aligned} \begin{array}{r} \mathcal {L}_{gan\text{- }sf(D)}=MSE(P,D_{f}(I_{f}))+ MSE(Q,D_{f}(\hat{I}_{f*})),\\ \mathcal {L}_{gan\text{- }s(D)}=MSE(P,D_{s}(I_{s}))+ MSE(Q,D_{s}(\hat{I}_{s*})) \end{array} \end{aligned}$$
(2)

where \(\hat{I}_{f*}\) (generated as \((G_{f}(I_{s}))\)) and \(\hat{I}_{s*}\) (generated as \((G_{s}\left( I_{f}, M_{gt}\right) \)) are the generated shadow-free and shadow images, respectively, with \(I_{s}\) and \(I_{f}\) being the input shadow and shadow-free images, respectively, and \(P=1\), \(Q=0\).

Cycle Consistency Losses: Cycle consistency \(L_1\) losses defined in Eq. (3) and Eq. (4) are applied to encourage the reconstructed images to be comparable to the original input images and to effectively improve the bidirectional mapping in the \(G_f\) and \(G_s\) networks.

$$\begin{aligned} \mathcal {L}_{cyc\text{- }s}=\Vert \hat{I}_{s}-I_{s}\Vert _{1} \end{aligned}$$
(3)
$$\begin{aligned} \mathcal {L}_{cyc\text{- }sf}=\Vert \hat{I}_{f}-I_{f}\Vert _{1} \end{aligned}$$
(4)

Here, \(\hat{I}_{s}\) (generated as \(G_{s}(G_{f}(I_{s}), M_{gt*})\)) and \(\hat{I}_{f}\) (generated as \(G_{f}(G_{s}(I_{f}, M_{gt})\)) are the reconstructed shadow and shadow-free images, respectively.

Identity Losses: The identity \(L_1\) losses defined in Eq. (5) and Eq. (6) motivate generators \(G_{s}\) and \(G_{f}\) not to change the input image (a shadow image and a shadow-free image, respectively), and maintain color consistency.

$$\begin{aligned} \mathcal {L}_{idt\text{- }s}=\Vert \hat{I}_{si}-I_{s}\Vert _{1} \end{aligned}$$
(5)
$$\begin{aligned} \mathcal {L}_{idt\text{- }sf}=\Vert \hat{I}_{fi}-I_{f}\Vert _{1} \end{aligned}$$
(6)

where \(\hat{I}_{si}\) is the generated image using \(G_{s}\) from \(I_{s}\) and null mask \(M_x\), and \(\hat{I}_{fi}\) is the generated image using \(G_{f}\) from \(I_{f}\).

Content Losses: The \(L_1\) constraint on content losses defined in Eq. (7) and Eq. (8) encourages generators to produce images that are closer to the ground-truth images.

$$\begin{aligned} \mathcal {L}_{cont\text{- }s}=\Vert \hat{I}_{s*}-I_{s*}\Vert _{1} \end{aligned}$$
(7)
$$\begin{aligned} \mathcal {L}_{cont\text{- }sf}=\Vert \hat{I}_{f*}-I_{f*}\Vert _{1} \end{aligned}$$
(8)

Here, \(I_{s*}\) and \(I_{f*}\) are the ground-truth shadow and shadow-free images, respectively, and \(\hat{I}_{f*}\) (generated as \((G_{f}(I_{s}))\)) and \(\hat{I}_{s*}\) (generated as \((G_{s}\left( I_{f}, M_{gt}\right) \)) are the generated shadow-free and shadow images, respectively.

Loss Function for Generators: The total generator loss for the proposed method is obtained as a weighted sum of the adversarial losses, cycle consistency losses, identity losses, and content losses, given as:

$$\begin{aligned} \begin{array}{r} \mathcal {L}_{G}=\lambda _{1}(\mathcal {L}_{gan\text{- }s(G)}+\mathcal {L}_{gan\text{- }sf(G)})+\lambda _{2}(\mathcal {L}_{cyc\text{- }s}+\mathcal {L}_{cyc\text{- }sf})\\ +\lambda _{3}(\mathcal {L}_{idt\text{- }s}+\mathcal {L}_{idt\text{- }sf})+\lambda _{4}(\mathcal {L}_{cont-s}+\mathcal {L}_{cont-sf}) \end{array} \end{aligned}$$
(9)

where \(\lambda _{1},\lambda _{2},\lambda _{3},\lambda _{4}\) are appropriately chosen weights.

Loss Function for Discriminators: The discriminator loss for the shadow-free discriminator \(D_f\) and shadow discriminator \(D_s\) in the proposed method are given in Eq. (10) and Eq. (11), respectively.

$$\begin{aligned} \mathcal {L}_{D_{f}}=\lambda _{5}(\mathcal {L}_{gan\text{- }sf(D)}) \end{aligned}$$
(10)
$$\begin{aligned} \mathcal {L}_{D_{s}}=\lambda _{5}(\mathcal {L}_{gan\text{- }s(D)}) \end{aligned}$$
(11)

Here, \(\lambda _{5}\) is the appropriately chosen weight.

2.4 Network Architecture and Training Strategy

We use the model of Johnson et al. [10] as the generator network, which consists of 3 convolutional layers, 9 residual blocks, and 2 deconvolution layers. After each convolution and deconvolution operation, the network employs instance normalization and the ReLU (rectified linear unit) activation function. For the discriminator network, we use PatchGAN [9], which focuses on classifying image patches as real or fake. Here, 4 convolutional layers are used with instance normalization and leaky ReLU activation function (slope = 0.2). Adam optimization [14] with a learning rate of 0.0002, with first and second order momentum as 0.5 and 0.999, is adopted during training. A zero-mean Gaussian distribution with a standard deviation of 0.02 initializes the network parameters. For data augmentation during training, images are resized to \(286\times 286\) and randomly cropped to \(256\times 256\). The network is trained for 200 epochs keeping the mini-batch size as 1 with the PyTorch module and NVIDIA GeForce-RTX2080-Ti GPU.

2.5 Benchmark Dataset Adjustment

Ideally, in the benchmark dataset for the shadow removal task, the non-shadow area of the shadow and the corresponding shadow-free image should be the same. However, there is a significant difference in the color consistency, brightness, and contrast, since both shadow and shadow-free images were captured at different times of the day. On the whole testing dataset of ISTD [27], the root mean square error (RMSE) in the LAB color space between the shadow and shadow-free images in the non-shadow area is 6.83, which should ideally be close to 0. Figure 3 shows the sample triplets from the ISTD dataset, where the difference in the non-shadow area is clearly visible. Supervised models are trained to produce an output close to the ground-truth shadow-free image, and accordingly, the loss function is defined, and models are trained. However, methods yield color, brightness, and contrast inconsistent outputs compared to the non-shadow area of the shadow image. Hence, it is essential to adjust those ground-truth shadow-free images to achieve better results.

Fig. 3.
figure 3

ISTD triplets, showing issue in the non-shadow area.

To achieve this, we process each image individually to adjust the ground-truth shadow-free images using the regression technique. Following are the steps we used for this correction task.

  • The non-shadow area of shadow and shadow-free images were extracted using the shadow mask.

  • A regressor makes use of that extracted non-shadow area and learns to transform the non-shadow pixel values of shadow-free image into the corresponding pixel values of the shadow image.

  • Finally, the trained regressor takes the shadow-free image as input and generates an adjusted shadow-free image.

We conducted various experiments by using three well-known regressors, Linear Regressor (LR), Decision-Tree Regressor (DTR), and K-Nearest-Neighbor Regressor (KNNR). Further, we considered both RGB and LAB color spaces. Also, we executed experiments by using single-output regression, where regression is performed on three individual color channels, and by using multi-output regression, where regression is performed on three combined color channels. Finally, we used the optimal decision-tree multi-output regressor in RGB color space for the benchmark dataset adjustment. Following steps describe the algorithm of the decision-tree regressor.

  • Given a training vector x and a label vector y, the decision tree divides the feature space in a recursive fashion, such that the samples with similar labels are grouped together.

  • Let the data at node n be denoted by \(D_n\) having \(m_n\) samples. For each candidate split \(\delta = (i,t_n)\), where i is feature and \(t_n\) is threshold, partition the data into \(D_{n}^{left}(\delta )\) and \(D_{n}^{right}(\delta )\) subsets according to following equations.

    $$\begin{aligned} D_{n}^{left}(\delta ) = \{(x,y)|x_i \le t_n\} \end{aligned}$$
    (12)
    $$\begin{aligned} D_{n}^{right}(\delta ) = \{(x,y)|x_i>t_n\} \end{aligned}$$
    (13)
  • The quality of a candidate split of node n is then measured using an impurity function G and loss function H according to Eq. (14) and Eq. (15), respectively. Here, \(\bar{y}_{n}\) is the mean value, and the mean squared error is used as the loss function.

    $$\begin{aligned} H(D_{n})=\frac{1}{m_{n}} \sum _{y \in D_{n}}(y-\bar{y}_{n})^{2}, \bar{y}_{n}=\frac{1}{m_{n}} \sum _{y \in D_{n}} y \end{aligned}$$
    (14)
    $$\begin{aligned} G(D_{n}, \delta )=\frac{m_{n}^{left}}{m_{n}} H(D_{n}^{left}(\delta ))+\frac{m_{n}^{right}}{m_{n}} H(D_{n}^{right}(\delta )) \end{aligned}$$
    (15)
  • Parameters that minimize the impurity are selected for splitting, as follows:

    $$\begin{aligned} \delta ^{*}=\arg \min _{\delta } G(D_{n}, \delta ) \end{aligned}$$
    (16)
  • The algorithm is recursed for subsets \(D_{n}^{left}(\delta ^{*})\) and \(D_{n}^{right}(\delta ^{*})\) until \(m_n = 1\).

3 Experimental Results

Database Description: To analyze the performance of the proposed framework, we experimented with the dataset containing image shadow triplets termed as ISTD [27] and trained models accordingly. ISTD contains 1870 triplets of shadow, shadow mask, and shadow-free image with 1330 image triplets in the training split and 540 in the testing split.

Evaluation Parameters: We followed [17, 25, 27] and used balance error rate (BER) for a quantitative comparison for shadow detection. Balance error rate is calculated as:

$$\begin{aligned} \textrm{BER}=1-\frac{1}{2}\left( \frac{T P}{T P+F N}+\frac{T N}{T N+F P}\right) \end{aligned}$$
(17)

where

  • True Positive (TP) denotes the number of pixels that the predictive model has labeled as a shadow, and actually, it is a shadow.

  • False Positive (FP) denotes the number of pixels that the predictive model has labeled as a shadow, and actually, it is a non-shadow.

  • True Negative (TN) denotes the number of pixels that the predictive model has labeled as a non-shadow, and actually, it is a non-shadow.

  • False Negative (FN) denotes the number of pixels that the predictive model has labeled as a non-shadow, and actually, it is a shadow.

For the quantitative assessment of shadow removal, we followed recent procedures [7, 8, 24, 27] and used root mean square error (RMSE) in the LAB color space computed between the ground-truth and produced shadow-free images. We resized all images to \(286\times 286\) for a fair comparison. Additionally, we calculated the RMSE value in the four scenarios: RMSE value by comparing the resulting shadow-free image \(\hat{I}_{fc*}\) with the ground-truth shadow-free image \({I}_{f*}\) (i) for all pixels (represented with ‘O’), (ii) for pixels in the shadow region (represented with ‘S’), (iii) for pixels in the non-shadow region (represented with SF), and (iv) by comparing \(\hat{I}_{fc*}\) with input shadow image \({I}_{s}\) for pixels in the non-shadow region (represented with SF-I). In the experiments, the hyper-parameters \(\lambda _{1}, \lambda _{2}, \lambda _{3}, \lambda _{4}, \lambda _{5}\) are set as 1, 10, 5, 5, 0.5, respectively. In the tables, best and second-best results are highlighted in bold and blue, respectively.

Evaluation on Removal: We compare the shadow removal performance of the proposed method with the methods in [5,6,7,8, 24, 27, 28] on the test dataset of ISTD. The results are shown in Table 1. Our method achieves the best performance in the O and SF scenarios, and the second-best performance in S and SF-I scenarios. TCGAN [24] achieves the best result in SF-I, but it has poor performance in S. Similarly, DSC [7] achieves the best result in S but performs poorly in SF and SF-I. Our approach achieves comparable results in all aspects and yields the best overall value O, compared to other methods. Figure 4 shows visual performance compared to methods ST-CGAN [27] and Mask-ShadowGAN [8]. While ST-CGAN [27] suffers from color-inconstancy and artifacts, and Mask-ShadowGAN [8] has over-exposure, our approach handles those issues and produces better output.

Table 1. Quantitative results of removal with RMSE on ISTD test dataset.
Fig. 4.
figure 4

Visual comparison of shadow removal results of ISTD test dataset.

Evaluation on Detection: We evaluate the shadow detection performance with the recent methods [8, 14, 17, 27] on the ISTD test dataset. The quantitative results are shown in Table 2. The proposed method outperforms the baseline Mask-ShadowGAN [8] and methods CGAN [17], StackedCNN [25]. Methods SCGAN [17] and ST-CGAN [27] achieve better results since these methods specifically train their networks for the detection task. As our goal is shadow removal, we do not train any separate network for detection; instead, we extract the shadow mask from the final shadow-free image and input image as discussed in Sect. 2.2. Figure 5 shows the visual performance compared to state-of-the-art Mask-ShadowGAN [8]. Our approach produces a shadow mask result close to the ground-truth shadow mask.

Benchmark Dataset Adjustment: To adjust ground-truth shadow-free images, we experimented with Linear Regressor (LR), Decision-Tree Regressor (DTR), and K-Nearest-Neighbor Regressor (KNNR) in RGB and LAB color spaces. While performing regression in the LAB color space, both shadow and shadow-free images are transformed to the LAB space from the RGB space, and after performing regression and correction, they are again transformed back to the RGB space. Also, we have performed experiments by using a regression for each individual color channel (there will be three one-input to one-output regressor) and by using a regression for combined color-channel (multi-output regressor) (there will be one 3-input to 3-output regressor). For implementation, we used regression methods from the scikit-learn python library [22]. The results of the experiments are shown in Table 3.

Table 2. Quantitative results of detection with BER(%) on ISTD test dataset.
Fig. 5.
figure 5

Visual comparison of shadow detection results on ISTD test dataset.

Table 3. Quantitative results of ISTD test dataset adjustment task with RMSE.

Experimentally, we observed that the decision-tree combined channel regressor in RGB color space has a lower RMSE value in O and SF scenarios. So finally, we used that method and created a new adjusted ISTD training and testing dataset. Figure 6 shows the visual output of this database adjustment task by using the selected method.

Fig. 6.
figure 6

Visual results of ISTD dataset adjustment task.

Evaluation on Removal with Adjusted Benchmark ISTD Dataset:We compare the shadow removal performance of the proposed method with the methods [8, 27], trained and tested on the adjusted dataset of ISTD. Since the official code for the ST-CGAN method [27] is not available, we use the community code [23] for evaluation purpose. The results are shown in Table 4. The proposed method achieves the best performance in O, S, and SF scenarios compared to state-of-the-art methods.

Table 4. Quantitative shadow removal results with RMSE, trained and tested on adjusted ISTD dataset.

Ablation Study: We have done an ablation study on the presented framework by removing the exposure correction module (represented by -c) along with not using ground-truth shadow and shadow-free images (represented by -gt) and not using ground-truth masks (represented by -gtm). While performing an experiment with -gt, we ignored content losses, and for the -gtm experiment, initially, we generated masks by ground-truth shadow and shadow-free images according to Sect. 2.2. Removal and detection results for all the experiments are shown in Table 5. Visual performance for (-c) is shown in Fig. 4. Our approach achieves the best overall performance for removal and detection, and shows the importance of ground-truth data and correction module to achieve the best result.

Table 5. Ablation study.

4 Conclusion

We proposed a method based on GAN to solve the shadow removal task in images. We used different constraints to effectively learn the bidirectional relationship between shadow and shadow-free domains under the paired setting. We also presented a novel process to handle the over-exposure problem after the training. As a result, the proposed method with an exposure correction module achieves the best or comparable performance compared to existing state-of-the-art methods, both quantitatively and visually. We explored the issue in benchmark datasets and introduced a technique for adjusting those benchmark datasets to additionally improve the shadow removal results. We also conducted various experiments to analyze the importance of ground-truth data and exposure correction module in generating better quality output.