1 Introduction

Artefacts are undesirable and unintended elements that appear in an image, causing a decline in the overall image quality. These artefacts can emerge in digital images due to various factors within the digital camera’s mechanisms, such as blooming, aliasing, compression, and noise. They can also originate naturally as shadows, fog, haze, smoke, and smog. When such a degraded image is subjected to further processing or input into deep learning or computer vision models, it can diminish the effectiveness of the algorithms [8]. Consequently, the likelihood of errors in tasks like object detection, classification or regression by deep learning models increases, given their high sensitivity to the data on which they were trained [40]. These models might inadvertently learn these undesirable traits, or if such traits are encountered in real-time during the inference phase, they could significantly reduce the models’ efficiency and accuracy [3]. Thus, it is imperative to conduct pre-processing on images, and these anomalies must be eradicated to attain favourable performance by deep learning models.

Shadows represent natural phenomena in which certain areas of an image are darker due to partial or complete obstruction of the light source [1]. Depending on their intensity, shadows can be categorised into three main types: hard shadows, soft shadows, and umbra/penumbra [16]. A hard shadow is formed when the light source is entirely blocked, causing the surface texture to disappear completely. In contrast, a soft shadow emerges when the light source is partially blocked, resulting in a partial disappearance of surface texture. Being two fundamental components of a shadow, umbra is the darker portion located at the bottom of the shadow, while penumbra is the lighter region often found at the boundaries of the shadow.

The presence of shadows in images can significantly impact the accuracy and effectiveness of various algorithms based on deep learning and computer vision. These shadows can lead to issues such as object merging, object loss, and misinterpretations of remote sensing images. Consequently, the removal of shadows from images becomes a crucial task to enhance visual quality. This improvement facilitates the pragmatic application of these algorithms, including tasks like object detection, object classification, and object tracking with expectation of high accuracy. Various approaches have been proposed in literature for eliminating shadows from RGB images. These can be divided into two broad categories, i.e., traditional image processing techniques and deep-learning based models.

Prior to the advent of cutting-edge deep-learning approaches, conventional image processing methods are employed for identifying and eliminating shadows from images [13, 39]. However, these traditional algorithms had limitations, being non-scalable and only effective for specific scenarios [14]. Some of these techniques relied on user input, which was impractical for real-time applications [1]. Over the past few years, deep-learning models have gained prominence due to their exceptional performance on vision-related tasks [36]. Generative Adversarial Networks (GAN) have become the preferred choice for shadow detection and removal. Other approaches involve Convolutional Neural Networks (CNN), like those in [3] and [8], which focus on shadow removal. Additionally, [42] employs transformer-based concepts, particularly spatial attention maps, for shadow detection and removal.

Deep-learning models excel but are heavily reliant on the training dataset. However, the currently available datasets for shadow detection and removal are, in general, limited in size, containing only a few thousand images and thus representing a small portion of all possible shadow variation. With access to millions of images in the training dataset, deep-learning models have the potential to achieve outstanding results as generic feature extractors [42]. The inadequacy of data leads to performance compromise by deep learning models. To address the problem to some extent, this study introduces a novel triplet dataset for the purpose of shadow detection and removal. The proposed approach harnesses the capabilities of deep-learning models, specifically GAN, followed by post-processing steps. These steps amalgamate traditional image processing techniques to generate state-of-the-art outcomes.

1.1 Literature review

Accurately detecting shadows and subsequently applying post-processing to the identified regions can play a pivotal role in creating the ultimate shadow-free image. Previously, the primary focus centred on precisely identifying shadows through conventional image processing techniques. A series of effective approaches, as demonstrated in [10] and [19], prove proficient in detecting shadow boundaries within individual RGB images. Leveraging properties extracted from shadow samples, the methodology presented in [21] dynamically formulate a feature space and compute decision parameters. This is followed by a sequence of transformations to generate the shadow mask. In [15], authors introduce a groundbreaking deep CNN for shadow detection. This model, composed of a 7-layer network architecture, efficiently produce shadow masks. Another recent advancement in this domain is the introduction of an innovative instance single-stage detector [37]. This detector incorporates a bidirectional relation learning module that grasps the interplay between object instances and their corresponding shadow instances. Additionally, the instance single-stage detector employ a deformable maskIoU head to enhance the precision of generated shadow masks.

While these proposed methodologies adeptly generate accurate shadow masks, their limitations lay in their sole ability to detect shadows through mask generation. Once shadows are accurately pinpointed, the subsequent step involves reconstructing the areas concealed by the shadow regions. This reconstruction can be accomplished using either traditional image processing techniques or deep-learning-based models.

Conventional image processing methods offer a route to obtaining images devoid of shadows. In the study conducted by [39] and [14], a manual approach involves drawing three color lines on the image to execute shadow detection and removal. Using these input lines, the algorithm engages in region matching, ultimately generating a shadow-free image. An alternate approach proposed by [4] capitalises on the YCbCr color space. By utilising the luminance Y-Channel, this method endeavours to produce a shadow-free image. Likewise, [1] and [13] employ multi-channel thresholding for shadow detection, and utilise shadow-matting techniques for subsequent shadow removal. On the other hand, various pre-and post-processing approaches are also proposed from which the process of shadow removal may benefit to suppress unwanted artefacts. For example, an efficient histogram equalisation is presented in [32] for uniform and non-uniform backgrounds followed by its improvement in [34] and [33] where object edges and image’s natural structural information is preserved using novel 2D histogram equalisation.

Nonetheless, traditional image processing techniques have their limitations. These methods are not easily adaptable to different situations and tend to yield shadow-free images solely under specific circumstances. Furthermore, some of these traditional techniques rely on user inputs, which pose impractical challenges in real-time scenarios and applications.

Over the recent years, deep learning CNN models have undergone exponential advancements, offering remarkable outcomes in terms of accuracy and efficiency. An instance is Background Estimation Document Shadow Removal Network (BEDSR-Net) [20], which presents an approach for identifying and eliminating shadows in document images. This methodology comprises two sub-networks: one estimating background color and generating an attention map, and the other producing the shadow-free image. Another deep learning-based approach, DeshadowNet [26], adeptly detects and removes shadows from RGB images end-to-end. This method employs a multi-scale feature extractor to glean information from the input image, while a multi-context embedding module generates the shadow-free image utilising these extracted features.

Table 1 Comparison of recent shadow removal methods

Within the realm of shadow detection and removal, GANs in [5, 36, 40], and [25] have garnered significant attention. GANs utilise generators to predict shadow-free images and discriminators to ascertain whether the generated image is indeed a shadow-free rendition of the original. A notable instance is Mask-ShadowGAN [9], which employs the concept of cycle GANs to learn from both shadow and shadow-free samples concurrently. Another recent approach by [29] introduces a three-layer CNN architecture for shadow detection, feature extraction, and final shadow-free image generation. Additionally, there are CNN models such as Dual Hierarchical Aggregation Networks (DHAN) [3] and spatial Recurrent Neural Networks (RNNs) with Direction-Aware Spatial Context (DSC) features [8], both are designed to predict shadow-free images. Building on the transformer concept, [42] and [35] present a methodology employing transformers followed by an encoder-decoder structure to produce the shadow-free image. The advantages and limitations of the most recent studies is presented in Table 1. Nonetheless, a major limitation of deep-learning models lies in their need for extensive training datasets, often in the millions, to perform optimally. Existing publicly available datasets, however, comprise only a limited number of samples, leading to a narrow variety across the dataset. Moreover, current deep-learning models demonstrate proficiency in monochromatic color images but struggle to yield satisfactory outcomes in the case of multi-color contrast images.

Datasets hold paramount importance for deep learning models, as these models necessitate extensive training data to effectively acquire desired features and perform with efficiency. Publicly accessible datasets for shadow detection and removal can be categorised into four types: Unpaired, paired (shadow-detection), paired (shadow-removal), and triplet datasets. Unpaired dataset comprises shadow and shadow-free images taken from diverse scenarios within a single training sample. The USR dataset [9] is an example of an unpaired dataset, encompassing 4,215 samples. Notably complex, this dataset demands intricate and computationally intensive models. Paired datasets for shadow detection involve shadow and shadow-mask samples. This type is used exclusively for shadow detection purposes. SBU [30] and UCF [44] are examples of such datasets, containing 4,727 and 245 training samples, respectively. Paired datasets tailored for shadow removal contain shadow and shadow-free samples, serving the sole purpose of shadow elimination. In this category, SRD [26] consists of 3,088 samples, while both UIUC [7] and LRSS [6] hold less than a hundred samples each. The triplet dataset, encompassing shadow, shadow-mask, and shadow-free samples, proves versatile, facilitating both shadow detection and removal tasks. ISTD [36] stands as the sole publicly accessible triplet dataset, housing 1,870 samples. Table 2 enlists popular publicly available datasets.

Table 2 Comparison of publicly available datasets

1.2 Our contribution

The presented study aims to bridge the existing research gaps by tackling certain shortcomings in the current methods for shadow detection and removal. Therefore, key contributions are as follows:

  • This research proposes the creation of a novel benchmark dataset called Extended ISTD dataset. This dataset comprises 5,352 triplet samples, marking it as the largest dataset intended for shadow detection and removal. It notably incorporates samples featuring dark/hard shadows and multi-color contrast shadows. This strategic inclusion of diverse samples enhances the overall distribution of variability across the dataset.

  • A stacked conditional GAN similar to Stacked Conditional Generative Adversarial Networks (ST-CGAN) [36] has been proposed, featuring modified encoders and decoders in the generator modules. Spatial attention has been utilised in encoders, followed by nearest neighbour upsampling in decoders, to address the boundary irregularities associated with standard GANs.

  • A post-processing stage follows the outcome of the proposed GAN architecture. This stage encompasses a combination of various traditional image processing techniques, such as histogram matching, custom filters, detection and estimation of shadow boundaries. The collective utilisation of these techniques refines the shadow-free images generated by the deep learning model.

2 Methodology

The overall methodology (see Fig. 1) includes dataset preparation, proposed GAN model, which comprises two stacked GANs, for effective shadow removal followed by specially crafted post-processing for result enhancement. In general, when the initial shadow image is input into GAN-based systems, the resultant prediction is not a complete shadow-free image. There might be residual lighter shadows, imprecisely preserved shadow boundaries, and instances where dark shadow areas remain unaltered. To address these concerns, two measures are undertaken: the introduction of the Extended ISTD dataset for robust model training and the application of a comprehensive post-processing stage that amalgamates diverse image processing techniques. The proposed dataset, Extended ISTD, exhibits an augmented image distribution by encompassing a variety of multi-color shadow images and intensively dark shadow regions. The model output, which is the shadow-free image predicted by GANs, undergoes further refinement by harnessing the capabilities of established traditional image processing techniques. These techniques incorporate morphology operations, histogram matching, custom filters, and shadow edge detection. Consequently, the proposed methodology, combining GANs with subsequent traditional image processing techniques as a post-processing step, demonstrates the capacity to yield seamless, error-free shadow-free images.

Numerous deep-learning models have emerged for the purpose of shadow detection and removal. These existing models, though data-driven and proficient in specific contexts like monochromatic or soft shadow images, encounter challenges when dealing with multi-color contrast images and regions of intense darkness. In this context, the proposed approach employs GAN-based system similar to ST-CGAN as the deep-learning model of choice with modifications in generators’ encoders and decoders. The advantage of this selection lies in ST-CGAN’s ability to predict both shadow masks and shadow-free images. The original model is trained on the ISTD dataset [36] with 1,870 triplet images, thereby catering to both shadow detection and removal objectives. Nonetheless, the ISTD dataset is hampered by its limited number of training samples and their skewed distribution, often concentrating on monochromatic shadow images. Consequently, the trained model’s performance on unseen data, encompassing multi-color or darker shadow images, tends to fall short.

Fig. 1
figure 1

Proposed methodology flow

2.1 Extended ISTD Triplet Dataset

To aid in the assessment of shadow understanding techniques, we propose a new extended version of ISTD dataset and added 3,482 triplet samples in the original ISTD dataset [36]. The purposed extended version i.e., Extended ISTD comes with 5,352 triplet samples. To the best of our knowledge, this is the first large scale benchmark dataset which can be simultaneously used for shadow detection as well as shadow removal purpose. To increase the variety in training dataset, following three steps are adopted which are discussed below:

Fig. 2
figure 2

A few sample from Extended ISTD dataset comprising images with shadow (first row), shadow binary mask (second row) using Algorithm 1, and shadow-free images (third row) available with original SRD dataset

Algorithm 1
figure a

Generating shadow mask from shadow and shadow-free image pair.

2.1.1 Extension with shadow and shadow-free images

SRD dataset, a few samples of which are shown in Fig. 2, represents a paired dataset containing 3,088 pairs of shadow and shadow-free images, designated for shadow removal tasks. This dataset notably boasts a comparably large sample count when contrasted with other datasets available to the public. However, lack of availability of shadow-masks in the SRD dataset pertains to its incompatibility with the proposed methodology, as it necessitates the inclusion of shadow, shadow-mask, and shadow-free images.

To acquire shadow mask images for SRD dataset, Algorithm 1 is proposed which involves taking shadow and shadow-free samples from various scenarios to generate shadow masks. Initially, pixel-wise subtraction is conducted between the shadow and shadow-free images. The resultant image’s single-channel component is selected, then subjected to thresholding to generate a binary image. A \(3\times 3\) median filter is applied to the binary image, followed by a morphological closing operation, ultimately producing the desired shadow mask. This collection of triplets, encompassing shadow, shadow-free, and the newly derived shadow-mask, is subsequently incorporated into the Extended ISTD dataset.

Fig. 3
figure 3

A few sample from Extended ISTD dataset comprising images with shadow (first row), shadow masks (second row) using Algorithm 1 and manually generated shadow-free images (third row)

2.1.2 Extension with shadow-only images

Internet-sourced random images (Public Domain Licensed), as shown in Fig. 3, offer a diverse array of shadow scenarios with varying contrasts, illuminations, and background textures. Integrating 394 such vibrant images with hard and soft shadows into the dataset presents the opportunity to significantly augment the variability. However, a key constraint arises in situations where only the shadow image is accessible, mandating the efficient generation of shadow-free and shadow-mask counterparts. The process of generating shadow-free images entails manual selection of the shadow region, followed by background estimation while eliminating the present shadows. For shadow images, in which the shadow-free counterpart can be adeptly generated, Algorithm 1 is applied to produce the corresponding shadow mask. This collection of triplets, consisting of the shadow, manually generated shadow-free image, and shadow mask produced is subsequently integrated with the pre-existing dataset.

2.1.3 Data augmentation

This process aims to enhance the variability within the dataset. The existing triplet samples in the dataset are subjected to augmentation, wherein various augmentation techniques such as rotation, flipping, and adjustments to contrast and brightness are randomly applied to the set of triplet images. These augmented images might appear visually similar to the human eye, yet they constitute distinct and singular samples for a CNN model. The triplet samples produced through augmentation are subsequently integrated into the dataset during proposed model training.

In summary, the Extended ISTD dataset diverges from the original ISTD dataset in three key aspects: (a) by extending the SRD dataset through the utilisation of shadow-masks (via Algorithm 1) derived from available shadow and shadow-free image pairs, (b) by integrating publicly accessible internet images containing both hard and soft shadows across various scenarios and lighting conditions. In this process, manual delineation of shadow regions is followed by software-based [11] shadow removal to obtain shadow-free images, which are then utilised in conjunction with Algorithm 1 to generate shadow-masks, and (c) by subsequently applying data augmentation techniques to the pool of 5,352 triplet samples, thereby enriching the dataset’s variability to enhance the training of deep neural networks. It is worth mentioning that Extended ISTD dataset is an extension over the training set of original ISTD dataset, the test set of both datasets remain same for fair comparative analysis on various shadow removal approaches. Original ISTD dataset [36] is available with already segregated training and test sets.

Fig. 4
figure 4

The provided shadow image (\(S_I\)) is applied to the generator (G1) to generate the corresponding shadow mask (\(S_M\)). This generated shadow mask, along with the actual ground truth shadow mask (\(G_M\)), is then forwarded to the Discriminator (D1) to determine the accuracy of the generated shadow mask. Subsequently, the input shadow image and the produced shadow mask are combined and passed into the generator (G2), which aims to generate the shadow-free image (\(S_F\)). This generated shadow-free image undergoes scrutiny by the discriminator (D2) using ground truth shadow-free image (\(G_F\)), which evaluates whether the image is genuinely devoid of shadows or not

2.2 Proposed architecture

The chosen training strategy of ST-CGAN is grounded in generative adversarial learning, comprising two generators (G1, G2) and two discriminators (D1, D2) as illustrated in Fig. 4. The primary function of generator G1 is to take the shadow image as input and generate the corresponding shadow mask. Discriminator D1’s role is to assess the accuracy of the shadow mask generated by G1. The generated shadow mask from G1 is then combined with the initial shadow image and forwarded to generator G2 which is tasked with predicting the ultimate shadow-free image. Simultaneously, discriminator D2 evaluates whether the predicted shadow-free image is genuinely devoid of shadows or not.

Fig. 5
figure 5

Proposed generator architecture (G1) to generate shadow mask using image with shadow as input. The encoder comprises first three blocks followed by bottleneck block (top layer) while decoder also has three blocks (bottom layer). The mentioned feature sizes appear at the end of each block of encoder and decoder. The generator G2, as mentionedi n Fig. 4 has similar architecture except input layer with the size of \(\text {256}\times \text {256}\times \text {4}\) and output generating shadow-free image with the size of \(\text {256}\times \text {256}\times \text {3}\)

In ST-CGAN, vanilla U-Net architecture [27] is used which is challenged by the notorious checkerboard artefacts [18]. This results in deterioration of the boundary region of shadows and the output shadow-free image appears with unnatural shadow residue. To address this issue to some extent, we have modified the generators G1 and G2 to include a spatial attention mechanism in the encoders. Moreover, in decoders, the nearest-neighbour 2D upsampling followed by simple 2D convolution is used instead of transposed 2D convolution (see Fig. 5). The functional relations governing the proposed generators are given in (1)-(7). Therefore, in both the encoders of the U-Net architectures of generator G1 and G2, the first convolution block becomes

$$\begin{aligned} C_{1}(x,y)= & {} \Sigma _{x,y}\Big (\Sigma _{i,j}I_{0}(i,j)W_{1}(x-i,y-j)\Big ), \text {for all { i},\,{ j} in the filter { W}},\end{aligned}$$
(1)
$$\begin{aligned} A_{Q,K}(x,y)= & {} \Phi \Big (QK^T/\sqrt{(}|K|\Big ),\end{aligned}$$
(2)
$$\begin{aligned} C_{2}(x,y)= & {} \Sigma _{x,y}\Big (\Sigma _{i,j}A_{Q,K}(i,j)W_{2}(x-i,y-j)\Big ), \text {for all { i},\,{ j} in the filter { W}},\end{aligned}$$
(3)
$$\begin{aligned} Y_{k}(x,y)= & {} max_{(p,q)\in R(i,j)}C_{2}(p,q), \end{aligned}$$
(4)

where, for pixels xy in image, \(C_{1}(x,y)\) and \(C_{2}(x,y)\) are two convolution layers, \(I_{0}(x,y)\) is the input image and \(I_{1,2,3}\) will be the preceding encoder block’s maxpool layer’s output. \(A_{Q,K}(x,y)\) is the attention layer with queries Q and keys K to be \(C_{1}(x,y)\) each, while \(|\cdot |\) is the matrix dimension size and \(\Phi (\cdot )\) is the softmax function. \(Y_{k}(x,y)\) is maxpooling layer with \(k\times k\) pooling window. R(ij) represents the set of indices (pq) in the input feature map. It is worth mentioning that rest of the encoder blocks do not house attention layers.

Each decoder blocks is modified as follows,

$$\begin{aligned} I_{u}^{l}(x,y)= & {} C_{l}\Big ([x/s_x],[y/s_y]\Big )^\frown C_{2}^{L-l}(x,y), l = 0, 1,..., L-1,\end{aligned}$$
(5)
$$\begin{aligned} C_{1}(x,y)= & {} \Sigma _{x,y}\Big (\Sigma _{i,j}I_{u}^l(i,j)W_{1}(x-i,y-j)\Big ), \text {for all { i},\,{ j} in the filter { W}},\end{aligned}$$
(6)
$$\begin{aligned} C_{2}(x,y)= & {} \Sigma _{x,y}\Big (\Sigma _{i,j}C_{l+1}(i,j)W_{2}(x-i,y-j)\Big ), \text {for all { i},\,{ j} in the filter { W}}, \end{aligned}$$
(7)

where, \(s_x\) and \(s_y\) are the scale factors in x and y dimensions. In the proposed architecture, \(L=3\) and \((^\frown )\) represents concatenation operation with corresponding second convolution layer of the \((L-l)^{th}\) encoder block via skip connection. \(C_0(x,y)\) represents last bottleneck layer, \(I_{u}^0\) is the upsampling layer of the first (\(l+1\)) decoder block. The architecture of the proposed generators is given in Table 3. It is worth mentioning that the discriminator architecture of ST-CGAN is followed in this study.

Table 3 The architecture of generators (G1/G2) of the proposed GAN
Fig. 6
figure 6

Visual illustration of Algorithm 2 to generate shadow-free image (\(I_{sha-free}\)) from raw image with shadow (\(I_{sha}\)) and intermediate shadow-free image (\(I_{int-sha-free}\)) which is the output of the proposed GAN architecture

While excelling with monochromatic shadow images, the chosen model falls short in delivering desirable outcomes for multi-color and deep shadow scenarios, as mentioned earlier. To address this limitation, the model undergoes training using the newly introduced Extended ISTD dataset. This dataset is meticulously designed to augment the shadow variation, achieved by deliberately incorporating multi-color shadow samples and dark shadow samples. During the training process, the dataset is partitioned to comprise 80% of samples in the training set and 20% in the validation set.

2.3 Post-processing

The proposed GAN, once trained on the Extended ISTD dataset, exhibits the ability to eliminate shadows from images. However, certain scenarios still result in residual lighter shadows or the absence of preserved shadow boundaries. To enhance image quality and address potential artefacts, the intermediate shadow-free image produced by the deep learning model undergoes a subsequent post-processing phase. This phase capitalises on various combined image processing techniques to generate a refined shadow-free image. The overall process is depicted in Fig. 6. Within this proposed post-processing step, shadow detection holds pivotal significance. The emphasis of the post-processing stage centres on predicting and refining the underlying texture solely within the shadow region. If shadow detection is executed accurately, signifying the efficient generation of a shadow mask, the post-processing step proves effective in generating a shadow-free version of the image. The steps are outlined in Algorithm 2.

Algorithm 2
figure b

Proposed post-processing step to refine the intermediate shadow free image generated by the proposed GAN model.

The shadow mask produced by the deep-learning model undergoes refinement through the application of image processing techniques. This refinement process utilises both the input shadow image (\(I_{sha}\)) and the intermediate shadow-free image (\(I_{int-sha-free}\)) generated by the deep learning model. Algorithm 1 is employed for the generation of the shadow mask (\(I_{mask}\)). This post-processing takes the input shadow image (\(I_{sha}\)) and the intermediate shadow-free image (\(I_{int-sha-free}\)) generated by the proposed GAN as inputs, yielding the shadow mask (\(I_{mask}\)). Concurrently, an inverted shadow mask (\(I_{inv-mask}\)) is created by taking the inverse of the shadow mask (\(I_{mask}\)) generated by Algorithm 1. The shadow mask (\(I_{mask}\)) and its inverted counterpart (\(I_{inv-mask}\)) are subsequently utilised for extracting the shadow region and the shadow-free region.

The intermediate shadow-free image (\(I_{int-sha-free}\)) is subsequently divided into two distinct images: the shadow region (\(I_{sha-reg}\)), comprising solely the pixels belonging to the shadowed area, and the shadow-free region (\(I_{sha-free-reg}\)), encompassing only the pixels within the unshadowed area. The shadow region (\(I_{sha-reg}\)) is derived by subtracting the inverted shadow mask (\(I_{inv-mask}\)) from the intermediate shadow-free image (\(I_{int-sha-free}\)). Similarly, the shadow-free region (\(I_{sha-free-reg}\)) is obtained by subtracting the generated shadow mask (\(I_{mask}\)) from the intermediate shadow-free image (\(I_{int-sha-free}\)) on a pixel-by-pixel basis.

After the isolation of the shadow region (\(I_{sha-reg}\)) and the shadow-free region (\(I_{sha-free-reg}\)), a channel-wise histogram matching procedure is implemented exclusively on the extracted shadow region (\(I_{sha-reg}\)). Notably, this histogram matching operation is conducted with the extracted shadow-free region (\(I_{sha-free-reg}\)) serving as the reference image.

The filtered image is achieved through the process of pixel-wise averaging between the histogram-matched image (\(I_{hist-match}\)) and the shadow region (\(I_{sha}\)). In this context, \(n_{1}\) and \(n_{2}\) represent the weights assigned to \(I_{hist-match}\) and \(I_{sha}\), respectively, where the combined sum of \(n_{1}\) and \(n_{2}\) equals 1.

The filtered image (\(I_{filt}\)) is then concatenated with shadow-free region (\(I_{sha-free-reg}\) produced with the help of the output of the proposed GAN model to get \(I_{filt-conc}\). This approach ensures that the aimed histogram matching and average filters are exclusively employed on the shadow region.

The shadow boundaries (\(I_{sha-bound}\)) are derived through the utilisation of the shadow mask. By using the shadow mask, the detection of shadow edges is facilitated, and the Canny edge detector is employed to identify these edges. This results in a shadow free image, efficiently generated with preserved shadow boundaries.

The resulting \(I_{sha-free}\) represents the ultimate shadow-free image, refined from the intermediate shadow-free image produced by the proposed GAN. Through comprehensive qualitative and quantitative assessments, it is evident that the presented methodology consistently surpasses other state-of-the-art approaches in terms of shadow removal performance.

3 Results and discussion

3.1 Evaluation metrics

To assess the efficacy of shadow removal, the Root Mean Square Error (RMSE) is calculated between the predicted shadow-free image and the ground truth shadow-free image as given in (8) with n being total number of pixels in each image. The RMSE quantifies the disparity between the two images by gauging their variations. In essence, a smaller RMSE value signifies a reduced discrepancy between the images, thereby indicating superior algorithm performance.

$$\begin{aligned} \text {RMSE} = \sqrt{\frac{1}{n} \Sigma _{i=1}^{n}{\Big ({ I_{sha-free} - I_{gnd-truth}}\Big )^2}}. \end{aligned}$$
(8)

PSNR in (9) denotes the peak signal-to-noise ratio existing between two images. This ratio serves as an assessment criterion to gauge the likeness between the generated shadow-free image (\(I_{sha-free}\)) and the ground truth shadow-free image (\(I_{gnd-truth}\)).

$$\begin{aligned} \text {PSNR} = 10\text {log}_{10}\Big (\frac{r^2}{\text {MSE}}\Big ). \end{aligned}$$
(9)

In the above equation, r represents the highest observed variation within the input shadow image, while MSE signifies the mean square error computed between the input shadow image (\(I_{sha}\)) and the produced shadow-free image (\(I_{sha-free}\)). The measurement unit of PSNR is expressed in decibels (dB).

Another similarity metric i.e., structural similarity index, SSIM, as given in (10), is calculated. It takes into account the perceived change in structural information, luminance, and contrast of the images. When comparing a shadow-free image to a ground truth image, the SSIM metric provides a quantitative measure of how close the refined image is to the ground truth in terms of visual quality.

$$\begin{aligned} \text {SSIM}(X, Y) = \frac{{(2 \mu _X \mu _Y + c_1)(2 \sigma _{XY} + c_2)}}{{(\mu _X^2 + \mu _Y^2 + c_1)(\sigma _X^2 + \sigma _Y^2 + c_2)}}, \end{aligned}$$
(10)

where, \(X\) and \(Y\) are the two images being compared, in our case \(I_{sha-free}\) and \(I_{gnd-truth}\) respectively. \(\mu _X\) and \(\mu _Y\) are the means of \(X\) and \(Y\), \(\sigma _X\) and \(\sigma _Y\) are their standard deviations of, \(\sigma _{XY}\) is the covariance between \(X\), and \(Y\), \(c_1\), and \(c_2\) are constants to stabilise the division with weak denominator.

The variables \(\mu _X\), \(\mu _Y\), \(\sigma _X\), \(\sigma _Y\), and \(\sigma _{XY}\) are calculated over a sliding window in the images. Typically, a Gaussian weighting function is applied to the window to give higher weights to the central pixels. The constants \(c_1\) and \(c_2\) are used to avoid instability when the denominator is close to zero. They are defined in (11) as:

$$\begin{aligned} c_1 = (k_1L)^2, \quad c_2 = (k_2L)^2, \end{aligned}$$
(11)

where, \(L\) is the dynamic range of pixel values (e.g., 256 per RGB channels for 8-bit images), and \(k_1\) and \(k_2\) are constants to stabilise the division. The constants \(k_1\) and \(k_2\) are usually set to small values, such as \(k_1 = 0.01\) and \(k_2 = 0.03\), to ensure numerical stability.

SSIM provides a value between -1 and 1, where 1 indicates perfect similarity.

3.2 Experimental protocol

The training of the proposed model and subsequent post-processing are conducted on Google Colab Pro graphical processing units (GPUs). The environmental setup for training and validating the proposed methodology involves Python 3.7.14, TensorFlow (Keras), and TorchVision 0.6.1. During training, hyperparameters are meticulous fine-tuned, involving learning rate scheduling and an augmentation in the number of samples per batch. Learning rate scheduling and a progressive increment in batch size is performed systematically over the course of specific epochs transitioning from 8, 16, 32, and eventually up to 64 samples per batch. Moreover, Adam solver [17] is used to train the proposed GAN and weights are initialised with zero-mean and unit variance in scratch training on ISTD dataset. To boost learning capability, on-the-fly data augmentation with flipping and rotation (45\(^{\circ }\), 90\(^{\circ }\), and 180\(^{\circ }\)) is used. The multi-objective loss function balanced by tripple scaling (\(\lambda \)s) are kept similar as adopted in [36] for training ST-CGAN. This careful adjustment of hyperparameters aims to avert the model from converging at local minima, leading to the lower achievement of generator as well as discriminator. For generators G1 and G2, binary cross entropy and L1 [31] loss, respectively, are used in training. In summary, optimum parametric setting is ensured for all the algorithms including the proposed for best performance outcome.

The source codes of all approaches in comparative study are taken from their authors’ repositories and for fair comparison, optimal experimental settings are kept same as suggested by their respective published articles to reproduce the results for shadow removal on ISTD datasets. On Extended ISTD dataset, all models, including the proposed, are retrained using transfer learning till convergence on network weights previously trained on ISTD dataset.

3.3 Quantitative and qualitative comparison

This section delves into a comprehensive comparison of the shadow removal outcomes yielded by the presented methodology against those of other state-of-the-art approaches including methods such as ST-CGAN [36], DHAN [3], SpA-Former [42], LG-SNet [23], DSC [8], and M-SGAN [9]

For the purpose of quantitative comparison, the ISTD test set is chosen, comprising 540 triplet samples. It is worth mentioning that test set of the ISTD dataset is mutually exclusive with the training and validation sets of Extended ISTD dataset on which all the competing architectures are trained including the proposed. Table 4 contrasts the RMSE, PSNR, and SSIM of the proposed methodology with those of other deep learning-based models when both training and testing is performed on original ISTD dataset. Here, ‘S’ denotes the shadow region, ‘NS’ signifies the non-shadow region or shadow-free area, and ‘E’ represents the entire image. The scores for other approaches are extracted from their authors’ published articles which are also reproduced with the same outcomes to keep comparisons fair and validate the experimental protocol followed in this study. This comparison reveals that the proposed methodology surpasses other existing approaches, achieving an RMSE of 6.01 and a PSNR of 29.43 across the entire image. It is evident from Table 4 that the proposed architecture yields favourable scores as compared to all other approaches with better RMSE and PSNR metrics. This advocates the effectiveness of proposed changes in the standard ST-CGAN architecture. Similarly, Table 5 presents the comparison when all the algorithms are trained on Extended ISTD dataset while test set remains the same i.e., original ISTD test set. Here, again, the proposed methodology surpasses the rest with state-of-the-art performance highlighting the benefits of versatile training data, robust architecture and effective post-processing.

Table 4 Quantitative comparison for the shadow removal of proposed methodology with other techniques when training is done on ISTD dataset
Table 5 Quantitative comparison for the shadow removal of the proposed methodology with other techniques when training is done on Extended ISTD dataset

Figure 7 showcases the visual juxtaposition of the proposed shadow removal approach against other methodologies using the ISTD test dataset. Meanwhile, Fig. 8 illustrates the outcomes of shadow removal on freely available, randomly selected internet-sourced shadow images apart from ISTD test set. The selected test images encompass scenarios containing dark/hard shadow regions as well as those with multi-color contrast shadow effects. In both representations, all models are trained on Extended ISTD training set. This visual assessment of the generated shadow-free images reveals the clear superiority of the proposed methodology over existing alternatives. The method’s effectiveness is particularly evident in its ability to produce high-quality shadow-free images, even in the presence of intricate backgrounds, challenging dark/hard shadow conditions, and multi-color contrast shadow settings. As a result, the proposed methodology excels in comparison to the established state-of-the-art techniques.

Fig. 7
figure 7

Shadow removal performance on ISTD test set

Fig. 8
figure 8

Shadow removal performance on internet-sourced random multi-color contrast shadow images. In this case the ground truth images are not available

Table 6 Performance comparison of the proposed model (trained on Extended ISTD dataset) with and without modified GAN’s generator and post-processing

ST-CGAN in is relatively better on soft shadows but leave residue when hard shadows are present in an image specially on vibrant color backgrounds. This is attributed towards lack of fully capturing shadow-dependent features in generator CNNs where attention to the edge details is imperative. On the other hand, DHAN comprises dual hierarchical aggregation network and shadow matting GANs. Despite reasonable performance as compared to ST-CGAN and LG-SNet in Fig. 7, such a setup is critically dependent on large data availability and its variability [38]. Moreover, matting GANs are prone to artefacts in synthesised images, especially in the regions with complex textures or structures [28] or with dynamic scenes when lighting condition or color shades change abruptly [24]. These artefacts can degrade the visual quality of the output and may require additional post-processing steps to mitigate. This is consistent especially with the visual finding in this study. SpA-Former presents a promising approach to shadow removal by leveraging transformer architectures and spatial attention mechanisms. While transformers excel at modelling sequential data, they may struggle to efficiently capture spatial relationships in images. This limitation can affect the model’s ability to understand spatial contexts and dependencies critical for tasks like shadow removal from images [2]. LG-SNet is a deep learning approach designed specifically for shadow removal tasks [23]. While this architecture has demonstrated effectiveness in certain scenarios, it may have weaknesses that could impact its performance. For shadow removal task, LG-SNet performance may degrade when faced with diverse shadow types, illumination conditions, and environmental settings that are not adequately represented in the training data [43]. On the other hand, DSC is a method designed to detect shadows in images by using directional information and spatial context features. However, the method may produce unnatural effects or false positives in regions that are not shadows or at the shadow edges (see Fig. 8 for DSC results in first and third columns), especially when processing images with complex textures, structures or varying lighting [41]. LG-SNet and DSC yield superior results in qualitative analysis as compared to ST-CGAN, SpA-Former and DHAN. The challenges of shadows on diverse background, lighting variation, complex structure, and abrupt color change require careful attention to acquire robust features and ability to overcome irregularities and anomalies in shadow and non-shadow regions of image. The proposed architectural changes in the ST-CGAN generators using spatial attention and nearest-neighbour 2D upsampling mitigate shadow residue with its attention to edge details. Moreover, such modifications results in model robustness in contrast with the architectures (e.g., DHAN and LG-SNet) where extensive data with variation in shadow pattern is required. Furthermore, textural information is favourably preserved by better understanding spatial contexts, a major problem in DSC [8]. Nonetheless, quantitative and qualitative findings affirm the necessity of effective post-processing to address the limitations inherent in deep learning-based generative techniques. Table 6 presents an ablation study where efficacy of the proposed GAN model with modified generators is justified and also advocates the use of the proposed post-processing approach. It is evident that Model-A with similar architecture to the ST-CGAN (see Table 5 also produces similar results. On the other hand, Model-B with the proposed generator but without the post-processing attains better results followed by further enhancement by the final Model-C.

Despite relative superior performance, the proposed algorithm still shows minor artefacts at the edges of hard shadows (see Fig. 7, first and second columns, and Fig. 8, last column). The main reason behind this limitation is unavailability of pixel colour information which gets blocked by hard shadow resulting in failure of GAN-based system and post-processing to recover. Although attention-mechanism in the proposed approach helps to mitigate this issue to some extent, the challenge remains valid in general. Moreover, shadow removal operations over a textured surface leaves unnatural smoothing. The proposed algorithm depicts less visual anomaly in comparison to the other approaches but careful observation reveals texture deformation over the shadow regions as shown in Fig. 8, first and second columns.

The Extended ISTD dataset which is generated using standard ISTD, SRD and further internet-sources images exhibit diversity in shadow types, lighting conditions, and background complexities, but it also has certain limitations and potential biases. In terms of shadow types, this dataset typically include various kinds of shadows, including cast shadows, self-shadows, and object shadows, ensuring a representative sample for training and evaluation. Lighting conditions encompass a range of scenarios, from indoor to outdoor environments, different times of day, and varying weather conditions, providing variability for robust model training. Similarly, background complexities vary with simple backgrounds like plain walls to complex scenes with multiple objects and textures, enabling models to learn to remove shadows effectively in diverse contexts. However, potential biases may exist in these datasets. For instance, there might be an overrepresentation of certain types of shadows or lighting conditions compared to others, leading to imbalances in the training data and potentially affecting model performance on underrepresented scenarios. Additionally, datasets may lack sufficient variability in terms of cultural or geographic contexts, which could impact model generalization to real-world applications across different regions or demographics. Furthermore, there are limitations in the quality and resolution of the images in this dataset, which could affect the realism and applicability of trained models in practical settings.

To mitigate these biases and limitations, future efforts could focus on collecting more diverse datasets that include a broader range of shadow types, lighting conditions, and background complexities. Incorporating data augmentation techniques to simulate additional variations and addressing biases through careful curation and sampling strategies can enhance the representativeness and robustness of datasets for shadow removal tasks. Additionally, efforts to ensure inclusivity and diversity in dataset collection processes can help improve the generalisation of trained models to diverse real-world scenarios and populations.

Shadow removal techniques play a crucial role across various real-world scenarios, including digital photography enhancement, surveillance imagery analysis, and autonomous driving systems. In digital photography, shadow removal enhances image quality by reducing distractions and improving overall clarity. Similarly, in surveillance imagery analysis, shadow removal aids in identifying objects and individuals obscured by shadows, thus enhancing security and investigative processes. In the context of autonomous driving systems, accurate shadow removal contributes to improved object detection and scene understanding, ultimately enhancing safety and decision-making algorithms. However, modern deep learning-based approaches, such as GANs and vision transformers, face challenges in handling dynamic lighting conditions and complex backgrounds, which can lead to inaccuracies in shadow removal. Mitigation strategies could involve incorporating additional contextual information, such as scene geometry or temporal data, to improve robustness. The improvements may focus on developing adaptive GAN architectures capable of dynamically adjusting to varying lighting conditions and complex environments, as well as exploring techniques for scalable training on large and diverse datasets to enhance generalisation capabilities. By addressing these challenges, such methods could further advance the effectiveness and applicability of shadow removal techniques in diverse real-world scenarios.

Deep learning algorithms like the ones presented in Tables (4) and (5) can effectively remove shadows from images, leading to enhanced image quality and better visual appeal. This can be particularly useful in applications like photography, surveillance, autonomous driving, etc., where clear and well-lit images are crucial. Once trained, these models can automate the process of shadow removal, reducing the need for manual intervention and saving time for users. This automation can lead to increased efficiency in various image processing workflows. Nonetheless, such models require large and diverse datasets for training to generalise well. However, obtaining such datasets with a wide variety of shadow types and scenes can be challenging, leading to potential biases in the trained models. Training deep learning models, especially complex ones like GANs, requires significant computational resources, including high-end GPUs and large amounts of memory. This can make deployment and scaling of these models costly and impractical for some applications. This highlights the importance of comprehensive dataset for tasks like shadow detection and removal covering maximum possible patterns of shadows in simple to complex environmental variety. Failure in providing diverse data may result in overfitting by deep architectures, where they memorise the training data instead of learning informative features. This can lead to poor performance on unseen data, especially if the training dataset is limited or not representative of the real-world scenarios the model will encounter. Real-world images often contain various complexities such as different types of shadows, varying lighting conditions, and occlusions. Ensuring that the trained models are robust enough to handle such variations is a significant challenge. In terms of resourse utilisation, Table 7 tabulates the number of parameters, training and inference time taken by approaches compared in Tables 4 and 5 on Google Colab Pro. Although different approaches are trained on Extended ISTD dataset for different number of epochs, Table 7 mentions the training time against 280 epochs for standardised comparison.

Table 7 Comparison of model parameters, training time on Extended ISTD dataset for 280 epochs and inference time per image on Google Colab Pro

Future research can focus on designing more robust and efficient deep learning architectures tailored specifically for shadow removal tasks. This includes exploring novel network structures, attention mechanisms, and fusion strategies to handle various types of shadows and lighting conditions effectively. Investigating and exploring techniques for uncertainty estimation can provide insights into model confidence and help identify potential failure cases. Moreover, developing techniques to enhance the robustness of deep learning models against adversarial attacks and perturbations can improve their reliability in real-world scenarios. This includes adversarial training, robust optimisation methods, and adversarial defence mechanisms tailored for shadow removal tasks. Customising and fine-tuning pre-trained models for specific application domains can improve their performance and efficiency. This involves understanding the unique characteristics and challenges of the target domain and adapting the model accordingly through domain-specific loss functions and training strategies. Enhancing the interpretability and explainability of deep learning models can facilitate their pragmatic adoption. Research can focus on developing methods for visualising model decisions, understanding feature importance, and providing insights into the underlying shadow removal process.

Another important factor that is crucial for adoption of deep architectures for shadow removal is their processing capability in real-time applications like autonomous vehicles which require quick and video-based shadow removal procedures. Current state-of-the art methods are generally complex and computationally expensive. Therefore, achieving promising performance on light-weight and efficient deep architectures will be the focus of attention in the upcoming studies. To achieve this, network pruning and model quantisation approaches can be adopted. Network pruning and quantization present promising avenues for making complex and computationally heavy deep learning algorithms like GAN and transformer-based systems more efficient without sacrificing performance. Pruning involves identifying and removing redundant or insignificant parameters from neural networks, leading to reduced model size and computational requirements while preserving accuracy. Techniques such as magnitude-based pruning, weight pruning, and structured pruning can effectively reduce model complexity without compromising performance. Additionally, quantization techniques aim to represent network parameters with reduced bit precision, thereby decreasing memory usage and improving inference speed. Methods like uniform quantization, non-uniform quantization, and quantization-aware training enable efficient representation of model parameters while maintaining performance. By employing network pruning and quantization, researchers can enhance the efficiency of advanced deep learning algorithms for shadow removal tasks, making them more viable for resource-constrained environments without compromising performance. Studies like [22] and [12] provide insights into the effectiveness of these techniques in optimizing deep learning models for real-world deployment.

4 Conclusion

This study introduces an innovative approach for the detection and removal of shadows from RGB images, all the while preserving the underlying background texture. To facilitate this objective, a substantial triplet dataset (Extended ISTD) is introduced. This dataset, comprising shadow, shadow-mask, and shadow-free images, serves to enhance the distribution of varying samples across the dataset. This augmentation is achieved by incorporating dark/hard shadow samples and multi-color contrast shadow samples into the training dataset. For robust shadow removal, stacked GAN architecture with generators studded with spatial attention and nearest-neighbour upsampling is proposed. Additionally, a post-processing phase is incorporated, harnessing the capabilities of diverse traditional image processing techniques, such as histogram matching, custom filters, shadow boundary detection, and estimation. This post-processing step contributes to refining the intermediate shadow-free image produced by the specially crafted GANs-based deep learning model. Following training on the Extended ISTD dataset and subsequent post-processing, the proposed deep learning model is adept at generating high quality shadow-free images with accuracy. With the proposed methodology addressing shadow removal, future efforts will be directed toward extending its capabilities beyond single RGB images to encompass video shadow detection and removal.