1 Introduction

Restoring images, a process that transforms a degraded picture back to its high-quality state continues to be a substantial obstacle within the realm of computer vision. This degradation could be in the form of noise, haze, blur, or even elements like rain streaks, among other factors. This task poses a high degree of difficulty as the solution space is practically infinite, with countless plausible restorations for any given degraded image. Traditional restoration techniques [1,2,3,4,5,6,7] have relied on explicitly defined image priors, handcrafted based on empirical observations. Such an approach, however, is inherently limited due to the difficulty of designing these priors and their lack of generalizability across different scenarios.

With the advancement of machine learning technologies, recent cutting-edge techniques have turned towards the use of convolutional neural networks (CNNs) [8,9,10,11,12,13,14,15,16]. CNNs offer the ability to implicitly learn more adaptable and general image priors by analyzing the statistical characteristics of natural images across extensive datasets.

The superior yield achieved by CNN-based approaches is mostly due to the meticulous design and integration of numerous network elements and functional segments established for image reinstatement [8, 11, 15, 17,18,19,20,21,22,23].

Fig. 1
figure 1

Image deraining on the Rain100H [18], Rain100L [18], Test100 [24], Test1200 [25] and Test2800 [26] datasets. Despite having a lot fewer parameters than the underlying model, our optimised Multi-stage technique outperforms the original cutting-edge MPRNet [27,28,29] in terms of PSNR and SSIM, suggesting greater picture restoration quality from a human visual perception standpoint

Inspired by the success of the MPRNet architecture [27], we introduce a derivative model, named RainDNet, which further advances the image restoration task, specifically for de-raining. The MPRNet [27] model incorporated an innovative multi-stage design [30,31,32,33], contrasting with the traditional single-stage architectures prevalent in low-level vision tasks [30, 34,35,36]. Our RainDNet model preserves this powerful multi-stage design and introduces further refinements.

To reduce the computational complexity and memory footprint of the network, RainDNet employs depthwise separable convolutions [37,38,39,40,41,42], a variant of standard convolutions that decouples the learning of spatial and depth-wise features. This change significantly reduces the number of parameters in the model without sacrificing performance. Furthermore, RainDNet enhances the original loss function of MPRNet [27] by integrating perceptual and SSIM losses in addition to the standard L1 and edge losses. These additions promote the preservation of perceptual quality and structural similarity in the restored images, thus further improving the visual quality of the derained outputs.

In this paper, we conduct an extensive evaluation of the proposed RainDNet model. Our comparative (Fig. 1) study shows that RainDNet can mostly achieve better PSNR values than the original MPRNet [27] while significantly surpassing it in terms of SSIM values, thus offering an improved trade-off between accuracy and perceptual quality. Our findings present RainDNet as a new promising approach for deraining tasks, opening up avenues for future improvements in the field of image restoration.

The key contributions presented in this study include:

  • We introduce a unique framework, RainDNet, for the restoration of images. This design draws inspiration from the multi-stage structure of MPRNet [27] and incorporates depthwise separable convolutions [37,38,39,40,41,42] to lessen the computational demand.

  • We introduce a modified loss function that incorporates perceptual and SSIM losses in addition to L1 and Edge losses, enhancing the quality of restored images.

  • We demonstrate the effectiveness of our model by conducting comprehensive experiments, comparing our model with the cutting-edge MPRNet [27] on multiple datasets. Our results mostly exhibit better PSNR performance, significantly better SSIM results, and better BRISQUE results, thus confirming the efficacy of our approach.

2 Related works

Throughout the last few decades, image-capturing technology has seen a significant transformation. We are transitioning from traditional high-end DSLR cameras towards more compact and user-friendly smartphone cameras. Early restoration approaches were anchored in mathematical and empirical methods like total variation [6, 43], sparse coding [44,45,46], self-similarity [47, 48], and gradient prior [49, 50]. These methods relied heavily on handcrafted features and were not always generalizable to diverse image degradation scenarios.

Convolutional neural networks (CNNs) in image restoration With the advent of deep learning, the focus shifted towards CNNs. CNN-based restoration methods have outperformed traditional methods [10, 11, 13, 15, 51, 52], providing a more robust and generalizable approach to image restoration. Among CNN-based methods, single-stage approaches currently dominate the field. They often repurpose architectural components developed for high-level vision tasks.

The advent of depthwise separable convolutions Another significant evolution in the field of deep learning is the introduction of depthwise separable convolutions [37,38,39,40,41,42], an efficient variant of standard convolutions. This efficient approach has been adopted in a variety of domains, showing great promise in enhancing the performance and efficiency of deep learning models.

Multi-stage approaches In contrast to the prevalent single-stage methods, multi-stage approaches [51, 53,54,55,56,57,58] aim to tackle the image restoration problem in a more structured manner. These methods progressively restore the clean image by incorporating a lightweight subnetwork at each stage. However, one common practice in these methods that could lead to suboptimal results is the use of identical subnetworks for each stage.

The use of attention mechanisms A more recent innovation in deep learning that has found its way into the domain of image restoration is the attention mechanism [51, 54, 55]. These modules record extensive mutual dependencies along spatial [59] and channel [60] dimensions, allowing for better context-aware processing of features [61].

Introduction of RainDNet In the current work, we introduce a novel variant of the well-established Multi-Stage Progressive Restoration Network (MPRNet [27]) for image deraining. In our proposed model, RainDNet, we replace some of the standard convolutions with depthwise separable convolutions [37,38,39,40,41,42], yielding a significant reduction in the model’s parameter count. Furthermore, we introduce perceptual and structural similarity (SSIM) losses in addition to the conventional L1 and edge losses, contributing to the model’s enhanced performance in capturing perceptually important image details. The revamped model, RainDNet, exhibits superior Peak Signal-to-Noise Ratio (PSNR) values and Structural Similarity Index Measure (SSIM) values when compared with the existing MPRNet [27] model. This indicates that our proposed solution while being more computationally efficient, does not compromise the image restoration performance.

Fig. 2
figure 2

RainDNet, our proposed architecture for image deraining, employs depthwise separable convolutions [37,38,39,40,41,42] in certain stages to optimize parameter utilization. The stages and their operations remain consistent with the original MPRNet [27] design, preserving the progressive restoration capability

Fig. 3
figure 3

Illustration of the depthwise separable convolution [37,38,39,40,41,42] Operation. This diagram demonstrates the two-step process of depthwise convolution followed by pointwise convolution, showcasing its efficiency in capturing spatial and cross-channel information with significantly fewer parameters compared to standard convolutions. This key alteration in our RainDNet architecture allows for similar restoration performance with a leaner model footprint

3 Gradual multi-stage enhancement

The framework we propose for the restoration of images, depicted in Fig. 2, plays out in three steps to gradually polish the images. As in the predecessor’s architecture, the first pair of phases leverage encoder–decoder sub-networks to acquire wide contextual information via extensive receptive fields. Acknowledging that image restoration is an intrinsically position-sensitive operation, the final stage of our model operates on the original input image resolution without any downsampling. This design choice allows the preservation of fine textures and spatial details in the output image, which is critical for the minor boost to PSNR scores and the large improvement in SSIM scores observed in our model.

We weave a supervised attention component between each pair of subsequent stages rather than just stringing together several stages. This component redefines the feature maps from the prior phase prior to feeding them into the next step, which is supervised by ground-truth images. This process optimises the information transfer between phases, which contributes to the improved performance of our model.

In addition to these modifications, we present a cross-stage feature fusion technique. This approach enables intermediate multi-scale context-sensitive traits from prior subnetworks to consolidate intermediate traits from succeeding subnetworks. This intricate interplay among stages and the efficient use of learned features across the network not only slightly improves the PSNR performance than the original model but elevates the SSIM scores significantly, positioning our model as a competent and improved variant of the original MPRNet [27] architecture.

Despite RainDNet having multiple stages, each stage can access the input image directly. As per recent restoration methods [51], we apply a multi-patch hierarchy on the input image and divide it into non-overlapping patches: four for stage 1, two for stage 2, and the original image for the final stage as illustrated in Fig. 2.

Fig. 4
figure 4

a The encoder–decoder subnetwork, where specific convolution layers are replaced with depthwise separable convolutions [37,38,39,40,41,42] for a more efficient model. b A comprehensive depiction of the modified Original Resolution Block (ORB) in our ORSNet subnetwork is provided. Each ORB consists of several depthwise separable convolutions, in addition to channel attention blocks. The acronym GAP represents Global Average Pooling [66]. c Cross Stage Feature Fusion (CSFF) between the first and second phases is demonstrated. d Demonstration of CSFF across the second and final phases, highlighting the flow and fusion of features in our RainDNet architecture

At any stage S, we propose the Combined Loss function to handle the task of rain streak removal from images, which is defined as:

$$\begin{aligned} L_{total}{} & {} = \lambda _{L1} \cdot L_{L1} + \lambda _{perc} \cdot L_{perc}\nonumber \\{} & {} \quad \ + \lambda _{edge} \cdot L_{edge} + \lambda _{ssim} \cdot L_{ssim} \end{aligned}$$
(1)

where, \(L_{total}\) is the total loss, \(L_{L1}\) is the L1 loss, \(L_{perc}\) is the perceptual loss, \(L_{edge}\) is the edge loss, and \(L_{ssim}\) is the SSIM loss. The coefficients \(\lambda _{L1}\), \(\lambda _{perc}\), \(\lambda _{edge}\), and \(\lambda _{ssim}\) are used to balance these loss components. Each component of the Combined Loss is described as follows:

  • \(L_{L1}\): The L1 [62] Loss calculates the absolute difference between the target and output images pixel-wise. It is defined as:

    $$\begin{aligned} L_{L1}(o, t) = \frac{\lambda _{L1}}{N} \sum _{i=1}^{N} |o_{i} - t_{i}| \end{aligned}$$
    (2)

    where o and t are the output and target images respectively, and N is the total number of pixels. This loss encourages the model to focus on all discrepancies, big or small, in the restored and target images.

  • \(L_{perc}\): The perceptual loss [63] uses a pre-trained VGG16 model to extract feature maps from the restored and target images. The L1 loss is then applied to these feature maps, defined as:

    $$\begin{aligned} L_{perc}(o, t) = \frac{\lambda _{perc}}{W \times H \times C} \sum _{w,h,c} |F_{o}^{w,h,c} - F_{t}^{w,h,c}| \end{aligned}$$
    (3)

    where \(F_{o}\) and \(F_{t}\) are the feature maps of output and target images extracted by the VGG16 model, W, H, and C are the width, height, and number of channels of the feature maps. This loss ensures the model produces a restored image that is not only pixel-wise accurate but also shares similar high-level features (i.e., texture and content) with the target image.

  • \(L_{edge}\): The Edge Loss [64] first applies the Sobel filter to the restored and target images to highlight the edges in the images. The L1 loss is then applied to these edge maps. The edge loss is defined as:

    $$\begin{aligned} L_{edge}(o, t) = \frac{\lambda _{edge}}{N} \sum _{i=1}^{N} |E_{o_{i}} - E_{t_{i}}| \end{aligned}$$
    (4)

    where \(E_{o}\) and \(E_{t}\) are the edge maps of output and target images created using the Sobel operator. This loss encourages the model to pay attention to the edges in the image, which is crucial in maintaining the structure and details of the scene.

  • \(L_{ssim}\): The Structural Similarity Index Measure (SSIM) loss [65] is used to ensure that the restored image shares structural similarity with the target image. For each color channel, the SSIM index is defined as:

    $$\begin{aligned} SSIM_{c}(o, t) = \frac{(2 \mu _{o,c} \mu _{t,c} + c_{1}) (2 \sigma _{o,c,t,c} + c_{2})}{(\mu _{o,c}^{2} + \mu _{t,c}^{2} + c_{1}) (\sigma _{o,c}^{2} + \sigma _{t,c}^{2} + c_{2})}\nonumber \\ \end{aligned}$$
    (5)

    and the SSIM loss is defined as:

    $$\begin{aligned} L_{ssim}(o, t) = \frac{\lambda _{ssim}}{C} \sum _{c=1}^{C} (1 - SSIM_{c}(o, t)) \end{aligned}$$
    (6)

    where \(\mu _{o,c}\) and \(\mu _{t,c}\) are the average of o and t for the color channel c, \(\sigma _{o,c}^{2}\) and \(\sigma _{t,c}^{2}\) are the variance of o and t for the color channel c, \(\sigma _{o,c,t,c}\) is the covariance of o and t for the color channel c, and \(c_{1}\) and \(c_{2}\) are two variables to stabilize the division with weak denominator.

Depth-wise separable convolution block Depth-wise separable convolution blocks [37,38,39,40,41,42], visualized in Fig. 3, are incorporated for feature extraction and a few other tasks, replacing the standard convolution blocks in the proposed model. This was adapted from the principle of factorizing the standard convolution operation into a depth-wise convolution and a point-wise convolution, which significantly reduces the computational burden without compromising the network’s ability to capture complex patterns in the data. Depth-wise separable convolutions exploit the spatial and cross-channel correlations separately, enabling the model to maintain a satisfactory level of representation learning with fewer parameters and computational complexity. They offer the benefits of computational efficiency and parameter reduction, which makes the model lighter, faster, and more suitable for tasks where computational resources are a constraint. Furthermore, the reduced complexity also contributes to alleviating overfitting issues, thus potentially improving the model’s performance on unseen data. These advantages make depth-wise separable convolution blocks a preferred choice for our network architecture in the image restoration task.

Fig. 5
figure 5

Modified supervised attention module: this illustration depicts our refined version of the Supervised Attention Module, which emphasizes feature refinement at each stage of the RainDNet architecture

Table 1 Overview of the image deraining dataset

3.1 Processing of complementary features

Modern single-stage CNN models for the restoration of images mostly employ either of the two architectural designs: (1) A framework for encoder–decoders or (2) a singular-scale feature conduit. Encoder–decoder structures [22, 23, 67] begin by converting the input to low-res illustrations and subsequently employ a reverse mapping process to restore the initial resolution.

Conversely, strategies that operate on a singular-scale feature pipeline are proficient at producing images with precise spatial details [13, 15, 68, 69]. Although these models maintain spatial precision, their outputs often lack semantic richness due to their restricted receptive field. These observations underscore the innate constraints of conventional architectural designs, which can generate either spatially precise or contextually trustworthy outputs, but often grapple to attain both.

Aiming to capitalize on the benefits of both design strategies, we propose a multi-tier framework. In our model, the initial stages utilize encoder–decoder networks, with the last stage operating directly on the original input resolution. Furthermore, we integrate depthwise separable convolutions [37,38,39,40,41,42] into our framework. This integration considerably lowers the model’s complexity, while preserving its competitive performance levels.

Subnetwork with encoder–decoder configuration As depicted in Fig. 4a, our encoder–decoder subnetwork, derived from the standard U-Net [28], includes several adjustments to accommodate our specific requirements. We mostly use channel attention blocks (CABs) [29] for collecting multi-scale traits. The feature maps at the U-Net’s skip connections are then processed by the CAB (refer to Fig. 4b). Ultimately, instead of utilising Transposed convolution [70] to increase the spatial scale of each feature within the decoder, we employ bilinear up-sampling complemented by a layer of convolution.

Table 2 Results from image deraining evaluations

In our proposed architecture, we make significant modifications by employing depthwise separable convolutions [37,38,39,40,41,42]. This modification allows for more efficient extraction of features at each scale while decreasing the model’s complexity. As a result, a balance between performance and computing efficiency has improved, allowing for a more lightweight model with superior performance.

Subnetwork of original resolution We implement a change in the final tier of the architecture to maintain the granular information from the source image in the resultant image. This tweak incorporates a subnetwork that executes in line with the original picture dimensions (refer to Fig. 2). Termed as the original-resolution subnetwork (ORSNet), this module circumvents any downsampling processes and generates features with high resolution, rich in spatial nuances. ORSNet is made up of several original-resolution blocks (ORBs), which in turn incorporate CABs. The structural outline of an ORB can be seen in Fig. 4b.

This novel inclusion of the ORSNet in the final stage ensures that the fine spatial details of the image are preserved, contributing to a more detailed output. The ORB, with its multiple CABs and absence of downsampling, focuses on enhancing the structural similarity of the output image, thereby significantly contributing to the improved SSIM score achieved by our model.

3.2 Supervised attention module

Cutting-edge multi-phase frameworks for restoration of images [51, 58] employ a straightforward strategy wherein each stage produces an image prediction which is subsequently forwarded to the next stage. We present a significant variation in this routine by incorporating a supervised attention module (SAM) between each pair of consecutive stages, contributing towards a substantial improvement in performance.

A structural diagram of the SAM can be seen in Fig. 5, and it lends dual advantages. First, it incorporates ground-truth supervisory cues that aid in the consecutive restoration of images at every tier. Furthermore, we generate attention maps by exploiting locally supervised predictions. These maps are critical in reducing the effect of less useful details at this level, allowing only the most significant features to continue to the subsequent phase.

Notably, this strategy of selectively preserving the most impactful features from one stage to the next directly influences the overall structural similarity of the final result.

SAM functions on the preceding stage’s incoming traits, \(F_{in} \in \mathbb {R}^{H \times W \times C}\), and produces a residual image, \(R_S \in \mathbb {R}^{H \times W \times C}\), using a basic \(1 \times 1\) convolution. The spatial dimension is denoted by \(H \times W\) while the channel count is denoted by C. This residual picture is combined with the impaired input image I to produce the reinstated image, \(X_S \in \mathbb {R}^{H \times W \times C}\).

Following that, the predicted picture \(X_S\), is provided with explicit supervision using the ground-truth image. Subsequently, Subsequently, attention masks \(M \in \mathbb {R}^{H \times W \times C}\) are constructed from the image \(X_S\) by employing a \(1 \times 1\) convolution and sigmoid activation. The masks are put to use to recalibrate the transformed local traits \(F_{in}\) (derived after \(1 \times 1\) convolution), leading to the production of attention-guided features that are subsequently incorporated in the identity mapping path. Ultimately, the output from SAM, an attention-enhanced feature presentation \(F_{out}\), is forwarded to the subsequent stage for additional refinement.

4 Experiments and analysis

We put our proposed technique to the test for a single image restoration task, namely image deraining, across five different datasets.

4.1 Datasets and evaluation protocol

Following the latest research for image deraining, we train our architecture using 13,712 clean rain-image pairs sourced from a diverse set of datasets [18, 24,25,26, 71], as outlined in Table 1. With this universally developed model, we proceed to evaluations on several testing sets, such as Rain100H [18], Rain100L [18], Test100 [24], Test2800 [26], and Test1200 [25].

Table 3 BRISQUE score results from image deraining evaluations

We carry out numerical evaluations using the PSNR, SSIM, [72] and BRISQUE [73] metrics. The formula for calculating the Peak Signal-to-Noise Ratio (PSNR) between two images (original and reconstructed) is defined as:

$$\begin{aligned} PSNR = 20 \cdot \log _{10}\left( \frac{MAX_{I}}{\sqrt{MSE}}\right) \end{aligned}$$
(7)

where \(MAX_{I}\) is the maximum possible pixel value of the image. For an 8-bit grayscale image, the maximum possible pixel value is 255. MSE represents the Mean Squared Error, which measures the average squared differences between the original and the reconstructed images.

Table 4 Comparison of trainable parameters

The Structural Similarity Index Measure (SSIM) index is a method for comparing similarities between two images (say x and y). The SSIM index is calculated as:

$$\begin{aligned} SSIM(x, y) = \frac{(2\mu _x\mu _y + c_1)(2\sigma _{xy} + c_2)}{(\mu _x^2 + \mu _y^2 + c_1)(\sigma _x^2 + \sigma _y^2 + c_2)} \end{aligned}$$
(8)

where \(\mu _x\) is the average of x, \(\mu _y\) is the average of y, \(\sigma _x^2\) is the variance of x, \(\sigma _y^2\) is the variance of y, \(\sigma _{xy}\) is the covariance of x and y, and \(c_1\) and \(c_2\) are two variables to stabilize the division with weak denominator.

BRISQUE (Blind Referenceless Image Spatial Quality Evaluator) [73] is a no reference image quality score. This uses a pre-trained SVM model to calculate the final score. The pre-trained model takes 5 attributes (MSCN (Mean Subtracted Contrast Normalization) Image and its four shifted versions), which are obtained from the given image. The lesser value of this implies better image quality.

To calculate the MSCN Coefficients, the image intensity I(i,j) at pixel (i, j) is transformed to the luminance \({\widehat{I}}(i, j)\)

$$\begin{aligned} {\widehat{I}}(i, j) = \frac{I(i,j)-\mu (i,j)}{(\sigma (i,j)+C)}\end{aligned}$$
(9)

Where i \(\in 1,2, \cdots M, j \in 1,2, \cdots N \) (M and N are height and width respectively). Functions \(\mu (i, j) and \sigma (i, j)\) are the local mean field and local variance field, respectively.

We display the relative decrease in error for each approach compared to the top performer by converting PSNR to RMSE (\({RMSE} \propto \sqrt{10^{ -PSNR / 10}}\)) and SSIM to DSSIM (\(DSSIM = (1 \) - SSIM)/2).

Fig. 6
figure 6

Results of image deraining using the RainDNet model. Marking a significant leap in performance, our RainDNet model expertly removes rain streaks and yields images that are not only realistic and devoid of artifacts, but also visually far more similar to the ground truth compared to prior models, particularly its precursor, the MPRNet [27]

4.2 Implementation

Our proposed RainDNet model, built to enable end-to-end training, does away with the need for pretraining steps. A distinguishing feature of our model is the application of depthwise separable convolutions [37,38,39,40,41,42] which efficiently manage computational resources while preserving the ability to learn from a large number of parameters. This approach is especially well-suited for our task and is implemented at various scales of our encoder–decoder network.

To facilitate the extraction of salient features at every scale, we incorporate two Channel Attention Blocks (CABs), utilizing \(2\times 2\) max-pooling with a stride of 2 for the downsampling process. The final stage of our model features an Original Resolution Subnetwork (ORSNet), composed of three Original Resolution Blocks (ORBs), each embedded with eight CABs.

In order to customize the network to cater to the specifics of the deraining task, we tweak the network’s width by defining the channel count to 40. The training procedure is implemented on patches of size \(256\times 256\), using a batch size of 2 per GPU, with two NVIDIA RTX 3090 GPUs, leading to a total of four batches per epoch. This training process lasts for 15k iterations.

In terms of optimization, we deploy the AdamW [77] optimizer with a learning rate set to \(3\times 10^{-4}\). Noteworthy is the fact that the AdamW optimizer applies weight decay directly to the weights, bypassing gradient modifications - an approach derived from early neural network methodologies wherein weight decay was achieved through direct weight shrinkage.

4.3 Image deraining results

In accordance with previous studies, specifically reference [31], we utilized the Y channel (from the YCbCr color space) to compute the quality metrics for the image deraining task. As presented in Table 2, our approach significantly surpasses the current cutting-edge model by mostly yielding superior PSNR and SSIM results across all five datasets. The BRISQUE score of the proposed model and the state-of-the-art models is shown in Table 3. The proposed model outperforms other models in terms of the BRISQUE score over all five datasets. In comparison to the most recent top-performing algorithm, MPRNet [27], our method achieved an average performance enhancement of 0.68 dB across all datasets. Moreover, our model is more efficient, having 3.5x fewer parameters than MPRNet [27], as evident from Table 4, which shows the number of trainable parameters of different models. Fig. 6 provides visual comparisons on challenging images. Our RainDNet demonstrates effectiveness in eradicating rain streaks of different orientations and intensities and generates images that are visually pleasing and closely align with the ground truth. In contrast, other techniques compromise structural content (first row), generate artifacts (second row), and are unsuccessful in completely removing rain streaks (third row).

Table 5 Examination of the individual elements of the proposed RainDNet through an ablation study

4.4 Ablation studies

In this segment, we execute a variety of experiments to comprehend the impact of each element of our RainDNet model. Our examination utilizes the Rain100H [18] dataset, with the deraining models trained on image patches of dimension \(256\times 256\). The findings are presented in Table 5.

Number of stages As we escalate the number of stages within our RainDNet, we observe an enhancement in its efficacy, thereby reinforcing the efficiency of our multi-stage framework.

Selection of subnetworks In our model, different types of subnetworks can be utilized in each stage. Hence, we experimented with several alternatives. Our observations reveal that deploying the encoder–decoder subnetwork in the initial stages and the ORSNet in the final stage yields superior outcomes as compared to employing a uniform design across all stages.

SAM, CSFF, and depthwise separable convolutions We also wanted to understand the impact of the Supervised Attention Module (SAM), the Cross Stage Feature Fusion (CSFF) mechanism, and depthwise separable convolutions [37,38,39,40,41,42] on the performance of our model. When we removed the SAM from our model, there was a significant drop in the PSNR. The same thing happened when we removed the CSFF. Removing both components led to an even bigger drop in performance. The introduction of depthwise separable convolutions, however, resulted in a significant boost to the model’s performance, confirming its importance in our architecture.

5 Conclusion

In this research, we present RainDNet, an enhanced multi-stage framework for image restoration. Extending the core principles of MPRNet, our model systematically enhances impaired inputs by embedding supervised attention within every stage. We define key tenets to shape our design, emphasizing the amalgamation of feature processing across several stages, coupled with a flexible exchange of information between stages.

RainDNet brings in stages that are rich in contextual information and spatial precision, working in unison to encode a wide array of features. We have implemented depthwise separable convolutions, thereby improving computational efficiency while preserving the model’s effectiveness. To facilitate fruitful cooperation between interconnected stages, we have designed a unique feature integration process across stages, along with a supervised attention module that navigates the exchange of outputs from preceding stages to the ones that follow.

Our advancements yield considerable performance improvements in terms of PSNR, SSIM scores and BRISQUE scores even when compared against the strong baseline of MPRNet. Demonstrated across various benchmark datasets, RainDNet not only exhibits superior restoration capabilities but also showcases a desirable trade-off between model size and efficiency. This advantage makes RainDNet especially fitting for devices with limited resources, without compromising the quality of the restored images.

As we move forward, we envision a promising scope for the continued development and optimization of the RainDNet model. By virtue of its sophisticated design and superior performance, RainDNet has the potential to pioneer new directions in image restoration and even in broader fields of computer vision.

One potential area of future exploration is the application of RainDNet in Advanced Driver-Assistance Systems (ADAS). Given its proficiency in enhancing degraded images, RainDNet could play a pivotal role in improving the accuracy and reliability of such systems, especially under adverse weather conditions. Its ability to accurately eliminate rain and haze from images could significantly improve the visual perception capabilities of ADAS, thus enhancing the safety and efficiency of automated driving.

Furthermore, we anticipate potential modifications in the RainDNet model that could accommodate other types of image degradation, like snow, dust, or fog. Future research might also investigate the application of RainDNet’s depthwise separable convolution strategy to other architectures, potentially sparking advancements in computational efficiency across a range of computer vision tasks.

Finally, we recognize the value of continued refinements to our supervised attention module and feature fusion techniques. These enhancements could further boost the model’s performance and establish RainDNet as a robust standard in the domain of image restoration. In conclusion, the future seems bright for RainDNet, with numerous avenues for exploration and expansion.