1 Introduction

Various image restoration tasks such as dehazing [33], inpainting [1] and image deraining [17] can improve image quality, and this helps to improve the detection accuracy of high-level CV tasks such as object classification and detection [25]. Therefore, image deraining has grabbed a lot of attention of researchers in this low-level CV area. Although many traditional algorithms have been proposed to remove rain streaks from rainy images, it remains nevertheless complex and difficult, as there is no temporal information available in the captured images [39].

Consider a rainy image X, which can be expressed as a sum of rain layer R and background image B, and the physical model can be given as:

$$ X = R + B $$
(1)

Therefore, still image deraining is an ill-posed problem, since we know only X, and there are lots of solutions for both unknown B and R. Most of the existing networks have considered and remained focused to remove rain streaks via the optimization problem. Therefore, existing deraining networks falls in any of the two categories such as traditional model-driven prior-based approaches and deep learning-based data-driven approaches.

Earlier researchers developed traditional model-driven prior-based approaches such as sparse coding [24], decomposition [17] and Gaussian mixture models (GMM) [20] to remove rain streaks from rainy images. However, these traditional networks are very sensitive to image variations as they designed their networks using handcrafted features. Due to wide growth and great improvements in deep learning technology, currently researchers have moved to adopt new data-driven approaches like convolutional neural networks (CNN) [42] and transformer [34] for removing rain streaks. Compared to traditional model-driven approaches, deep learning-based data-driven approaches are more robust and achieve excellent results. Currently most of the data-driven prior-based algorithms use CNN as their backbone to remove rain streaks. However, CNN has limited receptive fields and can capture only local spatial information and fail to capture broad contextual information.

To resolve this problem, some of the deraining networks introduced dilated convolution [18, 38] or construct deeper networks [8, 22, 37] to enlarge CNN receptive fields. However, it still results in local information as the operation of convolution is just sliding a window and computing local weighted summation. If multiple convolutional layers were stacked, it just increases the network complexity, which leads to overfitting.

In recent years, transformer [31] was initially used for NLP task and currently has been adopted in high-level CV tasks [25] and achieved impressive performance. CNN can only model local information, while transformer models the entire image and is adaptive to the input content. Tremendous success has been achieved in high-level CV tasks [9]; therefore, currently transformers have been adopted in low-level CV tasks such as dehazing and deraining networks [5, 28, 45]. U-shaped transformer [34] was proposed by Wang et al. by making refinements to Swin Transformer [23]. A nested U-shaped transformer [35] was proposed by increasing the number of transformer layers. However, we cannot adopt these transformers directly to single image deraining task as there are still many issues. (a) Transformer lacks the ability to model local features; (b) to process input image, transformer uses fixed patch size, therefore pixels at patch edges cannot use local features of surrounding pixels; and (c) hierarchical encoder was incorporated in U-shaped transformer; and it was unable to integrate multi-level features.

Therefore, we propose a deraining network based on transformer named DeTformer, in order to explore and exploit the long-range contextual information during the complex single image deraining process. We introduce multi-scale features in image deraining in order to effectively utilize the transformer fully. Therefore, it enables the transformer to use variable patch sizes and also helps to improve patch boundary defects. Several experiments illustrate that our network not only generates clean images, but also helps in improving the efficiency of subsequent high-level CV tasks.

The contributions can be summarized as follows:

  1. 1.

    A novel efficient transformer-based multi-scale structure was proposed for deraining single rainy images. Therefore, our network was able to model long-range inter-pixel contextual information in removing heavy and long rain streaks from rainy images.

  2. 2.

    We incorporated “gated depth-wise convolution feed-forward network” (GDWCFN) in DeTformer and it uses local features to generate better rain-free images.

  3. 3.

    We designed “multi-head depth-wise convolution transposed attention” (MDWCTA) module to integrate extracted multi-scale features effectively and also performs feature interaction along channels rather than spatial dimensions.

  4. 4.

    Experimental results demonstrate that DeTformer network outperforms SOTA networks on synthetic and real-world rain datasets.

2 Related Work

In this section, a brief review of deraining methods is provided and such networks fall under either traditional model-driven or deep learning-based data-driven prior-based approaches. Additionally, we provide previous works carried out by using multi-scale approaches and transformers.

2.1 Traditional Model-Driven Prior-based Approaches

Traditional model-driven prior-based approaches solve the image deraining process using prior knowledge. In [17], they decompose rain images into low- and high-frequency components, and adopted dictionary learning to remove high-frequency rain components. Li et al. [20] proposed Gaussian mixture models for single image deraining. Discriminative sparse coding [24] used learning the dictionary of rain streaks and background layers during the image deraining. Low-rank model-based traditional method was proposed by Chen et al. [7] to remove rain streaks and they assumed that rain streaks in a local patch have low rank. Filter-based sparsity and low-rank representation model was proposed by Zhang et al. [43] to remove rain streaks. In [32], a single image deraining model was proposed, which employed proximal gradient descent technique and applied convolutional dictionary learning mechanism for rain representation. Although all the tradition-based prior approaches have tried to achieve better results, they fail to remove rain streaks completely and cost time.

2.2 Deep Learning-based Data-Driven Approaches

Due to the wide success of deep learning in deraining [34, 42, 44], CNN-based approaches have replaced traditional model-driven prior-based approaches for removing rain streaks. Therefore, researchers have designed many CNN-based network structures and proposed various loss functions to improve the performance of deraining networks.

Wang et al. [38] proposed a deep learning architecture to remove rain streaks from heavy rains. They created a model which contains two components for representation of rain streak accumulation and for representation of various shapes and directions of overlapping rain streaks. In [42], a density-aware deraining network was proposed, which identifies the rain streak densities and processes these streak densities effectively. “Generative adversarial network (GAN)” [44] was designed to remove rain streaks and generate derained images directly. Fu et al. [10] introduced deep CNN referred to as Derain-Net to remove rain streaks. Wang et al. [40] adopted image enhancement technique for deraining process and incorporated GAN to generate high-quality rain patterns. To remove heavy rain streaks effectively, recursive networks [18, 26, 27] were adopted in deraining single images where the rain streaks were removed progressively and recursively.

Semi-supervised transfer learning technique [35] was adopted for single image deraining problem. This method uses semi-supervised and adds real rainy images without ground truth images into the network during training. Recursive operations were introduced on top of a progressive ResNet in order to exploit deep features across multiple stages and thus formed progressive recurrent network (PReNet) [27]. Yasarla et al. [41] proposed an over- and under-complete CNN which pays special attention while learning local structures by employing receptive field of filters. In [19], a rain-to-rain autoencoder was proposed and rain embedding was introduced in the encoder to improve deraining performance. They also proposed layered LSTM for recursive recurrent deraining and feature refinement was performed at multiple scales by a fine-grained encoder. Fu et al. [13] proposed rain streak removal via graph CNN to model long-range contextual information. Existing deraining networks embed low-quality features into the network directly, so Chen et al. [4] replaced low-quality features by high-quality features. They adopted closed-loop feedback control system to obtain latent high-quality features.

CNN-based deraining networks have achieved unprecedented success when compared to traditional model-driven prior-based networks. However, all the CNN-based deraining networks constructed by stacking multiple CNN layers and to model local information they use their limited receptive field.

2.3 Vision Transformers

Spectacular success has been achieved when the transformer was adopted in NLP field. Recently transformers [9] have been employed for image classification and achieved better results than SOTA CNNs. To learn long-range inter-pixel dependencies between the sequences, attention [31] mechanism was applied and the images were split into patch sequences by transformer. As transformer possesses long-range modelling capability and adaptability to input content, they were adopted in various high-level CV tasks such as object classification, detection, tracking, segmentation and pose estimation. For image restoration, networks which adopted the transformer are Restormer [46], U-former [34], Swin-IR [23], U2-former [16] and Transweather [30]. However, these networks perform poor on real rain images which are affected by high-density rainfall. In addition, to process high-resolution images, it requires huge computational complexity and also generate large number of parameters in transformer-based image deraining networks.

2.4 Multi-Scale Pyramidal Architecture

Using multi-scale learning, feature extraction would be improved to a certain extent since images of multiple scales can be extracted with different features. Lightweight pyramidal network [12] was developed using Gaussian–Laplacian image pyramid decomposition and performs image deraining at each pyramid-scale space. Jiang K et al. [15] constructed a pyramidal structure to improve the networks capability to encode rain streaks. Deep CNN-based recurrent neural networks [18] were constructed to remove heavy rain streaks. They adopted dilated CNN to acquire large receptive field since contextual information plays a vital role during the image deraining process. To remove heavy rain, they incorporated squeeze-and-excitation network and decomposed rain removal into multiple stages and assigned them with different alpha values.

In [26], a combination of multi-scale feature fusion and progressive structure was introduced in their network to separate heavy rain streaks. To extract contextual information from the shallow layers, they adopted U-net and at the last stage they incorporated image original resolution network to generate accurate derained images. A multi-stage architecture [47] was proposed which can progressively learn various image restoration functions for the degraded inputs. A supervised attention module was introduced to reweight local features by using per-pixel adaptive design. A “deep feature interactive aggregation network” [3] was proposed to improve long-range pixel dependencies among the captured features and to build channel correlations among the features for image deraining.

Therefore, by introducing transformer, multi-scale information was added so it can exploit the advantages of the network global connectivity and also learn feature map representation in rain streaks.

3 Proposed Method

Initially, efficient transformer architecture was described and then followed by a brief description of individual components used in our network. To reduce computational complexity of a single-scale network [23], we made key changes to multi-scale hierarchical module and multi-head SA layer. The overall pipeline of DeTformer architecture is shown in Fig. 1. A detailed description of core components of transformer block (TB) is as follows:

  1. (a)

    “Multi-head depth-wise convolution transposed attention” (MDWCTA) module and

  2. (b)

    “Gated depth-wise convolution feed-forward network” (GDWCFN). At the end, progressive training scheme and loss function details were provided.

Fig. 1
figure 1

Architecture of DeTformer

First, the degraded rainy image € ℜH × W × 3 which is fed to a 3 × 3 convolution layer to obtain low-level features ℜH × W × C (HW represents the spatial dimension and C represents the number of channels) and then flatten the extracted features into “token”. Next these tokens, i.e. shallow features, pass via four-stage symmetrical encoder–decoder and are then transformed into deep features ℜH × W × 2C. Each stage of encoder–decoder contains a series of transformer blocks (TB), and to maintain efficiency of our network, we gradually increase the number of transformer blocks from top to bottom levels. Therefore, our encoder network only expands the channel capacity and hierarchically reduces spatial dimensions for the input image. A 4 × 4 convolution with stride 2 was performed during the down-sampling operation; therefore, number of channels was doubled and the feature map became half. The decoder network takes the low-resolution latent features ℜH/8 × W/8 × 8C and recovers progressively the high-resolution features. A 2 × 2 transposed convolution with stride 2 was performed during the up-sampling operation, so the number of channels reduces to half and the feature map becomes doubled.

We apply pixel-shuffled and pixel-unshuffled operations [28] for feature up-sampling and down-sampling. To make recovery process easier, we incorporated skip connections to concatenate encoder features with decoder features. After concatenation operation, we apply 1 × 1 convolution to make the number of channels become half at all stages, except at the top level. At stage 1, the low-level image features of encoder transformer block were aggregated with high-level features of decoder transformer block. Therefore, it helps to preserve the textural details and fine structures in output derained images. Now the deep features were enriched further in the refinement stage as it operates with high-spatial-resolution features. Finally, the refined feature map was fed to a 3 × 3 convolution layer to generate the residual feature map R € ℜH × W × 3 to which original rainy image X is added to reconstruct the derained image: \(D = X + R\).

3.1 Transformer Block (TB)

Each transformer block (TB) consists of dual layer normalization layers [2], one “multi-head depth-wise convolution transposed attention” (MDWCTA) and one “gated depth-wise convolution feed-forward network” (GDWCFN) modules as shown in Fig. 2. Layer normalization (LN) was applied prior to MDWCTA and GDWCFN modules, and both modules perform element-wise addition using residual skip connections. It can be formulated as follows:

$$ Feat_{1} = MDWCTA\left( {LN\left( {Feat_{0} } \right)} \right) + Feat_{0} $$
(2)
$$ Feat_{2} = GDWCFN\left( {LN\left( {Feat_{1} } \right)} \right) + Feat_{1} $$
(3)

where LN refers to layer normalization, SA denotes self-attention, Feat0, Feat1 and Feat2 denote input feature map of TB, output feature map of MDWCTA and GDWCFN modules, respectively.

Fig. 2
figure 2

a Architecture of transformer block. b Gated depth-wise convolution feed-forward network. c Multi-head depth-wise convolution transposed attention

The original transformer [9, 11] increases the computational complexity of the model as it globally calculates self-attention. We adopted “multi-head depth-wise convolution transposed attention” (MDWCTA) [24] in TB, in order to process high-resolution images while removing heavy rain streaks in single image deraining. Earlier works [21, 36] adopted transformer and proved that they are deficient in processing local contextual information. Therefore, we replaced feed-forward network (FFN) [9, 23] in TBs with the proposed “multi-head depth-wise convolution transposed attention” (MDWCTA) module. So, we compensate the transformers lack of capturing the local feature information with convolutional layers.

3.2 Multi-head Depth-wise Convolution Transposed Attention (MDWCTA)

The transformer computational burden increases mainly comes from the self-attention (SA) layer. In the original transformer [9, 11], the memory and time complexity of key–query dot product interaction increases quadratically with the spatial resolution of input, i.e. O(W2H2). Therefore, it is not quite feasible to apply SA on image deraining tasks as it often involves high-resolution images.

We proposed MDWCTA module which has a linear complex structure to resolve this issue as shown in Fig. 2c. In this module, they apply SA along the channel dimensions instead of spatial dimensions, i.e. cross-covariance is computed across the channels and generates attention map encoding the global context by default. One key change we made in this module was to introduce a 3 × 3 depth-wise convolution to highlight the local contextual information prior to the feature covariance computing in order to produce global attention map.

To reduce computational complexity burden in our network, and to perform self-attention, a “non-overlap window-based” technique was applied. On the input feature map F € (HxWxC), the layer normalized tensor generates (HW/M2) × C local feature maps, as M × M local window slice was applied. Here (HW/M2) is the total divided windows. The obtained local features map was enriched as 1 × 1 convolution was applied to aggregate pixel-wise cross-channel contextual information. Then 3 × 3 depth-wise convolutions were applied to encode the channel-wise spatial contextual information, which yields normalized feature map, and the matrices for query (Q), key (K) and value (V) are given by:

$$ Q = FX_{Q} P_{{\text{Q}}} ,\quad K = FX_{K} P_{K} ,\quad V = FX_{V} P_{V} $$
(4)

where P and X perform 1 × 1 point-wise convolution and 3 × 3 depth-wise convolution. In the proposed network, we do not use bias in convolutional layers. A transposed attention map A € ℜC × C was generated instead of larger regular attention map A € ℜHW × HW by reshaping the query and key pair projections. The overall process of MDWCTA is formulated as follows:

$$ Feat_{1} = P \cdot \left( {Attention\left( {\hat{Q}, \hat{K}, \hat{V}} \right)} \right) + Feat_{0} $$
(5)
$$ Attention\left( {\hat{Q}, \hat{K}, \hat{V}} \right) = \hat{V} \cdot Softmax\left( {\hat{Q} . \frac{{\hat{K}}}{\alpha }} \right) $$
(6)

where Feat0 and Feat1 are input and output feature maps, and α is a learning scalable parameter which is used to control the magnitude of \(\hat{Q} \cdot \hat{K}\) before applying softmax. In our module, the number of channels was divided into heads and it learns separate attention maps parallel which is similar to conventional multi-head SA [9].

3.3 Gated Depth-Wise Convolution Feed-forward Network (GDWCFN)

Figure 2b shows the GDWCFN architecture. A regular FFN [9] operates on each pixel separately and identically while transforming the image features. They used two 1 × 1 convolutions initially: one to expand feature channels and other to reduce the channels to get back the original image size. Therefore, we apply a nonlinearity function in hidden layers. To improve representation learning, we made two modifications to a regular FFN. One is that the gated mechanism was incorporated and the other one adopted was depth-wise convolution.

To perform element-wise product of two parallel paths of linear transformation layers, a gating mechanism was formulated, one of which was activated with nonlinear GeLU [14]. As in MDWCTA, we adopted 3 × 3 depth-wise convolutions in GDWCFN to encode information from the spatially neighbouring pixel positions, as it is useful to learn local image structures.

For an input tensor Y € ℜH × W × C, GDWCFN was formulated as follows:

$$ \hat{Y} = P \cdot Gating \left( Y \right) + Y $$
(7)
$$ Gating \left( Y \right) = \mu \left( {W_{d}^{1} W_{p}^{1} \left( {LN\left( Y \right)} \right)} \right)\hat{e}W_{d}^{2} W_{p}^{2} \left( {LN\left( Y \right)} \right) $$
(8)

where µ represents nonlinear GeLU function, \(\hat{e}\) denotes element-wise multiplication and LN is layer normalization. The proposed network GDWCFN controls the information flow through multi-hierarchical levels and it allows each stage to put focus only on the fine details inverted to other stages. Therefore, this module plays a more vital role compared to MDWCTA module as its focus is to enrich the features with contextual information.

3.4 Progressive Learning

Many existing CNN-based deraining networks usually train networks using fixed image size patches. However, the original transformer model [29] trained on small cropped patches could not achieve optimal performance during image restoration. Therefore, we implemented progressive learning strategy where DeTformer network was trained initially with small image patches in the early epochs and gradually, patch sizes increased in later epochs. As we adopted mixed-size image patches training strategy, we were able to achieve better results during testing even for high-resolution images. Therefore, our network was able to preserve the fine image structures and texture while removing rain streaks as our network was trained using a curriculum learning fashion. We reduced the batch size as the patch size increased while training on large patches since it consumes longer time than usual. We needed to maintain similar time as fixed patch training.

3.5 Loss Function

In order to train deep draining networks, the widely adopted loss functions are mean absolute error (L1) loss, mean square error (L2) loss, negative SSIM loss, Charbonnier loss, attention loss, edge loss, adversarial loss and perceptual loss. We adopted Charbonnier loss in our network as it makes the model converge faster and can tolerate small errors. The total loss function is expressed as:

$$ L = \mathop {\mathop {\sum }\limits_{S = 1} }\limits^{4} \left[ {L_{char} \left( {X_{S} ,Y} \right) + \lambda L_{edge } \left( {X_{S} ,Y} \right)} \right] $$
(9)

where \(X_{S}\) is the derained image, \(Y\) represents the ground truth and \( L_{char}\) denotes Charbonnier loss.

$$ L_{char} = \sqrt {\parallel X_{s} - Y\parallel^{2} { } + \in^{2} } $$
(10)

In addition, edge loss (\(L_{edge }\)) is defined as:

$$ L_{edge } = \sqrt {\parallel \vartriangle \left( {X_{s} } \right) - \vartriangle \left( Y \right)\parallel^{2} + \in^{2} } $$
(11)

where \(\vartriangle\) is Gaussian operator which can control the relative importance of the loss terms in Eq. (9), \(\lambda\) (hyperparameter) was set to 0.05 and \(\in\) constant was set to 10−3.

4 Experimental Results and Discussion

Here we provide details of our experimental setup, datasets and performance metrics. We evaluated the performance and showed the effectiveness of DeTformer network on benchmark synthetic and real rain datasets.

(a) Experimental setup

Our proposed network was implemented on PyTorch 1.7 deep learning framework. AdamW optimizer solution was applied during the network training and trained for 105 iterations. Fixed learning strategy was used with 3 × 10−4 learning rate. Batch size was set to 8, and adapted variable patch sizes are set to 128 × 128, 160 × 160 and 192 × 192, respectively. To make the proposed network more robust, various augmentation techniques were applied such as horizontal flip and vertical flip during the network training. In all TBs, window size was fixed to 8. All the experiments were carried on Google Colab pro + which has Tesla V100 GPU. We employed four-level encoder–decoder hierarchy, number of TBs used was (4, 6, 6, 8), number of channels used was (32, 48, 64, 192), number of attention heads used in MDWCTA was (1, 2, 4, 8) and TRM used 4 blocks.

(b) Datasets

The effectiveness of our network was evaluated on synthetic paired rain datasets and real rain dataset [12], which includes Rain100L [38], Rain100H [38], Rain800 [44], Rain1200 [42], Rain12 [20] and Rain14000 [11] and renamed Testset as Rain100L, Rain100H, Test100, Test1200 and Test2800. Table 1 shows a brief summary of datasets used in this work.

Table 1 Summary of used datasets

(c) Evaluation Metrics

To show the effectiveness and performance of DeTformer network, we evaluated the derained image quality using two evaluation metrics. “Peak signal-to-noise ratio” (PSNR) and “Structural Similarity Index Measurement” (SSIM) were calculated on the derained images. Generally, the larger their values are, the better the deraining effect is.

4.1 Comparison with the State-of-the-Art Networks

We compared the performance of DeTformer network comprehensively with several state-of-the-art (SOTA) deraining networks such as JORDER [38], DID-MDN [42], RESCAN [18], SSTL [29], PReNet [27], DerainNet [10], UMRL [40], MSPNet [15], SAPNet [45], SEMI [35], OUCD [41], ECNet [19], PMSDNet [26], RCDNet [32], DualGCN [13], MPRNet [46], RLNet [4] and DFIANet [3].

The visual quantitative results of DeTformer network on synthetic rain datasets are shown in Table 2. It is clear from the table that our network achieves superior performance over state-of-the-art (SOTA) networks on all synthetic datasets. In particular, on Rain100L and Rain100H datasets, DeTformer network obtains 38.99 and 31.45 dB PSNR which is + 3.79 and + 1.97 dB PSNR higher compared to DFIANet [3] and which clearly shows that our network removes heavy and complex rain streaks more effectively. Table 2 shows that DeTformer network achieves the highest PSNR and SSIM metric values on Rain100L, Rain100H, Test100, Test1200 and Test2800 synthetic datasets. These is due to the fact that our network uses the benefits of transformer as well models the long-range contextual information better.

Table 2 Quantitative results of the proposed network on synthetic datasets and made comparison with the SOTA networks

The visual qualitative results of DeTformer network on synthetic rain datasets are shown in Figs. 3, 4 and 5, respectively. Although the networks (PReNet, ECNet and DFIANet) remove heavy rain streaks, “visible artefacts” and “blurred details” were nevertheless observed in the derained outputs, as shown in Fig. 3.

Fig. 3
figure 3

Qualitative results of the proposed network on synthetic datasets and made comparison with the SOTA networks and DeTformer network generate the best visual results on synthetic datasets

Fig. 4
figure 4

Visual qualitative results of the proposed network on Rain100L synthetic dataset

Fig. 5
figure 5

Visual qualitative results of the proposed network on Rain100H synthetic dataset

From the observation of derained images in Fig. 3, this situation occurs in clouds, sky and roof and appears in JORDER [38], RESCAN [18], SEMI [35] and DFIANet [3] networks. As the colour of background is similar to rain streaks, some networks perform excessive deraining and remove the fine details of similar colour as in the second row of Fig. 3. When the test images contain denser objects, it is difficult to remove rain streaks completely and recover finer details simultaneously, as was clear from the telephone booth and black fence in the third and fourth rows in SEMI [35], PReNet [27], ECNet [19], JORDER [38] and DFIANet [3]. OUCD [41] network combines global information in their network and pays attention only to local features and the network fails to remove heavy rain streaks completely. Therefore, compared to all these SOTA networks our network avoids these problems and restores the derained images which are highly similar to ground truth images.

Figures 4 and 5 show that our network exhibits impressive recovery deraining results while removing diverse light and heavy rain from rainy images. From the observed images, our network was able to restore clear image details and appropriate contrast and which are similar to ground truth images. Some more sample deraining results of the proposed network on Rain100H synthetic dataset along with their “mean square error” (MSE), PSNR and SSIM are shown in Fig. 6.

Fig. 6
figure 6

Visual qualitative deraining results of some sample images of Rain100H

To show the robustness and efficiency of DeTformer network, we also made a comparison with SOTA networks on real rain dataset [12]. Figure 7 shows the derained results on real rain dataset of the proposed network and made a comparative analysis with PReNet [27], MPRNet [47], ECNet [19], SAPNet [45] and DFIANet [3] networks. However, many of these networks produce artefacts during the image deraining process, which are not as clear as that of the images restored by our network. Our network removes rain streaks which are more unevenly distributed, and also achieves impressive performance while removing heavy rain streaks and outputs clear and detailed content results. In spite of complex rain scenes present in nature, our network generates excellent results while removing rain streaks under realistic conditions.

Fig. 7
figure 7

Visual qualitative derained results on real-world rain dataset of our network and made comparison with SOTA networks

We also provided a number of parameters required and performed floating-point operations (FLOPS) on a specific Rain100H dataset and made comparison with the SOTA networks in Table 3. It is observed that the number of parameters in our network reduces, as general convolution was replaced by transformer. On a test image 256 × 256, our network runs for just 270 ms (ms) and generates noise-free image. Figure 8 shows the comparison of a number of model parameters, FLOPS generated and runtime (ms) on a 256 × 256 image resolution of various SOTA networks.

Table 3 Comparison of FLOPS, model parameters and runtime of SOTA networks
Fig. 8
figure 8

Comparison results of SOTA networks vs. FLOPS, Parameters and Runtime

4.2 Ablation Studies

A series of ablation studies were conducted to show the impact of various factors on DeTformer network, and we evaluated the ability of our network during the deraining process. All ablation studies use Rain100H during network training and testing.

4.2.1 Effect of Basic Composition

Table 4 shows the ablation study results of the importance of each component separately. Therefore, our network achieves higher-quality performance. As seen from the table, when FFN was replaced with GDWCFN module, PSNR dropped by 0.88 dB. This proves the effectiveness of GDWCFN in enhancing and preserving the local feature information and alleviates the drawback of original transformer in extracting local feature information. If MDWCTA module is removed, PSNR drops by 1.31 dB and this proved that the networks performance would be improved by multi-scale feature fusion. PSNR was drastically reduced by 1.52 dB, when all the up-sampling and down-sampling layers were removed and this shows the effectiveness of the designed U-shaped transformer structure. We also provided a number of required parameters required and performed floating-point operations (FLOPS) when a specific component was employed in the proposed network.

Table 4 Effects of basic composition in the proposed network

We also performed experiments on the number of scales to be employed in the encoder–decoder network structure for removing rain streaks effectively during the deraining process.

4.2.2 Effect of Number of Scales

Table 5 shows the impact of the number of scales to be employed in the proposed network and to show the effectiveness of multi-scale structure. From these observations, it is clear that when S = 1, PSNR drops by 0.24 dB, since multi-resolution features can assist the DeTformer network better to remove heavy and complex rain streaks effectively. When S = 4, we were able to achieve both higher PSNR and SSIM metric values.

Table 5 Impact of number of scales in the proposed network

4.2.3 Effect of λ Hyperparameter

From Eq. (9), the total weighted loss function depends on λ hyperparameter which was set to 0.05. In order to obtain better network performance, we performed an ablation study to fix λ parameter. Table 6 shows the influence of λ value on PSNR and SSIM values. So, from these observations, we fixed λ value as 0.05 in weight loss function as it achieves higher PSNR and SSIM.

Table 6 Impact of λ parameter on total loss function

4.2.4 Effect of Number of Transformer Blocks in Encoder–Decoder Network

To decide the number of transformer blocks (N) to be employed in the encoder–decoder network, we performed an ablation study. Table 7 shows the impact of N on the proposed network on complexity and computational burden. In order to balance both complex structure and computational complexity, i.e. deraining performance and efficacy, we adopt N = 2 in our network.

Table 7 Impact of N in the proposed network

4.2.5 Effect of Different Loss Functions in Our Network

An ablation study was conducted to show the effectiveness of Charbonnier loss, and make a comparison with other popular loss functions L1 and L2. Table 8 shows the effectiveness of Charbonnier loss, so we adopted this loss function to reconstruct the derained image.

Table 8 Effect of loss function for improving deraining performance

4.2.6 Impact of Progressive Learning

The impact of progressive learning adopted in our network ablation study is shown in Table 9. We achieved better results with progressive learning than with fixed patch learning while still balancing similar training time.

Table 9 Impact of progressive learning on the proposed network

4.3 Limitation

Although our DeTformer deraining network has achieved superior performance over SOTA networks, it has certain limitations. During the testing stage, we fed our network with a raindrop image and it showed inconsistent behaviour and was unable to remove rain drops as shown in Fig. 9. This is because we did not train DeTformer network with raindrop images.

Fig. 9
figure 9

DeTformer network failure scenario

5 Conclusion

We present a transformer-based deraining network referred to as DeTformer. To process more complex and realistic rain images and restore fine details, we proposed an efficient DeTformer network and also made comparative analysis with SOTA deraining networks. The superior performance of DeTformer network was achieved by a series of improvements. In this work, transformer structure was adopted in deraining single images. We designed “gated depth-wise convolution feed-forward network” (GDWCFN) and applied depth-wise convolution which can improve the capability of modelling local features and suppresses less informative features. We incorporated multi-resolution features in the transformer, where the proposed network can use patches of random scales, and thus, it enables the edge pixels to utilize local features. Furthermore, we designed “multi-head depth-wise convolution transposed attention” (MDWCTA) module which can effectively integrate the multi-scale extracted features and also perform feature interaction across channel dimensions. Extensive experiments on our network demonstrate that it achieves superior performance on both synthetic paired and real rain datasets.