Keywords

1 Introduction

Due to the occlusion of rain steaks with various shapes and sizes, the images captured under different rainy conditions usually tend to seriously damage the texture details and loss image contents, which hampers the further applications [3, 13, 19]. In this way, designing an efficient single-image deraining algorithm is highly necessary, which can remove diverse rain streaks while preserve more image details, especially in the complicated outdoor scenes. In the past few years, deraining researches have drawn considerable attentions, which mainly revolve around rain removal in video and single image [10, 23, 25]. Compared with video rain removal which exploits the temporary correlations between successive frames to restore clean background video, single image deraining [6, 9, 12] is more challenging due to the shortage of temporary information.

The widespread popularity of deep learning-based methods in other visual tasks [13, 19, 20], has promoted convolutional neural networks (CNN) applied into single image deraining. In [5], Fu et al. propose that it is tough to separate background from rainy image by directly using convolution network, thus they adopt CNN to cope with high frequency feature map rather than the original rainy image. Besides, joint detection [26], density estimation [29], and residual learning [11] are also introduced for rain steak detection and removal. Zhang et al. [29] propose a two-stage algorithm, which first predicts rain steak distribution and then removes them from background. Wang et al. [21] utilize an attention network to guide deraining process and generate clear background. Yang et al. [27] focus on hierarchy of local features which influences deraining effect a lot. Although these methods have achieved considerable performance improvements, existing deep learning-based deraining algorithms are still restricted in the details restoration of deraining photos. From the perspective of human visual perception, the restoration effects of some methods are not very satisfactory, which either fail to remove rain steaks completely or have an over-deraining effect resulting in distortion of original image content. For example, some methods tend to blur the background or remove some image details while removing rain steaks, because of the different levels of overlaps between background texture details and rain streaks. Besides, most deep-learning based methods are trained on synthetic data sets that results in limited generalization capability to cope with real-life rainy images well.

To cope with the restrictions of prior frameworks, we propose a novel Multi-scale Gated Feature Enhancement Network (MGFE-Net), which is based on a typical framework of encoder and decoder. More specifically, the receptive field block (RFB) [13] is embedded into encoder and decoder to cope with diverse rain steaks removal and clean background restoration. Furthermore, we design the multi-scale gated module (MGM) to control propagation of multi-scale features, which can not only selectively combine multi-scale features acquired from different layers of encoder and decoder, but also keep the consistence between high-level semantics and low-level detail features. At last, several coarse deraining results are obtained by subtracting the feature maps generated by decoder from the original rainy image, and the final refined restored image is obtained by a fusion of these coarse deraining results. The proposed MGFE-Net can remove diverse rain steaks while well preserve the background content details. The comparison results validate that our MGFE-Net achieves best performance among recent designed deraining methods.

In general, the following three contributions are included in our paper:

  1. 1.

    We propose a novel network named Multi-scale Gated Feature Enhancement Network (MGFE-Net) based on a typical framework of encoder and decoder to deal with various rain streaks accumulated from different directions with different shapes and sizes while ensure the background content details well-preserved.

  2. 2.

    In MGFE-Net, we introduce receptive field block (RFB) into the encoder and decoder respectively to enhance multi-scale feature extraction. Besides, we design the multi-scale gated module (MGM) to selectively combine multi-scale features and keep the consistence between high-level semantics and low-level detail features for satisfied rain-free image restoration.

  3. 3.

    The comparison results on several benchmark synthetic and realistic datasets indicate that our MGFE-Net can present an excellent deraining performance and generalize well to real-life photos, which significantly improves the deraining effect and human visual perception quality of restored images.

2 The Proposed Method

2.1 Network Architecture

In our paper, the Multi-scale Gated Feature Enhancement Network (MGFE-Net) based on a typical encoder and decoder framework [26] is designed to deal with single image deraining task. Figure 1 presents the overall architecture. First, we embed the receptive field block (RFB) into diverse layers to strengthen the receptive field of filtering kernels in encoder and enhance the extracted deep features of decoder. Then, different from normal skip-connection in U-net framework, we design a gated module to selectively concatenate the shallow features and deep features, which can benefit to keep the consistence between shallow detail content and deep semantics. At last, in order to generate a refined restored rain-free image, a fusion strategy is adopted to integrate coarse deraining results obtained from different outputs of decoder layers.

Fig. 1.
figure 1

Illustration of the MGFE-Net. The designed receptive filed block (RFB) and multi-scale gated module (MGM) are embedded in encoder and decoder. The final deraining result is acquired by fusing several coarse outputs of decoder.

2.2 Enhanced Feature Extraction with Receptive Filed Block

The shapes, sizes and extension directions of rain steaks in real life are randomly varied, which makes the single image deraining a challenging problem. The performance of typical single image methods is always restricted, due to the limited receptive filed of simple cascaded convolution filters. To handle this issue, we integrate the RFB to promote model capability of extracting enough information by leveraging multi-scale features between adjacent convolution layers. As illustrated in Fig. 1, RFB is embedded after each layers of encoder and decoder. More specifically, RFB contains multiple forward paths with different kernel sizes, as can be seen in Fig. 2(a). For the input feature map \(F_{I} \in \mathbb {R}^{H \times W \times C}\) from previous layers in encoder or decoder, RFB adopts different filtering kernels followed by diverse dilation rates [3] to effectively extract rain streak features in complex scenes. These feature maps in multiple forward paths are finally concatenated together to obtain the output feature map \(f_{O} \in R^{H \times W \times C}\).

Fig. 2.
figure 2

The schematic illustration of designed modules in our MGFE-Net. (a) The integrated receptive filed block (RFB) for enhancing feature extraction. (b) Our proposed multi-scale gated module (MGM) with well consistence between high-level semantics and low-level details.

2.3 Multi-scale Gated Module

Except the inability to completely remove rain steaks, another common shortage of most rain removal methods is over-deraining, which leads to damage original image content and seriously affect the visual perception quality of restored images. Thus we design a gated module to control the propagation of multi-scale features, which can not only selectively combine multi-scale features acquired from different layers of encoder and decoder but also keep the consistence between image semantics in high level and texture details in low level. By adding the gated module between different layers of encoder and decoder, the model can achieve a good deraining effect meanwhile keep background content details well-preserved.

As described in Fig. 2(b), \(F_{i}\) and \(G_{i+1}\) denote the corresponding shallow and deep features in encoder and decoder, respectively. We first employ an upsampling layer \(U P_{2 \times }\) and a \(1 \times 1\) convolution layer to make the spatial size of \(F_{i}\) same as \(G_{i+1}\). Then the output feature maps are stacked with \(F_{i}\) using a concatenation operator Conc along the channel dimension. The cascaded feature maps can be denoted as:

$$\begin{aligned} U_{i}={\text {Conc}}\left( F_{i}, {\text {Conv}}_{1 \times 1}\left( U P_{2 \times }\left( G_{i+1}\right) \right) , i=H-1, \ldots , 1\right. \end{aligned}$$
(1)

where \(U_{i} \in \mathbb {R}^{H \times W \times C}\) indicates the concatenation result of \(F_{i}\) and \(G_{i+1}\), H indicates the number of convolution layers in total.

As shown in Fig. 2(b), the right branch includes average pooling layer and two fully connected layers, and a weight map for gated feature is generated by selecting sigmoid function after the fully connected network layer. For the left branch, a \(1 \times 1\) conv-layer and relu activation function [1] are adopted to change channel number of \(U_{i}\). Then the gated feature is generated by multiply outputs from two branches, which contains consistent low-level detail information and abstract semantic features. The whole process can be denoted as follow:

$$\begin{aligned} F_{g, i}=\left( f_{Right}\left( U_{i}\right) \otimes f_{Left}\left( U_{i}\right) \right) \oplus U_{i}, i=H-1, \ldots , 1 \end{aligned}$$
(2)

where \(F_{g, i}\) denotes the gated features in the \(i^{t h}\) layer, \(\otimes \) and \(\oplus \) are element-wise product and sum operation, respectively. Before being sent into deeper layer of decoder, the gated feature \(F_{g, i}\) is refined by as a dense block:

$$\begin{aligned} G_{i, 1}=f_{\text{ Dense }}\left( F_{g, i}\right) , i=H-1, \ldots , 1 \end{aligned}$$
(3)

where \(G_{i, 1}\) denotes the final output feature in \(i^{t h}\) layer of decoder and \(f_{Dense}\) denotes a dense block (DB) [7], which consists of three consecutive convolution layers with dense connections. The predicted derained image \(Y_{i}\) in the \(i^{t h}\) layer is obtained by subtracting the decoder output feature maps from the original rain maps,

$$\begin{aligned} Y_{i}=I-G_{i, 1}, i=H, \ldots , 1 \end{aligned}$$
(4)

In the end, we further fuse the coarse deraining results (i.e., \(Y_{H}, \ldots , Y_{1}\)) to obtain final refined deraining image \(\hat{Y}\), which can be denoted as follow:

$$\begin{aligned} \hat{Y}={\text {Conv}}_{1 \times 1}\left( {\text {Conc}}\left( Y_{H}, \ldots , Y_{1}\right) \right) \end{aligned}$$
(5)

2.4 Loss Function

In order to guarantee the satisfied deraining effect and visual perception of restored image, the proposed MGFE-Net is optimized by combining content loss, SSIM loss and gradient loss. Specifically, we first conduct content loss to effectively measure the differences between restored images and corresponding rain-free images by leveraging a \(L_{1}\) loss, which is formulated as follow:

$$\begin{aligned} L_{1}=\sum _{i=1}^{H}\left\| Y_{i}-Y\right\| _{1}+\Vert \hat{Y}-Y\Vert _{1} \end{aligned}$$
(6)

where H represents the amount of coarse deraining outputs of decoder, Y is the groundtruth image, \(Y_{i}\) and \(\hat{Y}\) denote the restored image obtained from the \(i^{th}\) layer of decoder and the final predicted deraining image, respectively.

Besides, the SSIM loss is utilized to evaluate structural similarity between restored images and rain-free images, which can ensure the preservation of content textures and is formulated as follow:

$$\begin{aligned} L_{ssim}=\sum _{i=1}^{H}\left( 1- {\text {SSIM}}\left( Y_{i}-Y\right) \right) +(1-{\text {SSIM}}(\hat{Y}-Y)) \end{aligned}$$
(7)

Furthermore, inspired by the advantages of sobel operator in edge prediction during image reconstruction [2, 24], we compare the derained images with its rain-free images in gradient domain to keep the same gradient distribution. Thus the gradient loss is defined as:

$$\begin{aligned} L_{grad}=\left\| \nabla _{x}(\hat{Y})-\nabla _{x}(Y)\right\| _{1}+\left\| \nabla _{y}(\hat{Y})-\nabla _{y}(Y)\right\| _{1} \end{aligned}$$
(8)

Finally, the total loss function for MGFE-Net is defined as follows:

$$\begin{aligned} L_{\text{ total }}=L_{1}+\lambda _{g} L_{grad}+\lambda _{s} L_{ssim} \end{aligned}$$
(9)

where \(\lambda _{g}\) and \(\lambda _{s}\) are coefficients to balance different loss items.

3 Experiment

3.1 Experiment Setup

Implementation Details. The MGFE-Net is applied on the deep learning-based PyTorch [17] framework. The training image samples are cropped into patches with size of \(256 \times 256\) and we further horizontally flip these patches in a probability of 0.5. The Adam optimizer is utilized with a batch size of 10 while the learning rate is \(2 \times 10^{-4}\) at the beginning stage and then decreased to \(1 \times 10^{-5}\) after 50,000 training iterations. During testing, these input rainy images keep original sizes without any data augmentations.

Datasets. In our paper, we compare MGFE-Net with other recent deraining algorithms on three synthetic benchmark datasets and a real world rainy image set. For specific, Rain1200 [29] contains a total of 24, 000 pairs of rainy/rain-free images, of which 12, 000 pairs are in training/testing image set, respectively. Besides the pairs in Rain1200 are conducted with three levels of rainy density. Rain1400 [5] collects 1, 000 clean images and each of them is transformed into 14 different rainy images. There are 12, 600/1400 sample pairs for training/testing set. Rain1000 [21], covering a wide range of realistic scenes, is the largest single image deraining dataset including 28,500/1,000 pairs for training/testing set respectively. In addition, we collect 146 realistic rainy photos from [21, 26], in which rain steaks vary in content, intensity, and orientation.

Evaluation Metrics. We adopt two typical measures, PSNR [8] and SSIM [22], to compare the performance of our MGFE-Net with recent methods. For real-world set which lacks corresponding ground truth, we use another two quantitative indicators, NIQE [15] and BRISQUE [14], to evaluate the visual quality of deraining photos. Smaller values of NIQE and BRISQUE mean better restoration effect and better visual perceptual quality. The recent deraining models we compared with are Clear [4], JORDER [26], DID-MDN [29], DualCNN [16], RESCAN [11], SPANet [21], UMRL [28] with cycle spinning, and PReNet [18].

3.2 Comparison with the State-of-the-Art Methods

Comparison Results on Synthetic Datasets. Table 1 summarizes the quantitative comparison results of different single image deraining methods where our MGFE-Net outperforms previous methods on all the benchmark datasets. Note that the performance of PReNet is very close to our MGFE-Net, the possible main reason we consider is that PReNet [18] adopts frequent image cropping to expands dataset by several times. Specifically, on Rain1200, Rain1400, and Rain1000 datasets, our method promotes the PSNR values by an average of 0.28 db, 0.68 db, 2.78 db compared with the second best results of each dataset. It is a remarkable fact that our method has an excellent performance on Rain1000, which collects images in kinds of natural scenes and contains lots of real rain steaks.

We then qualitatively compare our MGFE-Net with other methods by demonstrating details of the restored deraining images. As shown in Fig. 3, our MGFE-Net is the only model to successfully handle with different rainy situations. For the first two rows in Fig. 3 where the rain steaks are very densely distributed or different significantly in shapes, we can observe that the recent three methods cannot removal rain steaks completely while our method generates a clean deraining result. For the last two rows, other methods either leave obvious artifacts in restored images or blur the original background, while our method obtain a better visual effect and keep the content details well preserved.

Table 1. Comparison results in PSNR and SSIM between our MGFE-Net and other recent methods for single image deraining on three synthetic datasets.
Fig. 3.
figure 3

Qualitative comparison of SPANet [21], UMRL [28], PReNet [18] and our proposed MGFE-Net on three synthetic datasets.

Comparison Results on Real-World Dataset. Considering most deraining models are trained with synthetic rainy images, it is necessary to evaluate the generalization ability of deraining methods on realistic rainy photos. As shown in Table 2, it is obvious that our MGFE-Net performs better than previous methods according to NIQE and BRISQUE for reference-free evaluation. We also present several restored images for qualitative comparison in real-world rainy situation in Fig. 4. Whether it is in the case of heavy rain with dense rain steaks or spare rain steak distribution with complicated shapes, our MGFE-Net has a better generalization ability to remove rain steaks in the realistic scenes than other methods.

Table 2. Comparison results in NIQE and BRISQUE on the real-world photos.
Fig. 4.
figure 4

Comparison results in real-world rainy situations of SPANet [21], DualCNN [16], UMRL [28], PReNet [18] and our proposed MGFE-Net. Intuitively, our MGFE-Net performs better than recent deraining methods.

3.3 Ablation Study

To verify the effectiveness of designed modules in MGFE-Net, we conduct four different experimental settings and evaluate their performances on Rain1200 [29]. As shown in Table 3, the four experimental settings are used to present the effectiveness of receptive filed block (RFB), multi-scale gated module (MGM) and gradient loss (GL), respectively. Note that, the Backbone means the simple encoder and decoder framework under the only optimization of \(L_{1}\) loss and \(L_{ssim}\) loss. It can be seen obviously that supervising rain-free image generation by adding gradient loss does have a great effect on performance improvements. By integrating RFB and MGM into the experimental setting \(M_{b}\) sequentially, the fourth setting \(M_{d}\) (i.e., our proposed MGFE-Net) enhances the extracted deep features in larger receptive fields and effectively utilizes the multi-scale features, which could further promote the model capability and generate rain-free images with best visual effects.

Table 3. Ablation study on four experimental settings of MGFE-Net. Performances are evaluated on Rain1200 [29] dataset.

3.4 Conclusion

In our paper, a novel multi-scale gated feature enhancement network (MGFE-Net) is proposed to solve single image deraining task. In MGFE-Net, we leverage the receptive field block (RFB) to strengthen the efficient extraction of multi-scale features and use the multi-scale gated module (MGM) to selectively combine multi-scale features and keep the consistence between image semantics in high level and texture detail information in low level. By embedding the two modules into typical framework of encoder and decoder, the proposed MGFE-Net can not only generate a clean deraining image but also keep the background content well preserved. Sufficient comparison results demonstrate that our MGFE-Net not only presents an excellent performance but also generalize well to real-life photos, which significantly improves the deraining effect and enhance human visual perception quality of derained images.