1 Introduction

Images captured in rainy days suffer from noticeable degradation of scene visibility. For example, raindrops inevitably adhered to camera lenses or windscreens in a rainy day, which occlude and deform some image areas and make the performances of many algorithms in the vision systems (such as object detection, tracking, recognition, etc.) significantly degraded. The goal of single image deraining algorithms is to generate sharp images from a rainy image input, which can potentially benefit both the human visual perceptual quality, and many computer vision applications, such as intelligent vehicles and outdoor surveillance systems (Sheng et al. 2020; Tokuda et al. 2020).

The recent years have witnessed significant progress in single image deraining. The progress in this field can be attributed to various natural image priors (Sun et al. 2014; Kang et al. 2012; Chen and Hsu 2013; Bossu et al. 2011) and deep convolutional neural network (CNN)-based models (Fu et al. 2017b; Qian et al. 2018; Zhang and Patel 2018). However, a fair comprehensive study of the problem, the existing algorithms, and the performance metrics have been absent so far, which is the goal of this paper. In this work, we put our focus on image deraining techniques and how they have been extended or applied to high-level vision systems based on our proposed new benchmark. To the best of our knowledge, this is the first comprehensive benchmark and the first review in the literature that focuses on image deraining and its corresponding applications.

This work is organized as follows. First, Sect. 2 review the rainy image models and explain important background concepts that will be necessary throughout the rest of the paper. Next, Sect. 3 surveys the model-based and learning-based single-image deraining approaches and the existing datasets used in the rain removal literature. Next, Sect. 4 provides a comprehensive description as well as an analysis of the proposed benchmark of multi-purpose image deraining (MPID). Section 5 analyzes typical metrics and evaluation protocols for the deraining methods and provides quantitative results for them on the proposed benchmark. Finally, Sect. 6 summarizes the paper by presenting a brief discussion on the presented benchmark and enumerates potential future research directions.

1.1 Our Contribution

Image deraining is a heavily ill-posed problem. Despite many impressive methods published in recent few years, the lack of a large dataset and algorithm benchmarking makes it difficult to evaluate the progress made, and how practically useful those algorithms are. There are several unclear and unsatisfactory aspects of current deraining algorithm development, including but not limited to: (1) the modeling of rain is simplified, i.e., each method considers and is evaluated with one type of rain only (e.g., Kang et al. 2012; Chen and Hsu 2013; Li et al. 2016, 2017; Jiang et al. 2017; Lei et al. 2017; Wei et al. 2019; Ren et al. 2019) focus on rain streaks removal, and (Qian et al. 2018; You et al. 2016) concentrate on removing raindrops); (2) most quantitative results are reported on synthetic images, which often fail to capture the complexity and characteristics of real rain. Although there are some real deraining datasets are proposed, these databases lack sufficient real-world images and without any semantic annotation for diverse evaluations. (3) as a result of the last point, the evaluation metrics have been mostly limited to (the full-reference) PSNR and SSIM for image restoration purposes. They may become poorly related when it comes to other task purposes, such as human perception quality (Lai et al. 2016; Li et al. 2019a) or high-level computer vision utility (Dai et al. 2016, 2020; Sakaridis et al. 2018; Hahner et al. 2019).

In this paper, we aim to systematically evaluate state-of-the-art single image deraining methods, in a comprehensive and fair setting. To this end, we construct a large-scale benchmark, called Multi-Purpose Image Deraining (MPID). An overview of MPID could be found in Table 3, and image examples are displayed in Fig. 1. Compared with existing synthetic sets, the MPID dataset covers a much larger diversity of rain models (rain streak, raindrop, and rain and mist), including both synthetic and real-world images for evaluation, and featuring diverse contents and sources (for real rainy images). In addition, as the first-of-its-kind efforts in image deraining, we have annotated two sets of real-world rainy images with object bounding boxes from autonomous driving and video surveillance scenarios, respectively, for task-specific evaluation.

Fig. 1
figure 1

Example images from the MPID dataset. The proposed dataset contains both synthetic and real-wold rainy images of rain streak, raindrops, and rain and mist. In addition, we also annotate two sets of real-world images with object bounding boxes from autonomous driving and video surveillance scenarios

Using the MPID benchmark, we evaluate eight state-of-the-art single image deraining algorithms. We adopt a wide range of full-reference metrics (PSNR and SSIM), no-reference metrics (NIQE, BLIINDS-II, and SSEQ), as well as human subjective scores to thoroughly examine the performance of image deraining methods. A human subjective study is also conducted. Furthermore, as image deraining might be expected as a preprocessing step for mid- and high-level computer vision tasks, we also evaluate current algorithms in terms of their impact on subsequent object detection tasks, as a “task-specific” evaluation criterion. We reveal the performance gap in various aspects, when these algorithms are applied on synthetic and real images. By extensively comparing the state-of-the-art single image deraining algorithms on the MPID dataset, we gain insights into new research directions for image deraining.

In this paper, we extend our preliminary work (Li et al. 2019c) in the following aspects.

  • Evaluations of more image deraining algorithms In Li et al. (2019c), we evaluate six different deraining methods on the proposed multi-purpose image deraining (MPID) dataset. In this manuscript, we also evaluate two very recent image deraining work of DAF-Net (Hu et al. 2019) and STL (Wei et al. 2019), which perform better than conventional deraining approaches on existing deraining datasets. In particular, STL (Wei et al. 2019) is the first semi-supervised learning network toward the image deraining task.

  • Extension of detection methods In Li et al. (2019c), we use Faster R-CNN (FRCNN) (Ren et al. 2015), YOLO-V3 (Redmon and Farhadi 2018), SSD-512 (Liu et al. 2016), and RetinaNet (Lin et al. 2018) to detect objects after using a deraining algorithm. In this paper, we add a new state-of-the-art detection model of CenterNet (Zhou et al. 2019) to conduct the task-driven comparisons. As a result, our employed detection methods including tow-stage, one-stage anchor, and one-stage anchor-free detection algorithms. In addition, we also found that the recent CenterNet performs better than the conventional deep detection models.

  • Detailed results of object detection In addition to mAP results reported in Li et al. (2019c), we further show all the AP results in each object class for different deraining algorithms for more detailed comparative analysis.

  • Datasets survey In this paper, we summarize the existing image draining datasets used to measure and compare the performance of image deraining algorithms. We found that existing datasets are either too small in scale or limited to one rain type, or lack sufficient real-world images for diverse evaluations. In addition, none of them has any semantic annotation nor consider any subsequent task performance.

  • More analysis We add more analysis about different deraining algorithms in terms of various evaluation criteria (full- and no-reference objective, subjective, and task-specific metrics) to show the current challenges of the performance gap between synthetic and real-world images. Based on the comprehensive results, we further conclude some possible research directions for image deraining in the future.

2 Rainy Image Formulation Models

In this section, we review the commonly-used rain synthesis models in the literature. As a complicated atmospheric process, rain could cause several different types of visibility degradations, due to a magnitude of environmental factors including raindrop size, rain density, and wind velocity. When a rainy image is taken, the visual effects of rain on that digital image further hinges on many camera parameters, such as exposure time, depth of field, and resolution (Garg and Nayar 2005). Most existing deraining works assume one rain model (usually rain streak), which might have oversimplified the problem. We group existing rain models in literature into three major categories: rain streak, raindrop, as well as rain and mist.

A rain streak image \(\mathbf {R}_s\) can be modeled as a linear superimposition of the clean background scene \(\mathbf {B}\) and the sparse, line-shape rain streak component \(\mathbf {S}\):

$$\begin{aligned} \mathbf {R}_s =\mathbf {B} + \mathbf {S}. \end{aligned}$$
(1)

Rain streaks \(\mathbf {S}\) accumulated throughout the scene reduce the visibility of the background \(\mathbf {B}\). This is the most common model assumed by the majority of deraining algorithms.

Adherent raindrops (You et al. 2016) that fall and flow on camera lenses or a window glasses can obstruct and/or blur the background scenes. The raindrop degraded image \(\mathbf {R}_d\) can be modeled as the combination of the clean background \(\mathbf {B}\), and the blurry or obstruction effect of the raindrops \(\mathbf {D}\) in scattered, small-sized local coherent regions:

$$\begin{aligned} \mathbf {R}_d =\left( 1-\mathbf {M}\right) \odot \mathbf {B} + \mathbf {D}. \end{aligned}$$
(2)

\(\mathbf {M}\) is a binary mask and \(\odot \) means element-wise multiplication. In the mask, a pixel x is part of a raindrop region if \(\mathbf {M}(x)=1\), and otherwise belongs to the background.

Further, rainy images often contain both rain and mist in real cases. In addition, distant rain streaks accumulated throughout the scene reduce the visibility in a manner more similarly to fog, creating a mist-like phenomenon in the image background. Concerning this, we can define the rain and mist model for the captured image \(\mathbf {R}_m\), based on a composition of the rain streak model and the atmospheric scattering haze model (McCartney 1976):

$$\begin{aligned} \mathbf {R}_m =\mathbf {B} \odot t + A\left( 1-t\right) + \mathbf {S}, \end{aligned}$$
(3)

where \(\mathbf {S}\) is the rain streak component; t and A are the transmission map and atmospheric light that determines the fog/mist component, respectively.

There are two main drawbacks of existing evaluation approaches. First, synthetic rainy images usually fail to capture the characteristics of real degradation in rainy day. For example, the models of (1) and (2) only consider one factor. Halder et al. (2019) recently proposed a physically-based rendering method to improve the realism of these synthetic rainy images. They used a more complex pipeline to simulate and insert rain streaks taking into account its amount to generate a more convincing visual result. This method has achieved the goal of creating visually appealing rainy images. However, the method still only generates rain streaks without considering the mist effects in a rainy image. Second, existing deraining approaches use PSNR and SSIM to evaluate image restoration performance, which does not correlate well with human perception (Lai et al. 2016) and high-level visual algorithms (Li et al. 2019a). The lack of human and machine perceptual studies makes it difficult to compare the performance of deraining algorithms. While numerous full- and no-reference image quality metrics have been proposed, it is unclear whether these metrics can be applied to measure the quality of derained images.

3 Related Work

3.1 Overview of Deraining Algorithms

Early methods often require multiple frames for deraining (Ren et al. 2017; Santhaseelan and Asari 2015; Jiang et al. 2017; You et al. 2016). Garg and Nayar (2004) proposed a rain streak detection and removal method from a video by taking the average intensity of the detected rain streaks from the previous and subsequent frames. Garg and Nayar (2005) further improved the performance by selecting camera parameters without appreciably altering the scene appearance. However, those methods are not applicable to single image deraining.

Compared to multi-frame based deraining approaches which have temporal redundant knowledge, deraining from a single image is more challenging since less information is available. To address this problem, the design of single image deraining algorithm has attracted more research attention. The existing single image deraining methods can be roughly divided into two categories: model-based (non-deep-learning) and data-driven (deep-learning) approaches. There is a summary of single image rain removal methods in Table 1.

3.1.1 Model-Driven Algorithms

The model-driven methods especially focus on sufficiently utilizing and encoding physical properties of rain and prior knowledge of background scenes into an optimization problem and designing rational algorithms to solve it. These algorithms can be mainly divided into three main categories: filter based methods, low-rank

and sparse-coding based algorithms, and Gaussian Mixture Model (GMM) based approaches, etc.

Filter based algorithms Zheng et al. (2013) presented a multiple guided filter based method using low frequency part of a single image. Ding et al. (2016) designed a guided L0 smoothing filter based on L0 gradient minimization to remove rain streaks in a rainy image. Santhaseelan and Asari (2015) first detected rain streaks based on phase congruence features from input rainy videos, then the variation of features from frame to frame is capitalized to remove rain from videos.

Sparse coding based algorithms Many deraining methods capitalized on clean image or rain type priors to remove rain (Sun et al. 2014; Luo et al. 2015; Barnum et al. 2010). Kang et al. (2012) decomposed an input image into its low- and high-frequency components. Then they separated the rain streak frequencies from the high-frequency layer via sparse coding. Zhu et al. (2017) introduced a rain removal method based on the prior that rain streaks typically span a narrow range of directions. Chen and Hsu (2013) decomposed the background and rain streak layers based on low-rank priors.

GMM based algorithms Li et al. (2016) used patch-based priors for both the clean background and rain layers in the form of Gaussian mixture models. Based on Li et al. (2016, 2017) further introduced a structure residue recovery step to further separate the background residues and improve the decomposition quality for image deraining.

However, all of the above approaches rely on handcrafted image priors, which cannot hold in some real-world scenes. As a result, these model-driven algorithms tend to have unsatisfactory performances and generate some artifacts on real-world images with complicated scenes and rain forms.

3.1.2 Data-Driven Algorithms

Recent methods often adopt the data-driven algorithms by designing specific network architectures to learn network parameters for attaining complex rain removal functions. Most of these methods aim at certain insightful aspects of rain removal and have their applicability and advantages on some specific scenarios. We briefly discuss the popular deep neural networks employed for image deraining in this section.

CNN models A CNN architecture typically includes convolutional layers, pooling layers and fully connected layers. CNN is powerful in learning feature representation of different abstraction levels from large-scale data.

Recently, CNNs have achieved dominant success for image restoration (Ren et al. 2016; Zhang et al. 2017) including single image deraining (Fu et al. 2017a; Eigen et al. 2013). Fu et al. (2017b) proposed a deep detail network (DDN) for removing rain from single images with details preserving. Yang et al. (2017) presented a CNN based method to jointly detect and remove rain streaks, using a multi-stream network to capture the rain streak component. A density-aware multi-stream densely connected convolutional neural network was introduced in Zhang and Patel (2018) for joint rain density estimation and image deraining. Hu et al. (2019) formulated a depth-guided attention mechanism to learn depth-attentional features and regress a residual map, and prepared a new dataset RainCityscapes for rain removal. However, existing deep networks usually have an enormous number of parameters. To remedy this, Fu et al. (2020) proposed a lightweight deep network that is based on the classical Gaussian-Laplacian pyramid for single image deraining.

GAN models GAN is proposed to train generative models through a two-player game between a generator and a discriminator. Specially, the generator aims to generate synthesized data of the same distribution of real data, and tries to fool the discriminator. Discriminator is trained to distinguish synthesized data from real samples. During training, the generator and the discriminator compete with each other and improve themselves to help the two players to generate realistic derained images (Qian et al. 2018; Li et al. 2019b).

Qian et al. (2018) addressed a different problem of removing raindrops from single images by using visual attention with a generative adversarial network (GAN). Zhang et al. (2019) proposed a novel single image deraining method called Image Deraining conditional generative adversarial network (CGAN), which considers quantitative, visual and also discriminative performance into the objective function. Li et al. (2019b) proposed an integrated two-stage neural network and novel streak-aware decomposition to adaptively separate the image into a high-frequency component containing rain streaks and a low-frequency component containing rain accumulation. Yu et al. (2020) proposed a fully end-to-end image dehazing algorithm FD-GAN, which directly outputs haze-free images without the estimation of intermediate parameters.

Semi/Unsupervised models Semi-supervised learning is a learning paradigm concerned with the study of how computers and natural systems such as humans learn in the presence of both labeled and unlabeled data. While unsupervised learning means learning by only unlabeled data, i.e., using real captured rainy images with the corresponding ground truths. Wei et al. (2019) firstly proposed a semi-supervised transfer learning framework for single image rain removal. They rationally formulate the residual between the expected output clean images and their original noisy ones through a likelihood term imposed on a parameterized distribution designed based on the domain understanding for residuals. Jin et al. (2019) proposed an unsupervised generative adversarial network (UD-GAN) with self-supervised constraints for image deraining.

Despite the progress of deep-learning-based approaches compared with prior-based rain removal, their performance hinge on the synthetic training data, which may become problematic if real rainy images show a domain mismatch.

Table 1 An overview of single-image deraining methods

3.2 Datasets

In the computer vision field, widely accepted and commonly used databases have achieved objective comparisons and promoted scientific progress (Katrin et al. 2016; Szeliski et al. 2008; Schops et al. 2017). Several rainy image datasets were also used to measure and compare the performance of deraining algorithms. Li et al. (2016) introduced 12 images using photo-realistic rendering techniques. Zhang et al. (2019) synthesized a set of training and testing images with rain streak, using the same way in Li et al. (2016). The training set consists of 700 images and the testing set consists of 100 images. In addition, Zhang et al. (2019) also collected a dataset of 92 real-world rainy images downloaded from the web for qualitative visual comparison. Qian et al. (2018) released a set of clean and rain-drop corrupted image pairs, using a special lens equipment. To address heavy rain removal problem, Li et al. (2019b) created a new synthetic rain dataset named NYU-Rain and another outdoor rain dataset on a set of outdoor clean images, denoted as Outdoor-Rain. Specifically, they provided a new synthetic data generation pipeline by synthesizing the mist effect according to the scene depth. To make the synthesized images more realistic, they also added Gaussian blur on both the transmission map and the background to simulate the effect of scattering in heavy rain scenarios. Meanwhile, Wang et al. (2019) constructed a large-scale real-world paired rain and clean dataset by a semi-automatic method that incorporates temporal priors and human supervision.

We note that the recent work of Li et al. (2019b) and Wang et al. (2019) are two large-scale rain removal datasets with more realism than those from conventional deraining datasets. However, the data from Li et al. (2019b) only includes synthesized images, while the generated ground truths in Wang et al. (2019) may contain some noise, blur, and shaking due to the misalignment between neighboring frames in the captured videos. We summarized the most used datasets for image deraining in Table 2. As shown, existing datasets are either too small in scale or limited to one rain type (rain streaks or raindrops), or lack sufficient real-world images for diverse evaluations. Although the recent proposed Weather Kitti dataset (Halder et al. 2019) includes a large-scale number of images, none of the existing databases has any semantic annotation or subsequent task performance. In contrast, our dataset contains synthetic, real-world, as well as annotated rainy images for a comprehensive evaluation of single image deraining algorithms. The images in our dataset cover various rain types and scenarios and include actual challenges and variations from the real world.

Table 2 Summary of the most used datasets for image deraining
Table 3 Overview of the proposed MPID dataset

4 New Benchmark: Multi-purpose Image Deraining (MPID)

We present a new benchmark as a comprehensive platform, for evaluating single image deraining algorithms from a variety of perspectives. Our evaluation angles range from traditional PSNR/SSIM, to no-reference perception-driven metrics and human subjective quality, to “task-driven metrics” (Li et al. 2019a; Kupyn et al. 2018) indicating how well a target computer vision task can be performed on the derained images. Fitting those purposes, we generate/collect images in large scale, from both synthesis and real world sources, covering diverse real-life scenes, and annotate them when needed. The new benchmark, dubbed Multi-Purpose Image Deraining (MPID), is introduced below in details. An overview of MPID can be found in Table 3.

4.1 Training Sets: Three Synthesis Models

Following the three rain models in Sect. 1.1, we create three training sets, named Rain streak (T), Rain drop (T) and Rain and mist (T) sets (T short for “training”), respectively. All three sets are synthesized in controlled settings from clean images.Footnote 1 All clean images used are collected from the web, and we specifically pick those outdoor rain-free, haze-free photos taken in cloudy daylight, so that the synthesized rainy images look more realistic in terms of lighting condition (for example, there will be no rainy photo in a sunny daylight background). Specifically, we synthesize rainy images according to the following two aspects. First, we follow the common protocol used in Li et al. (2016), Zhang et al. (2019) to generate rain streaks. We also noticed the wet ground/overcast sky issue during data synthesis, and manually inspected/selected clear overcast images, on which we synthesized rain. Second, we follow the widely-accepted routine in Li et al. (2019a), Sakaridis et al. (2018), Ren et al. (2016, 2018a, 2018b, 2020) to generate mist. We first estimate depth from clear overcast outdoor images, and then synthesizing mist images as like in Ren et al. (2016).

The Rain streak (T) set contains 2,400 pairs of clean and rainy images, where the rainy images are generated from the clean ones using (1), with the identical protocol and hyperparameters to Li et al. (2016), Zhang et al. (2019). The Rain drop (T) set was borrowed from Qian et al. (2018)’s released training set consisting of 861 pairs of clean and rain-drop corrupted images, upon their authors’ consent. The Rain and mist (T) set is synthesized by first adding haze using the atmospheric scattering model: for each clean image, we estimate depth using the algorithm in Liu et al. (2016), Li et al. (2018) as recommended by Li et al. (2017), set different atmospheric lights A by choosing each channel uniformly randomly between [0.7, 1.0], and select \(\beta \) uniformly at random between [0.6, 1.8]. Then from the synthesized hazy version, we further add rain streaks in the same way as Rain streak (T). We end up with 700 pairs for the Rain and mist (T) set.

4.2 Testing Sets: From Synthetic to Real

Corresponding to three training sets, we generate three synthetic testing set in the same way: denoted as Rain streak (S), Rain drop (S), and Rain and mist (S) (S short for “synthetic testing”), consisting of 200, 149, and 70 pairs, respectively. On each testing set, we evaluate the restoration performance of deraining algorithms, using classical PSNR and SSIM metrics. Further, to predict the derained image’s perceptual quality to human viewers, we introduce the usage of three no-reference IQA models: naturalness image quality evaluator (NIQE) (Mittal et al. 2013), spatial-spectral entropy-based quality (SSEQ) (Liu et al. 2014), and blind image integrity notator using DCT statistics (BLIINDS-II) (Saad et al. 2012), to complement the shortness of PSNR/SSIM. NIQE is a well-known no-reference image quality score to indicate the perceived “naturalness” of an image: a smaller score indicates better perceptual quality. The score of SSEQ and BLIINDS-II that we used range from 0 (worst) to 100 (best).Footnote 2

Besides the three above synthetic test sets, we collect three sets of real-world images, that fall into each of three defined rain categories, to evaluate the deraining algorithms’ real-world generalization. The three sets, denoted as Rain streak (R), Raindrop (R), and Rain and mist (R) (R short for “real-world testing”), are collected from the Internet and are carefully inspected to ensure that images in each set fit the pre-defined rain type well. Due to the unavailability of ground truth clean images in real world, we evaluate NIQE, SSEQ, and BLIINDS-II on the three real-world sets. In addition, we also pick a small set of real-world images for human subjective rating of derained results.

4.3 Task-Driven Evaluation Sets

As pointed out by a plenty of recent works (Wang et al. 2016; Liu et al. 2018, 2019, 2020; Scheirer et al. 2020; Yang et al. 2020; Hahner et al. 2019), the performance of high-level computer vision tasks, such as object detection and recognition, will deteriorate in the presence of various sensory and environmental degradation. In particular, Sakaridis et al. (2018) studied the effect of image dehazing on semantic segmentation by a synthesized Foggy Cityscapes dataset with 20,550 images. This work carefully investigated the practicability of image dehazing for semantic foggy scene understanding (SFSU) and found that image dehazing marginally advances SFSU in most cases. Dai et al. (2016) evaluated several image super-resolution methods on high-level vision tasks and concluded that super-resolution approaches are usually helpful for other vision tasks. In these cases, the low-level image processing methods improved the performances on the high-level tasks. While deraining could be used as pre-processing for many computer vision tasks executed in the rainy conditions, there has been no systematic study on deraining algorithms’ impact on those target tasks. The recent work of Halder et al. (2019) evaluated the robustness of high-level tasks on rainy conditions. However, the evaluation did not include a study on the usefulness of deraining algorithms for the high-level tasks. We consider the resulting task performance after deraining as an indirect indicator of the deraining quality. Such a “task-driven” evaluation way has received little attention and can have great implications for outdoor applications.

To conduct such task-driven evaluations, realistic annotated datasets are necessary. To our best knowledge, there has been no dataset available serving the purpose of evaluating deraining algorithms in task-driven ways. We therefore collect two sets by our own: a Rain in Driving (RID) set collected from car-mounted cameras when driving in rainy weathers, and a Rain in surveillance (RIS) set collected from networked traffic surveillance cameras in rainy days.

For either set, we annotate object bounding boxes, and evaluate object detection performance after applying deraining. A summary with object statistics on both RID and RIS sets can be found in Table 4. The two sets differ in many ways: rain type, image quality, object size and angle, and so on. They are representative of real application scenarios where deraining may be desired.

Rain in Driving (RID) Set This set contains 2495 real rainy images from high-resolution driving videos. As we observe, its rain effect is closest to “raindrops” on camera lens. They were captured in diverse real traffic locations and scenes during multiple drives. We label bounding boxes for selected traffic objects: car, person, bus, bicycle, and motorcycle, that commonly appear on the roads of all images. Most images are of 1920 \(\times \) 990 resolution, with a few exceptions of 4023 \(\times \) 3024 resolution.

Rain in Surveillance (RIS) Set This set contains 2048 real rainy images from relatively lower-resolution surveillance video cameras. They were extracted from a total of 154 surveillance cameras in daytime, ensuring diversity in content (for example, we do not consider frames too close in time). As we observe, its rain effect is closest to “rain and mist” (many cameras have mist condensation during rain, and the low resolution will also cause more foggy effects). Specifically, we found very few bicycles in the RIS set, which consists of common sense that one will not go cycling when it rains. Therefore, we annotated trucks rather than bicycles in the RIS dataset. Finally, we selected and annotated the most common objects in the traffic surveillance scenes: car, person, bus, truck, and motorcycle. The vast majority of cameras have the resolution of 640 \(\times \) 368, with a few exceptions of 640 \(\times \) 480.

We carefully selected images containing these objects in the scene. We observed that rainy images tend to present a lesser number of objects in the scene, which is a natural disposition given that persons usually avoid getting out on the street when it is raining. These efforts result in a rich base of outdoor images in rainy and sunny weather conditions with the most common objects annotated.

Adverse weather condition like rain and haze, affects visual quality of images. Images captured in such conditions are intrinsically degraded. This leads to computer vision systems to have their performances decreased. Thus, tasks like deraining and dehazing are extremely challenging and important. Recently, many efforts have been made to remove rainy and hazy effects or, at least, attenuate their impairments. Despite the success of recent algorithms, a real rainy scenario continues to constitute a demanding problem to handle. We believe there is a gap between the synthetic rainy datasets used so far for training the current models and real rainy images. Hence, to improve on this task, we need to consider real information from rainy scenes. Motivated by this, we present a new dataset containing real rainy images from surveillance video cameras. We further provide a sunny set of images for evaluation and comparison in the object detection task. In this way, deraining strategies might benefit from promising results that image-to-image translation have shown for domain adaptation.

Table 4 Object statistics in RID and RIS sets
Table 5 Average full- and no-reference evaluations results on synthetic rainy images
Table 6 Average no-reference evaluations results of derained results on real rainy images
Fig. 2
figure 2

Visual comparisons of derained results on real images: rain streak (first image), raindrop (second image), and rain and mist (third image)

Table 7 Average subjective scores of derained results on 27 real images

5 Experimental Comparison

We evaluate eight representative state-of-the-art algorithms on MPID: Gaussian mixture model prior (GMM) (Li et al. 2016), joint rain detection and removal (JORDER) (Yang et al. 2017), deep detail network (DDN) (Fu et al. 2017b), conditional generative adversarial network (CGAN) (Zhang et al. 2019), density-aware image de-raining method using a multistream dense network (DID-MDN) (Zhang and Patel 2018), depth-attentional features network (DAF-Net) (Hu et al. 2019), semi-supervised transfer network (STL) (Wei et al. 2019), and DeRaindrop (Qian et al. 2018). All except GMM are state-of-the-art CNN-based deraining algorithms.

Evaluation Protocol The first seven models are specifically developed for removing rain streaks, while the last one targets at removing rain drops. Therefore, we compare them for rain streak sets. Since DeRaindrop is the only recent published method for raindrop removal, to provide more baselines for its performance, we also re-train and evaluate the other five models on the raindrop training dataset. In addition, we create a cascaded pipeline, by first running each of the five rain streak removal algorithms, followed by feeding into a dehazing model, as like in Yang et al. (2016), Li et al. (2019b). Based on the rain and mist model in (3), we can remove the additive rain streak first without the damage of haze in theory. Therefore, we first remove rain for the rain and mist input then restore clean image by feeding the derained result to MSCNN (Ren et al. 2016), which is trained on the synthesized hazy images based on the Middlebury stereo database (Scharstein and Szeliski 2002). The very recent work (Li et al. 2019b) also demonstrates that using the combination strategy of deraining first and then dehazing shows better performance than dehazing first and then deraining. We choose the MSCNN dehazing algorithm since recent dehazing studies (Li et al. 2019a; Liu et al. 2018) endorsed it both to produce the best human-favorable, artifact-free dehazing results, and to benefit subsequent high-level task in haze most. Such cascaded pipeline can be tuned from end to end, and we freeze the MSCNN part during tuning in order to focus on comparing deraining components. All models will be re-trained on the corresponding MPID training set, when evaluated on a certain rain type.

5.1 Objective Comparison

We first compare the derained results on the synthetic images using two full-reference (PSNR and SSIM) and three no-reference metrics (NIQE Mittal et al. 2012, SSEQ Liu et al. 2014, and BLIINDS-II Saad et al. 2012).

Table 8 Detection results on the RID sets
Table 9 Detection results on the RIS set

As seen from Table 5, the results have high consensus levels on synthetic data. First, the method of DDN (Fu et al. 2017b) by Fu et al. is the obvious winner on the rain streak (S) set, followed by the approach of JORDER (Yang et al. 2017). Second, DerainDrop (Qian et al. 2018) performs the best on the rain drop (S) set, especially significantly surpassing the others in terms of full-reference of PSNR and SSIM, as well as no-reference metric of BLINDS-II, showing that its specific structure indeed suits the raindrop removal problem. Other rain streak removal models seem to even hurt PSNR, SSIM and BLINDS-II, compared to the input rainy images. For example, CGAN (Zhang et al. 2019) decreases both PSNR and SSIM on the rain drop (S) set. The main reason may be that GANs tend to generate some unrealistic details in the scenes. Finally, for the rain and mist images, DDN (Fu et al. 2017b) also perform consistently the best according to PSNR and SSIM. Since STL (Wei et al. 2019) is trained to adapt real unsupervised diverse rain types through transferring from the supervised synthesized rain, this model achieves the highest SSEQ and BLINDS-II values although it performs worse in terms of full-reference metrics, which is aligned with the emerging trend of semi-supervised or unsupervised learning methods by using real-world training images .

The effectiveness of the winners can be ascribed to the two-step strategy of rain detection and removal, i.e., first estimate a mask of rain streaks or raindrops, then remove rain artifacts capitalized on the mask. We note that DDN (Fu et al. 2017b) focuses on high frequency details during training stage, while JORDER (Yang et al. 2017) also first detects the locations of rain streak, then removes rain based on the estimated rain streak regions. Coincidentally, DeRaindrop (Qian et al. 2018) also uses an attentive generative network to learn about raindrop regions and their surroundings first then derain images using the information of the learned masks. Therefore, removing background interference and attentively focusing on rain regions seem to be the main reason of the winners in Table 5. In addition, different from conventional deep learning methods which only use supervised image pairs, the recent work of STL (Wei et al. 2019) put real rainy images into the network training process and therefore obtain the best performance in terms of no-reference metrics.

We then show the derained results on the real-world images in Table 6, using three no-reference metrics (NIQE, SSEQ, and BLIINDS-II). Figure 2 shows three corresponding visual comparison examples. The raindrop (R) and rain and mist (R) sets show consistent results with their synthetic cases: DerainDrop (Qian et al. 2018) and STL (Wei et al. 2019) rank top-two on the raindrop dataset, while STL (Wei et al. 2019) still dominates on the rain and mist set. In particular, DerainDrop (Qian et al. 2018) ranks first in term of all the three no-reference metrics, thanks to the raindrop attention map learned by the attentive-recurrent network. However, some different tendency is observed on the rain streak (R) set: although DDN (Fu et al. 2017b) still obtain the highest BLINDS-II value, it has a worse performance according to the SSEQ and NIQE metrics. In contrast, STL (Wei et al. 2019) becomes the dominant winner on those real images again as like in the rain and mist case, outperforming DDN (Fu et al. 2017b) with a large margin in terms of SSEQ. As we observed, since CGAN (Zhang et al. 2019) is most free of physical priors or rain type assumptions, it has the largest flexibility for re-training to fit different data. Its results is also most photo-realistic due to the adversarial loss as shown in Fig. 2, especially for the rain streak and the rain and mist in the first and the third images. Additionally, the result might also suggest a larger domain gap between synthetic and real rain and mist data.

From Tables 5 and 6, we can observe that despite certain discrepancy (e.g., when it comes to “bad performers”), the metrics agree reasonably well on ranking top performers. For example, the method of DeRaindrop is the clear winner, winning two full-reference metrics on synthetic raindrop images in Table 5 and three non-reference metrics on real-world raindrop images in Table 6. In addition, the semi-supervised method of STL (Wei et al. 2019) becomes the dominant winner on those real images, especially in the rain and mist case, which demonstrates that employing some real images as the training data is able to deal with rain in real-world cases.

Fig. 3
figure 3

Visualization of object detection results with YOLO-V3 after applying different deraining algorithms on two images from the RID dataset

Fig. 4
figure 4

Visualization of object detection results with YOLO-V3 after applying different deraining algorithms on two images from the RIS dataset

5.2 Subjective Comparison

We next conduct a human subjective survey to evaluate the performance of image deraining algorithms. We follow a standard setting that fits a Bradley–Terry model (Bradley and Terry 1952) to estimate the subjective score for each method so that they can be ranked, with the exactly same routine as described in previous similar works (Li et al. 2019a). We select 10 images from Rain streak (R), 6 images from Rain drop (R), and 11 images from Rain and mist (R), taking all possible care to ensure that they have very diverse contents and quality. Each rain streak or rain and mist image is processed with each of the seven deraining algorithms (except DerainDrop Qian et al. 2018), and the seven deraining results, together with the original rainy image, are sent for pairwise comparison to construct the winning matrix. For a rain drop image, the procedure is the same except that it will be processed by all eight methods of GMM (Li et al. 2016), JORDER (Yang et al. 2017), DDN (Fu et al. 2017b), CGAN (Zhang et al. 2019), DID-MDN (Zhang and Patel 2018), DeRaindrop (Qian et al. 2018), DAF-Net (Hu et al. 2019), and STL (Wei et al. 2019). We collect the pair comparison results of human subject studies from 11 human raters, i.e., we use the paired comparison approach that requires each human subject to choose a preferred image from a pair of derained images. Despite the relatively small numbers of raters, we observed good consensus and small inter-person variances among raters, on same pairs’ comparison results, which make scores trustworthy.

The subjective scores are reported in Table 7. Note that we did not normalize the scores: so it is the score rank rather than the absolute score values that makes sense here. On the rain streak images, it seems that most human viewers prefer CGAN first, and then DDN. As shown in the first row of Fig. 2, the derained result generated by CGAN is more smooth than others. The main reason is that CGAN does not focus on designing a good prior or a good framework, but focus on ensuring the derained result should be indistinguishable from its corresponding clear image to a given discriminator. Therefore, CGAN generate derained results that is consistent with the human vision. On the raindrop images, it is somehow to our surprise that DerainDrop as well as other deep learning-based models is not favored by users; instead, the non-CNN-based GMM method, showed no advantage under previous objective metrics, was highly preferred by users. We conjecture that the patch-based Gaussian mixture prior can treat and remove both rain streaks and raindrops as “outliers”, and is less sensitive to training/testing data domain difference. Finally on the rain and mist images, DID-MDN receives the highest scores, while CGAN is next to it. This is mainly thanks to incorporating the rain-density subnetwork or GAN, that can provide more information of the scene context and hence improve generalization to complex rain conditions.

Table 10 Average no-reference evaluations results of derained results on RID images
Table 11 Average no-reference evaluations results of derained results on RIS images

From Tables 5, 6 and 7, we can found that the off-the-shelf no-reference perceptual metrics (SSEQ, NIQE, BLINDS-II) do not align well with the real human perception quality of deraining results. In fact, recent works (Choi et al. 2015) already discovered similar misalignments, when applying standard no-reference metrics to estimating defogging perceptual quality, and proposed fog-specific metrics. Similar efforts have not been found for deraining yet, and we expect this worthy effort to take place in near future.

5.3 Task-Driven Comparison

We first apply all deraining algorithms except GMM,Footnote 3 to pre-process the two task-driven testing sets of RID and RIS. Due to their different rain characteristics, for the RID set, we use deraining algorithms trained on the raindrop case; for the RIS set, we use deraining algorithms trained on the rain and mist case. We visually inspected the derained results and found the rain to be visually attenuated after applying the selected deraining algorithms.

We then study object detection performance on the derained sets, using several state-of-the-art object detection models: Faster R-CNN (FRCNN) (Ren et al. 2015), YOLO-V3 (Redmon and Farhadi 2018), SSD-512 (Liu et al. 2016), RetinaNet (Lin et al. 2018), and CenterNet (Zhou et al. 2019). FRCNN is two-stage detection model, which recompute features for each potential box, then classify those features. YOLO-V3, SSD-512, RetinaNet are anchor-based one-stage detection model which slide a complex arrangement of candidate anchor boxes, over the image and classify them directly without specifying the box content. CenterNet is ancher-free one-stage detection model which detects a pair of corners of a bounding box and groups them to form the final detected bounding box. These are representative detection models in their respective fields. We compare all deraining algorithms via the mean Average Precision (mAP) results achieved. It is important to note that our primary goal is not to optimize detection performance in rainy days, but to use a strong detection model as a fixed, fair metric on comparing deraining performance from a complementary perspective. In this way, the object detectors should not be adapted for rainy or derained images, and we use all authors’ pre-trained models on the MS-COCO (Lin et al. 2014) dataset.

The underlying hypothesis behind this evaluation protocol is: (1) an object detector trained on clean natural images will perform the best when the input is also from the clean image domain or close; (2) for detection in rain, the better the rain is removed, the better an object detection model (trained on clean images) will then perform. Such task-specific evaluation philosophy follows Kupyn et al. (2018), Li et al. (2019a).

Tables 8 and 9 report the mAP results and AP results in each object class comparison for different deraining algorithms, achieved using five different detection models, on both RID and RIS sets. We find that quite aligned conclusions could be drawn from the two sets.

Perhaps surprisingly at the first glance, we find that almost all existing deraining algorithms will deteriorate the detection performance compared to directly using the rainy images, for YOLO-V3, SSD-512, and RetinaNet. Our observation concurs the conclusion of another recent study (on dehazing) (Pei et al. 2018): since those deraining algorithms were not trained/optimized towards the end goal of object detection, they are unnecessary to help this goal, and the deraining process itself might have lost discriminative, semantically meaningful true information.

The two exceptions are FRCNN and CenterNet, where deraining algorithms could help detection a bit particularly on the RID dataset. However, the overall mAP results by FRCNN are often the worst or second worst. That implies a strong domain mismatch, suggesting that FRCNN results might not be as reliable an indicator for deraining performance as the others. In contrast, when combined with a deraining algorithm, CenterNet is almost the best detection methods on both RID and RIS datasets. Particularly, the cascaded manner of DAF-Net following the CenterNet achieves better detection results on the RID set than others, and the manner of DID-MDN following the CenterNet obtains the optimal detection results on the RIS set. This demonstrates that densely connected convolutional neural network based on rain density could apply to surveillance images in rain.

Both results on RID and RIS sets in Tables 8 and 9 show that YOLO-V3 achieves best detection performance, independently of deraining algorithms applied. Figures 3 and 4 show detections using YOLO-V3 on the respectives rainy images and their derained results for all deraining algorithms considered in this comparison. Since both RID and RIS have many small objects due to their relative long distance from the camera, we believe that here YOLO-V3 benefits from its new multi-scale prediction structure, that is known to improve small object detection dramatically (Redmon and Farhadi 2018).

We further notice a weak correlation in comparing the mAP results with the full- and no-reference evaluation results on RID (Tables 8, 10) and RIS (Tables 9, 11) images. We can observe this taking STL (Wei et al. 2019) as an example. Despite having obtained the highest SSEQ, NIQE, and BLINDS-II scores on the RIS dataset in Table 11, STL (Wei et al. 2019) has almost the lowest mAP values between all deraining approaches for all detection models performances in Table 9. The main reason may be that the unsupervised training strategy in Wei et al. (2019) is able to output images with sharp edge and contrast information close to real-world images, which consists of the features (e.g., high-frequency details and image edges) in no-reference metrics. However, the derained results by STL have some blocking artifacts as shown in Fig. 4g, which result in the lower detection result by detection models. Besides that, the two best deraining competitors (DAF-Net and DID-MDN) in terms of detection metric did not achieve best results in any no-reference evaluation metric.

All the results on real-world data experiments in terms of non-reference and the proposed task-specific metrics demonstrate that deraining could be further complicated when entangled with other practical degradation. There is no single metric in perfect tune with the human subjective score. Therefore, when designing a deraining algorithm, one needs to be clear about the end purpose. No-reference metrics are more appropriate when measuring the visual effect of real-world images, while the proposed task-specific metric is more reliable for high-level machine task performance.

6 Conclusions and Future Work

This paper proposes a new large-scale benchmark and presents a thorough survey of state-of-the-art single image deraining methods. Based on our evaluation and analysis, we present overall remarks and hypotheses below, which we hope can shed some light on future deraining research:

  • Rain types are diverse and call for specialized models. Certain models or components are revealed to be promising for specific rain types, e.g., rain detection/attention, GANs, and priors like patch-level GMM. We also advocate a combination of appropriate priors and data-driven methods. While the state-of-the-art image deraining methods can recover satisfactory sharp images on those standard benchmark datasets, they tend to fail on real-world rainy images. The main reason is that real-world images are often degraded by several factors other than a single rainy type, such as low-resolution, low-light, noise, and blur (Kupyn et al. 2019). To deal with the real complicated, varying rains, one might need to consider a mixture model of experts. Another practically useful direction is to develop scene-specific deraining, e.g., for traffic surveillance views.

  • There is no single best deraining algorithm under all metrics. The most popular evaluation metrics for image deraining are still PSNR and SSIM. They directly compare the pixel differences between derained images and the ground-truths when available. However, PSNR and SSIM cannot measure the perceptual quality precisely. Therefore, when designing a deraining algorithm, one needs to be clear about its end purpose. In addition, since the classical perceptual metrics themselves might be problematic to evaluate deraining, developing new metrics could be as important as new algorithms.

  • Algorithms trained on synthetic paired data may generalize poorly to real data, especially on complicated rain types such as rain and mist. Semi-supervised learning (Wei et al. 2019), domain generalization (Chen et al. 2020), or unpaired training (Zhu et al. 2017; Jiang et al. 2019) can take advantage of real data even without clean ground truth. They can potentially boost no-reference metrics and could be interesting to explore. A recent work (Yasarla et al. 2020) seems to make meaningful progress along this direction.

  • Existing deraining algorithms are ineffective in dealing with different rain types due to domain gaps between these synthetic images and real-world rainy images, as the rain models (e.g., rain streak, rain drop, and rain & mist) are oversimplified. Therefore, we advocate for more research attention on a better model design to handle rains in a complex and mixed scene.

  • No existing deraining method seems to directly help detection. That may encourage the community to develop new robust algorithms to account for high-level vision problems on real-world rainy images. On the other hand, to realize the goal of robust detection in rain does not have to adopt a de-raining pre-processing; there are other domain adaptation type options, e.g., Chen et al. (2018), which we will discuss in future work.

.