Unified Density-Aware Image Dehazing and Object Detection in Real-World Hazy Scenes

Zhang, Zhengxi; Zhao, Liang; Liu, Yunan; Zhang, Shanshan; Yang, Jian

doi:10.1007/978-3-030-69538-5_8

Zhengxi Zhang¹²,
Liang Zhao¹²,
Yunan Liu¹²,
Shanshan Zhang¹² &
…
Jian Yang¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12625))

Included in the following conference series:

Asian Conference on Computer Vision

816 Accesses
7 Citations

Abstract

It is an important yet challenging task to detect objects on hazy images in real-world applications. The major challenge comes from low visual quality and large haze density variations. In this work, we aim to jointly solve the image dehazing and the object detection tasks in real hazy scenarios by using haze density as prior knowledge. Our proposed Unified Dehazing and Detection (UDnD) framework consists of three parts: a residual-aware haze density classifier, a density-aware dehazing network, and a density-aware object detector. First, the classifier exploits the residuals of hazy images to accurately predict density levels, which provide rich domain knowledge for the subsequent two tasks. Then, we design respectively a High-Resolution Dehazing Network (HRDN) and a Faster R-CNN-based multi-domain object detector to leverage the extracted density information and tackle hazy object detection. Experiments demonstrate that UDnD performs favorably against other methods for object detection in real-world hazy scenes. Also, HRDN achieves better results than state-of-the-art dehazing methods in terms of PSNR and SSIM. Hence, HRDN can conduct haze removal effectively, based on which UDnD is able to provide high-quality detection results.

Z. Zhang and L. Zhao—Equal contribution.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Adaptive Dehazing YOLO for Object Detection

Perceiving and Modeling Density for Image Dehazing

Study on Practical Utility of Image Dehazing Algorithms Based on Deep Learning in Computer Vision Scene Understanding

1 Introduction

Object detection in hazy scenes is important for outdoor vision systems, e.g. video surveillance and autonomous driving; yet it is an extremely challenging task. The challenges mainly come from two aspects. On the one hand, hazy images are usually of poor visual quality that caused by low contrast, color distortion and blur etc. [1], making it more difficult to discriminate interesting objects from background clutters. On the other hand, haze density varies tremendously in real-world applications, leading to variations w.r.t. visual quality; these non-negligible intra-domain gaps make object detectors hard to converge.

A straightforward solution to hazy object detection is to first apply image dehazing and then perform object detection on dehazed images. Most previous work follow this strategy, isolating dehazing and detection [2, 3]. Since dehazing methods are not able to fully recover latent clear images, it is not guaranteed that the dehazed images are optimal for object detection [2, 4]. From this perspective, it is favorable to jointly solve the two tasks, so as to obtain detection-friendly dehazed images and more accurate detection results. In [5], a unified pipeline is first proposed for hazy object detection. However, in their method, each model is designed to process one fixed density level, without handling density variations.

Another line of work uses domain adaptation techniques to tackle the task. They take clear images as the source domain and hazy images as the target domain; and then they try to lift the target domain performance to the source domain level by closing the domain gap via feature alignment [6, 7]. However, in practice the domain gap is too large to handle, and it becomes especially more complex when there even exist significant intra-domain gaps in the hazy domain.

In this paper, we deal with the above mentioned two challenges in one coherent framework by taking advantage of both lines of work. We perform image dehazing to reduce the clear-hazy domain gap and then use the simplest domain adaptation method of fine-tuning to adapt a detector based on the clear domain to the dehazed domain. In the whole procedure, we take into account the intra-domain differences of hazy images by separating feature extraction for different haze density levels. Specifically, we propose a Unified density-aware Dehazing and Detection (UDnD) framework for solving image dehazing and object detection in a joint way. First, a modified VGG-Net [8] is introduced to predict haze density using hazy residuals. Then, we design a density switch module to multiplex different haze levels. For dehazing, we make modifications to HRNetV2 [9] by up-sampling with transposed convolution [10] and summing up features from different scales. The object detector then takes the dehazed image as input and switches to the branch dictated by the density level.

The contributions of this work are as follows:

We propose a UDnD framework to jointly solve dehazing and detection. It for the first time deals with the inter-domain and intra-domain gaps in both image dehazing and hazy object detection, making them mutually benefit.
We build a residual-aware classifier that predicts haze density levels to assist image dehazing and object detection. To the best of our knowledge, we are the first to explicitly predict haze density as prior knowledge for Convolutional Neural Networks (CNNs).
A novel dehazing method HRDN is introduced, which sums multi-resolution representations to recover finer details. Guided by haze levels, HRDN is able to integrate density-specific knowledge into the network so as to divide and conquer single image dehazing.
Experiments are conducted on two real-world hazy datasets, where the proposed UDnD outperforms the vanilla detector and the density-unaware counterparts. We also evaluate our dehazing method on two synthetic datasets, showing better performance than previous state-of-the-art methods. These results demonstrate that our unified framework can handle the two types of domain gaps and give more accurate detection results in real hazy conditions.

2 Related Work

Since we address the problem of hazy object detection by unifying single image dehazing and multi-domain learning, we will review related work in the above three aspects, respectively.

2.1 Hazy Object Detection

The performance of object detection has been boosted by deep learning. Many CNN-based detectors have been proposed during the past few years, including Faster R-CNN (FRCNN) [11], FPN [12], YOLO [13] and SSD [14]. Albeit obtaining satisfactory performance under clear-weather conditions, none of these models could work seamlessly in hazy scenes without some kind of adaptation.

An intuitive idea for solving hazy object detection is to adopt a two-stage approach, i.e. performing dehazing and detection separately. Following this, Li et al. [2] study the effect of dehazing on various detectors. They find that applying image dehazing as pre-processing is not very helpful and sometimes even harms the performance. In [3] and [4], similar conclusions are drawn for semantic segmentation and image classification. The main reason is that existing dehazing methods are not good enough to reconstruct high-quality clear images for subsequent high-level vision tasks [4]. To address this issue, Li et al. [5] jointly optimize dehazing and detection, achieving better results than traditional two-stage approaches on synthetic images. Though our method is also trained on synthetic images, we demonstrate end-to-end performance on real-world data, and our haze-density-specific gating function improves on their results.

On the other side, some methods adapt a detector from the clear domain to the hazy domain for hazy object detection. They typically find a way to measure the distance between feature distributions of both domains and then train a feature extractor to minimize that distance. Inspired by [15], recent work measure the distance by learning a domain classifier in an adversarial manner [6, 7, 16,17,18,19]. Chen et al. [6] present a Domain Adaptive Faster R-CNN (DA-FRCNN) to tackle image-level and instance-level domain shifts. [7] proposes to align the features from regions with objects. However, they do not consider intra-domain gaps in the target hazy domain, which are induced by density variations.

Our unified density-aware framework integrates ideas from both sides. The dehazing and the detection sub-networks are jointly optimized. Particularly, we alleviate the intra-domain gaps by utilizing density levels. In accordance with previous methods, we use FRCNN as the baseline detector for experiments. But in principle, our method can be applied to any arbitrary CNN-based detector.

2.2 Single Image Dehazing

Early dehazing methods stick to the standard optical model [20] and rely on hand-crafted priors [21,22,23,24,25,26]. Instead of manually designing features, CNN-based methods learn mappings directly from synthetic data. They usually estimate the transmission map and the atmospheric light, separately or jointly, as intermediate results, and then apply the reverse of the optical model [5, 27,28,29,30,31,32,33,34,35]. However, estimating transmission in hazy scenes is an ill-posed problem, and it gets even worse when the colors of objects are similar to those of atmospheric lights [36]. Therefore, some methods [36,37,38] try to recover haze-free images directly via end-to-end frameworks, without reliance on the optical model.

The intra-domain gaps, i.e. haze density variations, cannot be ignored [3]. Some efforts have been made to incorporate haze density analysis into dark channel prior [39, 40]. Dai et al. [41] train an AlexNet [42] to regress the attenuation coefficient. Recently, [30] uses multiple network stages to progressively estimate the transmission map and fuses the outputs from different stages, each of which is supervised by synthetic transmission of a fixed density level.

Instead, we handle the intra-domain gaps by explicitly predicting the haze density level and using it as prior knowledge for our end-to-end dehazing network.

2.3 Multi-Domain Learning

Multi-domain learning refers to learning effective representations for data from distinct domains [43]. It can be achieved by setting shared and domain-specific parameters, which resembles domain adaptation [44, 45]. Previous work build domain-specific Batch Normalization (BN) layers [46] on otherwise shared networks [47,48,49]. Inspired by Squeeze-and-Excitation (SE) networks [50, 51] introduces a data-driven SE adapter to adjust network activations.

In this work, we consider different haze levels as distinct domains and propose a density switch module to recalibrate features based on the haze density.

3 Proposed Method

In this section, we will first provide an overview of our proposed method and then explain each component in more detail.

3.1 Overview

We propose a coherent framework UDnD to jointly optimize image dehazing and object detection. Our method consists of three parts: a haze density classifier f, a dehazing module $ DH $ and a detection module $ DT $. The classifier assigns each hazy image $x^{h}$ a density level $\hat{d}=f(x^{h})$. The dehazing module maps $x^{h}$ to the latent clear image $\hat{x^{c}} = DH (x^{h}, \hat{d}; \theta _{ DH })$, with $\hat{d}$ as domain knowledge. The detector takes $\hat{x^{c}}$ and $\hat{d}$ as input and outputs a structured prediction $\hat{y} = DT (\hat{x^{c}}, \hat{d}; \theta _{ DT })$. Overall, our pipeline can be formulated as

$$\begin{aligned} \hat{y} = DT ( DH (x^{h}, f(x^{h}); \theta _{ DH }), f(x^{h}); \theta _{ DT }), \end{aligned}$$

(1)

where $x^{h}$ is the hazy image and $\hat{y}$ is the detection result. We only presume $ DH $ and $ DT $ to be differentiable and assume nothing about f beyond providing discrete labels. The entire architecture is illustrated in Fig. 1.

Let $x^{h}$ be a hazy image from the training set, with clear ground truth $x^{c}$ and object detection annotations y. The overall loss function for our UDnD is

$$\begin{aligned} \begin{aligned} \mathcal {L}(x^{h}, x^{c}, y; \theta _{ DH }, \theta _{ DT })&= \lambda \mathcal {L}_{ dehazing }(x^{c}, DH (x^{h}, f(x^{h}); \theta _{ DH })) \\&+ \mu \mathcal {L}_{ detection }(y, DT (\hat{x^{c}}, f(x^{h}); \theta _{ DT })), \end{aligned} \end{aligned}$$

(2)

where $\hat{x^{c}} = DH (x^{h}, f(x^{h}); \theta _{ DH })$ is the dehazed result of $x^{h}$. We use two weights $\lambda $ and $\mu $ to balance the reconstruction term ($\mathcal {L}_{ dehazing }$) and the task-driven term ($\mathcal {L}_{ detection }$), which are described in Sect. 3.3 and Sect. 3.4 respectively. Note that the term $\hat{x^{c}}$ guarantees that the dehazing sub-network is supervised by the detection loss as long as $\mu $ is non-zero, while the dehazing loss does not directly affect the detection sub-network. The haze density classifier is used for extracting prior knowledge in our settings, thereby not updated in Eq. 2; which means our framework is compatible to prior-based density estimation methods [52] as well.

3.2 Residual-Aware Haze Density Classifier

The standard optical model [20] formulates the hazing process as

$$\begin{aligned} x^{h}(i) = x^{c}(i)t(i) + L(1 - t(i)), \end{aligned}$$

(3)

where $x^{h}(i)$ is the observed hazy image at pixel location i, $x^{c}(i)$ is the clear scene radiance, and L is the atmospheric light. The transmission map t(i) is obtained using the distance $\ell (i)$ from the scene to the camera lens by $t(i) = \exp (- \beta \ell (i))$. Larger attenuation coefficient $\beta $ indicates denser haze. For homogeneous haze, the Meteorological Optical Range (MOR) [53], i.e. visibility in meters, depends on $\beta $ through $ MOR = \frac{2.996}{\beta }$. It follows that $\beta \ge 2.996 \times 10^{-3}\,\mathrm{m}^{-1}$, where the equality holds for the lightest haze by definition.

We formulate haze density estimation as a classification problem. The predicted density should satisfy $\hat{d} \in \{1, \dots , C\}$, where C is the total number of predefined density levels. Following [41], we set three levels: light, moderate and heavy; but our method can be extended to finer granularity given proper datasets. The haze density serves as domain label, guiding the update of domain-specific parameters in the subsequent dehazing and detection networks.

Inspired by [54], we observe that the residual of a hazy image, i.e. difference from its clear counterpart, is informative because the hazy image is a weighted sum of the clear image and the atmospheric light according to Eq. 3. Therefore, we propose a residual-aware haze density classifier that exploits the hazy residual. The details are depicted in Fig. 1. We stack 3 residual blocks [55] to estimate the residual, which is concatenated with the original hazy image to yield the 6-channel input for a modified VGG-16 [8].

Loss Function. The density classifier is optimized through a joint loss function:

$$\begin{aligned} \mathcal {L}_{ classification } = \alpha \mathcal {L}_{ res } + \mathcal {L}_{ cls }, \end{aligned}$$

(4)

where $\mathcal {L}_{ res }$ is the $L_{1}$ loss for residual regression and $\mathcal {L}_{ cls }$ is the cross-entropy loss for density classification. Additionally, $\alpha $ is used to balance the two tasks and is set to 0.2 in our experiments.

3.3 Density-Aware High-Resolution Dehazing Network

As is illustrated in Fig. 1, we propose a density-aware High-Resolution Dehazing Network (HRDN). The backbone is based on HRNetV2 [9], which maintains high-resolution representations and conducts repeated fusion to encourage interaction between multi-scale features. Different from HRNetV2, we only use 4 basic residual blocks in each network stage to prevent overfitting. The down-sampling operations in the stem are removed because dehazing is a pixel-level dense regression task. Moreover, we replace all the bilinear up-sampling units with transposed convolutions [10] to recover more details. To enforce a coarse-to-fine reconstruction process, we fuse the up-sampled features by summing them up instead of performing channel-wise concatenation like HRNetV2, which also reduces the computation cost as a side effect.

Density Switch. With the predicted haze density level from Sect. 3.2 as prior knowledge, we expect to handle density variations via multi-domain learning. Inspired by [51], we design a density switch module with multiple SE adapters, each corresponding to one type of density. From Fig. 1 we can see, the estimated haze level controls density switches by specifying which branch to take and what parameters to update, thus separating the feature extraction for different densities. We add density switches before the 2^nd, 3^rd and 4^th stages of HRDN, enabling the network to divide and conquer the intra-domain gaps.

Loss Function. Existing dehazing methods utilize various loss functions, such as $L_{1}$ loss [37], MSE loss [5], smooth $L_{1}$ loss [31], perceptual loss [35], and adversarial loss [38]. Their weighted combinations are widely adopted. Despite improvements in performance, complicated loss functions increase the burden of hyper-parameter tuning and make the model hard to converge. Inspired by [56], we empirically find that a single SSIM loss works well:

$$\begin{aligned} \mathcal {L}_{ dehazing }(x^{c}, \hat{x^{c}}) = 1 - SSIM (x^{c}, \hat{x^{c}}), \end{aligned}$$

(5)

where $x^{c}$ denotes the ground truth clear image and $\hat{x^{c}}$ denotes the dehazed image. The constant 1 here is added to ensure the loss value is non-negative.

3.4 Density-Aware Multi-Domain Object Detector

Although the hazy images have been processed by our dehazing network to reduce the inter-domain gaps, there still exist non-negligible intra-domain gaps among the dehazed images, which are caused by haze density variations. We provide some evidence via observing the convolutional activations of a vanilla FRCNN detector on the validation set of Foggy Cityscapes-DBF [41]. We collect the mean activations [51] for images of different densities, compute their differences from those of clear images before and after dehazing, and take these differences as domain gap measurements. A comparison is shown in Fig. 2. We have the following observations: (1) Prior to dehazing, the inter-domain gaps increase monotonically with haze density. (2) The inter-domain gaps are significantly reduced by dehazing, and thus dehazing serves as an effective pre-processing step. (3) Even after dehazing, the intra-domain gaps remain for images of different density levels. These gaps need to be handled by the object detector. (4) The differences vary across layers. The first layers, which learn basic feature detection filters such as edges and corners, exhibit considerable amount of shifts. In other words, the domain gaps are not properly handled at the very beginning and they propagate forward, resulting in the final poor detection results.

In this work, we address the intra-domain gaps in the dehazed domain via multi-domain learning. Following [5, 6], we make modifications on FRCNN. As is shown in Fig. 1, we introduce a density-aware multi-domain object detector by appending a density switch module to the ResNet-50 [55] and FPN [12]. By using the density switch, the detector will route images of different densities to desired branches, where different channel weights are computed to adjust the features. The weighted features are then fed into the Region Proposal Network (RPN) with density-specific information encoded.

Loss Function. We employ ROI-Alignment [57] to obtain the corresponding feature vector for each proposal from RPN. Finally, the category label is predicted via an ROI-wise classifier. The loss function for our multi-domain detector is inherited from the vanilla FRCNN [11] for simplicity:

$$\begin{aligned} \mathcal {L}_{ detection } = \mathcal {L}_{ rpn } + \mathcal {L}_{ roi }. \end{aligned}$$

(6)

Both the RPN loss ($\mathcal {L}_{ rpn }$) and the ROI loss ($\mathcal {L}_{ roi }$) consist of classification and localization terms, which are cross-entropy loss and smooth $L_{1}$ loss, respectively.

4 Experiments

In this section, we first briefly introduce the datasets and the evaluation metrics used for our experiments, followed by the implementation details. After that, we will provide a comparison against other methods on hazy object detection to demonstrate the effectiveness of our UDnD framework. We will also show the performance of the proposed HRDN compared with state-of-the-art dehazing methods. Finally, we will do some ablation study in terms of domain adaptation techniques, loss functions, and unified training strategies.

4.1 Datasets

The object detectors are trained on synthetic hazy images generated using Eq. 3, but are evaluated on real hazy images. Whereas the dehazing methods are evaluated on synthetic data, only for which the ground truth are available.

Synthetic Datasets

OTS and SOTS-outdoor. RESIDE [2] contains both indoor and outdoor hazy scenes. We adopt the Outdoor Training Set (OTS) and the outdoor subset of Synthetic Objective Testing Set (SOTS-outdoor), and ensure the ground truth clear images in OTS do not overlap with those in SOTS-outdoor through data cleaning [31]. The cleaned OTS has 296,695 hazy images with the atmospheric light $L \in [0.8, 1.0]$ and the attenuation coefficient $\beta \in [0.04, 0.2]$, generated out of 8,477 clear images. SOTS-outdoor has 500 hazy images.

Foggy Cityscapes-DBF. Foggy Cityscapes-DBF (FC-DBF) [41] derives from Cityscapes [58] and consists of a large and diverse set of urban street hazy scenes. There are a total of 8,925 images for training and 1,500 images for validation, both equally divided into three density levels ($\beta \in \{0.005, 0.01, 0.02\}$). We follow the screening criteria in [3] and use the selected 1,650 ($550 \times 3$) high-quality synthetic hazy images to fine-tune the object detectors. This dataset is denoted as FC-DBF-refine. The bounding box annotations of these hazy images are automatically inherited from their clear-weather counterparts.

Real-World Datasets

Foggy Driving. Foggy Driving (FD) [3] is a collection of 101 hazy images of driving scenes, among which 51 images are captured at various areas of Zurich by a cell phone camera and others are selected from the Web.

Foggy Zurich. Foggy Zurich [41] is comprised of 3,808 images that are video frames depicting hazy road scenes in Zurich and its suburbs. Different from FD, these images are collected with a GoPro Hero 5 camera. We manually select 400 images of diverse scenes and haze densities, and annotate them carefully to create a new test set, namely Foggy Zurich-test* (FZ*). The statistics of FD and FZ* are shown in Fig. 3. We can see they both include various haze density levels. In particular, FZ* has significantly more annotated objects than FD, and thus can be served as a more convincing test set.

4.2 Evaluation Metrics

For hazy object detection, we adopt Average Precision (AP) and mean Average Precision (mAP) that is the average of APs over all classes. Additionally, the mean of AP scores over the most frequent classes (car and person) is reported as mAP* for evaluating the detectors from a more practical perspective.

For dehazing, Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) [59] are used as standard quality measures. In particular, we investigate the effect of dehazing on detection with mAP.

Table 1. Comparison of different hazy object detection methods on Foggy Driving (FD) and Foggy Zurich-test* (FZ*) w.r.t. mAP (%). For training, “Clear” is clear-weather Cityscapes, “Syn. Hazy” is FC-DBF-refine , and “Real Hazy” is a subset of unlabelled Foggy Zurich. DA, UT and DL denote domain adaptation, unified training and density levels, respectively. Bold indicates the best results.

Full size table

4.3 Implementation Details

We take the FRCNN model pre-trained on Cityscapes by MMDetection [60, 61] as baseline and initialization. The dehazing sub-network is pre-trained on OTS. First, we implement an improved version of two-stage approaches by freezing the dehazing part of our UDnD framework. We name this pipeline Dehazing and Detection (DnD) and fine-tune it on FC-DBF-refine for 9 epochs. SGD algorithm [62] is employed with a mini-batch size of 1 and an initial learning rate of 0.001, which decays polynomially. For UDnD, the input images are randomly cropped and resized to $512 \times 512$. The dehazing and the detection sub-networks are jointly optimized based on the DnD model with the same strategy. We set $\lambda = 1$ and $\mu = 1$ in Eq. 2 via cross-validation.

For dehazing, the models are trained from scratch with $256 \times 256$ image patches. Adam [63] is used for optimization with a mini-batch size of 4. The initial learning rate is 0.0001. We train the models up to 100 epochs on FC-DBF and adopt the cosine annealing schedule [64]. In consistent with [31], we train the networks on OTS with a patch size of $240 \times 240$ for 10 epochs and decay the learning rate by half after every 2 epochs. Our code and trained models are available at https://github.com/xiqi98/UDnD.

4.4 Comparison on Object Detection in Real-World Hazy Scenes

We choose three methods for comparison: vanilla FRCNN [11], a baseline trained on clear images only; DA-FRCNN [6], a state-of-the-art method using domain adaptation; and JAOD-FRCNN [5], the only previous method jointly solving image dehazing and object detection.

We conduct evaluations on FD and FZ*, and report the results in Table 1, from which we have the following observations: (1) The vanilla FRCNN obtains the lowest mAP on both test sets, as it is trained on clear images only, without access to hazy images at all. It indicates that involving hazy images for training helps. (2) It boosts the performance by adapting the detector from clear to hazy images. For DA-FRCNN, the result on FD improves by more than 4 points over the baseline, when both synthetic and real-world data are used for training. Please note that for domain adaptation methods, real hazy images are needed to reach high performance. (3) JAOD-FRCNN achieves comparable performance to DA-FRCNN, showing that dehazing also serves as an effective way to reduce the inter-domain gaps. (4) Our method UDnD, which incorporates haze density information, obtains the best results. Specifically, it outperforms the vanilla FRCNN by $\sim $6 and $\sim $5 points on FD and FZ*, respectively; and it also improves on the results of DA-FRCNN, showing the benefits of a unified pipeline. Compared to JAOD-FRCNN, the obtained $\sim $3 points gain on FZ* demonstrates that the density switch module is helpful for dealing with the intra-domain differences, and thus leads to better detection performance.

The qualitative comparison is illustrated in Fig. 4. We can see UDnD is better at handling small objects, occluded objects and dense haze, which demonstrates its effectiveness in real-world hazy object detection.

Table 2. Comparison with state-of-the-art dehazing methods. To evaluate the effect of dehazing on object detection, we use a vanilla FRCNN based on Cityscapes to process the dehazed images produced by each model on OTS, and report mAP (%). Bold indicates the best results.

Full size table

Table 3. Effect of different domain adaptation techniques w.r.t. mAP and mAP* (%) that is the mean of APs over car and person. These experiments are conducted using our DnD pipeline. Bold indicates the best results.

Full size table

4.5 Comparison with State-of-the-Art Dehazing Methods

Our residual-aware haze density classifier achieves 97.60% accuracy on the validation set of FC-DBF, improving upon [41] by 3.33 points.

The proposed HRDN is evaluated against state-of-the-art dehazing methods [5, 21, 23, 31, 35, 37, 38] in terms of dehazing and detection performance. The results are shown in Table 2. We can observe that HRDN achieves the best performance on the two dehazing datasets, with margins of ${+1.12}\mathrm{dB}$ and ${+2.01}\mathrm{dB}$ in PSNR compared to the second best methods on SOTS-outdoor and the validation set of FC-DBF, respectively.

Meanwhile, we use mAP as an additional task-driven metric. Hazy images from FD and FZ* are pre-processed by each dehazing model trained on OTS before fed into the vanilla FRCNN. From the last two columns of Table 2, we can see our HRDN obtains the best results w.r.t. mAP on both test sets. Some methods, e.g. FFA-Net, tend to overfit the synthetic training data, and thus obtain relatively low mAP on the two real-world datasets; other methods like DCPDN fall short on image dehazing.

To summarize, our dehazing method HRDN not only outperforms previous methods in terms of PSNR and SSIM, but is also more helpful for high-level vision tasks, such as object detection.

4.6 Ablation Study

Domain Adaptation. We study three techniques for tackling domain gaps, namely dehazing, fine-tuning and density switch, based on our DnD pipeline. Table 3 shows the effect of the three sequentially. Dehazing and fine-tuning are used to deal with the clear-hazy gaps, and the density switch module is used to handle haze density variations. They all bring performance gains of 1–2 points each, justifying that UDnD can handle both types of domain gaps.

Table 4. Comparison of loss functions on the validation set of FC-DBF. The density switch modules in HRDN are disabled. Bold indicates the best results.

Full size table

Loss Function. In Fig. 5, we study the effect of different dehazing loss functions on the final detection performance, and find that a single SSIM loss works well. However, when we investigate on the dehazing task with the density switches in HRDN disabled, we observe that a combination of $L_{1}$ loss and SSIM loss achieves the best results, as is shown in Table 4. It indicates that there is a mismatch between image dehazing and object detection w.r.t. optimization goal, which explains why traditional two-stage methods fail in hazy object detection. We argue that haze mainly affects color, resulting in the wide application of $L_{1}$ loss and MSE loss in dehazing methods; but when it comes to detection, structure matters more than color. Hence, SSIM loss stands out as a good objective function for both tasks because of its emphasis on the structural information.

Table 5. Effect of updating different sub-networks of UDnD. We report mAP and APs (%) over all classes of Foggy Zurich-test*. Bold indicates the best results.

Full size table

Unified Training. We evaluate our unified training strategy by freezing the weights of different pre-trained sub-networks while keeping the same loss. The results in Table 5 show that disabling joint optimization of the dehazing and the detection sub-networks leads to a performance drop of $\sim $2 points w.r.t. mAP, manifesting the importance of a unified framework.

5 Conclusion

We have presented a Unified density-aware Dehazing and Detection (UDnD) framework for image reconstruction and object detection in hazy conditions, motivated by the ideas to jointly optimize the two tasks and to exploit haze density as prior knowledge. We propose a residual-aware classifier to estimate haze density, a density-aware High-Resolution Dehazing Network (HRDN) to divide and conquer various hazy scenarios, and a density-aware multi-domain object detector to tackle the final detection task. These collectively constitute a unified pipeline for hazy object detection. Experiments demonstrate the effectiveness of each module and the entire framework in real-world hazy scenes.

References

Li, Y., You, S., Brown, M.S., Tan, R.T.: Haze visibility enhancement: a survey and quantitative benchmarking. CVIU 165, 1–16 (2017)
Google Scholar
Li, B., et al.: Benchmarking single-image dehazing and beyond. IEEE Trans. Image Process. 28, 492–505 (2019)
Article MathSciNet Google Scholar
Sakaridis, C., Dai, D., Van Gool, L.: Semantic foggy scene understanding with synthetic data. IJCV 126, 973–992 (2018)
Article Google Scholar
Pei, Y., Huang, Y., Zou, Q., Lu, Y., Wang, S.: Does haze removal help CNN-based image classification? In: ECCV, pp. 682–697 (2018)
Google Scholar
Li, B., Peng, X., Wang, Z., Xu, J., Feng, D.: Aod-net: all-in-one dehazing network. In: ICCV, pp. 4770–4778 (2017)
Google Scholar
Chen, Y., Li, W., Sakaridis, C., Dai, D., Van Gool, L.: Domain adaptive faster R-CNN for object detection in the wild. In: CVPR, pp. 3339–3348 (2018)
Google Scholar
Zhu, X., Pang, J., Yang, C., Shi, J., Lin, D.: Adapting object detectors via selective cross-domain alignment. In: CVPR, pp. 687–696 (2019)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
Google Scholar
Dumoulin, V., Visin, F.: A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. TPAMI 39, 1137–1149 (2017)
Article Google Scholar
Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, pp. 2117–2125 (2017)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR, pp. 779–788 (2016)
Google Scholar
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: ECCV, pp. 21–37 (2016)
Google Scholar
Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Vaughan, J.W.: A theory of learning from different domains. Mach. Learn. 79, 151–175 (2010)
Article MathSciNet Google Scholar
He, Z., Zhang, L.: Multi-adversarial faster-RCNN for unrestricted object detection. In: ICCV, pp. 6668–6677 (2019)
Google Scholar
Saito, K., Ushiku, Y., Harada, T., Saenko, K.: Strong-weak distribution alignment for adaptive object detection. In: CVPR, pp. 6956–6965 (2019)
Google Scholar
Saito, K., Watanabe, K., Ushiku, Y., Harada, T.: Maximum classifier discrepancy for unsupervised domain adaptation. In: CVPR, pp. 3723–3732 (2018)
Google Scholar
Zheng, Y., Huang, D., Liu, S., Wang, Y.: Cross-domain object detection through coarse-to-fine feature adaptation. arXiv preprint arXiv:2003.10275 (2020)
Koschmieder, H.: Theorie der horizontalen sichtweite. Beitrage zur Physik der freien Atmosphare, pp. 33–53 (1924)
Google Scholar
Berman, D., treibitz, T., Avidan, S.: Non-local image dehazing. In: CVPR, pp. 1674–1682 (2016)
Google Scholar
Fattal, R.: Dehazing using color-lines. TOG 34, 1–14 (2014)
Article Google Scholar
He, K., Sun, J., Tang, X.: Single image haze removal using dark channel prior. TPAMI 33, 2341–2353 (2011)
Article Google Scholar
Meng, G., Wang, Y., Duan, J., Xiang, S., Pan, C.: Efficient image dehazing with boundary constraint and contextual regularization. In: ICCV, pp. 617–624 (2013)
Google Scholar
Tan, R.T.: Visibility in bad weather from a single image. In: CVPR, pp. 1–8 (2008)
Google Scholar
Zhu, Q., Mai, J., Shao, L.: A fast single image haze removal algorithm using color attenuation prior. TIP 24, 3522–3533 (2015)
MathSciNet MATH Google Scholar
Cai, B., Xu, X., Jia, K., Qing, C., Tao, D.: Dehazenet: an end-to-end system for single image haze removal. TIP 25, 5187–5198 (2016)
MathSciNet MATH Google Scholar
Deng, Z., et al.: Deep multi-model fusion for single-image dehazing. In: ICCV, pp. 2453–2462 (2019)
Google Scholar
Li, R., Pan, J., Li, Z., Tang, J.: Single image dehazing via conditional generative adversarial network. In: CVPR, pp. 8202–8211 (2018)
Google Scholar
Li, Y., et al.: Lap-net: level-aware progressive network for image dehazing. In: ICCV, pp. 3276–3285 (2019)
Google Scholar
Liu, X., Ma, Y., Shi, Z., Chen, J.: Griddehazenet: attention-based multi-scale network for image dehazing. In: ICCV, pp. 7314–7323 (2019)
Google Scholar
Liu, Y., Pan, J., Ren, J., Su, Z.: Learning deep priors for image dehazing. In: ICCV, pp. 2492–2500 (2019)
Google Scholar
Ren, W., Liu, S., Zhang, H., Pan, J., Cao, X., Yang, M.H.: Single image dehazing via multi-scale convolutional neural networks. In: ECCV, pp. 154–169 (2016)
Google Scholar
Yang, D., Sun, J.: Proximal dehaze-net: a prior learning-based deep network for single image dehazing. In: ECCV, pp. 702–717 (2018)
Google Scholar
Zhang, H., Patel, V.M.: Densely connected pyramid dehazing network. In: CVPR, pp. 3194–3203 (2018)
Google Scholar
Ren, W., et al.: Gated fusion network for single image dehazing. In: CVPR, pp. 3253–3261 (2018)
Google Scholar
Qin, X., Wang, Z., Bai, Y., Xie, X., Jia, H.: Ffa-net: feature fusion attention network for single image dehazing. In: AAAI, pp. 11908–11915 (2020)
Google Scholar
Qu, Y., Chen, Y., Huang, J., Xie, Y.: Enhanced pix2pix dehazing network. In: CVPR, pp. 8160–8168 (2019)
Google Scholar
Li, R., Kintak, U.: Haze density estimation and dark channel prior based image defogging. In: ICWAPR, pp. 29–35 (2018)
Google Scholar
Yeh, C.H., Kang, L.W., Lin, C.Y., Lin, C.Y.: Efficient image/video dehazing through haze density analysis based on pixel-based dark channel prior. In: ISIC, pp. 238–241 (2012)
Google Scholar
Dai, D., Sakaridis, C., Hecker, S., Van Gool, L.: Curriculum model adaptation with synthetic and real data for semantic foggy scene understanding. IJCV 128, 1182–1204 (2019)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NeurIPS, pp. 1097–1105 (2012)
Google Scholar
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: CVPR, pp. 4293–4302 (2016)
Google Scholar
Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791 (2015)
Mallya, A., Davis, D., Lazebnik, S.: Piggyback: adapting a single network to multiple tasks by learning to mask weights. In: ECCV, pp. 67–82 (2018)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456 (2015)
Google Scholar
Bilen, H., Vedaldi, A.: Universal representations: The missing link between faces, text, planktons, and cat breeds. arXiv preprint arXiv:1701.07275 (2017)
Rebuffi, S.A., Bilen, H., Vedaldi, A.: Learning multiple visual domains with residual adapters. In: NeurIPS, pp. 506–516 (2017)
Google Scholar
Rebuffi, S.A., Bilen, H., Vedaldi, A.: Efficient parametrization of multi-domain deep neural networks. In: CVPR, pp. 8119–8127 (2018)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)
Google Scholar
Wang, X., Cai, Z., Gao, D., Vasconcelos, N.: Towards universal object detection by domain attention. In: CVPR, pp. 7289–7298 (2019)
Google Scholar
Choi, L.K., You, J., Bovik, A.C.: Referenceless prediction of perceptual fog density and perceptual image defogging. TIP 24, 3888–3901 (2015)
MathSciNet MATH Google Scholar
NOAA: Federal meteorological handbook no. 1: Surface weather observations and reports (2005)
Google Scholar
Zhang, H., Patel, V.M.: Density-aware single image de-raining using a multi-stream dense network. In: CVPR, pp. 695–704 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Ren, D., Zuo, W., Hu, Q., Zhu, P., Meng, D.: Progressive image deraining networks: a better and simpler baseline. In: CVPR, pp. 3937–3946 (2019)
Google Scholar
He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. TIP 13, 600–612 (2004)
Google Scholar
Chen, K., et al.: Mmdetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS, pp. 8024–8035 (2019)
Google Scholar
Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)

Download references

Acknowledgement

This work was supported by the National Science Fund of China (Grant Nos. 61702262, U1713208), Funds for International Cooperation and Exchange of the National Natural Science Foundation of China (Grant No. 61861136011), Natural Science Foundation of Jiangsu Province, China (Grant No. BK20181299), Young Elite Scientists Sponsorship Program by CAST (2018QNRC001), the Fundamental Research Funds for the Central Universities” (Grant No.30920032201), and Science and Technology on Parallel and Distributed Processing Laboratory (PDL) Open Fund (WDZC20195500106).

Author information

Authors and Affiliations

PCA Lab, Key Lab of Intelligent Perception and Systems for High -Dimensional Information of Ministry of Education, Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
Zhengxi Zhang, Liang Zhao, Yunan Liu, Shanshan Zhang & Jian Yang

Authors

Zhengxi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Liang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yunan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shanshan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shanshan Zhang .

Editor information

Editors and Affiliations

Waseda University, Tokyo, Japan
Hiroshi Ishikawa
Institute of Automation of Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Czech Technical University in Prague, Prague, Czech Republic
Tomas Pajdla
University of Pennsylvania, Philadelphia, PA, USA
Jianbo Shi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Z., Zhao, L., Liu, Y., Zhang, S., Yang, J. (2021). Unified Density-Aware Image Dehazing and Object Detection in Real-World Hazy Scenes. In: Ishikawa, H., Liu, CL., Pajdla, T., Shi, J. (eds) Computer Vision – ACCV 2020. ACCV 2020. Lecture Notes in Computer Science(), vol 12625. Springer, Cham. https://doi.org/10.1007/978-3-030-69538-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-69538-5_8
Published: 25 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69537-8
Online ISBN: 978-3-030-69538-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unified Density-Aware Image Dehazing and Object Detection in Real-World Hazy Scenes

Abstract

Similar content being viewed by others

Adaptive Dehazing YOLO for Object Detection

Perceiving and Modeling Density for Image Dehazing

Study on Practical Utility of Image Dehazing Algorithms Based on Deep Learning in Computer Vision Scene Understanding

1 Introduction