Keywords

1 Introduction

Particles in the atmosphere absorb and scatter the reflected lights of object and result in poor image visibility, which hinders the performance of high-level computer vision tasks [1]. Hence, as a key prerequisite, single image dehazing has been widely studied in the latest decade, which can be roughly divided into model-based methods and model-free methods [2].

Traditional model-based methods estimate the unknown atmospheric light and transmission maps by the statistical rules of haze-free images, which include dark channel prior (DCP) [3], color-lines prior (CLP) [4], color attenuation prior (CAP) [5], and non-local dehazing (NLD) [6]. These methods achieve favorable dehazing effect and generalization ability, but tend to cause some color distortion and artifacts since unilateral hypothesis cannot maintain the accuracy of parameter estimations in various scenes. To this end, recent model-based methods utilize convolutional neural networks (CNNs) to estimate the atmospheric light and transmission maps respectively [7, 8] or simultaneously [9, 10]. These learning-based methods estimate parameters by data driving rather than man-made priors, and thus acquire more visually pleasing images. However, the atmospheric scattering model is just an ideal model, which influences the convergence of networks and restricts the final dehazing performance [11].

More recently, learning-based methods [12,13,14,15,16] tend to avoid the atmospheric scattering model and adopt an end-to-end training strategy (directly building the mapping between hazy images and haze-free images) to acquire high quality results. However, due to the huge differences between the features of hazy images and their haze-free ones, model-free methods always expand the network depths and scales to enhance feature extraction ability, which results in large computational consumption. Moreover, these methods fail to dehaze when applied to real scenes, mainly because networks trained on synthetic dataset cannot fit in uneven haze distribution and complex illumination existing in real scenes. To this end, some works [17,18,19] start to combine prior-based methods and model-free methods to reduce the differences between synthetic domain and real domain, which achieve better dehazing effect when applied to real-world images.

In this paper, we propose a multi-priors guided dehazing network (MGDNet) based on knowledge distillation. Different from a recent work [20], we pretrain two teacher networks by minimizing the losses between hazy images and supervised images (dehazed images of dark channel prior and non-local dehazing), and then teach a student network to learn their features by minimizing both feature-level and pixel-level distillation loss. Considering that the supervisions of teacher networks contain some color distortion, we utilize discrete wavelet transform (DWT) to get the high-frequency and low-frequency of the outputs of teacher networks and only use the high-frequency part to build the pixel-level distillation loss.

Comparative experiments on some real-world hazy images show that our MGDNet performs favorably against the state-of-the-arts, which validates that guiding with the partially correct teacher networks (the supervisions are dehazed images of DCP and NLD rather than ground truths) can effectively improve the dehazing ability in real scenes. In addition, these added negative information from teacher networks can be refined by the training process and the student network finally acquire dehazed images with more vivid color.

2 Related Work

2.1 Model-Based Methods

Model-based methods estimate the atmospheric light and transmission maps, and then restore dehazed images by atmospheric scattering model. Early model-based methods, also called prior-based methods, adopt statistical assumptions concluded from haze-free images to estimate the atmospheric light and transmission maps. For example, dark channel prior (DCP) [3] assumes clear RGB images have low intensity in at least one channel, and quickly acquires these two parameters based on the theory. Color-lines prior (CLP) [4] constructs a local formation model to recover the transmission map based on the lines offset. In addition, Color attenuation prior (CAP) [5] builds a linear relationship among color, haze concentration and scene depth to estimate the atmospheric light and transmission maps. Differently, another method NLD [6] estimates the transmission map by hundreds of distinct colors. Above prior-based methods dehaze favorably and have strong generalization in real scenes but tend to cause artifacts, halos and color distortion since unilateral assumption cannot estimate accurate atmospheric light and transmission maps in various scenes. To this end, recent model-based methods tend to estimate the atmospheric light and transmission maps by convolutional neural networks (CNNs). For example, some works estimate transmission maps by stacked CNN [7] or multiscale CNN [8]. Moreover, to avoid the cumulative error of two times estimation, AOD-Net [9] sets a linear equation to combine the atmospheric light and transmission map into a parameter \(K(x)\). Another method DCPDN [10] embeds the atmospheric scattering model into CNN, which directly acquires dehazed images by the joint estimation of atmospheric light and transmission maps. However, the atmospheric scattering model, as a simplified mathematical model, cannot completely replace the formation process of haze. Hence, model-based methods cannot acquire high quality results and still suffer from some color and illumination changes.

2.2 Model-Free Methods

Model-free methods, also called end-to-end methods, directly establish the mapping between hazy images and clear images instead of using atmospheric scattering model. Due to the huge gap between the features of hazy images and clear images, model-free methods often increase network depths and scales to enhance the feature extraction ability. For example, FFA [12] and DuRN [13] build a deep network based on residual blocks, and directly recover dehazed images by merging features from convolutional layers in different depths. GFN [14] utilizes white balance (WB), contrast enhancing (CE), and gamma correction (GC) to derive three subimages of the hazy input, and directly recovers dehazed images by using learned confidence maps to fuse these three subimages. Moreover, EPDN [15] acquires high contrast results by the adversarial training between a multiscale generator and discriminator. MSBDN [16] adopts back-projection feedback to connect non-adjacent layers, which reduces the loss of spatial information during sampling and improves the resolution of restored results. However, due to lacking of the knowledge to real-world haze, above networks conduct poor dehazing performance in real scenes. To this end, DANet [17] builds a bidirectional translation network to solve the domain adaptation problem, and acquires visually pleasing results on both synthetic and real scenes. RefineDNet [18] embeds DCP in CNN-based method, and adopts an adversarial training between unpaired real-world images to improve dehazing effect in both synthetic and real scenes. PSD [19] uses multiple prior losses to guide the training process, which acquires high contrast results in real scenes but tend to overenhance images. Differently, KDDN [20] pretrains a reconstruction network of clear images, and adopts the intermediate features to guide the training process of a dehazing network.

3 Proposed Method

As shown in Fig. 1, considering that dark channel prior (DCP) and non-local dehazing (NLD) dehaze favorably in real scenes, we dehaze images by these two prior-based methods and use them as fake ground truths to pretrain two teacher networks. During the training process of student network, we use the features of teacher networks to guide the student network, and make it achieve favorable dehazing effect in real scenes.

Fig. 1.
figure 1

The architecture of the proposed MGDNet

3.1 Teacher Networks

The DCP and NLD teacher networks have the same architecture, which are based on classic encoder-decoder architecture. As shown in Fig. 1, we first extract features on four scales by an encoder E containing four convolutions. The first convolution preliminarily extracts features of hazy inputs and change the shape from \(256 \times 256 \times 3\) to \(256 \times 256 \times 64\), and the following three convolutions sequentially adjust the shape of features to \(128 \times 128 \times 128\), \(64 \times 64 \times 256\) and \(32 \times 32 \times 512\), respectively. Moreover, considering paper [21] has shown that applying dilated convolutions into bottleneck layers of encoder-decoder structure can effectively alleviate the generation of artifacts. Hence, we design smoothed dilated residual block (SDRB) and add two SDRBs in the bottleneck layers of these two teacher networks. After that, a decoder D, consists of four deconvolutions, unsamples features to the shape of the corresponding layer in the encoder E, and finally outputs the dehazed images.

As shown in Fig. 2(a), the SDRB consists of two smoothed dilated convolutions (SDC) [22] and a residual connection [23], and each SDC contains a ShareSepConv, a \(3 \times 3\) convolution and a ReLU function. The ShareSepConv (Separable and Shared Convolution) performs as a preprocessing module, which builds a connection between non-adjacent regions and solves the spatial discontinuity caused by the expansion of receptive field. The theories and details of ShareSepConv can be seen in paper [22]. The \(3 \times 3\) convolution sets the dilation as 2 to expand the receptive fields, and enhances the perception ability to global features. Finally, the ReLU function improves network nonlinearity and the residual connection after two SDCs enhances feature flow.

Fig. 2.
figure 2

The structure of SDRB and RDB.

3.2 Student Networks

The student network is a dehazing network trained by synthetic hazy images, which has the similar structures to the teacher networks. As shown in Fig. 1, the student network is still based on a encoder-decoder structure but two residual dense blocks (RDBs) are applied to the bottleneck layers. RDB [24] combines the advantages of residual connection and densely connected network [25], which extracts structures effectively and helps the feature backpropagation. As shown in Fig. 2 (b), these two RDBs contain four 3 × 3 convolutions and one 1 × 1 convolution. All 3 × 3 convolutions are densely connected to avoid the loss of structure information extracted by shallower layers, and then the 1 × 1 convolution merges these abundant features to provide clear texture perception.

3.3 Overall Loss Function

Recent research [26] has shown that the combination of pixel-wise loss and feature-wise loss can effectively accelerate network training. Hence, for the training of MGDNet, the overall loss function contains L1 loss, perceptual loss, and distillation loss, which can be expressed as Eq. (1):

$$ L_{loss} = L_1 + L_{per} + \lambda L_{diss} $$
(1)

where \(L_1\), \(L_{per}\) and \(L_{diss}\) denotes the L1 loss, perceptual loss and distillation loss, respectively. \(\lambda\) is a trade-off coefficient to balance the effect of learning-based method and prior-based method. As shown in Fig. 3, our method can effectively improve the dehazing effect while maintain the color fidelity when setting \(\lambda\) as 1.

Fig. 3.
figure 3

The results when setting different \(\lambda\).

L1 Loss.

L1 loss (mean absolute error) can rapidly minimize the feature differences between hazy images and clear images by per-pixel comparison, thus we add L1 loss for network training. Different from L2 loss (mean squared error), L1 loss trains networks more stably, which can be expressed as Eq. (2):

$$ L_1 = \left\| {J - G(I)} \right\|_1 $$
(2)

where \(J\) represents haze-free images and \(G(I)\) represents the dehazed images of student network.

Perceptual Loss.

Perceptual loss [27] compares two images by perceptual and semantic differences, which effectively helps the network restore more vivid images. In this paper, we pretrain VGG19 network on the ImageNet and extract the features of the convolutions in number 2, 7, 15, 21, and 28 (the last convolution of each scale) to calculate loss, which can be expressed as Eq. (3):

$$ L_{p{\text{er}}} = \sum_{i = 1}^5 {\frac{1}{C_i H_i W_i }\left\| {\Phi_i \left( J \right) - \Phi_i \left( {G(I)} \right)} \right\|_1 } $$
(3)

where \(J\) presents clear images and \(G(I)\) represents dehazed images generated by student network. \(\Phi_i \left( J \right)\) and \(\Phi_i \left( {G(I)} \right)\) respectively represent the five scales perceptual features of the dehazed images and clear images extracted from the trained VGG19 network. \(C_i\), \(H_i\), and \(W_i\) represent the number of channels, height, and width of feature maps.

Distillation Loss.

As shown in Fig. 1, to make the trained student network conduct strong generalization ability to real scenes, we pretrain two teacher networks by minimize the L1 loss between \(N(I)\) (\(D(I)\)) and the dehazed images of NLD (DCP) (named as \(J_{NLD}\) (\(J_{DCP}\))), respectively. Then we adopt the pretrained networks to optimize the training of student network by both of feature-level losses (\(L_{N1} ,L_{D1} ,L_{N2} ,L_{D2}\)) and pixel-level distillation losses (\(L_{N3} ,L_{D3}\)). For feature-level guiding, we output the features after each SDRB (RDB) in the teacher networks (student network) by an extra \(3 \times 3\) convolution. For pixel-level guiding, considering that the supervisions (dehazed images of NLD and DCP) contain some negative information such as color and illumination distortion, we adopt discrete wavelet transform (DWT) in paper [28] to distinct the high-frequency and low-frequency parts of the outputs of teacher networks, and only the high-frequency images are sent to guide the training process of student network. Hence, the whole distillation loss can be expressed as Eq. (4):

$$ \begin{aligned} L_{diss} & = \left\| {F_{NLD1} - F_{S1} } \right\|_1 + \left\| {F_{DCP1} - F_{S1} } \right\|_1 + \left\| {F_{NLD2} - F_{S2} } \right\|_1 + \left\| {F_{DCP2} - F_{S2} } \right\|_1 \\ & + \,\left\| {DWT_h (N(I)) - DWT_h (G(I))} \right\|_1 + \left\| {DWT_h (D(I)) - DWT_h (G(I))} \right\|_1 \\ \end{aligned} $$
(4)

where \(F_{DCP1}\), \(F_{DCP2}\) and \(F_{NLD1}\), \(F_{NLD2}\) denote the features extracted from each SDRB of DCP teacher network and NLD teacher network, respectively. \(F_{S1}\) and \(F_{S2}\) denote the features extracted from each RDB of student network. \(DWT_h ( \cdot )\) is a high pass DWT, which extracts the structures and textures of input images for pixel-level guiding and makes the student network \(G( \cdot )\) avoid the negative low-frequency information (color and illumination distortion) from dehazed images \(N(I)\) and \(D(I)\).

4 Experiments

In this section, we conduct some experiments on real-world images to show that the proposed MGDNet performs better than some state-of-the-arts. These methods include DCP [3], NLD [6], DANet [17], RefineDNet [18], PSD [19] and KDDN [20]. All these methods are learning-based methods except DCP and NLD, which are two prior-based methods. Moreover, DANet, RefineDNet and PSD are prior-combined methods, and KDDN is a dehazing network using knowledge distillation similar to our MGDNet.

4.1 Dataset

To effectively train our MGDNet, we adopt the Indoor Training Set (ITS) in Realistic Single Image Dehazing (RESIDE) [29], which is a synthetic indoor training set containing 13990 hazy images and the corresponding haze-free images. We test the MGDNet and all comparative methods on IHAZE [30] and OHAZE [31] to verify the dehazing performance, which contains 5 indoor and outdoor paired testing images, respectively. Moreover, some real-world images in [17] and [32] are also adopted to further verify the generalization of MGDNet in real scenes.

4.2 Implementation Details

The proposed MGDNet is trained and tested in the PyTorch framework. During the training, we randomly crop local regions (256 × 256) of input paired images, and randomly flip or rotate them to enhance the diversity of training dataset. The training batch size is set to 4, and we train the MGDNet 30 epochs. To accelerate the training process, we use the Adam optimizer [33] and adopt a default value for the attenuation coefficient \(\beta_1\) and \(\beta_2\) being 0.9 and 0.999 respectively. Moreover, we set the initial learning rate to 0.0002, and decrease it to half every five epochs. All the experiments are implemented on a PC with two RTX 2080Ti GPUs.

4.3 Comparisons with State-of-the-Art Methods

Results on IHAZE and OHAZE. The comparison results are shown in Table 1, where the values are the average PSNR and SSIM of five indoor and outdoor testing images of IHAZE [30] and OHAZE [31], respectively. For IHAZE, the proposed MGDNet achieves the second-best performance of image dehazing in terms of both PSNR and SSIM. For OHAZE, the proposed MGDNet also achieves the second-best PSNR, and enhance SSIM by 0.02 when compared with second-best method DANet. These data show the DANet and our MGDNet conduct better dehazing effect on these two datasets. Also, we notice that prior-based methods DCP and NLD perform poorly in terms of both PSNR and SSIM, which shows the generated artifacts and color changes seriously reduce the quality of the dehazed images.

Table 1. Comparison of the state-of-the-art dehazing methods on IHAZE and OHAZE. Number in red and bule represent the best and second-best results, respectively.

Results on Natural Hazy Images.

Considering that hazy images acquired by haze machine may not completely verify the dehazing ability in real scenes, we further test all these methods on some real-world images in [17] and [33]. As shown in Fig. 4, prior-based methods DCP and NLD dehaze favorably but tent to cause color distortion and artifacts, which shows these methods have excellent generalization ability in real scenes. By contrast, learning-based methods tend to acquire under-dehazed results due to lacking of the knowledge to real-world hazy images. KDDN fails to dehaze in these scenes and a large amount of residual haze degrades the visibility of the generated results. More importantly, DANet cannot remove haze thoroughly and leads to obvious color changes although it shows favorable performance in IHAZE and OHAZE, which verifies this method cannot fit in natural scenes. The results of PSD suffer from severe color and illumination distortion, and some local regions have residual haze. Moreover, RefineDNet dehazes effectively in these scenes and acquires visually pleasing results. However, the results still contain some residual haze in local regions especially when applied to images with colorful textures. Better than above methods, the proposed MGDNet recovers high quality results with discriminative structures and vivid color, which shows it has strong dehazing effect in real scenes by the guiding of DCP and NLD methods. Moreover, compared to the dehazed images of DCP and NLD, we also notice that the results of MGDNet alleviate color distortion and artifacts by the training of synthetic images.

4.4 Ablation Study

To demonstrate the effectiveness of each module, we conduct an ablation study by the combination of four factors: DCP teacher network (DCP), NLD teacher network (NLD), pixel-wise distillation loss (PDL), discrete wavelet transform (DWT). We construct the following variants with different component combinations: (1) Student: only the student network is used; (2) Student+DCP: only the DCP teacher network is used to guide the student network by feature-wise distillation loss. (3) Student + NLD: only.

the NLD teacher network is used to guide the student network by feature-wise distillation losses. (4) Student+DCP+NLD: both the DCP and NLD teacher networks are used to guide the student network by feature-wise distillation losses. (5) Student+DCP+NLD+PDL: two teacher networks guide the student network by both feature-wise distillation losses and pixel-wise distillation losses (6) Student+DCP+NLD+PDL+DWT (Ours): a DWT module is applied before pixel-wise guiding, and only the high-frequency parts of DCP and NLD teacher networks are used to guide the student network during pixel-wise comparison. The results are shown in Table 2, it demonstrates that the proposed MGDNet achieves the best performance of image dehazing in terms of PSNR and SSIM. Moreover, adding DCP teacher network, we can improve PSNR from 19.68 dB to 20.32 dB and enhance SSIM by 0.01. And adding NLD teacher network, we can also improve the metrics by 0.56 dB and 0.007. These results show that the combination of prior-based methods improves the performance in outdoor scenes although the network is only trained by indoor images, and the DCP is more efficient than the NLD for our method. Moreover, adding both of DCP teacher network and NLD teacher network also provides with a little gain, which means that combining with multiple priors further improve the generalization ability since unilateral prior cannot hold in various scenes. Additionally, we have also noticed that the addition of PDL drops the performance. It shows that directly using pixel-wise distillation losses may degrade final results since prior-based dehazing images always contain some negative information such as color shifts and artifacts. Fortunately, with the help of DWT, the network can alleviate the distortions in the outputs of teacher networks and further improves the metrics to 20.49 dB and 0.904, respectively.

Fig. 4.
figure 4

Comparison of the state-of-the-art dehazing methods on the real-world images.

Table 2. Comparison of variants with different components on the outdoor dataset of SOTS.

5 Conclusion

In this paper, we propose a multi-priors guided dehazing network (MGDNet) based on knowledge distillation, which combines the complementary merits of prior-based dehazing methods and learning-based dehazing methods. Specifically, we pretrain two teacher networks to efficiently use the partially correct features of dark channel prior (DCP) and non-local dehazing (NLD) dehazed results by both of feature-level and pixel-level distillation losses. Moreover, we adopt a high-pass discrete wavelet transform (DWT) before pixel-level guiding to alleviate the negative information of prior dehazed images such as color shifts and artifacts. And the added features of two teacher networks can be refined during the supervised training of student network. Experiments on real-world images demonstrate that the proposed MGDN achieves favorable dehazing effect by both of the quantitative and qualitative comparisons.