1 Introduction

Haze is a commonly natural phenomenon, which makes the captured image degraded, and further hinders the recognition capability of computers. Thus, as a key prerequisite of high-level computer vision tasks, single-image dehazing has been extensively studied in recent years [1].

Single-image dehazing can be roughly divided into model-based methods and model-free methods. The model-based methods achieve dehazing via the following atmospheric scattering mode [2]:

$$ I\left( {\varvec{x}} \right) = J\left( {\varvec{x}} \right)t\left( {\varvec{x}} \right) + A\left( {1 - t\left( {\varvec{x}} \right)} \right) $$
(1)

where \(I\left( {\varvec{x}} \right)\) denote the hazy images captured in hazy conditions, and \(J\left( {\varvec{x}} \right)\) denote the restored haze-free images. \(A\) and \(t(x)\) denote the atmospheric light and transmission maps, respectively. Moreover, we have \(t(x) = e^{ - \beta d(x)}\) with \(\beta\) and \(d(x)\) being the atmosphere scattering parameter and the scene depth, respectively. Equation (1) is an ill-posed problem, which means we cannot directly acquire a clear image \(J\left( {\varvec{x}} \right)\) from a hazy image \(I\left( {\varvec{x}} \right)\) since both of \(A\) and \(t(x)\) are underdetermined.

To this end, early model-based methods (also called as prior-based methods) use the priors based on the observation of clear images to estimate the atmospheric light and transmission maps. These methods include color-lines prior (CLP) [3], boundary constraint and contextual regularization (BCCR) [4], dark channel prior (DCP) [5], color attenuation prior (CAP) [6], and non-local dehazing (NLD) [7]. Prior-based methods have strong generalization for image dehazing but always cause color shifts and artifacts since the unilateral hypothesis cannot estimate parameters accurately especially in complex scenes. Thus, recent model-based methods tend to utilize a designed convolutional neural network (CNN) to estimate the atmospheric light and transmission map [8,9,10,11]. However, as a simplified hazy model, the atmospheric scattering model cannot simulate the process of haze thoroughly and may restrict the final dehazing performance [12].

To this end, more model-free works [13,14,15,16,17,18] are proposed, which build an end-to-end CNN to directly map the features between the hazy images and the corresponding haze-free images. These end-to-end methods have showed strong dehazing ability in synthetic scenes, but they always acquire under-dehazed images when applied to real scenes due to domain shifts. In other words, the synthetic hazy images cannot represent uneven haze distribution and complex illumination in natural conditions and making the trained model cannot hold in these scenes. Hence, some more recent works [19,20,21,22] combine traditional priors (i.e., dark channel prior) with learning-based methods to achieve better dehazing effect in both synthetic and real scenes. However, the existing prior-combined dehazing methods cannot alleviate the distortions caused by prior-based methods effectively, and a more efficient feature aggregation mechanism should be studied to combine the complementary advantages of these two categories.

In this paper, we resort to knowledge distillation to solve the problem, and propose a prior-combined dehazing network dubbed as PCD. Knowledge distillation [23,24,25] is a widely used method for parameter reduction, in which a cumbersome network (teacher) is used to guide the learning process of a designed light-weight network (student). Based on it, recent dehazing works [23, 24] adopt the features of ground truths to enhance the image restoration. And paper [25] proposes a mutual distillation mechanism to improve the accuracy of a detection task. Inspired by them, we propose a mutual learning mechanism to combine the complementary merits of prior based methods and learning-based methods. Specifically, we build two sub-networks to achieve dehazing by both supervised and unsupervised learning. The supervised sub-network is optimized by the ground truths, and the unsupervised sub-network is optimized by DCP dehazed images. Hence, the outputs of supervised subnetwork provide color fidelity since the supervision contains completely correct information, and the outputs of unsupervised sub-network provide generalization ability to real scenes since the DCP is a statistical law of clear images. Then a novel mutual learning mechanism can be used to combine the complementary merits adoptively and acquire two preliminary dehazed images. Moreover, since it is highly possible that either of them is better than the other in some local regions, a feature fusion method (FFM) based on perceptual differences is further proposed. The FFM merges the preliminary dehazed images and then achieves better dehazing effect with realness.

The main contributions of this paper are summarized as follows:

  1. 1.

    We introduce a prior-combined dehazing (PCD) network based on mutual learning to combine the merits of prior-based methods and learning-based methods.

  2. 2.

    We propose a novel mutual learning mechanism to achieve the joint optimization of the supervised sub-network and unsupervised sub-network.

  3. 3.

    We propose a feature fusion module based on perceptual differences to aggregate the outputs of the sub-networks, which acquires the final dehazed images with clearer textures.

2 Network architecture

2.1 General architecture

Since it is hard to collect a large number of hazy images and their haze-free images, the existing learning-based methods still train the model by synthetic images. The synthetic hazy images exist apparent differences from real hazy images w.r.t haze distribution, and the CNN model lacks of the knowledge to natural scenes, which results in that learning-based methods acquire under-dehazed images in real scenes. Considering that traditional priors (i.e., dark channel prior) are statistical laws of clear images, recent works tend to combine the prior-based method with CNN-based methods to achieve better dehazing effect in real scenes. However, since the prior dehazed images contain severe distortion information such as artifacts, illumination changes, color shifts and halos, these prior-combined methods also suffer from image distortions due to insufficient feature aggregation mechanism. Thus, as shown in Fig. 1, a prior-combined dehazing network based on mutual learning is proposed to solve the problem, which consists of a supervised sub-network, an unsupervised sub-network and a feature fusion module.

Fig. 1
figure 1

The general architecture of our PCD, which consists of a supervised sub-network and an unsupervised sub-network optimized by the ground truths and DCP dehazed images. The outputs of sub-networks show complementary merits, and they are fused by perceptual differences to acquire final dehazed images

Fig. 2
figure 2

The architecture of the perceptual feature fusion. The dehazed images of two sub-networks are converted to LMN color space to estimate the similarity of ground truths, and then, two weighed maps \(W_{D1}\) and \(W_{D2}\) are generated by softmax function to assign the dehazed images \(D_{1}\) and \(D_{2}\) adoptively

2.2 Supervised sub-network

The supervised sub-network achieves dehazing by end-to-end strategies, which directly builds the mapping between synthetic hazy images and ground truths. As shown in Fig. 1, the supervised sub-network is based on a three scales autoencoder structure. Differently, we replace traditional convolutions with residual blocks for feature extraction since the residual structure has been proved efficient for feature flow. Specifically, we first extract the features of synthetic hazy images by a convolutional layer, which changes the channel number from 3 to 64. Then, two residual blocks enhance the feature representation and a Down-Conv layer downsamples feature maps to high-level semantic space. We downsample features by two times and form three scales feature maps, and the features of the bottleneck layer are sent to the decoder D to restore high-resolution results. The decoder D contains the structures symmetric to the encoder E, in which an Up-Conv is used to restore resolution, and then, two residual blocks enhance unsampled features. During the decoding process, the features of encoder E are sent to the corresponding layer of decoder by skip connection to avoid the loss of spatial information. Thus, except for the first scale in the decoder D, the features from the encoder E and the previous scale features of D are concatenated as the input of current scales until restoring the resolution same as the input hazy images. Finally, the outputs of decoder are sent to a convolutional layer with the Tanh activation function to acquire the dehazed images. Since the supervisions of supervised sub-network are ground truths, the dehazed images achieve high information fidelity although some regions are under-dehazed.

2.3 Unsupervised sub-network

To generate the features of DCP method, we build an unsupervised sub-network. As shown in Fig. 1, the unsupervised sub-network has the same structures of supervised sub-network. Differently, we train the unsupervised sub-network with the supervisions of the dehazed images of dark channel prior (DCP) rather than ground truths. Since the supervisions are fake ground truths (DCP dehazed images), we call it as unsupervised learning in this paper. The DCP method has been proved efficient in real haze removal although it may introduce some distortions, especially in the sky regions. As shown in Fig. 1, the outputs of unsupervised sub-network have similar features of DCP dehazed images, which acquire more discriminative textures although the sky regions suffer from illumination oversaturations. Since the output images of two sub-networks achieve complementary advantages, we apply a mutual learning mechanism to optimize them adoptively by two extra distillation losses. The details of the distillation losses can be seen in Sect. 2.5.

2.4 Feature fusion

In our method, \(D_{1}\) and \(D_{2}\) are dehazed images under the supervisions of ground truths and DCP dehazed images, respectively. Since the dehazed images \(D_{1}\) and \(D_{2}\) are acquired by their own ways, it is highly possible that either of them is better than the other in some local regions. Hence, if better regions from either of \(D_{1}\) and \(D_{2}\) are assigned with larger weights, a better result will be acquired. Since \(D_{1}\) are dehazed images with good fidelity and \(D_{2}\) are dehazed images with visibility, the fusion should consider the realness of \(D_{1}\) and maintain the same visibility of \(D_{2}\). Thus, we fuse them based on the ground truths. The process can be divided into the follows:

(1) Feature Extraction: Recent IQA research [22, 26] has shown that images in LMN color space can be easily estimated the color distortions and chrominance shifts. Consequently, to objectively estimate the realness of dehazed images \(D_{1}\) and \(D_{2}\), we first transform the images into the LMN color space, which can be expressed as:

$$ \left[ \begin{gathered} L \hfill \\ M \hfill \\ N \hfill \\ \end{gathered} \right] = \left[ {\begin{array}{*{20}c} {0.06} & {0.63} & {0.27} \\ {0.30} & {0.04} & { - 0.35} \\ {0.34} & { - 0.6} & {0.17} \\ \end{array} } \right]\left[ \begin{gathered} R \hfill \\ G \hfill \\ B \hfill \\ \end{gathered} \right] $$
(2)

(2) Similarity Calculation: We calculate the similarity in LMN space to evaluate the realness of the dehazed images. Taking the similarity value between dehazed image \(D_{1}\) and ground truth as example. Supposing that \(L_{1} (x)\), \(M_{1} (x)\) and \(N_{1} (x)\) are computed from dehazed images \(D_{1}\); \(L_{2} (x)\), \(M_{2} (x)\) and \(N_{2} (x)\) are acquired from the ground truth. The similarity \(S_{D1}^{{{\text{LMN}}}}\) at pixel x can be expressed as:

$$ \begin{aligned} S_{D1}^{{{\text{LMN}}}} (x) & = \frac{{2L_{1} (x)L_{2} (x) + C_{1} }}{{L_{1}^{2} (x) + L_{2}^{2} (x) + C_{1} }} \times \frac{{2M_{1} (x)M_{2} (x) + C_{1} }}{{M_{1}^{2} (x) + M_{2}^{2} (x) + C_{1} }} \\ & \quad \times \frac{{2N_{1} (x)N_{2} (x) + C_{1} }}{{N_{1}^{2} (x) + N_{2}^{2} (x) + C_{1} }} \\ \end{aligned} $$
(3)

where \(C_{1}\) is a constant set to 130 as suggested in [26].

(3) Weight generation and feature fusion:

To make the final result contains more realistic information, we convert the similarity into weights for the feature fusion process. Supposing that \(S_{D1}^{LMN}\) denotes the similarity value between dehazed image \(D_{1}\) and the ground truth at pixel x; \(S_{D2}^{LMN}\) denotes the similarity value between dehazed image \(D_{2}\) and the ground truth. The weights of dehazed images \(D_{1}\) and \(D_{2}\) at pixel x can be expressed as:

$$ \left[ {\begin{array}{*{20}c} {W_{D1} (x)} \\ {W_{D2} (x)} \\ \end{array} } \right] = {\text{Softmax}}\left( {\left[ {\begin{array}{*{20}c} {S_{D1}^{{{\text{LMN}}}} (x)} \\ {S_{D2}^{{{\text{LMN}}}} (x)} \\ \end{array} } \right]} \right) $$
(4)

where \({\text{Softmax}}\) denotes the softmax function that generating weighs adoptively based on the similarity \(S_{D1}^{{{\text{LMN}}}}\) and \(S_{D2}^{{{\text{LMN}}}}\). Note that \(W_{D1} (x) + W_{D2} (x) = 1\).

In the end, we aggregate the preliminary dehazed image \(D_{1}\) and \(D_{2}\) based on their weights, and the final result can be expressed as:

$$ D_{Fin} = W_{D1} \otimes D_{1} + W_{D2} \otimes D_{2} $$
(5)

where \(D_{{{\text{Fin}}}}\) denotes the final dehazed image, and \(W_{D1}\) and \(W_{D2}\) are the generated weights of dehazed images \(D_{1}\) and \(D_{2}\), respectively. \(\otimes\) denotes the pixel-wise product.

2.5 Loss function

Paper [27] has shown that the combination of pixel-wise loss and feature-wise loss can effectively mimic feature differences between two images. Thus, we use L1 loss, perceptual loss and distillation loss to train the proposed PCD, and the total loss function of supervised sub-network and unsupervised sub-network can be expressed as:

$$ L_{{{\text{Sup}}}} = L_{1} + \lambda_{1} L_{{{\text{Per1}}}} + \lambda_{2} L(D_{1} \parallel D_{2} ) $$
(6)
$$ L_{{{\text{Uns}}}} = L_{2} + \lambda_{1} L_{{{\text{Per2}}}} + \lambda_{2} L(D_{2} \parallel D_{1} ) $$
(7)

where \(L_{{{\text{Sup}}}}\) and \(L_{{{\text{Uns}}}}\) are losses of the supervised sub-network and unsupervised sub-network, respectively. \(L_{1}\) and \(L_{2}\) denote the L1 loss, and \(L_{{{\text{Per1}}}}\) and \(L_{{{\text{Per2}}}}\) denote the perceptual loss. \(\lambda_{1}\) and \(\lambda_{2}\) are the weight coefficients equal to 1. Moreover, \(L(D_{1} \parallel D_{2} )\) and \(L(D_{2} \parallel D_{1} )\) denote the distillation losses, which make the supervised sub-network (unsupervised sub-network) mimic the features of unsupervised sub-network (supervised sub-network), respectively.

2.5.1 L1 loss

L1 loss (mean absolute error) can rapidly minimize the feature differences between hazy images and clear images by per-pixel comparison, and thus, we add L1 loss for network training. Different from L2 loss (mean squared error), L1 loss trains network more stably, which can be expressed as:

$$ L_{1} = \left\| {J_{dcp} - D_{1} } \right\|_{1} $$
(8)
$$ L_{2} = \left\| {J - D_{2} } \right\|_{1} $$
(9)

where \(J_{{{\text{dcp}}}}\) and \(J\) represent DCP dehazed images and ground truths, respectively. \(D_{1}\) and \(D_{2}\), respectively, represent the output of supervised sub-network and unsupervised sub-network. \(\left\| \cdot \right\|_{1}\) denotes the L1 loss.

2.5.2 Perceptual loss

Perceptual loss [28] compares two images by perceptual and semantic differences, which effectively helps the network restore more vivid images. In this paper, we pretrain VGG19 network [29] on the ImageNet [30] and extract features of the convolutions in number 2, 7, 12, 21 and 30 to calculate losses. The perceptual losses used in supervised sub-network and unsupervised sub-network are, respectively, denoted as \(L_{{{\text{Per1}}}}\) and \(L_{{{\text{Per2}}}}\), which can be expressed as:

$$ L_{{{\text{Per1}}}} = \sum\limits_{i = 1}^{5} {\frac{1}{{C_{i} H_{i} W_{i} }}\left\| {\Phi_{i} \left( {GT} \right) - \Phi_{i} \left( {D_{1} } \right)} \right\|_{1} } $$
(10)
$$ L_{{{\text{Per2}}}} = \sum\limits_{i = 1}^{5} {\frac{1}{{C_{i} H_{i} W_{i} }}\left\| {\Phi_{i} \left( {J_{{{\text{dcp}}}} } \right) - \Phi_{i} \left( {D_{2} } \right)} \right\|_{1} } $$
(11)

where \(\Phi_{i} \left( \cdot \right)\) (\(i = 1,2,3,4,5\)) denotes the five scales perceptual features extracted from the trained VGG19 network. \(C_{i}\), \(H_{i}\) and \(W_{i}\) represent the number of channel, height, and width of feature maps.

2.5.3 Distillation loss

Due to a limited feature aggregation, recent prior-combined methods cannot deal with the distortions caused by prior-based methods. Since the supervised sub-network and unsupervised sub-network show complement merits about image dehazing, we aggregate the features by the mutual learning mechanism with two designed distillation losses:

$$ L(D_{1} \parallel D_{2} ) = L(D_{2} \parallel D_{1} ) = \left\| {D_{2} - D_{1} } \right\| $$
(12)

3 Experiment and analysis

In Sects. 3.1 and 3.2, we introduce the used datasets and the experimental details, respectively. In Sect. 3.3, we compare the proposed PCD with some state-of-the-arts on both synthetic dataset and real-word dataset and analyze the results by both qualitative and quantitative ways.

3.1 Datasets

For training, we use the Indoor Training Set (ITS) in Realistic Single Image Dehazing (RESIDE) [31], which contains 13,990 synthetic indoor hazy images and the corresponding haze-free images. For testing, we use the Synthetic Objective Testing Set (SOTS) in RESIDE, which contains 500 paired images captured in indoor and outdoor scenes, respectively. To compare the results quantificationally, the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [32] are used. Moreover, to show the generalization in natural scenes, some real-world images in URHI dataset [19] are further used. Since the real-world images do not contain their ground truths, we only qualitatively compare the results.

3.2 Implementation details

We achieve our PCD by PyTorch framework. For training, we reshape all the training images as 256 × 256. Moreover, we set the training batch size as 4 and set the total epochs as 20. To optimize the network and accelerate the training process, the Adam [33] optimizer is used with the attenuation coefficient being \(\beta_{1} = 0.9\) and \(\beta_{2} = 0.999\), respectively. In addition, the initial learning rate is set as 0.0002, and we decrease it to half after every two epochs.

3.3 Experimental results

3.3.1 Results on synthetic images

To show the effectiveness of our PCD, we test on both the indoor and outdoor images in SOTS. Figure 3 shows the results, we can find that prior-based methods (DCP, NLD and CAP) dehaze effectively in both indoor and outdoor scenes but suffer from halos, color shifts and artifacts. By contrast, learning-based methods (GridDehazeNet and MSBDN) dehaze effectively in indoor scenes due to trained by indoor dataset. But unfortunately, GridDehazeNet introduces many artifacts for outdoor scenes, which shows its poor generalization ability. Better than these methods, prior-combined methods (PSD, RefineDNet and Ours) perform stably in both indoor and outdoor scenes. However, there is some residual haze in the results of PSD. And the results of RefineDNet suffer from some color shifts. Only our PCD can effectively dehaze and provide color fidelity.

Fig. 3
figure 3

Results on images of SOTS. The upper three rows are the results of synthetic indoor images, and the bottom three rows are the results of synthetic outdoor images

To compare the results quantitatively, we calculate the average PSNR and SSIM. Table 1 shows the results; for indoor scenes, our PCD achieves the third-best results with the PSNR and SSIM being 27.34 dB and 0.971, respectively. But for outdoor scenes, our PCD improves PSNR from 20.46 to 23.72 dB and increases SSIM by 0.018 when compared with the second-best method MSBDN. The results show that learning-based methods (PSD and RefineDNet) drop the performance when applied to outdoor scenes. But our PCD alleviates it by adopting the efficient mutual learning mechanism to combine prior-based and learning-based ways. In addition, the comparison of FLOPS shows our PCD achieves dehazing with the minimum computational overhead.

Table 1 Quantitative comparison on SOTS

3.3.2 Results on real hazy images

To verify the generalization ability to natural scenes, we further evaluate the dehazing performance on real-world images in URHI [19]. As shown in Fig. 4, prior-based methods still dehaze effectively and restore most textures for these scenes. But unfortunately, some color shifts, artifacts or residual haze may exist. By contrast, learning-based methods fail to these scenes and there are a large amount haze in the results of GridDehazeNet and MSBDN. This further verifies that prior-based method has more stable dehazing performance than learning-based methods since the performance of learning-based methods is restricted by training data. For prior-combined methods, PSD acquires visually pleasing results with some illumination changes. RefineDNet darkens the images and also removes most haze. Better than them, our PCD removes more dense haze existing in the sky regions.

Fig. 4
figure 4

Results of images in URHI, our method still dehazes effectively in these natural scenes

3.4 Ablation study

3.4.1 Ablation study on the overall architecture

To show the effectiveness of the overall architecture, we conduct some ablation studies to explore the influences of the following three key factors: supervised learning (SL), unsupervised learning (UL), mutual learning mechanism (MLM) and feature fusion module (FFM). Thus, we construct the following variants: (1) SL, only train the network by supervised learning; (2) SL + UL, train the network by both supervised learning and unsupervised learning, and combine the outputs by channel-wise concatenation; (3) SL + UL + MLM, train the network by both supervised learning and unsupervised learning with mutual learning mechanism; (4) SL + UL + MLM + FFM (Ours), replace the channel-wise concatenation by feature fusion module. We train these variants on the ITS dataset for 20 epoch and test on the outdoor dataset of SOTS. Table 2 shows the results, the proposed PCD achieves the best metrics with PSNR and SSIM being 23.72 dB and 0.934, respectively. Specifically, by adding UL, the proposed PCD significantly improves PSNR from 20.15 to 22.46 dB and increases SSIM by 0.021. Moreover, by adding MLM, the PCD further combines the merits of prior-based method and learning-based method, which improves the metrics by 0.96 dB and 0.013. Finally, adding the FFM acquires a better result and also provides a little gain.

Table 2 Results of different variants about the overall architecture

3.4.2 Comparison for different prior

In our PCD, we use the DCP dehazed images as fake ground truths to achieve unsupervised learning and improve the generalization ability. Hence, it is important to compare the effectiveness with different prior-based methods. Thus, we acquire the dehazed images of DCP [5], CAP [6] and NLD [7] and use them to train the network, respectively. The quantitative comparisons of 500 outdoor images of SOTS are shown in Table 3, and the DCP combined network acquires better results than NLD combined network and CAP combined network, which shows the DCP method may have the best generalization in various scenes. Due to trained by indoor images, GridDehazeNet acquires poor metrics. Moreover, although PSD combines with multiple prior-based methods, the insufficient fusion mechanism causes severe color shifts and drops the metrics.

Table 3 Results of our PCD with different prior dehazed images as guidance

4 Conclusion

In this paper, we propose a prior-combined dehazing (PCD) network based on mutual learning. The PCD uses two sub-networks optimized by the ground truths and prior dehazed images to acquire two preliminary dehazed images and utilizes a novel mutual learning strategy to further aggregate the complementary features. In addition, a perceptual feature fusion method is proposed to maintain the dehazing ability while alleviate distortions. Experimental results on both synthetic and real-world images have shown that our PCD achieves better results in real scenes although it only acquires the third-best results in synthetic scenes. A more efficient prior-combined strategy will be studied in our further work.