1 Introduction

Image dehazing is a fundamental task in computer vision and image processing that aims to recover the underlying clear image from a hazy image corrupted by atmospheric scattering. It has important practical applications in outdoor surveillance, autonomous driving, and remote sensing, where the visibility of the image is often impaired by haze or mist. However, image dehazing is a challenging problem due to the complex physical models underlying haze formation, which makes the problem ill-posed and nonlinear.

To model the physical process of haze formation, the atmospheric scattering model (ASM) was proposed [1, 2]. The ASM assumes that the observed hazy image is a linear combination of a clear image and atmospheric light attenuated by the scattering medium. This can be represented mathematically as:

$$ \begin{array}{*{20}c} {I\left( x \right) = J\left( x \right)t\left( x \right) + A\left( {1 - t\left( x \right)} \right)} \\ \end{array} $$
(1)

where \(I\left( x \right)\) is the observed hazy image, \(J\left( x \right)\) is the clear image, \(t\left( x \right)\) is the transmission map, \(A\) is the atmospheric light, and \(x\) is the pixel coordinate.

Early dehazing methods focused on handcrafted prior-based methods such as the dark channel prior (DCP) [3], haze-line prior (HLP) [4] and color attenuation prior (CAP) [6]. The DCP utilizes prior knowledge to estimate \(t\left( x \right)\) and \(A\) to obtain \(J\left( x \right)\). However, these methods suffer from limitations such as oversimplification of the physical models. Therefore, these prior-based physical models do not perform well on real-world images. In addition, there are also some methods based on image fusion, such as AMEF [29], but they face problems with unsatisfactory dehazing effects.

Recently, deep learning-based methods have shown promising results in image dehazing by learning the underlying mapping between hazy and clear images. [7,8,9, 30, 31, 36,37,38] are some of the most representative deep learning-based methods. These methods use convolutional neural networks (CNNs) to learn the complex nonlinear mapping between hazy and clear images. Alternatively, they estimate the unknown variables in the atmospheric scattering model through the network, indirectly obtaining dehazed images. Other recent works have also explored various techniques for improving image dehazing, such as using generative adversarial networks (GANs) [16] and contrastive loss [24]. Although end-to-end deep learning-based methods have achieved great success, they mainly focus on the mapping between the input hazy image and the output haze-free image, without much consideration of the intermediate steps involved in the dehazing process. At the same time, there are difficulties in obtaining paired hazy and clear images, and the information of clear images is not fully explored. In addition, attention mechanisms designed only based on fixed-size convolutional kernels and single perception cannot endow the network with more flexibility, so they are unfavorable for learning the mapping from hazy images to clear images.

At the same time, as models become more cumbersome and complex, it is difficult to deploy these deep models on devices with limited resources. Knowledge distillation can effectively learn a small student model from a large teacher model, which provides the possibility of deploying these models on resource-limited devices such as mobile phones and embedded devices. According to the survey [39], the learning of student networks in knowledge distillation is influenced by three key factors: knowledge types, distillation strategies and teacher-student architectures. Knowledge types refer to the different kinds of information that the teacher model can provide to the student model, which can be categorized into response-based knowledge, feature-based knowledge and relation-based knowledge. Distillation strategies can be categorized into offline distillation, online distillation, and self-distillation, based on whether the teacher model and the student model are updated simultaneously. The teacher-student structure serves as the fundamental framework for knowledge distillation, playing a crucial role in the acquisition and transfer of knowledge. The selection or design of suitable teacher-student structures remains an area that requires further exploration.

Knowledge distillation was formally proposed by Hinton et al. [20] and has since been progressively applied to diverse domains, including image classification [40], face recognition [41], object detection [42], and even image dehazing [43,44,45,46,47]. Regarding image dehazing, [43] employs knowledge distillation to transfer knowledge from a pre-trained dehazing network (RefineDNet) to an unsupervised learning branch, resulting in an enhanced dehazing effect. [44] proposed a single image dehazing network that combines physical model and self-distillation, effectively utilizing the advantages of model-based and model-free methods. [45] utilizes transmission map information to guide the image transformation process and leverages features extracted from the transformed clear images to aid in transmission map estimation. [46] proposed an online knowledge distillation network (OKDNet) for single image dehazing, which combines the advantages of model-based and model-free dehazing methods through online knowledge distillation, achieving excellent dehazing effect. [47] proposed an image dehazing network based on multiple priors and offline knowledge distillation, namely dark channel prior and non-local dehazing prior, to guide the learning and optimization of dehazing network.

Based on the above analysis, we propose a knowledge-guided muti-perception attention network (KMAN) for image dehazing. Overall, we employ an autoencoder-like structure, combining network deconvolution (ND) and shallow and deep feature fusion (SDF) modules to form the basic architecture of the network [22]. By incorporating the concept of knowledge distillation, we not only focus on the output results of the network but also use the teacher network to provide guidance information for the image decoding process of the student network. To give the network more flexible adjustment abilities, we design a muti-perception attention mechanism. We also use perceptual loss and contrastive loss as regularization terms, which constrain the training process in multiple respects and enhance the robustness of the model.

In summary, our contributions are as follows:

  1. (1)

    We propose a knowledge-guided muti-perception attention network for image dehazing. We train a teacher network that learns from clear images to guide the decoding process of the student network, leveraging the strong prior knowledge of the teacher network to help gradually restore hazy images to haze-free images.

  2. (2)

    We employ a muti-perception attention (MPA) module that fuses information from different receptive fields and combines channel-level global and local information. Through adaptive learning, MPA assigns higher weights to more important features.

  3. (3)

    Our method (KMAN) provides a new paradigm for designing teacher-student network architectures for image dehazing, while achieving a good trade-off between reducing the parameter size and maintaining high-quality dehazing results, which may provide satisfactory performance on platforms with limited computing resources (such as mobile devices).

2 Related work

In this section, we review related work focusing on two main topics: attention mechanisms (channel, spatial) and their application in image dehazing, and knowledge distillation and its application in image dehazing.

2.1 Attention mechanism

The attention mechanism is a method that strongly focuses on certain features while ignoring other irrelevant features in deep learning models. In recent years, attention mechanisms have achieved significant results in computer vision tasks. They are mainly divided into channel attention, spatial attention, and self-attention [17] mechanisms. We will introduce two types of attention here: channel attention and spatial attention.


Channel Attention Channel attention mainly focuses on learning the correlation between different channels. Hu et al. proposed a channel attention module called SENet [18], which improves the model performance by learning the dependency between channels.


Spatial Attention The spatial attention mechanism emphasizes the importance of local features in images. Wang et al. proposed a spatial attention module called a nonlocal neural network [19], which can capture long-distance dependencies in images.

Attention-based methods aim to selectively attend to the most informative regions of the input hazy image to guide the dehazing process. Zhang et al. proposed a densely connected pyramid dehazing network (DCPDN) [9], which incorporates a multiscale and multilevel perceptual attention module to extract and fuse multiscale and multilevel features of hazy images. SADnet [13] combines channel attention, spatial attention, and self-attention to propose a channel-spatial self-attention (CSSA) mechanism. MSAFF-Net [14] uses channel attention and spatial attention of different scales to achieve feature fusion, while MARG-UNet [15] combines channel attention and pixel attention to propose a multimodal attention residual group (MARG) as one of the basic modules of the UNet structure to complete the dehazing task. These methods have achieved good results in their applications.

Various attention mechanisms endow the feature map adjustment with more flexibility. Image dehazing, due to the complexity of the scene, is still a very challenging problem. To improve the flexibility and perceptual ability of the pixel adjustment process taking hazy images to clear images, we propose a multi-perception attention module, which is used for learning the mapping from the latent features of hazy images to the latent features of clear images at the bottleneck of the network.

2.2 Knowledge distillation

Knowledge distillation is a method of training a small model to mimic the performance of a large model. It trains a small model (called a student model) by transferring the knowledge of a large model (called a teacher model) [20]. Knowledge distillation has achieved success in many computer vision tasks, such as image classification and object detection [21].

In recent years, the common dehazing algorithms based on knowledge distillation are mainly divided into two categories, one is online distillation and the other is offline distillation. Self-parameter distillation [11] can be regarded as a special form of online distillation, where the teacher network and the student network are the same network. It first uses haze-free images to train the network and extract scene content features. Then, it uses hazy images to continue training the network, using the previously learned scene content information to remove the haze. In the training process, it uses a parameter interpolation method to perform self-distillation. The other category is the offline distillation method, such as [10, 12], which trains a teacher network with clear images and aligns the features at the bottleneck of the teacher network and the student network when training the dehazing student network. Whether self-parameter distillation or alignment of features at the bottleneck is used, these methods lack sufficient guidance for the image decoding process near the output layer.

In our method, the student and teacher networks are designed heterogeneously, where the more complex teacher network learns the mapping between clear images and provides hierarchical guidance for the student network in the decoding process to obtain the dehaze image. The feature information at different levels of the clear image will provide key information guidance in the process of reconstructing the dehazed image in the student network, which enables the student network to gradually produce the restored image with the teacher’s guidance rather than directly producing it, and more easily restore the clear dehazed image.

3 Method

3.1 Architecture

In this section, we primarily introduce the overall structure of the network and several key components, including knowledge guidance, two designs for the muti-perception module, and the loss function.

Knowledge Guidance If a network can effectively reconstruct clear images, we consider it to have learned the information and structure of clear images, and it can provide guidance to the dehazing network by transferring its learned prior knowledge and feature representations. Based on this, we design a dual-network dehazing structure, both networks of which use a pyramid encoding–decoding framework [22], as shown in Fig. 1. First, we train a teacher network using clear images to assist the dehazing network in achieving better haze removal. Then, we train a dehazing student network that is guided step-by-step by the teacher network during the decoding process to obtain clear images. Specifically, we use the pyramid encoder–decoder structure with the shallow and deep feature fusion (SDF) module (as shown in Fig. 2) [22] and the network deconvolution (ND) module [5] as the basic framework of the network. At the bottleneck of the network, we design a muti-perception attention module as a fundamental component for mapping between the latent space representations of hazy and clear images. In terms of the loss function, in addition to the knowledge-guided regularization loss, the student network adds a contrastive learning regularization loss. The two networks also have different specific tasks: the input of the teacher network is clear images, and the focus is on providing knowledge guidance for the haze removal network; the input of the student network is hazy images, and the focus is on haze removal tasks.

Fig. 1
figure 1

The network architecture of the proposed KMAN

Fig. 2
figure 2

The structure of the SDF module, where α is a learnable parameter

Muti-perception Attention Module Haze in an image leads to decreased visibility and loss of details. To adapt to the complexity of image dehazing scenarios, it is necessary to effectively capture and utilize information from different levels and scales of the image. Therefore, we propose the muti-perception module, which utilizes dilated convolutions to capture information from different receptive fields. Additionally, it employs channel wise average pooling and max pooling to capture global and local information, respectively. This enables the algorithm to better understand the structure and content of the scene, thereby improving the dehazing results.

The MPA modules in the two networks have slightly different structures, with the muti-perception attention module in the student network (MPA-S) being lighter than the muti-perception attention module in the teacher network (MPA-T). This design choice results in a more lightweight model. The structure of the muti-perception attention module in the teacher network is illustrated in Fig. 3a. First, the features undergo parallel 1 × 1 convolutional transformations. Then, they pass through the muti-perception module and the channel attention module, and the results are multiplied together to obtain the final attention weights, \(att^{t}\). These weights are used to attentively weight the original features \(x^{t}\) of the teacher network, producing the final output \(y_{mpa}^{t}\). The entire process is depicted in Eqs. (2) and (3).

$$ att^{t} = mp \left( {conv_{1*1} \left( {x^{t} } \right)} \right)*ca \left( {conv_{1*1} \left( {x^{t} } \right)} \right) $$
(2)
$$ y_{mpa}^{t} = x^{t} *att^{t} $$
(3)
Fig. 3
figure 3

Muti-perception attention module

\(mp\left( * \right)\) and \(ca\left( * \right)\) in the equation correspond to the muti-perception module and channel attention module shown in Fig. 3. The structure of the muti-perception attention module in the student network is illustrated in Fig. 3b. First, the features undergo a 1 × 1 convolutional transformation. Then, they pass through the muti-perception module, resulting in the final attention weights, \(att^{s}\). These weights are used to attentively weight the original features \(x^{s}\) of the student network after a 1 × 1 convolutional transformation, producing the final output \(y_{mpa}^{s}\). The entire process is depicted in Eqs. (4) and (5).

$$ att^{s} = mp \left( {conv_{1*1} \left( {x^{s} } \right)} \right) $$
(4)
$$ y_{mpa}^{s} = conv_{1*1} \left( {x^{s} } \right)*att^{s} $$
(5)

The main difference between these two MPA modules is that the MPA module in the student network (MPA-S) does not employ channel attention as in the teacher network (MPA-T), which makes the attention weighting process more concise. This simplifies our network structure to some extent.

3.2 Loss function

The loss function consists of four components: the reconstruction loss \(L_{rec}\), perceptual loss \(L_{perc}\), contrastive loss \(L_{contrast}\), and knowledge-guided loss \(L_{kg}\), as shown in Eq. (6).

$$ L_{total} = L_{rec} + \alpha L_{perc} + \beta L_{contrast} + \gamma L_{kg} $$
(6)

Reconstruction Loss We use L1 loss as the training loss function, as shown in Eq. (7). In the task of image dehazing, we aim to preserve more details and texture information in the image, avoiding color and texture distortion. Generally, compared to L2 loss, L1 loss has a lower sensitivity to outliers, making it more effective in preserving image details, including texture and color information. Moreover, [32] demonstrated that training with L1 loss achieves better performance in terms of the PSNR and SSIM metrics in many image restoration tasks.

$$ L_{rec} = \left| {y - D\left( I \right)} \right|_{1} $$
(7)

\(y\) represents the ground truth, \(I\) is the hazy image input to the network, and \(D\left( * \right)\) represents the dehazing network.


Perceptual loss To obtain network outputs that are similar in style to the ground truth, we should not only focus on pixel differences between the network outputs and the ground truth but also take into account the differences in features and styles between them. Therefore, we introduce perceptual loss [23] as a regularization term in the image dehazing task to measure the similarity between the dehazed results and the real image features, as shown in Eq. (8).

$$ L_{{perc}} = \frac{1}{{C_{j} H_{j} W_{j} }}\left\| {\varphi _{j} \left( {\hat{y}} \right) - \varphi _{j} \left( y \right)} \right\|_{2}^{2} $$
(8)

\(\hat{y}\) represents the network’s output dehazed image, \(y\) represents the ground truth, \(\varphi\) is a pretrained network (VGG16), and the subscript \(j\) denotes the j-th layer of the network, with the corresponding feature shape of \(C_{j} \times H_{j} \times W_{j}\).


Contrastive Loss From contrastive learning, we can regard the hazy images as negative samples and the clear images as positive samples. To make the network output close to the positive samples and far from the negative samples, [24] proposed a contrastive regularization method, which uses the VGG network to extract deep features from the network output, hazy images, and corresponding clear images, as shown in Eq. (9). We introduce contrastive loss \(L_{contrast}\) to the dehazing task as one of the regularization terms in the loss function, further enhancing the robustness of the model.

$$ L_{contrast} = \mathop \sum \limits_{i = 1}^{n} \omega_{i} \cdot \frac{{Dis\left( {\varphi_{i} \left( J \right), \varphi_{i} \left( {D\left( I \right)} \right)} \right)}}{{Dis\left( {\varphi_{i} \left( I \right), \varphi_{i} \left( {D\left( I \right)} \right)} \right)}} $$
(9)

\(\varphi_{i}\), \(i = 1,2,3, \ldots ,n\), are the hidden layer features extracted by a fixed pretrained model \(\varphi\) for the \(i\)-th layer, and \(\omega_{i}\) are the corresponding coefficients. \(Dis\left( {a,b} \right)\) computes the L1 loss between features \(a\) and \(b\).


Knowledge-Guided Loss If we regard the image reconstruction as an encoding–decoding process, most deep learning-based dehazing methods only use the clear image itself as the ground truth for the loss function design. To provide guidance for the dehazing process, we trained an autoencoder network with clear images as a teacher network to provide guidance at the feature level of the clear image decoder. Specifically, when we train the dehazing student network with paired haze and clear images, we also input the clear images into the pre-trained teacher network, then, use the knowledge-guided regularization term to reduce the distance between the decoder features of the student network and the teacher network. In addition, we give a weight coefficient to each level of features to further increase the flexibility of guidance, as shown in Eq. (10).

$$\begin{aligned} L_{kg} &= \alpha_{1} \left| {T_{mpa} - S_{mpa} } \right|_{1} + \alpha_{2} \left| {T_{up1/2} - S_{up1/2} } \right|_{1} \\ &\quad + \alpha_{3} \left| {T_{up1} - S_{up1} } \right|_{1}\end{aligned} $$
(10)

In this equation, the subscripts \(mpa\), \(up1/2\), and \(up1\) represent features of the last muti-perception attention module, the first and second upsampling modules, which upsample the features to half and full of the input size respectively. \(\alpha_{1}\), \(\alpha_{2}\) and \(\alpha_{3}\) are the corresponding coefficients.

4 Experiment

In this section, we describe the experiments conducted to evaluate our proposed method. We first introduce the datasets used in the experiments, including the division of the training and testing sets. Then, we describe the experimental setup, including the training environment and configuration of the training parameters. We compare our experimental results with the results produced by currently popular methods, including both qualitative and quantitative comparisons, and we compare the parameter count of our deep learning method. Finally, we conduct ablation experiments to demonstrate the effectiveness of each module in the network.

4.1 Datasets

We utilize following datasets, namely, NYU [26], RESIDE-SOTS [25], I-haze/O-haze [34, 35], NH-haze [27, 28] and Dense-haze [33], to train our model.

We tested the performance of different algorithms on the following synthetic hazy image datasets: The NYU dataset contains 1449 pairs of hazy and clear images, of which 1159 randomly selected pairs are used for training, and the remaining 290 pairs are used for testing. For the SOTS dataset, the outdoor set consists of 500 paired images, of which 450 pairs are randomly selected for training and the remaining 50 pairs for testing. Additionally, we use 500 clear images from SOTS-outdoor to train our teacher network. The indoor set also contains 500 paired images, with 450 pairs used for training and the remaining 50 pairs for testing.

To test the performance of different algorithms on real hazy images, we evaluated them on the following real hazy image datasets: I-HAZE/O-HAZE, which contain images with homogeneous haze; NH-HAZE, which contains images with nonhomogeneous haze; Dense-HAZE, which contains images with dense haze; and RESIDE-HSTS, which contains unpaired real hazy images. We combined 30 pairs of hazy and haze-free images from the I-HAZE dataset and 45 pairs of hazy and clear images from the O-HAZE dataset, obtaining a total of 75 pairs of images. We randomly selected 60 pairs for training and 15 pairs for testing. The NH-HAZE dataset comprises 55 pairs of real nonhomogeneous hazy and haze-free images. We randomly selected 44 pairs for training and 11 pairs for testing. The Dense-HAZE dataset comprises 55 pairs of real dense hazy and haze-free images. We randomly selected 44 pairs for training and 11 pairs for testing.

4.2 Training

We train our proposed KMAN end-to-end by minimizing the loss function given in Eq. (6). The input images are resized to 256 × 256 before being fed into the network. The model is trained for 100 epochs using an NVIDIA RTX 3090 GPU. The initial learning rate is set to 0.001, and a cosine annealing strategy is employed for updates. The regularization coefficients α, β, and γ are set to 0.1, and the knowledge guidance coefficients \(\alpha_{1}\), \(\alpha_{2}\), and \(\alpha_{3}\) are set to 0.125, 1, and 1, respectively. The batch size is fixed at 10. All algorithms use the same test set, and all supervised learning algorithms use the same training set. In addition, to compare the generalization abilities of different algorithms on other datasets, we train the supervised learning algorithms on the SOTS-outdoor dataset and test them on other datasets.

4.3 Evaluation experiments

We conducted quantitative and qualitative comparisons on multiple datasets to evaluate the performance of our proposed image dehazing algorithm in terms of dehazing quality, performance, and parameter count. We used PSNR and SSIM as standard quantitative evaluation metrics and compared the parameter count of our algorithm with those of other deep learning methods. Moreover, our method exhibited significant advantages in visual effects, indicating its capability to remove haze while preserving more image details.

To fully test the dehazing abilities of different algorithms, we designed the following experiments. First, we split each dataset into training and testing sets and evaluated the performance of each algorithm on each testing set. Then, we trained all the supervised learning algorithms on the SOTS-outdoor dataset and tested them on the other datasets. The results are shown in Tables 1 and 2.

Table 1 Quantitative comparison results on different datasets
Table 2 The results of using SOTS-outdoor as the training set

According to Table 1, our algorithm achieves state-of-the-art results on most of the artificially synthesized image dehazing datasets and the real-world image dehazing datasets. This indicates that our algorithm is better suited to handling relatively complex scenes, exhibiting favorable performance on real hazy images with varying levels of homogeneous, nonhomogeneous, and heavy haze. From Table 2, our algorithm demonstrates superior generalization performance on both the artificially synthesized dataset and the datasets with homogeneous and nonhomogeneous haze concentrations. Considering the results from Tables 1 and 2, we hypothesize that the difficulty in generalizing our algorithm to dense haze scenarios stems from the training on artificially synthesized hazy images with homogeneous concentrations in the SOTS-outdoor dataset. Furthermore, we compared the parameter counts of different algorithms. The results show that our algorithm has a parameter count of 1.51 million, which, although not the lowest, strikes a good balance between performance and parameter count. Compared to other algorithms, our method achieves better dehazing effects and visual perception while maintaining fewer parameters.

The dehazing results on paired image dehazing datasets are illustrated in Fig. 4. The dehazing results of our algorithm exhibit superior performance in terms of texture clarity and haze removal. The algorithm does not produce significant texture damage or color distortion, and it performs well in scenes with homogeneous, nonhomogeneous, and high-density haze in real-world scenarios.

Fig. 4
figure 4

Qualitative comparison on different datasets. From the first row to the sixth row, they correspond to the images from the datasets NYU, RESIDE-SOTS, I-haze/O-haze, NH-haze and Dense-haze respectively

In addition to comparing the performance on datasets, we further demonstrate the superiority of our method in terms of generalization on real-world hazy images. As shown in Figs. 5 and 6, we selected unpaired real hazy images from RESIDE-HSTS and several unpaired real-world hazy images as test samples. The results show that our algorithm is better able to restore image details, colors, and contrast while effectively removing haze.

Fig. 5
figure 5

Qualitative comparison on HSTS real-world images

Fig. 6
figure 6

Qualitative comparison on real-world images

Based on the comprehensive results of our experiments, our algorithm demonstrates outstanding performance in various respects, including superior dehazing effects, performance, and visual perception.

4.4 Ablation experiments

Our image dehazing algorithm consists of a muti-perception attention module and a knowledge guidance mechanism provided by a teacher network for the decoding process. To validate the impacts of these two modules on the algorithm’s performance, we conducted ablation experiments on the muti-perception attention module and the teacher network, as shown in Table 3.

Table 3 Ablation results

Backbone We removed both the muti-perception attention module and the knowledge guidance mechanism and used the remaining modules as the backbone network for image dehazing. The experiments were conducted on the SOTS-outdoor dataset, with PSNR and SSIM used as evaluation metrics. The results showed a significant performance drop compared to our complete algorithm. The PSNR decreased from 28.37 to 27.57, and the SSIM decreased from 0.964 to 0.961. This indicates that the muti-perception attention module and the teacher network play a crucial role in improving the performance of image dehazing.


Backbone + KG In the ablation experiment on the muti-perception attention module, we removed the module from our algorithm while retaining the knowledge guidance mechanism for image dehazing. Similarly, experiments were conducted on the SOTS-outdoor dataset, with PSNR and SSIM used as evaluation metrics. The results showed a noticeable performance drop after removing the muti-perception attention module. The PSNR decreased from 28.37 to 27.75, and the SSIM decreased from 0.964 to 0.963. This demonstrates the crucial role of the muti-perception attention module in improving the performance of image dehazing.


Backbone + MPA In the ablation experiment on the knowledge guidance loss, we removed the teacher network module while retaining the muti-perception attention module for image dehazing. Again, experiments were conducted on the SOTS-outdoor dataset, with PSNR and SSIM used as the evaluation metrics. The results showed a slight performance drop after removing the teacher network module. The PSNR decreased from 28.37 to 28.15. This indicates that the teacher network module plays an important role in providing knowledge guidance for the decoding process, helping the decoder module to better restore the original image.


Loss Function We tested and recorded the values of PSNR and SSIM during the training process of 100 epochs on the SOTS-outdoor dataset. The results are shown in Fig. 7, which shows that the performance of the model gradually improves as we gradually add regularization terms to the loss function.

Fig. 7
figure 7

Ablation experiment results for the loss function. a and b are the values of PSNR and SSIM in 100 epochs: “psnr0” and “ssim0” correspond to the results of retaining only the L1 loss, “psnr1” and “ssim1” correspond to adding perceptual loss on the basis of L1 loss, “psnr2” and “ssim2" correspond to adding perceptual loss and contrastive loss on the basis of L1 loss, and “psnr3” and “ssim3” correspond to adding perceptual loss, contrastive loss and knowledge-guided loss on the basis of L1

In addition, to further demonstrate the effectiveness of the modules and methods we propose, we conducted a qualitative analysis of the results on real-world images, as shown in Fig. 8. It can be seen that both the MPA module and knowledge guidance significantly improve the haze removal performance of the images, and they can adapt to different complex scenes.

Fig. 8
figure 8

The results of ablation experiments on real images

In summary, we can conclude that both the muti-perception attention module and the teacher network module are key components in our algorithm, and their combination is crucial for improving the performance of image dehazing. The muti-perception attention module helps the algorithm better focus on useful information and features in the image while adapting to different scenes and haze densities. The teacher network module provides valuable knowledge guidance for the decoding process, aiding the algorithm in better restoring the original image.

5 Conclusion

In this paper, we propose a knowledge-guided muti-perception attention network (KMAN) for image dehazing. We conducted experiments on multiple datasets to evaluate the performance of our proposed algorithm in terms of the dehazing effect, performance, and parameter size. The results show that our algorithm performs well in all respects, achieving a good balance between performance and parameter size while outperforming other deep learning methods in terms of average PSNR and SSIM scores and visual effects. Furthermore, our algorithm performs particularly well in challenging scenes and images, demonstrating its generalization ability on real-world hazy images. In the future, we plan to further optimize our algorithm and continue to explore more effective algorithms for image restoration tasks.