1 Introduction

Image dehazing has always been a hot topic in bottom(low-level) visual tasks, because haze is closely related to our life and travel. Obtaining a clearer view in foggy days is the main task of image dehazing. In addition, image dehazing, a basic visual task, has always been a prerequisite for effective display in advanced visual tasks, and it is often used as a preprocessing step to obtain clear images. For example, object detection [1,2,3], semantic segmentation [4,5,6] and stereo matching [7,8,9] are implicitly influenced by haze images. Therefore, how to obtain dehazing images has aroused widespread concern from industry and academia.

The main task of image defogging is to restore blurred images to clear images for subsequent advanced visual tasks or observations. The atmospheric scattering model [10,11,12] explains the formation process of foggy images, and researchers often apply the inverse process to defog images.

$$ I\left( x \right) = J\left( x \right)t\left( x \right) + A\left( {1 - t\left( x \right)} \right) $$
(1)

where \(x\) represents the image pixel position, and \(I\left( x \right)\) represents the foggy image, \(J\left( x \right)\) represents the clear image after defogging, \(A\) represents the global atmospheric light value, and \(t\left( x \right)\) represents the transmittance of different positions in the atmosphere and can be further expressed as:

$$ t\left( x \right) = e^{{ - \beta d\left( {\varvec{x}} \right)}} $$
(2)

where \(\beta\) is the atmospheric scattering coefficient, and \( d \left( x \right)\) is the distance from the object to the imaging plane of the camera.

Researchers have proposed a series of methods to solve the image dehazing by Formula (1). These methods can be divided into two categories: physical methods based on apriori [13,14,15,16] and deep learning methods based on neural networks [17,18,19,20]. In the early literature, the ambiguity was eliminated mainly by constructing a physical model. On the premise of a priori assumption, the transmission map \(t\left( x \right)\) and the global atmospheric light value \(A\) were characterized, and then the defogged image was obtained [13, 14]. A dark channel prior algorithm for image dehazing is proposed through statistical analysis of a large number of hazy images [15]. \(A\) prior condition of color attenuation is proposed, which estimates the transmission rate and restores the scene radiance through the atmospheric scattering model to effectively remove the fog from a single image [16]. However, these methods can only be performed well under the prior assumption, and image distortion will occur when the scene features do not satisfy the prior conditions.

With the development of deep learning in the field of computer vision, researchers use convolutional neural networks to defog images. An end-to-end feature fusion attention network is proposed to directly restore fog-free images [22]. A multi-scale estimation module based on attention is proposed, which can efficiently exchange information of different scales, thus effectively alleviating the bottleneck problem of multi-scale estimation [19]. [23] set gating mechanism to capture finer image differences. [22] demonstrated that the channel attention mechanism can better encode the global atmospheric light value \(A\), and the pixel attention related to the image pixel position can effectively encode the transmission value \(t\left( x \right)\). Residual can potentially express the relationship between foggy images and fog-free images. Although the above model has achieved good results, the unevenness of fog distribution and some noise lead to color imbalance and artifacts in the defogging image, and considering the efficiency of image de-fogging, we propose a transmission map guided multi-feature fusion network. From formula (1), we can see that the encoding of the transmission map is the key factor affecting the model results, and the coding learning of \(t\left( x \right)\) is very difficult due to the uneven fog distribution. As we all know, fog is related to the depth of the image, and the deeper the depth value, the thicker the fog concentration. According to Formula (2), the transmission map contains atmospheric light scattering coefficient and depth information, and the depth corresponding to different image contents is different, so different image contents also correspond to different fog concentrations. We use the transmission diagram as a guide, which can better get the fog information of different thickness areas in the image. At the same time, considering the problem of dehaze efficiency, we did not construct a very complex network structure. Specifically, we first adopt U-Net network [21] as our infrastructure and combine local residual [24] with global residual [25] to realize the fusion of local and global multi-scale information. The hazy and transmission images used as the network inputs. In the coding stage, we propose a new multi-weight fusion module and use the transmission map as the guide to enhance the information of different dehazing image. In the decoding stage, We also used the transmission map guidance mechanism to guide the reconstruction of the fog removal image. We believe that the transmission map contains the fog concentration information corresponding to different image contents. Moreover, the residual and adaptive decoding modules are adopted. Different from adding the parameter α to ACER [26], we can spontaneously select appropriate weights for the characteristics of different channels to better integrate the characteristics of different depths and scales. The method proposed in this paper can significantly improve the image defogging effect, and a large number of experiments prove that the model can show better performance on datasets.

Our main contributions can be summarized as follows:

  1. a)

    We design a WA module is embedded in the encoding network. It can fuse multiple feature weight information, which can better encode transmission value \(t\left( x \right)\) and global atmospheric light value \(A\). And the dehazing features can be fused and re-fitted.

  2. b)

    Our proposed Mix module can dynamically combine the feature maps of different scales for up-sampling, and combine with the local residual module to allow the network to skip the foggy area and pay more attention to the thick foggy area. The spatial attention module SA and the pixel enhancement module guided by the transmission map can reconstruct the defogged image more finely.

  3. c)

    Our proposed dehaze network can effectively dehaze and has good performance on OTS and Haze4K datasets.

The rest of this paper is as follows: The second section introduces the relevant contents, the third section mainly introduces the proposed dehazing network, the fourth section carries out comparative and ablation experiments, and the fifth section draws conclusion.

2 Related work

Image dehazing methods can be roughly divided into two categories: physical methods based on manual prior and deep learning methods based on neural network. When the image satisfies the prior conditions, the method based on manual prior can show good results, but distortion will occur when the prior conditions do not match the image. The neural network-based method has occupied a dominant position with the development of artificial intelligence.

2.1 Image dehazing based on prior conditions

Through statistical method, it is found that there is at least one color channel with a lower brightness value in the pixels of the sky-free area of most fog-free images. The transmission map is further estimated, and a clear fog-free image is obtained by combining the global atmospheric light value A. Fattal et al. [27] inferred the color information of the original image by assuming that the surface shadow in the scene is unrelated to the transmission map in local statistics. Tan et al. [28] constructed a Markov random field with an energy function. Salazar-Colores et al. [28] combined DCP with morphological operations such as corrosion and expansion to calculate the transmission diagram. Meng et al. [29] added boundary constraints and context regularization to DCP to obtain a better transmission graph. Liu et al. [30] refined the transmission map by regularization to obtain a more refined transmission map. Zhu et al. [16] obtained the projection map by analyzing the relationship between the depth of the scene and the difference of blurred images in different channels. Sheng et al. [48] proposed a depth-aware motion blur model to enhance image. However, the image defogging will distort when the prior conditions are not satisfied.

2.2 Image dehazing based on deep learning

Owing to the rapid development of deep learning, convolutional neural networks have been applied to image defogging. DehazeNet [17] and MSCNN [32], as pioneers in the field of image defogging, applied neural network to estimate transmission value t(x) and artificial prior to estimate atmospheric light value \(A\). Their effects did not show higher performance than prior manual operation. Then DCPDN [33] used two networks to estimate the transmission value \(t\left( x \right)\) and the atmospheric light value a respectively. AOD-Net [18] uses a lightweight backbone network to predict the variable \(k\), and finally uses a physical defogging model variant to get a defogging image. Ren et al. [19] adopted a gated fusion idea, and took the fusion results generated by white balance, contrast enhancement and gamma correction as the final defogged image. Liu et al. [20] used attention and multi-scale methods to directly predict defogging images. Qin et al. [22] proposed an end-to-end defogging network based on channel attention and pixel attention, which can effectively encode transmission value t(x) and atmospheric light value \(A\). Lu et al. [35] proposed that multi-scale large convolution module and parallel attention module can better eliminate uneven smog. Zhao et al. [36] proposed a two-stage defogging network combining the advantages of prior and learning. Fan et al. [37] combined scene depth with multi-scale residual connection, which has a good performance in real scenes. Wen et al. [49] enhanced image by modifying the Cyclegan network. Zhou et al. [50] proposed a feedback spatial attention dehazing network, which has a good performance in SOTS dataset. C2Pnet [51] proposed a deep learning network that enhances image dehazing performance by utilizing cascaded channel and spatial attention mechanisms.The RIDCP [52] (Revitalizing Real Image Dehazing via High-Quality Codebook Priors) aims to address the challenges faced by existing methods when dealing with hazy images in the real world. This paper focuses on the coding of transmission value \(t\left( x \right)\) by multi-scale information and attention mechanism. These dehazing models achieve good results by learning the mapping from the fog map to the defog image from end to end, but they do not consider the relationship between the content of the image and the fog density. The transmission map can directly express the relationship between the content of the image and the concentration of fog. Therefore, we propose a transmission guided multi-feature fusion image dehaze network.

3 Proposed methods

The proposed Transmission-Guided Multi-Feature fusion network (TGMF) does not adopt a complex structure, and is only based on a simple U-Net architecture. The model architecture shown in Fig. 1, it has four down-sampling stages and four up-sampling stages. In each down-sampling stage, it is composed of WA mmodule. Different from the up-sampling stage of U-Net, an adaptive weight adjustment method is applied to connect the down-sampling and up-sampling stages, so that the network can independently allocate the weights of different path feature maps. Given \(\left\{ {I\left( x \right),T\left( x \right),J\left( x \right)} \right\}\) as the input of the network, \(I\left( x \right)\) is a foggy image, \(T\left( x \right)\) is a transmission image, and \(J\left( x \right)\) is a clear image, we use SSIM, perceptual and \(L_{1}\) loss to train the network model.

Fig. 1
figure 1

TGMF-Net network structure. TGMF-Net consists of four up-sampling structures and four down-sampling structures

3.1 Multi-layer attention feature fusion module

Our design of WA mldule is inspired by FFA-Net [22], and we think that channel attention and pixel attention can effectively encode global variable A and local variable t(x). Therefore, two pixel attention modules are connected in parallel after Channel Attention module, and different features are fused through MLP module. Assuming that the feature and transmission maps are \(\left\{ {x,T_{r} } \right\} \) respectively, we add the them in the channel dimension as inputs, and then use Channel Attention module to code the global variable \(A\). Channel Attention module can effectively extract the global information and change the channel dimension of the feature, and can assign different weights to different channels to make our network pay more attention to the dense fog area, high-frequency texture information and color fidelity.

Channel Attention can be expressed as:

$$ \hat{x} = {\text{AvgPool}}2d\left( x \right) $$
(3)
$$ x_{ca} = x \otimes {\text{Sigmoid}}\left( {{\text{Conv}}2d\left( {{\text{Relu}}\left( {{\text{Conv}}2d\left( x \right)} \right)} \right)} \right) $$
(4)

where \({\text{AvgPool}}2d\) is a global average pooling, \(x\) represents the feature map after global average pooling, \({\text{Conv}}2d\) is a convolution module, \({\text{Relu}}\) is an activation function, and \({\text{Sigmoid}}\) assigns different weights to each channel. The structure of Channel Attention network is shown in Fig. 2.

Fig. 2
figure 2

The structure of WA module. MLP means Multilayer Perceptron. Conv2d means Convolution Network. \(\oplus\) means Point-wise Addition, \(\otimes\) means Hadamard Product

Then, for the local variable \(t\left( x \right)\), we use trans map as a pixel-level weight coding local variable \(t\left( x \right)\), and trans map as a weight guided feature map to enhance the features of foggy areas. We think that the spatial attention module can improve the feature expression of key areas and code the local variable \(t\left( x \right) \) effectively.

WA module can be formulated as:

$$ x_{T} = x_{ca} \oplus (T_{r} \otimes x_{ca} ) $$
(5)
$$\begin{aligned} x_{sp} = &\,x_{ca} \otimes {\text{Sigmoid}}\left( {\text{Conv}}2d\left({\text{MaxPool}}2d\left( {x_{ca} } \right)\right.\right.\\&\,\left.\left. \oplus {\text{AvgPool}}2d\left( {x_{ca} } \right) \right) \right)\end{aligned}$$
(6)
$$ x_{{{\text{con}}}} = x_{T} \oplus x_{sp} $$
(7)

where \(x_{T}\) represents the feature map guided by trans map, \(x_{sp}\) represents the feature map after spatial attention, and the network structure of Spatial Attention module is shown in Fig. 2.

The transmission map contains the scattering coefficient and depth information of atmospheric light at different locations. We use it as a weight information to guide the network to learn fog information in different regions of the image features, and add more weights to the regions with larger concentrations.

We add \(x_{T}\) and \(x_{sp}\) in the feature dimension, and then send them into a multilayer perceptron, which can transform the number of feature channels into the initial number of feature channels. The multilayer perceptron contains two Conv convolutions, and Relu is used as the activation function. We believe that the multilayer perceptron can not only fuse different features, but also fit the defogging features.

MLP can be expressed as:

$$ x_{{{\text{out}}}} = {\text{Conv}}2d({\text{Relu}}({\text{Conv}}2d(x_{{{\text{con}}}} ))) $$
(8)

where \(x_{{{\text{out}}}}\) is the output of MLP module, and the overall structure of WA module is shown in Fig. 2.

3.2 Mix module

Different from the upsampling module of Unet network, our upsampling structure adopts an adaptive method to let the network spontaneously select appropriate weights and assign them to the feature maps of different paths, so as to effectively realize the fusion of shallow and deep features. Channel attention can effectively extract global information, change the channel dimension of features, and assign different weights to different channels to make our network pay more attention to dense fog areas, high-frequency texture information and color fidelity. Assuming that \(\left\{ {x_{a} ,x_{b} } \right\}\) are two input characteristic graphs, our upsampling adaptive weight allocation module can be expressed as:

$$ \begin{aligned} & X_{{{\text{fusion}}}} \\ & = x_{a} \otimes {\text{Sigmoid}}\left( {{\text{Conv}}2d\left( {{\text{Relu}}\left( {{\text{Conv}}2d\left( {\left( {x_{a} \oplus x_{b} } \right)} \right)} \right)} \right)} \right)\left[ {x_{a} } \right] \\ & \quad + x_{b} \otimes {\text{Sigmoid}}\left( {{\text{Conv}}2d\left( {{\text{Relu}}\left( {{\text{Conv}}2d\left( {\left( {x_{a} \oplus x_{b} } \right)} \right)} \right)} \right)} \right)\left[ {x_{b} } \right] \\ \end{aligned} $$
(9)

After the adaptive weight assignment module, trans map and spatial attention are used to highlight the importance of different spatial positions of feature maps, which can make the network pay more attention to haze pixels and high-frequency image areas, thus highlighting the output features. The local residual structure is used to combine the output with the input feature map of Mix module, which can make the network bypass less important information such as mist area and low frequency area and make the network pay more attention to effective information. Moreover, the combination of local residual and global residual can not only avoid training difficulties, but also transmit shallow information to the depths of the network, so that the network can obtain more detailed spatial characteristics and semantic information. In Mix module, we still use trans map as the guide and fuse the spatial feature map, so that our network can encode \(t\left( x \right)\) to the maximum extent, and the network can reconstruct relatively clear images from fog regions with different concentrations. Finally, MLP can integrate defogging features by adjusting the number of network channels (Fig. 3).

Fig. 3
figure 3

Mix module structure diagram. The mix structure combines adaptive upsampling with attention module and MLP structure

Mix module can be formulated as:

$$\begin{aligned} X_{{{\text{dehaze}}}} =&\,{\text{Conv}}2d\left( {\text{Relu}}\left( {\text{Conv}}2d\left(\left( {X_{{{\text{fusion}}}} \otimes T_{r} } \right) \right.\right.\right.\\&\,\left.\left.\left.\oplus \left({x_{a} \oplus x_{b} } \right) \oplus x_{sp} \right) \right)\right) \end{aligned}$$
(10)

3.3 Train loss

Given a pair of images \(\left\{ {I,J} \right\}\), \(I\) represents the foggy image and \(J\) represents the corresponding clear image. Let TGMF-Net predict the dehaze image to get \(\hat{J}\). L_1, perceptual and ssim losses are used to train our model.

In order to restore a more real de-fog image, to prevent the loss of details and texture information. We adopted \(L_{1}\) loss.

\(L_{1}\) loss can be expressed as:

$$ L_{1} = \frac{1}{H*W}\mathop \sum \limits_{m = 1}^{H} \mathop \sum \limits_{n = 1}^{W} \left| {J_{{\left( {m,n} \right)}} ,\hat{J}_{{\left( {m,n} \right)}} } \right| $$
(11)

where the \(J\) represents a clear image, \(\widehat{J }\) represents a predict image, \(H\) and \(W\) represent the height and width of the image. \(\left| \right| \ {\rm represents}\) an absolute value.

In order to obtain a defogging image that is close to the true value, we should not only focus on the differences between pixels, but on the characteristics and style differences between the predicted defogging image and the true value. We adopted perceptual loss.

perceptual loss can be expressed as:

$$ L_{per} = \mathop \sum \limits_{m = 1}^{H} \mathop \sum \limits_{n = 1}^{W} \left| {\phi_{j} \left( {\hat{J}} \right)\left( {m,n} \right) - \phi_{j} \left( J \right)\left( {m,n} \right)} \right|. $$
(12)

where the \(\phi_{j}\), \(H\), and \(W\) represent the feature map, the height and width, respectively The \(m\), \(n\) represent positions.

Compared with L1 loss, it pays more attention to structural similarity, and is more consistent with the human visual system's judgment on the similarity of two images.

ssim loss can be expressed as:

$$ L_{{{\text{ssim}}}} = 1 - \mathop \sum \limits_{m = 1}^{M} \frac{{\left( {2\mu_{{Y_{m} }} \mu_{{Y_{m}{\prime} }} + \theta_{1} } \right)\left( {2\sigma_{{Y_{m} Y_{m}{\prime} }} + \theta_{2} } \right)}}{{\left( {\mu_{{Y_{m} }}^{2} + \mu_{{Y_{m}{\prime} }}^{2} + \theta_{1} } \right)\left( {\sigma_{{Y_{m} }}^{2} + \sigma_{{Y_{m}{\prime} }}^{2} + \theta_{2} } \right)}} $$
(13)

where \(\mu_{{Y_{m} }}\) is the average of \(Y_{m}\), \(\mu_{{Y_{m}{\prime} }}\) is the average of \(Y_{m}{\prime}\), \(\sigma_{{Y_{m} }}\) is the variance of \(Y_{m}\), \(\sigma_{{Y_{m}{\prime} }}^{2}\) is the variance of \(Y_{m}\), \(\sigma_{{Y_{m}{\prime} }}^{2}\) is the variance of \(Y_{m}\), \(\sigma_{{Y_{m} Y_{m}{\prime} }}\) is the covariance of \(Y_{m}\), and \(Y_{m}{\prime}\), and \(\theta_{1}\) and \(\theta_{2}\) are constants. \(\theta_{1}\) and \(\theta_{2}\) are used to prevent the system instabilities produced by a zero denominator. The value range of \(L\) is [− 1,0].

The total loss function can be expressed as:

$$ L_{{{\text{total}}}} = L_{1} + L_{{{\text{perceptual}}}} + \lambda L_{{{\text{ssim}}}} $$
(14)

where the \(\lambda\) is a hyperparameter, \(\lambda = 0.1\).

4 Experiments

In this section, we will describe the experimental details from the dataset preparation, evaluation indicators and experimental settings, and comparative experiments and ablation experiments to evaluate our model performance.

4.1 Experimental settings

4.1.1 Datasets

To better demonstrate the performance of the model, Haze4K [38] and RESIDE [39] datasets are applied for comparative experiments. And in order to better demonstrate the performance of the model, we compared it with the real world data set.

RESIDE data set is a new large-scale benchmark that includes both synthetic and real hazy images, Such as resistance-in (indoor training set), resistance-out (outdoor training set) and synthetic objective testing set (SOTS). Following the setting of FFANet [21], we adopt RESIDE-OUT(Outdoor Training Set) as our training dataset, which contains 313,950 pairs of pictures. In addition, we take 500 paired images from Synthetic Objective Testing Set (SOTS) as our test set.

Haze4k is a synthetic data set containing 4000 hazy images, in which each hazy image contains a potential clean image and a transmission image. We follow the setting of PMNet [40], in which 3000 images are used to train 1000 images for testing. Compared with RESIDE dataset, Haze4K dataset mixes the images of indoor and outdoor scenes, and the synthetic pipeline is more realistic.

4.1.2 Training details

We trained TGMF-Net on Nvidia RTX 4090 GPU. We use AdamW optimizer to set the learning rate to 0.0003, the exponential decay rate to \({\beta }_{1}\)=0.5, \({\beta }_{2}\)=0.99, and combine with cosine annealing strategy to gradually decrease. The batch size of training is set to 8. The total training epoch is 800.

Because our network uses hazy images and transmission maps as inputs, the transmission maps of foggy images need to be generated. We do not use the method of optimizing hazy images before generating depth maps in MSCDN [37], but directly use GDCP [41] to generate transmission maps of hazy images. Although we believe that the transmission map generated by optimizing the hazy image will be better, this does not directly reflect the role of our network. Figure 4 shows the partially generated transmission map.

Fig. 4
figure 4

Hazy image and corresponding transmission map

4.1.3 Evalute system

To better evaluate the effectiveness and superiority of the algorithm, we use structural similarity index (SSIM) and Peak Signal to Noise Ratio (PSNR) as our objective evaluation indicators. The skimage library is applied to calculate SSIM and PSNR to avoid significant differences in calculation results caused by different calculation methods.

4.2 Comparison with state-of-the-art methods

To verify the validity of TGMF-Net model, we compared TGMF-Net with several other advanced methods, including DCP [15], MSCNN [32], AOD-Net [18], Dehaze-Net [17], GFN [20], MSBDN [31], DMT-Net [34] and PGCGA [43], KMAN [44], Transweather [45], PSPAN [46], C2Pnet [51]. In this section, the first, second and third items in Table 1 are represented in bold, single and dashed underlines respectively.

Table 1 Quantitative comparison of different defogging algorithms in different datasets

The comparison of TGMF-Net and other baseline performance demonstrates that our proposed TGMF-Net network is a better and more efficient dehazing network. In the RESIDE-OUT dataset, our TGMF-Net ranks second in PSNR. On the Haze4K dataset, our TGMF-Net ranks first in both PSNR and SSIM, and the second is DMT-Net, with a difference of 7.1 dB compared with TGMF-Net. Compared with C2PNet, our TGMF-net does not have a big gap between the two datasets, and the results of the two datasets are still better than other algorithms. In terms of model parameters, our model is only 17.307 M. Although it has larger model parameters than C2PNet, it is still easy for edge devices. In the reasoning test stage, the image processing speed is only 5.969 ms, which is 3 times as fast as MSBDN and 5 times as fast as DMT-Net. Although MSCNN, AOD-Net, Dehaze-Net, and GFN can achieve higher speed, they are not competitive due to their insufficient image dehazing ability. Therefore, our TGMF-Net network can undertake timely image dehazing tasks.

In Fig. 5, we selected four pictures from the SOTS test set for comparison. Both DCP and AOD-Net have sky overexposure, and MSBDN has weak defogging ability on the last picture, which is not perfect for restoring.

Fig. 5
figure 5

Visual comparison of different dehazing algorithms on RESIDE-OTS dataset. The red areas represent local details

image details and colors. The defogging effect of FFA-Net is the closest to our TGMF-Net, but FFA-Net is still insufficient for image details. Compared with the real fog-free image, it is obvious to find that TGMF-Ne has the best defogging effect and can effectively restore the image details and color information.

Figure 6 shows the visual comparison of different algorithms on Haze4K dataset, which confirms that our proposed TGMF-Net network has a more obvious image dehazing effect. DCP is prone to overexposure in the sky area, the color recovery effect of AOD-Net image is insufficient, and the dehazing effect of MSBDN and FFA-Net on Haze4K is poor. Our proposed TGMF-Net is more effective in defogging. And compared with other algorithms, our TGMF-Net pays more attention to image details and color restoration, such as color restoration in the sky area, and the image processed by TGMF-Net is closest to the GT image.

Fig. 6
figure 6

Visual comparison of different dehazing algorithms on Haze4K dataset. The red boxes indicate the significant differences

Figure 7 shows our de-fogging effect in the real world. The first column shows the image containing fog, and the second column shows the effect after image de-fogging by using DCP algorithm. It can be seen that there are artifacts and color deviation. The second column uses AOD-Net to de-fog the image, the de-fog effect is not obvious, and the image becomes dim. Compared with TGMF, FFA-Net has poor defogging effect, and the defogging effect at short distance is not obvious. MSBDN is close to our de-fog effect, but the effect is not good at different scene depths.

Fig. 7
figure 7

Visual comparison of different dehazing algorithms on real-word

4.3 Ablation study

To better show the performance of the model and the effectiveness of each part, we conducted ablation experiments. All models in the ablation experiment are identical to the final model, and the objective evaluation index and visual comparison are displayed.

Firstly, we conducted ablation experiments on the functions of each module: (1) The basic architecture adopte, and the number of up-sampling and down-sampling was the same as that of our TGMF-Net. (2) added WA module to the basic U-Net architecture. (3) Added Mix module to the basic U-Net architecture. The experiment is carried out on Haze4K data sets and the results are shown in Table 2.

Table 2 Ablation experiments on the functions of each module

Secondly, we conducted ablation experiments on whether the input image contains transmission map. (1) inputed hazy image; (2) inputed hazy and transmission images. The experimental results are shown in Table 3.

Table 3 Ablation experiment of different input images

Thirdly, We conducted ablation tests on different loss functions to verify the validity of the three loss functions we used. (1) Only used \(L1 loss\). (2) Used \(L1 loss\) and \(Perceptual loss\). (3) Used \(L1 loss\), \(perceptual loss\) and \(ssim loss\).

Table 2 presents that the SSIM of the basic TGMF-Net-Base network is 0.949, and the PSNR is 28.101. Compared with TGMF-Net-Base, PSNRand SSIM increased by about 3 dB and 0.019 respectively after adding Mix module in the decoding stage, which reflect the effectiveness of our proposed network adaptive upsampling module. Compared with TGMF-Net-Base, PSNR and SSIM increased by about 4 dB and 0.022 respectively after adding WA module in the coding stage, which shows that our proposed WA module can encode the global atmospheric light value A and transmission value t(x) more effectively. Finally, our model reached SSIM and PSNR of 0.975 and 35.64 in Haze4K dataset after adding two modules.

Table 3 presents that the SSIM is 0.972 and the PSNR is 34.862 when the input image does not contain the transmission image. When the input image is a fusion of transmitted and blurred images, SSIM and PSNR increase by approximately 0.03 and 0.784, respectively. It can be seen that choosing transmission and hazy images as our input images will positively gain the dehazing network.

Table 4 shows that only used \(L1 loss\) as loss function, SSIM is 0.965 and PSNR is 32.72, \(L1 loss\) and \(perceptual loss\) were used as loss functions, with SSIM of 0.968 and PSNR of 33.67. \(L1 loss\), \(perceptual loss\) and \(ssim loss\) are used as loss functions. SSIM is 0.975 and PSNR is 35.646. It can be seen in Table 4 that only L1 loss is used as a loss function, with an SSIM of 0.965 and a PSNR of 32.72. L1 loss and perceptual loss were used as loss functions, with SSIM of 0.968 and PSNR of 33.67. L1 loss, perceptual loss and ssim loss are used as loss functions. SSIM is 0.975 and PSNR is 35.646. We can see that the three loss functions, which we adopted have a better effect on our network.

Table 4 Ablation experiment of different loss function

5 Conclusion

In order to further improve the quality of image defogging and improve the performance of upstream visual tasks, a trans map guided multi-feature fusio defogging network TGMF-Net is proposed in this study. It includes a transmission-guided module, a down-sampling coding module, an adaptive multi-scale up-sampling module and a residual structure. The subsampling coding module can better encode t(x) by combining strings, and at the same time, the transmission map is used as the weight to guide the network to learn fog information of different depth scenes. The adaptive multi-scale upper sampling module can enable the network to spontaneously combine information of different depths, so as to better fit the de-fog features. The combination of local residuals and global residuals not only enables the network to target dense and semantically rich regions more specifically, but also predicts the underlying information between fog-free and hazy images. At the same time, we embed the projection map guide module in the up-sampling process, so that the predicted fog-free image can perform better under different depth content. A large number of experiments show that our proposed TGMF-Net is superior to other algorithms in data sets and real scenarios.