1 Introduction

Various particulate matters such as dust, water drops, aerosol, etc. in the atmosphere often obscure the clarity of vision-based applications in outdoor environment. The most common phenomenon in outdoor environment due to inclement weather conditions is haze [25]. With a rise in the number of vision-based applications such as object classification [14], autonomous driving [31], remote sensing [5], etc., outdoor scene enhancement has become increasingly desirable for obtaining a clear scene. In literature, various methodologies have been developed to resolve the same.

1.1 Motivation

Haze is a signal dependent non-linear noise. This incurs attenuation in an image with increase in scene depth [25]. Thus, the pixel locations in a scene image suffer different amount of degradation. Single image dehazing has gained more popularity lately than those requiring additional data like multiple images [24] and different degrees of polarization [32]. Since acquiring additional information is not feasible for real-time applications, the process of single image dehazing becomes more challenging. A notable problem in dehazing is the absence of datasets with natural image pairs of hazy and haze-free images, as it is unlikely that the atmospheric conditions remain same on a hazy and a clear day. Consequently, synthetic hazy images are used for training purposes and then, the dehazing methods are tested on natural hazy scene images. Although significant work has been done to remove haze using deep learning, the difficulty still remains with the complicated structure of the architecture and requirement of rigorous training.

1.2 Contributions

This paper proposes a deep learning-based single image dehazing network named as “Compact Single Image Dehazing Network (CSIDNet)” for outdoor scene enhancement. The contributions of this paper are three-fold and summarized as follows:

  • As the name implies, CSIDNet is a more compact network than the existing deep learning-based dehazing models and consists of only three convolutional layers.

  • CSIDNet is trained on a much smaller dataset with lesser number of images without compromising the performance. Thus, it is easy to train with faster run-time and more approachable for real-time applications.

  • The dehazed images obtained using CSIDNet are visually appealing and outperform the benchmarked deep learning-based dehazing models in terms of peak signal-to-noise ratio and structural similarity index measures.

Most of the dehazing models, to the best of our knowledge, are either computationally expensive, leading to an increased run-time, or require a number of resources for implementation. Contrary to this, CSIDNet has been designed with fewer layers and trained on an exceptionally lesser number of images, and yet significantly outperforms state-of-the-art methods.

The rest of the paper is organized as follows: Section 2 outlines the literature of image dehazing for outdoor scene enhancement, Section 3 presents the architectural design of proposed CSIDNet, Section 4 shows the comparison of results obtained with discussions, and finally, Section 5 highlights the concluding remarks with the future scope.

2 Related work

Some of the initial attempts for outdoor scene enhancement were based on Histogram Equalization (HE) and contrast restoration based methods [7, 26]. Tan [33] maximized the local contrast based on Markov random fields which led to over-saturated results. Fattal [6] used a refined image formation model to remove haze, but this is time consuming and fails in regions with dense haze. Meng et al. proposed Boundary Constraint and Contextual Regularization (BCCR) [23] method to efficiently remove haze with the assumption that haze-free images have better contrast than hazy ones. This result discontinuities in poor contrast regions. Ancuti et al. [2] proposed a contrast enhancement method to restore the discontinuities near edges lost due to poor contrast. Contrast restoration methods often produce unrealistic images due to the underlying assumption that the pixel intensity distribution of a clear scene must be uniform. The use of better assumptions and priors helped to make significant progress in outdoor scene enhancement. He et al. observed the low intensity values in the RGB image and proposed the Dark Channel Prior (DCP) [9]. DCP states that there always exist some pixels with low intensities within a local patch of one or more color channels in an RGB image. The drawback of DCP is haze overestimation in the sky regions. To reduce the computational time due to soft matting in DCP [15], He et al. introduced median of median filter [35], fast matting [8], and guided filter [10]. DCP was further used by Long et al. [19] to deal with halo artifacts by estimating an atmospheric veil for dehazing of remotely sensed hazy images. Thereafter, based on the behaviour of different image domains under hazy conditions, Tang et al. [34] proposed haze relevant features i.e. hue disparity, maximum saturation, and maximum contrast.

Recently, deep learning-based image dehazing models have achieved enormous popularity. Endeavours have been made to combine these models with the conventional atmospheric scattering model [25] for obtaining the clear scene. Zhu et al. introduced a Color Attenuation Prior (CAP) [38] based method to calculate the scene depth and then estimated the transmission map. However, it is not always accurate, and further calculation of airlight using this leads to accumulation and amplification of error. Cai et al. proposed DehazeNet architecture [4], which calculates the transmission map using four sequential operations. However, the dehazed images obtained using DehazeNet still persist some haze. Ren et al. proposed Multi-Scale Convolutional Neural Network (MSCNN) [29], which uses a combination of fine-scale and coarse-scale networks to output the clear scene. Li et al. combined the transmission map and atmospheric light as a new variable and used this to build an input adaptive model i.e. All-in-One Dehazing Network (AODNet) [16]. Ren et al. proposed Gated Fusion Network (GFN) [30], which is a supervised learning-based model taking three contrast relevant features [1, 28] as input to perform dehazing. Wang et al. introduced Atmospheric Illumination Prior Network (AIPNet) [36] based on the assumption that the luminance/ illumination channel of a hazy image is much more affected by haze than its corresponding chrominance channel. Yang et al. [37] introduced a region detection network to approximate the transmission map, which was further used to enhance details in the dehazed image.

Yet, the relation of a hazy image and its corresponding haze-free image is quite complicated and difficult to interpret. This relation cannot be represented completely using the atmospheric scattering model [25] proposed by Narasimhan and Nayar for describing the haze formation phenomenon. As a result, the dehazing methods based on this model do not perform well on natural hazy images even if they show appreciable results on the synthetic images. In the recent literature, Liu et al. proposed Generic Model-Agnostic Network (GMAN) [18], which does not take any application specific features as input to restore the haze-free image. Since the performance of deep learning-based dehazing models depend on the dataset of hazy and haze-free images, Li et al. proposed REalistic Single Image DEhazing (RESIDE) dataset [17]. The dataset consists of natural and synthetic hazy scene images of various haze levels with their ground truth clear counterparts. Recently, Qin et al. introduced Feature Fusion Attention Network (FFA-Net) [27] by combining the channel and pixel attentions for image restoration. Hence, the ultimate objective of outdoor scene enhancement is to increase the robustness of vision-based applications like object tracking. The enhancement+tracking pipeline aims to boost the real-time performance. Thus, the pipeline of faster enhancement and tracking results in the faster visual recognition. For example, the regression based networks for tracking with shrinkage loss [20] have gained attention among researchers. Furthermore, segmentation and tracking networks were proposed by Lu et al. [21, 22] in a unified and end-to-end trainable framework. Thus, the goal is to increase the performance of enhancement networks with faster run-time to synchronize with a faster vision-based applications.

3 Proposed network: compact single image dehazing network (CSIDNet)

This section includes explanation of the proposed dehazing network. Figure 1 outlines the architecture of the proposed network. In contrast to the other networks, CSIDNet comprises of only 3 convolutional layers. The detailed explanation of the architectural design of CSIDNet has been provided in the following subsections.

Fig. 1
figure 1

Architecture of CSIDNet. The input image is of size M × N with five input features i.e. R, G, B channels of input hazy image (I), minimum channel (\(I_{\min \limits }\)), and illumination channel (IY)

3.1 Hazy input features

Inspired by DCP [9] and AIPNet [36], CSIDNet extracts dark channel and illumination channel from input hazy image to learn the pattern of haze.

3.1.1 Dark channel prior

DCP [9] assumes that pixels of at least one color channel within a local patch always have low pixel intensity values in the non-sky region. Hence, the dark channel is obtained using

$$ I_{\text{dark}} (a) = \underset{n \in P(a) }\min \left( \underset{C_{h} \in \{ \text{R, G, B} \}}\min I^{C_{h}} (n) \right) $$
(1)

where, Ch represents R, G, and B color channels of the input hazy image I(a), and P(a) is a local patch centered at pixel location a. Equation (1) implies that the minimum intensity at each pixel location across all the color channels have a very low value in a haze-free image. This is mainly because of colourful objects dominated by one color channel and dark objects like tree trunks, shadows, etc. In this paper, the dark channel with patch size 1 × 1 has been considered.

3.1.2 Illumination channel

For a hazy input image I(a), the illumination channel or Y channel is obtained by YCbCr color domain [3]. The RGB image channels are converted to YCbCr color channels using

$$ \begin{bmatrix} I_{\mathrm{Y}}(a) \\ I_{\text{Cb}}(a) \\ I_{\text{Cr}}(a) \end{bmatrix} = \begin{bmatrix} 0.299 & 0.587 & 0.144 \\ -0.169 & -0.331 & 0.500 \\ 0.500 & -0.419 & -0.081 \end{bmatrix} \begin{bmatrix} I_{\mathrm{R}}(a) \\ I_{\mathrm{G}}(a) \\ I_{\mathrm{B}}(a) \end{bmatrix} $$
(2)

where, a is the pixel location, IY(a) is the illumination channel, ICb(a) and ICr(a) are the corresponding chrominance channels, and IR(a), IG(a), and IB(a) are the red, blue, and green color channels of the input image I(a).

The RGB color channels, dark channel, and illumination channel are then concatenated to form the hazy input features for the network as

$$ I_{\text{input}}(a) \xleftarrow{\text{ Concatenate }} \left( I_{\mathrm{R}} (a), I_{\mathrm{G}} (a), I_{\mathrm{B}} (a), I_{\text{dark}}(a), I_{\mathrm{Y}} (a) \right) . $$
(3)

3.2 Pre-activation

The hazy input features obtained by (3) are fully pre-activated [12] with batch normalization (Ψ) and leaky ReLU activation function (Φ) as

$$ I_{\text{normalized}} (a) = {\varPhi} \left( {\varPsi} \left( I_{\text{input}}(a) \right) \right) . $$
(4)

A dropout layer (Ω) has also been included after the pre-activation stage to avoid any over fitting as

$$ I_{\mathrm{D}}(a) = {\varOmega} \left( I_{\text{normalized}} (a) \right) . $$
(5)

3.3 Convolutional layers

The output of dropout layer obtained by (5) is passed through three consecutive combinations of the following: convolutional layer, batch normalization layer, and activation layer (i.e. Layers 1, 2, and 3 of Fig. 1) as

$$ I_{l+1}(a) = {\varPhi} \left( {\varPsi} \left( W_{l} \ast I_{l}(a) \right) \right);~~ \forall ~ l = 1, 2, 3 $$
(6)

where, l is the layer number, W is the kernel weight matrix between l and l + 1 layer, Il(a) and Il+ 1(a) act as input and output of lth layer, respectively. The input for the first layer is I1(a) = ID(a) where, ID(a) is obtained from (5).

3.3.1 Skip connection

Furthermore, to compensate for any possible information loss, the network contains one global skip connection [13]. The output of this skip connection connects hazy input features Iinput(a) and output of 3rd convolutional layer after batch normalization as

$$ I_{\text{skip}} (a) = {\varPsi} \left( W_{3} \ast I_{3}(a) \right) + I_{\text{input}}(a) . $$
(7)

3.3.2 Sigmoid activation

The output of skip connection Iskip is then passed through Sigmoid activation (σ) to constraint the output in the range of 0 to 1 as

$$ I_{\text{Sigmoid}}(a) = \sigma \left( I_{\text{skip}}(a) \right) . $$
(8)

3.4 Dehazed image

The output ISigmoid(a) contains 5 channels, where, the first three channels correspond to the required RGB dehazed output image as

$$ I_{\text{dehazed}}(a) \xleftarrow{\text{Extract first three channels}} I_{\text{Sigmoid}}(a) . $$
(9)

Thus, CSIDNet directly outputs a haze-free image Idehazed(a) in an interactive time without any intermediate results like transmission map or atmospheric light.

4 Results, validations, and discussions

This section presents qualitative and quantitative comparison of the results obtained using proposed network. The results are compared with the existing literature, namely BCCR [23], DCP [9], CAP [38], DehazeNet [4], MSCNN [29], AODNet [16], and GMAN [18].

While most of the deep learning-based dehazing models are trained on thousands to millions of images, CSIDNet has been trained only on 200 images without diminishing the performance. These training images have been selected randomly from Outdoor Training Set (OTS) of REalistic Single Image DEhazing (RESIDE) dataset [17]. The training of CSIDNet takes about 30 minutes on a system with Nvidia GeForce 940MX 2GB Graphics card. This is much lesser than the time taken by other models, which usually require at least 12-36 hours of training. The training has been conducted on an Intel Core i5-7200U 7th generation system with 2.5GHz processor and 8 GB DDR4 RAM. In further subsections, the explanation of datasets, network parameters, loss functions, quantitative and qualitative comparisons, and finally discussions have been provided.

4.1 Datasets

RESIDE dataset consists of 72,135 synthetic hazy images in Outdoor Training Set (OTS) for training. For testing, it contains 500 synthetic hazy images in Synthetic Objective Testing Set (SOTS) and 10 synthetic hazy images in Hybrid Subjective Testing Set (HSTS) with their respective ground truths.

For training of CSIDNet, 200 hazy images have been randomly selected from OTS to prepare one set of hazy images with their respective ground truths. Likewise, in total, five sets i.e. Set 1, Set 2, Set 3, Set 4, and Set 5 have been prepared randomly. The training sets are online available on the following link: https://drive.google.com/open?id=1uCfliFpldUUWdzT5TKPX0kbVH5mMJ1iw.

4.2 Network parameters and loss function

In CSIDNet, the depth of the convolutional layers are considered as 16 for layer 1, 16 for layer 2, and 5 for layer 3. The proposed network has been trained on images of size 224 × 224, but can be tested for images of any resolution. The kernel size for the convolution has been chosen as 3 × 3, the slope of Leaky ReLu is 0.2, and the dropout rate is 0.2. The kernel weights of convolutional layers have been initialized with He Uniform initializer [11]. CSIDNet is trained for 100 iterations with Adam optimizer of momentum values β1 = 0.9 and β2 = 0.999. The loss function considered for the training of CSIDNet is Mean Square Error (MSE). It can be defined by

$$ L_{\text{MSE}} = \frac{1}{C_{h}N_{p}} \sum\limits_{i = 1}^{C_{h}} \sum\limits_{a = 1}^{N_{p}} { \left( I_{\text{dehazed}}^{i}(a) - I_{\text{gt}}^{i}(a) \right)^{2} } $$
(10)

where, Idehazed is dehazed image, Igt is ground truth, Np represents the number of pixels in the image, and Ch denotes number of color channels. For further analysis, the proposed network has also been trained with Mean Absolute Error (MAE) as a loss function. It is also termed as L1 loss. It can be defined by

$$ L_{L1} = \frac{1}{C_{h}N_{p}} \sum\limits_{i = 1}^{C_{h}} \sum\limits_{a = 1}^{N_{p}} { \left| I_{\text{dehazed}}^{i}(a) - I_{\text{gt}}^{i}(a) \right|} $$
(11)

The number of epochs used for training with L1 loss function are 100, 150, and 200 in order to find the best possible hyper-parameters.

4.3 Quantitative comparison

The quantitative comparison of dehazed images obtained using CSIDNet has been analyzed in terms of Peak Signal to Noise Ratio (PSNR) and Structural SIMilarity (SSIM) index measures. The average PSNR and SSIM measures have been tabulated in Table 1 for images of SOTS and HSTS from RESIDE dataset [17]. Although the PSNR of GMAN is slightly higher than the PSNR of CSIDNet for HSTS dataset, GMAN has a lower SSIM index. Lower SSIM accounts for the greater number of distortions. Similarly, the SSIM index of AODNet is higher than CSIDNet for SOTS dataset, but the run-time efficiency of CSIDNet is better than AODNet. Table 2 illustrates the comparison of average run-time on SOTS, HSTS, and natural hazy images.

Table 1 Performance comparison: Average PSNR AND SSIM measures (∗ First, † Second, and ††Third: Top 3 performance metrics values)
Table 2 Performance comparison: Average run-time (in seconds)

A longer run-time implies that the network generates a lag in the process, leading to poor performance in real-time. Consequently, it is important for the output to be available in interactive time. As can be seen in Table 2, the run-time of DehazeNet is one of the highest amongst all the models, followed by GMAN, making it unfeasible for real-time purposes, whereas the proposed CSIDNet is fastest among all. Thus, CSIDNet manages to maintain the PSNR and SSIM values comparatively with faster run-time in comparison to others.

4.4 Qualitative comparison

Figures 23, and 4 show the visualization of dehazed images from SOTS, HSTS, and natural hazy scenes, respectively. CSIDNet produces dehazed images without any visual artifacts. BCCR and MSCNN alter the color information near the sky region as they mainly focus on increasing the contrast. DCP generates dehazed images with halo artifacts near edges and fails to deal with the sky regions. CAP produces over saturated dehazed images and alters the color information. However, with respect to aforementioned methods, DehazeNet produces visually appealing dehazed images but is left with some haze. The dehazed images obtained using GMAN result in distortions which are easily visible in the visual comparison. In comparison among all state-of-the-art methods, the dehazed images obtained using AODNet seem better than others. However, the blur generated near edges distorts the textural information. CSIDNet gives visually pleasing results without any visible artifacts; on the other hand, nearly all the methods generate noticeable distortions, especially in the sky region. It is due to the possibility of excessive dehazing by other methods in regions with fine or light haze.

Fig. 2
figure 2

Visual comparison on an image from SOTS of RESIDE dataset [17]

Fig. 3
figure 3

Visual comparison on an image from HSTS of RESIDE dataset [17]

Fig. 4
figure 4

Visual comparison on natural hazy image

4.5 Discussions

The compact size of the proposed network has been experimented on different number of layers and number of filters/depth. Table 3 tabulates the values of PSNR and SSIM index obtained using trained models generated with different number of layers. It depicts that the performance on test datasets is better for 3 layers. Similarly, Table 4 tabulates the values of PSNR and SSIM index obtained using trained models generated with different number of filters. Figure 5 shows the plots with MSE as a loss function for different number of layers and different number of filters. Finally, 16 filters were chosen for the proposed network as they take lesser memory and calculation time without compromising the accuracy. Increasing the number of filters leads to increased memory requirements, which will be considerably less in the proposed network. Hence, a more compact structure is obtained.

Fig. 5
figure 5

Plots for the mean square error with different (a) Number of layers and (b) Number of filters

Table 3 Performance measures for CSIDNet with different number of layers
Table 4 Performance measures for CSIDNet with different number of filters

CSIDNet has been trained for five sets separately. Each set contains 200 images for training i.e. Set 1, Set 2, Set 3, Set 4, and Set 5. For quantitative comparison, the testing has been performed on images from SOTS and HSTS datasets in terms of PSNR and SSIM index measures. Table 5 shows the performance measures obtained with MSE loss function. The table tabulates the performance measures using trained models generated with training sets 1, 2, 3, 4, and 5. Similarly, Tables 67, and 8, show the performance measures obtained with L1 loss function for 100, 150, and 200 epochs, respectively.

Table 5 Performance measures for CSIDNet using MSE as loss function
Table 6 Performance measures for CSIDNet using L1 loss function for 100 epochs
Table 7 Performance measures for CSIDNet using L1 loss function for 150 epochs
Table 8 Performance measures for CSIDNet using L1 loss function for 200 epochs

It can be observed that using MSE as a loss function for training, gives better PSNR and SSIM index measures. This can be interpreted by the mathematical behaviour of MSE which strongly penalizes the difference between ground truth and predicted value by squaring the error, as compared to L1 loss which only considers the absolute difference. Figs. 5 and 6 show the plots for MSE as a loss function and L1 loss function, respectively.

Fig. 6
figure 6

Plots for the mean absolute error/ L1 loss with (a) 100 epochs, (b) 150 epochs, and (c) 200 epochs

5 Conclusion

In this paper, a compact deep learning-based single image dehazing network named as Compact Single Image Dehazing Network (CSIDNet) has been proposed for outdoor scene enhancement. The proposed network not only outperforms several state-of-the art dehazing models, but also sets a benchmark as a compact model with minimal resource requirements. The enhanced scene images obtained using CSIDNet successfully maintains trade off between speed and accuracy. The comparative analysis indicates that CSIDNet is faster to train with lesser run-time while maintaining its performance and robustness quantitatively and visually. The proposed network gives remarkable results, which suggests its scope in various critical and real-time applications. Moreover, the potential future scope is the development of an end-to-end network for image dehazing and denoising as well under non-uniform illumination conditions without considering any priors and assumptions.