Keywords

1 Introduction

High-quality images are critical to computer vision tasks. However, due to the technical conditions and lighting limitations, images captured in insufficient light conditions inevitably appear with low contrast and unexpected noise and color shift, which will degrade both perceptual quality and downstream high-level vision tasks, such as object detection [16] and tracking [5]. Therefore, improving the quality of low-light images is urgently needed and has drawn significant attention in recent years.

Fig. 1.
figure 1

The statistical histograms of corresponding saturation channel by different enhancement methods. Our method achieves better saturation correction compared with SCI [19], which is the state-of-the-art LLIE method.

Recently, various supervised learning-based methods have been proposed for low-light image enhancement(LLIE) [8, 12, 17, 23, 25, 27, 28]. Nevertheless, due to training a deep model on paired data that may result in overfitting and limited generalization capability [15], unsupervised learning-based methods [4, 8, 10, 12, 17, 19] are extensively used to perform LLIE. Some of the existing unsupervised LLIE algorithms adopt the end-to-end “one-stage method” [8, 12, 19] to enhance brightness, which does not sufficiently suppress the noise while enhancing the brightness. Therefore, some “two-stage methods” [4, 10, 17] are proposed, which in a “first enhancement then denoising” manner to enhance image contrast and suppress noise. However, this cascade processing method may lead to the accumulation of artifacts, and the noise will be amplified in the brightness enhancement process, thus increasing the difficulty of denoising. Furthermore, the existing unsupervised low-light enhancement algorithms [4, 8, 10, 12, 17, 19] do not consider the saturation distortion caused by insufficient illumination, which leads to the incorrect color of the restored results, such as Fig. 1.

To address the above issues, we propose a parallel framework, which includes a saturation adaptive adjustment branch, brightness adjustment branch, noise suppression branch, and a fusion module for adjusting saturation, correcting brightness, denoise, and multi-branch fusion, respectively. Specifically, we propose a novel saturation adaptive adjustment method based on Gray World Algorithm [7] and Von Kries diagonal model [13] to adjust the saturation in HSV color space. As shown in Fig. 1, our method achieves better saturation correction than the state-of-the-art unsupervised method SCI [19]. Then we design a brightness adjustment branch, which uses the high-order mapping curve to adjust the original V channel at the pixel level in HSV color space. It takes the low-light image as the input and the parameter of the high-order mapping curve as the output. The estimated coefficients are used to adjust the dynamic range of the input through the high-order curve to obtain the corrected V channel. Meanwhile, we use BM3D [6] to initially denoise the input image in the noise suppression branch of the parallel framework. Finally, the output images of each branch of the parallel framework are fused through a fusion module composed of a trainable unsupervised guided filter network. The whole parallel framework is trained by unsupervised learning. It avoids the problem of insufficient denoise caused by the one-stage processing methods and intractable noise removal caused by the two-stage processing methods.

Our contributions are summarized as follows:

  1. (1)

    We propose a parallel fusion framework to simultaneously perform contrast enhancement, saturation correction, and denoising. The whole framework is trained in an unsupervised manner.

  2. (2)

    We propose a novel saturation adaptive adjustment method based on Gray World Algorithm and Von Kries diagonal model to correct saturation.

  3. (3)

    We introduce a trainable unsupervised guided filter fusion module in multi-branch fusion to further suppress noise.

  4. (4)

    Experiments on the LOL, MIT-Adobe 5k, and SICE datasets show that the proposed method achieves better saturation correction and noise suppression.

2 Related Work

2.1 Traditional LLIE Methods

Early LLIE methods are based on various image priors. For example, HE-based methods [21] focus on changing the image’s dynamic range to improve the contrast, which may lead to insufficient or excessive results enhancement. Inspired by Retinex [14] theory, some methods decompose the image into pixel-level products of reflection and illumination map, and the enhanced results can be obtained through further processing. However, this method relies on intensive parameter adjustment, leading to inconsistent colors and noise in the enhanced results.

2.2 Deep Learning-Based LLIE Methods

In recent years, the method based on deep learning has shown surprising results on LLIE. High-quality normal light is usually used as the ground truth to guide low-light image enhancement. The first learning-based LLIE method, LL-Net [18], proposes a stacked automatic encoder for simultaneous denoising and enhancement. The work in [25] is based on Retinex theory and uses special subnetworks to enhance the illuminance and reflectance component, respectively. Zhang et al. [30] propose KinD, which uses three subnetworks for layer decomposition, reflectivity recovery, and illumination adjustment. The work in [28] proposes a recursive band network and trains it through a semi-supervised strategy. However, training a depth model on paired data may lead to overfitting and limit the generalization ability, so the unsupervised LLIE method has been widely used in recent years. EnGAN [12] proposed a generator and used unpaired data for training. ZeroDCE [8] designed a depth curve estimation network to adjust the dynamic range of low illumination images. Liu et al. [17] built a Retinex-inspired unrolling framework with architecture search. SCI [19] presents a lightweight enhancement network and achieves state-of-the-art.

Fig. 2.
figure 2

Parallel fusion frame diagram, where the red box represents the brightness adjustment branch, and the green box represents the inception module. (Color figure online)

3 Proposed Method

The proposed framework is shown in Fig. 2. The framework consists of three main branches and a fusion module, and the input image is enhanced by brightness and saturation correction and denoising in parallel before fusion separately. The input image is converted from the RGB to HSV color space, and the S and V channels are corrected by the saturation and brightness correction branches, respectively. Meanwhile, the noise suppression branch performs preliminary denoise for input image parallelly on RGB color space. Finally, the output of each branch is fused through the fusion module. We explain the role of each branch and module in this section.

3.1 Saturation Adaptive Adjustment Branch

To solve the problem of saturation distortion, we propose a saturation adaptive adjustment method. It is assumed that the relative values of the three components of RGB remain unchanged after adjusting the saturation of an image. According to the Gray World Algorithm [7] and Von Kries diagonal model [13]:

$$\begin{aligned} \left\{ \begin{aligned} {M}'(x)&= M(x) \cdot \frac{K}{\overline{M}(x)}\\ {N}'(x)&= N(x) \cdot \frac{K}{\overline{N}(x)} \end{aligned} \right. \, , \end{aligned}$$
(1)

where \(M(x)=\mathop {max}{\left\{ R,G,B\right\} }\), \(N(x)=\mathop {min}{\left\{ R,G,B\right\} }\) represent the maximum and minimum channels in the RGB color space, respectively. \(\overline{M}(x)\) and \(\overline{N}(x)\) represent the average value of the maximum value channel and the minimum value channel respectively. \({M}'(x)\) and \({N}'(x)\) represent the adjusted channel, and K is the gain coefficient. Corresponding to the HSV color space, the adjustment formula of the saturation channel can be derived:

$$\begin{aligned} \begin{aligned} {S}'&= \frac{{M}'(x)-{N}'(x)}{{M}'(x)} = 1 - \frac{N(x) \cdot \frac{K}{\overline{N}(x)}}{M(x) \cdot \frac{K}{\overline{M}(x)}}\\&= 1 - (1 - S)\cdot \frac{\overline{M}(x)}{\overline{N}(x)}\\ \end{aligned} \end{aligned}$$
(2)

where \({S}'\) is the saturation channel after adjustment. We adjust the saturation channel according to Eq. 2.

3.2 Brightness Adjustment Branch

Inspired by [8], we perform a parameters estimation network to estimate a set of best-fitting parameters of the pixel-wise light enhancement curve to adjust the brightness, as shown in the red box in Fig. 2. The module maps all pixels of the V channel by applying the curve parameters iteratively to obtain the final enhanced V channel. The iterative process is as follows:

$$\begin{aligned} I_{n}(x) = I_{n-1}(x) + P_{n}(x) \cdot I_{n-1}(x) \cdot (1 - I_{n-1}(x)) \, , n = \left\{ 1,2,3,4 \right\} \, , \end{aligned}$$
(3)

where x denotes pixel coordinates, n is the number of iterations, \(P_{n}(x)\) is a parameter map with the same size as the given V channel. \(I_{n}(x)\) is the result of each iteration, and \(I_{0}(x)\) is the V channel of the original low-light image. We perform four iterations on the input V channel.

Unlike [8], we designed a multi-scale network to extract the input brightness information more effectively. Moreover, we only adjust the brightness in the V channel instead of in the RGB color channel. The parameter estimation network downsamples the input to 1/2, 1/4, and 1/8 size and then passes that through five layers of skip-connected inception module and ReLU, respectively. The features of the smallest size are upsampled and gradually fused with the features of larger size to obtain the parameters \(P_{n}(x)\) of the curve. Finally, the iterative mapping is performed according to Eq. 3 to obtain the output. The inception module is shown in the green box of Fig. 2, which consists of parallel 1\(\times \)1 convolution, 3\(\times \)3 convolution, and horizontal and vertical Sobel filters.

3.3 Noise Suppression Branch

To obtain a relatively clean image and retain the edge and detail information of the input image, we perform preliminary denoise on the original image through the noise suppression branch. As shown in the lower branch of the Fig. 2, we use BM3D [6] as an initial denoise method. This branch can be replaced by any denoise method, and the denoise result will affect the final image quality.

3.4 Fusion Module

We design an unsupervised guided filtering network to fuse different branches in the parallel framework. We use the image with sharp edges obtained by the lower branch in the parallel framework to guide the images with correct brightness and saturation obtained by the upper branch for fusion, which removes noise and retains the proper brightness and saturation information.

Trainable Guided Filter. Traditional guided filtering [9] assumes that the guide and output images are local linearly correlated within the filter window. Suppose q is a linear transformation of guide image G in a window \(\omega _{k}\) centered on pixel k:

$$\begin{aligned} q_{i} = a_{k} \cdot G_{i} + b_{k} \, , \forall i \in \omega _{k} \, , \end{aligned}$$
(4)

where \(a_{k}\), \(b_{k}\) are linear coefficients. As shown in the yellow box in Fig. 2, the guided filter fusion module takes the noisy image and the guide image as the input. It uses the five layers skip connected inception module to obtain the corresponding linear coefficients \(a_{k}\) and \(b_{k}\), and calculate the output result of the fusion module according to Eq. 4.

Unsupervised Training Framework. The existing trainable guided filters [26] are almost all supervised. We split the input noise image according to the neighborhood of each \(2\times 2\) window and randomly select two adjacent pixels in each window as the split two images \(N_{1}\) and \(N_{2}\). So we get two images with independent conditions but similar contents. According to [11], the optimization problem is transformed into:

$$\begin{aligned} \mathop {\arg \min }_{\theta } \mathbb {E}\left\| f_{\theta }(N_{1}) - N_{2} \right\| _{2}^{2} \, , \end{aligned}$$
(5)

where \(f_{\theta }\) is the trainable guide filter denoising network parameterized by \(\theta \). Thus we implement unsupervised trainable guided filtering.

Fig. 3.
figure 3

Subjective results comparison on the LOL dataset [25], the MIT-Adobe 5K dataset [2] and the SICE dataset [3]. RN means RetinexNet, ZDCE means ZeroDCE. Compared with other methods, our method performs better in saturation and noise suppression.

3.5 Loss Function

To train the Brightness Adjustment Branch, we use an exposure control loss \(L_{exp}\) to measure the distance between the average intensity value \(I_{k}\) of a local region to the well-exposedness level E:

$$\begin{aligned} L_{exp} = \frac{1}{R} \sum _{k=1}^{R}\left\| I_{k} - E \right\| \, , \end{aligned}$$
(6)

where R represents the number of non-overlapping local regions of size \(16\times 16\), and we set E to 0.6 in our experiments. To preserve the monotonicity relations between neighboring pixels, we add an illumination smoothness loss \(L_{tv}\) to each curve parameter map \(P_{n}\):

$$\begin{aligned} L_{tv} = \frac{1}{T} \sum _{n=1}^{T} (\nabla _{x} P_{n}+ \nabla _{y} P_{n})_{}^{2} \, , \end{aligned}$$
(7)

where T is the number of iterations, \(\nabla _{x}\) and \(\nabla _{y}\) represent the horizontal and vertical gradient operations, respectively. The total loss for training the parameter estimation network can be expressed as: \(L_{total_1} = L_{exp} + {\lambda }_{tv} L_{tv}\), where \({\lambda }_{tv}\) is the weight of the loss.

To train the Fusion Module, we use reconstruction loss \(L_{rec}\) to ensure structural similarity between the noisy input and output:

$$\begin{aligned} L_{rec} = \left\| f_{\theta }(N_{1}) - N_{2} \right\| _{2}^{2} \, . \end{aligned}$$
(8)

Since the ground-truth of \(N_{1}\) and \(N_{2}\) are different, directly applying reconstruction loss is inappropriate and leads to over-smoothing. So we add a regularization term loss \(L_{reg}\) [11]:

$$\begin{aligned} L_{reg} = \left\| f_{\theta }(N_{1}) - N_{2} - (O_{1} - O_{2}) \right\| _{2}^{2} \, , \end{aligned}$$
(9)

where \(O_{1}\) and \(O_{2}\) represent the two images split from the output. The total loss for training the trainable guided filter can be expressed as: \(L_{total_2} = {\lambda }_{rec} L_{rec} + {\lambda }_{reg} L_{reg}\) , where \({\lambda }_{rec}\) and \({\lambda }_{reg}\) are the weights of the losses.

4 Experiments

4.1 Experimental Setting

We conduct experiments on the LOL dataset [25], the MIT-Adobe 5K [2] dataset, and the SICE dataset [3]. We randomly sample 100 images from the MIT dataset for testing and others for training. The SICE dataset is a multi-exposure dataset consisting of 7 (or 9) pictures in various exposure levels for each scene. We select the first picture with the worst exposure in each scene in part II as the test images and select the third (or fourth) image as ground-truth.

Fig. 4.
figure 4

Detailed comparison of existing methods on the LOL dataset.

Table 1. Quantitative results in terms of four full-reference metrics including PSNR, SSIM, LPIPS and MSE, and three no-reference metrics including LOE, NIQE, and EME on the LOL, MIT-Adobe 5K(MIT) and SICE datasets. The best result is shown in , and the second-best result is .

4.2 Implementation Details

We implement our framework using PyTorch on an NVIDIA 3090 GPU and separately train the brightness adjustment module and the fusion module. We adopt the Adam optimizer with a learning rate of 0.0001. The cropped image and batch sizes are set to 512 and 8, respectively. \({\lambda }_{tv}\) is set to 200, \({\lambda }_{reg}\) and \({\lambda }_{rec}\) are both set to 1.

4.3 Experimental Results

We compare our method with three advanced supervised learning methods, including RetinexNet [25], DRBN [28], and KinD [30], and four unsupervised learning methods, including EnGAN [12], ZeroDCE [8], RUAS [17], and SCI [19].

Qualitative Evaluation. We present the visual comparisons on typical low-light images in Fig. 3. Most of the previous methods cannot recover global illumination and structure well, such as RetinexNet [25] and RUAS [17], and uneven enhancement may occur in some areas of the image, such as KinD [30] and RUAS [17].Meanwhile, as shown in Fig. 4, most existing methods do not correct saturation and suppress noise well, such as DRBN [28], EnGAN [12], ZeroDCE [8], SCI [19] and RUAS [17]. Comparatively, our method achieves good perceptual visual quality, with proper illumination, saturation, as well as clean and sharp details.

Quantitative Evaluation. As shown in Table 1, We perform quantitative comparisons of different methods, and we use four full-reference metrics including PSNR, SSIM [24], LPIPS [29], and MSE, three no-reference metrics including EME [1], LOE [22] and NIQE [20]. Our method achieves excellent results in most indicators compared with the existing methods. The results show that we have advantages in light restoration and structural restoration.

4.4 Ablation Study

In order to investigate the effectiveness of different components of our method, we conduct ablation experiments on several key components, including the proposed module and our proposed parallel framework.

Contribution of Each Branch. To verify the effectiveness of each branch, we conduct ablation studies on the LOL dataset [25]. Subjective experimental results and quantitative comparisons are shown in Fig. 5 and Table 2. The image will be very dark without the brightness adjustment branch. The overall image will be greenish without the saturation adaptive adjustment branch. Without the fusion module, the image will have severe noise.

Fig. 5.
figure 5

Ablation on different branches/modules and frameworks. The first to sixth panels: (a) w/o brightness adjustment branch. (b) w/o saturation adaptive adjustment branch. (c) w/o fusion module. (d) full branches and module. (e) cascade framework. (f) our proposed parallel framework.

Effect of Parallel Framework. To verify the framework’s effectiveness, we compare our proposed parallel framework with a “two-stage” cascade framework. We compare the experimental results of the cascade framework that removed the original noise suppression branch and changed it to self-guided filtering based on the output results of the original three branches. From Fig. 5 and Table 2, we can observe that our results are less noisy.

Table 2. Ablation on different branches/modules and frameworks. The best results are highlighted in bold. BAB means brightness adjustment branch, SAAB means saturation adaptive adjustment branch, FM means fusion module.

5 Conclusion

In this work, we propose a parallel unsupervised LLIE framework to improve brightness, correct saturation, and denoise, respectively. We designed a saturation correction branch based on the Gray World Algorithm and Von Kries model to correct saturation. In addition, we designed an unsupervised guided filtering module at the end of the parallel framework to fuse different branches. Various experiments show that our methods achieve better results quantitatively and qualitatively compared with the state-of-the-art unsupervised methods.