1 Introduction

Low-light image enhancement (LLIE) methods aim to restore the brightness of images captured in low-light environments and suppress noise caused by uneven lighting. Low-light images are rather common phenomena in real-world scenarios, and insufficient lighting leads to issues such as detail loss and low contrast in captured images, which greatly reduces the visual quality of the image. These degradations will not only have negative effect on human visual perception, but also will be not conducive to the subsequent computer vision tasks designed for high-quality images, such as face recognition [1], object detection [2] and semantic segmentation [3]. In practical applications, we can improve the brightness of images through digital image processing technology, as shown in Fig. 1.

Fig. 1
figure 1

Visualization comparison of low-light image and enhancement images

In recent years, a large number of studies on LLIE have been conducted. Traditional LLIE methods can be roughly categorized into two types: histogram equalization (HE)-based methods [4] and Retinex model-based methods [5, 6]. Although these methods can improve the brightness of the image, the enhanced images obtained usually have serious color deviation and edge blur issues.

Recently, due to the great success of deep learning technology in computer vision tasks, many deep learning-based LLIE methods have proposed, which are mainly divided into two categories: end-to-end frameworks [7,8,9,10] and Retinex-based frameworks [11,12,13]. The end-to-end frameworks are to learn the direct mapping function from a low-light image to the corresponding normal-light image. However, the results of such methods often suffer from color distortion and noise, and the constructed models lack interpretability, making it difficult to adapt to images in real low brightness scenes. Retinex-based frameworks typically decompose a low-light image into illumination and reflection maps and generate a normal-light images by adjusting the illumination and reflection components separately. However, the training process of such frameworks is cumbersome and time-consuming, and the quality of enhancement images largely depends on the accuracy of low-light image decomposition.

To tackle the above issues, we propose an illumination-aware two-stage network (IATN) based on Retinex theory for LLIE, which gradually achieves image enhancement by constructing two stages: illumination estimation and feature correction. In the first stage, to reduce the dependence on image decomposition and the computational complexity of the model, we construct a tiny network based on Retinex theory, i.e., illumination estimation network (IEN), to generate a coarse enhancement result. In the second stage, to address the issues of noise and color distortion in the result from the first stage, an illumination-aware correction network (IACN) consisting of an illumination feature extraction branch and a feature correction branch is constructed to obtain a refine enhancement result. In IACN, considering the diversity of lighting conditions, an initial illumination map is first directly obtained from the input low-light image and is sent to the illumination feature extraction branch to extract illumination features at different scales. Then, the U-shape feature correction branch containing multiply illumination-aware modules (IAMs) is designed to achieve correction and restoration of features at different scales. IAM is the core module for feature restoration, which guides the network in achieving noise removal and color correction by establishing the autocorrelation of image features in the channel dimension and the correlation between image features and illumination guidance features in the spatial dimension.

Overall, our contributions are summarized as follows:

  1. (1)

    An IATN for low-light image enhancement is proposed by constructing an IEN and an IACN, in order to gradually obtain the enhanced results with better visual effects.

  2. (2)

    In the first stage, a tiny IEN based on Retinex theory is constructed by learning an illumination map to obtain a coarse enhancement result.

  3. (3)

    In the second stage, an IACN is constructed to obtain a refined enhancement result by constructing an illumination feature extraction branch and a feature correction branch. In the feature correction branch, an IAM is designed to achieve restoration of features by establishing the correction between the illumination features and image features.

  4. (4)

    Extensive experiments on public low-light image datasets show that the proposed network achieves better performance both subjectively and objectively compared to some state-of-the-art methods.

2 Related works

2.1 Traditional LLIE methods

The traditional LLIE methods are roughly divided into two categories: histogram equalization (HE)-based methods and Retinex model-based methods.

Histogram-based methods enhance the image by extending the dynamic range of pixel values. For example, Celik et al. [14] enhanced the contrast of input image by mapping the diagonal elements of the initial histogram to the diagonal elements of the target histogram. Lee et al. [15] applied the layered difference representation of 2D histograms to amplify the gray-level differences between adjacent pixels. Although such methods can enhance the brightness of an image, it often fails to achieve the enhancement effect due to its simple use of mathematical methods without considering the formation model of the image.

The Retinex-based models follow a common assumption according to the Retinex theory, i.e., an image can be represented by the product of an illumination layer describing the illumination distribution of the scene and a reflectance layer describing the surface property of the object. The Retinex-based method obtains enhancement results by separately processing the decomposed reflectance and illumination components. For example, Fu et al. [16] proposed a weighted variational model to simultaneously estimate the illumination and the reflectance maps from an observed image. Hao et al. [17] proposed a novel Retinex-based LLIE method performed in a semi-decoupled way. However, this enhancement problem is a “pathological” problem, and due to imprecise prior assumptions, there is often overexposure in the enhancement results.

2.2 Learning-based LLIE methods

Due to the powerful learning ability of deep networks, the learning-based LLIE method has gradually become the mainstream method for LLIE. For example, Wei et al. [11] proposed a two-stage Retinex-based method called RetinexNet. The first stage decomposes the input image into reflectance and illumination maps, and the second stage obtains the enhanced result by adjusting the illumination map. Inspired by RetinexNet, Zhang et al. proposed two refined methods, called KinD [18] and KinD++ [13], which includes three sub-networks: decomposition-net, restoration-net and adjustment-net. Guo et al. [19] proposed a LLIE framework based on the divide-and-rule principle, which converts the image from RGB space into a luminance-chrominance space and designs an adjustable noise suppression network to eliminate noise in the brightened luminance. These methods have the problem of complex and time-consuming training, so some researchers have further studied end-to-end methods. For example, Liang et al. [20] designed a DCP-guided hierarchical dynamic mechanism for end-to-end LLIE. Yang et al. [21] presented a U-shaped encoder–decoder network based on multi-scale feature complementation. Fu et al. [22] designed an unsupervised network based on Retinex theory for learning adaptive priors from low-light image pairs, which can generate clean images through consistent constraints on reflectance. However, due to the inability to establish long-range dependencies of features in CNN-based methods, the issues of noise and color distortion have not been effectively addressed.

Due to its superior performance in capturing long-range features, Transformer is widely used in computer vision tasks. The Transformer [23] was first proposed for the field of natural language processing, and its superior performance has motivated many researchers to introduce it to vision tasks [24,25,26]. In image restoration, Zhang et al. [27] proposed a Structure and Texture-Aware Network, in which the structure sub-network is composed of stacked Transformer module, while the texture sub-network is composed of stacked central difference convolution modules. Xu et al. [9] proposed an SNR-aware CNN-Transformer hybrid network, called SNR-Net, which enhances areas with low signal-to-noise ratio using Transformer and enhances areas with high signal-to-noise ratio using convolution operations. However, these methods are accompanied by a huge computational burden and are difficult to apply high-resolution images.

3 Proposed method

3.1 Motivation and overview

Based on the Retinex theory, an observation image can be decomposed into two components: a reflection map and an illumination map. The reflection map represents the reflection component of the inherent properties of the object itself, which is not affected by light. The illuminance map reflects the information of light intensity and also determines the dynamic range size of the grayscale values of all pixels in the observation image. Under low illumination, due to uneven lighting, there may also be noise in the observation image. Therefore, a low-light image is affected by the light intensity and noise, and its degradation model can be defined as follows:

$$ I_{low} = R \odot L + N $$
(1)

where \(I_{low}\) stands for the low-light image, \(R\) stands for reflection map, \(L\) is the illumination map, \(N\) represents the noise, and \(\odot\) represents the element-wise multiplication.

According to formula (1), an IATN for low-light image enhancement is proposed, as shown in Fig. 2. The first stage network is constructed based on the first term to the right of the equal sign in formula (1) and obtain a coarse enhancement result by estimating the illumination map \(\overline{L}\). Due to the uneven illumination intensity, there are often issues of noise amplification and color distortion in the result \(\overline{R}\) from the first stage. Therefore, an IACN in the second stage network is constructed to obtain a refined enhancement result \(\overline{I}\).

Fig. 2
figure 2

The overall framework of the proposed IATN

Below, we will provide a detailed introduction to the construction of the two stages in the proposed IATN.

3.2 Illumination estimation network (IEN)

In the first stage, based on the Retinex theory, an IEN is constructed to learn an illumination map from the input low-light image and obtain a coarse enhancement result by utilizing the inverse formula \(\overline{R} = {I \mathord{\left/ {\vphantom {I {\overline{L} }}} \right. \kern-0pt} {\overline{L} }}\). Because the illumination map reflects the light intensity information in the environment and does not contain specific content, there is no high-frequency component in an ideal illumination map. Therefore, instead of learning complex features, the IEN requires a very small architecture to learn an illumination map, as shown in Fig. 2. In the IEN, three simple convolutional layers are adopted, each of which contains a convolutional operation and an activation function to extract features and increase the nonlinearity of the network. A sigmoid function in the final convolutional layer is used to compress the values in the illumination map to the range of [0,1].

To make the learned illumination map more accurate, a dual-constraint loss function is defined, which includes two loss terms to constrain the generated illumination map and coarse enhancement image, respectively. The first loss term is defined as the structure-aware TV loss [28] and can be expressed as:

$$ L_{is} = \left\| {\nabla \overline{L} \odot \exp \left( { - \lambda_{g} \nabla I_{GT} } \right)} \right\| $$
(2)

where \(\nabla\) denotes the gradient operator, \(I_{GT}\) denotes the ground-truth (GT) image,\(\lambda_{g}\) denotes the coefficient balancing the strength of structure awareness, and \(\exp ( \cdot )\) denotes an exponential function. The second loss term is defined as a mean squared error (MSE) loss, which is used to evaluate the difference between the coarse enhancement image and the corresponding GT image. It can be expressed as:

$$ L_{mse} = \left\| {\overline{R} - I_{GT} } \right\|_{2}^{2} $$
(3)

Based on Eqs. 2 and 3, the dual-constraint loss function of IEN can be expressed as:

$$ L_{ie} = L_{is} + L_{mse} $$
(4)

3.3 Illumination-aware correction network (IACN)

Although the coarse enhancement result already has good visual effects, it is still affected by noise amplification and color distortion due to uneven lighting. The degree of these two degradation factors is affected by uneven illumination, and noise and color distortion become more pronounced in darker areas. Therefore, in the second stage shown in Fig. 2, an IACN consisting of two branches, namely the illumination feature extraction branch and a feature correction branch, is designed to remove noise and correct color in the coarse enhancement result \(\overline{R}\), in order to obtain the refined enhancement image with better visual effect.

In the illumination feature extraction branch, the illumination features at different scales are generated and guide the restoration of features in the feature correction branch. The specific operation is as follows. Firstly, considering the illuminance differences of input low-light images, the initial illuminance map used to extract illumination features is directly obtained from the maximum values of the three channels in the input low-light image. Then, three convolutional layers containing a 3 × 3 convolution and two 4 × 4 convolutions with a stride of 2 are used to obtain illumination feature maps \(F_{l}^{i}\)(\(i = 1,2,3\)) at different scales, where \(i\) represents the scales. The number of channels at each scale in the illumination feature extraction branch corresponds to the number of channels in the feature correction branch at the same scale. Finally, these feature maps are sent to the corresponding layers of the feature correction branch.

In the feature correction branch, a U-shaped network containing three feature correction layers is designed, which corrects features of different scales through the guidance of illumination features. In each feature correction layer, multiple IAM are used to achieve feature enhancement by establishing the correlation between illuminance guidance features and features extracted from \(\overline{R}\). The 4 × 4 convolution or deconvolution with a stride of 2 is used to downsampling or upsampling the feature maps while expanding or compressing the channel dimension. The structure of IAM shown in Fig. 3 will be described in detail below.

Fig. 3
figure 3

The structure of IAM

IAM receives the feature maps \(F_{r}^{i}\) from \(\overline{R}\) and guidance feature maps \(F_{l}^{i}\) to achieve the enhancement of image features by constructing correlations between the two features. The execution process of IAM is as follows. First, the feature maps \(F_{r}^{i}\) undergo a layer normalization and convolutional layer to generate query (\(Q\)), key (\(K\)), value (\(V\)) projections. This process can be defined as:

$$\begin{aligned} W &= WQ\left( {LN\left( {F_{r}^{i} } \right)} \right),K = WK\left( {LN\left( {F_{r}^{i} } \right)} \right),\\ V &= WV\left( {LN\left( {F_{r}^{i} } \right)} \right) \end{aligned}$$
(5)

where \(WQ\), \(WK\), \(WV\) represent the convolution layers, and \(LN\) is the layer normalization.

Then, to reduce the computational complexity of the network and capture global contextual relationships between pixels, \(Q\) and \(K\) projections are reshaped to generate a transposed-attention map across feature dimensions through dot-product interaction. In addition, to perceive the impact of illumination on different regions, the guidance feature maps are reshaped and multiplied pixel by pixel with the reshaped \(V\) projection to generate an illumination-aware attention map on the spatial dimension. The specific operations are defined as follows:

$$ \begin{aligned} A & = {\text{softmax}} \left( {{{\widehat{Q} \times \widehat{K}} \mathord{\left/ {\vphantom {{\widehat{Q} \times \widehat{K}} \alpha }} \right. \kern-0pt} \alpha }} \right) \\ IA & = \widehat{{F_{l}^{i} }} \odot \widehat{V} \\ \end{aligned} $$
(6)

where \(A \in R^{C \times C}\) represents the transposed-attention map, \(\widehat{Q}\), \(\widehat{K}\), \(\widehat{V}\) represents reshaped feature maps from \(Q\), \(K\),\(V\) separately, \(\alpha\) represents a learnable parameter that adaptively scales the matrix multiplication, \(\widehat{{F_{l}^{i} }}\) represents the reshaped illumination feature maps from \(F_{l}^{i}\) and \(IA \in R^{HW \times C}\) represents the illumination-aware attention map.

Next, the two attention maps are multiplied to achieve interaction of local and nonlocal features, and the generated feature maps are aggregated with the input feature map \(F_{r}^{i}\) through a residual operation. Finally, the generated feature maps undergo a gated-convolutional feed-forward network [29] to obtain the output of the IAM, which is the enhanced features. This process is defined as follows:

$$ F_{o}^{i} = FN\left( {A \times IA + F_{r}^{i} } \right) $$
(7)

where \(FN\) denotes gated-convolutional feed-forward network, and \(F_{o}^{i}\) is the output of IAM.

For the training of IACN, we define a loss function that includes two loss terms: the Charbonnier loss [30] and structural similarity loss [31], as shown below:

$$ L_{iac} = \lambda \sqrt {\left\| {\overline{I} - I_{GT} } \right\|_{F}^{2} + \varepsilon^{2} } + \left( {1 - \lambda } \right)L_{SSIM} \left( {\overline{I} ,I_{GT} } \right) $$
(8)

where \(L_{SSIM}\) represents the structural similarity loss, the weighting parameter λ is set to 0.8, and the constant ϵ is set to 0.001.

4 Experiments and analysis

4.1 Implementation details

We implemented our model using PyTorch on a single NVIDIA RTX 3080 GPU. In the first stage of IATN, the batch size is set to 32, and the number of epochs is 10. In the second stage of IATN, the number of IAMs in the IACN is 1, 2, 2, 2 and 2 from the first to the fifth layer, respectively. The batch size is set to 8, and the number of epochs is 2000. In both stages, we augment the data using rotation and horizontal flipping and optimize the networks by the ADAM optimizer. The initial learning rate is set to 0.0001 in both stages, and the learning rate steadily decreases to 0.000001 by the cosine annealing scheme [32] during the training of IACN.

4.2 Dataset and evaluation metrics

We train our model on the LOL dataset [11], which contains 485 pairs of low-/normal-light images for training and 15 pairs for testing. To verify the generalization of the model, we test the trained model on another dataset LOL-v2. The LOL-v2 dataset is the real part of LOL-v2 [12], which is larger and more diverse than LOL, including 689 pairs of low-/normal-light images for training and 100 pairs for testing.

To evaluate the performance of different LLIE methods, we adopt three well-known objective evaluation metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM) [31] and Learned Perceptual Image Patch Similarity (LPIPS) [33]. The higher the value of PSNR and SSIM, the better the quality of the image. On the contrary, the lower the value of LPIPS, the better the quality of the image.

4.3 Quantitative and qualitative results

To verify the effectiveness of the proposed method, we compare it with some state-of-the-art (SOTA) methods for low-light enhancement, including RetinexNet [11], GLAD [34], Zero-DCE [35], RUAS [36], EnlightenGAN [37], Night-enhancement [38], URetinex [39], PairLIE [22] and Bread [19]. Note that the results of all comparison methods are reproduced by using the official codes with recommended parameters.

Table 1 shows the quantitative results of all compared methods on the LOL and LOL-v2 datasets, respectively. It is obvious that our method achieves better results on both datasets. On the LOL dataset, the PSNR and SSIM values of the propose method are higher than those of other comparison methods, while the LPIPS value is only slightly worse than those of URetinex. On the LOL-v2 dataset, our method obtained the best values for all three indicators compared to other methods.

Table 1 Quantitative results of all comparison methods on LOL and LOL-v2 datasets

The qualitative results on LOL and LOL-v2 datasets are shown in Figs. 4 and 5, respectively. For ease of observation, we selected and enlarged a small area from the images. As can be seen from the figures, the results of RetinexNet have serious color distortion and noise issues. Zero-DCE, EnlightenGAN and RUAS are unable to effectively restore the brightness of darker areas. The results obtained by Night-enhancement, URetinex and Bread suffer from color distortions compared to the GT. In addition, there is also severe noise in the GLAD results, and the results of Night-enhancement and PairLIE show varying degrees of edge blur. Compared with other methods, our method achieved results with higher visual quality, which are closest to GT.

Fig. 4
figure 4

Visual comparison of all comparison methods on LOL dataset

Fig. 5
figure 5

Visual comparison of all comparison methods on LOL-v2 dataset

4.4 Complexity analysis

In this section, we compared the parameters and floating-point operations per second (FLOPs) of all methods. The comparison results are shown in Table 2, from which we can find that the RUAS has the fewest parameters and FLOPs, while the Night-enhancement has the largest parameters and FLOPs. In contrast, our method achieves the optimal balance between the enhancement performance and the computational complexity of parameters and FLOPs.

Table 2 Complexity comparison of parameters and FLOPs

4.5 Ablation study

To validate the effectiveness of the proposed network, we perform ablation experiments about the network structure on the LOL dataset.

4.5.1 Components in network structure

To verify the role of each component in the network, four ablation experiments are designed by individually removing or replacing different components in the framework. The results are shown in Table 3a. The introduction of the four ablation experiments is as follows:

  1. (1)

    “w/o IAM” represents IACN without IAMs, denoting the IAMs are replaced by ordinary convolutional layers.

  2. (2)

    “w/o \(F_{l}^{i}\)” represents IACN without the illumination feature maps \(F_{l}^{i}\).

  3. (3)

    “w/ \(\overline{L}\)” represents replacing \(L_{0}\) with the output \(\overline{L}\) of IEN.

  4. (4)

    “w/o IEN” means that the low-light image is used directly as the input of the IACN.

Table 3 The ablation results of the network structure and the number of IAMs

From Table 3a, it can be observed that due to IAM being the core module in the network, its replacement by convolutional layers results in a significant decrease in the metric values obtained by the network. After removing or replacing other components, the performance of the network has decreased to varying degrees. Therefore, the constructed components in our network are effective, which further proves that the two-stage network constructed can achieve good enhancement results.

4.5.2 Number of IAMs

To verify the impact of the number of IAMs on network performance, we test the proposed network by changing the number of IAMs in each scale layer. The results are shown in Table 3b. From the table, it can be seen that increasing the number of IAMs in layers with smaller-scale feature maps does not improve the performance of the model (such as results in the second row). The selection in this article is the optimal among different quantity combinations.

4.6 Evaluation via downstream vision tasks

To evaluate the effectiveness of the proposed method in improving the performance of subsequent visual tasks, an object detection experiment is conducted on the enhanced results of all comparison methods, as shown in Fig. 6. The figure shows the pedestrian detection results using pretrained YOLOv3 on all enhanced images obtained by comparison methods. We can clearly see that more pedestrians can be accurately detected in the result of our method. Therefore, the proposed network achieved enhancement results with higher contrast, which is helpful for improving the performance of subsequent tasks.

Fig. 6
figure 6

Visual comparison of object detection results on the enhancement images

4.7 Limitations

Although our method can obtain promising results in most cases, it still has a few limitations. First, as shown in Fig. 7b, our method results in overexposure when the image has nonuniform lighting conditions, as shown in Fig. 7a. Another limitation of our method is that when the input image is captured in an extremely low-light environment (as shown in Fig. 7c), there are noises in our result, as shown in Fig. 7d. In the future, we will design more effective networks to enhance the generalization ability of the models for images under different light distributions.

Fig. 7
figure 7

Failure cases. Input image a has nonuniform lighting conditions and c is an extremely low-light image. b and d are the results obtained by the proposed method for images a and c, respectively

5 Conclusion

In this paper, we propose an IATN based on Retinex theory consisting of two networks: IEN and IACN. The IEN in the first stage is constructed to obtain a preliminary coarse enhancement result by estimating the illumination of the input low-light image. To reduce color distortion and suppress noise in the results obtained by the first stage, IACN in the second stage is constructed as a U-shaped network containing multiply core module IAMs. The IAM is designed to restore image features by establishing long-range dependences of features and the correlation between the image features and illumination features. Extensive experiments on pubic benchmarks datasets show that our method outperforms some SOTA methods significantly. In future, we will investigate more efficient networks to improve the naturalness and color consistency of LLIE.