1 Introduction

Nowadays, taking and sharing photographs has become a common life style in recent years. Various interesting image-based applications have emerged, such as image retrieval [1], classification [2, 3], clustering [4, 5], stylization [6], and segmentation [7]. Despite the advances in photographing devices, low-quality images can still be produced due to various reasons. On one hand, many people are not familiar with the basic shooting skills, e.g., rule of thirds, exposure setting, and so on. In this context, intelligent systems have been designed to help people choose a proper content composition [8, 9]. On the other hand, the visual appearance of a photograph can be affected by imperfect imaging conditions, such as low lightness, foggy weather, and motion blur. Therefore, enhancement models at the pixel level are also highly desired. Typical applications include detail enhancement [10], color transfer [11], low-light enhancement [12], dehazing [13], motion deblur [14], to name but a few.

In this paper, we focus on solving the issue of low lightness, which is often encountered in the photograph shooting activities. We can generally divide the poor lightness into two types. The first one is the globally low lightness, where there is only a few weak light sources existing in the imaging scene, e.g., a nighttime outdoor place and a dimly lit indoor place in Fig. 1. The second one is the unbalanced lightness, where a good light source exists, but fails to well illuminate the whole scene, e.g., backlight and sidelight in Fig. 1. To address the first type of low lightness, many enhancing models have been proposed, such as histogram-based models [15,16,17,18,19], aiming at stretching the histogram to a larger range of intensity values. However, they are limited to the second type of unbalanced lightness conditions, as these models tend to over-enhance the originally well-illuminated regions. Retinex-based models [20,21,22] are able to relieve this issue to some extent, but they are still less effective in tackling complex lightness conditions. Commonly, all the above models try to improve the contrast via manipulating only one single input.

Fig. 1
figure 1

First row: Images with less lit regions under various conditions: nighttime, indoor, backlight, and sidelight. Second row: enhanced images based on our model

The advance in photographing device benefits the task of low-light enhancement by providing more input data. Specifically, a same scene can be recorded with multiple images with different exposures, which effectively expand the intensity range for an image, especially for the originally dark regions. The multiple sources pave the way for the development of fusion-based enhancing models [23,24,25]. However, these models depend on the enriched data sources, which can be still unavailable in many situations. For example, we only get access to only one single low-light image from the Internet in most cases. If we still use the road map of multi-source fusion, we have to artificially generate a few intermediate enhanced images as the inputs in advance.

Following this road map, only with one single low-light image, we propose a low-light enhancement model via generating and fusing multiple sources. With a lightness-aware camera response model, multiple initial enhancements simulating differently exposed images are firstly produced. They are then fused at mid-level based on a patch-based image representation, in which image patches from each source are decomposed into several signals and are fused, respectively. The final enhancement is obtained by recomposing the fused signals back again. The highlights of our research are twofold. First, we extend the ability of camera response function in terms of adapting to different lightness configurations. Second, the mid-level fusion show more competitive performance than the current state-of-the-art enhancing models, including the ones based on single source and multiple sources.

The rest of this paper is organized as follows. Section 2 introduces the related works. Section 3 presents the details of the proposed model. Qualitative and quantitative comparisons are reported in Sect. 4. We finally conclude our research in Sect. 5.

2 Related works

In this section, we briefly review the related works of low-light image enhancement. We divided them into single-source models and multi-source models.

For the single-source group, a representative enhancing model is based on the manipulation of image histogram [15,16,17,18,19]. Based on the observation that the histogram of a low-light image heavily tails at low-intensity region, the histogram-based models mainly target at equalizing the intensity distribution across the whole intensity range [15,16,17,18] or reshaping the histogram into a desired distribution [19]. Since a histogram usually ignores most spatial information of an image, these models are often limited in tackling local lightness variation. As a result, they tend to produce the over-enhanced or under-enhanced effect.

Differently, Retinex-based models [20, 21, 26, 27] assume that an image is composed of an illumination layer and a reflectance layer. The former layer represents the illumination of the imaging scene, and the latter layer represents the inherent characteristics of object surface. A straightforward way for the Retinex-based models is to change the illumination layer, keep the reflectance layer unchanged, and recompose the two layers back again [26, 27]. The key component of Retinex-based enhancement models lies in the successful illumination–reflectance decomposition.

Since the image decomposition is ill-posed in nature, it often needs an alternative optimization process to approximate the two layers, which can be unstable and time-consuming. Enhancement models based on simplified Retinex model are thereof proposed [22, 28]. These models still assume that an image is the combination of the two layers. The difference is that they roughly estimate the illumination layer with a simple MaxRGB technique and refine the MaxRGB image with an edge-preserving filter, which plays a vital role in these simplified Retinex models. For example, choosing a different filtering model can lead to slightly different enhancing effects, especially for image regions with complex patterns [28].

There is another interesting assumption that the inverse of a low-light image resembles a hazy image. By applying dehazing techniques, the darkness can be eliminated as haze in the reverse low-light image [29,30,31]. However, methods based on this assumption tend to generate unrealistic effect on salient object boundaries.

The single-source-based enhancement models are usually controlled by one or two parameters, e.g., (simplified) Retinex-based models, or completely parameter-free, e.g., histogram-based models, which act as a uniform enhancing strength imposed on the whole image region. Therefore, they are less spatially aware of image contents with different illumination conditions. In this context, they tend to produce improper local enhancing effects.

For the multi-source group, the above issue can be largely relieved by jointly considering multiple sources as the inputs for the enhancement model, which are potentially complementary to each other. With advanced imaging hardware, multiple source images of a same scene can be almost simultaneously collected with different exposures. The key left for the enhancement task is the seamless fusion of these input images, which adaptively combines the different appearances of a same image region, and thereof avoids the over-/under-enhancement. Bertalmio and Levine [24] propose to encode the gradient and color information from a short-exposed image and a long-exposed image into an image functional, and perform variational minimization to obtain the final fusion. Kou et al. [25] use a multi-resolution technique for achieving the seamless fusion. Additionally, they propose to improve the fused result by further enhancing image details. Specifically, they propose an improved image filter that extracts high-frequency details from multiple inputs and add them into the fused image. Ma et al. [23] propose a novel patch decomposition model that separates an image patch into three kinds of signals. Then, the decomposed signals from each source are linearly or nonlinearly fused. The resultant enhancement is finally obtained by recomposing the signals back again.

For the situation when only one image is at hand, the technical road map of multi-source fusion needs an expansion, and the stage of source generation becomes indispensible. In [32], along with the original input, Fu et al. generate two intermediate enhanced images by applying two intensity transform techniques. Hao et al. [33] produce an intermediate enhancement based on the simplified Retinex model and fuse it with the original image by designing a content-aware weight map. Ying et al. [34] propose a novel bio-inspired enhancement model, in which the source is generated by a simulated camera response model [35]. Different from [34], the model in this paper avoids the heuristic judgment on if an image pixel is underexposed and thus is more flexible in generating more intermediate enhancing results.

Of note, there has been learning-based research for the low-light enhancement task [36, 37], which demonstrates very promising performance. For these methods, a collection of sufficient and reliable image pairs (normal lightness vs. low lightness) is vital to the training procedure.

3 Proposed method

3.1 Overall framework

Suppose the input \({\mathbf{I}}_{0} \in {\mathbb{R}}^{W \times H \times 3}\) is a color image represented in RGB space. The technical road map of our model is shown in Fig. 2. The model contains two main parts, i.e., lightness-aware source generation (described in Sect. 3.2) and multi-source fusion (described in Sect. 3.3).

Fig. 2
figure 2

Framework of our model

Of note, the data flow in the first part is explained as follows. We first convert the RGB input into the HSV space and then only send the V channel into the generator. The reason is that the generator only aims at simulating different illumination conditions. After that, we replace the original V channel with the generated V channel and keep the other two channels unchanged. Then, for all the sources, we convert the HSV image back into the RGB space, which is used in the following stage of image decomposition.

3.2 Generation of fusion source

We generate the fusion sources based on the camera response model, which is jointly described by camera response function (CRF) and the brightness transform function (BTF). The former is only determined by a camera itself, while the latter is determined by the camera and the exposure ratio \(k\). The general form of CRF can be represented as:

$${\mathbf{V}} = f\left( {\mathbf{E}} \right).$$
(1)

Here, \({\mathbf{V}} \in {\mathbb{R}}^{W \times H}\) is the V channel in the HSV space (observed pixel intensity of an image). \({\mathbf{E}}\) is the ideal scene irradiance. By choosing exposure ratios, different observed images can be obtained, e.g., \({\mathbf{V}}_{0} = f\left( {\mathbf{E}} \right)\) (a trivial case as \(k = 1\)), \({\mathbf{V}}_{1} = f\left( {k{\mathbf{E}}} \right)\).

On the other hand, the mapping between \({\mathbf{V}}_{0}\) and \({\mathbf{V}}_{1}\) can be also described by the brightness transform function, which represents the mapping between two observed images of a same scene with different exposures:

$${\mathbf{V}}_{1} = g\left( {{\mathbf{V}}_{0} ,k} \right)$$
(2)

Based on Eqs. 1 and 2, we have:

$$g\left( {f\left( {\mathbf{E}} \right),k} \right) = f\left( {k{\mathbf{E}}} \right)$$
(3)

According to [34, 35], we specify the BTF as a simple form: \({\mathbf{V}}_{1} = g\left( {{\mathbf{V}}_{0} ,k} \right) = \beta {\mathbf{V}}_{0}^{\gamma }\). Of note, \(\beta\) and \(\gamma\) are related to both the camera and the exposure ratio. Based on the comparametric equation [38], we have \(f\left( {k{\mathbf{E}}} \right) = \beta f\left( {\mathbf{E}} \right)^{\gamma }\). Except for the trivial case of \(\gamma = 1\), we can obtain the closed form of CRF as:

$$f\left( {\mathbf{E}} \right) = {\text{e}}^{{b\left( {1 - {\mathbf{E}}^{a} } \right)}} ,\quad a = \log_{k} \gamma ,\quad b = \frac{\ln \,\beta }{1 - \gamma }$$
(4)

Here, \(a\) and \(b\) are build-in parameters of a camera. They can be empirically chosen as a = − 0.3293, b = 1.1258, which are suitable for most cameras [35]. Therefore, we can thus obtain the BTF parameters \(\beta\) and \(\gamma\):

$$\beta = {\text{e}}^{{b\left( {1 - k^{a} } \right)}} , \gamma = k^{a}$$
(5)

Then, the brightness transform function can be further parameterized by the exposure ratio \(k\):

$${\mathbf{V}}_{1} = g\left( {{\mathbf{V}}_{0} ,k} \right) = {\text{e}}^{{b\left( {1 - k^{a} } \right)}} {\mathbf{V}}_{0}^{{k^{a} }}$$
(6)

In our application, we can take \({\mathbf{V}}_{0}\) as the original image at hand and \({\mathbf{V}}_{1}\) as a generated source.

Then, we estimate the exposure ratio \(k\) as follows. We first remove the small-scale image details from \({\mathbf{V}}_{0}\) by using the fast guided filter [39]. Then, we extract the low-light regions of low lightness determined by a threshold \(\eta\):

$${\mathbf{M}}\left( \eta \right) = \{ {\mathbf{V}}_{0} \left( p \right)|p < \eta \}$$
(7)

In another word, \({\mathbf{M}}\) approximately indicates the set of low-light pixels in the original image. The exposure ratio estimation can be formed as an optimization problem:

$$\tilde{k}_{\eta } = \mathop {\text{argmax}}\limits_{k} {\mathcal{H}}\left( {g\left( {{\mathbf{M}}\left( \eta \right),k} \right)} \right),$$
(8)

where \({\mathcal{H}}\left( \cdot \right)\) is the entropy and can be estimated from the image histogram of \({\mathbf{M}}\).

From the above modeling, the determination of low-light regions has large impact on the optimal exposure ratio. As exemplified in Fig. 1, low-light images can be divided into various specified conditions. Therefore, a single and ad hoc setting of the threshold \(\eta\) (e.g., 0.5 in [34]) is limited to describe the complex lightness conditions for an arbitrary image. We use a set of threshold values \(\left\{ {\eta_{1} , \eta_{2} , \ldots ,\eta_{N - 1} } \right\}\) to obtain different \(\tilde{k}\) values that cater to the generation of multiple sources for fusion.

3.3 Patch-based fusion

Without losing generality, we totally obtain \(N\) sources \(\left\{ {{\mathbf{I}}_{0} ,{\mathbf{I}}_{1} , \ldots ,\varvec{ }{\mathbf{I}}_{N - 1} } \right\}\) for the fusion process. For an image patch \({\mathbf{P}}\) of each source, we adopt the patch-based image decomposition [23]:

$${\mathbf{P}} = \left\| {{\mathbf{P}} - \mu_{{\mathbf{P}}} } \right\| \cdot \frac{{{\mathbf{P}} - \mu_{{\mathbf{P}}} }}{{\left\| {{\mathbf{P}} - \mu_{{\mathbf{P}}} } \right\|}} + \mu_{{\mathbf{P}}} = \left\| {{\tilde{\mathbf{P}}}} \right\| \cdot \frac{{{\tilde{\mathbf{P}}}}}{{\left\| {{\tilde{\mathbf{P}}}} \right\|}} + \mu_{{\mathbf{P}}} = c \cdot {\mathbf{s}} + l$$
(9)

In Eq. 9, for each \(M \times M\) squared patch, we stack its RGB channels together into a \(3M^{2}\)-length column vector \({\mathbf{P}}\). Here, \(\left\| \cdot \right\|\) is the L2 norm, \(c\) is the patch scale, \({\mathbf{s}}\) is the patch structure, and \(l\) is the patch mean intensity. These decomposed elements can be seen as a mid-level representation of an image. In the following, the three components are separately fused.

First, a nonlinear max-fusion is applied to the patch scale \(c_{n}\):

$$\hat{c} = \mathop {\hbox{max} }\limits_{0 \le n \le N - 1} c_{n}$$
(10)

Second, a linear weight fusion is constructed for the patch structure \({\mathbf{s}}_{n}\):

$${\bar{\mathbf{s}}} = \frac{{\mathop \sum \nolimits_{n = 0}^{N - 1} c_{n}^{\rho } {\mathbf{s}}_{n} }}{{\mathop \sum \nolimits_{n = 0}^{N - 1} c_{n}^{\rho } }}$$
(11)

From Eq. 11, we observe that the fused \({\bar{\mathbf{s}}}\) is jointly determined by \(\left\{ {{\mathbf{s}}_{n} } \right\}\) of the multiple sources. The fusion weights are determined by the exponential of patch scales \(\left\{ {c_{n} } \right\}\), where \(\rho \ge 0\) is a hyper-parameter. The obtained \({\bar{\mathbf{s}}}\) is further normalized as \({\hat{\mathbf{s}}} = {\bar{\mathbf{s}}}/\left\| {{\bar{\mathbf{s}}}} \right\|\). From Eqs. 10 and 11, we observe that the fusion tends to weigh more on strong patches and still considers the impact of weak patches. Third, we also use a weighted linear fusion for the patch mean \(l_{n}\):

$$\hat{l} = \frac{{\mathop \sum \nolimits_{n = 0}^{N - 1} L\left( {\mu_{n}^{0} ,l_{n} } \right)l_{n} }}{{\mathop \sum \nolimits_{n = 0}^{N - 1} L\left( {\mu_{n}^{0} ,l_{n} } \right)}}$$
(12)

In Eq. 12, \(L\left( { \cdot , \cdot } \right)\) describes how well the lightness of \(l_{n}\) is in \({\mathbf{I}}_{n}\):

$$L\left( {\mu_{n}^{0} ,l_{n} } \right) = { \exp }\left( { - \frac{{\left( {\mu_{n}^{0} - 0.5} \right)^{2} }}{{2\sigma_{g}^{2} }} - \frac{{\left( {l_{n} - 0.5} \right)^{2} }}{{2\sigma_{l}^{2} }}} \right),$$
(13)

where \(\mu_{n}^{0}\) is the global mean of \({\mathbf{I}}_{n}\), \(\sigma_{g}\) and \(\sigma_{l}\) control the spreads of the Gaussian distribution tails.

Finally, we disconnect the stacked \({\hat{\mathbf{s}}}\) back into the RGB channels and reconstruct them with the obtained \(\hat{c}\) and \(\hat{l}\) according to Eq. 9:

$${\hat{\mathbf{P}}}_{\phi } = \hat{c} \cdot {\hat{\mathbf{s}}}_{\phi } + \hat{l}$$
(14)

where \(\phi \in \left\{ {{\text{R}},{\text{G}},{\text{B}}} \right\}\) enumerates the three color channels. We use a sliding window with the stride length of \(B = \lfloor M/2 \rfloor\) to reconstruct each patch of the result image, and the pixels in overlapping regions are averaged. In this way, the reconstructed \({\mathbf{I}}_{\text{f}}\) is taken as the final result. In our research, the fusion parameters \(M,\rho ,\sigma_{g} ,\sigma_{l}\) are typically set as in [23].

3.4 Algorithm summary

The whole algorithm is summarized in Table 1. We can easily see that the computational complexity of the whole algorithm is jointly determined by the total pixel number, number of fusion sources, and the patch size. In experiments, we empirically found that there exists a balance between the number of fusion sources and the computational efficiency. We choose N = 4 sources for our research (including the original image) and set \(\eta\) as 0.4, 0.5, and 0.6 in our experiments.

Table 1 Summary of our algorithm

4 Experiments

4.1 Experimental settings

In experiments, a total of 35 images were collected from the Internet or taken by the authors. As shown in Fig. 3, images taken at different outdoor and indoor scenes have various lightness conditions, e.g., nighttime, backlight, sidelight, and so on. We introduce six models for comparison: Single-source models include multi-scale Retinex-based model (MSRCR) [20], Dehazing-based model (DEHAZE) [30], LIME model (LIME) [22], while multi-source models include bio-inspired model (BIMEF) [34], multi-fusion model (MF) [32], lightness-aware simplified Retinex model (LA-Retinex) [33]. The codes of first five models were publicly available from the project webpage of [34], while the codes of [33] were implemented in our previous research. All of them were run on a laptop with 2.6G Hz CPU and 8G RAM.

Fig. 3
figure 3

A gallery of all experimental images

4.2 Visual comparisons

We first conduct visual comparisons. In Fig. 4a, we present the enhanced results of the three images with dim lightness. We have the following observations. First, all the models are able to reveal the image details hidden in the darkness, especially for the single-source models. Second, all the single-source models are prone to generating over-enhanced results, like the inappropriately boosted edges and textures (DEHAZE, LIME), or unrealistically change of global appearance (MSRCR). In contrary, the results based on multi-source models have much more balanced lightness configurations and are more visually appealing. In Fig. 4b, we present the enhanced results of the three images partially with low-light regions. We have similar observations as in Fig. 4a that multi-source models perform better than single-source ones. Furthermore, by taking a closer look, our model produces fewer artifacts and brings in more vivid colors. For example, our model has more natural appearance on the wall and medal region at the first row of Fig. 4b than MF and LA-Retinex. The color of grass, trees, and sunset region of our model is brighter than BIMEF in the second and third examples of Fig. 4b. The reasons are twofold. On one hand, the patch-based computation makes our method robust to artifacts to some extent. On the other hand, as the RGB channels of \(\left\{ {{\mathbf{I}}_{0} ,{\mathbf{I}}_{1} , \ldots , {\mathbf{I}}_{N - 1} } \right\}\) are jointly considered in the decomposition and fusion, our method is able to improve the color distribution.

Fig. 4
figure 4

Visual comparison of the enhanced results based on all the seven models. (Better with an enlarged view and a bright screen display)

4.3 Quantitative comparisons

We also conduct quantitative comparisons on all the models based on multi-source fusion. We use a non-reference image quality evaluator BTMQI [40] and a visual aesthetic scoring network [41] and show all the scores in Tables 2 and 3. Of note, since our task does not change the image content composition, we only use the fine-grained scores of color harmony, color vividness, and lightness produced by the network trained in [41]. In the tables, we use bold font/Italics to highlight the best/the second best performance, respectively.

Table 2 Quantitative BTMQI scores [36]
Table 3 Quantitative aesthetic scores (color harmony, color vividness, and lightness) [37]

From both tables, our model has the best performance among all four fusion-based models, which again validates the effectiveness of our model. Specifically, we also have some additional observations. First, the MF model has the second best performance. This confirms the usefulness of the fusion road map, in which good results can be obtained even by combining several simple enhancement models as in [32]. Second, for a few cases, the input images can have higher scores than those of their enhanced results. This observation indicates that enhancement does not necessarily improve the visual quality all the time. The reason is that the models for comparison are still not fully quality-aware or aesthetics-aware, although they try to harmonize the complementary appearances across the multiple sources with different fusion techniques.

5 Conclusions

In this paper, we propose a low-light enhancement model via generating and fusing multiple sources, which facilitates the situation that only one single input image is at hand. We empirically validate our model on various low-light images. Compared with single-source models and other multi-source models, our model produces better results in terms of visual naturalness and aesthetics. As mentioned above, although our model is able to improve the visual aesthetics of an image, it is still limited as the enhancement process itself is not aesthetics-aware. In the following research, we plan to extend our model by equipping it with an aesthetics optimization process [42]. We also note that the determination of the low-light region in this paper is still heuristic to some extent. We can consider the technique of unsupervised feature selection [43,44,45,46] to accurately delineate the image regions with low lightness.