Keywords

1 Introduction

Recovering a high quality image from a corrupted image [1,2,3] plays an essential role in the follow-up visual tasks [4, 5]. In these techniques, specular highlight removal, which aims at separating the specular and diffuse components from the corrupted image, is one of the key problems. To address this problem, early methods [6, 7] use additional hardware outside the camera. However, these approaches usually cause serious measurement distortion. Therefore, current research tends to remove highlights directly from a single image using an image processing algorithm without the assistance of additional hardware. Typically, many works have been developed from the dichromatic reflection model that retains scene physical information for better results.

According to the physical properties of objects, the observed image intensity \(\mathrm{I}(\mathrm{p})\) at pixel \(\mathrm{p}\) can be represented by a dichromatic reflection model, and then can be decomposed into a diffuse reflection component \(\mathrm{D}(\mathrm{p})\) representing object surface information and a specular reflection component \(\mathrm{S}(\mathrm{p})\) representing light source information:

$$\begin{aligned} \mathrm {I}(\mathrm{p}) = \mathrm{D}(\mathrm{p}) + \mathrm{S}(\mathrm{p}) = \mathrm{m}_{\textit{d}}(\mathrm{p}) \Lambda (\mathrm{p}) + \mathrm{m}_{\textit{s}}(\mathrm{p}) \Gamma (\mathrm{p}), \end{aligned}$$
(1)

where \(\mathrm{m}_{\textit{d}}(\mathrm{p})\) and \(\mathrm{m}_{\textit{s}}(\mathrm{p})\) denote the corresponding diffuse and specular reflection coefficients that represent the reflection ability of the object at position \(\mathrm{p}\). \(\Lambda (\mathrm{p})\) and \(\Gamma (\mathrm{p})\) denote the diffuse and specular chromaticity; the latter is usually regarded as the illumination chromaticity of ambient light.

Currently, many existing specular highlight remove algorithms [8,9,10,11,12] are derived from the artificial controlled experimental scenes, in which the background and specular highlight are usually unnatural. In contrast, the nature scenes are captured under a non-controlled environment, where objects are exposed to natural light. Therefore, the highlights usually emerge irregularly. As a result, the performance of the aforementioned methods on nature scenes dramatically decrease.

In this paper, in contrast to the aforementioned methods addressing artificial experimental scenes, we take natural scenes as our research object and propose a single natural scene image highlight removal method, focusing on the estimate of specular reflection coefficients based on the accurately estimated illumination chromaticity. The proposed method shows the different characteristics of highlight areas in natural scene images and experimental scene images often used in past methods. Our model provides the key smooth feature information, which is used in the specular reflection coefficient estimation in chromaticity space after normalization, and combines the intensity information with color information to avoid color distortion, thus achieving highlight removal with the dichromatic reflection model. The contributions of this paper are summarized as follows:

  1. 1.

    We propose a highlight removal method for natural scene image, which separates the specular reflection by fully considering the distribution characteristic of ambient light.

  2. 2.

    We explain the difference between the natural scene images and artificial experimental scene images often used in past methods, and propose using smooth feature information of nature scene images to estimate the specular reflection coefficient matrix.

  3. 3.

    The proposed model achieves very competitive highlight removal results in natural scenes. It greatly preserves the details and structural information of the original image in some challenging scenarios containing complex textures or saturated pixels.

2 Related Works

Specular highlight removal is a challenging problem in low level vision task. To tackle this problem, many methods have been proposed. Typically, Tan and Ikeuchi [13] proposed a method of iterating the specular component until only the diffuse component is left. They gave the concept of specular free (SF) images for the first time. The SF image contains only the diffuse component and its scene geometry is the same as the original image. By comparing pixel values of the highlight area and the adjacent regions, the maximum value of the pixel is converted to match the neighboring pixel values to achieve highlight removal. Assuming that the camera is highly sensitive, Tan and Ikeuchi [14] further discovered a method to separate the reflection components with a large number of linear equations. However, the color of objects is also affected by the material, roughness and texture. To address this problem, Shen et al. [15] proposed modified specular free (MSF) images. They assumed that there were only two types of pixels in MSF images, normal pixels and highlighted pixels, and then calculated the reflection components of two types of pixels using chrominance. The main disadvantage of the MSF image is that it suffers from hue-saturation ambiguity which exists in many natural images.

Fig. 1.
figure 1

Single natural scene image highlight removal pipeline.

Recently, the dichromatic reflection model, which fully considers the physical properties of the scene, has become the most widely used method. Klinker et al. [16] found that the diffusion and specularity showed T-shaped distribution in the RGB color space images and used them to remove highlights. However, it has been proven that this T-shaped distribution is susceptible to noise in practical applications and is prone to cause deformation in real images. Yang et al. [8, 9] proposed a fast bilateral filter for estimating the maximum diffuse chrominance value in the local block of images, which caused the diffuse pixels to propagate to the specular pixels. Shen and Zheng [10] proposed the concept of pixel intensity ratio and constructed a pseudo-chrominance space to address the problem of texture surface, classifying pixels into clusters and robustly estimating the intensity ratio of each cluster. Ren et al. [12] introduced a color-lines constraint into the dichromatic reflection model, but it is limited to objects with dark or impure color surfaces and accurate pixel clustering cannot be achieved. Jie et al. [17] transformed reflection separation into a solution of a low-rank and sparse matrix on the assumption that the highlight part is sparse in images. In [18], Fu et al. proposed a specular highlight removal method for real-world images, but it cannot achieve well result in large white specular highlight regions because this method is based on the assumption that the specular highlight is sparse in distribution and small in size.

More recently, some deep learning based methods have been proposed to tackle this problem. Wu et al. [19] built a dataset in a controlled lab environment and proposed a specular highlight removal network. Fu et al. [20] proposed a multi-task network that jointly performs specular highlight detection and removal. These methods have achieved great performance in the man-made datasets. However, deep learning based methods rely on the training data and cannot achieve general performance in the natural scene images whose distribution are different from man-made images.

3 Proposed Method

In this work, we propose a single image highlight removal method for natural scenes by estimating the specular reflection coefficients:

$$\begin{aligned} \mathrm{D = I} - \textit{E}({\textit{M}(\mathrm{I}),\textit{F}({\textit{T}}({\mathrm{I}_{\textit{c}}})})) \Gamma , \end{aligned}$$
(2)

where \(\mathrm{I}\) represents a highlight image, \(\mathrm{I}_{\textit{c}}\) is the input image from different channels of \(\mathrm{I}\), \(\mathrm \Gamma \) is the specular chromaticity, and \(\mathrm{D}\) is the output diffuse image. \(T( \cdot )\) denotes the low-frequency information extraction based on image decomposition. \(F( \cdot )\) is the feature fusion function of all channels. \(M( \cdot )\) represents the space mapping from color space to chromaticity space. \(E( \cdot )\) denotes the estimation function of specular reflection coefficients. An overview of the proposed method is shown in Fig. 1. First, we extract the smooth feature components of an image by \(T( \cdot )\). Simultaneously, considering the effect of colors, the decomposition is implemented in the intensity and the three color channels. The information from the four channels is normalized and combined with \(F( \cdot )\) to estimate the specular reflection coefficients by our derivation \(E( \cdot )\). This process also uses the transformed input image, which is generated by transforming the input image into chromaticity space using \(M( \cdot )\). Finally, highlight removal is achieved by subtracting the specular reflection component from the original image \(\mathrm{I}\).

3.1 Scene Illumination Evaluation

To illustrate the difference between natural scene images and experimental scene images with highlight areas, we focus on their light distribution characteristics. According to Retinex theory [21], a digital image can be represented as the product of reflectance and illumination. The former represents the detailed information of an image, and the latter represents the ambience light. It can be simply written as:

$$\begin{aligned} \mathrm{I}_{c}({x,y}) = \mathrm{R}_{c}({x,y}) \cdot \mathrm{L}_{c}({x,y}), \end{aligned}$$
(3)

where \(\mathrm{I}_{c}({x,y})\) is the pixel intensity value at \((x, y)\) and \(c\) represents color channel R, G or B. \(\mathrm{R}_{c}({x,y})\) and \(\mathrm{L}_{c}({x,y})\) are the reflectance and illumination of channel \(c\), respectively. Gaussian filters are usually used to calculate \(\mathrm{L}_{c}({x,y})\).

In Fig. 2, we present the illuminance histogram of \(\mathrm{L}_{c}({x,y})\) for the experimental and natural images. The illuminance histogram counts the number of pixels in each gray level. As shown in Fig. 2, the illumination distribution in the natural scene is brighter than that in the experimental scene. The ground truth illumination should be natural and provide the best visual quality for the panorama. Related study [22] has proofed that images with an average gray value of 128 are close to the best visual effects for humans. For the given images, the average values of experimental and natural scenes are 24 and 103, respectively. For the experimental scene, a large number of pixels in the captured images are in the dark interval, and their neighboring pixels are mostly dark. This distribution is usually nonexistent in natural lighting conditions.

Fig. 2.
figure 2

Illuminance histograms of images from different surroundings. (a) presents the illuminance histogram of experimental scenes and (b) shows the illuminance histogram of natural scenes.

3.2 Smooth Feature Extraction

We first start with the dichromatic reflection model presented in Eq. (1). We denote \(\mathrm{I({p})=[I}_{\textit{r}}(\mathrm{p}),\mathrm{I}_{\textit{g}}(\mathrm{p}),\mathrm{I}_{\textit{b}}(\mathrm{p})]^\textit{T}\) as the intensity value of the image pixel at \(\mathrm{p}=[x,y]\). \(\mathrm{m}_{\textit{d}}(\mathrm{p})\) and \(\mathrm{m}_{\textit{s}}(\mathrm{p})\) are diffuse and specular reflection coefficients, respectively, which are related to surface geometry. \( \Lambda (\mathrm{p})=[\Lambda _{\textit{r}}(\mathrm{p}),\Lambda _{\textit{g}}(\mathrm{p}),\Lambda _{\textit{b}}(\mathrm{p})]^\textit{T}\) is the diffuse chromaticity, and its value usually remains the same in a continuous surface with the same color. Therefore, many studies estimate diffuse chromaticity based on the assumption that it is piecewise constant. However, the surface of the object is sometimes rough with irregularities in a nature scene, and the assumption is not suitable in all cases. By contrast, the illumination chromaticity \(\Gamma (\mathrm{p})=[\Gamma _{\textit{r}}(\mathrm{p}),\Gamma _{\textit{g}}(\mathrm{p}),\Gamma _{\textit{b}}(\mathrm{p})]^\textit{T}\) can be estimated accurately via the color constancy algorithm in [23] or color-lines constraint in [12]. After normalization, the illumination is a white color, i.e., \(\Gamma _{\textit{r}}(\mathrm{p})=\Gamma _{\textit{g}}(\mathrm{p})=\Gamma _{\textit{b}}(\mathrm{p})=1/3\). Thus, we focus on the estimate of the specular reflection coefficient \(\mathrm{m}_{\textit{s}}(\mathrm{p})\).

In general, the observed image can be decomposed into two components: a low-frequency component called illumination and a high-frequency detail component called reflectance [24]. The former represents the ambient illumination and the later represents the details of objects [21]. In addition, in the dichromatic reflection model, the specular reflection component represents the information of ambient illumination and the diffuse reflection component represents the information of objects. Thus, the low-frequency information extracted from an image can truly reflect the intensity of the specular reflection at each pixel. The low-frequency information is the smooth feature. Moreover, the coefficient \(\mathrm{m}_{\textit{s}}(\mathrm{p})\) precisely encodes the position and intensity of the specular reflection. For the low-frequency image at a certain position \(\mathrm{p}\), the larger the luminance value, the stronger the specular reflection, which accurately represents the intensity of the specular reflection. Therefore, we attempt to employ a low-frequency component to estimate the specular reflection coefficient \(\mathrm{m}_{\textit{s}}(\mathrm{p})\). At the same time, we do not make any changes to the texture detail information.

To achieve accurate separation of the low- and high-frequency parts, we utilize the method proposed in [25]. The process of image robust sparse decomposition is shown in Fig. 3. The robust sparse representation improves the robustness to non-Gaussian noise. We first implement decomposition for the grayscale map. It can be written as follows:

$$\begin{aligned} \mathrm{I = W + H} \quad \mathrm{and} \quad \mathrm{W = NZ,} \end{aligned}$$
(4)

where \(\mathrm{I}\) is an input image with highlight and \(\mathrm{N}\) is the dictionary we construct, which mainly contains the extracted luminance information. \(\mathrm{Z}\) is a sparse coefficient matrix in which very few element values are nonzero. \(\mathrm{NZ}\) composes the low-frequency intensity information image \(\mathrm{W}\). \(\mathrm{H}\) is the error matrix, containing edge and texture detail information. To construct a dictionary, we slide a small patch with a fixed step on the input image to obtain image blocks. Then, after vectorizing each block, all vectors form a vectorization matrix. Finally, the dictionary \(\mathrm{N}\) is obtained by normalizing the vectorization matrix. We set the size of the patch to \(1/10\) of the image size through a large number of experiments. For an image of size \(150\times 150\), the patch size is set to \(15\times 15\). Thus, the sparse decomposition problem can be transformed into the following optimization problem with an equality constraint:

$$\begin{aligned} \mathop {\mathrm {min}}\limits _\mathrm{Z,H} \ \Vert \mathrm{Z}\Vert _{0} + \lambda \Vert \mathrm{H}\Vert _{2,0} \quad \mathrm{s.t. } \ \mathrm{I = NZ + H,} \end{aligned}$$
(5)

in which \(\Vert \cdot \Vert _{0}\) denotes the \(l_{0}\) norm of the matrix \(\mathrm{Z}\), which counts the number of nonzero entries in the matrix. \(\Vert \cdot \Vert _{2,0}\) denotes the \(l_{2}\) norm of the matrix \(\mathrm{H}\), which describes sample specific error terms and outliers. The parameter \(\lambda \) is used to balance the effect of different components. We can adjust the proportion of the two components by changing \(\lambda \). While \(\lambda \) is larger, matrix \(\mathrm{H}\) has less texture detail information, and more illuminance information is contained in matrix \(\mathrm{NZ}\). The appropriate value of \(\lambda \) is discussed in our experiment section.

However, the \(l_{0}\) norm is highly nonconvex, and there is no efficient solution available. To make the optimization tractable, we relax it via replacing \(\Vert \cdot \Vert _{0}\) with \(\Vert \cdot \Vert _{1}\) and \(\Vert \cdot \Vert _{2,0}\) with \(\Vert \cdot \Vert _{2,1}\), and Eq. (5) can be formulated as the following convex optimization problem:

$$\begin{aligned} \mathop {\mathrm {min}}\limits _\mathrm{Z,H} \ \Vert \mathrm{Z}\Vert _{1} + \lambda \Vert \mathrm{H}\Vert _{2,1} \quad \mathrm{s.t.}\ \mathrm{I = NZ + H,} \end{aligned}$$
(6)

where \(\Vert \cdot \Vert _{1}\) denotes the \(l_{1}\) norm of the matrix, which can be calculated as follows:

$$\begin{aligned} \begin{aligned} \Vert \mathrm{Z}\Vert _{1}=\sum \nolimits _{k}\sum \nolimits _{j}|\mathrm{Z}({j,k})|\\\Vert \mathrm{H}\Vert _{2,1}=\sum \nolimits _{k}\sqrt{\sum \nolimits _{j}\mathrm{H}({j,k})^2}, \end{aligned} \end{aligned}$$
(7)

where \((j,k)\) denotes the position of an element in matrix \(\mathrm{Z}\) and \(\mathrm{H}\). Many efficient algorithms have been developed to solve this convex optimization problem, among which the linearized alternating direction method with the adaptive penalty (LADMAP) [26, 27] is widely used. Then \(\mathrm{Z}\) and \(\mathrm{H}\) can be obtained, and \(\mathrm{W}\) can be calculated by multiplying matrix \(\mathrm{N}\) and \(\mathrm{Z}\).

Fig. 3.
figure 3

The process of image robust sparse decomposition. The former denotes the low-frequency area, and the latter denotes the high-frequency area, which is rescaled for visualization.

3.3 Coefficient Estimation and Highlight Removal

We first decompose the grayscale map into the low-frequency part and the high-frequency part. However, the grayscale map contains only luminance information without any color information. It causes color distortion in the diffuse reflection part after separation when we use only the luminance to estimate the specular reflection coefficient \(\mathrm{m}_{\textit{s}}(\mathrm{p})\). To preserve the original color of the input image, robust sparse decomposition is implemented in three color channels: R, G, and B. Then, the information of these channels is combined as the final low-frequency component. At the same time, we assign weights to each W of the three color channels to fully consider the contribution of different colors at each pixel, which benefits from that our estimation of the reflection coefficient is based on global information. The weight can be calculated as:

$$\begin{aligned} \omega _{\textit{c}}(\mathrm{p}) = \dfrac{\mathrm{I}_{\textit{c}}(\mathrm{p})}{\mathrm{I}_{\textit{r}}(\mathrm{p}) + \mathrm{I}_{\textit{g}}(\mathrm{p}) + \mathrm{I}_{\textit{b}}(\mathrm{p})}, \end{aligned}$$
(8)

where \(c\) represents color channel R or G or B. The final \(\mathrm{W}\) can be written as follows:

$$\begin{aligned} \mathrm{W} = \dfrac{\mathrm{W}_{\textit{U}} + \omega _{\textit{R}} \cdot \mathrm{W}_{\textit{R}} + \omega _{\textit{G}} \cdot \mathrm{W}_{\textit{G}} + \omega _{\textit{B}} \cdot \mathrm{W}_{\textit{B}}}{4}. \end{aligned}$$
(9)

where \(\mathrm{W}_{\textit{U}}\), \(\mathrm{W}_{\textit{R}}\), \(\mathrm{W}_{\textit{G}}\), \(\mathrm{W}_{\textit{B}}\) denote the low-frequency information extracted in grayscale and three color channels, respectively. For the final low-frequency intensity image \(\mathrm{W}\) at a certain position, the larger the value, the stronger the specular reflection, and the larger the value of \(\mathrm{m}_{\textit{s}}(\mathrm{p})\). However, it is difficult to determine the specific interval of \(\mathrm{m}_{\textit{s}}(\mathrm{p})\). Therefore, we use chromatic space to solve this problem. Chromaticity is usually defined as the function of component \(\mathrm{C(p)}\):

$$\begin{aligned} \mathrm{C(p)} = \dfrac{\mathrm{I(p)}}{\sum \nolimits _{\textit{c}\in \{\textit{r},\textit{g},\textit{b}\}}\mathrm{I}_{\textit{c}}(\mathrm{p})}. \end{aligned}$$
(10)

Substituting (1) into (10), \(\mathrm{C(p)}\) can be written as follows:

$$\begin{aligned} \mathrm{C(p)} = \dfrac{\mathrm{m}_{\textit{d}}(\mathrm{p})}{\sum \nolimits _{\textit{c}\in \{\textit{r},\textit{g},\textit{b}\}}\mathrm{I}_{\textit{c}}(\mathrm{p})}\Lambda (\mathrm{p}) + \dfrac{\mathrm{m}_{\textit{s}}(\mathrm{p})}{\sum \nolimits _{\textit{c}\in \{\textit{r},\textit{g},\textit{b}\}}\mathrm{I}_{\textit{c}}({p})}\Gamma (\mathrm{p}). \end{aligned}$$
(11)

Then, chromaticities \(\Lambda (\mathrm{p})\) and \(\Gamma (\mathrm{p})\) are normalized, i.e., \(\sum \nolimits _{\textit{c}\in \{\textit{r},\textit{g},\textit{b}\}}\Lambda _{\textit{c}}(\mathrm{p})=1\) and \(\sum \nolimits _{\textit{c}\in \{\textit{r},\textit{g},\textit{b}\}}\Gamma _{\textit{c}}(\mathrm{p})=1\). After that, combining with Eq. (1), the sum of the pixel intensities of the three channels can be expressed as:

$$\begin{aligned} \sum \nolimits _{\textit{c}\in \{\textit{r},\textit{g},\textit{b}\}}\mathrm{I}_{\textit{c}}(\mathrm{p}) = \mathrm{m}_{\textit{d}}(\mathrm{p}) + \mathrm{m}_{\textit{s}}(\mathrm{p}). \end{aligned}$$
(12)

As a result, chromaticity can be written as the following:

$$\begin{aligned} \mathrm{C(p)} = \dfrac{\mathrm{m}_{\textit{d}}(\mathrm{p})}{\mathrm{m}_{\textit{d}}(\mathrm{p}) + \mathrm{m}_{\textit{s}}(\mathrm{p})}\Lambda (\mathrm{p}) + \dfrac{\mathrm{m}_{\textit{s}}(\mathrm{p})}{\mathrm{m}_{\textit{d}}(\mathrm{p}) + \mathrm{m}_{\textit{s}}(\mathrm{p})}\Gamma (\mathrm{p}). \end{aligned}$$
(13)

This process can be deemed as normalizing the reflection coefficient to the [0, 1] interval. \(m_{\textit{s}}({p})\) has strong relationship with the \(\mathrm{W}\). To simplify the estimation of \(m_{\textit{s}}({p})\), we directly use W to approximate \(m_{\textit{s}}({p})\). Moreover, to avoid over separation or incomplete separation of the specular reflection, the estimation is written as:

$$\begin{aligned} \dfrac{\mathrm{m}_{\textit{s}}(\mathrm{p})}{\mathrm{m}_{\textit{d}}(\mathrm{p}) + \mathrm{m}_{\textit{s}}(\mathrm{p})} = \alpha \,\mathrm{W,} \end{aligned}$$
(14)

where \( \alpha \) is an adjustable parameter, whose range is [0, 1]. In this work, we empirically set the value of \( \alpha \) to 0.6, and present the detailed discussion in the experiment section. Thus, the specular reflection component \(\mathrm{S(p)}\) can be obtained with Eq. (12), (13) and (14). The image after highlight removal can be obtained by subtracting the specular component from the original image.

4 Experiments

In this section, we evaluate the highlight removal performance of our method compared with currently effective methods. Following methods in [12, 17], we first use some commonly used laboratory images to perform quantitative comparisons. Then, some typical natural images, which include some challenging scenarios with rough surfaces, saturated pixels, and complex textures, are used to perform visual comparisons. Note that, real-world natural images have no ground truth. Therefore, we do not provide the quantitative results. In addition, the related state-of-the-art methods [1, 18] do not release their codes. Therefore, it is hard for us to provide the comparison results of these methods.

Fig. 4.
figure 4

Diffuse components of experimental scene images Masks and Fruits. (a) Input highlight images, (b) ground truths, (c) results of [10], (d) results of [12], (e) results of [17], (f) ours. (Color figure online)

Table 1. Quantitative comparison in terms of PSNR and SSIM for the images in Fig. 4
Table 2. Quantitative comparison in terms of PSNR and SSIM for the images from SPEC database.

4.1 Quantitative Comparison on Laboratory Images

We first show the separation results of experimental scene images under a black background, which was often used in past methods. Figure 4 shows Masks with pure and multicolored surfaces. The methods proposed in [10], [12], and [17] all create new artifacts while removing highlights on the yellow region of the mask on the left. The method in [12] makes the result darker due to over separation of the specular component. In contrast, our method removes highlights better in the yellow and blue regions that are closer to the ground truth. The captured surface of our result looks smoother and more continuous, which is why our method produces the highest PSNR and SSIM values as shown in Table 1. For the Fruits image, our method does not achieve the best separation results but produces the highest SSIM, which demonstrates our competitiveness compared with other methods in that we retain the original structure of the image to the greatest extent.

Fig. 5.
figure 5

Diffuse components of images close to natural scene Woodlego, Vase, Wire, and Key. (a) Input highlight images, (b) ground truths, (c) results of [8], (d) results of [10], (e) results of [12], (f) ours.

Moreover, we use the challenging images from the SPEC database to perform quantitative and qualitative comparisons. These images are taken under ambient light conditions created in the laboratory, which is closer to the natural scene but somewhat extreme. As shown in Fig. 5, the methods of bilateral filtering [8] and intensity ratio [10] introduce a large number of black and white noise points to the images during highlight removal, impairing the original image structure even in a single-color surface. The color-lines constraint method [12] tends to generate blur images because it is based on diffuse reflection pixel clustering. For scenes with unclear color or some metal surfaces, it is difficult to cluster pixels through only the chromaticity, and most of the image details are lost. In contrast, our method achieves good results in all these scenarios. Specifically, the proposed method can preserve the original structure and retain more edge details. For the severely overexposed scenes, the proposed method can still remove the specular highlight without introducing any artifacts and impairing image structure. For these challenging scenes, the proposed method produces the highest PSNR and SSIM values and is far better than other methods, as shown in Table 2.

4.2 Visual Effect Comparison on Natural Scene Images

Finally, we show the performance in natural scenes to further validate the superiority of our method. These images are taken under natural illumination from [12] and [17]. As ground truth results are unavailable, we provide a visual appearance comparison in Fig. 6. The method in [17] always achieves better highlight removal results than others. However, they are all inclined to degrade details and introduce additional artifact noise as shown in the red boxs. As illustrated in the first row of Fig. 6, the proposed method achieves more natural highlight removal, and the words on the plate are the clearest. For multicolored surfaces shown in the second row, our method produces pleasing visual result. The transition between highlight and nonhighlight regions is very natural, without the blurry edge that exists in other methods. In the third row, we not only remove highlights on the main object lock, but also on the background. The texture details in the background are optimally recovered through our method. Moreover, as shown in the fourth row, compared with other methods, the leaves recovered by the proposed method are more nature and colorful. In summary, the proposed method can produce better highlight removal results for nature scene images, which demonstrates its superiority.

Fig. 6.
figure 6

Highlight removal results for natural scene image. (a) Input highlight images, (b) results of [8], (c) results of [10], (d) results of [12], (e) results of [17], (f) ours. (Color figure online)

4.3 Discussion of Important Parameters

The Effect of Balance Parameter \(\lambda \). Parameter \(\lambda \) is used to balance the effect of low-frequency and high-frequency components. When \(\lambda \) is larger, less high-frequency information and more low-frequency information are separated. In Fig. 7(a), we observe the trend of PSNR when \(\lambda \) changes. Generally, PSNR shows an upward tendency with increasing \(\lambda \), and there are few fluctuations between them. When \(\lambda >1.5\), the PSNR values are slightly downward. Intuitively, when \(\lambda \) is small, high-frequency is excessively separated, resulting in a lack of information used for estimation; the image after separation is partially blurred. However, when \(\lambda >1.5\), very few high-frequency parts are separated, and low-frequency estimation does not preserve texture details very well. The maximum PSNR values appear in \(\lambda \in [1.1,1.5]\). In this work, \(\lambda \) is set to 1.4 and this value works well for most of the highlight images.

Fig. 7.
figure 7

Discussion of important parameters

The Effect of Adjustable Parameter \(\alpha \). Figure 7(b) shows the visual comparisons of natural images with different \(\alpha \). Larger \(\alpha \) makes the image darker, while smaller \(\alpha \) makes the specular reflection cannot be completely separated. When \(\alpha \) is 0.6, the proposed method achieves relatively better visual result. Although \( \alpha = 0.6\) may not be very accurate, it has little impact on the final result.

5 Conclusion

In this paper, we proposed an effective method for removing specular highlights focusing on captured natural scene images. The background and the distribution characteristics of ambient light are fully considered and the difference between the natural scene and the experimental scene is explained. We first constructed smooth feature images based on robust sparse decomposition. Then, we combined the smooth feature information with three color channels’ information, and assigned different weights according to the contributions of different colors at each pixel to ensure that the color is not distorted. Finally, we converted the image into the chromaticity space, where the normalized smooth feature can be used as an accurate estimate of the specular reflection coefficient. Our method achieved very pleasing highlight removal results in natural scene images. It can preserve the original structure information and details of images. However, our method could not detect subtle bright spots or recover information that has been damaged by highlights, which will be addressed in future work.