Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Intrinsic image decomposition aims at separating an illumination invariant reflectance image from an input color image. Such a decomposition has numerous applications in color enhancement, image segmentation, pattern recognition, and object tracking [1,2,3]. The separation of the shading component is used in BRDF estimation and shadow removal methods [4,5,6]. However, while intrinsic images have many applications, recovering them remains a substantial challenge for researchers. Estimation of intrinsic components is an ill-conditioned problem: a single image can be decomposed into infinitely many different combinations of reflectance and illumination. Thus, additional constraints or priors are needed to select an appropriate solution. Priors on reflectance (albedo) and shading are usually based on physical principles of light and object interaction, scene geometry, and material properties, as well as on expert knowledge of how intrinsic images should look like. Finally, decomposition into reflectance and illumination components is suitable only for diffuse (Lambertian) objects. According to the dichromatic model introduced by Shafer [7], if glossy (non-Lambertian) objects are present in a scene, a specular term should be taken into account. Many classical approaches fail when the target scene has non-Lambertian objects; as specularity depends on view point, it is hardly possible to estimate it from a single image.

To improve accuracy of intrinsic images, researchers use additional information, for instance, a video sequence instead of a single image, RGB-D imaging sensors, or manual labeling. This information may be incomplete, suffer from sensor noise, calibration errors, and be dependent on a human factor. Computing this information may be time consuming, require complex experiments, and special equipment. Thus, it is hardly possible to use it in for example industrial applications.

In this work, we leverage light fields for intrinsic image decomposition. 4D light fields are widely used in image analysis and computer graphics. The key idea of light field is to represent a scene not as a traditional 2D image, which contains information about accumulated intensity at each image point, but as a collection of images of the same scene from slightly different view points, see Fig. 1. The specific structure of the light field allows a wide range of applications. It is used for efficient depth estimation, virtual refocusing, automatic glare reduction as well as object insertion and removal [8,9,10]. Recently, the inherent structure of the light field was leveraged for shape and BRDF estimation [4, 11].

Fig. 1.
figure 1

The top-left image shows the center view of a light field parametrized by image coordinates x and y. On the bottom and right, the epipolar plane images (EPIs) for the white lines in the center view are shown, where s and t describe view point coordinates. As the camera moves, 3D scene points trace straight lines on the EPIs, whose slope corresponds to disparity. Any assignment of a property of a scene point to rays should be constant along these lines, which can be leveraged for consistent regularization [12].

Contributions. In this paper, we formulate and solve intrinsic light field decomposition by means of an optimization problem for albedo, shading, and specularity. As far as we are aware, this is the first time this problem is addressed for 4D light fields. Based on a detailed review of the state-of-the-art in intrinsic image decomposition, we propose priors for modeling all unknowns based on additional data available in the light field. Epipolar plane image constraints encourage albedo and shading to be constant for projections of the same scene point. By means of a novel term which is specific to light fields, we can also estimate specularity and highlights, and separate them from shading and albedo components. In experiments, we demonstrate that we outperform state-of-the-art intrinsic image decompostion based on RGB plus depth data [13], as well as an alternative approach to detect and remove light field specularity [10, 14].

2 Related Work

Intrinsic images have been a challenging research topic for many years. First introduced by Barrow and Tenenbaum [15], they divide an observed image into the product of a reflectance and illumination image. According to Land and McCann [1], large discontinuities in pixel intensities correspond to changes in reflectance, and the remaining variation corresponds to shading. They proposed a Retinex theory that was successfully extended and implemented for intrinsic image decomposition by Tappen et al. [16], Chung et al. [17], Grosse et al. [18], Finalayson et al. [5, 6], and many others.

Besides the Retinex approach, it is common to include additional regularization terms that describe certain physical properties of intrinsic components. Barron and Malik [19,20,21] introduce priors on reflectance, shape, and illumination to recover intrinsic images. Shen et al. [22] employ texture information. Finalyson et al. [23] search for an invariant image which is independent of lighting and shading. Gehler and Rother [24] model reflectance values drawn from a sparse set of basis colors. Bell et al. [25] also assume that reflectance values come from a predefined set which is unique for every image, then they iteratively adjust reflectance values in this set.

Recently, a significant improvement in intrinsic image decomposition was achieved by using richer types of input data. Having a sequence of images with depth information available allows to penalize albedo and shading consistency between different views, Lee et al. [26]. Depth or disparity information allows to incorporate spatial dependencies between pixels to construct shading prior, Jeon et al. [27]. Chen and Koltun [13] develop a model based on RGB-D information. They separate shading into two components: direct and indirect irradiance that significantly improved decomposition results. Barron and Malik [21] use depth to extend their SIRFS model [20] such that it is applicable for natural scenes.

Although decomposition algorithms nowadays achieve spectacular results for Lambertian scenes, their performance is suffering in the non-Lambertian case in the presence of highlights or specularity. In our paper, we will make use of the rich structure in the light field to estimate specularity for non-Lambertian objects. According to the dichromatic model introduced by Shafer [7], diffuse and specular reflections behave differently. Diffuse objects reflect incident light in multiple directions equally, thus, their color is independent of viewpoint. Specular objects reflect light in a certain direction that depends on orientation, and thus their color depends on viewpoint, light source color, and physical material properties. Blake and Bülthoff [28] made an extensive analysis of specular reflections, and propose a strategy for recovering 3D structure using specularity. Swaminathan et al. [29] study photometric properties of specular pixels, and model their motion depending on the surface geometry. Adato et al. [30] model specular flow with non-linear partial differential equations. Tao et al. [10, 14] introduced depth estimation for glossy surfaces. They leverage the light field structure to cluster pixels in specular and specular-free groups, then they remove specular components from the input light field.

3 Intrinsic Light Field Model

Light Field Structure. We briefly describe the light field structure and review notation. For more detailed information, we refer to [12, 31]. A light field is defined on 4D ray space \({\mathcal R} = \varPi \times \varOmega \), which parametrizes rays \({\varvec{r}} = (x,y,s,t)\) by their intersection coordinates with two planes \(\varPi \) and \(\varOmega \). Intersection with the focal plane \(\varPi \) gives view point coordinates (st), while the image plane \(\varOmega \) denotes image coordinates (xy), see Fig. 1. A 4D light field is now a map \(L: \mathcal {R} \rightarrow \mathbb {R}^{n}\) on ray space. It can be scalar or vector-valued for grey scale or color images, respectively.

Light Field Decomposition. We model an intrinsic light field as a function

$$\begin{aligned} L({\varvec{r}}) = A({\varvec{r}})S({\varvec{r}}) + H({\varvec{r}}), \end{aligned}$$
(1)

where the radiance L of every ray \({\varvec{r}}\) is decomposed into albedo A, shading S, and specular component H. The functions \(L,A,S,H : \mathcal {R} \rightarrow \mathbb {R}^{3}\) map ray space to RGB values. Albedo represents the color of an object independent of illumination and camera position. Shading describes intensity changes due to illumination, inter-reflections, and object geometry. Finally, specularity represents highlights that occur in case of non-Lambertian objects. They depend on illumination, object geometry, and camera position.

The common assumption in the literature related to intrinsic image decomposition is to model the shading component as mono-chromatic [16, 18, 24]. However, in case of multiple light sources or non-Planckian light, this modeling assumption is not sufficient. Thus, we further decompose shading into mono-chromatic shading s and trichromatic light source color C,

$$\begin{aligned} S({\varvec{r}}) = s({\varvec{r}}) C({\varvec{r}}). \end{aligned}$$
(2)

We directly compute the illumination component C in a pre-processing step with the illuminant estimation algorithm developed by Yang et al. [32] applied to the center view, assuming that it will be similar across views. After illumination color is computed, we exclude it from the original light field by switching to the new decomposition model

$$\begin{aligned} \frac{L({\varvec{r}})}{C({\varvec{r}})} = A({\varvec{r}})s({\varvec{r}}) + \frac{H({\varvec{r}})}{C({\varvec{r}})} \end{aligned}$$
(3)

which is illumination color free. Vector division is to be understood component-wise. As a further simplification, we obtain System (3) in linear form

$$\begin{aligned} L^{log}({\varvec{r}}) = A^{log}({\varvec{r}}) + {\varvec{1}}s^{log}({\varvec{r}}) + H^{log}({\varvec{r}}, A,s,H) \end{aligned}$$
(4)

by applying the logarithm. We now want to solve (4) with respect to albedo, shading, and specularity.

System (4) is ill-posed, since its number of variables is three times larger than the number of equations. To select a solution that agrees with physical meaning of intrinsic components, we pose it as an inverse problem and introduce a number of constraints or regularization terms for albedo, shading, and specularity. As usual, dependence of \(H^{log}\) on all arguments except \({\varvec{r}}\) is ignored during optimization, and it is estimated as another independent component. We thus solve a global energy minimization problem where we weight the residual of (4) with different priors and regularization terms,

$$\begin{aligned} \begin{aligned} \mathop {{{\mathrm{arg\,min}}}}\limits _{(A^{log},s^{log},H^{log})} \Bigl \{\;&{||L^{log}({\varvec{r}}) - A^{log}({\varvec{r}}) - \varvec{1}s^{log}({\varvec{r}}) - H^{log}({\varvec{r}}) ||}_2^2 \, + \,\dots \\ \dots +\, P_\text {albedo}(A^{log})&\,+\, P_\text {shading}(s^{log}) \,+\, P_\text {spec}(H^{log}) \,+\, J( A^{log}, s^{log}) \;\Bigr \}. \end{aligned} \end{aligned}$$
(5)

The priors \(P_\text {albedo}\) and \(P_\text {shading}\) for albedo and shading essentially apply the key ideas in intrinsic image decomposition to every subaperture image. They are defined in Sect. 4. The specularity prior \(P_\text {spec}\) is specific to light fields, and a main contribution of our work. It is described in detail in Sect. 5. Finally, the smoothing prior J across ray space encourages spatial smoothness and in particular consistency across different subaperture images. It relies on disparity, and is described together with the optimization framework in Sect. 6.

4 Albedo and Shading Priors

We start with describing the priors, which are the key for obtaining an accurate solution for intrisic light field decomposition from the variational model (5). In this section, we introduce the priors \(P_\text {albedo}\) and \(P_\text {shading}\) for albedo and shading, respectively.

Albedo. To model albedo, we combine ideas of Retinex theory, which is widely used to decompose an image into shading and reflectance components [16, 17, 33], with the idea that pixels with equal chromaticity are likely to have similar albedo [13, 26, 34]. Thus, the prior for albedo is the sum of two energies, \(P_\text {albedo}(A^{log}) = E_\text {retinex}(A^{log}) + E_\text {chroma}(A^{log})\), corresponding to these two models.

Under the simplifying assumption that image derivatives in the log-domain are caused either by shading or reflectance, we classify the derivative at every ray as caused by shading or albedo. The idea is to compute a modified gradient field \(\hat{g}\) which assigns a zero value to all derivatives that are caused by shading. The derivative classification is done with approach similar to Color Retinex used in [17, 18]. A partial spatial derivative \(L_x\) of the light field is classified as albedo if neighbouring RGB vectors into the direction of differentiation are not parallel, or if it is above a certain magnitude. Thus, the modified derivative is

$$\begin{aligned} \hat{g}_x = {\left\{ \begin{array}{ll} L_x &{} \text { if } \varvec{c}_{x+1,y} \cdot \varvec{c}_{x,y} < \tau _{col} \text { or } |L_x| > \tau _{grad},\\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(6)

Above, \(\varvec{c} = (r,g,b)^T\), the constant \(\tau _{col}>0\) is a threshold above which two vectors are assumed to be parallel, and \(\tau _{grad} >0\) is another user-defined constant. In a similar way, we estimate the modified partial derivative \(\hat{g}_y\) in the second spatial direction.

The gradient of the albedo should be equal to the gradient field modified by retinex, thus we finally obtain the retinex energy

$$\begin{aligned} E_{retinex}(A^{log}) = \lambda _{retinex}\int _{\mathcal {R}} {||\partial _x A^{log}({\varvec{r}}) - \hat{g}_x({\varvec{r}}) ||}^2 \,+\, {||\partial _y A^{log}{\varvec{r}}- \hat{g}_y({\varvec{r}}) ||}^2 d{\varvec{r}}. \end{aligned}$$
(7)

The second regularization term is based on chromaticity similarities between adjacent rays. The basic idea is that if two neighboring rays of the same view have close chromaticity values, they have the same albedo. We use the chromaticity measure described by Chen and Koltun [13], which gives a weight \(\alpha _{{\varvec{r}},\varvec{q}}\) for how likely it is that two rays \({\varvec{r}}\) and \(\varvec{q}\) have the same albedo,

$$\begin{aligned} {\begin{matrix} \alpha _{{\varvec{r}},{\varvec{q}}} = \Big ( 1 - \frac{ {||L^{ch}({\varvec{r}}) - L^{ch}(\varvec{q})||} }{\max \limits _{{\varvec{r}}^\prime \in \varOmega ,\,{\varvec{q}}^\prime \in N_A({\varvec{r}}^\prime )} { ||L^{ch}({\varvec{r}}^\prime ) - L^{ch}({\varvec{q}}^\prime )||}} \Big ) \sqrt{L^{lum}({\varvec{r}}) L^{lum}(\varvec{q})}, \end{matrix}} \end{aligned}$$
(8)

where \(N_A({\varvec{r}})\) is a neighborhood of \({\varvec{r}}\), and \(L^{ch}\) and \(L^{lum}\) are chromaticity and luminance. The chromaticity energy

$$\begin{aligned} E_{chroma}(A^{log}) = \lambda _{chroma}\int _{{\mathcal R}} \sum _{N_A({\varvec{r}})} \alpha _{{\varvec{r}},{\varvec{q}}} \, {||A^{log}( {\varvec{r}} ) - A^{log}( \varvec{q} ) ||}^2 \; d {\varvec{r}} \end{aligned}$$
(9)

now penalizes dissimilarity of albedos that have chromaticity measure \(\alpha _{{\varvec{r}},\varvec{q}}\) close to one. Note that we use a mixed continuous/discrete notation for \({\varvec{r}}\) and \({\varvec{q}}\), as our choice of neighbourhood is inherently discrete, while we require a variational rayspace model in the optimization framework, see Sect. 6.

To construct the neighborhoods \(N_A({\varvec{r}})\) for every ray \({\varvec{r}}\in {\mathcal R}\), we impose the assumption that spatially close points in \(\mathbb {R}^{3}\) probably have similar albedo. We select \(k_A\) nearest neighbors in \(\mathbb {R}^{3}\) for the point P on the scene surface intersected by \({\varvec{r}}\), and choose \(m_A\) out of \(k_A\) neighbors randomly. We believe that this connectivity strategy has several advantages over fully random connectivity: by defining neighbors we increase the chance to meet points with similar chromaticity, and by random connectivity within neighboring points we avoid disconnected chromaticity clusters.

Shading. The shading prior is also the sum of two components, \(P_\text {shading} = E_\text {normal} + E_\text {spatial}\). To model the first component, we adopt the well-known assumption [13, 26, 35] that scene points which are spatially close to each other and share the same orientation are likely to have similar shading. To facilitate this, we construct the six-dimensional set

$$\varGamma := \left\{ \bigl ( \,P({\varvec{r}}),\; {\varvec{n}}( P({\varvec{r}}) ) \,\bigr ) \;:\; {\varvec{r}}\in {\mathcal R} \right\} ,$$

where \(P({\varvec{r}})\) is again the point of the scene surface intersected by \({\varvec{r}}\), and \({\varvec{n}}(P({\varvec{r}}))\) the corresponding outer normal. The set of neighbours \(N_S({\varvec{r}})\) now consists of the \(k_N\)-nearest neighbours of \({\varvec{r}} \in {\mathcal R}\) in the six-dimensional space \(\varGamma \). The regularization term

$$\begin{aligned} E_{normal}(s^{log}) = \lambda _{normal}\int _{{\mathcal R}} \sum _{{\varvec{q}} \in N_S({\varvec{r}})} ( s^{log}({\varvec{r}}) - s^{log}({\varvec{q}}) )^2 \; d {\varvec{r}} \end{aligned}$$
(10)

thus penalizes shading components to be the same if corresponding 3D points are spatially close to each other and their outer normals have similar orientations.

To account for indirect shading, which is caused by inter-reflections between objects in a scene, we also include a purely spatial regularization term

$$\begin{aligned} E_{spatial}(s^{log}) = \lambda _{space}\int _{{\mathcal R}} \sum _{{\varvec{q}}\in N_D({\varvec{r}})} ( s^{log}({\varvec{r}}) - s^{log}({\varvec{q}}) )^2 \; d {\varvec{r}}, \end{aligned}$$
(11)

where the neighborhood \(N_D({\varvec{r}})\) denotes the \(k_D\) nearest neighbors of the 3D scene point first intersected by \({\varvec{r}}\).

5 Prior for the Specular Component

In this section, we describe the specular prior in the variational energy (5). We first discuss the modeling assumptions, then show how to compute a mask for candidate specular pixels based on these assumptions, and finally construct the prior \(P_\text {spec}\).

Fig. 2.
figure 2

The left image shows the center view of a light field captured with a Lytro Illum camera. The right image shows the specular mask obtained by our method.

Modeling Assumptions. We combine several approaches to model specularity [2, 10, 14, 28,29,30, 36]. According to the specular motion model [28, 29], specularity changes depend on surface geometry. For instance, regions of low curvature on a specular object create color intensity changes within different views. Specular regions of high curvature result in high pixel intensities in all subaperture views. Thus, curvature information can be useful while estimating specularity. In practice, however, it turns out that curvature estimation is very sensitive to inaccuracies of the 3D model of a scene. Imperfect disparity maps lead to a certain amount of noise in the estimated spatial coordinates, thus curvature information becomes highly unreliable. Instead of using curvature information directly, we therefore propose a heuristic approach that estimates candidate regions where specularity or highlights can occur. Our main modeling assumptions are thus:

  • S1. Specularity is view dependent.

  • S2. If a projected 3D point has high pixel intensities and its color is constant across all subaperture views, then the point may be part of a specular surface.

  • S3. If a projected 3D point has high variation in pixel intensities, and the color of the corresponding rays changes across subaperture views, then the point may belong to a specular surface.

  • S4. If a point is classified as specular, then it is a part of specular surface, and its local neighborhood in \(\mathbb {R}^{3}\) may result in specular pixels from a certain viewing angle.

  • S5. The distribution of specularity is sparse.

Potential specular objects are identified based on magnitude and variation of pixel values over different views. We compute a specular mask for the center view, and propagate it to the remaining views according to disparity.

Computing the Specular Mask. Our proposed algorithm proceeds in 4 steps:

  1. 1.

    Let \(\varOmega _c\) be the image plane for the center view, and \(V = \{(s_1,t_1), ..., (s_N,t_N)\}\) the set of remaining N view points.

    For every \(\varvec{p} \in \varOmega _c\), we compute the vector \({\varvec{\omega }}_{\varvec{p}}\) of color intensity changes with respect to V according to

    $$\begin{aligned} w^i_p = L_i(\varvec{p} + \varvec{v_i}d(\varvec{p})), \, i = 1, \dots , N. \end{aligned}$$
    (12)

    where \(\varvec{v_i} = (s_c - s_i, t_c - t_i)\) is the view point displacement and \(d(\varvec{p})\) the estimated scalar disparity of \(\varvec{p}\).

  2. 2.

    Identify pixels where color and intensity vary within subaperture views in three steps according to assumptions (S1) and (S3):

    • Filter out a percentage \(\% n_{var}\) of pixels that have low luminance variation \(\sigma (\varvec{\omega _p})\), where by \(\varOmega ^*_c\) we define a set of remaining pixels.

    • Exclude occlusion boundaries from \(\varOmega ^*_c\). To find occlusion boundaries, we compute the k-nearest neighbors in the image domain, and corresponding spatial coordinates in \(\mathbb {R}^{3}\). If neighboring pixels in \(\varOmega ^*_c\) are far away in \(\mathbb {R}^{3}\), with distances larger than \(d_{occ}\), then we classify those pixels as occlusion boundaries.

    • From the remaining pixels, finally exclude the percentage \(\%n_{conf}\) with the lowest confidence scores similar to the approach proposed by Tao et al. [14].

      To compute confidence, we cluster the corresponding values of \(\varvec{\omega _p}\) in two groups using K-means. Let \(\varvec{m}(\varvec{p})\) be the cluster centroid with the larger mean \(\mu (\varvec{m})\). The confidence is computed as

      $$\begin{aligned} c(\varvec{p}) = \exp \left( -\frac{1}{\sigma _{spec}^2} \left( \frac{\beta _0}{ \mu (\varvec{ m}) } + \frac{\beta _1}{ \xi (\varvec{m}) } \right) \right) , \end{aligned}$$
      (13)

      where \(\xi (\varvec{m})\) denotes the sum of all distances within the cluster.

      The confidence score grows with mean intensity and variation within the brightest cluster. Thus, we obtain pixels with varying values within subaperture views. Above, \(\beta _0\) and \(\beta _1\) are user-defined parameters that control exponential decay of brightness and distance terms, \(\sigma _{spec}\) scales the confidence function. We fix \(\beta _0 = 0.5, \, \beta _1 = 10^{-3}, \sigma _{spec} = 2\).

  3. 3.

    Identify pixels where intensity is high and color not changing within all subaperture views according to assumption (S2). According to Tian and Clark [36], regions with high unnormalized Wiener entropy, which is defined as the product of RGB values over all pixels, are likely to be specular. We adopt their approach and also identify those regions.

  4. 4.

    Combine pixels found in steps 2 and 3 into the specular mask

    $$\begin{aligned} h_{mask} = {\left\{ \begin{array}{ll} 1, \, \text {specular}\\ 0, \, \text {non-specular}, \end{array}\right. } \end{aligned}$$
    (14)

    which is then grown according to assumption (S4) to include all \(k_{spec}\)-nearest neighbors for each specular pixel in the initial mask.

An example specular mask for a Lytro dataset is shown in Fig. 2.

Final Prior on Specularity. The specular component should be non-zero only within the candidate specular region given by the mask \(h_\text {mask}\) defined above. We therefore strongly penalize non-zero values outside this region by defining the final sparsity prior as

$$\begin{aligned} P_\text {spec}(H^{log}) \;=\; \lambda _{spec} \int _{\mathcal R} \gamma _w ( 1 - h_\text {mask} ) ||H^{log}(\varvec{r})||^2 \; d {\varvec{r}} \;+\; \lambda _{sparse}|| H^{log} ||_1. \end{aligned}$$
(15)

where \(\gamma _w\gg 0\) is a constant. We include an additional sparsity norm on \(H^{log}\) to account for assumption (S5).

Fig. 3.
figure 3

Estimated disparity maps for scenes captured with a Lytro Illum camera. From left to right: an outdoor scene with the ceramic owl, a tinfoil swan, an indoor scene with the same owl and a candle. Disparities range between \(-1.5\) and 1.5.

6 Ray Space Regularization and Optimization

We summarize the previously defined terms in the variational energy (5) as a functional F, so that to obtain the light field decomposition we have to solve

$$\begin{aligned} \begin{aligned} \mathop {{{\mathrm{arg\,min}}}}\limits _{(A^{log},s^{log},H^{log})} \Bigl \{\; F( A^{log}, s^{log}, H^{log} ) \,+\, J( A^{log}, s^{log}) \;\Bigr \}. \end{aligned} \end{aligned}$$
(16)

As typical in intrinsic image decomposition, the overall optimization problem is rather complex. However, taking a detailed look at the individual terms, it turns out that we have a convex objective F. Furthermore, our intention is to define the global smoothness term J on ray space in a way that it enforces spatial smoothness within the views, as well as consistency with the disparity-induced structure on the epipolar plane images. Thus, the complete objective function exactly fits the light field optimization framework for inverse problems on ray space proposed by Goldluecke and Wanner [12]. The key advantage of this framework is that it is computationally efficient since it allows to solve subproblems for each epipolar plane image and view independently. Also, it is generic in the sense that we just need to provide a way to compute F and related proximity operators. We thus adopt their method to solve our problem.

Table 1. Main parameters for intrinsic image decomposition used in implementation.

In [12], the light field regularizer J in (16) is a sum of several contributions. First, there are individual regularizers \(J_{xs}\) and \(J_{yt}\) for each epipolar plane image, which depend on the disparity map and employ an anisotropic total variation to enforce consistency of the fields in the arguments with the linear patterns on the epipolar plane images, see Fig. 1. Second, for each view, there is a regularizer \(J_{st}\), and as in the basic framework in [12], we use a simple total variation term for efficiency. In future work, we intend to move to something more sophisticated here.

Fig. 4.
figure 4

Center view images showing the light field decomposition. The first row depicts a decomposition with our approach into: albedo, shading, and specularity. The second row illustrates Chen and Koltun’s algorithm [13], where the center view image is decomposed into albedo and shading with the additional input of our generated depth map. The third row illustrates original image, diffuse, and specular images obtained by Tao et al.’s method [10]. Due to EPI constraints and the specularity term, our shading component does not include specular highlights as Chen and Koltun’s result. Also, we removed most cast shadows from albedo image, while smoothness priors prevent albedo and shading discontinuities. Tao et al. detects less of the specular regions compared to ours. Their algorithm identifies mostly boundaries of specular regions, and removes those boundaries from the diffuse image. Disparity errors create variation of intensity values in subaperture views, which erroneously are classified as specularity on occlusion boundaries. Our algorithm detects the complete specular regions, since it is more robust to inaccurate disparity estimation.

Fig. 5.
figure 5

Center view of the outdoor scene with an origami swan made from aluminum foil. Comparing shading and albedo images, we conclude that our algorithm detects more cast shadows than Chen and Koltun’s algorithm. Our specular component contains more correctly classified glossy regions than the one produced by algorithm of Tao et al. We observe that their approach predominantly detects boundaries of specular regions, thus only these are removed in the generated specular free image.

Albedo and shading are independent of view point, thus their values should not vary between views. We want \(A^{log}\) and \(s^{log}\) to be constant in the direction of \(\varvec{d}\), except at disparity discontinuities. We also regularize both components within each individual view as noted above. The complete regularizer can thus be written as

$$\begin{aligned} J(A^{log}, s^{log}) = \mu J_{xs}(A^{log}, s^{log}) + \mu J_{yt}(A^{log}, s^{log}) + \lambda J_{st}(A^{log}, s^{log}), \end{aligned}$$
(17)

where \(\lambda , \mu > 0\) are user-defined constants which correspond to the amount of smoothing on the separate views and EPIs, respectively. The objective is convex, so that we achieve global optimality. For details and the actual optimization algorithm we refer to [12].

Fig. 6.
figure 6

Center view of an indoor scene with several light sources and non trivial chromaticity. We observe a difference in lighting of albedo images. In our approach, the albedo image is illumination free, compared to albedo images produced with Chen and Koltun’s algorithm. A reason can be that we first compute illumination color and exclude it from the optimization model, while in Chen and Koltun’s approach illumination color is included in the optimization. Both shading components are specular-free, while the albedo component of Chen and Koltun algorithm contains specularity. There is a near to zero specular component detected with Tao et al.’s algorithm. This can be explained by the bad initial disparity estimation which causes erroneous pixel classification. Our approach outperforms Tao et al., since our specular detection algorithm is occlusion aware and also analyses regions where pixels have high intensities.

Fig. 7.
figure 7

Potential locations for specularity that were detected with our algorithm. The left image shows candidate specular regions for the origami swan, the right image depicts the specular mask for the owl and candle scene.

7 Results

We validate our decomposition method on light fields captured with a Lytro Illum plenoptic camera, as well as on synthetic and gantry data sets provided by Wanner et al. [37]. In the paper, we present selected results for real world indoor and outdoor scenes, the rest we show in the supplementary material. While benchmark datasets for evaluating intrinsic image decomposition are presented in [18, 25, 38], those data sets are designed for algorithm evaluation on single RGB, RGB+D images, or on optic flow; they are not applicable to our light field based method. Since there are thus no ground truth intrinsic light fields available so far, we evaluate our method visually and with qualitive comparisons, deferring rendering of a novel benchmark to future work.

To recover a 3D model and estimate normals, we perform disparity estimation with the multi-view stereo method described in [9], with an improved more occlusion-aware data term and refined with further smoothing using a generalized total variation regularizer, see estimated disparity labels in Fig. 3. The main algorithm parameters and their values are presented in Table 1. Our method is implemented in Matlab, version R2015b, with run-times on a PC with Intel(R) Core i7-4790 CPU 3.60 GHz and an NVIDIA GeForce GTX 980.

Evaluation Results. Since there are no intrinsic image decomposition algorithms that consider specularity, to compare results of our specular term we select a recent algorithm for depth estimation and specular removal developed for light field cameras by Tao et al. [10]. To compare albedo and shading terms, we investigated recently published algorithms that employ 3D information. There are several papers where depth information is used for intrinsic image decomposition [13, 21, 26, 27]. We selected the algorithm developed by Chen and Koltun [13] to compare against, since it outperforms other algorithms that use 3D information. For both comparisons, we use the authors implementations with default parameter setting. Figures 4, 5 and 6 illustrate original image, our proposed decomposition method, Chen and Koltun [13], and Tao et al. [10]. For all images, contrast was enhanced using the Matlab function imadjust for better visualization. Figure 7 illustrates specular masks for origami swan and owl with candle light fields.

We also compared runtime of the algorithms. The Chen and Koltun algorithm converges in 20–30 min for a single image, the method by Tao et al. (including depth estimation) takes 60 min. Our approach evaluated on a light field with a cross-hair shaped subset of 17 views from a light field with \(9 \times 9\) views in total converges in 30–40 min, which amounts to 1.7–2.4 min per frame.

8 Conclusions

In this work, we propose the first approach towards solving the intrinsic 4D light field decomposition problem while leveraging the disparity-induced structure on the epipolar plane images. In contrast to existing intrinsic image algorithms, the dense collection of views in a light field allows us to define an additional specular term in the decomposition model, so that we can optimize over the specular component as well as albedo and shading by minimizing a single variational functional. As the inverse decomposition problem is embedded in a recent framework for light field labeling [12], we can ensure that albedo and shading estimates are consistent across and use information from all views. Experiments demonstrate that we outperform both a state-of-the-art intrinsic image decomposition method employing additional depth information [13], as well as a light field based method for specular removal [10, 14] on challenging non-Lambertian scenes.