1 Introduction

Acquiring the shape and the reflectance of a scene is a key issue, e.g., for the movie industry, as it allows proper relighting. The current proposed solutions focus on small objects and stand on multiple priors [39] or need very controlled environments [34, Chapter 9]. Well-established shape acquisition techniques such as multi-view stereo exist for accurate 3D-reconstruction. Nevertheless, they do not aim at recovering the surface reflectance. Hence, the original input images are usually mapped onto the 3D-reconstruction as texture. Since the image graylevel mixes shading information (induced by lighting and geometry) and reflectance (which is characteristic of the surface), relighting based on this approach usually lacks realism. To improve the results, reflectance needs to be separated from shading.

In order to more precisely illustrate our purpose, let us take the example of a Lambertian surface. In a 2D-point (pixel) \(\mathbf {p}\) conjugate to a 3D-point \(\mathbf {x}\) of a Lambertian surface, the graylevel \(I(\mathbf {p})\) is written

$$\begin{aligned} I(\mathbf {p}) = \rho (\mathbf {x}) \, \mathbf {s}(\mathbf {x}) \cdot \mathbf {n}(\mathbf {x}). \end{aligned}$$
(1)

In the right-hand side of (1), \(\rho (\mathbf {x}) \in \mathbb {R}\) is the albedo,Footnote 1 \(\mathbf {s}(\mathbf {x}) \in \mathbb {R}^3\) the lighting vector, and \(\mathbf {n}(\mathbf {x}) \in \mathbb {S}^2 \subset \mathbb {R}^3\) the outer unit-length normal to the surface. All these elements a priori depend on \(\mathbf {x}\) i.e.,they are defined locally. Whereas \(I(\mathbf {x})\) is always supposed to be given, different situations can occur, according to which are also known, among \(\rho (\mathbf {x}),\mathbf {s}(\mathbf {x})\) and \(\mathbf {n}(\mathbf {x})\).

Fig. 1
figure 1

The “workshop metaphor” (extracted from a paper by Adelson and Pentland [1]). Image (a) may be interpreted either by: b incorporating all the brightness variations inside the reflectance; c modulating the lighting of a white planar surface; d designing a uniformly white 3D-shape illuminated by a parallel and uniform light beam. This last interpretation is one of the solutions of the shape-from-shading problem

One Eq. (1) per pixel is not enough to simultaneously estimating the reflectance \(\rho (\mathbf {x})\), the lighting \(\mathbf {s}(\mathbf {x})\) and the geometry, represented here by \(\mathbf {n}(\mathbf {x})\), because there are much more unknowns than equations. Figure 1 illustrates this source of ill-posedness through the so-called workshop metaphor introduced by Adelson and Pentland in [1]: among three plausible interpretations (b), (c) and (d) of image (a), we are particularly interested in (d), which illustrates the principle of photometric 3D-reconstruction. This class of methods usually assume that the lighting \(\mathbf {s}(\mathbf {x})\) is known. Still, there remains three scalar unknowns per Eq. (1): \(\rho (\mathbf {x})\) and \(\mathbf {n}(\mathbf {x})\), which has two degrees of freedom. Assuming moreover that the reflectance \(\rho (\mathbf {x})\) is known, the shape-from-shading technique [16] uses the shading \(\mathbf {s}(\mathbf {x}) \cdot \mathbf {n}(\mathbf {x})\) as unique clue to recover the shape \(\mathbf {n}(\mathbf {x})\) from Eq. (1), but the problem is still ill-posed.

A classical way to make photometric 3D-reconstruction well-posed is to use \(m>1\) images taken using a single camera pose, but under varying known lighting:

$$\begin{aligned} I^i(\mathbf {p}) = \rho (\mathbf {x}) \, \mathbf {s}^i(\mathbf {x}) \cdot \mathbf {n}(\mathbf {x}), \quad i \in \{1,\ldots ,m\} \end{aligned}$$
(2)

In this variant of shape-from-shading called photometric stereo [40], the reflectance \(\rho (\mathbf {x})\) and the normal \(\mathbf {n}(\mathbf {x})\) can be estimated without any ambiguity, as soon as \(m\ge 3\) non-coplanar lighting vectors \(\mathbf {s}^i(\mathbf {x})\) are used.

Symmetrically to (2), solving the problem:

$$\begin{aligned} I^i(\mathbf {p}) = \rho (\mathbf {x}) \, \mathbf {s}(\mathbf {x}) \cdot \mathbf {n}^i(\mathbf {x}), \quad i \in \{1,\ldots ,m\} \end{aligned}$$
(3)

allows to estimate the lighting \(\mathbf {s}(\mathbf {x})\), as soon as the reflectance \(\rho (\mathbf {x})\) and \(m \ge 3\) non-coplanar normals \(\mathbf {n}^i(\mathbf {x}),i \in \{1,\ldots ,m\}\), are known. This can be carried out, for instance, by placing a small calibration pattern with known color and known shape near each 3D-point \(\mathbf {x}\) [32].

Fig. 2
figure 2

Overview of our contribution. From a set of n images of a surface acquired under different angles, and a coarse geometry obtained for instance using multi-view stereo, we estimate a shading-free reflectance map per view (Color figure online)

The problem we aim at solving in this paper is slightly different. Suppose we are given a series of \(m>1\) images of a scene taken using a single lighting, but m camera poses. According to Lambert’s law, this ensures that a 3D-point looks equally bright in all the images where it is visible. Such invariance is the basic clue of multi-view stereo (MVS), which has become a very popular technique for 3D-reconstruction [12]. Therefore, since an estimate of the surface shape is available, \(\mathbf {n}(\mathbf {x})\) is known. Now, we have to index the pixels by the image number i. Fortunately, additional data provided by MVS are the correspondences between the different views, taking the form of m-tuples of pixels \((\mathbf {p}^i)_{i \in \{1,\ldots ,m\}}\) which are conjugate to a common 3D-point \(\mathbf {x}\).

Our problem is writtenFootnote 2:

$$\begin{aligned} I^i(\mathbf {p}^i) = \rho (\mathbf {x}) \, \mathbf {s}(\mathbf {x}) \cdot \mathbf {n}(\mathbf {x}), \quad i \in \{1,\ldots ,m\} \end{aligned}$$
(4)

where \(\mathbf {p}^i\) is the projection of \(\mathbf {x}\) in the i-th image, and \(\rho (\mathbf {x})\) and \(\mathbf {s}(\mathbf {x})\) are unknown. Obviously, this system reduces to Eq. (1), since its m equations are the same one: the right-hand side of (4) does not depend on i, not more than the left-hand side \(I^i(\mathbf {p}^i)\) since, as already noticed, the lighting \(\mathbf {s}(\mathbf {x})\) does not vary from one image to another, and the surface is Lambertian.

Multi-view helps estimating the reflectance, because it provides the 3D-shape via MVS. However, even if \(\mathbf {n}(\mathbf {x})\) is known, Eq. (1) remains ill-posed. This is illustrated, in Fig. 1, by the solutions (b) and (c), which correspond to the same image (a) and to a common planar surface. In the absence of any prior, Eq. (1) has an infinity of solutions in \(\rho (\mathbf {x}) \, \mathbf {s}(\mathbf {x})\). In addition, determining \(\rho (\mathbf {x})\) from each of these solutions would give rise to another ambiguity, since \(\mathbf {s}(\mathbf {x})\) is not forced to be unit-length, contrarily to \(\mathbf {n}(\mathbf {x})\).

Such a double source of ill-posedness probably explains why various methods for reflectance estimation have been designed, introducing a variety of priors in order to disambiguate the problem. Most of them assume that brightness variations induced by reflectance changes are likely to be strong but sparsely distributed, while the lighting is likely to induce smoother changes [21].

This suggests to separate a single image into a piecewise-smooth layer and a more oscillating one. In the computer vision literature, this is often referred to as “intrinsic image decomposition”, while the terminology “cartoon + texture decomposition” is more frequently used by the mathematical imaging community. (Both these problems will be discussed in Sect. 2.)

Contributions In this work, we show the relevance of using multi-view images for reflectance estimation. Indeed, this enables a prior shape estimation using MVS, which essentially reduces the decomposition problem to the joint estimation of a set of reflectance maps, as illustrated in Fig. 2. We elaborate on the variational approach to multi-view decomposition into reflectance and shading, which we initially presented in [26]. The latter introduced a robust \(l^1\)-TV framework for the joint estimation of piecewise-smooth reflectance maps and of spherical harmonics lighting, with an additional term ensuring the consistency of the reflectance maps. The present paper extends this approach by developing the theoretical foundations of this variational model. In this view, our parameterization choices are further discussed and the underlying ambiguities are exhibited. The variational model is motivated by a Bayesian rationale, and the proposed numerical scheme is interpreted in terms of a majorization–minimization algorithm. Finally, we conclude that, besides a preliminary measurement of the incoming lighting, varying the lighting along with the viewing angle, in the spirit of photometric stereo, is the only way to estimate the reflectance without resorting to any prior.

Organization of the Paper After reviewing related approaches in Sect. 2, we formalize in Sect. 3 the problem of multi-view reflectance estimation. Section 4 then introduces a Bayesian-to-variational approach to this problem. A simple numerical strategy for solving the resulting variational problem, which is based on alternating majorization–minimization, is presented in Sect. 5. Experiments on both synthetic and real-world datasets are then conducted in Sect. 6, before summarizing our achievements and suggesting future research directions in Sect. 7.

2 Related Works

Studied since the 1970s [21], the problem of decomposing an image (or a set of images) into a piecewise-smooth component and an oscillatory one is a fundamental computer vision problem, which has been addressed in numerous ways.

Cartoon + Texture Decomposition Researchers in the field of mathematical imaging have suggested various variational models for this task, using for instance non-smooth regularization and Fourier-based frequency analysis [3], or \(l^1\)-TV variational models [23]. However, such techniques do not use an explicit photometric model for justifying the decomposition, whereas photometric analysis, which is another important branch of computer vision, may be a source of inspiration for motivating new variational models.

Photometric Stereo As discussed in Introduction, photometric stereo techniques [40] are able to unambiguously estimate the reflectance and the geometry, by considering several images obtained from the same viewing angle but under calibrated, varying lighting. Photometric stereo has even been extended to the case of uncalibrated, varying lighting [5]. In the same spirit as uncalibrated photometric stereo, our goal is to estimate reflectance under unknown lighting. However, the problem is less constrained in our case, since we cannot ensure that the lighting is varying. Our hope is that this can be somewhat compensated by the prior knowledge of geometry, and by the resort to appropriate priors. Various priors for reflectance have been discussed in the context of intrinsic image decomposition.

Intrinsic Image Decomposition Separating reflectance from shading in a single image is a challenging problem, often referred to as intrinsic image decomposition. Given the ill-posed nature of this problem, prior information on shape, reflectance and/or lighting must be introduced. Most of the existing works are based on the “retinex theory” [21], which states that most of the slight brightness variations in an image are due to lighting, while reflectance is piecewise-constant (as for instance a Mondrian image). A variety of clustering-based [13, 36] or sparsity-enhancing methods [14, 29, 36, 37] have been developed based on this theory. Among others, the work of Baron and Malik [4], which presents interesting results, stands on multiple priors to solve the fundamental ambiguity of shape-from-shading that we aim at revoking in the multi-view context. Some other methods disambiguate the problem by requiring the user to “brush” uniform reflectance parts [8, 29], or by resorting to a crowdsourced database [7]. Still, these works require user interactions, which may not be desirable in certain cases.

Multi-view 3D-reconstruction Instead of introducing possibly unverifiable priors, or relying on user interactions, ambiguities can be reduced by assuming that the geometry of the scene is known. Intrinsic image decomposition has for instance been addressed using an RGB-D camera [9] or, closer to our proposal, multiple views of the same scene under different angles [19, 20]. In the latter works, the geometry is first extracted from the multi-view images, before the problem of reflectance estimation is addressed. Geometry computation can be achieved using multi-view stereo (MVS). MVS techniques [35] have seen significant growth over the last decade, an expansion which goes hand in hand with the development of structure-from-motion (SfM) solutions [27]. Indeed, MVS requires the parameters of the cameras, outputs of the SfM algorithm. Nowadays, these mature methods are commonly used in uncontrolled environments, or even with large-scale Internet data [2]. For the sake of completeness, let us also mention that some efforts in the direction of multi-view and photometrically consistent 3D-reconstruction have been devoted recently [17, 18, 22, 24, 25]. Similar to these methods, we will resort to a compact representation of lighting, namely the spherical harmonics model.

Spherical Harmonics Lighting Model Let us consider a point \(\mathbf {x}\) lying on the surface \(\mathcal {S} \subset \mathbb {R}^3\) of the observed scene, and let \(\mathbf {n}(\mathbf {x})\) be the outer unit-length normal vector to \(\mathcal {S}\) in \(\mathbf {x}\). Let \(\mathcal {H}(\mathbf {x})\) be the hemisphere centered in \(\mathbf {x}\), having as basis plane the tangent plane to \(\mathcal {S}\) in \(\mathbf {x}\). Each light source visible from \(\mathbf {x}\) can be associated to a point \(\omega \) on \(\mathcal {H}(\mathbf {x})\). If we describe by the vector \(\mathbf {s}(\mathbf {x},\omega )\) the corresponding elementary light beam (oriented toward the source), then by definition of the reflectance (or BRDF) of the surface, denoted r, the luminance of \(\mathbf {x}\) in the direction \(\mathbf {v}\) is given by

$$\begin{aligned} L(\mathbf {x},\mathbf {v})= & {} \displaystyle \int _{\mathcal {H}(\mathbf {x})} r(\mathbf {x},\mathbf {n}(\mathbf {x}),\frac{{\mathbf {s}}(\mathbf {x},\omega )}{\Vert {\mathbf {s}}(\mathbf {x},\omega )\Vert },\mathbf {v})\nonumber \\&[\mathbf {s}(\mathbf {x},\omega ) \cdot \mathbf {n}(\mathbf {x})] \,\mathrm {d}\omega , \end{aligned}$$
(5)

where \([\mathbf {s}(\mathbf {x},\omega ) \cdot \mathbf {n}(\mathbf {x})]\) is the surface illuminance. In general, r depends both on the direction of the light \({\mathbf {s}}(\mathbf {x},\omega )\), and on the viewing direction \(\mathbf {v}\), relatively to \(\mathbf {n}(\mathbf {x})\).

This expression of the luminance is intractable in the general case. However, if we restrict our attention to Lambertian surfaces, the reflectance reduces to the albedo \(\rho (\mathbf {x})\), which is independent of any direction, and \(L(\mathbf {x},\mathbf {v})\) does not depend on the viewing direction \(\mathbf {v}\) anymore. If the light sources are further assumed to be distant enough from the object, then \(\mathbf {s}(\mathbf {x},\omega )\) is independent of \(\mathbf {x}\) i.e.,the light beams are the same for the whole (supposedly convex) object, and thus, the lighting is completely defined on the unit sphere. Therefore, the integral (5) acts as a convolution on \(\mathcal {H}(\mathbf {x})\), having as kernel \(\mathbf {s}(\omega ) \cdot \mathbf {n}(\mathbf {x})\). Spherical harmonics, which can be considered as the analogue to the Fourier series on the unit sphere, have been shown to be an efficient low-dimensional representation of this convolution [6, 33]. Many vision applications [18, 41] use second-order spherical harmonics, which can capture over \(99\%\) of the natural lighting [11] using only nine coefficients. This yields an approximation of the luminance of the form

$$\begin{aligned} L = \frac{\rho }{\pi } \, {\varvec{\sigma }} \cdot {\varvec{\nu }}, \end{aligned}$$
(6)

where \(\rho \in \mathbb {R}\) is the albedo (reflectance), \({\varvec{\sigma }} \in \mathbb {R}^9\) is a compact lighting representation, and \({\varvec{\nu }} \in \mathbb {R}^9\) stores the local geometric information. The latter is deduced from the normal according to:

$$\begin{aligned} {\varvec{\nu }} = \begin{bmatrix} \mathbf {n} \\ 1 \\ n_1\,n_2 \\ n_1\,n_3 \\ n_2\,n_3 \\ n_1^2-n_2^2 \\ 3n_3^2-1 \end{bmatrix}. \end{aligned}$$
(7)

In (6), the lighting vector \({\varvec{\sigma }}\) is the same in all the points of the surface, but the reflectance \(\rho \) and the geometric vector \({\varvec{\nu }}\) vary along the surface \(\mathcal {S}\) of the observed scene. Hence, we will write (6) as:

$$\begin{aligned} L(\mathbf {x}) = \frac{\rho (\mathbf {x})}{\pi } \, {\varvec{\sigma }} \cdot {\varvec{\nu }}(\mathbf {x}),\quad \forall \mathbf {x} \in \mathcal {S}. \end{aligned}$$
(8)

Our aim in this paper is to estimate the reflectance \(\rho (\mathbf {x})\) in each point \(\mathbf {x} \in \mathcal {S}\), as well as the lighting vector \({\varvec{\sigma }}\), given a set of multi-view images and the geometric vector \({\varvec{\nu }}(\mathbf {x})\). We formalize this problem in the next section.

3 Multi-view Reflectance Estimation

In this section, we describe with more care the problem of reflectance estimation from a set of multi-view images. First, we need to make explicit the relationship between graylevel, reflectance, lighting and geometry.

3.1 Image Formation Model

Let \(\mathbf {x} \in \mathcal {S}\) be a point on the surface of the scene. Assume that it is observed by a graylevel camera with linear response function and let \(I:\,\varOmega \subset \mathbb {R}^2 \rightarrow \mathbb {R}\) be the image, where \(\varOmega \) is the projection of \(\mathcal {S}\) onto the image plane. Then, the graylevel in the pixel \(\mathbf {p} \in \varOmega \) conjugate to \(\mathbf {x}\) is proportional to the luminance of \(\mathbf {x}\) in the direction of observation \(\mathbf {v}\):

$$\begin{aligned} I(\mathbf {p}) = \gamma \, L(\mathbf {v},\mathbf {x}), \end{aligned}$$
(9)

where the coefficient \(\gamma > 0\), referred to in the following as the “camera coefficient,” is unknown.Footnote 3 By assuming Lambertian reflectance and the light sources distant enough from the object, Eqs. (8) and (9) yield:

$$\begin{aligned} I(\mathbf {p}) = \gamma \, \frac{\rho (\mathbf {x})}{\pi } \, {\varvec{\sigma }} \cdot {\varvec{\nu }}(\mathbf {x}). \end{aligned}$$
(10)

Now, let us assume that m images \(I^i\) of the surface, \(i \in \{1,\ldots ,m\}\), obtained while moving a single camera, are available, and discuss how to adapt (10).

Case 1: unknown, yet fixed lighting and camera coefficient If all the automatic settings of the camera are disabled, then the camera coefficient is independent from the view. We can thus incorporate this coefficient and the denominator \(\pi \) into the lighting vector: \({\varvec{\sigma }}:= \frac{\gamma }{\pi } \, {\varvec{\sigma }}\). Moreover, if the illumination is fixed, the lighting vector \({\varvec{\sigma }}\) is independent from the view. In any point \(\mathbf {x}\) which is visible in the i-th view, Eq. (10) becomes:

$$\begin{aligned} I^i(\pi ^i(\mathbf {x})) = \rho (\mathbf {x}) \, {\varvec{\sigma }} \cdot {\varvec{\nu }}(\mathbf {x}), \end{aligned}$$
(11)

where we denote by \(\pi ^i\) the 3D-to-2D projection associated to the i-th view. In (11), the unknowns are the reflectance \(\rho (\mathbf {x})\) and the lighting vector \({\varvec{\sigma }}\). Eq. (11), \(i \in \{1,\ldots ,m\}\), constitute a generalization of (4) to more complex illumination scenarios. For the whole scene, this is a problem with \(n+9\) unknowns and up to nm equations, where n is the number of 3D-points \(\mathbf {x}\) which have been estimated by multi-view stereo. However, as for System (4), only n equations are linearly independent; hence, the problem of reflectance and lighting estimation is under-constrained.

Case 2: unknown and varying lighting and camera coefficient If lighting is varying, then we have to make the lighting vector view-dependent. If it is also assumed to vary, the camera coefficient can be integrated into the lighting vector with the denominator \(\pi \) i.e.,\({\varvec{\sigma }^i}:= \frac{\gamma ^i}{\pi } \, {\varvec{\sigma }^i}\), since the estimation of each \({\varvec{\sigma }^i}\) will include that of \(\gamma ^i\). Equation (10) then becomes:

$$\begin{aligned} I^i(\pi ^i(\mathbf {x})) = \rho (\mathbf {x}) \, {\varvec{\sigma }^i} \cdot {\varvec{\nu }}(\mathbf {x}). \end{aligned}$$
(12)

There are even more unknowns (\(n+9m\)), but this time the nm equations are linearly independent, at least as long as the \({\varvec{\sigma }^i}\) are not proportional i.e.,if not only the camera coefficient or the lighting intensity vary across the views, but also the lighting direction.Footnote 4 Typically, n is of the order of \([10^3,10^6]\); hence, the problem is over-constrained as soon as at least two out of the m lighting vectors are non-collinear. This is a situation similar to uncalibrated photometric stereo [5], but much more favorable: The geometry is known; hence, the ambiguities arising in uncalibrated photometric stereo are likely to be reduced. However, contrarily to uncalibrated photometric stereo, lighting is not actively controlled in our case. Lighting variations are likely to happen e.g. in outdoor scenarios, yet they will be limited. The m lighting vectors \({\varvec{\sigma }}^i,i \in \{1,\ldots ,m\}\), will thus be close to each other: Lighting variations will not be sufficient in practice for disambiguation (ill-conditioning).

Since (11) is under-constrained and (12) is ill-conditioned, additional information will have to be introduced either ways, and we can restrict our attention to the varying lighting case (12).

So far, we have assumed that graylevel images were available. To extend our study to RGB images, we abusively assume channel separation, and apply the framework independently in each channel \(\star \in \{R,G,B\}\). We then consider the expression:

$$\begin{aligned} I^i_\star (\pi ^i(\mathbf {x})) = \rho _\star (\mathbf {x}) \, {\varvec{\sigma }}_\star ^i \cdot {\varvec{\nu }}(\mathbf {x}) \end{aligned}$$
(13)

where \(\rho _\star (\mathbf {x})\) and \({\varvec{\sigma }}_\star ^i\) denote, respectively, the colored reflectance and the i-th colored lighting vector, relatively to the response of the camera in channel \(\star \). A more complete study of Model (13) is presented in [31].

Since we will apply the same framework independently in each color channel, we consider hereafter the graylevel case only i.e.,we consider the image formation model (12) instead of (13). The question which arises now is how to estimate the reflectance \(\rho (\mathbf {x})\) from a set of equations such as (12), when the geometry \({\varvec{\nu }}(\mathbf {x})\) is known but the lighting \({\varvec{\sigma }}^i\) is unknown.

3.2 Reflectance Estimation on the Surface

We place ourselves at the end of the multi-view 3D-reconstruction pipeline. Thus, the projections \(\pi ^i\) are known (in practice, they are estimated using SfM techniques), as well as the geometry, represented by a set of n 3D-points \(\mathbf {x}_j \in \mathbb {R}^3,j \in \{1,\ldots ,n\}\), and the corresponding normals \(\mathbf {n}(\mathbf {x}_j)\) (obtained for instance using SFM techniques), from which the n geometric vectors \({\varvec{\nu }}_j:={\varvec{\nu }}(\mathbf {x}_j)\) are easily deduced according to (7).

The unknowns are then the n reflectance values \(\rho _j := \rho (\mathbf {x}_j) \in \mathbb {R}\) and the m lighting vectors \({\varvec{\sigma }}^i \in \mathbb {R}^9\), which are independent from the 3D-point number j due to the distant light assumption. At first glance, one may think that their estimation can be carried out by simultaneously solving (12) in all the 3D-points \(\mathbf {x}_j\), in a purely data-driven manner, using some fitting function \(F:\,\mathbb {R} \rightarrow \mathbb {R}\):

$$\begin{aligned} \min _{\begin{array}{c} \{\rho _j \in \mathbb {R}\}_j\\ \{ {\varvec{\sigma }^i} \in \mathbb {R}^9\}_i \end{array}} \sum _{i=1}^m \sum _{j=1}^n v^i_j \, F\left( \rho _j \, {\varvec{\sigma }}^i \cdot {\varvec{\nu }}_j - I^i_j \right) , \end{aligned}$$
(14)

where we denote \(I^i_j = I^i(\pi ^i(\mathbf {x}_j))\), and \(v^i_j\) is a visibility boolean such that \(v^i_j = 1\) if \(\mathbf {x}_j\) is visible in the i-th image, and \(v^i_j = 0\) otherwise.

Let us consider, for the sake of pedagogy, the simplest case of least-squares fitting (\(F(x) = x^2\)) and perfect visibility (\(v^i_j \equiv 1\)). Then, Problem (14) is rewritten in matrix form:

$$\begin{aligned} \min _{\begin{array}{c} {\varvec{\rho }} \in \mathbb {R}^n\\ \mathbf {S} \in \mathbb {R}^{9 \times m} \end{array}} \left\| \mathbf {N} \left( {\varvec{\rho }} \otimes \mathbf {S}\right) - \mathbf {I} \right\| _F^2, \end{aligned}$$
(15)

where the Kronecker product \({\varvec{\rho }} \otimes \mathbf {S}\) is a matrix of \(\mathbb {R}^{9n \times m},{\varvec{\rho }}\) being a vector of \(\mathbb {R}^n\) which stores the n unknown reflectance values, and \(\mathbf {S}\) a matrix of \(\mathbb {R}^{9 \times m}\) which stores the m unknown lighting vectors \({\varvec{\sigma }}^i \in \mathbb {R}^{9}\), column-wise, \(\mathbf {N} \in \mathbb {R}^{n \times 9n}\) is a block-diagonal matrix whose j-th block, \(j \in \{1,\ldots ,n\}\), is the row vector \({\varvec{\nu }}_j^\top \), matrix \(\mathbf {I} \in \mathbb {R}^{n \times m}\) stores the graylevels, and \(\Vert \cdot \Vert _F\) is the Frobenius norm.

Using the pseudo-inverse \(\mathbf {N}^\dagger \) of \(\mathbf {N}\), (15) is rewritten:

$$\begin{aligned} \min _{\begin{array}{c} {\varvec{\rho }} \in \mathbb {R}^n\\ \mathbf {S} \in \mathbb {R}^{9 \times m} \end{array}} \left\| {\varvec{\rho }} \otimes \mathbf {S} - \mathbf {N}^\dagger \, \mathbf {I} \right\| _F^2. \end{aligned}$$
(16)

Problem (16) is a nearest Kronecker product problem, which can be solved by singular value decomposition (SVD) [15, Theorem 12.3.1].

However, this matrix factorization approach suffers from three shortcomings:

  1. (1)

    It is valid only if all 3D-points are visible under all the viewing angles, which is rather unrealistic. In practice, (15) should be replaced by

    $$\begin{aligned} \min _{\begin{array}{c} {\varvec{\rho }} \in \mathbb {R}^n\\ \mathbf {S} \in \mathbb {R}^{9 \times m} \end{array}} \left\| \mathbf {V} \circ \left[ \mathbf {N} \left( {\varvec{\rho }} \otimes \mathbf {S}\right) - \mathbf {I} \right] \right\| _F^2, \end{aligned}$$
    (17)

    where \(\mathbf {V} \in \{0,1\}^{n \times m}\) is a visibility matrix containing the values \(v^i_j\), and \(\circ \) is the Hadamard product. This yields a Kronecker product problem with missing data, which is much more arduous to solve.

  2. (2)

    It is adapted only to least-squares estimation. Considering a more robust fitting function would prevent a direct SVD solution.

  3. (3)

    If lighting is not varying (\({\varvec{\sigma }}^i = {\varvec{\sigma }} , \forall i \in \{1,\ldots ,m\}\)), then it can be verified that (15) is ill-posed. Among its many solutions, the following trivial one can be exhibited:

    $$\begin{aligned} \mathbf {S}_{\text {trivial}}&= {\varvec{\sigma }}_{\text {diffuse}} \, \mathbf {1}_{1 \times m}, \end{aligned}$$
    (18)
    $$\begin{aligned} {\varvec{\rho }}_{\text {trivial}}&= \left[ E_i[I^i_1],\ldots ,E_i[I^i_n] \right] ^\top , \end{aligned}$$
    (19)

    where:

    $$\begin{aligned}&{\varvec{\sigma }}_{\text {diffuse}} = \left[ 0,0,0,1,0,0,0,0,0\right] ^\top \end{aligned}$$
    (20)

    and \(E_i\) is the mean over the view indices i. This trivial solution means that the lighting is assumed to be completely diffuseFootnote 5, and that the reflectance is equal to the image graylevel, up to noise only. Obviously, this is not an acceptable interpretation. As discussed in the previous subsection, in real-world scenarios we will be very close to this degenerate case; hence, additional regularization will have to be introduced, which makes things even harder.

Overall, the optimization problem which needs to be addressed is not as easy as (16). It is a non-quadratic regularized problem of the form:

$$\begin{aligned}&\min _{\begin{array}{c} \{\rho _j \in \mathbb {R}\}_j\\ \{ {\varvec{\sigma }^i} \in \mathbb {R}^9\}_i \end{array}} \sum _{i=1}^p \sum _{j=1}^n v^i_j \, F\left( \rho _j \, {\varvec{\sigma }}^i \cdot {\varvec{\nu }}_j - I^i_j \right) \nonumber \\&\quad + \sum _{j=1}^n \sum _{k \vert \mathbf {x}_k \in \mathcal {V}(\mathbf {x}_j)} R(\rho _j,\rho _k), \end{aligned}$$
(21)

where \(\mathcal {V}(\mathbf {x}_j)\) is the set of neighbors of \(\mathbf {x}_j\) on surface \(\mathcal {S}\), and the regularization function R needs to be chosen appropriately to ensure piecewise-smoothness.

However, the sampling of the points \(\mathbf {x}_j\) on surface \(\mathcal {S}\) is usually non-uniform, because the shape of \(\mathcal {S}\) is potentially complex. It may thus be difficult to design appropriate fidelity and regularization functions F and R, and to design an appropriate numerical solving. In addition, some thin brightness variations may be missed if the sampling is not dense enough. Overall, direct estimation of reflectance on the surface looks promising at first sight, but rather tricky in practice. Therefore, we leave this as an interesting future research direction and follow in this paper a simpler approach, which consists in estimating reflectance in the image domain.

3.3 Reflectance Estimation in the Image Domain

Instead of trying to colorize the n 3D-points estimated by MVS i.e.,of parameterizing the reflectance over the (3D) surface \(\mathcal {S}\), we can also formulate the reflectance estimation problem in the (2D) image domain.

Equation (12) is equivalently written, in each pixel \(\mathbf {p} := \pi ^i(\mathbf {x}) \in \varOmega ^i:= \pi ^i(\mathcal {S})\):

$$\begin{aligned} I^i(\mathbf {p}) = \rho ^i(\mathbf {p}) \, {\varvec{\sigma }}^i \cdot {\varvec{\nu }}^i(\mathbf {p}), \end{aligned}$$
(22)

where we denote \(\rho ^i(\mathbf {p}) := \rho ({\pi ^i}^{-1}(\mathbf {p}))\) and \({\varvec{\nu }}^i(\mathbf {p}) := {\varvec{\nu }}({\pi ^i}^{-1}(\mathbf {p}))\). Instead of estimating one reflectance value \(\rho (\mathbf {x})\) per estimated 3D-point, the reflectance estimation problem is thus turned into the estimation of m “reflectance maps”

$$\begin{aligned} \rho ^i:\,\varOmega ^i \subset \mathbb {R}^2 \rightarrow \mathbb {R}. \end{aligned}$$
(23)

On the one hand, the 2D-parameterization (23) does not enforce the consistency of the reflectance maps. This will have to be explicitly enforced later on. Besides, the surface will not be directly colorized, but the estimated reflectance maps have to be back-projected and fused over the surface in a final step.

On the other hand, the question of occlusions (visibility) does not arise, and the domains \(\varOmega ^i\) are subsets of a uniform square 2D-grid. Therefore, it will be much easier to design appropriate fidelity and regularization terms. Besides, there will be as many reflectance estimates as pixels in those sets: With modern HD cameras, this number is much larger than the number of 3D-points estimated by multi-view stereo. Estimation will thus be much denser.

With such a parameterization choice, the regularized problem (21) will be turned into:

$$\begin{aligned}&\min _{\begin{array}{c} \{\rho ^i:\,\varOmega ^i \rightarrow \mathbb {R}\}_i\\ \{ {\varvec{\sigma }^i} \in \mathbb {R}^9\}_i \end{array}} \sum _{i=1}^p \sum _{\mathbf {p} \in \varOmega ^i} F\left( \rho ^i(\mathbf {p}) \, {\varvec{\sigma }}^i \cdot {\varvec{\nu }}^i(\mathbf {p}) - I^i(\mathbf {p}) \right) \nonumber \\&\qquad + \sum _{i=1}^p \sum _{\mathbf {p} \in \varOmega ^i} \sum _{ \mathbf {q} \in \mathcal {V}^i(\mathbf {p})} R(\rho ^i(\mathbf {p}),\rho ^i(\mathbf {q})) \nonumber \\&\qquad \text {s.t. } C(\{\rho ^i\}_i) = 0, \end{aligned}$$
(24)

with C some function to ensure multi-view consistency, and where \(\mathcal {V}^i(\mathbf {p})\) is the set of neighbors of pixel \(\mathbf {p}\) which lie inside \(\varOmega ^i\). Note that since \(\varOmega ^i\) is a subset of a square, regular 2D-grid, this neighborhood is much easier to handle than that appearing in (21).

In the next section, we discuss appropriate choices for FR and C in (24), by resorting to a Bayesian rationale.

4 A Bayesian-to-variational Framework for Multi-view Reflectance Estimation

Following Mumford’s Bayesian rationale for the variational formulation [28], let us now introduce a Bayesian-to-variational framework for estimating reflectance and lighting from multi-view images.

4.1 Bayesian Inference

Our problem consists in estimating the m reflectance maps \(\rho ^i:\,\varOmega ^i \rightarrow \mathbb {R}\) and the m lighting vectors \({\varvec{\sigma }}^i \in \mathbb {R}^9\), given the m images \(I^i:\,\varOmega ^i \rightarrow \mathbb {R},i \in \{1,\ldots ,m\}\). As we already stated, a maximum likelihood approach is hopeless, because a trivial solution arises. We rather resort to Bayesian inference, estimating \((\{\rho ^i\}_i,\{\varvec{\sigma }^i\}_i)\) as the maximum a posteriori (MAP) of the distribution

$$\begin{aligned}&\mathcal {P}(\{\rho ^i\}_i,\{\varvec{\sigma }^i\}_i \vert \{I^i\}_i) \nonumber \\&\qquad = \frac{\mathcal {P}(\{I^i\}_i \vert \{\rho ^i\}_i,\{\varvec{\sigma }^i\}_i ) \, \mathcal {P}(\{\rho ^i\}_i,\{\varvec{\sigma }^i\}_i)}{\mathcal {P}(\{I^i\}_i)}, \end{aligned}$$
(25)

where the denominator is the evidence, which can be discarded since it depends neither on the reflectance nor on the lighting, and the factors in the numerator are the likelihood and the prior, respectively.

Likelihood The image formation model (22) is never strictly satisfied in practice, due to noise, cast-shadows and possibly slightly specular surfaces. We assume that such deviations from the model can be represented as independent (with respect to pixels and views) Laplace lawsFootnote 6 with zero mean and scale parameter \(\alpha \):

$$\begin{aligned}&\mathcal {P}(\{I^i\}_i \vert \{\rho ^i\}_i,\{\varvec{\sigma }^i\}_i ) \nonumber \\&\quad = \prod _{i=1}^m \left( \frac{1}{2 \alpha }\right) ^{|\varOmega ^i|} \exp \left\{ - \frac{1}{\alpha } \left\| \rho ^i \, {\varvec{\sigma }}^i \cdot {\varvec{\nu }}^i - I^i \right\| _{i,1} \right\} \nonumber \\&\quad = \left( \frac{1}{2 \alpha }\right) ^{\sum _{i=1}^m |\varOmega ^i|} \exp \left\{ - \frac{1}{\alpha } \sum _{i=1}^m \left\| \rho ^i \, {\varvec{\sigma }}^i \cdot {\varvec{\nu }}^i - I^i \right\| _{i,1} \right\} \end{aligned}$$
(26)

where \(\Vert \cdot \Vert _{i,p},p \ge 0\), is the \(\ell ^p\)-norm over \(\varOmega ^i\) and \(|\varOmega ^i|\) is the cardinality of \(\varOmega ^i\).

Prior Since the reflectance maps \(\{\rho ^i\}_i\) are independent from the lighting vectors \(\{\varvec{\sigma }^i\}_i\), the prior can be factorized to \( \mathcal {P}(\{\rho ^i\}_i,\{\varvec{\sigma }^i\}_i) = \mathcal {P}(\{\rho ^i\}_i) \mathcal {P}(\{\varvec{\sigma }^i\}_i)\). Since the lighting vectors are independent from each other, the prior distribution of the lighting vectors factorizes to \(\mathcal {P}(\{\varvec{\sigma }^i\}_i) = \prod _{i=1}^m \mathcal {P}(\varvec{\sigma }^i)\). As each lighting vector is unconstrained, we can consider the same uniform distribution i.e.,\(\mathcal {P}(\varvec{\sigma }^i) = \tau \), independently from the view index i. This distribution being independent from the unknowns, we can discard the lighting prior from the inference process. Regarding the reflectance maps, we follow the retinex theory [21], and consider each of them as piecewise-constant. The natural prior for each such map is thus the Potts model:

$$\begin{aligned} \mathcal {P}(\rho ^i) = K^i \exp \left\{ - \frac{1}{\beta ^i} \left\| \nabla \rho ^i \right\| _{i,0} \right\} \end{aligned}$$
(27)

where \(\nabla \rho ^i(\mathbf {p}) = \left[ \partial _x \rho ^i(\mathbf {p}),\partial _y \rho ^i(\mathbf {p}) \right] ^\top \) represents the gradient of \(\rho ^i\) at pixel \(\mathbf {p}\) (approximated, in practice, using first-order forward stencils with a Neumann boundary condition), and with \(K^i\) a normalization coefficient and \(\beta ^i\) a scale parameter. Note that we use the abusive \(\ell ^0\)-norm notation \(\Vert \nabla \rho ^i\Vert _{i,0}\) to denote:

$$\begin{aligned} \left\| \nabla \rho ^i \right\| _{i,0} = \sum _{\mathbf {p} \in \varOmega ^i} \sum _{\mathbf {q} \in \mathcal {V}^i(\mathbf {p})} f\left( \rho ^i(\mathbf {p}) - \rho ^i(\mathbf {q}) \right) \end{aligned}$$
(28)

with \(f(x) = 1\) if \(x\ne 0\), and \(f(x) = 0\) otherwise.

The m reflectance maps are obviously not independent: The reflectance, which characterizes the surface, should be independent from the view. It follows that the parameters \((K^i,\beta ^i)\) are the same for each Potts model (27) and that the reflectance prior \(\mathcal {P}(\{\rho ^i\}_i)\) can be taken as the product of m independent distributions with the same parameters \((K,\beta )\):

$$\begin{aligned} \mathcal {P}(\{\rho ^i\}_i) = K^m \exp \left\{ - \frac{1}{\beta } \sum _{i=1}^m \left\| \nabla \rho ^i \right\| _{i,0} \right\} \end{aligned}$$
(29)

but only if the coupling between the reflectance maps is enforced by the following linear constraint:

$$\begin{aligned} C^{i,j}(\rho ^i-\rho ^j) = 0,~\forall (i,j) \in \{1,\ldots ,m\}^2, \end{aligned}$$
(30)

where \(C^{i,j}\) is a \(\varOmega ^i \times \varOmega ^j \rightarrow \{0,1\}\) “correspondence function,” which is easily created from the (known) projection functions \(\{\pi ^i\}_i\) and the geometry, and which is defined as follows:

$$\begin{aligned} C^{i,j}(\mathbf {p}^i,\mathbf {p}^j) = {\left\{ \begin{array}{ll} 1 &{} \text {if pixels }\mathbf {p}^i\text { and }\mathbf {p}^j\text { correspond} \\ &{} \text { to the same surface point}, \\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(31)

Since maximizing the MAP probability (25) is equivalent to minimizing its negative logarithm, we eventually obtain the following constrained variational problem, which explicits the functions FR and C in (24):

$$\begin{aligned}&\min _{\begin{array}{c} \{\rho ^i:\,\varOmega ^i \rightarrow \mathbb {R} \}_i \\ \{{\varvec{\sigma }^i} \in \mathbb {R}^9 \}_i \end{array}} \sum _{i=1}^m \left\| \rho ^i \, {\varvec{\sigma }}^i \cdot {\varvec{\nu }}^i - I^i \right\| _{i,1} + \lambda \sum _{i=1}^m \left\| \nabla \rho ^i \right\| _{i,0} \nonumber \\&\quad \text {s.t.}\quad C^{i,j}(\rho ^i-\rho ^j) = 0,~\forall (i,j) \in \{1,\ldots ,m\}^2, \end{aligned}$$
(32)

where \(\lambda = \alpha / \beta \) and where we neglect all the normalization coefficients.

4.2 Relationship with Cartoon + Texture Decomposition

Applying a logarithm transformation to both sides of (22), we obtain:

$$\begin{aligned} \tilde{I}^i(\mathbf {p}) = \tilde{\rho }^i(\mathbf {p}) + \log \left( {\varvec{\sigma }}^i \cdot {\varvec{\nu }}^i (\mathbf {p})\right) , \end{aligned}$$
(33)

where the tilde notation is used as a shortcut for the logarithm.

By applying the exact same Bayesian-to-variational rationale, we would end up with the following variational problem:

$$\begin{aligned}&\min _{\begin{array}{c} \{\tilde{\rho }^i:\,\varOmega ^i \rightarrow \mathbb {R} \}_i \\ \{{\varvec{\sigma }^i} \in \mathbb {R}^9 \}_i \end{array}} \sum _{i=1}^m \left\| \tilde{\rho }^i +\log \left( {\varvec{\sigma }}^i \cdot {\varvec{\nu }}^i\right) - \tilde{I}^i \right\| _{i,1} + \lambda \sum _{i=1}^m \left\| \nabla \tilde{\rho }^i \right\| _{i,0} \nonumber \\&\quad \text {s.t.}\quad C^{i,j}(\tilde{\rho }^i-\tilde{\rho }^j) = 0,~\forall (i,j) \in \{1,\ldots ,m\}^2, \end{aligned}$$
(34)

The variational problem (34) can be interpreted as a multi-view cartoon + texture decomposition problem, where each log-image \(\tilde{I}\) is decomposed into a component \(C^i := \tilde{\rho }^i\) which is piecewise-smooth (“cartoon”, here the log-reflectance), and a component \(T^i := \log \left( {\varvec{\sigma }}^i \cdot {\varvec{\nu }}^i\right) \) which contains higher-frequency details (“texture,” here the log-shading). In contrast with conventional methods for such a task, the present one uses an explicit shading model for the texture term.

Note, however, that such a decomposition is justified only if the \(\log \)-images \(\tilde{I}^i\) are considered. If using the original images \(I^i\), our framework should rather be considered as a multi-view cartoon “\({\varvec{\times }}\)” texture decomposition framework.

4.3 Bi-convex Relaxation of the Variational Model (32)

Problem (32) is a non-convex (due to the \(\ell ^0\)-regularizers), non-smooth (due to the \(\ell ^0\)-regularizers and to the \(\ell ^1\)-fidelity term). Although some efforts have recently been devoted to the resolution of optimization problems involving \(\ell ^0\)-regularizers [38], we prefer to keep the optimization simple and approximate these by (convex, but non-smooth) anisotropic total variation terms:

$$\begin{aligned} \sum _{i=1}^m \left\| \nabla \rho ^i \right\| _{i,0} \approx \sum _{i=1}^m \left\| \nabla \rho ^i \right\| _{i,1}. \end{aligned}$$
(35)

Besides, the correspondence function may be slightly inaccurate in practice, due to errors in the prior geometry estimation obtained via multi-view stereo. Therefore, we turn the linear constraint in (32) into an additional term. Eventually, we replace the non-differentiable absolute values arising from the \(\ell ^1\)-norms by the (differentiable) Moreau envelope i.e.,the Huber loss:Footnote 7

$$\begin{aligned} |x |\approx \phi _\delta (x) := {\left\{ \begin{array}{ll} \dfrac{x^2}{2 \, \delta }, &{} |x |\le \delta \\ |x |- \dfrac{\delta }{2}, &{} |x |> \delta \end{array}\right. } \end{aligned}$$
(36)

Altogether, this yields the following smooth, bi-convex variational problem:

$$\begin{aligned}&\min _{\begin{array}{c} \rho := \{\rho ^i:\,\varOmega ^i \rightarrow \mathbb {R} \}_i \\ \varvec{\sigma }:=\{{\varvec{\sigma }^i} \in \mathbb {R}^9 \}_i \end{array}} \varepsilon (\rho ,{\varvec{\sigma }}) := \sum _{i=1}^m \sum _{\mathbf {p} \in \varOmega ^i} \phi _\delta \left( \rho ^i(\mathbf {p}) \, {\varvec{\sigma }}^i \cdot {\varvec{\nu }}^i(\mathbf {p}) - I^i(\mathbf {p}) \right) \nonumber \\&\quad + \lambda \sum _{i=1}^m \sum _{\mathbf {p} \in \varOmega ^i} \left[ \phi _\delta \left( \partial _x \rho ^i(\mathbf {p}) \right) + \phi _\delta \left( \partial _y \rho ^i(\mathbf {p}) \right) \right] \nonumber \\&\quad + \mu \mathop {\sum \sum }_{1 \le i<j \le m} \sum _{\mathbf {p}^i \in \varOmega ^i} \sum _{\mathbf {p}^j \in \varOmega ^j} C_{i,j}(\mathbf {p}^i,\mathbf {p}^j) \, \phi _\delta \left( \rho ^i(\mathbf {p}^i) - \rho ^j(\mathbf {p}^j) \right) . \nonumber \\ \end{aligned}$$
(37)

In Eq. (37), the first term ensures photometric consistency (in the sense of the Huber loss function), the second one ensures reflectance smoothness (smoothed anisotropic total variation), and the third term ensures multi-view consistency of the reflectance estimates (again, in the sense of the Huber loss function). At last, \(\lambda \) and \(\mu \) are tunable hyper-parameters controlling the reflectance smoothness and the multi-view consistency, respectively.

5 Alternating Majorization–minimization for Solving (37)

To solve (37), we propose an alternating majorization-minimization method, which combines alternating and majorization–minimization optimization techniques. As sketched in Fig. 3, this algorithm works as follows. Given an estimate \((\rho ^{(k)},{\varvec{\sigma }}^{(k)})\) of the solution at iteration (k), the lighting vectors and the reflectance maps are successively updated according to:

$$\begin{aligned} \rho ^{(k+1)}&= \underset{\rho }{{\text {argmin~}}} \varepsilon _\rho ^{(k)}(\rho ), \end{aligned}$$
(38)
$$\begin{aligned} \varvec{\sigma }^{(k+1)}&= \underset{\varvec{\sigma }}{{\text {argmin~}}} \varepsilon _{\varvec{\sigma }}^{(k)}({\varvec{\sigma }}), \end{aligned}$$
(39)

where \(\varepsilon _\rho ^{(k)}\) and \(\varepsilon _{\varvec{\sigma }}^{(k)}\) are local quadratic majorants of \(\varepsilon (\cdot ,{\varvec{\sigma }^{(k)}})\) and \(\varepsilon (\rho ^{(k+1)},\cdot )\) around, respectively, \(\rho ^{(k)}\) and \(\varvec{\sigma }^{(k)}\). Then, the process is repeated until convergence.

Fig. 3
figure 3

Sketch of the proposed alternating majorization–minimization solution. The partially freezed energies \(\varepsilon (\cdot ,\varvec{\sigma })\) and \(\varepsilon (\rho ,\cdot )\) are locally majorized by the quadratic functions \(\varepsilon _{\rho }\) (in red) and \(\varepsilon _{\varvec{\sigma }}\) (in blue). Then, these quadratic majorants are (globally) minimized, and the process is repeated until convergence is reached (Color figure online)

To this end, let us first remark that the function

$$\begin{aligned} \psi _\delta (x;x_0) = {\left\{ \begin{array}{ll} \dfrac{ x^2}{2 \, \delta }, &{} \quad |x_0 |\le \delta , \\ \dfrac{ x^2}{2 \, |x_0 |} + \dfrac{|x_0 |}{2} - \dfrac{\delta }{2}, &{} \quad |x_0 |> \delta , \end{array}\right. } \end{aligned}$$
(40)

is such that \(\psi _\delta (x_0;x_0) = \phi _\delta (x_0)\), and is a proper local quadratic majorant of \(\phi _\delta \) around \(x_0,\forall x_0 \in \mathbb {R}\). This is easily verified if \(|x_0 |\le \delta \), from the definition (36) of \(\phi _\delta \). If \(|x_0 |> \delta \), the difference \(\psi _\delta (x;x_0) - \phi _\delta (x)\) writes:

$$\begin{aligned} {\left\{ \begin{array}{ll} \dfrac{\left( |x_0 |-\delta \right) \left( |x_0 |\, \delta - x^2 \right) }{2 \, |x_0 |\, \delta }, &{} \quad |x |\le \delta , \\ \dfrac{(|x |-|x_0 |)^2}{2 \, |x_0 |}, &{} \quad |x |> \delta , \end{array}\right. } \end{aligned}$$
(41)

which is positive in any case.

Therefore, the function

$$\begin{aligned} \varepsilon _\rho ^{(k)}(\rho )&:= \sum _{i=1}^m \sum _{\mathbf {p} \in \varOmega ^i} \psi _\delta \left( \rho ^{i}(\mathbf {p}) \, {\varvec{\sigma }}^{i,(k)} \cdot {\varvec{\nu }}^i(\mathbf {p}) - I^i(\mathbf {p}); r^{i,(k),(k)} \right) \nonumber \\&\quad + \lambda \sum _{i=1}^m \sum _{\mathbf {p} \in \varOmega ^i} \left[ \psi _\delta \left( \partial _x \rho ^i(\mathbf {p}) ; \partial _x \rho ^{i,(k)}(\mathbf {p}) \right) \right. \nonumber \\&\quad \qquad \quad \qquad \,\, \left. + \psi _\delta \left( \partial _y \rho ^i(\mathbf {p}); \partial _y \rho ^{i,(k)}(\mathbf {p}) \right) \right] \nonumber \\&\quad + \mu \mathop {\sum \sum }_{1\le i<j \le m} \, \sum _{\mathbf {p}^i \in \varOmega ^i} \, \sum _{\mathbf {p}^j \in \varOmega ^j} C_{i,j}(\mathbf {p}^i,\mathbf {p}^j) \nonumber \\&\qquad \psi _\delta \left( \rho ^i(\mathbf {p}^i) - \rho ^j(\mathbf {p}^j); \rho ^{i,(k)}(\mathbf {p}^i) - \rho ^{j,(k)}(\mathbf {p}^j)\right) , \end{aligned}$$
(42)

with

$$\begin{aligned} r^{i,(k_1),(k_2)} = \rho ^{i,(k_1)}(\mathbf {p}) \, {\varvec{\sigma }}^{i,(k_2)} \cdot {\varvec{\nu }}^i(\mathbf {p}) - I^i(\mathbf {p}), \end{aligned}$$
(43)

is a local quadratic majorant of \(\varepsilon (\cdot ,{\varvec{\sigma }^{(k)}})\) around \(\rho ^{(k)}\) which is suitable for the update (38).

Similarly, the function

$$\begin{aligned} \varepsilon _\sigma ^{(k)}({\varvec{\sigma }})&:=\sum _{i=1}^m \sum _{\mathbf {p} \in \varOmega ^i} \psi _\delta \left( \rho ^{i,(k+1)}(\mathbf {p}) {\varvec{\sigma }}^i \cdot {\varvec{\nu }}^i(\mathbf {p}) -I^i(\mathbf {p}) ; r^{i,(k+1),(k)} \right) \nonumber \\&\quad + \lambda \sum _{i=1}^m \sum _{\mathbf {p} \in \varOmega ^i} \left[ \phi _\delta \left( \partial _x \rho ^{i,(k+1)}(\mathbf {p}) \right) + \phi _\delta \left( \partial _y \rho ^{i,(k+1)}(\mathbf {p}) \right) \right] \nonumber \\&\quad + \mu \mathop {\sum \sum }_{1\le i<j \le m} \sum _{\mathbf {p}^i \in \varOmega ^i} \sum _{\mathbf {p}^j \in \varOmega ^j} \Big [ C_{i,j}(\mathbf {p}^i,\mathbf {p}^j) \nonumber \\&\quad \phi _\delta \left( \rho ^{i,(k+1)}(\mathbf {p}^i) - \rho ^{j,(k+1)} (\mathbf {p}^j) \right) \Big ] \end{aligned}$$
(44)

is a local quadratic majorant of \(\varepsilon (\rho ^{(k+1)},\cdot )\) around \(\varvec{\sigma }^{(k)}\) which is suitable for the update (39).

The update (38) then comes down to solving a large sparse linear least-squares problem, which we achieve by applying conjugate gradient iterations to the associated normal equations. Regarding (39), it comes down to solving a series of m independent small-scale linear least-squares problems, for instance, by resorting to the pseudo-inverse.

We iterate the optimization steps (38) and (39) until convergence or a maximum iteration number is reached, starting from the trivial solution of the non-regularized (\(\lambda = \mu = 0\)) problem. This non-regularized solution is attained by considering diffuse lighting (see (20)) and using the input images as reflectance maps. In our experiments, we found 50 iterations were always sufficient to reach a stable solution (\(10^{-3}\) relative residual between two consecutive energy values \(\varepsilon (\rho ^{(k)},{\varvec{\sigma }}^{(k)})\) and \(\varepsilon (\rho ^{(k+1)},{\varvec{\sigma }}^{(k+1)})\)).

Proving convergence of our scheme is beyond the scope of this paper, but the proof could certainly be derived from that in [31], where a similar alternating majorization–minimization called “alternating reweighted least-squares” is used. Note, however, that the convergence rate seems to be sublinear (see Fig. 4); hence, possibly faster numerical strategies could be explored in the future.

Fig. 4
figure 4

Top: evolution of the energy \(\varepsilon (\rho ^{(k)},{\varvec{\sigma }}^{(k)})\) defined in (37), in function of iterations (k), concerning the test presented in Fig. 8. Bottom: absolute value of the relative variation between two successive energy values. Our algorithm stops when this value is less than \(10^{-3}\), which happens in less than 50 iterations and takes around 3 minutes on a recent i7 processor, with non-optimized Matlab codes for \(m=13\) images of size \(540 \times 960\)

Fig. 5
figure 5

a 3D-shape used in the tests (the well-known “Joyful Yell” 3D-model), which will be imaged under two scenarios (see Figures 6 and 7). b Same, after smoothing, thus less accurate. cd Zooms of a and b, respectively, near the neck

6 Results

In this section, we evaluate the proposed variational method for multi-view reflectance estimation, on a variety of synthetic and real-world datasets. We start by a quantitative comparison of our results with two single-view methods, namely the cartoon + texture decomposition method from [23] and the intrinsic image decomposition method from [14].

Fig. 6
figure 6

First row: three (out of \(m=13\)) synthetic views of the object of Fig. 5a, computed with a purely Lambertian reflectance taking only four different values (hair, face, shirt and plinth), illuminated by a “skydome.” Second row: estimation of the reflectance using the cartoon + texture decomposition described in [23] (with its parameter fixed to 0.4). Third row: estimation of the reflectance using the method proposed in [14] (with 4 clusters). Forth row: estimation of the reflectance using the proposed approach (with \(\lambda = 8\) and \(\mu = 1000\)). Fifth row: ground truth (Color figure online)

Fig. 7
figure 7

First row: three (out of \(m=13\)) synthetic views of the object of Fig. 5a, computed with a non-uniform shirt reflectance, a uniform, but partly specular hair reflectance, illuminated by a single extended light source. Second row: estimation of the reflectance using the cartoon + texture decomposition described in [23] (with its parameter fixed to 0.4). Third row: estimation of the reflectance using the method proposed in [14] (with 6 clusters). Forth row: estimation of the reflectance using the proposed approach (with \(\lambda = 2.5\) and \(\mu = 1000\)). Fifth row: ground truth (Color figure online)

Table 1 RMSE on the reflectance estimates (the estimated and ground truth reflectance maps are scaled to [0, 1]), with respect to each channel and to the whole set of images, for our method and two single-view approaches

6.1 Quantitative Evaluation on a Synthetic Dataset

We first test our reflectance estimation method using \(m=13\) synthetic images, of size \(540\times 960\), of an object whose geometry is perfectly known (see Fig. 5a). Two scenarios are considered:

  • In Fig. 6, a purely Lambertian, piecewise-constant reflectance is mapped onto the surface of the object, which is then illuminated by a “skydome” i.e.,an almost diffuse lighting. Shading effects are thus rather limited; hence, applying to each image an estimation method which does not use an explicit reflectance model e.g. the cartoon + texture decomposition method from [23], should already provide satisfactory results. The reflectance being perfectly piecewise-constant, applying sparsity-based intrinsic image decomposition methods such as [14] to each image should also work well.

  • In Fig. 7, a more complicated (non-uniform) reflectance is mapped onto the shirt, the hair is made partly specular, and the diffuse lighting is replaced by a single extended light source, which induces much stronger shading effects. It will thus be much harder to remove shading without an explicit reflectance model (cartoon + texture approach), while the single-view image decomposition approach should be non-robust to specularities.

In both cases, the competing methods [23] and [14] are applied independently to each of the \(m=13\) images. The estimates are thus not expected to be consistent, which may be problematic if the reflectance maps should be further mapped onto the surface for, e.g. relighting applications. On the contrary, our approach simultaneously, and consistently, estimates the m reflectance maps.

As we dispose of the reflectance ground truth, we can numerically evaluate these results by estimating the root mean square error (RMSE) for each method, over the whole set of \(m=13\) images. The values are presented in Table 1. In order to compare comparable things, the reflectance estimated by each method is scaled, in each channel, by a factor common to the \(m=13\) reflectance maps, so as to minimize the RMSE. This should thus highlight inconsistencies between the reflectance maps.

Fig. 8
figure 8

Same test as in Fig. 7, using a coarse version of the 3D-shape (see Fig. 5b and d), with \(\lambda = 2.5\) and \(\mu = 1000\). Results are qualitatively similar to those shown in Fig. 7, obtained with perfect geometry. The RMSE in the RGB channels are, respectively: 0.24, 0.14 and 0.13, which are only slightly higher than those attained with perfect geometry (see Table 1) (Color figure online)

Based on the qualitative results from Figs. 6 and 7, and the quantitative evaluations shown in Table 1, we can make the following three observations:

(1) Considering an explicit image formation model improves cartoon + texture decomposition Actually, the cartoon part from the cartoon + texture decomposition is far less uniform than the reflectance estimated using both other methods. Shading is only blurred, and not really removed. This could be improved by augmenting the regularization weight, but the price to pay would be a loss of detail in the parts containing thinner details (as the shirt, in the example of Fig. 7).

(2) Simultaneously estimating the multi-view reflectance maps makes them consistent and improves robustness to specularities When estimating each reflectance map individually, inconsistencies arise, which is obvious for the hair in the third line of Fig. 6, and explains the RMSE values in Table 1. In contrast, our results confirm our basic idea i.e.,that reflectance estimation benefits in two ways from the multi-view framework: This allows us not only to estimate the 3D-shape, but also to constrain the reflectance of each surface point to be the same in all the pictures where it is visible. In addition, since the location of bright spots due to specularity depends on the viewing angle, they usually occur in some places on the surface only under certain viewing angles. Considering multi-view data should thus improve robustness to specularities. This is confirmed in Fig. 7 by the reflectance estimates in the hair, where the specularities are slightly better removed than with single-view methods.

(3) A sparsity-based prior for the reflectance should be preferred over total variation As we use a TV-smoothing term, which favors piecewise-smooth reflectance, the satisfactory results of Fig. 6 were predictable. However, some penumbra remains visible around the neck. Since we also know the object geometry, it seems that we could compensate for penumbra. However, this would require that the lighting is known as well, which is not the case in the framework of the targeted use case, since an outdoors lighting is uncontrolled. Moreover, we would have to consider not only the primary lighting, but also the successive bounces of light on the different parts of the scene. (These were taken into account by the ray-tracing algorithm, when synthesizing the images.) In contrast, the sparsity-based approach [14] is able to eliminate penumbra rather well, without modeling secondary reflections. It is also able to more appropriately remove shading on the face in the example of Fig. 7, while not degrading as much as total variation the thin structures of the shirt. Hence, the relative simplicity of the numerical solution, which is a consequence of the choice of replacing the Potts prior by a total variation one (see Sect. 4.3), comes with a price. In future works, it may be important to design a numerical strategy handling the original non-smooth, non-convex problem (32).

Fig. 9
figure 9

Test on a real-world dataset. First row: three (out of \(m=8\)) views of the scene. Second row: estimated reflectance maps using the proposed approach (with \(\lambda = 2\) and \(\mu = 1000\)). Geometry and camera parameters were estimated using an SfM/MVS pipeline (Color figure online)

6.2 Handling Inaccurate Geometry

In the previous experiments, the geometry was perfectly known. In real-world scenarios, errors in the 3D-shape estimation using SfM and MVS are unavoidable. Therefore, it is necessary to evaluate the ability of our method to handle inaccurate geometry.

Thus, we use for the next experiment the surface shown in Fig. 5b (zoomed in Fig. 5d), which is obtained by smoothing the original 3D-shape of Fig. 5a (zoomed in Fig. 5c), using a tool from the meshlab software. The results provided in Fig. 8 show that our method seems robust to such small inaccuracies in the object geometry and is thus relevant for the intended application.

In Fig. 9, we qualitatively evaluate our method on the outputs of an SfM/MVS pipeline applied to a real-world dataset, which provides estimates of the camera parameters and a rough geometry of the scene. These experiments confirm that small inaccuracies in the geometry input can be handled. The specularities are also appropriately removed, and the reflectance maps present the expected cartoon-like aspect. However, the reflectance is under-estimated in the sides of the nose and around the chin. Indeed, since lighting is fixed, these areas are self-shadowed in all the images. Two workarounds could be used: forcing the regularization term (and, possibly, losing fine-scale details), or actively controlling the lighting in order to be sure that no point on the surface is shadowed in all the views. This is further discussed in the next subsection.

Fig. 10
figure 10

Quantitative influence of parameter \(\lambda \), using images from the same dataset as that of Fig. 7, with \(\mu = 1000\) (Color figure online)

6.3 Tuning the Hyper-parameters \(\lambda \) and \(\mu \)

In the previous experiments, we arbitrarily chose the values of parameters \(\lambda \) and \(\mu \) which provided the “best” results. Of course, such a tuning, which may be tedious, must be discussed.

In order to highlight the influence of these parameters, let us first question what would happen without neither regularization nor multi-view consistency i.e.,when \(\lambda = \mu = 0\). In that case, only the photometric term would be optimized, which corresponds to the maximum likelihood case. If lighting is not varying, then we are in a degenerate case which may result in estimating diffuse lighting (see Eq. (20)) and replacing the reflectance maps by the images. Lighting will thus be “baked in” the reflectance maps, which is precisely what we pretend not to do.

To avoid this effect, the smoothness term must be activated by setting \(\lambda >0\). If we still consider \(\mu = 0\), then the variational Problem (37) comes down to m independent image restoration problems. In fact, these problems are similar to \(\ell ^1\)-TV denoising problems, except that a physically plausible fidelity term is used to help removing the illumination artifacts not only from the total variation regularization, but also by incorporating prior knowledge of the surface geometry. However, because the photometric term is invariant by the transformation \((\rho ^i,{\varvec{\sigma }}^i) := (\kappa ^i \rho ^i, {\varvec{\sigma }}^i/\kappa ^i),\kappa ^i >0\), each reflectance map \(\rho ^i\) is estimated only up to a scale factor, hence the m maps will not be consistent, as this is the case for the competing single-view methods.

The latter issue is solved by activating the multi-view consistency term i.e.,by setting \(\mu >0\). In that case, there is still an ambiguity \(\{\rho ^i,{\varvec{\sigma }}^i\}_i := \{\kappa \rho ^i, {\varvec{\sigma }}^i/\kappa \},\kappa >0\), but it is now global i.e.,independent from i. To solve this ambiguity, it is enough in practice to set one reflectance value arbitrarily, or to normalize the reflectance values.

Overall, it is necessary to ensure that both \(\lambda \) and \(\mu \) are strictly positive. The choice of \(\mu \) is not really critical. Indeed, the multi-view consistency regularizer which is controlled by \(\mu \) arises from relaxing a hard constraint (compare (32) and (37)). Hence, \(\mu \) only needs to be chosen “high enough” so that the regularizer approximates fairly well a hard constraint. In all the experiments, we used \(\mu = 1000\) and did not face any particular problem. Obviously, if the correspondences were not appropriately computed by SfM, then this value should be reduced, but SfM solutions such as [27] are now mature enough to provide accurate correspondences.

The choice of \(\lambda \) is much more critical. This is illustrated in Fig. 10, which shows the RMSE in each channel, using images from the same dataset as that of Fig. 7, at convergence of our algorithm, as a function of \(\lambda \). This graph shows that the “optimal” value of \(\lambda \) is very hard to find: in this example, a high value of \(\lambda \) would diminish the RMSE in the face and the hair (which are mostly red), because this would make them uniform as expected (see Fig. 11, last rows). However, a much lower value of \(\lambda \) is required in order to preserve the thin shirt details, which mostly contain green and blue components (see Fig. 11, first rows).

Fig. 11
figure 11

Qualitative influence of parameter \(\lambda \), using images from the same dataset as that of Fig. 7, with \(\mu = 1000\) (Color figure online)

Fig. 12
figure 12

First row: three (out of \(m=13\)) synthetic images computed under varying lighting (which comes here from the right, from the front and from the left, respectively). Second row: estimated reflectance maps using the proposed approach (with \(\lambda = 1\) and \(\mu = 1000\)). The thin structures of the shirt are preserved, while shading on the face is largely reduced. These results must be compared with those of the first row in Fig. 11, obtained with the same value of \(\lambda \) but under fixed lighting (Color figure online)

There is one situation where this tuning is much easier. It is when the lighting is not fixed, but strongly varying. As discussed in Sect. 3, the problem of jointly estimating reflectance and lighting is then over-determined, which theoretically makes the regularization unnecessary. In Fig. 12, we show the results obtained in the case where each image is obtained under a different lighting. In that case, the thin structures of the shirt are preserved, while shading on the face is largely reduced, despite the choice of a very low regularization weight \(\lambda = 1\). Note that we cannot use the limit case \(\lambda =0\) because not all pixels have correspondences in all images: there may thus be a few pixels for which the problem remains under-determined, and for which diffusion is required. Overall, this experiment shows that, without any prior knowledge on the lighting, the only way to avoid introducing an empirical prior on the reflectance, and thus its tuning, is to actively control lighting during the acquisition process. This means combining multi-view and photometric stereo.

It happens that this problem is actively being addressed by the computer vision community [30]. Interestingly, in this research the focus is put on highly accurate geometry estimation, and not so much on reflectance estimation (no reflectance estimation result is shown). Therefore, it may be an interesting future research direction to incorporate our reflectance estimation framework in such multi-view, multi-lighting approaches. Both highly accurate geometry and reflectance could indeed be expected.

7 Conclusion and Perspectives

We have proposed a variational framework for estimating the reflectance of a scene from a series of multi-view images. We advocate a 2D-parameterization of reflectance, turning the problem into that of converting the input images into reflectance maps. Invoking a Bayesian rationale leads to a variational model comprising a \(\ell ^1\)-norm-based photometric data term, a Potts regularizer and a multi-view consistency constraint. For simplicity, both the latter are relaxed into a total variation term and a \(\ell ^1\)-norm term, respectively. Numerical solving is carried out using an alternating majorization-minimization algorithm. Empirical results on both synthetic and real-world datasets demonstrate the interest of considering multi-view images for reflectance estimation, as it allows to benefit from prior knowledge of the geometry, to improve robustness to specularities and to guarantee consistency of the reflectance estimates.

However, the critical analysis of our results also highlighted some limitations and possible future research directions. For instance, avoiding the relaxation of the non-smooth, non-convex regularization, seems to be necessary in order to really ensure that the estimated reflectance maps are piecewise-constant. In addition, the choice of parameterizing reflectance in the image (2D) domain is advocated for reasons of numerical simplicity, yet it seems somewhat more natural to work directly on the surface. (This would avoid the multi-view consistency constraint.) However, this would require turning our simple variational framework into a more arduous optimization problem over a manifold.

Finally, we could disambiguate the problem by measuring upstream the incoming light, using, for instance, environment maps. Without prior measurement, it seems that the only way to avoid resorting to an arbitrary prior for limiting the arising ambiguities consists in actively controlling the lighting (this would avoid resorting to spatial regularization). Therefore, another extension of our work consists in estimating reflectance from multi-view, multi-lighting data, in the spirit of multi-view photometric stereo techniques. However, this would require appropriately modifying the SfM/MVS pipeline, which relies on the constant brightness assumption.