1 Introduction

Recording 3D shape and surface reflectance are both invaluable for digitally archiving and analyzing cultural heritage artifacts. While the importance of digitally archiving artifacts is generally recognized, it is still not widely spread in many museums and libraries, mostly due to the complexity of the digitization process that comes with expensive specialized setups. To enable everybody to participate in digital archiving, a method that is simple to operate and only requires a commodity device is very much wanted.

With this goal in mind, this paper presents a high-fidelity shape and albedo recovery method using a simple imaging setup that is already available in widespread devices. Our method only requires a stereo camera and a flashlight as shown in Fig. 1, and takes three images in two shots from a fixed viewpoint as input: Two images in one shot by a stereo camera, and another image by one camera with a flashlight. By harnessing both geometric and photometric cues from the input images, our method recovers a fine 3D shape and a surface albedo map. Specifically, our method uses the rough shape inferred from the stereo image pair to estimate the no-flash environmental lighting. Using our flash/no-flash image formation model, the high-frequency details of the target scene are then recovered.

Fig. 1
figure 1

Our setup uses a stereo camera and a flashlight, which is common in modern smartphones, e.g., the iPhone X from 2017. We capture a stereo image pair to infer a rough depth map and a flash/no-flash image pair to recover shape details and surface albedo

Unlike previous methods that rely on complex imaging setups (Zhang et al. 2012; Choe et al. 2014), our setup is minimal to introduce geometric and photometric cues. Other than an ordinary monocular camera, our method only requires one additional viewpoint (i.e., a stereo camera) and lighting condition (i.e., a flashlight). As will be shown later, further reducing any input significantly downgrades recovery. Fortunately, many commodity smartphones today are equipped with this imaging setup, and we will demonstrate later in this paper that our method is naturally applicable to such smartphones. With this setup, recording can be conducted outside a darkroom (e.g., in an office room) and completed in a moment as it only takes two shots without any camera movement. These properties make the digitization process easy.

The key contributions of our work are as follows:

  • A high-fidelity shape and albedo recovery method working with a simple, compact, and wide-spread imaging setup;

  • A flash/no-flash image formation model for Lambertian surfaces with non-uniform albedos under natural lighting;

  • A robust shape and albedo recovery method that harnesses both geometric and photometric cues.

This paper extends the preliminary version of our work (Cao et al. 2020) in three important aspects: First, we generalize the image formation model to flash/no-flash image pairs captured with different camera exposure settings. This generalization is crucial for the successful application using off-the-shelf devices (see Sect. 3.1). Second, we verify the effectiveness of our imaging setup and recovery method using off-the-shelf smartphones (in Sect. 4.2). Finally, we show more examples of reconstruction including outdoor objects.

2 Related Work

Our reconstruction method is related to shading-based shape recovery and flash photography.

Shading-based shape recovery: Geometric shape recovery approaches such as stereo are useful for recovering a coarse shape but have fundamental limitations in recovering high-frequency details (Klowsky et al. 2012). In contrast, photometric approaches can recover per-pixel surface normal s using shading cues in the images. In the past, various approaches have been proposed for high-quality shape recovery by combining the strengths of both geometric and photometric approaches. For example,  (Ikeuchi 1987) recovers the depth map from a stereo pair of normal maps, which are estimated by photometric stereo with three lights.

While photometric approaches commonly assume controlled lighting conditions without ambient lighting, when they are combined with geometric approaches this assumption is likely violated and they face more challenging lighting conditions. Basri and Jacobs (2003) verified that for a Lambertian surface its reflectance can be modeled as a low-dimensional linear combination of spherical harmonics. Photometric stereo under natural illumination has been shown to be feasible after this theoretical verification (Basri et al. 2007; Johnson and Adelson 2011). Such approaches have been incorporated into geometric approaches. An algorithmic structure of such combinations is to estimate a coarse depth map, then estimating illumination and albedo from the coarse depth map, followed by an optimization including but not limited to depth, shading, and smoothness constraints (Quéau et al. 2017; Wu et al. 2011b; Yan et al. 2018; Yu et al. 2013). Estimating global spherical harmonics coefficients usually fails in local areas where cast shadows or specularities dominate the intensity. To alleviate this problem, (Han et al. 2013) split illumination into a global and a local part,  (Or-El et al. 2015) handled local illumination based on first-order spherical harmonics, and  (Maier et al. 2017) proposed spatially-varying spherical harmonics. Besides a single color image, photometric cues from different types of input have been used to improve the reconstruction quality, for example, from infrared images (Choe et al. 2014; Haque et al. 2014), from RGB-D streams (Wu et al. 2011a, 2014), or from multiple view images (Gallardo et al. 2017; Maurer et al. 2018).

Our work uses a simpler setup consisting of a stereo camera and a flashlight. With two shots, our method recovers fine geometry for Lambertian objects under natural lighting.

Fig. 2
figure 2

Pipeline of our method. Given a an initial rough shape from a stereo camera, we first estimate b a coarse normal map. With c the flash and d the no-flash image, we optimize for e a fine normal map. Finally, we compute f the albedo map and perform depth normal fusion to obtain g the fine shape. Section 3.2 details each step

Flash photography: Images taken with a flashlight have been used for various computer vision tasks. Using the light falloff property, a flash and no-flash image pair has been used for image matting (Sun et al. 2006), foreground extraction (Sun et al. 2007), and saliency detection (He and Lau 2014). Under low-light conditions, a flash image captures high-frequency details but changes the overall appearance of the scene, while the no-flash image captures the overall environmental ambiance but is noisy. This complementary property has been used in photography enhancement under dark illumination (Eisemann and Durand 2004), denoising, detail transfer, or white balancing (Petschnigg et al. 2004).

Further, photometric cues introduced by a flashlight are useful in stereo matching. Feris et al. (2005) demonstrated that the shadows cast by a flashlight along depth discontinuities help to detect half-occlusion points in stereo matching. Zhou et al. (2012) showed the ratio of a flash/no-flash pair can make stereo matching robust against depth discontinuities. In addition, flash images are used for recovering spatially varying BRDFs (SVBRDFs). A single image captured from a flash-enabled camera, or a flash/no-flash pair (Aittala et al. 2015) is used for SVBRDF and shape recovery of near-planar objects (Aittala et al. 2016; Deschaintre et al. 2018; Li et al. 2018a) or those with complex geometry (Li et al. 2018b).

Our work differs from the previous works in that we explicitly parameterize the image observation lit by a flashlight, and use the flash/no-flash image pair to derive an albedo-free image formation model for geometry refinement.

3 Proposed Method

Figure 2 illustrates our method for shape and albedo recovery. The input to our method are (a) a rough depth map inferred from a stereo image pair taken by a stereo camera and (c)+(d) a flash/no-flash image pair taken by the stereo camera’s reference camera. First, we compute a coarse surface normal map from the depth map as shown in (b). We then estimate the environmental lighting and refine the normal map based on our flash/no-flash image formation model as in (e). Finally, we fuse the fine normal map (e) and the coarse depth map (a) to obtain the fine shape (f) and compute the albedo map (g).

In the following, Sect. 3.1 describes our image formation model for the flash/no-flash image pair and Sect. 3.2 details the design choices of each step in our method.

3.1 Image Formation Model

Assuming Lambertian reflectance, the radiance \(r \in \mathbb {R}_+\) emitted from a tiny surface patch can be modeled as

$$\begin{aligned} r=\rho \, s(\mathbf {n}), \end{aligned}$$
(1)

where a shading function \(s: \mathcal {S}^2\rightarrow \mathbb {R}\) depends on the environmental lighting and is scaled by the surface albedo \(\rho \in \mathbb {R}_+\). The shading function s is applied to the unit surface normal \(\mathbf {n}\in \mathcal {S}^2\subset \mathbb {R}^3\).

Let \(m \in \mathbb {R}_+\) be the recorded brightness of the radiance by a digital camera. Assume the camera has a linear radiometric response, say 1 for simplicity, to the radiance. The intensity m is then the scene radiance r scaled by the camera exposure \(c \in \mathbb {R}_+\) as

$$\begin{aligned} m = cr. \end{aligned}$$
(2)

The camera exposure c accounts lens-aperture, ISO, and exposure time.

Now consider that a flash/no-flash image pair is taken for an object by the same camera. Assume that the viewpoint is fixed, the object is static, and the environmental lighting stays the same during the capture. A pixel at a fixed location in the flash/no-flash image pair then records the radiance from the same surface patch, scaled by possibly different camera exposures. We use the subscript “nf” and “f” to indicate the no-flash and flash images, respectively. Using Eqs. (1) and (2), we can model the intensity recorded at the same pixel location in the flash/no-flash pair as

$$\begin{aligned} {\left\{ \begin{array}{ll} m_\mathrm {nf}= c_\mathrm {nf}\rho s_\mathrm {nf},\\ m_\mathrm {f}= c_\mathrm {f}\rho (s_\mathrm {nf}+ s_\mathrm {fo}). \end{array}\right. } \end{aligned}$$
(3)

The additional shading \(s_\mathrm {fo}\) is introduced by the flashlight (the subscript “fo” represents flash-only), which is identical to the shading if the flashlight were the only light source in the scene. Let \(\gamma = \frac{c_\mathrm {f}}{c_\mathrm {nf}}\) be the ratio of the flash image’s exposure to the no-flash image’s exposure. By modifying Eq. (3), we obtain

$$\begin{aligned} {\left\{ \begin{array}{ll} m_\mathrm {nf}= c_\mathrm {nf}\rho s_\mathrm {nf},\\ m_\mathrm {f}- \gamma m_\mathrm {nf}= c_\mathrm {f}\rho s_\mathrm {fo}. \end{array}\right. } \end{aligned}$$
(4)

The second equation implies a virtual flash-only image: The computed intensity \(m_\mathrm {f}- \gamma m_\mathrm {nf}\) is the flash-only shading scaled by the albedo and the flash image’s exposure. Figure 3 exemplifies a virtual flash-only image. Notice that the shadows caused by natural lighting disappear in the flash-only image, verifying the correctness of the subtraction.

Fig. 3
figure 3

Subtracting the \(\gamma \)-scaled no-flash image from the flash image yields a virtual flash-only image. The ratio image is obtained by dividing the gray-scale no-flash image by the gray-scale flash-only image (Color figure online)

Further taking the ratio of the two equations in Eq. (4) yields

$$\begin{aligned} \frac{\gamma m_\mathrm {nf}}{m_\mathrm {f}- \gamma m_\mathrm {nf}} = \frac{s_\mathrm {nf}}{s_\mathrm {fo}}. \end{aligned}$$
(5)

The division cancels out the unknown albedo \(\rho \); therefore, our method can naturally handle spatially-varying albedos unlike previous methods that assume piece-wise uniform albedos (Häfner et al. 2018, 2019). This albedo-free image formation model directly relates the shading to the measured intensity. The effect of this albedo canceling is illustrated in Fig. 3. While surface albedo of the mat has a complex spatial variation, only the shading information remains in the ratio image.

Explicitly modeling the camera exposure in the image formation model of Eq. (5) has practical merit. Using the identical exposure (\(\gamma =1\)) in (Cao et al. 2020) is a special case of the image formation model of Eq. (5); however, in practice it causes overexposure in the flash image or underexposure in the no-flash image. Equation (5) allows us to properly expose each image in the flash/no-flash pair.

Shading model: We now discuss how we model the no-flash shading \(s_\mathrm {nf}\) and the flash-only shading \(s_\mathrm {fo}\). Suppose a light ray in direction \({\mathbf {l}} \in \mathcal {S}^2\subset \mathbb {R}^3\) with intensity \(e: \mathcal {S}^2\rightarrow \mathbb {R}\) hits a surface patch. According to the Lambert’s law, the reflected light or shading, is given by

$$\begin{aligned} s(\mathbf {n})= e(\mathbf {l})\max (\mathbf {n}^\top \mathbf {l}, 0). \end{aligned}$$
(6)

Under natural lighting, light rays reach the surface patch from infinitely many directions. The shading then becomes the integral over all possible incident directions

$$\begin{aligned} s(\mathbf {n})=\int _{\mathcal {S}^2} e(\mathbf {l})\max (\mathbf {n}^\top \mathbf {l}, 0)\mathrm{d}\mathbf {l}. \end{aligned}$$
(7)

As studied in Ramamoorthi and Hanrahan (2001); Basri and Jacobs (2003), a Lambertian surface acts as a low-pass filter, and its shading under natural lighting is well characterized by the second-order spherical harmonics, i.e., the integral in Eq. (7) can be approximated by a linear combination of the second-order spherical harmonics. Denoting the unit surface normal \(\mathbf {n}=[n_1,n_2,n_3]^\top \), the spherical harmonics up to the second order can be stacked into a vector \(\mathbf {h}(\mathbf {n})\) as

$$\begin{aligned} \mathbf {h}(\mathbf {n})=[1,n_1,n_2,n_3,n_1n_2,n_2n_3,n_3n_1,n_1^2-n_2^2,3n_3^2-1]^\top . \end{aligned}$$

The shading under no-flash illumination \(s_\mathrm {nf}\) is then a linear combination of these spherical harmonics. Stacking the 9 coefficients into a vector \(\mathbf {l}_\mathrm {nf}\in \mathbb {R}^9\) yields

$$\begin{aligned} s_\mathrm {nf}= \mathbf {h}(\mathbf {n})^\top \mathbf {l}_\mathrm {nf}. \end{aligned}$$
(8)

Note that \({\mathbf {l}} \) and \(\mathbf {l}_\mathrm {nf}\) are different; \({\mathbf {l}} \) is a light ray direction, and \(\mathbf {l}_\mathrm {nf}\) is a stack of spherical harmonic coefficients.

For the flashlight, we assume it is a point light located at the optical center of the camera. The incident light direction \({\mathbf {l}}\) is thus the same as the camera’s viewing direction \({\mathbf {v}}\) for each surface patch. We further assume that the flashlight emits light uniformly in all directions and the light fall-off effect is negligible. As the flashlight is the only light source contributing to the shading \(s_\mathrm {fo}\), Eq. (6) can be applied. Let \(e_\mathrm {f}\) be the flashlight intensity. Equation (6) then reads

$$\begin{aligned} s_\mathrm {fo}\!= e_\mathrm {f} \max (\mathbf {n}^\top {\mathbf {l}}, 0) \!= e_\mathrm {f} \max (\mathbf {n}^\top {\mathbf {v}}, 0) \!= e_\mathrm {f} \mathbf {n}^\top {\mathbf {v}}. \end{aligned}$$
(9)

We can drop the \(\max (\cdot , 0)\) term because \(\mathbf {n}^\top {\mathbf {v}} \) is always greater than 0 if the surface patch is visible to the camera. Inserting Eqs. (8) and (9) into Eq. (5) yields

$$\begin{aligned} \frac{\gamma m_\mathrm {nf}}{m_\mathrm {f}- \gamma m_\mathrm {nf}} = \frac{\mathbf {h}(\mathbf {n})^\top \mathbf {l}'}{\mathbf {n}^\top {\mathbf {v}}}, \end{aligned}$$
(10)

where \(\mathbf {l}'= \mathbf {l}_\mathrm {nf}/ e_\mathrm {f} \) is the spherical harmonics coefficient vector scaled by the flashlight intensity, and we will call \(\mathbf {l}'\) global lighting vector. This final image formation model now explicitly relates surface normal and environmental lighting to the measured intensity.

3.2 Shape and Albedo Recovery

This section details the design choice for each step in our shape and albedo recovery method shown in Fig. 2.

Obtaining coarse surface normal s We compute the initial normal map from the depth map using PlanePCA (Hoppe et al. 1992). Given the camera intrinsics, we convert the depth map into a point cloud in camera coordinates and then find each point’s surface normal by fitting a plane to its nearest neighbors. Formally, given a set of points \(\mathbf {P}=\{\mathbf {p}_1, \mathbf {p}_2, ..., \mathbf {p}_n \mid \mathbf {p}_i\in \mathbb {R}^3\}\), we find the coarse surface normal vector \(\hat{\mathbf {n}}_i\) at \(\mathbf {p}_i\) by minimizing

$$\begin{aligned} \hat{\mathbf {n}}_i= \mathop {\mathrm {argmin}}\limits _{\hat{\mathbf {n}}_i} \sum _{\mathbf {p}_j\in \mathcal {N}(\mathbf {p}_i)} (\mathbf {p}_j-\bar{\mathbf {p}}_i)^\top \hat{\mathbf {n}}_i, \end{aligned}$$
(11)

where \(\mathcal {N}(\mathbf {p}_i)\) is the set of \(\mathbf {p}_i\)’s neighbors, and \(\bar{\mathbf {p}}_i\) is the mean of all \(\mathbf {p}_j\in \mathcal {N}(\mathbf {p}_i)\). We search for \(\mathbf {p}_i\)’s neighbors by performing a ball query as

$$\begin{aligned} \mathcal {N}(\mathbf {p}_i)=\{\mathbf {p}_j\mid \Vert \mathbf {p}_j-\mathbf {p}_i\Vert _2<r,~\forall \mathbf {p}_j\in \mathbf {P}\}, \end{aligned}$$
(12)

where r is an empirically chosen ball search radius. PlanePCA robustly estimates a coarse, smooth normal map that expresses low-frequency shape which we use in the following lighting estimation step.

Computing the global lighting vector Our goal now is, given the flash/no-flash image pair and a coarse normal map, to estimate the low-dimensional global lighting vector \(\mathbf {l}'\) in Eq. (10). Note that solving \(\mathbf {l}_\mathrm {nf}\) and \(e_\mathrm {f} \) separately is unnecessary for shape recovery; unknown \(e_\mathrm {f} \) barely scales the recovered albedo map.

Suppose there are p pixels in the region of interest, i.e., the region of the foreground object. We stack the row vectors \(\mathbf {h}(\hat{\mathbf {n}})^\top / \hat{\mathbf {n}}^\top {\mathbf {v}} \) for each pixel vertically into a matrix \(\mathbf {N}\in \mathbb {R}^{p \times 9}\) and stack the measured \(\gamma m_\mathrm {nf}/ (m_\mathrm {f}- \gamma m_\mathrm {nf})\) into a vector \(\mathbf {m} \in \mathbb {R}^{p}\). \(\mathbf {l}'\) can be obtained by solving the following over-determined system

$$\begin{aligned} \mathbf {N}\mathbf {l}'= \mathbf {m}. \end{aligned}$$
(13)

Although the coarse normal map only expresses a low-frequency shape, we will show in the experiment that the estimated lighting is still as accurate as if it is estimated from a ground truth normal map.

Refining the normal map We formulate the surface normal refinement as per-pixel optimization. The energy function consists of a shading constraint, a surface normal constraint, and a unit-length constraint as

$$\begin{aligned} \min _\mathbf {n}E_s(\mathbf {n}) + \lambda _1 E_n(\mathbf {n}) + \lambda _2 E_u(\mathbf {n}), \end{aligned}$$
(14)

where \(\lambda _1\) and \(\lambda _2\) are weighting factors. The shading constraint \(E_s\) measures the squared difference between the ratio image and the estimated ratio image in Eq. (10)

$$\begin{aligned} E_s(\mathbf {n}) = \left( \mathbf {h}(\mathbf {n})^\top \mathbf {l}'- \mathbf {n}^\top {\mathbf {v}} \frac{\gamma m_\mathrm {nf}}{m_\mathrm {f}- \gamma m_\mathrm {nf}}\right) ^{\!2}. \end{aligned}$$
(15)

We multiply both sides of Eq. (10) with \(\mathbf {n}^\top {\mathbf {v}} \) to avoid possible numerical issues.

The surface normal constraint \(E_n\) forces the refined surface normal to be close to the coarse surface normal \(\hat{\mathbf {n}}\), i.e., their dot-product should be close to 1

$$\begin{aligned} E_n(\mathbf {n}) = (1-\mathbf {n}^\top \hat{\mathbf {n}})^2. \end{aligned}$$
(16)

Finally, \(E_u\) enforces unit length of the surface normal

$$\begin{aligned} E_u(\mathbf {n}) = (1-\mathbf {n}^\top \mathbf {n})^2. \end{aligned}$$
(17)

The energy function Eq. (14) is non-convex due to the non-convex domain \(\mathcal {S}^2\). We solve it with BFGS (Liu and Nocedal 1989).

After optimizing the normal map we can compute the albedo map: According to Eqs. (3) and (8),

$$\begin{aligned} \rho = \frac{m_\mathrm {nf}}{c_\mathrm {nf}\mathbf {h}(\mathbf {n})^\top \mathbf {l}_\mathrm {nf}} = \frac{e_\mathrm {f} m_\mathrm {nf}}{c_\mathrm {nf}\mathbf {h}(\mathbf {n})^\top \mathbf {l}'}. \end{aligned}$$
(18)

A global scalar ambiguity remains in the albedo due to the camera exposure \(c_\mathrm {nf}\) and flashlight intensity \(e_\mathrm {f} \).

Handling cast shadows (optional) The spherical harmonics-based image formation model of Eq. (6) can handle attached shadows but not cast shadows (Basri and Jacobs 2003). Our method is thus likely to break down and produce artifacts in regions dominated by cast shadows. In such regions, instead of refining normals using our shading constraints, the initial normal vector estimated from the depth map is more reliable. To this end, we heuristically introduce a confidence term \(\omega \) into the energy function’s shading constraint as

$$\begin{aligned} \min _\mathbf {n}\omega E_s(\mathbf {n}) + \lambda _1 E_n(\mathbf {n}) + \lambda _2 E_u(\mathbf {n}), \end{aligned}$$
(19)

where \(\omega \) is defined as

$$\begin{aligned} \omega = \exp \Big (-\frac{(r-\mu )^2}{2\sigma ^2}\Big ). \end{aligned}$$
(20)

r is the ratio of the flash to no-flash intensities, and \(\mu \) and \(\sigma \) are the mean and the standard deviation of the ratio in the object region. This definition is based on the observation that cast shadows strongly deviates the ratio r from the mean ratio. From Eq. (5), once the pixel intensity is distorted by cast shadow under environmental light or flashlight, the numerator or denominator becomes close to zero, yielding too small or too large ratio values. This phenomenon is shown in Fig. 4. When environmental light causes shadows, the ratio of flash to no-flash becomes high (bright pixels in Fig. 4a); when flashlight causes shadows, the ratio becomes low (dark pixels in Fig. 4a).Footnote 1

Fig. 4
figure 4

The relation between the ratio of flash to no-flash images and cast shadows. Large ratios (bright pixels in a) are likely caused by cast shadows under environmental lighting b; tiny ratios (dark pixels in a) are caused by cast shadows under flashlight c

The above observation leads to the choice of \(\omega \) in Eq. (20). For pixels where the ratio deviates too much from the mean ratio, the shading constraint in Eq. (19) is unlikely reliable. The weight \(\omega \) should be small according to Eq. (20) so that the shading constraint contributes less to the normal refinement. As a result, the normal vector stays close to the initial one.

Fusing the normal and the depth map Finally, we fuse the fine normal map with the coarse shape to obtain the fine shape. To this end, we minimize the weighted sum of normal integration and depth terms.

For the normal integration term, we follow the inverse plane fitting method (Cao et al. 2021) to minimize the sum of plane fitting residuals as

$$\begin{aligned} E_{n}(\mathbf {z}, \mathbf {d})= \sum _{i}\sum _{j\in \mathcal {N}(i)}(z_j\mathbf {n}_i^\top \mathbf {K}^{-1}\tilde{\mathbf {u}}_j + d_i)^2, \end{aligned}$$
(21)

where \(\mathbf {z}\) and \(\mathbf {d}\) are the vectorized depth map and plane distances to the coordinate origin, respectively. \(\mathcal {N}(i)\) is the pixel i and its four neighborhoods; \(z_j\), \(\mathbf {n}_i\), \(\mathbf {u}_j\), and \(d_i\) are the j-th entry in \(\mathbf {z}\), the normal vector at pixel i, the homogeneous coordinates of pixel j, and the i-th entry in \(\mathbf {d}\), respectively. \(\mathbf {K}~\in ~\mathbb {R}^{3 \times 3}\) is the perspective camera intrinsic matrix. The inner term of Eq. (21) measures the distance of the 3D point \(z_j\mathbf {K}^{-1}\tilde{\mathbf {u}}_j\) to the plane, which is parameterized by its normal direction \(\mathbf {n}_i\) and its distance \(d_i\) to the coordinate origin. For the depth term, we force the estimated depth \(\mathbf {z}\) to be close to the initial depth \(\hat{\mathbf {z}}\)

$$\begin{aligned} E_d(\mathbf {z})=\Vert \mathbf {z}-\hat{\mathbf {z}}\Vert _2^2. \end{aligned}$$
(22)

The whole objective now reads

$$\begin{aligned} \min _{\mathbf {z}, \mathbf {d}} E_{n}(\mathbf {z}, \mathbf {d}) + \lambda _{d} E_d(\mathbf {z}), \end{aligned}$$
(23)

where \(\lambda _{d}\) is a weighting factor to be tuned. Equation (23) can be formed as a sparse linear system, and we use a multigrid method (Brandt 1977) to find its solution.

4 Experiments

This section evaluates our shape and albedo recovery results quantitatively on synthetic images and qualitatively using real-world images captured with iPhones.

4.1 Experiments Using Synthetic Images

Data generation We rendered two publicly available 3D mesh models, the Stanford Bunny and a StatueFootnote 2 with the physically-based renderer Mitsuba.Footnote 3 For the no-flash image, we put each object under an environment map lighting.Footnote 4 We then simulate the flashlight by placing an additional directional light source in the same scene. We obtain the objects’ ground truth shape, depth maps, and normal maps from the 3D models. To simulate the coarse shape from a stereo camera, we apply the quantization on the ground truth depth map. For the ground truth albedo, we use a texture image. To visualize the refinement of the estimated albedo map, we also compute the initial albedo according to Eq. (18) using the coarse normal map.

Baselines Although our setup combining a depth measurement with flash/no-flash image pairs is new and has no direct comparison methods, we assess our shape reconstruction results with the recent depth refinement methods by Han et al. (2013) and Yan et al. (2018). Unlike ours, both baseline methods refine the initial shape using a single color image (i.e., without flash/no-flash image pairs). We therefore aim to verify the effectiveness of our use of flash/no-flash pairs via this comparison.

We implemented (Han et al. 2013) as their source code is not publicly available. For Yan et al. ’s method (Yan et al. 2018), we used a trained convolutional neural network provided by the authors.Footnote 5 For a fair comparison, we use the uniform albedo maps for all objects, since the baseline methods assume the uniformness while our method is capable of spatially-varying albedos. We also use the same initial normal map and shape for all three methods. We measured the mean absolute error (MAbsE) between the estimated and the ground truth shape.

Results Table 1 summarizes the results of the quantitative comparison with the two baseline methods (Han et al. 2013 and Yan et al. (2018)). Our method using flash/no-flash image pairs achieves the lowest MAbsE among all methods. Further, the confidence term \(\omega \) in the energy function improves the results by our method in most cases, which verifies the effectiveness of our strategy for handling cast shadows.

Table 1 MAbsE of the depth maps recovered by different methods

Figure 5 shows shape and albedo recovery results by our method along with their coarse initializations and the ground truth. We also show the mean angular error (MAngE) of normal maps and MAbsE of shape and albedo maps. While the coarse normal maps contain only low-frequency content, our method recovers high-frequency details and yields lower errors than the initializations. This verifies the effectiveness of the optimization Eq. (14) based on our flash/no-flash image formation model Eq. (10). After the depth normal fusion, the shape also reflects the recovered details. The albedo map still appears to have shading components left due to the approximation error of the second-order spherical harmonics and the estimation error introduced by cast shadow in practice. But the error of the estimated albedo is smaller than that of the initial albedo. This quantitative evaluation justifies our shape and albedo recovery pipeline.

Fig. 5
figure 5

Shape and albedo recovery results on the synthetic Bunny and Statue datasets. The first column shows the rendered flash/no-flash pair. The even rows display the error map. The numbers above the error maps are the mean angular error (MAngE) of normal maps and the mean absolute error (MAbsE) of shape and albedo maps. Our method recovers high-frequency shape details

Figure 6 shows lighting estimation results on synthetic data. We render the flash/no-flash images of the Stanford Bunny with uniform albedo. To verify that estimating spherical harmonic coefficients \(\mathbf {l}_\mathrm {nf}\) from coarse normals is reliable, we compare the relighting images using coefficients estimated from ground truth and coarse normals. We estimate the flashlight intensity scaled coefficients \(\mathbf {l}'\) by Eq. (13), use the coefficients to compute the relighting images by Eq. (8), and compute the absolute error maps between the relighting and no-flash images. For both relighting images, we compute the spherical harmonic bases \(\mathbf {h}(\mathbf {n})\) from ground truth normals. We cancel the scale ambiguity between \(\mathbf {l}'\) and \(\mathbf {l}_\mathrm {nf}\) using the rendered no-flash image when visualizing the relighting images and computing the absolute error maps. As the spherical harmonics approximates the shading and assumes no cast shadow, the absolute error map shows that the approximation error is inevitable and mainly exists in cast shadow regions. The comparable relighting results verify that using initial coarse geometry for spherical harmonics estimation is reliable.

4.2 Experiments Using Smartphones

The camera system we require has become standard in modern smartphones. For example, iPhone models support stereo-based depth capture since the iPhone X released in 2017. This section describes shape and albedo recovery results from images captured by iPhones. To verify our method in practical scenarios, we captured small statues indoors as well as outdoor stone statues in an old shrine. Figure 7 shows the scenes of our image capture in indoor and outdoor environments using an iPhone X. Our method is handy to use as the recording only requires mounting a smartphone on a tripod.

Image capturing and preprocessing: We implemented a custom iOS application to control the image capture pipeline. Instead of capturing a stereo image pair and performing stereo matching by ourselves, we directly acquire the depth map via Apple’s API.Footnote 6 Due to API limitations, when the stereo camera is used for depth map capture, raw image delivery is unsupported. We instead take a no-flash image one more time to acquire a raw image. In summary, one scene capture using an iPhone required three shots

  • A depth map associated with the intrinsic parameters from the stereo camera,

  • A raw flash image from the reference camera, and

  • A raw no-flash image from the reference camera.

The flash/no-flash images are taken in auto-exposure mode, and the exposure ratio \(\gamma \) is computed from the EXIF tags.

The dimensions of acquired depth maps and flash/no-flash images are \(768 \times 576\) and \(4032 \times 3024\), respectively. To close the resolution gap, we unify their dimensions to \(1008 \times 756\) by rescaling. Specifically, we upsample the depth map with bi-cubic interpolation and downsampled the flash/no-flash images with inter-area interpolation. The intrinsic camera parameters (focal length and principal point coordinates) are scaled accordingly. As an implementation detail, we found that the depth map from the stereo camera and the color images from the reference camera are misaligned. Fortunately, we empirically found the misalignment was a simple fixed offset, therefore shifted the pixels in the flash/no-flash image pairs to align with the depth map.

Baselines: In addition to the quantitative comparisons by the synthetic dataset, we also compare our results visually with two shape and reflectance estimation methods by Häfner et al. (2018) and Boss et al. (2020). Our method takes as input flash/no-flash images and a depth map, while the baseline methods do not use all the cues. We simulate Haefner et al. ’s setup (Häfner et al. 2018), which uses a color image and a depth map, by removing the no-flash image from our input. Boss et al. (2020) setup, which uses a flash/no-flash image pair, was simulated by removing the depth map from our input.

Fig. 6
figure 6

Lighting estimation from synthetic flash/no-flash images. Both relighting images are computed using GT normals and spherical harmonic coefficients, estimated from (the first row) GT normals or (the second row) coarse normals. The major approximation error exists in the cast shadow. Estimating spherical harmonic coefficients from coarse normals achieves a comparable relighting result, verifying the correctness of our lighting estimation using coarse normals

Fig. 7
figure 7

Indoor and outdoor image capturing with phones

We used the implementations released by the authors.Footnote 7 For Haefner et al. ’s method (Häfner et al. 2018), we followed their default parameter settings and used a \(1008 \times 756\) flash image and a \(768 \times 576\) depth map as input. Since their method does not directly output a normal map, we computed normal maps (Quéau et al. 2018) from the estimated depth maps. For Boss et al. ’s method (Boss et al. 2020), we used their trained neural network. To fit the \(256 \times 256\) input image dimension, we cropped and downsampled our flash/no-flash images. As Boss et al. (2020) estimates Cook-Torrance model parameters (Cook and Torrance 1982) as diffuse, roughness, and specular, we show the estimated diffuse maps and treat them as albedo maps for notational simplicity.

Results Figure 8 shows a visual comparison using the input from an iPhone X. Overall, our setup combining flash/no-flash imaging and a rough depth map yields the high-fidelity shape and albedo recovery. Haefner et al. ’s method (Häfner et al. 2018) assumes the piece-wise constant albedo. We thus observe noises on the estimated shape when the surface albedo has a complex spatial variation (see the stone cow in Fig. 8). Boss et al. ’s method (Boss et al. 2020) explores shading information from only two images, which is inherently ill-posed. As a consequence, the estimated shapes are distorted; for example, concave surfaces can be wrongly estimated as convex, which can be seen in the stone cow’s ear.

Fig. 8
figure 8

Visual comparison on an iPhone’s input. We use all three input images: Flash/no-flash images and a depth map. Removing the no-flash image leads to Haefner et al. ’s setup (Häfner et al. 2018), which assumes piece-wise constant albedo and is not suitable for surfaces with complex albedo variation. Removing the depth map leads to Boss et al. ’s setup (Boss et al. 2020), which is ill-posed and results in distorted shape estimation. Stereoscopic flash and no-flash photography is key for high-fidelity shape and albedo recovery via a smartphone

Fig. 9
figure 9

Shape and albedo recovery results from an iPhone X; see Fig. 10 for its camera system. The objects in the first three rows are about 10 cm in height and placed in an office room. The last three rows display outdoor stone statues in an old shrine. Our method is able to recover shape details and surface albedo for both indoor and outdoor objects

Fig. 10
figure 10

Reconstruction of the same object using smartphone models with different camera/flashlight configurations. The first column depicts the camera systems of the iPhone X, 11, and 12 Pro. “UW”,“W”,“T”, and “F” are short for ultra-wide, wide angle, telephoto camera, and flashlight, respectively. The reference camera in the stereo camera is colored red. Our method generates stable results across different camera/flashlight configurations

Figure 9 displays visual results by our method for cultural heritage artifacts. The first three objects are about 10 cm high and were captured in an office room (Fig. 7, left). Although there is no access to the ground truth, our method qualitatively recovers the fine details that are absent in the initial shape derived from the stereo camera despite of the complex albedo. The last three rows of Fig. 9 show stone statues in an old shrine, which are fixed in place outdoors and impossible to move. With our stereoscopic flash/no-flash photography, we can recover fine shapes of such outdoor objects with a commodity smartphone without requiring special lighting equipment or a darkroom.

To verify that our method is suitable for different camera and flashlight configurations, we captured images of the same object using an iPhone X, 11, and 12 Pro. From the results in Fig. 10, we can see that our method produces stable results on devices with different camera systems, implying that our method is applicable on a fairly large number of smartphones.

Fig. 11
figure 11

Our method breaks down under direct sunlight due to the relatively weak flashlight. The virtual flash-only image (enhanced for visibility) obtained via Eq. (4) hardly provides additional photometric cues, leading to unsatisfactory recovery

Regarding runtime, each object took about 30s on a 2.3 GHz Intel i9 CPU. The computational bottlenecks are the fine normal optimization of Eq. (14) and the depth normal fusion of Eq. (23).

5 Conclusions

We presented a simple imaging setup for high-fidelity shape and albedo recovery using a stereo camera and flashlight. This setup can be naturally applied to two-shot images from smartphones with a stereo camera, which has become common today. Quantitative evaluation using synthetic images justifies our high-fidelity shape and albedo recovery pipeline. Qualitative results using images captured by a smartphone demonstrate our method’s effectiveness in real scenarios. The comparison with related methods shows that our setup is the minimal setup to recover high-fidelity shape and surface albedo via a smartphone.

Practical implications We have verified our method for digitizing cultural heritage artifacts using images captured by off-the-shelf smartphones. This implies that people can immediately turn their smartphones into high-fidelity 3D scanners using our setup and method. We believe that our method is useful in a scenario of crowd-sourced digital archiving, which accelerates the digitization of the world’s cultural heritages.

Limitation Our method breaks down if the object is directly lit by strong environmental lighting, such as sunlight; see Fig. 11 for an example. In this scenario, compared with the sunlight the flashlight is too weak to provide additional photometric cues. This problem might be alleviated if smartphones adopt flashlights of stronger intensity in the future. For now, we recommend capturing outdoor objects on cloudy days or around sunrise or sunset. Further, we require the object to be close to the camera due to flashlight falloff in practice.

Future work Our shape and albedo recovery method is based on images shot from a single viewpoint. A practical extension would be to use multi-view images for recovering complete objects.