1 Introduction

3D surface imaging technique has widely applications such as 3D modeling, reverse engineering, 3D printing, human body measurement, movies and animation, and humanCmachine interaction [1]. Phase-shifting-based systems with the digital light processing (DLP) projector have high measurement accuracy, but the cost is relatively high [2, 3]. In recent years, the consumer depth sensors (e.g., Kinect [4]) greatly reduce the cost of 3D data acquisition. However, these low-cost sensors only have limited accuracy. In order to improve reconstruction quality, one kind of method is fusing the aligned depth maps with a volumetric, truncated signed distance function (TSDF) representation, such as KinectFusion [5]. Another popular approach is to combine the normal information obtained by the photometric stereo (PS) method with the depth map obtained by these depth sensors [6,7,8].

Conventional photometric stereo (PS) estimates surface normal from an image sequence taken from a fixed viewpoint under varying directional (distant) lightings. Instead of directly measuring position/depth, PS estimates surface orientations by measuring the shading variations of surface under different illuminations. A 3D shape up to a scale can be obtained via integration over the obtained surface normal field. PS is an excellent solution for obtaining surface normal but not directly for depth, largely because integrating normals is prone to introducing low-frequency biases [9]. Another reason that PS is not as popular as other 3D modeling approach is because it works best for laboratory setup which needs controlled environment [9].

Combining advantages from both depth and normal sensors has been studied in recent years [6,7,8, 10,11,12]. Such fusion approach achieves high-quality 3D reconstruction by integrating the coarse base geometry estimated from the depth sensor with high-resolution surface normal details from PS. However, in these methods, the point light source-based systems have compact structures, but usually require specific calibration devices (e.g., mirror spheres) to precisely calibrate the parameters of the light sources [13,14,15]. With the directional lighting assumption, self-calibration of the light sources can be realized [7]. To obtain the directional lighting, the light sources have to be placed far away from the targets to be measured, which makes the systems uncompact. Besides, some methods assume that the target surface has a uniform albedo [6], so they cannot be applied to multi-albedo targets which are very common in practical applications.

Motivated by the shortcomings of existing systems, our goal is to design a practical PS system that meets the following conditions:

(1) The near point light source model is adopted for making a compact system.

(2) The parameters of the point light sources can be calibrated automatically.

(3) The system can be adapted to the multi-albedo objects.

(4) The PS method and the stereo matching method are combined to reconstruct surfaces with rich details and high accuracy.

According to the guideline above, we design a 3D imaging system combining the PS method and the traditional stereo vision, called the photo-geometric depth camera. The system consists of two CMOS cameras, a near-infrared speckle projector, four near-infrared LED, and a synchronous circuit, as shown in Figs. 1 and 5. The experimental hardware cost is about $400.

The main contributions of this paper are:

(1) A high-precision 3D imaging system is introduced, which combines the photometric stereo and the binocular stereo. The system is low cost and compact and can be adapted to multi-albedo targets in the general indoor environment.

(2) A point light source auto-calibration algorithm is proposed. The traditional point light source calibration method usually requires specific calibration objects (such as the mirror spheres). In our method, we firstly use the distant lighting model [7] to estimate the initial surface albedo map. Then, with the estimated albedo map and the normal vector field fixed, the parameters of the near lighting model are optimized. Next, with the optimized lighting model, we use the near-light photometric stereo (NLPS) method to re-compute the albedo map and use the method [9] to compute higher-quality depth map. We repeat the above two steps iteratively until convergence or the iteration times reaching a predefined maximum number.

The remainder of this paper is organized as follows: The details of the proposed system are presented in Sect. 2, and experimental results are provided in Sect. 3. Finally, Sect. 4 concludes the paper.

2 Photo-geometric depth camera

2.1 Hardware

Our photo-geometric depth camera consists of two CMOS cameras, a Kinect-type near-infrared (NIR) speckle projector, four NIR LEDs with a wavelength of 830 nm, and a microcontroller-based circuit, as shown in Fig. 1. The cameras have a frame rate of 60 fps and a resolution of \(1280\times 960\). The cameras are connected to the PC via Gigabit Ethernet interfaces. The narrowband filters are mounted on the 4-mm-focal-length lenses to filter out ambient light. The two cameras and the four LEDs are mounted on a rigid structure to keep the stable relative positional relationship. The cameras, the LEDs, and the projector are synchronized by the trigger signal of the microcontroller-based circuit, which is controlled by the PC via USB 2.0 interface. The NIR projector can emit a large number of random speckles to enhance the surface texture.

Fig. 1
figure 1

The proposed photo-geometric depth camera

As illustrated in Fig. 2, in a reconstruction cycle, the projector is lit firstly and the cameras are triggered simultaneously to capture a pair of images of the speckles. The stereo image pair is used to generate the initial depth map. Then, the four LEDs are lit one by one, and the cameras are triggered to capture images under the illumination of each LED. In these images, the images of the left camera are used for photometric computation.

Fig. 2
figure 2

The timing diagram of a reconstruction cycle

2.2 Near-light photometric stereo

We use the near point light source assumption in our image formation model [11]. In the jth image, the light vector \(\mathbf l _{ij}\) from the surface point \(\mathbf x _i\) to the light source \(\mathbf s _j\) is written as

$$\begin{aligned} \mathbf l _{ij}=\mathbf s _j-\mathbf x _i \end{aligned}$$
(1)

With the near light source assumption, intensity observation \(o_i\) is computed with accounting the inverse square law as

(2)

where \(E_j\) is the light source intensity at a unit distance, \(\rho _i\) is surface albedo, and \(\mathbf n _i\) is the surface normal vector (Fig. 3).

Fig. 3
figure 3

Near-light photometric stereo model

Once we know the light source parameters, we can estimate the normal vector \(\mathbf n _i\) and the albedo \(\rho _i\) according to Eq.  (2) from at least three observations. In order to balance the efficiency and the quality, we use four point light sources in our setup.

2.3 Reconstruction pipeline

We use the stereo camera calibration method in [16] to calibrate our stereo cameras. The light source will be self-calibrated using the method to be discussed in Sect. 2.5.

Given the calibrated cameras and lightings, the 3D reconstruction pipeline is illustrated in Fig. 4. Firstly, the stereo matching method is applied to the speckle image pair to generate the initial depth map. Then, the photometric computation is applied to the four images under the illumination of the four LEDs, respectively, to generate the surface normal vector field. Finally, the obtained initial depth map and the normal vector field are integrated to generate the higher-quality depth map.

Fig. 4
figure 4

The pipeline of the reconstruction

2.4 Initial depth generation

The semi-global matching (SGM) method [17] performs an energy minimization using dynamic programming on multiple 1D paths. The energy function consists of three terms: a data term for photo-consistency, a smoothness term for slanted surfaces that change the disparity slightly (parameter \(\hbox {P}_1\)), and a smoothness term for depth discontinuities (parameter \(\hbox {P}_2\)). Due to that SGM has a good balance in efficiency and accuracy, we use this method to estimate the initial depth map.

2.5 Point light source self-calibration

This section proposed a new calibration method for point light source including the geometric parameters and the light intensity. With the self-calibration method, our system does not rely on the fixed calibration such as mirror spheres [13,14,15], which makes the system more flexible and practical.

To estimate \(E_j\) and \(\mathbf s _j\), we use the system described in Sect. 1 to capture five image pairs of the target according to the timing diagram shown in Fig. 2. Our light source calibration firstly makes a distant lighting assumption and estimate a rough albedo map. Then, an iterative manner optimization is applied to estimate the parameters of the near light sources. The calibration algorithm is summarized as follows.

Algorithm 1: Point light source calibration

(1) Initialization

Rough depth map generation Each stereo pair is rectified to obtain a row-aligned epipolar geometry. The stereo matching method described in Sect. 2.4 is applied to the speckle image pair to generate the initial depth map \(\mathbf D _0\) of the target. A bilateral filter [18] is applied to the raw depth map to obtain a discontinuity preserved depth map with reduced noise \(\mathbf D '_0\).

$$\begin{aligned} \mathbf D '_0(\mathbf u ) = \frac{1}{W_p}\sum _{q\in nb(\mathbf u )}N_{\sigma _s}(\Vert \mathbf u -\mathbf q \Vert )N_{\sigma _s}(\Vert \mathbf D _0(\mathbf u )-\mathbf D _0(\mathbf q )\Vert )\mathbf D _0(\mathbf q ) \end{aligned}$$
(3)

where \(N_\sigma = exp(-t^2\sigma ^{-2})\), \(nb(\mathbf u )\) denotes the neighborhood of the pixel \(\mathbf u \) and \(W_p\) is a normalizing constant.

Fig. 5
figure 5

Prototype of the proposed photo-geometric depth camera

Initial position estimation According to the mounting position of the LEDs relative to the reference camera (we use the left camera as the reference camera), we can estimate an initial value \(\mathbf s _{j,0}\) for \(\mathbf s _j\).

Initial albedo estimation We follow the automatic calibration method [7] with distant lighting assumption to estimate the initial albedo map \(\rho \). We first robustly estimate a rank-3 approximation of the observed brightness matrix using an iterative re-weighting method and then factorize this rank reduced brightness matrix into the corresponding lighting, albedo and surface normal components.

Initial intensity estimation With the depth map \(\mathbf D '_0\) and the camera parameters, the point cloud \(\mathbf x _i\) of the target surface can be generated. Furthermore, the surface normal vectors \(\mathbf n _i\) can be estimated with the point cloud [19]. Up to now, using Eq. (2), we can estimate the initial value of \(E_j\) with the linear least square method.

$$\begin{aligned} E_j=\frac{\sum _{i=1}^No_i\Vert \mathbf l _{ij}\Vert ^3}{\sum _{i=1}^N\rho _i\mathbf l _{ij}\cdot \mathbf n _i} \end{aligned}$$
(4)
Fig. 6
figure 6

Convergence curve of the iterative optimization of the light source parameters for the dataset shown in Fig. 4. The error is defined in Eq. (5)

where N is the number of the surface points.

(2) Position and intensity refinement

Equation (2) is a typical nonlinear least squares problem. With the estimated initial values, we use the Levenberg–Marquardt (LM) algorithm [20] to optimize \(E_j\) and \(\mathbf s _j\) with the albedo map and the normal field fixed. The cost function is defined as:

$$\begin{aligned} cost(\mathbf E , \mathbf S ) = \sqrt{\frac{1}{4N}\sum _{j=1}^4\sum _{i=1}^N\left( o_i-E_j\rho _i\frac{(\mathbf s _j-\mathbf x _i)\cdot \mathbf n _i}{\Vert \mathbf s _j-\mathbf x _i\Vert ^3}\right) ^2} \end{aligned}$$
(5)

where \(\mathbf E =\left\{ E_j\right\} \), \(\mathbf S =\left\{ \mathbf s _j\right\} \), \(j=1,2,3,4\).

Fig. 7
figure 7

Comparisons of the estimated albedo maps and the normal vector fields. a The gray images of the three targets and the results of the estimated albedo maps. b Results of the estimated normal vector fields. In a, the first column shows the gray images, the second column shows the results of [7], and the last column shows our results. In b, the left shows the results of [7], and the right shows ours

Fig. 8
figure 8

Comparisons of reconstruction results. a Results of the male. b Results of the female. c Results of the shoe. The left column is the results of the stereo matching, and the right column is the results after fusion

(3) Updating albedo map, surface points, and normals

With the optimized \(E_j\) and \(\mathbf s _j\), we use the near lighting model [(Eq. (2)] to re-compute the albedo map and the normal field. By combining the rough depth map and the normal field with the method in [9], the higher-quality depth map can be acquired. Note that the normal field used for the following optimization is obtained from the optimized depth map, rather than the PS method.

(4) Iterative optimization

Jump to Step (2) until convergence or the iteration times reaching the predefined maximum number.

2.6 Depth normal fusion

To estimate the optimal depth by combining the normal vector field by the PS method and the rough depth map by the stereo matching, we can form a linear system of equations as [9] to refine the quality of the reconstructed surface:

$$\begin{aligned} \left[ \begin{array}{cccc} \lambda \mathbf I \\ \nabla ^2 \end{array}\right] [\mathbf D ^*]= \left[ \begin{array}{cccc} \lambda \mathbf D \\ \partial \mathbf N ^* \end{array}\right] \end{aligned}$$
(6)

where \(\mathbf D ^*\) is the refined depth map, \(\nabla ^2\) is a Laplacian operator, \(\mathbf I \) is an identity matrix, and \(\lambda \) is a weighting parameter controlling the contribution of depth constraint. \(\partial \mathbf N ^*\) is the stacks of \(-\frac{\partial }{\partial x}\frac{n_x}{n_y}-\frac{\partial }{\partial y}\frac{n_y}{n_z}\) for each normal \(\mathbf n \in \mathbf N ^*\). While it forms a large linear system of equations, because the left matrix is sparse, it can be efficiently solved using existing sparse linear solvers (e.g., CHOLMOD [21]).

3 Experimental results

Figure 5 shows the prototype of the proposed photo-geometric depth camera. The baseline length of the stereo system is 176.92 mm. The dimension of the depth camera is \(260\, \hbox {mm} \times 76 \, \hbox {mm}\times 150 \, \hbox {mm}\).

3.1 Qualitative evaluation

We firstly use three targets including a male, a female, and a shoe to evaluate our depth cameras. The gray images of the three targets are shown in Fig. 7a.

Figure 6 shows the convergence curve of the iterative optimization process for the target in Fig. 4. The Y-axis is the root-mean-square error defined in Eq. (4). After 10 iterations, the error converges.

We compare our method with the distant lighting model [7]. Figure 7a shows the estimated albedo maps, and Fig. 7b shows the estimated normal vector fields. From Fig. 7a, we can know that the albedo of the eyebrows of the two persons is relatively low and the albedo of the words on clothes of the male is relatively high. Our results correctly reflect these facts. However, the method in [7] cannot show these. The albedo bias of [7] is also severe for the shoe. Furthermore, the estimated surface normal vectors in face regions of the method in [7] are severely biased. These results show that the auto-calibration method in Sect. 2.5 improves the quality of the estimated albedos and normals greatly.

Figure 8 shows the reconstruction results of three targets. Figure 8a shows the results of the male shown in Figs. 7a and 8b shows the reconstruction result of a female, and Fig. 8c shows the results of a shoe. The left is the result using the speckle images, and the right is the result after combining the initial depth map by the stereo matching and the normal vector field by the PS method. These results show that the reconstruction quality can be improved remarkably by combining the near-light PS method, in which the parameters of the point light sources are calibrated automatically using our calibration method described in Sect. 2.5.

We also compare our depth camera with Kinect, a popular depth camera. Figure 9a shows the reconstruction results of a 30-cm tall David sculpture, and Fig. 9b shows the results of our depth camera. The volume voxels resolution of KinectFusion is set to \(512^3\). We can see that the result of our depth camera has higher quality in reconstruction details.

3.2 Quantitative evaluation

Figure 10 shows the quantitative evaluation results, where the result of a commercial phase-shifting system with nominal accuracy of 0.025 mm is treated as the reference model. The David sculpture is scanned in 14 different views using the system. The obtained point clouds are stitched together, and then, the triangular mesh model is calculated utilizing Geomagic Qualify [22]. Figure 10a shows the comparison with KinectFusion in geometric accuracy. The root-mean- square error (RMSE) of Kinect is 1.58 mm, and the RMSE of our depth is 0.36 mm. Figure 10b, c shows the comparison results of the estimated normals and albedos with [7]. To evaluate the estimated normals quantitatively, the normals of the reference model are taken as the ground truth. We first align the reconstructed point cloud with the reference model and find the closest point in reference model for each point in the reconstructed point cloud as the corresponding point and then compare their normals. The mean angle error of the normals using our method is \(8.4^{\circ }\) comparing with \(16.3^{\circ }\) from [7]. For albedos, because the surface of the David sculpture is uniform, we assume the ground-truth albedos are one everywhere. For the estimated albedos of the two methods, they have no unified scale. So we first align them to the ground-truth albedos by estimating a optimal scale before error evaluation. The RMSE of our method is 0.232, and the RMSE of [7] is 0.547.

Fig. 9
figure 9

Comparison with Kinect. a Results of a David sculpture. b Result of a human face. The left shows the results of Kinect, and the right shows the results of our depth camera

Fig. 10
figure 10

Quantitative comparison. a The left is the reference model obtained by the high accuracy phase-shifting system, the middle is the error map of KinectFusion, and the right is the error map of our depth camera. The gray regions in error maps indicate the invalid points. b The first is the “ground-truth” normal map calculated from the reference model, the second and the third are the normal maps obtained by our method and [7], respectively, and the fourth and the fifth are the corresponding error maps. c The first two images show the albedo maps obtained by our method and [7], respectively, and the last two are the corresponding error maps

Fig. 11
figure 11

Comparisons between the manual method [14] and the proposed method. The left is the normal field computed from the smoothed rough depth map, where the low-frequency components are accurate but lack of high-frequency components. The middle and the right are the normal fields computed by the PS method using the lighting parameters from the manual and automatic calibration methods, respectively

Furthermore, we calibrate the 3D locations of the four LEDs, respectively, using a mirror sphere as in [14]. The calibration method requires the mirror sphere to be placed at least two different locations. To get more accurate results, we capture five images of the sphere at five different locations. The mean distance deviation of the four 3D points between the manual calibration and the automatic calibration method is 3.42 cm. In Fig. 11, the left is the normal field computed from the smoothed rough depth, where the low-frequency components are accurate but lack of high-frequency components. The middle and the right are the normal fields computed by the PS method using the lighting parameters from the manual and automatic calibration methods, respectively. Visually, the quality of the automatic method is only a little worse than the manual method. The PS method recovers the high-frequency components, but there is a deviation in low-frequency components. So in the fusion process [9], the low-frequency components of the normal field form the depth are used to correct the normals from the PS method.

4 Summary

In this paper, we design a photo-geometric depth camera by combining the near point light source photometric stereo and the speckle-based stereo matching method. The depth camera is compact in structure and suitable for multi-albedo targets. The parameters (including position and intensity) of the light sources can be self-calibrated. To realize the auto-calibration of the point light sources, we firstly use the distant lighting model [7] to estimate the initial surface albedo map. Then, with the estimated albedo map and the normal vector field fixed, the parameters of the near lighting model are optimized. Next, with the optimized lighting model, we use the NLPS method to re-compute the albedo map and use the method [9] to compute higher-quality depth map. Repeat the above two steps iteratively until convergence or the iteration times reaching the predefined maximum number. Experiments have demonstrated that the depth camera we designed can reconstruct the multi-albedo targets with high fidelity in general indoor environment.

In the current implementation, the images captured by the cameras are transmitted to the computer before processing. In future work, we will design an embedded system based on FPGA in our depth camera to process the images so that the depth maps can be generated by the depth camera directly.