Keywords

2.1 Introduction

Time-of-Flight cameras emit modulated infrared light and detect its reflection from the illuminated scene points. According to the tof principle described in Chap. 1, the detected signal is gated and integrated using internal reference signals, to form the tangent of the phase \(\phi \) of the detected signal. Since the tangent of \(\phi \) is a periodic function with a period of \(2\pi \), the value \(\phi +2n\pi \) gives exactly the same tangent value for any nonnegative integer \(n\).

Commercially available tof cameras compute \(\phi \) on the assumption that \(\phi \) is within the range of \([0, 2\pi )\). For this reason, each modulation frequency \(f\) has its maximum range \(d_{\max }\) corresponding to \(2\pi \), encoded without ambiguity:

$$\begin{aligned} d_{\max }=\frac{c}{2f}, \end{aligned}$$
(2.1)

where \(c\) is the speed of light. For any scene points farther than \(d_{\max }\), the measured distance \(d\) is much shorter than its actual distance \(d+nd_{\max }\). This phenomenon is called phase wrapping, and estimating the unknown number of wrappings \(n\) is called phase unwrapping.

For example, the Mesa SR4000 [16] camera records a 3D point \(\varvec{\mathrm{X}}_p\) at each pixel \(p\), where the measured distance \(d_p\) equals \(\Vert \varvec{\mathrm{X}}_p\Vert \). In this case, the unwrapped 3D point \(\varvec{\mathrm{X}}_p(n_p)\) with number of wrappings \(n_p\) can be written as

$$\begin{aligned} \varvec{\mathrm{X}}_p(n_p)=\frac{d_p+n_pd_{\max }}{d_p}\varvec{\mathrm{X}}_p. \end{aligned}$$
(2.2)

Figure 2.1a shows a typical depth map acquired by the SR4000 [16], and Fig. 2.1b shows its unwrapped depth map. As shown in Fig. 2.1e, phase unwrapping is crucial for recovering large-scale scene structure.

Fig. 2.1
figure 1

Structure recovery through phase unwrapping. a Wrapped tof depth map. b Unwrapped depth map corresponding to (a). Only the distance values are displayed in (a) and (b), to aid visibility. The intensity is proportional to the distance. c Amplitude image associated with (a). d and e display the 3D points corresponding to (a) and (b), respectively. d The wrapped points are displayed in red. e Their unwrapped points are displayed in blue. The remaining points are textured using the original amplitude image (c)

To increase the usable range of tof cameras, it is also possible to extend the maximum range \(d_{\max }\) by decreasing the modulation frequency \(f\). In this case, the integration time should also be extended, to acquire a high quality depth map, since the depth noise is inversely proportional to \(f\). With extended integration time, moving objects are more likely to result in motion artifacts. In addition, we do not know at which modulation frequency phase wrapping does not occur, without exact knowledge regarding the scale of the scene.

If we can accurately unwrap a depth map acquired at a high modulation frequency, then the unwrapped depth map will suffer less from noise than a depth map acquired at a lower modulation frequency, integrated for the same time. Also, if a phase-unwrapping method does not require exact knowledge on the scale of the scene, then the method will be applicable in more large-scale environments.

There exist a number of phase-unwrapping methods [48, 14, 17, 21] that have been developed for tof cameras. According to the number of input depth maps, the methods are categorized into two groups: those using a single depth map [5, 7, 14, 17, 21] and those using multiple depth maps [4, 6, 8, 20]. The following subsections introduce their principles, advantages and limitations.

2.2 Phase Unwrapping from a Single Depth Map

tof cameras such as the SR4000 [16] provide an amplitude image along with its corresponding depth map. The amplitude image is encoded with the strength of the detected signal, which is inversely proportional to the squared distance. To obtain corrected amplitude\(A^{\prime }\) [19], which is proportional to the reflectivity of a scene surface with respect to the infrared light, we can multiply amplitude \(A\) and its corresponding squared distance \(d^2\):

$$\begin{aligned} A^\prime =A d^2. \end{aligned}$$
(2.3)

Figure 2.2 shows an example of amplitude correction. It can be observed from Fig. 2.2c that the corrected amplitude is low in the wrapped region. Based on the assumption that the reflectivity is constant over the scene, the corrected amplitude values can play an important role in detecting wrapped regions [5, 17, 21].

Fig. 2.2
figure 2

Amplitude correction example. a Amplitude image. b tof depth map. c Corrected amplitude image. The intensity in (b) is proportional to the distance. The lower left part of (b) has been wrapped. Images courtesy of Choi et al. [5]

Poppinga and Birk [21] use the following inequality for testing if the depth of pixel \(p\) has been wrapped:

$$\begin{aligned} A^\prime_p \le A^\mathrm{{ref}}_{p} T, \end{aligned}$$
(2.4)

where \(T\) is a manually chosen threshold, and \(A^\mathrm{{ref}}_{p}\) is the reference amplitude of pixel \(p\) when viewing a white wall at 1 m, approximated by

$$\begin{aligned} A^\mathrm{{ref}}_p = B-\bigl ((x_p - c_x)^2 + (y_p - c_y)^2\bigr ), \end{aligned}$$
(2.5)

where \(B\) is a constant. The image coordinates of \(p\) are \((x_p,y_p)\), and \((c_x,c_y)\) is approximately the image center, which is usually better illuminated than the periphery. \(A^\mathrm{{ref}}_p\) compensates this effect by decreasing \(A^\mathrm{{ref}}_p T\) if pixel \(p\) is in the periphery.

After the detection of wrapped pixels, it is possible to directly obtain an unwrapped depth map by setting the number of wrappings of the wrapped pixels to one on the assumption that the maximum number of wrappings is 1.

The assumption on the constant reflectivity tends to be broken when the scene is composed of different objects with varying reflectivity. This assumption cannot be fully relaxed without detailed knowledge of scene reflectivity, which is hard to obtain in practice. To robustly handle varying reflectivity, it is possible to adaptively set the threshold for each image and to enforce spatial smoothness on the detection results.

Choi et al. [5] model the distribution of corrected amplitude values in an image using a mixture of Gaussians with two components, and apply expectation maximization [1] to learn the model:

$$\begin{aligned} p(A^\prime _p)=\alpha _H p(A^\prime _p|\mu _H, \sigma ^2_H)+ \alpha _L p(A^\prime _p|\mu _L, \sigma ^2_L), \end{aligned}$$
(2.6)

where \(p(A^\prime _p|\mu , \sigma ^2)\) denotes a Gaussian distribution with mean \(\mu \) and variance \(\sigma ^2\), and \(\alpha \) is the coefficient for each distribution. The components \(p(A^\prime _p|\mu _H, \sigma ^2_H)\) and \(p(A^\prime _p|\mu _L, \sigma ^2_L)\) describe the distributions of high and low corrected amplitude values, respectively. Similarly, the subscripts \(H\) and \(L\) denote labels high and low, respectively. Using the learned distribution, it is possible to write a probabilistic version of Eq. (2.4) as

$$\begin{aligned} P(H|A^\prime _p)<0.5, \end{aligned}$$
(2.7)

where \(P(H|A^\prime _p)={\alpha _H p(A^\prime _p|\mu _H, \sigma ^2_H)}/{p(A^\prime _p)}\).

To enforce spatial smoothness on the detection results, Choi et al. [5] use a segmentation method [22] based on Markov random fields (MRFs). The method finds the binary labels \(n \in \{H, L\}\) or \(\{0,1\}\) that minimize the following energy:

$$\begin{aligned} E=\sum \limits _p {D_p(n_p)} + \sum \limits _{(p,q)} {V(n_p,n_q)}, \end{aligned}$$
(2.8)

where \(D_p(n_p)\) is a data cost that is defined as \(1-P(n_p|A^\prime _p)\), and \(V(n_p,n_q)\) is a discontinuity cost that penalizes a pair of adjacent pixels \(p\) and \(q\) if their labels \(n_p\) and \(n_q\) are different. \(V(n_p,n_q)\) is defined in a manner of increasing the penalty if a pair of adjacent pixels have similar corrected amplitude values:

$$\begin{aligned} V(n_p,n_q) = \lambda \exp \bigl (-\beta (A^\prime _p - A^\prime _q)^2\bigr ) \, \delta (n_p \ne n_q), \end{aligned}$$
(2.9)

where \(\lambda \) and \(\beta \) are constants, which are either manually chosen or adaptively determined. \(\delta (x)\) is a function that evaluates to 1 if its argument is true and evaluates to zero otherwise.

Fig. 2.3
figure 3

Detection of wrapped regions. a Result obtained by expectation maximization. b Result obtained by MRF optimization. The pixels with labels \(L\) and \(H\) are colored in black and white, respectively. The red pixels are those with extremely high or low amplitude values, which are not processed during the classification. c Unwrapped depth map corresponding to Fig. 2.2(b). The intensity is proportional to the distance. Images courtesy of Choi et al. [5]

Figure 2.3 shows the classification results obtained by Choi et al. [5] Because of varying reflectivity of the scene, the result in Fig. 2.3a exhibits misclassified pixels in the lower left part. The misclassification is reduced by applying the MRF optimization as shown in Fig. 2.3b. Figure 2.3c shows the unwrapped depth map obtained by Choi et al. [5], corresponding to Fig. 2.2b.

McClure et al. [17] also use a segmentation-based approach, in which the depth map is segmented into regions by applying the watershed transform [18]. In their method, wrapped regions are detected by checking the average corrected amplitude of each region.

On the other hand, depth values tend to be highly discontinuous across the wrapping boundaries, where there are transitions in the number of wrappings. For example, the depth maps in Figs. 2.1a, 2.2b shows such discontinuities. On the assumption that the illuminated surface is smooth, the depth difference between adjacent pixels should be small. If the difference between measured distances is greater than \(0.5d_{\max }\) for any adjacent pixels, say \(d_p-d_q>0.5d_{\max }\), we can set the number of relative wrappings, or, briefly, the shift \(n_q-n_p\) to 1 so that the unwrapped difference will satisfy \(-0.5d_{\max } \le d_p-d_q-(n_q-n_p)d_{\max } <0\), minimizing the discontinuity.

Figure 2.4 shows a one-dimensional phase-unwrapping example. In Fig. 2.4a, the phase difference between pixels \(p\) and \(q\) is greater than 0.5 (or \(\pi \)). The shifts that minimize the difference between adjacent pixels are 1 (or, \(n_q-n_p=1\)) for \(p\) and \(q\), and 0 for the other pairs of adjacent pixels. On the assumption that \(n_p\) equals 0, we can integrate the shifts from left to right to obtain the unwrapped phase image in Fig. 2.4b.

Fig. 2.4
figure 4

One-dimensional phase-unwrapping example. a Measured phase image. b Unwrapped phase image where the phase difference between \(p\) and \(q\) is now less than \(0.5\). In (a) and (b), all the phase values have been divided by \(2\pi \). For example, the displayed value \(0.1\) corresponds to \(0.2\pi \)

Fig. 2.5
figure 5

Two-dimensional phase-unwrapping example. a Measured phase image. (bd) Sequentially unwrapped phase images where the phase difference across the red dotted line has been minimized. From a to d, all the phase values have been divided by \(2\pi \). For example, the displayed value \(0.1\) corresponds to \(0.2\pi \)

Figure 2.5 shows a two-dimensional phase-unwrapping example. From Fig. 2.5a to d, the phase values are unwrapped in a manner of minimizing the phase difference across the red dotted line. In this two-dimensional case, the phase differences greater than 0.5 never vanish, and the red dotted line cycles around the image center infinitely. This is because of the local phase error that causes the violation of the zero-curl constraint [9, 12].

Fig. 2.6
figure 6

Zero-curl constraint: \(a(x,y)+b(x+1,y)=b(x,y)+a(x,y+1)\). a The number of relative wrappings between \((x+1,y+1)\) and \((x,y)\) should be consistent regardless of its integrating paths. For example, two different paths (red and blue) are shown. b shows an example in which the constraint is not satisfied. The four pixels correspond to the four pixels in the middle of Fig. 2.5a

Figure 2.6 illustrates the zero-curl constraint. Given four neighboring pixel locations \((x,y)\), \((x+1,y)\), \((x,y+1)\), and \((x+1,y+1)\), let \(a(x,y)\) and \(b(x,y)\) denote the shifts \(n(x+1,y)-n(x,y)\) and \(n(x,y+1)-n(x,y)\), respectively, where \(n(x,y)\) denotes the number of wrappings at \((x,y)\). Then, the shift \(n(x+1,y+1)-n(x,y)\) can be calculated in two different ways: either \(a(x,y)+b(x+1,y)\) or \(b(x,y)+a(x,y+1)\) following one of the two different paths shown in Fig. 2.6a. For any phase-unwrapping results to be consistent, the two values should be the same, satisfying the following equality:

$$\begin{aligned} a(x,y)+b(x+1,y)=b(x,y)+a(x,y+1). \end{aligned}$$
(2.10)

Because of noise or discontinuities in the scene, the zero-curl constraint may not be satisfied locally, and the local error is propagated to the entire image during the integration. There exist classical phase-unwrapping methods [9, 12] applied in magnetic resonance imaging [15] and interferometric synthetic aperture radar (SAR) [13], which rely on detecting [12] or fixing [9] broken zero-curl constraints. Indeed, these classical methods [9, 12] have been applied to phase unwrapping for tof cameras [7, 14].

2.2.1 Deterministic Methods

Goldstein et al. [12] assume that the shift is either 1 or -1 between adjacent pixels if their phase difference is greater than \(\pi \), and assume that it is 0 otherwise. They detect cycles of four neighboring pixels, referred to as plus and minus residues, which do not satisfy the zero-curl constraint.

If any integration path encloses an unequal number of plus and minus residue, the integrated phase values on the path suffer from global errors. In contrast, if any integration path encloses an equal number of plus and minus residues, the global error is balanced out. To prevent global errors from being generated, Goldstein et al. [12] connect nearby plus and minus residues with cuts, which interdict the integration paths, such that no net residues can be encircled.

After constructing the cuts, the integration starts from a pixel \(p\), and each neighboring pixel \(q\) is unwrapped relatively to \(p\) in a greedy and sequential manner if \(q\) has not been unwrapped and if \(p\) and \(q\) are on the same side of the cuts.

2.2.2 Probabilistic Methods

Frey et al. [9] propose a very loopy belief propagation method for estimating the shift that satisfies the zero-curl constraints. Let the set of shifts, and a measured phase image, be denoted by

$$\begin{aligned}&S=\Bigl \{a(x,y),\ b(x,y)\ :\ x=1,\ldots , N-1;\ y=1,\ldots , M-1\Bigr \} \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \text{ and}\nonumber \\&\varPhi =\Bigl \{\phi (x,y)\ :\ 0\le \phi (x,y) <1,\ x=1,\ldots , N;\ y=1, \ldots , M\Bigr \}, \end{aligned}$$

respectively, where the phase values have been divided by \(2\pi \). The estimation is then recast as finding the solution that maximizes the following joint distribution:

$$\begin{aligned}&p(S,\varPhi ) \propto \prod \limits _{x=1}^{N-1} \prod \limits _{y=1}^{M-1} \delta (a(x,y)+b(x+1,y)-a(x,y+1)-b(x,y)) \\&\times \prod \limits _{x=1}^{N-1} \prod \limits _{y=1}^{M} e^{\,{-}\,(\phi (x\,{+}\,1,y)\,{-}\,\phi (x,y)\,{+}\,a(x,y)) ^2/2\sigma ^2}\\&\,{\times }\,\prod \limits _{x=1}^{N} \prod \limits _{y=1}^{M-1}e^{-(\phi (x,y+1)-\phi (x,y)+b(x,y))^2/2\sigma ^2} \end{aligned}$$

where \(\delta (x)\) evaluates to \(1\) if \(x=0\) and to \(0\) otherwise. The variance \(\sigma ^2\) is estimated directly from the wrapped phase image [9].

Fig. 2.7
figure 7

Graphical model that describes the zero-curl constraints (black discs) between neighboring shift variables (white discs). 3-element probability vectors (\(\mu \)’s) on the shifts between adjacent nodes (\(-\)1, 0, or 1) are propagated across the network. The x marks denote pixels [9]

Frey et al. [9] construct a graphical model describing the factorization of \(p(S,\varPhi )\), as shown in Fig. 2.7. In the graph, each shift node (white disc) is located between two pixels, and corresponds to either an \(x\)-directional shift (\(a\)’s) or a \(y\)-directional shift (\(b\)’s). Each constraint node (black disc) corresponds to a zero-curl constraint, and is connected to its four neighboring shift nodes. Every node passes a message to its neighboring node, and each message is a 3-vector denoted by \({\mu }\), whose elements correspond to the allowed values of shifts, \(-\)1, 0, and 1. Each element of \({\mu }\) can be considered as a probability distribution over the three possible values [9].

Fig. 2.8
figure 8

a Constraint-to-shift vectors are computed from incoming shift-to-constraint vectors. b Shift-to-constraint vectors are computed from incoming constraint-to-shift vectors. c Estimates of the marginal probabilities of the shifts given the data are computed by combining incoming constraint-to-shift vectors [9]

Figure 2.8a illustrates the computation of a message \({\mu }_4\) from a constraint node to one of its neighboring shift nodes. The constraint node receives messages \({\mu }_1\), \({\mu }_2\), and \({\mu }_3\) from the rest of its neighboring shift nodes, and filters out the joint message elements that do not satisfy the zero-curl constraint:

$$\begin{aligned} \mu _{4i}=\sum \limits _{j=-1}^{1}\sum \limits _{k=-1}^{1}\sum \limits _{l=-1}^{1}\delta (k+l-i-j)\mu _{1j}\mu _{2k}\mu _{3l}, \end{aligned}$$
(2.11)

where \(\mu _{4i}\) denotes the element of \({\mu }_{4}\), corresponding to shift value \(i \in \{-1,0,1\}\).

Figure 2.8b illustrates the computation of a message \({\mu }_2\) from a shift node to one of its neighboring constraint node. Among the elements of the message \({\mu }_1\) from the other neighboring constraint node, the element, which is consistent with the measured shift \(\phi (x,y)-\phi (x+1,y)\), is amplified:

$$\begin{aligned} \mu _{2i}=\mu _{1i}\exp \Bigl (-\bigl (\phi (x+1,y)-\phi (x,y)+i\bigr )^2\big /{2\sigma ^2}\Bigr ). \end{aligned}$$
(2.12)

After the messages converge (or, after a fixed number of iterations), an estimate of the marginal probability of a shift is computed by using the messages passed into its corresponding shift node, as illustrated in Fig. 2.8c:

$$\begin{aligned} \hat{P}\bigl (a(x,y)=i|\varPhi \bigr ) = \frac{\mu _{1i}\mu _{2i}}{\sum \limits _j{\mu _{1j}\mu _{2j}}}. \end{aligned}$$
(2.13)

Given the estimates of the marginal probabilities, the most probable value of each shift node is selected. If some zero-curl constraints remain violated, a robust integration technique, such as least-squares integration [10] should be used [9].

2.2.3 Discussion

The aforementioned phase-unwrapping methods using a single depth map [5, 7, 14, 17, 21] have an advantage that the acquisition time is not extended, keeping the motion artifacts at a minimum. The methods, however, rely on strong assumptions that are fragile in real world situations. For example, the reflectivity of the scene surface may vary in a wide range. In this case, it is hard to detect wrapped regions based on the corrected amplitude values. In addition, the scene may be discontinuous if it contains multiple objects that occlude one another. In this case, the wrapping boundaries tend to coincide with object boundaries, and it is often hard to observe large depth discontinuities across the boundaries, which play an important role in determining the number of relative wrappings.

The assumptions can be relaxed by using multiple depth maps at a possible extension of acquisition time. The next subsection introduces phase-unwrapping methods using multiple depth maps.

2.3 Phase Unwrapping from Multiple Depth Maps

Suppose that a pair of depth maps \(M_1\) and \(M_2\) of a static scene are given, which have been taken at different modulation frequencies \(f_1\) and \(f_2\) from the same viewpoint. In this case, pixel \(p\) in \(M_1\) corresponds to pixel \(p\) in \(M_2\), since the corresponding region of the scene is projected onto the same location of \(M_1\) and \(M_2\). Thus, the unwrapped distances at those corresponding pixels should be consistent within the noise level.

Without prior knowledge, the noise in the unwrapped distance can be assumed to follow a zero-mean distribution. Under this assumption, the maximum likelihood estimates of the numbers of wrappings at the corresponding pixels should minimize the difference between their unwrapped distances. Let \(m_p\) and \(n_p\) be the numbers of wrappings at pixel \(p\) in \(M_1\) and \(M_2\), respectively. Then, we can choose \(m_p\) and \(n_p\) that minimize \(g(m_p, n_p)\) such that

$$\begin{aligned} g(m_p, n_p)=\bigl |d_p(f_1)+m_pd_{\max }(f_1) - d_p(f_2)-n_pd_{\max }(f_2)\bigr |, \end{aligned}$$
(2.14)

where \(d_p(f_1)\) and \(d_p(f_2)\) denote the measured distances at pixel \(p\) in \(M_1\) and \(M_2\) respectively, and \(d_{\max }(f)\) denotes the maximum range of \(f\).

The depth consistency constraint has been mentioned by Göktürk et al. [11] and used by Falie and Buzuloiu [8] for phase unwrapping of tof cameras. The illuminating power of tof cameras is, however, limited due to the eye-safety problem, and the reflectivity of the scene may be very low. In this situation, the amount of noise may be too large for accurate numbers of wrappings to minimize \(g(m_p,n_p)\). For robust estimation against noise, Droeschel et al. [6] incorporate the depth consistency constraint into their earlier work [7] for a single depth map, using an auxiliary depth map of a different modulation frequency.

If we acquire a pair of depth maps of a dynamic scene sequentially and independently, the pixels at the same location may not correspond to each other. To deal with such dynamic situations, several approaches [4, 20] acquire a pair of depth maps simultaneously. These can be divided into single-camera and multicamera methods, as described below.

2.3.1 Single-Camera Methods

For obtaining a pair of depth maps sequentially, four samples of integrated electric charge are required per each integration period, resulting in eight samples within a pair of two different integration periods. Payne et al. [20] propose a special hardware system that enables simultaneous acquisition of a pair of depth maps at different frequencies by dividing the integration period into two, switching between frequencies \(f_1\) and \(f_2\), as shown in Fig. 2.9.

Fig. 2.9
figure 9

Frequency modulation within an integration period. The first half is modulated at \(f_1\), and the other half is modulated at \(f_2\)

Payne et al. [20] also shows that it is possible to obtain a pair of depth maps with only five or six samples within a combined integration period, using their system. By using fewer samples, the total readout time is reduced and the integration period for each sample can be extended, resulting in an improved signal-to-noise ratio.

2.3.2 Multicamera Methods

Choi and Lee [4] use a pair of commercially available tof cameras to simultaneously acquire a pair of depth maps from different viewpoints. The two cameras \(C_1\) and \(C_2\) are fixed to each other, and the mapping of a 3D point \(\varvec{\mathrm{X}}\) from \(C_1\) to its corresponding point \(\varvec{\mathrm{X}}^\prime \) from \(C_2\) is given by \((\varvec{\mathrm{R}}, \varvec{\mathrm{T}})\), where \(\varvec{\mathrm{R}}\) is a \(3\times 3\) rotation matrix, and \(\varvec{\mathrm{T}}\) is a \(3\times 1\) translation vector. In [4], the extrinsic parameters \(\varvec{\mathrm{R}}\) and \(\varvec{\mathrm{T}}\) are assumed to have been estimated. Figure 2.10a shows the stereo tof camera system.

Fig. 2.10
figure 10

a Stereo tof camera system. (b, c) Depth maps acquired by the system. d Amplitude image corresponding to (b). (e, f) Unwrapped depth maps, corresponding to (b) and (c), respectively. The intensity in (b, c, e, f) is proportional to the depth. The maximum intensity (255) in (b, c) and (e, f) correspond to 5.2 and 15.6 m, respectively. Images courtesy of Choi and Lee [4]

Denoting by \(M_1\) and \(M_2\) a pair of depth maps acquired by the system, a pixel \(p\) in \(M_1\) and its corresponding pixel \(q\) in \(M_2\) should satisfy:

$$\begin{aligned} \varvec{\mathrm{X}}^{\prime }_q(n_q)=\varvec{\mathrm{R}}\varvec{\mathrm{X}}_p(m_p)+\varvec{\mathrm{T}}, \end{aligned}$$
(2.15)

where \(\varvec{\mathrm{X}}_p(m_p)\) and \(\varvec{\mathrm{X}}^{\prime }_q(n_q)\) denote the unwrapped 3D points of \(p\) and \(q\) with their numbers of wrappings \(m_p\) and \(n_q\), respectively.

Table 2.1 Summary of phase-unwrapping methods

Based on the relation in Eq. (2.15), Choi and Lee [4] generalize the depth consistency constraint in Eq. (2.14) for a single camera to those for the stereo camera system:

$$\begin{aligned} D_p(m_p)&= \min \limits _{n_{q^\star }\in \{0,\ldots ,N\}}\Bigl (\bigl \Vert \varvec{\mathrm{X}}^\prime _{q^\star }(n_{q^\star })-\varvec{\mathrm{R}}\varvec{\mathrm{X}}_p(m_p)-\varvec{\mathrm{T}}\bigr \Vert \Bigr ),\\ D_q(n_q)&= \min \limits _{m_{p^\star }\in \{0,\ldots ,N\}}\Bigl (\bigl \Vert \varvec{\mathrm{X}}_{p^\star }(m_{p^\star })-\varvec{\mathrm{R}}^T(\varvec{\mathrm{X}}^\prime _q(n_q)-\varvec{\mathrm{T}})\bigr \Vert \Bigr ),\nonumber \end{aligned}$$
(2.16)

where pixels \(q^\star \) and \(p^\star \) are the projections of \(\varvec{\mathrm{R}}\varvec{\mathrm{X}}_p(m_p)+\varvec{\mathrm{T}}\) and \(\varvec{\mathrm{R}}^T(\varvec{\mathrm{X}}^\prime _q(n_q)-\mathbf T )\) onto \(M_2\) and \(M_1\), respectively. The integer \(N\) is the maximum number of wrappings, determined by approximate knowledge on the scale of the scene.

To robustly handle with noise and occlusion, Choi and Lee [4] minimize the following MRF energy functions \(E_1\) and \(E_2\), instead of independently minimizing \(D_p(m_p)\) and \(D_q(m_q)\) at each pixel:

$$\begin{aligned} E_1&=\sum \limits _{p\in M_1}{\hat{D}_p(m_p)}+ \sum \limits _{(p,u)}{V(m_p,m_u)},\\ E_2&=\sum \limits _{q\in M_2}{\hat{D}_q(n_q)}+ \sum \limits _{(q,v)}{V(n_q,n_v)},\nonumber \end{aligned}$$
(2.17)

where \(\hat{D}_p(m_p)\) and \(\hat{D}_q(n_q)\) are the data cost of assigning \(m_p\) and \(n_q\) to pixels \(p\) and \(q\), respectively. Functions \(V(m_p,m_u)\) and \(V(n_q,n_v)\) determine the discontinuity cost of assigning (\(m_p\),\(m_u\)) and (\(n_q\),\(n_v\)) to pairs of adjacent pixels (\(p\),\(u\)) and (\(q\),\(v\)), respectively.

The data costs \(\hat{D}_p(m_p)\) and \(\hat{D}_q(n_q)\) are defined by truncating \(D_p(m_p)\) and \(D_q(n_q)\) to prevent their values from becoming too large, due to noise and occlusion:

$$\begin{aligned} \hat{D}_p(m_p)=\tau _{\varepsilon }\bigl (D_p (m_p)\bigr ), \quad \hat{D}_q(n_q)=\tau _{\varepsilon }\bigl (D_q (n_q)\bigr ), \end{aligned}$$
(2.18)
$$\begin{aligned} \tau _{\varepsilon }(x)=\left\{ \begin{array}{ll}x,&\text{ if} \; x<\varepsilon , \\ \varepsilon ,&\text{ otherwise}, \end{array}\right. \end{aligned}$$
(2.19)

where \(\varepsilon \) is a threshold proportional to the extrinsic calibration error of the system.

The function \(V(m_p,m_u)\) is defined in a manner that preserves depth continuity between adjacent pixels. Choi and Lee [4] assume a pair of measured 3D points \(\mathbf X _p\) and \(\mathbf X _u\) to have been projected from close surface points if they are close to each other and have similar corrected amplitude values. The proximity is preserved by penalizing the pair of pixels if they have different numbers of wrappings:

$$\begin{aligned} V(m_p, m_u) = \left\{ \begin{array}{l@{\quad }l} \frac{\lambda }{r_{pu}} \exp \Bigl (-\frac{\varDelta \varvec{\mathrm{X}}^2_{pu}}{2\sigma ^2_{\varvec{\mathrm{X}}}}\Bigr ) \exp \Bigl (-\frac{\varDelta A^{\prime 2}_{pu}}{2\sigma ^2_{A^\prime }}\Bigr )&\text{ if} \left\{ \begin{array}{ll} m_p \ne m_u \quad \text{ and}\\ \varDelta \varvec{\mathrm{X}}_{pu} < 0.5\,d_{\max }(f_1) \end{array}\right. \\ 0&\text{ otherwise}. \end{array} \right. \end{aligned}$$

where \(\lambda \) is a constant, \(\varDelta \varvec{\mathrm{X}}^2_{pu}=\Vert \varvec{\mathrm{X}}_p-\varvec{\mathrm{X}}_u\Vert ^2\), and \(\varDelta A^{\prime 2}_{pu}=\Vert A^\prime _p-A^\prime _u\Vert ^2\). The variances \(\sigma ^2_{\varvec{\mathrm{X}}}\) and \(\sigma ^2_{A^\prime }\) are adaptively determined. The positive scalar \(r_{pu}\) is the image coordinate distance between \(p\) and \(u\) for attenuation of the effect of less adjacent pixels. The function \(V(n_q,n_v)\) is defined by analogy with \(V(m_p,m_u)\).

Choi and Lee [4] minimize the MRF energies via the \(\alpha \)-expansion algorithm [2], obtaining a pair of unwrapped depth maps. To enforce further consistency between the unwrapped depth maps, they iteratively update the MRF energy corresponding to a depth map, using the unwrapped depth of the other map, and perform the minimization until the consistency no longer increases. Figure 2.10e, f shows examples of unwrapped depth maps, as obtained by the iterative optimizations. An alternative method for improving the depth accuracy using two tof cameras is described in [3].

2.3.3 Discussion

Table 2.1 summarizes the phase-unwrapping methods [47, 14, 17, 20, 21] for tof cameras. The last column of the table shows the extended maximum range, which can be theoretically achieved by the methods. The methods [6, 7, 14] based on the classical phase-unwrapping methods [9, 12] deliver the widest maximum range. In [4, 5], the maximum number of wrappings can be determined by the user. It follows that the maximum range of the methods can also become sufficiently wide, by setting \(N\) to a large value. In practice, however, the limited illuminating power of commercially available tof cameras prevents distant objects from being precisely measured. This means that the phase values may be invalid, even if they can be unwrapped. In addition, the working environment may be physically confined. For the latter reason, Droeschel et al. [6, 7] limit the maximum range to \(2d_{\max }\).

2.4 Conclusions

Although the hardware system in [20] has not yet been established in commercially available tof cameras, we believe that future tof cameras will use such a frequency modulation technique for accurate and precise depth measurement. In addition, the phase-unwrapping methods in [4, 6] are ready to be applied to a pair of depth maps acquired by such future tof cameras, for robust estimation of the unwrapped depth values. We believe that a suitable combination of hardware and software systems will extend the maximum tof range, up to a limit imposed by the illuminating power of the device.