Abstract
We propose a depth and image scene flow estimation method taking the input of a binocular video. The key component is motion-depth temporal consistency preservation, making computation in long sequences reliable. We tackle a number of fundamental technical issues, including connection establishment between motion and depth, structure consistency preservation in multiple frames, and long-range temporal constraint employment for error correction. We address all of them in a unified depth and scene flow estimation framework. Our main contributions include development of motion trajectories, which robustly link frame correspondences in a voting manner, rejection of depth/motion outliers through temporal robust regression, novel edge occurrence map estimation, and introduction of anisotropic smoothing priors for proper regularization.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In many computer vision tasks, reliable depth and motion estimation fundamentally assures high quality result production. If depth can be accurately inferred in 3D videos with necessary temporal consistency, traditionally challenging video editing to alter color, structure, and geometry, as well as the high-level scene understanding and recognition tasks can be accomplished much more easily. In addition, with the precipitate prevalence of 3D display and 3D capturing devices, the “2D-plus-depth” format becomes common and important, as it can be used to generate new viewsFootnote 1 for 3DTV.
Although a tremendous number of binocular videos have come into existence, with only two views, it is still very difficult to compute reliable and consistent depth in long sequences. Structure-from-motion (SFM) and multi-view stereo matching can be applied to static scenes where global constraints are established through the multi-view geometry (Snavely et al. 2006; Furukawa and Ponce 2007; Zhang et al. 2009). It is not suitable for videos that contain moving or deforming objects, which handicap correspondence establishment across multiple frames.
In optical flow estimation (Baker et al. 2011), which captures 2D apparent motion, correspondence between consecutive frames can be established. 2D optical flow and depth variation are jointly considered in Patras et al. (1996), Zhang and Kambhamettu (2001), Vedula et al. (2005), Huguet and Devernay (2007), Wedel et al. (2008, 2011), Valgaerts et al. (2010), Rabe et al. (2010), which is typically referred to as image scene flow. Given intrinsic camera parameters, 3D scene flow can be constructed (Basha et al. 2010; Wedel et al. 2011). These methods either compute motion and depth independently or resort to a four-image configuration. They do not tackle the temporal-consistency preservation problem in multiple frames.
Local temporal constraints were imposed in depth estimation for video sequences, known as spatiotemporal stereo matching (Zhang et al. 2003; Richardt et al. 2010). These methods do not cope with object motion across multiple frames and thus are more suitable for static scene videos. In optical flow estimation, the methods presented in Black (1994), Bruhn and Weickert (2005) have temporal terms. They, however, may suffer from two main estimation problems. First, it is difficult or inefficient to perform long range information propagation temporally. Second, erroneous estimates caused by occasional noise, sudden luminance change, and outliers in one frame could influence later results. There is no effective way to measure and reduce estimation errors globally.
We aim at reliable depth and motion estimation from multi-frame binocular videos with appropriate temporal consistency. Our method is not based on global multi-view geometry because dynamic objects do not obey them. We also do not count on the locally established temporal constraint due to its inefficiency in information propagation.
We make several major contributions to construct the system, which can measure and reduce estimation errors in multiple frames. (1) We propose motion trajectories that link reliable corresponding points among frames. It is robust against occasional noise and abrupt luminance variation. (2) We build structure profiles by considering multi-frame edges. Through a voting-like step, only edges reliable in multiple frames are enhanced. (3) Long-range temporal constraints are advocated, based on the robust motion trajectories. Regression then corrects errors and improves estimates temporally. (4) Last but not least, we propose anisotropic smoothing to non-uniformly regularize pixels, incorporating temporal edge information and preventing unconstrained boundary degradation.
2 Related Work
Simultaneous depth and motion estimation from stereo images was studied in Zhang and Faugeras (1992). In Wedel et al. (2008), motion and depth were computed sequentially and independently, assuming that the depth in previous frames is known. To improve the results, depth and motion are jointly estimated, using two stereo pairs (Patras et al. 1996; Zhang and Kambhamettu 2001; Min and Sohn 2006; Huguet and Devernay 2007; Valgaerts et al. 2010). These approaches estimate motion fields from two calibrated cameras, where constraints are established in the four-frame configuration. Temporal consistency may still be a problem.
Recently, Wedel et al. (2011) pointed out that decoupling disparity and motion field computation is advantageous in that different optimization techniques can be applied. A semi-dense scene flow field was computed in Cech et al. (2011) through locally growing correspondence seeds. Instead of modeling image scene flow, Basha et al. (2010) imposed constraints on 3D surface motion. Calibrated multi-view sequences were used. Rabe et al. (2010) applied Kalman filtering to independently computed flow estimates and disparities. Hadfield and Bowden (2011) proposed a particle approach for scene flow estimation, making use of depth sensors and the monocular image camera in Microsoft Kinect. Vogel et al. (2011) incorporated a local rigidity constraint to regularize scene flow estimation. Our method primarily differs from these approaches in the way of enforcing temporal consistency, as multi-frame long-range structure information is made use of.
Efforts have also been put to motion/depth discontinuity handling. Zhang and Kambhamettu (2001) used segmentation and applied piecewise regularization. Xu et al. (2008) applied segmentation model to optical flow estimation. Edge-preserving regularizer (Min and Sohn 2006), image driven regularizer (Sun et al. 2008), and complementary regularizer (Zimmer et al. 2009) were used to preserve motion and depth boundaries. The color edges or segments that are used as guidance are generally hard to be consistent over time, making producing high-quality depth boundary difficult.
In optical flow, two-frame configuration is common (Brox et al. 2004; Bruhn et al. 2005; Zimmer et al. 2009; Xu et al. 2010). A taxonomy of optical flow methods, along with comparisons, is reported on the Middlebury website (Baker et al. 2011). To enforce temporal smoothness, in Brox et al. (2004), Bruhn and Weickert (2005), Bruhn et al. (2005), constraints are yielded by assuming that the flow vectors from consecutive frames at the same image location are similar. It also applies to smoothly-varying motion. Álvarez et al. (2007) enforced symmetry between the forward and backward flow to reject outliers. Assuming constant motion acceleration, Black (1994) enforced temporal smoothness by predicting motion for the next frame using current estimate. These methods only consider consecutive frames, which are not effective and may accumulate errors when propagating information among frames that are far apart.
Using multiple frames, Irani (2002) projected flow vectors onto a subspace and assumed that the resulting matrix has a low rank for noise removal. Global motion in static scenes is considered. To construct chained motion trajectories, particle samples were generated and linked in a probabilistic way (Sand and Teller 2008). In Sundaram et al. (2010), chained trajectories were constructed by bi-directional check of motion vectors based on large displacement optical flow estimation (Brox et al. 2009). The quality of trajectories depends excessively on flow estimate from consecutive frames, making these methods possibly vulnerable to noise and estimation outliers.
To achieve temporal consistency in depth estimation, Zhang et al. (2003) proposed spacetime data costs that aggregate data term over a short period. Richardt et al. (2010) extended the idea by applying spatio-temporal cross-bilateral grid on the data cost, derived from the locally adaptive support aggregation window of Yoon and Kweon (2006). Temporal relationship is also considered upon nearby video frames. These methods did not deal with large motion between frames and thus are more suitable for static scenes.
Contrary to all these approaches, we propose a general binocular framework addressing the temporal consistency problem in depth and motion estimation in long sequences. Temporal information from multiple frames is incorporated with novel chained profiles.
3 Notations and Problem Introduction
Given a rectified binocular video, our method computes two-view stereo for each frame pair across the two sequences, together with dense motion in each sequence. Our framework can be readily extended to unrectified stereo videos linked with a fundamental matrix (Valgaerts et al. 2010).
We denote corresponding frames in the stereo sequences as \(f_{l}^{t}\) and \(f_{r}^{t}\), indexed by time t where t={0,1,…,N−1}. For each pixel x in frame \(f_{l}^{t}\), we find the corresponding pixels in the neighboring frames either temporally using motion estimation or spatially with stereo matching, as shown in Fig. 1. The correspondence \(x_{r}^{t}\) in \(f_{r}^{t}\) is expressed as
where \({\mathrm{d}}_{l}^{t}\) is the view-dependent disparity. In this paper, we alternatingly use \({\mathrm{d}}_{l}^{t}(x)\) and \({\mathrm{d}}_{l}^{t}\) to represent the disparity value at point x. Meanwhile, optical flow correspondence in the left sequence for pixel x is
based on the displacement u in frame \(f_{l}^{t+1}\). The 2D vector \({\mathrm{u}}^{t,t+1}_{l}\) is written as \({\mathrm{u}}_{l}^{t,t+1}=(u_{l}^{t,t+1},v_{l}^{t,t+1})^{T}\), as shown in Fig. 1. Finally, the correspondence for x in \(f_{r}^{t+1}\) is expressed as \(x_{r}^{t+1}={x}+{\mathrm{u}}_{l}^{t,t+1}+{\mathrm{d}}_{l}^{t+1}\), involving both motion and stereo. 3D image scene flow is, by convention, denoted as
where \(\delta\mathrm{d}_{l}^{t,t+1} = \mathrm{d}_{l}^{t+1}(x+\mathrm{u}_{l}^{t,t+1})-\mathrm{d}_{l}^{t}\). This representation includes spatial shift and depth variation for correspondences in successive two frames. In this paper, both “depth” and “disparity” are used to denote the displacement of pixels in two views, although, strictly speaking, depth is proportional to the reciprocal of disparity.
The above correspondences suggest a few fundamental constraints that are used in spatial-temporal depth estimation for every neighboring four frames (illustrated in Fig. 1). The four commonly used scene flow conditions are
where [k] indexes channels and Γ(⋅) is the robust Charbonnier function, i.e., the variant of L 1 regularizer, written as \(\varGamma(y^{2})=\sqrt{y^{2}+\epsilon^{2}}\) to reject matching outliers. E F and E B are stereo constraints and E L and E R are motion constraints for the neighboring-four-frame set. In what follows, we omit the subscript l for all left-view unknowns for simplicity’s sake.
In Eq. (1), each f has 5 channels as adopted in Sand and Teller (2006), i.e., f=(f I ,1/4(f G −f R ),1/4(f G −f B ),f ∂h ,f ∂v ), to make the following computation slightly more robust against illumination variation, compared to only considering RGB colors. f I is the image intensity; f ∂h and f ∂v are horizontal and vertical intensity gradients.
Key Issues
Simultaneously estimating all unknowns, i.e., depth and scene flow, is computationally expensive. It takes hours for the variational method (Huguet and Devernay 2007) to compute one scene flow field for one frame. We also found that only using these constraints cannot produce temporally consistent results. As briefed in the introduction, only connecting very close frames lacks representation ability to describe the relationship among correspondences that are far apart in the sequence. This deficiency could cause devastating failure in estimation.
We show one example in Fig. 2 where (a) and (e) are two stereo frames in the binocular sequence. (b)–(d) contain depth maps estimated using the joint method of Huguet and Devernay (2007). Even by simultaneously considering the motion and stereo terms, along with the L 1 regularization, the results are not temporally very consistent.
This is explainable: when all correspondences are locally established between successive frames, they are vulnerable to noise, occlusion and illumination variation. Unlike multi-view stereo, there is no global constraint to find and eliminate errors over the sequence. Even with the temporal constraints, when one depth estimate is problematic, all following computation steps can be affected, inevitably accumulating errors and eventually failing estimation. In light of this, other considerations should be taken especially for long sequences.
We propose new chained temporal constraints to make long-range depth-flow estimation less problematic. We show in Fig. 2(f)–(g) our depth results and in (h) our 3D scene flow estimate. Their quality is very high. In what follows, we use gray-scale values to visualize disparity maps and 3D arrows to visualize image scene flow. 2D optical flow is color-coded according to the wheel in Fig. 2(h), in which hue represents motion direction and intensity indicates flow magnitude. 3D scene flow vectors are similarly color coded for their first two dimensions.
4 Our Approach
Our method is to generate consistent depth and scene flow from only a binocular video, utilizing pixel correspondence among multiple frames. The overview of our system is given in Algorithm 1, which consists of main steps of initialization, temporal constraint establishment, and final depth and joint scene flow refinement.
4.1 Initialization
To initialize depth and motion, by convention, we apply the variational method to optical flow estimation and use discrete optimization for two-view stereo matching, the latter of which is capable of estimating large disparity.
Disparity Initialization
We compute the disparity \({\mathrm{d}}_{l}^{t}\) for each frame t by optimizing
where x is defined on Ω, domain of x in the 2D image grid. E F (dt) is given in Eq. (1). \(E_{S_{d}}(\nabla{\mathrm{d}}^{t})\) is the truncated L 1 function to preserve discontinuity, written as \(E_{S_{d}}(\nabla{\mathrm{d}}^{t}):=\min( |\nabla{\mathrm{d}}^{t}|, \rho)\), where ρ=3. It is a regularization term to preserve edges. β d is a weight. With the discrete energy function, we solve Eq. (2) using graph-cuts (Kolmogorov and Zabih 2004). Occlusion is further explicitly labeled using uniqueness check (Scharstein and Szeliski 2002). When two pixels in the left view are mapped to the same one in the right view, the pixel with smaller disparity is set as occlusion, with o d(x)=0.
Disparity maps for two consecutive frames are shown in Fig. 3(a)–(b). Textureless regions, such as the mountain, and region boundaries have inconsistent estimates. Close-ups of the depth boundaries are shown in (c). The inconsistent depth values cause flickering. See our supplementary video for the depth sequence.Footnote 2
Optical Flow Initialization
We initialize 2D motion in a variational framework. Both the forward and backward flow vectors are computed, denoted as \(\mathrm{u}^{t,t+1}_{l}\) and \({\mathrm{u}}^{t+1,t}_{l}\), for outlier rejection. For the following robust estimation, which is detailed in Sect. 4.2, we also compute bi-directional flow between frames \(f^{t}_{l}\) and \(f^{t+2}_{l}\) (denoted as \(\mathrm{u}^{t,t+2}_{l}\) and \(\mathrm{u}^{t+2,t}_{l}\)), and between frames \(f^{t}_{l}\) and \(f^{t+3}_{l}\) (denoted as \(\mathrm{u}^{t,t+3}_{l}\) and \(\mathrm{u}^{t+3,t}_{l}\)). Figure 4 illustrates these vectors. As all motion vectors are computed similarly, we only describe estimation of \(\mathrm{u}_{l}^{t,t+1}\). It is achieved by minimizing
where \(E_{S_{u}}(\nabla\mathrm{u}_{l}^{t,t+1})\) is the total variation regularizer, expressed as \(\sqrt{\|\nabla{\mathrm{u}}_{l}^{t,t+1}\|^{2}+\epsilon^{2}}\) to preserve edges. It is a convex penalty function and is commonly used in the variational framework. β u is a weight controlling the smoothness of the computed flow fields. Equation (3) is optimized by the efficient method of Brox et al. (2004). The initially estimated optical flow for two consecutive frames is shown in Fig. 3(d)–(e).
4.2 Chained Temporal Priors
A part of our contribution lies on constructing motion trajectories after initialization, which link corresponding pixels in several frames, and on proposing new structure profiles, essential in our system to form new temporal constraints.
Robust Motion Trajectory Estimation
For each pixel, to find its correspondences in other frames, we build motion trajectories based on the optical flow estimate in each sequence. Note that motion vectors do not necessarily contain integers and thus may link sub-pixels. Specifically, x+ut,t+1(x) in f t+1 that is mapped from x in f t based on the motion vector ut,t+1(x) is possibly a fractional value, locating in between four pixels, as shown in Fig. 5(a). When searching for the correspondence of x+ut,t+1(x) in frame t+2 or back in frame t, the motion vector for x+ut,t+1(x) has to be estimated, generally by spatial interpolation.
Here we propose a simple and yet effective way to improve interpolation accuracy. We observe in our experiments that simple distance-based interpolation, e.g., bilinear or bicubic method, could produce erroneous results. Figure 5(a) shows one example that the orange and green regions undergo motion in different directions. A point projected in between pixels {y 0,y 1,y 2,y 3}, after bilinear interpolation, is with near zero motion magnitude, which is obviously inappropriate. This problem is quite common for sequences containing dynamic objects. We ameliorate it by incorporating the color information to guide interpolation bilaterally together with the spatial distance, originated from spatial bilateral filtering (Tomasi and Manduchi 1998). The operator is written as
where t′ can be t, t+2, or other frame indexes depending on the motion definition. |w| is for normalization. The term \((f_{I}^{t}(x)-f_{I}^{t+1}(y_{i}))^{2}/\sigma_{2}\) considers the brightness similarity of points in different frames. σ 1 and σ 2 are set to 0.4 and 0.3 respectively. The comparison of bilateral interpolation and standard bilinear interpolation is given in Fig. 5(b). In the two motion field patches that contain dynamic object boundaries, it is clear that the bilateral method produces much sharper motion boundaries.
With this interpolation scheme, we link corresponding points among frames, which forms motion trajectories. Due to inevitable estimation errors in occlusion regions, object boundaries, and textureless regions, we identify and exclude outliers with bidirectional flow vectors (Huguet and Devernay 2007). In particular, we project x+ut,t+1(x) in f t+1, which is mapped from x in f t based on the motion vector ut,t+1(x), back to f t. We sum the two vectors with opposite directions and define the map o u, to mark glaring errors, as
where
and τ is the error threshold set to 1. Satisfying the inequality \(|\mathrm{u}_{o}^{t, t+1}|\geq\tau\) means the motion vectors that are supposedly opposite are malposed. We in this case discard ut,t+1(x) and set o u(x) to 0.
Removing a problematic flow vector shortens a motion trajectory. If it is too short, insufficient temporal information could be resulted in. With the observation that many outliers are caused by occasional noise and pixel color variation, which do not present consistently in frames for the same pixel in general, we also utilize longer-range bidirectional flow vectors, i.e., \(\mathrm{u}^{t,t+2}_{l}(x)\) and \(\mathrm{u}^{t+2,t}_{l}(x)\), as illustrated in Fig. 4. If they are valid after going through the same bidirectional consistency check, we reconnect the trajectory from frame t to t+2. Otherwise, we continue to test the pair of \(\mathrm{u}^{t,t+3}_{l}(x)\) and \(\mathrm{u}^{t+3,t}_{l}(x)\). Only if all these three checks fail, we break the trajectory. The algorithm to build a motion trajectory is detailed in Algorithm 2.
With these trajectories built in the forward and backward directions, we can find a series of correspondences in other frames for each pixel x t in frame t. The motion vector w.r.t. x t and the correspondence in frame t±i is expressed as ut,t±i, which is the sum of all consecutive motion vectors in the trajectory from frame t to t±i for x t.
Figure 5(c)–(d) show respectively motion fields ut,t−5 and ut,t+5 produced by Algorithm 2. Unreliable matching is marked as black (NaN in Algorithm 2). They are typically caused by motion occlusion, leading to break of motion trajectories. This type of disconnection is however desirable because occlusion shortens trajectories by nature. Results in (c)–(d) also demonstrate that one pixel is occluded at most along one direction, but not both.
To demonstrate the effectiveness of longer-range bidirectional flow \(\mathrm{u}^{t,t+2}_{l}(x)\) and \(\mathrm{u}^{t,t+3}_{l}(x)\) in generating trajectories, we show a comparison in Fig. 5(d) and (e), where (e) is produced with trajectory construction not using long-range flow and only relying on \(\mathrm{u}^{t,t+1}_{l}\). Less black pixels are in the flow field (d) due to higher robustness against occasional noise and estimation outliers.
Trajectory-Based Structure Profile
To build a practical system with constraints applied across multiple frames, besides trajectories, we also base our computation on a central observation—that is, motion and disparity boundaries are mostly in line with image structure boundaries. Salient edge maps, in this regard, are useful in shaping motion and disparity.
Unfortunately, depth and flow edges are very sensitive to noise, blurriness, illumination variance, and other kinds of image degradation. When taking sequences into consideration, boundaries of dynamic objects vary over time and are usually composed of different sets of pixels. It is common knowledge that only using image gradient information for frames separately can hardly infer reliable and consistent edges, as illustrated in Fig. 6(a).
Our goal in this part is to compute a series of salient structures that are temporally consistent and spatially conspicuous, regardless whether they are on dynamic objects or along illuminance variation boundaries. To this end, we first calculate edge magnitude maps separately for each frame after bilateral smoothing to remove a small degree of noise. The magnitude operator is \(\sqrt{f_{\partial h}^{2}+f_{\partial v}^{2}}\). Then we compute an edge occurrence map \(C_{f}^{t}\) by simply setting pixels with their magnitudes smaller than a threshold (generally set to 0.01) to zeros. It actually indicates the occurrence of significant structure edges. All maps in the input sequence, together with the computed dense motion trajectories, are used to establish a structure profile map for each frame.
For each pixel x in frame t, we project all edge occurrence values in other frames, according to the correspondences along the trajectories {ut,t+i} and {ut,t−i}, to it, as illustrated in Fig. 7. In this process, we can find consistent edge occurrence values where errors, after a voting-like process, can be quickly suppressed.
The corresponding point of x in frame t+i is x+ut,t+i(x) after chain projection, where ut,t+i(x) is the overall motion vector. The average of the occurrence value in the trajectory is expressed as
where n is the number of corresponding pixels along the trajectory. The structure profiles \(\tilde{C}_{f}^{t}(x)\) embody statistics of the occurrence of strong edges over multiple frames. Its value reveals the chance that the current pixel is on a consistent edge.
We do not introduce weights in Eq. (6) because true edges can typically exist in a large number of frames consistently while a false one caused by noise and estimation errors does not. By projecting all correspondences to the current pixel and adding their occurrence values, a true edge point can gather a large confidence value. Occasional outliers cannot receive consistent support temporally, and therefore only have small confidence. In this voting-like process, originally weak but consistent edges can be properly enhanced.
The resulting edge occurrence maps are \(\{\tilde{C}_{f}^{t}\}\), which are used to define edge priors. Figure 6(b) shows two edge occurrence maps, where inconsistent edges are notably weakened and consistent ones are enhanced. This profile construction process is robust against noise and sudden illumination change.
Trajectory-Based Depth/Motion Profile
Another set of important profiles are constructed based on the fact that \(\tilde{C}_{f}\) does not contain scene flow information. Spatial-temporal constraints were proposed in Black (1994), Bruhn and Weickert (2005), Bruhn et al. (2005) to enforce temporal consistency of motion vectors. Locally constant speed (Bruhn and Weickert 2005; Bruhn et al. 2005) and acceleration (Black 1994) are typical assumptions. We use a temporal-linear model to fit motion and depth, in order to reject outliers while allowing for depth and motion variation.
Here, we describe our depth profile estimation procedure. Motion profile can be computed similarly. For each pixel, we adopt a linear parametric model to fit depth after projecting values from other frames to t temporally based on our trajectories. The linear model is controlled by two parameters w 0 and w 1, representing depth offset and slope. Regression needs to minimize the energy
where i indexes frames in the trajectory and γ i (x) is the weight for the (t+i)th frame and 1/dt+i is the corresponding depth for x in frame t+i. With sub-pixel point position, we interpolate depth using bilateral weights described in Eq. (4). γ i (x) plays an important role and is defined as
which embodies two parts. They are respectively temporal weight \(e^{-i^{2}/\sigma_{t}}\) in a Gaussian window to reduce the influence of frames far away from the current frame t with σ t =10, and depth occlusion o d(x), which is labeled using the uniqueness check (Scharstein and Szeliski 2002). Zero o d(x) indicates occlusion, which correspondingly decreases the weight γ i (x) to zero.
Before regression in Eq. (7), we check the sum ∑ i γ i . If ∑ i γ i <3, we skip regression for the robustness’ sake. Equation (7) provides an effective way to gather statistical disparity information from multiple frames without much occlusion influence. As the initial disparity values are optimized in a global fashion in each frame, the robust regression process actually has incorporated neighboring disparity information for each pixel.
In minimization, taking derivatives w.r.t. the parameters w 0(x) and w 1(x) and setting them to zeros yield two equations. After a few simple algebraic operations, closed-form solutions are obtained as
Finally, given the estimated linear parameters, the depth profile \(\tilde{\mathrm{d}}^{t}(x)\) for x in frame t is written as
We show the w 0 and w 1 results in Fig. 8. w 0 corresponds to scene depth. It is made temporally more consistent after optimization thanks to the regression to reject random outliers and preserve boundaries. The average magnitude of the w 1 map is much smaller than that of w 0, manifesting that depth does not undergo abrupt change over time.
For motion profile computation, we similarly apply the linear model, yielding
where ũt(x) is the motion profile for pixel x in frame t and
o u is motion occlusion estimated through bidirectional check. As shown in Fig. 5(c)–(d), occlusion generally arises along one direction—that is, either forward or backward—but not both. So the situation that o u(x)=0 for all correspondences of one pixel in the trajectory seldom occurs, making it always possible to find points to gather statistics and refine motion during regression. We adopt a small σ t (set to 3) due to the fact that motion variation is typically larger than the change of scene depth.
4.3 Temporally-Constrained Depth and Scene Flow
In this section, we describe the central steps to estimate depth and image scene flow given the temporal constraints. We find that estimating depth and scene flow in the same pass is computationally expensive (Wedel et al. 2011) and unstable due to the fact that depth and scene flow are completely different variables by nature. Disparity can have very large values (up to tens or hundreds in the pixel scale) while scene flow captures object position variation and thus has much smaller scales. Putting them together makes variational optimization difficult to perform satisfyingly and be easily stuck in local optima. As reported in Huguet and Devernay (2007), Wedel et al. (2011), a full joint procedure to estimate depth and scene flow takes pretty long time.
It is also notable that the initial disparity values estimated frame-by-frame are lack of sub-pixel accuracy, unsuitable for δdt,t+1 estimation in scene flow. With these concerns, we decouple depth and scene flow, and optimize the disparity sequence with the long-range temporal constraint. Scene flow is then updated. The algorithm is outlined in Algorithm 3. We describe below the spatio-temporal functions to constrain depth and scene flow.
4.3.1 Consistent Depth Estimation
To refine depth, we minimize
where α d and β d are two weights. E F (dt) is the data cost defined in Eq. (1), which relates two views at time t. The occlusion variable \({\hat{o}}_{d}(x)\) helps reduce the adverse influence of occasional occlusion. We define \({\hat{o}}_{d}(x)=\mathrm{max}(o_{d}(x),0.01)\), for the sake of numerical stability. E T (dt) is the temporal depth data term, defined as
where \(\tilde{\mathrm{d}}^{t}\) is the fitted depth profile. This seemingly simple term is essential in our method because it incorporates long-range temporal information from multiple frames. E F , on the contrary, is only a local frame-wise data term.
The structure profile is incorporated in depth regularization to enforce structure consistency among frames. We only impose smoothness for regions with small edge-occurrence values in \(\tilde{C}_{f}^{t}\) and allow depth discontinuity to take place when \(\tilde{C}_{f}^{t}(x)\) is large. On account of possible subpixel errors in averaging the edge-occurrence maps, edges in \(\tilde{C}_{f}^{t}\) could be slightly wider than what they should be, as shown in Fig. 6. It is inappropriate to naively enforce no or small smoothness for these pixels because, without necessary regularization, edge cannot be well preserved.
We turn to an anisotropic smoothness method (Xiao et al. 2006; Sun et al. 2008; Zimmer et al. 2009) to provide critical constraints for edge-preserving regularization. We decompose depth gradient ∇d into {∇d∥,∇d⊥} according to the image gradient, where
where \(\nabla f_{I}^{\|}\) is the unit vector parallel to the frame intensity gradient ∇f I , i.e. \(\nabla f_{I}^{\|}=\frac{\nabla f_{I}}{\|\nabla f_{I}\|}\). We propose the following function for anisotropic smoothness regularization:
where Γ(⋅) is the robust Charbonnier function, defined in Eq. (1). In Eq. (14), for all pixels, smoothness is enforced along the isophote direction while discontinuity along gradient is allowed only for reliable strong edges, which corresponds to large \(\tilde{C}_{f}\). Hence, Eq. (14) provides necessary constraints in different directions. We note here that strong edges caused by occasional outliers in one frame would not affect the anisotropic regularizer. On the other hand, compared with C f , temporally consistent edges have been enhanced in \(\tilde{C}_{f}\), which boost the anisotropic properties accordingly.
With a few algebraic operations, Eq. (14) can be written in a form of diffusion tensor
where D(∇f I ) is the diffusion tensor defined as
where \(\nabla f_{I}^{\|}\) and \(\nabla f_{I}^{\bot}\) are two unit vectors parallel and perpendicular to ∇f I , respectively. Equation (12) can be efficiently minimized using variational solvers to enable sub-pixel accuracy. Note that the inherent difficulty to solve for large displacements in the variational framework is greatly reduced with our initial estimate \(\tilde{\mathrm{d}}^{t}\), obtained by robust regression. Our energy minimization is discussed in Sect. 4.3.3.
A depth map result is shown in Fig. 9(a), with the comparison in (b). The bottom left subfigure of (b) is the depth map obtained by enforcing smoothness uniformly in all directions with strength \((1-\tilde {C}_{f}^{t})\). When \(\tilde{C}_{f}^{t}\) is large near the boundaries, the depth estimation is ill-posed, resulting in a problematic map. Our result with anisotropic regularization preserves much better edges.
4.3.2 Scene Flow Estimation
We now estimate 2D motion and depth variation with necessary temporal constraints for image scene flow.
Data Fidelity Term
Our new data cost is given by
where E B , E L and E R are the traditional scene flow constraints, defined in Eq. (1). {dt} is the disparity computed in the above estimation step (described in Sect. 4.3.1). \(\hat{o}_{u}\), where \(\hat{o}_{u}=\mathrm{max}(o_{u},0.01)\), and \(\hat{o}_{d}\) mask out unreliable depth and flow primarily caused by occlusion. α u is the weight for the long-range temporal constraint.
E Td and E Tu incorporate our new temporal profiles. E Td is defined as
where dt is a shorthand for dt(x) and d′t+1:=dt+1(x+ut,t+1(x)). They are disparities of x in the t-th and (t+1)-th frames, respectively. Our depth refinement makes \(\mathrm{\delta d}^{t,t+1}\) a sub-pixel value.
Similarly, the flow constraint is defined as
Smoothness Term
In regularization, we penalize sudden and significant change of the 2D motion ∇ut,t+1 and of the disparity \(\nabla\mathrm{\delta d}^{t,t+1}\) w.r.t. the frame diffusion tensor, yielding an anisotropic smoothing effect with the edge maps as guidance. We define
Here κ is a weight set to 0.5. The final objective function for scene flow estimation with temporal constraints is given by
We minimize it using the variational method detailed below.
4.3.3 Energy Minimization
Equations (12) and (20) can be solved in a coarse-to-fine framework using variational solvers. However, this procedure is found not necessary. It is because in our system, variables are well initialized and estimated before optimization in the two stages. To compute depth using Eq. (12), the temporal depth profile is available. While jointly optimizing elements in scene flow using Eq. (20), the updated depth and motion profiles provide good initialization. Moreover, energy minimization with proper initialization ameliorates the inherent estimation problems for large displacements (Brox et al. 2009; Xu et al. 2010), even in the original image resolution.
With this consideration, we perform Taylor series expansion on the data term and solve for the increments
d(0) is set to \(\tilde{\mathrm{d}}\) in the first place. After Δd is computed, d(0) is accordingly updated to d(0)+Δd. Then we estimate Δd again. We repeat this procedure for 3 times. The final depth map is computed by d=d(0)+Δd. Given the refined depth map, we set \(s^{(0)}=(\tilde{{u}},\tilde{{v}},\mathrm{d}^{t+1}(x+\tilde{u})-\mathrm{d}^{t})\) and iteratively solve for the increment Δs.
This scheme is referred to as the warping strategy in Brox et al. (2004). In our system, to estimate Δd and Δs, the overall energy functions are minimized by solving the their corresponding Euler-Lagrange equations. To make description easy, we denote
In addition, \({\varGamma'}(y^{2}):=1/\sqrt{y^{2}+\epsilon^{2}}\), representing derivative of the Γ function. The Euler-Lagrange equation for Eq. (12) is given by
where div(⋅) is the divergence operator. After discretizing the equation, nonlinearity is only held in Γ′. A fixed-point loop similar to that in Brox et al. (2004), Bruhn and Weickert (2005) is applied, which removes the nonlinearity of Γ′ by using values obtained from the previous iteration. This step yields linear equations, which can be quickly solved by standard linear solvers. More details are included in Appendix A.
Similarly, for final scene flow computation, we denote
The Euler-Lagrange equations for Eq. (20) are given by
where
representing derivatives of the smoothness penalty on 2D motion. \(\varGamma'_{sd}\) is expressed as Γ′(∇δdT D(∇f I )∇δd), a penalty on depth variation. Applying a fixed-point conversion, the system becomes linear w.r.t. Δs and thus can be solved easily. Implementation details are in Appendix B.
5 Experiments
We extensively evaluate our method. We first perform quantitative evaluation on the POV-Ray-rendered sequences (Vaudrey et al. 2008; University of Auckland 2008), where ground truth depth and scene flow fields are available. It is notable that very few methods in the literature reported error statistics on video sequences because most of them used the four-frame configuration without leveraging long-range temporal information. In contrast, we provide both per-sequence statistics and detailed per-frame errors for each step of the proposed method. Complete results are presented in the supplementary video. To evaluate the long-range temporal constraints, we also conduct experiments to separately test important components, including the structure, long-range depth and motion profiles. Finally, we show results on challenging sequences containing large dynamic objects.
In dealing with synthetic sequences, β d and β u are set to 10 and 15, respectively. α d and α u are both with value 10. For natural sequences, we use β d=15, β u=20, α d=15, and α u=5. All other parameters are fixed as specified in the paper.
5.1 Quantitative Evaluation
We employ the mean absolute error (MAE) to measure the errors between our estimate d and the ground truth disparity d∗:
where \(|{\mathrm{\varOmega}}|\) is the number of pixels in image \({\mathrm{\varOmega}}\).
For scene flow, we adopt the average angular error (AAE) and the root mean square error (RMSE) (Wedel et al. 2011) to evaluate the angular and end-point errors:
The RMSE measures pixel-level errors while the AAE is a measure in degree. For fair comparison, we exclude image borders, occluded regions, and the infinite background, as suggested in Wedel et al. (2011).
Traffic Scene 1 is a sequence with 100 frames (University of Auckland 2008), which contains rapidly moving cars. Two frames are shown in Fig. 11(a)–(b). The whole sequence is provided in the supplementary video. The initial disparity map (d) is erroneous at occluded regions. In the robust regression result (e), we can notice that depth at occlusion is refined and the estimation outliers are reduced owing to the use of temporal information. Our final variational refinement further increases sub-pixel accuracy and lowers MAE down to 0.1 pixel. The absolute error is visualized in (g)–(i). The gray pixels with value 128 are with no error and brighter pixels indicate larger positive errors, following the representation in Wedel et al. (2011). The 3D scene flow fields in different stages are shown in (j)–(l), with the angular error maps coded in (m)–(p). Note that the rear window of the car and the dark shadow regions are consistently with large errors, due to transparency and large color variation.
The corresponding statistics for the whole sequence are listed in Table 1(a), where the errors produced in different stages are shown. They indicate that the two main steps make a great improvement. Detailed per-frame error plots are in the first row of Fig. 10. Our results get consistent improvement for all frames in each stage.
The results of Traffic Scene 2 are shown in Fig. 12. Sample frames and the ground truth depth are shown in (a)–(c). Our results at different stages are shown in (d)–(f) with errors coded in (g)–(i). The 3D scene flow vectors and errors are shown in (j)–(l). The corresponding errors are listed in Table 1(b). The per-frame statistics are plotted in the second row of Fig. 10. These visual and quantitative results verify the effectiveness of our method.
5.2 Multi-Pass Depth/Scene Flow Estimation
After obtaining scene flow and depth fields in our framework, we can use them to reconstruct motion trajectories and further refine the estimates. We have conducted experiments by running our system several passes, each taking the result from the previous pass as initialization. In this process, the scene flow fields are computed bidirectionally to construct new motion trajectories. The error statistics for Traffic Scene 1 are provided in Table 2. It is notable that multi-pass estimation can only improve the result marginally because the temporal information has already been well incorporated in the first-pass estimation. For the sake of efficiency, we run our algorithm only once for all examples.
5.3 Structure Profile Evaluation
We in this section evaluate the usefulness of our structure profile \(\tilde{C}^{t}_{f}\). Table 3 contains the statistics on Traffic Scene 1. The first-row errors are obtained by turning off the structure profile and simply setting \(\tilde{C}^{t}_{f}(x)=0\). The second-row statistics are with the edge map \({C}^{t}_{f}(x)\) constructed in each single frame as the structure prior. The last-row figures are obtained with the temporal structure profile \(\tilde{C}^{t}_{f}(x)\). It is clear that employing our structure profiles yields the least errors.
In Fig. 13, we compare the disparity maps produced using the single-frame edge prior and our temporal structure profile respectively. Temporally more consistent edges are preserved using our complete framework. In the upper image in (b), the rightmost car boundary is less erroneous. Problematic edges are also weakened, as compared in the bottom images. The corresponding edge maps are visualized in Fig. 6.
5.4 Long-Range Temporal Constraint Evaluation
We evaluate another important constraint in our method—that is, the long-range temporal depth and motion profiles. We conduct experiments with and without the robust temporal depth and motion profiles. Other components and parameters remain the same for fairness’ sake.
When the temporal depth/motion constraint is not used, we alternatively perform variational minimization by setting α u and α d to zeros in Eqs. (12) and (16). One comparison is given in Fig. 14 where (a) shows three frames computed without the temporal constraints and (b) shows our results. Obvious problems on the moving car can be noticed in (a). The estimates with the temporal constraints are much better, especially for pixels in highlight and occlusion. Table 4 lists the corresponding errors generated with and without the long-range temporal constraints.
5.5 Challenging Natural Video Examples
We also apply our method to several natural video sequences containing multiple dynamic objects with subtle motion and deforming surfaces. Figures 15 and 16 show intermediate and final results of the Balloons and Fish sequences respectively. In each figure, the input images, initial depth maps, trajectory-based edge profiles that faithfully enhance boundaries, our depth maps obtained with robust regression, final depth results after sub-pixel refinement, and the final 3D scene flow fields are shown in rows in a top-down order. Figure 17 shows another dynamic scene with a moving person. Our depth maps and scene flow fields are in the second and third rows, respectively. We note that consistent depth and scene flow estimation for sequences containing large dynamic objects is very challenging due to noise, illumination variation, and occlusions. Our final results are with properly maintained temporal consistency.
As the 2D-plus-depth solution for 3DTV utilizes depth information to synthesize new views, our method is capable of converting standard two-view 3D videos into the 2D-plus-depth format. To demonstrate the accuracy and consistency of our resulting depth maps, we synthesize new view sequences in Figs. 18 and 19. New views produced from the initial depth are also shown for comparison, which contain obvious visual artifacts. In our final results, continuous variation of depth both spatially and temporally is preserved, together with discontinuous object boundaries, thanks to the effective regularization, multi-frame profile construction, and robust optimization. The full novel-view-synthesis results are in the supplementary video.
5.6 Computational Cost
Although our method employs long-range temporal constraints, the computational cost does not increase exponentially compared with single-frame methods. Despite constructing trajectory-based priors, the optimization procedure between different frames is independent. It is also possible to use parallel computing techniques, such as OpenMP (OpenMP ARB 2012), for acceleration.
For the binocular 100-frame video Traffic Scene 1 with resolution 640×480, using 24 cores of Intel Xeon @2.67 GHz, each frame only takes 1 min on average. We list in two rows of Table 5 the running time of depth-and-motion initialization and of the whole system when using a single CPU core.
5.7 More Discussion
Temporal consistency is known as very important in estimating depth and scene flow maps in video sequences. However, finding good constraints in multiple frames is still an open problem. Previous joint estimation methods are generally insufficient for long range information propagation. Recent developments in scene flow estimation (Basha et al. 2010; Vogel et al. 2011; Wedel et al. 2011) did not tackle the problem from the temporal consistency point of view either. Our approach explicitly models temporal constraints using chained priors, which makes temporal propagation practical and efficient. This marks the major differences between the proposed approach and others. Our chained temporal priors described in Sect. 4.2 are also very general, which not only benefit depth and scene flow estimation for binocular sequences, but also work for cameras with active range sensors, such as Microsoft Kinect.
6 Conclusion
In summary of our method, the major novelty lies on the long-range temporal constraints, which significantly improve the depth and scene flow consistency both visually and quantitatively. In building the robust estimation system, our method contributes in the following ways. Firstly, the motion trajectory construction can find reliable estimates consecutively and break links when occlusion consistently arises in multiple frames. Occasional noise in one or two frames, on the contrary, can be robustly ignored. Secondly, the novel edge occurrence maps are constructed incorporating structural information from multiple frames. The voting-like average scheme greatly suppresses errors that cannot be coherent in multiple frames and enhances credible estimates. Thirdly, we propose the anisotropic smoothing scheme to provide proper regularization for all pixels based on the structure profiles.
Limitation and Future Work
Our method needs reasonable depth initialization. If, for one region, its initial depth estimation is consistently wrong for all frames, the following refinement would not improve it much. Our future work includes the extension to unrectified videos and acceleration using GPU.
Notes
2D-Plus-Depth: (2009). Stereoscopic video coding format. http://en.wikipedia.org/wiki/2D-plus-depth.
References
Álvarez, L., Deriche, R., Papadopoulo, T., & Sánchez, J. (2007). Symmetrical dense optical flow estimation with occlusions detection. International Journal of Computer Vision, 75, 371–385.
Baker, S., Scharstein, D., Lewis, J. P., Roth, S., Black, M. J., & Szeliski, R. (2011). A database and evaluation methodology for optical flow. International Journal of Computer Vision, 92, 1–31.
Basha, T., Moses, Y., & Kiryati, N. (2010). Multi-view scene flow estimation: a view centered variational approach. In CVPR (pp. 1506–1513).
Black, M. J. (1994). Recursive non-linear estimation of discontinuous flow fields. In ECCV (Vol. 1, pp. 138–145).
Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In ECCV (Vol. 4, pp. 25–36).
Brox, T., Bregler, C., & Malik, J. (2009). Large displacement optical flow. In CVPR (pp. 41–48).
Bruhn, A., & Weickert, J. (2005). Towards ultimate motion estimation: combining highest accuracy with real-time performance. In ICCV (pp. 749–755).
Bruhn, A., Weickert, J., & Schnörr, C. (2005). Lucas/Kanade meets horn/Schunck: combining local and global optic flow methods. International Journal of Computer Vision, 61, 211–231.
Cech, J., Sanchez-Riera, J., & Horaud, R. (2011). Scene flow estimation by growing correspondence seeds. In CVPR (pp. 3129–3136).
Furukawa, Y., & Ponce, J. (2007). Accurate, dense, and robust multi-view stereopsis. In CVPR (pp. 1362–1376).
Hadfield, S., & Bowden, R. (2011). Kinecting the dots: particle based scene flow from depth sensors. In ICCV (pp. 2290–2295).
Huguet, F., & Devernay, F. (2007). A variational method for scene flow estimation from stereo sequences. In ICCV (pp. 1–7).
Irani, M. (2002). Multi-frame correspondence estimation using subspace constraints. International Journal of Computer Vision, 48, 173–194.
Kolmogorov, V., & Zabih, R. (2004). What energy functions can be minimized via graph cuts? IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 147–159.
Min, D. B., & Sohn, K. (2006). Edge-preserving simultaneous joint motion-disparity estimation. In ICPR (Vol. 2, pp. 74–77).
OpenMP ARB (2012). Open multi-processing. http://openmp.org/.
Patras, I., Alvertos, N., & Tziritas, G. (1996). Joint disparity and motion field estimation in stereoscopic image sequences. In International conference on pattern recognition (Vol. 1, pp. 359–363).
Rabe, C., Müller, T., Wedel, A., & Franke, U. (2010). Dense, robust, and accurate motion field estimation from stereo image sequences in real-time. In ECCV (Vol. 4, pp. 582–595).
Richardt, C., Orr, D., Davies, I., Criminisi, A., & Dodgson, N. A. (2010). Real-time spatiotemporal stereo matching using the dual-cross-bilateral grid. In ECCV (Vol. 3, pp. 510–523).
Sand, P., & Teller, S. J. (2006). Particle video: long-range motion estimation using point trajectories. In CVPR (Vol. 2, pp. 2195–2202).
Sand, P., & Teller, S. J. (2008). Particle video: long-range motion estimation using point trajectories. International Journal of Computer Vision, 80, 72–91.
Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47, 7–42.
Snavely, N., Seitz, S. M., & Szeliski, R. (2006). Photo tourism: exploring photo collections in 3d. ACM Transactions on Graphics, 25, 835–846.
Sun, D., Roth, S., Lewis, J. P., & Black, M. J. (2008). Learning optical flow. In ECCV (Vol. 3, pp. 83–97).
Sundaram, N., Brox, T., & Keutzer, K. (2010). Dense point trajectories by gpu-accelerated large displacement optical flow. In ECCV (Vol. 1, pp. 438–451).
Tomasi, C., & Manduchi, R. (1998). Bilateral filtering for gray and color images. In ICCV (pp. 839–846).
University of Auckland (2008). Enpeda. Image sequence analysis test site (eisats). http://www.mi.auckland.ac.nz/EISATS/.
Valgaerts, L., Bruhn, A., Zimmer, H., Weickert, J., Stoll, C., & Theobalt, C. (2010). Joint estimation of motion, structure and geometry from stereo sequences. In ECCV (Vol. 4, pp. 568–581).
Vaudrey, T., Rabe, C., Klette, R., & Milburn, J. (2008). Differences between stereo and motion behavior on synthetic and real-world stereo sequences. In International conference of image and vision computing New Zealand (IVCNZ) (pp. 1–6).
Vedula, S., Baker, S., Rander, P., Collins, R. T., & Kanade, T. (2005). Three-dimensional scene flow. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 475–480.
Vogel, C., Schindler, K., & Roth, S. (2011). 3d scene flow estimation with a rigid motion prior. In ICCV (pp. 1291–1298).
Wedel, A., Rabe, C., Vaudrey, T., Brox, T., Franke, U., & Cremers, D. (2008). Efficient dense scene flow from sparse or dense stereo data. In ECCV (Vol. 1, pp. 739–751).
Wedel, A., Brox, T., Vaudrey, T., Rabe, C., Franke, U., & Cremers, D. (2011). Stereoscopic scene flow computation for 3d motion understanding. International Journal of Computer Vision, 95, 29–51.
Xiao, J., Cheng, H., Sawhney, H. S., Rao, C., & Isnardi, M. A. (2006). Bilateral filtering-based optical flow estimation with occlusion detection. In ECCV (Vol. 1, pp. 211–224).
Xu, L., Chen, J., & Jia, J. (2008). A segmentation based variational model for accurate optical flow estimation. In ECCV (Vol. 1, pp. 671–684).
Xu, L., Jia, J., & Matsushita, Y. (2010). Motion detail preserving optical flow estimation. In CVPR (pp. 1293–1300).
Yoon, K. J., & Kweon, I. S. (2006). Adaptive support-weight approach for correspondence search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 650–656.
Zhang, Z., & Faugeras, O. D. (1992). Estimation of displacements from two 3-d frames obtained from stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14, 1141–1156.
Zhang, Y., & Kambhamettu, C. (2001). On 3d scene flow and structure estimation. In CVPR (Vol. 2, pp. 778–785).
Zhang, L., Curless, B., & Seitz, S. M. (2003). Spacetime stereo: shape recovery for dynamic scenes. In CVPR (Vol. 2, pp. 367–374).
Zhang, G., Jia, J., Wong, T. T., & Bao, H. (2009). Consistent depth maps recovery from a video sequence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 974–988.
Zimmer, H., Bruhn, A., Weickert, J., Valgaerts, L., Salgado, B. R. A., & Seidel, H. P. (2009). Complementary optic flow. In EMMCVPR (pp. 207–220).
Acknowledgements
The authors would like to thank the associate editor and all the anonymous reviewers for their time and effort. This work is supported by a grant from the Research Grants Council of the Hong Kong SAR (Project No. 413110).
Author information
Authors and Affiliations
Corresponding author
Electronic Supplementary Material
Below is the link to the electronic supplementary material.
(MP4 121.5 MB)
Appendices
Appendix A
We give the details of solving the Euler-Lagrange equation (21):
With the applied anisotropic diffusion tensor, the smoothness term involves d hh , d vv , and d hv , which relate several neighboring points. We use the indices in Fig. 20 to represent the 2D coordinates: d1=d(i+1,j+1). q is used to index the current point (i,j). We apply central difference in the second order derivative computation. Specifically, we introduce function ζ(⋅) expressed as
(ζd v ) v and (ζd h ) v are defined similarly. Then we discretize a grid with size h h ×h v to apply Gauss-Seidel relaxation. By defining
we represent the anisotropic factors in simpler forms. The increment Δd can be computed using the following iterations:
where \(\mathcal{N}\) is the set of neighboring pixels, \(\mathcal{N}_{h}(q)=\{2,6\}\), and \(\mathcal{N}_{v}(q)=\{0,4\}\). Further, g 1 is defined as
and b can be derived as
where
\(\overline{p}=p \mod8\). To facilitate computation, we adopt a standard non-linear multi-grid numerical scheme (Bruhn and Weickert 2005) to accelerate convergence. The Gauss-Seidel relaxation works as the pre- and post-smoother, which is applied twice in each level.
Appendix B
After discretization, the linear equations to approximate Eq. (20) can be easily derived. Δu, Δv, and Δδd are iteratively refined, by fixing the other two variables during update. It leads to the Gauss-Seidel relaxation, written as
where
g 1,g 2 are functions defined in Appendix A. The Gauss-Seidel iteration is accelerated by a non-linear Multi-grid numerical scheme similar to the one to compute disparities in Appendix A.
Rights and permissions
About this article
Cite this article
Hung, C.H., Xu, L. & Jia, J. Consistent Binocular Depth and Scene Flow with Chained Temporal Profiles. Int J Comput Vis 102, 271–292 (2013). https://doi.org/10.1007/s11263-012-0559-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-012-0559-y