1 Introduction

Estimating a 3D reconstruction of a scene from 2D images has been one of the most studied topics in the computer vision community for the last four decades. As a result the geometric models for single and multiple views are currently well-known (Hartley and Zisserman 2004). The topic also has a key importance for robotics, as robots need accurate models of their environment in order to interact safely with it. The sequential 3D estimation of the scene and the camera pose is usually known in the robotics community as visual SLAM, the latter acronym standing for Simultaneous Localization and Mapping.

From a geometric point of view, we need at least two views to estimate the depth of a general scene. The standard 3D reconstruction pipeline starts from multiple views of a scene and uses the well-known geometric models to minimize an error related with the goodness of the estimation. The traditional approaches minimize the geometric reprojection error of a sparse set of salient points (e.g., Davison et al. 2007; Klein and Murray 2007; Snavely et al. 2008) while more recent ones use the photometric error (Newcombe et al. 2011; Engel et al. 2014). These algorithms have two main limitations that are rarely mentioned in the literature, failing in the cases of low-texture scenes and low-parallax camera motions. Both cases are likely to appear in indoor and man-made scenes.

Although single-view reconstruction is an ill-posed problem, meaning that in general depth cannot be estimated from one view, there are solutions based on exploiting non-geometric cues and assumptions on the scene. For example, Sturm and Maybank (1999) creates a piecewise planar reconstruction with user interaction. Hoiem et al. (2005), also using planar assumptions, is able to reconstruct outdoor scenes. Saxena et al. (2005) and Eigen et al. (2014) predict the depth from a single image by learning models from training data. Single-view reconstruction has been proposed for robot navigation and planning (Nabbe et al. 2006; Saxena ewt al. 2008), but its accuracy is usually lower than multiview techniques and might fail catastrophically if the underlying assumptions are not met or the current image is far from the training set.

In this paper we propose the combination of state-of-the-art dense monocular SLAM algorithms (specifically we take Newcombe et al. 2011 as our baseline) with higher-level features, data-driven and scene understanding cues to address the failure cases of low-texture scenes and low-parallax motions. We use 3D superpixels (3DS) (Concha and Civera 2014) to model areas of homogeneous color, data-driven 3D primitives (DDP) to predict the depth of repeating scene patterns from a single view (Fouhey et al. 2013) and layout estimation and classification (Hedau et al. 2009) to predict the depth of the walls and ceiling, usually textureless. Our experimental results show that our approach outperforms our baseline (Newcombe et al. 2011) in all the cases. Through several sequences, we illustrates the weaknesses and strengths of each of our depth cues.

See Fig. 1 for an overview of our system. Figure 1a shows the 3D reconstruction of a state-of-the-art dense SLAM method in a bedroom scene. Notice the errors in the walls. Observe the scene priors; 3D superpixels (3DS) in (c), data-driven primitives (DDP) in (d) and Layout and room labels in (e) and (f). (h) shows the improved reconstruction.

Fig. 1
figure 1

Incorporating scene priors to dense monocular mapping. (a). Variational mapping fails in textureless regions (top view). Notice for example the large errors in the walls. We use the following information to fix this error. Prior 1 Textureless regions are planar segments. We segment the image into superpixels (b) and triangulate them from multiple views (c). Prior 2 Man-made scene entities have repeating patterns that can be learned from RGB-D data. (d) Shows the detections of such data-driven primitives, capturing the three normals of the scene. Prior 3 Indoor scenes are box-like. We fit a box to a sparse reconstruction (e). Given the room layout, we classify the image into the room geometric parts walls–floor–ceiling (f) and clutter (g). This gives us the prior depth and shape for the pixels classified as room geometric parts in (f). (h) shows how the 3D reconstruction is improved when the three scene priors are used

This paper builds on the previous work (Concha et al. 2014). The specific contributions of this paper are

  • The evaluation of a new single-view depth prior based on learning geometric primitives from training data.

  • The fusion of the three priors. Notice that Concha et al. (2014) just evaluated two of the priors separately.

  • An extended experimental evaluation of the proposed algorithm, including several sequences from the publicly available NYU dataset.

The rest of the paper is organized as follows. Section 2 describes the related work. Section 3 gives the details of our proposal. Section 4 presents the experimental results and Section 5 concludes.

2 Related work

2.1 Dense monocular mapping

Real-time and dense 3D reconstructions of small-size environments from monocular sequences were first achieved in Graber et al. (2011), Newcombe et al. (2011), and Stühmer et al. (2010). The problem is formulated as the minimization of an energy composed of a photometric and a regularization term; the first one modeling the photometric consistency of corresponding pixels and the second one the smoothness of regions with low image gradients. A typical limitation of standard regularizers based on the Total Variation or the Huber norm is that they have high errors in large low-textured image regions. Engel et al. (2014) estimates the depth only for high-gradient pixels, producing semidense maps. In contrast, our proposal produces fully dense maps. Piniés et al. (2015) uses a non-local regularizer, able to propagate information from distant pixels and obtain more accurate reconstructions. Instead of relying in the regularizer, our proposal introduces new features (3D superpixels, 3D primitives learned from data and floor-ceiling-walls-clutter classification) to the formulation. Our proposal improves over the state of the art in the case of textureless regions. But it also improves in the low-parallax case, as our two latest cues use single view—zero-parallax—information.

2.2 Data-driven depth cues

There are several works that use machine learning and high-level cues to improve multiview reconstructions. Bao and Savarese (2011) jointly optimize 3D objects and sparse keypoints achieving a better performance in both tasks than the performance achieved optimizing them separately. Owens et al. (2013) detects patches based on gradients in the images and looks for them in a RGB-D dataset to infer depth information and use it to fill low texture areas in keypoint-based Structure from Motion. Differently from them we estimate fully dense 3D reconstructions. Bao et al. (2013) and Dame et al. (2013) use object constraints to improve 3D dense reconstructions. Our approach aims to reconstruct scenes instead of objects.

2.3 Manhattan and piece-wise planar models

Furukawa et al. (2009), Gallup et al. (2010) and Vanegas et al. (2010) used the Manhattan assumption to fill textureless gaps in sparse 3D reconstructions. Mičušík and Košecká (2010), Concha and Civera (2014), Concha and Civera (2015a), Flint et al. (2011) and Tsai et al. (2011) have used superpixels and indoor scene understanding respectively to fill textureless gaps in sparse 3D reconstructions. Our contribution is to fuse the previously mentioned cues and a new one—data-driven primitives—in a dense variational formulation. Our main advantages over them are the estimation of pixelwise reconstructions—the previously referred ones are not fully dense.

3 Dense mapping using scene priors

3.1 Problem formulation

Our aim is to estimate the inverse depth \({\varvec{\rho }}(\mathbf{u})\) for every pixel \(\mathbf{u}\) of a reference keyframe \(\mathbf{I}_r\) using a set of overlapping views \(\{\mathbf{I}_1, \ldots , \mathbf{I}_j, \ldots \}\). In order to do that we minimize a global energy function \(E_{{\varvec{\rho }}}\); which is the weighted sum of a photometric error data term \(\mathbf{C}(\mathbf{u},{\varvec{\rho }}(\mathbf{u}))\), a regularization term \(\mathbf{R(\mathbf{u},{\varvec{\rho }}(\mathbf{u}))}\) and our newly proposed term which is a summation of the three scene priors \({\varvec{\rho }}_1\), \({\varvec{\rho }}_2\) and \({\varvec{\rho }}_3\)

$$\begin{aligned} E_{\varvec{\rho }}= & {} \int ( \lambda _0 \mathbf{C}(\mathbf{u},\varvec{\rho }(\mathbf{u})) + \mathbf{R(\mathbf{u},\varvec{\rho }(\mathbf{u}))}\nonumber \\&+\, \sum \limits _{\pi =1}^3 \frac{\lambda _\pi }{2} \mathbf{P(\mathbf{u},\varvec{\rho }(\mathbf{u}),\varvec{\rho }_\pi (\mathbf{u}))} \partial \mathbf{u} \end{aligned}$$
(1)

\(\lambda _0\) and \(\lambda _\pi \) are the weighting factor that account for the relative importance of the energy terms.

3.2 The scene priors

To extract our three scene priors we need two preprocessing steps. We extract first a set of salient points \(\mathbf{u}^* \in \mathbf{u}\) in every keyframe of the sequence, compute correspondences and estimate the salient points’ 3D positions which we defined as \(\mathbf{p} = (\mathbf{p}_{1}^\top \ \ldots \ \mathbf{p}_{i}^\top \ \ldots \ \mathbf{p}_n^\top )^\top \) and camera poses \(\mathbf{c} = ( \mathbf{c}_{1}^\top \ \ldots \ \mathbf{c}_{r}^\top \ \ldots \ \mathbf{c}_{j}^\top \ \ldots \ \mathbf{c}_{m}^\top )^\top \) using a standard Bundle Adjustment optimization (Snavely et al. 2008).

In the second preprocessing step, we segment every reference keyframe \(\mathbf{I}_r\) into a set of superpixels \({\mathcal S}_r = \{s^r_1, \ldots , s^r_l, \ldots , s^r_t\}\) using the algorithm by Felzenszwalb and Huttenlocher (2004).

3.2.1 3D superpixels (3DS)

We assume that the superpixels \({\mathcal S}_r = \{s^r_1, \ldots , s^r_l, \ldots , s^r_t\}\) correspond to approximately planar areas in the scene. We will estimate their 3D parameters using Concha and Civera (2014), which we will summarize here for completeness.

We can estimate the geometric parameters \({\varvec{\Pi }} = \left( {\varvec{\pi }}_1^\top \ \ldots \right. \left. {\varvec{\pi }}_k^\top \ \ldots \ {\varvec{\pi }}_q^\top \right) ^\top \) for the q planar superpixels \(\{s_1, \ldots , s_k, \ldots , s_q\}\) that were matched in two or more keyframes with the following optimization

$$\begin{aligned} \hat{{\varvec{\Pi }}} = \mathop {\hbox {arg min}\,} \limits _{{\varvec{\Pi }}} \sum \limits _{r=1}^m \sum \limits _{k=1}^{q} F({\varvec{\epsilon }}_{s_{k}}^r) ~. \end{aligned}$$
(2)

\({\varvec{\epsilon }}_{s_{k}}^r = \mathbf{u}_{s_{k}}^r - \mathbf{h} (\mathbf{u}_{s_{k}^h}^j, {\varvec{\pi }}_k^h, \mathbf{c}_{r}, \mathbf{c}_{j})\) stands for the reprojection error of the superpixel \({s_{k}}\) contours in the keyframe \(\mathbf{I}_r\). As we are approximating the superpixels by planar surfaces, \(\mathbf{h}\) stands for a homography model. We use a robust function of the error \(F(*)\) to avoid the influence of outliers. Superpixels \({\varvec{\pi }}_k\) are parametrized by its plane normal \(\mathbf{n}_k\) and distance to the origin \({d}_k\).

The superpixel correspondences between several views are computed as follows. We first search for pairwise correspondences between two keyframes \(\mathbf{I}_{r}\) and \(\mathbf{I}_{j}\) using a Monte Carlo approach. For every superpixel \(\mathbf{s}_k\) in \(\mathbf{I}_{r}\) we create several plane hypotheses \({\varvec{\pi }}_k^h\). The plane hypothesis are ranked according to the reprojection error of the superpixel contours in image \(\mathbf{I}_{j}\)

$$\begin{aligned} \epsilon _{s_{k}^h}= & {} \left| \left| { \mathbf{u}_{s_{k}^h}^j - \mathbf{h} \left( \mathbf{u}_{s_{k}^h}^r, {\varvec{\pi }}_k^h, \mathbf{c}_{r}, \mathbf{c}_{j} \right) } \right| \right| \end{aligned}$$
(3)

The planar superpixel hypotheses \({\varvec{\pi }}_k^h\) with the smallest error \(\epsilon _{s_{k}^h}\) are taken as the initial seed for the optimization of Eq. 2.

The scene prior inverse depth \(\varvec{\rho }_1(\mathbf{u})\) for every pixel \(\mathbf{u} \in s_k\) is computed as the intersection of its backprojected ray and the plane \({\varvec{\pi }}_k\)

$$\begin{aligned} {\varvec{\rho }}_1(\mathbf{u}) = \left| \left| { -\frac{ \mathbf{u} \mathbf{K}_r^{-1} \mathbf{R}_r \mathbf{n}_k }{ d_k \mathbf{K}_r^{-1} \mathbf{u} }} \right| \right| ~. \end{aligned}$$
(4)

where \(\mathbf{R}_r\) is the rotation matrix of the keyframe \(\mathbf{I}_r\) and \(\mathbf{K}_r\) is its internal calibration matrix.

3.2.2 Data-driven primitives (DDP)

A data-driven primitive is a repetitive and distinctive image gradient pattern with an associated 3D pattern. The models for such patterns can be learned from RGB-D training data. At test time, and from a single view, the gradient patterns can be detected and their depth can be predicted. Imagine, for example, the case of a room corner. It is a primitive that appears frequently indoors, it has a clear 3D pattern and several associated image patterns depending on the viewpoint.

Specifically, we use the approach of Fouhey et al. (2013). Each primitive is defined by \(\langle \mathbf{w, N, y}\rangle \); where \(\mathbf{w}\) is the weight of an SVM classifier, \(\mathbf N = \{ n(u) \}\) is the set of normals for each pixel \(\mathbf u\) of the primitive patch, and \(\mathbf{y} = \{ 0, 1 \}^m\) is an indicator vector where \(y_i = 1\) if the training patch \(\mathbf{x}_i\) is an instance of such primitive. Each patch has a geometric component \(\mathbf{x}_i^G\) and an appearance component \(\mathbf{x}_i^A\) (HOG). In order to build the SVM classifiers \(\mathbf{w}\) the following objective function is minimized on m training images

$$\begin{aligned} \mathop {\hbox {arg min}\,} \limits _{\mathbf{y},\mathbf{w}} R(\mathbf{w}) + \sum \limits _{i=1}^m \left( \varDelta (\mathbf{N},\mathbf{x}_i^G) + c_2L(\mathbf{w},\mathbf{x}_i^A,y_i) \right) \end{aligned}$$
(5)

where R is the classifier regularizer, each \(c_i\) trades off between terms, and L is the loss function. Notice that the above classifiers will provide a set of sparse detections of some geometric primitives in the test images. Dense results can be achieved by the propagation of these sparse detections to the entire image. But we have observed that such propagation might be innacurate if only a small number of primitives is detected. In order to keep the geometric primitives as accurate as possible, we only consider the sparse detections.

Similarly to Sect. 3.2.1 we extract superpixels and assume that they correspond to approximately planar areas in the scene. For every superpixel \({\varvec{\pi }}_k\) its plane normal \(\mathbf{n}_k\) and distance to the origin \({d}_k\) are estimated. For each superpixel, the common normal direction is the most voted one from the geometric primitives. The distance \({d}_k\) is estimated using the approach of Sect. 3.2.3; and the inverse depth prior \(\varvec{\rho }_2(\mathbf{u})\) for every pixel \(\mathbf{u} \in s_k\) is computed as the intersection of its backprojected ray and the plane \({\varvec{\pi }}_k\) (Eq. 4).

3.2.3 Layout

One of the goals of indoor scene understanding is the estimation of the rough geometry of a room—its layout—and the classification of every image pixel \(\mathbf u\) into the wall, floor, ceiling or clutter classes. In this paper we basically use the algorithm of Hedau et al. (2009) and extend it to a multiview case. For an overview of the layout and the labelling algorithm see Fig. 2.

Fig. 2
figure 2

Overview of the layout and the labelling algorithm. See Sect. 3.2.3 for details

The main assumption is that we are in a cuboid room. The geometric model of the room layout \({\mathcal L}\) will be composed of six planes \({\mathcal L} = \left\{ {\varvec{\pi }}_1, {\varvec{\pi }}_2, {\varvec{\pi }}_3, {\varvec{\pi }}_4, {\varvec{\pi }}_5, {\varvec{\pi }}_6 \right\} \). Every plane \({\varvec{\pi }}_k\) will be parametrized by its plane normal \(\mathbf{n}_k\) and distance to the origin \({d}_k\). We first estimate the plane normals \(\mathbf{n}_k\) by extracting the vanishing points \(\mathbf{v}_k^r\) from the dominant directions of the room in every keyframe \(\mathbf{I}_r\) (Košecká and Zhang 2006). These vanishing points are estimated by clustering the detected 2D line segments in the keyframe in three dominant clusters. Figure 2b shows the vanishing points as red, green and blue circles. We backproject them to the 3D world \(\mathbf{V}_k^r = \mathbf{K}_r^{-1}\mathbf{v}_k^r\) (\(\mathbf{K}_r\) standing for the calibration matrix), group them into three clusters, and take their centroids.

In order to estimate the room layout box, we will fit planes to the sparse reconstruction \(\mathbf{p} = \left( \mathbf{p}_{1}^\top \ \ldots \mathbf{p}_{i}^\top \ \ldots \ \mathbf{p}_n^\top \right) ^\top \) of Fig. 2a. For this plane fitting, we start from the 3 dominant orientations of the room; the Manhattan directions provided by the vanishing points. For each orientation, we hypothesize \(N_{hyp}\) planes at different distances. Specifically, \(N_{hyp}=25\) in our experiments. A plane hypothesis is considered valid if it is supported by a minimum number of points (6 in our experiments). A point supports the hypothesis if it is within a predefined threshold. Finally, out of the winning planes, we select 6 extremal planes consisting the 3D box layout (Fig. 2c)

Next, leveraging this 3D box layout, we label every superpixel from the segmentation \({\mathcal S}_r = \{s^r_1, \ldots , s^r_l, \ldots , s^r_t\}\) into 4 different classes \(\{W, F, C, Cl\}\)—wall, floor, ceiling and clutter respectively. See Hoiem et al. (2007) for details on the superpixel features and the classification algorithm. One of the most informative features for this classification is the interposition feature. The superpixels belonging to the room geometry must be totally contained in one of the projected box polygons. The superpixels belonging to the object clutter can cross the boundary between two polygons of the project layout box. For example, in Fig. 2d, superpixels numbered 1 and 3 are totally contained in the wall and the floor polygon. Hence, they get the room geometry labels (Fig. 2e). The superpixel numbered 2 is crossing the red line of the projected box layout. Only 3D objects have this physical property and hence it is labelled as clutter (Fig. 2f). For more details see Hedau et al. (2009). Notice that this method only tells us where the objects are but it does not give us the orientation nor the depth prior for the clutter (object) region. Therefore, we will not constraint the depth of the pixels \(\mathbf{u} \in Cl\) that are labeled as clutter. For the rest of the pixels \(\mathbf{u} \in \{W, F, C\}\) we will compute an a priori inverse depth \(\varvec{\rho }_3\mathbf{(u)}\) from the intersection between the backprojected ray \(\mathbf{K}_r^{-1} \mathbf{u}\) and the layout plane \({\varvec{\pi }}_k \in {\mathcal L}\) where it has been classified using Eq. 4.

3.3 The photometric cost (\(\mathbf{C}(\mathbf{u},\varvec{\rho }(\mathbf{u}))\))

As in Newcombe et al. (2011), our photometric error is based on color difference between the reference image and the set of short-baseline images. Every pixel \(\mathbf{u}\) of the reference image \(\mathbf{I}_r\) is first backprojected at an inverse distance \(\varvec{\rho }\) and projected again in every close image \(\mathbf{I}_j\).

$$\begin{aligned} \mathbf{u}^j = \mathbf{T}_{rj}(\mathbf{u},\varvec{\rho }) = \mathbf{K} \mathbf{R}^\top \left( \left( \begin{array}{c} \frac{\mathbf{K}^{-1}\mathbf{u}}{||\mathbf{K}^{-1}\mathbf{u}||}\\ \varvec{\rho }\\ \end{array} \right) - \mathbf{t} \right) \end{aligned}$$
(6)

where \(\mathbf T\), \(\mathbf R\) and \(\mathbf t\) and respectively the relative transformation, rotation and translation between keyframe r and frame j. The photometric error is the summation of the color error between every pixel in the reference image and its corresponding in every other image at an hypothesized inverse distance \(\varvec{\rho }\).

$$\begin{aligned}&\mathbf{C}(\mathbf{u},\varvec{\rho }(\mathbf{u}))= \frac{1}{|I_s|} \sum _{j=1, j\ne r}^{m}f\left( \epsilon (\mathbf{I}_j,\mathbf{I}_r,\mathbf{u},\varvec{\rho })\right) \end{aligned}$$
(7)
$$\begin{aligned}&\epsilon (\mathbf{I}_j,\mathbf{I}_r,\mathbf{u},{\varvec{\rho }}) = \mathbf{I}_r(\mathbf{u}) - \mathbf{I}_j (\mathbf{T}_{rj}(\mathbf{u},{\varvec{\rho }})) \end{aligned}$$
(8)

where \(I_s\) is the number of images used in the reconstruction, used for normalization of the photometric term. Differently from (Concha et al. 2014) we use a robust cost function—Tukey’s cost function f—in the photometric term instead of \(L_1\) norm, which improves the accuracy of the reconstruction in depth discontinuities due to the influence of outliers in occlusions Concha and Civera (2015b).

3.4 The gradient regularizer (\(\mathbf{R(\mathbf{u},\varvec{\rho }(\mathbf{u}))}\))

The gradient regularizer is the Huber norm of the weighted gradient of the inverse depth map \(||\nabla \varvec{\rho }(\mathbf{u})||_{\epsilon }\)

$$\begin{aligned} \mathbf{R}(\mathbf{u},\varvec{\rho }(\mathbf {u})) = \mathbf{g}(\mathbf{u})||\nabla \varvec{\rho }(\mathbf{u})||_{\epsilon } \end{aligned}$$
(9)

Depth discontinuities often coincides with contours. \(\mathbf{g}(\mathbf{u})\) is a per-pixel weight that decreases the regularization strength for high-gradient pixels.

$$\begin{aligned} g(\mathbf{u}) = e^{-\alpha ||\nabla \mathbf{I}_r(\mathbf{u})||_2} \end{aligned}$$
(10)

where \(\alpha \) is a constant.

3.5 The terms associated with the scene priors (\(\mathbf{P(\mathbf{u},\varvec{\rho }(\mathbf{u}),\varvec{\rho }_\pi (\mathbf{u}))}\))

The scene prior terms model the distance from every point to its estimated planar prior (or priors) \(\varvec{\rho }_\pi \) detailed in Sect. 3.2. Differently from (Concha et al. 2014) we use iterative reweighted least squares to be robust against outliers Concha and Civera (2015b). This is of key importance to deal with classification or segmentation errors. In those cases the cost function of the error should saturate for large values and have a limited influence on the solution.

$$\begin{aligned} \mathbf{P}(\mathbf{u},\varvec{\rho }(\mathbf{u}),\varvec{\rho }_\pi (\mathbf{u})) = \mathbf{w}_\pi \left( \varvec{\rho }(\mathbf{u}) - \varvec{\rho }_\pi (\mathbf{u})\right) ^2 \end{aligned}$$
(11)

\(\mathbf{w}_\pi \) is the Tukey’s cost function weight. In the areas of the image where we do not have a planar constraint (areas classified as clutter in the Manhattan layout, small and textured superpixels and areas where we did not detect any geometric primitive) we set \(\lambda _\pi = 0\). We set the lambda of 3D superpixels \(\lambda _1 = 10\) and we set a smaller lambda for the other two priors \(\lambda _2 = 5\) and \(\lambda _3 = 5\). The reason is that superpixels are based on multiview geometry whereas layout and geometric primitives use learning which is more prone to large errors.

3.6 Solution

The energy is composed of two convex terms \(g(\mathbf{u})||\nabla {\varvec{\rho }}(\mathbf{u})||_{\epsilon } + \sum \limits _{\pi =1}^3 \frac{1}{2}\lambda _\pi w_\pi \left( \varvec{\rho }(\mathbf{u}) - \varvec{\rho }_\pi (\mathbf{u})\right) ^2\) and a non-convex term \(\lambda _0\mathbf{C}(\mathbf{u},\varvec{\rho }(\mathbf{u}))\). The convex terms and the non-convex term are optimized differently and an auxiliary variable \(\mathbf{a}\) is used to couple these two terms:

$$\begin{aligned} \ E_{\varvec{\rho },\mathbf{a}}= & {} \int \Big ({\Big . \lambda _0\mathbf{C}(\mathbf{u},\mathbf{a}(\mathbf{u}))} { +\mathbf{g}(\mathbf{u})||\nabla {\varvec{\rho }}(\mathbf{u})||_{\epsilon } \Big .} \nonumber \\&{\Big . + \sum \limits _{\pi =1}^3 \frac{1}{2}\lambda _\pi \mathbf{w}_\pi \left( \varvec{\rho }(\mathbf{u}) - \varvec{\rho }_\pi (\mathbf{u}) \right) ^2 + \Big .} \nonumber \\&{\Big . \frac{1}{2\theta }(\varvec{\rho }(\mathbf{u}) - \mathbf{a(\mathbf{u})})^2 \Big .}\Big ) \partial \mathbf{u} \end{aligned}$$
(12)

The coupling term \(\frac{1}{2\theta }(\varvec{\rho }(\mathbf{u}) - \mathbf{a}(\mathbf{u}))^2\) will enforce \(\varvec{\rho }\) and \(\mathbf{a}\) to become the same as the parameter \(\theta \) is initialized in 0.2 and it is derived to 0 iteratively. Therefore, Eq. 12 will result in the original energy 1.

The non-convex term will be optimized by sampling and the convex terms will be efficiently optimized using a primal-dual approach.

The convex terms are converted to their primal-dual formulation using the Legendre–Fenchel transformation (details and proofs in Angeli et al. 2011). The energy in Eq. 12 is then minimized as follows

$$\begin{aligned} \hat{\varvec{\rho }} = \mathop {\hbox {arg max}}_{\mathbf{q},||\mathbf{q}||_2 \le 1} \Big \{ \mathop {\hbox {arg min}\,}_{\varvec{\rho }, \mathbf{a}} {E}(\varvec{\rho },\mathbf{a},\mathbf{q}) \Big \} \end{aligned}$$
(13)
$$\begin{aligned} {E}(\varvec{\rho },\mathbf{a},\mathbf{q})= & {} \Big \{ \big \langle \mathbf{gA}\varvec{\rho }, \mathbf{q} \big \rangle -\delta _q(\mathbf{q}) - \frac{\epsilon }{2}||\mathbf{q}||_2^2 \nonumber \\&+\sum \limits _{\pi =1}^3 \frac{1}{2}\lambda _\pi \mathbf{w}_\pi \left( \varvec{\rho } - \varvec{\rho }_\pi \right) ^2 + \frac{1}{2\theta } (\varvec{\rho } - \mathbf{a})^2 \nonumber \\&+\,\lambda _0\mathbf{C}(\mathbf{a}) \Big \} \end{aligned}$$
(14)

where \(\mathbf{q}\) is the dual variable, \(\mathbf{A\varvec{\rho }}\) stands for the gradient of \(\varvec{\rho }\), \(\epsilon \) is the threshold of the Huber norm which determines when \(L_1\) or \(L_2\) norm are used (Newcombe et al. 2011) and \(\delta _q\) is an indicator function (Angeli et al. 2011).

For the dual variable \(\mathbf{q}\) the energy has to be maximized, therefore a gradient ascent step \(\frac{\partial \mathbf{E}(\varvec{\rho },\mathbf{a},q)}{\partial \mathbf{q}} = \nabla (q)\) is computed:

$$\begin{aligned} \frac{\partial E(\varvec{\rho },\mathbf{a},\mathbf{q})}{\partial \mathbf{q}} = \mathbf{gA}\varvec{\rho } - \epsilon \mathbf{q} \end{aligned}$$
(15)

Discretizing \(\nabla (q) = \frac{ \mathbf{q}^{(n+1)}-\mathbf{q}^n}{\sigma _q}\) and rearranging terms:

$$\begin{aligned} \frac{{ \mathbf q}^{(n+1)}-\mathbf{q}^n}{\sigma _q} = \mathbf{gA}\varvec{\rho }^n - \epsilon \mathbf{q}^{(n+1)} \end{aligned}$$
(16)

where \(\sigma _q\) is the differentiation step.

$$\begin{aligned} \mathbf{q}^{(n+1)}= & {} \left( \mathbf{q}^n +\sigma _q\mathbf{gA}\varvec{\rho }^n\right) / \left( 1+ \sigma _q\epsilon \right) \end{aligned}$$
(17)
$$\begin{aligned} \mathbf{q}^{(n+1)}= & {} \mathbf{q}^{(n+1)} / \max (1,||\mathbf{q}^{(n+1)}||_1) \end{aligned}$$
(18)

In the case of the variable \(\varvec{\rho }\), the energy is minimized, therefore a gradient descent step \(\frac{\partial \mathbf{E}(\varvec{\rho },\mathbf{a},\mathbf{q})}{\partial {\varvec{\rho }}} = \nabla (\varvec{\rho })\) is computed. Using the divergence theorem \(\frac{\partial \big \langle \mathbf{A}\varvec{\rho }, \mathbf{q } \big \rangle }{\partial \varvec{\rho }} = -div(\mathbf{q}) =\mathbf{A}^T\mathbf{q}\), where \(\mathbf{A}^T\mathbf{q}\) forms the negative divergence of \(\mathbf{q}\):

$$\begin{aligned} \frac{\partial E(\varvec{\rho },\mathbf{a},q)}{\partial \varvec{\rho }}&= \mathbf{gA}^T\mathbf{q} + \frac{1}{\theta }(\varvec{\rho }-\mathbf{a}) \nonumber \\&\quad +\,\sum \limits _{\pi =1}^3 \lambda _\pi \mathbf{w}_\pi \left( \varvec{\rho } - \varvec{\rho }_\pi \right) \end{aligned}$$
(19)

Discretizing \( \nabla (\varvec{\rho }) = \frac{ \varvec{\rho }^{(n+1)}-\varvec{\rho }^n}{\sigma _{\rho }} \)and rearranging terms:

$$\begin{aligned} \frac{ \varvec{\rho }^{(n+1)}-\varvec{\rho }^n}{\sigma _{ \rho }}= & {} -\mathbf{gA}^T\mathbf{q}^{(n+1)} \nonumber \\&-\,\frac{1}{\theta ^n}(\varvec{\rho }^{(n+1)}-\mathbf{a}^n) \nonumber \\&-\,\sum \limits _{\pi =1}^3 \lambda _\pi \mathbf{w}_\pi (\varvec{\rho }^{(n+1)} - \varvec{\rho }_\pi ) \end{aligned}$$
(20)

where \(\sigma _{\rho }\) is the differentiation step.

$$\begin{aligned} \varvec{\rho }^{(n+1)} \!=\! \frac{\left( \varvec{\rho }^n \!+\!\sigma _{\rho }\left( \mathbf{-gA}^T\mathbf{q}^{(n+1)}+\frac{\mathbf{a}^n}{\mathbf{\theta }^n} \!+\! \sum \limits _{\pi =1}^3 \lambda _\pi \mathbf{w}_\pi \varvec{\rho }_\pi \right) \right) }{(1+\frac{\sigma _{ \rho }}{\theta ^n} + \sum \limits _{\pi =1}^3 \lambda _\pi \mathbf{w}_\pi \sigma _{\rho })} \end{aligned}$$
(21)

The remaining non-convex function is minimized using a point-wise search for each \(\mathbf{a}\) in the range \(\mathbf{a} = [\varvec{\rho }_{min},\varvec{\rho }_{max}]\):

$$\begin{aligned} \hat{\mathbf{a}}= & {} \mathop {\hbox {arg min}\,}_{\mathbf{a}} E^{aux}(\varvec{\rho },\mathbf{a}) \end{aligned}$$
(22)
$$\begin{aligned} E^{aux}(\varvec{\rho },\mathbf{a})= & {} \frac{1}{2\theta }(\varvec{\rho } - \mathbf{a})^2 + \lambda _0\mathbf{C}(\mathbf{a}))) \end{aligned}$$
(23)

Finally, we use the acceleration of the non-convex solution recommended in Newcombe et al. (2011) and also we achieve sub-sample accuracy by performing a single Newton step using numerical derivative in the discrete variable \(\mathbf{a}\):

$$\begin{aligned} \hat{\mathbf{a}}^{(n+1)} = \hat{\mathbf{a}}^{(n+1)} - \frac{\nabla {E^{aux}}}{\nabla ^2{E^{aux}}} \end{aligned}$$
(24)

Equations 17, 18, 21, 22 and 24 are computed iteratively while \(\theta ^{(n+1)}= \theta ^n(1-0.001\,*\,n)\) is higher than 0.0001.

For the initialization of the iterative optimization we will use the photometric depth in the high-gradient image regions and the average of the depths of the scene priors for textureless areas. We have observed that this initialization has better convergence than a photometric one.

4 Experimental results

We have evaluated different aspects of our proposal in indoor and outdoor sequences of small and middle-size scenes. For every indoor experiment we have a RGB-D sequence. We used the D channel as the ground truth depth for the scene and our algorithm used the RGB data. We used our own sequences and sequences from the public NYU dataset (Silberman et al. 2012). In both cases the camera used was the Microsoft Kinect. The outdoor experiments were recorded with a RGB camera and we only show qualitative results, due to the limitations of RGB-D sensors under direct sunlight.

We divided our results on two subsets. Section 4.1 presents results on low texture scenes. Section 4.2 presents results on low parallax camera motions using the sequences from the NYU dataset.

4.1 Low texture scenes

4.1.1 Indoors

Fig. 3
figure 3

Indoor experiments, high-parallax camera motion, close-ups. First column Reference image. Second column 3D superpixels. Third column ground truth depth—red stands for no-depth-data. Fourth column DTAM depth. Fifth column Ours, using 3DS. Notice how this latest column is visually closer to the ground truth than the DTAM one (Color figure online)

We have evaluated the performance of 3D superpixels (3DS) as a prior for direct mapping with 5 indoor sequences (Bookshelf, Desktop, Corner1, Corner2 and Wall). The experiments in this section deviate from the assumptions of the other two priors—layout and geometric primitives—as most of them are close ups. We will only evaluate the improvement obtained using 3DS. 3DS is a more general prior than Layout and data-driven primitives (DDP), as it can be applied in any scene. DDP requires scenes similar to the training set and Layout requires a global view of the indoor scene. On the other hand, the triangulation of superpixels require a high-parallax camera motion while the other two perform reasonably even for the single-view case.

Figure 3 and Table 1 show the qualitative and quantitative results for these experiments. The so-called—Bookshelf experiment is a clear textured scenario where the photometric term is already very informative and the reconstruction is quite accurate with standard dense mapping. But even in this case, 3DS improves the mean error \(6\,\%\). In the other four sequences there are larger textureless areas and the gradient-based regularization produces larger errors. In these latest cases, 3DS improves the 3D reconstructions significantly.

We have used a larger sequence recorded in our Lab to compare 3DS and Layout. Find a qualitative summary of the results in Fig. 4, the 3D maps obtained in Fig. 5 and the quantitative results in Table 1. 3DS and Layout mean errors are 10.2 cm and 15.5 cm respectively, both smaller than the DTAM baseline error (28.2 cm). 3DS outperforms Layout in this case because the sequence was recorded with large camera translations—and hence high parallax. This is the best configuration for 3DS. You can observe a large error in the layout in the last row of Fig. 4. The red line standing for a corner is wrongly estimated at the middle of a wall. Our approach using Layout, in the last column, has a high depth error.

Table 1 Mean of the estimated depth error for the standard DTAM and our approach using 3DS

Notice in the last row of Table 1 that the combination of 3DS and Layout is slightly worse than 3DS alone. The reason is that the different energy terms in the optimization are weighted with the parameters \(\lambda _\pi \), that we learn from training data.

Finally, we have used three more sequences (Bedroom1, Bedroom2 and Kitchen) to further evaluate the performance of our algorithm in a high-parallax low-texture case, this time using the three scene priors and comparing against the baseline DTAM and also against the state-of-the-art batch approach PMVS (Furukawa and Ponce 2010). See Fig. 7 and Table 3 to observe the distribution of the errors in these experiments. Note that in the Bedroom2 and the Kitchen experiment the solution for standard DTAM is already quite accurate and we only slightly outperform it. For the case of Bedroom1 the baseline DTAM leads to big errors because of the large untextured wall. This error is fixed by Layout and DDP but the algorithm did not find a 3D Superpixel for the large wall, so the error is close to the DTAM baseline. Notice that we obtained competitive results in the comparison against PMVS. Note also in Fig. 11 that PMVS creates semidense maps and leaves holes in low textured areas, whereas we achieve fully dense reconstructions (Fig. 6; Table 2).

Fig. 4
figure 4

Lab experiment. Each row shows the results for a reference image. First column RGB image. Second column 3D superpixels. Third column Room layout and labels. Red lines stand for the projected box. Magenta stands for clutter, green for floor and dark blue for ceiling. Other colors stand for walls. Fourth column ground truth depth—red stands for no-depth-data. Fifth column DTAM depth. Sixth column Our approach, using 3DS. Seventh column Our approach, using Layout. The improvement of the depth maps of DTAM with planarity constraints against the standard DTAM is visually noticeable (Color figure online)

Fig. 5
figure 5

3D maps for the Lab experiment. Notice the large DTAM errors in (a) and the more accurate reconstructions in (b)—using the layout—and (c)—using 3D superpixels. Notice the differences: b shows small misalignments, while c is globally consistent but with large errors in the objects and final parts of two walls due to wrong labels and layout errors. d Shows a side view of DTAM using 3D superpixels. Quantitative results are in Table 1

Fig. 6
figure 6

Results from the Bedroom1, Bedroom2 and Kitchen sequence

4.1.2 Outdoors

We have performed two outdoor experiments—in a building corner and a façade—to evaluate 3D superpixels in outdoor scenes. Figure 8 summarizes the results. Observe how in both cases the low texture walls are not planar for the DTAM baseline. 3D superpixels are able to improve the results and estimate the correct planar surfaces.

Table 2 Mean of the estimated depth error for DTAM, PMVS and ours in high-parallax low-texture sequences

4.2 Low parallax camera motion

We have used the NYU dataset (Silberman et al. 2012) to evaluate the performance of our algorithm in low-parallax camera motion sequences. The first thing to remark is that 3D superpixels will perform badly for this case. In our experience, in order to have an accurate estimation of the 3D superpixel the baseline has to be greater than 0.2 times the average depth of the keyframe. This constraint does not hold for the sequences tested in this dataset, so the results for 3DS are the same than the baseline DTAM and we only present results for DDP and Layout. As previously said, this is a clear limitation of 3DS—and in general of multiview geometry—and an advantage of DDP and Layout, that give reasonable results even in the single-view case.

We have performed 4 reconstructions of the NYU dataset, that we will denote as NYU \(\#1\), \(\#2\), \(\#3\) and \(\#4\) and that corresponds to the sequences printer room 0001 rect (\(\#1\) and \(\#2\)), bedroom 0106 rect (\(\#3 \)) and bedroom 0110 rect (\(\#4\)) of the dataset. Figure 9 shows the Box-and-Whiskers plot of the depth error in this sequences for the baselines DTAM and PMVS and our dense mapping algorithm using Layout, DDP and both. Notice first the huge error of PMVS compared with the rest of the approaches. The reason is, being a multiview stereo algorithm, it is very affected by low-parallax measurements. The magnitude of the error—one order of magnitude higher than the others—can be seen in Table 3, that shows the mean error values. DTAM is less affected by the low parallax; but still the use of scene priors improves its accuracy. See the mean values in Table 3.

The performance of the different scene priors on these 4 NYU scenes can be better appreciated in Fig. 10. Observe that in the experiment NYU #1 the Layout is wrongly labeled (some cupboards are labeled as walls). This is the reason for the Layout algorithm performing slightly worse than DTAM in this sequence (see the mean values in Table 3). The labeling has also big errors in NYU #4, where part of the floor is labeled as clutter. But in this case the texture in the floor allows to reconstruct it more accurately than DTAM. In any case, this is precisely the limitation of DDP and Layout. As they rely on data-driven models, their accuracy can be low if the test image is very different than the training ones.

Fig. 7
figure 7

Box and Whiskers plots showing the depth error distribution for the indoor high-parallax sequences

Fig. 8
figure 8

Outdoor results, in a Corner and a Façade. The improvement of 3DS can be noticed visually

Fig. 9
figure 9

Box and Whiskers plots showing the depth error distribution for 4 indoor low-parallax sequences of the NYU dataset

Table 3 Mean of the estimated depth error for DTAM, PMVS and ours in low-parallax sequences
Fig. 10
figure 10

Overview of the DDP and layout results, the Ground Truth depth and our estimated depth in the NYU dataset sequences. We are able to estimate accurate reconstructions for these low-parallax sequences

Finally, Fig. 11 shows the comparison of our approach against PMVS in our high-parallax sequences and the low-parallax NYU ones. Notice first in the high-parallax sequences that PMVS is a semidense approach that only reconstructs high-gradient pixels. Our approach has the fundamental advantage over PMVS of doing a full 3D reconstruction, as seen in the figure.

Secondly, observe the bad results of PMVS in the low-parallax sequences of Fig. 11. Our approach, leveraging the single-view cues given by the Data-Driven Primitives and the Layout of the room, is able to reconstruct the scene with high accuracy even if the geometric constraints are weak.

Fig. 11
figure 11

Qualitative comparison of our approach against PMVS in our high-parallax sequences (left) and the NYU low-parallax sequences (right). Notice the sparsity of PMVS in textureless areas and our dense results. Also notice the bad 3D maps produced by PMVS in the low-parallax cases and how our algorithm produces reasonable results

5 Conclusion

In this paper we have presented an algorithm that fuses several scene priors and depth cues in a dense mapping algorithm based on variational methods. Although the multiview geometric constraints stand out as the preferred ones for monocular map building, their results are poor in low-textured areas and for low-parallax motions. We show how the use of (1) Superpixels as mid-level features, (2) Data-Driven Primitives that appear frequently and can be discovered from training samples, and (3) the rough room Layout estimation and pixel labeling can improve the 3D reconstructions in the two failure cases mentioned before.

Our experimental results show that 3D superpixels offer the highest accuracy, but they suffer from the multiview geometry limitations. Firstly, their accuracy decreases if the parallax is low. And secondly, superpixel matching can be difficult in certain cases. Their use as mid-level features is then recommended only with strict thresholds in the parallax angle and descriptor distances. We think that superpixels can be an excellent mid-level feature for mapping low texture regions if mid-baseline correspondences can be found.

Data-Driven Primitives and Layout estimation and labeling are techniques designed for the single-view case, hence being more robust to low-parallax motions. In this paper we use a multiview version of the second one for robustness, but it works reasonably well for single images. Both cues improve the reconstruction if the camera motion is small, and also in low-textured areas. The reason for the latest is (1) data-driven primitives capture non-local primitives and hence cover some textureless areas, and (2) the Layout is a global scene model. As their main limitation, their data-based nature makes them inaccurate as the image differs from the training samples. In this case, more training data or more sophisticated learning techniques could alleviate this problem.

For future work, we would like to study the potential of this research for a robust and real-time implementation. Regarding robustness, our main concern is that data-driven techniques can give large errors that are difficult to predict. Regarding real-time, we are quite confident that the techniques we used are low-cost. Concha and Civera (2014) already demonstrated that 3D superpixels can be reconstructed in real-time. Flint et al. (2011) and Tsai et al. (2011) estimated a multiview layout—without labeling the image—in real-time. Finally, although there is no experimental evidence of real-time for Data-Driven Primitives, it consists of HOG features extraction and SVM classification. Both algorithms require low computation.