1 Introduction

In computer vision, 3D reconstruction has been one of the widely researched areas in recent decades, and automatic geometric reconstruction plays a key role in automated intelligent systems [31, 34]. With the decreasing costs of video equipments, we now have the opportunity and an urgent need to run automated and accurate 3D reconstruction algorithms directly on multiple photographs or video clips. Indeed, the most important technological ingredients towards this goal are already in place. We have known that feature matching algorithms [6] can provide accurate correspondences, structure-from-motion (SFM) algorithms use these correspondences to evaluate accurate camera pose, and multi-view-stereo (MVS) methods finally reconstruct dense and accurate surface models of complex objects from a moderate number of calibrated images. Actually, the existing MVS algorithms has nearly achieved the surface coverage of about 95 % and the depth accuracy of about 0.5 mm from a set of low resolution (640×480) images as reported [1, 18].

MVS plays an important role in automatic acquisition of geometric objects. Existing state-of-the-art MVS algorithms can be roughly categorized into four classes: voxel, mesh, depth maps and patch based methods. Voxel-based MVS methods (VMVS) [25] represent geometry on a regularly sampled 3D grid (volume), either as a discrete occupancy function or a function encoding distance to the closest surface. Algorithms based on deformable polygonal meshes [7, 8] represent a surface as a set of connected planar facets and operate by iteratively evolving a surface to decrease or minimize a cost function. Approaches based on multiple depth maps [9, 10] model a scene as a set of depth maps and fuse individual depth maps into a single 3D model. Finally, patch-based MVS (PMVS) [1] algorithms output a dense collection of small oriented rectangular patches covering the observed surface obtained from pixel-level correspondences. Recently, CMVS [17] is approved effective in reconstructing from images of crowed scenes without an initialization process.

However, the mentioned methods still face the following difficulties. First, they cannot handle incremental reconstruction tasks well. The input images should be well sequenced manually before reconstruction. Moreover, once a geometric object is obtained, it cannot be incrementally updated when facing a new input view image. Second, the computational cost of existing methods may rapidly increase for batch-processing of images, especially in handling huge numbers of images collected from the Internet, making it unpractical for real-time applications. Moreover, the methods also face difficulties when dealing with images of varied illuminations or scales captured by different users.

In this paper, we propose a novel algorithm aiming at incrementally reconstructing a 3D model using the Bayesian framework. We first select a group of key views uniformly distributed on our view sphere to create an initial 3D surface modeled by PMVS as stated above. Then when a new calibrated image is input, we (1) map it into a triangle on our view sphere, (2) search the correlated patches with the new input view, (3) automatically update the initial 3D model using the photometric consistency constraint and geometric smoothness priors under the Bayesian inference framework, and (4) filter patches estimated as outliers according to the visibility and photometric constraints. Note that once a new image is added, more geometric details can be extracted and integrated to incrementally optimize the final 3D model (see Fig. 1).

Fig. 1
figure 1

The framework of our incremental reconstruction system

Our method has two main contributions. First, we propose a novel incremental 3D reconstruction framework, which makes full use of new views to incrementally update and extend an existing 3D model. As a result, the reconstruction process is more efficient and convenient, and is especially useful for automatic 3D reconstruction from a large number of real-life images or videos and real-time reconstruction. Second, to our knowledge, no previous work has attempted to reconstruct 3D dense models using the Bayesian learning framework, where pixel-level information and geometric level constraints are well integrated to optimize the final model. As a result, the reconstruction accuracy can be effectively improved.

The rest of the paper is organized as follows. Section 2 first introduces the related work. Then Sect. 3 gives details of our system. Experiments and discussions are shown in Sect. 4. Finally Sect. 5 concludes our work and the future improvements.

2 Related work

3D off-line reconstruction from images and video has achieved an impressive performance as mentioned in Sect. 1. Comparably, few works focus on incremental reconstruction from on-line videos or asynchronously input images, which may appear in many practical applications, such as Robot navigation, Web-based reconstruction and so on.

The existing efforts on on-line 3D reconstruction can be roughly categorized into two classes. The first class aims to compute camera pose and reconstruct spare 3D points incrementally for each frame or image. SLAM [21, 22] and SLAM based methods [20, 32, 33] are powerful techniques to achieve this target. Actually, these methods were originally designed to locate camera pose of a camera mounted on Robots moving around in an unknown scene and obtain visual odometry or sparse geometry of its surrounding environment for each frame. [19] extends it to a single uncontrolled camera in a small workspace. As a key component, the performance of SFM influences the overall effects. The development of on-line SFM systems [28, 29] for large scene reconstruction may boost the maturity of such methods; the other methods use the system of the first class as a front-end and incrementally construct a global consistent 3D model.

Recently, Merrell et al. [23] and Pollefeys et al. [24] published real-time methods using an extended plane sweeping stereo technique to reconstruct a noisy depth map for each frame and fuse these depth maps to a compact and dense map. Although having impressive results, the methods needed hardware GPU and fusing depth maps is performed at the end, making incremental updating impossible.

Other alternative approaches [25, 26] maintain a global 3D model calculated from sparse 3D feature points via Delaunay triangulation and free-space carving. When new features are added, the model is updated according to the free-space consistency. However, these methods only fuse new features to such a global model and never improve the older ones as new images and frame inputs. In addition, the simple free-space consistency, namely visibility consistency essentially, doesn’t suffice to make use of the new images. Moreover, based only on features from SLAM based system and lack of extending procedure, the model is relatively sparse. Recently, another method named ProFORMA [27] was developed for on-line reconstruction. The system combines a probabilistic voting scheme and the traditional free-space constraints to raise the robustness. Nevertheless, the system also faces the same incremental problem.

3 Our method

In this section, we give our incremental reconstruction algorithm in details. Our method can be briefly summarized as the following four steps:

  1. 1.

    Map the given multi-view images set I source to a view sphere S initial and select uniformly distributed key views to initialize a 3D model;

  2. 2.

    For each new input image i new , map it to S initial and search its related patches set P update on the 3D model;

  3. 3.

    Re-calculate the patches of P update using the Bayesian learning framework to incrementally refine the 3D model.

  4. 4.

    For all updated patches in P update , check their visibility and photometric consistency and filter out those patches estimated as outliers.

Steps 2 to Step 4 are repeated until there are no new input images. Note that in Step 2, only a subset P update (named seed patches set) on the previous 3D model is chosen to be updated for any new input image rather than all the patches on the model. It is based on the following fact that in each incremental recursion step, the existing patches on the previous 3D model may have different correlations to i new and we need not update those patches having low correlations. For example, there is no (or too low) correlation between i new and another patch that is completely invisible to it. This helps reduce the computational cost, simultaneously without losing accuracy in our incremental reconstruction. The overall frame of the system can be seen in Fig. 1.

3.1 Initialize a 3D model

Given a calibrated image set I source , we need first select an image subset uniformly distributed in different viewpoints to reconstruct an initial 3D model. The initial key views are selected as follows: (1) map each view image in I source to a view sphere S initial (see Fig. 2(a)), with its coordinate determined by the corresponding image plane, namely the normalized principal axis vector obtained from its projection matrix, and (2) sample key views uniformly across the sphere. Note that a point corresponding to a key view on the sphere actually represents a view image.

Fig. 2
figure 2

(a) The view sphere. (b) The patch model

Next, we triangulate S initial by grouping the neighboring key views on it into triangles using the Delaunay Triangulation algorithm [11]. 3D patch model \(\overline{S}\) can be simultaneously reconstructed using [12] from key views, reflecting an initial geometric contour of the target object. Note that the geometric contour is reconstructed using the patch-based approach [1], where a 3D surface is covered by plenty of patches, and a patch p is essentially a local tangent plane approximation of the surface. A patch p here has three geometric attributes (see Fig. 2(b)): c(p), n(p) and R(p), where c(p) denotes the geometric center, n(p) is the unit normal vector oriented toward the camera observing it, while a reference image R(p) is an image chosen from V(p) where p is truly visible on the condition that the retinal plane of R(p) is nearly parallel to p within a tiny distortion.

As a result, a triangulated view sphere and a 3D patch model are obtained as the initializations of our incremental updating system.

3.2 Search related patches for a new input image

In our incremental reconstruction step, we first search a corresponding patch subset from the previous 3D model for any new input calibrated image, and then extend the subset to make the model more uniform and well-sampled.

3.2.1 Search seed patches for any input image

To find the seed patches P update for any incrementally input image i new , we first search a proper triangle T on S initial , where i new can be mapped into using SIFT [6] as follows:

$$ T \leftarrow\mathop{\arg\max}_{T}\sum _{v\in T} \bigl|x_{i_{new}}^{v} \bigr| $$
(1)

where \(x_{i_{new}}^{v}\) is a set of matches between i new and the key view v corresponding to a vertex in triangle T. Then we search the correlated patch subset P update from the reconstructed 3D model by

$$ P_{update} = \bigcup_{v \in T} \bigl\{ p\big|p \in \overline{S},visR(p)\bigr\} $$
(2)

Obviously, i new provides more useful reconstruction details for the patches in P update than those outside it. Then we update S initial as follows: (1) add a new vertex representing the new image; (2) add a pyramid of triangles by connecting the new image to three vertices of T, and (3) delete T with i new located in. As a result, we can simultaneously obtain an updated view sphere (see i new in Fig. 2(a)).

3.2.2 Extend the seed patches

Next, we extend the patch model to obtain a relatively uniform patch density along different viewpoints over the surface. The extension is associated with the orientation of the new view and the average density of the existing global surface. Note that during this process, we may create new patches under local geometric constraints to improve patch density where patches are too sparse. Our extension has the following steps:

  • Estimate local density D p for every patch p in 3D model. We count its neighbors N(p) to evaluate the local density equivalently as follows:

    (3)
    $$ D_{p} = \bigl|N(p)\bigr| $$
    (4)

    where ρ can be computed relating to the distance at the depth of the center of c(p) and c(p′) corresponding to an image displacement of u pixels in R(p) (u=2 in our experiment);

  • Compute the global average density D g by averaging all estimated local densities;

  • For every seed patch in P update with its local density less than 0.5 D g , use SMOTE [13] to oversample new ones whose initialization can be seen in Table 1 between the seed patch and its neighbors (see Fig. 3). As a result, the original geometric constraints can be well maintained;

    Fig. 3
    figure 3

    Seed patches extension, where P new is generated along the line combining a seed patch P 0 and one of its neighbors P 1

    Table 1 The incremental algorithm
  • Add the new patches into P update .

3.3 Incremental surface reconstruction using Bayesian learning

This subsection introduces the Bayesian model used in our incremental reconstruction. We aim at discovering the photometric consistency and geometric smoothness constraints to obtain high-quality incremental reconstruction results.

Suppose i new is a measurement to our camera from the real scene modeled by PMVS in our method. Let S be the real scene to be modeled, we need reconstruct the most likely surface S MAP given the measurement i new . This can be achieved by maximizing the Bayesian posterior (MAP) probability p(S|i new ) in the solution space Ω

(5)
(6)

in order to reduce the parameter dimensions, we constraint Ω to the expanded patches subset P update as mentioned in Sect. 3. Note that the constant related to Z is ignored in (6). p(i new |S) specifies the likelihood of the measurement i new agreeing with S. In other words, it measures how well the normal and coordinate of a patch match the real surface according to the information hidden in i new and the other correlated images. It can be defined by the use of photometric discrepancy function [1], which we choose to express the photometric consistency:

(7)
(8)

where η is a control coefficient, and h(p,i new ,i) is equal to one minus the pair-wise normalized cross correlation concerning to the patch projection into images i new and i, respectively.

We use two constraints to define the prior p(S):

$$ p(S) \propto\exp\bigl( - \{ \lambda E_{1} + \zeta E_{2}\} \bigr) $$
(9)

where E 1 and E 2 are two geometric smoothness energy terms, and λ, ζ are weighted coefficients. E 1 is used to assure the smoothness of the reconstructed surface. For a natural 3D object, we can model its surface smoothness by accumulating sub-linear potentials of surface curvature similar to [14]. Concretely, we define E 1 as follows:

(10)
(11)

where N(p) is the neighboring patches set of p defined in (3) and f(p,v) is the square-root potential with f(p,v)=0 if n(p)=n(v) and positive otherwise.

However, there still may exist exceptions even Eq. (10) is met. For example, in Fig. 4(a), the patch p is an outlier while having well sub-linear continuous relations with normals of its neighbors in N(p). Considering although such a patch has a continuous normal, its geometric location is far away from the real surface, we use another geometric smoothness energy term E 2 to minimize such error as follows:

(12)
(13)

where d(p,v) is the distance between two patches p and v along n(p) (see Fig. 4(b)).

Fig. 4
figure 4

Geometric smoothness terms. (a) The blue patch p is an outlier; however it has a continuous normal with its neighboring patches. (b) d(p,v) is the absolute distance between two patches p and v along n(p)

This minimization problem requires us to adjust c(p) and n(p) for any patch p in S from the initial value to the final convergent solution. It is actually a sparse energy minimization optimization problem. To simplify the complexity and reduce the dimension of variables, we constrain c(p) lie on a ray to assure the projection into R(p) does not change. Simultaneously, we model n(p) with Euler angles. Thus for every patch, only three parameters participate in the optimization problem, greatly reducing the dimension of the solution space and improve stability in the search process. We use the conjugate gradient descent to solve the global optimization problem. In this process, the derivatives for geometric smoothness prior can be directly computed and those for the photometric consistency term are currently estimated numerically.

Considering input images may be similar to each other, such as adjacent frames from the same video, we need to further group those adjacent view images and use each image group to refine the 3D model, to avoid offering redundant information and further reduce reconstruction cost. Following the intuition, we replace the measurement i new with an image group G new in the Bayesian framework and update Eq. (8) by

(14)

Additionally, we search the Bayesian solution space Ω or P update for each group. Actually, as mentioned in Sect. 3.2.1, it follows the intuition that we first search the proper triangles and the correlated patches for each new view in G new in the same way with Eq. (1) and Eq. (2), and then the correlated patches in G new are actually a union of these correlated patches, namely

$$ P_{update} = \bigcup_{i_{new} \in G_{new}} P_\mathit{update\_inew} $$
(15)

where \(P_{\mathit{update\_inew}}\) is the correlated patches set for each new image in G new obtained by Eq. (2). Note that the intersection between these correlated patches set is usually of a large number due to the adjacent views from a video.

3.4 Filter outliers

After refining the model, we finally remove erroneous patches inconformity to the visibility consistency and photometric consensus. Due to the reasonable initialization from the 3D model with tough geometric relation before updating, few patches become outliers caused by bad local minima after refinement. Here, we handle the photometric consistency by ignoring the patches with a low photometric score calculated by Eq. (8). As to the visibility consistency, for each patch p of 3D model, we compare the number of images in V(p) updated according to a depth-map test similar to [1] to a preset threshold and filter it out as an outlier. To conclude, the patches satisfying

$$ \bigl\{ p|p \in P_{update},V(p) > \alpha, E_{p} <\beta\bigr\} $$
(16)

are regarded as reasonable ones and reserved in the final model. At the same time, in order to improve efficiency, we perform filtering every three rounds with new images input gradually.

As a summary, our incremental updating algorithm is shown in Table 1.

4 Experiments and discussions

We have implemented our incremental reconstruction algorithm on the C++ platform. The datasets [15, 16] used in our experiments are shown in Table 2, together with the number of the input images, their approximate sizes, the number of the key views we choose and the patch number of the reconstructed initial model using PMVS [12]. In our incremental processes, we set λ, ζ, η, α and β 0.3, 0.2, 0.7, 4 and 0.25, respectively. Figure 5 gives our incremental reconstruction results for these datasets.

Fig. 5
figure 5

Our incremental reconstruction results. (a) 2D sample images, (b) the initial 3D model, (c)–(e) the incremental reconstruction results. From top to bottom, the datasets are dinosaur, human skull cast, Morpheus, predator, soldier and temple

Table 2 The datasets used in our experiments

In Fig. 5, Column (a) and Column (b) correspond to example 2D images and their initial result models reconstructed from key views, respectively. After gradually adding new images, the result models are incrementally updated, as shown in the rest of the three columns (c)–(e). It can be seen that the result models can be dynamically optimized and enriched with more details during these processes.

To evaluate our method quantitatively, we adopt the weighted sum of normalized cross correlations (NCC) P k to model the accuracy of a patch k, formulated as

$$ P_{k} = \frac{1}{\sum_{i \in V(k)} r(k,i)} \sum_{i \in V(k)} \frac{1- h(k,R(k),i)}{r(k,i)} $$
(17)

r(k,i) is the diameter of a sphere centered at k with the projected diameter equaling one pixel in i. The weight related to each visible image in V(k) actually reflects its contribution to the patch refinement. The measurement indicates the accuracy of one patch k indirectly by estimating its overall photometric consistency in V(k), weighing different images according to their correlations to the patch. Obviously, the larger the measurement is, the more accurate the patch is. During each incremental step, we calculate the ratios of those patches with larger weighted NCC scores in P update (see Fig. 6) after updating. Figure 6 is a discrete figure where different points on curves have no relations and can be replaced by tables if enough space is available. It can be seen that after adding a new image, the NCC accuracy of more than 50 % of its related patches improve averagely, illustrating the effectiveness of our method.

Fig. 6
figure 6

The overall statistic analysis. (a) The ratio of patches having higher photometric consistency scores, (b) the number of extended patches

We also find that in Fig. 6(a), the ratio changes with image quality and position on our view sphere varying during the incremental reconstruction steps. It is due to that geometric smoothness term plays an important role in the optimization for poor-quality images, and thus the overall accuracy may be reduced simultaneously because of over-smoothing.

Figure 6(b) illustrates the number of extended patches in each incremental reconstruction step with the sample-rate 200 % in our experiments. Obviously, the number greatly depends on the viewpoint of a 2D image and more patches are necessary to be generated in sparse regions. Note that not all extended patches are finally added to the result model due to the global geometric constraints and the pixel-level information. Table 3 gives the accepted ratios for extending patches in our experiments, by averaging the statistical result of each new image. The performance of our filter is listed in Table 4, indicating that our system successfully removed erroneous patches. Note that few patches in our algorithm are estimated as outliers, showing that the 3D model become more accurate.

Table 3 The average accepted rate of extending patches
Table 4 The average accepted rate of our filter

Then we use a bootstrapping-like approach to further evaluate the bias brought by the selection of key views. For each dataset, we repeat the overall algorithm 20 times and thus obtain 20 final 3D models. Note that the selected key views set is different from each other in each test round due to random sampling as mentioned in Sect. 3.1. Then we use the D2 3D matching method [30] to calculate their similarities. Finally, the average similarity AS and variability V are computed as follows:

(18)
(19)

where S(i,j) denotes the similarity between two final models obtained in the ith and jth test round respectively. Note that a high average similarity or a low variability in general indicates the accuracy of the reconstructed 3D models or the robustness of our algorithm, respectively. The experimental results for each dataset are listed in Table 5. From the statistics, we find that either the average similarities or the variabilities for different datasets are good enough, reflecting the robustness and effectiveness of our algorithm.

Table 5 The average similarity and variability for each dataset

We also compare our methods by adopting different smoothness combinations as: (1) a uniform prior, (2) only E 1, (3) only E 2, and (4) both E 1 and E 2, on the same dataset human skull. Note that the uniform prior doesn’t have information and contribute to the refinement of patches. Thus the condition (1) degenerates to the traditional PMVS, which just updates patches by optimizing the photometric consistency. From the results shown in Fig. 7(a), the traditional PMVS, namely the condition (1), performs better than the other three conditions which have almost the same effects though the prior combining E 1 and E 2 seems slightly better with higher ratios of well improved patches. However, due to the lack of available constraints, it’s easy for a patch to be trapped in bad local minima in the traditional PMVS, manifesting a higher measurement but incongruity to the overall geometric constrains, namely that many outliers exist in the traditional PMVS as shown in Fig. 7(b). Figure 7(b) also shows the method with the two smoothness terms together generates few outliers after updating than with either E 1 or E 2 by avoiding bad local minima effectively using enough reasonable constraints. The experiments also show that E 1 and E 2 can complement with each other and work together to guide the optimization to converge to the optimal solution.

Fig. 7
figure 7

The comparison using different priors. (a) The ratio of patches having higher photometric consistency scores for three methods. (b) The patches estimated as outliers by our filter in each incremental step

From the comparison with the traditional PMVS algorithm, we find our method integrates the photometric consistency and geometric prior successfully using the Bayesian scheme and well improves the entire 3D model by maximizing the posterior to find a better solution. We also compare our method with different smoothness terms, and then verify our right choice for the geometric prior. However, our method still faces some shortcomings. First, despite easy to generalize to any scenes, the algorithm is mostly suitable to model object due to the use of a view sphere and limited uniform-views images or frames for initialization, unavailable to model large scenes. Additionally, although we group views and exploit their redundancy to accelerate the process as mentioned above for videos and similar images, the method still encounters a large challenge on time complexity because of the global optimization.

5 Conclusions

We propose a novel incremental reconstruction algorithm for calibrated multi-view stereo in this paper. Our method first initializes a 3D patch model using selected key views, and then when a new image is input interactively, seed patches for which the new image provides useful reconstruction details are searched and then extended to make surface of the 3D target uniform. We finally end up the incremental learning under Bayesian framework. Experiments on 6 open datasets illustrate the effectiveness of our method. In addition, we also use a bootstrapping approach to verify the robustness of our method. Considering that the selected key views need distribute uniformly on the view sphere, our future work focuses on reconstructing crowded scene models directly from real-life videos in an online and incremental way to get rid of the limitation. Another improvement lies on evaluating different 3D reconstruction methods in a more comparable approach, especially for incremental reconstruction applications.