Abstract
In this paper, we present a flexible and fast system for multi-scale objects/scenes 3D reconstruction from uncalibrated images/video taken by a moving camera characterized by variable parameters. The proposed system is based on incremental structure from motion and good exploitation of bundle adjustment. At first, from two selected images, our system allows to recover, in a well-chosen reference, coordinates of a set of 3D points. In this context, we have proposed a new method of self-calibration based on the use of two unknown scene points with their image projections. After that, new images are inserted progressively using 3D information already obtained. Local bundle adjustment is used to adjust the new estimated entities. At some time, we introduce a global bundle adjustment to adjust as best as possible all estimated entities and to have an initial 3D model of quality covering an interesting part of the object/scene. This model will be used as reference for the insertion of the rest of images. The proposed system allows to obtain satisfactory results within a reasonable time.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
3D reconstruction from images/video is an important and widely studied subject in recent decades. It is to recover 3D information from 2D images taken from different viewpoints or from video.
Several approaches [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] have been proposed to solve this problem. There are approaches that are based on points matching between different images. Structure from motion approach [5, 11, 12, 15] allows automatic recovery of 3D structure and camera motion. It is based on the detection and matching of interest points between different images. The matched points with other estimated geometric entities (epipolar geometry) will be used to recover a projective (uncalibrated images) and sparse representation of the scene. To move to a metric/Euclidean representation, a step of camera self-calibration to recover the intrinsic parameters is necessary. The reconstructed sparse 3D point cloud does not allow to properly define the shape of objects. So, to have dense results closer to the reality, dense matching methods should be used [16]. Delaunay triangulation, the Crust method [17] and the Poisson surface [18] are methods used to convert the obtained 3D point cloud into triangulated surface model. The approaches based on multi-view stereo [7,8,9,10] are often used to get high-quality dense 3D reconstruction results, but they require in input calibrated stereo images as well as a long computation time. In robotics, simultaneous localization and mapping (SLAM) [19, 20] simultaneously allows robot location and environment map construction using data retrieved from the sensors which may include cameras.
In this work, we propose a complete 3D reconstruction system from uncalibrated image/video sequences. Our 3D reconstruction system is able to produce very realistic three-dimensional models using a single camera. The camera intrinsic parameters are variable, the displacements of the camera are free, and the reconstruction environment is uncontrollable. All these factors offer more flexibility and generality for the 3D reconstruction of any objects/scenes (multi-scale objects/scenes). Our 3D reconstruction system is based on the incremental structure from motion. It is initialized from two images with a sufficient number of matches and a large camera motion [15]. In this context, we have proposed a new method for automatic recovery of intrinsic and extrinsic camera parameters that correspond to these two images. The 3D structure of the object/scene is initiated by the triangulation of matched/tracked interest points between these two images. The quality of initialization affects the whole system. Therefore, bundle adjustment is applied to adjust the estimated entities. The projection matrix of each new inserted image is estimated after locating projections of 3D reconstructed points in the inserted image. This estimate is based on the use of RANSAC algorithm [21] by solving a linear system using 3D points already reconstructed and their projections located in the inserted image. After, new 3D points are recovered from the interest point matching result between the inserted image and the image that precedes, and a local bundle adjustment is performed to adjust the new estimated entities. The local optimization does not guarantee the accuracy of the obtained solutions. So, in our system we integrate a global bundle adjustment after the insertion of \(M_{0}\) images (in our experiments \(10\le M_0 \le 20\)) to have an initial 3D model that will be used as a reference to insert the rest of the images in order to obtain a more complete final 3D model. To have a surface model, the Poisson surface algorithm [18] or the 3D Crust method [17] can be applied to the obtained 3D point cloud. Finally, the texture mapping provides realistic results.
The good exploitation of the existing (incremental structure from motion [12], bundle adjustment [5, 12, 22, 23], ...) and our own vision to solve the problem (camera self-calibration from only two images, the use of camera with varying intrinsic parameters, local and global vision of the problem, ...) allow us to propose a system that is fully automatic, flexible and able to reconstruct multi-scale three-dimensional models (small, medium and large) within a reasonable calculation time. On the other hand, there are systems [5, 24] that are based on the global bundle adjustment of all estimated entities, which require a fairly important computation time and demand a very important step of parameter initialization to avoid falling into local optima. Other systems [11, 12] require prior information on camera parameters to make a 3D reconstruction of the scene.
This paper is organized as follows. Section 2 presents related work. Section 3 describes the notations and background. The proposed method is described in Sect. 4. The experiments and the comparison of our method with other methods are presented in Sect. 5. Finally, the conclusion is presented in Sect. 6.
2 Related work
Several methods have been proposed to solve the problem of 3D reconstruction from images/video. The methods based on multi-view stereo [8, 10, 25] allow to have satisfactory results with a high density. Tran and Davis [25] presented the graph cut method to recover the 3D object surface by the use of silhouettes and foreground color information. In [8], the authors presented a new method for large-scale multi-view stereo based on dense matching between very high-resolution images. It allows to obtain a 3D point cloud that is very dense and of high quality at a relatively low computational cost. However, it requires the use of rich texture images to avoid making use of costly optimization algorithms. Furukawa and Ponce [10] also proposed a method to solve the problem of multi-view stereo. It consists in retrieving an initial set of patches covering the surface of the object/scene from the matching result of key points detected by Harris and DoG operators. The final patches are obtained by iteration between an expansion step, to obtain a dense set of patches, and a filtering step based on the visibility constraint to eliminate false matches. Finally, the resulting patch model is converted into a polygonal mesh, which can be refined by applying the photometric consistency and regularization constraints. All these methods provide dense three-dimensional models of good quality, but they require stereo cameras of known parameters (methods that start from stereo calibrated images) and they are expensive in terms of computation time.
Structure from motion methods [5, 11, 12, 26] allows to automatically recover both the three-dimensional structure and camera motion from uncalibrated image sequences. These methods are based on the detection and matching of interest points between different images. Pollefeys et al. [5] presented a complete system for three-dimensional modeling from uncalibrated images. First, structure from motion approach was used for the recovery of projective 3D structure and camera motion. Then, the different estimated entities will be refined by global bundle adjustment. To pass to a metric 3D reconstruction, they have gone through a camera self-calibration phase based on the use of the absolute conic. Finally, pairs of images are rectified and multi-view stereo matching is used to obtain a dense 3D reconstruction. However, the global bundle adjustment requires a very long calculation time, especially with the use of a large number of images and can converge to a local solution due to a bad initialization. Snavely et al. [11, 26] presented a system for 3D reconstruction from large photo collections. It is based on the incremental structure from motion approach to simultaneously recover the camera motion and a sparse 3D reconstruction of the scene. However, it requires prior information to initialize camera parameters (the use of EXIF tags) and requires a long calculation time, especially with the increase of images number, dominated by the bundle adjustment after the insertion of each new image. In [27], the authors presented a new incremental Structure from Motion technique based on geometric verification strategy, next best view selection and robust triangulation method. Fuhrmann et al. [13] presented a 3D reconstruction system of multi-scale scenes from images. It is based on structure from motion, multi-view stereo depth maps and surface reconstruction. Mouragnon et al. [12] proposed a method for real-time estimation of motion and 3D structure from video captured by a calibrated camera. The proposed method is based on the local bundle adjustment to refine the camera poses and 3D structure. However, this method requires the use of camera with known and unchanged intrinsic parameters during the acquisition of images. Thus, the quality of the 3D reconstruction is not assured because of the accumulation of errors when increasing the images number.
3 Notation and background
3.1 Pinhole camera model
The pinhole camera model is used to project a scene 3D point \(A_j =\left( {X_j ,Y_j ,Z_j ,1} \right) ^\mathrm{T}\) in the image point \(a_{ij} =\left( {u_{ij} ,v_{ij} ,1} \right) ^\mathrm{T}\). This projection is represented by the following formula:
where \({\uplambda }_{ij} \) is a nonzero scale factor, \(P_i =K_i \left[ {{\begin{array}{cc} {R_i }&{} {t_i } \\ \end{array} }} \right] \) is a \(3\times 4\) projection matrix, \(t_i \) is a translation vector, \(R_i \) is \(3\times 3\) a rotation matrix defined by:
as \(\alpha _i \), \(\beta _i \) and \(\gamma _i \) represent the three Euler angles.
\(K_i \) is intrinsic parameter matrix defined by:
where \(f_i\) is the focal length, \(\varepsilon _i\) is the scale factor, \(s_i \) is the skew factor and \(\left( {u_{{0i}} ,v_{{0i}} } \right) \) are the coordinates of the principal point.
3.2 Estimation of distortion coefficients
The distortion effect affects the quality of the 3D reconstruction [28]. In this work, we consider the first two coefficients of radial distortion \(k_{1}\) and \(k_{2}\) in order to obtain more accurate results.
The relationship between the distorted image points \((u_d ,v_d )\) and the undistorted image points (u, v) is defined by [29]:
where \(\left( {u_{0i} ,v_{0i} } \right) \) are the coordinates of the principal point that correspond to the ith image and \(\left( {x,y,1} \right) ^\mathrm{T}=K_i^{-1} \left( {u,v,1} \right) ^\mathrm{T}\).
So, for each image point we have the following formula :
3.3 Homography between two images
The homography between two images \(I_{i}\) and \(I_{j}\) is represented by a \(3\times 3\) matrix denoted \(H_{ij} \). For each point \(a_{ik} \) of the image \(I_{i}\) and its corresponding \(a_{jk} \) in the image \(I_{j}\), we have the following relationship:
Four non-aligned matches are sufficient for the estimation of this matrix. The use of the RANSAC algorithm [21] provides a reliable solution.
3.4 Selection of two images with a large displacement
In this work, we used the criteria already presented in [15] to select two images with a large displacement.
Let \(I_r \) be the reference image, and the disparity matrix is defined as follows:
where m is the number of images, n is the number of matches, \(\Vert a_{ij} -a_{rj}\Vert _F\) is the disparity between the two points and \((a_{ij} ,a_{rj} )\) is the jth matched point between \(I_r\) and \(I_i \).
The image \(I_{r'}\) that corresponds to a large camera motion relative to the reference image \(I_r\) is obtained by the use of the following formula:
where \(D_M =\left[ {{\begin{array}{c} {e_1 } \\ \vdots \\ {e_m } \\ \end{array} }} \right] \) is a vector which represents the mean of each row of D; \(D_S =\left[ {{\begin{array}{c} {s_1 } \\ \vdots \\ {s_m } \\ \end{array} }} \right] \) is a vector which represents the standard deviation of each row of D; \(\odot \) denotes the element-by-element multiplication.
4 Proposed method
We present an incremental 3D reconstruction system based on structure from motion approach and the good exploitation of bundle adjustment. It takes as input uncalibrated images/videos captured by a camera with variable parameters. As output, it determines the camera parameters (intrinsic and extrinsic) and the three-dimensional structure. Our system offers more flexibility to adapt and to reconstruct multi-scale objects/scenes (small, medium and large) in a reasonable time compared to methods applied in real time [12].
As already known, structure from motion approach using uncalibrated images allows to recover only a 3D projective reconstruction. To get a 3D metric/Euclidean reconstruction, it must pass through a camera self-calibration phase. In this work, we proposed a system that can directly retrieve the metric structure of the 3D scene. First, it is to initialize our system from two images with a sufficient number of matched interest points and a large movement of the camera [15]. In this context, we have proposed a new method of camera self-calibration from two images, which allows us retrieving coordinates of a set of 3D points in the scene corresponding to the matched image points. After each insertion of a new uncalibrated image, the camera parameters are retrieved based on the previously estimated 3D structure and new 3D points are recovered from interest point matching between the inserted image and the image that precedes. Our 3D reconstruction system is realized in three main steps: detection and matching/tracking of interest points between different images, initialization of the reconstruction system from two images with a large displacement [15] and incremental insertion of new images. The global algorithm of our 3D reconstruction system is presented as follows:
Algorithm 1 allows, from m (\(m\ge 2\)) input images, to gradually recover the metric 3D structure and the camera parameters (intrinsic and extrinsic) that corresponds to each image. The value of \(m_0\) is initialized to 3 because the local bundle adjustment is applied to the three last images. This allows to provide a reliable initial solution for the global bundle adjustment (GBA) which will be applied after the insertion of \(M_0\) images to have an initial 3D model of quality. The value of \(M_0\) is selected between 10 and 20 because the application of the GBA on a large number of images requires much calculation time. Also, the use of a small number of images does not allow to have a reliable initial 3D model. When inserting the remain images \(\{I_k \}_{M_0 <k\le m}\), the value of \(m_0\) can be taken greater than 3 to increase the system reliability. The initialization of our system from two images is a very interesting phase. So, to ensure the stability and reliability, we have selected two images with a sufficient number of matched points and a large movement of the camera [15]. The selected images allow to ensure the stability of epipolar geometry calculation (estimation of the fundamental matrix). This matrix will be used later to estimate the camera parameters.
4.1 Interest point detection and matching/tracking
In this work, we have chosen to use the SIFT algorithm [30] for interest point matching between different images because of its robustness to scale changes compared to other methods [31]. For the elimination of false matches and the estimation of fundamental matrix, the RANSAC algorithm was used [21].
4.2 Initialization of 3D reconstruction system from two images
The initialization of our 3D reconstruction system is performed from two selected images with a sufficient number of matched points and a large displacement of the camera [15] in order to stabilize the calculations. It consists in:
-
1.
Camera self-calibration: estimation of intrinsic parameters from two images.
-
2.
Estimation of extrinsic parameters.
-
3.
Retrieving a set of 3D points from matched interest points.
-
4.
Bundle adjustment taking into account the radial distortion.
4.2.1 Self-calibration
In this step, we present a new formulation of the self-calibration problem based on the good choice of global reference and the use of planar calibration/self-calibration concepts [29]. This formulation allows to obtain a linear system which leads, assuming that the principal point is in the center of the image and the skew factor is equal to zero, to determine the scale factor and the focal length. Thus, this formulation allows us to automatically estimate the camera extrinsic parameters.
Let \(A_1\) and \(A_2\) two unknown points of the 3D scene as \(a_{i1}\) and \(a_{i2} \) are, respectively, their projections in the image \(I_i\) with \(1\le i\le 2\). We define an Euclidean reference (O, X, Y, Z) as O is the midpoint of segment [\(A_1 A_2 ]\), and the two points \(A_1 \) and \(A_2 \) belong to the plane \(\hbox {Z} = 0\) (plane \(\hbox {OXY})\) (see Fig. 1).
In this reference:
where \(d={A_1 A_2 }/2\) and \(\theta \) is the angle between the line \((A_{1}A_{2})\) and the X-axis of the reference.
To simplify the calculations, we can choose the global reference such as \(\theta =\pi /3\) (other values can be selected). So, we obtain:
We consider the Euclidean reference (O, X, Y).
In this reference:
The projection of the plane \(Z=0\) in the image plane \(I_i\) is defined by the homography \(H_i \) as:
As \(A_1 \in \left( {O,X,Y} \right) \) and \(A_2 \in \left( {O,X,Y} \right) \).
So, we can write
Expressions (2) and (3) can be written in the form:
We put:
Expression (5) can be expressed as follows :
We put:
Expression (8) can be written as follows:
\(Q_i\) is a matrix \(3\times 3\) that allows to project the point \(A_j^{\prime } \) in the image \(I_i \).
At first, we want to recover the projection matrices \(\{Q_i \}_{1\le i\le 2} \).
From (9), we can write:
\(H_{12}\) is the homography between the images \(I_1\) and \(I_2\).
In (10), we replace \(Q_2\) by its formula (11) and we obtain:
From [32], we know that:
where \([e_2 ]_\times =\left( {{\begin{array}{ccc} 0&{}\quad {-e_{2_z} }&{}\quad {e_{2_y} } \\ {e_{2_z} }&{}\quad 0&{}\quad {-e_{2_x} } \\ {-e_{2_y} }&{}\quad {e_{2_x} }&{}\quad 0 \\ \end{array} }} \right) \) is the antisymmetric matrix associated with the epipole of the image \(I_2 \quad e_2 =\left( {e_{2_x} ,e_{2_y} ,e_{2_z} } \right) ^\mathrm{T}\).
Then, formulas (12) and (13) give:
From formulas (10) and (14), a linear system of eight linear equations (for the elements of \(Q_1\)) is obtained. The resolution of this system allows the estimation of matrix \(Q_1\).
Expressions (11) and (13) give:
The matrix \(Q_2\) is estimated from Eq. (15).
Now, the estimated projection matrices will be used to recover the intrinsic parameters. Formulas (4) and (9) give:
The previous formula gives:
where \(\omega _i =\left( {{\begin{array}{ccc} {\omega _{i00} }&{}\quad {\omega _{i01} }&{}\quad {\omega _{i02} } \\ {\omega _{i10} }&{}\quad {\omega _{i11} }&{}\quad {\omega _{i12} } \\ {\omega _{i20} }&{}\quad {\omega _{i21} }&{}\quad {\omega _{i22} } \\ \end{array} }} \right) =\left( {K_i K_i^\mathrm{T} } \right) ^{-1}\) is the image of the absolute conic.
We put \({B}'=\left( {\begin{array}{cc} d&{}\quad 0\\ 0&{}\quad d\\ 0&{}\quad 0\\ \end{array}}\right) \).
Formula (17) gives:
Formula (18) gives:
From (19), the following equation system is obtained:
Assuming that the principal point \((u_{0i,} v_{0i} )\) is in the center of the image and \(s_i =0\), \(\varepsilon _i \) and \(f_i \) will be determined.
Expression (20) gives:
where:
Expression (21) is a linear system of the form:
where \(X_1 =\varepsilon _i^2 f_i^2 \) and \(X_2 =\varepsilon _i^2\).
Solving this linear system by substitution allows to estimate \(X_1 \) and \(X_2 \).
The values of \(\varepsilon _i\) and \(f_i\) are obtained from \(X_1\) and \(X_2\) (the two positive values).
4.2.2 Estimation of extrinsic parameters
Formula (4) gives:
where \(r_k^i\) (\(1\le k\le 3\)) denotes the kth column of the rotation matrix \(R_i\).
As already presented in [29], the formula (22) gives:
where \(\mu _i =\Vert K_i^{-1} h_1^i\Vert ^{-1}=\Vert K_i^{-1} h_2^i\Vert ^{-1}\).
In our situation, the homography \(H_i =\left[ {{\begin{array}{ccc} {h_1^i }&{} {h_2^i }&{} {h_3^i } \\ \end{array} }} \right] \) is unknown because the scene is not planar.
Formula (9) gives:
Expressions (23), (24), (25) and (26) give:
where \(\mu _i^{\prime } =\Vert K_i^{-1} q_1^i\Vert ^{-1}=\Vert K_i^{-1} q_2^i\Vert ^{-1}\) and \(q_k^i \) (\(1\le k\le 3\)) denote the kth column of the matrix \(Q_i \).
The rotation matrix is obtained from (27), (28) and (29).
It remains to estimate the value of d to obtain the translation vector.
Expression (17) gives:
where \(\nu _i\) is a nonzero scale factor.
So, from (31) we obtain.
Then, substituting in (30) d by its formula (32), we obtain a linear system of the form:
where \(A\in {\mathbb {R}}^{3\times 3}\).
The resolution of this system by the singular value decomposition (SVD) allows to estimate the translation vector.
4.2.3 Recovering 3D point coordinates
The coordinates of 3D points are recovered from the matching result between the two images \(\left\{ {I_1 ,I_2 } \right\} \) and the projection matrices defined by:
4.2.4 Bundle adjustment
The optimization of different entities previously estimated (intrinsic and extrinsic parameters, radial distortion and the 3D point coordinates) is performed by minimizing the criterion (34) using the Levenberg–Marquardt algorithm [23, 33]:
where \(n_{1,2}\) is the number of reconstructed 3D points and
4.3 Inserting a new image \(I_k (3\le k\le m)\)
After inserting a new uncalibrated image, suitable projection matrix is estimated on the basis of the three-dimensional data already retrieved. The RANSAC algorithm [21] was used for the reliable recovery of this matrix by solving a linear system using previously reconstructed 3D points and their projections located in the inserted image [1] (Fig. 2). Then, new 3D points are retrieved from the interest point matching result between the inserted image and the previous image. So, the following steps are performed:
-
1.
Projections’ localization of already reconstructed 3D points in the image \(I_{k}\).
-
2.
Estimating the projection matrix \(P_{k}\) from \(n_{0}\) (\(n_0 \ge 6\)) 3D points and their projections located in the image \(I_{k}\) by the use of RANSAC method [21]
-
3.
Recovery of a set of 3D points from the interest point matching result between \(I_{k-1}\) and \(I_{k}\).
-
4.
Decomposition of the projection matrix P\(_{k}\) for the recovery of intrinsic and extrinsic parameters.
-
5.
Local bundle Adjustment between the last \(m_{0}\) images (in our experiments \(m_0 =3\)) taking into account the radial distortion.
-
6.
If \(k=M_0\) (\(10\le M_0 \le 20\) for our experience) applied a global bundle adjustment between the \(M_{0}\) inserted images.
4.3.1 Projection matrix estimation and new 3D point recovery
After the projections’ localization of already reconstructed 3D points in the image \(I_{k}\), \(P_{k}\) projection matrix is estimated from at least six already reconstructed 3D points and their projections localized in the image \(I_{k}\) [1] using RANSAC algorithm [21]. Then, the coordinates of new 3D points are estimated by triangulation from the interest point matching result between the images \(\{ I_{k-1}, I_{k}\}\) and the estimated projection matrices \(P_{k-1}\) and \(P_{k}\).
Algorithm 2 is based on the manipulation of the 3D information already estimated as well as on the interest point matching between the last 3 images \(\{I_{k-2} ,I_{k-1} ,I_k \}\) to estimate the projection matrix \(P_k \) that corresponds to the inserted image \(I_k\).
4.3.2 Local bundle adjustment between the \(m_0 =3\) latest images
The new estimated elements: the camera parameters that correspond to the image \(I_k \), the radial distortion coefficients and the new reconstructed 3D points, are optimized by minimizing the criterion (35) [23, 33].
where \(n_{k-1,k} \) is the number of new reconstructed 3D points from the interest point matching result between the couple image \(\{I_{k-1} , I_k \}\) and
4.3.3 Global bundle adjustment between the \(M_0 \) inserted images
Global bundle adjustment applied to all images requires a very long calculation time, especially with the use of a large number of images. In this work, we have combined between the local bundle adjustment after the insertion of a new image and the global bundle adjustment after the insertion of \(M_0\) images \((10\le M_0 \le 20)\) to accelerate the treatment maintaining system reliability. So, after the insertion of \(M_0\) images, all estimated elements (already locally optimized) will be used as an initial solution to minimize the criterion (36) using the Levenberg–Marquardt algorithm [23, 33].
where n is the number of reconstructed 3D points and
5 Experiments
To validate and test the robustness of the proposed approach, several images and video sequences are used. We present the results for five images/video sequences of scenes with different sizes (vase, villa pot, Medusa head [34], castle-P30 [35] and complex scene). All experiments are executed on a machine HP 650 Intel Core i3, 2.30 GHz CPU and 4 GB RAM.
5.1 Vase sequence: small scale object
This first sequence consists of thirty-two images, with a resolution of \(900 \times 1200\), taken around a small object. Three images of the sequence are presented in Fig. 3a. First, we begin by the detection and matching of interest points with SIFT method [30]. An example of interest point matching between two images is shown in Fig. 3b. Our reconstruction system is initialized from two images with a sufficient number of matches and a large camera motion [15]. After the estimation of camera parameters using our method, a set of 3D points is recovered from the result of interest point matching between these two images. A bundle adjustment, taking into account the radial distortion and using Levenberg–Marquardt algorithm [23, 33], is applied to adjust as best as possible the estimated entities. Table 1 shows the estimated values of the camera intrinsic parameters and the first two radial distortion coefficients corresponding to the two first images for the five sequences. For each new inserted image, new 3D points are recovered and a local bundle adjustment between the last \(m_0 =3\) images is performed to adjust the new estimated entities. After the insertion of \(M_0 =14\) images, a global bundle adjustment is executed to adjust all parameters and to obtain a reliable initial 3D model that will be used with the local bundle adjustment for the insertion of new images. Four views of the sparse 3D reconstruction are shown in Fig. 3c. Figure 3d shows the obtained 3D surface model after applying 3D Crust algorithm [17]. The textured 3D model is presented in Fig. 3e. Table 2 shows the values of \(M_0\) and the number of reconstructed 3D points (sparse 3D reconstruction) for the five sequences.
5.2 Villa pot sequence: medium-scale scene
In this second experiment, a sequence of 28 images, with a resolution of \(750 \times 1000\), was used. Three images of the sequence are shown in Fig. 4a. Figure 4b shows the result of interest point matching between two images after removing false matches using RANSAC algorithm [21] (353 matches are obtained). Two views of the sparse 3D reconstruction are presented in Fig. 4c. Figure 4d shows the almost dense matching result after applying the Match Propagation algorithm [36] (297694 matches are obtained). Three views of the dense 3D reconstruction are presented in Fig. 4e.
5.3 Medusa head sequence: medium-scale object
In this third experiment, we tested the power of our approach on the ‘Medusa head’ video downloaded from the Marc Pollefeys page [34]. Four key frames are shown in Fig. 5a. Sparse and dense matching are shown in Fig. 5b, c respectively. Two views of the sparse 3D reconstruction (7342 3D points were reconstructed) are presented in Fig. 5d. Two views of the obtained dense 3D model (dense 3D point cloud) are presented in Fig. 5e.
5.4 Castle sequence: large-scale scene
In the previous experiments, we have tested our approach on small- and medium-scale scenes. In this part, we used a sequence of 30 images, with resolution of \(1020 \times 680\), of a large-scale scene (castle-P30 [35]). Four images of the sequence are shown in Fig. 6a. We begin by the interest point matching between different images [30]. An example of result obtained between two images is shown in Fig. 6b. The initialization of our system is performed from two selected images [15]. Then, the rest of images is inserted progressively using the already estimated 3D structure and bundle adjustment. Two views of the obtained sparse 3D reconstruction, after the insertion of all images, are shown in Fig. 6d. To obtain a dense 3D reconstruction, we must pass through the dense matching between images. So, we used the match propagation method [36] which starts from the sparse matching result to search for new matches in the vicinities of the old. Figure 6c shows the almost dense matching result obtained. The dense 3D model is presented in Fig. 6e.
5.5 Complex scene sequence
In this experiment, our approach was tested on a sequence of 142 images, with a resolution of \(740 \times 565\), of complex scene composed of objects with different sizes. Four images of the sequence are shown in Fig. 7a. An example of interest point matching between two images, after removing false matches, is shown in Fig. 7b. Two views of sparse 3D reconstruction are presented in Fig. 7c. Figure 7d shows matching result after the application of the match propagation algorithm [36]. Three views of dense 3D reconstruction that corresponds to the last matching result are presented in Fig. 7e.
As presented in the experiments, our approach allows to obtain 3D reconstruction results of quality for different types of objects/scenes. These results also prove the reliability of our formulation for the estimation of intrinsic and extrinsic camera parameters.
To evaluate our approach, four state-of-the-art methods [5, 12, 27, 37] are used. Pollefeys approach [5] uses structure from motion and global bundle adjustment for the recovery of 3D projective structure and camera motion. This approach requires a camera self-calibration step to pass from the projective 3D structure to metric 3D structure. Mouragnon approach [12] is based on incremental structure from motion and the local bundle adjustment for the estimation of 3D structure and camera motion from video captured by a calibrated camera. (To adapt this approach to our situation, we used our camera self-calibration method.) Schonberger approach [27] and VisualSFM [37, 38] are two incremental Structure from Motion systems for 3D reconstruction from unordered image collections.
Table 3 shows that our method gives satisfactory results compared to the other methods [5, 12, 27, 37]. This is due to the good initialization based on the proposition of a new approach for self-calibration from two images (taking into account the radial distortion) which allows us estimating a set of 3D points of the scene, and also to the incremental recovery of new 3D points (after inserting new images) based on the local bundle adjustment as well as to suitable integration of the global bundle adjustment. Concerning the computation time (not counting the matching time), our approach is close to that of Mouragnon approach [12] and it is more rapid than Pollefeys approach [5], Schonberger approach [27] and VisualSFM [37, 38].
To test the performance of our approach compared to other methods [5, 12, 27, 37] in function of the number of used images. We used the sequence 2 (the other sequences lead to similar results). The results are shown in Figs. 8 and 9.
As shown in Fig. 8, concerning Mouragnon approach [12], the reprojection error rises with the increase of images number because of errors accumulation as it is an incremental approach based on local bundle adjustment. On the contrary, concerning Pollefeys approach [5], the reprojection error decreases when images’ number increases because it is based on global bundle adjustment. So, with the use of a large number of images the Pollefeys approach becomes more stable. Our approach allows to have more accurate results than these two methods, and it is closer to those obtained by Schonberger approach [27] and VisualSFM [37, 38]
When the images number is between 2 and 13 (in these experiments we took \(M_0 =13\)), our approach performs as global structure from motion systems [5] with more precision, because it is based on global bundle adjustment (GBA) with good initialization of different parameters (those obtained by local bundle adjustment). When the number of images is greater than 13, we note that the reprojection error is almost stable with some augmentation because the new images are inserted on the basis of the obtained initial 3D structure (already optimized locally and globally), and on local bundle adjustment.
As shown in Fig. 9, our approach is faster compared to Schonberger approach [27] and VisualSFM [37, 38], and it is much faster than Pollefeys approach [5]. This latter applies the global bundle adjustment on all estimated entities after the insertion of all images, which requires a long calculation time especially with the increase of images number and can even pose convergence problems (problem resolution by Levenberg–Marquardt algorithm [33] with a bad initialization). On the other hand, the proposed approach allows to have results of quality, in a time close to Mouragnon method [12] which is based on local bundle adjustment and applied in real time.
6 Conclusion
In this paper, we have proposed a complete system for 3D reconstruction from images/videos taken by a moving camera characterized by varying parameters. Our system allows to automatically recover camera parameters and to obtain metric 3D reconstruction results without gone through a 3D projective reconstruction. It is properly initialized from two images with a large camera motion. So, we have proposed a new method for automatic estimation of intrinsic and extrinsic camera parameters. Incremental 3D reconstruction systems are based on the local bundle adjustment to ensure the rapidity. But, the quality of reconstruction results can be affected by errors accumulation. Our system introduces a global bundle adjustment after the insertion of a suitable images number by avoiding the use of all images in order not to fall on optimization problems with a large parameters number to be optimized, which requires a lot of calculation time. The optimal 3D model obtained and the local bundle adjustment will be used for the insertion of the rest of images. Our 3D reconstruction system is completely automatic and provides more reliability in keeping rapidity.
References
El Hazzat, S., Saaidi, A., Karam, A., Satori, K.: Incremental multi-view 3D reconstruction starting from two images taken by a stereo pair of cameras. 3D Res. 6, 11 (2015). doi:10.1007/s13319-015-0041-z
Lhuillier, M., Quan, L.: A quasi-dense approach to surface reconstruction from uncalibrated images. IEEE Trans. Pattern Anal. Mach. Intell. 27(3), 418–433 (2005)
Wong, S.S., Chan, K.L.: 3D object model reconstruction from image sequence based on photometric consistency in volume space. Pattern Anal. Appl. 13(4), 437–450 (2009)
Ding, L., Ding, X., Fang, C.: 3D face sparse reconstruction based on local linear fitting. Vis. Comput. 30(2), 189–200 (2014)
Pollefeys, M., Gool, L.V., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., Koch, R.: Visual modeling with a hand-held camera. Int. J. Comput. Vis. 59(3), 207–232 (2004)
Merras, M., El Hazzat, S., Saaidi, A., Satori, K., Nazih, A.: 3D face reconstruction using images from cameras with varying parameters. Int. J. Autom. Comput. (2016). doi:10.1007/s11633-016-0999-x
Liu, J., Li, C., Mei, F., Wang, Z.: 3D entity-based stereo matching with ground control points and joint second order smoothness prior. Vis. Comput. 31(9), 1253–1269 (2015)
Tola, E., Strecha, C., Fua, P.: Efficient large-scale multi-view stereo for ultra high-resolution image sets. Mach. Vis. Appl. 23(5), 903–920 (2012)
Vu, H.H., Labatut, P., Pons, J.P., Keriven, R.: High accuracy and visibility-consistent dense multiview stereo. IEEE Trans. Pattern Anal. Mach. Intell. 34(5), 889–901 (2012)
Furukawa, Y., Ponce, J.: Accurate, dense, and robust multiview stereopsis. IEEE Trans. Pattern Anal. Mach. Intell. 32(8), 1362–1376 (2010)
Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3D. In: SIGGRAPH Conference Proceedings, pp. 835–846 (2006)
Mouragnon, E., Lhuillier, M., Dhome, M., Dekeyser, F., Sayd, P.: Generic and real-time structure from motion using local bundle adjustment. Image Vis. Comput. 27(8), 1178–1193 (2009)
Fuhrmann, S., Langguth, F., Moehrle, N., Waechter, M., Goesele, M.: MVE—an image-based reconstruction environment. Comput. Gr. 53, 44–53 (2015). Part A
El Akkad, N., El Hazzat, S., Saaidi, A., Satori, K.: Reconstruction of 3D scenes by camera self-calibration and using genetic algorithms. 3D Res. 7, 6 (2016). 10.1007/s13319-016-0082-y
Wang, G., Wu, Q.M.J.: Perspective 3-D Euclidean reconstruction with varying camera parameters. IEEE Trans. Circuits Syst. Video Technol. 19(12), 1793–1803 (2009)
Strecha, C., Tuytelaars, T., Van Gool, L.: Dense matching of multiple wide-baseline views. In: Proceedings of the International Conference on Computer Vision, pp. 1194–1201 (2003)
Amenta, N., Bern, M., Kamvysselis, M.: A new Voronoi-based surface reconstruction algorithm. In: Proceedings of SIGGRAPH’98, pp. 415–421 (1998)
Kazhdan, M., Bolithp, M., Hoppe, H.: Poisson surface reconstruction. In: Proceedings of Eurographics Symposium on Geometry Processing, pp. 61–70 (2006)
Lim, H., Lim, J., Kim, H. J.: Real-time 6-DOF monocular visual SLAM in a large-scale environment. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 1532–1539 (2014)
Kerl, C., Sturm, J., Cremers, D.: Dense visual slam for RGB-D cameras. In: International Conference on Intelligent Robots and Systems (IROS), pp. 2100–2106 (2013)
Fischler, M.A., Bolles, R.C.: Random sample consensus: aparadigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Wu, C., Agarwal, S., Curless, B., Seitz. S. M.: Multicore bundle adjustment. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3057–3064 (2011)
Lourakis, M.A., Argyros, A.: SBA: a software package for generic sparse bundle adjustment. ACM Trans. Math. Softw. 36(1), 1–30 (2009)
Brown, M., Lowe, D. G.: Unsupervised 3D object recognition and reconstruction in unordered datasets. In: Proceedings of the International Conference on 3D Digital Imaging and Modelling, pp. 56–63 (2005)
Tran, S., Davis, L.: 3D surface reconstruction using graph cuts with surface constraints. In: Proceedings of the European Conference on Computer Vision, pp. 219–231 (2006)
Snavely, N., Seitz, S., Szeliski, R.: Modeling the world from internet photo collections. Int. J. Comput. Vis. 80(2), 189–210 (2008)
Schonberger, J. L., Frahm, J.-M.: Structure-from-motion revisited. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Wu, C.: Critical configurations for radial distortion self-calibration. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 25–32 (2014)
Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Wu, J., Cui, Z., Sheng, V.S., Zhao, P., Su, D., Gong, S.: A comparative study of sift and its variants. Meas. Sci. Rev. 13(3), 122–131 (2013)
Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004)
Moré, J.J.: The Levenberg–Marquardt algorithm: implementation and theory. In: Watson, G.A. (ed.) Numerical Analysis, Lecture Notes in Mathematics, vol. 630, pp. 105–116. Springer, Berlin (1977)
Strecha, C., Von Hansen, W., Van Gool, L., Fua, P., Thoennessen, U.: On benchmarking camera calibration and multi-view stereo for high resolution imagery. In: CVPR, pp. 1–8 (2008)
Lhuillier, M., Quan, L.: Match propagation for image-based modeling and rendering. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1140–1146 (2002)
Wu, C.: Towards linear-time incremental structure from motion. In: International Conference on 3D Vision (3DV), pp. 127–134 (2013)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
El Hazzat, S., Merras, M., El Akkad, N. et al. 3D reconstruction system based on incremental structure from motion using a camera with varying parameters. Vis Comput 34, 1443–1460 (2018). https://doi.org/10.1007/s00371-017-1451-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-017-1451-0