1 Introduction

3D reconstruction from images/video is an important and widely studied subject in recent decades. It is to recover 3D information from 2D images taken from different viewpoints or from video.

Several approaches [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] have been proposed to solve this problem. There are approaches that are based on points matching between different images. Structure from motion approach [5, 11, 12, 15] allows automatic recovery of 3D structure and camera motion. It is based on the detection and matching of interest points between different images. The matched points with other estimated geometric entities (epipolar geometry) will be used to recover a projective (uncalibrated images) and sparse representation of the scene. To move to a metric/Euclidean representation, a step of camera self-calibration to recover the intrinsic parameters is necessary. The reconstructed sparse 3D point cloud does not allow to properly define the shape of objects. So, to have dense results closer to the reality, dense matching methods should be used [16]. Delaunay triangulation, the Crust method [17] and the Poisson surface [18] are methods used to convert the obtained 3D point cloud into triangulated surface model. The approaches based on multi-view stereo [7,8,9,10] are often used to get high-quality dense 3D reconstruction results, but they require in input calibrated stereo images as well as a long computation time. In robotics, simultaneous localization and mapping (SLAM) [19, 20] simultaneously allows robot location and environment map construction using data retrieved from the sensors which may include cameras.

In this work, we propose a complete 3D reconstruction system from uncalibrated image/video sequences. Our 3D reconstruction system is able to produce very realistic three-dimensional models using a single camera. The camera intrinsic parameters are variable, the displacements of the camera are free, and the reconstruction environment is uncontrollable. All these factors offer more flexibility and generality for the 3D reconstruction of any objects/scenes (multi-scale objects/scenes). Our 3D reconstruction system is based on the incremental structure from motion. It is initialized from two images with a sufficient number of matches and a large camera motion [15]. In this context, we have proposed a new method for automatic recovery of intrinsic and extrinsic camera parameters that correspond to these two images. The 3D structure of the object/scene is initiated by the triangulation of matched/tracked interest points between these two images. The quality of initialization affects the whole system. Therefore, bundle adjustment is applied to adjust the estimated entities. The projection matrix of each new inserted image is estimated after locating projections of 3D reconstructed points in the inserted image. This estimate is based on the use of RANSAC algorithm [21] by solving a linear system using 3D points already reconstructed and their projections located in the inserted image. After, new 3D points are recovered from the interest point matching result between the inserted image and the image that precedes, and a local bundle adjustment is performed to adjust the new estimated entities. The local optimization does not guarantee the accuracy of the obtained solutions. So, in our system we integrate a global bundle adjustment after the insertion of \(M_{0}\) images (in our experiments \(10\le M_0 \le 20\)) to have an initial 3D model that will be used as a reference to insert the rest of the images in order to obtain a more complete final 3D model. To have a surface model, the Poisson surface algorithm [18] or the 3D Crust method [17] can be applied to the obtained 3D point cloud. Finally, the texture mapping provides realistic results.

The good exploitation of the existing (incremental structure from motion [12], bundle adjustment [5, 12, 22, 23], ...) and our own vision to solve the problem (camera self-calibration from only two images, the use of camera with varying intrinsic parameters, local and global vision of the problem, ...) allow us to propose a system that is fully automatic, flexible and able to reconstruct multi-scale three-dimensional models (small, medium and large) within a reasonable calculation time. On the other hand, there are systems [5, 24] that are based on the global bundle adjustment of all estimated entities, which require a fairly important computation time and demand a very important step of parameter initialization to avoid falling into local optima. Other systems [11, 12] require prior information on camera parameters to make a 3D reconstruction of the scene.

This paper is organized as follows. Section 2 presents related work. Section 3 describes the notations and background. The proposed method is described in Sect. 4. The experiments and the comparison of our method with other methods are presented in Sect. 5. Finally, the conclusion is presented in Sect. 6.

2 Related work

Several methods have been proposed to solve the problem of 3D reconstruction from images/video. The methods based on multi-view stereo [8, 10, 25] allow to have satisfactory results with a high density. Tran and Davis [25] presented the graph cut method to recover the 3D object surface by the use of silhouettes and foreground color information. In [8], the authors presented a new method for large-scale multi-view stereo based on dense matching between very high-resolution images. It allows to obtain a 3D point cloud that is very dense and of high quality at a relatively low computational cost. However, it requires the use of rich texture images to avoid making use of costly optimization algorithms. Furukawa and Ponce [10] also proposed a method to solve the problem of multi-view stereo. It consists in retrieving an initial set of patches covering the surface of the object/scene from the matching result of key points detected by Harris and DoG operators. The final patches are obtained by iteration between an expansion step, to obtain a dense set of patches, and a filtering step based on the visibility constraint to eliminate false matches. Finally, the resulting patch model is converted into a polygonal mesh, which can be refined by applying the photometric consistency and regularization constraints. All these methods provide dense three-dimensional models of good quality, but they require stereo cameras of known parameters (methods that start from stereo calibrated images) and they are expensive in terms of computation time.

Structure from motion methods [5, 11, 12, 26] allows to automatically recover both the three-dimensional structure and camera motion from uncalibrated image sequences. These methods are based on the detection and matching of interest points between different images. Pollefeys et al. [5] presented a complete system for three-dimensional modeling from uncalibrated images. First, structure from motion approach was used for the recovery of projective 3D structure and camera motion. Then, the different estimated entities will be refined by global bundle adjustment. To pass to a metric 3D reconstruction, they have gone through a camera self-calibration phase based on the use of the absolute conic. Finally, pairs of images are rectified and multi-view stereo matching is used to obtain a dense 3D reconstruction. However, the global bundle adjustment requires a very long calculation time, especially with the use of a large number of images and can converge to a local solution due to a bad initialization. Snavely et al. [11, 26] presented a system for 3D reconstruction from large photo collections. It is based on the incremental structure from motion approach to simultaneously recover the camera motion and a sparse 3D reconstruction of the scene. However, it requires prior information to initialize camera parameters (the use of EXIF tags) and requires a long calculation time, especially with the increase of images number, dominated by the bundle adjustment after the insertion of each new image. In [27], the authors presented a new incremental Structure from Motion technique based on geometric verification strategy, next best view selection and robust triangulation method. Fuhrmann et al. [13] presented a 3D reconstruction system of multi-scale scenes from images. It is based on structure from motion, multi-view stereo depth maps and surface reconstruction. Mouragnon et al. [12] proposed a method for real-time estimation of motion and 3D structure from video captured by a calibrated camera. The proposed method is based on the local bundle adjustment to refine the camera poses and 3D structure. However, this method requires the use of camera with known and unchanged intrinsic parameters during the acquisition of images. Thus, the quality of the 3D reconstruction is not assured because of the accumulation of errors when increasing the images number.

3 Notation and background

3.1 Pinhole camera model

The pinhole camera model is used to project a scene 3D point \(A_j =\left( {X_j ,Y_j ,Z_j ,1} \right) ^\mathrm{T}\) in the image point \(a_{ij} =\left( {u_{ij} ,v_{ij} ,1} \right) ^\mathrm{T}\). This projection is represented by the following formula:

$$\begin{aligned} {\uplambda }_{ij} a_{ij} =P_i A_j \end{aligned}$$
(1)

where \({\uplambda }_{ij} \) is a nonzero scale factor, \(P_i =K_i \left[ {{\begin{array}{cc} {R_i }&{} {t_i } \\ \end{array} }} \right] \) is a \(3\times 4\) projection matrix, \(t_i \) is a translation vector, \(R_i \) is \(3\times 3\) a rotation matrix defined by:

$$\begin{aligned} R_i= & {} \left( {{\begin{array}{ccc} 1&{}\quad 0&{}\quad 0 \\ 0&{}\quad {\cos \alpha _i }&{}\quad {-\sin \alpha _i } \\ 0&{}\quad {\sin \alpha _i }&{}\quad {\cos \alpha _i } \\ \end{array} }} \right) \left( {{\begin{array}{ccc} {\cos \beta _i }&{}\quad 0&{}\quad {\sin \beta _i } \\ 0&{}\quad 1&{}\quad 0 \\ {-\sin \beta _i }&{}\quad 0&{}\quad {\cos \beta _i } \\ \end{array} }} \right) \\&\times \left( {{\begin{array}{ccc} {\cos \gamma _i }&{} \quad {-\sin \gamma _i }&{}\quad 0 \\ {\sin \gamma _i }&{}\quad {\cos \gamma _i }&{}\quad 0 \\ 0&{}\quad 0&{}\quad 1 \\ \end{array} }} \right) \end{aligned}$$

as \(\alpha _i \), \(\beta _i \) and \(\gamma _i \) represent the three Euler angles.

\(K_i \) is intrinsic parameter matrix defined by:

$$\begin{aligned} K_i =\left( {{\begin{array}{ccc} {f_i }&{}\quad {s_i }&{}\quad {u_{0i} } \\ 0&{}\quad {\varepsilon _i f_i }&{}\quad {v_{0i} } \\ 0&{}\quad 0&{}\quad 1 \\ \end{array} }} \right) \end{aligned}$$

where \(f_i\) is the focal length, \(\varepsilon _i\) is the scale factor, \(s_i \) is the skew factor and \(\left( {u_{{0i}} ,v_{{0i}} } \right) \) are the coordinates of the principal point.

3.2 Estimation of distortion coefficients

The distortion effect affects the quality of the 3D reconstruction [28]. In this work, we consider the first two coefficients of radial distortion \(k_{1}\) and \(k_{2}\) in order to obtain more accurate results.

The relationship between the distorted image points \((u_d ,v_d )\) and the undistorted image points (uv) is defined by [29]:

$$\begin{aligned} \left\{ {{\begin{array}{c} {u_d =u+\left( {u-u_{0i} } \right) \left( k_1 \left( {x^{2}+y^{2}} \right) +k_2 \left( x^{2}+y^{2}\right) ^{2} \right) } \\ v_d =v+\left( {v-v_{0i} } \right) \left( k_1 \left( {x^{2}+y^{2}} \right) +k_2 \left( x^{2}+y^{2}\right) ^{2} \right) \\ \end{array} }} \right. \end{aligned}$$

where \(\left( {u_{0i} ,v_{0i} } \right) \) are the coordinates of the principal point that correspond to the ith image and \(\left( {x,y,1} \right) ^\mathrm{T}=K_i^{-1} \left( {u,v,1} \right) ^\mathrm{T}\).

So, for each image point we have the following formula :

$$\begin{aligned}&\left[ {{\begin{array}{cc} {\left( {u-u_0 } \right) \left( {x^{2}+y^{2}} \right) }&{}\quad {\left( {u-u_0 } \right) \left( {x^{2}+y^{2}} \right) ^{2}} \\ {\left( {v-v_0 } \right) \left( {x^{2}+y^{2}} \right) }&{}\quad {\left( {v-v_0 } \right) \left( {x^{2}+y^{2}} \right) ^{2}} \\ \end{array} }} \right] \left[ {{\begin{array}{c} {k_1 } \\ {k_2 } \\ \end{array} }} \right] \\&\quad =\left[ {{\begin{array}{c} {u_d -u} \\ {v_d -v} \\ \end{array} }} \right] \end{aligned}$$

3.3 Homography between two images

The homography between two images \(I_{i}\) and \(I_{j}\) is represented by a \(3\times 3\) matrix denoted \(H_{ij} \). For each point \(a_{ik} \) of the image \(I_{i}\) and its corresponding \(a_{jk} \) in the image \(I_{j}\), we have the following relationship:

$$\begin{aligned} a_{jk} \sim H_{ij} a_{ik} \end{aligned}$$

Four non-aligned matches are sufficient for the estimation of this matrix. The use of the RANSAC algorithm [21] provides a reliable solution.

3.4 Selection of two images with a large displacement

In this work, we used the criteria already presented in [15] to select two images with a large displacement.

Let \(I_r \) be the reference image, and the disparity matrix is defined as follows:

$$\begin{aligned} D=\left[ {{\begin{array}{ccc} \Vert a_{11} -a_{r1}\Vert _F &{}\quad \cdots &{}\quad \Vert a_{1n} -a_{rn}\Vert _F \\ \vdots &{}\quad \ddots &{}\quad \vdots \\ \Vert a_{m1} -a_{r1}\Vert _F &{}\quad \cdots &{}\quad \Vert a_{mn} -a_{rn}\Vert _F \\ \end{array} }} \right] \end{aligned}$$

where m is the number of images, n is the number of matches, \(\Vert a_{ij} -a_{rj}\Vert _F\) is the disparity between the two points and \((a_{ij} ,a_{rj} )\) is the jth matched point between \(I_r\) and \(I_i \).

The image \(I_{r'}\) that corresponds to a large camera motion relative to the reference image \(I_r\) is obtained by the use of the following formula:

$$\begin{aligned} r'={\text {max}}\left( \frac{D_{M}\odot D_{S}}{\Vert D_{M}\Vert _{F}\Vert D_{S}\Vert _{F}}\right) \end{aligned}$$

where \(D_M =\left[ {{\begin{array}{c} {e_1 } \\ \vdots \\ {e_m } \\ \end{array} }} \right] \) is a vector which represents the mean of each row of D; \(D_S =\left[ {{\begin{array}{c} {s_1 } \\ \vdots \\ {s_m } \\ \end{array} }} \right] \) is a vector which represents the standard deviation of each row of D; \(\odot \) denotes the element-by-element multiplication.

4 Proposed method

We present an incremental 3D reconstruction system based on structure from motion approach and the good exploitation of bundle adjustment. It takes as input uncalibrated images/videos captured by a camera with variable parameters. As output, it determines the camera parameters (intrinsic and extrinsic) and the three-dimensional structure. Our system offers more flexibility to adapt and to reconstruct multi-scale objects/scenes (small, medium and large) in a reasonable time compared to methods applied in real time [12].

As already known, structure from motion approach using uncalibrated images allows to recover only a 3D projective reconstruction. To get a 3D metric/Euclidean reconstruction, it must pass through a camera self-calibration phase. In this work, we proposed a system that can directly retrieve the metric structure of the 3D scene. First, it is to initialize our system from two images with a sufficient number of matched interest points and a large movement of the camera [15]. In this context, we have proposed a new method of camera self-calibration from two images, which allows us retrieving coordinates of a set of 3D points in the scene corresponding to the matched image points. After each insertion of a new uncalibrated image, the camera parameters are retrieved based on the previously estimated 3D structure and new 3D points are recovered from interest point matching between the inserted image and the image that precedes. Our 3D reconstruction system is realized in three main steps: detection and matching/tracking of interest points between different images, initialization of the reconstruction system from two images with a large displacement [15] and incremental insertion of new images. The global algorithm of our 3D reconstruction system is presented as follows:

figure f

Algorithm 1 allows, from m (\(m\ge 2\)) input images, to gradually recover the metric 3D structure and the camera parameters (intrinsic and extrinsic) that corresponds to each image. The value of \(m_0\) is initialized to 3 because the local bundle adjustment is applied to the three last images. This allows to provide a reliable initial solution for the global bundle adjustment (GBA) which will be applied after the insertion of \(M_0\) images to have an initial 3D model of quality. The value of \(M_0\) is selected between 10 and 20 because the application of the GBA on a large number of images requires much calculation time. Also, the use of a small number of images does not allow to have a reliable initial 3D model. When inserting the remain images \(\{I_k \}_{M_0 <k\le m}\), the value of \(m_0\) can be taken greater than 3 to increase the system reliability. The initialization of our system from two images is a very interesting phase. So, to ensure the stability and reliability, we have selected two images with a sufficient number of matched points and a large movement of the camera [15]. The selected images allow to ensure the stability of epipolar geometry calculation (estimation of the fundamental matrix). This matrix will be used later to estimate the camera parameters.

4.1 Interest point detection and matching/tracking

In this work, we have chosen to use the SIFT algorithm [30] for interest point matching between different images because of its robustness to scale changes compared to other methods [31]. For the elimination of false matches and the estimation of fundamental matrix, the RANSAC algorithm was used [21].

4.2 Initialization of 3D reconstruction system from two images

The initialization of our 3D reconstruction system is performed from two selected images with a sufficient number of matched points and a large displacement of the camera [15] in order to stabilize the calculations. It consists in:

  1. 1.

    Camera self-calibration: estimation of intrinsic parameters from two images.

  2. 2.

    Estimation of extrinsic parameters.

  3. 3.

    Retrieving a set of 3D points from matched interest points.

  4. 4.

    Bundle adjustment taking into account the radial distortion.

4.2.1 Self-calibration

In this step, we present a new formulation of the self-calibration problem based on the good choice of global reference and the use of planar calibration/self-calibration concepts [29]. This formulation allows to obtain a linear system which leads, assuming that the principal point is in the center of the image and the skew factor is equal to zero, to determine the scale factor and the focal length. Thus, this formulation allows us to automatically estimate the camera extrinsic parameters.

Let \(A_1\) and \(A_2\) two unknown points of the 3D scene as \(a_{i1}\) and \(a_{i2} \) are, respectively, their projections in the image \(I_i\) with \(1\le i\le 2\). We define an Euclidean reference (O, X, Y, Z) as O is the midpoint of segment [\(A_1 A_2 ]\), and the two points \(A_1 \) and \(A_2 \) belong to the plane \(\hbox {Z} = 0\) (plane \(\hbox {OXY})\) (see Fig. 1).

In this reference:

$$\begin{aligned} A_1= & {} \left( {d\cos \theta ,d\sin \theta ,0,1} \right) ^\mathrm{T}\\ A_2= & {} \left( {-d\cos \theta ,-d\sin \theta ,0,1} \right) ^\mathrm{T} \end{aligned}$$

where \(d={A_1 A_2 }/2\) and \(\theta \) is the angle between the line \((A_{1}A_{2})\) and the X-axis of the reference.

Fig. 1
figure 1

Different entities used for automatic estimation of camera parameters

To simplify the calculations, we can choose the global reference such as \(\theta =\pi /3\) (other values can be selected). So, we obtain:

$$\begin{aligned} A_1= & {} \left( {d/2,{\sqrt{3}d}/2,0,1} \right) ^\mathrm{T}\quad \hbox {and}\\ A_2= & {} \left( {-\,d/2,-{\sqrt{3}d}/2,0,1} \right) ^\mathrm{T} \end{aligned}$$

We consider the Euclidean reference (O, X, Y).

In this reference:

$$\begin{aligned} A_1= & {} \left( {d/2,{\sqrt{3}d}/2,1} \right) ^\mathrm{T} \end{aligned}$$
(2)
$$\begin{aligned} A_2= & {} \left( {-d/2,-{\sqrt{3}d}/2,1} \right) ^\mathrm{T} \end{aligned}$$
(3)

The projection of the plane \(Z=0\) in the image plane \(I_i\) is defined by the homography \(H_i \) as:

$$\begin{aligned} H_i \sim K_i R_i \left( {{\begin{array}{ccc} 1 &{}\quad 0 &{}\\ 0 &{}\quad 1 &{}\quad {R_i^\mathrm{T}}{t_i} \\ 0 &{}\quad 0 &{}\\ \end{array}}}\right) ,\quad i=1,2 \end{aligned}$$
(4)

As \(A_1 \in \left( {O,X,Y} \right) \) and \(A_2 \in \left( {O,X,Y} \right) \).

So, we can write

$$\begin{aligned} a_{ij} \sim H_i A_j,\quad i=1,2\quad \hbox {and}\quad j=1,2 \end{aligned}$$
(5)

Expressions (2) and (3) can be written in the form:

$$\begin{aligned} A_1= & {} \left( {{\begin{array}{ccc} d&{}\quad 0&{}\quad 0 \\ 0&{}\quad d&{}\quad 0 \\ 0&{}\quad 0&{}\quad 1 \\ \end{array} }} \right) \left( {{\begin{array}{c} {1/2} \\ {{\sqrt{3}}/2} \\ 1 \\ \end{array} }} \right) \end{aligned}$$
(6)
$$\begin{aligned} A_2= & {} \left( {{\begin{array}{ccc} d&{}\quad 0&{}\quad 0 \\ 0&{}\quad d&{}\quad 0 \\ 0&{}\quad 0&{}\quad 1 \\ \end{array} }} \right) \left( {{\begin{array}{c} {-1/2} \\ {-{\sqrt{3}}/2} \\ 1 \\ \end{array} }} \right) \end{aligned}$$
(7)

We put:

$$\begin{aligned}&B=\left( {{\begin{array}{ccc} d&{}\quad 0&{}\quad 0 \\ 0&{}\quad d&{}\quad 0 \\ 0&{}\quad 0&{}\quad 1 \\ \end{array} }} \right) , \quad A_1^{\prime } =\left( {{\begin{array}{c} {1/2} \\ {{\sqrt{3}}/2} \\ 1 \\ \end{array} }} \right) \quad \hbox {and}\nonumber \\&\quad A_2^{\prime } =\left( {{\begin{array}{c} {-1/2} \\ {{-\sqrt{3}}/2} \\ 1 \\ \end{array} }} \right) \end{aligned}$$

Expression (5) can be expressed as follows :

$$\begin{aligned} a_{ij} \sim H_i BA_j^{\prime } ,\quad i=1,2\quad \hbox {and}\quad j=1,2 \end{aligned}$$
(8)

We put:

$$\begin{aligned} Q_i =\left( {{\begin{array}{ccc} {Q_{i00} }&{}\quad {Q_{i01} }&{}\quad {Q_{i02} } \\ {Q_{i10} }&{}\quad {Q_{i11} }&{}\quad {Q_{i12} } \\ {Q_{i20} }&{}\quad {Q_{i21} }&{}\quad {Q_{i22} } \\ \end{array} }} \right) =H_i B,\quad i=1,2 \end{aligned}$$
(9)

Expression (8) can be written as follows:

$$\begin{aligned} a_{ij} \sim Q_i A_j^{\prime } ,\quad i=1,2\quad \hbox {and}\quad j=1,2 \end{aligned}$$
(10)

\(Q_i\) is a matrix \(3\times 3\) that allows to project the point \(A_j^{\prime } \) in the image \(I_i \).

At first, we want to recover the projection matrices \(\{Q_i \}_{1\le i\le 2} \).

From (9), we can write:

$$\begin{aligned}&B=H_1^{-1} Q_1\quad \hbox {and}\quad Q_2 =H_2 H_1^{-1} Q_1 \nonumber \\&Q_2 =H_{12} Q_1 \end{aligned}$$
(11)

\(H_{12}\) is the homography between the images \(I_1\) and \(I_2\).

In (10), we replace \(Q_2\) by its formula (11) and we obtain:

$$\begin{aligned} a_{2j} \sim H_{12} Q_1 A_j^{\prime } , \,\, j=1,2 \end{aligned}$$
(12)

From [32], we know that:

$$\begin{aligned} F_{12} \sim [e_2 ]_\times H_{12} \end{aligned}$$
(13)

where \([e_2 ]_\times =\left( {{\begin{array}{ccc} 0&{}\quad {-e_{2_z} }&{}\quad {e_{2_y} } \\ {e_{2_z} }&{}\quad 0&{}\quad {-e_{2_x} } \\ {-e_{2_y} }&{}\quad {e_{2_x} }&{}\quad 0 \\ \end{array} }} \right) \) is the antisymmetric matrix associated with the epipole of the image \(I_2 \quad e_2 =\left( {e_{2_x} ,e_{2_y} ,e_{2_z} } \right) ^\mathrm{T}\).

Then, formulas (12) and (13) give:

$$\begin{aligned}{}[e_2 ]_\times a_{2j} \sim F_{12} Q_1 A_j^{\prime } , \quad j=1,2 \end{aligned}$$
(14)

From formulas (10) and (14), a linear system of eight linear equations (for the elements of \(Q_1\)) is obtained. The resolution of this system allows the estimation of matrix \(Q_1\).

Expressions (11) and (13) give:

$$\begin{aligned}{}[e_2 ]_\times Q_2 \sim F_{12} Q_1 \end{aligned}$$
(15)

The matrix \(Q_2\) is estimated from Eq. (15).

Now, the estimated projection matrices will be used to recover the intrinsic parameters. Formulas (4) and (9) give:

$$\begin{aligned} K^{-1}Q_i \sim R_i \left( {{\begin{array}{ccc} 1&{}\quad 0&{} \\ 0&{}\quad 1&{}\quad {R_i^\mathrm{T} } {t_i } \\ 0&{}\quad 0&{} \\ \end{array}}} \right) B \end{aligned}$$
(16)

The previous formula gives:

$$\begin{aligned} Q_i^\mathrm{T} \omega _i Q_i \sim \left( {{\begin{array}{ccc} d&{}\quad 0&{}\quad {0} \\ 0&{}\quad d&{}\quad {0} \\ &{}\quad {t_i^\mathrm{T} R_i }&{}\quad \\ \end{array} }} \right) \left( {{\begin{array}{ccc} d&{}\quad 0 &{}\\ 0&{}\quad d &{}\quad {R_i^\mathrm{T} t_i }\\ 0&{}\quad 0 &{}\\ \end{array}}} \right) \end{aligned}$$
(17)

where \(\omega _i =\left( {{\begin{array}{ccc} {\omega _{i00} }&{}\quad {\omega _{i01} }&{}\quad {\omega _{i02} } \\ {\omega _{i10} }&{}\quad {\omega _{i11} }&{}\quad {\omega _{i12} } \\ {\omega _{i20} }&{}\quad {\omega _{i21} }&{}\quad {\omega _{i22} } \\ \end{array} }} \right) =\left( {K_i K_i^\mathrm{T} } \right) ^{-1}\) is the image of the absolute conic.

$$\begin{aligned} \omega _{i00}= & {} \frac{1}{f_i^2 },\quad \omega _{i01} =\omega _{i10} =-\frac{s_i }{\varepsilon _i f_i^3 }, \\ \omega _{i02}= & {} \omega _{i20} =\frac{u_{0i} s_i -\,\varepsilon _i u_{0i} f_i }{\varepsilon _i f_i^{3}},\\ \omega _{i11}= & {} \frac{s_i^2 }{\varepsilon _i^2 f_i^4 }+\frac{1}{\varepsilon _i^2 f_i^2 },\\ \omega _{i12}= & {} \omega _{i21} =-\frac{s_i \left( {v_{0i} s_i -u_{0i} \varepsilon _i f_i } \right) }{\varepsilon _i^2 f_i^4 }-\frac{v_{0i} }{\varepsilon _i^2 f_i^2 },\\ \omega _{i22}= & {} \frac{\left( {v_{0i} s_i -u_{0i} \varepsilon _i f_i } \right) ^{2}}{\varepsilon _i^2 f_i^4 }+\frac{v_{0i}^2 }{\varepsilon _i^2 f_i^2 }+1. \end{aligned}$$

We put \({B}'=\left( {\begin{array}{cc} d&{}\quad 0\\ 0&{}\quad d\\ 0&{}\quad 0\\ \end{array}}\right) \).

Formula (17) gives:

$$\begin{aligned}&Q_i^\mathrm{T} \omega _i Q_i \sim \left( {{\begin{array}{cc} B^{\prime T}{B'}&{}\quad B^{{\prime }T}R_i^\mathrm{T} t_i \\ t_i^\mathrm{T} R_i {B'}&{}\quad t_i^\mathrm{T} t_i \\ \end{array}}}\right) \\&B^{{\prime }T}{B'}=\left( {\begin{array}{cc} d^{2}&{}\quad 0\\ 0&{}\quad d^{2}\\ \end{array}}\right) \nonumber \end{aligned}$$
(18)

Formula (18) gives:

$$\begin{aligned} \left( {{\begin{array}{cc} {\left( Q_i^\mathrm{T} \omega _i Q_i \right) _{00} }&{}\quad {\left( Q_i^\mathrm{T} \omega _i Q_i \right) _{01} } \\ {\left( Q_i^\mathrm{T} \omega _i Q_i \right) _{10} }&{}\quad {\left( Q_i^\mathrm{T} \omega _i Q_i \right) _{11} } \\ \end{array} }} \right) \sim \left( {{\begin{array}{cc} {d^{2}}&{}\quad 0 \\ 0&{}\quad {d^{2}} \\ \end{array} }} \right) ,\quad i=1,2. \end{aligned}$$
(19)

From (19), the following equation system is obtained:

$$\begin{aligned} \left\{ {{\begin{array}{l} {\left( Q_i^\mathrm{T} \omega _i Q_i\right) _{00} =\left( Q_i^\mathrm{T} \omega _i Q_i \right) _{11} } \\ {\left( Q_i^\mathrm{T} \omega _i Q_i \right) _{01} =\left( Q_i^\mathrm{T} \omega _i Q_i \right) _{10} =0} \\ \end{array} }} \right. \end{aligned}$$
(20)

Assuming that the principal point \((u_{0i,} v_{0i} )\) is in the center of the image and \(s_i =0\), \(\varepsilon _i \) and \(f_i \) will be determined.

Expression (20) gives:

$$\begin{aligned} \left\{ {{\begin{array}{c} {\alpha _1 \varepsilon _i^2 f_i^2 +\alpha _2 \varepsilon _i^2 +\alpha _3 =0} \\ {\beta _1 \varepsilon _i^2 f_i^2 +\beta _2 \varepsilon _i^2 +\beta _3 =0} \\ \end{array} }} \right. \end{aligned}$$
(21)

where:

$$\begin{aligned} \alpha _1= & {} Q_{i20}^2 -Q_{i21}^2 ,\\ \alpha _2= & {} Q_{i00}^2 -Q_{i01}^2 -2\left( {Q_{i00} Q_{i20} +Q_{i01} Q_{i21} } \right) u_{0i} \\&+\left( {Q_{i20}^2 -Q_{i21}^2 }\right) u_{0i}^2 ,\\ \alpha _3= & {} Q_{i10}^2 -Q_{i11}^2 -2(Q_{i10} Q_{i20} +Q_{i11} Q_{i21})v_{0i} \\&+\left( {Q_{i20}^2 -Q_{i21}^2 } \right) v_{0i}^2 ,\\ \beta _1= & {} Q_{i20} Q_{i21} ,\\ \beta _2= & {} Q_{i00} Q_{i01} -Q_{i01} Q_{i20} u_{0i} -Q_{i00} Q_{i21} u_{0i} \\&+Q_{i20} Q_{i21} u_{0i}^2\hbox { and }\\ \beta _3= & {} Q_{i10} Q_{i11} -Q_{i20} Q_{i11} v_{0i} -Q_{i10} Q_{i21} v_{0i}\\&+Q_{i20} Q_{i21} v_{0i}^2 . \end{aligned}$$

Expression (21) is a linear system of the form:

$$\begin{aligned} \left\{ {{\begin{array}{c} a_0 X_1 +a_1 X_2 =b_0 \\ a_2 X_1 +a_3 X_2 =b_1 \\ \end{array} }} \right. \end{aligned}$$

where \(X_1 =\varepsilon _i^2 f_i^2 \) and \(X_2 =\varepsilon _i^2\).

Solving this linear system by substitution allows to estimate \(X_1 \) and \(X_2 \).

The values of \(\varepsilon _i\) and \(f_i\) are obtained from \(X_1\) and \(X_2\) (the two positive values).

4.2.2 Estimation of extrinsic parameters

Formula (4) gives:

$$\begin{aligned} H_i \sim K_i \left[ {{\begin{array}{ccc} {r_1^i }&{} {r_2^i }&{} {t_i } \\ \end{array} }} \right] , \quad i=1,2 \end{aligned}$$
(22)

where \(r_k^i\) (\(1\le k\le 3\)) denotes the kth column of the rotation matrix \(R_i\).

As already presented in [29], the formula (22) gives:

$$\begin{aligned} r_1^i= & {} \mu _i K_i^{-1} h_1^i \end{aligned}$$
(23)
$$\begin{aligned} r_2^i= & {} \mu _i K_i^{-1} h_2^i \end{aligned}$$
(24)
$$\begin{aligned} t_i= & {} \mu _i K_i^{-1} h_3^i \end{aligned}$$
(25)

where \(\mu _i =\Vert K_i^{-1} h_1^i\Vert ^{-1}=\Vert K_i^{-1} h_2^i\Vert ^{-1}\).

In our situation, the homography \(H_i =\left[ {{\begin{array}{ccc} {h_1^i }&{} {h_2^i }&{} {h_3^i } \\ \end{array} }} \right] \) is unknown because the scene is not planar.

Formula (9) gives:

$$\begin{aligned} H_i =Q_i B^{-1} \end{aligned}$$
(26)

Expressions (23), (24), (25) and (26) give:

$$\begin{aligned} r_1^i= & {} \mu _i^{\prime } K_i^{-1} q_1^i \end{aligned}$$
(27)
$$\begin{aligned} r_2^i= & {} \mu _i^{\prime } K_i^{-1} q_2^i \end{aligned}$$
(28)
$$\begin{aligned} r_3^i= & {} r_1^i \times r_2^i \end{aligned}$$
(29)
$$\begin{aligned} t_i= & {} d\mu _i^{\prime } K_i^{-1} q_3^i \end{aligned}$$
(30)

where \(\mu _i^{\prime } =\Vert K_i^{-1} q_1^i\Vert ^{-1}=\Vert K_i^{-1} q_2^i\Vert ^{-1}\) and \(q_k^i \) (\(1\le k\le 3\)) denote the kth column of the matrix \(Q_i \).

The rotation matrix is obtained from (27), (28) and (29).

It remains to estimate the value of d to obtain the translation vector.

Expression (17) gives:

$$\begin{aligned} \left\{ {{\begin{array}{l} \left( Q_i^\mathrm{T} \omega _i Q_i \right) _{00} =\nu _i d^{2} \\ \left( Q_i^\mathrm{T} \omega _i Q_i \right) _{02} =\nu _i dt_i^\mathrm{T} r_1^i \\ \end{array} }} \right. \end{aligned}$$
(31)

where \(\nu _i\) is a nonzero scale factor.

So, from (31) we obtain.

$$\begin{aligned} d=\frac{\left( Q_i^\mathrm{T} \omega _i Q_i \right) _{00} }{\left( Q_i^\mathrm{T} \omega _i Q_i \right) _{02} }\left( {t_i^\mathrm{T} r_1^i } \right) \end{aligned}$$
(32)

Then, substituting in (30) d by its formula (32), we obtain a linear system of the form:

$$\begin{aligned} At_i =0 \end{aligned}$$
(33)

where \(A\in {\mathbb {R}}^{3\times 3}\).

The resolution of this system by the singular value decomposition (SVD) allows to estimate the translation vector.

4.2.3 Recovering 3D point coordinates

The coordinates of 3D points are recovered from the matching result between the two images \(\left\{ {I_1 ,I_2 } \right\} \) and the projection matrices defined by:

$$\begin{aligned} P_1 =K_1 \left[ {R_{1} t_1 } \right] \quad \hbox {and}\quad P_2 =K_2 \left[ {R_2 t_2 } \right] \end{aligned}$$

4.2.4 Bundle adjustment

The optimization of different entities previously estimated (intrinsic and extrinsic parameters, radial distortion and the 3D point coordinates) is performed by minimizing the criterion (34) using the Levenberg–Marquardt algorithm [23, 33]:

$$\begin{aligned} C\left( \theta \right) =\sum \limits _{i=1}^2 \sum \limits _{j=1}^{n_{1,2} } \Vert a_{ij} -{\mathcal {P}}( {K_i ,k_{1i} ,k_{2i} ,R_i ,t_i ,A_j } )\Vert ^{2} \end{aligned}$$
(34)

where \(n_{1,2}\) is the number of reconstructed 3D points and

$$\begin{aligned} \theta= & {} \left\{ f_1 ,\varepsilon _1 ,s_1 ,u_{01} ,v_{01} ,k_{11} ,k_{21} ,\alpha _1 ,\beta _1 ,\gamma _1 ,t_x^1 ,t_y^1 ,t_z^1 ,\right. \\&f_2 ,\varepsilon _2 ,s_2 ,u_{02} ,v_{02} ,k_{12} ,k_{22} ,\alpha _2 ,\beta _2 ,\gamma _2 ,t_x^2 ,t_y^2 ,\\&\left. \quad t_z^2 ,X_1 ,Y_1 ,Z_1 , \ldots ,X_{n_{1,2} } ,Y_{n_{1,2} } ,Z_{n_{1,2} } \right\} \end{aligned}$$
Fig. 2
figure 2

Projections’ localization of already reconstructed 3D points in the inserted image \(I_{k}\) using the interest point matching results. \(a_{kj} \) is the projection of the 3D point \(A_j \), reconstructed from \(a_{k-2j} \) and \(a_{k-1j} \), localized in the image \(I_{k}\)

4.3 Inserting a new image \(I_k (3\le k\le m)\)

After inserting a new uncalibrated image, suitable projection matrix is estimated on the basis of the three-dimensional data already retrieved. The RANSAC algorithm [21] was used for the reliable recovery of this matrix by solving a linear system using previously reconstructed 3D points and their projections located in the inserted image [1] (Fig. 2). Then, new 3D points are retrieved from the interest point matching result between the inserted image and the previous image. So, the following steps are performed:

  1. 1.

    Projections’ localization of already reconstructed 3D points in the image \(I_{k}\).

  2. 2.

    Estimating the projection matrix \(P_{k}\) from \(n_{0}\) (\(n_0 \ge 6\)) 3D points and their projections located in the image \(I_{k}\) by the use of RANSAC method [21]

  3. 3.

    Recovery of a set of 3D points from the interest point matching result between \(I_{k-1}\) and \(I_{k}\).

  4. 4.

    Decomposition of the projection matrix P\(_{k}\) for the recovery of intrinsic and extrinsic parameters.

  5. 5.

    Local bundle Adjustment between the last \(m_{0}\) images (in our experiments \(m_0 =3\)) taking into account the radial distortion.

  6. 6.

    If \(k=M_0\) (\(10\le M_0 \le 20\) for our experience) applied a global bundle adjustment between the \(M_{0}\) inserted images.

4.3.1 Projection matrix estimation and new 3D point recovery

After the projections’ localization of already reconstructed 3D points in the image \(I_{k}\), \(P_{k}\) projection matrix is estimated from at least six already reconstructed 3D points and their projections localized in the image \(I_{k}\) [1] using RANSAC algorithm [21]. Then, the coordinates of new 3D points are estimated by triangulation from the interest point matching result between the images \(\{ I_{k-1}, I_{k}\}\) and the estimated projection matrices \(P_{k-1}\) and \(P_{k}\).

figure g

Algorithm 2 is based on the manipulation of the 3D information already estimated as well as on the interest point matching between the last 3 images \(\{I_{k-2} ,I_{k-1} ,I_k \}\) to estimate the projection matrix \(P_k \) that corresponds to the inserted image \(I_k\).

Fig. 3
figure 3

a Three images of the sequence, b result of interest point matching between two images after removing false matches by RANSAC algorithm, c four views of the sparse 3D reconstruction, d two views of 3D surface model achieved using 3D Crust algorithm, e three views of textured 3D model

Table 1 Estimated camera intrinsic parameters that correspond to the first two images for the five sequences
Table 2 The number of reconstructed 3D points (sparse 3D reconstruction) for the five sequences
Fig. 4
figure 4

a Three images of the sequence, b result of interest point matching between two images after removing false matches, c Two views of sparse 3D reconstruction, d matching result after the application of the match propagation algorithm, e three views of dense 3D reconstruction

4.3.2 Local bundle adjustment between the \(m_0 =3\) latest images

The new estimated elements: the camera parameters that correspond to the image \(I_k \), the radial distortion coefficients and the new reconstructed 3D points, are optimized by minimizing the criterion (35) [23, 33].

$$\begin{aligned} C\left( \theta \right) =\sum \limits _{i=k-m_0 }^k \sum \limits _{j=1}^{n_{k-1,k} } \Vert a_{ij} -{\mathcal {P}}\left( {K_i ,k_{1i} ,k_{2i} ,R_i ,t_i ,A_j } \right) \Vert ^{2} \end{aligned}$$
(35)

where \(n_{k-1,k} \) is the number of new reconstructed 3D points from the interest point matching result between the couple image \(\{I_{k-1} , I_k \}\) and

$$\begin{aligned} \theta= & {} \left\{ f_k ,\varepsilon _k ,s_k ,u_{0k} ,v_{0k} ,k_{1k} ,k_{2k} ,\alpha _k ,\beta _k ,\gamma _k ,t_x^k ,t_y^k ,t_z^k ,\right. \\&\left. \quad X_1 ,Y_1 ,Z_1 ,\ldots ,X_{n_{k-1,k} } ,Y_{n_{k-1,k} } ,Z_{n_{k-1,k} }\right\} \end{aligned}$$
Fig. 5
figure 5

a Four key frames, b example of interest point matching between two images, c matching result after the application of the match propagation algorithm, d two views of sparse 3D reconstruction, e two views of dense 3D reconstruction

Fig. 6
figure 6

a Four images of the sequence, b result of interest point matching between two images, c matching result after the application of the match propagation algorithm, d two views of the sparse 3D reconstruction, e two views of the dense 3D reconstruction

4.3.3 Global bundle adjustment between the \(M_0 \) inserted images

Global bundle adjustment applied to all images requires a very long calculation time, especially with the use of a large number of images. In this work, we have combined between the local bundle adjustment after the insertion of a new image and the global bundle adjustment after the insertion of \(M_0\) images \((10\le M_0 \le 20)\) to accelerate the treatment maintaining system reliability. So, after the insertion of \(M_0\) images, all estimated elements (already locally optimized) will be used as an initial solution to minimize the criterion (36) using the Levenberg–Marquardt algorithm [23, 33].

$$\begin{aligned} C\left( \theta \right) =\sum \limits _{i=1}^{M_0 } \sum \limits _{j=1}^n \Vert a_{ij} -\mathcal {P}\left( {K_i ,k_{1i} ,k_{2i} ,R_i ,t_i ,A_j } \right) \Vert ^{2} \end{aligned}$$
(36)

where n is the number of reconstructed 3D points and

$$\begin{aligned} \theta= & {} \left\{ f_1 ,\varepsilon _1 ,s_1 ,u_{01} ,v_{01} ,k_{11} ,k_{21} ,\alpha _1 ,\beta _1 ,\gamma _1 ,t_x^1 ,t_y^1 ,t_z^1 ,\right. \\&\ldots ,f_{M_0 } ,\varepsilon _{M_0 } ,s_{M_0 } ,u_{0M_0 } ,v_{0M_0 } ,k_{1M_0 } ,k_{2M_0 } ,\alpha _{M_0 } ,\\&\left. \beta _{M_0 } ,\gamma _{M_0 } , t_x^{M_0 } ,t_y^{M_0 } ,t_z^{M_0 } ,X_1 ,Y_1 ,Z_1 ,\ldots ,X_n ,Y_n ,Z_n \right\} \end{aligned}$$

5 Experiments

To validate and test the robustness of the proposed approach, several images and video sequences are used. We present the results for five images/video sequences of scenes with different sizes (vase, villa pot, Medusa head [34], castle-P30 [35] and complex scene). All experiments are executed on a machine HP 650 Intel Core i3, 2.30 GHz CPU and 4 GB RAM.

Fig. 7
figure 7

a Four images of the sequence, b interest point matching between two images after removing false matches, c two views of the sparse 3D reconstruction, d matching result after the application of the match propagation algorithm, e three views of the dense 3D reconstruction that corresponds to the last matching result

5.1 Vase sequence: small scale object

This first sequence consists of thirty-two images, with a resolution of \(900 \times 1200\), taken around a small object. Three images of the sequence are presented in Fig. 3a. First, we begin by the detection and matching of interest points with SIFT method [30]. An example of interest point matching between two images is shown in Fig. 3b. Our reconstruction system is initialized from two images with a sufficient number of matches and a large camera motion [15]. After the estimation of camera parameters using our method, a set of 3D points is recovered from the result of interest point matching between these two images. A bundle adjustment, taking into account the radial distortion and using Levenberg–Marquardt algorithm [23, 33], is applied to adjust as best as possible the estimated entities. Table 1 shows the estimated values of the camera intrinsic parameters and the first two radial distortion coefficients corresponding to the two first images for the five sequences. For each new inserted image, new 3D points are recovered and a local bundle adjustment between the last \(m_0 =3\) images is performed to adjust the new estimated entities. After the insertion of \(M_0 =14\) images, a global bundle adjustment is executed to adjust all parameters and to obtain a reliable initial 3D model that will be used with the local bundle adjustment for the insertion of new images. Four views of the sparse 3D reconstruction are shown in Fig. 3c. Figure 3d shows the obtained 3D surface model after applying 3D Crust algorithm [17]. The textured 3D model is presented in Fig. 3e. Table 2 shows the values of \(M_0\) and the number of reconstructed 3D points (sparse 3D reconstruction) for the five sequences.

Table 3 Comparison results

5.2 Villa pot sequence: medium-scale scene

In this second experiment, a sequence of 28 images, with a resolution of \(750 \times 1000\), was used. Three images of the sequence are shown in Fig. 4a. Figure 4b shows the result of interest point matching between two images after removing false matches using RANSAC algorithm [21] (353 matches are obtained). Two views of the sparse 3D reconstruction are presented in Fig. 4c. Figure 4d shows the almost dense matching result after applying the Match Propagation algorithm [36] (297694 matches are obtained). Three views of the dense 3D reconstruction are presented in Fig. 4e.

5.3 Medusa head sequence: medium-scale object

In this third experiment, we tested the power of our approach on the ‘Medusa head’ video downloaded from the Marc Pollefeys page [34]. Four key frames are shown in Fig. 5a. Sparse and dense matching are shown in Fig. 5b, c respectively. Two views of the sparse 3D reconstruction (7342 3D points were reconstructed) are presented in Fig. 5d. Two views of the obtained dense 3D model (dense 3D point cloud) are presented in Fig. 5e.

5.4 Castle sequence: large-scale scene

In the previous experiments, we have tested our approach on small- and medium-scale scenes. In this part, we used a sequence of 30 images, with resolution of \(1020 \times 680\), of a large-scale scene (castle-P30 [35]). Four images of the sequence are shown in Fig. 6a. We begin by the interest point matching between different images [30]. An example of result obtained between two images is shown in Fig. 6b. The initialization of our system is performed from two selected images [15]. Then, the rest of images is inserted progressively using the already estimated 3D structure and bundle adjustment. Two views of the obtained sparse 3D reconstruction, after the insertion of all images, are shown in Fig. 6d. To obtain a dense 3D reconstruction, we must pass through the dense matching between images. So, we used the match propagation method [36] which starts from the sparse matching result to search for new matches in the vicinities of the old. Figure 6c shows the almost dense matching result obtained. The dense 3D model is presented in Fig. 6e.

5.5 Complex scene sequence

In this experiment, our approach was tested on a sequence of 142 images, with a resolution of \(740 \times 565\), of complex scene composed of objects with different sizes. Four images of the sequence are shown in Fig. 7a. An example of interest point matching between two images, after removing false matches, is shown in Fig. 7b. Two views of sparse 3D reconstruction are presented in Fig. 7c. Figure 7d shows matching result after the application of the match propagation algorithm [36]. Three views of dense 3D reconstruction that corresponds to the last matching result are presented in Fig. 7e.

As presented in the experiments, our approach allows to obtain 3D reconstruction results of quality for different types of objects/scenes. These results also prove the reliability of our formulation for the estimation of intrinsic and extrinsic camera parameters.

To evaluate our approach, four state-of-the-art methods [5, 12, 27, 37] are used. Pollefeys approach [5] uses structure from motion and global bundle adjustment for the recovery of 3D projective structure and camera motion. This approach requires a camera self-calibration step to pass from the projective 3D structure to metric 3D structure. Mouragnon approach [12] is based on incremental structure from motion and the local bundle adjustment for the estimation of 3D structure and camera motion from video captured by a calibrated camera. (To adapt this approach to our situation, we used our camera self-calibration method.) Schonberger approach [27] and VisualSFM [37, 38] are two incremental Structure from Motion systems for 3D reconstruction from unordered image collections.

Table 3 shows that our method gives satisfactory results compared to the other methods [5, 12, 27, 37]. This is due to the good initialization based on the proposition of a new approach for self-calibration from two images (taking into account the radial distortion) which allows us estimating a set of 3D points of the scene, and also to the incremental recovery of new 3D points (after inserting new images) based on the local bundle adjustment as well as to suitable integration of the global bundle adjustment. Concerning the computation time (not counting the matching time), our approach is close to that of Mouragnon approach [12] and it is more rapid than Pollefeys approach [5], Schonberger approach [27] and VisualSFM [37, 38].

Fig. 8
figure 8

Reprojection error in terms of images’ number

Fig. 9
figure 9

Execution time needed for the sparse 3D reconstruction (not counting the matching time)

To test the performance of our approach compared to other methods [5, 12, 27, 37] in function of the number of used images. We used the sequence 2 (the other sequences lead to similar results). The results are shown in Figs. 8 and 9.

As shown in Fig. 8, concerning Mouragnon approach [12], the reprojection error rises with the increase of images number because of errors accumulation as it is an incremental approach based on local bundle adjustment. On the contrary, concerning Pollefeys approach [5], the reprojection error decreases when images’ number increases because it is based on global bundle adjustment. So, with the use of a large number of images the Pollefeys approach becomes more stable. Our approach allows to have more accurate results than these two methods, and it is closer to those obtained by Schonberger approach [27] and VisualSFM [37, 38]

When the images number is between 2 and 13 (in these experiments we took \(M_0 =13\)), our approach performs as global structure from motion systems [5] with more precision, because it is based on global bundle adjustment (GBA) with good initialization of different parameters (those obtained by local bundle adjustment). When the number of images is greater than 13, we note that the reprojection error is almost stable with some augmentation because the new images are inserted on the basis of the obtained initial 3D structure (already optimized locally and globally), and on local bundle adjustment.

As shown in Fig. 9, our approach is faster compared to Schonberger approach [27] and VisualSFM [37, 38], and it is much faster than Pollefeys approach [5]. This latter applies the global bundle adjustment on all estimated entities after the insertion of all images, which requires a long calculation time especially with the increase of images number and can even pose convergence problems (problem resolution by Levenberg–Marquardt algorithm [33] with a bad initialization). On the other hand, the proposed approach allows to have results of quality, in a time close to Mouragnon method [12] which is based on local bundle adjustment and applied in real time.

6 Conclusion

In this paper, we have proposed a complete system for 3D reconstruction from images/videos taken by a moving camera characterized by varying parameters. Our system allows to automatically recover camera parameters and to obtain metric 3D reconstruction results without gone through a 3D projective reconstruction. It is properly initialized from two images with a large camera motion. So, we have proposed a new method for automatic estimation of intrinsic and extrinsic camera parameters. Incremental 3D reconstruction systems are based on the local bundle adjustment to ensure the rapidity. But, the quality of reconstruction results can be affected by errors accumulation. Our system introduces a global bundle adjustment after the insertion of a suitable images number by avoiding the use of all images in order not to fall on optimization problems with a large parameters number to be optimized, which requires a lot of calculation time. The optimal 3D model obtained and the local bundle adjustment will be used for the insertion of the rest of images. Our 3D reconstruction system is completely automatic and provides more reliability in keeping rapidity.