1 Introduction

3D model reconstruction is an essential procedure in virtual reality. To create a virtual scene of the reality, the 3D model of the reality scene should be firstly reconstructed. Cameras are the most commonly used sensors for data collection in 3D model reconstruction. They are also the most familiar electronic devices around human beings. Almost every smart phone is equipped with two cameras, the front camera and rear camera. The rear camera always has a much higher resolution than the front camera, even close to the professional digital camera. Those cameras are usually hand-held by human beings. Theoretically, anyone who has a digital camera or a smart phone with high-resolution cameras can collect images to reconstruct 3D model. However, those cameras are designed only for amateur photographing. The focal length is not fixed. The lens distortion is large and unknown. Furthermore, the positions and attitudes of hand-held camera are unknown, and the imaging structure of a scene is also not regularly aligned. Those characteristics make the images collected by hand-held cameras much more difficult to be applied in 3D model reconstruction. However, those hand-held cameras are quite convenient. If the 3D model reconstruction can be implemented with these hand-held cameras, the conventional complex 3D modeling work could be easier; more people can study and participate in the 3D modeling work or even in virtual reality activities through their hand-held cameras. This is significant to the development and innovation of photogrammetry, remote sensing and virtual reality communities.

3D modeling is a relatively complex procedure. The most frequently used methods of 3D modeling are photogrammetry methods (Jesse 2015; Agisoft 2015; Acute3D 2015; Eos Software module Inc. 2015; SimActive Inc. 2015; Rothganger et al. 2006; García-Gago et al. 2014; Rau and Chen 2003; Kocaman et al. 2006; Ozaki et al. 2011; Bujnak et al. 2009; Park and Subbarao 2004; Park et al. 2008; Wang 2012; Elias and Kebisek 2010), light detection and ranging (LiDAR) methods (Ackermann 1999; Li et al. 2012; Jiang et al. 2014; Zhang et al. 2006; Yu et al. 2014; Arefi et al. 2008; Martin et al. 2010; Kato et al. 2009; Yang et al. 2013; Zhu 2014) and LiDAR combined with photogrammetry methods (Baltsavias 1999; Ma 2004; Sohn and Dowman 2007; Chen et al. 2014; Kim and Habib 2009; Susaki 2013). Except the necessary auxiliary data, the first way only uses image data, the second one uses LiDAR point cloud data, and the third way uses both images and LiDAR point cloud. Photogrammetry method utilizes cameras to collect images, tie point observations are extracted and combined with other auxiliary data to restore the relative positions, altitudes and inner parameters of each camera, and then the point cloud and 3D model of the scenes can be generated. LiDAR method uses light detection positioning technology combined with the IMU/DGPS position and orientation System (POS) to directly acquire 3D coordinates of the ground points of the scenes. It is simple and effective for extraction of the Digital Surface Model (DSM). But the instruments are very expensive. Furthermore, the edge and texture information of the scenes are missing since the point cloud is collected with a regular interval. Both photogrammetry and LiDAR methods have advantages and disadvantages; thus, the LiDAR combined with photogrammetry method is proposed to exhaustively utilize the advantages and abandon the disadvantages of these two methods. However, the registration problem between the photogrammetry and LiDAR data is still not perfectly solved. The photogrammetry method is still a feasible and widely used method.

3D model reconstruction using images includes a number of procedures. Firstly, all images are preprocessed for data standardization, and tie points are automatically identified and matched in all images. Then, the ground control points (if there is any), tie points and initial positions, attitudes, known as exterior orientation parameters (EOPs), and inner parameters, known as interior orientation parameters (IOPs), of each camera are combined in a bundle block adjustment (BBA) procedure aiming to obtain the accurate EOPs and IOPs of each camera. At last, the dense point cloud is produced with these EOPs and IOPs, and the 3D model is reconstructed with these dense point cloud and the raw images. In this paper, the common digital images are used as test data for 3D modeling experiments. The main purpose of this work is to utilize proper exterior and interior orientation models, develop a stable and efficient workflow for 3D modeling with hand-held cameras. The whole procedure, with emphasis on the mathematical and technical details of BBA with these hand-held cameras, is discussed. The experimental results of the outcome dense point cloud and the reconstructed 3D model are also presented.

2 Related works

3D model reconstruction has been comprehensively studied in the photogrammetry, remote sensing and computer vision area in recent years. As mentioned before, the most frequently used methods are photogrammetry method, LiDAR method and LiDAR combined with photogrammetry method. A lot of research works have been focused on these methods.

In the photogrammetry community, explosive growth has been made in 3D model reconstruction. 123D Catch developed by Autodesk is an open-source photogrammetry software which can extract 3D information from 2D images (Jesse 2015); Photoscan developed by Agisoft (2015) is a stand-alone software product that performs photogrammetric processing of digital images and generates 3D spatial data; Smart3DCapture developed by Acute3D (2015) can turn photos into 3D models automatically; PhotoModeler developed by Eos Systems Inc. (2015) extracts 3D measurements and models from photographs taken with an ordinary camera; and Simactive (2015) is developed for the generation of high-quality geospatial data from imagery. Most of these software packages are customized to solve certain problems, for instance, aerial triangulation, 3D model reconstruction and others. The mathematic and technical details of 3D model reconstruction using only images are also discussed. Rothganger et al. (2006) used local affine-invariant image descriptors and multi-view spatial constraints to model the 3D objects. García-Gago et al. (2014) developed a photogrammetric and computer vision-based approach for automatic 3D architectural modeling and its typological analysis. Rau and Chen (2003) proposed a robust method for reconstruction of building model from three-dimensional line segments. Most of the above works use aerial imagery. Other source images are also adopted. Kocaman et al. (2006) used high-resolution satellite images to extract 3D models of buildings. Ozaki et al. (2011) tried to develop a method for 3D modeling of dynamic remote environments using the images from two cell phone cameras and a communication network. Bujnak et al. (2009) introduced a method for 3D reconstruction from images collections with only a single known focal length. Some researchers even use only 2D images and a priori information to reconstruct 3D model. Park and Subbarao (2004) and Park et al. (2008) developed a method for automatic 3D model reconstruction based on pose estimation and integration techniques and then he reconstructed a 3D face from only a single 2D face image based on this method. To improve the efficiency, graphic processing unit (GPU) parallel computing was introduced in. Wang (2012) built a framework for GPU 3D model reconstruction using structure from motion in his master thesis. Elias reported an overview of methods for 3D model reconstruction from 2D orthographic views. He argued that most of the design works did not lie in designing new components, but in adapting, modifying and refining existing ones (Elias and Kebisek 2010). Krasić and Pejić (2014) compared the semi-automatic and full-automatic photogrammetry method in the case study of 3D modeling for the remains of the Nis Palace.

Light detection and ranging (LiDAR) system has been widely used for 3D model reconstruction in recent years. Back in 1999, Ackermann (1999) have contributed a comprehensive analysis of the status and the expectations of airborne laser scanning system. Now in twenty-first century, a lot of works are still focused on the mathematical theory and technical detail of 3D reconstruction with LiDAR data. Some focused on 3D building model reconstruction. Li et al. (2012) developed a hierarchical contour method for automatic 3D city reconstruction with LiDAR data. Jiang et al. (2014) built a model for automatic reconstruction of multilayer building 3D contour model from airborne LiDAR point cloud. Zhang et al. (2006) introduced an automatic construction of building footprints from airborne LiDAR data. Yu et al. (2014) proposed a method to automatically reconstruct the 3D building models from segmented data based on pre-defined formal grammar and rules using laser scanning data. Arefi et al. (2008) studied the levels of detail in 3D building reconstruction from LiDAR data. Some focused on the 3D modeling of plants, vegetations and others. Martin et al. (2010) applied LiDAR point cloud in canopy surface reconstruction using Hough transformation. Kato et al. (2009) performed an implicit surface reconstruction for capturing the tree crown formation using airborne LiDAR data. Yang et al. (2013) proposed a method for 3D forest reconstruction and structural parameter retrievals using a terrestrial full-waveform LiDAR instrument. Zhu (2014) used airborne and mobile laser scanning to reconstruct 3D model of the railway environments.

Baltsavias (1999) has reported an early comparison research between the photogrammetry and laser scanning. These two methods both have advantages and disadvantages, thus combining them should be a wise choice. Ma (2004) had studied the theory and technical details of building model reconstruction from LiDAR data and aerial photographs in his doctoral dissertation. Sohn and Dowman (2007) performed the data fusion of high-resolution satellite imagery and LiDAR data for automatic 3D model of building extraction. Chen integrated LiDAR and camera data for 3D reconstruction for both indoor and outdoor environments (Chen et al. 2014). Kim and Habib (2009) studied the object-based integration of photogrammetric and LiDAR data for automatic generation of complex polyhedral building models, while Susaki (2013) proposed a knowledge-based modeling of building in dense urban areas by combining airborne LiDAR data and aerial images. Despite that a lot of research works have been done, but the registration between these two kinds of data still needs to be perfectly solved. None of the present solutions is satisfying in both stability and efficiency. Some other methods of 3D model reconstruction are also applied. For instance, Zhang et al. (2013) performed a real-time 3D model reconstruction and interaction system using Kinect for a game-based virtual laboratory.

In this work, we choose an economical and practical way, photogrammetry method. The source images are photographed by common hand-held cameras.

3 Methodology

To reconstruct 3D model from images, correspondence of images should be identified via tie point extraction and relative orientation procedure. Then, the BBA is applied to improve the accuracy of the image orientation. Finally, dense point cloud is produced using these orientation parameters and 3D models are reconstructed. Methods of processing common digital images are quite different from conventional aerial photogrammetry. A lot of the researchers have been focused on this problem. Some good methods and algorithms have been proposed, such as structure from motion (SFM) and multi-view stereo (MVS). In this paper, an efficient and effective BBA method using preconditioner conjugate gradient algorithm combined with the state-of-the-art SFM and MVS techniques is applied to reconstruct the 3D model with common hand-held cameras.

3.1 Tie point extraction and relative orientation

Common hand-held cameras are non-metric cameras. The correspondence problem of these images is difficult due to the distortions and deformations. Thus, a stable and efficient correspondence algorithm is required. Fortunately, the scale-invariant feature transform (SIFT) can provide robust feature extraction and image matching performance, invariant to many transformations such as scaling and rotating (Lowe 2004). SIFT is adopted to firstly identify feature points on the images and then match the conjugate points on the corresponding images. More information about SIFT can be found in reference (Lowe 2004). Although SIFT is invariant to many deformations, it is still prone to errors (Agarwal et al. 2011). To avoid mismatches, the epipolar constraint is applied. Epipolar searching is an effective and efficient strategy in image matching based on the theory that the conjugate points should be on the corresponding epipolar line as shown in Fig. 1. This strategy can not only decrease the errors, but also improve the searching efficiency. To extract tie points, exhaustive matching of all the images is implemented. But this process is very time consuming especially when many images are involved (for instance, more than 1000). Then, a fast match strategy is needed. Actually, some researchers had already noticed this problem, and some solutions had also been reported. Among them, Agarwal et al. (2011) proposed a quick image matching method base on image skeleton in his research. When image number is getting bigger, his method could be a wise choice. Figures 2 and 3 show the screenshots of the tie points from two scenes.

Fig. 1
figure 1

Epipolar line on the left and right images, respectively

Fig. 2
figure 2

Tie points shown on the test scene 1

Fig. 3
figure 3

Tie points shown on the test scene 2

Once the tie points are obtained, we can use them to perform the relative orientation to connect all the images and building a scalable block. The relative orientation is a classic and well-defined algorithm in the conventional photogrammetry process. Its main purpose is to acquire the relative position and attitude of the images with respect to a local coordinate system. All the images in the block can be connected using these relative positions and attitudes. In this paper, common digital images are used for 3D modeling. These images are always unordered and irregularly aligned. Some images might have no overlap with the rest ones. They should be removed from the block. In the conventional relative orientation process, the images are connected one by one. In here, a block adjustment will be performed at each time when the connected image number is increased by a certain number. This strategy is applied to avoid the error accumulation when connecting images one by one. The threshold of image numbers should be determined according to the accuracy and efficiency of the relative orientation process. Our empirical value of the certain number is 50. After relative orientation, the position and attitude of the all images in the local coordinate system can be obtained and a scalable model can be built as shown in Fig. 4.

Fig. 4
figure 4

A scalable model of a scene where the relative positions and attitudes of all images are demonstrated around the scene

3.2 Bundle block adjustment

BBA is to further determine the camera parameters (including EOPs and IOPs) and improve the accuracy using the tie point observations and other given information. It is a significant and essential process for 3D modeling. The accuracy of BBA can directly affect the accuracy of the reconstructed 3D model. Besides, good accuracy can largely improve the efficiency of the MVS since that the EOPs and IOPs are used to predict the positions of conjugate points on the corresponding images during the MVS process. The accuracy of camera parameters is higher, the MVS process is quicker and the final 3D model is more accurate.

3.2.1 Imaging geometry

A ground point P(X, Y, Z) is imaged by a camera with parameters (Xs, Ys, Zs, ϕ, ω, κ) known as EOPs and (f, x0, y0, k1, k2) known as IOPs. Then, an image point p(x, y) corresponding to the ground point P can be obtained in the image. The camera lens center is defined as the perspective center S. The ground point P, its corresponding image point p and the perspective center S are on the same line; the relationship can be described by formulae as Eqs. (1), (2) and (3).

$$\left[ {\begin{array}{*{20}c} {x - \Delta x} \\ {y - \Delta y} \\ { - f} \\ \end{array} } \right] = R^{T} \left[ {\begin{array}{*{20}c} {X - Xs} \\ {Y - Ys} \\ {Z - Zs} \\ \end{array} } \right]$$
(1)
$$R = R(\phi,\omega,\kappa) = \left[ {\begin{array}{*{20}c} {a_{1} } & {a_{2} } & {a_{3} } \\ {b_{1} } & {b_{2} } & {b_{3} } \\ {c_{1} } & {c_{2} } & {c_{3} } \\ \end{array} } \right]$$
(2)

Combine (1) and (2) we have

$$\left[ {\begin{array}{*{20}c} {x - \Delta x} \\ {y - \Delta y} \\ { - f} \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {a_{1} } & {b_{1} } & {c_{1} } \\ {a_{2} } & {b_{2} } & {c_{2} } \\ {a_{3} } & {b_{3} } & {c_{3} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {X - Xs} \\ {Y - Ys} \\ {Z - Zs} \\ \end{array} } \right]$$
(3)

f is the focal length of the camera; it can be read out from the auxiliary data. \(\Delta x,\Delta y\) in Eqs. (1) and (3) are known as corrections for image point coordinates. They can be expressed by IOPs, lens distortion parameters \(k_{1}\), \(k_{2}\) and principle point translation parameters \(x_{0}\),\(y_{0}\), as shown in Eqs. (4) and (5). This interior orientation model is applied to eliminate the distortions and other deformations in the common digital cameras.

$$\left\{ {\begin{array}{*{20}c} {\Delta x = x_{0} + k_{1} (x - x_{0} )r^{2} + k_{2} (x - x_{0} )r^{4} } \\ {\Delta y = y_{0} + k_{1} (y - y_{0} )r^{2} + k_{2} (y - y_{0} )r^{4} } \\ \end{array} } \right.$$
(4)
$$r = \sqrt {\left( {x - x_{0} } \right)^{2} + \left( {y - y_{0} } \right)^{2} }$$
(5)

3.2.2 Solving normal equation

To solve the EOPs and IOPs of all cameras and all the ground point coordinates (GPC) based on the collinearity condition, we build error equations from Eq. (3) according to the Levenberg–Marquardt (LM) model, and we have:

$$V = AX - L$$
(6)

where \(V\) is the residual vector, \(A\) is a matrix consist of the first-order derivatives of Eq. (3) to the unknowns (EOPs and GPC), and it is also called Jacobi matrix. \(X\) is the unknown vector. \(L\) is the discrepancy vector of the image points.

Then, we build the normal equation. Meanwhile, a damping term \(\lambda D\) is used in case that the rank of \(A^{T} A\) is not full and makes Eq. (6) irresolvable. So we have Eq. (7).

$$(A^{T} A + \lambda D)X = A^{T} L$$
(7)

where the matrix \(D\) is usually the diagonal of matrix \(A^{T} A\); \(\lambda\) is a damping value between (0, 1). It should be changed according to the result of each iteration.

The Jacobi matrix \(A\) can be partitioned into two parts, such as camera part and ground point part, so the matrix \(A\) can be rewritten as \(A = [\begin{array}{*{20}c} {A_{C} } & {A_{P} } \\ \end{array} ]\), the same can be done to \(D = \left[ {\begin{array}{*{20}c} {D_{C} } & {D_{P} } \\ \end{array} } \right]\) and \(X = \left[ {\begin{array}{*{20}c} {X_{C} } & {X_{P} } \\ \end{array} } \right]\). Then, we can rewrite the normal equation as follows:

$$\left[ {\begin{array}{*{20}c} {A_{C}^{T} A_{C} + \lambda D_{C} } & {A_{C}^{T} A_{P} } \\ {A_{P}^{T} A_{C} } & {A_{P}^{T} A_{P} + \lambda D_{P} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {X_{C} } \\ {X_{P} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {A_{C}^{T} L} \\ {A_{P}^{T} L} \\ \end{array} } \right]$$
(8)

Let \(V_{C} = A_{C}^{T} A_{C} + \lambda D_{C}\), \(V_{P} = A_{P}^{T} A_{P} + \lambda D_{P}\), \(W = A_{C}^{T} A_{P}\), \(L_{C} = A_{C}^{T} L\), \(L_{P} = A_{P}^{T} L\), and we have Eqs. (9) and (10).

$$\left[ {\begin{array}{*{20}c} {V_{C} } & W \\ {W^{T} } & {V_{P} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {X_{C} } \\ {X_{P} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {L_{C} } \\ {L_{P} } \\ \end{array} } \right]$$
(9)
$$SX_{C} = B$$
(10)

where

$$S = V_{C} - WV_{P}^{ - 1} W^{T}$$
(11)
$$B = L_{C} - WV_{P}^{ - 1} L_{P}$$
(12)

Unknown parameters \(X_{C}\) can be calculated by Eq. (10), and \(X_{P}\) can be then substituted from Eq. (9). The normal matrix size is reduced to the size of the unknown camera parameter part. This process is the so-called Schur compliment.

3.2.3 Conjugate Gradient methods

Conventional BBA uses LM and Schur compliment method to solve the normal equation. But when the image number is getting bigger, the normal matrix will be too large to be stored and inverted in the computer. Most researchers choose the conjugate gradient (CG) algorithm.

Conjugate gradient algorithm is firstly proposed by Hestenes and Stiefel (1952), and it is an iterative method for solving the linear symmetric positive defines system. During the iteration of the CG process, an initial vector x 0 is given as the approximate initial answer of the normal equation, and then a new vector x 1 is computed by the x 0 and other given parameters. As it repeated for certain times n, the process will eventually converged to a vector x n which should be the final answer of the normal equation. The main advantage of CG is that it avoids matrix–matrix multiplications and matrix inversion which are both time-consuming computations. Only matrix–vector multiplications are needed.

The converging times are related to the condition of the normal matrix. The theoretical iteration times to convergence should be equal to the condition of the normal matrix. But it has been reported that after r (much smaller than n) times, the solution x r will be close enough to the true answer. If one need to further improve the converging speed, a proper preconditioner should be used; this method is called preconditioned conjugate gradient (PCG). The PCG method is to apply a preconditioner \(M^{ - 1}\) to the normal matrix, so as to decrease the condition of the normal matrix, and thus accelerate the iteration process. After applying a preconditioner, Eq. (10) can be rewritten as follows:

$$M^{ - 1} Sx_{c} = M^{ - 1} l$$
(13)

The iteration times now should be no more than the condition of matrix \(M^{ - 1} S\). The main task is shifted to finding a proper preconditioner which can not only decrease the condition of the normal matrix but also is easy to be inverted. The simplest and most widely used preconditioner is block Jacobi preconditioner which uses a block diagonal of the normal matrix as the preconditioner. Other preconditioners, such as symmetric successive over-relaxation (SSOR) preconditioner (Agarwal et al. 2010), QR factorization preconditioner (Byröd and Åström 2010), balanced incomplete factorization-based preconditioner (Bru et al. 2008), multiscale preconditioner (Byröd and Åström 2009) and subgraph preconditioner (Jian et al. 2011), could be more efficient but might be more complicated and less stable.

The PCG algorithm can largely decrease the memory requirement of normal equation especially for a great number of images. As reported in reference (Zheng et al. 2016), when image number is more than 5000, the memory requirement of normal equation will be more than 6.7 GB which is too large for a common computer. Despite some high-performance computer can spare this large memory space, the computation efficiency will be compromised when a large portion of RAM is occupied by normal equation. More information about this can be found in (Zheng et al. 2016).

4 3D model reconstruction

After relative orientation and BBA process, the EOPs and IOPs of the cameras are recovered. The 3D coordinates of the tie points are also obtained. But these points are not dense enough to express the 3D model. So a dense match procedure is still necessary to produce dense point cloud of the scene. Dense feature points are firstly extracted by Harris or other effective feature point extraction algorithms. Then, the well-known MVS algorithm is adopted to extract the dense 3D point cloud. In this paper, we adopt a patch-based MVS (PMVS) algorithm. The details can be found in the literature (Furukawa and Ponce 2010). The EOPs and IOPs are important orientation parameters which are used in PMVS to predict the potential positions of the conjugate points in the corresponding images. Some gross points would exist due to the low contrast and weak texture; thus, a blunder detection and elimination algorithm should be applied to remove gross points.

To build a 3D model, the dense point cloud need to be further processed. The Poisson surface reconstruction (PSR) algorithm is adopted to generate the triangulated mesh model (Furukawa and Ponce 2010). A coarse model is firstly generated which is called an initial model, and then a refined model is extracted based on this initial model and related information by abandoning the outliers according to the method proposed in the literature (Furukawa and Ponce 2010). This process is also demonstrated as in Fig. 5.

Fig. 5
figure 5

Reconstruction of 3D model from dense point cloud, the left image shows point cloud, the middle shows a coarse 3D model, and the right shows a refined and accurate 3D model

5 Experiments and analysis

5.1 Dataset

There are totally four scenes of test images which are photographed by hand-held digital cameras. The first scene is a man sitting in a chair, the second scene is a cabbage, and the third and the fourth scene are both statues. The data information is listed in Table 1. The test images are shown in Fig. 6. All these images were preprocessed before the 3D model reconstruction procedure.

Table 1 Test data information
Fig. 6
figure 6

Images of the four scenes

In all the experiments, SIFT algorithm is implemented by the well-known OPENCV library (release 2.4.4) which is available at http://opencv.org/. BBA module is developed by the authors according to the method mentioned in subsection B in section III and literature (Zheng et al. 2016). PMVS module is also developed by the authors according to the method in the literature (Furukawa and Ponce 2010). The octree depth in Poisson reconstruction process is 10. All the experiments are performed on a common laptop computer equipped with the Inter (R) Core(TM) i5-33320 M CPU 2.60 GHz, 8.00 GB RAM, and 64-bit Windows 7 operating system.

We successively performed tie point extraction, relative orientation, BBA, dense point cloud extraction and 3D model reconstruction. The test results and analysis are presented in the next two sections.

5.2 Accuracies of bundle block adjustment

Four scenes of images are tested in this paper, and the root-mean square errors (RMSE) of the image point reprojection error are shown in Table 2.

Table 2 RMSE of the reprojection error after the BBA with four scenes

As can be seen in Table 2, after BBA, the RMSE of the image points are improved from 1 to 2 pixels to about 0.5 pixels. This is also clearly demonstrated in Fig. 7. It indicates that the interior orientation model is quite suitable for the hand-held digital camera. BBA with these images can achieve considerable sub-pixel accuracy. Thus, it is practical for 3D model reconstruction using common hand-held digital cameras.

Fig. 7
figure 7

The x-axis in the above figures represents the x-axis in image coordinate system, and the y-axis in a, b represents the RMSE of reprojection error in x direction before and after BBA, respectively

5.3 Dense point cloud and 3D model

After BBA, the high-precision EOPs and IOPs are obtained. These parameters are then used in dense point cloud extraction. MVS uses EOPs and IOPs to predict the conjugate image points. The accuracy of EOPs and IOPs is higher, the MVS process is quicker, and the 3D model is more accurate. Four models are reconstructed with four image clusters as shown in Fig. 8.

Fig. 8
figure 8

3D model reconstruction of four scenes, where the left image is the raw image; the middle image is the screenshot of the 3D dense point cloud; and the right image is the screenshot of the reconstructed 3D model

As demonstrated in Fig. 8, all the reconstructed 3D models are basically acceptable. These models are elaborate with respect to the real object despite that some 2D features are hardly to be extracted (such as the low-contrast area in the images of statue 2). The low-contrast area as demonstrated in Fig. 9 can also be modeled well. The 3D model reconstructed by the high-resolution digital cameras (Dataset 4 in Table 2) is better than others. This is mainly because of the disparities in resolution and lens quality.

Fig. 9
figure 9

The reconstruction performance in the low-contrast area as shown in red and blue rectangles in the source image and 3D model, respectively (color figure online)

5.4 Compared to other commercial software

Two of the above scenes are also processed by Pix4D (version 1.1.38-64 bit) software (https://pix4d.com/). The main setting of parameters is shown in Table 3. The results of Pix4D and our method are shown in Figs. 10 and 11.

Table 3 Parameters setting of Pix4D in the experiment
Fig. 10
figure 10

3D model reconstruction of the Cabbage by Pix4D (top) and our method (down) respectively. From left to right, the first image is the raw image; the second image is the screenshot of the 3D dense point cloud; the third image is the screenshot of the reconstructed 3D model; and the last image is the screenshot of the textured 3D model

Fig. 11
figure 11

3D model reconstruction of the statue 1 by Pix4D (top) and our method (down), respectively. From left to right, the first image is the raw image; the second image is the screenshot of the 3D dense point cloud; the third image is the screenshot of the reconstructed 3D model; and the last image is the screenshot of the textured 3D model

As shown in Figs. 10 and 11, the outcome point clouds are almost the same between Pix4D and our method, but the reconstructed models are different. There are less outliers in our method than result of Pix4D. This is mainly contributed by the refine process of our method as mentioned in subsection C of section III. The analysis of Pix4D result is unavailable since the specific method used in Pix4D is unknown.

To verify the accuracy, we measured some distances in 3D models reconstructed by Pix4D and our method, respectively, as shown in Figs. 12 and 13. Only relative error is valid since that the 3D models are all scalable and no control points were measured. The relative accuracy can be assessed through the comparison of our method with Pix4D. Assume that the accuracy of Pix4D has been well assessed since it is a mature commercial software. So if our method has the same accuracy to Pix4D, our method should be acceptable in accuracy phase.

Fig. 12
figure 12

Measurements in 3D model of the cabbage, left image shows the 3D model reconstructed by our method, right image shows the 3D model reconstructed by Pix4D

Fig. 13
figure 13

Measurements in 3D model of the statue 1, left image shows the 3D model reconstructed by our method, right image shows the 3D model reconstructed by Pix4D

It is obviously that there is a scale factor between Pix4D model and our model; different models have different scales. To compare the accuracy of our method with Pix4D, firstly the average scale factor is calculated by the following equation:

$$\hbox{sc} = \frac{1}{N}\sum\limits_{i = 0}^{N} {\frac{{L_{\rm ours}^{i} }}{{L_{\rm pix4D}^{i} }}}$$
(14)

where sc is the average scale factor, N is the number of measured lines, \(L_{\rm ours}^{i}\) is the measured length of line i in our model, and \(L_{\rm pix4D}^{i}\) is the measured length of the line i in Pix4D model.

Then, the error and relative error of our model with respect to the Pix4D model are calculated by the following equations:

$$e^{i} = L_{\rm ours}^{i} - L_{\rm pix4D}^{i} \cdot sc$$
(15)
$$r^{i} = e^{i} /L_{\rm ours}^{i}$$
(16)

where \(e^{i}\) is the error of line i, and \(r^{i}\) is the relative error of line i.

As can be seen in Table 4, the relative error of our 3D model with respect to Pix4D is about 1 % which is an acceptable accuracy. According to this relative accuracy, if the true length is 1 m, the error would be about 0.01 m. It also indicates that the 3D models reconstructed by our method are as fine as Pix4D in the accuracy phase.

Table 4 Comparison of accuracies of 3D models reconstructed by our method and Pix4D

5.5 Potential applications

These 3D models reconstructed by the common hand-held camera have relatively good accuracies which have potential applications in many fields. In the indoor environment where the GPS signal is unavailable, one can walk through the indoor area while holding a camera and taking pictures. Then, the 3D model of the indoor scenes and the camera trajectory can be reconstructed and restored adopting our method. This is very useful for indoor navigation. The same can be done in a crime scene. To protect the crime scene, officers only have to take as much pictures as possible without touching any object, and high-precision 3D model of the crime scene can be reconstructed in the laboratory. If a high-resolution camera is available, the precious historical relics can be preserved by reconstructing the accurate 3D model of them using our technology. Once they are destroyed somehow, the accurate 3D model will help the engineers to rebuild or restore the relics. As reported in the literature (Krasić and Pejić 2014), the remains of Nis Palace were successfully reconstructed using photogrammetry method. Although our method is capable of dealing with large-scale data, these applications still need to be further tested with plenty of datasets.

6 Conclusion

We proposed a method for 3D modeling with common hand-held cameras. The novelty of this work is not the pure mathematical algorithm but the whole framework of the inexpensive and convenient method to reconstruct 3D model with hand-held cameras. Thus, our method can decrease the cost and might bring more people to participate into virtual reality activities which will undoubtedly promote the development of virtual reality. Besides, the PCG algorithm is introduced to solve normal equation in the bundle adjustment instead of the conventional LM model which enables our method to have potential capacity for big data (more than 5000 images in a scene). The whole procedures of 3D model reconstruction are all briefly reviewed. Totally, four scenes of images collected by hand-held cameras are tested. According to the test results and analysis, we can conclude that:

  1. 1.

    After BBA with the test dataset, the accuracy can reach 0.5 pixels, which indicates that the adjustment model proposed in this paper is suitable for these cameras.

  2. 2.

    The dense point cloud is elaborate, even in some low-contrast areas. The final 3D models are acceptable. The 3D model reconstructed by high-resolution digital camera is more elaborate than that of cameras with low resolution. After all, the common hand-held cameras have high potential for 3D model reconstruction since they are more convenient and cheaper.

  3. 3.

    Our experiment results are slightly better than the results of Pix4D (a commercial photogrammetry software) in some respects, while the accuracy performance are about the same.

This technology has potential applications in indoor navigation, crime scene reconstruction and heritage preservation for example, but the test for large-scale data is yet to be done. A lot of specific problem of large-scale data still need to be solved. This is our next research interest. The authors also have to admit that more works still need to be done for improvements on image matching strategies in occlusions and shadow areas, robustness and efficiency of BBA, gross point detection and elimination strategy in both tie points and dense point cloud. More experiments need to be carried out to further exam and verify this work.