1 Introduction

With the rapid development of technologies such as augmented reality and virtual reality, the demand for geometric 3D models is also rising. 3D reconstruction technology is a very important research topic in the field of modeling, and its innovation and optimization become very important. Today’s computer hardware and software levels have been rapidly improved. More and more excellent and classic 3D reconstruction algorithms have been successfully applied in various fields, and the reconstruction and rendering accuracy of large-scale, high-complexity 3D scenes. The real-time performance is also constantly improving, and the difficulty is greatly reduced. In the early 3D reconstruction, modeling tools such as AutoCAD and 3DMAX were mainly used, but for large-scale scenes, it was impossible to solve the problem that the workload was too large and the reconstructed scene lacked reality. Therefore, a three-dimensional reconstruction method using a laser scanner has been proposed. The reconstruction accuracy of this method is relatively high, and the reconstructed three-dimensional model is more accurate and close to reality, but the scanning time required for this method is extremely long, and one scan is required [18]. Larger scale scenarios often take hours or even days. In addition, optical devices such as laser scanners are expensive and are not suitable for general applications, and in some special fields, such as cultural relic protection applications, it is often impossible to use lasers to illuminate objects to be reconstructed [10, 16]. Therefore, in comparison, the 3D reconstruction based on binocular stereo vision uses the photos of the real scene as the main data source, which can truly reflect the real scene under the premise of ensuring that the reconstructed model has good precision, compared with the laser scanning. The method has great advantages [5].

Although the mathematical theory based on the binocular stereoscopic 3D reconstruction method is perfect, it has great advantages compared to other methods, and binocular stereo vision has a mature and reliable implementation, but the existing embodiment is for a large-scale scene. There are few applications for 3D reconstruction, and there are still many defects and shortcomings in detail [4, 7, 14]. There are still many challenges, such as insufficient calibration space and insufficient matching accuracy in the process of large-scale scene reconstruction [15]. In view of the shortcomings of the above-mentioned 3-D reconstruction methods based on binocular stereo vision, this paper optimizes and improves the original 3-D reconstruction methods [13]. Firstly, when the camera calibration is performed, the calibration pole is selected as the calibration mark. This method overcomes the problem of too small checkerboards. Using calibration rods can expand the space that can be calibrated, and is cheaper than using checkerboards. Then, in order to ensure the speed and precision of the stereo matching, based on the traditional matching cost function, the weighted least squares filter(WLSF) method is used to optimize the obtained initial depth map, which can greatly improve the matching accuracy under the premise of ensuring matching efficiency [3]. Finally, the reconstruction of the point cloud is carried out, and the three-dimensional coordinates of the point cloud are calculated by the similar triangle method, and finally the dense point cloud of the large-scale scene is obtained [12, 19].

In detail, this paper provides the following key insights: Compared to reconstructing small objects, large-scale scene reconstruction based on binocular vision requires some improvement in the calibration and matching algorithms to prevent outliers from occurring in the resulting 3D model. For the binocular vision algorithm, smoothing the accurate depth map can effectively reduce the errors contained in the reconstructed 3D model.

2 Methodology

2.1 Technical approach for accurate 3-D reconstruction process

The Fig. 1 shows the flowchart of large-scale scene reconstruction. The improved calibration method was proposed to large-scale scene which used the calibration rod instead of the traditional checkerboard. The weight least squares filtering was used to refine the depth map. Finally, the points cloud was been recovered after camera calibration and stereo matching steps. Recovery the dense point clouds is a core issue in binocular vision algorithms, and a number of scholars have study about this topic. In the following text, we will discuss the key steps of the algorithm of points clouds reconstruction, which are the camera calibration and stereo matching algorithms.

Fig. 1
figure 1

Flowchart of large-scale scene reconstruction

2.2 Improved calibration algorithm using POLAR correction

There have many camera calibration methods such as Direct linear transformation(DLT) calibration, Radial alignment constraint(RAC) calibration and Zhang’s calibration and so on. Zhang’s calibration method is widely used in practical calibration experiments because of its simple calibration process and high accuracy. The chessboard grid is used to calibrate large-scale scenes, as shown in the Fig. 2.

Fig. 2
figure 2

Checkerboard calibration picture in a large-scale scene

In calibration experiments, we used standard checkerboard to calibrate, however, we found that was difficult to cover the whole calibration space for checkerboard. This is because the long baseline of binocular camera used in large-scale scenes. Because of the limited resolution of the camera, it is difficult to recognize corners when the checkerboard is far from the camera in large-scale scenes. Thus, it is difficult to accomplish calibration experiments in large-scale scenes using standard chessboard grids.

For limitation of the camera calibration method using chessboard grid, the calibration method using the calibration rod was proposed to calibrate the large-scale scene, which is a calibration algorithm based on the flat calibration method. Aiming at the inconsistency of the left and right camera focal lengths in the calibration process, Yang F et al [17] designed a calibration rod calibration algorithm to optimize the focal length difference based on Zhang’s calibration method. The calibration rod is placed vertically on the ground according to a certain arrangement before the calibration image acquisition, and the fixed scale is marked on the calibration rod as the calibration point, then we can synthesize a calibration plane by using the calibration rods of several specific positions [9]. The camera parameters (the internal and external parameters of the camera) are then calculated by the relationship between the world coordinates of the calibration point and the coordinates of the picture [6]. Figure 3 is an example of the arrangement and calibration space of the calibration rod. The calibration of stereo vision system can overcome the shortcomings of the traditional checkerboard grid, which is difficult to apply to large-scale scenes, and expand the calibration space of the flat panel calibration method to meet large-scale scene.

Fig. 3
figure 3

a Calibration rod and camera position diagram b Calibration rod c Calibration image

In the calibration rod calibration method, the calibration accuracy is improved by adjusting the focal length of the calibration image. In the ideal binocular camera model, the left and right cameras with the same internal parameters are placed at the same baseline position, and the left and right camera optical axes are completely parallel. In practical applications, it is difficult for binocular cameras to have the same internal parameters due to various factors. In the calibration rod calibration algorithm, the problem of different focal lengths of the left and right cameras has been corrected. Correcting the image can eliminate the error caused by the non-parallel of the optical axis of the camera to a certain extent, which can improve the calibration accuracy and speed up the calibration speed [1].

When using the binocular industrial camera to obtain pictures, due to the distortion and aberration of the camera’s own imaging, large errors will occur when performing stereo matching, and mismatching will also occur. In this paper, the polar line correction of the image is used to correct the error caused by the non-parallel of the optical axis of the camera.

The correction process for images mainly consists of two steps:

  1. (1)

    The homography matrices HLHR corresponding to the left and right images are calculated by using the internal and external parameters of the camera obtained by the initial calibration.

  2. (2)

    Warp the image and modify the camera projection matrix. The image is warped with a homography matrix, and then the camera projection matrix is modified to \( {M}_L^{\ast }={H}_L{M}_L \), \( {M}_R^{\ast }={H}_R{M}_R \). Calibrate the camera with a homography matrix:

$$ {M}_L^{\ast }={H}_L{M}_L={H}_L{K}_L{R}_L\left[I-{C}_L\right] $$
(1)
$$ {M}_R^{\ast }={H}_R{M}_R={H}_R{K}_R{R}_R\left[I-{C}_R\right] $$
(2)

KLKR are the internal parameter matrices of the left and right cameras, RLRR are the rotation matrix of the left and right cameras, CLCR is the optical center of the left and right cameras . Let eLeR be the poles of the left and right graphs, lLlR be the polar lines, and uLuR be the projections of the points in the range on the imaging plane. Let F be the basic matrix of the corrected image and λ ≠ 0. The necessary conditions for the polar line to coincide with the lines in the two images are:

$$ {l}_R^{\ast }={e}_R^{\ast}\times {u}_R^{\ast }=\lambda {F}^{\ast }{u}_L^{\ast } $$
(3)
$$ {\left[1,0,0\right]}^T\times {\left[{u}^{\prime },\mathrm{v},1\right]}^T={\left[1,0,0\right]}^T\times {\left[u+d,v,1\right]}^T=\lambda {F}^{\ast }{\left[u,v,1\right]}^T $$
(4)

among them:

$$ {F}^{\ast}\approx \left[\begin{array}{ccc}0& 0& 0\\ {}0& 0& 1\\ {}0& -1& 0\end{array}\right] $$
(5)

The correcting homography matrix is not unique. In order to select the best correcting homography matrix, the following derivation is made.

  1. (1)

    Move the poles in the two images to infinity

Let eL = [e1, e2, 1]T be the pole in the graph and \( {e}_1^2+{e}_2^2\ne 0 \). While rotating the polar line eL onto the u-axis, this pole maps to e ≈ [1, 0, 0]T, and the corresponding projection is:

$$ {\hat{H}}_L\approx \left[\begin{array}{ccc}{e}_1& {e}_2& 0\\ {}-{e}_2& {e}_1& 0\\ {}-{e}_1& -{e}_2& {e}_1^2+{e}_2^2\end{array}\right] $$
(6)
  1. (2)

    Unified polar line

Since \( {e}_R^{\ast }={\left[1,0,0\right]}^T \) is the left zero of \( \hat{F} \) and the right zero space, the modified basic matrix becomes:

$$ \hat{F}=\left[\begin{array}{ccc}0& 0& 0\\ {}0& \alpha & \beta \\ {}0& \gamma & \delta \end{array}\right] $$
(7)

The basic correction homograph \( {\overline{H}}_L \)\( {\overline{H}}_R \) is selected such that α = δ = 0 and β = − γ.

\( {\overline{H}}_L={H}_S{\hat{H}}_L \), \( {\overline{H}}_R={\hat{H}}_R \), among them, \( {H}_S=\left[\begin{array}{ccc}\alpha \delta -\beta \gamma & 0& 0\\ {}0& -\gamma & -\delta \\ {}0& \alpha & \beta .\end{array}\right] \)

such

$$ {F}^{\ast }={\left({\hat{H}}_R\right)}^{-T}F{\left({H}_S{\hat{H}}_L\right)}^{-1} $$
(8)
  1. (3)

    Select a pair of optimal homography matrices

Let \( {\overline{H}}_L \)\( {\overline{H}}_R \) be the basis for correcting the homography matrix. \( {\overline{H}}_L \)\( {\overline{H}}_R \) are also corrected homography matrices, which obey the equation \( {H}_R{F}^{\ast }{H}_L^T=\lambda {F}^{\ast } \), λ ≠ 0, to ensure that the image remains in the corrected state.

The intrinsic parameters of \( {\overline{H}}_L \)\( {\overline{H}}_R \) facilitate understanding of the meaning of the free parameters in the correct homography class:

$$ {H}_L=\left[\begin{array}{ccc}{l}_1& {l}_2& {l}_3\\ {}0& s& {u}_0\\ {}0& q& 1\end{array}\right]{\overline{H}}_L,{H}_R=\left[\begin{array}{ccc}{r}_1& {r}_2& {r}_3\\ {}0& s& {u}_0\\ {}0& q& 1\end{array}\right]{\overline{H}}_R $$
(9)

Where s ≠ 0 is the common vertical scale; u0 is the common vertical offset, l1r1 are the left and right distortions, l2r2 are the left and right horizontal scales, l3r3 are the left and right horizontal offsets And q is the common projection distortion.

This article uses the polar-corrected calibration image to recalibrate to make the calibration parameters more accurate. A set of calibration images taken by the left and right cameras are shown in the Fig. 4, and the images after the polar line correction are as shown.

Fig. 4
figure 4

Left and right polar-corrected calibration images

2.3 The exact depth map obtained by weighted least square filtering(WLSF) algorithm

After completing the camera calibration to obtain the camera’s internal and external parameters, we need to perform stereo matching on the left and right views of the reconstructed object to obtain the depth map [2, 8]. The local stereo matching algorithm is used to estimate the parallax in the local range, so this algorithm is called a support region based method. In the energy function of the local stereo matching algorithm, there are only data items and no smoothing terms. Therefore, the computational complexity is low, so most real-time stereo matching algorithms belong to local stereo matching, but the disadvantage is that the local stereo matching algorithm does not match the accuracy. In general, the stereo matching algorithm using local matching is far ahead of the global matching algorithm in speed, but the accuracy is slightly insufficient. After considering the speed and accuracy of reconstruction, this paper chooses to use the stereo matching algorithm based on normalized cross-correlation algorithm for initial matching, and then corrects and optimizes the initial matching result by weighted least squares method. Under the premise that the accuracy can meet the requirements, this method can greatly speed up the stereo matching [11].

The matching cost function is to calculate the similarity between the corresponding points of the left and right images by some similarity measure. The cost function of the normalized cross-correlation number used in this paper is:

$$ \mathrm{NCC}\left(p,d\right)=\frac{\sum_{\left(x,y\right)\in {W}_p}\left({I}_1\left(x,y\right)-\overline{I_1}\left({p}_x,{p}_y\right)\right)\cdotp \left({I}_2\left(x,y\right)-\overline{I_2}\left({p}_x+d,{p}_y\right)\right)}{\sqrt{\sum_{\left(x,y\right)\in {W}_p}{\left({I}_1\left(x,y\right)-\overline{I_1}\left({p}_x,{p}_y\right)\right)}^2\cdotp {\sum}_{\left(x,y\right)\in {W}_p}{\left({I}_2\left(x+d,y\right)-\overline{I_2}\left({p}_x+d,{p}_y\right)\right)}^2}} $$
(10)

Where NCC(p, d) is a measure of similarity. The closer this value is to 1, the greater the similarity between the two matching windows. The p-point represents the pixel coordinates to be matched (px, py) of the left image I1, d represents the distance in the horizontal direction between the pixel being searched and the point p in the right image I2, and Wp represents the matching window centered on the p-point.

The Fig. 5 shows the initial depth map. It can be seen that there are a large number of mismatched points in the image, and the edge segmentation of the object is also very blurred.

Fig. 5
figure 5

Initial depth map based on NCC

Due to the existence of a large number of mismatched points, the reconstruction result will generate a large number of errors, and it is difficult to meet the goal of accurate reconstruction of large-scale scenes. Further noise reduction and smoothing of the initial depth map is required. The traditional filtering method is easy to lose the edge details after processing, and it is difficult to obtain a good denoising effect. Therefore, the WLSF algorithm is proposed to optimize the initial depth map. This method has strong ability to preserve edges and denoise.

In this paper, the edge of the object in the depth map is enhanced by the method of WLSF, and the rest is smoothed. The WLSF method is filtered to make the processed image as much as possible on the basis of maintaining the edge. The ground is similar to the original image. The mathematical expression of the WLSF is

$$ {\sum}_p\left\{{\left({O}_p-{I}_p\right)}^2+\lambda \left[{a}_{xp}(I){\left(\frac{\partial O}{\partial x}\right)}_p^2+{a}_{yp}(I){\left(\frac{\partial O}{\partial x}\right)}_p^2\right]\right\} $$
(11)

Where O represents the filtered output image. I represents the initial image of the input, I = tb(x). P denotes a pixel located at a corresponding position. axp and ayp are smooth weights. (Op − Ip)2 is a data item. \( {a}_{xp}(I){\left(\frac{\partial O}{\partial x}\right)}_p^2+{a}_{yp}(I){\left(\frac{\partial O}{\partial x}\right)}_p^2 \) is a regular term. λ is the parameter that balances these two settings, this time is set to 0.35.

In order to facilitate the calculation, the equation is written in a matrix form, set to b before the optimization process, and set to f after the optimization process. Then the equation is rewritten as:

$$ {\left(f-b\right)}^T\left(f-b\right)+\lambda \left({f}^T{D}_x^T{A}_x{D}_xf+{f}^T{D}_y^T{A}_y{D}_yf\right) $$
(12)

Where Ax and Ay are the diagonal matrices of the smoothing terms axp and ayp, Dx and Dy are the forward and backward differential matrices. When the matrix representation of the input initial image I is set to R, f is satisfied when the minimum value is obtained by the above formula:

$$ \left(R+\lambda {L}_g\right)f=b $$
(13)

Where \( {L}_g={D}_x^T{A}_x{D}_x+{D}_y^T{A}_y{D}_y \); \( {a}_{xp}(g)={\left[{\left|\frac{\partial l}{\partial x}(p)\right|}^{\alpha }+\varepsilon \right]}^{-1} \), \( {a}_{yp}(g)={\left[{\left|\frac{\partial l}{\partial y}(p)\right|}^{\alpha }+\varepsilon \right]}^{-1} \), l is the logarithm of the luminance channel image of the input initial image I, and α is an amplification factor with a value range of [1, 2], this time is set to 1.8, ε is the gain term, used to prevent 0 from appearing, this time is set to 0.0001

From this equation, the matrix form f of the filtered image O is obtained:

$$ \mathrm{f}={F}_{\lambda }(q)={\left(R+\lambda {L}_g\right)}^{-1}q $$
(14)

In this formula, Fλ(q) ≈ (R + λaL)−1q, \( \mathrm{L}={D}_x^T{D}_x+{D}_y^T{D}_y \)

The depth image after WLSF is

$$ {t}_w={F}_{\lambda}\left({t}_b\right) $$
(15)

Where tw represents the depth image after WLSF, and Fλ(∙) represents the WLSF optimization. The Fig. 6 shows the depth map after WLSF optimization. It can be seen that after WLSF optimization, the noise of the depth map is significantly reduced, the edge of the object in the figure is smoother, and the original edge details are maintained, and no excessive smoothing occurs.

Fig. 6
figure 6

The depth map after WLSF optimization

2.4 Points cloud reconstruction

In this paper, we need to obtain accurate 3D point cloud data of reconstructed scenes. The 3D data of objects in space includes height, width and depth. The camera inevitably loses depth information during the imaging process. If you want to restore three-dimensional geometric information such as the shape of an object, you need to take the three-dimensional coordinate values of the points on the surface of the object. The set of three-dimensional information of the surface points of these objects is a point cloud. The three-dimensional coordinate values of these spatial points can be solved by using the stereo vision principle to calculate the three-dimensional point coordinates, that is, using the projection relationship existing between the pixel coordinate system and the world coordinate system. Since we have obtained an accurate depth map in the previous step, the depth map contains the parallax information of the corresponding point of each pixel in the scene, so this paper uses the solution method based on the parallax principle.

As shown in the Fig. 7, it is assumed that the parallel binocular camera has a translation distance of b (i.e, the baseline length) in the x-axis direction, O1O2 are the centers of the left and right cameras, p is the point on the object to be reconstructed, pO1O2 and two The-camera-imaging planes are respectively assigned to p1p2 points, and the p1p2 points are on the same horizontal line, that is, I1I2 are collinear.

Fig. 7
figure 7

Parallel binocular camera model

Assuming that the coordinate systems of the left and right cameras C1C2 are O1x1y1z1O2x2y2z2, the coordinates of the point p in the space can be obtained as (x1, y1, z1), in the left camera C1 coordinate system and C2 coordinates in the right camera. The system is (x1 − b, y1, z1), which can be obtained according to the proportion of central photography:

$$ \left\{\begin{array}{c}{u}_1-{u}_0={a}_x\frac{x_1}{z_1}\\ {}{v}_1-{v}_0={a}_y\frac{y_1}{z_1}\end{array}\right. $$
(16)
$$ \left\{\begin{array}{c}{u}_2-{u}_0={a}_x\frac{x_1-b}{z_1}\\ {}{v}_2-{v}_0={a}_y\frac{y_1}{z_1}\end{array}\right. $$
(17)

Where u0v0axay are the internal parameters of the two cameras, and (u1, v1)、(u2, v2) are the coordinates of the p point in the image. The two equations above can be used to obtain the three-dimensional coordinates of point p:

$$ \left\{\begin{array}{c}{x}_1=\frac{b\left({u}_1-{u}_0\right)}{u_1-{u}_2}\\ {}{y}_1=\frac{b{a}_x\left({v}_1-{v}_0\right)}{a_y\left({u}_1-{u}_2\right)}\\ {}{z}_1=\frac{b{a}_x}{\left({u}_1-{u}_2\right)}\end{array}\right. $$
(18)

3 Results and discussion

Figure 8 shows the left and right calibration images. Figure 9 shows the calibration image after the polar line correction. After correction, the binocular camera polar lines are parallel, and the corresponding calibration points are on the same horizontal line.

Fig. 8
figure 8

The left and right calibration images

Fig. 9
figure 9

Corrected left and right calibration images

We reconstruct the six calibration points in Fig. 8 by the calibration results obtained, and then analyze the accuracy of the calibration results by comparing with the actual coordinates of the calibration points. Table 1 shows the results of reconstructions before and after the correction of the six calibration points in Fig. 8. It can be seen that the corrected calibration point error is smaller and closer to the exact value.

Table 1 Y coordinate of calibration points

Figure 10 shows the results of reconstruction of all calibration points. According to Zhang’s calibration principle, we have five calibration rods on the same plane to form a calibration board, and there are 30 calibration points on each calibration board. Figure 11 shows the results of reconstruction of all calibration boards. A total of 11 calibration boards are included. We calculate the average error of all calibration points contained on each calibration board as the error of this calibration board. Then shown them in Table 2. Figure 12 shows the uncorrected errors and the corrected errors of all the calibration boards, the difference of the uncorrected errors and the corrected errors, then calculates the total average of the uncorrected errors and the corrected errors. The comparison shows that the corrected coordinate point error is smaller than the coordinate point error before correction, and the actual coordinate average error of the calibration point is 1.276260 pixels, which can meet the reconstruction requirements.

Fig. 10
figure 10

Calibration point reconstruction

Fig. 11
figure 11

Calibration board reconstruction

Table 2 Reproject points error
Fig. 12
figure 12

Reproject points error

Figure 13a, b show a set of left and right views taken simultaneously. There are two people (people1 and people2) and a bicycle (bicycle). Figure 13c is the disparity map. Figure 13d, e show the depth map obtained after WLSF. Figure 13d is shown in grayscale and Fig. 13e is shown in pseudo-heat map. Compared with the initial depth map (Fig. 12), the noise in the final depth map is significantly reduced, and the edge contours of each object are smoother and more accurate.

Fig. 13
figure 13

a left image b right image c disparity map d depth map in grayscale e depth map in pseudo-heat map

We use the calibration results and the depth map to perform point cloud reconstruction on the scene of the bicycle and the person. Figure 14 shows the point cloud reconstruction using the modified depth map and separates the point cloud within the depth range of the person. Then we use the height of the reconstructed person and the wheelbase of the bicycle to evaluate the reconstructed point cloud error.

Fig. 14
figure 14

Point cloud of people

Figure 15 and Table 3 show the results of reconstructing the bicycle shaft and the height of the person. The results show that the human height reconstruction error is within one centimeter. The wheelbase of the bicycle is slightly larger than 26.7 mm due to the identification of the center position of the bicycle shaft. But the scene we have done in 3D reconstruction has been 10 m to 20 m or even larger. We believe this error is within reasonable limits. The final reconstruction results can meet the needs of large-scale scenarios.

Fig. 15
figure 15

People height and bicycle axis reconstruction point cloud

Table 3 people height and bicycle axis reconstruction error

4 Conclusion

Through the optimization and improvement of the binocular stereo vision reconstruction algorithm, the 3D reconstruction in a large-scale scene is completed, and the reconstructed scene size is 10 m to 20 m or more, which solves the application of binocular vision based on large-scale scenes. The problem of stereoscopic 3D reconstruction.

The experimental results show that the application of the optimization algorithm in this paper can greatly expand the scope of application of Zhang’s calibration method, reduce the error of depth map noise, and improve the accuracy of obtaining depth map, so as to obtain better point cloud reconstruction results.

Applying the method in this paper can be better applied to binocular 3D reconstruction in large-scale scenes than traditional methods, and expands the scope of application of binocular vision reconstruction method. It is worth mentioning that the system also has the potential to perform high-speed three-dimensional reconstruction. Compared with the laser measurement reconstruction method, the reconstruction method of binocular vision is greatly accelerated under the premise of ensuring accuracy, and it can completely replace the laser for high-speed three-dimensional reconstruction.