1 Introduction

Given a set of 2D projections of the same 3D structure, e.g., each captured by a camera from a different viewpoint, bundle adjustment [1,2,3] simultaneously calculates the 3D world points and the camera view parameters that best fit these 2D projections. Bundle adjustment finds important applications in 3D reconstruction [4] from 2D images, such as structure from motion (SfM) [5, 6] and simultaneous localization and mapping (SLAM) [7]. This paper focuses on large-scale bundle adjustment, e.g., 3D reconstruction of an entire city from a very large collection of 2D images. For such tasks, a key consideration is the efficiency and scalability of the bundle adjustment solution, which rules out computationally expensive solutions such as those based on deep neural networks, e.g., [8,9,10,11]. Instead, in practice it is common to apply the Levenberg-Marquardt (LM) algorithm, which involves solving a large system of normal equations. Further, recent trends in the development of computational infrastructure indicate that it is critical to design a solution that well utilizes parallel computation resources. Hence, this paper focuses on building a scalable parallel optimizer that solves the normal equations in LM.

A classic solution for our problem is PBA [12, 13], which involves two key ideas to reduce the computational costs of the optimizer: Schur complement and preconditioned conjugate gradient. The latter (but not the former) is parallelized to utilize the capabilities of multicore CPUs and GPUs. Note that in PBA, the parallelization is done on a low level, i.e., on the level of matrix operations. To our knowledge, the core of this algorithm (i.e., its underlying math) has remained largely unimproved since its proposal almost a decade ago. Further, as our experimental evaluation shows, on some data-sets, PBA scales poorly with the number of processors, whose performance peaks at around 12-20 processors. To improve the scalability of bundle adjustment tasks, recent approaches have instead focused on higher-level parallelism (i.e., on the level of scenarios), which decompose a large bundle adjustment problem into smaller ones based on 3D world points, cameras, maps, and domains, and solve these smaller sub-programs in parallel [14, 15]. Meanwhile, several methods have addressed bundle adjustment computation in distributed settings with high communication costs (e.g., across data centers).

For instance, [16] proposes a consensus-based framework for distributed bundle adjustment based on proximal splitting, and a distributed alternating direction method of multipliers (ADMM) framework is proposed for very large-scale bundle adjustment problems [17,18,19], where an over-relaxation technique and self-adaption schemes are employed to improve the convergence rate. These solutions, however, tend to be less efficient since they often have linear or even sub-linear convergence rate, leading to numerous iterations and, thus, high computation and communication overhead. Additionally, robust parallel bundle adjustment (RPBA) [20] extends the consensus based optimization methods with covariance information to get a better convergence bahavior. Though RPBA is implemented upon covariance information which is very effective, the naive ADMM has slow convergence rate and produces more communication overhead.

This paper presents PPCG, an efficient and scalable low-level parallel optimizer for solving the LM normal equations in large-scale bundle adjustment, which is a direct improvement over the PBA algorithm. Specifically, PPCG is based on a novel parallel Schur complement method that effectively decomposes the Hessian matrix, exploiting its special sparsity conditions to reduce structure parameters. Note that the Schur complement module in PPCG is fundamentally different from that in PBA (i.e., the decomposition results can be considered as locally individual subsystems, which are different); meanwhile, we stress that unlike PBA, our Schur complement module runs in parallel on multiple processors, with a single Reduce operation to collect the results. Based on the decomposition results, PPCG then proceeds to perform preconditioned conjugate gradient, which is carefully designed to avoid expensive matrix operations and parallel Reduce steps. We formally prove the correctness of PPCG, and evaluate its performance through extensive experiments on the BAL benchmark datasets [2]. The evaluation results demonstrate that PPCG significantly outperforms PBA in terms of both acceleration and scalability.

2 Preliminaries on bundle adjustment

Before the application of bundle adjustment, the structure and camera parameters are roughly calculated by multi-view geometry methods. Since there are various distortions for uncalibrated camera models, the reprojection error is widely accepted to measure the accuracy. Generally considered as the back-end of SfM applications, bundle adjustment is an approach for estimating more accurate structure and camera parameters.

Let \(\boldsymbol {x} \in \mathbb {R}^{m_{s}}\), \(\boldsymbol {y} \in \mathbb {R}^{m_{c}}\) and \(\boldsymbol {z} \in \mathbb {R}^{m_{p}}\) be the vectors of structure parameters, camera parameters and coordinates of feature points in the images respectively. If the 3D points (structure parameters) are reprojected into images through the same cameras, the new reprojected feature points \(\boldsymbol {z}^{*} \in \mathbb {R}^{m_{p}}\) can be obtained by

$$ \begin{array}{@{}rcl@{}} \boldsymbol{z}^{*} = {{\varPhi}}(\boldsymbol{x}, \boldsymbol{y}) \end{array} $$
(1)

where Φ(⋅,⋅) is the reprojection function, including three-dimensional translations, rotations and projections. Due to optical distortions of cameras and truncation errors, the original feature points does not coincide with the reprojected feature points. Therefore, bundle adjustment can be represented by a least squares optimization problem, which is given by

$$ \arg \underset{\boldsymbol{z}^{*}}{\min}\lVert \boldsymbol{z} - \boldsymbol{z}^{*} \rVert^{2} = \underset{\boldsymbol{{{\varDelta}} x}, \boldsymbol{{{\varDelta}} y}}{\arg\min}\lVert \boldsymbol{z} - {{\varPhi}}(\boldsymbol{x}+\boldsymbol{{{\varDelta}} x}, \boldsymbol{y}+\boldsymbol{{{\varDelta}} y}) \rVert^{2}. $$
(2)

where \({{\varDelta }}{\boldsymbol {x}} \in \mathbb {R}^{m_{s}}\), \({{\varDelta }}{\boldsymbol {y}} \in \mathbb {R}^{m_{c}}\) are the increments of x, y respectively. As long as z represents the original featured points which are all constant, we can adjust Δx and Δy iteratively to satisfy the minimum reprojection error.

Assuming that f(x,y) = zz, Jx = f(x,y)/x and Jy = f(x,y)/y, The LM algorithm improves the original Gauss-Newton algorithm through gradient descent directions, then we can represent the above optimization problem approximately by

$$ \begin{array}{@{}rcl@{}} \underset{\boldsymbol{{{\varDelta}} x}, \boldsymbol{{{\varDelta}} y}}{\arg\min}&&\left\| f(\boldsymbol{x}, \boldsymbol{y}) + \begin{bmatrix}J_{\boldsymbol{x}} &J_{\boldsymbol{y}}\end{bmatrix}\begin{bmatrix}\boldsymbol{{{\varDelta}} x}\\\boldsymbol{{{\varDelta}} y}\end{bmatrix} \right\|^{2}\\ +\lambda&&\left\| \text{ diag}\left( \begin{bmatrix} \sqrt{J^{T}_{\boldsymbol{x}}J_{\boldsymbol{x}}} \boldsymbol{0}\\ \boldsymbol{0} &\sqrt{J^{T}_{\boldsymbol{y}}J_{\boldsymbol{y}}} \end{bmatrix}\right) \begin{bmatrix}\boldsymbol{{{\varDelta}} x}\\ \boldsymbol{{{\varDelta}} y}\end{bmatrix} \right\|^{2} \end{array} $$
(3)
$$ \begin{array}{@{}rcl@{}} = \underset{\boldsymbol{{{\varDelta}} x}, \boldsymbol{{{\varDelta}} y}}{\arg\min}&&\left\| f(\boldsymbol{x}, \boldsymbol{y}) + J_{\boldsymbol{x}}\boldsymbol{{{\varDelta}} x} + J_{\boldsymbol{y}}\boldsymbol{{{\varDelta}} y} \right\|^{2}\\ &&{}+\lambda\text{ diag}\left( J^{T}_{\boldsymbol{x}}J_{\boldsymbol{x}}\right)\left\|\boldsymbol{{{\varDelta}} x}\right\|^{2} + \lambda \text{ diag}\left( J^{T}_{\boldsymbol{y}}J_{\boldsymbol{y}}\right)\left\|\boldsymbol{{{\varDelta}} y}\right\|^{2} \end{array} $$
(4)

where \(\lambda \in \mathbb {R}\) is the damping factor. While setting the derivative of (3) to zero, we can obtain the following normal equations,

$$ \begin{array}{@{}rcl@{}} \begin{bmatrix} U & W \\ W^{T} & V \end{bmatrix} \begin{bmatrix} \boldsymbol{{{\varDelta}} x} \\ \boldsymbol{{{\varDelta}} y} \end{bmatrix}= - \begin{bmatrix} {J}_{\boldsymbol{x}}^{T}f(\boldsymbol{x}, \boldsymbol{y}) \\ {J}_{\boldsymbol{y}}^{T}f(\boldsymbol{x}, \boldsymbol{y}) \end{bmatrix}, \end{array} $$
(5)

where the sub-blocks \(U=J^{T}_{\boldsymbol {x}}J_{\boldsymbol {x}} + \lambda ~\text { diag}\left ({J}_{\boldsymbol {x}}^{T}J_{\boldsymbol {x}}\right )\), \(V=J^{T}_{\boldsymbol {y}}J_{\boldsymbol {y}} + \lambda \text { diag}(J^{T}_{\boldsymbol {y}}J_{\boldsymbol {y}})\), \(W=J^{T}_{\boldsymbol {x}}J_{\boldsymbol {y}}\), and the coefficient matrix is the augmented Hessian matrix. Then U is an ms × ms block diagonal and positive definite matrix, V is an much smaller mc × mc block diagonal and positive definite matrix, and W is an ms × mc matrix.

Providing that the augmented Hessian matrix is highly sparse and the Hessian matrix U is block diagonal, the computation of the inverse of U is very cheap. Furthermore, the Schur complement trick [21] eliminates the structure parameters Δx through Gaussian Elimination. Therefore, the normal equations including only camera parameters, are highly reduced and become

$$ \begin{array}{@{}rcl@{}} (V - W^{T}U^{-1}W)\boldsymbol{{{\varDelta}} y} = W^{T}U^{-1}{J}_{\boldsymbol{x}}^{T}f(\boldsymbol{x}, \boldsymbol{y}) - {J}_{\boldsymbol{y}}^{T}f(\boldsymbol{x}, \boldsymbol{y}). \end{array} $$

Let R = VWTU− 1W and \(\boldsymbol {v} = W^{T}U^{-1}{J}_{\boldsymbol {x}}^{T}f(\boldsymbol {x}, \boldsymbol {y}) - J^{T}_{\boldsymbol {y}}f(\boldsymbol {x}, \boldsymbol {y})\), we then obtain the following system of linear equations,

$$ \begin{array}{@{}rcl@{}} R\boldsymbol{{{\varDelta}} y} &= \boldsymbol{v}. \end{array} $$
(6)

The coefficient matrix R is the Schur complement called the Reduced Camera Matrix (RCM), which is much smaller and symmetrical and positive definite, and v is the Reduced Camera Vector (RCV). Δy can be solved through various iterative algorithms, then Δx can be obtained by

$$ \begin{array}{@{}rcl@{}} \boldsymbol{{{\varDelta}} x} = -U^{-1}\left( J^{T}_{\boldsymbol{x}}f(\boldsymbol{x}, \boldsymbol{y}) + W\boldsymbol{{{\varDelta}} y}\right) \end{array} $$
(7)

3 Proposed solution

The bundle adjustment introduced in Section 2, contains great potential parallelization, and can be accelerated by parallel computing for large scale problems. This section firstly presents a parallel Schur complement method to eliminate the structure parameters by enhancing the matrix inverse and multiplication operations, and improves the classical preconditioned conjugate gradient (PCG) method through parallelization in the second place. Before discussing the parallel methods, we present the definitions of memory coherence and consistency.

Definition 1

(Memory Coherence) On a shared-memory parallel computing system with multiprocessors, there is only one processor (process or thread) can read from/write to a memory location such that the updated value can be observed by all the subsequent read operations of the corresponding memory location. We call this characteristic as memory coherence.

Definition 2

(Memory Consistency) On a distributed-memory parallel computing system where each node has its own memory space, if a memory location is written to, its copies on other nodes will be updated such that the updated values can be observed in their own local memory spaces. This characteristic is called memory consistency.

To prevent failures and errors caused by parallel computing systems, we assume that both of the following parallel methods are presented under the constraints of memory coherence and consistency.

3.1 Parallel Schur complement

The augmented Hessian matrix is approximately a block-diagonal arrow-head matrix, and the diagonal submatrices are independent from each other. Considering this particular structure, the augmented Hessian matrix can be divided into three parts depending on whether the local data are calculated from structure parameters, camera parameters or observations. The augmented Hessian and its divisions can be represented by

$$ \begin{array}{@{}rcl@{}} H^{*} = &\begin{bmatrix} {U^{s}_{1}} & & & & &W_{1}\\ &{U^{s}_{2}} & & & &W_{2}\\ & &{U^{s}_{3}} & & &W_{3}\\ & & &{\ddots} & &\vdots\\ & & & &{U^{s}_{n}} &W_{n}\\ {W}_{1}^{T} &{W}_{2}^{T} &{W}_{3}^{T} &{\cdots} &{W}_{n}^{T} &U^{c}\\ \end{bmatrix}\\\Rightarrow &\begin{cases} U &= {\text{ diag}}\left[{U}_{1}^{s}, {U}_{2}^{s}, ..., {U}_{n}^{s}\right],\\ V &= U^{c} = {\sum}_{i=1}^{n}{U^{c}_{i}},\\ W &= [W_{1}, W_{2}, ..., W_{n}]^{T} \\ \end{cases} \end{array} $$
(8)

where \({U}_{1}^{s}, {U}_{2}^{s}, ..., {U}_{n}^{s}\) are Hessian matrices of structure parameters for n 3D world points, \({U}_{i}^{c}\) is the Hessian matrix of the i-th camera, whereas Uc is the Hessian matrix of all cameras, and W1,W2,...,Wn are derived from the observations between 3D world points and cameras. Sequentially, the RCM R can be presented as

$$ \begin{array}{@{}rcl@{}} R = V - W^{T}U^{-1}W = {\sum}^{n}_{i=1}({U^{c}_{i}} - {W^{T}_{i}}({U^{s}_{i}})^{-1}W_{i}). \end{array} $$
(9)

Given a parallel computing system with p concurrent computing units, we can distribute (5) into p subsystems, and the j-th subsystem can be summarized by

$$ \begin{array}{@{}rcl@{}} &&\begin{bmatrix} {U^{s}_{j}} & & & & &W_{j}\\ &{\ddots} & & & &\vdots\\ & &U_{lp+j}^{s}& & &W_{lp+j}\\ & & &{\ddots} & &\vdots\\ & & & &U^{s}_{\phi(j)} &W_{\phi(j)}\\ {W^{T}_{j}} &{\cdots} &W^{T}_{lp+j}&{\cdots} &W^{T}_{\phi(j)} &V(j)\\ \end{bmatrix} \times \begin{bmatrix} \boldsymbol{{{\varDelta}} x}_{j}\\ \vdots\\ \boldsymbol{{{\varDelta}} x}_{lp+j}\\ \vdots\\ \boldsymbol{{{\varDelta}} x}_{\phi(j)}\\ \boldsymbol{{{\varDelta}} y}\\ \end{bmatrix}\\ &=&- \begin{bmatrix} J^{T}_{\boldsymbol{x}_{j}}f(\boldsymbol{x}_{j}, \boldsymbol{y})\\ \vdots\\ {J}_{\boldsymbol{x}_{lp+j}}^{T}f(\boldsymbol{x}_{lp+j}, \boldsymbol{y})\\ \vdots\\ {J}_{\boldsymbol{x}_{\phi(j)}}^{T}f(\boldsymbol{x}_{\phi(j)}, \boldsymbol{y})\\ {J}_{\boldsymbol{y}}^{T}f(\boldsymbol{x}_{j:p:\phi(j)}, \boldsymbol{y})\\ \end{bmatrix} \end{array} $$
(10)

where \(V(j) = {U}_{j}^{c} + {\cdots } + U^{c}_{lp+j} + {\cdots } + U_{\phi (j)}^{c} (1 \le l \le \lfloor n/p\rfloor )\), ϕ(j) = ⌊n/pp when j > n mod p and ϕ(j) = (⌊n/p⌋)p + j when jn mod p. If we denote \(U(j) = {\text { diag}}\left [{U}_{j}^{s}, \cdots , {U}_{lp+j}^{s}, ..., {U}_{\phi (j)}^{s}\right ]\) and W(j) = [Wj,⋯ ,Wlp+j,⋯ ,Wϕ(j)]T, the related Schur complement is V (j) − WT(j)U− 1(j)W(j).

Theorem 1

If the 3D world points of a bundle adjustment system are divided into p subsystems and these 3D world points are evenly observed by cameras, the distribution method achieves the best load balance for parallel Schur complement when the 3D points are distributed equally.

Proof

We use the function ψ(⋅) to map the global 3D world points to local 3D world points in subsystems, i.e.ψ(1) is the first local 3D world point from the ψ(1)-th global 3D world point, ψ(2) is the second local 3D world point from the ψ(2)-th global 3D world point, et al.

Supposing that nj(1 ≤ jp) 3D world points are assigned to the j-th subsystem, the j-th divisions are denoted by \( U^{\prime }(j) = \text { diag}\left [{U}_{\psi (1)}^{s}, {U}_{\psi (2)}^{s}, \cdots , {U}_{\psi (n_{j})}^{s}\right ]\), \(W^{\prime }(j) = \left [W_{\psi (1)}, W_{\psi (2)}, \cdots , W_{\psi (n_{j})} \right ]^{T}\), \( V^{\prime }(j) = {U}_{\psi (1)}^{c} + {U}_{\psi (2)}^{c} + {\cdots } + {U}_{\psi (n_{j})}^{c}, \psi (1) \le \psi (2) \le {\cdots } \le \psi (n_{j})\). Then, the Schur complement Rj for the j-th subsystem can be inferred out like the form of (8), which implies

$$ \begin{array}{@{}rcl@{}} R_{j} = \sum\limits_{i=\psi(1)}^{\psi(n_{j})}\left( {U^{c}_{i}} - {W^{T}_{i}}({U^{s}_{i}})^{-1}W_{i}\right). \end{array} $$

As these 3D world points are evenly observed by cameras, the sparsity of W1 to Wn are considered to be equivalent such that \({U^{c}_{i}} - {W}_{i}^{T}({U}_{i}^{s})^{-1}W_{i}\) takes the same amount of time for all i = 1,2,⋯ ,n. Therefore, we can use the number of \({U^{c}_{i}} - {W^{T}_{i}}({U}_{i}^{s})^{-1}W_{i}\) to represent the load of the subsystems. Thus, the overall load for the j-th subsystem is determined by \(\max \limits _{j} n_{j}\). When nj = ⌊n/p⌋, we obtain the minimum overall load and achieves the best load balance. □

Furthermore, without the assumption that 3D world points are evenly observed by cameras, \(W^{\prime }(j) = [W_{\psi (1)},\) \(W_{\psi (2)}, \cdots , W_{\psi (n_{j})} ]^{T}\) are not equally distributed. The observations are always dense at one place, and sparse at another place, which is the locality of observations. Taking advantage of the locality, the distribution method presented in (10), achieves better load balance than other equal distribution methods.

Note that direct computation of the RCM still involves expensive operations such as matrix multiplications, parallel reductions, and sparse matrix subtractions. The operations that are all geometrical in time complexity, can be greatly reduced or avoided by associating with the following preconditioning approach.

3.2 Parallel PCG

When the number of images and cameras grows, the size of RCM increases quadratically. The sparsity of RCM mainly depends on the observations, and it turns to be much smaller and much denser after Schur complement, which becomes an important part with respect to overall performance. Before proposing the PPCG algorithm, we put forward the following theorem.

Theorem 2

If a bundle adjustment system with the augmented Hessian matrix in the form of (5), is divided into p concurrent subsystems with the augmented Hessian matrices in the form of (10), then

$$ \begin{array}{@{}rcl@{}} {\sum}^{n}_{i=1}{W}_{i}^{T}({U}_{i}^{s})^{-1}W_{i} = \sum\limits_{j=1}^{p}W^{T}(j)U^{-1}(j)W(j). \end{array} $$
(11)

Proof

As the matrix U− 1(j) is block diagonal and positive definite, the inverse of it can be easily determined by U− 1(j) \(=\! \text { diag}\left [\left ({U^{s}_{j}}\right )^{-1}, \cdots ,\left (U^{s}_{lp+j}\right )^{-1} , ...,\left (U^{s}_{\phi (j)}\right )^{-1}\right ]\), which is purely block diagonal. We expand WT(j)U− 1(j)W(j) of the right hand in Theorem 2 for each subsystems, which leads to

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{j=1}^{p} W^{T}(j)U^{-1}(j)W(j) \\ &= &{\sum}^{p}_{j=1}\left[{W^{T}_{j}}\left( {U^{s}_{j}}\right)^{-1}, \cdots, W^{T}_{\phi(j)}\left( U^{s}_{\phi(j)}\right)^{-1}\right]W(j) \end{array} $$
$$ \begin{array}{@{}rcl@{}} &= &\!\sum\limits_{j=1}^{p}\left( {W^{T}_{j}}\left( {U^{s}_{j}}\right)^{-1}W_{j} \!+ {\cdots} +\! W^{T}_{\phi(j)}\left( U^{s}_{\phi(j)}\right)^{-1}W_{\phi(j)}\right)\\ &= &\sum\limits_{i=1}^{n}{{W^{T}_{i}}\left( {U^{s}_{i}}\right)^{-1}W_{i}}. \end{array} $$

The PPCG approximates the accurate solution through updating the actual solution iteratively. Assuming that \(\boldsymbol {{{\varDelta }} y}_{k}\\ \in \mathbb {R}^{m_{c}}\) is the solution vector for the k-th iteration, \(\boldsymbol {r}_{k} \in \mathbb {R}^{m_{c}}\) is the residual vector, \(\boldsymbol {t}_{k} \in \mathbb {R}^{m_{c}}\) is the temporary vector, \(\boldsymbol {p}_{k} \in \mathbb {R}^{m_{c}}\) is the direction vector where k = 0,1,2,⋯, and M is the preconditioned matrix, the initializations are summarized by

$$ \begin{array}{@{}rcl@{}} \boldsymbol{r}_{0} = \boldsymbol{v} - R\boldsymbol{{{\varDelta}} y}_{0},\quad \boldsymbol{t}_{0} = M^{-1}\boldsymbol{r}_{0},\quad \boldsymbol{p}_{0} = \boldsymbol{t}_{0}. \end{array} $$
(12)

where Δy0 is zero or other reasonable vectors, and preconditioned matrix M is assigned to be the block diagonal of R. Since mc << ms, the communication overhead will be dominant (discussed in Section 4.3) if we distribute Δyk, rk, tk, pk and M into all the subsystems. Thus, they are duplicated in every subsystem to avoid frequent communications.

Given that the direction vectors p1,p2,⋯ ,pk are R-orthogonal, the step size αk for the k-th iteration can be determined by

$$ \begin{array}{@{}rcl@{}} \alpha_{k} = \frac{\boldsymbol{r}^{T}_{k}\boldsymbol{t}_{k}}{\boldsymbol{p}^{T}_{k}R\boldsymbol{p}_{k}}. \end{array} $$
(13)

As is demonstrated in Section 3.1, the RCM R is not explicitly calculated due to the expensive parallel reduce operations of sparse matrices. From the result of Theorem 2, we can calculate Rpk through local Hessian matrices U(j), V (j) and W(j) instead of the RCM R itself according to the implicit Schur complement method [22], that is

$$ \begin{array}{@{}rcl@{}} R\boldsymbol{p}_{k} \!&=&\! \left( \sum\limits_{i=1}^{n}\left( {U^{c}_{i}} - {W^{T}_{i}}\left( {U^{s}_{i}}\right)^{-1}W_{i}\right)\right)\boldsymbol{p}_{k} \\ \!&=&\! \left( \sum\limits_{i=1}^{n}{U^{c}_{i}}\right)\boldsymbol{p}_{k} - \left( {\sum}^{n}_{i=1}{W^{T}_{i}}\left( {U^{s}_{i}}\right)^{-1}W_{i}\right)\boldsymbol{p}_{k} \\ \!&=&\! \left( \sum\limits_{j=1}^{p}V(j)\right)\boldsymbol{p}_{k} - \left( \sum\limits_{j=1}^{p}W^{T}(j)U^{-1}(j)W(j)\right)\boldsymbol{p}_{k} \\ \!&=&\! \sum\limits_{j=1}^{p}(\mathit{V}(\mathit{j})\boldsymbol{p}_{k}) - \sum\limits_{j=1}^{p}W^{T}(j)(U^{-1}(j)(\mathit{W}(j)\boldsymbol{p}_{k})). \end{array} $$
(14)

With the step size αk and the result of Rpk, the new solution vector Δyk+ 1 and the new residual vector rk+ 1 can be generated by

$$ \begin{array}{@{}rcl@{}} \boldsymbol{{{\varDelta}} y}_{k+1} &= \boldsymbol{{{\varDelta}} y}_{k} + \alpha_{k} \boldsymbol{p}_{k}, \end{array} $$
(15)
$$ \begin{array}{@{}rcl@{}} \boldsymbol{r}_{k+1} &= \boldsymbol{r}_{k} - \alpha_{k} R\boldsymbol{p}_{k} \end{array} $$
(16)

where Δyk+ 1 and rk+ 1 are updated in conjugate directions.

The residual vector rk+ 1 is applied to determine the exit conditions. Once \(\left \|\boldsymbol {r}_{k+1}\right \|\) drops below a predefined threshold, we consider that Δyk+ 1 is a sufficiently accurate solution. Otherwise, we use the new residual vector to update the direction vector. The new temporary vector tk+ 1 can be obtained by

$$ \begin{array}{@{}rcl@{}} \boldsymbol{t}_{k+1} &= M^{-1}\boldsymbol{r}_{k+1}. \end{array} $$
(17)

Consequently, the Gram-Schmidt constant βk for the k-th iteration is given by

$$ \begin{array}{@{}rcl@{}} \beta_{k} = \frac{\boldsymbol{t}^{T}_{k+1}\boldsymbol{r}_{k+1}}{\boldsymbol{t}^{T}_{k}\boldsymbol{r}_{k}}. \end{array} $$
(18)

Finally, to ensure that the direction vectors p1,p2,⋯ , pk+ 1 are R-orthogonal, the new direction vector pk+ 1 can be calculated with βk by

$$ \begin{array}{@{}rcl@{}} \boldsymbol{p}_{k+1} = \boldsymbol{t}_{k+1} + \beta_{k} \boldsymbol{p}_{k}. \end{array} $$
(19)

Until reaching the termination condition, each new iteration generates a new solution with pk+ 1. The overall parallel algorithm on a multi-process system for solving normal equations is described by Algorithm 1.

figure c

3.3 Comparisons of parallel acceleration

Lemma 1

(Amdahl’s law) [23] For a parallel bundle adjustment system with p concurrent processors, the fraction that can be enhanced is ρ, and the fraction that can not be enhanced is 1 − ρ, then the overall speedup ratio can be determined by

$$ \begin{array}{@{}rcl@{}} \gamma(p) = \frac{1}{(1-\rho) + \rho/p}. \end{array} $$
(20)

The fraction ρ is completely determined by what is the aims of the task and how is the parallel algorithm designed. If ρ = 1, the system can be perfectly parallelized, but ρ < 1 in real applications. In this case, the speedup ratio is bounded by

$$ \begin{array}{@{}rcl@{}} \gamma(p) \le \lim_{p \to +\infty}\frac{1}{(1-\rho) + \rho/p} = \frac{1}{1-\rho}. \end{array} $$
(21)

Let N be the matrix dimension of the RCM. The naive Cholesky decomposition solution incurs cubic time complexity, i.e., \(O(\frac {N^{3}}{3})\). Let π be the number of nonzero entries, and κ be the condition number of the RCM, the time complexity of single-thread preconditioned conjugate gradient (PCG) is then \(O(\pi \sqrt {\kappa })\).

According to Lemma 1, ideally a parallel version of PCG would take \(O(\frac {\pi }{\gamma (p)}\sqrt {\kappa })\) time with p processors. In practice, however, there is often additional scheduling and communication overhead, and different processors are usually not perfectly balanced. Thus, in a parallel environment whether by multiple processes or multiple threads, we can conclude the proposition of Lemma 1,

Theorem 3

For a parallel bundle adjustment system with p concurrent processors considering overhead μ(p), the fraction that can be enhanced remains ρ, and the fraction that can not be enhanced is 1 − ρ, then the overall speedup ratio can be generated by

$$ \begin{array}{@{}rcl@{}} \gamma^{\prime}(p) = \frac{1}{(1 - \rho) + \rho/p + \mu(p)}. \end{array} $$
(22)

Intuitively, the overhead μ(p) is strictly increasing as the the number of processors increases, which affects the overall speedup. There is an equilibrium between \(\gamma ^{\prime }(p)\) and the overhead μ(p), which is depicted by the the following experiments through different scales of datasets.

To study the time complexity of multi-process and multi-thread parallel methods, we compare the operations included in Algorithm 1 by Table 1. When loading data f(x,y), the startup overhead for multi-process is generally a bit larger than multi-thread on Linux platform, which is not dominant for high performance multi-node servers. However, the shared memory architecture of multi-thread will produce conflicts when reading or writing memory during the procedure of the whole algorithm. That is, the bandwidth of memory limits the concurrency of many processors for multi-thread, and the conflicts between threads will become the bottleneck as the number of processors increases.

To account for the overhead, we introduce two functions of p, \(\gamma _{-}^{\prime }(p)\) and \(\gamma _{+}^{\prime }(p)\) \((0 < \gamma ^{\prime }_{-}(p), \gamma ^{\prime }_{+}(p) < 1/(1-\rho ))\), for the PBA and the proposed PPCG algorithm, respectively. When p = 1, multi-process and multi-thread will degenerate to single process, that is to say \(\gamma ^{\prime }_{-}(p) = \gamma ^{\prime }_{+}(p)\). When very few processors (possibly 2 to 4 for our simulations, which depends on architectures and algorithms), the memory conflicts of multi-thread are comparable with multi-process, and \(\gamma ^{\prime }_{-}(p) \approx \gamma ^{\prime }_{+}(p)\). When p grows much larger (5 or more), the memory conflicts are dominant as discussed above, which results to \(\gamma ^{\prime }_{-}(p) < \gamma ^{\prime }_{+}(p)\). Then, the time complexity of PBA is \(O\left (\frac {\pi }{\gamma ^{\prime }_{-}(p)}\sqrt {\kappa }\right )\), and that of PPCG is \(O\left (\frac {\pi }{\gamma ^{\prime }_{+}(p)}\sqrt {\kappa }\right )\).

Additionally, RPBA distributes the whole block into a limited number of subblocks through consensus ADMM framework. Without considering the communication overheads of ADMM, the ideal time complexity of RPBA is \(O\left (\frac {\pi }{p}\sqrt {\kappa }\right )\). Taking overheads into account, the actual ADMM iterates with sublinear convergence rate such that the time complexity of RPBA is \(O\left (\frac {\pi }{\gamma ^{\prime }(p)}\sqrt {\kappa }\right ) (0 < \gamma ^{\prime }(p) < 1/(1-\rho ))\) when we define a function, \(\gamma ^{\prime }(p)\), to summarize it. Since the overhead of ADMM is very different from overheads produced by the OS and architectures, we can’t compare the time complexity of RPBA with others in theory. The following experiments will compare the actual performance in real bundle adjustment applications. The time complexity of PPCG and the referenced algorithms are summarized by Table 2.

Table 1 Operations of multi-process and multi-thread methods
Table 2 Time complexity analysis

4 Experiments

We have implemented the proposed PPCG algorithm using C++ 11, and developed it on an Intel Core i5 8250 quad-core hyper-threading 1.6GHz personal computer running Linux Mint 19 Operating System with kernel 4.15.0-20-generic. The capacity of RAM is 8GB and the compiler is GCC7.3.0. A Message Passing Interface (MPI) based parallel framework is utilized for parallel implementation. State-of-the-art linear algebra software packages, BLAS and LAPACK, are employed for matrix computations. All experiments are performed with double-precision floating point numbers. The structure dimension is 3, and the camera dimension is 7 including 1 intrinsic parameter and 6 extrinsic parameters without considering distortion parameters. Moreover, we also assume that the number of processes is not less than the number of processors to assure effective parallelization.

4.1 Dataset details

BAL benchmark datasets [2], i.e., Ladybug-1723, Trafalgar-257, Dubrovnik-356, Venice-1776, Final-961 and Final-1936, are used in the experiments, as listed by Table 3. As well, the sizes and the sparsity of Hessian matrices built from these datasets are listed. The Hessian matrix are enormously large, but very sparse. Considering the memory limitation of SBA, we choose two subsets Final-961 and Final-1936 instead of the whole Final dataset.

Table 3 BAL datasets for simulations

Table 4 describes the characteristics of the six public data-sets. Category is how are the images are organized in the phase of feature extraction, and datasets with leaf images contains more observation information. Intuitively, datasets with leaf images contain more geometry information than skeletal sets, which is more efficient for reconstructions. Camera number are the number of cameras (or images) used while Feature density is the average number of feature points on one images. It is mentioned that datasets with larger feature density will produce more accurate point cloud.

Table 4 Dataset discriptions

The six public datasets are divided into 3 groups according to their characteristics, and we randomly choose one dataset from each of them for the following reconstructions, i.e. Final-1936, Ladybug-1723 and Trafalgar-257.

4.2 Reconstructions

With the applications of our algorithm for bundle adjustment, we perform the process of 3D reconstruction through the structure from motion method, and present the results of reconstructed point cloud. To verify the performance of bundle adjustment, we compare the reconstructed point cloud data before and after bundle adjustment. Figures 12 and 3 depict the differences of the reconstructed point cloud for Final-1936, Ladybug-1723 and Trafalgar-257. All reconstructed point cloud data are observed in bird’s-eye view.

Fig. 1
figure 1

Comparisons of the reconstructed point cloud for Final-1936, Ladybug-1723 and Trafalgar-257. a, c and e are reconstructed point cloud after bundle adjustment, b, d and f are reconstructed point cloud before bundle adjustment

Fig. 2
figure 2

Evaluations of convergence performance a Ladybug-1723 b Trafalgar-257 c Dubrovnik-356 d Venice-1776 e Final-961 f Final-1936

Fig. 3
figure 3

Evaluations of parallel performance a Ladybug-1723 b Trafalgar-257 c Dubrovnik-356 d Venice-1776 e Final-961 f Final-1936

Figure 1a and b present markedly the differences between reconstructions after and before bundle adjustment for Final-1936. It is shown that the reconstructed point cloud after bundle adjustment is optimized where the edges and figures are more clear. By contrast, the reconstructed point cloud without bundle adjustment is very indistinct, and nothing can be recognized except the contours. The feature density of Final-1936 is relatively dense. Every image has more features, which provides more geometry information for the nonlinear optimization procedure of bundle adjustment and produces more distinct point cloud.

Figure 1c and d present the differences between reconstructions after and before bundle adjustment for Ladybug-1723. There is significant improvements from our eye-sights where objects and edges are explicitly augmented after bundle adjustment. Scattering points are concentrated to their real positions, and we can observe bold trees and paths.

Figure 1e and f compare the point cloud after bundle adjustment with that before bundle adjustment for Trafalgar-257. Since the feature density of Trafalgar-257 is sparse, the point cloud without bundle adjustment is very sparse as well. The object points are not optimized and concentrated such that buildings and objects are extremely light. After bundle adjustment, the contours and edges of the buildings are augmented, the buildings and objects turn bold. As a result, more distinct point cloud can be observed with the optimization of bundle adjustment.

4.3 Convergence evaluation

The proposed PPCG algorithm is compared with the state-of-the-art algorithms, PBA [12] and RPBA [20]. Both single-thread and multi-thread algorithms are taken into consideration in our evaluation. For simplicity, single-thread PBA and RPBA are denoted by PBA-1 and RPBA-1 respectively, Two-thread PBA and RPBA are denoted by PBA-2 and RPBA-2, et al. The convergence rates of CholDec, PBA-1, PBA-2, PBA-4, RPBA-1, RPBA-2, RPBA-4, PPCG-1, PPCG-2 and PPCG-4, are depicted by Table 5.

Table 5 Comparisons of convergence rate

For each dataset, we run all algorithms until they converge to comparable MSE. In other words, throughout this section, all methods except RPBA, reach the same accuracy in all experiments. In a different way, RPBA uses the normalized weighted squared residuals (NWSR) to estimate global errors. The Iterations item summarizes iterations needed for an optimized solution for an algorithm, including successful LM iterations (the numerator) and total LM iterations (the denominator), the Runtime item is the time consumed for a whole bundle adjustment process while the Speedup Ratio item presents the speedup with respect to CholDec. Owing to the limitation that the source codes of PBA and RPBA are optimized for shared memory architectures, the comparisons of convergence rate are only performed on a shared memory personal computer.

From the results of Table 5, it is demonstrated that PPCG-4 achieves absolutely best convergence performance among referred methods for most datasets. To measure the performance of the iterative methods, we choose CholDec as our baseline benchmark algorithm, and compares the run-time of PBA, RPBA and PPCG with it. CholDec converges much slower than other methods due to the fact that Cholesky decomposition is very complicated in time when the Hessian matrix grows. In contrast, PBA, RPBA and PPCG are globally faster than CholDec because each iteration of these iterative methods consume much less time for sparse problems, though more iterations are needed.

For datasets with more cameras, such as Ladybug-1723, Venice-1776, Final-961 and Final-1936, there is significant speedup for iterative algorithms over CholDec. Therefore, it can be concluded that the iterative methods, PBA, RPBA and PPCG, are highly effective for large problems, because the time complexity of CholDec (6) grows cubically for datasets with much more cameras and the iterative methods act more excellent than the direct method, CholDec when the RCM is very sparse. Without parallelization, PBA-1 gains speedup of 2.74 to 132.05, RPBA-1 gains speedup of 2.70 to 176.61 and PPCG-1 gains speedup of 1.54 to 130.12.

For parallel iterative methods, PBA, RPBA and PPCG with multiple threads obtain more significant speedup ratio than single-thread algorithms by reducing each iteration time through multi-threading techniques. PBA-4 achieves speedup of 2.98 to 163.06 times. PBA accelerates the traditional preconditioned conjugate gradient (PCG) [2] through multiple concurrent threads, but the efficiency is relatively low compared with the CPUs it occupies. RPBA-4 converges very fast with less iterations by improving the measurement with NWSR, and achieves speedup of 25.95 to 157.29 times. PPCG-4 achieves speedup of 3.62 to 418.68 times, and achieves speedup of 1.12 to 3.76 times against PBA-4, and obtains a speedup ratio of 1.06 to 2.66 over RPBA-4 for datasets excluding Trafalgar-257 and Dubrovnik-356 (Table 2).

Figure 2 describes the convergence curves of the algorithms, which is more clear for the trajectory of convergence. Since NWSR is a measurement which is very different from MSE, RPBA is not included in Fig. 2. The much steeper curve demonstrates much faster convergence for each iteration, and the much earlier ending point demonstrates much faster global convergence. For all of the datasets, the curve of PPCG-4 converges fastest and ends earliest among the referred algorithms excluding RPBA-1, RPBA-2 and RPBA-4.

4.4 Parallelization evaluation

We evaluate the parallel performance of our algorithm on a multi-node server. Each node in the server is equipped with 24 Intel IA-64 cores running at 1.2GHz without hyper-threading, 64GB shared RAM and 64bit Linux operating system for each computing node. PPCG is not constrained by share-memory architectures, and can be extended to multiple computing nodes, which is evaluated on the server to measure the parallel performance. The speedup ratio and the efficiency from 1 to 72 processors (3 computing nodes) are depicted by Fig. 3.

Based on the experimental results, we conclude that the speedup ratio grows as the number of processors grows, and approaches to a certain value due to the existences of the bound and overhead while the efficiency acts in the opposite way. It is shown that PPCG gains excellent parallelization whether in small or large scale problems. Moreover, the curves of speedup ratio and efficiency cross at one point, where the equilibrium of them is guaranteed. The real performance of computing nodes is fluctuating, which results to the non-smooth curves, but the trends demonstrate the deterministic or functional relationships.

4.5 Bound and overhead

Categories and descriptions for the overhead of a multi-node server are listed by Table 6, as well as functions under the MPI framework. The overhead of imbalanced workload can be eliminated by well-designed parallel algorithm while the others are determined by hardware architectures and parallel framework. When transmitting data, the nodes have to created connections to destination nodes. The overhead caused by this whole procedure is connection setup overhead, which is increased when the increasing processors enlarge routing time. And the communication overhead is the time consumed by data transmission after connections.

Table 6 Overhead of multi-node computers

Theoretically, even more computing nodes or processors can be included to get much larger speedup ratio, but the bound of the speedup ratio (discussed in Section 3.3) will restrict the overall speedup ratio, as well as the overhead of startup and communications among different processes. Firstly, according to Lemma 1, the speedup ratio \(\gamma ^{\prime }_{+}(p)\) will not increase as it approaches 1/(1 − ρ) when adding more processors. This characteristic is owing to the algorithm or the task it self. Secondly, assuming that \(\gamma ^{\prime }_{+}(p) << 1/(1-\rho )\), additional processors will speedup the whole task when the computation burden is greater than related overhead for each process. As adding more processors, the computation burden decreases while overhead increases, which will results lower efficiency. As a result, for very large bundle adjustment problems, more computing nodes can be added for higher speedup ratio with acceptable efficiency.

5 Conclusion

This paper has proposed a PPCG algorithm to solve normal equations involved in bundle adjustment problems. A parallel Schur complement algorithm is devised to reduce the size of normal equations. Then, a parallel preconditioned conjugate gradient method is developed is to solve the reduced normal equations by parallelizing the original PCG algorithm. Taking the global performance into account, theoretical results ensuring the equivalence of matrix decomposition are presented with formal proof and the corresponding algorithm is developed to avoid expensive matrix computations. The experiments on a multi-node server show that PPCG achieves significant speedup ratios on several benchmark datasets compared to other algorithms, which reaching comparable accuracy. Moreover, the speedup ratio and the computational efficiency with respect to the number of processors are also evaluated and reported. Regarding future work, we plan to take PPCG as a local optimizer for consensus optimization based large-scale distributed bundle adjustment problems, and develop a distributed framework for further improvements with efficiency and scalability.