Keywords

1 Introduction

The classical challenge of image-based large scale reconstruction is witnessing renewed interest with the emergence of large-scale internet photo collections [2]. The computational bottleneck of 3D reconstruction and structure from motion methods is the problem of large-scale bundle adjustment (BA): Given a set of measured image feature locations and correspondences, BA aims to jointly estimate the 3D landmark positions and camera parameters by minimizing a non-linear least squares reprojection error. More specifically, the most time-consuming step is the solution of the normal equation in the popular Levenberg-Marquardt (LM) algorithm that is typically solved by Preconditioned Conjugate Gradients (PCG).

In this paper, we propose a new iterative solver for the normal equation that relies on the decomposable structure of the competitive block Jacobi preconditioner. Inspired by respective approaches in the domain-decomposition literature, we exploit the specificities of the Schur complement matrix to enlarge the search-space of the traditional PCG approach leading to what we call Multidirectional Conjugate Gradients (MCG). In particular our contributions are as follows:

Fig. 1.
figure 1

(a) Optimized 3D reconstruction of a final BAL dataset with 1936 poses and more than five million observations. For this problem MCG is 39% faster than PCG and the overall BA resolution is 16% faster. (b) Optimized 3D reconstruction of Alamo dataset from 1dSfM with 571 poses and 900000 observations. For this problem MCG is 56% faster than PCG and the overall BA resolution is 22% faster.

  • We design an extension of the popular PCG by using local contributions of the poses to augment the space in which a solution is sought for.

  • We experimentally demonstrate the robustness of MCG with respect to the relevant hyper-parameters.

  • We evaluate MCG on a multitude of BA problems from BAL [1] and 1dSfM [20] datasets with different sizes and show that it is a promising alternative to PCG.

  • We experimentally confirm that the performance gain of our method increases with the density of the Schur complement matrix leading to a speedup for solving the normal equation of up to 61% (Fig. 1).

2 Related Work

Since we propose a way to solve medium to large-scale BA using a new iterative solver that enlarges the search-space of the traditional PCG, in the following we will review both scalable BA and recent CG literature.

Scalable Bundle Adjustment

A detailed survey of the theory and methods in BA literature can be found in [17]. Sparsity of the BA problem is commonly exploited with the Schur complement matrix [5]. As the performance of BA methods is closely linked to the resolution of the normal equations, speed up the solve step is a challenging task. Traditional direct solvers such as sparse or dense Cholesky factorization [11] have been outperformed by inexact solvers as the problem size increases and are therefore frequently replaced by Conjugate Gradients (CG) based methods [1, 6, 18]. As its convergence rate depends on the condition number of the linear system a preconditioner is used to correct ill-conditioned BA problems [14]. Several works tackle the design of performant preconditioners for BA: [9] proposed the band block diagonals of the Schur complement matrix, [10] exploited the strength of the coupling between two poses to construct cluster-Jacobi and block-tridiagonal preconditioners, [7] built on the combinatorial structure of BA. However, despite these advances in the design of preconditioners, the iterative solver itself has rarely been challenged.

(Multi-preconditioned) Conjugate Gradients

Although CG has been a popular iterative solver for decades [8] there exist some interesting recent innovations, e.g. flexible methods with a preconditioner that changes throughout the iteration [13]. The case of a preconditioner that can be decomposed into a sum of preconditioners has been exploited by using Multi-Preconditioned Conjugate Gradients (MPCG) [4]. Unfortunately, with increasing system size MPCG rapidly becomes inefficient. As a remedy, Adaptive Multi-Preconditioned Conjugate Gradients have recently been proposed [3, 15]. This approach is particularly well adapted for domain-decomposable problems [12]. While decomposition of the reduced camera system in BA has already been tackled e.g. with stochastic clusters in [19], to our knowledge the decomposition inside the iterative solver has never been explored. As we will show in the following, this modification gives rise to a significant boost in performance.

3 Bundle Adjustment and Multidirectional Conjugate Gradients

We consider the general form of bundle adjustment with \(n_{p}\) poses and \(n_{l}\) landmarks. Let x be the state vector containing all the optimization variables. It is divided into a pose part \(x_{p}\) of length \(d_{p}n_{p}\) containing extrinsic and eventually intrinsic camera parameters for all poses (generally \(d_{p}=6\) if only extrinsic parameters are unknown and \(d_{p}=9\) if intrinsic parameters also need to be estimated) and a landmark part \(x_{l}\) of length \(3n_{l}\) containing the 3D coordinates of all landmarks. Let \(r\left( x\right) =[r_{1}\left( x\right) ,...,r_{k}\left( x\right) ]\) be the vector of residuals for a 3D reconstruction. The objective is to minimize the sum of squared residuals

$$\begin{aligned} F\left( x\right) =\Vert r\left( x\right) \Vert ^{2} = \sum _i \Vert r_i(x) \Vert ^{2} \end{aligned}$$
(1)

3.1 Least Squares Problem and Schur Complement

This minimization problem is usually solved with the Levenberg Marquardt algorithm, which is based on the first-order Taylor approximation of \(r\left( x\right) \) around the current state estimate \(x^{0}=\left( x_{p}^{0},x_{l}^{0}\right) \):

$$\begin{aligned} r\left( x\right)&\approx r^{0}+J\varDelta x \end{aligned}$$
(2)

where

$$\begin{aligned} r^{0}&=r\left( x^{0}\right) ,\end{aligned}$$
(3)
$$\begin{aligned} \varDelta x&= x-x^{0},\end{aligned}$$
(4)
$$\begin{aligned} J&=\frac{\partial r}{\partial x}\mid _{x=x^{0}} \end{aligned}$$
(5)

and J is the Jacobian of r that is decomposed into a pose part \(J_{p}\) and a landmark part \(J_{l}\). An added regularization term that improves convergence gives the damped linear least squares problem

$$\begin{aligned} \underset{\varDelta x_{p},\varDelta x_{l}}{\text {min}}&\left( \Vert r^{0}+\left( \begin{array}{cc} J_{p}&J_{l}\end{array}\right) \left( \begin{array}{c} \varDelta x_{p}\\ \varDelta x_{l} \end{array}\right) \Vert ^{2}+\lambda \Vert \left( \begin{array}{cc} D_{p}&D_{l}\end{array}\right) \left( \begin{array}{c} \varDelta x_{p}\\ \varDelta x_{l} \end{array}\right) \Vert ^{2}\right) \end{aligned}$$
(6)

with \(\lambda \) a damping coefficient and \(D_{p}\) and \(D_{c}\) diagonal damping matrices for pose and landmark variables. This damped problem leads to the corresponding normal equation

$$\begin{aligned} H\left( \begin{array}{c} \varDelta x_{p}\\ \varDelta x_{l} \end{array}\right)&=-\left( \begin{array}{c} b_{p}\\ b_{l} \end{array}\right) \end{aligned}$$
(7)

where

$$\begin{aligned} H&= \left( \begin{array}{cc} U_{\lambda } &{} W\\ W^{\top } &{} V_{\lambda } \end{array}\right) ,\end{aligned}$$
(8)
$$\begin{aligned} U_{\lambda }&=J_{p}^{\top }J_{p}+\lambda D_{p}^{\top }D_{p},\end{aligned}$$
(9)
$$\begin{aligned} V_{\lambda }&=J_{l}^{\top }J_{l}+\lambda D_{l}^{\top }D_{l},\end{aligned}$$
(10)
$$\begin{aligned} W&=J_{p}^{\top }J_{l},\text { }b_{p}=J_{p}^{\top }r^{0},\end{aligned}$$
(11)
$$\begin{aligned} b_{l}&=J_{l}^{\top }r^{0} \end{aligned}$$
(12)

As the system matrix H is of size \(\left( d_{p}n_{p}+3n_{l}\right) ^{2}\) and tends to be excessively costly for large-scale problems [1], it is common to reduce it by using the Schur complement trick and forming the reduced camera system

$$\begin{aligned} S\varDelta x_{p}&=-\widetilde{b} \end{aligned}$$
(13)

with

$$\begin{aligned} S&=U_{\lambda }-WV_{\lambda }^{-1}W^{\top },\end{aligned}$$
(14)
$$\begin{aligned} \widetilde{b}&=b_{p}-WV_{\lambda }^{-1}b_{l} \end{aligned}$$
(15)

and then solving (13) for \(\varDelta x_{p}\) and backsubstituting \(\varDelta x_{p}\) in

$$\begin{aligned} \varDelta x_{l}&=-V_{\lambda }^{-1}\left( -b_{l}+W^{\top }\varDelta x_{p}\right) \end{aligned}$$
(16)

3.2 Multidirectional Conjugate Gradients

Direct methods such as Cholesky decomposition [17] have been studied for solving (13) for small-size problems, but this approach implies a high computational cost whenever problems become too large.

A very popular iterative solver for large symmetric positive-definite system is the CG algorithm [16]. Since its convergence rate depends on the distribution of eigenvalues of S it is common to replace (13) by a preconditioned system. Given a preconditioner M the preconditioned linear system associated to

$$\begin{aligned} S\varDelta x_{p}=-\widetilde{b} \end{aligned}$$
(17)

is

$$\begin{aligned} M^{-1}S\varDelta x_{p}=-M^{-1}\widetilde{b} \end{aligned}$$
(18)

and the resulting algorithm is called Preconditioned Conjugate Gradients (PCG) (see Algorithm 1). For block structured matrices as S a competitive preconditioner is the block diagonal matrix \(D\left( S\right) \), also called block Jacobi preconditioner [1]. It is composed of the block diagonal elements of S. Since the block \(S_{mj}\) of S is nonzero if and only if cameras m and j share at least one common point, each diagonal block depends on a unique pose and is applied to the part of conjugate gradients residual \(r_{i}^{j}\) that is associated to this pose. The motivation of this section is to enlarge the conjugate gradients search space by using several local contributions instead of a unique global contribution.

figure a

Adaptive Multidirections

Fig. 2.
figure 2

(a) Block-Jacobi preconditioner \(D\left( S\right) \) is divided into N submatrices \(D_{p}\left( S\right) \) and each of them is directly applied to the associated block-row \(r^{p}\) in the CG residual. (b) Up to a \(\tau \)-test the search-space is enlarged. Each iteration provides N times more search-directions than PCG.

Local Preconditioners. We propose to decompose the set of poses into N subsets of sizes \(l_{1},\ldots , l_{N}\) and to take into consideration the block-diagonal matrix \(D_{p}\left( S\right) \) of the block-jacobi preconditioner and the associated residual \(r^{p}\) that correspond to the \(l_{p}\) poses of subset p (see Fig. 2(a)). All direct solves are performed inside these subsets and not in the global set. Each local solve is treated as a separate preconditioned equation and provides a unique search-direction. Consequently the conjugate vectors \(Z_{i+1}\in \mathbb {R}^{d_{p}n_{p}}\) in the preconditioned conjugate gradients (line 10 in Algorithm 1) are now replaced by conjugate matrices \(Z_{i+1}\in \mathbb {R}^{d_{p}n_{p}\times N}\) whose each column corresponds to a local preconditioned solve. The search-space is then significantly enlarged: N search directions are generated at each inner iteration instead of only one. An important drawback is that matrix-vector products are replaced by matrix-matrix products which can lead to a significant additional cost. A trade-off between convergence improvement and computational cost needs to be designed.

figure b

Adaptive \(\tau \)-Test. Following a similar approach as in [15] we propose to use an adaptive multidirectional conjugate gradients algorithm (MCG, see Algorithm 2) that adapts automatically if the convergence is too slow. Given a threshold \(\tau \in \mathbb {R}^{+}\) chosen by the user, a \(\tau \)-test determines whether the algorithm sufficiently reduces the error (case \(t_{i}>\tau \)) or not (case \(t_{i}<\tau \)). In the first case a global block Jacobi preconditioner is used and the algorithm performs a step of PCG; in the second case local block Jacobi preconditioners are used and the search-space is enlarged (see Fig. 2(b)).

Optimized Implementation. Besides matrix-matrix products two other changes appear. Firstly an \(N\times N\) matrix \(\varDelta _{i}\) must be inverted (or pseudo-inverted if \(\varDelta _{i}\) is not full-rank) each time \(t_{i}<\tau \) (line 4 in Algorithm 2). Secondly a full reorthogonalization is now necessary (line 16 in Algorithm 2) because of numerical errors while \(\beta _{i,j}=0\) as soon as \(i\ne j\) in PCG.

To improve the efficiency of MCG we do not directly apply S to \(P_{i}\) (line 3 in Algorithm 2) when the search-space is enlarged. By construction the block \(S_{kj}\) is nonzero if and only if cameras k and j observe at least one common point. The trick is to use the construction of \(Z_{i}\) and to directly apply the non-zero blocks \(S_{jk}\), i.e. consider only poses j observing a common point with k, to the column in \(Z_{i}\) associated to the subset containing pose k and then to compute

$$\begin{aligned} Q_{i}&=SZ_{i}-\mathop {\sum }_{j=0}^{i-1}Q_{j}\beta _{i,j} \end{aligned}$$
(19)

To get \(t_{i}\) we need to use a global solve (line 10 in Algorithm 2). As the local block Jacobi preconditioners \(\{D_{p}(S)\}_{p=1,...,N}\) and the global block Jacobi preconditioner \(D\left( S\right) \) share the same blocks it is not necessary to derive all local solves to construct the conjugates matrix (line 12 in Algorithm 2); instead it is more efficient to fill this matrix with block-row elements of the preconditioned residual \(D\left( S\right) ^{-1}r_{i+1}\).

As the behaviour of CG residuals is a priori unknown the best decomposition is not obvious. We decompose the set of poses into \(N-1\) subsets of same size and the last subset is filled by the few remaining poses. This structure presents the practical advantage to be very easily fashionable and the parallelizable block operations are balanced.

4 Experimental Evaluations

4.1 Algorithm and Datasets

Levenberg-Marquardt (LM) Loop. Starting with damped parameter \(10^{-4}\) we update \(\lambda \) according to the success or failure of the LM loop. Our implementation runs for at most 25 iterations, terminating early if a relative function tolerance of \(10^{-6}\) is reached. Our evaluation is built on the LM loop implemented in [19] and we also estimate intrinsics parameters for each pose.

Iterative Solver Step. For a direct performance comparison we implement our own MCG and PCG solvers in C++ by using Eigen 3.3 library. All row-major-sparse matrix-vector and matrix-matrix products are multi-threaded by using 4 cores. The tolerance \(\epsilon \) and the maximum number of iterations are set to \(10^{-6}\) and 1000 respectively. Pseudo-inversion is derived with the pseudo-inverse function from Eigen.

Datasets. For our evaluation we use 9 datasets with different sizes and heterogeneous Schur complement matrix densities d from BAL [1] and 1dSfM [20] datasets (see Table 1). The values of N and \(\tau \) are arbitrarily chosen and the robustness of our algorithm to these parameters is discussed in the next subsection.

Table 1. Details of the problems from BAL (prefixed as: F for final, L for Ladybug) and 1dSfM used in our experiments. d is the density of the associated Schur complement matrix, N is the number of subsets, \(\tau \) is the adaptive threshold that enlarges the search-direction space.

We run experiments on MacOS 11.2 with Intel Core i5 and 4 cores at 2 GHz.

4.2 Sensitivity with \(\tau \) and N

In this subsection we are interested in the solver runtime ratio that is defined as \(\frac{t_{MCG}}{t_{PCG}}\) where \(t_{MCG}\) (resp. \(t_{PCG}\)) is the total runtime to solve all the linear systems (12) with MCG (resp. PCG) until a given BA problem converges. We investigate the influence of \(\tau \) and N on this ratio.

Sensitivity with \(\tau \). We solve BA problem for different values of \(\tau \) and for a fixed number of subsets N given in Table 1. For each problem a wide range of values supplies a good trade-off between the augmented search-space and the additional computational cost (see Fig. 3). Although the choice of \(\tau \) is crucial it does not require a high accuracy. That confirms the tractability of our solver with \(\tau \).

Fig. 3.
figure 3

Robustness to \(\tau \). The plots show the performance ratio as a function of \(\tau \) for a number of subsets given in Table 1. The wide range of values that give similar performance confirms the tractability of MCG with \(\tau \).

Sensitivity with N. Similarly we solve BA problem for different values of N and for a fixed \(\tau \) given in Table 1. For each problem a wide range of values supplies a good trade-off between the augmented search-space and the additional computational cost (see Fig. 4). That confirms the tractability of our solver with N.

Fig. 4.
figure 4

Robustness to the number of subsets N. The plots show the performance ratio as a function of N for \(\tau \) given in Table 1. The wide range of values that give similar performance confirms the tractability of MCG with N.

Fig. 5.
figure 5

Density effect on the relative performance. Each point represents a BA problem from Table 1 and d is the density of the Schur complement matrix. Our solver competes PCG for sparse Schur matrix and leads to a significant speed-up for dense Schur matrix.

4.3 Density Effect

As the performance of PCG and MCG depends on matrix-vector product and matrix-matrix product respectively we expect a correlation with the density of the Schur matrix. Figure 5 investigates this intuition: MCG greatly outperforms PCG for dense Schur matrix and is competitive for sparse Schur matrix.

4.4 Global Performance

Figures 6 and 7 present the total runtime with respect to the number of BA iterations for each problem and the convergence plots of total BA cost for F-1936 and Alamo datasets, respectively. MCG and PCG give the same error at each BA iteration but the first one is more efficient in terms of runtime. Table 2 summarizes our results and highlights the great performance of MCG for dense Schur matrices. In the best case BA resolution is more than \(20\%\) faster than using PCG. Even for sparser matrices MCG competes PCG: in the worst case MCG presents similar results as PCG. If we restrict our comparison to the linear system solve steps our relative results are even better: MCG is up to \(60\%\) faster than PCG and presents similar results as PCG in the worst case.

Fig. 6.
figure 6

Global runtime to solve BA problem. The plots represent the total time with respect to the number of BA iterations. For almost all problems the BA resolution with MCG (orange) is significantly faster than PCG (blue). (Color figure online)

Fig. 7.
figure 7

Convergence plots of (a) Final-1936 from BAL dataset and (b) Alamo from 1dSfm dataset. The y-axes show the total BA cost.

Table 2. Relative performances of MCG w.r.t. PCG. d is the density of the associated Schur complement matrix. MCG greatly outperforms PCG (up to 61% faster) for dense Schur matrix and competes PCG for sparse Schur matrix. The global BA resolution is up to 22% faster.

5 Conclusion

We propose a novel iterative solver that accelerates the solution of the normal equation for large-scale BA problems. The proposed approach generalizes the traditional preconditioned conjugate gradients algorithm by enlarging its search-space leading to a convergence in much fewer iterations. Experimental validation on a multitude of large scale BA problems confirms a significant speedup in solving the normal equation of up to 61%, especially for dense Schur matrices where baseline techniques notoriously struggle. Moreover, detailed ablation studies demonstrate the robustness to variations in the hyper-parameters and increasing speedup as a function of the problem density.