Keywords

1 Introduction

Recently, we have witnessed a surge of interest in applying neural networks in various computer vision and robotics problems, such as, single-view depth estimation  [16], absolute pose regression  [28] and 3D point-cloud classification  [35]. However, we still rely on robust optimizations at different steps of geometric problems, for example, robot navigation and mapping. The reason is that neural networks have not yet proven to be effective in solving constrained optimization problems. Some classic examples of the test-time geometric optimization include rotation averaging  [8, 14, 15, 23, 26, 45] (a.k.a. rotation synchronization  [5, 38, 44]), pose-graph optimization  [30, 42], local bundle adjustment  [33] and global structure from motion  [46]. These optimization methods estimate the model parameters that agree best with the observed noisy measurements by minimizing a robust (typically non-convex) cost function. Often, these loss functions are designed based on certain assumptions about the sensor noise and outlier distributions. However, the observed noise distribution in a real-world application could be far from those assumptions. A few such examples of noise patterns in real datasets are displayed in Fig. 2. Furthermore the nature and the structure of the objective loss function is the same for different problem instances of a specific task. Nonetheless, existing methods optimize the loss function for each instance. Moreover, an optimization during test-time could be slow for a target task involving a large number of parameters, and often forestalls a real-time solution to the problem.

In this work, with the advancement of machine learning, we address the following question: “can we learn the noise patterns in data, given thousands of different problem instances of a specific task, and regress the target parameters instead of optimizing them during test-time?” The answer is affirmative for some specific applications, and we propose a learning framework that exceeds baseline optimization methods for a geometric problem. We choose multiple rotation averaging (MRA) as a target application to validate our claim.

Fig. 1.
figure 1

The proposed method NeuRoRA is a two-step approach: in the first step a graph-based network (CleanNet) is utilized to clean the view-graph by removing outliers and rectifying noisy measurements. An initialization from the cleaned view-graph, instantiated from a shortest path tree (SPT), is then further fine-tuned using a separate graph-based network (FineNet). The notations are outlined in Table 1.

In MRA, the task is to estimate the absolute orientations of cameras given some of their pairwise noisy relative orientations defined on a view-graph. There are a different number of cameras for each problem instance of MRA, and usually sparsely connected to each other. Further, the observed relative orientations are often corrupted by outliers. The conventional methods for this task  [5, 8, 14, 23, 44] optimize the parameters of the absolute orientations of the cameras that are most compatible (up to a robust cost) with the observed noisy relative orientations.

We propose a neural network for robust MRA. Our network is a combination of two simple four-layered message-passing neural networks defined on the view-graphs, summarized in Fig. 1. We name our method Neural Robust Rotation Averaging, which is abbreviated as NeuRoRA in the rest of the manuscript.

Contribution and Findings

  • A graph-based neural network NeuRoRA is proposed as an alternative to conventional optimizations for MRA.

  • NeuRoRA requires no explicit optimization at test-time and hence it is much faster (10–50\(\times \) on CPUs and 500–2000\(\times \) on GPUs) than the baselines.

  • The proposed NeuRoRA is more accurate than the conventional optimization methods of MRA. The mean/median orientation error of the predicted absolute orientations by the proposed method is \(1.45^{\circ }/0.74^{\circ }\), compared to \(2.17^{\circ }/1.25^{\circ }\) by an optimization method  [8].

  • Being a small size network, NeuRoRA is fast and can easily be deployed to real-world applications (network size \({<}{0.5}\) Mb).

Fig. 2.
figure 2

Here we qualitatively illustrate that the noise distribution in real data diverges considerably from the noise assumptions baked into most optimization methods. We plot the angle and axes of observed relative orientations (first row) and the same of noise (second row) in real datasets (for clarity only \({10^3}\) random samples) are displayed. The noise orientation is calculated from the ground-truth absolute and the observed relative orientations. The view-graphs of are shared by  [46] and is shared by  [12]. We plotted histograms of the magnitudes of the angles (in degrees) and the axes of the orientations. Notice that the axes of the sampled relative and noise orientations for the real data in (a)–(c) are not uniformly distributed on the unit ball. The sampled noise orientations (somewhat vertical axes) are far from the typical distribution assumptions regarded by optimization algorithms. Samples from such noise distributions (\(\ell _1\)  [44] and \(\ell _{1/2}\)  [8]) are shown in .

2 Related Works

We separate the related methods into two separate sections—(i) learning based methods as an alternative to optimizations, (ii) relevant optimizations specific to MRA, and (iii) other related optimization methods.

(i) Learning to optimize. A neural network is proposed as an alternative to non-linear least square optimizations in  [11] for camera tracking and mapping. It exploits the least square structure of the problem and uses a recurrent network to compute updated steps of the optimization variables. In a similar direction,  [31] relaxes the assumptions made by inverse compositional algorithms for dense image alignment by incorporating data-driven priors. [40] proposes a bundle adjustment layer that learns to predict the dampening parameter of the Levenberg-Marquardt algorithm to optimize depth and camera parameters. In contrast to the direct optimization-based methods that explicitly use regularizers to solve an ill-posed problem, [1] implicitly learn the prior through a data-driven method. Aoki et al.  [3] proposed an iterative algorithm based on PointNet  [35] for point-cloud registration as an alternative to direct optimization. Learning to predict an approximate solution to combinatorial optimization problem over graphs, e.g. minimum vertex cover, traveling salesman problem, etc., is proposed in  [29]. Learning methods to optimize general black-box functions  [10] have also received a lot of attention recently. These conventional learning-based methods are tailored to some specific problems, where in this work, we are interested in an alternative learning-based solution for geometric problems, e.g. SFM.

(ii) Robust optimization for rotation averaging. MRA was first introduced in  [18] where a linear solution was proposed using quaternion averaging and later in  [19] using Lie group based averaging. The solutions were non-robust in both the cases. Recently, there has been progress in designing robust algorithms  [9, 22] for rotation averaging. Most of the algorithms are based on iterative methods for optimizing a robust loss function. MRA is also exploited using sparse matrix decomposition, for example  [5, 44]. The state of the art methods are listed below:

  • Chatterjee and Govindu  [8] fine-tune an initialization by first performing an iterative \(\ell _1\) minimization, followed by another iterative reweighted least squares with a more robust loss function \(\ell _\frac{1}{2}\).

  • Hartley et al.  [22] propose a straight-forward method. It fine-tunes an initialization by the Weiszfeld algorithm of \(\ell _1\) averaging  [7]. At every iteration, the absolute orientations of each camera are updated by the median of those computed from its neighbors.

  • Arrigoni et al.  [5] formulate the problem as a low-rank and sparse matrix decomposition and utilizes existing decomposition algorithms that caters for missing data, outliers and noise in the pairwise observations.

  • Wang et al.  [44] employ the alternating direction method to minimize a robust cost function involving the sum of unsquared deviations.

Rotation averaging is surveyed recently in a vast amount of literature  [4, 6, 34, 43].

(iii) Other related optimization methods. DISCO  [12] employs a two-step approach. In the first step, a loopy belief propagation is used for an initial estimation of the absolute orientations which are fine-tuned by Levenberg-Marquardt method in the second step. The problem of detecting outliers in the view-graph has been extensively studied in the literature  [13, 20, 27, 32, 36, 47]. Optimizing/cleaning the view-graph for sfm also proposed in [21, 37, 39].

Huang et al.  [25] proposed a neural network solving the pairwise matching problem (c.f. page 2, 2nd col) accurately. The key component of  [25] is a network that takes two 3D scans and a relative transformation between them as input and outputs a score indicating the goodness of the scan alignment which is iteratively employed to fine-tune the absolute pose. Therefore, [25] is only valid (and tailored) for alignments of multiple scans.

3 Multiple Rotation Averaging

Consider N cameras with M pairwise relative orientation measurements forming a directed view-graph \(\mathcal {G}= (\mathcal {V}, \mathcal {E})\). A vertex \(\mathcal {V}_{v} \in \mathcal {V}\) corresponds to the absolute orientation \(\widehat{R}_{v}\) (to be estimated) of the vth camera and an edge \(\mathcal {E}_{uv} \in \mathcal {E}\) corresponds to the observed relative orientation \(\widetilde{R}_{uv}\) from uth camera to vth camera. Conventionally, the task is to estimate the absolute orientations \(\{\widehat{R}_{v}\}\), with respect to a global reference of orientations, such that the estimated orientations are most consistent with the observed noisy relative orientation measurements, i.e\(\widetilde{R}_{uv} \approx \widehat{R}_{v} \widehat{R}_{u}^{-1}, \forall \mathcal {E}_{uv} \in \mathcal {E}\). Further, the observed measurements are corrupted by outliers, i.e. some of the orientations \(\widetilde{R}_{uv}\) are far from \( \widehat{R}_{v} \widehat{R}_{u}^{-1}\). Conventionally, the solution is obtained by minimizing a robust cost function that penalizes the discrepancy between observed noisy relative orientations \(\{\widetilde{R}_{uv}\}\) and the estimated relative orientations \(\{R_{uv}^*\} := \{R_{v}^*{R_{u}^*}^{-1}\}\). The corresponding optimization problem can then be expressed as

$$\begin{aligned} \mathop {\text {arg min}}\limits _{\{R_{v}^*\}} \sum _{\mathcal {E}_{uv} \in \mathcal {E}}\rho \Big (d\big (R_{uv}^*, \widetilde{R}_{uv}\big )\Big ) \end{aligned}$$
(1)

where \(\rho (.)\) is a robust cost and d(., .) is a distance measure between the orientations. The nature of the above optimization is a typical complex multi-variable nonlinear optimization problem with thousands of variables (for thousands of cameras) and there seems to be no direct method (closed-form solution) minimizing the above cost even without outliers  [23].

The Choice of Distance Measure \(d(\widetilde{R}, R)\). There are three commonly used distance measurements in the rotation group SO(3): (i) the geodesic or angle metric \(d_\theta = \angle (\widetilde{R}, R)\), (ii) the chordal metric \(d_{C} = \Vert \widetilde{R} - R \Vert _F\) and (iii) the quaternion metric \(d_{Q} = \min { \{\Vert q_{\widetilde{R}} - q_{R} \Vert , \Vert q_{\widetilde{R}} + q_{R} \Vert \}}\) where \(q_{R}\) and \(q_{\widetilde{R}}\) are quaternion representations of R and \(\widetilde{R}\) respectively, and \(\Vert .\Vert _F\) is the Frobenius norm. The metrics \(d_C\) and \(d_{Q}\) are proven to be \(2\sqrt{2}\sin (d_\theta /2)\) and \(2\sin (d_\theta /4)\) respectively  [23], thus, all the metrics are the same to the first order. In our implementation, we employ the quaternion representations (with non-negative scalars).

The Choice of Robust Cost \(\rho (.)\). In practical applications, e.g. robot navigation, the agent usually ends up with some corrupt measurements (outliers), due to symmetric and repetitive man-made structures, in addition to the sensor noise. To estimate the absolute orientations of the cameras that are immune to those outliers, the conventional methods optimize a robust cost \(\rho (.)\) as discussed above. An exhaustive list of such robust functions can be found in  [8].

The noise and outliers in the observed relative orientations is assumed to follow some distributions subject to the cost function with mean identity orientation  [23, 44]. However, in real data, we observe very different noise distributions and a few such examples are shown in Fig. 2. Further, optical axis of most of the cameras are horizontal and hence the axes of the relative orientations are vertical. By training a neural network to perform the task, our aim is for the neural network to capture these patterns while predicting the absolute orientations.

4 Learning to Predict Absolute Orientations

Let \(\mathcal {D}:= \{\mathcal {G}\}\) be a dataset of ground-truth view-graphs. Each view-graph \(\mathcal {G}:=(\mathcal {V}, \mathcal {E})\) contains a noisy relative orientation measurement \(\widetilde{R}_{uv}\) for each edge \(\mathcal {E}_{uv} \in \mathcal {E}\) and a ground-truth absolute orientation \( \widehat{R}_{v}\) for each camera \(\mathcal {V}_v \in \mathcal {V}\). The desired neural network learns a mapping \(\varPhi \) that takes noisy relative measurements \(\{\widetilde{R}_{uv}\}\) as input and predicts the absolute orientations \(\{R^{\varPhi }_{v}\}:=\varPhi \Big (\{\widetilde{R}_{uv}\};\;\varTheta \Big )\) as output, where \(\varTheta \) is the set of network parameters. To train the parameters of such network, one could minimize the discrepancy between the ground-truth \(\widehat{R}_{uv}:= \widehat{R}_{v} \widehat{R}_{u}^{-1}\) and the estimated \(R_{uv}^{\varPhi } := R_{v}^{\varPhi } {R_{u}^{\varPhi }}^{-1}\) relative orientations (cf. Eq. (1)), i.e.

$$\begin{aligned} \mathop {\text {arg min}}\limits _{\varTheta } \sum _{\mathcal {G}\in \mathcal {D}}\sum _{\mathcal {E}_{uv} \in \mathcal {E}} d\big (R_{uv}^{\varPhi }, \widehat{R}_{uv}\big ) \end{aligned}$$
(2)

In contrast to (1), where conventional methods optimize the orientation parameters for each instance of the view-graph \(\mathcal {G}\in \mathcal {D}\), here in (2), the network parameters are optimized during training that learn the mapping \(\varPhi \) effectively from observed relative orientations \(\{\widetilde{R}_{uv}\}\) to the target absolute orientations \(\{\widehat{R}_{v}\}\), i.e\(\{\widehat{R}_{v}\} \approx \varPhi \Big (\{\widetilde{R}_{uv}\};\;\varTheta \Big )\) over the entire dataset of view-graphs \(\mathcal {D}\).

Direct Training of \(\varPhi \) and Gauge Freedom. For an arbitrary orientation R,

$$\begin{aligned} R_{uv}^{*} := R_{v}^{*} {R_{u}^{*}}^{-1} = ( R_{v}^{*}R){(R_{u}^{*}R)}^{-1},\;\;\; \forall \mathcal {E}_{uv} \in \mathcal {E}\end{aligned}$$
(3)

Therefore, \(\{R^{*}_{v}\}\) and \(\{R^{*}_{v} R\}\) essentially represent the same solution to the MRA problem (1) and there is a gauge freedom of degree 3. The mapping \(\varPhi \) is thus one-to-many as \(\{R^{\varPhi }_{v}\}\) and \(\{R^{\varPhi }_{v} R\}\) correspond to the same cost (2). This gauge freedom makes it difficult to train such a network. Further, one could choose a direct cost (no associated gauge freedom) to learn an one-to-one mapping \(\varPhi \), e.g.

$$\begin{aligned} \mathop {\text {arg min}}\limits _{\varTheta } \sum _{\mathcal {G}\in \mathcal {D}} \sum _{\mathcal {V}_v \in \mathcal {V}} d\big ( R_{v}^{\varPhi }, \widehat{R}_{v} \big ) \end{aligned}$$
(4)

where the reference orientation is fixed according to the ground-truth. Again, \(\{\widehat{R}_{v}\}\) and \(\{\widehat{R}_{v} R\}\) represent the same ground-truth where the reference orientations are fixed at different directions. One could fix the issue by fixing the reference orientation to the orientation of the first camera in all the view-graphs in \(\mathcal {D}\). However, in a graph (set representation), the nodes are permutation invariant. Thus the choice of the first camera, and hence the reference orientation, is arbitrary. Therefore, one needs to pass the reference orientation or the index of the first camera (possibly via a binary encoding) to the network as an additional input to be able to train such a network. However, we employ an alternative strategy adopted from the conventional optimization methods  [8, 23], i.e. initialize a solution of the absolute orientations under a fixed reference and pass the initialization to the network to fine-tune the solution. The network gets the reference orientation as an additional input via initialization (see Fig. 1(d)) and regress the parameters, i.e\(\{\widehat{R}_{v} \}\approx \varPhi \Big (\{\widetilde{R}_{uv}\}, \{\widetilde{R}_v\};\;\varTheta \Big )\). Further, we train the network by minimizing a combined cost where the 1st term (2) enforces the consistency over the entire graph and the 2nd term (4) enforces a unique solution, i.e.

$$\begin{aligned} \mathop {\text {arg min}}\limits _{\varTheta } \sum _{\mathcal {G}\in \mathcal {D}} \Big ( \sum _{\mathcal {E}_{uv} \in \mathcal {E}} d\big (R_{uv}^{\varPhi }, \widehat{R}_{uv}\big ) + \beta \sum _{\mathcal {V}_v \in \mathcal {V}} d\big ( R_{v}^{\varPhi }, \widehat{R}_{v} \big )\Big ) \end{aligned}$$
(5)

where \(\beta \) is a weight parameter. Note that the reference orientation are now fixed at the orientation of a certain camera c in the initialization \(\{\widetilde{R}_{v}\}\) as well as in the ground-truth absolute orientations \(\{ \widehat{R}_{v} \}\). Although, the choice of c is not critical, in practice, the camera c with most neighboring cameras is chosen as the reference, i.e\(\widetilde{R}_{c}\) = \(\widehat{R}_{c}\) = \(I_{3\times 3}\).

The above mapping \(\varPhi \) is now one-to-one. However, it requires an initialization \(\{\widetilde{R}_{v}\}\) as an additional input. Conventional methods initialize the absolute orientations using a spanning tree of the view graph. However even a single outlier in that spanning tree can lead to a very poor initialization, so it is very important to identify these outliers beforehand. Further, noise in the relative orientation along each edge of the spanning tree will also propagate at the subsequent nodes while computing the initial absolute orientations. Thus, we first clean the view-graph by removing the outliers and rectifying the noisy measurements, and then bootstrap an initialization from the cleaned view-graph.

Cleaning the View-Graph. Given the local structure in the view-graph, i.e. measurements of all the edges that the pair of adjacent nodes \(\{\mathcal {V}_{u}, \mathcal {V}_{v}\}\) are connected to (and possibly subsequent edges), an outlier edge \(\mathcal {E}_{uv}\) can be detected. To be specific, chaining the relative orientations along a cycle in the local structure of the view-graph forms an orientation close to the identity orientation and an indication of an outlier in the cycle otherwise. The presence of an outliers in multiple such cycles through the current edge indicates that the edge to be an outlier. Instead of designing such explicit algorithms, we use another neural network to clean the graph. The proposed method can be summarized as follows:

  • A graph-based network is employed to clean the view-graph by removing outlier measurements and rectifying noisy measurements (see Sect. 4.2).

  • The cleaned view-graph is then utilized to initialize the absolute orientations (see Sect. 4.3).

  • The initialization is fine-tuned using a separate network (see Sect. 4.4).

For clarity of the rest of the paper, the notations are outlined in Table 1.

4.1 The Network Design Choice

Generalizing convolution operators to irregular domains, such as graphs, is typically expressed as neighborhood aggregation or a message-passing scheme. The proposed network is built using such Message-Passing Neural Networks (MPNN)  [17], directly operating on view-graphs \(\mathcal {G}\). A MPNN is defined in terms of message functions \(m_v^{(t)}\) and update functions \(\gamma ^{(t)}\) that run for T time-steps (layers). At each time-step, the hidden state \(h_v^{(t)}\) at each node (feature) in the graph is updated according to

$$\begin{aligned} h_v^{(t)} = \gamma ^{(t)}\big ( h_v^{(t - 1 )}, m_v^{(t)} \big ) \end{aligned}$$
(6)

where \(m_v^{(t)}\) is the condensed message at node v, coming from the neighboring nodes \(u \in \mathcal {N}_v\), and can be expressed as follows:

$$\begin{aligned} m_v^{(t)} = \square _{\mathcal {V}_{u} \in \mathcal {N}_v} \phi ^{(t)}\big ( h_v^{(t - 1 )}, h_u^{(t - 1 )}, e_{uv}\big ) \end{aligned}$$
(7)

where \(\square \) denotes a differentiable, permutation invariant symmetric function, e.gmean, soft-max, etc.; \(\gamma ^{(t)}\) and \(\phi ^{(t)}\) are concatenation operations followed by 1-D convolutions and ReLUs; \(e_{uv}\) is the edge feature of the edge \(\mathcal {E}_{uv}\), \(h^{(t)}_{u\rightarrow v}:=\phi ^{(t)}\big ( h_v^{(t - 1 )}, h_u^{(t - 1 )}, e_{uv}\big )\) is the accumulated message for the edge \(\mathcal {E}_{uv}\) at time-step (t); and \(\mathcal {N}_v\) is the set of all neighboring cameras connected to \(\mathcal {V}_v\). A diagram of the elements involved in computing the next-level features is shown in Fig. 3.

Table 1. The notations and symbols used in the manuscript

4.2 View-Graph Cleaning Network

The view-graph cleaning network (CleanNet) is built on a MPNN. The input to CleanNet is a noisy view-graph and the output is a clean one, i.e. the network takes noisy relative orientations \(\widetilde{R}_{uv}\) as the edge features \(e_{uv}\) and predicts the noise-rectified relative orientations \(R_{uv}^{*}\) from the accumulated message \(h^{(T)}_{u\rightarrow v}:=\phi ^{(T)}\big ( h_v^{(T - 1 )}, h_u^{(T - 1 )}, e_{uv}\big )\) at the last layer. It also predicts a score \(\alpha _{uv}^{*}\) depicting the probability of the edge \(\mathcal {E}_{uv}\) to be an outlier. i.e

$$\begin{aligned} R_{uv}^{*} = lp_1\Big (h^{(T)}_{u\rightarrow v}\Big ) \star \widetilde{R}_{uv} ~~\text {and}~~ \alpha _{uv}^{*} = lp_2\Big (h^{(T)}_{u\rightarrow v}\Big ) \end{aligned}$$
(8)

where \(lp_1(.)\) and \(lp_2(.)\) are single-layered linear perceptrons that map the accumulated messages to the edge noise orientation and outlier score respectively. ‘\(\star \)’ is the matrix multiplication. The hidden states are initialized by null vectors, i.e\(h_v^{(0)} = \emptyset \). Note that instead of directly estimating the rectified orientations, we predict the noise in the relative orientation measurements, which are then multiplied to obtain the rectified orientations. The loss is chosen as the weighted combination of mean orientation error \(\mathcal {L}_{mre}\) of the rectified \( R_{uv}^{*}\) and ground-truth \(\widehat{R}_{uv} := \widehat{R}_{v} \widehat{R}_{u}^{-1}\) relative orientations, and mean binary cross-entropy error \(\mathcal {L}_{bce}\) of the predicted \( \alpha _{uv}^{*}\) and the ground-truth outlier score \(\widehat{\alpha }_{uv}\), i.e.

$$\begin{aligned} \mathcal {L}= \sum _{\mathcal {G}\in \mathcal {D}} \sum _{\mathcal {E}_{uv} \in \mathcal {E}} \Big ( \mathcal {L}_{mre}\big (R_{uv}^{*}, \widehat{R}_{uv}\big ) + \lambda \mathcal {L}_{bce}\big ( \alpha _{uv}^{*}, \widehat{\alpha }_{uv}\big ) \Big ) \end{aligned}$$
(9)

where \(\lambda \) is a weight parameter (fixed as \(\lambda =10\)). We formulate the orientations using unit quaternions and the predictions are normalized accordingly. The error in the prediction is also normalized by the degree of the node, i.e.

$$\begin{aligned} \mathcal {L}_{mre}\big (R_{uv}^{*}, \widehat{R}_{uv} \big ) = \frac{1}{|\mathcal {N}_v||\mathcal {N}_u|} d_{Q}\big (\frac{R_{uv}^{*}}{\Vert R_{uv}^{*}\Vert _2}, \widehat{R}_{uv}\big ) \end{aligned}$$
(10)

Experimentally, we observed the above loss produces superior performance than the standard discrepancy loss (2). Note that the ground-truth outlier score \(\widehat{\alpha }_{uv}\) is generated based on the amount of noise in the relative orientations. Specifically, if the amount of noise in the relative orientation \(\widetilde{R}_{v}^{-1} \widehat{R}_{uv} \widetilde{R}_{u} > 20^{\circ }\), the ground-truth edge label is assigned as an outlier, i.e\(\widehat{\alpha }_{uv} = 1\) and \(\widehat{\alpha }_{uv} = 0\) otherwise.

An edge \(\mathcal {E}_{uv}\) is marked as an outlier edge if the predicted outlier score \( \alpha _{uv}^{*}\) is greater than a predefined threshold \(\epsilon \). In all of our experiments, we choose the threshold \(\epsilon = 0.75\)Footnote 1. A cleaned view-graph \( \mathcal {G}^{*}\) is then generated by removing outlier edges from \(\mathcal {G}\) and replacing noisy relative orientations \(\widetilde{R}_{uv}\) by the rectified orientations \( R_{uv}^{*}\). Note that the cleaned graph \(\mathcal {G}^{*}\) is only employed to bootstrap an initialization of the absolute orientations.

Fig. 3.
figure 3

An illustration of computing next level features of a message-passing network.

4.3 Bootstrapping Absolute Orientations

Hartley et al.  [22] proposed generating a spanning tree by setting the camera with the maximum number of neighbors as the root and recursively adding adjacent cameras without forming a cycle. The reference orientation is fixed at the camera at the root of the spanning tree. The orientations of the rest of the cameras in the tree are computed by propagating away the rectified orientations \(R_{uv}^{*}\) from the root node along the edges, i.e\(\widetilde{R}_{v} = R_{uv}^{*} \widetilde{R}_{u}\).

As discussed before, the noise in the relative orientation along each edge \(R_{uv}^{*} \widehat{R}_{uv}^{-1}\) propagates at the subsequent nodes while computing the initial absolute orientations of the cameras. Therefore, the spanning tree that minimizes the sum of depths of all the nodes (a.k.a. shortest path tree  [41]) is the best spanning tree for the initialization. Starting with a root node, a shortest path tree could be computed by greedily connecting nodes to each neighboring node in the breadth-first order. The best shortest path tree can be found by applying the same procedure with each one of the nodes as a root node (time complexity \(\mathcal {O}(n^2)\))  [24]. However, we employed the procedure just once (time complexity \(\mathcal {O}(n)\)) with the root at the node with the maximum number of adjacent nodes (similar to Hartley et al.  [22]) and observed similar results as with the best spanning tree. The reference orientation of the initialization and the ground-truth is fixed at the root of the tree. This procedure is very fast and it takes only a fraction of a second for a large view-graph with thousands of cameras. We abbreviate this procedure as SPT and it is the default initializer in all of our experiments.

4.4 Fine-Tuning Network

The fine-tuning network (FineNet) is again built on a MPNN. It takes the initial absolute orientations \(\{\widetilde{R}_{v}\}\) and the relative orientation measurements \(\{\widetilde{R}_{uv}\}\) as inputs, and predicts the refined absolute orientations \(\{R_{v}^{*}\}\) as the output. The refined orientations are estimated from the hidden states \(h_v^{(T)}\) of the nodes at the last layer of the network, i.e.

$$\begin{aligned} R_{v}^{*} = lp_3 \big ( h_v^{(T)} \big ) \star \widetilde{R}_{v} \end{aligned}$$
(11)

where \(lp_3\) is a single layer of linear perceptron. We initialize the hidden states of the MPNN by the initial orientations, i.e\(h_v^{(0)} = \widetilde{R}_{v}\). The edge attributes are chosen as the relative discrepancy of the initial and the observed relative orientations, i.e\(e_{uv} = \widetilde{R}_v^{-1} \widetilde{R}_{uv} \widetilde{R}_u\). The loss for the fine-tuning network is computed as the weighted sum of edge consistency loss and the rotational distance between the predicted orientation \(\widetilde{R}_{v}\) and the ground-truth orientation \(\widetilde{R}_{v}\), i.e.

$$\begin{aligned} \mathcal {L}= \sum _{\mathcal {G}\in \mathcal {D}} \Big ( \sum _{\mathcal {E}_{uv} \in \mathcal {E}} \mathcal {L}_{mre}\big ( R_{uv}^{*}, \widehat{R}_{uv}\big ) + \frac{\beta }{|\mathcal {N}_v|} \sum _{\mathcal {V}_v \in \mathcal {V}}d_{Q}\big ( \frac{R_{v}^{*}}{\Vert R_{v}^{*}\Vert }, \widehat{R}_{v} \big ) \Big ) \end{aligned}$$
(12)

where \(\mathcal {L}_{mre}\) is chosen as the quaternion distance (10). This is a combination of two loss functions chosen according to (5). We value consistency of the entire graph (enforced via relative orientations in the first term) over individual accuracy (second term), and so choose \(\beta = 0.1\).

4.5 Training

The view-graph cleaning network and the fine-tuning network are trained separately. For each edge \(\mathcal {E}_{uv}\) in the view-graph with observed orientation \(\widetilde{R}_{uv}\), an additional edge \(\mathcal {E}_{vu}\) is included in the view-graph in the opposite direction with orientation \(\widetilde{R}_{vu} := \widetilde{R}_{uv}^{-1}\). This will ensure the messages flow in both directions of an edge. In both of the above networks, the parameters are chosen as: the number of time-steps \(T=4\), the permutation invariant function \(\square \) as the mean, and the length of the message \(m_v^{(t)}\) and hidden state \(h_v^{(t)}\) are 32.

Network Parameters \(\varTheta \): The parameters are involved with: (i) 1-D convolutions of \(\gamma ^{(t)}\) in (6) and \(\phi ^{(t)}\) in (7), and (ii) linear perceptrons \(lp_1\), \(lp_2\) in (8) and \(lp_3\) in (11). With the above hyper-parameters (i.e. time-steps, length of the messages, etc.), the total number of parameters of NeuRoRA becomes \({\approx }49.8\)K. Note that increasing the hyper-parameters could lead to a bigger network (more parameters) and that results a slower performance. A small size network is much faster but is not capable of predicting accurate outputs for larger view-graphs. We have tried different network sizes and found the current network size is a good balance between speed and accuracy.

Architecture Setup: The networks are implemented in PyTorch ToolboxFootnote 2 and trained on a GTX 1080 Ti GPU with a learning rate of \(0.5 \times 10^{-4}\) and weight decay \(10^{-4}\). Each of CleanNet and FineNet are trained for 250 epochs (takes \(\sim \)4–6 h) to learn the network parameters. To prevent the networks from over-fitting on the training dataset, we randomly drop \(25\%\) of the edges of each view-graph along with observed noisy relative orientations in each epoch. During testing all the edges were kept active. The parameters that yielded the minimum validation loss were kept for evaluation. All the baselines including the proposed networks were evaluated on an Intel Core i7 CPU.

5 Results

Experiments were carried out on synthetic as well as real datasets to demonstrate the true potential of our approach.

Baseline Methods. We evaluated NeuRoRA against the following baseline methods (also described in Sect. 2):

  • Chatterjee and Govindu  [8]: the latest implementation with the default parameters and cost function in the shared scriptsFootnote 3 were employed. We also employed their evaluation strategy to compare the predicted orientations.

  • Weiszfeld algorithm  [22]: the algorithm is straightforward but computationally expensive and we only ran it for 50 iterations of \(\ell _1\) averaging.

  • Arrigoni et al.  [5]: the authors shared the code with the optimal parametersFootnote 4.

  • Wang and Singer  [44]: this is employed with a publicly available scriptsFootnote 5.

We also ran the graph cleaning network (CleanNet) followed by bootstrapping initial orientation (using SPT) as a baseline CleanNet-SPT, and ran SPT on the noisy graph followed by fine-tuning network (FineNet) as another baseline SPT-FineNet. Note that the proposed network NeuRoRA takes CleanNet-SPT as an initialization and then fine-tunes the initialization in a single step by FineNet. NeuRoRA-v2 is a variation of the proposed method where an initialization from CleanNet-SPT is fine-tuned in two steps of FineNet, i.e. the output of FineNet in the first step is fed as an initialization of FineNet in the second step.

Synthetic Dataset. We carefully designed a synthetic dataset that closely resembles the real-world datasets. Since the amount of noise in observed relative measurements changes with the sensor type (e.g. camera device), the structure of the connections in the view-graphs and the outlier ratios are varied with the scene (Fig. 2). A single view-graph was generated as follows: (1) the number of cameras were sampled in the range 250–1000 and their orientations were generated randomly on a horizontal plane (yaw only), (2) pairwise edges and corresponding relative orientations were randomly introduced between the cameras that amounted to (10–30)% of all possible pairs, (3) the relative orientations were then corrupted by a noise with a std \(\sigma \) where \(\sigma \) is chosen uniformly in the range (\(5^{\circ }\)\(30^{\circ }\)) once for the entire view-graph, and the directions are chosen randomly on the vertical plane (to emulate realistic distributions 2), and (4) the relative orientations were further corrupted by (0–30)% of outliers with random orientations. Our synthetic dataset consisted of 1200 sparse view-graphs. The dataset was divided into training (\(80\%\)), validation (\(10\%\)), and testing (\(10\%\)).

The results are furnished in Table 2. The average angular error on all the view-graphs in the dataset is displayed. The proposed method NeuRoRA performs remarkably well compared to the baselines in terms of accuracy and speed. NeuRoRA-v2 further improves the results. Overall, Chatterjee  [8] performs well but the performance does not improve with a better initialization. Unlike Wang  [44], Weiszfeld  [22] improves the performance with a better initialization given by CleanNet-SPT, but, it can not improve the solution further given an even better initialization by NeuRoRA. Notice that the proposed NeuRoRA is three orders of magnitude faster with a GPU than the baseline methods.

Table 2. Results of Rotation averaging on a test synthetic dataset. The average angular error on all the view-graphs in our dataset is displayed. The proposed method NeuRoRA is remarkably faster than the baselines while producing better results. There is no GPU implementations of [5, 8, 22, 44] available, thus the runtime comparisons on cuda are excluded. Note that NeuRoRA takes only \(\varvec{0.0016}\) s on average on a GPU.

Real Dataset. We summarize the real datasets and display in Table 3. There are a total of 19 publicly available view-graphs with observed noisy relative orientations and the ground-truth absolute orientations. The ground-truth orientations were obtained by applying incremental bundle adjustment  [2] on the view-graphs. The TNotreDame dataset is shared by Chatterjee et al.  [8]Footnote 6. The Artsquad and SanFrancisco datasets are provided by DISCO  [12]Footnote 7. The rest of the view-graphs are publicly shared by 1DSFM  [46]Footnote 8. The ground-truth orientations are available for some of those cameras (indicated in parenthesis) and the training, validation and testing are performed only on those cameras.

Due to limited availability of real datasets for training, we employed network parameters pre-trained on the above synthetic dataset and further fine-tuned on the real datasets in round-robin fashion (leave one out). Such evaluation protocol is employed because we did not want to divide the sequences into training and testing sequences that might favor one particular method. The finetuning is done for each round of the round-robin using the real-data apart from the held-out test sequence. Overall, the proposed NeuRoRA outperformed the baselines for this task in terms of accuracy and efficiency (Table 3). The Artsquad and SanFrancisco datasets have different orientation patterns as shared from a different source  [12]. In particular SanFrancisco dataset is captured along a road which is significantly different from others. Thus, the performance of NeuRoRA falls short to Chatterjee  [8] and Wang  [44] only on those two sequences, but, is still better than Weiszfeld  [22] and Arrigoni  [5]. Nonetheless, the proposed NeuRoRA is much faster than others.

Table 3. Results of MRA on real datasets. The proposed method NeuRoRA is much faster than the baselines while producing overall similar or better results. The number of cameras, for which ground-truths are available, is shown within parenthesis.
Table 4. Robustness check: results of Rotation averaging on multiple datasets where NeuRoRA is trained on one and evaluated on the other datasets.

Robustness Check. In this experiment we study the generalization capability of NeuRoRA. To check the individual effects of different sensor settings, we generate a number of synthetic dataset varying (i) #cameras (ii) #edges, (iii) amount of noise and outliers, and (iv) planar/random camera motion. NeuRoRA is then trained on one of such datasets and evaluated on the others. Each dataset consists of 1000 view-graphs (large ones contain only 100). Results are furnished in Table 4. Notice that NeuRoRa generalizes well across dataset changes except when the network is trained on planar cameras and tested on random. We therefore advice to use two separate networks for planar and non-planner scenes. Notice that Chatterjee  [8] demands a large memory for large view-graphs and failed to execute on a system of 64 Gb of RAM. We tested our method on view-graphs upto 25K vertices and 24M edges on the same system.

6 Discussion

We have proposed a graph-based neural network for absolute orientation regression of a number of cameras from their observed relative orientations. The proposed network is exceptionally faster than the strong optimization-based baselines while producing better results on most datasets. The outstanding performance of the current work and the relevant neural networks for test-time optimization leads to the following question: “can we then replace all the optimizations in robotics/computer vision by a suitable neural network-based regression?” The answer is obviously No. For instance, if an optimization at test-time requires solving a simpler convex cost with a few parameters to optimize, a naive gradient descent will find the globally optimal parameters, while a network-based regression would only estimate sub-optimal parameters. To date, neural nets have been proven to be consistently better at solving pattern recognition problems than solving a constraint optimization problems. A few neural network-based solutions are proposed recently that can exploit the patterns in the data while solving a test-time optimization. Therefore the current work also opens up many questions related to the right tool for a specific application.