Keywords

1 Introduction

Estimating the position and orientation of cameras from images is essential in many applications, including virtual reality, augmented reality, and autonomous driving. While the problem can be approached via a geometric pipeline consisting of image retrieval, feature extraction and matching, and a robust Perspective-n-Points (PnP) algorithm, many challenges remain, such as invariance to appearance or the selection of the best set of method hyperparameters.

Learning-based methods have been used in traditional pipelines to improve robustness and accuracy, e.g. by generating neural network (NN)-based feature descriptors [7, 15, 16, 24, 25], combining feature extraction and matching into one network [29], or incorporating differentiable outlier filtering modules [1,2,3]. Although deep 3D-based solutions have demonstrated favorable results, many pre-requisites often remain, such as the need for an accurate 3D model of the scene and manual hyperparameter tuning of the remaining classical components.

The alternative end-to-end NN-based approach, termed absolute pose regression (APR), directly regresses the absolute pose of the camera from input images [13] without requiring prior knowledge about the 3D structure of the neighboring environment. Compared with deep 3D-based methods, APR methods can achieve at least one magnitude faster running speeds at the cost of inferior accuracy and longer training time. Although follow-up works such as MapNet [4] and Kendall et al. [12] attempt to improve APR methods by adding various constraints such as relative pose and scene geometry reprojection, a noticeable gap remains between APR and 3D-based methods.

Recently, Direct-PN [5] achieved state-of-the-art (SOTA) accuracy in indoor localization tasks among existing single-frame APR methods. As well as being supervised by ground-truth poses, the network directly matches the input image and a NeRF-rendered image at the predicted pose. However, it has two major limitations: (a) direct matching is very sensitive to photometric inconsistency, as images with different exposures could produce a high photometric error even from the same camera pose, which reduces the viability of photometric direct matching in environments with large photometric distortions, such as outdoor scenes; (b) there is a domain gap between real and rendered images caused by poor rendering quality or changes in content and appearance of the query scene.

In order to address these limitations, we propose a novel relocalization pipeline that combines APR and direct feature matching. First, we introduce a histogram-assisted variant of NeRF, which learns to control synthetic appearance via histograms of luminance information. This significantly reduces the gap between real and synthetic image appearance. Second, we propose a network DFNet that extracts domain invariant features and regresses camera poses, trained using a contrastive loss with a customized mining method. Matching these features instead of direct pixels colors boosts the performance of the direct dense matching further. Third, we improve generalizability by (i) applying a cheap Random View Synthesis (RVS) strategy to efficiently generate a synthetic training set by rendering novel views from randomly generated pseudo training poses and (ii) allow the use of unlabeled data. We show that our method outperforms existing single-frame APR methods by as much as 56% on both indoor 7-Scenes and outdoor Cambridge datasets. We summarize our main contributions as follows:

  1. 1.

    We introduce a direct feature matching method that offers better robustness than the prior photometric matching formulation, and devise a network DFNet that can effectively bridge the feature-level domain gap between real and synthetic images.

  2. 2.

    We introduce a histogram-assisted NeRF, which can scale the direct matching approach to scenes with large photometric distortions, e.g., outdoor environments, and provide more accurate rendering appearance to unseen real data.

  3. 3.

    We show that a simpler synthetic data generation strategy such as RVS can improve pose regression performance.

2 Related Work

Absolute Pose Regression. Absolute pose regression aims to directly regress the 6-DOF camera pose from an image using Convolutional Neural Networks. The first practice in this area is introduced by PoseNet [13], which is a GoogLeNet-backbone network appended with an MLP regressor. Successors of PoseNet propose several variations in network architectures, such as adding LSTM layers [35], adapting an encoder-decoder backbone [18], splitting the network into position and orientation branches [37], or incorporating attentions using transformers [27, 28]. Other methods propose different strategies to train APR. Bayesian PoseNet [14] inserts Monte Carlo dropout to a Bayesian CNN that estimates pose with uncertainty. Kendall et al. [12] proposes to balance the translation and rotation loss at training using learnable weights and reprojection error. MapNet [4] trains the network using both absolute pose loss and relative pose loss but can infer in a single-frame manner. Direct-PoseNet (Direct-PN) [5] adapts additional photometric loss by comparing the query image with NeRF synthesis on the predicted pose.

Fig. 1.
figure 1

Overview of the direct feature matching pipeline. Given an input image I, a pose regressor \(\mathcal {F}\) estimates a camera pose \(\hat{P}\), from which a luminance prior NVS system \(\mathcal {H}\) renders a synthetic image \(\hat{I}\). Domain invariant features of M and \(\hat{M}\) are extracted using a feature extractor \(\mathcal {G}\), supplying a feature-metric direct matching signal \(\mathcal {L}_{dm}\) to optimize the pose regressor.

Semi-supervised Learning in APR. Several APR methods explore semi-supervised learning with additional images without ground-truth pose annotation to improve pose regression performance. To the best of our knowledge, MapNet+ [4] and MapNet+PGO [4] are the pioneers to train APR on unlabeled video sequences using external VO algorithms [8, 9]. Direct-PN+ [5] finetune on unlabeled data from arbitrary viewpoints solely based on its direct matching formulation. While the direct matching idea from Direct-PN+ inspires our proposed method, we focus on training in the feature space. Our solution can scale to scenes with large photometric distortion, where the previous method fails.

Novel View Synthesis in APR. Novel View Synthesis (NVS) can be beneficial to the visual relocalization task. For example, NVS can expand training space by generating extra synthetic data. Purkait et al. [23] propose a method to generate realistic synthetic training data for pose regression leveraging the 3D map and feature correspondences. LENS [21] deploys a NeRF-W [17] model to sample the scene boundaries and synthesize virtual views with uniformly generated virtual camera poses. However, Purkait et al. rely on a pre-computed reconstructed 3D map. LENS is limited by its costly offline computation efficiency and the lack of compensation to the domain gap between synthetic and real images, i.e., dynamic objects or artifacts. Another direction is to embed NVS into the pose estimation process. InLoc [31] verifies the predicted pose with view synthesis. Ng et al. [22] combine a multi-view stereo (MVS) model with a relative pose regressor (RPR). iNeRF [38], Wang et al. [36], and Direct-PN [5] utilize an inverted NeRF to optimize the camera pose. Our paper is the first to incorporate both strategies yet have major differences from the above methods. 1) we introduce an NVS method that can adapt to real exposure change in view synthesis. 2) we address the domain adaptation problem between the actual camera footage with synthetic images. 3) our synthetic data generation strategy is comparatively less constrained and can be deployed efficiently in online training.

3 Method

We illustrate our proposed direct feature matching pipeline in Fig. 1, which contains two primary components: 1) the DFNet network, which, given an input image I, uses a pose estimator \(\mathcal {F}\) to predict a 6-DoF camera pose and a feature extractor \(\mathcal {G}\) to compute a feature map M, and 2) a histogram-assisted NeRF \(\mathcal {H}\), which compensates for high exposure fluctuation by providing luminance control when rendering a novel view given an arbitrary pose.

Training the direct feature matching pipeline can be split into two stages, (i) DFNet and the histogram-assisted NeRF, and (ii) direct feature matching. In stage one, we train the NVS module \(\mathcal {H}\) like a standard NeRF, and the DFNet with a loss term \(\mathcal {L}_{DFNet}\) in Eq. (5). In stage two, fixing the histogram-assisted NeRF and the feature extractor \(\mathcal {G}\), we further optimize the main pose estimation module \(\mathcal {F}\) via a direct feature matching signal between feature maps extracted from the real image and its synthetic counterpart \(\hat{I}\), which is rendered from the predicted pose \(\hat{P}\) of image I via the NVS module \(\mathcal {H}\). At test time, only the pose estimator \(\mathcal {F}\) is required given the query image, which ensures a rapid inference.

This section is organized as follows: the DFNet pipeline is detailed in Sect. 3.1, followed by a showcase of our histogram-assisted NeRF \(\mathcal {H}\) in Sect. 3.2. To further boost the pose estimation accuracy, an efficient Random View Synthesis (RVS) training strategy is introduced in Sect. 3.3.

3.1 Direct Feature Matching for Pose Estimation

This section aims to introduce: 1) the design of our main network DFNet, 2) the direct feature matching formulation that boosts pose estimation performance in a semi-supervised training manner, and 3) the contrastive-training scheme that closes the domain gap between real images and synthetic images.

Fig. 2.
figure 2

(a) The training scheme for DFNet to close the domain gap between real images and rendered images. (b) The histogram-assisted NeRF architecture.

DFNet Structure. The DFNet in our pipeline consists of two networks, a pose estimator \(\mathcal {F}\) and a feature extractor \(\mathcal {G}\). The pose estimator \(\mathcal {F}\) in our DFNet is similar to an ordinary PoseNet, which predicts a 6-DoF camera pose \(\hat{P} = \mathcal {F}(I)\) for an input image I, and can be supervised by an \(L_1\) or \(L_2\) loss between the pose estimation \(\hat{P}\) and its ground truth pose P.

The feature extractor \(\mathcal {G}\) in our DFNet takes as input feature maps extracted from various convolutional blocks in the pose estimator and pushes them through a few convolutional blocks, producing the final feature maps \(M =\mathcal {G}(I)\), which are the key ingredients during feature-metric direct matching.

Two key properties of the feature extractor \(\mathcal {G}\) that we seek to learn are 1) domain invariance, i.e., being invariant to the domain of real images and the domain of synthetic images and 2) transformation sensitive, i.e., being sensitive to the image difference that is caused by geometry transformations. With these properties learned, our feature extractor can extract domain-invariant features during feature-metric direct matching while preserving geometry-sensitive information for pose learning. We detail the way to train the DFNet in the Closing the Domain Gap section.

Direct Feature Matching. Direct matching in APR was first introduced by Direct-PN [5], which minimizes the photometric difference between a real image I and a synthetic image \(\hat{I}\) rendered from the estimated pose \(\hat{P}\) of the real image I. Ideally, if the predicted pose \(\hat{P}\) is close to its ground truth pose P, and the novel view renderer produces realistic images, the rendered image \(\hat{I}\) should be indistinguishable from the real image.

In practice, we found the photometric-based supervision signal could be noisy in direct matching, when part of scene content changes. For example, random cars and pedestrians may appear through time or the NeRF rendering quality is imperfect. Therefore, we propose to measure the distance between images in feature space instead of in photometric space, given that the deep features are usually more robust to appearance changes and imperfect renderings.

Specifically, for an input image I and its pose estimation \(\hat{P}=\mathcal {F}(I)\), a synthetic image \(\hat{I} = \mathcal {H}(\hat{P}, \textbf{y}_I)\) can be rendered using the pose estimation \(\hat{P}\) and the histogram embedding \(\textbf{y}_I\) of the input image I. We then extract the feature map \(M \in \mathbb {R}^{H_M \times W_M \times C_M}\) and \(\tilde{M} \in \mathbb {R}^{H_M \times W_M \times C_M}\) for image I and \(\hat{I}\) respectively, where \(H_M\) and \(W_M\) are the spatial dimensions and \(C_M\) is the channel dimension of the feature maps. To measure the difference between two feature maps, we compute a cosine similarity between feature \(m_i \in \mathbb {R}^{C_M}\) and \(\tilde{m}_i \in \mathbb {R}^{C_M}\) for each feature location i:

$$\begin{aligned} \cos (m_i, \tilde{m}_i) = \frac{m_i\cdot \tilde{m}_i}{\Vert m_{i}\Vert _{2}\cdot \Vert \tilde{m}_{i}\Vert _{2}}. \end{aligned}$$
(1)

By minimizing the feature-metric direct matching loss \(\mathcal {L}_{dm} = \sum _i ( 1-\cos (m_i, \tilde{m}_i) )\), the pose estimator \(\mathcal {F}\) can be trained in a semi-supervised manner (note no ground truth label required for the input image I).

Our direct feature matching may optionally follow the procedure of semi-supervised training proposed by MapNet+ [4] to improve pose estimation with unlabeled sequences captured in the same scene. Unlike [4], which requires sequential frames to enforce a relative geometric constraint using a VO algorithm, our feature-matching can be trained by images from arbitrary viewpoints without ground truth pose annotation. Our method can be used at train time with a batch of unlabeled images, or as a pose refiner for a single test image. In the latter case, our direct matching can also be regarded as a post-processing module. During the training stage, only the weights of the pose estimator will be updated, whereas the feature extractor part remains frozen to back-propagation.

Fig. 3.
figure 3

A visual comparison of features before and after closing the domain gap. Ideally, a robust feature extractor shall produce indistinguishable features between real and rendered images from the same pose. Column 2/Column 3 are features trained without/with using our proposed \(\mathcal {L}_{triplet}\) loss, where our method can effectively produce similar features across two domains.

Closing the Domain Gap. We notice that synthetic images from NeRF are imperfect due to rendering artifacts or lack of adaption of the dynamic content of the scene, which leads to a domain gap between render and real images. This domain gap poses difficulties to our feature extractor (Fig. 3), which we expect to produce features far away if two views are from different poses and to produce similar features between a rendered view and a real image from the same pose.

Intuitively, we could simply enforce the feature extractor to produce similar features for a rendered image \(\hat{I}\) and a real image I via a distance function \(d(\cdot )\) during training. However, this approach leads to model collapse [6], which motivates us to explore the original triplet loss:

$$\begin{aligned} \mathcal {L}_{triplet}^{ori}=\max \left\{ d(M^{P}_{real}, M^{P}_{syn}) - d(M^{P}_{real}, M^{\bar{P}}_{syn}) +{\text {margin}}, 0\right\} , \end{aligned}$$
(2)

where \(M^{P}_{real}\) and \(M^{P}_{syn}\), the feature maps of a real image and a synthetic image at pose P, compose a positive pair, and \(M^{\bar{P}}_{syn}\) is a feature map of a synthetic image rendered at an arbitrary pose \(\bar{P}\) other than the pose P.

With a closer look at the task of feature-metric direct matching, we implement a customized in-triplet mining which explores the minimum distances among negative pairs:

$$\begin{aligned} \mathcal {L}_{triplet}=\max \left\{ d(M^{P}_{real}, M^{P}_{syn}) - q_{\ominus } + {\text {margin}}, 0\right\} , \end{aligned}$$
(3)

where the positive pair is as same as Eq. (2) and \(q_{\ominus }\) is the minimum distance between four negative pairs:

$$\begin{aligned} q_{\ominus } = \min \left\{ d(M^{P}_{real}, M^{\bar{P}}_{real}), d(M^{P}_{real}, M^{\bar{P}}_{syn}), d(M^{P}_{syn}, M^{\bar{P}}_{real}), d(M^{P}_{syn}, M^{\bar{P}}_{syn}) \right\} , \end{aligned}$$
(4)

which essentially takes the hardest negative pair among all matching pairs between synthetic images and real images that are in different camera poses. The margin value is set to 1.0 in our implementation. Since finding the minimum of negative pairs is non-differentiable, we implement the in-triplet mining as a prior step before \(\mathcal {L}_{triplet}\) is computed.

Overall, to train the pose estimator and to obtain domain invariant and transformation sensitive property, we adapt a siamese-style training scheme as illustrated in Fig. 2a. Given an input image I and its ground truth pose P, a synthetic image \(\hat{I}\) can be rendered via the NVS module \(\mathcal {H}\) (assumed pre-trained) using the ground truth pose P. We then present both the real image I and the synthetic image \(\hat{I}\) to the pose estimator and the feature extractor, resulting in pose estimations \(\hat{P}_{real}\) and \(\hat{P}_{syn}\) and feature maps \(M_{real}\) and \(M_{syn}\) for the real image I and synthetic image \(\hat{I}\), respectively. The training then is supervised via a combined loss function

$$\begin{aligned} \mathcal {L}_{DFNet} = \mathcal {L}_{triplet} + \mathcal {L}_{RVS} + \frac{1}{2}(\Vert P-\hat{P}_{real}\Vert _{2} + \Vert P-\hat{P}_{syn}\Vert _{2}), \end{aligned}$$
(5)

where \(\Vert \cdot \Vert \) denotes a \(L_2\) loss and \(\mathcal {L}_{RVS}\) is a supervision signal from our RVS training strategy, which we explain in Sect. 3.3.

Fig. 4.
figure 4

Typically NeRF only renders views that reflect the appearance of its training sequences, as shown by NeRF-W’s synthetic view (b). However, in relocalization tasks, the query set may have different appearances or exposures to the train set. The proposed histogram-assisted NeRF (c) can render a more accurate appearance to the unseen query set (a) in both quantitative (PSNR) and visual comparisons. We refer to the supplementary for more examples.

3.2 Histogram-Assisted NeRF

The DFNet pipeline relies on an NVS module that renders a synthetic image from which we extract a feature map and compare it with a real image. Theoretically, while the NVS module in our pipeline can be in any form as long as it provides high-quality novel view renderings, in practice, we found that due to the presence of auto exposure during image capturing, it is necessary to have a renderer that can render images in a compensated exposure condition. Although employing direct matching in feature space could mediate the exposure issue to some extent, we find decoupling the exposure issue from the domain adaption issue leads to better pose estimation results.

One off-the-shelf option is a recent work NeRF-W [17], which offers the ability to control rendered appearance via an appearance embedding that is based on frame indices. However, in the context of direct matching, since we aim to compare a real image with its synthetic version, we desire a more fine-grained exposure control to render an image that matches the exposure condition of the real image, as illustrated in Fig. 4.

To this end, we propose a novel view renderer histogram-assisted NeRF (Fig. 2b) which renders an image \(\hat{I} = \mathcal {H}(P, \textbf{y}_I)\) that matches the exposure level of a query real image I via a histogram embedding \(\textbf{y}_I\) of the query image I at an arbitrary camera pose P. Specifically, our NeRF contains 3 components:

  1. 1.

    A base network \(\mathcal {H}_{b}\) that provides a density estimation \(\sigma _b\) and a hidden state \(\textbf{z}\) for a coarse estimation: \( [\sigma _b, \textbf{z}]=\mathcal {H}_{b}(\gamma (\textbf{x})). \)

  2. 2.

    A static network \(\mathcal {H}_{s}\) to model density \(\sigma _s\) and radiance \(\textbf{c}_s\) for static structure and appearance: \( [\sigma _s, \textbf{c}_s]=\mathcal {H}_{s}(\textbf{z}, \gamma (\textbf{d}), \textbf{y}_I). \)

  3. 3.

    A transient network \(\mathcal {H}_{t}\) to model density \(\sigma _t\), radiance \(\textbf{c}_t\) and an uncertainty estimation \(\beta \) for dynamic objects: \( [\sigma _t, \textbf{c}_t, \beta ]=\mathcal {H}_{t}(\textbf{z}, \textbf{y}_I). \)

As for the input, \(\textbf{x}\) is a 3D point and \(\textbf{d}\) is a view angle that observes the 3D point, with both of them encoded by a positional encoding [10, 19, 34] operator \(\gamma (\cdot )\) before injecting to each network.

During training, the coarse density estimation from the base network \(\mathcal {H}_{b}\) provides a distribution where the other two networks could sample more 3D points near non-empty space accordingly. Both the static and the transient network are conditioned on a histogram-based embedding \(\textbf{y}_I \in \mathbb {R}^{C_y}\), which is mapped from a \(N_b\) bins histogram. The histogram is computed on the luma channel Y of a target image in YUV space. We found this approach works well in a direct matching context, not only in feature-metric space but also in photometric space.

We adopt a similar network structure and volumetric rendering method as in NeRF-W [17], to which we refer readers for more details.

3.3 Random View Synthesis

During the training of DFNet, we can generate training data by synthesis more views from randomly perturbed training poses. We refer this process as Random View Synthesis (RVS), and we use this data generation strategy to help the DFNet to better generalize to unseen views.

Specifically, given a training pose P, a perturbed pose \(P^\prime \) can be generated around the training pose with a random translation noise of \(\psi \) meters and random rotation noise of \(\phi \) degrees. A synthetic image \(I^\prime = \mathcal {H}(P^\prime , \textbf{y}_{I_{nn}})\) is then rendered via histogram-assisted NeRF \(\mathcal {H}\), with \(\textbf{y}_{I_{nn}}\) being the histogram embedding of the training image with the nearest training pose. The synthetic pose-image pair \((P^\prime , I^\prime )\) is used as a training sample for the pose estimator to provide an additional supervision signal \(\mathcal {L}_{RVS} = \Vert P^\prime - \hat{P^{\prime }}\Vert _2\), where \(\hat{P^{\prime }} = \mathcal {F}(I^\prime )\) is the pose estimation of the rendered image.

A key advantage of our method is efficiency in comparison with prior training sample generation methods. For example, LENS [21] generates high-resolution synthetic data with a maximum of 40 s/image and requires complicated parameter settings in finding candidate poses within scene volumes. In contrast, our RVS is a lightweight strategy that seamlessly fits our DFNet training at a much cheaper cost (12.2 fps) and with fewer constraints in pose generation while being able to reach similar performance. We refer to Sect. 4.5 for more discussion.

Table 1. Pose regression results on 7-Scenes dataset. We compare DFNet and DFNet\(_{dm}\) (DFNet with feature-metric direct matching) with prior single-frame APR methods and unlabeled training methods, in median translation error (m) and rotation error (\(^\circ \)). Note that MapNet+ and MapNet+PGO are sequential methods with unlabeled training. Numbers in bold represent the best performance.

4 Experiments

4.1 Implementation

We introduce the implementation details for histogram-assisted NeRF, DFNet, and direct feature matching. We also provide more details in the supplementary.

NeRF. Our histogram-assisted NeRF model is trained with a re-aligned and re-centered pose in SE(3), similar to Mildenhall et al. [19]. The image histogram bin size is set to \(N_b=10\) and embedded with a vector dimension of 50 for the static model and 20 for the transient model. We train the model with a learning rate of \(5 \times 10^{-4}\) and an exponential decay of \(5 \times 10^{-4}\) for 600 epochs.

DFNet. Our DFNet adapts an ImageNet pre-trained VGG-16 [30] as the backbone, and an Adam optimizer with a learning rate of \(1 \times 10^{-4}\) is applied during training. For feature extraction, we extract \(L=3\) feature maps from the end of the encoder’s first, third, and fifth blocks before pooling layers. All final feature outputs are upscaled to the same size as the input image \(H \times W\) with bilinear upsampling. For pose regression, we regresses the SE(3) camera pose with a fully connected layer. A singular value decomposition (SVD) is applied to ensure the rotation component of \(\hat{P}\) is normalized [5].

Direct Feature Matching. To validate our feature-metric direct matching formulation, we follow the same procedure from MapNet+ [4] and Direct-PN+U [5], which use a portion of validation images without the ground truth poses for finetuning. When finetuning DFNet, we optimize the pose regression module \(\mathcal {F}\) solely based on the direct feature matching loss \(\mathcal {L}_{dm}\). We set the batch size to 1 and the learning rate to \(1 \times 10^{-5}\). For naming simplicity, we named our model trained with direct feature matching as DFNet\(_{dm}\).

4.2 Evaluation on the 7-Scenes Dataset

We evaluate our method on an indoor camera localization dataset 7-Scenes [11, 29]. The dataset consists of seven indoor scenes scaled from \(1\,\textrm{m}^3\) to \(18, \textrm{m}^3\). Each scene contains 1000 to 7000 training sets and 1000 to 5000 validation sets. Both histogram-assisted NeRF and DFNet use subsampled training data with a spacing window \(d=5\) for scenes containing \(\le \) 2000 frames and \(d=10\) otherwise. RVS poses are sampled on the training pose, and the DFNet parameters are \(t_\psi =0.2\) m, \(r_\phi =10^\circ \), and \(d_{max}=0.2\) m. For fair comparison to other unlabeled training methods such as MapNet+ and Direct-PN, we finetune our DFNet\(_{dm}\) using the same amount of unlabeled samples, which is 1/5 or 1/10 of the sequences based on the spacing window above to ensure our method is not overfitting to the entire test sequences.

We compared our method quantitatively with prior single-frame APR methods and unlabeled training APR methods in Table 1. The results show that both our DFNet and DFNet\(_{dm}\) obtain superior accuracy, and DFNet\(_{dm}\) achieves 56% and 57% improvement over averaged median translation and rotation errors compared to prior SOTA performance.

Table 2. Single-frame APR results on Cambridge dataset. We report the median position and orientation errors in \(m/^\circ \) and the respective rankings over scene average as in [27, 28]. The best results is highlighted in bold. For fair comparisons, we omit prior APR methods which did not publish results in Cambridge.
Table 3. Comparison between our method and sequential-based APR methods and 3D structure-based methods.

4.3 Evaluation on Cambridge Dataset

We further compare our approach on four outdoor scenes from the Cambridge Landmarks [13] dataset, scaling from \(875\,\textrm{m}^2\) to \(5600\,\textrm{m}^2\). Each scene contains from 200+ to 1500 training samples. Our models are trained with 50% of training data, and DFNet’s RVS are \(t_\psi =3\) m, \(r_\phi =7.5^\circ \), and \(d_{max}=1\) m. For finetuning DFNet\(_{dm}\) with unlabeled data, we use 50% of the unlabeled validation sequence since fewer validation sets are available than 7-Scenes. Table 2 shows a comparison between our approach and prior single-frame APR methods, which omits prior APR methods that did not report results in Cambridge. We observe that our DFNet\(_{dm}\) outperforms other methods significantly (60%+ in scene average), which further proves the effectiveness of our approach.

Table 4. (a) The effect of various level of features on DFNet\(_{dm}\) result. Letter F, M, and C denote features extracted from fine, middle, and coarse levels in DFNet.(b) Ablation on DFNet (upper part) and histogram-assisted NeRF in photometric direct matching (lower part). DFM denotes Direct Feature Matching.
Fig. 5.
figure 5

Pose difference vs. feature dissimilarity. X-axis: camera position (left) and orientation difference (right) between a real image and a rendered image. Y-axis: feature dissimilarity \(\mathcal {L}_{dm}\). Our direct feature matching loss \(\mathcal {L}_{dm}\) is closely related to pose error, leading to effective training of the APR method.

Fig. 6.
figure 6

(a) Top row: feature collapsing when training DFNet on Kings without using triplet loss. Bottom row: training DFNet with triplet loss can avoid the feature collapsing issue. (b) Feature maps of other scenes in Cambridge when training with triplet loss. We show that more refined level features consistently contain more meaningful details and, therefore more beneficial to use for direct feature matching.

4.4 Comparison to Sequential APR and 3D Approaches

Table 3 compares our method to other types of relocalization approaches, such as several state-of-the-art sequential-based APR approaches and 3D structure-based method Active Search [26]. We notice that our DFNet\(_{dm}\) outperforms most sequential-based APR methods except the translation error of VLocNet [33] on 7-Scenes in terms of the scene average performance. However, we still achieve superior accuracy than VLocNet in 7 out of 11 scenes. For the first time, the performance of single-image APR is comparable to 3D-structure methods. Our DFNet\(_{dm}\) is slightly more accurate than Active Search [26] in average rotation error of 7-scenes. However, our method is still slightly behind in terms of translation error and Cambridge errors although by smaller margins.

4.5 Ablation Study

Effectiveness of Direct Feature Matching. We run a toy example of direct feature matching on Shop Facade using finest features and combinations of multi-level features, as in Table 4(a). We discover that finer-level features are more helpful for direct matching. We believe this to be due to their capability to preserve high frequency details and sharper contents, as shown in (Fig. 6(b)). This explains why we only use the finest feature in the feature-metric direct matching implementation. Furthermore, Fig. 5 shows how the direct matching loss \(\mathcal {L}_{dm}\) successfully correlates the pose differences to the feature similarity between real images and rendered images.

Table 5. Data generation strategy comparison: RVS vs. LENS [21] on 7-Scenes. An EfficientNet backbone (as in LENS) is used in DFNet for a fair comparison. Our RVS strategy obtains a comparable results to LENS while using much less training data and rendering in much lower resolution, enabling online training.

Features Collapse. We demonstrate the difference when training DFNet’s feature extractor with and without triplet loss in Fig. 6(a). We replace our triplet loss with a mean square error (MSE) loss for the without triplet loss case. Intuitively, losses that only minimize positive sample distances such as MSE, \(L_2\), or \(L_1\) losses may lead to feature collapsing [6] since the feature extraction blocks in DFNet are likely to learn to cheat. On the other hand, using triplet loss supervised with additional negative samples works well for extracting dense domain invariant features.

Summary of Ablation. We break down our design decisions to show how each component contributes to the pose regression accuracy in Table 4(b). We start with training an DFNet model using with standard triplet loss without mining. The performance improves noticeably when we add the RVS. We also see around 16%/36% gain in translation and rotation errors when adding the customized triplet loss \(\mathcal {L}_{triplet}\) . We then validate our DFNet\(_{dm}\)’s direct feature matching (DFM), which further reduces error significantly. The DFM approach with histogram-assisted NeRF outperforms the NeRF-W one, which validates the effectiveness of our histogram embedding design. Finally, we attempt to train a Direct-PN+U model with our histogram-assisted NeRF modification. Our results show that the photometric direct matching-based method that can benefit from our new NVS method, though the pose estimation accuracy is worse than our feature-metric direct matching method.

Effectiveness of RVS. Table 5 shows a comparison between our online RVS strategy with another peer work LENS [21] that uses NeRF data generation for APR training. Although both data generation methods effectively improve APR performance, our RVS strategy is a much cheaper alternative requiring lower rendering resolution (80\(\,\times \,\)60 vs. 320\(\,\times \,\)240 [21]) and fewer data. We are able to reach similar performance with LENS when we replace our VGG16 backbone with an EfficientNet-B0 [32], which proves that a simpler data generation strategy could also effectively improves APR methods.

5 Conclusion

In summary, we introduce an Absolute Pose Regression (APR) pipeline for camera re-localization. Specifically: 1) we propose a histogram-assisted NeRF to compensate dramatic exposure variance in large scale scene with challenging exposure conditions. The histogram-assisted NeRF, serving as a novel view renderer, enables a direct matching training scheme; 2) we explore a direct matching scheme in feature space, leading to a more robust performance than the photometric approach, and address a domain gap issue that arises when matching real images with synthetic images via a contrastive learning scheme; 3) we devise an efficient data generation strategy, which proposes pseudo training poses around existing training trajectories, leading to better generalization capability to unseen data. As a result, our method achieves a state-of-the-art accuracy by outperforming existing single-image APR methods by as much as 56%, comparable to 3D structure-based methods.