Keywords

1 Introduction

Camera pose estimation, the task of regressing a camera’s relative position and rotation to an object in a given image, is a fundamental problem in computer vision and robotics. Given RGB or RGB-D images [13], we can estimate the camera parameters by reconstructing the target scene using SfM [12], regressing the camera pose [3], or iteratively optimizing the camera parameters [6]. Recently, many researchers have paid attention to the use of view synthesis by neural radiance fields (NeRF) [8] to improve camera pose estimation. Among them, LENS [9] and Direct-PoseNet [1] are practical and sophisticated approaches that utilize novel views from pre-trained NeRF for localization. Concretely, LENS utilizes the novel view rendered from the original NeRF [8] as data augmentation to train the camera pose regressor, PoseNet [3], which directly estimates camera parameters from a given single image using a convolutional neural network. Direct-PoseNet has a similar approach. However, if a diverse and abundant set of multiview images is unavailable during the training of NeRF, it may not effectively render novel views. These novel views are crucial for training the camera pose regressor. In such circumstances, the quality of novel views could degrade, leading to suboptimal performance in camera localization.

Hence, in this paper, we tackle the problem of learning the pose regressor from the viewpoint-biased and limited training set. Because the training of NeRF tends to fail in such a situation, we propose augmenting the regressor’s training set using a few-shot NeRF instead of an original NeRF, which is employed in the previous frameworks. Concretely, we adopt DietNeRF [2] as a few-shot NeRF for data augmentation. Using DietNeRF, we can render high-quality novel views with a consistent 3D structure for stable training of the regressor. In the training phase of the regressor, we learn the regressor to make it more stable using actual observed data and extended views rendered from the pre-trained DietNeRF.

In the experiments, to validate the effectiveness of the proposed method, we compared DietNeRF with the original NeRF using training data with a small number of shots and viewpoint bias. Our experiments demonstrated that the novel views by the DietNeRF further improve the camera pose estimation performance compared to the original NeRF.

2 Related Work

2.1 Neural Radiance Fields

Mildenhall et al. proposed neural radiance fields (NeRF) [8] for learning a multi-layer perceptron (MLP) that represents the three-dimensional (3D) space of a target scene from multi-view images with camera pose. The learned 3D representation can be utilized to generate an unobserved scene.

While NeRF can learn the consistent 3D structure and generate realistic novel views, training NeRFs requires multi-view images with camera parameters, which is laborious. Moreover, when the training data is small or when the viewpoints are biased, the training tends to fail the training and generate poor-quality rendering images. Therefore, several studies have been proposed to reduce the number of training data [2, 4, 15]. pixelNeRF [15] is a method for learning NeRFs from a single image by conditioning the color and density of the 3D coordinates on the features extracted by the trained CNN. InfoNeRF [4] learns to minimize the density of sampling points on the ray except for high-density points where an object exists, thereby suppressing the effect of noise and improving the quality of the image generation. DietNeRF [2] is a method that uses CLIP [10] for training to prevent training collapse and improve the generation quality when the amount of training data is small. This is because CLIP’s image encoder can extract semantic features to make the semantic features similar between viewpoints in the 3D space during training so that unobserved regions that do not appear in the training data can be made semantically consistent. As a result, even when the training data is small or the training viewpoints are biased, it is possible to learn so that unobserved regions are complemented plausibly.

DietNeRF. In this section, we describe the training phase of DietNeRF in detail. The DietNeRF model takes 3D coordinates \(\textbf{x}\) and view direction \(\textbf{d}\) as input and outputs the density \(\sigma \) and color \(\textbf{c}\) of the 3D coordinates. This mapping function is modeled by a multi-layer perceptron (MLP). Next, to calculate the pixel value, we sample a ray \(\textbf{r}\) on 3D space based on camera pose, aggregate these properties \((\sigma ,\textbf{c})\) for each ray through the MLP, and then calculate the pixel value \(\mathbf {C(r)}\) based on a volume rendering approach. The MLP’s trainable parameters are optimized by minimizing the following photometric loss function,

$$\begin{aligned} \mathcal {L}_\mathrm{{MSE}}(\mathcal {R})=\frac{1}{N}\sum _{\textbf{r}\in \mathcal {R}}||\mathbf {C(r)}-\mathbf {\hat{C}(r)}||^2_2, \end{aligned}$$
(1)

where \(\mathbf {C(r)}\) is a ground truth color and \(\mathcal {R}\) is a set of N rays.

To hallucinate unseen regions, DietNeRF introduces the auxiliary semantic loss function, which aims to minimize the semantic distance between feature vectors of ground truth image \(\textbf{I}\) and synthesized image \(\mathbf {\hat{I}}\). These feature vectors are extracted from CLIP’s [10] image encoder \(\phi \). This process is formulated as

$$\begin{aligned} \mathcal {L}_\mathrm{{sc}}(\textbf{I},\hat{\textbf{I}})=\phi (\textbf{I})\phi (\mathbf {\hat{I}})^\top . \end{aligned}$$
(2)

The total loss function for training DietNeRF is described as

$$\begin{aligned} \mathcal {L}_\mathrm{{total}}=\lambda _\mathrm{{MSE}}\mathcal {L}_\mathrm{{MSE}}+\lambda _\mathrm{{sc}}\mathcal {L}_\mathrm{{sc}}, \end{aligned}$$
(3)

where \(\lambda _\mathrm{{MSE}}\) and \(\lambda _\mathrm{{sc}}\) are hyperparameters that balance these loss function. Before training the pose regressor, we train DietNeRF according to the final loss function (Eq. (3)).

2.2 Camera Pose Estimation

Camera pose estimation is a key component for various applications. To achieve this task, several approaches have been proposed. Among them, absolute pose regression learns to regress the camera parameter from a given image by convolutional neural networks (CNN) from a pair of target scenes and the corresponding camera pose. PoseNet is one of the representative works. PoseNet regresses the parameters using MobileNet-V2 [11], enabling fast inference. However, since PoseNet is based on CNN, it easily overfits the training data and the camera distribution, resulting in poor performance. In addition, overfitting can be apparent when large-scale and diverse training data is unavailable.

Fig. 1.
figure 1

Overview of the proposed method (a) Training DietNeRF from a small amount of training data. (b) Generating synthetic data for PoseNet using DietNeRF. (c) Training PoseNet using synthetic data and a small amount of real training data.

For better estimation, many researchers have paid attention to the use of novel view synthesis techniques using NeRF [8]. LENS [9] augments the unseen views using NeRF-W [7] to enhance the pose regressor training. LENS generates a 3D grid based on density information obtained from NeRF-W and selects a viewpoint that is not too close to the object’s location. The camera poses generated from the nearest camera pose from the selected viewpoint, and the camera pose generated from that viewpoint using NeRF-W are added to the training of the pose regressor. Direct-PoseNet [1] also uses pre-trained NeRF photometric errors for training. This has the advantage that unlabeled images can be used to train Pose Regressor.

However, the property of CNN-based regressors like PoseNet heavily depends on the view quality and the viewpoint distributions. In addition, building a training set for both the regressor and NeRF model is laborious. Therefore, in this paper, we use a few-shot NeRF, which can generate plausible unobserved views from the limited dataset, to augment training data for boosting the regressor’s generalization ability.

3 Proposed Method

In this paper, we introduce an improved pipeline for few-shot and viewpoint-based camera pose estimation. As shown in the Fig. 1, the proposed method consists of three steps: training DietNeRF [2] as a view augmenter (Sect. 2.1 and Sect. 3.1), generating synthetic data for PoseNet [3] (Sect. 3.2), and training PoseNet for camera pose estimation (Sect. 3.3).

3.1 The Training of DietNeRF

To generate novel views for the pose regressor as shown in Fig. 1(a), we first train DietNeRF [2] from a given small dataset using the procedure in (Sect. 2.1).

3.2 View Synthesis for Data Augmentation

The camera pose regressor like PoseNet [3] has a problem of overfitting the training data when the limited and biased training data, results in poor performance for camera pose estimation. To solve this problem, our strategy is to utilize the novel views rendered by DietNeRF [2] as additional training data. The augmented data set consists of the image from the unseen viewpoint, and the corresponding camera poses because we can obtain the pairs from DietNeRF.

To sample viewpoints for training data augmentation, we assume that we are observing the target object from a hemispherical plane with a constant distance. Typically, such viewpoint distribution is based on a von Mises distribution in the directional statistic. The distribution changes depending on the parameters of the mean and concentration relative to the mean. When the concentration is zero, the von Mises distribution returns to a uniform distribution. Since the viewpoint of the composite data should have a viewpoint that captures a wide range of the target scene as in the uniform distribution, the azimuth and elevation angles are sampled from the von Mises distribution with mean 0 and concentration 0, and the 3D coordinates are determined. Following this sampling strategy, we sample N viewpoints consisting of azimuth and elevation angles from the von Mises distribution. Given sampled viewpoints, we generate additional view images from DietNeRF, as shown in Fig. 1(b).

3.3 The Training of Camera Pose Regressor

Finally, we train PoseNet [3] using real multi-view images with camera pose and synthetic additional images generated from DietNeRF [2] (Sect. 3.2), as shown in Fig. 1. The camera extrinsic parameters for camera pose estimation consist of the rotation and translation matrix. The following loss functions \(\mathcal {L}_\mathrm{{pose}}\) are defined based on the predicted camera pose \(\hat{\textbf{P}}\) and the ground-truth \(\textbf{P}\) of the training data.

$$\begin{aligned} \mathcal {L}_\mathrm{{pose}}=\frac{1}{|\textbf{P}|}||\textbf{P}-\hat{\textbf{P}}||^2_2. \end{aligned}$$
(4)

4 Evaluation

We perform experiments from two perspectives: (i) we quantitatively and qualitatively evaluate a novel view quality of the original NeRF and DietNeRF for view augmentation in a viewpoint-biased setting, (ii) we quantitatively compare our model with previous work for camera pose estimation task.

4.1 Evaluation Setting

Dataset. We used NeRF synthetic dataset proposed in the original NeRF paper [8]. This dataset is rendered from a high-quality 3D model using Blender. Because we aim to improve the performance of the few-shot and viewpoint-biased settings in the experiments, we created subsets of 10 images as a training set from NeRF synthetic for training models. Following Sect. 3.2, we sampled augmented unseen viewpoints from a von Mises distribution with a concentration of 0. To evaluate our model in a viewpoint-based setting, we controlled the mean parameter of the von Mises distribution. The viewpoint-biased data we created are categorized into three types: random, side, and front. These viewpoint distributions cover the hemisphere, the target object from the side, and the target object from the front, respectively. Therefore, because side and front include largely invisible regions due to self-occlusion, we investigated variations of the side and front viewpoints. By controlling the azimuth of von Mises distribution, we additionally created high, middle, and low concentrated viewpoint datasets for side and front viewpoints. These high, middle, and low concentrated viewpoints differ in the degree of observation regions, as shown in Fig. 2.

Fig. 2.
figure 2

low, middle, and high-concentrated viewpoint distribution for evaluation in the viewpoint-biased setting

Table 1. Number of training successes of random
Table 2. Number of training successes of side/front

Evaluation Metrics. We quantitatively evaluated the image completion quality in an invisible region using Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM) [14], and Learned Perceptual Image Patch Similarity (LPIPS) [16]. To quantitatively evaluate the camera pose estimation, we used translation error and rotation error as metrics. The translation error indicates the error of the camera position and is calculated from the mean-squared error of the translation matrix between the ground truth and prediction. On the other hand, the rotation error indicates the mean-squared error of the rotation angle between the ground truth and prediction.

Table 3. Rendering quality for random
Table 4. Rendering quality for side/front

Network Details. We optimized the trainable parameters of DietNeRF using Adam [5], where the batch size is 1,024 and the initial learning rate is set to 0.0005. For stability training, we applied an exponentially decreasing scheduler, increasing the learning rate by 0.1 over 250,000 iterations. Following the paper [2], we minimized Eq. (3) until 200,000 iterations and then minimized Eq. (1) from 200,000 to 250,000 iterations for better generalization.

Fig. 3.
figure 3

Visual comparison between NeRF and DietNeRF. NeRF produces artifacts in an unseen viewpoint when the training dataset is small

4.2 The Completion Performance of DietNeRF

Quantitative Comparison. Quantitative completion results for NeRF and DietNeRF and the number of successful studies are shown in Table 1, 2, 3 and 4. From the random distribution results as shown in Table 3, we confirmed that DietNeRF did not tend to overfit training data, and the rendering quality was slightly better than that of NeRF. Especially, for the HotDog target, DietNeRF outperformed NeRF in terms of rendering quality and training stability. These results are similar to those reported in the DietNeRF paper [2] and indicate that DietNeRF’s generalization ability is superior to the vanilla NeRF model. On the other hand, when the training viewpoint distribution is biased to side, the PSNR scores of NeRF and DietNeRF for Hotdog target were 24.02 and 20.47, respectively. In the case of front, the PSNR was 23.63 and 25.55, the opposite results. From these comparison results in the viewpoint-biased settings, we found that the rendering performance of DietNeRF tends to depend on the training viewpoint distribution and target object.

Visual Results of random . We closely looked at the rendering quality of NeRF and DietNeRF for boosting camera pose estimation performance. The rendering images of NeRF in the viewpoint-based setting are shown in Fig. 3. The figure clearly showed that while NeRF’s PSNR score was partially competitive to DietNeRF, the rendering results in unseen viewpoints collapsed and had artifacts. On the other hand, DietNeRF can complete the unseen region even if the scene was not observed in the training phase. This is because CLIP’s semantic feature enhances the viewpoint generalization of DietNeRF.

Fig. 4.
figure 4

Rendering results outside of Side’s training viewpoint

Fig. 5.
figure 5

Rendering results outside of front’s training viewpoint

Visual Results of side and front . Figures 4 and 5 show the rendering results of side and front settings, respectively. Interestingly, we found that DietNeRF’s completion ability depends on not only the training viewpoint distribution but also the symmetric property of the target object. Specifically, we found that DietNeRF tends to be able to complement invisible regions when the object has a symmetrical structure (Lego, Drums, and Hotdog) and the learning perspective captures one side of the symmetry.

When the training viewpoints are biased (side and front) and when DietNeRF is superior to NeRF, NeRF is sometimes superior to DietNeRF in the quality of the validation data (middle and high) for the vicinity of the training viewpoint. This indicates that while DietNeRF performs well in complementing unseen regions, it may not be as good as NeRF in producing quality for visible regions.

4.3 The Performance of Camera Pose Estimation

Quantitative Comparison. The results of camera pose estimation in the random, side, and front are shown in Tables 5 and 6, respectively. These scores were obtained from the PoseNet trained on real data and synthetic data by NeRF and DietNeRF. When training data was sampled from random distribution, DietNeRF was able to generate more high-quality novel views than NeRF, and the rendering images could enhance PoseNet’s generalization, as shown in Table 5. When the training view was biased to side or front, Among NeRF and DietNeRF, higher generation quality had better camera pose estimation accuracy.

Table 5. Camera pose estimation result of random.
Table 6. Camera pose estimation results of side/front

The Effect of Viewpoint Augmentation Scale. Figure 6 shows the results when changing the number of additional data generated by Blender, NeRF, and DietNeRF. From the figure, we found that the performance was significantly improved by DietNeRF, and increasing the number of additional data resulted in the improvement of camera pose estimation in all synthesizers (Blender, NeRF, and DietNeRF). When the number of synthetic data was set to the number of images that minimized the error of camera pose estimation using DietNeRF trained with 10 real images, we confirmed that the error of camera pose estimation was equivalent to PoseNet trained with 100 images generated by Blender for the translation error and 150 images generated by Blender for the rotation error.

Fig. 6.
figure 6

Error of camera pose estimation when the scale of synthetic data is changing. Blue: camera pose estimation error of PoseNet trained on DietNeRF synthetic images, orange: NeRF synthetic images, green: ground truth images rendered by Blender. (Color figure online)

5 Conclusion

In this paper, we proposed a view augmentation technique for learning a camera pose estimation model, PoseNet, from a small amount of training data. The proposed method improves the performance of camera pose estimation by generating synthetic data from DietNeRF trained with a small amount of data by generating new viewpoint images and training PoseNet using the synthetic data and a small amount of training data. In addition, we validated the improvement of camera pose estimation by increasing the number of synthetic data and confirmed that the performance improves by augmenting training data. In future work, it is necessary to verify whether the proposed method is effective in more realistic scenes.