Keywords

1 Introduction

Recent advances in Neural Radiance Fields (NeRFs) [11] strongly improved the fidelity of generated novel views by fitting a neural network to predict the volume density and the emitted radiance of each 3D point in a scene. The differentiable volume rendering step allows having a set of images, with known camera poses, as the only input for model fitting. Moreover, the limited amount of data, i.e. (image, camera pose) pairs, needed to train a NeRF model, facilitates its adoption and drives the increasing range of its possible applications. Among these, view synthesis recently emerged for street view reconstruction [12, 19] in the context of AR/VR applications, robotics, and autonomous driving, with considerable efforts towards vehicle novel view generation. However, these attempts focus on images representing large-scale unbounded scenes, such as those from KITTI [9], and usually fail to achieve high-quality 3D vehicle reconstruction.

Fig. 1.
figure 1

A visualization of the CarPatch data: RGB images (left), depth images (center), and semantic segmentation of vehicle components (right).

In this paper, we introduce an additional use case for neural radiance fields, i.e. vehicle inspection, where the goal is to represent an individual high-quality instance of a given car. The availability of a high-fidelity 3D vehicle representation could be beneficial whenever the car body has to be analyzed in detail. For instance, insurance companies or body shops could rely on NeRF-generated views to assess possible external damages after a road accident and estimate their repair cost. Moreover, rental companies could compare two NeRF models trained before and after each rental, respectively, to assign responsibility for any new damages. This would avoid expert on-site inspection or a rough evaluation based on a limited number of captures.

For this purpose, we provide an experimental overview of the state-of-the-art NeRF methods, suitable for vehicle reconstruction. To make the experimental setting reproducible and to provide a basis for new experimentation, we propose CarPatch, a new benchmark to assess neural radiance field methods on the vehicle inspection task. Specifically, we generate a novel dataset consisting of 8 different synthetic scenes, corresponding to as many high-quality 3D car meshes with realistic details and challenging light conditions. As depicted in Fig. 1, we provide not only RGB images with camera poses, but also binary masks of different car components to validate the reconstruction quality of specific vehicle parts (e.g. wheels or windows). Moreover, for each camera position, we generate the ground truth depth map with the double goal of examining the ability of NeRF architectures to correctly predict volume density and, at the same time, enable future works based on RGB-D inputs. We evaluate the novel view generation and depth estimation performance of several methods under diverse settings (both global and component-level). Finally, since the process of image collection for fitting neural radiance fields could be time consuming in real scenarios, we provide the same scenes by varying the number of training images, in order to determine the robustness to the amount of training data.

After an overview of the main related works in Sect. 2, we thoroughly describe the process of 3D mesh gathering, scene setup, and dataset generation in Sect. 3. The evaluation of existing NeRF architectures on CarPatch is presented in Sect. 4.

2 Related Work

We provide a brief overview of the latest updates in neural radiance field, including its significant extensions and applications that have influenced our work. NeRF limitations have been tackled by different works, trying to reduce its complexity, increase the reconstruction quality, and develop more challenging benchmarks.

Neural Scene Reconstruction. The handling of aliasing artifacts is a well-known issue in rendering algorithms. Mip-NeRF [1, 2] and Zip-NeRF [3] have tackled the aliasing issue by reasoning on volumetric frustums along a cone. These approaches have inspired works such as Able-NeRF [16], which replaces the MLP of the original implementation with a transformer-based architecture. In addition to other sources of aliasing, reflections can pose a challenge for NeRF. Several works have attempted to address the issue of aliasing in reflections by taking into account the reflectance of the scene [4, 5, 17]. Moreover, computation is a widely recognized concern. Various works in the literature have demonstrated that it is possible to achieve high-fidelity reconstructions while reducing the overall training time. Two notable works in this direction include NSVF [10], which uses a voxel-based representation for more efficient rendering of large scenes, and Instant-NGP [13], which proposes a multi-resolution hash table combined with a light MLP to achieve faster training times. Other approaches such as DVGO [15] and Plenoxels [8] optimize voxel grids of features to enable fast radiance field reconstruction. TensoRF [7] combines the traditional CP decomposition [7] with a new vector-matrix decomposition method [6] leading to faster training and higher-quality reconstruction.

In this work, in order to satisfy real-time performances for vehicle inspection, we select a set of architectures that strike a balance between training time and the quality of the reconstruction.

Table 1. Comparison between existing datasets used as benchmarks for neural radiance field evaluation and CarPatch. We provide the same scene by varying the amount of training data (40, 60, 80, and 100 images), allowing users to test the robustness of their architectures. We also release depth and segmentation data for all the images.

Scene Representation Benchmarks. One of the most widely used benchmarks for evaluating NeRF is the Nerf Synthetic Blender dataset [11]. This dataset consists of 8 different scenes generated using BlenderFootnote 1, each with 100 training images and 200 test images. Other synthetic datasets include the Shiny Blender dataset [17], which mostly contains singular objects with simple geometries, and Blend DMVS [20], which provides various scenes to test NeRF implementations at different scales. These works do not provide ground truth information about the semantic meaning of the images. This limitation makes it difficult to study the ability of NeRF to reconstruct certain surfaces compared to others. In our CarPatch dataset, we provide ground truth segmentation of vehicle components in the scene, allowing for the evaluation of architectures on specific parts. Table 1 presents a comparison between the most common datasets used as benchmarks and our proposed dataset.

3 The CarPatch dataset

In this section, we detail the source data and the procedure exploited for generating our CarPatch dataset. In particular, we describe how we gathered 3D models, set up the blender scenes, and designed the image capture process (Fig. 2).

Fig. 2.
figure 2

Sample RGB images (left), depth data (center), and segmentation masks (right) from CarPatch, for different car models.

3.1 Synthetic 3D Models and Scene Setup

All the 3D models included in CarPatch scenes have been downloaded from SketchfabFootnote 2, a large collection of free 3D objects for research use. Table 2 provides a detailed list of all the starting models used. Each of them has been edited in Blender to enhance its realism; specifically, we improved the materials, colors, and lighting in each scene to create a more challenging environment.

The scenes have been set up accordingly to the Google Blender dataset [11]. The lighting conditions and rendering settings were customized to create a more realistic environment. The vehicle was placed at the center of the scene at position (0,0,0), with nine lights distributed around the car and varying emission strengths to create shadows and enhance reflections on the materials’ surfaces. To improve realism, we resized objects to match their real-world size. The camera and lights were placed in order to provide an accurate representation of the environment, making the scenes similar to real-world scenarios.

Table 2. Summary of the source 3D models from which our dataset has been generated, including their key features.

3.2 Dataset Building

The dataset was built using the Python interface provided in Blender, allowing us to control objects in the environment. For each rendered image, we captured not only the RGB color values but also the corresponding depth map, as well as the pixel-wise semantic segmentation masks for eight vehicle components: bumpers, lights, mirrors, hoods/trunks, fenders, doors, wheels, and windows. Examples of these segmentation masks can be seen in Fig. 1. Please note that all the pixels belonging to a component (e.g. doors) are grouped into the same class, regardless of the specific component location (e.g. front/rear/right/left door). The bpycvFootnote 3 utility has been used for collecting additional metadata, enabling us to evaluate NeRF models on the RGB reconstruction and depth estimation of the overall vehicle as well as each of its subparts.

For the rendering of training images, the camera randomly moved on the hemisphere centered in (0,0,0) and above the ground. The camera rotation angle was sampled from a uniform distribution before each new capture. For building the test set, the position of the camera was kept at a fixed distance from the ground and rotated around the Z-axis with a fixed angle equal to \(\frac{2\pi }{\#test\_views}\) radians before each new capture.

In order to guarantee the fairness of the current and future comparisons, we explicitly provide four different versions of each scene, by varying the number of training images (40, 60, 80, and 100 images, respectively). Different versions of the same scene have no overlap in training camera poses, while the test set is always the same and contains 200 images for each scene.

We release the code for dataset creation and metrics evaluation at https://github.com/davidedinuc/carpatch.

4 Benchmark

This section presents the selection and testing of various recent NeRF-based methods [7, 13, 15] on the presented CarPatch dataset, with a detailed description of the experimental setting for each baseline. Additionally, we assess the quality of the reconstructed vehicles in terms of their appearance and 3D surface reconstruction, utilizing depth maps generated during volume rendering.

4.1 Compared Methods

To overcome challenges related to illumination and reflective surfaces during the process of reconstructing vehicles, it is crucial to choose an appropriate neural rendering approach. We tested selected approaches on CarPatch without modifying the implementation details available in the original repositories, whenever possible. However, some parameters had to be adjusted in order to fit our models (which are larger compared to reference dataset meshes) to the scene. All tests were performed on a GeForce GTX 1080 Ti. After considering various NeRF systems, we have selected the following baselines:

  • Instant-NGP [13] Since the original implementation of Instant-NGP is in CUDA, we decided to use an available PyTorch implementationFootnote 4 of this approach in order to have a fair comparison with the other approaches. In our experiments, a batch size of 8192 was maintained, with a scene scale of 0.5 and a total of 30,000 iteration steps.

  • TensoRF [7] In our setting, a batch of 4096 rays was used. Additionally, we increased the overall scale of the scene from 1 to 3.5. These adjustments were made after experimentation and careful consideration of the resulting reconstructions. Training lasts 30,000 iterations.

  • DVGO [15] In this work, the training process consists of two phases: a coarse training phase of 5,000 iterations, followed by a fine training phase of 20,000 iterations that aims to improve the model’s ability to learn intricate details of the scene. In our experiments, we applied a batch size of 8192 while maintaining the default scene size.

Table 3. Quantitative results on the CarPatch test set for each vehicle model.
Table 4. Quantitative results on the CarPatch test set for each vehicle component averaged over the vehicle models.

4.2 Metrics

The effectiveness of the chosen methods has been assessed thanks to the typical perceptual metrics used in NeRF-based reconstruction tasks, namely PSNR, SSIM [18], and LPIPS [21].

However, the appearance-based metrics are strongly related to the emitted radiance besides the learned volume density. We suggest two supplementary depth-based metrics for the sole purpose of assessing the volume density. Since it is not feasible to obtain ground truth 3D models of the vehicles in real-world scenarios, we utilize the depth map as our knowledge of the 3D surface of the objects. Specifically, we define a depth map as a matrix

$$\begin{aligned} D = \{d_{ij}\}, d_{ij} \in [0, R] \end{aligned}$$
(1)

in which each value \(d_{ij}\) ranges from 0 to the maximum depth value R. Furthermore, we estimate the surface normals from the depth maps [14]. Initially, we establish the orientation of a surface normal as:

$$\begin{aligned} \textbf{d} = \langle d_x, d_y, d_z \rangle = \left( -\frac{\partial d_{ij}}{\partial i}, -\frac{\partial d_{ij}}{\partial j}, 1 \right) \approx \left( d_{(i+1)j} - d_{ij}, d_{i(j+1)} - d_{ij}, 1 \right) \end{aligned}$$
(2)

where the first two elements represent the depth gradients in the i and j directions, respectively. Afterward, we normalize the normal vector to obtain a unit-length vector \(\textbf{n}(d_{ij}) = \frac{\textbf{d}}{\left\Vert \textbf{d} \right\Vert }\).

We assess the 3D reconstruction’s quality through the following metrics:

  • Depth Root Mean Squared Error (D-RMSE) This metric measures the average difference in meters between the ground truth and predicted depth maps.

    $$\begin{aligned} \text {D-RMSE} = \sqrt{\frac{\sum _{i=0}^{M}\sum _{j=0}^{N} (\hat{d}_{ij} - d_{ij})^2}{M \cdot N}} \end{aligned}$$
    (3)
  • Surface Normal Root Mean Squared Error (SN-RMSE) This metric measures the average angular error in degrees between the angle direction of the ground truth and predicted surface normals.

    $$\begin{aligned} \text {SN-RMSE} = \sqrt{\frac{\sum _{i=0}^{M}\sum _{j=0}^{N} (\arccos (\textbf{n}(\hat{d}_{ij})) - \arccos (\textbf{n}(d_{ij})))^2}{M \cdot N}} \end{aligned}$$
    (4)

D-RMSE and SN-RMSE are computed only for those pixels with a positive depth value in both GT and predicted depth maps. This avoids computing depth estimation errors on background pixels (which have a fixed depth value of 0).

Fig. 3.
figure 3

Performance by varying the number of training images, in terms of PSNR, SSIM, LPIPS, D-RMSE, and SN-RMSE. Despite its lower overall performance, Instant-NGP [13] exhibits low variance with respect to the amount of training data.

Fig. 4.
figure 4

Performance by camera viewing angle, in terms of PSNR, SSIM, LPIPS, D-RMSE, and SN-RMSE. Depending on the training camera distribution, all the methods struggle wherever the viewpoints are more sparse (e.g. between \(225^\circ \) and \(270^\circ \)). The red arrow represents where the front of the vehicle is facing. (Color figure online)

4.3 Results

The following section presents both quantitative and qualitative results obtained from the selected NeRF baselines. We will discuss their performance on the CarPatch dataset, by analyzing the impact of viewing camera angle and the number of training images.

According to Table 3, all the selected NeRF approaches obtain satisfying results. Although the baselines demonstrate similar performances in terms of appearance scores (PSNR, SSIM, and LPIPS), our evaluation using depth-based metrics (D-RMSE and SN-RMSE) reveals significant differences in the 3D reconstruction of the vehicles. DVGO outperforms its competitors by achieving better depth estimation, resulting in a \(+13.5\%\) improvement compared to iNGP and a \(+7.6\%\) improvement compared to TensoRF. In contrast, TensoRF predicts a more accurate 3D surface with the lowest angular error on the surface normals.

Since our use case is related to vehicle inspection, in Table 4 we report results computed on each car component. For this purpose, we mask both GT and predictions using a specific component mask before computing the metrics. However, this would lead to an unbalanced ratio between background and foreground pixels, due to the limited components’ area, and finally to a biased metric value. By computing D-RMSE and SN-RMSE only on foreground pixels (see Sect. 4.2), depth-based metrics are not affected by this issue. For PSNR, SSIM, and LPIPS, instead, we compute component-level metrics over the image crop delimited by the bounding boxes around each mask. As expected, it is worth noting that NeRF struggles to reconstruct transparent objects (e.g. mirrors, lights, and windows) obtaining the highest errors in terms of depth and normal estimation. However, over the single components, TensoRF outperforms the competitors in most of the metrics and in particular on the surface normal estimation. The errors in the reconstruction of specific components’ surfaces can also be appreciated in the qualitative results of Fig. 5.

Moreover, we analyze the performances of each method in terms of the number of training images. We trained the baselines on every version of the CarPatch dataset and report the results in Fig. 3. It is worth noting that reducing the number of training images has a significant impact on all the metrics independently of the method. However, Instant-NGP demonstrates to be more robust to the number of camera viewpoints having a smoother drop, especially in terms of LPIPS, D-RMSE, and SN-RMSE.

Finally, we discuss how the training camera viewpoints’ distribution around the vehicle may affect the performance of each method from certain camera angles. In particular, as depicted in Fig. 4, it is evident how between \(180^\circ \) and \(270^\circ \) and between \(0^\circ \) and \(45^\circ \) there are considerable variations in the metrics. Indeed, in these areas the datasets contain more sparsity in terms of camera viewpoints and, as expected, all the methods are affected.

Fig. 5.
figure 5

Sample of 3D reconstruction of the Tesla: (left) the reconstructed RGB, depth, and surface normals, (right) the reconstructed surfaces on the triangle mesh.

5 Conclusion

In this article, we have proposed a new benchmark for the evaluation and comparison of NeRF-based techniques. Focusing on one of the many concrete applications of this recent technology, i.e. vehicle inspection, a new synthetic dataset including renderings of 8 vehicles was first created. In addition to the set of RGB views annotated with the camera pose, the dataset is enriched by semantic segmentation masks as well as depth maps to further analyze the results and compare the methods.

The presence of reflective surfaces and transparent parts makes the task of vehicle reconstruction still challenging. Proposed additional metrics, as well as new graphical ways of displaying the results, are proposed to make these limitations more evident. We are confident that CarPatch can be of great help as a basis for research on NeRF models in general and, more specifically, in their application to the field of vehicle reconstruction.