CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle Components

Di Nucci, Davide; Simoni, Alessandro; Tomei, Matteo; Ciuffreda, Luca; Vezzani, Roberto; Cucchiara, Rita

doi:10.1007/978-3-031-43153-1_9

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14234))

Included in the following conference series:

International Conference on Image Analysis and Processing

789 Accesses

Abstract

Neural Radiance Fields (NeRFs) have gained widespread recognition as a highly effective technique for representing 3D reconstructions of objects and scenes derived from sets of images. Despite their efficiency, NeRF models can pose challenges in certain scenarios such as vehicle inspection, where the lack of sufficient data or the presence of challenging elements (e.g. reflections) strongly impact the accuracy of the reconstruction. To this aim, we introduce CarPatch, a novel synthetic benchmark of vehicles. In addition to a set of images annotated with their intrinsic and extrinsic camera parameters, the corresponding depth maps and semantic segmentation masks have been generated for each view. Global and part-based metrics have been defined and used to evaluate, compare, and better characterize some state-of-the-art techniques. The dataset is publicly released at https://aimagelab.ing.unimore.it/go/carpatch and can be used as an evaluation guide and as a baseline for future work on this challenging topic.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Neural-Sim: Learning to Generate Training Data with NeRF

BEVSeg: Geometry and Data-Driven Based Multi-view Segmentation in Bird’s-Eye-View

A hybrid framework for heterogeneous object detection amidst diverse and adverse weather conditions employing Enhanced-DARTS

Article 15 September 2024

Keywords

1 Introduction

Recent advances in Neural Radiance Fields (NeRFs) [11] strongly improved the fidelity of generated novel views by fitting a neural network to predict the volume density and the emitted radiance of each 3D point in a scene. The differentiable volume rendering step allows having a set of images, with known camera poses, as the only input for model fitting. Moreover, the limited amount of data, i.e. (image, camera pose) pairs, needed to train a NeRF model, facilitates its adoption and drives the increasing range of its possible applications. Among these, view synthesis recently emerged for street view reconstruction [12, 19] in the context of AR/VR applications, robotics, and autonomous driving, with considerable efforts towards vehicle novel view generation. However, these attempts focus on images representing large-scale unbounded scenes, such as those from KITTI [9], and usually fail to achieve high-quality 3D vehicle reconstruction.

In this paper, we introduce an additional use case for neural radiance fields, i.e. vehicle inspection, where the goal is to represent an individual high-quality instance of a given car. The availability of a high-fidelity 3D vehicle representation could be beneficial whenever the car body has to be analyzed in detail. For instance, insurance companies or body shops could rely on NeRF-generated views to assess possible external damages after a road accident and estimate their repair cost. Moreover, rental companies could compare two NeRF models trained before and after each rental, respectively, to assign responsibility for any new damages. This would avoid expert on-site inspection or a rough evaluation based on a limited number of captures.

For this purpose, we provide an experimental overview of the state-of-the-art NeRF methods, suitable for vehicle reconstruction. To make the experimental setting reproducible and to provide a basis for new experimentation, we propose CarPatch, a new benchmark to assess neural radiance field methods on the vehicle inspection task. Specifically, we generate a novel dataset consisting of 8 different synthetic scenes, corresponding to as many high-quality 3D car meshes with realistic details and challenging light conditions. As depicted in Fig. 1, we provide not only RGB images with camera poses, but also binary masks of different car components to validate the reconstruction quality of specific vehicle parts (e.g. wheels or windows). Moreover, for each camera position, we generate the ground truth depth map with the double goal of examining the ability of NeRF architectures to correctly predict volume density and, at the same time, enable future works based on RGB-D inputs. We evaluate the novel view generation and depth estimation performance of several methods under diverse settings (both global and component-level). Finally, since the process of image collection for fitting neural radiance fields could be time consuming in real scenarios, we provide the same scenes by varying the number of training images, in order to determine the robustness to the amount of training data.

After an overview of the main related works in Sect. 2, we thoroughly describe the process of 3D mesh gathering, scene setup, and dataset generation in Sect. 3. The evaluation of existing NeRF architectures on CarPatch is presented in Sect. 4.

2 Related Work

We provide a brief overview of the latest updates in neural radiance field, including its significant extensions and applications that have influenced our work. NeRF limitations have been tackled by different works, trying to reduce its complexity, increase the reconstruction quality, and develop more challenging benchmarks.

Neural Scene Reconstruction. The handling of aliasing artifacts is a well-known issue in rendering algorithms. Mip-NeRF [1, 2] and Zip-NeRF [3] have tackled the aliasing issue by reasoning on volumetric frustums along a cone. These approaches have inspired works such as Able-NeRF [16], which replaces the MLP of the original implementation with a transformer-based architecture. In addition to other sources of aliasing, reflections can pose a challenge for NeRF. Several works have attempted to address the issue of aliasing in reflections by taking into account the reflectance of the scene [4, 5, 17]. Moreover, computation is a widely recognized concern. Various works in the literature have demonstrated that it is possible to achieve high-fidelity reconstructions while reducing the overall training time. Two notable works in this direction include NSVF [10], which uses a voxel-based representation for more efficient rendering of large scenes, and Instant-NGP [13], which proposes a multi-resolution hash table combined with a light MLP to achieve faster training times. Other approaches such as DVGO [15] and Plenoxels [8] optimize voxel grids of features to enable fast radiance field reconstruction. TensoRF [7] combines the traditional CP decomposition [7] with a new vector-matrix decomposition method [6] leading to faster training and higher-quality reconstruction.

In this work, in order to satisfy real-time performances for vehicle inspection, we select a set of architectures that strike a balance between training time and the quality of the reconstruction.

Table 1. Comparison between existing datasets used as benchmarks for neural radiance field evaluation and CarPatch. We provide the same scene by varying the amount of training data (40, 60, 80, and 100 images), allowing users to test the robustness of their architectures. We also release depth and segmentation data for all the images.

Full size table

Scene Representation Benchmarks. One of the most widely used benchmarks for evaluating NeRF is the Nerf Synthetic Blender dataset [11]. This dataset consists of 8 different scenes generated using Blender^{Footnote 1}, each with 100 training images and 200 test images. Other synthetic datasets include the Shiny Blender dataset [17], which mostly contains singular objects with simple geometries, and Blend DMVS [20], which provides various scenes to test NeRF implementations at different scales. These works do not provide ground truth information about the semantic meaning of the images. This limitation makes it difficult to study the ability of NeRF to reconstruct certain surfaces compared to others. In our CarPatch dataset, we provide ground truth segmentation of vehicle components in the scene, allowing for the evaluation of architectures on specific parts. Table 1 presents a comparison between the most common datasets used as benchmarks and our proposed dataset.

3 The CarPatch dataset

In this section, we detail the source data and the procedure exploited for generating our CarPatch dataset. In particular, we describe how we gathered 3D models, set up the blender scenes, and designed the image capture process (Fig. 2).

3.1 Synthetic 3D Models and Scene Setup

All the 3D models included in CarPatch scenes have been downloaded from Sketchfab^{Footnote 2}, a large collection of free 3D objects for research use. Table 2 provides a detailed list of all the starting models used. Each of them has been edited in Blender to enhance its realism; specifically, we improved the materials, colors, and lighting in each scene to create a more challenging environment.

The scenes have been set up accordingly to the Google Blender dataset [11]. The lighting conditions and rendering settings were customized to create a more realistic environment. The vehicle was placed at the center of the scene at position (0,0,0), with nine lights distributed around the car and varying emission strengths to create shadows and enhance reflections on the materials’ surfaces. To improve realism, we resized objects to match their real-world size. The camera and lights were placed in order to provide an accurate representation of the environment, making the scenes similar to real-world scenarios.

Table 2. Summary of the source 3D models from which our dataset has been generated, including their key features.

Full size table

3.2 Dataset Building

The dataset was built using the Python interface provided in Blender, allowing us to control objects in the environment. For each rendered image, we captured not only the RGB color values but also the corresponding depth map, as well as the pixel-wise semantic segmentation masks for eight vehicle components: bumpers, lights, mirrors, hoods/trunks, fenders, doors, wheels, and windows. Examples of these segmentation masks can be seen in Fig. 1. Please note that all the pixels belonging to a component (e.g. doors) are grouped into the same class, regardless of the specific component location (e.g. front/rear/right/left door). The bpycv^{Footnote 3} utility has been used for collecting additional metadata, enabling us to evaluate NeRF models on the RGB reconstruction and depth estimation of the overall vehicle as well as each of its subparts.

For the rendering of training images, the camera randomly moved on the hemisphere centered in (0,0,0) and above the ground. The camera rotation angle was sampled from a uniform distribution before each new capture. For building the test set, the position of the camera was kept at a fixed distance from the ground and rotated around the Z-axis with a fixed angle equal to $\frac{2\pi }{\#test\_views}$ radians before each new capture.

In order to guarantee the fairness of the current and future comparisons, we explicitly provide four different versions of each scene, by varying the number of training images (40, 60, 80, and 100 images, respectively). Different versions of the same scene have no overlap in training camera poses, while the test set is always the same and contains 200 images for each scene.

We release the code for dataset creation and metrics evaluation at https://github.com/davidedinuc/carpatch.

4 Benchmark

This section presents the selection and testing of various recent NeRF-based methods [7, 13, 15] on the presented CarPatch dataset, with a detailed description of the experimental setting for each baseline. Additionally, we assess the quality of the reconstructed vehicles in terms of their appearance and 3D surface reconstruction, utilizing depth maps generated during volume rendering.

4.1 Compared Methods

To overcome challenges related to illumination and reflective surfaces during the process of reconstructing vehicles, it is crucial to choose an appropriate neural rendering approach. We tested selected approaches on CarPatch without modifying the implementation details available in the original repositories, whenever possible. However, some parameters had to be adjusted in order to fit our models (which are larger compared to reference dataset meshes) to the scene. All tests were performed on a GeForce GTX 1080 Ti. After considering various NeRF systems, we have selected the following baselines:

Instant-NGP [13] Since the original implementation of Instant-NGP is in CUDA, we decided to use an available PyTorch implementation^{Footnote 4} of this approach in order to have a fair comparison with the other approaches. In our experiments, a batch size of 8192 was maintained, with a scene scale of 0.5 and a total of 30,000 iteration steps.
TensoRF [7] In our setting, a batch of 4096 rays was used. Additionally, we increased the overall scale of the scene from 1 to 3.5. These adjustments were made after experimentation and careful consideration of the resulting reconstructions. Training lasts 30,000 iterations.
DVGO [15] In this work, the training process consists of two phases: a coarse training phase of 5,000 iterations, followed by a fine training phase of 20,000 iterations that aims to improve the model’s ability to learn intricate details of the scene. In our experiments, we applied a batch size of 8192 while maintaining the default scene size.

Table 3. Quantitative results on the CarPatch test set for each vehicle model.

Full size table

Table 4. Quantitative results on the CarPatch test set for each vehicle component averaged over the vehicle models.

Full size table

4.2 Metrics

The effectiveness of the chosen methods has been assessed thanks to the typical perceptual metrics used in NeRF-based reconstruction tasks, namely PSNR, SSIM [18], and LPIPS [21].

However, the appearance-based metrics are strongly related to the emitted radiance besides the learned volume density. We suggest two supplementary depth-based metrics for the sole purpose of assessing the volume density. Since it is not feasible to obtain ground truth 3D models of the vehicles in real-world scenarios, we utilize the depth map as our knowledge of the 3D surface of the objects. Specifically, we define a depth map as a matrix

$$\begin{aligned} D = \{d_{ij}\}, d_{ij} \in [0, R] \end{aligned}$$

(1)

in which each value $d_{ij}$ ranges from 0 to the maximum depth value R. Furthermore, we estimate the surface normals from the depth maps [14]. Initially, we establish the orientation of a surface normal as:

$$\begin{aligned} \textbf{d} = \langle d_x, d_y, d_z \rangle = \left( -\frac{\partial d_{ij}}{\partial i}, -\frac{\partial d_{ij}}{\partial j}, 1 \right) \approx \left( d_{(i+1)j} - d_{ij}, d_{i(j+1)} - d_{ij}, 1 \right) \end{aligned}$$

(2)

where the first two elements represent the depth gradients in the i and j directions, respectively. Afterward, we normalize the normal vector to obtain a unit-length vector $\textbf{n}(d_{ij}) = \frac{\textbf{d}}{\left\Vert \textbf{d} \right\Vert }$.

We assess the 3D reconstruction’s quality through the following metrics:

Depth Root Mean Squared Error (D-RMSE) This metric measures the average difference in meters between the ground truth and predicted depth maps.
$$\begin{aligned} \text {D-RMSE} = \sqrt{\frac{\sum _{i=0}^{M}\sum _{j=0}^{N} (\hat{d}_{ij} - d_{ij})^2}{M \cdot N}} \end{aligned}$$
(3)
Surface Normal Root Mean Squared Error (SN-RMSE) This metric measures the average angular error in degrees between the angle direction of the ground truth and predicted surface normals.
$$\begin{aligned} \text {SN-RMSE} = \sqrt{\frac{\sum _{i=0}^{M}\sum _{j=0}^{N} (\arccos (\textbf{n}(\hat{d}_{ij})) - \arccos (\textbf{n}(d_{ij})))^2}{M \cdot N}} \end{aligned}$$
(4)

D-RMSE and SN-RMSE are computed only for those pixels with a positive depth value in both GT and predicted depth maps. This avoids computing depth estimation errors on background pixels (which have a fixed depth value of 0).

4.3 Results

The following section presents both quantitative and qualitative results obtained from the selected NeRF baselines. We will discuss their performance on the CarPatch dataset, by analyzing the impact of viewing camera angle and the number of training images.

According to Table 3, all the selected NeRF approaches obtain satisfying results. Although the baselines demonstrate similar performances in terms of appearance scores (PSNR, SSIM, and LPIPS), our evaluation using depth-based metrics (D-RMSE and SN-RMSE) reveals significant differences in the 3D reconstruction of the vehicles. DVGO outperforms its competitors by achieving better depth estimation, resulting in a $+13.5\%$ improvement compared to iNGP and a $+7.6\%$ improvement compared to TensoRF. In contrast, TensoRF predicts a more accurate 3D surface with the lowest angular error on the surface normals.

Since our use case is related to vehicle inspection, in Table 4 we report results computed on each car component. For this purpose, we mask both GT and predictions using a specific component mask before computing the metrics. However, this would lead to an unbalanced ratio between background and foreground pixels, due to the limited components’ area, and finally to a biased metric value. By computing D-RMSE and SN-RMSE only on foreground pixels (see Sect. 4.2), depth-based metrics are not affected by this issue. For PSNR, SSIM, and LPIPS, instead, we compute component-level metrics over the image crop delimited by the bounding boxes around each mask. As expected, it is worth noting that NeRF struggles to reconstruct transparent objects (e.g. mirrors, lights, and windows) obtaining the highest errors in terms of depth and normal estimation. However, over the single components, TensoRF outperforms the competitors in most of the metrics and in particular on the surface normal estimation. The errors in the reconstruction of specific components’ surfaces can also be appreciated in the qualitative results of Fig. 5.

Moreover, we analyze the performances of each method in terms of the number of training images. We trained the baselines on every version of the CarPatch dataset and report the results in Fig. 3. It is worth noting that reducing the number of training images has a significant impact on all the metrics independently of the method. However, Instant-NGP demonstrates to be more robust to the number of camera viewpoints having a smoother drop, especially in terms of LPIPS, D-RMSE, and SN-RMSE.

Finally, we discuss how the training camera viewpoints’ distribution around the vehicle may affect the performance of each method from certain camera angles. In particular, as depicted in Fig. 4, it is evident how between $180^\circ $ and $270^\circ $ and between $0^\circ $ and $45^\circ $ there are considerable variations in the metrics. Indeed, in these areas the datasets contain more sparsity in terms of camera viewpoints and, as expected, all the methods are affected.

5 Conclusion

In this article, we have proposed a new benchmark for the evaluation and comparison of NeRF-based techniques. Focusing on one of the many concrete applications of this recent technology, i.e. vehicle inspection, a new synthetic dataset including renderings of 8 vehicles was first created. In addition to the set of RGB views annotated with the camera pose, the dataset is enriched by semantic segmentation masks as well as depth maps to further analyze the results and compare the methods.

The presence of reflective surfaces and transparent parts makes the task of vehicle reconstruction still challenging. Proposed additional metrics, as well as new graphical ways of displaying the results, are proposed to make these limitations more evident. We are confident that CarPatch can be of great help as a basis for research on NeRF models in general and, more specifically, in their application to the field of vehicle reconstruction.

Notes

References

Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-NeRF: a multiscale representation for anti-aliasing neural radiance fields. In: Proceedings of the IEEE/CVF ICCV (2021)
Google Scholar
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-NeRF 360: unbounded anti-aliased neural radiance fields. In: Proceedings of the IEEE/CVF Conference on CVPR (2022)
Google Scholar
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Zip-NeRF: anti-aliased grid-based neural radiance fields. arXiv:2304.06706 (2023)
Bi, S., et al.: Neural reflectance fields for appearance acquisition. arXiv:2008.03824 (2020)
Boss, M., Jampani, V., Braun, R., Liu, C., Barron, J., Lensch, H.: Neural-PIL: neural pre-integrated lighting for reflectance decomposition. In: Advances in Neural Information Processing Systems, vol. 34 (2021)
Google Scholar
Carroll, J.D., Chang, J.J.: Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young’’ decomposition. Psychometrika 35(3), 283–319 (1970)
Article MATH Google Scholar
Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: Tensorf: tensorial radiance fields. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13692, pp. 333–350. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19824-3_20
Chapter Google Scholar
Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels: radiance fields without neural networks. In: Proceedings of the IEEE/CVF Conference on CVPR (2022)
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: Proceedings of the IEEE/CVF Conference on CVPR. IEEE (2012)
Google Scholar
Liu, L., Gu, J., Zaw Lin, K., Chua, T.S., Theobalt, C.: Neural sparse voxel fields. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Article Google Scholar
Müller, N., Simonelli, A., Porzi, L., Bulò, S.R., Nießner, M., Kontschieder, P.: AutoRF: learning 3D object radiance fields from single view observations. In: Proceedings of the IEEE/CVF Conference on CVPR (2022)
Google Scholar
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (ToG) 41(4), 1–15 (2022)
Article Google Scholar
Pini, S., Borghi, G., Vezzani, R., Maltoni, D., Cucchiara, R.: A systematic comparison of depth map representations for face recognition. Sensors 21(3), 944 (2021). https://doi.org/10.3390/s21030944
Article Google Scholar
Sun, C., Sun, M., Chen, H.T.: Direct voxel grid optimization: super-fast convergence for radiance fields reconstruction. In: Proceedings of the IEEE/CVF Conference on CVPR (2022)
Google Scholar
Tang, Z.J., Cham, T.J., Zhao, H.: ABLE-NeRF: attention-based rendering with learnable embeddings for neural radiance field. arXiv:2303.13817 (2023)
Verbin, D., Hedman, P., Mildenhall, B., Zickler, T., Barron, J.T., Srinivasan, P.P.: Ref-NeRF: structured view-dependent appearance for neural radiance fields. In: Proceedings of IEEE/CVF Conference on CVPR (2022)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Xie, Z., Zhang, J., Li, W., Zhang, F., Zhang, L.: S-NeRF: neural radiance fields for street views (2023)
Google Scholar
Yao, Y., et al.: BlendedMVS: a large-scale dataset for generalized multi-view stereo networks. In: Proceedings of the IEEE/CVF Conference on CVPR (2020)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on CVPR (2018)
Google Scholar

Download references

Acknowledgements

The work is partially supported by the Department of Engineering Enzo Ferrari, under the project FAR-Dip-DIEF 2022 “AI platform with digital twins of interacting robots and people”.

Author information

Authors and Affiliations

Department of Engineering “Enzo Ferrari” (DIEF), University of Modena and Reggio Emilia, 41125, Modena, Italy
Davide Di Nucci, Alessandro Simoni, Roberto Vezzani & Rita Cucchiara
Prometeia, 40137, Bologna, Italy
Matteo Tomei & Luca Ciuffreda

Authors

Davide Di Nucci
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Simoni
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Tomei
View author publications
You can also search for this author in PubMed Google Scholar
Luca Ciuffreda
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Vezzani
View author publications
You can also search for this author in PubMed Google Scholar
Rita Cucchiara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Davide Di Nucci .

Editor information

Editors and Affiliations

University of Udine, Udine, Italy
Gian Luca Foresti
University of Udine, Udine, Italy
Andrea Fusiello
University of York, York, UK
Edwin Hancock

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Di Nucci, D., Simoni, A., Tomei, M., Ciuffreda, L., Vezzani, R., Cucchiara, R. (2023). CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle Components. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14234. Springer, Cham. https://doi.org/10.1007/978-3-031-43153-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-43153-1_9
Published: 05 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43152-4
Online ISBN: 978-3-031-43153-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle Components

Abstract

Similar content being viewed by others

Neural-Sim: Learning to Generate Training Data with NeRF

BEVSeg: Geometry and Data-Driven Based Multi-view Segmentation in Bird’s-Eye-View

A hybrid framework for heterogeneous object detection amidst diverse and adverse weather conditions employing Enhanced-DARTS

Keywords

1 Introduction

2 Related Work