Keywords

1 Introduction

Deep learning has shown immense potential in the field of 3D vision in recent years, advancing challenging tasks such as 3D object reconstruction, pose estimation, shape retrieval, robotic grasping etc. But unlike for 2D tasks  [10, 23, 28], large scale real world datasets for 3D object understanding is scarce. Hence, to allow for further advancement of state-of-the-art in 3D object understanding we introduce our dataset which consists of 998 high resolution, textured 3D models of everyday tabletop objects along with their 847K real world RGB-D images. Accurate annotation of camera pose and object pose is performed for each image. Figure 1 shows some sample data from our dataset.

We primarily focus on learned multi-view 3D reconstruction due to the lack of real world datasets for the task. 3D reconstruction methods [15, 38, 43, 48, 50] learn to predict 3D model of an object from its color images with known camera and object poses. They require large amount of training examples to be able to generalize to unseen images. While datasets like Pix3D [44], PASCAL3D+ [52] and ObjectNet3D [51] provide 3D models and real world images, they are mostly limited to a single image per model.

Fig. 1.
figure 1

Sample data from our dataset. From left to right, shown are visualization of textured 3D model, three sample multi-view images with wireframe object model superimposed based on annotated camera and object poses.

Existing multi-view 3D reconstruction methods [8, 21, 38, 43, 50] rely heavily on synthetic datasets, especially ShapeNet [6], for training and evaluation. There are a few works [25, 38] utilizing real world datasets [7], but only for qualitative evaluation purpose, not for training or quantitative evaluation. To remedy this, we present our dataset and validate its usefulness by performing training as well as qualitative/quantitative evaluation with various state-of-the-art multi-view 3D reconstruction baselines.

The contributions of our work are as follows:

  1. 1.

    To the best our knowledge, our dataset is the first real world dataset that can be used for training and quantitative evaluation of learning-based multi-view 3D reconstruction algorithms.

  2. 2.

    We present two novel methods for automatic/semi-automatic data annotation. We will make the annotation tools publicly available to allow future extensions to the dataset.

2 Related Work

3D Shapes Dataset: Datasets like Princeton shape benchmark [42], FAUST [2], ShapeNet [6] provide a large collection of 3D CAD models of diverse objects, but without associated real world RGB images. PASCAL3D+ [52] and ObjectNet3D [51] performed rough alignment between images from existing datasets and 3D models from online shape repositories. IKEA [27] also performed 2D-3D alignment between existing datasets but with finer alignment results on a smaller set of images and shapes (759 images and 90 shapes). Pix3D [44] extended IKEA to 10K images and 395 shapes through crowdsourcing and scanning some objects manually. These datasets mostly have single-view images associated with the shapes.

Datasets like [4, 19, 24] have utilized RGB-D sensors to capture relatively small number of objects and are mostly geared towards robot manipulation tasks rather than 3D reconstruction. Knapitsch et al.  [22] provided a small number of large scale scenes which are suitable for benchmarking traditional Structure-from-Motion (SfM) and Multi-view Stereo (MVS) algorithms rather than learned 3D reconstruction.

Table 1. Comparison between different datasets. Objectron and CO3D only provide point cloud models of the objects. Pix3D contains a mixture of scanned and CAD 3D models. PASCAL3D+ and ObjectNet3D only have rough object pose annotation, while the annotation is not provided in Redwood-OS. Only our dataset provides precisely scanned texture-mapped 3D models that are further registered to multi-view RGB images.

The dataset that is closest to ours is Redwood-OS [7]. It provides RGB-D videos of 398 objects and their 3D scene reconstructions. There are several crucial limitations that has prevented widespread adoption of this dataset for multi-view 3D reconstruction though. Firstly, the dataset is not annotated with camera and object pose information. While the camera pose can be obtained using Simultaneous Localization and Mapping (SLAM) or Structure-from-Motion (SfM) techniques [3, 11, 32, 40, 41], obtaining accurate object poses is relatively harder. Also, the 3D reconstructions were performed on scene level rather than object level, making it difficult to directly use it for supervision of object reconstruction.

More recently, Objectron [1] and CO3D [37] have provided large scale video sequences of real world objects along with point clouds and object poses but without precise dense 3D models. We aim to tackle the shortcomings of the existing datasets and create a dataset that can effectively serve as a real world benchmark for learning-based multi-view 3D reconstruction models. Table 1 shows the comparison between the relevant datasets.

3D Reconstruction: The methods in [15, 16, 34, 45, 48, 54] predict 3D models from single-view color images. Since a single-view image can only provide a limited coverage of a target object, multi-view input is preferred in many applications. SLAM and Structure-from-Motion methods  [3, 11, 32, 40, 41] are popular ways of performing 3D reconstruction but they struggle with poorly textured and non-Lambertian surfaces and require careful input view selection. Deep learning has emerged as a potential solution to tackle these issues. Early works like [8, 17, 21] used Recurrent Neural Networks (RNN) to perform multi-view 3D reconstruction. Pixel2Mesh++ [50] introduced cross-view perceptual feature pooling and multi-view deformation reasoning to refine an initial shape. MeshMVS [43] predicted a coarse volume from Multi-view Stereo depths first and then applied deformations on it to get a finer shape. All of these works were trained and evaluated exclusively on synthetic datasets due to the lack of proper real world datasets.

Some recent works like DVR [33], IDR [55], Neus [49], Geo-Neus [13] have focused on unsupervised 3D reconstruction with expensive per-scene optimization for each object. These methods encode each scene into separate Multi-layer Perceptron (MLP) that implicitly represents the scene as Signed Distance Function (SDF) or Occupancy Field. These works have obtained impressive results on small scale datasets of real world objects [20, 53]. Our dataset can be further applied to evaluate these methods quantitatively on a much larger scale dataset.

3 Data Acquisition

Our data acquisition takes place in two steps. First, a detailed and textured 3D model of an object is generated using Shining3D® EinScan-SE 3D scanner. The scanner uses a calibrated turntable, a 1.3 Megapixel camera and visible light sources to obtain the 3D model of an object. Then, an Intel® RealSense LiDAR Camera L515 is used to record a RGB-D video sequence of the object on a round ottoman chair, capturing 360° view around the object. The video is recorded at 30 frames per second in HD resolution (1280 \(\times \) 720). Figure 1 shows a number of 3D models and some sample color images from our dataset.

Datasets like [7, 24] perform 3D model generation and video recording in one step by reconstructing the 3D scene captured by the images. The quality of the 3D models generated this way depends heavily on the trajectory of the camera and requires some level of expertise for data collection. Furthermore, these datasets use consumer grade cameras which cannot reconstruct fine details in the 3D geometry. We therefore use specialized hardware designed for high quality 3D scanning.

Another approach is to utilize 3D CAD models from online repositories and match them with real world 2D images, which are also mostly collected online [9, 27, 51, 52]. The downside of this approach is that it is difficult to ensure exact instance-level match between 3D models and 2D images. According to a survey conducted by Sun et al.  [44], test subjects reported that only a small fraction of the images matched the corresponding shapes in datasets [51, 52].

4 Data Annotation

The most challenging aspect of creating a large scale real world dataset for object reconstruction is generating ground truth annotations. Most learning-based 3D reconstruction methods require accurate camera poses as well as consistent object poses in the camera coordinate frame. While it is fairly easy to obtain the camera poses, obtaining accurate object poses is more challenging.

The methods in [44, 52] perform object pose estimation by manually annotating corresponding keypoints in the 3D models and 2D images, and then performing 2D-3D alignment with the Perspective-n-Point (PnP) [14, 26] and Levenberg-Marquardt algorithms [31]. Note that these datasets mostly contain a single image for each 3D model, which makes this kind of annotation feasible. In comparison, we aim to do this for video sequences with up to 1000 images, which could be manual intensive. Additionally, estimating object pose that is consistent over multi-view images will require keypoint matches at sub-pixel accuracy which is impossible by manual annotation.

On the other hand, the methods in [9, 51] manually annotate the object pose directly by either trying to align the 3D model with the scene reconstruction [9] or the re-projected 3D model with 2D image [51]. We found these techniques to be inadequate for producing multi-view consistent object poses and therefore develop our own annotation systems.

4.1 Notations

We represent an object pose by \(\xi \in SE(3)\) where SE(3) is the 3D Special Euclidean Lie group [47] of 4 \(\times \) 4 rigid body transformation matrix:

$$\begin{aligned} \xi = \begin{bmatrix} R &{} \quad &{} t \\ 0 &{} \quad &{} 1 \end{bmatrix} \end{aligned}$$
(1)

where R is the 3 \(\times \) 3 rotation matrix and t is the 3D translation vector.

We define object pose \({}_{\text {w}} \xi _{\text {obj}}\) as the transformation from canonical object frame (obj) to world frame (w). Similarly, the pose of the \(i^{\text {th}}\) camera \({}_{\text {w}} \xi _{\text {cam}_i}\) represents the transformation from camera to world frame. The canonical object frame is centered at the object with z-axis pointing upwards along the gravity direction while the world frame is arbitrary (e.g. pose of the first camera).

We use pinhole camera model with camera intrinsics matrix \(K \):

$$\begin{aligned} K = \begin{bmatrix} f_x &{} \quad &{} 0 &{} \quad &{} c_x \\ 0 &{} \quad &{} f_y &{} \quad &{} c_y \\ 0 &{} \quad &{} 0 &{} \quad &{} 1 \end{bmatrix} \end{aligned}$$
(2)

where \(f_x\) and \(f_y\) are focal lengths and \(c_x\) and \(c_y\) principal points. These parameters are provided by the camera manufacturers.

The image coordinates \(p \) of a 3D point \(P _w\) in homogeneous world coordinate can be computed as:

$$\begin{aligned} p = K \begin{bmatrix} R_i^{T}&\&-R_i^{T} t_i \end{bmatrix} P _w \end{aligned}$$
(3)

where \(R_i\) and \(t_i\) are the rotation and translation components of the camera pose.

Fig. 2.
figure 2

Texture-rich Object Annotation. Step 1: Synthetic views of the 3D model are rendered. Step 2: Feature matching is performed between/across real and synthetic images. Step 3: Pose of the real and virtual cameras are estimated. Step 4: Object pose is estimated by 7-DOF alignment between estimated and ground truth virtual camera poses.

The images taken from our RGB-D camera suffer from radial and tangential distortion. But for the purpose of annotation, we undistort the images so that the pinhole camera model holds.

We now present two methods for annotating our dataset depending on the texture-richness of the object being scanned: Texture-rich Object Annotation and Textureless Object Annotation.

4.2 Texture-rich Object Annotation

Since our 3D models have high-fidelity textures from our 3D scanner, we can utilize it to annotate the object pose in the recorded video sequence. We perform joint camera and object pose estimation by matching keypoints between images and 3D model to ensure camera and object pose consistency over multiple views. Figure 2 illustrates the annotation process. Following are the steps involved:

i. Rendering synthetic views of a 3D model: Instead of directly matching keypoints between a 3D model and 2D images, we instead render synthetic views of the 3D model and perform 2D keypoint matching. We use the physically based rendering engine, Pyrender [29], to render synthetic views. This allows us to utilize robust keypoint matching algorithms developed for RGB images. The virtual camera poses for rendering are randomly sampled around the object by varying the camera distance, and azimuth/elevation angles with respect to the object. We verify the quality of each rendered image by checking if there are sufficient keypoint matches against the real images. 150 images are rendered for each object model.

ii. Feature matching: We perform exhaustive feature matching across as well as within the real and synthetic images using neural network based feature matching technique SuperGlue [39].

iii. Camera pose estimation: Given the keypoint matches, we estimate the camera poses of both the real and virtual cameras in the same world coordinate frame using the SfM tool COLMAP [40, 41].

iv. Object pose estimation: Let \(\{\hat{\xi }_i\ |\ i=1,...,150\}\) be the ground truth poses of the virtual cameras in object frame (we keep track of the ground truth poses during the rendering step). Let \(\{\xi _{i}\ |\ i=1,...,150\}\) be the corresponding poses estimated by COLMAP in world frame. By aligning \(\{\xi _i\}\) and \(\{\hat{\xi }_i\}\) we can estimate the object pose. We use the Kabsch-Umeyama algorithm [46] under Random Sample Consensus (RANSAC) [5] scheme to perform a 7-DOF (pose + scale) alignment. Since COLMAP only uses 2D image information, its poses have arbitrary scale; hence we perform a 7-DOF alignment instead of 6-DOF to obtain metric scale. After applying the Kabsch-Umeyama algorithm we get 7-DOF transformation S in Sim(3) Lie Group parameterized as:

$$\begin{aligned} S = \begin{bmatrix} sR_s &{} \quad &{} t_s \\ 0 &{} \quad &{} 1 \end{bmatrix} \end{aligned}$$
(4)

The camera poses from COLMAP can then be transformed to metric scale pose:

$$\begin{aligned} {}_{\text {w}} \xi _{\text {cam}_i} = \begin{bmatrix} R_s R_i &{} \quad &{} sR_s t_i + t_s \\ 0 &{} \quad &{} 1 \end{bmatrix} \end{aligned}$$
(5)

where \(R_i\) and \(t_i\) are the rotation and translation component of the camera poses from COLMAP.

Since the ground truth virtual camera poses \(\{\hat{\xi }_i | i=1,...,150\}\) are in object frame, the transformation in Eq. (5) will lead to camera poses in object frame i.e. \({}_{\text {w}} \xi _{\text {obj}} = \mathbb {I}\) where \(\mathbb {I}\) is the 4 \(\times \) 4 identity matrix.

4.3 Textureless Object Annotation

While the pipeline outlined in Sub-section 4.2 can accurately annotate texture-rich objects, it will fail for textureless objects since correct feature matches among the images cannot be established. To tackle this problem we develop another annotation system shown in Fig. 3 that can handle objects lacking good textures which consists of the following steps:

Fig. 3.
figure 3

Textureless object annotation. Step 1: Camera pose annotation (+ dense scene reconstruction). Step 2: Manual annotation of rough object pose where a transparent projection of the object model is superimposed over an RGB image for 2D visualization (top) and the 3D object is placed alongside the dense scene reconstruction for 3D visualization (bottom). Step 3: Object pose is refined such that the object projection overlaps with the ground truth mask (green). (Color figure online)

i. Camera pose estimation: Even when the object being scanned is textureless, our background has sufficient textures to allow successful camera pose estimation. We therefore utilize the RGB-D version of ORB-SLAM2 [32] to obtain the camera poses \(\{ {}_{\text {w}} \xi _{\text {cam}_i} \}\). Since it uses depth information alongside RGB, the poses are in metric scale.

ii. Manual annotation of rough object pose: We create an annotation interface as shown in Step 2 of Fig. 3 to estimate the rough object pose. To facilitate the annotation, we reconstruct the 3D scene using the RGB-D images and camera poses estimated in the previous step by employing Truncated Signed Distance Function (TSDF) fusion [56]. The object pose \({}_{\text {w}} \xi _{\text {obj}}\) is initialized to be a fixed distance in front of the first camera and the z-axis is aligned with the principle axis of the 3D scene found using Principal Component Analysis (PCA). An annotator can then update the 3 translation and 3 Euler angle (roll-pitch-yaw) components of the 6D object pose using keyboard to align the object model with the scene. In addition to the 3D scene, we also show the projection of the object model over an RGB image. The RGB image can be changed to verify the consistency of the object pose over multiple views.

iii. Object pose refinement: We find that obtaining accurate object pose through manual annotation is difficult, so we refine it further by aligning the projection of the 3D object model with ground truth object masks in different images. The ground truth object masks are obtained from Cascade Mask R-CNN [18] with a 152-layer ResNetXt backbone pretrained on ImageNet.

Let \({}_{\text {w}} \xi _{\text {obj}}\) be the rough object pose from manual annotation and \({}_{\text {w}} \xi _{\text {cam}_i}\) be the pose of the \(i^{\text {th}}\) camera. The camera-centric object pose is represented as follows:

$$\begin{aligned} \xi = {}_{\text {cam}_i} \xi _{\text {obj}} = {(_{\text {w}} \xi _{\text {cam}_i})^{-1}} \times {}_{\text {w}} \xi _{\text {obj}} \end{aligned}$$
(6)

The transformation \(\xi \in SE(3)\) is used to differentiably render [36] the object model onto the image of camera i to obtain the rendered object mask by applying the projection model of Eq. (3). Since direct optimization in the manifold space SE(3) is not possible, we instead optimize the linearized increment of the manifold around \(\xi \). This is a common technique in SLAM and Visual Odometry [11, 32].

Let \(\delta \xi \in \mathfrak {se}(3)\) represent the linearized increment of \(\xi \) belonging to the Lie algebra \(\mathfrak {se}(3)\) corresponding to Lie Group \(SE (3)\) [47]. The updated object pose is given by:

$$\begin{aligned} \xi ' = \xi \times exp(\delta \xi ) \end{aligned}$$
(7)

Here, exp represents the exponential map that transforms \(\mathfrak {se}(3)\) to \(SE (3)\). The object pose w.r.t. world frame can also be updated by right multiplication of the initial pose with \(exp(\delta \xi )\).

We can optimize \(\delta \xi \) in order to increase the overlap between the rendered mask M at \(\xi '\) and ground truth mask \(\hat{M}\) using least-squares minimization of the mask loss:

$$\begin{aligned} \mathcal {L}_{\text {mask}} = \text {mean} (\Vert M \ominus \hat{M} \Vert _{2}) \end{aligned}$$
(8)

where \(\ominus \) represents element-wise subtraction.

The optimization is performed using stochastic gradient descent for each camera for 30 iterations in PyTorch [35] library. Since \(\delta \xi \in \mathfrak {se}(3)\) cannot represent large changes in pose, we update the pose \(\xi \leftarrow \xi '\) every 30 iterations and relinearize \(\delta \xi \) around the new \(\xi \).

5 Dataset Statistics

We collected in total 998 objects. It typically takes about 20 min to scan the 3D model of an object and record a video, but about 2 h to register the scanned 3D model to all the video frames. Table 2 shows the category distribution of objects in our dataset along with the method used to annotate the object (texture-rich vs textureless). Each category in our dataset contains 39–115 objects, with average 67 objects per category. A majority of the objects (89%) were annotated using texture-rich pipeline which requires no user input. Table 3 shows the distribution of images over the categories. We have on average 56K images for each category.

Table 2. Annotation statistics.
Table 3. Image distribution over the categories. Number of images in each category has been rounded to nearest 1000.

6 Evaluation

To verify the usefulness of our dataset, we train and evaluate state-of-the-art multi-view 3D reconstruction baselines exclusively on our dataset. From each object, we randomly sample 100 different 3-view image tuples as the multi-view inputs. To ensure fair evaluation and avoid overfitting we split our dataset into training, testing and validation sets in approximately 70%-20%-10% ratio. The train-test-validation split is performed such that the distribution in each object category is also 70%-20%-10%. Only the data in training set is used to fit the baseline models while validation set is used to decide when to save the model parameters during training (known as checkpointing). All the evaluation results presented here are on the test set entirely held out during the training process.

6.1 Experiments

We evaluate our datasets with several recent learning-based 3D reconstruction baseline methods, including Multi-view Pixel2Mesh (MVP2M) [50], Pixel2Mesh++ (P2M++) [50], Multi-view extension of Mesh R-CNN [15] (MV M-RCNN) provided by [43], MeshMVS [43], DVR [33], IDR [55] and COLMAP [40, 41]. We use the ‘Sphere-Init’ version of Mesh R-CNN and ‘Back-projected depth’ version of MeshMVS.

MVP2M pools multi-view image features and uses it to deform an initial ellipsoid to the desired shape. Pixel2Mesh++ deforms the mesh predicted by MVP2M by taking the weighted sum of deformation hypothesis sampled near the MVP2M mesh vertices. MV M-RCNN improves on MVP2M with a deeper backbone, better training recipe and higher resolution initial shape.

MeshMVS first predicts depth images using Multi-view Stereo and uses the depths to obtain a coarse shape which is deformed using similar techniques as MVP2M and MV MR-CNN. To train the depth prediction network of MeshMVS, we use depths rendered from the 3D object models since the recorded depth can be inaccurate or altogether missing at close distances. We also evaluate the baseline MeshMVS (RGB-D) which uses ground truth depths instead of predicted depths to obtain the coarse shape, essentially performing shape completion instead of prediction.

We also include per-scene optimized baselines DVR, IDR and COLMAP which do not require training generalizable priors with 3D supervision. DVR and IDR perform NeRF [30] like optimization to learn 3D models from images using implicit neural representation. COLMAP performs Structure-from-Motion (SfM) to first generate sparse point cloud which are further densified using Patch Match Stereo algorithm [41]. These methods require larger number of images to produce satisfactory results, hence we use 64 input images. Since the time required to reconstruct a scene is large for these methods, we evaluate these methods only on 30 scenes from the test set - 2 from each category.

All of the baselines require the object in the images to be segmented out of the background. We do this by rendering the 2D image masks of 3D object models using the annotated camera/object pose. Also, we transform the images to the size and intrinsics (Eq. (2)) required by the baselines before training/testing.

Metrics: We follow recent works [15, 43, 50] and choose F1-score (harmonic mean of precision and recall) at a thresholds \(\tau =0.3\) as our evaluation metric. Precision in this context is defined as the fraction of points in predicted model within \(\tau \) distance from the ground truth points while recall is the fraction of point in ground truth model within \(\tau \) distance from the predicted points.

We also report Chamfer Distance between a predicted model P and ground truth model Q which measures the mean distance between the closest pairs of points \(\Lambda _{P,Q} = \{(p, arg\text { min}_q\Vert p -q \Vert ): p \in P, q \in Q\}\) in the two models:

$$\begin{aligned} \mathcal {L}_{\text {chamfer}}(P, Q) = |P|^{-1} \sum _{(p, q) \in \Lambda _{P,Q}}{||p-q||^{2}} + |Q|^{-1} \sum _{(q, p) \in \Lambda _{Q,P}}{||q-p||^{2}} \end{aligned}$$
(9)

We uniformly sample 10k points from predicted and ground truth meshes to evaluate these metrics. Following [12, 15], we rescale the 3D models so that the longest edge of the ground truth mesh bounding box has length 10.

Fig. 4.
figure 4

Qualitative Evaluation. Left to right, shown are an input image, ground truth mesh, results from MVP2M, P2M++, MV M-RCNN, MeshMVS, and MeshMVS (RGB-D) respectively.

Results: The quantitative comparison results of different learning-based 3D reconstruction baselines on our dataset are presented in Table 4. Note that both training and testing set contain objects from all categories, but test F1-score on individual categories as well as over all categories are reported here. Figure 4 visualizes the shapes generated by different methods for qualitative evaluation.

We can see that overall Pixel2Mesh++ performs the best (barring MeshMVS RGB-D). This is contrary to the results on ShapeNet reported in [43] where MeshMVS performs the best. This can be attributed to the high depth prediction error of MeshMVS (average depth error is \(\sim \)6% of the total depth range). When predicted depth is replaced with ground truth depth, we indeed see a significant improvement in the performance of MeshMVS indicating that depth prediction is the main bottleneck in its performance.

Fig. 5.
figure 5

Qualitative evaluation. Left to right: Input image, results from DVR, IDR, COLMAP.

Table 5 shows the quantitative comparison between different unsupervised, per-scene optimized baselines. Here, IDR outperforms the other two baselines which is in line with the results presented in [55] on the DTU dataset. COLMAP performs worse than the rest because the textures on most of the objects are insufficient for dense reconstruction using Patch Match stereo leading to sparse and noisy results (Fig. 5).

Table 4. Quantitative comparison of state-of-the-art learning-based multi-view 3D reconstruction methods on our dataset. We report F1-score and Chamfer Distance on each semantic category as well as over all categories. The baseline MeshMVS (RGB-D) is not considered for highlighting the best performance since it uses ground truth depth as additional input.
Table 5. Quantitative comparison of state-of-the-art NeRF-based 3D reconstruction methods along with COLMAP on our dataset. We report F1-score and Chamfer Distance on each semantic category as well as over all categories.

Single Category Training: We compare the difference in the performance when each category is trained and evaluated separately. In this case, there will be a different set of model parameters for each category. For these experiments we sample 200 different 3-view images as inputs from each scene. Table 6 shows the results for MV M-RCNN baseline when each category is trained separately versus when all are trained together. We see that the performance is generally better when using all categories, showing that 3D reconstruction models can learn to generalize over multiple categories in our dataset.

Table 6. Single Vs All Category Training evaluation on MV M-RCNN baseline.

7 Discussion

The results presented in Tables 4 and 6 as well as the qualitative evaluation of Fig. 4 show that the problem of generalizable multi-view 3D reconstruction is far from solved. While works like Pixel2Mesh++, Mesh R-CNN and MeshMVS have offered promising avenues for advancement of the state-of-the-art, more research is still needed in this direction. Table 5 and Fig. 5 shows the limitations of traditional 3D reconstruction methods like COLMAP. While more recent NeRF-based methods like DVR and IDR generates high quality reconstruction, their running time is at the order of 10 h in general and requires a larger number of input images (64 in our case). We hope that our dataset can serve as a challenging benchmark for these problems; aiding and inspiring future work in 3D shape generation.

8 Conclusion

We present a large scale dataset of 3D models and their real world multi-view images. Two methods were developed for annotation of the dataset which can provide high accuracy camera and object poses. Experiments show that our dataset can be used for training and evaluating multi-view 3D reconstruction methods, something that has been lacking in existing real world datasets.