Keywords

1 Introduction

Simultaneous localization and mapping (SLAM) is an essential component in robot navigation, virtual reality (VR), and augmented reality (AR) systems. Various datasets and benchmarks have been proposed for SLAM [11, 35, 39] and related problems, including visual-inertial odometry [6, 30], camera re-localization [15, 29, 32], and depth estimation [21, 33]. Currently, there exists only a few building-scale SLAM datasets [28] that include ground truth camera poses and dense 3D geometry. Such datasets enable, for example, evaluation of algorithms needed in large-scale AR applications.

The lack of building-scale SLAM datasets is explained by the difficulty of acquiring ground truth data. Some have utilized a high-end LiDAR for obtaining 3D geometry of the environment [2, 4, 26, 28]. Ground truth camera poses may be acquired using a motion capture (MoCap) system when the environment is small enough [35, 40]. The high cost of equipment, complex sensor setup, and slow capturing process make these approaches less attractive and inconvenient for crowd-sourced data collection.

An alternative is to perform 3D reconstruction using a monocular, stereo, or depth camera. Consumer RGB-D cameras, in particular, are interesting because of their relatively good accuracy, fast acquisition speed, low-cost, and effectiveness in textureless environments. RGB-D cameras have been used to collect datasets for depth estimation [21, 33], scene understanding [8], and camera re-localization [32, 38], among other tasks. The problem is that existing RGB-D reconstruction systems (e.g. [5, 9, 22]) are limited to room-scale and apartment-scale environments.

Synthetic SLAM datasets have also been proposed [20, 27, 39] that include perfect ground truth. The challenge is that data such as time-of-flight (ToF) depth maps and infrared images are difficult to synthesize realistically. Consequently, training and evaluation done using synthetic data may not reflect algorithm’s real-world performance. To address the domain gap problem, algorithms are often fine-tuned using real data.

We propose a framework to create building-scale 3D reconstructions using a consumer depth camera (Azure Kinect). Unlike existing approaches, we register color images and depth maps using color-to-depth (C2D) strategy. This allows us to directly utilize the raw depth maps captured by the wide field-of-view (FoV) infrared camera. Coupled with an open-source SLAM library [19], we acquire a building-scale 3D vision dataset (BS3D) that is considerably larger than similar datasets as shown in Fig. 1. The BS3D dataset includes 392k synchronized color images, depth maps and infrared images, inertial measurements, camera poses, enhanced depth maps, surface reconstructions, and laser scans. Our framework will be released for the public to enable fast, easy and affordable indoor 3D reconstructionFootnote 1.

Fig. 1.
figure 1

Building-scale 3D reconstruction (4300 m\(^2\)) obtained using an RGB-D camera and the proposed framework. The magnified area (90 m\(^2\)) is larger than any reconstruction in the ScanNet dataset [8].

2 Related Work

This section introduces commonly used RGB-D SLAM datasets and corresponding data acquisition processes. A summary of the datasets is provided in Table 1. As there exist countless SLAM datasets, the scope is restricted to real-world indoor scenarios. We leave out datasets focusing on aerial scenarios (e.g. EuRoC MAV [2]) and autonomous driving (e.g. KITTI [11]). We also omit RGB-D datasets captured with a stationary scanner (e.g. Matterport3D [4]) as they cannot be used for SLAM evaluation. Synthetic datasets, such as SceneNet RGB-D [20], TartanAir [39], and ICL [27] are also omitted.

ADVIO [6] dataset is a realistic visual-inertial odometry benchmark that includes building-scale environments. Ground truth trajectory is computed using an inertial navigation system (INS) together with manual location fixes. The main limitation of the dataset is that it does not come with ground truth 3D geometry. LaMAR [28] is a large-scale SLAM benchmark that utilizes high-end mapping platforms (NavVis M6 trolley and VLX backpack) for ground truth generation. Although the capturing setup includes a variety of devices (e.g. HoloLens2 and iPad Pro), it does not include a dedicated RGB-D camera.

OpenLORIS-Scene [31] focuses on the lifelong SLAM scenario where environments are dynamic and changing, similar to LaMAR [28]. The data is collected over an extended period of time using wheeled robots equipped with various sensors, including RGB-D, stereo, IMU, wheel odometry, and LiDAR. Ground truth poses are acquired using an external motion capture (MoCap) system, or with a 2D laser SLAM method. The dataset is not ideal for handheld SLAM evaluation because of the limited motion patterns of a ground robot.

TUM RGB-D SLAM [35] is one of the most popular SLAM datasets. The RGB-D images are acquired using a consumer depth camera Microsoft Kinect v1. Ground truth trajectory is incomplete because the MoCap system can only cover a small part of the environment. CoRBS [40] consists of four room-scale environments. It also utilizes MoCap for acquiring ground truth trajectories for Microsoft Kinect v2. Unlike [35], CoRBS provides ground truth 3D geometry acquired using a laser scanner. The data also includes infrared images, but not inertial measurements, unlike our dataset.

7-Scenes [32] and 12-Scenes [38] are commonly used for evaluating camera localization. 7-Scenes includes seven scenes captured using Kinect v1. KinectFusion [22] is used to obtain ground truth poses and dense 3D models from the RGB-D images. 12-Scenes consists of multiple rooms captured using the Structure.io depth sensor and iPad color camera. The reconstructions are larger compared to 7-Scenes, about 37 m3 on average. They are acquired using the VoxelHashing framework [23], an alternative to KinectFusion with better scalability.

ScanNet [8] is an RGB-D dataset containing 2.5M views acquired in 707 distinct spaces. It includes estimated calibration parameters, camera poses, 3D surface reconstructions, textured meshes, and object-level semantic segmentations. The hardware consists of a Structure.io depth sensor attached to a tablet computer. Pose estimation is done using BundleFusion [9], after which volumetric integration is performed through VoxelHashing [23].

Sun3D [43] is a large RGB-D database with camera poses, point clouds, object labels, and refined depth maps. The reconstruction process is based on structure from motion (SfM) where manual object annotations are utilized to reduce drift and loop-closure failures. Refined depth maps are obtained via volumetric fusion similar to KinectFusion [22]. We emphasize that ScanNet [8] and Sun3D [43] reconstructions are considerably smaller and have lower quality than those provided in our dataset. Unlike [28, 31, 35], our system also does not require a complex and expensive capturing setup, or manual annotation [6, 43].

Table 1. List of indoor RGB-D SLAM datasets. The BS3D acquisition setup does not require high-end LiDARs [28, 31, 40], MoCap systems [31, 36, 40], or manual annotation [6, 43]. BS3D is building-scale, unlike [8, 32, 36, 38, 40, 43]. Note that ADVIO [6] and LaMAR [28] do not have a dedicated depth camera.

3 Reconstruction Framework

In this section, we introduce the RGB-D reconstruction framework shown in Fig. 2. The framework produces accurate 3D reconstructions of building-scale environments using low-cost hardware. The system is fully automatic and robust against poor lighting conditions and fast motions. Color images are only used for loop closure detection as they are susceptible to motion blur and rolling shutter distortion. Raw depth maps enable accurate odometry and the refinement of loop closure transformations.

Fig. 2.
figure 2

Overview of the RGB-D reconstruction system.

3.1 Hardware

Data is captured using the Azure Kinect depth camera, which is well-suited for crowd-sourcing due to its popularity and affordability. We capture synchronized depth, color, and infrared images 30 Hz using the official recorder application running on a laptop computer. We use the wide FoV mode of the infrared camera with 2\(\,\times \,\)2 binning to extend the Z-range. The resolution of raw depth maps and IR images is 512\(\,\times \,\)512 pixels. Auto-exposure is enabled when capturing color images at the resolution of 720\(\,\times \,\)1280 pixels. We also record accelerometer and gyroscope readings at 1.6 kHz.

3.2 Color-to-Depth Alignment

Most RGB-D reconstruction systems expect that color images and depth maps have been spatially and temporally aligned. Modern depth cameras typically produce temporally synchronized images so the main concern is the spatial alignment. Conventionally, raw depth maps are transformed to the coordinate system of the color camera, which we refer to as the depth-to-color (D2C) alignment.

In the case of Azure Kinect, the color camera’s FoV is much narrower (90\(\,\times \,\)59\(^\circ \)) compared to the infrared camera (120\(\,\times \,\)120\(^\circ \)). Thus, the D2C alignment would not take advantage of the infrared camera’s wide FoV because depth maps would be heavily cropped. Moreover, the D2C alignment might introduce artefacts to the raw depth maps.

We propose an alternative called color-to-depth (C2D) alignment where color images are transformed instead. In the experiments, we show that this drastically improves the quality of the reconstructions. The main challenge of C2D is that it requires a fully dense depth map. Fortunately, a reasonably good alignment can be achieved even with a low quality depth map. This is because the baseline between the cameras is narrow and missing depths often appear in areas that are far away from the camera.

For the C2D alignment, we first perform depth inpainting using linear interpolation. Then, the color image is transformed to the raw depth frame. To keep as much of the color information as possible, the output resolution will be higher (1024\(\,\times \,\)1024 pixels) compared to the raw depth maps. After that, holes in the color image due to occlusions are inpainted using the OpenCV library’s implementation of [37]. We note that minor artefacts in the aligned color images will have little impact on the SIFT-based loop closure detection.

3.3 RGB-D Mapping

We process the RGB-D sequences using an open-source SLAM library called RTAB-Map [19]. Odometry is computed from the raw depth maps using the point-to-plane variant of the iterative closest point (ICP) algorithm [25]. We use the scan-to-map odometry strategy [19] where incoming frames are registered against a point cloud map created from past keyframes. The wide FoV ensures that ICP-odometry rarely fails, but in case it does, a new map is initialized.

Loop closure detection is needed for drift correction and merging of individual maps. For this purpose, SIFT features are extracted from the valid area of the aligned color images. Loop closures are detected using the bag-of-words approach [18], and transformations are estimated using the Perspective-n-Point RANSAC algorithm and refined using ICP [25]. Graph optimization is done using the GTSAM library [10] and Gauss-Newton algorithm.

RTAB-Map supports multi-session mapping which is a necessary feature when reconstructing building-scale environments. It is not practical to collect possibly hours of data at once. Furthermore, having the ability to later update and expand the map is a useful feature. In practice, individual sequences are first processed separately, followed by multi-session mapping. The sessions are merged by finding loop closures and by performing graph optimization. The input is a sequence of keyframes along with odometry poses and SIFT features computed during single-session mapping. The sessions are processed in such order that there is at least some overlap between the current session and the global map build so far.

3.4 Surface Reconstruction

It is often useful to have a 3D surface representation of the environment. There exists many classical [14, 22] and learning-based [1, 41] surface reconstruction approaches. Methods that utilize deep neural networks, such as NeuralFusion [41], have produced impressive results on the task of depth map fusion. Neural radiance fields (NeRFs) have also been adapted to RGB-D imagery [1] showing good performance. We did not use learning-based approaches in this work because they are typically limited to small scenes, at least for the time being. Moreover, scene-specific learning [1] takes several hours even with powerful hardware.

Surface reconstruction is done in segments due to the large scale of the environment and the vast number of frames. To that end, we first create a point cloud from downsampled raw depth maps. Every point includes a view index along with 3D coordinates. The point cloud is partitioned into manageable segments using the K-means algorithm. A mesh is created for each segment using the scalable TSDF fusion implementation [46] that is based on [7, 22]. It uses a hierarchical hashing structure to support large scenes.

4 BS3D Dataset

The BS3D dataset was collected at the university campus using Azure Kinect (Sect. 3.1). Figure 3 shows example frames from the dataset. The collection was done in multiple sessions due to large scale of the environment. The recordings were processed using the framework described in Sect. 3.

Fig. 3.
figure 3

Example frames from the dataset. Environments are diverse and challenging, including cafeterias, stairs, study areas, corridors, and lobbies.

4.1 Dataset Features

The reconstruction shown in Fig. 1 consists of 47 overlapping recording sessions. Additional 14 sessions, including 3D laser scans, were recorded for evaluation purposes. Most sessions begin and end at the same location to encourage loop closure detection. The total duration of the sessions is 3 h and 38 min and the combined trajectory length is 6.4 km. The reconstructed floor area is approximately 4300 m\(^2\).

The dataset consists of 392k frames, including color images, raw depth maps, and infrared images. Color images and depth maps are provided in both coordinate frames (color and infrared camera). The images have been undistorted for convenience, but the original recordings are also included. We provide camera poses in a global reference frame for every image. Data also includes inertial measurements, enhanced depth maps and surface normals that have been rendered from the mesh as visualized in Fig. 4.

Fig. 4.
figure 4

The BS3D dataset includes color and infrared images, depth maps, IMU data, camera parameters, and surface reconstructions. Enhanced depth maps and surface normals are rendered from the mesh.

4.2 Laser Scan

We utilize the FARO 3D X 130 laser scanner for acquiring ground truth 3D geometry. The scanned area includes a lobby and corridors of different sizes (800 m\(^2\)). The 28 individual scans were registered using the SCENE software that comes with the laser scanner. Noticeable artefacts, e.g. those caused by mirrors, were manually removed. The laser scan is used to evaluate the reconstruction framework in Sect. 5. However, this data also enables, for example, training and evaluation of RGB-D surface reconstruction algorithms.

5 Experiments

We compare our framework with the state-of-the-art RGB-D reconstruction methods [3, 5, 9]. The value of the BS3D dataset is demonstrated by training a recent monocular depth estimation model [44]. We also benchmark visual-inertial odometry approaches [3, 12, 34] using either color or infrared images to further highlight the unique aspects of the BS3D dataset.

5.1 Reconstruction Framework

In this experiment, we compare the framework against Redwood [5], BundleFusion [9], and ORB-SLAM3 [3]. RGBD images are provided for [3, 5, 9] in the coordinate frame of the color camera. Given the estimated camera poses, we create a point cloud and compare it to the laser scan (Sect. 4.2). The evaluation metrics include overlap of the point clouds and RMSE of inlier correspondences. Before comparison, we create uniformly sampled point clouds using voxel downsampling (1 cm\(^3\) voxel) that computes the centroid of the points in each voxel. The overlap is defined as the ratio of inlier correspondences and the number of ground truth points. A 3D point is considered to be an inlier if the distance to the closest ground truth point is below threshold \(\gamma \).

Table 2 shows the results for environments of different sizes. All methods are able to reconstruct the small environment (35 m\(^2\)) consisting of 2.8k frames. The differences between the methods become more evident when reconstructing the medium-size environment (160 m\(^2\)) consisting of 7.3k frames. BundleFusion [9] only produces a partial reconstruction because of odometry failures. The proposed approach gives the most accurate reconstructions as visualized in Fig. 5. Note that it is not possible to achieve 100% overlap because the depth camera does not observe all parts of the ground truth.

The largest environment (350 m\(^2\)) consists of 24k frames acquired in four sessions. Redwood [5] does not scale to input sequences of this long. ORB-SLAM3 [3] frequently loses the odometry in open spaces which leads to incomplete and less accurate reconstructions. Our method suffers the same problem when C2D is disabled. Unreliable odometry is likely due to the color camera’s limited FoV, rolling shutter distortion, and motion blur. The C2D alignment improves the accuracy and robustness of ICP-based odometry and loop closures. Without C2D, the frequent odometry failures result in disconnected maps and noticeable drift. We note that the reconstruction in Fig. 1 was computed from \(\sim \)300k frames which is far more than [3, 5, 9] can handle.

Table 2. Comparison of RGB-D reconstruction methods in small, medium and large-scale environments (from top to bottom). Overlap of the point clouds and inlier RMSE computed for distance thresholds \(\gamma \) (mm). Some methods only work in small and/or medium scale environments.
Fig. 5.
figure 5

Reconstructions obtained using Redwood [5], ORB-SLAM3 [3], and the proposed method. Colors depict errors (distance to the closest ground truth point).

5.2 Depth Estimation

We investigate whether the BS3D dataset can be used to train better models for monocular depth estimation. For this experiment, we use the state-of-the-art LeReS model [44] based on ResNet50. The original model has been trained using 354k samples taken from various datasets [13, 16, 24, 42, 45]. We finetune the model using 16.5k samples from BS3D. We set the learning rate to 2e−5 and train only 4 epochs to avoid overfitting. Other training details, including loss functions are the same as in [44].

For testing, we use NYUD-v2 [21] and iBims-1 [17] that are not seen during training. We also evaluate using BS3D by sampling 535 images from an unseen part of the building. Table 3 shows that finetuning improves the performance on iBims-1 and BS3D. The finetuned model performs marginally worse on NYUD-v2 which is not surprising considering that NYUD-v2 mainly contains room-scale scenes that are not present in BS3D. The qualitative comparison in Fig. 6 also shows a clear improvement over the pretrained model on iBims-1 that contains both small and large scenes. The model trained only using BS3D cannot compete with other models, except on BS3D on which the performance is surprisingly good. The poor performance on other datasets is not surprising because of the small training set.

Table 3. Monocular depth estimation using LeReS [44] trained from scratch using BS3D, pretrained model, and finetuned model. NUYD-v2 [21], iBims-1 [17], and BS3D are used for testing.
Fig. 6.
figure 6

Comparison of pretrained and finetuned (BS3D) monocular depth estimation model LeReS [44] on an independent iBims-1 [17] dataset unseen during training.

5.3 Visual-Inertial Odometry

The BS3D dataset includes active infrared images along with color and IMU data. This opens interesting possibilities, for example, the comparison of color and infrared as inputs for visual-inertial odometry. Infrared-inertial odometry is an attractive approach in the sense that it does not require external light, meaning it would work in completely dark environments.

We evaluate OpenVINS [12], ORB-SLAM3 [3], and DM-VIO [34] using color-inertial and infrared-inertial inputs. Note that ORB-SLAM3 has an unfair advantage because it has a loop closure detector that cannot be disabled. In the case of infrared images, we apply a power law transformation (\(\overline{I}=0.04 \cdot I^{0.6}\)) to increase brightness. As supported by [34], we provide a mask of valid pixels to ignore black areas near the edges of the infrared images. We adjust the parameters related to feature detection when using infrared images with [3, 12]. We use the standard error metrics, namely absolute trajectory error (ATE) and relative pose error (RPE) which measures the drift per second. The methods are evaluated 5 times on each of the 10 sequences (Table 4).

From the results in Table 5, we can see that ORB-SLAM3 has the lowest ATE when evaluating color-inertial odometry, mainly because of loop closure detection. In most cases, ORB-SLAM3 and OpenVINS fail to initialize when using infrared images. We conclude that off-the-shelve feature detectors (FAST and ORB) are quite poor at detecting good features from infrared images. Interestingly, DM-VIO performs better when using infrared images instead of color which is likely due to the infrared camera’s global shutter and wider FoV. This result reveals the great potential of using active infrared images for visual-inertial odometry and the need for new research.

Table 4. Evaluation sequences used in the visual-inertial odometry experiment. Last column shows if the camera returns to the starting point (chance for a loop closure).
Table 5. Comparison of visual-inertial odometry methods using color-inertial and infrared-inertial inputs. Average absolute trajectory error (ATE) and relative pose error (RPE). Last column shows the percentage of successful runs.

6 Conclusion

We presented a framework for acquiring high-quality 3D reconstructions using a consumer depth camera. The ability to produce building-scale reconstructions is a significant improvement over existing methods that are limited to smaller environments such as rooms or apartments. The proposed C2D alignment enables the use of raw depth maps, resulting in more accurate 3D reconstructions. Our approach is fast, easy to use, and requires no expensive hardware, making it ideal for crowd-sourced data collection. We acquire building-scale 3D dataset (BS3D) and demonstrate its value for monocular depth estimation. BS3D is unique also because it includes active infrared images, which are often missing in other datasets. We employ infrared images for visual-inertial odometry, discovering a promising new research direction.