1 Introduction

Real-time online 3D reconstruction has become increasingly important with the rise of applications like mixed reality, autonomous driving, robotics, or live 3D content creation via scanning. The majority of 3D reconstruction hardware platforms like phones, tablets, or mixed reality headsets contain a multitude of sensors, but few algorithms leverage them jointly to increase their accuracy, robustness, and reliability. For instance, the HoloLens2 has four tracking cameras and a depth camera for mapping. Neither is its depth camera used for tracking, nor are the tracking cameras used for mapping. Fusing the data from multiple sensors is challenging since different sensors typically operate in different domains, have diverse value ranges as well as noise and outlier statistics. This diversity is, however, what motivates sensor fusion. For example, RGB stereo cameras typically have a larger field of view and higher resolution than time-of-flight (ToF) cameras, but typically struggle on homogeneously textured surfaces. ToF cameras perform well regardless of texture, but show performance drops around edges. Fig. 1 shows the online fusion result of a ToF camera and a multi-view stereo (MVS) depth sensor. Both traditional and recent learning-based techniques such as TSDF Fusion [11] and RoutedFusion [54] respectively, reveal a high degree of noise and outliers when fusing multi-sensor depth. Although other recent works tackle depth map fusion [23, 50, 55, 56] with learning techniques, there is yet no work that considers multiple sensors for online dense reconstruction.

Fig. 1.
figure 1

Online multi-sensor depth map fusion. We fuse depth streams from a time-of-flight (ToF) camera and multi-view stereo (MVS) depth. Compared to competitive depth fusion methods such as TSDF Fusion and RoutedFusion, our learning-based approach can handle multiple depth sensors and significantly reduces the amount of outliers without loss of completeness.

In this paper, we present an approach for sensor fusion (SenFuNet) which jointly learns (1) the iterative online fusion of depth maps from a single sensor and (2) the effective fusion of depth data from multiple different sensors. During training, our method learns relevant sensor properties which impact the reconstruction accuracy to locally emphasize the better sensor for particular input and geometry configurations (see Fig. 1). We demonstrate with multiple sensor combinations that the learned sensor weighting is generic and can also be used as an expert system, e.g. for fusing the results of different stereo methods. In this case, our method predicts which algorithm performs better on which part of the scene. Since our approach handles time asynchronous sensor inputs, it is also applicable to collaborative multi-agent reconstruction. Our contributions are:

  • Our approach learns location-dependent fusion weights for the individual sensor contributions according to learned sensor statistics. For various sensor combinations our method extracts multi-sensor results that are consistently better than those obtained from the individual sensors.

  • Our pipeline is end-to-end trained in an online manner, is light-weight, real-time capable and generalizes well even for small amounts of training data.

  • In contrast to early fusion approaches which directly fuse depth values and thus generally assume a time synchronized sensor setup, our approach is flexible and can fuse the recovered scene reconstruction from asynchronous sensors. Our system is therefore more robust (compared to early fusion) to sensor differences such as sampling frequency, pose and resolution differences.

2 Related Work

In this section, we discuss dense online 3D reconstruction, multi-sensor depth fusion and multi-sensor dense 3D reconstruction.

Dense Online 3D Scene Reconstruction. The foundation for many volumetric online 3D reconstruction methods via truncated signed distance functions (TSDF) was laid by Curless and Levoy [11]. Popular extensions of this seminal work are KinectFusion [24] and scalable generalizations with voxel hashing [25, 38, 39], octrees [48], or increased pose robustness via sparse image features [5]. Further extensions include tracking for Simultaneous Localization and Mapping (SLAM) [37, 47, 50, 60] which potentially also handle loop closures, e.g. BundleFusion [12]. To account for greater depth noise, RoutedFusion [54] learns online updates of the volumetric grid. NeuralFusion [55] extends this idea by additionally learning the scene representation which significantly improves robustness to outliers. DI-Fusion [23], similarly to [55], learns the scene representation, but additionally decodes a confidence of the signed distance per voxel. Continual Neural Mapping [56] learns a continuous scene representation through a neural network from sequential depth maps. Several recent works do not require depth input and instead perform online reconstruction from RGB-cameras such as Atlas [35], VolumeFusion [9], TransformerFusion [4] and NeuralRecon [51]. None of these approaches consider multiple sensors and their extensions to sensor-aware data fusion is often by no means straightforward. Nevertheless, by treating all sensors equally, they can be used as baseline methods.

The majority of the aforementioned traditional methods do not properly account for varying noise and outlier levels for different depth values, which are better handled by probabilistic fusion methods [15,16,17, 28]. Cao et al. [7] introduced a probabilistic framework via a Gaussian mixture model into a surfel-based reconstruction framework to account for uncertainties in the observed depth. For a recent survey on online RGB-D 3D scene reconstruction, readers are referred to [61]. Overall, none of the state-of-the-art methods for dense online 3D scene reconstruction consider multiple sensors.

Multi-sensor Depth Fusion. The task of fusing depth maps from diverse sensors has been studied extensively. Many works study the fusion of a specific set of sensors. For example, RGB stereo and time-of-flight (ToF) [1, 2, 10, 13, 14, 18, 33, 52], RGB stereo and Lidar [31], RGB and Lidar [40, 41, 44], RGB stereo and monocular depth [34] and the fusion of multiple RGB stereo algorithms [42]. All these methods only study a specific set of sensors, while we do not enforce such a limitation. Few works study the fusion of arbitrary depth sensors [43]. Contrary to our method, all methods performing depth map fusion assume time synchronized sensors, which is hard, if not impossible, to achieve with realistic multi-sensor equipment.

Multi-sensor Dense 3D Reconstruction. Some works consider the problem of offline multi-sensor dense 3D reconstruction. For example, depth map fusion for semantic 3D reconstruction [45], combining multi-view stereo with a ToF sensor in a probabilistic framework [27], the combination of a depth sensor with photometric stereo [6] and large scene reconstruction using unsynchronized RGBD cameras mounted on an indoor robot [57]. These offline methods do not address the online problem setting that we are concerned with. Some works use sensor fusion to achieve robust pose estimation in an online setting [20, 58]. In contrast to our method, these works do not leverage sensor fusion for mapping. Ali et al. [3] present an online framework which perhaps is most closely related to our work. They take Lidar and stereo depth maps as input and fuse the TSDF signals of both sensors with a linear average before updating the global grid using TSDF Fusion [11]. To reduce noise further, they optimize a least squares problem which encourages surface smoothing. Contrary to our method, no learning is used and their system is only designed to fuse stereo depth with Lidar.

Fig. 2.
figure 2

SenFuNet Architecture. Given a depth stream \(D^i_t\), with known camera poses, our method fuses each frame at time t from sensor i into global sensor-specific shape \(S^i_t\) and feature \(F^i_t\) grids. The Shape Integration Module fuses the frames into \(S^i_t = \{V^i_t, W^i_t\}\) consisting of a TSDF grid \(V^i_t\) and a weight counter grid \(W^i_t\). In parallel, the Feature Integration Layer extracts features from the depth maps using a 2D Feature Network \(\mathcal {F}^i\) and integrates them into the feature grid \(F^i_t\). Next, \(S^i_t\) and \(F^i_t\) are combined and decoded through a 3D Weighting Network \(\mathcal {G}\) into a sensor weighting \(\alpha \in [0, 1]\). Together with \(S^i_t\) and \(\alpha \), the Fusion Module computes the fused grid \(V_t\) at each voxel location.

3 Method

Overview. Given multiple noisy depth streams \(D^i_t : \mathbb {R}^2 \rightarrow \mathbb {R}\) from different sensors with known camera calibration, i.e. extrinsics \(P^i_t \in \mathbb{S}\mathbb{E}(3)\) and intrinsics \(K^i \in \mathbb {R}^{3 \times 3}\), our method integrates each depth frame at time \(t \in \mathbb {N}\) from sensor \(i \in \{1, \ 2\}\) into a globally consistent shape \(S^i_t\) and feature \(F^i_t\) grid. Through a series of operations, we then decode \(S^i_t\) and \(F^i_t\) into a fused TSDF grid \(V_t \in \mathbb {R}^{X \times Y \times Z}\), which can be converted into a mesh with marching cubes [29]. Our overall framework can be split into four components (see Fig. 2). First, the Shape Integration Module integrates depth frames \(D^i_t\) successively into the zero-initialized shape representation \(S^i_t = \{V^i_t, W^i_t\}\). \(S^i_t\) consists of a TSDF grid \(V^i_t \in \mathbb {R}^{X \times Y \times Z}\) and a corresponding weight grid \(W^i_t \in \mathbb {N}^{X \times Y \times Z}\), which keeps track of the number of updates to each voxel. In parallel, the Feature Integration Layer extracts features from the depth maps using a 2D feature network \(\mathcal {F}^i : D^i_t \in \mathbb {R}^{W \times H \times 1} \rightarrow f^i_t \in \mathbb {R}^{W \times H \times n}\), with n being the feature dimension. We use separate feature networks per sensor to learn sensor-specific depth dependent statistics such as shape and edge information. The extracted features \(f^i_t\) are then integrated into the zero-initialized feature grid \(F^i_t \in \mathbb {R}^{X \times Y \times Z \times n}\). Next, \(S^i_t\) and \(F^i_t\) are combined and decoded through a 3D Weighting Network \(\mathcal {G}\) into a location-dependent sensor weighting \(\alpha \in [0, 1]\). Together with \(S^i_t\) and \(\alpha \), the Fusion Module fuses the information into \(V_t\) at each voxel location. Key to our approach is the separation of per sensor information into different representations along with the successive aggregation of shapes and features in the 3D domain. This strategy enables \(\mathcal {G}\) to learn a fusion strategy of the incoming multi-sensor depth stream. Our method is able to fuse the sensors in a spatially dependent manner from a smooth combination to a hard selection as illustrated in Fig. 3. Our scheme hence avoids the need to perform post-outlier filtering by thresholding with the weight \(W_t^i\), which is difficult to tune and is prone to reduce scene completion [54]. Another popular outlier filtering technique is free-space carvingFootnote 1, but this can be computationally expensive and is not required by our method. Instead, we use the learned \(\alpha \) as part of an outlier filter at test time, requiring no manual tuning. Next, we describe each component in detail.

Fig. 3.
figure 3

Overview. Left to right: Sequential fusion of a multi-sensor noisy depth stream. Our method integrates each depth frame at time t and produces a sensor weighting which fuses the sensors in a spatially dependent manner. For example, areas in yellow denote high trust of the ToF sensor.

(a) Shape Integration Module. For each depth map \(D^i_t\) and pixel, a full perspective unprojection of the depth into the world coordinate frame yields a 3D point \({\textbf {x}}_w \in \mathbb {R}^3\). Along each ray from the camera center, centered at \({\textbf {x}}_w\), we sample T points uniformly over a predetermined distance l. The coordinates are then converted to the voxel space and a local shape grid \(S^{i, *}_{t-1}\) is extracted from \(S^i_{t-1}\) through nearest neighbor extraction. To incrementally update the local shape grid, we follow the moving average update scheme of TSDF Fusion [11]. For numerical stability, the weights are clipped at a maximum weight \(\omega _\text {max}\). For more details, see the suppl. material.

(b) Feature Integration Layer. Each depth map \(D^i_t\) is passed through a 2D network \(\mathcal {F}^i\) to extract context information \(f^i_t\), which can be useful during the sensor fusion process. When fusing sensors based on the stereo matching principle, we provide the RGB frame as additional input channels to \(\mathcal {F}^i\). The network is fully convolutional and comprises 7 network blocks, each consisting of the following operations, 1) a \(3 \times 3\) convolution with zero padding 1 and input channel dimension 4 and output dimension 4 (except the first block which takes 1 channel as input when only depth is provided), 2) a \(\mathop {\textrm{tanh}}\) activation, 3) another \(3 \times 3\) convolution with zero padding 1 outputting 4 channels and 4) a \(\mathop {\textrm{tanh}}\) activation. The output of the first six blocks is added to the output of the next block via residual connections. Finally, we normalize the feature vectors at each pixel location and concatenate the input depth.

Next, we repeat the features T times along the direction of the viewing ray from the camera, \(f^i_t \xrightarrow [\text {T times}]{\text {Repeat}}f^{i, T}_t \in \mathbb {R}^{W \times H \times T \times n}\). The local feature grid \(F^{i,*}_{t-1}\) is then updated using the precomputed update indices from the Shape Integration Module with a moving average update: \(F^{i,*}_t = \frac{W^{i,*}_{t-1}F^{i,*}_{t-1} + f^{i,T}_t}{W^{i,*}_{t-1} + 1} .\) For all update locations the grid \(F^i_t\) is replaced with \(F^{i, *}_t\).

(c) Weighting Network. The task of the weighting network \(\mathcal {G}\) is to predict the optimal fusion strategy of the surface hypotheses \(V^i_t\). The input to the network is prepared by first concatenating the features \(F^i_t\) and the \(\mathop {\textrm{tanh}}\)-transformed weight counters \(W^i_t\) and second by concatenating the resulting vectors across the sensors. Due to memory constraints, the entire scene cannot be fit onto the GPU, and hence we use a sliding-window approach at test time to feed \(\mathcal {G}\) chunks of data. First, the minimum bounding grid of the measured scene (i.e. where \(W^i_t > 0\)) is extracted from the global grids. Then, the extracted grid is parsed using chunks of size \(d \times d \times d\) through \(\mathcal {G} : \mathbb {R}^{d \times d \times d \times 2(n+1)} \rightarrow \alpha \in \mathbb {R}^{d \times d \times d \times 1}\) into \(\alpha \in [0, 1]\). To avoid edge effects, we use a stride of d/2 and update the central chunk of side length d/2. The architecture of \(\mathcal {G}\) combines 2 layers of 3D-convolutions with kernel size 3 and replication padding 1 interleaved with \(\mathop {\textrm{ReLU}}\) activations. The first layer outputs 32 and the second layer 16 channels. Finally, the 16-dimensional features are decoded into the sensor weighting \(\alpha \) by a \(1\!\times \!1\!\times \!1\) convolution followed by a sigmoid activation.

(d) Fusion Module. The task of the fusion module is to combine \(\alpha \) with the shapes \(S^i_t\). In the following, we define the set of voxels where only sensor 1 integrates as \(C^1 = \{W^1_t > 0, \ W^2_{t-1} = 0\}\), the set where only sensor 2 integrates as \(C^2 = \{W^1_t = 0, \ W^2_{t-1} > 0\}\) and the set where both sensors integrate as \(C^{12} = \{W^1_t> 0, \ W^2_{t-1} > 0\}\). Let us also introduce \(\alpha _1 = \alpha \) and \(\alpha _2 = 1 - \alpha \). The fusion module computes the fused grid \(V_t\) as

$$\begin{aligned} V_t = {\left\{ \begin{array}{ll} \alpha _1 V_t^1 + \alpha _2V_{t-1}^2 &{} \text{ if } C^{12} \\ V_t^1 &{} \text{ if } C^1\\ V_{t-1}^2 &{} \text{ if } C^2. \end{array}\right. } \end{aligned}$$
(1)

Depending on the voxel set, \(V_t\) is computed either as a weighted average of the two surface hypotheses or by selecting one of them. With only one sensor observation, a weighted average would corrupt the result. Hence, the single observed surface is selected. At test time, we additionally apply a learned Outlier Filter which utilizes \(\alpha _i\) and \(W^i_t\). The filter is formulated for sensors 1 and 2 as

$$\begin{aligned} \hat{W}_t^1 = \mathbbm {1}_{\{C^1, \ \alpha _1> 0.5\}}W_t^1 ,\quad \hat{W}_{t-1}^2 = \mathbbm {1}_{\{C^2, \ \alpha _2 > 0.5\}}W_{t-1}^2 , \end{aligned}$$
(2)

where \(\mathbbm {1}_{\{.\}}\) denotes the indicator functionFootnote 2. When only one sensor is observed at a certain voxel, we remove the observation if \(\alpha _i\), which could be interpreted as a confidence, is below 0.5. This is done by setting the weight counter to 0.

Loss Function. The full pipeline is trained end-to-end using the overall loss

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{f} + \lambda _1 \sum _{i=1}^2\mathcal {L}_{C^i}^{in} + \lambda _2 \sum _{i=1}^2\mathcal {L}_{C^i}^{out}. \end{aligned}$$
(3)

The term \(\mathcal {L}_{f}\) computes the mean \(L_1\) error to the ground truth TSDF masked by \(C^{12}\) (4). To supervise the voxel sets \(C^{1}\) and \(C^{2}\), we introduce two additional terms, which penalize \(L_1\) deviations from the optimal \(\alpha \). The purpose of these terms is to provide a training signal for the outlier filter. If the \(L_1\) TSDF error is smaller than some threshold \(\eta \), the observation is deemed to be an inlier, and the corresponding confidence \(\alpha _i\) should be 1, otherwise 0. The loss is computed as the mean \(L_1\) error to the optimal \(\alpha \):

$$\begin{aligned} \mathcal {L}_{f} = \frac{1}{N_{C^{12}}}\sum \mathbbm {1}_{\{C^{12}\}}|V_t&- V^{GT}|_1, \ \mathcal {L}_{C^i}^{in} = \frac{1}{N_{C^i}^{in}} \sum \mathbbm {1}_{\{C^i, \ |V_t-V^{GT}|_1 < \eta \}}| \alpha _i - 1|_1, \nonumber \\ \mathcal {L}_{C^i}^{out}&= \frac{1}{N_{C^i}^{out}} \sum \mathbbm {1}_{\{C^i, \ |V_t-V^{GT}|_1 > \eta \}}| \alpha _i|_1, \end{aligned}$$
(4)

where the normalization factors are defined as

$$\begin{aligned} N_{C^{12}} = \sum&\mathbbm {1}_{\{C^{12}\}}, \ N_{C^i}^{in} = \sum \mathbbm {1}_{\{C^i, \ |V_t - V^{GT}|_1 < \eta \}}, \nonumber \\ N_{C^i}^{out}&= \sum \mathbbm {1}_{\{C^i, \ |V_t - V^{GT}|_1 > \eta \}} . \end{aligned}$$
(5)

Training Forward Pass. After the integration of a new depth frame \(D^i_t\) into the shape and feature grids, the update indices from the Shape Integration Module are used to compute the minimum bounding box of update voxels in \(S^i_t\) and \(F^i_t\). The update box varies in size between frames and cannot always fit on the GPU. Due to this and for the sake of training efficiency, we extract a \(d \times d \times d\) chunk from within the box volume. The chunk location is randomly selected using a uniform distribution along each axis of the bounding box. If the bounding box volume is smaller along any dimension than d, the chunk shrinks to the minimum size along the affected dimension. To maximize the number of voxels that are used to train the networks \(F^i\), we sample a chunk until we find one with at least 2000 update indices. At most, we make 600 attempts. If not enough valid indices are found, the next frame is integrated. The update indices in the chunk are finally used to mask the loss. We randomly reset the shape and feature grids with a probability of 0.01 at each frame integration to improve training robustness.

4 Experiments

We first describe our experimental setup and then evaluate our method against state-of-the-art online depth fusion methods on Replica, the real-world CoRBS and the Scene3D datasets. All reported results are averages over the respective test scenes. Further experiments and details are in the supplementary material.

Implementation Details. We use \(\omega _\text {max} = 500\) and extract \(T = 11\) points along the update band \(l = 0.1\) m. We store \(n = 5\) features at each voxel location and use a chunk side length of \(d = 64\). For the loss (3) \(\lambda _1 = 1/60\), \(\lambda _2 = 1/600 \) and \(\eta = 0.04\) m. In total, the networks of our model comprise 27.7K parameters, where 24.3K are designated to \(\mathcal {G}\) and the remaining parameters are split equally between \(\mathcal {F}^1\) and \(\mathcal {F}^2\). For our method and all baselines, the image size is \(W=H = 256\), the voxel size is 0.01 m and we mask the 10 pixel border of all depth maps to avoid edge artifacts, i.e. pixels belonging to the mask are not integrated into 3D. Since our TSDF updates cannot be larger than 0.05 m, we truncate the ground truth TSDF grid at \(l/2 = 0.05\) m.

Evaluation Metrics. The TSDF grids are evaluated using the Mean Absolute Distance (MAD), Mean Squared Error (MSE), Intersection over Union (IoU) and Accuracy (Acc.). The meshes, produced by marching cubes [29] from the TSDF grids, are evaluated using the F-score which is the harmonic mean of the Precision (P) and Recall (R).

Baseline Methods. Since there is no other multi-sensor online 3D reconstruction method that addresses the same problem, we define our own baselines by generalizing single sensor fusion pipelines to multiple sensors. TSDF Fusion [11] is the gold standard for fast, dense mapping of posed depth maps. It generalizes to the multi-sensor setting effortlessly by integrating all depth frames into the same TSDF grid at runtime. RoutedFusion [54] extends TSDF Fusion by learning the TSDF mapping. We generalize RoutedFusion to multiple sensors by feeding all depth frames into the same TSDF grid, but each sensor is assigned a separate fusion network to account for sensor-dependent noiseFootnote 3. NeuralFusion [55] extends RoutedFusion for better outlier handling, but despite efforts and help from the authors, the network did not converge during training due to heavy loss oscillations caused by integrating different sensors. DI-Fusion [23] learns the scene representation and predicts the signed distance value as well as the uncertainty \(\sigma \) per voxel. We use the provided model from the authors and integrate all frames from both sensors into the same volumetric grid. In the following, we refer to each multi-sensor baseline by using the corresponding single sensor name. For additional comparison, when time synchronized sensors with ground truth depth are available, we train a so-called “Early Fusion” baseline by fusing the 2D depth frames of both sensors. The fusion is performed with a modified version of the 2D denoising network proposed by Weder et al. [54] followed by TSDF Fusion to attain the 3D model (see supplementary material). This baseline should be interpreted as a light-weight alternative to our proposed SenFuNet, but assumes synchronized sensors, which SenFuNet does not. Finally, for the single-sensor results, we evaluate the TSDF grids \(V^i_t\). To make the comparisons fair, we do not use weight counter thresholding as post-outlier filter for any method. For DI-Fusion, we filter outliers by thresholding the learned voxel uncertainty. The default value provided in the implementation is used.

Table 1. Replica Dataset. ToF+PSMNet Fusion. (a) Our method outperforms the baselines as well as both of the sensor inputs and sets a new state-of-the-art for multi-sensor online depth fusion. (b) The denoising network mitigates outliers along planar regions, compare to Table 1a. Our method even outperforms the Early Fusion baseline, which assumes synchronized sensors.
Fig. 4.
figure 4

Replica Dataset. Our method fuses the sensors consistently better than the baselines. Concretely, our method learns to detect and remove outliers much more effectively (best viewed on screen). Top row: ToF\(\boldsymbol{+}\)PSMNet Fusion without denoising. See also Table 1a. Bottom row: ToF\(\boldsymbol{+}\)PSMNet Fusion with denoising. See also Table 1b.

4.1 Experiments on the Replica Dataset

The Replica dataset [49] comprises high-quality 3D reconstructions of a variety of indoor scenes. We collect data from Replica to create a multi-sensor dataset suitable for depth map fusion. To prepare ground truth signed distance grids, we first make the 3D meshes watertight using screened Poisson surface reconstruction [26]. The meshes are then converted to signed distance grids using a modified version of mesh-to-sdfFootnote 4 to accommodate non-cubic voxel grids. Ground truth depth and an RGB stereo pair are extracted using AI Habitat [32] along random trajectories. In total, we collected 92698 frames. We use 7 training and 3 test scenes. We simulate a depth sensor by adding noise to the GT depth of the left stereo view. Correspondingly, from the RGB stereo pairs a left stereo view depth map can be predicted using (optionally multi-view) stereo algorithms. In the following, we construct two sensor combinations and evaluate our model.

ToF+PSMNet Fusion. We simulate a ToF sensor by adding realistic noise to the ground truth depth mapsFootnote 5 [21]. To balance the two sensors, we increase the noise level by a factor 5 compared to the original implementation. We simulate another depth sensor from the RGB stereo pair using PSMNet [8]. We train the network on the Replica train set and keep it fixed while training our pipeline. Table 1a shows that our method outperforms both TSDF Fusion, RoutedFusion and DI-Fusion on all metrics except Recall with at least 13% on the F-score. Additionally, the F-score improves with a minimum of 18% compared to the input sensors. Specifically, note the absence of outliers (colored yellow) in Fig. 4 Top row when comparing our method to TSDF Fusion. Also note the sensor weighting e.g. we find lots of noise on the right wall of the ToF scene and thus, our method puts more weight on the stereo sensor in this region.

Weder et al. [54] showed that a 2D denoising network (called routing network in the paper) that preprocesses the depth maps can improve performance when noise is present in planar regions. To this end, we train our own denoising network on the Replica train set and train a new model which applies a fixed denoising network. According to Table 1, this yields a gain of 10% on the F-score of the fused model compared to without using a denoising network, see also Fig. 4 Bottom row. Early Fusion is a strong alternative to our method when the sensors are synchronized. We want to highlight, however, that the resource overhead of our method is worthwhile since we outperform Early Fusion even in the synchronized setting.

Table 2. Time Asynchronous Evaluation. SenFuNet outperforms Early Fusion for sensors with different sampling frequencies. *With depth denoising.

Time Asynchronous Evaluation. RGB cameras often have higher frame rates than ToF sensors which makes Early Fusion more challenging as one sensor might lack new data. We simulate this setting by giving the PSMNet sensor twice the sampling rate of the ToF sensor, i.e. we drop every second ToF frame. To provide a corresponding ToF frame for Early Fusion, we reproject the latest observed ToF frame into the current view of the PSMNet sensor. As demonstrated in Table 2 the gap between our SenFuNet late fusion approach and Early Fusion becomes even larger (cf. Table 1 (b)).

Table 3. Replica Dataset. SGM+PSMNet Fusion. Our method does not assume a particular sensor pairing and works well for all tested sensors. The gain from the denoising network is marginal with a 2% F-score improvement since there are few outliers on planar regions of the stereo depth maps and the denoising network over-smooths the depth along discontinuities. Our method outperforms Early Fusion (which generally assumes synchronized sensors) on most metrics even without depth denoising.

SGM+PSMNet Fusion. Our method does not assume a particular sensor pairing. We show this, by replacing the ToF sensor with a stereo sensor acquired using semi-global matching [22]. In Table 3, we show state-of-the-art sensor fusion performance both with and without a denoising network. The denoising network tends to over-smooth depth discontinuities, which negatively affects performance when few outliers exist. Additionally, even without using a denoising network, we outperform Early Fusion on most metrics. TSDF Fusion, RoutedFusion and DI-Fusion aggregate outliers across the sensors leading to worse performance than the single sensor results.

4.2 Experiments on the CoRBS Dataset

The real-world CoRBS dataset [53] provides a selection of reconstructed objects with very accurate ground truth 3D and camera trajectories along with a consumer-grade RGBD camera. We apply our method to the dataset by training a model on the desk scene and testing it on the human scene. The procedure to create the ground truth signed distance grids is identical to the Replica dataset. We create an additional depth sensor along the ToF depth using MVS with COLMAP [46]Footnote 6. Fig. 5 shows that our model can fuse very imbalanced sensors while the baseline methods fail severely. Even if one sensor (MVS) is significantly worse, our method still improves on most metrics. This confirms that our method learns to meaningfully fuse the sensors even if one sensor adds very little.

4.3 Experiments on the Scene3D Dataset

We demonstrate that our framework can fuse imbalanced sensors on room-sized real-world scenes using the RGBD Scene3D dataset [59]. The Scene3D dataset comprises a collection of 3D models of indoor and outdoor scenes. We train our model on the stonewall scene and test it on the copy room scene. To create the ground truth training grid, we follow the steps outlined previously except that it was not necessary to make the mesh watertight. We fuse every 10th frame during train and test time. Equivalent to the study on CoRBS, we create an MVS depth sensor using COLMAP and perform ToF and MVS fusion. We only integrate MVS depth in the interval [0.5, 3.0] m. Table 4a along with Fig. 6 Top row shows that our method yields a fused result better than the individual input sensors and the baseline methods. Further, Fig. 1 shows our method in comparison with TSDF Fusion [11] and RoutedFusion [54] on the lounge scene.

Fig. 5.
figure 5

CoRBS Dataset. ToF+MVS Fusion. Our model can find synergies between very imbalanced sensors. (a) The numerical results show that our fused model is better than the individual depth sensor inputs and significantly better than any of the baseline methods. (b) Contrary to our method, the baseline methods cannot handle the high degree of outliers from the MVS sensor.

Fig. 6.
figure 6

Top row: Our method effectively fuses the ToF and MVS sensors. Note specifically the absence of yellow outliers in the corner of the bookshelf. See also Table 4a. Bottom row: Multi-Agent ToF Reconstruction. Our method is flexible and can perform Multi-Agent reconstruction. Note that our model learns where to trust each agent at different spatial locations for maximum completeness, while also being noise aware. See for instance the left bottom corner of the bookshelf where both agents integrate, but the noise-free agent is given a higher weighting. The above scene is taken from the Scene3D Dataset [59] (best viewed on screen). See also Table 4b.

Table 4. Scene3D Dataset. (a) ToF+MVS Fusion. Our method outperforms the baselines on real-world data on a room-sized scene. (b) Multi-Agent ToF Reconstruction. Our method is flexible and can perform collaborative sensor fusion from multiple sensors with different camera trajectories.

Multi-agent ToF Fusion. Our method is not exclusively applicable to sensor fusion, but more flexible. We demonstrate this by formulating a Multi-Agent reconstruction problem, which assumes that two identical ToF sensors with different camera trajectories are provided. The task is to fuse the reconstructions from the two agents. This requires an understanding of when to perform sensor selection for increased completeness and smooth fusion when both sensors have registered observations. Note that this formulation is different from typical works on collaborative 3D reconstruction e.g. [19] where the goal is to align 3D reconstruction fragments to produce a complete model. In our Multi-Agent setting, the alignment is given and the task is instead to perform data fusion on the 3D fragments. No modification of our method is required to perform this task. We set \(\lambda _1 = 1/1200\) and \(\lambda _2 = 1/12000\) and split the original trajectory into 100 frame chunks that are divided between the agents. Table 4b and Fig. 6 Bottom row show that our method effectively fuses the incoming data streams and yields a 4% F-score gain on the TSDF Fusion baseline.

Fig. 7.
figure 7

(a) Performance over Camera Trajectory. Our fused output outperforms the single sensor reconstructions (\(V_{t}^{i}\)) for all frames along the camera trajectory. The above results are evaluated on the Replica office 0 scene using \(\{\)ToF, PSMNet\(\}\) with depth denoising. Note that the results get slightly worse after 300 frames. This is due to additional noise from the depth sensors when viewing the scene from further away. (b) Effect of Learned Outlier Filter. The learned filter is crucial for robust outlier handling. Erroneous outlier predictions shown in yellow are effectively removed by our approach while keeping the correct green-colored predictions. (Color figure online)

4.4 More Statistical Analysis

Performance over Camera Trajectory. To show that our fused output is not only better at the end of the fusion process, we visualize the quantitative performance across the accumulated trajectory. In Fig. 7a, we evaluate the office 0 scene on the sensors \(\{\)ToF, PSMNet\(\}\) with depth denoising. Our fused model consistently improves on the inputs.

Table 5. Architecture Ablation. We vary the number of 3D convolutional layers with kernel size 3. Performance is optimal at 2 layers, equivalent to a receptive field of \(9^3\).

Architecture Ablation. We perform a network ablation on the Replica dataset with the sensors \(\{\)SGM, PSMNet\(\}\) without depth denoising. In Table 5, we investigate the number of layers with kernel size 3 in the Weighting Network \(\mathcal {G}\). Two layers yield optimal performance which amounts to a receptive field of \(9 \!\times \! 9 \!\times \! 9\). This is realistic given that the support for a specific sensor is local by nature.

Table 6. Generalization Capability. Our model generalizes well when evaluated on the office 0 scene. We conclude this by training a model only on office 0.

Generalization Capability. Table 6 shows our model’s generalization ability for \(\{\)SGM, PSMNet\(\}\) fusion when evaluated against a model trained and tested on the office 0 scene. Our model generalizes well and performs almost on par with one which is only trained on the office 0 scene. The generalization capability is not surprising since \(\mathcal {G}\) has a limited receptive field of \(9 \!\times \! 9 \!\times \! 9\).

Effect of Learned Outlier Filter. To show the effectiveness of the filter, we study the feature space on the input side of \(\mathcal {G}\). Specifically, we consider the hotel 0 scene and sensors \(\{\)ToF, PSMNet\(\}\) with depth denoising. First, we concatenate both feature grids and flatten the resulting grid. Then, we reduce the observations of the 12-dim feature space to a 2-dim representation using tSNE [30]. We then colorize each point with the corresponding signed distance error at the original voxel position. We repeat the visualization with and without the learned outlier filter. Fig. 7b shows that the filter effectively removes outliers while keeping good predictions.

Table 7. Loss Ablation. When only the term \(\mathcal {L}_f\) is used, we observe a significant performance drop compared to the full loss. Note, however, that only with the term \(\mathcal {L}_f\), our model still improves on the single sensor input metrics compared to Table 3a.

Loss Ablation. Table 7 shows the performance difference when the model is trained only with the term \(\mathcal {L}_f\) compared to the full loss (3). We perform \(\{\)SGM, PSMNet\(\}\) fusion on the Replica dataset. The extra terms of the full loss clearly help improve overall performance and specifically to filter outliers.

Limitations. Our framework currently supports two sensors. Its extension to a k-sensor setting is straightforward, but the memory footprint grows linearly with the number of sensors. However, few devices have more than two or three depth sensors. While our method generates better results on average, some local regions may not improve and our method struggles with overlapping outliers from both sensors. For qualitative examples, see the supplementary material. Ideally, outliers can be filtered and the data fused with a learned scene representation as in [55], but our efforts to make [55] work with multiple sensors suggests that this is a harder learning problem which deserves attention in future work.

5 Conclusion

We propose a machine learning approach for online multi-sensor 3D reconstruction using depth maps. We show that a fusion decision on 3D features rather than directly on 2D depth maps generally improves surface accuracy and outlier robustness. This also holds when 2D fusion is straightforward, i.e. for time synchronized sensors, equal sensor resolution and calibration. The experiments demonstrate that our model handles various sensors, can scale to room-sized real-world scenes and produce a fused result that is quantitatively and qualitatively better than the single sensor inputs and the compared baselines methods.