Keywords

1 Introduction

Converting unstructured point clouds to surfaces is a key step of most scanning-based asset creation workflows, including games and AR/VR applications. While scanning technologies have become more easily accessible (e.g., depth cameras on smart phones, portable scanners), algorithms for producing a surface mesh remain limited. A good surfacing algorithm should be able to handle raw point clouds with noisy and varying sampling density, work with different surface topologies, and generalize across a large range of scanned shapes.

Screened Poisson Reconstruction (SPR)  [19] is the most commonly used method to convert an unstructured point cloud, along with its per-point normals, to a surface mesh. While the method is general, in absence of any data-priors, SPR typically produces smooth surfaces, can incorrectly close off holes and tunnels in noisy or non-uniformly sampled scans (see Fig. 1), and further degenerates when per-point normal estimates are erroneous.

Fig. 1.
figure 1

We present Points2Surf, a method to reconstruct an accurate implicit surface from a noisy point cloud. Unlike current data-driven surface reconstruction methods like DeepSDF and AtlasNet, it is patch-based, improves detail reconstruction, and unlike Screened Poisson Reconstruction (SPR), a learned prior of low-level patch shapes improves reconstruction accuracy. Note the quality of reconstructions, both geometric and topological, against the original surfaces. The ability of Points2Surf to generalize to new shapes makes it the first learning-based approach with significant generalization ability under both geometric and topological variations.

Hence, several data-driven alternatives  [8, 13, 21, 30] have recently been proposed. These methods specialize to particular object categories (e.g., cars, planes, chairs), and typically regress a global latent vector from any input scan. The networks can then decode a final shape (e.g., a collection of primitives  [13] or a mesh  [30]) from the estimated global latent vector. While such data-specific approaches handle noisy and partial scans, the methods do not generalize to new surfaces with varying shape and topology (see Fig. 1).

As a solution, we present Points2Surf, a method that learns to produce implicit surfaces directly from raw point clouds. During test time, our method can reliably handle raw scans to reproduce fine-scale data features even from noisy and non-uniformly sampled point sets, works for objects with varying topological attributes, and generalizes to new objects (see Fig. 1).

Our key insight is to decompose the problem into learning a global and a local function. For the global function, we learn the sign (i.e., inside or outside) of an implicit signed distance function, while, for the local function, we use a patch-based approach to learn absolute distance fields with respect to local point cloud patches. The global task is coarse (i.e., to learn the inside/outside of objects) and hence can be generalized across significant shape variations. The local task exploits the geometric observation that a large variety of shapes can be expressed in terms of a much smaller collection of atomic shape patches  [2], which generalizes across arbitrary shapes. We demonstrate that such a factorization leads to a simple, robust, and generalizable approach to learn an implicit signed distance field, from which a final surface is extracted using Marching Cubes  [23].

We test our algorithms on a range of synthetic and real examples, compare on unseen classes against both classical (reduction in reconstruction error by 30% over SPR) and learning based strong baselines (reduction in reconstruction error by 470% over DeepSDF  [30] and 270% over AtlasNet  [13]), and provide ablations studies. We consistently demonstrate both qualitative and quantitative improvement over all the methods that can be applied directly on raw scans.

2 Related Work

Several methods have been proposed to reconstruct surfaces from point clouds. We divide these into methods that aggregate information from a large dataset into a data-driven prior, and methods that do not use a data-driven prior.

Non-data-driven Surface Reconstruction. Berger et al.  [4] present an in-depth survey that is focused primarily on non-data-driven methods. Here we focus on approaches that are most relevant to our method. Scale space meshing  [10] applies iterative mean curvature motion to smooth the points for meshing. It preserves multi-resolution features well. Ohrhallinger et al. propose a combinatorial method  [27] which compares favorably with previous methods such as Wrap  [11], TightCocone  [9] and Shrink  [5] especially for sparse sampling and thin structures. However, these methods are not designed to process noisy point clouds. Another line of work deforms initial meshes  [22, 33] or parametric patches  [36] to fit a noisy point cloud. These approaches however, cannot change the topology and connectivity of the original meshes or patches, usually resulting in a different connectivity or topology than the ground truth. The most widely-used approaches to reconstruct surfaces with arbitrary topology from noisy point clouds fit implicit functions to the point cloud and generate a surface as a level set of the function. Early work by Hoppe et al. introduced this approach  [16], and since then several methods have focused on different representations of the implicit functions, like Fourier coefficients  [17], wavelets  [24], radial-basis functions  [29] or multi-scale approaches  [26, 28]. Alliez et al.  [1] use a PCA of 3D Voronoi cells to estimate gradients and fit an implicit function by solving an eigenvalue problem. This approach tends to over-smooth geometric detail. Poisson reconstruction  [18, 19] is the current gold standard for non-data-driven surface reconstruction from point clouds. None of the above methods make use of a prior that distills information about typical surface shapes from a large dataset. Hence, while they are very general, they fail to handle partial and/or noisy input. We provide extensive comparisons to Screened Poisson Reconstruction (SPR)  [19] in Sect. 4.

Data-Driven Surface Reconstruction. Recently, several methods have been proposed to learn a prior of typical surface shapes from a large dataset. Early work was done by Sauerer et al.  [21], where a decision tree is trained to predict the absolute distance part of an SDF, but ground truth normals are still required to obtain the sign (inside/outside) of the SDF. More recent data-driven methods represent surfaces with a single latent feature vector in a learned feature space. An encoder can be trained to obtain the feature vector from a point cloud. The feature representation acts as a strong prior, since only shapes that are representable in the feature space are reconstructed. Early methods use voxel-based representations of the surfaces, with spatial data-structures like octrees offsetting the cost of a full volumetric grid  [34, 35]. Scan2Mesh  [8] reconstructs a coarse mesh, including vertices and triangles, from a scan with impressive robustness to missing parts. However, the result is typically very coarse and not watertight or manifold, and does not apply to arbitrary new shapes. AtlasNet  [13] uses multiple parametric surfaces as representation that jointly form a surface, achieving impressive accuracy and cross-category generalization. More recently, several approaches learn implicit function representations of surfaces  [6, 25, 30]. These methods are trained to learn a functional that maps a latent encoding of a surface to an implicit function that can be used to extract a continuous surface. The implicit representation is more suitable for surfaces with complex topology and tends to produce aesthetically pleasing smooth surfaces.

The single latent feature vector that the methods above use to represent a surface acts as a strong prior, allowing these methods to reconstruct surfaces even in the presence of strong noise or missing parts; but it also limits the generality of these methods. The feature space typically captures only shapes that are similar to the shapes in the training set, and the variety of shapes that can be captured by the feature space is limited by the fixed capacity of the latent feature vector. Instead, we propose to decompose the SDF that is used to reconstruct the surface into a coarse global sign and a detailed local absolute distance. Separate feature vectors are used to represent the global and local parts, allowing us to represent detailed local geometry, without losing coarse global information about the shape.

3 Method

Our goal is to reconstruct a watertight surface S from a 3D point cloud \(P = \{p_1, \ldots , p_N\}\) that was sampled from the surface S through a noisy sampling process, like a 3D scanner. We represent a surface as the zero-set of a Signed Distance Function (SDF) \(f_S\):

$$\begin{aligned} S = L_0(f_S) = \{x \in \mathbb {R}^3\ |\ f_S(x)=0\}. \end{aligned}$$
(1)

Recent work  [6, 25, 30] has shown that such an implicit representation of the surface is particularly suitable for neural networks, which can be trained as functionals that take as input a latent representation of the point cloud and output an approximation of the SDF:

$$\begin{aligned} f_S(x) \approx \tilde{f}_P(x) = s_{\theta }(x | z),\ \text {with}\ z = e_{\phi }(P), \end{aligned}$$
(2)

where z is a latent description of surface S that can be encoded from the input point cloud with an encoder e, and s is implemented by a neural network that is conditioned on the latent vector z. The networks s and e are parameterized by \(\theta \) and \(\phi \), respectively. This representation of the surface is continuous, usually produces watertight meshes, and can naturally encode arbitrary topology. Different from non-data-driven methods like SPR  [19], the trained network obtains a strong prior from the dataset it was trained on, that allows robust reconstruction of surfaces even from unreliable input, such as noisy and sparsely sampled point clouds. However, encoding the entire surface with a single latent vector imposes limitations on the accuracy and generality of the reconstruction, due to the limited capacity of the latent representation.

In this work, we factorize the SDF into the absolute distance \(f^d\) and the sign of the distance \(f^s\), and take a closer look at the information needed to approximate each factor. To estimate the absolute distance \(\tilde{f}^d(x)\) at a query point x, we only need a local neighborhood of the query point:

$$\begin{aligned} \tilde{f}_P^d(x) = s^d_{\theta }(x | z^d_x),\ \text {with}\ z^d_x = e^d_{\phi }(\mathbf {p}^d_x), \end{aligned}$$
(3)

where \(\mathbf {p}^d_x \subset P\) is a local neighborhood of the point cloud centered around x. Estimating the distance from an encoding of a local neighborhood gives us more accuracy than estimating it from an encoding of the entire shape, since the local encoding \(z^d_x\) can more accurately represent the local neighborhood around x than the global encoding z. Note that in a point cloud without noise and sufficiently dense sampling, the single closest point \(p^* \subset P\) to the query x would be enough to obtain a good approximation of the absolute distance. But since we work with noisy and sparsely sampled point clouds, using a larger local neighborhood increases robustness.

In order to estimate the sign \(\tilde{f}^s(x)\) at the query point x, we need global information about the entire shape, since the interior/exterior of a watertight surface cannot be estimated reliably from a local patch. Instead, we take a global sub-sample of the point cloud P as input:

$$\begin{aligned} \tilde{f}_P^s(x) = \mathrm {sgn}\big (\tilde{g}_P^s(x)\big ) = \mathrm {sgn}\big (s^s_{\theta }(x | z^s_x)\big ),\ \text {with}\ z^s_x = e^s_{\psi }(\mathbf {p}^s_x), \end{aligned}$$
(4)

where \(\mathbf {p}^s_x \subset P\) is a global subsample of the point cloud, \(\psi \) are the parameters of the encoder, and \(\tilde{g}_P^s(x)\) are logits of the probability that x has a positive sign. Working with logits avoids discontinuities near the surface, where the sign changes. Since it is more important to have accurate information closer to the query point, we sample \(\mathbf {p}^s_x\) with a density gradient that is highest near the query point and falls off with distance from the query point.

We found that sharing information between the two latent descriptions \(z^s_x\) and \(z^d_x\) benefits both the absolute distance and the sign of the SDF, giving us the formulation we use in Points2Surf:

$$\begin{aligned} \big (\tilde{f}_P^d(x), \tilde{g}_P^s(x)\big ) = s_{\theta }(x | z^d_x, z^s_x),\ \text {with}\ z^d_x = e^d_{\phi }(\mathbf {p}^d_x) \ \text {and}\ z^s_x = e^s_{\psi }(\mathbf {p}^s_x). \end{aligned}$$
(5)

We describe the architecture of our encoders and decoder, the training setup, and our patch sampling strategy in more detail in Sect. 3.1.

To reconstruct the surface S, we apply Marching Cubes  [23] to a sample grid of the estimated SDF \(\tilde{f}^d(x) * \tilde{f}^s(x)\). In Sect. 3.2, we describe a strategy to improve performance by only evaluating a subset of the grid samples.

Fig. 2.
figure 2

Points2Surf Architecture. Given a query point x (red) and a point cloud P (gray), we sample a local patch (yellow) and a coarse global subsample (purple) of the point cloud. These are encoded into two feature vectors that are fed to a decoder, which outputs a logit of the sign probability and the absolute distance of the SDF at the query point x. (Color figure online)

3.1 Architecture and Training

Figure 2 shows an overview of our architecture. Our approach estimates the absolute distance \(\tilde{f}_P^d(x)\) and the sign logits \(\tilde{g}_P^s(x)\) at a query point based on two inputs: the query point x and the point cloud P.

Pointset Sampling. The local patch \(\mathbf {p}^d_x\) and the global sub-sample \(\mathbf {p}^s_x\) are both chosen from the point cloud P based on the query point x. The set \(\mathbf {p}^d_x\) is made of the \(n_d\) nearest neighbors of the query point (we choose \(n_d = 300\) but also experiment with other values). Unlike a fixed radius, the nearest neighbors are suitable for query points with arbitrary distance from the point cloud. The global sub-sample \(\mathbf {p}^s_x\) is sampled from P with a density gradient that decreases with distance from x:

$$\begin{aligned} \rho (p_i) = \frac{v(p_i)}{\sum _{p_j \in P} v(p_j)},\ \text {with}\ v(p_i) = \left[ 1 - 1.5 \frac{\Vert p_i-x\Vert _2}{\max _{p_j \in P} \Vert p_j-x\Vert _2} \right] _{0.05}^1, \end{aligned}$$
(6)

where \(\rho \) is the sample probability for a point \(p_i \in P\), v is the gradient that decreases with distance from x, and the square brackets denote clamping. The minimum value for the clamping ensures that some far points are taken and the sub-sample can represent a closed object. We sample \(n_s\) points from P according to this probability (we choose \(n_s = 1000\) in our experiments).

Pointset Normalization. Both \(\mathbf {p}^d_x\) and \(\mathbf {p}^s_x\) are normalized by centering them at the query point, and scaling them to unit radius. After running the network, the estimated distance is scaled back to the original size before comparing to the ground truth. Due to the centering, the query point is always at the origin of the normalized coordinate frame and does not need to be passed explicitly to the network. To normalize the orientation of both point subsets, we use a data-driven approach: a Quaternion Spatial Transformer Network (QSTN)  [15] takes as input the global subset \(\mathbf {p}^s_x\) and estimates a rotation represented as quaternion q that is used to rotate both point subsets. We take the global subset as input, since the global information can help the network with finding a more consistent rotation. The QSTN is trained end-to-end with the full architecture, without direct supervision for the rotation.

Encoder and Decoder Architecture. The local encoder \(e^d_{\phi }\), and the global encoder \(e^s_{\psi }\) are both implemented as PointNets  [31], sharing the same architecture, but not the parameters. Following the PointNet architecture, a feature representation for each point is computed using 5 MLP layers, with a spatial transformer in feature space after the third layer. Each layer except the last one uses batch normalization and ReLU. The point feature representations are then aggregated into point set feature representations \(z^d_x = e^d_{\phi }(\mathbf {p}^d_x)\) and \(z^s_x = e^s_{\psi }(\mathbf {p}^s_x)\) using a channel-wise maximum. The decoder \(s_{\theta }\) is implemented as 4-layer MLP that takes as input the concatenated feature vectors \(z^d_x\) and \(z^s_x\) and outputs both the absolute distance \(\tilde{f}^d(x)\) and sign logits \(\tilde{g}^s(x)\).

Losses and Training. We train our networks end-to-end to regress the absolute distance of the query point x from the watertight ground-truth surface S and classify the sign as positive (outside S) or negative (inside S). We assume that ground-truth surfaces are available during training for supervision. We use an \(L_2\)-based regression for the absolute distance:

$$\begin{aligned} \mathcal {L}^d(x, P, S) = \Vert \tanh (|\tilde{f}_P^d(x)|) - \tanh (|d(x, S)|)\Vert _2^2, \end{aligned}$$
(7)

where d(xS) is the distance of x to the ground-truth surface S. The \(\tanh \) function gives more weight to smaller absolute distances, which are more important for an accurate surface reconstruction. For the sign classification, we use the binary cross entropy H as loss:

$$\begin{aligned} \mathcal {L}^s(x, P, S) = H\Big (\sigma \big (\tilde{g}_P^s(x)\big ),\ [f_S(x) > 0]\Big ), \end{aligned}$$
(8)

where \(\sigma \) is the logistic function to convert the sign logits to probabilities, and \([f_S(x) > 0]\) is 1 if x is outside the surface S and 0 otherwise. In our optimization, we minimize these two losses for all shapes and query points in the training set:

$$\begin{aligned} \sum _{(P, S) \in \mathcal {S}} \sum _{x \in \mathcal {X}_S} \mathcal {L}^d(x, P, S) + \mathcal {L}^s(x, P, S), \end{aligned}$$
(9)

where \(\mathcal {S}\) is the set of surfaces S and corresponding point clouds P in the training set and \(\mathcal {X}_S\) the set of query points for shape S. Estimating the sign as a classification task instead of regressing the signed distance allows the network to express confidence through the magnitude of the sign logits, improving performance.

3.2 Surface Reconstruction

At inference time, we want to reconstruct a surface \(\tilde{S}\) from an estimated SDF \(\tilde{f}(x) = \tilde{f}^d(x) * \tilde{f}^s(x)\). A straight-forward approach is to apply Marching Cubes  [23] to a volumetric grid of SDF samples. Obtaining a high-resolution result, however, would require evaluating a prohibitive number of samples for each shape. We observe that in order to reconstruct a surface, a Truncated Signed Distance Field (TSDF) is sufficient, where the SDF is truncated to the interval \([-\epsilon , \epsilon ]\) (we set \(\epsilon \) to three times the grid spacing in all our experiments). We only need to evaluate the SDF for samples that are inside this interval, while samples outside the interval merely need the correct sign. We leave samples with a distance larger than \(\epsilon \) to the nearest point in P blank, and in a second step, we propagate the signed distance values from non-blank to blank samples, to obtain the correct sign in the truncated regions of the TSDF. We iteratively apply a box filter of size \(3^3\) at the blank samples until convergence. In each step, we update initially unknown samples only if the filter response is greater than a user-defined confidence threshold (we use 13 in our experiments). After each step, the samples are set to −1 if the filter response was negative or to +1 for a positive response.

4 Results

We compare our method to SPR as the gold standard for non-data-driven surface reconstruction and to two state-of-the-art data-driven surface reconstruction methods. We provide both qualitative and quantitative comparisons on several datasets in Sect. 4.2, and perform an ablation study in Sect. 4.3.

4.1 Datasets

We train and evaluate on the ABC dataset  [20] which contains a large number and variety of high-quality CAD meshes. We pick 4950 clean watertight meshes for training and 100 meshes for the validation and test sets. Note that each mesh produces a large number of diverse patches as training samples. Operating on local patches also allows us to generalize better, which we demonstrate on two additional test-only datasets: a dataset of 22 diverse meshes which are well-known in geometry processing, such as the Utah teapot and the Stanford Bunny, which we call the Famous dataset, and 3 Real scans of complex objects used in several denoising papers  [32, 37]. Examples from each dataset are shown in Fig. 3. The ABC dataset contains predominantly CAD models of mechanical parts, while the Famous dataset contains more organic shapes, such as characters and animals. Since we train on the ABC dataset, the Famous dataset serves to test the generalizability of our method versus baselines.

Fig. 3.
figure 3

Dataset examples. Examples of the ABC dataset and its three variants are shown on the left, examples of the famous dataset and its five variants on the right.

Pointcloud Sampling. As a pre-processing step, we center all meshes at the origin and scale them uniformly to fit within the unit cube. To obtain point clouds P from the meshes S in the datasets, we simulate scanning them with a time-of-flight sensor from random viewpoints using BlenSor  [14]. BlenSor realistically simulates various types of scanner noise and artifacts such as backfolding, ray reflections, and per-ray noise. We scan each mesh in the Famous dataset with 10 scans and each mesh in the ABC dataset with a random number of scans, between 5 and 30. For each scan, we place the scanner at a random location on a sphere centered at the origin, with the radius chosen randomly in U[3L, 5L], where L is the largest side of the mesh bounding box. The scanner is oriented to point at a location with small random offset from the origin, between \(U[-0.1L, 0.1L]\) along each axis, and rotated randomly around the view direction. Each scan has a resolution of \(176 \times 144\), resulting in roughly 25k points, minus some points missing due to simulated scanning artifacts. The point clouds of multiple scans of a mesh are merged to obtain the final point cloud.

Dataset Variants. We create multiple versions of both the ABC and Famous datasets, with varying amount of per-ray noise. This Gaussian noise added to the depth values simulates inaccuracies in the depth measurements. For both datasets, we create a noise-free version of the point clouds, called ABC no-noise and Famous no-noise. Also, we make versions with strong noise (standard deviation is 0.05L) called ABC max-noise and Famous max-noise. Since we need varying noise strength for the training, we create a version of ABC where the standard deviation is randomly chosen in U[0, 0.05L] (ABC var-noise), and a version with a constant noise strength of 0.01L for the test set (Famous med-noise). Additionally we create sparser (5 scans) and denser (30 scans) point clouds in comparison to the 10 scans of the other variants of Famous. Both variants have a medium noise strength of 0.01L. The training set uses the ABC var-noise version, all other variants are used for evaluation only.

Query Points. The training set also contains a set \(\mathcal {X}_S\) of query points for each (point cloud, mesh) pair \((P, S) \in \mathcal {S}\). Query points closer to the surface are more important for the surface reconstruction and more difficult to estimate. Hence, we randomly sample a set of 1000 points on the surface and offset them in the normal direction by a uniform random amount in \(U[-0.02L, 0.02L]\). An additional 1000 query points are sampled randomly in the unit cube that contains the surface, for a total of 2000 query points per mesh. During training, we randomly sample a subset of 1000 query points per mesh in each epoch.

4.2 Comparison to Baselines

We compare our method to recent data-driven surface reconstruction methods, AtlasNet  [13] and DeepSDF  [30], and to SPR  [19], which is still the gold standard in non-data-driven surface reconstruction from point clouds. Both AtlasNet and DeepSDF represent a full surface as a single latent vector that is decoded into either a set of parametric surface patches in AtlasNet, or an SDF in DeepSDF. In contrast, SPR solves for an SDF that has a given sparse set of point normals as gradients, and takes on values of 0 at the point locations. We use the default values and training protocols given by the authors for all baselines (more details in the Supplementary) and re-train the two data-driven methods on our training set. We provide SPR with point normals as input, which we estimate from the input point cloud using the recent PCPNet  [15].

Error Metric. To measure the reconstruction error of each method, we sample both the reconstructed surface and the ground-truth surface with 10k points and compute the Chamfer distance  [3, 12] between the two point sets:

$$\begin{aligned} d_{\text {ch}}(A, B) := \frac{1}{|A|} \sum _{p_i \in A} \min _{p_j \in B} \Vert p_i - p_j\Vert ^2_2\ + \frac{1}{|B|} \sum _{p_j \in B} \min _{p_i \in A} \Vert p_j - p_i\Vert ^2_2, \end{aligned}$$
(10)

where A and B are point sets sampled on the two surfaces. The Chamfer distance penalizes both false negatives (missing parts) and false positives (excess parts).

Table 1. Comparison of reconstruction errors. We show the Chamfer distance between reconstructed and ground-truth surfaces averaged over all shapes in a dataset. Both the absolute value of the error multiplied by 100 (abs.), and the error relative to Point2Surf (rel.) are shown to facilitate the comparison. Our method consistently performs better than the baselines, due to its strong and generalizable prior.
Fig. 4.
figure 4

Qualitative comparison of surface reconstructions. We evaluate one example from each dataset variant with each method. Colors show the distance of the reconstructed surface to the ground-truth surface.

Quantitative and Qualitative Comparison. A quantitative comparison of the reconstruction quality is shown in Table 1, and Fig. 4 compares a few reconstructions qualitatively. All methods were trained on the training set of the ABC var-noise dataset, which contains predominantly mechanical parts, while the more organic shapes in the Famous dataset test how well each method can generalize to novel types of shapes.

Both DeepSDF and AtlasNet use a global shape prior, which is well suited for a dataset with high geometric consistency among the shapes, like cars in ShapeNet, but struggles with the significantly larger geometric and topological diversity in our datasets, reflected in a higher reconstruction error than SPR or Points2Surf. This is also clearly visible in Fig. 4, where the surfaces reconstructed by DeepSDF and AtlasNet appear over-smoothed and inaccurate.

In SPR, the full shape space does not need to be encoded into a prior with limited capacity, resulting in a better accuracy. But this lack of a strong prior also prevents SPR from robustly reconstructing typical surface features, such as holes or planar patches (see Figs. 1 and 4).

Points2Surf learns a prior of local surface details, instead of a prior for global surface shapes. This local prior helps recover surface details like holes and planar patches more robustly, improving our accuracy over SPR. Since there is less variety and more consistency in local surface details compared to global surface shapes, our method generalizes better and achieves a higher accuracy than the data-driven methods that use a prior over the global surface shape.

Generalization. A comparison of our generalization performance against AtlasNet and DeepSDF shows an advantage for our method. In Table 1, we can see that the error for DeepSDF and AtlasNet increases more when going from the ABC dataset to the Famous dataset than the error for our method. This suggests that our method generalizes better from the CAD models in the ABC dataset set to the more organic shapes in the Famous dataset.

Fig. 5.
figure 5

Comparison of reconstruction details. Our learned prior improves the reconstruction robustness for geometric detail compared to SPR.

Topological Quality. Figure 5 shows examples of geometric detail that benefits from our prior. The first example shows that small features such as holes can be recovered from a very weak geometric signal in a noisy point cloud. Concavities, such as the space between the legs of the Armadillo, and fine shape details like the Armadillo’s hand are also recovered more accurately in the presence of strong noise. In the heart example, the concavity makes it hard to estimate the correct normal direction based on only a local neighborhood, which causes SPR to produce artifacts. In contrast, the global information we use in our patches helps us estimate the correct sign, even if the local neighborhood is misleading.

Effect of Noise. Examples of reconstructions from point clouds with increasing amounts of noise are shown in Fig. 6. Our learned prior for local patches and our coarse global surface information makes it easier to find small holes and large concavities. In the medium noise setting, we can recover the small holes and the large concavity of the surface. With maximum noise, it is very difficult to detect the small holes, but we can still recover the concavity.

Fig. 6.
figure 6

Effect of noise on our reconstruction. DeepSDF (D.SDF), AtlasNet (A.Net), SPR and Point2Surf (P2S) are applied to increasingly noisy point clouds. Our patch-based data-driven approach is more accurate than DeepSDF and AtlasNet, and can more robustly recover small holes and concavities than SPR.

Fig. 7.
figure 7

Reconstruction of real-world point clouds. Snapshots of the real-world objects are shown on the left. DeepSDF and AtlasNet do not generalize well, resulting in inaccurate reconstructions, while the smoothness prior of SPR results in loss of detail near concavities and holes. Our data-driven local prior better preserves these details.

Real-World Data. The real-world point clouds in Fig. 1 bottom and Fig. 7 bottom both originate from a multi-view dataset  [37] and were obtained with a plane-sweep algorithm  [7] from multiple photographs of an object. We additionally remove outliers using the recent PointCleanNet  [32]. Figure 7 top was obtained by the authors through an SfM approach. DeepSDF and AtlasNet do not generalize well to unseen shape categories. SPR performs significantly better but its smoothness prior tends to over-smooth shapes and close holes. Points2Surf better preserves holes and details, at the cost of a slight increase in topological noise. Our technique also generalizes to unseen point-cloud acquisition methods.

4.3 Ablation Study

We evaluate several design choices in (\(e_{\text {vanilla}}\)) using an ablation study, as shown in Table 2. We evaluate the number of nearest neighbors \(k=300\) that form the local patch by decreasing and increasing k by a factor of 4 (\(k_{\text {small}}\) and \(k_{\text {large}}\)), effectively halving and doubling the size of the local patch. A large k performs significantly worse because we lose local detail with a larger patch size. A small k still works reasonably well, but gives a lower performance, especially with strong noise. We also test a fixed radius for the local patch, with three different sizes (\(r_{\text {small}} := 0.05L\), \(r_{\text {med}} := 0.1L\) and \(r_{\text {large}} := 0.2L\)). A fixed patch size is less suitable than nearest neighbors when computing the distance at query points that are far away from the surface, giving a lower performance than the standard nearest neighbor setting. The next variant is using a single shared encoder (\(e_{\text {shared}}\)) for both the global sub-sample \(\mathbf {p}^s_x\) and the local patch \(\mathbf {p}^d_x\), by concatenating the two before encoding them. The performance of \(e_{\text {shared}}\) is competitive, but shows that using two separate encoders increases performance. Omitting the QSTN (\(e_{\text {no}\_\text {QSTN}}\)) speeds-up the training by roughly 10% and yields slightly better results. The reason is probably that our outputs are rotation-invariant in contrast to the normals of PCPNet. Using a uniform global sub-sample in \(e_{\text {uniform}}\) increases the quality over the distance-dependent sub-sampling in \(e_{\text {vanilla}}\). This uniform sub-sample preserves more information about the far side of the object, which benefits the inside/outside classification. Due to resource constraints, we trained all models in Table 2 for 50 epochs only. For applications where speed, memory and simplicity is important, we recommend using a shared encoder without the QSTN and with uniform sub-sampling.

Table 2. Ablation Study. We compare Points2Surf (\(e_{\text {vanilla}}\)) to several variants and show the Chamfer distance relative to Points2Surf. Please see the text for details.

5 Conclusion

We have presented Points2Surf as a method for surface reconstruction from raw point clouds. Our method reliably captures both geometric and topological details, and generalizes to unseen shapes more robustly than current methods.

The distance-dependent global sub-sample may cause inconsistencies between the outputs of neighboring patches, which results in bumpy surfaces.

One interesting future direction would be to perform multi-scale reconstructions, where coarser levels provide consistency for finer levels, reducing the bumpiness. This should also decrease the computation times significantly. Finally, it would be interesting to develop a differentiable version of Marching Cubes to jointly train SDF estimation and surface extraction.