Keywords

1 Introduction

As the low-cost sensors like depth camera and LIDAR are becoming increasingly available, 3D data has gained large attention in vision and robotics community. However, viewpoint occlusion and low sensor resolution in 3D scans always lead to incomplete shapes, which can not be directly used in practical applications. To this end, it is desired to recover a complete 3D model from a partial shape, which has significant values in variety of tasks such as 3D reconstruction  [4, 7, 8, 45], robotics  [38], scene understanding  [3, 46] and autonomous driving  [26].

Recent learning-based works succeed in performing shape completion on volumetric representation of 3D objects, such as occupied grids or TSDF volume, where convolution operations can be applied directly  [4, 11, 19, 43, 47]. However, volumetric representation always leads to expensive memory cost and low shape fidelity. In contrast, point cloud is a more compact and finer representation of 3D data than voxel but is harder to incorporate in the neural network due to its irregular properties. Recently, several network architectures have been designed for direct point cloud shape completion, such as PCN  [51], TopNet  [36], RL-GAN-Net  [31], demonstrating the advantage of point cloud shape completion through learning-based methods. These methods are all based on encoder-decoder networks, where the completed 3D model is recovered via a global feature vector obtained from the encoder. Though the encoded vector is able to represent the overall shape, merely considering it in shape completion will lead to the loss of geometric details existing in original point clouds.

Fig. 1.
figure 1

(a) Given a partial input, the proposed network reconstructs the complete shape from known and missing parts using separated feature representation. (b) Compared with PCN  [51] and TopNet  [36], our method performs better in both detail preservation and latent shape prediction.

To address this problem, rather than using only a global feature vector to generate the whole complete model, we extract multi-level features and aggregate different-level features to represent the known and missing parts separately, which contributes to both existing detail preservation and missing shape prediction, as illustrated in Fig. 1(a).

We first extract multi-level features for each point by a hierarchical feature learning architecture. These features can present the existing details well, but they do not involve enough missing part cues. If we directly use these features for shape completion, the generated results are likely tangled into the existing partial input. To this end, we consider aggregating diverse features for the known and missing parts separately. We propose two feature aggregation strategies, i.e., global & local feature aggregation (GLFA) and residual feature aggregation (RFA). For global & local feature aggregation, the main idea is that we leverage local features to represent the known part while aggregate global features for missing part. For residual feature aggregation, we aim to represent the missing part with residual features which indicates the difference between the holistic shape and the partial model. We then combine the known and missing part features with a proper ratio and reconstruct the complete model from the combined features. Finally, a successive refinement component is designed to converge the generated model to be uniformly distributed and reduce the outliers.

To summarize, our main contributions are:

  • We propose a novel learning based point cloud completion architecture which achieves better performance on both detail preservation and latent shape prediction by separated feature aggregation strategy.

  • We design a refinement component which can refine the reconstructed coordinates to be uniformly distributed and reduce the noises and outliers.

  • Extensive experiments demonstrate our proposed network outperforms state-of-the-art 3D point cloud completion methods both quantitatively and qualitatively.

2 Related Work

Deep Learning on Point Cloud. Qi et al.  [28] first introduced a deep learning network PointNet which uses symmetric function to directly process point cloud. PointNet++  [29] captures point cloud local structure from neighborhoods at multiple scales. FoldingNet  [49] uses a novel folding-based decoder which deforms a canonical 2D grid onto the underlying shape surface. A series of network architectures on point cloud have been proposed in succession for point cloud analysis  [2, 13, 14, 18, 20, 22, 30, 35, 39, 40, 42, 44, 53, 54], and related applications such as object detection  [5, 10, 17, 26, 27, 32, 41, 52] and reconstruction [12, 39, 48].

Non-learning Based Shape Completion. Shape completion has long been a popular problem on interest in the graphics and vision field. Some effective descriptors have been developed in the early years, such as  [15, 24, 34], which leverages geometric cues to fill the missing parts in the surface. These methods are usually limited to fill only the small holes. Another way to complete the shape is to find the symmetric structure as priors to achieve the completion  [23, 25, 37]. However, these methods work well only when the missing part can be inferred from the existing partial model. Some researchers proposed data-driven methods  [16, 21, 33] which usually retrieve the most likely model based on the partial input from a large 3D shape database. Though convincing results can be obtained, these methods are time-consuming in matching process according to the database size.

Learning Based Shape Completion. Recently more researchers tend to solve 3D vision tasks using learning methods. Learning based methods on shape completion usually use deep neural network with a encoder-decoder architecture to directly map the partial input to a complete shape. Most pioneering works [4, 11, 19, 43, 47] rely on volumetric representations where convolution operations can be directly applied. Volumetric representations lead to large computation and memory costs, thus most works operate on low dimension voxel grids leading to details missing. To avoid these limitations, Yuan et al. proposed PCN  [51] which directly generates complete shape with partial point cloud as input. PCN recovers the complete point cloud in a two-stage process which first generates a coarse result with low resolution and then recovers the final output using the coarse one. TopNet  [36] explores hierarchical rooted tree structure as decoder to generate arbitrary grouping of points in completion task. RL-GAN-Net  [31] presents a completion framework using reinforcement learning agent to control the GAN generator. All these approaches generate the complete point cloud by feeding a encoded global feature vector to a decoder network. Though the global feature can almost represent the underlying holistic surface, it discards many details existing in the partial input.

Fig. 2.
figure 2

Overall network architecture. Given a partial point cloud of N points with XYZ coordinates, our network extracts multi-level features and interpolates each level feature to the same size \(N \times C_m\) (implemented with PointNet++  [29] layers). Multi-level features are then aggregated to represent known and missing parts separately with feature aggregation strategy. A high-resolution completion result \(Y_{rec}\) of size \(rN \times 3\) is reconstructed from the expanded features where r is the expansion factor. The reconstructed results are finally refined by a refinement component.

3 Network Architecture

The overall architecture of our network is shown in Fig. 2. First, the multi-level features are extracted via a hierarchical feature learning to efficiently represent both local and global properties. Then we aggregate features for known and missing parts of the point set separately to provide explicit cues for detail preservation and shape prediction. After that, we expand the features and reconstruct the coordinates from the expanded features to obtain a high-resolution result. Finally, we design a refinement component to make the complete point cloud uniformly distributed and smooth.

3.1 Multi-level Features Extraction

Both local and global structures can be efficiently explored by extracting features from different levels. We extract multi-level features via a hierarchical feature learning architecture proposed in PointNet++  [29], which is also used in PU-Net [50]. The feature extraction module consists of multiple layers of sampling and grouping operation to embed multi-level features and then interpolates each level feature to have the same point number and feature size of \(N \times C_m\), where N is the input point number and \(C_m\) refers to the interpolated multi-level feature dimension. We progressively capture features with gradually increasing grouping radius. The features at the i-th level is defined as \(f_{i}\). Different from the PU-Net  [50], we extract a global feature \(f_{global}\) from the last-level feature. The global feature \(f_{global}\) have the same size \(1 \times C_m\) and we duplicate it to \(N \times C_m\).

3.2 Separated Feature Aggregation

If we directly concatenate the multi-level features for completion, a main problem is that these features can preserve the existing details, but it does not capture enough information to complete the missing shape. The generated points are easier to be tangled at the original partial input even if we add a repulsion loss on the generated results. We discuss this problem in experiments (Sect. 4.4).

To this end, we consider providing explicit cues for the network to balance detail preservation and shape prediction. We separately represent the known and missing parts with diverse features. For known part features, denoted as \(f_{known}\), we want to ensure the propagation of local structure of the input. The missing part features, denoted as \(f_{missing}\), can be regarded as the features predicted from global structure, or features indicating the difference between the complete shape and the existing partial structure.

Fig. 3.
figure 3

Illustration of the proposed feature aggregation strategies and the process of feature expansion and reconstruction.

Taking the advantages of multi-level features, we propose and compare two different types of feature aggregation strategies namely GLFA and RFA, which is illustrated in Fig. 3(a) (b).

Global & Local Feature Aggregation (GLFA). Intuitively, the known part should be represented with more local features to keep the original details, whereas the missing part should be expressed with a more global feature to predict the latent underlying surface for the completion task. Generally, lower-level features layers in a network correspond to local features in smaller scales, while higher level features are more related to the global features. To this end, we represent the \(f_{known}\) and \(f_{missing}\) as:

$$\begin{aligned} \begin{aligned} f_{known}&=[f_1, \ldots , f_{m}, C_{origin}, f_{global}],\\ f_{missing}&= [f_{n-m+1}, \ldots , f_n, C_{missing}, f_{global}],\\ C_{missing}&= \mathcal {T}(C_{origin}),\\ m&= {\lfloor {\frac{n}{2}} \rfloor } + 1,\\ \end{aligned} \end{aligned}$$
(1)

where n is the total level number. We aggregate the first m levels features \([f_1, \ldots , f_m]\) to \(f_{known}\) and last m level features \([f_{n-m+1}, \ldots , f_n]\) to \(f_{missing}\).

\(C_{origin}\) is the original points coordinates in the partial point cloud. \(C_{missing}\) indicates the possible coordinates of the missing part. At first, we just use the \(C_{origin}\) as \(C_{missing}\), but we find the network can converge faster if we set \(C_{missing}\) with more proper initial coordinates. To this end, we leverage a learnable transformation net \(\mathcal {T}(\cdot )\) to transform \(C_{origin}\) to proper coordinates as \(C_{missing}\). \(\mathcal {T}(\cdot )\) is similar to the T-Net proposed in Pointnet [28], containing a series of multi-layer perceptrons, a predicted transform matrix and coordinates bias. Our observation for using a transformation net is the symmetry properties of most of objects in the world, thus transforming the original coordinates properly can better fit the missing part coordinates.

Residual Feature Aggregation (RFA). A more direct way to consider this issue is to represent the missing part with features which can be seen as residual features between the global shape and the known part. We first compute the difference between the global feature vector and known part features in feature space. After that, a successive shared multi-layer perceptron is applied to generalize the difference to the residual features in the latent feature space. The \(f_{known}\) and \(f_{missing}\) are represented as:

$$\begin{aligned} \begin{aligned} f_{known}&= [f_1, \ldots , f_n, C_{origin}, f_{global}],\\ f_{missing}&= [f_{res_i} , \ldots , f_{res_n}, C_{missing}, f_{global}],\\ C_{missing}&= \mathcal {T}(C_{origin}),\\ f_{res_i}&=\mathbf {MLP}([f_{global} - f_i]), \end{aligned} \end{aligned}$$
(2)

where \(f_{res_i}\) indicates the residual features between \(f_{global}\) and the \(f_i\), and \(\mathbf {MLP}(\cdot )\) refers to a shared MLP of several fully connected layers.

3.3 Feature Expansion and Reconstruction

Feature Expansion with Combination. After feature aggregation, we combine both the known and missing features and expand the number of features to high-resolution representation. As shown in Fig. 3(c), we expand the feature number by duplicating \(f_{known}\) and \(f_{missing}\) for j and k times respectively, where \(j+k=r\). We then apply r separated MLPs to the r feature sets to generate diverse point sets. Every MLP shares weights in the feature set.

An important factor is how we define the combination ratio of j and k. In our experiments, we define \(j:k=1:1\), as the visible part is often about a half of the whole object. We discuss the effect of the ratio in the supplementary material.

Coordinates Reconstruction. We reconstruct a coarse model \(Y_{rec}\) from the expanded features via a sharded MLP and the output is point coordinates \(rN\times 3\). The corresponding reconstructed coordinates from the known part and missing part are denoted as \(Y_{known}\) and \(Y_{missing}\).

3.4 Refinement Component

In practice, we notice that the reconstructed points are easy to locate too close to each other due to the high correlation of the duplicated features, and suffer from noises and outliers. Thus, we design a refinement component to enhance the distribution uniformity and reduce useless points.

We first apply farthest point sampling (FPS) to get a relatively uniformly subsampled \(\frac{r}{2}N\) point set. However, FPS algorithm is random and the subsampling depends on which point is selected first, which will not get rid of the noises and outliers in the point set.

We apply a following attention module to reduce the incorrect points. The attention module comprises a shared MLP of 3 stacking fully connected layers with softplus activation function applied on the last layer to produce a score map. It selects the top tN features and points as shown in Fig. 2. The input to the attention module is the corresponding \(\frac{r}{2}N \times C\) features to the selected points, and the output is the indexes of the most significant tN points. The selected points are denoted as \(Y_{att}\).

Finally, we apply a local folding unit  [49] to approximate a smooth surface of high resolution, which is proposed in PCN  [51]. The input to the local folding unit is the \(tN \times (3+C)\) points and corresponding features. For each point, a patch of \(u^2\) points is generated in the local coordinates centered at \({x_i}\) via the folding operation, and transformed into the global coordinates by adding \({x_i}\) to the output. The final output is the refined point coordinates \(rN\times 3\).

3.5 Loss Function

To measure the differences between two point clouds \((S_{1},S_{2})\), Chamfer Distance (CD) and Earth Mover’s Distance (EMD) are introduced in recent work  [6]. We choose the Chamfer distance like [31, 36], due to its efficiency over EMD.

$$\begin{aligned} \mathcal {L}_{CD}(S_{1},\!S_{2})\!=\!\frac{1}{\vert S_{1}\vert }\sum _{x\in S_{1}}\!\min _{y\in S_{2}}\Vert x-y\Vert _{2} +\frac{1}{\vert S_{2}\vert }\sum _{y\in S_{2}}\!\min _{x\in S_{1}}\Vert y-x\Vert _{2}. \end{aligned}$$
(3)

Besides FPS algorithm, we also explicitly ensure the distribution uniformity of the attention module output in the loss function to reduce the incorrect points. We apply the repulsion loss proposed in PU-Net  [50] which is defined as:

$$\begin{aligned} \mathcal {L}_{rep}(S)=\sum _{x_i\in S}\sum _{x_{i^{\prime }}\in K(x_i)}\eta (\Vert x_{i^{\prime }}-x_{i}\Vert )w(\Vert x_{i^{\prime }}-x_{i}\Vert ), \end{aligned}$$
(4)

where \(\eta (\cdot )\) and \(w(\cdot )\) are two repulsion term to penalize \(x_{i}\) if \(x_{i}\) is too close to its neighboring point \(x_{i^{\prime }}\).

We jointly train the network by minimizing the following loss function:

$$\begin{aligned} \begin{aligned} {L}_{sum}=\alpha \mathcal {L}_{CD}(Y_{rec}, {Y}_{gt}) + \mathcal {L}_{CD}(Y_{att},{Y}_{gt})\\ +\, \mathcal {L}_{CD}(Y_{final}, {Y}_{gt}) + \beta \mathcal {L}_{rep}(Y_{att}), \end{aligned} \end{aligned}$$
(5)

where we set \(\alpha =0.5\) as we do not need \(Y_{rec}\) to be an very accurate result, and \(\beta = 0.2\) in our experiments. Note that we apply repulsion loss only on \(Y_{att}\), as local folding unit can approximate a smooth surface which represents the local geometry of a shape, so we just need to ensure \(Y_{att}\) to be uniformly distributed.

We do not explicitly constrain \(Y_{known}\) and \(Y_{missing}\) in the loss function to be close to the known part and missing points respectively. If we supervise \(Y_{known}\) with partial input, it will leave the whole completion task to the missing part features reconstruction. We expect the network itself to learn to reconstruct diverse coordinates according to the different feature representation for known and missing parts, which makes the completion can process in both \(Y_{known}\) and \(Y_{missing}\). Our experiments demonstrate this in Sect. 4.3.

Table 1. Quantitative comparison on known categories with state-of-the-art methods with the metric as Chamfer Distance multiplied by \(10^4\).
Table 2. Quantitative comparison on novel categories with state-of-the-art methods with the metric as Chamfer Distance multiplied by \(10^4\).

4 Experiments

Training Set. We conduct our experiments on the dataset used by PCN  [51] which is a subset of the Shapenet dataset. The ground truth contains 16384 points uniformly sampled from the mesh, and the partial inputs with 2048 points are generated by back-projecting 2.5D depth images into 3D. The training set contains 28974 different models from 8 categories. Each model contains a complete point cloud with about 7 partial point clouds taken from different viewpoint for data augmentation. The validation set contains 100 models.

Testing Set from ShapeNet. The testing set from ShapeNet [1] are divided into two sets: one contains 8 known object categories on which the models are trained; another contains 8 novel categories that are not in the training set. The novel categories are also divided into two groups: one that is visually similar to the known categories, and another that is visually dissimilar. There are 150 models in each category.

Testing Set from Kitti. We also test our methods on real-world scans from Kitti dataset [9]. The testing scans are cars which are extracted from each frame according to the ground truth object bounding boxes. The testing set contains 2483 partial point clouds labeled as cars.

Training Setup. We train our model for 30 epochs with a batch size of 8. The initial learning rate is set to be 0.0007 which is decayed by 0.7 for every 50,000 iterations. We set other parameters \(r=8\), \(t=2\), \(u=2\) in our implements.

Fig. 4.
figure 4

Qualitative comparisons on both known and novel categories.

4.1 Completion Results on ShapeNet

We qualitatively and quantitatively compare our network on both known categories and novel categories with several state-of-the-art point cloud completion methods: FC (Pointnet auto-encoder)  [28], FoldingNet  [49], PCN  [51], and TopNet  [36]. For FC, Folding, PCN, we use the pre-trained model and evaluation codes released in the public project of PCN on github. For TopNet, we use their public code and retrain their network using the training set in PCN. For the sake of clarity, we denote our network with separated feature aggregation of GLFA and RFA strategy respectively as NSFA-GLFA and NSFA-RFA. We choose our network of 5 levels during feature extraction as the baseline network. Table 1 and 2 show the quantitative comparison results on known and novel categories respectively. Both NSFA-GLFA and NSFA-RFA achieve lower values for the evaluation metrics in both known and novel categories.

Besides quantitative results, as we claim that our methods can preserve the shape details, we select some qualitative examples of models with much details from the testing set. Results are shown in Fig. 4 where (a)–(d) are from known categories and (e)–(i) are from novel categories. We can observe that FC, Folding, PCN and TopNet can predict a overall shape but most details of the model are lost. In contrast, NSFA-GLFA and NSFA-RFA show their outperformance in both detail preservation and missing shape prediction.

4.2 Completion Results on Kitti

As there are no complete ground truth point clouds for Kitti, we use two metrics proposed in PCN to quantitatively evaluate the performance: 1) Fidelity error, which is the average distance from each point in the input to its nearest neighbour in the output. This measures how well the input is preserved; 2) Minimal Matching Distance (MMD), which is the Chamfer Distance between the output and the car point cloud from ShapeNet that is closest to the output point cloud in terms of CD. This measures how much the output resembles a typical car.

Fig. 5.
figure 5

Results on Kitti dataset. The numbers below the completion results show the fidelity error for each model.

Table 3. Quantitative comparison on known categories with state-of-the-art methods with the metric as Chamfer Distance.

The quantitative and qualitative results are shown in Table 3 and Fig. 5 respectively. The fidelity error of each model is also attached in qualitative results. We use NSFA-RFA as our method for comparison as NSFA-RFA performs better than NSFA-GLFA on car category on ShapeNet.

From both quantitative and qualitative results, it can be observed that our method achieves lowest fidelity error, which meets our claim that our network can preserve the existing details better. In Fig. 5, we can see other methods also present reasonable results, but some details in the partial model seems to be lost. Besides, the performances of all the methods on MMD metric are very close, indicating the results from all methods can resemble typical cars.

4.3 Reconstructed Coordinates Visualization

As we consider the known and missing parts separately, we visualize the reconstructed coordinates \(Y_{known}\) and \(Y_{missing}\) in \(Y_{rec}\) from the known part and missing part features to validate our method. Figure 6 shows the visualization examples of both NSFA-GLFA and NSFA-RFA. In general, \(Y_{known}\) is closer to the partial input, and \(Y_{missing}\) is closer to the missing shape. Meanwhile, there are also completion effects for the \(Y_{known}\), and \(Y_{missing}\) also has some original points. This is reasonable as we aggregate global feature to both part features, so the completion can progress in both parts. Specifically, the completion effects on \(Y_{known}\) of NSFA-RFA is more significant than NSFA-GLFA. The reason may be that we concatenate all level features to \(f_{known}\) of NSFA-RFA but only low-level local features in NSFA-GLFA, thus the completion effects are more apparent on the \(Y_{known}\) of NSFA-RFA.

Fig. 6.
figure 6

Reconstructed coordinates visualization from known and missing part features.

4.4 Feature Aggregation Strategy Evaluation

As we mentioned in Sect. 3.2, directly concatenating the multi-level features for completion will lead to imbalance between detail preservation and shape completion. We denote our network without feature aggregation strategies as NWoSFA. We evaluate the proposed feature aggregation strategies with qualitative examples shown in Fig. 7. It can be seen that the generated points of NWoSFA are more likely to be gathered at the original input. This effect is released in NSFA-GLFA and the generated points of NSFA-RFA seems can be well spread on the surface. The reason why NSFA-RFA performs better than NSFA-GLFA may due to its more significant completion effects on the known part as mentioned in Sect. 4.3. The completion effects on the known part can help fill the missing shape better. The quantitative results are shown in Table 5. It is notable that NSFA-RFA and NSFA-GLFA significantly boost the performance on the known categories. For novel categories, NSFA-GLFA achieves the lowest Chamfer Distance but NSFA-RFA seems not work well. This indicates that it is hard to generate missing part features using the residual feature aggregation for a totally unseen category. In contrast, NSFA-GLFA uses the global feature to represent the missing part, which is more suitable for novel category.

Fig. 7.
figure 7

Qualitative evaluation of the proposed feature aggregation strategies.

Table 4. Quantitative evaluation of the proposed refinement component on known and novel categories.
Table 5. Quantitative evaluation of the proposed feature aggregation strategies.

4.5 Effectiveness of Refinement Component

We also evaluate the effectiveness of the refinement component. Some intermediate results produced by NSFA-GLFA are shown in Fig. 8. The FPS algorithm enhances the point set distribution uniformity, but it can not get rid of the incorrect points. With the attention module, the generated points are more uniformly distributed with fewer outliers and noises. Finally, the local folding unit spread the points to a high-resolution result with smooth surface. The quantitative results are also shown in Table 4.

Fig. 8.
figure 8

Effectiveness of the refinement component. FPS module makes the result to be more uniformly distributed and the attention module reduces the invalid points. The local folding unit spread the points to a smooth surface.

Fig. 9.
figure 9

Symmetrical characteristic during the completion.

4.6 Symmetrical Characteristic During Completion

During the completion process, we find our network try to learn the symmetrical characteristic of the object to complete the model. As shown in Fig. 9, it can be seen that the \(Y_{missing}\) is close to the partial input after a proper transformation. This indicates that the details can be preserved not only in the partial input but also in the predicted symmetrical part taking advantages of the symmetrical characteristic.

5 Conclusion

In this work we have proposed a network for point cloud completion via with two separated feature aggregation namely GLFA and RFA, considers the existing known part and the missing part separately. RFA achieves overall better performance on known categories and GLFA shows its advantages on novel categories. Both these two strategies show significant improvements over previous methods on both detail preservation and shape prediction.