Keywords

1 Introduction

In the last two decades, industrial robots have been widely used in various fields of manufacture to enable efficient automated production processes. Vision-guided industrial robots are one of key parts in the state-of-the-art manufacturing processes. One of the main tasks for industrial robot is picking object and placing it to the specific location. This task has a good solution in 2D placement case while in 3D scattered stacking case, as illustrated in Fig. 1, remains a challenge. Indeed, 3D object detection and pose estimation are significant for robot bin picking in 3D case. The goal is to locate some pickable object instances and determine their 6D poses, i.e. rotation matrix and translation matrix.

Fig. 1.
figure 1

Left: robot bin picking system. Middle: objects are stacked randomly in a bin. Right: scene point clouds from 3D sensor. The mismatching regions are marked with the red bounding box. (Color figure online)

Various algorithms have been proposed for 3D object detection and pose estimation based on 2D images, range images or 3D point clouds. However, few of them are suited for the real-world industrial robot bin picking application for 3D scattered stacking case. There are some reasons: (a) the industrial objects are generally texture-less objects with simple shape. Thus, traditional 2D/3D key point based object detection and pose estimation are not appropriate. (b) 3D scattered stacking objects form a heavy cluttered environment with occlusions. (c) 3D sensor may not obtain complete scene information because of reflection, transmission and absorption. (d) the industrial bin picking requires eliminating the wrong pose hypothesis. Therefore, no satisfactory general solution to this task exists yet.

In literatures, while several methods have been proposed to tackle with this problem based on different input data, we focus here on the approach from [1], which relies on point cloud data. This algorithm has been improved and extended by many authors [2,3,4,5,6,7]. Also, it has been used in various robot bin picking applications [5, 8]. Recent studies [4] have shown its potential to deal with sensor noise, background clutters and occlusions, which is the major disadvantage of another popular pose estimation algorithm, LINEMOD [9, 10]. Moreover, smarter sampling strategy [2,3,4] and GPU acceleration [6] enable PPF method to obtain a fast and robust pose estimation result in heavily cluttered scenes with occlusions. However, this method may fail in the mismatching region caused by false features or features with insufficient information. As shown in Fig. 1, the mismatching regions are marked with the red bounding boxes where the wrong pose may be evaluated a high score. Thus, robot may try to pick these high score wrong detected objects, which causes a decline in system robustness. Actually, even human may not able to identify the correct poses from these regions in point clouds. To tackle with this problem, we propose a new clustering strategy in Point Pair Feature (PPF) based 3D object detection and pose estimation framework to further improve the pose hypothesis result. Our main contribution is using Density Based Spatial Clustering of Applications with Noise (DBSCAN) [11, 12] and Principle Component Analysis (PCA) in PPF method. Through further studying on the clustering step, we give a probable pose distribution perspective for mismatching region detection.

The remainder of this paper is organized as follows. Section 2 gives an extensive overview of various cutting-edge 3D object detection and pose estimation methods as well as their applications for robot bin picking. Our method for improved PPF method and mismatching region detection is demonstrated in Sect. 3. Experimental results are presented and discussed in Sect. 4 and the conclusion follows in Sect. 5.

2 Related Work

An extensive overview of the state-of-the-art approaches in 3D object detection and pose estimation is provided. This section also introduces their application in robot bin picking.

2.1 3D Object Detection and Pose Estimation

Many algorithms have been proposed for 3D object detection and pose estimation. We only focus on the recent work here and split them in several categories.

Point Wise Correspondence Based Methods. Iterative Closest Point (ICP) [13] is a classic point wise correspondence based method for point cloud registration, which can also be used for pose estimation. However, it requires a good initialization to avoid being trapped in a local minimum during optimization. Go-ICP [14] provides a global optimal registration by using global branch-and-bound (BnB) optimization in ICP. Since ICP enables a high accuracy point cloud registration, many algorithms combine ICP into a coarse-to-fine pose estimation framework.

Template Based Methods. Template based methods, which are widely used in 2D object detection, are extended in 3D case in recent years [9, 10]. The templates store the different viewpoints of object appearances. An object is detected when a template matches the image and gives its 3D pose. LINEMOD, presented in [9, 10], renders 3D object model to generate a large number of templates covering full view hemisphere. It combines 2D image gradients and 3D surface normals to build multiple modalities. Though such template based approaches can work accurately and quickly in practice, they still show robust problem in cluttered environment with occlusions.

Local Feature Based Methods. Local feature based methods, including point pair feature (PPF) [1], point feature histograms (PFH) [15] and 3D key point that are extended from traditional 2D key point like 3D-SURF [16], are robust to occlusions. A review can be found in [17]. These features are usually combined with other techniques like voting for object detection. However, they are usually computationally expensive and have difficulty in scenes with heavy clutters. In this paper, we focus one of the promising methods, PPF based method [1]. Several improvements and extensions of PPF based method have been proposed in recent years. [2,3,4] improve PPF method [1] in almost every stage by smarter subsampling strategy, smarter voting strategy and smarter pose evaluation strategy. [4] shows its potential to tackle with scene noise, cluttered environment with occlusions and loss of information caused by subsampling compared with original PPF method [1] and other state-of-the-art algorithms [3, 9]. [5] discusses different point pair features for voting. [6] uses color point pair feature and gives its GPU implementation. RANSAC integrated PPF method is proposed in [7].

Learning Based Methods. With the rapid development of learning techniques, several learning based methods are proposed for this task. [18, 19] use Iterative Hough Forest on local patches to detect and estimate 3D poses. [20,21,22] extend traditional convolutional neural network to 3D case for object detection and pose estimation. These recent studies show the potential to use the state-of-the-art learning techniques on this task.

2.2 3D Object Pose Estimation in Robot Bin Picking

Since most industrial objects are texture-less, LINEMOD [9, 10] and PPF method [1, 5, 8] are popular in robot bin picking application. The first Amazon Picking Challenge [23, 24] presents some state-of-the-art robot bin picking frameworks combined with various methods and tricks. SegICP, presented in [25], shows the potential to combine CNN-based object detection with ICP on this task.

3 Proposed Method

We describe here our contributions to improve the PPF method [1] more robust for robot bin picking. This paper focuses on PPF methods in pose clustering stage. Compared with traditional PPF method and its improvements, we give a probable pose distribution perspective to rethink about the PPF method and use it to detect mismatching region in the scene. This section introduces our improved PPF method first and then describes our clustering strategy in detail.

3.1 Improved PPF Method

We revisit PPF based object detection and pose estimation pipeline and describe our improved PPF method here. The improved PPF method can be divided into five stages.

Preprocessing the Model and Scene Data. Generally, the model data is CAD format and scene data is a point cloud without normals. In order to use PPF method, remeshing techniques are used to transform model and scene data to uniform dense point clouds with normals. We also use subsampling to accelerate the PPF method.

Model Training. Point pair feature is used to create a hash table for voting framework in this step. Every point pair on the model surface calculates the feature \(\mathbf {F}_{m}(\mathbf {m}_{i},\mathbf {m}_{j})\) and the corresponding angle \(\alpha ^{model}_{ij}\). The distance and angle in PPF are discretized in step \(d_{dist}\) and \(d_{angle} = 2\pi /n_{angle}\). The indices of hash table is hashed PPF and the stored element is \((\mathbf {m}_{i},\mathbf {m}_{j},\alpha ^{model}_{ij})\).

Voting. Voting framework is similar to the Generalized Hough Transform. Every local coordinates \((\mathbf {m}_{r},\alpha )\) is presented in a two-dimensional accumulator array. \(\alpha \) is discretized using \(d_{angle}\). Here we subsample scene points as reference points. For each reference point \(s_{r}\), the point pair feature \(\mathbf {F}_{s}(s_{r},s_{j})\) and the corresponding angle \(\alpha ^{scene}_{rj}\) is calculated. Then corresponding \(\mathbf {F}_{m}(\mathbf {m}_{r},\mathbf {m}_{j})\) is found and \(\alpha = \alpha ^{model}_{rj} - \alpha ^{scene}_{rj}\). The accumulator position \((\mathbf {m}_{r},\alpha )\) vote one. Compared to the original version, we compute vote value as

$$\begin{aligned} vote =\left\{ \begin{array}{lr} 1-\lambda |n_{1} \cdot n_{2}|, &{} 1-\lambda |n_{1} \cdot n_{2}| \ge v_{min} \\ v_{min}, &{} 1-\lambda |n_{1} \cdot n_{2}| < v_{min} \\ \end{array} \right. \end{aligned}$$
(1)

where \(|n_{1} \cdot n_{2}|\) denotes the absolute value of the dot product of two normals, \(\lambda \) is a weighting parameter in default set to 1 and \(v_{min}\) is the minimum boundary of the vote in default set to 0.1. The weighted vote will weaken the planar vote at some level. In addition, if the distance between two points is larger than the size of the model, it’s unnecessary to process them because only point pairs belong to the same object shall be vote. After all points \(s_{i}\) are processed, the peak in the accumulator array are the corresponding local coordinate to \(s_{r}\). Then, we obtain a set of possible object poses, which we called probable pose distribution in this paper.

Pose Clustering. In the original version of PPF method, the retrieved poses are sorted in descending order of the number of votes to make sure group most likely pose hypotheses first. Then it uses euclidean distance to cluster the proximal poses and the score of the cluster is the sum of the scores of the contained poses. Finally, it removes the isolated poses with low score and averages the clustered pose hypotheses to get the final pose hypotheses. In this paper, we use new clustering strategy combined with DBSCAN and PCA to further improve the pose hypotheses result. The following part will describe it in detail.

Post Processing the Pose Hypotheses. After the high score pose hypotheses are obtained, post processing is suggested to obtain higher accuracy of the detected poses. The pose hypothesis scores will be reevaluated to reject the wrong pose hypotheses and ICP algorithm will be used to get accuracy poses. One of the popular evaluation strategies is computing how many points on the model are detected on the scene point cloud. However, false pose hypotheses can get high evaluation scores in mismatching region, which will give wrong pose result to robot. Hence, we evaluate pose hypotheses in clustering stage first to detect the pose hypotheses in mismatching region.

3.2 Mismatching Region

Given a scene point cloud and a model point cloud, we define the mismatching region as region where wrong pose hypothesis can be evaluated a high score. For example, traditional evaluation strategies in the post processing stage, such as rerendering the depth image or computing area of the model surface detected in the scene, will apparently evaluate a wrong pose hypothesis a high score in the mismatching region shown in Fig. 1. The mismatching region is caused by false features or features with insufficient information. This false features may appear when the scene features are similar to the model features, multiple model features couple or scene features couple with model features.

Fig. 2.
figure 2

The probable pose distribution visualization on the mismatching region. Left: model point cloud. Middle: scene point cloud with pose positions visualization. Right: pose orientations visualization in normalized axis-angle space (Color figure online)

To vividly describe the cause of the mismatching region, we use PPF method in the scene with mismatching region and visualize the probable pose distribution, which is a set of high vote pose hypotheses after voting stage. The result is illustrated in Fig. 2. In the spherical feature matching, the pose positions gather at one point while the orientations randomly distributed in the normalized axis-angle space because the overlapping surface on the model can slide on the scene with small overlapping area change. Also, for the cube with six planar features, the pose positions can slide in a wide range on the planar scene while the orientations are constraint to be six spatial curves. And for real scene matching with the object in our robot bin picking system, it’s obvious to see that the mismatching region contains a large pose position sliding area. Therefore, we know the mismatching region is the region where the overlapping surface on the model can slide on the scene with small overlapping area change. And we can detect the mismatching region by detecting the pose hypotheses sliding area in the probable pose distribution. In fact, the mismatching region highly depend on the shape of the detected objects and the way of stacking the detected objects. If the shape feature is simple, such as cube and cylinder, and the sensor can only scan the partial shape, the features cannot denote a specific pose because the pose hypotheses lack enough reliable constraints, which show up as false features or features with insufficient information. For example, in our robot bin picking system, the object, shown in Fig. 2, will cause the mismatching region problem.

To detect the pose in the mismatching region, we propose a new clustering strategy on the probable pose distribution. Since we only focus on the robot bin picking application, we believe that pose position is more significant than orientation, which may vary with symmetric structure but does not affect the bin picking. We use DBSCAN to cluster the probable pose by position distribution and use PCA to find the size of the clustered pose set. Then we use a threshold to detect the mismatching region by pose size.

3.3 DBSCAN: Density-Based Clustering

With the appropriate setting on the sampling step, here we list several assumptions for PPF method: (a) Every reference point on the object will return a high vote true pose hypothesis. (b) High visibility object sampled many reference points will gather plentys of high vote pose hypotheses on its true pose, as illustrated in the Fig. 2. (c) Scene feature dissimilar to the object will return a low vote wrong pose hypothesis. (d) High vote pose hypotheses, i.e. probable pose distribution, contain all visible object poses in the scene. (e) Mismatching region, where overlapping can slide, is a dense pose hypotheses cluster of arbitrary shape. Original PPF method uses greedy strategy in the pose clustering stage. It uses highest score pose hypotheses as a cluster center and cluster its nearest pose hypotheses. By repeating this loop, we get a descending order clustered pose hypotheses, which will be used to post processing. The cluster shape is similar to K-means, which limits the use of probable pose distribution.

figure a

Hence, we handle probable pose distribution using Density Based Spatial Clustering of Applications with Noise (DBSCAN) [11], which can discover clusters of arbitrary shape and also handle outliers effectively. DBSCAN algorithm defines cluster as a densely connected points by eps and minpts. eps is the maximum distance neighborhood for given point and minpts is the minimum number of points required in the eps-neighborhood of a point to form a cluster. This paper implements the DBSCAN algorithm in [11, 12] for clustering probable pose position. As we state before, only the position of the pose is used for clustering. The basic pipeline of DBSCAN is shown in Algorithm 1.

Figure 3 shows the result of clustered pose in the position space. Scene data is shown in Fig. 1. The high density clusters are detected and marked with different color while the isolated or low density points are detected as noise. Apparently, a clustered pose hypothesis with small size present a specific true pose.

Fig. 3.
figure 3

Left: visualization of DBSCAN result. Right: visualization of PCA result. The oriented bounding boxes are drawn and the mismatching region are shown in red. The noise poses are marked as cross. (Color figure online)

3.4 Mismatching Region Detection

After we obtain clustered pose hypotheses by DBSCAN algorithm, we use Principle Component Analysis (PCA) to extract cluster features. PCA is a popular machine learning and statistics technique to extract the main directions of the dataset. In our approach, we first compute the center and main directions of the cluster, and then project the points in the cluster on the main directions to get the cluster size, which presents the overlapping sliding area. The sizes of all clusters are illustrated with 3D oriented bounding box in Fig. 3. Also, the mismatching region is detected by the sum of box size and marked with red color.

To obtain final pose hypotheses result, we use original greedy strategy in further clustering. The highest vote pose hypothesis in the cluster will be used to cluster its neighborhood and the final vote will be the sum of all neighbor votes. This cluster result will present the final pose hypothesis of the cluster. Then, we apply this method to every cluster as well as the mismatching region. Each cluster outputs only one good pose hypothesis with vote and the pose hypothesis from mismatching region will be marked as unreliable pose hypothesis. Finally, our approach only return high vote reliable pose hypotheses for the post processing stage.

4 Experimental Results

We evaluate our approach on the real robot bin picking application with mismatching region detection. Given a scene point cloud data, our algorithm not only detects the objects and estimates their poses, but also marks the poses in the mismatching region, as shown in Fig. 4. Multiple high score pose hypotheses are detected and wrong pose hypotheses with high evaluation score are marked with red color. Compared with the state-of-the-art PPF method, our approach improves the pose hypotheses result and enables robust 3D object detection and pose estimation in robot bin picking application. As we state in the Sect. 3, the mismatching region problem is caused by the shape of objects, the way of stacking objects and the capability of the sensor. If the mismatching region dosen’t exist, our method works like the original version because DBSCAN works like original clustering. According to our analysis in Sect. 3, this method shall work in other object detection and pose estimation task.

Fig. 4.
figure 4

3D object pose estimation with mismatching region detection on the real scene data for robot bin picking. The objects in mismatching region are marked with red color. (Color figure online)

The algorithm takes several parameters in DBSCAN and filter stage, which are related to sampling step and reference point selection strategy in PPF method. Also, since the clustering algorithm complexity stay same and the size of probable pose hypotheses is generally small, our approach dose not increase the much runtime compared with original version. However, this problem remains a challenge if the overlapping can not slide in some cases.

5 Conclusion

In this paper, we revisit the state-of-the-art PPF method and propose an improved PPF method with mismatching region detection for 3D object detection and pose estimation. We demonstrate the mismatching problem when using traditional PPF method for robot bin picking. To improve the pose hypothesis result, a new clustering strategy combined with DBSCAN and PCA is proposed. Also, we define the mismatching region and probable pose distribution to obtain more information in PPF method. Experiments on the real scene point clouds show that our approach provides the capability of mismatching region detection compared with original version.

We believe that PPF method still remain promising potential if we combine it with 2D image for 3D object detection and pose estimation problem. And the probable pose distribution has the potential to offer more constraints for other pose estimation techniques.