Estimation of 3-D Pose with 2-D Vision Based on Shape Matching Method

Chen, Bin; Su, Jianhua; Lv, Kun; Xue, Donge

doi:10.1007/978-3-030-04224-0_20

Bin Chen^15,16,
Jianhua Su¹⁵,
Kun Lv¹⁷ &
…
Donge Xue¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11306))

Included in the following conference series:

International Conference on Neural Information Processing

1836 Accesses

Abstract

Pose estimation is an important step in the grasping of workpieces. However, most previous works aim to use the 3D vision system to locate the 3D pose of the object. This paper develops a pose estimation of 3D object with 2D vision system. The proposed method includes two steps: (a) a hierarchy model of 2D views of the object is firstly constructed off-line; (b) the pose of object is then estimated by measuring the similarity of the model and target image. The proposed method is inherently robust against noise and illumination changes, and also efficient in real applications.

Access provided by CONRICYT-eBooks. Download conference paper PDF

3D Pose Estimation of Manipulator Based on Multi View

Shape and Pose Reconstruction of Robotic In-Hand Objects from a Single Depth Camera

Pose Estimation of 3D Objects Based on Point Pair Feature and Weighted Voting

Keywords

1 Introduction

Picking of workpieces randomly placed on a conveyor belt require robots to precisely estimate the pose of the object. To achieve a precise location has been widely studied in the last decades due to its strong impact in the productivity for manufacturer. Recognition and pose estimation of the parts are mainly based on 2D or 3D vision techniques, where pose estimation with 2D vision system is ideal for a planar part whose three dimensions are negligible, and for complex objects, the 3D representation approach is highly preferable. However, the 3D vision still has some limitations [1]: (a) the cost of such industrial sensors is still higher than a conventional high resolution industrial camera; (b) with a 3D sensor is usually impossible to recognize specific patterns drawn on the object surfaces that may identify the correct object side.

In order to reduce the cost of the vision system, some researchers aim to acquire the pose of 3D parts with 2D vision system. Hinterstoisser et al. [2] developed a real-time template recognition approach to match the target object with the template and then detect its pose. In their recent work [3], a template-based Line-Mod approach with a Kinect is given to detect the objects. Rios-Cabrera and Tuytelaars [4] detected multiple specific 3D objects based on the Line-Mod template-based method in which they learned the template online and speed up the detection based on cascades. Brachmann et al. [5] discuss the estimate of the 3D Pose of specific objects from a single RGB-D image, where the key concept is an intermediate representation in form of a dense 3D object coordinate labelling paired with a dense class labelling. Bonde et al. [6] presented a learning-based instance recognition framework from single view point clouds. They used a soft label shape features of an object to classify both the location and the pose of object. Borgefors et al. [7] estimate the pose of a target object by comparing the detected image and its precomputed 2D views with a matching metric, which is calculated by the edges similarity of the models and the image, but it suffers from the occlusion. Steger et al. [8, 10] computed the similarity by the dot product of the pixel gradient vector, where they claim the method is robust to the light and occlusion. Belongie et al. [9] developed the shape context method to calculate the similarity, which could obtain high matching accuracy but is time-consuming.

This paper develops a pose estimation of 3D object with 2D vision system. We firstly establish the 2D view library of the object with the 3D model of the workpiece. And then, the shape context is employed to calculate the similarity of the target contour and the sample contours in the 2D view library. The template matching method usually suffers from the noise and changes in illumination, which will reduce the precision of the location. Our work is similar to reference [10], but the main difference is that we measure the similarity of the 2D view and target image using a few points sampled from the shape contours, while [10] calculate the dot product of the pixel gradient vector as a similarity. Compared with [10], our method would reduce the complexity of the computation of the similarity.

The rest of the paper is organized as follows: the construction of 2D view library with 3D CAD model is firstly described in Sect. 2. And, the description of shape context of the sampling point is presented in Sect. 3. Finally, several experiments are given in Sect. 4.

2 Construction of Hierarchical View Library

Generally, a 2D view of an object could be obtained by projecting the CAD model of the object on a planar; hence, it is able to establish the 2D view library by projecting on various planar. Inspired by work [10], we use the “view ball” to establish the view library of the object, where a virtual camera is employed to project the object on the ball. Thus, the 3D pose of an object, i.e., X(α, β, r), could be described by a point on the sphere, where α denotes the longitude of the point, β denotes the latitude of the point, and r denotes the distance of the point from the center of the sphere (Fig. 1).

By projecting the CAD model from different position, a vary view images of the model are obtained (Fig. 2).

Once the 2D images of the object are captured, it is possible to establish the view library. In this work, we employ pyramid structure to layer the projection view, which includes three steps as shown in Fig. 3: (a) a view that achieves a degree of similarity is firstly sorted as a ‘aspect’ class; (b) the ‘aspects’ are stored on the second floor of the pyramid store; (c) the similarities of the images on the second pyramid layer are calculated and clustered again until four layers pyramid are obtained.

As the pyramid model of the view library has been established, the matching search can be performed using of the hierarchical search method, which will improve the searching efficiency for an object needed to be identified [11,12,13,14]. A coarse scale searching is firstly performed at the bottom level, i.e., level 4. At each level, the similarities of the input image with all the nodes on the layer are calculated. And, the node with the largest similarity is selected, which would decide the child node. This process is repeated until all the nodes at the bottom of the pyramid are traversed. Repeat this process until we find the most similar model in first layer as shown in Fig. 4.

3 Computation of Similarity

This work aims to estimate the pose of a target object by calculating the similarity between the image of target and the template view in library.

It believes that there is a certain correspondence between the extracted feature and the geometric features of the 3D object [7,8,9,10]. Therefore, the position of the 3D object could be calculated from the corresponding feature set by matching the extracted feature with the geometric features of the 3D object. However, in the previous work [7,8,9,10], the calculation of similarity using the dot product of the pixel gradient vector is time consuming.

Borrowed the idea from the shape context [9], we introduce a new shape descriptor to match the extracted feature with the geometric features. The shape descriptor contains only coordinate information, through which we can calculate the Euclidean distance and cosine similarity between two shapes. Notice that the shape context [9] includes the gray-value, the location and the set of vectors originating from a point to all other sample points on the shape, but the proposed shape descriptor only contain the points $ (\text{x}_{k} ,\text{y}_{k} ) $ where k = 1, …, n, on the contours to denote the shape context. For example, there are 50 points on the edge of the bearing to describe its shape as shown in Fig. 5.

Once the points on the shape are detected, the hierarchical search is performed by calculating the shape similarity of the object. So, for a new input image, the contour of the object is extracted by a simple image processing. For multi-object, the contours of the target object can be selected by it shape or area. In the following, a new algorithm is proposed for calculating the similarity between two contours.

Assuming that a point $ p_{i} $ is on a target shape (where i stands for the serial number, the same as bellow), our first task is to find a matching point on the sample shape. After finding all the matching points, we calculate the distance of the matching pairs, which is denoted by shape similarity. If the two shapes are identical, the distance between the corresponding points is zero. In other word, the shape similarity is 100%.

For example, if there are m points $ P = \{ p_{1} ,p_{2} ,p_{3} , \ldots ,p_{m} \} $ on the target contour, we randomly select n sampling points from the set P. And then, we sort those points, where the point with the minimum x coordinate is denoted as the initial point, and another point closest to the first point is selected as the second point, and the point closest to the second point is selected as the third point, etc. After sorting the points, we denote $ p_{i} $ by the point on the target shape and $ q_{i} $ by matching point on the sample shape. We set a distance threshold $ d_{thd} $ to avoid the mistake matching. The mismatching points are also used as the penalty in calculating the similarity. That is, more mismatching points show the lower similarity of the two shapes.

The distance of the two shapes would be obtained by:

$$ d_{p} = \frac{1}{{n_{samp} }}\sum\limits_{i = 1}^{{n_{samp} }} {(p_{i} - q_{i} )^{2} } \quad {\text{where}}\;\left| {p_{i} - q_{i} } \right| \le d_{t\,hd} $$

(1)

$$ d_{e} = d_{p} /d_{thd} $$

(2)

where $ d_{p} $ represents the average distance between each matching point. Assuming that the maximum distance between two points is $ d_{thd} $, which means that $ 0 < d_{e} \le 1 $. Next, we define similarity between two point sets as 1 − $ d_{e} $.

The computations of similarity with Eqs. 1 and 2 suffer from the image noise. In order to improve the robustness of the matching, we introduce the vector sets to calculate the similarity of two shapes. Assuming that $ p_{1} $ is the initial point, we connect $ p_{1} $ and other points to build vectors, and obtain a vector set $ P_{\text{vector}} = \{ \vec{p}_{12} ,\vec{p}_{13} , \ldots \vec{p}_{{1{\text{nsamp}}}} \} $, as illustrated in Fig. 6.

We built vector set $ {\text{Q}}_{vector} = \{ {\vec{\text{q}}}_{12} ,{\vec{\text{q}}}_{13} , \ldots \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {q}_{{1{\text{n}}_{\text{samp}} }} \} $ for the sample shape. Therefore, the similarity of the vectors using the dot product could be defined as follows.

$$ S_{v} = \left| {\frac{1}{{n_{samp} - 1}}\sum\limits_{i = 2}^{{n_{samp} }} {\frac{{p_{1i} \cdot q_{1i} }}{{\left\| {p_{1i} } \right\| \cdot \left\| {q_{1i} } \right\|}}} } \right| $$

(3)

It is expected that the higher similarity of the two shapes, the smaller angle between the two vector sets. We regard $ Sv $ as the cosine similarity between the two shapes.

Considering the Euclidean distance between the matching points and the cosine similarity between the matching vectors, we define the similarity function as:

$$ S = \frac{1}{2}(1 + s_{v} - d_{\text{e}} ) - \frac{{n_{left} }}{{\alpha \cdot n_{samp} }} $$

(4)

Where $ {\text{n}}_{left} $ is the number of points with wrong match.

Note that in this algorithm, the size of the sample point and the value of the threshold $ d_{thd} $ depend on the size of the actual target object and the size of the image. In the experiments, we set the number of selected points by $ n_{samp} $ = m/20, and the threshold by $ d_{thd} $ = 0.00001. The reason why $ n_{left} /n_{samp} $ is divided by α is that penalty term should not too big. Experiment shows that when α = 3, the result is best.

4 Experiments

We firstly create a layered 2D view library by projecting the bearing CAD model from different pose, where the scope of the view library is shown in Table 1. And then, the three layers of the pyramid layer are shown in Figs. 7, 8 and 9, respectively. When the pyramid model is established, the number of layers is 3, and the number of 2D views from top to bottom is 27,208,1002.

Table 1. The scope of the view library

Full size table

Once we take a bearing picture shown in Fig. 10, we could find the model whose pose is most similar to the target object by the following steps: (a) we execute a rough search to find the model whose pose is similar with target object roughly in the third layer as shown in Fig. 7, where the result is marked by a red box. Thus, we could decide the child node. (b) we repeat the searching until all the nodes at the bottom of the pyramid are traversed. The model which is most similar with the target object has been selected in second layer and first layer, as shown in Figs. 8 and 9.

Since each 2D projection view corresponds to a position relationship between the camera and the actual object, it is easy to determine the position relationship between object and the camera by finding a 2D view that best matches the actual object (Fig. 11).

In order to test the experimental results, we place each object in two different places. In the same place, we take five pictures and use the mean value as the final position results. Table 2 shows the comparison of the results by the algorithm and the actual position of the target object.

Table 2. Results of the accuracy evaluation

Full size table

To illustrate the difference between our approach and the approach proposed in [10], we compare the two different algorithms by calculating the accuracy and recognition times shown in Fig. 12, while [10] calculating the dot product of the pixel gradient vector as a similarity.

It can be seen from Table 2, the position error is about 0.78 mm and the computation time is less than 1 s. The recognition of our approach is faster, whereas the accuracy is worse.

5 Conclusion

This paper develops a pose estimation of 3D object with 2D vision system. We propose a new way of calculating the similarity between the model and the target image, which is based on the sampling point on the shape of the object. We firstly get the model and object contours and then sample the points on the contours and use the coordinate of the sampling points as the descriptor of the points. We sort those points on a certain way to determine the correspondence between two shapes. Through a distance threshold, we filter out the wrong match points. Finally, by calculating the Euclidean distance and cosine distance between these matching points, the similarity between the model and the target image is obtained, and the 3D position of the object is determined. The proposed approach has a significant advantage in terms of speed and anti-noise interference, relative to the way in which the similarity is calculated by the pixel value. However, the limitation of this algorithm is that it is difficult to guarantee the stability of the algorithm for objects with complex shapes, or objects without obvious contour features and difficult to extract, and the wrong results will be matched. Therefore, in the future work, we will improve the algorithm for this aspect.

References

Pretto, A., Tonello, S., Menegatti, E.: Flexible 3D localization of planar objects for industrial bin-picking with monocamera vision system. In: 9th IEEE International Conference on Automation Science and Engineering (CASE), Madison, pp. 168–175. IEEE Press (2013)
Google Scholar
Hinterstoisser, S., et al.: Gradient response maps for real-time detection of textureless objects. In: 32nd IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, pp. 876–888. IEEE Press (2012)
Google Scholar
Hinterstoisser, S., et al.: Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 548–562. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37331-2_42
Chapter Google Scholar
Rios-Cabrera, R., Tuytelaars, T.: Discriminatively trained templates for 3D object detection: a real time scalable approach. In: 14th IEEE International Conference on Computer Vision, Sydney, pp. 2048–2055. IEEE Press (2013)
Google Scholar
Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6D object pose estimation using 3D object coordinates. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 536–551. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_35
Chapter Google Scholar
Bonde, U., Badrinarayanan, V., Cipolla, R.: Robust instance recognition in presence of occlusion and clutter. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 520–535. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_34
Chapter Google Scholar
Borgefors, G.: Hierarchical chamfer matching: a parametric edge matching algorithm. In: 8th IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 10, pp. 849–865. IEEE Press (1988)
Google Scholar
Steger, C.: Similarity measures for occlusion, clutter, and illumination invariant object recognition. In: Radig, B., Florczyk, S. (eds.) DAGM 2001. LNCS, vol. 2191, pp. 148–154. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45404-7_20
Chapter MATH Google Scholar
Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. In: 22nd IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 509–522. IEEE Press (2002)
Google Scholar
Ulrich, M., Wiedemann, C., Steger, C.: Combining scale-space and similarity-based aspect graphs for fast 3D object recognition. In: 32nd IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, pp. 1902–1914. IEEE Press (2012)
Google Scholar
Lowe, D.G.: Three-dimensional object recognition from single two dimensional images. Artif. Intell. 31, 355–395 (1987)
Article Google Scholar
Lowe, D.G.: Fitting parametrized 3-D models to images. In: 11th IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, pp. 441–450. IEEE Press (1991)
Google Scholar
Costa, M.S., Shapiro, L.G.: 3D object recognition and pose with relational indexing. Comput. Vis. Image Underst. 79, 364–407 (2000)
Article Google Scholar
Wiedemann, C., Ulrich, M., Steger, C.: Recognition and tracking of 3D objects. In: Symposium on Pattern Recognition, vol. 5096, pp. 132–141 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

The State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
Bin Chen & Jianhua Su
University of Chinese Academy of Sciences, Beijing, China
Bin Chen
China International Engineering Consulting Corporation, Beijing, China
Kun Lv
Jiangxi University of Science and Technology, Ganzhou, China
Donge Xue

Authors

Bin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Su
View author publications
You can also search for this author in PubMed Google Scholar
Kun Lv
View author publications
You can also search for this author in PubMed Google Scholar
Donge Xue
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianhua Su .

Editor information

Editors and Affiliations

The Chinese Academy of Sciences, Beijing, China
Long Cheng
City University of Hong Kong, Kowloon, Hong Kong
Andrew Chi Sing Leung
Kobe University, Kobe, Japan
Seiichi Ozawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, B., Su, J., Lv, K., Xue, D. (2018). Estimation of 3-D Pose with 2-D Vision Based on Shape Matching Method. In: Cheng, L., Leung, A., Ozawa, S. (eds) Neural Information Processing. ICONIP 2018. Lecture Notes in Computer Science(), vol 11306. Springer, Cham. https://doi.org/10.1007/978-3-030-04224-0_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-04224-0_20
Published: 18 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04223-3
Online ISBN: 978-3-030-04224-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Estimation of 3-D Pose with 2-D Vision Based on Shape Matching Method

Abstract

Similar content being viewed by others

3D Pose Estimation of Manipulator Based on Multi View

Shape and Pose Reconstruction of Robotic In-Hand Objects from a Single Depth Camera

Pose Estimation of 3D Objects Based on Point Pair Feature and Weighted Voting

Keywords

1 Introduction

2 Construction of Hierarchical View Library

3 Computation of Similarity

4 Experiments

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Estimation of 3-D Pose with 2-D Vision Based on Shape Matching Method

Abstract

Similar content being viewed by others

3D Pose Estimation of Manipulator Based on Multi View

Shape and Pose Reconstruction of Robotic In-Hand Objects from a Single Depth Camera

Pose Estimation of 3D Objects Based on Point Pair Feature and Weighted Voting

Keywords

1 Introduction

2 Construction of Hierarchical View Library

3 Computation of Similarity

4 Experiments

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation