Keywords

1 Introduction

Picking of workpieces randomly placed on a conveyor belt require robots to precisely estimate the pose of the object. To achieve a precise location has been widely studied in the last decades due to its strong impact in the productivity for manufacturer. Recognition and pose estimation of the parts are mainly based on 2D or 3D vision techniques, where pose estimation with 2D vision system is ideal for a planar part whose three dimensions are negligible, and for complex objects, the 3D representation approach is highly preferable. However, the 3D vision still has some limitations [1]: (a) the cost of such industrial sensors is still higher than a conventional high resolution industrial camera; (b) with a 3D sensor is usually impossible to recognize specific patterns drawn on the object surfaces that may identify the correct object side.

In order to reduce the cost of the vision system, some researchers aim to acquire the pose of 3D parts with 2D vision system. Hinterstoisser et al. [2] developed a real-time template recognition approach to match the target object with the template and then detect its pose. In their recent work [3], a template-based Line-Mod approach with a Kinect is given to detect the objects. Rios-Cabrera and Tuytelaars [4] detected multiple specific 3D objects based on the Line-Mod template-based method in which they learned the template online and speed up the detection based on cascades. Brachmann et al. [5] discuss the estimate of the 3D Pose of specific objects from a single RGB-D image, where the key concept is an intermediate representation in form of a dense 3D object coordinate labelling paired with a dense class labelling. Bonde et al. [6] presented a learning-based instance recognition framework from single view point clouds. They used a soft label shape features of an object to classify both the location and the pose of object. Borgefors et al. [7] estimate the pose of a target object by comparing the detected image and its precomputed 2D views with a matching metric, which is calculated by the edges similarity of the models and the image, but it suffers from the occlusion. Steger et al. [8, 10] computed the similarity by the dot product of the pixel gradient vector, where they claim the method is robust to the light and occlusion. Belongie et al. [9] developed the shape context method to calculate the similarity, which could obtain high matching accuracy but is time-consuming.

This paper develops a pose estimation of 3D object with 2D vision system. We firstly establish the 2D view library of the object with the 3D model of the workpiece. And then, the shape context is employed to calculate the similarity of the target contour and the sample contours in the 2D view library. The template matching method usually suffers from the noise and changes in illumination, which will reduce the precision of the location. Our work is similar to reference [10], but the main difference is that we measure the similarity of the 2D view and target image using a few points sampled from the shape contours, while [10] calculate the dot product of the pixel gradient vector as a similarity. Compared with [10], our method would reduce the complexity of the computation of the similarity.

The rest of the paper is organized as follows: the construction of 2D view library with 3D CAD model is firstly described in Sect. 2. And, the description of shape context of the sampling point is presented in Sect. 3. Finally, several experiments are given in Sect. 4.

2 Construction of Hierarchical View Library

Generally, a 2D view of an object could be obtained by projecting the CAD model of the object on a planar; hence, it is able to establish the 2D view library by projecting on various planar. Inspired by work [10], we use the “view ball” to establish the view library of the object, where a virtual camera is employed to project the object on the ball. Thus, the 3D pose of an object, i.e., X(α, β, r), could be described by a point on the sphere, where α denotes the longitude of the point, β denotes the latitude of the point, and r denotes the distance of the point from the center of the sphere (Fig. 1).

Fig. 1.
figure 1

The description of the “view ball”.

By projecting the CAD model from different position, a vary view images of the model are obtained (Fig. 2).

Fig. 2.
figure 2

Obtain 2-D projection images, where “1” represent virtual camera and “2” represent CAD model.

Once the 2D images of the object are captured, it is possible to establish the view library. In this work, we employ pyramid structure to layer the projection view, which includes three steps as shown in Fig. 3: (a) a view that achieves a degree of similarity is firstly sorted as a ‘aspect’ class; (b) the ‘aspects’ are stored on the second floor of the pyramid store; (c) the similarities of the images on the second pyramid layer are calculated and clustered again until four layers pyramid are obtained.

Fig. 3.
figure 3

Resulting aspects on pyramid levels 1 to 4 of the hierarchical model. The cameras of the aspects are visualized by small black pyramids. The blue region visualizes the set of aspects on the different levels that end up in a single aspect on the top pyramid level. (Color figure online)

As the pyramid model of the view library has been established, the matching search can be performed using of the hierarchical search method, which will improve the searching efficiency for an object needed to be identified [11,12,13,14]. A coarse scale searching is firstly performed at the bottom level, i.e., level 4. At each level, the similarities of the input image with all the nodes on the layer are calculated. And, the node with the largest similarity is selected, which would decide the child node. This process is repeated until all the nodes at the bottom of the pyramid are traversed. Repeat this process until we find the most similar model in first layer as shown in Fig. 4.

Fig. 4.
figure 4

(a) Object images; (b) object recognition using the hierarchy of views.

3 Computation of Similarity

This work aims to estimate the pose of a target object by calculating the similarity between the image of target and the template view in library.

It believes that there is a certain correspondence between the extracted feature and the geometric features of the 3D object [7,8,9,10]. Therefore, the position of the 3D object could be calculated from the corresponding feature set by matching the extracted feature with the geometric features of the 3D object. However, in the previous work [7,8,9,10], the calculation of similarity using the dot product of the pixel gradient vector is time consuming.

Borrowed the idea from the shape context [9], we introduce a new shape descriptor to match the extracted feature with the geometric features. The shape descriptor contains only coordinate information, through which we can calculate the Euclidean distance and cosine similarity between two shapes. Notice that the shape context [9] includes the gray-value, the location and the set of vectors originating from a point to all other sample points on the shape, but the proposed shape descriptor only contain the points \( (\text{x}_{k} ,\text{y}_{k} ) \) where k = 1, …, n, on the contours to denote the shape context. For example, there are 50 points on the edge of the bearing to describe its shape as shown in Fig. 5.

Fig. 5.
figure 5

Sample 50 points on the edge of the bearing.

Once the points on the shape are detected, the hierarchical search is performed by calculating the shape similarity of the object. So, for a new input image, the contour of the object is extracted by a simple image processing. For multi-object, the contours of the target object can be selected by it shape or area. In the following, a new algorithm is proposed for calculating the similarity between two contours.

Assuming that a point \( p_{i} \) is on a target shape (where i stands for the serial number, the same as bellow), our first task is to find a matching point on the sample shape. After finding all the matching points, we calculate the distance of the matching pairs, which is denoted by shape similarity. If the two shapes are identical, the distance between the corresponding points is zero. In other word, the shape similarity is 100%.

For example, if there are m points \( P = \{ p_{1} ,p_{2} ,p_{3} , \ldots ,p_{m} \} \) on the target contour, we randomly select n sampling points from the set P. And then, we sort those points, where the point with the minimum x coordinate is denoted as the initial point, and another point closest to the first point is selected as the second point, and the point closest to the second point is selected as the third point, etc. After sorting the points, we denote \( p_{i} \) by the point on the target shape and \( q_{i} \) by matching point on the sample shape. We set a distance threshold \( d_{thd} \) to avoid the mistake matching. The mismatching points are also used as the penalty in calculating the similarity. That is, more mismatching points show the lower similarity of the two shapes.

The distance of the two shapes would be obtained by:

$$ d_{p} = \frac{1}{{n_{samp} }}\sum\limits_{i = 1}^{{n_{samp} }} {(p_{i} - q_{i} )^{2} } \quad {\text{where}}\;\left| {p_{i} - q_{i} } \right| \le d_{t\,hd} $$
(1)
$$ d_{e} = d_{p} /d_{thd} $$
(2)

where \( d_{p} \) represents the average distance between each matching point. Assuming that the maximum distance between two points is \( d_{thd} \), which means that \( 0 < d_{e} \le 1 \). Next, we define similarity between two point sets as 1 − \( d_{e} \).

The computations of similarity with Eqs. 1 and 2 suffer from the image noise. In order to improve the robustness of the matching, we introduce the vector sets to calculate the similarity of two shapes. Assuming that \( p_{1} \) is the initial point, we connect \( p_{1} \) and other points to build vectors, and obtain a vector set \( P_{\text{vector}} = \{ \vec{p}_{12} ,\vec{p}_{13} , \ldots \vec{p}_{{1{\text{nsamp}}}} \} \), as illustrated in Fig. 6.

Fig. 6.
figure 6

Vector sets from p1 to the rest of points

We built vector set \( {\text{Q}}_{vector} = \{ {\vec{\text{q}}}_{12} ,{\vec{\text{q}}}_{13} , \ldots \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {q}_{{1{\text{n}}_{\text{samp}} }} \} \) for the sample shape. Therefore, the similarity of the vectors using the dot product could be defined as follows.

$$ S_{v} = \left| {\frac{1}{{n_{samp} - 1}}\sum\limits_{i = 2}^{{n_{samp} }} {\frac{{p_{1i} \cdot q_{1i} }}{{\left\| {p_{1i} } \right\| \cdot \left\| {q_{1i} } \right\|}}} } \right| $$
(3)

It is expected that the higher similarity of the two shapes, the smaller angle between the two vector sets. We regard \( Sv \) as the cosine similarity between the two shapes.

Considering the Euclidean distance between the matching points and the cosine similarity between the matching vectors, we define the similarity function as:

$$ S = \frac{1}{2}(1 + s_{v} - d_{\text{e}} ) - \frac{{n_{left} }}{{\alpha \cdot n_{samp} }} $$
(4)

Where \( {\text{n}}_{left} \) is the number of points with wrong match.

Note that in this algorithm, the size of the sample point and the value of the threshold \( d_{thd} \) depend on the size of the actual target object and the size of the image. In the experiments, we set the number of selected points by \( n_{samp} \) = m/20, and the threshold by \( d_{thd} \) = 0.00001. The reason why \( n_{left} /n_{samp} \) is divided by α is that penalty term should not too big. Experiment shows that when α = 3, the result is best.

4 Experiments

We firstly create a layered 2D view library by projecting the bearing CAD model from different pose, where the scope of the view library is shown in Table 1. And then, the three layers of the pyramid layer are shown in Figs. 7, 8 and 9, respectively. When the pyramid model is established, the number of layers is 3, and the number of 2D views from top to bottom is 27,208,1002.

Table 1. The scope of the view library
Fig. 7.
figure 7

The third layer of the view library

Fig. 8.
figure 8

The second layer of the view library

Fig. 9.
figure 9

The first layer of the view library

Once we take a bearing picture shown in Fig. 10, we could find the model whose pose is most similar to the target object by the following steps: (a) we execute a rough search to find the model whose pose is similar with target object roughly in the third layer as shown in Fig. 7, where the result is marked by a red box. Thus, we could decide the child node. (b) we repeat the searching until all the nodes at the bottom of the pyramid are traversed. The model which is most similar with the target object has been selected in second layer and first layer, as shown in Figs. 8 and 9.

Fig. 10.
figure 10

The object’s image

Since each 2D projection view corresponds to a position relationship between the camera and the actual object, it is easy to determine the position relationship between object and the camera by finding a 2D view that best matches the actual object (Fig. 11).

Fig. 11.
figure 11

Use model to locate metal polyhedron (a, b) and bearing (c, d)

In order to test the experimental results, we place each object in two different places. In the same place, we take five pictures and use the mean value as the final position results. Table 2 shows the comparison of the results by the algorithm and the actual position of the target object.

Table 2. Results of the accuracy evaluation

To illustrate the difference between our approach and the approach proposed in [10], we compare the two different algorithms by calculating the accuracy and recognition times shown in Fig. 12, while [10] calculating the dot product of the pixel gradient vector as a similarity.

Fig. 12.
figure 12

Compare the two algorithms about the accuracy

It can be seen from Table 2, the position error is about 0.78 mm and the computation time is less than 1 s. The recognition of our approach is faster, whereas the accuracy is worse.

5 Conclusion

This paper develops a pose estimation of 3D object with 2D vision system. We propose a new way of calculating the similarity between the model and the target image, which is based on the sampling point on the shape of the object. We firstly get the model and object contours and then sample the points on the contours and use the coordinate of the sampling points as the descriptor of the points. We sort those points on a certain way to determine the correspondence between two shapes. Through a distance threshold, we filter out the wrong match points. Finally, by calculating the Euclidean distance and cosine distance between these matching points, the similarity between the model and the target image is obtained, and the 3D position of the object is determined. The proposed approach has a significant advantage in terms of speed and anti-noise interference, relative to the way in which the similarity is calculated by the pixel value. However, the limitation of this algorithm is that it is difficult to guarantee the stability of the algorithm for objects with complex shapes, or objects without obvious contour features and difficult to extract, and the wrong results will be matched. Therefore, in the future work, we will improve the algorithm for this aspect.