Keywords

1 Introduction

With the very fast development of high technologies, robotics is now more and more important to human life. Specifically, vision processing is one of the most focused areas, which helps a robot increase its ability to learn in explored environments. This work considers a scenario in which a NAO robot can recognize previously learned objects by fusing multi-camera to increase the quality of recognition and reduce uncertainties and imprecisions. We first have a look at how the other works have dealt with object recognition, then propose a solution for the considered case.

In fact, the problem of recognizing an object has been addressed for several decades. The number of methodologies is huge up to now; each of them tried to prove their strengths and overcame the weaknesses of the preceding solutions. For instances, Berg et al. [1] used Geometric Blur approach for feature descriptors and proposed an algorithm to calculate the correspondences between images. The query image was then classified according to its lowest cost of correspondence to the sample images. Besides that, Ling and Jacobs [2] introduced the term “inner-distance”as the length of the shortest path between landmark points within the shape silhouette. The inner-distance was used to build shape representations and they helped to obtain good matching results. For some texture-based approaches, [3] proposed a texture descriptor based on Random Sets and experimentally showed that it outperformed the co-occurrence matrix descriptor. Decision tree induction was used in that work to learn the classifier. Another example can be found in [4] where color and texture information were both used in an agricultural scenario to recognize fruits. On the other hand, some context-based methods like [57] considered contextual information surrounding the target objects. These information come from the interaction among objects in the scene and they help to disambiguate appearance inputs in recognition tasks. Similarly successful, the methods based on local feature description like SIFT [8] and SURF [9] have received many positive evaluations and have been widely applied [1013]. SIFT extracts keypoints from object to build feature vectors. We then calculate the matching (using Euclidean distance) between an input object and the ones in database to find the best candidate class. After that, the agreement on the object and its location, scale, and orientation are determined by using a hash table implementation of the Generalized Hough Transform. In a different manner, SURF uses a blob detector based on the Hessian matrix to find interest points, then it calculates the descriptor by using the sum of Haar wavelet responses. Finally, by comparing the descriptors obtained from different images, the matching pairs can be found.

For the purpose of collecting spatial information about the detected objects, and avoiding imprecision of 2D images under non-ideal lighting conditions like outdoor environment, some works concentrated on 3D object recognition. In [14], an extended version of the Generalized Hough Transform was used in 3D scenes. Each point in the input cloud votes for a spatial position of the object’s reference point and the accumulating bin with the maximum votes indicates an instance of the object in the scene. In [15, 16], the 3D extensions of SIFT and SURF descriptor also gave positive recognition results. In addition, Zhong [17] introduced a new 3D shape descriptor called Intrinsic Shape Signature to characterize a local/semi-local region of a point cloud. This descriptor uses a view-independent representation of the 3D shape to match shape patches from different views directly, and a view-dependent transform encoding the viewing geometry to facilitate fast pose estimation. On the contrary, [18, 19] considered the use of point pairs for the description and the feature matching is then done by implementing a hash table. Recently, the SHOT descriptor [20] has emerged as an efficient tool for 3D object recognition [21, 22]. Indeed, the descriptor encodes histograms of basic first-order differential entities (i.e. the normals of the points within the support), which are more representative than plain 3D coordinates about the local structure of the surface. After defining an unique and robust 3D local reference frame, it is possible to enhance the discriminative power of the descriptor by concerning the location of the points within the support, from that describing a signature.

It is clear that all of the above mentioned approaches have experimentally shown good results in object recognition. Nevertheless, many of them did not focus on the problem of uncertainty and imprecision which might come from the quality of data and sensors, the lighting conditions, the viewing angles to the objects and particularly, the similarity among confusing objects. Therefore, in this work we propose to use multi-camera to recognize objects which have many similarities. The proposed method is implemented in a NAO robot due to our development in a robotics project, however it is not restricted to any other kind of vision-based platform. In order to take advantage of both 2D and 3D recognitions, we use not only a 2D camera of the NAO robot but also another 2D IP Axis camera and another 3D Axus camera; Fig. 1 shows the multi-camera environment where the robot is requested to recognize objects. The fusion of these three heterogeneous sensors brings additional advantages for each one because the NAO camera and the IP camera give characteristics about the 2D features of the detected objects whereas the Axus camera provides depth information. We propose an evidential classifier based on Dempster-Shafer theory (or Evidence theory) [23] for each camera, then we combine them in decision level in order to give more reasonable results of object recognition.

Fig. 1.
figure 1

Multi-camera helps NAO robot recognize objects.

The outline of the paper is as follow. First, we describe our approach step-by-step in Sect. 2, then we give an illustrative example in Sect. 3. Section 4 shows our results of experiment to validate the approach, finally Sect. 5 gives the conclusion.

2 Our Recognition Approach

2.1 An Evidential Classifier for Each Camera

Processing Flow: Figure 2 shows the flow of classification by each camera. First, an input image in 2D or 3D form is captured based on the type of camera sensor. For the NAO camera and the IP camera (2D), the input data is \(640 \times 480 \) images; for the Axus camera (3D), the input images are in form of Point Cloud since we implement 3D processing by using the PCL library [24]. To focus on the classification, we use only one instance of object appearing in the captured scene.

First, interest points (or key points) of the object in the scene are extracted. In an image, an interest point can be described as a point that has rich information about local image structure around it, and these points characterize well the patterns in the image. After that, we use methods of descriptor to build a feature vector for each interest point. We use the word “feature points” for the interest points that have been described by the descriptor. The methods of descriptors used in this work are SURF [9] for 2D data and SHOT [20] for 3D data according to their strong properties as explained above. From the set of feature points acquired, we build a mass function which describes the camera’s degree of belief about the classes of detected object. Thereafter, a decision is made by choosing the class with the maximum pignistic probability. The processing flow is described with more detail later.

Fig. 2.
figure 2

Evidential classifier for each camera

Evidence Theory in the Scenario: Suppose the robot has to recognize an object that can be only in one of N classes, i.e. the space of discernment is:

$$\begin{aligned} \varOmega = \{O_1, O_2, ..., O_N\} \end{aligned}$$
(1)

Then we have the power set which contains the subsets of the space of discernment:

$$\begin{aligned} 2^\varOmega = \{ \{\emptyset \},\ \{O_1\},\ \{O_2\},\ ...,\ \{O_N\},\ \{O_1 \cup O_2\},\ ...,\ \{O_1 \cup O_N\},\ ...,\ \{\varOmega \} \} \end{aligned}$$
(2)

In Evidence Theory, we have to determine a mass function which describes the degree of belief for all possible hypotheses in the power set. This function satisfies:

$$\begin{aligned} \begin{aligned} m: 2^\varOmega \rightarrow [0,1] \\ \sum _{H \in 2^\varOmega }m(H)=1 \end{aligned} \end{aligned}$$
(3)

To illustrate the proposed approach, we consider a simple case in Fig. 3 where we suppose that there are three classes of object: A, B and C. For the sake of explanation, we assume that we have only one training image for each class. With an input image which contains a set X of feature points of object, our mission is to decide the appropriate class for X. The basic idea is that each feature point \(x_i \in X\) will vote for a hypothesis \(H \in 2^\varOmega \) based on its matching to the training images. In Fig. 3, the feature point \(x_1\) matches both images of class A and B, so we accumulate one vote for the hypothesis \(H = \{A \cup B\}\). Similarly, the feature point \(x_2\) votes for \(H = \{C\}\). By doing the same principle for all the feature points of X, we can construct all elements of the mass function after doing a normalization step. Due to the need of clear explanation in a scientific work, the step of defining the matching and constructing mass function will be mathematically described thereafter.

Fig. 3.
figure 3

Illustration of the idea. Each input feature point votes for a hypothesis.

Construction of Mass and Decision: First, let us denote \(\varDelta (p_i, p_j)\) the normalized distance between two feature points \(p_i\) and \(p_j\); the shorter the distance is, the more similar the two feature points are.

$$\begin{aligned} \varDelta (p_i, p_j) \in [0, 1] \end{aligned}$$
(4)

In order to decide the matching between a feature point \(p^X_i\) of an input image X (X can also be understood as the set of feature points for the input image) and a training image M whose class is \(O_j \in \varOmega \), we use the idea in [25]. We will find the two nearest neighbours of \(p^X_i\) in M, called \(p^M_{i_1}\) and \(p^M_{i_2}\) (the feature points in M are previously extracted in the training phase). We suppose that \(p^M_{i_1}\) is closer to \(p^X_i\) than \(p^M_{i_2}\) i.e. \(\varDelta (p^X_i, p^M_{i_1}) \le \varDelta (p^X_i, p^M_{i_2})\). After that, we define a matching function between the feature point \(p^X_i\) of an input image X and the model M :

$$\begin{aligned} \delta (p^X_i, M) = {\left\{ \begin{array}{ll} 1, &{} \text { if } \varDelta (p^X_i, p^M_{i_1}) \le \alpha \text { and } \frac{\varDelta (p^X_i, p^M_{i_1})}{\varDelta (p^X_i, p^M_{i_2})} \le \beta \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(5)

where \(\alpha \) and \(\beta \) are two user-defined parameters such that \(0 \le \alpha , \beta \le 1\). The former guarantees that the distance between \(p^X_i\) and its most similar feature point found in M is small enough whereas the latter helps to avoid false matching. In this work, we choose \(\beta = 0.8\) as suggested in [25], and we add \(\alpha = 0.25\) in order to reduce noise. Indeed, these two parameters help us to find a strong and distinctive matching between the feature point \(p^X_i\) and its closest feature point in M. If \(\delta (p^X_i, M) = 1\), we then say that \(p^X_i\) is matched to the training image M, i.e. matched to the class \(O_j \in \varOmega \) of M and vice versa. In the same way, we can find all the matches of the feature points in the input image X to the training image M.

For now, we define the matching between X and the class \(O_j\) by considering all the matches between feature points \(p^X_i\) in X and the class \(O_j\). In the case that the class \(O_j\) has several training images \(M_k\), we choose the training image \(M_{max}\) that has the maximum number of matches to X according to Eq. (5).

$$\begin{aligned} \delta ^{max}(p^X_i, O_j) = \delta (p^X_i, M_{max}) \end{aligned}$$
(6)

Table 1 shows an example illustrating the matches between input feature points and the output classes. A cell \(c(p^X_i, O_j)\) implies the matching between the feature point \(p^X_i\) of X and the class \(O_j\), \(i = 1,2,...R_X\) - number of feature points in X, \(j = 1,2,...N\) - number of classes. If the cell is red, it means that the feature point \(p^X_i\) matches the class \(O_j\) (i.e. \(\delta ^{max}(p^X_i, O_j)=1\)), otherwise not matched.

Table 1. Matching between the feature points of input image X and the classes

After we determine the matching between the input feature points and the output classes, we can construct the mass function as follow. Each feature point \(p^X_i\) will vote for a hypothesis in the power set such that the hypothesis is composed of the classes that match \(p^X_i\). Mathematically, let’s define a hypothesis-voted function that calculates the accumulated votes for each hypothesis:

$$\begin{aligned} accVote(X, H) = \sum _{p^X_i \in X}\phi (p^X_i, H), ~ ~ ~ H \in 2^\varOmega \end{aligned}$$
(7)

where \(\phi (p^X_i, H)\) is a function indicating the matching between the feature point \(p^X_i\) and every element class in H:

$$\begin{aligned} \phi (p^X_i, H) = {\left\{ \begin{array}{ll} 1, &{} \text { if } \sum \nolimits _{O_j \in H}\delta ^{max}(p^X_i, O_j) = |H| \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(8)

where |H| be the cardinality of H and \(\delta ^{max}(p^X_i, O_j)\) was already explained above. Indeed, \(\phi (p^X_i, H)\) indicates whether a feature point \(p^X_i\) matches every element class in the hypothesis H or not, and accVote(XH) calculates the number of feature points in X that matches every element class in H. After that, we calculate the mass function based on the hypothesis-voted function:

$$\begin{aligned} m^X(H) = \frac{accVote(X, H)}{G^X} \end{aligned}$$
(9)

where \(G^X\) is the normalization factor that guaranties the condition in Eq. (3):

$$\begin{aligned} G^X = \sum _{H \in 2^{\varOmega }, H \ne \emptyset }accVote(X, H) \end{aligned}$$
(10)

It is worth noting that, in this work we assume that the class of object in the input image X is only in \(\varOmega \), so we put \(m^X(\emptyset ) = 0\).

Once we have constructed the mass function, we can give decision about the class of the object. Since the maximum of belief is too pessimistic and the maximum of plausibility is too optimistic, we choose the class which has the maximum pignistic probability [26]:

$$\begin{aligned} BetP^X(O_j) = \frac{1}{1 - m^X(\emptyset )}\sum _{O_j \in H}\frac{m^X(H)}{|H|} \end{aligned}$$
(11)

2.2 Fusion of Cameras

Base on the Evidence theory, each camera gives a decision about the classification of the detected object. In addition, by using Dempster’s rule of combination [23], we can integrate information from multi-camera in order to give a better decision. Usually, the rule is defined for two sources, however it is enough to ensure a trivial extension to many sources due to its associativity and commutativity:

(12)

where S is the number of information source (i.e. number of cameras, 3 in this experiment) and:

$$\begin{aligned} K = \sum _{H_1 \cap H_2 \cap ... \cap H_S = \emptyset }m_1(H_1)m_2(H_2)...m_S(H_S) \end{aligned}$$
(13)

Finally, the decision about the class of the detected object can be made by using pignistic probability as in Eq. (11).

3 Illustrative Example

In this section, we provide an example to illustrate the proposed approach. Suppose that we want the robot to recognize an object in a captured scene with three classes in the space of discernment, that means:

$$\begin{aligned} \varOmega = \{O_1, O_2, O_3\} \end{aligned}$$
(14)

so there are 8 possible hypotheses in the power set:

$$\begin{aligned} 2^{\varOmega } = \{ \{\emptyset \}, \{O_3\}, \{O_2\}, \{O_2 \cup O_3\}, \{O_1\}, \{O_1 \cup O_3\}, \{O_1 \cup O_2\}, \{\varOmega \} \} \end{aligned}$$
(15)

For simplicity, we suppose that for each class, we have only 1 training image. Assuming that the NAO camera captures the scene X and it found 10 feature points in the input image \(X_{NAO}\). For each of those input feature points, we find two nearest neighbours feature points in each training image. After that, we use Eqs. (4), (5), and (6) to construct the matching between the input image and each class. Table 2 shows an example of the matching found. Each cell describes the matching between a feature point and a class; if \(\delta ^{max}(p^{X_{NAO}}_i, O_j) = 1\), the cell is red, otherwise white. The last row indicates the hypothesis voted by the associating feature point.

Table 2. Matching between the input image \(X^{NAO}\) and the classes

From Table 2, we have determined the strength of each hypothesis in the power set. Table 3 then shows the accumulated vote for each hypothesis which is calculated by Eqs. (7) and (8). Each cell in the table is the value of \(\phi (p^{X_{NAO}}_i, H), H \in 2^{\varOmega }\). Remind that if \(\phi (p^{X_{NAO}}_i, H) = 1\), it means that the feature point \(p^{X_{NAO}}_i\) votes for the hypothesis H. According to Eq. (10), we have \(G^{X_{NAO}} = \sum {accVote} = 1 + 3 + 1 + 2 + 2 + 1 + 0 = 10\). From these information, we calculate the mass values as in the last column by using Eq. (9).

Table 3. Accumulated vote for each hypothesis

After that, we assume that we use not only the NAO camera but also another IP camera (2D) and another Axus camera (3D). By doing the same steps, we can obtain two mass vectors output from the two additional sensors. Table 4 shows example values of these mases. Additionally, we also calculate the combination of the masses using Dempster’s rule (\(m_{comb}\)) and transform it to the pignistic probability (BetP) for each of singleton hypothesis. The last column is the final decision from the fusion of three cameras, which recognizes that the detected object belongs to the class \(O_1\).

Table 4. Mass values from there camera sensors

4 Experiments

As mentioned previously, the concentration of this work is how to resolve uncertainties and imprecisions during the object recognition process of the NAO robot. For that reason, we did three experiments, each of them contains a set of confusing objects as shown in Fig. 4. In the first set, there are 4 cups which can cause uncertainty in their spatial structures for the 3D camera to recognize. Conversely, the second experiment contains 4 boxes that have similar brand information on their surface, which may limit the recognition of the 2D cameras. Finally, we tested with 4 Lego bricks which are considered to have difficulties for both 2D and 3D cameras, in the third experiment.

For the training phase, we trained two images for each object with each camera in different view points. We then manually removed the background in these images in order to have only the model objects. For the test phase, NAO robot is requested to recognize an object appearing in front of it and say the result to human. The two cameras (IP and Axus) are on the two sides of the robot to help it improve the recognition. These three cameras capture the scene at the same time whenever the robot wants to recognize the object in the scene. To focus on the work of recognition, the image region containing the object is restricted in order to avoid the noises in scene. For each of the three experiments, we did 32 recognition tests with different objects of 4 classes (so 8 tests for each object). The tested objects were turned around and put in different angles to the cameras in each test for the reason of challenging uncertainty.

Table 5 shows the results of experiment which is the comparison between the recognition rate of each camera (using the proposed classifier individually) and the fusion of three cameras. Remind that the rate for each camera cannot be high due to the confusing between similar objects and the objects are turned around each time of test. The fifth column is the result when we fuse the three cameras by using a simple voting based on majority: each camera gives its own recognition result based on the proposed classifier, then we choose the output class that is voted by the largest number of cameras. The last column shows the result of using Dempster-Shafer combination for the three cameras, which outperforms the majority voting to improve the recognition rate in average.

Fig. 4.
figure 4

Confusing objects used in the experiment

Table 5. Experiment result

5 Conclusion

The work in this paper focuses on how to resolve uncertainties and imprecisions in object recognition for a NAO robot. Since the robot may face difficulties during its visual operation due to lighting conditions, viewing angles and the quality of camera, we propose to add more cameras in order to improve the recognition rate. Each camera extracts feature points from the captured scene, then provides a mass function based on the matching between the input and the training images. After that, Dempster’s rule of combination is used to fuse information from these cameras. As can be seen, the approach is generalized for both 2D and 3D cameras, and the experiment work gives positive results, which prove the advantage of the fusion. Our future works will consider a more complex scenario where the NAO robot can build a semantic map based on the recognition approach used in this work.