1 Introduction

In the future, most of the operations in factory will be done by autonomous robots that need visual feedback to work within the workspace (Pérez et al. 2016). However, a standard industrial robot manipulator has not yet equipped with the visual sensor in term of the capability improvement to manipulate objects in arbitrary location accurately. Many types of vision techniques which can be applied in the related system were introduced in (Pérez et al. 2016; Wilson 2016), but stereo vision is the commonly used due to safety, wide range and accuracy among other vision techniques. Stereo vision is an imaging technique for recovering depth from camera images by comparing two views of the same scene (Borangiu and Dumitrache 2010; Corke 2011; Fevery et al. 2010). It has been applied in industrial automation and three-dimension (3D) machine vision applications to perform tasks such as bin picking, volume measurement, and 3D object location and identification (Nof 2009; Pérez et al. 2016; Point_Grey 2015). This technique makes robots be more flexible of position control capability in tracking and grasping the unknown object in arbitrary location (Borangiu and Dumitrache 2010). One of important advantages of stereo vision is non-invasive technique with respect to the surrounding environment due to it does not require additional light sources (Borangiu and Dumitrache 2010). Another advantage of stereo vision over another methods is the extracting the relative depth information among the passive methods (Kheng et al. 2010; Pérez et al. 2016).

In order to apply the stereo vision system in desired control application, it often requires coordinate transformation between stereo camera and robot arm, namely eye to hand calibration. The eye to hand calibration is the task of computing the relative 3D position and orientation between the camera and the robot arm such as in eye to hand camera configuration (Miseikis et al. 2016). However, this calibration is difficult to be determined due to the non-linier relationship (Daniilidis 1999; Dornaika and Horaud 1998; Tao 2015; Wu et al. 2014). Previous researchers used a mathematical approach for calibration such as series of generic geometric (Tsai and Lenz 1989) and closed form methods (Dornaika and Horaud 1998; Tsai and Lenz 1989). However, those techniques are limited by long computation time. In the current work, many researchers have utilized artificial intelligence (AI) approach which adopts human brain behavior such as fuzzy logic and neural network in order to reduce long computation.

Utilization of neural networks and fuzzy logic for solving the camera to robot arm coordinate transformation are reported in (Jafari and Jarvis 2004; Juang et al. 2015; Wu et al. 2014). Neural network does not require prior knowledge, however it needs more sufficient data training and several learning algorithms. On the contrary, a fuzzy logic requires linguistic rules (If–Then rules) instead of learning examples as prior knowledge and this system is not capable to learn (Abe 2012). There exist numerous possibilities for the fusion of neural networks and fuzzy logic technique so that both of them can overcome their individual drawbacks as well as get benefits from each other’s merits.

This paper presents a method of integrating the measuring functions of a 3D binocular stereo vision system into an industrial robot system to manipulate the targeted object. We consider the ANFIS structure with first order Sugeno model that contains 343 rules. Gaussian membership functions with product inference rule are used at the fuzzification level. Further, to adjust the parameters of membership functions, we used hybrid learning algorithm that combines least square and gradient descent methods (Jang 1993). The ANFIS controller developed consists of three inputs of 3D object position coordinate \( (X_{c} ,Y_{c} ,Z_{c} ) \) in camera coordinate frame which is obtained by stereo vision system and three outputs of 3D object position coordinate \( (X_{r} ,Y_{r} ,Z_{r} ) \) in robot frame.

The study deals with the design techniques and procedures related to vision system setup, stereo camera calibration, camera to robot coordination calibration, and system performance analysis. The rest of this paper is structured as follows. Section 2 is the summary of the stereo vision system including stereo camera calibration, object feature extraction and pose estimation. Coordinate transformation of camera to robot arm and training data of ANFIS are also described. Section 3 explains the experiment results and discussion. Finally, a brief conclusion is presented in Sect. 4.

2 Eye to hand calibration for stereo vision-based object manipulation system

In this study, the stereo vision based object manipulation system with eye to hand calibration using ANFIS is shown in Fig. 1. Personal computer (PC) is connected to Robot arm controller via RS232 serial communication interface to deliver commands and receive the responses of control robot arm movement process. While the gripper and the stereo camera is connected to PC through USB serial interface. PC is also used for graphic user interface development to control robot arm, gripper and stereo camera. Stereo camera is made of two identical cameras, Logitech C310, aligned in y axis and distance apart in x axis. Stereo camera is used as a vision sensor to capture the object in 3D world coordinate and then the feature extraction of the object is obtained to calculate the pose estimation using image processing algorithm. The robot arm controller is programmed to receive the commands from PC and drive the robot arm to move according to the desired position, and also send the position data of the end-effector of robot arm to the PC as feedbacks.

Fig. 1
figure 1

Stereo vision based object manipulation system architecture

6-DOF robot arm is driven by robot controller to move to the estimated object position, and then robot controller reads the position of end effector of robot arm by requesting command from PC. Therefore, when the position is achieved, the gripper grasps the object by command from PC and continues the tasks. In this paper, the stereo vision system and Eye to Hand calibration using ANFIS with the detailed description of each part of the system is described in following sections.

2.1 Stereo vision system

Stereo vision attempts to compute the 3D data in a way similar to the human brain. A 3D binocular stereo vision system uses two cameras which capture images of the same scene from different positions, and then calculates the 3D coordinates for each pixel by comparing the parallax shifts between the two images (Borangiu and Dumitrache 2010; Corke 2011; Jiadi et al. 2014; Point_Grey 2015).

In order to use the cameras in stereo vision system, the knowledge of camera model and its parameters are important. The projection of an object with respect to the pinhole camera system is described in Fig. 2. In Fig. 2, the coordinate vector of a 3D point \( P = [X,Y,Z]^{T} \) is projected in 2D camera image plane coordinates as \( p = [x,y]^{T} \) and from the comparison of similar triangles can be calculated through Eq. (1). The parameter f represents the focal length of the camera.

$$ x = f\frac{X}{Z};\;y = f\frac{Y}{Z}. $$
(1)
Fig. 2
figure 2

Pinhole camera model

Since, the image coordinates are measured in pixels, while the spatial coordinates are in millimeters. This means that equations must be obtained to convert units between these two measurement systems.

$$ u = k_{u} (x + x_{0} ) = k_{u} f\frac{X}{Z} + k_{u} x_{0} , $$
$$ v = k_{v} (y + y_{0} ) = k_{v} f\frac{Y}{Z} + k_{v} y_{0} , $$

where u and v are the number of pixels, k u and k v are column-wise and row-wise density of pixels, respectively, measured as number of pixels per millimeter. The relationship between the world reference frame and image frame of projection 3D point P′ can be formulated as:

$$ p^{\prime} = A\,[R|t]\,P^{\prime},\;p^{{\prime}{T}} = [\begin{array}{*{20}c} {p^{T} } & {\left| s \right.} \\ \end{array} ],\;t^{T} = [\begin{array}{*{20}c} {t_{1} } &{t_{2} } & {t_{3} } \\ \end{array} ],\;P^{{\prime}{T}} = [P^{T} \,\left| {1]} \right., $$
(2)

or

$$ \left[ {\begin{array}{*{20}c} x \\ y \\ s \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} \alpha & \gamma & {u_{0} } \\ 0 & \beta & {v_{0} } \\ 0 & 0 & 1 \\ \end{array} } \right]\,\left[ {\begin{array}{*{20}c} {r_{11} } & {r_{12} } & {r_{13} } & {t_{1} } \\ {r_{21} } & {r_{22} } & {r_{23} } & {t_{2} } \\ {r_{31} } & {r_{32} } & {r_{33} } & {t_{3} } \\ \end{array} } \right]\,\left[ {\begin{array}{*{20}c} X \\ Y \\ Z \\ 1 \\ \end{array} } \right]. $$
(3)

In Eq. (3), where α = f/k u and β = f/k v are the focal lengths in horizontal and vertical pixels, respectively [f is the focal length in millimeter, and (k u , k v ) are the pixel size in millimeter], (u 0 , v 0 ) are the coordinates of the principle point, and γ is the skew factor that models non-orthogonal uv axes. Since (x, y, s) is homogeneous, the pixel coordinates x and y can be retrieved by dividing the scale factor s.

The intrinsic and extrinsic camera parameters are obtained by camera calibration that is principally important in stereo vision systems and has the crucial role in many computer vision tasks. The accuracy of the estimation of actual object distances is determined by the precision of the camera calibration. Camera calibration is the process of determining the camera’s intrinsic parameters and the extrinsic parameters with respect to the world coordinate system as mentioned in (Bouguet 2015; Corke 2011; Nguyen et al. 2015). Intrinsic parameters are the characteristics of the camera, i.e., (α, β, γ, u0, v0) where (u0, v0) is the coordinate of the principal point, α and β are the scale factors in image u and v axes, and γ is the parameter describing the skewness of the two image axes. Extrinsic parameters are the orientation and location of the camera, i.e., (R, t) where R is the rotation and t is the translation of right camera with respect to left camera, respectively (Zhang et al. 2011).

In this study, an object feature extraction based on HSV color space is used to detect the targeted object on the workspace and the pose estimation, described in algorithm on Matlab platform as shown in Fig. 3. First, initialize the stereo camera model and object color threshold adjustment based on HSV space value. Second, capture the image pair with the both cameras at the same time. Next, extract the object from the image based on color using HSV space threshold. Median filtering is applied for noise removing and morphological operation acts like opening and closing to locate the boundary of the object. Next, locate boundary object. Then, the centroid of the located object in both left and right image is calculated. Finally, the 3D location of the object is obtained by applying the triangulation.

Fig. 3
figure 3

Flow chart diagram of object feature extraction and pose estimation

The configuration scheme of the stereo vision used in the approach task in this paper is shown in Fig. 4, which is fixed in the work space with two cameras in parallel. The distance of two cameras optical center is b, and they have the same focal length f. Given a reference point \( P(X_{p} ,Y_{p} ,Z_{p} ) \), the projection is \( p_{1} (x_{1} ,y_{1} ) \) in image plan 1 and \( p_{2} (x_{2} ,y_{2} ) \) in the image plan 2. Then by perspective projection, we have the image coordinate of P in two image planes and we can see in Fig. 5 to simplify the calculation

$$ \frac{X}{{x_{1} }} = \frac{Z}{f};\;\frac{Y}{{y_{1} }} = \frac{Z}{f};\;\frac{b - X}{{x_{2} }} = \frac{Z}{f}. $$
(4)
Fig. 4
figure 4

Configuration scheme of stereo vision

Fig. 5
figure 5

Triangulation scheme of stereo vision

In Fig. 4, we assume that two cameras have the identify camera parameters, which are obtained by stereo camera calibration in Matlab (Bouguet 2015). Its images on the two cameras are \( p_{1} \) and \( p_{2} \), \( d = x_{1} + x_{2} \) is the parallax and Y-axis is perpendicular to the page (Liu and Chen 2009). According to the principle of similar triangles, we can get the Eq. (4). From Fig. 5, we also can see that b can be written in Eq. (5) and Z is the depth of P point obtained by Eq. (6)

$$ b = \frac{Z}{f}x_{1} + \frac{Z}{f}x_{2} , $$
(5)
$$ Z = \frac{b \times f}{{x_{1} + x_{2} }}, $$
(6)

The disparity d is the difference x coordinate in image 1 and image 2 as written in Eq. (7)

$$ d = x_{1} + x_{2} . $$
(7)

Substituting (7) to Eq. (6), the depth (Z) of point P can be seen in Eq. (8). After Z is obtained, then we can obtain the X and Y coordinates of P point using Eqs. (9) and (10), respectively.

$$ Z = \frac{b \times f}{d}, $$
(8)
$$ X = \frac{{Z \times x_{1} }}{f} $$
(9)
$$ Y = \frac{{Z \times y_{1} }}{f}, $$
(10)

where \( x_{1} \) and \( x_{2} \) are the pixel locations on the 2-D image, and X, Y and Z are the actual positions on 3D image.

2.2 ANFIS based eye to hand calibration

Due to the difference views of the camera and robot arm to the targeted object, it is necessary to know the relative position and orientation between gripper and end-effector, the end-effector and robot base, object and the robot base, between the camera and the robot base, and between the object and the camera. Those coordinate transformation relationships are shown in Fig. 6. \( {}^{B}\xi_{E} \) is the coordinate relationship between end-effector to robot base that can be found by using forward kinematic equation of 6 DOF robot arm with considering its DH parameters (Kucuk and Bingul 2006). \( {}^{E}\xi_{G} \) is the coordinate relationship between end-effector and gripper which is only the translation of z axis from end effector. \( {}^{C}\xi_{T} \) is the targeted object coordinate with respect to camera coordinate that can be obtained by stereo vision system, \( {}^{B}\xi_{T} \) is the targeted object coordinate with respect to robot base that can be obtained by using teaching box of robot arm controller to get the object coordinate by pointing the end-effector of robot arm to the desired object position. \( {}^{B}\xi_{C} \) is the camera coordinate with respect to robot base that will be obtained using ANFIS method in following section.

Fig. 6
figure 6

Coordinate transformation relationship

2.2.1 ANFIS architecture

The ANFIS can perform the mapping relation between the input and output through a learning algorithm to optimize the parameters of a given FIS. The ANFIS architecture consists of fuzzy layer, product layer, normalized layer, de-fuzzy layer, and summation layer. To give a simple explanation, Fig. 7 shows the structure of a two-input type-3 ANFIS with 4 rules considered, in which a circle indicates a fixed node, whereas a square indicates an adjustable node. For example, we consider two inputs x, y and one output z in the FIS. The ANFIS used in this paper implements a first-order Sugeno FIS. Among many fuzzy systems, the Sugeno fuzzy model is the most widely applied, because of its high interpretability and computational efficiency, and built-in optimal and adaptive techniques.

Fig. 7
figure 7

ANFIS structure

According to Jang (1993), type-3 ANFIS uses Takagi–Sugeno if–then rules of the following form:

$$ {\text{Rule}}\;1:{\text{IF}}\;\;x\;\;{\text{is}}\;\;A_{1} \;{\text{and}}\;\;y\;\;{\text{is}}\;\;B_{1} \;\;{\text{THEN}}\;\;z_{1} = p_{1} x + q_{1} y + r_{1} , $$
$$ {\text{Rule }}2:{\text{IF}}\;\;x\;\;{\text{is}}\;\;A_{2} \;{\text{and}}\;\;y\;\;{\text{is}}\;\;B_{1} \;\;{\text{THEN}}\;\;z_{2} = p_{2} x + q_{2} y + r_{2} , $$
$$ {\text{Rule}}\;3:{\text{IF}}\;\;x\;\;{\text{is}}\;\;A_{1} \;\;{\text{and}}\;\;y\;\;{\text{is}}\;\;B_{2} \;\;{\text{THEN}}\;\;z_{3} = p_{3} x + q_{3} y + r_{3} , $$
$$ {\text{Rule}}\;4:{\text{IF}}\;\;x\;\;{\text{is}}\;\;A_{2} \;\;{\text{and}}\;\;y\;\;{\text{is}}\;\;B_{2} \;\;{\text{THEN}}\;\;z_{4} = p_{4} x + q_{4} y + r_{4} , $$
(11)

where x and y are two input variables, z i (x,y) (i = 1:4) is one output variable, A i and B i (i = 1,2) are linguistic variables that cover the input variable universe of discourse and p i , q i and r i (i = 1:4) are linear consequent parameters. The output of each rule is linear combination of input variables and a constant. The typical ANFIS consists of 5 layer structure. The layers and their functions can be described as follows:

Layer 1: fuzzification layer

In this layer, each node denotes the membership function (MF) of fuzzy sets \( A_{i} \) and \( B_{i} \), (i = 1, 2), \( \mu_{{A_{i} (x)}} \) and \( \mu_{{B_{i} (y)}} \). The membership value directly depends on the membership function. Since the membership functions have adjustable parameters, this layer can be called adaptive layer. The Gaussian membership function as shown in (12) is used in this study.

$$ gaussmf(x,c,s) = e^{{ - \frac{{(x - c_{i} )^{2} }}{{2s_{i}^{2} }}}} , $$
(12)

where x is the input and \( (c_{i} ,s_{i} ) \) is the parameter set that changes the shape of the MF. The parameters of this layer are termed the premise parameters.

Layer 2: product layer

In this layer, the T-norm operation is used to calculate the firing strength of a rule via multiplication:

$$ \omega_{i} = \mu_{{A_{i} (x)}} \mu_{{B_{i} (y)}} . $$
(13)

Layer 3: normalization layer

In this layer the ratio of a rule’s firing strength to the total of all firing strengths is calculated:

$$ \varpi_{i} = \frac{{\omega_{i} }}{{\sum\limits_{i = 1}^{4} {\omega_{i} } }} = \frac{{\omega_{i} }}{{\omega_{1} + \omega_{2} + \omega_{3} + \omega_{4} }}. $$
(14)

Layer 4: defuzzification layer

In the fourth layer, the linear compound is obtained from the inputs of the system as THEN part of fuzzy rules as:

$$ \varpi_{i} z_{i} (x,y) = \varpi_{i} (p_{i} x + q_{i} y + r_{i} ), $$
(15)

where \( \varpi_{i} \) is the output of layer 3 and \( \{ p_{i} + q_{i} + r_{i} \} \) is the consequent parameter set.

Layer 5: summation layer

A node in the fifth layer is fixed node that calculates the overall output as the summation of all incoming inputs:

$$ z = \sum\limits_{i = 1}^{4} {\varpi_{i} z_{i} } (x,y) = \frac{{\omega_{1} z_{1} + \omega_{2} z_{2} + \omega_{3} z_{3} + \omega_{4} z_{4} }}{{\omega_{1} + \omega_{2} + \omega_{3} + \omega_{4} }}. $$
(16)

2.2.2 ANFIS structure for computing camera to robot arm calibration

The ANFIS structure for computing the solution of camera to robot arm calibration, is shown in Fig. 8. It consists of three ANFIS networks with first-order Sugeno fuzzy system for three axis (x, y, z) in camera to robot arm 3D coordinate transformation. 3, 5 and 7 Gaussian MFs with product inference rule are used at the fuzzification layer, and hybrid learning algorithm is used to adjust the premise and consequent parameters. In Fig. 8,\( {}^{C}\xi_{T} \) or \( (X_{c} ,Y_{c} ,Z_{c} ) \) is the targeted object coordinate with respect to camera coordinate which is obtained by stereo vision system, \( {}^{B}\xi_{T} \) or \( (X_{r} ,Y_{r} ,Z_{r} ) \) is the targeted object coordinate with respect to robot base which is obtained by positioning the end effector to the desired object position using teaching box of robot arm controller. \( {}^{B}\xi_{C} \) is the camera coordinate with respect to robot base that will be obtained by training the ANFIS.

Fig. 8
figure 8

Proposed ANFIS architecture for computing eye to hand coordinate transformation

The training data are very important in camera to robot arm position calibration process to get the accurate value of the object 3D position. The training data generation consists of the following steps: First, the object was placed on one of the position which is used in calibration. Second, the object 3D position measurement was obtained by object feature extraction and position estimation process. Third, the 3D position of object was recorded in the file. Fourth, repetition was done from first to third step to get the comparisons of the object position, so as one position will have twice measurement results. After that, repetition process was done from the first to fourth step for the next 3D position object calibration. In this research there were 234 position data calibrated with double for every point.

3 Experimental results

In this part, several experiments are performed to evaluate ANFIS method for eye to hand calibration. Stereo camera calibration is applied prior image processing to get both intrinsic and extrinsic parameters. Then, eye to hand calibration is performed based on ANFIS. Figure 9 depicts the experimental setup. We choose a red cylinder bottle cup as the target object which is placed into a workspace. A pair of Logitech C310 cameras is placed toward the robot arm to build the stereo vision system, each camera works at 640 × 480 pixels.

Fig. 9
figure 9

Experimental setup

3.1 Experiment 1: stereo camera calibration

Before eye to hand calibration, we performed stereo camera calibration to get the intrinsic and extrinsic camera parameters. In this paper, we utilize a method proposed by (Bouguet 2015) with a classical black-white chessboard to calibrate the cameras. A stereo camera system was set up for baseline of 92 mm between the two cameras, we therefore do calibration. Our chessboard has 63 square-blocks in a 9 × 7 pattern, which has a size of 40 mm x 40 mm for each square. Accurate chessboard size is required to give an accurate estimation of object targeted features. For the calibration procedure, the 18-different positions and orientations of 640 × 480 pixel images of chessboard are captured by each camera simultaneously and then loaded into Matlab. The corner of chessboard square is detected with subpixel accuracy as the input of the calibration method; and the output includes intrinsic, distortion and extrinsic matrices of the two cameras and the perspective transformation matrix. All those outputs are needed in re-projecting depth information to the real-world coordinates. The results of stereo camera calibration are shown in Table 1.

Table 1 The intrinsic and extrinsic parameters of stereo camera

From Table 1, the intrinsic parameters are slightly similar between right and left camera, including focal length and principal point. All those parameters will be used for triangulation process in the next step. For the extrinsic parameters, the two-camera rotation matrix R is approximately the unit matrix, which means that there is no rotation, but translation only between two cameras. From the translation vector t, we know that the translation of y-axis and z-axis is small and the distance between two cameras baseline vision system is 92 mm. According to the experiment results, the calibration of stereo camera has been successful with the intrinsic and extrinsic parameters results that can be used in the triangulation process. But the resulted camera parameters are not precisely accurate because of the hand-made stereo camera with the low resolution.

3.2 Experiment 2: object feature extraction and pose estimation

After obtaining the stereo camera calibration with both intrinsic and extrinsic camera parameters, image processing system performed the object feature extraction and pose estimation as mentioned in Sect. 2. Figure 10 shows the object feature extraction and pose estimation results in every step of the two cameras. First, capture the image pair with both cameras at the same time. Next, extract the object from the image based on color using HSV space threshold. Median filtering is applied for noise removing and morphological operation acts like opening and closing to locate the boundary of the object. Next, locate boundary object and the centroid of the located object in both left and right images. Finally, the position of object will be determined from centroid object estimation.

Fig. 10
figure 10

Object feature extraction and pose estimation process. a Color image captured by left camera. b Color image captured by right camera. c HSV color image from left camera. d HSV color image from right camera. e Filtered image of red color object on left camera. f Filtered image of red color object on right camera. g Red color object pose estimation. h Red color object pose estimation

In regard to the stereo camera coordinate, the 3D object position coordinate is obtained using triangulation method as discussed in Sect. 2. The coordinate of 3D object position is (−78.03, −50.28, 1022.62) for (x, y, z), respectively, and saved into a file for ANFIS training. According to the results, the color based object detection using HSV color space thresholding succeeds to distinguish the object from the background with robustness from the light changes.

3.3 Experiment 3: ANFIS based eye to hand calibration

Three ANFIS structure with first-order Sugeno fuzzy system is trained to calibrate the position of stereo camera with respect to base frame of robot arm using 3, 5, and 7 Gaussian membership functions (MFs) with twice data collection. The product inference rules are used at the fuzzification layer and hybrid learning algorithm to adjust the premise and consequent parameters. All the procedure of training followed the steps described in Sect. 2. The 138 and 234 points of 3D object are resulted from twice data collection, for input ANFIS training that captured by calibrated stereo vision system as shows in Fig. 11a, b, respectively. Figure 12 shows the 3D object position toward base frame of robot arm used as the output training data ANFIS is resulted from positioning of end-effector using teaching box control.

Fig. 11
figure 11

Input data of ANFIS Training. a 138 data recorded. b 234 data recorded

Fig. 12
figure 12

Output data of training ANFIS

We further trained the proposed ANFIS structures to obtain minimal error and/or accepted error using different MFs (3, 5, and 7 MFs) from twice data collection. We found that 5 MFs gave the smallest error. Figure 13 depict the result of training error for (X, Y, Z) axis using 5 MFs. At the end of training process, the ANFIS network would have learned the input–output mapping and it is tested with the testing data.

Fig. 13
figure 13

The smallest training error of ANFIS in different number of membership functions. a X-axis training error. b Y-axis training error. c Z-axis training error

The details of training results using different MFs with twice data collection are shows in Table 2. We compared the points of 3D object position with 138 and 234 point data collection for input ANFIS training using stereo vision system. Table 2 summarizes the comparison of training error results between 138 and 234 points of double captured data collection using 3, 5, and 7 numbers of membership functions. We found that 5 membership functions with 234 points of data collection generated the smallest training error results (shown in italics) compared to 3 and 7 number of membership functions. The 138 training data could not reach the smallest error, even though using different number of MFs.

Table 2 ANFIS training error

After training ANFIS and the smallest training error was obtained, then testing to the ANFIS is conducted by positioning the 16 points to test the performance of the hand to eye calibration system as shown in Fig. 14. Based on the experimental results, the error results of ANFIS testing were (0.44, 2.01, 1.53) in mm for x, y and z axis, respectively. It showed a success of designed systems because the object targeted still can be reached using the gripper.

Fig. 14
figure 14

Testing data for ANFIS

In this experiment, there are three processes to achieve the object 3D pose estimation as follows: (1) image capture using stereo camera with elapsed time is 0.001546 s. (2) Object feature extraction and 3D pose estimation with elapsed time is 0.313688 s. In this process, we process two images which captured from two cameras to detect the object centroid using HSV color thresholding, and then estimate the 3D pose using triangulation process. (3) 3D object position estimation using ANFIS based eye to hand calibration with elapsed time is 0.187859 s. So, the total elapsed time is 0.503093 s, which means the sampling rate of 1 Hz. If we increase the efficiency of the process 2 or 3 a little, the sampling rate can be increased to be 2 Hz. It will be applicable to the middle-case applications.

4 Conclusions

In this work, the calibrated stereo camera was successfully developed and implemented in a stereo vision-based object manipulation system with eye to hand calibration using the ANFIS method. Based on the experimental results, it is concluded that eye-to-hand calibration with the ANFIS method can achieve a good performance and thus can be implemented for different applications such as object tracking and grasping.