1 Introduction

Mobile manipulators for material handling are extensively used to transfer material in production lines with short life cycle, small-volume and wide-variety of products [1]. The guidance control system of the mobile base of the mobile manipulator inevitably causes position and orientation errors of the mobile base. Uncertainties in either the location of the mobile base or the location of the object are responsible for a position mismatch, which normally causes failure in the following pick-and-place operation. A more direct means of solving this problem is to use a CCD camera to provide visual feedback. The camera is generally attached to the end-effector of the manipulator.

The visual servo control structures of such an eye-in-hand manipulator can be classified into four major categories [2] according to the coordinates where the error signal is defined and whether the control structure is hierarchical. The control objective of position-based control is to eliminate errors defined by the pose of the target with respect to the camera, and the features extracted from the image planes are used to estimate the spatial pose of the target. Then, the control law sends the command to the joint-level controllers to drive the servomechanism. This control architecture is termed hierarchical control and is considered a dynamic position-based look-and-move control structure. If the servomechanism is directly controlled by the control law mentioned above instead of the joint-level controller, the control architecture is referred to as a position-based visual servo control structure. Meanwhile, in image-based control, the errors to the control law are determined directly from the extracted features of the image planes. However, if the errors are then sent to the joint-level controller to drive the servomechanism, the control architecture is hierarchical and is called a dynamic image-based look-and-move control structure. On the contrary, the visual servo controller eliminates the joint-level controller and directly controls the servomechanism. This control architecture is termed an image-based visual servo control structure.

Apart from the traditional visual servo control methods mentioned above, behavior-based approaches to visual servo control have already been presented in the literature [3, 4]. An uncalibrated eye-in-hand vision system is adopted to provide visual information to the mobile manipulator to pick up a workpiece located on the station. This paper focuses on the development of the behavioristic image-based visual servoing algorithm for the robot manipulator mounted on the mobile base.

2 Image processing

In this study, the workpiece to be picked up by the manipulator’s end-effector is a rectangular parallelepiped. The images of the workpiece are captured from above by a CCD camera attached to the end-effector, as the end-effector moves toward the workpiece. The image of the workpiece’s top surface is a quadrangle. Only information about the quadrangle is of interest, so the captured RGB color image is first converted into YCbCr color space. The converted image is then preprocessed with the thresholding technique, and opening and closing operations to yield a clear binary image of the target area, as shown in Fig. 1. Finally, all the image features can be obtained from this target area.

Fig. 1
figure 1

Image of the workpiece’s top surface

3 Control strategy for picking up the target object

3.1 Selecting image features

Six image features are used to determine the translational and orientational motion in 3D. Each image feature can uniquely direct one D.O.F. of motion relative to the camera frame. Before the image features are defined, the coordinate symbols are provided. Specifically, as displayed in Fig. 2, let \( \delta {}^{E}X_{1} \), \( \delta {}^{E}X_{2} \) and \( \delta {}^{E}X_{3} \), represent differential changes in translation along the \( {}^{E}X \), \( {}^{E}Y \) and \( {}^{E}Z \) axes of the end-effector frame. In this study, the end-effector frame is fixed on the surface of the flange of the manipulator’s last link, with the \( {}^{E}Z \) axis along the last joint axis. Let \( \delta {}^{E}X_{4} \), \( \delta {}^{E}X_{5} \) and \( \delta {}^{E}X_{6} \) denote the differential changes in orientation about the \( {}^{E}X \), \( {}^{E}Y \) and \( {}^{E}Z \) axes of the end-effector frame. Furthermore, the camera frame must be considered. The origin of the camera frame is at the center of the camera, while its z-axis is the optical axis of the camera. Similarly, \( \delta {}^{C}X_{i} \) for i = 1, 2,…,6, represent the differential movements with respect to the camera frame.

Fig. 2
figure 2

End-effector and object frames

The two-dimensional geometric moment, \( m_{pq} \), and central moment, \( \mu_{pq} \), of order p + q for a (M × N) discretized image, \( f(x,y) \), are defined as

$$ m_{pq} = \mathop \Sigma \limits_{y = 0}^{N - 1} \mathop \Sigma \limits_{x = 0}^{M - 1} x^{p} y^{q} f(x,y), $$
(1)
$$ \mu_{pq} = \mathop \Sigma \limits_{y = 0}^{N - 1} \mathop \Sigma \limits_{x = 0}^{M - 1} (x - x_{g} )^{p} (y - y_{g} )^{q} f(x,y), $$
(2)

where \( (x_{g} ,y_{g} ) \) is the coordinate of image centroid. According to the definitions and the properties of image moments [5], we select the six image features as follows.

\( F_{1} = y_{g} = {{m_{01} } \mathord{\left/ {\vphantom {{m_{01} } {m_{00} }}} \right. \kern-0pt} {m_{00} }} \), coordinate of the centroid of the quadrangular image. It directs camera motion \( {}^{C}X_{1} \);

\( F_{2} = x_{g} = {{m_{10} } \mathord{\left/ {\vphantom {{m_{10} } {m_{00} }}} \right. \kern-0pt} {m_{00} }} \), coordinate of the centroid of the quadrangular image. It directs camera motion \( {}^{C}X_{2} \);

\( F_{3} = {{1 - A_{2} } \mathord{\left/ {\vphantom {{1 - A_{2} } {A_{1} }}} \right. \kern-0pt} {A_{1} }} \), where \( A_{i} \) denotes the area of the quadrangle, \( m_{00} \), \( i = 1 \) in the desired pose and i = 2 in the present pose. \( F_{3} \) guides camera motion \( {}^{C}X_{3} \);

\( F_{4} = f_{y} = (s_{2} c_{3} - c_{2} s_{3} )/K \), variation due to the rotation about the axis, which passes through the center of gravity of the quadrangle and is parallel to the V-axis in the image plane, where \( s_{2} = \mu_{30} - 3\mu_{12} \), \( c_{3} = \left( {\mu_{20} - \mu_{02} } \right)^{2} - 4\mu_{11}^{2} \),\( c_{2} = \mu_{03} - 3\mu_{21} \), \( s_{3} = 4\mu_{11} \left( {\mu_{20} - \mu_{02} } \right) \), \( K = I_{1} I_{3}^{3/2} /\sqrt {m_{00} } \), \( I_{1} = \left( {\mu_{20} - \mu_{02} } \right)^{2} + 4\mu_{11}^{2} \), and \( I_{3} = \mu_{20} + \mu_{02} \). \( F_{4} \) directs camera motion \( {}^{C}X_{4} \);

\( F_{5} = f_{x} = (c_{2} c_{3} + s_{2} s_{3} )/K \), variation due to the rotation about the axis, which passes through the center of gravity of the quadrangle and is parallel to the U-axis in the image plane. It directs camera motion \( {}^{C}X_{5} \), and \( F_{6} = 0.5\tan^{ - 1} \left[ {{{2\mu_{11} } \mathord{\left/ {\vphantom {{2\mu_{11} } {\left( {\mu_{20} - \mu_{02} } \right)}}} \right. \kern-0pt} {\left( {\mu_{20} - \mu_{02} } \right)}}} \right] \), angle between the \( U \) axis and the principal axis of least inertia of the quadrangle in the image plane. \( F_{6} \) guides camera motion \( {}^{C}X_{6} \).

3.2 Motion planning based on behavior design

In this work, fuzzy rules are used to enable the controllers to map image features in the image space onto motion commands in the camera space. These are then transformed to the commands in relation to the end-effector frame, which eventually control the manipulator.

3.2.1 Six main behaviors

The complete manipulation task can be divided into two categories of behaviors—vision-based behaviors: Center 1, Center 2, Zoom, Yaw, Pitch and Roll, and non-vision-based behaviors: Top and Catch. The manipulation starts with the Top behavior to move the end-effector to the top location, where the pose of the end-effector in relation to the manipulator base is fixed. The first three vision-based behaviors are the translational motions of the camera toward the workpiece, while the other three vision-based behaviors are the orientational motions of the camera. The Catch is a non-vision-based behavior. Only when the end-effector has reached the target location is the Catch activated and moves the end-effector a short distance forward. The gripper then closes to grasp the workpiece. These behaviors are defined as follows:

Center 1 is based on the first image feature, \( F_{1} \). This behavior translates the camera along the \( {}^{C}X \) axis of the camera frame to keep the centroid of the quadrangular image at the desired pixel in the image plane.

Center 2 is based on the second image feature, \( F_{2} \). This behavior translates the camera along the \( {}^{C}Y \) axis of the camera frame to keep the centroid of the quadrangular image at the desired pixel in the image plane.

Zoom is based on \( F_{3} \); it moves the camera along the \( {}^{C}Z \) axis of the camera frame to keep the size of the object as a predefined value.

Yaw and Pitch are based on \( F_{4} \) and \( F_{5} \), respectively; it rotates the camera about \( {}^{C}X \) and \( {}^{C}Y \) to keep the top surface of the target object parallel to the image plane.

Roll is based on \( F_{6} \); it rotates the camera about \( {}^{C}Z \) so that the principal angle equals that in the reference image, in which the gripper’s two fingers are arranged parallel to the short sides of target.

Vision-based behaviors are defined from the perspective of an eye-in-hand camera, so movements are performed relative to the camera frame. Figure 3 presents these behaviors.

Fig. 3
figure 3

Movements associated with defined behaviors

3.2.2 Neural-fuzzy controller

The best control law or membership functions of traditional fuzzy controllers can be determined by experience. However, the manipulation tasks are nonlinear and coupled. None set of membership functions is good for all of the work environment. With respect to learning ability of the artificial neural network, back-propagation architecture is the most popular and effective for solving complex and ill-defined problems. Hence, six simple NFCs (neural-fuzzy controllers) [6], using back-propagation, are designed herein. One image feature is input to each controller, which changes one D.O.F. of the camera motion as the output. The back-propagation algorithm is used only to adjust the consequents of fuzzy rules for each NFC at each iteration during the manipulation. Restated, the camera is guided intuitively by the image features on the image plane.

Fuzzy singleton rules are applied to simplify the NFCs; they are defined as follows:

$$ R_{i}^{j} \,\,:{\text{ if}}\,\,\,\delta F_{i} \,\,{\text{is}}\,\,\,\,A_{i}^{j} ,{\text{ then}}\,\,\,\delta {}^{C}X_{i} \,\,{\text{is}}\,\,\,w_{i}^{j} , $$
(3)

where input variable \( \delta F_{i} \) is an image feature error; output variable \( \delta {}^{C}X_{i} \) denotes a relative motion command in the camera frame; \( A_{i}^{j} \) are linguistic terms of the precondition part with membership functions \( \mu_{{A_{i}^{j} }} (\delta F_{i} ) \), and \( w_{i}^{j} \) represent the real numbers of the consequent part, i = 1, 2,…,6 and j = 1, 2,…,7. That is, each i = 1, 2,…, 6, can be treated as a NFC with seven rules to control one D.O.F. of motion relative to the camera frame. The input membership functions and the singletons of the consequent parts of fuzzy rules for each NFC are shown in Fig. 4. Herein, a simplified defuzzifier is used. The final output \( \delta {}^{C}X_{i} \) of the neural-fuzzy system is determined according to \( \delta {}^{C}X_{i} = \sum\nolimits_{j = 1}^{7} {\mu_{{A_{i}^{j} }} w_{i}^{j} } \).

Fig. 4
figure 4

Membership functions of the input/output variables

In this work, only real numbers \( w_{i}^{j} \) are tuned on-line. Accordingly, the error function to be minimized is defined by

$$ E_{i} = \frac{1}{2}(F_{i}^{r} - F_{i} )^{2} = \frac{1}{2}(\delta F_{i} )^{2} $$
(4)

The derivative of an error function \( E_{i} \) with respect to the jth consequent can be obtained as:

$$ \frac{{\partial E_{i} }}{{\partial w_{i}^{j} }} = \frac{{\partial E_{i} }}{{\partial (\delta {}^{C}X_{i} )}}\frac{{\partial (\delta {}^{C}X_{i} )}}{{\partial w_{i}^{j} }} \approx \frac{{E_{i} (t) - E_{i} (t - 1)}}{{\delta {}^{C}X_{i} (t) - \delta {}^{C}X_{i} (t - 1)}}\frac{{\partial (\delta {}^{C}X_{i} )}}{{\partial w_{i}^{j} }} $$
(5)

To decrease \( E_{i} \) with respect to \( w_{i}^{j} \), the consequent changes at time stage t, and \( \Delta w_{i}^{j} (t) \) can be chosen as \( \Delta w_{i}^{j} (t) = - \eta {{\partial E_{i} } \mathord{\left/ {\vphantom {{\partial E_{i} } {\partial w_{i}^{j} }}} \right. \kern-0pt} {\partial w_{i}^{j} }} \), where \( \eta \) is the learning-rate parameter. Additionally, for the fast learning speed, learning rate [7] is designed as saturated linear function as follows:

$$ \eta_{i} \left( {\delta F_{i} } \right) = \left\{ {\begin{array}{*{20}c} {\left( {\eta_{i} } \right)_{\max } ,} & {\quad {\text{if }}\left| {\delta F_{i} } \right| > \left( {\delta F_{i} } \right)_{\text{sat}} } \\ {\frac{{\left( {\eta_{i} } \right)_{{\max} } - \left( {\eta_{i} } \right)_{{\min} } }}{{\left( {\delta F_{i} } \right)_{\text{sat}} }}\delta F_{i} + \left( {\eta_{i} } \right)_{{\min} } ,} & {\quad {\text{if }}\left| {\delta F_{i} } \right| < \left( {\delta F_{i} } \right)_{\text{sat}} } \\ \end{array} } \right. $$
(6)

where \( \left( {\eta_{i} } \right)_{{\min} } \) and \( \left( {\eta_{i} } \right)_{{\max} } \) are, respectively, minimum and maximum learning rate that can be chosen by the user, and \( \left( {\delta F_{i} } \right)_{\text{sat}} \) is saturated difference value that can also be chosen by the user. Thus, a learning rule for adapting the consequent at time stage t can be given as \( w_{i}^{j} (t + 1) = w_{i}^{j} (t) + \Delta w_{i}^{j} (t) \).

3.3 Rough motion transformation

In this work, the pose of the camera in relation to the end-effector is invariant, as shown in Fig. 5(a), so the camera and the end-effector can be regarded as a rigid body. After the motion of the rigid body has been analyzed, the transformation from the output values in relation to the camera frame to the motion commands with respect to the end-effector frame is obtained.

Fig. 5
figure 5

(a) End-effector and camera frames, (b) unexpected camera displacement caused by a rotation about \( {}^{E}X \)

Figure 5(b) demonstrates that if the controller output is only a rotation \( \delta {}^{C}X_{4} \) about \( {}^{C}X \) but the command sent to the manipulator is a rotation about \( {}^{E}X \), then the unexpected camera displacements in the \( - {}^{C}Y \) and \( - {}^{C}Z \) directions are \( \delta {}^{C}X_{2}^{'} \) and \( \delta {}^{C}X_{3}^{'} \), respectively. Similar situations occur in \( \delta {}^{C}X_{5} \) and \( \delta {}^{C}X_{6} \). Accordingly, the motion command sent to the manipulator should be transformed as:

$$ \begin{aligned} \delta {}^{E}X_{1} = \delta {}^{C}X_{1} - \it{d}z \cdot \sin (\delta {}^{C}X_{6} ) \cdot \tan (\delta {}^{C}X_{6} /2) - 2\it{d}k \cdot \sin (\delta {}^{C}X_{5} /2) \cdot \cos (90^\circ - \delta {}^{C}X_{5} /2 - \tan^{ - 1} (\it{d}z/\it{d}x)) \hfill \\ \delta {}^{E}X_{2} = \delta {}^{C}X_{2} + \it{d}z \cdot \sin (\delta {}^{C}X_{6} ) + \it{d}x \cdot \sin (\delta {}^{C}X_{4} ) \hfill \\ \delta {}^{E}X_{3} = \delta {}^{C}X_{3} + \it{d}x \cdot \sin (\delta {}^{C}X_{4} ) \cdot \tan (\delta {}^{C}X_{4} /2) - 2\it{d}k \cdot \sin (\delta {}^{C}X_{5} /2) \cdot \sin (90^\circ - \delta {}^{C}X_{5} /2 - \tan^{ - 1} (\it{d}z/\it{d}x)) \hfill \\ \delta^{E} X_{4} = \delta^{C} X_{4} ,\;\;\delta {}^{E}X_{5} = \delta {}^{C}X_{5} ,\;\;\delta {}^{E}X_{6} = \delta {}^{C}X_{6} , \hfill \\ \end{aligned} $$
(7)

where \( \it{d}x \) is the distance between the axes \( {}^{C}X \) and \( {}^{E}X \), and \( {\it{d}}z \) represents the distance between the axes \( {}^{C}Z \) and \( {}^{E}Z \). The two lines associated with \( {\text{d}}x \) and \( {\text{d}}z \) are assumed to be mutually perpendicular. They are measured roughly using a ruler and the naked eye. Therefore, \( {d}k = \sqrt {({d}x)^{2} + ({d}z)^{2} } \) is assumed.

3.4 Control strategy

The behavior-based look-and-move control structure is depicted in Fig. 6. The manipulation starts with the Top behavior to move the end-effector to the top location, as depicted in Fig. 2, to perform pick-and-place tasks. \( F_{i}^{r} \) is the value of \( F_{i} \) measured by the teach-by-showing method. This reference image is captured by the vision system when the end-effector is driven by a teaching box to the target location, \( {}^{o}T_{t} \), relative to the object frame, as shown in Fig. 2, where the object frame is fixed on the top surface of the object. The target location is defined as follows. The end-effector is initially driven by a teaching box to a location that allows the gripper to grasp the workpiece. Then, the end-effector is driven to another location, which is a safe distance from the preceding location (in this study, 10 cm above it). This “another location” is called the “target location”. Consequently, the reference features that correspond to the reference image at the target location are \( F_{1}^{r} \), \( F_{2}^{r} \), \( F_{3}^{r} \), \( F_{4}^{r} \), \( F_{5}^{r} \) and \( F_{6}^{r} \).

Fig. 6
figure 6

Behavior-based look-and-move control structure

Each \( \delta F_{i} \), for i = 1, 2,…,6, defined as the error between \( F_{i} \) and the feature value of the reference image at the target location, \( F_{i}^{r} \), is then input to the proposed behavior-based controller to obtain the motion command in the end-effector frame. The internal structure of the behavior-based controller is shown in Fig. 7. The relative motion command in the camera frame generated from the NFCs can be represented as a six-element vector \( [\delta {}^{C}X_{1} ,\delta {}^{C}X_{2} ,\delta {}^{C}X_{3} ,\delta {}^{C}X_{4} ,\delta {}^{C}X_{5} ,\delta {}^{C}X_{6} ]^{T} \). The rough motion transformation can be applied to transform the relative motion command in the camera frame to the relative motion command \( [\delta {}^{E}X_{1} ,\delta {}^{E}X_{2} ,\delta {}^{E}X_{3} ,\delta {}^{E}X_{4} ,\delta {}^{E}X_{5} ,\delta {}^{E}X_{6} ]^{T} \) in the end-effector frame.

Fig. 7
figure 7

Internal structure of behavior-based controller

Each element of the original relative motion command in the camera frame is independent of each other. However, most elements of the relative motion command in the end-effector frame are coupled. To achieve the smoother manipulation, these behaviors in the end-effector frame are then fused to produce a final command action to reach the target location using the proposed behavior fusion scheme. The inputs to the behavior fusion controller are image feature errors, \( \delta F_{i} \), for i = 1, 2,…, 6, while the outputs are the normalized fusion weights of the behaviors. Before input to the fuzzy fusion-weight controller, the image feature error, \( \delta F_{i} \), is compared with the preset threshold to decide whether the corresponding behavior needs to be fused. The membership functions of the input and output variables are displayed in Fig. 8. According to the membership functions of the input variables, the rule base of the fusion-weight controller consists of seven rules. At each time stage, the input variables are first fuzzified and the fuzzy inference is made. The crisp value of the behavior fusion weight, \( W_{i} \), is then determined by the center of gravity defuzzification. Finally, the defuzzified weights are normalized to be \( N_{i} = {{W_{i} } \mathord{\left/ {\vphantom {{W_{i} } \sum }} \right. \kern-0pt} \sum }W_{i} \). Accordingly, the fused relative motion command becomes \( {}^{E}X = \left[ {{}^{E}X_{i} } \right]_{6 \times 1} = \left[ {\delta {}^{E}X_{i} \cdot N{}_{i}} \right]_{6 \times 1} \) in the end-effector frame. In the speed command module, the motion speed of the manipulator is dynamically adjusted and set as some percentage (10–40%) of the full speed based on the third image feature error, \( \delta F_{3} \).

Fig. 8
figure 8

Membership functions of the input/output variables

During the manipulation, the non-vision-based behavior, Top, is initially activated to move the end-effector of the manipulator to the top location. Whenever \( \left| {\delta F_{i} } \right| \) is bigger than the specified limiting value, \( \varepsilon_{i} \), the corresponding vision-based behaviors are then activated and fused to command the end-effector to approach the target location. This process is iteratively performed until all the image feature errors are below the specified limiting values. Finally, the non-vision-based behavior Catch is inspired, and the end-effector is commanded to move through a short distance (in this study, 10 cm) along the \( {}^{E}Z \) axis of the end-effector frame to grasp the workpiece.

4 Experimentation

The mobile base is stopped and fixed next to the workstation, which is roughly parallel to the surface of the workstation to verify the proposed control strategy. As presented in Fig. 9, the workpiece is pointed in three directions and placed in six different positions, which are separated by 15 cm, to simulate the possible orientation and position errors that arise in the application stage. In each location, the workpiece is tilted by 15° and 25° to the station surface to simulate non-flat ground in the application stage.

Fig. 9
figure 9

Possible locations of the workpiece to be picked up

In experiments performed to evaluate the positioning performance of the eye-in-hand manipulator, the end-effector of the manipulator is firstly driven to the top location. Then, the end-effector is visually guided to grasp the workpiece according to the proposed control strategy with the preset parameters. In the approaching stage, the image feature errors drop as the number of execution steps grows. This stage continues until the errors in the image features, \( \delta F_{i} \), for i = 1, 2,…, 6, are below \( \varepsilon_{1} = 2 \), \( \varepsilon_{2} = 2 \), \( \varepsilon_{3} = 5 \), \( \varepsilon_{4} = 0.001 \), \( \varepsilon_{5} = 0.001 \) and \( \varepsilon_{6} = 0 \), respectively. Figure 10 displays the images in the course of approaching the workpiece.

Fig. 10
figure 10

Images in the course of approaching the workpiece in Pos2 with a rotation of 45° and a tilt of 25°

The final and desired locations of the end-effector are recorded in the task manipulation. The coordinate transformation matrix between those locations of the end-effector can then be determined. Consequently, the position error can be defined by the magnitude of the position vector of the matrix. The orientation error can be defined as the angle of rotation about the principle axis, obtained from the rotational transformation of the matrix.

Table 1 shows the resulting positioning errors and the number of steps for each test. The number of steps is that required for the end-effector to travel from the top location to the target location. Executing this behavior-based control strategy takes a minimum of about 9 steps and a maximum of 12 steps in cases in which the workpiece is arranged without a tilt. The initial orientation of the workpiece only weakly influences the total execution time. In the cases in which the tilt angle is 15° or 25°, about 17 steps to 25 steps are required. As the tilt angles are increased, more steps are needed. In all of the tests, the final position error is less than 2.46 mm, and the final orientation error is less than 1.21°.

Table 1 Positioning errors of the eye-in-hand manipulator and the number of steps for each test

5 Conclusion

This paper adopts an uncalibrated eye-in-hand vision system to provide visual information to the mobile manipulator to pick up a workpiece located on the station. A novel vision-guided control strategy with a behavior-based look-and-move structure is proposed. This strategy includes the NFCs with varying learning rate, the rough motion transformation and the behavior fusion scheme. Notably, this rough motion transformation is inaccurate. However, the designed NFCs handle the inaccuracy by tuning the consequents of the fuzzy rules using the back-propagation algorithm. This process saves considerable time without the extensive computation of hand-eye calibration. Finally, the proposed control strategy is experimentally applied to realize a manipulator that can fast approach a target object and precisely position its end-effector in the desired relative pose to the object, independently of where the target object is located on a station. Significantly, the superiority of this research over the conventional one can be shown as follows: (1) No camera calibration and hand/eye calibration are performed. (2) The selected image features are not sensitive to illumination and the distance between the camera and the workpiece. (3) The end-effector of the manipulator is controlled to approach the workpiece more smoothly and faster by the behavior fusion scheme and the speed command module of the proposed control strategy.