1 Introduction

Nowadays, in conjunction with the non-stop improvement of science and technology, robots increasingly play a crucial role in all regions of life and society, gradually change humans for difficult and dangerous tasks to create products of high quality and precision. Therefore, the requirements for flexibility, accuracy, and robustness of robots are increasing and robots must have the ability to operate in distinctive environments. To meet such requirements, robots are frequently incorporated with sensors to measure and perceive the surroundings (particularly when the robots operate in unstructured environments). Commonly used sensors are pressure sensors, ultrasonic sensors, lasers, vision sensors, etc. Among them, vision sensors (e.g., cameras) are the most common because vision sensors perceive the environment as the same as humans and allow for non-contact measurements.

Cameras began to be used to control the motion of robots in the 60 s (Shirai and Inoue 1973). Feedback control using visual information from vision sensors (closed-loop control) was only developed in the 90 s to increase the accuracy of the robot system (Hashimoto 1993; Cong and Hanh 2022). This closed-loop control system is called Visual Servoing.

Visual servoing (VS) is a feedback control technique that uses visual information to enhance the accuracy and versatility of robot systems (Cong and Hanh 2022; Chaumette et al. 2016; Zhao et al. 2021). The control tasks in visual servoing are defined to manipulate the movements of robots using visual features extracted from images of interesting objects in real-time (Li et al. 2018; Cong and Hanh 2021). The error function in VS is defined as the error of visual features at the current and the desired position of the robot/camera. The purpose of a visual control schemes is to regulate the error and make it zeros. One, two or more cameras can be used to gain visual information of the objects for the robot control task (Cong and Hanh 2022, 2021; Zhao et al. 2021; Wu et al. 2022).

There are two main configurations for combining camera and robot in visual servoing applications. The first configuration is eye-in-hand, the camera is mounted on the end-effector of the robot and moves with the robot. A transformation between the camera’s frame and the end-effector’s frame is defined to convert motions from the camera’s frame to the end-effector’s frame. In the second configuration, called eye-to-hand, one or more cameras are fixed in the workspace to observe both the robot and the objects. This configuration needs to calculate the transformation matrix between the camera and robot coordinate frame at each iteration. The hybrid configuration can use both eye-in-hand and eye-to-hand configurations.

Two main control schemes that are commonly applied in visual servoing controls are position-based visual servoing (PBVS) scheme and image-based visual servoing (IBVS) scheme (Cong and Hanh 2022; Zhao et al. 2021; Wu et al. 2022; Sanderson and Weiss 1980). The combination of two control schemes creates a hybrid scheme called 2-1/2D visual servoing (Malis et al. 1999). In the PBVS system, the control inputs are calculated in the 3D space. The 3D coordinates of the objects are estimated from the image features. There are different methods for determining the position of an object [see (Dementhon and Davis June 1995)]. All methods require knowing the complete geometry of the objects and the camera’s calibration parameters. In the IBVS system, the control inputs are directly calculated in the 2D image space. This method is robust to the camera and robot calibration errors (Espiau 1993; Malis et al. 2010; Xiaolin and Hongwen 2020). However, convergence is only guaranteed in a small region (completely impossible to analyze to determine) around the desired location. An expressive graph to illustrate the VS is shown in Fig. 1.

Fig. 1
figure 1

Two visual servoing control schemes

Two essential aspects that significantly affect the behaviour of a visual servoing system are the choice of visual features to be used as input by the controller and the control schemes. With the same set of features, the system has different behaviours when used in different control schemes, and the same control rule gives different behaviours when considering different features. The behaviour acquired with the combination of two choices is often not as expected, choosing a particular set of features or a particular control scheme can result in some stability and convergence problems.

This paper reviews common techniques in visual servoing control and makes assessments based on the efficiency and problem-solving ability of control schemes. The popular schemes are simulated on Matlab to obtain performance indicators and behavior graphs. To conduct the assessment, the paper gives typical control tasks that are translation and rotation around the X, Y and Z axes. The evaluations are conducted with the varying motion parameters and the varying effects of noise. After all the tests have been performed, the main conclusions drawn from them are presented and discussed.

2 Review of visual servoing

2.1 General theory of visual servoing

In visual servoing control, visual features are extracted from the image to control the motion of the robot (Ren et al. 2020; Zhong et al. 2019; Chwa 2020). The control rule is designed so that the current visual features \(s\left( t \right)\) are equal to the desired features \(s_{d}\). Therefore, all visual servoing tasks are aimed at eliminating errors \(e\left( t \right)\) is defined as (Hanh and Cong 2022):

$${\mathbf{e}} = {\varvec{s}}\left( t \right) - {\varvec{s}}_{{\varvec{d}}}$$
(1)

In visual servoing, the relationship between the derivative of the vector s and the relative velocity between the camera and the object is determined by the interaction matrix:

$$\dot{\varvec{s}} = {\varvec{L}}_{{\varvec{s}}} {}_{\varvec{ }}^{{\varvec{c}}} {\varvec{v}}_{{\varvec{o}}} + \frac{{\partial {\varvec{s}}}}{\partial t}$$
(2)

where ∂s/∂t is the derivative of s caused by the object's own motion, \({\varvec{L}}_{{\varvec{s}}} \in {\varvec{R}}^{{{\varvec{k}} \times {\varvec{n}}}}\) is the interaction matrix and \({}_{\varvec{ }}^{{\varvec{c}}} {\varvec{v}}_{{\varvec{o}}}\) the relative velocity between the camera and the object.

In the case of a non-moving object, ∂s/∂t = 0, we have:

$$\dot{\varvec{e}} = \dot{\varvec{s}} = {\varvec{L}}_{{\varvec{s}}} {\varvec{v}}_{{\varvec{c}}}$$
(3)

To ensure that the error of the feature vector is suppressed, the following first-order control law is used:

$$\dot{\varvec{e}} = {\varvec{Ke}}$$
(4)

where \({\varvec{K}}\) is a constant gain matrix. Combined (4) with (1) and (3) we get:

$${\varvec{v}}_{{\varvec{c}}} = - {\varvec{KL}}_{{\varvec{S}}}^{ + } {\varvec{e}} = - {\varvec{KL}}_{{\varvec{S}}}^{ + } \left( {{\varvec{s}} - {\varvec{s}}_{{\varvec{d}}} } \right)$$
(5)

where \({\varvec{L}}_{{\varvec{S}}}^{ + } = \left( {{\varvec{L}}_{{\varvec{S}}}^{{\varvec{T}}} {\varvec{L}}_{{\varvec{S}}} } \right)^{ - 1} {\varvec{L}}_{{\varvec{S}}}^{{\varvec{T}}}\) is the pseudo-inverse of the interaction matrix.

2.2 Visual servoing control schemes

Control schemes are mainly different in the way that visual information is used. Different control rules (Wang et al. 2017; Ren et al. 2020; Liu et al. 2020; Malis Feb. 2004) will have different effects on the response of the system. This section presents visual servoing control schemes, starting from traditional control schemes to hybrid schemes and enhanced control schemes. To form control rules for visual servoing schemes, 2D pixel coordinate or 3D coordinates are used. Other visual features will be detailed in the next section.

2.2.1 Position-based visual servoing (PBVS)

In PBVS the error signal is computed in the 3D Cartesian space. The state vector is the pose of the robot \({\varvec{P}} = \left[ {{\varvec{t}}_{{\varvec{c}}}^{{\varvec{d}}} ,\theta {\varvec{u}}} \right]^{T},\) where \({\varvec{t}}_{{\varvec{c}}}^{{\varvec{d}}}\) is the translation vector from the current position to the desired position, determined in the current camera frame \({\varvec{F}}\), θu is the vector/angle representation of the orientation. We have the relationship:

$$\dot{\varvec{P}} = {\varvec{L}}_{{\varvec{S}}} {\varvec{v}}_{{\varvec{c}}} = \left[ {\begin{array}{*{20}c} {\varvec{R}} & 0 \\ 0 & {{\varvec{L}}_{{\varvec{W}}} } \\ \end{array} } \right]{\varvec{v}}_{{\varvec{c}}}$$
(6)

where R is the rotation matrix, and \({\varvec{L}}_{{\varvec{W}}}\) is the interaction matrix of rotation motion

$$L_{w} = I_{3} - \frac{\theta }{2}\left[ u \right]_{ \times } + \left( {1 - \frac{sinc\theta }{{sinc^{2} \frac{\theta }{2}}}} \right)\left[ u \right]_{ \times }^{2},$$

\({\varvec{L}}_{{\varvec{W}}}^{ - 1}\) can be approximated by the identity matrix.

Assume \({\varvec{v}}_{{\varvec{c}}} = \left( {{\varvec{v}}_{{\varvec{t}}} ,{\varvec{\omega}}_{{\varvec{c}}} } \right)^{T}\) where \({\varvec{v}}_{{\varvec{t}}}\) is the translation.

velocity and \({\varvec{\omega}}_{{\varvec{c}}}\) is the angular velocity of the camera.

From (5) we have:

$$\left\{ {\begin{array}{*{20}c} {{\mathbf{v}}_{{\mathbf{t}}} = - {\varvec{K}}_{{\varvec{t}}} {\varvec{R}}^{{\varvec{T}}} {\varvec{t}}_{{\varvec{c}}}^{{\varvec{d}}} = - {\varvec{K}}_{{\varvec{t}}} t} \\ {{\varvec{\omega}}_{{\varvec{c}}} = - {\varvec{K}}_{{\varvec{\omega}}} \theta u } \\ \end{array} } \right.$$
(7)

where \({\varvec{t}} = \left( {t_{x} ,t_{y} ,t_{z} } \right)^{T}\) is the translation vector from the current camera \({\varvec{F}}\) to the desired camera frame \({\varvec{F}}^{\varvec{*}}\), determined in the desired camera frame.

2.2.2 Image-based visual servoing (IBVS)

Let \({\varvec{P}}_{{\varvec{i}}} = \left( {X_{i} ,Y_{i} ,Z_{i} } \right)^{T}\) is the 3D coordinate of a feature point and the image coordinate of \({\varvec{P}}_{{\varvec{i}}}\) is \({\varvec{p}}_{{\varvec{i}}} = \left( {x_{i} ,y_{i} } \right)^{T}\). Using perspective projection in the pinhole camera model, we have:

$$\begin{gathered} x_{i} = f\frac{{X_{i} }}{{Z_{i} }} \hfill \\ y_{i} = f\frac{{Y_{i} }}{{Z_{i} }} \hfill \\ \end{gathered}$$
(8)

where \(f\) is the focal length. The velocity of point \({\varvec{P}}_{{\varvec{i}}}\) relates to the camera velocity by the equation:

$$\dot{\varvec{P}}_{{\varvec{i}}} = - {\mathbf{v}}_{{\varvec{c}}} - {\varvec{\omega}}_{{\varvec{c}}} \times {\varvec{P}}_{{\varvec{i}}}$$
(9)

Taking the time derivative of Eq. (8) and combining it with Eq. (9), we obtain:

$$\dot{\varvec{p}}_{{\varvec{i}}} = {\varvec{L}}_{{{\varvec{si}}}} {\varvec{v}}_{{\varvec{c}}}$$
(10)

where:

$${\varvec{L}}_{{{\varvec{si}}}} = \left[ {\begin{array}{*{20}c} {\frac{ - 1}{{Z_{i} }}} & 0 & {\frac{{x_{i} }}{{Z_{i} }}} & {x_{i} y_{i} } & { - \left( {1 + x_{i}^{2} } \right)} & {y_{i} } \\ 0 & {\frac{ - 1}{{Z_{i} }}} & {\frac{{y_{i} }}{{Z_{i} }}} & {1 + y_{i}^{2} } & { - x_{i} y_{i} } & { - x_{i} } \\ \end{array} } \right]$$
(11)

Define feature vector \({\varvec{s}} = \left( {{\varvec{p}}_{1} ,{\varvec{p}}_{2} , \ldots ,{\varvec{p}}_{{\varvec{n}}} } \right)^{T}\), n is the number of feature points. The interaction matrix is obtained by stacking the interaction matrices of each feature point determined by (11):

$${\varvec{L}}_{{\varvec{S}}} = \left( {{\varvec{L}}_{{{\varvec{s}}1}} ,{\varvec{L}}_{{{\varvec{s}}2}} , \ldots ,{\varvec{L}}_{{{\varvec{sk}}}} } \right)^{T} \in R^{2n \times 6}$$
(12)

There is a commonly mentioned issue when applying IBVS called “camera retreat” in which the camera moves away from the target in a normal direction and then returns (back-and-forth movements) as shown in Fig. 2.

Fig. 2
figure 2

Camera retreat problem. a image feature trajectories. b translation along the X, Y and Z axes

2.2.3 2 ½ D Visual Servoing (2.5DVS)

The 2 ½ D visual servoing control scheme was proposed by Malis (Malis et al. 1999) in 1999 with the aim of combining the advantages of both IBVS and PBVS schemes. Hence this method is also known as hybrid visual servoing. Different from the two control diagrams above, the 2.5DVS scheme separates the camera translation and rotation control laws from each other. The axis-angle parameters \(\theta {\varvec{u}}\) obtained from the rotation matrix are used to calculate angular velocity commands (Malis et al. 1999):

$$\frac{{d\left( {\varvec{\theta u}} \right)}}{dt} = \left[ {\begin{array}{*{20}c} 0\quad {{\varvec{L}}_{{\varvec{W}}} } \\ \end{array} } \right]{\varvec{v}}_{{\varvec{c}}}$$
(13)

For calculating the translation velocities, define the extended image features \({\varvec{m}}_{{\varvec{e}}} = \left( {x,y,log\left( Z \right)} \right)^{T}\). According to Malis et al. (1999), we have:

$$\dot{\varvec{m}}_{{\varvec{e}}} = \left[ {\begin{array}{*{20}c} {\frac{1}{{d^{*} }}{\varvec{L}}_{{\varvec{v}}} } \quad {{\varvec{L}}_{{\varvec{v\omega }}} } \\ \end{array} } \right]{\varvec{v}}_{{\varvec{c}}}$$
(14)

where:

$$\begin{gathered} L_{v} = \frac{1}{{\rho_{1} }}\left[ {\begin{array}{*{20}c} { - 1} & 0 & x \\ 0 & { - 1} & y \\ 0 & 0 & { - 1} \\ \end{array} } \right] \hfill \\ L_{v\omega } = \left[ {\begin{array}{*{20}c} {xy} & { - \left( {1 + x^{2} } \right)} & y \\ {1 + y^{2} } & { - xy} & { - x} \\ { - y} & x & 0 \\ \end{array} } \right] \hfill \\ \end{gathered}$$
(15)

\(d^{*}\) is the distance from the camera at the desired position to a fixed plane \(\pi\) containing the feature points, \({\uprho }_{1}\) is a coefficient defined as (Malis et al. 1999):

$$\rho_{1} = \frac{Z}{{d^{*} }}$$
(16)

Define feature vectors \({\varvec{s}} = \left( {{\varvec{m}}_{{\varvec{e}}} ,\theta {\varvec{u}}} \right)^{T}\) and \({\varvec{s}}_{{\varvec{d}}} = \left( {{\varvec{m}}_{{{\varvec{ed}}}} ,0} \right)^{T}\). The control task is to minimize the error \({\varvec{e}} = \left( {\left( {{\varvec{m}}_{{\varvec{e}}} - {\varvec{m}}_{{{\varvec{ed}}}} } \right)^{T} ,\theta {\varvec{u}}^{T} } \right)^{T}\). The derivative of the error and the camera's velocity is related by (3) with the interaction matrix (Malis et al. 1999):

$${\varvec{L}}_{{\varvec{S}}} = \left[ {\begin{array}{*{20}c} {\frac{1}{{d^{*} }}{\varvec{L}}_{{\varvec{v}}} } & {{\varvec{L}}_{{\varvec{v\omega }}} } \\ 0 & {{\varvec{L}}_{{\varvec{W}}} } \\ \end{array} } \right]{\varvec{v}}_{{\varvec{c}}}$$
(17)

and the velocity of camera obtained from (5) is (Malis et al. 1999):

$${\varvec{v}}_{{\varvec{c}}} = - {\varvec{K}}\left[ {\begin{array}{*{20}c} {d^{*} {\varvec{L}}_{{\varvec{v}}}^{ - 1} } & { - d^{*} {\varvec{L}}_{{\varvec{v}}}^{ - 1} {\varvec{L}}_{{\varvec{v\omega }}} } \\ 0 & {{\varvec{I}}_{3} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {x - x_{d} } \\ {y - y_{d} } \\ \end{array} } \\ {\begin{array}{*{20}c} {log\left( {Z/Z_{d} } \right)} \\ {\theta {\varvec{u}}} \\ \end{array} } \\ \end{array} } \right]$$
(18)

The 2.5D VS separates the camera translation and rotation control laws from each other, so it can avoid “camera retreat” problem as in the coupled control law of IBVS.

In Eq. (18), the ratio \(\rho_{1}\), the ratio \(Z/Z_{d}\) and the vector/angle parameter \(\theta {\varvec{u}}\) is estimated by using the homography matrix. The homography matrix can be decomposed (Malis et al. 1999):

$$\varvec{H = R + t}_{{{\varvec{d}}^{\varvec{*}} }} {\varvec{n}}^{{\varvec{*T}}}$$
(19)

where R is the rotation matrix between the camera’s current frame \({\varvec{F}}\) and the camera’s desired frame \({\varvec{F}}^{\varvec{*}}\), \(n^{*}\) is the unit vector normal to plane \(\pi\) expressed in frame \({\varvec{F}}^{\varvec{*}}\), and \({\varvec{t}}_{{{\varvec{d}}^{\varvec{*}} }}\) is defined as \({\varvec{t}}/d^{*}\), \({\varvec{t}}\) being the translation vector between \({\varvec{F}}\) and \({\varvec{F}}^{\varvec{*}}\). The ratio \(\rho_{1}\) is computed as (Malis et al. 1999):

$$\rho_{1} = \frac{{{\text{det}}\left( {\varvec{H}} \right)}}{{\left( {{\varvec{Rn}}^{\varvec{*}} } \right)^{{\varvec{T}}} {\varvec{m}}}}$$
(20)

and the ratio \(Z/Z_{d}\) is determined as (Malis et al. 1999):

$$\frac{Z}{{Z_{d} }} = \rho_{1} {\varvec{n}}^{*T} {\varvec{m}}^{*}$$
(21)

with \({\varvec{m}}\) is the normalized coordinates of an image point.

2.2.4 Partitioned visual servoing (PVS)

The 2½D visual servoing control scheme described above is designed to separate rotation from translational motion, by selecting features in both 2D and 3D spaces. Another scheme proposed by Corke and Hutchinson (Corke and Hutchinson 2001) also decouples displacements in the Z direction using only features extracted directly in the image with the aim of solving some problems in IBVS, mainly the “camera retreat” problem that occurs when a rotation around the optical axis is required. Perform decoupling Z-axis motion:

$${\dot{\mathbf{s}}} = {\mathbf{L}}_{{\text{s}}} {\varvec{v}}_{c} = {\varvec{L}}_{xy} {\varvec{v}}_{xy} + {\varvec{L}}_{z} {\varvec{v}}_{z}$$
(22)

where the matrix \({\varvec{L}}_{{{\varvec{xy}}}}\) includes the 1st, 2nd, 4th, and 5th columns of the \({\varvec{L}}_{{\varvec{s}}}\) matrix and the \({\varvec{L}}_{{\varvec{z}}}\) matrix contains the 3th and 6th columns of the \({\varvec{L}}_{{\varvec{s}}}\) matrix.

Since \({\varvec{v}}_{{\varvec{z}}}\) is calculated separately, from Eq. (22) the velocity in the x and y directions can be determined:

$${\varvec{v}}_{xy} = {\varvec{L}}_{xy}^{ + } \left( { - K{\varvec{e}} - {\varvec{L}}_{z} {\varvec{v}}_{z} } \right)$$
(23)

The velocities in the Z direction are determined using angle feature and area feature extracted from the image:

$$\begin{gathered} \omega_{z} = k_{{\omega_{z} }} \left( {\theta_{m}^{*} - \theta_{m} } \right) \hfill \\ v_{tz} = k_{{v_{z} }} \left( { \epsilon^{*} - \epsilon } \right) \hfill \\ \end{gathered}$$
(24)

2.2.5 Shortest path visual servoing (SPVS)

Shortest Path Visual Servoing was presented by Kyrki and Kragic (Kyrki et al. 2004) with the aim of solving two fundamental problems in visual servoing: the possibility of objects coming out of the camera’s field of view and the joints getting too close to the allowed limit. This method is designed to ensure that the camera trajectory is a straight line from the starting point to the desired position by using the position-based control for the translation:

$${\mathbf{v}}_{{\mathbf{t}}} = - {\varvec{K}}_{{\varvec{t}}} {\varvec{t}}$$
(25)

The rotation angle around the z-axis also remains the same control law:

$${\upomega }_{{\text{z}}} = {\text{k}}_{{{\upomega }_{{\text{z}}} }} {\text{u}}_{{\text{z}}} {\uptheta }$$
(26)

And the rotation around the x and y axes will be controlled by using a single feature point in the image to ensure the visibility of target points:

$$\left[ {\omega_{x} , \omega_{y} } \right]^{{\text{T}}} = {\varvec{L}}_{{\omega_{x,y} }}^{ - 1} \left( { - {\varvec{K}}e - {\varvec{L}}_{t} {\varvec{v}}_{t} - {\varvec{L}}_{{\omega_{z} }} {\varvec{\omega}}_{z} } \right)$$
(27)

where \({\mathbf{L}}_{{\text{t}}}\),\({\mathbf{L}}_{{{\upomega }_{{\text{z}}} }}\), \({\mathbf{L}}_{{{\upomega }_{{{\text{x}},{\text{y}}}} }}^{ }\) are the interaction matrix for translation, rotation around Z and X, Y axes, respectively.

2.3 Selection of visual features

The behavior of the VS system depends on the selection of visual features. The visual features are observed by the visual sensor and generate input for the control scheme. A visual sensor provides potential visual features, but if the features are not selected properly, it will lead to stability problems and the robot's displacement will be large (Ren et al. 2020). Therefore, the selection of visual features in VS is important because it determines the speed, accuracy and reliability of the visual measurements. And so, determines the accuracy and robustness of the VS system (Ren et al. 2020; Feddema et al. 1991; Janabi-Sharifi and Wilson 1997). For this reason, it is necessary to design the best visual features for VS systems that meet the following criteria: avoid local stability, robustness with calibration error and modeling error, non-singularity, the reasonable trajectory of the camera and of the features in the image and finally, maximum decoupling of degrees of freedom and for a linear relationship between the visual features and the controlled degrees of freedom.

2.3.1 Geometric features

The most common visual features are geometric features (e.g., points, lines, circles, etc.). Geometric features are defined to describe the geometric content in an image (2D visual feature) or the relationship between a coordinate system attached to the robot system and the coordinate system attached to the object (3D visual feature). Both 2D and 3D features can be used at the same time in hybrid diagrams.

2D visual features are often extracted from 2D images such as points, lines, ellipses, areas of interest or contours (Feddema et al. 1991; Janabi-Sharifi and Wilson 1997; Shi and Tomasi 1994; Andreff et al. 2000). These features are extracted by image processing algorithms. In the case of a feature point, Cartesian coordinates are often used, but polar or cylindrical coordinates can also sometimes be used (Chaumette Aug. 2004).

In addition, image moments and moment invariants can also be used in VS (Chaumette 2002, 2004; Tahri and Chaumette 2003, 2004, 2005, 2005; Zhao et al. 2013). Using image moments has many outstanding advantages over traditional VS, as it allows a general representation that not only allows solving for basic geometric objects but also for complex objects with unspecified shapes. In (Chaumette 2004) the use of image moments is discussed to form the expression of the Jacobian visual matrix. This expression allows separate degrees of freedom based on the selected moment type. And in Tahri and Chaumette (2005) shows the use of moment invariants to design a decoupled 2D visual servoing control scheme.

Visual features can be selected in 3D space such as the position or coordinates of the 3D point (Martinet et al. 1996; Deng et al. 2003; Wilson et al. 1996). The model of the object and the measurement in the image are often used to estimate the relative position of the object relative to the camera. In (Cervera et al. 2003), the 3D coordinates of the object are used as feature vectors and it is required to know the camera calibration parameters in advance. In PBVS, an object’s orientation can be represented by roll-pitch-yaw or axis-rotation (Wang and Wilson 1992) or quaternions (Hu et al. 2010).

Several combinations of feature types can be considered, for example, the combination of both 2D and 3D features are presented in Wilson et al. (1996); Deng et al. 2002; Marchand and Chaumette June 2005) and the combination of polar coordinates and Cartesian coordinates of pixels are presented in Corke et al. (2009).

2.3.2 Photometry features

Contrary to using geometric features in VS, recently photometric features calculated from the luminance of pixels have been used in VS. Using photometry features does not require complex image processing such as feature extraction, matching and tracking. Furthermore, the effect of some parts being masked and the depth estimation is not great. This approach is implemented by considering the entire image as a set of features to calculate the control input signal (Nayar et al. 1996; Deguchi 2000). The control input can belong to the eigenspace or the kernel of the pixel, or it can also be defined as the set of all pixels.

In (Nayar et al. 1996), pixel intensity is not used directly, but eigen spatial decomposition is performed first to reduce the dimension of the image data. Then, control is performed in eigenspace, not directly from the pixels’ intensity. Furthermore, this approach requires computing off-line the eigenspace and performing image projection on this subspace for each new frame. An interesting approach, also considering pixels’ intensity, was recently proposed in Kallem et al. (2007). This approach is based on the use of kernel methods resulting in highly decoupled control laws. It is also possible to use the brightness of all pixels in the image as a visual feature set (Collewet et al. 2008; Collewet and Marchand 2009a, b).

2.3.3 The velocity field features

The velocity field in the image is used as the visual feature in Dong and Zhang (2020), and the relationship between the change of the velocity field of features and the velocity of the camera is modelled. This method is used to drive the camera to the position parallel to a plane and follow a trajectory. The camera’s movement is controlled so that the velocity field in the image is equal to the velocity field at the desired location (Xu et al. 2018). In (Kelly et al. 2004), the velocity field is used in the VS of the robotic arm under the fixed sphere of the camera (Fig. 3). The desired velocity field \(v_{d}\) is defined in the image space, which is a tangent vector representing the desired image velocity feature \(\dot{x}\) at each point of the image space. The velocity field error is defined as the difference between the desired velocity field \(v_{dx}\) and the image velocity feature \(\dot{x}\). The velocity field is also used in Kelly et al. (2006) to control a mobile robot with a fixed camera.

Fig. 3
figure 3

Visual servoing using velocity field (Kelly et al. 2004)

2.4 Combined and enhanced control schemes

To increase system performance and overcome the disadvantages of visual servoing control schemes, in addition to using advanced control schemes (2.5D VS, Partitioned VS, Shorted Path VS), VS control schemes can be combined with each other or combined VS with other controllers to take advantage of the controllers and overcome the disadvantages of VS scheme for improving system performance.

Some studies have been done in this direction, such as combining VS with sliding mode control (Zanne et al. 2000), partitioning approaches (Corke and Hutchinson 2521; Pages et al. 2006; Gans et al. 2003), planning approaches (Shu et al. 2018; Keshmiri et al. 2017), switching schemes (Gans and Hutchinson 2002, 2007; Norouzi-Ghazbi and Janabi-Sharifi 2021; Zhao et al. 2017; Ghasemi et al. 2019, 2020)

2.4.1 Switching approaches

One of the ways to use multiple controllers in combination is to use a switching scheme, in which one controller is selected at a time depending on which criteria need to be optimized (Norouzi-Ghazbi and Janabi-Sharifi 2021; Abhilash and Ashok 2016). To implement this scheme, two levels of control strategy are required: the lower level is used to implement the VS control rules, and the higher level of control strategy determines which control scheme should be applied.

A switching system is represented by the differential equation:

$$\dot{x}\left( t \right) = f_{\sigma \left( t \right)} \left( {x,t,u_{\sigma \left( t \right)} } \right):\sigma \in \left\{ {1 \ldots n} \right\}$$
(28)

where \(f_{\sigma \left( t \right)}\) is the set of n discriminant functions. The switching is directly affected by the choice of the control input u. Each visual servoing control scheme provides a controlled velocity \(u = \left[ {T_{x} ,T_{y} ,T_{z} ,\omega_{x} ,\omega_{y} ,\omega_{z} } \right]^{T}\) and a switching rule determines the actual control input used at each control cycle. The stability of the switching system is not guaranteed by the stability of the individual controllers. A set of stable controllers may become unstable if an inappropriate switching rule is applied, and unstable systems may become asymptotically stable if switching approaches are applied. Therefore, the overall stability of the system is difficult to guarantee, however, under certain conditions, the stability of the system can be demonstrated (Malis et al. 1999; Dementhon and Davis 1995).

Nicholas R. Gans (Gans and Hutchinson 2007) implemented a switching system between two control laws PBVS and IBVS based on the value of the Lyapunov function. The system starts with the Gan’s IBVS controller, considering the Lyapunov function for the PBVS controller defined by \(L = \frac{1}{2}\left\| {{\varvec{e}}\left( t \right)} \right\|^{2}\). At any point in time, if the value of the Lyapunov function exceeds the threshold γP, the system switches to the PBVS scheme. And in the process of using PBVS, if at any time, the value of the Lyapunov function for IBVS exceeds the threshold γI, the system switches to the IBVS scheme. If the thresholds are chosen appropriately, the system can obtain the relative advantages of IBVS and PBVS and limit the shortcomings.

Gan (Gans and Hutchinson 2002) implemented the switching scheme between the two controllers based on homography matrix and affine approximation. The homography-based scheme can be used in the case of general motion, including rotations around the x and y axes. If the motion does not include rotations around the x and y axes, the performance of the two methods is the same. However, when noise is present, the affine method is more efficient. Three different switching rules are implemented and compared: “Deterministic Switching”, “Random Switching” and “Biased Random Switching”. The result shows that the Biased Random rule gives better results than other switching rules.

2.4.2 Task sequencing

The original approach, at the beginning of studies of visual servoing control, constrained all degrees of freedom of the robot in one task. However, in the early stages of the control process, this is not necessary. The typical situation is that some features reach the desired position before others. In addition, the classical methods of choosing a trajectory may not be suitable, and not optimal.

A more efficient way to control the system is that uses some of the robot’s degrees of freedom to perform secondary tasks when the robot is very far from the target. These secondary tasks can improve control system robustness, including avoiding joint limits, ensuring that the targets remain within the camera’s field of view, or avoiding collision with obstacles (Alatartsev et al. 2015; Ahmadi et al. 2022). When the robot approaches the targets, all the degrees of freedom of the robot are controlled to reach the desired position. This approach is called task sequencing (Kurtser and Edan 2020; Diyaley et al. 2020).

To use this approach, it is necessary to ensure that all sub-tasks have different priorities and the sub-tasks do not affect the main task. To do that, redundancy formalism is used (Mansard and Chaumette 2004, 2005).

The task sequencing approach proposes a solution to increase the robustness of the system by dividing the global task into multiple subtasks. This is done by adding and removing those subtasks from a stack, according to the conditions of the environment.

To build an environment-adaptive system, which may include trajectory constraints, add more subtasks to the stack. This can be done using the cost function, to determine the safe position of the robot. This function can be defined in the joint space and has a high value for dangerous situations and low for safe situations. Using predictive control, it is possible to estimate when the system is in danger with actual task stacks and change them to remediate. If the cost function exceeds a certain threshold value, then performing a task stack change. This method has been successfully implemented and tested by Mansard on a real robot and described in Mansard and Chaumette (2004) and Mansard and Chaumette (2005).

2.4.3 Feature trajectory planning

In order to fulfill the requirements of obstacle avoidance, joint angle limitation or collision avoidance, it is possible to create a trajectory for the robot and use visual servoing to control the robot to follow the established trajectory (Hosoda et al. 1995; Mezouar and Chaumette 2002; Dejun and Kam 2020). Constraints can integrate concurrently. Trajectories of features \(s_{d}\) allow the camera to reach the desired position while ensuring that constraints are satisfied by using path planning techniques, such as the famous “potential field” method (Zhao et al. 2020).

Separating trajectory planning from tracking allows a significant improvement in the robustness of the visual servoing to modeling errors. Indeed, modeling errors have a large effect when the error s—sd is large, but a small effect when s—sd is small. When the desired feature point trajectory sd(t) satisfying sd(0) = s(0) has been designed in the planning stage, the scheme can be adapted to the actual requirement of the changes of sd, and makes the error ssd remain small. More precisely, we have:

$$\dot{\varvec{e}} = \dot{\varvec{s}} - \dot{\varvec{s}}_{{\varvec{d}}} = {\varvec{L}}_{e} {\varvec{v}}_{c} - \dot{s}_{d}$$
(29)

From that, using the control law \(\dot{e} = - {\lambda e}\), the velocity can be obtained:

$${\varvec{v}}_{c} = - {\uplambda }\widehat{{{\varvec{L}}_{e}^{ + } }}{\varvec{e}} + \widehat{{{\varvec{L}}_{e}^{ + } }}\dot{s}_{d}$$
(30)

The second component of this control law predicts the variation of sd, eliminating the tracking error.

2.5 Problems in visual servoing

Choosing a suitable set of visual features and designing good control schemes should be considered to avoid errors when performing visual servoing. Very common problems in visual servoing control tasks are directly affected by this choice, which includes local minima, singularity, and visibility of visual features.

2.5.1 Local minima

In general, the local minima problem only occurs with specific configurations. When stuck in a local minimum, the camera velocity is zero while the errors of the features remain uncorrected to zero. This results in converging to a final position that is different from the desired position. When the feature vector s is generated from three points in the image and LS has full rank, then we have Ker (\(L_{S}^{ + }\)) = 0, implying that there are no local minima. However, when using three points, the same images of three points can be seen from four different camera positions, which means that exist four camera positions such that s = sd, corresponding to four global minimums. When using four points, there is theoretically only one position of the camera. However, \(dim\left( {L_{S} } \right) = 8 \times 6\), implies that \(\dim Ker\left( {L_{S} } \right) = 2\). Using four points, the control law tries to control 8 constraints on the image trajectory while the system has only six degrees of freedom. In that case, due to the existence of impracticable motions in the image computed by the control law, the system may reach a local minimum.

Several control strategies have been developed to avoid local minima in visual servoing such as using hybrid schemes or trajectory planning.

2.5.2 Singularity

When the interaction matrix becomes singular, the camera velocity tends to infinity thereby causing instability of the system. The singularity can appear when selected pixels are image features. For example, when four points are used and the required motion of the camera is to rotate around the optical axis at an angle of 180°, the feature point trajectories will be straight lines passing through the center of the image and the Jacobian matrix becomes singular. For this movement, it is not appropriate to use pixel coordinates. If these points are replaced by cylindrical coordinates (ρ,θ), the singularity may not occur when performing a rotation of 180° around the optical axis.

In PBVS, problems of local minima and singularity can be avoided depending on the choice of the error e, a straight line from the starting position to the desired position is obtained when the error is defined in the desired camera system. The singularity can be avoided when using partitioned visual servoing, 2 ½ D visual servoing, switching, and PBVS scheme.

2.5.3 Visibility of the features

Using 2D and 3D visual servoing schemes with poor calibration and an initial position far from the desired position, the target can be out of the camera’s field of view. So, visual servoing control rules must be designed to keep features in the camera’s field of view resulting in reliable feedback during visual servoing. To minimize the probability of features leaving the FOV, a “repulsive potential field” can be applied, creating a strategy for path planning, using a schematic transformation map as well as using structure light.

2.6 Interaction matrix approximation for IBVS

When using IBVS, the following stability and convergence problems are encountered:

  • The system reaches a local minimum far from the desired position. This happens when the interaction matrix has an incomplete rank.

  • The interaction matrix becomes singular and leads to an unstable system.

  • Unnecessary back and forth movements of the camera when performing rotation around the optical axis to ensure that the trajectory of the feature points is a straight line.

  • When asked to perform a rotation around the optical axis at an angle of 180°, the camera performs an infinity retreat.

In this section, methods for approximating the interaction matrix are presented to solve the above problems and analyze the advantages and disadvantages of each method.

Constant Jacobian matrix: \(\widehat{{{\text{L}}^{ + } }}\) has a fixed value and is equal to the pseudo-inverse of the interaction matrix at the desired position. Therefore, it is only necessary to define the features and depth at the desired position to calculate the interaction matrix. The notation \({\text{L}}^{*}\) is the interaction matrix at the desired position, if \({\text{L}}^{*}\) has a full rank, the camera’s velocity can be calculated by:

$$v = - K\left( {L^{*} } \right)^{ + } e\left( t \right) = - K\left( {\left( {L^{*} } \right)^{T} L^{*} } \right)^{ - 1} \left( {L^{*} } \right)^{T} e\left( t \right)$$
(31)

With this method, the stability of the system is only guaranteed in a small area around the desired location. This area is difficult and complex to define. Trajectories of feature points in the image are not straight lines and are difficult to determine in advance. Therefore, some features may be out of the camera’s field of view, especially when the start position is far from the desired position.

Varying Jacobian matrix: the depth z(t) of each feature is estimated if the 3D model of the object is known in advance or from the camera motion measurement. Therefore, the interaction matrix can be calculated using expression (11). Notation \(L^{\prime} = L\left( {s\left( t \right),\widehat{z\left( t \right)}} \right)\), if the matrix L′ has full rank, the camera speed can be calculated by:

$$v = - K\left( {L^{\prime}} \right)^{ + } e\left( t \right) = - K\left( {\mathop {L^{\prime}}\nolimits^{{\text{T}}} L^{\prime}} \right)^{ - 1} \mathop {L^{\prime}}\nolimits^{{\text{T}}} e\left( t \right)$$
(32)

The trajectories of the feature points of this method are straight lines from the initial position to the desired position. However, the camera can reach local minima and the interaction matrix becomes singular.

Pseudo-inverse of the mean of the Jacobians: The approximation of the interaction matrix is obtained by averaging the two matrices in the above method:

$$\hat{L} = \frac{{L^{\prime} + L^{*} }}{2}$$
(33)

In general, the rank of the average matrix is not related to the rank of the two matrices \(L^{\prime}\) and \(L^{*}\). If this matrix has a full rank, the camera speed is calculated by:

$$v = - K\left( {\frac{{L^{\prime} + L^{*} }}{2}} \right)^{ + } e\left( t \right)$$
(34)

This scheme shows good performance for translation and rotation movements around the camera's optical axis without any retreating motion. However, the camera goes to infinity when the requirement movement is a rotation of 180° around the optical axis. Because there is no backward motion, the trajectories of the features are curves with great curvature, and some of the features may be out of the camera’s field of view.

Mean of the Jacobian pseudo-inverses: This method directly approximates the pseudo-inverse matrix:

$$\widehat{{L^{ + } }} = \frac{{\left( {L^{\prime}} \right)^{ + } + \left( {L^{*} } \right)^{ + } }}{2}$$
(35)

and the camera’s speed is calculated by:

$$v = - K\frac{{\left( {L^{\prime}} \right)^{ + } + \left( {L^{*} } \right)^{ + } }}{2}e\left( t \right)$$
(36)

If the required motion is only rotations around the camera’s Z-axis, there is no retreat motion. However, a small retreat occurs if the required motion of the camera is a combination of translation and rotation around the optical axis. The retreat motion makes the trajectories of the feature points almost linear. Thus, it helps to ensure that the features do not leave the camera’s field of view.

The MJPM scheme has an advantage over the PMJM scheme when the required camera motion is very close to 180°, the MJPM converges with a smooth trajectory, but the PMJM shows an unreasonable trajectory. The camera performs a large rotation around the optical axis at the beginning of the motion because the interaction matrix is close to the singularity (Cong and Hanh 2019).

2.7 Parametric approximations for the Jacobian matrix

E. Nematollahi and F. Janabi-Sharifi have proposed a new class of parametric approximations for interaction matrices (Nematollahi and Janabi-Sharifi 2009) to solve some of the difficulties of IBVS systems related to interaction matrix approximations. This paper proposes three methods in which the third method is an extension of the other two methods. The first and second methods are special cases of the third method. So, we only discuss the third method.

Consider the following system of equations:

$$\begin{gathered} \left( {\delta_{1} L^{\prime} + \delta_{2} L^{*} } \right)v = - K\left( {\delta_{1} + \delta_{2} } \right)e\left( t \right) \hfill \\ \left( {\eta_{1} L^{\prime} + \eta_{2} L^{*} } \right)v = - K\left( {\eta_{1} + \eta_{2} } \right)e\left( t \right) \hfill \\ \end{gathered}$$
(37)

where \(\delta_{1} ,\delta_{2} ,\eta_{1} ,\eta_{2}\) are the coefficients that need to be determined.

The camera velocity is calculated by:

$$v = - K\left[ {\begin{array}{*{20}c} {L_{\delta } } \\ {L_{\eta } } \\ \end{array} } \right]^{ + } \left[ {\begin{array}{*{20}c} {\left( {\delta_{1} + \delta_{2} } \right)e\left( t \right)} \\ {\left( {\eta_{1} + \eta_{2} } \right)e\left( t \right)} \\ \end{array} } \right] = - KA^{ - 1} B^{T} e\left( t \right)$$
(38)

where \(L_{\delta } = \delta_{1} L^{\prime} + \delta_{2} L^{*} ,L_{\eta } = \eta_{1} L^{\prime} + \eta_{2} L^{*}\)

$$A = L_{\delta }^{T} L_{\delta } + L_{\eta }^{T} L_{\eta }$$
$$B = \left( {\delta_{1} + \delta_{2} } \right)L_{\delta } + \left( {\eta_{1} + \eta_{2} } \right)L_{\eta }$$

We can see that this method includes all the above methods except the MJPM method. For example, when choosing \(\delta_{1} = \delta_{2} = 1/2\) and \(\eta_{1} = \eta_{2} = 0\), the PMJC method will be obtained. So, if the parameters are chosen appropriately, this method shows superior behavior over other methods. This scheme can control the camera to perform a pure rotation of 1800 around the optical axis without any retreat motion.

3 Assessment of visual servoing techniques

3.1 Description of evaluation criteria and methods

In order to make the most objective evaluation of visual servoing control techniques, it is important to adhere to the evaluation criteria rigorously. Towards this goal, the paper builds a series of control tasks that are commonly found in visual servoing. In addition, it also proposes quantitative criteria for evaluation. Techniques are evaluated when applied in tasks where visual servoing systems often encounter difficulties and reveal defects.

3.1.1 Control tasks

Various visual servoing control methods have been developed to meet the specific requirements of different tasks. Therefore, to evaluate visual servoing techniques, the paper has selected four control tasks that most often cause problems in visual servoing.

Task 1: Rotation around the optical axis. The first task considered is pure rotation around the optical axis. With this task, the IBVS algorithms expose problems such as unnecessary movements along the optical axis (“camera retreat” and “camera advance” problem), and the camera movement to infinity when required a rotation of 1800. Several diagrams have been developed to solve this problem such as PC&SH method, 2 ½D VS, SPVS or solved by using different approximations for Jacobian matrices.

Task 2: Translation along the optical axis. The second task is pure motion along the optical axis with an initial position from a distance of 1 m in front of the target to a distance of 1 m behind the target. This particular motion is chosen because, in essence, visual servoing control schemes depend on depth estimation.

Task 3: Rotation movement around the Y-axis of the camera. The third task corresponds to pure rotation around the Y-axis of the coordinate system attached to the camera. This task represents rotations around axes parallel to the image plane. The target rotation angles will be from 10° to 70°. When the rotation angle is greater than 70°, the feature points will almost align, so visual servoing cannot be performed.

Task 4: Motion along the Y-axis of the camera. This task represents the result of translation along any axis in a plane parallel to the image plane.

Task 5: General motion. The final task is generalized movements that require the visual servoing system to perform translational and rotational movements simultaneously. In this task, in order to have a better assessment of the performance of VS algorithms, the simulation is performed with the PUMA560 robot. The velocity obtained during visual servoing is used to control the rotations through the robot’s Jacobian matrix. Figure 4 shows the PUMA560 robot from the Robotics Toolbox used for simulation.

Fig. 4
figure 4

PUMA560 robot

3.1.2 Performance metrics

For the evaluations to be quantitative, it is necessary to define a set of performance metrics that are used to evaluate the schemes. The following metrics are selected for evaluation:

Number of iterations to convergence: Visual servoing is considered successful and stops when the average error of the feature points is less than 1 pixel. If the average error of five consecutive iterations differs by no more than 0.1pixel, the error is considered to converge to a constant value and the process is stopped. The number of iterations can be increased or decreased if the coefficient λ is changed. However, it is a parameter that deeply evaluates the performance of many systems or of a system with many different tasks.

Error at termination: At the halt of visual servoing, the pixel-error of each feature point from the desired position is calculated. Visual servoing stops if the error decreases to zero or converges to a constant value as mentioned above. Additionally, if over 300 iterations had been performed without convergence visual servoing was halted. Finally, the VS process is halted if the camera has moved back 10 m from the target, advanced to a depth of 0 m, or the feature points have moved more than 3000 pixels from the image center.

Maximum feature excursion: At each iteration, the distance from the feature point to the center of the image is calculated. The maximum value, in pixels, achieved during the entire process was used for the evaluation.

Maximum camera translation: At each iteration, the distance of the camera from the desired position is calculated. The maximum value is used to evaluate.

3.1.3 Simulation methodology

The whole simulation uses Matlab with the support of Machine Vision Toolbox and Robotics Toolbox (Zhong et al. 2019). For each simulation, the feature points are the vertices of the square in the 3D coordinate system. The square has a side length of 0.1 m and the desired position of the camera is at a position of 1 m relative to the square plane. To estimate the homography matrix, it is required that the feature points be coplanar. Using the vertices of the square gives a uniform result that a configuration of any shape may not be able to achieve.

Simulation using a camera with pixel sizes of 10–5 × 10–5 m, focal length f = 0.008 m. The image plane can be considered to be infinite, but if a feature goes out of the distance of 6000 pixels from the center of the image, the visual servoing system will stop. Furthermore, if the system fails to zero error, or does not converge to a fixed error after 300 iterations, visual servoing also halts.

The gain for each system was a 6 × 6 diagonal matrix, allowing to adjust the convergence rate independently. The gains are chosen so as to adjust to zero an error of a certain degree of freedom in 30 iterations while keeping the error of other degrees of freedom at zero.

The performance of the visual servoing system is affected by various conditions such as signal noise, error in calibration, and error in the kinematic parameters of the robot. The paper performs simulations for visual servoing with noise in images. The noise causes a large shift in the coordinates of the feature points. This causes errors when calculating the camera's displacement. To measure the influence of noise on different systems, the paper simulates noise during feature detection by adding an “offset” value to the pixel coordinates of features. This offset value follows a Gaussian distribution with a mean of zero and variance ranging from 0 to 1. Since the noise is random, the simulation is performed 100 times for the entire range of motion and averaged to smooth the results and remove outliers.

3.2 Simulation results and evaluation

In this section, the simulation results are presented for the five visual servoing control schemes described: Image-based Visual Servoing (IBVS), Position-based Visual Servoing (PBVS), 2 ½ D Visual Servoing (2.5D VS), Partitioned Visual Servoing (PVS) and Shorted Path Visual Servoing (SPVS). Each system is simulated with four tasks. Each subsection of this section details a control task. At the beginning of the subsections, the notable results are summarized, followed by detailed descriptions for each performance metric.

Graphs are shown in groups of 5 figures, each for a different VS scheme. The figures arranged in clockwise order are IBVS, PBVS, 2.5D VS, PVS and SPVS. The simulation results under the influence of noise are presented in 3D graphs with the variance of noise increasing along the right axis, and the motion variables (rotation, translation) increasing along the left axis. Performance index variables increase along the vertical axis.

3.2.1 Rotation around optical axis

The trajectories of the feature points are shown in Fig. 5. The IBVS method has a trajectory that is a straight line from the starting position to the ending position. The rest of the methods give orbits that are circular arcs. When the rotation is close to 180°, the features in the IBVS go to the center of the image, the Jacobian matrix becomes singular and the system cannot converge.

  1. (a)

    Remaining pixel error

Fig. 5
figure 5

Trajectories of feature points with a rotation of 60° around the Z axis

The simulation results are shown in Fig. 6. IBVS regulates the error to zero when rotations are less than 160°, even as the effect of noise increases. However, when the angle of rotation is greater than 160°, the camera moves back to infinity, bringing the feature points to the center of the image and cannot be converged.

Fig. 6
figure 6

Average pixel error when rotating around the Z axis

The remaining methods show the dependence of the error on the noise level. In which, the PVS method shows less influence of noise, while the influence of the other three methods is almost the same.

When there is no effect of noise or the effect is small, the algorithms are capable of adjusting the error to zero. However, when increasing the value of noise, after 300 iterations, the error still remains at 1.5 to 3 pixels (depending on the magnitude of the noise). The error of the IBVS algorithm also increases with the magnitude of the noise, although it is not shown in Fig. 6.

The rotation angle has almost no effect on the error.

  1. (b)

    Number of iterations

Figure 7 shows the change of the number of iterations to convergence or failure with different values of rotation angle and noise.

Fig. 7
figure 7

Average iterations until convergence with rotation around the Z-axis

The graphs show that all methods have a dependence on the number of iterations on the noise level and the rotation angle value. The IBVS method has a sudden change in the number of iterations when the rotation angle is close to 180°. This is because the camera performs a retreat to the limit value causing the system to stop, the retreating speed is very fast, so the number of iterations is also very small. Compared with other methods, IBVS is less affected by noise. When the variance of noise is less than 0.5, the number of iterations is almost unchanged. When the variance is greater than 0.5, the number of iterations increases but is still less than 300 iterations. For rotation angles less than 30°, the number of iterations increases slightly as the rotation increases and is almost unchanged for rotation angles from 30° to 160°.

The two methods PBVS and 2.5DVS give almost the same results. The number of iterations starts to increase rapidly when the noise is larger than 0.25. When the noise is equal to one, these methods all take more than 300 iterations to stop. As the rotation angle value changes, there is also a slight increase in the number of iterations.

For the PVS method, the number of iterations increases when noise and rotation angle increase. Especially, if the rotation angle is large, the number of iterations increases sharply as the value of the noise increases.

  1. (c)

    Maximum feature point excursion

The result of the dependence of the maximum distance of the feature points with the change of rotation angle and noise is shown in Fig. 8.

Fig. 8
figure 8

Average maximum feature point excursion with rotation around the Z-axis

The results are similar to the graph for the number of iterations. Except for the PVS method, all other methods show that the distance increases with the noise value and is independent of the rotation angle. In the IBVS method when the rotation angle is greater than 1600, the feature points move towards the center of the image, so the maximum excursion remains constant and equal to the value at the initial position. As for the PVS method, the trajectory of the feature points changes as the noise value increases, causing the displacement distance to increase. In the presence of noise, this distance depends only on the rotation angle and not on the noise. The excursion increases sharply as the angles increase and reach a maximum when the rotation is 1800.

  1. (d)

    Maximum Camera Translation

In rotation around the optical axis, camera displacement is the most interesting feature. The graph for this parameter is shown in Fig. 9.

Fig. 9
figure 9

Average maximum camera translation with rotation around the Z-axis

As mentioned before, IBVS has a large retreat movement. The translation of the camera increases exponentially with the value of the rotation angle. When the angle of rotation reaches 1650, this retreat exceeds 10 m and the process stops.

PBVS has the smallest camera translation. There is a small increase in this translation value as the magnitude of the noise increases.

The camera translates a large distance in the PVS method. This value increases with the rotation angle and reaches a maximum value of 1.23 m when the rotation angle is 140° and then gradually decreases. The translation value does not depend on the magnitude of the noise.

The two methods SPVS and 2.5DVS have almost the same graph, with the camera displacement increasing with noise and almost unchanged with the rotation angle. The maximum retreat value is only about 25 mm.

3.2.2 Translation along the optical axis of the camera

The trajectories of the feature points in the image for motion along the Z-axis are almost the same (almost straight line). Figure 10 shows the trajectory for the IBVS method.

Fig. 10
figure 10

Trajectories of feature points in translation along the Z-axis

There is no problem of stability with the translation along the optical axis task, and the systems have approximately the same graph of the criterion.

  1. (a)

    Remaining pixel error

The pixel error graphs are shown in Fig. 11, giving the same results for all schemes. The error increases with the increasing noise and is almost independent to the initial distance. IBVS gives the smallest error value, with error values varying from 0.8 to 1.2 pixels. PBVS and 2.5DVS have the largest error with a range of 0.5–2.5 pixels. The error in the SPVS method is quite uniform. In some cases, when the noise variance is equal to 1, the error increases to about 1.8 pixels. The error range of the PVS method is from 0.5 to 1.5 pixels.

  1. (b)

    Number of iterations

Fig. 11
figure 11

Average pixel error when translation along the Z axis

The graphs for the number of iterations are shown in Fig. 12. The graph shows that three methods IBVS, PVS, SPVS have nearly the same results, all have the average number of iterations less than 300, the number of iterations increases rapidly when the value of the noise is greater than 0.25. IBVS gives the smallest number of iterations and is less affected by noise than other methods. When the noise is equal to 0.25 and 0.5, the number of iterations is almost unchanged. The two remaining methods have the same results with the number of iterations increasing sharply as the noise increases, and when the noise is greater than 0.75, the number of iterations reaches 300.

Fig. 12
figure 12

Average iterations with translation along the Z-axis

When the motion value increases, the number of iterations increases if there is no noise, when there is noise, the motion value does not affect the number of iterations (except for the 2.5DVS method which still changes).

3.2.3 Rotation around the Y-axis of the camera

Based on the trajectories of the feature points when rotation around the Y axis in Fig. 13, it can be seen that the trajectories of the features of all methods are almost straight lines, in which PVS and 2.5DVS give trajectories of relatively small curvature.

  1. (a)

    Remaining pixel error

Fig. 13
figure 13

Trajectories of feature points when rotated around the Y-axis


The pixel error graph is shown in Fig. 14, the results show that there is a difference between the methods.

Fig. 14
figure 14

Average pixel error when rotating around the Y axis

All methods have an error value that increases as the noise increases. If the noise is greater than 1, all methods give an error of more than 1 pixel. Two methods IBVS and PVS give lower error results than other methods. When the noise magnitude is less than 1, these two methods give an error of less than 1 pixel. The maximum error when noise is equal to 1 is about 1.4 pixels.

2.5DVS and SPVS give the same error graph results, the error increases as the noise size increases. However, SPVS has a smaller error with a maximum error of about 1.7pixels, and 2.5DVS is 2.5pixels. PBVS has a complex noise change graph, but in general, the error increases with noise and reaches a maximum of 1.8pixels.

The angle of rotation has almost no effect on the error.

  1. (b)

    Number of iterations


Figure 15 shows the number of iterations to stop when rotating around the Y axis. Based on that, it can be seen that all methods have dependent on the number of iterations on the noise level and the rotation angle. The number of iterations increases as the rotation angle value increases. The 2.5DVS method has a rapid increase in the number of iterations to more than 100 when increasing the angle value close to 80°.

Fig. 15
figure 15

Average iterations when rotating around the Y axis

The graphs of the three methods 2.5DVS, PBVS and SPVS show that the number of iterations increases rapidly when the magnitude of the noise increases, in which 2.5DVS has the most influence, all three methods have 300 iterations with noise variance greater than 0.75. The two methods IBVS and PVS are less affected by noise, the number of iterations is still guaranteed to be less than 300 iterations when the noise variance is equal to 1.

  1. (c)

    Maximum camera translation


The graphs of the camera translation with different rotation angles and noise are shown in Fig. 16. It can be seen that the two methods PBVS and SPVS have almost no camera displacement because these two methods directly use the control law according to the translation vector for the translational velocities in three axes. When there is the influence of noise, the displacement of the camera of these two methods increases with the noise, the maximum displacement value is 6 mm when the noise is equal to 1.

Fig. 16
figure 16

Average maximum camera translation with rotation around the Y axis

The other three methods show that the camera has unnecessary movements when the movement requires only rotation around the Y-axis of the camera. This can be explained by the fact that these methods use image features to control the translational velocity of the camera. The displacement value of the camera increases with the rotation angle value and is almost unaffected by noise except for IBVS whose displacement value increases slightly as noise increases. The PVS method has the largest displacement, with a maximum displacement of 2.2 m when the rotation angle is 80°. The two methods IBVS and 2.5DVS have the same displacement with a maximum value of about 1 m when the rotation angle is equal to 80°.

3.2.4 Translation along the Y-axis of the camera

The trajectories in the image of the feature points for motion along the Y axis are approximately the same. Figure 17 shows the trajectory for the PBVS method.

Fig. 17
figure 17

Trajectories of feature points when translating along the Y-axis

For this task, all methods yielded roughly the same performance metrics. Distance of motion does not have much effect on performance. IBVS and PVS are less affected by noise than the other methods.

  1. (a)

    Remaining pixel error


Figure 18 shows the average pixel error when translating along the Y axis. The results are almost the same for all schemes. The error increases as the noise increases and is independent of the initial distance. PBVS and 2.5DVS have the largest error with a range of 0.75–2.5 pixels. The error in the PVS method is quite uniform, when the noise increases to 1, the error only increases to about 1.5 pixels. The other two methods have an error range of 0.75–2 pixels.

  1. (b)

    Number of iterations

Fig. 18
figure 18

Average pixel error when translating along the Y axis


The graphs of the number of iterations are shown in Fig. 19. The graphs show that the number of iterations of all methods depends on the noise level and the motion value.

Fig. 19
figure 19

Average iterations when translating along the Y axis

The three methods PBVS, 2.5DVS and SPVS have up to 300 iterations when the noise is greater than 0.75. The other two methods, IBVS and PVS, give a number of iterations less than 300 when the noise is equal to 1 and the number of iterations is almost unchanged if the noise is less than 0.5.

3.2.5 General motion

Tasks with the initial position of the camera in the general position are performed by using the PUMA 560 robot. The desired position of the robot is at a distance of 1 m from the feature points with the end-effector at the pose with coordinates P0 = [0.4318, − 0.15, − 0.6] and the three angles roll, pitch, yaw are π, π, π/2 as shown in Fig. 20.

Fig. 20
figure 20

The desired configuration of the robot

The tasks performed in general motion were all performed under the condition that the noise in the image was white noise with the noise variance equal to 1. The tasks were performed 50 times and averaged for the criteria.

  1. (a)

    Test 1


Test 1 is a combination of three translational motions along the X, Y and Z axes with the translation vector P = [0.3, 0.2, 0.6]. The robot configuration at this location is shown in Fig. 21.

Fig. 21
figure 21

The initial configuration of the robot in Test 1

Table 1 shows the criteria for each method. Figures 22 and 23 show the trajectory of the feature points in the image and the trajectory of the camera for each method, respectively. It can be seen that, for test 1, which only requires translational motions along the three axes X, Y, and Z, the result is almost the same. All methods can regulate the error to zero. The number of iterations for convergence of each method is almost the same. The 2.5DVS method has a slower convergence rate than other methods. This can be explained when looking at the velocity graph in Fig. 24, the Z-axis velocity of the 2.5DVS method is only half of the other methods, so the convergence is slower. The PVS method also has a small Z velocity, but a larger X and Y velocity, so the convergence speed is also equal to other methods.

Table 1 Performance metric when executing test 1
Fig. 22
figure 22

The trajectories of the feature points in Test 1

Fig. 23
figure 23

The 3D trajectory of robot in Test 1

Fig. 24
figure 24

The translational velocity of robot in Test 1

All systems have a feature point trajectory that is almost a straight line from the initial coordinates to the desired coordinates. In which the trajectories of PVS and 2.5DVS methods have small curvature. The camera’s trajectories in 3D space are also straight lines.

  1. (b)

    Test 2


Test 2 is a combination of three rotations around the X, Y, and Z axes with rotation angles of 10°, 15° and − 90°, respectively. There is no translational motion required in test 2. The objective of this test is to investigate the independence of rotational and translational motions. Table 2 presents the performance metrics for test 2.

Table 2 Performance metric when executing test 2

Based on the data in Table 2, it can be seen that all methods can be able to drive the camera to the desired position. In particular, the 2.5DVS method gives faster convergence results than other methods. The PVS method has the displacement of the feature points larger than half of the image size, so the feature points will be out of the image boundary. All other methods keep the feature points in the image as shown in Figs. 25 and 26

Fig. 25
figure 25

The trajectories of the feature points in Test 2

Fig. 26
figure 26

The change in the 3D coordinates of the robot in Test 2

Figures 27 and 28 show the angular and translational velocities of the camera, respectively. The translational velocities of the two methods PBVS and SPVS are almost zero, so there is no translational motion. The IBVS method has the highest velocity in the Z direction, the PVS method has the largest velocity in the Y direction. The angular velocity graphs of the three methods PBVS, 2.5DVS and SPVS are almost the same. Although the required motions are rotation in three axes, these three methods only require two rotations around the axis. The two methods IBVS and PVS require all three rotations around three axes to achieve the desired position.

Fig. 27
figure 27

The angular velocity of the robot in Test 2

Fig. 28
figure 28

The translational velocity of the robot in Test 2

Figure 26 shows the change in the 3D coordinates of the robot, it can be seen that the two methods PBVS and SPBVS have no redundant motion along the X, Y and Z axes. The other methods have unnecessary motions along the X, Y, and Z axes. In which the IBVS method has the largest motion of 0.3 m in the Z direction. The 2.5DVS method has the smallest motion in all directions.

  1. (c)

    Test 3


Test 3 requires translation in the three X, Y and Z axes and rotation around the Z axis. The initial position of the camera relative to the desired position has the coordinate P = [0.2, 0, 0.5] and the required rotation angle is 180° around the Z axis. Figure 29 shows the configuration of the robot at the initial position for test 3.

Fig. 29
figure 29

The initial configuration of the robot in Test 3

Based on the data obtained in Table 3, it can be seen that except for the IBVS method, all other methods have converged results after about 90 iterations. Because test 3 requires 180° rotation, IBVS performs a retreat to infinity, however, due to the robot’s limitation, after only 5 iterations, the system stops due to exceeding the robot’s workspace.

Table 3 Performance metric when executing test 3

To better see the behavior of each system, consider the graphs of the feature point trajectories in Fig. 30 and the 3D trajectories of the camera in Fig. 31. Each method gives different trajectories. The two methods PBVS and SPVS have similar trajectories of feature points in the image because they share the same control law for translational and rotational motion around the Z axis. The 2.5DVS method gives slightly different trajectories because a different control law is used for translational motion. The PVS method gives a different trajectory than the other methods in which the feature points follow curves, so the distance of feature points from the center of the image is also larger than other methods (Table 3). The 3D trajectory of the PVS method is also a curve with great curvature, while the PBVS methods are almost linear. Test 3 does not have motions in the X and Y axes from which it can be seen that only the SPVS method doesn’t have unnecessary movements, the trajectory is a straight line from the initial to the final position. The two methods PBVS and 2.5DVS have redundant movements on the Y axis. IBVS method has a retraction movement in the Z direction, the camera has retreated up to a height of 0.8 m, exceeding the limit of the robot.

Fig. 30
figure 30

The trajectories of the feature points in Test 3

Fig. 31
figure 31

The 3D trajectory of the robot in Test 3

Figures 32 and 33 are velocity plots. It can be seen that the two methods PBVS and SPVS have similar velocity graphs (the same control law). The PVS method gives a velocity graph with very large variations in the velocity values in the X and Y directions. The IBVS has a large velocity in the Z-axis to perform the retreat motion.

  1. (d)

    Test 4

Fig. 32
figure 32

The translational velocity of the robot in Test 3

Fig. 33
figure 33

The angular velocity of the robot in Test 3


Test 4 is a general motion that requires translation and rotation in the X, Y, and Z axes. The initial position of the camera relative to the desired position is P = [− 0.3, − 0.2, 0.4], the rotation angles are π/6, − π/6, − π/2. The configuration of the robot in the initial position is shown in Fig. 34.

Fig. 34
figure 34

The initial configuration of the robot in Test 4

In this test, the features are located near the edge of the image, so that the phenomenon of the features coming out of the camera’s field of view can be seen. Table 4 shows the performance metrics for test 4. All methods can drive the camera to the desired position with an average error of about 0.8 pixels and a number of iterations between 80 and 90. The SPVS method has the largest number of iterations with 94.1 iterations. The two methods PBVS and PVS have the distance from the feature points to the center of the image larger than half the size of the image, so these points can be out of the camera's field of view. The remaining methods still ensure that the features are in the field of view of the camera. The trajectories of the feature points are shown in Fig. 35.

Table 4 Performance metric when executing test 4
Fig. 35
figure 35

The trajectories of the feature points in Test 4

In Fig. 35, it can be seen that the feature points trajectories of the two methods PBVS and PVS are curves with a large radius, so the two lower points are out of the camera's field of view. The remaining three methods use pixel coordinates in control schemes, so they always tend to bring the feature points towards the desired position, thus keeping the feature points in the image. The SPVS method has the trajectories of the feature points closest to straight lines.

Figure 36 shows the 3D trajectories of the camera. It can be seen that the trajectories in the two methods IBVS and PVS are arbitrary curves. In the PBVS method, the trajectory is a curve with small curvature. The SPVS method has a straight-line trajectory. Thus, the SPVS method has the shortest trajectory both in 3D space and in the image. The 2.5D VS method also has an almost straight-line trajectory.

Fig. 36
figure 36

The 3D trajectory of robot in Test 4

The camera velocities are shown in Fig. 37 and Fig. 38. The two methods IBVS and PVS give a large change in velocity values, while in the other methods, the velocities decrease exponentially to zero.

Fig. 37
figure 37

The translational velocity of robot in Test 4

Fig. 38
figure 38

The angular velocity of robot in Test 4

4 Conclusion

The paper has presented the general theory of visual servoing, the two main components in visual servoing are control schemes and visual features. Several enhancement and combination schemes are used to increase the robustness of the system. Problems in visual servoing are presented and interactive matrix approximations are used to solve them in IBVS.

Several standardized tests have been performed in the face of image errors to evaluate system performance against difficult tasks. This data can be used to select appropriate visual servo systems for specific tasks.

The two methods IBVS and PVS are less affected by noise than other methods due to directly using the control laws in the image. However, they create unnecessary movements of the robot. The two methods PBVS and 2.5D are influenced by noise, which may not achieve the desired error if the value of the noise is large.

Through the general tasks, it can be concluded that the two methods PVS and PBVS have complex trajectories in the image and feature points easily out of the camera’s field of view, especially when these points are located near the boundary. The two methods IBVS and PVS have redundant movements as mentioned above. The SPVS method has the best results with a 3D trajectory that is a straight line from the starting position to the desired position without unwanted motion and keeping the features in the camera's field of view. The 2.5DVS method has performance somewhere in the middle.