Keywords

1 Introduction

Vision-based control is a mainstream approach to drive the mobile robot to achieve multi-objective control by visual feedback, such as navigation and path planning [1, 2]. The error of the feature points on the image plane is used as the feedback to drive the mobile robot towards the targets. Feature error reflects the difference between the current image features and the desired image features. One practical advantage of the VBC method is that there is no need for a geometric model of the target. A key issue is how to achieve an efficient mapping from the 2D feature errors to the velocities in the 3D Cartesian coordinate system. The kinematics bridges two [3]. The multi-objective visual control requires the motion process to consider different objectives such as stability, rapidity, and keeping the target. To achieve the trade-off between different objectives, a feasible way is to use modular controllers [4]. Another way is to regard performance efficiency as the primary objective and others as constraints [5]. The latter can simplify the multi-objective control task into a single objective task with constraints. Previous methods heavily rely on dynamic modeling, and it is difficult to make these dynamic models scalable and generalizable.

The Jacobian matrix transforms the multi-objective visual control problem into a matrix estimation and optimization problem [6]. Three conventional methods are used to approximate the Jacobin matrix for multi-objective visual control. The first one is to employ the current matrix; one is the desired matrix and the other is the average matrix [7]. The universal Jacobian matrix involves a linear combination of the current matrix and the desired matrix. A control parameter ranging from 0 to 1 is used to balance the current matrix and the desired one. It is appropriate to assign a smaller value for the parameter to ensure stable convergence when a mobile robot is close to the target. Instead, a larger value achieves faster convergence [8]. Previous work used the proportional-integral-differential (PID) method to estimate the Jacobian matrix but the performance is limited due to the unsuitable parameters [9]. Until now, the estimation of the matrix still needs human experts’ experience. Reinforcement Learning (RL) does not require any prior knowledge about the environment and it drives agents to explore the environment to obtain the best solution via trial and error.

RL methods have been demonstrated to have excellent performance in developing more advanced robots, including multi-objective visual control systems [10]. [11] developed a novel RL method to stabilize a biped robot on a rotating platform. The proposed method addresses the overfitting problem with guaranteed model complexity. However, these conventional methods with tabular RL have a bottleneck due to its discrete action space. It is also helpless for a high-dimensional or continuous control task. Advances in deep learning models have made it possible to extract high-level features from sensory data, and it provides an opportunity of scaling to problems with infinite state spaces. Deep Reinforcement Learning (DRL) applies the computational power of deep learning to relaxing the curse of dimensionality for complex tasks [12]. It can learn a direct mapping from the state space to the policy space in an end-to-end way. Carlos Sampedro et al. developed a deep reinforcement learning controller to achieve the mapping from state space to linear velocity commands for multirotor aerial robots [13]. The learning performance of the DRL based controller is significantly impacted by hyperparameters. A heuristic strategy is an approach to Hyperparameters [14].

This paper proposes a reinforcement learning method with a heuristic strategy to obtain the transient matrix for multi-objective VBC systems. DSPG algorithm employs discounted return sampling which provides an opportunity for the control system with continuous spaces. The proposed method reduces the computation of the performance gradient to an expectation that is estimated by the discounted sampling return. It requires only a fixed computing interval instead of a predetermined environmental model. The deep policy network produces control actions via an on-policy learning method. An appropriate action hence is chosen to improve the performance of the conventional multi-objective visual control, in terms of the convergence rate and stability. The hyperparameter for the proposed RL based controller is optimized by a heuristic strategy, which can give better performance to the learning model of mobile robots.

Major contributions of this work are stated as follows:

  1. 1)

    A VBC method with the DSPG algorithm (VBC-DSPG) is proposed for the multi-objective control. The DSPG algorithm is used to select the time-varying Jacobian matrix online.

  2. 2)

    A parameter tuning method with the Cosine Annealing for the VBC-DSPG method is proposed. The learning rate of the RL model is tuned to improve the learning performance of the proposed method.

  3. 3)

    Related experiments are conducted to investigate the potential of the suggested scheme in a renowned robot platform.

This paper is organized as follows. Section 2 gives a brief introduction to the Multi-objective visual control model. Section 3 gives details of the proposed VBC-DSPG method with a Cosine Annealing. Section 4 demonstrates the experiment setup and experimental results. In the last section, the conclusions are presented.

2 Vision-Based Multi-objective Control

2.1 Multi-objective Visual Control Model

In a visual control process, a robot uses an RGB-D vision to extract features from the current image, which is \(\mathbf {S}_{c}=\left( \mathbf {s}_{1}^{c},\mathbf {s}_{2}^{c}, \ldots ,\mathbf {s}_{\mathrm {N}}^{\mathrm {c}}\right) ^{\mathrm {T}}\). To use visual control methods with the closed-loop feedback mechanism, we denote the desired features as \(\mathbf {S}_{*}=\left( \mathbf {s}_{1}^{*}, \mathbf {s}_{2}^{*}, \ldots , \mathbf {s}_{\mathrm {N}}^{*}\right) ^{\mathrm {T}}\). Then, the errors \(\mathbf {e}(t)=\left[ \mathbf {s}_{1}^{\mathbf {c}}-\mathbf {s}_{\mathbf {1}}^{*}, \mathbf {s}_{2}^{\mathbf {c}}-\mathbf {s}_{\mathbf {2}}^{*}, \ldots , \mathbf {s}_{\mathbf {N}}^{\mathbf {c}}-\mathbf {s}_{\mathbf {N}}^{*}\right] ^{\mathrm {T}} \in \mathfrak {R}^{2 N \times 1}\) between the current features and the desired features can be calculated. The feature errors \(\mathbf {e}(t)\) is a time-varying vector and its change depends on the motion of the robot. The velocities for the robot are derived from the time derivative for the vector \(\mathbf {e}(t)\) and it is as follows

$$\begin{aligned} \dot{\mathrm {e}}(t)=d \mathbf {e}(t) / d t =\left( d\left[ \mathbf {s}_{1}^{c}-\mathbf {s}_{1}^{*}\right] / d t, d\left[ \mathbf {s}_{2}^{c}-\mathbf {s}_{2}^{*}\right] / d t, \ldots , d\left[ \mathbf {s}_{N}^{c}-\mathbf {s}_{N}^{*}\right] / d t\right) . \end{aligned}$$
(1)

Multi-objective visual control requires the movement of a robot to simultaneously satisfy three different objectives, which are less fluctuation, short time, and no loss of target. The Jacobian matrix J transforms the change rate for feature errors into velocities of the robot. The conversion process is as follows

$$\begin{aligned} \dot{\mathbf {e}}(t)=\mathbf {J}\left[ \omega _{t}, v_{t}^{x}, v_{t}^{y}\right] ^{\mathrm {T}}=\mathbf {J}[\varvec{\omega }, \mathbf {v}]^{\mathrm {T}}, \mathbf {J} \in \mathfrak {R}^{2\,N * 3}, \end{aligned}$$
(2)

where N is the number of the features. The previous study [4] used \(\dot{\mathbf {e}}(t)=-\varphi \mathbf {e}(t)\) to ensure an exponential decoupled decrease for the error vector \(\mathbf {e}(t)\). Equation (3) gives the visual control law using the universal Jacobian matrix.

$$\begin{aligned} \begin{array}{l} {\left[ \omega _{t}^{r}, v_{t}^{x}, v_{t}^{y}\right] ^{\mathrm {T}}=-\varphi \hat{\mathbf {J}}^{+} \mathbf {e}(t), \hat{\mathbf {J}}^{+} \in \mathfrak {R}^{3 * 2\,N}} \\ \mathbf {J}=x \mathbf {J}_{1}+(1-x) \mathbf {J}_{2} \end{array}, \end{aligned}$$
(3)

where \(\hat{\mathbf {J}}^{+}\) is the pseudo-inverse for the Jacobian matrix.\(\varphi \)is a constant, which is the servo gain [4]. Based on the Jacobian matrix, a multi-objective visual control model is given by,

(4)

where x is the control parameter and its value ranges from 0 to 1. \(f_{1}(x), f_{2}(x), \ldots , \) \(f_{m}(x)\) represents the objectives for a perfect control performance, which are defined in a specific task. \(\varphi (\bullet )\) is a conversion function mapping the feature errors to velocities of the robot. In this work, these objectives are less fluctuation, short time, and the low probability of target loss. An appropriate value for the control parameter x can satisfy the high-efficiency multi-objective control.

2.2 Computation of the Jacobian Matrix

Figure 1 describes a coordinate transformation model for visual features. In the transformation model, the coordinate for the origin \(O_{1}\left( u_{o}, v_{o}\right) \) on the image plane is mapped to the coordinate for the origin \(O_{2}\) on the camera plane.

Fig. 1.
figure 1

Coordinate transformation model for visual features.

If the coordinate in pixel for the point \(\mathbf {a}\) on the camera plane is \( \mathbf {a} (u, v)\), the 3D point concerning the point \(\mathbf {a}\) on the Cartesian coordinate system is \(\mathbf {A}\). Take the point \(\mathbf {a}\) and the point \(\mathbf {A}\), for example, Eq. (5) gives the transformation from a point on the camera plane to a 3D point in the Cartesian coordinate system. The value of the pixel for the point \(\mathbf {a}\) is \(\mathbf {d}\left( u^{\prime }, v^{\prime }\right) \) on the image plane.

$$\begin{aligned} \left\{ \begin{array}{l} X_{a}=Z_{a} d_{x}\left( u-u_{o}\right) / f^{2} \\ Y_{a}=Z_{a} d_{y}\left( v-v_{o}\right) / f^{2} \end{array}\right. , \end{aligned}$$
(5)

where the focal length of the RGB-D camera is f. The scaling constants from the image plane to the camera plane in the x-axis and y-axis directions are represented by \(\vartheta _{x}\) and \(\vartheta _{y}\). For the \(j-t h\) feature point \(\left( u_{j}, v_{j}\right) \), the expression for the Jacobian matrix is given by Eq. (6).

$$\begin{aligned} \mathbf {J}_{j}=\left[ \begin{array}{ccc} -\left( f / Z_{a} \vartheta _{x}\right) &{} 0 &{} v_{j}\left( \vartheta _{y} / \vartheta _{x}\right) \\ 0 &{} -\left( f / Z_{a} \vartheta _{x}\right) &{} u_{j}\left( \vartheta _{x} / \vartheta _{y}\right) \end{array}\right] , \end{aligned}$$
(6)

For the N features, the whole expression for the Jacobian matrix is \(\mathbf {J}=[\mathbf {J}_{1},\mathbf {J}_{2},\ldots ,\) \(\mathbf {J}_{N}]^{\mathrm {T}} \in \mathfrak {R}^{2 N * 3}\).

3 A Discounted Sampling Policy Gradient with a Heuristic Strategy for Multi-objective Visual Control

3.1 Reinforcement Learning

An RL agent selects an action \(a_{t}\) using a policy \(\pi (s, a)\) when receives state \(s_{t}\). Then, the agent gets a reward \(r_{t}\) from the environment and moves to the next state \(s_{t+1}\). This process is repeated until the RL agent reaches the target position. The action policy \(\pi \) is constantly updated with the exploration of the agent. The RL method maximizes the expectation of the cumulative rewards \(R\left( s_{t}\right) =r_{t}+\gamma R\left( s_{t+1}\right) =\sum _{i=0}^{\infty } \gamma ^{i} r_{t+i}\) for each state. \(\gamma \) is the discount factor. The value function is used to represent the expectation for the cumulative reward. The value function for the state \(s_{t}\) is given by,

$$\begin{aligned} \begin{array}{l} V\left( s_{t}\right) =\mathrm {E}\left[ r_{t+1}+\gamma r_{t+2}+\gamma ^{2} r_{t+3}+\ldots \mid s=s_{t}\right] \\ =\sum _{a} \pi (s, a) \sum _{s_{t+1}} P_{s_{ t+1}}^{a}\left( r_{t}+\gamma V\left( s_{t+1}\right) \right) \end{array}. \end{aligned}$$
(7)

The Q-learning method uses the action-value function to represent the expectation for the cumulative reward. The action-value function for Q-learning in one-step prediction to the RL problems can be represented by,

$$\begin{aligned} Q(s, a)=\mathrm {E}\left[ \sum _{k=0}^{\infty } \gamma ^{k} r_{t+k+1} \mid s_{t}=s, a_{t}=a\right] \!. \end{aligned}$$
(8)

For the RL problems, the policy gradient optimizes the policy with the gradient in the policy space, instead of the value function. This method is practical when encountering a task with a stochastic action space [15]. DSPG algorithms are often used to address the problem of action selection in a continuous or high-dimensional space for RL tasks with unstructured environments. The agent is driven to move to the next state and receive a reward from the environment. The discounted reward sampling in a fixed interval allows us to estimate the value function, resulting in a policy that is considered to be optimal or closer to optimal.

3.2 Discounted Sampling Policy Gradient with a Cosine Annealing

Similar to the Deep Q-network (DQN) method, a nonlinear neural network function approximator is used for the DSPG method. The DSPG method uses low-dimensional observations, such as the joint angles or pixels, to learn competitive policies for many cases.

The goal of the DSPG method is to maximize the long-term discounted return with the weights \(\theta \), which is \(J(\theta )=\mathrm {E}_{\tau \sim p_{\theta }(\tau )}\left\lfloor \sum _{t} r\left( s_{t}, a_{t}\right) \right] \). The optimal parameter space setting \(\theta ^{*}\) makes the robot get an optimal behavior trajectory \(\tau \), which can obtain the maximum long-term return \(r(\tau )=\max \sum _{t} r\left( s_{t}, a_{t}\right) \). The better the trajectory means the robot can make a wise decision to select a time-varying matrix \(\mathbf {J}\). \(s_{0}\) is the initial state. The probability of a trajectory with a differentiable distribution function \(p_{\theta }(\tau )\) is shown in Eq. (9).

$$\begin{aligned} p_{\theta }(\tau )=p\left( s_{0}\right) \prod _{t=0}^{T-1} p\left( s_{t+1} \mid s_{t}, a_{t}\right) \pi _{\theta }\left( a_{t} \mid s_{t}\right) , \end{aligned}$$
(9)

where \(\pi _{\theta }\left( a_{t} \mid s_{t}\right) \) is the parameterized policy and \(p\left( s_{t+1} \mid s_{t}, a_{t}\right) \) is the state transition probability. The gradient for the objective function is given by,

$$\begin{aligned} \begin{array}{l} \nabla _{\theta } J(\theta )=\int p_{\theta }(\tau ) r(\tau ) d \tau \\ =\mathrm {E}_{\tau \sim p_{\theta }(\tau )}\left[ \left( \sum _{t} \nabla _{\theta } \log \left( \pi _{\theta }\left( a_{t} \mid s_{t}\right) \right) \right) \left( \sum _{t} r\left( s_{t}, a_{t}\right) \right) \right] \\ \approx \frac{1}{N} \sum _{j=1}^{N} \sum _{t}\left[ \nabla _{\theta } \log \left( \pi _{\theta }\left( a_{t, j} \mid s_{t, j}\right) \right) \left( \sum _{t^{\prime }=t} r\left( s_{t^{\prime }, j}, a_{t^{\prime }, j}\right) \right) \right] , \end{array} \end{aligned}$$
(10)

where the objective function samples N trajectories. The sampled returns are used to approximate the value function. In an episode, the effect of current reward on current policy decreases with the increase of time step. So, a discounted return sampling method for the expectation is proposed. The objective function with a discounted return sampling is given by,

$$\begin{aligned} \begin{array}{l} \nabla _{\theta } J(\theta )=\int p_{\theta }(\tau ) r(\tau ) d \tau \\ \approx \frac{1}{N} \sum _{j=1}^{N} \sum _{t}\left[ \nabla _{\theta } \log \left( \pi _{\theta }\left( a_{t, j} \mid s_{t, j}\right) \right) \left( \sum _{t^{\prime }=t} \gamma ^{t-t^{\prime }} r\left( s_{t^{\prime }, j}, a_{t^{\prime }, j}\right) \right) \right] \end{array}, \end{aligned}$$
(11)

where \(\gamma \) is the discount factor. \(\eta _{t}\) is the learning rate. The updating law for the policy network is given by,

$$\begin{aligned} \theta _{t+1} \leftarrow \theta _{t}+\eta _{t} \nabla _{\theta _{t}} J\left( \theta _{t}\right) . \end{aligned}$$
(12)

Hyperparameter strategies evaluate the performance of each configuration on the learning tasks. This method can complete the training process of automatically parameter testing for a machine learning model. However, the time cost of the evaluation for a configuration is expensive. In the process of Hyperparameter tuning, re-evaluation is necessary, which will be very inefficient for complex models. Some heuristic methods, such as evolutionary algorithms, are difficult to apply in the tuning of Hyperparameter. Previous works often leverage the Simulated Annealing (SA) to optimize the Hyperparameters [16]. As a heuristic method, SA simulates the annealing process in the thermodynamics energetics. A rule of the agent learning process should be satisfied: more current experience should be reserved in the early stage of learning. Conversely, the more prior experience should be preserved. A Cosine Annealing scheme is employed to tune the learning rate for the RL controller. The updating law with the Cosine Annealing is given by,

$$\begin{aligned} \eta _{t} \leftarrow \eta _{\min }+\frac{1}{2}\left( 1+\cos \left( \frac{T_{t}}{T_{\max }} \pi \right) \right) \left( \eta _{\max }-\eta _{\min }\right) , \end{aligned}$$
(13)

where \(\eta _{\max }\) and \(\eta _{\min }\) are the maximum and minimum of learning rate respectively, which are constant. \(T_{t}\) represents the number of current epochs. \(T_{\max }\) is the maximum epochs. An exploration noise \(\mathrm {N}\) sampled from a noise process is added to the policy network to develop an exploration action policy. The action policy with an exploration noise is given by,

$$\begin{aligned} a_{t}=\pi _{\theta }\left( s_{t}\right) +\mathrm {N}. \end{aligned}$$
(14)

3.3 An Adaptive Multi-objective Visual Control Model with DSPG (VBC-DSPG)

DSPG method employs the policy network to select a time-varying matrix and uses a sample discounted return to estimate the value of the action function. An RGB-D camera receives image features by the contour recognition methods. Then, the current observation \(s_{t}\) is calculated. The observation information consists of a 4-dimensional vector defined by Eq. (15).

$$\begin{aligned} \mathbf {S}_{t}=\left( \mathbf {e}_{x}, d \mathbf {e}_{x} / d t, \mathbf {e}_{y}, d \mathbf {e}_{y} / d t\right) , \end{aligned}$$
(15)

where \(\mathbf {e}_{x}, \mathbf {e}_{y}\) represents the vector for the feature errors of the current positions concerning the desired positions defined in the image plane respectively. The current policy network receives the vector of the current observation as the input and outputs a distribution \(\pi _{\theta }\left( a_{t} \mid s_{t}\right) \) for the action space. This work defines a mapping function to ensure that the output action is limited between 0 and 1 because the control parameter x ranges from 0 to 1. Equation (16) gives the expression for the mapping function.

$$\begin{aligned} x=f\left( a_{t}\right) =\frac{1}{\pi } \arctan \left( a_{t}\right) +0.5. \end{aligned}$$
(16)

As illustrated in Fig. 2, the DSPG agent consists of a feed-forward neural network with an input layer, an output layer, and hidden layers of 20 and 40 units. In hidden layers, the Sigmoid function is utilized as the activation function mapping an input to an output.

Fig. 2.
figure 2

Structure of the proposed DSPG model.

In the VBC-DSPG method, the shaping reinforcement signal \(r_{t}\) in Eq. (17) is used to update the policy.

$$\begin{aligned} r_{t}=-\frac{R}{2}\left( \sum _{i=1}^{N} \sqrt{\left( e_{x}^{i}\right) ^{2}+\left( e_{y}^{i}\right) ^{2}}+\sqrt{\left( d e_{x}^{i} / d t\right) ^{2}+\left( d e_{y}^{i} / d t\right) ^{2}}\right) /\left( N \sqrt{r o w^{2}+c o l^{2}}\right) , \end{aligned}$$
(17)

where the weight and length of the image plane are represented by \(\mathrm {row}\) and \(\mathrm {col}\) respectively. The environment gives the reward R to the RL agent depending on the actual situation. The number of feature points is N. The robot receives a higher reward if it is closer to the target.

The main motivation for employing the reinforcement signal is to drive the robot to achieve the target as soon as possible. In Eq. (17), the reward term penalizes the learning agent when the feature errors are big. The weights of the policy network are updated in the way that the robot adaptively selects a time-varying matrix for Multi-objective visual control.

4 Experiments

4.1 Experimental Environment

As shown in Fig. 3, we construct a simulation model with the same dynamics as the real device using commercial robotic software, Webots7.0.3. In the simulation, the robot learns policy using the DSPG algorithm and when the policy converges, this policy is optimized using the Cosine Annealing. Noise and disturbance are added in the simulation environment to create a complex task environment close to the real-world. Then, the experiments on a mobile robot are performed to appraise the practicality of the proposed method. Finally, the proposed method and the conventional pseudo inverse kinematics methods are used to select the matrix. Experimental results are recorded to test whether the proposed method has better performance.

Fig. 3.
figure 3

A visual control task in the simulation environment.

4.2 Training for RL

An episodic RL setting is used to train the RL agent. This control task requires the mobile robot to move from an initial position to a target position, as shown in Fig. 3. The visual control process of a robot should satisfy multiple objectives: less fluctuation, short time, and no loss of target. In each episode, the initial position for the RL agent is randomly selected, and the desired position is (0.463 m, 0.593 m, 37.8\(^{\circ }\)). In each time step, the DSPG agent acts with an added exploration noise. The gain \(\varphi ^{\circ }\) is set to 0.35. The hyperparameters for the DSPG agent are shown below. The initial value for the learning rate \(\eta \) is 0.5. The discount \(\gamma \) is 0.95. The reward R is 10. The maximum epochs \(T_{\max }\) is 1000.

The rules for the RL agent are as follows. 1). This episode will be terminated if the robot loses some features. 2). This episode will be terminated if the feature errors remain a certain pixel for a certain time. 3). This episode will be terminated if the robot does not arrive at the desired position after the maximum number of steps. 4). After a new action is selected by the RL agent, the matrix is updated using Eq.(3). 5). The robot returns to the starting position if a new episode starts.

When this episode is terminated negatively, the reward −10 is given to the RL agent. In other episodes, the RL agent is rewarded with the reward function. With this configuration, the agent learns a stable motion behavior after considerable training episodes. Since the RL agent is given a negative reward before reaching the desired position, the reward is a negative value per episode. After training, agents will learn a policy from state space to behavior space.

Fig. 4.
figure 4

Experimental results, in terms of the Feature error, Feature trajectories, and Velocities. For each item, from left to right, from top to bottom are A-JM, C-JM, D-JM, and Proposed method.

4.3 Simulation

In the simulation, to verify the effectiveness of the proposed method, the proposed VBC-DSPG method and three competitors were run on the platform. The initial position is (0.0289 m, 0.062 m, 0.07\(^{\circ })\) and the desired position is (0.463 m, 0.593 m, 37.8\(^{\circ })\). Three competitors include the Jacobian matrix method using the current one (C-JM) [1], the matrix using the desired one (D-JM) [1], and the matrix using the average one (A-JM) [4]. The four methods were tested 50 times and the average values for these 50 times are shown in Fig. 4.

Experimental results show that the C-JM method converges in around 129 control cycles. The D-JM method converges in around 115 control cycles. The A-JM method converges in around 111 control cycles. The convergence time for the proposed method is around 92 control cycles. Although the three conventional competitors reach the desired position, the convergence rates for the three competitors are slower than the proposed method. Different methods select different Jacobian matrix, which results in different behavior. In Fig. 4, at the beginning of the control process, the proposed method chooses a larger velocity, which drives the robot to reach the target position faster.

However, in the beginning, excessive-velocity may lead to instability, difficulty in control, and even loss of target, resulting in danger, such as C-JM methods. Figure 4 shows the feature trajectories for the four methods. Conventional methods cause one dimension to reach the desired position, while others do not. So, there are many fluctuations in the feature trajectory. The proposed method enables the robot to learn a policy to achieve better multi-objective control performance.

5 Conclusion

This paper developed a new multi-objective visual control method that integrates discounted sampling policy gradient and a heuristic strategy. A DSPG based scheme is proposed to handle the multi-objective visual control challenges of mobile robots by selecting an appropriate Jacobian matrix adaptively. The policy network is obtained by a training process. Then, the network can automatically calculate a time-varying matrix to achieve a robust control performance. The proposed method has been extensively compared with three conventional methods.