Keywords

1 Introduction

In recent years, soft robots have received extensive attention from researchers due to inherent compliance, environmental adaptability, lower inertia and safe human-machine interaction [1]. Inspired by nature and composed of low-modulus materials, soft robots can deform continuously like flexible structures in biological systems [2]. Features aforementioned make soft robots have obvious advantages and promising prospect in the application of flexible grasping, surgery, rehabilitation, and bionic locomotion [3, 4].

Conventional rigid robotic arms have been extensively employed in tasks such as grasping, assembling and handling. But the limited DOF and the possibility of harm to humans restrict them from working in an unstructured environment or human-machine interaction scenes [5]. Compared with the rigid robotic arms, soft robotic arms have the advantages of lightweight, flexibility and safety, so they could be used in an unstructured environment or human-machine interaction scene and perform well. With the development of the soft robotic arms, an amount of actuation methods have been applied, like hydraulic actuation [6], shape memory alloy actuation [7], pneumatic artificial muscles actuation [8], and cable-driven actuation [9]. In the aforementioned actuation technique, hydraulic actuation is widely applied and has got lots of studies due to their conformability [10]. However, modeling and controlling of the hydraulic soft robotic arms are challenging and difficult because of the strong nonlinearity between hydraulic pressure and elastic deformation [11].

Researchers have paid much effort on motion control of soft robotic arms by using both model based and model-free methods [12]. The premise of using model-based control approaches is to establish a mathematical model of the controlled object, an accurate model or a reasonably simplified model is the guarantee of the good control performance. Xie et al. develop the kinematic model of the soft robotic arm by using the piecewise constant-curvature (PCC) assumption to predict the position of its tip position [5]. Ohta et al. develop the kinematic model of the robotic arm by using DH parameters and carry out simulation and experimental results for closed-loop position control based on the kinematic model [13]. Yang et al. build a direct kinematic model from the sensor data to the deformation and an inverse kinematic model used to calculate the actuation of SMA coils base on given planned deformation [14]. In order to achieve more precise control performance, some studies pay attention to the dynamic model and achieved great progress. Renda et al. develop a dynamic model of a soft continuum robotic arm by using a rigorous geometrically exact method [15]. Tutcu et al. combine a kinematic model with a quasi-static equilibrium solution for more accurate modeling of the end effector of a soft continuum robot [16]. In addition to these, some novel methods are derived, like Chen et al. using force balance of the ending plate to build the model [17]. Tang et al. propose a model based online learning and adaptive control algorithm for the wearable soft robot [18].

For multi-segments hydraulic soft robotic arms, model-based control is difficult to achieve real-time and high accuracy without additional restrictions due to the complexity and imprecision of the mathematical model, and model-free methods offer the possibility of good control performance. Li et al. use adaptive Kalman filter to achieve path tracking for a continuum robot [19]. Melingui et al. develop two controllers based on a distal supervised learning scheme and an adaptive neural to control CBHA’s kinematics and dynamics [20]. With the development of machine learning techniques, reinforcement learning has been widely used in robotic control [21], model-free reinforcement learning has obvious advantages in soft robotic arms control tasks. Ma et al. propose a reinforcement learning method based on the Deep Deterministic Policy Gradient (DDPG) algorithm to solve position control problem [22]. Shahid et al. develop a control policy parameterized by a neural network and learned using modern Proximal Policy Optimization (PPO) algorithm [23]. Satheeshbabu et al. present an open loop position control policy based on deep reinforcement learning and use Deep-Q Learning with experience replay [24]. Although some efforts have been paid on using reinforcement learning to control soft robotic arms, most of the current studies focus on planar motions, or spatial motions with less control inputs in limited environments or in small action space.

In this paper, we investigate the motion control of a double-segment hydraulic soft robotic arm, which has six control inputs and a large state-action space. To achieve open-loop motion control, a model-free control policy based on deep reinforcement learning (RL) is proposed by using Deep Deterministic Policy Gradient (DDPG) algorithm. The kinematic model [5] of the soft robotic arm is employed instead of physical prototype to train the control policy. A complete training framework is established through the Reinforcement Learning Toolbox and Deep Learning Toolbox in Matlab software. To make the control policy fast converge and avoid falling into local optimum, the reward is shaped by combining the position and the action together. A control policy with excellent performance was obtained via parameter optimization and reward function optimization. A series of simulations are implemented to evaluate the control policy, the effectiveness and good tracking performance of the control policy are verified in simulations.

The remainder of this paper is structured as follows. Section 2 describes the architecture of the system. Section 3 introduces the training framework and configurations, and show the results of simulations. Section 4 presents conclusions and future works.

2 System Description

2.1 Hydraulic Soft Robotic Arm

The studied hydraulic soft robotic arm is as shown in Fig. 1. It is totally made of soft materials and composed of an elastic cylinder, two connectors and three chambers with double-helical fiber reinforcement for each segment. Table 1 gives the key parameters of the arms. Each segment of the soft robotic arms is independent and can be quickly assembled and disassembled through the connector. In the current design, the arms can be extended to three segments, but considering the overall length of the arms and the existing experimental conditions, a two-segments arm is employed to carry out the work.

Fig. 1.
figure 1

(a) Two-segments hydraulic soft robotic arm prototype. (b) Schematic of the one-segment hydraulic soft robotic arm.

Table 1. Key parameters of the hydraulic soft robotic arm.

2.2 Markov Decision Process Modeling

Markov Decision Process (MDP) formally defines the reinforcement learning problem, using reinforcement learning on robots requires it to be abstracted and represented as an MDP. A MDP is based on the integration of each interactive object, composed of agent and environment, and its elements include state, action and reward. The motion control task is modeled into a continuous-state, continuous-action MDP. Assuming the simplest form of representation, the RL-based motion control task of the hydraulic soft robotic arm is abstracted as follows:

State(s): State is the condition of the agent described by the environment. In the soft robotic arm motion control task, the state consists of two parts, which are the current state of the soft robotic arm and the action at the previous moment. More specifically, the current state of the soft robotic arm is error between the soft robotic arm and the target position in the direction of each coordinate axis.

Action(s): Action is the collection of actions which the agent could take, called action space. Agents based on DDPG can output continuous actions. Considering that it is difficult to establish an accuracy dynamic model from the pressure of chambers to the position of the tip of the soft robotic arm, the forward kinematic model from the length of chambers to the position of the tip of the soft robotic arm is used to train the agent in the simulation. Therefor, actions that the soft robotic arm could take is the increment of each chamber length, the upper limit and the lower limit of each increment are +1 mm and –1 mm respectively. According to the maximum pressure of the chambers, the upper bound of the length of the chambers is 200 mm. This setting can make the soft robotic arm reach the target position smoothly and quickly.

Reward(s): The reward is a quantitative indicator used to judge each action of the agent and guide the robot to complete tasks. In our task, in order to make the robotic arm move to the target position quickly and stably, the Euclidean distance between arm’s tip position and target position and the action at the previous moment are used as the basis for formulating rewards. Actions that move the manipulator away from the target and are not conducive to the stability of the robot will be subject to greater penalties. On the contrary, actions that bring the robot closer to the target and approach stability will be rewarded. This can speed up the training process of the policy and contribute to the steady-state performance of the soft robotic arm. The reward structure is shown as follows:

$$\begin{aligned} r = \left\{ \begin{array}{l} - 0.001err_d - 0.05\sum {| a_i |} - 0.0003( {| err_x | + | err_y | + | err_z |} ), err_d > \varepsilon \\ 500 - 0.05\sum {| a_i |}, err_d \le \varepsilon \end{array} \right. \end{aligned}$$
(1)

where the \(\varepsilon = 5\,\text {mm}\) is the target threshold, the \(err_d\) is the Euclidean distance between arm’s tip position and target position, the \(err_x\), \(err_y\) and \(err_z\) is the distance between arm’s tip position and target position between the tip position of the arm and the target position on each coordinate axis. When the agent reaches the target within \(\varepsilon \), the training episode is done. The reward is to penalize actions that are not conducive to completing the task and make the soft robotic arm reach the target position in the shortest path.

2.3 Deep Deterministic Policy Gradient Framework

DDPG is a model-free reinforcement learning method that can be extended to continuous action control [25]. We use an actor-critic framework on DDPG to make the policy stable. Convolutional neural network is used to approximate the optimal policy function \(\mu \) and Q function, namely the policy network and the Q network, and the deep learning method is used to train the above neural network. DDPG needs to learn Q network while learning policy network. The implementation and training method of the Q function refers to the DQN [26]. The value iteration update of the Q function follows the Bellman equation and is defined as:

$$\begin{aligned} {Q_t^\mu }(s_t,a_t) = Q_t^\mu (s_t,a_t) + \alpha (r_t + \gamma {\mathop {\max }\limits _a}Q^\mu (s_{t + 1},a) - Q_t^\mu (s_t,a_t)) \end{aligned}$$
(2)

where the \(s_t\) is the state at time step t, \(a_t\) is the action at time t, \(s_{t+1}\) is the state after taking action \(a_t\), \(r_t\) is the reward value about \(a_t\), \(\alpha \) is the learning rate, and \(\gamma \) is the discount rate.

In the continuous action spaces training process, exploration is important to find potential better policies, so we add random noise for the action to transit the action from a deterministic process to a random process, and then sample the action from this random process and send it to the environment for execution. The above policy is called the behavior policy, which is represented by \(\beta \). Ornstein-Uhlenbeck process is used to generate random noise as shown is Eq. 3.

$$\begin{aligned} \partial n = \varPhi (\eta -n) + \sigma W \end{aligned}$$
(3)

where the \(\eta \) is the mean, the \(\varPhi \) is the decay rate, the \(\sigma \) is the variance, the W is the Wiener process. The process of training policy network is to find the optimal solution of policy network parameters, and the stochastic gradient ascent method is used to train the network. The Eq. 4 is used to judge the performance of a policy, and the optimal policy is defined by Eq. 5. The whole algorithm framework is as shown in Algorithm 1.

$$\begin{aligned} \begin{aligned} {J_{_\beta }}\left( \mu \right)&= \int _S {{\rho ^\beta }} \left( s \right) {Q^\mu }\left( {s,a} \right) ds\\&= {E_{s \sim {\rho ^\beta }}}\left[ {{Q^\mu }\left( {s,a} \right) } \right] \end{aligned} \end{aligned}$$
(4)

where the s is the state, the \(\rho ^\beta \) is the distribution function of the state.

$$\begin{aligned} \mu = \mathop {\arg \max }\limits _\mu J(\mu ) \end{aligned}$$
(5)
figure a

3 Training and Simulations

3.1 Training Setup

A high-performance computer consists of a 10900X CPU and an RTX2080Ti GPU is used to train the control policy and validate the effectiveness of policy in simulations. The policy training framework is deployed in Matlab by using Reinforcement Learning Toolbox and Deep Learning Toolbox, and trained by using a critic network and an actor network. The critic network has two hidden layers with 400 neurons and the learning rate is \(1e^{-3}\). The actor network has four hidden layers with 400 neurons and the learning rate is \(1e^{-4}\). The outputs of the actor network are bounded between \(-1\) and \(+1\) with a tanhLayer followed a ScalingLayer. Other training parameters are set as follows, the maximum number of training episodes is set as 100000, the maximum number of steps per episode is set as 200, the discount factor is set as 0.99, the minibatch size is set as 256, the target smooth factor is set as 0.001, and the experience reply buffer is set as \(1e^8\). A simulation model is built based on the kinematic model of the soft robotic arm [5] and connected with the agent in Simulink.

3.2 Position Control Results

The position control simulations and trajectory tracking simulations are implemented to validate the effectiveness and dynamic performance of the presented control policy. As for position control, we select a series of points in the workspace of the soft robotic arm as target points to test the steady-state performance of the control policy. The results are as shown in Fig. 2, the simulation step size is set to 0.01 s. The control policy transfers the soft robotic arm from the initial state to the target state with few steps, and the steady-state error is controlled within 3 mm. Simulation results revealed the effectiveness and stability of the control policy.

Fig. 2.
figure 2

(a) Position control with the target point (162, 72, 307). (b) Position control with the target point (–91, 147, 311).

3.3 Trajectory Tracking Results

The dynamic response of the control policy is the key factor that determines the dynamic performance of the system. As shown in Fig. 3, We select some trajectories according to the workspace of the soft robotic arm to verify the dynamic performance of the control policy. The soft robotic arm begins moving along the target trajectory from 50 s, the policy control soft robotic arm to quickly follow with the target trajectory, and the dynamic error is controlled within 5 mm during the whole movement. Simulation results proved the rapid dynamic response of the control policy, and the soft robotic arm based on this control policy has good tracking performance.

Fig. 3.
figure 3

(a) Trajectory tracking control with the target trajectory from the point (30, 80, 339) to the point (100, 170, 310). (b) Trajectory tracking control with the target trajectory from the point (90, 30, 340) to the point (170, 100, 300).

4 Conclusion and Future Work

Focusing on the motion control of hydraulic soft robotic arm, this paper implements the kinematic model of the soft robotic arm in simulations and develops a model-free control policy based on deep reinforcement learning. The Reinforcement Learning Toolbox and Deep Learning Toolbox are used to deploy the policy training framework, and the Deep Deterministic Policy Gradient (DDPG) algorithm is used to train the policy. The simulations experiments show the effectiveness, robustness and good dynamic performance in motion control of the proposed control policy. After experimental verification, this article is a good attempt of applying reinforcement learning to the motion control of a hydraulic soft robotic arm with highly nonlinear characteristics.

In future work, further improvement and optimization of the proposed control policy will be studied, and the policy will be deployed into the physical prototype control system.