1 Introduction

The ocean covers more than seventy percent of our planet; however, more than eighty percent of our ocean is unobserved and unexplored. This uncharted part of our planet offers huge potential for the industrial sectors, as well as for disruptive and exploration-driven scientific discoveries. Soft robots are compliant, lightweight, and multifunctional, and have nice environmental adaptability and safety. Compared with the existing rigid robots, soft robots have many advantages in a diverse range of underwater applications, such as manipulation in coral reefs, cleaning coast and offshore pollutants, collecting marine biological samples, monitoring underwater structures, and so on (Palli et al. 2017; Zhang et al. 2018a; Xie et al. 2020; Zhuo et al. 2020). However, developing agile, dexterous, and reliable underwater soft robots faces substantial challenges in structural design, actuation, modeling, and control. Soft manipulators have been applied for performing underwater grasping tasks in very recent studies owing to their outstanding environment adaptability and safety (Teeples et al. 2018; Liu et al. 2020; Kurumaya et al. 2018; Mura et al. 2018; Xu et al. 2018). Soft manipulators generally have strong nonlinearities (e.g., asymmetric hysteresis, creep, and so on) due to the characteristics of the materials used as actuators and structures (Hosovsky et al. 2016; Shiva et al. 2016; Stilli et al. 2017; Pawlowski et al. 2019; Thérien and Plante 2016). Furthermore, a soft manipulator mounted on a vehicle for underwater grasping tasks suffers from the effects of ocean currents, water pressure, load change, and disturbances caused by the movement of the vehicle (Zhang et al. 2018b). Efficiently controlling the soft manipulator for underwater tasks remains meaningful and challenging work.

In previous studies, the common control approaches of manipulators can be divided into model-based controllers and model-free controllers (Zhang et al. 2016). Model-based controllers are derived based on physical or semi-physical models of `manipulators (Best et al. 2016; Trivedi and Rahn 2014; Robinson et al. 2014; Li et al. 2018; Chen et al. 2020). The control performance is relevant to the accuracy of the model. Compared with model-based controllers, the model-free controllers require no model information from soft manipulators but require control structures based on real-time accurate feedback data (Vikas et al. 2015; George et al. 2018; Li et al. 2017; Jiang et al. 2020; Bruder et al. 2002, 2020).

Recently, researchers start to apply machine learning methods to model-based controllers for improving the robustness of the soft manipulator. For common dynamic control problems of soft manipulators, Thuruthel et al., proposed a model-based learning method for closed-loop predictive control of a soft robotic manipulator (George et al. 2019). The feedforward dynamic model was established via a recurrent neural network, and then a closed-loop control policy was derived by trajectory optimization and supervised learning based on the dynamic model. Fang et al. (2019) proposed a vision-based online learning kinematic controller for performing precise robotic tasks by local Gaussian process regression, which did not need physical model information of the manipulator and camera parameters. To improve the position control accuracy of soft manipulators, Hofer et al. (2019) presented a norm-optimal iterative learning control algorithm for a soft robotic arm and applied this method for adjusting the output of a PID controller to improve the robustness of the manipulation system . To improve model-based control methods with a low tolerance for external environments, Ho et al. (2018) used a localized online learning-based control to update the inverse model of a redundant two-segment soft robot, which makes the system adapt to the unknown external disturbance. However, machine learning controllers applied for soft manipulators usually require an offline pre-training process, and the trained model cannot be online updated for practical scenes. Furthermore, the training results are likely to get stuck at a locally optimal value (Liu et al. 2017). Therefore, the online training process is essential for the underwater tasks of soft manipulators regarding the time-varying water current disturbance.

In our previous work, we have integrated an opposite-bending-and-stretching structure (OBSS) soft manipulator on a remotely operated vehicle (ROV) system (as shown in Fig. 1) and accomplished harvesting tasks by manual control (Gong et al. 2018, 2019, 2020). However, the soft manipulator has an obvious hysteresis and a low rigidity, which leads to the movement of the manipulator is easily affected by the external disturbance. Therefore, to further improve robustness and adaptivity for autonomous delicate grasping in the aquatic environment, we propose a learning adaptive controller based on the temporal difference reinforcement learning method. In this controller, we design an action selection guidance strategy based on the human experience, thus compared with the above-mentioned controllers, the controller has a good online learning ability and control performance, and doesn’t need the offline training process. By using the proposed controller, the predictive output of a feedforward prediction model (chamber length vs. pneumatic pressure) can be adjusted online, which endows the soft manipulator with robustness while encountering underwater disturbances (external loads and stable flow). For abbreviation, we name this controller as a prediction model-based guided reinforcement learning adaptive controller (GRLMAC). Then, we test and validate the effectiveness of GRLMAC on simulation and experiment platforms by carrying on static reaching tasks, dynamic trajectory tracking tasks, and grasping tasks.

Fig. 1
figure 1

The underwater grasping system consists of the OBSS soft manipulator and an ROV. The OBSS soft manipulator system contains a sensing system (a binocular camera), a control system, and a multi-channel pneumatic system

This paper is organized as follows. Section 2 introduces the inverse kinematics modeling process of the OBSS soft manipulator briefly. Section 3 introduces the feed-forward prediction model briefly. Then, we design the guided reinforcement learning policy to modify the prediction model output and proposed a prediction model-based guided reinforcement learning adaptive controller (GRLMAC). Section 4 establishes the simulation platform in the MATLAB environment. And then control performance, learning efficiency, and robustness of GRLMAC for different external loads and time-varying disturbance are analyzed. Section 5 gives the physical experimental platform and then conducts some experiment tasks to further verify the performance of GRLMAC. Section 6 draws conclusions.

2 Inverse kinematics model

The physical prototype and space coordinate systems of the OBSS soft manipulator are shown in Fig. 2. The soft manipulator consists of two bending segments, one extending segment, and one soft gripper. All of the parts are fabricated with silicon rubber. Each bending segment with 2-DOFs has three actuated chambers and the extending chamber with 1-DOF has one actuated chamber. The two bending segments assembled with an offset angle of 180° have the same radius and initial length, and always keep equal bending angles and sigmoidal opposing curvatures during manipulation so that the orientation of the soft gripper is always kept vertically downward.

Fig. 2
figure 2

The OBSS soft manipulator. a The physical prototype of the OBSS soft manipulator which is consisted of four parts including bending segment 1, bending segment 2, extending segment, and soft gripper. b Space coordinate systems of the OBSS soft manipulator, where O0-X0Y0Z0 is the base coordinate system; O1-X1Y1Z1, O2-X2Y2Z2, and O3-X3Y3Z3 are moving coordinate systems for each segment of the soft manipulator. c Geometric relationship of bending segment 1

Based on the characteristics of the OBSS soft manipulator, we have established its kinematics model in our previous work (Gong et al. 2019). In this section, we will introduce the modeling process briefly, and the notations are summarized in Table 1.

Table 1 Notation and definitions

The constraint conditions for kinematics modeling are determined as follows

$$\left\{ \begin{gathered} \theta_{1} = \theta_{2} ,\;\;\phi_{2} = \phi_{1} + \pi \hfill \\ \lambda_{1} = \lambda_{2} ,\;\;r_{1} = r_{2} \hfill \\ l_{1j} = l_{2j} \quad (j = 1,\;2,\;3) \hfill \\ \end{gathered} \right.$$
(1)

where θi (i = 1, 2 represents the ith bending segment) is the bending angle, ϕi is the deflection angle, λi is the radius of center curvature, ri is the distance from the cross-sectional center to the center of a chamber, and lij is the length of the jth chamber in the ith bending segment. Based on the above conditions, relative to the base coordinate system O0X0Y0Z0, the end center coordinates of each bending segment are expressed as Eq. (2).

$$x_{1} { = }\frac{{x_{2} }}{{2}} = \frac{x}{2},\;\;y_{1} { = }\frac{{y_{2} }}{{2}} = \frac{y}{2},\;\;\left| {z_{1} } \right| = \frac{{\left| {z_{2} } \right|}}{{2}} = \frac{{\left| z \right| - l_{e} }}{2}$$
(2)

where O1 (x1, y1, z1) is the end center coordinates of bending segment 1, O2 (x2, y2, z2) is the end center coordinates of bending segment 2, O3 (x, y, z) is the end center coordinates of the soft gripper, and le is the length of the extending segment. Then, the deflection angle ϕ1 can be calculated by

$$\phi_{1} { = }\tan^{ - 1} \left( \frac{y}{x} \right)$$
(3)

Based on the geometric relationship described in Fig. 2, the bending angle θ1 can be obtained by setting the value of z1

$$\theta_{1} { = }\,{\uppi }{ - }{\text{2sin}}^{{ - 1}} \left( \begin{gathered} \frac{{z_{{1}} }}{{\sqrt {x_{1}^{2} + y_{1}^{2} + z_{1}^{2} } }} \hfill \\ \hfill \\ \end{gathered} \right)$$
(4)

Then, the radius of curvature λ1 is

$$\lambda_{{{\kern 1pt} {1}}} = \sqrt {\frac{{x_{1}^{2} + y_{1}^{2} + z_{1}^{2} }}{{2(1 - \cos \theta_{{{\kern 1pt} 1}} )}}}$$
(5)

Based on Eqs. (3)–(5), the length of the jth chamber in bending segment 1 can be obtained

$$\left\{ \begin{gathered} l{\kern 1pt}_{11} = \theta {\kern 1pt}_{{1}} {(}\lambda_{{{\kern 1pt} {1}}} { - }r{\kern 1pt}_{1} {\text{cos}}\phi_{{{\kern 1pt} {1}}} {)} \hfill \\ l{\kern 1pt}_{12} = \theta {\kern 1pt}_{{1}} \left( {\lambda_{{{\kern 1pt} {1}}} { - }r{\kern 1pt}_{1} {\text{cos}}\left( {\frac{{2{\uppi }}}{3} - \phi_{{{\kern 1pt} {1}}} } \right)} \right) \hfill \\ l{\kern 1pt}_{13} = \theta {\kern 1pt}_{{1}} \left( {\lambda_{{{\kern 1pt} {1}}} { - }r{\kern 1pt}_{1} {\text{cos}}\left( {\frac{{{{4\pi }}}}{3} - \phi_{{{\kern 1pt} {1}}} } \right)} \right) \hfill \\ \end{gathered} \right.$$
(6)

And then, the length of the jth chamber in bending segment 2 can be obtained from Eq. (1), and the length of the extending segment le can be obtained from Eq. (2). If the results are not satisfied with the length requirement for each chamber, we modify the value of z1, and then calculate the chamber length again.

3 Guided reinforcement learning model-based adaptive controller

3.1 Hysteresis model

For measuring the relationship between pressure and length for a chamber in the bending segment or the extending segment, we conducted an isotonic test in the water and non-loaded condition. From the measurement results (as shown in Fig. 3), we found that the actuated chamber in each segment of the soft manipulator has an obvious unsymmetric hysteresis. To describe the phenomenon, in this paper, the extended unparallel Prandtl- Ishlinskii (EUPI) model is adopted and expressed as Eq. (7). This model is an effective method to describe the hysteresis of artificial muscles (Hao et al. 2017; Sun et al. 2017).

$$\left\{ \begin{gathered} u_{{\text{p}}} (t) = \Gamma_{{{\text{UPI}}}} [l](t) + {\text{P}} [l](t) \hfill \\ \Gamma_{{{\text{UPI}}}} [l](t) = \sum\limits_{m = 1}^{{N_{r} }} {\omega_{m} F_{{r_{m} ,\;\alpha_{m,\;} \beta_{m} }} [l](t)} \hfill \\ F_{{r_{m} ,\;\alpha_{m,\;} \beta_{m} }} [l](t)\;{ = }\;\max {{\{ }}\alpha_{m} (l(t) - r_{m} ),\quad \quad \quad \quad \quad \quad \hfill \\ \quad \quad \quad \quad \quad \quad \min \{ \beta_{m} (l(t) + r_{m} ),\;F_{{r_{m} ,\;\alpha_{m,} \;\beta_{m} }} [l](t{ - 1})\} {{\} }} \hfill \\ {\text{P}} [l](t) = p_{1} l(t)^{3} + p_{2} l(t)^{2} + p_{3} l(t) \hfill \\ \end{gathered} \right.$$
(7)

where up(t) represents the driving pressure for the chamber predicted by the EUPI model. The EUPI model consists of unparallel PI (UPI) portion ΓUPI[l](t) and polynomial portion P[l](t). l(t) is the chamber length of the soft manipulator, ωm > 0 (m = 1, 2, …, Nr, Nr is the total number of dead zones) is a weight coefficient, \(F_{{r_{m} ,\;\alpha_{m,\;} \beta_{m} }} [l](t)\) is the unparallel PI operator, αm and βm are tilt coefficients of pressurization edge and depressurization edge respectively, rm(rm ≥ 0) is the boundary threshold value of the mth dead zone, and pn (n = 1, 2, 3) is the weight coefficient of the polynomial portion. αm, βm, ωm, and pn are identified by particle swarm optimization (PSO) algorithm. The EUPI fitting curves are shown in Fig. 3.

Fig. 3
figure 3

Pressure-length hysteresis curves and corresponding fitting curves for chambers in the OBSS soft manipulator in the underwater environment and non-load condition. a The hysteresis curve and the fitting curve for the bending segment. b The hysteresis curve and the fitting curve for the extending segment

3.2 Guided reinforcement learning policy

Based on the kinematic model of the soft manipulator and the EUPI model of the chamber, we establish a soft manipulator motion control system. In this control system, based on the target position, we first calculate each chamber’s desired length through the kinematic model. Then, we take the desired length into the EUPI and calculate the predictive driving pressure of each chamber, that is, the EUPI model is treated as a feed-forward prediction model (FPM) in our work.

Nevertheless, the EUPI model is identified in a specific condition (no external load), so it has a poor universality. Its predictive performance is easily affected by changes in external conditions. To address the problem, we use a correction coefficient κ(t) to modify the predictive driving pressure up(t) and the actual driving pressure for a chamber in the soft manipulator are expressed as

$$u_{{\text{a}}} (t) = u_{{\text{p}}} (t) + \kappa (t)$$
(8)

where κ(t) is represented as follows

$$\kappa (t) = \kappa (t - 1) + \Delta \kappa (s_{1} (t))$$
(9)

where Δκ(s1(t)) is an adjustment function which is about s1(t) = ld (t + 1)−l (t) (ld is the desired chamber length). In this paper, Δκ(s1(t)) is set as an exponential function expressed in Eq. (10) to ensure it is bounded.

$$\begin{aligned} \Delta \kappa (t) & = \left( {p^{\prime}_{1} e^{{\left( {p^{\prime}_{2} - \;\frac{{p^{\prime}_{3} }}{{\left| {s_{1} (t)} \right|}}} \right)}} + p^{\prime}_{4} } \right){\text{sgn}} (s_{1} (t)) \\ {\text{sgn}}(s_{1} (t)) & = \left\{ \begin{gathered} - 1\quad \quad s_{1} (t) < 0 \\ 0\quad \quad \,\;s_{1} (t) = 0 \\ 1\quad \quad \,\;\,s_{1} (t) > 0 \\ \end{gathered} \right. \\ \end{aligned}$$
(10)

In (10), \(p_{1}^{\prime }\), \(p_{2}^{\prime }\), \(p_{3}^{\prime }\) and \(p_{4}^{\prime } > 0\).

To improve the flexibility and stability for adjusting κ(t), we need the above parameters in Δκ(s1(t)) are also related to s1(t). To this end, in this section, we design an online learning strategy to determined \(p_{1}^{\prime }\), \(p_{2}^{\prime }\), \(p_{3}^{\prime }\) and \(p_{4}^{\prime }\) by using the Sarsa learning algorithm because of its high learning rate without the knowledge of the environment model (such as the state transition probabilities) comparing with dynamic programming and Monte Carlo (Sutton and Barto 1998; Sutton 1988; Kirkpatrick and Valasek 2009; Kirkpatrick et al. 2013). The detailed design procedure is described as follows.

For the soft manipulator system, we set state variables s1(t) and s2(t + 1) = ld(t + 1) − l(t + 1) belong to a state-space S = {s|−∞ < s < ∞}. Based on the displacement range of the actuated chamber described in Fig. 3, the state space S could be divided into the following seven continuous intervals.

$$\begin{gathered} {\mathbf{S}} = {{\{ }}{\mathbf{S}}_{{1}} ,\;{\mathbf{S}}_{{2}} ,\;{\mathbf{S}}_{{3}} ,\;{\mathbf{S}}_{{4}} ,\;{\mathbf{S}}_{{5}} ,\;{\mathbf{S}}_{{6}} ,\;{\mathbf{S}}_{{7}} {{\} }} \hfill \\ \left\{ \begin{gathered} {\mathbf{S}}_{{1}} { = }\left\{ {s| - \infty < s < - 50} \right\};{\kern 1pt} \;{\mathbf{S}}_{{2}} { = }\left\{ {s| - 50 \le s < - 0.1} \right\}; \hfill \\ {\mathbf{S}}_{{3}} { = }\left\{ {s| - 0.1 \le s < - \;0.00001} \right\}; \hfill \\ {\mathbf{S}}_{{4}} { = }\left\{ {s| - 0.00001 \le s \le 0.00001} \right\}; \hfill \\ {\mathbf{S}}_{5} { = }\left\{ {s|0.00001 < s \le 0.1} \right\}; \hfill \\ {\mathbf{S}}_{6} { = }\left\{ {s|0.1 < s \le 50} \right\};\,\;{\mathbf{S}}_{7} { = }\left\{ {s|50 < s < \infty } \right\}. \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered}$$
(11)

Then, corresponding to each state interval, based on the soft manipulator’s driving performance, we set an action space A which contains four actions and is expressed in Eq. (12).

$$\begin{gathered} {\mathbf{A}} = {{\{ }}a_{{1}} ,\;a_{{2}} ,\;a_{{3}} ,\;a_{{4}} {{\} }} \hfill \\ \left\{ \begin{gathered} a_{1} :\;[\begin{array}{*{20}l} {p^{\prime}_{1} } & {p^{\prime}_{2} } & {p^{\prime}_{3} } & {p^{\prime}_{4} } \\ \end{array} ] = [\begin{array}{*{20}l} {10^{3} } & 1 & {50} & {10^{2} } \\ \end{array} ], \hfill \\ a_{2} :\;[\begin{array}{*{20}l} {p^{\prime}_{1} } & {p^{\prime}_{2} } & {p^{\prime}_{3} } & {p^{\prime}_{4} } \\ \end{array} ] = [\begin{array}{*{20}l} {10^{2} } & {\frac{0.1}{{50}}} & {0.1} & {10} \\ \end{array} ], \hfill \\ a_{3} :\;[\begin{array}{*{20}l} {p^{\prime}_{1} } & {p^{\prime}_{2} } & {p^{\prime}_{3} } & {p^{\prime}_{4} } \\ \end{array} ] = [\begin{array}{*{20}l} {10} & {\frac{0.00001}{{0.1}}} & {0.00001} & 1 \\ \end{array} ], \hfill \\ a_{4} \;:\;[\begin{array}{*{20}l} {p^{\prime}_{1} } & {p^{\prime}_{2} } & {p^{\prime}_{3} } & {p^{\prime}_{4} } \\ \end{array} ] = [\begin{array}{*{20}l} 1 & 1 & {0.00001} & 0 \\ \end{array} ] \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered}$$
(12)

Based on Eq. (12), at the current state s1(t), we can select an action a(k) from A to determine the parameters in the Δκ(s1(t)). After that, the driving pressure ua(t) is calculated by Eqs. (8)–(10), and then we execute ua(t) and obtain the state variable s2(t + 1). To evaluate the selected action a(t), by considering the plausibility and validity of the selected action at state s1(t), we design a reward matrix R. In our work, we set the allowable range of s2(t + 1) as [− 0.1 0.1], hence the reward matrix R is described as (3).

$${\mathbf{R}} = \left[ {\begin{array}{*{20}c} {{\raise0.7ex\hbox{${a(t) \in {\mathbf{A}}}$} \!\mathord{\left/ {\vphantom {{a(t) \in {\mathbf{A}}} {s_{2} (t + 1) \in {\mathbf{S}}}}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${s_{2} (t + 1) \in {\mathbf{S}}}$}}} & {a_{1} } & {a_{2} } & {\begin{array}{*{20}c} {a_{3} } & {a_{4} } \\ \end{array} } \\ {{\mathbf{S}}_{1} } & 0 & 0 & {\begin{array}{*{20}c} 0 & 0 \\ \end{array} } \\ {{\mathbf{S}}_{2} } & 0 & 0 & {\begin{array}{*{20}c} 0 & 0 \\ \end{array} } \\ {\begin{array}{*{20}c} {{\mathbf{S}}_{3} } \\ {{\mathbf{S}}_{4} } \\ {{\mathbf{S}}_{5} } \\ {\begin{array}{*{20}c} {{\mathbf{S}}_{6} } \\ {{\mathbf{S}}_{7} } \\ \end{array} } \\ \end{array} } & {\begin{array}{*{20}c} 1 \\ {10} \\ 1 \\ {\begin{array}{*{20}c} 0 \\ 0 \\ \end{array} } \\ \end{array} } & {\begin{array}{*{20}c} 1 \\ {10} \\ 1 \\ {\begin{array}{*{20}c} 0 \\ 0 \\ \end{array} } \\ \end{array} } & {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 \\ {10} \\ 1 \\ {\begin{array}{*{20}c} 0 \\ 0 \\ \end{array} } \\ \end{array} } & {\begin{array}{*{20}c} 1 \\ {10} \\ 1 \\ {\begin{array}{*{20}c} 0 \\ 0 \\ \end{array} } \\ \end{array} } \\ \end{array} } \\ \end{array} } \right]$$
(13)

According to Eq. (13), the current reward r(t) = 0 means that state s2(t + 1) generated by action a(t) is a bad or impossible state, r(t) = 1 means that state s2(t + 1) is a reasonable, but not the best state, and r(t) = 10 means that state s2(t + 1) is the best state under action a(t). Therefore, to make s2(t + 1) is the best state, we need the soft manipulator system to learn to select the appropriate action from A at the current state s1(t).

To this end, based on the Sarsa algorithm (Gong et al. 2018, 2019, 2020; Hao et al. 2017), a state-action value matrix Q(S, A) ∈ R7×4 is designed and its recursive equation is defined as

$${\mathbf{Q}}({\mathbf{S}}_{{1}} (t),a(t)) \leftarrow {\mathbf{Q}}({\mathbf{S}}_{{1}} (t),a(t)) + \alpha [r(t) + \gamma {\mathbf{Q}}({\mathbf{S}}_{{1}} (t + {1}),a(t + {1})) - {\mathbf{Q}}({\mathbf{S}}_{{1}} (t),a(t))]$$
(14)

where S1(t) is the state interval in state-space S which is the state s1(t) belongs to, α is the learning rate, and γ ∈ [0,1] is the discount factor. Then, based on Q, we select the action a(t) from the action space A.

The ε-greedy policy is the basic and commonly used action selection strategy for reinforcement learning methods and contains the exploration phase and the exploitation phase (Sutton and Barto 1998). In the exploration phase, we arbitrarily choose an action from A with a small probability ε. In the exploitation phase, we choose the optimal action (the action corresponding to the maximum of Q at state s1(t) ∈ Sk) with a probability 1−ε. For the soft manipulator system, the ε-greedy policy is represented as Eq. (15).

$$\left\{ \begin{gathered} {\text{i}} {\text{f}} \;{\text{rand}} () < \varepsilon \quad a(t) \leftarrow {\text{rand}}_{a} ({\mathbf{A}}(1,\,\{ a_{1} ,\;a_{2} \;,\;a_{3} ,\;a_{4} \} )) \hfill \\ {\text{else}}\quad \quad \quad \quad \;\;a(t) \leftarrow \max_{a} ({\mathbf{Q}}({\mathbf{S}}_{1} (t),\,\{ a_{1} ,\;a_{2} \;,\;a_{3} ,\;a_{4} \} )) \hfill \\ \end{gathered} \right.$$
(15)

From Eq. (15), to converge to the optimal Q, it always needs to take a large number of steps, and the optimal results can easily drop into the local optimum because of the diversity of the optional action.

We design an action selection guidance strategy to improve ε-greedy policy based on our knowledge of the actuation performance of the chamber and experience in soft manipulator operation, which is obtained from a large number of experiments in our early work (Gong et al. 2019). This method reduces the step cost and improves the convergence rate and the global convergence of the reinforcement learning method. Then, we can determine the corresponding action choice ranges for different state intervals in the state space S. For example, when s1(t) belongs to S1, according to our experience, to make the actuated chamber reach the desired length rapidly, we need to adjust κ(t) considerably; thus the action a1 is the best option.

Then, based on the action selection guidance strategy, the ε-greedy policy can be rewritten as Eq. (16). The improved ε-greedy policy reduces the variety of action choices and endows reasonable choice ranges, which enable the learning method with a fast convergence performance and can be used for online adjusting κ(t) in a real-time manner without an offline pre-training phase.

$$\left\{ \begin{gathered} {\text{i}} {\text{f}} \;{\text{rand}} () < \varepsilon \quad a(t) \leftarrow {\text{rand}}_{a} ({\mathbf{A}}(1,\,\{ a_{1} ,\;a_{2} ,\;a_{3} ,\;a_{4} \} )) \hfill \\ {\text{else}} \quad \quad \quad \quad {\text{if}}\;s_{1} (t) \in {\mathbf{S}}_{1} ,\quad a(t) \leftarrow \max_{a} {\mathbf{Q}}({\mathbf{S}}_{1} ,{{\{ }}a_{1} {{\} }}) \hfill \\ \quad \quad \quad \quad \quad \;\;{\text{if}}\;s_{1} (t) \in {\mathbf{S}}_{2} ,\quad a(t) \leftarrow \max_{a} {\mathbf{Q}}({\mathbf{S}}_{2} ,\{ a_{1} ,\;a_{2} \} ) \hfill \\ \quad \quad \quad \quad \quad \;\;{\text{if}}\;s_{1} (t) \in {\mathbf{S}}_{3} ,\quad a(t) \leftarrow \max_{a} {\mathbf{Q}}({\mathbf{S}}_{3} ,\{ a_{2} ,\;a_{3} \} ) \hfill \\ \quad \quad \quad \quad \quad \;\;{\text{if}}\;s_{1} (t) \in {\mathbf{S}}_{4} ,\quad a(t) \leftarrow \max_{a} {\mathbf{Q}}({\mathbf{S}}_{4} ,\{ a_{3} ,\;a_{4} \} ) \hfill \\ \quad \quad \quad \quad \quad \;\;{\text{if}}\;s_{1} (t) \in {\mathbf{S}}_{5} ,\quad a(t) \leftarrow \max_{a} {\mathbf{Q}}({\mathbf{S}}_{5} ,\{ a_{2} ,\;a_{3} \} ) \hfill \\ \quad \quad \quad \quad \quad \;\;{\text{if}}\;s_{1} (t) \in {\mathbf{S}}_{6} ,\quad a(t) \leftarrow \max_{a} {\mathbf{Q}}({\mathbf{S}}_{6} ,\{ a_{1} ,\;a_{2} \} ) \hfill \\ \quad \quad \quad \quad \quad \;\;{\text{if}}\;s_{1} (t) \in {\mathbf{S}}_{7} ,\quad a(t) \leftarrow \max_{a} {\mathbf{Q}}({\mathbf{S}}_{7} ,\{ a_{1} \} ) \hfill \\ \end{gathered} \right.$$
(16)

According to Eqs. (8)–(16), we represent the prediction model-based guided reinforcement learning adaptive controller (GRLMAC) for a chamber, and the control schematic diagram for the OBSS soft manipulator system is depicted in Fig. 4. The corresponding control procedure is shown in Algorithm 1.

Fig. 4
figure 4

Schematic diagram of GRLMAC for the soft manipulator. Based on the position of the target Ptarget (t) and the distance ∆P (t), the desired chamber length l desired (t) and the chamber length error ∆l(t) can be obtained via the inverse kinematics model. The predictive driving pressure uprediction(t) is calculated by taking ldesired (t) into the FPM and is online adjusted by the correction coefficient κ(t) which is obtained by taking ∆l(t) and κ(t−1) into the GRL module. uactual(t) is the actual driving pressure for the soft manipulator

figure j

In Fig. 4, \({\Delta }{\mathbf{P}}(t) = [{\Delta }x(t),{\Delta }y(t),{\Delta }z(t)]\) is the distance between the soft manipulator end and the target, \({\mathbf{P}}_{{{\text{target}}}} (t) = [x_{t} (t),\,y_{t} (t),\,z_{t} (t)]\) is the position of the target, \({\mathbf{l}}_{{{\text{desired}}}} (t) = [l_{d11} (t),l_{d12} (t),l_{d13} (t),l_{d21} (t),l_{d22} (t),l_{d23} (t),l_{de} (t)]\) and \({\Delta }{\mathbf{l}}(t) = [{\Delta }l_{11} (t),{\Delta }l_{12} (t),{\Delta }l_{13} (t),{\Delta }l_{21} (t),{\Delta }l_{22} (t),{\Delta }l_{23} (t),{\Delta }l_{e} (t)]\) are the desired length and the length error of chambers respectively, \({\mathbf{u}}_{{{\text{prediction}}}} (t) = [u_{p11} (t),u_{p12} (t),u_{p13} (t),u_{p21} (t),u_{p22} (t),u_{p23} (t),u_{pe} (t)]\) is the predictive driving pressure for the soft manipulator calculated by (7), \({{\varvec{\upkappa}}}(t) = [\kappa_{11} (t),\kappa_{12} (t),\kappa_{13} (t),\kappa_{21} (t),\kappa_{22} (t),\kappa_{23} (t),\kappa_{e} (t)]\) is the correction coefficient for uprediction(t), and \({\mathbf{u}}_{{{\text{actual}}}} (t) = [u_{a11} (t),u_{a12} (t),u_{a13} (t),u_{a21} (t),u_{a22} (t),u_{a23} (t),u_{ae} (t)]\) is the actual driving pressure for the soft manipulator. Each chamber requires a corresponding GRLMAC for controlling its length variation.

3.3 Stability analysis

The actuated chamber of the OBSS soft manipulator as a controllable system should satisfy the following assumptions which are described in Bu et al. (2019):

A1: The input and the output of the soft manipulator control system are measurable and controllable. When disturbances are bounded, there is always one bounded input signal corresponding to a bounded desired output signal, which makes the actual system output signal equal to the desired one.

A2: The nonlinear system function has a continuous partial derivative with respect to the current input signal.

A3: The actuated chamber control system satisfies the generalized Lipschitz condition that exists a parameter b > 0 makes Eq. (17) be established.

$$\begin{gathered} t_{1} \ne t_{2} ,\;\;t_{1} ,\;t_{2} > 0 \hfill \\ u_{a} (t_{1} ) \ne u_{a} (t_{2} ) \hfill \\ \left| {l(t_{1} + 1) - l(t_{2} + 1)} \right| \le b\left| {u_{a} (t_{1} ) - u_{a} (t_{2} )} \right| \hfill \\ \end{gathered}$$
(17)

Based on the assumptions A2 and A3, when |ua(t)−ua(t−1)|≠ 0 there must be a ψ(t) ∈ R, so that |l(t)−l(t−1)|= ψ(t)|ua(t)−ua(t−1)|. Moreover, as shown in Fig. 3, the input pressure and the output length of the actuated chamber have the same monotonicity, which means that

$${\text{sgn}}\left( {{\Delta }\;u_{a} \left( t \right)} \right) = {\text{sgn}}\left( {{\Delta }\;l\left( t \right)} \right)$$

Moreover, the discrete-time state equation could be expressed as

$$\left\{ \begin{gathered} {\mathbf{x}}_{{{\text{st}}}} (t + 1) = {\mathbf{Ax}}_{{{\text{st}}}} (t) + {\mathbf{B}}{\Delta }u_{a} (t) \hfill \\ {\Delta }l(t + 1) = {\mathbf{Cx}}_{{{\text{st}}}} (t) + {\mathbf{D}}{\Delta }u_{a} (t) \hfill \\ \end{gathered} \right.$$
(18)

where xst(t) denotes the state variable, and A, B, C, and D are coefficient matrixes and expressed as follows

$$\begin{gathered} {\mathbf{A}} = \left[ \begin{gathered} - \frac{{C_{ac} }}{{K_{ac} }},\quad 0 \hfill \\ \;\quad 0\;\,,\;\, - \frac{{C_{ac} }}{{K_{ac} }} \hfill \\ \end{gathered} \right],\quad {\mathbf{B}} = \left[ \begin{gathered} 1 \hfill \\ 1 \hfill \\ \end{gathered} \right],\quad \hfill \\ {\mathbf{C}} = \left[ \begin{gathered} {\kern 1pt} - \frac{{C_{ac} }}{{\left( {K_{ac} } \right)^{2} }} \\ 0{\kern 1pt} \\ \end{gathered} \right]^{{\text{T}}} ,\quad {\mathbf{D}} = \left[ \begin{gathered} \frac{1}{{K_{ac} }} \hfill \\ 0 \hfill \\ \end{gathered} \right],\quad {\mathbf{x}}_{{{\text{st}}}} {(}t{) = }\left[ \begin{gathered} K_{ac} {\Delta }l(t) \hfill \\ K_{ac} {\Delta }l(t) \hfill \\ \end{gathered} \right] \hfill \\ \end{gathered}$$

where Cac and Kac > 0.

For analyzing the stability of GRLMAC, we need to construct a common Lyapunov function that exists in each state interval and satisfies the following conditions

$$\left\{ \begin{gathered} {\text{V}}({\mathbf{x}}_{st} \left( t \right)) > 0\;{\text{and}}\;{\text{V}}(0) = 0 \hfill \\ \Delta {\text{V}}({\mathbf{x}}_{st} \left( t \right)) = {\text{V}}({\mathbf{x}}_{st} \left( {t + 1} \right)) - {\text{V}}({\mathbf{x}}_{st} \left( t \right)) < 0 \hfill \\ {\text{when}}\;{\mathbf{x}}_{st} \left( t \right) \to \infty ,\;{\text{V}}({\mathbf{x}}_{st} \left( t \right)) \to \infty \; \hfill \\ \end{gathered} \right.$$
(19)

Then, based on Eqs. (18) and (19), we consider the following Lyapunov candidate function

$${\text{V}} ({\mathbf{x}}_{{{\text{st}}}} (t + 1)) = {\mathbf{x}}_{{{\text{st}}}} (t{ + 1})^{{\mathbf{T}}} {\mathbf{Px}}_{{{\text{st}}}} (t + 1)$$
(20)

where P is symmetric positive matrixes and represented as

$${\mathbf{A}}^{{\mathbf{T}}} {\mathbf{PA}} - {\mathbf{P}} = - \left[ \begin{gathered} 1\quad 0 \hfill \\ 0\quad 1 \hfill \\ \end{gathered} \right]$$

Then, we obtain

$${\mathbf{P}} = \left[ {\begin{array}{*{20}c} {\frac{{\left( {K_{ac} } \right)^{2} }}{{\left( {K_{ac} } \right)^{2} - \left( {C_{ac} } \right)^{2} }}} & 0 \\ 0 & {\frac{{\left( {K_{ac} } \right)^{2} }}{{\left( {K_{ac} } \right)^{2} - \left( {C_{ac} } \right)^{2} }}} \\ \end{array} } \right]$$

Because P is a positive matrix, we have \(K_{ac}\) > \(C_{ac}\).

Then, the deviation of V(xst(t + 1), α(t)) is

$$\begin{gathered} {\Delta }V({\mathbf{x}}_{{{\text{st}}}} (t + 1)) \hfill \\ = {\mathbf{x}}_{{{\text{st}}}} (t{ + 1})^{{\mathbf{T}}} {\mathbf{Px}}_{{{\text{st}}}} (t + 1) - {\mathbf{x}}_{{{\text{st}}}} (t)^{{\mathbf{T}}} {\mathbf{Px}}_{{{\text{st}}}} (t) \hfill \\ = {\mathbf{x}}_{{{\text{st}}}}^{\text{T}} (t)[{\mathbf{A}}^{\text{T}} {\mathbf{PA}} - {\mathbf{P}}]{\mathbf{x}}_{{{\text{st}}}} (t) + {\mathbf{x}}_{{{\text{st}}}}^{{\text{T}}} (t + 1){\mathbf{PB}}{\Delta }u_{a} (t) \hfill \\ \;\; + {\Delta }u_{a} (t){\mathbf{B}}^{\text{T}} {\mathbf{P}}[{\mathbf{x}}_{{{\text{st}}}} (t + 1) - {\mathbf{B}}{\Delta }u_{a} (t)] \hfill \\ \end{gathered}$$
(21)

Hence, for \({\Delta }V({\mathbf{x}}_{{{\text{st}}}} (t + 1)) < 0\), we know that the value of Δua(t) should be over 2Kac times than Δl(t). In our work, according to the driving performance of the actuated chamber (as shown in Fig. 3) and multiple parameter adjustment experiments, we design the action space A and determine its parameters so that the proposed controller satisfies the above stability conditions.

4 Simulation and results

For verifying the control performance of GRLMAC, some tracking tasks are performed under a time-varying disturbance and different external load conditions. For all the tasks, the simulation step size h = 0.01 s and the parameters of GRLMAC are set as follows: the learning rate α = 1, the discount rate γ = 0.8, the initial value of κ =  [1, 1, 1, 1, 1, 1, 1], and the value matrix Q(s, a) is initialized to a zeros matrix. All simulations are performed on a PC with an i7 CPU @ 2.70 GHz, 16 GB RAM, and MATLAB 2016b.

4.1 Static performance

To validate the static performance of GRLMAC, we formulated reaching tasks with different external conditions. The actual time consumption for each reaching task is 2 s. Figure 5 shows the tracking performance of GRLMAC for a target point (100, 100, 400) under different external conditions. For different external load conditions (Mload = 0 g and 200 g) without disturbance, GRLMAC maintains a short settling time (less than 0.2 s) and low steady-state distance, as shown in Table 2. Then, to demonstrate the robustness of GRLMAC for the external disturbance, we add a time-varying disturbance to the X direction (X dis = 5t). Compared with the results without disturbance, the static performance is non-significantly affected by the external disturbance (the variations in settling time and distance are about 0.02 s and 0.3 mm) as shown in Fig. 5 and Table 2. Therefore, GRLMAC may endow the system with strong robustness and fast online- adjustment ability for the external disturbance.

Fig. 5
figure 5

Reaching tasks results under different external conditions

Table 2 Indicators of static performance without or with disturbance (target point is (100, 100, 400))

4.2 Dynamic performance

To validate the dynamic performance of GRLMAC, trajectory tracking tasks with different initial external loads and disturbances are formulated. The trajectory is set as (20sin(πt/1.25), 20cos(πt/1.25), 500). The time consumption for each task is 5 s. Figure 6 describes simulation results and shows that GRLMAC maintains a superior control performance for different external loads and the time-varying disturbance (the mean distance is less than 0.2 mm). By comparing indicators shown in Table 3, we can find that GRL can efficiently adjust predictive driving pressure uprediction(t) and ensure that the variation in distance is less than 0.02 mm, which illustrates that the proposed controller improves the robustness of the soft manipulator for the external disturbance.

Fig. 6
figure 6

Trajectory tracking tasks results under different external conditions

Table 3 Indicators of dynamic performance without or with disturbance (trajectory is (20sin(πt/1.25), 20cos(πt/1.25), 500))

5 Experiments and results

To further verify the performance of GRLMAC, experiments were conducted on a soft manipulator system. Figure 7 shows that the system is composed of a soft manipulator, a binocular camera (ZED, Stereolab, USA) which is used to measure the distance between the target and the gripper, a multi-channel pneumatic system, a vibration pump, and a PC.

Fig. 7
figure 7

The OBSS soft manipulator experiment system

In the multi-channel pneumatic system, eight proportional valves (ITV0030, SMC, Japan) are used for actuating the soft manipulator. Besides, a vibration pump is used for generating a constant flow disturbance for the manipulator in the X-direction. In this section, three kinds of experiment tasks including the static reaching task, the dynamic trajectory tracking task, and the grasping task are performed. All of the above tasks are accomplished in the water environment. Moreover, the parameters of GRLMAC are set as the same value as in Sect. 4. It should be noted that the soft manipulator is only controlled in the XY plane for the above experiment tasks. The reason is that it can avoid the problem of position detection caused by shading from the manipulator. Moreover, Because of the response time of the proportional valve, data transmission time, and airflow rate, each control step of the OBSS control system takes about 0.8–1.5 s. Hence to demonstrate the actual response performance of the proposed controller, in the following curve figures, the unit of the X-axis is the number of consuming steps.

5.1 Static reaching task

In this section, we conducted reaching tasks for validating the static performance of GRLMAC. For the reaching task, the soft manipulator with an external load (0 g, 30 g, and 96 g) is controlled to move toward a target point under a water flow disturbance. As shown in Fig. 8, the control performance of GRLMAC is non-significantly affected by varying loads and flow disturbance, which demonstrates that the proposed controller has good robustness. Moreover, the settling step is less than 20 steps, and that means GRL has a high online learning efficiency by introducing the action selection guidance strategy. It is noteworthy that the data instability of position coordinates detected by the binocular camera has a significant impact on the stability and the accuracy of GRLMAC. Therefore, the steady-state error (shown in Table 4) is larger than the simulation results.

Fig. 8
figure 8

Static reaching task results. For more details refer to Movie S1

Table 4 Indicators of static performance without or with disturbance

5.2 Dynamic trajectory tracking task

To verify the dynamic performance and the robustness of GRLMAC more systematically, we control the soft manipulator to track a square signal and a sin signal with or without a constant flow disturbance. Figures 9 and 10 illustrate the control performance of GRLMAC for trajectory tracking tasks. GRLMAC always has almost the same control performance for the soft manipulator, whether the flow disturbance exists or not (the change of error is about 1 mm, as shown in Table 5). This result validates the effectiveness and robustness of the proposed controller. However, the stability and the control accuracy of GRLMAC for the sine-wave signal are still affected by the measured data instability.

Fig. 9
figure 9

Dynamic trajectory tracking task results for a square signal. For more details can refer to Movie S2

Fig. 10
figure 10

Dynamic trajectory tracking task results for a sin signal. For more details refer to Movie S2

Table 5 Indicators of dynamic performance without or with disturbance

5.3 Grasping task

In this section, to validate grasping performance, we execute grasping tasks under a water flow disturbance. The objects (including a ping-pong ball, a sea cucumber, and a scallop) need to be grasped into a circle by the OBSS soft manipulator which is controlled via GRLMAC. The initial position of the soft gripper is set as shown in Fig. 11. For each grasp task, the soft manipulator takes 15 steps to move to the object. After reaching the object, the manipulator took 3 steps to complete the grasping-return-release the object. The task process takes 18 steps in total. As shown in Movie S3, the soft manipulator can autonomously grasp the objects with different sizes and weights into the circle under a flow disturbance based on the algorism of GRLMAC. The result demonstrates that the proposed controller has good robustness for the external disturbance. Moreover, the soft manipulator reaching the object only takes less than 15 steps, which illustrates that GRLMAC has a fast online learning ability.

Fig. 11
figure 11

Grasping task platform. Three kinds of objects (including sea cucumber, a ping-pong ball filled with bolts, and a scallop) need to be grasped into the circle in order. All the tasks are executed under a flow disturbance generated by a vibration pump

Moreover, to demonstrate the characteristic of GRLMAC, the control performance comparison between GRLMAC and the other controllers mentioned in the introduction is provided in Table 6. According to comparison results, the proposed controller has a good online learning ability, which makes it has a better control performance than the offline learning controller.

Table 6 Comparison of control performance

By combining the OBSS soft manipulator with GRLMAC, our soft manipulator can offer a promising option for high-performance and low-cost underwater manipulation systems for marine tasks. To validate the ability of the gripping tasks in a real-world underwater environment with influences of ocean current, water pressure, visibility, we constructed the OBSS soft manipulator with an ROV. We performed collecting seafood animals in the natural undersea environment through this robot (Fig. 12 and Movie S4). From the results, the low visibility in the offshore marine area and the strong and time-varying current in the open ocean increase the difficulty for grasping tasks. According to the above experiment results, we notice that the stability and the accuracy of the proposed controller are affected by measurement noise and non-stationary stochastic disturbance, which have effects on the action selection. The reason is that action space A is designed based on the length error of a chamber. Therefore, an online signal processing algorithm based on the optimal estimation theory is essential to obtain stable feedback data in the future.

Fig. 12
figure 12

Grasping task in the natural undersea environment. a Performing the grasping task in the offshore marine area with low visibility. b Performing the grasping task in the open-ocean with strong and time-varying. More details can refer to Movie S4

6 Conclusion

In this study, a prediction model-based guided reinforcement learning adaptive controller (GRLMAC) is presented to control an OBSS soft manipulator, so that the soft manipulator can efficiently complete the grasping task in a water environment with external disturbances (e.g., currents, water pressure, external loads, etc.). In GRLMAC, an action selection guidance strategy based on the human experience is designed to direct the reinforcement learning method to choose an appropriate adjustment behavior for the FPM. This approach endows reinforcement learning with efficient online learning ability and avoids the offline training process. To verify the control performance of GRLMAC, both simulation and experiment platforms were established, and tracking and grasping tasks are conducted. Both simulation and experimental results show that the proposed controller has a good position control performance (the distance is about 1 mm for reaching tasks) and robustness (the error change is less than 1 mm) under different external loads and time-varying disturbance. Moreover, efficient online learning ability enables the manipulator to reach the target point just within a few steps (the settling step is about 20 steps), which is less time-consuming. The above results demonstrate the effectiveness of GRLMAC in the underwater grasping task. In the future study, we will analyze the effects of stochastic environmental disturbances for the grasping task, and then design a disturbance predictive policy and introduce it into ε-greedy policy to select the appropriate control action of the soft manipulator for water disturbances.