1 Introduction

Although goal-directed movements have been extensively studied in the field of sensorimotor control in the past decades, its underlying computational mechanism is still not fully understood yet (Todorov and Jordan 2002; Harris and Wolpert 1998; Franklin et al. 2003; Burdet et al. 2001, 2006; Wolpert et al. 1995; Selen et al. 2009; Zhou et al. 2017; Kadiallah et al. 2011; Mistry et al. 2013; Česonis and Franklin 2020). Various computational models have been proposed to account for sensorimotor control and learning (Shadmehr and Mussa-Ivaldi 2012; Krakauer et al. 2019). One widely accepted theory is that the CNS selects trajectories that minimize a cost function (Flash and Hogan 1985; Uno et al. 1989; Harris and Wolpert 1998; Todorov 2005; Jiang and Jiang 2014; Bian et al. 2020). In particular, the authors of Todorov and Jordan (2002) suggest that the CNS uses a model-based optimal feedback principle to coordinate body movement by minimizing an integral-quadratic cost index that trades off energy consumption with constraints. Such optimal control frameworks have been found to successfully explain diverse phenomena, such as approximately straight movement trajectories and bell-shaped velocity curves (Morasso 1981), variability patterns and flexibility of arm movement trajectories (Todorov and Jordan 2002; Liu and Todorov 2007), adaptation to force fields and visuomotor transforms (Ueyama 2014; Braun et al. 2009), kinematic invariance despite the sacrifice of optimality (Mistry et al. 2013), fast timescale of motor learning (Crevecoeur et al. 2020), to name a few. A common assumption in these studies is that the CNS first identifies the system dynamics and then solves the optimal control problem using the identified model. (This kind of mechanism is referred to as model-based mechanism, according to Haith and Krakauer (2013)). However, currently there is no strong experimental evidence about how the CNS manages to generate an internal representation of the environment, especially for complex environments. To this end, model-free learning approaches, such as reinforcement learning (RL) and adaptive dynamic programming (ADP), are utilized to explain sensorimotor learning behavior (Haith and Krakauer 2013; d’Acremont et al. 2009; Fiete et al. 2007; Jiang and Jiang 2014; Bian et al. 2020). RL and ADP are biologically inspired learning approaches that study how an agent iteratively modifies its control policies toward finding the optimal policy maximizing a cumulative reward function, directly using the observed responses from its interactions with the environment (Sutton and Barto 2018; Bertsekas 2019; Jiang et al. 2020; Jiang and Jiang 2017). Thus, an intermediate internal representation of the environment is not needed in the computation of the optimal control policy anymore. (This kind of mechanism is referred to as a model-free mechanism, according to Haith and Krakauer (2013)). The computational models based on RL and ADP also succeed in explaining many experimental observations in sensorimotor control and learning, see (Fiete et al. (2007), d’Acremont et al. (2009), Huang et al. (2011), Izawa and Shadmehr (2011), Shmuelof et al. (2012), Haith and Krakauer (2013), Jiang and Jiang (2014), Vaswani et al. (2015) and Bian et al. (2020)). Some other related but independent works include the iterative model reference adaptive control framework presented in Zhou et al. (2011), and the direct policy updating framework (Hadjiosif et al. 2021). The main difference between these models and the RL/ADP models is that the former is not based on the optimal control framework.

Noise exists at all levels of sensorimotor interactions (Parker et al. 2002; Orbán and Wolpert 2011). Sensory inputs are noisy, which limits the accuracy of our perception. Motor commands are also noisy, which leads to inaccurate movements. The noisy sensory inputs and motor commands further result in imprecise estimation of the state of the environment and our body, and ambiguity of the parameters that characterize the task (Orbán and Wolpert 2011). Although the CNS may use mechanisms in the style of Kalman filter or Bayesian integration to minimize the effects (estimation errors) caused by the noise in sensorimotor interactions (Parker et al. 2002; Körding and Wolpert 2004, 2006; Wolpert 2007; Orbán and Wolpert 2011; Bach and Dolan 2012), the estimation errors can never be completely suppressed (Sternad et al. 2011; Acerbi et al. 2017). Then, a natural question to ask is: How does the CNS manage to learn near-optimal policies or adapt to the new environment, in the presence of estimation errors? Motor adaptation and learning often involve iterative (trial-by-trial) improvement processes (Haith and Krakauer 2013). From the computational perspective, even small errors in the iterative processes may accumulate or be amplified over the iterations, to finally cause divergence or failure of the process (Bertsekas 2011). Thus, it is a nontrivial question why the learning performance of CNS is not affected by the estimation errors, or equivalently, why the learning performance of the CNS is robust to estimation errors in the learning processes. This question is not addressed in most of the model-based mechanisms mentioned previously, since the internal models are often assumed perfect and accurate. Although the effects of the parameter estimation errors for the uncertain force fields are explicitly investigated in the computational models proposed in Mistry et al. (2013) and Crevecoeur et al. (2020), the partial knowledge of the internal model is still needed there. Most of the model-free mechanisms mentioned in the last paragraph have no formal theoretical treatment on this issue either. The computational models proposed in Zhou et al. (2011) and Hadjiosif et al. (2021) are able to adjust the control policies iteratively without formulating an intermediate internal model, by utilizing the estimation errors of the sensory output through model reference adaptive control and direct policy updating, respectively. However, these models do not reflect the objective of minimizing the metabolic cost, which can be naturally embedded in the optimal control framework and is widely deemed to be one of the underlying principles which the CNS obeys to choose the control policies (Todorov and Jordan 2002; Liu and Todorov 2007; Burdet et al. 2001; Franklin et al. 2008; Selen et al. 2009; Franklin and Wolpert 2011).

With the above discussions in mind, in this paper we argue that our recently developed robust reinforcement learning theory (Bian and Jiang 2019; Pang et al. 2021) provides a new model-free adaptive optimal control principle candidate for explaining the robustness features observed in human motor adaptation and learning. Previous theoretical studies of RL and ADP often implicitly assume that the algorithms can be implemented or solved exactly without any estimation errors, which is a strong assumption since model uncertainties and noisy data are common in reality. In Bian and Jiang (2019) and Pang et al. (2021), it is shown by theoretical analysis that the value iteration and the policy iteration, two main classes of learning algorithms in RL and ADP, are robust to errors in the learning process aimed at solving a linear quadratic regulator (LQR) problem. More concretely, in Pang et al. (2021) we prove that the policy iteration algorithm is small-disturbance input-to-state stable. In other words, whenever the estimation error in each iteration is bounded and small, the solutions of the policy iteration algorithm are also bounded and enter a small neighborhood of the optimal solution of the LQR problem. In light of this robustness result, we propose a novel model-free computational model in this paper, named optimistic least-squares policy iteration (O-LSPI), to explain the robustness of the learning and adaptation of the CNS in the arm reaching task. We demonstrate through numerical studies that although the unmeasurable control-dependent noise in the human arm movement model introduces estimation errors into the learning algorithm, O-LSPI is still capable of finding near-optimal policies in different force fields and producing nearly identical results as observed in experiments conducted by Burdet et al. (2001, 2006) and Franklin et al. (2003).

The rest of this paper is organized as follows: Sect. 2 introduces the robust reinforcement learning, or more precisely the O-LSPI algorithm, as a novel computational principle of human movement, in the context of the general LQR problem with control-dependent stochastic noise. In Sect. 3, the proposed O-LSPI algorithm is applied to the human movement model, to reproduce the arm reaching task in simulation. Section 4 presents some discussions about the proposed mechanism. Section 5 closes the paper with some concluding remarks.

2 Theory of robust reinforcement learning

2.1 Problem formulation

Consider linear stochastic systems with control-dependent noises

$$\begin{aligned} {\mathrm{d}}x = (Ax + Bu){\mathrm{d}}t + B\sum _{k=1}^{q} C_ku{\mathrm{d}}w_k, \end{aligned}$$
(1)

where \(A\in {\mathbb {R}}^{n\times n}\) and \(B\in {\mathbb {R}}^{n\times m}\) are constant matrices describing the system dynamics, \(u\in {\mathbb {R}}^m\) is the control signal, \(C_k\in {\mathbb {R}}^{m\times m}\) is the gain matrix of control-dependent noise, and \(w_k\) is independent one-dimensional Brownian motion for \(k = 1, 2, \ldots , q\). It is assumed that pair (AB) is controllable. The control-dependent noise in (1) is used to capture the psychophysical observations that the variability of motor errors increases with the magnitude of the movement (Harris and Wolpert 1998; Liu and Todorov 2007). Although the actual human arm system is nonlinear due to the complex behaviors of its components, e.g., tendon and muscles, as demonstrated in Harris and Wolpert (1998), Liu and Todorov (2007), Zhou et al. (2011), Crevecoeur et al. (2020) and Mistry et al. (2013), for the simple arm reaching task considered in this paper (introduced in detail in the next section), the nonlinear dynamics can be linearized (Khalil 2002) and approximated well using the linear dynamics (1).

Following Todorov and Jordan (2002) and Liu and Todorov (2007), the optimal control problem is to find an optimal control policy to minimize the following cost with respect to the nominal system of (1) without the control-dependent noise

$$\begin{aligned} J(x(0),u) = \int _0^{\infty } \left( x^{\mathrm{T}}Qx + u^{\mathrm{T}}Ru\right) {\mathrm{d}}t, \end{aligned}$$
(2)

where \(Q\in {\mathbb {S}}^n\) and \(R\in {\mathbb {S}}^m\) are positive definite constant weighting matrices, with \({\mathbb {S}}^n\) denoting the set of all real symmetric matrices of order n. It is well known (Liberzon 2012, Section 6.2.2) that the optimal control policy is \(u^* = -K^*x\), where \(K^* = R^{-1}B^{\mathrm{T}}P^*\) and \(P^*\in {\mathbb {S}}^n\) is the unique positive definite solution of the algebraic Riccati equation (ARE)

$$\begin{aligned} A^{\mathrm{T}}P + PA + Q - PBR^{-1}B^{\mathrm{T}}P = 0. \end{aligned}$$
(3)

In addition, \(K^*\) is stabilizing in the sense that \(A-BK^*\) is Hurwitz or its eigenvalues have negative real parts. Define

$$\begin{aligned} \begin{aligned} {\mathcal {A}}(K)&= I_n\otimes (A-BK)^{\mathrm{T}} + (A-BK)^{\mathrm{T}}\otimes I_n \\&\quad + \sum _{k=1}^q (BC_kK)^{\mathrm{T}}\otimes (BC_kK)^{\mathrm{T}}. \end{aligned} \end{aligned}$$

As it can be directly checked (Kleinman 1969), the system (1) in closed loop with \(u= -Kx\) is mean-square stable in the sense of Willems and Willems (1976 Definition 1.) if \({\mathcal {A}}(K)\) is Hurwitz. In particular, if K is stabilizing and the gain matrices \(C_k\) of the control-independent noise are small enough, then \({\mathcal {A}}(K)\) is Hurwitz.

2.2 Policy iteration

Notice that (3) is a nonlinear matrix equation in P, which is hard to be directly solved. Policy iteration is an iterative method to find \(P^*\) by solving successively a sequence of transformed linear matrix equations.

For any \(P\in {\mathbb {S}}^n\) and any \(K\in {\mathbb {R}}^{m\times n}\), define

$$\begin{aligned} \begin{aligned} G(P)&\triangleq \left[ \begin{array}{cc} Q + A^{\mathrm{T}}P + PA &{} PB \\ B^{\mathrm{T}}P &{} R \end{array}\right] \\&= \left[ \begin{array}{c|c} [G(P)]_{xx} &{} [G(P)]_{ux}^{\mathrm{T}} \\ \hline [G(P)]_{ux} &{} [G(P)]_{uu} \end{array} \right] . \end{aligned} \end{aligned}$$

and

$$\begin{aligned} {\mathcal {H}}(G(P),K) = \left[ \begin{array}{cc} I_n&-K^{\mathrm{T}} \end{array}\right] G(P)\left[ \begin{array}{c} I_n \\ -K \end{array}\right] . \end{aligned}$$

The following policy iteration method was originally presented in Kleinman (1968).

Algorithm 1

(Kleinman’s Policy Iteration)

  1. (1)

    Choose a stabilizing control gain \(K_1\), and let \(i=1\).

  2. (2)

    (Policy evaluation) Evaluate the performance of control gain \(K_i\), by solving

    $$\begin{aligned} {\mathcal {H}}(G_i,K_i)=0 \end{aligned}$$
    (4)

    for \(P_i\in {\mathbb {S}}^n\), where \(G_i \triangleq G(P_i)\).

  3. (3)

    (Policy improvement) Get the improved policy by

    $$\begin{aligned} K_{i+1} = [G_i]_{uu}^{-1}[G_i]_{ux}. \end{aligned}$$
    (5)
  4. (4)

    Set \(i\leftarrow i+1\) and go back to Step (2).

The following properties were proved in Kleinman (1968).

  1. (i)

    \(A-BK_i\) is Hurwitz for all \(i=1,2,\ldots \).

  2. (ii)

    \(P_1\ge P_2 \ge P_3\ge \cdots \ge P^*\).

  3. (iii)

    \(\lim _{i\rightarrow \infty }P_i=P^*\), \(\lim _{i\rightarrow \infty }K_i = K^*\).

2.3 Robust policy iteration

Clearly, the implementation of Kleinman’s policy iteration algorithm relies upon the exact knowledge of the system matrices A and B. In the absence of the exact knowledge of A and B, only an estimate of G in Algorithm 1 obtained from data can be used, which leads to the following inexact, yet implementable, policy iteration algorithm:

Algorithm 2

(Inexact Policy Iteration)

  1. (1)

    Choose a stabilizing control gain \(\hat{K}_1\), and let \(i=1\).

  2. (2)

    (Inexact policy evaluation) Obtain \(\hat{G}_i\in {\mathbb {S}}^{m+n}\) as an approximation of \(G(\hat{P}_i)\), where \(\hat{P}_i\) is the solution of

    $$\begin{aligned} {\mathcal {H}}(G(\hat{P}_i),\hat{K}_i)=0. \end{aligned}$$
    (6)
  3. (3)

    (Policy update) Construct a new control gain

    $$\begin{aligned} \hat{K}_{i+1} = [\hat{G}_i]_{uu}^{-1}[\hat{G}_i]_{ux}. \end{aligned}$$
    (7)
  4. (4)

    Set \(i\leftarrow i+1\) and go back to Step (2).

With the error \(\Delta G_i \triangleq \hat{G}_i - G(\hat{P}_i)\) in each iteration, the sequences \(\{\hat{G}_i\}_{i=1}^\infty \) and \(\{\hat{K}_i\}_{i=1}^\infty \) generated by Algorithm 2 would be different from the sequences \(\{G_i\}_{i=1}^\infty \) and \(\{K_i\}_{i=1}^\infty \) generated by Algorithm 1. Thus, a natural question to ask is: Is policy iteration robust to the errors in the learning process? In other words, in the presence of error \(\Delta G_i\), when will \(\hat{K}_i\) still converge to a small neighborhood of \(K^*\)? In our recent work (Pang et al. 2021), we provide an answer to this question, as shown in the following theorem.

Theorem 1

For any given stabilizing control gain \(\hat{K}_1\) and any \(\epsilon >0\), there exists \(\delta (\epsilon ,\hat{K}_1)>0\), such that if \(Q>0\) and \(\Vert \Delta G \Vert _\infty < \delta \),

  1. (i)

    \([\hat{G}_i]_{uu}\) is invertible, \(\hat{K}_i\) is stabilizing, \(\Vert \hat{K}_i\Vert _F<M_0\), \(\forall i\in {\mathbb {Z}}_+\), \(i>0\), where \(M_0(\delta )>0\).

  2. (ii)

    \(\limsup _{i\rightarrow \infty } \Vert \hat{K}_i-K^* \Vert _F<\epsilon \).

  3. (iii)

    \(\lim _{i\rightarrow \infty } \Vert \Delta G_i\Vert _F = 0\) implies \(\lim _{i\rightarrow \infty } \Vert \hat{K}_i-K^* \Vert _F=0\).

Intuitively, Theorem 1 implies that in Algorithm 2, if the error signal \(\Delta G\) is bounded and not too large, then the generated control policy \(\hat{K}_i\) is also bounded and will ultimately be in a neighborhood of the optimal policy \(K^*\) whose size is proportional to the \(l^\infty \)-norm of the error signal. The smaller the error is, the better the ultimately generated policy is. In other words, the algorithm described in Algorithm 2 is not sensitive to small errors in the learning process.

2.4 Optimistic least-squares policy iteration

This subsection presents a specific method to construct the estimation \(\hat{G}_i\) in Step (2) of Algorithm 2 from the input/state data generated by system (1) (sensory data generated in the sensorimotor interactions), without the knowledge of system matrices AB, gain matrices \(\{C_k\}_{k=1}^q\) and the control-dependent noise. Thus, the resulting Algorithm 3, named optimistic least-squares policy iteration (O-LSPI), is a novel model-free computational mechanism and an instantiation of Algorithm 2.

The O-LSPI is based on the following lemma.

Lemma 1

For any stabilizing control gain K, its associated \(P_K\) satisfying (4) is the unique stable equilibrium of linear dynamical system

$$\begin{aligned} \dot{P} = {\mathcal {H}}(G(P),K), \quad P(0)\in {\mathbb {S}}^n, \end{aligned}$$
(8)

and \(\lim \limits _{t\rightarrow \infty } G(P(t)) = G(P_K)\).

Proof

Vectorizing (8), we have

$$\begin{aligned} \begin{aligned} {{\,\mathrm{vec}\,}}(\dot{P})&= \left( I_n\otimes (A-BK)^{\mathrm{T}} + (A-BK)^{\mathrm{T}}\otimes I_n\right) \\&\times {{\,\mathrm{vec}\,}}(P) + {{\,\mathrm{vec}\,}}(Q+K^{\mathrm{T}}RK). \end{aligned} \end{aligned}$$
(9)

Since \((A-BK)\) is Hurwitz, obviously this linear dynamical system admits a unique stable equilibrium \(P_K\). \(\square \)

Lemma 1 implies that in Algorithm 1, instead of solving (4), one can solve the ODE (8). This is actually the continuous-time version of the optimistic policy iteration in Tsitsiklis (2002) and Bertsekas (2011) for finite state and action spaces (thus the name “optimistic”). Lemma 1 is a well-known result in control theory (Mori et al. 1986), where (4) and (8) are in fact the algebraic Lyapunov matrix equation and the Lyapunov matrix differential equation, respectively.

Now, we show how \(\hat{G}_i\) in Algorithm 2 can be estimated by CNS directly using the sensory data, i.e., input/state data collected from system (1), based on (8) and least squares. Suppose in the i-th iteration, a control policy

$$\begin{aligned} u_i = -\hat{K}_ix + y \end{aligned}$$
(10)

is applied to the system (1) to collect data, where \(y\in {\mathbb {R}}^m\) is the exploration noise. For any \(\bar{P}_i\in {\mathbb {S}}^n\), Ito’s formula (Pavliotis 2014, Lemma 3.2) yields

$$\begin{aligned} \begin{aligned} {\mathrm{d}}(x^{\mathrm{T}}\bar{P}_ix)&= 2x^{\mathrm{T}}\bar{P}_i(Ax+Bu){\mathrm{d}}t + u^{\mathrm{T}}\varSigma (\bar{P}_i)u{\mathrm{d}}t \\&\quad + 2x^{\mathrm{T}}\bar{P}_iB\sum _{k=1}^q C_ku{\mathrm{d}}w_k, \end{aligned} \end{aligned}$$

where \(\varSigma (\bar{P}_i) = \sum _{k=1}^qC^{\mathrm{T}}_kB^{\mathrm{T}}\bar{P}_iBC_k\). Define \(t_j = j\Delta t\), where \(j = 0,1,\ldots ,M\), \(\Delta t > 0\) and M is a positive integer. Integrating the above equation from \(t_j\) to \(t_{j+1}\), we have

$$\begin{aligned}&x^{\mathrm{T}}(t_{j+1})\bar{P}_ix(t_{j+1}) - x^{\mathrm{T}}(t_j)\bar{P}_ix(t_j) \\&= \int _{t_j}^{t_{j+1}} z^{\mathrm{T}}\theta (\bar{P}_i)z{\mathrm{d}}t + \int _{t_j}^{t_{j+1}} 2x^{\mathrm{T}}\bar{P}_iB\sum _{k=1}^q C_ku{\mathrm{d}}w_k, \nonumber \end{aligned}$$
(11)

where \( z = \left[ \begin{array}{c} x^{\mathrm{T}} , u^{\mathrm{T}} \end{array}\right] ^{\mathrm{T}}\) and

$$\begin{aligned} \begin{aligned} \theta (\bar{P}_i)&\triangleq \left[ \begin{array}{cc} A^{\mathrm{T}}\bar{P}_i + \bar{P}_iA &{}\quad \bar{P}_iB \\ B^{\mathrm{T}}\bar{P}_i &{}\quad \varSigma (\bar{P}_i) \end{array}\right] \\&= \left[ \begin{array}{c|c} [\theta (\bar{P}_i)]_{xx} &{} [\theta (\bar{P}_i)]_{ux}^{\mathrm{T}} \\ \hline [\theta (\bar{P}_i)]_{ux} &{} [\theta (\bar{P}_i)]_{uu} \end{array} \right] . \end{aligned} \end{aligned}$$

Taking the expectation on both sides of (11) yields

$$\begin{aligned} (X_{j+1}-X_{j})^{\mathrm{T}}{{\,\mathrm{svec}\,}}(\bar{P}_i) = Z^{\mathrm{T}}_j {{\,\mathrm{svec}\,}}(\theta (\bar{P}_i)), \end{aligned}$$
(12)

where for any \(Y\in {\mathbb {S}}^m\)

$$\begin{aligned} \begin{aligned} {{\,\mathrm{svec}\,}}(Y)&= [y_{11},\sqrt{2}y_{12},\ldots ,\sqrt{2}y_{1m},y_{22},\sqrt{2}y_{23},\\&\quad \ldots ,\sqrt{2}y_{m-1,m},y_{m,m}]^{\mathrm{T}}\in {\mathbb {R}}^{\frac{1}{2}m(m+1)}, \\ X_j&= {\mathbb {E}}[{{\,\mathrm{svec}\,}}(x(t_j)x^{\mathrm{T}}(t_j))],\\ Z_j&= {\mathbb {E}}\left[ \int _{t_j}^{t_{j+1}}{{\,\mathrm{svec}\,}}(zz^{\mathrm{T}}){\mathrm{d}}t\right] . \end{aligned} \end{aligned}$$

Rewriting the above linear equations (12) for \(j=0,\ldots ,M-1\) into a compact form, we obtain

$$\begin{aligned} \varPhi _{i,M}{{\,\mathrm{svec}\,}}(\theta (\bar{P}_i)) = \varPsi _{i,M}{{\,\mathrm{svec}\,}}(\bar{P}_i), \end{aligned}$$
(13)

where

$$\begin{aligned} \begin{aligned} \varPhi _{i,M}&= [Z_0,Z_1,\ldots ,Z_{M-1}]^{\mathrm{T}},\\ \varPsi _{i,M}&= [X_1-X_0, X_2-X_1,\ldots ,X_M-X_{M-1}]^{\mathrm{T}}, \end{aligned} \end{aligned}$$

and i in the subscriptions of \(\varPhi \) and \(\varPsi \) is used to emphasize that we are using (10) as the control policy. The following assumption is made.

Assumption 1

\(\varPhi _{i,M}\) has full column rank.

Remark 1

Assumption 1 is in the spirit of persistent excitation condition in adaptive control (Åström and Wittenmark 1995). Similar assumptions are needed in other RL methods, see Jiang and Jiang (2014, 2017), Kamalapurkar et al. (2018), Kiumarsi et al. (2018), Bian et al. (2016, 2020), Bian and Jiang (2019), Pang and Jiang (2020, 2021), Pang et al. (2019, 2020). Assumption 1 makes the data-based differential equation (18) a good approximation of the model-based differential equation (8), which is a key component in the convergence analysis of the O-LSPI in the sequel. The presence of exploration noise y is necessary for Assumption 1; otherwise, u will always be linearly dependent on x.

Under Assumption 1, (13) can be rewritten as

$$\begin{aligned} {{\,\mathrm{svec}\,}}(\theta (\bar{P}_i)) = \varPhi _{i,M}^\dagger \varPsi _{i,M}{{\,\mathrm{svec}\,}}(\bar{P}_i), \end{aligned}$$
(14)

where \(\varPhi _{i,M}^\dagger \) denotes the Moore–Penrose pseudoinverse of \(\varPhi _{i,M}\). Notice that (8) can be rewritten as

$$\begin{aligned} \dot{\bar{P}}_i = {\mathcal {H}}(\underbrace{\theta (\bar{P}_i) - 0_n\oplus [\theta (\bar{P}_i)]_{uu} + Q\oplus R}_{=G(\bar{P}_i)},K) \end{aligned}$$
(15)

with \(\bar{P}_i(0)\in {\mathbb {S}}^n\), where

$$\begin{aligned} \begin{aligned} 0_n\oplus [\theta (\bar{P}_i)]_{uu}&= \left[ \begin{array}{ll} 0_n &{}\quad 0_{n\times m} \\ 0_{m\times n} &{}\quad [\theta (\bar{P}_i)]_{uu} \end{array}\right] , \\ Q \oplus R&= \left[ \begin{array}{ll} Q &{}\quad 0_{n\times m} \\ 0_{m\times n} &{}\quad R \end{array}\right] . \end{aligned} \end{aligned}$$

If (14) is inserted into (15), then (15) only depends on the data-based matrices \(\varPhi _{i,M}\) and \(\varPsi _{i,M}\), i.e., the precise knowledge of system matrices A, B and \(\{C_k\}_{k=1}^q\) is not needed. However, the expectations in \(\varPhi _{i,M}\) and \(\varPsi _{i,M}\) are not known directly. Thus, they need to be estimated from the data. Suppose there are in total N trajectories of state and input data of length \(t_M\) that are collected along the solutions of system (1) with control policy (10). Then, we can construct approximation \(\hat{\theta }(\bar{P}_i)\) of \(\theta (\bar{P}_i)\) using

$$\begin{aligned} {{\,\mathrm{svec}\,}}(\hat{\theta }(\bar{P}_i)) = \hat{\varPhi }_{i,M,N}^\dagger \hat{\varPsi }_{i,M,N}{{\,\mathrm{svec}\,}}(\bar{P}_i), \end{aligned}$$
(16)

where

$$\begin{aligned} \begin{aligned}&\hat{\varPhi }_{i,M,N} = [\hat{Z}_{0,N},\hat{Z}_{1,N},\ldots ,\hat{Z}_{M-1,N}]^{\mathrm{T}},\\&\hat{\varPsi }_{i,M,N} = [\hat{X}_{1,N}-\hat{X}_{0,N}, \hat{X}_{2,N}-\hat{X}_{1,N},\\&\quad \qquad \qquad \ldots ,\hat{X}_{M,N}-\hat{X}_{M-1,N}]^{\mathrm{T}},\\&\hat{Z}_{j,N} = \frac{1}{N}\sum _{l=1}^N \int _{t_j}^{t_{j+1}}{{\,\mathrm{svec}\,}}\big (z^{(l)}[z^{(l)}]^{\mathrm{T}}\big ){\mathrm{d}}t,\ z^{(l)} = \left[ \begin{array}{c} x^{(l)} \\ u^{(l)} \end{array}\right] ,\\&\hat{X}_{j,N} = \frac{1}{N}\sum _{l=1}^N {{\,\mathrm{svec}\,}}\big (x_{t_j}^{(l)}[x_{t_j}^{(l)}]^{\mathrm{T}}\big ),\quad j = 0,\ldots , M-1. \end{aligned} \end{aligned}$$

and \(x^{(l)}\), \(u^{(l)}\) are the l-th state trajectory and input trajectory, respectively. By the strong law of large numbers, almost surely

$$\begin{aligned} \lim \limits _{N\rightarrow \infty } \hat{\varPhi }_{i,M,N} = \varPhi _{i,M},\quad \lim \limits _{N\rightarrow \infty } \hat{\varPsi }_{i,M,N} = \varPsi _{i,M}. \end{aligned}$$
(17)

This implies that the solution of the following ordinary differential equation

$$\begin{aligned} \dot{\check{P}}_i = {\mathcal {H}}(\hat{\theta }(\check{P}_i) - 0_n\oplus [\hat{\theta }(\check{P}_i)]_{uu} + Q\oplus R,K), \end{aligned}$$
(18)

with \(\check{P}_i(0)\in {\mathbb {S}}^n\) would be close to the solution of (15), if N is large enough and \(\bar{P}(0) = \check{P}(0)\). Thus by Lemma 1, \(\check{P}_i(s)\) and \(\hat{\theta }(\check{P}_i(s))\) would be close to \(\hat{P}_i\) and \(\theta (\hat{P}_i)\) in Algorithm 2, respectively, for N and \(s>0\) large enough. In view of the relationship between \(\theta (\cdot )\) and \(G(\cdot )\) in (15), an estimation \(\hat{G}_i\) of \(G(\hat{P}_i)\) can be constructed as

$$\begin{aligned} \hat{G}_i = \hat{\theta }(\check{P}_i(s))- 0_n\oplus [\hat{\theta }(\check{P}_i(s))]_{uu} + Q\oplus R \end{aligned}$$
(19)

in Algorithm 2. The O-LSPI algorithm is summarized in Algorithm 3.

Algorithm 3

(Optimistic Least-Squares Policy Iteration)

  1. (1)

    (Initialization) Choose a stabilizing control gain \(\hat{K}_1\), number of trajectories N, time step \(\Delta t>0\), length of samples M, length of policy evaluation \(s>0\), number of iterations \(\bar{I}\). Let \(i=1\).

  2. (2)

    Collect input/state data to construct data matrices \(\hat{\varPhi }_{i,M,N}\) and \(\hat{\varPsi }_{i,M,N}\) in (16), by applying control policy (10) to (1).

  3. (3)

    (Inexact policy evaluation) Obtain \(\hat{G}_i\) defined in (19) by solving the initial value problem of equation (18) on [0, s] with initial value \(\check{P}_i(0)=0_n\).

  4. (4)

    (Policy update) Construct a new control gain

    $$\begin{aligned} \hat{K}_{i+1} = [\hat{G}_i]_{uu}^{-1}[\hat{G}_i]_{ux}. \end{aligned}$$
    (20)
  5. (5)

    Set \(i\leftarrow i+1\). If \(i<\bar{I}\), go back to Step (2).

  6. (6)

    Use \(u_{\bar{I}} = -\hat{K}_{\bar{I}}x\) as the approximately optimal control policy.

It is worth emphasizing again that Algorithm 3 does not need the knowledge of system matrices A and B, gain matrices \(\{C_k\}_{k=1}^q\) and the control-dependent noise \(w_k\). Based on Theorem 1, the convergence of Algorithm 3 is given in the following theorem, whose proof is given in “Appendix 1.”

Theorem 2

For any given stabilizing control gain \(\hat{K}_1\) and \(\epsilon _1>0\), there exist an integer \(N_0\), an integer \(\bar{I}\) and a constant \(s_0>0\), such that if Assumption 1 is satisfied for all \(i=1,\ldots ,\bar{I}\), then for any \(N>N_0\) and \(s>s_0\), almost surely

$$\begin{aligned} \Vert \hat{K}_{\bar{I}} - K^*\Vert _F<\epsilon _1, \end{aligned}$$

and \(\hat{K}_i\) is stabilizing for all \(i=1,\ldots ,\bar{I}\).

2.5 An example: double integrator

To verify the effectiveness of the O-LSPI algorithm and its convergence result Theorem 2, consider a double integrator disturbed by control-dependent stochastic noise,

$$\begin{aligned} {\mathrm{d}}x = \left[ \begin{array}{cc} 0 &{}\quad 1 \\ 0 &{}\quad 0 \end{array}\right] x{\mathrm{d}}t + \left[ \begin{array}{c} 0 \\ 1 \end{array}\right] (u{\mathrm{d}}t + 0.1u{\mathrm{d}}w_1). \end{aligned}$$
(21)

It is assumed that the system parameters in (21) are unknown and the stochastic noise \(w_1\) is unmeasurable, but an initial stabilizing control gain \(\hat{K}_1 = [ 11, 9]\) is available. Setting \(Q=I_2\) and \(R = 1\), we run O-LSPI with parameters \(N={10}^4\), \(\Delta t = 0.05\), \(M=7\), \(s=10\), \(\bar{I} = 10\), and exploration noise

$$\begin{aligned} y = \sum _{j=1}^{100} \sin (\eta _j t), \end{aligned}$$

where \(\{\eta _j\}_{j=1}^{100}\) are drawn independently and identically from the Gaussian distribution with mean \(-250\) and standard deviation 500. The simulation results are shown in Fig. 1, where the relative errors of \(\hat{K}_i\) and \(\hat{P}_i\) with respective to their optimal values \(K^*\) and \(P^*\) are plotted, respectively. One can see that the relative errors converge to a small neighborhood of zero, which implies that \(\hat{K}_i\) and \(\hat{P}_i\) converge to a small neighborhood of their optimal value, respectively.

Fig. 1
figure 1

Simulation results on the double integrator

3 Numerical studies in motor learning and control

3.1 Human arm movement model

We consider the sensorimotor control tasks studied in Harris and Wolpert (1998), Burdet et al. (2001, 2006) and Franklin et al. (2003), where human subjects make point-to-point reaching movements in the horizontal plane. The objective is to reproduce similar results to those observed from experiments in Burdet et al. (2001, 2006) and Franklin et al. (2003), using the proposed O-LSPI algorithm.

In our numerical experiment, the dynamics of the arm are simplified to a point-mass model (Liu and Todorov 2007) as shown below,

$$\begin{aligned} \begin{aligned} \dot{p}&= v, \\ m\dot{v}&= a - bv + f, \\ \tau \dot{a}&= u - a + C_1u\xi _1 + C_2u\xi _2, \end{aligned} \end{aligned}$$
(22)

where \(p=[p_x,p_y]^{\mathrm{T}}\), \(v=[v_x,v_y]^{\mathrm{T}}\), \(a=[a_x,a_y]^{\mathrm{T}}\), \(u=[u_x,u_y]^{\mathrm{T}}\), \(f=[f_x,f_y]^{\mathrm{T}}\) are the two-dimensional hand position, velocity, actuator state, control input and external force generated from the force fields, respectively; m denotes the mass of the hand; b is the viscosity constant; \(\tau \) is the time constant; \(\xi _1\) and \(\xi _2\) are Gaussian white noises (Pavliotis 2014); and

$$\begin{aligned} C_1 = \left[ \begin{array}{cc} c_1 &{}\quad 0 \\ c_2 &{}\quad 0 \end{array}\right] , \qquad C_2 = \left[ \begin{array}{cc} 0 &{}\quad c_2 \\ 0 &{}\quad c_1 \end{array}\right] \end{aligned}$$

are gain matrices of the control-dependent noise (Harris and Wolpert 1998). To fit this model into the optimal control problem formulated in Sect. 2.1, (22) is rewritten in the form of state-space model,

$$\begin{aligned} {\mathrm{d}}x = (Ax + Bu){\mathrm{d}}t + B(C_1u{\mathrm{d}}w_1 + C_2u{\mathrm{d}}w_2) + Df{\mathrm{d}}t\nonumber \\ \end{aligned}$$
(23)

where \(w_1\) and \(w_2\) are one-dimensional standard Brownian motions, and

$$\begin{aligned} \begin{aligned} x&= \left[ \begin{array}{c} p \\ v \\ a \end{array}\right] ,\ \ A = \left[ \begin{array}{ccc} 0_2 &{}\quad I_2 &{}\quad 0_2 \\ 0_2 &{}\quad -\frac{b}{m}I_2 &{}\quad \frac{1}{m}I_2\\ 0_2 &{}\quad 0_2 &{}\quad -\frac{1}{\tau }I_2 \end{array}\right] ,\\ B&= \left[ \begin{array}{c} 0_2 \\ 0_2 \\ \frac{1}{\tau }I_2 \end{array}\right] , \ \ D = \left[ \begin{array}{c} 0_{2} \\ I_2 \\ 0_2 \end{array}\right] . \end{aligned} \end{aligned}$$

The weighting matrices in cost function (2) are chosen as

$$\begin{aligned} Q= & {} \left[ \begin{array}{cccccc} 2000 &{}\quad -40 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ -40 &{}\quad 1000 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 20 &{}\quad -1 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad -1 &{}\quad 20 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0.01 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0.01 \end{array}\right] ,\\&\quad R = 0.01I_2. \end{aligned}$$
Table 1 Parameters of the arm movement model
Fig. 2
figure 2

Simulated movement trajectory generated by O-LSPI. a Five movement trajectories of the subject after learning in the NF. b The first five consecutive movement trajectories of the subject in the VF. c Five consecutive movement trajectories of the subject after 30 trials in the VF. d Five after-effect trials in the NF

The term f in (23) is used to model possible external disturbances exerted on the hand (Liu and Todorov 2007) from the force fields. Three kinds of disturbances are considered here: the null field (NF) (Burdet et al. 2001, 2006; Franklin et al. 2003), \(f\equiv 0\), where no external disturbances are exerted; the velocity-dependent force field (VF) (Burdet et al. 2006; Franklin et al. 2003),

$$\begin{aligned} f = \chi \left[ \begin{array}{cc} 13 &{}\quad -18 \\ 18 &{}\quad 13 \end{array}\right] \left[ \begin{array}{c} v_x \\ v_y \end{array}\right] \end{aligned}$$

where \(\chi \in [2/3,1]\) is a constant that can be adjusted to the subject’s strength; the divergent force field (DF) (Burdet et al. 2001, 2006; Franklin et al. 2003),

$$\begin{aligned} f = \left[ \begin{array}{cc} \beta &{}\quad 0 \\ 0 &{}\quad 0 \end{array}\right] \left[ \begin{array}{c} p_x \\ 0 \end{array}\right] \end{aligned}$$
(24)

where \(\beta >0\) and a negative elastic force disturbance perpendicular to the target direction is produced.

The model parameters used in our simulations are given in Table 1.

3.2 Sensorimotor control in velocity-dependent force field

In the experiments conducted by Franklin et al. (2003) and Burdet et al. (2006), subjects sat in a chair and moved the parallel-link direct drive air-magnet floating manipulandum (PFM) in a series of forward reaching movements performed in the horizontal plane. Subjects performed reaching movements from a start circle to a target circle with total distance 0.25 m. They firstly practiced in the NF until they achieved enough successful trials. Trials were considered successful if they ended inside the target within the prescribed time \(0.6\pm 0.1\) s. Then, VF was activated without informing the subjects in advance. Subjects practiced in VF until achieving enough successful trials. Next they took a short break and performed several trials in the NF. These trials were called after-effects and were recorded to confirm that subjects adapted to the force field. It was observed (Franklin et al. 2003; Burdet et al. 2006) that initial trials of the subjects in VF were distorted drastically, but subjects made straighter movements gradually. After learning through enough trials, the trajectories were relatively straight and consistently reached the final target position. Inspection of the stiffness, which was defined as graphical depiction of the elastic restoring force corresponding to the unit displacement of the hand for the subject in the force fields (Burdet et al. 2001, 2006), revealed that after adaptation endpoint stiffness was selectively modified to the direction of the instability. See Gomi and Kawato (1996) and Franklin et al. (2003) for more details.

In this subsection, we apply O-LSPI to the human arm movement model (23) to reproduce the experimental results in Franklin et al. (2003) and Burdet et al. (2006).

Fig. 3
figure 3

Simulated velocity and endpoint force curves generated by O-LSPI. a Simulated curves of the subject after learning in the NF. b Simulated curves of the subject in several trials when firstly exposed to the VF. c Simulated curves of the subject after 30 trials in the VF. d After-effect trials in the NF. x-velocity and y-velocity curves are shown in the first and second rows, respectively. Bell-shaped velocity curves are clearly observed in y-velocity curves. x-endpoint force and y-endpoint force curves are shown in the third and fourth rows, respectively. A comparison of the first and third figures in the x-endpoint force suggests that subjects adapted to the VF by generating compensation force to counteract the force produced by VF. The shapes of these curves resemble closely the experimental results reported in Franklin et al. (2003); Burdet et al. (2006)

The experiments in NF are firstly simulated. The O-LSPI starts with an initial stabilizing control gain \(\hat{K}_1\in {\mathbb {R}}^{2\times 6}\), such that \(A-B\hat{K}_1\) is Hurwitz. Such a control gain can be found by robust control theory (Zhou and Doyle 1998), if some upper and lower bounds of the elements in A and B are available and the pair (AB) is stabilizable. Indeed, the first several trials in the NF can be interpreted as the searching for an initial stabilizing control gain, by estimating the bounds of the parameters b, m and \(\tau \) and using robust control techniques. If the CNS figures out that \(b\in [-8,12]\), \(m\in [1,1.5]\) and \(\tau \in [0.03,0.07]\) in the first several trials, an initial control gain can be chosen as

$$\begin{aligned} \hat{K}_1 = \left[ \begin{array}{cccccc} 100 &{}\quad 0 &{}\quad 10 &{}\quad 0 &{}\quad 10 &{}\quad 0 \\ 0 &{}\quad 100 &{}\quad 0 &{}\quad 10 &{}\quad 0 &{}\quad 10 \end{array}\right] . \end{aligned}$$

Then during the i-th trial, we collect input/state data generated by control policy (10) and construct estimation \(\hat{G}_i\) from the data to update the control policy.Footnote 1 After 30 trials, the control policy is updated to

$$\begin{aligned}&\hat{K}_{30}= \nonumber \\&\left[ \begin{array}{cccccc} 446.21 &{}\quad -6.98 &{}\quad 49.80 &{}\quad -0.94 &{}\quad 1.41 &{}\quad -0.02 \\ -5.46 &{}\quad 315.00 &{}\quad -0.84 &{}\quad 43.38 &{}\quad -0.01 &{}\quad 1.31 \end{array}\right] \nonumber \\ \end{aligned}$$
(25)

which is nearly optimal since the optimal policy in NF is

$$\begin{aligned}&\hat{K}^*_{\mathrm {NF}} = \\&\left[ \begin{array}{cccccc} 447.19 &{}\quad -5.91 &{}\quad 49.70 &{}\quad -0.94 &{}\quad 1.41 &{}\quad -0.01 \\ -4.27 &{}\quad 316.17 &{}\quad -0.91 &{}\quad 43.34 &{}\quad -0.01 &{}\quad 1.30 \end{array}\right] . \end{aligned}$$

Next, the experiments in VF are simulated. Since the velocity force field is activated without notifying the subjects, the first trial in VF was under the same control policy as learnt in NF. After the first trial, the CNS can realize that it is facing with a new environment different from NF. So O-LSPI is applied starting from the second trial. The initial control gain we use for the second trial is obtained by tripling the gains in (25) in the NF. This is to mimic the experimental observation (Franklin et al. 2008) that muscle activities increased dramatically after the first trial. After 30 trials, the control policy is updated to

$$\begin{aligned}&\hat{K}_{30} = \\&\left[ \begin{array}{cccccc} 426.00 &{}\quad 100.54 &{}\quad 66.30 &{}\quad -12.56 &{}\quad 1.65 &{}\quad -0.04 \\ -155.75 &{}\quad 299.92 &{}\quad 5.93 &{}\quad 61.57 &{}\quad -0.04 &{}\quad 1.59 \end{array}\right] \end{aligned}$$

which is nearly optimal since the optimal policy in NF is

$$\begin{aligned}&\hat{K}^*_{\mathrm {VF}} =\\&\left[ \begin{array}{cccccc} 419.42 &{}\quad 101.30 &{}\quad 65.99 &{}\quad -12.37 &{}\quad 1.65 &{}\quad -0.04 \\ -155.19 &{}\quad 299.56 &{}\quad 6.05 &{}\quad 61.80 &{}\quad -0.04 &{}\quad 1.59 \end{array}\right] . \end{aligned}$$

The simulated movement trajectories, the velocity curves and the endpoint force curves are shown in Figs. 2 and 3. It can be seen that the simulated movement trajectories in the NF are approximately straight lines, and the velocity curves along the y-axis are bell-shaped. The subject successfully reaches the target in the first trial in VF, but the trajectory is heavily distorted to the upper-left side, since the subject is still using the near-optimal control policy learnt in NF when firstly exposed to VF. Although VF produces a stable interaction with the arm, the near-optimal control policy learnt in NF is far from being optimal in the new environment VF. The O-LSPI takes effect from the second trial in VF. Motor adaptation can be observed by comparing the first five consecutive trials, where the movement trajectories are getting straighter and straighter. After 30 trials, the movement trajectories return to be straight lines, and the velocity curves become bell-shaped again. This implies that after 30 trials in the VF, the CNS can learn well the optimal control policy using data, without knowing or using the precise system parameters, and the unmeasurable control-dependent noise. The stiffness ellipses after 30 trials are shown in Fig. 4. The stiffness in the VF increases significantly in the direction of the external force, compared with the stiffness in the NF. Finally, our numerical study shows clearly the after-effects of the subject behavior when the VF is suddenly deactivated. These observations are a clear testament that motor adaptation to VF indeed occurs. One can find through comparison that our simulation results in Figs. 23 and 4 are consistent with the experimental observations in Franklin et al. (2003) and Burdet et al. (2006).

The relative estimation errors of \(\hat{G}_i\) in Algorithm 3 are shown in Fig. 5. One can see that the errors are as large as \(16\%\) in the learning process. This implies that the CNS is able to adapt to VF with imperfect or noisy information.

3.3 Sensorimotor control in divergent force field

The effects of the divergent field to the subjects in the arm reaching movement are also studied in Burdet et al. (2001, 2006) and Franklin et al. (2003), whose experimental results are reproduced using our proposed computational method in this subsection. We set \(\beta = 300\) in (24).

Fig. 4
figure 4

Illustration of the stiffness geometry to the VF. Compared with the stiffness in the NF (green), the stiffness in the VF (red) increased significantly in the direction of the external force

Fig. 5
figure 5

Relative estimation error \(\Delta G_i\) between \(\hat{G}_i\) and its true value \(G(\hat{P}_i)\) in O-LSPI in the VF. Although the estimation errors exist in the learning process and can be as large as \(16\%\) of the true value measured by the Frobenius norm, the subject is still able to adapt to the VF

Fig. 6
figure 6

Simulated movement trajectory generated by O-LSPI. a Five movement trajectories of the subject after learning in the NF. b Five independent movement trajectories of the subject when firstly exposed to DF. For safety reasons, the DF is turned off when the trajectory deviates more than 2.5 cm from the y-axis. The black lines indicate this safety zone. c Five consecutive movement trajectories of the subject after 30 trials in the DF. d Five after-effect trials in the NF

Fig. 7
figure 7

Simulated velocity and endpoint force curves generated by O-LSPI. a Simulated curves of the subject after learning in the NF. b Simulated curves of the subject in several trials when firstly exposed to the DF. c Simulated curves of the subject after 30 trials in the DF. d After-effect trials in the NF. x-velocity and y-velocity curves are shown in the first and second rows, respectively. Bell-shaped velocity curves are clearly observed in y-velocity curves. x-endpoint force and y-endpoint force curves are shown in the third and fourth rows, respectively. The third figure in the x-endpoint force suggests that subjects adapted to the DF by generating compensation force in the x-direction to counteract the force produced by DF. The shapes of these curves resemble closely the experimental results reported in Burdet et al. (2001, 2006); Franklin et al. (2003)

The simulation of movements in the NF before the DF is applied is the same with that presented in the last subsection. However, with \(\beta = 300\) the DF produces an unstable interaction with the arm, so that the near-optimal policy learned in the NF is not stabilizing anymore. Therefore, when we apply the same near-optimal control policy learnt in the NF to generate the movements for the first five trials in the DF, unstable behaviors are observed, as shown in Fig. 6b. In this case, it is hypothesized that the CNS re-learns a new initial stabilizing controller using the robust control theory (see Remark 2 for an example), such that

$$\begin{aligned} A-B\hat{K}_1 = \left[ \begin{array}{cccccc} 0 &{}\; 0 &{}\; 1 &{}\; 0 &{}\; 0 &{}\; 0 \\ 0 &{}\; 0 &{}\; 0 &{}\; 1 &{}\; 0 &{}\; 0 \\ \frac{\beta }{m} &{}\; 0 &{}\; -\frac{b}{m} &{}\; 0 &{}\; \frac{1}{m} &{}\; 0 \\ 0 &{}\; 0 &{}\; 0 &{}\; -\frac{b}{m} &{}\; 0 &{}\; \frac{1}{m} \\ 0 &{}\; 0 &{}\; 0 &{}\; 0 &{}\; -\frac{1}{\tau } &{}\; 0 \\ 0 &{}\; 0 &{}\; 0 &{}\; 0 &{}\; 0 &{}\; -\frac{1}{\tau } \end{array}\right] - \left[ \begin{array}{cc} 0 &{}\; 0 \\ 0 &{}\; 0 \\ 0 &{}\; 0 \\ 0 &{}\; 0 \\ \frac{1}{\tau } &{}\; 0 \\ 0 &{}\; \frac{1}{\tau } \end{array}\right] \hat{K}_1 \end{aligned}$$

is Hurwitz. Here, we increase the first entry in the first row of the control gain in (25) by 600 and set the resultant stabilizing control gain to be the initial stabilizing control gain. After 30 trials in the DF, the O-LSPI has updated the control gain to

$$\begin{aligned} \hat{K}_{30} = \left[ \begin{array}{cccccc} 1462.15 &{}\quad 4.70 &{}\quad 81.59 &{}\quad -0.14 &{}\quad 1.87 &{}\quad 0.01 \\ -7.06 &{}\quad 310.12 &{}\quad -0.88 &{}\quad 43.09 &{}\quad 0.00 &{}\quad 1.31 \end{array}\right] , \end{aligned}$$

which is near-optimal, since the corresponding optimal control gain is

$$\begin{aligned} K^*_{\mathrm {DF}} = \left[ \begin{array}{cccccc} 1481.89 &{}\quad -4.86 &{}\quad 82.19 &{}\quad -0.76 &{}\quad 1.88 &{}\quad -0.01 \\ -6.65 &{}\quad 316.19 &{}\quad -0.799 &{}\quad 43.34 &{}\quad -0.01 &{}\quad 1.30 \end{array}\right] . \end{aligned}$$

The simulated movement trajectories, the velocity curves and the endpoint force curves are shown in Figs. 6 and 7. It can be seen that the simulated movement trajectories in the NF are approximately straight lines, and the velocity curves along the y-axis are bell-shaped. Due to the control-dependent noise, the movement trajectories differ slightly from trial to trial. Since DF produces an unstable interactions with the arm, unstable behaviors are observed in the first several trials when the subject is first exposed to the DF. Then, O-LSPI is applied, and after 30 trials, the movement trajectories become approximately straight as in the NF, which implies that the CNS has learned to adapt to the DF. The stiffness ellipses after 30 trials are shown in Fig. 8. It is clear that the stiffness in the DF increases significantly in the direction of the divergent force, and the change of stiffness along the movement direction is not significant, compared with the stiffness in the NF. Finally, the after-effects of the subject behavior are simulated when the DF is suddenly deactivated. The after-effects movement trajectories are even straighter than the trajectories in the NF. The reason is that the CNS has learned to compensate more to the displacement along the x-axis. One can find through comparison that our simulation results in Figs. 6, 7 and 8 are consistent with the experimental observations in Burdet et al. (2001, 2006) and Franklin et al. (2003).

The relative estimation errors of \(\hat{G}_i\) in Algorithm 3 are shown in Fig. 9. The errors can be as large as \(90\%\) of the true values in the learning process. This implies that the CNS is able to adapt to DF in the presence of the large imperfect or noisy information.

Fig. 8
figure 8

Illustration of the stiffness geometry to the DF. Compared with the stiffness in the NF (green), the stiffness in the DF (red) increased significantly in the direction of the external force

Fig. 9
figure 9

Relative estimation error \(\Delta G_i\) between \(\hat{G}_i\) and its true value \(G(\hat{P}_i)\) in O-LSPI in the DF. Although the estimation errors exist in the learning process and can be as large as \(90\%\) of the true value measured by the Frobenius norm, the subject is still able to adapt to the DF

Remark 2

By robust control theory (Zhou and Doyle 1998), an initial stabilizing controller can be found provided that the bounds of the parameters of a linear system are known. For example, suppose the CNS has access to the bounds of the unknown system parameters:

$$\begin{aligned} \begin{aligned}&0.5\le m\le 5,\quad 2 \le b \le 20, \\&0.02 \le \tau \le 0.5,\quad 200 \le \beta \le 500, \end{aligned} \end{aligned}$$
(26)

in the DF. Then, with these bounds at hand, a direct application of robust control theory (an application of the Robust Control Toolbox (Balas et al. 2007) in terms of the code implementation) yields:

$$\begin{aligned}&K =\\&\left[ \begin{array}{cccccc} 15854 &{}\quad -17.23 &{}\quad 477.10 &{}\quad -7 &{}\quad 29.76 &{}\quad -270.88 \\ -11.65 &{}\quad 3.89 &{}\quad -1.43 &{}\quad 4.55 &{}\quad -0.04 &{}\quad 58.57 \end{array}\right] \end{aligned}$$

that stabilizes the hand in DF for all possible parameters satisfying the conditions in (26). In this way, an initial stabilizing (but generally not optimal) control gain required by the O-LSPI algorithm can be found, without the exact knowledge of the system dynamics.

Remark 3

It should be mentioned that if the bounds of the parameters are completely unknown, the value iteration, an alternative reinforcement learning algorithm, can be used to find the optimal LQR control policies and reproduce the experimental results in the divergent field (Bian et al. 2020), without any initial stabilizing control policy. Please see Bian and Jiang (2019), Pang and Jiang (2020) and Bian et al. (2020) for details.

3.4 Fitts’s law

In this subsection, we further validate our computational model using the Fitts’s law (Fitts 1954; Schmidt et al. 2018). Fitts’s law is one of the widespread formal rules in the study of human behavior. The law dictates that the movement duration \(t_f\) required to rapidly reach a target area is a function of the distance d to the target and the size of the target \(\gamma \). There are multiple versions of Fitts’s law (Schmidt et al. 2018), two of which are used in our validation. The first one is the logarithmic law (log law)

$$\begin{aligned} t_f = \alpha _0 + \alpha _1 \log _2\left( \frac{2d}{\gamma }\right) , \end{aligned}$$

where \(\alpha _0\) and \(\alpha _1\) are two constants. The second one is the power law

$$\begin{aligned} t_f = \alpha _0\left( \frac{d}{\gamma }\right) ^{\alpha _1}. \end{aligned}$$

In our case, \(\gamma \) is the diameter of the target circle, and d is the distance from the starting point to the center of the target. We generate trials using the after-learning control polices in the NF, VF and DF, respectively. The movement duration \(t_f\) is defined as the time when the hand cursor enters the target. The data fitting results are shown in Fig. 10 and Table 2. It can be seen that our simulation results are consistent with the predictions of Fitts’s law.

Fig. 10
figure 10

Log and power versions of Fitts’s law. Crosses in the first row, the second row and the third row represent after-leaning movement durations simulated in the NF, the VF and the DF, respectively. Solid lines in A, C, E are least-squares fits using the log Fitts’s law, and solid lines in B, D, F are the least-squares fits using the power Fitts’s law

4 Discussion

4.1 Model-free learning

Most of the previous computational models for sensorimotor control assume that the CNS has the exact knowledge of the motor system and the environment that it is interacting with Wolpert et al. (1995), Todorov (2005), Liu and Todorov (2007), Todorov and Jordan (2002), Harris and Wolpert (1998), Mussa-Ivaldi et al. (1985), Yeo et al. (2016), Česonis and Franklin (2020), Zhou et al. (2017), Mistry et al. (2013), Ueyama (2014), Cluff and Scott (2015) and Gaveau et al. (2014). Then, the optimal control policies are computed based on this assumption. In contrast, our proposed computational model Algorithm 3 is a model-free approach which does not need accurate model information or estimate the unknown model parameters. Algorithm 3 informs that near-optimal policies are derived using the sensory data and are robust to the control-dependent noise. The numerical experiments in the last section show that out proposed model can generate typical movement trajectories, stiffness ellipses observed in previous experiments in different settings. As one of the key differences with the sensorimotor models mentioned above, our proposed computational mechanism suggests that, when confronted with unknown environments and imprecise dynamics, the CNS may update and improve its command signals for movement through learning and repeated trials using sensory data.

Table 2 Parameters in the log law and power law estimated by least squares

Two model-based computational mechanisms that do not require an accurate internal model are proposed in Mistry et al. (2013) and Crevecoeur et al. (2020). Assuming that the parameters in the uncertain force fields are unknown, modified and adaptive linear quadratic Gaussian control is proposed in Mistry et al. (2013) and Crevecoeur et al. (2020), respectively, to explain the experimental phenomena observed there. However, the accurate models for the arms and the rest of the environments are still needed to be known. Furthermore, the parameters in the uncertain force fields need to be estimated explicitly. By contrast, our proposed mechanism assumes that all the parameters in the model are unknown, and the control policies are directly generated from the sensory data, and no parameter in the arm and environment models is explicitly estimated.

Model-free approaches based on the optimal control framework are also developed in Jiang and Jiang (2014, 2015), Bian et al. (2016, 2020) for sensorimotor control. Although these model-free approaches successfully reproduce the experimental observations in arm reaching task, they assume implicitly that the control-dependent noises \(\{w_k\}_{k=1}^{\infty }\) in (23) are measurable and explicitly use the noise-corrupted data in the computation of iterative estimates of the optimal policy. In this paper, we conjecture that the CNS makes decisions without direct access to any noise-dependent data, so the name of model-free is more relevant to the proposed computational model. Interestingly, both theoretical and numerical studies have shown that, without cancelling exactly the noise-dependent terms, human motor learning and control is inherently robust to the small noise occurring in the learning process.

Model-free approaches not based on the optimal control framework are proposed in Zhou et al. (2011) and Hadjiosif et al. (2021). The model in Zhou et al. (2011) synthesizes iterative learning control and model reference adaptive control to reproduce the behavior observed in the human motor adaptation, so that the motion command is carried out without the need for inverse kinematics and the knowledge of disturbance in the environment. It is assumed in Zhou et al. (2011) that the CNS aims at letting the arm track an ideal trajectory of a reference model. Although this mechanism does not require internal models of the human arm and the environment, the reference model needs to be identified from the experimental data or specially designed. The model in Hadjiosif et al. (2021) takes the derivative of the movement errors with respect to the parameters in control policies, such that the control policies are directly updated by gradient descent without the knowledge of an forward model. Although this model successfully characterizes some properties of the implicit adaption under mirror-reversed visual feedback, the relationship between the sensory output and the control policy is simply modeled as a static function, rather than a dynamical system used in our paper [see Eq. (1)]. Besides, both the models in Zhou et al. (2011) and Hadjiosif et al. (2021) only aim at minimizing sensory output errors, without minimizing the metabolic cost, which is considered in our model and is deemed to be one of the fundamental principles in sensorimotor systems by many previous studies (Todorov and Jordan 2002; Liu and Todorov 2007; Burdet et al. 2001; Franklin et al. 2008; Selen et al. 2009; Franklin and Wolpert 2011).

4.2 Robust reinforcement learning

With the control-dependent noise unmeasurable, we can only solve the data-based approximate ODE (18) of the model-based precise ODE (8), which is the main factor that causes the discrepancies \(\Delta G_i\) between \(\hat{G}_i\) and its true value \(G(\hat{P}_i)\) in Algorithm 3. The simulation results in the last section show that even if the errors \(\Delta G_i\) are present and can be as large as \(90\%\) of the true value (see Figs. 5 and 9), O-LSPI still successfully reproduces movement trajectories, velocity and endpoint force curves, stiffness geometries similar to those observed in experiments in Burdet et al. (2001, 2006) and Franklin et al. (2003). This is consistent with our theoretical result Theorem 1 on the robustness of reinforcement learning algorithms and suggests that the CNS is able to learn to adapt to the new environment and external disturbances in an error-tolerant, robust way. Such observations and implications are not included or explored in the previously proposed computational mechanisms based on optimal control theory in Todorov and Jordan (2002), Harris and Wolpert (1998), Burdet et al. (2001, 2006), Franklin et al. (2003), Wolpert et al. (1995), Selen et al. (2009), Zhou et al. (2017), Kadiallah et al. (2011), Mistry et al. (2013), Česonis and Franklin (2020), Haith and Krakauer (2013), d’Acremont et al. (2009), Fiete et al. (2007), Jiang and Jiang (2014) and Bian et al. (2020).

It is worth emphasizing that the robustness issue studied in this paper is different from the robustness problem considered in Ueyama (2014), Jiang and Jiang (2015), Crevecoeur et al. (2019), Gravell et al. (2020) and Bian et al. (2020). The robustness of the closed-loop system consisting of the optimal controller (the optimal policy learned by the CNS) and the controlled plant (the sensorimotor system and the environment that the CNS is interacting with) to external disturbances is analyzed in Ueyama (2014), Jiang and Jiang (2015), Crevecoeur et al. (2019), Gravell et al. (2020) and Bian et al. (2020), while this paper is devoted more exclusively to investigating the robustness of the learning algorithm, Algorithm 3, against the errors in the learning process.

4.3 Exploration noise

For the convergence of our proposed computational algorithm, it is necessary to add the exploration noise y to the state-feedback term \(-\hat{K}_ix\) in (10). Indeed, without adding the exploration noise, the control input \(u_i\) is a linear combination of the state x, i.e., \(u_i=-\hat{K}_ix\), and thus, Assumption 1 cannot hold. As a result, there is no guarantee that the least-squares problem in (16) has reasonable solutions and Algorithm 3 will converge to a small neighborhood of the optimal solution. In other words, our computational mechanism suggests that the intrinsic control-dependent noises \(\xi _1\) and \(\xi _2\) in the human arm movement model (22) alone are not enough to guarantee that the CNS is able to successfully adapt to the force fields in the arm reaching experiments, and extrinsic exploration noise actively added to the control input is indispensable in the sensorimotor learning. This is consistent with the discoveries in the literature that the CNS actively regulates the motor variability through extrinsic noise to facilitate motor learning (Wu et al. 2014; Sternad 2018). In fact, it is reported that noise is even able to teach people to make specific movements (Thorp et al. 2017). Evidences can also be found in the study of songbird learning (Fiete et al. 2007; Tumer and Brainard 2007). The song-related avian brain area—robust nucleus of the arcopallium (RA)—is responsible for the song production (Hahnloser et al. 2002), i.e., the controller of the avian song control system. It is found in Fiete et al. (2007) that another song-related avian brain area lateral magnocellular nucleus of the nidopallium (LMAN) produces fluctuating glutamatergic input to area RA, to generate behavioral variability in the songbirds for trial-and-error learning. In particular, lesion of area LMAN has little immediate effect on song production in adults, but arrests song learning in juveniles (Fiete et al. 2007). Thus, the extrinsic signal provided to area RA by area LMAN in songbird learning plays the same role to that of the extrinsic exploration noise in our Algorithm 3.

4.4 Infinite-horizon optimal control

In our proposed computational method, the cost function (2) penalizes the trajectory deviation and the energy consumption over an infinite time horizon, which yields a time-invariant control policy in each trial. This infinite-horizon optimal control framework has a main advantage that movement duration needs not to be prescribed by the CNS (Huh et al. 2010; Huh 2012; Qian et al. 2013; Li et al. 2015; Česonis and Franklin 2021). Intuitively this seems to be more realistic because the duration of each movement is difficult to be prescribed and is different from each other as a result of randomness in the trajectories caused by the signal-dependent noise. The finite-horizon optimal control framework is utilized in Liu and Todorov (2007), where the cost function is integrated over a prescribed finite time interval and the resultant control policy is time-varying. It is suggested in Liu and Todorov (2007) that if the target is not reached at the prescribed reaching time, the CNS can similarly plan an independent new trajectory between the actual position of the hand and the final target, and the final trajectory will be the superposition of all the trajectories. By contrast, our model matches the intuition that the motor system keeps moving the hand toward the target until it is reached.

Although both finite-horizon models and infinite-horizon models have been widely and successfully used to explain the goal-directed reaching tasks, it is still an unsettled problem which kind of models humans actually use (Li et al. 2015). Both of these models have unique merits and limitations (see Česonis and Franklin 2021 and the references therein for details). Recently, a novel model combining the strengths of the finite-horizon optimal control and the infinite-horizon optimal control is proposed in Česonis and Franklin (2021) to address their individual limitations.

5 Conclusion

In this paper, we have proposed a new computational mechanism based on robust reinforcement learning, named optimistic least-squares policy iteration (O-LSPI), to model the robustness phenomenon in motor learning, control and adaptation. The O-LSPI model suggests that in spite of the estimation errors caused by the unmeasurable control-dependent noise, the CNS could still find near-optimal control policies directly from the noisy sensory data. Simulated movement trajectories, velocity and endpoint force curves and stiffness geometries consistent with the experimental observations in Burdet et al. (2001, 2006) and Franklin et al. (2003) are reproduced using the O-LSPI model, with estimation errors as large as \(90\%\) of the corresponding true values. Therefore, we argue that human sensorimotor systems may use a mechanism in the spirit of robust reinforcement learning to achieve motor learning, control and adaptation that are robust to the estimation errors present in the sensorimotor interactions.