1 Introduction

Movement Primitives (MPs) are a well-established approach for representing movement policies in robotics. MPs have several beneficial properties; generalization to new situations, temporal modulation of the movement, co-activation of multiple primitives to concurrently solve multiple tasks, sequencing of primitives to generate longer and more complex movements, and they are easy to learn from demonstrations. Using such properties, MPs were successfully applied to reaching (dAvella and Bizzi 2005), locomotion (Dominici et al. 2011; Moro et al. 2012) and are state of the art for robot movement representation and generation. However, many approaches for movement generation based on MPs (Ijspeert et al. 2003; Williams et al. 2007; dAvella and Bizzi 2005; Khansari-Zadeh and Billard 2011; Rozo et al. 2013; Rückert et al. 2012; Righetti and Ijspeert 2006) exhibit only a subset of these properties. Hence, a generalized framework that unifies all these properties in one principled framework is needed.

We formalize the concept of probabilistic movement primitives (ProMPs) as a general probabilistic framework for representing and learning MPs. A ProMP represents a distribution over trajectories. The trajectory distribution can be defined in either joint-space, task-space, or any other space that accommodates the experiment. In this paper, we focus on joint-space trajectories. Working with distributions enables us to formulate the described properties using operations from probability theory. For example, modulation of a movement to a novel target can be realized by conditioning on the desired target’s positions or velocities. Similarly, consistent parallel activation of two elementary behaviors can be accomplished by a product of two independent trajectory distributions. A trajectory distribution can encode the variance of the movement, and, hence, a ProMP can directly encode optimal behavior in systems with linear dynamics, quadratic costs and Gaussian noise (Todorov and Jordan 2002). In contrast, deterministic approaches, e.g., the DMP approach, can only represent the mean solution, which is known to be suboptimal. Even if assumption does not hold, we believe that it offers a good a approximation of physical robotic systems. Finally, a probabilistic framework allows us to model the coupling between the degrees of freedom (DoFs) of the robot by estimating the covariance between different DoFs.

The benefits of using a probabilistic representation have so far not been extensively exploited for representing and learning MPs. The main reason for this limitation has been the difficulty of extracting a policy for controlling the robot from a trajectory distribution. We show how this step can be accomplished and derive a control policy that exactly reproduces a given trajectory distribution. While ProMP introduces many novel components, it also incorporates many of the advantages from well-known previous movement primitive representations (Schaal et al. 2005; dAvella and Bizzi 2005), such as temporal rescaling of movements and the ability to represent both rhythmic and stroke based movements.

In this paper, we unify and complement our prior work (Paraschos et al. 2013a, b; Neumann et al. 2014) on ProMPs. Note that the reference Neumann et al. (2014) contains only a brief summary of our work on ProMPs presented in the context of an overview paper that spans over multiple topics. Therefore, Neumann et al. (2014) provides less information than the corresponding conference papers. In this paper, we present much more details which are necessary to reproduce the results. We introduce a new regularization technique for achieving smoother movements and present an expectation-maximization algorithm for learning rhythmic ProMPs in more detail. We extended the description of our controller derivation and show how it is used on physical tasks, e.g. controlling a 7-DoF arm for playing Maracas, robot-hockey, and ‘Astrojax’. Moreover, we show new comparisons to state of the art MP approaches in terms of optimality, generalizability, composition of primitives and robustness of the movement representations. We also evaluate our ProMP controller on non-linear systems and made the source code of all examples publicly available.Footnote 1

2 Properties of movement primitive frameworks

We categorize MPs into state-based (Khansari-Zadeh and Billard 2011; Calinon et al. 2010a) and trajectory-based representations (Schaal et al. 2005; Neumann et al. 2009; Rückert et al. 2012; Rozo et al. 2013). Trajectory-based primitives typically use time as the driving force of the movement. They require simple, typically linear, controllers, and scale well to a large number of DoFs. In contrast, state-based primitives (Khansari-Zadeh and Billard 2011; Calinon et al. 2010a) do not require the knowledge of a time step but often need to use more complex, non-linear policies. Such increased complexity has limited the application of state-based primitives to a rather small number of dimensions, such as the Cartesian coordinates of the task space of a robot. The main focus of this paper is on trajectory-based representations. We begin with a discussion on the properties of MPs.

2.1 Concise representation

MPs offer a concise representation of the movement, with a few open parameters to set. The small number of parameters simplifies learning the movement from demonstrations and the use of reinforcement learning algorithms to adapt and refine the primitive through trial-and-error. MP frameworks can be trained from demonstrations using simple learning methods, e.g. linear regression, and have been successfully used in fairly complex scenarios, including “Ball-in-the-Cup” Kober et al. (2010), Ball-Throwing (Ude et al. 2010; da Silva et al. 2012), Pancake-Flipping (Kormushev et al. 2010), Tetherball (Daniel et al. 2012a), and bi-pedal locomotion (Nakanishi et al. 2004).

2.2 Adaptation and time modulation

Many MPs offer an intrinsic adaptation mechanism to match a new situation or an altered task, e.g., hitting a different incoming balls when playing table tennis. The adaptation commonly comes in a form of modification of the desired target position and velocity at the end of the primitive or as a modulation of the amplitude of the primitive (Ijspeert et al. 2003). Our approach (Paraschos et al. 2013a, b) can be used to adapt the movement at any time point during the trajectory’s execution.

Furthermore, adaptation of MPs include temporal modulation. Temporal modulation is a valuable property as it enables MPs to be applied in scenarios where correct timing is critical for the success of the task, e.g., in hitting, batting, or in locomotion to adjust the walking speed of the robot (Righetti and Ijspeert 2006).

2.3 Combination and sequencing

The expressiveness of an MP approach can be significantly improved if multiple primitives can be simultaneously co-activated to compose more complex movements. However, most MP approaches do not support co-activation of primitives in a principled way. Instead, the concurrent activation requires a prioritization scheme (Mülling et al. 2013; Pastor et al. 2011) in order not to disrupt the motion. In our approach (Paraschos et al. 2013a), we co-activate primitives to solve multiple tasks at the same time, without the need of such a scheme. Besides simultaneous activation, MP architectures aim to support sequencing MPs (Konidaris et al. 2012) to acquire a smooth transition from one primitive to another. Such sequencing is needed to dynamically concatenate primitives in order to acquire longer, more complex movements. We show that in our framework a smooth transition can be achieved in a principled way similar to the combination of primitives.

2.4 Coupling the DoFs

Movement primitives approaches are typically applied to robots with multiple Degrees of Freedom (DoF). In order to reproduce coordinated movements, MPs need a synchronization mechanism among the different DoF. Using time, or a function of time, as a reference signal (Schaal et al. 2007; Ijspeert 2008), one can implement simple time alignment mechanisms. However, when experiencing deviations from the desired trajectory due to noise or unmodeled effects, coordinated recovering from perturbations is advantageous. ProMPs, additionally to time synchronization, estimate such correlations directly from demonstrations and use them to synchronize the DoFs of the system.

2.5 Optimal behavior

Many trajectory-based representations use a single desired trajectory that is followed by a feedback controller with constant gains. However, following such a single trajectory has been proven to be suboptimal for many tasks if the system’s dynamics are stochastic (Todorov and Jordan 2002). In this paper, we focus on control affine systems with Gaussian control noise, which is a standard assumption for physical systems. In this case, a distribution over trajectories is a good representation of the optimal behavior. Such distribution can be achieved by using time-varying feedback gains, which are often used as approximation for optimal behavior (Li and Todorov 2010). Feedback controllers with time varying gains modulate the stiffness of the system to provide high precision at the ‘important’ time points of a task while the system is less controlled for time points where accurate control in not so critical. The time varying gains of the controller can be approximated (Calinon et al. 2010b), computed using a LQR by specifying a cost function (Calinon 2016; Bruno et al. 2015), improved with reinforcement learning (Buchli et al. 2011), or, as in our approach, computed in closed form (Paraschos et al. 2013a).

2.6 Stability

Generating stable behavior is an important aspect of MPs. However, stability guaranties often have limited use as they assume linearity in the dynamics. Yet, however simple, real-world, systems are non-linear, e.g., a pendulum, where the gravity alone introduces non-linearities in the dynamics. Discrete DMPs (Ijspeert et al. 2003) generate stable behavior by moving towards an attractor at the end of the movement, while periodic MPs (Ijspeert et al. 2003; Righetti and Ijspeert 2006) stabilise the movement on a unit circle. The probabilistic framework from Calinon et al. (2010a) initially did not provide any stability guarantees, but it was still generating stable movements as long as the disturbances did not perturb the system “far” from the region where the demonstration occurred. With Khansari-Zadeh and Billard (2011) the authors alleviate the problem and learned asymptotically stable control laws. Recently, Calinon (2016) proposed the use of a Linear Quadratic Regulator (LQR) for control, that is stable for closed-loop systems (Stengel 2012). The ProMPs (Paraschos et al. 2013a) derive a controller that exactly reproduces the demonstrated trajectory distribution and, thus, provide stability guaranties as long as the demonstrated trajectory distribution was generated by a stable control law.

3 Related work

A commonly used trajectory-based representation is the Dynamic Movement Primitive (DMP) approach, introduced in Ijspeert et al. (2003, 2013) for a recent review. They represent a linear attractor system which is modulated by a time-dependent forcing function. The DMP introduced the concept of a phase, defined as a monotonic function of time. By adjusting the phase derivatives, we can temporally scale the movement. The forcing function is represented by normalized Gaussian basis functions, multiplied with the phase signal. Since the phase decreases exponentially to zero, the forcing function will asymptotically vanish at the end of the movement. At that time, only the attractor dynamics stay active, which guarantees the stability of the linear system. When used in an imitation learning scenario, the weights of the basis functions can be fitted from a single demonstration using linear regression. Generalization to new, unseen, situations in DMPs is limited. The original formulation only allowed for changing the position at the end of the movement, which is implemented by modifying the position of the goal attractor or, for rhythmic DMPs, by adjusting the amplitude of the forcing function. Extensions exist that also allow setting a desired final velocity (Kober et al. 2010; Mülling et al. 2013; Paraschos et al. 2009). Directly changing intermediate points in the trajectory is not possible. DMPs can be sequenced given proper initialization (Paraschos et al. 2009), but only instant switching from one primitive to another is considered. Kulvicius et al. (2012) extended DMPs to support sequencing of primitives and evaluated their approach on a handwriting dataset. Gams et al. (2014) proposed the use of DMPs for tasks that include interactions with the environment.

Despite that DMPs introduce many beneficial properties, such as temporal scaling of the movement, learning from a single demonstration or generalizing to new final positions, further work is still needed for concurrently activating multiple primitives, generalizing to intermediate via-points, representing optimal behavior in stochastic systems, and capturing the correlation of the individual joints of the robot. Trajectories based on DMPs applied to multiple DoF systems are synchronized based only on the internal phase variable. Multiple DMPs for the same DoF cannot be activated simultaneously without further considerations on prioritized control and partial cancellation of the movement.

Probabilistic approaches use distributions to additionally encode the variability of the movement (Calinon et al. 2010a; Rozo et al. 2013; Kormushev et al. 2010; Calinon 2016). The variability of the movement, or the variance in distribution terms, is crucial, as it reflects the importance of single time points for the movement execution and it is often a requirement for representing optimal behavior in stochastic systems (Todorov and Jordan 2002). Moreover, capturing the variance of the movement leads to better generalization capabilities and to more natural movements. A probabilistic MP approach was proposed by Calinon et al. (2010a), where a Gaussian Mixture Regression (GMR) model was used to represent the trajectory. Given a set of trajectories, the GMR was trained with an Expectation Maximization (EM) algorithm (Rozo et al. 2013). A unifying formulation that extends the DMPs and uses them in a probabilistic framework is discussed in Kormushev et al. (2010). Yet, it is unclear how a GMR model can be conditioned to reach different final or intermediate positions. An extension of the approach (Calinon 2016) enabled generalization to different situations by recording the movement from different spaces and tracking the affine transformation to each space. While the approach is capable of generalizing, for example when an object changes its position, it can not modulate the encoded variance.

Besides representing the variance of the trajectory, we need a controller that reproduces the encoded distribution on a real system. A feedback controller where the gains are based on the inverse of the covariance of the current time-step was presented in Calinon et al. (2010b). The control law is based on the intuition that the gains have to be lower when the variance of the trajectories is higher. A comparison to this control law is presented at the evaluation section of this paper. As our experiments show, the resulting trajectory distribution from executing this controller does not match the desired one. In Calinon (2016) and Bruno et al. (2015), the authors proposed the use of minimum intervention control to generate the gains of the feedback controller. In this approach, the authors use the inverse of the covariance at every time point as metric for the quadratic state costs. However, while intuition-wise weighting the state with the inverse of the covariance is appropriate, we will show in our comparison that this approach can not match the desired trajectory distribution. Additionally, the cost function proposed by the authors include a quadratic action penalty to limit the actions that is not learned by the demonstrations.

A different approach for computing a control law for a GMR model was proposed by Khansari-Zadeh and Billard (2011). In this approach, the control gains are proven to be stable if the system is linear. The authors derive the stability constraints from the Lyapunov stability theory. In Khansari-Zadeh et al. (2014), the authors extend their approach to generate stable controllers with state-dependent stiffness. The resulting controller share similarities with the ProMPs controller.

The approach by Rückert et al. (2012) also offers a probabilistic interpretation of MPs by representing them with learned graphical models. A probabilistic planning algorithm is used to obtain a controller that optimizes the cost function represented by the graphical model. The resulting controller is also a linear feedback controller with time varying gains. However, this approach heavily depends on the quality of the used planner and imitation learning of such a representation is not straightforward.

The ability to combine multiple MPs into a single movement provides significantly better generalization capabilities, enables the use of MP libraries, and has recently attracted attention of the community. Mülling et al. (2013) use a library for table tennis which is concurrently activating multiple DMPs to perform striking movements. Each primitive is activated with an activation provided by a trained gating network. The primitives are then combined on the acceleration level which is equivalent to a linear combination of primitives in parameter space. The primitives and the activation weights were refined with Reinforcement Learning (RL). A different approach was proposed by Matsubara et al. (2011) using DMPs in combination of with a style parameter. The parameters of DMPs are linearly interpolated according to the given style parameter. Forte et al. (2012) proposed a similar approach, where a library of DMPs learned from multiple demonstrations is used. Generalization is obtained from a Gaussian Process Regression (GPR) model which is capable of modeling non-linear transformations of the style variable. The major limitation of approaches based on deterministic representations, e.g., on DMPs, is the inability to concurrently solve a combination of tasks where we have one task per primitive. Since there is no notion of the importance of each time point in the trajectory the resulting combined primitive is just an interpolation of the participating primitives trajectories. In contrast, probabilistic representations (Khansari-Zadeh and Billard 2011; Calinon et al. 2010a) leave unclear how primitives can be combined. In ProMPs, we propose a new combination operator based on a product of trajectory distributions. We show that by co-activating ProMPs, the resulting movement solves a combination of tasks that is given by a combination of different cost functions. We evaluate this property in two different scenarios in the experiments section.

Smoothly sequencing, also called blending, two movement primitives can be considered as a special case of a combination of MPs. Discrete DMPs can be trivially sequenced (Paraschos et al. 2009), however the transition from one primitive to then next one is typically instantly, which can lead to a jump in the acceleration profiles. Special cases of discrete and periodic primitive blending, such as transient motions, have been considered in Ernesti et al. (2012) and Degallier et al. (2011). As opposed to the previous approaches, the ProMPs can cope with combination and blending of primitives independently of their periodicity.

In the next section, we will first introduce probabilistic movement primitives and show their advantageous properties. Next, will show how to compute a time-varying feedback controller that reproduces the given trajectory distribution. Subsequently, we will demonstrate the performance and advantageous properties of ProMPs in several experiments on simulated and real robot tasks.

4 Probabilistic movement primitives (ProMPs)

ProMPs provide a single principled framework for implementing the desirable properties of MPs, summarized in Table 1. We will first introduce the probabilistic model for representing the trajectory distribution, that is based on a basis function representation. Such a representations significantly reduces the amount of model parameters and facilitates learning. We proceed by illustrating how our representation can be trained from imitation data for both stroke-based and periodic movements. Training from imitation allows to rapidly reproduce tasks that are easy to demonstrate to the robot. Here, we describe a simple maximum likelihood training procedure that can be used for stroke-based movements and an expectation-maximization algorithm that can be used to train the primitive in case of missing data or also for rhythmic movements. We continue by discussing the implementation of the desirable properties, i.e. temporal modulation of the movement, encoding of the coupling between the joints that allows the generation of coordinated movements, conditioning to generalize a trained primitive to a novel situation, adaptation to task parameters to allow task-dependent variables to modify the primitive, and combination and blending of primitives to solve more complex tasks. Finally, in Sect. 4.4, we present the analytical derivation of a stochastic feedback controller that is capable of exactly reproducing the trajectory distribution. Such feedback controller is essential for using trajectory distributions for controlling a physical system (Fig. 1).

Table 1 Properties and their implementation in the ProMPs
Fig. 1
figure 1

Two real robot setups that we used for the evaluation of our approach. (left) The KUKA arm playing the maracas musical instrument. We demonstrated a slow version of the rhythmic shaking movement and we progressively increased the speed. (right) The KUKA arm playing with an Astrojax. The robot learned to play from demonstrations

4.1 Probabilistic trajectory representation

We start our discussion with the simple case of a single degree of freedom, where the joint angle q is a scalar, and we subsequently extend it to the multiple DoF case, where the vector \(\varvec{q}\) describes multiple joint angles. We model a single movement execution as a trajectory \(\varvec{\tau }=\left\{ q_{t}\right\} _{t=0 \ldots T}\), defined by the joint angle \(q_{t}\) over time. In our framework, a MP describes multiple ways to execute a movement, which naturally leads to a probability distribution over trajectories. We encode our policy representation with a hierarchical Bayesian model, which is presented in Fig. 2.

4.1.1 Concise encoding of trajectory distributions

Our movement primitive representation models the time-varying variance of the trajectories. Representing the variance information is crucial as it reflects the importance of single time points for the movement execution. We use a basis-function representation as it reduces the amount of model parameters in comparison to a simple distribution over the joint positions for each time step. This reduction in parameters can greatly facilitate learning. Additionally, it allows us to derive a continuous time approach and transfer data between systems, e.g., from a motion capture system to the robotic platform, directly without interpolating the data. When controlling the system, a continuous time approach allows for choosing the control frequency and is robust to jitter. Further, as we will discuss in Sect. 4.1.2, it enables the temporal modulation of the movement. Additionally, it allows us to generalize the primitive at any time-point, Sect. 4.3.1 and to derive our feedback controller in closed form, Sect. 4.4.

Fig. 2
figure 2

The Hierarchical Bayesian model used in ProMPs. The probability distribution \(p( \varvec{y}_{1:T} | \varvec{w})\) of the observed trajectories depends on the parameter vector \(\varvec{w}\). The distribution over the parameter vector \(\varvec{w}\) is given by \(p(\varvec{w} | \varvec{\theta })\). The parameter vector \(\varvec{w}\) is integrated out in the ProMP formulation

We use a weight vector \(\varvec{w}\) to compactly represent a single trajectory. The probability of observing a trajectory \(\varvec{\tau }\) given the underlying weight vector \(\varvec{w}\) is given as a linear basis function model

$$\begin{aligned} \varvec{y}_{t}= & {} \left[ \begin{array}{l} q_{t} \\ \dot{q}_{t} \end{array}\right] = \varvec{\Phi }_{t}\varvec{w}+\varvec{\epsilon }_{y}, \end{aligned}$$
(1)
$$\begin{aligned} p(\varvec{\tau }|\varvec{w})= & {} \prod _{t}\mathcal {N}\left( \varvec{y}_{t} \big | \varvec{\Phi }_{t}\varvec{w},\varvec{\Sigma }_{y}\right) , \end{aligned}$$
(2)

where \(\varvec{\Phi }_{t}=[\varvec{\phi }_{t},\dot{\varvec{\phi }}_t]^\mathrm{T}\) defines the \(2\times n\) dimensional time-dependent basis function matrix for the joint positions \(q_{t}\) and velocities \(\dot{q}_{t}\). The basis functions for the velocities \(\dot{\varvec{\phi }}_t\) are the time derivatives of \(\varvec{\phi }_t\). The variable n defines the number of basis functions and \(\varvec{\epsilon }_{y}\sim \mathcal {N}\left( \varvec{0},\varvec{\Sigma }_{y}\right) \) represents zero-mean i.i.d. Gaussian noise.

In order to capture the variance of the trajectories, we introduce a distribution \(p(\varvec{w};\varvec{\theta })\) over the weight vector \(\varvec{w}\), with parameters \(\varvec{\theta }\). In most cases, the distribution \(p(\varvec{w};\varvec{\theta })\) will be Gaussian where the parameter vector \(\varvec{\theta }= \{\varvec{\mu }_{\varvec{w}}, \varvec{\Sigma }_{\varvec{w}} \}\) specifies the mean and the variance of \(\varvec{w}\). However, also more complex distributions such as Gaussian mixture models can be used for this task (Rueckert et al. 2015). The trajectory distribution \(p(\varvec{\tau };\varvec{\theta })\) can now be computed by marginalizing out the weight vector \(\varvec{w}\), i.e.

$$\begin{aligned} p(\varvec{\tau };\varvec{\theta })=\int p(\varvec{\tau }|\varvec{w})p(\varvec{w};\varvec{\theta })d\varvec{w}, \end{aligned}$$
(3)

to obtain the probability distribution over the trajectories \(\varvec{\tau }\). The distribution \(p(\varvec{\tau };\varvec{\theta })\) defines the hierarchical Bayesian model that is illustrated at Fig. 2. The model’s parameters are given by the observation noise variance \(\varvec{\Sigma }_{y}\) and the parameters \(\varvec{\theta }\) of the weight distribution \(p(\varvec{w};\varvec{\theta })\).

Fig. 3
figure 3

Trajectory distribution showing the joint positions (first row) and velocities (second row). The shaded area denotes two times the standard deviation. a The demonstrated trajectory distribution that was generated by an stochastic optimal control algorithm for a via-point task. The resulting trajectories show variability due to the noise in the system. b The trajectory distribution generated using ProMPs (blue). ProMPs can exactly reproduce the demonstrated trajectory distribution (shown in red below the blue shaded area). c The resulting trajectory distribution produced by the inverse covariance control approach (blue). Due to latency-effects it missed the via-points in time and generated high actions which led to the velocity spike. d Trajectory distribution produced by DMPs. While the DMP can follow the mean of the demonstrations, it can not adapt its variance. The accuracy at the via-points is worse than ProMPs, while the control actions are higher in non-relevant areas of the trajectory. In blue we tuned the DMP gains for reproducing the trajectory distribution with the lowest cost and in green we used lower gains (Color figure online)

Illustrative example To illustrate the properties of our MP representation, we use a simple toy-task as a running example throughout this section where we also compare to other state-of-the-art MP approaches. In our toy-task, we use a trajectory distribution that passes through two via-points. The simulated system has linear dynamics and Gaussian i.i.d. noise on the actions. In this illustrative example, we control the acceleration of the system. We generate demonstrations with an optimal control algorithm (Toussaint 2009). The cost function is given as

$$\begin{aligned} C(\varvec{\tau }, \varvec{u}) = \!\sum _{i= \{t_\text {via} \}} (\varvec{y}_i^d - \varvec{y}_i)^T \varvec{Q} (\varvec{y}_i^d - \varvec{y}_i) + \sum _{i=1}^T \varvec{u}^T_i \varvec{R} \varvec{u}_i, \end{aligned}$$
(4)

where \(t_{\mathrm{via}} = \{0.4\,\mathrm{s}, 0.7\,\mathrm{s} \}\) is a set of the time-points for the via-points and \(\varvec{Q},\varvec{R}\) are are the state and action cost matrices, respectively. We simulate trajectories with the resulting controller to obtain the demonstrations. The demonstrations exhibit variability due to the noise of the system. The optimal trajectory distribution is presented in Fig. 3a.

Table 2 Comparison of different control approaches on a hand-specified cost function

The use of a cost-function enables us to quantify the quality of the resulting MP policies. The ProMP policy is capable of reproducing exactly the variance of the movement, as shown in Fig. 3b. For the trajectory reproduction of ProMPs, we used the controller that we describe in Sect. 4.4. Additionally, we evaluate the heuristic controller presented in Calinon et al. (2010b), which computes the feedback gains inverse proportionally to the variance of the trajectory. The trajectory distribution of the inverse covariance controller does not match the demonstrated distribution, see Fig. 3c. The DMP approach uses constant feedback gains to follow a single trajectory, and, hence, can not adapt the variance of the resulting trajectory distribution. In Fig. 3d, we generated trajectory distributions for two different settings of the feedback gains to illustrate the resulting variances. We empirically optimized the gains for the inverse covariance controller and the DMPs using search. The average costs generated by each control law are shown in the upper part of Table 2. The ProMP achieve a similar cost to the optimal controller while all other controllers can not reproduce the optimal behavior.

Further, we compare our approach to Calinon (2016), where we fit the proposed Gaussian Mixture Model (GMM) to the demonstrations and then use Gaussian Mixture Regression (GMR) to derive the desired trajectory distribution. We present the fitted regression model in Fig. 4 (blue). We generated trajectories using Minimum Intervention Control (Calinon 2016) and we present the results in Fig. 4 (red) where we jointly optimized for the number of mixture components and the action penalty. We also used the optimal number of components, but the same action penalty as in the cost function used to generate the demonstrations (green). The resulting controller can not reproduce the given distribution.

Moreover, we evaluated our approach using simple Gaussian distributions and optimal control. At every time-step, we fit a Gaussian distribution over the state and we use it to set a quadratic cost function. The cost function has the form of Eq. (4) where \(\varvec{y}_i^d\) is set to the mean and Q to the inverse of the covariance. We optimize for the action penalty \(\varvec{R}\) such that the true cost function we used to generate the data is minimized. We present our results in Table 2. This approach uses the same approach for deriving the controller as in Calinon (2016), but uses a simple Gaussian distribution to model each time-step instead of the state-defined GMR. Compared to ProMPs, the performance on the true cost function is worse as can be seen in the table. This approach also does not provide any generalization or modulation mechanism.

As another baseline, we fit a Gaussian distribution at every time-step on the state-action space. At reproduction, we condition the distribution of that time-step on the current state to obtain the action, which results in a linear Gaussian action policy. As the demonstrations have been generated by a time-dependent linear controller, the performance of this approach is is close to optimal and similar to the ProMP controller as shown in Table 2. However, fitting a Gaussian distribution over the state-action requires the actions to be known during the demonstrations and, which limits the applicability of the approach to tele-operation setups. Similar to the optimal control approach from the previous paragraph, this approach does not provide any generalization mechanism.

Fig. 4
figure 4

Evaluation of the GMM-GMR approach, using the minimum innervation principle for control (Calinon 2016). The learned distribution using the GMM-GMR approach is presented in blue. The approach captures the mean of the distribution accurately, however, the variance at the via-points is higher than in the demonstrations. For reproduction, we used the optimal action penalty (red) or the same action penalty as in the demonstrations (green). While the mean of the reproductions matches the mean of the demonstrations, there is a miss-match for the variance (Color figure online)

4.1.2 Temporal modulation

With temporal modulation, we can adjust the execution speed of the movement. Similar to the DMP approach, we introduce a phase variable z to decouple the movement from the time signal. By modifying the rate of the phase variable, we can modulate the speed of the movement. Without loss of generality, we define the phase as \(z_{0}=0\) at the beginning of the movement and as \(z_{T}=1\) at the end. We typically use a constant velocity \(\dot{z}_{t}=1/T\) for reproducing the recorded motion, but we can also adapt it dynamically during the execution of the movement. The basis functions \(\varvec{\phi }_{t}\) now directly depend on the phase instead of time, such that

$$\begin{aligned} \varvec{\phi }_{t}= & {} \varvec{\phi }(z_{t}), \end{aligned}$$
(5)
$$\begin{aligned} \dot{\varvec{\phi }}_{t}= & {} \dot{\varvec{\phi }}(z_{t})\dot{z}_{t}, \end{aligned}$$
(6)

where \(\dot{\varvec{\phi }}_{t}\) denotes the corresponding derivative. An illustration of temporal scaling for our running example is shown in Fig. 5.

Fig. 5
figure 5

Temporal modulation of the ProMPs. The demonstrated distribution is shown in red. The green shows an execution at a slower pace, whereas the blue at a faster one (Color figure online)

4.1.3 Rhythmic and stroke-based movements

The choice of the basis functions depends on the type of movement, which can be either rhythmic or stroke-based. For stroke-based movements, we use Gaussian basis functions \(b_{i}^{\text {G}}\), while for rhythmic movements, we use Von-Mises basis functions \(b_{i}^{\text {VM}}\) to model periodicity in the phase variable z, i.e.,

$$\begin{aligned} b_{i}^{\text {G}}(z)= & {} \exp \left( -\frac{(z_{t}-c_{i})^{2}}{2h}\right) , \end{aligned}$$
(7)
$$\begin{aligned} b_{i}^{\text {VM}}(z)= & {} \exp \left( \frac{{\cos (2\pi (z_{t}-c_{i}))}}{h}\right) , \end{aligned}$$
(8)

where h defines the width of the basis and \(c_{i}\) the center for the ith basis function. We normalize the basis functions

$$\begin{aligned} \phi _{i}(z_{t})=\frac{b_{i}(z)}{\sum _{j=1}^n b_{j}(z)}, \end{aligned}$$
(9)

to obtain a constant summed activation and improve the regression’s performance. The centers of the basis functions are uniformly placed in \([-2h,(1+2h)]\) the phase domain. We center basis functions outside the interval [0, 1] to improve homogeneity of the basis vector, i.e., by including the “tails” of the basis placed outside, and therefore improve the performance of our model.

4.1.4 Encoding coupling between joints

So far, we have considered each degree of freedom to be modeled independently. However, for many tasks we have to coordinate the movement of multiple joints. The trajectory distributions \(p\left( \varvec{\tau };\varvec{\theta }\right) \) can be easily extended to the multi-DoF case. For each dimension i, we maintain a parameter vector \(\varvec{w}_{i}\), and we define the combined weight vector \(\varvec{w}\) as \(\varvec{w}=[\varvec{w}_{1}^{T},\ldots ,\varvec{w}_{n}^{T}]^{T}\), a concatenation of the weight vectors. The basis matrix \(\varvec{\Phi }_{t}\) now extends to a block-diagonal matrix containing the basis functions and their derivatives for each dimension. The observation vector \(\varvec{y}_{t}\) consists of the angles and velocities of all joints. The probability of an observation \(\varvec{y}\) at time t is given by

$$\begin{aligned} p(\varvec{y}_{t}|\varvec{w})= & {} \mathcal {N}\left( \left[ \begin{array}{c} \varvec{y}_{1,t}\\ \vdots \\ \varvec{y}_{d,t} \end{array}\right] \Bigg |\left[ \begin{array}{ccc} \varvec{\Phi }_{t} &{} \ldots &{} \varvec{0}\\ \vdots &{} \ddots &{} \vdots \\ \varvec{0} &{} \cdots &{} \varvec{\Phi }_{t} \end{array}\right] \varvec{w},\varvec{\Sigma }_{y}\right) \nonumber \\= & {} \mathcal {N}\left( \varvec{y}_{t}\big |\varvec{\Psi }_{t}\varvec{w},\varvec{\Sigma }_{y}\right) \end{aligned}$$
(10)

where \(\varvec{y}_{i,t}=[q_{i,t},\dot{q}_{i,t}]^{T}\) denotes the joint angle and velocity for the \(i^{\text {th}}\) joint. We now maintain a distribution \(p(\varvec{w};\varvec{\theta })\) over the combined parameter vector \(\varvec{w}\). By introducing \(p(\varvec{w}; \varvec{\theta })\), we extended our representation to additionally capture the correlation between the joints. The extended multi-DoF representation is used throughout the rest of the paper, including the experimental section. Controlling the robot in a co-ordinated manner using the coupling between the joints, for example, allows the robot to reach a via-point defined in the task-space while the joints exhibit variability. In the multi-DoF model, Eq. (1 2) becomes

$$\begin{aligned} p(\varvec{\tau }|\varvec{w}) = \prod _{t}\mathcal {N}\left( \varvec{y}_{t} \big | \varvec{\Psi }_{t}\varvec{w},\varvec{\Sigma }_{y}\right) . \end{aligned}$$
(11)

Additionally, our model captures the covariance of joint positions and velocities for each time step. Therefore, it encodes a linear relationship between them and enables to compute the desired velocity if the position is known or vice versa. We further exploit this property in Sect. 4.3.1 for adaptation to novel situations.

figure e

4.2 Learning from demonstrations

To simplify the learning of the parameters \(\varvec{\theta }\), we will assume a Gaussian distribution for \(p(\varvec{w};\varvec{\theta })=\mathcal {N}\left( \varvec{w}\big | \varvec{\mu }_{w},\varvec{\Sigma }_{w}\right) \) over the parameters \(\varvec{w}\). Consequently, the distribution of the state \(p(\varvec{y}_{t}|\varvec{\theta })\) for time step t is given by

$$\begin{aligned} p\left( \varvec{y}_{t};\varvec{\theta }\right)= & {} \int \mathcal {N}\left( \varvec{y}_{t}\big |\varvec{\Psi }_{t}\varvec{w},\varvec{\Sigma }_{y}\right) \mathcal {N}\left( \varvec{w}\big |\varvec{\mu }_{\varvec{w}},\varvec{\Sigma }_{\varvec{w}}\right) d\varvec{w} \nonumber \\= & {} \mathcal {N}\left( \varvec{y}_{t}\big |\varvec{\Psi }_{t}\varvec{\mu }_{\varvec{w}},\varvec{\Psi }_{t}\varvec{\Sigma }_{\varvec{w}}\varvec{\Psi }_{t}^{T}+\varvec{\Sigma }_{y}\right) , \end{aligned}$$
(12)

and, thus, we can easily evaluate the mean and the variance for any time point t. As a ProMP represents multiple ways to execute an elemental movement, we need multiple demonstrations in order to learn \(p(\varvec{w};\varvec{\theta })\), or, in the special case that only one demonstration is available, a prior variance profile for \(p(\varvec{w})\) should be given.Footnote 2

4.2.1 Learning stroke-based movements

For stroke-based movements, we can estimate the parameters \(\varvec{\theta }=\{\varvec{\mu }_{\varvec{w}},\varvec{\Sigma }_{\varvec{w}}\}\) from demonstrations by a simple maximum likelihood estimation algorithm. We estimate the weights for each trajectory individually with linear ridge regression, i.e.

$$\begin{aligned} \varvec{w}_i = \left( {\varvec{\Psi }}^T {\varvec{\Psi }} + \lambda \varvec{I} \right) ^{-1} {\varvec{\Psi }}^T \varvec{Y}_i \end{aligned}$$
(13)

where \(\varvec{Y}_i\) represents the positions of all joints and time steps from the demonstration i, and \({\varvec{\Psi }}\) the corresponding basis function matrix for all time steps. We align the demonstrations by adjusting the phase signal. For each demonstration, we assume that \(z_\text {begin}=0\) and at the end \(z_\text {end}=1\). The ridge factor \(\lambda \) is generally set to a very small value, typically \(\lambda =10^{-12}\), as larger values degrade the estimation the trajectory distribution. In this paper, we also propose a new regularization scheme that is based on minimizing the jerk of the trajectories, i.e.,

$$\begin{aligned} \varvec{w}_i = \left( \varvec{\Psi } \varvec{\Psi } + \lambda \varvec{\Gamma }^T \varvec{\Gamma }\right) ^{-1} {\varvec{\Psi }}^T \varvec{Y}_i, \end{aligned}$$
(14)

where \(\varvec{\Gamma }\) denotes the third derivativeFootnote 3 of \(\varvec{\Psi }\). The third derivative is needed as the jerk is given by the third derivative. The jerk minimization scheme can generate smoother torque profiles and, hence, performs better in the cost function comparison presented in Table 2. The mean \(\varvec{\mu }_{\varvec{w}}\) and covariance \(\varvec{\Sigma }_{\varvec{w}}\) are computed from the samples \(\varvec{w}_i\),

$$\begin{aligned} \varvec{\mu }_{\varvec{w}} = \frac{1}{N}\sum _{i=1}^N \varvec{w}_{i}, \,\,\, \varvec{\hat{\Sigma }_{\varvec{w}}} = \frac{1}{N} \sum _{i=1}^N ( \varvec{w}_i - \varvec{\mu }_{\varvec{w}} ) ( \varvec{w}_i - \varvec{\mu }_{\varvec{w}} )^T \end{aligned}$$
(15)

where N is the number of demonstrations. We use an Inverse-Wishart distribution as a prior to the covariance matrix \(\varvec{\Sigma }_{\varvec{w}}\). The maximum a-posteriori estimate of the covariance (OHagan and Forster 2004) given the prior becomes

$$\begin{aligned} \varvec{\Sigma }_{\varvec{w}} = \frac{N\varvec{\hat{\Sigma }_{\varvec{w}}}+\lambda _{\varvec{w}} \varvec{I}}{N+\lambda }, \end{aligned}$$
(16)

where the value of \(\lambda _{\varvec{w}}\) is set such that the covariance matrix \(\varvec{\Sigma }_{\varvec{w}}\) is positive-definite. The complete algorithm is shown in Algorithm 1.

4.2.2 Learning periodic movements

In this section we present an Expectation-Maximization (EM) algorithm that can be used to learn from missing data or rhythmic movements. Using the previous learning approach for periodic movements would require that each demonstration finishes at the same state as it started, as we use a single weight vector per demonstration and the basis functions are periodic. However, due to the variability, single trajectories typically do not end exactly where they started. Yet, rhythmic movements can be learned by using an EM-algorithm that we can train with partial trajectories, i.e., trajectories that do not cover a whole period.

We derive an Expectation Maximization (EM) algorithm that infers the latent variables, i.e. the weights for each demonstrations during training (Ewerton et al. 2015). We assume that our set of demonstrations contains multiple periods. First, we determine the period length from the demonstration and we construct the basis and phase signal. We randomly split the demonstration to N potentially overlapping segments. The size of the segment must be shorter than a period to avoid the periodicity in the basis functions for a single demonstration. The initial guess for the parameters is estimated using linear ridge regression. In the expectation step, we need to compute the posterior distribution of the weights

$$\begin{aligned} p(\varvec{w}_i| \varvec{Y}_i, \varvec{\mu }_{\varvec{w}}, \varvec{\Sigma }_{\varvec{w}}) \propto p(\varvec{Y}_i| \varvec{w}_i) p( \varvec{w}_i| \varvec{\mu }_{\varvec{w}}, \varvec{\Sigma }_{\varvec{w}}), \end{aligned}$$
(17)

for each demonstration. The posterior can be computed using the Bayes rule for Gaussian distributions. The expectation step becomes

$$\begin{aligned} \varvec{\mu }_i= & {} \varvec{\mu }_{\varvec{w}} + \varvec{\Psi }_i^T \left( \varvec{\Psi }_i \varvec{\Sigma }_{\varvec{w}} \varvec{\Psi }_i^T \right) ^{-1} \left( \varvec{Y}_i - \varvec{\Psi }_i \varvec{\mu }_w \right) , \end{aligned}$$
(18)
$$\begin{aligned} \varvec{\Sigma }_i= & {} \varvec{\Sigma }_{\varvec{w}} - \varvec{\Sigma }_{\varvec{w}} \varvec{\Psi }^T_i \left( \varvec{\Psi }_i \varvec{\Sigma }_{\varvec{w}} \varvec{\Psi }^T_i \right) ^{-1} \varvec{\Psi }_i \varvec{\Sigma }_{\varvec{w}}, \end{aligned}$$
(19)

where the index i denotes the i-th segment of the demonstration and \(\varvec{\Psi }_i\) the basis functions for that segment. We dropped the time dependency from the notation of \(\varvec{\Psi }_i\) for clearness. In the maximization step, we need to optimize the complete-data log-likelihood

$$\begin{aligned} \text {argmax}_{\varvec{\theta }^\prime } \sum _{i=1}^N \int _w p \left( \varvec{w}_i \big | \varvec{\theta }\right) \mathrm {log} \, p \left( \varvec{Y}_i \big | \varvec{\theta }^\prime \right) p \left( \varvec{w} \big | \varvec{\theta }^\prime \right) d\varvec{w} \end{aligned}$$
(20)

where \(\varvec{\theta }^\prime = \{ \varvec{\mu }_{\varvec{w}}^\prime , \varvec{\Sigma }_{\varvec{w}}^\prime \}\) denote the new parameters for the weight distribution. Thus, the maximization step becomes

$$\begin{aligned} \varvec{\mu }_{\varvec{w}}^\prime= & {} \frac{1}{N} \sum _{i=1}^N \varvec{\mu }_i, \end{aligned}$$
(21)
$$\begin{aligned} \varvec{\Sigma }_{\varvec{w}}^\prime= & {} \frac{1}{N} \sum _{i=1}^N \left( \left( \varvec{\mu }_i - \varvec{\mu }_{\varvec{w}}^\prime \right) \left( \varvec{\mu }_i - \varvec{\mu }_{\varvec{w}}^\prime \right) ^T + \varvec{\Sigma }_i \right) , \end{aligned}$$
(22)

for computing the updates in closed form. We iterate between the expectation step and the maximization step until convergence. Our algorithm is based on the EM from HBMs with Gaussian distributions approach presented in Lazaric and Ghavamzadeh (2010) and has been evaluated in Paraschos et al. (2013a) and Ewerton et al. (2015) for the ProMP representation. The algorithm for learning periodic movements is shown in Algorithm 2.

In both learning approaches, the weight covariance \(\varvec{\Sigma }_{\varvec{w}}\) may become not positive definite because of numerical problems. To correct these numerical problems we use an eigen-decomposition to find the closest symmetric positive definite matrix to our estimation, as described in Higham (1988).

figure f

4.3 New probabilistic operators for movement primitives

With the probabilistic representation we can exploit probabilistic operators, i.e., modulate the trajectory by conditioning and co-activate MPs by computing the product of distributions.

Using Gaussian distributions for \(p(\varvec{w};\varvec{\theta })\), all operators can be computed in closed form.

4.3.1 Modulation of the trajectory distribution by conditioning

The modulation of via-points and final positions is an important property to adapt the MP to new situations. In our probabilistic formulation, such operations can be described by conditioning the MP to reach a certain state \(\varvec{y}_{t}^{*}\) at time t. Note that conditioning can be performed for any time point t. It is performed by adding a desired observation

$$\begin{aligned} \varvec{x}_{t}^{*}=\left\{ \varvec{y}_{t}^{*},\varvec{\Sigma }_{y}^{*}\right\} \end{aligned}$$
(23)

to our probabilistic model and applying Bayes theorem, i.e.

$$\begin{aligned} p\left( \varvec{w}\big |\varvec{x}_{t}^{*}\right) \propto \mathcal {N}\left( \varvec{y}_{t}^{*}\big |\varvec{\Psi }_{t}\varvec{w},\varvec{\Sigma }_{y}^{*}\right) p(\varvec{w}), \end{aligned}$$
(24)

where the state vector \(\varvec{y}_{t}^{*}\) represents the desired position and velocity vector at time t and \(\varvec{\Sigma }_{y}^{*}\) describes the accuracy of the desired observation. We can also condition on any subset of \(\varvec{y}_{t}^{*}\). For example, specifying a desired joint position \(q_{1}\) for the first joint the trajectory distribution will automatically infer the most probable joint positions for the other joints. Conditioning partially on the state is done by constructing the basis function matrix \(\varvec{\Psi }\) used in Eqs. (25) and (26) to contain only the variables that participate in the conditioning. For example, Maeda et al. (2014) used such an approach based on ProMPs to model human–robot interaction where conditioning on the human movement yields the desired movement of the robot.

For Gaussian trajectory distributions, the conditional distribution \(p\left( \varvec{w}\big |\varvec{x}_{t}^{*}\right) \) for \(\varvec{w}\) is Gaussian with mean and variance

$$\begin{aligned} \varvec{\mu }_{\varvec{w}}^{[\text {new}]}= & {} \varvec{\mu }_{\varvec{w}} + {\varvec{L} \left( \varvec{y}_{t}^{*}-\varvec{\Psi }_{t}^{T}\varvec{\mu }_{\varvec{w}}\right) },\end{aligned}$$
(25)
$$\begin{aligned} \varvec{\Sigma }_{\varvec{w}}^{[\text {new}]}= & {} \varvec{\Sigma }_{\varvec{w}}-\varvec{L} \varvec{\Psi }_{t}^{T}\varvec{\Sigma }_{\varvec{w}}, \end{aligned}$$
(26)

where \(\varvec{L}\) is given by

$$\begin{aligned} \varvec{L} = \varvec{\Sigma }_{\varvec{w}}\varvec{\Psi }_{t}\left( \varvec{\Sigma }_{y}^{*}+\varvec{\Psi }_{t}^{T}\varvec{\Sigma }_{\varvec{w}}\varvec{\Psi }_{t}\right) ^{-1}. \end{aligned}$$
(27)

Illustrative Example Conditioning a ProMP to different target states, positions and velocities, is illustrated in Fig. 6. We observe that, despite the modulation of the ProMP by conditioning, the ProMP stays within the original distribution. How the ProMPs modulate is hence learned from the original demonstrations. Modulation strategies in other approaches such as the DMPs do not show this effect (Schaal et al. 2005). DMPs can reach the desired target position and velocities at the end of the movement, but deform the trajectory significantly. In contrast, the trajectory distribution obtained by conditioning a ProMP even matches the distribution of the optimal controller that has the conditioned via-point as additional cost term.

Fig. 6
figure 6

Generalization of primitives. We want to modulate the MPs such that they go through additional via-points (blue and green) and evaluate the quality of the generalized MP policies. The resulting distributions are illustrated only for comparison and are not used for training. The added via-points are depicted with colored boxes. a, b Evaluation of the optimal controller given the additional via-points on the final position (a) or final velocity (b). c, d Evaluation of the ProMP on the same via-points. ProMPs reproduce the optimal behavior despite that the unconditioned demonstrations have been used for training. e, f Generalization to the same via-points with DMPs. The position generalization is a linear interpolation of the mean trajectory and quickly goes “outside” the demonstrated distribution. The final velocity generalization reproduce drastically different trajectories than the demonstrated ones. g, h Evaluation of the optimal controller and the ProMPs on additional via-point in intermediate and final locations, that require adaptation on both the position and the velocity simultaneously (Color figure online)

4.3.2 Adaptation to task parameters

In many situations, we need to adapt the primitive based on an external state variable \(\varvec{\hat{s}}\), such as a desired target angle when shooting hockey pucks. The value of such external variables is typically known during training and also before reproduction of the primitive. Hence, we can directly learn this adaptation by learning a mapping from the external variable to the mean weight vector \(\varvec{\mu }_{\varvec{w}}\). We use a simple linear mapping, which is equivalent to modeling a joint distribution

$$\begin{aligned} p\left( \varvec{w}, \varvec{\hat{s}} \right)= & {} \mathcal {N}\left( \begin{bmatrix} \varvec{w} \\ \varvec{\hat{s}} \end{bmatrix}\big | \varvec{\mu }, \varvec{\Sigma }\right) \nonumber \\= & {} \mathcal {N}\left( \varvec{w} \big | \varvec{O} \varvec{\hat{s}} + \varvec{o} , \varvec{\Sigma }_{\varvec{w}} \right) \mathcal {N}\left( \varvec{\hat{s}} \big | \varvec{\mu }_{\varvec{\hat{s}}}, \varvec{\Sigma }_{\varvec{\hat{s}}} \right) , \end{aligned}$$
(28)

however, the transformation parameters \(\{\varvec{O},\varvec{o}\}\) are learned directly with linear ridge regression.

4.3.3 Combination and blending of movement primitives

We can use a product of trajectory distributions to continuously combine and blend different MPs into a single movement. Suppose that we maintain a set of i different primitives that we want to combine. We can co-activate them by taking the products of distributions,

$$\begin{aligned} p_{\text {{new}}}(\varvec{\tau })\propto {\textstyle \prod _{i}}p_{i}(\varvec{\tau })^{\alpha ^{[i]}}, \end{aligned}$$
(29)

where the \(\alpha ^{[i]}\in [0,1]\) factors denote the activation of the \(i{\text {th}}\) primitive. The product captures the overlapping region of the active MPs, i.e., the part of the trajectory space where all MPs have high probability mass.

Fig. 7
figure 7

Combination and blending of two primitives. We want to combine two MPs to obtain an MP that can achieve both tasks of the single MPs at the same time. We show the resulting distribution in green and the participating primitives in blue and red. a The resulting optimal distribution is generated by adding both cost-functions that have been used to generate the single primitive distributions. b Combining DMPs linearly in weight space results in a linearly interpolated trajectory. The movement misses all the via-points. c We co-activate two ProMPs with equal weights. The resulting movement passes through all via-points. d We smoothly blend from the red primitive to the blue primitive. The resulting movement (green) first follows the red primitive and, subsequently, switches to following exactly the blue primitive (Color figure online)

We also want to be able to modulate the activations of the primitives, for example, to continuously blend the movement execution from one primitive to the next one. Hence, we decompose the trajectory into its single time steps and use time-varying activation functions \(\alpha _{t}^{[i]}\), i.e.,

$$\begin{aligned} p^{*}(\varvec{\tau })&\propto \prod _t \prod _i p_i (\varvec{y}_t)^{\alpha _t^{[i]}}, \end{aligned}$$
(30)
$$\begin{aligned} p_{i}(\varvec{y}_{t})&= \int p_{i} \left( \varvec{y}_t \big | \varvec{w}^{[i]}\right) p_i \left( \varvec{w}^{[i]}\right) d\varvec{w}^{[i]}. \end{aligned}$$
(31)

For Gaussian distributions \(p_i(\varvec{y}_{t})=\mathcal {N}(\varvec{y}_{t}|\varvec{\mu }_{t}^{[i]},\varvec{\Sigma }_{t}^{[i]})\), the resulting distribution \(p^{*}(\varvec{y}_{t})\) is again Gaussian with variance and mean,

$$\begin{aligned} \varvec{\Sigma }_t^*&= \left( \sum _i \left( \varvec{\Sigma }_t^{[i]}/ \alpha _t^{[i]}\right) ^{-1}\right) ^{-1}, \end{aligned}$$
(32)
$$\begin{aligned} \varvec{\mu }_t^*&= \varvec{\Sigma }_t^* \left( \sum _i \left( \varvec{\Sigma }_t^{[i]} / \alpha _t^{[i]} \right) ^{-1} \varvec{\mu }_t^{[i]} \right) . \end{aligned}$$
(33)

Illustrative Example. Co-activation of two ProMPs is shown in Fig. 7c and blending of two ProMPs in Fig. 7d. We trained the ProMPs such that each primitive solves a different task indicated by the via points in the figures with the same colors. The combined primitive is capable of reaching all four via-points, i.e., it achieved both tasks at the same time. Additionally, we compare our combination approach to the optimal controller by adding the cost functions of the two tasks. The optimal controller results are shown in Fig. 7a. Combining movements with the DMPs results on averaging between the trajectories and therefore missing all of the via-points. The trajectory distribution is shown in Fig. 7b. We quantified the results in terms of the average cost in Table 2. While the ProMP approach achieves an average cost in the same range of magnitude, the performance of the DMP combination is highly degraded.

4.4 Using trajectory distributions for robot control

In order to fully exploit the properties of trajectory distributions, a policy that reproduces these distributions is needed for controlling the robot. To this effect, we derive a stochastic feedback controller that can accurately reproduce the mean \(\varvec{\mu }_{t}\), the variances \(\varvec{\Sigma }_{t}\), and the correlations \(\varvec{\Sigma }_{t,t+1}\) for all time steps t of a given trajectory distribution. The derivation of the controller is based on moment matching on Gaussian distribution. In our approach there is no notion of cost function.

Such controller can only be obtained by using a model. We approximate the continuous time dynamics of the system by a linearized discrete-time system with step duration \(\mathrm {dt}\),

$$\begin{aligned} \varvec{y}_{t+{{\mathrm{dt}}}}=\left( \varvec{I}+\varvec{A}_{t}{{\mathrm{dt}}}\right) \varvec{y}_{t}+\varvec{B}_{t}{{\mathrm{dt}}}\varvec{u}+\varvec{c}_{t}{{\mathrm{dt}}}, \end{aligned}$$
(34)

where the system matrices \(\varvec{A}_{t}\), the input matrices \(\varvec{B}_{t}\) and the drift vectors \(\varvec{c}_{t}\) can be obtained by first order Taylor expansion of the dynamical system for the current state \(\varvec{y}_t\).Footnote 4 We assume a stochastic linear feedback controller with time varying feedback gains is generating the control actions, i.e.,

$$\begin{aligned} \varvec{u}=\varvec{K}_{t}\varvec{y}_{t}+\varvec{k}_{t}+\varvec{\epsilon }_{\varvec{u}},\quad \varvec{\epsilon }\sim \mathcal {N}\left( \varvec{\epsilon }_{\varvec{u}}\big |0, \varvec{\Sigma _{u}}{{{\mathrm{dt}}}}^{-1}\right) , \end{aligned}$$
(35)

where the matrix \(\varvec{K}_{t}\) denotes a feedback gain matrix and \(\varvec{k}_{t}\) a feed-forward component. We use a control noise which behaves like a Wiener process (Stark and Woods 2001), and, hence, its variance grows linearly with the step durationFootnote 5 \({{\mathrm{dt}}}\). By substituting Eq. (35) into Eq. (34), we can rewrite the next state of the system as

$$\begin{aligned} \varvec{y}_{t+{{\mathrm{dt}}}}&= \left( \varvec{I}+\left( \varvec{A}_{t}+\varvec{B}_{t}\varvec{K}_{t}\right) {{\mathrm{dt}}}\right) \varvec{y}_{t}\nonumber \\&\quad +\varvec{B}_{t}{{\mathrm{dt}}}(\varvec{k}_{t}+\varvec{\epsilon }_{u})+\varvec{c}{{\mathrm{dt}}}\nonumber \\&= \varvec{F}_{t}\varvec{y}_{t}+\varvec{f}_{t}+\varvec{B}_{t}{{\mathrm{dt}}}\varvec{\epsilon }_{u}, \end{aligned}$$
(36)

where we defined

$$\begin{aligned} \varvec{F}_{t}&=\left( \varvec{I}+\left( \varvec{A}_{t}+\varvec{B}_{t}\varvec{K}_{t}\right) {{\mathrm{dt}}}\right) ,\nonumber \\ \varvec{f}_{t}&=\varvec{B}_{t}\varvec{k}_{t}{{\mathrm{dt}}}+\varvec{c}{{\mathrm{dt}}}. \end{aligned}$$
(37)

We will omit the time-index as subscript for most matrices in the remainder of the paper to improve readability. From Eq. (12), we know that the distribution for our current state \(\varvec{y}_{t}\) is Gaussian with mean \(\varvec{\mu }_{t}=\varvec{\Psi }_{t}\varvec{\mu }_{w}\) and covarianceFootnote 6 \(\varvec{\Sigma }_{t}=\varvec{\Psi }_{t}\varvec{\Sigma }_{\varvec{w}}\varvec{\Psi }_{t}^{T}\). As the system dynamics are modeled by a Gaussian linear model, we can obtain the distribution of the next state \(p\left( \varvec{y}{}_{t+\mathrm {dt}}\right) \) analytically from the forward model by integrating out the current state

$$\begin{aligned} p(\varvec{y}_{t+{{\mathrm{dt}}}})&=\int _{\varvec{y}_t}\mathcal {N}\left( \varvec{y}_{t+{{\mathrm{dt}}}}\big |\varvec{F}\varvec{y}_t+\varvec{f},\varvec{\Sigma }_s{{\mathrm{dt}}}\right) \mathcal {N}\left( \varvec{y}_t \big | \varvec{\mu }_t, \varvec{\Sigma }_t\right) \nonumber \\&= \mathcal {N}\left( \varvec{y}_{t+{{\mathrm{dt}}}} \big |\varvec{F} \varvec{\mu }_t +\varvec{f},\varvec{F}\varvec{\Sigma }_t \varvec{F}^T+ \varvec{\Sigma }_s {{\mathrm{dt}}}\right) \!, \end{aligned}$$
(38)

where \({{\mathrm{dt}}}\varvec{\Sigma }_s ={{\mathrm{dt}}}\varvec{B} \varvec{\Sigma }_u \varvec{B}^T\) represents the system noise matrix. Both sides of Eq. (38) are Gaussian distributions. The left-hand side can be computed in two ways; from our desired trajectory distribution \(p(\varvec{\tau };\varvec{\theta })\) and from Eq. (38). We proceed by matching the mean and the variances of both sides with our control law,

$$\begin{aligned} \varvec{\mu }_{t+{{\mathrm{dt}}}} ={}&\varvec{F}\varvec{\mu }_{t}+(\varvec{Bk}+\varvec{c}){{\mathrm{dt}}}, \end{aligned}$$
(39)
$$\begin{aligned} \varvec{\Sigma }_{t+{{\mathrm{dt}}}}={}&\varvec{F}\varvec{\Sigma }_{t}\varvec{F}^{T}+\varvec{\Sigma }_{s}{{\mathrm{dt}}}, \end{aligned}$$
(40)

where \(\varvec{F}\) is given in Eq. (37) and contains the time varying feedback gains \(\varvec{K}\). Using both constraints, we can now obtain the time-dependent gains \(\varvec{K}_t\) and \(\varvec{k}_t\). Note that the linearized model given by \(\varvec{A}_t\), \(\varvec{B}_t\) and \(\varvec{c}_t\) depends on the current state \(\varvec{y}_t\) which is used as linearization point. As our computation of the gains will depend on the linearized model, our controller gains also depend implicitely on the current state, i.e., \(\varvec{K}_t = \varvec{K}(\varvec{y}_t)\) and \(\varvec{k}_t = \varvec{k}(\varvec{y}_t)\). Therefore, our controller is in fact a non-linear controller. However, we will ommit the state dependence of our gains in the remaining derivation for the sake of clarity.

4.4.1 Derivation of the controller gains

We continue with the derivation of the controller gains, \(\varvec{K}\). To perform the derivation we assume, for the moment, that the stochasticity of the controller \(\varvec{\Sigma }_u\) is known. In Sect. 4.4.3, we show how the stochasticity of the controller can be computed closed form. By rearranging terms, the covariance constraint becomes

$$\begin{aligned} \varvec{\Sigma }_{t+{{\mathrm{dt}}}} - \varvec{\Sigma }_t&= \varvec{\Sigma }_{s}{{\mathrm{dt}}}+\left( \varvec{A}+\varvec{B}\varvec{K}\right) \varvec{\Sigma }_t{{\mathrm{dt}}}\nonumber \\&\quad + \varvec{\Sigma }_{t}\left( \varvec{A}+\varvec{B}\varvec{K}\right) ^{T}{{\mathrm{dt}}}+O({{\mathrm{dt}}}^{2}), \end{aligned}$$
(41)

where \(O({{\mathrm{dt}}}^{2})\) denotes all second order terms in \({{\mathrm{dt}}}\). After dividing by \({{\mathrm{dt}}}\) and taking the limit of \({{\mathrm{dt}}}\rightarrow 0\), the second order terms disappear and we obtain the time derivative of the covariance

$$\begin{aligned} \dot{\varvec{\Sigma }}_t= & {} \lim _{{{\mathrm{dt}}}\rightarrow 0}\frac{\varvec{\Sigma }_{t+{{\mathrm{dt}}}}-\varvec{\Sigma }_t}{{{\mathrm{dt}}}} \nonumber \\= & {} \left( \varvec{A}+\varvec{B} \varvec{K}\right) \varvec{\Sigma }_t +\varvec{\Sigma }_t \left( \varvec{A}+\varvec{B}\varvec{K}\right) ^T+\varvec{\Sigma }_s, \end{aligned}$$
(42)

which is a special case of the continuous time Ricatti equation. Note that this operation was only possible due to the continuous time formulation of the basis functions.

The derivative of the covariance matrix \(\dot{\varvec{\Sigma }}_{t}\) can additionally be obtained from the trajectory distribution by

$$\begin{aligned} \dot{\varvec{\varvec{\Sigma }}}_{t}=\dot{\varvec{\Psi }}_{t}\varvec{\Sigma }_{w}\varvec{\Psi }_{t}^{T}+\varvec{\Psi }_{t}\varvec{\Sigma }_{w}\dot{\varvec{\Psi }}_{t}^{T}, \end{aligned}$$
(43)

which we substitute into Eq. (42). After rearranging terms, the equation reads

$$\begin{aligned} \varvec{M}+\varvec{M}^{T}= \varvec{B} \varvec{K}\varvec{\Sigma }_{t}+{\left( \varvec{B}\varvec{K}\varvec{\Sigma }_{t}\right) }^{T}, \end{aligned}$$
(44)

where we defined

$$\begin{aligned} \varvec{M} = \dot{\varvec{\Psi }}_t \varvec{\Sigma }_w \varvec{\Psi }_t^T - \varvec{A} \varvec{\Sigma }_t -0.5 \varvec{\Sigma }_s, \end{aligned}$$
(45)

to demonstrate the structure of the equation. A solution can be obtained by setting \(\varvec{M}=\varvec{B} \varvec{K}\varvec{\Sigma }_{t}\) and solving for the gain matrix \(\varvec{K}\),

$$\begin{aligned} \varvec{K}=\varvec{B}^{\dag }\left( \dot{\varvec{\Psi }}_t \varvec{\Sigma }_w \varvec{\Psi }_t^T - \varvec{A}\varvec{\Sigma }_t - 0.5 \varvec{\Sigma }_s \right) \varvec{\Sigma }_t^{-1}, \end{aligned}$$
(46)

where \(\varvec{B}^{\dag }\) denotes the pseudo-inverse of the control matrix \(\varvec{B}\).

4.4.2 Derivation of the feed-forward controls

Similarly, we obtain the feed-forward control signal \(\varvec{k}\) by matching the mean of the trajectory distribution \(\varvec{\mu }_{t+{{\mathrm{dt}}}}\) with the mean computed with the forward model. After rearranging terms, dividing by \({{\mathrm{dt}}}\), and taking the limit of \({{\mathrm{dt}}}\rightarrow 0\), we arrive at

$$\begin{aligned} \dot{\varvec{\mu }}_{t}=\left( \varvec{A}+\varvec{B} \varvec{K}\right) \varvec{\mu }_{t}+\varvec{Bk}+\varvec{c}, \end{aligned}$$
(47)

the differential equation for the mean of the trajectory. We use the trajectory distribution \(p(\varvec{\tau };\varvec{\theta })\) to obtain \(\varvec{\mu }_t =\varvec{\Psi }_t \varvec{\mu }_w\) and \(\dot{\varvec{\mu }}_t= \dot{\varvec{\Psi }}_t \varvec{\mu }_w\) and solve Eq. (47) for \(\varvec{k}\),

$$\begin{aligned} \varvec{k}=\varvec{B}^{\dag }\left( \dot{\varvec{\Psi }}_{t}\varvec{\mu }_{w} - \left( \varvec{A}+\varvec{BK}\right) \varvec{\Psi }_{t}\varvec{\mu }_{w}-\varvec{c}\right) . \end{aligned}$$
(48)

The time-varying feedback gains \(\varvec{K}\) do not depend on the mean of the trajectory distribution, but only on the variance at that time step. Similarly, the feed-forward controls \(\varvec{k}\), depend on the variance only through the feedback gains \(\varvec{K}\), but otherwise they depend on the mean.

4.4.3 Estimation of the control noise

The last step required to match the trajectory distribution is to match the control noise matrix \(\varvec{\Sigma }_{u}\) which is needed to generate the distribution. This noise can be higher than the system noise to induce a higher variance in the distribution. Such a higher variance can, for example, be useful for exploration in reinforcement learning.

We compute the system noise covariance \(\varvec{\Sigma }_{s}=\varvec{B}\varvec{\Sigma }_{u}\varvec{B}^{T}\) by examining the cross-correlation between time steps of the trajectory distribution. To do so, we compute the joint distribution \(p\left( \varvec{y}_{t},\varvec{y}_{t+{{\mathrm{dt}}}}\right) \) of the current state \(\varvec{y}_{t}\) and the next state \(\varvec{y}_{t+{{\mathrm{dt}}}}\) as

$$\begin{aligned}&p\left( \varvec{y}_{t},\varvec{y}_{t+{{\mathrm{dt}}}}\right) \nonumber \\&\quad =\mathcal {N}\left( \left[ \begin{array}{l} \varvec{y}_{t}\\ \varvec{y}_{t+{{\mathrm{dt}}}} \end{array}\right] \left. \Big |\right. \left[ \begin{array}{l} \varvec{\mu }_{t}\\ \varvec{\mu }_{t+{{\mathrm{dt}}}} \end{array}\right] ,\left[ \begin{array}{ll} \varvec{\Sigma }_{t} &{} \varvec{C}_{t}\\ \varvec{C}_{t}^{T} &{} \varvec{\Sigma }_{t+{{\mathrm{dt}}}} \end{array}\right] \right) , \end{aligned}$$
(49)

where \(\varvec{\varvec{C}}_{t}=\varvec{\Psi }_{t}\varvec{\Sigma }_{\varvec{w}}\varvec{\Psi }_{t+{{\mathrm{dt}}}}^{T}\) is the cross-correlation of the subsequent time points. We use our linear Gaussian model to match the cross correlation. The joint distribution for \(\varvec{y}_{t}\) and \(\varvec{y}_{t+{{\mathrm{dt}}}}\) can also be obtained by our system dynamics, i.e.,

$$\begin{aligned} p\left( \varvec{y}_t,\varvec{y}_{t+\mathrm {dt}}\right) =\mathcal {N}\left( \varvec{y}_t|\varvec{\mu }_{t},\varvec{\Sigma }_{t}\right) \mathcal {N}\left( \varvec{y}_{t+\mathrm {dt}}|\varvec{F}\varvec{y}_{t}+\varvec{f},\varvec{\Sigma }_{\varvec{u}}\right) \end{aligned}$$

which yields a Gaussian distribution with mean and covariance

$$\begin{aligned} \varvec{\hat{\mu }_t} = \begin{bmatrix} \varvec{\mu }_{t}\\ \varvec{F}\varvec{\mu }_{t}+\varvec{f} \end{bmatrix}, \,\,\, \varvec{\hat{\Sigma }_t} = \begin{bmatrix} \varvec{\Sigma }_{t}&\varvec{\Sigma }_{t}\varvec{F}^{T}\\ \varvec{F}\varvec{\Sigma }{}_{t}&\varvec{F}\varvec{\Sigma }_{t}\varvec{F}^{T}+\varvec{\varvec{\Sigma }}_{\varvec{s}}{{\mathrm{dt}}}. \end{bmatrix} \end{aligned}$$
(50)

The noise covariance \(\varvec{\Sigma }_{s}\) is obtained by matching both covariance matrices given in Eqs. (49) and (50),

$$\begin{aligned} \varvec{\Sigma }_s {{\mathrm{dt}}}&=\varvec{\Sigma }_{t+{{\mathrm{dt}}}} - \varvec{F} \varvec{\Sigma }_t \varvec{F}^T = \varvec{\Sigma }_{t+{{\mathrm{dt}}}} -\varvec{F} \varvec{\Sigma }_t \varvec{\Sigma }_t^{-1} \varvec{\Sigma }_t \varvec{F}^T \nonumber \\&=\varvec{\Sigma }_{t+{{\mathrm{dt}}}} - \varvec{C}_t^T \varvec{\Sigma }_t^{-1} \varvec{C}_t , \end{aligned}$$
(51)

and solving for \(\varvec{\Sigma }_{s}\). The variance \(\varvec{\Sigma }_{u}\) of the control noise is then given by

$$\begin{aligned} \varvec{\Sigma }_{u}=\varvec{B}^{\dag }\varvec{\Sigma }_{s}\varvec{B}^{\dag T}. \end{aligned}$$
(52)

The variance of our stochastic feedback controller does not depend on the controller gains and can be pre-computed before estimating the controller gains. If the computed desired control noise is smaller than the real control noise of the system, we use the control noise of the system to calculate the feedback gain matrix \(\varvec{K}\). Otherwise the estimated \(\varvec{\Sigma }_{u}\) is used to allow the trajectory distribution to increase its variance.

4.4.4 Controlling a physical system

On a non-linear physical system, we first obtain the linearization of the dynamics model using the current state \(\varvec{y}_t\) and use this linearization to obtain the parameters of the controller for the current time step in an online manner.

For a physical system, we also have to consider that the variance of the control noise \(\varvec{\Sigma }_{\varvec{u}}\), computed from Eq. (52), contains two sources of noise; first, the inherent system noise \(\varvec{\Sigma }_u^\prime \), and, second, the additional noise injected into the system by the demonstrator. Therefore, if we apply the control noise \(\varvec{\Sigma }_{\varvec{u}}\) the inherent system noise will still be present and, as a result, our controller will not match the demonstrated distribution as it already contained the system noise. Therefore, we compute the control noise covariance

$$\begin{aligned} \varvec{\Sigma }_{\varvec{u}}^{[\text {new}]} = \varvec{\Sigma }_{\varvec{u}} - \varvec{\Sigma }_u^\prime \end{aligned}$$
(53)

by subtracting the estimated system noise \(\varvec{\Sigma }_u^\prime \) from the controller noise \(\varvec{\Sigma }_{\varvec{u}}\), computed from Eq. (52). If the resulting controller noise is not positive definite, e.g., when the system noise estimate is higher than the control noise, we set the control noise to zero.

Illustrative example—robustness analysis. In order to evaluate the robustness of our approach, we test different MP approaches under strong perturbation occurring during the execution of the movement, see Fig. 8. Our control approach demonstrates compliant behavior when the variance of the movement is high. It allows larger deviations from the demonstrated distribution and takes more time to “return” to the distribution. However, it manages to pass accurately through the via-points as this point has small variance. The DMPs on the other hand, use high feedback gains which results in a less compliant movement which quickly tries to return to the mean trajectory. Such strategy results in unnecessary high control actions as DMPs do not have a notion of the importance of time points.

Fig. 8
figure 8

Robustness evaluation. We applied a perturbation between the dashed lines with an amplitude of \(P=200\,(\hbox {m/s}^2)\) (green), or an amplitude of \(P=-200\,(\hbox {m/s}^2)\) (blue). The ProMPs a show compliant behavior but pass through the via-point accurately. The DMPs b are much stiffer and compensate the perturbation faster, before the via-point was reached. The DMPs exhibit a less efficient recovery strategy due to the higher actions. a ProMPs, b DMPs (Color figure online)

Table 3 Overview of the experimental evaluation of ProMPs

4.4.5 Relation to optimal control

Our controller derivation has strong relations to optimal control (OC). Equation (42) resembles a continuous time Ricatti equation that is typically used for state estimation (Todorov 2008), only the observation noise is missing as it is not present in our application. It is well known that state estimation and optimal control are dual problems that can be solved in the same framework (Todorov 2008). Yet, our usage of the Ricatti equation is quite different from OC and state estimation. Both approaches use the Ricatti equation for backwards integration of the value function, or the covariance, respectively. In contrast, we assume that the covariance and its derivative are already known. In this case, we can use the Ricatti equation to obtain the controller gains and no backwards integration is required. By circumventing the backwards integration, we can also avoid limitations of many OC algorithms. Almost all OC methods require a linearization of the model along a nominal mean trajectory. Using this linearization, an approximately optimal linear controller can be obtained (Li and Todorov 2010; Toussaint 2009). In contrast, our ProMP controller is non-linear as the linearization of the system is computed online for the current state. The use of OC or state estimation would also require that we know either the reward function or the observation model. Both quantities are unknown in the imitation learning scenario.

5 Experiments

We evaluate our approach on simulated and real robot experiments. Our experimental setups cover several aspects of our framework, i.e., stroke-based and rhythmic movements, linear and non-linear systems, simple trajectory following tasks, coordinated movements, and complex experiments such as table tennis or robot hockey.

For the real-robot experiments, i.e., the Astrojax, the maracas and the hockey task, we gathered demonstrations by kinesthetic teach-in, whereas for the simulated tasks we specify a cost function for finding the optimal time-varying controller. We used the optimal control algorithm from Toussaint (2009). For stroke-based movements, we train our approach as in Sect. 4.2.1 and for periodic tasks we use the EM approach in Sect. 4.2.2. An overview of the experiments performed and their objectives is given in Table 3. The open parameters of our approach where hand-picked and no further tuning was necessary.

5.1 7-link reaching task

In this task, we use a seven link planar robot that has to reach desired target positions in task-space, at different time points, with its end-effector. Our goal is to demonstrate the co-activation of ProMPs to solve a combination of tasks by combining two different movements. In addition, the task evaluates the necessity of the coupling between the joints of the robot, which is implemented by the ProMPs. As many joint configurations can lead to the same end-effector position, the end-effector of the robot can exhibit high accuracy, whereas each individual joint can exhibit higher variability. In this experiment, the end-effector has low variability at the task-space via-points. In order to successfully reproduce the demonstrated movements, ProMPs must correctly capture and reproduce the coupling between the DoF of the robot.

In the first set of demonstrations, the robot has to reach the via-point at \(t_{1}=0.25\,\text {s}\). The reproduced behavior with the ProMPs is illustrated in Fig. 9 (top). We learned the coupling of all seven joints with one ProMP. The ProMP exactly reproduced the via-points in task space while exhibiting a large variability for time steps in between the via-points. Moreover, the ProMP could also reproduce the coupling of the joints from the optimal control law which can be seen by the small variance of the end-effector in comparison to the rather large variance of the single joints at the via-points. The ProMP achieved an average cost value of similar quality as the optimal controller.

In the second set of demonstrations the first via-point was located at time step \(t_{2}=0.75\,\hbox {s}\). The movement of the robot is illustrated for specific time steps in Fig. 9 (middle). We combined both primitives and the resulting movement is illustrated in Fig. 9 (bottom). The combination of both MPs accurately reaches both via-points at \(t_{1}=0.25\) and \(t_{2}=0.75\), generating a primitive that satisfies both tasks.

Moreover, we evaluated the reproduction cost our approach to the number of training demonstrations in Fig. 10. The comparison was performed on the first set of demonstrations, i.e. top row of Fig. 9. With only two training demonstrations, our approach depends heavily on the regularization coefficients for the estimation of the covariance matrix and, on average, produces higher actions compared to using more demonstrations for training. In Fig. 10, we show that the performance of our approach does not significantly improve using more than 20 demonstrations for training. Additionally, we evaluated the performance of the inverse covariance controller (Calinon et al. 2010b) and the DMPs (Ijspeert et al. 2003). The cost for every experiment is averaged over 200 reproductions. Additionally, we average over 10 trials, where for each trial, we randomly randomly regenerated the demonstrations using an optimal control law.

Fig. 9
figure 9

A 7-link planar robot has to reach a target position at \(T=1.0\,\text {s}\) with its end-effector while passing a via-point at \(t_{1}=0.25\,\text {s}\) (top) or \(t_{2}=0.75\,\text {s}\) (middle). The plot shows the mean posture of the robot at different time steps in black and samples generated by the ProMP in gray. The ProMP approach was able to exactly reproduce the demonstration which have been generated by an optimal control law. The combination of both learned ProMPs is shown in the bottom. The resulting movement reached both via-points with high accuracy

Fig. 10
figure 10

Evaluation of the reproduction cost versus the number of demonstrations provided for training on the 7-link task-space via-point task. We present the results using ProMPs (blue), the Inv. Cov. Ctl. (red) Calinon et al. (2010b), and DMPs (green) Ijspeert et al. (2003). The cost is averaged over 200 reproductions for every approach and over 10 trials (Color figure online)

5.2 Double pendulum

In this experiment we evaluate our control approach on a system with non-linear dynamics. We use a simulated double-pendulum with unit link lengths and unit masses. Non-linearities are induced due to gravity, centripetal and Coriolis forces. During the execution of our controller we compute a linearization of the system dynamics at every time step at the state \(\varvec{y}_t\) to obtain \(\{\varvec{A}_t, \varvec{B}_t, \varvec{c}_t\}\).

Fig. 11
figure 11

Double pendulum, non-linear system. In red we depict the demonstrated trajectory distribution. (first row) In this experiment, we use the optimal controller to generate demonstrations on a linear system. Subsequently, we executed our controller on a non-linear double-pendulum system. The reproduced trajectory distribution (blue) match the demonstrations (red) despite the changed dynamics. The ProMP controller is using the linearization at the current state to compute the control gains. (second row) We illustrate the performance of our approach by using non state-independent gains (blue) where the linearization is performed offline along the mean state trajectory. As can be seen, ProMPs with state-independent gains are not capable of reproducing the demonstrated trajectory distribution. In green, we evaluate the performance of a linearized version of the non-linear ProMP controller which has been learned by fitting a linear model to the data produced by the ProMP controller. Also the linearized ProMP controller fails at tracking the distribution, showing that the state-dependent gains of the ProMP controller that cause the non-linearity are essential for accurate tracking in non-linear systems. a First joint and b second joint (  Color figure online)

Fig. 12
figure 12

The KUKA light-weight arm playing “Astrojax”. The robot holds one of the balls in his fingers and starts with releasing the ball that is connected to the other end of the string. It subsequently reproduces the demonstrated rhythmic movement showing the same human-like variability in its movement pattern

In this experiment, we also evaluate the robustness of the controller to changes in the system dynamics. To this end, we generated demonstrations on a linear double-link system, i.e. without gravity, centripetal, and Coriolis forces taken into account, using the optimal controller. Subsequently, we executed the learned trajectory distribution on the non-linear dynamical system using the ProMP controller that uses the linearization of the real dynamics. The linearization is performed in an online manner at the current state of the system for each of the reproductions, resulting in state-dependent gains and a non-linear control architecture. Our results are presented in Fig. 11. The reproduced trajectory distribution matches the demonstrations, despite the drastic change in the dynamics of the system. Additionally, we compare to the ProMP controller if we use a pre-linearization of the system dynamics along the mean trajectory, which is given in Fig. 11 (second row). Linearizing at the mean trajectory results in a linear feedback controller with state-independent gains and, hence, the resulting controller can not reproduce the demonstrated trajectory distribution. Moreover, we evaluated the reproduction a learned linear Gaussian controller per time-step which is learned from data obtained from the ProMP controller. We used the ProMP reproductions as our classical optimal control method (Toussaint 2009) failed to find a solution that was minimizing the given cost function. This approach is a linearized version of the non-linear ProMP controller. Our results in Fig. 11 show that the tracking performance reduces significantly, which proofs that the non-linearities of the ProMP controller are essential for accurate distribution tracking in non-linear systems.

5.3 Playing astrojax

‘Astrojax’ is a toy consisting of three balls on a string. Two balls are fixed at either end of the string, while one ball is free to slide along the string. Roughly, ‘Astrojax’ is a game between ‘YoYo’ and juggling. In order to successfully play ‘Astrojax’, the bottom two balls should orbit each other and not get in touch. We use the ‘Astrojax’ experiment to demonstrate that ProMPs can successfully learn and reproduce periodic movements. The real-robot setup is shown in in Figs. 1 and 12. The hand performs a stable grasp and is not controlled by ProMPs. We demonstrate a rhythmic movement to the robot which created a “basic orbit” pattern. We subsequently use the ProMPs to learn the movement with thirty Von-Mises basis functions for each joint. The robot could reproduce the behavior and recreated the same pattern, as illustrated in Fig. 12. The demonstrations exhibit a lot of variability and the robot generate periodic movements which show the same type of variability. During the demonstrations, we were capable of sustaining a successful orbit of the ‘Astrojax’ for a mean duration of \(t_{\text {demo}}=8.2\,(\hbox {s})\). During the reproduction, we achieved a mean orbiting of \(t_\text {reprod.}=15.2\,(\hbox {s})\). In contrast, the DMP approach would repeat always exactly the same movement, rendering the behavior different than the demonstrated one. DMPs are neither capable of reproducing variability, be compliant, or generate coordinated movements. GMR approaches, to our knowledge, have not yet investigated the application in periodic movements. A video with the robot playing ‘Astrojax’ can be found at http://www.ausy.tu-darmstadt.de/uploads/Team/AlexandrosParaschos/Astrojax.mp4.

5.4 Robot maracas

Fig. 13
figure 13

a The trajectory distribution of the fourth joint when playing maracas. The speed of the movement is adapted by modulating the speed of the phase signal \(z_t\). The desired distribution is shown in blue and the generated distribution from the feedback controller in green. Both distributions match. b, c Blending between two rhythmic movements (blue and red areas). The green area is produced by continuously switching from the blue to the red movement (Color figure online)

The maracas is a musical instrument containing grains. Shaking the maracas produces sounds. We used the KUKA lightweight arm for the experiments and the DLR hand to grasp the instrument. The hand was only used for holding the maracas and was not controlled by the ProMPs. Our setup is shown in Fig. 1.

Fig. 14
figure 14

Robot hockey. The robot shoots a hockey puck. The figure shows overlaid images of the real-robot setup that is set on the flour, taken from above. We demonstrate ten straight shots with varying distances and ten shots with varying angles. The pictures show samples from the ProMP model for straight shots (b) and shots with different angles (c). Learning from the union of the two data sets yields a model that represents variance in both distance and angle (d). Co-activating the individual MPs leads to a combined MP that reproduces shots where both models had probability mass, i.e., in the center at medium distance (e). The last picture shows the effect of conditioning on the angle of the shoot (f)

As demonstrating fast movements with kinesthetic teach-in can be difficult on the real robot arm due to the inertia, friction, and model discrepancies, we demonstrate a slower movement of ten periods. We used this slow demonstration for learning the primitive but modulated the speed of the phase during reproduction. The faster movement achieved a shaking movement of appropriate speed that generates the desired sound of the instrument.

We learned the rhythmic movement using \(N=10\) Von-Misses basis functions per dimension. The ProMP was trained all seven DoF of the robot. We optimized the parameters of ProMPs using the Expectation Maximization algorithm. To do so, we split the demonstration in \(M=400\) segments and assigned the appropriate phase signal. We executed our controller after training and we measured that the generated trajectories stay on average \(94.4\%\) of the total time within two standard deviations of demonstrated distribution. After learning the ProMP model from the demonstration, we progressively increase the speed of the movement by modulating the phase, such that the robot successfully plays the instrument.

The speed of the motion can be changed during execution to achieve different sound patterns. We show an example movement of the robot in Fig. 13a. The desired trajectory distribution of the demonstrated rhythmic movement and the resulting distribution generated from the feedback controller again match.

Additionally, we demonstrated a second type of rhythmic shaking movement and use it to continuously blend between both movements to produce different sounds. One such transition between the two ProMPs is shown for one joint in Fig. 13b, c. We measured the trajectory reproduction accuracy from our controller against the desired blended distributions and found that the trajectories are within two standard deviations for 92.7, and 93.4% of the total execution time, respectively. A video showing the demonstration phase, reproduction with time modulation, and blending two primitives can be found at http://www.ausy.tu-darmstadt.de/uploads/Team/AlexandrosParaschos/Maracas.mp4

Fig. 15
figure 15

The simulated table tennis setup. (left) Shown are the robot arm mounted on linear axis, the ball position, the hitting plane in which the robot will try to hit the ball, and the hitting point prediction. Due to the induced noise in our simulation the desired and actual hitting points may differ. On the opponent’s side, we can see the robot’s target for this simulation. In our experiments, we use 15 different combinations of initial ball positions and targets covering most of the table

5.5 Robot hockey

In the hockey task, the robot has to shoot a hockey puck in different directions and for different distances. The task setup is depicted in Fig. 14a. We used the KUKA lightweight arm for this experiment and controlled the accelerations of the arm with the ProMPs using an inverse dynamics controller. The control parameters of the robot \(t_{k\in {1\dots K}}\) are the desired position vector \(\varvec{q}_{t}\in \mathbb {R}^{7}\) and the desired acceleration \(\ddot{\varvec{q}}_{t}\in \mathbb {R}^{7}\) of each joint. The ProMPs provide at every time point the desired acceleration \(\ddot{\varvec{q}}_{t}\), while the desired position \(\varvec{q}_{t}\) is obtained from second-order Euler integration of the acceleration. The duration of the control step is \({{\mathrm{dt}}}=1\) ms. A hockey stick is mounted as an end-effector for hitting the puck.

We again used two sets of demonstrations. The first set contained \(M_{1}=10\) demonstrations where the robot shot the puck straight at varying distances. The demonstrations were provided by a human tutor, using kinesthetic teaching. The second set also contained \(M_{2}=10\) demonstrations where the demonstrator shot the puck at varying angles, while trying to keep the variance of the distance relatively small. For both demonstration sets, we trained two ProMPs using \(N=10\) Gaussian basis functions per dimension, which resulted in a weight vector \(\varvec{w}\in \mathbb {R}^{70}\). By reproducing the learned primitives, we obtain behaviors illustrated in Fig. 14b, c respectively. The shots exhibit the demonstrated variability in either angle or distance. We generated the images in Fig. 14 by taking the picture of the robot’s configuration after the execution of the primitive and the puck has stopped. The figures show an overlay of the images from multiple executions of each primitive. By training a primitive on the union of the two datasets, the robot is able to shoot the puck at a variety of angles and distances, as illustrated in Fig. 14d. Additionally, we co-activated the two individual primitives and the resulting MP shoots only in the center at medium distance, i.e., the intersection of both MPs, as illustrated in Fig. 14e. This experiment again illustrates the achievement of a combination of tasks, where the first task was to shoot at a desired angle and the second, to shoot at a desired distance.

Finally, we learned a conditional distribution over the trajectories conditioned on the angle of the final puck position as described in Sect. 4.3.2. The resulting primitive was able to shoot at the desired angle as illustrated in Fig. 14f. All the operations are computed in closed form, no re-estimation of the primitive parameters is needed to compute the generalization or the combination of the primitives.

We provide a cost function evaluation of the two demonstrated datasets, the “angle” and the “distance” dataset, and the respective reproduction in Table 4. The cost function is chosen intuitively to resemble the desired task. By giving the human demonstrator a specific task, we can assume that he is minimizing a similar cost function, at least in approximation. Our approach successfully reproduces the same costs as in the demonstrations. The cost function of the “distance” dataset contains demonstrations that shoot the puck at different distances, but aiming at the same angle. Therefore, it only penalizes deviations from the desired angle. Similarly, in the “angle” dataset, the cost function penalizes deviations from the desired distance. Since, shooting the puck at a specific distance is quite hard due to different environment variables, i.e, friction between the puck surface and the floor, we choose a lower deviation penalty.

We also evaluated the cost on the combined movement which is supposed to solve both tasks, i.e., shoot at a specific distance and angle. For this evaluation, we added the cost functions from the “distance” and “angle” datasets. In Table 4, we show that the reproduction of the combination, which is a newly composed behavior not present in the demonstrations, achieves significantly lower costs than both original datasets.

Table 4 Evaluation of the average cost for the Robot Hockey experiment

5.6 Simulated table tennis

In this experiment, we evaluate the generalization capabilities of the ProMPs for a complex task. As comparison, we use the DMP approach presented in Kober et al. (2010). The robot, a simulated BioRob 5-DoF arm (Klug et al. 2008), is mounted on two linear axis and equipped with and additional shoulder joint. The setup is shown in Fig. 15. We control the robot with inverse dynamics control. We used an imperfect inverse dynamics model to render the simulation more realistic. As a result, the desired and actual trajectories do not match exactly and, thus, make the robot more sensitive to jerky movements as jerky movements are harder to track. At the beginning of each experiment, the ball is set to different pre-specified positions and initial velocities.

The robot has to return the ball to a specific target area at the opponents field. For this experiment, we gathered trajectories for 15 different combinations of initial ball configurations and robot targets, generated from an analytical player (Muelling et al. 2011). We trained the ProMP approach with the whole data-set and created a single primitive. In our experiment, the ball state is set at the beginning of a trial and the ProMP is conditioned to the predicted hitting position and velocity in joint space, obtained from the analytical player. A delay before the start of the execution of the primitive is provided by the simulation. In order to make the task more realistic, we assume that the ball state is estimated, instead of being directly observed, with zero-mean i.i.d. Gaussian noise. The noise on the ball position increases the task difficulty significantly as it also affects the estimated time until the ball reaches the hitting plane. We evaluate the ProMPs and the DMPs on each of the 15 task setups by computing the average distance to the target and the average success rate. We display our results on Fig. 16.

The DMP was trained with only one demonstration, while the goal position and velocity were modified according to predicted hitting point using the approach presented in Kober et al. (2010). The DMP had inferior performance as it significantly deforms the trajectories, which makes the resulting trajectory harder to track as the feedback controller saturates in torque limits due the deformation. This saturation has the effect that the robot does not reach the specified hitting point with the specified velocity.

Fig. 16
figure 16

The distance between the impact position of the ball on the opponents field and the actual targeted point in meters, for the DMP and the ProMP approaches. We tested 15 different configurations of ball initial states and robot’s targets. We average the results over 20 samples where Gaussian observation noise was added to the initial ball position. The bars denote the mean error and the error-bars one standard deviation. (bottom) Shows the success rate for each combination. If the distance between the landed position and the target position is less than 0.4 meters it is counted as a success. The performance of ProMPs is superior in all the experiments leading generally to smaller errors with an increased success rate

6 Discussion and conclusion

Probabilistic movement primitives are a promising approach for learning, modulating, and re-using movements in a modular control architecture. To effectively take advantage of such a control architecture, ProMPs support simultaneous activation, match the quality of the encoded behavior from the demonstrations, are able to adapt to different desired target positions, and can be efficiently learned by imitation. In ProMPs we parametrize the desired trajectory distribution of the primitive by a hierarchical Bayesian model with Gaussian distributions. The trajectory distribution can be easily obtained from demonstrations and simultaneously defines a feedback controller which is used for movement execution. Our probabilistic formulation introduces new operations for movement primitives, such us conditioning and combination of primitives. These all these mechanisms do not exist for alternative representations and, with ProMPs, we provide a single mathematical framework to describe them. Future work will focus on using the ProMPs in a modular control architecture and improving upon imitation learning by reinforcement learning.

The advanced flexibility of ProMPs comes to a cost of requiring multiple demonstrations in order to accurately encode the distribution over the trajectories. The number of demonstrations required depend on the complexity of the task and, from our experience, \({\sim }10{-}20\) suffice for simple tasks. Prior knowledge about the task can be incorporate by using prior distributions and regularization techniques. Furthermore, our approach is appropriate for tasks that have a strong coupling to time. For tasks loosely coupled with time, other approached might produce better results. Finally, it should be noted that our approach can not capture multiple modes since we only use a single Gaussian component to encode the trajectory distribution.