1 Introduction

Humans are very good at learning and reproducing complex tasks, but it is often tedious and cumbersome to program a robot for performing them. Programming by demonstration (PbD) alleviates this problem, giving the possibility to teach a skill to a robot through demonstrations. The skill can be acquired from an expert in the corresponding field and removes the bottleneck for the teacher to have knowledge of robotics or programming. This also provides a great potential for industrial applications as PbD can significantly reduce the setting up time of an assembly line. Since PbD aims at speeding the setting up time, collecting a lot of demonstrations can be an expensive and time consuming task. Thus, it is a desirable attribute to learn from as few demonstrations as possible. Moreover, since the demonstrations are finite, the learned controller should not only be able to generate motions within the demonstrated ranges, but also beyond them. There are skills in which multiple demonstrations can look very different due to the underlying task-specific variations [18, 27, 28]. As an example, Fig. 1 presents a sweeping task. In this task, the trash position can completely modify the trajectory of the broom, even for a fixed starting and collection point. For this task, the trash position can be interpreted as a task parameter, governing the variations in different demonstrations.

This paper utilizes these types of task-specific demonstrations. Our approach combines dynamical systems [19, 26] and statistical machine learning techniques [7]. The existing works extending dynamic movement primitives (DMPs) to include task parameters have used discriminative approaches for learning [10, 18, 27, 28]. Discriminative models are used for modeling the dependence of an unobserved variable y on an observed variable x. This is done by modeling the conditional probability distribution P(y|x), which is then used for predicting y from x. On the contrary, we combine a generative approach with the DMP, as it can make use of incomplete/unlabeled training data. This results in an 1-Step learning procedure similar to [27]. Generative models encode the joint distribution of P(xy) of the variables of interest. The conditional distribution can later be inferred from the joint distribution, i.e., \(P(y|x) = P(x,y)/p(x)\) where p(x) is obtained by marginalizing out y from P(xy). The aim of this work is to learn from very few demonstrations which implies sparsely distributed data. We solve the data sparsity problem by augmenting training data with additional incomplete data while poor local optima are avoided with deterministic annealing expectation maximization (DAEM).

The main contributions of this paper are:

  • Instead of a discriminative model, we have used a generative model for modeling the forcing terms in a task-parameterized DMP (Sect. 3).

  • The local maxima problem during the likelihood maximization is resolved with DAEM (Sect. 3).

  • We solve the data scarcity problem by using additional incomplete data and the expectation maximization (EM) algorithm [9] (Sect. 4). The detailed derivation of the incomplete data EM is also provided (Appendix).

  • Through simulated and real robot experiments, we show that our approach requires very few demonstrations for learning and provides superior extrapolation capabilities when compared with the related works (Sect. 6).

Fig. 1
figure 1

This figure shows the illustration of a sweeping task. The position of the trash (colored circles) can be considered as the task parameter, governing variations in the demonstrations. For a new trash position (blue circle), which is away from the demonstrated region, the robot should be able to generate trajectory for moving trash to the collection point (color figure online)

2 Movement primitives

2.1 Dynamic movement primitive

DMP is a way to learn motor actions [26]. It can encode discrete as well as rhythmic movements. We consider the DMP formulation presented in [19], as it overcomes the numerical problems which arises when changing the goal position in the original formulation [26]. A separate DMP is learned for each considered degree of freedom (DOF). A canonical system acts as a clock and for synchronization each DMP is driven by the common clock signal.

$$\begin{aligned} \tau \dot{s}= -\alpha _s s \end{aligned}$$
(1)

The parameter s is usually initialized to one and it monotonically decays to zero, \(\tau \) is the temporal scaling factor while \(\alpha _s\) determines the duration of the movement. From Eq. (1), the time t and s are related as \(s(t)=\exp ({\frac{-\alpha _s t}{\tau }})\). The canonical system drives the second-order transformed system:

$$\begin{aligned} \tau \dot{v}= & {} k(g-x)-d v -k(g-x_0) s + s k \mathcal {F}(s) \\ \tau \dot{x}= & {} v \end{aligned}$$

where g and \(x_0\) are goal and start positions, respectively, k acts like a spring constant while the damping term d is set such that the system is critically damped. The learning of forcing term \(\mathcal {F}(s)\) allows arbitrarily complex movements. \(\mathcal {F}(s)\) is defined as \(\frac{\sum \nolimits _{i=1}^{K} \psi _i (s) \omega _i }{\sum \nolimits _{i=1}^{K} \psi _i (s)}\) where \(\psi _i(s) = \exp (-h_i {(s-c_i)}^2) \) are Gaussian basis functions with spread \(h_i\), centers \(c_i\) and adjustable weights \(\omega _i\). To encode a movement, we first register x(t) and its first and second derivatives v(t) and \(\dot{v}(t)\), respectively, at each time step \(t=0,\ldots ,T\). Then for a suitable value of \(\tau \), we integrate the canonical system and calculate the target value \(\mathcal {F}_\mathrm{tar} (s)\) for each time step.

$$\begin{aligned} \mathcal {F}_\mathrm{tar}(s) =\frac{\dot{v}\tau - k(g-x) +d v +k(g-x_0) s }{sk} \end{aligned}$$

Now learning is performed to minimize the error criterion \(J=\sum _s {\left( \mathcal {F}_\mathrm{tar} (s) - \mathcal {F}(s) \right) } ^2\) which is a linear regression problem and the weights \(\omega _i\) are learned with weighted least squares.

2.2 DMP learning with a Gaussian mixture model

The forcing term \(\mathcal {F}(s)\) can be encoded with a Gaussian mixture model (GMM) [2] or any other suitable function approximator [27]. When using a GMM, the manual specification of the meta parameters related to the basis functions (means and spread) is not required as the means and covariances of the GMM components are learned using EM. That is why the GMM-based encoding also requires less number of components as compared with the number of basis functions. The number of GMM components can also be optimized by using an appropriate model selection criterion [22]. A GMM with K components is parameterized by \(\varvec{\theta }_{(K)}={\{\varvec{\pi }_k,{\varvec{\mu }}_k,{\varvec{{\varvec{\Sigma }}}}_k\}}_{k=1}^K\), where \(\varvec{\pi }_1,\dots ,\varvec{\pi }_K\) are mixing coefficients with constraints \(\varvec{\pi }_{k} > 0 \quad \text{ and } \quad \sum _{k=1}^{K} \varvec{\pi }_{k} =1 \), \({\varvec{\mu }}_1,\dots ,{\varvec{\mu }}_K\) are means and \({\varvec{\Sigma }}_1,\dots ,{\varvec{\Sigma }}_K\) are covariance matrices. The learning scheme is as follows. First a dataset is created

$$\begin{aligned} \mathbf x = {\left( \begin{array}{ccc} s_{1} &{}\quad \ldots &{}\quad s_{T} \\ \mathcal {F}_\mathrm{tar}(s_{1}) &{}\quad \ldots &{}\quad \mathcal {F}_\mathrm{tar}(s_{T}) \end{array} \right) } \end{aligned}$$
(2)

Then, a GMM is fitted to the data with EM [9]. Eigenvalues of covariance matrices are regularized to avoid singularity during EM [5]. Now for retrieving \(\mathcal {F}(s)\) for a given s value we can use Gaussian mixture regression (GMR) [4]. In GMR, input and output variables in each component are represented separately

$$\begin{aligned} \varvec{\mu }_{k} = \begin{bmatrix} \varvec{\mu }_{k}^{I}\\ \varvec{\mu }_{k}^{O}\end{bmatrix},\quad \varvec{\Sigma }_{k} = \begin{bmatrix} \varvec{\Sigma }_{k}^{I}&\quad \varvec{\Sigma }_{k}^{IO}\\ \varvec{\Sigma }_{k}^{OI}&\quad \varvec{\Sigma }_{k}^{O}\end{bmatrix} \end{aligned}$$

For a given input variable \(\mathbf x ^{I}\), the expected value of \(\mathbf x ^{O}\) is calculated as:

$$\begin{aligned} E(\mathbf x ^{O} | \mathbf x ^{I} ) = \sum _{k=1}^{K} h_{k} \hat{\mathbf{x }}_{k} \end{aligned}$$
$$\begin{aligned} \text {with} \qquad&h_{k}&= \frac{\varvec{\pi }_{k}\mathcal {N}\left( \mathbf x ^{I}; \varvec{\mu }_{k}^{I},\varvec{\Sigma }_{k}^{I}\right) }{\sum _{l=1}^{K}\varvec{\pi }_{l}\mathcal {N}\left( \mathbf x ^{I}; \varvec{\mu }_{l}^{I},\varvec{\Sigma }_{l}^{I}\right) } \\&\hat{\mathbf{x }}_{k}&= \varvec{\mu }_{k}^{O} + \varvec{\Sigma }_{k}^{OI}{\left( \varvec{\Sigma }_{k}^{I}\right) }^{-1} \left( \mathbf x ^{I} - \varvec{\mu }_{k}^{I}\right) \end{aligned}$$
Fig. 2
figure 2

TP-DMP learning using mixture of GMMs. a The mixture of GMMs, b the underlying regression surface and c the intuitive reasoning for such a response

3 Task-parameterized DMP

The DMP parameters can be separated into two types: (1) the shape parameters \(\omega _i\) associated with the basis functions and (2) the DMP meta parameters which are all parameters other than the shape parameters, i.e., \(\tau ,g,K,\), etc. DMP presented in Sect. 2 does not consider external parameters \(\varvec{\mathcal {T}}\), which are referred to as task parameters in this work (e.g., the trash position in Fig. 1). Also the only input to a DMP is the clock signal. In the task-parameterized DMP (TP-DMP), we firstly want to learn from multiple demonstrations executed for different task parameters. Secondly for adapting the motion to a new task, the task parameters should also be passed as an input along with the clock signal. Following are the preprocessing steps that we consider in our TP-DMP framework:

  1. 1.

    Since a common clock (canonical system) drive all DMPs, we assign a common time duration to all demonstrations.

  2. 2.

    All demonstrations are linearly resampled to have an equal number of samples. A sample inbetween two data points is created by linear interpolation of the neighboring data points.

3.1 TP-DMP learning using mixture of GMMs

We consider learning as a density estimation problem where we want to learn the joint distribution of \((s,\varvec{\mathcal {T}},\mathcal {F})\). Learning a single GMM over all demonstrations encoding \((s,\varvec{\mathcal {T}},\mathcal {F})\) can suffer from curse of dimensionality in higher-dimensional space. That is why we learn a separate GMM for each demonstration in lower-dimensional space, i.e., \((s,\mathcal {F})\), by first creating the data sets as in Eq. (2) and then applying EM as mentioned in Sect. 2.2.

$$\begin{aligned} \varvec{\mu }_{o,m} = {\left( \begin{array}{c} \varvec{\mu }_{o,m}^{s} \\ \varvec{\mu }_{o,m}^{\mathcal {F}} \end{array} \right) },\quad \varvec{\Sigma }_{o,m} = {\left( \begin{array}{cc} \varvec{\Sigma }_{o,m}^{ss} &{}\quad \varvec{\Sigma }_{o,m}^{s \mathcal {F}} \\ \varvec{\Sigma }_{o,m}^{\mathcal {F} s} &{}\quad \varvec{\Sigma }_{o,m}^{\mathcal {F} \mathcal {F}} \end{array} \right) } \end{aligned}$$

Here, the subscripts o and m denote the indexes of demonstrations and components, respectively, while the terms s and \(\mathcal {F}\) in superscript denote the dimensions corresponding to s and \(\mathcal {F}\), respectively. Since the task parameters remain constant during a demonstration, we can simply concatenate their values in the learned means of the GMM components. The diagonal values corresponding to the task parameters in the covariances are not learned and are set to small value \(\epsilon \)

$$\begin{aligned} \varvec{\mu }_{o,m}= & {} {\left( \begin{array}{c} \varvec{\mu }_{o,m}^{s} \\ \varvec{\mathcal {T}}_{o} \\ \varvec{\mu }_{o,m}^{\mathcal {F}} \end{array} \right) },\nonumber \\ \varvec{\Sigma }_{o,m}= & {} {\left( \begin{array}{ccccc} \varvec{\Sigma }_{o,m}^{s s} &{}\quad 0 &{}\quad \dots &{}\quad 0 &{}\quad \varvec{\Sigma }_{o,m}^{s \mathcal {F}} \\ 0 &{}\quad \epsilon &{}\quad \ddots &{}\quad &{}\quad 0 \\ \vdots &{}\quad \ddots &{}\quad \ddots &{}\quad \ddots &{}\quad \vdots \\ 0 &{}\quad &{}\quad \ddots &{}\quad \epsilon &{}\quad 0 \\ \varvec{\Sigma }_{o,m}^{\mathcal {F} s} &{}\quad 0 &{}\quad \dots &{}\quad 0 &{}\quad \varvec{\Sigma }_{o,m}^{\mathcal {F} \mathcal {F}} \end{array} \right) }. \end{aligned}$$

If the task parameters are varying during the demonstrations, the GMMs can simply be learned over all of the variables, i.e., \((s,\varvec{\mathcal {T}},\mathcal {F})\), and we will have unconstrained covariances. In fact, it is the fixed task parameters for each demonstration that makes the learning challenging. The idea of combining separately learned HMMs has also been introduced in [13], but in our approach we apply an additional EM cycle over the separately learned GMMs for achieving task-specific generalization. We mix the separately learned GMMs to achieve generalization for novel task parameter values. Similar to the mixing coefficients \(\varvec{\pi }\) which represent the weights of the components within a GMM, we introduce a mixing coefficient \(\varvec{\phi }\) representing the mixing weight of each GMM, having the same constraints as that of \(\pi \), i.e., \(\pi _i>0\) and \(\sum \nolimits _{i=1}^{K} \pi _i = 1\). With these settings we apply EM to learn a mixture of GMMs. We do not update the mixing weights and the means of the components within each GMM as they are important for preserving the local behavior of each demonstration. As EM maximizes the likelihood locally, it can converge to a local maxima. To overcome local maxima problem, we use DAEM [29]. DAEM has a temperature parameter \(\beta \) which has a small value at the beginning and it gradually increases to one. It uses the ideas from statistical mechanics, by applying the concept of maximum entropy. The objective function is considered as the thermodynamic free energy, which is regulated by the temperature. EM is applied deterministically for each temperature value and the estimated parameters become initialization for the next temperature value. At lower temperature values, DAEM suppresses poor local optima and increases the likelihood of convergence to the global maximum. For M demonstrations with T data points in each demonstration, we first create a single dataset containing all demonstrations (define \(N = T M\)) and then apply the DAEM for learning mixture of GMMs, whose update equations can be written as:

E-step

$$\begin{aligned} p_{i,o,m}= & {} \left( \varvec{\phi }_{o}^{t} \varvec{\pi }_{o,m}\mathcal {N}\left( \mathbf x _{i};\varvec{\mu }_{o,m}^{t},\varvec{\Sigma }_{o,m}^{t}\right) \right) ^{\beta } \\ b_{i,o}= & {} \frac{ \sum \nolimits _{l=1}^{K} p_{i,o,l}}{\sum \nolimits _{r=1}^{M} \sum \nolimits _{l=1}^{K} p_{i,r,l}}, \quad q_{i,o,m} = \frac{ p_{i,o,m}}{\sum \nolimits _{r=1}^{M} \sum \nolimits _{l=1}^{K} p_{i,r,l}} \end{aligned}$$

M-step

$$\begin{aligned} \varvec{\phi }_{o}^{t+1}= & {} \frac{\sum \nolimits _{i=1}^{N} b_{i,o}}{N}, \\ \varvec{\Sigma }_{o,m}^{t+1}= & {} \frac{\sum \nolimits _{i=1}^{N}q_{i,o,m}(\mathbf x _{i} - \varvec{\mu }_{o,m}){(\mathbf x _{i} - \varvec{\mu }_{o,m})}^\top }{\sum _{i=1}^{N} q_{i,o,m}} \end{aligned}$$

where \(b_{i,o}\) represents the responsibility that the oth GMM takes for explaining the ith data point while \(q_{i,o,m}\) represents the responsibility that the mth component of the oth GMM takes for explaining the ith data point. It is a common practice to apply regularization on learned covariances for avoiding singularities during EM. As a regularization measure, if any of the eigenvalues of the learned covariances becomes lower than a predefined threshold \(\epsilon \), then we reset it to \(\epsilon \). We apply the described approach to the mixture of GMMs learned for encoding the forcing terms of a DMP, with variations along a scalar task parameter. The learned mixture of GMMs, which is also the estimated density function, is shown in Fig. 2a. The GMMs remain unchanged. The regression for the mixture of GMMs is calculated as:

$$\begin{aligned} E(\mathbf x ^{O} | \mathbf x ^{I} ) = \sum \limits _{r=1}^{M} \sum \limits _{l=1}^{K} v_{r,l} \hat{\mathbf{x }}_{r,l} \end{aligned}$$
$$\begin{aligned} \text {with}&\quad v_{o,m}&= \frac{ \varvec{\phi }_{o} \varvec{\pi }_{o,m}\mathcal {N}\left( \mathbf x ^{I}; \varvec{\mu }_{o,m}^{I},\varvec{\Sigma }_{o,m}^{I}\right) }{\sum \nolimits _{r=1}^{M} \sum \nolimits _{l=1}^{K} \varvec{\phi }_{r} \varvec{\pi }_{r,l}\mathcal {N}\left( \mathbf x ^{I}; \varvec{\mu }_{r,l}^{I},\varvec{\Sigma }_{r,l}^{I}\right) } \\&\hat{\mathbf{x }}_{o,m}&= \varvec{\mu }_{o,m}^{O} + \varvec{\Sigma }_{o,m}^{OI}{\left( \varvec{\Sigma }_{o,m}^{I}\right) }^{-1} \left( \mathbf x ^{I} - \varvec{\mu }_{o,m}^{I}\right) . \end{aligned}$$
Fig. 3
figure 3

Result of manually setting the variance along task variable. a The mixture of GMMs, b the mixture of GMMs after manually setting the variance along task variable and c the underlying regression surface

For evaluation, we generate linearly spaced samples of \(\varvec{\mathcal {T}}\) and s within the demonstrated ranges and use GMR to predict the value of \(\mathcal {F}\) for each sample, i.e., \( \{ s \times \varvec{\mathcal {T}} \} \mapsto \mathcal {F}\). The surface plot of this data can be visualized in Fig. 2b. We can see that it has a step along the task parameter. This shows that as we have sparse data in the task space (only two trajectories), each GMM in the mixture kept concentrated at regions of demonstrations. Due to data sparsity, the density estimate is overfitted in its current form. The reason for the step in surface plot can be explained by Fig. 2c. If we have well-separated Gaussians (Fig. 2c-top) then their activation functions, calculated as the responsibility term in EM, switches in a very narrow region (Fig. 2c-bottom). The same phenomenon occurred in the regression surface where regions close to a GMM are mostly influenced by it. As we move away, the activation transits sharply to a nearby GMM. This also means that this model is overfitted and can only be useful for exact reproduction of task parameter values in training dataset and it fails to generalize for novel task parameter values. Instead of using EM, a simple trick to avoid overfitting problem could have been to manually specify a reasonable value of variance \(\epsilon \) for the task variables, as in the LWR-based approaches [28]. This is analogous of applying a regularization term along task variables. Although this can provide interpolation behavior, it fails to extrapolate beyond the demonstrated regions, as the underlying regression surface changes its behavior beyond the demonstrated interval as shown in Fig. 3.

4 Generalizing from incomplete data via the EM

Due to the few demonstrations, the challenge of data sparsity is posed in task space. We show that the data sparsity problem can be solved by augmenting the training data with the additional incomplete data spanning the input space. This can subsequently be used for getting a better estimate of the underlying distribution and thus improving the generalization behavior of the already learned GMMs. Since the mixture of GMMs is a generative model, it can benefit from incomplete data [3].

4.1 Defining input data distribution

For a task-parameterized DMP, the input variables are \((s,\varvec{\mathcal {T}})\) for which we want to predict the value of output \(\mathcal {F}\). Without loss of generality, we can assume that the input variables are independent of each other but conditionally dependent for a given output value, as shown in Fig. 4.

Fig. 4
figure 4

Graphical model illustrating dependence of input and output variables

We first separately model the distribution of each input variable. Among the input variables, the clock signal s is generated by an exponentially decaying function (canonical system) and has uneven distribution of samples in different regions. Since a GMM can model any arbitrarily complex density function, we model the distribution of s by fitting a univariate GMM with W components (through EM) to its samples \(\varvec{\theta }^s_{(W)}={\{{\pi }_w^s,\mu _w^s,{({\sigma _w^s})^2}\}}_{w=1}^W\). The same procedure cannot be used for the very limited samples of task parameters \(\varvec{\mathcal {T}}\), which are equal to the number of demonstrations. For simplicity, we assume each dimension of task variables to follow a univariate normal distribution, i.e., for the dth dimension of \(\varvec{\mathcal {T}} = [ \mathcal {T}^1 \ldots \mathcal {T}^d \ldots \mathcal {T}^D ]^\top \), \(\mathcal {T}^d \sim \mathcal {N}\big ( \mu _d,\sigma _d^2\big ) \) where \(\mu _d = \frac{\mathcal {T}_1^d+\mathcal {T}_2^d+\dots +\mathcal {T}_M^d}{M}\) and \(\sigma _d^2 = \frac{1}{M-1} \sum \nolimits _{i=1}^{M} (\mathcal {T}_i^d-\mu _d)^2\).

Since the inputs are provided in a regression problem, we refer to them as the observable variables. Similarly, the output variables that need to be predicted are termed as missing variables. Using the assumption of independence, the resultant distribution of the input variables is defined by concatenating all the distribution learned separately to form a multivariate GMM with W components.

\({\varvec{\theta }}^\mathrm{obs}_{(W)}={\{{\pi }_w^\mathrm{obs},{\varvec{\mu }}_w^\mathrm{obs},{{\varvec{\Sigma }}}_w^\mathrm{obs}\}}_{w=1}^W\) with \(\pi _w^\mathrm{obs} = \pi _w^s\),

\(\varvec{\mu }_w^\mathrm{obs} = \begin{bmatrix} \mu _w^s \\ \mu _1 \\ \vdots \\ \mu _D \end{bmatrix}\)    and    \(\varvec{\Sigma }_{w}^\mathrm{obs} = {\left[ \begin{array}{cccc} (\sigma _{w}^{s})^2 &{}\quad 0 &{}\quad \dots &{}\quad 0 \\ 0 &{}\quad \sigma _1^2 &{}\quad \ddots &{}\quad \vdots \\ \vdots &{}\quad \ddots &{}\quad \ddots &{}\quad 0 \\ 0 &{}\quad \ldots &{}\quad 0 &{}\quad \sigma _D^2 \end{array} \right] }\).

As the output dimension is not considered for now, we term the input data distribution as incomplete data GMM (IDGMM).

Fig. 5
figure 5

If we have data points whose certain dimensions are missing then they can still be used when applying EM for fitting a GMM. For instance, the output value of the data plotted on the input axis is missing. For this incomplete data, the expected values of the missing output values are also estimated within EM

4.2 Incorporating incomplete data via EM

Figure 5 shows a dataset with a mixture of complete and incomplete data. The incomplete data still provides useful information when applying EM for fitting a GMM [9]. As mentioned earlier, the regression using mixture of GMMs encountered problem due to the data sparsity in task space. What would be the effect of filling regions inbetween GMMs with incomplete data, i.e, without the outputs \(\mathcal {F}\)? The EM applied with additional incomplete samples provide smooth activation of responsibilities when switching from one GMM to a nearby GMM. Since we treat learning as a density estimation problem, the amount of incomplete data required to fill the empty regions can increase drastically with the increase in the dimensions of task parameters (curse of dimensionality). The computational burden of EM also increases with the increase in size of training data. To avoid these problems, we instead directly use the IDGMM.

To use the IDGMM, a weighting parameter analogous to the number of data points represented by the IDGMM has to be specified by the user. Our training dataset consists of N data points. The weighting parameter is also set equal to N and thus each IDGMM component represent \(\varvec{\pi }_{w}^\mathrm{obs} \times N\) data points. It has to be noted that the EM applied with this additional data still provides a maximum likelihood estimate of the model parameters as the IDGMM is defined on data and not on the parameters of the model. Now, for benefiting from this incomplete data, we use the current mixture of GMMs for calculating the Expectation of incomplete terms appearing in likelihood maximization [12]. It turns out that, for a data point \(\mathbf x _i\) with observable (input dimensions) and missing (output dimensions) parts \(\begin{bmatrix} \mathbf x ^\mathrm{obs}_i\\ \mathbf x ^\mathrm{miss}_i\end{bmatrix}\), we have to calculate three expectations [12], i.e., \( E[z_{i,k}|\mathbf x ^\mathrm{obs},\varvec{\theta }_t]\), \(E[z_{i,k},\mathbf x ^\mathrm{miss}|\mathbf x ^\mathrm{obs},\varvec{\theta }_t]\) and \(E[z_{i,k},\mathbf x ^\mathrm{miss}\mathbf{x ^{\mathrm{miss}^\top }}|\mathbf x ^\mathrm{obs},\varvec{\theta }_t]\), where \(z_{i,k}\) is an indicator variable defining association of \(\mathbf x _i\) to the \(k_{th}\) cluster. These expectations can be directly used in M-step [12].

In the M-step, the value of \(d_{i,o,m}^{t+1}\) calculated only on the observable dimensions can be directly used for updating \(\varvec{{\phi }}_{o}^{t+1}\).

E-step

$$\begin{aligned} p_{i,o,m}^{t+1}= & {} \left( \varvec{\phi }_{o}^{t} \varvec{\pi }_{o,m}\mathcal {N}\left( \mathbf x _{i};\varvec{\mu }_{o,m}^{t},\varvec{\Sigma }_{o,m}^{t}\right) \right) ^{\beta } \\ c_{w,o,m}^{t+1}= & {} \left( \varvec{\phi }_{o}^{t} \varvec{\pi }_{o,m} \mathcal {N}\left( \varvec{\mu }_{w}^\mathrm{obs}; \varvec{\mu }_{o,m}^\mathrm{obs},\varvec{\Sigma }_{w}^\mathrm{obs}+\varvec{\Sigma }_{o,m}^\mathrm{obs}\right) \right) ^{\beta } \\ b_{i,o}^{t+1}= & {} \frac{ \sum \nolimits _{l=1}^{K} p_{i,o,l}^{t+1}}{\sum \nolimits _{r=1}^{M} \sum \nolimits _{l=1}^{K} p_{i,r,l}^{t+1}}, \quad q_{i,o,m}^{t+1} = \frac{ p_{i,o,m}^{t+1}}{\sum \nolimits _{r=1}^{M} \sum \nolimits _{l=1}^{K} p_{i,r,l}^{t+1}} \\ d_{w,o,m}^{t+1}= & {} \frac{ c_{w,o,m}^{t+1}}{\sum _{r=1}^{M}\sum _{l=1}^{K} c_{w,r,l}^{t+1}} \times \varvec{\pi }_{w}^\mathrm{obs} \times N \end{aligned}$$

M-step

$$\begin{aligned} \varvec{\phi }_{o}^{t+1}= & {} \frac{\sum \nolimits _{i=1}^{N} b_{i,o}^{t+1} + \sum _{w=1}^{W} \sum _{l=1}^{K} d_{w,o,l}^{t+1}}{2N}, \\ \varvec{\Sigma }_{o,m}^{t+1}= & {} \frac{\sum \nolimits _{i=1}^{N}q_{i,o,m}^{t+1}(\mathbf x _{i} - \varvec{\mu }_{o,m}){(\mathbf x _{i} - \varvec{\mu }_{o,m})}^\top + \sum \nolimits _{w=1}^{W} {\varvec{A}}_{w,o,m} }{\sum _{i=1}^{N} q_{i,o,m}^{t+1} + \sum \nolimits _{w=1}^{W} d_{w,o,m}^{t+1}} \end{aligned}$$

where

$$\begin{aligned} \varvec{\mu }_{o,m}= & {} \begin{bmatrix} \varvec{\mu }_{o,m}^\mathrm{obs}\\ \varvec{\mu }_{o,m}^\mathrm{miss}\end{bmatrix}, \quad \varvec{\Sigma }_{o,m} = \begin{bmatrix} \varvec{\Sigma }_{o,m}^\mathrm{obs}&\quad \varvec{\Sigma }_{o,m}^\mathrm{obs .miss}\\ \varvec{\Sigma }_{o,m}^\mathrm{miss .obs}&\quad \varvec{\Sigma }_{o,m}^\mathrm{miss}\end{bmatrix} \end{aligned}$$
$$\begin{aligned} {\varvec{A}}_{w,o,m}= & {} d_{w,o,m}^{t+1} \begin{bmatrix} {\varvec{X}}_w^\mathrm{obs} - \varvec{\mu }_{o,m}^\mathrm{obs}\\ {\varvec{X}}_w^\mathrm{miss} - \varvec{\mu }_{o,m}^\mathrm{miss}\end{bmatrix} \begin{bmatrix} {\varvec{X}}_w^\mathrm{obs} - \varvec{\mu }_{o,m}^\mathrm{obs}\\ {\varvec{X}}_w^\mathrm{miss} - \varvec{\mu }_{o,m}^\mathrm{miss}\end{bmatrix}^\top \\= & {} \begin{bmatrix} {\varvec{A}}_{11}&\quad {\varvec{A}}_{12} \\ {\varvec{A}}_{21}&\quad {\varvec{A}}_{22} \end{bmatrix} \end{aligned}$$

with

$$\begin{aligned} {\varvec{A}}_{11}= & {} d_{w,o,m}^{t+1} \left( \varvec{\Sigma }_{w}^\mathrm{obs} + \left( \varvec{\mu }_{w}^\mathrm{obs} - \varvec{\mu }_{o,m}^\mathrm{obs}\right) \left( \varvec{\mu }_{w}^\mathrm{obs} - \varvec{\mu }_{o,m}^\mathrm{obs}\right) ^\top \right) \\ {\varvec{A}}_{21}= & {} d_{w,o,m}^{t+1} \varvec{\Sigma }_{o,m}^\mathrm{miss.obs}{\left( \varvec{\Sigma }_{o,m}^\mathrm{obs}\right) }^{-1} {\varvec{\mathcal {A}}}_{w,o,m} \\ {\varvec{A}}_{12}= & {} {A_{21}}^\top \\ {\varvec{A}}_{22}= & {} d_{w,o,m}^{t+1} \varvec{\Sigma }_{o,m}^\mathrm{miss} + d_{w,o,m}^{t+1} \varvec{\Sigma }_{o,m}^\mathrm{miss.obs} \\&{\left( \varvec{\Sigma }_{o,m}^\mathrm{obs}\right) }^{-1} {\varvec{\mathcal {A}}}_{w,o,m}{\left( \varvec{\Sigma }_{o,m}^\mathrm{obs}\right) }^{-1} {\varvec{\Sigma }_{o,m}^\mathrm{miss.obs}}^{\top } \\&- d_{w,o,m}^{t+1} \varvec{\Sigma }_{o,m}^\mathrm{miss.obs} {\left( \varvec{\Sigma }_{o,m}^\mathrm{obs}\right) }^{-1} {\left( \varvec{\Sigma }_{o,m}^\mathrm{miss.obs}\right) }^{\top } \end{aligned}$$

where \({\varvec{\mathcal {A}}}_{w,o,m} = \varvec{\Sigma }_{w}^\mathrm{obs} + (\varvec{\mu }_{w}^\mathrm{obs} - {\varvec{\mu }}_{o,m}^\mathrm{obs}) (\varvec{\mu }_{w}^\mathrm{obs} - {{\varvec{\mu }}_{o,m}^\mathrm{obs})}^\top \). Since we do not update the means, the additional data cannot pull the already learned GMMs away from the demonstrated regions, the components cannot get to a saddle point during DAEM and it also results in a fast rate of convergence for EM.

One may wonder about the result of fitting a single GMM instead of using the mixture of GMMs. As the data are concentrated at discrete regions of input space (task parameters), the components of a single GMM will also get attracted to same regions as by the mixture of GMMs. An appropriate regularization term must be used for a single GMM, to avoid singularity issues of covariance matrices, as the data concentrated at discrete regions in higher-dimensional space. The reason for using the mixture of GMMs instead of a single GMM is that now we optimize the weight \(\varvec{\phi }\) of each GMM separately, without disturbing the weights \(\varvec{\pi }\) within each GMM, as they are important for preserving the local behavior of each GMM. This also results in a smaller number of parameters update during the second EM cycle, in contrast to updating the mixing weights \(\varvec{\pi }\) of a single GMM.

4.3 Computational complexity

The computational complexity of our approach during motion execution is \(\mathcal {O}(n)\) for the number of DMPs, the number of demonstrations and the number of GMM components. Since GMR involves matrix inversion over input variables, the computational complexity is \(\mathcal {O}(n^3)\) for the dimensions of task parameters. For the special case of fixed task parameters throughout the trajectory, one can find conditional GMMs for the fixed task parameters (Table 1). This makes the computational complexity irrelevant of task parameters, i.e., for the fixed task parameters \(\mathcal {T}\), the conditional parameters for the regression are calculated as:

$$\begin{aligned}&\varvec{\hat{\phi }}_{o} \varvec{\hat{\pi }}_{o,m} = \frac{ \varvec{\phi }_{o} \varvec{\pi }_{o,m} \mathcal {N}\left( \mathcal {T}; \varvec{\mu }_{o,m}^{\mathcal {T}},\varvec{\Sigma }_{o,m}^{\mathcal {T}}\right) }{ \sum _{r=1}^{M}\sum _{l=1}^{K} \varvec{\phi }_{r} \varvec{\pi }_{r,l} \mathcal {N}\left( \mathcal {T}; \varvec{\mu }_{r,l}^{\mathcal {T}},\varvec{\Sigma }_{r,l}^{\mathcal {T}}\right) } \\&\hat{\varvec{\mu }}_{o,m} = \varvec{\mu }_{o,m}^{\{s,\mathcal {F}\}} + \varvec{\Sigma }_{o,m}^{\{s,\mathcal {F}\}.\mathcal {T}}{\left( \varvec{\Sigma }_{o,m}^{\mathcal {T}}\right) }^{-1} \left( \mathcal {T} - \varvec{\mu }_{o,m}^{\mathcal {T}}\right) \\&\hat{\varvec{\Sigma }}_{o,m} = \varvec{\Sigma }_{o,m}^{\{s,\mathcal {F}\}} - \varvec{\Sigma }_{o,m}^{\{s,\mathcal {F}\}.\mathcal {T}}{\left( \varvec{\Sigma }_{o,m}^{\mathcal {T}}\right) }^{-1}\varvec{\Sigma }_{o,m}^{\mathcal {T}.\{s,\mathcal {F}\}} \end{aligned}$$

where the terms \(s,\mathcal {F}\) and \(\mathcal {T}\) in superscript denote the dimensions corresponding to s, \(\mathcal {F}\) and \(\mathcal {T}\), respectively.

Fig. 6
figure 6

A step by step illustration of TP-DMP learning with additional incomplete data and DAEM. a Demonstrations, b initial GMMs encoding forcing terms of DMPs for y-axis, c transformed GMMs using incomplete data and DAEM, d learned regression surfaces for y-axis, e multiple generated movements

5 Related work

In general, there are two main approaches for learning motor skills; firstly those based on mimicking motion data using dynamical systems, i.e., DMPs [19, 26], secondly those relying on statistical machine learning, i.e., Gaussian mixture models (GMMs) [7] and hidden Markov models (HMMs) [16, 17]. DMPs consider one-shot learning and provide spatial and temporal scalability properties as well as guaranteed convergence to the goal position. Learning is done at acceleration level and many variations exist for DMPs [11, 20, 21]. Statistical machine learning-based approaches directly learn on spatial data and can easily encode multiple demonstrations at a time, but lack certain properties presented by a DMP, for instance spatial amplification of the movement or guaranteed convergence to a goal position.

Among the GMM-based approaches, [7] used different frames of reference for capturing distinct aspects of multiple demonstrations. Task parameters are defined as frames of reference. For generalization, the already learned GMMs are placed with respect to new frames and multiplied to retrieve a GMM for a new scenario. A similar method is [6], where they have shown that such an approach can also provide some extrapolation capability. However, these approaches lack typical properties associated with DMPs, e.g., spatial amplification of the movement and guaranteed convergence to a goal position. Additionally, the frames of references cannot be arbitrarily placed and the user needs to have a certain level of understanding about how to place the frame of references for the TP-GMM to be successful. Probabilistic movement primitives (ProMPs) have been shown to have better inference capabilities than a DMP and can combine or blend multiple demonstrations to achieve task-specific generalizations. However, ProMPs require a large number of demonstrations for learning. According to [25], ProMPs have a high-dimensional covariance matrix and many free parameters. That is why they suffer from overfitting without a large number of demonstrations.

Table 1 Computational complexity during motion generation with respect to the variables of interest

Task-specific variations of DMPs are considered in [10, 18, 27, 28]. Basis targets can be extracted for the corresponding style parameters of each demonstration [18]. The basis targets and style parameters are combined to get appropriate basis functions. For generalization to a new situation with different task parameters, for example different height of an obstacle in a point to point reaching task, one has to provide appropriate style parameters. To get the style parameters for novel situations, the mapping from task parameters to style parameters is learned by supervised learning.

A more direct 2-Step (2S) approach is considered in [10, 28]. In the first step, a mapping from task parameters to the DMP parameters is learned. It can either be learned by locally weighted regression (\(\mathbf LWR ^{\text {2S}}\)), by placing local kernels along the task parameters [28], or a function approximator such as Gaussian process regression (\(\mathbf GPR ^{\text {2S}}\)) [10]. In the second step, the inferred DMP parameters are used for motion generation. An extension of [28] is [27], where they used locally weighted projection regression (LWPR) [30] for avoiding manual placement of the basis functions, which significantly reduces the required number of basis functions as compared with LWR. They also emphasized that any suitable function approximator can be used for directly encoding forcing terms of the DMPs, eliminating need to follow the 2-Step procedure. When using GPR for learning the direct encoding, they called their approach \(\mathbf GPR ^{\text {1S}}\), where the superscript 1S and 2S emphasize on the 1-Step and 2-Step approaches, respectively. These approaches can provide generalization capability when interpolating along the task parameters, but they do not perform well for very few demonstrations and cannot extrapolate beyond the demonstrated regions of task parameters, as we also show in our experiments.

In our work, the use of a generative model with a DMP may seem like a strange choice at first, as existing approaches use discriminative models [10, 18, 27, 28]. The discriminative models have been shown to yield lower asymptotic errors compared to their generative counterparts, but they also require higher amount of training data to reach those levels [14]. More specifically, a discriminative model requires \(\mathcal {O}(n)\) training examples for reaching its asymptotic error while a generative model require \(\mathcal {O}(\log {}n)\) [14]. This implies that, for few training examples, the generative model might have already reached its lowest asymptotic error, and thus performing better than a discriminative model. Since PbD focuses on learning from as few examples as possible, a generative model is indeed a useful choice.

6 Results

For all experiments the temperature schedule \((\beta )\) which we use for annealing (DAEM) is \([0.1~0.2~\dots ~ 1.0 ]\). The regularization term \(\epsilon \) for eigenvalues of the GMM covariances is set to \(10^{-6}\). For initializing GMMs, the trajectories are equally segmented in time domain and then the Gaussians are calculated from the samples of each segment. The parameters of all models are empirically set.

6.1 Simulation of variable height obstacle avoidance

Our first experiment consists of a planar point to point reaching task with the variable height of the obstacles. If there is an obstacle in the way, the trajectory has to change according to the height of the obstacle. In this experiment, the task parameter is defined as the maximum height to avoid the obstacle.Footnote 1 The goal of learning is to adapt a motion trajectory for a new desirable height. The demonstrations can be visualized in Fig. 6a. There are only two demonstrations with 200 samples in each demonstration. The task parameters associated with these demonstrations are [0.0903, 0.1598]. Two DMPs are learned for generating motion in the x and y axis. We set the number of components in each GMM to 6. The two GMMs \((\varepsilon =10^{-6})\) corresponding to the forcing terms of DMPs for motion on the y-axis are shown in Fig. 6b. Now, we apply our prescribed approach with the components in the IDGMM empirically set to 15. This transforms the complete data GMMs as shown in Fig. 6c. A single computer with Ubuntu 14.04, Intel Core i7 4790K QuadCore 4.0 GHz and 16 GiB memory took approximately 13 s for fitting the GMM in MATLAB R2015b. The forcing terms of all DMPs can be predicted in less than 1 ms during the reproduction in simulation.

Table 2 Task errors (simulation) with different approaches
Fig. 7
figure 7

Learned regression surfaces for y-axis a with \(\mathbf GMR ^{\text {1S}}\) without DAEM, b with \(\mathbf LWR ^{\text {2S}}\), c with \(\mathbf GPR ^{\text {1S}}\) and d with \(\mathbf GPR ^{\text {2S}}\)

Fig. 8
figure 8

Motion reproductions a with \(\mathbf GMR ^{\text {1S}}\) without DAEM, b with \(\mathbf LWR ^{\text {2S}}\), c with \(\mathbf GPR ^{\text {1S}}\), d with \(\mathbf GPR ^{\text {2S}}\) and e with TP-GMM

Comparison the proposed approach is compared with \(\mathbf LWR ^{\text {2S}}\) [28], \(\mathbf GPR ^{\text {1S}}\) [27], \(\mathbf GPR ^{\text {2S}}\) [10] and TP-GMM [5]. Since we have used Gaussian mixture regression for direct prediction of forcing terms, we refer to our approach as \(\mathbf GMR ^{\text {1S}}\). For \(\mathbf LWR ^{\text {2S}}\), [28] and [27] used tricubic kernel and Gaussian kernel, respectively. We use Gaussian kernel with 5 equally spaced kernels placed along the task space. The choice of kernel is seldom important for the performance of LWR [28]. In \(\mathbf LWR ^{\text {2S}}\) and \(\mathbf GPR ^{\text {2S}}\), the number of basis functions for shape parameters are set to 10, as the smaller values do not correctly capture the demonstrations. On the other hand, the GMMs in our approach require fewer number of Gaussians (i.e., 6) in this experiment. As in [27], GPR is used with the covariance function of the Mat\(\acute{\text {e}}\)rn form, with isotropic distance measure and hyperparameters optimization [24]. Three frames of reference are defined for TP-GMM, one at the starting point, one above the starting point with height equal to the desired height and one at the ending point. The number of components in the TP-GMM is empirically set to 4, as the higher values yield spiky motions. When performing learning with few demonstrations, if the GMMs in TP-GMM are learned with the high number of Gaussians then they converge to the data points of the individual trajectories. Calculating the product of GMMs with Gaussians concentrated at regions of individual trajectories results in sudden activation of the responsibility term from one Gaussian to another Gaussian of a different trajectory and thus results the spiky behavior and wrong motion reproduction. Also the GMMs in TP-GMM is formed by encoding the relationship inbetween the clock signal and the spatial data with GMMs in different frame of references and their products afterward while the GMMs in a TP-DMP model encode the relationship inbetween the clock signal, task parameters and the forcing term of a DMP. Since both models operate differently and encode different set of variables, setting the same number of Gaussians in each is not needed for fair comparison.

Fig. 9
figure 9

Motion reproductions for changing the height during execution a with \(\mathbf GMR ^{\text {1S}}\), b with \(\mathbf LWR ^{\text {2S}}\), c with \(\mathbf GPR ^{\text {1S}}\), d with \(\mathbf GPR ^{\text {2S}}\) and e with TP-GMM

The errors can be defined as the difference inbetween the maximum desired height of the trajectory (i.e., the input task parameter value used for the regression) and the actual achieved height values by the reproduced trajectories. We generate 50 linearly spaced task parameter values in the range \([\min (\varvec{\mathcal {T}}) , \min (\varvec{\mathcal {T}}) + 2.5 \times (\max (\varvec{\mathcal {T}}) - \min (\varvec{\mathcal {T}}))]\) ([0.0903,0.2641]). The generated trajectories for these task parameters are shown in Figs. 6e and 8. Table 2 presents the Mean absolute Error (ME) and its Standard Deviation (SD) when interpolating (task parameter in the range [0.0903,0.1577]) and extrapolating (task parameter in the range [0.1612,0.2641]). All approaches produce small errors when interpolating while \(\mathbf GMR ^{\text {1S}}\) outperformed all other approaches when extrapolating beyond the demonstrated task parameters. The use of DAEM is critical for the performance of \(\mathbf GMR ^{\text {1S}}\) as the performance degrades substantially without annealing. The TP-GMM fails to preserve the shape information of the demonstrations, as it can be seen in Fig. 8e.

Table 3 Task errors (simulation) of different approaches when the height is changed during the trajectories

Figures 6d and 7 present the surface plots of the generated forcing terms \(( \{ s \times \varvec{\mathcal {T}} \} \mapsto \mathcal {F} )\) of the DMP-based approaches for y-axis. The regression surfaces of all the approaches except \(\mathbf GMR ^{\text {1S}}\) change their response in the extrapolation range, which is also the reason for their large errors when extrapolating. The regression surface of \(\mathbf GMR ^{\text {1S}}\) without DAEM shows that the EM converged to a poor local optima without annealing. The kernels in \(\mathbf LWR \) and \(\mathbf GP \) based approaches predict by mostly using the nearby data. Thus, their performance degrades when trying to extrapolate. The emphasis in TP-GMM model is to strictly pass through certain frames of reference and thus it can lose the shape information. Additionally, with the clock signal as the only input, the starting point of the reproductions with the TP-GMM, which is marked by a ‘−’ sign in Fig. 8e, moves quite far away from the starting point of the demonstrations. This problem can happen when the clock signal is the only input in the TP-GMM without consideration of the current point of a trajectory. A remedy to such a problem can be to encode the current point as an input.

Fig. 10
figure 10

Schematic of our vision system

Fig. 11
figure 11

Demonstrations are collected by setting the robot in gravity compensation mode while a Cartesian impedance controller is used for motion generation. a A human provides the demonstration for moving the trash to the dustpan. b The robot generates motion for a new trash position

Fig. 12
figure 12

Comparison of \(\mathbf GMR ^{\text {1S}}\), \(\mathbf LWR ^{\text {2S}}\), \(\mathbf GPR ^{\text {1S}}\), \(\mathbf GPR ^{\text {2S}}\) and \(\mathbf TP-GMM \) for the sweeping task. a Demonstrations for the sweeping task. Circles represent the trash positions while the rectangle represents the bounding box enclosing these trash positions. b Blue dots represent trash positions for interpolation evaluation while the red ones are for extrapolation evaluations. cf) Generated movements for new trash positions (color figure online)

Since we use a generative model for encoding the forcing terms of the DMP, our approach can benefit from the incomplete data. The IDGMM spans beyond the demonstrated range and thus retrieves a better underlying function when compared with the supervised learning approaches, which only rely on training data.

Varying task parameter the task parameter is not necessary to be fixed during the reproduction phase and can vary if needed. Now 20 trajectories are generated with the initial desired task parameters linearly sampled from the interval [0.0903, 0.1577]. After one third of the executed motion, the desired task parameters are multiplied with 1.5. They now lie in the interval [0.1355,0.2366]. Figure 9 contains the plot of the generated trajectories. A benefit of using a DMP-based approach is that the output trajectories are always smooth, even though the change in the task parameters is discontinuous. The reproduction errors are given in Table 3. Again, due to the aforementioned reasons, the TP-DMP outperforms all other approaches by producing least amount of error. Care should be taken in a real robot experiment, as an instantaneous change in the value of desired task parameters can cause a high acceleration at the end-effector. The high accelerations can be avoided by smoothly changing the task parameters when required.

6.2 Real robot experiments

Experimental setup the experiments are conducted using a KUKA lightweight robot IV+. For collecting demonstrations, the robot is set to gravity compensation mode. With \(\mathbf GPR ^{\text {1S}}\), offline trajectories are generated due to its high computational cost. This also means that with \(\mathbf GPR ^{\text {1S}}\), the task parameters should remain fixed during the motion reproduction. The markers are tracked with Kinect RGB-D camera by using ROS wrapper for Alvar, an open source augmented reality tag tracking library [1]. More specifically, we have a fixed marker with respect to robot’s frame of reference and a moving marker. The fixed marker is used for localizing the camera while the moving marker is used for tracking objects, as shown in Fig. 10. A low pass filter is applied to remove high frequency noise in the vision system. The GMM model is always learned offline in MATLAB. A single computer with Ubuntu 12.04 32-bit, Intel Core i5-2500 quad core 3.3 GHz and 16 GiB memory is used for marker tracking as well as online motion generation. For motion reproduction with \(\mathbf GMR ^{\text {1S}}\), the forcing terms of all DMPs are predicted within 7 ms. A Cartesian impedance controller with a control frequency of 100Hz is used for motion generation.

Table 4 Task errors (KUKA) with different approaches

6.2.1 Sweeping task

This experiment considers a sweeping task, which consists of moving trash to a collection point. The task parameters are defined as the planar coordinates of the trash. A teacher physically holds the end-effector for various trash positions and demonstrates the required motions for moving the trash to the collection point, as shown in Fig. 11a. Learning is performed in task space (position and orientation of the end-effector). Each demonstration has a duration of approximately 5 s with a sampling rate of 10 ms.

Three DMPs are learned, two for generating a planar motion and one for encoding the planar orientation of the end-effector. As shown in Fig. 12a, four demonstrations are provided for different positions of the trash. In our demonstrations, the x and y values of the trash position lie between [\(-0.0216\)m ,0.0350m] and [\(-0.54\)m, \(-0.454\)m], respectively, (drawn as a rectangle). We selected 25 new trash positions (a grid) for evaluation in the extended x and y ranges of [\(-0.0358\)m ,0.0491m] and [\(-0.6045\)m, \(-0.3895\)m], respectively, which can be visualized in Fig. 12b. The blue samples, which are close to the bounding box of the demonstrated region, are used for evaluating interpolation performance. The red samples, which are far away from the bounding box, are used for evaluating extrapolation performance.

Comparison we defined error as the minimum distance inbetween the trash position and the generated trajectory. Table 4 contains the ME and its SD when using our approach, \(\mathbf LWR ^{\text {2S}}\), \(\mathbf GPR ^{\text {1S}}\), \(\mathbf GPR ^{\text {2S}}\) and TP-GMM. Three frames at starting point, ending point and at the trash location are defined for TP-GMM. The number of components in the TP-GMM is empirically set to 6, as the higher values yield spiky motion. The basis functions in \(\mathbf LWR ^{\text {2S}}\) and \(\mathbf GPR ^{\text {2S}}\) are set to 60. For our approach, the components in each GMM, as well as the IDGMM, are set to 40. The remaining settings are same as in previous experiment. Like in the previous experiment, our approach requires less components as compared with the basis functions in the other approaches. Again, our approach produces less errors for interpolation as well as extrapolation as shown in Fig. 12c–f and Table 4. The error for other approaches increases considerably as we move away from the demonstrated interval. Additionally, some of the trajectories generated by TP-GMM surpassed the collection points as the TP-GMM model does not have the notion of a goal position. Interestingly, no demonstrations are provided for the trash positions in the upper half of the sweeping area, which makes it an interesting region for comparing different approaches. \(\mathbf GMR ^{\text {1S}}\) also successfully generates the motion for the trash position in this region, as shown in Fig. 12f.

Fig. 13
figure 13

Joint angles and end-effector trajectory of the demonstrated motions. a Demonstrated joint angles in radians and time in seconds. b End-effector trajectory in Cartesian space

Fig. 14
figure 14

Reproductions with \(\mathbf LWR ^{\text {2S}}\). a Reproduced joint angles. b End-effector trajectory in Cartesian space

Fig. 15
figure 15

Reproductions with \(\mathbf GPR ^{\text {1S}}\). a Reproduced joint angles. b End-effector trajectory in Cartesian space

Fig. 16
figure 16

Reproductions with \(\mathbf GPR ^{\text {2S}}\). a Reproduced joint angles. b End-effector trajectory in Cartesian space

Fig. 17
figure 17

Reproductions with \(\mathbf TP-GMM \). a Reproduced joint angles. b End-effector trajectory in Cartesian space

Fig. 18
figure 18

Reproductions with \(\mathbf GMR ^{\text {1S}}\). a Reproduced joint angles. b End-effector trajectory in Cartesian space

Fig. 19
figure 19

Executed motions for the two extrapolation evaluations. a, c Robot initial configuration and different target positions. b Executed striking motion for hitting the target in (a). d Executed striking motion for hitting the target in (c)

6.2.2 Striking task

This experiment considers a task which involves striking a ball such that it hits a desired target position. The ball is always placed at the same point and the task parameter is defined as the the x-coordinate of the target. The y-coordinate of the target is fixed in the two demonstrations. Similar to the previous experiment, a teacher physically holds the end-effector and demonstrates the required motion for hitting the target. The demonstrations have a duration of appropriately 1.5 s with a sampling rate of 10 ms. Due to the small duration of the motions we increase the size of data ten times (to approximately 1500) by inserting samples in between the adjacent data points of the trajectories by using linear interpolation. Now we consider learning in joint space. Different demonstrations can produce completely different final joints configuration during the demonstrations. The final joint configurations are measurable in the demonstrations but are unknown when reproducing motion for a new target position. Thus, it is necessary to predict the final joints configuration during the reproduction phase. Seven DMPs are learned with one DMP for each joint of the robot and thus utilizing all DOFs.

Our approach can also easily incorporate the learning of meta parameters in a DMP. For \(\mathbf GMR ^{\text {1S}}\), we learn the distribution of \((s,[\varvec{\mathcal {T}} \, g],\mathcal {F})\). Similar to task parameters, the goal terms g (meta parameters) are constants for a single demonstration. Thus, we simply interpret them as additional task parameters. This also means that different DOF in each demonstration will now have different task parameters. The final joint configurations (goal positions), which we set as an additional task parameter, cannot be known in advance. As there is no distinction inbetween input and output variables when fitting a GMM and during GMR, any set of variables can be selected as input, to retrieve the expected value of remaining variables. Thus with GMR, the observable task variable can be used for predicting the expected value of missing task variables and for motion generation. So now, with GMR, we not only predict the goal terms g of each DMP but also generate the forcing terms. For \(\mathbf GMR ^{\text {1S}}\) the components in each GMM (\(\varepsilon =10^{-4}\)), as well as in the IDGMM, are set to 60. The number of basis functions in \(\mathbf LWR ^{\text {2S}}\) and \(\mathbf GPR ^{\text {2S}}\) is set to 100. The goal positions for \(\mathbf LWR ^{\text {2S}}\) and \(\mathbf GPR ^{\text {2S}}\) are predicted in the first step along with the DMP parameters by using LWR and GPR, respectively. The goal position in \(\mathbf GPR ^{\text {1S}}\) is predicted at each time step along with the forcing terms by using GPR. Two frames of reference are defined for TP-GMM: one at the start of the trajectory and one at the end of the trajectory. As mentioned before, the final joint configurations are not known and hence the TP-GMM cannot be used directly in this experiment. To use TP-GMM, we first predict the final joint configuration with GP and then the second frame of reference is placed at that final joint configuration for motion reproduction i.e., the offset vector for second frame of reference looks like \(b_2=[t_f~j_{f1}~j_{f2}~j_{f3}~j_{f4}~j_{f5}~j_{f6}~j_{f7}]'\) where \(t_f\) is the final time value and \(j_{fn}\) is the predicted final joint angle for the nth joint. The transformation matrices for the two frames of references are set to identity matrices. The number of components in the TP-GMM is empirically set to 4. The remaining settings are same as in the previous experiments.

Comparison a binary evaluation criterion is defined as a success if the robot is able to hit the target and failure otherwise. Demonstrations are provided for hitting a target with x-coordinate of \(-0.4891 m\) and \(-0.6703 m\), as shown in Fig. 13a. Figure 13b shows the end-effector pose. Interpolation performance is evaluated for the target x-coordinates of \(-0.5663 m\) while extrapolation performance is evaluated for the target x-coordinates of \(-0.4146 m\) and \(-0.7933 m\). When generalizing for novel goal positions, the DMPs can produce high accelerations at the beginning of the movement [15]. This is due to the initial interaction inbetween the linear dynamics and forcing terms. This undesirable behavior was slightly observed in this experiment. A simple solution to solve this problem is to gradually activate the forcing terms. Thus, we multiply the predicted forcing terms by \((1-s^{10})\). As s decays exponentially, the effect of this term vanishes very quickly.

Figure 14 contains the reproduction results with \(\mathbf LWR ^{\text {2S}}\). It produces a good trajectory for interpolation performance as it lies inbetween the demonstrated trajectories. The trajectories for the extrapolation fail to reproduce the task as they are similar to the demonstrations. The \(\mathbf LWR ^{\text {2S}}\) also encountered the same problem during the first experiment where successful trajectories were generated for interpolation intervals but it failed to reproduce during the extrapolation intervals. Figure 15 contains the reproduction results with \(\mathbf GPR ^{\text {1S}}\). It can easily be observed that the initial end-effector trajectory required to approach the ball (and critical for the execution of the task) is very similar for the three reproductions. The trajectories lose the important shape information required for the successful execution of the task. Also the green and brown trajectories of one of the joints are almost overlapping. Since the reproduction is in joint space, an incorrect motion of even a single joint can lead to the failure of the task. Figure 16 contains the reproduction results with \(\mathbf GPR ^{\text {2S}}\). Some of the reproduced trajectories are very different from the demonstrated ones. As with \(\mathbf GPR ^{\text {1S}}\), the trajectories generated with \(\mathbf GPR ^{\text {2S}}\) lose the initial shape information which is required for the successful execution of the task. The failure of both of the GP-based approaches can be attributed the small amount of training data, i.e., only two trajectories. Figure 17 contains the reproduction results with TP-GMM. The reproduced trajectories do not capture the demonstrated motions and it fails to learn anything useful for this task. The joint distribution of all the eight variables (one phase and seven joint angles) is encoded in the TP-GMM. As we only have two trajectories, the product of GMMs in TP-GMM suffers from a severe curse of dimensionality.

Figure 18 contains the reproduction results with \(\mathbf GMR ^{\text {1S}}\). The shape information is preserved in the reproduced trajectories and the DMPs goal parameters are correctly inferred. The generated joint angles trajectories extend further away from the demonstrated trajectories for extrapolation. The joint angles trajectories for interpolation are inbetween the extrapolation trajectories. Executing motion trajectories generated by \(\mathbf GMR ^{\text {1S}}\) on KUKA show that our approach always yields success in the extended range of [0.4146, 0.7933]. For the two extrapolation evaluations with our approach, the ball trajectories, as well as the executed motions on KUKA, are visualized in Fig. 19, where the different final joint configurations of the reproduced motions are also observed.

7 Conclusion and future work

We have shown how the task-specific generalization of a DMP can be achieved by formulating learning as a density estimation problem. The proposed approach captures the local behavior of each demonstration by using a GMM. These GMMs are then mixed to get the task-specific generalization. We have handled the data sparsity along task parameters by introducing additional incomplete data filling the input space. Deterministic Annealing EM is used to avoid the local maxima problem. We retain the local behavior of each GMM by keeping the means and mixing weights within the GMMs fixed. The task-specific generalization is achieved by just adapting covariances and mixing coefficient of the already learned GMMs. The TP-DMP framework can perform learning in task space as well as in joint space and can even handle the learning of meta parameters of a DMP. As shown in the experiments, our approach requires very few demonstrations for learning and it outperforms the existing approaches specially when extrapolating beyond the demonstrated ranges of the task variables. As future work, we plan to extend our proposed work with sample reuse approach [8] and to investigate the scalability issue to complex tasks.