1 Introduction

In contrast to industrial robots in large factories, a wide range of service robots are designed to move in unconstrained environments in which they should fulfill a series of tasks while swiftly reacting to perturbations. The expectations and promises of service robotics applications are very challenging and cannot be achieved without joint efforts from different fields of robotics. This exploitation of various methods can hardly be done serially and instead requires closer interactions between learning, planning and control. One of the prior requirement to face such challenge is to design a versatile representation of what the robot should do (how it should move, which behavior it should follow) that is compatible with the above techniques and that can be shared bilaterally. In particular, in continuously changing environments, the movements of service robots need to be generated and adapted to the ongoing situation very quickly.

This tutorial takes the perspective that the challenges of recognizing, predicting and generating movements can be achieved within the same encoding strategy. It will show that simple probabilistic mixture models can be exploited to model the natural variations in human and robot motions, as well as to make links between learning, online planning and optimal control. Gaussian mixture models provide a structure that is compatible with many robot learning approaches. It is flexible to the requirements of service robotics, because the representation can be easily adapted to the application requirements while preserving the core probabilistic mixture modeling strategy (addition of transition information in the form of an HMM, subspace clustering with MFA or MPPCA, etc.). Finally, the model is not tied to a specific parameter estimation technique, which allows the movements to be acquired by different interaction modalities and learning strategies.

The tutorial will focus on task-parameterized Gaussian mixture model (TP-GMM), by presenting a number of extensions and ongoing challenges targeting applications in unconstrained environment.

1.1 Organization of the paper

Section 2 discusses the importance of considering adaptive models of movements in robotics. It introduces the proposed approach from a high-level perspective and motivates it using a toy example with a single Gaussian.

Section 3 presents the core of the approach by taking the example of a standard Gaussian mixture model (GMM) modified as a task-parameterized GMM (TP-GMM). It also discusses the practical use of regularization terms.

The next four sections present techniques that rely on the core TP-GMM encoding strategy but that tackle different challenges, by moving progressively from encoding issues to kinematic and dynamic retrieval of data. Section 4 addresses the challenge of handling high-dimensional data with subspace clustering extensions of the model. Section 5 discusses the challenge of generating continuous movements from task-parameterized models. It presents two distinct approaches based on Gaussian mixture regression (Sect. 5.1) or trajectory models encoding of dynamic features (Sect. 5.2).

Section 6 then extends the motion synthesis challenge to the important problem of learning a controller for the robot, by presenting a minimal intervention control strategy that can exploit the proposed task parameterized model.

Section 7 gives an overview of more advanced forms of task parameterization that can be considered, such as constraints in different data spaces (e.g., to handle constraints at joint and end-effector levels simultaneously), as well as priority and projection constraints.

Finally, Sect. 8 presents comparisons with other task-adaptive approaches, and Sect. 9 discusses further work.

Each section of the paper is accompanied by source codes designed as simple didactic examples implemented in Matlab/GNU Octave. In order to facilitate reading, implementation details such as estimation update rules have been gathered at the end of the paper in the form of Appendices.

2 Adaptive models of movements

Task-parameterized models of movements/behaviors refer to representations that can automatically adapt to a set of task parameters that can, for example, describe the current context, situation, state of the environment or state of the robot configuration. The task parameters refer to the variables that can be collected by the system and that describe a situation, such as positions of objects in the environment. The task parameters can be fixed during an execution trial or they can vary while the motion is executed. The model parameters refer to the variables learned by the system, namely that are stored in memory (the internal representation of the movement). During reproduction, a new set of task parameters (description of the present situation) is used to produce new movements (e.g., adaptation to new position of objects after having observed the skill in a different situation).

Several denominations have been introduced in the literature to describe these models, such as task-parameterized [20, 64, 91] (the denomination used here), parametric [52, 60, 102], stylistic [13] or object-centric warping [57]. In these models, the encoding of skills usually serve several purposes, including classification, prediction, synthesis and online adaptation. A taxonomy of task-parameterized models is presented in [14], with three broad categories, namely

  1. 1.

    approaches employing M models for the M demonstrations, performed in M different situations, see, e.g., [21, 29, 44, 48, 51, 60, 97],

  2. 2.

    approaches employing P models for the P frames of reference that are possibly relevant for the task, see, e.g., [3, 25, 66], and

  3. 3.

    approaches employing a single model whose parameters are modulated by task parameters, see, e.g., [43, 52, 71, 72, 80, 102].

In the majority of these approaches, the retrieval of movements from the model parameters and the task parameters is viewed as a regression problem. This generality might look appealing at first sight, but it also limits the generalization scope of these models, see Fig. 1. Task-parameterized Gaussian mixture models (TP-GMM) aim at increasing this generalization capability by exploiting the functional nature of task parameters. Indeed, in robotics applications, task parameters can most of the time be related to frames of reference, coordinate systems, basis functions or local projections, whose structure can be exploited to speed up learning and provide the system with better extrapolation capability.

Fig. 1
figure 1

Solving a task adaptation problem as a standard regression problem does not always provide satisfying generalization capability. Left 4 demonstrations where the task parameters are treated as inputs (position and direction of the second object concatenated in a vector), and where the motion model parameters are treated as outputs (GMM parameters concatenated in a vector). Here, each demonstration is aligned by fitting a single model to the concatenated set of demonstrations. Right 6 reproduction attempts by treating the problem of adapting the movement to the new task parameters as standard regression, namely by relying on the training set to regenerate new model parameters based on new task parameters and by using this model to generate a new motion with Gaussian mixture regression. Here, both inputs and outputs are treated as multidimensional vectors. The 6 reproduction attempts consider situations of increasing complexity. We can observe that the system provides good interpolation results but cannot extrapolate well when faced with situations that are far from the regions covered by the demonstrations. This example was implemented with Gaussian process regression, but similar reproduction results are observed with other regression mechanisms. As expected, reproductions far outside the regions covered by the demonstrations will tend to collapse to an average of the different trajectory models, resulting in poor generalization capability

2.1 Proposed approach

The proposed approach uses a generative model to encode the movement, where the variability and correlation information is used to infer the impedance parameters of a virtual spring–damper system. These parameters figuratively correspond to the stiffness of a spring and to the damping coefficient of a viscous damper, with the difference that they can also be full stiffness and damping matrices.

In its task-parameterized version, the model uses several frames of reference to describe the robot behavior in multiple coordinate systems. The variations and correlations observed from the perspective of these different frames are exploited to determine the impedance of the system with a linear quadratic regulator. Figure 2 illustrates the overall approach, which can be decomposed into multiple steps, involving statistical modeling, dynamical systems and optimal control. This illustration will be used as a guiding thread to describe throughout the article the different model components and algorithms enabling learning, adaptation, synthesis and control of movement skills.

Fig. 2
figure 2

Illustration of the overall approach (see main text for details). a Observation of a task in different situations and generalization to new contexts. Multiple demonstrations provide the opportunity to discern the structure of the task. b Probabilistic encoding of continuous movements in multiple coordinate systems. c Exploitation of variability and correlation information to adapt the motion to new situations. With cross-situational observations of the same task, the robot can generalize the skill to new situations. d Computation of the underlying optimal control strategy driving the observed behavior

The proposed task-parameterized model is not new: preliminary versions were investigated in [14, 16, 20] for the special case of frames of reference representing rotations and translations in Cartesian space. The current paper discusses the potentials of the approach, and introduces several routes for further investigation, which aim at applying the proposed technique to a wider range of affine transformations (directly exploiting the robotics application domain), including constraints in both configuration and operational spaces, as well as priority constraints. It also shows that the proposed method can be applied to different probabilistic encoding strategies, including subspace clustering approaches that enable the model to handle feature spaces of high dimensions.

Table 1 will be used as a reference to the notation, dimensions and names of variables employed in the paper. As a general rule, lowercase and uppercase bold fonts, respectively, indicate vectors and matrices, while normal fonts indicate scalars. Table 2 lists all examples available as Matlab/GNU Octave source codes.

2.2 Example with a single Gaussian

Before presenting the details of the task-parameterized model, the approach is motivated by an introductory example with a single Gaussian. Two frames are considered, described respectively at each time step t by \(\{{\varvec{b}}_{t,1},{\varvec{A}}_{t,1}\}\) and \(\{{\varvec{b}}_{t,2},{\varvec{A}}_{t,2}\}\), representing the origin of the observer \({\varvec{b}}\) and a set of basis vectors \(\{{\varvec{e}}_1,{\varvec{e}}_2,\ldots \}\) forming a transformation matrix \({\varvec{A}}=[{\varvec{e}}_1 {\varvec{e}}_2 \ldots ]\).

Table 1 Notation and names of variables

A set of demonstrations is observed from the perspective of the two frames. During reproduction, each frame expects the new datapoints to lie within the same range. If \(\mathcal {N}\Big ({\varvec{\mu }}^{(1)},{\varvec{{\varSigma }}}^{(1)}\Big )\) and \(\mathcal {N}\Big ({\varvec{\mu }}^{(2)},{\varvec{{\varSigma }}}^{(2)}\Big )\) describe the observations in the first and second frames; the two observers, respectively, expect the reproduction attempts to lie within the distributions \(\mathcal {N}\Big ({\varvec{\hat{\xi }}}{}_{t}^{(1)},\, {\varvec{\hat{{\varSigma }}}}{}_{t}^{(1)} \Big )\) and \(\mathcal {N}\Big ({\varvec{\hat{\xi }}}{}_{t}^{(2)},\, {\varvec{\hat{{\varSigma }}}}{}_{t}^{(2)} \Big )\) with

$$\begin{aligned} {\varvec{\hat{\xi }}}{}^{(1)}_{t} = {\varvec{A}}_{t,1}\; {\varvec{\mu }}^{(1)} + {\varvec{b}}_{t,1}&,\quad {\varvec{\hat{{\varSigma }}}}{}^{(1)}_{t} = {\varvec{A}}_{t,1}\; {\varvec{{\varSigma }}}^{(1)} {\varvec{A}}_{t,1}^{\scriptscriptstyle \top }\;,\end{aligned}$$
(1)
$$\begin{aligned} {\varvec{\hat{\xi }}}{}^{(2)}_{t} = {\varvec{A}}_{t,2}\; {\varvec{\mu }}^{(2)} + {\varvec{b}}_{t,2}&,\quad {\varvec{\hat{{\varSigma }}}}{}^{(2)}_{t} = {\varvec{A}}_{t,2}\; {\varvec{{\varSigma }}}^{(2)} {\varvec{A}}_{t,2}^{\scriptscriptstyle \top }\;, \end{aligned}$$
(2)

computed with the linear transformation property of normal distributions.

Table 2 List of Matlab/GNU Octave examples (by alphabetic order)

During reproduction, a trade-off needs to be determined to concord with the distributions expected by each frame. The underlying objective function is defined as the weighted sum of the quadratic error terms

$$\begin{aligned} {\varvec{\hat{\xi }}}_t \;=\; \arg \underset{{\varvec{\xi }}_t}{\min } \sum _{j=1}^2 {\big ({\varvec{\xi }}_t-{\varvec{\hat{\xi }}}{}_t^{(j)}\big )}^{\scriptscriptstyle \top }\; {{\varvec{\hat{{\varSigma }}}}{}_t^{(j)}}^{-1} \big ({\varvec{\xi }}_t-{\varvec{\hat{\xi }}}{}_t^{(j)}\big ). \end{aligned}$$
(3)

The above objective can be solved easily by differentiating and equating to zero the above equation, yielding a point \({\varvec{\hat{\xi }}}_{t}\), with an estimation error defined by a covariance \({\varvec{\hat{{\varSigma }}}}_{t}\). It is easy to show that the resulting \(\mathcal {N}\Big ( {\varvec{\hat{\xi }}}_{t},{\varvec{\hat{{\varSigma }}}}_{t} \Big )\) corresponds to the product of the two Gaussians \(\mathcal {N}\Big ({\varvec{\hat{\xi }}}{}_{t}^{(1)},\; {\varvec{\hat{{\varSigma }}}}{}_{t}^{(1)} \Big )\) and \(\mathcal {N}\Big ({\varvec{\hat{\xi }}}{}_{t}^{(2)},\; {\varvec{\hat{{\varSigma }}}}{}_{t}^{(2)} \Big )\), see [18] for details.

Fig. 3
figure 3

Minimization of the objective function in Eq. (3) composed of a weighted sum of quadratic error terms, whose result corresponds to a product of Gaussians

Figure 3 illustrates this process for one of the Gaussian in Fig. 2.

3 Task-parameterized Gaussian mixture model (TP-GMM)

The task-parameterized Gaussian mixture model (TP-GMM) is a direct extension of the objective problem presented above, by considering multiple frames and multiple clusters of datapoints (soft clustering via mixture modeling). It probabilistically encodes the relevance of candidate frames, which can change during the task. In contrast to approaches such as [71] that aim at extracting a single (most prominent) coordinate system located at the end of a motion segment, the proposed approach allows the superposition and transition of different coordinate systems that are relevant for the task (parallel organization of behavior primitives, adaptation to multiple viapoints in the middle of the movement, or modulation based on positions, orientations or geometries of objects).

Each demonstration \(m\in \{1,\ldots ,M\}\) contains \(T_m\) datapoints forming a dataset of N datapoints \(\{{\varvec{\xi }}_{t}\}_{t=1}^N\) with \(N=\sum _{m}^{M} T_m\).

The task parameters are represented by P coordinate systems, defined at time step t by \(\{{\varvec{b}}_{t,j},{\varvec{A}}_{t,j}\}_{j=1}^P\), representing, respectively, the origin of the observer and a transformation matrix.

The demonstrations \({\varvec{\xi }}\in \mathbb {R}^{D\times N}\) are observed from these different viewpoints, forming P trajectory samples \({\varvec{X}}^{(j)}\in \mathbb {R}^{D\times N}\). These samples can be collected from sensors located at the frames or computed with

$$\begin{aligned} {\varvec{X}}^{(j)}_t = {\varvec{A}}_{t,j}^{-1} ({\varvec{\xi }}_t - {\varvec{b}}_{t,j}). \end{aligned}$$
(4)

The parameters of a TP-GMM with K components are defined by \(\big \{\pi _i,\{{\varvec{\mu }}^{(j)}_i,{\varvec{{\varSigma }}}^{(j)}_i\}_{j=1}^P\big \}_{i=1}^K\) (\(\pi _i\) are the mixing coefficients, \({\varvec{\mu }}^{(j)}_i\) and \({\varvec{{\varSigma }}}^{(j)}_i\) are the center and covariance matrix of the ith Gaussian component in frame j).

Learning of the parameters is achieved by log-likelihood maximization subject to the constraint that the data in the different frames arose from the same source, resulting in an expectation-maximization (EM) algorithm [23] to iteratively update the model parameters until convergence, see “Appendix 1” for details. Other forms of learning for mixture models are possible, including spectral clustering [53, 69, 84], online learning [27, 34, 68, 83, 99] or self-refinement [19].

For a movement in Cartesian space with 10 demonstrations and 3 candidate frames, the overall learning process typically takes 1–3 s on a standard laptop. The reproduction is much faster and can be computed online (usually below 1 ms).

The learned model is then used to reproduce movements in other situations (for new position and orientation of candidate frames). A new GMM with parameters \(\{\pi _i,{\varvec{\hat{\xi }}}_{t,i},{\varvec{\hat{{\varSigma }}}}_{t,i}\}_{i=1}^K\) can automatically be generated with

$$\begin{aligned}&\mathcal {N} \Big ( {\varvec{\hat{\xi }}}_{t,i}, {\varvec{\hat{{\varSigma }}}}_{t,i} \Big ) \;\propto \; \prod \limits _{j=1}^P \mathcal {N} \Big ( {\varvec{\hat{\xi }}}{}_{t,i}^{(j)},\; {\varvec{\hat{{\varSigma }}}}{}_{t,i}^{(j)} \Big ), \;{{\mathrm {with}}}\nonumber \\&\quad {\varvec{\hat{\xi }}}{}^{(j)}_{t,i} = {\varvec{A}}_{t,j} {\varvec{\mu }}^{(j)}_i + {\varvec{b}}_{t,j} \;,\quad {\varvec{\hat{{\varSigma }}}}{}^{(j)}_{t,i} = {\varvec{A}}_{t,j}{\varvec{{\varSigma }}}^{(j)}_i {\varvec{A}}_{t,j}^{\scriptscriptstyle \top }, \end{aligned}$$
(5)

where the result of the Gaussian product is given by

$$\begin{aligned} {\varvec{\hat{{\varSigma }}}}_{t,i} = \Bigg ( \sum \limits _{j=1}^P {{\varvec{\hat{{\varSigma }}}}{}^{(j)}_{t,i}}^{-1} \Bigg )^{-1},\quad {\varvec{\hat{\xi }}}_{t,i} = {\varvec{\hat{{\varSigma }}}}_{t,i} \sum \limits _{j=1}^P {{\varvec{\hat{{\varSigma }}}}{}^{(j)}_{t,i}}^{-1} {\varvec{\hat{\xi }}}{}^{(j)}_{t,i} . \end{aligned}$$
(6)

For computational efficiency, the above equations can be computed with precision matrices instead of covariances. Figure 4 depicts the different steps of the above computation.

Fig. 4
figure 4

TP-GMM retrieval process and associated variables. a, b The model parameters (TP-GMM with 3 Gaussians and 2 frames). c Temporary GMM retrieved at time step t for a new configuration of the two frames. df Details of computation, where each temporary Gaussian is retrieved as a product of linearly transformed Gaussians

The proposed task-parameterized approach requires each frame to evaluate the local variability of the demonstrations. This section showed that a mixture model could be employed to extract this variability. However, other encoding strategies can be used as long as the local variations take the form of full covariances aligned with the different frames. In particular, data-driven encoding strategies can alternatively be employed. This will be shown later in Sect. 8 using two examples with task-parameterized Gaussian process (TP-GP) and task-parameterized locally weighted regression (TP-LWR).

figure a

3.1 Regularization of the TP-GMM parameters

In applications that are prone to overfitting, it is relevant to introduce regularization terms. Regularization has the effect of avoiding singularities and smoothing the solution space. An option is to define a minimal admissible eigenvalue \(\lambda _{\min }\) and adjust each covariance matrix \({\varvec{{\varSigma }}}^{(j)}_i\) so that

$$\begin{aligned}&{\varvec{{\varSigma }}}^{(j)}_i \leftarrow \; {\varvec{V}}^{(j)}_i {\varvec{\tilde{D}}}^{(j)}_i {{\varvec{V}}^{(j)}_i}^{\scriptscriptstyle \top }, \\&{{\mathrm {with}}}\quad {\varvec{\tilde{D}}}^{(j)}_i = \begin{bmatrix} \tilde{\lambda }^{2(j)}_{i,1}&0&\cdots&0 \\ 0&\tilde{\lambda }^{2(j)}_{i,2}&\cdots&0 \\ \vdots&\vdots&\ddots&\vdots \\ 0&0&\cdots&\tilde{\lambda }^{2(j)}_{i,d} \end{bmatrix}, \nonumber \\&{{\mathrm {and}}}\quad \tilde{\lambda }^{(j)}_{i,k} = \max (\tilde{\lambda }^{(j)}_{i,k},\lambda _{\min }) \quad \forall k\in \{1,\ldots ,d\}, \nonumber \end{aligned}$$
(7)

where \({\varvec{V}}^{(j)}_i\) is a matrix containing the stacked eigenvectors of \({\varvec{{\varSigma }}}^{(j)}_i\), with \(\lambda ^{(j)}_{i,k}\) the corresponding eigenvalues.

Another approach is to set a priori uncertainties on the covariance parameters in the form of a diagonal isotropic covariance \({\varvec{I}}\rho \) (Tikhonov regularization), so that

$$\begin{aligned} {\varvec{{\varSigma }}}^{(j)}_i \;\leftarrow \; {\varvec{{\varSigma }}}^{(j)}_i + {\varvec{I}}\rho , \end{aligned}$$
(8)

with \({\varvec{I}}\) an identity matrix and \(\rho \) a small scalar factor that can be either set empirically or estimated from the data. The difference with Eq. (7) is that a value \(\rho \) is added to each \(\lambda ^{(j)}_{i,k}\) instead of truncating the eigenvalues. The same development can be done with singular value decomposition, emphasizing the effect of the regularization on the condition number, by forcing it to be higher than a threshold as in Eq. (7) or by increasing the singular values as in Eq. (8). It is in some applications convenient to apply small regularization terms at different steps of the procedure (e.g., at each iteration in the EM process and after convergence before computing Gaussian products).

4 Extension to task-parameterized subspace clustering

Classical Gaussian mixture models tend to perform poorly in high-dimensional spaces if too few datapoints are available. This is also true for robotics problems aiming at encoding multivariate and multimodal signals from only few demonstrations. Namely, if the training set is \(\{{\varvec{\xi }}_t\}_{t=1}^N\) with \({\varvec{\xi }}_t\in \mathbb {R}^D\), the curse of dimensionality occurs if the dimension of the data D is too large compared to the size of the training set N. In particular, the problem can affect the full covariances \({\varvec{{\varSigma }}}^{(j)}_i\in \mathbb {R}^{D\times D}\) in (52) because the number of parameters to be estimated quadratically grows with D.

Bouveyron and Brunet reviewed various ways of viewing the problem and coping with high-dimensional data in clustering problems [11]. In practice, three viewpoints can be considered:

  1. 1.

    Since D is too large compared to N, a global dimensionality reduction should be applied as a pre-processing step to reduce D.

  2. 2.

    Since D is too large compared to N, the solution space contains many poor local optima; the solution space should be smoothed by introducing ridge or lasso regularization in the estimation of the covariance (avoiding numerical problem and singular solutions when inverting the covariances). As discussed in Sect. 3.1, a simple form of regularization can be achieved after the maximization step of each EM loop.

  3. 3.

    Since D is too large compared to N, the model is probably over-parametrized, and a more parsimonious model should be used (thus estimating a fewer number of parameters).

One example falling in the last category would be to consider spherical or diagonal covariances instead of full matrices, corresponding to a separate treatment of each variable. Although commonly employed in robotics, such decoupling is a limiting factor to encode gestures and sensorimotor streams, because it does not fully exploit principles underlying coordination, motor skill acquisition and actionperception couplings [41, 46, 54, 67, 81, 86, 94, 103].

Our rationale is that diagonal constraints are too strong for motor skill encoding, because it loses important synergistic information among the variables. There are, however, a wide range of alternatives in mixture modeling, which are in-between the encoding of diagonal and full covariances and that can readily be exploited in the context of robot skills acquisition. These alternatives can be studied as a subspace clustering problem that aims at grouping the data such that they can be locally projected in a subspace of reduced dimensionality, thus helping the analysis of the local trend of the movement, while reducing the number of parameters to be estimated, and “locking” the most important synergies to cope with perturbations.

Many possible constraints can be considered, grouped in families such as parsimonious GMM [7, 11, 62], mixtures of factor analyzers (MFA) [61] or mixtures of probabilistic principal component analyzers (MPPCA) [93]. These techniques will next be described in the context of task-parameterized models.

4.1 Parsimonious TP-GMM

By following the perspective of Sect. 3.1, a parsimonious TP-GMM can be defined by considering the spectral decomposition of the covariances

$$\begin{aligned} {\varvec{{\varSigma }}}^{(j)}_i = {\varvec{V}}^{(j)}_i {\varvec{D}}^{(j)}_i {{\varvec{V}}^{(j)}_i}^{\scriptscriptstyle \top }, \end{aligned}$$
(9)

with \({\varvec{V}}^{(j)}_i\) a matrix of ordered eigenvectors (determining the orientation of the cluster) and \({\varvec{D}}^{(j)}_i\) a diagonal matrix with ordered eigenvalues \(\lambda ^{(j)}_{i,k}\) (determining the shape of the cluster), where constraints are set by sharing some of these elements among the clusters, and/or by keeping only the first d eigenvectors and eigenvalues in the parameterization.

The high-dimensional data clustering (HDDC) approach from [12] lies in this category of models addressing both subspace clustering and regularization. An example of implementation is to consider that the subspace of each cluster i is generated by the first \(d_i\) eigenvectors associated with the first \(d^{(j)}_i\) eigenvalues \(\lambda ^{(j)}_{i,k}\), and that outside of this subspace, the variance is spherical, modeled by a single parameter

$$\begin{aligned} \bar{\lambda }^{(j)}_i= & {} \frac{1}{D-d^{(j)}_i} \sum _{\;\;\;k=d^{(j)}_i+1}^D \lambda ^{(j)}_{i,k}\nonumber \\= & {} \frac{1}{D-d^{(j)}_i} \Bigg ( {{\mathrm {tr}}}\left( {\varvec{{\varSigma }}}^{(j)}_i\right) - \sum _{k=1}^{d^{(j)}_i} \lambda ^{(j)}_{i,k} \Bigg ), \end{aligned}$$
(10)

which is used to reconstruct a full covariance matrix by replacing the last \(D-d^{(j)}_i\) eigenvalues with \(\bar{\lambda }^{(j)}_i\).

figure b

4.2 Task-parameterized mixture of factor analyzers (TP-MFA)

Factor analysis (FA) is an approach as old as principal component analysis (PCA) to cope with dimension reduction, often overshadowed by PCA although is has an equivalently important literature on the topic [12]. The basic idea of factor analysis is to reduce the dimensionality of the data while keeping the observed covariance structure, see [26] for an example of application in robotics.

The TP-GMM presented in Sect. 3 is fully compatible with subspace clustering approaches based on factor analysis. A task-parameterized mixture of factor analyzers (TP-MFA) assumes for each component i and frame j a covariance structure of the form

$$\begin{aligned} {\varvec{{\varSigma }}}^{(j)}_i = {\varvec{{\varLambda }}}^{(j)}_i {{\varvec{{\varLambda }}}^{(j)}_i}^{\scriptscriptstyle \top }+ {\varvec{{\varPsi }}}^{(j)}_i, \end{aligned}$$
(11)

where \({\varvec{{\varLambda }}}^{(j)}_i\in \mathbb {R}^{D\times d}\), known as the factor loadings matrix, typically has \(d<D\) (providing a parsimonious representation of the data) and a diagonal noise matrix \({\varvec{{\varPsi }}}^{(j)}_i\).

The factor loading and noise terms of the covariance matrix can be constrained in different ways (e.g., such as being shared across Gaussian components), yielding a collection of eight parsimonious covariance structures [62]. For example, the task-parameterized mixture of probabilistic principal component analyzers (TP-MPPCA) [93] is a special case of TP-MFA with the distribution of the errors assumed to be isotropic with \({\varvec{{\varPsi }}}^{(j)}_i={\varvec{I}}{\sigma ^{(j)}_i}^2\).

Figure 5 shows that the covariance structure in MFA can span a wide range of covariances.

Fig. 5
figure 5

Exploitation of the covariance structure in a mixture of factor analyzers (MFA) to consider intermediary steps between the modeling as diagonal covariances (left) and full covariances (right)

“Appendix 2” details the structure of TP-MFA and provides an EM algorithm to estimate the model parameters.

The hypothesis of TP-MFA models can be viewed as less restrictive as TP-HDDC models based on eigendecomposition (see Sect. 4.1), because the subspace of each class does not need to be spanned by orthogonal vectors, whereas it is a necessary condition in models based on eigendecomposition [12].

Similarly to parsimonious GMM based on eigendecomposition, the covariances in TP-MFA can be constrained by fixing d or by sharing elements among the mixture components. This encoding strategy can then be extended to variants of MFA aiming at optimizing the sharing and re-use of subspaces among the Gaussian components, such as in semi-tied covariance [31]. These techniques can be exploited to extend the concept of synergies to a wider range of rich motor skills, with a simultaneous segmentation and re-use of previously discovered synergies.

For each approach, a dedicated EM update can be derived corresponding to the type of constraints considered [62]. They all reconstruct estimates of the full covariances, which is an important characteristic that will be exploited in the next sections of this article.

The TP-MFA extension of TP-GMM opens several roads for further investigation. Bayesian nonparametric approaches such as [101] can be used to simultaneously select the number of clusters and the dimension of the subspace in each cluster. Another extension is to use tied structures in the covariances to enable the organization and reuse of previously acquired synergies [31].

Another possible extension is to enable deep learning strategies in task-parameterized models. As discussed in [92], the prior of each FA can be replaced by a separate second-level MFA that learns to model the aggregated posterior of that FA (instead of the isotropic Gaussian), providing a hierarchical structure organization where one layer of latent variables can be learned at a time. This can be exploited as a link with deep learning strategies [9, 40] for real-valued high-dimensional data within directed graphical models.

Figure 6 shows a kinematic example with TP-MFA used for encoding and synthesis purposes.

figure c
Fig. 6
figure 6

Example of TP-MFA to encode and retrieve full-body dancing motion from [36]. Here, the motion of one of the two partners (\(D=94\)) is retrieved by adapting it online to the motion of the other partner. In the model, the red stick figure (in thin line) is the frame of reference and the gray stick figure (in thick line) is the motion encoded in TP-MFA. The black stick figure (in thick line) shows the motion regenerated with TP-MFA and Gaussian mixture regression, based on the observation of the red stick figure. For 12 Gaussian components (\(K=12\)) and a subspace of 2 dimensions (\(d=2\)) encoding a motion of 94 dimensions, the total number of parameters in TP-MFA is 4511. The corresponding number of parameters in a TP-GMM with full covariances would be 54719. We can see that TP-MFA generates a smooth and natural movement similar to the original dance (color figure online)

5 Extension to motion synthesis

While the previous sections focused only on TP-GMM as an encoding strategy, this section addresses the problem of generating movements from the model.

Several approaches can be used to retrieve continuous movements from a TP-GMM. The next two subsections provide examples for two different synthesis techniques. The first technique is to encode a decay term or a time variable as an additional feature in the mixture and use Gaussian mixture regression (GMR) [33] to retrieve movements adapted to the current situations. The second technique is to encode both static and dynamic features in the mixture model as in trajectory-HMM [30, 89, 95, 106]. These two techniques are described next.

5.1 Gaussian mixture regression (GMR)

With a GMM representation, the reproduction of a movement can be formalized as a regression problem [33]. We showed in [17, 18] that in robot learning, Gaussian mixture regression (GMR) offers a simple solution to generate continuous movements from a GMM. GMR relies on basic properties of normal distributions (linear transformation and conditioning). It provides a probabilistic retrieval of movements or policies, in which the model can compute the next actions on-the-fly, with a computation time that is independent of the number of datapoints used to train the model.

Fig. 7
figure 7

Illustration of the encoding of \(\mathcal {P}({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}},{\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {O}}}})\) as a Gaussian mixture model (GMM) with two components, and estimation of \(\mathcal {P}({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {O}}}}|{\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}})\) with Gaussian mixture regression (GMR), where both \({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}\) and \({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {O}}}}\) can be multidimensional. The model can emulate a large spectrum of regression mechanisms, from standard linear regression (when a single component \(K=1\) is used), to non-linear kernel regression (with \(K=N\) and a Gaussian centered on each datapoint)

In contrast to other regression methods such as locally weighted regression (LWR) [79], locally weighted projection regression (LWPR) [100] or Gaussian process regression (GPR) [35, 70, 75], GMR does not model the regression function directly. It models the joint probability density function of the data and then derives the regression function from the joint density model, see [88] for an excellent review of regression approaches. The estimation of the model parameters is thus achieved in an offline phase that depends linearly on the number of datapoints. Regression is then independent of this number and can be computed very rapidly, which makes the approach an interesting alternative to regression methods whose processing grows with the size of the dataset. The other benefit is that both input and output variables can be multidimensional without modification of the model.

GMR can, for example, be employed in robot applications requiring input and output dimensions to be specified at run time (e.g., to handle missing sensory inputs or to react swiftly by retrieving partial outputs).

The superscripts \(\mathcal {I}\) and \(\mathcal {O}\) will be further used to describe the dimensions that span for input and output (used as exponents for vectors and matrices). The general case of a GMM encoding a dataset \({\varvec{\xi }}\) with the joint distribution \(\mathcal {P}({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}},{\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {O}}}})\sim \sum _{i=1}^K\pi _i\;\mathcal {N}({\varvec{\mu }}_i,{\varvec{{\varSigma }}}_i)\) will first be described, which will later be extended to its task-parameterized version.

At each iteration step t, the datapoint \({\varvec{\xi }}_t\) can be decomposed as two subvectors \({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t\) and \({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {O}}}}_t\) spanning for the input and output dimensions. For trajectory encoding in task space, \(\mathcal {I}\) corresponds to the time input dimension (e.g., value of a decay term), and \(\mathcal {O}\) corresponds to the output dimensions describing a path (e.g., end-effector position in task space).

With this notation, a block decomposition of the datapoint \({\varvec{\xi }}_t\), vectors \({\varvec{\mu }}_i\) and matrices \({\varvec{{\varSigma }}}_i\) can be written as

$$\begin{aligned} {\varvec{\xi }}_t=\begin{bmatrix}{\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t \\ {\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {O}}}}_t\end{bmatrix} \;,\quad {\varvec{\mu }}_i=\begin{bmatrix}{\varvec{\mu }}^{\scriptscriptstyle {{\mathcal {I}}}}_i \\ {\varvec{\mu }}^{\scriptscriptstyle {{\mathcal {O}}}}_i\end{bmatrix} \;,\quad {\varvec{{\varSigma }}}_i=\begin{bmatrix}{\varvec{{\varSigma }}}^{\scriptscriptstyle {{\mathcal {I}}}}_i \;{\varvec{{\varSigma }}}^{\scriptscriptstyle {{\mathcal {IO}}}}_i \\ {\varvec{{\varSigma }}}^{\scriptscriptstyle {{\mathcal {OI}}}}_i {\varvec{{\varSigma }}}^{\scriptscriptstyle {{\mathcal {O}}}}_i\end{bmatrix}. \end{aligned}$$
(12)

At each time step t during reproduction, \(\mathcal {P}({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {O}}}}_t|{\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t)\) is computed as the conditional distribution

$$\begin{aligned}&\mathcal {P}({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {O}}}}_t|{\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t) \sim \sum _{i=1}^K h_i({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t)\; \mathcal {N} \left( {\varvec{\hat{\mu }}}^{\scriptscriptstyle {{\mathcal {O}}}}_i({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t),{\varvec{\hat{{\varSigma }}}}^{\scriptscriptstyle {{\mathcal {O}}}}_i\right) , \end{aligned}$$
(13)
$$\begin{aligned}&{{\mathrm {with}}}\quad \hat{{\varvec{\mu }}}^{\scriptscriptstyle {{\mathcal {O}}}}_i({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t) = {\varvec{\mu }}^{\scriptscriptstyle {{\mathcal {O}}}}_i + {\varvec{{\varSigma }}}^{\scriptscriptstyle {{\mathcal {OI}}}}_i{{\varvec{{\varSigma }}}^{\scriptscriptstyle {{\mathcal {I}}}}_i}^{-1} ({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t-{\varvec{\mu }}^{\scriptscriptstyle {{\mathcal {I}}}}_i), \end{aligned}$$
(14)
$$\begin{aligned}&\hat{{\varvec{{\varSigma }}}}^{\scriptscriptstyle {{\mathcal {O}}}}_i = {\varvec{{\varSigma }}}^{\scriptscriptstyle {{\mathcal {O}}}}_i - {\varvec{{\varSigma }}}^{\scriptscriptstyle {{\mathcal {OI}}}}_i{{\varvec{{\varSigma }}}^{\scriptscriptstyle {{\mathcal {I}}}}_i}^{-1} {\varvec{{\varSigma }}}^{\scriptscriptstyle {{\mathcal {IO}}}}_i,\end{aligned}$$
(15)
$$\begin{aligned}&{{\mathrm {and}}}\quad h_i({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t) = \frac{\pi _i \mathcal {N}({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t |\; {\varvec{\mu }}^{\scriptscriptstyle {{\mathcal {I}}}}_i,{\varvec{{\varSigma }}}^{\scriptscriptstyle {{\mathcal {I}}}}_i)}{\sum _k^K \pi _k \mathcal {N}({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t |\; {\varvec{\mu }}^{\scriptscriptstyle {{\mathcal {I}}}}_k,{\varvec{{\varSigma }}}^{\scriptscriptstyle {{\mathcal {I}}}}_k)}. \end{aligned}$$
(16)

Note that Eq. (13) represents a multimodal distribution. For problems in which a single peaked output distribution is preferred, Eq. (13) can be approximated by a normal distribution (see “Appendix 3” for details of computation)

$$\begin{aligned}&{\mathcal {P}}({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {O}}}}_t | {\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t) = {\mathcal {N}} \Big ({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {O}}}}_t |\; {\varvec{\hat{\mu }}}{}^{\scriptscriptstyle {{\mathcal {O}}}}_t, {\varvec{\hat{\varSigma }}}{}^{\scriptscriptstyle {{\mathcal {O}}}}_t \Big ), \quad {{\mathrm {with}}} \end{aligned}$$
(17)
$$\begin{aligned}&{\varvec{\hat{\mu }}}{}^{\scriptscriptstyle {{\mathcal {O}}}}_t = \sum _{i=1}^K h_i({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t) \; {\varvec{\hat{\mu }}}_i^{\scriptscriptstyle {{\mathcal {O}}}} ({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t) ,\end{aligned}$$
(18)
$$\begin{aligned}&{\varvec{\hat{\varSigma }}}{}^{\scriptscriptstyle {{\mathcal {O}}}}_t = \sum _{i=1}^K h_i({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t) \Big ( {\varvec{\hat{{\varSigma }}}}_i^{\scriptscriptstyle {{\mathcal {O}}}} + {\varvec{\hat{\mu }}}_i^{\scriptscriptstyle {{\mathcal {O}}}}({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t) \; {{\varvec{\hat{\mu }}}_i^{\scriptscriptstyle {{\mathcal {O}}}} ({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}_t)}^{\scriptscriptstyle \top }\Big ) - {\varvec{\hat{\mu }}}^{\scriptscriptstyle {{\mathcal {O}}}}_t {{\varvec{\hat{\mu }}}^{\scriptscriptstyle {{\mathcal {O}}}}_t}^{\scriptscriptstyle \top }. \end{aligned}$$
(19)

The retrieved signal in Eq. (17) encapsulates variation and correlation information in the form of full covariance matrices. GMR has so far mostly been used in three manners:

  1. 1.

    as an autonomous system with \({{\varvec{\xi }}=[{\varvec{x}}^{\scriptscriptstyle \top },\dot{{\varvec{x}}}^{\scriptscriptstyle \top }]}^{\scriptscriptstyle \top }\), by learning \(\mathcal {P}({\varvec{x}},\dot{{\varvec{x}}})\) with a GMM, with \({\varvec{x}}\) and \(\dot{{\varvec{x}}}\) representing position and velocity of the system (either in task space or joint space), and by retrieving iteratively during reproduction a series of velocity commands by estimating \(\mathcal {P}(\dot{{\varvec{x}}}|{\varvec{x}})\) with GMR [17, 38, 39, 47];

  2. 2.

    as time-indexed trajectories with \({{\varvec{\xi }}=[t,{\varvec{x}}^{\scriptscriptstyle \top }]}^{\scriptscriptstyle \top }\), by learning \(\mathcal {P}(t,{\varvec{x}})\) with a GMM and retrieving \(\mathcal {P}({\varvec{x}}|t)\) with GMR for each time step to reproduce smooth trajectories (infinitely differentiable) [18].

  3. 3.

    as a probabilistic formulation of dynamic movement primitives (DMP) [20].

Alternatively, any subset of input–output dimensions can be selected, which can change, if required, at each iteration during reproduction. It can, for example, handle different sources of missing data, as the system is able to consider any combination of multidimensional input/output mappings during the retrieval phase. Expectations on the remaining dimensions can be computed within the control loop of the robot, corresponding to a convex sum of linear approximations (with weights varying non-linearly).

Figure 7 depicts the use of GMR in the case of time-indexed trajectories.

GMR can be viewed as a trade-off between a global and local approach in the sense that the placement and spread of the basis functions are learned, together with their responses, as a soft partitioning problem through expectation-maximization (EM),Footnote 1 while the prediction is a weighted superposition of locally linear systems. It provides variation and correlation information for the retrieved multidimensional output, enabling the extraction of local coordination patterns in the movement.

Note that if the application requires the encoding of high-dimension data from few observations, subspace learning techniques such as MFA (see Sect. 4) can be used jointly with GMR to locally reduce the dimensionality without modifying the regression process.

The combination of TP-GMM and GMR is simply achieved by augmenting the dataset in each frame with an input dimension and defining all task parameters \({\varvec{A}}_{t,j}\) and \({\varvec{b}}_{t,j}\) so that the input is not modulated by the task parameterization. Compared to an initial TP-GMM encoding \({\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {O}}}}\) with task parameters \({\varvec{A}}^{\scriptscriptstyle {{\mathcal {O}}}}_{t,j}\) and \({\varvec{b}}^{\scriptscriptstyle {{\mathcal {O}}}}_{t,j}\), the combination of TP-GMM and GMR instead encodes \({\varvec{\xi }}=\big [{{\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {I}}}}}^{\scriptscriptstyle \top }, {{\varvec{\xi }}^{\scriptscriptstyle {{\mathcal {O}}}}}^{\scriptscriptstyle \top }\big ]^{\scriptscriptstyle \top }\) with task parameters

$$\begin{aligned} {\varvec{A}}_{t,j} = \begin{bmatrix} {\varvec{I}}&{\varvec{0}} \\ {\varvec{0}}&{\varvec{A}}^{\scriptscriptstyle {{\mathcal {O}}}}_{t,j} \end{bmatrix} \;,\quad {\varvec{b}}_{t,j} = \begin{bmatrix} {\varvec{0}} \\ {\varvec{b}}^{\scriptscriptstyle {{\mathcal {O}}}}_{t,j} \end{bmatrix}, \end{aligned}$$
(20)

where in the case of a decay term (or an explicit time variable driving the system), the identity matrix \({\varvec{I}}\) collapses to 1.Footnote 2

figure d

5.2 GMM with dynamic features (trajectory-GMM)

In the field of speech processing, the extraction of statistics from both static and dynamic features within a hidden Markov model (HMM) has a long history [30, 95, 106]. In particular, it can be used in speech synthesis to avoid discontinuities in the generated speech spectra. The synthesized speech then becomes natural and smooth even when a small number of Gaussians is used. This is achieved by coordinating the distributions of both static and dynamic features (the dynamic features are often called delta and delta-delta parameters). In speech processing, these parameters usually correspond to the evolution of mel-frequency cepstral coefficients characterizing the power spectrum of a sound, but the same trajectory-HMM approach can be used with any form of continuous signals. In robotics, this approach has rarely been exploited, at the exception of the work from Sugiura et al. employing it to represent object manipulation movements [89].

For the encoding of movements, velocity and acceleration can alternatively be used as dynamic features. By considering an Euler approximation, the velocity is computed as

$$\begin{aligned} {\varvec{\dot{x}}}_t = \frac{{\varvec{x}}_{t+1}-{\varvec{x}}_t}{\Delta t}, \end{aligned}$$
(21)

where \({\varvec{x}}_t\) is a multivariate position vector. The acceleration is similarly computed as

$$\begin{aligned} {\varvec{\ddot{x}}}_t = \frac{{\varvec{\dot{x}}}_{t+1}-{\varvec{\dot{x}}}_t}{\Delta t} = \frac{{\varvec{x}}_{t+2}-2{\varvec{x}}_{t+1}+{\varvec{x}}_t}{\Delta t^2}. \end{aligned}$$
(22)

By using (21) and (22), a vector \({\varvec{\zeta }}_t\) will be used to represent the concatenated position, velocity and acceleration vectors at time step t, namelyFootnote 3

(23)

\({\varvec{\zeta }}\) and \({\varvec{x}}\) are then defined as large vectors concatenating \({\varvec{\zeta }}_t\) and \({\varvec{x}}_t\) for all time steps, namely

$$\begin{aligned} {\varvec{\zeta }}=\left[ \begin{matrix}{\varvec{\zeta }}_1 \\ {\varvec{\zeta }}_2 \\ \vdots \\ {\varvec{\zeta }}_T \end{matrix}\right] ,\quad {\varvec{x}}=\left[ \begin{matrix}{\varvec{x}}_1 \\ {\varvec{x}}_2 \\ \vdots \\ {\varvec{x}}_T \end{matrix}\right] . \end{aligned}$$
(24)

Similarly to the matrix operator (23) defined for a single time step, a large sparse matrix \({\varvec{{\varPhi }}}\) can be defined so that \({\varvec{\zeta }}={\varvec{{\varPhi }}}{\varvec{x}}\), namelyFootnote 4

(25)

The dataset \(\{{\varvec{\zeta }}_t\}_{t=1}^N\) with \(N=\sum _m^M T_m\) is composed of M trajectory samples, where the mth trajectory sample has \(T_m\) datapoints. It can be encoded in a Gaussian mixture model (GMM), hidden Markov model (HMM) or hidden semi-Markov model (HSMM) [73]. The example of a GMM encoding \(\mathcal {P}({\varvec{\zeta }})\sim \sum _{i=1}^K\pi _i\;\mathcal {N}({\varvec{\mu }}_i,{\varvec{{\varSigma }}}_i)\) will be described here, which will later be extended to its task-parameterized version.

After training, for a given sequence of states \({\varvec{s}}=\{s_1,s_2,\ldots ,s_T\}\) of T time steps, with discrete states \(s_t\in \{1,\ldots ,K\}\),Footnote 5 the likelihood of a movement \({\varvec{\zeta }}\) is given by

$$\begin{aligned} \mathcal {P}({\varvec{\zeta }}|{\varvec{s}}) = \prod _{t=1}^T \mathcal {N}({\varvec{\zeta }}_t|{\varvec{\mu }}_{s_t},{\varvec{{\varSigma }}}_{s_t}), \end{aligned}$$
(26)

where \({\varvec{\mu }}_{s_t}\) and \({\varvec{{\varSigma }}}_{s_t}\) are the center and covariance of state \(s_t\) at time step t. This product can be rewritten as the conditional distribution

$$\begin{aligned}&\mathcal {P}({\varvec{\zeta }}|{\varvec{s}}) = \mathcal {N}({\varvec{\zeta }}|{\varvec{\mu }}_{{\varvec{s}}},{\varvec{{\varSigma }}}_{{\varvec{s}}}), \\&{{\mathrm {with}}}\quad {\varvec{\mu }}_{{\varvec{s}}} = \left[ \begin{matrix}{\varvec{\mu }}_{s_1} \\ {\varvec{\mu }}_{s_2} \\ \vdots \\ {\varvec{\mu }}_{s_T} \end{matrix}\right] \quad {{\mathrm {and}}}\quad {\varvec{{\varSigma }}}_{{\varvec{s}}} = \left[ \begin{matrix}{\varvec{{\varSigma }}}_{s_1} &{} {\varvec{0}} &{} \cdots &{} {\varvec{0}} \\ {\varvec{0}} &{} {\varvec{{\varSigma }}}_{s_2} &{} \cdots &{} {\varvec{0}} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ {\varvec{0}} &{} {\varvec{0}} &{} \cdots &{} {\varvec{{\varSigma }}}_{s_T} \end{matrix}\right] .\nonumber \end{aligned}$$
(27)

By using the relation \({\varvec{\zeta }}={\varvec{{\varPhi }}}{\varvec{x}}\), we then seek during reproduction for a trajectory \({\varvec{x}}\) maximizing (27), namely

$$\begin{aligned} {\varvec{\hat{x}}} = \arg \max _{{\varvec{x}}} \; \log \;\mathcal {P}({\varvec{{\varPhi }}}{\varvec{x}} \;|\;{\varvec{s}}). \end{aligned}$$
(28)

The part of \(\log \;\mathcal {P}({\varvec{{\varPhi }}}{\varvec{x}} \;|\;{\varvec{s}})\) dependent on \({\varvec{x}}\) takes the quadratic error form

$$\begin{aligned} c&= ({\varvec{\mu }}_{{\varvec{s}}}-{\varvec{\zeta }})^{\scriptscriptstyle \top }{\varvec{{\varSigma }}}_{{\varvec{s}}}^{-1} ({\varvec{\mu }}_{{\varvec{s}}}-{\varvec{\zeta }}) \\&= ({\varvec{\mu }}_{{\varvec{s}}}-{\varvec{{\varPhi }}}{\varvec{x}})^{\scriptscriptstyle \top }{\varvec{{\varSigma }}}_{{\varvec{s}}}^{-1} ({\varvec{\mu }}_{{\varvec{s}}}-{\varvec{{\varPhi }}}{\varvec{x}}). \nonumber \end{aligned}$$
(29)

A solution can be found by differentiating the above objective function with respect to \({\varvec{x}}\) and equating to 0, providing the trajectory (in vector form)

$$\begin{aligned} {\varvec{\hat{x}}} = \left( {\varvec{{\varPhi }}}^{\scriptscriptstyle \top }{{\varvec{{\varSigma }}}_{{\varvec{s}}}}^{-1} {\varvec{{\varPhi }}}\right) ^{-1} {\varvec{{\varPhi }}}^{\scriptscriptstyle \top }{{\varvec{{\varSigma }}}_{{\varvec{s}}}}^{-1} {\varvec{\mu }}_{{\varvec{s}}}, \end{aligned}$$
(30)

with the covariance error of the weighted least squares estimate given by

$$\begin{aligned} {\varvec{\hat{{\varSigma }}}}^{{\varvec{x}}} = \sigma \left( {\varvec{{\varPhi }}}^{\scriptscriptstyle \top }{{\varvec{{\varSigma }}}_{{\varvec{s}}}}^{-1} {\varvec{{\varPhi }}}\right) ^{-1}, \end{aligned}$$
(31)

where \(\sigma \) is a scale factor.Footnote 6

The resulting Gaussian \(\mathcal {N}({\varvec{\hat{x}}},{\varvec{\hat{{\varSigma }}}}^{{\varvec{x}}})\) forms a trajectory distribution. Other forms of trajectory distributions can be employed, where the main differences lie in the structure given to \({\varvec{\hat{{\varSigma }}}}^{{\varvec{x}}}\) and in the way the basis functions are defined. A relevant example is the probabilistic movement primitives approach proposed by Paraschos et al. [72]. The structure of the trajectory distribution defined in [72] requires multiple trajectory demonstrations to avoid overfitting, but the problem can be circumvented by employing factorization and variational inference techniques [77].

In trajectory-GMM, the problem of setting the shape and spread of the basis functions, as well as the problem of determining the sparse structure of \({\varvec{\hat{{\varSigma }}}}^{{\varvec{x}}}\), is directly framed within the GMM likelihood maximization problem, allowing the use of an EM algorithm to automatically organize the basis functions and find an appropriate structure for \({\varvec{\hat{{\varSigma }}}}^{{\varvec{x}}}\), which will, for example, result in a sparse band-diagonal structure when the components are sufficiently decoupled, and which will account for the correlations within \({\varvec{\zeta }}_t\).

An illustration of the trajectory-GMM properties that are of interest in robotics are shown in Fig. 8.

Fig. 8
figure 8

Example of trajectory-GMM encoding and retrieval. The planar motion contains multiple options and is learned from a set of partial demonstrations that can be provided in any order. Left four demonstrations (represented with different shades of gray), corresponding to different subparts of a longer movement, where a part in the movement contains two optional paths. Center the four demonstrations are used to train a trajectory-GMM with \(K=18\) components. Right two movements retrieved from the trajectory-GMM by stochastic sampling (with equal chance to take one or the other path). We can see that the movements are smooth, with an average position and full covariance estimated at each time step (represented as a light red flow tube of one standard deviation)

The combination of TP-GMM and trajectory-GMM is achieved by augmenting the position data with its derivatives (e.g., with velocity and acceleration) and defining all task parameters \({\varvec{A}}_{t,j}\) and \({\varvec{b}}_{t,j}\) so that they also apply to the derivatives. Compared to an initial TP-GMM encoding \({\varvec{x}}\) with task parameters \({\varvec{A}}^{\scriptscriptstyle {{\mathcal {O}}}}_{t,j}\) and \({\varvec{b}}^{\scriptscriptstyle {{\mathcal {O}}}}_{t,j}\), the combination of TP-GMM and trajectory-GMM instead encodes \({\varvec{\zeta }}=\big [{\varvec{x}}^{\scriptscriptstyle \top },{\varvec{\dot{x}}}^{\scriptscriptstyle \top },{\varvec{\ddot{x}}}^{\scriptscriptstyle \top }\big ]^{\scriptscriptstyle \top }\) with task parameters

$$\begin{aligned} {\varvec{A}}_{t,j} = \begin{bmatrix} {\varvec{A}}^{\scriptscriptstyle {{\mathcal {O}}}}_{t,j}&{\varvec{0}}&{\varvec{0}} \\ {\varvec{0}}&{\varvec{A}}^{\scriptscriptstyle {{\mathcal {O}}}}_{t,j}&{\varvec{0}} \\ {\varvec{0}}&{\varvec{0}}&{\varvec{A}}^{\scriptscriptstyle {{\mathcal {O}}}}_{t,j} \end{bmatrix} \;,\quad {\varvec{b}}_{t,j} = \begin{bmatrix} {\varvec{b}}^{\scriptscriptstyle {{\mathcal {O}}}}_{t,j} \\ {\varvec{0}} \\ {\varvec{0}} \end{bmatrix}. \end{aligned}$$
(32)
figure e

5.3 Dynamic-based versus time-based features in GMM

GMR with time-indexed trajectories (Sect. 5.1) offers a simple solution for the generation of trajectories from a GMM, with the disadvantage that time-based GMR often requires in practice a preprocessing step such as dynamic time warping (DTW) to re-align multiple demonstrations in time.

For the same reason, time-based GMR may also reduce the applicability of motion synthesis to more complex forms of teaching interactions, such as learning recurring patterns (combination of discrete and periodic motions) or demonstrating different options in a movement. This is not the case for trajectory-GMM that can handle these two issues without further modification of the model (see Sect. 5.2). This is achieved at the expense of increasing the GMM dimensionality (by encoding position, velocity and acceleration instead of position and time as in time-based GMR).

Another advantage of encoding dynamic features over time features is that partial demonstrations can easily be used in the training set, see Fig. 8. In service robotics applications, it can sometimes be difficult and inefficient to demonstrate a complex manipulation task in a single shot. A more user-friendly way would be to provide incremental corrections or piecewise demonstrations of whole body movements, where the models can be trained with partial chunks of the whole movement, see, e.g., [55]. This is also required in kinesthetic teaching with robots endowed with a high number of articulations (since the user cannot control all degrees of freedom at the same time with two hands). Trajectory-GMM provides here a way to handle partial demonstrations without further modification of the model.

6 Extension to minimal intervention control

Previous sections discussed the problem of generating a reference trajectory that can adapt to the current situation, by assuming that a controller is available to track the retrieved reference trajectory. In this section, the problem is extended to that of directly finding a controller to reproduce the movement.

Section 2.2 showed that the objective function (3) underlying TP-GMM aims at finding points \({\varvec{\xi }}_t\) minimizing a weighted sum of quadratic error terms, whose result corresponds to a product of Gaussians. A similar function can be defined for the search of a controller, whose objective is to find a feedforward and feedback policy (instead of finding a reference trajectory).

This section depicts the canonical problem of searching a controller \({\varvec{u}}_t\) for the discrete linear dynamical system (double integrator)

$$\begin{aligned} \underbrace{\begin{bmatrix}{\varvec{x}}_{t+1}\\{\varvec{\dot{x}}}_{t+1}\end{bmatrix}}_{{\varvec{\xi }}_{t+1}} = \underbrace{\begin{bmatrix}{\varvec{I}}&\;{\varvec{I}}\Delta t\\{\varvec{0}}&\;{\varvec{I}}\end{bmatrix}}_{{\varvec{A}}} \underbrace{\begin{bmatrix}{\varvec{x}}_t\\{\varvec{\dot{x}}}_t\end{bmatrix}}_{{\varvec{\xi }}_t} + \underbrace{\begin{bmatrix}{\varvec{0}}\\{\varvec{I}}\Delta t\end{bmatrix}}_{{\varvec{B}}} {\varvec{u}}_t. \end{aligned}$$
(33)

minimizing the cost

$$\begin{aligned} C&= {\big ({\varvec{\hat{\xi }}}_T-{\varvec{\xi }}_T\big )}^{\scriptscriptstyle \top }{\varvec{{\varSigma }}}_{s_T}^{-1} \big ({\varvec{\hat{\xi }}}_T-{\varvec{\xi }}_T\big ) \nonumber \\&\quad +\; \sum _{t=1}^{T-1} \Big ({\big ({\varvec{\hat{\xi }}}_t-{\varvec{\xi }}_t\big )}^{\scriptscriptstyle \top }{\varvec{{\varSigma }}}_{s_t}^{-1} \big ({\varvec{\hat{\xi }}}_t-{\varvec{\xi }}_t\big ) \;+\; {\varvec{u}}_t^{\scriptscriptstyle \top }{\varvec{R}}_t\; {\varvec{u}}_t \Big ) \nonumber \\&= {\big ({\varvec{\mu }}_{{\varvec{s}}}-{\varvec{\zeta }}\big )}^{\scriptscriptstyle \top }{\varvec{{\varSigma }}}_{{\varvec{s}}}^{-1} \big ({\varvec{\mu }}_{{\varvec{s}}}-{\varvec{\zeta }}\big ) \;+\; {\varvec{U}}^{\scriptscriptstyle \top }{\varvec{\tilde{R}}} {\varvec{U}}, \end{aligned}$$
(34)

with \({\varvec{\mu }}_{{\varvec{s}}}\in \mathbb {R}^{TCD}\) and \({\varvec{{\varSigma }}}_{{\varvec{s}}}={{\mathrm {blockdiag}}}({\varvec{{\varSigma }}}_{s_1},{\varvec{{\varSigma }}}_{s_2},\ldots ,{\varvec{{\varSigma }}}_{s_T})\) with \({\varvec{{\varSigma }}}_{{\varvec{s}}}\in \mathbb {R}^{TCD\times TCD}\) defined as in Eq. (27), and \({\varvec{\tilde{R}}}={{\mathrm {blockdiag}}}({\varvec{R}},{\varvec{R}},\ldots ,{\varvec{R}})\) with \({\varvec{\tilde{R}}}\in \mathbb {R}^{(T-1)D\times (T-1)D}\) an additional cost on the control inputs. The problem corresponds to an unconstrained linear model predictive control (MPC) problem.

It is worth noting that the objective function (29) used in the context of GMM with dynamic features (see Sect. 5.2) is a special case of (34) with \({\varvec{\tilde{R}}}={\varvec{0}}\).

We showed in [16] that TP-GMM can be used to find a controller autonomously regulating the stiffness and damping behavior of the robot, see also Fig. 2d. The model shares links with optimal feedback control strategies in which deviations from an average trajectory are corrected only when they interfere with task performance, resulting in a controller satisfying minimal intervention principle [94, 103]. The approach also shares similarities with the solution proposed by Medina et al. in the context of risk-sensitive control for haptic assistance [63], by exploiting the predicted variability to form a minimal intervention controller (in task space or in joint space). The retrieved variability and correlation information is exploited to generate safe and natural movements within an optimal control strategy, in accordance with the predicted range of motion that could correctly reproduce the task in the current situation. Indeed, we demonstrated in [16] that TP-GMM is fully compatible with linear quadratic regulation (LQR) strategies, providing a controller adapted to the current situation with both impedance gains and reference trajectories varying with respect to external task parameters.

The tracking problem can be solved by different techniques, either exploiting tools from physics, dynamic programming or linear algebra [6, 10]. It can, for example, be solved with a batch approach by expressing all future states \({\varvec{x}}_t\) as explicit function of the state \({\varvec{x}}_1\). By writing

$$\begin{aligned} {\varvec{\xi }}_{2}&= {\varvec{A}} {\varvec{\xi }}_1 + {\varvec{B}} {\varvec{u}}_1,\\ {\varvec{\xi }}_{3}&= {\varvec{A}} {\varvec{\xi }}_2 + {\varvec{B}} {\varvec{u}}_2 = {\varvec{A}} ({\varvec{A}} {\varvec{\xi }}_1 + {\varvec{B}} {\varvec{u}}_1) + {\varvec{B}} {\varvec{u}}_2,\\&\vdots \\ {\varvec{\xi }}_{T}&= {\varvec{A}}^T {\varvec{\xi }}_1 + {\varvec{A}}^{T-1}{\varvec{B}}{\varvec{u}}_1 + {\varvec{A}}^{T-2}{\varvec{B}}{\varvec{u}}_2 + \cdots + {\varvec{B}}{\varvec{u}}_T \end{aligned}$$

in a matrix form, we get

$$\begin{aligned} \underbrace{ \left[ \begin{matrix} {\varvec{\xi }}_1\\ {\varvec{\xi }}_2\\ {\varvec{\xi }}_3\\ \vdots \\ {\varvec{\xi }}_T \end{matrix}\right] }_{{\varvec{\zeta }}} = \underbrace{ \left[ \begin{matrix} {\varvec{I}}\\ {\varvec{A}}\\ {\varvec{A}}^{2}\\ \vdots \\ {\varvec{A}}^{T} \end{matrix}\right] }_{{\varvec{S}}^{{\varvec{\xi }}}} {\varvec{\xi }}_1 + \underbrace{ \left[ \begin{matrix} {\varvec{0}} &{} {\varvec{0}} &{} \cdots &{} {\varvec{0}} \\ {\varvec{B}} &{} {\varvec{0}} &{} \cdots &{} {\varvec{0}} \\ {\varvec{A}}{\varvec{B}} &{} {\varvec{B}} &{} \cdots &{} {\varvec{0}} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ {\varvec{A}}^{T-1}{\varvec{B}} &{} {\varvec{A}}^{T-2}{\varvec{B}} &{} \cdots &{} {\varvec{B}} \end{matrix}\right] }_{{\varvec{S}}^{{\varvec{u}}}} \underbrace{ \left[ \begin{matrix} {\varvec{u}}_1\\ {\varvec{u}}_2\\ \vdots \\ {\varvec{u}}_{T-1} \end{matrix}\right] }_{{\varvec{U}}} , \end{aligned}$$
(35)

with \({\varvec{\zeta }}\in \mathbb {R}^{TCD}\), \({\varvec{S}}^{{\varvec{\xi }}}\in \mathbb {R}^{TCD\times CD}\), \({\varvec{\xi }}_1\in \mathbb {R}^{CD}\), \({\varvec{S}}^{{\varvec{u}}}\in \mathbb {R}^{TCD\times (T-1)D}\) and \({\varvec{U}}\in \mathbb {R}^{(T-1)D}\).

Substituting (35) into (34), we get the cost function

$$\begin{aligned} C&= {\big ({\varvec{\mu }}_{{\varvec{s}}} - {\varvec{S}}^{{\varvec{\xi }}} {\varvec{\xi }}_1 - {\varvec{S}}^{{\varvec{u}}} {\varvec{U}}\big )}^{\scriptscriptstyle \top }{\varvec{{\varSigma }}}_{{\varvec{s}}}^{-1} \; \big ({\varvec{\mu }}_{{\varvec{s}}} - {\varvec{S}}^{{\varvec{\xi }}} {\varvec{\xi }}_1 - {\varvec{S}}^{{\varvec{u}}} {\varvec{U}}\big ) \nonumber \\&\quad +\; {\varvec{U}}^{\scriptscriptstyle \top }{\varvec{\tilde{R}}} {\varvec{U}}. \end{aligned}$$
(36)

Differentiating with respect to \({\varvec{U}}\) and equating to zero yields the sequence of control inputs

$$\begin{aligned} {\varvec{\hat{U}}} = {\big ({{\varvec{S}}^{{\varvec{u}}}}^{\scriptscriptstyle \top }{\varvec{{\varSigma }}}_{{\varvec{s}}}^{-1} {\varvec{S}}^{{\varvec{u}}} + {\varvec{\tilde{R}}}\big )}^{-1} {{\varvec{S}}^{{\varvec{u}}}}^{\scriptscriptstyle \top }{\varvec{{\varSigma }}}_{{\varvec{s}}}^{-1} \; \big ({\varvec{\mu }}_{{\varvec{s}}} - {\varvec{S}}^{{\varvec{\xi }}} {\varvec{\xi }}_1 \big ) , \end{aligned}$$
(37)

corresponding to a damped weighted least squares estimate (ridge regression).

The sequence of acceleration commands (37) can either be used as a planning technique to reconstruct a trajectory with (35) or as a control technique to estimate feedforward and feedback terms for the current iteration.

The same controller can also be found iteratively with a Riccati equation, see [16] for details.

It is worth noting that the constraint (33) defines the same set of relations as (21) and (22) used in trajectory-GMM (GMM with dynamic features). The main difference between the two problems is that trajectory-GMM seeks for a reference trajectory, while the above problem seeks for a controller. Both problems result in a weighted least squares solution, with the difference that (37) uses a Tikhonov regularization term corresponding to the cost \({\varvec{R}}\) that we set on the control inputs.

figure f

Extension to multiple coordinate systems

The above optimal control problem can be extended to reference trajectories expressed in P coordinate systems. By extending the cost in (34) to

$$\begin{aligned} \tilde{C}&= \sum _{j=1}^P {\Big ({\varvec{\mu }}_{{\varvec{s}}}^{(j)}-{\varvec{\zeta }}\Big )}^{\scriptscriptstyle \top }{{\varvec{{\varSigma }}}_{{\varvec{s}}}^{(j)}}^{-1} \Big ({\varvec{\mu }}_{{\varvec{s}}}^{(j)}-{\varvec{\zeta }}\Big ) + {\varvec{U}}^{\scriptscriptstyle \top }{\varvec{\tilde{R}}} {\varvec{U}}, \end{aligned}$$
(38)

the sequence of control inputs becomes

$$\begin{aligned}&{\varvec{\hat{U}}} = {\big ({{\varvec{S}}^{{\varvec{u}}}}^{\scriptscriptstyle \top }{\varvec{{\varSigma }}}_{{\varvec{s}}}^{-1} {\varvec{S}}^{{\varvec{u}}} + {\varvec{\tilde{R}}}\big )}^{-1} {{\varvec{S}}^{{\varvec{u}}}}^{\scriptscriptstyle \top }{\varvec{{\varSigma }}}_{{\varvec{s}}}^{-1} \; \big ({{\varvec{\mu }}_{{\varvec{s}}}} - {\varvec{S}}^{{\varvec{\xi }}} {\varvec{\xi }}_1 \big ) ,\\&\quad {{\mathrm {with}}}\quad {\varvec{{\varSigma }}}_{{\varvec{s}}}^{-1} = \sum _{j=1}^P {{\varvec{{\varSigma }}}_{{\varvec{s}}}^{(j)}}^{-1} \;,\; {{\varvec{\mu }}_{{\varvec{s}}}} = {\varvec{{\varSigma }}}_{{\varvec{s}}} \sum _{j=1}^P {{\varvec{{\varSigma }}}_{{\varvec{s}}}^{(j)}}^{-1} {\varvec{\mu }}_{{\varvec{s}}}^{(j)}, \end{aligned}$$

whose intermediary variables \({\varvec{\mu }}_{{\varvec{s}}}\) and \({\varvec{{\varSigma }}}_{{\varvec{s}}}\) corresponds to the Gaussian resulting from the product of Gaussians with centers \({\varvec{\mu }}_{{\varvec{s}}}^{(j)}\) and covariances \({\varvec{{\varSigma }}}_{{\varvec{s}}}^{(j)}\). Namely, solving

$$\begin{aligned} {\varvec{\hat{U}}} = \arg \underset{{\varvec{U}}}{\min } \sum _{j=1}^P {\Big ({\varvec{\mu }}_{{\varvec{s}}}^{(j)}-{\varvec{\zeta }}\Big )}^{{\scriptscriptstyle \top }} {{\varvec{{\varSigma }}}_{{\varvec{s}}}^{(j)}}^{-1} \Big ({\varvec{\mu }}_{{\varvec{s}}}^{(j)}-{\varvec{\zeta }}\Big ) + {\varvec{U}}^{\scriptscriptstyle \top }{\varvec{\tilde{R}}} {\varvec{U}} \end{aligned}$$

is equivalent to the two step optimization process

$$\begin{aligned} {\varvec{\mu }}_{{\varvec{s}}}&= \arg \underset{{\varvec{\zeta }}}{\min } \sum _{j=1}^P {\Big ({\varvec{\mu }}_{{\varvec{s}}}^{(j)}-{\varvec{\zeta }}\Big )}^{\scriptscriptstyle \top }{{\varvec{{\varSigma }}}_{{\varvec{s}}}^{(j)}}^{-1} \Big ({\varvec{\mu }}_{{\varvec{s}}}^{(j)}-{\varvec{\zeta }}\Big ), \\ {\varvec{\hat{U}}}&= \arg \underset{{\varvec{U}}}{\min } {\big ({\varvec{\mu }}_{{\varvec{s}}}-{\varvec{\zeta }}\big )}^{\scriptscriptstyle \top }{\varvec{{\varSigma }}}_{{\varvec{s}}}^{-1} \big ({\varvec{\mu }}_{{\varvec{s}}}-{\varvec{\zeta }}\big ) \;+\; {\varvec{U}}^{\scriptscriptstyle \top }{\varvec{\tilde{R}}} {\varvec{U}}, \end{aligned}$$

showing that the combination of TP-GMM with LQR corresponds to an optimal controller acting in multiple coordinate systems.

Fig. 9
figure 9

Learning of two behaviors with the Baxter robot at Idiap. The taught tasks consist of holding a cup horizontally with one hand and holding a sugar cube above the cup with the other hand. The demonstrations are provided in two steps by kinesthetic teaching, namely by holding the arms of the robot and moving them through the task while the robot compensates for the effect of gravity on its limbs. This procedure allows the user to move the robot arms without feeling their weight and without feeling the motors in the articulations, while the sensors are used to record the movement. Here, the data are recorded in several frames of reference (top image). During reproduction, the robot is controlled by following a minimal intervention principle, where the impedance parameters of the robot (stiffness and damping of a virtual spring pulling the robot arms) are automatically set in accordance to the extracted variation and coordination patterns. First sequence brief demonstration to show the robot how to hold a cup horizontally. Second sequence brief demonstration to show how to hold a sugar cube above the cup. Third sequence manual displacement of the left arm to test the learned behavior (the coordination of the two hands is successfully retrieved). Last sequence combination of the two learned tasks within a minimal intervention controller. Here, the user pushes the robot to show that the robot remains soft for perturbations that do not conflict with the acquired task constraints (automatic exploitation of the redundant degrees of freedom that do not conflict with the task)

Table 3 Task parameters as candidate projection operators (with affine transformations defined by \({\varvec{A}}_{t,j}\) and \({\varvec{b}}_{t,j}\))

The problem can be viewed as a form of inverse optimal control (IOC) [1, 2, 24], or more precisely, as a rudimentary form of IOC which can be solved analytically. Namely, it can provide a controller without exploratory search, at the expense of being restricted to simple forms of objectives (weighted sums of quadratic errors whose weights are learned from the demonstrations). This dual view can be exploited in further research to bridge action-level and goal-level imitation or to provide better initial estimates in IOC problems.

Figure 9 shows that a TP-GMM with a single Gaussian, combined with a minimal intervention controller based on LQR, can be used to encode and retrieve various behaviors.

figure g

7 Extension to task parameters in the form of projection constraints

Thus far, this tutorial considered problems in which the task parameters were related to position, orientation or shape of objects in a Cartesian space. However, the use of TP-GMM is not limited can be extended to other forms of locally linear transformations or projections. The consideration of non square \({\varvec{A}}_{t,j}\) matrices is, for example, relevant for the consideration of soft constraints in both configuration and operational spaces (through Jacobian operators), see [15] for a preliminary work in this direction.

It can also provide a principled way to learn nullspace constraints in a probabilistic form. The different frames correspond in this case to various candidate subspace projections of the movement, with statistical information extracted from the different projections.

An important and challenging category of applications includes the problems requiring priority constraints [37, 58, 78, 96, 104]. Such constraints can be learned and encoded within a TP-GMM from an initial set of task hierarchies given as potential candidates to describe the observed skill. The probabilistic encoding is exploited here to discover in which manner the subtasks are prioritized.

For a controller handling constraints both in configuration and operational spaces, some of the most common candidate projection operators are presented in Table 3, covering a very wide range of robotics applications.

Note here that the Gaussian product is computed in configuration space (\({\varvec{q}}\) and \({\varvec{x}}\) represent, respectively, poses in joint space and task space). Equation (39) describes joint space constraints in a fixed frame. It corresponds to the canonical frame defined by \({\varvec{A}}_{t,j}={\varvec{I}}\) (identity matrix) and \({\varvec{b}}_{t,j}={\varvec{0}}\). Equation (40) describes absolute position constraints (in operational space), where \({\varvec{J}}^{\dagger }\) is the Jacobian pseudoinverse used as least-norm inverse kinematics solution. Note that Eq. (40) describes a moving frame, where the task parameters change at each iteration (observation of a changing pose in configuration space). Equation (41) describes relative position constraints, where the constraint in task space is related to an object described at each time step t by a position \({\varvec{b}}^{\scriptscriptstyle {{\mathcal {O}}}}_t\) and an orientation matrix \({\varvec{A}}^{\scriptscriptstyle {{\mathcal {O}}}}_t\) in task space. Equation (42) describes nullspace/priority constraints in joint space, with \({\varvec{N}}={\varvec{I}}-{\varvec{J}}^{\dagger }{\varvec{J}}\) a nullspace projection operator. Equation (43) describes absolute position nullspace/priority constraints, where the secondary objective is described in task space (for a point in the kinematic chain with corresponding Jacobian \({\varvec{\tilde{J}}}\)). Finally, Eq. (44) describes relative position nullspace/priority constraints.

Fig. 10
figure 10

Example of TP-GMM with constraints in both joint space (frame 1) and task space (frames 2–4). The task consists of pointing at two objects (red and blue) and then coming back to a neutral pose. Left the demonstrations show a preferred posture employed to complete the task (first robot link oriented to the right and the other two links oriented upwards). Right reproduction attempts by synthesizing a motion for the newly encountered situations (new position of blue and red objects), which is achieved through GMR. Each row in the graph shows the configuration at a different time step of the movement. We can see that the generated motion satisfies the demonstrated constraints (pointing sequentially at the two objects while trying to maintain a preferred configuration in joint space). Note here that the motion is generated in an online manner, which allows the robot to handle objects that are moving during the execution of the pointing gestures (color figure online)

The above equations can be retrieved without much effort by discretizing (with an Euler approximation) the standard inverse kinematics and nullspace control relations that can be found in most robotics textbooks, see, e.g., [5].

Figure 10 shows an example of constraints in configuration and operational spaces. The task requires the robot to consider, in parallel and in series, constraints in task space (pointing to objects) and in joint space (maintaining a preferred posture). The frame in joint space is defined by Eq. (42), and the frames in task space are defined by Eq. (41) for the two objects and by Eq. (40) for the motion in the robot frame, with \({\varvec{x}}\) encoding Euler angles and \({\varvec{J}}\) describing the orientation part of the end-effector Jacobian.

Figure 11 presents a TP-GMM example with task parameters taking the form of nullspace bases. The frames are defined by Eqs. (41) and (44) with two different combinations of nullspaces and Jacobians corresponding to the left and right arm.

figure h

8 Comparisons with other task-adaptive approaches

This section aims at comparing different techniques to adapt movements to new situations, using the task adaptation problem depicted in Figs. 1 and 2.

Qualitative comparisons for the task-parameterized approaches presented throughout the article are presented in Figs. 12 and 13.

Figure 12 shows that GMM/GMR MFA/GMR and trajectory-GMM could all retrieve suitable trajectories for each of the new situations.

Figure 13 shows that the proposed task-parameterized approach can also be employed with data-driven encoding strategies, at the expense of a slower reproduction process that depends on the number of demonstrations. For this simple dataset, a kernel-based approach (first row) generated new trajectories that are similar to the trajectories generated by the weighted least-squares approach (second row).

figure i

The same task adaptation problem was also tested with two baseline approaches that are described next.

8.1 Gaussian process regression (GPR) with trajectory models

An alternative way of handling task-parameterized movements is to fit a model to each demonstration and associate it with a task-specific feature, goal, style variable or perceptual feedback, see, for example, [21, 29, 44, 48, 51, 60, 97].

Fig. 11
figure 11

Illustration of the encoding of priority constraints in a TP-GMM. The top row shows 3 demonstrations with a bimanual robot composed of 5 articulations (the color goes from light gray to black to show the evolution of the movement). The task consists of tracking two objects with the left and right end-effectors (the path of the objects are depicted in red). In some parts of the demonstrations, the two objects could not be reached, and the demonstrator either made a compromise (left graph) or gave priority to the left or right hand (middle and right graphs). The bottom row shows reproduction attempts for new trajectories of the two objects. We can see that, although faced with different situations, the priority constraints are reproduced similarly to the corresponding demonstrations (color figure online)

Fig. 12
figure 12

Generalization capability of three model-based task-parameterized approaches. Each row shows the results for a different approach, and each column shows a different situation (with increasing generalization complexity). In each graph, the four demonstrations and the associated adapted model parameters are depicted in semi-transparent colors

Fig. 13
figure 13

Generalization capability of two data-driven task-parameterized approaches. Each row shows the results for a different approach, and each column shows a different situation (with increasing generalization complexity). In each graph, the four demonstrations and the associated adapted model parameters are depicted in semi-transparent colors

Fig. 14
figure 14

Generalization capability of three alternative approaches to task parameterization. Each row shows the results for a different approach, and each column shows a different situation (with increasing generalization complexity). In each graph, the four demonstrations and the associated adapted model parameters are depicted in semi-transparent colors

Such approach is typically better suited for task parameters that do not change during the demonstration. It can be achieved in a non-parametric or parametric way, by either encoding the relationships between the raw trajectory variables and the task parameters (first row in Fig. 14) or by encoding the relationships between the trajectory model parameters and the task parameters (second row in Fig. 14, see also Fig. 1).

In this second approach, the output variables \({\varvec{{\varTheta }}}\) are the model parameters and the query points \({\varvec{Q}}\) are the task parameters. The reproduction step consists of retrieving new model parameters \({\varvec{{\varTheta }}}^d\) from new task parameters \({\varvec{Q}}^d\), which can, for example, be achieved with Gaussian process regression (GPR) [75]. After centering the training data, the joint distribution of the demonstrated and new outputs can be estimated as

$$\begin{aligned} \begin{bmatrix}{\varvec{{\varTheta }}}\\ {\varvec{{\varTheta }}}^{d} \end{bmatrix} = \mathcal {N}\left( {\varvec{0}}, \begin{bmatrix}{\varvec{K}}({\varvec{Q}},{\varvec{Q}}) + \sigma ^2 {\varvec{I}}&{\varvec{K}}({\varvec{Q}},{\varvec{Q}}^d)\\ {\varvec{K}}({\varvec{Q}}^d,{\varvec{Q}})&{\varvec{K}}({\varvec{Q}}^d,{\varvec{Q}}^d) \end{bmatrix} \right) , \end{aligned}$$
(45)

where \({\varvec{Q}}\) is the concatenation of query points \({\varvec{q}}_m\), and \({\varvec{{\varTheta }}}\) the concatenation of outputs \({\varvec{\theta }}_m\), with \(m\in \{1,\ldots ,M\}\) (M is the number of demonstrations). Squared exponential covariance functions \({\varvec{K}}\) are considered here.

By using the conditional probability property of normal distributions, the expected outputs \({\varvec{\hat{{\varTheta }}}}\) associated with the new query points \({\varvec{Q}}^d\) are given by

$$\begin{aligned} {\varvec{\hat{{\varTheta }}}} = {\varvec{K}}({\varvec{Q}}^d,{\varvec{Q}}) \; [ {\varvec{K}}({\varvec{Q}},{\varvec{Q}}) + \sigma ^2 {\varvec{I}} ]^{-1} {\varvec{{\varTheta }}}, \end{aligned}$$
(46)

with the covariance of the prediction given by

$$\begin{aligned} {\varvec{\hat{{\varSigma }}}}^{{\varvec{{\varTheta }}}}= & {} {\varvec{K}}({\varvec{Q}}^d, {\varvec{Q}}^d) - {\varvec{K}}({\varvec{Q}}^d,{\varvec{Q}}) [ {\varvec{K}}({\varvec{Q}},{\varvec{Q}})\nonumber \\&+\, \sigma ^2 {\varvec{I}}]^{-1} {\varvec{K}}({\varvec{Q}},{\varvec{Q}}^d). \end{aligned}$$
(47)

The above formulation can be used with various trajectory models. For a fair comparison, GMM encoding of trajectories was considered, with GMR used to regenerate the trajectories. Thus, a GMM \({\varvec{\theta }}_m=\{\pi _{i,m},{\varvec{\mu }}_{i,m},{\varvec{{\varSigma }}}_{i,m}\}_{i=1}^K\) is fit to each demonstration \({\varvec{\xi }}_m\), with associate query point \({\varvec{q}}_m = \{{\varvec{A}}_{m,j}, {\varvec{b}}_{m,j}\}_{j=1}^P\) describing the demonstration context. After all demonstrations are collected, \([ {\varvec{K}}({\varvec{Q}},{\varvec{Q}}) + \sigma ^2 {\varvec{I}} ]^{-1} {\varvec{{\varTheta }}}\) in Eqs. (45), (46) can be precomputed.

During reproduction, a new query point \({\varvec{Q}}^d\) with \({\varvec{Q}}^d = \{{\varvec{A}}_{j}, {\varvec{b}}_{j}\}_{j=1}^P\) is used to retrieve a new GMM using Eq. (46), which is then used to retrieve a new trajectory through GMR.

The first and second row of Fig. 14 show generalization results for the use of GPR with a non-parametric (first row) and parametric (second row) encoding of the data. Although GPR can easily handle situations within the range of the demonstrations, its generalization capability can degrade if the query points are too far from the demonstrations (it collapses to an average of the models), which is confirmed by the results of the experiment.

figure j

8.2 Parametric HMM/GMM (PHMM/PGMM)

Another approach to handle task-parameterized movements is to encode all demonstrations in a single mixture model, where each cluster in the mixture learns the relations between the task parameters and the motion. The parametric hidden Markov model (PHMM) is a representative approach in this category. It was originally introduced for recognition and prediction of gestures [102] and extended in robotics to movement generation [52, 105]. We will refer to a parametric Gaussian mixture model (PGMM) when the transition and initial state probabilities are not taken into account in the likelihood estimation.Footnote 7

The original model modulates in each cluster the center of a Gaussian distribution through an affine relationship with the task parameters. This is achieved by concatenating the task parameters in a vector \({\varvec{Q}}_t\) as in the above, and defining the centers of the Gaussians in a task-adaptive manner with

$$\begin{aligned} {\varvec{\mu }}_{t,i} = {\varvec{Z}}_i \; {\big [{\varvec{Q}}_t^{\scriptscriptstyle \top },\; 1\big ]}^{\scriptscriptstyle \top }, \end{aligned}$$
(48)

where \({\varvec{\tilde{Z}}}_i\) describe the model parameters to be estimated. In the case of affine relationships, this can be done with EM, see “Appendix 4” for details. The other parameters of the model are estimated as the standard GMM formulation.

After having trained the model, each new set of task parameters concatenated in \({\varvec{Q}}_t\) will provide new Gaussian centers \({\varvec{\mu }}_{t,i}\) in the GMM using Eq. (48), where GMR can then be used to retrieve a new trajectory.

The last row of Fig. 14 shows generalization results with PGMM. The main drawback of this model is that only the centers of the Gaussians are adapted to the current situation. The covariances are estimated as constant matrices \({\varvec{{\varSigma }}}_i\), estimated with the standard EM procedure for GMM. This limitation is confirmed in the experiment, where EM often converged to local optima that were unable to extract the underlying structures of the task for the encoding of continuous movements.

figure k

9 Discussion and future work

Recent service robots are provided with numerous and/or high-resolution sensors and actuators. This increase of dimensionality is problematic in applications where the sample size remains bounded by the cost of data acquisition. There is a non-negligible number of applications that would require models capable of targeting wide-ranging data. Namely, models that could start learning from a small number of demonstrations, while still being able to continue learning once more data become available. Robot learning from demonstration is one such field, whose learning challenge often requires the design of appropriate domain-specific priors to ensure that generalization can be achieved from small training sets.

We showed throughout this article that an efficient and versatile prior knowledge for task adaptation is to consider that the task parameters describing the current situation (body and workspace configuration that the robot encounters) can be represented as affine transformations (including frames of reference, coordinate systems or projections).

The above prior requires the experimenter to provide the robot with a set of candidate frames that could potentially be relevant for the task. We showed that the representation as affine transformations has a simple interpretation, can be easily implemented and remains valid for a wide range of skills that a service robot can encounter. It was shown in [4] that when frames are missing during reproduction (e.g., when occlusions occur or when frames are collected at different rates), the system is still able to reproduce an appropriate behavior for the current circumstance.

A limitation of the current TP-GMM approach is that it requires the experimenter to provide an initial set of frames that will act as candidate projections/transformations of the data that might potentially be relevant for the task. The number of frames can be overspecified by the experimenter (e.g., by providing an exhaustive list), but it comes at the expense of requiring more demonstrations to obtain sufficient statistics to discard the frames that have no role in the task. The problem is not different from the problem of selecting the variables that will form the feature vectors fed to a learning process. The only difference lies in the selection of frames in the form of affine transformations that are most often associated with easily interpretable behaviors.

In practice, the experimenter selects objects and locations in the robot kinematic chain that might be relevant for the task, which are typically the end-effectors of the robot, where tools, grippers or parts in contact with the environment are mounted, see also discussion in [45].Footnote 8 The issue of predefining an initial set of frames is not restrictive when the number of frames remains reasonably low (e.g., when they come from a set of predefined objects tracked with visual markers in a lab setting). However, for perception in unconstrained environment, the number of frames could grow quickly (e.g., detection of phantom objects), while the number of demonstrations remains low. Further work is thus required to detect redundant frames or remove irrelevant frames, as well as to automatically determine in which manner the frames are coordinated with each other and locally contribute to the achievement of the task. A promising route for further investigation is to exploit the recent developments in multilinear algebra and tensor analysis [49, 85] that exploit the multivariate structure of the data for statistical analysis and compression, without transforming it to a matrix form (by processing data jointly in spatial and spectral ways, instead of flattening the higher-order tensor dataset).

In service robotics, movements often need to be expressed simultaneously in multiple coordinate systems and are stored as multidimensional arrays (tensor-variate data). Multilinear algebra could thus provide a principled method to simultaneously extract eigenframes, eigenposes and eigentrajectories. Multiway analysis of tensor-variate data offers a rich set of data decomposition techniques, whose advantage has been demonstrated in computer imaging fields such as face processing [98], video analysis [107], geoscience [76] or neuroimaging [8], but which remains an uncharted territory in robotics and motor skills learning.

Another open route for further investigation concerns the use of a richer set of task parameters. A wide range of motor skills could potentially be adapted to this framework, by exploiting the functional nature of task parameters to build models that learn the local structure of the task from a low number of demonstrations. Indeed, most task parameterization in robot control can be related to some form of frames of reference, coordinate systems, projections or basis functions, where the involvement of the frames can change during the execution of the task, with transformations represented as locally linear projection operators (e.g., Jacobians for inverse kinematics, kernel matrix for nullspace projections, etc.). A wider range of robot skills could be defined in such way, see, e.g., the possible tasks described in §6.2.1 of [5].

The potential applications are diverse, with an objective in line with the original purpose of motor primitives to be composed together serially or in parallel [28]. TP-GMM could potentially be employed as a tool to revisit existing robotics techniques in a probabilistic form. This includes the consideration of soft constraints in both configuration and operational spaces, where the frames would correspond to different subspace projections of the same movement, with the extracted regularities employed to learn bimanual tasks or whole-body movements.

One of the important next challenges in robot learning will be to extend the concept of movement primitives to a broader range of behaviors including impedance, reaction and collaboration primitives.

10 Conclusion

In service robotics, movements often need to be modulated by external parameters describing the current situation (body and workspace configuration that the robot encounters). This tutorial showed that in many cases, the task parameters can be represented as affine transformations. Based on this prior assumption, a task-parameterized model was presented by exploiting this structure to learn a skill from a small number of demonstrations.

The proposed approach was implemented and tested with various statistical encoding strategies, including standard mixture models, kernel approaches and subspace clustering methods. It was shown that a wide variety of problems in robotics can be reinterpreted by introducing such relation between the task parameters and the model parameters. The approach was demonstrated in a series of control and planning problems in operational and configuration spaces. Each section of the article was accompanied by source codes to help the practitioners study, test and extend the proposed approach.