Keywords

1 Introduction

One of the hallmarks of human intelligence is the ability to predict the future based on past observations. Through perceiving and forecasting how the environment evolves and how a fellow human acts, a human learns to interact with the world [60]. Remarkably, humans acquire such a prediction ability from just a few experiences, which is yet generalizable across different scenarios [50]. Similarly, to allow natural and effective interaction with humans, artificial agents (e.g., robots) should be able to do the same, i.e., forecasting how a human moves or acts in the near future conditioning on a series of historical movements [29]. As a more concrete example illustrated in Fig. 1, when deployed in natural environments, robots are supposed to predict unfamiliar actions after seeing only a few examples [20, 27]. While human motion prediction has attracted increasing attention [9, 16, 19, 26, 32], the existing approaches rely on extensive annotated motion capture (mocap) data and are brittle to novel actions.

Fig. 1.
figure 1

Illustration of the importance of few-shot human motion prediction as a first step towards seamless human-robot interaction and collaboration. In real-world scenarios, the prediction typically happens in an on-line, streaming manner with limited training data. Specifically, a robot has acquired a general-purpose prediction ability, e.g., through learning on several known action classes using our meta-learning approach. The robot is then deployed in a natural environment. Now a person performs certain never-before-seen action, e.g., greeting, while the robot is watching (a). The person then stops, and the robot has no sensory inputs, which is illustrated by blinding its eyes with a sheet of paper (b). The robot adapts the generic initial model for use as a task-specific predictor model, predicts the future motion of the person, and performs or demonstrates it in a human-like, realistic way (c) and (d).

We believe that the significant gap between human and machine prediction arises from two issues. First, motion dynamics are difficult to model because they entangle physical constraints with goal-directed behaviors [32]. Beyond some action classes (e.g., walking) [8, 22], it is challenging to generate sophisticated physical models for general types of motion [42]. Second, there exists a lack of large-scale, annotated motion data. Current mocap datasets are constructed with dedicated sensored environments and so are not scalable. This motivates the exploration of motion models learned from limited training data. Unfortunately, a substantial amount of annotated data is required for the state-of-the-art deep recurrent encoder-decoder network based models [4, 16, 18, 19, 26, 32] to learn the desired motion dynamics. One stark evidence of this is that a constant pose predictor [32], as a naïve approach that does not produce interesting motion, sometimes achieves the best performance. An attractive solution is learning a “basis” of underlying knowledge that is shared across a wide variety of action classes, including never-before-seen actions. This can be in principle achieved by transfer learning [3, 38, 44, 68] in a way that fine-tunes a pre-trained network from another task which has more labeled data; nevertheless, the benefit of pre-training decreases as the source task diverges from the target task [70].

Here we make the first attempt towards few-shot human motion prediction. Inspired by the recent progress on few-shot learning and meta-learning [14, 47, 58, 61, 66], we propose a general meta-learning framework—proactive and adaptive meta-learning (PAML), which can be applied to human motion prediction. Our key insight is that having a good generalization from few examples relies on both a generic initial model and an effective strategy for adapting this model to novel tasks. We then introduce a novel combination of the state-of-the-art model-agnostic meta-learning (MAML) [14] and model regression networks (MRN) [66, 69], and unify them into an integrated, end-to-end framework. MAML enables the meta-learner to aggregate contextual information from various prediction tasks and thus produces a generic model initialization, while MRN allows the meta-learner to adapt a few-shot model and thus improves its generalization.

More concretely, a beneficial common initialization would serve as a good point to start training for a novel action being considered. This can be accomplished by explicitly learning the initial parameters of a predictor model in a way that the model has maximal performance on a new task after the parameters have been updated with a few training examples from that new task. Hence, we make use of MAML [14], which initializes the weights of a network such that standard stochastic gradient descent (SGD) can make rapid progress on a new task. We learn this initialization through a meta-learning procedure that learns from a large set of motion prediction tasks with small amounts of data. After obtaining the pre-trained model, MAML uses one or few SGD updates to adapt it to a novel task. Although the initial model is somewhat generic, plain SGD updates can only slightly modify its parameters [68] especially in the small-sample size regime; otherwise, it would lead to severe over-fitting to the new data [23]. This is still far from satisfactory, because the obtained task-specific model is different from the one that would be learned from a large set of samples.

To address this limitation, we consider meta-learning approaches that learn an update function or learning rule. Specifically, we leverage MRN [66, 69] as the adaptation strategy, which describes a method for learning from small datasets through estimating a generic model transformation. That is, MRN learns a meta-level network that operates on the space of model parameters, which is trained to regress many-shot model parameters (trained on large datasets) from few-shot model parameters (trained on small datasets). While MRN was developed in the context of convolutional neural networks, we extend it to recurrent neural networks. By unifying MAML with MRN, our resulting PAML model is not only directly initialized to produce the desired parameters that are useful for later adaptation, but it can also be effectively adapted to novel actions through exploiting the structure of model parameters shared across action classes.

Our contributions are three-fold. (1) To the best of our knowledge, this is the first time the few-shot learning problem for human motion prediction has been explored. We show how meta-learning can be operationalized for such a task. (2) We present a novel meta-learning approach, combining MAML with MRN, that jointly learns a generic model initialization and an effective model adaptation strategy. Our approach is general and can be applied to different tasks. (3) We show how our approach significantly facilitates the prediction of novel actions from few examples on the challenging mocap H3.6M dataset [25].

2 Related Work

Human motion prediction has great application potential in computer vision and robotic vision, including human-robot interaction and collaboration [29], motion generation for computer graphics [30], action anticipation [24, 28], and proactive decision-making in autonomous driving systems [37]. It is typically addressed by state-space equations and latent-variable models. Traditional approaches focus on hidden Markov models [7], linear dynamic models [41], Gaussian process latent variable models [59, 62], bilinear spatio-temporal basis models [1], and restricted Boltzmann machines [52,53,54,55]. In the deep learning era, recurrent neural networks (RNNs) based approaches have attracted more attention and significantly pushed the state of the art [16, 18, 19, 26, 32].

Flagship techniques include LSTM-3LR and ERD [16], SRNNs [26], and residual sup. [32]. LSTM-3LR (3-layer long short-term memory network) learns pose representation and temporal dynamics simultaneously via curriculum learning [16]. In additional to the concatenated LSTM units as in LSTM-3LR, ERD (encoder-recurrent-decoder) further introduces non-linear space encoders for data preprocessing [16]. SRNNs (structural RNNs) model human activity with a hand-designed spatio-temporal graph and introduce the encoded semantic knowledge into recurrent networks [26]. These approaches fail to consider the shared knowledge across action classes and they thus learn action-specific models and restrict the training process on the corresponding subsets of the mocap dataset. Residual sup. is a simple sequence-to-sequence architecture with a residual connection, which incorporates the action class information via one-hot vectors [32]. Despite their promise, these existing methods directly learn on the target task with large amounts of training data and cannot generalize well from a few examples or to novel action classes. There has been little work on few-shot motion prediction as ours, which is crucial for robot learning in practice. Our task is also significantly different from few-shot imitation learning: while this line of work aims to learn and mimic human motion from demonstration [11, 15, 39, 71], our goal is to predict unseen future motion based on historical observations.

Few-shot or low-shot learning has long stood as one of the unsolved fundamental problems and been addressed from different perspectives [13, 14, 17, 21, 45, 56, 61, 63,64,65,66,67]. Our approach falls more into a classic yet recently renovated class of approaches, termed as meta-learning that frames few-shot learning itself as a “learning-to-learn” problem [47, 57, 58]. The idea is to use the common knowledge captured among a set of few-shot learning tasks during meta-training for a novel few-shot learning problem, in a way that (1) accumulates statistics over the training set using RNNs [61], memory-augmented networks [45], or multilayer perceptrons [12], (2) produces a generic network initialization [14, 36, 65], (3) embeds examples into a universal feature space [51], (4) estimates the model parameters that would be learned from a large dataset using a few novel class examples [6] or from a small dataset model [66, 69], (5) modifies the weights of one network using another [46, 48, 49], and (6) learns to optimize through a learned update rule instead of hand-designed SGD [2, 31, 43].

Often, these prior approaches are developed with image classification in mind, and cannot be easily re-purposed to handle different model architectures or readily applicable to other domains such as human motion prediction. Moreover, they aim to either obtain a better model initialization [14, 36, 65] or learn an update function or learning rule [2, 5, 43, 48, 66], but not both. By contrast, we present a unified view by taking these two aspects into consideration and show how they complement each other in an end-to-end meta-learning framework. Our approach is also general and can be applied to other tasks as well.

3 Proactive and Adaptive Meta-learning

We now present our meta-learning framework for few-shot human motion prediction. The predictor (i.e., learner) is a recurrent encoder-decoder network, which frames motion prediction as a sequence-to-sequence problem. To enable the predictor to rapidly produce satisfactory prediction from just a few training sequences for a novel task (i.e., action class), we introduce proactive and adaptive meta-learning (PAML). Through meta-learning from a large collection of few-shot prediction tasks on known action classes, PAML jointly learns a generic model initialization and an effective model adaptation strategy.

3.1 Meta-learning Setup for Human Motion Prediction

Human motion is typically represented as sequential data. Given a historical motion sequence, we predict possible motion in the short-term or long-term future. In few-shot motion prediction, we aim to train a predictor model that can quickly adapt to a new task using only a few training sequences. To achieve this, we introduce a meta-learning mechanism that treats entire prediction tasks as training examples. During meta-learning, the predictor is trained on a set of prediction tasks guided by a high-level meta-learner, such that the trained predictor can accomplish the desired few-shot adaptation ability.

The predictor (i.e., learner), represented by a parametrized function \(\mathcal {P}_\theta \) with parameters \(\theta \), maps an input historical sequence \(\mathbf {X}\) to an output future sequence \(\mathbf {\widehat{Y}}\). We denote the input motion sequence of length n as \(\mathbf {X}=\left\{ \mathbf {x}^1,\mathbf {x}^2,\ldots ,\mathbf {x}^n\right\} \), where \(\mathbf {x}^i\in \mathbb {R}^d, i=1,\ldots ,n\) is a mocap vector consisting of a set of 3D body joint angles [35], and d is the number of joint angles. The learner predicts the future sequence \(\mathbf {\widehat{Y}}=\left\{ \mathbf {\widehat{x}}^{n+1}, \mathbf {\widehat{x}}^{n+2},\ldots ,\mathbf {\widehat{x}}^{n+m}\right\} \) in the next m timesteps, where \(\mathbf {\widehat{x}}^j\in \mathbb {R}^d, j=n+1,\ldots ,n+m\) is the predicted mocap vector at the j-th timestep. The groundtruth of the future sequence is denoted as \(\mathbf {Y}^{gt}=\left\{ \mathbf {x}^{n+1},\mathbf {x}^{n+2},\ldots ,\mathbf {x}^{n+m}\right\} \).

During meta-learning, we are interested in training a learning procedure (i.e., the meta-learner) that enables the predictor model to adapt to a large number of prediction tasks. For the k-shot prediction task, each task \(\mathcal {T} = \left\{ \mathcal {L}, \mathcal {D}_\text {train}, \mathcal {D}_\text {test}\right\} \) aims to predict a certain action from a few (k) examples. It consists of a loss function \(\mathcal {L}\), a small training set \(\mathcal {D}_\text {train}=\left\{ \left( \mathbf {X}_u,\mathbf {{Y}}_u^{gt}\right) \right\} , u=1,\ldots ,k\) with k action-specific past and future sequence pairs, and a test set \(\mathcal {D}_\text {test}\) that has a set number of past and future sequence pairs for evaluation. A frame-wise Euclidean distance is commonly used as the loss function \(\mathcal {L}\) for motion prediction. For each task, the meta-learner takes \(\mathcal {D}_\text {train}\) as input and produces a predictor (i.e., learner) that achieves high average prediction performance on its corresponding \(\mathcal {D}_\text {test}\).

More precisely, we consider a distribution \(p\left( \mathcal {T}\right) \) over prediction tasks that we want our predictor to be able to adapt to. Meta-learning algorithms have two phases: meta-training and meta-test. During meta-training, a prediction task \(\mathcal {T}_i\) is sampled from \(p\left( \mathcal {T}\right) \), and the predictor \(\mathcal {P}\) is trained on its corresponding small training set \(\mathcal {D}_\text {train}\) with the loss \(\mathcal {L}_{\mathcal {T}_i}\) from \({\mathcal {T}_i}\). The predictor is then improved by considering how the test error on the corresponding test set \(\mathcal {D}_\text {test}\) changes with respect to the parameters. This test error serves as the training error of the meta-learning process. During meta-test, a held-out set of prediction tasks drawn from \(p\left( \mathcal {T}\right) \) (i.e., novel action classes), each with its own small training set \(\mathcal {D}_\text {train}\) and test set \(\mathcal {D}_\text {test}\), is used to evaluate the performance of the predictor.

3.2 Learner: Encoder-Decoder Architecture

We use the state-of-the-art recurrent encoder-decoder network based motion predictor in [32] as our learner \(\mathcal {P}\). The encoder and decoder consist of GRU (gated recurrent unit) [10] cells as building blocks. The input sequence is passed through the encoder to infer a latent representation. This latent representation and a seed motion frame are then fed into the decoder to output the first timestep prediction. The decoder takes its own output as the next timestep input and generates further prediction sequentially. Different from [32], to deal with novel action classes, we do not use one-hot vectors to indicate the action class.

3.3 Proactive Meta-learner: Generic Model Initialization

Intuitively, if we have a universal predictor that is broadly applicable to a variety of tasks in \(p\left( \mathcal {T}\right) \) instead of a specific task, it would serve as a good point to start training for a novel target task. We explicitly learn such a general-purpose initial model by using model-agnostic meta-learning (MAML) [14]. MAML is developed for gradient-based learning rules (e.g., SGD) and aims to learn a model in a way that a few SGD updates can make rapid progress on a new task.

Concretely, when adapting to a new task \(\mathcal {T}_i\), the initial parameters \(\theta \) of the predictor become \(\theta '_i\). In MAML, this is computed using one or more SGD updates on \(\mathcal {D}_\text {train}\) of task \(\mathcal {T}_i\). For the sake of simplicity and without loss of generality, we consider one SGD update:

$$\begin{aligned} \theta _i'=\theta -\alpha \nabla _{\theta }\mathcal {L}_{\mathcal {T}_i}\left( \mathcal {P}_\theta \right) , \end{aligned}$$
(1)

where \(\alpha \) is the learning rate hyper-parameter. We optimize \(\theta \) such that the updated \(\theta '_i\) will produce maximal performance on \(\mathcal {D}_\text {test}\) of task \(\mathcal {T}_i\). When averaged across the tasks sampled from \(p\left( \mathcal {T}\right) \), we have the meta-objective function:

$$\begin{aligned} \min _\theta \sum _{\mathcal {T}_i\sim p(\mathcal {T})}\mathcal {L}_{\mathcal {T}_i}\left( \mathcal {P}_{\theta _i'}\right) =\min _\theta \sum _{\mathcal {T}_i\sim p(\mathcal {T})}\mathcal {L}_{\mathcal {T}_i}\left( \mathcal {P}_{\theta -\alpha \nabla _\theta \mathcal {L}_{\mathcal {T}_i}\left( \mathcal {P}_\theta \right) }\right) . \end{aligned}$$
(2)

Note that the meta-optimization is performed over the predictor parameters \(\theta \), whereas the objective is computed using the updated parameters \(\theta '\). This meta-optimization across tasks is performed via SGD in the form of

$$\begin{aligned} \theta \leftarrow \theta -\beta \nabla _{\theta }\sum _{\mathcal {T}_i\sim p(\mathcal {T})}\mathcal {L}_{\mathcal {T}_i}\left( \mathcal {P}_{\theta _i'}\right) , \end{aligned}$$
(3)

where \(\beta \) is the meta-learning rate hyper-parameter. During each iteration, we sample task mini-batch from \(p\left( \mathcal {T}\right) \) and perform the corresponding learner update in Eq. (1) and meta-learner update in Eq. (3).

3.4 Adaptive Meta-learner: Model Adaptation Strategy

In MAML, the model parameters \(\theta '_i\) of a new task \(\mathcal {T}_i\) are obtained by performing a few plain SGD updates on top of the initial \(\theta \) using its small training set \(\mathcal {D}_\text {train}\). After meta-training, \(\theta \) tend to be generic. However, with limited training data from \(\mathcal {D}_\text {train}\), SGD updates can only modify \(\theta \) slightly, which is still far from the desired \(\theta ^*_i\) that would be learned from a large set of target samples. Higher-level knowledge is thus necessary to guide the model adaptation to novel tasks.

In fact, during meta-training, for each of the known action classes, we have a large training set of annotated sequences, and we sample from this original large set to generate few-shot training sequences. Note that for the novel classes during meta-test, there are no large annotated training sets. Such a setup—meta-learners are trained by sampling small training sets from a large universe of annotated examples—is common in few-shot image classification through meta-learning [14, 21, 61, 66]. While the previous approaches (e.g., MAML) only use this original large set for sampling few-shot training sets, we explicitly leverage it and learn the corresponding many-shot model \(\theta ^*_i\) for \(\mathcal {T}_i\). During sampling, if some tasks are sampled from the same action class, while they have their own few-shot training sequences, these tasks correspond to the same \(\theta ^*_i\) of that action class. We then use model regression networks (MRN) [66, 69] as the adaptation strategy. MRN is developed in image classification scenarios and obtains learning-to-learn knowledge about a generic transformation from few-shot to many-shot models.

Let \(\theta ^0_i\) denote the model parameters learned from \(\mathcal {D}_\text {train}\) by using SGD (i.e., \(\theta '_i\) in Eq. (1)). Let \(\theta ^*_i\) denote the underlying model parameters learned from a large set of annotated samples. We aim to make the updated \(\theta '_i\) as close as to the desired \(\theta ^*_i\). MRN assumes that there exists a generic non-linear transformation, represented by a regression function \(\mathcal {H}_\phi \) parameterized by \(\phi \) in the model parameter space, such that \(\theta ^*_i \approx \mathcal {H}_\phi \left( \theta ^0_i\right) \) for a broad range of tasks \(\mathcal {T}_i\). The square of the Euclidean distance is used as the regression loss. We then estimate \(\mathcal {H}_\phi \) based on a large set of known tasks \(\mathcal {T}_i\) drawn from \(p\left( \mathcal {T}\right) \) during meta-training:

$$\begin{aligned} \min _{\phi }\sum _{\mathcal {T}_i\sim p(\mathcal {T})}\left\| \mathcal {H}_\phi \left( \theta ^0_i\right) -\theta ^*_i\right\| _2^2. \end{aligned}$$
(4)

Consistent with [66], we use multilayer feed-forward networks as \(\mathcal {H}\).

3.5 An Integrated Framework

We introduce the adaptation strategy in both the meta-training and meta-test phases. For task \(\mathcal {T}_i\), after performing a few SGD updates on small training set \(\mathcal {D}_\text {train}\), we then apply the transformation \(\mathcal {H}\) to obtain \(\theta '_i\). Equation (1) is modified as

$$\begin{aligned} \theta _i'=\mathcal {H}_\phi \left( \theta -\alpha \nabla _{\theta }\mathcal {L}_{\mathcal {T}_i}\left( \mathcal {P}_\theta \right) \right) . \end{aligned}$$
(5)
figure a

During meta-training, for task \(\mathcal {T}_i\), we also have the underlying parameters \(\theta ^*_i\), which are obtained by performing SGD updates on the corresponding large sample set. Now, the meta-objective in Eq. (2) becomes

$$\begin{aligned} \min _{\theta ,\phi }\sum _{\mathcal {T}_i\sim p(\mathcal {T})}\widetilde{\mathcal {L}}_{\mathcal {T}_i}\left( \mathcal {P}_{\theta _i'}\right) =\min _{\theta ,\phi }\sum _{\mathcal {T}_i\sim p(\mathcal {T})}\mathcal {L}_{\mathcal {T}_i}\left( \mathcal {P}_{\theta _i'}\right) +\frac{1}{2}\lambda \left\| \theta '_i-\theta ^*_i\right\| _2^2, \end{aligned}$$
(6)

where \(\lambda \) is the trade-off hyper-parameter. This is a joint optimization with respect to both \(\theta \) and \(\phi \), and we perform the meta-optimization across tasks using SGD, as shown in Algorithm 1. Hence, we integrate both model initialization and adaptation into an end-to-end meta-learning framework. The model is initialized to produce the parameters that are optimal for its adaptation; meanwhile, the model is adapted by leveraging “learning-to-learn” knowledge about the relationship between few-shot and many-shot models. During meta-test, for a novel prediction task, with the learned generic model initialization \(\theta \) and model adaptation \(\mathcal {H}_\phi \), we use Eq. (5) to obtain the task-specific predictor model.

4 Experimental Evaluation

In this section, we explore the use of our proactive and adaptive meta-learning (PAML). PAML is general and can be in principle applied to a broad range of few-shot learning tasks. For performance calibration, we begin with a sanity check of our approach on a standard few-shot image classification task and compare with existing meta-learning approaches. We then focus on our main task of human motion prediction. Through comparing with the state-of-the-art motion prediction approaches, we show that PAML significantly improves the prediction performance in the small-sample size regime.

Table 1. Performance sanity check of our approach by comparing with some state-of-the-art meta-learning approaches to few-shot image classification on the widely used mini-ImageNet dataset. Our PAML outperforms these baselines, showing its general effectiveness for few-shot learning

4.1 Sanity Check on Few-Shot Image Classification

The majority of the existing few-shot learning and meta-learning approaches are developed in the scenario of classification tasks. As a sanity check, the first question is how our meta-learning approach compares with these prior techniques. For a fair comparison, we evaluate on the standard few-shot image classification task. The most common setup is an N-way, k-shot classification that aims to classify data into N classes when we only have a small number (k) of labeled instances per class for training. The loss function is the cross-entropy error between the predicted and true labels. Following [14, 33, 34, 43, 51, 61], we evaluate on the most widely used mini-ImageNet benchmark. It consists of 64 meta-training and 24 meta-test classes, with 600 images of size \(84 \times 84\) per class.

During meta-training, each task is sampled as an N-way, k-shot classification problem: we first randomly sample N classes from the meta-training classes; for each class, we randomly sample k and 1 examples to form the training and test set, respectively. During meta-test, we report performance on the unseen classes from the meta-test classes. We use the convolutional network in  [14] as the classifier (i.e., learner). Our model adaptation meta-network is a 2-layer fully-connected network with Leaky ReLU nonlinearity.

Table 1 summarizes the performance comparisons in the standard 5-way, 1-/5-shot setting. Our PAML consistently outperforms all the baselines. In particular, there is a notable \(5\%\) performance improvement compared with MAML, showing the complementary benefits of our model adaptation strategy. This sanity check verifies the effectiveness of our meta-learning framework. Moreover, some of these existing methods, such as matching networks [61] and prototypical networks [51], are designed with few-shot classification in mind, and are not readily applicable to other domains such as human motion prediction.

4.2 Few-Shot Human Motion Prediction

We now focus on using our meta-learning approach for human motion prediction. To the best of our knowledge, we are the first ones that explore the few-shot learning problem for human motion prediction. Due to the lack of published protocols, we propose our evaluation protocol for this task.

Dataset. We evaluate on Human 3.6M (H3.6M) [25], a heavily benchmarked, large-scale mocap dataset that has been widely used in human motion analysis. H3.6M contains seven actors performing 15 varied actions. Following the standard experimental setup in [16, 26, 32], we down-sample the dataset by two, train on six subjects, and test on subject five. Each action contains hours of video from these actors performing such activity. Sequence clips are randomly taken from the training and test videos to construct the corresponding training and test sequences [26]. Given the past 50 mocap frames (2 s in total), we forecast the future 10 frames (400 ms in total) in short-term prediction and the future 25 frames (1 s in total) in long-term prediction.

Few-Shot Learning Task and Meta-learning Setup. We use 11 action classes for meta-training: directions, greeting, phoning, posing, purchases, sitting, sitting down, taking photo, waiting, walking dog, and walking together. And we use the remaining 4 action classes for meta-test: walking, eating, smoking, and discussion. These four actions are commonly used to evaluate motion prediction algorithms [16, 26, 32]. The k-shot motion prediction task which we address is: for a certain action, given a small collection of k action-specific past and future sequence pairs, we aim to learn a predictor model so that it is able to predict the possible future motion for a new past sequence from that action. Accordingly, the setup of k-shot prediction tasks in meta-learning is as follows. During meta-training, for each task, we randomly select one action out of 11, and we sample k action-specific sequence pairs as \(\mathcal {D}_\text {train}\). During meta-test, for each of the 4 novel actions, we sample k sequence pairs from its training set to produce the small set \(\mathcal {D}_\text {train}\). We then adapt our meta-learned predictor for use as the target action-specific predictor. We evaluate it on the corresponding test set. We run five trials for each action and report the average performance.

Implementation Details. In our experiments, the predictor is residual sup., the state-of-the-art encoder-decoder network for motion prediction [32]. For the encoder and decoder, we use a single GRU cell [10] with hidden size 1,024, respectively. Following [32], we use tied weights between the encoder and decoder. We use fully-connected networks with Leaky ReLU nonlinearity as our model adaptation meta-networks. In most cases, k is set as 5 and we also evaluate how performance changes when k varies. By cross-validation, the trade-off hyper-parameter \(\lambda \) is set as 0.1, the learning rate \(\alpha \) is set as 0.05, and the meta-learning rates \(\beta \) and \(\gamma \) are set as 0.0005. For the predictor, we clip the gradient to a maximum \(\ell _2\)-norm of 5. We run 10, 000 iterations during meta-training. We use PyTorch [40] to train our model.

Table 2. Mean angle error comparisons between our PAML and variants of the state-of-the-art residual sup. [32] on the 4 novel actions of H3.6M for \(k=5\)-shot human motion prediction. Our PAML consistently and significantly outperforms all the baselines. In particular, it is superior to the multi-task learning and transfer learning baselines on all the actions across different time horizons

Baselines. For a fair comparison, we compare with residual sup. [32], which is the same predictor as ours but is not meta-learned. In particular, we evaluate its variants in the small-sample size regime and consider learning both action-specific and action-agnostic models in the following scenarios.

  • Action-specific training from scratch: for each of the 4 target actions, we learn an action-specific predictor from its k training sequences pairs.

  • Action-agnostic training from scratch: we learn a single predictor for the 4 target actions from all their training sequence pairs.

  • Off-the-shelf transfer: we learn a single predictor for the 11 meta-training actions from their large amounts of training sequence pairs, and directly use this predictor for the 4 target actions without modification.

  • Multi-task learning: we learn a single predictor for all the 15 actions from large amounts of training sequence pairs of the 11 meta-training actions and k sequence pairs per action of the 4 target actions.

  • Fine-tuning transfer: after learning a single predictor for the 11 meta-training actions from their large amounts of training sequence pairs, we fine-tune it to be an action-specific predictor for each of the 4 target actions, respectively, using its k training sequence pairs.

Evaluation Metrics. We evaluate our approach both quantitatively and qualitatively. For the quantitative evaluation, we use the standard metric—mean error between the predicted motion and the groundtruth motion in the angle space [16, 26, 32]. Following the preprocessing in [32, 54], we exclude the translation and rotation of the whole body. We also qualitatively visualize the prediction frame by frame.

Comparison with the State-of-the-Art Approaches. Table 2 shows the quantitative comparisons between our PAML and a variety of variants of residual sup. While residual sup. has achieved impressive performance with a large amount of annotated mocap sequences [32], its prediction significantly degrades in the small-sample size regime. As expected, directly training the predictor from a few examples leads to poor performance (i.e., with the angle error in range 2–3), due to severe over-fitting. In such scenarios of training from scratch, learning an action-agnostic model is slightly better than learning an action-specific one (e.g., decreasing the angle error by 0.1 at 80 ms for walking), since the former allows the predictor to exploit some common motion regularities from multiple actions. By transferring knowledge from relevant actions with large sets of samples in a more principled manner, the prediction performance is slightly improved. This is achieved by multi-task learning, e.g., training an action-agnostic predictor using both the 11 source and 4 target actions, or transfer learning, e.g., first training an action-agnostic predictor using the source actions, and then using it either in an off-the-shelf manner or through fine-tuning.

However, modeling multiple actions is more challenging than modeling each action separately, due to the significant diversity of different actions. The performance improvement of these multi-task learning and transfer learning baselines is limited and their performance is also comparably low. This thus demonstrates the general difficulty of our few-shot motion prediction task. By contrast, our PAML consistently and significantly outperforms the baselines on almost all the actions across different time horizons, showing the effectiveness of our meta-learning mechanism. There is even a noticeable performance boost for the complicated motion (e.g., decreasing the angle error by 0.3 at 80 ms for smoking). By explicitly learning from a large number of few-shot prediction tasks during meta-training, PAML is able to extract and leverage knowledge shared both across different actions and across multiple few-shot prediction tasks, thus improving the prediction of novel actions from a few examples by a large margin.

Moreover, as mentioned before, most of the current meta-learning approaches, such as matching networks [61] and prototypical networks [51], are developed for the simple tasks like image classification with task-specific model architectures (e.g., learning an embedding space that is useful for nearest neighbor or prototype classifiers), which are not readily applicable to our problem. Unlike them, our approach is general and can be effectively used across a broad range of tasks, as shown in Tables 1 and 2. Figure 2 further visualizes our prediction and compares with one of the top performing baselines. From Fig. 2, we can see that our PAML generates lower-error, smoother, and more realistic prediction.

Fig. 2.
figure 2

Visualizations for \(k=5\)-shot motion prediction on smoking and discussion. Top: the input sequence and the groundtruth of the prediction sequence. Middle: multi-task learning of residual sup. [32], one of the top performing baselines. Bottom: our prediction results. The groundtruth and the input sequences are shown in black, and the predictions are shown in color. Our PAML produces smoother and more human-like prediction. Best viewed in color with zoom. (Color figure online)

Table 3. Ablation on model initialization vs. adaptation. Each component by itself outperforms the fine-tuning baseline. Our full model achieves the best performance

Ablation Studies. In Tables 3 and 4 we evaluate the contributions of different factors in our approach to the results.

Model Initialization vs. Model Adaptation. Our meta-learning approach consists of two components: a generic model initialization and an effective model adaptation meta-network. In Table 3, we can see that each component by itself is superior to the baselines reported in Table 2 in almost all the scenarios. This shows that meta-learning, in general, by leveraging shared knowledge across relevant tasks, enables us to deal with a novel task in a sample-efficient way. Moreover, our full PAML model consistently outperforms its variants, showing the complementarity of each component. This verifies the importance of simultaneously learning a generic initial model and an effective adaptation strategy.

Structure of \(\mathcal {H}\). In Table 4 we compare different implementations of the model adaptation meta-network \(\mathcal {H}\): as a simple affine transformation, or as networks with 2–4 layers. Since Leaky ReLU is used in [66], we try both ReLU and Leaky ReLU as activation function in the hidden layers. The results show that 3-layer fully-connected networks with Leaky ReLU achieve the best performance.

Table 4. Ablation on the structure of \(\mathcal {H}\). We vary the number of fully-connected layers and try ReLU and Leaky ReLU as activation function. The results show that “3-layer, Leaky ReLU” works best, but in general \(\mathcal {H}\) is robust to specific implementation choices
Fig. 3.
figure 3

Impact of the training sample size k for k-shot motion prediction. We compare our PAML with fine-tuning transfer of residual sup. [32], one of the top performing baselines. As a reference, we also include the oracle performance, which is residual sup. trained from thousands of annotated sequence pairs. X-axis: number of training sequence pairs k per task. Y-axis: mean angle error. Ours consistently outperforms fine-tuning and with only 100 sequence pairs, it achieves the performance close to the oracle.

Impact of Training Sample Sizes. In the previous experiments, we focused on a fixed \(k=5\)-shot motion prediction task. To test how our meta-learning approach benefits from more training sequences, we evaluate the performance change with respect to the sample size k. Figure 3 summarizes the comparisons with fine-tuning transfer, one of the top performing baselines reported in Table 2, when k varies from 1 to 100 at 80 ms. As a reference, we also include the oracle performance, which is the residual sup. baseline trained on the entire training set of the target action (i.e., with thousands of annotated sequence pairs). Figure 3 shows that our approach consistently outperforms fine-tuning and improves its performance with more and more training sequences. Interestingly, through our meta-learning mechanism, with only 100 sequence pairs, we achieve the performance that is close to the oracle trained from thousands of sequence pairs.

5 Conclusions

In this work we have formulated a novel problem of few-shot human motion prediction and proposed a conceptually simple but powerful approach to address this problem. Our key insight is to jointly learn a generic model initialization and an effective model adaptation strategy through meta-learning. To do so, we utilize a novel combination of model-agnostic meta-learning and model regression networks, two meta-learning approaches that have complementary strengths, and unify them into an integrated, end-to-end framework. As a sanity check, we demonstrate that our approach significantly outperforms existing techniques on the most widely benchmarked few-shot image classification task. We then present the state-of-the-art results on few-shot human motion prediction.