Keywords

1 Introduction

Predicting     the evolution of the surrounding physical world over time is an essential aspect of human intelligence. For example, in a seamless interaction, a robot is supposed to have some notion of how people move or act in the near future, conditioned on a series of historical movements. Human motion prediction has thus been widely used in computer vision and robotics, such as autonomous driving [54], character animation [62], robot navigation [59], motion tracking [48], and human-robot interaction [7, 34, 35, 38]. Owing to deep learning techniques, there has been significant progress over the past few years in modeling and predicting motions. Despite notable successes, forecasting human motions, especially over longer time horizons (i.e., up to several seconds), is fundamentally challenging, because of the difficulty of modeling multi-modal motion dynamics and uncertainty of human conscious movements. Learning such uncertainty can, for example, help reduce the search space in motion tracking problems.

As a powerful tool, deep generative models are thus introduced for this purpose, where random codes from a prior distribution are employed to capture the multi-modal distribution of future human motions. However, current motion capture datasets are typically constructed in a way that there is only a single ground truth future sequence for each single historical sequence [30, 60], which makes it difficult for generators to model the underlying multi-modal densities of future motion distribution. Indeed, in practice, generators tend to ignore differences in random codes and simply produce similar predictions. This is known as mode collapse – the samples are concentrated in the major mode, as depicted with a representative example in Fig. 1, which has been widely observed [72]. Recent work has alleviated this problem by explicitly promoting diversity in sampling using post-hoc diversity mappings [72], or through sequentially generating different body parts [51] to achieve combinatorial diversity. These techniques, however, induce additional modeling complexity, without guaranteeing that the diversity modeling accurately covers multiple plausible modes of human motions.

Fig. 1.
figure 1

Our Spatial-Temporal AnchoR-based Sampling (STARS) is able to capture multiple modes, thus facilitating diverse human motion prediction. Left: with the traditional generative models such as conditional variational autoencoders (CVAEs), the predicted motions are often concentrated in the major mode with less diversity (illustrated with 8 samples). Right: STARS is able to cover more modes, where motions in the same mode have similar characteristics but vary widely across modes. Here, we use 4 anchors to pinpoint different modes. With each anchor, we sample noise and generate 2 similar motions with slight variation in each mode

To this end, we propose a simple yet effective strategy – Multi-Level Spatial-Temporal AnchoR-Based Sampling (STARS) – with the key insight that future motions are not completely random or independent of each other; they share some deterministic properties in line with physical laws and human body constraints, and continue trends of historical movements. For example, we may expect changes in velocity or direction to be shared deterministically among some future motions, whereas they might differ in the magnitude stochastically. Based on this observation, we disentangle latent codes in the generative model into a stochastic component (noise) and a deterministic learnable component named anchors. With this disentanglement, the diversity of predictions is jointly affected by random noise as well as anchors that are learned to be specialized for certain modes of future motion. In contrast, the diversity from traditional generative models is determined by solely independent noise, as depicted in Fig. 1. Now, on the one hand, random noise only accounts for modeling the uncertainty within the mode identified by the anchor, which reduces the burden of having to model the entire future diversity. On the other hand, the model can better capture deterministic states of multiple modes by directly optimizing the anchors, thereby reducing the modeling complexity.

Naturally, human motions exhibit variation in the spatial and temporal domains, and these two types of variation are comparatively independent. Inspired by this, we propose a further decomposition to factorize anchors into spatial and temporal anchors. Specifically, our designed spatial anchors capture future motion variation at the spatial level, but remain constant at the temporal level, and vice versa. Another appealing property of our approach is that, by introducing straightforward linear interpolation of spatial-temporal anchors, we achieve flexible and seamless control over the predictions (Fig. 6 and Fig. 7). Unlike low-level controls that combine motions of different body parts [51, 72], our work enables manipulation of future motions in the native space and time, which is an under-explored problem. Additionally, we propose a multi-level mechanism for spatial-temporal anchors to capture multi-scale modes of future motions.

As a key advantage, spatial-temporal anchors are compatible with any motion predictor. Here, we introduce an Interaction-Enhanced Spatial-Temporal Graph Covolutional Network (IE-STGCN). This model encodes the spatial locality of human motion and achieves state-of-the-art performance as a motion predictor.

Our contributions can be summarized as follows. (1) We propose a novel anchor-based generative model that formulates sampling as learning deterministic anchors with likelihood sampling to better capture the multiple modes of human motions. (2) We propose a multi-level spatial-temporal decomposition of anchors for interpretable control over future motions. (3) We develop a spatial-temporal graph neural network with interaction enhancement to incorporate our anchor-based sampling. (4) We demonstrate that our approach, as a unified framework for modeling human motions, significantly outperforms state-of-the-art models in both diverse and deterministic human motion prediction.

2 Related Work

Deterministic Motion Prediction. Existing work on deterministic human motion forecasting predicts a single future motion based on a sequence of past poses [1, 6, 15, 42, 49], or video frames [11, 71, 74], or under the constraints of the scene context [8, 12, 25], by using recurrent neural networks (RNNs) [63], temporal convolutional networks (TCNs) [3], and graph neural networks (GNNs) [33] for sequence modeling. Common early trends involve the use of RNNs [20,21,22, 31, 65], which are limited in long-term temporal encoding due to error accumulation [18, 53] and training difficulty [55]. Some recent attempts exploit GNNs [16, 50] to encode poses from the spatial level, but such work still relies on RNNs [41], CNNs [14, 39, 40], or feed-forward networks [52] for temporal modeling. Recently, spatial-temporal graph convolutional networks (STGCNs) [61, 67, 69] are proposed to jointly encode the spatial and temporal correlations with spatial-temporal graphs. Continuing this effort, we propose IE-STGCN, which additionally encodes inductive biases such as spatial locality into STGCNs.

Stochastic Motion Prediction. Stochastic human motion prediction is an emerging trend with the development of deep generative models such as variational autoencoders (VAEs) [32], generative adversarial networks (GANs) [19], and normalizing flows (NFs) [58]. Most existing work [2, 4, 26, 37, 43, 64, 66, 73] produces various predictions from a set of codes independently sampled from a given distribution. As depicted in DLow [72], such likelihood-based sampling cannot produce enough diversity, as many samples are merely perturbations in the major mode. To overcome the issue, DLow employs a two-stage framework, using post-hoc mappings to shape the latent samples to improve the diversity. GSPS [51] generates different body parts in a sequential manner to achieve combinatorial diversity. Nevertheless, their explicit promotion of diversity induces additional complexity but does not directly enhance multi-mode capture. We introduce anchors that are comparatively easy to optimize, to locate deterministic components of motion modes and impose sample diversity.

Controllable Motion Prediction. Controllable motion prediction has been explored in computer graphics for virtual character generation [27, 28, 44]. In the prediction task, DLow [72] and GSPS [51] propose to control the predicted motion by separating upper and lower body parts, fixing one part while controlling the diversity of the other. In this paper, through the use of spatial-temporal anchors, we propose different but more natural controllability in native space and time. By varying and interpolating the spatial and temporal anchors, we achieve high-level control over the spatial and temporal variation, respectively.

Learnable Anchors. Our anchor-based sampling, i.e., sampling with deterministic learnable codes, is inspired by work on leveraging predefined primitives and learnable codes for applications such as trajectory prediction [10, 13, 36, 46, 57], object detection [9, 45], human pose estimation [68], and video representation learning [24]. Anchors usually refer to the hypothesis of predictions, such as box candidates with different shapes and locations in object detection [45]. In a similar spirit, anchors in the context of human motion prediction indicate assumptions about future movements. The difference is that the anchors here are not hand-crafted or predefined primitives; instead, they are latent codes learned from the data. In the meantime, we endow anchors with explainability i.e., to describe the multi-level spatial-temporal variation of future motions.

3 Methodology

Problem Formulation. We denote the input motion sequence of length \(T_h\) as \(\textbf{X}=[\textbf{x}_1, \textbf{x}_2,\ldots ,\textbf{x}_{T_h}]^T\), where the 3D coordinates of V joints are used to describe each pose \(\textbf{x}_i \in \mathbb {R}^{{V}\times C^{(0)}}\). Here, we have \(C^{(0)} = 3\). The K output sequences of length \(T_p\) are denoted as \(\widehat{\textbf{Y}}_1, \widehat{\textbf{Y}}_2, \ldots , \widehat{\textbf{Y}}_K\). We have access to a single ground truth future motion of length \(T_p\) as \(\textbf{Y}\). Our objectives are: (1) one of the K predictions is as close to the ground truth as possible; and (2) the K sequences are as diverse as possible, yet representing realistic future motions.

In this section, we first briefly review deep generative models, describe how they draw samples to generate multiple futures, and discuss their limitations (Sect. 3.1). We then detail our insights on STARS including anchor-based sampling and multi-level spatial-temporal anchors (Sect. 3.1 and Fig. 2). To model the human motion, we design an IE-STGCN and incorporate our spatial-temporal anchors into it (Sect. 3.2), as illustrated in Fig. 3.

Fig. 2.
figure 2

Comparison of generative models without and with anchor-based sampling. Anchors and network parameters are jointly optimized. (a) Conventional generative model with only stochastic noise; (b) Generative model with deterministic anchor process: an anchor with Gaussian noise corresponds to a prediction; (c) Spatial-temporal compositional anchors: any pair of combined spatial and temporal anchors corresponds to a prediction; (d) Multi-level spatial-temporal anchors: anchors at different levels are combined for encoding multi-scale modes

3.1 Multi-level Spatial-Temporal Anchor-Based Sampling

Preliminaries: Deep Generative Models. There is a large body of work on the generation of multiple hypotheses with deep generative models, most of which learn a parametric probability distribution function explicitly or implicitly. Let \(p(\textbf{Y}|\textbf{X})\) denote the distribution of the future human motion \(\textbf{Y}\) conditioned on the past sequence \(\textbf{X}\). With a latent variable \(\textbf{z}\in \mathcal {Z}\), the distribution can be reparameterized as \(p(\textbf{Y}|\textbf{X}) = \int p(\textbf{Y}|\textbf{X},\textbf{z})p(\textbf{z}) \textrm{d} \textbf{z}\), where \(p(\textbf{z})\) is often a Gaussian prior distribution. To generate a future motion sequence \(\mathbf {\widehat{Y}}\), \(\textbf{z}\) is drawn from the given distribution \(p(\textbf{z})\), and then a deterministic generator \(\mathcal {G}:\mathcal {Z}\times \mathcal {X} \rightarrow \mathcal {Y}\) is used for mapping, as illustrated in Fig. 2(a):

$$\begin{aligned} \textbf{z} \sim p(\textbf{z}), \ \mathbf {\widehat{Y}} = \mathcal {G}(\textbf{z}, \textbf{X}), \end{aligned}$$
(1)

where \(\mathcal {G}\) is a deep neural network parameterized by \(\theta \). The goal of generative modeling is to make the distribution \(p_{\theta }(\mathbf {\widehat{Y}}|\textbf{X})\) derived from the generator \(\mathcal {G}\) close to the actual distribution \(p(\textbf{Y}|\textbf{X})\).

To generate K diverse motion predictions, traditional approaches first independently sample a set of latent codes \(Z = \{\textbf{z}_1, \ldots , \textbf{z}_K\}\) from a prior distribution \(p(\textbf{z})\). Although in theory, generative models are capable of covering different modes, they are not guaranteed to locate all the modes precisely, and mode collapse has been widely observed [70, 72].

Anchor-Based Sampling. To address this problem, we propose a simple yet effective sampling strategy. Our intuition is that the diversity in future motions could be characterized by: (1) deterministic component – across different actions performed by different subjects, there exist correlated or shareable changes in velocity, direction, movement patterns, etc., which naturally emerge and can be directly learned from data; and (2) stochastic component – given an action carried out by a subject, the magnitude of the changes exists which is stochastic.

Therefore, we disentangle the code in the latent space of the generative model into a stochastic component sampled from \(p(\textbf{z})\), and a deterministic component represented by a set of K learnable parameters called anchors \(\mathcal {A} = \{\textbf{a}_k\}_{k=1}^K\). Deterministic anchors are expected to identify as many modes as possible, which is achieved through a carefully designed optimization, while stochastic noise further specifies motion variation within certain modes. With this latent code disentanglement, we denote the new multi-modal distribution as

$$\begin{aligned} p_{\theta }(\mathbf {\widehat{Y}}|\textbf{X}, \mathcal {A}) = \frac{1}{K}\sum _{k=1}^K \int p_{\theta }(\mathbf {\widehat{Y}}|\textbf{X}, \textbf{z},\textbf{a}_k)p(\textbf{z}) \textrm{d} \textbf{z}. \end{aligned}$$
(2)

Consequently, as illustrated in Fig. 2(b), suppose we select the k-th learned anchor \(\textbf{a}_k \in \mathcal {A}\), along with the randomly sampled noise \(\textbf{z}\in Z\), we can generate the prediction \(\widehat{\textbf{Y}}_k\) as,

$$\begin{aligned} \textbf{z} \sim p(\textbf{z}),\ \widehat{\textbf{Y}}_k = \mathcal {G}(\textbf{a}_k, \textbf{z}, \textbf{X}). \end{aligned}$$
(3)

We can produce a total of K predictions if using each anchor once, though all anchors are not limited to being used or used only once. To incorporate anchors into the network, we find it effective to make simple additions between selected anchors and latent features, as shown in Fig. 3.

Spatial-Temporal Compositional Anchors. We observe that the diversity of future motions can be roughly divided into two types, namely spatial variation and temporal variation, which are relatively independent. This sheds light on a feasible further decomposition of the K anchors into two types of learnable codes: spatial anchors \(\mathcal {A}_s = \{\textbf{a}^s_i\}_{i=1}^{K_s}\) and temporal anchors \(\mathcal {A}_t = \{\textbf{a}^t_j\}_{j=1}^{K_t}\), where \(K = K_s \times K_t\). With this decomposition, we still can yield a total of \(K_s \times K_t\) compositional anchors through each pair of spatial-temporal anchors. Note that the temporal anchors here, in fact, control the frequency variation of future motion sequences, since our temporal features are in the frequency domain, as we will demonstrate in Sect. 3.2. To be more specific, conceptually, all spatial anchors are set to be identical in the temporal dimension but characterize the variation of motion in the spatial dimension, taking control of the movement trends and directions. Meanwhile, all temporal anchors remain unchanged in the spatial dimension but differ in the temporal dimension, producing disparities in frequency to affect the movement speed.

Fig. 3.
figure 3

Overview of our STARS w/ IE-STGCN framework. We combine the multi-level spatial-temporal anchors, the sampled noise, with the backbone IE-STGCN. To generate one of the predictions given a past motion, we draw noise \(\textbf{z}_k\), and add the selected spatial-temporal anchors to the latent feature at each level

To produce \(\widehat{\textbf{Y}}_k\), as depicted in Fig. 2(c), we sample \(\textbf{z}\) and select i-th spatial anchor \(\textbf{a}_i^s\) and j-th temporal anchor \(\textbf{a}_j^t\),

$$\begin{aligned} \textbf{z} \sim p(\textbf{z}),\ \widehat{\textbf{Y}}_k = \mathcal {G}(\textbf{a}_i^s + \textbf{a}_j^t, \textbf{z}, \textbf{X}), \end{aligned}$$
(4)

where \(\textbf{a}_i^s + \textbf{a}_j^t\) is a spatial-temporal compositional anchor corresponding to an original anchor \(\textbf{a}_k\). Furthermore, motion control over spatial and temporal variation can be customized through these spatial-temporal anchors. For example, we can produce future motions with similar trends by fixing the spatial anchors while varying or interpolating the temporal anchors, as shown in Sect. 4.3.

Multi-level Spatial-Temporal Anchors. To further learn and capture multi-scale modes of future motions, we propose a multi-level mechanism to extend the spatial-temporal anchors. As an illustration, Fig. 2(d) shows a simple two-level case for this design. We introduce two different spatial-temporal anchor sets, \(\{\mathcal {A}_t^{(1)},\mathcal {A}_s^{(1)}\}\) and \(\{\mathcal {A}_t^{(2)},\mathcal {A}_s^{(2)}\}\), and assign them sequentially to different network parts \(\mathcal {G}^{(1)},\mathcal {G}^{(2)}\). Suppose (ij) is a spatial-temporal index corresponding to the 1D index k, we can generate \(\widehat{\textbf{Y}}_k\) through a two-level process as

$$\begin{aligned} \textbf{z} \sim p(\textbf{z}),\ \widehat{\textbf{Y}}_k = \mathcal {G}^{(2)}(\textbf{a}_i^{s_{2}} + \textbf{a}_j^{t_{2}}, \textbf{z}, \mathcal {G}^{(1)}(\textbf{a}_i^{s_{1}} + \textbf{a}_j^{t_{1}}, \textbf{X})), \end{aligned}$$
(5)

where \(a_i^{s_{1}} \in \mathcal {A}_s^{(1)}, a_j^{t_{1}} \in \mathcal {A}_t^{(1)}, a_i^{s_{2}} \in \mathcal {A}_s^{(2)}, a_j^{t_{2}} \in \mathcal {A}_t^{(2)}\). As a principled way, anchors can be applied at more levels to encode richer assumptions about future motions.

Training. During training, the model uses each spatial-temporal anchor explicitly to generate K future motions for each past motion sequence. The loss functions are mostly adopted as proposed in [51], which we summarize into three categories: (1) reconstruction losses that, which optimize the best predictions under different definitions among K generated motions, and thus optimize anchors to their own nearest modes; (2) a diversity-promoting loss that explicitly promotes pairwise distances in predictions, avoiding that anchors collapse to the same; and (3) motion constraint losses that encourage output movements to be realistic. All anchors are directly learned from the data via gradient descent. In the forward pass, we explicitly take every anchor \(\textbf{a}_i \in \mathcal {A}=\{\textbf{a}_k\}_{k=1}^K\) as an additional input to the network and produce a total of K outputs. In the backward pass, each anchor is optimized separately based on its corresponding outputs and losses, while the backbone network is updated based on the fused losses from all outputs. This separate backward pass is automatically done via PyTorch [56]. Please refer to the supplementary material for more details.

3.2 Interaction-Enhanced Spatial-Temporal Graph Convolutional Network

In principle, our proposed anchor-based sampling permits flexible network architecture. Here, to incorporate our multi-level spatial-temporal anchors, we naturally represent motion sequences as spatial-temporal graphs (to be precise, spatial-frequency graphs), instead of the widely used spatial graphs [51, 52]. Our approach builds upon the Discrete Cosine Transform (DCT) [51, 52] to transform the motion into the frequency domain. Specifically, given a past motion \(\textbf{X}_{1:T_h} \in \mathbb {R}^{T_h \times V \times C^{(0)}}\), where each pose has V joints, we first replicate the last pose \(T_p\) times to get \(\textbf{X}_{1:T_h+T_p} = [\textbf{x}_1, \textbf{x}_2,\ldots ,\textbf{x}_{T_h}, \textbf{x}_{T_h}, \ldots ,\textbf{x}_{T_h}]^T\). With predefined M basis \(\textbf{C} \in \mathbb {R}^{ M\times (T_h+T_p)}\) for DCT, the motion is transformed as

$$\begin{aligned} \widetilde{\textbf{X}} = \textbf{C}\textbf{X}_{1:T_h+T_p}. \end{aligned}$$
(6)

We formulate \(\widetilde{\textbf{X}} \in \mathbb {R}^{M \times V \times C^{(0)}}\) in the 0-th layer and latent features in any l-th graph layer as spatial-temporal graphs \((\mathcal {V}^{(l)}, \mathcal {E}^{(l)})\) with \(M \times V\) nodes. We specify the node i by 2D index \((f_i, v_i)\) for joint \(v_i\) with frequency \(f_i\) component. The edge \((i, j) \in \mathcal {E}^{(l)}\) associated with the interaction between node i and node j is represented by \(\textbf{Adj}^{(l)}[i][j]\), where the adjacency matrix \(\textbf{Adj}^{(l)} \in \mathbb {R}^{M V\times M V}\) is learnable. We bottleneck spatial-temporal interactions as [61], by factorizing the adjacency matrix into the product of low-rank spatial and temporal matrices \(\textbf{Adj}^{(l)} = \textbf{Adj}^{(l)}_{s} \textbf{Adj}^{(l)}_{f}\). The spatial adjacency matrix \(\textbf{Adj}^{(l)}_{s} \in \mathbb {R}^{M V\times M V}\) connects only nodes with the same frequency. And the frequency adjacency matrix \(\textbf{Adj}^{(l)}_{f} \in \mathbb {R}^{M V\times M V}\) is merely responsible for the interplay between the nodes representing the same joint.

The spatial-temporal graph can be conveniently encoded by a graph convolutional network (GCN). Given a set of trainable weights \(\textbf{W}^{(l)} \in \mathbb {R}^{C^{(l)}\times C^{(l+1)}}\) and activation function \(\sigma (\cdot )\), such as ReLU, a spatial-temporal graph convolutional layer projects the input from \(C^{(l)}\) to \(C^{(l+1)}\) dimensions by

$$\begin{aligned} \textbf{H}^{(l+1)}_k = \sigma (\textbf{Adj}^{(l)}\textbf{H}^{(l)}_k\textbf{W}^{(l)})=\sigma (\textbf{Adj}_{s}^{(l)}\textbf{Adj}_{f}^{(l)}\textbf{H}^{(l)}_k\textbf{W}^{(l)}), \end{aligned}$$
(7)

where \(\textbf{H}^{(l)}_k \in \mathbb {R}^{M V \times C^{(l)}}\) denotes the latent feature of the prediction \(\widehat{\textbf{Y}}_k\) at l-th layer. The backbone consists of multiple graph convolutional layers. After generating predicted DCT coefficients \(\widetilde{\textbf{Y}}_k \in \mathbb {R}^{M \times V \times C^{(L)}}\) reshaped from \(\textbf{H}^{(L)}_k\), where \(C^{(L)} = 3\), we recover \(\widehat{\textbf{Y}}_k\) via Inverse DCT (IDCT) as

$$\begin{aligned} \mathbf {\widehat{Y}}_k = (\mathbf {C^\textsf{T}}\widetilde{\textbf{Y}}_k)_{T_h+1:T_h+T_p}, \end{aligned}$$
(8)

where the last \(T_p\) frames of the recovered sequence represent future poses.

Conceptually, interactions between spatial-temporal nodes should be relatively invariant across layers, and different interactions should not be equally important. For example, we would expect constraints and dependencies between “left arm” and “left forearm,” while the movements of “head” and “left forearm” are relatively independent. We consider it redundant to construct a complete spatial-temporal graph for each layer independently. Therefore, we introduce cross-layer interaction sharing to share parameters between graphs in different layers, and spatial interaction pruning to prune the complete graph.

Cross-Layer Interaction Sharing. Much care has been taken into employing learnable interactions between spatial nodes across all graph layers [52, 61, 67]. We consider the spatial relationship to be relatively unchanged. Empirically, we find that sharing the adjacency matrix at intervals of one layer is effective. As shown in Fig. 3, we set \(\textbf{Adj}_s^{(4)} = \textbf{Adj}_s^{(6)} = \textbf{Adj}_s^{(8)}\) and \(\textbf{Adj}_s^{(5)} = \textbf{Adj}_s^{(7)}\).

Spatial Interaction Pruning. To emphasize the physical relationships and constraints between spatial joints, we prune the spatial connections \(\mathbf {\widehat{Adj}}_s^{(l)} = \textbf{M}_s \odot \textbf{Adj}_s^{(l)}\) in every graph layer l using a predefined mask \(\textbf{M}_s\), where \(\odot \) is an element-wise product. Inspired by [47], we emphasize spatial locality based on skeletal connections and mirror symmetry tendencies. We denote our proposed predefined mask matrix as

$$\begin{aligned} \textbf{M}_s[i][j] = {\left\{ \begin{array}{ll} 1, &{} v_i \text { and } v_j \text { are physically connected, }f_i=f_j \\ 1, &{} v_i \text{ and } v_j \text { are mirror-symmetric, }f_i=f_j \\ 0, &{} \text{ otherwise }. \end{array}\right. } \end{aligned}$$
(9)

Finally, our architecture consists of four original STGCNs without spatial pruning and four Pruned STGCNs, as illustrated in Fig. 3. Please refer to the supplementary material for more information of the architecture.

4 Experiments

4.1 Experimental Setup for Diverse Prediction

Datasets. We perform evaluation on two motion capture datasets, Human3.6M [30] and HumanEva-I [60]. Human3.6M consists of 11 subjects and 3.6 million frames 50 Hz. Following [51, 72], we use a 17-joint skeleton representation and train our model to predict 100 future frames given 25 past frames without global translation. We train on five subjects (S1, S5, S6, S7, and S8) and test on two subjects (S9 and S11). HumanEva-I contains 3 subjects recorded 60 Hz. Following [51, 72], the pose is represented by 15 joints. We use the official train/test split [60]. The model forecasts 60 future frames given 15 past frames.

Metrics. For a fair comparison, we measure the diversity and accuracy of the predictions according to the evaluation metrics in [2, 51, 70, 72]. (1) Average Pairwise Distance (APD): average \(\ell _2\) distance between all prediction pairs, defined as \(\frac{1}{K(K-1)} \sum _{i=1}^K \sum _{j\ne i}^K \Vert \widehat{\textbf{Y}}_i - \widehat{\textbf{Y}}_j\Vert _2\). (2) Average Displacement Error (ADE): average \(\ell _2\) distance over the time between the ground truth and the closest prediction, computed as \(\frac{1}{T_p}\min _k \Vert \widehat{\textbf{Y}}_k - \textbf{Y}\Vert _2\). (3) Final Displacement Error (FDE): \(\ell _2\) distance of the last frame between the ground truth and the closest prediction, defined as \(\min _k\Vert \widehat{\textbf{Y}}_k[T_p] - \textbf{Y}[T_p]\Vert _2\). To measure the ability to produce multi-modal predictions, we also report multi-modal versions of ADE and FDE. We define the multi-modal ground truth [70] as \(\{\textbf{Y}_n\}_{n=1}^N\), which is clustered based on historical pose distances, representing possible multi-modal future motions. The detail of multi-modal ground truth is in the supplementary material. (4) Multi-Modal ADE (MMADE): the average displacement error between the predictions and the multi-modal ground truth, denoted as \(\frac{1}{NT_p}\sum _{n=1}^N\min _k\Vert \widehat{\textbf{Y}}_k - \textbf{Y}_n\Vert _2\). (5) Multi-Modal FDE (MMFDE): the final displacement error between the predictions and the multi-modal ground truth, denoted as \(\frac{1}{N}\sum _{n=1}^N\min _k\Vert \widehat{\textbf{Y}}_k[T_p] - \textbf{Y}_n[T_p]\Vert _2\). All metrics here are in meters.

Baselines. To evaluate our stochastic motion prediction method, we consider two types of baselines: (1) Stochastic methods, including CVAE-based methods, Pose-Knows [64] and MT-VAE [66], as well as CGAN-based methods, HP-GAN [4]; (2) Diversity promoting methods, including Best-of-Many [5], GMVAE [17], DeLiGAN [23], DSF [70], DLow [72], MOJO [75], and GSPS [51].

Implementation Details. The backbone consists of 8 GCN layers. We perform spatial pruning on 4 GCN layers (denoted as ‘Pruned’). The remaining 4 layers are not pruned. In each layer, we use batch normalization [29] and residual connections. We add K spatial-temporal compositional anchors at layers 4 and 6, and perform random sampling at the layer 5. Here, \(K=50\) unless otherwise specified. For Human3.6M, the model is trained for 500 epochs, with a batch size of 16 and 5000 training instances per epoch. For HumanEva-I, the model is trained for 500 epochs, with a batch size of 16 and 2000 training instances per epoch. Additional implementation details are provided in the supplementary material.

4.2 Quantitative Results and Ablation of Diverse Prediction

We compare our method with the baselines in Table 1 on Human3.6M and HumanEva-I. We produce one prediction using each spatial-temporal anchor for a total of 50 predictions, which is consistent with the literature [51, 72]. For all metrics, our method consistently outperforms all baselines on both datasets. Methods such as GMVAE [17] and DeLiGAN [23] have relatively low accuracy (ADE, FDE, MMADE, and MMFDE) and diversity (APD), since they still follow a pure random sampling. Methods such as DSF [70], DLow [72] and GSPS [51] explicitly promote diversity by introducing assumptions in the latent codes or directly in the generation process. Instead, we propose to use anchors to locate diverse modes directly learned from the data, which is more effective.

Table 1. Quantitative results on Human3.6M and HumanEva-I for \(K=50\). Our model significantly outperforms all stochastic prediction baselines on all metrics. The results of baselines are reported from [51, 72, 75]
Table 2. Ablation study on Human3.6M and HumanEva-I for \(K=50\). We compare the following 4 cases: (I) 50 original anchors; (II) 2 temporal anchors and 25 spatial anchors; (III) 50 spatial-temporal compositional anchors from 5 temporal anchors and 10 spatial anchors; (IV) 50 spatial-temporal compositional anchors for both levels

Effectiveness of Multi-level Spatial-Temporal Anchors. As shown in Table 2, compared with not using spatial-temporal decoupling (I), using it (II and III) leads to relatively lower diversity, but facilitates mode capture and results in higher accuracy on Human3.6M. Applying the multi-level mechanism (IV) improves diversity, but sacrifices a little accuracy on Human3.6M. By contrast, we observe improvements in both diversity and accuracy on HumanEva-I. The results suggest that there is an intrinsic trade-off between diversity and accuracy. Higher diversity indicates that the model has a better chance of covering multiple modes. However, when the diversity exceeds a certain level, the trade-off between diversity and accuracy becomes noticeable.

Fig. 4.
figure 4

Ablation study on Human3.6M. We report ADE, MMADE, FDE, and MMFDE, comparing settings with different numbers of anchors and samples

Impact of Number of Anchors and Samples. We investigate the effect of two important hyperparameters on the model, i.e., the number of anchors and the number of samples. As illustrated in Fig. 4(a), we fix the number of samples to 50 and compare the results when the number of anchors varies within 0, 5, 10, 25, 50. The results show that more anchors enable the model to better capture the major modes (ADE, FDE) and also other modes (MMADE, MMFDE). In Fig. 4(b), we vary the sample size to be 10, 20, 50, 100 and keep the number of anchors the same as the number of samples. The results show that the larger the number of samples is, the easier it is for a sample to approach the ground truth.

Table 3. Ablation study on Human3.6M for \(K=100\). We demonstrate the generalizability of our anchor-based sampling. For a fair comparison, we add single-level anchor-based sampling to GSPS [51] and IE-STGCN, without changing any other design and without using spatial-temporal decomposition. We observe that our anchor-based sampling mechanism consistently improves diversity and accuracy for both approaches. Meanwhile, our backbone is more lightweight but performs better

Generalizability of Anchor-Based Sampling. In Table 3, we demonstrate that our anchor-based sampling is model-agnostic and can be inserted as a plug-and-play module into different motion predictors. Concretely, we apply our anchor-based sampling to the baseline method GSPS [51], which also achieves consistent improvements under every metric, with improvements in terms of diversity and multi-modal accuracy being particularly evident. For simplicity, this evaluation uses simple single-level anchors, but the improvements are pronounced. We would also like to emphasize that the total number of parameters in our IE-STGCN predictor is only \(\mathbf {22\%}\) of that in GSPS.

Fig. 5.
figure 5

Visualization of end poses on Human3.6M. We show the historical poses in red and black skeletons, and the predicted end poses with purple and green. As highlighted by the red and blue dashed boxes, the best predictions of our method are closer to the ground truth than the state-of-the-art baseline GSPS [51]

Fig. 6.
figure 6

Visualization of controllable motion prediction on Human3.6M and HumanEva-I. We control different trends and speeds of motions by controlling spatial and temporal anchors. For example, the third and fourth rows have similar motion trends, but the motion in the third row is faster

4.3 Qualitative Results of Diverse Prediction

We visualize the start pose, the end pose of the ground truth future motions, and the end pose of 10 motion samples in Fig. 5. The qualitative comparisons support the ADE results in Table 1 that our best predicted samples are closer to the ground truth. In Fig. 6 and Fig. 7, we provide the predicted sequences sampled every ten frames. As mentioned before, our spatial-temporal anchors provide a new form of control over spatial-temporal aspects. With the same temporal anchor, the motion frequencies are similar, but the motion patterns are different. Conversely, if we control the spatial anchors to be the same, the motion trends are similar, but the speed might be different. We show a smooth control through the linear interpolation of spatial-temporal anchors in Fig. 7. The new interpolated anchors produce some interesting and valid pose sequences. And smooth changes in spatial trend and temporal velocity can be observed.

Fig. 7.
figure 7

Linear interpolation of anchors. We seamlessly control different trends and speeds of future motions by linear interpolation of spatial and temporal anchors. Specifically, given two anchors \(\textbf{a}_1\) and \(\textbf{a}_2\) and a coefficient \(\alpha \), we produce predictions from the interpolated anchor formulated as \((1-\alpha )\textbf{a}_1 + \alpha \textbf{a}_2\)

4.4 Effectiveness on Deterministic Prediction

Our model can be easily extended to deterministic prediction by specifying \(K=1\). Without diverse sampling, we retrain two deterministic prediction model variants: IE-STGCN-Short dedicated to short-term prediction and IE-STGCN-Long for long-term prediction. We use different settings for deterministic prediction, following existing work [16, 52, 61] and for fair comparisons. Here, we evaluate on Human3.6M and use the 22-joint skeleton representations. Given a 400 ms historical motion sequence, the model generates a 400 ms motion for short-term prediction and a 1000 ms motion for long-term prediction. We use five subjects (S1, S6, S7, S8 and S9) for training and a subject 5 (S5) for testing. We compare our two model variants with recent state-of-the-art deterministic prediction baselines: LTD [52], STS-GCN [61], and MSR-GCN [16]. We evaluate this by reporting Mean Per Joint Position Error (MPJPE) [30] in millimeter at each time step, defined as \(\frac{1}{V} \sum _{i=1}^V\Vert \widehat{\textbf{y}}_{t}[i] - \textbf{y}_{t}[i]\Vert _2\), where \(\widehat{\textbf{y}}_{t}[i]\) and \(\textbf{y}_{t}[i]\) are produced and ground truth 3D positions of the i-th joint at time t. Table 4 includes short-term (80\(\sim \)400 ms) and long-term (560\(\sim \)1000 ms) comparisons, showing that our models outperform the baseline models on both short-term and long-term horizons. Additional experimental results and implementation details of our two deterministic prediction models are provided in the supplementary material.

Table 4. Quantitative results on Human3.6M for \(K=1\). Both our long-term and short-term deterministic models significantly outperform all deterministic baselines

5 Conclusion

In this paper, we present a simple yet effective approach, STARS, to predict multiple plausible and diverse future motions. And our spatial-temporal anchors enable novel controllable motion prediction. To incorporate our spatial-temporal anchors, we propose a novel motion predictor IE-STGCN. Extensive experiments on Human3.6M and HumanEva-I show the state-of-the-art performance of our unified approach for both diverse and deterministic motion predictions. In the future, we will consider human-scene interaction and investigate the integration of our predictor into human-robot interaction systems.