Abstract
Predicting diverse human motions given a sequence of historical poses has received increasing attention. Despite rapid progress, existing work captures the multi-modal nature of human motions primarily through likelihood-based sampling, where the mode collapse has been widely observed. In this paper, we propose a simple yet effective approach that disentangles randomly sampled codes with a deterministic learnable component named anchors to promote sample precision and diversity. Anchors are further factorized into spatial anchors and temporal anchors, which provide attractively interpretable control over spatial-temporal disparity. In principle, our spatial-temporal anchor-based sampling (STARS) can be applied to different motion predictors. Here we propose an interaction-enhanced spatial-temporal graph convolutional network (IE-STGCN) that encodes prior knowledge of human motions (e.g., spatial locality), and incorporate the anchors into it. Extensive experiments demonstrate that our approach outperforms state of the art in both stochastic and deterministic prediction, suggesting it as a unified framework for modeling human motions. Our code and pretrained models are available at https://github.com/Sirui-Xu/STARS.
Y.-X. Wang and L.-Y. Gui—Contributed equally to this work.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Predicting the evolution of the surrounding physical world over time is an essential aspect of human intelligence. For example, in a seamless interaction, a robot is supposed to have some notion of how people move or act in the near future, conditioned on a series of historical movements. Human motion prediction has thus been widely used in computer vision and robotics, such as autonomous driving [54], character animation [62], robot navigation [59], motion tracking [48], and human-robot interaction [7, 34, 35, 38]. Owing to deep learning techniques, there has been significant progress over the past few years in modeling and predicting motions. Despite notable successes, forecasting human motions, especially over longer time horizons (i.e., up to several seconds), is fundamentally challenging, because of the difficulty of modeling multi-modal motion dynamics and uncertainty of human conscious movements. Learning such uncertainty can, for example, help reduce the search space in motion tracking problems.
As a powerful tool, deep generative models are thus introduced for this purpose, where random codes from a prior distribution are employed to capture the multi-modal distribution of future human motions. However, current motion capture datasets are typically constructed in a way that there is only a single ground truth future sequence for each single historical sequence [30, 60], which makes it difficult for generators to model the underlying multi-modal densities of future motion distribution. Indeed, in practice, generators tend to ignore differences in random codes and simply produce similar predictions. This is known as mode collapse – the samples are concentrated in the major mode, as depicted with a representative example in Fig. 1, which has been widely observed [72]. Recent work has alleviated this problem by explicitly promoting diversity in sampling using post-hoc diversity mappings [72], or through sequentially generating different body parts [51] to achieve combinatorial diversity. These techniques, however, induce additional modeling complexity, without guaranteeing that the diversity modeling accurately covers multiple plausible modes of human motions.
To this end, we propose a simple yet effective strategy – Multi-Level Spatial-Temporal AnchoR-Based Sampling (STARS) – with the key insight that future motions are not completely random or independent of each other; they share some deterministic properties in line with physical laws and human body constraints, and continue trends of historical movements. For example, we may expect changes in velocity or direction to be shared deterministically among some future motions, whereas they might differ in the magnitude stochastically. Based on this observation, we disentangle latent codes in the generative model into a stochastic component (noise) and a deterministic learnable component named anchors. With this disentanglement, the diversity of predictions is jointly affected by random noise as well as anchors that are learned to be specialized for certain modes of future motion. In contrast, the diversity from traditional generative models is determined by solely independent noise, as depicted in Fig. 1. Now, on the one hand, random noise only accounts for modeling the uncertainty within the mode identified by the anchor, which reduces the burden of having to model the entire future diversity. On the other hand, the model can better capture deterministic states of multiple modes by directly optimizing the anchors, thereby reducing the modeling complexity.
Naturally, human motions exhibit variation in the spatial and temporal domains, and these two types of variation are comparatively independent. Inspired by this, we propose a further decomposition to factorize anchors into spatial and temporal anchors. Specifically, our designed spatial anchors capture future motion variation at the spatial level, but remain constant at the temporal level, and vice versa. Another appealing property of our approach is that, by introducing straightforward linear interpolation of spatial-temporal anchors, we achieve flexible and seamless control over the predictions (Fig. 6 and Fig. 7). Unlike low-level controls that combine motions of different body parts [51, 72], our work enables manipulation of future motions in the native space and time, which is an under-explored problem. Additionally, we propose a multi-level mechanism for spatial-temporal anchors to capture multi-scale modes of future motions.
As a key advantage, spatial-temporal anchors are compatible with any motion predictor. Here, we introduce an Interaction-Enhanced Spatial-Temporal Graph Covolutional Network (IE-STGCN). This model encodes the spatial locality of human motion and achieves state-of-the-art performance as a motion predictor.
Our contributions can be summarized as follows. (1) We propose a novel anchor-based generative model that formulates sampling as learning deterministic anchors with likelihood sampling to better capture the multiple modes of human motions. (2) We propose a multi-level spatial-temporal decomposition of anchors for interpretable control over future motions. (3) We develop a spatial-temporal graph neural network with interaction enhancement to incorporate our anchor-based sampling. (4) We demonstrate that our approach, as a unified framework for modeling human motions, significantly outperforms state-of-the-art models in both diverse and deterministic human motion prediction.
2 Related Work
Deterministic Motion Prediction. Existing work on deterministic human motion forecasting predicts a single future motion based on a sequence of past poses [1, 6, 15, 42, 49], or video frames [11, 71, 74], or under the constraints of the scene context [8, 12, 25], by using recurrent neural networks (RNNs) [63], temporal convolutional networks (TCNs) [3], and graph neural networks (GNNs) [33] for sequence modeling. Common early trends involve the use of RNNs [20,21,22, 31, 65], which are limited in long-term temporal encoding due to error accumulation [18, 53] and training difficulty [55]. Some recent attempts exploit GNNs [16, 50] to encode poses from the spatial level, but such work still relies on RNNs [41], CNNs [14, 39, 40], or feed-forward networks [52] for temporal modeling. Recently, spatial-temporal graph convolutional networks (STGCNs) [61, 67, 69] are proposed to jointly encode the spatial and temporal correlations with spatial-temporal graphs. Continuing this effort, we propose IE-STGCN, which additionally encodes inductive biases such as spatial locality into STGCNs.
Stochastic Motion Prediction. Stochastic human motion prediction is an emerging trend with the development of deep generative models such as variational autoencoders (VAEs) [32], generative adversarial networks (GANs) [19], and normalizing flows (NFs) [58]. Most existing work [2, 4, 26, 37, 43, 64, 66, 73] produces various predictions from a set of codes independently sampled from a given distribution. As depicted in DLow [72], such likelihood-based sampling cannot produce enough diversity, as many samples are merely perturbations in the major mode. To overcome the issue, DLow employs a two-stage framework, using post-hoc mappings to shape the latent samples to improve the diversity. GSPS [51] generates different body parts in a sequential manner to achieve combinatorial diversity. Nevertheless, their explicit promotion of diversity induces additional complexity but does not directly enhance multi-mode capture. We introduce anchors that are comparatively easy to optimize, to locate deterministic components of motion modes and impose sample diversity.
Controllable Motion Prediction. Controllable motion prediction has been explored in computer graphics for virtual character generation [27, 28, 44]. In the prediction task, DLow [72] and GSPS [51] propose to control the predicted motion by separating upper and lower body parts, fixing one part while controlling the diversity of the other. In this paper, through the use of spatial-temporal anchors, we propose different but more natural controllability in native space and time. By varying and interpolating the spatial and temporal anchors, we achieve high-level control over the spatial and temporal variation, respectively.
Learnable Anchors. Our anchor-based sampling, i.e., sampling with deterministic learnable codes, is inspired by work on leveraging predefined primitives and learnable codes for applications such as trajectory prediction [10, 13, 36, 46, 57], object detection [9, 45], human pose estimation [68], and video representation learning [24]. Anchors usually refer to the hypothesis of predictions, such as box candidates with different shapes and locations in object detection [45]. In a similar spirit, anchors in the context of human motion prediction indicate assumptions about future movements. The difference is that the anchors here are not hand-crafted or predefined primitives; instead, they are latent codes learned from the data. In the meantime, we endow anchors with explainability i.e., to describe the multi-level spatial-temporal variation of future motions.
3 Methodology
Problem Formulation. We denote the input motion sequence of length \(T_h\) as \(\textbf{X}=[\textbf{x}_1, \textbf{x}_2,\ldots ,\textbf{x}_{T_h}]^T\), where the 3D coordinates of V joints are used to describe each pose \(\textbf{x}_i \in \mathbb {R}^{{V}\times C^{(0)}}\). Here, we have \(C^{(0)} = 3\). The K output sequences of length \(T_p\) are denoted as \(\widehat{\textbf{Y}}_1, \widehat{\textbf{Y}}_2, \ldots , \widehat{\textbf{Y}}_K\). We have access to a single ground truth future motion of length \(T_p\) as \(\textbf{Y}\). Our objectives are: (1) one of the K predictions is as close to the ground truth as possible; and (2) the K sequences are as diverse as possible, yet representing realistic future motions.
In this section, we first briefly review deep generative models, describe how they draw samples to generate multiple futures, and discuss their limitations (Sect. 3.1). We then detail our insights on STARS including anchor-based sampling and multi-level spatial-temporal anchors (Sect. 3.1 and Fig. 2). To model the human motion, we design an IE-STGCN and incorporate our spatial-temporal anchors into it (Sect. 3.2), as illustrated in Fig. 3.
3.1 Multi-level Spatial-Temporal Anchor-Based Sampling
Preliminaries: Deep Generative Models. There is a large body of work on the generation of multiple hypotheses with deep generative models, most of which learn a parametric probability distribution function explicitly or implicitly. Let \(p(\textbf{Y}|\textbf{X})\) denote the distribution of the future human motion \(\textbf{Y}\) conditioned on the past sequence \(\textbf{X}\). With a latent variable \(\textbf{z}\in \mathcal {Z}\), the distribution can be reparameterized as \(p(\textbf{Y}|\textbf{X}) = \int p(\textbf{Y}|\textbf{X},\textbf{z})p(\textbf{z}) \textrm{d} \textbf{z}\), where \(p(\textbf{z})\) is often a Gaussian prior distribution. To generate a future motion sequence \(\mathbf {\widehat{Y}}\), \(\textbf{z}\) is drawn from the given distribution \(p(\textbf{z})\), and then a deterministic generator \(\mathcal {G}:\mathcal {Z}\times \mathcal {X} \rightarrow \mathcal {Y}\) is used for mapping, as illustrated in Fig. 2(a):
where \(\mathcal {G}\) is a deep neural network parameterized by \(\theta \). The goal of generative modeling is to make the distribution \(p_{\theta }(\mathbf {\widehat{Y}}|\textbf{X})\) derived from the generator \(\mathcal {G}\) close to the actual distribution \(p(\textbf{Y}|\textbf{X})\).
To generate K diverse motion predictions, traditional approaches first independently sample a set of latent codes \(Z = \{\textbf{z}_1, \ldots , \textbf{z}_K\}\) from a prior distribution \(p(\textbf{z})\). Although in theory, generative models are capable of covering different modes, they are not guaranteed to locate all the modes precisely, and mode collapse has been widely observed [70, 72].
Anchor-Based Sampling. To address this problem, we propose a simple yet effective sampling strategy. Our intuition is that the diversity in future motions could be characterized by: (1) deterministic component – across different actions performed by different subjects, there exist correlated or shareable changes in velocity, direction, movement patterns, etc., which naturally emerge and can be directly learned from data; and (2) stochastic component – given an action carried out by a subject, the magnitude of the changes exists which is stochastic.
Therefore, we disentangle the code in the latent space of the generative model into a stochastic component sampled from \(p(\textbf{z})\), and a deterministic component represented by a set of K learnable parameters called anchors \(\mathcal {A} = \{\textbf{a}_k\}_{k=1}^K\). Deterministic anchors are expected to identify as many modes as possible, which is achieved through a carefully designed optimization, while stochastic noise further specifies motion variation within certain modes. With this latent code disentanglement, we denote the new multi-modal distribution as
Consequently, as illustrated in Fig. 2(b), suppose we select the k-th learned anchor \(\textbf{a}_k \in \mathcal {A}\), along with the randomly sampled noise \(\textbf{z}\in Z\), we can generate the prediction \(\widehat{\textbf{Y}}_k\) as,
We can produce a total of K predictions if using each anchor once, though all anchors are not limited to being used or used only once. To incorporate anchors into the network, we find it effective to make simple additions between selected anchors and latent features, as shown in Fig. 3.
Spatial-Temporal Compositional Anchors. We observe that the diversity of future motions can be roughly divided into two types, namely spatial variation and temporal variation, which are relatively independent. This sheds light on a feasible further decomposition of the K anchors into two types of learnable codes: spatial anchors \(\mathcal {A}_s = \{\textbf{a}^s_i\}_{i=1}^{K_s}\) and temporal anchors \(\mathcal {A}_t = \{\textbf{a}^t_j\}_{j=1}^{K_t}\), where \(K = K_s \times K_t\). With this decomposition, we still can yield a total of \(K_s \times K_t\) compositional anchors through each pair of spatial-temporal anchors. Note that the temporal anchors here, in fact, control the frequency variation of future motion sequences, since our temporal features are in the frequency domain, as we will demonstrate in Sect. 3.2. To be more specific, conceptually, all spatial anchors are set to be identical in the temporal dimension but characterize the variation of motion in the spatial dimension, taking control of the movement trends and directions. Meanwhile, all temporal anchors remain unchanged in the spatial dimension but differ in the temporal dimension, producing disparities in frequency to affect the movement speed.
To produce \(\widehat{\textbf{Y}}_k\), as depicted in Fig. 2(c), we sample \(\textbf{z}\) and select i-th spatial anchor \(\textbf{a}_i^s\) and j-th temporal anchor \(\textbf{a}_j^t\),
where \(\textbf{a}_i^s + \textbf{a}_j^t\) is a spatial-temporal compositional anchor corresponding to an original anchor \(\textbf{a}_k\). Furthermore, motion control over spatial and temporal variation can be customized through these spatial-temporal anchors. For example, we can produce future motions with similar trends by fixing the spatial anchors while varying or interpolating the temporal anchors, as shown in Sect. 4.3.
Multi-level Spatial-Temporal Anchors. To further learn and capture multi-scale modes of future motions, we propose a multi-level mechanism to extend the spatial-temporal anchors. As an illustration, Fig. 2(d) shows a simple two-level case for this design. We introduce two different spatial-temporal anchor sets, \(\{\mathcal {A}_t^{(1)},\mathcal {A}_s^{(1)}\}\) and \(\{\mathcal {A}_t^{(2)},\mathcal {A}_s^{(2)}\}\), and assign them sequentially to different network parts \(\mathcal {G}^{(1)},\mathcal {G}^{(2)}\). Suppose (i, j) is a spatial-temporal index corresponding to the 1D index k, we can generate \(\widehat{\textbf{Y}}_k\) through a two-level process as
where \(a_i^{s_{1}} \in \mathcal {A}_s^{(1)}, a_j^{t_{1}} \in \mathcal {A}_t^{(1)}, a_i^{s_{2}} \in \mathcal {A}_s^{(2)}, a_j^{t_{2}} \in \mathcal {A}_t^{(2)}\). As a principled way, anchors can be applied at more levels to encode richer assumptions about future motions.
Training. During training, the model uses each spatial-temporal anchor explicitly to generate K future motions for each past motion sequence. The loss functions are mostly adopted as proposed in [51], which we summarize into three categories: (1) reconstruction losses that, which optimize the best predictions under different definitions among K generated motions, and thus optimize anchors to their own nearest modes; (2) a diversity-promoting loss that explicitly promotes pairwise distances in predictions, avoiding that anchors collapse to the same; and (3) motion constraint losses that encourage output movements to be realistic. All anchors are directly learned from the data via gradient descent. In the forward pass, we explicitly take every anchor \(\textbf{a}_i \in \mathcal {A}=\{\textbf{a}_k\}_{k=1}^K\) as an additional input to the network and produce a total of K outputs. In the backward pass, each anchor is optimized separately based on its corresponding outputs and losses, while the backbone network is updated based on the fused losses from all outputs. This separate backward pass is automatically done via PyTorch [56]. Please refer to the supplementary material for more details.
3.2 Interaction-Enhanced Spatial-Temporal Graph Convolutional Network
In principle, our proposed anchor-based sampling permits flexible network architecture. Here, to incorporate our multi-level spatial-temporal anchors, we naturally represent motion sequences as spatial-temporal graphs (to be precise, spatial-frequency graphs), instead of the widely used spatial graphs [51, 52]. Our approach builds upon the Discrete Cosine Transform (DCT) [51, 52] to transform the motion into the frequency domain. Specifically, given a past motion \(\textbf{X}_{1:T_h} \in \mathbb {R}^{T_h \times V \times C^{(0)}}\), where each pose has V joints, we first replicate the last pose \(T_p\) times to get \(\textbf{X}_{1:T_h+T_p} = [\textbf{x}_1, \textbf{x}_2,\ldots ,\textbf{x}_{T_h}, \textbf{x}_{T_h}, \ldots ,\textbf{x}_{T_h}]^T\). With predefined M basis \(\textbf{C} \in \mathbb {R}^{ M\times (T_h+T_p)}\) for DCT, the motion is transformed as
We formulate \(\widetilde{\textbf{X}} \in \mathbb {R}^{M \times V \times C^{(0)}}\) in the 0-th layer and latent features in any l-th graph layer as spatial-temporal graphs \((\mathcal {V}^{(l)}, \mathcal {E}^{(l)})\) with \(M \times V\) nodes. We specify the node i by 2D index \((f_i, v_i)\) for joint \(v_i\) with frequency \(f_i\) component. The edge \((i, j) \in \mathcal {E}^{(l)}\) associated with the interaction between node i and node j is represented by \(\textbf{Adj}^{(l)}[i][j]\), where the adjacency matrix \(\textbf{Adj}^{(l)} \in \mathbb {R}^{M V\times M V}\) is learnable. We bottleneck spatial-temporal interactions as [61], by factorizing the adjacency matrix into the product of low-rank spatial and temporal matrices \(\textbf{Adj}^{(l)} = \textbf{Adj}^{(l)}_{s} \textbf{Adj}^{(l)}_{f}\). The spatial adjacency matrix \(\textbf{Adj}^{(l)}_{s} \in \mathbb {R}^{M V\times M V}\) connects only nodes with the same frequency. And the frequency adjacency matrix \(\textbf{Adj}^{(l)}_{f} \in \mathbb {R}^{M V\times M V}\) is merely responsible for the interplay between the nodes representing the same joint.
The spatial-temporal graph can be conveniently encoded by a graph convolutional network (GCN). Given a set of trainable weights \(\textbf{W}^{(l)} \in \mathbb {R}^{C^{(l)}\times C^{(l+1)}}\) and activation function \(\sigma (\cdot )\), such as ReLU, a spatial-temporal graph convolutional layer projects the input from \(C^{(l)}\) to \(C^{(l+1)}\) dimensions by
where \(\textbf{H}^{(l)}_k \in \mathbb {R}^{M V \times C^{(l)}}\) denotes the latent feature of the prediction \(\widehat{\textbf{Y}}_k\) at l-th layer. The backbone consists of multiple graph convolutional layers. After generating predicted DCT coefficients \(\widetilde{\textbf{Y}}_k \in \mathbb {R}^{M \times V \times C^{(L)}}\) reshaped from \(\textbf{H}^{(L)}_k\), where \(C^{(L)} = 3\), we recover \(\widehat{\textbf{Y}}_k\) via Inverse DCT (IDCT) as
where the last \(T_p\) frames of the recovered sequence represent future poses.
Conceptually, interactions between spatial-temporal nodes should be relatively invariant across layers, and different interactions should not be equally important. For example, we would expect constraints and dependencies between “left arm” and “left forearm,” while the movements of “head” and “left forearm” are relatively independent. We consider it redundant to construct a complete spatial-temporal graph for each layer independently. Therefore, we introduce cross-layer interaction sharing to share parameters between graphs in different layers, and spatial interaction pruning to prune the complete graph.
Cross-Layer Interaction Sharing. Much care has been taken into employing learnable interactions between spatial nodes across all graph layers [52, 61, 67]. We consider the spatial relationship to be relatively unchanged. Empirically, we find that sharing the adjacency matrix at intervals of one layer is effective. As shown in Fig. 3, we set \(\textbf{Adj}_s^{(4)} = \textbf{Adj}_s^{(6)} = \textbf{Adj}_s^{(8)}\) and \(\textbf{Adj}_s^{(5)} = \textbf{Adj}_s^{(7)}\).
Spatial Interaction Pruning. To emphasize the physical relationships and constraints between spatial joints, we prune the spatial connections \(\mathbf {\widehat{Adj}}_s^{(l)} = \textbf{M}_s \odot \textbf{Adj}_s^{(l)}\) in every graph layer l using a predefined mask \(\textbf{M}_s\), where \(\odot \) is an element-wise product. Inspired by [47], we emphasize spatial locality based on skeletal connections and mirror symmetry tendencies. We denote our proposed predefined mask matrix as
Finally, our architecture consists of four original STGCNs without spatial pruning and four Pruned STGCNs, as illustrated in Fig. 3. Please refer to the supplementary material for more information of the architecture.
4 Experiments
4.1 Experimental Setup for Diverse Prediction
Datasets. We perform evaluation on two motion capture datasets, Human3.6M [30] and HumanEva-I [60]. Human3.6M consists of 11 subjects and 3.6 million frames 50 Hz. Following [51, 72], we use a 17-joint skeleton representation and train our model to predict 100 future frames given 25 past frames without global translation. We train on five subjects (S1, S5, S6, S7, and S8) and test on two subjects (S9 and S11). HumanEva-I contains 3 subjects recorded 60 Hz. Following [51, 72], the pose is represented by 15 joints. We use the official train/test split [60]. The model forecasts 60 future frames given 15 past frames.
Metrics. For a fair comparison, we measure the diversity and accuracy of the predictions according to the evaluation metrics in [2, 51, 70, 72]. (1) Average Pairwise Distance (APD): average \(\ell _2\) distance between all prediction pairs, defined as \(\frac{1}{K(K-1)} \sum _{i=1}^K \sum _{j\ne i}^K \Vert \widehat{\textbf{Y}}_i - \widehat{\textbf{Y}}_j\Vert _2\). (2) Average Displacement Error (ADE): average \(\ell _2\) distance over the time between the ground truth and the closest prediction, computed as \(\frac{1}{T_p}\min _k \Vert \widehat{\textbf{Y}}_k - \textbf{Y}\Vert _2\). (3) Final Displacement Error (FDE): \(\ell _2\) distance of the last frame between the ground truth and the closest prediction, defined as \(\min _k\Vert \widehat{\textbf{Y}}_k[T_p] - \textbf{Y}[T_p]\Vert _2\). To measure the ability to produce multi-modal predictions, we also report multi-modal versions of ADE and FDE. We define the multi-modal ground truth [70] as \(\{\textbf{Y}_n\}_{n=1}^N\), which is clustered based on historical pose distances, representing possible multi-modal future motions. The detail of multi-modal ground truth is in the supplementary material. (4) Multi-Modal ADE (MMADE): the average displacement error between the predictions and the multi-modal ground truth, denoted as \(\frac{1}{NT_p}\sum _{n=1}^N\min _k\Vert \widehat{\textbf{Y}}_k - \textbf{Y}_n\Vert _2\). (5) Multi-Modal FDE (MMFDE): the final displacement error between the predictions and the multi-modal ground truth, denoted as \(\frac{1}{N}\sum _{n=1}^N\min _k\Vert \widehat{\textbf{Y}}_k[T_p] - \textbf{Y}_n[T_p]\Vert _2\). All metrics here are in meters.
Baselines. To evaluate our stochastic motion prediction method, we consider two types of baselines: (1) Stochastic methods, including CVAE-based methods, Pose-Knows [64] and MT-VAE [66], as well as CGAN-based methods, HP-GAN [4]; (2) Diversity promoting methods, including Best-of-Many [5], GMVAE [17], DeLiGAN [23], DSF [70], DLow [72], MOJO [75], and GSPS [51].
Implementation Details. The backbone consists of 8 GCN layers. We perform spatial pruning on 4 GCN layers (denoted as ‘Pruned’). The remaining 4 layers are not pruned. In each layer, we use batch normalization [29] and residual connections. We add K spatial-temporal compositional anchors at layers 4 and 6, and perform random sampling at the layer 5. Here, \(K=50\) unless otherwise specified. For Human3.6M, the model is trained for 500 epochs, with a batch size of 16 and 5000 training instances per epoch. For HumanEva-I, the model is trained for 500 epochs, with a batch size of 16 and 2000 training instances per epoch. Additional implementation details are provided in the supplementary material.
4.2 Quantitative Results and Ablation of Diverse Prediction
We compare our method with the baselines in Table 1 on Human3.6M and HumanEva-I. We produce one prediction using each spatial-temporal anchor for a total of 50 predictions, which is consistent with the literature [51, 72]. For all metrics, our method consistently outperforms all baselines on both datasets. Methods such as GMVAE [17] and DeLiGAN [23] have relatively low accuracy (ADE, FDE, MMADE, and MMFDE) and diversity (APD), since they still follow a pure random sampling. Methods such as DSF [70], DLow [72] and GSPS [51] explicitly promote diversity by introducing assumptions in the latent codes or directly in the generation process. Instead, we propose to use anchors to locate diverse modes directly learned from the data, which is more effective.
Effectiveness of Multi-level Spatial-Temporal Anchors. As shown in Table 2, compared with not using spatial-temporal decoupling (I), using it (II and III) leads to relatively lower diversity, but facilitates mode capture and results in higher accuracy on Human3.6M. Applying the multi-level mechanism (IV) improves diversity, but sacrifices a little accuracy on Human3.6M. By contrast, we observe improvements in both diversity and accuracy on HumanEva-I. The results suggest that there is an intrinsic trade-off between diversity and accuracy. Higher diversity indicates that the model has a better chance of covering multiple modes. However, when the diversity exceeds a certain level, the trade-off between diversity and accuracy becomes noticeable.
Impact of Number of Anchors and Samples. We investigate the effect of two important hyperparameters on the model, i.e., the number of anchors and the number of samples. As illustrated in Fig. 4(a), we fix the number of samples to 50 and compare the results when the number of anchors varies within 0, 5, 10, 25, 50. The results show that more anchors enable the model to better capture the major modes (ADE, FDE) and also other modes (MMADE, MMFDE). In Fig. 4(b), we vary the sample size to be 10, 20, 50, 100 and keep the number of anchors the same as the number of samples. The results show that the larger the number of samples is, the easier it is for a sample to approach the ground truth.
Generalizability of Anchor-Based Sampling. In Table 3, we demonstrate that our anchor-based sampling is model-agnostic and can be inserted as a plug-and-play module into different motion predictors. Concretely, we apply our anchor-based sampling to the baseline method GSPS [51], which also achieves consistent improvements under every metric, with improvements in terms of diversity and multi-modal accuracy being particularly evident. For simplicity, this evaluation uses simple single-level anchors, but the improvements are pronounced. We would also like to emphasize that the total number of parameters in our IE-STGCN predictor is only \(\mathbf {22\%}\) of that in GSPS.
4.3 Qualitative Results of Diverse Prediction
We visualize the start pose, the end pose of the ground truth future motions, and the end pose of 10 motion samples in Fig. 5. The qualitative comparisons support the ADE results in Table 1 that our best predicted samples are closer to the ground truth. In Fig. 6 and Fig. 7, we provide the predicted sequences sampled every ten frames. As mentioned before, our spatial-temporal anchors provide a new form of control over spatial-temporal aspects. With the same temporal anchor, the motion frequencies are similar, but the motion patterns are different. Conversely, if we control the spatial anchors to be the same, the motion trends are similar, but the speed might be different. We show a smooth control through the linear interpolation of spatial-temporal anchors in Fig. 7. The new interpolated anchors produce some interesting and valid pose sequences. And smooth changes in spatial trend and temporal velocity can be observed.
4.4 Effectiveness on Deterministic Prediction
Our model can be easily extended to deterministic prediction by specifying \(K=1\). Without diverse sampling, we retrain two deterministic prediction model variants: IE-STGCN-Short dedicated to short-term prediction and IE-STGCN-Long for long-term prediction. We use different settings for deterministic prediction, following existing work [16, 52, 61] and for fair comparisons. Here, we evaluate on Human3.6M and use the 22-joint skeleton representations. Given a 400 ms historical motion sequence, the model generates a 400 ms motion for short-term prediction and a 1000 ms motion for long-term prediction. We use five subjects (S1, S6, S7, S8 and S9) for training and a subject 5 (S5) for testing. We compare our two model variants with recent state-of-the-art deterministic prediction baselines: LTD [52], STS-GCN [61], and MSR-GCN [16]. We evaluate this by reporting Mean Per Joint Position Error (MPJPE) [30] in millimeter at each time step, defined as \(\frac{1}{V} \sum _{i=1}^V\Vert \widehat{\textbf{y}}_{t}[i] - \textbf{y}_{t}[i]\Vert _2\), where \(\widehat{\textbf{y}}_{t}[i]\) and \(\textbf{y}_{t}[i]\) are produced and ground truth 3D positions of the i-th joint at time t. Table 4 includes short-term (80\(\sim \)400 ms) and long-term (560\(\sim \)1000 ms) comparisons, showing that our models outperform the baseline models on both short-term and long-term horizons. Additional experimental results and implementation details of our two deterministic prediction models are provided in the supplementary material.
5 Conclusion
In this paper, we present a simple yet effective approach, STARS, to predict multiple plausible and diverse future motions. And our spatial-temporal anchors enable novel controllable motion prediction. To incorporate our spatial-temporal anchors, we propose a novel motion predictor IE-STGCN. Extensive experiments on Human3.6M and HumanEva-I show the state-of-the-art performance of our unified approach for both diverse and deterministic motion predictions. In the future, we will consider human-scene interaction and investigate the integration of our predictor into human-robot interaction systems.
References
Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7144–7153 (2019)
Aliakbarian, S., Saleh, F.S., Salzmann, M., Petersson, L., Gould, S.: A stochastic conditioning scheme for diverse human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5223–5232 (2020)
Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)
Barsoum, E., Kender, J.R., Liu, Z.: HP-GAN: Probabilistic 3D Human Motion Prediction via GAN. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 1418–1427 (2018)
Bhattacharyya, A., Schiele, B., Fritz, M.: Accurate and diverse sampling of sequences based on a “best of many" sample objective. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8485–8493 (2018)
Butepage, J., Black, M.J., Kragic, D., Kjellstrom, H.: Deep representation learning for human motion prediction and classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6158–6166 (2017)
Bütepage, J., Kjellström, H., Kragic, D.: Anticipating many futures: Online human motion prediction and generation for human-robot interaction. In: IEEE International Conference on Robotics and Automation, pp. 4563–4570 (2018)
Cao, Z., et al.: Long-Term human motion prediction with scene context. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 387–404. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_23
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chai, Y., Sapp, B., Bansal, M., Anguelov, D.: Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449 (2019)
Chao, Y.W., Yang, J., Price, B., Cohen, S., Deng, J.: Forecasting human dynamics from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 548–556 (2017)
Corona, E., Pumarola, A., Alenya, G., Moreno-Noguer, F.: Context-aware human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6992–7001 (2020)
Cui, H., et al.: Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In: International Conference on Robotics and Automation, pp. 2090–2096 (2019)
Cui, Q., Sun, H.: Towards accurate 3D human motion prediction from incomplete observations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4801–4810 (2021)
Cui, Q., Sun, H., Yang, F.: Learning dynamic relationships for 3D human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6519–6527 (2020)
Dang, L., Nie, Y., Long, C., Zhang, Q., Li, G.: MSR-GCN: multi-scale residual graph convolution networks for human motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11467–11476 (2021)
Dilokthanakul, N., et al.: Deep unsupervised clustering with Gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648 (2016)
Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4346–4354 (2015)
Goodfellow, I., et al.: Generative adversarial nets. In: 27th Proceedings of the International Conference on Advances in Neural Information Processing Systems (2014)
Gui, L.Y., Wang, Y.X., Liang, X., Moura, J.M.F.: Adversarial geometry-aware human motion prediction. In: European Conference on Computer Vision, pp. 786–803 (2018)
Gui, L.Y., Wang, Y.X., Ramanan, D., Moura, J.M.F.: Few-shot human motion prediction via meta-learning. In: European Conference on Computer Vision, pp. 432–450 (2018)
Gui, L.Y., Zhang, K., Wang, Y.X., Liang, X., Moura, J.M.F., Veloso, M.: Teaching robots to predict human motion. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 562–567 (2018)
Gurumurthy, S., Kiran Sarvadevabhatla, R., Venkatesh Babu, R.: DeLiGAN : generative adversarial networks for diverse and limited data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 166–174 (2017)
Han, T., Xie, W., Zisserman, A.: Memory-augmented dense predictive coding for video representation learning. In: European Conference on Computer Vision, pp. 312–329 (2020)
Hassan, M., et al.: Stochastic scene-aware motion prediction. In: Proceedings of the International Conference on Computer Vision, pp. 11374–11384 (2021)
Hernandez, A., Gall, J., Moreno-Noguer, F.: Human motion prediction via spatio-temporal inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7134–7143 (2019)
Holden, D., Komura, T., Saito, J.: Phase-functioned neural networks for character control. ACM Trans. Graph. 36, 1–13 (2017)
Holden, D., Saito, J., Komura, T.: A deep learning framework for character motion synthesis and editing. ACM Trans. Graph. 35, 1–11 (2016)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5308–5317 (2016)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 14–29 (2016)
Koppula, H.S., Saxena, A.: Anticipating human activities for reactive robotic response. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2071–2071 (2013)
Kothari, P., Sifringer, B., Alahi, A.: Interpretable social anchors for human trajectory forecasting in crowds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15551–15561 (2021)
Kundu, J.N., Gor, M., Babu, R.V.: BiHMP-GAN: Bidirectional 3D human motion prediction GAN. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8553–8560 (2019)
Lasota, P.A., Shah, J.A.: A multiple-predictor approach to human motion prediction. In: IEEE International Conference on Robotics and Automation, pp. 2300–2307 (2017)
Lebailly, T., Kiciroglu, S., Salzmann, M., Fua, P., Wang, W.: Motion prediction using temporal inception module. In: Proceedings of the Asian Conference on Computer Vision (2020)
Li, C., Zhang, Z., Lee, W.S., Lee, G.H.: Convolutional sequence to sequence model for human dynamics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5226–5234 (2018)
Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., Tian, Q.: Dynamic multiscale graph neural networks for 3D skeleton based human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 214–223 (2020)
Li, X., Li, H., Joo, H., Liu, Y., Sheikh, Y.: Structure from recurrent motion: From rigidity to recurrency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3032–3040 (2018)
Lin, X., Amer, M.R.: Human motion modeling using DVGANs. arXiv preprint arXiv:1804.10652 (2018)
Ling, H.Y., Zinno, F., Cheng, G., Van De Panne, M.: Character controllers using motion VAEs. ACM Trans. Graph. 39(4), 40–1 (2020)
Lui, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Liu, Y., Zhang, J., Fang, L., Jiang, Q., Zhou, B.: Multimodal motion prediction with stacked transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7577–7586 (2021)
Liu, Z., et al.: Motion prediction using trajectory cues. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13299–13308 (2021)
Luber, M., Stork, J.A., Tipaldi, G.D., Arras, K.O.: People tracking with human motion predictions from social forces. In: IEEE International Conference on Robotics and Automation, pp. 464–469 (2010)
Lyu, K., Liu, Z., Wu, S., Chen, H., Zhang, X., Yin, Y.: Learning human motion prediction via stochastic differential equations. In: Proceedings of ACM International Conference on Multimedia, pp. 4976–4984 (2021)
Mao, W., Liu, M., Salzmann, M.: History repeats itself: human motion prediction via motion attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 474–489. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_28
Mao, W., Liu, M., Salzmann, M.: Generating smooth pose sequences for diverse human motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13309–13318 (2021)
Mao, W., Liu, M., Salzmann, M., Li, H.: Learning trajectory dependencies for human motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9489–9497 (2019)
Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2891–2900 (2017)
Paden, B., Cáp, M., Yong, S.Z., Yershov, D.S., Frazzoli, E.: A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. Intell. Veh. 1, 33–55 (2016)
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning, pp. 1310–1318. PMLR (2013)
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: 32nd Proceedings of the International Conference on Advances in Neural Information Processing Systems (2019)
Phan-Minh, T., Grigore, E.C., Boulton, F.A., Beijbom, O., Wolff, E.M.: CoverNet: multimodal behavior prediction using trajectory sets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14074–14083 (2020)
Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning, pp. 1530–1538. PMLR (2015)
Rudenko, A., Palmieri, L., Arras, K.O.: Joint long-term prediction of human motion using a planning-based social force approach. In: IEEE International Conference on Robotics and Automation, pp. 4571–4577 (2018)
Sigal, L., Balan, A.O., Black, M.J., HumanEva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion: HumanEva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vision 87(1), 4–27 (2010)
Sofianos, T., Sampieri, A., Franco, L., Galasso, F.: Space-Time-Separable Graph Convolutional Network for pose forecasting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11209–11218 (2021)
Starke, S., Zhao, Y., Zinno, F., Komura, T.: Neural animation layering for synthesizing martial arts movements. ACM Trans. Graphi. 40, 1–16 (2021)
Sutskever, I., Martens, J., Hinton, G.: Generating text with recurrent neural networks. In: International Conference on Machine Learning, pp. 1017–1024 (2011)
Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: Video forecasting by generating pose futures. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3332–3341 (2017)
Wang, B., Adeli, E., Chiu, H.k., Huang, D.A., Niebles, J.C.: Imitation learning for human pose prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7124–7133 (2019)
Yan, X., et al.: MT-VAE: learning motion transformations to generate multimodal human dynamics. In: European Conference on Computer Vision, pp. 276–293 (2018)
Yan, Z., Zhai, D.H., Xia, Y.: DMS-GCN: dynamic mutiscale spatiotemporal graph convolutional networks for human motion prediction. arXiv preprint arXiv:2112.10365 (2021)
Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1385–1392 (2011)
Yu, B., Yin, H., Zhu, Z.: Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875 (2017)
Yuan, Y., Kitani, K.: Diverse trajectory forecasting with determinantal point processes. arXiv preprint arXiv:1907.04967 (2019)
Yuan, Y., Kitani, K.: Ego-pose estimation and forecasting as real-time PD control. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 10082–10092 (2019)
Yuan, Y., Kitani, K.: DLow: diversifying latent flows for diverse human motion prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 346–364. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_20
Yuan, Y., Kitani, K.: Residual force control for agile human behavior imitation and extended motion synthesis. Adv. Neural. Inf. Process. Syst. 33, 21763–21774 (2020)
Zhang, J.Y., Felsen, P., Kanazawa, A., Malik, J.: Predicting 3D human dynamics from video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7114–7123 (2019)
Zhang, Y., Black, M.J., Tang, S.: We are more than our joints: predicting how 3D bodies move. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3372–3382 (2021)
Acknowledgement
This work was supported in part by NSF Grant 2106825, the Jump ARCHES endowment through the Health Care Engineering Systems Center, the New Frontiers Initiative, the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign through the NCSA Fellows program, and the IBM-Illinois Discovery Accelerator Institute.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, S., Wang, YX., Gui, LY. (2022). Diverse Human Motion Prediction Guided by Multi-level Spatial-Temporal Anchors. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13682. Springer, Cham. https://doi.org/10.1007/978-3-031-20047-2_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-20047-2_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20046-5
Online ISBN: 978-3-031-20047-2
eBook Packages: Computer ScienceComputer Science (R0)