Keywords

1 Introduction

Human motion prediction, i.e., predicting the future 3D poses of a person based on past poses, is an important problem in computer vision and has many useful applications in autonomous driving  [53], human robot interaction  [37] and healthcare  [65]. It is a challenging problem because the future motion of a person is potentially diverse and multi-modal due to the complex nature of human behavior. For many safety-critical applications, it is important to predict a diverse set of human motions instead of just the most likely one. For examples, an autonomous vehicle should be aware that a nearby pedestrian can suddenly cross the road even though the pedestrian will most likely remain in place. This diversity requirement calls for a generative approach that can fully characterize the multi-modal distribution of future human motion.

Fig. 1.
figure 1

In the latent space of a conditional variational autoencoder (CVAE), samples (stars) from our method DLow are able to cover more modes (colored ellipses) than the CVAE samples. In the motion space, DLow generates a diverse set of future human motions while the CVAE only produces perturbations of the motion of the major mode.

Deep generative models, e.g., variational autoencoders (VAEs)  [36], are effective tools to model multi-modal data distributions. Most existing work  [3, 6, 40, 44, 59, 66, 69] using deep generative models for human motion prediction is focused on the design of the generative model to allow it to effectively learn the data distribution. After the generative model is learned, little attention has been paid to the sampling method used to produce motion samples (predicted future motions) from the pretrained generative model (weights kept fixed). Most of prior work predicts a set of motions by randomly sampling a set of latent codes from the latent prior and decoding them with the generator into motion samples. We argue that such a sampling strategy is not guaranteed to produce a diverse set of samples for two reasons: (1) The samples are independently drawn, which makes it difficult to enforce diversity; (2) The samples are drawn based on likelihood only, which means many samples may concentrate around the major modes (which have more observed data) of the data distribution and fail to cover the minor modes (as shown in Fig. 1 (Bottom)). The poor sample efficiency of random sampling means that one needs to draw a large number of samples in order to cover all the modes which is computationally expensive and can lead to high latency, making it unsuitable for real-time applications such as autonomous driving and virtual reality. This prompts us to address an overlooked aspect of diverse human motion prediction—the sampling strategy.

We propose a novel sampling method, Diversifying Latent Flows (DLow), to obtain a diverse set of samples from a pretrained deep generative model. For this work, we use a conditional variational autoencoder (CVAE) as our pretrained generative model but other generative models can also be used with our approach. DLow is inspired by the two previously mentioned problems with random (independent) sampling. To tackle problem (1) where sample independence limits model diversity, we introduce a new random variable and a set of learnable deterministic mapping functions to correlate the motion samples. We first transform the random variable with the mapping functions to generate a set of correlated latent codes which are then decoded into motion samples using the generator. As all motion samples are generated from a common random factor, this formulation allows us to model the joint sample distribution and offers us the opportunity to impose diversity on the samples by optimizing the parameters of the mapping functions. To address problem (2) where likelihood-based sampling limits diversity, we introduce a diversity-promoting prior (loss function) on the samples during the training of DLow. The prior follows an energy-based formulation using an energy function based on pairwise sample distance. We optimize the mapping functions during training to minimize the cross entropy between the joint sample distribution and diversity-promoting prior to increase sample diversity. To strike a balance between diversity and likelihood, we add a KL term to the optimization to enhance the likelihood of each sample. The relative weights between the prior term and the KL term represent the trade-off between the diversity and likelihood of the generated motion samples. Furthermore, our approach is highly flexible in that by designing different forms of the diversity-promoting prior we can impose a variety of structures on the samples besides diversity. For example, we can design the prior to ask the motion samples to cover the ground truth better to achieve higher sample accuracy. Additionally, other designs of the prior can enable new applications, such as controllable motion prediction, where we generate diverse motion samples that share some common features (e.g., similar leg motion but diverse upper-body motion).

The contributions of this work are the following: (1) We propose a novel perspective for addressing sample diversity in deep generative models—designing sampling methods for a pretrained generative model. (2) We propose a principled sampling method, DLow, which formulates diversity sampling as a constrained optimization problem over a set of learnable mapping functions using a diversity-promoting prior on the samples and KL constraints on the latent codes, which allows us to balance between sample diversity and likelihood. (3) Our approach allows for flexible design of the diversity-promoting prior to obtain more accurate samples or enable new applications such as controllable motion prediction. (4) We demonstrate through human motion prediction experiments that our approach outperforms state-of-the-art baseline methods in terms of sample diversity and accuracy.

2 Related Work

Human Motion Prediction. Most previous work takes a deterministic approach to modeling human motion and regress a single future motion from past 3D poses [1, 9, 13, 16, 17, 21, 33, 43, 49, 50, 55, 67] or video frames [10, 73, 75]. While these approaches are able to predict the most likely future motion, they fail to model the multi-modal nature of human motion, which is essential for safety-critical applications. More related to our work, stochastic human motion prediction methods start to gain popularity with the development of deep generative models. These methods  [3, 6, 40, 44, 59, 66, 69, 74] often build upon popular generative models such as conditional generative adversarial networks (CGANs;  [20]) or conditional variational autoencoders (CVAEs;  [36]). The aforementioned methods differ in the design of their generative models, but at test time they follow the same sampling strategy—randomly and independently sampling trajectories from the pretrained generative model without considering the correlation between samples. In this work, we propose a principled sampling method that can produce a diverse set of samples, thus improving sample efficiency compared to the random sampling typically used in prior work.

Diverse Inference. Producing a diverse set of solutions has been investigated in numerous problems in computer vision and machine learning. A branch of these diversity-driven methods stems from the M-Best MAP problem  [52, 60], including diverse M-Best solutions  [7] and multiple choice learning  [27, 42]. Alternatively, submodular function maximization has been applied to select a diverse subset of garments from fashion images  [30]. Another type of methods  [5, 18, 19, 31, 38, 68, 72] seeks diversity using determinantal point processes (DPPs;  [39, 48]) which are efficient probabilistic models that can measure the global diversity and quality within a set. Similarly, Fisher information  [58] has been used for diverse feature  [22] and data  [62] selection. Diversity has also been a key aspect in generative modeling. A vast body of work has tried to alleviate the mode collapse problem in GANs  [4, 11, 12, 15, 24, 45, 63, 70] and the posterior collapse problem in VAEs  [8, 28, 35, 46, 64, 76]. Normalizing flows  [56] have also been used to promote diversity in trajectory forecasting  [23, 57]. This line of work aims to improve the diversity of the data distribution learned by deep generative models. We address diversity from a different angle by improving the strategy for producing samples from a pretrained deep generative model.

3 Diversifying Latent Flows (DLow)

For many existing methods on generative vision tasks such as multi-modal human motion prediction, the primary focus is to learn a good generative model that can capture the multi-modal distribution of the data. In contrast, once the generative model is learned, little attention has been paid to devising sampling strategies for producing diverse samples from the pretrained generative model.

In this section, we will introduce our method, Diversifying Latent Flows (DLow), as a principled way for drawing a diverse and likely set of samples from a pretrained generative model (weights kept fixed). To provide the proper context, we will first start with a brief review of deep generative models and how traditional methods produce samples from a pretrained generative model.

Background: Deep Generative Models. Let \(\mathbf {x} \in \mathcal {X}\) denote data (e.g., human motion) drawn from a data distribution \(p(\mathbf {x}|\mathbf {c})\) where \(\mathbf {c}\) is some conditional information (e.g., past motion). One can reparameterize the data distribution by introducing a latent variable \(\mathbf {z} \in \mathcal {Z}\) such that \(p(\mathbf {x}|\mathbf {c}) = \int _\mathbf {z} p(\mathbf {x}|\mathbf {z}, \mathbf {c}) p(\mathbf {z})d\mathbf {z}\), where \(p(\mathbf {z})\) is a Gaussian prior distribution. Deep generative models learn \(p(\mathbf {x}|\mathbf {c})\) by modeling the conditional distribution \(p(\mathbf {x}|\mathbf {z}, \mathbf {c})\), and the generative process can be described as sampling \(\mathbf {z}\) and mapping them to data samples \(\mathbf {x}\) using a deterministic generator function \(G_\theta : \mathcal {Z} \rightarrow \mathcal {X}\) as

$$\begin{aligned}&\mathbf {z} \sim p(\mathbf {z})\,, \end{aligned}$$
(1)
$$\begin{aligned}&\mathbf {x} = G_\theta (\mathbf {z}, \mathbf {c})\,, \end{aligned}$$
(2)

where the generator \(G_\theta \) is instantiated as a deep neural network parametrized by \(\theta \). This generative process produces samples from the implicit sample distribution \(p_\theta (\mathbf {x}|\mathbf {c})\) of the generative model, and the goal of generative modeling is to learn a generator \(G_\theta \) such that \(p_\theta (\mathbf {x}|\mathbf {c}) \approx p(\mathbf {x}|\mathbf {c})\). There are various approaches for learning the generator function \(G_\theta \), which yield different types of deep generative models such as variational autoencoders (VAEs;  [36]), normalizing flows (NFs;  [56]), and generative adversarial networks (GANs;  [20]). Note that even though the discussion in this work is focused on conditional generative models, our method can be readily applied to the unconditional case.

Random Sampling. Once the generator function \(G_\theta \) is learned, traditional approaches produce samples from the learned data distribution \(p_\theta (\mathbf {x}|\mathbf {c})\) by first randomly sampling a set of latent codes \(Z = \{\mathbf {z}_1, \ldots , \mathbf {z}_K\}\) from the latent prior \(p(\mathbf {z})\) (Eq. (1)) and decode Z with the generator \(G_\theta \) into a set of data samples \(X = \{\mathbf {x}_1, \ldots , \mathbf {x}_K\}\) (Eq. (2)). We argue that such a sampling strategy may result in a less diverse sample set for two reasons: (1) Independent sampling cannot model the repulsion between samples within a diverse set; (2) The sampling is only based on the data likelihood and many samples can concentrate around a small number of modes that have more training data. As a result, random sampling can lead to low sample efficiency because many samples are similar to one another and fail to cover other modes in the data distribution.

Fig. 2.
figure 2

Overview of our DLow framework applied to diverse human motion prediction. The network \(Q_\gamma \) takes past motion \(\mathbf {c}\) as input and outputs the parameters of the mapping functions \(\mathcal {T}_{\psi _1}, \ldots , \mathcal {T}_{\psi _K}\). Each mapping \(\mathcal {T}_{\psi _k}\) transforms the random variable \(\varvec{\epsilon }\) to a different latent code \(\mathbf {z}_k\) and also warps the density \(p(\varvec{\epsilon })\) to the latent code density \(r_\psi (\mathbf {z}_k|\mathbf {c})\). Each latent code \(\mathbf {z}_k\) is decoded by the CVAE decoder into a motion sample \(\mathbf {x}_k\).

DLow Sampling. To address the above issues with the random sampling approach, we propose an alternative sampling method, Diversifying Latent Flows (DLow), that can generate a diverse and likely set of samples from a pretrained deep generative model. Again, we stress that the weights of the generative model are kept fixed for DLow. We later apply DLow to the task of human motion prediction in Sect. 4 to demonstrate DLow’s ability to improve sample diversity.

Instead of sampling each latent code \(\mathbf {z}_k \in Z\) independently according to \(p(\mathbf {z})\), we introduce a random variable \(\varvec{\epsilon }\) and conditionally generate the latent codes Z and data samples X as follows:

$$\begin{aligned}&\varvec{\epsilon } \sim p(\varvec{\epsilon })\,, \end{aligned}$$
(3)
$$\begin{aligned}&\mathbf {z} _k = \mathcal {T}_{\psi _k}(\varvec{\epsilon }) \,,\;\, \quad \quad 1 \le k \le K\,, \end{aligned}$$
(4)
$$\begin{aligned}&\mathbf {x} _k = G_\theta (\mathbf {z}_k, \mathbf {c})\,, \,\quad 1 \le k \le K\,, \end{aligned}$$
(5)

where \(p(\varvec{\epsilon })\) is a Gaussian distribution, \(\mathcal {T}_{\psi _1}, \ldots , \mathcal {T}_{\psi _K}\) are latent mapping functions with parameters \(\psi = \{ \psi _1, \ldots , \psi _K \}\), and each \(\mathcal {T}_{\psi _k}\) maps \(\varvec{\epsilon }\) to a different latent code \(\mathbf {z}_k\). The above generative process defines a joint distribution \(r_\psi (X,Z|\mathbf {c})=p_\theta (X|Z, \mathbf {c})r_\psi (Z|\mathbf {c})\) over the samples X and latent codes Z, where \(p_\theta (X|Z,\mathbf {c})\) is the conditional distribution induced by the generator \(G_\theta (\mathbf {z}, \mathbf {c})\). Notice that in our setup, \(r_\psi (X,Z|\mathbf {c})\) depends only on \(\psi \) as the generator parameters \(\theta \) are learned in advance and are kept fixed. The data samples X can be viewed as a sample from the joint sample distribution \(r_\psi (X|\mathbf {c})=\int r_\psi (X,Z|\mathbf {c})dZ\) and the latent codes Z can be regarded as a sample from the joint latent distribution \(r_\psi (Z|\mathbf {c})\) induced by warping \(p(\varvec{\epsilon })\) through \(\mathcal {T}_{\psi _1}, \ldots , \mathcal {T}_{\psi _K}\). If we further marginalize out all variables except for \(\mathbf {x}_k\) from \(r_\psi (X|\mathbf {c})\), we obtain the marginal sample distribution \(r_\psi (\mathbf {x}_k|\mathbf {c})\) from which each sample \(\mathbf {x}_k\) is drawn. Similarly, each latent code \(\mathbf {z}_k \in Z\) can be viewed as a latent sample from the marginal latent distribution \(r_\psi (\mathbf {z}_k|\mathbf {c})\).

The above distribution reparametrizations are illustrated in Fig. 2. We can see that all latent codes Z and data samples X are correlated as they are uniquely determined by \(\varvec{\epsilon }\), and by sampling \(\varvec{\epsilon }\) one can easily produce Z and X from the joint latent distribution \(r_\psi (Z|\mathbf {c})\) and joint sample distribution \(r_\psi (X|\mathbf {c})\). Because \(r_\psi (Z|\mathbf {c})\) and \(r_\psi (X|\mathbf {c})\) are controlled by the latent mapping functions \(\mathcal {T}_{\psi _1}, \ldots , \mathcal {T}_{\psi _K}\), we can impose structural constraints on \(r_\psi (Z|\mathbf {c})\) and \(r_\psi (X|\mathbf {c})\) by optimizing the parameters \(\psi \) of the latent mapping functions.

To encourage the diversity of samples X, we introduce a diversity-promoting prior p(X) (specific form defined later) and formulate a constrained optimization problem:

$$\begin{aligned} \min _\psi&\quad -\mathbb {E}_{X \sim r_\psi (X|\mathbf {c})}[\log p(X)]\,, \end{aligned}$$
(6)
$$\begin{aligned} \text {s.t.}&\quad \text {KL} (r_\psi (\mathbf {z}_k|\mathbf {c}) \Vert p(\mathbf {z}_k)) = 0\,, \quad 1 \le k \le K\,, \end{aligned}$$
(7)

where we minimize the cross entropy between the sample distribution \(r_\psi (X|\mathbf {c})\) and the diversity-promoting prior p(X). However, the objective in Eq. (6) alone can result in very low-likelihood samples \(\mathbf {x}_k\) corresponding to latent codes \(\mathbf {z}_k\) that are far away from the Gaussian prior \(p(\mathbf {z}_k)\). To ensure that each sample \(\mathbf {x}_k\) also has high likelihood under the generative model \(p_\theta (\mathbf {x}|\mathbf {c})\), we add constraints in Eq. (7) on the KL divergence between \(r_\psi (\mathbf {z}_k|\mathbf {c})\) and the Gaussian prior \(p(\mathbf {z}_k)\) (same as \(p(\mathbf {z})\)) to make \(r_\psi (\mathbf {z}_k|\mathbf {c}) = p(\mathbf {z}_k)\) and thus \(r_\psi (\mathbf {x}_k|\mathbf {c}) = p_\theta (\mathbf {x}_k|\mathbf {c})\) where \(r_\psi (\mathbf {x}_k|\mathbf {c})= \int p_\theta (\mathbf {x}_k|\mathbf {z}_k, \mathbf {c})r_\psi (\mathbf {z}_k|\mathbf {c})d\mathbf {z}_k\) and \( p_\theta (\mathbf {x}_k|\mathbf {c})= \int p_\theta (\mathbf {x}_k|\mathbf {z}_k, \mathbf {c})p(\mathbf {z}_k)d\mathbf {z}_k\). To optimize this constrained objective, we soften the constraints with the Lagrangian function:

$$\begin{aligned} \min _\psi \, -\mathbb {E}_{X \sim r_\psi (X|\mathbf {c})}[\log p(X)] +\beta \sum _{k=1}^K\text {KL} (r_\psi (\mathbf {z}_k|\mathbf {c}) \Vert p(\mathbf {z}_k))\,, \end{aligned}$$
(8)

where we use the same Lagrangian multiplier \(\beta \) for all constraints. Despite having similar form, the above objective is very different from the objective function of \(\beta \)-VAE  [29] in many ways: (1) our goal is to learn a diverse sampling distribution \(r_\psi (X|\mathbf {c})\) for a pretrained generative model rather than learning the generative model itself; (2) The first part in our objective is a diversifying term instead of a reconstruction term; (3) Our objective function applies to most deep generative models, not just VAEs. In this objective, the softening of the hard KL constraints allows for the trade-off between the diversity and likelihood of the samples X. For small \(\beta \), \(r_\psi (\mathbf {z}_k|\mathbf {c})\) is allowed to deviate from \(p(\mathbf {z}_k)\) so that \(r_\psi (\mathbf {z}_1|\mathbf {c}), \ldots , r_\psi (\mathbf {z}_K|\mathbf {c})\) can potentially attend to different regions in the latent space as shown in Fig. 2 (latent space) to further improve sample diversity. For large \(\beta \), the objective will focus on minimizing the KL term so that \(r_\psi (\mathbf {z}_k|\mathbf {c})\approx p(\mathbf {z}_k)\) and \(r_\psi (\mathbf {x}_k|\mathbf {c})\approx p_\theta (\mathbf {x}_k|\mathbf {c})\), and thus the sample \(\mathbf {x}_k\) will have high likelihood under \(p_\theta (\mathbf {x}_k|\mathbf {c})\).

The overall DLow objective is defined as:

$$\begin{aligned} L_\text {DLow} = L_{\text {prior}} + \beta L_{\text {KL}}\,, \end{aligned}$$
(9)

where \(L_{\text {prior}}\) and \(L_{\text {KL}}\) are the first and second term in Eq. (8) respectively. In the following, we will discuss in detail how we design the latent mapping functions \(\mathcal {T}_{\psi _1}, \ldots , \mathcal {T}_{\psi _K}\) and the diversity-promoting prior p(X).

Latent Mapping Functions. Each latent mapping \(\mathcal {T}_{\psi _k}\) transforms the Gaussian distribution \(p(\varvec{\epsilon })\) to the marginal latent distribution \(r_\psi (\mathbf {z}_k|\mathbf {c})\) for latent code \(\mathbf {z}_k\) where \(\mathcal {T}_{\psi _k}\) is also conditioned on \(\mathbf {c}\). As \(r_\psi (\mathbf {z}_k|\mathbf {c})\) should stay close to the Gaussian latent prior \(p(\mathbf {z}_k)\), it would be ideal if the mapping \(\mathcal {T}_{\psi _k}\) makes \(r_\psi (\mathbf {z}_k|\mathbf {c})\) also a Gaussian. Thus, we design \(\mathcal {T}_{\psi _k}\) to be an invertible affine transformation:

$$\begin{aligned} \mathcal {T}_{\psi _k}(\varvec{\epsilon }) = \mathbf {A}_k(\mathbf {c})\varvec{\epsilon } + \mathbf {b}_k(\mathbf {c}) \,, \end{aligned}$$
(10)

where the mapping parameters \(\psi _k = \{\mathbf {A}_k(\mathbf {c}), \mathbf {b}_k(\mathbf {c})\}\), \(\mathbf {A}_k \in \mathbb {R}^{n_z \times n_z}\) is a nonsingular matrix, \(\mathbf {b}_k \in \mathbb {R}^{n_z}\) is a vector, and \(n_z\) is the number of dimensions for \(\mathbf {z}_k\) and \(\varvec{\epsilon }\). As shown in Fig. 2, we use a K-head network \(Q_\gamma (\mathbf {c})\) to output \(\psi _1, \ldots , \psi _K\), and the parameters \(\gamma \) of the network \(Q_\gamma (\mathbf {c})\) are the parameters to be optimized with the DLow objective in Eq. (9).

Under the invertible affine transformation \(\mathcal {T}_{\psi _k}\), \(r_\psi (\mathbf {z}_k|\mathbf {c})\) becomes a Gaussian distribution \(\mathcal {N}(\mathbf {b}_k, \mathbf {A}_k\mathbf {A}_k^T)\). This allows us to compute the KL divergence terms in \(L_\text {KL}\) analytically:

$$\begin{aligned} \text {KL} (r_\psi (\mathbf {z}_k|\mathbf {c})\Vert p(\mathbf {z}_k)) = \frac{1}{2}\left( {\text {tr}}\left( \mathbf {A}_k\mathbf {A}_k^T\right) +\mathbf {b}_k^T\mathbf {b}_k -n_z - \log \det \left( \mathbf {A}_k\mathbf {A}_k^T\right) \right) . \end{aligned}$$
(11)

The KL divergence is minimized when \(r_\psi (\mathbf {z}_k|\mathbf {c}) = p(\mathbf {z}_k)\) which implies that \(\mathbf {A}_k\mathbf {A}_k^T = \mathbf {I}\) and \(\mathbf {b}_k = \mathbf {0}\). Geometrically, this means that \(\mathbf {A}_k\) is in the orthogonal group \(O(n_z)\), which includes all rotations and reflections in an \(n_z\)-dimensional space. This means any mapping \(\mathcal {T}_{\psi _k}\) that is a rotation or reflection operation will minimize the KL divergence. As mentioned before, there is a trade-off between diversity and likelihood in Eq. (9). To improve sample diversity (minimize \(L_\text {prior}\)) without compromising likelihood (KL divergence), we can optimize \(\mathcal {T}_{\psi _1}, \ldots , \mathcal {T}_{\psi _K}\) to be different rotations or reflections to map \(\varvec{\epsilon }\) to different feasible points \(\mathbf {z}_1, \ldots , \mathbf {z}_k\) in the latent space. This geometric understanding sheds light on the mapping space admitted by the hard KL constraints. In practice, we use soft KL constraints in the DLow objective to further enlarge the feasible mapping space which allows us to achieve lower \(L_\text {prior}\) and better sample diversity.

Diversity-Promoting Prior. In the DLow objective, a diversity-promoting prior p(X) on the joint sample distribution is used to guide the optimization of the latent mapping functions \(\mathcal {T}_{\psi _1}, \ldots , \mathcal {T}_{\psi _K}\). With an energy-based formulation, the prior p(X) can be defined using an energy function E(X):

$$\begin{aligned} p(X) = \exp (-E(X)) / \mathcal {S}\,, \end{aligned}$$
(12)

where \(\mathcal {S}\) is a normalizing constant. Dropping the constant \(\mathcal {S}\), the first term in Eq. (8) can be rewritten as

$$\begin{aligned} L_{\text {prior}} = \mathbb {E}_{X \sim r_\psi (X|\mathbf {c})}[E(X)]\,. \end{aligned}$$
(13)

To promote sample diversity of X, we design an energy function \(E := E_d\) based on a pairwise distance metric \(\mathcal {D}\):

$$\begin{aligned} E_d(X) = \frac{1}{K(K-1)}\sum _{i=1}^K\sum _{j\ne i}^K \exp \left( -\frac{\mathcal {D}^2(\mathbf {x}_i, \mathbf {x}_j)}{\sigma _d}\right) , \end{aligned}$$
(14)

where we use the Euclidean distance for \(\mathcal {D}\) and an RBF kernel with scale \(\sigma _d\). Minimizing \(L_{\text {prior}}\) moves the samples towards a lower-energy (diverse) configuration. \(L_{\text {prior}}\) can be evaluated efficiently with the reparametrization trick  [36].

Up to this point, we have described the proposed sampling method, DLow, for generating a diverse set of samples from a pretrained generative model \(p_\theta (\mathbf {x}|\mathbf {c})\). By introducing a common random variable \(\varvec{\epsilon }\), DLow allows us to generate correlated samples X. Moreover, by introducing learnable mapping functions \(\mathcal {T}_{\psi _k}\), we can model the joint sample distribution \(r_\psi (X|\mathbf {c})\) and impose structural constraints, such as diversity, on the sample set X which cannot be modeled by random sampling from the generative model.

4 Diverse Human Motion Prediction

Equipped with a method to generate diverse samples from a pretrained deep generative model, we now turn our attention to the task of diverse human motion prediction. Suppose the pose of a person is a V-dimensional vector consisting of 3D joint positions, we use \(\mathbf {c} \in \mathbb {R}^{H \times V}\) to denote the past motion of H time steps and \(\mathbf {x} \in \mathbb {R}^{T \times V}\) to denote the future motion over a future time horizon of T. Given a past motion \(\mathbf {c}\), the goal of diverse human motion prediction is to generate a diverse set of future motions \(X = \{\mathbf {x}_1, \ldots , \mathbf {x}_K\}\).

To capture the multi-modal distribution of the future trajectory \(\mathbf {x}\), we take a generative approach and use a conditional variational autoencoder (CVAE) to learn the future trajectory distribution \(p_\theta (\mathbf {x}|\mathbf {c})\). Here we use the CVAE for its stability over other popular approaches such as CGANs, but other suitable deep generative models could also be used. The CVAE uses a variational lower bound  [34] as a surrogate for the intractable true data log-likelihood:

$$\begin{aligned} \mathcal {L}(\mathbf {x} ; \theta , \phi )= \; \mathbb {E}_{q_{\phi }(\mathbf {z} | \mathbf {x}, \mathbf {c})}\left[ \log p_{\theta }(\mathbf {x} | \mathbf {z}, \mathbf {c})\right] -{\text {KL}}\left( q_{\phi }(\mathbf {z} | \mathbf {x}, \mathbf {c}) \Vert p(\mathbf {z})\right) , \end{aligned}$$
(15)

where \(q_{\phi }(\mathbf {z} | \mathbf {x}, \mathbf {c})\) is an \(\phi \)-parametrized approximate posterior distribution. We use multivariate Gaussians for the prior, posterior (encoder distribution) and likelihood (decoder distribution): \(p(\mathbf {z})=\mathcal {N}(\mathbf {0}, \mathbf {I})\), \(q_{\phi }(\mathbf {z} | \mathbf {x}, \mathbf {c}) = \mathcal {N}(\varvec{\mu }, \text {Diag}(\varvec{\sigma }^2))\), and \(p_\theta (\mathbf {x}|\mathbf {z}, \mathbf {c}) = \mathcal {N}(\tilde{\mathbf {x}}, \alpha \mathbf {I})\) where \(\alpha \) is a hyperparameter. Both the encoder and decoder are implemented as recurrent neural networks (RNNs) (network architectures given in the supplementary materials). The encoder network \(F_\phi \) outputs the parameters of the posterior distribution: \((\varvec{\mu }, \varvec{\sigma }) = F_\phi (\mathbf {x}, \mathbf {c})\); the decoder network \(G_\theta \) outputs the reconstructed future trajectory \(\tilde{\mathbf {x}} = G_\theta (\mathbf {z}, \mathbf {c})\). The CVAE is learned via jointly optimizing the encoder and decoder with Eq. (15).

4.1 Diversity Sampling with DLow

Once the CVAE is learned, we follow the DLow framework proposed in Sect. 3 to optimize the network \(Q_\gamma \) and learn the latent mapping functions \(\mathcal {T}_{\psi _1}, \ldots , \mathcal {T}_{\psi _K}\). Before doing this, to fully leverage the DLow framework, we will look at one of DLow’s key feature, i.e., the design of the diversity-promoting prior p(X) in \(L_\text {prior}\) can be flexibly changed by modifying the underlying energy function E(X). This allows us to impose various structural constraints besides diversity on the sample set X. Below, we will provide two examples of such prior designs that (1) improve sample accuracy or (2) enable new applications such as controllable motion prediction.

Reconstruction Energy. To ensure that the sample set X is both diverse and accurate, i.e., the ground truth future motion \(\hat{\mathbf {x}}\) is close to one of the samples in X, we can modify the prior’s energy function E in Eq. (12) by adding a reconstruction term \(E_r\):

$$\begin{aligned}&E(X) = E_d(X) + \lambda _r E_r(X)\,,\end{aligned}$$
(16)
$$\begin{aligned}&E_r(X) = \min _k \mathcal {D}^2(\mathbf {x}_k, \hat{\mathbf {x}})\,, \end{aligned}$$
(17)

where \(\lambda _r\) is a weighting factor and we use Euclidean distance as the distance metric \(\mathcal {D}\). As DLow produces a correlated set of samples X instead of independent samples, the network \(Q_\gamma \) can learn to distribute samples in a way that are both diverse and accurate, covering the ground truth better. We use this prior design for our main experiments.

Controllable Motion Prediction. Another possible design of the diversity-promoting prior p(X) is one that promotes diversity in a certain subspace of the sample space. In the context of human motion prediction, we may want certain body parts to move similarly but other parts to move differently. For example, we may want leg motion to be similar but upper-body motion to be diverse across motion samples. We call this task controllable motion prediction, i.e., finding a set of diverse samples that share some common features, which can allow users or down-stream systems to explore variations of a certain type of samples.

Formally, we divide the human joints into two sets, \(J_s\) and \(J_d\), and ask samples in X to have similar motions for joints \(J_s\) but diverse motions for joints \(J_d\). We can slice a motion sample \(\mathbf {x}_k\) into two parts: \(\mathbf {x}_k = \left( \mathbf {x}_k^s, \mathbf {x}_k^d\right) \) where \(\mathbf {x}_k^s\) and \(\mathbf {x}_k^d\) correspond to \(J_s\) and \(J_d\) respectively. Similarly, we can slice the sample set X into two sets: \(X_s = \{\mathbf {x}_1^s, \ldots , \mathbf {x}_K^s\}\) and \(X_d = \{\mathbf {x}_1^d, \ldots , \mathbf {x}_K^d\}\). We then define a new energy function E for the prior p(X):

$$\begin{aligned}&E(X) = E_d(X_d) + \lambda _s E_s(X_s) + \lambda _r E_r(X)\,,\end{aligned}$$
(18)
$$\begin{aligned}&E_s(X_s) = \frac{1}{K(K-1)}\sum _{i=1}^K\sum _{j\ne i}^K \mathcal {D}^2(\mathbf {x}_i^s, \mathbf {x}_j^s)\,, \end{aligned}$$
(19)

where we add another energy term \(E_s\) weighted by \(\lambda _s\) to minimize the motion distance between samples for joints \(J_s\), and we only compute the diversity-promoting term \(E_d\) using motions of joints \(J_d\). After optimizing \(Q_\gamma \) using the DLow objective with the new energy E, we can produce diverse samples X that have similar motions for joints \(J_s\).

Furthermore, we may also want to use a reference motion sample \(\mathbf {x}_\text {ref}\) to provide the desired features. To achieve this, we can treat \(\mathbf {x}_\text {ref}\) as the first sample \(\mathbf {x}_1\) in X. We first find its corresponding latent code \(\mathbf {z}_1 := \mathbf {z}_\text {ref}\) using the CVAE encoder: \(\mathbf {z}_\text {ref} =F_\phi ^{\varvec{\mu }}(\mathbf {x}_\text {ref}, \mathbf {c})\). We can then find the common variable \(\varvec{\epsilon }_\text {ref}\) for generating X using the inverse mapping \(\mathcal {T}_{\psi _1}^{-1}\):

$$\begin{aligned} \varvec{\epsilon _\text {ref}} = \mathcal {T}_{\psi _1}^{-1}(\mathbf {z}_\text {ref}) = \mathbf {A}_1^{-1}(\mathbf {z}_\text {ref} - \mathbf {b}_1)\,. \end{aligned}$$
(20)

With \(\varvec{\epsilon }_\text {ref}\) known, we can generate X that includes \(\mathbf {x}_\text {ref}\). In practice, we force \(\mathcal {T}_{\psi _1}\) to be an identity mapping to enforce \(r_\psi (\mathbf {z}_1|\mathbf {c}) = p(\mathbf {z}_1)\) so that \(r_\psi (\mathbf {z}_1|\mathbf {c})\) covers the posterior distribution of \(\mathbf {z}_\text {ref}\). Otherwise, if \(\mathbf {z}_\text {ref}\) lies outside of the high density region of \(r_\psi (\mathbf {z}_1|\mathbf {c})\), it may lead to low-likelihood \(\varvec{\epsilon }_\text {ref}\) after the inverse mapping.

5 Experiments

Datasets. We perform evaluation on two public motion capture datasets: Human3.6M  [32] and HumanEva-I  [61]. Human3.6M is a large-scale dataset with 11 subjects (7 with ground truth) and 3.6 million video frames in total. Each subject performs 15 actions and the human motion is recorded 50 Hz. Following previous work [47, 51, 54, 71], we adopt a 17-joint skeleton and train on five subjects (S1, S5, S6, S7, S8) and test on two subjects (S9 and S11). HumanEva-I is a relatively small dataset, containing only three subjects recorded 60 Hz. We adopt a 15-joint skeleton  [54] and use the same train/test split provided in the dataset. By using both a large dataset with more variation in motion and a small dataset with less variation, we can better evaluate the generalization of our method to different types of data. For Human3.6M, we predict future motion for 2 s based on observed motion of 0.5 s. For HumanEva-I, we forecast future motion for 1 s given observed motion of 0.25 s.

Baselines. To fully evaluate our method, we consider three types of baselines: (1) Deterministic motion prediction methods, including ERD  [16] and acLSTM  [43]; (2) Stochastic motion prediction methods, including CVAE based methods, Pose-Knows  [66] and MT-VAE  [69], as well as a CGAN based method, HP-GAN  [6]; (3) Diversity-promoting methods for generative models, including Best-of-Many  [8], GMVAE  [14], DeLiGAN  [26], and DSF  [72].

Metrics. We use the following metrics to measure both sample diversity and accuracy. (1) Average Pairwise Distance (APD): average L2 distance between all pairs of motion samples to measure diversity within samples, which is computed as \(\frac{1}{K(K-1)}\sum _{i=1}^K \sum _{j\ne i}^K \Vert \mathbf {x}_i - \mathbf {x}_j\Vert \). (2) Average Displacement Error (ADE): average L2 distance over all time steps between the ground truth motion \(\hat{\mathbf {x}}\) and the closest sample, which is computed as \(\frac{1}{T}\min _{\mathbf {x} \in X} \Vert \hat{\mathbf {x}} - \mathbf {x}\Vert \). (3) Final Displacement Error (FDE): L2 distance between the final ground truth pose \(\mathbf {x}^T\) and the closest sample’s final pose, which is computed as \(\min _{\mathbf {x} \in X} \Vert \hat{\mathbf {x}}^T - \mathbf {x}^T\Vert \). (4) Multi-Modal ADE (MMADE): the multi-modal version of ADE that obtains multi-modal ground truth future motions by grouping similar past motions. (5) Multi-Modal FDE (MMFDE): the multi-modal version of FDE.

In these metrics, APD has been used to measure sample diversity  [3]. ADE and FDE are common metrics for evaluating sample accuracy in trajectory forecasting literature  [2, 25, 41]. MMADE and MMFDE  [72] are metrics used to measure a method’s ability to produce multi-modal predictions.

Table 1. Quantitative results on Human3.6M and HumanEva-I.

5.1 Quantitative Results

We summarize the quantitative results on Human3.6M and HumanEva-I in Table 1. The metrics are computed with the sample set size \(K = 50\). For both datasets, we can see that our method, DLow, outperforms all baselines in terms of both sample diversity (APD) and accuracy (ADE, FDE) as well as covering multi-modal ground truth (MMADE, MMFDE). Deterministic methods like ERD  [16] and acLSTM  [43] do not perform well because they only predict one future trajectory which can lead to mode averaging. Methods like MT-VAE  [69] produce trajectories samples that lack diversity so they fail to cover the multi-modal ground-truth (indicated by high MMADE and MMFDE) despite having decently low ADE and FDE. We would also like to point out the closest competitor DSF  [72] can only generate one deterministic set of samples, while our method can produce multiple diverse sets by sampling \(\varvec{\epsilon }\). We also show how each metric changes against various K in the supplementary materials.

Table 2. Ablation study on Human3.6M and HumanEva-I.

Ablation Study. We further perform an ablation study (Table 2) to analyze the effects of the two energy terms \(E_d\) and \(E_r\) in Eq. (16). First, without the reconstruction term \(E_r\), the DLow variant is able to achieve higher diversity (APD) at the cost of sample accuracy (ADE, FDE, MMADE, MMFDE). This is expected because the network only optimizes the diversity term \(E_d\) and focuses solely on diversity. Second, for the variant without \(E_d\), both sample diversity and accuracy decrease. It is intuitive to see why the diversity (APD) decreases. To see why the sample accuracy (ADE, FDE, MMADE, MMFDE) also decreases, we should consider the fact that a more diverse set of samples have a better chance at covering the ground truth. Finally, when we remove both \(E_d\) and \(E_r\) (i.e., only optimize \(L_\text {KL}\)), the results are the worst, which is expected.

Fig. 3.
figure 3

Qualitative Results on Human3.6M and HumanEva-I.

Fig. 4.
figure 4

Varying \(\beta \) in DLow allows us to balance between diversity and likelihood.

5.2 Qualitative Results

To visually evaluate the diversity and accuracy of each method, we present a qualitative comparison in Fig. 3 where we render the start pose, the end pose of the ground truth future motion, and the end pose of 10 motion samples. Note that we do not model the global translation of the person, which is why some sitting motions appear to be floating. For Human3.6M, we can see that our method DLow can predict a wide array of future motions, including standing, sitting, bending, crouching, and turning, which cover the ground truth bending motion. In contrast, the baseline methods mostly produce perturbations of a single motion—standing. For HumanEva-I, we can see that DLow produces interesting variations of the fighting motion, while the baselines produce almost identical future motions.

Fig. 5.
figure 5

Effect of varying \(\varvec{\epsilon }\) on motion samples.

Fig. 6.
figure 6

Controllable Motion Prediction. DLow enables samples to have more similar leg motion to the reference.

Diversity vs. Likelihood. As discussed in the approach section, the \(\beta \) in Eq. (8) represents the trade-off between sample diversity and likelihood. To verify this, we trained three DLow models with different \(\beta \) (1, 10, 100) and visualize the motion samples generated by each model in Fig. 4. We can see that a larger \(\beta \) leads to less diverse samples which correspond to the major mode of the generator distribution, while a smaller \(\beta \) can produce more diverse motion samples covering other plausible yet less likely future motions.

Effect of Varying \(\varvec{\epsilon }\). A key difference between our method and DSF  [72] is that we can generate multiple diverse sets of samples while DSF can only produce a fixed diverse set. To demonstrate this, we show in Fig. 5 how the motion samples of DLow change with different \(\varvec{\epsilon }\). By comparing the four sets of motion samples, one can conclude that changing \(\varvec{\epsilon }\) varies each set of samples but preserves the main structure of each motion.

Controllable Motion Prediction. As highlighted before, the flexible design of the diversity-promoting prior enables a new application, controllable motion prediction, where we predict diverse motions that share some common features. We showcase this application by conducting an experiment using the energy function defined in Eq. (18). The network is trained so that the leg motion of the motion samples is similar while the upper-body motion is diverse. The results are shown in Fig. 6. We can see that given a reference motion, our method can generate diverse upper-body motion and preserve similar leg motion, while random samples from the CVAE cannot enforce similar leg motion. Please refer to the supplementary materials for more results.

6 Conclusion

We have proposed a novel sampling strategy, DLow, for deep generative models to obtain a diverse set of future human motions. We introduced learnable latent mapping functions which allowed us to generate a set of correlated samples, whose diversity can be optimized by a diversity-promoting prior. Experiments demonstrated superior performance in generating diverse motion samples. Moreover, we showed that the flexible design of the diversity-promoting prior further enables new applications, such as controllable human motion prediction. We hope that our exploration of deep generative models through the lens of diversity will encourage more work towards understanding the complex nature of modeling and predicting future human behavior.