DLow: Diversifying Latent Flows for Diverse Human Motion Prediction

Yuan, Ye; Kitani, Kris

doi:10.1007/978-3-030-58545-7_20

Ye Yuan¹² &
Kris Kitani¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12354))

Included in the following conference series:

European Conference on Computer Vision

5455 Accesses
100 Citations

Abstract

Deep generative models are often used for human motion prediction as they are able to model multi-modal data distributions and characterize diverse human behavior. While much care has been taken into designing and learning deep generative models, how to efficiently produce diverse samples from a deep generative model after it has been trained is still an under-explored problem. To obtain samples from a pretrained generative model, most existing generative human motion prediction methods draw a set of independent Gaussian latent codes and convert them to motion samples. Clearly, this random sampling strategy is not guaranteed to produce diverse samples for two reasons: (1) The independent sampling cannot force the samples to be diverse; (2) The sampling is based solely on likelihood which may only produce samples that correspond to the major modes of the data distribution. To address these problems, we propose a novel sampling method, Diversifying Latent Flows (DLow), to produce a diverse set of samples from a pretrained deep generative model. Unlike random (independent) sampling, the proposed DLow sampling method samples a single random variable and then maps it with a set of learnable mapping functions to a set of correlated latent codes. The correlated latent codes are then decoded into a set of correlated samples. During training, DLow uses a diversity-promoting prior over samples as an objective to optimize the latent mappings to improve sample diversity. The design of the prior is highly flexible and can be customized to generate diverse motions with common features (e.g., similar leg motion but diverse upper-body motion). Our experiments demonstrate that DLow outperforms state-of-the-art baseline methods in terms of sample diversity and accuracy (Code: https://github.com/Khrylx/DLow. Video: https://youtu.be/64OEdSadb00).

Access provided by Autonomous University of Puebla. Download conference paper PDF

PoseGPT: Quantization-Based 3D Human Motion Generation and Forecasting

Diverse Human Motion Prediction Guided by Multi-level Spatial-Temporal Anchors

Implicit regularization of a deep augmented neural network model for human motion prediction

Article 20 January 2023

Keywords

1 Introduction

Human motion prediction, i.e., predicting the future 3D poses of a person based on past poses, is an important problem in computer vision and has many useful applications in autonomous driving [53], human robot interaction [37] and healthcare [65]. It is a challenging problem because the future motion of a person is potentially diverse and multi-modal due to the complex nature of human behavior. For many safety-critical applications, it is important to predict a diverse set of human motions instead of just the most likely one. For examples, an autonomous vehicle should be aware that a nearby pedestrian can suddenly cross the road even though the pedestrian will most likely remain in place. This diversity requirement calls for a generative approach that can fully characterize the multi-modal distribution of future human motion.

Deep generative models, e.g., variational autoencoders (VAEs) [36], are effective tools to model multi-modal data distributions. Most existing work [3, 6, 40, 44, 59, 66, 69] using deep generative models for human motion prediction is focused on the design of the generative model to allow it to effectively learn the data distribution. After the generative model is learned, little attention has been paid to the sampling method used to produce motion samples (predicted future motions) from the pretrained generative model (weights kept fixed). Most of prior work predicts a set of motions by randomly sampling a set of latent codes from the latent prior and decoding them with the generator into motion samples. We argue that such a sampling strategy is not guaranteed to produce a diverse set of samples for two reasons: (1) The samples are independently drawn, which makes it difficult to enforce diversity; (2) The samples are drawn based on likelihood only, which means many samples may concentrate around the major modes (which have more observed data) of the data distribution and fail to cover the minor modes (as shown in Fig. 1 (Bottom)). The poor sample efficiency of random sampling means that one needs to draw a large number of samples in order to cover all the modes which is computationally expensive and can lead to high latency, making it unsuitable for real-time applications such as autonomous driving and virtual reality. This prompts us to address an overlooked aspect of diverse human motion prediction—the sampling strategy.

We propose a novel sampling method, Diversifying Latent Flows (DLow), to obtain a diverse set of samples from a pretrained deep generative model. For this work, we use a conditional variational autoencoder (CVAE) as our pretrained generative model but other generative models can also be used with our approach. DLow is inspired by the two previously mentioned problems with random (independent) sampling. To tackle problem (1) where sample independence limits model diversity, we introduce a new random variable and a set of learnable deterministic mapping functions to correlate the motion samples. We first transform the random variable with the mapping functions to generate a set of correlated latent codes which are then decoded into motion samples using the generator. As all motion samples are generated from a common random factor, this formulation allows us to model the joint sample distribution and offers us the opportunity to impose diversity on the samples by optimizing the parameters of the mapping functions. To address problem (2) where likelihood-based sampling limits diversity, we introduce a diversity-promoting prior (loss function) on the samples during the training of DLow. The prior follows an energy-based formulation using an energy function based on pairwise sample distance. We optimize the mapping functions during training to minimize the cross entropy between the joint sample distribution and diversity-promoting prior to increase sample diversity. To strike a balance between diversity and likelihood, we add a KL term to the optimization to enhance the likelihood of each sample. The relative weights between the prior term and the KL term represent the trade-off between the diversity and likelihood of the generated motion samples. Furthermore, our approach is highly flexible in that by designing different forms of the diversity-promoting prior we can impose a variety of structures on the samples besides diversity. For example, we can design the prior to ask the motion samples to cover the ground truth better to achieve higher sample accuracy. Additionally, other designs of the prior can enable new applications, such as controllable motion prediction, where we generate diverse motion samples that share some common features (e.g., similar leg motion but diverse upper-body motion).

The contributions of this work are the following: (1) We propose a novel perspective for addressing sample diversity in deep generative models—designing sampling methods for a pretrained generative model. (2) We propose a principled sampling method, DLow, which formulates diversity sampling as a constrained optimization problem over a set of learnable mapping functions using a diversity-promoting prior on the samples and KL constraints on the latent codes, which allows us to balance between sample diversity and likelihood. (3) Our approach allows for flexible design of the diversity-promoting prior to obtain more accurate samples or enable new applications such as controllable motion prediction. (4) We demonstrate through human motion prediction experiments that our approach outperforms state-of-the-art baseline methods in terms of sample diversity and accuracy.

2 Related Work

Human Motion Prediction. Most previous work takes a deterministic approach to modeling human motion and regress a single future motion from past 3D poses [1, 9, 13, 16, 17, 21, 33, 43, 49, 50, 55, 67] or video frames [10, 73, 75]. While these approaches are able to predict the most likely future motion, they fail to model the multi-modal nature of human motion, which is essential for safety-critical applications. More related to our work, stochastic human motion prediction methods start to gain popularity with the development of deep generative models. These methods [3, 6, 40, 44, 59, 66, 69, 74] often build upon popular generative models such as conditional generative adversarial networks (CGANs; [20]) or conditional variational autoencoders (CVAEs; [36]). The aforementioned methods differ in the design of their generative models, but at test time they follow the same sampling strategy—randomly and independently sampling trajectories from the pretrained generative model without considering the correlation between samples. In this work, we propose a principled sampling method that can produce a diverse set of samples, thus improving sample efficiency compared to the random sampling typically used in prior work.

Diverse Inference. Producing a diverse set of solutions has been investigated in numerous problems in computer vision and machine learning. A branch of these diversity-driven methods stems from the M-Best MAP problem [52, 60], including diverse M-Best solutions [7] and multiple choice learning [27, 42]. Alternatively, submodular function maximization has been applied to select a diverse subset of garments from fashion images [30]. Another type of methods [5, 18, 19, 31, 38, 68, 72] seeks diversity using determinantal point processes (DPPs; [39, 48]) which are efficient probabilistic models that can measure the global diversity and quality within a set. Similarly, Fisher information [58] has been used for diverse feature [22] and data [62] selection. Diversity has also been a key aspect in generative modeling. A vast body of work has tried to alleviate the mode collapse problem in GANs [4, 11, 12, 15, 24, 45, 63, 70] and the posterior collapse problem in VAEs [8, 28, 35, 46, 64, 76]. Normalizing flows [56] have also been used to promote diversity in trajectory forecasting [23, 57]. This line of work aims to improve the diversity of the data distribution learned by deep generative models. We address diversity from a different angle by improving the strategy for producing samples from a pretrained deep generative model.

3 Diversifying Latent Flows (DLow)

For many existing methods on generative vision tasks such as multi-modal human motion prediction, the primary focus is to learn a good generative model that can capture the multi-modal distribution of the data. In contrast, once the generative model is learned, little attention has been paid to devising sampling strategies for producing diverse samples from the pretrained generative model.

In this section, we will introduce our method, Diversifying Latent Flows (DLow), as a principled way for drawing a diverse and likely set of samples from a pretrained generative model (weights kept fixed). To provide the proper context, we will first start with a brief review of deep generative models and how traditional methods produce samples from a pretrained generative model.

Background: Deep Generative Models. Let $\mathbf {x} \in \mathcal {X}$ denote data (e.g., human motion) drawn from a data distribution $p(\mathbf {x}|\mathbf {c})$ where $\mathbf {c}$ is some conditional information (e.g., past motion). One can reparameterize the data distribution by introducing a latent variable $\mathbf {z} \in \mathcal {Z}$ such that $p(\mathbf {x}|\mathbf {c}) = \int _\mathbf {z} p(\mathbf {x}|\mathbf {z}, \mathbf {c}) p(\mathbf {z})d\mathbf {z}$, where $p(\mathbf {z})$ is a Gaussian prior distribution. Deep generative models learn $p(\mathbf {x}|\mathbf {c})$ by modeling the conditional distribution $p(\mathbf {x}|\mathbf {z}, \mathbf {c})$, and the generative process can be described as sampling $\mathbf {z}$ and mapping them to data samples $\mathbf {x}$ using a deterministic generator function $G_\theta : \mathcal {Z} \rightarrow \mathcal {X}$ as

$$\begin{aligned}&\mathbf {z} \sim p(\mathbf {z})\,, \end{aligned}$$

(1)

$$\begin{aligned}&\mathbf {x} = G_\theta (\mathbf {z}, \mathbf {c})\,, \end{aligned}$$

(2)

where the generator $G_\theta $ is instantiated as a deep neural network parametrized by $\theta $. This generative process produces samples from the implicit sample distribution $p_\theta (\mathbf {x}|\mathbf {c})$ of the generative model, and the goal of generative modeling is to learn a generator $G_\theta $ such that $p_\theta (\mathbf {x}|\mathbf {c}) \approx p(\mathbf {x}|\mathbf {c})$. There are various approaches for learning the generator function $G_\theta $, which yield different types of deep generative models such as variational autoencoders (VAEs; [36]), normalizing flows (NFs; [56]), and generative adversarial networks (GANs; [20]). Note that even though the discussion in this work is focused on conditional generative models, our method can be readily applied to the unconditional case.

Random Sampling. Once the generator function $G_\theta $ is learned, traditional approaches produce samples from the learned data distribution $p_\theta (\mathbf {x}|\mathbf {c})$ by first randomly sampling a set of latent codes $Z = \{\mathbf {z}_1, \ldots , \mathbf {z}_K\}$ from the latent prior $p(\mathbf {z})$ (Eq. (1)) and decode Z with the generator $G_\theta $ into a set of data samples $X = \{\mathbf {x}_1, \ldots , \mathbf {x}_K\}$ (Eq. (2)). We argue that such a sampling strategy may result in a less diverse sample set for two reasons: (1) Independent sampling cannot model the repulsion between samples within a diverse set; (2) The sampling is only based on the data likelihood and many samples can concentrate around a small number of modes that have more training data. As a result, random sampling can lead to low sample efficiency because many samples are similar to one another and fail to cover other modes in the data distribution.

DLow Sampling. To address the above issues with the random sampling approach, we propose an alternative sampling method, Diversifying Latent Flows (DLow), that can generate a diverse and likely set of samples from a pretrained deep generative model. Again, we stress that the weights of the generative model are kept fixed for DLow. We later apply DLow to the task of human motion prediction in Sect. 4 to demonstrate DLow’s ability to improve sample diversity.

Instead of sampling each latent code $\mathbf {z}_k \in Z$ independently according to $p(\mathbf {z})$, we introduce a random variable $\varvec{\epsilon }$ and conditionally generate the latent codes Z and data samples X as follows:

$$\begin{aligned}&\varvec{\epsilon } \sim p(\varvec{\epsilon })\,, \end{aligned}$$

(3)

$$\begin{aligned}&\mathbf {z} _k = \mathcal {T}_{\psi _k}(\varvec{\epsilon }) \,,\;\, \quad \quad 1 \le k \le K\,, \end{aligned}$$

(4)

$$\begin{aligned}&\mathbf {x} _k = G_\theta (\mathbf {z}_k, \mathbf {c})\,, \,\quad 1 \le k \le K\,, \end{aligned}$$

(5)

where $p(\varvec{\epsilon })$ is a Gaussian distribution, $\mathcal {T}_{\psi _1}, \ldots , \mathcal {T}_{\psi _K}$ are latent mapping functions with parameters $\psi = \{ \psi _1, \ldots , \psi _K \}$, and each $\mathcal {T}_{\psi _k}$ maps $\varvec{\epsilon }$ to a different latent code $\mathbf {z}_k$. The above generative process defines a joint distribution $r_\psi (X,Z|\mathbf {c})=p_\theta (X|Z, \mathbf {c})r_\psi (Z|\mathbf {c})$ over the samples X and latent codes Z, where $p_\theta (X|Z,\mathbf {c})$ is the conditional distribution induced by the generator $G_\theta (\mathbf {z}, \mathbf {c})$. Notice that in our setup, $r_\psi (X,Z|\mathbf {c})$ depends only on $\psi $ as the generator parameters $\theta $ are learned in advance and are kept fixed. The data samples X can be viewed as a sample from the joint sample distribution $r_\psi (X|\mathbf {c})=\int r_\psi (X,Z|\mathbf {c})dZ$ and the latent codes Z can be regarded as a sample from the joint latent distribution $r_\psi (Z|\mathbf {c})$ induced by warping $p(\varvec{\epsilon })$ through $\mathcal {T}_{\psi _1}, \ldots , \mathcal {T}_{\psi _K}$. If we further marginalize out all variables except for $\mathbf {x}_k$ from $r_\psi (X|\mathbf {c})$, we obtain the marginal sample distribution $r_\psi (\mathbf {x}_k|\mathbf {c})$ from which each sample $\mathbf {x}_k$ is drawn. Similarly, each latent code $\mathbf {z}_k \in Z$ can be viewed as a latent sample from the marginal latent distribution $r_\psi (\mathbf {z}_k|\mathbf {c})$.

The above distribution reparametrizations are illustrated in Fig. 2. We can see that all latent codes Z and data samples X are correlated as they are uniquely determined by $\varvec{\epsilon }$, and by sampling $\varvec{\epsilon }$ one can easily produce Z and X from the joint latent distribution $r_\psi (Z|\mathbf {c})$ and joint sample distribution $r_\psi (X|\mathbf {c})$. Because $r_\psi (Z|\mathbf {c})$ and $r_\psi (X|\mathbf {c})$ are controlled by the latent mapping functions $\mathcal {T}_{\psi _1}, \ldots , \mathcal {T}_{\psi _K}$, we can impose structural constraints on $r_\psi (Z|\mathbf {c})$ and $r_\psi (X|\mathbf {c})$ by optimizing the parameters $\psi $ of the latent mapping functions.

To encourage the diversity of samples X, we introduce a diversity-promoting prior p(X) (specific form defined later) and formulate a constrained optimization problem:

$$\begin{aligned} \min _\psi&\quad -\mathbb {E}_{X \sim r_\psi (X|\mathbf {c})}[\log p(X)]\,, \end{aligned}$$

(6)

$$\begin{aligned} \text {s.t.}&\quad \text {KL} (r_\psi (\mathbf {z}_k|\mathbf {c}) \Vert p(\mathbf {z}_k)) = 0\,, \quad 1 \le k \le K\,, \end{aligned}$$

(7)

where we minimize the cross entropy between the sample distribution $r_\psi (X|\mathbf {c})$ and the diversity-promoting prior p(X). However, the objective in Eq. (6) alone can result in very low-likelihood samples $\mathbf {x}_k$ corresponding to latent codes $\mathbf {z}_k$ that are far away from the Gaussian prior $p(\mathbf {z}_k)$. To ensure that each sample $\mathbf {x}_k$ also has high likelihood under the generative model $p_\theta (\mathbf {x}|\mathbf {c})$, we add constraints in Eq. (7) on the KL divergence between $r_\psi (\mathbf {z}_k|\mathbf {c})$ and the Gaussian prior $p(\mathbf {z}_k)$ (same as $p(\mathbf {z})$) to make $r_\psi (\mathbf {z}_k|\mathbf {c}) = p(\mathbf {z}_k)$ and thus $r_\psi (\mathbf {x}_k|\mathbf {c}) = p_\theta (\mathbf {x}_k|\mathbf {c})$ where $r_\psi (\mathbf {x}_k|\mathbf {c})= \int p_\theta (\mathbf {x}_k|\mathbf {z}_k, \mathbf {c})r_\psi (\mathbf {z}_k|\mathbf {c})d\mathbf {z}_k$ and $ p_\theta (\mathbf {x}_k|\mathbf {c})= \int p_\theta (\mathbf {x}_k|\mathbf {z}_k, \mathbf {c})p(\mathbf {z}_k)d\mathbf {z}_k$. To optimize this constrained objective, we soften the constraints with the Lagrangian function:

$$\begin{aligned} \min _\psi \, -\mathbb {E}_{X \sim r_\psi (X|\mathbf {c})}[\log p(X)] +\beta \sum _{k=1}^K\text {KL} (r_\psi (\mathbf {z}_k|\mathbf {c}) \Vert p(\mathbf {z}_k))\,, \end{aligned}$$

(8)

where we use the same Lagrangian multiplier $\beta $ for all constraints. Despite having similar form, the above objective is very different from the objective function of $\beta $-VAE [29] in many ways: (1) our goal is to learn a diverse sampling distribution $r_\psi (X|\mathbf {c})$ for a pretrained generative model rather than learning the generative model itself; (2) The first part in our objective is a diversifying term instead of a reconstruction term; (3) Our objective function applies to most deep generative models, not just VAEs. In this objective, the softening of the hard KL constraints allows for the trade-off between the diversity and likelihood of the samples X. For small $\beta $, $r_\psi (\mathbf {z}_k|\mathbf {c})$ is allowed to deviate from $p(\mathbf {z}_k)$ so that $r_\psi (\mathbf {z}_1|\mathbf {c}), \ldots , r_\psi (\mathbf {z}_K|\mathbf {c})$ can potentially attend to different regions in the latent space as shown in Fig. 2 (latent space) to further improve sample diversity. For large $\beta $, the objective will focus on minimizing the KL term so that $r_\psi (\mathbf {z}_k|\mathbf {c})\approx p(\mathbf {z}_k)$ and $r_\psi (\mathbf {x}_k|\mathbf {c})\approx p_\theta (\mathbf {x}_k|\mathbf {c})$, and thus the sample $\mathbf {x}_k$ will have high likelihood under $p_\theta (\mathbf {x}_k|\mathbf {c})$.

The overall DLow objective is defined as:

$$\begin{aligned} L_\text {DLow} = L_{\text {prior}} + \beta L_{\text {KL}}\,, \end{aligned}$$

(9)

where $L_{\text {prior}}$ and $L_{\text {KL}}$ are the first and second term in Eq. (8) respectively. In the following, we will discuss in detail how we design the latent mapping functions $\mathcal {T}_{\psi _1}, \ldots , \mathcal {T}_{\psi _K}$ and the diversity-promoting prior p(X).

Latent Mapping Functions. Each latent mapping $\mathcal {T}_{\psi _k}$ transforms the Gaussian distribution $p(\varvec{\epsilon })$ to the marginal latent distribution $r_\psi (\mathbf {z}_k|\mathbf {c})$ for latent code $\mathbf {z}_k$ where $\mathcal {T}_{\psi _k}$ is also conditioned on $\mathbf {c}$. As $r_\psi (\mathbf {z}_k|\mathbf {c})$ should stay close to the Gaussian latent prior $p(\mathbf {z}_k)$, it would be ideal if the mapping $\mathcal {T}_{\psi _k}$ makes $r_\psi (\mathbf {z}_k|\mathbf {c})$ also a Gaussian. Thus, we design $\mathcal {T}_{\psi _k}$ to be an invertible affine transformation:

$$\begin{aligned} \mathcal {T}_{\psi _k}(\varvec{\epsilon }) = \mathbf {A}_k(\mathbf {c})\varvec{\epsilon } + \mathbf {b}_k(\mathbf {c}) \,, \end{aligned}$$

(10)

where the mapping parameters $\psi _k = \{\mathbf {A}_k(\mathbf {c}), \mathbf {b}_k(\mathbf {c})\}$, $\mathbf {A}_k \in \mathbb {R}^{n_z \times n_z}$ is a nonsingular matrix, $\mathbf {b}_k \in \mathbb {R}^{n_z}$ is a vector, and $n_z$ is the number of dimensions for $\mathbf {z}_k$ and $\varvec{\epsilon }$. As shown in Fig. 2, we use a K-head network $Q_\gamma (\mathbf {c})$ to output $\psi _1, \ldots , \psi _K$, and the parameters $\gamma $ of the network $Q_\gamma (\mathbf {c})$ are the parameters to be optimized with the DLow objective in Eq. (9).

Under the invertible affine transformation $\mathcal {T}_{\psi _k}$, $r_\psi (\mathbf {z}_k|\mathbf {c})$ becomes a Gaussian distribution $\mathcal {N}(\mathbf {b}_k, \mathbf {A}_k\mathbf {A}_k^T)$. This allows us to compute the KL divergence terms in $L_\text {KL}$ analytically:

$$\begin{aligned} \text {KL} (r_\psi (\mathbf {z}_k|\mathbf {c})\Vert p(\mathbf {z}_k)) = \frac{1}{2}\left( {\text {tr}}\left( \mathbf {A}_k\mathbf {A}_k^T\right) +\mathbf {b}_k^T\mathbf {b}_k -n_z - \log \det \left( \mathbf {A}_k\mathbf {A}_k^T\right) \right) . \end{aligned}$$

(11)

The KL divergence is minimized when $r_\psi (\mathbf {z}_k|\mathbf {c}) = p(\mathbf {z}_k)$ which implies that $\mathbf {A}_k\mathbf {A}_k^T = \mathbf {I}$ and $\mathbf {b}_k = \mathbf {0}$. Geometrically, this means that $\mathbf {A}_k$ is in the orthogonal group $O(n_z)$, which includes all rotations and reflections in an $n_z$-dimensional space. This means any mapping $\mathcal {T}_{\psi _k}$ that is a rotation or reflection operation will minimize the KL divergence. As mentioned before, there is a trade-off between diversity and likelihood in Eq. (9). To improve sample diversity (minimize $L_\text {prior}$) without compromising likelihood (KL divergence), we can optimize $\mathcal {T}_{\psi _1}, \ldots , \mathcal {T}_{\psi _K}$ to be different rotations or reflections to map $\varvec{\epsilon }$ to different feasible points $\mathbf {z}_1, \ldots , \mathbf {z}_k$ in the latent space. This geometric understanding sheds light on the mapping space admitted by the hard KL constraints. In practice, we use soft KL constraints in the DLow objective to further enlarge the feasible mapping space which allows us to achieve lower $L_\text {prior}$ and better sample diversity.

Diversity-Promoting Prior. In the DLow objective, a diversity-promoting prior p(X) on the joint sample distribution is used to guide the optimization of the latent mapping functions $\mathcal {T}_{\psi _1}, \ldots , \mathcal {T}_{\psi _K}$. With an energy-based formulation, the prior p(X) can be defined using an energy function E(X):

$$\begin{aligned} p(X) = \exp (-E(X)) / \mathcal {S}\,, \end{aligned}$$

(12)

where $\mathcal {S}$ is a normalizing constant. Dropping the constant $\mathcal {S}$, the first term in Eq. (8) can be rewritten as

$$\begin{aligned} L_{\text {prior}} = \mathbb {E}_{X \sim r_\psi (X|\mathbf {c})}[E(X)]\,. \end{aligned}$$

(13)

To promote sample diversity of X, we design an energy function $E := E_d$ based on a pairwise distance metric $\mathcal {D}$:

$$\begin{aligned} E_d(X) = \frac{1}{K(K-1)}\sum _{i=1}^K\sum _{j\ne i}^K \exp \left( -\frac{\mathcal {D}^2(\mathbf {x}_i, \mathbf {x}_j)}{\sigma _d}\right) , \end{aligned}$$

(14)

where we use the Euclidean distance for $\mathcal {D}$ and an RBF kernel with scale $\sigma _d$. Minimizing $L_{\text {prior}}$ moves the samples towards a lower-energy (diverse) configuration. $L_{\text {prior}}$ can be evaluated efficiently with the reparametrization trick [36].

Up to this point, we have described the proposed sampling method, DLow, for generating a diverse set of samples from a pretrained generative model $p_\theta (\mathbf {x}|\mathbf {c})$. By introducing a common random variable $\varvec{\epsilon }$, DLow allows us to generate correlated samples X. Moreover, by introducing learnable mapping functions $\mathcal {T}_{\psi _k}$, we can model the joint sample distribution $r_\psi (X|\mathbf {c})$ and impose structural constraints, such as diversity, on the sample set X which cannot be modeled by random sampling from the generative model.

4 Diverse Human Motion Prediction

Equipped with a method to generate diverse samples from a pretrained deep generative model, we now turn our attention to the task of diverse human motion prediction. Suppose the pose of a person is a V-dimensional vector consisting of 3D joint positions, we use $\mathbf {c} \in \mathbb {R}^{H \times V}$ to denote the past motion of H time steps and $\mathbf {x} \in \mathbb {R}^{T \times V}$ to denote the future motion over a future time horizon of T. Given a past motion $\mathbf {c}$, the goal of diverse human motion prediction is to generate a diverse set of future motions $X = \{\mathbf {x}_1, \ldots , \mathbf {x}_K\}$.

To capture the multi-modal distribution of the future trajectory $\mathbf {x}$, we take a generative approach and use a conditional variational autoencoder (CVAE) to learn the future trajectory distribution $p_\theta (\mathbf {x}|\mathbf {c})$. Here we use the CVAE for its stability over other popular approaches such as CGANs, but other suitable deep generative models could also be used. The CVAE uses a variational lower bound [34] as a surrogate for the intractable true data log-likelihood:

$$\begin{aligned} \mathcal {L}(\mathbf {x} ; \theta , \phi )= \; \mathbb {E}_{q_{\phi }(\mathbf {z} | \mathbf {x}, \mathbf {c})}\left[ \log p_{\theta }(\mathbf {x} | \mathbf {z}, \mathbf {c})\right] -{\text {KL}}\left( q_{\phi }(\mathbf {z} | \mathbf {x}, \mathbf {c}) \Vert p(\mathbf {z})\right) , \end{aligned}$$

(15)

where $q_{\phi }(\mathbf {z} | \mathbf {x}, \mathbf {c})$ is an $\phi $-parametrized approximate posterior distribution. We use multivariate Gaussians for the prior, posterior (encoder distribution) and likelihood (decoder distribution): $p(\mathbf {z})=\mathcal {N}(\mathbf {0}, \mathbf {I})$, $q_{\phi }(\mathbf {z} | \mathbf {x}, \mathbf {c}) = \mathcal {N}(\varvec{\mu }, \text {Diag}(\varvec{\sigma }^2))$, and $p_\theta (\mathbf {x}|\mathbf {z}, \mathbf {c}) = \mathcal {N}(\tilde{\mathbf {x}}, \alpha \mathbf {I})$ where $\alpha $ is a hyperparameter. Both the encoder and decoder are implemented as recurrent neural networks (RNNs) (network architectures given in the supplementary materials). The encoder network $F_\phi $ outputs the parameters of the posterior distribution: $(\varvec{\mu }, \varvec{\sigma }) = F_\phi (\mathbf {x}, \mathbf {c})$; the decoder network $G_\theta $ outputs the reconstructed future trajectory $\tilde{\mathbf {x}} = G_\theta (\mathbf {z}, \mathbf {c})$. The CVAE is learned via jointly optimizing the encoder and decoder with Eq. (15).

4.1 Diversity Sampling with DLow

Once the CVAE is learned, we follow the DLow framework proposed in Sect. 3 to optimize the network $Q_\gamma $ and learn the latent mapping functions $\mathcal {T}_{\psi _1}, \ldots , \mathcal {T}_{\psi _K}$. Before doing this, to fully leverage the DLow framework, we will look at one of DLow’s key feature, i.e., the design of the diversity-promoting prior p(X) in $L_\text {prior}$ can be flexibly changed by modifying the underlying energy function E(X). This allows us to impose various structural constraints besides diversity on the sample set X. Below, we will provide two examples of such prior designs that (1) improve sample accuracy or (2) enable new applications such as controllable motion prediction.

Reconstruction Energy. To ensure that the sample set X is both diverse and accurate, i.e., the ground truth future motion $\hat{\mathbf {x}}$ is close to one of the samples in X, we can modify the prior’s energy function E in Eq. (12) by adding a reconstruction term $E_r$:

$$\begin{aligned}&E(X) = E_d(X) + \lambda _r E_r(X)\,,\end{aligned}$$

(16)

$$\begin{aligned}&E_r(X) = \min _k \mathcal {D}^2(\mathbf {x}_k, \hat{\mathbf {x}})\,, \end{aligned}$$

(17)

where $\lambda _r$ is a weighting factor and we use Euclidean distance as the distance metric $\mathcal {D}$. As DLow produces a correlated set of samples X instead of independent samples, the network $Q_\gamma $ can learn to distribute samples in a way that are both diverse and accurate, covering the ground truth better. We use this prior design for our main experiments.

Controllable Motion Prediction. Another possible design of the diversity-promoting prior p(X) is one that promotes diversity in a certain subspace of the sample space. In the context of human motion prediction, we may want certain body parts to move similarly but other parts to move differently. For example, we may want leg motion to be similar but upper-body motion to be diverse across motion samples. We call this task controllable motion prediction, i.e., finding a set of diverse samples that share some common features, which can allow users or down-stream systems to explore variations of a certain type of samples.

Formally, we divide the human joints into two sets, $J_s$ and $J_d$, and ask samples in X to have similar motions for joints $J_s$ but diverse motions for joints $J_d$. We can slice a motion sample $\mathbf {x}_k$ into two parts: $\mathbf {x}_k = \left( \mathbf {x}_k^s, \mathbf {x}_k^d\right) $ where $\mathbf {x}_k^s$ and $\mathbf {x}_k^d$ correspond to $J_s$ and $J_d$ respectively. Similarly, we can slice the sample set X into two sets: $X_s = \{\mathbf {x}_1^s, \ldots , \mathbf {x}_K^s\}$ and $X_d = \{\mathbf {x}_1^d, \ldots , \mathbf {x}_K^d\}$. We then define a new energy function E for the prior p(X):

$$\begin{aligned}&E(X) = E_d(X_d) + \lambda _s E_s(X_s) + \lambda _r E_r(X)\,,\end{aligned}$$

(18)

$$\begin{aligned}&E_s(X_s) = \frac{1}{K(K-1)}\sum _{i=1}^K\sum _{j\ne i}^K \mathcal {D}^2(\mathbf {x}_i^s, \mathbf {x}_j^s)\,, \end{aligned}$$

(19)

where we add another energy term $E_s$ weighted by $\lambda _s$ to minimize the motion distance between samples for joints $J_s$, and we only compute the diversity-promoting term $E_d$ using motions of joints $J_d$. After optimizing $Q_\gamma $ using the DLow objective with the new energy E, we can produce diverse samples X that have similar motions for joints $J_s$.

Furthermore, we may also want to use a reference motion sample $\mathbf {x}_\text {ref}$ to provide the desired features. To achieve this, we can treat $\mathbf {x}_\text {ref}$ as the first sample $\mathbf {x}_1$ in X. We first find its corresponding latent code $\mathbf {z}_1 := \mathbf {z}_\text {ref}$ using the CVAE encoder: $\mathbf {z}_\text {ref} =F_\phi ^{\varvec{\mu }}(\mathbf {x}_\text {ref}, \mathbf {c})$. We can then find the common variable $\varvec{\epsilon }_\text {ref}$ for generating X using the inverse mapping $\mathcal {T}_{\psi _1}^{-1}$:

$$\begin{aligned} \varvec{\epsilon _\text {ref}} = \mathcal {T}_{\psi _1}^{-1}(\mathbf {z}_\text {ref}) = \mathbf {A}_1^{-1}(\mathbf {z}_\text {ref} - \mathbf {b}_1)\,. \end{aligned}$$

(20)

With $\varvec{\epsilon }_\text {ref}$ known, we can generate X that includes $\mathbf {x}_\text {ref}$. In practice, we force $\mathcal {T}_{\psi _1}$ to be an identity mapping to enforce $r_\psi (\mathbf {z}_1|\mathbf {c}) = p(\mathbf {z}_1)$ so that $r_\psi (\mathbf {z}_1|\mathbf {c})$ covers the posterior distribution of $\mathbf {z}_\text {ref}$. Otherwise, if $\mathbf {z}_\text {ref}$ lies outside of the high density region of $r_\psi (\mathbf {z}_1|\mathbf {c})$, it may lead to low-likelihood $\varvec{\epsilon }_\text {ref}$ after the inverse mapping.

5 Experiments

Datasets. We perform evaluation on two public motion capture datasets: Human3.6M [32] and HumanEva-I [61]. Human3.6M is a large-scale dataset with 11 subjects (7 with ground truth) and 3.6 million video frames in total. Each subject performs 15 actions and the human motion is recorded 50 Hz. Following previous work [47, 51, 54, 71], we adopt a 17-joint skeleton and train on five subjects (S1, S5, S6, S7, S8) and test on two subjects (S9 and S11). HumanEva-I is a relatively small dataset, containing only three subjects recorded 60 Hz. We adopt a 15-joint skeleton [54] and use the same train/test split provided in the dataset. By using both a large dataset with more variation in motion and a small dataset with less variation, we can better evaluate the generalization of our method to different types of data. For Human3.6M, we predict future motion for 2 s based on observed motion of 0.5 s. For HumanEva-I, we forecast future motion for 1 s given observed motion of 0.25 s.

Baselines. To fully evaluate our method, we consider three types of baselines: (1) Deterministic motion prediction methods, including ERD [16] and acLSTM [43]; (2) Stochastic motion prediction methods, including CVAE based methods, Pose-Knows [66] and MT-VAE [69], as well as a CGAN based method, HP-GAN [6]; (3) Diversity-promoting methods for generative models, including Best-of-Many [8], GMVAE [14], DeLiGAN [26], and DSF [72].

Metrics. We use the following metrics to measure both sample diversity and accuracy. (1) Average Pairwise Distance (APD): average L2 distance between all pairs of motion samples to measure diversity within samples, which is computed as $\frac{1}{K(K-1)}\sum _{i=1}^K \sum _{j\ne i}^K \Vert \mathbf {x}_i - \mathbf {x}_j\Vert $. (2) Average Displacement Error (ADE): average L2 distance over all time steps between the ground truth motion $\hat{\mathbf {x}}$ and the closest sample, which is computed as $\frac{1}{T}\min _{\mathbf {x} \in X} \Vert \hat{\mathbf {x}} - \mathbf {x}\Vert $. (3) Final Displacement Error (FDE): L2 distance between the final ground truth pose $\mathbf {x}^T$ and the closest sample’s final pose, which is computed as $\min _{\mathbf {x} \in X} \Vert \hat{\mathbf {x}}^T - \mathbf {x}^T\Vert $. (4) Multi-Modal ADE (MMADE): the multi-modal version of ADE that obtains multi-modal ground truth future motions by grouping similar past motions. (5) Multi-Modal FDE (MMFDE): the multi-modal version of FDE.

In these metrics, APD has been used to measure sample diversity [3]. ADE and FDE are common metrics for evaluating sample accuracy in trajectory forecasting literature [2, 25, 41]. MMADE and MMFDE [72] are metrics used to measure a method’s ability to produce multi-modal predictions.

Table 1. Quantitative results on Human3.6M and HumanEva-I.

Full size table

5.1 Quantitative Results

We summarize the quantitative results on Human3.6M and HumanEva-I in Table 1. The metrics are computed with the sample set size $K = 50$. For both datasets, we can see that our method, DLow, outperforms all baselines in terms of both sample diversity (APD) and accuracy (ADE, FDE) as well as covering multi-modal ground truth (MMADE, MMFDE). Deterministic methods like ERD [16] and acLSTM [43] do not perform well because they only predict one future trajectory which can lead to mode averaging. Methods like MT-VAE [69] produce trajectories samples that lack diversity so they fail to cover the multi-modal ground-truth (indicated by high MMADE and MMFDE) despite having decently low ADE and FDE. We would also like to point out the closest competitor DSF [72] can only generate one deterministic set of samples, while our method can produce multiple diverse sets by sampling $\varvec{\epsilon }$. We also show how each metric changes against various K in the supplementary materials.

Table 2. Ablation study on Human3.6M and HumanEva-I.

Full size table

Ablation Study. We further perform an ablation study (Table 2) to analyze the effects of the two energy terms $E_d$ and $E_r$ in Eq. (16). First, without the reconstruction term $E_r$, the DLow variant is able to achieve higher diversity (APD) at the cost of sample accuracy (ADE, FDE, MMADE, MMFDE). This is expected because the network only optimizes the diversity term $E_d$ and focuses solely on diversity. Second, for the variant without $E_d$, both sample diversity and accuracy decrease. It is intuitive to see why the diversity (APD) decreases. To see why the sample accuracy (ADE, FDE, MMADE, MMFDE) also decreases, we should consider the fact that a more diverse set of samples have a better chance at covering the ground truth. Finally, when we remove both $E_d$ and $E_r$ (i.e., only optimize $L_\text {KL}$), the results are the worst, which is expected.

5.2 Qualitative Results

To visually evaluate the diversity and accuracy of each method, we present a qualitative comparison in Fig. 3 where we render the start pose, the end pose of the ground truth future motion, and the end pose of 10 motion samples. Note that we do not model the global translation of the person, which is why some sitting motions appear to be floating. For Human3.6M, we can see that our method DLow can predict a wide array of future motions, including standing, sitting, bending, crouching, and turning, which cover the ground truth bending motion. In contrast, the baseline methods mostly produce perturbations of a single motion—standing. For HumanEva-I, we can see that DLow produces interesting variations of the fighting motion, while the baselines produce almost identical future motions.

Diversity vs. Likelihood. As discussed in the approach section, the $\beta $ in Eq. (8) represents the trade-off between sample diversity and likelihood. To verify this, we trained three DLow models with different $\beta $ (1, 10, 100) and visualize the motion samples generated by each model in Fig. 4. We can see that a larger $\beta $ leads to less diverse samples which correspond to the major mode of the generator distribution, while a smaller $\beta $ can produce more diverse motion samples covering other plausible yet less likely future motions.

Effect of Varying $\varvec{\epsilon }$. A key difference between our method and DSF [72] is that we can generate multiple diverse sets of samples while DSF can only produce a fixed diverse set. To demonstrate this, we show in Fig. 5 how the motion samples of DLow change with different $\varvec{\epsilon }$. By comparing the four sets of motion samples, one can conclude that changing $\varvec{\epsilon }$ varies each set of samples but preserves the main structure of each motion.

Controllable Motion Prediction. As highlighted before, the flexible design of the diversity-promoting prior enables a new application, controllable motion prediction, where we predict diverse motions that share some common features. We showcase this application by conducting an experiment using the energy function defined in Eq. (18). The network is trained so that the leg motion of the motion samples is similar while the upper-body motion is diverse. The results are shown in Fig. 6. We can see that given a reference motion, our method can generate diverse upper-body motion and preserve similar leg motion, while random samples from the CVAE cannot enforce similar leg motion. Please refer to the supplementary materials for more results.

6 Conclusion

We have proposed a novel sampling strategy, DLow, for deep generative models to obtain a diverse set of future human motions. We introduced learnable latent mapping functions which allowed us to generate a set of correlated samples, whose diversity can be optimized by a diversity-promoting prior. Experiments demonstrated superior performance in generating diverse motion samples. Moreover, we showed that the flexible design of the diversity-promoting prior further enables new applications, such as controllable human motion prediction. We hope that our exploration of deep generative models through the lens of diversity will encourage more work towards understanding the complex nature of modeling and predicting future human behavior.

References

Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7144–7153 (2019)
Google Scholar
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: human trajectory prediction in crowded spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–971 (2016)
Google Scholar
Aliakbarian, S., Saleh, F.S., Salzmann, M., Petersson, L., Gould, S.: A stochastic conditioning scheme for diverse human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5223–5232 (2020)
Google Scholar
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. arXiv preprint arXiv:1701.07875 (2017)
Azadi, S., Feng, J., Darrell, T.: Learning detection with diverse proposals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7149–7157 (2017)
Google Scholar
Barsoum, E., Kender, J., Liu, Z.: HP-GAN: probabilistic 3D human motion prediction via GAN. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1418–1427 (2018)
Google Scholar
Batra, D., Yadollahpour, P., Guzman-Rivera, A., Shakhnarovich, G.: Diverse M-best solutions in Markov random fields. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 1–16. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_1
Chapter Google Scholar
Bhattacharyya, A., Schiele, B., Fritz, M.: Accurate and diverse sampling of sequences based on a “best of many” sample objective. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8485–8493 (2018)
Google Scholar
Butepage, J., Black, M.J., Kragic, D., Kjellstrom, H.: Deep representation learning for human motion prediction and classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6158–6166 (2017)
Google Scholar
Chao, Y.W., Yang, J., Price, B., Cohen, S., Deng, J.: Forecasting human dynamics from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 548–556 (2017)
Google Scholar
Che, T., Li, Y., Jacob, A.P., Bengio, Y., Li, W.: Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136 (2016)
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2172–2180 (2016)
Google Scholar
Chiu, H.k., Adeli, E., Wang, B., Huang, D.A., Niebles, J.C.: Action-agnostic human pose forecasting. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1423–1432. IEEE (2019)
Google Scholar
Dilokthanakul, N., et al.: Deep unsupervised clustering with Gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648 (2016)
Elfeki, M., Couprie, C., Riviere, M., Elhoseiny, M.: GDPP: learning diverse generations using determinantal point process. arXiv preprint arXiv:1812.00068 (2018)
Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4346–4354 (2015)
Google Scholar
Ghosh, P., Song, J., Aksan, E., Hilliges, O.: Learning human motion models for long-term predictions. In: 2017 International Conference on 3D Vision (3DV), pp. 458–466. IEEE (2017)
Google Scholar
Gillenwater, J.A., Kulesza, A., Fox, E., Taskar, B.: Expectation-maximization for learning determinantal point processes. In: Advances in Neural Information Processing Systems, pp. 3149–3157 (2014)
Google Scholar
Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Advances in Neural Information Processing Systems, pp. 2069–2077 (2014)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Google Scholar
Gopalakrishnan, A., Mali, A., Kifer, D., Giles, L., Ororbia, A.G.: A neural temporal model for human motion prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12116–12125 (2019)
Google Scholar
Gu, Q., Li, Z., Han, J.: Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725 (2012)
Guan, J., Yuan, Y., Kitani, K.M., Rhinehart, N.: Generative hybrid representations for activity forecasting with no-regret learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: Advances in Neural Information Processing Systems, pp. 5767–5777 (2017)
Google Scholar
Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social GAN: socially acceptable trajectories with generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2255–2264 (2018)
Google Scholar
Gurumurthy, S., Kiran Sarvadevabhatla, R., Venkatesh Babu, R.: DeliGAN: generative adversarial networks for diverse and limited data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 166–174 (2017)
Google Scholar
Guzman-Rivera, A., Batra, D., Kohli, P.: Multiple choice learning: learning to produce multiple structured outputs. In: Advances in Neural Information Processing Systems, pp. 1799–1807 (2012)
Google Scholar
He, J., Spokoyny, D., Neubig, G., Berg-Kirkpatrick, T.: Lagging inference networks and posterior collapse in variational autoencoders. arXiv preprint arXiv:1901.05534 (2019)
Higgins, I., et al.: beta-VAE: learning basic visual concepts with a constrained variational framework. ICLR 2(5), 6 (2017)
Google Scholar
Hsiao, W.L., Grauman, K.: Creating capsule wardrobes from fashion images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7161–7170 (2018)
Google Scholar
Huang, D.A., Ma, M., Ma, W.C., Kitani, K.M.: How do we use our hands? discovering a diverse set of common grasps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 666–675 (2015)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Google Scholar
Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5308–5317 (2016)
Google Scholar
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233 (1999)
Article Google Scholar
Kim, Y., Wiseman, S., Miller, A.C., Sontag, D., Rush, A.M.: Semi-amortized variational autoencoders. arXiv preprint arXiv:1802.02550 (2018)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Koppula, H.S., Saxena, A.: Anticipating human activities for reactive robotic response. In: IROS, Tokyo, p. 2071 (2013)
Google Scholar
Kulesza, A., Taskar, B.: k-dpps: Fixed-size determinantal point processes. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 1193–1200 (2011)
Google Scholar
Kulesza, A., Taskar, B., et al.: Determinantal point processes for machine learning. Found. Trends® Mach. Learn. 5(2–3), 123–286 (2012)
Google Scholar
Kundu, J.N., Gor, M., Babu, R.V.: BIHMP-GAN: bidirectional 3D human motion prediction GAN. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8553–8560 (2019)
Google Scholar
Lee, N., Choi, W., Vernaza, P., Choy, C.B., Torr, P.H., Chandraker, M.: Desire: distant future prediction in dynamic scenes with interacting agents. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 336–345 (2017)
Google Scholar
Lee, S., Prakash, S.P.S., Cogswell, M., Ranjan, V., Crandall, D., Batra, D.: Stochastic multiple choice learning for training diverse deep ensembles. In: Advances in Neural Information Processing Systems, pp. 2119–2127 (2016)
Google Scholar
Li, Z., Zhou, Y., Xiao, S., He, C., Huang, Z., Li, H.: Auto-conditioned recurrent networks for extended complex human motion synthesis. arXiv preprint arXiv:1707.05363 (2017)
Lin, X., Amer, M.R.: Human motion modeling using DVGANs. arXiv preprint arXiv:1804.10652 (2018)
Lin, Z., Khetan, A., Fanti, G., Oh, S.: PACGAN: the power of two samples in generative adversarial networks. In: Advances in Neural Information Processing Systems, pp. 1498–1507 (2018)
Google Scholar
Liu, X., Gao, J., Celikyilmaz, A., Carin, L., et al.: Cyclical annealing schedule: a simple approach to mitigating KL vanishing. arXiv preprint arXiv:1903.10145 (2019)
Luvizon, D.C., Picard, D., Tabia, H.: 2D/3D pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5137–5146 (2018)
Google Scholar
Macchi, O.: The coincidence approach to stochastic point processes. Adv. Appl. Probab. 7(1), 83–122 (1975)
Article MathSciNet Google Scholar
Mao, W., Liu, M., Salzmann, M., Li, H.: Learning trajectory dependencies for human motion prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9489–9497 (2019)
Google Scholar
Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2891–2900 (2017)
Google Scholar
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649 (2017)
Google Scholar
Nilsson, D.: An efficient algorithm for finding the m most probable configurations in probabilistic expert systems. Stat. Comput. 8(2), 159–173 (1998)
Article Google Scholar
Paden, B., Čáp, M., Yong, S.Z., Yershov, D., Frazzoli, E.: A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. Intell. Veh. 1(1), 33–55 (2016)
Article Google Scholar
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7753–7762 (2019)
Google Scholar
Pavllo, D., Grangier, D., Auli, M.: QuaterNet: a quaternion-based recurrent model for human motion. arXiv preprint arXiv:1805.06485 (2018)
Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770 (2015)
Rhinehart, N., Kitani, K.M., Vernaza, P.: r2p2: a reparameterized pushforward policy for diverse, precise generative path forecasting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 794–811. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_47
Chapter Google Scholar
Rissanen, J.J.: Fisher information and stochastic complexity. IEEE Trans. Inf. Theory 42(1), 40–47 (1996)
Article MathSciNet Google Scholar
Ruiz, A.H., Gall, J., Moreno-Noguer, F.: Human motion prediction via spatio-temporal inpainting. arXiv preprint arXiv:1812.05478 (2018)
Seroussi, B., Golmard, J.L.: An algorithm directly finding the k most probable configurations in Bayesian networks. Int. J. Approx. Reason. 11(3), 205–233 (1994)
Article Google Scholar
Sigal, L., Balan, A.O., Black, M.J.: HUMANEVA: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. 87(1–2), 4 (2010)
Article Google Scholar
Sourati, J., Akcakaya, M., Erdogmus, D., Leen, T.K., Dy, J.G.: A probabilistic active learning algorithm based on fisher information ratio. IEEE Trans. Pattern Anal. Mach. Intell. 40(8), 2023–2029 (2017)
Article Google Scholar
Srivastava, A., Valkov, L., Russell, C., Gutmann, M.U., Sutton, C.: VeeGAN: reducing mode collapse in GANs using implicit variational learning. In: Advances in Neural Information Processing Systems, pp. 3308–3318 (2017)
Google Scholar
Tolstikhin, I., Bousquet, O., Gelly, S., Schoelkopf, B.: Wasserstein auto-encoders. arXiv preprint arXiv 1711, 01558 (2017)
Google Scholar
Troje, N.F.: Decomposing biological motion: a framework for analysis and synthesis of human gait patterns. J. Vis. 2(5), 2 (2002)
Google Scholar
Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3332–3341 (2017)
Google Scholar
Wang, B., Adeli, E., Chiu, H.k., Huang, D.A., Niebles, J.C.: Imitation learning for human pose prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7124–7133 (2019)
Google Scholar
Weng, X., Yuan, Y., Kitani, K.: Joint 3d tracking and forecasting with graph neural network and diversity sampling. arXiv:2003.07847 (2020)
Yan, X., et al.: MT-VAE: learning motion transformations to generate multimodal human dynamics. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 276–293. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_17
Chapter Google Scholar
Yang, D., Hong, S., Jang, Y., Zhao, T., Lee, H.: Diversity-sensitive conditional generative adversarial networks. arXiv preprint arXiv:1901.09024 (2019)
Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., Wang, X.: 3D human pose estimation in the wild by adversarial learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5255–5264 (2018)
Google Scholar
Yuan, Y., Kitani, K.: Diverse trajectory forecasting with determinantal point processes. arXiv preprint arXiv:1907.04967 (2019)
Yuan, Y., Kitani, K.: Ego-pose estimation and forecasting as real-time PD control. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 10082–10092 (2019)
Google Scholar
Yuan, Y., Kitani, K.: Residual force control for agile human behavior imitation and extended motion synthesis. arXiv preprint arXiv:2006.07364 (2020)
Zhang, J.Y., Felsen, P., Kanazawa, A., Malik, J.: Predicting 3D human dynamics from video. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7114–7123 (2019)
Google Scholar
Zhao, S., Song, J., Ermon, S.: InfoVAE: information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262 (2017)

Download references

Author information

Authors and Affiliations

Robotics Institute, Carnegie Mellon University, Pittsburgh, USA
Ye Yuan & Kris Kitani

Authors

Ye Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Kris Kitani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ye Yuan .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 88770 KB)

Supplementary material 2 (pdf 11250 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yuan, Y., Kitani, K. (2020). DLow: Diversifying Latent Flows for Diverse Human Motion Prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12354. Springer, Cham. https://doi.org/10.1007/978-3-030-58545-7_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-58545-7_20
Published: 05 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58544-0
Online ISBN: 978-3-030-58545-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics