Keywords

1 Introduction

Fig. 1.
figure 1

Our model learns a disentangled representation of shape and pose for mesh. In the middle are two source subjects taken from AMASS and SMAL datasets respectively. On the left are meshes with the same pose but varying shapes which we construct by transferring shape codes extracted from other meshes using our method. On the right are meshes with the same subject identity but varying poses which we construct by transferring pose codes.

Parameterizing 3D mesh deformation with different factors, such as pose and shape, is crucial in computer graphics for efficient 3D shape manipulation, and for computer vision, to extract structure and understand human and animal motion in videos.

Although parametric models of meshes such as SCAPE  [1], SMPL  [25], Dyna  [31], Adam  [20] for bodies, MANO  [34] for hands, SMAL  [45] for animals, basel face model  [29], FLAME  [23] and their combinations  [30] for faces, have been extremely useful for many applications. Learning them is a difficult task that requires expert knowledge and manual intervention. SMPL for example, is learned from a set of meshes in correspondence, and requires defining a skeleton hierarchy, manually initializing blendweights to bind each vertex to body parts, carefully unposing meshes, and a training procedure that requires several stages.

In this paper, we address the problem of unsupervised disentanglement of pose and shape for 3D meshes. Like other models such as SMPL, our method requires a dataset of meshes registered to a template for training. But unlike other methods, we learn to factor pose and shape based on the data alone without making assumptions on the number of parts, the skeleton or the kinematic chain. Our model only requires that the same shape can be seen in different poses, which is available for datasets collected from scanners or motion capture devices. We call our model unsupervised because we do not make use of meshes annotated with pose or shape codes, and we make no assumptions on the underlying parts or skeleton. This flexibility makes our model applicable to a wide variety of objects, such as humans, hands, animals and faces.

Fig. 2.
figure 2

A schematic overview of shape and pose disentangling mesh auto-encoder. The input mesh \(\mathbf {X}\) is separately processed by a shape branch and a pose branch to get shape code \(\varvec{\beta }\) and pose code \(\varvec{\theta }\). The two latent codes are subsequently concatenated and decoded to the reconstructed mesh \(\mathbf {\hat{X}}\)(top). The shape codes of two deformations of the same subject are swapped to reconstruct each other (bottom left). The pose code of one subject is used to reconstruct itself after a cycle of decoding-encoding (bottom right).

Unsupervised disentanglement from meshes is a challenging task. Most datasets  [23, 25, 27, 34] contain the same shape in different poses, e.g., they capture a human or an animal moving. However, real world datasets do not contain two different shapes in the same pose – two different humans, or animals are highly unlikely to be captured performing the exact same pose or motion. This makes disentangling pose and shape from data difficult.

We achieve disentanglement with an auto-encoding neural network based on two key observations. First, we should be able to auto-encode a mesh in two codes (pose and shape), which we achieve with two separate encoder branches, see Fig. 2(top). Second, given two meshes \(\mathbf {X}_{1}^{s}\) and \(\mathbf {X}_{2}^{s}\) of the same subject s in two different poses, we should be able to swap their shape codes and reconstruct exactly the two input meshes. This is imposed with a cross-consistency loss, see Fig. 2(lower left). These two constraints however, are not sufficient and lead to degenerate solutions, with shape information flowing into pose code.

If we had access to two different shapes in the exact same pose, we could impose an analogous cross-consistency loss on the pose. But as mentioned, such data is not available. Our idea is to generate such pairs of different shapes with the exact same pose on the fly during training with our disentangling network.

Given two meshes with different shapes and poses \(\mathbf {X}_1^s\) and \(\mathbf {X}^t\), we generate a proxy mesh \(\tilde{\mathbf {X}}^{t}\) with the pose of mesh \(\mathbf {X}^{s}_1\) and the shape of mesh \(\mathbf {X}^t\) within the training loop. If disentanglement is effective, we should recover the original pose code from the proxy mesh, and mix it with the shape code of mesh \(\mathbf {X}_1^s\), to decode it into mesh \(\mathbf {X}_{1}^{s}\). We ask the network to satisfy this constraint with a self-consistency loss. For the self-consistency constraint to work well, the proxy mesh must not contain any shape characteristic of mesh \(\mathbf {X}_1^s\), which occurrs if the pose code carries shape information. To resolve this, we replace the initially decoded proxy mesh \(\mathbf {\tilde{X}}^t\) with an As-Rigid-As-Possible  [38] approximate. Self-consistency is best understood with the illustration in Fig. 2 (lower right).

Our experiments show that these two simple—but not immediately obvious—losses allow to discover independent pose and shape factors from 3D meshes directly. To demonstrate the wide applicability of our method, we use it to disentangle pose and shape in four different publicly available datasets of full body humans  [27], hands  [34], faces  [23] and animals  [45]. We show several downstream applications, such as pose transfer, pose-aware shape retrieval, and pose and shape interpolation. We will make our code and model publicly available so that researchers can learn their own models from data.

2 Related Work

Disentangled Representations for 2D Images. The motivation behind feature disentanglement is that images can be synthesized from individual factors of variation. A pioneering work for disentanglement learning is InfoGAN  [7], which maximizes the variational lower bound for the mutual information between latent code and generator distribution. Beta-VAE  [15] and its follow-up work  [6] penalized a KL divergence term to reduce variable correlations. Similarly, Kim et al.  [21] encouraged fatorial marginal distribution of latent variables.

Another line of work incorporates Spatial Transformer Network  [17] to explicitly model object deformations  [26, 36, 37]. Iosanos et al.  [35] recovered a 3D deformable template from a set of images and transformed it to fit image coordinates. Recently, adversarial training is exploited to enforce feature disentanglement  [10, 11, 24, 28, 40]. Our work has similarities with  [16, 43], where latent features are mixed and then separated. But unlike them, our method does not depend on auxiliary classifiers or adversarial loss, which are notoriously hard to train and tune. The idea of swapping codes (cross-consistency) to factor out appearance or identity as been also used in   [33], but we additionally introduce the self-consistency loss which is critical for disentanglement. Furthermore, all these works focus on 2D images while we focus on disentanglement for 3D meshes.

Deep Learning for 3D Reconstructions. With the advances in geometric deep learning, a number of models have been proposed to analyse and reconstruct 3D shapes. Particularly related to us are mesh auto-encoders. Tan et al.  [41] designed a mesh variational auto-encoder using fully-connected layers. Instead of operating directly on mesh vertices, the model deals with a rotation-invariant mesh representation  [12]. Ranjan et al.  [32] generalized downsampling and upsampling layers to meshes by collapsing unimportant edges based on quadric error measure. DEMEA  [42] performs mesh deformation in a low-dimensional embedded deformation layer which helps reduce reconstruction artifacts. These models do not separate shapes from poses when embedding meshes into the latent space. Jiang et al.  [19] decomposed 3D facial meshes into identity code and expression code. Their approach needs supervision on expression labels to work. Similarly, Jiang et al.  [18] trained a disentangled human body model in a hierarchical manner with a predefined anatomical segmentation. Deng et al.  [9] conditions human shape occupancy on pose, but requires pose labels for training. Levinson et al.  [22] trained on pairs of shapes with the exact same poses, which is unrealistic for non-synthetic datasets. LIMP  [8] explicitly enforced that change in pose should preserve pairwise geodesic distances. Although it works well for small datasets, the intensive computations make it unsuitable for larger datasets. Geometrically Disentangled VAE(GDVAE)  [3] is capable of learning shape and pose from pointclouds in a completely unsupervised manner. GDVAE utilizes the fact that isometric deformations preserve spectrum of the Laplace-Beltrami Operator(LBO) to disentangle shape. While we require meshes in correspondence and GDVAE does not, we obtain significantly better disentanglement and reconstruction quality. Furthermore, in practice GDVAE uses meshes in correspondence to compute the LBO spectrum of each mesh. While the spectrum should be invariant to connectivity, in practice it is known to be very sensitive to noise and different discretizations. Instead of relying on LBO spectrum, we assume the subject identity is known which requires no extra labelling, and impose shape and pose consistency by swapping and mixing codes during training.

3D Deformation Transfer. Traditional deformation transfer methods solve an optimization problem for each pair of source and target meshes. The seminal work of Sumner et al.  [39] transfers deformation via per-triangle affine transformations assuming correspondence. While general, this approach produces artifacts when transferring between significantly different shapes. Ben-Chen et al.  [4] formulated deformation transfer as a space deformation problem. Recently, Lin et al.  [13] achieved automatic deformation transfer between two different domains of meshes without correspondence. They build an auto-encoder for each of the source and target domain. Deformation transfer is performed at latent space by a cycle-consistent adversarial network  [44]. For every new pair of shapes, a new model needs to be trained, whereas we train on multiple shapes simultaneously, and our training procedure is much simpler. These approaches focus on transferring pose deformations between pairs of meshes, whereas our ability to transfer deformation is just a natural consequence of the learned disentangled representation.

3 Method

Given a set of meshes with the same topology, our goal is to learn a latent representation with disentangled shape and pose components. In our context, we refer to shape as the intrinsic geometric properties of a surface (height, limb lengths, body shape etc.), which remain invariant under approximately isometric deformations. We refer to the other properties that vary with motion as pose.

Our model is built on three mild assumptions. i) All the meshes should be registered and have the same connectivity. ii) There are enough shape and pose variations in the training set to cover the latent space. iii) The same shape can be seen in different poses, which naturally occurs when capturing a body, face, hand or animal in motion. Note that models like SMPL  [25] are built on the same assumptions, but unlike those models we do not hand-define the number of parts, skeleton nor the surface-to-part associations.

3.1 Overview

Our model follows the classical auto-encoder architecture. The encoder function \(f_\mathrm {enc}\) embeds input mesh \(\mathbf {X}\) into latent shape space and latent pose space: \(f_\mathrm {enc}(\mathbf {X}) = \left( f_{\varvec{\beta }}(\mathbf {X}), f_{\varvec{\theta }}(\mathbf {X})\right) = (\varvec{\beta }, \varvec{\theta })\), where \(\varvec{\beta }\) denotes shape code, and \(\varvec{\theta }\) denotes pose code. The encoder consists of two branches for shape \(f_{\varvec{\beta }}(\mathbf {X})= \varvec{\beta }\) and for pose \(f_{\varvec{\theta }}(\mathbf {X})=\varvec{\theta }\) respectively, which are independent and do not share weights. The decoder function \(g_\mathrm {dec}\) takes shape and pose codes as inputs, and transforms them back to the corresponding mesh: \(g_\mathrm {dec}(\varvec{\beta }, \varvec{\theta })=\tilde{\mathbf {X}}\).

The challenge is to disentangle pose and shape in an unsupervised manner, without supervision on \(\varvec{\theta }\) or \(\varvec{\beta }\) coming from an existing parametric model. We achieve this with a cross-consistency and a self-consistency loss during training. An overview of our approach is given in Fig. 2.

3.2 Cross-Consistency

Given two meshes, \(\mathbf {X}_{1}^{s}\) and \(\mathbf {X}_{2}^{s}\) (superscript indicates subject identity and subscript labels individual meshes of a given subject), of subject s in different poses we should be able to swap their shape codes and recover exactly the same meshes.

We randomly sample a mesh pair \((\mathbf {X}_1^s, \mathbf {X}_2^s)\) of the same subject from the training set and decompose it into \((\varvec{\beta }_{1}^{s}, \varvec{\theta }_{1}^{s})\) and \((\varvec{\beta }_{2}^{s}, \varvec{\theta }_{2}^{s})\) respectively. The cross-consistency implies that the original meshes should be recovered by swapping shape codes \(\varvec{\beta }_{1}^{s}\) \(\varvec{\beta }_{2}^{s}\):

$$\begin{aligned} g_\mathrm {dec}(\varvec{\beta }_{2}^{s}, \varvec{\theta }_{1}^{s})&=\mathbf {X}_{1}^{s} \end{aligned}$$
(1)
$$\begin{aligned} g_\mathrm {dec}(\varvec{\beta }_{1}^{s}, \varvec{\theta }_{2}^{s})&=\mathbf {X}_{2}^{s} \end{aligned}$$
(2)

Since the cross-consistency constraint holds in both directions, optimizing one loss term suffices. The loss is defined as

$$\begin{aligned} \mathcal {L}_C = {\left\Vert {g_\mathrm {dec}\left( f_{\varvec{\beta }}(\mathbf {X}_2^s), f_{\varvec{\theta }}(\mathcal {T}(\mathbf {X}_1^s))\right) - \mathbf {X}_1^s} \right\Vert }_{1}, \end{aligned}$$
(3)

where \(\mathcal {T}\) is a family of pose invariant mesh transformations such as random scaling and uniform noise corruption, which serves as data augmentation to improve generalization and robustness of the pose branch. The cross-consistency is useful to make the model aware of the distinction between shape and pose, but as we discussed in the introduction, it alone does not guarantee disentangled representations. This motivates our self-consistency loss, which we explain next.

3.3 Self-consistency

Having pairs of meshes with different shapes and the exact same pose would simplify the task, but such data is never available in real world datasets. The key idea of self-consistency is to generate such mesh pairs consisting of two different shapes in the same pose on the fly during the training process.

We sample a triplet \((\mathbf {X}_{1}^{s}, \mathbf {X}_{2}^{s}, \mathbf {X}^{t})\), where mesh \(\mathbf {X}^t\) shares neither shape nor pose with \((\mathbf {X}_1^s, \mathbf {X}_2^s)\). We combine the shape from \(\mathbf {X}^t\) and pose from \(\mathbf {X}_1^s\) to generate an intermediate mesh \(\mathbf {\tilde{X}}^t = g_\mathrm {dec}(\varvec{\beta }^t, \varvec{\theta }_1^s)\).

Since \(\mathbf {\tilde{X}}^t\) should have the same pose \(\varvec{\tilde{\theta }}^{t} = f_{\varvec{\theta }}(\mathbf {\tilde{X}}^t)\) as \(\mathbf {X}_1^s\), and \(\mathbf {X}_2^s\) has the same shape \(\varvec{\beta }_{2}^{s}\) as \(\mathbf {X}_1^s\), we should be able to reconstruct \(\mathbf {X}_1^s\) with

$$\begin{aligned} g_\mathrm {dec}\left( \varvec{\beta }_{2}^{s}, \varvec{\tilde{\theta }}^{t}\right)&=\mathbf {X}_{1}^{s}. \end{aligned}$$
(4)

The intuition behind this constraint is that the encoding and decoding of pose code should remain self-consistent with changes in the shape.

Although this loss alone is already quite effective, degeneracy can occur in the network if the proxy mesh \(\tilde{\mathbf {X}}^{t}\) inherits shape attributes of \(\mathbf {X}_1^s\) through the pose code. We make sure this does not happen by incorporating ARAP deformation  [38] within the training loop.

As-rigid-as-possible Deformation. We use ARAP to deform \(\mathbf {X}^{t}\) to match the pose of the network prediction \(\mathbf {\tilde{X}}^{t}\) while preserving the original shape as much as possible,

$$\begin{aligned} \mathbf {\tilde{X}}^{t'} = \text {ARAP}\left( \mathbf {X}^{t}, \mathbf {\tilde{X}}^{t}\right) , \end{aligned}$$
(5)

where \(\mathbf {\tilde{X}}^{t'}\) is the desired deformed shape, see Fig. 3. Specifically, we deform \(\mathbf {X}^{t}\) to match a few randomly selected anchor points of the network prediction \(\mathbf {\tilde{X}}^{t}\). ARAP is a detail-preserving surface deformation algorithm that encourages locally rigid transformations. Note that we can successfully apply ARAP because the shape of \(\mathbf {\tilde{X}}^{t}\) should converge to the shape of \(\mathbf {X}^{t}\) during training. Hence, when only pose is different in the pair \((\mathbf {X}^{t},\mathbf {\tilde{X}}^{t})\), the ARAP loss approaches zero, and disentanglement is successful.

In the following, we provide a brief introduction to the optimization procedure of ARAP. We refer interested readers to  [38] for more details. Let \(\mathbf {X}\) be a triangle mesh embedded in \(\mathbb {R}^3\) and \(\mathbf {\tilde{X}}\) be the deformed mesh. Each vertex i has an associated cell \(\mathcal {C}_i\), which covers the vertex itself and its one-ring neighbourhood \(\mathcal {N}(i)\). If a cell \(\mathcal {C}_i\) is rigidly transformed to \(\mathcal {\tilde{C}}_{i}\), the transformation can be represented by a rotation matrix \(\mathbf {R}_i\) satisfying \(\varvec{\tilde{e}}_{ij}=\mathbf {R}_i\textit{\textbf{e}}_{ij}\) for every edge \(\textit{\textbf{e}}_{ij} = (\textit{\textbf{v}}_j-\textit{\textbf{v}}_i)\) incident at vertex \(\textit{\textbf{v}}_i\). If \(\mathcal {\tilde{C}}_{i}\) and \(\mathcal {C}_{i}\) cannot be rigidly aligned, then \(\mathbf {R}_i\) is the optimal rotation matrix that aligns \(\mathcal {C}_i\) and \(\mathcal {\tilde{C}}_{i}\) with minimal non-rigid distortion. This objective can be formulated as follows.

$$\begin{aligned} E\left( \mathcal {C}_i,\mathcal {\tilde{C}}_{i} \right) =\sum _{j\in \mathcal {N}\left( i \right) }{w_{ij}\left\Vert \varvec{\tilde{e}}_{ij}-\mathbf {R}_i\textit{\textbf{e}}_{ij} \right\Vert ^2} \end{aligned}$$
(6)

where \(w_{ij}\) adjusts the importance of each edge. ARAP deformation minimizes Eq. (6) for all vertices i by an iterative procedure. It alternates between first estimating the current optimal rotation \(\mathbf {R}_i\) for cell \(\mathcal {C}_i\) while keeping the vertices \(\varvec{\tilde{v}}_i\) (and hence the edges \(\varvec{\tilde{e}}_{ij}\)) fixed, and second computing the updated vertices \(\varvec{\tilde{v}}_i\) based on the updated \(\mathbf {R}_i\). Let the covariance matrix \(\mathbf {S}_i=\sum _{j\in \mathcal {N}\left( i \right) }{w_{ij}\textit{\textbf{e}}_{ij}\varvec{\tilde{e}}_{ij}^{T}}\) have a singular value decomposition, \(\mathbf {S}_i=\mathbf {U}_i\mathbf {\Sigma } _i\mathbf {V}_i\). Then the relative rotation \(\mathbf {R}_i\) between them can be analytically calculated as \(\mathbf {R}_i = \mathbf {V}_i\mathbf {U}_i^T\) up to a change of sign  [2]. Fixing \(\mathbf {R}_i\) simplifies Eq. (6) to a weighted least squares problem (over the vertices ) of the form

Fig. 3.
figure 3

ARAP corrects artifacts in network prediction caused by embedding shape information in pose code. Notice how the circled region in the initial prediction resembles that of the pose source. This is rectified after applying ARAP for only 1 iteration.

$$\begin{aligned} \sum _{j\in \mathcal {N}\left( i \right) }{w_{ij}\left( \varvec{\tilde{v}}_i-\varvec{\tilde{v}}_j \right) }\,\,=\sum _{j\in \mathcal {N}\left( i \right) }{\frac{w_{ij}}{2}\left( \mathbf {R}_i+\mathbf {R}_j \right) \left( \textit{\textbf{v}}_i-\textit{\textbf{v}}_j \right) }, \end{aligned}$$
(7)

which can be solved efficiently by a sparse Cholesky solver.

Note that Eq. (7) is an underdetermined problem so at least one anchor vertex needs to be fixed to obtain a unique solution. We take \(\mathbf {\tilde{X}}^{t}\) as an initial guess and randomly fix a small number of anchor vertices across its surface that should be matched by deforming the source mesh \(\mathbf {X}^t\) (i.e. \(\varvec{\tilde{v}}^{t}_j:=\textit{\textbf{v}}^{t}_j\) for all anchor vertices \(\textit{\textbf{v}}^{t}_j\)). There is a tradeoff when determining the number of anchor vertices; fixing too many does not improve the shape much while fixing too few could incur a deviation of pose. We found that fixing 1%–10% vertices gives good results in most cases. For training efficiency considerations, we only run ARAP for 1 iteration. This is sufficient since ARAP runs on every input training batch. We also adopted uniform weighting instead of cotangent weighting for \(w_{ij}\) and we did not observe any performance drop under this choice.

Self-consistency Loss. Let \(\mathbf {\tilde{X}}^{t'}\) be the output of ARAP, which should have the pose of \(\mathbf {X}_1^s\) with the shape of \(\mathbf {X}^t\). We enforce the equality in Eq. (4) with the following self-consistency loss:

$$\begin{aligned} \mathcal {L}_S = {\left\Vert {g_\mathrm {dec}\left( f_{\varvec{\beta }}(\mathbf {X}_2^s), f_{\varvec{\theta }}(\mathcal {T}(\mathbf {\tilde{X}}^{t'}))\right) - \mathbf {X}_1^s} \right\Vert }_{1} \end{aligned}$$
(8)

where again, the intuition is that the pose extracted \(f_{\varvec{\theta }}(\mathcal {T}(\mathbf {\tilde{X}}^{t'}))\) should be independent of shape. Note that while ARAP is computed on the fly during training, we do not backpropagate through it.

3.4 Loss Terms and Objective Function

The overall objective we seek to optimize is

$$\begin{aligned} \mathcal {L} = \lambda _C\mathcal {L}_C + \lambda _S\mathcal {L}_S \end{aligned}$$
(9)

In all our experiments we set \(\lambda _C = \lambda _S = 0.5\). We also experimented with edge length constraints and other local shape preserving losses, but observed no benefit or worse performance.

3.5 Implementation Details

We preprocess the input meshes by centering them around the origin. For the disentangling mesh auto-encoder, we use an architecture similar to  [5]. In particular, we adopt the spiral convolution operator, which aggregates and orders local vertices in a spiral trajectory. Each encoder branch consists of four consecutive mesh convolution layers and downsampling layers. The last layer is fully-connected which maps flattened features to latent space. The decoder architecture is a symmetry of the encoder except that mesh downsampling layers are replaced by upsampling layers. We follow the practice in  [32] which downsamples and upsamples meshes based on quadric error metrics. We choose leaky ReLU with a negative slope of 0.02 as activation function. The model is optimized by ADAM solver with a cosine annealing learning rate scheduler.

4 Experiments

In this section, we evaluate our proposed approach on a variety of datasets and tasks. We conduct quantitative evaluations on AMASS dataset and COMA dataset. We compare our model to the state-of-the-art unsupervised disentangling models proposed in  [3, 19]. We also perform an ablation study to evaluate the importance of each loss. In addition, we qualitatively show pose transfer results on four datasets (AMASS, SMAL, COMA and MANO) to demonstrate the wide applicability of our method. Finally, we show the usefulness of our disentangled codes for the tasks of shape and pose retrieval and motion sequence interpolation.

4.1 Datasets

We use the following four publicly available datasets to evaluate our method:

AMASS  [27] is a large human motion sequence dataset that unifies 15 smaller datasets by fitting SMPL body model to motion capture markers. It consists of 344 subjects and more than 10k motions. We follow the protocol splits and sample every 1 out of 100 frames for the middle 90% portion of each sequence.

SMAL  [45] is a parametric articulated body model for quadrupedal animals. Since there are not sufficient scans in this dataset, we synthesize SMAL shapes and poses using the procedure in  [14]. Finally, we get 100 shapes and 160 poses distinct for each shape. We use a 9:1 data split.

MANO  [34] is the 3D hand model used to fit AMASS together with SMPL. We treat it as a standalone dataset since its training scans contain more pose variations. To keep things simple without losing generality, we train the model specifically on right hands and flipped left hands. The official training set contains less than 2000 samples, hence we augment it by sampling shape and pose parameters of MANO from a Gaussian distribution.

COMA  [32] is a facial expression dataset consisting of 12 subjects under 12 types of extreme expressions. We follow the same splits as in  [32].

Table 1. AMASS pose transfer results when training on different models. The numbers are measured in millimeters. The error on our model is close to the supervised baseline, indicating that our self-consistency loss is a good substitute for pose supervision.

4.2 Quantitative Evaluation

AMASS Pose Transfer. In the following, we show quantitative results of our model trained on AMASS. Since AMASS comes with SMPL parameters, we utilize the SMPL model to generate pseudo-groundtruth for evaluating pose-transferred reconstructions. We sample a subset of paired meshes (with different shapes and poses) along with their pose-transferred pseudo-groundtruth. The error is calculated between model-predicted transfer results and the pseudo-groundtruth. We use 128-dimensional latent codes, 16 for shape and 112 for pose.

We compare our method to Geometric Disentanglement Variational Autoencoder(GDVAE)  [3], a state-of-the-art unsupervised method which can disentangle pose and shape from 3D pointclouds. It is important to note that a fair comparison to GDVAE is not possible as we make different assumptions. They do not assume mesh correspondence while we do. However, GDVAE uses LBO spectra computed on meshes which are in perfect correspondence. Since the LBO spectra is sensitive to noise and the type of discretization, the performance of GDVAE could be significantly deteriorated when computed on meshes not in correspondence. Furthermore, we assume we can see the same shape in different poses. But as argued earlier, this is the typical case in datasets with dynamics. Hence, despite the differences in assumptions, we think the comparison is meaningful.

We report the one-side Chamfer distance for GDVAE (i.e., average distance between every point and its nearest point on groundtruth surface) and report the vertex-to-vertex error for our method. Note that the Chamfer distance would be lower for our method, but we want the metric to reflect how well we predict the semantics (body part locations) as well.

We also compare our method with a supervised baseline, which leverages pose labels from the SMPL. In that case, the intermediate mesh \(\mathbf {\tilde{X}}^{t'}\) is replaced by the pseudo-groundtruth coming from the SMPL model.

Table 1 summarizes reconstruction errors of pose-transferred meshes on AMASS dataset using different models. The supervised baseline with pose supervision achieves the lowest error, which serves as the performance upper bound for our model. Remarkably, our unsupervised model is only 4mm worse than the supervised baseline, suggesting that our proposed approach, which only requires seeing a subject in different poses, is sufficient to disentangle shape from pose. In addition, our approach achieves a much lower error compared to GDVAE. Again, we compare for completeness, but we do not want to claim we are superior as our assumptions are different, and the losses are conceptually very different.

We can also observe from Table 1 that training solely with cross-consistency constraint leads to degenerate solutions. This shows that our approach can only exploit the weak signal of seeing the same subject in different poses when combined with the self-consistency loss. Notably, enforcing the self-consistency constraint already drives the model to learn a reasonably well-disentangled representation, which is further improved by incorporating ARAP in-the-loop. We hypothesize that without ARAP, the intermediate mesh \(\tilde{X}^t\) is noisy in shape but relatively accurate in pose at early stages of training, thus helping disentanglement.

AMASS Pose-Aware Shape Retrieval. Shape retrieval refers to the task of retrieving similar objects given a query object. Our model learns disentangled representations for shape and pose; hence we can retrieve objects either similar in shape or similar in pose. Our evaluation of shape retrieval accuracy follows the experiment settings in  [3]. Specifically, we evaluate on AMASS dataset which comprises groundtruth SMPL parameters. To avoid confusion of notations, we denote with \(\varvec{\dot{\beta }}\) the SMPL shape parameters and denote with \(\varvec{\dot{\theta }}\) the SMPL pose parameters. For each queried object \(\mathbf {X}\), we encode it into a latent code and search for its closest neighbour \(\mathbf {Y}\) in latent space. The retrieval accuracy is determined by the Euclidean error between SMPL parameters of \(\mathbf {X}\) and \(\mathbf {Y}\): \(E_{\varvec{\dot{\beta }}}(\mathbf {X}, \mathbf {Y}) = \Vert \varvec{\dot{\beta }}(\mathbf {X}) - \varvec{\dot{\beta }}(\mathbf {Y}) \Vert _2\), \(E_{\varvec{\dot{\theta }}}(\mathbf {X}, \mathbf {\mathbf {Y}}) = \Vert q(\varvec{\dot{\theta }}(\mathbf {X})) - q(\varvec{\dot{\theta }}(\mathbf {Y})) \Vert _2\), where \(q(\cdot )\) converts axis-angle representations to unit quaternions. Again, to properly compare with GDVAE which uses 5 dimensions for shape and 15 dimensions for pose, we reduce the latent dimension of our model with principal component analysis(PCA). We show results for shape retrieval and pose retrieval in Table 2.

Ideally if the shape code is disentangled from the pose code, we should get a low \(E_{\varvec{\dot{\beta }}}\) and high \(E_{\varvec{\dot{\theta }}}\) when retrieving with \(\varvec{\beta }\), and vice versa. This is in accordance with our results. Interestingly, dimensionality reduction with PCA boosts the shape difference for pose retrieval. This indicates that some degree of entanglement is still present in our pose code. An example of pose retrieval is demonstrated in Fig. 4 – notice the pose similarity for the retrieved shapes.

Table 2. Mean error on SMPL parameters for shape retrieval. Column 1 corresponds to retrieval with shape code \(\varvec{\beta }\) and column 2 with pose code \(\varvec{\theta }\). Arrows indicate if the desired metrics should be high or low when retrieving with \(\varvec{\beta }\) or \(\varvec{\theta }\).
Fig. 4.
figure 4

An example of pose retrieval with our model. Bottom left: top three meshes most similar with the query in pose code. Bottom right: top three meshes of different subjects most similar with the query in pose code.

COMA Expression Extrapolation. COMA dataset spans over twelve types of extreme expressions. To evaluate the generalization capability of our model, we adopt the expression extrapolation setting of  [32]. Specifically, we run a 12-fold cross-validation by leaving one expression class out and training on the rest. We subsequently evaluate reconstruction on the left-out class. Table 3 shows the average reconstruction performance of our model compared with FLAME  [23] and Jiang et al.’s approach  [19] (see supplementary material for the full table). Both Jiang et al. and our model allocate 4 dimensions for identity and 4 dimensions for expression, while FLAME allocates 8 dimensions for each. Our model consistently outperforms the other two by a large margin.

Table 3. Mean errors of expression extrapolation on COMA dataset. All numbers are in millimeters. The results of Jiang et al. and FLAME are taken from  [19].
Fig. 5.
figure 5

Pose transfer from pose sources to shape sources. Please see supplementary video at https://virtualhumans.mpi-inf.mpg.de/unsup_shape_pose/ for transferring animated sequences.

Fig. 6.
figure 6

Latent interpolation of shape and pose codes on AMASS dataset. The leftmost column are source meshes, while the rightmost are target meshes. Intermediate columns are linear interpolation of specific codes at uniform time steps \(s=0\) and \(s=1\). First two rows show interpolation of pose, and last two rows show interpolation of shape.

4.3 Qualitative Evaluation

Pose Transfer. We qualitatively evaluate pose transfer on AMASS, SMAL, COMA and MANO. In each dataset, a pose sequence is transferred to a given shape. Ideally if our model learns a disentangled representation, the outputs should preserve the identity of shape source, while inheriting the deformation from pose sources. Figure 5 visualizes the transfer results. We can observe subject shape is preserved well under new poses. The results are most obvious for bodies, animals and faces. It is less obvious for hands due to their visual similarity.

Latent Interpolation. Latent representations learned by our model should ideally be smooth and vary continuously. We demonstrate this via linearly interpolating our learned shape codes and pose codes. When interpolating shape, we always fix the pose code to that of the source mesh. The same holds when we interpolate pose. Interpolation results are shown in Fig. 6. We can observe the smooth transition between nearby meshes. Furthermore, we can see that mesh shapes remain unchanged during pose interpolation, and vice versa. This indicates that variations in shape and pose are independent of each other.

5 Conclusion and Future Work

In this paper, we introduced an auto-encoder model that disentangles shape and pose for 3D meshes in an unsupervised manner. We exploited subject identity information, which is commonly available when scanning or capturing shapes using motion capture. We showed two key ideas to achieve disentanglement, namely a cross-consistency and a self-consistency loss coupled with ARAP deformation within the training loop. Our model is straightforward to train and it generalizes well across various datasets. We demonstrated the use of latent codes by performing pose transfer, shape retrieval and latent interpolation. Although our method provides an exciting next step in unsupervised learning of deformable models from data, there is still room for improvement. In contrast to hand-crafted models like SMPL, where every parameter carries meaning (joint axes and angles per part), we have no control over specific parts of the mesh with our pose code. We also observed that interpolation of large torso rotations squeezes the meshes. In future work, we plan to explore a more structured pose space for easier part manipulation, which allows easy user manipulation, and plan to generalize our method to work with un-registered pointclouds as input. Since our model builds on simple yet effective ideas, we hope researchers can build on it and make further progress in this exciting research direction.