LoRD: Local 4D Implicit Representation for High-Fidelity Dynamic Human Modeling

Jiang, Boyan; Ren, Xinlin; Dou, Mingsong; Xue, Xiangyang; Fu, Yanwei; Zhang, Yinda

doi:10.1007/978-3-031-19809-0_18

Boyan Jiang¹²,
Xinlin Ren¹²,
Mingsong Dou¹³,
Xiangyang Xue¹²,
Yanwei Fu¹² &
…
Yinda Zhang¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13686))

Included in the following conference series:

European Conference on Computer Vision

2968 Accesses
3 Citations

Abstract

Recent progress in 4D implicit representation focuses on globally controlling the shape and motion with low dimensional latent vectors, which is prone to missing surface details and accumulating tracking error. While many deep local representations have shown promising results for 3D shape modeling, their 4D counterpart does not exist yet. In this paper, we fill this blank by proposing a novel Local 4D implicit Representation for Dynamic clothed human, named LoRD, which has the merits of both 4D human modeling and local representation, and enables high-fidelity reconstruction with detailed surface deformations, such as clothing wrinkles. Particularly, our key insight is to encourage the network to learn the latent codes of local part-level representation, capable of explaining the local geometry and temporal deformations. To make the inference at test-time, we first estimate the inner body skeleton motion to track local parts at each time step, and then optimize the latent codes for each part via auto-decoding based on different types of observed data. Extensive experiments demonstrate that the proposed method has strong capability for representing 4D human, and outperforms state-of-the-art methods on practical applications, including 4D reconstruction from sparse points, non-rigid depth fusion, both qualitatively and quantitatively.

B. Jiang, X. Ren and X. Xue are with School of Computer Science, Fudan University. Yanwei Fu is with School of Data Science, Fudan University.

Access provided by Autonomous University of Puebla. Download conference paper PDF

HEI-Human: A Hybrid Explicit and Implicit Method for Single-View 3D Clothed Human Reconstruction

Combining Implicit Function Learning and Parametric Models for 3D Human Reconstruction

UNIF: United Neural Implicit Functions for Clothed Human Reconstruction and Animation

1 Introduction

Dynamic 3D human modeling has been a long-standing challenge to 3D vision and graphics communities, as it is critical to various applications, such as VR/AR, animation and robot simulation. Traditional methods leverage well-designed parametric model [2] and physics-based simulation [20, 22, 60, 66] to model the inner human body and deformable outer cloth separately, but they typically demand huge engineering efforts and expensive computational cost. Recently, many learning based methods have been proposed [4, 11, 25, 32, 35, 36, 43, 59]; unfortunately, some of these methods can not model fine-grained geometry details beyond inner body, while the others only support frame-wise reconstruction to produce dynamic sequence.

The key challenge of dynamic human modeling is to find a way to model 4D representations for both surface geometry and temporal motion. Typically, existing 4D human representation methods infer the single holistic latent code/vector to control global motion and shape, which unfortunately are prone to over-smoothing shapes and missing fine-grained surface details. Recent efforts are made on inferring local representations for 3D modeling [6, 15, 19, 30, 53]. Typically, these methods utilize a set of local parts to model the geometry of local surface regions for reconstructing complete 3D shapes. Such local formulation improves the model capacity in recovering the detailed geometry with a stronger generalization ability than global free-form modeling [44, 48, 51]. However, it is nontrivial to directly enable these local methods to support the 4D scenario of modeling a dynamic 3D human with temporal motions, as their naïve extension to do per-frame reconstruction can not maintain the desirable properties of 4D modeling, such as temporal inter-/extrapolation, 4D spatial completion.

To this end, this paper proposes a Local 4D implicit Representation for Dynamic human, named LoRD, which combines the merits of 4D human modeling and local representation. The LoRD is capable to produce high-fidelity human mesh sequence. Given a dynamic clothed human sequence over a time span $T \in \left[ 0,1\right] $, we decouple its temporal evolution into two factors: inner body skeleton motion and outer surface deformation. We handle the skeleton motion with the widely-used SMPL parametric model [40], which uses a shape parameter and a series of pose parameters to represent the temporal changing of inner body. On the other hand, for outer surface deformation, we resort to a local implicit framework. Specifically, we sample a bunch of local parts on the inner body mesh of the canonical frame ($T=0$), each part is represented by a 3D sphere with the intrinsic parameters (not camera intrinsics) of radius and transformation with respect to the world coordinate frame, and latent codes encoding local deformation and canonical shape information. Since SMPL models have the unified mesh topology, we can find the correspondence in subsequent frames and temporally align the local coordinate systems for each part. Then we use a 4D local implicit network to model the surface deformation within each part conditioned on their latent codes. Such representation utilizes inner body model to handle the global skeleton motion, and leaves the detailed surface dynamics to the powerful local implicit network. This facilitates the dynamic human modeling with high-quality geometry.

Technically, our local representation is learned on 100 human sequences with ground truth mesh and its corresponding inner body mesh, each sequence contains $L=17$ frames. For each training sequence, we first sample the local parts on the surface of inner body mesh and randomly initialize the latent codes. Then we use objective function introduced by IGR [23] to optimize the local implicit network and latent codes. During the test-time, we fix the local implicit network to support a particular application (e.g., 4D reconstruction from sparse points, non-rigid depth fusion) via the auto-decoding method [51]. To obtain the inner body mesh, we use the existing work H4D [29] to provide plausible body estimation. Moreover, our representation can combine with the H4D motion model to conduct body reference optimization introduced by PaMIR [77], and support inner body refining to handle the imperfect body estimation (detailed in Sec. 3.4). This improves the robustness of LoRD against inaccurate inner body tracking.

To summarize, the main contributions of our work are: 1) We propose a novel local 4D implicit representation, which divides surface of a dynamic human into a collection of local parts and supports high-fidelity dynamic human modeling; 2) To temporally align each part for training and test-time optimization, we leverage inner SMPL body mesh for local part tracking; 3) We design an inner body refining strategy based on our local representation to optimize imperfect initial body estimation; 4) Our representation only requires a small set of data for training, and outperforms the state-of-the-art methods on practical applications, e.g. 4D reconstruction from sparse points, non-rigid depth fusion.

2 Related Work

4D Representation. Deep learning methods have shown impressive results on 3D-related tasks based on various representations, such as voxels [12, 21, 69], point clouds [1, 18, 55, 56], meshes [7, 24, 31, 37, 68] and neural implicit surfaces [6, 9, 10, 17, 19, 30, 44, 51]. While great success has achieved for static 3D object, recent works [28, 48, 57] attempt to investigate elegant 4D representation of modeling dynamic 3D object with an additional temporal dimension. When targeting the dynamic human, recent methods [28, 48] always suffer from missing surface details and inaccurate motion due to the global shape modeling and lack of human motion prior. In contrast, the proposed local 4D representation leverages inner body tracking to handle the global skeleton motion and leaves the detailed dynamics to a set of local parts, which is effective to recover high-fidelity surface deformation, and generalize well to the novel sequences.

Local Shape Representation. The implicit representations conditioned on a global latent vector [44, 51] often produce over-smooth results and have failed to recover detailed geometry such as human hands and clothing wrinkles. To tackle this problem, some recent works utilize local implicit representation for shape modeling [14, 19, 30, 53] and neural rendering [39, 52], but none of them has used it to build 4D representation that represents how 3D geometry deforms continuously over time. Similar to us, there is a family of work [8, 67] building human avatar which supports shape generation under arbitrary body poses. However, they process different timestamps independently and do not explicitly estimate temporal correspondences, which are shown to be important for recovering geometry details from multiple input frames or applications like motion completion/prediction. In contrast, our method extends the local representation to 4D scenario by combining the human prior model and 4D implicit network, which can directly produce 4D results with one-shot optimization process.

Dynamic Human Modeling. When it comes to capturing the dynamic human, some methods [26, 27, 73] require a pre-scanned template as a good initialization to obtain results from monocular color information. Recent methods [46, 61, 74, 76] utilize depth sensors to achieve real-time speed based on the classical deformation graph [62] and volumetric fusion [47], which get rid of subject-specific template. Since these methods are conducted in a frame-by-frame manner without intermediate motion representation, they are prone to error accumulation and hard to recover from tracking failures. Most recently, NDG [5] learns a globally-consistent deformation graph to facilitate non-rigid reconstruction, but requires per-sequence retraining and relies on multi-view depth sensors, which is inconvenient in the actual usage. As a popular line of works, NeRF-based [45] human modeling methods [52, 54] typically do not satisfy both local and temporal modeling. Most similar to us, Zheng et al. [75] propose a structured temporal NeRF for dynamic human rendering. We note that these methods mainly focus on rendering quality but usually produce unsatisfactory geometry. In contrast, LoRD models motion and shape jointly with local representation, so that information from two domains can be exchanged through the 4D model and benefit each other, which produces high-fidelity geometry results.

3 Method

Our framework is overviewed in Fig. 2: given a 3D clothed human mesh sequence of length $L=17$ frames that performs some motions in a normalized time span $\left[ 0,1\right] $, we first define a set of local parts (Sect. 3.1) around inner body surface of the canonical frame ($T=0$ in our setup). Then we temporally track these parts which are controlled by the skeleton motion of the inner body model (SMPL). Note that we use the ground truth SMPL mesh during training, whereas the SMPL parameters are estimated with the off-the-shelf method [29] at test-time. Each part contains a motion code $c_m$, a shape code $c_s$ and a texture code $c_t$ (optional), which can be decoded by our local implicit network (Sect. 3.2) to obtain the reconstructed surface. Overall, we utilize the inner body model to track global skeleton motion and leave the detailed temporal deformation, geometry and texture of the local surface patch to the local implicit network. Training and test-time optimization are discussed in Sect. 3.3 and Sect. 3.4, respectively.

3.1 Local Part Formulation

Inner body model There are many ways to track the global skeleton motion of a dynamic human, e.g. optical/scene flow [38, 65], dense human correspondence [64, 71], and deformation graph [62]. In our formulation, we choose the widely-used SMPL model [40] as it naturally provides surface correspondence between frames and its low-dimensional representations are easily to be optimized.

LoRD represents a 4D human with a set of local parts (defined as 3D spheres) $\mathcal {P}=\left\{ \mathcal {P}_{k}\right\} _{k=1}^{K}$, where $\mathcal {P}_{k}=\left\{ \textbf{r},\textbf{R}_k,\textbf{c}_k\right\} $ is the intrinsic parameters of part k (do not confuse them with camera intrinsics); $\textbf{r} \in \mathbb {R}$ is the radius of the sphere shared by all parts (we use $r=5cm$ in our experiments); $\textbf{R}_k \in \mathbb {R}^{9}$ and $\textbf{c}_k \in \mathbb {R}^{3}$ are the rotation matrix relative to the world coordinate frame and the center of sphere for each part respectively. Given the inner body mesh of the canonical frame, a sampling algorithm (detailed in Supp.) is conducted on its surface to obtain the part centers. Inspired by [6, 30], to make the result smooth over the parts border, we use the overlapping strategy during the part sampling process, where each part overlaps with its neighboring parts by maximum $1.5\times $ the part radius $\textbf{r}$, and finally produce 2127 parts. The transformation of each part is based on the local coordinate frame as shown in Fig. 2. Details are in Supp. Mat.

3.2 Local Implicit Network

Besides the intrinsic parameters, each local part also has the latent parameters as low-dimensional codes $c_m$, $c_s$ and $c_t$, which encode respectively the information of the local surface deformation, canonical geometry and texture. The goal of the local parts is to represent the detailed temporal deformation and geometry of the local surface patches. To this end, we follow D-NeRF [54] and use a 4D implicit network, which consists of a motion model and a canonical shape model. Moreover, if the observed data contain texture information, the additional texture model would be triggered to predict colors for the vertices of reconstructed mesh. Note that the implicit network is shared by all local parts. Next, We briefly introduce each model and the detailed architecture can be found in Supp. Mat.

Motion Model. As shown in Fig. 2, we formulate the motion model $f^m\left( \textbf{x},T\mid c_m\right) $ as a 4D function conditioned by the motion code $c_m \in \mathbb {R}^{128}$, which takes a 3D point $\textbf{x}=\left( x,y,z\right) $ in the local coordinate frame and a time value T (normalized to $\left[ 0,1\right] $) as input, and predicts a deformation vector $\mathrm {\Delta }\textbf{x}$ that transforms this point to the canonical frame, i.e. $T=0$, by $\textbf{x}^{*}=\textbf{x}+\mathrm {\Delta }\textbf{x}$. We adopt the network architecture of IM-Net [9], and reduce the feature dimension of each hidden layer by 4 fold [30] to obtain an efficient motion model.

Canonical Shape Model. The canonical shape model $f^s\left( \textbf{x}\mid c_s\right) $ is a neural signed distance function, which only holds a static implicit geometry of the canonical frame as the temporal deformation is handled by the motion model. Specifically, given a 3D query point at time T, we first obtain its position in the space of the canonical frame with the motion model, and then use the canonical shape model that is conditioned on a canonical shape latent code $c_s \in \mathbb {R}^{128}$ to predict the signed distance of the given point towards the surface. The same network architecture as DeepSDF [51] is adopted for canonical shape model. For training and testing efficiency, we reduce the number of layers and the feature channels for each layer to 6 and 256 respectively. During inference, we compute the bounding box of human based on the inner body mesh for each frame, and utilize the Marching Cubes algorithm [41] to extract the iso-surface.

Texture Model. If the input data contains texture information, e.g. colored point clouds, our representation can be extended to support surface texture inference. We achieve this by learning a function $f^t\left( \textbf{x}, T\mid c_t\right) $ to predict the 4D texture field [49, 58, 59] of the dynamic local surface conditioned on a texture code $c_t \in \mathbb {R}^{128}$. It takes a 3D point $\textbf{x}$ in the local coordinate frame and a time value T, and outputs the RGB value of this point. We use the architecture of TextureField [49] decoder for our texture model. Please refer Supp. Mat. for the detailed network architecture. Note that we use our texture model in a per-sequence fashion during the test-time without pre-training, i.e. fit the input sequence with updating the network parameters, for better visualization results.

3.3 Training

Thank to our local formulation, the training of our model is very data efficient. We only use 100 sequences of length $L=17$ frames from CAPE dataset [42] to learn our representation. During training, we adopt the auto-decoding method [51] and optimize our motion model, canonical shape model, and the latent codes for training parts. Specifically, given a training sequence that contains ground truth clothed meshes and the corresponding inner body meshes, we first sample a bunch of local parts on the surface of the inner body mesh of the first frame. Since the SMPL mesh has the unified surface topology, we can obtain the rotations and locations of each part in the following time steps, thus align their local coordinate frames. Next, we initialize the motion code and canonical shape code for each part with the vectors randomly sampled from $N\left( 0,0.01\right) $, these codes are optimized with the network parameters during training. To train our implicit networks, the query points are sampled from three sources, i.e. surface, near surface space and free space in the bounding box.

Loss Functions. The point sets sampled on-surface and off-surface are denoted as $\mathcal {X}$ and $\bar{\mathcal {X}}$ respectively. We optimize our 4D implicit function $f(\cdot )$ base on the loss functions introduced by IGR [23]:

$$\begin{aligned} \mathcal {L}_{\textrm{s}}=\frac{1}{|\mathcal {X}|}\sum _{\boldsymbol{x} \in \mathcal {X}}f(\boldsymbol{x})+\left. \left\| \nabla _{\boldsymbol{x}} f(\boldsymbol{x})-\boldsymbol{n}(\boldsymbol{x})\right\| \right) , \quad \mathcal {L}_{\textrm{e}}=\frac{1}{|\bar{\mathcal {X}}|}\sum _{\boldsymbol{x} \in \bar{\mathcal {X}}}\left( \left\| \nabla _{\boldsymbol{x}} f(\boldsymbol{x})\right\| -1\right) ^{2} \end{aligned}$$

where $\mathcal {L}_{\textrm{s}}$ ensures the zero signed distance values for on-surface points and their normals aligned with the ground truth. $\mathcal {L}_{\textrm{e}}$ is the regularization term encouraging the learned function to satisfy the Eikonal equation [13]. In addition, we also add a latent regularization term $\mathcal {L}_{\textrm{c}}=\left\| c_m\right\| _{2}+\left\| c_s\right\| _{2}$ to constrain the learning of latent spaces. The final objective function for training is $\mathcal {L}=\lambda _1\mathcal {L}_{\textrm{s}}+\lambda _2\mathcal {L}_{\textrm{e}}+\lambda _3\mathcal {L}_{\textrm{c}}$. We use $\lambda _1=1.0$, $\lambda _2=1e^{-1}$, $\lambda _3=1e^{-3}$ in our experiment.

Evaluate SDF for Query Points. During the training process, the sampled points are only evaluated by the local parts that cover them. In our case, “point $\textbf{x}$ is covered by part k” means the Euclidean distance between $\textbf{x}$ and the center of part $c_k$ is less than or equal to the pre-defined part radius $\textbf{r}$, i.e. $d\left( \textbf{x},c_k\right) \le \textbf{r}$. The sampled parts are highly overlapping, thus for one query point, we randomly choose n parts that covered this point to evaluate its SDF, and then average n SDF values ($n=4$ in our experiments) as the final output. This could encourage the network to produce the smooth results in the overlapping regions. If some points are not covered by any parts, e.g. points sampled in the free space far from surface, then it will choose n-nearest parts to obtain the SDF prediction. Note that this is important for reconstructing complete results, since we cannot ensure the local parts sampled from inner body mesh would completely cover the surface of the clothed human.

3.4 Test-Time Optimization

After learning our local representation, we can then conduct the test-time optimization to reconstruct the dynamic human based on the given observations. In our experiments, we mainly focus on recovering 4D humans from complete point clouds or partial depth sequences. Generally speaking, the test-time optimization is similar to the training process, which performs backward optimization with the auto-decoding fashion, except that we fix the network parameters and only update the latent codes for each local part. Since we leverage the loss functions from IGR [23], and directly perform optimization based on the point clouds with local-based representation, the geometry covered by each part is a non-watertight surface, which causes the extracted surface contains artificial interior back-faces. We borrow the post-processing algorithm from LIG [30] to remove such artifacts. The details about the post-processing algorithm and the choices of hyper-parameters can be found in Supp. Mat. In addition, there are some technical details that we want to clarify below.

Inner Body Estimation. Given a testing sequence, we first need to estimate inner body meshes to sample local parts. As the temporal consistency could facilitate our reconstruction, we use the recent motion based human body estimation method H4D [29] to fit the SMPL parameters via backward optimization.

Inner Body Refining. The fitting results of H4D [29] are accurate enough in most cases, but still imperfect on some sequences, which may cause the observations of some local parts vary too much over time. Inspired by PaMIR [77], we propose a strategy to refine the initial inner body fitting from H4D. Specifically, we first sample and track the local parts on the initial body mesh sequence produced by H4D, and optimize the latent codes for each part. Then we fix the latent codes and local parts, query the SMPL vertices into our local implicit network, and optimize the SMPL parameters for shape and initial pose, and latent vector for motion of H4D. We follow the body reference optimization proposed in PaMIR to build the loss functions of our refining process:

$$\begin{aligned} \mathcal {L}_{\textrm{SMPL}}=\left\{ \begin{array}{ll}|f\left( x\right) | &{} f\left( x\right) \ge 0 \\ \frac{1}{\eta }|f\left( x\right) | &{} f\left( x\right) <0\end{array}\right. , \quad \mathcal {L}_{reg}=\left\| V-V^{i n i t}\right\| _{2}, \end{aligned}$$

where $\eta =5$, $f\left( \cdot \right) $ is our local implicit signed distance function; $V=\left( \beta ,\theta _0, c_m\right) $ contains the shape parameter, initial pose parameter and latent motion code of H4D, and the superscript “init” means initial estimations. This reflects the fact that, if the body estimation is accurate, then the vertices of the body mesh will get the negative SDF predictions (inside surface). Moreover, we also use an additional observation loss $\mathcal {L}_{\textrm{obs}}$, which denotes Chamfer loss for the complete point cloud and point-to-surface loss for partial point cloud from the depth image. The final objective function is $\mathcal {L}=\lambda _1\mathcal {L}_{\textrm{SMPL}}+\lambda _2\mathcal {L}_{\textrm{obs}}+\lambda _3\mathcal {L}_{reg}$, where $\lambda _1=1.0$, $\lambda _2=1e^{2}$ and $\lambda _3=1e^{-3}$ in our experiments. We verify the effectiveness of our inner body refining strategy in Sect. 4.4.

Texture Model Optimization. As mentioned in Sect. 3.2, we optimize the texture model for each testing sequence. Given a colored point cloud sequence, we can obtain the ground truth color $C_T\left( \textbf{x}\right) $ of a surface point $\textbf{x}$ in time T. Then we query $\textbf{x}$ into the texture model conditioned on the texture code $c_t^k$ of part k to get the color prediction. We also use the average of n predicted colors as the final output (Sect. 3.3). To optimize the network parameters and texture codes, we add the $L_1$-loss $\mathcal {L}_{\textrm{color}}=\left| f^{c}(\textbf{x},T\mid c_t)-C_T(\textbf{x})\right| $ into the objective function.

4 Experiments

In this section, we evaluate the representation capability of LoRD and its value in practical applications, i.e. 4D reconstruction and non-rigid depth fusion.

Dataset and Metric. For training and evaluation, we use the CAPE [42] dataset which contains more than 600 motion sequences of 15 persons wearing different types of outfits, and the SMPL registrations are provided. Additionally, some raw scanned sequences with texture information are also available. We choose 100 sub-sequences of length $L=17$ for training, and use the sub-sequences of novel subjects for testing. To compare with the baseline methods, we use Chamfer Distance-L2 [44], normal consistency [59] (the average L2 distance between the normal of given point on the source mesh and the normal of its nearest neighbor on the target mesh), and F-Score [68] as evaluation metrics.

Implementation Details. We use PyTorch with Adam optimizer [34] of learning rate $1e^{-3}$ and batch size 1 for both training and test-time optimization. The experiments are conducted on a single Nvidia 2080Ti GPU. The test-time optimization takes around 15min for each 17 frames sequence.

Table 1. Comparisons on 4D human fitting. Left: framewise methods, Right: temporal methods. “Ch.-$L_2$” and “Normal” mean Chamfer Distance ($\times 10^{-4}\,\textrm{m}^2$) and Surface Normal Consistency respectively. The threshold for computing F-Score is $\tau =5\,\textrm{mm}$.

Full size table

4.1 Representation Capability

4D Human Fitting. We first evaluate the efficacy of LoRD in representing dynamic human by overfitting a given mesh sequence. We select one sequence from the CAPE dataset for this task. For comparison, we choose 3D neural SDF methods DeepSDF [51] and NGLoD [63], DeepSDF is a global representation which represents the complete shape with a single latent code, while NGLoD is a SoTA local neural SDF representation based on the Octree, both of them are 3D representations that need to work with frame-wise manner to produce a temporal sequence. In addition, we choose the SoTA 4D representation methods OFlow [48] and 4D-CR [28] as our baseline.

The quantitative results are shown in Table 1. Our LoRD representation clearly outperforms DeepSDF and all the SoTA 4D representation methods, and performs comparable with framewise method NGLoD. We show the visual results in Fig. 3, the colors of our results indicate the dense correspondences w.r.t the first frame. Specifically, for each vertex on the reconstructed mesh of time T, we use the optimized motion codes to transform it to the first frame, and obtain color value of the nearest vertex. We note that this cannot be achieved by DeepSDF or NGLoD, since they do not model temporal information.

Temporal Inter-/extrapolation To further show the superiority of LoRD over the framewise representations, we show the temporal inter-/extrapolation results achieved by our method in Fig. 4. Given a sequence of length $L=17$ frames, for interpolation, we randomly choose 9 frames as the observations to perform SDF fitting, the goal is to complete the missing frames to obtain a temporally complete sequence. And for extrapolation, we only use the first 9 frames and need predict the future motion of the last 8 frames. Figure 4 shows that LoRD produces the plausible results on both inter- or extra-polation modes. Again, these temporal completion tasks also cannot be achieved by the framewise 3D representations, e.g. DeepSDF, NGLoD. We also provide the results about interpolation of the latent codes in Supp. Mat. (Sect. 2.2) as a sanity check.

4.2 4D Reconstruction from Sparse Points

We then show that LoRD can support various applications. First, we demonstrate that LoRD can achieve high quality 4D reconstruction from sparse point clouds. In this case, we assume the point normal directions are available (oriented point cloud, the same for Poisson Reconstruction [33] and LIG [30]).

Compare to Instance-Level Methods. We first compare LoRD with the instance level methods, the “instance-level” in here means we only overfit one sequence at a time and do not consider generalization to other instances. We choose the traditional Poisson Surface Reconstruction with octree depth value $d=10$ (PSR10) [33], Alpha Shape [16] and Ball Pivoting [3] as the baseline. Moreover, we also compare with the SoTA network-based surface reconstruction method Deep Hybrid Self-Prior (DHSP), and the non-rigid reconstruction method Neural Deformation Graph (NDG). The quantitative results are show in Fig. 6 (a, I), the leftmost column represents the sampled point cloud density (number of points per square meter of surface), the smaller number corresponds to the sparser point cloud, the surface area of SMPL mesh used for point sampling is around $2\,\textrm{m}^2$. As can be seen, our method outperforms all the baselines by a large margin. More importantly, the sparser point cloud hardly affects our performance while the baseline methods have been significantly affected, this is because LoRD is a 4D representation, sparse observation from each frame can compensate each other through the motion model. The qualitative comparisons are shown in Fig. 5 (above the solid line), our method can recover geometry details on the face and cloth with high resolution texture, while the baselines only produce over-smooth results due to the limited information from sparse inputs.

Compare to Generalizable Methods. To show the generalization ability of our method, we train LoRD on the training set of 100 sequences, then fix the network parameters and optimize the latent codes of local parts to fit the input point cloud via back-propagation. In this experiment, we use the point density of 2000 points/$\textrm{m}^2$ (same as the results in the last group of Fig. 6 (a, I)), and choose 10 testing sequences of novel subjects for evaluation. As framewise baselines, we choose: IPNet [4] and PTF [70], which takes point cloud as input and output reconstructed mesh via feed forward fashion; CAPE [42] and LIG [30], which obtain reconstructions via the backward optimization similar to us. The OFlow and 4D-CR are still considered as the baseline of temporal methods, we remove their encoders, fix the decoder parameters, and perform backward optimization. For OFlow and 4D-CR, we use the ground truth occupancy instead of oriented point cloud as supervision for more stable results. The results are shown in Fig. 5 (below the solid line) and Fig. 6 (a, II), our method beats all the baselines both qualitatively and quantitatively. We can observe the fine-grained geometry recovered by LoRD in the zoomed-in parts of Fig. 5, as well as detailed clothing deformation, which show that our model trained on small set of data can generalize well to the novel motion sequences. More results are in Supp. Mat.

4.3 Non-Rigid Depth Fusion

We further test LoRD with the application of non-rigid depth fusion. Given a static RGB-D camera, with a person standing in front of it performing different actions, the goal is to accurately track the human motion and merge all depth observations in a time span, and finally produce a dynamic mesh sequence. In this experiment, we use the mesh sequences of length $L=17$ from CAPE dataset [42], and render each frame to get depth image of resolution $512\times 512$. We compute the normal map based on the depth image, and back-project each pixel into 3D space with the known camera intrinsics to obtain the partial oriented point cloud as the observations. Then we run H4D [29] to get the inner body estimation, and use our pretrained LoRD model to perform auto-decoding. Our approach formulates non-rigid fusion as a temporal completion problem within local parts. We choose DynamicFusion [46], NPMs [50] and PTF [70] as our baseline and show the qualitative comparisons in Fig. 6 (b). We observe that PTF produces overly smooth results, NPMs cannot model the detailed surface geometry for different subjects, and DynamicFusion fails to track the human motion that is very fast or contains self-occlusion and leads to unsatisfactory fusions. In contrast, our model is capable to produce more complete fusion results than DynamicFusion, e.g. back of the first example, and more detailed geometry than PTF and NPMs. Additional results including non-rigid fusion on real-world data and the comparison to more recent human specific fusion work DoubleFusion [74] are provided in Supp. Mat. for the sake of space. Our method shows robustness to the SMPL fitting error and provides more complete results than DoubleFusion.

Table 2. Ablation study. Left: the effectiveness of the inner body refining on different noise levels; Right: the effect of the part radius. We choose part radius $r=5cm$ in our experiments. The visualization examples are in Supp. Mat.

Full size table

4.4 Ablation Study

Imperfect Body Tracking. We first provide an ablation study to demonstrate the effectiveness of the proposed inner body refining method. We use the 4D reconstruction task with the point density 2000 point/$\textrm{m}^2$ for evaluation. Given the initially estimated SMPL inner body, we manually add the random Gaussian noise to it and compare the reconstruction performances before and after refining. Specifically, we perturb the SMPL shape ($\beta $) and pose ($\theta $) parameters by $\beta +=\lambda _{\beta } \cdot \sigma \cdot \mu $ and $\theta +=\lambda _{\theta } \cdot \sigma \cdot \mu $, where $\mu \in N\left( 0,1\right) $, $\lambda _{\beta }=0.05$, $\lambda _{\theta }=0.01$, and $\sigma \in \left[ 3, 5\right] $ represents the level of noise. The quantitative results are show in Table 2 (left). Without inner body refining, the reconstruction performance drops fast as the noise level up. And by using our refining method, the performance improves and in general stable on different noise levels.

Local Part Size. We then study the effect of different radii for local part. To this end, we use our pretrained model, and test on the task of 4D reconstruction as previous. The comparisons are shown in Table 2 (right). As can be seen, the reconstruction performance is affected by the choice of part radius r. We choose $r=5cm$ in our experiment for slightly better results. We find that the over-small part is inclined to produce artifacts, possibly due to the limited receptive field within part. And the larger part could lead to overly smooth results.

5 Conclusion

This work introduces LoRD, a local 4D implicit representation for dynamic human, which aims to optimize a part-level temporal network for modeling detailed human surface deformation, e.g. clothing wrinkles. LoRD is learned on a very small set of training data (less than 100 sequences). Once trained, it can be used to fit different types of observed data including sparse point clouds, monocular depth images via auto-decoding. LoRD is capable to reconstruct high-fidelity 4D human and outperforms the state-of-the-art methods.

References

Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Representation learning and adversarial generation of 3d point clouds. arXiv preprint arXiv:1707.02392 2(3), 4 (2017)
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: shape completion and animation of people. In: ACM SIGGRAPH 2005 Papers, pp. 408–416 (2005)
Google Scholar
Bernardini, F., Mittleman, J., Rushmeier, H., Silva, C., Taubin, G.: The ball-pivoting algorithm for surface reconstruction. IEEE Trans. Visual Comput. Graphics 5(4), 349–359 (1999)
Article Google Scholar
Bhatnagar, B.L., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Combining implicit function learning and parametric models for 3D human reconstruction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 311–329. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_19
Chapter Google Scholar
Bozic, A., Palafox, P., Zollhofer, M., Thies, J., Dai, A., Nießner, M.: Neural deformation graphs for globally-consistent non-rigid reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1450–1459 (2021)
Google Scholar
Chabra, R., et al.: Deep local shapes: learning local SDF priors for detailed 3D reconstruction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 608–625. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_36
Chapter Google Scholar
ChaoWen, Zhang, Y., Li, Z., Fu, Y.: Pixel2mesh++: Multi-view 3d mesh generation via deformation. In: ICCV (2019)
Google Scholar
Chen, X., et al.: gdna: Towards generative detailed neural avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20427–20437 (2022)
Google Scholar
Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In: CVPR, pp. 5939–5948 (2019)
Google Scholar
Chibane, J., Alldieck, T., Pons-Moll, G.: Implicit functions in feature space for 3d shape reconstruction and completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6970–6981 (2020)
Google Scholar
Choi, H., Moon, G., Lee, K.M.: Beyond static features for temporally consistent 3d human pose and shape from a video. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: A unified approach for single and multi-view 3d object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_38
Chapter Google Scholar
Crandall, M.G., Lions, P.L.: Viscosity solutions of hamilton-jacobi equations. Trans. Am. Math. Soc. 277(1), 1–42 (1983)
Article MathSciNet MATH Google Scholar
Deng, B., Genova, K., Yazdani, S., Bouaziz, S., Hinton, G., Tagliasacchi, A.: Cvxnet: Learnable convex decomposition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 31–44 (2020)
Google Scholar
Deng, B., et al.: NASA neural articulated shape approximation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 612–628. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_36
Chapter Google Scholar
Edelsbrunner, H., Mücke, E.P.: Three-dimensional alpha shapes. ACM Trans. Graph. (TOG) 13(1), 43–72 (1994)
Article MATH Google Scholar
Erler, P., Guerrero, P., Ohrhallinger, S., Mitra, N.J., Wimmer, M.: Points2Surf learning implicit surfaces from point clouds. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 108–124. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_7
Chapter Google Scholar
Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3d object reconstruction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 605–613 (2017)
Google Scholar
Genova, K., Cole, F., Sud, A., Sarna, A., Funkhouser, T.: Local deep implicit functions for 3d shape. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4857–4866 (2020)
Google Scholar
Gillette, R., Peters, C., Vining, N., Edwards, E., Sheffer, A.: Real-time dynamic wrinkling of coarse animated cloth. In: Proceedings of the 14th ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 17–26 (2015)
Google Scholar
Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 484–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_29
Chapter Google Scholar
Goldenthal, R., Harmon, D., Fattal, R., Bercovier, M., Grinspun, E.: Efficient simulation of inextensible cloth. In: ACM SIGGRAPH 2007 papers, pp. 49-es (2007)
Google Scholar
Gropp, A., Yariv, L., Haim, N., Atzmon, M., Lipman, Y.: Implicit geometric regularization for learning shapes. arXiv preprint arXiv:2002.10099 (2020)
Groueix, T., Fisher, M., Kim, V.G., Russell, B.C., Aubry, M.: Atlasnet: A papier-mâché approach to learning 3d surface generation. arXiv preprint arXiv:1802.05384 (2018)
Guler, R.A., Kokkinos, I.: Holopose: Holistic 3d human reconstruction in-the-wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10884–10894 (2019)
Google Scholar
Habermann, M., Xu, W., Zollhoefer, M., Pons-Moll, G., Theobalt, C.: Livecap: Real-time human performance capture from monocular video. ACM Trans. Graph. (TOG) 38(2), 1–17 (2019)
Article Google Scholar
Habermann, M., Xu, W., Zollhofer, M., Pons-Moll, G., Theobalt, C.: Deepcap: Monocular human performance capture using weak supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5052–5063 (2020)
Google Scholar
Jiang, B., Zhang, Y., Wei, X., Xue, X., Fu, Y.: Learning compositional representation for 4d captures with neural ode. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5340–5350 (2021)
Google Scholar
Jiang, B., Zhang, Y., Wei, X., Xue, X., Fu, Y.: H4d: Human 4d modeling by learning neural compositional representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19355–19365 (2022)
Google Scholar
Jiang, C., Sud, A., Makadia, A., Huang, J., Nießner, M., Funkhouser, T.: Local implicit grid representations for 3d scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6001–6010 (2020)
Google Scholar
Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 386–402. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_23
Chapter Google Scholar
Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3d human dynamics from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5614–5623 (2019)
Google Scholar
Kazhdan, M., Hoppe, H.: Screened poisson surface reconstruction. ACM Trans. Graph. (ToG) 32(3), 1–13 (2013)
Article MATH Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: Video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5253–5263 (2020)
Google Scholar
Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: Closing the loop between 3d and 2d human representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6050–6059 (2017)
Google Scholar
Liao, Y., Donne, S., Geiger, A.: Deep marching cubes: Learning explicit surface representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2916–2925 (2018)
Google Scholar
Liu, X., Qi, C.R., Guibas, L.J.: Flownet3d: Learning scene flow in 3d point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 529–537 (2019)
Google Scholar
Lombardi, S., Simon, T., Schwartz, G., Zollhoefer, M., Sheikh, Y., Saragih, J.: Mixture of volumetric primitives for efficient neural rendering. ACM Trans. Graph. (TOG) 40(4), 1–13 (2021)
Article Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned multi-person linear model. ACM Trans. Graph. (TOG) 34(6), 1–16 (2015)
Article Google Scholar
Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface construction algorithm. ACM Siggraph Comput. Graph. 21(4), 163–169 (1987)
Article Google Scholar
Ma, Q., et al.: Learning to Dress 3D People in Generative Clothing. In: Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Mehta, D., et al.: Single-shot multi-person 3d pose estimation from monocular rgb. In: 2018 International Conference on 3D Vision (3DV), pp. 120–130. IEEE (2018)
Google Scholar
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: Learning 3d reconstruction in function space. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4460–4470 (2019)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: Representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Chapter Google Scholar
Newcombe, R.A., Fox, D., Seitz, S.M.: Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 343–352 (2015)
Google Scholar
Newcombe, R.A., et al.: Kinectfusion: Real-time dense surface mapping and tracking. In: 2011 10th IEEE International Symposium on Mixed and Augmented Reality, pp. 127–136. IEEE (2011)
Google Scholar
Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Occupancy flow: 4d reconstruction by learning particle dynamics. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5379–5389 (2019)
Google Scholar
Oechsle, M., Mescheder, L., Niemeyer, M., Strauss, T., Geiger, A.: Texture fields: Learning texture representations in function space. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4531–4540 (2019)
Google Scholar
Palafox, P., Božič, A., Thies, J., Nießner, M., Dai, A.: Npms: Neural parametric models for 3d deformable shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12695–12705 (2021)
Google Scholar
Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: Learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 165–174 (2019)
Google Scholar
Peng, S., et al.: Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9054–9063 (2021)
Google Scholar
Peng, S., Niemeyer, M., Mescheder, L., Pollefeys, M., Geiger, A.: Convolutional occupancy networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 523–540. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_31
Chapter Google Scholar
Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-nerf: Neural radiance fields for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10318–10327 (2021)
Google Scholar
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3d object detection from RGB-D data. In: CVPR (2018)
Google Scholar
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: CVPR (2017)
Google Scholar
Rempe, D., Birdal, T., Zhao, Y., Gojcic, Z., Sridhar, S., Guibas, L.J.: Caspr: Learning canonical spatiotemporal point cloud representations. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Google Scholar
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2304–2314 (2019)
Google Scholar
Saito, S., Yang, J., Ma, Q., Black, M.J.: Scanimate: Weakly supervised learning of skinned clothed avatar networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2886–2897 (2021)
Google Scholar
Selle, A., Su, J., Irving, G., Fedkiw, R.: Robust high-resolution cloth using parallelism, history-based collisions, and accurate friction. IEEE Trans. Visual Comput. Graphics 15(2), 339–350 (2008)
Article Google Scholar
Su, Z., Xu, L., Zheng, Z., Yu, T., Liu, Y., Fang, L.: Robustfusion: Human volumetric capture with data-driven visual cues using a rgbd camera. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 246–264. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_15
Chapter Google Scholar
Sumner, R.W., Schmid, J., Pauly, M.: Embedded deformation for shape manipulation. In: ACM Siggraph 2007 Papers, pp. 80-es (2007)
Google Scholar
Takikawa, T., et al.: Neural geometric level of detail: Real-time rendering with implicit 3d shapes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11358–11367 (2021)
Google Scholar
Tan, F., et al.: Humangps: Geodesic preserving feature for dense human correspondences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1820–1830 (2021)
Google Scholar
Teed, Z., Deng, J.: RAFT: Recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Chapter Google Scholar
Terzopoulos, D., Platt, J., Barr, A., Fleischer, K.: Elastically deformable models. In: Proceedings of the 14th Annual Conference on Computer Graphics and Interactive techniques, pp. 205–214 (1987)
Google Scholar
Tiwari, G., Sarafianos, N., Tung, T., Pons-Moll, G.: Neural-gif: Neural generalized implicit functions for animating people in clothing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11708–11718 (2021)
Google Scholar
Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.-G.: Pixel2mesh: Generating 3d mesh models from single rgb images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 55–71. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_4
Chapter Google Scholar
Wang, P.S., Liu, Y., Guo, Y.X., Sun, C.Y., Tong, X.: O-cnn: Octree-based convolutional neural networks for 3d shape analysis. ACM Trans. Graph. (TOG) 36(4), 72 (2017)
Google Scholar
Wang, S., Geiger, A., Tang, S.: Locally aware piecewise transformation fields for 3d human mesh registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7639–7648 (2021)
Google Scholar
Wei, L., Huang, Q., Ceylan, D., Vouga, E., Li, H.: Dense human body correspondences using convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1544–1553 (2016)
Google Scholar
Wei, X., Chen, Z., Fu, Y., Cui, Z., Zhang, Y.: Deep hybrid self-prior for full 3d mesh generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5805–5814 (2021)
Google Scholar
Xu, W., et al.: Monoperfcap: Human performance capture from monocular video. ACM Trans. Graph. (ToG) 37(2), 1–15 (2018)
Article Google Scholar
Yu, T., et al.: Doublefusion: Real-time capture of human performances with inner body shapes from a single depth sensor. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7287–7296 (2018)
Google Scholar
Zheng, Z., Huang, H., Yu, T., Zhang, H., Guo, Y., Liu, Y.: Structured local radiance fields for human avatar modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15893–15903 (2022)
Google Scholar
Zheng, Z., et al.: Hybridfusion: Real-time performance capture using a single depth sensor and sparse IMUs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 389–406. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_24
Chapter Google Scholar
Zheng, Z., Yu, T., Liu, Y., Dai, Q.: Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. 44, 3170–3184 (2021)
Google Scholar

Download references

Acknowledgement

This work was supported by Shanghai Municipal Science and Technology Major Projects (No.2018SHZDZX01, and 2021SHZDZX0103).

Author information

Authors and Affiliations

Fudan University, Shanghai, China
Boyan Jiang, Xinlin Ren, Xiangyang Xue & Yanwei Fu
Google, California, USA
Mingsong Dou & Yinda Zhang

Authors

Boyan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Xinlin Ren
View author publications
You can also search for this author in PubMed Google Scholar
Mingsong Dou
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyang Xue
View author publications
You can also search for this author in PubMed Google Scholar
Yanwei Fu
View author publications
You can also search for this author in PubMed Google Scholar
Yinda Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanwei Fu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3849 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, B., Ren, X., Dou, M., Xue, X., Fu, Y., Zhang, Y. (2022). LoRD: Local 4D Implicit Representation for High-Fidelity Dynamic Human Modeling. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13686. Springer, Cham. https://doi.org/10.1007/978-3-031-19809-0_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-19809-0_18
Published: 01 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19808-3
Online ISBN: 978-3-031-19809-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics