1 Introduction

Dynamic 3D human modeling has been a long-standing challenge to 3D vision and graphics communities, as it is critical to various applications, such as VR/AR, animation and robot simulation. Traditional methods leverage well-designed parametric model [2] and physics-based simulation [20, 22, 60, 66] to model the inner human body and deformable outer cloth separately, but they typically demand huge engineering efforts and expensive computational cost. Recently, many learning based methods have been proposed [4, 11, 25, 32, 35, 36, 43, 59]; unfortunately, some of these methods can not model fine-grained geometry details beyond inner body, while the others only support frame-wise reconstruction to produce dynamic sequence.

Fig. 1.
figure 1

LoRD represents dynamic human with a set of overlapping local parts. Each part is temporally tracked with the estimated SMPL meshes, and contains low-dimensional latent codes of motion, canonical shape and texture (optional), which can be decoded to recover the detailed temporal changing of local surface patches by a 4D implicit network. During the test-time, these latent codes are optimized based on the different types of input observations, such as sparse point clouds and monocular RGB-D video to produce high-fidelity 4D human reconstruction.

The key challenge of dynamic human modeling is to find a way to model 4D representations for both surface geometry and temporal motion. Typically, existing 4D human representation methods infer the single holistic latent code/vector to control global motion and shape, which unfortunately are prone to over-smoothing shapes and missing fine-grained surface details. Recent efforts are made on inferring local representations for 3D modeling [6, 15, 19, 30, 53]. Typically, these methods utilize a set of local parts to model the geometry of local surface regions for reconstructing complete 3D shapes. Such local formulation improves the model capacity in recovering the detailed geometry with a stronger generalization ability than global free-form modeling [44, 48, 51]. However, it is nontrivial to directly enable these local methods to support the 4D scenario of modeling a dynamic 3D human with temporal motions, as their naïve extension to do per-frame reconstruction can not maintain the desirable properties of 4D modeling, such as temporal inter-/extrapolation, 4D spatial completion.

To this end, this paper proposes a Local 4D implicit Representation for Dynamic human, named LoRD, which combines the merits of 4D human modeling and local representation. The LoRD is capable to produce high-fidelity human mesh sequence. Given a dynamic clothed human sequence over a time span \(T \in \left[ 0,1\right] \), we decouple its temporal evolution into two factors: inner body skeleton motion and outer surface deformation. We handle the skeleton motion with the widely-used SMPL parametric model [40], which uses a shape parameter and a series of pose parameters to represent the temporal changing of inner body. On the other hand, for outer surface deformation, we resort to a local implicit framework. Specifically, we sample a bunch of local parts on the inner body mesh of the canonical frame (\(T=0\)), each part is represented by a 3D sphere with the intrinsic parameters (not camera intrinsics) of radius and transformation with respect to the world coordinate frame, and latent codes encoding local deformation and canonical shape information. Since SMPL models have the unified mesh topology, we can find the correspondence in subsequent frames and temporally align the local coordinate systems for each part. Then we use a 4D local implicit network to model the surface deformation within each part conditioned on their latent codes. Such representation utilizes inner body model to handle the global skeleton motion, and leaves the detailed surface dynamics to the powerful local implicit network. This facilitates the dynamic human modeling with high-quality geometry.

Technically, our local representation is learned on 100 human sequences with ground truth mesh and its corresponding inner body mesh, each sequence contains \(L=17\) frames. For each training sequence, we first sample the local parts on the surface of inner body mesh and randomly initialize the latent codes. Then we use objective function introduced by IGR [23] to optimize the local implicit network and latent codes. During the test-time, we fix the local implicit network to support a particular application (e.g., 4D reconstruction from sparse points, non-rigid depth fusion) via the auto-decoding method [51]. To obtain the inner body mesh, we use the existing work H4D [29] to provide plausible body estimation. Moreover, our representation can combine with the H4D motion model to conduct body reference optimization introduced by PaMIR [77], and support inner body refining to handle the imperfect body estimation (detailed in Sec. 3.4). This improves the robustness of LoRD against inaccurate inner body tracking.

To summarize, the main contributions of our work are: 1) We propose a novel local 4D implicit representation, which divides surface of a dynamic human into a collection of local parts and supports high-fidelity dynamic human modeling; 2) To temporally align each part for training and test-time optimization, we leverage inner SMPL body mesh for local part tracking; 3) We design an inner body refining strategy based on our local representation to optimize imperfect initial body estimation; 4) Our representation only requires a small set of data for training, and outperforms the state-of-the-art methods on practical applications, e.g. 4D reconstruction from sparse points, non-rigid depth fusion.

2 Related Work

4D Representation. Deep learning methods have shown impressive results on 3D-related tasks based on various representations, such as voxels [12, 21, 69], point clouds [1, 18, 55, 56], meshes [7, 24, 31, 37, 68] and neural implicit surfaces [6, 9, 10, 17, 19, 30, 44, 51]. While great success has achieved for static 3D object, recent works [28, 48, 57] attempt to investigate elegant 4D representation of modeling dynamic 3D object with an additional temporal dimension. When targeting the dynamic human, recent methods [28, 48] always suffer from missing surface details and inaccurate motion due to the global shape modeling and lack of human motion prior. In contrast, the proposed local 4D representation leverages inner body tracking to handle the global skeleton motion and leaves the detailed dynamics to a set of local parts, which is effective to recover high-fidelity surface deformation, and generalize well to the novel sequences.

Local Shape Representation. The implicit representations conditioned on a global latent vector [44, 51] often produce over-smooth results and have failed to recover detailed geometry such as human hands and clothing wrinkles. To tackle this problem, some recent works utilize local implicit representation for shape modeling [14, 19, 30, 53] and neural rendering [39, 52], but none of them has used it to build 4D representation that represents how 3D geometry deforms continuously over time. Similar to us, there is a family of work [8, 67] building human avatar which supports shape generation under arbitrary body poses. However, they process different timestamps independently and do not explicitly estimate temporal correspondences, which are shown to be important for recovering geometry details from multiple input frames or applications like motion completion/prediction. In contrast, our method extends the local representation to 4D scenario by combining the human prior model and 4D implicit network, which can directly produce 4D results with one-shot optimization process.

Dynamic Human Modeling. When it comes to capturing the dynamic human, some methods [26, 27, 73] require a pre-scanned template as a good initialization to obtain results from monocular color information. Recent methods [46, 61, 74, 76] utilize depth sensors to achieve real-time speed based on the classical deformation graph [62] and volumetric fusion [47], which get rid of subject-specific template. Since these methods are conducted in a frame-by-frame manner without intermediate motion representation, they are prone to error accumulation and hard to recover from tracking failures. Most recently, NDG [5] learns a globally-consistent deformation graph to facilitate non-rigid reconstruction, but requires per-sequence retraining and relies on multi-view depth sensors, which is inconvenient in the actual usage. As a popular line of works, NeRF-based [45] human modeling methods [52, 54] typically do not satisfy both local and temporal modeling. Most similar to us, Zheng et al. [75] propose a structured temporal NeRF for dynamic human rendering. We note that these methods mainly focus on rendering quality but usually produce unsatisfactory geometry. In contrast, LoRD models motion and shape jointly with local representation, so that information from two domains can be exchanged through the 4D model and benefit each other, which produces high-fidelity geometry results.

Fig. 2.
figure 2

Overview of our framework. We use a set of spherical parts to model the local surface deformation of dynamic human. Given a 3D point \(\left( x,y,z\right) \) under the world coordinate frame, we determine which part it falls into and transform it into the local coordinate frame, i.e. \(\left( x',y',z'\right) \), according to the estimated SMPL parameters. The transformed point is queried into a local implicit network, which is conditioned on the latent codes of local part, to obtain signed distance and RGB (optional) value. Note that our local implicit network is shared by all parts. Meshes are extracted with Marching Cubes [41].

3 Method

Our framework is overviewed in Fig. 2: given a 3D clothed human mesh sequence of length \(L=17\) frames that performs some motions in a normalized time span \(\left[ 0,1\right] \), we first define a set of local parts (Sect. 3.1) around inner body surface of the canonical frame (\(T=0\) in our setup). Then we temporally track these parts which are controlled by the skeleton motion of the inner body model (SMPL). Note that we use the ground truth SMPL mesh during training, whereas the SMPL parameters are estimated with the off-the-shelf method [29] at test-time. Each part contains a motion code \(c_m\), a shape code \(c_s\) and a texture code \(c_t\) (optional), which can be decoded by our local implicit network (Sect. 3.2) to obtain the reconstructed surface. Overall, we utilize the inner body model to track global skeleton motion and leave the detailed temporal deformation, geometry and texture of the local surface patch to the local implicit network. Training and test-time optimization are discussed in Sect. 3.3 and Sect. 3.4, respectively.

3.1 Local Part Formulation

Inner body model There are many ways to track the global skeleton motion of a dynamic human, e.g. optical/scene flow [38, 65], dense human correspondence [64, 71], and deformation graph [62]. In our formulation, we choose the widely-used SMPL model [40] as it naturally provides surface correspondence between frames and its low-dimensional representations are easily to be optimized.

LoRD represents a 4D human with a set of local parts (defined as 3D spheres) \(\mathcal {P}=\left\{ \mathcal {P}_{k}\right\} _{k=1}^{K}\), where \(\mathcal {P}_{k}=\left\{ \textbf{r},\textbf{R}_k,\textbf{c}_k\right\} \) is the intrinsic parameters of part k (do not confuse them with camera intrinsics); \(\textbf{r} \in \mathbb {R}\) is the radius of the sphere shared by all parts (we use \(r=5cm\) in our experiments); \(\textbf{R}_k \in \mathbb {R}^{9}\) and \(\textbf{c}_k \in \mathbb {R}^{3}\) are the rotation matrix relative to the world coordinate frame and the center of sphere for each part respectively. Given the inner body mesh of the canonical frame, a sampling algorithm (detailed in Supp.) is conducted on its surface to obtain the part centers. Inspired by [6, 30], to make the result smooth over the parts border, we use the overlapping strategy during the part sampling process, where each part overlaps with its neighboring parts by maximum \(1.5\times \) the part radius \(\textbf{r}\), and finally produce 2127 parts. The transformation of each part is based on the local coordinate frame as shown in Fig. 2. Details are in Supp. Mat.

3.2 Local Implicit Network

Besides the intrinsic parameters, each local part also has the latent parameters as low-dimensional codes \(c_m\), \(c_s\) and \(c_t\), which encode respectively the information of the local surface deformation, canonical geometry and texture. The goal of the local parts is to represent the detailed temporal deformation and geometry of the local surface patches. To this end, we follow D-NeRF [54] and use a 4D implicit network, which consists of a motion model and a canonical shape model. Moreover, if the observed data contain texture information, the additional texture model would be triggered to predict colors for the vertices of reconstructed mesh. Note that the implicit network is shared by all local parts. Next, We briefly introduce each model and the detailed architecture can be found in Supp. Mat.

Motion Model. As shown in Fig. 2, we formulate the motion model \(f^m\left( \textbf{x},T\mid c_m\right) \) as a 4D function conditioned by the motion code \(c_m \in \mathbb {R}^{128}\), which takes a 3D point \(\textbf{x}=\left( x,y,z\right) \) in the local coordinate frame and a time value T (normalized to \(\left[ 0,1\right] \)) as input, and predicts a deformation vector \(\mathrm {\Delta }\textbf{x}\) that transforms this point to the canonical frame, i.e. \(T=0\), by \(\textbf{x}^{*}=\textbf{x}+\mathrm {\Delta }\textbf{x}\). We adopt the network architecture of IM-Net [9], and reduce the feature dimension of each hidden layer by 4 fold [30] to obtain an efficient motion model.

Canonical Shape Model. The canonical shape model \(f^s\left( \textbf{x}\mid c_s\right) \) is a neural signed distance function, which only holds a static implicit geometry of the canonical frame as the temporal deformation is handled by the motion model. Specifically, given a 3D query point at time T, we first obtain its position in the space of the canonical frame with the motion model, and then use the canonical shape model that is conditioned on a canonical shape latent code \(c_s \in \mathbb {R}^{128}\) to predict the signed distance of the given point towards the surface. The same network architecture as DeepSDF [51] is adopted for canonical shape model. For training and testing efficiency, we reduce the number of layers and the feature channels for each layer to 6 and 256 respectively. During inference, we compute the bounding box of human based on the inner body mesh for each frame, and utilize the Marching Cubes algorithm [41] to extract the iso-surface.

Texture Model. If the input data contains texture information, e.g. colored point clouds, our representation can be extended to support surface texture inference. We achieve this by learning a function \(f^t\left( \textbf{x}, T\mid c_t\right) \) to predict the 4D texture field [49, 58, 59] of the dynamic local surface conditioned on a texture code \(c_t \in \mathbb {R}^{128}\). It takes a 3D point \(\textbf{x}\) in the local coordinate frame and a time value T, and outputs the RGB value of this point. We use the architecture of TextureField [49] decoder for our texture model. Please refer Supp. Mat. for the detailed network architecture. Note that we use our texture model in a per-sequence fashion during the test-time without pre-training, i.e. fit the input sequence with updating the network parameters, for better visualization results.

3.3 Training

Thank to our local formulation, the training of our model is very data efficient. We only use 100 sequences of length \(L=17\) frames from CAPE dataset [42] to learn our representation. During training, we adopt the auto-decoding method [51] and optimize our motion model, canonical shape model, and the latent codes for training parts. Specifically, given a training sequence that contains ground truth clothed meshes and the corresponding inner body meshes, we first sample a bunch of local parts on the surface of the inner body mesh of the first frame. Since the SMPL mesh has the unified surface topology, we can obtain the rotations and locations of each part in the following time steps, thus align their local coordinate frames. Next, we initialize the motion code and canonical shape code for each part with the vectors randomly sampled from \(N\left( 0,0.01\right) \), these codes are optimized with the network parameters during training. To train our implicit networks, the query points are sampled from three sources, i.e. surface, near surface space and free space in the bounding box.

Loss Functions. The point sets sampled on-surface and off-surface are denoted as \(\mathcal {X}\) and \(\bar{\mathcal {X}}\) respectively. We optimize our 4D implicit function \(f(\cdot )\) base on the loss functions introduced by IGR [23]:

$$\begin{aligned} \mathcal {L}_{\textrm{s}}=\frac{1}{|\mathcal {X}|}\sum _{\boldsymbol{x} \in \mathcal {X}}f(\boldsymbol{x})+\left. \left\| \nabla _{\boldsymbol{x}} f(\boldsymbol{x})-\boldsymbol{n}(\boldsymbol{x})\right\| \right) , \quad \mathcal {L}_{\textrm{e}}=\frac{1}{|\bar{\mathcal {X}}|}\sum _{\boldsymbol{x} \in \bar{\mathcal {X}}}\left( \left\| \nabla _{\boldsymbol{x}} f(\boldsymbol{x})\right\| -1\right) ^{2} \end{aligned}$$

where \(\mathcal {L}_{\textrm{s}}\) ensures the zero signed distance values for on-surface points and their normals aligned with the ground truth. \(\mathcal {L}_{\textrm{e}}\) is the regularization term encouraging the learned function to satisfy the Eikonal equation [13]. In addition, we also add a latent regularization term \(\mathcal {L}_{\textrm{c}}=\left\| c_m\right\| _{2}+\left\| c_s\right\| _{2}\) to constrain the learning of latent spaces. The final objective function for training is \(\mathcal {L}=\lambda _1\mathcal {L}_{\textrm{s}}+\lambda _2\mathcal {L}_{\textrm{e}}+\lambda _3\mathcal {L}_{\textrm{c}}\). We use \(\lambda _1=1.0\), \(\lambda _2=1e^{-1}\), \(\lambda _3=1e^{-3}\) in our experiment.

Evaluate SDF for Query Points. During the training process, the sampled points are only evaluated by the local parts that cover them. In our case, “point \(\textbf{x}\) is covered by part k” means the Euclidean distance between \(\textbf{x}\) and the center of part \(c_k\) is less than or equal to the pre-defined part radius \(\textbf{r}\), i.e. \(d\left( \textbf{x},c_k\right) \le \textbf{r}\). The sampled parts are highly overlapping, thus for one query point, we randomly choose n parts that covered this point to evaluate its SDF, and then average n SDF values (\(n=4\) in our experiments) as the final output. This could encourage the network to produce the smooth results in the overlapping regions. If some points are not covered by any parts, e.g. points sampled in the free space far from surface, then it will choose n-nearest parts to obtain the SDF prediction. Note that this is important for reconstructing complete results, since we cannot ensure the local parts sampled from inner body mesh would completely cover the surface of the clothed human.

3.4 Test-Time Optimization

After learning our local representation, we can then conduct the test-time optimization to reconstruct the dynamic human based on the given observations. In our experiments, we mainly focus on recovering 4D humans from complete point clouds or partial depth sequences. Generally speaking, the test-time optimization is similar to the training process, which performs backward optimization with the auto-decoding fashion, except that we fix the network parameters and only update the latent codes for each local part. Since we leverage the loss functions from IGR [23], and directly perform optimization based on the point clouds with local-based representation, the geometry covered by each part is a non-watertight surface, which causes the extracted surface contains artificial interior back-faces. We borrow the post-processing algorithm from LIG [30] to remove such artifacts. The details about the post-processing algorithm and the choices of hyper-parameters can be found in Supp. Mat. In addition, there are some technical details that we want to clarify below.

Inner Body Estimation. Given a testing sequence, we first need to estimate inner body meshes to sample local parts. As the temporal consistency could facilitate our reconstruction, we use the recent motion based human body estimation method H4D [29] to fit the SMPL parameters via backward optimization.

Inner Body Refining. The fitting results of H4D [29] are accurate enough in most cases, but still imperfect on some sequences, which may cause the observations of some local parts vary too much over time. Inspired by PaMIR [77], we propose a strategy to refine the initial inner body fitting from H4D. Specifically, we first sample and track the local parts on the initial body mesh sequence produced by H4D, and optimize the latent codes for each part. Then we fix the latent codes and local parts, query the SMPL vertices into our local implicit network, and optimize the SMPL parameters for shape and initial pose, and latent vector for motion of H4D. We follow the body reference optimization proposed in PaMIR to build the loss functions of our refining process:

$$\begin{aligned} \mathcal {L}_{\textrm{SMPL}}=\left\{ \begin{array}{ll}|f\left( x\right) | &{} f\left( x\right) \ge 0 \\ \frac{1}{\eta }|f\left( x\right) | &{} f\left( x\right) <0\end{array}\right. , \quad \mathcal {L}_{reg}=\left\| V-V^{i n i t}\right\| _{2}, \end{aligned}$$

where \(\eta =5\), \(f\left( \cdot \right) \) is our local implicit signed distance function; \(V=\left( \beta ,\theta _0, c_m\right) \) contains the shape parameter, initial pose parameter and latent motion code of H4D, and the superscript “init” means initial estimations. This reflects the fact that, if the body estimation is accurate, then the vertices of the body mesh will get the negative SDF predictions (inside surface). Moreover, we also use an additional observation loss \(\mathcal {L}_{\textrm{obs}}\), which denotes Chamfer loss for the complete point cloud and point-to-surface loss for partial point cloud from the depth image. The final objective function is \(\mathcal {L}=\lambda _1\mathcal {L}_{\textrm{SMPL}}+\lambda _2\mathcal {L}_{\textrm{obs}}+\lambda _3\mathcal {L}_{reg}\), where \(\lambda _1=1.0\), \(\lambda _2=1e^{2}\) and \(\lambda _3=1e^{-3}\) in our experiments. We verify the effectiveness of our inner body refining strategy in Sect. 4.4.

Texture Model Optimization. As mentioned in Sect. 3.2, we optimize the texture model for each testing sequence. Given a colored point cloud sequence, we can obtain the ground truth color \(C_T\left( \textbf{x}\right) \) of a surface point \(\textbf{x}\) in time T. Then we query \(\textbf{x}\) into the texture model conditioned on the texture code \(c_t^k\) of part k to get the color prediction. We also use the average of n predicted colors as the final output (Sect. 3.3). To optimize the network parameters and texture codes, we add the \(L_1\)-loss \(\mathcal {L}_{\textrm{color}}=\left| f^{c}(\textbf{x},T\mid c_t)-C_T(\textbf{x})\right| \) into the objective function.

Fig. 3.
figure 3

4D human fitting. We choose SoTA implicit 3D/4D representations to overfit a given mesh sequence and compare the results with us. The colors on our results indicate the correspondences across different frames, which cannot be obtained by the framewise baselines, i.e. NGLoD, DeepSDF. The zoomed-in part shows we reconstruct better finger details than NGLoD. (Color figure online)

4 Experiments

In this section, we evaluate the representation capability of LoRD and its value in practical applications, i.e. 4D reconstruction and non-rigid depth fusion.

Fig. 4.
figure 4

Temporal inter-/extrapolation. Colored meshes are inter-/extrapolated frames. (Color figure online)

Dataset and Metric. For training and evaluation, we use the CAPE [42] dataset which contains more than 600 motion sequences of 15 persons wearing different types of outfits, and the SMPL registrations are provided. Additionally, some raw scanned sequences with texture information are also available. We choose 100 sub-sequences of length \(L=17\) for training, and use the sub-sequences of novel subjects for testing. To compare with the baseline methods, we use Chamfer Distance-L2 [44], normal consistency [59] (the average L2 distance between the normal of given point on the source mesh and the normal of its nearest neighbor on the target mesh), and F-Score [68] as evaluation metrics.

Implementation Details. We use PyTorch with Adam optimizer [34] of learning rate \(1e^{-3}\) and batch size 1 for both training and test-time optimization. The experiments are conducted on a single Nvidia 2080Ti GPU. The test-time optimization takes around 15min for each 17 frames sequence.

Table 1. Comparisons on 4D human fitting. Left: framewise methods, Right: temporal methods. “Ch.-\(L_2\)” and “Normal” mean Chamfer Distance (\(\times 10^{-4}\,\textrm{m}^2\)) and Surface Normal Consistency respectively. The threshold for computing F-Score is \(\tau =5\,\textrm{mm}\).

4.1 Representation Capability

4D Human Fitting. We first evaluate the efficacy of LoRD in representing dynamic human by overfitting a given mesh sequence. We select one sequence from the CAPE dataset for this task. For comparison, we choose 3D neural SDF methods DeepSDF [51] and NGLoD [63], DeepSDF is a global representation which represents the complete shape with a single latent code, while NGLoD is a SoTA local neural SDF representation based on the Octree, both of them are 3D representations that need to work with frame-wise manner to produce a temporal sequence. In addition, we choose the SoTA 4D representation methods OFlow [48] and 4D-CR [28] as our baseline.

The quantitative results are shown in Table 1. Our LoRD representation clearly outperforms DeepSDF and all the SoTA 4D representation methods, and performs comparable with framewise method NGLoD. We show the visual results in Fig. 3, the colors of our results indicate the dense correspondences w.r.t the first frame. Specifically, for each vertex on the reconstructed mesh of time T, we use the optimized motion codes to transform it to the first frame, and obtain color value of the nearest vertex. We note that this cannot be achieved by DeepSDF or NGLoD, since they do not model temporal information.

Temporal Inter-/extrapolation To further show the superiority of LoRD over the framewise representations, we show the temporal inter-/extrapolation results achieved by our method in Fig. 4. Given a sequence of length \(L=17\) frames, for interpolation, we randomly choose 9 frames as the observations to perform SDF fitting, the goal is to complete the missing frames to obtain a temporally complete sequence. And for extrapolation, we only use the first 9 frames and need predict the future motion of the last 8 frames. Figure 4 shows that LoRD produces the plausible results on both inter- or extra-polation modes. Again, these temporal completion tasks also cannot be achieved by the framewise 3D representations, e.g. DeepSDF, NGLoD. We also provide the results about interpolation of the latent codes in Supp. Mat. (Sect. 2.2) as a sanity check.

4.2 4D Reconstruction from Sparse Points

We then show that LoRD can support various applications. First, we demonstrate that LoRD can achieve high quality 4D reconstruction from sparse point clouds. In this case, we assume the point normal directions are available (oriented point cloud, the same for Poisson Reconstruction [33] and LIG [30]).

Fig. 5.
figure 5

4D reconstruction from sparse points. Each input point cloud contains around 4000 points. Note the detailed geometry in the zoomed-in parts and the surface deformation recovered by our method. We provide more qualitative results in Supp. Mat.

Compare to Instance-Level Methods. We first compare LoRD with the instance level methods, the “instance-level” in here means we only overfit one sequence at a time and do not consider generalization to other instances. We choose the traditional Poisson Surface Reconstruction with octree depth value \(d=10\) (PSR10) [33], Alpha Shape [16] and Ball Pivoting [3] as the baseline. Moreover, we also compare with the SoTA network-based surface reconstruction method Deep Hybrid Self-Prior (DHSP), and the non-rigid reconstruction method Neural Deformation Graph (NDG). The quantitative results are show in Fig. 6 (a, I), the leftmost column represents the sampled point cloud density (number of points per square meter of surface), the smaller number corresponds to the sparser point cloud, the surface area of SMPL mesh used for point sampling is around \(2\,\textrm{m}^2\). As can be seen, our method outperforms all the baselines by a large margin. More importantly, the sparser point cloud hardly affects our performance while the baseline methods have been significantly affected, this is because LoRD is a 4D representation, sparse observation from each frame can compensate each other through the motion model. The qualitative comparisons are shown in Fig. 5 (above the solid line), our method can recover geometry details on the face and cloth with high resolution texture, while the baselines only produce over-smooth results due to the limited information from sparse inputs.

Fig. 6.
figure 6

(a) Comparisons on 4D reconstruction from sparse points. The leftmost column in Block I represents the sampled point cloud density, the smaller number corresponds to the sparser point cloud. The results in Block II are obtained from the point cloud of density 2000 points/\(\textrm{m}^2\). (b) Qualitative comparisons on non-rigid depth fusion.

Compare to Generalizable Methods. To show the generalization ability of our method, we train LoRD on the training set of 100 sequences, then fix the network parameters and optimize the latent codes of local parts to fit the input point cloud via back-propagation. In this experiment, we use the point density of 2000 points/\(\textrm{m}^2\) (same as the results in the last group of Fig. 6 (a, I)), and choose 10 testing sequences of novel subjects for evaluation. As framewise baselines, we choose: IPNet [4] and PTF [70], which takes point cloud as input and output reconstructed mesh via feed forward fashion; CAPE [42] and LIG [30], which obtain reconstructions via the backward optimization similar to us. The OFlow and 4D-CR are still considered as the baseline of temporal methods, we remove their encoders, fix the decoder parameters, and perform backward optimization. For OFlow and 4D-CR, we use the ground truth occupancy instead of oriented point cloud as supervision for more stable results. The results are shown in Fig. 5 (below the solid line) and Fig. 6 (a, II), our method beats all the baselines both qualitatively and quantitatively. We can observe the fine-grained geometry recovered by LoRD in the zoomed-in parts of Fig. 5, as well as detailed clothing deformation, which show that our model trained on small set of data can generalize well to the novel motion sequences. More results are in Supp. Mat.

4.3 Non-Rigid Depth Fusion

We further test LoRD with the application of non-rigid depth fusion. Given a static RGB-D camera, with a person standing in front of it performing different actions, the goal is to accurately track the human motion and merge all depth observations in a time span, and finally produce a dynamic mesh sequence. In this experiment, we use the mesh sequences of length \(L=17\) from CAPE dataset [42], and render each frame to get depth image of resolution \(512\times 512\). We compute the normal map based on the depth image, and back-project each pixel into 3D space with the known camera intrinsics to obtain the partial oriented point cloud as the observations. Then we run H4D [29] to get the inner body estimation, and use our pretrained LoRD model to perform auto-decoding. Our approach formulates non-rigid fusion as a temporal completion problem within local parts. We choose DynamicFusion [46], NPMs [50] and PTF [70] as our baseline and show the qualitative comparisons in Fig. 6 (b). We observe that PTF produces overly smooth results, NPMs cannot model the detailed surface geometry for different subjects, and DynamicFusion fails to track the human motion that is very fast or contains self-occlusion and leads to unsatisfactory fusions. In contrast, our model is capable to produce more complete fusion results than DynamicFusion, e.g. back of the first example, and more detailed geometry than PTF and NPMs. Additional results including non-rigid fusion on real-world data and the comparison to more recent human specific fusion work DoubleFusion [74] are provided in Supp. Mat. for the sake of space. Our method shows robustness to the SMPL fitting error and provides more complete results than DoubleFusion.

Table 2. Ablation study. Left: the effectiveness of the inner body refining on different noise levels; Right: the effect of the part radius. We choose part radius \(r=5cm\) in our experiments. The visualization examples are in Supp. Mat.

4.4 Ablation Study

Imperfect Body Tracking. We first provide an ablation study to demonstrate the effectiveness of the proposed inner body refining method. We use the 4D reconstruction task with the point density 2000 point/\(\textrm{m}^2\) for evaluation. Given the initially estimated SMPL inner body, we manually add the random Gaussian noise to it and compare the reconstruction performances before and after refining. Specifically, we perturb the SMPL shape (\(\beta \)) and pose (\(\theta \)) parameters by \(\beta +=\lambda _{\beta } \cdot \sigma \cdot \mu \) and \(\theta +=\lambda _{\theta } \cdot \sigma \cdot \mu \), where \(\mu \in N\left( 0,1\right) \), \(\lambda _{\beta }=0.05\), \(\lambda _{\theta }=0.01\), and \(\sigma \in \left[ 3, 5\right] \) represents the level of noise. The quantitative results are show in Table 2 (left). Without inner body refining, the reconstruction performance drops fast as the noise level up. And by using our refining method, the performance improves and in general stable on different noise levels.

Local Part Size. We then study the effect of different radii for local part. To this end, we use our pretrained model, and test on the task of 4D reconstruction as previous. The comparisons are shown in Table 2 (right). As can be seen, the reconstruction performance is affected by the choice of part radius r. We choose \(r=5cm\) in our experiment for slightly better results. We find that the over-small part is inclined to produce artifacts, possibly due to the limited receptive field within part. And the larger part could lead to overly smooth results.

5 Conclusion

This work introduces LoRD, a local 4D implicit representation for dynamic human, which aims to optimize a part-level temporal network for modeling detailed human surface deformation, e.g. clothing wrinkles. LoRD is learned on a very small set of training data (less than 100 sequences). Once trained, it can be used to fit different types of observed data including sparse point clouds, monocular depth images via auto-decoding. LoRD is capable to reconstruct high-fidelity 4D human and outperforms the state-of-the-art methods.