Keywords

1 Introduction

Estimating motion in dynamic scenes is a fundamental and long-standing problem in computer vision [16]. Most of the existing 3D motion estimation works are concerned with specific objects like humans [42]. Still, knowing the 3D motion of all objects in a dynamic scene can be of great benefit to a number of vision applications like robot path planning [8]. Tracking all points in the space with only multiview data is obviously challenging, however, neural fields is a hot topic that has emerged recently [59], bringing hope to breakthroughs for this problem.

Neural fields, also known as coordinate-based neural networks, have demonstrated great potential in dynamic 3D scene reconstruction from multiview data [51, 59]. Coordinate-based representations not only naturally support fine-grained modeling of the motion for points in space, but also require no prior knowledge about the geometry and track all points in space. In this paper, we address the problem of estimating 3D motion from multiview image sequences, for general scenes and for all points in the space (Fig. 1).

Despite recent progress on neural fields-based dynamic scene representation (e.g., [9,10,11, 20, 21, 25, 36, 37, 41, 52, 56, 57, 63]), estimating 3D motion from multiview data remains challenging for the following reasons. First, motion ambiguity exists among points with the same color, so one cannot confidently track interchangeable points on non-rigid surfaces from visual observations alone (c.f. possibility of position swapping). Second, the color of any point may change over time. For example, spatially or temporally varying lighting conditions can blur the notion of a point’s identity over time.

Fig. 1.
figure 1

Our method can handle topologically varying scenes and estimate physical motion for all points in the space. Topologically varying means that the topology of the scene can change, such as a new person entering the scene in (a). All points in the space are tracked, such as the ball in (b). Only the sequence of images to be analyzed is used and no prior knowledge is required in our framework. (See supplementary material for an animated version of the figure.)

In this paper, we propose to regularize the estimated motion to be predictable to address the aforementioned ambiguity issues. The key insight behind motion predictability is that underlying motion patterns exist in a dynamic real-world scene. Chaotic motions (e.g., position swapping for similarly-colored points) are not predictable and should be penalized. In our work, the motion in a scene is “implicitly” regularized by enforcing predictability, which is intrinsically different from explicitly designed regularizing terms, such as elastic regularization [36] and as-rigid-as-possible regularization [52].

State-of-the-art solutions use combinations of space-time radiance neural fields and neural motion fields to model dynamic scenes, optimizing these fields jointly over a set of visual observations in a self-supervised manner by comparing predicted images to actual observations. But vision-based supervision alone typically results in noisy and poorly disentangled motion fields, c.f. aforementioned ambiguities. Therefore, some recent works use data-driven priors like depth [57] and 2D optical flow [21] as a regularization. In contrast, we propose to improve motion field optimization through predictability-based regularization. Instead of learning a motion field M that maps each 3D position \(\textbf{p}\) and timestep t to a deformation vector \(\Delta _{t \rightarrow t+\delta t}\textbf{p}\), we condition the motion field on a predictable embedding of the motion for queried time (noted \(\boldsymbol{\omega }_{t \rightarrow t+\delta t}\)), i.e., \(\Delta _{t \rightarrow t+\delta t}\textbf{p} = M(\textbf{p}, \boldsymbol{\omega }_{t \rightarrow t+\delta t})\). These motion embeddings are either directly optimized jointly with the space-time field over observations, or are inferred by a predictor function P that takes a set of past embeddings and infers the next motion embedding. During scene optimization, we enforce each motion embedding regressed from the observations to be predictable by our model P. Therefore we promote the encoding of underlying motion patterns and penalize chaotic and unlikely-realistic deformations. In summary, our contributions are as follows:

  • We propose to leverage predictability as a prior w.r.tlet@tokeneonedotthe motion in a dynamic scene. Predictability regularization implicitly penalizes chaotic motion estimation and can help solve the ambiguity of motion.

  • We condition point motions on embedding vectors and design a predictor on the embedding space to enforce motion predictability.

  • We demonstrate the benefits of the resulting additional supervision (predictability regularization) on motion learning through a variety of qualitative and quantitative evaluations.

  • We provide insights into how the proposed framework can be leveraged for motion prediction as a by-product.

2 Related Work

Neural Fields. A neural field is a field that is parameterized fully or in part by a neural network [6, 59]. Neural fields are widely used for implicitly encoding the geometry of a scene, such as occupancy [29] and distance function [7, 35]. Our method is built on the milestone work NeRF [30], in which the radiance and density are encoded in neural fields. NeRF led to a series of breakthroughs in the fields of 3D scene understanding and rendering, such as relighting [2, 3, 46], human face and body capture [14, 24, 34, 39, 40, 49], and city-scale reconstruction [43, 50, 53, 58]. A recent method also named PREF [15] is developed for compact neural signal modeling.

Motion Estimation and 4D Reconstruction. Large-scale learning-based motion estimation from multiview data achieved impressive performance [22, 42], but most methods are constrained to tracking some specific objects such as humans [42]. In this paper, we are concerned with estimating the motion of all points without access to any annotations, which is related to the 4D reconstruction problem where motion is usually estimated. Some methods have been developed with known geometry information such as depth or point cloud. DynamicFusion [32], Schmidt et allet@tokeneonedot[44], Bozic et allet@tokeneonedot[5], and Yoon et allet@tokeneonedot[61] estimate motion from videos with depth. OFlow [33] and ShapeFlow [17] infer a deformation flow field with the knowledge of occupancy. More recently, motivated by the success of NeRF, a number of methods have been designed to reconstruct 4D scenes as well as motion directly from multiview data, which can be acquired from a multi-camera system or a single moving camera. D-NeRF [41], Nerfies [36] and NR-NeRF [52] set a canonical frame and align dynamic points to it. DCT-NeRF [56] proposes to track the trajectory of a point along all sequences. NSFF [21], VideoNeRF [57], and NeRFlow [9] propose to represent the dynamic scene with a 4D space-time field, thus able to handle topologically varying scenes. The 4D fields are under-determined, and precomputed data-driven priors are usually needed to achieve good performance. HyperNeRF [37] proposes to align frames towards a hyperspace for topologically varying scenes and achieves state-of-the-art performance without the need of data-driven priors. These methods are able to render visually appealing images for novel views and time, yet their performance on 3D motion estimation has room for improvements.

Scene Flow Estimation. 3D motion field is also known as dense scene flow [28, 31, 62]. Vedula et allet@tokeneonedot[54] introduced the concept and demonstrated a framework for acquiring dense, non-rigid scene flow from optical flow. Basha et allet@tokeneonedot[1] proposed a 3D point cloud parameterization of the 3D structure and scene flow with calibrated multi-view videos. Vogel et allet@tokeneonedot[55] suggested to represent the dynamic 3D scene by a collection of planar, rigidly moving, local segments. More recently, Yang et allet@tokeneonedot[60] proposed a framework adopting 3D rigid transformations for analyzing background segmentation and rigidly moving objects.

Predictability. The study of the predictability of time series data dates back to [4, 38], in which predictability is interpreted as the ability to be decomposed into lower-dimensional components. The idea of extracting principal components as predictability is adopted for blind source separation in [48]. Differential entropy is used for measuring predictability in [12]. Our method shares a similar motivation as the above methods in terms of discovering low-rank structures, while predictability in our method is not explicitly defined but implicitly introduced through a predictor network.

3 Preliminaries

Our method is built upon the NeRF framework [30] and is inspired by recent progresses w.r.tlet@tokeneonedotdynamic scenes [21, 57]. For each 3D point \(\textbf{p}=(x,y,z)\) in the considered space, we represent its volume density by \(\boldsymbol{\sigma }(\textbf{p})\), and its color from a viewing direction \(\textbf{d}\) by \(\textbf{c}(\textbf{p},\textbf{d})\). In NeRF, these two attributes are defined as the output of a continuous function F modeled by a neural network, i.e., \((\textbf{c},\boldsymbol{\sigma })=F(\textbf{p},\textbf{d})\). This neural field can be queried to render images of the represented scene through volume rendering. For each camera ray \(\textbf{r}\) defined by its optical origin \(\textbf{o}\) and direction \(\textbf{d}\) intersecting a pixel, we compute the color \(\textbf{C}(\textbf{r})\) of said pixel by sampling points along the ray, i.e., sampling \(\textbf{p}_i=\textbf{o}+i\textbf{d}\); then querying and accumulating their attributes according to F. Overall, the expected color \(\textbf{C}(\textbf{r})\) of the ray \(\textbf{r}\) is:

$$\begin{aligned} \textbf{C}(\textbf{r})=\int _{i_n}^{i_f} e^{-\int _{i_n}^i\boldsymbol{\sigma }(\textbf{p}_j) dj}\boldsymbol{\sigma }\big (\textbf{p}_i\big )\textbf{c}\big (\textbf{p}_i,\textbf{d}\big )di, \end{aligned}$$
(1)

where \(i_n, i_f\) are near and far bounds. The integration in Eq. 1 is numerically approximated by summing up a set of points on the ray.

For dynamic scenes, existing solutions can be roughly categorized into two groups. Either methods model the motion and radiance with two distinct fields [36, 41], or they are regularizing the motion from a space-time field [9, 21, 57]. In the former solutions, the color of a point \(\textbf{p}\) at time t is represented by \(F_k(M(\textbf{p}, t),\textbf{d})\), where \(F_k\) represents the kth canonical time-invariant space and M is a learned neural motion field defining the motion \(\Delta \textbf{p}\) of any point \(\textbf{p}\) at time t w.r.tlet@tokeneonedotto their position in the canonical space. Our method falls into the latter category, in which each point in the dynamic scene is represented by a space-time field \(F(\textbf{p}, \textbf{d}, t)\). Unlike canonical space-based methods, for the space-time field we need to specify the frame of F when joint training with a motion field M. We opt for space-time field rather than canonical-space one for two reasons. First, we presume that underlying patterns exist for the motion of a certain time range. So canonical-frame-based motion estimation frameworks are not suitable, since their motions are from the predefined canonical frame to another, whereas we need the motion between a certain range of frames. Second, space-time fields are more generic as they can handle non-existent geometry in the canonical frame (e.g., objects entering the scene mid-sequence). Note that for both categories, the scene fields are optimized jointly leveraging observation-based self-supervision, i.e., computing the image reconstruction loss for each time step t as:

$$\begin{aligned} \mathcal {L}_{\textrm{rec}}=\sum _{\textbf{r}}\Vert \textbf{C}_{\textrm{gt}}^t(\textbf{r}) - \textbf{C}_{\mathrm {}}^t(\textbf{r}) \Vert _2^2, \end{aligned}$$
(2)

with \(\textbf{C}_{\textrm{gt}}^t\) is the observed pixel color and \(\textbf{C}^t\) is the color rendered from F and M.

4 Method

4.1 Overview

Our framework consists of three components: a neural space-time field F, a motion field M and a motion predictor P. An overview of their interactions is presented in Fig. 2. In our framework and implementations, we do not model the viewing dependency effects with the space-time field, so the space-time field outputs the color and occupancy for each point (xyzt), whereas the motion field provides the motion of any point between two time steps, according to the space-time field. Let the motion of point \(\textbf{p}=(x,y,z)\) from time t to \(t+\delta t\) be \(\Delta _{t\rightarrow t+\delta t}\textbf{p}\), then for \(\textbf{p}\) at time t we have:

$$\begin{aligned} (\textbf{c}_t,\boldsymbol{\sigma }_t)=F(\textbf{p}+\Delta _{t\rightarrow t+\delta t}\textbf{p},t+\delta t). \end{aligned}$$
(3)

The idea is that for a scene observed at time \(t+\delta t\), we can obtain the attributes of \(\textbf{p}\) at time t by querying the space-time field with the point location at \(t+\delta t\).

In our framework, the motion network is conditioned on an embedding vector \(\boldsymbol{\omega }\) (instead of queried timestep) and the motion can be written as \(\Delta _{t\rightarrow t+\delta t}\textbf{p}=M(\textbf{p},\boldsymbol{\omega }_{t\rightarrow t+\delta t})\), where \(\boldsymbol{\omega }_{t\rightarrow t+\delta t}\) depends on time t and interval \(\delta t\). Replacing the temporal variable t with a vector \(\boldsymbol{\omega }\) as input to M enables predictability via embedding, as further detailed in Sect. 4.2. All networks and the embedding vector w.r.tlet@tokeneonedottime t are optimized using the reconstruction loss \(\mathcal {L}_{\textrm{rec}}\) (c.f. Eq. 2), with color \(C^t\) predicted from \(F,M,\boldsymbol{\omega }_{t\rightarrow t+\delta t}\) according to Eqs. 1 and 3.

We define the predictor P as a function taking as input several motion embedding vectors of previous frames and inferring the motion embedding vectors for the future frames accordingly. Mathematically, we have \(\boldsymbol{\omega }_{t\rightarrow t+\delta t}=P(\boldsymbol{\omega }_\textrm{prev})\) with \(\boldsymbol{\omega }_\textrm{prev}=\{\boldsymbol{\omega }_{t-(i+1)\delta t \rightarrow t-i\delta t}\}_{i=1}^{\tau }\) set of \(\tau \) previous frames’ embeddings. For example, in Fig. 2, the embedding vector \(\boldsymbol{\omega }_{3\rightarrow 4}\) for motion from \(t_3\) to \(t_4\) is predicted from previous three embedding vectors, that is, \(P\left( \{\boldsymbol{\omega }_{0\rightarrow 1},\boldsymbol{\omega }_{1\rightarrow 2},\boldsymbol{\omega }_{2\rightarrow 3}\}\right) \).

Fig. 2.
figure 2

Overview of the proposed framework. Three networks are trained jointly: the space-time field, the motion field and the predictor. The space-time field returns color and occupancy for each point at a specific time. The motion field predicts the motion of a point based on a motion embedding vector. The predictor generates the future motion embedding based on previously observed embeddings.

4.2 Neural Motion Fields with Motion Embedding

The motion field is conditioned on an embedding vector, sampled from a latent space depicting motion patterns. Such embedding can be implemented in various ways. The simplest one is to associate each motion of interest with a trainable embedding vector. This technique has been widely used for conditioning neural fields w.r.tlet@tokeneonedotappearance [26] and deformation [36]. However, empirical studies show that associating each motion with motion embedding frequently and significantly slows down the convergence speed of the predictor, as demonstrated in Fig. 3. We presume that the phenomenon is caused by the large and unstructured solution space brought by frame-wise motion embedding. To validate the assumption and improve the convergence speed, we propose to reduce the dimension of the input and output space of the predictor.

Inspired by mixture-of-experts-based prediction networks [13, 23, 47], we design a set \(\textbf{B}\in \mathbb {R}^{n\times m}\) of n embedding basis vectors, i.e., \(\textbf{B} = [\textbf{b}_1,\cdots ,\textbf{b}_n]^T\) with \(\textbf{b}_i\in \mathbb {R}^m\) basis vector. \(\textbf{B}\) is shared across all frames. Then the motion embedding becomes \(\boldsymbol{\omega }_{t \rightarrow t + \delta t}= \textbf{w}_{t \rightarrow t + \delta t} \cdot \textbf{B}\), with \(\textbf{w} \in \mathbb {R}^n\) optimizable linear combination weights. Accordingly, we redefine the model P to receive and predict these weight vectors instead of the embedding ones, thus reducing its input space and output space to \(\mathbb {R}^n\), i.e., with the dimensionality of basis vectors not affecting the predictor anymore. In our experiments, we set \(n=5\) and \(m=32\), so the dimension of the predictor’s output space is reduced from 32 to 5. An illustration and comparison of the training losses between the two schemes are presented in Fig. 3.

Fig. 3.
figure 3

We use a set of basis vectors for the motion embedding (middle), rather than associating each frame with a motion vector (left). The input and out space of the predictor switches to the linear combination weights by using these shared basis vectors. The comparison of training losses (right) indicates that the predictor converges faster on the space of linear combination weights.

4.3 Regularizing with Motion Prediction

Our proposed solution makes it possible to complement the usual self-supervision of space-time neural fields (through visual reconstruction only) by a regularization term over motion. However, while predicting motion embeddings is straightforward, i.e., by simply forwarding the embedding vectors of previous frames into P, leveraging P for the regularization of M is not trivial.

In our framework, motion embeddings can be acquired either from reconstruction, i.e., optimizing each embedding along with other components (e.g., both the motion embedding \(\boldsymbol{\omega }_{3\rightarrow 4} = \textbf{w}_{3\rightarrow 4}\cdot \textbf{B}\) and the motion network \(M(\textbf{p},\boldsymbol{\omega }_{3\rightarrow 4})\) can be optimized on observed images at \(t=3,4\)); or through the predictor (e.g., \(\boldsymbol{\omega }_{3\rightarrow 4} = P(\left\{ \textbf{w}_{t-1 \rightarrow t}\right\} _{t=1}^{3})\cdot \textbf{B}\)). We leverage this redundancy for regularization, i.e., proposing a loss to minimize the difference between the self-supervised embeddings and their corresponding predicted versions:

$$\begin{aligned} \mathcal {L}_{\textrm{pred}}=\Vert P\left( \textbf{w}_{\textrm{prev}}\right) - \mathop {\mathrm {arg\,min}}\limits _{\textbf{w}_{t \rightarrow t + \delta t}} \mathcal {L}_{\textrm{rec}} \Vert _2^2, \text {~where~} \textbf{w}_\textrm{prev}=\{\textbf{w}_{t-(i+1)\delta t \rightarrow t-i\delta t}\}_{i=1}^{\tau }. \end{aligned}$$
(4)

In the above equation, the first term \(P(\cdot )\) represents the motion embedding predicted according to previous \(\tau \) frames, and the second term \(\mathop {\mathrm {arg\,min}}\limits _{\textbf{w}_{t \rightarrow t + \delta t}} \mathcal {L}_{\textrm{rec}}\) is the vector acquired from minimizing the reconstruction loss.

It is, however, impractical to compute this second term during training, since the reconstruction problem can take hours to solve via optimization. We propose instead to obtain \(\textbf{w}_{{t \rightarrow t + \delta t}}\) in an online manner, and to jointly optimize frame weights \(\textbf{w}\) over both \(\mathcal {L}_{\textrm{rec}}\) and \(\mathcal {L}_{\textrm{pred}}\) at each optimization step. That is, at each step, all current frame weights \(\textbf{w}\) are first used to compute \(\mathcal {L}_{\textrm{pred}}\) and optimize downstream models accordingly, and are then themselves optimized w.r.tlet@tokeneonedot\(\mathcal {L}_{\textrm{rec}}\). The details of implementing the two losses with batches of frames are introduced in the next section.

4.4 Optimization

During optimization, we sample a short sequence of frames from the training set. For simplifying the notations, we assume that the predictor takes \(\tau =3\) frames as input and predicts the motion of the next frame. An illustration is presented in Fig. 4. Four consecutive frames \((t_i,t_{i+1},t_{i+2},t_{i+3})\) are first sampled from the observed sequence and the corresponding embedding vectors \(\boldsymbol{\omega }\) are acquired as in Sect. 4.2. Note that training images can be sampled from different synchronized cameras if available.

We disentangle appearance- and motion-related information during optimization by applying \(\mathcal {L}_{\textrm{rec}}\) to images reconstructed both with and without motion reparameterization. That is, we sample F for radiance/density values \((\textbf{c}_t, \boldsymbol{\sigma }_t)\) both as \(F\left( \textbf{p} + M(\textbf{p}, \boldsymbol{\omega }_{t \rightarrow t + \delta t}), t + \delta t\right) \) and as \(F\left( \textbf{p}, t\right) \) (c.f., Fig. 4).

Fig. 4.
figure 4

System optimization, demonstrated on a batch of 4 frames. Predictor P infers a vector \(\boldsymbol{\omega }\) based on the preceding 3 frames; \(\mathcal {L}_{\textrm{pred}}\) minimizes the difference between these predicted embeddings and their sampled equivalents; whereas reconstruction loss \(\mathcal {L}_{\textrm{rec}}\) is applied to the predicted four frames, with and without motion reparameterization.

5 Experiments

We qualitatively and quantitatively evaluate our method in this section. We urge the reader to check our video to better appraise the quality of motion. The following three datasets are used for evaluation:

  • ZJU-MoCap [40] is a multi-camera dataset, with videos of one person performing different actions. Since each video sequence records a single human, the scene is less topologically varying and we compare our method with canonical frame-based representations of dynamic scenes. We use videos from 11 cameras for evaluation.

  • Panoptic [18] includes videos from multiple synchronized cameras under many different settings including multi-person activities and human-object interactions. We select 4 challenging and representative video clips from the 31 HD cameras and denote them as Sports, Tools, Ian, and Cello. Each clip has 400 frames and all the clips involve human-object interaction.

  • Hypernerf [37] is a single-camera dataset, i.e., with one view available at each timestamp. Unlike the previous two datasets that use static cameras, in Hypernerf the multiview information is generated by moving the camera around. Hypernerf is challenging not only because of the single-camera setting, but also the topologically varying scenes.

Details about the clips (e.g., starting and ending frame number) are included in the supplementary. All the sequences are split into short intervals consisting of 25 frames. On each interval, the networks are trained using an Adam optimizer [19] with a learning rate that decays from \(5\times 10^{-4}\) to \(5\times 10^{-6}\) every 50k iterations. During training, the two losses are added with a balancing parameter, i.e., \(\mathcal {L}=\mathcal {L}_{\textrm{rec}}+\gamma \mathcal {L}_{\textrm{pred}}\) with \(\gamma \) set to 0.01 in all experiments. A batch of 1,024 rays is randomly sampled from the selected frames for training the motion field and the space-time field. We observe that using viewing direction \(\textbf{d}\) in F leads to worse performance if the scene of interest mostly contains Lambertian surfaces. In our experiments, the viewing direction is not taken as the input for the space-time field, i.e., a space-time irradiance field [57]. The network structures of the motion field and the space-time field are the same as in NeRF [30]. The predictor consists of 5 fully connected layers with a width of 128 and ReLU activations.

Fig. 5.
figure 5

Comparison of the estimated motion on the ZJU-MoCap dataset. Only one person is captured for each sequence and we compare our method with canonical frame-based methods Nerfies [36] and D-NeRF [41]. Motion for 20 frames is demonstrated.

Fig. 6.
figure 6

Motion estimation comparison on the Panoptic dataset [18]. Motions estimated by VideoNeRF are more chaotic than NSFF, possibly due to the 2D optical flow supervision adopted in NSFF. Our method faithfully estimates the motions of people and objects, whereas NSFF fails to track some points, e.g., the ball in Sports and Ian.

Fig. 7.
figure 7

Comparison of the estimated motion on the Hypernerf dataset [37]. We randomly sample points on the surfaces and then demonstrate their the motions.

5.1 Qualitative Evaluation

We visually compare the estimated motion in this section. Since neural motion fields tracks all points in the space, we randomly sample points and then demonstrate their trajectory. Different sampling strategies are used for different datasets. For ZJU-MoCap, we first sample a dense grid of points and then remove the empty points with \(\sigma <20\), then we randomly sample points from the non-empty ones. For Panoptic, since background (walls and floors) is kept in the scenes, we sample meaningful points near the persons in the scene, leveraging provided people positions. For Hypernerf, since the scenes are all front-facing, we sample points on the surfaces according to the depth generated by the space-time field F from one view.

On Multi-Camera Dataset. We first present our results on ZJU-MoCap in Fig. 5. Since there is only one person in this dataset, the topology of the scene roughly remains unchanged and canonical space-based methods can be applied. Nerfies [36] and D-NeRF [41] are selected for comparison. As can be observed from the images, our method can generate a smooth motion as opposed to the rugged and noisy motions from the other two methods.

Figure 6 demonstrate the performance of our method and competitors on the Panoptic dataset. The scenes contain complex geometries and objects may occur or disappear in the middle of a sequence. Two space-time field-based methods, VideoNeRF [57] and NSFF [21], are selected for comparison. Our method estimate the motion of both people and objects accurately, while VideoNeRF presents chaotic results and the motion from NSFF are occasionally inaccurate. The results on Figs. 5 and 6 validate our claim that our method can well track all points in the space without prior knowledge of the scene.

On Single-Camera Dataset. To further validate our method, we demonstrate motion estimation in single-camera settings, which are more commonly encountered by dynamic-scene novel-view rendering methods. We consider the challenging scenes captured by Hypernerf [37]. As shown in Fig. 7, we compare again to VideoNeRF [57] and NSFF [21]. We note that our results are more temporally consistent and accurate than competitors. These results highlight the practical value of our method, able to accurately handle single-camera image sequences captured in the wild.

5.2 Quantitative Evaluation

Quantitative evaluation is difficult for our task since manually labeling a dense set of points in the space is expensive, if not unfeasible. We thus use the sparser human body joints provided by the Panoptic dataset to quantify the accuracy of the estimated motion. MPJPE [45] and 3D-PCK [27] are two widely used metrics for evaluating 3D human pose tracking performance, but both of them do not suit our task since our tracking requires as input the position of points at the starting frame. We propose to calculate the tracking error across K frames and use the averaged value as a metric. We denote the metric as \(\textrm{mMPJPE}_K\) (mean MPJPE), computed as:

$$\begin{aligned} \textrm{mMPJPE}_K = \frac{1}{N_f} \frac{1}{K} \sum _{u=1}^{N_f} \sum _{v=u+1}^{u+K} \textrm{MPJPE}(P_{u\rightarrow v},P_v^{\textrm{gt}}), \end{aligned}$$
(5)

where K is the number of frames for evaluating the motion and \(N_f\) is the total number of frames in the sequence. \(P_{i\rightarrow j}\) represents the estimated positions for the jth frame given positions for the ith one as inputs, and \(P_j^{\textrm{gt}}\) the ground-truth joint positions for the jth frame.

We report the mMPJPEK metric with \(K =\) 5, 10, 15 on the Panoptic dataset in Table 1. Our method achieves more accurate tracking performance than the other two methods except on Ian while tracking with 5 and 10 frames. NSFF requires both 2D optical flow and depth, while VideoNeRF requires depth information. As a comparison, we do not use any data-driven prior to guide the motion estimation module. Moreover, in Fig. 8 we visualize the tracked pose and the ground truth pose on one sequence and compute the corresponding mMPJPE metrics (\(N_f=1\) for one sequence).

Table 1. Quantitatively evaluating the estimated motion on the Panoptic dataset. Locations of the body joints in the starting frame are used as the inputs and we calculate the averaged tracking error for the body joints.
Fig. 8.
figure 8

Visualization of motion tracking results on the Panoptic dataset.

Fig. 9.
figure 9

Accuracy evaluation of the motion predictor. Left: Plotting of the MPJPE of predicted future body joint locations. Horizontal lines are the mMPJPE\(_{15}\) results on the corresponding scenes. Right: Visualization of predicted future motion of densely sampled points on the last observed frames and the \(10^\textrm{th}\) ground-truth future frames.

Fig. 10.
figure 10

Transferability of the motion predictor. We train the whole framework on the left-side sequence, then we freeze the predictor and fine tune other models on the right-side sequence. The next 10 frame motions are predicted from the last observed frame (the right-side first image) and visualized. The other two images are real movements in the future \(5^\textrm{th}\) frame and \(10^\textrm{th}\) frame.

5.3 Analysis of the Motion Predictor

We analyze the motion predictor P in two aspects: prediction accuracy and transferability. For the accuracy evaluation, we compare the predicted future locations of the body joints and the ground-truth future locations. The results are demonstrated in Fig. 9. The training sequences are separated into 20 intervals and we test the prediction results on each interval. The MPJPE of predicted body joint locations are averaged over all the intervals and plotted. We can observe in Table 1 that the model can predict the unseen motion of the next 5 time steps, with a low error close to the tracking error over actual observations.

We further demonstrate the transferability of the predictor in Fig. 10. Since the predictor generates motion codes in a latent space, the same model should work for motion sequences with similar patterns. We test the intuition on the ZJU-MoCap dataset, on two sequences in which the person does similar actions. We can observe from the right side of the figure that the predicted motions align with the real movements. The results demonstrate that the predictor is indeed transferable if the motions are similar.

6 Discussion

Limitations. Our method sometimes fail on non-rigid/monochromatic elements and the problem of motion estimation then gets underconstrained: Some points may converge into the same point for the non-rigid case and it may be hard to tell which part in the monochromatic area moved. We presume that a more advanced (possibly pre-trained) motion prediction model could be leveraged. Moreover, while our methods shows higher precision in estimating natural motion (e.g., dense human motion tracking), it is among our future work to address some other challenging scenes (e.g., scenes with chaotic particles).

Conclusion. We introduced a novel solution for the regularization and prediction of 3D dense motion in dynamic scenes. Leveraging advances in neural fields, we propose a combination of space-time and motion fields conditioned on motion embeddings. Through predictability-based regularization over these embeddings, we promote the encoding of scene-relevant motions and penalize ambiguous and noisy deformations. We acknowledge that this scheme may not benefit all types of scenes (c.f. above limitations), but it shows higher precision in natural settings.