Abstract
Predicting human motion is critical for assistive robots and AR/VR applications, where the interaction with humans needs to be safe and comfortable. Meanwhile, an accurate prediction depends on understanding both the scene context and human intentions. Even though many works study scene-aware human motion prediction, the latter is largely underexplored due to the lack of ego-centric views that disclose human intent and the limited diversity in motion and scenes. To reduce the gap, we propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, as well as ego-centric views with the eye gaze that serves as a surrogate for inferring human intent. By employing inertial sensors for motion capture, our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects. We perform an extensive study of the benefits of leveraging the eye gaze for ego-centric human motion prediction with various state-of-the-art architectures. Moreover, to realize the full potential of the gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches. Our network achieves the top performance in human motion prediction on the proposed dataset, thanks to the intent information from eye gaze and the denoised gaze feature modulated by the motion. Code and data can be found at https://github.com/y-zheng18/GIMO.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
1 Introduction
A large portion of the human brain cortex is devoted to processing visual signals collected by the optic nerve, and over half of the nerve fibers carry information from the fovea that is responsible for sharp central vision. When modulated through foveal fixation, or equivalently, eye gaze, important sensory input of fine details perceived with the fovea can inform future actions of the human agent [8, 42]. As shown in Fig. 1, a human agent intending to perform two tasks entails distinctive gaze patterns, even though the first few moves are not very distinguishable. Hence, it is beneficial to employ eye gaze when making human motion predictions in the 3D scene, which is of great importance for human-machine interactions [1, 6]. For example, a human agent wearing an AR/VR headset may approach a chair to sit on it or grab a cup on the table behind it. If the latter is true, we may want the headset to send a warning for collision avoidance based on the forecasted future. To resolve ambiguities for reliable human motion prediction, there is an increasing interest in leveraging eye gaze as it highly correlates to the user intent that motivates the consequent actions.
The key to understanding the role of gaze and how it can effectively inform human motion prediction lies in two folds. First, it is critical to have a dataset with high-quality 3D body pose annotations and corresponding eye gaze. Besides data quality, the 3D scene and motion dynamics should be diverse to enable meaningful learning and evaluation of the gain when eye gaze is incorporated. Second, it is crucial to have a network architecture that can efficiently utilize sparse eye gaze during predictions given the multi-modal setting (e.g., gaze, human motion and scene geometry) and the fact that not every single gaze is of the same significance regarding the agent’s intent (e.g., one may get distracted by a salient object in the scene that has nothing to do with the task at hand).
However, most existing human motion datasets do not support evaluating the effect of eye gaze due to the lack of ego-centric data annotated with both gaze and 3D body pose within the same scene. Recently, there are a few datasets proposed on ego-centric social interaction and object manipulation where gaze and the viewer’s 3D poses are available. Nevertheless, they are not suitable for ego-centric human motion prediction since the diversity of scenes and the variation in motion dynamics are very limited. To validate the benefits of eye gaze in human motion prediction, we propose a large-scale ego-centric dataset, which contains the scene context, eye gaze, and accurate 3D body poses of the human actors. By employing an advanced motion capture system based on Inertial Measurement Units (IMUs), we can collect 3D pose data with high fidelity and avoid the limits of conventional multi-camera systems. For example, the actor can walk through any environment without performing a cumbersome setup of motion capture devices. Moreover, accurate poses can be recorded without any 2D-3D lifting, which could induce errors due to occlusions and noise in the detection. These advantages enable the actors to perform various long-horizon activities in a diverse set of daily living environments.
In order to check the effectiveness of eye gaze in improving human motion prediction, we perform an extensive study with multiple state-of-the-art architectures. However, we note that gaze and motion could both be inherently ambiguous in forecasting future movements. For example, the gaze may be allocated to a TV monitor while walking towards the dining table. In this case, the actor may simply follow the momentum, thus rendering the eye gaze uninformative about the body motion. To fully utilize the potential of eye gaze, we propose a novel architecture that manifests cross-modal attention such that not only future motion can benefit from the eye gaze, but also the significance of gaze in predicting the future can be reinforced by the observed motion. With eye gaze, better human motion predictions are observed across various architectures. Furthermore, the proposed architecture achieves top performance measured under different criteria, verifying the effectiveness of our bidirectional fusion scheme.
In summary, we make the following contributions. First, we provide a large-scale human motion dataset that enables investigating the benefits of eye gaze under diverse scenes and motion dynamics. Second, we propose a novel architecture with a bidirectional multi-modal fusion that better suits gaze-informed human motion prediction through mutual disambiguation between motion and gaze. Finally, we validate the usefulness of eye gaze for human motion prediction with multiple architectures and verify the effectiveness of the proposed architecture by showing top performance on the proposed benchmark.
2 Related Work
Datasets for Human Motions. Human motion modeling is a long-standing problem and is extensively explored with high-quality motion capture datasets, ranging from small-scale CMU Graphics Lab Motion Capture Database [5] to large-scale ones like AMASS [31]. Human3.6M [13] captures high-quality motions using a multi-view camera system and serves as a standard benchmark for motion prediction and 3D pose estimation. While these datasets provide adequate data to learn motion dynamics, the constraints from the 3D environment are usually not included. Later, more datasets containing the 3D scene are proposed, and scene-aware motion prediction can be studied using GTA-1M dataset [4]. PROX [11] includes both 3D scene and human interaction motions which can be used to explore scene-aware motion generation [51] task and the problem of placing human to the scene [60, 62]. As the data is always collected with a human agent, ego-centric videos are provided in EgoPose [54, 55], Kinpoly [30] and HPS [9] to study how the motion estimation and prediction can benefit from these ego-centric observations. Moreover, social interaction is considered in You2Me [36] and EgoBody [57]. However, existing datasets do not contain diverse 3D scenes and human motions with intentions, we collect a large-scale dataset for gaze-guided human motion prediction, and it consists of high-quality human motions, 3D scene, ego-centric video and corresponding eye gaze information.
Human Motion Prediction. RNNs have proven successful in modeling human motion dynamics [3, 7, 27, 34, 61]. [32] proposes an attention-based model to guide the future prediction with motion history. To effectively exploit both spatial and temporal dependencies in human pose sequences, ST-Transformer [2] designs a spatial temporal transformer architecture to model the human motions. Pose Transformers [35] investigates a non-autoregressive formulation using transformer model and shows superior performance in terms of both efficiency and accuracy. As human motions are tightly correlated with the scene context, scene-aware motion prediction is also actively studied [4, 10, 63]. A three-stage pipeline is established to predict long-term human motions conditioned on the scene context [4]. SAMP [10] further includes object geometry to estimate interaction positions and orientations, and generates motions following a calculated collision-free trajectory. Besides the scene constraints, other modalities such as gaze and music also provide clues for future motion prediction. Transformer [48] is applied to generate dance movements conditioned on music [24, 25, 47]. MoGaze [21] verifies the effectiveness of eye gaze information for motion prediction with an RNN model in a full-body manipulation scenario. Our work aims to predict long-term future motions with both 3D scene and gaze constraints. We differ from existing motion prediction works, as their focus is the dense motion predictions, while we are predicting long-term sparse motions to understand human intentions.
Human Motion Estimation. 3D pose estimation is extensively studied in third-person view images or videos [12, 18,19,20, 29, 43, 56, 58]. VIBE [18] propose a sequential model to estimate human poses and shapes from videos, along with a motion discriminator to constrain the predictions in a plausible motion manifold. TCMR [12] explicitly enforces the neural nets to leverage past and future frames to eliminate jitters in predictions. Motion priors are founded effective in improving the temporal smoothness and tackling the occlusion issues [23, 40, 59]. More attentions are received in ego-centric pose estimation recently. Pose estimation from images captured using a fish eye camera is explored in [41, 44, 45, 50, 53]. [15] deploy a chest-mounted camera and predict motions based on an implicit motion graph. Following the chest-mounted camera setting, You2Me [36] introduces the motions of the visible second person as an additional signal to constrain the motion estimation of the camera wearer. [30, 54, 55] explores motion estimation and prediction with head-mounted front-facing camera. In this work, we are addressing the ego-centric motion prediction task where past motions are given. Our proposed dataset can benefit the ego-centric motion estimation problem.
3 GIMO Dataset: Gaze and Motion with Scene Context
Human motion is affected by the scene, which provides physical constraints and the agent’s psychological demand that drives body movements. To have a concrete assessment of the benefits induced by eye gaze, we need both ego-centric views, and 3D body poses of the agent. Particularly, they should be temporally synchronized and spatially aligned within the 3D scenes. Current datasets for human motion prediction are either collected in a virtual environment risking being unrealistic or captured by an array of cameras with limited scene diversity and motion dynamics. Moreover, eye gaze is usually not available.
Therefore, we propose a real-world large-scale dataset that provides high-quality human motions, ego-centric views with eye gaze, as well as 3D environments. Next, we describe our data collection pipeline.
3.1 Hardware Setup
We employ a commercialized IMU-based motion capture system to record high-quality 3D body poses of the human agent, whose eye gaze in 3D is detected using an AR device mounted on the head. The 3D scenes are scanned by a smartphone equipped with lidar sensors (please see Fig. 2, top-right).
Motion Capture. To capture daily activities in various indoor environments, we resort to motion capture from IMU signals following HPS [9]. While HPS only provides SMPL [28] models with body movements, we take advantage of an advanced commercial productFootnote 1 which can record at 96 fps 3D body and hand joint movement of the subject. To obtain the full-body pose and hand gesture of the subject, we apply SMPL-X [37] model to fit the recorded IMU signals from multiple joints. Compared to human motion datasets like PROX [11], where the 3D body pose is estimated from monocular RGB videos, the pose obtained using the above procedure is free from estimation errors caused by noisy detection and occlusions. Fitting parametric human body models for poses from multi-view RGB(D) streams or with marker-based systems is also commonly used to collect human motion data [13, 17, 57], however, our pipeline requires much less effort in presetting the environment; thus, we can collect human motion data in any indoor scene. These characteristics endow us with the capability to ensure the diversity of the scene and motion dynamics in our dataset.
Gaze Capture. Following [57], we use Hololens2Footnote 2 and its Research Mode API [46] to capture the 3D eye gaze. It also records ego-centric video at 30 fps in \(760\times 428\) resolution, long-throw depth streams at 1–5 fps in \(512\times 512\), and 6D poses of the head-mounted camera. The 3D scene is reconstructed through TSDF fusion given the recorded depth, which is used for the subsequent global alignment. The eye gaze is recorded as a 3D point in the coordinate system of the headset.
3D Scene Acquisition. To obtain high-quality 3D geometry of the scene (the reconstructed TSDF result from Hololens2 is usually noisy), we use an iPhone13 Pro Max equipped with LiDAR sensors to scan the environment through 3D Scanner APPFootnote 3. The output mesh contains about 500k vertices and photorealistic texture, providing sufficient details to infer the affordance of the scene. The data collection process involving human agents and the alignment of different coordinate frames to the scanned meshes are described in the following.
3.2 Data Collection with Human in Action
One distinct feature of our dataset is that it captures long-term motions with clear intentions. Different from prior datasets for motion estimation purposes where the subjects are performing random actions such as jumping and waving hands, we aim at collecting motion trajectories with semantic meaning, e.g., walk to open the door. Thus, we focus on collecting data from various daily activities in indoor scenes. The full statistics of our dataset are listed in Table 1.
To this end, we recruit 11 university students (4 female and 7 male) and ask them to perform the activities defined in Table 2. The subjects are instructed to start from a distant location to the goal object and then move to the destination to act. Therefore, long-term motion with clear intention can be obtained. Especially, the collection progress includes the following steps: (i) the subject wears the head-mounted Hololens2, the IMU-based motion capture clothes, and gloves, where calibration is performed to set up the motion capture system; (ii) the subject chooses the action from the activities in Table 2 according to the affordance of the scene; (iii) the 3D scene is scanned; (iv) the subject starts to carry out the planned activities in the scene while data are collected; (v) the scene is reset for the following subjects to perform their activities. Note, if the subject changes the scene geometry, we reset the objects to their original states to avoid rescanning the whole environment.
As a result, our dataset contains 129k ego-centric images, 11 subjects, and 217 motion trajectories in 19 scenes, manifesting enough capacity and diversity for gaze-informed human motion prediction. As illustrated in Fig. 2, the motions are smooth and convey clear semantic intentions.
3.3 Data Preparation
Synchronization. Given compatibility issues, it is difficult to synchronize the motion capture system with Hololens2 without modifying their commercialized software. Instead, we use a hand gesture that can be observed in the ego-centric view as a starting signal. Once the pose and ego-centric image of the hand gesture are aligned, the rest frames can be synchronized according to the timestamps.
Parametric Model Fitting. To obtain the 3D body pose and shape of the subject, we fit SMPL-X [37] model to the 3D joints (23 body joints, 15 left-hand joints, and 15 right-hand joints), which are computed from the recorded IMU signals by the provided commercial software. In addition, the 6D head pose is used to determine the head position and orientation of the SMPL-X model.
Alignment. The Hololens2 coordinate system and the fitted SMPL-X models need to align with the high-quality 3D scene scans. The former is aligned through ICP between the TSDF fusion result of the depth recorded by Hololens2 and the 3D scene. The SMPL-X motion sequence is first transformed to the Hololens2 coordinate system via human annotations, i.e., the start and end shapes of the human body are scanned by Hololens2 and visible in the TSDF reconstruction, which serves as anchor shapes for aligning the fitted models. The pose can then be aligned to the 3D scene using the global transformation obtained from the previous ICP alignment between the scene scans. Our dataset is named GIMO, and we describe our method for gaze-informed motion prediction in the following.
4 Gaze-Informed Human Motion Prediction
Gaze conveys relevant information about the subject’s intent, which can be used to enhance long-horizon human motion prediction. On the other hand, past motions [2, 4], ego-centric views [10, 55], or 3D context [10, 51] could provide helpful constraints on human motion, yet, the prediction is still challenging and suffers from uncertainties in the future. Here, we aim at gaze-informed long-term human motion prediction. Specifically, given the past motion, 3D scene, and 3D eye gaze as inputs, we study how they can be integrated to resolve the ambiguities in future motion and generate intention-aware motion predictions.
To fully utilize the geometry information provided by the 3D scene and intention clues from past motions and gaze, we propose a novel framework with a bidirectional fusion scheme that facilitates the communication between different modalities. As shown in Fig. 3, we use PointNet++ [39] as the encoding backbone to extract per-point features of the 3D scene, followed by several cross-modal transformers to transcend information from multi-modality embeddings.
4.1 Problem Definition
We represent a motion sample as a parametric sequence \(X_{i:j}=\{x_i, x_{i+1}, \cdots , x_{j}\}\) where \(x_{k}=(t_k, r_k, h_k, \beta _{k}, p_{k})\) is a pose frame at time k. Here \(t\in R^3\) is the global translation, \(r\in SO(3)\) denotes the global orientation, \(h_k\in R^{32}\) refers to the body pose embedding, \(\beta \in R^{10}\) is the shape parameter, and \(p\in R^{24}\) is the hand pose, where SMPL-X body mesh \(M=\mathcal {M}(t_k, r_k, h_k, \beta _{k}, p_{k})\) can be obtained using VPoser [37]. The 3D scene is represented as a point cloud \(S\in R^{n\times 3}\), and the 3D gaze point \(g\in R^3\) is defined as the intersection points between the gaze direction and the scene. Thus, given the inputs of a motion sequence \(X_{1:t}\) along with the corresponding 3D gaze \(G_{1:t}=\{g_1, g_2, \cdots , g_t\}\) and the 3D scene S, we aim to predict the future motion \(X_{t:t+T}={\varPhi }(X_{1:t}, G_{1:t}, S|\theta )\) where \(\theta \) represents the network parameters.
4.2 Multi-modal Feature Extraction
Instead of extracting the multi-modal embeddings independently [25], we propose a novel scheme to integrate the motion, gaze, and scene features. The gist is to let the motion and gaze features communicate to each other, so their uncertainties regarding the future can be mutually decreased, resulting in more effective utilization of the gaze information.
Scene Feature Extraction. To learn the constraints from the 3D scene and guide the network to pay attention to local geometric structures, we apply PointNet++ to extract both global and local scene features. Specially, we obtain the per-point feature map and a global descriptor of the scene as follows:
where \(S\in R^{n\times 3}\) is the input point cloud, \(F_P\in R^{n\times d_p}\) is the per-point \(d_p\) dimensional feature map, and \(F_o\in R^{d_o}\) is the global descriptor of the scene. Given the per-point feature \(F_P\), the feature of an arbitrary point e can be computed through the inversed distance weighted interpolation [39]:
where \(\{p_{1}, p_{2}, \cdots , p_{n_e}\}\) are the nearest neighbors of e in the scene point cloud.
Gaze Feature Extraction. We query the gaze point feature \(f_g\) from the per-point scene feature map \(F_P\) according to Eq. 2, i.e., \(f_g=F_{P|g}\). Thus, the interpolated gaze feature contains relevant scene information that provides cues to infer the subject’s intention.
Motion Feature Extraction. A linear layer is used to extract the motion embedding \(f_m\) from the input motion parameter x. To endow the embedding awareness of the 3D scene, we further query the scene features of the SMPL-X vertices using Eq. 2. These SMPL-X per-vertex features are then fed to PointNet [38] to get the ambient scene context feature \(f_{m\_v}\) of the current motion pose:
where \(\mathcal {M}(x)\) is the SMPL-X vertex set with motion parameter x.
4.3 Attention-Aware Multi-modal Feature Fusion
Given the multi-modal nature of the gaze, scene, and motion, an efficient feature fusion module is necessary to leverage the information from different modalities. Instead of directly concatenating the features [25], we propose a more effective scheme by deploying a cross-modal transformer [14] to fuse the gaze, motion, and scene features (Fig. 3). We explain our design in the following.
Cross-Modal Transformer. The cross-modal transformer [14] is used to capture the correlations between input embedding sequences and to establish communications between the multi-modal information. It is largely based on attention mechanism [48]. An attention function [14] maps a query and key-value pairs to an output as:
where \(q\in R^{l_q\times d_q}\), \(k\in R^{l_{kv}\times d_k}\), \(v\in R^{l_{kv}\times d_v}\) are input query, key and value vectors, and \(W_q\in R^{d_q\times d_K}\), \(W_k\in R^{d_k\times d_K}\), \(W_q\in R^{d_v\times d_V}\) embed the inputs. Here d denotes the dimension of the input vector and l is the sequence length.
As shown in Fig. 3(b), the cross-modal transformer is built on a stack of attention layers, which maps a \(t_i\)-length input into a \(t_q\)-length output by querying a \(t_q\)-length feature:
It is proved to be efficient in processing multi-modal signals, e.g., text, & audio.
Motion Feature Fusion. The motion feature should be aware of the 3D scene context and the subject’s intention inferred from the gaze information, so that it can guide the prediction network to generate more reasonable motion trajectories (e.g., free from penetration and collision) and accurate estimations of the ending position or pose of the subject. For this purpose, we first use the scene context feature \(f_{m\_v}\) acquired from the ambient 3D environment (Eq. 3) as the query to update the motion feature \(f_m\) through a motion-scene transformer:
Thus, the output motion embedding \(f_{m\_s}\) is expected to be aware of the 3D scene. We then feed \(f_{m\_s}\) to the next motion-gaze transformer where the gaze feature \(f_g\) is the query input:
The final motion embedding \(f_{m\_g}\) is expected to integrate both the 3D scene information and the intention clues from the gaze features.
Gaze Feature Fusion. While gaze can help generate intention-aware motion features, the motion could also provide informative guidance to mitigate the randomness of gaze since not every gaze point reveals meaningful user intent. Therefore, we treat the gaze embedding in a bidirectional manner, i.e., the motion embedding \(f_m\) is also used as the query to update the gaze features such that the network can learn which gaze features contribute more to the future motion:
The bidirectionally fused multi-modal features are then composed into holistic temporal representations of the input to perform human motion prediction. As illustrated in Fig. 3(c), the updated gaze feature \(f_{g\_m}\), motion feature \(f_{m\_g}\) and the global scene feature \(F_O\) are used to predict the future motion by:
where cat denotes the concatenation operation, and \(h_{position}\) is the latent vector that contains temporal positional encodings for the output [14]. We verify the effectiveness of our design in utilizing gaze information through experiments.
5 Experiments
In this part, we explain our experimental setup and results. Our goal is to examine the following questions:
-
1.
Does gaze help disambiguate human motion prediction?
-
2.
How do state-of-the-art methods perform on our dataset?
-
3.
What is the contribution of each part of our design to the final performance? Overall, is the proposed architecture effective?
5.1 Experimental Setup
In our experiments, we predict the future motion in 5 s from 3 s input, where the first 3 s of a trajectory is just about to start an activity in our dataset, and in the next 5 s the trajectory proceeds to finish the activity. We set the motion frame rate to 2 fps, i.e., 6 pose input and 10 pose output. Since we aim to explore the effect of gaze in disambiguating motion prediction, high-frequency motion is not necessary. Note that once the waypoints are predicted, a full motion sequence with high fps can be easily generated [51].
Baselines. We implement several state-of-the-art motion prediction and generation baselines including ST-Transformer [2] and an RNN network [22] for full motion prediction from the past motion input, and MultimodalNet [25] based on transformer for motion synthesis from multi-modal data (i.e., gaze, motion, and the 3D scene feature in our experiments). We build our pipeline by incorporating 6 cross-modal transformer layers [14]. L1 loss between the predicted motion and the ground truth is used to train the network. More details about the network architecture and training are available in the supplementary material.
5.2 Evaluation
To evaluate, we divide the 217 trajectories of our dataset into 180 trajectories for training and 37 for testing. The 37 motions consist of 27 trajectories (different from the training ones) performed in known scenes from the training set and 10 in 2 new environments scanned only for evaluation purposes.
Evaluation Metrics. We employ the destination error and the path error as our evaluation metrics. The destination error refers to the global translation, rotation error and the mean per-joint position error (MPJPE) [13] of the last pose in the predicted motion. The destination pose contains essential information about the subject’s goal, which is our experiments’ primary focus. The path error is computed as the mean error of the predicted poses in 5s [4]. We treat the global translation and rotation error as the l1 distance between the predicted SMPL-X translation and orientation parameter with the ground truth [51].
Quantitative Evaluation. As shown in Table 3 and Table 4, while the state-of-the-art methods based on spatio-temporal transformer [2] suffer from ambiguities since the prediction is simply from the past motion, a simple RNN method with motion and gaze input [21] can significantly reduce the ambiguity, indicating the effectiveness of gaze in guiding the prediction of motion. Our method achieves promising results in predicting reasonable future motion with small destination and translation errors. Compared to MultimodalNet [25] built on the vanilla transformers [47], our method outperforms in recognizing the subject’s intent from the gaze and thus predicts more accurate destination poses.
Qualitative Evaluation. Figure 4 shows that in a “going to sit” activity performed in one scene from the training set (top row), our method manages to generate accurate destination poses, i.e., sitting on the sofa. In the new environment, the subject first grabs a blackboard eraser and then starts wiping. While all the methods generate walking actions, ours without gaze input fails to predict the correct motion. When given gaze, results from MultimodalNet [25] and our method both reach out the hand and try to grab something. Our prediction successfully arrives at the destination point where the eraser lies; however, the results of MultimodalNet [25] reach out to the wrong place. More visualizations and failure cases are included in the supplementary material.
5.3 Ablation Study
In this part, we aim to answer question 3 by finding the factors that contribute to the superior performance of our method.
Variant 1: Gaze. We evaluate the baseline’s performance with and without gaze input to explore how gaze could influence the motion prediction results. As clearly demonstrated in Table 3 and Table 4, the RNN network [21] and the MultimodalNet [25] both gain significant accuracy improvement given gaze inputs. Figure 4 shows that without gaze, our method is confused about the future destination. To find more intuitions about the role of gaze in motion prediction, we visualize the attention weights of gaze feature query over the motion feature as depicted in Fig. 5. Interestingly, we find the gaze feature does influence the ending poses in the predicted motion, implying that the gaze can serve as a strong indicator of the destination of a motion, which reveals the user’s intent.
Variant 2: PointNet++ for Scene Feature Query. We propose to use PointNet++ [39] to extract the per-point feature of the scene such that the gaze feature and scene-aware motion feature can be obtained (Sect. 4.2). We replace it with PointNet to extract the global scene feature and use a linear layer to get the gaze feature. Results in Table 3 and Table 4 demonstrate that the variant can act well on scenes from the training set, but lose its competitiveness when generalized to new environments with different 3D structures.
Variant 3: Cross-Modal Transformer. The cross-modal transformer architecture proves to be effective in bridging multi-modal information [14]. We replace it with the vanilla transformer [48] as used in [25]. Results in Table 3 and Table 4 (Ours (vanilla)) demonstrate the loss of accuracy compared to the full design. Note that the path error of the variant on the new scenes is even larger than the results without gaze input, indicating that the vanilla transformer might not be efficient enough to capture the correlations between multi-modal inputs.
6 Discussion and Future Work
We present the GIMO dataset, a real-world dataset with ego-centric images, 3D gazes, 3D scene context, and ground-truth human motions. With the collected dataset, we define a new task, i.e., gaze-informed human motion prediction, and further contribute a novel framework to minimize the ambiguities in motion prediction by leveraging eye gaze to infer the subject’s potential intention. While our method only relies on 3D inputs, future work can incorporate visual information from ego-centric images to further improve accuracy. Besides the proposed task and framework, our dataset can benefit various applications, e.g., intention-aware motion synthesis and gaze-guided ego-centric pose estimation. We believe our work opens not only new directions for motion prediction but also has foreseeable impacts on ego-centric vision topics.
References
Admoni, H., Scassellati, B.: Social eye gaze in human-robot interaction: a review. J. Hum.-Robot Interact. 6(1), 25–63 (2017)
Aksan, E., Kaufmann, M., Cao, P., Hilliges, O.: A spatio-temporal transformer for 3D human motion prediction. In: 2021 International Conference on 3D Vision (3DV), pp. 565–574. IEEE (2021)
Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7144–7153 (2019)
Cao, Z., Gao, H., Mangalam, K., Cai, Q.-Z., Vo, M., Malik, J.: Long-term human motion prediction with scene context. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 387–404. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_23
CMU Graphics Lab (2000). http://mocap.cs.cmu.edu/
Duarte, N.F., Raković, M., Tasevski, J., Coco, M.I., Billard, A., Santos-Victor, J.: Action anticipation: reading the intentions of humans and robots. IEEE Robot. Autom. Lett. 3(4), 4132–4139 (2018)
Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4346–4354 (2015)
Gottlieb, J., Oudeyer, P.Y., Lopes, M., Baranes, A.: Information-seeking, curiosity, and attention: computational and neural mechanisms. Trends Cogn. Sci. 17(11), 585–593 (2013)
Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human poseitioning system (HPS): 3D human pose estimation and self-localization in large scenes from body-mounted sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4318–4329 (2021)
Hassan, M., et al.: Stochastic scene-aware motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11374–11384 (2021)
Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3D human pose ambiguities with 3D scene constraints. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2282–2292 (2019)
Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3D human pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 68–84 (2018)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Jaegle, A., et al.: Perceiver IO: a general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795 (2021)
Jiang, H., Grauman, K.: Seeing invisible poses: estimating 3D body pose from egocentric video. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3501–3509. IEEE (2017)
Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3334–3342 (2015)
Joo, H., Simon, T., Sheikh, Y.: Total capture: a 3D deformation model for tracking faces, hands, and bodies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8320–8329 (2018)
Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: video inference for human body pose and shape estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5253–5263 (2020)
Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: Pare: part attention regressor for 3D human body estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11127–11137 (2021)
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2252–2261 (2019)
Kratzer, P., Bihlmaier, S., Midlagajni, N.B., Prakash, R., Toussaint, M., Mainprice, J.: Mogaze: a dataset of full-body motions that includes workspace geometry and eye-gaze. IEEE Robot. Autom. Lett. 6(2), 367–373 (2020)
Kratzer, P., Toussaint, M., Mainprice, J.: Prediction of human full-body movements with motion optimization and recurrent neural networks. In: 2020 ICRA, pp. 1792–1798 (2020)
Li, J., et al.: Task-generic hierarchical human motion prior using VAEs. In: 2021 International Conference on 3D Vision (3DV), pp. 771–781. IEEE (2021)
Li, J., et al.: Learning to generate diverse dance motions with transformer. arXiv preprint arXiv:2008.08171 (2020)
Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3D dance generation with AIST++. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13401–13412 (2021)
Li, Y., Liu, M., Rehg, J.: In the eye of the beholder: gaze and actions in first person video. IEEE Trans. Pattern Anal. Mach. Intell. (2021)
Li, Z., Zhou, Y., Xiao, S., He, C., Huang, Z., Li, H.: Auto-conditioned recurrent networks for extended complex human motion synthesis. arXiv preprint arXiv:1707.05363 (2017)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (TOG) 34(6), 1–16 (2015)
Luo, Z., Golestaneh, S.A., Kitani, K.M.: 3D human motion estimation via motion compression and refinement. In: Proceedings of the Asian Conference on Computer Vision (2020)
Luo, Z., Hachiuma, R., Yuan, Y., Kitani, K.: Dynamics-regulated kinematic policy for egocentric pose estimation. In: Advances in Neural Information Processing Systems, vol. 34 (2021)
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5442–5451 (2019)
Mao, W., Liu, M., Salzmann, M.: History repeats itself: human motion prediction via motion attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 474–489. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_28
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 601–617 (2018)
Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2891–2900 (2017)
Martínez-González, A., Villamizar, M., Odobez, J.M.: Pose transformers (POTR): human motion prediction with non-autoregressive transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2276–2284 (2021)
Ng, E., Xiang, D., Joo, H., Grauman, K.: You2me: inferring body pose in egocentric video via first and second person interactions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9890–9900 (2020)
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10975–10985 (2019)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3D human motion model for robust pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11488–11499 (2021)
Rhodin, H., et al.: Egocap: egocentric marker-less motion capture with two fisheye cameras. ACM Trans. Graph. (TOG) 35(6), 1–11 (2016)
Tatler, B.W., Hayhoe, M.M., Land, M.F., Ballard, D.H.: Eye guidance in natural vision: reinterpreting salience. J. Vis. 11(5) (2011)
Tian, Y., Zhang, H., Liu, Y., Wang, l.: Recovering 3D human mesh from monocular images: a survey. arXiv preprint arXiv:2203.01923 (2022)
Tome, D., et al.: Selfpose: 3D egocentric pose estimation from a headset mounted camera. arXiv preprint arXiv:2011.01519 (2020)
Tome, D., Peluse, P., Agapito, L., Badino, H.: xR-EgoPose: egocentric 3D human pose from an HMD camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7728–7738 (2019)
Ungureanu, D., et al.: Hololens 2 research mode as a tool for computer vision research. arXiv preprint arXiv:2008.11239 (2020)
Valle-Pérez, G., Henter, G.E., Beskow, J., Holzapfel, A., Oudeyer, P.Y., Alexanderson, S.: Transflower: probabilistic autoregressive dance generation with multimodal attention. ACM Trans. Graph. (TOG) 40(6), 1–14 (2021)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Von Marcard, T., Pons-Moll, G., Rosenhahn, B.: Human pose estimation from video and IMUs. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1533–1547 (2016)
Wang, J., Liu, L., Xu, W., Sarkar, K., Theobalt, C.: Estimating egocentric 3D human pose in global space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11500–11509 (2021)
Wang, J., Xu, H., Xu, J., Liu, S., Wang, X.: Synthesizing long-term 3D human motion and interaction in 3D scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9401–9411 (2021)
Wei, P., Liu, Y., Shu, T., Zheng, N., Zhu, S.C.: Where and why are they looking? Jointly inferring human attention and intentions in complex tasks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6801–6809 (2018)
Xu, W., et al.: Mo2cap2: real-time mobile 3D motion capture with a cap-mounted fisheye camera. IEEE Trans. Visual Comput. Graphics 25(5), 2093–2101 (2019)
Yuan, Y., Kitani, K.: 3D ego-pose estimation via imitation learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 735–750 (2018)
Yuan, Y., Kitani, K.: Ego-pose estimation and forecasting as real-time PD control. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10082–10092 (2019)
Zhang, H., et al.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Zhang, S., et al.: Egobody: human body shape, motion and social interactions from head-mounted devices. arXiv preprint arXiv:2112.07642 (2021)
Zhang, S., Zhang, Y., Bogo, F., Marc, P., Tang, S.: Learning motion priors for 4D human body capture in 3d scenes. In: International Conference on Computer Vision (ICCV), October 2021
Zhang, S., Zhang, Y., Bogo, F., Pollefeys, M., Tang, S.: Learning motion priors for 4D human body capture in 3D scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11343–11353 (2021)
Zhang, S., Zhang, Y., Ma, Q., Black, M.J., Tang, S.: Place: proximity learning of articulation and contact in 3D environments. In: 2020 International Conference on 3D Vision (3DV), pp. 642–651. IEEE (2020)
Zhang, Y., Black, M.J., Tang, S.: We are more than our joints: predicting how 3D bodies move. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3372–3382 (2021)
Zhang, Y., Hassan, M., Neumann, H., Black, M.J., Tang, S.: Generating 3D people in scenes without people. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6194–6204 (2020)
Zhang, Y., Tang, S.: The wanderings of odysseus in 3D scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20481–20491 (2022)
Acknowledgments
The authors are supported by a grant from the Stanford HAI Institute, a Vannevar Bush Faculty Fellowship, a gift from the Amazon Research Awards program, the NSFC grant No. 62125107, and No. 62171255. Also, Toyota Research Institute provided funds to support this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zheng, Y. et al. (2022). GIMO: Gaze-Informed Human Motion Prediction in Context. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13673. Springer, Cham. https://doi.org/10.1007/978-3-031-19778-9_39
Download citation
DOI: https://doi.org/10.1007/978-3-031-19778-9_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19777-2
Online ISBN: 978-3-031-19778-9
eBook Packages: Computer ScienceComputer Science (R0)