1 Introduction

A large portion of the human brain cortex is devoted to processing visual signals collected by the optic nerve, and over half of the nerve fibers carry information from the fovea that is responsible for sharp central vision. When modulated through foveal fixation, or equivalently, eye gaze, important sensory input of fine details perceived with the fovea can inform future actions of the human agent [8, 42]. As shown in Fig. 1, a human agent intending to perform two tasks entails distinctive gaze patterns, even though the first few moves are not very distinguishable. Hence, it is beneficial to employ eye gaze when making human motion predictions in the 3D scene, which is of great importance for human-machine interactions [1, 6]. For example, a human agent wearing an AR/VR headset may approach a chair to sit on it or grab a cup on the table behind it. If the latter is true, we may want the headset to send a warning for collision avoidance based on the forecasted future. To resolve ambiguities for reliable human motion prediction, there is an increasing interest in leveraging eye gaze as it highly correlates to the user intent that motivates the consequent actions.

Fig. 1.
figure 1

Human motion driven by different intents look similar at the beginning. However, the scanning patterns of the eye gaze (red dots) during the starting phase are pretty distinctive, which suggests that we can leverage eye gaze to reduce uncertainties when predicting future body movements. (Color figure online)

The key to understanding the role of gaze and how it can effectively inform human motion prediction lies in two folds. First, it is critical to have a dataset with high-quality 3D body pose annotations and corresponding eye gaze. Besides data quality, the 3D scene and motion dynamics should be diverse to enable meaningful learning and evaluation of the gain when eye gaze is incorporated. Second, it is crucial to have a network architecture that can efficiently utilize sparse eye gaze during predictions given the multi-modal setting (e.g., gaze, human motion and scene geometry) and the fact that not every single gaze is of the same significance regarding the agent’s intent (e.g., one may get distracted by a salient object in the scene that has nothing to do with the task at hand).

However, most existing human motion datasets do not support evaluating the effect of eye gaze due to the lack of ego-centric data annotated with both gaze and 3D body pose within the same scene. Recently, there are a few datasets proposed on ego-centric social interaction and object manipulation where gaze and the viewer’s 3D poses are available. Nevertheless, they are not suitable for ego-centric human motion prediction since the diversity of scenes and the variation in motion dynamics are very limited. To validate the benefits of eye gaze in human motion prediction, we propose a large-scale ego-centric dataset, which contains the scene context, eye gaze, and accurate 3D body poses of the human actors. By employing an advanced motion capture system based on Inertial Measurement Units (IMUs), we can collect 3D pose data with high fidelity and avoid the limits of conventional multi-camera systems. For example, the actor can walk through any environment without performing a cumbersome setup of motion capture devices. Moreover, accurate poses can be recorded without any 2D-3D lifting, which could induce errors due to occlusions and noise in the detection. These advantages enable the actors to perform various long-horizon activities in a diverse set of daily living environments.

In order to check the effectiveness of eye gaze in improving human motion prediction, we perform an extensive study with multiple state-of-the-art architectures. However, we note that gaze and motion could both be inherently ambiguous in forecasting future movements. For example, the gaze may be allocated to a TV monitor while walking towards the dining table. In this case, the actor may simply follow the momentum, thus rendering the eye gaze uninformative about the body motion. To fully utilize the potential of eye gaze, we propose a novel architecture that manifests cross-modal attention such that not only future motion can benefit from the eye gaze, but also the significance of gaze in predicting the future can be reinforced by the observed motion. With eye gaze, better human motion predictions are observed across various architectures. Furthermore, the proposed architecture achieves top performance measured under different criteria, verifying the effectiveness of our bidirectional fusion scheme.

In summary, we make the following contributions. First, we provide a large-scale human motion dataset that enables investigating the benefits of eye gaze under diverse scenes and motion dynamics. Second, we propose a novel architecture with a bidirectional multi-modal fusion that better suits gaze-informed human motion prediction through mutual disambiguation between motion and gaze. Finally, we validate the usefulness of eye gaze for human motion prediction with multiple architectures and verify the effectiveness of the proposed architecture by showing top performance on the proposed benchmark.

2 Related Work

Datasets for Human Motions. Human motion modeling is a long-standing problem and is extensively explored with high-quality motion capture datasets, ranging from small-scale CMU Graphics Lab Motion Capture Database [5] to large-scale ones like AMASS [31]. Human3.6M [13] captures high-quality motions using a multi-view camera system and serves as a standard benchmark for motion prediction and 3D pose estimation. While these datasets provide adequate data to learn motion dynamics, the constraints from the 3D environment are usually not included. Later, more datasets containing the 3D scene are proposed, and scene-aware motion prediction can be studied using GTA-1M dataset [4]. PROX [11] includes both 3D scene and human interaction motions which can be used to explore scene-aware motion generation [51] task and the problem of placing human to the scene [60, 62]. As the data is always collected with a human agent, ego-centric videos are provided in EgoPose [54, 55], Kinpoly [30] and HPS [9] to study how the motion estimation and prediction can benefit from these ego-centric observations. Moreover, social interaction is considered in You2Me [36] and EgoBody [57]. However, existing datasets do not contain diverse 3D scenes and human motions with intentions, we collect a large-scale dataset for gaze-guided human motion prediction, and it consists of high-quality human motions, 3D scene, ego-centric video and corresponding eye gaze information.

Human Motion Prediction. RNNs have proven successful in modeling human motion dynamics [3, 7, 27, 34, 61]. [32] proposes an attention-based model to guide the future prediction with motion history. To effectively exploit both spatial and temporal dependencies in human pose sequences, ST-Transformer [2] designs a spatial temporal transformer architecture to model the human motions. Pose Transformers [35] investigates a non-autoregressive formulation using transformer model and shows superior performance in terms of both efficiency and accuracy. As human motions are tightly correlated with the scene context, scene-aware motion prediction is also actively studied [4, 10, 63]. A three-stage pipeline is established to predict long-term human motions conditioned on the scene context [4]. SAMP [10] further includes object geometry to estimate interaction positions and orientations, and generates motions following a calculated collision-free trajectory. Besides the scene constraints, other modalities such as gaze and music also provide clues for future motion prediction. Transformer [48] is applied to generate dance movements conditioned on music [24, 25, 47]. MoGaze [21] verifies the effectiveness of eye gaze information for motion prediction with an RNN model in a full-body manipulation scenario. Our work aims to predict long-term future motions with both 3D scene and gaze constraints. We differ from existing motion prediction works, as their focus is the dense motion predictions, while we are predicting long-term sparse motions to understand human intentions.

Human Motion Estimation. 3D pose estimation is extensively studied in third-person view images or videos [12, 18,19,20, 29, 43, 56, 58]. VIBE [18] propose a sequential model to estimate human poses and shapes from videos, along with a motion discriminator to constrain the predictions in a plausible motion manifold. TCMR [12] explicitly enforces the neural nets to leverage past and future frames to eliminate jitters in predictions. Motion priors are founded effective in improving the temporal smoothness and tackling the occlusion issues [23, 40, 59]. More attentions are received in ego-centric pose estimation recently. Pose estimation from images captured using a fish eye camera is explored in [41, 44, 45, 50, 53]. [15] deploy a chest-mounted camera and predict motions based on an implicit motion graph. Following the chest-mounted camera setting, You2Me [36] introduces the motions of the visible second person as an additional signal to constrain the motion estimation of the camera wearer. [30, 54, 55] explores motion estimation and prediction with head-mounted front-facing camera. In this work, we are addressing the ego-centric motion prediction task where past motions are given. Our proposed dataset can benefit the ego-centric motion estimation problem.

Fig. 2.
figure 2

We collect human motion data in various indoor environments (1st, 2nd rows), allowing the human subject to perform a diverse range of daily activities exhibiting rich dynamics (bottom). Top-right: motion and gaze capture devices.

3 GIMO Dataset: Gaze and Motion with Scene Context

Human motion is affected by the scene, which provides physical constraints and the agent’s psychological demand that drives body movements. To have a concrete assessment of the benefits induced by eye gaze, we need both ego-centric views, and 3D body poses of the agent. Particularly, they should be temporally synchronized and spatially aligned within the 3D scenes. Current datasets for human motion prediction are either collected in a virtual environment risking being unrealistic or captured by an array of cameras with limited scene diversity and motion dynamics. Moreover, eye gaze is usually not available.

Therefore, we propose a real-world large-scale dataset that provides high-quality human motions, ego-centric views with eye gaze, as well as 3D environments. Next, we describe our data collection pipeline.

3.1 Hardware Setup

We employ a commercialized IMU-based motion capture system to record high-quality 3D body poses of the human agent, whose eye gaze in 3D is detected using an AR device mounted on the head. The 3D scenes are scanned by a smartphone equipped with lidar sensors (please see Fig. 2, top-right).

Motion Capture. To capture daily activities in various indoor environments, we resort to motion capture from IMU signals following HPS [9]. While HPS only provides SMPL [28] models with body movements, we take advantage of an advanced commercial productFootnote 1 which can record at 96 fps 3D body and hand joint movement of the subject. To obtain the full-body pose and hand gesture of the subject, we apply SMPL-X [37] model to fit the recorded IMU signals from multiple joints. Compared to human motion datasets like PROX [11], where the 3D body pose is estimated from monocular RGB videos, the pose obtained using the above procedure is free from estimation errors caused by noisy detection and occlusions. Fitting parametric human body models for poses from multi-view RGB(D) streams or with marker-based systems is also commonly used to collect human motion data [13, 17, 57], however, our pipeline requires much less effort in presetting the environment; thus, we can collect human motion data in any indoor scene. These characteristics endow us with the capability to ensure the diversity of the scene and motion dynamics in our dataset.

Table 1. Statistics of existing and our datasets. \(^*\) means virtual 3D scenes, e.g., from game engine [4] or CAD models [10]. Ego denotes egocentric images are available, and Intent indicates whether the motions have clear intentions, e.g., fetching a book.

Gaze Capture. Following [57], we use Hololens2Footnote 2 and its Research Mode API [46] to capture the 3D eye gaze. It also records ego-centric video at 30 fps in \(760\times 428\) resolution, long-throw depth streams at 1–5 fps in \(512\times 512\), and 6D poses of the head-mounted camera. The 3D scene is reconstructed through TSDF fusion given the recorded depth, which is used for the subsequent global alignment. The eye gaze is recorded as a 3D point in the coordinate system of the headset.

3D Scene Acquisition. To obtain high-quality 3D geometry of the scene (the reconstructed TSDF result from Hololens2 is usually noisy), we use an iPhone13 Pro Max equipped with LiDAR sensors to scan the environment through 3D Scanner APPFootnote 3. The output mesh contains about 500k vertices and photorealistic texture, providing sufficient details to infer the affordance of the scene. The data collection process involving human agents and the alignment of different coordinate frames to the scanned meshes are described in the following.

3.2 Data Collection with Human in Action

One distinct feature of our dataset is that it captures long-term motions with clear intentions. Different from prior datasets for motion estimation purposes where the subjects are performing random actions such as jumping and waving hands, we aim at collecting motion trajectories with semantic meaning, e.g., walk to open the door. Thus, we focus on collecting data from various daily activities in indoor scenes. The full statistics of our dataset are listed in Table 1.

Table 2. Activities performed by our subjects.

To this end, we recruit 11 university students (4 female and 7 male) and ask them to perform the activities defined in Table 2. The subjects are instructed to start from a distant location to the goal object and then move to the destination to act. Therefore, long-term motion with clear intention can be obtained. Especially, the collection progress includes the following steps: (i) the subject wears the head-mounted Hololens2, the IMU-based motion capture clothes, and gloves, where calibration is performed to set up the motion capture system; (ii) the subject chooses the action from the activities in Table 2 according to the affordance of the scene; (iii) the 3D scene is scanned; (iv) the subject starts to carry out the planned activities in the scene while data are collected; (v) the scene is reset for the following subjects to perform their activities. Note, if the subject changes the scene geometry, we reset the objects to their original states to avoid rescanning the whole environment.

As a result, our dataset contains 129k ego-centric images, 11 subjects, and 217 motion trajectories in 19 scenes, manifesting enough capacity and diversity for gaze-informed human motion prediction. As illustrated in Fig. 2, the motions are smooth and convey clear semantic intentions.

3.3 Data Preparation

Synchronization. Given compatibility issues, it is difficult to synchronize the motion capture system with Hololens2 without modifying their commercialized software. Instead, we use a hand gesture that can be observed in the ego-centric view as a starting signal. Once the pose and ego-centric image of the hand gesture are aligned, the rest frames can be synchronized according to the timestamps.

Parametric Model Fitting. To obtain the 3D body pose and shape of the subject, we fit SMPL-X [37] model to the 3D joints (23 body joints, 15 left-hand joints, and 15 right-hand joints), which are computed from the recorded IMU signals by the provided commercial software. In addition, the 6D head pose is used to determine the head position and orientation of the SMPL-X model.

Alignment. The Hololens2 coordinate system and the fitted SMPL-X models need to align with the high-quality 3D scene scans. The former is aligned through ICP between the TSDF fusion result of the depth recorded by Hololens2 and the 3D scene. The SMPL-X motion sequence is first transformed to the Hololens2 coordinate system via human annotations, i.e., the start and end shapes of the human body are scanned by Hololens2 and visible in the TSDF reconstruction, which serves as anchor shapes for aligning the fitted models. The pose can then be aligned to the 3D scene using the global transformation obtained from the previous ICP alignment between the scene scans. Our dataset is named GIMO, and we describe our method for gaze-informed motion prediction in the following.

Fig. 3.
figure 3

Our gaze-informed human motion prediction architecture. Multi-modal features, i.e., gaze feature, human motion feature, and global scene feature, are extracted and then fused through the proposed bidirectional fusion scheme (a). The fused features are then stacked into a holistic representation and used for future motion prediction (c). The cross-modal transformer component [14] is illustrated in (b). Please refer to Sect. 4 for more details.

4 Gaze-Informed Human Motion Prediction

Gaze conveys relevant information about the subject’s intent, which can be used to enhance long-horizon human motion prediction. On the other hand, past motions [2, 4], ego-centric views [10, 55], or 3D context [10, 51] could provide helpful constraints on human motion, yet, the prediction is still challenging and suffers from uncertainties in the future. Here, we aim at gaze-informed long-term human motion prediction. Specifically, given the past motion, 3D scene, and 3D eye gaze as inputs, we study how they can be integrated to resolve the ambiguities in future motion and generate intention-aware motion predictions.

To fully utilize the geometry information provided by the 3D scene and intention clues from past motions and gaze, we propose a novel framework with a bidirectional fusion scheme that facilitates the communication between different modalities. As shown in Fig. 3, we use PointNet++ [39] as the encoding backbone to extract per-point features of the 3D scene, followed by several cross-modal transformers to transcend information from multi-modality embeddings.

4.1 Problem Definition

We represent a motion sample as a parametric sequence \(X_{i:j}=\{x_i, x_{i+1}, \cdots , x_{j}\}\) where \(x_{k}=(t_k, r_k, h_k, \beta _{k}, p_{k})\) is a pose frame at time k. Here \(t\in R^3\) is the global translation, \(r\in SO(3)\) denotes the global orientation, \(h_k\in R^{32}\) refers to the body pose embedding, \(\beta \in R^{10}\) is the shape parameter, and \(p\in R^{24}\) is the hand pose, where SMPL-X body mesh \(M=\mathcal {M}(t_k, r_k, h_k, \beta _{k}, p_{k})\) can be obtained using VPoser [37]. The 3D scene is represented as a point cloud \(S\in R^{n\times 3}\), and the 3D gaze point \(g\in R^3\) is defined as the intersection points between the gaze direction and the scene. Thus, given the inputs of a motion sequence \(X_{1:t}\) along with the corresponding 3D gaze \(G_{1:t}=\{g_1, g_2, \cdots , g_t\}\) and the 3D scene S, we aim to predict the future motion \(X_{t:t+T}={\varPhi }(X_{1:t}, G_{1:t}, S|\theta )\) where \(\theta \) represents the network parameters.

4.2 Multi-modal Feature Extraction

Instead of extracting the multi-modal embeddings independently [25], we propose a novel scheme to integrate the motion, gaze, and scene features. The gist is to let the motion and gaze features communicate to each other, so their uncertainties regarding the future can be mutually decreased, resulting in more effective utilization of the gaze information.

Scene Feature Extraction. To learn the constraints from the 3D scene and guide the network to pay attention to local geometric structures, we apply PointNet++ to extract both global and local scene features. Specially, we obtain the per-point feature map and a global descriptor of the scene as follows:

$$\begin{aligned} F_P, F_o = {\varPhi }_{scene}(S|\theta _{s}) \end{aligned}$$
(1)

where \(S\in R^{n\times 3}\) is the input point cloud, \(F_P\in R^{n\times d_p}\) is the per-point \(d_p\) dimensional feature map, and \(F_o\in R^{d_o}\) is the global descriptor of the scene. Given the per-point feature \(F_P\), the feature of an arbitrary point e can be computed through the inversed distance weighted interpolation [39]:

$$\begin{aligned} F_{P|e} = \frac{{\varSigma }_{i=1}^{n_e}w_iF_{P|p_i}}{{\varSigma }_{i=1}^{n_e}w_i}, w_i = \frac{1}{||p_{i} - e||_2} \end{aligned}$$
(2)

where \(\{p_{1}, p_{2}, \cdots , p_{n_e}\}\) are the nearest neighbors of e in the scene point cloud.

Gaze Feature Extraction. We query the gaze point feature \(f_g\) from the per-point scene feature map \(F_P\) according to Eq. 2, i.e., \(f_g=F_{P|g}\). Thus, the interpolated gaze feature contains relevant scene information that provides cues to infer the subject’s intention.

Motion Feature Extraction. A linear layer is used to extract the motion embedding \(f_m\) from the input motion parameter x. To endow the embedding awareness of the 3D scene, we further query the scene features of the SMPL-X vertices using Eq. 2. These SMPL-X per-vertex features are then fed to PointNet [38] to get the ambient scene context feature \(f_{m\_v}\) of the current motion pose:

$$\begin{aligned} f_{m\_v} = PointNet(\{F_{P|v}, v\in \mathcal {M}(x)\}) \end{aligned}$$
(3)

where \(\mathcal {M}(x)\) is the SMPL-X vertex set with motion parameter x.

4.3 Attention-Aware Multi-modal Feature Fusion

Given the multi-modal nature of the gaze, scene, and motion, an efficient feature fusion module is necessary to leverage the information from different modalities. Instead of directly concatenating the features [25], we propose a more effective scheme by deploying a cross-modal transformer [14] to fuse the gaze, motion, and scene features (Fig. 3). We explain our design in the following.

Cross-Modal Transformer. The cross-modal transformer [14] is used to capture the correlations between input embedding sequences and to establish communications between the multi-modal information. It is largely based on attention mechanism [48]. An attention function [14] maps a query and key-value pairs to an output as:

$$\begin{aligned} Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_K}})V, Q=qW_q, K=kW_k, V=vW_v \end{aligned}$$
(4)

where \(q\in R^{l_q\times d_q}\), \(k\in R^{l_{kv}\times d_k}\), \(v\in R^{l_{kv}\times d_v}\) are input query, key and value vectors, and \(W_q\in R^{d_q\times d_K}\), \(W_k\in R^{d_k\times d_K}\), \(W_q\in R^{d_v\times d_V}\) embed the inputs. Here d denotes the dimension of the input vector and l is the sequence length.

As shown in Fig. 3(b), the cross-modal transformer is built on a stack of attention layers, which maps a \(t_i\)-length input into a \(t_q\)-length output by querying a \(t_q\)-length feature:

$$\begin{aligned} \phi _{out} = cross\_trans(\phi _{query}, \phi _{input}) \end{aligned}$$
(5)

It is proved to be efficient in processing multi-modal signals, e.g., text, & audio.

Motion Feature Fusion. The motion feature should be aware of the 3D scene context and the subject’s intention inferred from the gaze information, so that it can guide the prediction network to generate more reasonable motion trajectories (e.g., free from penetration and collision) and accurate estimations of the ending position or pose of the subject. For this purpose, we first use the scene context feature \(f_{m\_v}\) acquired from the ambient 3D environment (Eq. 3) as the query to update the motion feature \(f_m\) through a motion-scene transformer:

$$\begin{aligned} f_{m\_s} = cross\_trans(f_{m\_v}, f_m) \end{aligned}$$
(6)

Thus, the output motion embedding \(f_{m\_s}\) is expected to be aware of the 3D scene. We then feed \(f_{m\_s}\) to the next motion-gaze transformer where the gaze feature \(f_g\) is the query input:

$$\begin{aligned} f_{m\_g} = cross\_trans(f_g, f_{m\_s}) \end{aligned}$$
(7)

The final motion embedding \(f_{m\_g}\) is expected to integrate both the 3D scene information and the intention clues from the gaze features.

Fig. 4.
figure 4

Qualitative results. Top row: results on a known scene from the training set. Bottom row: results in a new environment. We compare our method with MultimodalNet [25] and ours without gaze. Please zoom in for details.

Gaze Feature Fusion. While gaze can help generate intention-aware motion features, the motion could also provide informative guidance to mitigate the randomness of gaze since not every gaze point reveals meaningful user intent. Therefore, we treat the gaze embedding in a bidirectional manner, i.e., the motion embedding \(f_m\) is also used as the query to update the gaze features such that the network can learn which gaze features contribute more to the future motion:

$$\begin{aligned} f_{g\_m} = cross\_trans(f_m, f_g) \end{aligned}$$
(8)

The bidirectionally fused multi-modal features are then composed into holistic temporal representations of the input to perform human motion prediction. As illustrated in Fig. 3(c), the updated gaze feature \(f_{g\_m}\), motion feature \(f_{m\_g}\) and the global scene feature \(F_O\) are used to predict the future motion by:

$$\begin{aligned} X_{t:t+T} = cross\_trans(h_{position}, cat(f_{g\_m}, f_{m\_g}, F_O)_{1:t}) \end{aligned}$$
(9)

where cat denotes the concatenation operation, and \(h_{position}\) is the latent vector that contains temporal positional encodings for the output [14]. We verify the effectiveness of our design in utilizing gaze information through experiments.

5 Experiments

In this part, we explain our experimental setup and results. Our goal is to examine the following questions:

  1. 1.

    Does gaze help disambiguate human motion prediction?

  2. 2.

    How do state-of-the-art methods perform on our dataset?

  3. 3.

    What is the contribution of each part of our design to the final performance? Overall, is the proposed architecture effective?

5.1 Experimental Setup

In our experiments, we predict the future motion in 5 s from 3 s input, where the first 3 s of a trajectory is just about to start an activity in our dataset, and in the next 5 s the trajectory proceeds to finish the activity. We set the motion frame rate to 2 fps, i.e., 6 pose input and 10 pose output. Since we aim to explore the effect of gaze in disambiguating motion prediction, high-frequency motion is not necessary. Note that once the waypoints are predicted, a full motion sequence with high fps can be easily generated [51].

Baselines. We implement several state-of-the-art motion prediction and generation baselines including ST-Transformer [2] and an RNN network [22] for full motion prediction from the past motion input, and MultimodalNet [25] based on transformer for motion synthesis from multi-modal data (i.e., gaze, motion, and the 3D scene feature in our experiments). We build our pipeline by incorporating 6 cross-modal transformer layers [14]. L1 loss between the predicted motion and the ground truth is used to train the network. More details about the network architecture and training are available in the supplementary material.

Table 3. Destination accuracy. We report the global translation and orientation error and mean per-joint position error (MPJPE).
Table 4. Path errors of the predicted motions.

5.2 Evaluation

To evaluate, we divide the 217 trajectories of our dataset into 180 trajectories for training and 37 for testing. The 37 motions consist of 27 trajectories (different from the training ones) performed in known scenes from the training set and 10 in 2 new environments scanned only for evaluation purposes.

Fig. 5.
figure 5

The attention map of the 6 input gaze for the 10 output motion. The gaze influences the ending output most (brighter means larger weight), indicating that the gaze features reveal the subject’s final goals.

Evaluation Metrics. We employ the destination error and the path error as our evaluation metrics. The destination error refers to the global translation, rotation error and the mean per-joint position error (MPJPE) [13] of the last pose in the predicted motion. The destination pose contains essential information about the subject’s goal, which is our experiments’ primary focus. The path error is computed as the mean error of the predicted poses in 5s [4]. We treat the global translation and rotation error as the l1 distance between the predicted SMPL-X translation and orientation parameter with the ground truth [51].

Quantitative Evaluation. As shown in Table 3 and Table 4, while the state-of-the-art methods based on spatio-temporal transformer [2] suffer from ambiguities since the prediction is simply from the past motion, a simple RNN method with motion and gaze input [21] can significantly reduce the ambiguity, indicating the effectiveness of gaze in guiding the prediction of motion. Our method achieves promising results in predicting reasonable future motion with small destination and translation errors. Compared to MultimodalNet [25] built on the vanilla transformers [47], our method outperforms in recognizing the subject’s intent from the gaze and thus predicts more accurate destination poses.

Qualitative Evaluation. Figure 4 shows that in a “going to sit” activity performed in one scene from the training set (top row), our method manages to generate accurate destination poses, i.e., sitting on the sofa. In the new environment, the subject first grabs a blackboard eraser and then starts wiping. While all the methods generate walking actions, ours without gaze input fails to predict the correct motion. When given gaze, results from MultimodalNet [25] and our method both reach out the hand and try to grab something. Our prediction successfully arrives at the destination point where the eraser lies; however, the results of MultimodalNet [25] reach out to the wrong place. More visualizations and failure cases are included in the supplementary material.

5.3 Ablation Study

In this part, we aim to answer question 3 by finding the factors that contribute to the superior performance of our method.

Variant 1: Gaze. We evaluate the baseline’s performance with and without gaze input to explore how gaze could influence the motion prediction results. As clearly demonstrated in Table 3 and Table 4, the RNN network [21] and the MultimodalNet [25] both gain significant accuracy improvement given gaze inputs. Figure 4 shows that without gaze, our method is confused about the future destination. To find more intuitions about the role of gaze in motion prediction, we visualize the attention weights of gaze feature query over the motion feature as depicted in Fig. 5. Interestingly, we find the gaze feature does influence the ending poses in the predicted motion, implying that the gaze can serve as a strong indicator of the destination of a motion, which reveals the user’s intent.

Variant 2: PointNet++ for Scene Feature Query. We propose to use PointNet++ [39] to extract the per-point feature of the scene such that the gaze feature and scene-aware motion feature can be obtained (Sect. 4.2). We replace it with PointNet to extract the global scene feature and use a linear layer to get the gaze feature. Results in Table 3 and Table 4 demonstrate that the variant can act well on scenes from the training set, but lose its competitiveness when generalized to new environments with different 3D structures.

Variant 3: Cross-Modal Transformer. The cross-modal transformer architecture proves to be effective in bridging multi-modal information [14]. We replace it with the vanilla transformer [48] as used in [25]. Results in Table 3 and Table 4 (Ours (vanilla)) demonstrate the loss of accuracy compared to the full design. Note that the path error of the variant on the new scenes is even larger than the results without gaze input, indicating that the vanilla transformer might not be efficient enough to capture the correlations between multi-modal inputs.

6 Discussion and Future Work

We present the GIMO dataset, a real-world dataset with ego-centric images, 3D gazes, 3D scene context, and ground-truth human motions. With the collected dataset, we define a new task, i.e., gaze-informed human motion prediction, and further contribute a novel framework to minimize the ambiguities in motion prediction by leveraging eye gaze to infer the subject’s potential intention. While our method only relies on 3D inputs, future work can incorporate visual information from ego-centric images to further improve accuracy. Besides the proposed task and framework, our dataset can benefit various applications, e.g., intention-aware motion synthesis and gaze-guided ego-centric pose estimation. We believe our work opens not only new directions for motion prediction but also has foreseeable impacts on ego-centric vision topics.