Keywords

1 Introduction

Pedestrian trajectory prediction attempts to forecast the socially-acceptable future paths of people based on their past movement patterns. These behavior patterns often depend on each pedestrian’s surrounding environments, as well as collaborative movement, mimicking a group leader, or collision avoidance. Collaborative movement, one of the most frequent patterns, occurs when several colleagues form a group and move together. Computational social scientists estimate that up to 70% of the people in a crowd will form groups [40, 48]. They also gather surrounding information and have the same destination [40]. Such groups have characteristics that are distinguishable from those of individuals, maintain rather stable formations, and even provide important cues that can be used for future trajectory prediction [48, 78].

Pioneering works in human trajectory forecasting model the group movement by assigning additional hand-crafted terms as energy potentials [41, 47, 66]. These works account for the presence of other group members and physics-based attractive forces, which are only valid between the same group members. In recent works, convolutional neural networks (CNNs) and graph neural networks (GNNs) show impressive progress modeling the social interactions, including traveling together and collision avoidance [1, 2, 17, 39, 54]. Nevertheless, trajectory prediction is still a challenging problem because of the complexity of implicitly learning individual and group behavior at once.

Fig. 1.
figure 1

Comparison of existing agent-agent interaction graphs and the proposed group-aware GP-Graph. To capture social interactions, (a) existing pedestrian trajectory prediction models each pedestrian on a graph node. Since the pedestrian graph is a complete graph, it is difficult to capture the group’s movement because it becomes overly complex in a crowded scene. (b) GP-Graph is directly able to learn an intra-/inter-group interaction while keeping the agent-wise structure.

There are several attempts that explicitly encode the group coherence behaviors by assigning hidden states of LSTM with a summation of other agents’ states, multiplied by a binary group indicator function [6]. However, existing studies have a critical problem when it comes to capturing the group interaction. Since their forecasting models focus more on individuals, the group features are shared at the individual node as illustrated in Fig. 1(a). Although this approach can conceptually capture group movement behavior, it is difficult for the learning-based methods to represent it because of the overwhelming number of edges for the individual interactions. And, this problem is increasingly difficult in crowded environments.

To address this issue, we propose a novel general architecture for pedestrian trajectory prediction: GrouP-Graph (GP-Graph). As illustrated in Fig. 1(b), our GP-Graph captures intra-(members in a group) and inter-group interactions by disentangling input pedestrian graphs. Specifically, our GP-Graph first learns to assign each pedestrian into the most likely behavior group. The group indices of each pedestrian are generated using a pairwise distance matrix. To make the indexing process end-to-end trainable, we introduce a straight-through group back-propagation trick inspired by the Straight-Through estimator [5, 21, 35]. Using the group information, GP-graph then transforms the input pedestrian graph into both intra- and inter-group interaction graphs. We construct the intra-group graph by masking out edges of the input pedestrian graph for unassociated group members. For the inter-group graph, we propose group pooling &unpooling operations to represent a group with multiple members as one graph node. By applying these processes, GP-Graph architecture has three advantages: (1) It reduces the complexity of trajectory prediction which is caused by the different social behaviors of individuals, by modeling group interactions. (2) It alleviates inherent scene bias by considering the huge number of unseen pedestrian graph nodes between the training and test environments, as discussed in [8]. (3) It offers a graph augmentation effect with pedestrian node grouping.

Next, through weight sharing with baseline trajectory predictors, we force a hierarchy representation from both the input pedestrian graph and the disentangled interactions. This representation is used to infer a probability map for socially-acceptable future trajectories after passing through our group integration module. In addition, we introduce a group-level latent vector sampling to ensure collective inferences over a set of plausible future trajectories.

To the best of our knowledge, this is the first model that literally pools pedestrian colleagues into one group node to efficiently capture group motion behaviors, and learns pedestrian grouping in an end-to-end manner. Furthermore, GP-Graph has the best performance on various datasets among existing methods when unifying with GNN-based models, and it can be integrated with all types of trajectory prediction models, achieving consistent improvements. We also provide extensive ablation studies to analyze and evaluate our GP-Graph.

2 Related Works

2.1 Trajectory Prediction

Earlier works [18, 38, 42, 66] model human motions in crowds using hand-crafted functions to describe attractive and repulsive forces. Since then, pedestrian trajectory prediction has been advanced by research interest in computer vision. Such research leverages the impressive capacity of CNNs which can capture social interactions between surrounding pedestrians. One pioneering work is Social-LSTM [1], which introduces a social pooling mechanism considering a neighbor’s hidden state information inside a spatial grid. Much of the emphasis in subsequent research has been to add human-environment interactions from a surveillance view perspective [11, 23, 33, 37, 49, 52, 58, 59, 61, 75]. Instead of taking environmental information into account, some methods directly share hidden states of agents between other interactive agents [17, 50, 64]. In particular, Social-GAN [17] takes the interactions via max-pooling in all neighborhood features in the scene, and Social-Attention [64] introduces an attention mechanism to impose a relative importance on neighbors and performs a weighted aggregation for the features.

In terms of graph notations, each pedestrian and their social relations can be represented as a node and an edge, respectively. When predicting pedestrian trajectories, graph representation is used to model social interactions with graph convolutional networks (GCNs) [2, 22, 39, 59], graph attention networks (GATs) [3, 19, 23, 32, 54, 63], and transformers [16, 69, 70]. Usually, these approaches infer future paths through recurrent estimations [1, 9, 16, 17, 26, 50, 74] or extrapolations [2, 31, 39, 54]. Other types of relevant research are based on probabilistic inferences for multi-modal trajectory prediction using Gaussian modeling [1, 2, 30, 39, 54, 55, 65, 69], generative models [11, 17, 19, 23, 49, 58, 75], and a conditional variational autoencoder [9, 20, 26, 27, 29, 36, 50, 60]. We note that these approaches focus only on learning implicit representations for group behaviors from agent-agent interactions.

2.2 Group-Aware Representation

Contextual and spatial information can be derived from group-aware representations of agent dynamics. To accomplish this, one of the group-aware approaches is social grouping, which describes agents in groups that move differently than independent agents.

In early approaches [24, 76, 77], pedestrians can be divided into several groups based on behavior patterns. To represent the collective activities of agents in a supervised manner, a work in [41] exploits conditional random fields (CRF) to jointly predict the future trajectories of pedestrians and their group membership. Yamaguchi  et al. [66] harness distance, speed, and overlap time to train a linear SVM to classify whether two pedestrians are in the same group or not. In contrast, a work in [14] proposes automatic detection for small groups of individuals using a bottom-up hierarchical clustering with speed and proximity features.

Group-aware predictors recognize the affiliations and relations of individual agents, and encode their proper reactions to moving groups. Several physics-based techniques represent group relations by adding attractive forces among group members [40, 41, 44, 46, 51, 56, 66]. Although a dominant learning paradigm [1, 4, 43, 62, 73] implicitly learns intra- and inter-group coherency, only two works in [6, 12] explicitly define group information. To be specific, one [6] identifies pedestrians walking together in the crowd using a coherent filtering algorithm [77], and utilizes the group information in a social pooling layer to share their hidden states. Another work [12] proposes a generative adversarial model (GAN)-based trajectory model, jointly learning informative latent features for simultaneous pedestrian trajectory forecasting and group detection. These approaches only learn individual-level interactions within a group, but do not encode their affiliated groups and future paths at the same time. Unlike them, our GP-Graph aggregates a group-group relation via a novel group pooling in the proposed end-to-end trainable architecture without any supervision.

2.3 Graph Node Pooling

Pooling operations are used for features extracted from grid data, like images, as well as graph-structured data. However, there is no geographic proximity or order information in the graph nodes that existing pooling operations require. As alternative methods, three types of graph pooling are introduced: topology-based pooling [10, 45], global pooling [15, 72], and hierarchical pooling [7, 13, 68]. These approaches are designed for general graph structures. However, since human behavior prediction has time-variant and generative properties, it is no possible to leverage the advantages of these pooling operations for this task.

Fig. 2.
figure 2

An overview of our GP-Graph architecture. Starting with graph-structured trajectories for N pedestrians, we first estimate grouping information with the Group Assignment Module. We then generate both intra-/inter-group interaction graphs by masking out unrelated nodes and by performing pedestrian group pooling. The weight-shared trajectory prediction model takes the three types of graphs and capture group-aware social interactions. Group pooling operators are then applied to encode agent-wise features from group-wise features, and then fed into the Group Integration Module to estimate the probability distribution for future trajectory prediction.

3 Proposed Method

In this work, we focus on how group awareness in crowds is formed for pedestrian trajectory prediction. We start with a definition of a pedestrian graph and trajectory prediction in Sect. 3.1. We then introduce our end-to-end learnable pedestrian group assignment technique in Sect. 3.2. Using group index information and our novel pedestrian group pooling &unpooling operations, we construct a group hierarchy representation of pedestrian graphs in Sect. 3.3. The overall architecture of our GP-Graph is illustrated in Fig. 2.

3.1 Problem Definition

Pedestrian trajectory prediction can be defined as a sequential inference task made observations for all agents in a scene. Suppose that N is the number of pedestrians in a scene, the history trajectory of each pedestrian \(n \in [1, ..., N]\) can be represented as \({\boldsymbol{X}}_n\!=\!\{ (x_n^t, y_n^t)\,|\,t\!\in \![1, ..., T_{obs}] \}\), where the \((x_n^t, y_n^t)\) is the 2D spatial coordinate of a pedestrian n at specific time t. Similarly, the ground truth future trajectory of pedestrian n can be defined as \({\boldsymbol{Y}}_n\!=\!\{ (x_n^t, y_n^t)\,|\,t\!\in \![T_{obs}\!+\!1, ..., T_{pred}] \}\).

The social interactions are modeled from the past trajectories of other pedestrians. In general, the pedestrian graph \(\mathcal {G}_{ped}\!=\!(\mathcal {V}_{ped}, \mathcal {E}_{ped})\) refers to a set of pedestrian nodes \(\mathcal {V}_{ped} = \{ {\boldsymbol{X}}_n\,|\,n\!\in \![1, ..., N] \}\) and edges on their pairwise social interaction \(\mathcal {E}_{ped} = \{ e_{i,j}\,|\,i,j\!\in \![1, ..., N] \}\). The trajectory prediction process forecasts their future sequences based on their past trajectory and the social interaction as:

$$\begin{aligned} \widehat{{\boldsymbol{Y}}} = F_\theta \left( X,\,\mathcal {G}_{ped}\right) \end{aligned}$$
(1)

where \(\widehat{{\boldsymbol{Y}}} = \{ \widehat{{\boldsymbol{Y}}}_n\,|\,n\!\in \![1, ..., N] \}\) denotes the estimated future trajectories of all pedestrians in a scene, and \(F_\theta (\,\cdot \,)\) is the trajectory generation network.

3.2 Learning the Trajectory Grouping Network

Our goal in this work is to encode powerful group-wise features beyond existing agent-wise social interaction aggregation models to achieve highly accurate human trajectory prediction. The group-wise features represent group members in input scenes as single nodes, making pedestrian graphs simpler. We use a U-Net architecture with pooling layers to encode the features on graphs. By reducing the number of nodes through the pooling layers in the U-Net, higher-level group-wise features can be obtained. After that, agent-wise features are recovered through unpooling operations.

Unlike conventional pooling &unpooling operators working on grid-structured data, like images, it is not feasible to apply them to graph-structured data. Some earlier works to handle this issue [7, 13]. The works focus on capturing global information by removing relatively redundant nodes using a graph pooling, and restoring the original shapes by adding dummy nodes from a graph unpooling if needed. However, in pedestrian trajectory prediction, each node must keep its identity index information and describe the dynamic property of the group behavior in scenes. For that, we present pedestrian graph-oriented group pooling &unpooling methods. We note that it is the first work to exploit the pedestrian index itself as a group representation.

Learning Pedestrian Grouping.   First of all, we estimate grouping information to which the pedestrian belongs using a Group Assignment Module. Using the history trajectory of each pedestrian, we measure the feature similarity among all pedestrian pairs based on their \(L_2\) distance. With this pairwise distance, we pick out all pairs of pedestrians that are likely to be a colleague (affiliated with same group). The pairwise distance matrix \({\boldsymbol{D}}\) and a set of colleagues indices \(\varUpsilon \) are defined as:

$$\begin{aligned} {\boldsymbol{D}}_{\,i,j} = \Vert F_\phi ({\boldsymbol{X}}_i) - F_\phi ({\boldsymbol{X}}_j)\Vert ~~~\text {for}~~ i,j \in [1, ..., N], \end{aligned}$$
(2)
$$\begin{aligned} \varUpsilon = \{ \text {pair}(i,\,j)\,|\,i,j \in [1, ..., N], ~i \ne j, ~{\boldsymbol{D}}_{\,i,j} \le \pi \}, \end{aligned}$$
(3)

where \(F_\phi (\,\cdot \,)\) is a learnable convolutional layer and \(\pi \) is a learnable thresholding parameter.

Next, using the pairwise colleague set \(\varUpsilon \), we arrange the colleague members in associated groups and assign their group index. We make a group index set G, which is formulated as follows:

$$\begin{aligned} G = \Big \{ G_k \,|\, G_k = \!\!\bigcup _{(i,j) \in \varUpsilon }\! \{i,\,j\},~~G_a\!\cap G_b = \varnothing ~~\text {for}~ a \ne b \Big \} \end{aligned}$$
(4)

where \(G_k\) denotes the k-th group and is the union of each pair set (ij). This information is used as important prior knowledge in the subsequent pedestrian group pooling and unpooling operators.

Pedestrian Group Pooling.   Based on the group behavior property that group members gather surrounding information and share behavioral patterns, we group the pedestrian nodes, where the corresponding node’s features are aggregated into one node. The aggregated group features are then stacked for subsequent social interaction capturing modules (i.e.GNNs). Here, the most representative feature for each pedestrian node is selected via an average pooling. With the feature, we can model the group-wise graph structures, which have much fewer number of nodes than the input pedestrian graph, as will be demonstrated in Sec. 4.3. We define the pooled group-wise trajectory feature \({\boldsymbol{Z}}\) as follows:

$$\begin{aligned} {\boldsymbol{Z}} = \{{\boldsymbol{Z}}_k\,|\,k \in [1, ..., K]\}, ~~~~~{\boldsymbol{Z}}_k = \frac{1}{|G_k|} \sum _{i\;\!\in \;\!G_k} \!{\boldsymbol{X}}_i, \end{aligned}$$
(5)

where K is the total group numbers in G.

Pedestrian Group Unpooling.   Next, we upscale the group-wise graph structures back to their original size by using an unpooling operation. This enables each pedestrian trajectory to be forecast with output agent-wise feature fusion information. In existing methods [7, 13], zero vector nodes are appended into the group features during unpooling. The output of the convolution process on the zero vector nodes fails to exhibit the group properties. To alleviate this issue, we duplicate the group features and then assign them into nodes for all the relevant group members so that they have identical group behavior information. The pedestrian group unpooling operator can be formulated as follows:

(6)

where IMAGE is the agent-wise trajectory feature reconstructed from Z, having the same order of pedestrian indices as in \({\boldsymbol{X}}\).

Straight-Through Group Estimator.   A major hurdle, when training the group assignment module in Eq. (4) which is a sampling function, is that index information is not treated as learnable parameters. Accordingly, the group index cannot be trained using standard backpropagation algorithms. The reason is why the existing methods utilize separate training steps from main trajectory prediction networks for the group detection task.

We tackle this problem by introducing a Straight-through (ST) trick, inspired by the biased path derivative estimators in [5, 21, 35]. Instead of making the discrete index set \(G_k\) differentiable, we separate the forward pass and backward pass of the group assignment module in the training process. Our intuition for constructing the backward pass is that group members have similar features with closer pairwise distance between colleagues.

In the forward pass, we perform our group pooling over both pedestrian features and the group index from the input trajectory and estimated group assignment information, respectively. For the backward pass, we propose group-wise continuous relaxed features to approximate the group indexing process. We compute the probability that a pair of pedestrians belongs to the same group using the proposed differentiable binary thresholding function \(\frac{1}{1+\exp (x-\pi )}\), and apply it on the pairwise distance matrix \({\boldsymbol{D}}\). We then measure the normalized probability \({\boldsymbol{A}}\) of the summation of all neighbors’ probability. Lastly, we compute a new pedestrian trajectory feature \({\boldsymbol{X}}'\) by aggregating features between group members through the matrix multiplication of \({\boldsymbol{X}}\) and \({\boldsymbol{A}}\) as follows:

$$\begin{aligned} {\boldsymbol{A}}_{\,i,j} = \frac{\frac{1}{1 + \exp \!\big (\frac{{\boldsymbol{D}}_{\,i,j}-\pi }{\tau }\big )}}{\sum _{i=1}^{N} \Big ({\frac{1}{1 + \exp \!\big (\frac{{\boldsymbol{D}}_{\,i,j}-\pi }{\tau }\big )}}\Big )} ~~~\text {for}~~ i,j \in [1, ..., N], \end{aligned}$$
(7)
$$\begin{aligned} {\boldsymbol{X}}' = \langle \, {\boldsymbol{X}} - {\boldsymbol{X}}{\boldsymbol{A}} \,\,\rangle + {\boldsymbol{X}}{\boldsymbol{A}}, \end{aligned}$$
(8)

where \(\tau \) is the temperature of the sigmoid function and \(\langle \,\cdot \,\rangle \) is the detach (in PyTorch) or stop gradient (in Tensorflow) function which prevents the backpropagation.

For further explanation of Eq. (8), we replace the input of pedestrian group pooling module X with a new pedestrian trajectory feature \({\boldsymbol{X}}'\) in implementation. To be specific, we can remove \({\boldsymbol{X}}{\boldsymbol{A}}\) in the forward pass, allowing us to compute a loss for the trajectory feature \({\boldsymbol{X}}\). In contrast, due to the stop gradient \(\langle \,\cdot \,\rangle \), the loss is only backpropagated to \({\boldsymbol{X}}{\boldsymbol{A}}\) in the backward pass. To this end, we can train both the convolutional layer \(F_\phi \) and the learnable threshold parameter \(\pi \) which are used for the computation of the pairwise distance matrix \({\boldsymbol{D}}\) and the construction of group index set G, respectively.

Fig. 3.
figure 3

An illustration of our pedestrian group assignment method using a pairwise group probability matrix A. With a group index set G, a pedestrian group hierarchy is constructed based on three types of interaction graphs.

3.3 Pedestrian Group Hierarchy Architecture

Using the estimated pedestrian grouping information, we reconstruct the initial social interaction graph \(\mathcal {G}_{ped}\) in an efficient form for pedestrian trajectory prediction. Instead of the existing complex and complete pedestrian graph, intra- and inter-group interaction graphs capture the group-ware social relation, as illustrated in Fig. 3.

Intra-group Interaction Graph.   We design a pedestrian interaction graph that captures relations between members affiliated with the same group. The intra-group interaction graph \(\mathcal {G}_{member}\!=\!(\mathcal {V}_{ped}, \mathcal {E}_{member})\) consists of a set of pedestrian nodes \(\mathcal {V}_{ped}\) and edges on their pairwise social interaction of group members \(\mathcal {E}_{member} = \{ e_{i,j}\,|\,i,j\!\in \![1, ..., N], k\!\in \![1, ..., K], \{i,j\}\!\subset \!G_k \}\). Through this graph representation, pedestrian nodes can learn social norms of internal collision avoidance between group members while maintaining their own formations and on-going directions.

Inter-group Interaction Graph.   Inter-group interactions (group-group relation) are indispensable to learn social norms between groups as well. To take various group behaviors such as following a leading group, avoiding collisions and joining a new group, we create an inter-group interaction graph \(\mathcal {G}_{group}\!=\!(\mathcal {V}_{group}, \mathcal {E}_{group})\). Here, nodes refer to each group’s features IMAGE generated with our pedestrian group pooling operation, and edges mean the pairwise group-group interactions \(\mathcal {E}_{group} = \{ \bar{e}_{p,q}\,|\,p,q\!\in \![1, ..., K] \}\).

Group Integration Network.   We incorporate the social interactions as a form of group hierarchy into well-designed existing trajectory prediction baseline models in Fig. 3(b). Meaningful features can be extracted by feeding a different type of graph-structured data into the same baseline model. Here, the baseline models share their weights to reduce the amount of parameters while enriching the augmentation effect. Afterward, the output features from the baseline models are aggregated agent-wise, and are then used to predict the probability map of future trajectories using our group integration module. The generated output trajectory \(\widehat{Y}\) with the group integration network \(F_\psi \) is formulated as:

(9)

Group-Level Latent Vector Sampling.    To infer the multi-modal future paths of pedestrians, an additional random latent vector is introduced with an input observation path. This latent vector becomes a factor, determining a person’s choice of behavior patterns, such as acceleration/deceleration and turning to right/left. There are two ways to adopt this latent vector in trajectory generation: (1) Scene-level sampling [17] where everyone in the scene shares one latent vector, unifying the behavior patterns of all pedestrians in a scene (e.g., all pedestrians are slow down); (2) Pedestrian-level sampling [50] that allocates the different latent vectors for each pedestrian, but forces the pedestrians to have different patterns, where the group behavior property is lost.

We propose a group-level latent vector sampling method as a compromise of the two ways. We use the group information estimated from the GP-Graph to share the latent vector between groups. If two people are not associated with the same group, an independent random noise is assigned as a latent vector. In this way, it is possible to sample a multi-modal trajectory, which is independent of other groups members and follows associated group behaviors. The effectiveness of the group-level sampling is visualized in Sect. 4.3.

3.4 Implementation Details

To validate the generality of our GP-Graph, we incorporate it into four state-of-the-art baselines: three different GNN-based baseline methods including STGCNN (GCN-based) [39], SGCN (GAT-based) [54] and STAR (Transformer-based) [69], and one non-GNN model, PECNet [36]. We simply replace their trajectory prediction parts with ours. We additionally embed our agent/intra-/inter-graphs on the baseline networks, and compute integrated output trajectories to obtain the group-aware prediction.

For our proposed modules, we initialize the learnable parameter \(\pi \) as one, which cut the total number of nodes moderately down by half, with the group pooling in the initial training step. Other learnable parameters such as \(F_\theta \), \(F_\phi \) and \(F_\psi \) are randomly initialized. We set the hyperparameter \(\tau \) to 0.1 to give the binary thresholding function a steep slope.

To train the GP-Graph architecture, we use the same training hyperparameters (e.g., batch size, train epochs, learning rate, learning rate decay), loss functions, and optimizers of the baseline models. We note that we do not use additional group labels for an apple-to-apple comparison with the baseline models. Our group assignment module is trained to estimate effective groups for trajectory prediction in an unsupervised manner. Thanks to our powerful Straight-Through Group Estimator, it accomplish promising results over other supervised group detection networks [7] that require additional group labels.

4 Experiments

In this section, we conduct comprehensive experiments to verify how the grouping strategy contributes to pedestrian trajectory prediction. We first briefly describe our experimental setup (Sect. 4.1). We then provide comparison results with various baseline models for both group detection and trajectory prediction (Sect. 4.3 and Sect. 4.2). We lastly conduct an extensive ablation study to demonstrate the effect of each component of our method (Sect. 4.4).

Table 1. Comparison between GP-Graph architecture and the vanilla agent-wise interaction graph for four state-of-the-art multi-modal trajectory prediction models, Social-STGCNN  [39], SGCN  [54], STAR  [69] and PECNet  [36]. The models are evaluated on the ETH  [42], UCY  [28], SDD  [47] and GCS  [67] datasets. Gain: performance improvement w.r.t FDE over the baseline models, Unit for ADE and FDE: meter, Bold: Best.

4.1 Experimental Setup

Datasets.   We evaluate the effectiveness of our GP-Graph by incorporating it into several baseline models and check the performance improvement on public datasets: ETH [42], UCY [28], Stanford Drone Dataset (SDD) [47], and the Grand Central Station (GCS) [67] datasets. The ETH & UCY datasets contain five unique scenes (ETH, Hotel, Univ, Zara1 and Zara2) with 1,536 pedestrians, and the official leave-one-out strategy is used to train and to validate the models. SDD consists of various types of objects with a birds-eye view, and GCS shows highly congested pedestrian walking scenes. We use the standard training and evaluation protocol [17, 19, 36, 39, 50, 54] in which the first 3.2 s (8 frames) are observed and next 4.8 s (12 frames) are used for a ground truth trajectory. Additionally, two scenes (Seq-eth, Seq-hotel) of the ETH datasets provide ground-truth group labels. We use them to evaluate how accurately our GP-Graph groups individual pedestrians.

Evaluation Protocols.   For multi-modal human trajectory prediction, we follow a standard evaluation manner, in Social-GAN [17], generating 20 samples based on predicted probabilistic distributions, and then choosing the best sample to measure the evaluation metrics. We use same evaluation metrics of previous works [1, 17, 34, 61] for future trajectory prediction. Average Displacement Error (ADE) computes the Euclidean distance between a prediction and ground-truth trajectory, while Final Displacement Error (FDE) computes the Euclidean distance between an end-point of prediction and ground-truth. Collision rate (COL) checks the percentage of test cases where the predicted trajectories of different agents run into collisions, and Temporal Correlation Coefficient (TCC) measures the Pearson correlation coefficient of motion patterns between a predicted and ground-truth trajectory. We use both ADE and FDE as accuracy measures, and both COL and TCC as reliability measures in our group-wise prediction. For the COL metric, we average a set of collision ratios over the 20 multi-modal samples.

For grouping measures, we use precision and recall values based on two popular metrics, proposed in prior works [6, 12]: A group pair score (PW) measures the ratio between group pairs that disagree on their cluster membership, and all possible pairs in a scene. A Group-MITRE score (GM) is a ratio of the minimum number of links for group members and fake counterparts for pedestrians who are not affiliated with any group.

4.2 Quantitative Results

Evaluation on Trajectory Prediction.   We first compare our GP-Graph with conventional agent-wise prediction models on the trajectory prediction benchmarks. As reported in Table 1, our GP-Graph achieves consistent performance improvements on all the baseline models. Additionally, our group-aware prediction also reduces the collision rate between agents, and shows analogous motion patterns with its ground truth by capturing the group movement behavior well. The results demonstrate that the trajectory prediction models benefit from the group-awareness cue of our group assignment module.

Table 2. Comparison of GP-Graph on SGCN with other state-of-the-art group detection models (Precision/Recall). For fair comparison, the evaluation results are directly referred from [6, 12]. \(\mathcal {S}\): Use a loss for supervision, Bold: Best, Underline: Second best.

Evaluation on Group Estimation.   We also compare the grouping ability of our GP-Graph with that of state-of-the-art models in Table 2. Our group assignment module trained in an unsupervised manner achieves superior results in the PW precision in both scenes, but shows relatively low recall values over the baseline models.

There are various group interaction scenarios in both scenes, and we found that our model sometimes fails to assign pedestrians into one large group when either a person joins the group or the group splits into both sides to avoid a collision. In this situation, while forecasting agent-wise trajectories, it is advantageous to divide the group into sub-groups or singletons, letting them have different behavior patterns. Although false-negative group links sometimes occur during the group estimation because of this, it is not a big issue for trajectory prediction.

To measure the maximum capability of our group estimator, we additionally carry out an experiment with a supervision loss to reduce the false-negative group links. We use a binary cross-entropy loss between the distance matrix and the ground-truth group label. As shown in Table 2, the performance is comparable to the state-of-the art group estimation models with respect to the PW and GM metrics. This indicates that our learning trajectory grouping network can properly assign groups without needing complex clustering algorithms.

Fig. 4.
figure 4

(Top): Examples of pedestrian trajectory prediction results. (Bottom): Examples of group estimation results on ETH/UCY datasets [28, 42].

4.3 Qualitative Results

Trajectory Visualization.   In Fig. 4, we visualize some prediction results of GP-Graph and other methods. Since GP-Graph estimates the group-aware representations and captures both intra-/inter-group interactions, the predicted trajectories are closer to socially-acceptable trajectories and forms more stable behaviors between group members than those of the comparison models. Figure 4 also shows the pedestrians forming a group with our group assignment module. GP-Graph uses movement patterns and proximity information to properly create a group node for pedestrians who will take the same behaviors and walking directions in the future. This simplifies complex pedestrian graphs and eliminates potential errors associated with the collision avoidance between colleagues.

Fig. 5.
figure 5

(a) Visualization of predicted trajectory distribution in ZARA1 scene. (b–d) Examples of three sampled trajectories with scene-level, pedestrian-level, and group-level latent vector sampling strategy.

Table 3. Ablation study of various pooling &unpooling operations on SGCN [54] (FDE/COL/TCC). In the case of our Pedestrian Group Pooling &Unpooling, we additionally provide experimental results using the ground-truth group labels (Oracle). Bold: Best, Underline: Second best.

Group-Level Latent Vector Sampling.   To demonstrate the effectiveness of the group-level latent vector sampling strategy, we compare ours with two previous strategies: scene-level and pedestrian-level sampling in Fig. 5. Even though the probability maps of pedestrians are well predicted with the estimated group information (Fig. 5(a)), its limitation still remains. For example, all sampled trajectories in the probability distributions lean toward the same directions (Fig. 5(b)) or are scattered with different patterns even within group members, which leads to collisions between colleagues (Fig. 5(c)). Our GP-Graph with the proposed group-level sampling strategy predicts the collaborative walking trajectories of associated group members, which is independent of other groups (Fig. 5(d)).

4.4 Ablation Study

Pooling &Unpooling.   To check the effectiveness of the proposed group pooling &unpooling layers, we compare it with different pooling methods including gPool [13] and SAGPool [25] with respect to FDE, COL and TCC. gPool proposes a top-k pooling by employing a projection vector to compute a rank score for each node. SAGpool is similar to the gPool method, but encodes topology information in a self-attention manner. As shown in Table 3, for both gPool and SAGPool, pedestrian features are lost via the pooling operations on unimportant nodes. By contrast, our pooling approach focuses on group representations of the pedestrian graph structure because it is optimized to capture group-related patterns.

Group Hierarchy Graph.   We examine each component of the group hierarchy graph in Table 4. Both intra-/inter-group interaction graphs show a noticeable performance improvement compared to the baseline models, and the inter-group graph with our group pooling operation has the most important role in performance improvement (variants 1 to 4). The best performances can be achieved when all three types of interaction graphs are used with a weight-shared baseline model, which takes full advantage of graph augmentations (variants 4 and 5).

Grouping Method.   We introduce a learnable threshold parameter \(\pi \) on the group assignment module in Eq. (2) because in practice the total number of groups in a scene can change according to the trajectory feature of the input pedestrian node. To highlight the importance of \(\pi \), we test a fixed ratio group pooling with a node reduction ratio of 50%. As expected, the learnable threshold shows lower errors than the fixed ratio of group pooling (variants 5 and 6). This means that it is effective to guarantee the variability of group numbers, since the number can vary even when the same number of pedestrians exists in a scene.

Additionally, we report results for the group-level latent vector sampling strategy (variants 5 and 7). Since the ADE and FDE metrics are based on best-of-many strategies, there is no difference with respect to numerical performance. However, it allows each group to keep their own behavior patterns, and to represent independency between groups, as in Fig. 5.

Table 4. Ablation study (ADE/FDE). AW, MB, GP, WS, FG and GS respectively denote agent-wise pedestrian graph, intra-group member graph, inter-group graph, weight sharing among different interaction graph, fixed ratio node reduction of grouping and group-level latent vector sampling respectively. All tests are performed on SGCN. Bold: Best, Underline: Second best.

5 Conclusion

In this paper, we present a GP-Graph architecture for learning group-aware motion representations. We model group behaviors in crowded scenes by proposing a group hierarchy graph using novel pedestrian group pooling &unpooling operations. We use them for our group assignment module and straight-forward group estimation trick. Based on the GP-Graph, we introduce a multi-modal trajectory prediction framework that can attend intra-/inter group interaction features to capture human-human interactions as well as group-group interactions. Experiments demonstrate that our method significantly improves performance on challenging pedestrian trajectory prediction datasets.