Keywords

1 Introduction

Trajectory prediction is a hot topic in the field of computer vision. Predicting the Trajectories of pedestrians is essential for Self-driving cars and robots. However, the problem of pedestrian trajectory prediction is extremely complicated, due to interdependent of pedestrians. And timing predictions often lead to error accumulation.

Traditional mathematical statistical methods [1, 2] rely on artificially designed features to model pedestrian movements and interactions. With the development of neural networks in recent years, the methods based on Neural Network have surpassed the traditional methods and achieved better results. Trajectory prediction methods based on deep learning mostly use LSTM [3] model to encode and decode each pedestrian trajectory sequence in the scene using LSTM network to obtain the trajectory characteristics of each pedestrian.

When dealing with the interactions between pedestrians, the above methods mostly deal with each pedestrian individually, which increase the amount of calculation and time cost. On the other hand, the error propagation problem is not solved properly. The errors of the previous time step will be amplified and affect the result of the subsequent time step.

Inspired by the attention mechanism in natural language processing [4], a novel method for trajectory prediction is proposed. We assign different weight coefficients to historical features based on the attention mechanism and fuse historical features.

Contributions of this paper are summarized as follows:

  • Proposed a pedestrian interaction processing method based on graph neural network, which realizes feature fusion between pedestrians, and reduces the time consumption.

  • Proposed an attention-based historical feature fusion method that adaptively selects the historical trajectory characteristics to improve the accuracy of prediction.

  • Evaluated on several publicly available real-world crowd datasets. And the results demonstrate the effectiveness of our method.

The remainder of this paper is organized as follows. Section 2 reviews main work related to trajectory prediction. Section 3 describes the proposed model. Section 4 evaluates the effectiveness of the framework while conclusions and suggestions for future work are summarized in Sect. 5.

2 Related Work

At present, trajectory prediction methods [1, 2] include traditional mathematical statistics methods and methods based on deep learning. Traditional mathematical statistical methods rely on artificially designed features to model pedestrian movements and interactions. The bottlenecks of traditional mathematical statistics methods are that they cannot consider long-term dependence on information and adapt to complex mobile scenarios.

In recent years, methods based on deep learning have proven to be superior to traditional mathematical statistical methods. Among them, the methods based on LSTM and the methods based on GAN [5] are the most representative. Because the nature of pedestrian trajectories is a set of natural motion sequences with time series characteristics, some methods [6,7,8] mainly build models based on recurrent neural networks. Alahi et al. [6] have introduced the social pooling layer to bring the hidden states of neighboring pedestrians together to form interaction features, and achieved the purpose of modeling pedestrian interaction. The social pooling layer meshes the target scene and pools the hidden layer features of other pedestrians in the neighborhood of each pedestrian grid according to the grid range. At the same time, the results generated by the GAN-based method [9, 10] perform well in authenticity and diversity. Gupta et al. [9] have proposed a trajectory sampler that handles the interactions between all the observed pedestrians by pooling the GAN input random vector with a vector combining the hidden representations of the other pedestrian trajectories.

Deep Learning on Graphs Generalizing neural networks to graph-structured data is an emerging topic. Recently, researchers apply attention mechanisms to graph-structured data [11] to model spatial correlations for graph classification. Regarding the attention mechanism [4], the literature proposes a network structure called “transformer” based on the attention mechanism to mine the relationship between input and output.

Inspired by the above, we model the pedestrians in the scene as graph nodes and implement pedestrian interaction through graph attention networks. We model the direct relationship between history and future time steps to mitigate the problem of error propagation.

3 Methodology

3.1 Problem Definition

Suppose pedestrians are represented by \( P_{1} \), \( P_{2} \)\( P_{N} \). The position of pedestrian \( P_{i} \) at time-step T is denoted as (\( X_{Ti} \), \( Y_{Ti} \)). The problem is defined to predict the trajectories (\( X_{Ti} \), \( Y_{Ti} \)), where T = Tobs+1, Tobs+2… Tpred.

3.2 Overall Flowchart

The overall network architecture is shown in Fig. 1. First, we use the LSTM network to extract the trajectory features, and merge the trajectory features into the graph attention network to obtain the global interaction features. Then, based on the attention mechanism, the trajectory features of multiple time steps are fused to obtain the final trajectory features. Finally, the trajectory features are input into the LSTM network to obtain the predicted position.

Fig. 1.
figure 1

Overall network architecture

3.3 Extract Trajectory Features

At first, we embed the location of each person to get a fixed length vector \( d_{Ti} \) and input it to the E-LSTM.

$$ d_{Ti} = F\left( {x_{i}^{t} ,y_{i}^{t} } \right) $$
(1)
$$ S_{i}^{t} = {\text{E}} - LSTM\left( {S_{i}^{t - 1} ,d_{Ti} ;W_{E} } \right) $$
(2)

where the function \( F\left( \cdot \right) \) is an embedding function. \( S_{i}^{t} \) is the hidden state of the E-LSTM at time-step t. \( W_{E} \) is the weight of the E-LSTM.

3.4 Pedestrian Interaction

We consider the pedestrians as nodes on the graph, and use GAT [11] to implement our pedestrian interaction mechanism. \( S_{i}^{{t = T_{obs} }} \) is input into the attention layer of the graph, and the attention coefficient between node pairs (i, j) is calculated:

$$ \alpha_{i.j}^{t} = \frac{{exp(LeakyRelu(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {a}^{T} [WS_{i}^{t} ||WS_{j}^{t} ]))}}{{\sum k \in N_{i} exp(LeakyRelu(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {a}^{T} [WS_{i}^{t} ||WS_{k}^{t} ])}} $$
(3)

Where || is the concatenation operation, \( \alpha_{i.j}^{t} \) is the attention coefficient of node j to i, \( N_{i} \) represents the neighbors of node i on the graph, \( W \) is the weight matrix, \( a \) is a single-layer feedforward neural network, parametrized by a weight vector \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {a}^{T} \). It is normalized by a softmax function with LeakyReLU.

After the standardized attention coefficient is obtained, the output of the attention network of node i graph is calculated by the following formula:

$$ \hat{S}_{i}^{t} = \sigma \left( {\sum j \in N_{i} \alpha_{i.j}^{t} WS_{i}^{t} } \right) $$
(4)

Where \( \sigma \) is a nonlinear function, \( \hat{S}_{i}^{t} \) is the state of pedestrian i after merging the characteristics of the surrounding pedestrian trajectory.

3.5 Trajectory Feature Fusion

To ease the error propagation, we added the attention module. The attention module models the direct relationship between each future time steps and historical time steps to generate the pedestrian final trajectory feature.

For pedestrian i, the correlation between the time step \( t_{P} \left( {t_{P} = T_{obs} , \ldots T_{obs + Q} } \right) \) and the historical time steps t(t = \( T_{1} , \ldots T_{obs - 1} \)) is calculated.

$$ \mu_{{t_{P} ,t}} = \hat{S}_{i}^{{t_{p} }} ,S_{i}^{t} $$
(5)
$$ \gamma_{{t_{P} ,t}} = \frac{{exp\left( {\mu_{{t_{P} ,t}} } \right)}}{{\mathop \sum \nolimits_{{t_{r} = T_{1} ,}}^{{T_{obs - 1} }} exp\left( {\mu_{{t_{P} ,t_{r} }} } \right)}} $$
(6)

Where \( \left\langle {\cdot,\cdot} \right\rangle \) denotes the inner product operator, \( \gamma_{{t_{P} ,t}} \) is the attention score. The trajectory characteristics of pedestrian i are calculated by the following formula:

$$ \hat{P}_{i}^{t} = \left( {\sum\nolimits_{{t = T_{1} }}^{{T_{obs - 1} }} {\gamma_{{t_{P} ,t}} S_{i}^{t} } } \right)||\hat{S}_{i}^{{t_{p} }} $$
(7)

In many cases, the trajectory of a pedestrian is multi-modal. And different people may choose different modes of action. In order to enhance the diversity of the generated trajectory, the pedestrian trajectory characteristics are added to the noise vector and input to the LSTM network to decode and obtain the prediction result.

$$ r_{i}^{{T_{obs} }} = \hat{P}_{i}^{t} ||Z $$
(8)
$$ r_{i}^{{T_{obs} + 1}} = D - LSTM\left( {r_{i}^{{T_{obs} }} ,d_{i}^{{T_{obs} }} ;W_{D} } \right) $$
(9)
$$ \left( {x_{i}^{{T_{obs} + 1}} , y_{i}^{{T_{obs} + 1}} } \right) = \delta_{3} \left( {r_{i}^{{T_{obs} + 1}} } \right) $$
(10)

Where Z represents noise, \( r_{i}^{{T_{obs} }} \) represents intermediate state vector, \( \delta_{3} \left( \cdot \right) \) is a function, \( W_{D} \) is the D-LSTM weight. After the predicted position is obtained at time step \( T_{obs + 1} \), the subsequent time step location is predicted.

For each pedestrian, the model produces multiple predicted trajectories by randomly sampling z from N (0,1) (the standard normal distribution). And select the trajectory with the smallest distance from the real position as the model output to calculate the loss.

$$ L = min_{k} \left| {\left| {P_{i} - \hat{P}_{i}^{k} } \right|} \right|_{2} $$
(11)

Where \( P_{i} \) is the ground-truth trajectory of pedestrian i, \( \hat{P}_{i} \) is the trajectory produced by our model, k is a hyperparameter.

4 Experiment

Training Samples.

We evaluate our proposed model on two public pedestrian walking datasets, ETH [12] and UCY [13], which contain rich social interactions. The ETH dataset consists of two scenarios named ETH and HOTEL. UCY dataset includes two scenarios and in three components, named ZARA-01, ZARA-02 and UCY. These datasets contain thousands of real-world pedestrian trajectories and cover rich human-human interactions. We evaluate our model on these 5 datasets. We follow the leave-one-out evaluation methodology in [9].

Parameter Settings.

We iteratively train the network with a batch size of 64 for 300 epochs using Adam optimizer with a learning rate of 0.01. The dimensions of the hidden state for LSTM is 32. For first graph attention layer, the shape of W is 16 × 16. For the second layer, the shape of W is 16 × 32. Batch Normalization is applied over the input of graph attention layer. The dimension of Z is set to 16.

Evaluation Index.

There are two types of metrics for evaluating the performance of trajectory prediction, including the Average Displacement error (ADE) and Final Displacement error (FDE) in meters.

  1. 1.

    Average Displacement Error (ADE): Average L2 distance between ground truth and our prediction over all predicted time steps.

  2. 2.

    Final Displacement Error (FDE): The distance between the predicted final destination and the true final destination at end of the prediction period Tpred.

Quantitative Evaluation.

All experiments are based on ETH and UCY datasets. The results and analysis are as follows.

Table 1. Comparative experiment on attention module.

In Table 1, We evaluated the attention module of the experiment. × indicates that the attention module has been removed from the network. The results show that the effect is improved on some data sets, which shows that the attention module has a certain effect. But on the ETH dataset, the error becomes larger, which is related to the number of pedestrians’ historical time step. Compared with other datasets, the average residence time of pedestrians in the ETH dataset is shorter.

Table 2. Comparison with several baseline models.

In Table 2, we evaluate our model against all baseline models. The results show that our method outperforms all compared methods on all datasets. Compared with S-LSTM and SGAN, the ADE is reduced by 31% and 18% respectively. For FDE, the performance is increased by 32% and 21% respectively. These results show that our model has advantages compared to other methods.

Fig. 2.
figure 2

Trajectories generated by our model

Figure 2 shows some examples of predicted trajectories drawn from datasets. The predicted paths of our models appear able to better capture the direction of pedestrian movement. The generated trajectories do not have a linear trend, and the model also performs well in the case of multiple pedestrians.

5 Conclusion

In this paper, we propose a novel method for the prediction of pedestrian trajectories. We use the graph attention network to handle global pedestrian interaction. Furthermore, we use the attention module to select and fuse historical features. Experimental results show that the attention mechanism effectively reduces the error propagation and improves prediction results. Test results on two datasets prove that our method can effectively improve prediction accuracy. We have noticed that the attention module failed to get the expected result on the ETH dataset, the issue will be further analyzed to improve our method in the future work.