Keywords

1 Introduction

Predicting the future trajectory of humans in dynamic scenarios is a crucial point of autonomous driving, which supplies vital information for the downstream decision-making process. In autonomous driving, accurate pedestrian trajectory prediction enables the system to plan the vehicle’s movement in advance in an aggressive environment. For example, in the driver’s view, the trajectory of the pedestrian will affect his next decision, whether to go straight or wait for him to pass. Similarly, in driving assistance systems, the controller must predict the trajectory of pedestrians before making a decision.

Fig. 1.
figure 1

The goal of this paper is to predict the future trajectory of pedestrian. This can be divided into two sub-problems: historical trajectories modeling and social interaction modeling. We use an encoder-decoder framework to solve this problem.

However, trajectory prediction is challenging because two crucial points should to be considered during the prediction. One is that the historical trajectory of pedestrians will affect the future positions, and the other is that the complex relationship among pedestrians will affect each other and even their future decisions. First, it is obvious that the past locations of pedestrians contain information for judging future positions. However, how to model the historical course is a intricate problem. Second, the interaction among pedestrians is mainly driven by common sense and social customs. This requires understanding the complex relationship and modeling them in a proper way. Meanwhile, the complexity of pedestrian trajectory prediction comes from different social behaviors, such as parallel walking with others in a group, avoiding collisions, and entering a specific point in different directions. This paper refers to the various behaviors of pedestrians mentioned above as social interaction and proposes a way to describe them.

Since this field is crucial, this encouraged researchers in this area to focus on modeling social interactions and propose deep networks. Yet, there are still many problems with current solutions. Such as Social-Force-based models [9, 12] concentrate on modeling human interaction by handcraft, although it is easy to calculate in simple scenarios, manual modeling cannot handle complex scenarios. Since the rapid development of deep learning, many previous works have used LSTM [1, 8, 16, 29] to solve the problem of pedestrian trajectory modeling. As a neural memory network, LSTM can use hidden layer vectors to express social relationships among pedestrians while modeling their trajectories. However, when LSTM models long sequences, there will be more and more deviation as time increases. In addition, some previous work [21, 25] uses the Attention mechanism to perform social modeling based on the LSTM network, which has certain improvements on open datasets. Recent years, Graph-Neural-Network-based models [10, 17, 22] have been promoted in this field to model human interactions. But these methods generally require a lot of calculations and need more time to obtain the results. In total, the previous methods have problems in two aspects: First, LSTM is mostly used for trajectory modeling, but LSTM has the characteristic of unstable long-sequence modeling; that is, as the prediction time increases, the deviation error becomes more significant; Second, most previous works used more complex social relationship modeling, although they achieved relatively good results, the amount of calculation is enormous, and it is not suitable for real-time systems for autonomous driving.

As shown in Fig. 1, this paper proposes a new type of encoder-decoder framework that uses Transformer to solve this historical trajectory modeling problem. We designed Social-Transformer to overcome the two aforementioned limitations.

Experiments on multiple pedestrian trajectory prediction datasets have proved the improvement of our model. Our contributions are as follows:

  1. 1.

    We propose a Transformer-based model, a novel model structure for pedestrians modeling, we use an encoder module to model the historical trajectories and a decoder part decode the future locations. In addition, the middle hidden layer vector is used to model the social relationship. Since the Transformer itself has the advantage of dealing with long sequences, with the increase of time, the stability of our prediction sequence has also been greatly improved. At the same time, Transformer can perform parallel computing, so the efficiency is much higher than LSTM.

  2. 2.

    We are the first one to use the Attention mechanism in the hidden vector of Transformer-based model to build human interactions. In this way, we need few parameters, but the effect is very good. The calculation speed is increased while reducing the amount of calculation, which can be applied to the real-time system of unmanned driving.

  3. 3.

    We conduct experiments on two open datasets and able to achieve about 2% to 50% improvement comparing with other methods.

2 Related Work

Due to the importance of the trajectory prediction problem, many researchers now propose solutions in this field. In this section, we will give a brief view of different methods.

Human Trajectory Prediction in Deep Models. With the development of deep learning, many researchers search a proper model to solve prediction problem. Social-LSTM [1] is one of the first deep models to solve pedestrian trajectory prediction. It uses a recurrent neural network to model the historical trajectory of each pedestrian, then uses a pooling mechanism to aggregate the hidden states and then predict the trajectory. Subsequently, social-GAN [7] used the GAN model for packaging the LSTM model framework and improved performance on this basis. Only considering the coordinate trajectory of pedestrians alone obviously has great limitations for prediction. Later works such as Sophie [21], SR-LSTM [29] and Peek Into The Future (PIF) [14] added visual features into models and applied new pooling mechanisms to improve prediction accuracy. It is innovative that Sophie using an attentive GAN to modeling the interaction among people. The attention mechanism lets the person in one scene notice the other people and calculate the influence weights on him. Similarly, SR-LSTM also weighs the contribution of each pedestrian to others via an original weighting mechanism but not named attention. Differently, Peeking into the future use a two-stream model which modeling both trajectories and behaviors of pedestrians during training steps. It achieves better performance, because the two tasks can complement each other. Since Graph CNNs were introduced by [11] which added CNN concepts into graphs, many previous work, such as Social-STGCNN [17], STGAT [10] and Recursive Social Behavior Graph (RSBG) [22], used Graph Neural Network to build human interactions and promoted better performance on predicting. Social-STGCNN substitutes the need of aggregation methods by modeling the interactions as a graph. STGAT models the spatial interactions captured by the graph attention mechanism at each time-step and adopts an extra LSTM to encode the temporal correlations of interactions.

Encoder-Decoder Model. Encoder-Decoder framework was first promoted by [23]. It was promoted to map a fixed-length input into a fixed-length output, where the two lengths may be different. The Encoder-Decoder model and its variants are considered as the best solution for many complex tasks, for example, machine translation [26], speech recognition [20] and video captioning [23]. Similarly, our problem is to predict the future trajectories of all pedestrians given the historical trajectories and the input and output sequence lengths may be inconsistent. At the same time, the Encoder-Decoder model is designed to generate new sequences based on existing sequences, which is just suitable for our problem. Thus, we adopt Encoder-Decoder framework as our prominent architecture.

Recent Advancements in Transformer. Transformer was first promoted by [24] and caused a great sensation. It has achieved great results in many downstream tasks. It groundbreakingly uses the Attention mechanism to notice all the information in the entire sequence. At the same time, it combines the encoder and decoder with the cross attention so that the decoder module can also pay attention to the information encoded by the former when decoding. Natural language processing models such as Bert [5] and GPT [6] using Transformer structure have sound effects in many fields such as translation, text generation, and sentiment analysis. Besides in the field of natural language processing, Transformer also has many excellent applications in the computer vision direction. For example, DETR [4] uses the Transformer framework in image detection on COCO 2017 detection and panoptic segmentation datasets, and Segformer [27] and SETR [30] achieve the best results in image segmentation. Transformer overcomes the shortcoming of LSTM that the error will increase over time when predicting long sequences. It ensures that while the sequence can be calculated in parallel, it can calculate the part of the sequence that currently has the most significant influence according to the actual situation. For our problem, the transformer can be used to model the historical trajectory, and at the same time, use the hidden layer vector in the middle to model the social relationship so that the two limitations mentioned above can be solved. First, as time goes by, the error will not deviate significantly, and secondly, the use of simple social modeling methods can significantly reduce the amount of calculation and meet the real-time needs in autonomous driving scenarios.

3 Method

Our goal is to predict the future trajectories of the pedestrians in a scene. Obviously, a person’s historical trajectory is an essential factor in determining their future positions. In addition, social relationships also plays an important role. For example, a person could alter his/her path to avoid collision with other people. Therefore, this problem can be converted into two sub-problems: modeling the historical trajectory of every pedestrian and modeling the social interaction relationship of all pedestrians in one scene.

In this section, we present our Social-Transformer model (as shown in Fig. 2). Our model has three components: a Transformer-based module for taking the pedestrians’ historical trajectory as input, we call it the encoder part, a Social-Attention-based module for capturing the spatial correlations of interactions, and a Transformer-based module for output the predicted trajectory of every pedestrian, which is a decoder part.

Fig. 2.
figure 2

Overview of our Social-Transformer method. The framework is based on an encoder-decoder model and consists of 3 parts: The encoder part takes the historical trajectories of pedestrians as input. The Social-Interaction module models the social relationship among people. The decoder module generates the predicted locations of pedestrians.

3.1 Problem Definition

Trajectory prediction can be equivalent to a sequence generation problem, which estimating all pedestrians’ states in the future based on the given past information. First, we assume that each scene has been preprocessed to obtain the spatial coordinates of every pedestrian. Previous work also follows this conventions [1, 2, 7, 15]. Then, we suppose there are N pedestrians in one scene, represented as \(p_1,p_2,...,p_N\). At last, at time-instant \(\tau \), we could assume that every pedestrian’s trajectory is represented by his/her spatial positions, that is to say, xy-coordinates \(traj_i^\tau =(x_i^\tau ,y_i^\tau )\). Therefore, the past and current trajectory of pedestrian i at time \(\tau \) is represented as by the ordered set as:

$$\begin{aligned} traj_i^\tau =\{(x_i^\tau , y_i^\tau | \tau \in t_0,t_1,...,t_{pred} \}\ \forall i \in N \end{aligned}$$
(1)

Throughout the paper, we use \(t_{0}\) to \(t_{obs}\) as the past time, and \(t_{obs+1}\) to \(t_{pred}\) as the future time. We observe all pedestrians’ xy-coordinates from \(t_0\) to \(t_{obs} \), take these sequences into a model with params \(W^*\), and predict the xy-coordinates of them from \(t_{obs+1}\) to \(t_{pred}\).

$$\begin{aligned} traj_{1:N}^{t_{obs+1}\,:\,t_{pred}}=f(traj_{1\,:\,N}^{t_0\,:\,t_{obs}}\,;\,W^*) \end{aligned}$$
(2)

3.2 Model Description

The Social-Transformer model consists of three main parts: a Transformer-based module for modeling the pedestrian’s historical trajectories, we call it encoder part, a Social-Attention-based module for modeling the social interaction for all pedestrians in one scene, and a similar Transformer-based module for predicting the future trajectories of all pedestrian which is decoder module (as shown in Fig. 2).

First, the encoder module takes sequences of all pedestrians in one scene as input, \(traj_{1\,:\,N}^{t_0\,:\,t_{obs}}\), and then output their extracted features from the historical trajectories. Second, the Social-Attention-based module takes the outcome of the previous module and models the social interaction of all pedestrians in one scene. This module concentrates on the most critical information, builds a social interaction graph, and updates the relationship weight for every pedestrian in one scene. Then, the historical trajectories and social interaction information are included in the input for the last decoder module (the Transformer-based module). Finally, the decoder module compares its input with ground truth and reconstructs the trajectories of all pedestrians. In the following sections, we elaborate each module in detail.

Historical Trajectory Modeling. Each pedestrian has his/her own motion states, including preferred speed, acceleration, and so on. Previous works mostly use LSTM for memorizing the historical motions [1, 7, 10]. Still, in recent years, people use Transformer instead of LSTM to deal with sequence generation problems [24] because Transformer shows better performance in many tasks and more stable speed in calculating. However, no one has tried to use Transformer as a backbone to modeling the past motions of pedestrians. Therefore, we will introduce how to modify the model backbone to fit our problem.

Transformer is mostly used in NLP tasks, which has two parts of the input: token embedding and position embedding. Token embedding is defined as the embedding vector of the input sequences, which has the shape of [BLC]. Where B represents the batch size, L represents the length of a sequence, and C represents the sequence dim at time \(\tau \). In our problem, we first compute the relative position of each pedestrian to the previous time step:

$$\begin{aligned} \varDelta x_i^\tau =x_i^\tau -x_i^{\tau -1} \end{aligned}$$
(3)
$$\begin{aligned} \varDelta x_i^\tau =y_i^\tau -y_i^{\tau -1} \end{aligned}$$
(4)

Then we use a linear layer to project the input relative position into high dimension space and use vector \(e_i^\tau \) to present the embedded trajectory (token embedding) of pedestrian i at time \(\tau \):

$$\begin{aligned} e_i^{\tau }=\phi (\varDelta x_i^{\tau }, \varDelta y_i^{\tau };W^e) \end{aligned}$$
(5)

which \(\phi \) is an embedding function, \(W^e\) is the embedding weight.

Position embedding shows the temporal information for every pedestrian from \(t_0 \) to \(t_{obs}\), since Transformer does not have the function of inputting in chronological order like LSTM, it needs a positional embedding to encode timing information. For example, \(t_0\) is represented by number 0, and \(t_1\) is represented by number 1. In Transformer, there is a sine cosine formula PE to compute the position embedding for each time \(\tau \), which take \(pos\in \{1:T\}\) and model dim i as inputs:

$$\begin{aligned} p_i^{\tau }=PE(pos,i) \end{aligned}$$
(6)

Through this function we can calculate a fix position embedding for each sequence at each time \(\tau \). In this paper, we use \(p_i^{\tau }\) to represent the position embedding as an input of Transformer. In summary, we define the input of encoder module as:

$$\begin{aligned} input_i^{\tau }=e_i^{\tau }+p_i^{\tau } \end{aligned}$$
(7)

After that, the Transformer can be used to encode the input and generate the encoded hidden layer vector.

$$\begin{aligned} y_i^{t_{obs}}=Trans_{encoder}(input_i;W_{trans_{encoder}}) \end{aligned}$$
(8)

Pedestrian Interaction Modeling. Naive use of one Transformer module does not capture the interaction between people. However, an accurate trajectory prediction must take into account the social relationship resume. For example, when pedestrians appear in different positions in a crowd, their focus is also different. People walking on the road concentrate on the information of vehicles and traffic lights, and pedestrians walking in the crowd pay close attention to the movement of pedestrians in front of them. Therefore, we hope that every pedestrian in the scene can pay more attention to the person who has the most significant influence on him.

To achieve this, we use the attention mechanism to model the social connection and take the encoded historical trajectories of pedestrians in one scene as the input of the Social-Attention module. We first established a relationship matrix \(M\in \mathbb {R}^{N\times {N}}\) about pedestrians. Each pedestrian is represented by the encoded historical trajectory information. In this relationship matrix, the ith row and jth column represent the weight of the relationship between the two pedestrian i and j. Take \(y_i\) as the encoded trajectory of pedestrian i,and \(y_{i/N}\) as the historical trajectories of other pedestrians except the pedestrian i:

$$\begin{aligned} y^{'}_i=ATT_{social}(y_{1:N/i}^{\tau };W_{social}),\, \tau \in \{t_{0}:t_{obs}\} \end{aligned}$$
(9)

Using the attention mechanism, each pedestrian will pay attention to other pedestrians in the whole scene, and calculate the influence of other pedestrians on himself, thereby generating the weight matrix M, and then update his/her own encoded history trajectory, so in this way, the historical trajectory information is included social relations. We will next use the updated vector \(y_i^{'}\) to represent every pedestrian and decode his/her future trajectory.

Future Trajectory Prediction. We hope that our model can output a reasonable trajectory range for trajectory prediction but not an actual location. Most previous works [1, 3, 25] reflect the uncertainty by predicting the Gaussian distribution’s parameter values and then obtaining the exact future location sampled from the Gaussian distribution. In this paper, we also take this measure to calculate loss during training.

Similar to the encoder module for the historical trajectory modeling, the prediction module is based on the decoding part of the Transformer. In this module, we take the previous vector as input which includes the historical trajectory and social relationship, and output the two-dimensional Gaussian distribution parameters of \(t_{obs+1}\) to \(t_{pred}\). Then we sample the specific coordinate values from such Gaussian distribution parameters and compute a negative log-Likelihood loss with ground truth.

$$\begin{aligned} result_{i}^{t_{obs+1}:t_{pred}}=Trans_{decoder}(y^{'}_i;W_{trans_{decoder}}) \end{aligned}$$
(10)

Which \(W_{trans_{decoder}}\) is the parameters of this decoder module, \(y_i^{'}\) is the vector which includes the information of historical trajectory and social interaction of pedestrian i.

Implementation Details. Since our datasets are relatively small, we use a lightweight Transformer with fewer parameters to make predictions. We use an embedding for dimension 32 for token embedding before using the positions as input of the Transformer module. We set the parameters in the Transformer module as follows; the model dim is 64, the fully connected layer’s dimension is 128, the number of layers is 6, the query dim is 24, the value dim is 24, the number of multi-head attention is 4. During the training step, we use a learning rate of 0.001 and Adam for an optimizer. The Social-Transformer model was trained on a single GPU.

4 Experiments

In this section, we first introduce two datasets that are commonly used in pedestrian trajectory prediction. Then we compare our model’s performance against other baselines on these two datasets. At last, we present a qualitative analysis of our model in some special situations.

4.1 Datasets and Evaluation Metrics

We compare with other baselines on two public pedestrian datasets. The two public datasets are ETH [19] and UCY [13], which include the trajectory of pedestrians in the open scene and the social relationships among the crowd. The trajectories are sampled every 0.4 s. The ETH dataset is divided into two scenarios, ETH and HOTEL, and the UCY dataset is divided into three scenarios ZARA01, ZARA02, and UNIV. In total, we trained and tested our model on these five datasets. Like Social-LSTM [1], we observed the trajectories of 3.2 s with correspond of 8 frames and predicts the trajectories of 4.8 s which are 12 frames. These datasets include a lot of challenging behaviors: suddenly turning, gathering together, suddenly disappearing in certain scenes, etc.

We use two indicators to measure the model’s performance: the average displacement error (ADE) [18] and the final displacement error (FDE) [1]. ADE is used to measure the average prediction performance in the process of trajectory prediction, and FDE is used to measure the final trajectory prediction accuracy at the endpoints. The output of our model is a two-dimensional Gaussian distribution. By sampling in the Gaussian distribution, we calculate the ADE and FDE between the sampled xy-coordinates and the standard value.

$$\begin{aligned} ADE=\frac{\sum _{i \in {N}}\sum _{\tau \in {t_{pred}}}||\hat{traj_{i}^{\tau }}-traj_{i}^{\tau }||^{2}}{N\times {t_{pred}}} \end{aligned}$$
(11)
$$\begin{aligned} FDE=\frac{\sum _{i \in {N}} ||\hat{traj_{i}^{\tau }}-traj_{i}^{\tau }||^{2}}{N},\, \tau =t_{pred} \end{aligned}$$
(12)

4.2 Quantitative Analysis

In this section, we will introduce the quantitative analysis of our model.

Table 1. Quantitative results of all the baselines and our model. All models take as input of 8 frames and predict the next 12 frames, our model has the best performance on ETH, HOTEL, UNIV datasets on both ADE and FDE.

Performances on Datasets. The performance of Social-Transformer is compared with other models on ADE/FDE metrics in Table 1. In total, our model performs the best on ETH, HOTEL, UNIV datasets both on ADE and FDE. For the ADE indicator, the best baseline method that has the lowest average prediction error is STGAT. However, our model has made significant progress compared to STGAT, especially on the ETH dataset, which has a 40% reduction. At the same time, on the HOTEL and UNIV datasets, our model also has an error reduction of 2% to 10%. For the FDE indicator, the results show that our model outperforms all compared methods on all datasets. Compared with STGAT, our model in most datasets has a drop of more than 50%.

These results show that our model has advantages compared to other methods, especially on the final displacement error. It shows that the Social-Transformer is more stable when predicting the sequence. There is no such thing as the longer the sequence, the greater the deviation.

Ablation Study. In this section, our purpose is to test the rationality of social modeling. The Table 2 shows the values of ADE and FDE on five datasets after adding and removing social modeling in our model. After adding social modeling, on the five datasets, the value of ADE is reduced by about 9.25% to 12% compared with the model without social modeling. At the same time, the FDE indicator changes more obviously. Almost on every dataset final displacement error has dropped by more than 15%. On the ZARA01 dataset, the improvement of FDE is undeniable, from 0.89 to 0.55. The improvement shows that the social interaction modeling plays an important role in predicting the future locations of pedestrians, especially in the FDE metric.

Table 2. Ablation study of the Social-Transformer model. The study are tested on the five datasets
Fig. 3.
figure 3

Visualization results of our model. The coordinate points of the triangle in the figure represent the historical trajectory of the pedestrian, the cross point represents the actual future trajectory of the pedestrian, and the circular point represents the predicted value of our model. We chose four representative scenes, (a)–(d).

4.3 Qualitative Analysis

Visualization results are shown in Fig. 3. The coordinate points of the triangle in the figure represent the historical trajectory of the pedestrian, the cross point represents the actual future trajectory of the pedestrian, and the circular point represents the predicted value of our model. We chose four representative scenes in all results.

Figure 3(a) shows that our model successfully predicted the turning behavior of the pedestrian. In this scene, there is only one pedestrian. By mainly judging his historical trajectory, our Social-Transformer model made a correct prediction of a turn right.

Figure 3(b) shows a complex communicative behavior, two pedestrians walking side by side. In this case, the target pedestrian obviously pays more attention to his nearest friend rather than others. Our model also notices this and predicts a similar path for them.

Figure 3(c) shows a simple scene. It shows that our model accurately predicts the future trajectory and direction of pedestrians.

Figure 3(d) shows a complex social scene. In the swarming and collision avoiding scene, target person show a strong tendency to separate their walking directions. And at the same time, our model also predict successfully their different walking paths.

5 Conclusions

We have presented a encoder-decoder based model which can both learn historical trajectories of pedestrians and their social interactions. First, we use the encoder part which is based on Transformer to model the past trajectory of one pedestrian. Then we share the information among all pedestrians in one scene to let every pedestrian notice the other person and calculate the influence weights on others. Finally, we will update the vector representing the pedestrian and combining historical trajectory and social interaction information. At last, we use the decoder to take the encoded vector as input and calculate the variance between the output of the decoder module and the ground truth. Our model outperforms state-of-the-art methods on two publicly available datasets. In addition, we also show the necessity of the social interaction modeling module. Based on Transformer, we can perform parallel calculations and reduce time consumption, which keeps the prediction error of long sequences at a small level, and will not deviate more due to the increase in time. At the same time, our model has a smaller amount of calculation, and the calculation speed is faster, and it can meet the real-time requirements of autonomous driving system. In future work, due to the complex traffic environment will also affect pedestrian prediction, we will extend our work to multi-class agents, such as bicycles, vehicles, and pedestrians, to pay attention to each other and complete more complicated social interaction modeling. In addition, visual information is also very important in prediction. Many pedestrians decide the next step based on the surrounding environment. So we will consider the influence of the environment, including traffic lights, road type, and weather factors.