Keywords

1 Introduction

Future path prediction, which aims at forecasting the future trajectories of multiple agents in the next few seconds, has received a lot of attention in the multimedia community [27, 32]. This is a fundamental problem in a variety of applications such as autonomous driving [7], long-tern object tracking [21], monitoring, robotics, etc. Recently, Recurrent Neural Network (RNN) and its variants, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), have demonstrated promising performance in modeling the trajectory sequences [1, 14].

Trajectory prediction is difficult because of its intrinsic properties: (1) The pedestrians walking in public often interact with each other and will change the walking paths to avoid collision or overtaking. (2) The pedestrians may follow several viable trajectories that could avoid collision, such as moving to the left, right, or stopping. Because of these properties, some trajectory prediction methods are proposed by modeling social interactions [1, 11, 23, 35]. Some methods include additional surrounding semantics maps into their model such that the predicted trajectories will comply the restrictions in the maps [13, 25, 26]. Given the observed trajectories of agents in a scene, a pedestrian may take different trajectories depending on their latent goals. Because of future is uncertain, many researches focus on multiple future trajectory prediction [11, 14, 18]. Some recent works also proposed probabilistic multi-future trajectory prediction [23, 28], which provide very useful quantity results. Moreover, some vision-based trajectory prediction methods apply raw camera data and LiDAR point cloud data directly for interaction modeling and trajectory prediction [12, 22].

In practice, one can drive a car with a LiDAR [17] or use overhead surveillance cameras on open road to collect as many as possible trajectories of the road users. However, not all trajectory data are helpful in training a robust and accurate prediction model. Some observed trajectories are noisy because of the car movements in some scenes are relatively simple where the road users are moving at constant speeds. In this study, we propose a data efficient trajectory prediction active learning method through the selection of a compact less noisy and more informative training set from all the observed trajectory data. To the best of our knowledge, it is the first time considering the trajectory prediction problem from the perspective of trajectory samples. The main contributions of this study are summarized as follows.

  • This study proposes a simple and efficient active learning strategy, which could remove noisy and redundant trajectory data for a compact and informative training set. The storage and computation costs at training stage are greatly reduced.

  • Our proposed method could actively learn the streaming trajectory data incrementally and efficiently, which is the first work that consider the value of trajectory data in trajectory prediction task.

  • Our proposed active prediction method is able to achieve better performance than the previous state-of-the-arts with much smaller training dataset on five public pedestrian trajectory datasets.

2 Related Work

With rapid development of deep learning, various methods have been proposed. Social-LSTM [1] is one of the earliest methods of applying Recurrent Neural Network for pedestrian prediction. In Social-LSTM, a pooling layer is designed for sharing human-human interaction features among pedestrians. Later work [4, 19, 30, 34] followed this pattern, design different approaches for delivering information of human-human interactions. Instead of making only one determined trajectory of each pedestrian, Generative Adversarial Networks-based (GAN) methods [2, 9, 13, 16, 25, 29] has been designed for multiple plausible trajectories prediction. Moreover, auto-encoder-based methods [5, 20] have been developed for encoding important features of pedestrians and then making predictions with a decoder. Due to the big success of Transformer structure [29] in sequential processing [6]. Recent works [10, 33] seek to utilize this structure for pedestrian trajectory prediction and achieve competitive performance. However, these methods roughly utilize all the available trajectory data to attempt to understand the movement pattern for future trajectory prediction. We argue that not all available data are useful or meaningful during training and blindly using such large amount of data could damage the performance of models, not to mention the expensive computation and time costs.

3 Model

In this section, we first give the problem definition and then detailed introduce our proposed model. The pipeline of our IAL-TP method is illustrated in Fig. 1.

Fig. 1.
figure 1

The structure of our proposed IAL-TP model. At first, the model is initialized with samples from the base pool. At each iteration, a subset is selected out of the candidate pool based on the model inference and then added into the active set. Once the number of trajectory samples in active set meets the requirement, retrain the model with samples from the collected active set.

3.1 Construct Two Pools

In real-world applications, trajectory data are easy to be collected automatically by sensors such as LiDAR and cameras. Previous methods save all the collected trajectories and then train their models on the collected data. However, the raw trajectory data are noisy and redundant and it is unnecessary to save and train all the collected data. Some collected trajectories fluctuate drastically with time and some trajectories are straight lines with constant speeds. The fluctuation trajectories are noisy and the straight trajectories with constant speeds are redundant and too easy for the prediction model. To address this issue, we begin with constructing two non-intersect pools, a base pool \(\mathcal {P}^{b}\) only with a small amount of trajectory data for model initial learning, and a candidate pool \(\mathcal {P}^{c}\) with the remaining trajectory data for incremental active learning. Following the above problem formulation, the whole training trajectory data is \(\varGamma ^{obs}=\{\varGamma _{i}^{obs}|\forall i \in \{1,2,...,N\}\}\), define the base pool \(\mathcal {P}^{b}=\{\varGamma ^{obs}_j|\forall j\in \mathcal {S}^{b} \}\), the candidate pool \(\mathcal {P}^{c}=\{\varGamma ^{obs}_k|\forall k\in \mathcal {S}^{c}\}\), where \(\mathcal {S}^{b}\) and \(\mathcal {S}^{c}\) are two disjoint subsets and \(\mathcal {S}^{b}\cup \mathcal {S}^{a}=\{1,2,...,N\}\). We randomly select \(\lambda \%\) trajectory data from the whole training data as the base pool \(\mathcal {P}^{b}\) and we set \(\lambda =5\) in our work.

3.2 Incremental Active Learning

figure a

Instead of learning from whole trajectory samples, we propose an active learning method to incrementally select partial “worthy” trajectory data from candidate pool merging in the active set \(\mathcal {P}^{a}\), and then utilize these more valuable samples for model learning. Denote we expect to select \(p\%\) trajectory samples from candidate pool, \(\hat{N}^{a}\) is the expected number of trajectory samples of active set, which is defined as follow:

$$\begin{aligned} \hat{N}^{a} = p\% \times N \end{aligned}$$
(1)

At each iteration, we infer all the trajectory samples from candidate pool \(\mathcal {P}^{c}\) and rank these samples base on their inference errors, and choose the median ones as the subset \(\varDelta \mathcal {P}^{c}\). According to [3], the larger the error, the more noise the sample will have, the smaller the error is, the easier the sample can be learned. Therefore, our proposed select strategy is to select median trajectory samples merging in the active set. These median samples have less noise, and at the meantime, they are more representative than those with smaller error. In the experiments, we have explored different select strategies to shown the effectiveness of our proposed selection. Note that the subset will be removed from the candidate pool once selected. We iterate the above selection steps until the number of active set \(N^{a}\) equals to \(\hat{N}^{a}\). The overall incremental active learning method is illustrated in Algorithm 1 followed with detailed explanation.

Beginning with the untrained model \(M^{0}\), whole observed trajectories \(\varGamma _{obs}\), and hyper-parameter \(\lambda \%\), we firstly construct the base pool and candidate pool (line 1), and then train an initial model \(M^{b}\) with trajectory samples from base pool \(\mathcal {P}^{b}\) (line 2). Before starting iteration, we define an empty active set \(\mathcal {P}^{a}\), and calculate the number \(N^{a}\) of the trajectory samples in \(\mathcal {P}^{a}\). Also, we calculate the expected number \(\hat{N}^{a}\) of trajectory samples for final model learning (line3). At each iteration, we first inference all the samples of the candidate pool through model \(M^{b}\) and obtain the errors of these samples (line 5). Then, we select a batch of samples with median errors as the subset \(\varDelta \mathcal {P}^{c}\) from the candidate pool (line 6). Afterwards, the model \(M^{b}\) is fine-tuned with the subset \(\varDelta \mathcal {P}^{c}\) to ensure the model learn well on this subset, and thus avoid selecting similar samples at the next iteration (line 7). Finally, we update the \(\mathcal {P}^{c}\), \(\mathcal {P}^{a}\), and calculate the new number \(N^{a}\) in \(\mathcal {P}^{a}\) (line 8, 9). When the number \(N^{a}\) equals to our expected \(\hat{N}^{a}\), the iteration is finished, and we obtain an active set \(\mathcal {P}^{a}\) within more valuable and representative trajectory samples. We retrain the model \(M^{0}\) with the active set until convergence is realized to return the final model \(M^{a}\).

3.3 Backbone for Prediction

In order to make accurate trajectory predictions, we utilize our previous state-of-the-art framework [31], which is able to extract global spatial-temporal feature representations of pedestrians for future trajectory prediction.

4 Experiments

We demonstrate the experimental results on two public datasets: ETH [24] and UCY [15]. We observe 8 frames and predict next 12 frames of trajectories.

4.1 Evaluation Metrics

Similar with other baselines, we use two evaluation metrics: Average Displacement Error (ADE) and Final Displacement Error (FDE).

Table 1. Experimental results of our proposed IAL-TP model with different \(p\%\). We use four different \(p\%\) to evaluate the influence of different numbers of trajectory samples that participate in the model training phase.
Table 2. Experimental results of baselines compared to our proposed model. Original Social-STGCNN model is a probabilistic model and we try our best to adapt it to a deterministic model.
Fig. 2.
figure 2

Model performance versus part of the training trajectory samples. The x-axis shows the percentage of trajectory samples used. The y-axis shows the corresponding ADE/FDE error.

4.2 Data Efficiency and Model Robustness

In our work, the most important hyper-parameter is the percentage \(p\%\) denoting the number \(\hat{N}^{a}\) of trajectory samples in the active set. The number \(\hat{N}^{a}\) represents the number of trajectory samples that are learned by the model at the training phase. Note that \(p\%=100\%\) means all trajectory samples are used in the candidate pool for model learning, which is the same with the existing baselines. Table 1 shows the performance of our proposed model with different \(p\%\).

We can observe that when \(p\%=50\%\), namely using only half of the trajectory samples from the candidate pool, our proposed IAL-TP model achieves the best performance with the smallest error. It indicates that there are a lot of redundant trajectory samples in the datasets and it also validates the necessity of our proposed active learning idea. In specific, the ADE error on dataset ETH is the same with the result with \(p\%=100\%\), the ADE error on dataset HOTEL with \(p\%=50\%\) outperforms the result with \(p\%=100\%\) by \(19.2\%\), which is a significant improvement. One possible reason is that the datasets ETH and HOTEL are relatively more crowded than other datasets [1]. It reflects that with more training data, active learning is more necessary and effective.

For comparison, Fig. 2 demonstrates the results of several existing methods with part of the training trajectory samples. Note that the original Social-GAN [11] and Social-STGCNN [23] are two probabilistic models, and we adapt them to deterministic models. We can observe that our model consistently outperforms the Social-GAN and Social-STGCNN models on both ADE and FDE metrics with shrinked training trajectory data. In addition, with the same increase of \(p\%\), our model has the least reduction on both ADE and FDE metrics, which validates the robustness of our proposed model. Note that without any specific indication, we set \(p\%=50\%\) in following sections.

4.3 Selection Strategy

In order to validate the effectiveness of our proposed “Median” selection strategy (introduced in Sect. 3.2), we design two others election strategies for comparison. One strategy is to select the samples with the largest error, which is referred as “Max”, the other strategy is to select the samples with the smallest error, which is referred as “Min”. Table 3 shows the results of three different select strategies. We can observe that the “Median” select strategy outperforms other strategies. This proves that samples with median inference errors are more representative. As discussed in Sect. 3.2, samples with larger error are more likely to have noise, which have negative influences on model learning. In addition, samples with smaller error are more likely to be too “easy” for model learning, namely these samples are less valuable. Thus our proposed “Median” select strategy is more appropriate while seeking the “worthy” trajectory samples.

Table 3. Experimental results (ADE/FDE) of our proposed IAL-TP model with three different strategies, “Max”, “Median”, and “Min”.

4.4 Quantitative Analysis

Table 2 shows quantitative results of our proposed model and baseline. Overall, IAL-TP model outperforms all the baselines on the two metrics with only half of the training trajectory samples. The ADE metric improves by \(14.8\%\) compared to TF-based and the FDE metric improves by \(12.0\%\) compared to TPNet. In specific, the ADE error of our IAL-TP model on dataset ETH is 0.56, significantly improving by \(42.8\%\) compared to Social-STGCNN, the FDE error is 1.04, significantly improving by \(48.3\%\) compared to TPNet. It validates the necessity of our active learning idea in the pedestrian trajectory prediction problem. Additionally, it also validates the active set selected by our proposed strategy is a small but compact and representative training set.

5 Conclusion

In this paper, we propose a novel trajectory prediction model via incremental active learning (IAL-TP). In this model, we design a simple and effective method to iteratively select more valuable and representative trajectory samples for model learning, which can filter out noisy and redundant samples. This incremental active learning method greatly improves the generalization ability and the robustness of the model. Experimental results on five public datasets validate the effectiveness of our model. Additionally, it can achieve better performance than the state-of-the-art methods with only a small fraction of the whole training data.