Keywords

1 Introduction

In recent years, although domestic private car ownership has increased substantially, due to the rapid development of the sharing economy [1], taxis are playing an increasingly significant role in urban transportation. In 2020, the taxi empty rate in Beijing has risen to nearly 40%, which undoubtedly caused a serious waste of resources and traffic congestion in the city. Therefore, a key aspect of solving the above problem is to accurately predict the demand for taxis in each region.

Traffic forecasting has received extensive attention and research in the past few decades [20, 21]. Traffic prediction models can be divided into non-parametric and parametric models. Among them, the parameters are from the data distribution. Usually, parametric models include the time series model [2], autoregressive integrated moving average model (ARIMA) [3], linear regression model [4], and Kalman filtering model [5]. The parametric models have relatively simple structures and they specify the learning method in the form of a certain function, which strictly depends on the stationary hypothesis and cannot reflect the uncertainty of the traffic state and non-linearity. The data-driven non-parametric models can effectively solve these problems, including traditional machine learning and deep learning approaches. Without prior knowledge, they can still fit multiple functional forms of enough historical data. Traditional machine learning models include k-nearest neighbors [6], decision trees [7] (such as CART and C4.5), naive Bayes [8], support vector machines [9], and neural networks [10]. Since traditional machine learning methods usually require complex feature engineering and are not suitable for processing a large number of datasets, the deep learning algorithm has emerged. Common deep learning models include Auto-encoder [11], generative adversarial networks [12], convolutional neural network (CNN) [13], and recurrent neural network [14]. Although the Recurrent Neural Network (RNN) is designed for dealing with time-series problems, and CNN has the ability to extract spatial correlation, they are only suitable for processing structured data. Especially, CNN divides the traffic network into a two-dimensional regular lattice, which destroys the original structure of the traffic network.

Therefore, we design a temporal attention-based graph convolutional network (TAGCN) for predicting the taxi’s demand in the functional area. The model can directly process traffic data on graphics, effectively capture complex spatial-temporal correlations, and more accurately predict local peaks and the values of the start and end points of the data. The main contributions of this paper are summarized as follows:

  • We propose a new temporal graph convolutional network model based on the attention mechanism, which embeds Attention to highlight the characteristics of traffic data. The spatial-temporal convolution model includes graph convolutional network (GCN) and temporal convolutional network (TCN) to realize the spatial-temporal correlation of the traffic data.

  • After changing the previous lattice division ways, the city is divided into multiple lattices of different sizes according to the function of area, and the spatial location is converted into a graphic structure.

  • Three real-world datasets are utilized in the experiment. Compared with 7 state-of-the-art baselines, TAGCN has achieved the best prediction results.

2 Related Work

A large number of methods based on deep learning networks are widely used in traffic prediction problems without artificially synthesizing complex features and cross-domain knowledge sharing. For example, Contextualized Spatial-Temporal Network for Taxi Origin-Destination Demand Prediction (CSTN) [15] uses CNN to learn the dependence of local space and ConvLSTM to analyze the change of taxi demand over time. Deep Multi-View Spatial-Temporal Network for Taxi Demand Prediction (DMVST-Net) [16] uses CNN to capture the spatial proximity and Long Short-Term Memory (LSTM) to model the time series. The short-term time dependence in Revisiting Spatial-Temporal Similarity: A Deep Learning Framework for Traffic Prediction (STDN) [17] is obtained by LSTM, the time periodicity is obtained by Attention for the previous days, and the spatial relationship between adjacent regions is captured by CNN.

Although LSTM performs well on several sequence problems (such as speech/text recognition [18] and machine translation [19]), since the network can only handle one time step at a time, the next step must wait for the previous step to complete the operation. It means that LSTM cannot solve large-scale parallel computing problems. The data processed by LSTM and CNN belongs to the Euclidean space and has a regular structure. However, in real life, there are many irregular data structures, such as social networks, chemical molecular structures, knowledge maps, and other topological structures, which makes the prediction effect of RNN and CNN greatly ineffective. At the same time, the above work divides the region in the form of average size, which cannot well distinguish each functional area.

The non-Euclidean spatial data processing methods include graph convolutional network (GCN) [22], graph neural network (GNN) [23], DeepWalk (Online learning) [24], node2vec [25], etc. The typical model GCN has the same role as CNN as a feature extractor. However, since the spatial relationship of the traffic map is a topological structure, which belongs to non-Euclidean data, GCN can better address the map data and extract the spatial features of the topological map than CNN. The extracted features can be used in graph classification, node classification, link prediction, and graph embedding, which fully proves the ability of GCN to process highly nonlinear data in non-Euclidean space.

Therefore, we select GCN as the internal component to extract the correlation of the input spatial data. For time-series data, the temporal convolutional network (TCN) [28] is utilized to obtain the temporal dependence, meanwhile, the characteristics of the traffic demand data are highlighted by the temporal attention mechanism [34]. In terms of functional areas, we divide the city into multiple lattices of different sizes. This work aims to predict the taxi’s demand of each area at the \({(t+1)}^{th}\) time interval, given the historical data of the previous \(t\) time intervals.

3 Algorithm Design

3.1 Data Design

We extract information about passengers getting on a taxi from the three original datasets, draw a scatter plot based on the extracted data, determine the area with the densest data, and select all the data in that area. To minimize the difference of the data, the final area of the three datasets is about 22.2 km × 9.16 km.

According to the functionality of the region, we divided the cities in the three datasets into four types: school district, recreation area, residential area, and business district. Furthermore, the city is divided into lattices according to the distribution of functional areas, and finally, 60 lattices of different sizes are divided and all lattices are numbered. To determine the scope of each lattice, we extract the boundaries of all lattices one by one and judge the lattice where the starting position of the taxi is as well as the center of each lattice. The calculation of the latitude and longitude of the center point are respectively shown in Eq. (1) and Eq. (2),

$$\varphi = \frac{\left|{\varphi }_{2}- {\varphi }_{1}\right|}{2}$$
(1)
$$\lambda = \frac{\left|{\lambda }_{2}- {\lambda }_{1}\right|}{2}$$
(2)

where \(\varphi \) and \(\lambda \) represent the latitude and longitude of the center point, respectively; \({\varphi }_{2}\) and \({\varphi }_{1}\) represent the latitude of the two vertices on the left and right respectively; \({\lambda }_{2}\) and \({\lambda }_{1}\) represent the longitude of the two vertices above and below, respectively.

The data should be preprocessed first that can be input into the model, i.e., the \(A\) matrix and the \(V\) matrix are calculated. The \(A\) matrix is an adjacency matrix based on the distance between lattices, used to store the mutual distance between 60 lattices. The \(V\) matrix represents the taxi’s demand in each lattice. In the calculation of matrix \(A\), we use the distance between the center points of the lattice to represent the distance between each area. The distance between each center point is calculated by Eq. (3).

$$d=2\pi r \mathit{arc}\mathit{ }\mathit{sin}(\sqrt{{\mathit{sin}}^{2}\left(\frac{{\varphi }_{2} - {\varphi }_{1}}{2}\right)+ \mathit{cos}\left({\varphi }_{1}\right)\mathit{cos}\left({\varphi }_{2}\right){sin}^{2}\left(\frac{{\lambda }_{2} - {\lambda }_{1}}{2}\right)})$$
(3)

where \(d\) is the distance between two center points, and \(r\) is the radius of the earth.

After the data processing is completed, we use the Z-score data standardization method to standardize the data. The Z-score method is mainly used to process data with too messy distribution and too many singularities. The standard deviation of the data after Z-score processing is 1, and the mean is 0. The calculation of the Z-score is shown in Eq. (4).

$$z = \frac{x-\mu }{\sigma }$$
(4)

where \(x\) is the original data, \(\mu \) and \(\sigma \) are respectively the average and the standard deviation of the overall sample space.

3.2 Model Design

This paper solves the spatial-temporal problem of taxi demand forecast. We use two temporal convolution layers to obtain the long-term dependence of time and a spatial convolution layer to extract the spatial correlation. When the data is input to the first temporal convolutional layer, it is also input to the attention layer to highlight its characteristics, and the output of the attention layer is fed into the first temporal convolutional layer for extracting the low-level feature. The adjacency matrix is mapped from the distance between the lattices. After normalization, the obtained data can be comparable and maintain the relative relationship between the data. The results of the first temporal convolution layer and the adjacency matrix are input to the spatial convolution layer to enhance the spatial relative relationship of features. Enter these results into the second temporal convolution layer to obtain the temporal dependence and high-level feature of the fused data. TCN is composed of dilated convolution, causal convolution, and residual structure. To solve the gradient explosion or disappearance caused by complete convolution networks, we add BatchNorm2D to the data post-processing layer. Finally, to map the learned distributed feature representation to the sample label space, we add a fully connected layer, which retains the complexity of the model. The model structure is shown in Fig. 1.

Fig. 1.
figure 1

The model structure. TAGCN is composed of temporal attention mechanism, temporal convolutional layer, spatial convolutional layer, and post-processing layer.

Spatial Convolution Layer.

We use GCN to extract spatial correlation based on the distance between lattices. The core of GCN is based on the spectral decomposition of the Laplacian matrix, i.e., the eigenvector of the Laplacian matrix corresponding to the graph is obtained from the eigenfunction \({e}^{-iwt}\) of the Laplacian operator. Laplacian matrix can make the transfer intensity of the data features in GCN proportional to their state differences. To increase the influence of the original node in the calculation process, we use the modified version of the Laplacian matrix, and the formula is defined as shown in Eq. (5),

$$L = {\tilde{D }}^{-\frac{1}{2}}\tilde{A }{\tilde{D }}^{-\frac{1}{2}}$$
(5)

where \(\tilde{A }=A+I\), \(I\) is the identity matrix, \(\tilde{D }\) is the degree matrix of \(\tilde{A }\), and its formula is \({\tilde{D }}_{ii = }\sum j{\tilde{A }}_{ij}\).

However, because the graph convolution kernel is global, the amount of parameters is large, and the calculation process involves high computational complexity feature decomposition, the complexity of the graph convolution operation is extremely high. Using Chebyshev [35] polynomials to fit the convolution kernel can achieve the purpose of reducing computational complexity. The k-th order truncated expansion calculation of Chebyshev polynomial \({T}_{k}(x)\) is shown in Eq. (6),

$${g}_{{\theta }^{\mathrm{^{\prime}}}}\left(\Lambda \right)\approx \sum_{k=0}^{K}{\theta }_{k}^{\mathrm{^{\prime}}}{T}_{k}(\stackrel{\sim }{\Lambda })$$
(6)

where \({g}_{{\theta }^{^{\prime}}}\left(\Lambda \right)\) is a function expressed by the eigenvalues of the Laplacian matrix, \(k\) represents the highest order of the polynomial, \(\stackrel{\sim }{\Lambda }\) is the scaled eigenvector matrix, \(\stackrel{\sim }{\Lambda }=\frac{2\Lambda }{{\lambda }_{max}}-{I}_{n}\), \({\lambda }_{max}\) is the spectral radius of \(L\), and the Chebyshev polynomial is recursively defined as \({T}_{k}\left(x\right)=2x{T}_{k-1}\left(x\right)-{T}_{k-2}\left(x\right),{ T}_{1}\left(x\right)=x, {T}_{0}\left(x\right)=1\).

Temporal Convolution Layer.

We use Temporal Convolutional Network (TCN) to obtain the temporal dependence, which is composed of dilated convolution with the same input and output length, causal convolution, and residual structure. Causality means that in the output sequence, the elements at time \(t\) can only depend on the elements at time \(t\) and before in the input sequence. To ensure that the output tensor has the same length as the input tensor, zeros are padded to the left of the input tensor. For causal convolution, the modeling length of its time is limited by the size of the convolution kernel. To obtain long-term dependence, it is necessary to linearly stack many layers. Therefore, the researchers proposed dilated convolution. When the number of layers of the convolutional network is small, we can obtain a large receptive field after using dilated convolution. However, even if dilated convolution is used, the network structure may still be deep, which will cause the gradient to disappear. The addition of the residual structure can solve it. The reason is that there is a cross-layer connection structure in the residual structure, which transmits messages in a cross-layer manner. A residual block includes two layers of convolution and nonlinear mapping, and WeightNorm [27] is added to each layer to regularize the network. To make TCN not just an overly complex linear regression model, we add an activation function Tanh to the residual block to introduce nonlinearity.

Attention Mechanism.

In the time dimension, there is a correlation between traffic status in different time periods, and the correlation is also different in various actual situation. The addition of the attention mechanism can highlight the correlation between input data and current output data, and obtain the temporal-dependent intensity of time \(i\) and time \(j\) to improve the accuracy of prediction. By dynamically calculating the current input and learning parameters, we can get the time attention matrix \(T\), as shown in Eqs. (7) and (8). The activation function sigmoid is used to normalize the output data, and the range is compressed between [0, 1].

$$T = {V}_{e}((X*{U}_{1}){U}_{2}(X*{U}_{3})+{b}_{e}$$
(7)
$${T}_{i,j}^{\mathrm{^{\prime}}} = \frac{{e}^{{T}_{i,j}}}{\sum_{i=1}^{N}{e}^{{T}_{i,j}}}$$
(8)

Post-processing Layer.

Since the model is composed of complete convolutions, it is easy to cause gradient disappearance and gradient explosion. To solve these problems, we introduce BatchNorm2D. BatchNorm2D standardizes the output of each layer to make the mean and variance consistent, so as to improve the stability of model training and accelerate the speed of network convergence. We can generally understand it as BatchNorm2D pushing the output from the saturated zone to the unsaturated zone. The calculation of BatchNorm2D is shown in Eq. (9),

$$y= \frac{x-E(x)}{\sqrt{Var(x)}+ \in } \times \gamma + \beta $$
(9)

where \(x\) is the input data, \(E(x)\) and \(\sqrt{Var(x)}\) are the mean and variance of \(x\), \(\in \) is a variable added to prevent zero from appearing in the denominator, and \(\gamma \) and \(\beta \) are learning parameters of 1 and 0, respectively. In the Adam optimizer, the step size is updated by the mean of the gradient and the square of the gradient. Different adaptive learning rates are calculated by the first-order and the second-order moment estimation of the gradient for different parameters. The role of the fully connected layer is to fuse the learned distributed feature representation together as an output value. Finally, the difference between the predicted value \({y}_{i}\) and the actual value \({x}_{i}\) is calculated by the mean square loss function MSELoss.

4 Experiment

4.1 Dataset Analysis

Three real-world datasets are used in our experiment, i.e., the taxi order dataset of Chengdu City in August 2014, Chengdu City in November 2016, and Haikou City from May to October 2017 after desensitization [26]. The above datasets all contain the time, longitude, and latitude of the trajectory. We select the start time of the order from 7:00 to 21:00 in all datasets and the latitude and longitude of the pick-up location. For Chengdu City, the longitude range is [103.95, 104.15] and the latitude range is [30.65, 30.75], which is shown in Fig. 2.

Fig. 2.
figure 2

Parts of Chengdu City, Sichuan Province, China.

4.2 Baselines

In order to specifically judge the performance of our model, we choose seven state-of-the-art models to compare with TAGCN. The baselines are introduced as follows:

  • Long Short Term Memory Network (LSTM) [29]: The addition and removal of information can be achieved through the gated structure.

  • Gated Recurrent Unit (GRU) [30]: It can realize forgetting and selective memory using one gate.

  • Spatial-Temporal Dynamic Network (STDN) [17]: It utilizes CNN, LSTM and attention to obtain spatial-temporal features.

  • Graph Convolution Networks (GCN) [22]: It is a generalization of CNN for learning non-grid data in the field of graphs.

  • Temporal Graph Convolutional Network (T-GCN) [31]: It combines GCN and GRU to capture the road network topology and the temporal dependence.

  • Spatio-Temporal Graph Convolutional Networks (STGCN) [32]: It consists of multiple spatio-temporal convolution modules to capture spatio-temporal correlation.

  • Attention Based Spatial-Temporal Graph Convolutional Networks (ASTGCN) [33]: It uses three modules to process three different fragments and the data is processed by two layers of ST blocks.

4.3 Evaluation Index

We use Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Mean Absolute Percentage Error(MAPE) to evaluate baselines and our model. RMSE is sensitive to the extra-large and extra-small error in the predicted value, so it can well reflect the precision of the model. MAPE is more robust than RMSE. MAE can reflect the real situation of the predicted value error. It can be seen from Table 1 that all the indicators of TAGCN are the smallest.

Table 1. MAE, RMSE and MAPE of each model.

4.4 Spatial-Temporal Analysis

Figures 36 show the predictions of the taxi’s demand in the recreation area, school district, residential area, and business district by eight models. The X-axis represents time, and one unit represents half an hour. The Y-axis represents the demand for taxis. We can see that the prediction results of TAGCN are more accurate than baselines. As can be seen from Fig. 3, 11:00–13:00 and 17:00–20:00 are the peaks for taxi’s demand in the recreation area. The reason is that people prepare to leave the area after taking a rest and starting entertainment. It can be seen from Fig. 4 that 11:30–12:30 and 17:00–18:00 are the peaks for taxi’s demand in the school district because students leave school during these periods. In Fig. 5, 7:00–8:00 and 12:00–13:00 are the peaks for taxi’s demand in the residential area, which are the rush hours before the work time. It can be seen from Fig. 6 that 11:30–12:00 and 18:00–20:00 are the peaks for taxi’s demand in the business district because people knock off work.

Fig. 3.
figure 3

Demand in recreation area.

Fig. 4.
figure 4

Demand in school district.

Fig. 5.
figure 5

Demand in residential area.

Fig. 6.
figure 6

Demand in business district.

The comparisons between TAGCN and LSTM, GRU, and GCN are shown in Figures (a). LSTM and GRU are mainly used to predict sequence information, and they cannot obtain spatial information. GCN can process irregular graph data to obtain spatial characteristics, however, it lacks the capture of temporal characteristics. The comparisons between TAGCN, T-GCN, STDN and ASTGCN are shown in Figures (b). The comparisons of TAGCN with and STGCN and TAGCN-w/o-attention are shown in Figures (c). T-GCN, STDN, ASTGCN, STGCN and TAGCN-w/o-attention can simultaneously obtain spatial and temporal dependence, but it is found that the prediction results of TAGCN for local peaks and edge values are better than baselines. The reason is that traditional neural networks are more difficult to capture long-term dependent information due to the limitation of the size of the convolution kernel, while TCN composed of dilated convolution and casual convolution can extract features across time steps and the temporal attention can highlight the features of the time-series data. Compared with the attention-based ASTGCN model, the reason for the advantage of our model is the TCN mentioned above, while ASTGCN uses standard convolution to extract the time features. Therefore, adding TCN and the temporal attention based on GCN makes the prediction of the model more stable.

5 Conclusion

We proposed a temporal attention-based graph convolution network model, which is used to predict passengers’ demand for taxis in each functional area of the city. The model focuses on extracting temporal dependence, spatial information, and highlighting the characteristics of the time-series data. After comparing the indicators and predicted values with some state-of-the-art models, we concluded that TAGCN is superior to the baselines in predicting local peaks and edge values. The role of TAGCN is to assist in the scheduling of taxis to avoid the waste of road resources and passenger time. In the future, we will consider the destination of passengers, and introduce external factors to further improve the accuracy of our model.