Keywords

1 Introduction

Intelligent Transportation System (i.e., ITS) effectively applies computer technology and artificial intelligence in transportation and service control, strengthening the connection among vehicles, roads, and users, thus forming a safe, efficient, and accurate comprehensive transportation system. As a critical task in intelligent transportation, traffic prediction significantly impacts traffic management, vehicle allocation, travel time prediction, and other downstream tasks and has become an important research direction in ITS.

The existing traffic forecasting methods can be divided into three categories: statistical methods, traditional machine learning methods, and deep learning methods. Statistical methods predict future traffic conditions through historical averages or probability modeling. The former corresponds to the historical average model, which assumes that traffic at the exact location has similar daily patterns; Therefore, the historical average can be used as the prediction result. The latter’s representative is the Auto-Regressive Integrated Moving Average model (ARIMA) [1, 2], which uses ARIMA to model the time series of traffic volume. However, unexpected situations such as traffic congestion and accidents often occur in the transportation system. In this case, statistical methods may be unable to make accurate predictions.

Machine learning-based methods, which generally rely on support vector machines (SVM) [3] and hidden Markov models for modeling, can better handle unexpected conditions than statistical methods. However, the performance of machine learning-based methods is heavily determined by the effectiveness of feature extraction. Thus, with the popularization of the Global Positioning System (GPS) and the development of traffic sensors, traffic data has accumulated rapidly. These methods are no longer suitable for processing increasingly expanding datasets. In this case, deep learning has gradually gained attention and become the mainstream research direction.

Deep learning methods mainly use Convolutional Neural Networks (CNN) [4, 5] or Graph Convolutional Networks (GCN) [6,7,8] to extract spatial features of nodes and use Recurrent Neural Networks (RNN) [10], CNN, and Attention mechanisms to extract temporal features of nodes. Compared to CNN-based models that only apply to grid format data, Graph Convolutional Networks (GCNs) can extract spatial features from data with non-Euclidean structures. However, its effectiveness highly depends on the quality of the graph and requires prior knowledge of physical topology or traffic networks. In addition, graph structures generated by adjacent relationships typically reflect local correlations between nodes, which is insufficient to capture complex dependency over a long spatial range. In practice, the state of a node may be affected by other non-adjacent nodes. For example, an accident vehicle may cause congestion in the entire transportation network. Therefore, the global correlation between non-adjacent nodes is crucial for long-term prediction, which has been largely ignored in previous models.

To solve the above issues, we propose a new spatio-temporal Graph Neural Network (STLGCN) for long-term traffic prediction. Different from existing methods that require the pre-generation of graph structures, our model adaptively generates multi-scale feature maps by calculating node correlations based on time information. To improve the effectiveness of graph generation, we have designed a self-generated position encoding method. In addition, we propose a new graph convolution method to optimize the use of traffic information. Our contributions are summarized as follows:

  • An adaptive graph generation method is proposed for learning a multi-scale feature graph based on temporal information. This method uses self-generated position encoding to improve the effectiveness of the generated graph.

  • A novel graph convolution method is proposed to optimize the utilization of related traffic information. This method can better extract valuable features and filter out irrelevant features.

  • Extensive experiments conducted on real-world datasets prove that our proposed method achieves higher precision than other baseline methods.

2 Related Work

This section briefly introduces methods used in graph convolution, graph generation, and traffic forecasting.

2.1 Graph Convolution

Graph convolution is a commonly used technology in applications, including natural language processing and social network analysis. It is specifically designed to aggregate information from neighboring nodes through a convolution kernel for processing graph structured data. Based on the way of processing the adjacency matrix, there are two main graph convolution methods: spectral-based and spatial-based [9].

The spectral-based method originates from graph spectral theory. This method defines the graph Fourier transform based on the concept of the graph Laplacian and then establishes the graph filter following the traditional signal processing method. For example, Bruna et al. [8] first defined spectral-based graph convolution based on graph theory. However, this method involves the eigendecomposition of the Laplacian matrix and multiple matrix multiplications. Thus its computational cost is enormous. At the same time, the quantity of learnable parameters defined by its convolutional kernel is equivalent to the number of graph nodes, which might lead to high computational complexity and make the method less effective. Therefore, until the proposal of ChebNet [9], graph convolution had not received attention and development. ChebNet uses Chebyshev polynomials to parameterize convolutional kernels, which significantly reduces time and spatial complexity and gives the characteristics of Spatial Localization.

Similar to traditional CNN convolution on images, spatial-based graph convolution is defined based on the spatial relationships of graph nodes. This method, whose essence is the transmission of node information along the graph edges, integrates the information of the central node and its neighboring nodes to update the feature representation of the central node. There are several main methods in this category. Micheli et al. [9] defined spatial convolution operations through message passing. Atwood et al. [12] proposed that graph convolution is a diffusion process of information between different nodes, and they defined a convolution on a graph based on diffusion theory. And, Velickovic et al. [6] introduced an attention mechanism that uses attention weights to aggregate information about neighboring nodes.

2.2 Graph Generation

Graph generation methods can be divided into two categories [18]. The first generates an adjacency matrix based on the spatial features, while the second is based on the temporal features. In general, most methods are based on the former method regarding the spatial distance or the status of road connections as spatial features. Such spatial graphs generated in this way can effectively capture local correlation in transportation networks. However, it often performs poorly in capturing the global correlation, which is even more crucial for long-term prediction.

There are three main methods for generating graph structures based on temporal features: 1. Metric Learning [22], 2. Probabilistic Modeling [19], and 3. Directly Optimizing [20].

Among them, metric learning aims to learn a metric function that calculates the correlation between nodes. Common methods for metric learning include kernel-based methods and attention-based methods. The former method commonly uses Gaussian or polynomial kernel functions and neural networks. However, attention-based methods use attention mechanisms to calculate correlations, which are more dynamic than metric-based methods.

Probabilistic modeling aims to learn the probability distribution of edges. This method assumes that graphs can be generated by sampling edges, whose probabilities can be modeled by learnable parameters. Probabilistic modeling is often combined with the Bayesian theorem to filter out irrelevant edges.

Direct optimization methods are based on the prior knowledge of the graph and handle the adjacency matrix directly. This method assumes that similar nodes are connected and generates the graph based on this assumption. Direct optimization methods are often combined with GCN for graph generation and use regularization for optimization.

2.3 Traffic Forecasting

Traffic prediction problems aim to predict future traffic status given historical traffic information. The information here is usually provided by sensor networks on the road, and the states between sensor nodes are generally strongly correlated in time and space. Thus, how to capture the implicit spatial and temporal dependencies in data is a critical issue in this field.

In recent years, deep learning-based methods have become the focus of this field. In the early stages, CNN-based methods typically converted urban traffic data into grid format to meet the requirements of image convolution. For example, Guo [21] designed a 3D CNN for capturing spatial-temporal correlations. Yu [15] combined CNN with LSTM for traffic forecasting. Although these methods have achieved improvements over traditional machine learning methods, the transformation process results in the loss of topological information about roads.

Considering the importance of spatio-temporal dependence in traffic prediction problems, spatio-temporal graph convolution is a more suitable choice for processing traffic data. These methods use GCN to capture spatial correlations between nodes and use CNN, RNN, or Attention Mechanisms to capture temporal correlations, effectively utilizing the topological information of the road. One representative is the DCRNN model proposed by Li [14], which embeds graph convolution in GRU to solve the task of spatio-temporal prediction. And Yu et al. obtained spatio-temporal correlation by stacking spatio-temporal modules constructed by TCN [10] and GCN. In addition, there are methods to directly use graph convolution to get spatio-temporal correlation by generating a spatio-temporal graph [16].

3 Methods

In this section, after introducing some basic concepts, we provide a detailed introduction to the proposed model’s network framework, focusing on the graph generator and spatio-temporal blocks.

3.1 Preliminary

One of the most important goals of traffic forecasting is to predict the future traffic condition given historical features. These traffic features (e.g., velocity, flow, volume) can basically reflect the real-time traffic conditions of road segments in a city.

In this article, we denote the sensor networks as a weighted undirected graph \(G=(V,E,W)\), where V is the set of sensor nodes with a number of elements \(|V|=N\), while W is the weights of Nodes, and E denotes the set of edges. Based on the above assumptions, the traffic features observed at time t can be denoted by \(X_t^p=(x_{i,t} ), i=1,\ldots ,N\), where i represents the i-th node. The primary purpose of traffic forecasting is to learn a function \(f(\cdot )\) which establishes a mapping from T historical signals to \(T'\) future signals, i.e.:

$$\begin{aligned} f: x_{future}=f(x_{historical}), \end{aligned}$$
(1)

where:

$$x_{future}= [ x_{t+1},...,x_{t+T^{'} } ]\in R^{T^{'}\times N\times d}, $$
$$x_{historical}=[x_{t-T+1},\ldots ,x_t]\in R^{T\times N\times d},$$

and d is the dimension of the feature.

Fig. 1.
figure 1

Framework of STLGCN, consisting of Graph Generator block, Spatial-Temporal blocks and Prediction block.

3.2 Framework

The proposed Spatial-Temporal graph convolutional network framework is shown in Fig. 1, which consists of a Graph Generating block, a Spatial-Temporal block, and a Prediction block.

Among them, the graph generation block generates multi-scale graphs by redefining the neighbors of each node and generating a multi-order neighborhood graph. Additionally, self-generated position encoding is used in the process of graph generating. In the Spatial-Temporal blocks, dilated causal convolutions are utilized to extract the temporal correlation. And a novel graph convolution method is used to capture the spatial dependency of graphs generated in the Graph Generator block. And in the prediction block, two temporal convolutions are used for prediction. The graph-generating module and spatio-temporal module will be introduced in detail in the following text.

3.3 Graph Generation Block

Figure 2 shows the framework of the graphic generation block. The definition of multi-level neighbors, self-generated position encoding methods, and graph generating methods involved in this module will be introduced in this section.

Fig. 2.
figure 2

Graph Generator Block.

Definition of Multi-order Neighbors. For each pair of nodes \(v,u\in V\) in a given graph \(G=(E,V)\), if there is a path:

$$\begin{aligned} p_{v,u}^{(S)}=(e_{v,s_1},e_{s_1,s_2},\ldots ,e_{s_{k-2},s_{k-1}},e_{s_{k-1},u}), \end{aligned}$$
(2)

such that connects vu, then vu are called multi-order neighbors on graph G; where \(S=\{s_1,...,s_{k-1}\}\) represents intermediate nodes that the path \(p_{v,u}^{(S)}\) passes through, and \(e_{i,j}\in E,\{i,j\}\subset \{v,u,s_1,...,s_{k-1}\}\). In addition, the length of the shortest path:

$$\begin{aligned} k = \min _S |p_{v,u}^{(S)}|, \end{aligned}$$
(3)

connecting v and u on graph G is called the order of the neighbor vu; Meanwhile, vu are called k-hops neighbor to each other.

Self-generated Position Encoding. The success of the Transformer can be attributed, in part, to its ingenious design of position encoding, which enables the model to distinguish different nodes and obtain unique relationships between them. Inspired by this, we propose a self-generated position encoding method, which can simultaneously determine the temporal and spatial patterns of the data and enables the encoding process to be learnable. By applying this method in the graph representation, the model proposed in this article can more effectively capture the temporal and spatial correlations between nodes.

Specifically, let \(X_i^{(T)}\in R^{d\times T}\) be the feature of node i at time slot T, where d denotes the dimension of node features. To capture time patterns, a convolutional layer with a kernel size of \(1\times 1\) is used for temporal encoding. In terms of spatial encoding, a learnable parameter P is used to distinguish the differences between nodes, which will be adaptively optimized during the training process. Overall, The entire process can be represented as:

$$\begin{aligned} X' = Conv(X) + P, \end{aligned}$$
(4)

where \(Conv(\cdot )\) represents convolutional layer, and P represents learnable position encoding.

Graph Generation Algorithm. The graph-generation algorithm adopts a metric learning method, using cosine similarity as the kernel function. The original adjacency matrix cap A. sub 1 of graph script cap G is calculated first in graph generating.

$$\begin{aligned} A_{i,j}^{(1)}=cosine(X_i,X_j)=\frac{(X_i\cdot X_j)}{\Vert X_i\Vert \Vert X_j\Vert }, \end{aligned}$$
(5)

Then, for each pair of nodes vu, their k-order correlation is generated by all \((k-1)\)-order correlations of vu, i.e.:

$$\begin{aligned} A^{(k)}(v,u)=A_{i,j}^{(k)}=cosine(A_{i,:}^{(k-1)},A_{j,:}^{(k-1)}), \end{aligned}$$
(6)

where \(A_{i,:}^{(k-1)}\) represents the corresponding rows of the i-th nodes v in matrix \(A^{(k)}\), that is, all its \((k-1)\)-order correlations. By iteratively solving the above recursive equation, any multi-order graph \(A_k,k=1,2,...\) can be obtained. Finally, the final set of multi-order graphs is obtained by combining all the generated adjacency matrix, i.e.:

$$\begin{aligned} A=[A^{(1)},...,A^{(k)}], \end{aligned}$$
(7)

3.4 Spatial-Temporal Blocks

This block sequentially uses dilated convolution and a new graph convolution method to capture temporal and spatial correlations. The framework of the convolution module proposed in this article for capturing spatial correlation is shown in Fig. 3.

Fig. 3.
figure 3

Spatial-temporal Block.

Temporal Correlation Capturing. Compared to RNN, CNN-based methods have fewer parameters and are easier to optimize. Therefore, we utilize gated causal convolution with dilation to capture a temporal correlation. The dilated causal convolution increases the Receptive field by increasing expansion, which improves efficiency. In addition, by stacking dilated causal convolutional layers with different kernel sizes, various scale temporal correlations can be effectively captured. This process can be represented as:

$$\begin{aligned} x\star f(t)=\sum _{s=0}^{K-1}f(s)x(t-d\times s), \end{aligned}$$
(8)

where d is the size of dilation. In addition, the gating mechanism is adopted to control the transmission of information. Specifically, assuming that the input is \(X\in R^{N\times d\times S}\), this final output can be represented as:

$$\begin{aligned} h=\tau (\theta _1\star X+b)\odot \sigma (\theta _2 \star X+c), \end{aligned}$$
(9)

where \(\tau (\cdot ),\sigma (\cdot )\) represent Tanh and Sigmoid activation function.

Spatial Correlation Capturing. Global features, usually implicit, refer to the overall dependency information between nodes, which is crucial for revealing the global correlation between data. Existing GCN methods typically use a spatial graph to aggregate information, which may result in neglecting these global features. For example, when dealing with a fully connected graph, previous methods usually choose a threshold to filter out the irrelated neighbor information, such as the degree of relevance or the number of neighbors. However, if the threshold value is set inappropriately, the irrelated information may be included or the valuable information may be ignored while aggregating information from neighbors.

To solve this problem, we propose a novel graph convolution method. When sampling the neighborhoods of nodes, we assume that highly correlated nodes should exist in all graphs with different sampling rates, while unrelated nodes may only exist in those graphs with high sampling rates. Based on this assumption, we use different sampling rates on the sub-graphs generated by the graph generator module to filter out irrelevant influences and enhance the effect of highly correlated neighbors. In addition, the sampling strategy is the same in all sub-graphs. The sampling frequency is \(a_1,a_2,...,N,\) where \(a_1<a_2<...<N\) and N is the total number of neighbors. We then use these sampled subgraphs to aggregate information from neighbors. As a result, STLGCN can aggregate more useful information from highly related neighbors and ignore the irrelated information more efficiently.

4 Experiments

4.1 Datasets

Experiments in this paper are conducted on two public transportation datasets: the PEMS-BAY dataset and the METR-LA dataset. The PEMS-BAY dataset records the speed data of 325 road nodes in the California highway network, and the METE-LA dataset contains the traffic speed data of 207 nodes on the Los Angeles expressway. In terms of data processing, the time window size adopted in this paper is 5 min. The data in the dataset is divided into a training set, a validation set, and a test at the ratio of 6:2:2. The specific data are shown in Table 1.

Table 1. The statics of METR-LA and PEMS-BAY

4.2 Baseline Algorithms

The dynamic spatial-temporal graph convolution is compared with the following models.

  • T-GCN [15], which integrates GCN into GRU, is used for traffic forecasting.

  • STGCN [13], which combines GCN with one-dimensional causal convolution and adopts a sandwich structure for spatial-temporal relationship acquisition.

  • DCRNN [12], which uses diffusion graph convolution and integrates it with an RNN. It adopts an encoder-decoder structure for traffic forecasting.

  • STSGCN [14], which generates the spatial-temporal adjacency matrix of nodes and obtains the spatial-temporal relationship through graph convolution.

  • AGCRN [11], which proposes a learnable node encoding method.

  • Graph WaveNet [2], which adopts improved diffusion graph convolution to obtain spatial relationships, uses dilated graph convolution to obtain temporal relationships, and extracts spatial-temporal relationships through iterative temporal relationship acquisition and spatial relationship acquisition.

4.3 Experiment Settings

All experiments are conducted on a hardware platform equipped with an Intel(R) Xeon(R) Gold 6138 CPU @ 2.00 GHz and an NVIDIA GeForce RTX 2080 Ti. The dilation sizes are defined as 1, 2, 1, 2, 1, 2, 1, 2. There are a total of four spatial-temporal convolution blocks and two diffusion steps. The graph generator block generates zero-order and first-order neighbors. The sampling frequency is set to 50 nodes, 100 nodes, and all nodes, respectively. The Dropout is set to 0.3, and the Adam optimizer is used. One hundred training rounds are conducted, and the decay index is 0.0001. Besides, the loss function is MAE. The evaluation indexes include MAE, RMSE, and MAPE, where MAE and RMSE reflect the fitting of the model to the extreme values, and MAPE reflects the average prediction of the model.

4.4 Experimental Results

Table 2. Performance of STLGCN and other baselines.

This paper mainly compares the long-term forecasting ability of the models. To this end, the time window sizes of 30 min, 45 min, and 60 min are adopted in the selection of the forecast horizon. The experiments of all models are conducted in the same environment, and the training parameters of the models are kept consistent. The prediction results of the models are shown in Table 2. It can be seen that STLGCN shows the best result in all indexes. With the increase in the prediction time, the improvement of MAPE is more significant. On the METR- LA dataset, compared with the optimal model, the MAPE of STLGCN at 30 min, 45 min, and 60 min is improved by 0.04%, 0.13%, and 0.11%, respectively. On the PEMS-BAY dataset, compared with the optimal model, the MAPE of STLGCN at 30 min, 45 min, and 60 min is improved by 0.08%, 0.06%, and 0.03%, respectively. In the baseline models, the performance of the models using the RNN is poor. For example, the three models (TGCN, DCRNN, and AGCRN) perform worse than the TCN-based model. Among the baseline models, the best-performing model is Graph WaveNet, and this is highly related to the diffusion graph convolution it used. STSGCN builds a spatial-temporal graph and exploits the spatial-temporal correlations more powerfully, so its performance is relatively better.

5 Conclusions

In this study, we propose a novel spatial-temporal graph convolutional network for long-term traffic forecasting (STLGCN). Our model adaptively generates a multi-order graph using a graph generator that incorporates self-generated position encoding to enhance the effectiveness of the generated graph. Additionally, we propose a graph convolution method to extract useful traffic information from the generated graph while filtering out irrelevant data, improving the model’s ability to capture spatial-temporal correlations.