1 Introduction

As urbanization accelerates, the surge in population and vehicles presents unprecedented challenges to urban traffic infrastructure. [1]. Traffic prediction, a vital aspect of Intelligent Transportation Systems (ITSs) [2, 3], has evolved into a research area of mutual interest for both academia and industry. For traffic managers, traffic prediction provides real-time traffic information, helping drivers effectively avoid congestion. For travelers, traffic prediction offers trip suggestions and helps optimize travel choices [4].

Fig. 1
figure 1

Spatio-temporal pearson analysis on the PeMS08 dataset

Traffic prediction aims to forecast future traffic flow states by observing traffic time series data and the underlying road network structure. Traffic prediction has always been challenging due to complex spatio-temporal dependencies. Traffic data exhibits dependencies in temporal, spatial, and spatio-temporal relationships. Temporal dependency refers to the influence of past time points on future ones, where complex temporal correlations make long-term traffic forecasting difficult. For example, when using data from the past 12 time steps to predict the next 12, the 8th to 12th steps are generally harder to predict, while the 1st to 4th steps are easier. Spatial dependency refers to the mutual influence of sensor nodes simultaneously. As shown in Fig. 1, temporal dependency usually exhibits slight periodic changes, and spatial dependency between roads displays different correlations under different environments. Spatio-temporal dependency involves the mutual influence between sensor nodes over time. Although advanced research has made progress in capturing complex spatio-temporal dependencies, limitations remain in the following two aspects.

  1. 1.

    Insufficient modeling of spatial dependencies. Early studies [5,6,7] constructed static adjacency matrices in the intrinsic space based on prior knowledge provided by road networks. This approach considered the intrinsic spatial dependencies between nodes but often ignored the dynamic spatial dependencies caused by factors such as traffic congestion. GWN [8] and AGCRN [9] addressed this limitation by constructing adaptive adjacency matrices using learnable node embeddings in a single space. However, they overlooked multiple latent spatial dependencies between nodes under different factors. These dependencies are influenced by spatial heterogeneity, causal associations, and uncertainty. For example, as shown in Fig. 2, spatial heterogeneity means that traffic conditions in different locations, such as residential and commercial areas, vary significantly due to their attributes, like road conditions and points of interest. Causal associations refer to the interactions and dependencies between different components of the traffic system [10, 11]. For instance, traffic congestion in a residential area may be caused by a traffic accident in a commercial area, and this congestion may in turn lead to subsequent multi-vehicle collisions. It is necessary to identify potential factors that may trigger events from pre-event data and analyze the direct and indirect impacts of events from post-event data. Uncertainty refers to the effects of various factors, such as weather changes and holidays, which increase the complexity of traffic dynamics prediction and management. Therefore, diversified spatial dependencies can be fully considered by dynamically modeling in multiple specific spaces.

  2. 2.

    Limitations in processing long time series data. Due to their chain-like structure, Recurrent Neural Networks (RNNs) and their variants often suffer from gradient vanishing or exploding issues, making them inefficient at capturing long-term dependencies. In contrast, Transformer models are widely applied to various sequence processing tasks due to their unique structural design [12]. Transformers rely entirely on self-attention mechanisms to capture global dependencies between data points, effectively extracting long-term temporal dependencies.

Considering the aforementioned challenge, we introduce a Space-specific Graph Convolutional Recurrent Transformer Network (SSGCRTN) to explore the interaction between input data and the spatio-temporal correlations in road networks. The main contributions are as follows:

  • We propose a novel Space-Specific Graph Convolution (SSGC), consisting of Intrinsic Space Graph Convolution (ISGC), Latent Space Graph Convolution (LSGC), and an Adaptive Fusion Layer (AFL). Notably, in the LSGC, we introduce a multi-head mechanism that divides embeddings into multiple subspaces and automatically learns dynamic graphs for each subspace. Learning of multiple subspaces effectively simulates the diversity of spatial relationships.

  • We develop a new Space-Specific Graph Convolutional Recurrent Network (SSGCRN), which replaces the gated units of GRU with SSGC to capture parallel spatio-temporal dependencies.

  • To better capture the causal relationships in context, we introduce a Spatio-Temporal Interaction Module (STIM), which uses a bidirectional SSGCRN to recursively integrate spatio-temporal dependencies in each cycle. At the end of STIM, a Transformer-based Global Temporal Fusion Module (GTFM) is introduced. It employs a self-attention mechanism to dynamically extract key features at each time step and flexibly allocate weights, thereby effectively capturing global spatio-temporal correlations.

  • Extensive experiments on four real-world traffic flow datasets and two speed datasets have confirmed the superior performance of our method.

Fig. 2
figure 2

Traffic data’s spatial diversity. There are multiple latent spatial dependencies between nodes, which are influenced by spatial heterogeneity, dynamic associations, and uncertainty

2 Related works

Traffic prediction has been extensively researched [13], resulting in two main categories: traditional methods and deep learning methods. Traditional methods include both classic statistical learning and machine learning techniques.

Initially, traffic prediction tasks were viewed as simple time series forecasting problem, addressed using classic statistical approaches like ARIMA [14] and VAR [15]. These models are based on the assumption of linear relationships, while traffic data exhibits complex nonlinear relationships. Therefore, they perform worse than machine learning methods. The use of Markov Jump Neural Networks for sampling data control systems synchronization can handle random fluctuations in traffic prediction, providing a method to manage nonlinear traffic data that surpasses the limitations of traditional statistical methods [16]. Subsequently, other traditional machine learning methods, such as SVR [17], RFR [18], and KNN [19], have also been widely applied to traffic prediction. These methods can capture the nonlinear characteristics of traffic data well but often require manual feature engineering, which limits their flexibility and degree of automation. The coupled inertial memristive neural network model offers a new perspective by simulating the complex interactions between road sections in the traffic network and capturing the nonlinear and periodic characteristics of traffic data [20]. Additionally, the event-triggered output tracking method provides an efficient solution for real-time traffic prediction by reducing communication requirements and improving system response efficiency, demonstrating higher efficiency and real-time performance in practical applications [21].

In recent years, the focus has shifted to deep learning methods due to their superior auto-feature learning capabilities. RNNs and its variants, LSTM and GRU [22,23,24,25], have been used to extract temporal features in traffic prediction. However, as the sequence length increases, these methods may encounter issues with low computational efficiency and error accumulation. In contrast, Convolutional Neural Networks (CNNs) have fewer parameters and the ability to transfer learning, leading to the widespread adoption of CNN-based methods like Wavenet [26] and TCN [27] for traffic prediction. Some researchers have also created grid-based maps of traffic data and used CNNs for spatial information extraction [28, 29]. However, these CNN-based methods are limited in capturing the topological structure of traffic road networks. Graph Neural Networks (GNNs) represent data using graphs and are widely applied with excellent performance in tasks such as graph classification, node clustering, and other graph-structured tasks. TCGNN [30] uses GNNs for network flow classification, PistGNN [31] serves recommendation systems, and HetGAPN [32] improves text processing in natural language processing. The flexibility and powerful capabilities of GNNs mean that they are not limited to traditional graph-structured tasks; they are beginning to demonstrate their unique value in the field of traffic as well.

Recent research has seen the adoption of GNN-based methods for analyzing spatial correlations in traffic data [33], while also incorporating RNNs, CNNs, and attention mechanisms to grasp the temporal dynamics of traffic data. For instance, DCRNN [5] combines GRU with diffusion GCNs, simulating the diffusion process of traffic spatial correlations through a directed graph. However, it only considers the fixed connection relationships between traffic nodes, capturing spatial dependencies through bidirectional random walks, while neglecting the dynamic associations between nodes. In contrast, ASTGCN [7] and STSGCN [34], by integrating spatio-temporal attention mechanisms, more effectively capture dynamic spatial-temporal dependencies. Nevertheless, relying solely on spatial or temporal attention components makes it difficult to comprehensively capture the global spatio-temporal dependencies in traffic flow data. This is especially true in the context of real-time urban traffic dynamics, where the fixed graph structure represented by a static adjacency matrix cannot effectively capture this dynamic nature. To address this, GWN [8] introduces a new adaptive adjacency matrix through node embeddings. STFGNN [35] generates temporal graphs in a data-driven manner and innovatively fuses spatio-temporal relationships at different time steps in parallel. STGODE [36] proposes a novel tensor-form GNN to address deep GCNs’ over-smoothing issue. Z-GCNETs [37] and TAMP-S2GCNets [38] are adept at extracting hidden temporal features by analyzing the topological characteristics of time. DSTAGNN [39] creates dynamic-aware graphs following a data-driven approach to represent dynamic node associations in traffic networks. STGPCN [40] uses graph product to take spatial and temporal graphs as inputs, automatically creating a new large overlapping spatio-temporal graph. DGCRN [41] finely models the dynamic graphs at each timestep using a graph generation algorithm, also introducing a novel training strategy. HSTGCNT [42] combines spatio-temporal graph convolutional networks with long-term temporal transformer networks to capture long-term and short-term temporal relationships and integrates these relationships through an attention fusion module. However, most GNN-based methods either overlook the inherent spatial dependencies in predefined graph matrices or ignore the dynamic associations of traffic nodes under different factor influences. Key determinants of intricate traffic flow fluctuations, like the causative links among traffic incidents, have not been completely accounted for. The latest research leverages the inherent spatial information of road networks, combining generative models with textual descriptions of traffic systems for traffic generation, resulting in more comprehensive and realistic traffic conditions [43].

Compared to the aforementioned methods, we consider the real connections between nodes in the traffic network and learn the latent associations between nodes under different influencing factors through embeddings in multiple subspaces. We also simulate the complex spatial interactions in the road network using an adaptive fusion structure. Additionally, we account for the causal relationships between traffic events and apply a self-attention mechanism to each node to capture global features better.

3 Preliminaries

Definition 1: Traffic Network. The actual traffic road network and sensors deployed for recording traffic information are formalized as a graph \(G=\left( V,E,A \right) \), where G represents the actual traffic road network, \(V=\left\{ v_1, v_2, \ldots , v_N\right\} \) denotes the set of sensor nodes on the road, E is the set of edges between the neighboring sensors, and \(A \in \mathbb {R}^{N \times N}\) corresponds to the adjacency matrix of G.

Definition 2: Traffic Signal Matrix. Traffic characteristics (such as traffic flow and speed) recorded by sensors at the t-th timestep are represented as a tensor \(X_t \in \mathbb {R}^{N \times C}\), where C is the number of traffic characteristics.

Traffic prediction involves learning a function, represented as \(\mathfrak {F}(\cdot )\), on traffic network G, to forecast traffic flow at sensor nodes based on historical data from H timesteps, for the forthcoming P timesteps:

$$\begin{aligned} {{X}_{(t+1):(t+P)}}=\mathfrak {F}\left( {{X}_{(t-H+1):t}};G \right) , \end{aligned}$$
(1)

where \(X_{(t+1):(t+P)} \in \) \({{\mathbb {R}}^{P\times N\times C}}\),\({{X}_{(t-H+1):t}}\in {{\mathbb {R}}^{H\times N\times C}}\).

4 Methodology

The architecture of our model is illustrated in Fig. 3. Initially, the input data is processed by STIM to extract preliminary spatio-temporal features. Next, it passes through GTFM to capture the global spatio-temporal features. Finally, a fully connected layer learns the nonlinear spatio-temporal dependencies and adjusts the dimensions to the required output size.

Fig. 3
figure 3

SSGCRTN primarily consists of STIM and GTFM. STIM utilizes a bidirectional SSGCRN to model local spatio-temporal dependencies and capture causal relationships between traffic events. SSGC replaces the fully connected layers of GRU to achieve synchronized extraction of spatial and temporal dependencies. SSGC comprises ISGC, LSGC, and AFL. ISGC learns fixed spatial correlations among road network nodes, LSGC explores multiple latent spatial correlations, and AFL dynamically adjusts the mutual influence of different types of spatial correlations

4.1 Space-specific graph convolution

4.1.1 Intrinsic space graph convolution

The traffic network visually encodes the interconnections between urban centers and suburban roads. The topological links between various roads are unchanging. Consequently, based on the standard deviation of the actual distance between sensors and a set threshold, a threshold Gaussian kernel is used to construct an adjacency matrix \(A^{IS}\) in the intrinsic space. The initial node weights are set to mirror the non-Euclidean topological connections among different nodes:

$$\begin{aligned} {A^{IS}} = \left\{ {\begin{array}{ll} {\exp \left( { - \frac{{d{{\left( {{v_i},{v_j}} \right) }^2}}}{{{\sigma ^2}}}} \right) ,} & {{\text {if}}\,d{{\left( {{v_i},{v_j}} \right) }^2} \le k} \\ {0,} & \text { otherwise} \end{array}} \right. , \end{aligned}$$
(2)

where \(d\left( v_i, v_j\right) \) is defined as the distance between node \({{v}_{i}}\) and node \({{v}_{j}}\), \(\sigma \) represents the standard deviation of the distance, and k signifies the threshold.

Previous studies [44, 45] often used multi-layer GCNs to model long-distance spatial dependencies, but these models only consider the information of directly connected neighbors at each layer. Additionally, using multi-layer GCNs can lead to nodes in locally connected subgraphs having overly similar representations, reducing predictive performance. In this paper, we update node representations using a single-layer GCN and introduce multi-hop neighbors in this layer to obtain richer road network traffic topological information, effectively reducing the over-smoothing issue. The representation of k-hop ISGC is given below:

$$\begin{aligned} {X_{IS}}&= \sum \limits _{i = 0}^K {{H^{(k)}}} {W^{(k)}},\nonumber \\ {H^{(k)}}&= {H^{(k - 1)}}{\tilde{A}^{IS}}, \nonumber \\ {H^0}&= {X_{in}},\\ {\tilde{A}^{IS}}&= {\left( {{{\tilde{D}}^{IS}}} \right) ^{ - 1}}{A^{IS}}, \nonumber \\ {\tilde{D}^{IS}}&= \sum \limits _j {A_{i,j}^{IS}} \nonumber , \end{aligned}$$
(3)

where \(X_{i n}\) and \(X_{IS}\) represent the initial and resultant node states, respectively. \(W^{(k)} \in \mathbb {R}^{d_{i n} \times d_{\text{ out } }}\) is a learnable parameter, and K stands for the total number of hops.

4.1.2 Latent space graph convolution

The spatial correlations between sensors are influenced by various irregular factors, leading to spatial graph structures that are not entirely consistent under different conditions. Sole dependence on a predefined graph for spatial dependencies, without direct relevance to the prediction task, can result in notable bias. GWN generates the adjacency matrix through learnable node embeddings. First, two learnable parameters \(E_1\) and \(E_2\) are randomly initialized, where \(E_1, E_2 \in \mathbb {R}^{N \times D}\). Then, the calculation is performed as follows:

$$\begin{aligned} \hat{A}_{a a p}={\text {Softmax}}\left( {\text {ReLU}}\left( E_1 E_2^T\right) \right) . \end{aligned}$$
(4)

AGCRN directly generates \({D}^{-\frac{1}{2}} {A}{D}^{-\frac{1}{2}}\) to avoid unnecessary and repetitive calculations:

$$\begin{aligned} D^{-\frac{1}{2}} A D^{-\frac{1}{2}}={\text {Softmax}}\left( {\text {ReLU}}\left( E_A E_A^T\right) \right) . \end{aligned}$$
(5)

We observe that both GWN and AGCRN overlook the multiple latent spatial dependencies that may exist between nodes. To address this, we propose modeling dynamic spatial dependencies in multiple latent spaces. Specifically, we first randomly initialize two learnable node embedding dictionaries \({{E}_{1}},{{E}_{2}}\in {{\mathbb {R}}^{N\times {{n}_{space}}\times {{d}_{k}}}}\), where \({{n}_{space}}\) is the number of latent spaces, and \({{d}_{k}}\) represents the embedding dimension of each latent space. The use of two different dictionaries allows the model to learn and represent node characteristics from two distinct perspectives, akin to observing data from different viewpoints, thus enhancing the potential to capture complex dependencies. \(E_1\) can be seen as providing an "emitting" characteristic description of the nodes, while \(E_2\) provides a corresponding "receiving" characteristic description. This design enables each node to interact with other nodes not just in a single way but in multiple latent spaces, playing different roles (i.e., emitter or receiver). Additionally, by independently optimizing \(E_1\) and \(E_2\), we can more flexibly adjust the interaction modes between nodes during training, which is advantageous for learning complex and nonlinear spatial relationships. We then infer the latent dependencies in each latent space using the following formula:

$$\begin{aligned} A_i^{LS} = {\mathop {\text{ S }oftmax}\nolimits } \left( \frac{\mathop {\text {ReLU}}\nolimits \left( E_{1,i}E_{2,i}^T \right) }{\sqrt{d_k}} \right) , \end{aligned}$$
(6)

where i represents the i-th latent space, and ReLU is the activation function used to filter out some of the weaker connections. Notably, the Softmax function is directly applied to achieve the normalized graph, bypassing the need to first create an adjacency matrix and then calculate the Laplacian matrix, thereby eliminating unnecessary computations. We replace \(A^{IS}\) in (3) with \(A_i^{LS}\), and then obtain \(X_i^{LS}\) through graph convolution operations. Each latent space independently learns different spatial dependencies, allowing the entire model to capture the diversity of spatial relationships. The final output \(X_{LS}\) is obtained by averaging the outputs of each latent space \(X_i^{LS}\) to model the multiple latent spatial dependencies:

$$\begin{aligned} X_{LS}=\frac{1}{n_{space}} \sum _{i=1}^{n_{space}} X_i^{LS}. \end{aligned}$$
(7)

In this process, our model integrate information from different processing streams, effectively capturing multi-level and multi-granularity spatial dependencies in the data. By integrating outputs from various latent spaces, the model enhances its comprehensive understanding of spatial domain knowledge, enabling it to interpret and utilize this information at multiple levels.

4.1.3 Adaptive fusion layer

ISGC focuses on the information of neighboring nodes around existing connected nodes on the road, making it localized. The learning of multiple subspaces in LSGC reveals potential connections between unknown nodes, allowing LSGC to effectively capture the dependencies between two spatially distant road nodes, making it global. To coordinate and enhance the interaction between these two types of spatial dependencies, we use an adaptive fusion layer to integrate them. The fusion method is as follows:

$$\begin{aligned} {X_s} = {W_{IS}}{X_{IS}} + {W_{LS}}{X_{LS}}, \end{aligned}$$
(8)

where \({{W}_{IS}}=1-{{W}_{LS}}\) is a learnable parameter.

4.2 Bidirectional spatio-temporal dependency fusion

Inspired by AGCRN [9] and DGCRN [41], we design the SSGCRN, which replaces the traditional gating units of GRU with SSGC. This allows multi-granularity information from the spatial domain to be directly used to model short-term dynamics and complex interactions in the spatio-temporal domain simultaneously. This design not only enhances the accuracy of spatio-temporal data processing but also deepens the model’s understanding of node interactions at different temporal and spatial granularities. SSGCRN fuses parallel spatio-temporal dependencies across multiple spatial dimensions at each adjacent time step to effectively learn the underlying local spatio-temporal dependencies in traffic data.

It’s worth noting that relationships in traffic data are not invariably arranged sequentially. Indeed, there are intricate causal links among various traffic events. Therefore, we introduce STIM, which recursively fuses the intrinsic and latent spatio-temporal dependencies of each period through forward and backward SSGCRN. Specifically, we concatenate the input data \(X_s^{(t)}\) at the current time t with the output of the SSGCRN from the previous time step \(h_{t-1}^{(d)} \in \mathbb {R}^{N \times C^{\prime }}\) as the input information for the current time step, and then perform the following calculations:

$$\begin{aligned} z_t^{(d)}=\sigma \left( \mathcal {G}\left( \left[ X_s^{(t)} \Vert \textrm{h}_{t-1}^{(d)}\right] ; \Upsilon _z^{(d)}\right) \right) \nonumber \\ r_t^{(d)} = \sigma \left( {\mathcal{G}\left( {\left[ {X_s^{(t)}\parallel {\text {h}}_{t - 1}^{(d)}} \right] ;\Upsilon _r^{(d)}} \right) } \right) \nonumber \\ c_t^{(d)} = \tanh \left( {\mathcal{G}\left( {\left[ {X_s^{(t)}\parallel r_t^{(d)} \odot {\text {h}}_{t - 1}^{(d)}} \right] ;\Upsilon _c^{(d)}} \right) } \right) \nonumber \\ {\text {h}}_t^{(d)} = z_t^{(d)} \odot {\text {h}}_{t - 1}^{(d)} + \left( {1 - z_t^{(d)}} \right) \odot c_t^{(d)} \end{aligned}$$
(9)
Table 1 Details of datasets

where d represents the direction of STIM, with 1 indicates forward and -1 indicates backward. \([\cdot \Vert \cdot ]\) denotes concatenation operation along the feature dimension, \(\odot \) signifies element-wise multiplication, \(z_t^{(d)}\) and \(r_t^{(d)}\) are the outputs of the update gate and reset gate at time step t, respectively, and \(c_t^{(d)}\) is the new candidate activation state at the current time step, representing the latent new information calculated based on the current input and the adjusted previous hidden state. \(\mathcal {G}\) denotes the SSGC module with learnable parameters \(\Upsilon _z^{(d)}\), \(\Upsilon _r^{(d)}\) and \(\Upsilon _c^{(d)}\). The forward and backward features of STIM are concatenated, denoted as \(\left[ h_t^{(1)} \Vert h_t^{(-1)}\right] \). Once the final step is completed, we collect the hidden features from all time steps within STIM to form a comprehensive feature \(X^H \in \mathbb {R}^{N \times T \times 2 C^{\prime }}\):

$$\begin{aligned} {X^H} = {\text {[h}}_1^{(1)}{\text {,h}}_1^{( - 1)}{\text {;h}}_2^{(1)}{\text {,h}}_2^{( - 1)}{\text {;}} \cdots {\text {h}}_T^{(1)}{\text {,h}}_T^{( - 1)}{\text {]}}. \end{aligned}$$
(10)

4.3 Global spatio-temporal dependency fusion

In STIM, the output at a given time step is influenced by both the current input and the hidden state from the previous time step. For long sequences, especially those that require understanding interdependent features, STIM needs to gather sufficient information over multiple time steps to connect events that are temporally distant. The self-attention mechanism, especially as applied in Transformers, has proven to be highly effective in learning the long-distance interdependencies of time series data, as it can directly capture the relationships between any two positions in the sequence. Therefore, we adopt GTFM after STIM to deeply integrate hidden features from all time steps in STIM. GTFM mainly consists of temporal multi-head self-attention and fully connected layers. To enhance the representational power of node features, we introduce residual connections into the network. Specifically, it can be represented as follows:

$$\begin{aligned} {Q_{{v_i}}} = X_{{v_i}}^H{W_q},\nonumber \\ {K_{{v_i}}} = X_{{v_i}}^H{W_k},\nonumber \\ {V_{{v_i}}} = X_{{v_i}}^H{W_v}, \end{aligned}$$
(11)

where \({W_q}\), \({W_k}\), and \({W_v}\) represent learnable weight matrices. The dynamic dependency between nodes at different times, \(T^T \in \mathbb {R}^{N \times N}\), is computed through the dot product of \(Q_{v_i}\) and \(K_{v_i}^T\):

$$\begin{aligned} {T^T} = \mathrm{{Softmax}}(\frac{{{Q_{{v_i}}}K_{{v_i}}^T}}{{\sqrt{{d_k}} }}),\nonumber \\ GT_{v_i} = {T^T}V_{v_i}, \\ O_{v_i} = \textrm{LN}({\mathop \textrm{Concat}\nolimits } \left( {G{T_1}, \ldots ,G{T_H}} \right) {W^T} + X_{{v_i}}^H),\nonumber \end{aligned}$$
(12)

where \(\frac{1}{\sqrt{d_k}}\) is a scaling factor, the Softmax function maps the relevance of \(Q_{v_i}\) and \(K_{v_i}\) to a range of [0,1], Concat is used to concatenate attention features, LN refers to layer normalization to improve model convergence, and \(W^T\) is a learnable parameter. After all nodes are computed, we obtain the final output \(O \in \mathbb {R}^{N \times T \times 2C^{\prime }}\).

4.4 Multi-step traffic prediction

To finalize multi-step prediction, a dimension-specific linear transformation is applied to the output sequence via a fully connected neural network layer. This approach is more efficient than single-step prediction methods:

$$\begin{aligned} \hat{Y} = {W_l}O + {b_l}, \end{aligned}$$
(13)

where \(W_l\) and \(b_l\) are the weight matrix and bias term, respectively, and \(\hat{Y} \in \mathbb {R}^{N \times T \times 1}\) is the final prediction result.

5 Experiment

5.1 Datasets

To evaluate the performance of SSGCRTN, we conduct extensive experiments on six public traffic datasets: PeMS03/ 04/07/08 [34], and PeMSD7(M)/(L) [6]. Data gathered every 30 seconds in real-time through the Caltrans Performance Measurement System (PeMS), these datasets are ultimately aggregated at 5-minute intervals, as summarized in Table 1. The distribution of sensors in PeMS can be seen in Fig. 4.

Each dataset consists of two parts. The first part is a CSV file that provides distance information between sensors with connectivity relationships. It includes three attribute columns: “from", “to", and “cost". “From" and “to" record the ID information of two stations, and “cost" records the corresponding distance information. The second part is a data file. For the PeMS03/04/07/08 datasets, we select the recorded traffic flow data. For the PeMSD7(M)/(L) datasets, we select the recorded speed data.

Fig. 4
figure 4

Visualisation of PeMS sensor positions. The PeMS03/04/07/08 datasets correspond to districts 3, 4, 7, and 8 in California, respectively. PeMSD7(M) and PeMSD7(L) correspond to medium-scale and large-scale data from district 7, respectively

5.2 Data preprocessing

For consistency with benchmarks in earlier studies [34, 35], the datasets are split into training, validation, and test sets following a 6:2:2 ratio. The 60% training set provides sufficient data to help the model learn complex patterns. Traffic data typically includes data from multiple time periods and locations, requiring a large amount of data to capture the relationships between these variables. The 20% validation set is used to adjust model parameters and prevent overfitting, while the remaining 20% test set is used to evaluate the model’s final performance, ensuring that the model generalizes well to new data.

Missing data in the dataset is handled using masking. To accelerate the model’s convergence speed, we normalize the input data using Z-Score normalization, as follows:

$$\begin{aligned} x^{\prime }=\frac{x-{\text {mean}}(x)}{{\text {std}}(x)}, \end{aligned}$$
(14)

where \({\text {mean}}(x)\) denotes the average value and \({\text {std}}(x)\) signifies the standard deviation of the training data.

5.3 Experiment settings

This research aims to predict traffic conditions for the next hour using traffic flow data from the previous hour. To achieve this, we set the values of H and P to 12.

The experimental setup includes a computer running Windows 10 OS, featuring an Intel Xeon Gold 5320 CPU @ 2.20 GHz. The system is equipped with 200 GB RAM and a single NVIDIA RTX 3090 GPU. For the PeMS03/04/07, and PeMSD7(M)/(L) datasets, we set \({n_{space}}=4\), \({{d}_{k}}=64\). For the PeMS08 dataset, \({n_{space}} = 16\), \({{d}_{k}}=64\). All tasks are evaluated using three widely-adopted evaluation metrics: MAE, MAPE and RMSE. The definitions of these metrics are as follows:

1) MAE:

$$\begin{aligned} {\text {MAE = }}\frac{1}{N}\sum \limits _{t = 1}^N {\left| {{{\hat{y}}_t} - {y_t}} \right| }, \end{aligned}$$
(15)

2) MAPE:

$$\begin{aligned} {\text {MAPE = }}\frac{{100\% }}{N}\sum \limits _{t = 1}^N {\left| {\frac{{{{\hat{y}}_t} - {y_t}}}{{{y_t}}}} \right| }, \end{aligned}$$
(16)

3) RMSE:

$$\begin{aligned} \textrm{RMSE}=\sqrt{\frac{1}{N} \sum _{t=1}^N\left( \hat{y}_t-y_t\right) ^2} \end{aligned}$$
(17)

where N denotes the number of samples.

MAE is selected as the loss function, and the Adam optimizer is utilized for the training process. We set the number of training epochs to 150, with a batch size of 64 and a learning rate of 0.001. To prevent overfitting, an early stopping strategy is employed. The model exhibiting the smallest loss on the validation set is chosen as the final model for evaluation. The experiments are conducted five times, and the average results of the evaluation metrics are reported.

Table 2 Comparison of SSGCRTN and baselines on PeMS03, PeMS04, PeMS07, and PeMS08 datasets

5.4 Baseline methods

To evaluate SSGCRTN, it is compared with sixteen baseline models.

  • DCRNN [5]: Utilizes bidirectional random walks to represent spatial correlations and employs a GRU to capture temporal dependencies.

  • STGCN [6]: Exploits spatio-temporal correlations through a sequence of spatio-temporal convolutional blocks.

  • ASTGCN [7]: Employs a spatial-temporal attention mechanism to capture hidden spatio-temporal patterns.

  • GWN [8]: Captures latent spatial dependencies using an adaptive matrix and extends the receptive field with dilated 1D convolutional layers.

  • STG2Seq [46]: Uses graph convolutions exclusively to extract spatial correlations for multi-step traffic forecasting.

  • STSGCN [34]: Employs a graph convolution module designed to concurrently identify local spatial and temporal correlations.

  • AGCRN [9]: Captures node-specific spatial and temporal correlations through a node-adaptive parameter learning module.

  • LSGCN [47]: Uses graph convolution and cosine graph attention network to extract long-term and short-term spatial dependencies.

  • STFGNN [35]: Captures hidden spatial dependencies by integrating data-driven graphs with predefined spatial graphs.

  • Z-GCNETs [37]: Introduces a time-aware zigzag topology layer in GCN to capture significant time-aware topological features in data.

  • STGODE [36]: Proposes a novel tensor form of GNN to extract distant spatio-temporal correlations.

  • TAMP-S2GCNets [38]: Models spatio-temporal data using dynamic matrix construction and temporal graph sequences.

  • DSTAGNN [39]: Generates dynamic spatio-temporal graphs through a data-driven approach, enhancing the multi-head attention mechanism to represent dynamic associations between nodes.

  • DGCRN [41]: Utilizes a generative tactic to meticulously model the complex topological structure of dynamic graphs for each time interval.

  • STGPCN [40]: Convolves various spatio-temporal graphs defined by graph accumulation operations to capture spatio-temporal relationships.

  • STC-CGCN [25]: Introduces prior knowledge such as comfort to improve prediction accuracy.

5.5 Experiment results

Table 3 Comparison of SSGCRTN and baselines on PeMSD7(M) and PeMSD7(L) datasets
Fig. 5
figure 5

Three metric values for each horizon on six datasets

Fig. 6
figure 6

Model parameters and the cost of training

Tables 2 and 3 respectively present the experimental results of SSGCRTN and other baseline models on traffic flow and speed datasets. We draw the following conclusions:

Early models [5,6,7, 34, 46, 47] constructed static graphs to account for intrinsic spatial dependencies, capturing only shared patterns of traffic sequences while neglecting latent spatial dependencies. To account for latent dependencies, Graph WaveNet constructs adaptive graphs, ST-CGCN dynamically generates complex adjacency matrices, and STGPCN produces multiple spatio-temporal graphs. STFGNN and STGODE introduce temporal graphs and GODE, respectively. Z-GCNETs and TAMP-S2GCNets fully consider the topological properties conditioned on time, thereby enhancing the models’ spatio-temporal awareness and outperforming other graph-based methods. However, they fail to effectively explore parallel spatio-temporal dependencies, resulting in inferior performance compared to AGCRN and DGCRN. DSTAGNN combines dynamic graphs generated from historical data to better uncover potential spatial dependencies among nodes while using self-attention mechanisms to capture long-term temporal dependencies. Consequently, it is highly competitive compared to our model.

It is worth noting that SSGCRTN demonstrates sub-optimal performance compared to other baselines only on PeMS07 for MAE, RMSE, and PeMSD7(L) for MAPE metrics. We conjecture this is because these two datasets have the second largest and the largest number of traffic nodes, respectively, making it difficult for SSGC to identify useful signals. The optimal performance on all other datasets can be attributed to three main reasons: 1) Our model considers intrinsic and multiple latent spatial dependencies; 2) Our model captures parallel spatio-temporal dependencies and enhances the understanding of temporal context; 3) Our model accounts for long-term temporal dependencies.

Figure 5 shows the performance of several models at twelve prediction levels across six datasets under three metrics. As the prediction level increases, the complexity of forecasting escalates, leading to a continual rise in metrics such as MAE, RMSE, and MAPE. SSGCRTN identifies both preceding and succeeding dependencies in temporal dimension for short-term predictions, while employing node-specific time self-attention mechanisms for long-term forecasting abilities. Thus, SSGCRTN is suitable for both short-term and long-term forecasting tasks, demonstrating its versatility and enduring stability.

Figure 6 illustrates how SSGCRTN compares to other advanced models in spatio-temporal prediction, considering parameter quantity and training costs. The findings indicate that SSGCRTN effectively manages the overall number of parameters. During the training phase, SSGCRTN consumes more time because SSGC extracts features from multiple subspaces, learning a broader range of spatial knowledge categories than other baseline methods. Despite this, its lower training costs and outstanding predictive performance make it a preferred choice for spatio-temporal prediction.

Figure 7 demonstrates the quality of predictions at different times of the day by capturing twenty-four prediction snapshots along the time axis on the test set. SSGCRTN responds more quickly to dynamic changes in traffic flow under conditions of missing data and more accurately predicts the start of traffic peaks.

Fig. 7
figure 7

Predictions of our model and DGCRN in different scenarios. (a) and (c) display the model’s effectiveness in predicting during instances of missing data. (b) and (d) visualize the prediction outcomes during peak data moments

Fig. 8
figure 8

Heatmap of MAE for sensor nodes containing a lot of missing data in the PeMS04 and PeMS08 datasets. Model names are indicated along the horizontal axis, and the sensor node numbers are marked on the vertical axis

Moreover, the efficacy of SSGCRTN in situations with substantial data gaps is corroborated. Specifically, the ten road nodes experiencing the most significant data omissions in the PeMS04 and PeMS08 test sets are chosen, and the MAE values for various models are depicted as a heatmap, as illustrated in Fig. 8. The MAE values for some nodes in PeMS08 are generally lower than those in PeMS04, indicating a positive correlation between lower data loss rate and lower MAE values. It is evident that SSGCRTN consistently exhibits lower MAE and lighter colors on these road nodes, strongly demonstrating its outstanding performance even in real-world scenarios with substantial missing data.

We also use relative error rates to quantify the differences between our model and other advanced baseline models. In all datasets, the average MAE, RMSE, and MAPE values for our model are 12.70, 21.29, and 10.07%, respectively. The corresponding values for TAMP-S2GCNets are 13.72 (108.03%), 22.92 (107.66%), and 10.80% (107.25%); for DSTAGNN, the values are 12.99 (102.28%), 21.62 (101.55%), and 10.24 (101.68%); and for DGCRN, the values are 13.19 (103.86%), 21.88 (102.77%), and 11.47% (113.90%). Overall, the improvement of our model compared to these two models ranges from 1.55% to 13.90%. This improvement is crucial in traffic prediction, as even slight enhancements can significantly impact the accuracy of the final prediction results.

5.6 Ablation study

To substantiate the efficacy of various components of SSGCRTN, ablation studies are performed using the PeMS04 and PeMS08 datasets. We design five variants of the SSGCRTN model as follows:

  1. 1.

    SSGC: This model specifically adopts our proposed SSGC for traffic prediction.

  2. 2.

    STIM: This model removes the GTFM.

  3. 3.

    w/o ISGC: This model removes the ISGC.

  4. 4.

    w/o LSGC: This model removes the LSGC.

  5. 5.

    w/o Reverse SSGCRN: This model removes the reverse SSGCRN from STIM.

The results of the ablation experiments are shown in Table 4. We draw the following conclusions:

Table 4 Component analysis of the SSGCRTN on PeMS04 and PeMS08 datasets
Fig. 9
figure 9

The influence of \({n_{space}}\) and \(d_k\) on the model’s performance. Bar charts are used for numerical comparisons, and line charts are used for trend analysis. (a) As \({n_{space}}\) increases from 1 to 16, the overall performance of the model first increases and then decreases, with the best performance at \({n_{space}}\) of 16. (b) With \({n_{space}}\) fixed at 16, increasing \(d_k\) from 8 to 128 shows that the model performs best at \(d_k\) of 64

Fig. 10
figure 10

Heatmaps of \({{A}^{IS}}\) and \(A_{i}^{LS}\) showing the last 50 sensor nodes in each dataset. The first to sixth rows are visualized on the PeMS03, PeMS04, PeMS07, PeMS08, PeMSD7(M), and PeMSD7(L), respectively. The first column represents \({{A}^{IS}}\), and the last three columns are selected from the three specific spaces in \(A_{i}^{LS}\)

It’s evident that each component is indispensable for our model. This indicates that the model extracts different types of knowledge from multiple subgraphs through the causal fusion strategy, enhancing its perception of temporal and spatial dynamics and better capturing spatio-temporal features.

Compared to other variant models, SSGC exhibits the worst predictive performance, suggesting that focusing solely on the spatial correlations of traffic flow data, without deeply modeling the dynamic spatio-temporal associations between road network nodes over time, is insufficient for accurate traffic flow prediction. The model variant w/o LSGC shows the largest error compared to w/o ISGC indicates that dynamic features obtained from multiple latent spaces are more crucial than fixed spatial topological features. \(A_i^{LS}\), which is trained alongside the model, learns latent dependencies directly related to downstream tasks. The ISGC component enhances predictive performance, suggesting that it learns the positional relationships between sensor nodes during training, capturing the intrinsic spatial dependencies between them. Furthermore, the unidirectional SSGCRN’s predictive performance falls short compared to STIM, indicating that the causal fusion strategy enables the model to parallelly learn and integrate spatio-temporal relationships at each time step. In-depth study of the causal relationships in traffic events can achieve better predictions. Lastly, after removing GTFM, all three evaluation metrics increased significantly. This indicates that GTFM greatly enhances the model’s ability to capture global spatio-temporal dependencies.

5.7 Hyperparameter effects

Since there may be multiple hidden spatial dependencies between nodes, our work introduces a multi-head mechanism, aiming for \(E_1\) and \(E_2\) to simulate dependencies in the latent space. Additionally, the node embedding dimension is an important parameter in the LSGC component, affecting the quality of the embeddings and determining whether SSGCRTN can effectively capture the diversity of spatial relationships. Figure 9 shows the prediction results of our model on PeMS08 with different hyperparameters. When we adjust one parameter, other parameters are set to their optimal defaults. The study finds that increasing \(n_{space}\) can lead to performance improvement. When \(n_{space}= 16\), the model performs best. This indicates that as \(n_{space}\) increases, the model acquires more latent spatial dependency information, validating the effectiveness of multiple latent spaces. SSGCRTN performs best when the embedding dimension is 64; both smaller and larger node embedding dimensions reduce performance. This may be because when the embedding dimension is small, the node embedding module can only contain relatively limited information, and when the embedding dimension is too large, the number of module parameters increases sharply, making it difficult for the model to optimize. In conclusion, finding the appropriate node embedding dimension is crucial for the spatio-temporal capture capability of SSGCRTN.

5.8 Analysis of multiple graphs

We visualize \({{A}^{IS}}\) and \(A_i^{LS}\) as heatmaps on six datasets, as shown in Fig. 10. \({{A}^{IS}}\) reflects the proximity of nodes in the intrinsic space, with its values determined by the actual distances between sensors – the closer the distance, the higher the value. \(A_i^{LS}\) reflects node interaction and similarity in the latent space, continuously adjusting as the model trains. In the \(A_i^{LS}\) heatmap, some rows, such as those within the red border, have higher values, indicating that the current node has a broad influence and affects most nodes. The dynamic graphs generated in multiple latent spaces fully utilize semantic associations between road nodes, better capturing latent traffic flow information. This pattern is found in the heatmaps in Fig. 10 (b)-(d), (f)-(h), (j)-(l), (n)-(p), (r)-(t), and (v)-(x). As shown in Fig. 10 (a), (e), (i), (m), (q), and (u), \({{A}^{IS}}\) heatmaps are relatively sparse compared to \(A_i^{LS}\), indicating lower intrinsic spatial correlation between sensor nodes in the original traffic network. \({{A}^{IS}}\) is static and cannot reflect real-time traffic conditions like peak hours. Due to the lack of specific geographical locations of sensor nodes and surrounding points of interest, the distribution characteristics of high-impact nodes in \(A_i^{LS}\) require further investigation.

6 Conclusion

This paper presents SSGCRTN, a novel approach to traffic prediction. It not only extracts intrinsic and various latent spatial dependencies through SSGC but also deeply explores the fusion of spatio-temporal correlations. Specifically, we combine SSGC with an RNN-based model to process spatio-temporal relationships in parallel at different times. SSGC not only captures existing links between road nodes but also explores latent node correlations under various factors. Since SSGC involves modeling multiple specific spaces, it may have limitations with large-scale node datasets. Additionally, SSGCRTN incorporates a temporal self-attention mechanism for each node \({{v}_{i}}\), crucial for identifying key traffic sequence features and understanding global spatio-temporal dependencies. Comprehensive experiments on six real traffic datasets show that SSGCRTN outperforms existing methods. As a generalized framework suitable for various time-series prediction tasks, we aim to adapt SSGCRTN for forecasting in areas like weather and air quality in future work.