1 Introduction

A time series is a set of random variables arranged in time order. Time series forecasting, on the other hand, uses known experience to estimate unknown future values of random variables based on the analysis of historical observations. In the field of traditional time series analysis, most of the studies are on univariate time series. The classical ones are Auto-Regressive (AR) [1] model, Moving Average (MA) [2], and Auto Regressive Moving Average (ARMA) [3] model. The AR model focuses on the historical observations themselves. The MA model focuses on the cumulative error in the forecast, eliminating the random fluctuations in the forecast. The ARMA model combines the two in a simple linear fashion, allowing the model to be analyzed from both perspectives in an integrated manner. However, this model can only be used to deal with small-scale smooth time series in univariate, homoskedasticity settings. Later, the proposal of the AutoRegressive Integrated Moving Average(ARIMA) [4] model made dealing with non-stationary time series possible. The main idea of this model is to transform the non-stationary series into a stationary series by differencing it to study it. On the other hand, many models emerged in heteroskedasticity, multivariate, and nonlinear time series analysis. For example, the Threshold Autoregressive (TAR) [5] model and the Auto-regressive conditional heteroskedasticity (ARCH) [6] model. The TAR model can divide the state space by setting a threshold and then use various linear forms to deal with nonlinear time series. The ARCH model applies to the case of heteroskedasticity. Its basic idea is to assume that the variance is a random variable obeying a normal distribution and to use a linear combination of the squared values of past finite duration series for regression analysis. Another central idea of traditional time series analysis is transforming it to the frequency domain. The spectral analysis method is a powerful tool for mining the hidden periodicity of time series. The main idea is to model the time series with a linear combination of sine and cosine by Fourier transform [7]. Modern spectral analysis methods are still applied in many fields and perform excellently.

Traditional models are usually effective only for specific temporal patterns. Machine learning models are more generalizable than traditional models and can often be used to deal with more complex time series. The common ones are Support Vector Regression (SVR) [8], Random Forests (RF) [9], and Extreme Gradient Boosting (Xgboost) [10, 11]. SVR belongs to the application of SVM to regression problems, and it can handle linear and nonlinear regression problems by selecting different kernel functions. RF is an integrated model that chooses a decision tree as its base model. It combines the outputs of the base models with specific rules to obtain the final predicted values. Xgboost, like Random Forest, is also an integration algorithm. However, instead of simply combining the outputs of the base model, the tree is continuously added to fit the residuals of the previous base model predictions so that the predicted values keep approximating the actual values.

In recent times, with the development of hardware technology, the arithmetic power of computers has been significantly improved. Neural networks became popular and developed. Many models for processing time series have also been born in this field. For example, an LSTM [12, 13] model with temporal information enhancement (T-LSTM) was proposed by Mou et al. [14]. The model is based on the original LSTM model and improves prediction accuracy by capturing the intrinsic correlation between traffic flow and temporal information. The above model relies purely on recurrent neural networks. Later, Wen et al. [15] made the first attempt to use CNNs [16,17,18] in time series prediction problems. The main idea is to model the anomaly detection problem in time series as an image segmentation problem and then use a U-net-like[19, 20] model architecture to process the time series in a convolutional manner, and it also achieves good results.

Although the above models have satisfied most of the time series prediction tasks, the multivariate prediction still needs improvements. The birth of graph neural networks (GNN)[21, 22] has dramatically impacted the dominance of traditional neural networks. Due to the graph structure that can represent non-Euclidean data, it has received attention from several research fields, such as computer vision [23, 24] and natural language processing [16, 25]. Breakthroughs have also occurred in the field of multivariate time series prediction. In 2020, Wu et al.[26] proposed a Multivariate Time series forecasting with Graph Neural Networks (MTGNN) model specifically for multivariate time series forecasting problems. In the same year, Song et al. [27] proposed the Spatial-Temporal Synchronous Graph Convolutional Networks (STSGCN) model. This model can effectively capture graphs’ complex correlations and heterogeneities through a spatiotemporal synchronous modelling mechanism. Time series analysis models based on graph neural networks have flourished.

Graphical neural networks have been an essential tool for mining the spatial dependence of multivariate time series. Many scholars have attempted to improve modeling methods to exploit the inherent dependencies among multivariate time series to improve prediction accuracy. As a result, various graph models have emerged. These graph operators act similarly to the different convolution kernels in convolutional networks. As we know, in image processing, it is often necessary to match different convolution kernels, such as dilation convolution[28, 29], grouped convolution [30], and separable convolution. [31, 32], as needed if we want to achieve the desired results. Different convolutional kernels can extract different levels of features from the image. Inspired by convolutional networks, this paper is the first attempt to use multiple graph operators in feature extraction, using different graph operators to explore the spatial dependencies between nodes from different perspectives and then improve the accuracy of the model prediction.

2 Research background

Traffic flow is a typical multivariate time series, which refers to the number of traffic entities or other traffic indicators passing through a location, section, or road lane during a specific period. Most traffic flows are highly nonlinear [33], time-dependent and uncertain, making it difficult for traditional time series models to meet the needs of practical applications. In recent years, graph neural networks have seen rapid development. Their powerful representations can use to model multivariate time series spatially and achieve accurate predictions. Furthermore, this has become an important method for studying multivariate series problems. Typical ones are Diffusion Convolutional Recurrent Neural Networks (DCRNN) [34] and Gated Attention Networks for Learning on Large and Spatiotemporal Graphs (GaAN) [35]. The DCRNN model is based on a diffusion mechanism that captures spatial dependencies between nodes by randomly wandering around the graph; the GaAN is based on a multi-headed attention mechanism but adds a gated value to each head to adjust the importance of each attention head.

Most of the above algorithms only model traffic flow from a single perspective. However, considering the complexity of traffic road networks, there may be multiple spatial dependencies at their underlying layers, so a single type of network cannot fully exploit the dependency information among nodes. This paper proposes an integrated model — Recurrent Neural Networks Integrating Graph gated attention and Graph diffusion convolution Operators (iGoRNN) to address the above problems. The model is based on an encoder-decoder architecture with a core building block for a graph operator integrator that efficiently fuses information captured by different graph operators to improve the network’s ability to understand the spatial dependence of multivariate time-series data.

3 Research methodology

3.1 Problem description

In this paper, the sensor network in the road section is abstract as a weighted directed graph \( \mathcal {G} = (\mathcal {V}, \mathcal {E}, \textbf{W}) \). where \( \mathcal {V} \) is the set of sensors in the road network; \( N = \vert \mathcal {V} \vert \) denotes the number of sensors; \( \mathcal {E} \) denotes the set of edges, representing the existence of links between sensors. \( \textbf{W} \in \mathbb {R}^{N \times N} \) is a weighted proximity matrix representing nodes’ proximity.

To facilitate the description of the problem, we define the traffic flow signals collected by all sensors at time t as \( \textbf{X}^{(t)} \in \mathbb {R}^{N \times P} \) , where P denotes the number of traffic indicators detected by the sensors; and define \( \hat{\textbf{X}}^{(t)} \in \mathbb {R}^{N \times P} \) denotes the predicted value of the traffic indicators at time t. In turn, \( \mathcal {X} = \left\{ \textbf{X}^{(t - T^{\prime } + 1)},\textbf{X}^{(t - T^{\prime } + 2)}, \cdots , \textbf{X}^{(t)} \right\} \) can represent the observed values for the historical \( T^{\prime } \) timestamps; \( \mathcal {\hat{X}} = \left\{ \hat{\mathbf {{X}}}^{(t+1)},\hat{\mathbf {{X}}}^{(t+2)}, \cdots , \hat{\mathbf {{X}}}^{(t+T)} \right\} \) can represent the predicted values for the future T timestamps. Therefore, the ultimate goal of this paper is to learn a function \( \mathscr {H}(\cdot )\) that maps \( T^{\prime } \) historical timestamp data to future T timestamp data.

$$\begin{aligned} \mathcal {\hat{X}} = \mathscr {H}(\mathcal {X}) \end{aligned}$$
(1)

3.2 Model architecture

In this paper, we design a encoder-decoder architecture, Fig. 1. Its encoding and decoding process is mainly done through graph operator integrators(Go-Integrator), and the vertical flow of information in the graph can be understood as the encoding and decoding process. The horizontal flow can be understood as the feature capture process. For more complex multivariate timing data, the depth of information mining can be increased by stacking integrator layers horizontally.

Fig. 1
figure 1

Model Body Architecture

The encoder’s input in the model is the historical observation \( \mathcal {X} \); the output of the decoder is the future prediction \( \mathcal {\hat{X}} \); \( \textbf{H} \in \mathbb {R}^{N\times Q} \) denotes the input or output of the intermediate hidden layer.

3.2.1 Graph operator integrator

The Graph Operator Integrator (Go-Integrator) is the core building block of iGoRNN, which is used upwards to control the overall flow of information and downwards to fuse the information captured by each graph operator. In order to enable the integrator to perform information fusion efficiently, We present for the first time a feasible integrator architecture, see Fig. 2. Subfigure (a) is a standard information integrator for the start and intermediate nodes of the codec. At the same time, since the last layer of the encoder does not need to obtain an output, we have designed a simplified architecture subfigure (b), which can save computational resources to a certain extent.

Fig. 2
figure 2

Go-integrator unit

Two graph networks are used inside the integrator for information mining, a diffusion convolutional network based on static graphs and a gated attention network based on dynamic graphs. The static graph can objectively portray the physical distance between sensors, and the dynamic graph can filter out the neighbouring nodes that have a high impact on the central node. The two graph networks, one based on physical space and one based on value space, each capture the spatial dependencies between nodes. Finally, an integrator is used to achieve the fusion of multiple features.

In the complete integrator unit, historical data is sent to two Graph Gated Recurrent Units (GGRU) [36, 37] along with implicit data for computation. Then its output is stitched with the original input and finally mapped to the specified dimension by the Feed Forward layer. Equation (2) gives the detailed process of fusion.

$$\begin{aligned} \hat{\textbf{H}}_{o}^{(t)} = \mathscr {F}_{d_{o}}^{\tau }\left( \textbf{X}^{(t)} \parallel \coprod _{i=1}^{I} \mathscr {R}[\Gamma _{i}]\left( \textbf{X}^{(t)}, \textbf{H}_{i}^{(t-1)}\right) \right) \end{aligned}$$
(2)

Here, \( \textbf{X}^{(t)} \) denotes the historical observation at the current moment; I denotes the number of GGRU units involved in the calculation. \( \textbf{H}_{i}^{(t-1)} \) denotes the output of the i-th GGRU unit from the iGGRU of the previous layer, which is also used as the input of the i-th GGRU unit in this layer, and \( \hat{\textbf{H}}_{o}^{(t)} \) denotes the final output of the Go-Integrator.\( \parallel \) denotes cascade, and \( \coprod \) indicates sequential cascade. \( \mathscr {R}[\Gamma _{i}] \) denotes the GGRU unit using the \( \Gamma _{i} \) graph operator, \( \mathscr {F}_{d_{o}}^{\tau } \) denotes the feedforward neural network, \( d_{o} \) represents the dimension of model input, and \( \tau \) denotes using the \( \tanh \) activation function.

3.2.2 Graph gated recurrent unit

The Graph Gated Recurrent Unit (GGRU) is a particular type of Gated Recurrent Unit (GRU). GGRU uses a graph operator instead of the fully connected layer in the GRU, see Fig. 3. Compared with the classical GRU model, GGRU does not only learn time series patterns but also mines spatial dependencies between sequences by graph operators, making it applicable to spatial time series data.

Fig. 3
figure 3

Graph gated recurrent unit

The macro architecture of GGRU is consistent with that of a normal GRU, Fig. 3. where \( \textbf{X}^{(t)}, \textbf{H}^{(t)} \) denotes the input and output of the t timestamp; \( \textbf{r}^{(t)},\textbf{u}^{(t)} \) denotes the gating state of the Reset gate and Update gate of the t timestamp; and \( \Theta _{r}, \Theta _{u}, \Theta _{C} \) is the different filter parameters. \( \Gamma _{\mathcal {G}} \) denotes the execution of the \( \Gamma \) operator in the specified graph network \( \mathcal {G} \), and \( \odot \) denotes the Hadamard product. The specific calculation process is in the following equations.

$$\begin{aligned} \textbf{r}^{(t)}= & {} \sigma (\Gamma _{\mathcal {G}\Theta _{r}} [\textbf{X}^{(t)} \parallel \textbf{H}^{(t-1)}] + b_{r}) \end{aligned}$$
(3)
$$\begin{aligned} \textbf{u}^{(t)}= & {} \sigma (\Gamma _{\mathcal {G}\Theta _{u}} [\textbf{X}^{(t)} \parallel \textbf{H}^{(t-1)}] + b_{u}) \end{aligned}$$
(4)
$$\begin{aligned} \textbf{C}^{(t)}= & {} \tanh (\Gamma _{\mathcal {G}\Theta _{C}} [\textbf{X}^{(t)} \parallel (\textbf{r}^{(t)} \odot \textbf{H}^{(t-1)})]+ b_{c}) \end{aligned}$$
(5)
$$\begin{aligned} \textbf{H}^{(t)}= & {} \textbf{u}^{(t)} \odot \textbf{H}^{(t-1)} + (1-\textbf{u}^{(t)})\odot \textbf{C}^{(t)} \end{aligned}$$
(6)

Equations (3) and (4) are the update process of the gated state. The GGRU first fuses the input data \( \textbf{X}^{(t)} \) of the current moment with the output data \( \textbf{H}^{(t-1)} \) of the last moment and then feeds it into the graph operator network for capturing the spatial dependence. Finally, After activation by the sigmoid function, the gating states \( \textbf{r}^{(t)} \) and \( \textbf{u}^{(t)} \) at the t time are obtained, where \( \textbf{r},\textbf{u} \in (0,1) \). After the gating signal update, the reset and update gates can capture the essential features in the current message.

Equation (5) is the process of capturing essential features using reset gates. First, the reset gate resets the hidden matrix \( \textbf{H}^{(t-1)} \). The reset process is similar to the forgetting process, in which the historical information retained in the matrix \( \textbf{H}^{(t-1)} \) will further reduce. Then it is fused with the observed data \( \textbf{X}^{(t)} \) at the current moment and fed into the graph operator network for the spatial feature extraction. Finally, after the \( \tanh \) function activation will obtain the candidate matrix \( \textbf{C}^{(t)} \).

The final step of the algorithm is the update process of updating the historical information \( \textbf{H}^{(t-1)} \) using the candidate matrix \( \textbf{C}^{(t)} \), (6). Thus, we obtain the output \( \textbf{H}^{(t)} \) at the current moment.

In practice, We embed the diffusion convolution and gated attention operators into GGRU to obtain two types of feature capturers, GGRU(DC) and GGRU(GA), respectively. These two GGRUs can capture node information from different perspectives and thus improve the network’s understanding of spatial-temporal data.

3.2.3 Graph diffusion convolution operator

Graph Convolutional Networks (GCN) [38,39,40] introduces convolution to general graph-structured data. For such non-gridded data, instead of predicting each node individually, GCN aggregates useful information from neighboring nodes as much as possible for each node in a process similar to image convolution. Like CNN, GCN has features such as feature learning, parameter binding, and invariance.

However, GCN uses only the direct neighbors of each node. Nevertheless, we cannot simply interpret the nodes in the graph as a binary relationship. The basic facts between nodes are much more complex than that. The proposed Diffusion Convolutional Neural Networks (DCNN) [41, 42] breaks through this layer of limitation and is a special kind of GCN network. DCNN aims at mining the spatial dependencies between nodes at a deeper level. In 2018, Li et al. [34] constructed the first DCRNN model based on the DCNN model to solve the prediction problem of spatial timing. Equation (7) expresses the diffusion convolution process.

$$\begin{aligned} \textbf{H}= & {} \coprod _{q=1}^{Q} \left( a \sum \limits _{p=1}^{P} (\Theta _{O}\mathcal {P}_{O}^{K} + \Theta _{I}\mathcal {P}_{I}^{K})\textbf{X}_{[:,p]} \right) \nonumber \\ \mathcal {P}_{O}^{K}= & {} \sum \limits _{k=0}^{K} \alpha (1 - \alpha )^{k} (\textbf{D}_{O}^{-1} \textbf{W})^{k}\nonumber \\ \mathcal {P}_{I}^{K}= & {} \sum \limits _{k=0}^{K} \beta (1 - \beta )^{k} (\textbf{D}_{I}^{-1} \textbf{W})^{k} \end{aligned}$$
(7)

Here, \( \textbf{X} \in \mathbb {R}^{N\times P} \) represents the model’s input, and \( \textbf{H} \in \mathbb {R}^{N \times Q} \) represents the model’s output. The diffusion process is truncated using a finite number of K steps and eventually maps the inputs of P-dimensional features to Q-dimensional outputs. \( \textbf{D}_{O}, \textbf{D}_{I} \) denotes each node’s outgoing and incoming degree matrices in the graph, respectively, and further \( \textbf{D}_{O}^{-1}\textbf{W},\textbf{D}_{I}^{-1}\textbf{W} \) represents the forward state transfer matrix and the reverse state transfer matrix. \( \alpha ,\beta \in [0,1] \) denotes the restart probability of random wandering, \( \Theta _{O},\Theta _{I} \in \mathbb {R}^{Q\times P\times K}\) indicates the model parameters, and \( \Theta _{O[q,p,k]} = \alpha (1 - \alpha )^{k}, \Theta _{I[q,p,k]} = \beta (1 - \beta )^{k} \). \( \mathcal {P}_{O[i,:]}^{K}, \mathcal {P}_{I[i,:]}^{K} \) denotes the probability of performing K-step diffusion from node i to its neighboring nodes, where the former denotes the forward diffusion probability and the latter denotes the reverse diffusion probability.

Equation (7) describes a two-way diffusion process, and using this two-way diffusion process allows the model to have the ability to capture both upstream and downstream traffic impacts. Exceptionally, if the restart probability value \( \beta = 0 \), the model will change to a one-way diffusion process that can apply to capture features on particular road segments. They are giving the model more flexibility and adaptability.

Since graph diffusion convolution operator is based on static graphs, its adjacency matrix is constructed using a distance-based Gaussian kernel[43], see (8).

$$\begin{aligned} w_{ij} = {\left\{ \begin{array}{ll} \exp \left( -\dfrac{\text {dist}(v_{i}, v_{j})^{2}}{\sigma ^{2}} \right) &{} \text {if dist}(v_{i}, v_{j}) \le \kappa \\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(8)

Here, \( v_{i} \) denotes the i-th sensor in the road, \( \text {dist}(v_{i}, v_{j}) \) and \( w_{ij} \) denote the actual distance and weight between two sensors, respectively, and \( \sigma \) is the standard deviation of the distance set. \( \kappa \) is the threshold parameter; by setting a higher threshold value, the graph can be made sparse, and thus the convergence speed of the model can be improved.

3.2.4 Graph gated attention operator

Graph-gated attention networks(GaAN) are special graph attention networks (GAT)[44, 45]. Classical attentional mechanisms emerged in the field of computer vision and flourished in the field of natural language processing. Instead of focusing on the information as a whole, the attention mechanism focuses limited attention on the critical information. The advantage of introducing the attention mechanism is that it has fewer parameters to get the most helpful information and avoid wasting resources. However, at the same time, it may cause information loss. To solve this problem, Vaswani et al.[46] propose the Multi-head Attention mechanism[47, 48], which is to add multiple attention heads to the network, each expanding into a separate subspace, allowing the model to capture information more comprehensively about several different perspectives simultaneously. However, the multi-headed attention mechanism needs to be revised in that it needs to consider the differences in importance between multiple heads. Later, Zhang et al.[35] proposed the gated attention mechanism, which can explore multiple representation subspaces between the central node and its neighbouring nodes while focusing on the value of these subspaces to dynamically adjust each contribution subspace to the outcome in a gated manner.

Suppose the K-head attention mechanism is applied to Graph Gated Attention networks for feature capture. For each node i, a K-dimensional gating vector \( \textbf{g}_{i} \) is used to regulate the contribution of each attention head. Equation (9) gives the specific calculation of the gating vector \( \textbf{g}_{i} \).

$$\begin{aligned} \textbf{g}_{i}= & {} [g_{i}^{(1)}, \cdots , g_{i}^{(K)}] \nonumber \\= & {} \mathscr {L}_{\theta _{K}}^{\sigma } \left( \textbf{x}_{i} \parallel \mathop {\textbf{Max}}_{j\in \mathcal {N}_{i}} \left( \{ \mathscr {L}_{\theta _{m}} (\textbf{z}_{j})\} \right) \parallel \dfrac{\sum _{j\in \mathcal {N}_{i}} \textbf{z}_{j}}{\vert \mathcal {N}_{i}\vert }\right) \end{aligned}$$
(9)

Given node i, all its neighboring nodes are represented by the set \( \mathcal {N}_{i} \). \( \textbf{x}_{i} = \textbf{X}_{i,:} \) denotes the input feature vector of this node; \( \textbf{z}_{\mathcal {N}_{i}} = \{ \textbf{z}_{j} \mid j \in \mathcal {N}_{i} \}\) means the set of reference vectors of all its adjacent nodes, where \( \textbf{z}_{i} = \mathscr {L}_{\theta _{h}}(\textbf{x}_{i}) \). \( \mathop {\textbf{Max}} \) indicates the maximum value element-wise. \( \mathscr {L}_{\theta _{K}}^{\sigma } \) denotes the mapping of the vectors to the K dimension and scaling the result to the [0, 1] interval using the sigmoid function.

Since Graph Gated Attention is a dynamic network, it needs to calculate the weight relationship between node i and its neighbors in real-time. The specific calculation process is shown in (10).

$$\begin{aligned} w_{ij}^{k}= & {} \dfrac{\exp {(\phi _{w}^{(k)} (\textbf{x}_{i}, \textbf{z}_{j} ))}}{\sum _{l=1}^{\vert \mathcal {N}_{i}\vert } \exp {(\phi _{w}^{(k)}(\textbf{x}_{i}, \textbf{z}_{l} ))} }\nonumber \\ \phi _{w}^{(k)}(\textbf{x}, \textbf{z})= & {} \langle \mathscr {L}_{\theta _{xa}^{(k)}}(\textbf{x}), \mathscr {L}_{\theta _{za}^{(k)}}(\textbf{z}) \rangle \end{aligned}$$
(10)

Since there are a total of K attention heads in the Graph Gated Attention Operator algorithm, we must compute a weight matrix for each head. Here \( w_{ij}^{k} \) denotes the weights of node i and node j under the k-th attention head. \( \mathscr {L}_{\theta _{xa}^{(k)}}, \mathscr {L}_{\theta _{za}^{(k)}} \) is used to generate the query and key vectors of dimension \( d_{a} \). \( \langle \cdot , \cdot \rangle \) denotes the inner product operation.

Once the gating values and weight matrices of each attention head are obtained, the graph aggregation process of node i can be completed. The information captured by the K attention heads is first multiplied by their respective gating vectors and then stitched together with the original input. Finally, the output vector \( \textrm{y}_{i} \) is obtained after mapping by the fully connected layer. Equation (11) describes the specific implementation process.

$$\begin{aligned} \begin{aligned} \textbf{y}_{i} = \mathscr {L}_{\theta _{o}}\left( \textbf{x}_{i} \parallel \coprod _{k=1}^{K} \left( g_{i}^{(k)} \sum _{j\in \mathcal {N}_{i}} w_{i,j}^{(k)} \mathscr {L}_{\theta _{v}^{(k)}}^{\iota } (\textbf{z}_{j}) \right) \right) \end{aligned} \end{aligned}$$
(11)

Here, \( \mathscr {L}_{\theta _{v}^{(k)}} \) generates a value vector of dimension \( d_{v} \), \( \mathscr {L}_{\theta _{o}} \) is responsible for mapping the final output to a specific dimension, and \( \iota \) denotes the LeakRelu activation function.

4 Experimental design

4.1 Experimental setup

The METR-LA and PeMS-BAY datasets used in the experiment were obtained from Li et al. [34]. The dataset contains traffic information for the Los Angeles and California Freeway. The nodes in the dataset represent the measured traffic speed sensor IDs, and the edges are the proximity calculated based on the distance between sensors in the road network using (8). The sampling interval of the sensor is 5 minutes. Table 1 records the detailed statistical information of the data. Figure 4 shows the traffic speed variation over 48 hours on three road segments for both data sets.

Table 1 Statistical information of the experimental dataset
Fig. 4
figure 4

Traffic flow changes over 48 hours on three road sections of the METR-LA and PeMS-BAY datasets

Table 2 Performance comparison of different traffic speed prediction models on the METR-LA dataset

This paper uses the first 70% of the data set as the training set, the middle 10% as the validation set, and the last 20% as the test set. The sliding window is used to group the sequences, and the window size is set to 12. The grouped data is used as the input to the model to predict the traffic flow in the next hour. In the integrator, the diffusion convolution operator uses a bi-directional diffusion mode with truncation steps set to 2. The number of attention heads in the gated attention operator is set to 2. The learning rate is initialized to 0.01 and dynamically adjusted using the Adam optimizer with a multiplier of \( 0.99^{\text {epoch}} \).

Before the model starts training, the data are transformed into a distribution with mean 0 and variance 1 based on the expectation and sample variance of the training set. The advantage of performing normalization is that the data set is limited to a smaller range of values, which facilitates the computation of gradients in the post-order and ensures fast convergence of the model. The uniformity of the data scale can avoid the influence of different representations on the training results. Equations (12) and (13) correspond to the normalization and denormalization processes, respectively.

$$\begin{aligned} \textbf{X}^{*}= & {} \frac{\textbf{X} - E(\textbf{X})}{\sqrt{D(\textbf{X})}} \end{aligned}$$
(12)
$$\begin{aligned} \textbf{X}= & {} \textbf{X}^{*} \cdot \sqrt{D(\textbf{X})} + E(\textbf{X}) \end{aligned}$$
(13)

Here, \( \textbf{X} \) denotes the training set sample, and \( E(\textbf{X}),D(\textbf{X}) \) indicates the training set sample mean and sample variance, respectively.

4.2 Comparison experiments

In this section, we selected four significant classes of standard time series analysis models in the field of time series. These include the statistical learning model ARIMA [49], machine learning model LSVM [50], deep learning model FC-lSTM [51] and graph-based deep learning models DCRNN[34], STGCN [52], GaAN [35], ASTGCN [53], and GMAN [54]. Experiments use Mean Absolute Error (MAE), Root Mean Square Error ( RMSE) and Mean Absolute Percentage Error (MAPE) is used as evaluation metrics to assess the prediction performance of the models.

$$\begin{aligned} \text {MAE}(\textbf{x}, \hat{\textbf{x}})= & {} \dfrac{1}{\vert \Omega \vert } \sum _{i \in \Omega } \vert x_{i} - \hat{x_{i}} \vert \end{aligned}$$
(14)
$$\begin{aligned} \text {RMSE}(\textbf{x}, \hat{\textbf{x}})= & {} \sqrt{\dfrac{1}{\vert \Omega \vert } \left| \sum _{i\in \Omega } (x_{i} - \hat{x_{i}})^{2} \right| } \end{aligned}$$
(15)
$$\begin{aligned} \text {MAPE}(\textbf{x}, \hat{\textbf{x}})= & {} \dfrac{1}{\vert \Omega \vert } \sum _{i \in \Omega } \left| \dfrac{x_{i} - \hat{x}}{x_{i}} \right| \end{aligned}$$
(16)
Table 3 Performance comparison of different traffic speed prediction models on the PeMS-BAY dataset
Fig. 5
figure 5

Eight different Integrator structures. \( \dagger \) indicates the use of a serial structure, \( \ddagger \) indicates the use of a parallel structure, \( \Downarrow \) indicates that each GGRU unit has separate hidden layer integrator inputs and outputs, and \( \downarrow \) indicates that all GGRU units share the same hidden layer. \( \prime \) indicates the inclusion of a residual-like structure

Tables 2 and 3 compare the prediction results of the iGoRNN model with the other baseline models. By comparing the scores of each model on long-term and short-term time series predictions, we can see that the graph-based neural network model achieves better prediction accuracy. For smoother sequences (PeMS-BAY), some traditional models achieved good results even for short-term predictions (15 min) but performed poorly for long-time predictions (60 min). For unstable sequences (Metr-LA), conventional models perform poorly in both short- and long-term predictions. On the other hand, the GaAN model performs well on the Metr-LA dataset. GMAN is more suitable for smoother series. All these illustrate the importance of introducing graph structure in complex time series forecasting. Moreover, the model we designed shows optimal or suboptimal performance in long-term and short-term forecasting, especially for more complex time series, because it simultaneously introduces the advantages of multiple graph structures.

4.3 Ablation experiments

We designed several similar Integrator structures (Fig. 5) to evaluate our model comprehensively and did further ablation experiments using the METR-LA dataset.

Fig. 6
figure 6

The decreasing curve of Loss value for different time lengths predicted by iGoRNN model with different Integrator

Fig. 7
figure 7

Performance Analysis of the iGoRNN Model Using Different Integrators. Sub-Fig.a represents the optimal MAE loss value of the model for different prediction lengths, and sub-Fig.b represents the number of floating point operations for each model

Figure 6 shows the reduction of MAE loss values on the training and validation sets for different prediction forecast lengths. The figure shows that most integrators can reduce the forecast accuracy to an approximate value, except for the individual model, which requires better convergence. Fig. 7-a shows the magnitude of the MAE loss values for the final eight models, from which we can see that iGoRNN\(^{\ddagger \downarrow \prime } \) achieves the best results for short-term predictions. In the long-term forecast, iGoRNN\(^{\dagger \downarrow \prime } \) works best. Fig. 7-b shows the number of Floating Point Operations (FLOPs) required for each model. The parallel structure is generally less computationally intensive than the series structure, mainly because the second graph operator of the latter is susceptible to the intermediate dimensionality of the model. Moreover, the parallel structure is more conducive to deploying parallelized computations and thus has better advantages in large-scale computations.

Due to the limitation of experimental resources, here we only compare the performance and efficiency of the iGoRNN model with a single-layer stacking structure. Nevertheless, it has shown a relatively good performance. We expect to conduct more in-depth analysis on more complex datasets in

the future, including exploring more efficient ways of information aggregation and the combined effect of the number of model stacking layers on prediction results and time consumption.

5 Conclusion

Multivariate time series analysis has applications in economics, sociology, meteorology, environmental science, engineering, etc. Previous authors have proposed many time series models to meet the needs of various sectors of society. These models have a core concept of the mining as much quality information as possible from historical observations to achieve accurate forecasting. For univariate time series, the models can only achieve precise forecasting by thoroughly learning the hidden periodic statement of the time series. However, for multivariate time series, in addition to mining the frequent patterns from the time series itself, exploiting the potential dependencies between multiple time series will also positively affect the final prediction results. To fully exploit the spatial dependencies among multivariate time series, this paper designs an integrated model iGoRNN. The model assumes no single graph network can mine all the underlying correlations among multivariate time series. Hence, the iGoRNN model utilizes multiple graph operators to capture quality information from the series more comprehensively from different perspectives. The network as a whole also adopts a recurrent structure based on an encoder-decoder so that the model can aggregate temporal features while capturing spatial features instead of fragmenting the intersectionality of spatiotemporal features. Finally, this paper conducts comparison experiments with the baseline model using the publicly available METR-LA dataset and PeMS-BAY dataset. The experimental results show that the iGoRNN network is better than the other models in terms of prediction accuracy and can be competent for multivariate time series prediction tasks. Meanwhile, this paper gives seven other Integrator modules to complement the prediction tasks in different domains.