1 Introduction

The development of mobile computing and data acquisition techniques has facilitated the collection of location-based data [1, 2]. Among various spatial–temporal mining applications in data-driven urban sensing scenarios, traffic flow forecasting has become one of the most important smart city applications [3]. Accurate prediction of traffic volume for each geographical regions in a city can not only benefit the public risk assessment (e.g., crowd flow tragedy mitigation with traffic control [4]), but also enhance the service qualities of various intelligent transportation systems (e.g., location-based recommendation services [5] and traffic management for congestion alleviation [6]).

Among various traffic prediction methods, deep learning-based methods stand out owing to the feature representation effectiveness of neural network architecture. There exist many recent proposed spatial–temporal data forecasting frameworks focusing on modeling the traffic time-evolving regularities over the temporal dimension and the underlying cross-region geographical dependencies over the spatial dimension. For example, a periodically shifted attention mechanism is introduced in STDN [7], to model traffic temporal dependencies, with the joint learning of recurrent neural networks. ST-ResNet [8] proposes to perform image-like convolutional operations over the generated traffic matrix with spatial and temporal information.

Recently, graph neural networks (GNNs) have been utilized as a powerful modeling method to fuse complex relational information over graph-structured data [9, 10]. Toward this research line, some traffic prediction models attempt to design spectral convolution-based encoder over the constructed geographical graph of regions based on their adjacent relation, such as ST-GCN [11] and DCRNN [12]. In particular, they take inspiration from graph convolutional network (GCN) [13] and follow the graph-structured message passing paradigm to perform neighborhood feature transformation and refine embeddings. Despite the effectiveness of aforementioned traffic flow forecasting approaches, we identify three significant challenges that have not been well addressed in previous research work.

First, most existing traffic prediction methods building the inter-region relation encoding function with the consideration of nearby geographical signals, which overlook the global traffic dependencies across different regions in entire urban space [14]. For instance, two geographical areas can also be inter-dependent with each other in terms of the traffic time-varying patterns, even though they are not spatially adjacent [15, 16]. As a result, it is necessary to enhance the cross-region traffic dependency modeling with the awareness of global context.

Second, when intuitively useful to employ recurrent neural framework or attention mechanism to encode the time-evolving patterns of traffic flow in the region embedding function, it is non-trivial to do it well. In specific, there is often multi-resolution periodicity (e.g., hourly, daily, weekly) that govern citywide traffic flow in real-life scenarios [17], making it difficult to distill the desired temporal hierarchy of traffic regularities in a comprehensive manner.

Third, while the location-aware dependencies between geographical neighborhood regions could be learned by the convolutional neural kernels with different latent channels [8], existing solutions often equally treat feature representations learned from different channel dimensions and perform cross-channel feature aggregation with the same weight. During the spatial pattern integration paradigm, the importance of different hidden channel views can be quite different since the encoded channel-specific embeddings may reflect different types of spatial semantics. Hence, when injecting the spatial contextual signals into the region-wise relation learning, it is important to discriminate the channel-specific contributions for assisting the traffic prediction of the target region at future time slots.

In light of these aforementioned limitations and challenges, we propose to study traffic flow prediction, with the goal of preserving both local and global region-wise traffic dependencies, capturing multi-resolution temporal dynamics, as well as encoding latent semantics of spatial context. In our work, we present a new traffic forecasting framework–Spatial-Temporal Convolutional Graph Attention model (\(ST-CGA^{+}\)), to deal with all these challenges.

Specifically, to handle the temporal dynamics with the awareness of multi-resolution regularities, we propose a multi-resolution transformer network to learn time granularity-aware representations with the preservation of multi-level temporal patterns of traffic flow. Instead of performing temporal encoding with singular dimension of periodicity, the multi-resolution transformer module augments \(ST-CGA^{+}\) for capturing the multi-level periodicity of traffic flow regularities with hierarchical self-attention layers under multiple feature representation subspaces. To inject the global traffic dependencies across different regions, we further integrate our hierarchical transformer network with an attentional graph neural network to refine learned latent representations with multiple embedding propagation layers. In our graph neural component, we parameterize the weight matrices for calculating attention over region-wise relations. In addition, having realized the importance of capturing spatial context, we capture spatial semantics across latent channel dimensions with channel-aware spatial encoder, to explicitly learn importance weights of different hidden channels during spatial–temporal pattern aggregation. To aggregate the resolution-specific embeddings, a cross-resolution gating mechanism is developed to promote the collaboration of resolution-specific spatial–temporal pattern representations. At the prediction phase of our model, we incorporate the external knowledge from meteorological data source with an external factor fusion module.

Our contributions can be summarized as follows:

  • General aspects \(ST-CGA^{+}\) not only captures spatial dependencies from local view to global view, but also integrates multi-granularity temporal dynamics. In addition, channel-aware recalibration network injects channel information into our model to distinguish the importance of representations in different subspaces. We highlight the importance of explicitly exploring the multi-resolution temporal dynamics and maintaining global dependent representations across different regions, as well as encoding latent semantics when incorporating spatial context, in predicting the traffic flow of each region in a city.

  • Methodologies We present \(ST-CGA^{+}\) architecture for modeling traffic flow data from both spatial and temporal dimensions. To handle the multi-level periodicity, we integrate hierarchically structured transformer networks with the integrative framework of self-attention module and attentive graph neural network, to jointly encode multi-resolution temporal patterns and global inter-region traffic dependencies. To inject the spatial contextual signals into our predictive solution, we develop a multi-view collaboration module which explicitly embeds multi-level temporal signals into the channel-aware spatial relation encoder, with the cooperation of the designed channel-aware convolution-based recalibrated residual network and gated cross-resolution aggregation mechanism.

  • Experiments findings We demonstrate the effectiveness and efficacy of our proposed \(ST-CGA^{+}\) framework on four real-world datasets collected from Beijing and New York City. Evaluation results suggest that \(ST-CGA^{+}\) outperforms different types of traffic prediction models under different settings. We further conduct case study to show the proposed method is able to automatically capture the dependencies between different regions. Furthermore, computational cost evaluation indicates that our \(ST-CGA^{+}\) could achieve comparable efficiency as compared to state-of-the-art baselines.

2 Related work

Traditional approaches(i.e., ARIMA [18], SVR [19]) simply learned the historical temporal dependency from traffic flow data, which caused poor generalization ability. Srinivasan et al. proposed a hybrid model which predicted short-term traffic flow with the integration of feed-forward neural network [20]. This hybrid model fitted complex nonlinear traffic states to some extent, but there were still great deficiencies.

2.1 Deep traffic flow prediction techniques

The promising representation ability of deep learning techniques has led to advances on the traffic flow prediction task [10]. Many methods have been developed to model the traffic variation patterns from spatial and temporal dimensions based on different neural network architectures [4, 14, 15, 22]. In particular, one straightforward solution lies in the utilization of recurrent neural network for temporal effect encoding, due to the time-ordered nature of traffic data, such as ST-RNN [21] which utilizes the recurrent neural network to encode the temporal effects of geo-tagged series data. D-LSTM [22] is built upon the long short-term memory framework. Later on, several subsequent work maps the time interval-specific traffic flow information across the entire urban area into a 2-dimensional image-like matrix and utilizes the convolutional neural network to model the correlations between different regions [14, 23]. For example, DeepST [24] models the spatial correlations for traffic flow prediction using image-based convolutional neural network with geographical grid kernels. Combining recurrent neural network and convolution can also achieve good results, such as DMVST-Net [14]an integrative traffic prediction model, which combines the LSTM encoder and local convolutional network for spatial–temporal pattern learning.

Additionally, due to the effectiveness of attention mechanism for explicit relation learning, integrative forecasting frameworks have been developed to jointly capture spatial and temporal dependencies from traffic data, such as the convolutional shifted attention mechanism [7] and attention-based conv-lstm model [25]. To incorporate external factors from other data sources or meta knowledge from geographical attributes, some hybrid solutions are designed for predicting traffic volume with data fusion network (UrbanFM) [26] and meta-learning framework (ST-MetaNet) [4]. For instance, UrbanFM proposes a general fusion network to consider the influences of external factors. ST-MetaNet attempts to learn correlations between locations through incorporating the region attributes into the meta-learning framework. Different from those above traffic prediction methods, our developed \(ST-CGA^{+}\) framework addresses the key limitation of them by modeling the temporal dynamics of traffic patterns under a multi-scale paradigm and region-wise dependencies with global context.

2.2 GNNs for spatial–temporal data prediction

Another research line relevant to our work is the recently emerged graph neural networks (GNNs), which has become the state-of-the-art models for representation learning in graph-structured data [27, 28]. In graph neural network architectures, the graph dependence information is learned through the message passing process between nodes based on their connections (edges). Due to the convincing performance of GNN, it has been applied in various graph domains, such as user–item interaction learning in recommendation [29], social network analysis [30], and heterogeneous feature aggregation in academic networks [9].

Motivated by the information propagation paradigm, graph neural network models have been applied to spatial–temporal data forecasting [31, 32]. For example, a graph attentive method has been developed to model relationships between users’ sequential behaviors and point-of-interests (POIs) with the consideration of spatial and temporal factors [33]. Furthermore, several GNN-based traffic prediction approaches have been proposed recently to encode the spatial correlations among geographical regions. Specifically, ST-GCN [11] is equipped with graph convolutional neural network to model spatial–temporal dependencies with convolution blocks. DCRNN [12] designs bidirectional random walks to capture spatial correlations and scheduled sampling-based encoder-decoder for temporal pattern modeling. Geng et al. [15] enhances the graph convolutional framework to predict the ride-sharing demand with the incorporation of multi-dimensional spatial–temporal data. In GMAN, an encoder-decoder architecture is adopted to perform feature encoding from the constructed traffic network data [34]. In this work, our \(ST-CGA^{+}\) makes steps further by endowing the spatial–temporal graph neural networks with the capability to jointly encode the dynamic traffic evolving patterns and the underlying global-level geographical dependencies.

2.3 Attention mechanism

Attention mechanism has been shown to be effective in aggregating relational data, such as sequential language data learning [35] and user behavior modeling [36]. In spatial–temporal data mining scenarios, some works attempt to employ neural attention network to model spatial–temporal data [37,38,39]. For example, Feng et al. [37] studies the mobility prediction problem with an attentional recurrent neural network to capture transition regularities of human mobility traces. In [38], the attention mechanism is integrated with graph convolutional network to model the migration behavior with heterogeneous data sources. In addition, variational attention is proposed to predict user’s next footprint based on his/her historical point-of-interest check-in traces [39]. In [40] and [41], a self-attention layer is applied to encode the traffic evolving patterns, by automatically performing the temporal aggregation over input. Recently, there is a rising enthusiasm to design transformer-based neural network models to model sequential data [42, 43]. Motivated by the effectiveness of transformer network, our work integrates the position-aware self-attentive network within multi-head representation spaces, with the graph neural architecture for encoding the multi-level periodicity of traffic flow.

3 Methodology

Before presenting our studied traffic prediction framework, let us first formally introduce key notations. In our work, the traffic flow across different regions is represented as a three-way tensors \({\mathcal {X}}\), i.e., \({\mathcal {X}} \in {\mathbb {R}}^{M\times N\times T}\). We divide the city map into \(M\times N\) regions. In tensor \({\mathcal {X}}\), each element \(x_{m,n}^t\) denotes the traffic volume of the region with the index (mn) at the t-th time slot (e.g., hour or day). In our traffic flow forecasting scenario, we consider both i) incoming traffic volume (inflow)—the number of arrived vehicles at a specific time slot and ii) outgoing traffic volume (outflow)—the number of departed vehicles at a specific time slot.

We now present our spatial–temporal convolutional graph neural network, termed as \(ST-CGA^{+}\), which is illustrated in Fig. 1. It is composed of three key components: (i) capturing the multi-level periodic patterns with resolution-aware transformer encoder; (ii) region-wise traffic dependency modeling, which captures traffic dependencies across different regions under the global context; (iii) spatial relation learning, which injects latent semantics with channel-aware recalibration network.

Fig. 1
figure 1

The \(ST-CGA^{+}\) Framework. Given |P| different time resolutions, we can obtain |P| hidden states (e.g., \(\mathbf{Z} ^p \in {\mathbb {R}}^{M\times N\times d}\)) from the multi-resolution hierarchical transformer networks. Then, we feed the resolution-aware representations \(\mathbf{Z} ^p\) into the channel-aware spatial relation encoder (\(\varOmega \) represents the learned mask tensor)

3.1 Temporal hierarchy modeling

With the consideration of multi-level temporal patterns of traffic regularities, we design a hierarchical transformer network with different time scales to model the temporal hierarchy of traffic flow data. Given the defined three-dimensional traffic flow tensor \({\mathcal {X}} \in {\mathbb {R}}^{M\times N\times T}\), we first generate different resolution-specific traffic flow tensors with multiple time granularities for both inflow and outflow. For simplifying notations in our methodology, without loss of generality, we do not differentiate the inflow and outflow by unity their data point as \(x_{m,n}^t\) which represents the traffic volume of spatial region (mn) at the t-th time slot. To capture the temporal hierarchy of traffic regularities, we define the periodicity resolution p to indicate the time difference between our sampled two consecutive traffic volume data points \(x_{m,n}^t\) and \(x_{m,n}^{t'}\) from region \(r_{m,n}\). For example, the periodicity resolution p (time difference \((t-t')\)) can be an hour, a day or a week, with the consideration of hourly, daily or weekly traffic regularities.

Based on the original input tensor \({\mathcal {X}}\), we generate different resolution-aware traffic flow tensor \({\mathcal {X}}^p\) corresponds to different settings of periodicity resolution p. Here, \({\mathcal {X}}^p\) is formally defined as \({\mathcal {X}}^p \in {\mathbb {R}}^{M\times N\times T_p}\), where \(T_p\) represents the input series length of tensor \({\mathcal {X}}^p\) and will vary by different periodicity resolution p (e.g., hour, day, week). Given the most fine-grained time resolution in original traffic tensor \({\mathcal {X}}\) is hour, we take periodicity resolution \(p\in \{hour, day\}\) as concrete examples, to show the generated resolution-specific traffic tensor \(\mathbf{X} ^p_{m,n}\) of region \(r_{m,n}\) as below:

$$\begin{aligned} \mathbf{X} ^p_{m,n}&=(x^{p,1}_{m,n},...,x^{p,t}_{m,n},x^{p,(t+1)}_{m,n},...,x^{p,T_p}_{m,n}),~(p=hour) \nonumber \\ \mathbf{X} ^p_{m,n}&=(x^{p,1}_{m,n},...,x^{p,t}_{m,n},x^{p,(t+24)}_{m,n},...,x^{p,T_p}_{m,n}),~(p=day) \end{aligned}$$
(1)

3.1.1 Resolution-aware self-attention network

We first design resolution-aware self-attention network to encode temporal dependencies of traffic flow across time slots. Our temporal encoder is inspired by the recent progress pf neural attention mechanism for relation encoding and feature aggregation in various applications, such as machine translation [44] and graph representation [45].

In our \(ST-CGA^{+}\) framework, we utilize the self-attentive mechanism as the base encoding function to learn temporal representations. With the format of matrix calculation in our self-attention network, we define three representation projection matrices, i.e., query (\(\mathbf{Q} \in {\mathbb {R}}^{T_p\times d}\), key (\(\mathbf{K} \in {\mathbb {R}}^{T_p\times d}\)) and value (\(\mathbf{V} \in {\mathbb {R}}^{T_p\times d}\)) matrices. These transformation operations aim to map the input traffic data \({\mathcal {X}}^p\in {\mathbb {R}}^{T_p\times d}\) across all geographical regions in a city, into three dimensions of latent representation units. In particular, to generate \(\mathbf{Q} \), \(\mathbf{K} \) and \(\mathbf{V} \) weight matrices for each resolution-specific traffic series data, we conduct embedding transformation over \({\mathcal {X}}^p \in {\mathbb {R}}^{M\times N\times T_p}\) with the periodicity resolution of p, with the trainable weight matrices \(\mathbf{W} _Q^p \in {\mathbb {R}}^{d\times d}\), \(\mathbf{W} _K^p \in {\mathbb {R}}^{d\times d}\) and \(\mathbf{W} _V^p \in {\mathbb {R}}^{d\times d}\), respectively. Formally, we present the self-attention operations based on the following scaled dot-product attentional operations as follows.

$$\begin{aligned} {[}\mathbf{Q }^{p}, \mathbf{K }^{p}, \mathbf{V }^{p} ]&= \mathbf{E }^{p} \cdot [\mathbf{W }_{Q}^{p}, \mathbf{W }_{K}^{p}, \mathbf{W }_{V}^{p}] \nonumber \\ \mathbf{Y }^{p} = Att(\mathbf{Q }^{p},\mathbf{K }^{p},\mathbf{V }^{p})&=\varphi \left( \frac{\mathbf{Q }^{p} (\mathbf{K }^{p})^{T}}{\sqrt{d}}\right) \mathbf{V }^{p} \end{aligned}$$
(2)

where \(\mathbf{E} ^p \in {\mathbb {R}}^{d}\) represents generated embeddings of input traffic flow data point \(x^{p,t}_{m,n}\). We define \(\mathbf{Y} ^p\in {\mathbb {R}}^{|R|\times d}\) to denote the encoded latent representations from the temporally ordered traffic series data under the periodicity resolution of p. In \(\mathbf{Y} ^p\), each row corresponds to the learned feature embeddings of region \(r_{m,n}\). \(\varphi (\cdot )\) represents the softmax function. In our temporal encoder, we incorporate the \(\sqrt{d}\) as the scale factor to alleviate large value effect during the inner product calculations.

Positional encoding strategy To inject the positional information of temporally ordered traffic volume data points into our temporal encoder, we add a positional embedding strategy to determine the order of encoded data points. Following the positional encoding method in transformer architecture [46], we perform the position embedding using the following sin and cosine functions:

$$\begin{aligned} \hat{\mathbf{E }}_{t, 2i} = \text {sin}\left( \frac{t}{10000^{2i/d}}\right) ;~~\hat{\mathbf{E }}_{t, 2i+1}= \text {cos}\left( \frac{t}{10000^{2i/d}}\right) \end{aligned}$$
(3)

where 2i and \(2i+1\) represent the even and odd element index of positional embedding vector \(\bar{\mathbf{E }}_t\) at the t-th encoded time step. Specifically, for elements with even index, they are constructed using the sin function; for elements with odd index, we generate the vector with cos function. The positional encoding strategy enables the temporal dependency modeling with the awareness of vector positional information, by updating the input embedding \(\mathbf{E} ^p_{m,n}\) of region \(r_{m,n}\) as \( \bar{\mathbf{E }}^p_{m,n} = \mathbf{E} ^p_{m,n} + \hat{\mathbf{E }}^p_{m,n}\). Different from recurrent neural networks [21], self-attention has the advantage of i) enabling the sequential pattern learning in a parallel way; ii) explicit modeling of both short- and long-range dependencies.

Multi-head representation space In our temporal hierarchy encoder, we perform the self-attention in H latent representation subspaces. To make this attention computation within multi-head spaces, we split the query, key and value into H different matrices. These partitioned vectors are fed into the self-attention mechanism individually. Each self-attention process corresponds to a head and the encoded head-specific latent embeddings \(\mathbf{Y} ^p_h\) are concatenated into a single vector. Our developed self-attention mechanism with multi-head representation learning space is formally given:

$$\begin{aligned} \mathbf{Y }^{p}&= Concat(\mathbf{Y }^{p}_{h}) \end{aligned}$$
(4)
$$\begin{aligned} \mathbf{Y }^{p}_{h}&= \varphi \left( \frac{\mathbf{Q }^{p}_{h} (\mathbf{K }^{p}_{h})^{T}}{\sqrt{d}}\right) \mathbf{V }^{p}_{h} \end{aligned}$$
(5)

Through expanding the self-attention layer with multi-head, we allow the temporal pattern encoder to project input embeddings into different representation subspace with multiple set of query, key and value transformation matrices.

3.1.2 Layer normalization with residual connections

We further utilize the residual connection [47] by adding the original positional input embedding \(\mathbf{E} _p\) to the concatenated representation \(\mathbf{Y} ^p\) encoded from the multi-head self-attention network. By doing so, the learned low-layer features are propagated to high-layer neural network for feature interaction. Then, we apply batch normalization layer with the aim of stabilizing neural network training [48] as follows:

$$\begin{aligned} LayerNorm(\mathbf{Y }^{p}) = \omega _{1} \odot \frac{\mathbf{Y }^{p}- \mu }{\sqrt{\sigma ^{2}+\epsilon }}+ \omega _{2} \end{aligned}$$
(6)

where \(\omega _1\) and \(\omega _2\) represent learned scaling factors and bias terms. \(\mu \) and \(\sigma \) are the mean and variance of input vector \(\mathbf{Y} ^p\). \(\epsilon \) is a small decimal, preventing division by 0. The element-wise product operation is denoted as \(\odot \). In addition, to augment our temporal hierarchy encoder with the capability of modeling nonlinearities for feature interaction, we feed the encoded temporal representations into a point-wise feed-forward network as below:

$$\begin{aligned} \bar{\mathbf{Y }}^p = ReLU(\mathbf{Y} ^p \mathbf{W} _1 + \mathbf{b} _1 ) \mathbf{W} _2 + \mathbf{b} _2 \end{aligned}$$
(7)

where \(\mathbf{W} _1 \in {\mathbb {R}}^{d\times d}\), \(\mathbf{W} _2 \in {\mathbb {R}}^{d\times d}\) are learned transformation matrices. \(\mathbf{b} _1\) and \(\mathbf{b} _2\) denote the bias terms. The output \(\bar{\mathbf{Y }}^p\) is then again added to the input vector for point-wise feed-forward network with layer normalization. We illustrate the architecture of our temporal encoder with the developed resolution-aware transformer network in Fig. 2.

Fig. 2
figure 2

The resolution-aware transformer network

3.2 Region-wise graph attentive learning

To capture the traffic dependencies across different regions in entire urban space, we integrate the designed transformer-based temporal encoder with a resolution-aware graph attention network. In particular, we first define a region graph \(G=(V,E)\), where V and E denotes the vertex and edge set, respectively. In graph G, each vertex represents a region \(r_{m,n}\) and the edge corresponds to the pairwise relationship between two regions. In our graph neural network architecture, we perform embedding propagation over region graph G through our attentive message passing paradigm, in order to capture region-wise traffic dependencies from a global perspective. Different from graph convolutional network which aggregates information between neighboring nodes fully based on graph structure, graph attention mechanism captures pairwise relations between two neighbors in an explicit manner [49].

The input to our graph attention layer is a set of node features initialized with temporal representations under the periodicity resolution of p: \(\bar{\mathbf{Y }}^p\), i.e., \(\bar{\mathbf{y }}^p_{0,0}\), ..., \(\bar{\mathbf{y }}^p_{m,n}\), ..., \(\bar{\mathbf{y }}^p_{M-1,N-1} \in {\mathbb {R}}^{d}\), where d denotes the latent embedding dimensionality. Our developed multi-resolution graph attention layer consists of four key operations. To enhance the expressive power of feature representation during the embedding propagation paradigm, we add a learnable linear transformation on individual region node embedding \(\mathbf{y} ^p_{m,n}\), which is formally represented as follows:

$$\begin{aligned} {\widetilde{\mathbf{Y }}}^{p} = {\bar{\mathbf{Y }}}^{p} \cdot \mathbf{W }_{p},~~\mathbf{W }_{p}\in {\mathbb {R}}^{d\times d'} \end{aligned}$$
(8)

We calculate attention coefficient \(\epsilon _{(m,n),(m',n')}\) for the pairwise dependency between neighboring region node \(r_{m,n}\) and \(r_{m',n'}\) with the mechanism: \({\mathbb {R}}^{d} \times {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}\). To compute the attention score \(r_{m,n}\) and \(r_{m',n'}\), we perform concatenation between projected representation \(\widetilde{\mathbf{y }}^p_{m,n}\) and \(\widetilde{\mathbf{y }}^p_{m',n'}\). Then, we apply the dot product between the incorporated learnable weight vector \(\varvec{\alpha }\) and the concatenated embedding \([\widetilde{\mathbf{y }}^p_{m,n}, \widetilde{\mathbf{y }}^p_{m',n'}]\). We further employ the activation functions of LeakyReLU and softmax as follows:

$$\begin{aligned} {\hat{\epsilon }}_{(m,n),(m',n')}&= LeakyReLU(\varvec{\alpha }^T [\widetilde{\mathbf{y }}^p_{m,n}, \widetilde{\mathbf{y }}^p_{m',n'}]) \nonumber \\ \epsilon _{(m,n),(m',n')}&= \frac{exp({\hat{\epsilon }}_{(m,n),(m',n')})}{\sum _{(m',n')\in {\mathcal {N}}(m,n)} exp({\hat{\epsilon }}_{(m,n),(m',n')})} \end{aligned}$$
(9)

To enrich the model representation ability in our graph-structured traffic dependency encoder, we perform the message passing with multi-head representation spaces with the following operation:

$$\begin{aligned} \mathbf{z} _{m,n}^p = Concat_{h=1}^{H} LeakyReLU \Big (\sum _{{(m',n')}\in {\mathcal {N}}(m,n)} \epsilon _{(m,n),(m',n')}^h \widetilde{\mathbf{y }}^p_{m,n} \Big ) \end{aligned}$$
(10)

\(\mathbf{z} _{m,n}^p\) represents the aggregated feature embedding by preserving the inter-region dependencies with respect to their traffic distributions under the periodicity resolution of p. We define \(\epsilon _{(m,n),(m',n')}^h\) to denote the attentive score for the representation subspace of h. The model flow is shown in Fig. 3.

Fig. 3
figure 3

Region-wise graph attentive learning module

3.3 Spatial context injection in \(ST-CGA^{+}\) framework

In this component, we inject the contextual signals from different latent channel dimensions into geographical dependencies, using the designed convolution-based recalibrated residual network.

3.3.1 Convolution-based residual unit

Given the learned latent feature representations \(\mathbf{Z} \in {\mathbb {R}}^{M\times N\times d\times |P|}\) of region-wise traffic transitional regularities across both time slots and different geographical areas, we design a convolution-based residual network to encode relational structures of spatial context between both nearby and distant regions. To address the gradient vanishing problem and strengthen feature propagation [47], we feed \(\mathbf{Z} \) into our convolution-based subnet by employing the ResNet for each individual resolution p with the residual mappings. Our convolution-based residual network is formally defined as:

$$\begin{aligned} \widetilde{\mathbf{Z }}^p={\mathcal {F}}(\mathbf{Z} ^p) + \mathbf{Z} ^p \end{aligned}$$
(11)

We define \({\mathcal {F}}(\cdot )\) to represent the residual operator. \(\widetilde{\mathbf{Z }}^p\) indicates the encoded high-level feature representation under the setting of periodicity resolution of p. We integrate two convolutional layers in our residual operator \({\mathcal {F}}(\cdot )\) with the formal presentation as follows:

$$\begin{aligned} \mathbf{Z} ^{p(l+1)}=ReLU(W^{(l)} * \mathbf{Z} ^{p(l)} +b^{(l)}) \end{aligned}$$
(12)

where the trainable transformation matrices and bias terms are denoted as \(W^{(l)}\) and \(b^{(l)}\), respectively. We define \(\mathbf{Z} ^{p(l+1)} \in {\mathbb {R}}^{M\times N\times C}\) to represent the learned feature embeddings from the convolutional unit with the ReLU activation function. In our spatial encoder, the corresponding kernel size, which captures region-wise dependencies based on their adjacent geographical relationships, is set as \(3\times 3\) spatial scale with the stride parameter as 1. We define \(\widetilde{\mathbf{Z }}^p \in {\mathbb {R}}^{M\times N\times C}\) as the refined feature representations which preserves the spatial context and enhances the cross-region traffic dependency modeling paradigm.

3.3.2 Channel-aware recalibration network

Inspired by the strength of feature learning paradigm with pyramid network structure, we propose a channel-aware recalibration network based on a bottom-up and top-down neural architecture, to endow our spatial relation encoder with the capability of capturing latent semantics of spatial relationships across different representation channels, based on a hierarchical feature aggregation framework. Toward this end, we learn a mask tensor \(\varvec{\varOmega } \in {\mathbb {R}}^{M\times N\times C}\) which corresponds to importance weights of latent channel dimensions. In our recalibration network, two candidate functions can be chosen as the feature encoder:

Fully convolutional networks (FConv) Our first candidate encoder is to perform convolution and max pooling operations several times, to increase the receptive field and obtain the intermediate hidden representation with a top-down architecture. Then, the global feature interaction signals across all regions and channels, is then expanded by a symmetrical top-down architecture to generate the weights of the input for each position in \(\mathbf{Z} \). After the convolution operations, we use linear interpolation to up sample the embeddings. The number of bi-linear interpolation is the same as that of max pooling to be consistent with the input embedding dimensionality. The final output is constrained to [0, 1] with the normalization to get the final mask \(\varvec{\varOmega }\).

Fully connected layers (FCL) Another encoder function is to stretch the input representation tensor \(\mathbf{Z} ^p\) into a one-dimensional vector, and feed it into a stacked feed-forward neural networks to generate an intermediate latent representation. A symmetric structure of bottom-up network with a reverse order of stacked feed-forward networks, is utilized for learning the mask tensor \(\varvec{\varOmega }\). Similarly, a sigmoid function is applied to map the output into the range of [0, 1] to generate the final mask \(\varvec{\varOmega }\).

With the joint consideration of region spatial relations across \({\mathbb {R}}^{M\times N}\) and channels \({\mathbb {R}}^C\), we could generate the mask tensor \(\varvec{\varOmega }\)–corresponding to each element position in \(\widetilde{\mathbf{Z }}^p\). We further apply the mask tensor on resolution-specific representation \(\widetilde{\mathbf{Z }}^p\), to obtain \(\varLambda ^p\) with the following recalibration operation:

$$\begin{aligned} \varvec{\varLambda }^p = \varvec{\varOmega } \circ \widetilde{\mathbf{Z }}^p = \varvec{\varOmega } \circ {\mathcal {F}}(\mathbf{Z} ^p) + \mathbf{Z} ^p \end{aligned}$$
(13)

where \(\circ \) is the element-wise multiplication. By integrating the channel-aware recalibration network into \(ST-CGA^{+}\) with the learned relevance scores into the traffic pattern representation process, the mask tensor \(\varvec{\varOmega }\) i) enhances the representation capability by explicitly differentiating dimension units (i.e., regions and latent channels); ii) serves as a gradient update filter during the back propagation, i.e., enhancing the robustness of \(ST-CGA^{+}\) in learning gradients for parameter inference in convolutional residual unit [50].

3.4 Cross-resolution pattern integration

To aggregate the complex spatial and temporal patterns (encoded from the integrative architecture of channel-aware convolutional graph attention network), we develop a gating mechanism to promote the collaboration of different resolution-specific representations \(\varvec{\varLambda }^p\) (\(p\in P=\{hour, day, week, month\}\)). Each one corresponds to the hourly (\(\varvec{\varLambda }^{p_h}\)), daily (\(\varvec{\varLambda }^{p_d}\)), weekly (\(\varvec{\varLambda }^{p_w}\)) and monthly (\(\varvec{\varLambda }^{p_m}\)) traffic transitional regularities. In particular, we estimate the importance score among resolution-specific embedding vectors by performing parametric matrix-based sum operation as follows:

$$\begin{aligned} \varvec{\varLambda }= \mathbf{W} ^{p_h} \circ \varvec{\varLambda }^{p_h} + \mathbf{W} ^{p_d} \circ \varvec{\varLambda }^{p_d} + \mathbf{W} ^{p_w} \circ \varvec{\varLambda }^{p_w} +\mathbf{W} ^{p_m} \circ \varvec{\varLambda }^{p_m} \end{aligned}$$
(14)

where \(\mathbf{W} ^{p_h}\), \(\mathbf{W} ^{p_d}\), \(\mathbf{W} ^{p_w}\) and \(\mathbf{W} ^{p_m}\) represent the learnable transformation matrices for different resolution-aware representations. With the element-wise multiplication \(\circ \), we can generate the final pattern representation \(\varvec{\varLambda }\), with the explicitly exploration of spatial–temporal dependencies under the cross-resolution learning scenario.

3.5 External factor fusion

The traffic transitional regularities are also affected by various external factors, such as meteorological conditions and external temporal information (e.g., holidays). Hence, in the prediction scenario of traffic flow, it is also crucial to account for the influences of such external data sources which are defined as follows with details:

External factors We consider four types of external factors as the complementary data sources, namely weather conditions, temperature/\(^\circ \)C, wind speed/mph and holiday signals. Each type of data source is associated with several encoding vectors with respect to different feature dimensions. Considering the weather condition as a concrete example, we generate four vectors (\(f_{sun}\in {\mathbb {R}}^{T}\), \(f_{rain}\in {\mathbb {R}}^{T}\), \(f_{fog}\in {\mathbb {R}}^{T}\), \(f_{snow}\in {\mathbb {R}}^{T}\)) corresponding to sunny, rainy, foggy and snowy, respectively. Each element \(f^t_*=1\) if t-th day is positive for the weather condition feature (e.g., rainy day), and \(f^t_*=0\) otherwise. Similar strategy is applied for encoding holiday signals, i.e., weekday, weekend and national holiday. Furthermore, we use min–max normalization to scale quantitative temperature and wind speed values into the range of [0, 1] for each target time resolution (e.g., hour or half an hour).

Based on the aforementioned definition, we could associate each region with the external feature vector \(f^T_{*}\). We utilize a multi-layer perceptron architecture to map \(f^T_{*}\) into a latent space with a representation of \(\mathbf{B} _{m,n} \in \mathbf{B} \), where \(\mathbf{B} \in {\mathbb {R}}^{M\times N\times d}\). We further perform the concatenation of \(\mathbf{B} \) and \(\varvec{\varLambda }\), and feed it into the prediction layer with a feed-forward network structure to forecast the future traffic.

3.6 The learning process of \(ST-CGA^{+}\)

In this subsection, we first present our optimized loss function for the learning process of \(ST-CGA^{+}\), and then, we provide detailed time complexity analysis of our method.

3.6.1 Optimized objective

In this work, we aim to simultaneously predict the input and output traffic volume of each region across the entire city with the following defined loss function:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}&= \sum _{m=0}^{M-1}\sum _{n=0}^{N-1} \lambda [({\bar{x}}^{i}_{m,n,t})-(x^{i}_{m,n,t})]^{2}\\&\quad +(1-\lambda )[({\bar{x}}^{o}_{m,n,t})-(x^{o}_{m,n,t})]^{2} \end{aligned} \end{aligned}$$
(15)

where \(\lambda \) balances the influence of input and output traffic flow. \({\bar{x}}^{i}_{m,n,t}\) and \({\bar{x}}^{o}_{m,n,t}\) represents the ground truth traffic volume of input and output flow, at the region \(r_{m,n}\) and t-th time slot.

3.6.2 Complexity analysis of \(ST-CGA^{+}\) framework

We then analyze the time complexity of our \(ST-CGA^{+}\) framework. The model first takes linear time complexity for data preparation. The resulted |P| resolution-specific tensors use \(O(|P|\times T\times M\times N\times d)\) to calculate the query, key and value matrices, and use \(O(|P|\times T^2\times d)\) for weighted summations. The complexity of the output linear mapping after the concatenation is \(O(|P|\times T\times d^2)\). In the latter region-wise graph attention module, \(ST-CGA^{+}\) takes \(O(|P|\times M\times N\times d\times d')\) for the high-level feature representation, and takes \(O(|P|\times M^2\times N^2\times d')\) for computing the weights and the attentive aggregation. For the region-wise spatial relation modeling, \(ST-CGA^{+}\) first employs \(O(|P|\times M\times N\times d\times d_{\text {conv}}^2\times C)\) computations for each convolution, where \(d_{\text {conv}}\) is the size of the convolutional kernel. In the later channel-aware recalibration network, \(O(|P|\times M\times N\times C^2\times d_{\text {FConv}}^2)\) is required for the fully convolutional approach and \(O(|P|\times M\times N\times C\times d_{\text {FCL}})\) is required for the fully connected scheme, where \(d_{\text {FConv}}\) is the kernel size for the convolutions and \(d_{\text {FCL}}\) is the dimensionality of the FCL hidden layer. The external factor fusion takes the \(O(|f|\times d)\) complexity where |f| denotes the dimensionality of the external features. Overall, the \(O(|P|\times M^2\times N^2\times d')\) computations from the graph attention module clearly dominate the time complexity of \(ST-CGA^{+}\) in most real-world cases. So our method is as efficient as state-of-the-art neural graph approaches.

4 Evaluation

In this section, we evaluate the performance of \(ST-CGA^{+}\) on four traffic flow datasets collected from different cities and applications and make comparison with various state-of-the-art forecasting techniques. Specifically, the experimental results aim to answer the following research questions:

  • RQ1: Does our \(ST-CGA^{+}\) consistently outperform the state-of-the-art baselines in making predictions on different traffic flow datasets?

  • RQ2: How does \(ST-CGA^{+}\) perform with the incorporation of different granularity-aware temporal encoders?

  • RQ3: How do different encoding functions in our channel-aware recalibration network affect the model accuracy?

  • RQ4: What are the impacts of key modules of our framework in boosting the traffic prediction performance?

  • RQ5: How do different hyperparameter configurations affect the performance of the developed \(ST-CGA^{+}\)?

  • RQ6: How do we understand the interpretable representation capability of our designed graph neural network in capturing relational patterns across different regions?

  • RQ7: How is the model efficiency of \(ST-CGA^{+}\)?

4.1 Experimented datasets

The experimental datasets are collected from two cities (i.e., New York City and Beijing) to record trajectories of taxi and bike mobility with geographical coordinates. We present the details of each datasets as follows:

  • NYC-Taxi: This dataset is consisted of more than 22,000,000 taxi trajectory records across geographical area of New York City spanning from Jan 2015 to Mar 2015. Following the same settings in [7], the taxi trajectories are mapped into \(10\times 20\) disjoint regions to generate the corresponding traffic inflow and outflow volume of each region with the measured time interval as 30 minutes.

  • BJ-Taxi. It is another taxi dataset (more than 34,000 taxi trajectories) collected from Beijing city over four different periods (Jul 2013–Oct 2013; Mar 2014–Jun 2014; Mar 2015–Jun 2015; Nov 2015-Apr 2016). We divide the spatial coverage area into \(32\times 32\) regions for trajectory mapping, which is consistent with [8]. In this dataset, the traffic volume of each region is estimated every 30 minutes.

  • NYC-Bike-1. This dataset is collected from the bicycle-sharing system in New York City and consisted of 6,800 trajectories from Apr 2014 to Sep 2014 [8]. The urban area is partitioned into different regions following a \(16\times 8\) grid map. The temporally ordered traffic series is constructed with the time interval of one hour.

  • NYC-Bike-2. It is another collected bike trajectory dataset spanning from Jul 2016 to Aug 2016. There are more than 2,600,000 trajectory logs included in these data. We apply the \(10\times 20\) grid map and half an hour measured time interval (following the same settings in [7]) for spatial and temporal data mapping, respectively.

In our experiments, given the dataset-specific time interval for traffic volume measurement, we configure the time granularity set as \(P\in \{30 mins, day, week\}\) for NYC-Taxi, BJ-Taxi, NYC-Bike-2 data, and \(P\in \{hour, day, week\}\) for NYC-Bike-1 data. Following the same data pre-processing steps in [7, 14], we keep the data instances with the traffic volume \(\ge 10\) in experimental datasets. During the forecasting phase of time slot-specific traffic flow, we apply the min–max normalization to project data scale into the range of \([-1, 1]\) using the tanh as the activation function to nonlinear transformation. The model output will be re-mapped into the same value range of input data.

4.2 Evaluation protocols

In this subsection, we first introduce the evaluation metrics used in our performance comparison and then elaborate the compared baselines based on different neural network structures. Finally, we present the hyperparameter settings of our \(ST-CGA^{+}\) with implementation details.

4.2.1 Performance evaluation metric

In the performance evaluation of all compared methods, we use two representative metrics, root mean squared error (RMSE) and mean absolute percentage error (MAPE), which have been widely employed in traffic volume prediction tasks [4, 15, 51]. RMSE and MAPE are denoted as:

$$\begin{aligned} RMSE&= \sqrt{\frac{1}{num}\sum _{i}(x_i-{\bar{x}}_i)^2} \end{aligned}$$
(16)
$$\begin{aligned} MAPE&= \frac{100\%}{num} \sum _{i}|\frac{x_i-{\bar{x}}_i}{{\bar{x}}_i}| \end{aligned}$$
(17)

where \({\bar{x}}\) and x are ground truth and the corresponding predicted value, respectively; num is the number of ground truths. Note that the lower RMSE and MAPE scores indicate better performance in predicting traffic flow.

4.2.2 Compared baselines

To comprehensively evaluate the effectiveness of our \(ST-CGA^{+}\), we compare \(ST-CGA^{+}\) with the different types of baselines with different model architectures. Among these compared methods, most of them serve as strong baselines in traffic prediction domain.

  • Random prediction (RP): this method predicts the traffic flow at random. The traffic flow is randomly generated by referring to the range defined by the maximum and minimum traffic flow in the database.

  • Average prediction (AP): it always predicts the average value based on past values.

Traditional time series prediction methods:

  • ARIMA [18]: this method is a time series analysis method which models the data temporal structures by performing regressing on future variables based on past values.

  • SVR [19]: it is another representative time series prediction technique which transforms data into feature space using nonlinear function.

Neural network-enhanced hybrid model:

  • Fuzzy+NN [20]: this hybrid model predicts short-term traffic flow with the integration of feed-forward neural network and fuzzy input fuzzy output filter.

Recurrent Neural Network Spatial–Temporal Prediction:

  • ST-RNN [21]: it utilizes the recurrent neural network to encode the temporal effects of geo-tagged series data.

  • D-LSTM [22]: it is stacked by long short-term memory networks to predict traffic with temporal dependencies.

Convolutional neural traffic prediction models:

  • DeepST [24]: this method models the spatial correlations for traffic flow prediction using image-based convolutional neural network with geographical grid kernels.

  • ST-ResNet [8]: it enhances the convolution neural network-based traffic prediction with the incorporation of residual network for model training efficiency.

  • DMVST-Net [14]: it is an integrative traffic prediction model which combines the LSTM encoder and local convolutional network for spatial–temporal pattern learning.

Traffic prediction with attentive mechanism:

  • STDN [7]: it models the spatial similarity and long-term periodic temporal pattern with a designed flow gating mechanism and shifted attention mechanism, respectively.

Graph neural network for traffic forecasting:

  • DCRNN [12]: it designs bidirectional random walks to capture spatial correlations and scheduled sampling-based encoder-decoder for temporal pattern modeling.

  • ST-GCN [11]: it proposes to use graph convolutional layers on the graph-structured time series data to model the corresponding spatial and temporal similarities.

  • ST-MGCN [15]: this method captures the non-Euclidean correlations among spatially adjacent regions with multiple graph convolutional layers in predicting traffic flow.

  • GMAN [34]: GMAN is built upon the graph-based attention network for aggregating information from both spatial and temporal dimensions.

  • ST-CGA [40]: it is the prior version of this work. The main difference between them is that \(ST-CGA^{+}\) merely performs the temporal information encoding under a singular representation learning space without the explicitly modeling of sequential traffic transitional patterns.

  • ST-GDN [41]: it jointly learns the local region-wise geographical dependencies and the spatial semantics from a global perspective.

Deep hybrid traffic predictive techniques:

  • UrbanFM [26]: it utilizes the convolutional network-based feature extraction network to consider local region-wise dependencies and designs a diffusion network to model the external factors (e.g., Meteorological data).

  • ST-MetaNet [4]: this is a meta-learning traffic prediction approach which employs the meta knowledge from geo-graph attributes for spatial correlation modeling based on the graph attention and recurrent neural network.

4.2.3 Implementation details for reproductivity

In our evaluation, we present the partition details of training, validation and test datasets in Table 1. The number of records in training, validation and test datasets is shown in Table 2. Validation set gives an estimate of model skill while tuning model’s hyperparameters with the data held back from training set.

Table 1 Training/validation/test data split details
Table 2 Number of records of training, testing and validation data

The settings of \(ST-CGA^{+}\). The proposed \(ST-CGA^{+}\) is implemented with TensorFlow. We perform the model optimization using the Adam optimizer with the batch size as 32 and learning rate as \(1e^{-3}\). In particular, settings of each module in our \(ST-CGA^{+}\) are elaborated as follows:

  • In our graph-structured attentive layer, feature embedding size d is chosen from the range of [32,64,128,256] and the depth of our designed graph neural network is selected from [1,2,3,4,5].

  • In our geographical relation modeling, we tune the parameter of channel dimensionality C from [32,64,128,256] and apply the kernel with the \(3\times 3\) filter size.

  • In our multi-scale temporal encoder, the length of input traffic series for different resolutions (i.e., hour–\(T_h\), day–\(T_d\) and week–\(T_w\)) is chosen from \(\{1, 2, 3, 4, 5, 6\}\), \(\{1, 2, 3, 4, 5\}\), \(\{1, 2, 3, 4, 5, 6\}\), respectively.

  • During the external factor fusion network, the prediction layer is configured with 3 layers of feed-forward network.

In the process of parameter adjustment, we use the grid search method to try every possibility in all candidate parameter choices. Some typical parameter selections are presented in Figs. 8 and  9. Best parameters obtained for the four data sets are: feature embedding size \(d=64\), depth of graph neural network is 3, channel dimensionality \(C=64\), filter size is \(3\times 3\) and the number of feed-forward network layers is 3. For BJ-taxi and NYC-taxi, the best \(T_h\), \(T_d\) and \(T_w\) are 4, 3 and 1, respectively. For NYC-bike1 and NYC-bike2, the best \(T_h\), \(T_d\) and \(T_w\) are 3, 2 and 1, respectively.

Baseline settings and performance tuning All the methods are trained from scratch without any pre-training on a single NVIDIA GeForce GTX 1080 Ti GPU. The experiments of most baselines are performed with their released code, and the hyperparameter initialization settings are consistent with their original papers. For fair comparison, we further apply the grid search strategy [52] to discover the optimal parameter settings of each baseline on the validation set. Moreover, the early stopping is adopted to terminate the training process based on the validation performance. After the tuning hyperparameter for all baselines, we reported their best performance in the evaluation results.

4.3 RQ1: performance comparison

We evaluate our proposed \(ST-CGA^{+}\) method on four experimented datasets with respect to both inflow and outflow traffic volume. We apply the trained model to predict the traffic flow in the test time period. The prediction results are presented in Tables 3 and 4, which show that \(ST-CGA^{+}\) outperforms various baselines by a significant margin in terms of RMSE and MAPE. Such performance improvements are attributed to the jointly learning of global cross-region traffic dependencies and channel-aware spatial contextual information under a convolutional graph neural network. In addition, we also conduct a statistically significant test on four datasets. We first adopt Shapiro Wilk test to the experimental data. Suppose that the experimental results we obtained conform to the normal distribution. (i) BJ-Taxi, RMSE: w=0.992767, p-value=0.8856, MAPE: w=0,999013, p-value=0.8671; (ii) NYC-Taxi, RMSE: w=0.998345, p-value=0.5431, MAPE: w=0,998573, p-value=0.5012; (iii) NYC-Bike1, RMSE: w=0.997653, p-value=0.6281, MAPE: w=0.998961, p-value=0.6601; (iv) NYC-Bike2, RMSE: w=0,999753, p-value=0.7302, MAPE: w=0.999502, p-value=0.7042. The calculation results show that the w statistic is close to 1 and the p-value is significantly greater than 0.05, so we can’t refuse that it conforms to the normal distribution. Then, we do Student’s t test on the experimental results. The results are shown as follows: (i) BJ-Taxi, p-value for RMSE: 7.366e-04, p-value for MAPE: 5.128e-05; (ii) NYC-Taxi, p-value for RMSE: 6.326e-05, p-value for MAPE: 7.727e-06; (iii) NYC-Bike1, p-value for RMSE: 1.966e-08, p-value for MAPE: 9.463e-08; (iv) NYC-Bike2, p-value for RMSE: 6.606e-06, p-value for MAPE: 3.529e-05. From the test results, we can observe that the p-value is much less than 0.05. At the same time, we do ANOVA with all the results. Multi-group comparisons of the means are carried out by one-way analysis of variance (ANOVA) test with post hoc contrasts by Student–Newman–Keuls test. The statistical significance for all tests is set at p less than 0.05. The above statistical analysis shows that our experimental results have statistical significance.

From the performance comparison between \(ST-CGA^{+}\) and state-of-the-art spatial–temporal prediction techniques (with various neural network structures), we can observe that graph neural network-based models (e.g., ST-GCN and ST-MGCN) achieve better performance compared with others in most evaluation cases. This observation suggests the rationality of formulating the region-wise relation learning on graphs. Our developed \(ST-CGA^{+}\) framework is built on the graph neural network architecture with the injection of region-wise dependency with respect to traffic variation patterns. Moreover, from evaluation results in Tables 3 and  4, we can observe that both GNN-based methods and deep hybrid approaches achieve better performance than the spatial–temporal forecasting techniques based on the recurrent or convolutional neural networks (e.g., ST-RNN, D-LSTM, DeepST and ST-ResNet). This observation suggests that only modeling the traffic data from either temporal (with recurrent neural units) or spatial (with convolution-based feature extraction) dimension can hardly capture the complex traffic variation patterns across time slots and geographical regions. Different from ST-CGA and ST-GDN which ignore the sequential signals for modeling temporal dependence, our framework designs the multi-scale transformer network to encode the multi-level periodic patterns of traffic flow, and we further integrate it with an attentive graph neural architecture to capture the global cross-region traffic dependencies. In our proposed \(ST-CGA^{+}\) framework, we not only encode the temporal signals of traffic data with complex multi-grained periodic patterns, but also learn the spatial dependencies among different regions with a channel-aware convolution-based graph neural architecture.

We further show the performance comparison with the visualizations of traffic volume prediction errors of \(ST-CGA^{+}\) and several better performed baselines (as shown in Fig. 5). Specifically, we visualize the forecasting errors between the estimated traffic flow volume \({\bar{x}}_{m,n}^t\) and the corresponding ground truth \(x_{m,n}^t\), i.e., \([({\bar{x}}_{m,n}^t)-(x_{m,n}^t)]^{2}\) with geographical heatmaps for BJ-Taxi dataset. \({\bar{x}}_{m,n}^t\) and \(x_{m,n}^t\) are calculated by performing averaging operation over the traffic inflow and outflow volume. In those figures, larger prediction errors are represented with brighter pixel, in which each pixel corresponds to individual geographical region. We can observe the performance superiority of our \(ST-CGA^{+}\) framework, which is consistent with the reported quantitative results in Tables 3 and  4.

Additionally, we present the evaluation errors (measured by RMSE) of different types of prediction cases in Fig. 4. Particularly, prediction cases of traffic inflow and outflow for each region are grouped into one of four categories (i.e., (0,325], (325,650], (650,975], (975,1300]) in terms of region’s traffic volume. We can observe that our \(ST-CGA^{+}\) always obtains the best performance as compared to other state-of-the-art competitors for different traffic volume groups, which further verifies the effectiveness of \(ST-CGA^{+}\) under different spatial–temporal data volumes.

Fig. 4
figure 4

Evaluation errors (measured by RMSE) of different groups of prediction cases in terms of regions’ traffic volume from the predicted time slot on BJ-Taxi data

Table 3 Performance comparison of all methods on BJ-taxi and NYC-Taxi in terms of RMSE and MAPE [40]
Table 4 Performance comparison of all methods on NYC-Bike1 and NYC-Bike2 in terms of RMSE and MAPE [40]
Fig. 5
figure 5

Visualization for traffic volume prediction errors with geographical heatmaps

Fig. 6
figure 6

Ablation study of the proposed \(ST-CGA^{+}\) framework in terms of RMSE and MAPE

4.4 RQ2: ablation analysis of the proposed model

In addition to the overall performance comparison between our \(ST-CGA^{+}\) and various state-of-the-art traffic prediction models, we further perform experiments to investigate the effectiveness of designed sub-network in \(ST-CGA^{+}\). Particularly, we consider the following model variants in our model ablation study and show the evaluation results in Fig. 6.

  • Impact of graph attention module The first model variant \(ST-CGA^{+}\)\(-g\) does not inject the cross-region time-aware traffic dependence with the attentive graph neural network. The performance gap between \(ST-CGA^{+}\) and \(ST-CGA^{+}\)\(-g\) indicates the effectiveness of our graph attention layer to learn the inter-region traffic dependencies.

  • Impact of spatial relation encoder module We do not contain the spatial relation encoder in our \(ST-CGA^{+}\) framework with the variant \(ST-CGA^{+}\)\(-s\), by removing the integrative architecture of residual neural network and channel-aware convolutional sub-network. The results justify the necessity of exploring the geographical relational structures to augment the modeling process of region-wise dependencies.

  • Impact of channel-aware recalibration module To study the effect our designed channel-aware recalibration sub-network, we do not include the embedding recalibration component to enhance the region-wise spatial dependency learning with learned channel-aware importance weights, i.e., \(ST-CGA^{+}\)\(-c\). By comparing \(ST-CGA^{+}\) and \(ST-CGA^{+}\)\(-c\), we can observe the positive effect of our encoded channel-aware region embeddings.

  • Impact of external data fusion module. In this variant \(ST-CGA^{+}\)\(-e\), we do not incorporate the external knowledge with meteorological data in our traffic volume prediction framework. The performance improvement between \(ST-CGA^{+}\) and \(ST-CGA^{+}\)\(-e\), suggests the effectiveness of our external data fusion module.

Overall, the investigation of the effects of our designed sub-networks indicates that our complete model \(ST-CGA^{+}\) could achieve the best performance as compared to other model variants in terms of RMSE and MAPE on different traffic inflow and outflow datasets. The observations demonstrate the significance of individual components in capturing the spatial context among regions and incorporating the external knowledge from meteorological data.

4.5 RQ3: performance versus multi-grained dynamics

We study the effects of \(ST-CGA^{+}\) for multi-grained dynamic learning with different temporal resolutions. To achieve this goal, we examine the performance of \(ST-CGA^{+}\) with different settings of period resolution set in \(p \in P\).

  • \(ST-CGA^{+}\)\(_{h}\): \(p \in \{hour/30mins\}\)

  • \(ST-CGA^{+}\)\(_{h,d}\): \(p \in \{hour/30mins, day\}\)

  • \(ST-CGA^{+}\)\(_{h,w}\): \(p \in \{hour/30mins, week\}\)

  • \(ST-CGA^{+}\)\(_{h,d,w}\): \(p \in \{hour/30mins, day, week\}\)

  • \(ST-CGA^{+}\)\(_{h,d,w,m}\): \(p \in \{hour/30mins, day, week, month\}\)

With the considerations of different temporal resolutions, we present the evaluation results in Fig. 7. We observe that \(ST-CGA^{+}\)\(_{h,d,w,m}\) achieves the best performance when competing with other variants using different temporal resolution settings. This observation suggests that discriminating the temporal information encoding with multiple resolution-specific latent representations, is beneficial for capturing complex temporal patterns of traffic flow. Furthermore, compared with the temporal pattern learning with singular time resolution (hourly regularities) \(ST-CGA^{+}\)\(_{h}\), \(ST-CGA^{+}\)\(_{h,d}\) (i.e., < hourly, daily > regularities) and \(ST-CGA^{+}\)\(_{h,w}\) ( < hourly, weekly > regularities) achieve better performance. The model prediction performance can be further improved by exploring more temporal resolutions, i.e., \(ST-CGA^{+}\)\(_{h,d,w}\) (< hourly, daily, weekly > regularities) and \(ST-CGA^{+}\)\(_{h,d,w,m}\) (< hourly, daily, weekly, monthly > regularities). The above observations confirm the validity of our \(ST-CGA^{+}\) for modeling long-term temporal dependency with multi-grained multi-head self-attentive layer.

Fig. 7
figure 7

Influence of multi-resolution dynamics learning of \(ST-CGA^{+}\) in terms of RMSE and MAPE

4.6 RQ4: impact of encoder functions in channel-aware recalibration network

We investigate the impact of encoder functions in our channel-aware recalibration network for capturing geographical relationships between regions with different latent representation channels. In particular, we utilize different encoding functions fully convolutional network: \(ST-CGA^{+}\)\(_{FConv}\) and fully connected layers: \(ST-CGA^{+}\)\(_{FCL}\)), with 64 kernels and \(3\times 3\) filter size. In the convolutional encoder, we set the corresponding stride as (2,2). In addition, the nonlinear projection function in the fully connected neural encoder aims to perform the embedding transformation from the dimension of (\(M\times N\times d\)) to (\(M\times N\times C\)). The impact study results are shown in Table 5; we can observe that \(ST-CGA^{+}\)\(_{FCL}\) performs better than \(ST-CGA^{+}\)\(_{FConv}\). The potential reason may lie in that the fully connected neural network is able to supercharge the convolution-based residual unit with the injection of high-level nonlinearities. In contrast, \(ST-CGA^{+}\)\(_{FConv}\) may involve some noisy effects and can hardly capture the nonlinear cross-channel feature interaction in a comprehensive manner.

Table 5 Effect investigation of encoder functions in the channel-aware recalibration network on NYC-Taxi data

4.7 RQ5: parameter effect study

In this subsection, we study how the parameter settings of our \(ST-CGA^{+}\) affects the traffic flow prediction performance. Figs. 8 and  9 show the traffic prediction accuracy under different hyperparameter settings in terms of RMSE and MAPE. When varying the target parameter, while keeping other parameters unchanged. For different types of hyperparameters, we present the following observations.

Impact of filter size We first evaluate the model performance with different filter sizes (as shown in Figs. 8 and  9). We can observe that a larger filter size results in better prediction accuracy when we increase the filter size from 2 to 3, which indicates that performing the convolution over regions with \(3\times 3\) spatial coverage is more beneficial for capturing the geographical dependencies compared with the spatial \(2\times 2\) filter setting. However, larger filter size does not necessarily bring the positive effect, with filter size larger than \(3\times 3\). The reason is that convolutional operations with larger filter size may increase the training difficulty of neural networks via involving more hyperparameters.

Impact of sequence length We can observe that \(ST-CGA^{+}\) could achieve comparable prediction performance with the settings of sequence length \(T_h=4\) and \(T_d=3\), which suggests the model effectiveness of our \(ST-CGA^{+}\) in capturing long-term temporal dependencies of traffic variation patterns, without involving long traffic data series.

Impact of channel dimensionality The model performance is evaluated through varying the channel dimensionality. We can notice that the best performance is achieved with the channel embedding size of 64. When we conduct feature representation with latent dimensionality \(>64\), the performance degrades due to the overfitting issue.

Impact of # of graph neural layers Since our \(ST-CGA^{+}\) is built on the graph neural framework, we evaluate the effect of stacking more graph layers to distill the region-wise traffic dependence. We can observe that increasing the depth of graph attentive mechanism can boost the traffic forecasting performance. In specific, we search the number of graph neural layers in the range of {1,2,3,4,5} and present the evaluation results in Fig. 8. The \(ST-CGA^{+}\) framework with 2 and 3 attention layers outperforms the model which considers first-order neighbors only. Such improvements are attributed to the effective modeling of traffic dependencies among different regions with the injection of high-order graph-structured connective patterns. Nevertheless, when stacking four and five embedding propagation layers over region graph, the prediction performance becomes worse and overfitting can be observed.

Fig. 8
figure 8

Hyperparameter study on BJ-Taxi data in terms of RMSE and MAPE

Fig. 9
figure 9

Hyperparameter study on NYC-Taxi data in terms of RMSE and MAPE

4.8 RQ6: model interpretation with case study

We further show the model interpretation ability with case study. In particular, we visualize the learned attention weights in our graph neural network for region-wise traffic dependency modeling. The visualization results on BJ-Taxi dataset are shown in Fig. 10. Given the target geographical area of “Dongzhimen Bridge” in Beijing, we can observe that geographical regions, which are highly relevant to the target one, are either spatially neighbors or share similar region functionalities (e.g., shopping center & transportation hub). In Fig. 10, we highlight several extracted examples of geographical regions which share larger relevance scores to “Dongzhimen Bridge,” such as “Sanyuan Bridge” and “Xizhimen Bridge.” All those spatial areas are transportation hubs, and thus, share similar traffic patterns in urban space. Hence, the learned attention weights between different regions show the explainability of our \(ST-CGA^{+}\) in capturing the cross-region similarities in terms of traffic patterns through our graph-structured relation encoder.

Fig. 10
figure 10

Model explainability of \(ST-CGA^{+}\) on BJ-Taxi dataset with extracted region examples. The target region (Dongzhimen Bridge) has four key urban functions (i.e., CBD, Transportation Hub, Subway Station, Shopping Center). Eight highly relevant regions sharing similar functions are highlighted in the geographic map

4.9 RQ7: model efficiency study

We finally investigate the model efficiency of our \(ST-CGA^{+}\) framework. Table 6 presents the computational cost of training (with 300 epochs) and inference phase for \(ST-CGA^{+}\) and six best performed baselines on four different traffic datasets. We measure the time cost of each model on the validation data for inference. And Table 7 shows the efficiency study measured by space. All experiments are conducted with the default parameter configurations on a single NVIDIA GeForce GTX 1080 Ti GPU. We can observe that \(ST-CGA^{+}\) outperforms most of compared approaches and could achieve competitive efficiency as compared to ST-GCN, i.e., the attention-based graph embedding propagation layer has higher computational cost than the adjacent matrix-based graph convolution. Considering the prediction accuracy comparison between \(ST-CGA^{+}\) and ST-GCN, the additional computational cost could bring positive effect via learning global region inter-dependencies in an explicit manner.

Table 6 Model performance study with running time(s)
Table 7 Model performance study with storage space

5 Conclusion

In this paper, \(ST-CGA^{+}\) accurately predicts the traffic flow, which is of great importance for intelligent transportation applications, such as traffic management, congestion alleviation and public risk assessment. We study the traffic flow forecasting problem by developing a spatial–temporal convolutional graph attention network, i.e., to effectively aggregate the hierarchical time- and region-wise dependent effects. The \(ST-CGA^{+}\) addresses the challenges with three key modules: (i) a multi-scale temporal encoder, which is composed by multi-head self-attention layers, that explicitly models the intra-region traffic variation patterns with different time resolutions; (ii) inter-region traffic dependency modeling component which performs the embedding propagation over the regions with a graph attention network; and (iii) spatially aware relation encoding which incorporates the geographical context into the dependency learning across different regions. In addition, the \(ST-CGA^{+}\) captures the high-order spatial relation structures, especially channel information, with a channel-aware convolutional graph learning model, and integrates the collaborative signals from spatial, temporal and semantic dimensions. In the experimental part, compared with the previous state-of-the-art baselines, our method improves the experimental effect by at least 5% in terms of RMSE and MAPE. For instance, \(ST-CGA^{+}\) reduces the performance of inflow/outflow predictions to 14.46 and 17.41 in terms of RMSE on BJ-Taxi, respectively. Extensive experiments show the proposed \(ST-CGA^{+}\) framework can consistently outperform different types of baselines under various experimental settings. In addition to the superior performance, the ablation study proves that the main modules we designed play a significant role in improving the effect of the model. We also find that the distribution of the learned attention weights corresponds to the function of the region in the city.

However, our model has the following shortages: (i) There are plenty of additional factors (e.g., weather, holidays, Point-of-Interest features of regions) in the traffic flow prediction scenario. Although our model has considered some information, the design of external factor encoder in Sect. 3 is too simple. How to extract useful parts efficiently and accurately from external factors and inject these pieces of information into models is very important to improve the performance. In the future, we plan to extend our traffic flow prediction framework by incorporating rich contextual information from more external data sources (e.g., point-of-interest features of regions), in order to enhance the geographical context modeling phase. (ii) Our model lacks the ability to handle new arriving data. A real-time spatial–temporal predictive solution deserves the exploration with the aim of handling the new arriving traffic flow data in a dynamic hyperparameter updating framework. (iii) \(ST-CGA^{+}\) suffers from incomplete data: (a) insufficient data for training and skewed data distribution; (b) real-world traffic data are often noisy. In the future, to tackle these challenges, we will further optimize the network structure of \(ST-CGA^{+}\) with data augmentation and contrastive learning techniques based on the graph neural network. In addition, while graph attention network provides a strong ability to capture cross-region correlations, the training time also becomes longer. Improper selection of periodic parameters will also lead to poor performance and high number of hyperparameters make the application of a complete grid search time-consuming. In view of the above shortcomings, we will continue to improve our model in the future.