Keywords

1 Introduction

From financial investment and market analysis [6] to traffic [21], electricity management, healthcare [4], and climate science, accurately predicting the future real values of series based on available historical records forms a coveted task over time in various scientific and industrial fields. There are a wide variety of methods employed for time series forecasting, ranging from statistical [2] to recent deep learning approaches [22]. However, there are several major challenges present. Real-world time series data are often subject to noisy and irregular observations, missing values, repeated patterns of variable periodicities and very long-term dependencies. While the time series are supposed to represent continuous phenomena, the data is usually collected using sensors. Thus, observations are determined by a sampling rate with potential information loss. On the other hand, standard sequential neural networks, such as recurrent (RNNs) [27] and convolutional networks (CNNs) [20], are discrete and assume regular spacing between observations. Several continuous analogues of such architectures that implicitly handle the time information have been proposed to address irregularly sampled missing data [26]. The variable periodicities and long-term dependencies present in the data make models prone to shape and temporal distortions, overfitting and poor local minima while training with standard loss functions (e. g., MSE). Variants of DTW and MSE have been proposed to mitigate these phenomena and can increase the forecasting quality of deep neural networks [16, 19].

A novel perspective for boosting the robustness of neural networks for complex time series is to extract representative embeddings for patterns after transforming them to another representation domain, such as the spectral one. Spectral approaches have seen much use in the text domain. Graph-based text mining (i. e., Graph-of-Words) [25] can be used for capturing the relationships between the terms and building document-level representations. It is natural, then, that such approaches might be suitable for more general sequence modeling. Capitalizing on the recent success of graph neural networks (GNNs) on graph structured data, a new family of algorithms jointly learns a correlation graph between interrelated time series while simultaneously performing forecasting [3, 29, 32]. The nodes in the learnable graph structure represent each individual time series and the links between them express their temporal similarities. However, since such methods rely on series-to-series correlations, they do not explicitly represent the inter-series temporal dynamics evolution. Some preliminary studies have proposed simple computational methods for mapping time series to temporal graphs where each node corresponds to a time step, such as the visibility graph [17] and the recurrence network [7].

In this paper, we propose a novel neural network, TimeGNN, that extends these previous approaches by jointly learning dynamic temporal graphs for time series forecasting on raw data. TimeGNN (i) extracts temporal embeddings from sliding windows of the input series using dilated convolutions of different receptive sizes, (ii) constructs a learnable graph structure, which is forward and directed, based on the similarity of the embedding vectors in each window in a differentiable way, (iii) applies standard GNN architectures to learn embeddings for each node and produces forecasts based on the representation vector of the last time step. We evaluate the proposed architecture on various real-world datasets and compare it against several deep learning benchmarks, including graph-based approaches. Our results indicate that TimeGNN is significantly less costly in both inference and training while achieving comparable forecasting performance. The code implementation for this paper is available at https://github.com/xun468/Time-GNN.

2 Related Work

Time Series Forecasting Models. Time series forecasting has been a long-studied challenge in several application domains. In terms of statistical methods, linear models including the autoregressive integrated moving average (ARIMA) [2] and its multivariate extension, the vector autoregressive model (VAR) [10] constitute the most dominant approaches. The need for capturing non-linear patterns and overcoming the strong assumptions for statistical methods, e. g., the stationarity assumption, has led to the application of deep neural networks, initially introduced in sequential modeling, to the time series forecasting setting. Those models include recurrent neural networks (RNNs) [27] and their improved variants for alleviating the vanishing gradient problem, namely the LSTM [12] and the GRU [5]. An alternative method for extracting long-term dependencies via large receptive fields can be achieved by leveraging stacked dilated convolutions, as proposed along with the Temporal Convolution Network (TCN) [1]. Bridging CNNs and LSTMs to capture both short-term local dependency patterns among variables and long-term patterns, the Long- and Short-term Time-series network (LSTNet) [18] has been proposed. For univariate point forecasting, the recently proposed N-BEATS model [24] introduces a deep neural architecture based on a deep stack of fully-connected layers with basis expansion. Attention-based approaches have also been employed for time-series forecasting, including Transformer [30] and Informer [35]. Finally, for efficient long-term modeling, the most recent Autoformer architecture [31] introduces an auto-correlation mechanism in place of self-attention, which extracts and aggregates similar sub-series based on the series periodicity.

Graph Neural Networks. Over the past few years, graph neural networks (GNNs) have been applied with great success to machine learning problems on graphs in various fields, including chemistry for drug screening [14] and biology for predicting the functions of proteins modeled as graphs [9]. The field of GNNs has been largely dominated by the so-called message passing neural networks (MPNNs) [8], where each node updates its feature vector by aggregating the feature vectors of its neighbors. In the case of time series data on arbitrary known graphs, e. g., in traffic forecasting, several architectures that combine sequential models with GNNs have been proposed [21, 28, 33, 34].

Joint Graph Structure Learning and Forecasting. However, since spatial-temporal forecasting requires an apriori topology which does not apply in the case of most real-world time series datasets, graph structure learning has arisen as a viable solution. Recent models perform joint graph learning and forecasting for multivariate time series data using GNNs, intending to capture temporal patterns and exploit the interdependency among time series while predicting the series’ future values. The most dominant algorithms include NRI [15], MTGNN [32] and GTS [29], in which the graph nodes represent the individual time series and their edges represent their temporal evolution. MTGNN obtains the graph adjacency from the as a degree-k structure from the pairwise scores of embeddings of each series in the multivariate collection, which might pose challenges to end-to-end learning. On the other hand, NRI and GTS employ the Gumbel softmax trick [13] to differentiably sample a discrete adjacency matrix from the edge probabilities. Both models compute fixed-size representations of each node based on the time series, with the former dynamically producing the representations per individual window and the latter extracting global representations from the whole training series. MTGNN combines temporal convolution with graph convolution layers, and GTS uses a Diffusion Convolutional Recurrent Neural Network (DCRNN) [21], where the hidden representations of nodes are diffused using graph convolutions at each step.

3 Method

Let \(\{\textbf{X}_{i,1:T}\}_{i=1}^m\) be a multivariate time series that consists of m channels and has a length equal to T. Then, \(\textbf{X}_t \in \mathbb {R}^{m}\) represents the observed values at time step t. Let also \(\mathcal {G}\) denote the set of temporal dynamic graph structures that we want to infer.

Given the observed values of \(\tau \) previous time steps of the time series, i. e.,  \(\textbf{X}_{t-\tau }, \ldots , \textbf{X}_{t-1}\), the goal is to forecast the next h time steps (e. g.,  \(h=1\) for 1-step forecasting), i. e.,  \(\hat{\textbf{X}}_{t}, \hat{\textbf{X}}_{t+1}, \ldots ,\) \(\hat{\textbf{X}}_{t+h-1}\). These values can be obtained by the forecasting model \(\mathcal {F}\) with parameters \(\varPhi \) and the graphs \(\mathcal {G}\) as follows:

$$\begin{aligned} \hat{\textbf{X}}_{t}, \hat{\textbf{X}}_{t+1}, \ldots , \hat{\textbf{X}}_{t+h-1} = \mathcal {F}(\textbf{X}_{t-\tau }, \ldots , \textbf{X}_{t-1} ; \mathcal {G} ; \varPhi ) \end{aligned}$$
(1)
Fig. 1.
figure 1

The proposed TimeGNN framework time series for graph learning from raw time series and forecasting based on embeddings learned on the parameterized graph structures.

3.1 Time Series Feature Extraction

Unlike previous methods which extract one feature vector per variable in the multivariate input, our method extracts one feature vector per time step in each window k of length \(\tau \). Temporal sub-patterns are learned using stacked dilated convolutions, similar to the main blocks of the inception architecture [23].

Given the sliding windows \(\textbf{S} = \{\textbf{X}_{t-\tau +k-K}, \ldots , \textbf{X}_{t+k-K-1}\}_{k=1}^K\), we perform the following convolutional operations to extract three feature maps \(\textbf{f}_0^k\), \(\textbf{f}_1^k\), \(\textbf{f}_2^k\), per window \(\textbf{S}^k\). Let \(\textbf{f}_i^k \in \mathbb {R}^{\tau \times d}\) for hidden dimension d of the convolutional kernels, such that:

$$\begin{aligned} \begin{aligned} &\textbf{f}_0^k = \textbf{S}^k *\textbf{C}_0^{1,1} +\textbf{b}_{01} \\ &\textbf{f}_1^k = (\textbf{S}^k *\textbf{C}_1^{1,1} +\textbf{b}_{11}) *\textbf{C}_2^{3,3} + \textbf{b}_{23} \\ &\textbf{f}_2^k = (\textbf{S}^k *\textbf{C}_2^{1,1} + \textbf{b}_{21}) *\textbf{C}_2^{5,5} + \textbf{b}_{25} \end{aligned} \end{aligned}$$
(2)

where \(*\) the convolutional operator, \(\textbf{C}_0^{1,1}\), \(\textbf{C}_1^{1,1}\), \(\textbf{C}_2^{1,1}\) convolutional kernels of size 1 and dilation rate 1, \(\textbf{C}_2^{3,3}\) a convolutional kernel of size 3 and dilation rate 3, \(\textbf{C}_2^{5,5}\) a convolutional kernel of size 5 and dilation rate 5, and \(\textbf{b}_{01}, \textbf{b}_{11}, \textbf{b}_{21}, \textbf{b}_{23}, \textbf{b}_{25}\) the corresponding bias terms.

The final representations per window k are obtained using a fully connected layer on the concatenated features \(\textbf{f}_0^k, \textbf{f}_1^k, \textbf{f}_2^k\), i. e., \(\textbf{z}^k = \text {FC}(\textbf{f}_0^k \Vert \textbf{f}_1^k\Vert \textbf{f}_2^k)\), such that \(\textbf{z}^k \in \mathbb {R}^{\tau \times d}\). In the next sections, we refer to each time step of the hidden representation of the feature extraction module in each window k as \(\textbf{z}_i^k, \forall ~i \in \{1, \ldots \tau \}\).

3.2 Graph Structure Learning

The set \(\mathcal {G} = \{\mathcal {G}^k\}, k \in \mathbb {N}^*\) describes the collection of graph structures that are parameterized for all individual sliding window of length \(\tau \) of the series, where K defines the total number of windows. The goal of the graph learning module is to learn each adjacency matrix \(\textbf{A}^k \in \{0,1\}^{\tau \times \tau }\) for a temporal window of observations \(\textbf{S}^k\). Following the works of [15, 29], we use the Gumbel softmax trick to sample a discrete adjacency matrix as described below.

For the Gumbel softmax trick, let \(\textbf{A}^k\) refer to a random variable of the matrix Bernoulli distribution parameterized by \(\boldsymbol{\theta }^k \in [0,1]^{\tau \times \tau }\), so that \(A_{ij}^k \sim Ber(\theta _{ij}^k)\) is independent for pairs (ij). By applying the Gumbel reparameterization trick [13] for enabling differentiability in sampling, we can obtain the following:

$$\begin{aligned} \begin{gathered} A_{ij}^k = \sigma ((\log (\theta _{ij}^k/(1-\theta _{ij}^k)) + (\textbf{g}_{i,j}^1 - \textbf{g}_{i,j}^2))/s),\\ \textbf{g}_{i,j}^1,\textbf{g}_{i,j}^2 \sim \text {Gumbel}(0,1), \forall ~i,j \end{gathered} \end{aligned}$$
(3)

where \(\textbf{g}_{i,j}^1,\textbf{g}_{i,j}^2\) are vectors of i.i.d samples drawn from Gumbel distribution, \(\sigma \) is the sigmoid activation and s is a parameter that controls the smoothness of samples, so that the distribution converges to categorical values as \(s \xrightarrow {}0\).

The link predictor takes each pair of extracted features \((\textbf{z}_i^k,\textbf{z}_j^k)\) of window k and maps their similarity to a \(\theta _{ij}^k \in [0,1]\) by applying fully connected layers. Then the Gumbel reparameterization trick is used to approximate a sigmoid activation function while retaining differentiability:

$$\begin{aligned} \theta _{ij}^k = \sigma \Big ( \text {FC}\big (\text {FC}(\textbf{z}_i^k\Vert \textbf{z}_j^k)\big )\Big ) \end{aligned}$$
(4)

In order to obtain directed and forward (i. e., no look-back in previous time steps in the history) graph structures \(\mathcal {G}\) we only learn the upper triangular part of the adjacency matrices.

3.3 Graph Neural Network for Forecasting

Once the collection \(\mathcal {G}\) of learnable graph structures per sliding window k are sampled, standard GNN architectures can be applied for capturing the node-to-node relations, i. e., the temporal graph dynamics. GraphSAGE [11] was chosen as the basic building GNN block of the node embedding learning architecture as it can effectively generalize across different graphs with the same attributes. GraphSAGE is an inductive framework that exploits node feature information and generates node embeddings (i. e., \(\textbf{h}_u\) for node u) via a learnable function, by sampling and aggregating features from a node’s local neighborhood (i. e., \(\mathcal {N}(u)\)).

Let \((\mathcal {V}^k, \mathcal {E}^k)\) correspond to the set of nodes and edges of the learnable graph structure for each \(\mathcal {G}^k\). The node embedding update process for each \(p \in \{1, \ldots , P\}\) aggregation steps, employs the mean-based aggregator, namely convolutional, by calculating the element-wise mean of the vectors in \(\{\textbf{h}_u^{p-1}, \forall u \in \mathcal {N}(u)\}\), such that:

$$\begin{aligned} \textbf{h}_u^{p} \xleftarrow {} \sigma (\textbf{W} \cdot \text {MEAN}(\{\textbf{h}_u^{p-1}\} \cup \{\textbf{h}_u^{p-1} \forall u \in \mathcal {N}(u)\})) \end{aligned}$$
(5)

where \(\textbf{W}\) trainable weights. The final normalized (i. e., \(\mathbf {\tilde{h}}_u^{p}\)) representation of the last node (i. e., time step) in each forward and directed graph denoted as \(\textbf{z}_{u_T} = \mathbf {\tilde{h}}_{u_T}^p\) is passed to the output module. The output module consists of two fully connected layers which reduce the vector into the final output dimension, so as to correspond to the forecasts \(\hat{\textbf{X}}_{t}, \hat{\textbf{X}}_{t+1}, \ldots , \hat{\textbf{X}}_{t+h-1}\). Figure 1 demonstrates the feature extraction, graph learning, GNN and output modules of the proposed TimeGNN architecture.

3.4 Training and Inference

To train the parameters of Eq. (1) for the time series point forecasting task, we use the mean absolute error loss (MAE). Let \(\mathbf {\hat{X}}^{(i)}, i \in \{1,...,K\}\) denote the predicted vector values for K samples, then the MAE loss is defined as:

$$\begin{aligned} \mathcal {L} = \frac{1}{K}\sum _{i=1}^{K}\Vert \mathbf {\hat{X}}^{(i)}-\textbf{X}^{(i)}\Vert \end{aligned}$$

The optimized weights for the feature extraction, graph structure learning, GNN and output modules are selected based on the minimum validation loss during training, which is evaluated as described in the experimental setup (Sect. 4.3)

4 Experimental Evaluation

We next describe the experimental setup, including the datasets and baselines used for comparisons. We also demonstrate and analyze the results obtained by the proposed TimeGNN architecture and the baseline models.

4.1 Datasets

This work was evaluated on the following multivariate time series datasets:

Exchange-Rate which consists of the daily exchange rates of 8 countries from 1990 to 2016, following the preprocessing of [18].

Weather that contains hourly observations of 12 climatological features over a period of four yearsFootnote 1, preprocessed as in [35].

Fig. 2.
figure 2

Computation costs of TimeGNN, TimeMTGNN and baseline models. (a) The inference and epoch training time per epoch between datasets. (b) The inference and epoch times with varying window sizes on the weather dataset

Electricity-Load is based on the UCI Electricity Consuming Load datasetFootnote 2 that records the electricity consumption of 370 Portuguese clients from 2011 to 2014. As in [35], the recordings are binned into hourly intervals over the period of 2012 to 2014 and incomplete clients are removed.

Solar-Energy contains the solar power production records in 2006, sampled every 10 minutes from 137 PV plants in Alabama StateFootnote 3.

Traffic is a collection of 48 months, between 2015 and 2016, of hourly data from the California Department of TransportationFootnote 4. The data describes the road occupancy rates (between 0 and 1) measured by different sensors.

4.2 Baselines

We consider five baseline models for comparison with our TimeGNN proposed architecture. We chose two graph-based methods, MTGNN [32] and GTS [29], and three non graph-based methods, LSTNet [18], LSTM [12], and TCN [1]. Also, we evaluate the performance of TimeMTGNN, a variant of MTGNN that includes our proposed graph learning module. LSTM and TCN follow the size of the hidden dimension and number of layers of TimeGNN. Those were fixed to three layers with hidden dimensions of 32, 64 for the Exchange-Rate and Weather datasets and 128 for Electricity, Solar-Energy and Traffic. In the case of MTGNN, GTS, and LSTNet, parameters were kept as close as possible to the ones mentioned in their experimental setups.

4.3 Experimental Setup

Each model is trained for two runs for 50 epochs and the average mean squared error (MSE) and mean absolute error (MAE) score on the test set are recorded. The model chosen for evaluation is the one that performs the best on the validation set during training. The same dataloader is used for all models where the train, validation, and test splits are 0.7, 0.1, and 0.2 respectively. The data is split first and each split is scaled using the standard scalar. The dataloader uses windows of length 96 and batch size 16. The forecasting horizons tested are 1, 3, 6, and 9 time steps into the future, where the exact value of the time step is dependent on the dataset (e. g., 3 time steps would correspond to 3 h into the future for the weather dataset and 3 days into the future for the Exchange dataset). In this paper, we use single-step forecasting for ease of comparison with other baseline methods. For training, we use the Adam optimizer with a learning rate of 0.001. Experiments for the Weather and Exchange datasets were conducted on an NVIDIA T4 and Electricity-Load, Solar, and Traffic on an NVIDIA A40.

Table 1. Forecasting performance for all multivariate datasets and baselines for different horizons h - best in bold, second best underlined.

4.4 Results

Scalability. We compare the inference and training times of the graph-based models TimeGNN, MTGNN, GTS in Fig. 2. These figures also include recordings from the ablation study of the TimeMTGNN variant, which is described in the relevant paragraph below. Figure 2(a) shows the computational costs on each dataset. Among the baseline models, GTS is the most costly in both inference and training time due to the use of the entire training dataset for graph construction. In contrast, MTGNN learns static node features and is subsequently more efficient. In inference time, as the number of variables increases there is a noticeable increase in inference time for MTGNN and GTS as their graph sizes also increase. TimeGNN’s graph does not increase in size with the number of variables and consequently, the inference time scales well across datasets. The training epoch times follow the observations in inference time.

Since the size of the graphs used by TimeGNN is based on window size, the cost of increasing the window size on the weather dataset is shown in Fig. 2(b). As the window size increases, so does the cost of inference and training for all models. As the graph learning modules for MTGNN and GTS do not interact with the window size, the increase in cost can primarily be attributed to their forecasting modules. MTGNN’s inference times do not increase as dramatically as GTS’s, implying a more robust forecasting module. As the window size increases, TimeGNN’s inference and training cost growth is slower than the other methods and remains the fastest of the GNN methods. The time-based graph learning module does not become overly cumbersome as window sizes increase.

Forecasting Quality. Table 1 summarizes the forecasting performance of the baseline models and TimeGNN for different horizons \(h \in \{1,3,6,9\}\).

In general, GTS has the best forecasting performance on the smaller Exchange-Rate dataset. The use of the training data during graph construction may give GTS an advantage over the other methods on this dataset. TimeGNN however shows signs of overfitting during training and is unable to match the other two GNNs. On the Weather dataset, the purely recurrent methods perform the best in MSE score across all horizons. TimeGNN is competitive with the recurrent methods on these metrics and surpasses the recurrent models on MAE. This suggests TimeGNN is producing more significant outlier predictions than the recurrent methods and TimeGNN is the best performing GNN method.

On the larger Electricity-Load, Solar-Energy, and Traffic datasets, in general, MTGNN is the top performer with LSTNet close behind. However, for larger horizons, TimeGNN performs better than GTS and competitively with LSTNet and the other recurrent models. This shows that time-domain graphs can successfully capture long-term dependencies within a dataset although TimeGNN struggles more with short-term predictions. This could also be attributed to the simplicity of TimeGNN’s forecasting module compared to the other graph-based approaches.

Ablation Study. To empirically examine the effects of the forecasting module and the representation power of the proposed graph construction module in TimeGNN, we conducted an ablation study where we replaced MTGNN’s graph construction module with our own, so-called TimeMTGNN baseline. The remaining modules and the hyperparameters in TimeMTGNN are kept as similar as possible to MTGNN. TimeMTGNN shows comparable forecasting performance to MTGNN on the larger Electricity-Load, Solar-Energy, and Traffic datasets and higher performance on the smaller Exchange-Rate and Weather datasets. This shows the TimeGNN graph construction module is capable of learning meaningful graph representations that do not impede and in some cases improve forecasting quality. As seen in Fig. 2, the computational performance of TimeMTGNN suffers in comparison to MTGNN. A major contributing factor is the number of graphs produced. MTGNN learns a single graph for a dataset while TimeGNN produces one graph per window, accordingly, the number of GNN operations is greatly increased. However, the focus of this experiment was to confirm that the proposed temporal graph-learning module preserves or improves accuracy over static ones rather than to optimize efficiency.

5 Conclusion

We have presented a novel method of representing and dynamically generating graphs from raw time series. While conventional methods construct graphs based on the variables, we instead construct graphs such that each time step is a node. We use this method in TimeGNN, a model consisting of a graph construction module and a simple GNN-based forecasting module, and examine its performance against state-of-the-art neural networks. While TimeGNN’s relative performance differs between datasets, this representation is clearly able to capture and learn the underlying properties of time series. Additionally, it is far faster and more scalable than existing graph methods as both the number of variables and the window size increase.