Keywords

1 Introduction

Time series forecasting is an important task in industry and academia, with applications in fields such as retail demand forecasting [1], finance [2,3,4], and traffic flow prediction [5]. Traditionally, time series forecasting was dominated by linear models such as the autoregressive integrated moving average model (ARIMA), which required prior knowledge about time series structures such as seasonality and trend. With an increasing abundance of data and computational power however, deep learning models have gained much research interest due to their ability to learn complex temporal relationships with a purely data-driven approach; thus requiring minimal human intervention and expertise in the subject matter. In this work, we combine deep learning with state space models (SSM) for sequential modelling. Our work follows recent trend that combines the powerful modelling capabilities of deep learning models with well understood theoretical frameworks such as SSMs.

Recurrent neural networks (RNN) are a popular class of neural networks for sequential modelling. There exists a great abundance of literature on time series modelling with RNNs across different domains [6,7,8,9,10,11,12,13,14,15,16]. However, vanilla RNNs have deterministic transition functions, which may limit their expressive power at modelling sequences with high variability and complexity [18]. There is recent evidence that the performance of RNNs on complex sequential data such as speech, music, and videos can be improved when uncertainty is incorporated in the modelling process [19,20,21,22,23,24]. This approach makes an RNN more expressive, as instead of outputting a single deterministic hidden state at every time step, it now considers many possible future paths before making a prediction. Inspired by this, we propose an RNN cell with stochastic hidden states for time series forecasting, which is achieved by inserting a latent random variable into the RNN update function. Our approach corresponds to a state space formulation of time series modelling where the RNN transition function defines the latent state equation, and another neural network defines the observation equation given the RNN hidden state. The main contributions of our paper are as follows:

  1. 1.

    we propose a novel deep stochastic recurrent architecture for multistep-ahead time series forecasting which leverages the ability of regular RNNs to model long-term dynamics and the stochastic framework of state space models.

  2. 2.

    we conduct experiments using publicly available datasets in the fields of finance, traffic flow prediction, air quality forecasting, and disease transmission. Results demonstrate that our stochastic RNN consistently outperforms its deterministic counterpart, and is capable of generating probabilistic forecasts

2 Related Works

2.1 Recurrent Neural Networks

The recurrent neural network (RNN) is a deep architecture specifically designed to handle sequential data, and has delivered state-of-the-art performance in areas such as natural language processing [25]. The structure of the RNN is such that at each time step t, the hidden state of the network - which learns a representation of the raw inputs - is updated using the external input for time t as well as network outputs from the previous step \(t - 1\). The weights of the network are shared across all time steps and the model is trained using back-propagation. When used to model long sequences of data, the RNN is subject to the vanishing/exploding gradient problem [26]. Variants of the RNN such as the LSTM [27] and the GRU [28] were proposed to address this issue. These variants use gated mechanisms to regulate the flow of information. The GRU is a simplification of the LSTM without a memory cell, which is more computationally efficient to train and offers comparable performance to the LSTM [29].

2.2 Stochastic Gradient Variational Bayes

The authors in [21] proposed the idea of combining an RNN with a variational auto-encoder (VAE) to leverage the RNN’s ability to capture time dependencies and the VAE’s role as a generative model. The proposed structure consists of an encoder that learns a mapping from data to a distribution over latent variables, and a decoder that maps latent representations to data. The model can be efficiently trained with Stochastic Gradient Variational Bayes (SGVB) [30] and enables efficient, large-scale unsupervised variational learning on sequential data. Consider input \(\boldsymbol{x}\) of arbitrary size, we wish to model the data distribution \(p(\boldsymbol{x})\) given some unobserved latent variable \(\boldsymbol{z}\) (again, of arbitrary dimension). The aim is maximise the marginal likelihood function \(p(\boldsymbol{x}) = \int p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z}) \,d\boldsymbol{z}\), which is often intractable when the likelihood \(p(\boldsymbol{x}|\boldsymbol{z})\) is expressed by a neural network with non-linear layers. Instead we apply variational inference and maximise the evidence lower-bound (ELBO):

$$\begin{aligned} \log p(\boldsymbol{x})=\log \int&p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})\,d\boldsymbol{z} = \log \int p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})\frac{q(\boldsymbol{z})}{q(\boldsymbol{z})}\,d\boldsymbol{z} \nonumber \\&\ge \mathbb {E}_{z \sim q(\boldsymbol{z|\boldsymbol{x}})} [\log p(\boldsymbol{x}|\boldsymbol{z})] - KL(q(\boldsymbol{z}|\boldsymbol{x})||p(\boldsymbol{z}))= ELBO, \end{aligned}$$
(1)

where \(q(\boldsymbol{z}|\boldsymbol{x})\) is the variational approximation to true the posterior distribution \(p(\boldsymbol{z}|\boldsymbol{x})\) and KL is the Kullback-Leibler divergence. For the rest of this paper we refer to \(p(\boldsymbol{x}|\boldsymbol{z})\) as the decoding distribution and \(q(\boldsymbol{z}|\boldsymbol{x})\) as the encoding distribution. The relationship between the marginal likelihood p(x) and the ELBO is given by

$$\begin{aligned} \log p(\boldsymbol{x}) = \mathbb {E}_{z \sim q(\boldsymbol{z|\boldsymbol{x}})} [\log p(\boldsymbol{x}|\boldsymbol{z})] - KL(q(\boldsymbol{z}|\boldsymbol{x})\!\!\!&|| p(\boldsymbol{z})) \nonumber \\&+ \, KL(q(\boldsymbol{z}|\boldsymbol{x})||(p(\boldsymbol{z}|\boldsymbol{x})), \end{aligned}$$
(2)

where the third KL term specifies the tightness of the lower bound. The expectation \(\mathbb {E}_{z \sim q(\boldsymbol{z|\boldsymbol{x}})} [\log p(\boldsymbol{x}|\boldsymbol{z})]\) can be interpreted as an expected negative reconstructed error, and \(KL(q(\boldsymbol{z}|\boldsymbol{x})||p(\boldsymbol{z}))\) serves as a regulariser.

2.3 State Space Models

State space models provide a unified framework for time series modelling; they refer to probabilistic graphical models that describe relationships between observations and the underlying latent variable [35]. Exact inference is feasible only for hidden Markov models (HMM) and linear Gaussian state space models (LGSS) and both are not suitable for long-term prediction [31]. SSMs can be viewed a probabilistic extension of RNNs. Inside an RNN, the evolution of the hidden states \(\boldsymbol{h}\) is governed by a non-linear transition function f: \(\boldsymbol{h}_{t+1} = f(\boldsymbol{h}_t, \boldsymbol{x}_{t+1})\) where \(\boldsymbol{x}\) is the input vector. For an SSM however, the hidden states are assumed to be random variables. It is therefore intuitive to combine the non-linear gated mechanisms of the RNN with the stochastic transitions of the SSM; this creates a sequential generative model that is more expressive than the RNN and better capable of modelling long-term dynamics than the SSM. There are many recent works that draw connections between SSM and VAE using an RNN. The authors in [18] and [19] propose a sequential VAE with nonlinear state transitions in the latent space, in [32] the authors investigate various inference schemes for variational RNNs, in [22] the authors propose to stack a stochastic SSM layer on top of a deterministic RNN layer, in [23] the authors propose a latent transition scheme that is stochastic conditioned on some inferable parameters, the authors in [33] propose a deep Kalman filter with exogenous inputs, the authors in [34] propose a stochastic variant of the Bi-LSTM, and in [37] the authors use an RNN to parameterise a LGSS.

3 Stochastic Recurrent Neural Network

3.1 Problem Statement

For a multivariate dataset comprised of \(N+1\) time series, the covariates \(\boldsymbol{x}_{1:T+\tau } = \{\boldsymbol{x}_1,\boldsymbol{x}_2,... \boldsymbol{x}_{T+\tau }\} \in \mathbb {R}^{N \times (T+\tau )}\) and the target variable \(y_{1:T} \in \mathbb {R}^{1 \times T}\). We refer to the period \(\{T+1,T+2,... T+\tau \}\) as the prediction period, where \(\tau \in \mathbb {Z}^+\) is the number of prediction steps and we wish to model the conditional distribution

$$\begin{aligned} P(y_{T+1:T+\tau }|y_{1:T}, \boldsymbol{x}_{1:T+\tau }).\end{aligned}$$
(3)

3.2 Stochastic GRU Cell

Here we introduce the update equations of our stochastic GRU, which forms the backbone of our temporal model:

$$\begin{aligned} \boldsymbol{u}_t&= \sigma (\boldsymbol{W}_u\cdot \boldsymbol{x}_t + \boldsymbol{C}_u\cdot \boldsymbol{z}_t + \boldsymbol{M}_u\cdot \boldsymbol{h}_{t-1} + \boldsymbol{b}_u) \end{aligned}$$
(4)
$$\begin{aligned} \boldsymbol{r}_t&= \sigma (\boldsymbol{W}_r\cdot \boldsymbol{x}_t + \boldsymbol{C}_r\cdot \boldsymbol{z}_t + \boldsymbol{M}_r\cdot \boldsymbol{h}_{t-1} + \boldsymbol{b}_r) \end{aligned}$$
(5)
$$\begin{aligned} \boldsymbol{\tilde{h}}_t&= tanh(\boldsymbol{W}_h\cdot \boldsymbol{x}_t + \boldsymbol{C}_h\cdot \boldsymbol{z}_t + \boldsymbol{r}_t\odot \boldsymbol{M}_h\cdot \boldsymbol{h}_{t-1} + \boldsymbol{b}_h) \end{aligned}$$
(6)
$$\begin{aligned} \boldsymbol{h}_t&= \boldsymbol{u}_t\odot \boldsymbol{h}_{t-1} + (1-\boldsymbol{u}_t)\odot \boldsymbol{\tilde{h}}_t, \end{aligned}$$
(7)

where \(\sigma \) is the sigmoid activation function, \(\boldsymbol{z}_t\) is a latent random variable which captures the stochasticity of the temporal process, \(\boldsymbol{u}_t\) and \(\boldsymbol{r}_t\) represent the update and reset gates, \(\boldsymbol{W}\), \(\boldsymbol{C}\) and \(\boldsymbol{M}\) are weight matrices, \(\boldsymbol{b}\) is the bias matrix, \(\boldsymbol{h}_t\) is the GRU hidden state and \(\odot \) is the element-wise Hadamard product. Our stochastic adaptation can be seen as a generalisation of the regular GRU, i.e. when \(\boldsymbol{C} = 0\), we have a regular GRU cell [28].

3.3 Generative Model

The role of the generative model is to establish probabilistic relationships between the target variable \(y_t\), the intermediate variables of interest (\(\boldsymbol{h}_t\),\(\boldsymbol{z}_t\)), and the input \(\boldsymbol{x}_t\). Our model uses neural networks to describe the non-linear transition and emission processes, and we preserve the architectural workings of an RNN - relevant information is encoded within the hidden states that evolve with time, and the hidden states contain all necessary information required to estimate the target variable at each time step. A graphical representation of the generative model is shown in Fig. 1a, the RNN transitions are now stochastic, faciliated by the random variable \(\boldsymbol{z}_t\). The joint probability distribution of the generative model can be factorised as follows:

$$\begin{aligned} p_\theta (y_{2:T},\boldsymbol{z}_{2:T},\boldsymbol{h}_{2:T}|\boldsymbol{x}_{1:T}) =\prod _{t=2}^{T}p_{\theta _1}(y_t|\boldsymbol{h}_t)p_{\theta _2}(\boldsymbol{h}_t|\boldsymbol{h}_{t-1},\boldsymbol{z}_{t},\boldsymbol{x}_t)p_{\theta _3}(\boldsymbol{z}_t|\boldsymbol{h}_{t-1}) \end{aligned}$$
(8)

where

$$\begin{aligned} p_{\theta _3}(\boldsymbol{z}_t|\boldsymbol{h}_{t-1})&=N(\boldsymbol{\mu }(\boldsymbol{h}_{t-1}),\boldsymbol{\sigma ^2}(\boldsymbol{h}_{t-1})\boldsymbol{\textit{I}})\end{aligned}$$
(9)
$$\begin{aligned} \boldsymbol{h}_t&=\textit{GRU}(\boldsymbol{h}_{t-1},\boldsymbol{z}_t,\boldsymbol{x}_t)\end{aligned}$$
(10)
$$\begin{aligned} y_t\sim p_{\theta _1}(y_t|\boldsymbol{h}_t)&=\textit{N}(\mu (\boldsymbol{h}_{t}),\sigma ^2(\boldsymbol{h}_{t})), \end{aligned}$$
(11)

where \(\textit{GRU}\) is the stochastic GRU update function given by (4)–(7). (9) defines the prior distribution of \(\boldsymbol{z}_t\), which we assume to have an isotropic Gaussian prior (covariance matrix is diagonal) parameterised by a multi-layer perceptron (MLP). When conditioning on past time series for prediction, we use (9), (10) and the last available hidden state \(\boldsymbol{h}_{last}\) to calculate \(\boldsymbol{h}_1\) for the next sequence, otherwise we initialise them to \(\boldsymbol{0}\). We refer to the collection of parameters of the generative model as \(\theta \), i.e. \(\theta = \{\theta _1,\theta _2,\theta _3\}\). We refer to (11) as our generative distribution, which is parameterised by an MLP.

Fig. 1.
figure 1

Proposed generative and inference models

3.4 Inference Model

We wish to maximise the marginal log-likelihood function \(\log p_\theta (y_{2:T}|\boldsymbol{x}_{2:T})\), however the random variable \(\boldsymbol{z}_{t}\) of the non-linear SSM cannot be analytically integrated out. We instead maximise the variational lower bound (ELBO) with respect to the generative model parameters \(\theta \) and some inference model parameter which we call \(\phi \) [36]. The variational approximation of the true posterior \(p(\boldsymbol{z}_{2:T},\boldsymbol{h}_{2:T}|y_{1:T},\boldsymbol{x}_{1:T})\) can be factorised as follows:

$$\begin{aligned} q_\phi (\boldsymbol{z}_{2:T},\boldsymbol{h}_{2:T}|y_{1:T},\boldsymbol{x}_{1:T}) = \prod _{t=2}^{T}q_\phi (\boldsymbol{z}_t|y_{1:T})q_\phi (\boldsymbol{h}_t|\boldsymbol{h}_{t-1},\boldsymbol{z}_t,\boldsymbol{x}_t) \end{aligned}$$
(12)

and

$$\begin{aligned} q_\phi (\boldsymbol{h}_t|\boldsymbol{h}_{t-1},\boldsymbol{z}_t,\boldsymbol{x}_t) = p_{\theta _2}(\boldsymbol{h}_t|\boldsymbol{h}_{t-1},\boldsymbol{z}_t,\boldsymbol{x}_t), \end{aligned}$$
(13)

where \(p_{\theta _2}\) is the same as in (8); this is due to the fact that the GRU transition function is fully deterministic conditioned on knowing \(\boldsymbol{z}_t\) and hence \(p_{\theta _2}\) is just a delta distribution centered at the GRU output value given by (4)–(7). The graphical model of the inference network is given in Fig. 1a. Since the purpose of the inference model is to infer the filtering distribution \(q_\phi (\boldsymbol{z}_t|\boldsymbol{y}_{1:t})\), and that an RNN hidden state contains a representation of current and past inputs, we use a second GRU model with hidden states \(\boldsymbol{g}_t\) as our inference model, which takes the observed target values \(y_t\) and previous hidden state \(\boldsymbol{g}_{t-1}\) as inputs and maps \(g_t\) to the inferred value of \(\boldsymbol{z}_t\):

$$\begin{aligned} \boldsymbol{g}_t&=\textit{GRU}(\boldsymbol{g}_{t-1},\boldsymbol{y}_t)\end{aligned}$$
(14)
$$\begin{aligned} \boldsymbol{z}_t\sim q_\phi (\boldsymbol{z}_t|\boldsymbol{y}_{1:t})&=N(\boldsymbol{\mu }(\boldsymbol{g}_{t}),\boldsymbol{\sigma ^2}(\boldsymbol{g}_{t})\boldsymbol{\textit{I}}). \end{aligned}$$
(15)

3.5 Model Training

The objective function of our stochastic RNN is the ELBO \(\textit{L}(\theta ,\phi )\) given by:

$$\begin{aligned} \textit{L}(\theta ,\phi )=\int \int&q_\phi \log \frac{p_\theta }{q_\phi }d\boldsymbol{z}_{2:T}d\boldsymbol{h}_{2:T}\nonumber \\&=\sum _{n=2}^{T}\mathbb {E}_{q_\phi } [\log p_\theta (y_t|\boldsymbol{h}_t)] - KL(q_\phi (\boldsymbol{z}_t|\boldsymbol{y}_{1:t})||p_\theta (\boldsymbol{z}_t|\boldsymbol{h}_{t-1})),\end{aligned}$$
(16)

where \(p_\theta \) and \(q_\phi \) are the generative and inference distributions given by (8) and (12) respectively. During training, we use the posterior network (15) to infer the latent variable \(\boldsymbol{z}_t\) used for reconstruction. During testing we use the prior network (9) to forecast 1-step-ahead \(\boldsymbol{z}_t\), which has been trained using the KL term in the ELBO function. We seek to optimise the ELBO with respect to decoder parameters \(\theta \) and encoder parameters \(\phi \) jointly, i.e. we wish to find:

$$\begin{aligned} (\theta ^*,\phi ^*)={\text {*}}{argmax}_{\theta ,\phi }{} \textit{L}(\theta ,\phi ). \end{aligned}$$
(17)

Since we do not back-propagate through a sampling operation, we apply the reparameterisation trick [30] to write

$$\begin{aligned} \boldsymbol{z}=\boldsymbol{\mu }+\boldsymbol{\sigma }\odot \boldsymbol{\epsilon }, \end{aligned}$$
(18)

where \(\boldsymbol{\epsilon }\sim \boldsymbol{N}(0,\boldsymbol{\textit{I}})\) and we sample from \(\boldsymbol{\epsilon }\) instead. The KL divergence term in (16) can be analytically computed since we assume the prior and posterior of \(\boldsymbol{z}_t\) to be normally distributed.

3.6 Model Prediction

Given the last available GRU hidden state \(\boldsymbol{h}_{last}\), prediction window \(\tau \) and covariates \(\boldsymbol{x}_{T+1:T+\tau }\), we generate predicted target values in an autoregressive manner, assuming that at every time step the hidden state of the GRU \(\boldsymbol{h}_t\) contains all relevant information up to time t. The prediction algorithm of our stochastic GRU is given by Algorithm 1.

figure a

4 Experiments

We highlight the model performance on 6 publicly available datasets:

  1. 1.

    Equity options trading price time series available from the Chicago Board Options Exchange (CBOE) datashop. This dataset describes the minute-level traded prices of an option throughout the day. We study 3 options with Microsoft and Amazon stocks as underlyings where \(\boldsymbol{x}_t=\) underlying stock price and \(y_t=\) traded option price

  2. 2.

    The Beijing PM2.5 multivariate dataset describes hourly PM2.5 (a type of air pollution) concentrations of the US Embassy in Beijing, and is freely available from the UCI Machine Learning Repository. The covariates we use are \(\boldsymbol{x}_t=\) temperature, pressure, cumulated wind speed, Dew point, cumulated hours of rainfall and cumulated hours of snow, and \(y_t=\) PM2.5 concentration. We use data from 01/11/2014 onwards

  3. 3.

    The Metro Interstate Traffic Volume dataset describes the hourly interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis and ST Paul, MN. This dataset is available on the UCI Machine Learning Repository. The covariates we use in this experiment are \(\boldsymbol{x}_t=\) temperature, mm of rainfall in the hour, mm of snow in the hour, and percentage of cloud cover, and \(y_t=\) hourly traffic volume. We use data from 02/10/2012 9AM onwards

  4. 4.

    The Hungarian Chickenpox dataset describes weekly chickenpox cases (childhood disease) in different Hungarian counties. This dataset is also available on the UCI Machine Learning Repository. For this experiment, \(y_t=\) number of chickenpox cases in the Hungarian capital city Budapest, \(\boldsymbol{x}_t=\) number of chickenpox cases in Pest, Bacs, Komarom and Heves, which are 4 nearby counties. We use data from 03/01/2005 onwards

We generate probabilistic forecasts using 500 Monte-Carlo simulations and we take the mean predictions as our point forecasts to compute the error metrics. We tested the number of simulations from 100 to 1000 and found that above 500, the differences in performance were small, and with fewer than 500 we could not obtain realistic confidence intervals for some time series. We provide graphical illustrations of the prediction results in Fig. 2a–2f. We compare our model performance against an AR(1) model assuming the prediction is the same as the last observed value (\(y_{T+\tau }=y_T\)), a standard LSTM model and a standard GRU model. For the performance metric, we normalise the root-mean-squared-error (rmse) to enable comparison between time series:

$$\begin{aligned} nrmse = \frac{\sqrt{\frac{\sum _{i=1}^{N}(y_i-\hat{y}_i)^2}{N}}}{\bar{y}}, \end{aligned}$$
(19)

where \(\bar{y} = mean(y)\), \(\hat{y}_i\) is the mean predicted value of \(y_i\), and N is the prediction size. For replication purposes, in Table 1 we provide (in order): number of training, validation and conditioning steps, (non-overlapping) sequence lengths used for training, number of prediction steps, dimensions of \(\boldsymbol{z}_t\), \(\boldsymbol{h}_t\) and \(\boldsymbol{g}_t\), details about the MLPs corresponding to (9) (\(\boldsymbol{z}_t\) prior) and (15) (\(\boldsymbol{z}_t\) post) in the form of (n layers, n hidden units per layer), and lastly the size of the hidden states of the benchmark RNNs (LSTM and GRU). we use the ADAM optimiser with a learning rate of 0.001. In Table 2, 3, 4 and 5 we observe that the nrmse of the stochastic GRU is lower than its deterministic counterpart for all datasets investigated and across all prediction steps. This shows that our proposed method can better capture both long and short-term dynamics of the time series. With respect to multistep time series forecasting, it is often difficult to accurately model the long-term dynamics. Our approach provides an additional degree of freedom facilitated by the latent random variable which needs to be inferred using the inference network; we believe this allows the stochastic GRU to better capture the stochasticity of the time series at every time step. In Fig. 2e for example, we observe that our model captures well the long-term cyclicity of the traffic volume, and in Fig. 2d where the time series is much more erratic, our model can still accurately predict the general shape of the time series in the prediction period.

Table 1. Model and training parameters
Table 2. nrmse for 30 steps-ahead options price predictions
Table 3. nrmse for 30 steps-ahead PM2.5 concentration predictions
Table 4. nrmse for 30 steps-ahead traffic volume predictions
Table 5. nrmse for 30 steps-ahead Hungarian chickenpox predictions
Table 6. nrmse of MLP benchmark and our proposed model for 30 steps-ahead forecasts
Fig. 2.
figure 2

Model prediction results on different datasets

To investigate the effectiveness of our temporal model, we compare our prediction errors against a model without a temporal component, which is constructed using a 3-layer MLP with 5 hidden nodes and ReLU activation functions. Since we are using covariates in the prediction period (3), we would like to verify that our model can outperform a simple regression-type benchmark which approximates a function of the form \(y_t=f_\psi (\boldsymbol{x_t})\); we use the MLP to parameterise the function \(f_\psi \). We observe in Table 6 that our proposed model outperforms a regression-type benchmark for all the experiments, which shows the effectiveness of our temporal model. It is also worth noting that in our experiments we use the actual values of the future covariates. In a real forecasting setting, the future covariates themselves could be outputs of other mathematical models, or they could be estimated using expert judgement.

5 Conclusion

In this paper we have presented a stochastic adaptation of the Gated Recurrent Unit which is trained with stochastic gradient variational Bayes. Our model design preserves the architectural workings of an RNN, which encapsulates all relevant information into the hidden state, however our adaptation takes inspiration from the stochastic transition functions of state space models by injecting a latent random variable into the update functions of the GRU, which allows the GRU to be more expressive at modelling highly variable transition dynamics compared to a regular RNN with deterministic transition functions. We have tested the performance of our model on different publicly available datasets and results demonstrate the effectiveness of our design. Given that GRUs are now popular building blocks for much more complex deep architectures, we believe that our stochastic GRU could prove useful as an improved component which can be integrated into sophisticated deep learning models for sequential modelling.