Keywords

1 Introduction

One of the most important tasks for a financial institution is to monitor the volatility of its portfolio and other market variables. However, there are many different ways to quantify this latent and unobservable variable, such as historical volatility (HV, a.k.a. close-to-close, the standard deviation of log-returns over a time window) [27]Footnote 1, realised volatility (RV, the square root of the sum of squared log-returns over a time window) [1], implied volatility (IV, backwards calculated from options prices via an option pricing model, such as Black-Scholes) [20], and many more [26]. Because volatility is a key factor in security valuation, risk management, and options pricing, as well as affecting investment choice and valuation of public and corporate liabilities, sophisticated computational models are studied for financial volatility forecasting to support practitioners’ judgment and decision-making in quantitative finance [6, 19, 27, 30]. In the 2020s, such computer-assisted forecasting methods are dominated by Generalised Auto-Regressive Conditional Heteroscedasticity (GARCH) models and relatively simple Neural Networks (NN), leaving much of Machine Learning (ML) and Deep Learning (DL) unexplored [10].

Hence, this multidisciplinary paper will exemplify the power of ML/DL in forecasting financial volatility to practitioners in quantitative finance. We will compare the financial forecasting ability of a range of methods by proceeding from simpler or shallower models (i.e., the GARCH models and Multi-Layer Perceptrons (MLP)) to deeper and more complex NNs (i.e., the Recurrent NNs (RNN), Convolutional NNs (CNN), Temporal Convolutional Networks (TCN), and Temporal Fusion Transformer (TFT)). These performance evaluations and statistical analyses on five assets (i.e., S&P500, NASDAQ100, gold, silver, and oil), completed by releasing our Python codeFootnote 2 under the MIT license should encourage practitioners to apply DL as a way to reduce error in forecasting financial volatility.

2 Related Work

A convenient property of financial price data is the efficient market hypothesis, which stipulates that all publicly available information is reflected in the market prices of assets at a given time point [30]. At the finest resolution, market prices are a list of prices of all the buy and sell orders that were matched, which can then be aggregated over time (e.g., 1-h or 1-day intervals) to create more coarse-grained views and can be further described by its highest, lowest, opening, and closing price for that interval, as well as the total number of assets traded, known as volume; however, introducing additional data tends to be helpful in this predictive modeling task [21, 35]. In addition to the definition of volatility and financial price data, the volatility forecasting model should consider the time period for which the data is useful: If the goal was to forecast for the next 15 min, using data from the previous 50 years might be wasteful but with 1 week, information from past market regimes that could repeat might be missed. Moreover, the amount of information provided at inference time is important as it impacts the computation time, as well as may dilute the useful information; when inferring the volatility of the next 30 days, all data could be useful, but the most recent entries are likely to carry more insights than earlier ones. Finally, the timing of the data and the window of time that the volatility captures must also be considered, keeping in mind that the further into the future we are trying to forecast, the more uncertain any forecast will be. Although this part of modeling should depend on the reason for forecasting volatilityFootnote 3, asking the model to forecast volatility in a wide range of time frames may be beneficialFootnote 4.

Of the many types of models that can be used to understand and forecast volatility, none are as widespread as the auto-regressive (AR) models: The seminal Auto-Regressive Conditional Heteroscedasticity (ARCH) models future volatility conditioned on previous observations [7] and its adaptation as the well-known GARCH model includes an Auto-Regressive Moving Average (ARMA) component [2]. Since these models from the 1980s, there have been many advancements that attempt to address the models’ inability to capture several stylized facts of volatility [8]Footnote 5. Despite the countless variants of the GARCH model, several experiments have found that the simple ARCH(1) and GARCH(1, 1) forecasting models perform the best [11, 24].

ML and DL models have also shown much success and are rising in popularity [4, 5, 12]. NN-based models are commonly used, and although they do not have the same theoretical underpinnings as the GARCH models, they are flexible, possessing the ability to learn any arbitrary mapping f from input \({\textbf {X}}\) to output y; \(y = f({\textbf {X}})\). In the context of time series analysis, a Nonlinear Auto-Regressive (NAR) framework is often adopted with the MLP, enforcing an AR property to the nonlinear mapping (e.g., \(\widehat{y}_{t+1} = f([y_t,y_{t-1},...,y_{t-m}]^T)\) with t referring to a given time point) [15]. This can be extended into a NARX framework by including exogenous variables (such as those derived from several indices, exchange rates, and outputs of GARCH models), thus providing more information to the model [3] which has been beneficial for forecasting performance [14]. Other NN architectures (e.g., RNNs, CNNs, and Long Short Term Memory (LSTM) models) have also been used in volatility forecasting. For instance, LSTM and GARCH models have been combined to forecast HV [13] and gold prices can be converted into a 3-channel RGB image and then processed with a pre-trained vgg16 (a well-known and high performing CNN model) [32].

Whilst RNNs, LSTMs, and CNNs are deep models, they are not considered as the state-of-the-art (SOTA) for time series processing in DL, and models used in financial volatility forecasting tend to be even shallower and simpler, a distinct gap highlighted in a recent systematic literature review [10]. This is reserved for recent models that have the extremely deep capacity and use complex models, often adapted from other fields such as TCNs, which have been successful in music generation, speech enhancement, and many other areas involving time series [16, 23, 25]. The TCN is a CNN adaptation, consisting of 1-dimensional convolutional blocks structured in a way that does not violate the temporal ordering of data (i.e., only past data can be seen when forecasting), known as a causal convolution [23]. In conjunction with a progressively increasing dilation size, the receptive field can be increased exponentially as layers increase, thus allowing the exploitation of long-term relationships. These blocks also often use residual connections, layer normalization, gradient clipping, and dropout, all of which have been shown to improve learning and performance [34]. Another recently developed SOTA model that handles sequential data well is the Transformer [31]. Its TFT variant deploys a gating mechanism to skip unused components of the network, variable selection networks to select relevant input variables at each time step, static co-variate encoders to provide context to the model, temporal processing to learn long and short-term relationships, and quantile predictions to forecast with a corresponding confidence [17].

3 Experimental Comparison of Forecasting Models

Our experimental study of forecasting models will next exemplify through comparative performance evaluation and statistical significance testing the power of DL in forecasting financial volatility. The volatility of five assets will be forecast with the SOTA methods; simpler or shallower DL models; and recent deeper and more complex models. The results will indicate that in almost all cases, DL models forecast volatility with less error than the SOTA models in financial volatility research. These experiments will be repeated to give evidence that the difference between competing models is statistically significant, therefore encouraging their use in practice and further study as a shared task.

Table 1. Description of data

3.1 Posing the Problem as a Shared Task

Volatility was forecast for five assets: S&P500, NASDAQ-100 (NDX), gold, silver, and oil. The data for each, as well as the corresponding volatility indices, were retrieved from Global Financial DataFootnote 6 (Table 1). The proper permissions to use the data for the purposes of this study and its reporting were obtained from Global Financial DataFootnote 7. The data consisted of the daily closing prices, as well as the open, high, and low prices for S&P500, NDX, gold, and oil. Volume was available only for S&P500 and NDX. Each asset was restricted to a starting date that corresponded with when the volatility index was introduced, except for S&P500 and NDX. This was because the volatility index for S&P500 was originally for the S&P100 and later changed on 22 September 2003 and because the volatility index for NDX began earlier than one of the exogenous variables used. Additionally, the ending date was restricted to 31 December 2018.

Exogenous variables were also retrieved from Global Financial Data, consisting of several other indices (SZSE, BSE SENSEX, FTSE100, and DJIA), exchange rates (US-YEN, US-EURO, and the US dollar trade weighted index), and United States fundamentals (Federal Reserve primary credit rate, mean and median duration of unemployment, consumer price index inflation rate, Government debt per Gross Domestic Product (GDP), gross Federal debt, and currency in circulation). All variables were date matched with the underlying assets by bringing forward the nearest historical value.

The task was to forecast the month-long HV and IV (Fig. 1), starting from 1 day ahead, for S&P500, NDX, gold, silver, and oil. The ground truth for HV was the standard deviation of log returns starting from 1 trading day ahead to 22 trading days ahead (= one calendar month). In other words, with t referring to the current time, we defined HV over a certain period \([\tau _1, \, \tau _2 ] = [t+1, \, t+22]\) as the standard deviation (\(\text {std}(\cdot )\)) of log-returns as follows:

$$\begin{aligned} \text {HV} = \sqrt{\frac{1}{N} \sum _{t=\tau _1}^{\tau _2}{\left( r_t - \frac{1}{N} \sum _{t=\tau _1}^{\tau _2}{r_t} \right) ^2}} = \text {std} \left( \begin{bmatrix} r_{\tau _2} \\ r_{\tau _2 - 1} \\ r_{\tau _2 - 2} \\ \vdots \\ r_{\tau _1} \end{bmatrix} \right) \end{aligned}$$
(1)

where \(N = \tau _2 - \tau _1 = 21\) is the number of samples between the time steps, \(P_t\) is price at time t, and \(r_t = \log (P_t/P_{t-1}) \cdot 100\). For IV, this meant the ground truth was simply the value of the volatility index for the next trading day, as the volatility index was already defined for the next calendar month. The values of the volatility indices were also adjusted by a factor of \(1/\sqrt{252}\), de-annualizing the value to be on the same scale as HV.

Fig. 1.
figure 1

Groundtruth and naïve forecasts for HV (a) and IV (b).

3.2 Methods

Five methods were used to represent the SOTA financial volatility forecasting performance, the combination of which will be the benchmark for comparison. These five methods were: a naïve modelFootnote 8, a GARCH model, an MLP model, and two models from literature: ANN-GARCH [14], and CNN-LSTM [32].

Two models from DL were investigated to represent the experimental forecasting performance. The first model was the TCN, as well as the TCN with several modifications. The first modification was to leverage the naïve model and forecast a residual, defined as either the difference , or the log difference . Another modification was to include multiple tasks to the network, introducing another loss function that will have a separate but related training effect [28]. The additional task was to predict either the direction of the forecast (up or down), or the change in direction (change or no change). The final modification was to include additional input channels, introducing new information to the network [33], such as descriptors of the underlying asset (log returns, naïve forecasts for HV and IV, and current direction of movement), and variables that describe the market (US dollar trade-weighted index, Federal Reserve primary credit rate, mean and median duration of unemployment, consumer price index inflation rate, Government debt per GDP, gross Federal debt, and currency in circulation) as there is literature to suggest that this may improve performance [14]. The second DL model explored was the TFT. Additional variables like descriptors of the asset (open, high, low, close, volume where possible, log returns, squared log returns, inverse price of the underlying asset, and naïve forecasts for both HV and IV), descriptors of the market (the US dollar trade-weighted index, Federal Reserve primary credit rate, mean and median duration of unemployment, consumer price index inflation rate, Government debt per GDP, gross Federal debt, and currency in circulation), and descriptors of time (day of the week, month, and a number of days since previous observation) were also included.

To engineer and evaluate the forecasting models using these five methods, a 70-15-15 train-validation-test split of the data was used because it did not violate the temporal aspect of the dataFootnote 9. All performances were quantified with the Mean Squared Error (MSE) with statistical significance testing to distinguish if competing models were statistically significantly different from each other. After the hyperparameters of a model were chosen and the testing phase was completed, the model was reinitialized with a random seed, re-trained, and re-tested. This was repeated until ten MSE values were obtained for each model. These values were then tested across different models in a pair-wise fashion to determine if they were from the same distributionFootnote 10. The Shapiro-Wilk (SW) test was first applied to assess the normality of the distribution with significance level \(\alpha \) = 0.05Footnote 11. If both distributions were normal then Student’s t-test was used, otherwise the Kruskal-Wallis (KW) test was employed.

Table 2. Smallest error models (in bold), and models for which no statistically significant difference could be found from the smallest error models.
Table 3. Performance (MSE) of forecasting models on the test set. Smaller is better, bold is the best. HS refers to Hyperparameter Search.

3.3 Result Evaluation and Analysis

A comparison of the benchmark models and experimental models gave evidence of a clear trend. Across almost all volatility forecasting tasks and assets investigated, the experimental models outperformed the benchmark models, with statistical significance. Based on 10 repetitions, for almost all assets and tasks, the performance values from the experimental models were superior and found to be statistically significant (Table 2).

Of the benchmark models, the ANN-GARCH model from literature performed best overall in forecasting HV, but for IV forecasting, the naïve model performed the best overall, achieving the smallest errors for all five assets (Table 3). However, a comparison of the traditional grid search hyperparameter optimization method against the more recent Bayesian Optimisation HyperBand (BOHB) search [9] indicated no clear trend; BOHB only produced a better forecasting model for the HV of oil, and the IV of gold and silver . Both methods were given roughly the same wall time and were both tested using the un-modified TCN. Though it is difficult to say if one method is superior to the other, the continued use of grid search is justified and was the primary hyperparameter optimization method for the remaining experimental TCN models.

Of the experimental models, the TFT performed best overall for HV forecasting, achieving the smallest errors for S&P500, NDX, and oil (Table 3). An encoder length of 21 days was optimal for all assets, with no set of input variables that were consistently best. S&P500 and NDX performed best with the addition of variables that describe time and the underlying asset, gold and silver performed best with the addition of variables that describe time, and oil performed best with the addition of market and time descriptors. The inclusion of exogenous variables only increased the performance for forecasting gold HV. The smallest error for gold was achieved by a benchmark model, specifically the ANN-GARCH.

The TCN variants were the best performing experimental model for IV forecasting, achieving the smallest errors for all assets (Table 3). The optimal modification was to use a secondary task of predicting the direction, as well as forecasting the residuals, consistent amongst all assets. S&P500, NDX, and gold also benefited from the inclusion of the volatility index value and previous direction of movements. For the TFT model, an encoder length of 10 days was preferred for all assets, except for S&P500 which preferred a length of 126.

4 Discussion

This experimental study exemplified the value of DL in forecasting financial volatility and expedited further progress in such DL applications by releasing open-source software and proposing a shared task. It created a benchmark of experimental evaluation results that consisted of the SOTA in NN-based financial volatility forecasting, several traditional models, and a naïve baseline model. This was then compared to several DL methods, representing the competing experimental models.

These results, however, come with some limitations. The main limitation is that the implementation of several models (GARCH and TFT) was open source and thus not necessarily under the same strict control as the other models used.

This study differs from prior publications by presenting a multidisciplinary approach to DL experimentation in forecasting financial volatility. While some other studies on financial volatility forecasting exist, they tend to be limited to literature reviews [10, 27, 29] or expert systems in economics [5, 14]. Our results imply that DL may offer better volatility forecasting performance than traditional methods, and hence, our code release and proposed shared task should expedite this future work. The most obvious is an investigation into other DL models that have not yet been used for volatility forecasting. Combined with the larger capacity of deeper models, another avenue to enhance the models is to make use of multi-modal data (e.g., extend from numeric data to text [18, 22]).

Moving forward, the most vital work is not further exploration of DL models and methods, but rather, the establishment of the proposed shared task that could include, for example, sharing of relevant resources (e.g., code to train models and/or the resulting trained models) and tracks for studying models on a given data modality or expanding them across modalities. This would allow easy and direct comparisons without the need to implement competing models, enabling the synthesis of publications, and propelling the field of financial volatility forecasting further and faster. This task should help gain a deeper understanding of the factors and mechanisms that may affect the economic feasibility of a statistical result. In conclusion, harvesting the diversity of thought and other community effects is likely to accelerate knowledge discovery and methodological innovations required to proceed from statistical significance to economic impact.