1 Introduction

Financial time series prediction is important and challeging tak in empirical finance. The data geration process of these series are complex because of its chaotic, noisy, non-stacionary and nonlinear nature (Cao and Tay 2001). Thus, the use of support vector regression (SVR) in financial forecasting has been proposed in the literature, because it do not establish hypotheses about the distribution of data, is a pure data-driven technique, is very flexible, has excellent forecasting accuracy, and show theoretical and empirical superior results than artificial neural networks and traditional statistical methods (Sapankevych and Sankar 2009; Cavalcante et al. 2016). Moreover, the SVR is a machine learning technique based on the statistical learning theory that implement the Structural Risk Minimization Principle, which results in better generalization performance (Cao and Tay 2003).

Volatility is a measure of the degree of flutuaction of financial return and is a proxy for risk, thus is a key variable in risk management, asset pricing and portfolio selection (Brownlees and Gallo 2009). Linear and nonlinear parametric generalized autoregressive conditional heteroscedasticity (GARCH) models make assumptions about the functional form of the data generating process and the error distribution. Besides, empirical studies provide evidence that GARCH has low forecasting performance (Jorion 1995; Brailsford and Faff 1996; Mcmillan and Speight 2000; Choudhry and Wu 2008). Therefore, modifications have been proposed to improve its forecasts accuracy such as: changes in specification and model estimation, the use of different proxies for volatility, changes in evaluation metrics of forecast (Chen et al. 2010). To overcome these limitations, volatility forecasting models based on SVR have been proposed in the literature, because it able to capture non-linear caractheristics of financial time series such as volatility clustering, leptokurtosis and leverage effect, without any assumptions about the data distribution properties. As shown by Fernando et al. (2003), Chen et al. (2010), Li (2014) and Santamaría-Bonfil et al. (2015), SVR shows superior results on volatility forecasting compared with GARCH models, due to its ability to capture the dynamic and nonlinear behavior of financial time series.

The volatility of financial asset returns change over time due to the capital market behavior. Empirical evidence shows that there are oscillation between several regimes in the financial market, in which the overall distribution of returns is a mixture of normals (Levy and Kaplanski 2015). In general, researches report the existence of two regimes (one with high and the other with low volatility) for the distribution of stock returns in the equity market. However, the markets can have more than two states (Bae et al. 2014). Then, it is necessary a mixture of more than two normal distributions to model the regime-switching behavior (Guidolin 2011). As the SVR is a kernel-based methodology, its forecasting performance is greatly dependent upon the selection of kernel function. To improve the SVR learning and generalization ability and take advantage of different kernel functions, it is possible to construct hybrid kernels via linear or non-linear combination of kernels (Huang et al. 2014). Empirical evidence shows that the hybrid kernel has superior empirical results on forecasting accuracy than the SVR with a single-kernel (Huang et al. 2014). However, to the best of our knowledge, no research on volatility forecasting via SVR used a hybrid kernel. In this context, the main purpose of this article is to use a mixture of Gaussian kernels in the SVR based on GARCH (1,1) (heretofore SVR–GARCH) in order to improve the prediction accuracy and model: (i) market regimes and (ii) financial returns stylized characteristics such as high curtosis, heavy tails and volatility clustering.

The forecasting accuracy of SVR–GARCH with a linear combination of one, two, three and four Gaussian kernels in one-period-ahead volatility forecasting is compared with the SVR–GARCH with Morlet wavelet kernel, GARCH, EGARCH, GJR. For each GARCH models four different distributions for the innovations are considered: the Normal, the Student’s t, the skewed Student’s t and the GED. in terms of two evaluation metrics of mean absolute error (MAE) and root mean squared error (RMSE).

The remainder of this paper is organized as follows. Section 2 provides a brief explanation of the Support Vector Machine (SVM) for regression. Section 3 explains the use of mixture of Gaussian kernels in the SVR–GARCH. Section 4 describes the empirical modeling. Section 5 shows the empirical results of the proposed model on daily financial returns of Nikkei 225 and Ibovespa indexes. Section 6 provides the concluding remarks of this paper.

2 Support vector regression

The support vector machine (SVM) is a machine learning algorithm based on the statistical learning theory developed by Vapnik and Chervonenkis (1974). SVM for linear and non-linear regressions is called support vector regression (SVR) (Smola and Schölkopf 2004). The non-linear SVR can be written as follows: given a set of training data \({(x_1,y_1),\ldots ,(x_n,y_n) }\), where \(x_i\in \mathcal {X} \subseteq \mathbb {R}\) is the input vector and \(y_i\in Y \subseteq \mathbb {R}\) being the output scalar. The goal of SVR is to find a function f(x) that approximate the output escalar \(y_i\) less than a forecast error. To achieve this goal, the SVR nonlinearly maps the input vector space (\(\mathbb {R}^n\)) into a higher dimension feature space (\(\mathcal {F}\)), where the non-linear relations of the input space are aproximated by a linear regression in the feature space Vapnik (1995):

$$\begin{aligned} f(x)=w^T\phi (x)+b, \quad \text{ with } \; \phi :\mathbb {R}^n \rightarrow \mathcal {F},w \in \mathcal {F} \end{aligned}$$
(1)

where w and b are the regression parameter vectors and \(\phi ({.})\) is the nonlinear mapping function, which projects the input vector into a higher dimension feature space, where the linear regression is defined. Vapnik (1995) introduced the \(\epsilon \)-insensitive loss function (\(L_\epsilon \)) to measure the difference between the actual and predicted values. The goal of \(\epsilon \)-SVR is to find a function that has at least \(\epsilon \) deviation from \(y_i\). The vector w and constant b can be found by minimizing the following regularized function R(C) (Vapnik 1995):

$$\begin{aligned} {\textit{Minimize}}: R(C)=\dfrac{1}{2}\Vert w\Vert ^2+\dfrac{C}{n}\sum _{i=1}^{n}(L_\epsilon (f(y_i, f(x_i)); \end{aligned}$$
(2)

where:

$$\begin{aligned} L_\epsilon (y, f(x))= {\left\{ \begin{array}{ll} |y_i-f(x_i)|-\epsilon , &{} \quad \text{ if } \; |y_i-f(x_i)|>\epsilon \\ 0, &{} \quad \text{ otherwise } \end{array}\right. }, \epsilon \ge 0 \end{aligned}$$
(3)

is the \(\epsilon \)-insensitive loss function (\(L_\epsilon \)). Only the observations on or outside the \(\epsilon \)-insensitive zone will serve as the support vectors to construct the decision fucntion (\(f(\mathbf x )\)). To indicate the errors outside the \(\epsilon \)-insensitive zone, slack variables ((\(\xi _i,\xi _i^*\)), \(i=1,2,\ldots ,n\)) are introduced (Smola and Schölkopf 2004). Then, the primal problem of SVR is given by:

$$\begin{aligned} {\textit{Minimize}}: \dfrac{1}{2}\Vert w\Vert ^2+ C \sum _{i=1}^{n}(\xi _i+\xi _i^*), \end{aligned}$$
(4)
$$\begin{aligned} \text{ subject } \text{ to }\ {\left\{ \begin{array}{ll} y-w^T\phi (x)-b\le \epsilon +\xi _i,\\ w^T\phi (x)+b-y\le \epsilon +\xi _i^*,\\ \xi _i,\xi _i^*\ge 0 \end{array}\right. } \end{aligned}$$

The convex quadratic programming and the linear restrictions of the primal problem above assure that the SVR will always achieve the optimal global solution. The term \(\dfrac{1}{2}\Vert w\Vert ^2\) characterizes the model complexity. The parameter C denotes the trade-off between the function complexity and training error \(\sum _{i=1}^{n}(\xi _i+\xi _i^*)\) (Sermpinis et al. 2014). If the value C is large, the algorithm will overfit the data and will have lower generalization hability. The parameter \(\epsilon \) controls the width of the \(\epsilon \)-insensitive zone: the higher \(\epsilon \) is, less support vectors are selected (Cherkassky and Ma 2004). The parameters C and \(\epsilon \) are the SVR meta-parameters and, in general, are determined by cross-validation (Haykin 1998). In order to solve equation 4, it is possible to use the Lagrangian multipliers (\(\alpha _i\) e \(\alpha _i^*\)) and the Karush–Kuhn–Tucker conditions (Karush (1939); Kuhn and Tucker (1951)), transforming the problem in its dual form (Vapnik 1995):

$$\begin{aligned} \text{ Maximize }:\mathcal {L}&=-\dfrac{1}{2}\sum _{i=1}^{n}(\alpha _i-\alpha _i^*)(\alpha _j^*-\alpha _j)\langle \ \phi ({x}_i),\phi ({x})\rangle \nonumber \\&\quad \,+\;\sum _{i=1}^{n}y_i(\alpha _i-\alpha _i^*) -\epsilon \sum _{i=1}^{n}(\alpha _i+\alpha _i^*) \end{aligned}$$
(5)
$$\begin{aligned} \text{ subject } \text{ to }\ {\left\{ \begin{array}{ll} \sum _{i=1}^{n}(\alpha _i^*-\alpha _i)=0, \\ 0\le \alpha _i \le C, i=1,\ldots ,n \\ 0\le \alpha _i^* \le C, i=1,\ldots ,n \end{array}\right. } \end{aligned}$$

The Lagrange multipliers are calculated, then we find the support vector in expansion: \(w=\sum _{i=1}^{n}(\alpha _i^*-\alpha _i)\phi ({x}_i)\). From the solution of the dual problem, the \(\epsilon \)-SVR function can be written as (Vapnik 1995):

$$\begin{aligned} f(x)=\sum _{i=1}^{n}(\alpha _i-\alpha _i^*)\langle \phi ({x}_i),\phi (x) \rangle +b \end{aligned}$$
(6)

where \(\langle \phi (x_i),\phi (x) \rangle \) is the dot product in the feature space \(\mathcal {F}\). To avoid the complexity of computing \(\phi (.)\), we can substitute the dot product by a kernel function:

$$\begin{aligned} f(x)=\sum _{i=1}^{n}(\alpha _i-\alpha _i^*) K(x_i,x)+b^* \end{aligned}$$
(7)

The kernel function \(K(x,x')=\langle \phi (x),\phi (x')\rangle \) is critical to the forecasting performance of the SVR. Any function that satisfies the (Mercer 1909) theorem is an admissible kernel. So far there is no analytical method for choosing the most appropriate kernel for a given problem (Sangeetha and Kalpana 2010). Before estimating w and b with SVR, it is necessary to choose the regularized parameter C, loss function parameter \(\epsilon \) and the parameters of the chosen kernel (Sermpinis et al. 2014).

3 SVR–GARCH with a mixture of Gaussian kernels

In empirical finance, the normal distribution is the most convenient distributional assumption of asset returns (Wirjanto and Xu 2009). Nevertheless, empirical studies show that the distribution of returns depart from the normal distribution shape due to their substancial leptokurtosis (fat tails) and skewness (assymetry) (Wang and Taaffe 2015). Besides, the stock market oscillates between regimes (or states) (Tu 2010). Then, to explain these facts, it is more appropriate to use a mixture of two or more normal distributions (Wang and Taaffe 2015), because they are more flexible and can model complex phenomena (Marron and Wand 1992; McLachlan and Peel 2004).

Conditional volatility models such as the Autoregressive Conditional Heteroscedasticity (ARCH) (Engle 1982) and Generalized ARCH (GARCH) (Bollerslev 1986) can capture volatility clustering and time-varying volatility. However, empirical evidence shows that these models with Gaussian or heavier distribution innovations can not model the full extent of skewness and kurtosis (Wirjanto and Xu 2009). Since any continuous distribution can be well approximated by a finite mixture of normal distributions, the use of mixtures of normals in GARCH innovations have been proposed by Bai et al. (2003), Haas et al. (2004), Marcucci (2005), Alexander and Lazar (2006), Wirjanto and Xu (2009).

Given that financial returns are subject to regime-switching behavior between k states, even if the distribution of financial returns of each regime is normal, the overall distribution, given the probability of each state, is not normal. In fact, it is a mixture of k normal distributions (Levy and Kaplanski 2015). One way to accomodate this situation in the context of volatility forecasting via SVR is to use a mixture of Gaussian kernels. The mixture of normal distributions can capture extreme events, high curtosis, heavy tails of financial returns and approximate arbitraly any continuous probability distribution (McLachlan and Peel 2004; Wang and Taaffe 2015). In this context, we attempt to find the optimal number of mixtures of Gaussian kernels for the SVR–GARCH model. We will test a linear combination of one, two, three and four Gaussian kernels in SVR based on GARCH(1,1), which can capture market regimes and perhaps show better out-of-sample forecasting results than the SVR–GARCH with a single-kernel.

To verify which are the most used kernels in volatility forecasting via SVR, we conduct a search from 2000 to 2016 through each of the following publishers online search engines: Elsevier, Wiley Online Library, IEEE Xplore Digital Library, SCOPUS, ISI Web of Knowledge, Sciencedirect, Google Scholar and ProQuest Journals. We use the same query for every search engine:

(“support vector machine” OR ”support vector regression”) AND (“financial time series forecasting” OR“volatility forecasting” OR “volatility” )

Then, we select papers about volatility forecasting via SVR published in peer-reviewed journals with an impact factor. We found that the ten selected research articles used only a single-kernel and that the Gaussian is the most widely used kernel function (Table 1):

Table 1 Kernel choice in volatility forecasting

The kernel function maps non-linear observations of input data into a higher dimensional feature space, in which the data is linearly separable (Vapnik 1995). In this paper, we use a linear combination of \(k=1,2,3,4\) Gaussian kernels:

$$\begin{aligned} K_{mix}(x,x')= \sum _{k=1}^{K}\rho _k \times K_k(x,x'), \quad \rho _k\ge 0 \quad \text{ and } \quad \sum _{k=1}^{K}\rho _k=1 \end{aligned}$$
(8)

where \(\rho \) is the weighting coefficient and \(K(x,x')_k=\exp \left( -\gamma ||\ x-x' ||^2\right) \) . Following Huang et al. (2014) the optimal \(\rho \) is obtained by a grid search with the search step length 0.1.

Empirical research show that wavelet kernels have superior volatility forecasting results than the Gaussian kernel (Tang et al. 2009b, a; Li 2014). Then, we also use the Morlet wavelet kernel in the SVR–GARCH (Zhang et al. 2004):

$$\begin{aligned} k(x,x')=\prod _{i=1}^{N}\Big (cos\Big (1.75\times (\dfrac{x_i-x_i'}{a})\Big )\Big )\exp \Big ( \dfrac{-||\ x_i-x_i' ||^2}{2a^2}\Big ),\quad {x},{x}' \in \mathbb {R}^N \end{aligned}$$
(9)

4 Empirical modelling

4.1 Parametric volatility models

Let \(P_t\) be asset price at time t. Then, the return of asset in time t is given by:

$$\begin{aligned} r_t=log \left( \frac{P_t}{P_{t-1}}\right) \end{aligned}$$
(10)

The GARCH can capture volatility clustering and persistence. The GARCH(1,1) is specified as follows:

$$\begin{aligned} r_t= & {} u_t+a_t \end{aligned}$$
(11)
$$\begin{aligned} a_t= & {} \sqrt{h_t}z_t,\quad z_t \sim i.i.d(0,1) \end{aligned}$$
(12)
$$\begin{aligned} h_t= & {} \alpha _0+\alpha _1a^2_{t-1}+\beta _1h_{t-1} \end{aligned}$$
(13)

where \(\alpha _0>0\) and \(\alpha _1, \beta _1\ge 0\). One of the drawbacks of the standard GARCH is that negative and positive shocks have the same impact in the volatility forecasts (Franses and van Dijk 1996). Glosten et al. (1993) introduced the GJR model to capture the asymmetric response of volatility to shocks. The GJR(1,1) is defined as:

$$\begin{aligned} h_t=\alpha _0+\alpha _1a^2_{t-1}+\beta _1h_{t-1}+\gamma S^-_{t-1}a^2_{t-1}, \end{aligned}$$
(14)

where

$$\begin{aligned} S^-_{t-1}= {\left\{ \begin{array}{ll} 1, &{} \quad \text{ se }\quad a_{t-1} < 0 \\ 0, &{} \quad \text{ otherwise } \end{array}\right. } \end{aligned}$$
(15)

where \(\alpha _0>0\) and \(\alpha _1, \beta _1\ge 0\), \(\alpha _1+\gamma \ge 0\). The exponential generalized autoregressive conditional heteroscedasticity (EGARCH) (Nelson 1991) can model the the skewness of financial returns and ensure that the variance is always postive. The EGARCH(1,1) is written as the following:

$$\begin{aligned} \ln (h_t)=\alpha _0+\beta _1\ln \Big (h_{t-1}\Big )+\alpha _1\Bigg [\dfrac{|a_{t-1}|}{\sqrt{h_{t-1}}}- \dfrac{2}{\sqrt{\pi }}\Bigg ]-\gamma \Bigg (\dfrac{|a_{t-1}|}{\sqrt{h_{t-1}}}\Bigg ) \end{aligned}$$
(16)

where \(\gamma \) is the assymetric response parameter. If this parameter is positive, a negative return generates more volatility than positive returns. In order to model the fat-tails of the empirical distribution of financial returns, the errors \(z_t\) can follow a Student’s t, skewed Student’s t or Generalized Error Distribution (GED) distributions (Marcucci 2005):

  1. 1.

    A random variable X that follows a Student’s t distribution has the following probability density function (pdf) (Casella and Berger 2001):

    $$\begin{aligned} f(x) = \frac{\Gamma (\frac{\nu +1}{2})}{\sqrt{\nu \pi }\,\Gamma (\frac{\nu }{2})} \left( 1+\frac{x^2}{\nu } \right) ^{(-\frac{\nu +1}{2})} \end{aligned}$$
    (17)

    where \(\nu \) is the degree of freedom parameter and \(\Gamma (.)\) is the Gamma function.

  2. 2.

    Generalized error distribution (GED): a random variable X that follows a GED distribution with zero mean and unit variance has the following pdf Tsay (2010):

    $$\begin{aligned} f\left( x\right) =\dfrac{\nu exp [-\left( \frac{1}{2}\right) |(x / \lambda ) |^\nu ]}{\lambda 2^{(\nu +1 / \nu ) }\Gamma (1/\nu ) }, \quad 0<\nu \le \infty \end{aligned}$$
    (18)

    where:

    $$\begin{aligned} \lambda =\left[ \dfrac{2^{-(2 / \nu )}\Gamma \left( 1 / \nu \right) }{\Gamma (3 / v) }\right] ^{1 / 2} \end{aligned}$$
    (19)

    where the parameter \(\nu \) denotes the thickness-of-tail. When \(0<\nu <2\), GED has thicker tails than the normal distribution.

  3. 3.

    The skewed Student’s t-distribution can model the asymmetric effects and excess of kurtosis, the pdf takes the following form Fernandez and Steel (1998):

    $$\begin{aligned}&f( x| \iota ,\nu ) =\dfrac{2}{\iota +1 / \iota } [ g(\iota (sx+m)|\nu ) I_{( -\infty ,0)} (x+ m / s)] \end{aligned}$$
    (20)
    $$\begin{aligned}&+\dfrac{2}{\iota +1 / \iota }[ g((sx+m)/\iota |\nu ) I_{( 0,+\infty )} (x+m / s)], \end{aligned}$$
    (21)

    where \(g(./\nu )\) is a Student’s t-distribution with \(\nu \) degress of freedom,

    $$\begin{aligned} m= & {} \dfrac{\Gamma \left( \left( \nu +1\right) / 2\right) \sqrt{\nu -2}}{\sqrt{\pi }\Gamma \left( \nu / 2\right) }(\iota -1 / \iota ), \end{aligned}$$
    (22)
    $$\begin{aligned} s= & {} \sqrt{(\iota ^{2}+1 /\iota ^2-1)-m^2} \end{aligned}$$
    (23)

    where \(\iota \) is the assymetric parameter.

4.2 SVR based on GARCH

In order to forecast volatility, we have to define the inputs and outputs of the SVR decision function. Previous studies showed that the GARCH(1,1) is sufficient to model financial volatility (Poon and Granger 2003; Hansen and Lunde 2005). Thus, in this work the conditional variance is modeled by a GARCH (1,1), while the conditional mean is modeled by an AR (1) process (Franses and van Dijk 1996). Then, to forecast volatility, we use a SVR based on GARCH (1,1) (heretofore SVR–GARCH). The output variable is \(h_t\) and the input vector is: \(x_t=[a^2_{t-1},h_{t-1}]\). The SVR–GARCH is given by the following structure:

$$\begin{aligned} r_t=f\left( r_{t-1}\right) +a_t \end{aligned}$$
(24)

where f is the decision function estimated by SVR for the mean equation. We get the squared residuals from the conditional mean estimation of the SVR–GARCH, then we estimate the conditional variance equation given by:

$$\begin{aligned} \tilde{h}_t=g(\tilde{h}_{t-1},a_{t-1}^2) \end{aligned}$$
(25)

where g is the decision function estimated by SVR, \(a_{t}^2\) it is the squared residuals and \(\tilde{h}\) is the volatility proxy. In the mean equation, we use a single Gaussian kernel (\(K(x,x')= \exp \left( -\gamma ||\ x-x' ||^2\right) \)), because it is a common choice in financial time series forecasting via SVR (Sapankevych and Sankar 2009). In the volatility equation of SVR–GARCH, we use a linear combination of one, two, three and four Gaussian kernels given by Eq. 8.

As volatility is not directly observable, is necessary the use of proxy. As in Brooks (2001), Brooks and Persand (2003), Chen et al. (2010) we use the following proxy:

$$\begin{aligned} \tilde{h}_t=(r_t-\bar{r})^2 \end{aligned}$$
(26)

where \(r_t\) are the financial returns and \(\bar{r}\) it is the mean of returns. Any volatility proxy is an imperfect estimator of the true conditional variance (Patton 2011). Perhaps the use of another proxy may alter the results presented here. However, this issue is beyond the scope of this paper.

Before applying the SVR–GARCH for volatility prediction, we use the validation procedure (also known as holdout method) based on grid search and sensitivy analysis to select the kernel parameter \(\gamma \) (for Gaussian kernel), the regularized parameter C and loss function parameter \(\epsilon \) (Stone 1974; Kohavi 1995; Arlot and Celisse 2010). We divide the database into three mutually exclusive sets: training, validation and testing (Shalev-shwartz and Ben-david 2014). The training set is used to estimate model parameters, then the performance of various values of the parameters are evaluated in the validation set. Following Cao and Tay (2001) and Chen et al. (2010), we make a sensitivity analysis to assess the effects of variation of parameters C, \(\epsilon \), \(\gamma \) in the MAE of volatility forecasting in the validation set. Therefore, we make a grid search for each parameter, keeping the others fixed. For the variation of each of the parameters, we make the forecasting in the validation set and then calculate the MAE. We choose the parameters that has the smallest value of MAE. Finally, we evaluate the SVR–GARCH generalization performance in the test set. In this paper, the whole data are divided into three subsets: the first 50% composes the training set, the next 20% composes the validation set and the last, 30%, is reserved for the test set.

We use the MAE and the RMSE to evaluate the prediction performance. The RMSE is given by:

$$\begin{aligned} RMSE=\sqrt{\dfrac{1}{n}\sum _{t=1}^{n} \epsilon _t^2 } \end{aligned}$$
(27)

MAE measures the magnitude of overall error and is given by the following equation (Hyndman and Koehler 2006):

$$\begin{aligned} MAE=\frac{1}{n} \sum _{t=1}^{n} |y_t-\hat{y_t} |\end{aligned}$$
(28)

where \(y_t\) denote the observation at time t and \(\hat{y_t}\) denote the forecast of \(y_t\). The model which produces the smallest values of MAE and RMSE is judged to be the best model. MAE is a random variable and we have to use a statistical procedure to determine if one model shows superior predictive performance over another model. We use the two-sided (Diebold and Mariano 1995) (DM) test to compare the forecast performance of two competing models. Then, the DM test statistic is based on the difference of MAE loss function and it has the following null and alternative hypothesis:

$$\begin{aligned} H_0: MAE_1-MAE_0=0\quad \text{ versus }\quad H_1:MAE_1-MAE_0\ne 0 \end{aligned}$$

where \(MAE_0\) is the MAE of the competing model and \(MAE_1\) is the MAE of the proposed model. Thus, if the null hypothesis is rejected , there is evidence that some model is superior to the other. Moreover, according to Chen et al. (2010), the Diebold–Mariano(DM) statistic in a robust form for a time series with volatility \(\sigma _t\) is given by:

$$\begin{aligned} DM={\frac{1}{\sqrt{n}}}{\frac{1}{\sqrt{\hat{S^2}}}}\sum _{t=T_1}^{T-1} (|\sigma _{t+1}^2 -\hat{\sigma }_{1,t+1}^2 |- |\sigma _{t+1}^2 -\hat{\sigma }_{0,t+1}^2 |)\sim N(0,1) \end{aligned}$$
(29)

where \(\hat{\sigma }_{0,t+1}^2\) is the volatility estimated by the competing model, \(\hat{\sigma }_{1,t+1}^2\) is the volatility estimated by the proposed model and \(\hat{S^2}\) denotes the co(variance) matrix estimated by the Newey and West (1987) procedure. Negative (positive) values of DM statistic indicates that the proposed model performs better (worse) than the competing model.

5 Empirical results

In this section, we apply the SVR–GARCH with a linear combination of one, two, three and four Gaussian kernels to volatility forecast and compare its performance to three other parametric volatility models, specifically, the GARCH, EGARCH and GJR. The first dataset consists of the Nikkei 225 daily closing price from May 1, 2010 to January 28, 2016 for a total of 1422 observations all obtained from Yahoo Finance and then transformed into log return as in 10. The second dataset consists of the daily closing price of the Bovespa index for the period December 22, 2007 to January 04, 2016. The first half of the whole data are used as the training data, 20% are reserved for the validation set and the remaining data, 30%, as test set. Table 2 shows the summary of the descriptive statistics for the Nikkei 225 and Ibovespa along the whole sample period.

Table 2 Descriptive statistics for daily returns

The returns are characterized by excess kurtosis and deviate from normal distribution. Table 3 and Table 4 show the parameter estimates for the GARCH (1,1), EGARCH (1,1) and GJR (1,1) models for the Nikkei and Ibovespa returns. For each model four different distributions for the innovations are considered: the Normal, the Student’s t,the skewed Student’s t and the GED (Generalized Error Distribution). The Nikkei and Ibovespa series best fit to the GJR with skewed Student’s t innovation, according to highest value of Log likelihood (LL) and smallest value of AIC and BIC.

Table 3 Goodness of fit for Nikkei returns
Table 4 Goodness of fit for Ibovespa returns

Then, we select the parameters C, \(\epsilon \) and the kernel parameters for the conditional mean and volatility equation via cross-validation. For the mean equation, we use a Gaussian kernel and for the volatility equation we use a linear combination of one, two three and four Gaussian kernels. The first 711 observations of Nikkei returns series are used for training, from 712 to 996 for validation and from 997 to 1422 for the test set. We use the training set to estimate the function f of the mean equation and g of the volatility equation of SVR–GARCH. In this section, we only report the parameter selection for the SVR–GARCH with a linear combination of two Gaussian kernels. The parameter selection for the SVR–GARCH with one, two, three Gaussian kernels and Morlet kernel is similar and not reported here to save space. For the same reason, we do not report the results for the Ibovespa returns.

First, we estimate the conditional mean equation in the training set:

$$\begin{aligned} r_t=f(r_{t-1})\quad \text{ for } \; i \in (2,\ldots ,711) \end{aligned}$$
(30)

For the selection of optimal parameters , we use a grid search for each parameter, while keeping the others fixed. For the variation of each parameter, we make a forecast in the validation set in order to minimize the following expression:

$$\begin{aligned} MAE=\frac{1}{284} \sum _{t=712}^{996} |r_t - f(r_{t-1}) |\end{aligned}$$
(31)

For the sensitive analysis of C, we fix \(\epsilon =0,0001\), \(\gamma =1,25\) and parameter C takes values in the range [0, 10]. The value of \(C=0.004\) leads to the best validation performance. Epsilon varies in the range [0, 5], with \(\gamma =1,25\), \(C=0,025\). The validating MAE attains the minima when \(\epsilon =0.2205\). Parameter \(\gamma \) takes value in the range [0, 10], with \(C=0.004\) and \(\epsilon =0.2205\). The value of \(\gamma =0.9\) results in the best validation performance. Thus, the best parameters of SVR–GARCH for the conditional mean returns are: \(C=0.004\), \(\epsilon =0.2205\) and \(\gamma =0.9\) (Table 5):

Table 5 Sensitivity analysis of SVR in conditional mean estimation

Therefore, we estimate the conditional mean equation by using the SVR–GARCH with the best parameters for the conditional mean until the 996 observation to obtain the residuals \(a_t\) in the following way:

$$\begin{aligned} a_t=r_t-f(r_t)\quad \text{ for } \; i \in (2,\ldots ,996) \end{aligned}$$
(32)

Then we estimate the volatility equation of SVR–GARCH(1,1):

$$\begin{aligned} \tilde{h}_t=g(\tilde{h}_{t-1},a_{t-1}^2)\quad \text{ for } \; i \in (2,\ldots ,711) \end{aligned}$$
(33)

where \(a_t^2\) is the squared residuals. The volatility proxy \(\tilde{h}_t\) is calculated until the 996 observation and the parameter selection is made in order to minimize the following expression:

$$\begin{aligned} MAE=\frac{1}{284} \sum _{t=712}^{996} |\tilde{h}_t - g(\tilde{h}_{t-1},a_{t-1}^2) |\end{aligned}$$
(34)

For the sensitive analysis of C, we fix \(\epsilon =0.0001\), \(\gamma _1=0.01\), \(\gamma _2=0.07\), \(\rho =0.25\) and parameter C takes values in the range [0, 10]. The validating MAE attains the minima when \(C=0,625\). As in the mean equation , we do the same procedure for the others parameters (Table 6).

Table 6 Sensitivity analysis of SVR in conditional variance estimation

Thus, the appropriate parameters of SVR–GARCH for the conditional variance are \(C=5.184\), \(\epsilon =0.05929\), \(\gamma _1=0.9801\), \(\gamma _2=0.01 \) and \(\rho =0.37\).

5.1 Volatility forecasting evaluation

With the SVR–GARCH optimal parameters (C, \(\epsilon \) and kernel parameters), we make the one-period-ahed volatility forecasts in the test set (i.e. out-of-sample). After each forecast, we calculate the forrecast errors and repeat the forecasting process for the next period. Table 7 report the values of MAE and RMSE obtained from different models for the Nikkei and Ibovespa returns.

Table 7 Out-of-sample evaluation of one-period-ahead volatility forecasts

For the Nikkei 225 series, the SVR–GARCH with a mixture of three Gaussian kernels achieve smallest value of MAE. But, the SVR–GARCH with Morlet wavelet kernel achieve the smallest value of RMSE. According to MAE and RMSE measures in Table 7, the SVR–GARCH with a linear combination of four Gaussian kernels is the best one for the Ibovespa series. To compare the predictive power of two models we use the two-sided Diebold–Mariano test given by the following null and alternative hypotheses for the Nikkei returns:

$$\begin{aligned}&H_0: \dfrac{1}{426}\left| \tilde{h}_t- \hat{h}_{1,t}\right| -\left| \tilde{h}_t- \hat{h}_{0,t}\right| =0 \quad \text{ versus } \nonumber \\&H_1:\dfrac{1}{426}\left| \tilde{h}_t- \hat{h}_{1,t}\right| - \left| \tilde{h}_t- \hat{h}_{0,t}\right| \ne 0, \end{aligned}$$
(35)
Table 8 Diebold–Mariano test (benchmark: SVR–GARCH 4, one-step-ahead)
Table 9 Diebold–Mariano test (benchmark: SVR–GARCH 3, one-step-ahead)

where \(\tilde{h}_t\) is the volatility proxy, \(\hat{h}_{0,t}\) is the volatility estimated by the proposed model and \({\hat{h}}_{1,t}\) is the volatility estimated by the competing model. Moreover, the DM test statistic is given by Chen et al. (2010):

$$\begin{aligned} DM={\frac{1}{\sqrt{426}}}{\frac{1}{\sqrt{{\hat{S}}^{2}}}}\sum _{t=996}^{1422} \left| {\tilde{h}}_{t}- {\hat{h}}_{1,t}\right| -\left| {\tilde{h}}_{t}- {\hat{h}}_{0,t}\right| \sim N(0,1) \end{aligned}$$
(36)

Tables 8 and 9 report the DM statistics and p-values of the Diebold–Mariano test for the difference of MAE loss function for the Nikkei 225 and Ibovespa daily returns, respectively:

For the Nikkei 225 and Ibovespa series, it is evident that all SVR–GARCH models significantly outperform every GARCH model at any usual confidence level. For the Nikkei 225 series, except for SVR–GARCH with Morlet wavelet kernel, the sign of the DM statistic is always negative, implying that the benchmark’s loss is lower than the loss implied by the competing models. However, we cannot reject the null hypothesis for the SVR–GARCH with Morlet wavelet kernel and with one and two Gaussian kernels, which means that these models have equal forecasting ability. For the other models we always reject the null hypothesis of equal forecast accuracy at any usual confidence level. For the Ibovespa series, the sign of the DM statistic for the SVR–GARCH with a linear combination of three kernels is always negative and we always reject the null hypothesis of equal forecast accuracy.

6 Concluding remarks

The main contributions of this paper is to use a mixture of one, two, three and four Gaussian kernels in the SVR based on GARCH(1,1) to take into account the existence of market regimes. We compare these models with SVR–GARCH with Morlet wavelet kernel, GARCH, EGARCH and GJR models in terms of their ability to forecast volatility by using MAE, RMSE and Diebold–Mariano test. All GARCH models are estimated assuming Gaussian, Student’s t, sweked Student’s t and GED innovations. To determine the SVR optimal parameters we use used the validation technique (holdout method) based on grid-search and sensitivity analysis. Nikkei 225 and Ibovespa daily returns were used as the dataset. The empirical results indicate that the mixture of Gaussian kernels can improve the SVR–GARCH one-period-ahead volatility forecasts. In sum, the mixture of normal distributions can model the overall distribution of financial returns when markets display regime behaviour and also better approximate nonlinear characteristics of financial returns such as heavy tails, volatility clustering and time-varying skewness.