1 Introduction

The stock market is an extremely complex nonlinear dynamic system, in which the ability of forecasting stock returns volatility with great precision is of significant importance for investors and regulators. Volatility modeling and prediction of stock index returns are important subjects in financial risk management including derivative pricing, risk measurement and multi-period portfolio selection. Since it can proxy the risk conditions in financial markets, it has attracted a great deal of attentions. Taking into account the stylized facts of financial phenomena, the GARCH-type family models are extensively used for capturing the clustering properties in financial asset returns volatility prediction (Bentes 2015a, b; Abounoori et al. 2016; Lv and Shan 2013). However, the GARCH family models belong to parametric models which require prior structure assumptions of financial data. Considerable researchers have found that financial time series data does not follow normal distribution assumption. On the contrary, it exhibits leptokurtosis and heavy tailed properties in addition to excessive skewness and kurtosis. Hence, non-Gaussian distributional assumptions need to be imposed to the GARCH-type family processes to capture the left-skewed distributional characteristics of financial returns. Besides, the GARCH family processes cannot describe the nonlinear complex dependence relationships among financial asset variables, which may lead to unsatisfactory forecasting consequences. To overcome the drawbacks of conventional financial forecasting models and to enhance the volatility prediction ability, it calls for more advanced techniques to identify these complex patterns.

With rapid developments of artificial intelligent technologies, the artificial neural network techniques and machine learning techniques have been attracting more and more attentions in recent years and have been widely applied in finance fields (Hajizadeh et al. 2012; Bildirici and Ersin 2009; Kristjanpoller et al. 2014), especially for financial asset returns and volatility predictions (Lahmiri 2016a, b). The integration of artificial intelligent technologies with conventional GARCH-nG models can effectively improve the volatility forecasting performances of stock indices and has become hot research topics (Tseng et al. 2009). Furthermore, intelligent optimization algorithms comprising genetic algorithm, particle swarm optimization algorithm and so on have been utilized to improve the estimation accuracy of intelligent algorithm parameters.

However, it may not always appropriate to choose forecasted series of GARCH-type families as inputs to artificial neural network for further volatility predictions. Because the financial asset variables typically display nonlinear and non-Gaussian behaviors, it requires two aspects of extensions. One research extension is the abnormal distribution assumptions for modeling of returns time series distribution, and the other research direction is the assumption-free techniques to identify the evolutional dynamic pattern for the volatility process. Hence, our research will also focus on these two aspects to compare the forecasting ability of AGARCH-nG-ANN models with the improved support vector machine method based on modified optimization algorithm.

Support vector machine is a rather powerful machine learning method based on the statistical learning theory, which takes the structural risk minimization as the principle, and effectively suppresses the over-learning phenomenon, thus displaying good generalization capabilities. The training of the model is transformed into solving a quadratic programming problem to ensure the global optimality and to solve the local minimum problem in the neural network. The LS-SVM (Suykens et al. 2002) is the support vector machine that takes quadratic loss functions as empirical risks. It replaces inequality constraints with equality constraint and transforms the training of the model into linear solutions to the equations, which simplifies the calculation process and shortens the training time with more deterministic training consequences so that it is suitable for online applications. Therefore, this study investigates the predictive power of LS-SVM based on improved particle swarm optimization algorithm in stock index volatility forecasting, and the outcomes are contrasted with those of the hybrid approaches.

Since the previous works about volatility forecasting mainly involve artificial neural network hybridized with GARCH model without non-Gaussian innovations or merely involve the support vector machine technique without parameter optimization. In this paper, we construct the LS-SVM method optimized by the modified PSO algorithm to enhance the returns volatility forecasting. In addition, we incorporate the non-Gaussian innovations to the previous AGARCH-ANN models. Then, we compare the volatility forecasting performances of the proposed model with other two extended models using stock market data from wind database, namely the asymmetric GARCH models with non-Gaussian distributions hybridized with artificial neural network and the individual parametric asymmetric GARCH models. The forecasting accuracy results provide suggestions for the stock market investors.

2 Literature review

Because of the high peakness and excessive kurtosis, financial time series data are generally not normally distributed. Several works have examined the influences of different choices of distribution forms. Generally, the generalized error distribution assumption is made for innovations since its capability of characterizing excessive kurtosis, and the Student-t distribution assumption is made for innovations to flexibly characterize the fat tails of financial time series. Zhu and Galbraith (2010) proposed the asymmetric Student-t (AST) distribution which allows separate parameters to control skewness and thickness of the tails. Through exploiting the asymmetrical and fat-tailed models, researchers have found evidence supporting the usefulness of such extensions in describing the complex pattern of financial data (Alberg et al. 2008).

There has been growing interests in modeling nonlinear models, and nonparametric models for volatilities prediction have gone through major developments (Kristjanpoller and Minutolo 2016). Researchers have made efforts to take advantage of the artificial intelligence techniques for applications in finance and economics (Cheng and Wei 2009; Ramyar and Kianfa 2017). Previous studies have found that the hybrid model integrating asymmetric GARCH models and the artificial neural network techniques display much more closeness to the actual volatility process (Kristjanpoller et al. 2014), which indicate that the GARCH-type ANN models provide better prediction performances than pure GARCH-type models. Especially when extending the asymmetric GARCH family models with artificial neural network approaches, it can effectively enhance the forecasting ability. However, when using extracted technique indicator as inputs to ANN with back-propagation neural network, Lahmiri (2017) found that it produced higher accurate estimation consequences than the hybrid EGARCH-ANN approaches. Currently, the ANN is one of the most widely used nonlinear forecasting models that deal with the complex correlation pattern among variables, especially applications in stock price predictions (Qiu et al. 2016; Zahedi and Rounaghi 2015; Göçken et al. 2016). Ince et al. (2017) investigated the predictive accuracy of exchange rate with ANN and the monetary model. Choudhary and Haider (2012) assessed the power of several ANN combination models for inflation forecasting and found that these may serve as credible tools. Dhamija and Bhalla (2010) employed data during financial meltdown period to demonstrate the usefulness of GARCH models and artificial neural network technique in determining the long-run nonlinearity of sample data. Monfared and Enke (2015) presented an adaptive neural network filter so as to predict the error in GARCH models and then applied it to predict the GARCH process.

However, the ANN technique has certain troublesome problems such as difficulties in determining the number of hidden layer nodes in neural network, the existence of excessive learning phenomenon and the local minimum problem in the training process. Fortunately, the support vector machine method is developed from statistical learning theory, which has obvious advantages in the prediction of small samples, nonlinear and high-dimensional problems. Researchers have extensively applied SVM in the fields of economics and finance, such as making economic forecasts and predicting bankruptcy (Zhao et al. 2017). With the further study of SVM methods, various improvement methods have been proposed and applied (Rojo-Álvarez et al. 2014). It is illustrated that the LS-SVM method has good robustness and is suitable for large-scale computing. Zhao et al. (2016) stated that it is more advantageous to use LS-SVM as the nonlinear prediction model, but it also needs modifications to improve functionalities. Ismail et al. (2011) proposed the integration of self-organizing maps and LS-SVM technique for time series forecasting. The PSO algorithm has been widely used in recent years (Abualigah and Khader 2017; Abualigah et al. 2018). In order to improve the classification accuracy, Liu and Zhou (2015) introduced PSO algorithm into the LS-SVM model for novel data classification approach. With reference to the above ideas, this paper presents an improved PSO algorithm to optimize the LS-SVM parameters so as to enhance the accuracy of predictions. Furthermore, it is noted that few studies have attempted to comprehensively compare parametric approaches and nonparametric approaches in stock volatility forecasting, our study will contribute to the selection of methods in financial volatility prediction techniques.

The motivation of this study is to enhance the returns volatility forecasting of previous methods and investigate whether the improvements can be achieved by the LS-SVM technique optimized by IPSO algorithm. In order to compare the different performance properties of LS-SVM-IPSO and hybrid AGARCH-nG-ANN as well as the AGARCH family models, three kinds of models are constructed to conduct the stock index volatility forecasting.

3 GARCH-type models with non-Gaussian distributions

In this paper, three volatility forecast techniques including the traditional GARCH-type processes, the hybrid artificial neural network using GARCH-nG prediction outcomes as inputs and the IPSO-modified LS-SVM method are compared to predict stock index historical volatility. In this section, we will focus on the extensions of GARCH-type family models. A critical feature of the substantial progress has been the incorporation of fatter tails and asymmetric effects into the variance process, which will better improve the fitting of the tail decay rates since declines in financial returns may involve more extreme returns movements.

4 Asymmetric GARCH models

For the sake of capturing the clustering property and heteroscedasticity effects, the time changing volatility process σt can be expressed in GARCH (m,n) models (Bollerslev 1986) as follows, which depends on p order past conditional variance as well as q order past squared innovations.

$$ \begin{aligned} & \sigma_{t}^{2} = \omega_{1} + \sum\nolimits_{i = 1}^{p} {\alpha_{1i} \varepsilon_{t - i}^{2} } + \sum\nolimits_{i = 1}^{q} {\beta_{1j} \sigma_{j - i}^{2} } \\ & \varepsilon_{t} = \sigma_{t} Z_{t} \\ \end{aligned} $$
(1)

where Zt is i.i.d. random variables with zero mean and unit variance, εt represents uncorrelated series with zero mean and variance σt, α1i> 0, and β1j> 0 as well as the following constraints are satisfied to ensure stability.

$$ \sum\nolimits_{i = 1}^{p} {\alpha_{1i} } + \sum\nolimits_{i = 1}^{q} {\beta_{1j} } < 1 $$
(2)

In order to better capture the significant asymmetry in asset returns caused by asymmetric effects of different shocks, the EGARCH and GJR-GARCH models are most frequently employed. The last two versions can account for the leverage effects of returns on conditional variance, which means that the large returns decline may cause larger return volatility than the increases with the same amplitude.

Since the EGARCH model (Nelson 1991) describes the variance as the asymmetric function of disturbance error εt, it can be used to account for the asymmetric effects in volatility, capturing the asymmetric behaviors of returns, which can be expressed in the following form:

$$ \log (\sigma_{t} ) = \omega_{2} + \sum\nolimits_{i = 1}^{p} {\alpha_{2i} } \left[ {\frac{{\left| {\varepsilon_{t - i} } \right|}}{{\sqrt {\sigma_{t - i} } }} - \sqrt {\frac{2}{\pi }} } \right] + \gamma_{2i} \frac{{\varepsilon_{t - 1} }}{{\sqrt {\sigma_{t - i} } }} + \sum\nolimits_{j = 1}^{q} {\beta_{2j} \log (\sigma_{t - j} )} $$
(3)

where ω2, α2i and β2j denote the model parameters with no restrictions, respectively.

Another model that can capture the asymmetric characteristics of returns is the GJR-GARCH model, which allows the conditional variance to react differently to the negative and positive innovations. It can be expressed as follows:

$$ \sigma_{t}^{2} = \omega_{3} + \sum\nolimits_{i = 1}^{p} {\beta_{3i} \sigma_{t - i}^{2} } + \sum\nolimits_{i = 1}^{q} {\alpha_{3i} \varepsilon_{t - i}^{2} } + \sum\nolimits_{i = 1}^{q} {\gamma_{3i} \varepsilon_{t - i}^{2} \varTheta } $$
(4)

where Θ = 1 if εt−i< 0, and Θ = 0 if εt−i≥ 0, α3i≥ 0, β1j≥ 0 and α3i+ γ3i≥ 0. Besides, the following constraints need to be satisfied.

$$ \sum\nolimits_{i = 1}^{p} {\alpha_{3i} } + \sum\nolimits_{i = 1}^{q} {\beta_{3j} } + \frac{1}{2}\sum\nolimits_{i = 1}^{q} {\gamma_{3j} } < 1 $$
(5)

5 Non-Gaussian innovation distributions

Considering that the financial returns series distribution typically displays leptokurtosis and heavy tailed characteristics in addition to excessive kurtosis, the non-Gaussian distributional assumptions are often imposed to the GARCH-type volatility models (Chuang et al. 2007), such as the Student-t distribution, the generalized error distribution (GED) and the generalized asymmetric Student-t distribution (Zhu and Galbraith 2011). The different hypotheses of the error distributions in GARCH-type models are to capture different statistical features. The asymmetric GARCH models allow various features of returns distributions by specifying the distributions of innovations.

For the stock returns series yt with mean μ and variance σ, the Student-t density distribution function has the following expression.

$$ f(y,\mu ,\sigma ,\nu ) = \frac{{\varGamma ({{\upsilon + 1} \mathord{\left/ {\vphantom {{\upsilon + 1} 2}} \right. \kern-0pt} 2})}}{{\sigma^{2} \sqrt {\pi (\upsilon - 2)} \varGamma ({{\upsilon + 1} \mathord{\left/ {\vphantom {{\upsilon + 1} 2}} \right. \kern-0pt} 2})}}\left( {1 + \frac{{(y - \mu )^{2} }}{{\sigma^{2} (\upsilon - 2)}}} \right)^{{\frac{ - (\upsilon + 1)}{2}}} $$
(6)

where Γ represents the gamma function, and the degree of freedom parameter that controls the thickness satisfies the condition of ν > 2.

For the generalized error distribution assumption of returns distribution, it has the following density function.

$$ f(y,\mu ,\sigma ,\theta ) = \frac{\theta }{2\sigma \varGamma (1/\theta )}\exp \left( { - \frac{{\left| {y_{t} - \mu } \right|}}{\sigma }^{\theta } } \right) $$
(7)

where θ represents the control parameter.

The generalized asymmetric Student-t distribution allows separate parameter to control the thickness and skewness of tails, which is important for quantities relying on tail features. The general form of the AST density in rescaled version can be expressed as follows:

$$ f_{\text{AST}} (y) = \left\{ {\begin{array}{*{20}l} {\frac{1}{\sigma }\left[ {1 + \frac{1}{{v_{1} }}\left( {\frac{y - \mu }{{2\alpha_{4} \sigma K(v_{1} )}}} \right)^{2} } \right]^{{ - (v_{1} + 1)/2}} ,} \hfill & {{\text{if}}\;y \le \mu } \hfill \\ {\frac{1}{\sigma }\left[ {1 + \frac{1}{{v_{2} }}\left( {\frac{y - \mu }{{2(1 - \alpha_{4} )\sigma K(v_{2} )}}} \right)^{2} } \right]^{{ - (v_{2} + 1)/2}} ,} \hfill & {{\text{if}}\;y > \mu } \hfill \\ \end{array} } \right. $$
(8)

where \( K(\upsilon ) = {{\varGamma ((v + 1)/2)} \mathord{\left/ {\vphantom {{\varGamma ((v + 1)/2)} {\left[ {\sqrt {\pi v} \varGamma (v/2)} \right]}}} \right. \kern-0pt} {\left[ {\sqrt {\pi v} \varGamma (v/2)} \right]}} \), Γ represents the gamma function and α∈(0,1), v1 > 0 and v2 > 0.

5.1 GARCH-nG parameter estimation method

For the selection of the optimal parameter lag for the GARCH-nG models, the penalized model order selection criteria containing the Bayesian information criterion (BIC) and the Akaike’s information criterion (AIC) are employed.

Traditionally, the maximum likelihood estimation method is employed for parameter estimations of GARCH-type variance equation. And the estimations are obtained through numerical maximization of each of the corresponding likelihood functions. Let the parameter vector of the likelihood function be Ψ, and then the general likelihood function form can be expressed as follows.

$$ l(\varPsi ;y) = \sum\nolimits_{t = 1}^{T} {\left\{ {\log \delta - \log \sigma_{t} + \log f_{Y} \left( {\tau + \delta \frac{{r_{t} - \mu }}{{\sigma_{t} }}} \right)} \right\}} $$
(9)

where τ denotes the mean value of the density function fY, and δ denotes the standard deviation of the density function fY.

6 Least square support vector machine with improved particle swarm optimization

There often exist nonlinear patterns in financial time series that cannot be characterized by parametric models. Hence, the approximation of the real relationships in financial data calls for more advanced methods. In this study, we mainly focus on artificial intelligence-based ANN methods and the machine learning-based LS-SVM method. Specifically, the ANN is hybridized with the asymmetric GARCH models with non-Gaussian innovations. And we propose the least square support vector machine technique with improved particle swarm optimization, in which the multi-region adaptive PSO algorithm is employed for parameter estimations.

6.1 Artificial neural network

The artificial neural network has unique advantages in dealing with stock price forecasting, which is a multi-factor, uncertain and nonlinear time series forecasting problem. Volatility changes display nonlinear features, and the transaction information contains a large number of decisions inherently in stock returns volatility changes. Through the study of historical data, neural network can search the rules and patterns of parameters autonomously from the complex financial time series data.

The ANN that depends on no hypotheses tries to mimic the structure of biological neural network, where a set of neurons are connected in layers. It has the advantages of adaption on the basis of the properties extracted from the research problem. The typical ANN that is organized hierarchically is generically composed of one input layer that has x variables, one hidden layer and the output layer that has forecasted variable y calculated as follows.

$$ y_{i} = f\left( {\sum\nolimits_{j = 1}^{p} {x_{j} w_{ij} + \eta } } \right) $$
(10)

where wij denotes the connection weights between neuron j and neuron i, η is the bias, and f denotes the activation function affecting the output amplitude. The input layer of ANN corresponds to the input variables, and the hidden layer is employed for capturing nonlinear relationships.

Typically, the sigmoid function is chosen as well as the Levenberg–Marquadt algorithm being employed for network training. And the connecting weights are adjusted according to the following rule.

$$ \Delta w_{i} = - (M_{i}^{t} M_{i} + cI)^{ - 1} M_{i} e_{i} $$
(11)

in which M is the Jacobian matrix, I represents the identity matrix, and c denotes the adaptive learning parameter.

The commonly used ANN techniques comprise the BP neural network (BP-NN), the wavelet neural network and the RBF neural network, among which the BP-NN technique is chosen in this work since it is the most extensively employed technique. Theoretically, the BP-NN is capable of approximating any complicated nonlinear functions, displaying strong generalization abilities and self-learning features. The back-propagation neural network takes inputs from the previous layer and then sends outputs to the next layer. The BP neural network technique is introduced into the stock volatility analysis and evaluation to understand the internal mechanism of the volatility dynamics so as to effectively predict the volatility changes.

6.2 Least square support vector machine

However, the ANN also has drawbacks of over-fitting problems. Additionally, the BP neural network has the disadvantages of easily falling into local solutions and slow convergence speed in handling with high-dimensional data (Dai et al. 2014). Whereas the support vector machine is a typical machine learning algorithm with complete mathematic theories and excellent learning abilities that have become a hot research topic in machine learning fields and have been successfully applied in many fields. The SVM takes the training error as the constraint condition of the optimization problem and regards the minimization of the confidence range as the optimization goal. Namely, it is a learning method based on the structural risk minimization criterion, whose promotion ability is obviously superior to the neural network. It is assumed that there are two types of sample points in the two-dimensional space, and the data samples are mapped into high-dimensional space through nonlinear functions. Subsequently, a pair of parallel hyper-planes are constructed to separate the training samples and to maximize the distance between each point in the sample set and the hyper-plane.

The SVM training solves a constrained quadratic programming problem, and the number of constraints is equal to the sample size. Therefore, when the sample size is fairly large, it may take long time for training. To improve the training efficiency of SVM, the least square support vector machine is applied based on SVM. Compared with the SVM, the LS-SVM replaces slack variables by square of training error and uses equality constraints instead of inequality constraints. The training process only requires solving a linear system of equations, which avoids the time-consuming QP problem. Furthermore, it displays simple calculation and fast convergence speed with high precision that has been widely applied in the fields of nonlinear process modeling. Since this paper focuses on nonlinear process modeling based on the LS-SVM regression, its basic principle is primarily described below.

Suppose that the training sample set is {(x1,y1), (x2,y2),…, (xt,yt)}, where xiRN stands for the input of the ith sample and yiR stands for the output of the ith sample, i = 1, 2, …, l. For the nonlinear system, the nonlinear function is assumed to be expressed in the following form.

$$ f(x) = w^{T} \cdot \varphi \left( x \right) + b $$
(12)

where φ(·) denotes the kernel space mapping function that maps the input data of the original space to the high-dimensional feature space. The slack variable of the SVM optimization problem is replaced by the square error term and the inequality constraint is varied to equality constraint. Then, the optimization problem of the LS-SVM regression is expressed as follows.

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{w,b,e} J(w,e) = \frac{1}{2}w^{T} w + \frac{C}{2}\sum\nolimits_{i = 1}^{l} {e_{i}^{2} } \\ & {\text{s}}.{\text{t}}.\quad y_{i} = w^{T} \varphi (x_{i} ) + b + e_{i} ,\;i = 1,2, \ldots ,l \\ \end{aligned} $$
(13)

where the weight vector wRN; the error variable eiR; b represents the deviation amount; C stands for the regularization parameter.

$$ L(w,b,e;\alpha ) = J(w,e) - \sum\nolimits_{i = 1}^{l} {\alpha_{i} (w^{T} \varphi (x_{i} ) + b + e_{i} - y_{i} )} $$
(14)

According to the above optimization function, the Lagrange function can be defined as follows.

where the Lagrange multiplier (i.e., support vector) αiR. According to the KKT conditions, the following are available.

$$ \left\{ {\begin{array}{*{20}l} {\frac{\partial L}{\partial w} = 0 \to w = \sum\nolimits_{i = 1}^{l} {\alpha_{i} \varphi (x_{i} ) = 0} } \hfill \\ {\frac{\partial L}{\partial b} = 0 \to \sum\nolimits_{i = 1}^{l} {\alpha_{i} = 0} } \hfill \\ {\frac{\partial L}{{\partial e_{i} }} = 0 \to \alpha_{i} = Ce_{i} } \hfill \\ {\frac{\partial L}{{\partial \alpha_{i} }} = 0 \to w^{T} \varphi (x_{i} ) + b + e_{i} - y_{i} = 0} \hfill \\ \end{array} } \right. $$
(15)

Eliminate variables w,ei, the matrix equation can be derived in the following.

$$ \left[ {\begin{array}{*{20}l} 0 \hfill & 1 \hfill & 1 \hfill & \cdots \hfill & 1 \hfill \\ 1 \hfill & {K(x_{1} ,x_{1} ) + \frac{1}{2c}} \hfill & {K(x_{1} ,x_{1} )} \hfill & \cdots \hfill & {K(x_{1} ,x_{n} )} \hfill \\ 1 \hfill & {K(x_{2} ,x_{1} )} \hfill & {K(x_{2} ,x_{2} ) + \frac{1}{2c}} \hfill & \cdots \hfill & {K(x_{2} ,x_{n} )} \hfill \\ \vdots \hfill & \vdots \hfill & {} \hfill & {} \hfill & {} \hfill \\ 1 \hfill & {K(x_{n} ,x_{1} )} \hfill & {K(x_{n} ,x_{2} )} \hfill & \cdots \hfill & {K(x_{n} ,x_{n} ) + \frac{1}{2c}} \hfill \\ \end{array} } \right] $$
(16)
$$ \left[ \begin{aligned} b \hfill \\ \alpha_{1} \hfill \\ \alpha_{2} \hfill \\ \vdots \hfill \\ \alpha_{n} \hfill \\ \end{aligned} \right] = \left[ \begin{aligned} 0 \hfill \\ y_{1} \hfill \\ y_{2} \hfill \\ \vdots \hfill \\ y_{n} \hfill \\ \end{aligned} \right] $$
(17)

where K(x,xi) = <φ(x),φ(xi) > in Eq. (16) stands for the kernel function that represents the inner product of the high-dimensional feature space. According to the functional theory, as long as satisfying the Mercer conditions, the function can be used as kernel functions, which commonly contains the linear functions, the perceptron functions and the radial basis functions. In view of better performances of radial basis functions, the radial basis function K(xi,xj) = exp(xixj2/(2σ2)) is chosen as the kernel function of LS-SVM. Finally, the decision function is determined as shown in Eq. (18):

$$ f(x) = \sum\nolimits_{i = 1}^{n} {\alpha_{i} } K(x,x_{i} ) + b $$
(18)

The selection of kernel function causes great influences on the generalization ability of the system. Hence, the selection of kernel function constitutes the core issue of the support vector machine theories. Unfortunately, there has been no effective ways to choose kernel functions so far. In practice, there are several commonly used kernel functions in the following forms:

  1. (1)

    Polynomial kernel function: K(xi·x) = [(xi·x) + 1]q, where q stands for the polynomial parameter;

  2. (2)

    Radial basis function: K(xi·x) = exp(− ‖xi − x2/σ2), where σ denotes the radial basis function parameter;

  3. (3)

    Sigmoid kernel function: K(xi·x) = tanh(v(xi·x) + c), where v > 0,c < 0.

What can be seen is that the number of intermediate nodes in LS-SVM is finally transformed into solving a convex optimization problem, which is theoretically optimal. As shown in the LS-SVM structure in Fig. 1, x = (x1, x2,…, xm) denotes the input vector, m represents the dimension of the input vector, n is the number of support vectors, and the number of nodes in the middle layer is n with n ≠ m.

Fig. 1
figure 1

The structure of LS-SVM

6.3 Improved particle swarm optimization algorithm

In general, the commonly used methods of parameter estimation are as follows: (1) the least square estimate of linear method. The advantage of this method is the simple calculation process, in which only large equations need to be solved. On the contrary, the disadvantage is that the regression equation has large residuals and relatively low prediction accuracy. (2) the Quasi-Newton method for solving unconstrained optimization problem. However, during the parameter optimization process, this method becomes easy to fall into local extremum, so that the obtained model may deviate from reality. At the same time, this method also requires partial derivative of the optimized function, which is difficult to be satisfied in practical problems. Concerning how to find a fast and effective algorithm to estimate the parameters is an important research issue. Fang et al (2010) pointed out that the determination of parameters belonged to the nonlinear combinatorial optimization problem, and most of the optimal solutions of the objective function to the parameters were pathological problems with no analytic solutions. Therefore, it requires the introduction of advanced and flexible optimization algorithm such as the intelligent optimization algorithm. However, the direct use of particle swarm optimization algorithm for parameter estimation has the disadvantage of large estimation errors. In this section, some improvements are proposed and the improved algorithm is employed to estimate the parameters.

The basic PSO algorithm searches for the optimal solution by cooperation and information sharing among individual groups in every iteration, estimating the objective function of the particle and determining the best position pbest of each particle at time t and the best position of the group gbest, and then iteratively update the particle velocity and position. As the particles in the PSO move toward the best position in their own history and gather around the best position in the cluster, the rapid convergence effects of the particle population are formed, which will also result in falling into the local optimal solution and the early convergence phenomenon. Moreover, the usage of real financial market data to estimate the nonlinear optimal model parameters can readily lead to the local optimal solution, resulting in inaccurate outcomes. Therefore, it is of significant importance to improve the computational efficiency of the PSO algorithm.

The basic particle swarm algorithm assumes that each particle flies in the n-dimensional space, Xi = (xi1, xi2,…, xin) represents the current position of particle i, Vi = (vi1, vi2,…, vin) represents the current flying speed of particle i, pi = (pbesti1, pbesti2,…, pbestin) denotes the optimal location for the individual particle i. The particles are dynamically adjusted according to individual and group flight experiences, with the velocity vij and the position xij upgrade equation satisfying Eq. (19)

$$ \begin{aligned} & v_{{ij}} \left( {t + 1} \right) = v_{{ij}} \left( t \right) + c_{1} r_{1} \left( {p{\text{best}}_{{ij}} \left( t \right) - x_{{ij}} \left( t \right)} \right) \\ & \quad \quad \quad \quad \quad + c_{2} r_{2} \left( {g{\text{best}}_{j} \left( t \right) - x_{{ij}} \left( t \right)} \right) \\ & x_{{ij}} \left( {t + 1} \right) = x_{{ij}} \left( t \right) + v_{{ij}} \left( {t + 1} \right) \\ \end{aligned} $$
(19)

where c1, c2 stand for the learning factors, c1 can adjust the flying distance of the particle to its best position, c2 can adjust the particle’s optimal flight step to the crowds, r1, r2 stand for the random numbers in [0,1], gbestj(t) denotes the jth-dimensional component of the optimal position of the particle swarm at the tth iteration. The first term in the first formula represents the inertia term, and the second term represents the cognition term, which means learning from its own history, also known as individual cognition. The third expression of collective information indicates that it is based on group information, with the adjustment reflecting coordination between particles. The so-called sociality is essentially the enhancement of the probability of the implementation of behaviors when an individual sees other individuals strengthening certain behaviors.

In order to improve the particle swarm initialization accuracy and to reduce the post-particle search time, Yang and Lee (2012) proposed a multi-basin PSO algorithm to keep the population of particles exploring in multiple regions so as to avoid premature, preventing falling into the local optimum. In order to increase the particle diversity, Pehlivanoglu (2013) proposed a multi-frequency vibration PSO algorithm with parameter variation. Based on these methods and taking account of the idea of mutation, a multi-region adaptive PSO algorithm with parameter and population adaptive mutation is proposed. The multi-region search of the initial particles is carried out to improve the initialization effects, the adjustment probability is adaptively determined according to the convergence of the algorithm, and the chaos traversal properties are utilized to avoid falling into the local optimum. The improved particle swarm optimization algorithm includes the following operations: the population initialization with multi-region local search; the adaptive mutation of particle swarm parameter; the adaptive mutation of population global region.

6.3.1 Population initialization with multi-region local search

The linear local search method is utilized to find the local minimum from the randomly generated initial parameter set Θi. The iterative formula of partial search is Θi+1 = Θi+ lidi, di= − M(i)−1fi), where li denotes the phase step, di denotes the gradient direction, M(i) represents the non-singular square matrix, which is an approximation of the Hessian matrix satisfying positive definiteness, and difi) = − ∇fi)TM(i)−1fi) < 0. The gradient direction di is updated at each iteration, and the phase step li satisfies the Wolfe condition, that is, fi+ lidi) ≤ fi) + z1lifi) di, ∇fi+ lidi)Tdi≥ z2fi) di, where z1, z2∈[0,1].

The individual region converged to the local extremum is defined as Wn): = {Θ∈Rn:limi→∞Θi = Θn}. Based on the local search method with global convergence, the whole search space Rn is divided into several regions Wn), in which each region contains the optimal value of convergence of the initial particle swarm. And the initial particle swarm in the same region converges to the same objective function value. Then, the parameter set \( S_{1} = (\varTheta_{1}^{1} ,\varTheta_{2}^{1} , \ldots ,\varTheta_{n}^{1} ) \) initialized after the local search is superior to the original parameter set \( S_{0} = (\varTheta_{1}^{0} ,\varTheta_{2}^{0} , \ldots ,\varTheta_{n}^{0} ) \).

6.3.2 Parameter adaptive mutation

The particles show strong convergence and decreasing rates in following the optimal particles. In order to increase the diversity of particles and to prevent it from falling into the local optimum, adaptive crossover and mutation operators are introduced according to the distance between each particle and the optimal particle \( L = \sqrt {\sum\nolimits_{i = 1}^{D} {(x_{1i} - x_{2i} )^{2} } } \), adaptively determining the algorithm adjustment probability under convergence. The particles in the population are taken out sequentially to determine whether the spatial distance between the extracted particle and the backup multi-region initialized optimal particle is less than the threshold ∆τ = (1 − t/T)p× (ub − lb). If it is less than the threshold, the crossover operation is performed to enhance the search of the middle region of the particle, where ub and lb are the upper and lower limits of the problem, p represents the adjustment parameter.

The initial particle size of the algorithm is diversified. At this time, the larger threshold value is suitable for population adjustment. As the iterations proceed, the population tends to aggregate and a small threshold needs to be set to allow more particles to cross and mutate. The cross-operation formula can be expressed as bx1 = x1e + x2(1 − e), bx2 = x1(1 − e) + x2e, where bx1, bx2 denote new particles generated by crossover, and e stands for the random number sequence in (0,1). After the crossover operation, the new particle’s fitness value is calculated. If the fitness value becomes better, the old particle is replaced by the new one. In contrast, if the fitness value becomes worse, then the mutation operation is introduced to enhance the fine search around the particle. The mutation operation can be expressed as cx1 = x + (1 − t/T)q(ub − x), cx2 = x − (1 − t/T)q(x − lb), where cx1, cx2 are new particles generated by mutation, q denotes the weight of variation. After the crossover and mutation operations, the global optimal particle is moved relatively a little distance in the upper and lower limits to get a new set of particles and then recalculates the fitness value of the particles, as well as selecting the particles with better fitness value instead of the optimal particles.

6.3.3 Population global adaptive variation

For the sake of avoiding the algorithm getting into the local optimum, chaotic variables are introduced. By utilizing the chaotic ergodic properties, it leads the particle swarm to the optimal solution. According to the particle convergence trend and the fitness value differences in the population, mutation probability is set for each particle. Define the degree of convergence \( O_{t} = 1/[1 + u\sqrt {(1/n)(\sum\nolimits_{i = 1}^{n} {(fit_{i}^{t} - fit_{mean}^{t} )}^{2} )} ] \), where u is the adjustment factor, fiti denotes the fitness value of the ith particle at the tth iteration, and fitmean represents the average fitness value of all the particles after the tth iteration. When the degree of population aggregation increases, the differences of fitness values became smaller and the convergence degree Qt becomes larger. Set the mutation probability \( P_{mu} = \sin ((\pi /2) \cdot (f_{gbest}^{t} /f_{i}^{t} ) \cdot O_{t} ) \). When rand<Pmu is met, the particles are mutated and the local extremum can jump out. Chaos mutation is applied to the particles that meet the mutation conditions, with the formula of chaos variation expressed as \( x_{ijm} = x_{ij\hbox{min} }^{t} + \rho_{ij}^{t + 1} (x_{ij\hbox{max} }^{t} - x_{ij\hbox{min} }^{t} ) \), \( x_{ij}^{new} = 0.5(x_{ij} - x_{ijm} ) \), where \( \rho_{ij}^{t + 1} \) is the chaotic variable traversed in (0,1), xijm denotes the value of the ith particle after the variation in the jth dimension, \( x_{ij\hbox{min} }^{t} \) and \( x_{ij\hbox{max} }^{t} \), respectively, represent the minimum and maximum values experienced in the tth iteration and \( x_{ij}^{new} \) is the new particle generated after the mutation.

6.3.4 Multi-region adaptive PSO algorithm flow design

  • Step 1 Initialize the population randomly, update the iterative formula Θi+1, and adjust the particle population size and velocity Vi on the multi-region parameter space set Θn to obtain the number of particles backed up by the multi-region local search.

  • Step 2 Update particle velocity vij and position xij according to formula (18) and record the global optimal particle gbest and the historical optimal particle pbest.

  • Step 3 Perform cross-mutation operation on the particles in the population according to the conditions.

  • Step 3.1 Determine whether the distance L between the particles taken out in sequence and the global optimum particle meets the threshold condition ∆τ. If not, move to the next particle, repeat step 3.1, otherwise go to the next step.

  • Step 3.2 Cross-operate on the particles satisfying the threshold condition ∆τ to generate the new particle bxi. If the fitness value gets better, replace the particle and move on to the next particle. Repeat from step 3.1, otherwise go to step 3.3.

  • Step 3.3 Perform mutation operation to generate new particles cxi, calculate the new particle fitness value and compare it with the original particle, then replace the particle with poor fitness values, and turn to the next particle, repeat from step 3.1, update the global optimal particle gbest and the particle historically optimum pbest.

  • Step 4 The mutation probability Pmu is calculated according to the convergence degree Qt to determine whether the particle satisfies the chaotic mutation condition. If the particle satisfies the condition, the chaotic mutation xijm is applied for adaptive mutation, with the particle fitness values calculated. Then, the global optimal particle gbest and the particle history optimum pbest are updated.

  • Step 5 Determine whether the algorithm meets the termination criteria. If satisfied, then terminate the algorithm, output gbest and pbest; if not satisfied, then t = t + 1, go to step 2.

7 Empirical results

7.1 Data description

In this section, the S&P 500 index (SP500) and the Chinese Shanghai and Shenzhen 300 composite index (HS300) are investigated. The study dataset includes the actual observation values from January 4, 2010 to December 29, 2017, among which the former 80% of the observation values are utilized for training and the last 20% observation values are utilized for forecast evaluations. The corresponding data are collected from the wind database. The basic statistical features of stock index returns are computed for both stock index series containing the volatility, skewness and kurtosis as well as the heteroscedasticity test and normality test. Figure 2 displays the SP500 and HS300 stock indices and the corresponding returns. Both the Ljung-Box Q2(10) statistics and the ARCH effects test for the squared returns indicate that both the returns indices exhibit strong heteroscedasticity effects. Additionally, the Jarque–Bera tests of both series reject the normal distribution, and the kurtosis values suggest that both the returns series have higher peakness compared to the Gaussian distribution.

Fig. 2
figure 2

The stock price index and returns of HS300 and SP500

There exist certain explanatory factors that influence the asset returns in financial markets significantly. Primarily, eight technical indicators are extracted from historical volatility series. The technical indicators analysis is based on statistical principles, employing a large number of historical data with addition, subtraction, multiplication and division of statistics and calculation methods to create a series of mathematical formula index system. The combination of historical data and technical indicators for stock forecasting will make the results more effective. And Table 1 provides the descriptions of the technical analysis indicators.

Table 1 Description of technical indicator

7.2 Performance indicator

In order to compare the prediction performances by different approaches, the following indicators containing the mean absolute error (MAE), the mean of squared error (MSE) and the root mean square error (RMSE) are chosen to evaluate the error accuracy.

$$ {\text{MAE}} = {1 \mathord{\left/ {\vphantom {1 n}} \right. \kern-0pt} n}\sum\nolimits_{i = 1}^{n} {\left| {y_{a} - y_{f} } \right|} $$
(20)
$$ {\text{MSE}} = {1 \mathord{\left/ {\vphantom {1 n}} \right. \kern-0pt} n}\sum\nolimits_{i = 1}^{n} {(y_{a} - y_{f} )^{2} } $$
(21)
$$ {\text{RMSE}} = \left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. \kern-0pt} n}\sum\nolimits_{i = 1}^{n} {(y_{a} - y_{f} )^{2} } } \right)^{1/2} $$
(22)

where ya denotes the actual values, yf denotes the model forecast values, and n represents the observation number.

7.3 Experimental results and discussion

In the GARCH-nG method, the fitness of GARCH-nG, EGARCH-nG and the GJR-nG models are initially evaluated. According to the AIC and BIC criteria, the best parameter lag for GARCH-nG combined with various (p,q) orders ranging from (1,1) to (10,10) is calculated. And the best model specification to fit the distributional characteristics of the data is found based on a set of criteria. We report the corresponding estimation parameters of asymmetric GARCH-non-Gaussian models of SP 500 index and HS300 index in Tables 2 and 3. Furthermore, the volatility forecasting consequences using EGARCH parametric method with Student-t distribution, GED distribution and AST distribution are presented in Fig. 3, where the SP500 index volatility and HS300 index volatility are taken as examples.

Table 2 Parameter estimation of EGARCH-non-Gaussian models of S&P500
Table 3 Parameter estimation of GJR-GARCH-non-Gaussian models of HS300
Fig. 3
figure 3

The volatility forecasting using EGARCH with different innovation distributions

In the second hybrid method, the obtained volatility estimations from GARCH-nG models complemented with technical indicators are fed to the neural network. The GARCH-nG models are hybridized with the ANN method and the realized volatility is regarded as the output target of the network. Namely, the hybrid method takes as inputs the simulated volatility as well as the specified explanatory variables for training the network. This hybrid method is capable of keeping the fine properties of GARCH-nG models while enhancing with the ANN technique. During the ANN running stage, the sample dataset is divided into the training period and the prediction period, in which the training data are utilized to determine model specifications, and the forecast data are reserved for evaluation. And repeated training is carried out in the hidden layer to determine appropriate numbers of neurons. When the error between the desired output and the actual output is less than the specified value or if the termination criterion is satisfied, the network training is completed with the weights and bias being saved. The network with back-propagation is trained employing the Levenberg–Marquardt method, and the input variables are normalized between values of − 1 and 1.

In the third method, the proposed IPSO algorithm is applied to the LS-SVM for optimizing parameters. And the evolutions of fitness function with respect to generations of IPSO algorithm and PSO algorithm are compared in Fig. 4. What can obviously see from the plot is that the IPSO algorithm converges faster along with lower values than that of PSO algorithm. Since it uses the technical indicators affecting the stock index returns volatility, it is expected that the model can better capture the impacts of market variables.

Fig. 4
figure 4

The fitness functions of IPSO algorithm (left panel) and PSO algorithm (right panel)

Tables 4 and 5 report the evaluation accuracy results according to the measurement metrics. The lower the measurement indicators are, the higher prediction accuracy of the model is. What can be seen from the results is that the AGARCH process models with asymmetric Student-t innovation distribution display slightly lower prediction errors than that with normal distribution for both the two stock indices. This finding suggests that the S&P 500 stock index and HS300 index exhibit fat-tailed properties, which require taking account of the non-Gaussian distributional assumptions to model the volatility dynamic process. Similarly, the asymmetric GARCH-type models provide lower measurement errors in case of modeling the stock index volatility, which suggests that it is appropriate to consider the asymmetric effects of volatility modeling. The consequences provide evidence for the usefulness of non-Gaussian distributions in fitting financial assets.

Table 4 The stock index volatility forecasting results of S&P 500
Table 5 The stock index volatility forecasting results of HS300

In addition, it can be concluded that the hybrid AGARCH-ANN models that utilize the outcomes of AGARCH family models as inputs perform better than the AGARCH models on the whole. And the hybrid AGARCH-ANN model has been proven robust for different specifications and performs better than the GARCH-nG models in predicting the stock index volatility. The obtained consequences support the findings of previous works that the combination of asymmetric GARCH models with neural network can improve the prediction ability of the individual GARCH process models in forecasting the stock index volatility.

However, it is not always an efficient approach when compared to the modified machine learning methods. As indicated by the performance measurements, the LS-SVM-IPSO technique outperforms other methods by providing the lowest forecasting errors in forecasting the volatility of S&P 500 index and HS 300 index. Moreover, in contrast to the parametric method, the least square methods do not need to regard the type of the innovation distributions, showing flexibility implementation capability.

The relative errors of the predictions of the training set and the test set in the LS-SVM-IPSO model are less than those of the hybrid ANN models. The above comparison results show that the LS-SVM-IPSO model has more advantages than the hybrid ANN models. Furthermore, Figs. 5, 6 plot the predicted values employing the proposed method with the actual observation values. It is observed that the proposed LS-SVM-IPSO approach presents a powerful and efficient tool with very small deviations in both training samples and testing samples, meaning that the proposed model equips with fine approximation capability and generalization ability.

Fig. 5
figure 5

The volatility forecasting and estimation errors of S&P500 index

Fig. 6
figure 6

The volatility forecasting and estimation errors of HS300 index

The similar conclusion can also be inferred from the scatter graphs of goodness of fitting for volatility forecasting of SP500 and HS300 stock indices in Figs. 7, 8. Obviously seen from the scatter plot, it indicates that the proposed LS-SVM-IPSO approach can effectively forecast the volatility tendency of stock indexes.

Fig. 7
figure 7

The scatter plot of goodness of fitting for SP500

Fig. 8
figure 8

The scatter plot of goodness of fitting for HS300

8 Conclusion

Accurate forecast of stock returns volatility is challenging since the highly complex nonlinear nature of returns. In this study, we apply the LS-SVM technique optimized by the IPSO algorithm that can map nonlinear functions without prior hypothesis to forecast stock returns volatility. Furthermore, the individual GARCH family models with non-Gaussian distributions and those hybridized with the ANN method are also compared in the empirical experiments.

In terms of the loss functions, the resulting forecast performances obtained by the three different methods have been compared. It is observed that the asymmetric GARCH volatility model with generalized asymmetric Student-t distribution combing with ANN technique exhibits higher accuracy than individual parametric methods. The empirical studies find that the SVM technique without PSO optimization has similar modeling accuracy and generalization ability with hybrid models based on ANN. However, the LS-SVM-IPSO technique exhibits the most promising prediction performances, showing the lowest forecasting errors. Comparing with the neural network methods, the optimized least square support vector machine displays obvious higher modeling accuracy, closer approximation degree and better generalization abilities, which demonstrates to be an efficient and superior forecasting approach on the whole.