1 Introduction

Several techniques have been developed to forecast time series using stock value data. There also exist numerous forecasting applications, as analyzed in [1]: signal statistics pre-processing and communications, industrial control processing, econometrics, meteorology, physics, biology, medicine, oceanography, seismology, astronomy, and psychology.

A possible solution to this problem was described by Box et al. [2], who developed a time-series forecasting analysis technique based on linear systems. Basically, the procedure consisted of suppressing the non-seasonality of the series by parameter analysis, which measures time-series data correlation, and models the selection which best fits the data obtained (a specific-order ARIMA model). But, in real systems, non-linear and stochastic phenomena occur, and so, the series’ dynamics cannot be described exactly by classical models. Artificial neural networks (ANNs) have improved forecasting results, detecting the non-linear nature of the data. ANNs based on radial basis functions (RBFs) allow a better forecasting adjustment; they implement local approximations to non-linear functions, minimizing the mean square error to achieve the adjustment of neural parameters. Platt’s algorithm [3], resource allocating network (RAN), consisted of controlling the size of the neural network, thus, reducing the computational cost associated with the calculus of the optimum weights in perceptron networks.

Matrix decomposition techniques have been used as an improvement of Platt’s model [4]; these take the most relevant data in the input space, to avoid processing non-relevant information. The model incorporating these techniques, “neural model with automatic parameter adjustment for prediction” (NAPA-PRED), also includes neural pruning [5]. An improved version, INAPA-PRED, based on support vector machine (SVM) philosophy, is presented in Sect 6.2 in the discussion of endogenous time-series forecasting.

The next step was to include exogenous information in these models. Principal component analysis (PCA) is a well-established tool in finance. It has been shown that prediction results can be improved by using this technique [4]. Although both methods linearly transform the observed signal into components, the difference is that, in PCA, the goal is to obtain principal components which are uncorrelated (features), giving projections of the data in the direction of the maximum variance [6]. PCA algorithms use only second-order statistical information. On the other hand, in [7], we can discover interesting structures in finance using the new signal-processing tool, independent component analysis (ICA), which finds statistically independent components by using higher-order statistical information to separate the signals [8, 9]. This new technique may use entropy [10], contrast functions based on information theory [11], mutual information [12], or geometric considerations in data distribution spaces [13-17], etc. Forecasting and analyzing financial time series using ICA can contribute to a better understanding and more accurate prediction of financial markets [18, 19]. Nevertheless, in this paper, we seek to exclude pre-processing techniques which may contaminate raw data.

2 Forecasting model (cross prediction model)

The cross prediction model (CPM) is shown in Eq. 1. We consider a data set consisting of correlated signals from a stock exchange and seek to build a forecasting function, P, for one of the set of signals, {series1,...,series S }, which allows us to include exogenous data from the other series. If we consider just one series [4], the individual forecasting function can be expressed in terms of RBFs as [20]:

$$ {\mathbf{F}}{\left( {\mathbf{x}} \right)} = {\sum\limits_{i = 1}^N {f_{i} {\left( {\mathbf{x}} \right)}} } = {\sum\limits_{i = 1}^N {h_{i} \exp {\left\{ { - \frac{{{\left\| {{\mathbf{x}} - c_{i} } \right\|}}} {{r^{2}_{i} }}} \right\}}} } $$
(1)

where x is a p-dimensional input vector at time t, N is the number of neurons (RBFs), f i is the output for each i-th neuron, c i is the center of the i-th neuron which controls the local space situation of this cell, and r i is the radius of the i-th neuron. The global output is a linear combination of the individual outputs for each neuron with weight h i . Thus, we are using a method for moving beyond linearity, where the core idea is to augment/replace the input vector, x, with additional variables, which are transformations of x, and then use linear models in this new space of derived input features. RBFs are one of the most popular kernel methods for regression over the domain \(\mathbb{R}^{n} \) and consists of fitting a different, but simple model, at each query point, c i , using those observations close to this target point in order to obtain a smoothed function. This localization is achieved via a weighting function or kernel f i .

We apply/extend this regularization concept to extra series, including a row of neurons (Eq. 1), for each series and weighting these values with a factor, b ij (the neural resources of ANNs are updated using an on-line SVM algorithm, INAPA-PRED, see Sect. 6.2). Finally, the global smoothed function for the stock j is defined as:

$$ {\mathbf{P}}_{j} {\left( {\mathbf{x}} \right)} = {\sum\limits_{i = 1}^S {b_{{ij}} F_{i} {\left( {{\mathbf{x}},j} \right)}} } $$
(2)

where F i is the smoothed function of each series, S is the number of input series, and b ij are the weights for j-stock forecasting. Obviously, one of these weight factors must be relevant in this linear fit (b jj ~1, or auto-weight factor).

We can use matrix notation to include the set of forecasts in an S-dimensional vector, P (B in Fig. 1):

$${\mathbf{P}}{\left( {\mathbf{x}} \right)} = {\text{diag}}{\left( {{\mathbf{B}} \cdot {\mathbf{F}}{\left( {\mathbf{x}} \right)}} \right)}$$
(3)

where F=(F 1,...,F S ) is an S×S matrix with \(F_{I} \in \mathbb{R}^{S} \) and B is an S×S weight matrix. The operator diag extracts the main diagonal.

Fig. 1
figure 1

Schematic representation of CPM with adaptive radii, centers, and input space ANNs (RAN + NAPAPRED + Reg [18]). This improvement consists of neural parameter adaptation when input space increases, i.e., RBF centers and radii are statistically updated when dynamic series change takes place

To test this model, we can choose a set of values for the weight factors as functions of the correlation factors between the series, and so, Eq. 2 can be expressed as:

$$ {\mathbf{P}}{\left( {\mathbf{x}} \right)} = {\left( {1 - {\sum\limits_{i \ne j}^S {\rho _{i} } }} \right)}F_{j} + {\sum\limits_{i \ne j}^S {\rho _{i} F_{i} } } $$
(4)

where P is the forecasting function for the desired stock, j, and ρ I is the correlation factor with the exogenous series, i.

We can include Eq. 4 in the generalized additive models for regression proposed in supervised learning [21]:

$${\mathbf{E}}{\left\{ {Y|X_{1} , \ldots ,X_{n} } \right\}} = \alpha + f_{1} {\left( {X_{1} } \right)} + \ldots + f_{n} {\left( {X_{n} } \right)} $$
(5)

where the X i s usually represent predictors and Y represents the system output; f j s are non-specific smooth (“non-parametric”) functions. Thus, we can fit this model by minimizing the mean square error function, or by the other methods presented in [21].

3 Forecasting model and genetic algorithms

CPM uses a genetic algorithm for b I parameter fitting (see Sect. 6.1 for further discussion). A canonical GA is constructed by operations of parameter encoding, population initialization, crossover, mutation, mate selection, population replacement, etc. Our encoding parametric system consists of the codification into genes and chromosomes or individuals as a string of binary digits using the complement representation, although other encoding methods are also possible, i.e., [22-25], where the value of each parameter is a gene and an individual is encoded by a string of real numbers instead of binary ones. In the initial population generation step, we assume that the parameters lie in a bounded region [0,1] (at the edge of this region, we can reconstruct the model without exogenous data) and N individuals are generated randomly. After the initial population, N, is generated, the fitness of each chromosome I i is determined by the function:

$$ {\aleph }{\left( {I_{i} } \right)} = \frac{1} {{{\text{e}}{\left( {I_{i} } \right)}}} $$
(6)

To overcome the problem of convergence in the optimal solution, we add a positive constant to the denominator. Another important question in the canonical GA is the definition of the selection operator. New generations for mating are chosen with their fitness function values determined by roulette wheel selection. Once the new individuals have been selected, we apply crossover (P c ) to generate two offspring. In the next step, the mutation operator (P m ) is applied to these offspring to prevent premature convergence. In order to improve the convergence speed of the algorithm, we included mechanisms such as elitist strategy, in which the best individual in the current generation always survives into the next.

The GA used in the forecasting function (Eq. 2) has an absolute error value start criterion. Once it has started, it uses the values (or individual) found to be optimal (elite) the last time and applies a local search around this elite individual. Thus, an efficient search is performed around an individual (set of b i s) in which one parameter is more relevant than the others.

The computational time depends on the encoding length and the number of individuals and genes. Because of the probabilistic nature of the GA-based method, the proposed method almost converges to a global optimal solution on average. Our simulation did not produce any non-convergent cases. Figure 2 shows the iterative procedure implemented for the global prediction system including GA.

Fig. 2
figure 2

Pseudo-code of CPM+GA

4 Simulations

With the aim of assessing the performance of the CP model, we used indexes of Spanish banks and companies for a given period, with particular attention to the IBEX35 index, which we consider is the most representative sample of Spanish stock movements.

We considered the simplest case, which consists of two time series corresponding to the companies’ ACS (series1) and BBVA (series2). The former is the target of the forecasting process and the latter is introduced as external information. The period studied was July-October 2000. Each time series includes 200 points corresponding to selling days (quoting days).

The horizon of the forecasting process (hor) was set at 8; the weight function of the forecasting function was a correlation function between the two time series for series2 (in particular, we chose its square) and the difference to one for series1. We established a 10-day forecasting window (W), and the maximum lag number was set at 2W, in order to achieve a 10×20 Toeplitz matrix. The first time point of the algorithm was set at 50. Figure 3 shows the forecasting results from lag 50 to 200, corresponding to series1.

Fig. 3a-h
figure 3

Simulation results. a top: real series ACS; bottom: real series BBVA. b Real series and predicted ACS series with CPM. c Absolute value of the error with CPM. d Real series and predicted ACS series with CPM+GA. e Real series and predicted ACS series without exogenous data. f Absolute value of the error with CPM+GA. g NRMSE evolution for CPM (dots) and CPM+GA (line). h NRMSE evolution for selected stock indexes

Note the instability of the system in the very first iterations, until it reaches an acceptable convergence. The most interesting feature of the result is shown in Table 1. From this table, it is easy to deduce that, if we move one of the two series horizontally, the correlation decreases dramatically. This is how we avoid the delay problem encountered with certain networks, where the information introduced into the system is non-relevant. This problem is due to the increased level of information, together with the fact that we have enclosed only one additional time series (series2), despite the increased neuron resources. At the end of the process, we used 20 neurons for net 1 and 21 for net 2. Although the forecasting is acceptable, we expect a better performance with more data point series.

Table 1 Correlation coefficients between real and predicted signals with different lags

The next step consisted of using the general algorithm including the GA. A four-individual (N ind) population of dimension 2×1 was used; this is sufficient because the searching space was bounded. The GA was run four times before convergence was reached. The individuals were encoded with 34 bits (17 bits for each parameter). In this case, convergence is defined in terms of the adjustment function; other authors use other GA parameters, such as the absence of change in the individuals after a certain number of generations. We observed a considerable improvement in the forecasting results and evidence of the disappearance of the delay problem, as shown in Fig. 3.

The number of neurons at the end of the process is the same as in the former case, since we have only modified the weight of each series during the forecasting process. The dynamics and values for the weights are shown in Table 2.

Table 2 Dynamics and values of the weights for the GA

Error behavior is shown in Table 2. Note that:

  • We can bound the error by appropriate selection of the b 1 parameters, when the dynamics of the series is coherent (avoiding large fluctuations in stock values)

  • The algorithm converges faster, as shown at the very beginning of the graph

  • The forecasting results are better using GA, as shown in Fig. 3, which describes the evolution of the normalized mean square error

Finally, we carried out a simulation with nine indexes and computed the prediction function for five series, obtaining similar results, as presented in Table 2. In the complete model, we limited the input space dimension to three for the extra series and to five for the target series. The NRMSE depends on each series (data set) and target series (evolution). In this table, we also compare the ICA method versus CPM for a limited data set (70 iterations). The NRMSE of the indexes increases at the beginning of the process using the ICA method and converges to CPM NRMSE values when the data set is enlarged (estimators approach higher-order statistics). This effect is also observed when the dynamics of the series change suddenly, due to a new, independent event.

Due to the symmetric character of our forecasting model, it is possible to implement parallel programming languages (like PVM) to build a more general forecasting model for a set of series. We would launch the same number for “child” processes and banks; these would run forecasting vectors, which would be weighted by a square matrix with dimension equal to the number of series, B. The “parent” process would have the results of the forecasting process for the calculation of the error vector, in order to update the neuron resources. Therefore, we would utilize the computational cost of a forecasting function to calculate the rest of the series.

5 Conclusions

In addition to the above ideas, we conclude that our new forecasting model for time series is characterized by:

  • The enclosing of external information. We avoid pre-processing and data contamination by applying ICA and PCA. Series are enclosed in the net directly.

  • Forecasting results are improved by means of hybrid techniques using well known techniques like GA.

  • The possibility of implementation in parallel programming languages (e.g., “parallel virtual machine” (PVM)), and the better performance and lower computational time achieved by using a neuronal matrix architecture.

6 Appendix: theoretical background

The purpose of this section is twofold: on the one hand, we discuss in detail the GA implemented for parameter fitting. We model the GA as an inhomogeneous Markov chain [26] exhibiting the features and properties of our genetic operators. On the other hand, we discuss further the endogenous model used as a “kernel” for time-series forecasting.

6.1 Description of GA

As noted in Sect. 3, CPM uses a GA for b I parameter fitting. A GA can be modeled by means of a time inhomogeneous Markov chain [26], obtaining interesting properties related with weak and strong ergodicity, convergence, and the distribution probability of the process [27]. In the latter reference, a canonical GA is constituted operations of parameter encoding, population initialization, crossover, mutation, mate selection, population replacement, fitness scaling, etc., proving that, with these simple operators, a GA does not converge to a population containing only optimal members. However, some GAs converge to the optimum, e.g., the elitist GA [28] and those which introduce reduction operators [29].

We have borrowed the notation mainly from [30], where the model for GAs is an inhomogeneous Markov chain model of probability distributions (S) over the set of all possible populations of a fixed finite size. Let C be the set of all possible creatures in a given world (vectors of dimension equal to the number of extra series) and a function f:CR +. The task of the GAs is to find an element, cC, for which f(c) is maximal. We encode creatures into genes and chromosomes or individuals as strings of length ℓ of binary digits (size of alphabet A is a=2) using one-complement representation; other encoding methods are also possible [22-24] or [25], where the value of each parameter is a gene and an individual is encoded by a string of real numbers instead of binary ones.

In the initial population generation step (choosing randomly p ∈℘ N , where ℘ N is the set of populations, i.e., the set of N-tuples of creatures containing a LN·ℓ elements), we assume that creatures lie in a bounded region [0,1] (at the edge of this region, we can reconstruct the model without exogenous data). After the initial population, p, has been generated, the fitness of each chromosome, c i , is determined via the function:

$$ f{\left( {c_{i} } \right)} = \frac{1} {{e{\left( {c_{i} } \right)}}} $$
(7)

where e is an error function (i.e., the square error sum in a set of neural outputs, solving the convergence problem in the optimal solution by adding a positive constant to the denominator).

The next step in the canonical GA is to define the selection operator. New generations for mating are selected, depending on their fitness function values, using roulette wheel selection. Let p=(c 1,...,c N ) ∈℘ N , n\(\mathcal{N} \), and f be the fitness function acting in each component of p. Scaled fitness selection of p is a lottery for every position 1≤IN in population p, such that creature c j is selected with probability:

$$\frac{{f_{n} {\left( {p,j} \right)}}} {{{\sum\nolimits_{i = 1}^N {f_{n} {\left( {p,i} \right)}} }}} $$
(8)

thus, proportional fitness selection can be described by column stochastic matrices F n , n\(\mathcal{N} \), with components:

$$ \langle q,{\mathbf{F}}_{n} p\rangle = {\prod\limits_{i = 1}^N {\frac{{n{\left( {q_{i} } \right)}f_{n} {\left( {p,q_{i} } \right)}}} {{{\sum\nolimits_{j = 1}^N {f_{n} {\left( {p,j} \right)}} }}}} } $$
(9)

where p,q ∈℘ N so p i ,q i C, 〈⋯〉 denotes the standard inner product, and n(d i ) is the number of occurrences of q i in p.

Once the two individuals have been selected, an elementary crossover operator C(K,P c ) is applied (setting the crossover rate at a value, i.e., P c → 0, which implies children similar to parent individuals), which is given (assuming even N) by:

$${\mathbf{C}}{\left( {K,P_{c} } \right)} = {\prod\limits_{i = 1}^{N \mathord{\left/ {\vphantom {N 2}} \right. \kern-\nulldelimiterspace} 2} {{\left( {{\left( {1 - P_{c} } \right)}\mathcal{I} + P_{c} {\mathbf{C}}{\left( {2i - 1,2i,k_{I} } \right)}} \right)}} } $$
(10)

where C(2i − 1,2i,k I ) denotes the elementary crossover operation of c I , c j creatures at position 1≤k≤ℓ and \(\mathcal{I} \)is the identity matrix, to generate two offspring (see [27] for further properties of the crossover operator).

The mutation operator, M P m , is applied (with probability P m ) independently at each bit in a population p ∈℘ N , to avoid premature convergence (see [22] for further discussion). The multi-bit mutation operator with change probability following a simulated annealing law with respect to the position 1≤iL in p ∈℘ N :

$$ {\mathbf{P}}_{{m{\left( i \right)}}} = \mu \cdot \exp {\left( {\frac{{ - \bmod {\left\{ {\frac{{i - 1}} {N}} \right\}}}} {\emptyset }} \right)} $$
(11)

where ∅ is a normalization constant and μ, the change probability at the beginning of each creature p i in population p, can be described as a positive stochastic matrix in the form:

$$ \langle q,{\mathbf{M}}_{{{\mathbf{P}}_{m} }} p\rangle = \mu ^{{\Delta {\left( {p,q} \right)}}} \exp {\left( { - {\sum\limits_{{\text{dif}}{\left( i \right)}}^{\Delta {\left( {p,q} \right)}} {\frac{{\bmod {\left\{ {\frac{{i - 1}} {N}} \right\}}}} {\emptyset }} }} \right)}.{\prod\limits_{{\text{equ}}{\left( i \right)}}^{L - \Delta {\left( {p,q} \right)}} {{\left[ {1 - \mu \cdot \exp {\left( {\frac{{\bmod {\left\{ {\frac{{i - 1}} {N}} \right\}}}} {\emptyset }} \right)}} \right]}} } $$
(12)

where \(\Delta {\left( {p,q} \right)} \) is the Hamming distance between p and q∈℘ N , dif(i), resp. equ(i) is the set of indexes where p and q are different resp. equal. From Eq. 12 and observing how the matrices act on populations, we can write:

$$ {\mathbf{M}}_{{{\mathbf{P}}_{m} }} = {\prod\limits_{\lambda = 1}^N {{\left( {{\left[ {1 - \mu \cdot \exp {\left( {\frac{{\bmod {\left\{ {\frac{{\lambda - 1}} {N}} \right\}}}} {\emptyset }} \right)}} \right]}1 + \mu \cdot \exp {\left( {\frac{{\bmod {\left\{ {\frac{{\lambda - 1}} {N}} \right\}}}} {\emptyset }} \right)}\ifmmode\expandafter\hat\else\expandafter\^\fi{m}^{1} {\left( \lambda \right)}} \right)}} } $$
(13)

where \(\ifmmode\expandafter\hat\else\expandafter\^\fi{m}^{1} {\left( \lambda \right)} = {\mathbf{1}} \otimes {\mathbf{1}} \ldots \otimes {\overbrace {\ifmmode\expandafter\hat\else\expandafter\^\fi{m}^{1} }^\lambda } \otimes \ldots {\mathbf{1}} \) is a linear operator on V , the free vector space over A L and \(\ifmmode\expandafter\hat\else\expandafter\^\fi{m}^{1} \) is the linear 1-bit mutation operator on V 1, the free vector space over A. The latter operator is defined acting on the alphabet as:

$$\langle \ifmmode\expandafter\hat\else\expandafter\^\fi{a}{\left( {{\tau }\ifmmode{'}\else$'$\fi} \right)},\ifmmode\expandafter\hat\else\expandafter\^\fi{m}^{1} \ifmmode\expandafter\hat\else\expandafter\^\fi{a}{\left( \tau \right)}\rangle = {\left( {a - 1} \right)}^{{ - 1}} ,\;\;\;\;0 \leqslant {\tau }\ifmmode{'}\else$'$\fi \ne \tau \leqslant a - 1 $$
(14)

i.e., the probability of changing a letter in the alphabet once mutation occurs, with probability equal to .

The spectrum of M P m can be evaluated according to the following expression:

$${\text{sp}}{\left( {{\mathbf{M}}_{{{\mathbf{P}}_{m} }} } \right)} = {\left\{ {{\left( {1 - \frac{{\mu {\left( \lambda \right)}}} {{a - 1}}} \right)}^{\lambda } ;\;\;\;\;\lambda \in {\left[ {0,L} \right]}} \right\}} $$
(15)

where \(\mu {\left( \lambda \right)} = \exp {\left( {\frac{{ - \bmod \frac{{\lambda - 1}} {N}}} {\emptyset }} \right)} \).

The operator presented in Eq. 13 has similar properties to the constant multi-bit mutation operator, M μ , presented in [27]. M μ is a contracting map in the sense presented in [30]. It is easy to prove that M P m is a contracting map too, by using the Corollary B.2 in [27] and the eigenvalues of this operator (Eq. 15).

We can also compare the coefficients of ergodicity:

$$\tau _{r} {\left( {{\mathbf{M}}_{{{\mathbf{P}}_{m} }} } \right)} < \tau _{r} {\left( {{\mathbf{M}}_{\mu } } \right)} $$
(16)

where \(\tau _{r} {\left( {\mathbf{X}} \right)} = \max {\left\{ {{\left\| {{\mathbf{X}}_{\nu } } \right\|}_{r} :\nu \in \mathcal{R}^{n} ,\;\nu \bot e\;{\text{and}}\;{\left\| \nu \right\|}_{r} = 1} \right\}} \) .

Mutation is more likely at the beginning of the string of binary digits (“small neighborhood philosophy”). In order to improve the speed of convergence of the algorithm, we have included mechanisms such as elitist strategy (reduction operator [31]) in which the best individual in the current generation always survives into the next (a further discussion of the reduction operator, P R , can be found in [32]).

Finally, the GA is modeled, at each step, as the stochastic matrix product acting on probability distributions over the populations:

$${\mathbf{S}}_{{{\mathbf{P}}^{n}_{m} ,{\mathbf{P}}^{n}_{c} }} = {\mathbf{P}}^{n}_{R} \cdot {\mathbf{F}}_{n} \cdot {\mathbf{C}}^{k}_{{{\mathbf{P}}^{n}_{c} }} \cdot {\mathbf{M}}_{{{\mathbf{P}}^{n}_{m} }} $$
(17)

As shown above, the GA used in the forecasting function (Eq. 2) has an absolute error value start criterion (i.e., error>uga=1.5). Once it starts, it uses the values (or individual) found to be optimal (elite) the last time, and applies a local search (using the selected mutation and crossover operators) around this elite individual. Thus, we perform an efficient search around an individual (set of b I s) in which one parameter is more relevant than the others.

The computational time depends on the encoding length and the number of individuals and genes. Because of the probabilistic nature of the GA-based method, the proposed method almost converges to a global optimal solution on average. In our simulation, no non-convergent case was found. Figure 4 shows the GA-pseudocode and Fig. 1 completes the iterative procedure implemented for the overall prediction system including GA.

Fig. 4
figure 4

Pseudo-code of GA

6.2 Endogenous learning machine

The purpose of this section is to introduce the foundations of SVMs [33] and their connection with RT [34] in order to show the new on-line algorithm for time-series forecasting.

SVMs are learning algorithms based on the structural risk minimization principle [35] (SRM), characterized by the use of the expansion of SV “admissible” kernels and the sparsity of the solution. They have been proposed as a technique in time-series forecasting [36, 37] and have addressed the overfitting problem present in classical neural networks, thanks to their high capacity for generalization. The solution for SVM prediction is achieved by solving the constrained quadratic programming problem. SV machines, thus, are non-parametric techniques, i.e., the number of base functions is unknown beforehand. The solution of this complex problem in real-time applications can be extremely complicated because of high computational time demands.

SVMs are essentially regularization networks (RN) with the kernels being Green’s function of the corresponding regularization operators [38]. Using this connection, with a suitable choice of regularization operator (based on SVM philosophy), we should obtain a parametric model that is very resistant to the overfitting problem. Our parametric model is a RAN [3] characterized by the control of neural resources and by the use of matrix decompositions, i.e., singular value decomposition (SVD) and QR decomposition to input selection and neural pruning [4].

The following text is organized, thus: in the next subsection, we give a brief overview of the basic VC theory. The SV algorithm and its connection to RT theory is then presented and, finally, the new on-line algorithm is described.

6.2.1 Foundations of VC theory

A general notion of functional approximation problems Footnote 1 can be described as follows:

Let \(\Delta \equiv {\left\{ {{\left( {x_{1} ,y_{1} } \right)}, \ldots ,{\left( {x_{{\ell }} ,y_{{\ell }} } \right)}} \right\}} \) be a set of independent and identically distributed training samples with unknown probability distribution function, F(x,y). The learning problem is that of choosing, from the given set of functions, f(x,α ), α∈Λ, where Λ is a set of parameters, the one that best approximates the output, y, of the system. Thus, the selection of the desired function must be based on the training set, \(\Delta \), i.e., applying the empirical risk minimization principle:

$$ R{\left( {\alpha _{{\ell }} } \right)} \equiv {\int {L{\left( {y,f{\left( {x,\alpha } \right)}} \right)}} }{\text{d}}F{\left( {x,y} \right)} \simeq \frac{1} {{\ell }}{\sum\limits_{i = 1}^{\ell } {{\left( {y_{i} - f{\left( {x_{i} ,\alpha } \right)}} \right)}^{2} = R_{{{\text{emp}}}} {\left( {\alpha _{{\ell }} } \right)}} } $$
(18)

where we substitute the loss function, L(y,f(x,α)), measuring the discrepancy between the response, y, to a given input, x, and the solution, f(x,α), for a specific loss which forms the least squares method. Under certain conditions, the functional empirical risk converges towards the expected risk and, hence, the approximation in Eq. 18 holds, i.e., ℓ→∞. However, under small sample sizes, non-convergence may occur and the overfitting problem could arise [39]. This is avoided by introducing a regularization term [34] to limit the complexity of the loss function class arising from the problem of model selection [39].

The theory on controlling the generalization ability of learning machines is devoted to constructing a new, inductive principle to minimize the functional risk and to control the complexity of the loss function class. This is a major task, as noted above, whenever a small sample of training instances “ℓ” is used [33]. To construct learning methods, we use the bounds found by Vapnik:

$$R{\left( {\alpha ^{k}_{{_{{\ell }} }} } \right)} \leqslant R_{{{\text{emp}}}} {\left( {\alpha ^{k}_{{_{{\ell }} }} } \right)} + {\Re }{\left( {\frac{{\ell }} {{h_{k} }}} \right)} $$
(19)

where R is the actual risk, R emp is the empirical risk depending on the samples, ℜ is the confidence interval, α k Footnote 2 is the set of selected parameters that defines the class of approximation functions, and h is the VC dimension Footnote 3. In order to minimize the right-hand side of inequality 19, we apply the SRM principle as follows:

Let \(\mathcal{L}_{{\text{1}}} \subset \mathcal{L}_{{\text{2}}} \subset \cdots \subset \mathcal{L}_{{\text{k}}} \cdots \) be a nested “admissible” Footnote 4 family of loss function classes with a finite VC dimension denoted by “h i ” with i=1,...,k, for a given set of observations, \(\Delta \). The SRM principle chooses the suitable class, \(\mathcal{L}_k \) (and the function L(x,α k ), minimizing the guaranteed risk (right-hand side of inequality 19. In other words, the higher the complexity in the class function, the lower the empirical risk with the higher confidence interval (the second term in the bounds of the expected risk).

6.2.2 Support vector machines and regularization theory

The SV algorithm is a non-linear generalization of the generalized portrait developed in the 1960s by Vapnik and Lerner [40]. The basic idea in SVM for regression and function estimation is to use a mapping function, Φ, from the input space, \(\mathcal{F}\) , into a high-dimensional feature space, ℱ, and then to apply a linear regression. Thus, the standard linear regression transforms into:

$$f{\left( x \right)} = \langle \omega \cdot \Phi {\left( x \right)}\rangle + b $$
(20)

where \(\Phi :\chi \to \mathcal{F} \), b is a bias or threshold, and \(\omega \in \mathcal{F}\) is a vector defining the function class. The target is to determine ω, i.e., the set of parameters in the neural network, minimizing the regularized risk expressed as:

$$R_{{{\text{reg}}}} {\left[ f \right]} = R_{{{\text{emp}}}} {\left[ f \right]} + \lambda {\left\| \omega \right\|}^{2} $$
(21)

thus, we are enforcing “flatness” in the feature space, that is, we seek small ω. Note that Eq. 21 is very common in RN with a certain second term.

The SVM algorithm is a suitable way of solving the minimization of Eq. 21, which can be expressed as a quadratic programming problem using the formulation stated in [33]:

$$ {\text{minimize}}\frac{1} {2}{\left\| \omega \right\|}^{2} + C{\sum\limits_{i = 1}^{\ell } {{\left( {\xi _{i} + \xi ^{ * }_{i} } \right)}} }, $$
(22)

given a suitable loss function L(·) Footnote 5 , a constant C≥ 0 and with variables ξ I , ξ * I ≥ 0. The optimization problem is solved by constructing a Lagrange function, introducing dual variables, using Eq. 22 and the selected loss function.

Once it is uniquely solved, we can write the vector ω in terms of the data points as follows:

$$ \omega = {\sum\limits_{i = 1}^{\ell } {{\left( {\alpha _{i} - \alpha ^{ * }_{i} } \right)}} }\Phi {\left( {x_{i} } \right)} $$
(23)

where α i and α * i are the solutions of the above-mentioned quadratic problem. Once this problem, with high computational demands Footnote 6, is solved, we introduce Eq. 23 into Eq. 20 and obtain the solution in terms of dot products:

$$ f{\left( x \right)} = {\sum\limits_{i = 1}^{\ell } {{\left( {\alpha _{i} - \alpha ^{ * }_{i} } \right)}} }\langle \Phi {\left( {x_{i} } \right)} \cdot \Phi {\left( x \right)}\rangle + b $$
(24)

At this point, we use a strategy to avoid computing the dot product in the high-dimensional feature space in Eq. 24, replacing it with a kernel function that satisfies Mercer’s condition. Mercer’s theorem guarantees the existence of this kernel function:

$$f{\left( x \right)} = {\sum\limits_{i = 1}^{\ell } {h_{I} \cdot k{\left( {x_{I} ,x} \right)} + b} } $$
(25)

where \( h_{i} \equiv {\left( {\alpha _{i} - \alpha ^{ * }_{i} } \right)} \) and \( k{\left( {x_{i} ,x} \right)} = \langle \Phi {\left( {x_{i} } \right)} \cdot \Phi {\left( x \right)}\rangle . \) Finally, we note, regarding the sparsity of the SV expansion (Eq. 24), that only the elements satisfying |f(x i )−y i |≥ε, where ε is the standard deviation of f(x i ) from y i (see selected loss function), have non-zero Lagrange multipliers, α i and \( \alpha ^{*}_{i} \). This can be proved by applying Karush-Kuhn-Tucher (KKT) conditions [41] to the SV dual optimization problem.

RT appears in the methods described for solving ill-posed problems [34]. In RN, we minimize an expression similar to Eq. 21. However, the search criterion is to enforce smoothness (instead of flatness) for the function in the input space (instead of the feature space). Thus, we have:

$$R_{{{\text{reg}}}} {\left| f \right|} = R_{{{\text{emp}}}} {\left| f \right|} + \frac{\lambda } {2}{\left\| {\ifmmode\expandafter\hat\else\expandafter\^\fi{P}f} \right\|}^{2} $$
(26)

where \(\ifmmode\expandafter\hat\else\expandafter\^\fi{P} \) denotes a regularization operator in the sense of [34], mapping from the Hilbert space, H, of functions to a dot product space, D, such that \(\langle f,g\rangle \;\forall f,g \in H \) is well defined. Applying Frèchet’s differential Footnote 7 to Eq. 26 and the concept of Green’s function of \(\ifmmode\expandafter\hat\else\expandafter\^\fi{P} * \ifmmode\expandafter\hat\else\expandafter\^\fi{P}, \) we have:

$$ \ifmmode\expandafter\hat\else\expandafter\^\fi{P} * \ifmmode\expandafter\hat\else\expandafter\^\fi{P} \cdot G{\left( {x_{i} ,x_{j} } \right)} = \delta {\left( {x_{i} - x_{j} } \right)} $$
(27)

(here, δ denotes Dirac’s δ, that is 〈f, δ(x i )〉=f(x i )), and, hence [4]:

$$ f{\left( x \right)} = \lambda {\sum\limits_{i = 1}^{\ell } {{\left[ {y_{i} - f{\left( {x_{i} } \right)}} \right]}_{\varepsilon } \cdot G{\left( {x,x_{i} } \right)}} } $$
(28)

The correspondence between SVM and RN (Eqs. 25 and 28) is proved if and only if Green’s function, G, is an “admissible” kernel in the terms of Mercer’s theorem, i.e., if we can write G as:

$$ G{\left( {x_{i} ,x_{j} } \right)} = \langle \Phi {\left( {x_{i} } \right)},\Phi {\left( {x_{j} } \right)}\rangle \;{\text{with}}\;\Phi :x_{i} \to {\left( {\ifmmode\expandafter\hat\else\expandafter\^\fi{P}G} \right)}{\left( {x_{i} ,.} \right)} $$
(29)

A similar proof of this connection can be found in [38]. Hence, given a regularization operator, we can find an admissible kernel such that the SV machine using it will enforce flatness in the feature space and minimize Eq. 26. Moreover, given an SV kernel, we can find a regularization operator such that the SVM can be seen as an RN.

6.2.3 On-line algorithm using regularization operators

In this section, we describe a new on-line RN based on “RAN” algorithms Footnote 8 [3], which consist of a network using RBFs, a strategy for:

  • Allocating new units (RBFs), using a two-part novelty condition [3]

  • Input space selection and neural pruning using matrix decompositions such as SVD and QR with pivoting [4]

together with a learning rule based on SRM, as discussed in the previous sections. Our network has one layer, as stated in Eq. 25. In terms of RBFs, the latter equation can be expressed as:

$$ f{\left( x \right)} = {\sum\limits_{i = 1}^{N{\left( t \right)}} {h_{i} \cdot \exp {\left( { - \frac{{{\left\| {x{\left( t \right)} - x_{i} {\left( t \right)}} \right\|}^{2} }} {{2\sigma ^{2}_{i} {\left( t \right)}}}} \right)}} } + b $$
(30)

where N(t) is the number of neurons, x i (t) is the center of the neurons, and σ i (t) is their radius at time t. In order to minimize Eq. 26, we propose a regularization operator based on SVM philosophy. We enforce flatness in the feature space, as described in Sect 6.2, using the regularization operator, \({\left\| {\ifmmode\expandafter\hat\else\expandafter\^\fi{P}f} \right\|}^{2} \equiv {\left\| \omega \right\|}^{2} , \) and, thus:

$$ R_{{{\text{reg}}}} {\left| f \right|} = R_{{{\text{emp}}}} {\left| f \right|} + \frac{\lambda } {2}{\sum\limits_{i,j = 1}^{N{\left( t \right)}} {h_{i} h_{j} k{\left( {x_{i} ,x_{j} } \right)}} } $$
(31)

We assume that R emp=(yf(x))2 and minimize Eq. 31 by adjusting the centers and radii (gradient descend method, \(\Delta \chi = - \eta {\partial R{\left| f \right|}} \mathord{\left/ {\vphantom {{\partial R{\left| f \right|}} {\partial \chi }}} \right. \kern-\nulldelimiterspace} {\partial \chi } \) with simulated annealing):

$$ \Delta x_{i} = - 2\frac{\eta } {{\sigma _{i} }}{\left( {x - x_{i} } \right)}h_{i} {\left( {f{\left( x \right)} - y} \right)}k{\left( {x,x_{i} } \right)} + \alpha {\sum\limits_{i,j = 1}^{N{\left( t \right)}} {h_{i} h_{j} k{\left( {x_{i} ,x_{j} } \right)}{\left( {x_{i} - x_{j} } \right)}} } $$
(32)

and

$$ \Delta h_{i} = \ifmmode\expandafter\tilde\else\expandafter\~\fi{\alpha }{\left( t \right)}f{\left( {x_{i} } \right)} - \eta {\left( {f{\left( x \right)} - y} \right)}k{\left( {x,x_{i} } \right)} $$
(33)

where \(\alpha {\left( t \right)}\;{\text{and}}\; \ifmmode\expandafter\tilde\else\expandafter\~\fi{\alpha }{\left( t \right)} \) are scalar-valued “adaptation gains”, related to a similar gain used in the stochastic approximation processes; as in these methods, it should decrease in time. The second summand in Eq. 32 can be evaluated in several regions, inspired by the “divide-and-conquer” principle and used in unsupervised learning, i.e., competitive learning in self-organizing maps [42] or in expert SVMs [43]. This is necessary because of the volatile nature of time series, i.e., the dynamics of stock returns vary between different regions, leading to gradual changes in the dependence between the input and output variables. Thus, the super-index in the latter equation is redefined as:

$$ N_{c} {\left( t \right)} = {\left\{ {s_{i} {\left( t \right)}:{\left\| {x{\left( t \right)} - x_{i} {\left( t \right)}} \right\|} \leqslant \rho } \right\}} $$
(34)

the set of neurons close to the current input. The structure of the algorithm is shown below as pseudo-code, including the set of initial parameters:

program online-algorithm

(Note: to denotes current iteration; k denotes

prediction horizon);

Initialize parameters and variables

Build input Toeplitz matrix A using (3W-1)

input values

Input space selection: determine Np relevant

lags L using SVD and QR_wp [8]

Determine input vector: x=x(to-k-L(1))

while (true)

if (n_rbfs > 0)

Compute f(x)

Find nearest RBF:|x-x_dmin|

else

f(x)=x(to-k-1)

Calculate error: e=|f(x)-x(to)|

if (e>epsilon and |x-x_d_min|>delta) [7]

Add RBF with parameters:

x_I=x, sigma_i=kappa*|x-c_dmin}|, h=e

else

Execute pruning (SVD & QR_wp to neural activations) [8]

Update parameters minimizing actual risk

(15}(16}

if (e>theta*epsilon and n_inps<max_inps)

n_inps = n_inps+1

Determine new lags: L=[L_1,L_2,...,L_Np]

Add rbf_add RBFs

to = to+1

end

6.2.4 Experiments

The application goal of our network is to predict complex time series. We chose the high-dimensional chaotic system generated by the Mackey-Glass delay differential equation:

$$\frac{{{\text{d}}x{\left( t \right)}}} {{{\text{d}}t}} = - b \cdot x{\left( t \right)} + a \cdot \frac{{x{\left( {t - \tau } \right)}}} {{1 + x^{{10}} {\left( {t - \tau } \right)}}} $$
(35)

with b=0.1, a=0.2, and delay t d =17. This equation was originally presented as a model of blood regulation and became popular in modeling time series benchmarks. We added two modifications to Eq. 35:

  • Zero-mean Gaussian noise with standard deviation equal to 1/4 of the standard deviation of the original series

  • Random dynamics changes in terms of delay (between 100 and 300 time steps) t d =17,23,30

We integrated the chaotic model using MatLab software on a Pentium III 850 MHz computer, obtaining 2,000 patterns. For our comparison, we used 100 prediction results from SVM_online (presented in this paper), standard SVM (with ε-insensitive loss), and NAPA_PRED (RAN algorithm using matrix decompositions being one of the best on-line algorithms available to date [4]). Clearly, there is a remarkable difference between the above on-line algorithm and SVM philosophy. Standard SVM and SVM_online achieve similar results for this set of data at the beginning of the process. In addition, there is a noticeable improvement in the last iterations because of the volatile nature of the series. The change in time delay, t d , leads to gradual changes in the dependence between the input and output variables and, in general, it is hard for a single model including SVMs to capture such a dynamic input-output relationship inherent in the data. Focusing our attention on the on-line algorithm, we observe the better performance of the new algorithm, such as a lower number of neurons (“sparsity”), and improved input space dimension and forecasting results (Table 3).

Table 3 Evolution of normalized root mean square error (NRMSE). Xth-S denotes the Xth-step prediction NRMSE on the test set (noisy Mackey-Glass with delay changing operation mode