1 Introduction

Today artificial neural networks (ANNs) are successfully used in a wide range of data processing problems (when data can be presented either in the form of  “object-property” tables or in the form of time series, often produced by non-stationary nonlinear stochastic or chaotic systems). The advantages ANNs have over other existing approaches derive from their universal approximating capabilities and learning capacities.

Conventionally, “learning” is defined as a process of adjusting synaptic weights using an optimization procedure that involves searching for the extremum of the given learning criterion. The learning process quality can be improved by adjusting a network topology along with its synaptic weights (Haykin 1999; Cichocki and Unbehauen 1993). This idea is the foundation of evolving computational intelligence systems (Kasabov 2001, 2003, 2007; Kasabov and Song 2002; Kasabov et al. 2005; Lughofer 2011; Angelov and Filev 2004, 2005; Angelov and Kasabov 2005; Angelov and Lughofer 2008; Angelov and Zhou 2006, 2008; Angelov et al. 2004, 2005, 2006, 2007, 2008, 2010; Lughofer and Klement 2003, 2004; Lughofer and Bodenhofer 2006; Lughofer and Guardiola 2008a, b; Lughofer and Kindermann 2008, 2010; Lughofer and Angelov 2009, 2011; Lughofer 2006, 2008a, b, c, 2010a, b; Lughofer et al. 2003, 2004, 2005, 2007, 2009). Under this approach, the best known architectures are DENFIS by Kasabov and Song (2002), Angelov and Filev (2004) and FLEXFIS by Lughofer (2008c). These systems are actually five-layer Takagi-Sugeno networks and evolution is fulfilled in the fuzzification layer. These networks can process data in an online mode where a clusterization task is solved in the antecedent (unsupervised learning) and consequent parameter tuning is performed with supervised learning with the help of the exponentially weighted recurrent least-squares method. Though these systems are characterized by high approximating properties, they can also process non-stationary signals and they require big-volume samples to tune parameters. A clusterization procedure cannot be optimized in terms of speed like all the algorithms which are based on self-learning. A rather interesting class of computational intelligence systems where an architecture is evolving during a learning process is cascade-correlation neural networks (Fahlman and Lebiere 1990; Prechelt 1997; Schalkoff 1997; Avedjan et al. 1999) due to their high efficiency degree and learning simplicity of both synaptic weights and a network topology. Such a network starts off with a simple architecture consisting of a pool (ensemble) of neurons which are trained independently (the first cascade). Each neuron in the pool can have a different activation function and/or a different learning algorithm. The neurons in the pool do not interact with each other while trained. After all the neurons in the pool of the first cascade have their weights adjusted, the best neuron with respect to a learning criterion forms the first cascade and its synaptic weights can no longer be adjusted. Then the second cascade is formed usually out of similar neurons in the training pool. The only difference is that the neurons which are trained in the pool of the second cascade have an additional input (and, therefore, an additional synaptic weight) which is an output of the first cascade. Similar to the first cascade, the second cascade will eliminate all but one neuron showing the best performance whose synaptic weights will thereafter be fixed. Neurons of the third cascade have two additional inputs, namely the outputs of the first and second cascades. The evolving network continues to add new cascades to its architecture until it reaches the desired quality of problem solving over the given training set.

Authors of the most popular cascade neural network, CasCorLA, S. E. Fahlman and C. Lebiere, used elementary Rosenblatt perceptrons with traditional sigmoidal activation functions and adjusted synaptic weights using the Quickprop-algorithm (a modification of the \(\delta \)-learning rule). Since the output signal of such neurons is non-linearly dependent on its synaptic weights, the learning rate cannot be increased for such neurons. In order to avoid multi-epoch learning (Bodyanskiy et al. 2008, 2009, 2011a, b, c; Bodyanskiy and Viktorov 2009a, b; Bodyanskiy and Kolodyazhniy 2010), different types of neurons (with outputs that depend linearly on synaptic weights) should be used as network nodes. This would allow to use optimal learning algorithms in terms of speed and process data as it is an input to the network. However, if the network learns in an online mode, it is impossible to determine the best neuron in the pool. While working with non-stationary objects, one neuron of the training pool can be identified as the best for one part of the training set, but not for the others. Thus we suggest that all the neurons in the training pool should be retained and a certain optimization procedure (generated according to a general network quality criterion) should be used to determine an output of the cascade. In this paper, we try to create such a hybrid neural network with an optimized neuron pool in each cascade.

2 An optimized cascade neural network architecture

The architecture of the desired hybrid neural network with an optimized pool of neurons in each cascade is shown in Fig. 1.

Fig. 1
figure 1

The optimized cascade neural network architecture

An input of the network (the so-called “receptive layer”) is a vector signal

$$\begin{aligned} x(k)=\left( x_1(k),x_2(k), \ldots , x_n(k)\right) ^T\!, \end{aligned}$$

where \(k=1,2,\ldots ,\) is either the quantity of samples in the “object-property” table or the current discrete time. These signals are fed to the inputs of each neuron in the network \(N_j^{[m]}\) (\(j=1,2,\ldots , q\) is the quantity of neurons in the training pool, \(m=1,2,\ldots \) is the number of the cascade), that produces outputs \(\hat{y}_j^{[m]}(k)\). These outputs are then combined with a generalizing neuron \(GN^{[m]}\), which generates an optimal output \(\hat{y}^{*[m]}(k)\) of the \(m\)-th cascade. While the input of the neurons in the first cascade is \(x(k)\) (which may also contain a threshold value \(x_0(k)\equiv 1\)), neurons in the second cascade have an additional input for the generated signal \(\hat{y}^{*[1]}(k)\), neurons in the third cascade have two additional inputs \(\hat{y}^{*[1]}(k)\), \(\hat{y}^{*[2]}(k)\), and neurons in the \(m\)-th cascade have \((m-1)\) additional inputs \(\hat{y}^{*[1]}(k), \hat{y}^{*[2]}(k), \ldots ,\hat{y}^{*[m-1]}(k)\). The new cascades become a part of the network during a training process when it becomes clear that the current cascades do not provide the desired quality.

3 Training elementary Rosenblatt perceptrons in a cascade neural network

For now, let us assume that the \(j\)-th node in the \(m\)-th cascade is an elementary Rosenblatt perceptron with the activation function

$$\begin{aligned} 0<\sigma ^{[m]}_j({\gamma _j}^{[m]} u_j^{[m]})=\frac{1}{1+e^{- \gamma ^{[m]}_j u_j^{[m]}}}<1, \end{aligned}$$

where \(u_j^{[m]}\) is an internal activation signal of the \(j\)-th neuron in the \(m\)-th cascade, and \(\gamma _j^{[m]}\) is a gain parameter. In such a case, the neurons in the pool of the first cascade will have the following outputs:

$$\begin{aligned} \hat{y}_j^{[1]}=\sigma _j^{[1]} \left( \gamma _j^{[1]} \sum _{i=0}^{n} w_{ji}^{[1]}x_i\right) =\sigma _j^{[1]}\left( \gamma _j^{[1]}w_j^{[1]T} x \right) , \end{aligned}$$

where \(w_{ji}^{[1]}\) is the \(i\)-th synaptic weight for the \(j\)-th neuron in the first cascade. Outputs of the second cascade

$$\begin{aligned} \hat{y}_j^{[2]}=\sigma _j^{[2]} \left( \gamma _j^{[2]} \left( \sum _{i=0}^{n} w_{ji}^{[2]}x_i+w_{j,n+1}^{[2]}\hat{y}^{*[1]}\right) \right) , \end{aligned}$$

outputs of the \(m\)-th cascade

$$\begin{aligned} \begin{aligned} \hat{y}_j^{[m]}&=\sigma _j^{[m]}\bigg ( \gamma _j^{[m]} \bigg ( \sum _{i=0}^{n} w_{ji}^{[m]}x_i+w_{j,n+1}^{[m]}\hat{y}^{*[1]}\\&\quad +w_{j,n+2}^{[m]}\hat{y}^{*[2]}+\cdots +w_{j,n+m-1}^{[m]}\hat{y}^{*[m-1]}\bigg )\bigg )\\&=\sigma _j^{[m]}\left( \gamma _j^{[m]}\sum _{i=0}^{n+m-1}w_{ji}^{[m]}x_j^{[m]}\right) =\sigma _j^{[m]}\left( w_j^{[m]T}x^{[m]}\right) \end{aligned}, \end{aligned}$$

where \(x^{[m]}=\left( x^T, \hat{y}^{*[1]},\ldots , \hat{y}^{*[m-1]}\right) ^T\).

Thus the cascade network, using Rosenblatt perceptrons as nodes and containing \(m\) cascades, is dependent on \((m(n+2)+\sum _{p=1}^{m-1}p)\) parameters including the gain parameters \(\gamma _j^{[p]}, p=1,2,\ldots ,m\). We use a conventional quadratic function as a learning criterion

$$\begin{aligned} E_j^{[m]}&= \frac{1}{2}\left( e_j^{[m]}(k)\right) ^2=\frac{1}{2}\left( y(k)-\hat{y}_j^{[m]}(k)\right) ^2\nonumber \\&= \frac{1}{2}\left( y(k)-\sigma _j^{[m]}\left( \gamma _j^{[m]}w_j^{[m]T}x^{[m]}(k)\right) \right) ^2, \end{aligned}$$
(1)

where \(y(k)\) is a reference signal. The gradient optimization of the criterion (1) with respect to \(w_j^{[m]}\) is

$$\begin{aligned} w_j^{[m]}(k+1)&= w_j^{[m]}(k)+\eta _j^{[m]}(k+1)e_j^{[m]}(k+1)\gamma _j^{[m]}\nonumber \\&\times \hat{y}_j^{[m]}(k\!+\!1)\left( 1\!-\!\hat{y}_j^{[m]}(k\!+\!1)\right) x^{[m]}(k\!+\!1)\nonumber \\&= w_j^{[m]}(k)+\eta _j^{[m]}(k+1)e_j^{[m]}(k+1)\nonumber \\&\times \gamma _j^{[m]}J_j^{[m]}(k+1), \end{aligned}$$
(2)

(here \(\eta _j^{[m](k+1)}\) is a learning rate parameter), and minimization of (1) with respect to \(\gamma _j^{[m]}\) can be performed using the Kruschke–Movellan algorithm (Kruschke and Movellan 1991):

$$\begin{aligned} \gamma _j^{[m]}(k\!+\!1)&\!=\!\gamma _j^{[m]}(k)\!+\!\eta _j^{[m]}(k\!+\!1)e_j^{[m]}(k\!+\!1)\hat{y}_j^{[m]}(k\!+\!1)\nonumber \\&\quad \times \left( 1-\hat{y}_j^{[m]}(k+1)\right) u_j^{[m]}(k+1). \end{aligned}$$
(3)

Combining (2) and (3), we obtain a general learning algorithm for the \(j\)-th neuron in the \(m\)-th cascade:

$$\begin{aligned} \begin{aligned} \left( \frac{w_j^{[m]}(k+1)}{\gamma _j^{[m]}(k+1)}\right)&=\left( \frac{w_j^{[m]}(k)}{\gamma _j^{[m]}(k)}\right) \\&\quad +\eta _j^{[m]}(k+1)e_j^{[m]}(k+1)\hat{y}_j^{[m]}(k+1)\\&\quad \times (1\!-\!\hat{y}_j^{[m]}(k\!+\!1))\left( \frac{\gamma _j^{[m]}x^{[m]}(k\!+\!1)}{u_j^{[m]}(k\!+\!1)}\right) , \end{aligned} \end{aligned}$$

or, introducing new variables, in a more compact form:

$$\begin{aligned}&\tilde{w}_j^{[m]}(k+1)\\&\quad =\tilde{w}_j^{[m]}(k)+\eta _j^{[m]}(k+1)\\&\qquad \times e_j^{[m]}(k\!+\!1) \hat{y}_j^{[m]}(k\!+\!1)(1\!-\!\hat{y}_j^{[m]}(k\!+\!1))\tilde{x}^{[m]}(k\!+\!1)\\&\quad =\tilde{w}_j^{[m]}(k)+\eta _j^{[m]}(k+1)e_j^{[m]}(k+1)\tilde{J}_j^{[m]}(k+1). \end{aligned}$$

Weight adjustment can be improved by introducing a momentum term to the learning process (Chan and Fallside 1987; Almeida and Silva 1990; Holmes and Veitch 1991), so that instead of the learning criterion (1) we use the function

$$\begin{aligned}&E_j^{[m]}(k)=\frac{\eta }{2}(e_j^{[m]}(k))^2\nonumber \\&\quad +\frac{1-\eta }{2}\Vert \tilde{w}_j^{[m]}(k)-\tilde{w}_j^{[m]}(k-1)\Vert ^2,0<\eta \leqslant 1 \end{aligned}$$
(4)

and the algorithm is

(5)

which is a modification of the Silva–Almeida procedure (Almeida and Silva 1990).

Using the approach suggested in (Bodyanskiy et al. 2001b, 2003b), we can introduce tracking and filtering properties. So a final version of the algorithm is

$$\begin{aligned} \left\{ \begin{array}{l} \tilde{w}_j^{[m]}(k+1)=\tilde{w}_j^{[m]}(k)\\ \quad +\frac{\eta e_j^{[m]}(k+1)\tilde{J}_j^{[m]}(k+1)}{r_j^{[m]}(k+1)}\\ \quad +\frac{(1-\eta )(\tilde{w}_j^{[m]}(k)-\tilde{w}_j^{[m]}(k-1))}{r_j^{[m]}(k+1)}, \\ r_j^{[m]}(k+1)=r_j^{[m]}(k)+||\tilde{J}_j^{[m]}(k+1)||^2\\ \quad -||\tilde{J}_j^{[m]}(k-s)||^2 \end{array}\right. \end{aligned}$$
(6)

where \(s\) is a sliding window size.

It is interesting that when \(s=1\), \(\eta =1\), we get a nonlinear version of the well-known Kaczmarz–Widrow–Hoff algorithm (Kaczmarz 1937, 1993; Hoff and Widrow 1960):

$$\begin{aligned} \tilde{w}_j^{[m]}(k+1)=\tilde{w}_j^{[m]}(k)+\frac{e_j^{[m]}(k+1)\tilde{J}_j^{[m]}(k+1)}{\Vert \tilde{J}_j^{[m]}(k+1)\Vert ^2}, \end{aligned}$$

which is widely used in artificial neural networks’ learning and characterized by high convergence rate.

4 Training neo-fuzzy neurons in a cascade neural network

A low learning rate of Rosenblatt perceptrons along with difficulty in interpreting results (inherent to all ANNs in general) encourages us to search for alternative approaches to the synthesis of evolving systems in general and cascade neural networks in particular. High interpretability and transparency along with good approximation capabilities and ability to learn are the main features of the neuro-fuzzy systems (Jang et al. 1997), which are the foundation of hybrid artificial intelligence systems.

In Bodyanskiy and Viktorov (2009a, b), Bodyanskiy and Kolodyazhniy (2010) hybrid cascade systems were introduced which used neo-fuzzy neurons (Kusanagi et al. 1992; Uchino and Yamakawa 1997; Miki and Yamakawa 1999) as network nodes, allowing them to significantly increase a rate of synaptic weight adjustment. A neo-fuzzy neuron (NFN) is a non-linear system providing the following mapping:

$$\begin{aligned} \hat{y}=\sum _{i=1}^{n} f_i(x_i), \end{aligned}$$

where \(x_i\) is the \(i\mathrm{th}\) input (\(i=1,2,\ldots ,n\)), \(\hat{y}\) is an output of the neo-fuzzy neuron. Structural units of the neo-fuzzy neuron are non-linear synapses \(NS_i\) which transform input signals in the following way:

$$\begin{aligned} f_i(x_i)=\sum _{l=1}^{h} w_{li} \mu _{li}(x_i), \end{aligned}$$

where \(w_{li}\) is the \(l\mathrm{th}\) synaptic weight of the \(i\mathrm{th}\) non-linear synapse, \(l=1,2,\ldots ,h\) is the total quantity of synaptic weights and, therefore, membership functions \(\mu _{li}(x_i)\) in the synapse. So \(NS_i\) implements fuzzy inference in the form

$$\begin{aligned} \text {IF}\, x_i \,\text {IS}\, X_{li} \,\text {THEN THE OUTPUT IS} \,w_{li}, \end{aligned}$$

where \(X_{li}\) is a fuzzy set with a membership function \(\mu _{li}\), \(w_{li}\) is a singleton (a synaptic weight in a consequent). It can be seen that, in fact, the non-linear synapse implements the zero-order Takagi-Sugeno fuzzy inference.

Figure 2 shows the \(j\mathrm{th}\) neo-fuzzy neuron of the first cascade (according to the network topology shown in Fig. 1).

Fig. 2
figure 2

A neo-fuzzy neuron of the first cascade

$$\begin{aligned} \left\{ \begin{array}{ll} \hat{y}_j^{[1]}(k)=\sum _{i=1}^{n}f_{ji}^{[1]}(x_i(k))\\ =\sum _{i=1}^{n}\sum _{l=1}^{h} w_{jli}^{[1]} \mu _{jli}^{[1]}(x_i(k)),\\ \text {IF}\, x_i(k) \,\text {IS}\, X_{jli}\, \text {THEN}\\ \hbox {THE}\,\hbox {OUTPUT}\,\hbox {IS}\, w_{jli}^{[1]}. \end{array}\right. \end{aligned}$$
(7)

Authors of the neo-fuzzy neuron (Kusanagi et al. 1992; Uchino and Yamakawa 1997; Miki and Yamakawa 1999) used a traditional triangular structure meeting the conditions of Ruspini partitioning (unity partitioning) as membership functions:

$$\begin{aligned} \mu _{jli}^{[1]}(x_i)= \left\{ \begin{array}{ll} &{} \frac{x_i-c_{j,l-1,i}^{[1]}}{c_{jli}^{[1]}-c_{j,l-1,i}^{[1]}},\,\quad \text {if}\, x_i \in [c_{j,l-1,i}^{[1]},c_{jli}^{[1]}], \\ &{} \frac{c_{j,l+1,i}^{[1]}-x_i}{c_{j,l+1,i}^{[1]}-c_{jli}^{[1]}},\,\quad \text {if}\, x_i \in [c_{jli}^{[1]},c_{j,l+1,i}^{[1]}] \\ &{} 0,\,\quad \text {otherwise}, \end{array}\right. \end{aligned}$$
(8)

where \(c_{jli}^{[1]}\) are center parameters of membership functions over the interval [0,1] which are relatively arbitrarily chosen (usually evenly distributed) where, naturally, \(0\le x_i \le 1\). This choice ensures that the input signal \(x_i\) activates only two neighboring membership functions, and their sum is always equal to 1, which means that

$$\begin{aligned} \mu _{jli}^{[1]}(x_i)+\mu _{j,l+1,i}^{[1]}(x_i)=1 \end{aligned}$$

and

$$\begin{aligned} f_{ji}^{[1]}(x_i)=w_{jli}^{[1]}\mu _{jli}^{[1]}(x_i)+w_{j,l+1,i}^{[1]}\mu _{j,l+1,i}^{[1]}(x_i). \end{aligned}$$

Approximating capabilities can be improved using cubic splines (Bodyanskiy and Viktorov 2009b) instead of triangular membership functions:

$$\begin{aligned} \mu _{jli}^{[1]}(x_i)= \left\{ \begin{array}{ll} &{} \frac{1}{4}\left( 2+3\frac{2x_i-c_{jli}^{[1]}-c_{j,l-1,i}^{[1]}}{c_{jli}^{[1]}-c_{j,l-1,i}^{[1]}}\right. \\ &{}\left. \quad -\left( \frac{2x_i-c_{jli}^{[1]}-c_{j,l-1,i}^{[1]}}{c_{jli}^{[1]}-c_{j,l-1,i}^{[1]}}\right) ^3\right) ,\\ &{}\text {if}\, x_i \in [c_{j,l-1,i}^{[1]},c_{jli}^{[1]}], \\ &{}\frac{1}{4}\Bigg (2-3\frac{2x_i-c_{j,l+1,i}^{[1]}-c_{jli}^{[1]}}{c_{j,l+1,i}^{[1]}-c_{jli}^{[1]}}\\ &{}\quad +\Bigg (\frac{2x_i-c_{j,l+1,i}^{[1]}-c_{jli}^{[1]}}{c_{j,l+1,i}^{[1]}-c_{jli}^{[1]}}\Bigg )^3\Bigg ),\\ {} &{}\text {if}\, x_i \in [c_{jli}^{[1]},c_{j,l+1,i}^{[1]}], \\ &{} 0, \text {otherwise} \end{array}\right. \end{aligned}$$
(9)

or B-splines (Bodyanskiy and Kolodyazhniy 2010):

$$\begin{aligned} \mu _{jli}^{g[1]}(x_i)=\left\{ \begin{aligned}&\begin{aligned}&1,\, \text {if}\, x_i \in [c_{jli}^{[1]},c_{j,l+1,i}^{[1]}] \\&0, \, \text {otherwise}, \end{aligned} \Bigg \} \quad \text {for}\, g=1, \\&\begin{aligned}&\frac{x_i-c_{jli}^{[1]}}{c_{j,l+g-1,i}^{[1]}-c_{jli}^{[1]}} \mu _{jli}^{g-1,[1]}(x_i)\\&\quad +\frac{c_{j,l+g,i}^{[1]}-x_i}{c_{j,l+g,i}^{[1]}-c_{j,l+1,i}^{[1]}} \\&\quad \times \mu _{j,l+1,i}^{g-1,[1]}(x_i) \Bigg \} \quad \text {for}\, g>1, \end{aligned} \\ \end{aligned} \right. \end{aligned}$$
(10)

where \(\mu _{jli}^{g[1]}(x_i)\) is the \(l\)-th spline of the \(g\)-th order. It can be seen that when \(g=2\) we obtain triangular membership functions in (8). B-splines also ensure unity partitioning, but in a general case they can activate an arbitrary number of membership functions, beyond the interval [0,1], which might be useful for subsequent cascades (i.e., those following the first).

It is clear that other structures such as polynomial harmonic functions, wavelets, orthogonal functions, etc. can be used as membership functions for non-linear synapses.

It is still unclear which of the functions can provide the best results, which is why the idea of using not a single neuron, but a pool of neurons with different membership and activation functions seems promising.

Similar to (7) we can determine outputs for the remaining cascades: outputs of the neurons in the second cascade

$$\begin{aligned} \hat{y}_{j}^{[2]}\!=\!\sum _{i=1}^{n}\sum _{l=1}^{h} w_{jli}^{[2]} \mu _{jli}^{[2]}(x_i)\!+\!\sum _{l=1}^{h} w_{jl,n+1}^{[2]} \mu _{jl,n\!+\!1}^{[2]} (\hat{y}^{*[1]}), \end{aligned}$$

outputs of the \(m\mathrm{th}\) cascade

$$\begin{aligned} \begin{aligned} \hat{y}_{j}^{[m]}&=\sum _{i=1}^{n}\sum _{l=1}^{h} w_{jli}^{[m]} \mu _{jli}^{[m]}(x_i)\\&\quad +\sum _{p=n+1}^{n+m-1}\sum _{l=1}^{h} w_{jlp}^{[m]} \mu _{jlp}^{[m]} (\hat{y}^{*[p-n]}). \end{aligned} \end{aligned}$$

Thus, the cascade network formed with the neo-fuzzy neurons, consisting of \(m\) cascades, contains \(h\big ( n+\sum _{p=1}^{m-1}p \big )\) parameters. Introducing a vector of membership functions for the \(j\)-th neo-fuzzy neuron in the \(m\)-th cascade,

$$\begin{aligned} \begin{aligned}&\mu _{j}^{[m]}(k)=(\mu _{j11}^{[m]}(x_{1}(k)),\ldots ,\mu _{jh1}^{[m]}(x_{1}(k)),\mu _{j12}^{[m]}(x_{2}(k)),\\&\qquad \ldots ,\mu _{jh2}^{[m]}(x_{2}(k)),\ldots ,\mu _{jli}^{[m]}(x_{i}(k)),\ldots ,\mu _{jhn}^{[m]}(x_{n}(k)),\\&\qquad \mu _{j1,n+1}^{[m]}(\hat{y}^{*[1]}(k)),\ldots ,\mu _{jh,n+m-1}^{[m]}(\hat{y}^{*[m-1]}(k)))^{T} \end{aligned} \end{aligned}$$

and a corresponding vector of synaptic weights,

$$\begin{aligned} \begin{aligned} w_{j}^{[m]}&=(w_{j11}^{[m]},\ldots ,w_{jh1}^{[m]},w_{j12}^{[m]},\ldots ,w_{jh2}^{[m]},\ldots ,\\&w_{jli}^{[m]},\ldots ,w_{jhn}^{[m]},w_{j1,n+1}^{[m]},\ldots ,w_{jh,n+m-1}^{[m]})^{T}, \end{aligned} \end{aligned}$$

we obtain an output

$$\begin{aligned} \hat{y}_{j}^{[m]}(k)=w_{j}^{[m]T}\mu _{j}^{[m]}(k). \end{aligned}$$

The learning criterion (1) for this case will be

$$\begin{aligned} E_{j}^{[m]}(k)=\frac{1}{2}(e_{j}^{[m]}(k))^{2}=\frac{1}{2}(y(k)-w_{j}^{[m]T}\mu _{j}^{[m]}(k))^{2} \end{aligned}$$
(11)

and its minimization can be reached by using a “sliding window” modification of the procedure (Bodyanskiy et al. 1986):

$$\begin{aligned} \left\{ \begin{array}{ll} &{} w_{j}^{[m]}(k+1)=w_{j}^{[m]}(k) \\ &{}\,\,\,+\frac{e_{j}^{[m]}(k+1)\mu _{j}^{[m]}(k+1)}{r_{j}^{[m]}(k+1)} \\ &{} r_{j}^{[m]}(k+1)=r_{j}^{[m]}(k)+||\mu _{j}^{[m]}(k+1)||^{2}\\ &{} \,\,\,\,-\Vert \mu _{j}^{[m]}(k-s)\Vert ^{2} \end{array} \right. \end{aligned}$$
(12)

or when \(s=1\) (Bodyanskiy et al. 2003a):

$$\begin{aligned} w_{j}^{[m]}(k+1)=w_{j}^{[m]}(k)+\frac{e_{j}^{[m]}(k+1)\mu _{j}^{[m]}(k+1)}{\Vert \mu _{j}^{[m]}(k+1)\Vert ^{2}}, \end{aligned}$$

which reduces to the one-step-optimal Kaczmarz–Widrow–Hoff algorithm. Its clear that one could use other algorithms instead of (12), for example, the exponentially weighted recurrent least squares method (EWRLSM) that is used in DENFIS (Kasabov and Song 2002), eTS (Angelov and Filev 2004) and FLEXFIS (Angelov et al. 2005; Lughofer 2008c). But one should remember that EWRLSM may be unstable when a forgetting factor is rather low. When using the criterion with the momentum term (3) instead of (11) we obtain a final learning algorithm for the neo-fuzzy neuron:

$$\begin{aligned} \left\{ \begin{array}{ll} &{} w_{j}^{[m]}(k+1)=w_{j}^{[m]}(k) \\ &{}\quad +\Bigg (\frac{\eta e_{j}^{[m]}(k+1)\mu _{j}^{[m]}(k+1)}{r_{j}^{[m]}(k+1)}\\ &{}\quad +\frac{(1-\eta )(w_{j}^{[m]}(k)-w_{j}^{[m]}(k-1))}{r_{j}^{[m]}(k+1)}\Bigg ), \\ &{} r_{j}^{[m]}(k+1)=r_{j}^{[m]}(k)+||\mu _{j}^{[m]}(k+1)||^{2}\\ &{}\quad -\Vert \mu _{j}^{[m]}(k-s)\Vert ^{2}. \end{array} \right. \end{aligned}$$
(13)

It should be kept in mind that, since a NFN output is linearly dependent on its synaptic weights, we can use any adaptive linear identification algorithm (Ljung 1999) (second-order recursive least square methods, robust, ignoring outdated information, etc.), which allows us to process non-stationary signals in an online mode.

5 Optimization of the pool output

Outputs generated by neurons in each pool are combined with the corresponding neuron \(GN^{[m]}\), the output accuracy of which \(\hat{y}^{*[m]}(k)\) must be higher than the accuracy of any output \(\hat{y}_{j}^{[m]}(k)\). This task can be solved with the help of the neural networks ensembles approach. Although the well-known algorithms are not designated for working in an online mode, the adaptive generalizing forecasting could be used in this case (Bodyanskiy et al. 1983, 1989, 1999, 2001a; Bodyanskiy and Pliss 1990; Bodyanskiy and Vorobyov 2000).

Let us introduce a vector of pool inputs for the \(m\)-th cascade:

$$\begin{aligned} \hat{y}^{[m]}(k)=(\hat{y}_{1}^{[m]}(k),\hat{y}_{2}^{[m]}(k),\ldots ,\hat{y}_{q}^{[m]}(k))^{T}; \end{aligned}$$

then an optimal output of the neuron \(GN^{[m]}\), which is in essence an adaptive linear associator (Cichocki and Unbehauen 1993; Haykin 1999), can be defined as

$$\begin{aligned} \hat{y}^{*[m]}(k)=\sum _{j=1}^{q} c_{j}^{[m]}\hat{y}_{j}^{[m]}(k)=c^{[m]T}\hat{y}^{[m]}(k) \end{aligned}$$

with additional restrictions on unbiasedness

$$\begin{aligned} \sum _{j=1}^{q} c_{j}^{[m]}=E^{T}c^{[m]}=1, \end{aligned}$$
(14)

where \(c^{[m]}=(c_{1}^{[m]},c_{2}^{[m]},\ldots ,c_{q}^{[m]})^{T}\), \(E=(1,1,\ldots ,1)^{T}\) are \((q\times 1)\)-vectors.

Introducing a learning criterion on a sliding window

$$\begin{aligned} \begin{aligned} E^{[m]}(k)&=\frac{1}{2}\sum _{\tau =k-s+1}^{k}(y(\tau )-\hat{y}^{*[m]}(\tau ))^{2}\\&=\frac{1}{2}\sum _{\tau =k-s+1}^{k}(y(\tau )-c^{[m]T}\hat{y}^{[m]}(\tau ))^{2}, \end{aligned} \end{aligned}$$

taking into account the constraints (14), the Lagrangian function will be

$$\begin{aligned} L^{[m]}(k)=E^{[m]}(k)+\lambda (1-E^{T}c^{[m]}) \end{aligned}$$
(15)

where \(\lambda \) is an undetermined Lagrange multiplier.

Direct minimization of (15) with respect to \(c^{[m]}\) gives

$$\begin{aligned} \left\{ \begin{array}{ll} &{} \hat{y}^{*[m]}(k+1)=\frac{\hat{y}^{[m]T}(k+1)P^{[m]}(k+1)E}{E^{T}P^{[m]}(k+1)E}, \\ &{} P^{[m]}(k+1)=\Bigg (\sum _{\tau =k-s+2}^{k+1}\hat{y}^{[m]}(\tau )\hat{y}^{[m]T}(\tau )\Bigg )^{-1} \end{array} \right. \end{aligned}$$
(16)

or in a recurrent form:

$$\begin{aligned} \left\{ \begin{array}{ll} &{} \tilde{P}^{[m]}(k+1)=P^{[m]}(k)\\ &{}\quad -\frac{P^{[m]}(k)\hat{y}^{[m]}(k+1)\hat{y}^{[m]T}(k+1)P^{[m]}(k)}{1+\hat{y}^{[m]T}(k+1)P^{[m]}(k)\hat{y}^{[m]}(k+1)}, \\ &{} P^{[m]}(k+1)=\tilde{P}^{[m]}(k+1)\\ &{}\quad +\frac{\tilde{P}^{[m]}(k+1)\hat{y}^{[m]}(k-s+1)\hat{y}^{[m]T}(k-s+1)\tilde{P}^{[m]}(k+1)}{1-\hat{y}^{[m]T}(k-s+1)\tilde{P}^{[m]}(k+1)\hat{y}^{[m]}(k-s+1)},\\ &{} \hat{y}^{*[m]}(k+1)=\frac{\hat{y}^{[m]T}(k+1)P^{[m]}(k+1)E}{E^{T}P^{[m]}(k+1)E}.\\ \end{array} \right. \end{aligned}$$
(17)

when \(s=1\), ratios (16) and (17) take on an extremely simple form:

$$\begin{aligned} \hat{y}^{*[m]}(k+1)&= \frac{\hat{y}^{[m]T}(k+1)\hat{y}^{[m]}(k+1)}{E^{T}\hat{y}^{[m]}(k+1)}\nonumber \\&= \frac{||\hat{y}^{[m]}(k\!+\!1)||^{2}}{E^{T}\hat{y}^{[m]}(k\!+\!1)} =\frac{\sum _{j=1}^{q} (\hat{y}_{j}^{[m]}(k\!+\!1))^{2}}{\sum _{j=1}^{q} \hat{y}_{j}^{[m]}(k\!+\!1)}.\nonumber \\ \end{aligned}$$
(18)

It is important to note that training both neo-fuzzy neurons and neuron-generalizers can be organized in an online adaptive mode. In this way, the neurons’ weights of all previous cascades are not frozen, but they are constantly adjusted, and the number of cascades can both increase and decrease in real time, which distinguishes the proposed neural network from other well-known cascade systems.

6 Experimental results

Considering a comparison with the evolving approaches eTS (Angelov and Filev 2004), \(\text {Simp}\_{\text {eTS}}\) (Angelov and Filev 2005), SAFIS (Rong et al. 2006), MRAN (Yingwei et al. 1997), RANEKF (Kadirkamanathan and Niranjan 1993), and FLEXFIS (Angelov et al. 2005; Lughofer 2008c), the following nonlinear dynamic system identification example was applied (the results of these methods are reported in (Angelov and Zhou 2006)):

$$\begin{aligned} y(n+1)=\frac{y(n)y(n-1)(y(n)-0.5)}{1 + y^{2}(n) + y^{2}(n-1)} + u(n), \end{aligned}$$
(19)

where \(u(n)=\sin (2\pi /25)\), \(y(0)=0\), and \(y(1)=0\). For the incremental and evolving training procedures, 5,000 samples were created starting with \(y(0)=0\), and further, 200 test samples were created for eliciting the root mean-squared error (RMSE) on these samples as a reliable estimator for the generalized prediction error.

Regarding a comparison with the evolving fuzzy modelling approaches DENFIS (Kasabov and Song 2002) and eTS (Angelov and Filev 2004) and its extension exTS (Angelov and Zhou 2006) for the minimum-input-minimum-output Takagi–Sugeno model case, FLEXFIS was used for predicting the Mackey–Glass chaotic time series, given by (the results on the first method are reported in Kasabov and Song (2002), while for the other two are reported in Angelov and Zhou (2006)):

$$\begin{aligned} \frac{\mathrm{d}x(t)}{\mathrm{d}t}=\frac{0.2x(t-\tau )}{1 + x^{10}(t-\tau )}- 0.1x(t), \end{aligned}$$
(20)

where \(x(0)=1.2\), \(\tau =17\), and \(x(t)=0\) for \(t<0\). The task is to predict \(x(t+85)\) from the input vectors \([x(t-18) x(t-12) x(t-6) x(t)]\) for any value of \(t\). 3,000 training samples were collected for \(t\) in the interval [201, 3,200]. 500 test samples in the interval [5,001, 5,500] were collected to elicit the NDEI on unseen samples, which is the RMSE divided by the standard deviation of the target series (Table 1).

Table 1 FLEXFIS, eTS, \(\text {Simp}\_{\text {eTS}}\), SAFIS, MRAN, RANEKF and Cascade NN

It should be mentioned that the proposed cascade NN works in an online mode (unlike the other systems in Table 2).

Table 2 FLEXFIS, eTS, \(\text {Simp}\_{\text {eTS}}\), SAFIS, MRAN, RANEKF and Cascade NN

The electric load data were provided by a local supplier from Kharkiv, Ukraine. The data describe hourly electric load in that region in 2007 (8,760 samples). Sampling time of these data is one hour. 6,132 data points were used for training, and 2,628 data points were used for testing. The forecast made by the cascade neural network was one hour ahead.

Input variables to the cascade neural network were the load a week ago and yesterday at the same hour as predicted, the load an hour ago, the current load, the change in load from the previous to the current hour, and the number of the current hour within the current year (6 inputs altogether).

The cascade neural network used for prediction contained 4 cascades (3 neo-fuzzy neurons in each cascade with 5 membership functions per input in each neo-fuzzy neuron). After training the network gave 0.02165 mean squared error (MSE) of prediction. In Fig. 3, the forecast for 3 weeks at the end of March and the beginning of April 2007 is shown (Table 3).

Fig. 3
figure 3

The electric energy forecast

Table 3 Cascades’ forecasting accuracy

This period includes Easter holidays and the change-over from the winter to summer time, so it is characterized by less regularity of electricity consumption than most other weeks throughout a year. For this period alone, the cascade network provided a prediction with MSE = 0.02886. For comparison, for July 2007 MSE = 0.0223, because there were no holidays during this month. These results are about 40 % more accurate than those previously obtained for the same time series by RBFN and MLP (Table 4).

Table 4 Forecasting results

The proposed cascade NN was also used for the Narendra time series prediction (Narendra and Parthasarathy 1990).

$$\begin{aligned} \begin{aligned}&y(k+1)=\frac{y(k)}{1 + y^{2}(k)} + f(u(k)),\\&f(u(k))=u^{3}(k),\\&u(k)=\sin (\pi k/250), \quad \text {if} \, k<500\\&u(k)=0.8\sin (\pi k/250) + 0.2\sin (\pi k/25),\\ \quad \text {if} \, k&\ge 500.\\ \end{aligned} \end{aligned}$$
(21)

The number of inputs is 6, the number of cascades is 5, and the number of neurons in each cascade is 3 (Table 5).

Table 5 The Narendra time series prediction

7 Summary

This paper proposes a new architecture and learning algorithms for a hybrid cascade neural network with pool optimization in each cascade. The proposed system is different from existing cascade systems in its capability to operate in an online mode, which allows it to work with non-stationary and stochastic nonlinear chaotic signals with the required accuracy. Compared to the well-known evolving neuro-fuzzy systems based on Takagi–Sugeno fuzzy reasoning, the proposed system provides computational simplicity and possesses both tracking and filtering capabilities.