Keywords

1 Introduction

A great combination of different neuro-fuzzy systems is of considerable use nowadays for a large variety of data processing problems. This fact should be highlighted by a number of preferences that neuro-fuzzy systems hold over other existing methods, and that comes from their abilities to get trained as well as their universal approximating capacities.

A degree of the training procedure may be refined by adapting both a network’s set of synaptic weights and its topology [1,2,3,4,5,6,7,8]. This notion is the ground rules for evolving (growing) systems of computational intelligence [9,10,11]. It stands to mention that probably one of the most prosperous actualizations of this attitude is cascade-correlation neural networks [12,13,14] by reason of their high level of efficacy and learning simplicity for both a network scheme and for synaptic weights. In general terms, such sort of a network gets underway with a rather simple architecture containing an ensemble of neurons to be trained irrespectively (a case of the first cascade). Every neuron in an ensemble can possess various activation functions as well as learning procedures. Nodes (neurons) in the ensemble do not intercommunicate while they are being learnt.

Eventually, when all the elements in the ensemble of the first cascade have had their weights adapted, the best neuron in relation to a learning criterion builds up the first cascade, and its synaptic weights are not able of being configured any longer. In the next place, the second cascade is commonly formed by means of akin neurons in the training ensemble. The sole difference is that neurons to be learnt in the ensemble of the second cascade own an additional input (and consequently an additional synaptic weight) which proves to be an output of the first cascade. In similar fashion to the first cascade, the second one withdraws all elements except a single one, which gives the best performance. Its synaptic weights should be fixed afterwards. Nodes in the third cascade hold two additional inputs, namely the outputs of the first and second cascades. The growing network keeps on adding new cascades to its topology until it gains the required quality of the results received over the given training set.

By way of evading multi-epoch learning [15,16,17,18,19,20,21,22,23], various kinds of neurons (preferably their outputs should depend in a linear manner on synaptic weights) may be utilized as the network’s elements. This could give the opportunity to exploit some optimal in speed learning algorithms and handle data as it arrives to the network. In the meantime, if the system is being trained in an online manner, it looks impossible to detect the best neuron in the ensemble. While handling non-stationary data objects, one node in a training ensemble may be confirmed to be the best element for one part of the training data sample (but it cannot be selected as the best one for the other parts). It may be recommended that all the units should be abandoned in the training ensemble, and some specific optimization method (selected in agreement with a general quality criterion for the network) is meant to be used for estimation of an output of the cascade.

It will be observed that the widely recognized cascade neural networks bring into action a non-linear mapping \( R^{n} \to R^{1} \), which means that a common cascade neural network is a system with a single output. By contrast, many problems solved by means of neuro-fuzzy systems demand a multidimensional mapping \( R^{n} \to R^{g} \) to be executed, that finally accounts for the fact that a number of elements to be trained in every cascade is \( g \) times more by contrast to a common neural network, which makes this sort of a system too ponderous. Hence, it seems relevant to operate a specific multidimensional neuron’s topology as the cascade network’s unit with multiple outputs instead of traditional.

The described growing cascade neuro-fuzzy system of computational intelligence is actually an effort to develop a system for handling a data stream that is fed to the system in an online way and that is in possession of a far smaller amount of parameters to be set as opposed to other widely recognized analogues.

2 An Architecture of the Hybrid Growing System

A scheme of the introduced hybrid system is represented in Fig. 1. In fact, it coincides with architecture of the hybrid evolving neural network with an optimized ensemble in every cascade group of elements to have been developed in [24,25,26,27,28,29]. A basic dissimilarity lies in a type of elements utilized and learning procedures respectively.

Fig. 1.
figure 1

An architecture of the growing neuro-fuzzy system.

A network’s input can be described by a vector signal \( x\left( k \right) = \left( {x_{1} \left( k \right),x_{2} \left( k \right), \ldots ,x_{n} \left( k \right)} \right)^{T} \), where \( k = 1,2, \ldots \) stands for either a plurality of observations in the “object-property” table or an index of the current discrete time. These signals are moved to inputs of each neuron \( MN_{j}^{[m]} \) in the system (\( j = 1,2, \ldots ,q \) denotes a number of neurons in a training ensemble, \( m = 1,2, \ldots \) specifies a cascade’s number). A vector output \( \hat{y}^{[m]j} \left( k \right) = \left( {\hat{y}_{1}^{[m]j} \left( k \right),\hat{y}_{2}^{[m]j} \left( k \right), \ldots ,\hat{y}_{d}^{[m]j} \left( k \right), \ldots ,\hat{y}_{g}^{[m]j} \left( k \right)} \right)^{T} \) is eventually produced, \( d = 1,2, \ldots ,g \). These outputs are in the next place fed to a generalizing neuron \( GMN^{[m]} \) to reproduce an optimized vector output \( \hat{y}^{*[m]} \left( k \right) \) for the cascade m. Just as the input of the nodes in the first cascade is \( x\left( k \right) \), elements in the second cascade take \( g \) additional arriving signals for the obtained signal \( \hat{y}^{*[1]} \left( k \right) \), neurons in the third cascade have \( 2g \) additional inputs \( \hat{y}^{*[1]} \left( k \right),\hat{y}^{*[2]} \left( k \right) \), whilst neurons in the m-th cascade own \( \left( {m - 1} \right)g \) additional incoming signals \( \hat{y}^{*[1]} \left( k \right),\hat{y}^{*[2]} \left( k \right), \ldots ,\hat{y}^{*[m - 1]} \left( k \right) \). New cascades are becoming a part of the hybrid system within the learning procedure just as it turns out to be clear that an architecture with a current amount of cascades does not provide the required accuracy.

Since a system signal in a conventional neo-fuzzy neuron [30,31,32] is governed by the synaptic weights in a linear manner, any adaptive identification algorithm [33,34,35] may actually be applied to learning the network’s neo-fuzzy neurons (like either the exponentially-weighted least-squares method in a recurrent form

$$ \left\{ \begin{aligned} & w_{d}^{[m]j} \left( {k + 1} \right) = w_{d}^{[m]j} \left( k \right) + \frac{{P_{d}^{[m]j} \left( k \right)\left( {y^{d} \left( {k + 1} \right) - \left( {w_{d}^{[m]j} \left( k \right)} \right)^{T} \mu_{d}^{[m]j} \left( {k + 1} \right)} \right)}}{{\alpha + \left( {\mu_{d}^{[m]j} \left( {k + 1} \right)} \right)^{T} P_{d}^{[m]j} \left( k \right)\mu_{d}^{[m]j} \left( {k + 1} \right)}}\mu_{d}^{[m]j} \left( {k + 1} \right), \\ & P_{d}^{[m]j} \left( {k + 1} \right) = \frac{1}{\alpha }\left( {P_{d}^{[m]j} \left( k \right) - \frac{{P_{d}^{[m]j} \left( k \right)\mu_{d}^{[m]j} \left( {k + 1} \right)\left( {\mu_{d}^{[m]j} \left( {k + 1} \right)} \right)^{T} P_{d}^{[m]j} \left( k \right)}}{{\alpha + \left( {\mu_{d}^{[m]j} \left( {k + 1} \right)} \right)^{T} P_{d}^{[m]j} \left( k \right)\mu_{d}^{[m]j} \left( {k + 1} \right)}}} \right) \\ \end{aligned} \right. $$
(1)

(here \( y^{d} \left( {k + 1} \right),\,d = 1,2, \ldots ,g \) specifies an external learning signal, \( 0 < \alpha \le 1 \) marks a forgetting factor) or the gradient learning algorithm with both tracking and filtering properties [35])

$$ \left\{ \begin{aligned} & w_{d}^{[m]j} \left( {k + 1} \right) = w_{d}^{[m]j} \left( k \right) + \frac{{y^{d} \left( {k + 1} \right) - \left( {w_{d}^{[m]j} \left( k \right)} \right)^{T} \mu_{d}^{[m]j} \left( {k + 1} \right)}}{{r_{d}^{[m]j} \left( {k + 1} \right)}}\mu_{d}^{[m]j} \left( {k + 1} \right), \\ & r_{d}^{[m]j} \left( {k + 1} \right) = \alpha r_{d}^{[m]j} \left( k \right) + \left\| {\mu_{d}^{[m]j} \left( {k + 1} \right)} \right\|^{2} ,\,\,0 \le \alpha \le 1. \\ \end{aligned} \right. $$
(2)

An architecture of a typical neo-fuzzy neuron (Fig. 2) as part of the multidimensional neuron \( MN_{g}^{[1]} \) in the cascade system is abundant, since a vector of input signals \( x\left( k \right) \) (the first cascade) is sent to same-type non-linear synapses \( NS_{di}^{[1]j} \) of the neo-fuzzy neurons, where each neuron obtains a signal \( \hat{y}_{d}^{[1]j} \left( k \right),\,\,d = 1,2, \ldots ,g \) at its output. As a result, components of the output vector \( \hat{y}^{[1]j} \left( k \right) = \left( {\hat{y}_{1}^{[1]j} \left( k \right),\hat{y}_{2}^{[1]j} \left( k \right), \ldots ,\hat{y}_{g}^{[1]j} \left( k \right)} \right)^{T} \) are computed irrespectively.

Fig. 2.
figure 2

An architecture of the traditional neo-fuzzy neuron.

This fact can be missed by introducing a multidimensional neo-fuzzy neuron [36], whose architecture is shown in Fig. 3 and is a modification of the system proposed in [37]. Its structural units are composite non-linear synapses \( MNS_{i}^{[1]j} \), where each synapse contains \( h \) membership functions \( \mu_{li}^{[1]j} \) and \( gh \) tunable synaptic weights \( w_{dli}^{[1]j} \). In this way, the multidimensional neo-fuzzy neuron in the first cascade contains \( ghn \) synaptic weights, but only \( hn \) membership functions. That’s \( g \) times smaller in comparison with a situation if the cascade is formed of common neo-fuzzy neurons.

Fig. 3.
figure 3

An architecture of the multidimensional neo-fuzzy neuron.

Assuming a \( \left( {hn \times 1} \right) \) – vector of membership functions

\( \mu^{[1]j} \left( k \right) = \left( {\mu_{11}^{[1]j} \left( {x_{1} \left( k \right)} \right),\mu_{21}^{[1]j} \left( {x_{1} \left( k \right)} \right), \ldots ,\mu_{h1}^{[1]j} \left( {x_{1} \left( k \right)} \right), \ldots ,\mu_{li}^{[1]j} \left( {x_{i} \left( k \right)} \right), \ldots ,\mu_{hn}^{[1]j} \left( {x_{n} \left( k \right)} \right)} \right)^{T} \) and a \( \left( {g \times hn} \right) \) – matrix of synaptic weights

$$ W^{[1]j} = \left( {\begin{array}{*{20}c} {w_{111}^{[1]j} } & {w_{112}^{[1]j} } & \cdots & {\begin{array}{*{20}c} {w_{1li}^{[1]j} } & {\begin{array}{*{20}c} \cdots & {w_{1hn}^{[1]j} } \\ \end{array} } \\ \end{array} } \\ {w_{211}^{[1]j} } & {w_{212}^{[1]j} } & \cdots & {\begin{array}{*{20}c} {w_{2li}^{[1]j} } & {\begin{array}{*{20}c} \cdots & {w_{2hn}^{[1]j} } \\ \end{array} } \\ \end{array} } \\ \vdots & \vdots & \vdots & {\begin{array}{*{20}c} \vdots & {\begin{array}{*{20}c} \vdots & \vdots \\ \end{array} } \\ \end{array} } \\ {w_{g11}^{[1]j} } & {w_{g12}^{[1]j} } & \cdots & {\begin{array}{*{20}c} {w_{gli}^{[1]j} } & {\begin{array}{*{20}c} \cdots & {w_{ghn}^{[1]j} } \\ \end{array} } \\ \end{array} } \\ \end{array} } \right), $$

the output signal \( MN_{j}^{[1]} \) can be written down at the \( k \) – th time moment in the form of

$$ \hat{y}^{[1]j} \left( k \right) = W^{[1]j} \mu^{[1]j} \left( k \right). $$
(3)

Learning the multidimensional neo-fuzzy neuron may be carried out applying either a matrix modification of the exponentially-weighted recurrent least squares method (1) in the form of

$$ \left\{ \begin{aligned} & W^{[1]j} \left( {k + 1} \right) = W^{[1]j} \left( k \right) + \frac{{\left( {y\left( {k + 1} \right) - W^{[1]j} \left( k \right)\mu^{[1]j} \left( {k + 1} \right)} \right)\left( {\mu^{[1]j} \left( {k + 1} \right)} \right)^{T} P^{[1]j} \left( k \right)}}{{\alpha + \left( {\mu^{[1]j} \left( {k + 1} \right)} \right)^{T} P^{[1]j} \left( k \right)\mu^{[1]j} \left( {k + 1} \right)}}, \\ & P^{[1]j} \left( {k + 1} \right) = \frac{1}{\alpha }\left( {P^{[1]j} \left( k \right) - \frac{{P^{[1]j} \left( k \right)\mu^{[1]j} \left( {k + 1} \right)\left( {\mu^{[1]j} \left( {k + 1} \right)} \right)^{T} P^{[1]j} \left( k \right)}}{{\alpha + \left( {\mu^{[1]j} \left( {k + 1} \right)} \right)^{T} P^{[1]j} \left( k \right)\mu^{[1]j} \left( {k + 1} \right)}}} \right),0 < \alpha \le 1 \\ \end{aligned} \right. $$
(4)

or a multidimensional version of the algorithm (2) [38]:

$$ \left\{ \begin{aligned} & W^{[1]j} \left( {k + 1} \right) = W^{[1]j} \left( k \right) + \frac{{y\left( {k + 1} \right) - W^{[1]j} \left( k \right)\mu^{[1]j} \left( {k + 1} \right)}}{{r^{[1]j} \left( {k + 1} \right)}}\left( {\mu^{[1]j} \left( {k + 1} \right)} \right)^{T} , \\ & r^{[1]j} \left( {k + 1} \right) = \alpha r^{[1]j} \left( k \right) + \left\| {\mu^{[1]j} \left( {k + 1} \right)} \right\|^{2} ,\,\,0 \le \alpha \le 1, \\ \end{aligned} \right. $$
(5)

here \( y\left( {k + 1} \right) = \left( {y^{1} \left( {k + 1} \right),y^{2} \left( {k + 1} \right), \ldots ,y^{g} \left( {k + 1} \right)} \right)^{T} . \)

The rest of cascades are trained in a similar fashion, while a vector of membership functions \( \mu^{[m]j} \left( {k + 1} \right) \) in the \( m \)-th cascade enlarges its dimensionality by \( \left( {m - 1} \right)g \) elements which are guided by the preceding cascades’ outputs.

3 Output Signals’ Optimization of the Multidimensional Neo-fuzzy Neuron Ensemble

Outputs generated by the neurons in each ensemble are combined by the corresponding neuron \( GN^{[m]} \), whose output accuracy \( \hat{y}^{*[m]} \left( k \right) \) must be higher than the accuracy of any output \( \hat{y}_{j}^{[m]} \left( k \right) \). This task can be solved through the use of the neural networks’ ensembles approach. Although the well-recognized algorithms are not designated for operating in an online fashion, in this case one could use the adaptive generalizing forecasting  [39, 40].

Let’s introduce a vector of ensemble inputs for the \( m \)-th cascade

$$ \hat{y}^{[m]} \left( k \right) = \left( {\hat{y}_{1}^{[m]} \left( k \right),\hat{y}_{2}^{[m]} \left( k \right), \ldots ,\hat{y}_{q}^{[m]} \left( k \right)} \right)^{T} ; $$

then an optimal output of the neuron \( GN^{[m]} \), which is intrinsically an adaptive linear associator [1,2,3,4,5,6,7,8], can be defined as

$$ \hat{y}^{*[m]} \left( k \right) = \sum\limits_{j = 1}^{q} {c_{j}^{[m]} \hat{y}_{j}^{[m]} \left( k \right) = c^{[m]T} \hat{y}^{[m]} \left( k \right)} $$

or with additional constraints on unbiasedness

$$ \sum\limits_{j = 1}^{q} {c_{n}^{\left[ m \right]} } = E^{T} c^{\left[ m \right]} = 1 $$
(6)

where \( c^{\left[ m \right]} = \left( {c_{1}^{\left[ m \right]} ,\,c_{2}^{\left[ m \right]} ,\, \ldots ,\,c_{q}^{\left[ m \right]} } \right)^{T} \) and \( E = \left( {1,1, \ldots ,1} \right)^{T} \) are \( \left( {q \times 1} \right) \) – vectors.

The vector of generalization coefficients \( c^{[m]} \) can be found with the help of the Lagrange undetermined multipliers’ method. For this reason, we’ll introduce a \( \left( {k \times g} \right) \) – matrix of reference signals and a \( \left( {k \times gq} \right) \) – matrix of ensemble’s output signals

$$ Y\left( k \right) = \left( {\begin{array}{*{20}c} {y^{T} \left( 1 \right)} \\ {y^{T} \left( 2 \right)} \\ \vdots \\ {y^{T} \left( k \right)} \\ \end{array} } \right),\,\hat{Y}^{[m]} \left( k \right) = \left( {\begin{array}{*{20}c} {\hat{y}_{1}^{[m]T} \left( 1 \right)} & {\hat{y}_{2}^{[m]T} \left( 1 \right)} & \ldots & {\hat{y}_{q}^{[m]T} \left( 1 \right)} \\ {\hat{y}_{1}^{[m]T} \left( 2 \right)} & {\hat{y}_{2}^{[m]T} \left( 2 \right)} & \cdots & {\hat{y}_{q}^{[m]T} \left( 2 \right)} \\ \vdots & \vdots & \vdots & \vdots \\ {\hat{y}_{1}^{[m]T} \left( k \right)} & {\hat{y}_{2}^{[m]T} \left( k \right)} & \cdots & {\hat{y}_{q}^{[m]T} \left( k \right)} \\ \end{array} } \right), $$

a \( \left( {k \times g} \right) - \) matrix of innovations

$$ V^{[m]} \left( k \right) = Y\left( k \right) - \hat{Y}^{[m]} \left( k \right)I \otimes c^{[m]} $$

and the Lagrange function

$$ \begin{aligned} & L^{[m]} \left( k \right) = \frac{1}{2}Tr\left( {V^{[m]T} \left( k \right)V^{[m]} \left( k \right)} \right) + \lambda \left( {E^{T} c^{[m]} - 1} \right) \\ & = \frac{1}{2}Tr\left( {Y\left( k \right) - \hat{Y}^{[m]} \left( k \right)I \otimes c^{[m]} } \right)^{T} \left( {Y\left( k \right) - \hat{Y}^{[m]} \left( k \right)I \otimes c^{[m]} } \right) + \lambda \left( {E^{T} c^{[m]} - 1} \right) \\ & = \frac{1}{2}\sum\limits_{\tau = 1}^{k} {\left\| {y\left( \tau \right) - \hat{y}^{[m]} \left( \tau \right)c^{[m]} } \right\|}^{2} + \lambda \left( {E^{T} c^{[m]} - 1} \right). \\ \end{aligned} $$
(7)

Here \( I \) is a \( \left( {g \times g} \right) - \) identity matrix, \( \otimes \) is the tensor product symbol, \( \lambda \) stands for an undetermined Lagrange multiplier.

Solving the Karush-Kuhn-Tucker system of equations

$$ \left\{ \begin{aligned} & \nabla_{{c^{[m]} }} L^{[m]} \left( k \right) = \sum\limits_{\tau = 1}^{k} {\left( { - \hat{y}^{[m]T} \left( \tau \right)y\left( \tau \right) + \hat{y}^{[m]T} \left( \tau \right)\hat{y}^{[m]} \left( \tau \right)c^{[m]} } \right) + \lambda E = \vec{0},} \\ & \frac{{\partial L^{[m]} \left( k \right)}}{\partial \lambda } = E^{T} c^{[m]} - 1 = 0 \\ \end{aligned} \right. $$

allows obtaining the desired vector of generalization coefficients as follows

$$ c^{[m]} \left( k \right) = c^{*[m]} \left( k \right) + P^{[m]} \left( k \right)\frac{{1 - E^{T} c^{*[m]} \left( k \right)}}{{E^{T} P^{[m]} \left( k \right)E}}E $$
(8)

where

$$ \left\{ \begin{aligned} & P^{[m]} \left( k \right) = \left( {\sum\limits_{\tau = 1}^{k} {\hat{y}^{[m]T} \left( \tau \right)\hat{y}^{[m]} \left( \tau \right)} } \right)^{ - 1} , \\ & c^{*[m]} \left( k \right) = P^{[m]} \left( k \right)\sum\limits_{\tau = 1}^{k} {\hat{y}^{[m]T} \left( \tau \right)y\left( \tau \right)} = P^{[m]} \left( k \right)p^{[m]} \left( k \right), \\ \end{aligned} \right. $$

\( c^{*[m]} \left( k \right) \) is an estimate of the traditional least squares method obtained by the previous \( k \) observations.

In order to research vector properties of the obtained generalization coefficients, we should make some obvious transformations. Considering that a vector of learning errors for the neuron \( GMN^{[m]} \) can be written down in the form

$$ \begin{aligned} & e^{[m]} \left( k \right) = y\left( k \right) - \hat{y}^{*[m]} \left( k \right) = y\left( k \right) - \hat{y}^{[m]} \left( k \right)c^{[m]} = e^{[m]} \left( k \right) \\ & = y\left( k \right)E^{T} c^{[m]} - \hat{y}^{[m]} \left( k \right)c^{[m]} = \\ & = \left( {y\left( k \right)E^{T} - \hat{y}^{[m]} \left( k \right)} \right)c^{[m]} = \upsilon^{[m]} \left( k \right)c^{[m]} , \\ \end{aligned} $$

the Lagrange function (7) can be also put down in the form

$$ \begin{aligned} & L^{[m]} \left( k \right) = \frac{1}{2}\sum\limits_{\tau = 1}^{k} {c^{[m]T} \upsilon^{[m]} \left( \tau \right)\upsilon^{[m]T} \left( \tau \right)} c^{[m]} + \lambda \left( {E^{T} c^{[m]} - 1} \right) \\ & = \frac{1}{2}c^{[m]T} R^{[m]} \left( k \right)c^{[m]} + \lambda \left( {E^{T} c^{[m]} - 1} \right) \\ \end{aligned} $$

and then solving a system of equations

$$ \left\{ \begin{aligned} & \nabla_{{c^{[m]} }} L^{[m]} \left( k \right) = R^{[m]} \left( k \right)c^{[m]} + \lambda E = \vec{0}, \\ & \frac{{\partial L^{[m]} }}{\partial \lambda } = E^{T} c^{[m]} - 1 = 0, \\ \end{aligned} \right. $$

we receive

$$ \left\{ \begin{aligned} & c^{[m]} \left( k \right) = \left( {R^{[m]} \left( k \right)} \right)^{ - 1} E\left( {E^{T} \left( {R^{[m]} \left( k \right)} \right)^{ - 1} E} \right)^{ - 1} , \\ & \lambda = - 2E^{T} \left( {R^{[m]} \left( k \right)} \right)^{ - 1} E \\ \end{aligned} \right. $$

where \( R^{[m]} \left( k \right) = \sum\limits_{\tau = 1}^{k} {\upsilon^{[m]} \left( \tau \right)} \upsilon^{[m]T} \left( \tau \right) = V^{[m]T} \left( k \right)V^{[m]} \left( k \right). \)

The Lagrange function’s value can be easily written down at a saddle point

$$ L^{*} \left( k \right) = \left( {E^{T} \left( {R^{[m]} \left( k \right)} \right)^{ - 1} E} \right)^{ - 1} , $$

analyzing which by the Cauchy-Schwarz inequality, it can be shown that the generalized output signal \( \hat{y}^{*[m]} \left( k \right) \) is not inferior to accuracy of the best neuron \( \hat{y}^{[m]j} \left( k \right) \), \( j = 1,2, \ldots ,q \) in an ensemble of output signals.

In order to provide information processing in an online manner, the expression (8) should be performed in a recurrent form which acquires the view of (by using the Sherman-Morrison-Woodbery formula)

$$ \left\{ \begin{aligned} & P^{\left[ m \right]} \left( {k + 1} \right) = P^{\left[ m \right]} \left( k \right) - P^{\left[ m \right]} \left( k \right)\hat{y}^{\left[ m \right]T} \left( {k + 1} \right)\left( {I + \hat{y}^{\left[ m \right]} \left( {k + 1} \right)P^{\left[ m \right]} \left( k \right)\hat{y}^{\left[ m \right]T} \left( {k + 1} \right)} \right)^{ - 1} \\ & \cdot \hat{y}^{[m]} \left( {k + 1} \right)P^{[m]} \left( k \right) = \left( {I - P^{[m]} \left( k \right)\hat{y}^{[m]T} \left( {k + 1} \right)\hat{y}^{[m]} \left( {k + 1} \right)} \right)^{ - 1} P^{[m]} \left( k \right), \\ & p^{[m]} \left( {k + 1} \right) = p^{[m]} \left( k \right) + \hat{y}^{[m]T} \left( {k + 1} \right)y\left( {k + 1} \right), \\ & c^{*[m]} \left( {k + 1} \right) = P^{[m]} \left( {k + 1} \right)p^{[m]} \left( {k + 1} \right), \\ & c^{[m]} \left( {k + 1} \right) = c^{*[m]} \left( {k + 1} \right) + P^{[m]} \left( {k + 1} \right)\left( {E^{T} P^{[m]} \left( {k + 1} \right)E} \right)^{ - 1} \left( {1 - E^{T} c^{*[m]} \left( {k + 1} \right)} \right)E. \\ \end{aligned} \right. $$
(9)

Unwieldiness of the algorithm (9), that is in fact the Gauss-Newton optimization procedure, has to do with inversion of \( \left( {g \times g} \right) \) – matrices at every time moment \( k \). And when this value \( g \) is large enough, it is much easier to use gradient learning algorithms to tune the weight vector \( c^{[m]} \left( k \right) \). The learning algorithm can be obtained easily enough if the Arrow-Hurwitz gradient algorithm is used for a search of the Lagrange function’s saddle point which takes on the form in this case

$$ \left\{ \begin{aligned} & c^{[m]} \left( {k + 1} \right) = c^{[m]} \left( k \right) - \eta_{c} \left( {k + 1} \right)\nabla_{{c^{[m]} }} L^{[m]} \left( k \right), \\ & \lambda \left( {k + 1} \right) = \lambda \left( k \right) + \eta_{\lambda } \left( {k + 1} \right)\frac{{\partial L^{[m]} \left( k \right)}}{\partial \lambda } \\ \end{aligned} \right. $$
(10)

or specifically for (10)

$$ \left\{ \begin{aligned} & c^{[m]} \left( {k + 1} \right) = c^{[m]} \left( k \right) + \eta_{c} \left( {k + 1} \right)\left( {\hat{y}^{[m]T} \left( k \right)e^{[m]} \left( k \right) - \lambda \left( k \right)E} \right), \\ & \lambda \left( {k + 1} \right) = \lambda \left( k \right) + \eta_{\lambda } \left( {k + 1} \right)\left( {E^{T} c^{[m]} \left( {k + 1} \right) - 1} \right) \\ \end{aligned} \right. $$
(11)

where \( \eta_{c} \left( {k + 1} \right) \), \( \eta_{\lambda } \left( {k + 1} \right) \) are some learning rate parameters.

The Arrow-Hurwitz procedure converges to a saddle point of the Lagrange function when a range of learning rate parameters \( \eta_{c} \left( {k + 1} \right) \) and \( \eta_{\lambda } \left( {k + 1} \right) \) is sufficiently wide. However, one could try to optimize these parameters to reduce training time. For this purpose, we should write down the expression (10) in the form

$$ \left\{ \begin{aligned} & \hat{y}^{[m]} \left( k \right)c^{[m]} \left( {k + 1} \right) = \hat{y}^{[m]} \left( k \right)c^{[m]} \left( k \right) - \eta_{c} \left( {k + 1} \right)\hat{y}^{[m]} \left( k \right)\nabla_{{c^{[m]} }} L^{[m]} \left( k \right), \\ & y\left( k \right) - \hat{y}^{[m]} \left( k \right)c^{[m]} \left( {k + 1} \right) = y\left( k \right) - \hat{y}^{[m]} \left( k \right)c^{[m]} \left( k \right) + \eta_{c} \left( {k + 1} \right)\hat{y}^{[m]} \left( k \right)\nabla_{{c^{[m]} }} L^{[m]} \left( k \right). \\ \end{aligned} \right. $$
(12)

A left side of the expression (12) describes an a posteriori error \( \tilde{e}^{[m]} \left( k \right) \), which is obtained after one cycle of parameters’ tuning, i.e.

$$ \tilde{e}^{[m]} \left( k \right) = e^{[m]} \left( k \right) + \eta_{c} \left( {k + 1} \right)\hat{y}^{[m]} \left( k \right)\nabla_{{c^{[m]} }} L^{[m]} \left( k \right). $$

Introducing the squared norm of this error

$$ \begin{aligned} & \left\| {\tilde{e}^{[m]} \left( k \right)} \right\|^{2} = \left\| {e^{[m]} \left( k \right)} \right\|^{2} + 2\eta_{c} \left( {k + 1} \right)e^{[m]T} \left( k \right)\hat{y}^{[m]} \left( k \right)\nabla_{{c^{[m]} }} L^{[m]} \left( k \right) \\ & + \eta_{c}^{2} \left( {k + 1} \right)\left\| {\hat{y}^{[m]} \left( k \right)\nabla_{{c^{[m]} }} L^{[m]} \left( k \right)} \right\|^{2} \\ \end{aligned} $$

and minimizing it in \( \eta_{c} \left( {k + 1} \right) \), i.e. solving a differential equation

$$ \frac{{\partial \left\| {\tilde{e}^{[m]} \left( k \right)} \right\|^{2} }}{{\partial \eta_{c} }} = 0, $$

we come to an optimal value for a learning rate parameter

$$ \eta_{c} \left( {k + 1} \right) = - \frac{{e^{[m]T} \left( k \right)\hat{y}^{[m]} \left( k \right)\nabla_{{c^{[m]} }} L^{[m]} \left( k \right)}}{{\left\| {\hat{y}^{[m]} \left( k \right)\nabla_{{c^{[m]} }} L^{[m]} \left( k \right)} \right\|^{2} }}. $$

Then the algorithms (10) and (11) can be finally put down as follows

$$ \left\{ \begin{aligned} & \nabla_{{c^{[m]} }} L\left( k \right) = - \left( {\hat{y}^{[m]T} \left( k \right)e^{[m]} \left( k \right) - \lambda \left( k \right)E} \right), \\ & c^{[m]} \left( {k + 1} \right) = c^{[m]} \left( k \right) + \frac{{e^{[m]T} \left( k \right)\hat{y}^{[m]} \left( k \right)\nabla_{{c^{[m]} }} L^{[m]} \left( k \right)}}{{\left\| {\hat{y}^{[m]} \left( k \right)\nabla_{{c^{[m]} }} L^{[m]} \left( k \right)} \right\|^{2} }}\nabla_{{c^{[m]} }} L\left( k \right), \\ & \lambda \left( {k + 1} \right) = \lambda \left( k \right) + \eta_{\lambda } \left( {k + 1} \right)\left( {E^{T} c^{[m]} \left( {k + 1} \right) - 1} \right). \\ \end{aligned} \right. $$
(13)

The procedure (13) is computationally much easier than (9), and if there are no constraints (6) it turns into a multidimensional modification of the Kaczmarz-Widrow-Hoff algorithm which is widely spread in the problems of ANNs learning.

Elements of a generalization coefficients’ vector can be interpreted as membership levels, if a constraint on synaptic weights’ non-negativity for the generalizing neuron \( GMN^{[m]} \) is introduced into the Lagrange function to be optimized, i.e.

$$ \sum\limits_{j = 1}^{q} {\tilde{c}_{j}^{[m]} } = E^{T} \tilde{c}^{[m]} = 1,\,\,\,\,0 \le \tilde{c}_{j}^{[m]} \le 1\,\,\,\,\forall j = 1,2, \ldots ,q. $$
(14)

Introducing the Lagrange function with additional constraints-inequalities

$$ \begin{aligned} & \tilde{L}^{[m]} \left( k \right) = \frac{1}{2}Tr\left( {V^{[m]T} \left( k \right)V^{[m]} \left( k \right)} \right) + \lambda \left( {E^{T} \tilde{c}^{[m]} - 1} \right) - \rho^{T} \tilde{c}^{[m]} \\ & = \frac{1}{2}Tr\left( {Y\left( k \right) - \hat{Y}^{[m]} \left( k \right)I \otimes \tilde{c}^{[m]} } \right)^{T} \left( {Y\left( k \right) - \hat{Y}^{[m]} \left( k \right)I \otimes \tilde{c}^{[m]} } \right) + \lambda \left( {E^{T} \tilde{c}^{[m]} - 1} \right) - \rho^{T} \tilde{c}^{[m]} \\ & = \frac{1}{2}\sum\limits_{\tau = 1}^{k} {\left\| {y\left( \tau \right) - \hat{y}^{[m]} \left( \tau \right)\tilde{c}^{[m]} } \right\|^{2} + } \lambda \left( {E^{T} \tilde{c}^{[m]} - 1} \right) - \rho^{T} \tilde{c}^{[m]} \\ \end{aligned} $$

(here \( \rho \) is a \( \left( {q \times 1} \right) \) – vector of non-negative undetermined Lagrange multipliers) and solving the Karush-Kuhn-Tucker system of equations

$$ \left\{ \begin{aligned} & \nabla_{{\tilde{c}^{[m]} }} \tilde{L}^{[m]} \left( k \right) = \vec{0}, \\ & \frac{{\partial \tilde{L}^{[m]} \left( k \right)}}{\partial \lambda } = 0, \\ & \rho_{j} \ge 0\,\,\,\forall j = 1,2, \ldots ,q, \\ \end{aligned} \right. $$

an analytical solution takes on the form

$$ \left\{ \begin{aligned} & \tilde{c}^{[m]} \left( k \right) = P^{[m]} \left( k \right)\left( {p^{[m]} \left( k \right) - \lambda E + \rho } \right), \\ & \lambda = \frac{{E^{T} P^{[m]} \left( k \right)p^{[m]} \left( k \right) - 1 + E^{T} P^{[m]} \left( k \right)\rho }}{{E^{T} P^{[m]} \left( k \right)E}} \\ \end{aligned} \right. $$

and having used the Arrow-Hurwicz-Uzawa procedure, we obtain a learning algorithm of the neuron \( GMN^{[m]} \) in the view of

$$ \left\{ \begin{aligned} & \tilde{c}^{[m]} \left( {k + 1} \right) = c^{*[m]} \left( {k + 1} \right) - P^{[m]} \left( {k + 1} \right)\frac{{E^{T} c^{*[m]} \left( {k + 1} \right) - 1 + E^{T} P^{[m]} \left( {k + 1} \right)\rho \left( k \right)}}{{E^{T} P^{[m]} \left( {k + 1} \right)E}}E \\ & + P^{[m]} \left( {k + 1} \right)\rho \left( k \right), \\ \\ & \rho \left( {k + 1} \right) = \mathop{\Pr}\nolimits_{ + } \left( {\rho \left( k \right) - \eta_{\rho } \left( {k + 1} \right)\tilde{c}^{[m]} \left( {k + 1} \right)} \right). \\ \end{aligned} \right. $$
(15)

The first ratio (15) can be transformed into the form of

$$ \begin{aligned} & \tilde{c}^{[m]} \left( {k + 1} \right) = c^{[m]} \left( {k + 1} \right) - P^{[m]} \left( {k + 1} \right)\frac{{E^{T} P^{[m]} \left( {k + 1} \right)\rho \left( k \right)}}{{E^{T} P^{[m]} \left( {k + 1} \right)E}}E + P^{[m]} \left( {k + 1} \right)\rho \left( k \right) \\ & = c^{[m]} \left( {k + 1} \right) + \left( {I - \frac{{P^{[m]} \left( {k + 1} \right)EE^{T} }}{{E^{T} P^{[m]} \left( {k + 1} \right)E}}} \right)P^{[m]} \left( {k + 1} \right)\rho \left( k \right) \\ \end{aligned} $$
(16)

where \( c^{[m]} \left( {k + 1} \right) \) is defined by the ratio (8), \( \left( {I - P^{[m]} \left( {k + 1} \right)EE^{T} \left( {E^{T} P^{[m]} \left( {k + 1} \right)E} \right)^{ - 1} } \right) \) is a projector to the hyperplane \( \tilde{c}^{[m]T} \left( {k + 1} \right)E = 1 \). It can be easily noticed that the vectors \( E \) and \( \left( {I - P^{[m]} \left( {k + 1} \right)EE^{T} \left( {E^{T} P^{[m]} \left( {k + 1} \right)E} \right)^{ - 1} } \right)P^{[m]} \left( {k + 1} \right)\rho \left( k \right) \) are orthogonal, so we can write down the ratios (14) and (15) in a simpler form

$$ \left\{ \begin{aligned} & \tilde{c}^{[m]} \left( {k + 1} \right) = c^{[m]} \left( {k + 1} \right) + \mathop{\Pr}\nolimits_{{c^{[m]T} E = 1}} \left( {P^{[m]} \left( {k + 1} \right)\rho \left( k \right)} \right), \\ & \rho \left( {k + 1} \right) = \mathop{\Pr}\nolimits_{ + } \left( {\rho \left( k \right) - \eta_{\rho } \left( {k + 1} \right)\tilde{c}^{[m]} \left( {k + 1} \right)} \right). \\ \end{aligned} \right. $$

Then the learning algorithm of the generalizing neuron with the constraints (14) finally takes on the form

$$ \left\{ \begin{aligned} & P^{[m]} \left( {k + 1} \right) = P^{[m]} \left( k \right) - P^{[m]} \left( k \right)\hat{y}^{[m]T} \left( {k + 1} \right)\left( {I + \hat{y}^{[m]} \left( {k + 1} \right)P^{[m]} \left( k \right)\hat{y}^{[m]T} \left( {k + 1} \right)} \right)^{ - 1} \\ & = \left( {I - P^{[m]} \left( k \right)\hat{y}^{[m]T} \left( {k + 1} \right)\hat{y}^{[m]} \left( {k + 1} \right)} \right)^{ - 1} P^{[m]} \left( k \right), \\ & p^{[m]} \left( {k + 1} \right) = p^{[m]} \left( k \right) + \hat{y}^{[m]T} \left( {k + 1} \right)y\left( {k + 1} \right), \\ & c^{*[m]} \left( {k + 1} \right) = P^{[m]} \left( {k + 1} \right)p^{[m]} \left( {k + 1} \right), \\ & c^{[m]} \left( {k + 1} \right) = c^{*[m]} \left( {k + 1} \right)P^{[m]} \left( {k + 1} \right)\left( {E^{T} P^{[m]} \left( {k + 1} \right)E} \right)^{ - 1} \left( {1 - E^{T} c^{*[m]} \left( {k + 1} \right)} \right)E, \\ & \tilde{c}^{[m]} \left( {k + 1} \right) = c^{[m]} \left( {k + 1} \right) - P^{[m]} \left( {k + 1} \right)\frac{{E^{T} P^{[m]} \left( {k + 1} \right)\rho \left( k \right)}}{{E^{T} P^{[m]} \left( {k + 1} \right)E}}E + P^{[m]} \left( {k + 1} \right)\rho \left( k \right), \\ & \rho \left( {k + 1} \right) = \mathop{\Pr}\nolimits_{ + } \left( {\rho \left( k \right) - \eta_{\rho } \left( {k + 1} \right)\tilde{c}^{[m]} \left( {k + 1} \right)} \right). \\ \end{aligned} \right. $$
(16)

The learning procedure (16) can be considerably simplified similar to the previous one with the help of the gradient algorithm

$$ \left\{ \begin{aligned} & \tilde{c}^{[m]} \left( {k + 1} \right) = \tilde{c}^{[m]} \left( k \right) - \eta_{c} \left( {k + 1} \right)\nabla_{{\tilde{c}^{[m]} }} \tilde{L}^{[m]} \left( k \right), \\ & \lambda \left( {k + 1} \right) = \lambda \left( k \right) + \eta_{\lambda } \left( {k + 1} \right)\left( {E^{T} \tilde{c}^{[m]} \left( {k + 1} \right) - 1} \right), \\ & \rho \left( {k + 1} \right) = \mathop{\Pr}\nolimits_{ + } \left( {\rho \left( k \right) - \eta_{\rho } \left( {k + 1} \right)\tilde{c}^{[m]} \left( {k + 1} \right)} \right). \\ \end{aligned} \right. $$

Carrying out transformations similar to the abovementioned ones, we finally obtain

$$ \left\{ \begin{aligned} & \nabla_{{\tilde{c}^{[m]} }} \tilde{L}\left( k \right) = - \left( {\hat{y}^{[m]T} \left( k \right)e^{[m]} \left( k \right) - \lambda \left( k \right)E + \rho \left( k \right)} \right), \\ & \tilde{c}^{[m]} \left( {k + 1} \right) = \tilde{c}^{[m]} \left( k \right) + \frac{{e^{[m]T} \left( k \right)\hat{y}^{[m]} \left( k \right)\nabla_{{\tilde{c}^{[m]} }} \tilde{L}^{[m]} \left( k \right)}}{{\left\| {\hat{y}^{[m]} \left( k \right)\nabla_{{\tilde{c}^{[m]} }} \tilde{L}^{[m]} \left( k \right)} \right\|^{2} }}\nabla_{{\tilde{c}^{[m]} }} \tilde{L}^{[m]} \left( k \right), \\ & \lambda \left( {k + 1} \right) = \lambda \left( k \right) + \eta_{\lambda } \left( {k + 1} \right)\left( {E^{T} \tilde{c}^{[m]} \left( {k + 1} \right) - 1} \right), \\ & \rho \left( {k + 1} \right) = \mathop{\Pr}\nolimits_{ + } \left( {\rho \left( k \right) - \eta_{\rho } \left( {k + 1} \right)\tilde{c}^{[m]} \left( {k + 1} \right)} \right). \\ \end{aligned} \right. $$
(17)

The algorithm (17) comprises the procedure (13) as a particular case.

4 Experimental Results

To illustrate the effectiveness of the suggested adaptive neuro-fuzzy system and its learning procedures, we have actualized an experimental test by means of handling the chaotic Lorenz attractor identification. The Lorenz attractor is a fractal structure which matches the Lorenz oscillator’s behavior. The Lorenz oscillator is a three-dimensional dynamical system that puts forward a chaotic flow that is also renowned for its lemniscate shape. As a matter of fact, a state of the dynamical system (three variables of the three-dimensional system) is evolving with the course of time in a complex non-repeating pattern.

The Lorenz attractor may be exemplified by a differential equation in the form of

$$ \left\{ \begin{aligned} & \dot{x} = \sigma \left( {y - x} \right), \\ & \dot{y} = x\left( {r - z} \right) - y, \\ & \dot{z} = xy - bz. \\ \end{aligned} \right. $$
(18)

This system of Eq. (18) can be also put down in the recurrent form

$$ \left\{ \begin{aligned} & x\left( {i + 1} \right) = x\left( i \right) + \sigma \left( {y\left( i \right) - x\left( i \right)} \right)dt, \\ & y\left( {i + 1} \right) = y\left( i \right) + \left( {rx\left( i \right) - x\left( i \right)z\left( i \right) - y\left( i \right)} \right)dt, \\ & z\left( {i + 1} \right) = z\left( i \right) + \left( {x\left( i \right)y\left( i \right) - bz\left( i \right)} \right)dt \\ \end{aligned} \right. $$
(19)

where parameter values are: \( \sigma = 10,\,r = 28,\,b = 2.66,\,dt = 0.001 \).

A data set was acquired with the benefit of (19) which comprises 10000 samples, where 7000 points establish a training set, and 3000 samples make up a validation set.

In our system, we had 2 cascades containing 2 multidimensional neurons each and a generalized neuron in each cascade. The first neuron in each cascade involves 2 membership functions. The graphical results are represented in Figs. 4, 5 and 6. One can basically see the forecasting results for the last cascade in Table 1.

Fig. 4.
figure 4

Identification by means of the Lorenz attractor. The X-component results.

Fig. 5.
figure 5

Identification by means of the Lorenz attractor. The Y-component results.

Fig. 6.
figure 6

Identification by means of the Lorenz attractor. The Z-component results.

Table 1. Table of forecasting results

5 Conclusion

The hybrid growing neuro-fuzzy architecture and its learning algorithms for the multidimensional growing hybrid cascade neuro-fuzzy system which enables neuron ensemble optimization in every cascade were considered and introduced in the article. The most important privilege of the considered hybrid neuro-fuzzy system is its trait to accomplish a procedure of parallel computation for a data stream based on peculiar elements with upgraded approximating properties. The developed system turns out to be rather easy from the effectuation standpoint; it holds a high processing speed and approximating features. It can be described by a rather high training speed which makes it possible to process online sequential data. The distinctive feature of the introduced system is the fact that every cascade is put together by an ensemble of neurons, and their outputs are joined with the optimization procedure of a specific sort. Thus, every cascade produces an output signal of the optimal accuracy. The proposed system, which is ultimately a growing (evolving) system of computational intelligence, assures processing the incoming data in an online fashion just unlike the rest of conventional systems.