1 Introduction

In the past two decades, the recursive least square (RLS) algorithm and the extended Kalman filter (EKF) algorithm in training multilayered feedforward neural networks (MFNNs) have been extensively investigated [15]. RLS or EKF algorithms belong to the online mode approach in which the weights are updated immediately after the presentation of a training pattern. The advantages of the online mode training approach are that it does not require the storage of the entire input output history and that in conjunction with the use of a weight decay factor, it can be used to estimate processes which are “mildly” non-stationary. RLS and EKF algorithms are efficient second-order gradient descent training methods. Compared to first-order methods, such as the backpropagation (BP) algorithm [6], they have a faster convergence rate. Moreover, in the RLS and EKF algorithms, fewer parameters are required to be tuned during the training.

Leung et al. [1, 2] found that the RLS algorithm has an implicit weight decay term [79] by controlling the initial value of the error covariance matrix. With the weight decay term, the magnitudes of the trained weights are constrained to be small. Hence, the network output function is smooth and the generalization ability is improved. Besides, when magnitudes of the trained weights are small, the effect of weight faults can be suppressed [10, 11].

However, the weight decay effect in the standard RLS is not substantial and decreases when the number of training cycles increases. That means, the generalization ability of the network trained with this algorithm is not fully controllable. By tackling this problem, a constant true weight decay RLS algorithm, namely true weight decay RLS (TWDRLS), was proposed [12]. The TWDRLS algorithm is able to make the weights decay effect more effective. Consequently, a network trained with this algorithm exhibits a better generalization ability. However, the computational complexity of the TWDRLS algorithm is equal to O(M 3) which is much higher than that of the standard RLS, i.e., O(M 2), where M is the number of weights in the network. Therefore, it is necessary to reduce the complexity of this algorithm so that the TWDRLS algorithm can be applied to large scale practical problems.

This paper first derives a set of concise equations for the TWDRLS algorithm and discusses the decay effect of the algorithm in this form. The main contribution of this paper is to propose a decoupled version for the TWDRLS algorithm. The goal is to reduce both the computational complexity and storage requirement. In this decoupled version, instead of using one set of TWDRLS equations to train all weights, each neuron, except input neurons, is associated with a set of decoupled TWDRLS equations, which is used for training its corresponding input weights only. The overall complexity of all decoupled sets of TWDRLS equations is very low.

The rest of the paper is organized as follows. In Sect. 2, we give a brief review on the RLS and TWDRLS algorithms. We describe the decoupled TWDRLS algorithm in Sect. 3. The computer simulation results are presented in Sect. 4. Finally, we summarize our findings in Sect. 5.

2 TWDRLS algorithm

Since a MFNN with a single hidden layer that has sufficient hidden neurons is able to approximate any function [13], we consider that the MFNN model has n o output nodes, n h hidden nodes, and n in input nodes. The output of the ith neuron in the output layer is denoted by y o i . The output of the jth element of the hidden layer is denoted by y h j . The kth element of the input x of the network is denoted by y in i .

The connection weight from the jth hidden neuron to the ith output neuron is denoted by w o i,j . Output biases are implemented as weights and are denoted by \(w^o_{i,n_{h}+1}. \) The connection weight from the kth input to the jth hidden neuron is denoted by w in j,k . Input biases are implemented as weights and are denoted by \(w^o_{j,n_{\rm in}+1}. \) The total number of weights in the network is equal to

$$ M= n_o (n_{h}+1) + n_{h}(n_{\rm in}+1). $$
(1)

Using the conventional notation, in the RLS approach [14, 15], we arrange all weights in a M-dimensional vector, given by

$$ {\user2{w}} = \left[ w^{\rm in}_{1,1},\ldots, w^{\rm in}_{n_{h},(n_{\rm in}+1)},w^{o}_{1,1},\ldots,w^{o}_{n_o,(n_{h}+1)}\right]^T. $$
(2)

In the RLS approach, the objective function at the tth training iteration is to minimize the following energy function:

$$ E({\user2{w}}) = \sum_{\tau=1}^{t} \left[ {\user2{d}}(\tau) - {\user2{h}}({\user2{w}}, {\user2{x}} (\tau) ) \right]^T \left[ {\user2{d}}(\tau) - {\user2{h}}({\user2{w}}, {\user2{x}} (\tau) ) \right] + \left[ {\user2{w}} - \hat{{\user2{w}}} (0) \right]^T {\user2{P}}^{-1} (0) \left[ {\user2{w}} - \hat{{\user2{w}}} (0) \right] $$
(3)

where {x(τ), d(τ)} is the training input-out pair at the τth training iteration, \({\user2{h}}(\cdot, \cdot)\) is an n o -dimensional nonlinear function describing the network. The matrix P(0) is the initial error covariance matrix in the RLS algorithm and it is usually set to δ−1 I M×M , where I M×M is an M × M identity matrix. The magnitude of the initial weight vector \(\hat{{\user2{w}}} (0)\) should be small.

The minimization of (3) results in the following recursive equations [14, 15]:

$$ {\user2{K}}(t) = {\user2{P}}(t-1) {\user2{H}}^{T} (t) \left[ I_{n_L \times n_L} + {\user2{H}}(t) {\user2{P}}(t-1) {\user2{H}}^{T} (t) \right]^{-1} $$
(4)
$$ {\user2{P}}(t) = {\user2{P}}(t-1) - {\user2{K}}(t) {\user2{H}}(t) {\user2{P}}(t-1) $$
(5)
$$ \hat{ {\user2{w}}} (t) = \hat{{\user2{w}}} (t-1) + {\user2{K}}(t) \left[ {\user2{d}}(t) - {\user2{h}}(\hat{{\user2{w}}} (t-1), {\user2{x}}(t)) \right] , $$
(6)

where

$$ {\user2{H}} (t) = \left[ \left. \frac{ \partial {\user2{h}} ({\user2{w}}, {\user2{x}}(t) ) }{\partial {\user2{w}}} \right|_{{\user2{w}}= \hat{{\user2{w}}} (t-1)} \right]^T, $$
(7)

is the gradient matrix of h(wx(t)) with size n o ×M; and K(t) is the so-called Kalman gain matrix (with size M × n o ). The matrix P(t) is the so-called error covariance matrix and is symmetric positive definite.

From (3), the standard RLS algorithm has a weight decay term w T P −1(0)w. However, as mentioned in [1, 2], the standard RLS algorithm only has the limited weight decaying effect which is equal to \(\frac{\delta}{t_o}\) per training iteration, where t o is the number of training iterations. That is, the effect of the weight decay term in each training iteration decreases linearly as the number of training iterations increases. Hence, the more training presentations take place, the less smoothing effect would have in the data fitting process. When the number of presentations cannot be determined a priori, using the value of δ to control the generalization ability becomes impractical.

In [12], the TWDRLS algorithm was proposed to enhance the weight decay effect. In this algorithm, a new energy function was considered, given by

$$ \begin{aligned} E({\user2{w}}) &= \sum_{\tau=1}^{t} \left[ \left[ {\user2{d}}(\tau) - {\user2{h}}({\user2{w}}, {\user2{x}} (\tau) ) \right]^T \left[ {\user2{d}}(\tau) - {\user2{h}}({\user2{w}}, {\user2{x}} (\tau) ) \right] + \alpha {\user2{w}}^T {\user2{w}} \right] \\ &\quad + \left[{\user2{w}} - \hat{{\user2{w}}} (0) \right]^T \tilde{{\user2{P}}}^{-1} (0) \left[ {\user2{w}} - \hat{{\user2{w}}} (0) \right] \end{aligned} $$
(8)

where α is a regularization parameter, the matrix \(\tilde{{\user2{P}}}(0)\) is the initial error covariance matrix is usually set to δ−1 I M×M . In (8), there is a constant decay term αw T w, and the decay effect per each training iteration does not decrease as the number of training iterations increases.

The gradient of this energy function in (8) is given by

$$ \frac{\partial E({\user2{w}}) }{\partial {\user2{w}}} \approx 2 \tilde{{\user2{P}}}^{-1} (0) \left[ {\user2{w}} - \hat{{\user2{w}}} (0) \right] + 2 \sum_{\tau=1}^{t} \left[ \alpha {\user2{w}} - {\user2{H}}^{T} (\tau) \left[ {\user2{d}}(\tau) -{\user2{H}} (\tau) {\user2{w}} - {\user2{\xi}} (\tau) \right] \right] . $$
(9)

In the above, we use the common linearization technique in RLS [4, 5]. That is, we linearize h(wx (τ)) around the estimate \(\hat{{\user2{w}}}(\tau-1), \) given by

$$ {\user2{h}}({\user2{w}}, \vec x (\tau) ) = {\user2{H}} (\tau) {\user2{w}} + {\user2{\xi}} (\tau) $$
(10)

where

$$ {\user2{H}} (\tau) = \left[ \left. \frac{ \partial {\user2{h}} ({\user2{w}}, {\user2{x}}(\tau) ) }{\partial {\user2{w}}} \right|_{{\user2{w}}= \hat{{\user2{w}}} (\tau-1)} \right]^T, $$
(11)
$$ {\user2{\xi}} (\tau) = {\user2{h}} ( \hat{{\user2{w}}} (\tau-1),{\user2{x}}(\tau) ) - {\user2{H}}(\tau) \hat {{\user2{w}}} (\tau) + {\user2{\rho}} (\tau) $$
(12)

is the residual in the expansion of h(wx (τ)), and \({\user2{\rho}} (\tau)\) consists of higher order residual. In the derivation, we assume that the higher order residual is not a function of w. This assumption is commonly used in the derivation of the RLS or EKF equations [4, 5].

To minimize the energy function, we set the gradient to zero. Hence, we have

$$ \hat{{\user2{w}}} (t) = \tilde{{\user2{P}}}(t) {\user2{r}}(t) $$
(13)

where

$$ \tilde{{\user2{P}}}^{-1} (t) = \tilde{{\user2{P}}}^{-1} (0) + \sum_{\tau=1}^{t} \left[ {\user2{H}}^{T} (\tau) {\user2{H}}(\tau) + \alpha {\user2{I}}_{M \times M} \right] $$
(14)
$$ =\tilde{{\user2{P}}}^{-1} (t-1) + {\user2{H}}^{T} (t) {\user2{H}}(t) + \alpha {\user2{I}}_{M \times M} $$
(15)
$$ {\user2{r}}(t) = \tilde{{\user2{P}}}^{-1} (0) \hat{{\user2{w}}} (0) + \sum_{\tau=1}^{t} {\user2{H}}^T (\tau) \left[ {\user2{d}}(\tau) - {\user2{\xi}} (\tau) \right] $$
(16)
$$ = {\user2{r}} (t-1) + {\user2{H}}^T (t) \left[ {\user2{d}}(t) - {\user2{\xi}} (t) \right] . $$
(17)

Furthermore, define

$$ {\user2{P}}^* (t) \buildrel{\Updelta} \over {=} \left[ {\user2{I}}_{M \times M} + \alpha \tilde{{\user2{P}}}(t-1) \right]^{-1} \tilde{{\user2{P}}}(t-1) . $$
(18)

Hence, we have

$$ {{\user2{P}}^*}^{-1} (t) = \tilde{{\user2{P}}}^{-1} (t-1) + \alpha {\user2{I}}_{M \times M} . $$
(19)

Note that

$$ \left[ \left[ {\user2{I}}_{M \times M} + \alpha \tilde{{\user2{P}}}(t-1) \right]^{-1} \tilde{{\user2{P}}}(t-1) \right] \left[ \tilde{{\user2{P}}}^{-1} (t-1) + \alpha {\user2{I}}_{M \times M} \right] = {\user2{I}}_{M \times M} . $$
(20)

Employing the matrix inversion lemma:

$$ ( {\user2{A}}^{-1} + {\user2{B}} {\user2{C}}^{-1} {\user2{B}}^T )^{-1} = {\user2{A}} - {\user2{A}} {\user2{B}} ({\user2{C}} +{\user2{B}}^T {\user2{A}} {\user2{B}})^{-1} {\user2{B}}^T {\user2{A}} , $$
(21)

in the recursive calculation of P(t), (13) becomes the following recursive equations:

$$ {\user2{P}}^{*} (t-1) = \left[ {\user2{I}}_{M \times M} + \alpha \tilde{{\user2{P}}}(t-1) \right]^{-1} \tilde{{\user2{P}}}(t-1) $$
(22)
$$ {\user2{K}}(t) = {\user2{P}}^* (t-1) {\user2{H}}^T (t) \left[ {\user2{I}}_{n_o \times n_o}+ {\user2{H}}(t) {\user2{P}}^*(t-1) {\user2{H}}^T(t) \right]^{-1} $$
(23)
$$ \tilde{{\user2{P}}}(t) = {\user2{P}}^*(t-1) - {\user2{K}}(t) {\user2{H}}(t) {\user2{P}}^*(t-1) $$
(24)
$$ \hat{{\user2{w}}} (t) = \hat{{\user2{w}}} (t-1) - \alpha \tilde{{\user2{P}}}(t) \hat{{\user2{w}}} (t-1) + {\user2{K}}(t) \left[ \vec d(t) - {\user2{h}} \left( \hat{{\user2{w}}}(t-1), {\user2{x}}(t) \right) \right] . $$
(25)

(22)–(25) are the general global true weight decay recursive equations. They are more compact than the equations presented in [12].

In (22)–(25), we can easily observe that when the regularization parameter α is set to zero, the term \(\alpha \tilde{{\user2{P}}}(t) \hat{{\user2{w}}} (t-1)\) vanishes in (25), and (22)–(25) reduce to the standard RLS equations. We note that the main difference between the standard RLS equations and the TWDRLS equations is the introduction of a weight decaying term \( - \alpha \tilde{{\user2{P}}}(t) \hat{{\user2{w}}} (t-1)\) in (25). The inclusion of this term guarantees that the magnitude of the updating weight vector decays an amount proportional to \(\alpha \tilde{{\user2{P}}}(t) . \) It should be notice that from the definition, \(\tilde{{\user2{P}}}(t)\) is positive definite. Therefore, the magnitude of the weight vector would not be too large. So the generalization ability of the trained networks would be better.

We can also explain the weight decay effect from the energy function, given by (8). Clearly, when the regularization parameter α is set to zero, the energy function of the TWDRLS in (8) becomes the energy function of the standard RLS, given by (3).

From the energy function point of view, the objective of the TWDRLS is the same as that of batch model weight decay methods. Hence, existing heuristic methods [7, 1618] for choosing the value of α can be used for the TWDRLS’s case. Those methods can be applied to any training algorithms whose cost function contains a quadratic weight decay term. The most simple method is the test set validation method [7], in which we use a test set to select the most suitable value of the regularization parameter. Since the aim of this paper is to develop the RLS equations for the weight decay regularizer rather than to develop some Bayesian theories for model selection [1922], in this paper we suggest that we train a number of networks with different values of α and then we select a network based on a test set.

A drawback of the TWDRLS algorithm is the requirement in computing the inverse of the M-dimensional matrix \(({\user2{I}}_{M \times M} + \alpha \tilde{{\user2{P}}}(t-1) ). \) This complexity is equal to O(M 3) which is much higher than that of the standard RLS, O(M 2). The TWDRLS algorithm is computationally prohibitive even for a network with moderate size. In the next Section, a decoupled version of the TWDRLS algorithm will be proposed to solve this high complexity problem.

3 Decouple the TWDRLS algorithm

3.1 Derivation

In order to decouple the TWDRLS algorithm, we first divide the weight vector into several smaller local groups. For the ith output neuron, we use a decoupled weight vector

$$ {\user2{w}}^o_{i} = \left[ w^o_{i,1},\ldots,w^o_{i,(n_{h}+1)} \right]^T $$
(26)

to represent those weights connecting hidden neurons to the ith output neuron. For the jth hidden neuron, we use a decoupled weight vector

$$ {\user2{w}}^{\rm in}_{j} = \left[ w^{\rm in}_{j,1}, \ldots,w^{\rm in}_{j,(n_{\rm in}+1)} \right]^T $$
(27)

to represent those weights connecting inputs to the jth hidden neuron.

In the decoupled version of the TWRLS algorithm, we consider the estimation of each decoupled weight vector separately. When we consider the weight vector of a neuron, we assume that other decoupled weight vectors are constant vectorsFootnote 1.

For the ith output neuron, it is associated with a decoupled weight vector w o i . Since we assume that other decoupled weight vectors are constant vectors, the energy function of that decoupled weight vector is given by

$$ \begin{aligned} E({\user2{w}}^o_{i}) &= \sum_{\tau=1}^{t} \left[ \left[ d_i(\tau) - h_i({\user2{w}}, {\user2{x}} (\tau) ) \right]^2 + \alpha {{\user2{w}}^o_{i}}^T {\user2{w}}^o_{i} \right] \\ & \quad+ \left[{\user2{w}}^o_{i} - \hat{{\user2{w}}}^o_{i} (0) \right]^T {{\user2{P}}^o}^{-1}_{i} (0) \left[ {\user2{w}}^o_{i} - \hat{{\user2{w}}}^o_{i} (0) \right] . \end{aligned} $$
(28)

Utilizing a derivative process similar to the previous analysis, we obtain the following recursive equations for the output neurons:

$$ {\user2{P}}^{o*}_{i} (t-1) = \left[ {\user2{I}}_{(n_{h}+1) \times (n_{h}+1)} + \alpha {\user2{P}}^{o}_{i}(t-1) \right]^{-1} {\user2{P}}^o_{i}(t-1) $$
(29)
$$ {\user2{K}}^o_{i}(t) = {\user2{P}}^{o*}_{i} (t-1) {{\user2{H}}^o_{i}}^T (t) \left[ 1+ {\user2{H}}^o_{i}(t) {\user2{P}}^{o*}_{i} (t-1) {{\user2{H}}^o_{i}}^T(t) \right]^{-1} $$
(30)
$$ {\user2{P}}^o_{i} (t) = {\user2{P}}^{o*}_{i}(t-1) - {\user2{K}}^o_{i}(t) {\user2{H}}^o_{i}(t) {\user2{P}}^{o*}_{i}(t-1) $$
(31)
$$ \hat{{\user2{w}}}^o_{i} (t) = \hat{{\user2{w}}}^o_{i} (t-1) - \alpha {\user2{P}}^o_{i} (t) \hat{{\user2{w}}}^o_{i} (t-1) + {\user2{K}}^o_{i}(t) \left[ d_i(t) - h_i \left( \hat{{\user2{w}}} (t-1), {\user2{x}}(t) \right) \right] , $$
(32)

where

$$ {\user2{H}}^o_{i} (\tau) = \left[ \left. \frac{ \partial h_i ({\user2{w}}, {\user2{x}} (\tau) ) }{\partial {\user2{w}}^o_{i}} \right|_{{\user2{w}} = \hat{{\user2{w}}} (\tau-1)} \right]^T , $$
(33)

is the 1 × (n h + 1) decoupled gradient matrix, K o i (t) is the (n h + 1) × 1 decoupled Kalman gain, and P o i (t) is the (n h + 1) × (n h + 1) decoupled error covariance matrix.

Similarly, for the jth hidden neuron, it is associated with a decoupled weight vector w in j . The energy function of this decoupled weight vector w in j is given by

$$ \begin{aligned} E({\user2{w}}^{\rm in}_{j}) &= \sum_{\tau=1}^{t} \left[ \left[ {\user2{d}} (\tau) - {\user2{h}}({\user2{w}}, {\user2{x}} (\tau) ) \right]^T \left[ {\user2{d}} (\tau) - {\user2{h}}({\user2{w}}, {\user2{x}} (\tau) ) \right] + \alpha {{\user2{w}}^{\rm in}_{j}}^T {\user2{w}}^{\rm in}_{j} \right]\\ &\quad + \left[{\user2{w}}^{\rm in}_{j} - \hat{{\user2{w}}}^{\rm in}_{j} (0) \right]^T {{\user2{P}}^{\rm in}_j}^{-1} (0) \left[ {\user2{w}}^{\rm in}_{j} - \hat{{\user2{w}}}^{\rm in}_{j} (0) \right] . \end{aligned} $$
(34)

With the energy function, the recursive equations are given by

$$ {\user2{P}}^{{\rm in}*}_{j} (t-1) = \left[ {\user2{I}}_{(n_{\rm in}+1) \times (n_{\rm in}+1)} + \alpha {\user2{P}}^{\rm in}_{j}(t-1) \right]^{-1} {\user2{P}}^{\rm in}_{j}(t-1) $$
(35)
$$ {\user2{K}}^{\rm in}_{j}(t) = {\user2{P}}^{{\rm in}*}_{j} (t-1) {{\user2{H}}^{\rm {\rm in}}_{j}}^T (t) \left[ {\user2{I}}_{(n_{o}+1) \times (n_{o}+1)} + {\user2{H}}^{\rm in}_{j}(t) {\user2{P}}^{{\rm in}*}_{j} (t-1) {{\user2{H}}^{\rm in}_{j}}^T(t) \right]^{-1} $$
(36)
$$ {\user2{P}}^{\rm in}_{j} (t) = {\user2{P}}^{{\rm in}*}_{j}(t-1) - {\user2{K}}^{\rm in}_{j}(t) {\user2{H}}^{\rm in}_{j}(t) {\user2{P}}^{{\rm in}*}_{j}(t-1) $$
(37)
$$ \hat{{\user2{w}}}^{\rm in}_{j} (t) = \hat{{\user2{w}}}^{\rm in}_{j} (t-1) - \alpha {\user2{P}}^{\rm in}_{j} (t) \hat{{\user2{w}}}^{\rm in}_{j} (t-1) + {\user2{K}}^{\rm in}_{j}(t) \left[ {\user2{d}} (t) - {\user2{h}} \left( \hat{{\user2{w}}} (t-1), {\user2{x}}(t) \right) \right] , $$
(38)

where

$$ {\user2{H}}^{\rm in}_{j} (\tau) = \left[ \left. \frac{ \partial {\user2{h}} ({\user2{w}}, {\user2{x}} (\tau) ) }{\partial {\user2{w}}^{\rm in}_{j}} \right|_{{\user2{w}} = \hat{{\user2{w}}} (\tau-1)} \right]^T , $$
(39)

is the n o ×(n in + 1) decoupled gradient matrix, K in j (t) is the (n in + 1)×n o decoupled Kalman gain, and P in j (t) is the (n in + 1)×(n in + 1) decoupled error covariance matrix.

The training process of the decoupled TWDRLS algorithm is as follows. We first train the output decoupled weight vectors w o i ’s. Afterwards, we update the hidden decoupled weight vectors w in j ’s. At each training stage, only the concerned weight vector is updated and all other weights remain unchanged.

3.2 Complexity

In the global TWDRLS, the complexity mainly comes from computing the inverse of the M-dimensional matrix (I M×M  + αP(t − 1)). This complexity is equal to O(M 3). So, the computational complexity is equal to

$$ TCC_{\rm global} = O ( M^3) = O\left( \left( n_o (n_{{h}+1}) + n_{h}(n_{\rm in}+1) \right )^3 \right) . $$

Since the size of the matrix is M × M, the space complexity (storage requirement) is equal to

$$ TCS_{\rm global} = O (M^2) = O\left( \left( n_o (n_{{h}+1}) + n_{h}(n_{\rm in}+1) \right )^2 \right) . $$

From (29), for each output neuron, the computational cost of the decoupled TWDRLS algorithm mainly comes from the inversion of an (n h  + 1) × (n h  + 1) matrix. In this way, the computational complexity for each output neuron is O((n h  + 1)3) and the corresponding space complexity is equal to O((n h  + 1)2). From (35), for each hidden neuron, the computational cost of the decoupled TWDRLS algorithm mainly comes from the inversion of an (n in + 1) × (n in + 1) matrix. In this way, the computational complexity for each hidden neuron is O((n in + 1)3) and the corresponding space complexity is equal to O((n in + 1)2).

Hence, the total computational complexity of the decoupled TWDRLS algorithm is equal to

$$ TCC_{\rm decouple} = O\left( n_o \left( n_{h} +1 \right )^3 + n_{h} \left( n_{\rm in} +1 \right )^3 \right) $$

and the space complexity (storage requirement) is equal to

$$ TCS_{\rm decouple} = O\left( n_o \left( n_{h} +1 \right )^2 + n_{h} \left( n_{\rm in} +1 \right )^2 \right) . $$

They are much smaller than the computational and space complexities of the global case. Some examples related to the complexity issue will be given in the next section.

4 Computer simulations

The proposed decoupled TWDRLS algorithm is applied to two problems: the generalized XOR problem and the sunspot data prediction problem. Its performance is compared with that of the global version. The first problem is a typical nonlinear classification problem while the second one is a standard nonlinear time series prediction problem. The initial weights are small zero-mean independent identically distributed Gaussian random variables. The activation function for hidden neurons is the hyperbolic tangent function. Since the generalized XOR problem is a classification problem, the output neuron of the generalized XOR problem is with the hyperbolic tangent activation function. Since the sunspot data prediction problem is a regression problem, the output neuron of the sunspot data prediction problem is with the linear activation function. The training for each problem is performed 10 times with different random initial weights.

4.1 Generalized XOR problem

The generalized XOR problem is formulated as d = sign(x 1 x 2) with inputs in the range [−1, 1]. The desired output is either −1 (corresponding to logical zero) or 1 (corresponding to logical one). The network has two input neurons, ten hidden neurons, and one output neuron. As a result, there are 41 weights in the network. The training set and test set, shown in Fig. 1, contain 50 and 2,000 samples, respectively. The total number of training cycles is set to 200. It is because after 200 training cycles, the decreasing rate of training errors is very slow. In each cycle, 50 randomly selected training samples from the training set are fed to the network one by one.

Fig. 1
figure 1

Training and test samples for the generalized XOR problems. a Training samples. b Test samples

Since the generalized XOR problem is a classification problem, the criterion used to evaluate the model performance is the false rate (misclassification rate). A test pattern is misclassified when the sign of the network output is not the same as that of the desired one. Figure 2 summarizes the average test set false rates in the ten runs. The average test set false rates obtained by global and decouple TWDRLS algorithms are usually lower than those obtained by the standard RLS and decouple RLS algorithms over a wide range of regularization parameters. This means that both global and decouple TWDRLS algorithms can improve the generalization ability. In terms of average false rate, the performance of the decouple TWDRLS algorithm is quite similar to that of the global ones. The computational and space complexities for global and decouple algorithms are listed in Table 1. From Fig. 2 and Table 1, we can conclude that the performance of the decouple TWDRLS algorithm is comparable to that of the global ones and that its time and space complexities are much smaller.

Fig. 2
figure 2

Average test set false rate of 10 runs for the generalized XOR problem

Table 1 Computational and space complexities of the global and decouple TWDRLS algorithms for the generalized XOR problem

The decision boundaries obtained from typical trained networks are plotted in Fig. 3. From Figs. 1 and 3, the decision boundaries obtained from the trained networks with TWDRLS algorithms are closer to the ideal ones. Also, the performance of decouple TWDRLS is very close to that of the global TWDRLS algorithm.

Fig. 3
figure 3

Decision boundaries of various trained networks for the generalized XOR problem. Note that when α = 0, the TWDRLS is identical to RLS. a RLS. b Decouple RLS. c Global TWDRLS, α = 0.00562. d Decouple TWDRLS, α = 0.00178

From Fig. 2, the average test set false rate first decreases with the regularization parameter α and then increases with it. This shows that a proper selection of α indeed improves the generalization ability of the network. On the other hand, we observe that the test set false rate becomes very high when the decay parameter α is very large. This is due to the fact that when the decay parameter is very large, the weight decay effect is very substantial and the trained network cannot learn the target function. In order to further illustrate this, we plot in Fig. 4 the decision boundary of the network trained with α = 0.0178. The figure shows that the decision boundary is quite far from the ideal one. This is because when the value of α is too large, the weight decay effect is too strong and then the trained network cannot capture the desired decision boundary.

Fig. 4
figure 4

Decision boundary of a trained network (decouple TWDRLS) with too large regularization parameter, where α = 0.0178. Since the regularization parameter is too large, the trained network cannot form a good decision boundary

4.2 Sunspot data prediction

The sunspot data from 1700 to 1979, shown in Fig. 5, are taken as the training and the test sets. Following the common practice, we divide the data into a training set (1700–1920) and two test sets, namely, Test-set 1 (1921–1955) and Test-set 2 (1956–1979). The sunspot series is rather non-stationary and Test-set 2 is atypical.

Fig. 5
figure 5

Sunspot data

In the simulation, we assume that the series is generated from the following auto-regressive model, given by

$$ d(t)=\varphi(d(t-1),\ldots,d(t-12))+ \epsilon (t) $$
(40)

where \(\epsilon(t)\) is noise and \(\varphi (\cdot,\ldots,\cdot)\) is an unknown nonlinear function. A network with 12 input neurons, 8 hidden neurons, and one output neuron is used for approximating \(\varphi (\cdot,\ldots,\cdot). \) There are 113 weights in the MFNN model. The total number of training cycles is equal to 200. As this is a time series problem, the criterion to evaluate the model performance is the mean squared error (MSE) of the test set.

Figure 6 summarizes the average MSE in the 10 runs. From Fig. 6, over a wide range of the regularization parameter α, both global and decouple TWDRLS algorithms can greatly improve the generalization ability of the trained networks, especially for Test-set 2 that is quite different from the training set. However, the test MSE becomes very large at large values of α. This is because at large value of α, the weight decay effect is too strong and then the network cannot learn the target function. In most cases, the performance of the decouple training is comparable to that of the global ones. Also, Table 2 shows that those complexities of the decouple training are much smaller than those of the global one.

Fig. 6
figure 6

MSE of networks trained by global and decouple TWDRLS algorithms. Note that when α = 0, the TWDRLS is identical to RLS. a Test-set 1 average MSE. b Test-set 2 average MSE

Table 2 Computational and space complexities of the global and decouple TWDRLS algorithms for the sunspot data prediction problem

5 Conclusion

We have investigated the problem of training the MFNN model using the TWDRLS algorithms. We derive a set of concise equations for the decouple TWDRLS algorithm. Computer simulations indicate that both decouple and global TWDRLS algorithms can improve the generation ability of MFNNs. The performance of the decouple TWDRLS algorithm is comparable to that of the global ones. However, when the decouple approach is used, the computational complexity and the storage requirement are greatly reduced. In the decoupled version, each neuron has its own set of RLS equations. Hence, the decoupled version is suitable for the parallel computing [25, 26]. Hence, one of the future works is to develop a parallel implementation of the decoupled version, in which each processor is used for the computation of one set of RLS equations.