Keywords

1 Introduction

Over the last few years, the domain of complex-valued neural networks has received an increasing interest. Popular applications of this type of networks include antenna design, estimation of direction of arrival and beamforming, radar imaging, communications signal processing, image processing, and many others (for an extensive presentation, see [11]).

These networks appear as a natural choice for solving problems such as channel equalization or time series prediction in the signal processing domain, because some signals are naturally expressed in complex-valued form. Several methods, which include different network architectures and different learning algorithms, have been proposed to increase the efficiency of learning in complex-valued neural networks (see, for example, [18]). Some of these methods are specially designed for these networks, while others are extended from the real-valued case.

One such method, which has proven its efficiency in many applications, is the scaled conjugate gradient learning method. First proposed by [15], the scaled conjugate gradient method has become today a very known and used algorithm to train feedforward neural networks. Taking this fact into account, it seems natural to extend this learning algorithm to complex-valued neural networks, also.

In this paper, we present the deduction of the scaled conjugate gradient method. We also give a general formula to calculate the gradient of the error function that works both for fully complex, and for split complex activation functions, in the context of a multilayer feedforward complex-valued neural network. We test the proposed scaled conjugate gradient method on both synthetic and real-world applications. The synthetic applications include two fully complex function approximation problems and one split complex function approximation problem. The real-world application is a nonlinear time series prediction problem.

The remainder of this paper is organized as follows: Sect. 2 gives a thorough explanation of the conjugate gradient methods for the optimization of an error function defined on the complex plane. Then, Sect. 3 presents the scaled conjugate algorithm for complex-valued feedforward neural networks. The experimental results of the four applications of the proposed algorithms are shown and discussed in Sect. 4, along with a detailed description of the nature of each problem. Section 5 is dedicated to presenting the conclusions of the study.

2 Conjugate Gradient Algorithms

Conjugate gradient methods belong to the larger class of line search algorithms. For minimizing the error function of a neural network, a series of steps through the weight space are necessary to find the weight for which the minimum of the function is attained. Each step is determined by the search direction and a real number telling us how far in that direction we should move. In the classical gradient descent, the search direction is that of the negative gradient and the real number is the learning rate parameter. In the general case, we can consider some particular search direction, and then determine the minimum of the error function in that direction, thus yielding the real number that tells us how far in that direction we should move. This represents the line search algorithm, and constitutes the basis for a family of methods that have better performance than the classical gradient descent. Our deduction of conjugate gradient algorithms follows mainly the one presented in [3], which too follows that in [13].

Let’s assume that we have a complex-valued neural network with an error function denoted by E, and an 2N-dimensional weight vector denoted by \(\mathbf {w}=(w_{1}^{R},w_{1}^{I},\ldots ,w_{N}^{R},w_{N}^{I})^{T}\). We must find the weight vector denoted by \(\mathbf {w^{*}}\) that minimizes the function \(E(\mathbf {w})\). Suppose we are iterating through the weight space to find the value of \(\mathbf {w}^{*}\) or a very good approximation of it. Further, let’s assume that at step k in the iteration, we want the search direction to be \(\mathbf {p}_{k}\), where \(\mathbf {p}_{k}\) is obviously an 2N-dimensional vector. In this case, the next value for the weight vector is given by \(\mathbf {w}_{k+1}=\mathbf {w}_{k}+\lambda _{k}\mathbf {p}_{k},\) where the parameter \(\lambda _{k}\) is a real number telling us how far in the direction of \(\mathbf {p}_{k}\) we want to go, which means that \(\lambda _{k}\) should be chosen to minimize \(E(\lambda )=E(\mathbf {w}_{k}+\lambda \mathbf {p}_{k}).\)

This is a real-valued function in one real variable, so its minimum is attained when \(\frac{\partial E(\lambda )}{\partial \lambda }=\frac{\partial E(\mathbf {w}_{k}+\lambda \mathbf {p}_{k})}{\partial \lambda }=0.\) By the chain rule, we can write that

$$\begin{aligned} \frac{\partial E(\mathbf {w}_{k}+\lambda \mathbf {p}_{k})}{\partial \lambda }= & {} \sum _{i=1}^{N}\frac{\partial E(\mathbf {w}_{k}+\lambda \mathbf {p}_{k})}{\partial (w_{i}^{k,R}+\lambda p_{i}^{k,R})}\frac{\partial (w_{i}^{k,R}+\lambda p_{i}^{k,R})}{\partial \lambda }\nonumber \\&+\sum _{i=1}^{N}\frac{\partial E(\mathbf {w}_{k}+\lambda \mathbf {p}_{k})}{\partial (w_{i}^{k,I}+\lambda p_{i}^{k,I})}\frac{\partial (w_{i}^{k,I}+\lambda p_{i}^{k,I})}{\partial \lambda }\nonumber \\= & {} \sum _{i=1}^{N}\frac{\partial E(\mathbf {w}_{k+1})}{\partial w_{i}^{k+1,R}}p_{i}^{k,R}+\frac{\partial E(\mathbf {w}_{k+1})}{\partial w_{i}^{k+1,I}}p_{i}^{k,I}\nonumber \\= & {} \langle \nabla E(\mathbf {w}_{k+1}),\mathbf {p}_{k}\rangle , \end{aligned}$$
(1)

where \(\langle \mathbf {a},\mathbf {b}\rangle \) is the Euclidean scalar product in \(\mathbb {R}^{2N}\simeq \mathbb {C}^{N}\), given by \(\langle \mathbf {a},\mathbf {b}\rangle =\left( \sum _{i=1}^{N}a_{i}\overline{b_{i}}\right) ^{R}=\sum _{i=1}^{N}a_{i}^{R}b_{i}^{R}+a_{i}^{I}b_{i}^{I},\) for all \(\mathbf {a},\mathbf {b}\in \mathbb {R}^{2N}\simeq \mathbb {C}^{N}\), and by \(a^{R}\) and \(a^{I}\) we denoted the real and imaginary part of the complex number a, and by \(\overline{a}\) the conjugate of the complex number a.

So, if we denote \(\mathbf {g}:=\nabla E\), then (1) can be written in the form

$$\begin{aligned} \langle \mathbf {g}(\mathbf {w}_{k+1}),\mathbf {p}_{k}\rangle =0. \end{aligned}$$
(2)

The next search direction \(\mathbf {p}_{k+1}\) is chosen so that the component of the gradient parallel to the previous search direction \(\mathbf {p}_{k}\) remains zero. As a consequence, we have that \(\langle \mathbf {g}(\mathbf {w}_{k+1}+\lambda \mathbf {p}_{k+1}),\mathbf {p}_{k}\rangle =0.\) By the Taylor series expansion to the first order, we have that \(\mathbf {g}(\mathbf {w}_{k+1}+\lambda \mathbf {p}_{k+1})=\mathbf {g}(\mathbf {w}_{k+1})+\nabla \mathbf {g}(\mathbf {w}_{k+1})^{T}\lambda \mathbf {p}_{k+1},\) and then, if we take (2) into account, we obtain that \(\lambda \langle \nabla \mathbf {g}(\mathbf {w}_{k+1})^{T}\mathbf {p}_{k+1},\mathbf {p}_{k}\rangle =0,\) which is equivalent to \(\langle \mathbf {H}^{T}(\mathbf {w}_{k+1})\mathbf {p}_{k+1},\mathbf {p}_{k}\rangle =0,\) or, further to

$$\begin{aligned} \langle \mathbf {p}_{k+1},\mathbf {H}(\mathbf {w}_{k+1})\mathbf {p}_{k}\rangle =0, \end{aligned}$$
(3)

where we denoted by \(\mathbf {H}(\mathbf {w}_{k+1})\) the Hessian of the error function \(E(\mathbf {w})\), because \(\mathbf {g}=\nabla E\), and thus \(\nabla \mathbf {g}\) is the Hessian.

The search directions that satisfy equation (3) are said to be conjugate directions. The conjugate gradient algorithm builds the search directions \(\mathbf {p}_{k}\), such that each new direction is conjugate to all the previous ones.

Next, we will explain the linear conjugate gradient algorithm. For this, we consider an error function of the form

$$\begin{aligned} E(\mathbf {w})=E_{0}+\mathbf {b}^{T}\mathbf {w}+\frac{1}{2}\mathbf {w}^{T}\mathbf {H}\mathbf {w}, \end{aligned}$$
(4)

where \(\mathbf {b}\) and \(\mathbf {H}\) are constants, and the matrix \(\mathbf {H}\) is assumed to be positive definite. The gradient of this function is given by

$$\begin{aligned} \mathbf {g}(\mathbf {w})=\mathbf {b}+\mathbf {H}\mathbf {w}. \end{aligned}$$
(5)

The weight vector \(\mathbf {w}^{*}\) that minimizes the function \(E(\mathbf {w})\) satisfies the equation

$$\begin{aligned} \mathbf {b}+\mathbf {H}\mathbf {w}^{*}=0. \end{aligned}$$
(6)

As we saw earlier from equation (3), a set of 2N nonzero vectors \(\{\mathbf {p}_{1},\mathbf {p}_{2},\ldots ,\) \(\mathbf {p}_{2N}\}\subset \mathbb {R}^{2N}\) is said to be conjugate with respect to the positive definite matrix \(\mathbf {H}\) if and only if

$$\begin{aligned} \langle \mathbf {p}_{i},\mathbf {H}\mathbf {p}_{j}\rangle =0,\forall i\ne j. \end{aligned}$$
(7)

It is easy to show that in these conditions, the set of 2N vectors is also linearly independent, which means that they form a basis in \(\mathbb {R}^{2N}\simeq \mathbb {C}^{N}\). If we start from the initial point \(\mathbf {w}_{1}\) and want to find the value of \(\mathbf {w}^{*}\) that minimizes the error function given in (4), taking into account the above remarks, we can write that

$$\begin{aligned} \mathbf {w}^{*}-\mathbf {w}_{1}=\sum _{i=1}^{2N}\alpha _{i}\mathbf {p}_{i}. \end{aligned}$$
(8)

Now, if we set

$$\begin{aligned} \mathbf {w}_{k}=\mathbf {w}_{1}+\sum _{i=1}^{k-1}\alpha _{i}\mathbf {p}_{i}, \end{aligned}$$
(9)

then (8) can be written in the iterative form

$$\begin{aligned} \mathbf {w}_{k+1}=\mathbf {w}_{k}+\alpha _{k}\mathbf {p}_{k}, \end{aligned}$$
(10)

which means that the value of \(\mathbf {w}^{*}\) can be determined in at most 2N steps for the error function (4), using the above iteration. We still have to determine the real parameters \(\alpha _{k}\) that tell us how much we should go in any of the 2N conjugate directions \(\mathbf {p}_{k}\).

For this, we will multiply equation (8) by \(\mathbf {H}\) to the left, and take the Euclidean \(\mathbb {R}^{2N}\) scalar product with \(\mathbf {p}_{k}\). Taking into account equation (6), we obtain that \(-\langle \mathbf {p}_{k},\mathbf {b}+\mathbf {H}\mathbf {w}_{1}\rangle =\sum _{i=1}^{2N}\alpha _{i}\langle \mathbf {p}_{k},\mathbf {H}\mathbf {p}_{i}\rangle .\) But, because the directions \(\mathbf {p}_{k}\) are conjugate with respect to matrix \(\mathbf {H}\), we have from (7) that \(\langle \mathbf {p}_{k},\mathbf {H}\mathbf {p}_{i}\rangle =0,\forall i\ne k\), so the above equation yields the following value for \(\alpha _{k}\):

$$\begin{aligned} \alpha _{k}=-\frac{\langle \mathbf {p}_{k},\mathbf {b}+\mathbf {H}\mathbf {w}_{1}\rangle }{\langle \mathbf {p}_{k},\mathbf {H}\mathbf {p}_{k}\rangle }. \end{aligned}$$
(11)

Now, if we multiply equation (9) by \(\mathbf {H}\) to the left, and take the Euclidean \(\mathbb {R}^{2N}\) scalar product with \(\mathbf {p}_{k}\), we have that: \(\langle \mathbf {p}_{k},\mathbf {H}\mathbf {w}_{k}\rangle =\langle \mathbf {p}_{k},\mathbf {H}\mathbf {w}_{1}\rangle +\sum _{i=1}^{k-1}\alpha _{i}\langle \mathbf {p}_{k},\mathbf {H}\mathbf {p}_{i}\rangle ,\) or, taking into account that \(\langle \mathbf {p}_{k},\mathbf {H}\mathbf {p}_{i}\rangle =0,\forall i\ne k\), we get that \(\langle \mathbf {p}_{k},\mathbf {H}\mathbf {w}_{k}\rangle =\langle \mathbf {p}_{k},\mathbf {H}\mathbf {w}_{1}\rangle ,\) and so the relation (11) for calculating \(\alpha _{k}\) becomes:

$$\begin{aligned} \alpha _{k}=-\frac{\langle \mathbf {p}_{k},\mathbf {b}+\mathbf {H}\mathbf {w}_{k}\rangle }{\langle \mathbf {p}_{k},\mathbf {H}\mathbf {p}_{k}\rangle }=-\frac{\langle \mathbf {p}_{k},\mathbf {g}_{k}\rangle }{\langle \mathbf {p}_{k},\mathbf {H}\mathbf {p}_{k}\rangle }, \end{aligned}$$
(12)

where \(\mathbf {g}_{k}:=\mathbf {g}(\mathbf {w}_{k})=\mathbf {b}+\mathbf {H}\mathbf {w}_{k}\), as relation (5) shows.

Finally, we need to construct the mutually conjugate directions \(\mathbf {p}_{k}\). For this, the first direction is initialized by the negative gradient of the error function at the initial point \(\mathbf {w}_{1}\), i.e. \(\mathbf {p}_{1}=-\mathbf {g}_{1}\). We have the following update rule for the conjugate directions:

$$\begin{aligned} \mathbf {p}_{k+1}=-\mathbf {g}_{k+1}+\beta _{k}\mathbf {p}_{k}. \end{aligned}$$
(13)

Taking the Euclidean \(\mathbb {R}^{2N}\) scalar product with \(\mathbf {H}\mathbf {p}_{k}\), and imposing the conjugacy condition \(\langle \mathbf {p}_{k+1},\mathbf {H}\mathbf {p}_{k}\rangle =0\), we obtain that

$$\begin{aligned} \beta _{k}=\frac{\langle \mathbf {g}_{k+1},\mathbf {H}\mathbf {p}_{k}\rangle }{\langle \mathbf {p}_{k},\mathbf {H}\mathbf {p}_{k}\rangle }. \end{aligned}$$
(14)

It can be easily shown by induction that repeated application of the relations (13) and (14), yield a set of mutually conjugate directions with respect to the positive definite matrix \(\mathbf {H}\).

So far, we have dealt with a quadratic error function that has a positive definite Hessian matrix \(\mathbf {H}\). But in practical applications, the error function may be far from quadratic, and so the expressions for calculating \(\alpha _{k}\) and \(\beta _{k}\) that we deduced above, may not be as accurate as in the quadratic case. Furthermore, these expressions need the explicit calculation of the Hessian matrix \(\mathbf {H}\) for each step of the algorithm, because the Hessian is constant only in the case of the quadratic error function. This calculation is computationally intensive and should be avoided. In what follows, we will deduce expressions for \(\alpha _{k}\) and \(\beta _{k}\) that do not need the explicit calculation of the Hessian matrix, and do not even assume that the Hessian is positive definite.

First of all, let’s consider the expression for \(\alpha _{k}\), given in (12). Because of the iterative relation (10), we can replace the explicit calculation of \(\alpha _{k}\) with an inexact line search that minimizes \(E(\mathbf {w}_{k+1})=E(\mathbf {w}_{k}+\alpha _{k}\mathbf {p}_{k})\), i.e. a line minimization along the search direction \(\mathbf {p}_{k}\), starting at the point \(\mathbf {w}_{k}\). In our experiments, we used the golden section search, which is guaranteed to have linear convergence, see [4, 14].

Now, let’s turn our attention to \(\beta _{k}\). From (5), we have that

$$ \mathbf {g}_{k+1}-\mathbf {g}_{k}=\mathbf {H}(\mathbf {w}_{k+1}-\mathbf {w}_{k})=\alpha _{k}\mathbf {H}\mathbf {p}_{k}, $$

and so the expression (14) becomes:

$$ \beta _{k}=\frac{\langle \mathbf {g}_{k+1},\mathbf {g}_{k+1}-\mathbf {g}_{k}\rangle }{\langle \mathbf {p}_{k},\mathbf {g}_{k+1}-\mathbf {g}_{k}\rangle }. $$

This is known as the Hestenes-Stiefel update expression, see [10].

Similarly, we obtain the Polak-Ribiere update expression (see [17]):

$$\begin{aligned} \beta _{k}=\frac{\langle \mathbf {g}_{k+1},\mathbf {g}_{k+1}-\mathbf {g}_{k}\rangle }{\langle \mathbf {g}_{k},\mathbf {g}_{k}\rangle }. \end{aligned}$$
(15)

We then have that \(\langle \mathbf {g}_{k},\mathbf {g}_{k+1}\rangle =0\), and so expression (15) becomes:

$$ \beta _{k}=\frac{\langle \mathbf {g}_{k+1},\mathbf {g}_{k+1}\rangle }{\langle \mathbf {g}_{k},\mathbf {g}_{k}\rangle }. $$

This expression is known as the Fletcher-Reeves update formula, see [19].

3 Scaled Conjugate Gradient Algorithm

As we have seen above, in real world applications, the Hessian matrix can be far from being positive definite. Because of this, Møller proposed in [15] the scaled conjugate algorithm which uses the model trust region method known from the Levenberg-Marquardt algorithm, combined with the conjugate gradient method presented above. To ensure the positive definiteness, we should add to the Hessian matrix a sufficiently large positive constant \(\lambda _{k}\) multiplied by the identity matrix. With this change, the formula for the step length given in (12), becomes

$$\begin{aligned} \alpha _{k}=-\frac{\langle \mathbf {p}_{k},\mathbf {g}_{k}\rangle }{\langle \mathbf {p}_{k},\mathbf {H}\mathbf {p}_{k}\rangle +\lambda _{k}\langle \mathbf {p}_{k},\mathbf {p}_{k}\rangle }. \end{aligned}$$
(16)

Let us denote the denominator of (16) by \(\delta _{k}:=\langle \mathbf {p}_{k},\mathbf {H}\mathbf {p}_{k}\rangle +\lambda _{k}\langle \mathbf {p}_{k},\mathbf {p}_{k}\rangle \). For a positive definite Hessian matrix, we have that \(\delta _{k}>0\). But if \(\delta _{k}\le 0\), then we should increase the value of \(\delta _{k}\) in order to make it positive. Let \(\overline{\delta _{k}}\) denote the new value of \(\delta _{k}\), and, accordingly, let \(\overline{\lambda _{k}}\) denote the new value of \(\lambda _{k}\). It is clear that we have the relation

$$\begin{aligned} \overline{\delta _{k}}=\delta _{k}+(\overline{\lambda _{k}}-\lambda _{k})\langle \mathbf {p}_{k},\mathbf {p}_{k}\rangle , \end{aligned}$$
(17)

and, in order to have \(\overline{\delta _{k}}>0\), we must have \(\overline{\lambda _{k}}>\lambda _{k}-\delta _{k}/\langle \mathbf {p}_{k},\mathbf {p}_{k}\rangle \). Møller in [15] chooses to set \(\overline{\lambda _{k}}=2\left( \lambda _{k}-\frac{\delta _{k}}{\langle \mathbf {p}_{k},\mathbf {p}_{k}\rangle }\right) \), and so the expression for the new value of \(\delta _{k}\) given in (17), becomes \(\overline{\delta _{k}}=-\delta _{k}+\lambda _{k}\langle \mathbf {p}_{k},\mathbf {p}_{k}\rangle =-\langle \mathbf {p}_{k},\mathbf {H}\mathbf {p}_{k}\rangle ,\) which is now positive and will be used in (16) to calculate the value of \(\alpha _{k}\).

Another problem signaled above is the quadratic approximation for the error function E. The scaled conjugate gradient algorithm addresses this problem by considering a comparison parameter defined by

$$\begin{aligned} \varDelta _{k}=\frac{E(\mathbf {w}_{k})-E(\mathbf {w}_{k}+\alpha _{k}\mathbf {p}_{k})}{E(\mathbf {w}_{k})-E_{Q}(\mathbf {w}_{k}+\alpha _{k}\mathbf {p}_{k})}, \end{aligned}$$
(18)

where \(E_{Q}(\mathbf {w})\) represents the local quadratic approximation of the error function in the neighborhood of the point \(\mathbf {w}_{k}\), given by

$$\begin{aligned} E_{Q}(\mathbf {w}_{k}+\alpha _{k}\mathbf {p}_{k})=E(\mathbf {w}_{k})+\alpha _{k}\langle \mathbf {p}_{k},\mathbf {g}_{k}\rangle +\frac{1}{2}\alpha _{k}^{2}\langle \mathbf {p}_{k},\mathbf {H}\mathbf {p}_{k}\rangle . \end{aligned}$$
(19)

We can easily see that \(\varDelta _{k}\) measures how good the quadratic approximation really is. Plugging relation (19) into relation (18), and taking into account expression (12) for \(\alpha _{k}\), we have that \(\varDelta _{k}=\frac{2(E(\mathbf {w}_{k})-E(\mathbf {w}_{k}+\alpha _{k}\mathbf {p}_{k}))}{\alpha _{k}\langle \mathbf {p}_{k},\mathbf {g}_{k}\rangle }.\)

The value of \(\lambda _{k}\) is then updated in the following way

$$ \lambda _{k+1}={\left\{ \begin{array}{ll} \lambda _{k}/2, &{} \text {if }\varDelta _{k}>0.75\\ 4\lambda _{k}, &{} \text {if }\varDelta _{k}<0.25\\ \lambda _{k}, &{} \text {else} \end{array}\right. }, $$

in order to ensure a better quadratic approximation.

Thus, there are two stages of updating \(\lambda _{k}\): one to ensure that \(\delta _{k}>0\) and one according to the validity of the local quadratic approximation. The two stages are applied successively after each weight update.

In order to apply the scaled conjugate gradient algorithm to a complex-valued feedforward neural network, we only need to calculate the gradient of the error function at different steps. In what follows, we will give a method for calculating such gradients, using the well-known backpropagation scheme.

Let’s assume that we have a fully connected complex-valued feedforward network that has L layers, where layer 1 is the input layer, layer L is the ouput layer, and the layers denoted by \(\{2,\ldots ,L-1\}\) are hidden layers. The error function \(E:\mathbb {R}^{2N}\simeq \mathbb {C}^{N}\rightarrow \mathbb {R}\) for such a network is

$$ E(\mathbf {w})=\frac{1}{2}\sum _{i=1}^{c}[(y_{i}^{L,R}-t_{i}^{R})^{2}+(y_{i}^{L,I}-t_{i}^{I})^{2}], $$

where \((y_{i}^{L})_{1\le i\le c}\) represent the outputs of the network, \((t_{i})_{1\le i\le c}\) represent the targets, and \(\mathbf {w}\) represents the vector of the weights and biases of the network. As a consequence, in order to compute the gradient \(\mathbf {g}(\mathbf {w}):=\nabla E(\mathbf {w})\), we must calculate all the partial derivatives of the form \(\frac{\partial E}{\partial w_{jk}^{l,R}}(\mathbf {w})\) and \(\frac{\partial E}{\partial w_{jk}^{l,I}}(\mathbf {w})\), where \(w_{jk}^{l}\) denotes the weight connecting neuron j from layer l to neuron k from layer \(l-1\), for all \(l\in \{2,\ldots ,L\}\). We further denote

$$\begin{aligned} s_{j}^{l}:=\sum _{k}w_{jk}^{l}x_{k}^{l-1}+w_{j0}^{l}, \end{aligned}$$
$$ y_{j}^{l}:=G^{l}(s_{j}^{l}), $$

where \(G^{l}\) is the activation function of layer \(l\in \{2,\ldots ,L\}\), \((x_{k}^{1})_{1\le k\le d}\) are the network inputs, and \(x_{k}^{l}:=y_{k}^{l}\), \(\forall l\in \{2,\ldots ,L-1\}\), \(\forall k\), because \(x_{k}^{1}\) are the inputs, \(y_{k}^{L}\) are the outputs, and \(y_{k}^{l}=x_{k}^{l}\) are the outputs of layer l, which are also inputs to layer \(l+1\).

By calculations, we obtain the following formula for computing the components of the gradient of the error function:

$$ \frac{\partial E}{\partial w_{jk}^{l,R}}(\mathbf {w})+i\frac{\partial E}{\partial w_{jk}^{l,I}}(\mathbf {w})=\delta _{j}^{l}\overline{x_{k}^{l-1}},\forall l\in \{2,\ldots ,L\}, $$

where

$$ \delta _{j}^{l}={\left\{ \begin{array}{ll} \left( \sum _{m}\overline{w_{mj}^{l+1}}\delta _{m}^{l+1}\right) ^{R}\left( \frac{\partial G^{l,R}(s_{j}^{l})}{\partial s_{j}^{l,R}}+i\frac{\partial G^{l,R}(s_{j}^{l})}{\partial s_{j}^{l,I}}\right) \\ +\left( \sum _{m}\overline{w_{mj}^{l+1}}\delta _{m}^{l+1}\right) ^{I}\left( \frac{\partial G^{l,I}(s_{j}^{l})}{\partial s_{j}^{l,R}}+i\frac{\partial G^{l,I}(s_{j}^{l})}{\partial s_{j}^{l,I}}\right) , &{} l\le L-1\\ (y_{j}^{l,R}-t_{j}^{R})\left( \frac{\partial G^{l,R}(s_{j}^{l})}{\partial s_{j}^{l,R}}+i\frac{\partial G^{l,R}(s_{j}^{l})}{\partial s_{j}^{l,I}}\right) \\ +(y_{j}^{l,I}-t_{j}^{I})\left( \frac{\partial G^{l,I}(s_{j}^{l})}{\partial s_{j}^{l,R}}+i\frac{\partial G^{l,I}(s_{j}^{l})}{\partial s_{j}^{l,I}}\right) , &{} l=L \end{array}\right. }. $$

The above formula works both for fully complex, and for split complex activation functions, and represents a unitary writing of the fully complex and split complex variants of the complex-valued backpropagation algorithm found in the literature.

4 Experimental Results

4.1 Fully Complex Synthetic Function I

The first synthetic fully complex function we test the proposed algorithm on is the two variable quadratic function \(f_{1}(z_{1},z_{2})=\frac{1}{6}\left( z_{1}^{2}+z_{2}^{2}\right) \). Fully complex functions treat complex numbers as a whole, and not the real and imaginary parts separately, as the split complex functions do. This problem was used as a benchmark to test the performance of different complex-valued neural network architectures and learning algorithms, for example in [2022, 25].

To train the networks, we randomly generated 3000 training samples, with each sample having the inputs \(z_{1},z_{2}\) inside the disk centered at the origin and with radius 2.5. For testing, we generated 1000 samples with the same characteristics. All the networks had one hidden layer comprised of 15 neurons and were trained for 5000 epochs. The activation function for the hidden layer was the fully complex hyperbolic tangent function, given by \(G^{2}(z)=\tanh z=\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}},\) and the activation function for the output layer was the identity \(G^{3}(z)=z\).

In our experiments, we trained complex-valued feedforward neural networks using the classical gradient descent algorithm (abbreviated GD), the gradient descent algorithm with momentum (GDM), the conjugate gradient algorithm with Hestenes-Stiefel updates (CGHS), the conjugate gradient algorithm with Polak-Ribiere updates (CGPR), the conjugate gradient algorithm with Fletcher-Reeves updates (CGFR), and the scaled conjugate gradient algorithm (SCG).

Table 1 Experimental results for the function \(f_{1}\)

Training was repeated 50 times for each algorithm, and the resulted mean and standard deviation of the mean squared error (MSE) are given in Table 1.

The best algorithm was clearly SCG, followed by the conjugate gradient algorithms. The table also gives the MSE of other algorithms used to learn this function, together with the references in which these algorithms and network architectures first appeared. We can see that the proposed algorithm was better in terms of performance than all these other algorithms.

4.2 Fully Complex Synthetic Function II

A more complicated example is given by the following function: \(f_{2}(z_{1,}z_{2},z_{3},z_{4})=\frac{1}{1.5}\left( z_{3}+10z_{1}z_{4}+\frac{z_{2}^{2}}{z_{1}}\right) ,\) which was used as a benchmark in [1, 2224]. We generated 3000 random training samples and 1000 testing samples, all having the inputs inside the unit disk. Variable \(z_{1}\) was chosen so that its radius is bigger than 0.1, because its reciprocal appears in the expression of the function, and otherwise it could have led to very high values of the function in comparison with the other variables. The networks had a single hidden layer with 25 neurons, the same activation functions as the ones used in the previous experiment, and were trained for 5000 epochs. Table 2 shows the results of running each one of the algorithms 50 times. The table also presents the values of the MSE for different learning methods and architectures found in the literature.

In this experiment also, SCG had better results than the conjugate gradient algorithms, but poorer than some other types of architectures used to learn this problem.

Table 2 Experimental results for the function \(f_{2}\)

4.3 Split Complex Synthetic Function I

We now test the proposed algorithm on a split complex function. The function, also used in [2, 5, 12], is \(f_{3}(x+iy)=\sin x\cosh y+i\cos x\sinh y.\)

The training set had 3000 samples and the test set had 1000 samples randomly generated from the unit disk. The neural networks had 15 neurons on a single hidden layer. The activation functions were split hyperbolic tangent for the hidden layer: \(G^{2}(x+iy)=\tanh x+i\tanh y=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}+i\frac{e^{y}-e^{-y}}{e^{y}+e^{-y}},\) and the identity function for the output layer: \(G^{3}(z)=z\).

The mean and standard deviation of the mean squared error (MSE) over 50 runs are presented in Table 3. The performances of the algorithms were similar to the ones in the previous experiments.

Table 3 Experimental results for the function \(f_{3}\)

4.4 Nonlinear Time Series Prediction

The last experiment deals with the prediction of complex-valued nonlinear signals. It involves passing the output of the autoregressive filter given by \(y(k)=1.79y(k-1)-1.85y(k-2)+1.27y(k-3)-0.41y(k-4)+n(k),\) through the nonlinearity given by \(z(k)=\frac{z(k-1)}{1+z^{2}(k-1)}+y^{3}(k)\), which was proposed in [16], and then used in [79].

The complex-valued noise n(k) was chosen so that the variance of the signal as a whole is 1, taking into account the fact that \(\sigma ^{2}=(\sigma ^{R})^{2}+(\sigma ^{I})^{2}\). The tap input of the filter was 4, and so the networks had 4 inputs, 4 hidden neurons and one output neuron. They were trained for 5000 epochs with 5000 training samples.

After running each algorithm 50 times, the results are given in Table 4. In the table, we presented a measure of performance called prediction gain, defined by \(R_{p}=10\log _{10}\frac{\sigma _{x}^{2}}{\sigma _{e}^{2}},\) where \(\sigma _{x}^{2}\) represents the variance of the input signal and \(\sigma _{e}^{2}\) represents the variance of the prediction error. The prediction gain is given in dB. It is obvious that, because of the way it is defined, a bigger prediction gain means better performance. It can be easily seen that in this case, SCG, CGHS, and CGFR gave approximately the same results, with CGPR performing slightly worse, and these results were better than those of some classical algorithms and network architectures found in the literature.

Table 4 Experimental results for nonlinear time series prediction

5 Conclusions

The full deductions of the scaled conjugate gradient algorithm and of the most known variants of the conjugate gradient algorithm for training complex-valued feedforward neural networks were presented. A method for computing gradients of the error function was given, which can be applied both for fully complex and for split complex activation functions. The three variants of the conjugate gradient algorithm with different update rules and the scaled conjugate gradient algorithm for optimizing the error function were applied for training networks used to solve four well-known synthetic and real-world problems.

Experimental results showed that the scaled conjugate gradient method performed better on the proposed problems than the classical gradient descent and gradient descent with momentum algorithms, in some cases as much as four orders of magnitude better in terms of training and testing mean squared error.

The scaled conjugate gradient algorithm was generally better than the classical variants of the conjugate gradient algorithm. This order of the algorithms in terms of performance is consistent with the one observed in the real-valued case, yet another argument for the extension of these learning methods to the complex-valued domain.

As a conclusion, it can be said that the scaled conjugate gradient algorithm represents an efficient and fast method for training feedforward complex-valued neural networks, as it was shown by its performance in solving very heterogeneous synthetic and real-world problems.