Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Research on the effect of noise on neural networks has been conducted for almost two decades. From the earlier 90 s to the mid 90 s, researchers investigated the effect of noise on the performance of a multilayer perceptron (MLP)/recurrent neural networks (RNN) [6, 1113, 15] and the associative networks [2, 19]. Later, from the mid 90 s to the late 90 s, researchers started to analyze the effects of additive input noise (AIN) [5, 7, 8, 14] and additive weight noise (AWN) [1] on back-propagation learning. The objective functions for these noise injection-based learning algorithms were revealed. In the 2000s, researchers investigated the effect of chaotic noise (CN) on MLP [3, 4].

In recent years, the effects of AWN and multiplicative weight noise (MWN) on the RBF and MLP learning algorithms have been investigated [9, 10, 16, 17]. It is shown that the objective function of the RBF learning algorithm with adding AWN or MWN is identical to the original RBF learning algorithm [9]. Hence, adding AWN or MWN during RBF learning cannot improve the generalization ability of an RBF. Adding AWN during MLP learning can improve the generalization ability of an MLP. Adding MWN during MLP learning might not be [10, 17]. These results clarify a common missconception that adding noise during learning is able to improve the generalization ability of a neural network.

Now, we would like to investigate another question. Would similar results be obtained for other learning algorithms? To do so, one obvious approach is to investigate the effect of noise on a gradient system as many learning algorithms are developed by the method of gradient descent. Understanding the effect of noise on gradient systems can aid in the understanding of the effect of noise on these learning algorithms. Therefore, the objective of the paper is to investigate the effects of three types of noise (multiplicative noise, additive noise and chaotic noise) on a gradient system with forgetting. The energy functions of the corresponding gradient systems are revealed.

In the next section, the gradient systems with noise are introduced. The energy functions of these gradient systems with noise will be analyzed in Sect. 3. Effect of noise on the gradient systems will be elucidated in Sect. 4. Finally. Section 5 gives the conclusion of the paper.

2 Models

Let \({\mathbf x}(t) \in R^n\) and \(F({\mathbf x}) \in R\) is a bounded smooth function of \({\mathbf x}\). The energy function is given by \(V({\mathbf x}) = F({\mathbf x}) + \lambda \Vert {\mathbf x}\Vert ^2_2\), where \(\lambda \) is a small positive number called forgetting factor. The gradient system is defined as follows:

$$\begin{aligned} {\mathbf x}(t+1) = {\mathbf x}(t) - \mu \left( \frac{\partial F({\mathbf x}(t))}{\partial {\mathbf x}} + \lambda {\mathbf x}(t)\right) , \end{aligned}$$
(1)

where \(\mu \) is the learning step and it is a small positive number, and \(\partial F({\mathbf x}(t))/\partial {\mathbf x} = \left. \partial F({\mathbf x})/\partial {\mathbf x} \right| _{{\mathbf x} = {\mathbf x}(t)}\).

2.1 Multiplicative/Additive Noise

With multiplicative noise, the vector \({\mathbf x}(t)\) in (1) is replaced by \(\tilde{\mathbf x}(t)\), where

$$\begin{aligned} \tilde{\mathbf x}(t)= & {} {\mathbf x}(t) + {\mathbf b}(t)\otimes {\mathbf x}(t). \\ {\mathbf b}(t)\otimes {\mathbf x}(t)= & {} (b_1(t)x_1(t), b_2(t)x_2(t), \cdots , b_n(t) x_n(t))^T. \nonumber \end{aligned}$$
(2)

With additive noise,

$$\begin{aligned} \tilde{\mathbf x}(t) = {\mathbf x}(t) + {\mathbf b}(t). \end{aligned}$$
(3)

In (2) and (3), \({\mathbf b}(t) \in R^n\) is a Gaussian random vector with mean \({\mathbf 0}\) and covariance matrix \(S_b {\mathbf I}_{n\times n}\). Moreover, \(E[b_i(t)] = 0\) for all \(i = 1, \cdots , n\) and \(t \ge 0\). \(E[b_i^2(t)]\) equals to \(S_b\) and \(E[b_i(t)b_j(t)]\) equals zero if \(i \ne j\). \(E[b_i(t_1)b_i(t_2)] = 0\) if \(t_1 \ne t_2\). The gradient system with noise is given as follows:

$$\begin{aligned} {\mathbf x}(t+1) = \tilde{\mathbf x}(t) - \mu \left( \frac{\partial F(\tilde{\mathbf x}(t))}{\partial {\mathbf x}} + \lambda \tilde{\mathbf x}(t) \right) . \end{aligned}$$
(4)

2.2 Chaotic Noise

With chaotic noise injection, the noise is added to the gradient vector as follows [3, 4, 20]:

$$\begin{aligned} {\mathbf x}(t+1) = {\mathbf x}(t) - \mu \left( \frac{\partial F({\mathbf x}(t))}{\partial {\mathbf x}} + \lambda {\mathbf x}(t) + \kappa n(t) {\mathbf e}\right) , \end{aligned}$$
(5)

where \({\mathbf e}\) is a constant vector of all 1s, \(\kappa \) is a positive constant and n(t) is a deterministic noise generated by

$$\begin{aligned} n(t+1) = \alpha n(t) (1-n(t)), \;\;\;\; 3.6 < \alpha < 4. \end{aligned}$$
(6)

3 Energy Functions

In this section, the energy functions of these gradient systems with noise are revealed. The effect of noise on the gradient systems will be discussed in the next section.

3.1 Multiplicative/Additive Noise

Given \({\mathbf x}(t)\), the mean update of (4) can be written as follows:

$$\begin{aligned} E[{\mathbf x}(t+1)|{\mathbf x}(t)] = E[\tilde{\mathbf x}(t)|{\mathbf x}(t)] - \mu E\left[ \left. \frac{\partial F(\tilde{\mathbf x}(t))}{\partial {\mathbf x}} + \lambda \tilde{\mathbf x}(t)\right| {\mathbf x}(t)\right] . \end{aligned}$$
(7)

In (7), the expectation is taken over the probability space of \(\tilde{\mathbf x}(t)\). Since \(E[{\mathbf b}(t)] = {\mathbf 0}\), by (2) we get that \(E[\tilde{\mathbf x}(t)|{\mathbf x}(t)]= {\mathbf x}(t)\). Equation (7) can be rewritten as follows:

$$\begin{aligned} E[{\mathbf x}(t+1)|{\mathbf x}(t)] = {\mathbf x}(t) - \mu \left( E\left[ \left. \frac{\partial F(\tilde{\mathbf x})}{\partial {\mathbf x}} \right| {\mathbf x}(t)\right] + \lambda {\mathbf x}(t) \right) . \end{aligned}$$
(8)

Next, we let \(V_{\otimes }({\mathbf x})\) be a scalar function such that

$$\begin{aligned} E[{\mathbf x}(t+1)|{\mathbf x}(t)] = {\mathbf x}(t) - \mu \frac{\partial V_{\otimes }({\mathbf x}(t))}{\partial {\mathbf x}}. \end{aligned}$$
(9)

The energy function is stated in the following theorem.

Theorem 1

For a gradient system defined as (1) and \({\mathbf x}(t)\) is corrupted by multiplicative noise as stated in (2),

$$\begin{aligned} E[F(\tilde{\mathbf x})|{\mathbf x}] = F({\mathbf x}) + \frac{S_b}{2} \sum _{j=1}^n \frac{\partial ^2 F({\mathbf x})}{\partial x_j \partial x_j} x_j^2 \end{aligned}$$
(10)

and

$$\begin{aligned} V_{\otimes }({\mathbf x}) = F({\mathbf x}) + \frac{\lambda }{2} \Vert {\mathbf x}\Vert ^2_2 + \frac{S_b}{2} \sum _{j=1}^n \frac{\partial ^2 F({\mathbf x})}{\partial x_j \partial x_j} x_j^2 - S_b \int {\mathbf x}\otimes \text{ diag }\left\{ {\mathbf H}({\mathbf x})\right\} \cdot d{\mathbf x}. \end{aligned}$$
(11)

where \(\int \) is the line integral, \({\mathbf H}({\mathbf x})\) is the Hessian matrix of \(F({\mathbf x})\), i.e. \({\mathbf H}({\mathbf x}) = \nabla \nabla _{\mathbf x} F({\mathbf x})\) and

$$ \text{ diag }\left\{ {\mathbf H}({\mathbf x})\right\} = \left( \frac{\partial ^2 F({\mathbf x})}{\partial x_1^2}, \frac{\partial ^2 F({\mathbf x})}{\partial x_2^2}, \cdots , \frac{\partial ^2 F({\mathbf x})}{\partial x_n^2} \right) ^T. $$

Proof: Consider (8) and let \(\partial F({\mathbf x})/\partial x_i\) be the \(i^{th}\) element of \(\partial F({\mathbf x})/\partial {\mathbf x}\).

$$\begin{aligned} \frac{\partial F(\tilde{\mathbf x})}{\partial x_i} = \frac{\partial F({\mathbf x})}{\partial x_i} + \sum _{j=1}^n \frac{\partial ^2 F({\mathbf x})}{\partial x_j \partial x_i} (b_j x_j) + \frac{1}{2} \sum _{k=1}^n \sum _{j=1}^n \frac{\partial ^3 F({\mathbf x})}{\partial x_k \partial x_j \partial x_i} b_k b_j x_k x_j. \end{aligned}$$
(12)

Therefore,

$$\begin{aligned} E\left[ \left. \frac{\partial F(\tilde{\mathbf x})}{\partial x_i} \right| {\mathbf x} \right] = \frac{\partial F({\mathbf x})}{\partial x_i} + \frac{S_b}{2} \sum _{j=1}^n \frac{\partial ^3 F({\mathbf x})}{\partial x_j \partial x_j \partial x_i} x_j^2. \end{aligned}$$
(13)

On the other hand, by expanding \(F(\tilde{\mathbf x})\) about \({\mathbf x}\), we get that

$$ F(\tilde{\mathbf x}) = F({\mathbf x}) + \sum _{i=1}^n \frac{\partial F({\mathbf x})}{\partial x_i} b_i x_i + \frac{1}{2} \sum _{j=1}^n \sum _{i=1}^n \frac{\partial ^2 F({\mathbf x})}{\partial x_j \partial x_i} b_j b_i x_j x_i $$

and hence

$$\begin{aligned} E[F(\tilde{\mathbf x})|{\mathbf x}] = F({\mathbf x}) + \frac{S_b}{2} \sum _{j=1}^n \frac{\partial ^2 F({\mathbf x})}{\partial x_j \partial x_j} x_j^2. \end{aligned}$$
(14)

Differentiate both side of (14) with respect to \(x_i\), we get that

$$\begin{aligned} \frac{\partial }{\partial x_i} E[F(\tilde{\mathbf x})|{\mathbf x}] = \frac{\partial F({\mathbf x})}{\partial x_i} + \frac{S_b}{2} \sum _{j=1}^n \frac{\partial ^3 F({\mathbf x})}{\partial x_i \partial x_j \partial x_j} x_j^2 + S_b \frac{\partial ^2 F({\mathbf x})}{\partial x_i \partial x_i} x_i. \end{aligned}$$
(15)

As \(F({\mathbf x})\) is smooth, \(\partial ^3 F({\mathbf x})/\partial x_j \partial x_j \partial x_i = \partial ^3 F({\mathbf x})/\partial x_i \partial x_j \partial x_j\). Compare (13) and (15), we get that

$$ E\left[ \left. \frac{\partial F(\tilde{\mathbf x})}{\partial x_i} \right| {\mathbf x} \right] = \frac{\partial E[F(\tilde{\mathbf x})|{\mathbf x}]}{\partial x_i} - S_b \frac{\partial ^2 F({\mathbf x})}{\partial x_i \partial x_i} x_i. $$

Further by (8) and (9), we get that

$$\begin{aligned} V_{\otimes }({\mathbf x}) = E[F(\tilde{\mathbf x})|{\mathbf x}] - S_b \int {\mathbf x}\otimes \text{ diag }\left\{ {\mathbf H}({\mathbf x})\right\} \cdot d{\mathbf x} + \frac{\lambda }{2} \Vert x\Vert ^2_2. \end{aligned}$$
(16)

Putting (10) in (16) and rearranging the terms, we can get the energy function as given in (11) and the proof is completed. Q.E.D.

Similarly, the energy function of the gradient system with additive noise is stated in the following theorem.

Theorem 2

For a gradient system defined as (1) and \({\mathbf x}(t)\) is corrupted by additive noise as stated in (3),

$$\begin{aligned} V_{\oplus }({\mathbf x}) = F({\mathbf x}) + \frac{\lambda }{2} \Vert {\mathbf x} \Vert ^2_2 + \frac{S_b}{2} \sum _{j=1}^n \frac{\partial ^2 F({\mathbf x})}{\partial x_j \partial x_j}. \end{aligned}$$
(17)

Proof: For additive noise, the noisy \(\tilde{\mathbf x}\) in (4) is given by \(\tilde{\mathbf x} = {\mathbf x} + {\mathbf b}\). Similarly, we consider (8) and let \(\partial F({\mathbf x})/\partial x_i\) be the \(i^{th}\) element of \(\partial F({\mathbf x})/\partial {\mathbf x}\).

$$ \frac{\partial F(\tilde{\mathbf x})}{\partial x_i} = \frac{\partial F({\mathbf x})}{\partial x_i} + \sum _{j=1}^n \frac{\partial ^2 F({\mathbf x})}{\partial x_j \partial x_i} b_j + \frac{1}{2} \sum _{k=1}^n \sum _{j=1}^n \frac{\partial ^3 F({\mathbf x})}{\partial x_k \partial x_j \partial x_i} b_k b_j. $$

Therefore,

$$\begin{aligned} E\left[ \left. \frac{\partial F(\tilde{\mathbf x})}{\partial x_i} \right| {\mathbf x} \right] = \frac{\partial F({\mathbf x})}{\partial x_i} + \frac{S_b}{2} \sum _{j=1}^n \frac{\partial ^3 F({\mathbf x})}{\partial x_j \partial x_j \partial x_i}. \end{aligned}$$
(18)

Using similar technique as in multiplicative noise, we can get that

$$\begin{aligned} E[F(\tilde{\mathbf x})|{\mathbf x}] = F({\mathbf x}) + \frac{S_b}{2} \sum _{j=1}^n \frac{\partial ^2 F({\mathbf x})}{\partial x_j \partial x_j} \end{aligned}$$
(19)

and

$$\begin{aligned} \frac{\partial }{\partial x_i} E[F(\tilde{\mathbf x})|{\mathbf x}] = \frac{\partial F({\mathbf x})}{\partial x_i} + \frac{S_b}{2} \sum _{j=1}^n \frac{\partial ^3 F({\mathbf x})}{\partial x_i \partial x_j \partial x_j}. \end{aligned}$$
(20)

Compare (18) with (20), we get that \(E\left[ \left. \partial F(\tilde{\mathbf x})/\partial x_i \right| {\mathbf x} \right] = \partial E[F(\tilde{\mathbf x})|{\mathbf x}]/\partial x_i\) and thus

$$\begin{aligned} \frac{\partial V_{\otimes }({\mathbf x}(t))}{\partial {\mathbf x}} = \frac{\partial E[F(\tilde{\mathbf x})|{\mathbf x}]}{\partial {\mathbf x}} + \lambda {\mathbf x} \end{aligned}$$
(21)

By (19), (20) and (21), the energy function as stated in (17) can be obtained and the proof is completed. Q.E.D.

3.2 Chaotic Noise

For the system with chaotic noise injection, all elements in \({\mathbf x}\) suffered the same amount of noise \(\kappa n(t)\) in the \(t^{th}\) step. As observed from Fig. 1 which plots the value \(\sum _{\tau =0}^{T-1} n(t+\tau )/T\) for \(t=1,\cdots , 2000\) and for different values of T, it is reasonable to assume that \(\sum _{\tau =0}^{T-1} n(t+ \tau )/T\) is a constant for all t if \(T \gg 1\). Then, we can get the following theorem on the energy function of a gradient system with chaotic noise.

Fig. 1.
figure 1

\(\sum _{\tau =0}^{T-1} n(t+\tau )/T\) against t, for \(t=1, \cdots , 2\,K\); \(T = 500, 1\,K, 5\,K\). \(\alpha = 3.8\)

Theorem 3

For a gradient system defined as (5) and \(\mu T \rightarrow 0\),

$$\begin{aligned} V_{\odot }({\mathbf x}) = F({\mathbf x}) + \frac{\lambda }{2} \Vert {\mathbf x} \Vert ^2_2 + \kappa '\sum _{i=1}^n x_i, \end{aligned}$$
(22)

where \(\kappa '\) is a constant.

Proof: Suppose \(\mu T \rightarrow 0\) for all t, we could assume that

$$ {\mathbf x}(t+\tau ) = {\mathbf x}(t), \;\; \frac{\partial F({\mathbf x}(t+\tau ))}{\partial {\mathbf x}} = \frac{\partial F({\mathbf x}(t))}{\partial {\mathbf x}} $$

for \(\tau = 0, 1, \cdots , T-1\). In such case, we can get from (5) that

$$\begin{aligned} {\mathbf x}(t+T)= & {} {\mathbf x}(t) - \mu ' \left( \frac{\partial F({\mathbf x}(t))}{\partial {\mathbf x}} + \lambda {\mathbf x}(t) + \frac{\kappa }{T} \sum _{\tau = 0}^{T-1} n(t+ \tau ) {\mathbf e} \right) \nonumber \\= & {} {\mathbf x}(t) - \mu ' \left( \frac{\partial F({\mathbf x}(t))}{\partial {\mathbf x}} + \lambda {\mathbf x}(t) + \kappa \bar{n} {\mathbf e} \right) , \end{aligned}$$
(23)

where \(\mu ' = \mu T\) and \(\bar{n} = \lim _{T\rightarrow \infty } \sum _{t=1}^T n(t) /T\). Clearly, the energy function is given by (22) and \(\kappa ' = \kappa \lim _{T\rightarrow \infty } \sum _{t=1}^T n(t)/T\). Q.E.D.

4 Effect of Noise

For multiplicative noise, let us rewrite that \(V_{\otimes } ({\mathbf x}) = F({\mathbf x}) + \frac{\lambda }{2} \Vert {\mathbf x}\Vert ^2_2 + S_b \mathcal{R}({\mathbf x})\), where \(\mathcal{R}({\mathbf x})\) corresponds to a regularizer. From (11), it is given by

$$\begin{aligned} \mathcal{R}({\mathbf x}) = \frac{1}{2} \sum _{j=1}^n \frac{\partial ^2 F({\mathbf x})}{\partial x_j \partial x_j} x_j^2 - \int {\mathbf x}\otimes \text{ diag }\left\{ {\mathbf H}({\mathbf x})\right\} \cdot d{\mathbf x}. \end{aligned}$$
(24)

The effect of the first term is to bring \({\mathbf x}\) closer to the zero vector while the second term is to push it away. Therefore, the existence of multiplicative noise in a gradient system would lead to two opposite effects. It should also be noted that \({\mathbf H}({\mathbf x})\) is a constant matrix (say \(\bar{\mathbf H}\)) and \(\mathcal{R}({\mathbf x}) = 0\) if \(F({\mathbf x})\) is quadratic. Existence of multiplicative noise has no effect on the gradient system.

For additive noise, the additional term \((S_b/2) \sum _{j=1}^n \partial ^2 F({\mathbf x})/\partial x_j \partial x_j\) has the effect that brings the solution closer to the zeros vector. This term reduces to a constant if \(F({\mathbf x})\) is quadratic. The different between \(V_{\oplus }({\mathbf x})\) and \(V({\mathbf x})\) is just a constant. So, existence of additive noise has no effect on a gradient system if \(F({\mathbf x})\) is quadratic.

For chaotic noise, from (22), the additional term is \(\kappa ' \sum _{i=1}^n x_i\). Its effect is to let \({\mathbf x}\) slide along the direction \(-[1 \; 1\;\cdots \; 1]^T\). If all the \(x_i\)s are positive, the additional term will bring them move slightly towards the zero vector. If all the \(x_i\) are negative, it will move \({\mathbf x}\) slightly further away from the zero vector. The effect of noise on the gradient system will depend on the minimum point of the \(V({\mathbf x})\) and it has no effect if \(\lim _{T\rightarrow \infty } \sum _{t=1}^T n(t)/T = 0\).

5 Conclusion

In this paper, we have introduced the models of the gradient systems with three different type of noise, namely multiplicative, additive and chaotic noise. The energy functions of the corresponding gradient systems with noise have been revealed. By investigating the additional term in the energy functions as compared with the original energy function, it is found that only additive noise has a clear effect on the gradient system. It enforces the state vector moving slightly towards the zero vector. With multiplicative noise, two opposite effects exists, moving towards and away. With chaotic noise, the effect will be depended on the location of the minimum point of \(V({\mathbf x})\). It could be enforced to move towards or away from the zero vector. Moreover, if \(F({\mathbf x})\) is quadratic, either multiplicative or additive noise will have no effect on the gradient system.

Treating (i) the state vector as the weight vector of a neural network, (ii) \(F(\cdot )\) as the mean square errors and (iii) moving toward the zero vector as improving generalization, our results imply that (a) injecting AWN during MLP learning can improve generalization, (b) injecting MWN or CN during MLP learning might not be and (c) injecting AWN or MWN during RBF learning cannot improve generalization. Results (a) and (b) are equally applied to other nonlinear neural networks. Treating x as the state variable in the stochastic Wang’s kWTA model [18] or \({\mathbf x}\) as the neuronal outputs of Hopfield network, the effect of noise on these models can thus be analyzed by the same technique. Due to page limit, those results will be presented elsewhere.