Keywords

1 Introduction

In recent years, great developments have been made in understanding the mechanisms of training of a neural networks when the width of the network is large. The first step was given in Neal [7], where it was shown that for any NTK-parametrized NN, the output before training converges to a Gaussian process on the space of inputs as the width increases. This means that even in the case of a neural network with nonlinear transformations, Bayesian regression with this Gaussian process as its prior distribution is tractable when we take the limit of width to infinity (Williams [12] and Goldberg et al. [2]). This idea has been extended to deep neural-networks by Lee et al. [5].

The Bayesian regression and training by gradient method have been linked by Jacot et al. ([4]). They found that gradient method in a NTK-parametrized NN with the large width is equivalent to kernel learning with the neural tangent kernel (NTK), and found a connection between the kernel and the maximum-a-posteriori estimator in Bayesian inference. They and Lee et al. ([6]) also showed that as the NTK-parametrized NN becomes wider, the model becomes linearized along the gradient descent or flow as the training, and the parameters become harder to be changed. This “lazy” regime appears, as shown in Chizat et al. [1], not only in over-parametrized neural-networks, but also in more abstract settings depending on the choice of scaling and initialization.

Due to the universal nature discovered in [4] and [6], we have not been able to distinguish whether they are pre- or post-neuronal if we focus on the behavior of the parameters. In this paper, we show that, during the learning, the behaviors of the cumulative sums of parameters over all neurons are different from each other according to their types of parameters. This implies that it is possible to distinguish whether the parameters are pre- or post-neuronal. When the width of the network tends to infinity, we also show that the “energy” of the cumulative sum is conserved (Theorem 2).

2 Related Works

Integral Representation of Mean-Field Parametrized NN. A mean-field parametrized NN forms like a Riemann sum, and thus has an integral representation when the width tends to infinity. In Sonoda-Murata [8] and Murata [11], the relationship between the distribution of parameters and the output is described via ridgelet transformation and their reconstruction theorem. On the other hand, in the case of our NTK-parametrized NN, the output before training is given by a stochastic integral when the network is infinitely wide. It would be of independent interest to investigate the reconstruction theorem in this situation.

Dynamics of Infinitely Wide Mean-Field Parametrized NN. For training of mean-field parametrized NN, another method for training is the stochastic gradient descent. It is described as a stochastic differential equation in the parameter space, in particular, it gives a gradient Langevin dynamics. When the width of the network is infinite, the parameter space is infinite-dimensional. Then the corresponding dynamics is described by an infinite-dimensional Langevin dynamics in a reproducing kernel Hilbert space, which appears as a collection of features. This infinite-dimensional model contains all models of finite width, and thus allows us to analyze them universally among all models with finite width. The convergence of this learning and the generalization error are discussed in Suzuki [9] and Suzuki-Akiyama [10].

3 Our Contribution

We consider the following NTK-parametrized NN of the width m:

$$\begin{aligned} f( x; \theta ) = \frac{1}{\sqrt{m}} \sum _{j=1}^{m} b_{j} \sigma ( a_{j} x + a_{0,j} ) . \end{aligned}$$

Here, the input \( x \in \mathbb {R}^{} \) is one-dimensional and the activation function \(\sigma : \mathbb {R} \rightarrow \mathbb {R}\) is assumed to be non-negative and Lipschitz continuous. We denote the coordinates of the parameter \( \theta = ( \boldsymbol{a}_{0}, \boldsymbol{a}, \boldsymbol{b} ) \) as follows.

  • Pre-neuronal thresholds: \(\boldsymbol{a}_{0} = ( a_{0,1}, a_{0,2}, \ldots , a_{0,m} ) \in \mathbb {R}^{m}\),

  • Pre-neuronal weights: \(\boldsymbol{a} = ( a_{1}, a_{2}, \ldots , a_{m} ) \in \mathbb {R}^{m} \),

  • Post-neuronal weights: \(\boldsymbol{b} = ( b_{1}, b_{2}, \ldots , b_{m} ) \in \mathbb {R}^{m}\).

Given a training data \( \{ ( x_{i}, y_{i} ) \}_{i=1}^{n} \), we put \( \hat{y}_{i} ( \theta ) := f( x_{i}; \theta ) \) and define a loss function by

$$\begin{aligned} L( \theta ) := \frac{1}{n} \sum _{i=1}^{n} ( \hat{y}_{i} ( \theta ) - y_{i} )^{2}. \end{aligned}$$

The solution to the associated gradient flow equation \( \frac{\mathrm {d}}{\mathrm {d}t} \theta (t) = - \frac{1}{2} ( \nabla _{\theta } L ) ( \theta (t) ) \) is denoted by \( \theta (t) = ( \boldsymbol{a}_{0}(t), \boldsymbol{a}(t), \boldsymbol{b}(t) ) = ( \{ a_{0,j}(t) \}_{j=1}^{m}, \{ a_{j}(t) \}_{j=1}^{m}, \{ b_{j}(t) \}_{j=1}^{m} ) \), where we set its initialization by \( \theta (0) = ( \boldsymbol{a}_{0} (0), \boldsymbol{a} (0), \boldsymbol{b} (0) ) \sim \mathrm {N} ( \boldsymbol{0}, I_{3m} ). \) Here, \(I_{3m}\) is the identity matrix of order 3m.

It is known that when the width m of the network is sufficiently large and training is performed, the optimal parameters are obtained as values close to the initial ones (Jacot et al. [4]). In this paper, we further investigate behaviors of the parameters. Specifically, we consider cumulative sums of the parameters over all neurons at each epoch, which are normalized by a scale depending on the width m. We focus on what arises when we take the normalized cumulative sums along the gradient flow, even the values of parameters are hardly varied. It is enough to consider only two cumulative sums \(\sum _{j=1}^{m} a_{j}(0)\) and \(\sum _{j=1}^{m} b_{j}(0)\) associated with pre- and post-neuronal weights respectively since thresholds have the same role as pre-neuronal weights by considering \(\{ (x_{i}, 1) \}_{i=1}^{n}\) as a two-dimensional input.

To compare their behaviors among different widths during the training, we have to consider which scale is appropriate to normalize the cumulative sums of the parameters. The initialization gives us a hint. At the initialization, variances of the cumulative sums are given by \( \sum _{j=1}^{m} \mathrm {Var} ( a_{j} (0) ) = \sum _{j=1}^{m} \mathrm {Var} ( b_{j} (0) ) = m \). Thus it would be natural to normalize \(\sum _{j=1}^{m} a_{j}(0)\) and \(\sum _{j=1}^{m} b_{j}(0)\) by scaling of \(\sqrt{m}\). Moreover, we embed them into the space of continuous functions on the interval [0, 1] as follows. On the m-equidistant partition \(\{ s_{k} := \frac{k}{m} \}_{k=0}^{m}\) of the interval, we set \( A_{s_{k}}^{(m)} (t) := \frac{1}{\sqrt{m}} \sum _{j=1}^{k} a_{j} (t) \) and \( B_{s_{k}}^{(m)} (t) := \frac{1}{\sqrt{m}} \sum _{j=1}^{k} b_{j} (t) \) and then we extend them onto subintervals \([ s_{k-1}, s_{k} ]\) by linear interpolations:

$$\begin{aligned} \begin{aligned} \begin{array}{l} \displaystyle A_{s}^{(m)} (t) := \frac{ A_{s_{k}}^{(m)} (t) - A_{s_{k-1}}^{(m)} (t) }{ s_{k} - s_{k-1} } ( s - s_{k-1} ) + A_{s_{k-1}}^{(m)} (t), \\ \displaystyle B_{s}^{(m)} (t) := \frac{ B_{s_{k}}^{(m)} (t) - B_{s_{k-1}}^{(m)} (t) }{ s_{k} - s_{k-1} } ( s - s_{k-1} ) + B_{s_{k-1}}^{(m)} (t) \end{array} \quad \text {if }s_{k-1} \le s \le s_{k}. \end{aligned} \end{aligned}$$

For each width m and time t of the gradient flow, these embedded functions \( A^{(m)} (t) = \{ A_{s}^{(m)} (t) \}_{0 \le s \le 1} \) and \( B^{(m)} (t) = \{ B_{s}^{(m)} (t) \}_{0 \le s \le 1} \) are random continuous-functions on [0, 1], namely, stochastic processes.

With this embedding, it will be necessary that they do not diverge when \(m \rightarrow \infty \) in order to compare them appropriately among various widths. At the initialization, by the so-called Donsker’s invariance principle, which is well known in probability theory, the stochastic processes \( \{ ( A^{(m)}(0), B^{(m)}(0) ) \}_{m=1}^{\infty } \) converge to a two-dimensional Brownian motion. In general, for any time t of the gradient flow, the following is valid.

Theorem 1

The family \(\{ ( A^{(m)}(t), B^{(m)}(t) )\}_{m=1}^{\infty }\) is tight.

Fig. 1.
figure 1

Outputs after training

This implies that a certain subsequence \(\{ ( A^{(m_{k})}(t), B^{(m_{k})}(t) ) \}_{k=1}^{\infty }\) converges almost surely (by replacing the probability space appropriately if necessary). In what follows, we denote the subsequence again by \(\{ (A^{(m)}(t), B^{(m)}(t) ) \}\) for simplicity of notations. The limit (A(t), B(t) ) of this subsequence gives a dynamics on the infinite-dimensional Banach space \(C([0,1] \rightarrow \mathbb {R}^{2})\) and then it would be another interest to describe the dynamics. In terms of \(B(t) = \{ B_{s}(t) \}_{0 \le s \le 1}\), we have

$$\begin{aligned} f(x_{i}; \theta ) = \frac{1}{\sqrt{m}} \sum _{j=1}^{m} \sigma ( a_{j} x_{i} + a_{0,j} ) b_{j} \rightarrow \int _{0}^{1} \sigma ( a_{s} x + a_{0,s} ) \mathrm {d}B_{s}(0) =: \hat{y}_{i}^{(\infty )} \end{aligned}$$

in probability as \(m \rightarrow \infty \), and this limit is called a stochastic integral. In the above, \(\{ a_{s} \}_{0 \le s \le 1}\) and \(\{ a_{0,s} \}_{0 \le s \le 1}\) are mutually independent Gaussian processes on [0, 1] with a zero mean and the covariance function given by \( \mathbf {E} [ a_{s} a_{u} ] = \mathbf {E} [ a_{0,s} a_{0,u} ] = \mathbf {1}_{\{ 0 \}} (u-s) \). Here, \(\mathbf {1}_{\{ 0 \}} \) is the indicator function of the singleton \(\{0\}\). These are also independent of B(0). Although it can be smoothly expected that the dynamics of \(\{ ( A(t), B(t) ) \}_{t \ge 0}\) is described by the neural tangent kernel, since \(C([0,1] \rightarrow \mathbb {R}^{2})\) is a non-Hilbert Banach space, it is difficult to employ the concepts of their gradient and kernel that depend on the inner product structure.

Now, among NTK-parametrized NNs of various widths, we can compare the dynamics for the cumulative sum at an “appropriate scale”. Figure 1 shows outputs of neural networks widths of \(m = 100, 1000, 10000\) after training. The training data are indicated by points, and we have used gradient descent. The following Figs. 2, 3, 4 and 5 show the changes of the parameters and their cumulative sums during the training. Each line in Figs. 2 and 4 represents how the corresponding parameter is varied during the training.

Fig. 2.
figure 2

Changes of parameters \(a_j\) during the training

Fig. 3.
figure 3

Cumulative sums of parameters \(a_j\) before/after the training

Fig. 4.
figure 4

Changes of parameters \(b_j\) during the training

Fig. 5.
figure 5

Cumulative sums of parameters \(b_j\) before/after the training

From the figures, as width increases, the variation of cumulative sum becomes smaller for parameters a, while we can see it is actually varied for parameters b.

In fact, when \(t=0\) and \(m \rightarrow \infty \), by the law of large numbers, we have

$$\begin{aligned} \begin{aligned} \left. \frac{\mathrm {d}}{\mathrm {d}t}\right| _{t=0} A_{s_{m}}^{(m)} (t)&= - \frac{1}{n} \sum _{i=1}^{n} \big ( \hat{y}_{i} ( \theta (0) ) - y_{i} \big ) \frac{1}{m} \sum _{j=1}^{m} \sigma ^{\prime } \big ( a_{j} (0) x_{i} + a_{0,j} (0) \big ) x_{i} b_{j}(0) \\&\rightarrow - \frac{1}{n} \sum _{i=1}^{n} \big ( \hat{y}_{i}^{( \infty )} - y_{i} \big ) \mathbf {E} \big [ \sigma ^{\prime } \big ( a_{1} (0) x_{i} + a_{0,1} (0) \big ) x_{i} \big ] \mathbf {E} [ b_{1}(0) ] = 0. \end{aligned} \end{aligned}$$

On the other hand, since the activation function \(\sigma \) is non-negative and non-zero,

$$\begin{aligned} \begin{aligned} \left. \frac{\mathrm {d}}{\mathrm {d}t}\right| _{t=0} B_{s_{m}}^{(m)} (t)&= - \frac{1}{n} \sum _{i=1}^{n} \big ( \hat{y}_{i} ( \theta (0) ) - y_{i} \big ) \frac{1}{m} \sum _{j=1}^{m} \sigma \big ( a_{j} (0) x_{i} + a_{0,j} (0) \big ) \\&\rightarrow - \frac{1}{n} \sum _{i=1}^{n} \big ( \hat{y}_{i}^{( \infty )} - y_{i} \big ) \mathbf {E} \big [ \sigma \big ( a_{1} (0) x_{i} + a_{0,1} (0) \big ) \big ] \ne 0. \end{aligned} \end{aligned}$$

As above, we observed numerically that the cumulative sum of the parameters b is varied along the gradient flow. It can be shown, however, that the following “energy” is conserved along the gradient flow.

Theorem 2

We have \(\displaystyle \lim _{m \rightarrow \infty } \frac{1}{m} \sum _{j=1}^m \big ( b_j (t)- \mathbf {E} [b_j(t)] \big )^2 = 1 \) for all \(t \ge 0\).

Here, \(\mathbf {E}\) denotes the expectation operator. The same for \(a_{0,j}(t)\) and \(a_{j}(t)\).

Figures 6 and 7 below confirm Theorem 2 in the learning shown in Fig. 1. The expectations have been simulated with using Monte Carlo methods.

Fig. 6.
figure 6

Graph of \(\displaystyle \frac{1}{m} \sum _{j=1}^{m} ( a_{j} (t) - \mathbf {E} [ a_{j} (t) ] )^{2}\)

Fig. 7.
figure 7

Graph of \(\displaystyle \frac{1}{m} \sum _{j=1}^{m} ( b_{j} (t) - \mathbf {E} [ b_{j} (t) ] )^{2}\)

4 Conclusion

In this paper, we showed that in a three-layer wide neural-network, the cumulative sum of pre-neuronal parameters is hardly varied along the gradient flow, while it is varied for post-neuronal parameters. This allowed us to find a critical difference among the behaviors of the pre- and post-neuronal parameters, this is a first trial to distinguish them, which has not been so far. Furthermore, we showed that the energy is conserved along the gradient flow.