Keywords

1 Introduction

The most important, difficult and expensive problem in deep learning and machine learning is neural network training. There are still many challenges such as ill-conditioning, local minima, cliffs, exploding gradients, and numerical stable problems [1]. Therefore, how to ensure the neural network training performance remains a problem and an important goal which should be studied more deeply.

There have been many algorithms proposed to train neural networks already. Among them, gradient descent (GD) algorithm and stochastic gradient descent (SGD) algorithm are two basic. GD algorithm is a very traditional method which is regarded as slow or unreliable [1]. SGD algorithm was firstly proposed in [2] which is the most used method and allows the neural network to scale to large data sets for machine learning and deep learning [3]. However, training with it can sometimes be slow especially in situations of high curvature, small but consistent gradients and noisy gradients [1, 4].

To overcome above problems, many variants of SGD algorithm have been presented. To guarantee the convergence rate is almost the same fast as SGD algorithm while there is noise in gradients, noise reduction methods such as dynamic sample size algorithms [5] by increasing the mini batch size gradually, gradient aggregation algorithms [6, 7] to improve the quality of the searching directions, and iterate averaging algorithms [8] which employed a more aggressive stepsize sequence, were developed one after another. These methods have proved to be effective with noise reduction capabilities in practice while their convergence rates are all linear or sublinear. To improve the convergence rate, second order methods [9] such as Hessian-free Newton algorithm and Gauss-Newton algorithm were proposed. Second order methods have superlinear or quadratic convergence rates. However, they are locally convergent, which means that their iteration initial values must be sufficiently close to the optimal solution. Moreover, non-convex problem of these methods has not been solved totally satisfactory or universally accepted [4].

SGD algorithm with momentum [10, 11] is a more stable and faster learning method which can escape local minima, however it is an empirical convergence method. Recently, [12,13,14,15] gave some theoretical analysis and improved methods with faster convergence rates. A quadratic convergence rate was obtained in [14, 15] with a weaker growth condition for both convexity and strong convexity problems. Unfortunately, the convergence rate is just linear while the convex condition is not satisfied. Thus, to SGD algorithm and its variants, superlinear convergence rates for generalized situations special for the non-convex problem have not been realized universally accepted.

Different from above mentioned methods, adaptive gradient methods such as AdaGrad [16] and Adam [17] automatically adapt learning rates throughout the course of training, which are fairly robust and outperform SGD in practice [18]. However, their convergence rates may be even worse than SGD [19]. Overall, how to improve the convergence rate and training performances in a more general condition such as not only convex condition but also non-convex condition, is still an open problem which should be further studied.

In this paper, the finite-time stable problem in the neural network training process is studied. A new learning rate is designed by using the finite-time control theory. And the goal achieves theoretically that training errors converge to zeros or sufficiently small in finite time. Furthermore, the new learning rate is a function of GD. This new algorithm not only has a faster convergence rate and better stability but also has the excellent performances of SGD and its variants.

The paper is organized as following: In Sect. 2, the problem and preliminary results are introduced. In Sect. 3, the finite-time stable weight update law is given. Training results compared with other algorithms in set CIFAR-10 are given in Sect. 4. Conclusions are made in Sect. 5.

2 Problem Formulation

To simplify the proof and showing our results, the depth-L feedforward fully-connected neural networks with \(l_{2}\)-regression task is considered which are shown as following [3]:

$$\begin{aligned} \left\{ \begin{array}{ll} h_{i,0}=\phi (W_{\text {in}}x_{i})\\ h_{i,l}=\phi (W_{l}h_{i,l-1})\\ y_{i}=W_{\text {out}}h_{i,L} \end{array}\right. \end{aligned}$$
(1)

where \(l=1,\cdots ,L, i=1,\cdots ,n\), \(x_{i}\in R^{m_{\text {in}}}\) and \(y_{i}\in R^{m_{\text {ob}}}\) are the \(i\text {th}\) training input and output, n and m are the number of training samples and neurons, and \(W_{\text {in}}\in R^{m\times m_{\text {in}}}, W_{\text {out}}\in R^{m_{\text {ob}}\times m}\) and \(W_{l}\in R^{m\times m}\) are weight matrices of the input layer, output layer and \(l\text {th}\) hidden layer, respectively. Function \(\phi (\cdot )\) are activation functions. Let:

$$\begin{aligned} W\triangleq [\text {vec}(W_{\text {in}})^{T}, \text {vec}(W_{1})^{T},\cdots ,\text {vec}(W_{l})^{T}, \text {vec}(W_{\text {out}}^{T})^{T}]^{T} \end{aligned}$$

where \(\text {vec}(\bullet )\) is defined for a matrix \(A=[a_{1},\cdots ,a_{n}]\) by:

$$\begin{aligned} \text {vec}(A)\triangleq \begin{bmatrix}a_{1}\\ \vdots \\ a_{n}\end{bmatrix} \end{aligned}$$

then FNN (1) can be rewritten as following:

$$\begin{aligned} y_{i}=F(W,x_{i}) \end{aligned}$$
(2)

where \(F(W,x_{i})\) is a nonlinear function. Consider an over-parameterized neural networks, to the \(i_{\text {th}}\) training input, there is a corresponding sampling data \(y_{i}^{*}\) and an actual optimal weight matrix \(W^{*}\), such that:

$$\begin{aligned} y_{i}^{*}=F(W^{*},x_{i}) \end{aligned}$$
(3)

then the training error is defined as following:

$$\begin{aligned} e_{i}\triangleq y_{i}^{*}-y_{i} \end{aligned}$$
(4)

and the \(l_{2}\)-regression loss function E can be defined as:

$$\begin{aligned} E(W)\triangleq \frac{1}{2}\sum \limits _{i=1}^{n}\Vert e_{i}\Vert _{2}^{2}. \end{aligned}$$
(5)

The \(l_{2}\)-regression loss problem is to find a actual weight W such that for any given \(\epsilon >0\):

$$\begin{aligned} E(W)\le \epsilon . \end{aligned}$$
(6)

To Eq. (2), the following ordinary different equation (ODE) is established for \(y_{i}\):

$$\begin{aligned} \left\{ \begin{array}{ll} \dot{y}_{i}=J_{i}(t)u\\ u=\dot{W} \end{array}\right. \end{aligned}$$
(7)

where \(J_{i}(t)\) is the Jacobian matrix calculated as following:

$$\begin{aligned} J_{i}(t)=\frac{\partial F(W,x_{i})}{\partial W}=\left[ \frac{\partial y_{i}}{\partial W_{q}}\right] . \end{aligned}$$
(8)

To the training error \(e_{i}\), define:

$$\begin{aligned} e\triangleq \begin{bmatrix}e_{1}\\ \vdots \\ e_{n}\end{bmatrix}, y\triangleq \begin{bmatrix}y_{1}\\ \vdots \\ y_{n}\end{bmatrix}, y^{*}\triangleq \begin{bmatrix}y_{1}^{*}\\ \vdots \\ y_{n}^{*}\end{bmatrix}, J(t)\triangleq \begin{bmatrix}J_{1}(t)\\ \vdots \\ J_{n}(t)\end{bmatrix} \end{aligned}$$
(9)

then from ODE (7), we get \(e=y^{*}-y\) and:

$$\begin{aligned} \left\{ \begin{array}{ll} \dot{e}=-J(t)u\\ u=\dot{W}. \end{array}\right. \end{aligned}$$
(10)

In this paper, our goal is to give a new weight update law to make the FNN convergent to a small enough set with better stability. To achieve this goal, the dynamics of error system (10) with the finite-time stable performance is considered.

3 Designing of the Weight Update Law with Finite Time Stability Performance

3.1 Weight Varying Rate Design

To ensure the closed-loop system is asymptotically stable for error system (10), following controller form is considered:

$$\begin{aligned} u&=Q^{\frac{1}{2}}(e)R^{-\frac{1}{2}}u_{I} \end{aligned}$$
(11)

where \(R>0\), \(u_{I}\) is an unit vector under \(L_{2}\) norm, and \(Q(e)>0\) is an one dimensional positive real function while \(e\ne 0\).

To error system (10) and controller u shown by (11), if take the Lyapunov function as following:

$$\begin{aligned} V(t)=\frac{1}{2}e^{T}e \end{aligned}$$
(12)

then we can get:

$$\begin{aligned} \dot{V}(t)&=e^{T}\dot{e}\\&=-e^{T}J(t)u\\&=-Q^{\frac{1}{2}}(e)e^{T}J(t)R^{-\frac{1}{2}}u_{I}. \end{aligned}$$

And Q(e) and \(u_{I}\) can be designed further for error system (10) to make the closed-loop system have a desired performance.

Furthermore, if take \(u_{I}\) as following:

$$\begin{aligned} u_{I}=(e^{T}J(t)R^{-1}J^{T}(t)e)^{-\frac{1}{2}}R^{-\frac{1}{2}}J^{T}(t)e \end{aligned}$$
(13)

then we have \(u_{I}^{T}u_{I}=1\),

$$\begin{aligned} \dot{V}(t)=-Q^{\frac{1}{2}}(e)(e^{T}J(t)R^{-1}J^{T}(t)e)^{\frac{1}{2}} \end{aligned}$$
(14)

and \(\dot{V}(t)<0\), i.e., error system (10) is stable. Moreover, if take Q(e) as following:

$$\begin{aligned} Q(e)=c^{2}_{1}(e^{T}e)^{2\beta } \end{aligned}$$
(15)

where \(c_{1}>0\) and \(\beta \in (0, \frac{1}{2})\), then we have:

Theorem 1

If V(t) is taken as Eq. (12), \(u_{I}\) is taken as Eq. (13), and Q(e) is taken as Eq. (15), then the closed-loop system of system (10) and controller (11) is finite-time stable, which means that there is a finite time T such that \(e(t)\equiv 0\) for all \(t>T\).

Proof

Since V(t) is taken as Eq. (12) and \(u_{I}\) is taken as Eq. (13), then Eq. (14) is established for system (10). Furthermore, substitute Q(e) of Eq. (15) into Eq. (14), we have:

$$\begin{aligned} \dot{V}(t)&=-c_{1}(e^{T}e)^{\beta }(e^{T}J(t)R^{-1}J^{T}(t)e)^{\frac{1}{2}}\\&=-c_{1}(e^{T}e)^{\beta +\frac{1}{2}}(e^{T}J(t)R^{-1}J^{T}(t)e)^{\frac{1}{2}}(e^{T}e)^{-\frac{1}{2}}. \end{aligned}$$

Define \(\underline{\lambda }_{\min }\) is the lower bound of the smallest modulus eigenvalue of matrix \(J(t)R^{-1}J^{T}(t)\), then:

$$\begin{aligned} \underline{\lambda }^{\frac{1}{2}}_{\min } \le (e^{T}J(t)R^{-1}J^{T}(t)e)^{\frac{1}{2}}(e^{T}e)^{-\frac{1}{2}}. \end{aligned}$$

Thus, to \(\dot{V}(t)\), we have:

$$\begin{aligned} \dot{V}(t)\le -\underline{\lambda }^{\frac{1}{2}}_{\min }c_{1}(e^{T}e)^{\beta +\frac{1}{2}}. \end{aligned}$$

The above inequality is established if and only if:

$$\begin{aligned} \dot{V}(t)+2^{\beta +\frac{1}{2}}\underline{\lambda }^{\frac{1}{2}}_{\min }c_{1}(V(e))^{\beta +\frac{1}{2}}\le 0. \end{aligned}$$

Thus, if let c and \(\alpha \):

$$\begin{aligned} c\triangleq 2^{\beta +\frac{1}{2}}\underline{\lambda }^{\frac{1}{2}}_{\min }c_{1}, \alpha \triangleq \beta +\frac{1}{2} \end{aligned}$$

and \(\beta \in (0, \frac{1}{2})\), then \(e(t)\equiv 0\) while \(t\ge T(e)\), where setting time T(e) is calculated as following:

$$\begin{aligned} T(e)\le \frac{1}{c(1-\alpha )}V^{1-\alpha }(e)\le \frac{1}{c(1-\alpha )}V^{1-\alpha }(e_{0}) . \end{aligned}$$
(16)

The proof of Theorem 1 is finished.

According to Eq. (16), if let:

$$\begin{aligned} T_{0}\triangleq \frac{1}{c(1-\alpha )}V^{1-\alpha }(e_{0}) \end{aligned}$$
(17)

then \(T_{0}\) can taken as a setting time. Substitute Eq. (12) into Eq. (17), we have the following formula for \(T_{0}\):

$$\begin{aligned} T_{0}=(e_{0}^{T}e_{0})^{\frac{1}{2}-\beta }\underline{\lambda }^{-\frac{1}{2}}_{\min }(1-2\beta )^{-1}c_{1}^{-1}. \end{aligned}$$
(18)

Theorem 2

For the FNN with the error dynamics (10), if u is taken as following:

$$\begin{aligned} u=c_{1}(e^{T}e)^{\beta }(e^{T}J(t)R^{-1}J^{T}(t)e)^{-\frac{1}{2}}R^{-1}J^{T}(t)e \end{aligned}$$
(19)

where \(c_{1}>0\) and \(\beta \in (0, \frac{1}{2})\), then e(t) is finite time stable and a setting time upper bound can be taken as \(T_{0}\).

Remark 1

From Theorem 1, Theorem 2 is easily proofed. The closed-loop system of the error dynamics system (10) and the controller (19) is finite-time stable and the error vector e converges to zeros or a small enough set in finite time which can be estimated by Eq. (18).

3.2 Designing of the Weight Update Law

In this section, the numerical stable problem of the weight update law of W is considered. From error system (10) and Theorem 2, we can easily get that W should satisfy the following ODE in the training process:

$$\begin{aligned} \left\{ \begin{array}{ll} \dot{W}=u\\ W(t_{0})=W_{0} \end{array}\right. \end{aligned}$$
(20)

where u satisfies Eq. (19) which is a nonlinear function of t and W, and \(W_{0}\) is a given initial value of W. To update W, BP algorithm and AdaGrad algorithm are commonly used, which can be formulated as following:

$$\begin{aligned} W(t+1)=W(t)+\eta (t)u(t) \end{aligned}$$
(21)

where \(\eta (t)=\eta _{0}\) which is a constant in BP algorithm and \(\eta (t)=\eta _{0}(\Sigma _{j=0}^{t}\Vert u(j)\Vert )^{-\frac{1}{2}}\) in AdaGrad algorithm.

4 Simulation

Fig. 1.
figure 1

Comparison among SGD algorithm with momentum, HJB algorithm integrated with SGD, and our algorithm.

Fig. 2.
figure 2

Comparison among Adagrad algorithm, HJB algorithm integrated with Adagrad, and our algorithm.

Fig. 3.
figure 3

Comparison among Adam algorithm, HJB algorithm integrated with Adam, and our algorithm.

In this section, the classify problem of images in the set CIFAR-10 is studied. In the training process, VGG16 is used as the basic neural network and two dropout layers are added just before the output layer and the dropout probability are 0.25 and 0.5, respectively.

To verify performance of our algorithm, simulations of Figs. 1, 2 and 3 are done. In Fig. 1, hyper parameters of SGD-M optimizer are taken as \(learning \ rate=0.01, decay=1{e}{-}{8}\), and \(momentum=0.9\), HJB optimizer are taken as \(R =0.000012I\), and our algorithm are taken as \(\gamma = 1, R = 0.01I\) and \( \beta = 0.42\). In Fig. 2, hyper parameters of AdaGrad are taken as \(learning \ rate=0.0003, initial \ accumulator \ value=0.1\) and \(epsilon=1{e}{-}{8}\), HJB optimizer are taken as \(R =0.000012I\), and our algorithm are taken as \(\gamma = 1, R = 0.05I\). In Fig. 3, hyper parameters of Adam are taken as \(learning \ rate=3{e}{-}{6}, \beta _{1}=0.9, \beta _{2}=0.999,\) and \(\epsilon =1{e}{-}{3}\), HJB optimizer are taken as \(R =0.000012I\), and our algorithm are taken as \(\gamma = 1, R = 3{e}{-}{5I}\) and \( \beta = 0.45\).

Though, compared to these algorithms, the accuracy of our algorithm in validation is better. In Fig. 1, the highest value of the accuracy is achieved justly \(Epoch=12\) in our algorithm while \(Epoch=31\) in SGD-M and \(Epoch=35\) in HJB-SGD. In Fig. 2, the highest value of the accuracy is achieved justly \(Epoch=12\) in our algorithm while \(Epoch=31\) in Adagrad and \(Epoch=30\) in HJB-Adagrad. In Fig. 3, the highest value of the accuracy is achieved justly \(Epoch=11\) in our algorithm while \(Epoch=31\) in Adam and \(Epoch=35\) in HJB-SGD. Thus, our algorithm needs fewer epoches to obtained a training result with a higher validation accuracy and better robustness, i.e., compared to other seven algorithms, our algorithm has the best performance.

Thus, based on simulating results of Figs. 1, 2 and 3 and above analyses, we can conclude that our new algorithm has a better stability, higher accuracy and needs less epochs in training process than these mentioned existing algorithms.

5 Conclusions

In this paper, a new training algorithm is established for neural networks by using the finite-time convergent theory in the control theory. And the basic problems that how to improve the training convergence rate is considered. To achieve this goal, the normalization variable \(u_{I}\) is introduced, and a new weight varying law is designed by using Lyapunov stability analysis method and the finite-time stable performance is also proved in theory. To verify performances of our algorithm, simulations of the classify problem of images in set CIFAR-10 by using VGG16 are considered. Four typical training algorithms which are SGD-M, AdaGrad, Adam, and HJB integrated with them are all compared to our algorithm. Simulations show that our algorithm has superior training performance.