Keywords

1 Introduction

Deep neural networks (DNNs) have become a significant paradigm of artificial intelligence in many fields such as industrial visual [1, 2], nature language process [3, 4], auxiliary medical [5], auto driving [6], etc. In this paradigm, deep model training of DNN is an arduous and important task because the neural network model is extremely complex. To address this problem, deep model training is generally transformed into an optimization problem that minimizes the value of a loss function. For this reason, optimization algorithms are necessary when training deep models. Moreover, to implement the optimization algorithm easily and quickly, many algorithms based on stochastic gradient descent method have been widely concerned and applied [7, 8].

Stochastic gradient descent (SGD) algorithm is a classical and simple optimization algorithm for deep neural network training. Moreover, SGD is easily applied to various deep training tasks because of its simple algorithm logic. Importantly, SGD superior to other optimization algorithms in terms of generalization ability in many applications. However, the convergence rate of SGD is unsatisfactory, so a large number of label samples are required. To speed up the convergence rate of SGD, many fast optimization algorithms with adaptive stepsize have been proposed, for instance, Adam [9], AMSGrad [10], AdamW [11], etc. In fact, these adaptive algorithms not only design the adaptive stepsize, but also fully consider the help of the historical gradient information to the convergence speed. Therefore, the adaptive algorithms have received extensive attentions in recent years.

Although the above algorithms have been applied in some scenarios, they all restricted to convex conditions. However, in many practical scenarios, the convex condition is not easy to satisfy, which will cause the performance of the above convex optimization algorithms to be greatly reduced, or even fail to converge. Consequently, the research of non-convex optimization algorithms is crucial to the wide application of DNN. To tackle the non-convex problem, many non-convex optimization algorithms have been proposed in last few decades, such as [12,13,14]. However, these methods are all fixed stepsize with slow convergence rate, and exhibit poor performance in non-convex problems. For this reason, many researchers further studied the non-convex issues for Adam with adaptive stepsize. For example, non-convex optimization for Adam [15], non-convex stochastic optimization for Adam-type algorithms [16], and Yogi [17]. It is however, whether the AdamW optimization algorithm conveges under non-convex conditions is still an open problem.

AdamW is a better optimizer than Adam with its generalization capabilities that decoupling the weight decay from the adaptive stepsize. However, AdamW is a typical convex optimization algorithm that fails converging in the non-convex case. Therefore, it is urgent to propose an optimization method to ensure that AdamW still converges in the non-convex case. In this paper, we propose a non-convex optimization algorithm, named DAda-NC, which uses decoupled method to recover the weight decay regularization of AdamW. Importantly, DAda-NC also conbines the \(sign(\cdot )\) function with its stepsize to ensure that it converges even under non-convex conditions. In addition, our mainly contributions are summarized below:

  • We propose a decoupled adaptive training algorithm for deep learning under non-convex conditions, named DAda-NC.

  • We solve the non convergence problem in the case of non-convex for adaptive algorithms by exploiting the \(sign(\cdot )\) function.

  • We further improve the generalization ability of adaptive algorithms by exploiting a decoupled weight decay factor.

  • We present sufficient experiments for both image classification and language processing tasks on different public datasets.

The rest of this article is structured as follows: Sect. 2 presents the notation and preliminaries of this work. Moreover, the algorithm design of DAda-NC is provided in Sect. 3. Furthermore, the results of experiments are shown in Sect. 4. We put the conclusion of this work in Sect. 5, and provide our future works in Sect. 6. Finally, we state our acknowledgements for this work in Sect. 7.

2 Notation and Preliminaries

2.1 Notation

In this paper, we use bold letters to denote vectors, such as \(\mathbf {x}\) and \(\mathbf {y}\). For all \(\mathbf {x}, \mathbf {y} \in \mathbb {R}^d\), \(\mathbf {x}^2\) denotes the element-wise square, \(\sqrt{\mathbf {x}}\) represents the element-wise square root, and \(\frac{\mathbf {x}}{\mathbf {y}}\) denotes the element-wise division. Moreover, \(\mathbf {x}_t\) represents the value of \(\mathbf {x}\) at the t-th time. In addition, \(x_{t,i}\) denotes the i-the coordinate of vector \(\mathbf {x}_t\).

2.2 Online Non-convex Optimization

In this paper, we consider an online learning problem. In this problem, the decision vectors \(\mathbf {x}_t\) and the loss functions \(f_t(\cdot )\) are different from time t to time \(t+1\). Specifically, the optimizer generates a decision \(\mathbf {x}_t\) for t-th time. Thus, deep model outputs a result that following the decision. Then, the adversary (i.e., the loss function) returns a regret to the optimizer based on the output result. Finally, the optimizer utilizes its strategy to update the decision \(\mathbf {x}_{t+1}\) for the next time. Consequently, the optimization goal of online leaning can be formed as follows:

$$\begin{aligned} \min _{\mathbf {x}_t\in \mathbb {R}^d} R(T) = \frac{1}{T}\sum _{t=1}^T\left( f_t(\mathbf {x}_t)- f_t(\mathbf {x}^*) \right) , \end{aligned}$$
(1)

where T is a time horizon, \(f_t(\cdot )\) is the loss function at time t, R(T) denotes the regret of time T, \(\mathbf {x}_t\) is the decision vector at time t, and \(\mathbf {x}^*\) represents the global optima solution.

In our work, we focus on an online learning problem under non-convex conditions. For this reason, we next introduce a non-convex optimization problem. Specifically, we give the following non-convex form to intuitively describe the problems:

$$\begin{aligned} f_t(\mathbf {x}_t) := \mathbb {E}_{\mathbf {s}_t\sim \mathbb {P}}\left[ \ell _t(\mathbf {x}_t,\mathbf {s}_t) \right] , \end{aligned}$$
(2)

where time \(t\in \{1,\ldots ,T\}\), \(\mathbf {x}_t\in \mathbb {R}^d\), \(\ell \) is a smooth loss function which possibly non-convex, the vector \(\mathbf {x}_t\) is model parameters of DNN at time t, \(\mathbf {s}_t\) is the labeled sample for training at time t, and \(\mathbb {P}\) is an unknown data distribution.

2.3 Adam

In this section, we review the design of Adam. The Adam optimization algorithm is shown in Algorithm 1. According to the original design of Adam, it is to solve the optimization problem under convex conditions. Therefore, the loss functions \(f_t\) and the feasible region for Adam are both convex.

figure a

As shown in Algorithm 1, the first order moment of Adam is designed as a exponential moving average of gradient, which is formed as follows:

$$\begin{aligned} \mathbf {m}_t = \beta _{1}\mathbf {m}_{t-1} + (1-\beta _{1})\mathbf {g}_t. \end{aligned}$$
(3)

The first-order momentum in adaptive optimization algorithms is to consider the influence of historical gradient information on the current gradient, which refers to the effect of the inertia of object motion in physics. In fact, an important significance of the first-order momentum is to speed up the iteration speed of the optimization algorithm, which is similar to the second-order momentum. In particular, the second order momentum is an exponential moving average with respect to the square of the gradient, and its form is shown below:

$$\begin{aligned} \mathbf {v}_t = \mathbf {v}_{t-1} - (1-\beta _{2})(\mathbf {v}_{t-1} - \mathbf {g}_t^2). \end{aligned}$$
(4)

Importantly, the second order momentum realizes the adaptation of the step size of the optimization algorithm, thereby avoiding the invalid oscillation of the algorithm near the optimal point. Finally, Adam updates the decision vector based on the two momentums and the learning rate (\(\alpha _t\)), which is shown below:

$$\begin{aligned} \mathbf {x}_{t+1} = \mathbf {x}_{t} - \alpha _t\mathbf {m}_t / (\sqrt{\mathbf {v}_t}+\epsilon ). \end{aligned}$$
(5)

Although Adam has received a lot of attention and applications, its applicable condition is that the loss function is convex. Therefore, in the case of non-convex, the effect of Adam is very unsatisfactory. For this reason, we propose a novel adaptive momentum optimization algorithm, which is specifically applied to non-convex situations.

3 Algorithm Design of DAda-NC

In this section, we first present some definitions and assumptions that guarantee the usability of the proposed algorithm. Secondly, we introduce the design details for our proposed algorithm.

Definition 1

If function \(\ell \) is L-smooth, then for \(\forall \mathbf {x}, \mathbf {y}\in \mathbb {R}^d\), it satisfies the following condition:

$$\begin{aligned} \Vert \nabla \ell (\mathbf {x}) - \nabla \ell (\mathbf {y})\Vert \le L\Vert \mathbf {x} - \mathbf {y}\Vert . \end{aligned}$$
(6)

Definition 2

For the function \(\ell \), its gradient calculation number with respect to the first parameter of an algorithm is equal to the stochastic first-order (SFO) complexity of the algorithm.

Definition 3

The SFO complexity of SGD to obtain a \(\delta \)-accurate solution is \(O(1/\delta ^2)\).

The definitions in this paper are common in many previous similar works, such as [17, 18].

Assumption 1

This work assumes that the gradient of the function \(\ell \) is bounded, i.e., for \(\forall \mathbf {x},\mathbf {y}\in \mathbb {R}^d\), we have \(\Vert \nabla \ell (x_{t,i})\Vert \le G\).

Assumption 2

This work assumes that the variance in stochastic gradients is bounded, i.e., for \(\forall \mathbf {x}\in \mathbb {R}^d\), we have \(\mathbb {E}\Vert \nabla \ell (x_{t,i}) - \nabla f(x_{t,i})\Vert ^2 \le \sigma ^2\).

The assumptions in our work are typical in many similar works that with respect to the first order momentum based optimization algorithms, such as [19, 20].

figure b

Next, we turn to particular introduce our proposed algorithm, DAda-NC. For the first step of DAda-NC, it stochastic chooses a batch of samples from training dataset and return the corresponding gradient. In order to eliminate the excessive influence of the second-order weight, we design a decoupled weight decay strategy. Therefore, we add a decoupled operation into the gradient as follows:

$$\begin{aligned} \mathbf {g}_t = \nabla f_t(\mathbf {x}_{t-1}) + \lambda \mathbf {x}_{t-1}. \end{aligned}$$
(7)

Moreover, DAda-NC uses a same update strategy as Adam for the first order momentum, which is shown below:

$$\begin{aligned} \mathbf {m}_t = \beta _{1}\mathbf {m}_{t-1} + (1-\beta _{1})\mathbf {g}_t. \end{aligned}$$
(8)

Importantly, DAda-NC applies the \(sign(\cdot )\) function to the second order momentum, thus solving the non convex optimization problem. The form of the second order momentum in DAda-NC is shown as follows:

$$\begin{aligned} \mathbf {v}_t = \mathbf {v}_{t-1} - (1-\beta _{2})sign(\mathbf {v}_{t-1}-\mathbf {g}_t^2)\mathbf {g}_t^2. \end{aligned}$$
(9)

Finally, the update rule of the decision vector with decoupled weight decay is shown below:

$$\begin{aligned} \mathbf {x}_{t+1} = \mathbf {x}_{t} - \alpha _t\Big (\mathbf {m}_t / (\sqrt{v_t}+\epsilon ) + \lambda \mathbf {x}_{t-1}\Big ). \end{aligned}$$
(10)

Therefore, the details of the proposed algorithm have been shown in full. In order to demonstrate the performance of our proposed algorithm in practices, we next apply it in various deep learning tasks.

4 Experiments

In this section, we evaluate the performance of Dada-NC on different datasets and learning rate settings for non-convex conditions. In our experiments, we conduct two group tasks of deep learning, i.e., image classification and language processing. Moreover, we use one 1080Ti GPU and the Pytorch module of Python 3.6 for all model training tasks.

4.1 Datasets and Parameter Settings

In our experiments, we use different public datasets for image classification task and language processing task, respectively. The summary of datasets utilized in our experiments are shown in Table 1. In the image classification tasks, we use a famous public dataset, i.e., CIFAR-10, which is a image dataset including 10 classes and 60000 images with \(32\,*\,32\) size. Furthermore, we conduct three classical architectures, i.e., VGG-11 [21], ResNet-34 [22] and DensetNet-121 [23], for this dataset. For the language processing tasks, we use a famous language dataset, Penn Treebank, with 1, 2, 3-Layer LSTM architectures, respectively. Moreover, the Penn Treebank is a dataset obtained by tagging the corpus, and its tagging content includes part of speech tagging and parsing. In addition, the corpus of the Penn Treebank data set comes from the Wall Street Journal in 1989, and its word size is 1M extracted from 2499 articles.

Table 1. Datasets and architectures used in our experiments.

We compare our proposed algorithm with the classical adaptive optimization algorithm Adam and the latest adaptive non-convex optimization algorithm Yogi. We set all the algorithms executed in our experiments as follows:

  • Adam [9] is with the stepsize: \(\alpha = 0,001\), the coefficient of first order momentum: \(\beta _1 = 0.9\), the coefficient of second order momentum: \(\beta _2 = 0.999\), the regularization item: \(\epsilon = 10^{-8}\). Moreover, the initialization momentums are set as \(\mathbf {m}_t = 0\) and \(\mathbf {v}_t = 0\).

  • Yogi [17] is with the stepsize: \(\alpha = 0,01\), the coefficient of first order momentum: \(\beta _1 = 0.9\), the coefficient of second order momentum: \(\beta _2 = 0.999\), the regularization item: \(\epsilon = 10^{-3}\). Moreover, the initialization momentums are set as \(\mathbf {m}_t = 0\) and \(\mathbf {v}_t = 0\).

  • DAda-NC (Ours) is with the stepsize: \(\alpha = 0,5\), the decoupled coefficient: \(\lambda = 0.125*10^{-3}\), the coefficient of first order momentum: \(\beta _1 = 0.9\), the coefficient of second order momentum: \(\beta _2 = 0.999\), the regularization item: \(\epsilon = 10^{-3}\). Moreover, the initialization momentums are set as \(\mathbf {m}_t = 0\) and \(\mathbf {v}_t = 0\).

Fig. 1.
figure 1

Comparison of each executed algorithm with respect to loss vs. epoches on CIFAR-10.

4.2 Image Classification

In this group of experiments, we conduct the image classification task, which is a standard and focused problem in machine learning, on the public dataset CIFAR-10. Furthermore, three different neural network architectures, VGG-11 (with 11 layers), ResNet-34 (with 34 layers) and DenseNet-121 (with 121 layers), are utilized for the experiments on CIFAR-10, respectively. Next, we present the results and analysis of our experiments.

Fig. 2.
figure 2

Comparison of each executed algorithm with respect to training accuracy vs. epoches on CIFAR-10.

Fig. 3.
figure 3

Comparison of each executed algorithm with respect to test accuracy vs. epoches on CIFAR-10.

The experimental results about the loss with respect to the epoches are shown in Fig. 1. In this figure, we intuitively see that the loss of our proposed algorithm can reach the lowest of all executed algorithms at the end of the experiment. In fact, the purpose of model training is to minimize the value of the loss, thereby the extent of the loss can potentially reflect the quality of the model training results. In other words, the stable loss value output by the optimization algorithm in the training process potentially reflects the generalization ability of the algorithm. Therefore, the Fig. 1 shows that our proposed algorithm outperforms Adam and Yogi in non-convex conditions.

Table 2. The top-1 training accuracy of compared algorithms on three models, respectively.

Moreover, the experimental results on the training loss are shown in Fig. 2. In this figure, we straightly observe that our proposed algorithm attains a higher training accuracy than Adam and Yogi. As is well-known, the index of training accuracy directly reflect the generalization ability of the optimization algorithms used for the training process. Therefore, the Fig. 2 demonstrates that our proposed algorithm performs better than Adam and Yogi on training accuracy for image classification tasks. Furthermore, we present the summary of top-1 training accuracy of each compared algorithms under non-convex conditions as shown in Table 2. The results in this table further demonstrate that our proposed algorithm performs the best in terms of the training accuracy.

Table 3. The top-1 test accuracy of compared algorithms on three models, respectively.

In addition, another important indicator that reflects the performance of the optimization algorithm is the test accuracy after model training. For this reason, we also conduct a set of experiments on test accuracy with respect to image classification tasks. And the experimental results are shown in Fig. 3. Note that the training samples will be used repeatedly, while the test samples are usually extra, so the training accuracy will be slightly higher than the test accuracy. Nonetheless, the Fig. 3 reveals that our proposed algorithm obtain a higher test accuracy than Adam and Yogi after model training. Moreover, we compare the top-1 test accuracy of all the executed algorithms in this group of experiments, and show them in Table 3. The results also verify that our proposed algorithm outperforms Adam and Yogi in terms of the test accuracy.

4.3 Language Processing

Fig. 4.
figure 4

Comparison of each executed algorithm with respect to perplexity vs. epoches on Penn Treebank. The lower perplexity, the better.

Language processing is one of the important applications in artificial intelligence. Therefore, we also conduct another group of experiments on language processing. In this group of experiments, we utilize the classic 1, 2, and 3-layer LSTM models on the Penn Treebank dataset, respectively. Moreover, the total number of model parameters of 1,2,3-Layer LSTM are 5293200, 13632400 and 24221600, respectively. It is well known that the perplexity is an important evaluation indicator for language models, therefore, we also use the perplexity to evaluate the performance of executed optimization algorithms. Moreover, if it is known that the sentence \((w_1, w_2, \ldots , w_m)\) will appear in the corpus, then the higher the probability of the sentence calculated by the language model, the better the language model predicts the corpus. For this reason, we usually use the perplexity to characterize the predictive ability of language models. Furthermore, the perplexity P of the language model S can be formed as follows:

$$\begin{aligned} P(S)&= p(w_1, w_2, \ldots , w_m)^{-\frac{1}{m}} \nonumber \\&= \left[ \prod _{i=1}^m \frac{1}{p(w_i | w_1, w_2, \ldots , w_{i-1})}\right] ^{\frac{1}{m}}, \end{aligned}$$
(11)

where p denotes the probability.

Table 4. The results of test process on LSTM.

The results of this group of experiments are shown in Fig. 4. Moreover, this figure firsthand show that our proposed algorithm attains the least perplexity in all 1, 2 and 3-layers LSTM models. Note that the lower the perplexity value, the better the model performance. Both Adam and Yogi perform worse than our proposed algorithm. Therefore, our proposed algorithm also has a good performance in the training of language processing models under non-convex conditions. In addition, we further show the results of the test process in Table 4. We directly see that our proposed algorithm attains the least value of loss, perplexity and bits-per-character among all the compared algorithms in the test process.

5 Conclusion

In this paper, we focus on the problem of convergence and generalization ability of adaptive optimization algorithms under non-convex conditions. In our propose algorithm, we first use a decoupled weight decay method to further improve the generalization ability. And then we re-design the second order momentum of adaptive algorithms with the \(sign(\cdot )\) function that ensures our proposed algorithm converges under non-convex conditions. Finally, we conduct sufficient experiments on both image classification and language processing tasks. All the experimental results demonstrate that our proposed algorithm outperforms Adam and Yogi on standard public datasets under non-convex cases.

6 Future Work

Nonetheless, it is not known whether our proposed algorithm still has such good performance except in the case of non-convex, such as convex, strongly convex and so on. Therefore, we will explore these issues in our future works. Furthermore, the theoretically proof of convergence of our proposed algorithm is also left for our future work.